Philosophy, Ethics, and Safety of AI

Chapter 27 — Philosophy, Ethics, and Safety of AI Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 1056–1090


Can Machines Think? (Philosophical Foundations)

The Turing Test (1950)

A machine passes the Turing Test if a human interrogator cannot distinguish it from a human in text conversation.

Objections: - Imitation vs. understanding: passing the test = behavioral equivalence, not actual intelligence - Chinese Room (Searle, 1980): a person following rules to respond in Chinese doesn’t “understand” Chinese — neither does a computer

Strong AI: machines have genuine mental states, consciousness, understanding. Weak AI: machines behave as if they have mental states (sufficient for practical purposes).

Consciousness and Qualia

Qualia: subjective experience (“what it’s like” to see red). Likely untestable computationally.

Functionalism: mental states = functional roles (input/output relations). Supports strong AI.


Ethics of AI Systems

Bias and Fairness

AI systems trained on biased data perpetuate or amplify bias: - Face recognition worse for darker skin tones - Loan approval correlated with race - Hiring algorithms biased against women

Fairness metrics (often in conflict): - Individual fairness: similar individuals → similar outcomes - Group fairness: equal outcomes across demographic groups - Equalized odds: equal TPR and FPR across groups - Calibration: predicted probabilities = actual rates

Impossibility theorem (Chouldechova, 2016): cannot simultaneously satisfy calibration + equalized false positives + equalized false negatives (unless base rates equal).


Transparency and Explainability

Black-box problem: complex models (deep NNs) make decisions humans can’t interpret.

Explainability methods: - LIME: local linear approximation around a prediction - SHAP: Shapley values — attribution of each feature’s contribution - Attention visualization: which input tokens influenced the output - Counterfactuals: “what minimal change would flip this decision?”

Tradeoff: more interpretable models are often less accurate.


Accountability and Responsibility

Accountability gap: who is responsible when an AI system causes harm? - Developer, deployer, user? - Current legal frameworks don’t clearly assign liability

EU AI Act: risk-based regulation (high-risk AI in healthcare, hiring, law enforcement requires transparency, human oversight, and conformity assessments).


AI Safety

Near-Term Safety

Specification problem: it’s hard to specify exactly what you want. Examples: - RL agent “cheats” by exploiting reward function bugs - Optimization pressure finds unexpected edge cases - LLMs generate harmful content

Solutions: red-teaming, constitutional AI, RLHF, careful reward design.

Long-Term Safety: Alignment

Value alignment: ensure AI systems pursue human values, not just proxy metrics.

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” → AI optimizes the proxy and misses the true goal.

Instrumental convergence (Omohundro, Bostrom): any sufficiently intelligent agent pursuing almost any goal will be driven to acquire: self-preservation, resource acquisition, goal preservation.

Control problem: how to maintain human control as AI becomes more capable?

Corrigibility: AI that allows itself to be corrected, modified, or shut down.


AI Risk Landscape

Risk type Examples Severity
Near-term Bias, privacy, misuse Moderate-high
Mid-term Automation displacement, surveillance High
Long-term Misaligned AGI, power concentration Potentially existential

Technical safety research: interpretability, scalable oversight, reward modeling, robustness.

Governance: international coordination, standards bodies, regulation (EU AI Act, NIST AI RMF).


The Value of AI (Positive Case)

The risks don’t outweigh the benefits if safety is adequately addressed.


Connection to DynamICCL

Safety considerations for DynamICCL RL: - Specification: reward = throughput; edge case: maximize throughput by disrupting gradient synchronization (not desired) - Safety constraints: constrained MDP prevents dangerous NCCL parameter changes - Transparency: interpretable policies (decision tree) over black-box NN for production NCCL - Accountability: clear logging of policy decisions for debugging