Philosophy, Ethics, and Safety of AI
Chapter 27 — Philosophy, Ethics, and Safety of AI Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 1056–1090
Can Machines Think? (Philosophical Foundations)
The Turing Test (1950)
A machine passes the Turing Test if a human interrogator cannot distinguish it from a human in text conversation.
Objections: - Imitation vs. understanding: passing the test = behavioral equivalence, not actual intelligence - Chinese Room (Searle, 1980): a person following rules to respond in Chinese doesn’t “understand” Chinese — neither does a computer
Strong AI: machines have genuine mental states, consciousness, understanding. Weak AI: machines behave as if they have mental states (sufficient for practical purposes).
Consciousness and Qualia
Qualia: subjective experience (“what it’s like” to see red). Likely untestable computationally.
Functionalism: mental states = functional roles (input/output relations). Supports strong AI.
Ethics of AI Systems
Bias and Fairness
AI systems trained on biased data perpetuate or amplify bias: - Face recognition worse for darker skin tones - Loan approval correlated with race - Hiring algorithms biased against women
Fairness metrics (often in conflict): - Individual fairness: similar individuals → similar outcomes - Group fairness: equal outcomes across demographic groups - Equalized odds: equal TPR and FPR across groups - Calibration: predicted probabilities = actual rates
Impossibility theorem (Chouldechova, 2016): cannot simultaneously satisfy calibration + equalized false positives + equalized false negatives (unless base rates equal).
Transparency and Explainability
Black-box problem: complex models (deep NNs) make decisions humans can’t interpret.
Explainability methods: - LIME: local linear approximation around a prediction - SHAP: Shapley values — attribution of each feature’s contribution - Attention visualization: which input tokens influenced the output - Counterfactuals: “what minimal change would flip this decision?”
Tradeoff: more interpretable models are often less accurate.
Accountability and Responsibility
Accountability gap: who is responsible when an AI system causes harm? - Developer, deployer, user? - Current legal frameworks don’t clearly assign liability
EU AI Act: risk-based regulation (high-risk AI in healthcare, hiring, law enforcement requires transparency, human oversight, and conformity assessments).
AI Safety
Near-Term Safety
Specification problem: it’s hard to specify exactly what you want. Examples: - RL agent “cheats” by exploiting reward function bugs - Optimization pressure finds unexpected edge cases - LLMs generate harmful content
Solutions: red-teaming, constitutional AI, RLHF, careful reward design.
Long-Term Safety: Alignment
Value alignment: ensure AI systems pursue human values, not just proxy metrics.
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” → AI optimizes the proxy and misses the true goal.
Instrumental convergence (Omohundro, Bostrom): any sufficiently intelligent agent pursuing almost any goal will be driven to acquire: self-preservation, resource acquisition, goal preservation.
Control problem: how to maintain human control as AI becomes more capable?
Corrigibility: AI that allows itself to be corrected, modified, or shut down.
AI Risk Landscape
| Risk type | Examples | Severity |
|---|---|---|
| Near-term | Bias, privacy, misuse | Moderate-high |
| Mid-term | Automation displacement, surveillance | High |
| Long-term | Misaligned AGI, power concentration | Potentially existential |
Technical safety research: interpretability, scalable oversight, reward modeling, robustness.
Governance: international coordination, standards bodies, regulation (EU AI Act, NIST AI RMF).
The Value of AI (Positive Case)
- Medical: faster drug discovery, better diagnosis
- Climate: materials discovery, energy optimization
- Education: personalized learning
- Science: AlphaFold-style breakthroughs
The risks don’t outweigh the benefits if safety is adequately addressed.
Connection to DynamICCL
Safety considerations for DynamICCL RL: - Specification: reward = throughput; edge case: maximize throughput by disrupting gradient synchronization (not desired) - Safety constraints: constrained MDP prevents dangerous NCCL parameter changes - Transparency: interpretable policies (decision tree) over black-box NN for production NCCL - Accountability: clear logging of policy decisions for debugging