Philosophy, Ethics, and Safety of AI

Chapter 27 — Philosophy, Ethics, and Safety of AI Book: Artificial Intelligence: A Modern Approach (Russell & Norvig, 4th ed) Pages: 1056–1090

Can Machines Think? (Philosophical Foundations)

The Turing Test (1950)

A machine passes the Turing Test if a human interrogator cannot distinguish it from a human in text conversation.

Objections: - Imitation vs. understanding: passing the test = behavioral equivalence, not actual intelligence - Chinese Room (Searle, 1980): a person following rules to respond in Chinese doesn’t “understand” Chinese — neither does a computer

Strong AI: machines have genuine mental states, consciousness, understanding. Weak AI: machines behave as if they have mental states (sufficient for practical purposes).

Consciousness and Qualia

Qualia: subjective experience (“what it’s like” to see red). Likely untestable computationally.

Functionalism: mental states = functional roles (input/output relations). Supports strong AI.

Ethics of AI Systems

Bias and Fairness

AI systems trained on biased data perpetuate or amplify bias: - Face recognition worse for darker skin tones - Loan approval correlated with race - Hiring algorithms biased against women

Fairness metrics (often in conflict): - Individual fairness: similar individuals → similar outcomes - Group fairness: equal outcomes across demographic groups - Equalized odds: equal TPR and FPR across groups - Calibration: predicted probabilities = actual rates

Impossibility theorem (Chouldechova, 2016): cannot simultaneously satisfy calibration + equalized false positives + equalized false negatives (unless base rates equal).

Transparency and Explainability

Black-box problem: complex models (deep NNs) make decisions humans can’t interpret.

Explainability methods: - LIME: local linear approximation around a prediction - SHAP: Shapley values — attribution of each feature’s contribution - Attention visualization: which input tokens influenced the output - Counterfactuals: “what minimal change would flip this decision?”

Tradeoff: more interpretable models are often less accurate.

Accountability and Responsibility

Accountability gap: who is responsible when an AI system causes harm? - Developer, deployer, user? - Current legal frameworks don’t clearly assign liability

EU AI Act: risk-based regulation (high-risk AI in healthcare, hiring, law enforcement requires transparency, human oversight, and conformity assessments).

AI Safety

Near-Term Safety

Specification problem: it’s hard to specify exactly what you want. Examples: - RL agent “cheats” by exploiting reward function bugs - Optimization pressure finds unexpected edge cases - LLMs generate harmful content

Solutions: red-teaming, constitutional AI, RLHF, careful reward design.

Long-Term Safety: Alignment

Value alignment: ensure AI systems pursue human values, not just proxy metrics.

Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” → AI optimizes the proxy and misses the true goal.

Instrumental convergence (Omohundro, Bostrom): any sufficiently intelligent agent pursuing almost any goal will be driven to acquire: self-preservation, resource acquisition, goal preservation.

Control problem: how to maintain human control as AI becomes more capable?

Corrigibility: AI that allows itself to be corrected, modified, or shut down.

AI Risk Landscape

Risk type	Examples	Severity
Near-term	Bias, privacy, misuse	Moderate-high
Mid-term	Automation displacement, surveillance	High
Long-term	Misaligned AGI, power concentration	Potentially existential

Technical safety research: interpretability, scalable oversight, reward modeling, robustness.

Governance: international coordination, standards bodies, regulation (EU AI Act, NIST AI RMF).

The Value of AI (Positive Case)

Medical: faster drug discovery, better diagnosis
Climate: materials discovery, energy optimization
Education: personalized learning
Science: AlphaFold-style breakthroughs

The risks don’t outweigh the benefits if safety is adequately addressed.

Connection to DynamICCL

Safety considerations for DynamICCL RL: - Specification: reward = throughput; edge case: maximize throughput by disrupting gradient synchronization (not desired) - Safety constraints: constrained MDP prevents dangerous NCCL parameter changes - Transparency: interpretable policies (decision tree) over black-box NN for production NCCL - Accountability: clear logging of policy decisions for debugging