1. Goal Misalignment
1.1 Risks from Learned Optimization: Introduction
1.2 Gradient Hacking
1.4 Avoiding Reward Hacking: Concrete Problems in AI Safety (pages 7-11)
2. Robustness
Safe Reinforcement Learning
2.1 Safety Gym: Imposing safety constraints during training
2.2 A Comprehensive Survey on Safe Reinforcement Learning
2.3 Learning human objectives by evaluating hypothetical behaviours
Adversarial Examples
2.4 Testing Robustness Against Unforeseen Adversaries
2.5 Natural Adversarial Examples
2.6 Adversarial Policies: Attacking Deep Reinforcement Learning
Corrigibility
2.7 Corrigibility
2.8 Towards a Mechanistic Understanding of Corrigibility
3. Learning from Humans
3.1 Learning from Human Preferences
3.2 Cooperative Inverse Reinforcement Learning
3.3 Learning to Summarise with Human Feedback
3.4 Aligning AI with Shared Human Values
4. Mechanistic Interpretability
4.1 Interpretability (An Overview)
- Discovering Features and Circuits
4.1.1 An Introduction to Circuits
4.1.2 Curve Detectors
4.1.3 Curve Circuits
4.1.4 High-Low Frequency Detectors
4.1.5 Visualising Weights
- Scaling Circuits to Larger Models
4.1.6 Equivariance
4.1.7 Branch Specialisation
4.1.8 Clusterability in Neural Networks
5. Inner Alignment
5.1 Relaxed adversarial training for inner alignment
6. Outer Alignment and Decomposing Tasks
6.1 Outer alignment and imitative amplification
6.2 Supervising strong learners by amplifying weak experts
6.3 Recursively summarising books with human feedback
7. Iterated Amplification
7.1 Iterated amplification sequence on the alignment forum
- This is a collection of Alignment Forum posts on Iterated Amplicaification, curated by Paul Christiano. Please feel free to peruse this list and choose readings that seem interesting.
7.2 Learning Complex Goals with Iterated Amplification
7.3 Recursive Reward Modelling: Scalable agent alignment via reward modelling
Learning Normativity: An Alternative Approach to Iterated Amplification
7.4 Learning Normativity: A Research Agenda and Recursive Quantilizers II
8. Forecasting
8.1 Forecasting transformative AI: the “biological anchors” method in a nutshell