0% Complete
0/0 Steps
  1. Phase 1 | Individual readings and ideas
    2 Topics
    1 Quiz
Lesson 1, Topic 1
In Progress


JJ March 23, 2022
Lesson Progress
0% Complete

1. Goal Misalignment

1.1 Risks from Learned Optimization: Introduction

1.2 Gradient Hacking

1.3 Specification Gaming

1.4 Avoiding Reward Hacking: Concrete Problems in AI Safety (pages 7-11)

2. Robustness

Safe Reinforcement Learning

2.1 Safety Gym: Imposing safety constraints during training

2.2 A Comprehensive Survey on Safe Reinforcement Learning

2.3 Learning human objectives by evaluating hypothetical behaviours

Adversarial Examples

2.4 Testing Robustness Against Unforeseen Adversaries

2.5 Natural Adversarial Examples

2.6 Adversarial Policies: Attacking Deep Reinforcement Learning


2.7 Corrigibility

2.8 Towards a Mechanistic Understanding of Corrigibility

3. Learning from Humans

3.1 Learning from Human Preferences

3.2 Cooperative Inverse Reinforcement Learning

3.3 Learning to Summarise with Human Feedback

3.4 Aligning AI with Shared Human Values

4. Mechanistic Interpretability

4.1 Interpretability (An Overview)

  • Discovering Features and Circuits

4.1.1 An Introduction to Circuits

4.1.2 Curve Detectors

4.1.3 Curve Circuits

4.1.4 High-Low Frequency Detectors

4.1.5 Visualising Weights

  • Scaling Circuits to Larger Models

4.1.6 Equivariance 

4.1.7 Branch Specialisation

4.1.8 Clusterability in Neural Networks

4.2 Feature Visualisation

5. Inner Alignment

5.1 Relaxed adversarial training for inner alignment

6. Outer Alignment and Decomposing Tasks

6.1 Outer alignment and imitative amplification

6.2 Supervising strong learners by amplifying weak experts

6.3 Recursively summarising books with human feedback

7. Iterated Amplification

7.1 Iterated amplification sequence on the alignment forum

  • This is a collection of Alignment Forum posts on Iterated Amplicaification, curated by Paul Christiano. Please feel free to peruse this list and choose readings that seem interesting.

7.2 Learning Complex Goals with Iterated Amplification

7.3 Recursive Reward Modelling: Scalable agent alignment via reward modelling 

Learning Normativity: An Alternative Approach to Iterated Amplification

7.4 Learning Normativity: A Research Agenda and Recursive Quantilizers II

8. Forecasting

8.1 Forecasting transformative AI: the “biological anchors” method in a nutshell