I read this book after hearing Brian Christian interviewed on the Ezra Klein podcast – and it is especially relevant in the moment of ChatGPT. The “alignment problem” of the title refers to the alignment between “machine learning and human values” – when models amplify bias or when rewards incentivize the wrong behavior. Each of the main chapters focuses on the way a specific piece of technology forces us to rethink our assumptions about some philosophical or moral issue – trying to train a machine to recognize patterns of criminal behavior or even to play video games can easily reveal the tacit assumptions and hidden complexity in our commonsense reasoning and behavior. The book is nuanced and brimming with research – it restlessly turns problems over again and again to show them from different angles.

I was immediately drawn into the first chapter – “Representation” – on the application of neural nets to word embeddings in large textual data sets.

“Bias in machine-learning systems is often a direct result of the data on which the systems are trained–making it incredibly important to understand who is represented in those datasets, and to what degree, before using them to train systems that will affect real people. But what do you do if your dataset is as inclusive as possible–say, something approximating the entirety of written English, some hundred billion words–and it’s the world itself that’s biased?” (33-34)

The chapter goes on to document the challenges of debiasing models where the bias exists in the training data – but one doesn’t want to see the bias come back out the other side. Language has the power to both reflect culture and also to some degree shape it. Machine learning models have the opportunity to do even more because of the scale they operate at – they are amplifiers of undercurrents.

One of the techniques described in the chapter revolves around the concept of “word embeddings,” which is a way of measuring language resting on a simple assumption: “Words will tend to be found near words that are ‘similar’ to themselves” (36). This allows the machine to create a model that represents the associations between words in vectors. Many of the examples illustrate how patterns of associations encode patterns of bias. Models are more likely to recall pairs like “man/doctor” and “woman/nurse” because those associations are part of the training data. To combat these biased associations, humans have had to go back to the data and try to untangle the problematic patterns that pervade the written record.

“We must take great care not to ignore the things that are not easily quantified or do not admit themselves into our models. The danger, paraphrasing Hannah Arendt, is not so much that our models are false but that they might become true.” (326)

The point Christian makes marks several risks. One is that the world is constantly changing. A machine learning model is like a snapshot of one aspect of the universe at one particular time in one particular place. It is a far more comprehensive snapshot than one can generate from one’s own personal experience. But a model is still limited and might fail to represent the world in its next mutation. A language model “trained on 2016 data was slowly bleeding out its accuracy as the world moved on” (325). The implications are that we need to be careful to contextualize the predictions that come from machine learning models – they are vast and amazing but also quite limited in their representations.

We are in danger of losing control of the world not to AI or to machines but to models. To formal, often numerical specifications for what exists and for what we want.

Disagreements about those two categories – ”what exists” and “what we want” – have long been the domain of philosophy, psychology, theology, politics, economics, and other explanatory systems in which the moral implications of how we answered these questions was often explicitly part of the explanation being put forth. What is the world and how do we create structures within it that contribute to human thriving? These are the basic human questions and to the extent that machine learning can help us answer these questions, we need to be sure that our values shape the way we use our tools, rather than the other way around.

But it’s difficult to reconcile the fact that humans are biased with the claim that humans can be a check on the bias in artificial intelligence. If models capture human bias, what tools do we have to revise them, considering our own imperfect capacity for reason? We’re caught between massive amplifications of our own mistakes and our own propensity for making yet more mistakes.

To me, AI doesn’t seem all that bad when used with skepticism and critical thinking. Unfortunately, we can’t even seem to operate the internet with the appropriate level of skepticism and critical thinking – opinions diverge on just where we should direct each of those capacities.