At this stage, the AI is “aligned” with AI safety principles, typically by being fed instructions on how to refuse or deflect requests. Safety is an elastic concept. At the top of the safety hierarchy, alignment is supposed to ensure that AI will not give out dangerously false information or develop what in a human we’d call harmful intentions (the robots-destroying-humanity scenario). Next is keeping it from giving out information that could immediately be put to harmful use—how to kill yourself, how to make meth. Beyond that, though, the notion of AI safety includes the much squishier goal of avoiding toxicity. “Whenever you’re trying to train the model to be safer, you add filters, you add classifiers, and then you’re reducing unsafe usage,” Jan Leike, a co-head of alignment at OpenAI, told me earlier this year, before Altman’s ouster. “But you’re also potentially refusing some use cases that are totally legitimate.”
This trade-off is sometimes called an “alignment tax.” The power of generative AI is that it combines humanlike abilities to interpret texts or carry on a discussion with a very un-humanlike reservoir of knowledge. Alignment partly overrides this, replacing some of what the model has learned with a narrower set of answers. “A stronger alignment reduces the cognitive ability of the model,” says Eric Hartford, a former senior engineer at Microsoft, Amazon, and eBay who has created influential training techniques for uncensored models. In his view, ChatGPT “has been getting less creative and less intelligent over time,” even as the technology undeniably improves.
Read more:
Gimein, M. (2023, November 24). AI’s Spicy-Mayo Problem. The Atlantic. https://www.theatlantic.com/ideas/archive/2023/11/ai-safety-regulations-uncensored-models/676076/