As easy as one-two-three? No. Easy as one-three. OpenAI has skipped o2 to call their new ‘frontier’ model o3 (Altman hinted that this may have been due to Telco o2 issue). I like the idea that AI is undermining everything we were told makes good marketing - no read branding, messaging, just get it out there with benchmarks.
Transcending human intelligence
Not sure we've realised that we now have AGI. Arc-AGI solved with 87.7% (human threshold 85%) as well as other benchmarks... we will, of course, move the bar higher, find ways to avoid recognising the achievement... then pass it again... Arc Prize is a not-for-profit with an AGI benchmark. There are new skills so that the AI cannot memorise them.
To give you some idea of the speed and scale of progress:
Software engineering
On real-world software tasks, evaluations on o3 are more than 20% better than o1 models at 71%. AI has already established itself as solid, useful and widely used coding tool. People have been letting AI code then iterating iterate. This takes it to another level. One wonders how long the job of coder will survive at this rate.
Maths and science
Also superb at maths and PhD science questions at 88.7 %. A typical PhD gets around 70%. Frontier is toughest mathematical dataset – extremely hard problems. o3 has 25% accuracy which is a strong result.
GPQA Diamond
An interesting measure of how far we have come is GPQA Diamond. It's a clever test that compares a model against npvice Google search, human domain experts and the model. Experts get 81% right in their fields, highly skilled non-experts with 30 minutes per question and Google access get 22%. GPT-4 got 37% at the start of 2024. o1 got 78%. o3 is 87.7%. That is astonishing.AGI
Artificial General Intelligence suffers from a problem of definition, as do most abstractions at this level. Does it mean:
Specific human reasoning skills (maths, science etc)
Average human competences
Greater than all of humanity
There is no singularity, there is a spectrum or constellation of possible targets here. What is clear is that these targets are being hit, not in one go, but one by one, sometimes in clusters. Maths and science are easy to measure but also fiendishly difficult to achieve, so this is a real milestone. Yet they focus on clear rationality. To be fair critics were telling us that AI would never get here, never mind get here so fast. We should celebrate this as many of the problems we face with climate, energy, healthcare and education may well be solved, not in the sense of final solutions but better solutions.
Problem solving in real life is messier and more of a challenge. That's why the agentic move is so interesting, as it tackles this set of human capabilities. We have brains that have evolved into a specific environment where we had to solve specific problems. This is where the dynmaic interrogative, dialogic nature of AI helps enormously. It has already made great strides in this direction.
Embodied AI, in the physical world of elevators, cars, cabs, trucks, drones, ships and submersibles has also taken great leaps. This is another set of targets that are being hit one by one. One could argue that neurological targets are also on our hotlist - Neuralink is a good example, where we enhance our neurological deficits.
AI may help us understand neuroscience and the brain, solve engineering problems to accelerate fusion, help with drug discovery and many other intractable problems. What we can be sure of is the increased impact of AI on productivity. The leaps in efficacy prove this.
Conclusion
This is a warning to people who claim that scaling is over. Sutskever was right – we have a way to go and other techniques are clearly delivering the goods. It is clear that AI is delivering faster than expected. The consequences of AGI are closer than expected with huge productivity gains on the horizon. That is the subject of the book I am currently writing.
Future issues
One issue needs discussion - compute costs. I think this will be solved. We saw a 250x decline in token costs in 20 months. So $20 for hard problems are likely to come down to cents.
One can now start to ask how the cost of compute compares to human costs in organisations for similar tasks and roles. The productivity game looks as though it will start with coding, where much of it can be automated.
On problems to solve:
- fusion
- medical research
- personalised tutors
- optimising political policies
- next-generation batteries
- cheap renewables
A final though on AI being self-generative. At what point does this technology start working on itself to solve the frontier problems and advance even quicker?
No comments:
Post a Comment