Pelayo Arbués

Recent Notes

Why Software Engineers Should Learn a Bit of Data Science
Apr 01, 2025
A recommender beast
Feb 05, 2025
The next generation of weak learners
Jan 28, 2025

See 89 more →

❯

Literature Notes

❯

❯

So AI Won't Scale, Now What?

So AI Won't Scale, Now What?

Apr 16, 202513 min read

articles
literature-note

Metadata

Author: James Ravenscroft
Full Title: So AI Won’t Scale, Now What?
URL: https://brainsteam.co.uk/2024/11/12/so-ai-wont-scale-what-next/

Highlights

Rumours that the famed Transformer architecture that powers GPT, Claude and Llama, won’t keep getting better with more compute (the scaling law) have been around for a while. Even recently, AI companies and pundits have been denying that this is the case, hyping upcoming model releases and raising obscene amounts of money. However, over the last few days leaks and rumours have emerged suggesting that the big silicon valley AI companies are now waking up to the reality that the scaling law might have stopped working. (View Highlight)
Update: Overnight Ilya Sutskever himself has added his voice to the choir of experts claiming that the scaling laws have hit a wall. This is a significant inflection point since Sutskever is responsible for leading much of OpenAI’s work on GPT-3 and GPT-4 before leaving to start his own research company earlier this year. He is big on AGI and superintelligence so this admission carries a lot of weight.* (View Highlight)
To those unfamiliar with AI and NLP research, ChatGPT might appear to have been an overnight sensation from left field. However, science isn’t a “big bang” that happens in a vaccum. Sudden-seeming breakthroughs are usually the result of many years of incremental research reaching an inflection point. GPT-4o and friends would not be possible without the invention of The Transformer 5 years prior and a number of smaller advancements leading up to the launch of ChatGPT in 2022. (View Highlight)
(View Highlight)
The Transformer is a deep learning neural network algorithm that was invented by Googlers Vaswani et al in 2017. They work by loosely modelling how humans pay attention to contextual clues about which words are more or less important whilst reading. Between 2017 and 2022 lots of incremental improvements were made to the transformer including the ability to generate text based on an input “prompt” ( Radford et al. 2018, Lewis et al 2019 and Raffel et al 2020. ). In 2022, the formulation of Reinforcement Learning Human Feedback (RLHF) by Ouyang et al. at OpenAI made the process of prompting models significantly more intuitive for non-technical users and made user-facing LLM systems like ChatGPT feasible. Of course all of this work can be traced back to earlier breakthroughs in deep learning by people like Hinton, LeCunn and Bengio. (View Highlight)
This is all to say that it’s very difficult to predict which incremental advancement will lead to a tipping point that investors think is saleable. If it turns out that we have hit the limits of transformers, we may not see a major breakthrough in AI performance for a long time, leading to a long AI winter. Or, we might find that some new post-transformer architecture emerges from an unknown lab tomorrow and eats OpenAI’s breakfast overnight. (View Highlight)
Current generation frontier models by companies like Google, OpenAI, Anthropic and Mistral are very impressive at some tasks. Extracting information from audio streams or images of hand-written notes, helping software engineers write code more quickly. The kinds of tasks they’re great at are low-risk high-effort grunt work and tasks that can be automatically quality assured (for example generated application code can be unit tested). Even with a non-zero error rate models may be better at dull tasks than a bored human who doesn’t want to be doing them in the first place. (View Highlight)
But, let’s face it: the huge amount of money being thrown at LLMs is not to solve the boring tasks but because investors think that it’s possible to significantly reduce costs and “automate the majority of work”. As impressive as current frontier models are, everyone who works with them knows that they are not up to the task of completely automating a large amount of knowledge work (unless you use questionable methods to have the model rate it’s own abilities of course). We’ve seen tonnes of examples of high profile, embarrassing LLM fails in chatbots, search results and even McDonalds recently canned their AI-powered drive through trial. (View Highlight)
LLMs are also really good at natural language tasks. That is, tasks that NLP researchers (like myself) have historically found interesting because they unlock other use cases, but which are probably not interesting to end users and consumers. For example, LLMs are great at classifying and categorising text, at extracting names of companies; people and places (NER), at determining whether two sentences imply the same thing or not (NLI) and many others. These use cases can be combined with traditional software engineering to build platforms that do things like make podcasts searchable or make it easy for researchers to find relevant scientific papers. However, selling models to the kinds of people building these systems is a business-to-business play rather than a business-to-consumer play and (as I explain below) it’s a pretty commoditised space to play in too. (View Highlight)
Most office workers and knowledge workers are now familiar with LLMs and are using and enjoying them on a regular basis and finding them genuinely useful. However, the general population is a little more wary. A UK Government Survey conducted earlier this year found that only 44% of adults surveyed agreed that AI will benefit society and 20% said disagreed with this statement. Interviewees cited concerns about personal data usage, fake news and cyber crime. (View Highlight)
Beyond the politics and the drama, when it comes to getting hands on with AI, the picture is also mixed. Some AI products like Cursor, a tool for writing software using an AI assistant have launched to great fanfare and been very well received. Some AI products like the infamous humane pin struggle to find product-market fit and flop. AI features in established products have seen lukewarm receptions. After a short experimentation window, Discord shut down their AI chatbot at the end of 2023. More recently, it turned out that Microsoft’s Office 365 Copilot Pro subscription is so unpopular that MS are giving up trying to sell it as a standalone product and bundling it into their standard Office subscription along with a mandatory price bump. Additionally, even though ChatGPT reportedly has over 200 million weekly active users, only 11 million (approximately 5%) of them are paying for Pro. (View Highlight)
Frontier models are incredibly expensive to run. It is estimated that OpenAI currently loses $2 f ore v ery$ 1 they make. Microsoft are thought to be losing $20 per user per month on Github Copilot. There is a race-to-the-bottom for LLM API pricing to encourage businesses to build their businesses using a particular provider’s model. Frontier models have become commoditised but they’re not even profitable yet. Companies like Meta, Microsoft and Mistral have been providing small-but-mighty models that can perform a large number of AI tasks locally for free which significantly undercuts frontier model providers, forcing them to respond in kind by releasing lower cost, less powerful versions of their flagship models. Local models can be fine tuned on specific tasks and for some use cases, they can work just as well as frontier models and eat into AI providers’ higher-end customer base too. (View Highlight)
There’s a human cost to these models. Many AI products have been enabled by outsourcing the development of datasets and moderation of incoming requests to workers on poverty wages. These workers are regularly reviewing upsetting and offensive material and receiving limited or no support. Earlier this year it was revealed that Amazon’s “just walk out” AI shop was powered by offshore workers paid £1-2 per hour to watch CCTV footage. (View Highlight)
The miracle that AI companies are praying for would be a sudden breakthrough stemming from experimental “post transformer” model architectures. . Approaches like Mamba or RWKV or one of their descendants could lead to the continuation of scaling laws which would mean “business as usual” for AI firms. (View Highlight)
On the other hand, if a new model architecture doesn’t emerge quickly, AI companies will need to focus on cutting costs and driving up prices of their offerings to make themselves profitable. If they can’t continue to generate record breaking valuations and speculative investment, their record-breaking losses will surely mean that their days are numbered. (View Highlight)
One possible avenue for exploration is quantisation and model miniaturisation. Through projects like llama.cpp we’ve already seen that it’s possible to run models with many billions of parameters (the ’neurons’ in the neural network) on a laptop or even a mobile phone. This is possible through the compression of the model parameters to take up significantly less memory, throwing out some information but preserving enough that the model still works a bit like how JPEG image files work. Recently researchers at Microsoft have started to explore whether it’s possible to represent neural network parameters in 1 bit, that is, the smallest possible unit of information. Other researchers found that it might be possible to just throw out a large number of parameters all together. Assuming the performance losses are acceptable or minimal, quantizing flagship models could potentially allow AI companies to serve their APIs at a much lower cost and use less electricity and water cooling. I suspect all of the big providers are already following this particular research thread very closely. (View Highlight)
Shrinking models using techniques like quantisation and miniaturisation so that they can run on less powerful could be one strategy for cutting costs. Photo by Mathieu Stern on Unsplash (View Highlight)
Recent work by Akyürek et al 2024. has shown that state of the art performance can be obtained using small models. Their approach, Test-Time-Tuning, involves slightly tweaking the model’s parameters, normally frozen after training, at runtime. It’s similar to giving examples in your prompt but instead of giving the model something additional to read, we give its brain cells a quick and targeted jolt to focus it on the task at hand. It’s still an emerging technique and we’ve got a lot to learn about how it could be applied in practice and at scale. However, it could potentially accelerate the miniaturisation of LLMs quite significantly. (View Highlight)
A complementary approach would be to move to dedicated hardware. GPUs are typically used for graphics processing (that’s where the G comes from) and although there’s a fair amount of mathematical crossover with deep learning, they are not specifically optimised for AI workloads. Some companies like groq are already producing custom chips that are optimised for AI model inference at increased efficiency. (View Highlight)
It’s possible (although perhaps less likely) that AI becomes much more de-centralised. New computers and phones now ship with AI-specific chips and both Apple and Microsoft have made headlines recently with their (somewhat different) approaches to local AI. Perhaps OpenAI will end up licensing a miniaturised version of GPT-4 to hardware vendors who will ship it as part of your operating system. That would certainly cut (or perhaps externalise) running costs both fiscal and environmental. Historically vendors have claimed that they’re keeping these models away from end users for safety reasons. On the other hand, assuming no breakthrough, end users already have access to tools like StableDiffusion and Llama 3 and they can already generate fake news and fake images of politicians. Running models locally also makes it difficult for AI companies to capture data for further training. (View Highlight)
If Trump’s tariffs come to pass then the cost of running models either in US data centres or on US consumers’ devices goes up. For AI companies this incentivises them to either offshore inference capabilities outside of the US (since services are not subject to tariffs ) or pursue the decentralised “run on your device” model explored above. Of course, end consumers will also have to deal with a price rise if they want that new phone or laptop which could put AI-capable devices, and thus, AI companies’ models, out of reach of more US customers. (View Highlight)
It’s also possible that dedicated AI companies like OpenAI, Anthropic and Mistral will go the way of the dodo, their offerings shuttered and their IP auctioned off to the FAANGs of this world. Make no mistake, LLMs will not go anywhere. The companies that buy up the IP could find ways to run LLMs at a low cost and have them run on your device (like Apple are starting to do and Microsoft are experimenting with). In this scenario, the consumer-facing LLMs just become a value-add feature for whatever the company is selling: their devices, their operating system or their social media platform. (View Highlight)
When GPT-4 first launched and the assumption was that scaling laws would hold and GPT-5 would be exponentially better, it was a commonly held belief that OpenAI’s dominance of the LLM space was unassailable. However, another interesting side effect of “no scaling law and no big breakthrough” is that it calls into question the unassailable position of big AI companies. We’re already seeing “super mini” LLMs with 8 billion parameters that can compete with GPT-4 at specific tasks, in some cases across many different benchmarks. If ‘Open Source’ models can operate in the same ballpark as GPT-4 then there’s less of a reason for B2B customers to stick with OpenAI. (View Highlight)
All of this said, running LLMs at scale across clusters of GPUs is not a trivial activity that a 4-person startup is going to want to try to take on. I envisage further commoditisation of the LLM API world with more services like groq and fireworks offering cheap-as-chips LLM inference using “good enough” models that OpenAI can’t profitably compete with. Firms that are big enough might decide to build an internal LLMOps team and manage this stuff themselves as they scale up. (View Highlight)
While the scaling law held, big tech leaders have asked for record-breaking investment in AI to help us unlock super intelligent AI and use it to solve societal problems, climate goals be damned. However, in lieu of an all-powerful GPT-5, its a little easier for investors and the general public to take these claims with a pinch of salt. Against the backdrop of user-hostile increases to [subscription fees](https://www.forbes.com/sites/tonifitzgerald/2024/08/14/rising-streaming-subscription-prices-how-high-is-too-high-for-netflix-disney-and-more/, new limitations and more ads across the tech industry, AI companies will have to tread carefully. They will have to be quicker to demonstrate the value of what they’re building right now and dial back the futuristic rhetoric. They will have to show their distrustful customer-base that they are committed to climate goals. They will need to take practical steps towards more transparent and equitable data collection. They’ll have to make sure that offshore workers are paid a fair wage and that they’re not ruining anyone’s mental health by having them look at the worst of humanity all day. (View Highlight)
Although the train might be leaving the AI hype station for now, LLMs are here to stay and researchers will continue to make incremental progress towards their next goals and towards AGI. We might see another AI winter while we wait for the next big breakthrough to happen but we will also see some genuinely useful new software emerge from the 2022-2024 boom. (View Highlight)

Graph View

Metadata
Highlights

Now Reading

![CDATA[Not Boring by Packy McCormick]]>
Apr 16, 2025

See 1293 more →

Created with Quartz, © 2025

Bluesky
Linkedin
Mastodon
Twitter
Unsplash
GitHub
RSS