GPT4’s level of performance has been replicated within multiple organizations. GPT3’s level of performance has been reproduced by many. GPT2 level models can be trained by almost everyone (probably on the order of $1k to do in a few hours). The early idea that models could maybe be moats has been so resoundingly defeated that people don’t expect language model providers to have any moats. (View Highlight)
In this era of massive investment into AI, the land grab phase, we’ll see the most providers trying to enter the space. It has turned out that many of these providers have reached largely similar tiers of capabilities, which may not always be the case as scaling costs get even higher (and future technical challenges emerge). Market share distributions will be defined by who can create sticky user behavior. To set the stage, I’m bringing up this image I Tweeted last week. It shows that since Gemini 1.5 Pro on February 15th, we’ve seen 4 GPT4 class models in Gemini, Claude 3 (GPT4 turbo tier, actually), Mistral Large, and Inflection 2.5. (View Highlight)
I expected to have Llama 3 by now, but it seems the scope of what they’re doing is constantly changing. I have heard credible rumors that it is done training and credible rumors that multiple teams are trying different things and the best one wins. So, who knows when we’ll get this model? (View Highlight)
These broadly available GPT4 class models are very rapidly going to create a high-entropy situation in the next few months — prices will fluctuate, companies will fight for usage, and narratives will shift fast. This fluctuation is the organizations trying to find a competitive advantage and a moat. (View Highlight)
The companies that have users interacting with their models consistently have moats through data and habits. The models themselves are not a moat, as I discussed at the end of last year when I tried to predict machine learning moats, but there are things in the modern large language model (LLM) space that open-source will really struggle to replicate. Concretely, that difference is access to quality and diverse training prompts for fine-tuning. While I want open-source to win out for personal philosophical and financial factors, this obviously is not a walk in the park for the open-source community. It’ll be a siege of a castle with, you guessed it, a moat. We’ll see if the moat holds. (View Highlight)
Having the best cheap model could be another way to create a moat. While the companies creating openly trained and permissive-ish usage don’t normally have the economies of scale to drive down inference costs, those of Google, Anthropic (a borderline inclusion on capital assets), and OpenAI will probably use it as a loss leader. We need more data on the paid versus free tier usage of the various applications, and especially conversion numbers. We likely won’t get this data, so don’t buy into the PR narratives you hear too much. (View Highlight)
The model I’m expecting to test this hypothesis is Claude 3 Sonnet. Can they get many users to unsubscribe for a better free model? Unlikely. Can they bring more people in from OpenAI’s free tier, probably? Both Anthropic and OpenAI probably have no chance of winning an inference price competition versus Google, though. (View Highlight)
In some ways, Claude 3’s timing seems like their last shot to get a consumer footprint (with GPT4.5-Turbo around the corner). Google’s recent “ship it” attitude and OpenAI’s larger consumer footprint will be extremely hard to beat. In some ways, I don’t even expect the Claude 3 step to matter much for paid users. Disruption theory on the internet has long been driven by the need to have a dramatically better experience or price, not just marginal gains. (View Highlight)
New highlights added March 14, 2024 at 9:50 PM
The biggest moat killer for LLMs is having all levels of the stack available in the open. Model weights are just one piece of this, with user data and infrastructure also playing a huge role. (View Highlight)
The commoditization of GPT4 class models is a phenomenal indicator of the progression of the open LLM ecosystem. There are countless individuals who can easily pay the price it takes to create a model like Claude 3 and release it to the world. Depending on your accounting, the ballpark price will probably be on the order of $100 million (yes, I know you can make many different arguments). (View Highlight)
The other method to solve this resource problem is via the coordination of open actors. Most of the issues in pretraining development take more engineering time rather than true trade secrets. Small organizations training models are constantly guessing at which feature will most improve their next model — doing all of them is how you have GPT4 class base models. (View Highlight)
In short, the open capabilities of RLHF and all methods of preference fine-tuning are severely lacking their closed counterparts. Direct Preference Optimization (DPO), the algorithm providing boosts to usability, AlpacaEval, and MT Bench for open chat models isn’t a solution it’s a starting point (as I’ve written about extensively). We have a lot of work to do to truly have local chat agents at GPT4 quality. Primarily, I would recommend digging into data and continuing to expand evaluation tools centered around these methods. We’re just pushing the first few rotations around on the flywheel and efforts on data and evaluation tend to compound the most, instead of methods with vibes-based evaluation. This is epitomized by the fact that most fine-tuning data is derivative of GPT4 or a related model like Claude. OpenAI’s John Schulman already gave a very good talk on why training on distilled data can limit the performance of a model. (View Highlight)
We need what Soumith was calling the sinkhole on his recent appearance on the Latent Space Podcast (which covers more actual ground than the Lex and Yann interview everyone has been asking me to comment on). The sinkhole is where all the prompts and feedback labels will go from all of the open providers. HuggingChat, Ollama, local llama, and anyone hosting open models, needs to do the hard work of getting permissions to transfer and share data. (View Highlight)
The total usage of open models at moderately sized providers is probably less than the likes of OpenAI with 100 million consumer users, but it is also almost surely enough to have enough training signal. Coordination in the open is part of the problem that open-source has definitionally always fought — open-source systems have many different stakeholders and therefore opinions on what to do or not do with the data. I don’t see the data coordination problem being meaningfully solved with open modelsanytime soon — there will be a meaningful gap in preference alignment between open and closed models. (View Highlight)
The people who are ready to have a short-term impact on the development of LLMs are the best LLM sommeliers. People who don’t use models don’t speak the same language. There are countless executives and prominent people denying the usefulness of these models. It’s fine to ignore these people on the grounds of personal experience — LLMs feel like they’re getting more useful on a quarterly basis. It’s not that LLMs are unlocking new things that they can do, but more of the feeling that the things we’ve been trying to do with these models have gotten more reliable. (View Highlight)
There are very few reasons to argue for the saturation of abilities of LLMs other than the ratio of cost to performance gain. As long as the performance gain comes, the marginal cost of tech businesses will almost surely make it worth it. Many people pay for multiple 20permonthsubscriptionsforAI,soifoneismuchbetter,gettingthemtopay50 per month seems totally reasonable. This calculation does not include the potential for transformative new applications and use cases, which are definitionally hard to predict. (View Highlight)
I don’t even feel like I use the models enough to get a sense of how powerful they are and I use them at a high level almost daily. There’s a lot of alpha to unlock with deep familiarity — I know multiple people who regularly have a jailbreak that works to unlock special behaviors. Supposedly, it’s not so hard still to get the model to think it needs to do something it was trained not to in order to help you. (View Highlight)
With the probably coming Grok release, it’s wild how many mid-sized (7 to 5 billion parameter) open models there are. I can’t help but think how easy it would be for the leading closed companies to crush these models if they wanted to. Sometimes the open models look like playthings relative to API models. 6T tokens of training data for Google is totally small beans. It’s a bifurcated market with the press around it designed to make it confusing by burdening the terminology of state of the art. (View Highlight)
I’m interested in the Chinese open models as a source of randomness. I’m not an expert, but it seems likely their incentives are not to play the exact same game as the US and European technology companies (obviously in some ways given the chip bans). This, outside of political risk, seems like a very likely net benefit for the progression of open LLMs. (View Highlight)