The model isn’t the product, the system around it is
For teams that aren’t building models, the rapid pace of innovation is a boon as they migrate from one SOTA model to the next, chasing gains in context size, reasoning capability, and price-to-value to build better and better products. This progress is as exciting as it is predictable. Taken together, this means models are likely to be the least durable component in the system. (View Highlight)
Instead, focus your efforts on what’s going to provide lasting value, such as:
Evals: To reliably measure performance on your task across models Guardrails: To prevent undesired outputs no matter the model
Caching: To reduce latency and cost by avoiding the model altogether
Data hywheel: To power the iterative improvement of everything above (View Highlight)
These components create a thicker moat of product quality than raw model capabilities. (View Highlight)
But that doesn’t mean building at the application layer is risk-free. Don’t point your shears at the same yaks that OpenAI or other model providers will need to shave if they want to provide viable enterprise software.
For example, some teams invested in building custom tooling to validate structured output from proprietary models; minimal investment here is important, but a deep one is not a good use of time. OpenAI needs to ensure that when you ask for a function call, you get a valid function call—because all of their customers want this. Employ some “strategic procrastination” here, build what you absolutely need, and await the obvious expansions to capabilities from providers. (View Highlight)
Building a product that tries to be everything to everyone is a recipe for mediocrity. To create compelling products, companies need to specialize in building sticky experiences that keep users coming back. (View Highlight)
Consider a generic RAG system that aims to answer any question a user might ask. The lack of specialization means that the system can’t prioritize recent information, parse domain-specific formats, or understand the nuances of specific tasks. As a result, users are left with a shallow, unreliable experience that doesn’t meet their needs, leading to churn. (View Highlight)
To address this, focus on specific domains and use cases. Narrow the scope by going deep rather than wide. This will create domain-specific tools that resonate with users. Specialization also allows you to be upfront about your system’s capabilities and limitations. Being transparent about what your system can and cannot do demonstrates self-awareness, helps users understand where it can add the most value, and thus builds trust and confidence in the output. (View Highlight)
DevOps is not fundamentally about reproducible workhows or shifting left or empowering two pizza teams—and it’s definitely not about writing YAML files. (View Highlight)
DevOps is about shortening the feedback cycles between work and its outcomes so that improvements accumulate instead of errors. Its roots go back, via the Lean Startup movement, to Lean Manufacturing and the Toyota Production System, with its emphasis on Single Minute Exchange of Die and Kaizen. (View Highlight)
MLOps has adapted the form of DevOps to ML. We have reproducible experiments and we have all-in-one suites that empower model builders to ship. And Lordy, do we have YAML files.
But as an industry, MLOps didn’t adopt the function of DevOps. It didn’t shorten the feedback gap between models and their inferences and interactions in production.
Hearteningly, the field of LLMOps has shifted away from thinking about hobgoblins of little minds like prompt management and towards the hard problems that block iteration: production monitoring and continual improvement, linked by evaluation. (View Highlight)
Already, we have interactive arenas for neutral, crowd-sourced evaluation of chat and coding models – an outer loop of collective, iterative improvement. Tools like LangSmith, Log10, LangFuse, W&B Weave, HoneyHive, and more promise to not only collect and collate data about system outcomes in production, but also to leverage them to improve those systems by integrating deeply with development. Embrace these tools or build your own. (View Highlight)
Consider a few misguided ventures that waste your team’s time:
Building custom text-to-SQL capabilities for your business.
Building a chatbot to talk to your documentation.
Integrating your company’s knowledge base with your customer support chatbot. (View Highlight)
While the above are the hellos-world of LLM applications, none of them make sense for a product company to build themselves. These are general problems for many businesses with a large gap between promising demo and dependable component—the customary domain of software companies. Investing valuable R&D resources on general problems being tackled en masse by the current Y Combinator batch is a waste.
If this sounds like trite business advice, it’s because in the frothy excitement of the current hype wave, it’s easy to mistake anything “LLM” as cutting-edge, accretive differentiation, missing which applications are already old hat. (View Highlight)
For those who have worked in ML for a long time, you may jump to the idea of “human-in-the-loop”, but not so fast: HITL Machine Learning is a paradigm built on Human experts ensuring that ML models behave as predicted. While related, here we are proposing something more subtle. LLM-driven systems should not be the primary drivers of most workhows today, they should merely be a resource.
By centering humans, and asking how an LLM can support their workhow, this leads to significantly different product and design decisions. Ultimately, it will drive you to build different products than competitors who try to rapidly offshore all responsibility to LLMs; better, more useful, and less risky products. (View Highlight)
The previous sections have delivered a firehose of techniques and advice. It’s a lot to take in. Let’s consider the minimum useful set of advice: if a team wants to build an LLM product, where should they begin? (View Highlight)
Over the past year, we’ve seen enough to be confident that successful LLM applications follow a consistent trajectory. We walk through this basic “getting started” playbook in this section. The core idea is to start simple and only add complexity as needed. A decent rule of thumb is that each level of sophistication typically requires at least an order of magnitude more effort than the one before it. With this in mind… (View Highlight)
Prompt engineering comes first
Start with prompt engineering. Use all the techniques we discussed in the tactics section before. Chain-of- thought, n-shot examples, and structured input and output are almost always a good idea. Prototype with the most highly capable models before trying to squeeze performance out of weaker models.
Only if prompt engineering cannot achieve the desired level of performance should you consider finetuning. This will come up more often if there are non-functional requirements (e.g., data privacy, complete control, cost) that block the use of proprietary models and thus require you to self-host. Just make sure those same privacy requirements don’t block you from using user data for finetuning! (View Highlight)
In 1971, the researchers at Xerox PARC predicted the future: the world of networked personal computers that we are now living in. They helped birth that future by playing pivotal roles in the invention of the technologies that made it possible, from Ethernet and graphics rendering to the mouse and the window.
But they also engaged in a simple exercise: they looked at applications that were very useful (e.g. video displays) but were not yet economical (i.e. enough RAM to drive a video display was many thousands of dollars). Then they looked at historic price trends for that technology (a la Moore’s Law) and predicted when those technologies would become economical. (View Highlight)
We can do the same for LLM technologies, even though we don’t have something quite as clean as transistors per dollar to work with. Take a popular, long-standing benchmark, like the Massively-Multitask Language Understanding dataset, and a consistent input approach (five-shot prompting). Then, compare the cost to run language models with various performance levels on this benchmark over time.
Figure. For a fixed cost, capabilities are rapidly increasing. For a fixed capability level, costs are rapidly decreasing. Created by co-author Charles Frye using public data on May 13, 2024. (View Highlight)
In the four years since the launch of OpenAI’s davinci model as an API, the cost of running a model with equivalent performance on that task at the scale of one million tokens (about one hundred copies of this document) has dropped from 20tolessthan10¢–ahalvingtimeofjustsixmonths.Similarly,thecosttorunMeta’sLLaMA38B,viaanAPIprovideroronyourown,isjust20¢permilliontokensasofMayof2024,andithassimilarperformancetoOpenAI’stext−davinci−003,themodelthatenabledChatGPT.Thatmodelalsocostabout20 per million tokens when it was released in late November of 2023. That’s two orders of magnitude in just 18 months – the same timeframe in which Moore’s Law predicts a mere doubling. (View Highlight)
Meanwhile, when Pac-Man was released in 1980, 1oftoday’smoneywouldbuyyouacredit,goodtoplayforafewminutesortensofminutes–callitsixgamesperhour,or6 per hour. This napkin math suggests that a compelling LLM-enhanced gaming experience will become economical sometime in 2025.
These trends are new, only a few years old. But there is little reason to expect this process to slow down in the next few years. Even as we perhaps use up low-hanging fruit in algorithms and datasets, like scaling past the “Chinchilla ratio” of ~20 tokens per parameter, deeper innovations and investments inside the data center and at the silicon layer promise to pick up the slack.
And this is perhaps the most important strategic fact: what is a completely infeasible hoor demo or research paper today will become a premium feature in a few years and then a commodity shortly after. We should build our systems, and our organizations, with this in mind. (View Highlight)
We get it, building LLM demos is a ton of fun. With just a few lines of code, a vector database, and a carefully crafted prompt, we create ✨ magic ✨ . And in the past year, this magic has been compared to the internet, the smartphone, and even the printing press.
Unfortunately, as anyone who has worked on shipping real-world software knows, there’s a world of difference between a demo that works in a controlled setting and a product that operates reliably at scale. (View Highlight)