Applied LLMs - What We ve Learned From A Year of Building with LLMs

rw-book-cover

Metadata

Author: Eugene Yan, Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu, Shreya Shankar
Full Title: Applied LLMs - What We ve Learned From A Year of Building with LLMs
URL: https://readwise.io/reader/document_raw_content/182552986

Highlights

The model isn’t the product, the system around it is For teams that aren’t building models, the rapid pace of innovation is a boon as they migrate from one SOTA model to the next, chasing gains in context size, reasoning capability, and price-to-value to build better and better products. This progress is as exciting as it is predictable. Taken together, this means models are likely to be the least durable component in the system. (View Highlight)
Instead, focus your efforts on what’s going to provide lasting value, such as: Evals: To reliably measure performance on your task across models Guardrails: To prevent undesired outputs no matter the model Caching: To reduce latency and cost by avoiding the model altogether Data hywheel: To power the iterative improvement of everything above (View Highlight)
These components create a thicker moat of product quality than raw model capabilities. (View Highlight)
But that doesn’t mean building at the application layer is risk-free. Don’t point your shears at the same yaks that OpenAI or other model providers will need to shave if they want to provide viable enterprise software. For example, some teams invested in building custom tooling to validate structured output from proprietary models; minimal investment here is important, but a deep one is not a good use of time. OpenAI needs to ensure that when you ask for a function call, you get a valid function call—because all of their customers want this. Employ some “strategic procrastination” here, build what you absolutely need, and await the obvious expansions to capabilities from providers. (View Highlight)
Building a product that tries to be everything to everyone is a recipe for mediocrity. To create compelling products, companies need to specialize in building sticky experiences that keep users coming back. (View Highlight)
Consider a generic RAG system that aims to answer any question a user might ask. The lack of specialization means that the system can’t prioritize recent information, parse domain-specific formats, or understand the nuances of specific tasks. As a result, users are left with a shallow, unreliable experience that doesn’t meet their needs, leading to churn. (View Highlight)
To address this, focus on specific domains and use cases. Narrow the scope by going deep rather than wide. This will create domain-specific tools that resonate with users. Specialization also allows you to be upfront about your system’s capabilities and limitations. Being transparent about what your system can and cannot do demonstrates self-awareness, helps users understand where it can add the most value, and thus builds trust and confidence in the output. (View Highlight)
DevOps is not fundamentally about reproducible workhows or shifting left or empowering two pizza teams—and it’s definitely not about writing YAML files. (View Highlight)
DevOps is about shortening the feedback cycles between work and its outcomes so that improvements accumulate instead of errors. Its roots go back, via the Lean Startup movement, to Lean Manufacturing and the Toyota Production System, with its emphasis on Single Minute Exchange of Die and Kaizen. (View Highlight)
MLOps has adapted the form of DevOps to ML. We have reproducible experiments and we have all-in-one suites that empower model builders to ship. And Lordy, do we have YAML files. But as an industry, MLOps didn’t adopt the function of DevOps. It didn’t shorten the feedback gap between models and their inferences and interactions in production. Hearteningly, the field of LLMOps has shifted away from thinking about hobgoblins of little minds like prompt management and towards the hard problems that block iteration: production monitoring and continual improvement, linked by evaluation. (View Highlight)
Already, we have interactive arenas for neutral, crowd-sourced evaluation of chat and coding models – an outer loop of collective, iterative improvement. Tools like LangSmith, Log10, LangFuse, W&B Weave, HoneyHive, and more promise to not only collect and collate data about system outcomes in production, but also to leverage them to improve those systems by integrating deeply with development. Embrace these tools or build your own. (View Highlight)
Most successful businesses are not LLM businesses. Simultaneously, most businesses have opportunities to be improved by LLMs. This pair of observations often mislead leaders into hastily retrofitting systems with LLMs at increased cost and decreased quality and releasing them as ersatz, vanity “AI” features, complete with the now- dreaded sparkle icon. There’s a better way: focus on LLM applications that truly align with your product goals and enhance your core operations. (View Highlight)
Consider a few misguided ventures that waste your team’s time: Building custom text-to-SQL capabilities for your business. Building a chatbot to talk to your documentation. Integrating your company’s knowledge base with your customer support chatbot. (View Highlight)
While the above are the hellos-world of LLM applications, none of them make sense for a product company to build themselves. These are general problems for many businesses with a large gap between promising demo and dependable component—the customary domain of software companies. Investing valuable R&D resources on general problems being tackled en masse by the current Y Combinator batch is a waste. If this sounds like trite business advice, it’s because in the frothy excitement of the current hype wave, it’s easy to mistake anything “LLM” as cutting-edge, accretive differentiation, missing which applications are already old hat. (View Highlight)
Right now, LLM-powered applications are brittle. They required an incredible amount of safe-guarding and defensive engineering, yet remain hard to predict. Additionally, when tightly scoped these applications can be wildly useful. This means that LLMs make excellent tools to accelerate user workhows. While it may be tempting to imagine LLM-based applications fully replacing a workhow, or standing in for a job function, today the most effective paradigm is a human-computer centaur (Centaur chess). When capable humans are paired with LLM capabilities tuned for their rapid utilization, productivity and happiness doing tasks can be massively increased. One of the hagship applications of LLMs, GitHub CoPilot, demonstrated the power of these workhows: (View Highlight)
For those who have worked in ML for a long time, you may jump to the idea of “human-in-the-loop”, but not so fast: HITL Machine Learning is a paradigm built on Human experts ensuring that ML models behave as predicted. While related, here we are proposing something more subtle. LLM-driven systems should not be the primary drivers of most workhows today, they should merely be a resource. By centering humans, and asking how an LLM can support their workhow, this leads to significantly different product and design decisions. Ultimately, it will drive you to build different products than competitors who try to rapidly offshore all responsibility to LLMs; better, more useful, and less risky products. (View Highlight)
The previous sections have delivered a firehose of techniques and advice. It’s a lot to take in. Let’s consider the minimum useful set of advice: if a team wants to build an LLM product, where should they begin? (View Highlight)
Over the past year, we’ve seen enough to be confident that successful LLM applications follow a consistent trajectory. We walk through this basic “getting started” playbook in this section. The core idea is to start simple and only add complexity as needed. A decent rule of thumb is that each level of sophistication typically requires at least an order of magnitude more effort than the one before it. With this in mind… (View Highlight)
Prompt engineering comes first Start with prompt engineering. Use all the techniques we discussed in the tactics section before. Chain-of- thought, n-shot examples, and structured input and output are almost always a good idea. Prototype with the most highly capable models before trying to squeeze performance out of weaker models. Only if prompt engineering cannot achieve the desired level of performance should you consider finetuning. This will come up more often if there are non-functional requirements (e.g., data privacy, complete control, cost) that block the use of proprietary models and thus require you to self-host. Just make sure those same privacy requirements don’t block you from using user data for finetuning! (View Highlight)
Build evals and kickstart a data flywheel Even teams that are just getting started need evals. Otherwise, you won’t know whether your prompt engineering is sufficient or when your finetuned model is ready to replace the base model. Effective evals are specific to your tasks and mirror the intended use cases. The first level of evals that we recommend is unit testing. These simple assertions detect known or hypothesized failure modes and help drive early design decisions. Also see other task-specific evals for classification, summarization, etc. While unit tests and model-based evaluations are useful, they don’t replace the need for human evaluation. Have people use your model/product and provide feedback. This serves the dual purpose of measuring real-world performance and defect rates while also collecting high-quality annotated data that can be used to finetune future models. This creates a positive feedback loop, or data hywheel, which compounds over time: Human evaluation to assess model performance and/or find defects Use the annotated data to finetune the model or update the prompt Repeat For example, when auditing LLM-generated summaries for defects we might label each sentence with fine- grained feedback identifying factual inconsistency, irrelevance, or poor style. We can then use these factual inconsistency annotations to train a hallucination classifier or use the relevance annotations to train a relevance-reward model. As another example, LinkedIn shared about their success with using model-based evaluators to estimate hallucinations, responsible AI violations, coherence, etc. in their write-up By creating assets that compound their value over time, we upgrade building evals from a purely operational expense to a strategic investment, and build our data hywheel in the process. (View Highlight)
In 1971, the researchers at Xerox PARC predicted the future: the world of networked personal computers that we are now living in. They helped birth that future by playing pivotal roles in the invention of the technologies that made it possible, from Ethernet and graphics rendering to the mouse and the window. But they also engaged in a simple exercise: they looked at applications that were very useful (e.g. video displays) but were not yet economical (i.e. enough RAM to drive a video display was many thousands of dollars). Then they looked at historic price trends for that technology (a la Moore’s Law) and predicted when those technologies would become economical. (View Highlight)
We can do the same for LLM technologies, even though we don’t have something quite as clean as transistors per dollar to work with. Take a popular, long-standing benchmark, like the Massively-Multitask Language Understanding dataset, and a consistent input approach (five-shot prompting). Then, compare the cost to run language models with various performance levels on this benchmark over time. Figure. For a fixed cost, capabilities are rapidly increasing. For a fixed capability level, costs are rapidly decreasing. Created by co-author Charles Frye using public data on May 13, 2024. (View Highlight)
In the four years since the launch of OpenAI’s davinci model as an API, the cost of running a model with equivalent performance on that task at the scale of one million tokens (about one hundred copies of this document) has dropped from $20 t o l ess t han 10¢- aha l v in g t im eo f j u s t s i x m o n t h s . S imi l a r l y, t h ecos tt or u n M e t a ’ s LL a M A 38 B, v iaan A P I p ro v i d eroro n yo u ro w n, i s j u s t 20¢ p er mi ll i o n t o k e n s a so f M a yo f 2024, an d i t ha ss imi l a r p er f or man ce t o Op e n A I ’ s t e x t - d a v in c i - 003, t h e m o d e lt ha t e nab l e d C ha tGPT . T ha t m o d e l a l socos t ab o u t$ 20 per million tokens when it was released in late November of 2023. That’s two orders of magnitude in just 18 months – the same timeframe in which Moore’s Law predicts a mere doubling. (View Highlight)
Now, let’s consider an application of LLMs that is very useful (powering generative video game characters, a la Park et al[) but is not yet economical (their cost was estimated at $625 p er h o u r] (h ttp s : // a r x i v . or g / ab s /2310.02172) h ere) . S in ce t ha tp a p er w a s p u b l i s h e d in A ugu s t o f 2023, t h ecos t ha s d ro pp e d ro ug h l yo n eor d ero f ma g ni t u d e, t o$ 62.50 per hour. We might expect it to drop to $6.25 per hour in another nine months. (View Highlight)
Meanwhile, when Pac-Man was released in 1980, $1 o f t o d a y ’ s m o n ey w o u l d b u yyo u a cre d i t, g oo d t o pl a y f or a f e w min u t esor t e n so f min u t es - c a ll i t s i xg am es p er h o u r, or$ 6 per hour. This napkin math suggests that a compelling LLM-enhanced gaming experience will become economical sometime in 2025. These trends are new, only a few years old. But there is little reason to expect this process to slow down in the next few years. Even as we perhaps use up low-hanging fruit in algorithms and datasets, like scaling past the “Chinchilla ratio” of ~20 tokens per parameter, deeper innovations and investments inside the data center and at the silicon layer promise to pick up the slack. And this is perhaps the most important strategic fact: what is a completely infeasible hoor demo or research paper today will become a premium feature in a few years and then a commodity shortly after. We should build our systems, and our organizations, with this in mind. (View Highlight)
We get it, building LLM demos is a ton of fun. With just a few lines of code, a vector database, and a carefully crafted prompt, we create ✨ magic ✨ . And in the past year, this magic has been compared to the internet, the smartphone, and even the printing press. Unfortunately, as anyone who has worked on shipping real-world software knows, there’s a world of difference between a demo that works in a controlled setting and a product that operates reliably at scale. (View Highlight)
There’s a large class of problems that are easy to imagine and build demos for, but extremely hard to make products out of. For example, self-driving: It’s easy to demo a car self-driving around a block; making it into a product takes a decade. - Andrej Karpathy (View Highlight)
Take, for example, self-driving cars. The first car was driven by a neural network in 1988. Twenty-five years later, Andrej Karpathy took his first demo ride in a Waymo. A decade after that, the company received its driverless permit. That’s thirty-five years of rigorous engineering, testing, refinement, and regulatory navigation to go from prototype to commercial product. Across industry and academia, we’ve observed the ups and downs for the past year: Year 1 of N for LLM applications. We hope that the lessons we’ve learned—from tactics like evals, prompt engineering, and guardrails, to operational techniques and building teams to strategic perspectives like which capabilities to build internally—help you in year 2 and beyond, as we all build on this exciting new technology together. (View Highlight)