Pretty much every company I know is looking for a way to benefit from Large Language Models. Even if their executives don’t see much applicability, their investors likely do, so they’re staring at the blank page nervously trying to come up with an idea. It’s straightforward to make an argument for LLMs improving internal efficiency somehow, but it’s much harder to describe a believable way that LLMs will make your product more useful to your customers. (View Highlight)
Many folks in the industry are still building their mental model for LLMs, which leads to many reasoning errors about what LLMs can do and how we should use them. Two unhelpful mental models I see many folks have regarding LLMs are:
LLMs are magic: anything that a human can do, an LLM can probably do roughly as well and vastly faster
LLMs are the same as reinforcement learning: current issues with hallucinations and accuracy are caused by small datasets. Accuracy problems will be solved with larger training sets, and we can rely on confidence scores to reduce the impact of (View Highlight)
I’d instead suggest these pillars for a useful mental model around LLMs:
LLMs can predict reasonable responses to any prompt – an LLM will confident provide a response to any textual prompt you write, and will increasingly provide a response to text plus other forms of media like image or video
You cannot know whether a given response is accurate – LLMs generate unexpected results, called hallucinations, and you cannot concretely know when they are wrong. There are no confidence scores for reasoning about a specific answer from an LLM
You can estimate accuracy for a model and a given set of prompts using evals – You can use evals – running an LLM against a known set of prompts and evaluations of those prompts – to evaluate the likelihood that an LLM will perform well in a given scenario
You can generally increase accuracy by using a larger model, but it’ll cost more and have higher latency – for example, GPT 4 is a larger model than GPT 3.5, and generally provides higher quality responses. However, it’s meaningfully more expensive (~20x more expensive), and meaningfully slower (2-5x slower). However, the quality, cost and latency are improving at every price point. You should expect the year-over-year performance at a given cost, latency or quality point to meaningfully improve over the next five years (e.g. you should expect to get GPT 4 quality at the price and latency of GPT 3.5 in 12-24 months)
Models generally get more accurate as the training corpus grows in size – the accuracy of reinforcement learning tends to grow predictability as the dataset grows. That remains generally true, but is less predictable with LLMs. Small models generally underperform large models. Large models generally outperform small models with higher quality data. Supplementing large general models with specific data is called “fine-tuning” and it’s currently ambiguous when fine-tuning will outperform using a larger model.
Even the fastest LLMs are not that fast – even a fast LLM might take 10+ seconds to provide a reasonably sized response. If you need to perform multiple iterations of prompts and responses, or to use a larger model, it might take a minute or two to complete. These will get faster, but they aren’t fast today
Even the most expensive LLMs are not that expensive for B2B usage. Even the cheapest LLM is not that cheap for Consumer usage – because pricing is driven by usage volume, this is a technology that’s very easy to justify for B2B businesses with smaller, paying usage. Conversely, it’s very challenging to figure out how you’re going to pay for significant LLM usage in a Consumer business without the risk of significantly shrinking your margin (View Highlight)
LLMs can still provide significant value to the business, as you could increase efficiency of validating the paperwork matching with the user supplied information, but the user themselves won’t see much benefit other than perhaps faster validation of their application.
However, you can adjust the workflows to make them more valuable:
User creates an account
Product asks user to provide paperwork
Product uses LLM to extract values from paperwork
User validates the extracted data is correct, providing some adjustments
Internal team reviews the user’s adjustments, along with any high risk issues raised by a rule engine of some sort
The technical complexity of these two products is functionally equivalent, but the user experience is radically different. The internal team experience is improved as well. My belief is that many existing products will find they can only significantly benefit their user experience from LLMs by rethinking their workflows. (View Highlight)
One solution to navigate large datasets within a fixed token window is Retrieval Augmented Generation (RAG). To come up with a concrete example, you might want to create a dating app that matches individuals based on their free-form answer to the question, “What is your relationship with books, tv shows, movies and music, and how has it changed over time?” No token window is large enough to include every user in the dating app’s database in the prompt, but you could find twenty plausible matching users by filtering on location, and then include those twenty users’ free-form answers, and match amongst them. (View Highlight)
Where I see folks get into trouble is trying to treat RAG as a solution to a search problem, as opposed to recognizing that RAG requires useful search as part of its implementation. An effective approach to RAG depends on a high-quality retrieval and filtering mechanism to work well at a non-trivial scale. For example, with a high-level view of RAG, some folks might think they can replace their search technology (e.g. Elasticsearch) with RAG, but that’s only true if your dataset is very small and you can tolerate much higher response latencies. (View Highlight)
Model performance, essentially the quality of response for a given budget in either dollars or milliseconds, is going to continue to improve, but it’s not going to continue improving at this rate absent significant technology breakthroughs in the application or processing of LLMs. I’d expect those breakthroughs to happen, but to happen less frequently after the first several years, and slow from there. It’s hard to determine where we are in that cycle because there’s still an extraordinary amount of capital flowing into this space. (View Highlight)
In addition to technical breakthroughs, the other aspect driving innovation is building increasingly large models. It’s unclear if today’s limiting factor is availability of Nvidia GPUs, larger datasets to train models upon that are plausibly legal, capital to train new models, or financial models suggesting that the discounted future cashflow from training larger models doesn’t meet a reasonable payback period. My assumption is that all of these have or will be the limiting constraint on LLM innovation over time, and various competitors will be best suited to make progress depending on which constraint is most relevant. (Lots of fascinating albeit fringe scenarios to contemplate here, e.g. imagine a scenario where the US government disbands copyright laws to allow training on larger datasets because it fears losing the LLM training race to countries that don’t respect US copyright laws.) (View Highlight)
As discussed in the workflow section, many companies already have humans performing validation work who can now move into supervision of LLM responses rather than generating the responses themselves. In other scenarios, it’s possible to adjust your product’s workflows to rely on external users to serve as the HITL instead. I suspect most products will depend on both techniques along with heuristics to determine when internal review is necessary. (View Highlight)
As mentioned before, LLMs often generate confidently wrong responses. HITL is the design principle to prevent acting on confidently wrong responses. This is because it shifts responsibility (specifically, legal liability) away from the LLM itself and today the specific human. For example, if you use Github Copilot to generate some code that causes a security breach, you are responsible for that security breach, not Github Copilot. Every large-scale adopting of LLMs today is being done in a mode where it shifts responsibility for the responses to the user. (View Highlight)
There’s a strong desire for a world where LLMs replace software engineers, or where software engineers move into a supervisory role rather than writing software. For example, an entrepreneur wants to build a copy of Reddit, and uses an LLM to implement that implementation. There’s enough evidence that you can assume it’s possible today to go from zero to one on a new product idea in a few weeks with an LLM and some debugging skills. (View Highlight)
I used to think this was very important, but my sense is that LLM hosting is already essentially equivalent to other cloud services (e.g. you can get Anthropic via AWS or OpenAI via Azure), and that very few companies will benefit from spending too much time worrying about LLM availability. I do think that getting direct access to LLMs via cloud providers–companies that are well-versed at scalability–is likely the winning pick here as well. (View Highlight)