Meta’s new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop. (View Highlight)
Just 20 months ago I was amazed to see something that felt GPT-3 class run on that same machine. The quality of models that are accessible on consumer hardware has improved dramatically in the past two years. (View Highlight)
My laptop is a 64GB MacBook Pro M2, which I got in January 2023—two months after the initial release of ChatGPT. All of my experiments running LLMs on a laptop have used this same machine. (View Highlight)
I had a moment of déjà vu the day before yesterday, when I ran Llama 3.3 70B on the same laptop for the first time. (View Highlight)
This model delivers similar performance to Llama 3.1 405B with cost effective inference that’s feasible to run locally on common developer workstations. (View Highlight)
Everything I’ve seen so far from Llama 3.3 70B suggests that it holds up to that standard. I honestly didn’t think this was possible—I assumed that anything as useful as GPT-4 would require many times more resources than are available to me on my consumer-grade laptop. (View Highlight)
I’m so excited by the continual efficiency improvements we’re seeing in running these impressively capable models. In the proprietary hosted world it’s giving us incredibly cheap and fast models like Gemini 1.5 Flash, GPT-4o mini and Amazon Nova. In the openly licensed world it’s giving us increasingly powerful models we can run directly on our own devices. (View Highlight)
I don’t expect that this model would work well with much less than my 64GB of RAM. The first time I tried it consumed every remaining bit of available memory and hard-crashed my Mac! For my second attempt I made sure not to have Firefox and VS Code running at the same time and it worked just fine. (View Highlight)
One of my current favorites for that is LiveBench, which calls itself “a challenging, contamination-free LLM benchmark” and tests a large array of models with a comprehensive set of different tasks. (View Highlight)
llama-3.3-70b-instruct-turbo currently sits in position 19 on their table, a place ahead of Claude 3 Opus (my favorite model for several months after its release in March 2024) and just behind April’s GPT-4 Turbo and September’s GPT-4o. (View Highlight)
Llama 3.3 is currently the model that has impressed me the most that I’ve managed to run on my own hardware, but I’ve had several other positive experiences recently. (View Highlight)
A couple of weeks ago I tried another Qwen model, QwQ, which implements a similar chain-of-thought pattern to OpenAI’s o1 series but again runs comfortably on my own device. (View Highlight)
Meta’s Llama 3.2 models are interesting as well: tiny 1B and 3B models (those should run even on a Raspberry Pi) that are way more capable than I would have expected—plus Meta’s first multi-modal vision models at 11B and 90B sizes. I wrote about those in September. (View Highlight)
I’ve been mostly unconvinced by the ongoing discourse around LLMs hitting a plateau. The areas I’m personally most excited about are multi-modality (images, audio and video as input) and model efficiency. Both of those have had enormous leaps forward in the past year. (View Highlight)
I don’t particularly care about “AGI”. I want models that can do useful things that I tell them to, quickly and inexpensively—and that’s exactly what I’ve been getting more of over the past twelve months. (View Highlight)
Even if progress on these tools entirely stopped right now, the amount I could get done with just the models I’ve downloaded and stashed on a USB drive would keep me busy and productive for years. (View Highlight)
I focused on Ollama in this article because it’s the easiest option, but I also managed to run a version of Llama 3.3 using Apple’s excellent MLX library, which just celebrated its first birthday. (View Highlight)
Here’s how I ran the model with MLX, using uv to fire up a temporary virtual environment: (View Highlight)