rw-book-cover

Metadata

Highlights

  • what AI should you actually use? Not five years from now. Not in some hypothetical future. Today. (View Highlight)
  • AI models are gaining capabilities at an increasingly rapid rate, new companies are releasing new models, and nothing is well documented or well understood. In fact, in the few days I have been working on this draft, I had to add an entirely new model and update the chart below multiple times due to new releases. (View Highlight)
  • To pick an AI model for you, you need to know what they can do. I decided to focus here on the major AI companies that offer easy-to-use apps that you can run on your phone, and which allow you to access their most up-to-date AI models. Right now, to consistently access a frontier model with a good app, you are going to need to pay around $20/month (at least in the US), with a couple exceptions. (View Highlight)
  • We are going to go through things in detail, but, for most people, there are three good choices right now: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. (View Highlight)
  • For most people starting to use AI, the most important goal is to ensure that you have access to a frontier model with its own app. (View Highlight)
  • The problem is that most of the AI companies push you towards their smaller AI models if you don’t pay for access, and sometimes even if you do. Generally, smaller models are much faster to run, slightly less capable, and also much cheaper for the AI companies to operate. For example, GPT-4o-mini is the smaller version of GPT-4 and Gemini Flash is the smaller version of Gemini. (View Highlight)
  • Right now, for Claude you want to use Claude 3.5 Sonnet (which consistently outperforms its larger sibling Claude 3 Opus), for Gemini you want to use Gemini 2.0 Flash (though full Gemini 2.0 is expected very soon), and for ChatGPT you want to use GPT-4o (except when tackling complex problems that benefit from o1’s reasoning capabilities). (View Highlight)
  • Imagine an AI that can converse with you in real-time, seeing what you see, hearing what you say, and responding naturally – that’s “Live Mode” (though it goes by various names). (View Highlight)
  • You are actually seeing three advances in AI working together: First, multimodal speech lets the AI handle voice natively, unlike most AI models that use separate systems to convert between text and speech. This means it can theoretically generate any sound, though OpenAI limits this for safety. Second, multimodal vision lets the AI see and analyze real-time video. Third, internet connectivity provides access to current information. (View Highlight)
  • Right now, only ChatGPT offers a full multimodal Live Mode for all paying customers. It’s the little icon all the way to the right of the prompt bar (ChatGPT is full of little icons). But Google has already demonstrated a Live Mode for its Gemini model, and I expect we will see others soon. (View Highlight)
  • For those who are watching the AI space, by far the most important recent advance in the last few months has been the development of reasoning models. As I explained in my post about o1, it turns out that if you let an AI “think” about a problem before answering, you get better results. The longer the model thinks, generally, the better the outcome. Behind the scenes, it’s cranking through a whole thought process you never see, only showing you the final answer. Interestingly, when you peek behind that curtain, you find these AIs think in ways that feel eerily human (View Highlight)
  • That was the thinking process of DeepSeek-v3 r1, one of only a few reasoning models that have been released to the public. It is also an unusual model in many ways: it is an excellent model from China [1] ; it is open source so anyone can download and modify it; and it is cheap to run (and is currently offered for free by its parent company, DeepSeek). Google also offers a reasoning version of its Gemini 2.0 Flash. However, the most capable reasoning models right now are the o1 family from OpenAI. These are confusingly named, but, in order of capability, there are o1-mini, o1, and o1-pro. A new series of models, o3 (OpenAI could not get the rights to the o2 name, making things even more baffling) is expected at any moment, and o3-mini is likely to be a very good model. (View Highlight)
  • Reasoning models aren’t chatty assistants – they’re more like scholars. You’ll ask a question, wait while they ‘think’ (sometimes minutes!), and get an answer. You want to make sure that the question you give them is very clear and has all the context they need. For very hard questions, especially in academic research, math, or computer science, you will want to use a reasoning model. Otherwise, a standard chat model is fine. (View Highlight)
  • Not all AIs can access the web and do searches to learn new information past their original training. Currently, Gemini, Grok, DeepSeek, Copilot and ChatGPT can search the web actively, while Claude cannot. This capability makes a huge difference when you need current information or fact-checking, but not all models use their internet connections fully, so you will still need to fact-check. (View Highlight)
  • Most of the LLMs that generate images do so by actually using a separate image generation tool. They do not have direct control over what that tool does, they just send a prompt to it and then show you the picture that results. That is changing with multimodal image creation, which lets the AI directly control the images it makes. For right now, Gemini’s Imagen 3 leads the pack, but honestly? They’ll all handle your basic “otter holding a sign saying ‘This is ____’ as it sits on a pink unicorn float in the middle of a pool” just fine. (View Highlight)
  • All AIs are pretty good at writing code, but only a few models (mostly Claude and ChatGPT, but also Gemini to a lesser extent) have the ability to execute the code directly. Doing so lets you do a lot of exciting things. For example, this is the result of telling o1 using the Canvas feature (which you need to turn on by typing /canvas): “create an interactive tool that visually shows me how correlation works, and why correlation alone is not a great descriptor of the underlying data in many cases. make it accessible to non-math people and highly interactive and engaging” (View Highlight)
  • Further, when models can code and use external files, they are capable of doing data analysis. Want to analyze a dataset? ChatGPT’s Code Interpreter will do the best job on statistical analyses, Claude does less statistics but often is best at interpretation, and Gemini tends to focus on graphing. None of them are great with Excel files full of formulas and tabs yet, but they do a good job with structured data. (View Highlight)
  • It is very useful for your AI to take in data from the outside world. Almost all of the major AIs include the ability to process images. The models can often infer a huge amount from a picture. Far fewer models do video (which is actually processed as images at 1 frame every second or two). Right now that can only be done by Google’s Gemini, though ChatGPT can see video in Live Mode. (View Highlight)
  • And, while all the AI models can work with documents, they aren’t equally good at all formats. Gemini, GPT-4o (but not o1), and Claude can process PDFs with images and charts, while DeepSeek can only read the text. No model is particularly good at Excel or PowerPoint (though Microsoft Copilot does a bit better here, as you might expect), though that will change soon. The different models also have different amounts of memory (“context windows”) with Gemini having by far the most, capable of holding up to 2 million words at once. (View Highlight)
  • If you happen to like the personality of a particular AI, you may be willing to put up with fewer features or less capabilities. You can try out the free versions of multiple AIs to get a sense for that. That said, for most people, you probably want to pick among the paid versions of ChatGPT, Claude or Gemini. (View Highlight)
  • ChatGPT currently has the best Live Mode in its Advanced Voice Mode. The other big advantage of ChatGPT is that it does everything, often in somewhat confusing ways - OpenAI has AI models specialized in hard problems (o1 series) and models for chat (GPT-4o); some models can write and run complex software programs (though it is hard to know which); there are systems that remember past interactions and scheduling systems; movie-making tools and early software agents (View Highlight)
  • Gemini does not yet have as good a Live Mode, but that is supposed to be coming soon. For now, Gemini’s advantage is a family of powerful models including reasoners, very good integration with search, and a pretty easy-to-use user interface, as you might expect from Google. It also has top-flight image and video generation. (View Highlight)
  • Claude has the smallest number of features of any of these three systems, and really only has one model you care about - Claude 3.5 Sonnet. But Sonnet is very, very good. It often seems to be clever and insightful in ways that the other models are not. A lot of people end up using Claude as their primary model as a result, even though it is not as feature rich. (View Highlight)
  • While it is very new, you might also consider DeepSeek if you want a very good all-around model with excellent reasoning. If you subscribe to X, you get Grok for free, and the team at X.ai are scaling up capabilities incredibly quickly, with a soon-to-be-released new model, Grok 3, promising to be the largest model ever trained. And if you have Copilot, you can use that, as it includes a mix of Microsoft and OpenAI models, though I find the lack of transparency over which models it is using when to be somewhat confusing. (View Highlight)