rw-book-cover

Metadata

Highlights

  • I have had access to the much-rumored OpenAI “Strawberry” enhanced reasoning system for over a month, and now that it is public, I can finally share some thoughts1. It is amazing, still limited, and, perhaps most importantly, a signal of where things are heading. (View Highlight)
  • The new AI model, called GPT-o1 (why are the AI companies so bad at names?), lets the AI “think through” a problem before solving it. This lets it address very hard problems that require planning and iteration, like novel math or science questions. In fact, it can now beat human PhD experts in solving extremely hard physics problems. (View Highlight)
  • To be clear, GPT o1 doesn’t do everything better. It is not a better writer than GPT-4o, for example. But for tasks that require planning, the changes are quite large. For example, here is me giving GPT o1 the instruction: Figure out how to build a teaching simulator using multiple agents and generative AI, inspired by the paper below and considering the views of teachers and students. write the code and be detailed in your approach. I then pasted in the full text of our paper. The only other prompt I gave was build the full code. You can see what the system produced below. (View Highlight)
  • Crossword puzzles are especially hard for LLMs because they require iterative solving: trying and rejecting many answers that all affect each other. This is something LLMs can’t do, since they can only add a token/word at a time to their answer. When I give the prompt to Claude, for example, it first comes up with an answer for 1 down (it guesses STAR, which is wrong) and then is stuck trying to figure out the rest of the puzzle with that answer, ultimately failing to even come close. Without a planning process, it has to just charge ahead. (View Highlight)
  • But what happens when I give this to Strawberry? The AI “thinks” about the problem first, for a full 108 seconds (most problems are solved in much shorter times). You can see its thoughts, a sample of which are below (there was a lot more I did not include), and which are super illuminating - it is worth a moment to read some of it. (View Highlight)
  • The LLM iterates repeatedly, creating and rejecting ideas. The results are pretty impressive, and it does well… but GPT-o1 is still seemingly based on GPT-4o, and it is a little too literal to solve this rather unfair puzzle. The answer to 1 down “Galaxy cluster” is not a reference to real galaxies, but rather a reference to the Samsung Galaxy phone (this stumped me, too) - “APPS.” Stuck on real galaxies, the AI instead kept trying out the name of actual galactic clusters before deciding 1 down is COMA (which is a real galactic cluster - I had no idea). Thus, the rest of the results are not correct and do not fit the rules exactly, but are pretty creative: 1 across is CONS, 12 across is OUCH, 15 across is MUSICIANS, etc. (View Highlight)
  • So GPT-o1 does things that would have been impossible without Strawberry, but it still isn’t flawless: errors and hallucinations still happen, and it is still limited by the “intelligence” of GPT-4o as the underlying model. Since getting the new model, I haven’t stopped using Claude to critique my posts - Claude is still better at style - but I did stop using it for anything involving complex planning or problem solving. It represents a huge leap in those areas. (View Highlight)
  • Using GPT-o1 means confronting a paradigm change in AI. Planning is a form of agency, where the AI arrives at conclusions about how to solve a problem on its own, without our help. You can see from the video above that the AI does so much thinking and heavy lifting, churning out complete results, that my role as a human partner feels diminished. It just does its thing and hands me an answer. Sure, I can sift through its pages of reasoning to spot mistakes, but I no longer feel as connected to the AI output, or that I am playing as large a role in shaping where the solution is going. This isn’t necessarily bad, but it is different. (View Highlight)
  • As these systems level up and inch towards true autonomous agents, we’re going to need to figure out how to stay in the loop - both to catch errors and to keep our fingers on the pulse of the problems we’re trying to crack. GPT-o1 is pulling back the curtain on AI capabilities we might not have seen coming, even with its current limitations. This leaves us with a crucial question: How do we evolve our collaboration with AI as it evolves? That is a problem that GPT-o1 can not yet solve. (View Highlight)