rw-book-cover

Metadata

Highlights

  • One way to think about these new models is as a specialized extension of the chain of thought prompting pattern—the “think step by step” trick that we’ve been exploring as a a community for a couple of years now, first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022. (View Highlight)
  • Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them. (View Highlight)
  • Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason. (View Highlight)
  • I don’t really like the term “reasoning” because I don’t think it has a robust definition in the context of LLMs, but OpenAI have committed to using it here and I think it does an adequate job of conveying the problem these new models are trying to solve. (View Highlight)
  • For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice. However, if you’re aiming to develop applications that demand deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice. (View Highlight)
  • No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages. (View Highlight)
  • Most interestingly is the introduction of “reasoning tokens”—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens. (View Highlight)
  • Thanks to the importance of reasoning tokens—OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models—the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini! These are an increase from the gpt-4o and gpt-4o-mini models which both currently have a 16,384 output token limit. (View Highlight)
  • Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response. (View Highlight)
  • Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users. (View Highlight)
  • So two key reasons here: one is around safety and policy compliance: they want the model to be able to reason about how it’s obeying those policy rules without exposing intermediary steps that might include information that violates those policies. The second is what they call competitive advantage—which I interpret as wanting to avoid other models being able to train against the reasoning work that they have invested in. (View Highlight)
  • I’m not at all happy about this policy decision. As someone who develops against LLMs interpretability and transparency is everything to me—the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards. (View Highlight)
  • It’s going to take a while for the community to shake out the best practices for mhen and where these models should be applied. I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet), but it’s going to be really interesting to see us collectively expand our mental model of what kind of tasks can be solved using LLMs given this new class of model. (View Highlight)
  • I expect we’ll see other AI labs, including the open model weights community, start to replicate some of these results with their own versions of models that are specifically trained to apply this style of chain-of-thought reasoning. (View Highlight)