September 20, 2024

ChatGPT o1: What’s cool, what’s hype, and what happens next

hero image for blog post

Last Thursday, OpenAI announced the launch of its not so secret “Strawberry” model now called o1. It’s been live for a little over a week and the dust has started to settle, so let’s break down what makes this launch exciting, what this new model actually does, and what the future might look like with AI that “thinks.”

What is o1?

o1 is OpenAI's newest family of models, representing a significant leap forward in AI capabilities – it can reason. The model has two variants: o1-mini and o1-preview. Here’s what sets these models apart from their predecessors:

  • Advanced Reasoning: o1 models spend more time "thinking" before responding. They can break down complex challenges into more manageable components, allowing for deeper analysis and more thoughtful outputs.

  • Complex Problem Solving: Advanced reasoning means they can tackle more complex tasks and that stumped previous generations of models.

    For example, GPT-4o fails spectacularly when asked to create a grammatically correct sentence that doesn’t use the same word twice. o1-preview thinks for 34 seconds, provides potential solutions, refines them, then generates a sentence that passes with flying colors.

  • Specialized Strengths: o1 excels in coding, mathematics, and scientific reasoning, outperforming GPT-4o, Claude, Gemini, and other frontier models in these areas.

  • Trade-offs: While powerful, o1 models aren’t the best at everything. For example, they’re slightly worse than GPT-4o when given certain writing tasks.

  • Accessibility: Free for paid users, but with significant capacity limits to manage demand.

Current Limitations

OpenAI’s o1 also has some important limitations to be aware of:

  • No web browsing capabilities
  • Unable to upload or process files or images
  • Lacks code interpreter (advanced data analysis) functionality
  • Cannot generate images using DALL-E

These limitations may be temporary, as OpenAI often introduces new features over time.

Overall, the release of o1 represents a break from OpenAI’s previous consumer (non-API) releases – in the past, the most recently released model has by and large been the top-performing model for almost every task. Now, o1 is the premiere model for some tasks but GPT-4o remains the best model for others, requiring consumers to make the distinction in which model is right for their task.

Birdseye view: What’s groundbreaking vs. just hype

As with every OpenAI release, o1’s release represents some genuine advancements and some marketing hype:

Real breakthroughs:

  • The model is trained to spend more time reasoning, which mimics human "thinking," and opens up new frontiers for problem-solving.

  • The model also has the ability to refine its thinking, try different strategies, recognize mistakes, and recover from those mistakes.

Just hype (for now):

  • When chatting with o1 models, users can see the “thinking” section before the response. However, this isn’t always a faithful representation of what the model is actually thinking. It’s a smart gimmick for your benefit – to help you understand something is happening while you wait.

  • The o1-preview model is just that, a preview, so we aren’t seeing the model’s full potential. OpenAI seems to take a staged approach to releasing models to understand potential impacts on safety, to gather data, and manage expectations from users.

What they mean when they say “advanced reasoning” or “thinking”

"Advanced reasoning" is more marketing shorthand than a literal description of the model's processes. In essence, o1 is designed to spend more time processing information before responding. This means it can handle more complex user inputs and mimic human problem-solving approaches.

When you see “thinking” on the screen, what it’s actually doing is using a chain of thought to attempt to solve your problem, similar to how a human might think for a while before responding to a question. OpenAI has honed and refined o1’s problem solving strategies during its training so that it can recognize mistakes and then correct them, break down tasks into smaller, more manageable ones, and try a different approach if the current one isn’t working.

The biggest impact on the user is that you might wait anywhere from a few seconds to a full minute for the model to answer your query or complete your task – and then you get a more focused and accurate outcome. This is most impactful in a few niche areas:

  • Academic analysis: Unlike GPT-4o, o1 is on par with PhD students in physics, chemistry, and biology benchmarks. This represents a significant leap forward in the model's ability to understand and apply complex scientific concepts.
  • Advanced education: o1 could serve as a more advanced tutor, capable of adapting its teaching style to individual students' needs and tackling complex subjects with ease.
  • Mathematical and coding expertise: The model shows exceptional capabilities in solving complex mathematics problems and complicated coding tasks. While it can’t currently run the code it generates, the code it generates is exceptional and thorough, making it great for data analysts and developers.
  • Legal analysis: The model's ability to consider multiple factors and precedents could be valuable in legal research and case analysis.
  • Financial work: o1's mathematical prowess could revolutionize how we approach financial forecasting and risk assessment.
  • Medical implications: While not a replacement for human doctors, o1 could assist in analyzing complex medical cases and suggesting potential diagnoses or treatment options.

These high-level capabilities probably won’t significantly impact daily tasks for most ChatGPT users who are not regularly engaging in PhD-level research or complex coding projects. For routine tasks like writing assistance, general knowledge queries, or creative brainstorming, most users will likely find GPT-o1 less sufficient than 4o.

However, there’s one more common use case where o1 may be a significant step forward than GPT-4o – AI as a thought partner. To test this, we took one of our prompts from our AI for Strategic Decision-Making course, and tested it against GPT-4o and o1-preview.

We used the prompt: “I want to understand the industry benchmarks for free trial – what should I expect for: conversion rate to free trial and conversion rate to purchase among free trial users. Include data for opt-in and opt-out of a free trial.” Here’s how GPT-4o and o1-preview responded:

GPT-4o

The response from GPT-4o isn’t necessarily wrong, but it’s generally basic information with not a lot of nuance or offering deeper insights. This might be good if you want to check some quick assumptions, but there’s not a lot of color in the response.

o1-preview

The model thought for 11 seconds, offering insights into the thought process as it tackled the task. Its response was too extensive to include here but it was incredibly comprehensive.

Notably, o1-preview provided some unsolicited breakdowns, such as distinguishing between B2B or B2C scenarios. This level of detail, which GPT-4o missed, emerged naturally. The model even elaborated on free trial variations, considering cases where upfront payment might be required.

This depth of analysis demonstrates o1-preview’s ability to break down problems, anticipate and address nuanced aspects of a task without explicit direction.

How to prompt this new model

o1’s release has one additional major implication – it represents a fundamental shift in how we interact with LLMs.

With o1's built-in reasoning capabilities, OpenAI has eliminated the need for explicit prompting techniques traditionally relied upon with previous models. You no longer need to instruct the model to think step-by-step, provide reasoning, or pause before responding because OpenAI has integrated these prompts into o1's training.

Here’s how to prompt o1 successfully:

  1. Keep it simple and direct: o1 doesn't need the same level of guidance as previous models. Instead of detailed, step-by-step instructions, try more open-ended prompts that allow o1 to leverage its reasoning capabilities and decide how it’ll tackle a problem on its own.
  2. Avoid unnecessary prompting phrases: Phrases like "think step-by-step" or "explain your reasoning" are now redundant. o1 will do this automatically.
  3. Focus on the core question or task: Minimize context or background information. o1 is capable of inferring necessary context, so concentrate on clearly stating your main query or objective.
  4. Embrace complexity: Don't shy away from asking complex, multi-faceted questions. o1 thrives on challenges that require deep analysis and interconnected thinking.
  5. Be patient: o1 might take longer to respond. This isn't a flaw – it's a feature that allows for more thorough and considered responses.

The bottom line

The shift from GPT-4o to o1 isn't a simple upgrade where everything just gets better. It's more like switching from a Swiss Army knife to a specialized tool set.

This marks a significant turning point in the evolution of LLMs. We can expect to see more specialized models that rapidly improve their reasoning capabilities. Here are three macro implications.

The end of the “one size fits all” model?

Unlike the clear progression from GPT-3.5 to GPT-4 to GPT-4o, where each new model consistently outperformed its predecessor across tasks, o1 is a departure from this trend. A task that worked well with GPT-4o might not yield the best result with o1.

Does this signal the end of “general-purpose” models? It’s unlikely, but does introduce a new challenge: determining which model is best suited for a specific task. My take is that OpenAI will eventually automate this decision-making process by routing requests to the most appropriate model. But at least for now, we’ll have to think critically about when to use a model.

Maintaining the human in the loop

As o1 and other models take on more complex reasoning and spend more time “thinking” before responding, the human as an overseer becomes more critical than ever. Since the current reasoning shown to users doesn't actually reflect the model’s true process, it’s crucial that users critically evaluate outputs.

An increased focus on latency

In the future, LLMs will likely ponder problems for much longer than a few seconds or minutes. We might even get to the point where models “think” for hours, days, weeks, or even years. This extended “thinking time” introduces a new variable in our interaction with AI:

  • Task Selection: We’ll now need to start conditioning ourselves to be more discerning about which tasks we assign to which model. Poorly defined instructions or unsuitable tasks could result in time wasted as thinking time extends from minutes to hours.
  • Workflow Adaptation: Unlike the quick back and forth possible with earlier models, working with more advanced reasoning models may require us to adjust our workflows to accommodate the longer thinking time.

The bottom line is: Unlike earlier leaps in model capabilities, don’t jump right to using o1 for all your LLM tasks. Historically, once the hype settles around a new model, everyday users can easily identify practical applications. With o1's release, we're in uncharted territory.

Greg Shove
Chase Ballard