May 15, 2024

GPT-4o: What’s cool, what’s hype, and what happens next

hero image for blog post

On Monday, OpenAI announced the launch of its new and improved model, GPT-4o. Let’s break down what makes this launch exciting, why it’s making big tech sweat, and what we should be worried about (if anything).

What is GPT-4o?

GPT-4o is OpenAI’s newest flagship model – here’s the TL;DR on what it improves upon:

  • Same GPT4-level intelligence, but much faster responses
  • Accepts prompts and provides answers in text, voice, and visual modes (aka, you’ll be able to turn on your camera and talk to it like a coworker)
  • Free for everyone, though free users will have capacity limits
  • Includes a desktop app (not just a browser version)
  • Speaks in 50 different languages and multiple tones, such as sarcasm, delight, singing, etc.

This new model is being rolled out to ChatGPT Plus and Team users now, with availability for Enterprise users coming soon. For now, it looks like only the desktop and mobile apps will have full multi-modal capabilities, and those will be rolling out to all users in the coming weeks.

What’s exciting vs. what’s just marketing

Some of these product features are truly groundbreaking, and some are padding out the press release. Here’s our take.

Breakthrough:

  • Multi-modal capabilities in one platform (voice, text, vision)
  • Much faster responses, enabling real-time conversation
  • Can speak in 50 different languages
  • Free (with capacity limits)
  • Desktop and mobile app — will certainly enhance the usefulness of ChatGPT

Just marketing (for now):

  • Singing, “sarcasm”, and other voices – it’s a cool trick, but we’re not convinced that it adds much to the user experience for ChatGPT to sound snarky.
  • The "linear algebra" demo made the problem sound very complex, but it was actually quite straightforward. Real-world tests will determine if the model has significantly improved advanced reasoning capabilities.
  • Two GPT-4o’s interacting and singing felt more like a gimmick than a useful application, and the demos were hard to watch.

The biggest breakthrough: Multi-modal in one platform

The biggest breakthrough by far is that GPT-4o is “natively multi-modal”. This means it can answer and understand the world through voice, text, and images, all in one interface.

So instead of having to type out context for a problem you want ChatGPT to solve, you can turn on your camera and show the model the problem in real time – and get responses almost instantaneously. The new desktop app will also give ChatGPT the ability to analyze desktop screens and take screenshots for anything you want to discuss.

(note: you should take all company-provided demos with a grain of salt while the model itself rolls out).

For us, this has huge implications for AI coaching, teaching, and accessing the world as it changes. Imagine ChatGPT-4o being able to sit in a meeting and adapt as quickly to a developing crisis as a coworker would. Or for children to have a 24/7 tutor that can adapt their style of teaching to a student’s needs.

To echo Ethan Mollick, this model also throws open the doors of accessibility. The fact that it’s free and available in dozens of languages means that a critical mass will be able to benefit from the leading LLM (not ChatGPT 3.5).

But wait … why is it free?

OpenAI has made the GPT-4o model available to everyone, including their free users. Free users will have a limit on the number of messages they can send with GPT-4o. Once that limit is reached, they’ll be switched over to GPT-3.5 (which we’ve noted before is a seriously lackluster experience).

“Plus” users, on the other hand, will have a 5x higher message limit than free users. Team and Enterprise users will have even higher limits. We don’t know what the limit will be for free users yet, but OpenAI will have to make it low enough to entice people to the paid version, since the quality is the same.

How is this possible from a business model perspective?

Option 1: They lowered the cost of the model. OpenAI gave a big shout out to Nvidia/Jenson, so we assume they must have made some serious improvements to the cost of the model. But we shouldn’t forget how Microsoft’s significant investment in OpenAI inspired their willingness to operate at a loss to gain market share and improve their models. They have a track record of focusing on their consumer growth over immediate profitability.

Option 2: Their growth focus is enterprise buyers and developer access. Given the huge demand for GPT-3 and GPT-4 API access, giving away GPT-4o’s API for free is likely a strategic move to drive mass adoption in the face of rising competition from Gemini and Claude.

Option 3: We’re paying for this with our data. With access to our desktops, cameras, and microphones, OpenAI has a vast well of data to train the next generation of models on. That data well gets more valuable if they open the aperture and attract more free users. It's worth noting that access to microphones and cameras isn't new – these capabilities have been in OpenAI's mobile app for some time – and the intention behind the data collection isn’t to sell for targeted advertising. But a good rule of thumb is: “If you're not paying for it, you're not the customer; you're the product being sold." In OpenAI's case, the focus isn’t on monetizing user data, but to use our contributions to make themselves more valuable.

What this means for the AI space

We will see this play out rather quickly – here are our predictions:

Prediction #1: More free, best-in-class models are coming soon. In a defensive response, other frontier model developers will likely make their best models free as well. However, they’re unlikely to make up market share. OpenAI's proactive strategy allows them to capitalize on growth potential, while competitors are forced to react and adapt.

Prediction #2: Apple will enhance Siri with GPT-4o's capabilities (hopefully at WWDC in June). The real differentiator for Apple will be in how they handle privacy and performing actions with Siri (what the Humane Pin and Rabbit r1 should have been).

Prediction #3: Google will stumble – again – by trying to make augmented reality glasses happen. Google already missed the bus with LLMs, but they debuted a multimodal AI assistant this week at Google I/O. The only catch: They’re marketing it to be used with A/R glasses, which no one is rushing to buy.

Prediction #4: ChatGPT’s desktop app will soon be able to take control of a computer and perform actions. Letting LLMs run code on a computer to complete tasks seems like the natural next step.

Prediction #5: Multi-modal will enable LLMs to build their own, improved world model by learning about the real world through video and text, and video generation will start to improve rapidly with new training data from users.

Greg Shove
Chase Ballard