It wasn’t long ago that open-source models were eating the dust of GPT-3.5 and Claude, but the tables have started to turn. We tested four models for ourselves to see how they stack up.
We kept it simple — no multimodal stuff like vision or audio processing, just good old fashioned natural language. We used the same simple prompt: “Imagine you are a third grade teacher. Can you explain the theory of relativity to a third-grade class?”
TL;DR: Our favorite (and least favorite) open-source models
Gemma takes the top spot for providing a clear outline, thought-provoking discussion questions, and by far the most succinct and easy-to-grasp definitions of the prompt’s key concepts.
LLaVa takes last place for its lack of clear and concise information. More than any others, it sacrificed clarity to achieve a specific tone.
Setting a baseline: How Claude Opus, GPT-4, and Gemini Pro performed on the same task
Before we determine which open-source models aced the test, let’s set a baseline with the big three LLMs (Claude, GPT, and Gemini).
Claude Opus
Our grade: 7.5/10
Where it excelled: It had great narrative structure tailored for the audience of the prompt. It used one consistent analogy to describe the entire theory.
Where it fell short: It prioritized the narrative over clarity in some places which made the analogy less useful in understanding the concept.
ChatGPT (GPT-4)
Our grade: 9/10
Where it excelled: It provided more relatable analogies that would be easy for the audience provided in the prompt to understand. It started simple and built upon its explanations to get more complex.
Where it fell short: Its response is pretty hard to fault, but if I were nitpicking I’d say there were some parts of the theory it only alluded to rather than fully explaining.
Gemini
Our grade: 8/10
Where it excelled: It used the same narrative style and analogies as Claude but was much more conversational and clear in explaining the concept.
Where it fell short: It provided discussion suggestions that were tangential to the prompt. It also referenced examples that are likely too complex for the audience we provided.
The best of the big three
Each of these are able to adequately break down complex concepts using storytelling and analogies, but GPT-4 provided the most accessible answer for novices with the most relatable context.
Rating the open-source models
When we talk about model size, we’re talking about the amount of data it's trained on. This is measured in “parameters” (2 billion, 7 billion, etc). More parameters means more knowledge, but smaller models can still pack a punch because it's not just about size – efficiency and performance matter too.
Mistral Medium
Creator: Mistral, a French startup founded in April 2022 from previous employees of Meta and Google’s DeepMind.
Model Size: Medium
Our grade: 6/10
Where it excelled: Uses a relatable and conversational tone, gives examples appropriate for the prompt.
Where it fell short: The analogies it uses are not very clear. It doesn’t quite break down the concepts simply enough or provide any kind of summary to reiterate the main points.
Our Take: Mistral's language model is a reliable, customizable choice for those who want to fine-tune their AI assistant to their preferences. It's been around for a while, offering smaller, open-source models that can even run on your laptop. However, if you need an instant, off-the-shelf solution, you might want to look elsewhere for now.
Gemma
Creator: Google
Model Size: Small and Medium
Our grade: 8/10
Where it excelled: Very clear, succinct, and digestible organization of information. Provided discussion questions that were more relevant than Gemini’s.
Where it fell short: While it explains things very simply, it doesn’t use analogies or comparisons to help illustrate a concept. Provides more of an overview than a teachable narrative.
Our Take: Gemma is our go-to choice for ready-to-use AI language models. It's an assistant that doesn't need much training, perfect for building your own AI applications without extra bulk. Versatile, efficient, and ready to tackle almost any task, Gemma is the star player for creating something amazing with open-source language models.
DBRX
Creator: Databricks
Model Size: Large
Our grade: 7/10
Where it excelled: Clear, no-frills response that gets to the point immediately. Uses simple and concrete examples that make the concepts digestible.
Where it fell short: Doesn’t provide any kind of summative statement to wrap up the key learnings.
Our Take: Despite its impressive 132 billion parameters, DBRX couldn't quite match Gemma's performance in our tests. While DBRX has the brawn, Gemma has the brains and finesse to deliver the results we needed. This proves that bigger isn't always better in AI – sometimes, a smaller, more focused model like Gemma can outperform larger counterparts by better understanding and responding to specific needs.
LLaVa 1.5
Creators: Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee
Model Size: Medium-Large
Our grade: 5/10
Where it excelled: It didn’t really excel at anything but it does achieve a friendly and approachable tone. It attempted analogies to help simplify a complex topic, but it wasn’t really successful.
Where it fell short: Its analogies confuse more than help. It introduces confusing concepts – such as things falling upwards and “special clocks” – and never completely explains anything. A summary statement would make its response more successful.
Our Take: While LLaVa's text generation left room for improvement in our tests, its true potential lies in its untested vision capabilities. If you need a model that can tackle both text and images, LLaVa is a decent starting point. A little fine-tuning can help it adapt to your specific needs.
Are open-source models eclipsing the frontier?
Our final verdict: Almost. A couple of these open-source responses were nearly as good as the ones provided by the big 3, but none of them achieved the narrative and tone that Opus, GPT-4, and Gemini were able to capture.
Of course, there is more to consider than just how it does with a basic prompt. If you’re contemplating going with an open-source model, look at things like cost per million tokens, overall speed, and the ability to customize and control the model to your specific use case.
And if you’re interested in how these stack up with some of the other chatbots on the market, check out our LLM guide.