A Brief History of Local Llamas

"TinyLlama logo" by The TinyLlama project. Licensed under Apache 2.0

My Background

Most my knowledge of this field comes from a few guest lectures, and the indispensable r/localllama community, which always has the latest news about local llamas. I’ve become a fan of the local llama movement in December 2023, so the “important points” covered here are coming from a retrospective.

Throughout this piece, the terms “large language model” and “llama” are used interchangeably. Same goes for the terms “open source and locally hosted llama” and “local llama”.

Whenever you see numbers like 7B, that means the llama has 7 billion parameters. More parameters means the model is smarter but bigger.

Modern History

The modern history of llamas begins in 2022. 90% of sources llama papers cite these days are from 2022 onwards.

Here’s a brief timeline:

March 2022: InstructGPT paper is pre-print.
November 2022: ChatGPT is released.
March 2023: LLaMA (open source) and GPT4 is released.
July 2023: LLaMA 2 is released, alongside GGUF quantization.
August 2023: AWQ quantization paper.
September 2023: Mistral 7B is released.
December 2023: Mixtral 8x7B becomes the first MoE local llama.

Early 2022

In March 2022, OpenAI released a paper improving the conversational ability of their then uncontested GPT3. Interest in InstructGPT was largely limited to the academic community. As of writing, InstructGPT remains the last major paper OpenAI released on this topic.

The remainder of 2022 was mostly focused on text-to-image models. OpenAI’s DALL-E 2 lead the way, but the open source community kept pace with Stable Diffusion. Midjourney eventually ends up on top by mid-2022, and causes a huge amount of controversy with artists by winning the Colorado State Fair’s fine-art contest.

Late 2022

In late November 2022, OpenAI released ChatGPT, which is generally speculated to be a larger version of InstructGPT. This model is single-handedly responsible for the NLP Boom. ChatGPT revolutionized the corporate perception on chatbots and AI in general. It was considered a form of disruptive innovation in the search engine market, leading Google to hold record layoffs in January 2023.

At this point, the local llama movement practically didn’t exist. Front-ends, especially for chatbot role-play as later exemplified by SillyTavern, began development, but they were all still running through the OpenAI’s ChatGPT API.

Early 2023

In March 2023, Meta kick-started the local llama movement, by releasing LLaMA a 65B parameter llama that was open source! Benchmarks aside, it was not very good. ChatGPT continued to be viewed as considerably better. However, it provided a strong base for further iteration, and gave the local llama community a much stronger model than any other local llama at the time.

Around this time, OpenAI released GPT4, a model that undeniably broke through all records. In fact, GPT4 remains the best llama, as of December 2023. The original ChatGPT is now referred to as GPT3.5, to disambiguate it from GPT4. This caused much of the local llama community to continue focusing on building frontends, while using GPT4’s API for inference. Nothing open source was even remotely close to GPT4 at this point.

Mid 2023

We finally see the local llama movement really take off around August 2023. Meta released LLaMA2, which has decent performance even on its 7B version. One key contribution of LLaMA2 was the GGUF quantization format. This format allows a model to be run on a mix of RAM and VRAM, which meant many home computers could now run 4-bit quantized 7B models! Previously, most enthusiasts would have to rent cloud GPUs to run their “local” llamas. Quantizing into GGUF is a very expensive process, so TheBloke on Huggingface emerges the defacto source for pre-quantized llamas.

Based on LLaMA, the open source llama.cpp becomes the leader of local llama inference backends. Its support extends far beyond only running LLaMA2, it’s the first major backend to support running GGUF quantizations!

In addition, the Activation-aware Weight Quantization (AWQ) paper is released at this time. It uses mixed-quantization to increase both the speed and performance of quantized models. This is especially true for very heavily quantized models like 4-bit quantization, that has become the standard for the local llama community at this point. AWQ lacks support anywhere at this time.

In late September 2023, out of nowhere came a French startup with a 7B model that made leaps on top of LLaMA2 7B. Mistral remains the best local llama until mid-December 2023. Huge work in improving model tuning, particularly character creation and code-assistant models, is done on top of Mistral 7B.

Late 2023

It’s worth noting that at this point, there is still no local competitor to GPT3.5. But that was all about to change on December 11th 2023, when Mistral released Mixtral 8x7B. Mixtral is the first major local llama to use the same technique as GPT4; a mixture of experts (MoE). While about 1/40th the speculated size of GPT4 and 1/3rd of GPT3.5, Mixtral is able to go toe-to-toe with GPT3.5 both in benchmarks and user reviews. This is hailed as a landmark achievement by the local llama community, demonstrating that open source models are able to compete with commercially developed ones. The achievement is amplified by comparing Mixtral against Google freshly-unveiled Gemini models in the same week. R/localllama reviews generally suggest Mixtral pulls ahead of Gemini in conversational tasks.

Apple unexpectedly joins the local llama movement, by open-sourcing their Ferret model in mid-December. This model builds upon LLaVA, previously the leading multi-modal llama for images in the local llama community.

In very late December, llama.cpp merges AWQ support. In the coming year, I expect AWQ to largely replace GPTQ in the local llama community, though GGUF will remain more popular in general.

Just 3 days from the end of the year, TinyLlama releases their 3 trillion token checkpoint, on their 1.1B model. This miniscule model sets a new lower-bound for the number of neurons required to make a capable llama, enabling more users to locally host their llama. In practice, TinyLlama easily goes toe-to-toe with Microsoft’s closed-source Phi-2 2.7B released just a few weeks earlier. This is a huge win for the open source community, demonstrating how open source models can get ahead of commercial ones.

Going Forward

With the release of Mixtral, the local llama community is hoping 2024 will be the turning point where open source models finally break ahead of commercial models. However as of writing, it’s very unclear how the community will break through GPT4, the llama that remains uncontested in practice.

Early 2024

This is where we currently are! Hence, things are just dates for now. We’ll see how much impact they have in a retrospective:

2024-01-22: Bard with Gemini-Pro defeats all models except GPT4-Turbo in chatbot arena. This is seen as questionably fair, since bard has internet access.
2024-01-29: miqu gets released. This is a suspected Mistral_Medium leak. Despite only having a 4bit-quantized version, it’s ahead of all current locallamas.
2024-01-30: Yi-34B is the largest local llama for language-vision. LLaVA 1.6 based on top of it sets new records in vision performance.
2024-02-08: Google releases Gemini Advanced, a GPT4 competitor with similar pricing. Public opinion seems to be that it’s quite a bit worse that GPT4, except it’s less censored and much better at creative writing.
2024-02-15: Google releases Gemini Pro 1.5, with 1 million tokens of context! Third party testing on r/localllama shows it’s effectively about to query very large codebases, beating out GPT4 (with 32k context) on every test.
2024-02-15: OpenAI releases Sora, a text-to-video model for up to 60s of video. A huge amount of hype starts up around it “simulating the world”, but it’s only open to a very small tester group.
2024-02-26: Mistral releases Mistral-Large and simultaneously removes all the mentions of a commitment to open source from their website. They revert this change the following day, after the community backlash.
2024-03-27: DataBricks open sources DBRX, a 132B parameter MoE with 36B parameters active per forward pass. It was trained on 12T tokens. According to user evaluation, it beats Mixtral for all uses.
2024-04-18: Meta releases LLaMA3 8b and 70b. 70b is the new best open model, right around Claude3 Sonnet and above older gpt4 versions!
2024-05-13: OpenAI releases gpt-4o. This multimodal model is able to take in different forms of input and output, like speech and text, no model-chain required. It also beats all other models comfortably
2024-05-28: Gemini-pro-1.5 is second only to gpt-4o, all other gpt-4 models are ranked lower, but not for english
2024-05-29: Mistral releases codestral, a 22B closed model. User reviews have put it above gpt4o and opus for coding (me too)
2024-06-20: Anthropic releases Sonnet 3.5. Despite being the medium model, it’s ranked higher for coding on lmsys than gpt-4o and user reviews vastly prefer its voice
2024-07-16: Mistral releases codestral-mamba 7B, with up to 256k “tested” tokens of support. Codestral-mamba 7B is under Apache 2.0, marking the first major foundational model provider not using a transformers llama
2024-07-23: LLaMA 3.1 405B, an open weights model by Meta, finally definitively defeats GPT-4 across all relevant benchmarks!
2024-09-12: OpenAI releases their o1 series. These models use hidden autoregression to “reason” better across tasks. This has been well observed before, but open llamas don’t have access to reasoning-style data (expensive)
2024-09-19: Qwen2.5 is released, in every size imaginable. This model remains at the top of the open-weights scene until the end of the year.
2024-10-22: Anthropic releases a follow-up to Sonnet 3.5, with Sonnet 3.5 (new). Along with OpenAI’s AMA, the industry trend seems to be towards medium sized models as opposed to large ones (Gemini Ultra, Claude Opus, GPT-5).
2024-11-08: Google finally releases an OpenAI-API compatible endpoint for their gemini models, joining Mistral and Ollama in OpenAI-API compatibility
2024-11-14: Gemini Exp 1114 breaks through chatbot arena, ahead of even OpenAI’s o1 models! They keep it up with 1206 and Flash2.0
2024-11-27: QwQ, a model from the Qwen team, is released, becoming the first major “reasoning” open-weights llama. It’s great!
2024-12-20: o3 from openai is released, setting new records on the ARC-AGI benchmark. More importantly though, the community has recognized we’re firmly in time of inference-time compute, instead of pretraining
2024-12-23: QvQ, a multi-modal QwQ is released, being the first major reasoning open-weights llama with multi-modality. It’s also great!
2024-12-26: Deepseek v3 becomes the largest open weights model released. It beats even sonnet 3.5 in several coding benchmarks!
2025-01-20: Deepseek R1 (based on v3) is released. It beats OpenAI’s own o1 models at their own benchmarks, marking the first time an open model definitively beats closed ones. It causes a wave of commitment to localllamas from big companies, including mistral switching back to Apache 2.0 and even OpenAI considering releasing something (TBA).
2025-02-07: The NoLiMa paper is released. This study demonstrated that all llamas still significantly underperform on long contexts.
2025-02-27: OpenAI releases GPT-4.5, largely speculated to be a failed GPT-5. Despite being the largest model released by a leading AI company, it’s underwhelming, demonstrating that scale is certainly not all llamas need.
2025-03-12: Google releases Gemma 3 27B, with a massive 128k context! It outperforms all open models at it size, though not too revolutionary.
2025-03-25: Google releases Gemini 2.5 Pro, which now does reasoning as well. It beats everything on benchmarks and marks the last big player to use reasoning.
2025-04-29: Qwen3 is released. They’re well received by the community, notably for having very sparse MOE architectures. They also released a 600m parameter model, which is quite coherent and runs at 45t/s on cpu!
2025-05-21: Mistral releases devstral. This model is built to work with the agentic OpenHands framework. Community feedback is generally negative.
2025-05-28: Deepseek updates R1 quietly, though this model ends up outperforming everything, even Gemini 2.5 Pro (at that point).
2025-05-30: “How much do language models memorize?” paper is released. This paper finds a compression rate of about 3.6bits per weight. While a terrible privacy paper, this finding is quite helpful in understanding what scale does to llamas.
2025-06-09: Apple releases the “The Illusion of Thinking” paper, which largely claims that reasoning doesn’t help. This is very controversial.

Going Forward

There’s surprisingly little work in multimodal localllamas as of now. It’s likely the next year will see improvements for voice and image support. Piratically of note is llama.cpp/ollama’s lack of support for audio. Voice-to-voice models are likely to also become more widespread in the community.

A largely unrelated front is reasoning. Aside from QwQ, there aren’t any major reasoning local models, despite the o1 series having some of the heighest zero-shot performance of all llamas. It’s likely the local community will push on this frontier as well, though it doesn’t seem to be as important.