GGUF vs MLX: Two Takes on the Local LLM Model Format

6 minute read

If you have ever downloaded a model to run on your own machine, you have met the
two formats that dominate the conversation today: GGUF and MLX. They both
solve the same surface problem — “package a large language model so it loads fast
and runs efficiently on hardware I actually own” — but they come from very
different worlds. GGUF grew out of a cross-platform, CPU-first inference engine.
MLX is Apple’s bet on its own silicon. Understanding where each one came from
explains almost every trade-off between them.

A Bit of History

GGUF: born from llama.cpp and GGML

GGUF (GPT-Generated Unified Format) is the file format used by
llama.cpp, Georgi Gerganov’s C/C++
inference engine. To understand GGUF you have to understand what came before it.

When llama.cpp first appeared in early 2023, models were stored in GGML
(GPT-Generated Model Language) files. GGML was wonderfully simple — a binary
header followed by raw tensor data — but that simplicity was also its problem.
Each model architecture had its own loading code, metadata lived in separate
files or was hard-coded into the loader, and the header layout changed often
enough that a file produced by one version of llama.cpp would silently fail to
load in the next. There was an intermediate format, GGJT, but it never fully
solved the versioning mess.

On August 21, 2023, the llama.cpp team introduced GGUF as a
backwards-incompatible successor. The key idea was to make the file
self-describing: architecture details, tokenizer data, quantization parameters,
and the quantized weights all live in a single, portable file with a structured
key/value metadata block. No side-car configs, no hard-coded assumptions, and an
extensible layout that could evolve without breaking old files. Within a few
months it became the de-facto standard for distributing local models — go to
Hugging Face and you will find tens of thousands of *-GGUF repos.

MLX: Apple builds for its own silicon

MLX came from a different direction
entirely. Released by Apple’s machine-learning research group in December
2023
, MLX is not primarily a file format — it is an array framework for
Apple silicon, in the spirit of NumPy, PyTorch, and JAX. It was authored by
Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert, and it
is built around the unified memory architecture of Apple’s M-series chips,
where the CPU and GPU share the same physical RAM and arrays can move between
them with no copy.

The “MLX format” people refer to when they say “an MLX model” is really a
.safetensors file plus MLX-specific metadata, produced by the
mlx-lm tooling. The mlx_lm.convert
command pulls an original model (typically Hugging Face safetensors), optionally
quantizes it with MLX’s group-size / bits-per-weight scheme, and writes it back
out as safetensors with the quantization recipe recorded in the metadata. So
where GGUF invented a brand-new container, MLX leaned on the already-popular,
safe-by-design safetensors format and layered its own conventions on top.

How They Actually Differ

Dimension GGUF (llama.cpp) MLX (Apple)
Origin llama.cpp / GGML lineage, 2023 Apple ML research, Dec 2023
Container Custom single-file binary safetensors + metadata
Target hardware Cross-platform: CPU, CUDA, Metal, Vulkan, ROCm, SYCL Apple silicon only
Self-describing Yes — tokenizer + arch + weights in one file Mostly; relies on safetensors + config
Quantization K-quants (Q4_K_M, Q5_K_M, Q6_K…), I-quants w/ imatrix Group-size + bits (e.g. 4-bit, mixed-bit recipes)
Primary runtime llama.cpp, Ollama, LM Studio mlx-lm, MLX Swift, LM Studio
Ecosystem size Huge — the default on Hugging Face Growing fast, Apple-centric

Quantization

GGUF’s quantization scheme is one of its biggest strengths. The K-quants
(Q4_K_M, Q5_K_M, Q6_K, and friends) use block-wise quantization with mixed
precision — the _M (“medium”) variants keep the most sensitive layers at higher
precision to preserve quality at almost no size cost. The newer I-quants push
compression even further using lookup tables, but they work best when paired with
an importance matrix (imatrix) calibrated on a representative dataset. The net
result is a rich menu of size/quality trade-offs, which is exactly why a single
model on Hugging Face often ships a dozen GGUF variants.

MLX takes a simpler, more uniform approach: you pick a group size and a
number of bits per weight (4-bit being the common default), with mixed-bit
recipes available for finer control. It is less of a sprawling menu and more of a
clean dial, which fits MLX’s “designed by researchers, for researchers” ethos.

Hardware and performance

This is where the philosophical split shows up in real numbers.

MLX is tuned for one thing and does it well. On Apple silicon, MLX is
frequently 30–50% faster than llama.cpp in the decode (token-generation)
phase, and optimized 7B models have been measured sustaining well over 200
tokens/sec. Because it owns the whole stack down to the Metal kernels and unified
memory, there is very little overhead.

But the win is workload-dependent. Benchmarks show MLX can spend a
disproportionate amount of time in the prefill phase (processing your prompt).
In one M1 Max test with a ~650-token prompt, MLX’s effective throughput
(prefill + decode combined) came out around 13 tok/s while the GGUF/llama.cpp run
hit ~20 tok/s — because MLX spent the bulk of its time chewing through the prompt.
The rule of thumb that emerges: MLX shines when output is long relative to input
(creative writing, summaries, long explanations), while GGUF/llama.cpp holds up
better when prompts are long and answers are short
(RAG, classification,
extraction).

And of course, GGUF runs everywhere. The same file works on a CPU-only Linux
box, an NVIDIA workstation via CUDA, an AMD card via ROCm, or a Mac via Metal.
MLX simply does not run off Apple silicon.

Pros and Cons

GGUF

Pros Cons
Truly cross-platform — one file runs on CPU, CUDA, Metal, Vulkan, ROCm, SYCL Not hardware-specialized — leaves performance on the table on Apple silicon vs MLX
Self-contained — tokenizer, architecture, and weights in a single portable file Quantization zoo can overwhelm — which of a dozen variants do you actually want?
Richest quantization menu — K-quants and imatrix I-quants for fine-grained size/quality control Conversion can be fiddly for newer/unusual architectures until llama.cpp adds support
Massive ecosystem — default on Hugging Face; supported by Ollama, LM Studio, and more  
Battle-tested from Raspberry Pis to server GPUs  

MLX

Pros Cons
Best-in-class speed on Apple silicon, especially for long-generation workloads Apple silicon only — zero portability to NVIDIA, AMD, or plain CPU boxes
Unified-memory native — no CPU↔GPU copies, efficient memory use on Macs Smaller ecosystem than GGUF, though catching up quickly
Clean, composable framework — a full NumPy/PyTorch-style array library, so training and fine-tuning are first-class Weaker on prefill-heavy / long-prompt workloads in current benchmarks
Builds on safetensors, a safe, well-understood container Quantization is simpler but less granular than GGUF’s K/I-quant lineup
Backed by Apple with fast-moving, active development (Swift/C APIs for native apps)  

Which Should You Use?

The honest answer is “it depends on your hardware and your workload,” but here is
the short version:

  • You ship to varied hardware, or run anything other than a Mac? GGUF. The
    portability and ecosystem are unbeatable, and “convert once, run anywhere” is a
    real advantage.
  • You live entirely on Apple silicon and care about maximum local speed,
    especially for long generations or fine-tuning?
    MLX. It was built for exactly
    your machine.
  • You are on a Mac but unsure? Tools like LM Studio and Ollama now
    support both, so it costs little to download a GGUF and an MLX build of the same
    model and benchmark your prompts. Real-world throughput depends heavily on your
    prompt-to-output ratio, so measure rather than guess.

Two formats, two philosophies: GGUF chose portability and a universal
container
, MLX chose depth on a single platform. Neither is “winning” — they
are optimizing for different things, and the local-LLM world is better for having
both.


Source repositories

Further reading

Categories:

Updated: