GGUF vs MLX: Two Takes on the Local LLM Model Format
If you have ever downloaded a model to run on your own machine, you have met the
two formats that dominate the conversation today: GGUF and MLX. They both
solve the same surface problem — “package a large language model so it loads fast
and runs efficiently on hardware I actually own” — but they come from very
different worlds. GGUF grew out of a cross-platform, CPU-first inference engine.
MLX is Apple’s bet on its own silicon. Understanding where each one came from
explains almost every trade-off between them.
A Bit of History
GGUF: born from llama.cpp and GGML
GGUF (GPT-Generated Unified Format) is the file format used by
llama.cpp, Georgi Gerganov’s C/C++
inference engine. To understand GGUF you have to understand what came before it.
When llama.cpp first appeared in early 2023, models were stored in GGML
(GPT-Generated Model Language) files. GGML was wonderfully simple — a binary
header followed by raw tensor data — but that simplicity was also its problem.
Each model architecture had its own loading code, metadata lived in separate
files or was hard-coded into the loader, and the header layout changed often
enough that a file produced by one version of llama.cpp would silently fail to
load in the next. There was an intermediate format, GGJT, but it never fully
solved the versioning mess.
On August 21, 2023, the llama.cpp team introduced GGUF as a
backwards-incompatible successor. The key idea was to make the file
self-describing: architecture details, tokenizer data, quantization parameters,
and the quantized weights all live in a single, portable file with a structured
key/value metadata block. No side-car configs, no hard-coded assumptions, and an
extensible layout that could evolve without breaking old files. Within a few
months it became the de-facto standard for distributing local models — go to
Hugging Face and you will find tens of thousands of *-GGUF repos.
MLX: Apple builds for its own silicon
MLX came from a different direction
entirely. Released by Apple’s machine-learning research group in December
2023, MLX is not primarily a file format — it is an array framework for
Apple silicon, in the spirit of NumPy, PyTorch, and JAX. It was authored by
Awni Hannun, Jagrit Digani, Angelos Katharopoulos, and Ronan Collobert, and it
is built around the unified memory architecture of Apple’s M-series chips,
where the CPU and GPU share the same physical RAM and arrays can move between
them with no copy.
The “MLX format” people refer to when they say “an MLX model” is really a
.safetensors file plus MLX-specific metadata, produced by the
mlx-lm tooling. The mlx_lm.convert
command pulls an original model (typically Hugging Face safetensors), optionally
quantizes it with MLX’s group-size / bits-per-weight scheme, and writes it back
out as safetensors with the quantization recipe recorded in the metadata. So
where GGUF invented a brand-new container, MLX leaned on the already-popular,
safe-by-design safetensors format and layered its own conventions on top.
How They Actually Differ
| Dimension | GGUF (llama.cpp) | MLX (Apple) |
|---|---|---|
| Origin | llama.cpp / GGML lineage, 2023 | Apple ML research, Dec 2023 |
| Container | Custom single-file binary | safetensors + metadata |
| Target hardware | Cross-platform: CPU, CUDA, Metal, Vulkan, ROCm, SYCL | Apple silicon only |
| Self-describing | Yes — tokenizer + arch + weights in one file | Mostly; relies on safetensors + config |
| Quantization | K-quants (Q4_K_M, Q5_K_M, Q6_K…), I-quants w/ imatrix | Group-size + bits (e.g. 4-bit, mixed-bit recipes) |
| Primary runtime | llama.cpp, Ollama, LM Studio | mlx-lm, MLX Swift, LM Studio |
| Ecosystem size | Huge — the default on Hugging Face | Growing fast, Apple-centric |
Quantization
GGUF’s quantization scheme is one of its biggest strengths. The K-quants
(Q4_K_M, Q5_K_M, Q6_K, and friends) use block-wise quantization with mixed
precision — the _M (“medium”) variants keep the most sensitive layers at higher
precision to preserve quality at almost no size cost. The newer I-quants push
compression even further using lookup tables, but they work best when paired with
an importance matrix (imatrix) calibrated on a representative dataset. The net
result is a rich menu of size/quality trade-offs, which is exactly why a single
model on Hugging Face often ships a dozen GGUF variants.
MLX takes a simpler, more uniform approach: you pick a group size and a
number of bits per weight (4-bit being the common default), with mixed-bit
recipes available for finer control. It is less of a sprawling menu and more of a
clean dial, which fits MLX’s “designed by researchers, for researchers” ethos.
Hardware and performance
This is where the philosophical split shows up in real numbers.
MLX is tuned for one thing and does it well. On Apple silicon, MLX is
frequently 30–50% faster than llama.cpp in the decode (token-generation)
phase, and optimized 7B models have been measured sustaining well over 200
tokens/sec. Because it owns the whole stack down to the Metal kernels and unified
memory, there is very little overhead.
But the win is workload-dependent. Benchmarks show MLX can spend a
disproportionate amount of time in the prefill phase (processing your prompt).
In one M1 Max test with a ~650-token prompt, MLX’s effective throughput
(prefill + decode combined) came out around 13 tok/s while the GGUF/llama.cpp run
hit ~20 tok/s — because MLX spent the bulk of its time chewing through the prompt.
The rule of thumb that emerges: MLX shines when output is long relative to input
(creative writing, summaries, long explanations), while GGUF/llama.cpp holds up
better when prompts are long and answers are short (RAG, classification,
extraction).
And of course, GGUF runs everywhere. The same file works on a CPU-only Linux
box, an NVIDIA workstation via CUDA, an AMD card via ROCm, or a Mac via Metal.
MLX simply does not run off Apple silicon.
Pros and Cons
GGUF
| Pros | Cons |
|---|---|
| Truly cross-platform — one file runs on CPU, CUDA, Metal, Vulkan, ROCm, SYCL | Not hardware-specialized — leaves performance on the table on Apple silicon vs MLX |
| Self-contained — tokenizer, architecture, and weights in a single portable file | Quantization zoo can overwhelm — which of a dozen variants do you actually want? |
| Richest quantization menu — K-quants and imatrix I-quants for fine-grained size/quality control | Conversion can be fiddly for newer/unusual architectures until llama.cpp adds support |
| Massive ecosystem — default on Hugging Face; supported by Ollama, LM Studio, and more | |
| Battle-tested from Raspberry Pis to server GPUs |
MLX
| Pros | Cons |
|---|---|
| Best-in-class speed on Apple silicon, especially for long-generation workloads | Apple silicon only — zero portability to NVIDIA, AMD, or plain CPU boxes |
| Unified-memory native — no CPU↔GPU copies, efficient memory use on Macs | Smaller ecosystem than GGUF, though catching up quickly |
| Clean, composable framework — a full NumPy/PyTorch-style array library, so training and fine-tuning are first-class | Weaker on prefill-heavy / long-prompt workloads in current benchmarks |
| Builds on safetensors, a safe, well-understood container | Quantization is simpler but less granular than GGUF’s K/I-quant lineup |
| Backed by Apple with fast-moving, active development (Swift/C APIs for native apps) |
Which Should You Use?
The honest answer is “it depends on your hardware and your workload,” but here is
the short version:
- You ship to varied hardware, or run anything other than a Mac? GGUF. The
portability and ecosystem are unbeatable, and “convert once, run anywhere” is a
real advantage. - You live entirely on Apple silicon and care about maximum local speed,
especially for long generations or fine-tuning? MLX. It was built for exactly
your machine. - You are on a Mac but unsure? Tools like LM Studio and Ollama now
support both, so it costs little to download a GGUF and an MLX build of the same
model and benchmark your prompts. Real-world throughput depends heavily on your
prompt-to-output ratio, so measure rather than guess.
Two formats, two philosophies: GGUF chose portability and a universal
container, MLX chose depth on a single platform. Neither is “winning” — they
are optimizing for different things, and the local-LLM world is better for having
both.
Source repositories
- llama.cpp / GGUF: https://github.com/ggml-org/llama.cpp
- MLX: https://github.com/ml-explore/mlx
- mlx-lm (conversion & inference): https://github.com/ml-explore/mlx-lm
Further reading
- GGUF Format: A Complete Guide to Local LLM Inference (DataCamp)
- GGUF File Format (llama.cpp DeepWiki)
- Model Conversion & Quantization (mlx-lm DeepWiki)
- Benchmarking Apple’s MLX vs. llama.cpp (Andreas Kunar)
- 57 tok/s on Screen, 3 tok/s in Practice: MLX vs llama.cpp (famstack.dev)
- llama.cpp quantization README