2026-05-05 · compare / mistral.rs

UniLLM vs mistral.rs

Where UniLLM and mistral.rs overlap, where their abstractions differ, and how to choose. Honest comparison based on each project's published documentation.


mistral.rs is a fast LLM inference platform written in Rust, focused on practical serving and a wide menu of inference features. UniLLM is a modular LLM inference runtime, also written in Rust, organised around three composable core abstractions. Both projects are doing inference in Rust. They get there with different shapes.

What each project is

mistral.rs is a feature-dense Rust inference platform. It ships HTTP and Python bindings, supports a broad list of model families and modalities, exposes quantized inference paths, and emphasises the practical kit — chat templates, sampling features, device mapping, an OpenAI-compatible HTTP API. Its centre of gravity is on the serving side.

UniLLM is a runtime expressed through traits. Its README leads with “A modular LLM inference runtime written in Rust” and the architecture document leads with the three core layers: TensorCore, ModelCore, WeightLoaderCore. The 47 supported architectures all flow through the same Model trait via the model_config! macro. Continuous-batching, chunked-prefill, and admission control live in a scheduler crate. The KV cache is a hybrid RadixAttention + PagedAttention design in its own crate. The project is honest in its ROADMAP about what is built, what is tested, and what is still in progress.

Where they overlap

Both projects:

  • Are written in Rust, with no Python runtime dependency at execution time.
  • Implement transformer architectures across LLaMA, Mistral, Mixtral, Qwen, Gemma, Phi, and others.
  • Read SafeTensors and GGUF weights.
  • Aim at production-shaped inference paths.
  • Care about throughput, not just one-shot inference.

They are doing the same kind of work. The differences are about how the runtime is composed and what’s in scope.

How they differ

Abstraction posture

mistral.rs presents itself as a platform: install it, point it at a model, and it serves. The architecture is internal; users interact with the pipeline and the HTTP layer.

UniLLM presents itself as three traits and the consequences. The README and the architecture document both emphasise that:

  • Tensor is one type with Arc<dyn TensorStorage> storage, Device is an enum of CPU, CUDA(usize), Metal(usize), and operations live on a TensorOps trait plus a functional ops_fn module.
  • Model is one trait with forward, generate, from_weights, and to_device, and the model_config! macro generates the boilerplate so a new architecture is mostly its forward pass.
  • WeightLoader is format-agnostic with from_safetensors, from_gguf, from_pytorch, and auto_detect, returning a unified ModelWeights container.

If you intend to extend the runtime — add an architecture, change the tensor backend, plug in a new weight format — UniLLM is shaped to make those extensions bounded. mistral.rs is shaped to make first-time deployment fast.

Serving surface

mistral.rs ships an HTTP API and Python bindings out of the box. That’s part of why it exists.

UniLLM does not yet ship an HTTP server. The ROADMAP names it as item #4 on the near-term priority list, including OpenAI-compatible /v1/chat/completions endpoints over the existing axum dependency, with SSE streaming. Today, UniLLM inference is CLI (cargo run --bin unillm -p unillm-runtime -- generate --prompt "...") or programmatic.

If you need to expose an HTTP endpoint today without writing one yourself, mistral.rs covers more ground. If you’re building your own serving layer in Rust and want to embed the runtime inside it, UniLLM is shaped for that.

Quantized inference

mistral.rs has explicit quantized inference paths and a wide selection of quantization schemes.

UniLLM’s current GGUF support dequantizes Q4_0 and Q8_0 weights to f32 at load time, and runs inference in f32. The runtime ships SIMD kernels (AVX2, AVX-512, NEON) for quantized matmul, RMSNorm, RoPE, and SwiGLU, and direct quantized inference — Q4_K and Q5_K matmul without dequantization — is on UniLLM’s long-term roadmap as a TensorCore change behind the existing loader.

If your model is large enough that running at f32 is impractical and you need to lean on quantization to fit in memory, mistral.rs is the runtime that can serve it today.

KV cache and scheduler design

UniLLM ships the cache and scheduler as named, separable crates:

  • crates/kv — hybrid RadixAttention + PagedAttention with an adaptive tiering policy.
  • crates/scheduler — continuous batching, chunked prefill, admission control.

That design is the centerpiece of UniLLM’s serving story, with one honest caveat from the ROADMAP: the KV cache is implemented and tested but is not yet wired into the autoregressive generation loop. Connecting it is item #3 on the near-term list.

mistral.rs has its own approach to these problems, integrated into its pipeline. The UniLLM design is more openly compositional; the mistral.rs design is more integrated.

Hardware backends

UniLLM’s tensor abstraction is wired for CPU, CUDA, and Metal via Candle feature flags, but the project’s ROADMAP is explicit that “All inference runs on CPU” today. The CPU path has SIMD kernels for the operators that dominate decoder time. GPU acceleration is on the near-term priority list.

mistral.rs supports CPU and accelerated backends today, and is the more obvious choice if you need GPU performance immediately.

Model coverage

Both projects implement many architectures. UniLLM lists 47 across ten categories — core LLMs, GPT family, code models, MoE, RWKV / linear attention, vision-language, audio / speech, encoders, and specialised architectures — all behind one Model trait. The honest line from the ROADMAP: only LLaMA is currently validated end-to-end with real GGUF weights. The others are correct at the trait level and being progressively validated as real-weight integration tests come online.

mistral.rs’s coverage is also broad and is the kind of detail that changes between releases; the project’s own model list is the canonical answer.

When you’d pick which

Pick mistral.rs when:

  • You need an HTTP API and broad serving features today.
  • Quantized inference at low memory footprint is non-negotiable.
  • You want a more integrated platform and are less concerned with extending the runtime’s abstractions yourself.
  • GPU support today matters for your workload.

Pick UniLLM when:

  • You’re building your own serving layer and want a runtime expressed through small, named traits.
  • The three-layer split (TensorCore / ModelCore / WeightLoaderCore) maps cleanly onto how you want to evolve the system.
  • You want the KV cache (RadixAttention + PagedAttention) and scheduler (continuous batching, chunked prefill, admission control) as workspace crates you can reason about independently.
  • You’re comfortable with the project’s published state — CPU today, LLaMA validated end-to-end, the rest progressively coming online — and want to follow or contribute to its trajectory.

The ROADMAP and the architecture document are the live source of truth for everything above.


Source: github.com/cognisoc/unillm · Docs: docs.cognisoc.com/unillm/