UniLLM vs Candle
How UniLLM and HuggingFace's Candle differ in scope, abstraction, and intended audience. Honest comparison based on each project's published documentation.
Candle is HuggingFace’s minimalist ML framework for Rust. UniLLM is a modular LLM inference runtime written in Rust. They overlap, but they aim at different layers, and choosing between them is mostly a question of what you’re actually building.
What each project is
Candle is a tensor / ML framework with a PyTorch-flavoured API in Rust. It provides tensor operations, a Module trait, and reference implementations of many models. It supports CPU, CUDA, and Metal backends. The HuggingFace team uses it as the substrate for Rust-native ML work across their ecosystem.
UniLLM is an inference runtime. Its README states it plainly: a modular LLM inference runtime, organised around three composable abstractions — TensorCore, ModelCore, and WeightLoaderCore. The 47 architectures share one Model trait, the KV cache and scheduler live as first-class workspace crates, and the project’s reason for existing is to run LLMs in production-shaped paths, not to be a general-purpose ML framework.
Where they overlap
Both projects:
- Are pure Rust, with no Python runtime dependency at execution time.
- Provide a tensor type and tensor operations.
- Implement transformer architectures.
- Support multiple weight formats (SafeTensors and GGUF are common ground).
- Run on CPU and target GPUs via their respective backends.
Both can sit underneath an LLM inference application. They are not interchangeable in the strict sense, but the use cases shade into each other.
How they differ
Scope and intent
Candle is framework-shaped. It gives you tensors and Modules and asks you to compose them. The model zoo is large and growing, but the framework’s job is to make new model code easy to write, not to operate a production inference path.
UniLLM is runtime-shaped. The whole project assumes you want to serve models: continuous batching, chunked prefill, admission control, and a hybrid KV cache are first-class crates. The 47 supported architectures are all expressed through one Model trait so the runtime can treat them uniformly. The README headline number is 201 passing tests across the workspace, including the cache and scheduler.
The abstraction layout
Candle’s primary abstraction is the tensor and the Module trait, in the PyTorch lineage. UniLLM’s primary abstraction is the three-layer split: Tensor plus TensorOps (one type, dispatched by device), Model plus model_config! macro (one trait, 47 architectures), and WeightLoader plus ModelWeights (format-agnostic loading). The intent is that adding an architecture is mostly the forward pass, adding a backend is one TensorOps impl, and adding a format is one loader.
KV cache and scheduler
This is the cleanest line between the projects. UniLLM ships:
- A hybrid KV cache crate: RadixAttention layered on top of PagedAttention, with an adaptive tiering policy.
- A scheduler crate with continuous batching, chunked prefill, and admission control.
Candle does not ship these as first-class library components. They’re the kind of thing a runtime built on Candle would have to add.
The honest caveat: UniLLM’s KV cache is implemented and tested but, per its own ROADMAP, is not yet wired into the autoregressive generation loop. The mechanics are correct; the integration is the next step. Candle simply doesn’t take that question on at all in its current scope.
Quantization and weight formats
Both projects read SafeTensors. Both can read GGUF. UniLLM’s GGUF path includes dequantization for the common Q4_0 / Q8_0 cases, returning f32 tensors that the rest of the runtime treats uniformly. Direct quantized matmul — running on Q4 / Q8 weights without dequantizing — is on UniLLM’s long-term roadmap; it is not part of the current weight loader. Candle exposes a quantized tensor type and quantized GGUF inference, which is a different trade.
Hardware backends today
Candle exposes CPU, CUDA, and Metal backends. UniLLM’s tensor abstraction is wired for CPU, CUDA(usize), and Metal(usize) devices, but the project’s own ROADMAP is explicit that “All inference runs on CPU” today and GPU acceleration is on the near-term priority list. The CPU path ships SIMD kernels (AVX2, AVX-512, NEON) for the operators that dominate decoder time.
Model coverage
Both projects implement many architectures. Candle’s model zoo grows by contributions to its examples and integrations. UniLLM implements 47 architectures behind one trait, but is honest that only LLaMA is currently validated end-to-end with real GGUF weights — the other 46 have correct forward-pass implementations covered by unit tests with dummy tensors and are being validated as the project moves forward.
When you’d pick which
Pick Candle when:
- You want a general-purpose Rust ML framework.
- You’re building something that isn’t a serving runtime — fine-tuning code, evaluation harnesses, custom model research.
- You want a broad ecosystem maintained by the HuggingFace team.
- Your inference path is single-request or simple, and the cache / scheduler design space isn’t yours to solve.
Pick UniLLM when:
- You’re building an inference runtime, not a framework.
- You want first-class KV cache and scheduling abstractions in the workspace, not as a thing you write yourself.
- You like the three-layer split (
TensorCore/ModelCore/WeightLoaderCore) and the consequence that adding architectures is bounded work. - You’re comfortable with the project’s published state: CPU today, validated on LLaMA end-to-end, with the other 46 architectures implemented at the trait level and being progressively validated.
What to track
The fastest-moving items on UniLLM’s near-term roadmap are real-weight validation for more architectures, GPU backends (the tensor abstraction is wired for them already via Candle feature flags), wiring the KV cache into the generation loop, and an HTTP API server. Each of those collapses one of the lines above. The ROADMAP is the document to watch.
Source: github.com/cognisoc/unillm · Docs: docs.cognisoc.com/unillm/