vLLM Blog

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

2025-02-21T00:00:00+00:00

Today, we are excited to announce vllm-project/aibrix: a battery-included vLLM Kubernetes serving stack developed by Bytedance. Started in early 2024, AIBrix has been successfully deployed to support multiple business use cases across ByteDance, demonstrating its scalability and effectiveness in large-scale deployments.

While vLLM makes deploying a single serving instance easy, deploying vLLM at scale presents unique challenges in routing, autoscaling, and fault tolerance. AIBrix is an open-source initiative designed to provide the essential building blocks to construct scalable inference infrastructure. It delivers a cloud-native solution optimized for deploying, managing, and scaling large language model (LLM) inference, tailored specifically to enterprise needs.

The initial release focuses on the following key features:

High-Density LoRA Management: Streamlined support for lightweight, low-rank adaptations of models.
LLM Gateway and Routing: Efficiently manage and direct traffic across multiple models and replicas.
LLM App-Tailored Autoscaler: Dynamically scale inference resources based on real-time demand.
Unified AI Runtime: A versatile sidecar enabling metric standardization, model downloading, and management.
Distributed Inference: Scalable architecture to handle large workloads across multiple nodes.
Distributed KV Cache: Enables high-capacity, cross-engine KV reuse.
Cost-efficient Heterogeneous Serving: Enables mixed GPU inference to reduce costs with SLO guarantees
GPU Hardware Failure Detection: Proactive detection of GPU hardware issues.

AIBrix Vision & Industry Collaboration

AIBrix is built on the principle of system and inference engine co-design, with a primary focus on constructing scalable inference systems on Kubernetes in a cloud-native way. Moving forward, we will continue exploring the co-design approach through initiatives such as

Expanding distributed KV cache to support a wider range of scenarios, including Prefill & Decode (P&D) aggregation, request migration, and cross-instance KV reuse, improving memory efficiency and inference flexibility.
Adopting traditional resource management principles like QoS, Priority, Fairness to LLM inference to enabling request-level multi-tenancy to ensure efficient resource allocation.
Apply roofline-based profiling to optimize computational efficiency and deliver strong SLO-guaranteed inference performance across diverse workloads.

As part of this mission, we actively collaborate with industry leaders to drive open, cloud-native solutions for LLM serving.

“ByteDance has been a phenomenal partner in helping Google drive standardization of LLM serving in Kubernetes through Working Group Serving and contributing to the Gateway API Inference Extension. We are excited to continue collaborating on shared components that will enable AIBrix and large scale inference platforms” - Clayton Coleman, Distinguished Engineer and Inference Lead for GKE

“vLLM has seen explosive growth worldwide, becoming a cornerstone of LLM inference. AIBrix is a promising project that builds on this momentum, offering powerful capabilities to productionize vLLM while driving innovation in open-source LLM inference” - Robert Nishihara, Co-Founder of Anyscale & Co-Creator of Ray

Explore More

Check out the repo at https://github.com/vllm-project/aibrix and dive into our blog post for an in-depth look at AIBrix’s architecture and key capabilities. For a deeper understanding, explore our white paper on design philosophy and results, and follow the documentation to get started with deployment and integration and join the vLLM slack’s aibrix channel to discuss with the developers.

FAQ

How is AIBrix different from the vLLM production stack?

AIBrix is an open source release from Bytedance with a focus on large scale use cases and cloud native solutions. Production stack, managed by UChicago LMCache team, is an open framework that welcomes everyone to extend, experiment, and contribute. You can see the production stack’s roadmap here.
AIBrix is an instantiation of what a powerful K8s stack can be and has been in production for the past 6+ months. Production stack is starting from scratch implementation focused on iterating each building block with the feedback and contributions from the community.
Production stack’s desired strength is to leverage built-in KV cache-focused optimizations (transfer, blending, routing), especially beneficial in long-context and prefill-heavy workloads. In the near term, production stack plans to leverage components from AIBrix.

Is AIBrix a community driven project?

Absolutely. The purpose of open-sourcing it under vLLM project organization is to open it up for collaboration both with practitioners and researchers. There are many areas of enhancements planned and the core developers believe in the future is open source!

How is AIBrix different from other cloud native solutions such as KServe, KubeAI, and others?

AIBrix offers more native integration with vLLM. By designing with only an inference engine in mind, AIBrix can prioritize features such as fast model loading, autoscaling, and LoRA management.

Distributed Inference with vLLM

2025-02-17T00:00:00+00:00

Motivation

Serving large models often leads to memory bottlenecks, such as the dreaded CUDA out of memory error. To tackle this, there are two main solutions:

Reduce Precision – Utilizing FP8 and lower-bit quantization methods can reduce memory usage. However, this approach may impact accuracy and scalability, and is not sufficient by itself as models grow beyond hundreds of billions of parameters.
Distributed Inference – Spreading model computations across multiple GPUs or nodes enables scalability and efficiency. This is where distributed architectures like tensor parallelism and pipeline parallelism come into play.

vLLM Architecture and Large Language Model Inference Challenges

LLM inference poses unique challenges compared to training:

Unlike training, which focuses purely on throughput with known static shapes, inference requires low latency and dynamic workload handling.
Inference workloads must efficiently manage KV caches, speculative decoding, and prefill-to-decode transitions.
Large models often exceed single-GPU capacity, requiring advanced parallelization strategies.

To address these issues, vLLM provides:

Tensor parallelism to shard each model layer across multiple GPUs within a node.
Pipeline parallelism to distribute contiguous sections of model layers across multiple nodes.
Optimized communication kernels and control plane architecture to minimize CPU overhead and maximize GPU utilization.

GPU Parallelism Techniques in vLLM

Tensor Parallelism

Problem: Model Exceeds Single GPU Capacity

As models grow, a single GPU cannot accommodate them, necessitating multi-GPU strategies. Tensor parallelism shards model weights across GPUs, allowing concurrent computation for lower latency and enhanced scalability.

This approach, originally developed for training in Megatron-LM (Shoeybi et al., 2019), has been adapted and optimized in vLLM for inference workloads.

Tensor Parallelism relies on two primary techniques:

Column Parallelism: Splitting weight matrices along columns and concatenating results after computation.
Row Parallelism: Splitting matrices along rows, summing partial results post-computation.

As a specific example, let’s break down how this parallelism works for the MLP (multi-layer perceptron) layers in Llama models:

Column parallelism applies to up-projection operations.
Element-wise activation functions (e.g., SILU) operate on sharded outputs.
Row parallelism is used in down-projection, with an all-reduce operation to aggregate final results.

Tensor parallelism ensures that inference computations are distributed across multiple GPUs, maximizing the memory bandwidth and compute available. When used, we can achieve latency improvements from effectively multiplying memory bandwidth. This occurs because sharding model weights allows multiple GPUs to access memory in parallel, reducing bottlenecks that a single GPU might encounter.

Source: Sebastian Raschka, 2023.

However, it requires high-bandwidth interconnects between each GPU, like NVLink or InfiniBand, to minimize overhead from the increased communication costs.

Pipeline Parallelism

Problem: Model Exceeds Multi-GPU Capacity

For extremely large models (e.g., DeepSeek R1, Llama 3.1 405B), a single node may not suffice. Pipeline parallelism shards models across nodes, each handling specific contiguous model layers.

How It Works

Each GPU loads and processes a distinct set of layers.
Send/Receive Operations: Intermediate activations are transmitted between GPUs as computation progresses.

This results in lower communication overhead compared to tensor parallelism since data transfer occurs once per pipeline stage.

Pipeline Parallelism reduces memory constraints across GPUs but does not inherently decrease inference latency as tensor parallelism does. To mitigate throughput inefficiencies, vLLM incorporates advanced pipeline scheduling, ensuring that all GPUs remain active by optimizing micro-batch execution.

Combining Tensor Parallelism and Pipeline Parallelism

As a general rule of thumb, think of the applications of parallelism like this:

Use pipeline parallelism across nodes and tensor parallelism within nodes when interconnects are slow.
If interconnects are efficient (e.g., NVLink, InfiniBand), tensor parallelism can extend across nodes.
Combining both techniques intelligently reduces unnecessary communication overhead and maximizes GPU utilization.

Performance Scaling and Memory Effects

While the basic principles of parallelization suggest linear scaling, in practice, the performance improvements can be super-linear due to memory effects. With either Tensor Parallelism or Pipeline Parallelism, throughput improvements can arise in non-obvious ways due to the memory available for KV Cache increasing super-linearly.

This super-linear scaling effect occurs because larger caches allow for larger batch sizes for processing more requests in parallel and better memory locality, resulting in improved GPU utilization beyond what might be expected from simply adding more compute resources. In the above graph you can see between TP=1 and TP=2, we are able to increase the amount of KV Cache blocks by 13.9x which allows us to observe 3.9x more token throughput - much more than the linear 2x we would expect from using 2 GPUs instead of 1.

Conclusion

Serving large models efficiently requires a combination of Tensor Parallelism, Pipeline Parallelism, and performance optimizations like Chunked Prefill. vLLM enables scalable inference by leveraging these techniques while ensuring adaptability across different hardware accelerators. As we continue to enhance vLLM, staying informed about new developments such as expert parallelism for Mixture of Experts (MoE) and expanded quantization support will be crucial for optimizing AI workloads.

Come to the Bi-weekly Office Hours to learn more about LLM inference optimizations and vLLM!

Acknowledgement

Sangbin Cho (xAI) for the origination of some of the figures.

Introducing vLLM Inference Provider in Llama Stack

2025-01-27T00:00:00+00:00

We are excited to announce that vLLM inference provider is now available in Llama Stack through the collaboration between the Red Hat AI Engineering team and the Llama Stack team from Meta. This article provides an introduction to this integration and a tutorial to help you get started using it locally or deploying it in a Kubernetes cluster.

What is Llama Stack?

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Llama Stack focuses on making it easy to build production applications with a variety of models - ranging from the latest Llama 3.3 model to specialized models like Llama Guard for safety and other models. The goal is to provide pre-packaged implementations (aka “distributions”) which can be run in a variety of deployment environments. The Stack can assist you in your entire app development lifecycle - start iterating on local, mobile or desktop and seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience are available.

Each specific implementation of an API is called a “Provider” in this architecture. Users can swap providers via configuration. vLLM is a prominent example of a high-performance API backing the inference API.

vLLM Inference Provider

Llama Stack provides two vLLM inference providers:

Remote vLLM inference provider through vLLM’s OpenAI-compatible server;
Inline vLLM inference provider that runs alongside with Llama Stack server.

In this article, we will demonstrate the functionality through the remote vLLM inference provider.

Tutorial

Prerequisites

Linux operating system
Hugging Face CLI if you’d like to download the model via CLI.
OCI-compliant container technologies like Podman or Docker (can be specified via the CONTAINER_BINARY environment variable when running llama stack CLI commands).
Kind for Kubernetes deployment.
Conda for managing Python environment.

Get Started via Containers

Start vLLM Server

We first download the “Llama-3.2-1B-Instruct” model using the Hugging Face CLI. Note that you’ll need to request for access and then specify your Hugging Face token when logging in.

mkdir /tmp/test-vllm-llama-stack
huggingface-cli login --token 
huggingface-cli download meta-llama/Llama-3.2-1B-Instruct --local-dir /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct

Next, let’s build the vLLM CPU container image from source. Note that while we use it for demonstration purposes, there are plenty of other images available for different hardware and architectures.

git clone git@github.com:vllm-project/vllm.git /tmp/test-vllm-llama-stack
cd /tmp/test-vllm-llama-stack/vllm
podman build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .

We can then start the vLLM container:

podman run -it --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /tmp/test-vllm-llama-stack/.cache/huggingface/hub/models/Llama-3.2-1B-Instruct:/app/model \
   --entrypoint='["python3", "-m", "vllm.entrypoints.openai.api_server", "--model", "/app/model", "--served-model-name", "meta-llama/Llama-3.2-1B-Instruct", "--port", "8000"]' \
    vllm-cpu-env

We can get a list of models and test a prompt once the model server has started:

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.2-1B-Instruct",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'

Start Llama Stack Server

Once we verify that the vLLM server has started successfully and is able to serve requests, we can then build and start the Llama Stack server.

First, we clone the Llama Stack source code and create a Conda environment that includes all the dependencies:

git clone git@github.com:meta-llama/llama-stack.git /tmp/test-vllm-llama-stack/llama-stack
cd /tmp/test-vllm-llama-stack/llama-stack
conda create -n stack python=3.10
conda activate stack
pip install .

Next, we build the container image with llama stack build:

cat > /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml << "EOF"
name: vllm
distribution_spec:
  description: Like local, but use vLLM for running LLM inference
  providers:
    inference: remote::vllm
    safety: inline::llama-guard
    agents: inline::meta-reference
    vector_io: inline::faiss
    datasetio: inline::localfs
    scoring: inline::basic
    eval: inline::meta-reference
    post_training: inline::torchtune
    telemetry: inline::meta-reference
image_type: container
EOF

export CONTAINER_BINARY=podman
LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack build --config /tmp/test-vllm-llama-stack/vllm-llama-stack-build.yaml --image-name distribution-myenv

Once the container image has been built successfully, we can then edit the generated vllm-run.yaml to be /tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml with the following change in the models field:

models:
- metadata: {}
  model_id: ${env.INFERENCE_MODEL}
  provider_id: vllm
  provider_model_id: null

Then we can start the Llama Stack Server with the image we built via llama stack run:

export INFERENCE_ADDR=host.containers.internal
export INFERENCE_PORT=8000
export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
export LLAMA_STACK_PORT=5000

LLAMA_STACK_DIR=. PYTHONPATH=. python -m llama_stack.cli.llama stack run \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
/tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml

Alternatively, we can run the following podman run command instead:

podman run --security-opt label=disable -it --network host -v /tmp/test-vllm-llama-stack/vllm-llama-stack-run.yaml:/app/config.yaml -v /tmp/test-vllm-llama-stack/llama-stack:/app/llama-stack-source \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://$INFERENCE_ADDR:$INFERENCE_PORT/v1 \
--env VLLM_MAX_TOKENS=8192 \
--env VLLM_API_TOKEN=fake \
--env LLAMA_STACK_PORT=$LLAMA_STACK_PORT \
--entrypoint='["python", "-m", "llama_stack.distribution.server.server", "--yaml-config", "/app/config.yaml"]' \
localhost/distribution-myenv:dev

Once we start the Llama Stack server successfully, we can then start testing a inference request:

Via Bash:

llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"

Output:

ChatCompletionResponse(
    completion_message=CompletionMessage(
        content="Hello! I'm an AI, a conversational AI model. I'm a type of computer program designed to understand and respond to human language. My creators have 
trained me on a vast amount of text data, allowing me to generate human-like responses to a wide range of questions and topics. I'm here to help answer any question you 
may have, so feel free to ask me anything!",
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None
)

Via Python:

import os
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")

# List available models
models = client.models.list()
print(models)

response = client.inference.chat_completion(
    model_id=os.environ["INFERENCE_MODEL"],
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"}
    ]
)
print(response.completion_message.content)

Output:

[Model(identifier='meta-llama/Llama-3.2-1B-Instruct', metadata={}, api_model_type='llm', provider_id='vllm', provider_resource_id='meta-llama/Llama-3.2-1B-Instruct', type='model', model_type='llm')]
Here is a haiku about coding:

Columns of code flow
Logic codes the endless night
Tech's silent dawn rise

Deployment on Kubernetes

Instead of starting the Llama Stack and vLLM servers locally. We can deploy them in a Kubernetes cluster. We’ll use a local Kind cluster for demonstration purposes:

kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test

Start vLLM server as a Kubernetes Pod and Service (remember to replace with your actual token):

cat <"
---
apiVersion: v1
kind: Pod
metadata:
  name: vllm-server
  labels:
    app: vllm
spec:
  containers:
  - name: llama-stack
    image: localhost/vllm-cpu-env:latest
    command:
        - bash
        - -c
        - |
          MODEL="meta-llama/Llama-3.2-1B-Instruct"
          MODEL_PATH=/app/model/$(basename $MODEL)
          huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
          huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
          python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
    ports:
      - containerPort: 8000
    volumeMounts:
      - name: llama-storage
        mountPath: /app/model
    env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: token
  volumes:
  - name: llama-storage
    persistentVolumeClaim:
      claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - port: 8000
    targetPort: 8000
  type: NodePort
EOF

We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):

$ kubectl logs vllm-server
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Then we can modify the previously created vllm-llama-stack-run.yaml to /tmp/test-vllm-llama-stack/vllm-llama-stack-run-k8s.yaml with the following inference provider:

providers:
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://vllm-server.default.svc.cluster.local:8000/v1
      max_tokens: 4096
      api_token: fake

Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:

cat >/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s <



We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:
cat <


We can check that the Llama Stack server has started:
$ kubectl logs vllm-server
...
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)


Now let’s forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:
kubectl port-forward service/llama-stack-service 5000:5000
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"


You can learn more about different providers and functionalities of Llama Stack on the official documentation.

Acknowledgement

We’d like to thank the Red Hat AI Engineering team for the implementation of the vLLM inference providers, contributions to many bug fixes, improvements, and key design discussions. We also want to thank the Llama Stack team from Meta and the vLLM team for their timely PR reviews and bug fixes.



vLLM V1: A Major Upgrade to vLLM’s Core Architecture
2025-01-27T00:00:00+00:00






We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves state-of-the-art performance and is set to gain even more optimizations. Best of all, users can enable V1 seamlessly—just set the VLLM_USE_V1=1 environment variable without any changes to the existing API. After testing and feedback collection in the coming weeks, we plan to transition V1 into the default engine.

Why vLLM V1?

Learning from vLLM V0

Over the past 1.5 years, vLLM has achieved remarkable success in supporting diverse models, features, and hardware backends. However, while our community scaled horizontally, we faced challenges making the systems simple and integrating various optimizations vertically across the stack. Features were often developed independently, making it difficult to combine them effectively and cleanly. Over time, technical debt accumulated, prompting us to revisit our foundational design.

Goals of V1

Based on the above motivation, vLLM V1 is designed to:

  Provide a simple, modular, and easy-to-hack codebase.
  Ensure high performance with near-zero CPU overhead.
  Combine key optimizations into a unified architecture.
  Require zero configs by enabling features/optimizations by default.


Scope of V1

vLLM V1 introduces a comprehensive re-architecture of its core components, including the scheduler, KV cache manager, worker, sampler, and API server. However, it still shares a lot of code with vLLM V0, such as model implementations, GPU kernels, distributed control plane, and various utility functions. This approach allows V1 to leverage the extensive coverage and stability established by V0 while delivering significant enhancements to performance and code complexity.

What’s New in vLLM V1?

1. Optimized Execution Loop & API Server







As a full-fledged continuous batching engine and OpenAI-compatible API server, vLLM’s core execution loop relies on CPU operations to manage request states between model forward passes. As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.

In the v0.6.0 release, vLLM introduced a multiprocessing API server utilizing ZeroMQ for IPC, enabling overlap between the API server and AsyncLLM. vLLM V1 extends this by integrating the multiprocessing architecture deeper into the core of AsyncLLM, creating an isolated EngineCore execution loop that focuses exclusively on the scheduler and model executor. This design allows for greater overlap of CPU-intensive tasks—such as tokenization, multimodal input processing, de-tokenization, and request streaming—with the core execution loop, thereby maximizing model throughput.

2. Simple & Flexible Scheduler







vLLM V1 introduces a simple yet flexible scheduler. It removes the traditional distinction between “prefill” and “decode” phases by treating user-given prompt tokens and model-generated output tokens uniformly. Scheduling decisions are represented as a simple dictionary, e.g., {request_id: num_tokens}, which specifies the number of tokens to process for each request at each step. We find that this representation is general enough to support features such as chunked prefills, prefix caching, and speculative decoding. For instance, chunked-prefill scheduling is seamlessly implemented: with a fixed token budget, the scheduler dynamically decides how many tokens to allocate to each request (as shown in the figure above).

3. Zero-Overhead Prefix Caching

vLLM V1, like V0, uses hash-based prefix caching and LRU-based cache eviction. In V0, enabling prefix caching sometimes causes significant CPU overhead, leading to rather decreased performance when the cache hit rate is low. As a result, it is disabled by default. In V1, we optimize the data structure for constant-time cache eviction and carefully minimize Python object creation overhead. This makes V1’s prefix caching introduce near-zero performance degradation, even when the cache hit rate is 0%.







Here are some benchmark results. In our experiments, we observed that V1’s perfix caching causes less than 1% decrease in throughput even when the cache hit rate is 0%, while it improves the performance several times when the cache hit rate is high. Thanks to the near-zero overhead, we now enable prefix caching by default in V1.

4. Clean Architecture for Tensor-Parallel Inference







vLLM V1 introduces a clean and efficient architecture for tensor-parallel inference, effectively addressing the limitations of V0. In V0, the scheduler and Worker 0 are colocated within the same process to reduce the inter-process communication overhead when broadcasting input data to workers. However, this design introduces an asymmetric architecture, increasing complexity. V1 overcomes this by caching request states on the worker side and transmitting only incremental updates (diffs) at each step. This optimization minimizes inter-process communication, allowing the scheduler and Worker 0 to operate in separate processes, resulting in a clean, symmetric architecture. Moreover, V1 abstracts away most distributed logic, enabling workers to operate the same way for both single-GPU and multi-GPU setups.

5. Efficient Input Preparation







In vLLM V0, input tensors and metadata for the model are recreated at each step, often leading to significant CPU overhead. To optimize this, V1 implements the Persistent Batch technique, which caches the input tensors and only applies the diffs to them at each step. Additionally, V1 minimizes the CPU overheads in updating the tensors by extensively utilizing Numpy operations instead of Python’s native ones.

6. torch.compile and Piecewise CUDA Graphs







V1 leverages vLLM’s torch.compile integration to automatically optimize the model. This allows V1 to efficiently support a wide variety of models while minimizing the need of writing custom kernels. Furthermore, V1 introduces piecewise CUDA graphs to alleviate the limitations of CUDA graphs. We are preparing dedicated blog posts on the torch.compile integration and piecewise CUDA graphs, so stay tuned for more updates!

7. Enhanced Support for Multimodal LLMs

vLLM V1 treats multimodal large language models (MLLMs) as first-class citizens and introduces several key improvements in their support.

First, V1 optimizes multimodal input preprocessing by moving it to a non-blocking process. For example, image files (e.g., JPG or PNG) must be converted into tensors of pixel values, cropped, and transformed before being fed into the model. This preprocessing can consume significant CPU cycles, possibly leaving the GPU idle. To address this, V1 offloads the preprocessing task to a separate process, preventing it from blocking the GPU worker, and adds a preprocessing cache so that processed inputs can be reused across requests if they share the same multimodal input.

Second, V1 introduces prefix caching for multimodal inputs. In addition to the hash of token IDs, image hashes are used to identify the KV cache for image inputs. This improvement is especially beneficial for multi-turn conversations that include image inputs.

Third, V1 enables chunked-prefill scheduling for MLLMs with the “encoder cache.” In V0, image inputs and text inputs had to be processed in the same step because the LLM decoder’s  token depends on the vision embeddings which are discarded after the step. With the encoder cache, V1 temporarily stores the vision embeddings, allowing the scheduler to split the text inputs into chunks and process them across multiple steps without needing to regenerate vision embeddings every step.

8. FlashAttention 3

The final piece of the puzzle for vLLM V1 was integrating FlashAttention 3. Given the high level of dynamism in V1—such as combining prefill and decode within the same batch—a flexible and high-performance attention kernel was essential. FlashAttention 3 effectively addresses this requirement, offering robust support for a wide range of features while maintaining excellent performance across diverse use cases.

Performance

Thanks to the extensive architectural enhancements, vLLM V1 achieves state-of-the-art throughput and latency, delivering up to 1.7x higher throughput compared to V0 (without multi-step scheduling).
These dramatic performance gains stem from comprehensive CPU overhead reductions across the entire stack.
The improvements are even more pronounced for vision-language models (VLMs) like Qwen2-VL, thanks to V1’s enhanced support for VLMs.


  Text Models: Llama 3.1 8B & Llama 3.3 70B








We measured the performance of vLLM V0 and V1 on Llama 3.1 8B and Llama 3.3 70B models using the ShareGPT dataset.
V1 demonstrated consistently lower latency than V0 especially at high QPS, thanks to the higher throughput it achieves.
Given that the kernels used for V0 and V1 are almost identical, the performance difference is mainly due to the architectural improvements (reduced CPU overheads) in V1.


  Vision-language Models: Qwen2-VL








We evaluated the performance on VLMs by testing Qwen2-VL using the VisionArena dataset.
V1 delivered even larger speedups over V0, thanks its improved VLM support, driven by two key improvements: offloading input processing to a separate process and implementing more flexible scheduling for multimodal queries.
We would also like to point out that prefix caching is now natively supported for multimodal models in V1, but will skip the benchmark results here.


  Looking Forward


While these improvements are significant, we view them as just the beginning.
The redesigned architecture provies a solid foundation that will enable rapid development of new features.
We look forward to sharing additional enhancements in the coming weeks.
Stay tuned for more updates!

Limitations & Future Work

While vLLM V1 shows promising results, it is still in its alpha stage and lacks several features from V0. Here’s a clarification:

Model Support:

V1 supports decoder-only Transformers like Llama, mixture-of-experts (MoE) models like Mixtral, and several VLMs such as Qwen2-VL. All quantization methods are supported. However, V1 currently does not support encoder-decoder architectures like multimodal Llama 3.2, Mamba-based models like Jamba, or embedding models. Please check out our documentation for a more detailed list of the supported models.

Feature Limitations:

V1 currently lacks support for log probs, prompt log probs sampling parameters, pipeline parallelism, structured decoding, speculative decoding, prometheus metrics, and LoRA. We are actively working to close this feature gap and add brand-new optimizations to the V1 engine.

Hardware Support:

V1 currently supports only Ampere or later NVIDIA GPUs. We are actively working to extend support to other hardware backends such as TPU.

Finally, please note that you can continue using V0 and maintain backward compatibility by not setting VLLM_USE_V1=1.

How to Get Started

To use vLLM V1:

  Install the latest version of vLLM with pip install vllm --upgrade.
  Set the environment variable export VLLM_USE_V1=1.
  Use vLLM’s Python API or OpenAI-compatible server (vllm serve ). You don’t need any change to the existing API.


Please try it out and share your feedback!

Acknowledgment

We gratefully acknowledge that the design of vLLM V1 builds upon and enhances several open-source LLM inference engines, including LightLLM, LMDeploy, SGLang, TGI, and TRT-LLM. These engines have significantly influenced our work, and we have gained valuable insights from them.

The V1 re-architecture is a continued joint effort across the entire vLLM team and community. Below is an incomplete list of contributors to this milestone:


  UC Berkeley, Neural Magic (now Red Hat), Anyscale, and Roblox mainly drove the effort together.
  Woosuk Kwon initiated the project and implemented the scheduler and model runner.
  Robert Shaw implemented the optimized execution loop and API server.
  Cody Yu implemented efficient prefix caching for text and image inputs.
  Roger Wang led the overall enhanced MLLM support in V1.
  Kaichao You led the torch.compile integration and implemented the piecewise CUDA graphs.
  Tyler Michael Smith implemented the tensor parallelism support with Python multiprocessing.
  Rui Qiao implemented the tensor parallelism support with Ray and is implementing pipeline parallelism support.
  Lucas Wilkinson added support for FlashAttention 3.
  Alexander Matveev implemented the optimized preprocessor for multimodal inputs and is implementing TPU support.
  Sourashis Roy implemented the logit penalties in the sampler.
  Cyrus Leung led the MLLM input processing refactoring effort and helped its integration to V1.
  Russell Bryant addressed several multiprocess-related issues.
  Nick Hill optimized the engine loop and API server.
  Ricky Xu and Chen Zhang helped refactor the KV cache manager.
  Jie Li and Michael Goin helped with MLLM support and optimization.
  Aaron Pham is implementing the structured decoding support.
  Varun Sundar Rabindranath is implementing the multi-LoRA support.
  Andrew Feldman is implementing the log probs and prompt log probs support.
  Lily Liu is implementing the speculative decoding support.
  Kuntai Du is implementing the prefill disaggregation and KV Cache transfer support.
  Simon Mo and Zhuohan Li contributed to the V1 system design.



High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”
2025-01-21T00:00:00+00:00



TL;DR

  vLLM boasts the largest open-source community, but what does it take to transform vLLM from the best single-node LLM engine to a premier LLM serving system?
  Today, we release “vLLM production-stack”, a vLLM-based full inference stack that introduces two major advantages:
    
      10x better performance (3-10x lower response delay & 2-5x higher throughput) with prefix-aware request routing and KV-cache sharing.
      Easy cluster deployment with built-in support for fault tolerance, autoscaling, and observability.
    
  
  And the best part? It’s open-source—so everyone can get started right away! [https://github.com/vllm-project/production-stack]


The Context


In the AI arms race, it’s no longer just about who has the best model—it’s about who has the best LLM serving system.

vLLM has taken the open-source community by storm, with unparalleled hardware and model support plus an active ecosystem of top-notch contributors. But until now, vLLM has mostly focused on single-node deployments.

How do we extend its power into a full-stack inference system that any organization can deploy at scale with high reliability, high throughput, and low latency? That’s precisely why the LMCache team and the vLLM team built vLLM production-stack.





Introducing “vLLM Production-Stack”
vLLM Production-stack is an open-source reference implementation of an inference stack built on top of vLLM, designed to run seamlessly on a cluster of GPU nodes. It adds four critical functionalities that complement vLLM’s native strengths:

  KV cache sharing & storage to speed up inference when context is reused (powered by the LMCache project).
  Prefix-aware routing that sends queries to the vLLM instance already holding the relevant context KV cache.
  Observability of individual engine status and query-level metrics (TTFT, TBT, throughput).
  Autoscaling to handle dynamics of workloads.


Comparison with Alternatives:

Below is a quick snapshot comparing vLLM production-stack with its closest counterparts:




The Design
The vLLM production-stack architecture builds on top of vLLM’s powerful single-node engine to provide a cluster-wide solution.

At a high level:

  Applications send LLM inference requests.
  Prefix-aware routing checks if the requested context is already cached within the memory pool of one instance. It then forwards the request to the node with the pre-computed cache.
  Autoscaling and a cluster manager watch the overall load and spin up new vLLM nodes if needed.
  Observability modules gather metrics like TTFT (Time-To-First-Token), TBT (Time-Between-Tokens), and throughput, giving you real-time insights into your system’s health.






Advantage #1: Easy Deployment

Use helm chart to deploy the vLLM production-stack to your k8s cluster through running a single command:
sudo helm repo add llmstack-repo https://lmcache.github.io/helm/ &&\
  sudo helm install llmstack llmstack-repo/vllm-stack 


For more details, please refer to the detailed README at vLLM production-stack repo. Tutorials about setting up k8s cluster and customizing helm charts are also available.

Advantage #2: Better Performance
We conduct a benchmark of multi-round Q&A workload on vLLM production-stack and other setups, including vLLM + KServe and an commercial endpoint service.
The results show vLLM stack outperforms other setups across key metrics (time to first token and inter token latency).









Advantage #3: Effortless Monitoring
Keep real-time tracking of your LLM inference cluster with key metrics including latency distributions, number of requests over time, KV cache hit rate.





Conclusion
We’re thrilled to unveil vLLM Production Stack—the next step in transforming vLLM from a best-in-class single-node engine into a full-scale LLM serving system. 
We believe the vLL stack will open new doors for organizations seeking to build, test, and deploy LLM applications at scale without sacrificing performance or simplicity.

If you’re as excited as we are, don’t wait!

  Clone the repo: https://github.com/vllm-project/production-stack
  Kick the tires
  Let us know what you think!
  Interest Form


Join us to build a future where every application can harness the power of LLM inference—reliably, at scale, and without breaking a sweat.
Happy deploying!

Contacts:

  vLLM slack
  LMCache slack



Structured Decoding in vLLM: a gentle introduction
2025-01-14T00:00:00+00:00
TL/DR:


  Structured decoding allows precise control over LLM output formats
  vLLM now supports both outlines and XGrammar backends for structured decoding
  Recent XGrammar integration brings up to 5x improvement in time per output token (TPOT) under load
  Upcoming v1 release focuses on enhanced performance and schedule-level mask broadcasting for mixed-requests batch support


vLLM is the high-throughput and efficient inference engine for running large-language models (LLMs). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with XGrammar, and share our tentative roadmap for future improvements.


  We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic system.


For more information about vLLM, please check out our documentation.

Language models: A brief historical context

In 1950, Alan Turing proposed that a high-speed digital computer, programmed with rules, could exhibit emergent behaviour of intelligence (Turing, 1950). This led to two main approaches in AI development:


  
    Good Old-Fashioned AI (GOFAI): A paradigm quickly emerged among researchers in the 1950s, where expert systems were designed to replicate the decision-making capabilities of a human specialist¹, (or symbolic reasoning system), referred to by Haugland as Good Old-Fashioned AI (GOFAI) (Haugeland, 1997). However, it quickly ran into funding problems due to its semantic representation not being able to scale up to generalised tasks (Also known as the “AI Winter” (Hendler, 2008)).
  
  
    New-Fangled AI (NFAI): Concurrently, Donald Norman’s Parallel Distributed Processing (Rumelhart et al., 1986) group investigated variations of Rosenblatt’s perception (Rosenblatt, 1958), where they proposed hidden layers within the network alongside with inputs and outputs to extrapolate appropriate responses based on what it had learned during training process. These connectionist networks were often built on top of statistical methods². Given the abundance of data and Moore’s Law³ resulting in an unprecedented amount of compute available, we see the complete dominance of connectionist networks in both research and production use-cases, most notably variants of decoder-only transformers⁴ for text generations tasks. As such, most modern transformers variants are considered NFAI systems.
  


In summary:


  GOFAI are deterministic and rule-based, given its intentionality is injected through explicit programming
  NFAI are often considered as “black-box” models (in: input - out: some output), data-driven given the networked complexity nature of its internal representations


Why do we need structured decoding?


  

Shogoth as GPTs. In a sense, RLHF, or any post-training methods, is an injection of rules (a GOFAI system) into any large compound AI systems



LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.

These models work well given the following assumption: the input prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification⁵.

This is where structured decoding comes in. It enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.

Companies like OpenAI have recognized this need, implementing features like JSON mode to constrain⁶ the output format. If you have built with these functionalities before (such as agentic workflows, function calling, coding assistant), chances are you are using structured decoding under the hood.


  Guided decoding is to LLMs what validation is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!


Structured decoding and vLLM

In simple terms, structured decoding gives LLMs a “template” to follow. Users provide a schema that “influences” the model’s output, ensuring compliance with the desired structure:



From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, outlines proposed guided generations via finite-state machine (FSM) for any given schemas (Willard & Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output.


  

courtesy of LMSys, 2024.



in vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests).


  Note: in some cases, it can even improve the native decoding performance for LLM!


Previous limitations in vLLM

There are few limitations with current vLLM’s support of the Outlines backend:


  Slow decoding: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode one token at a time, resulting in slow decoding.
  Batch processing bottlenecks: Implementation in vLLM relies heavily on logit processor⁷. As such, this is on the critical path of the sampling process. In batching use-case, compiling FSM per requests as well as computing the mask synchronous means that all requests in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
    
      We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.
    
  
  Performance issues with CFG mode: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally crashes the engine.
  Limited advanced feature support: Techniques like jump-forward decoding are currently not possible with logit-processor approach. It requires prefilling a set of k-next tokens, whereas for logit processors we can only deal with the next-token.


Integration with XGrammar

XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a “collection of FSMs, and each FSM represents a context-free grammar (CFG).” One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead.

This advancement addresses limitation (1) by moving grammar compilation out of Python into C, utilising pthread. Additionally, XGrammar lays the groundwork for addressing limitation (4) in future releases. Below are performance comparisons between the XGrammar and Outlines backends:


  
  

courtesy of Michael Goin (Red Hat).



In vLLM’s v0 architecture, we’ve implemented XGrammar as a logit processor, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there’s still significant room for optimization.

There are still a few usability concerns in XGrammar v0 integration to match feature parity with all use cases:


  It is yet to support grammars other than GBNF format (PR on vLLM: github)
  It is yet to support regex
  It is yet to support complex JSON that uses regex patterns or numeric ranges
    
      There are a few PR trying to cover this usage. There was one bugfix PR on vLLM and one upstream
    
  



  vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to Outlines.

  Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases, lm-format-enforcer fails to enforce correct outputs, and not up to par with Outlines in terms of performance.


Tentative plans for v1

With the release of v1 on the horizon, we’re working on a tentative plan for structured decoding:


  Moving guided decoding towards scheduler-level:
    
      Reason: We have more context regarding which requests that use structured decoding at a scheduler-level, therefore it shouldn’t block other requests within the batch (tentatively addressing limitation (2)). In a sense, this moves guided decoding outside of the critical path.
      This would allow for more natural vertical integration with jump-forward decoding (address limitation (4)).
    
  
  Allowing bit-mask calculation in one process instead of each GPU workers
    
      Reason: We can broadcast this bit-mask to each GPU worker instead of repeating this process per GPU worker.
      We will look to carefully analyze the bandwidth implications of broadcasting masks for every sample per request that use guided decoding.
    
  
  Good baseline for speculative decoding and tool-use
    
      Reason: XGrammar includes plans to support tool-use, such that we can move away from Python’s tool parser.
      Tree scoring in speculative decoding can then use the same API as jump-forward decoding (which depends on the integration of guided decoding at the scheduler level).
    
  


NOTE: if you have any more suggestions we are more than happy to take it into consideration. Consider joining vLLM slack via #feat-structured-output.

Acknowledgements

We want to thank the vLLM team, XGrammar team, Aaron Pham (BentoML), Michael Goin (Red Hat), Chendi Xue (Intel), and Russell Bryant (Red Hat) for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.

References


  Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473
  Haugeland, J. (1997). Mind Design II: Philosophy, Psychology, and Artificial Intelligence. The MIT Press. https://doi.org/10.7551/mitpress/4626.001.0001
  Hendler, J. (2008). Avoiding Another AI Winter. IEEE Intelligent Systems, 23(2), 2–4. https://doi.org/10.1109/MIS.2008.20
  Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.
  Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361
  Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781
  Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519
  Rumelhart, D. E., McClelland, J. L., & Group, P. R. (1986). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press. https://doi.org/10.7551/mitpress/5236.001.0001
  Shortliffe, E. H. (1974). MYCIN: A Rule-Based Computer Program for Advising Physicians Regarding Antimicrobial Therapy Selection (Technical Report STAN-CS-74-465). Stanford University.
  Statistical Machine Translation. (n.d.). IBM Models. Statistical Machine Translation Survey. http://www2.statmt.org/survey/Topic/IBMModels
  Turing, A. M. (1950). i.—Computing Machinery And Intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433
  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need. arXiv preprint arXiv:1706.03762
  Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:2307.09702





  
    

      Allen Newell and Herbert Simon’s work at RAND initially showed that computers can simulate important aspects of intelligence.

      Another notable application was found in the medical domain (Haugeland, 1997). MYCIN, developed at Stanford University in the 1970s, diagnosed and recommended treatments for blood infections (Shortliffe, 1974). MYCIN’s developers recognized the importance of justifying recommendations, implementing what were known as “rule traces” to explain the system’s reasoning in human-understandable terms. ↩
    
    

      In the 1990s, IBM released a sequence of complex statistical models that is trained to perform machine translations tasks (Statistical Machine Translation, n.d.) (see also: this lecture from Cornell).

      In 2001, Bag of words (BoW)-variants model was trained on 0.3B tokens and was considered SOTA at the time (Mikolov et al., 2013). These earlier works proved to the research community that statistical modelling triumphs over symbolic counterpart for language processing given it can capture the general patterns for large corpuses of text. ↩
    
    

      In 2017, The landmark paper “Attention is all You Need” introduced Transformers architecture (Vaswani et al., 2023) for neural machine translations tasks, which is based on the attention mechanism first proposed by (Bahdanau et al., 2016).

      OpenAI then introduced the scaling law for neural language models (Kaplan et al., 2020), which sets off the race towards building these systems based on foundational language models. ↩
    
    

      Prior to Attention-based transformers, seq-to-seq models uses RNNs given its ability for longer context length and better memory. However, they are more susceptible to vanishing/exploding gradients comparing to feed-forward network, and thus LSTM (Hochreiter & Schmidhuber, 1997) was proposed to solve this problem. Yet, one of the main problems with LSTM is that they tend to have poor memory recall with data they have seen many steps ago.

      The Attention paper addresses this problem by encoding additional positional data into the inputs. The paper also additionally proposed a encoder-decoder architecture for translation tasks, however, most of text-generation models nowadays are decoder-only, given its superior performance over zero-shot tasks.

      One of the many reasons why attention-based transformers works better than LSTM is because transformers are very scalable and hardware-aware (you can’t just arbitrary add more LSTM block and hope for better long-term retention). For more information, please refer back to the original paper. ↩
    
    

      One might argue that we can reliably achieve these through few-shot promptings, i.e “Give me a JSON that yields the address of users. Example output can be …”. However, there is no guarantee that the generated outputs is a valid JSON. This is because these models are probabilistic systems, as they are “sampling” the next results based on the distribution of data that it was trained on.

      One might also argue that one should use specific fine-tuned models for JSON outputs to perform such cases. However, fine-tuning often requires extensive training and a lot more labor to curate data, monitor progress, and perform evaluation, which is a huge resources not everyone can afford to do. ↩
    
    
      Note that the phrase “[structured/constrained/guided] decoding” are used interchangeably, but they all refer to the same mechanism of “using a format for the model to structurally sampling outputs.” ↩
    
    
      See this blog post from HuggingFace for using logit processors to control the generation process. ↩
    
  



Installing and Developing vLLM with Ease
2025-01-10T00:00:00+00:00
The field of LLM inference is advancing at an unprecedented pace. With new models and features emerging weekly, the traditional software release pipeline often struggles to keep up. At vLLM, we aim to provide more than just a software package. We’re building a system—a trusted, trackable, and participatory ecosystem for LLM inference. This blog post highlights how vLLM enables users to install and develop with ease while staying at the forefront of innovation.

TL;DR:


  Flexible and fast installation options from stable releases to nightly builds.
  Streamlined development workflow for both Python and C++/CUDA developers.
  Robust version tracking capabilities for production deployments.


Seamless Installation of vLLM Versions

Install Released Versions

We periodically release stable versions of vLLM to the Python Package Index, ensuring users can easily install them using standard Python package managers. For example:

pip install vllm


For those who prefer a faster package manager, uv has been gaining traction in the vLLM community. After setting up a Python environment with uv, installing vLLM is straightforward:

uv pip install vllm


Refer to the documentation for more details on setting up uv. Using a simple server-grade setup (Intel 8th Gen CPU), we observe that uv is 200x faster than pip:

# with cached packages, clean virtual environment
$ time pip install vllm
...
pip install vllm 59.09s user 3.82s system 83% cpu 1:15.68 total

# with cached packages, clean virtual environment
$ time uv pip install vllm
...
uv pip install vllm 0.17s user 0.57s system 193% cpu 0.383 total


Install the Latest vLLM from the Main Branch

To meet the community’s need for cutting-edge features and models, we provide nightly wheels for every commit on the main branch.

Using pip:

pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly


Adding --pre ensures pip includes pre-released versions in its search.

Using uv:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly


Development Made Simple

We understand that an active, engaged developer community is the backbone of innovation. That’s why vLLM offers smooth workflows for developers, regardless of whether they’re modifying Python code or working with kernels.

Python Developers

For Python developers who need to tweak and test vLLM’s Python code, there’s no need to compile kernels. This setup enables you to start development quickly.

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install -e .


The VLLM_USE_PRECOMPILED=1 flag instructs the installer to use pre-compiled CUDA kernels instead of building them from source, significantly reducing installation time. This is perfect for developers focusing on Python-level features like API improvements, model support, or integration work.

This lightweight process runs efficiently, even on a laptop. Refer to our documentation for more advanced usage.

C++/Kernel Developers

For advanced contributors working with C++ code or CUDA kernels, we incorporate a compilation cache to minimize build time and streamline kernel development. Please check our documentation for more details.

Track Changes with Ease

The fast-evolving nature of LLM inference means interfaces and behaviors are still stabilizing. vLLM has been integrated into many workflows, including OpenRLHF, veRL, open_instruct, LLaMA-Factory, etc. We collaborate with these projects to stabilize interfaces and behaviors for LLM inference. To facilitate the process, we provide powerful tools for these advanced users to track changes across versions.

Installing a Specific Commit

To simplify tracking and testing, we provide wheels for every commit in the main branch. Users can easily install any specific commit, which can be particularly useful to bisect and track the changes.

We recommend using uv to install a specific commit:

# use full commit hash from the main branch
export VLLM_COMMIT=72d9c316d3f6ede485146fe5aabd4e61dbc59069
uv pip install vllm --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT}


In uv, packages in --extra-index-url have higher priority than the default index, which makes it possible to install a developing version prior to the latest public release (at the time of writing, it is v0.6.6.post1).

In contrast, pip combines packages from --extra-index-url and the default index, choosing only the latest version, which makes it difficult to install a developing version prior to the released version. Therefore, for pip users, it requires specifying a placeholder wheel name to install a specific commit:

# use full commit hash from the main branch
export VLLM_COMMIT=33f460b17a54acb3b6cc0b03f4a17876cff5eafd
pip install https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl


Conclusion

At vLLM, our commitment extends beyond delivering high-performance software. We’re building a system that empowers trust, enables transparent tracking of changes, and invites active participation. Together, we can shape the future of AI, pushing the boundaries of innovation while making it accessible to all.

For collaboration requests or inquiries, reach out at vllm-questions@lists.berkeley.edu. Join our growing community on GitHub or connect with us on the vLLM Slack. Together, let’s drive AI innovation forward.

Acknowledgments

We extend our gratitude to the uv community — particularly Charlie Marsh — for creating a fast, innovative package manager. Special thanks to Kevin Luu (Anyscale), Daniele Trifirò (Red Hat), and Michael Goin (Neural Magic) for their invaluable contributions to streamlining workflows. Kaichao You and Simon Mo from the UC Berkeley team lead these efforts.


vLLM 2024 Retrospective and 2025 Vision
2025-01-10T00:00:00+00:00
The vLLM community achieved remarkable growth in 2024, evolving from a specialized inference engine to become the de facto serving solution for the open-source AI ecosystem. This transformation is reflected in our growth metrics:


  GitHub stars grew from 14,000 to 32,600 (2.3x)
  Contributors expanded from 190 to 740 (3.8x)
  Monthly downloads surged from 6,000 to 27,000 (4.5x)
  GPU hours increased approximately 10x over the last six months
  Explore more usage data at https://2024.vllm.ai


vLLM has established itself as the leading open-source LLM serving and inference engine, with widespread adoption in production applications (e.g., powering Amazon Rufus and LinkedIn AI features). Our bi-monthly meetups have become strategic gatherings for partnerships with industry leaders like IBM, AWS, and NVIDIA, marking our progress toward becoming the universal serving solution for the open-source AI ecosystem. Read on for more details about vLLM’s 2024 achievements and 2025 roadmap!

This blog is based on the 16th session of the bi-weekly vLLM Office Hours. Watch the recording here.



2024 Achievements: Scaling Models, Hardware, and Features

Community Contributions and Growth


  

vLLM Main Contributor Groups (by Commits)



2024 was an exceptional year for vLLM! Our contribution community has expanded dramatically to include:


  15+ full-time contributors across 6+ organizations
  20+ active organizations as key stakeholders and sponsors
  Contributions from top institutions including UC Berkeley, Neural Magic, Anyscale, Roblox, IBM, AMD, Intel, and NVIDIA, as well as individual developers worldwide
  A thriving ecosystem connecting model creators, hardware vendors, and optimization developers
  Well-attended bi-weekly office hours facilitating transparency, community growth, and strategic partnerships


These numbers reflect more than growth—they demonstrate vLLM’s role as critical infrastructure in the AI ecosystem, supporting everything from research prototypes to production systems serving millions of users.

Expanding Model Support


  

Usage by Model Architecture in Serving



At the beginning of 2024, vLLM supported only a handful of models. By year’s end, the project had evolved to support performant inference for almost 100 model architectures: spanning nearly every prominent open-source large language model (LLM), multimodal (image, audio, video), encoder-decoder, speculative decoding, classification, embedding, and reward models. Notably, vLLM introduced production support for state-space language models, exploring the future of non-transformer language models.

Broadening Hardware Compatibility


  

GPU Hours Breakdown by Hardware Vendor



From the initial hardware target of NVIDIA A100 GPUs, vLLM has expanded to support:


  NVIDIA GPUs: First-class optimizations for H100, with support for every NVIDIA GPU from V100 and newer.
  AMD GPUs: Support for MI200, MI300, and Radeon RX 7900 series - with rapidly growing adoption for MI300X.
  Google TPUs: Support for TPU v4, v5p, v5e, and the latest v6e.
  AWS Inferentia and Trainium: Supports for trn1/inf2 instances.
  Intel Gaudi (HPU) and GPU (XPU): Leveraging Intel GPU and Gaudi architectures for AI workloads.
  CPUs: Featuring support for a growing list of ISAs - x86, ARM, and PowerPC.


vLLM’s hardware compatibility has broadened to address diverse user requirements while incorporating performance improvements. Importantly, vLLM is on the path to ensure that all models work on all hardware platforms, with all the optimizations enabled.

Delivering Key Features


  

Increasing Percentage of vLLM Deployments with Quantization



vLLM’s 2024 development roadmap emphasized performance, scalability, and usability:


  Weight and Activation Quantization: Added support for diverse quantization methods and kernels, enabling efficient inference across hardware platforms. Notable integrations include activation quantization for FP8+INT8, Marlin+Machete kernels for GPTQ/AWQ/wNa16, FP8 KV Cache, AQLM, QQQ, HQQ, bitsandbytes, and GGUF. Over 20% of vLLM deployments now use quantization.
  Automatic Prefix Caching: Reduced costs and improved latency for context-heavy applications.
  Chunked Prefill: Enhanced stability of inter-token latency for interactive applications.
  Speculative Decoding: Accelerated token generation through simultaneous token prediction and validation, supporting draft models, n-gram matching in prompts, and MLP speculators like Medusa or EAGLE.
  Structured Outputs: Provided high-performance capabilities for applications requiring specific formats like JSON or pydantic schemas.
  Tool Calling: Enabled models with supported chat templates to generate tool calls autonomously, facilitating data processing and agentic flows.
  Distributed Inference: Introduced pipeline parallelism and disaggregated prefill to effectively scale workloads across GPUs and nodes.




Our 2025 Vision

In 2025, we anticipate a significant push in the boundaries of scaling for both pretraining and inference-time scaling. We believe that open-source models are rapidly catching up to proprietary ones, and through distillation, these massive models are becoming smaller, more intelligent, and more practical for production deployment.

Emerging Model Capabilities: GPT-4o Class Models served on single node

Our vision is ambitious yet concrete: enabling GPT-4o level performance on a single GPU, GPT-4o on a single node, and next generation scale capabilities on a modest cluster. To achieve this, we’re focusing on three key optimization frontiers:


  
    KV cache and attention optimization with sliding windows, cross-layer attention, and native quantization
  
  
    MoE optimizations targeting architecture with shared experts and large numbers of fine-grained experts
  
  
    Extended long context support through alternative architectures like state space models
  


Beyond raw performance, we’re tailoring vLLM for specialized vertical applications. Each use case demands specific optimizations: reasoning applications need custom tokens and flexible reasoning steps, coding requires fill-in-the-middle capabilities and prompt lookup decoding, agent frameworks benefit from tree-based caching, and creative applications need diverse sampling strategies including beam search variants and contrastive decode.

We’re also expanding vLLM’s role in the model training process. Recent adoption by prominent researchers like John Schulman signals our growing importance in post-training workflows. We’ll provide tight integration with data curation and post-training processes, making vLLM an essential tool across the full AI development lifecycle.

Practical Scale: Powering Thousands of Production Clusters

As LLMs become the backbone of modern applications, we envision vLLM powering thousands of production clusters running 24/7. These aren’t experimental deployments—they’re mission-critical systems handling constant traffic for product features, maintained by dedicated platform teams.

To support this scale, we’re making vLLM truly battery-included for production applications. Quantization, prefix caching, and speculative decoding will become default features rather than optional optimizations. Structured output generation will be standard rather than exceptional. We’re developing comprehensive recipes for routing, caching, and auto-scaling that span the full lifecycle of production deployments.

As deployments scale beyond single replicas, we’re creating stable interfaces for cluster-level solutions. This includes robust default configurations tuned for popular models and hardware platforms, along with flexible optimization paths for diverse use cases. We’re fostering a community dedicated to pushing the boundaries of vLLM efficiency, ensuring our platform evolves to meet new challenges.

Open Architecture: The Foundation of Our Future

The key to vLLM’s continued success lies in its open architecture. We’re shipping a ground-up rearchitecture with our V1 release that exemplifies this philosophy. Every component – from model architectures to scheduling policies, memory management to sampling strategies – is designed to be modified and extended in both research and private forks.

Our commitment to openness extends beyond just code. We’re introducing:


  
    Pluggable architectures for seamless integration of new models, hardware backends, and custom extensions
  
  
    First-class torch.compile support, enabling custom operation fusion passes and rapid experimentation
  
  
    A flexible component system that supports private extensions while maintaining core stability
  


We’re doubling down on community development, coordinating engineering efforts across organizations while celebrating ecosystem projects. This includes growing our core team through a clear recruitment process and organizational structure. The goal isn’t just to make vLLM the best choice technically – it’s to ensure that everyone who invests in vLLM finds themselves better off for having done so.

Our architecture is more than just a technical choice; it’s a commitment to creating a connected ecosystem through extensibility and modification rather than lock-in. By making vLLM both powerful and customizable, we ensure its place at the heart of the AI inference ecosystem.



A Bit of Reflection

As we reflect on vLLM’s journey, some key themes emerge that have shaped our growth and continue to guide our path forward.

Building Bridges in the AI Ecosystem

What started as an inference engine has evolved into something far more significant: a platform that bridges previously distinct worlds in the AI landscape. Model creators, hardware vendors, and optimization specialists have found in vLLM a unique amplifier for their contributions. When hardware teams develop new accelerators, vLLM provides immediate access to a broad application ecosystem. When researchers devise novel optimization techniques, vLLM offers a production-ready platform to demonstrate real-world impact. This virtuous cycle of contribution and amplification has become core to our identity, driving us to continuously improve the platform’s accessibility and extensibility.

Managing Growth While Maintaining Excellence

Our exponential growth in 2024 brought both opportunities and challenges. The rapid expansion of our codebase and contributor base created unprecedented velocity, enabling us to tackle ambitious technical challenges and respond quickly to community needs. However, this growth also increased the complexity of our codebase. Rather than allowing technical debt to accumulate, we made the decisive choice to invest in our foundation. The second half of 2024 saw us undertake an ambitious redesign of vLLM’s core architecture, culminating in what we now call our V1 architecture. This wasn’t just a technical refresh – it was a deliberate move to ensure that our platform remains maintainable and modular as we scale to meet the needs of an expanding AI ecosystem.

Pioneering a New Model of Open Source Development

Perhaps our most unique challenge has been building a world-class engineering organization through a network of sponsored volunteers. Unlike traditional open source projects that rely on funding from a single organization, vLLM is charting a different course. We’re creating a collaborative environment where multiple organizations contribute not just code, but resources and strategic direction. This model brings novel challenges in coordination, planning, and execution, but it also offers unprecedented opportunities for innovation and resilience. We’re learning – and sometimes inventing – best practices for everything from distributed decision-making to remote collaboration across organizational boundaries.

Our Unwavering Commitment

Through all these changes and challenges, our fundamental mission remains clear: building the world’s fastest and easiest-to-use open-source LLM inference and serving engine. We believe that by lowering the barriers to efficient AI inference, we can help make advanced AI applications more practical and accessible for everyone. This isn’t just about technical excellence – it’s about creating a foundation that enables the entire AI community to move forward faster, together.



Usage Data Collection

The metrics and insights throughout this post are powered by vLLM’s usage system, which collects anonymized deployment data. Each vLLM instance generates a UUID and reports technical metrics including:


  Hardware specs (GPU count/type, CPU architecture, available memory)
  Model configuration (architecture, dtype, tensor parallelism degree)
  Runtime settings (quantization type, prefix caching enabled)
  Deployment context (cloud provider, platform, vLLM version)


This telemetry helps prioritize optimizations for common hardware configurations and identify which features need performance improvements. The data is collected locally in ~/.config/vllm/usage_stats.json. Users can opt out by setting VLLM_NO_USAGE_STATS=1, DO_NOT_TRACK=1, or creating ~/.config/vllm/do_not_track. The implementation details and full schema are available in our usage stats documentation.



Join the Journey

vLLM’s 2024 journey demonstrates the transformative potential of open-source collaboration. With a clear vision for 2025, the project is poised to redefine AI inference, making it more accessible, scalable, and efficient. Whether through code contributions, attending vLLM Office Hours, or adopting vLLM in production, every participant helps shape the future of this fast-moving project.

As we enter 2025, we continue to encourage community participation through:


  Contributing Code: Help refine vLLM’s core functionality or extend its capabilities—many RFCs and features need additional support
  Providing Feedback: Share insights on features and use cases to shape vLLM’s roadmap via GitHub, Slack, Discord, or events
  Building with vLLM: Adopt the platform in your projects, develop your expertise, and share your experience


Join the vLLM Developer Slack to get mentored by project leaders and work at the forefront of AI inference innovation.

Together, we’ll advance open-source AI innovation in 2025!


Serving LLMs on AMD MI300X: Best Practices
2024-10-23T00:00:00+00:00
TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B. This guide explores 8 key vLLM settings to maximize efficiency, showing you how to leverage the power of open-source LLM inference on AMD. If you just want to see the optimal parameters, jump to the Quick Start Guide.





   



vLLM vs. TGI performance comparison for Llama 3.1 405B on 8 x MI300X (BF16, 32 QPS).






   



vLLM vs. TGI performance comparison for Llama 3.1 70B on 8 x MI300X (BF16, 32 QPS).


Introduction

Meta recently announced they’re running 100% of their live Llama 3.1 405B model traffic on AMD MI300X GPUs, showcasing the power and readiness of AMD’s ROCm platform for large language model (LLM) inference. This exciting news coincides with the release of ROCm 6.2, which brings significant improvements to vLLM support, making it easier than ever to harness the power of AMD GPUs for LLM inference.

ROCm, AMD’s answer to CUDA, might be less familiar to some, but it’s rapidly maturing as a robust and performant alternative.  With vLLM, harnessing this power is easier than ever.  We’ll show you how.

vLLM v.s. TGI

vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B. It also achieves 1.8x higher throughput and 5.1x faster TTFT than TGI for Llama 3.1 70B.

On Llama 3.1 405B, vLLM demonstrates significantly better performance compared to TGI in both time to first token (TTFT) and throughput across various query-per-second (QPS) scenarios. For TTFT, vLLM achieves approximately 3.8x faster response times on average compared to TGI at 16 QPS in the optimized configuration. Throughput-wise, vLLM consistently outperforms TGI, with the highest throughput of 5.76 requests/second on the ShareGPT dataset at 1000 QPS in the optimized setup, compared to TGI’s 3.55 requests/second.

Even in the default configuration, vLLM shows superior performance compared to TGI. For instance, at 16 QPS, vLLM’s default configuration achieves a throughput of 4.05 requests/second versus TGI’s 2.58 requests/second. This performance advantage is maintained across different QPS levels, highlighting vLLM’s efficiency in handling large language model inference tasks.










vLLM vs. TGI performance for Llama 3.1 405B on 8 x MI300X (BF16, QPS 16, 32, 1000; see Appendix for commands).


How to run vLLM with Optimal Performance

Key Settings and Configurations

We’ve been extensively testing various vLLM settings to identify optimal configurations for MI300X.  Here’s what we’ve learned:


  Chunked Prefill: The rule of thumb is to disable it for now on MI300X in most cases for better performance.
  Multi-Step Scheduling: Significant gains in GPU utilization and overall performance can be achieved with multi-step scheduling. Set the --num-scheduler-steps to a value between 10 and 15 to optimize GPU utilization and performance.
  Prefix Caching: Combining prefix caching with chunked prefill can enhance performance in specific scenarios. However, if user requests have a low prefix caching hit rate, it might be advisable to disable both chunked prefill and prefix caching.
  Graph Capture: When working with models that support long context lengths, set the --max-seq-len-to-capture to 16384. However, be aware that increasing this value doesn’t always guarantee performance improvements and may sometimes lead to degradation due to suboptimal bucket sizes.
  AMD-Specific Optimizations: Disabling NUMA balancing and tuning NCCL_MIN_NCHANNELS can yield further performance improvements.
  KV Cache Data Type: For optimal performance, use the default KV cache data type, which automatically matches the model’s data type.
  Tensor Parallelism: For throughput optimization, use the minimum tensor parallelism (TP) that accommodates the model weights and context, and run multiple vLLM instances. For latency optimization, set TP equal to the number of GPUs in a node.
  Maximum Number of Sequences: To optimize performance, increase --max-num-seqs to 512 or higher, based on your GPU’s memory and compute resources. This can significantly improve resource utilization and throughput, especially for models handling shorter inputs and outputs.
  Use CK Flash Attention: the CK Flash Attention implementation is a lot faster than triton implementation.


Detailed Analysis and Experiments

Case 1: Chunked Prefill

Chunked prefill is an experimental feature in vLLM that allows large prefill requests to be divided into smaller chunks batched together with decode requests. This improves system efficiency by overlapping compute-bound prefill requests with memory-bound decode requests. You can enable it by setting --enable_chunked_prefill=True in the LLM constructor or using the --enable-chunked-prefill command line option.

Based on the experiment we ran, we found that there’s a slight improvement with tuning the chunked prefill values over disabling the chunked prefill feature. However, if you’re not sure whether to enable chunked prefill or not, simply start off by disabling it and you should generally expect better performance than with using the default settings. This is specific to MI300X GPUs.
















Case 2: Number of scheduler steps

Multi-step scheduling has been introduced In vLLM v0.6.0 promising higher gpu utilization and better overall performance. As detailed in this blog post, the magic behind this performance boost lies in its ability to perform scheduling and input preparation once and run the model for a number of consecutive steps without interrupting the GPU. By cleverly spreading CPU overhead across these steps, it dramatically reduces GPU idle time and supercharges performance.

To enable multi-step scheduling, set the --num-scheduler-steps argument to a number larger than 1, which is the default value (It’s worth mentioning that we found that using multi-step scheduling can provide diminishing returns the higher it goes up in value, hence, we stick with an upper bound of 15).
















Case 3: Chunked Prefill and Prefix caching

Chunked Prefill and prefix caching are optimization techniques in vLLM that improve performance by breaking large prefills into smaller chunks for efficient batching and reusing cached KV (key-value) computations for shared prefixes across queries, respectively.

By default, vLLM will automatically enable the chunked prefill feature if a model has a context length of more than 32k tokens. The maximum number of tokens to be chunked for prefill is set to 512 by default.

Before we dive deep into the graph, we’ll first try to explain the terminology used in the experiment. Fresh Run refers to the situation where the prefix caching memory is not populated at all. 2nd Run refers to rerunning the benchmark script again after the Fresh Run. In general, when rerunning the ShareGPT benchmark dataset on the 2nd Run, we get around a 50% prefix caching hit-rate.

Looking at the graphs below, we can make three observations about this experiment.

  Based on the comparison of Bar 2 (red) with the baseline (blue), there is a huge gain in performance.
  Based on the comparison of Bar 3 (yellow), Bar 5 (orange) and Bar 6 (teal) with the baseline, the chunked prefill performance depends on the user request input prompt length distribution.
  In our experiments we found that the prefix caching hit rates of Bar 3 (yellow) and Bar 4 (green) are around 0.9% and 50%. Based on the comparison of Bar 3 (yellow) and Bar 4 (green) with the baseline and Bar 2 (red), this tells us that if the user requests do not have high prefix caching hit rate, disabling both chunked prefill and prefix caching might be considered a good rule of thumb.

















Case 4: Max sequence length to capture

The --max-seq-len-to-capture argument in vLLM controls the maximum sequence length that can be handled by CUDA/HIP graphs, which optimize performance by capturing and replaying GPU operations. If a sequence exceeds this length, the system reverts to eager mode executing operations one by one, which can be less efficient. This applies to both regular and encoder-decoder models.

Our benchmarks reveal an interesting trend: increasing --max-seq-len-to-capture doesn’t always improve performance and can sometimes even degrade it. This might be due to how vLLM creates buckets for different sequence lengths.

Here’s why:

  Bucketing: vLLM uses buckets to group sequences of similar lengths, optimizing graph capture for each bucket.
  Optimal Buckets: Initially, the buckets are finely grained (e.g., [4, 8, 12,…, 2048, 4096]), allowing for efficient graph capture for various sequence lengths.
  Coarser Buckets: Increasing --max-seq-len-to-capture can lead to coarser buckets (e.g., [4, 8, 12, 2048, 8192]).
  Performance Impact: When input sequences fall into these larger, less precise buckets, the captured CUDA/HIP graphs may not be optimal, potentially leading to reduced performance.


Therefore, while capturing longer sequences with CUDA/HIP graphs seems beneficial, it’s crucial to consider the potential impact on bucketing and overall performance. Finding the optimal --max-seq-len-to-capture value may require experimentation to balance graph capture efficiency with appropriate bucket sizes for your specific workload.
















Case 5: AMD Recommended Environmental Variables

To further optimize vLLM performance on AMD MI300X, we can leverage AMD-specific environment variables.


  Disabling NUMA Balancing: Non-Uniform Memory Access (NUMA) balancing can sometimes hinder GPU performance. As recommended in the AMD MAD repository, disabling it can prevent potential GPU hangs and improve overall efficiency. This can be achieved with the following command:
      # disable automatic NUMA balancing
  sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
  # check if NUMA balancing is disabled (returns 0 if disabled)
  cat /proc/sys/kernel/numa_balancing
  0
    
  
  Tuning NCCL Communication: The NVIDIA Collective Communications Library (NCCL) is used for inter-GPU communication. For MI300X, the AMD vLLM fork performance document suggests setting the NCCL_MIN_NCHANNELS environment variable to 112 to potentially enhance performance.


In our tests, enabling these two configurations yielded a slight performance improvement. This aligns with the findings in the “NanoFlow: Towards Optimal Large Language Model Serving Throughput” paper, which indicates that while optimizing network communication is beneficial, the impact might be limited since LLM inference is primarily dominated by compute-bound and memory-bound operations.

Even though the gains might be small, fine-tuning these environment variables can contribute to squeezing out the maximum performance from your AMD system.
















Case 6: KVCache Type Auto/FP8

By default, vLLM will automatically allocate a KV Cache type that matches the model’s data type. However, vLLM also supports native FP8 on MI300X which we can exploit to reduce the memory requirement of KVCache and thereby increasing the deployable context length of the model.

We experiment by using Auto KVCache type and KV Cache type FP8 and compare it to the default baseline. We can see from the figure below that using Auto KVCache type (red) achieves a higher request per second rate than using KV Cache type set to FP8 (yellow). Theoretically, this might be due to a quantization overhead in Llama-3.1-70B-Instruct (bfloat16) model, but since the cost of the overhead seems to be small, it could still be a good tradeoff in some cases to obtain a huge reduction in the KVCache requirements.
















Case 7: Performance Difference between TP 4 and TP 8

Tensor parallelism is a technique for distributing the computational load of large models. It works by splitting individual tensors across multiple devices, allowing for parallel processing of specific operations or layers. This approach reduces the memory footprint of the model and enables scaling across multiple GPUs.

While increasing the tensor parallelism degree can improve performance by providing more compute resources, the gains aren’t always linear. This is because communication overhead increases as more devices are involved, and the workload on each individual GPU decreases. Given the substantial processing power of the MI300X, smaller workloads per GPU can actually lead to underutilization, further hindering performance scaling.

Therefore, when optimizing for throughput, we recommend launching multiple instances of vLLM instead of aggressively increasing tensor parallelism. This approach tends to yield more linear performance improvements. However, if minimizing latency is the priority, increasing the tensor parallelism degree may be the more effective strategy.
















Case 8: Effect of Maximum Number of (Parallel) Sequences

The --max-num-seqs argument specifies the maximum number of sequences that can be processed per iteration. This parameter controls the number of concurrent requests in a batch, impacting memory usage and performance. In the ShareGPT benchmark, due to the shorter input and output length of the samples, the Llama-3.1-70B-Instruct hosted on MI300X can process a large number of requests per iteration. In our experiment, the --max-num-seqs is still a limiting factor, even if --max-num-seqs is set at 1024.
















Quick Start Guide
If you are not sure about the deployment setting and the distribution of the user requests, you could:


  Use CK Flash Attention* (thought we didn’t show here, the CK Flash Attention implementation is a lot faster than triton counterpart implementation)
    
      export VLLM_USE_TRITON_FLASH_ATTN=0
    
  
  Disable chunked prefill --enable-chunked-prefill=False
  Disable prefix caching
  If the model supports long context length, set the --max-seq-len-to-capture to 16384
  Set --num-scheduler-steps to 10 or 15.
  Set the AMD environment:
    
      sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' 
      export NCCL_MIN_NCHANNELS=112
    
  
  Increase --max-num-seqs to 512 and above, depending on the GPU memory and compute resource of the GPUs.


VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --host 0.0.0.0 --port 8000 -tp 4 --max-num-seqs 1024 --max-seq-len-to-capture 16384 --served-model-name meta-llama/Llama-3.1-70B-Instruct --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024


For quick setup, we have compiled the Docker Image of vLLM 0.6.2 (commit: cb3b2b9ba4a95c413a879e30e2b8674187519a93) to Github Container Registry.
To get download the image:
# v0.6.2 post
docker pull ghcr.io/embeddedllm/vllm-rocm:cb3b2b9
# P.S. We also have compiled the image for v0.6.3.post1 at commit 717a5f8
docker pull ghcr.io/embeddedllm/vllm-rocm:v0.6.3.post1-717a5f8


To launch a docker container with the image run:
sudo docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   -v /path/to/hfmodels:/app/model \ # if you have pre-downloaded the model weight, else ignore
   ghcr.io/embeddedllm/vllm-rocm:cb3b2b9 \
   bash


Now launch the LLM server with the parameters that we have found:

VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve meta-llama/Llama-3.1-70B-Instruct --host 0.0.0.0 --port 8000 -tp 4 --max-num-seqs 1024 --max-seq-len-to-capture 16384 --served-model-name meta-llama/Llama-3.1-70B-Instruct --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024


Conclusion
This guide has explored the power of vLLM for serving large language models on AMD MI300X GPUs. By meticulously tuning key settings like chunked prefill, multi-step scheduling, and CUDA graph capture, we’ve demonstrated how to achieve substantial performance gains over standard configurations and alternative serving solutions. vLLM unlocks significantly higher throughput and faster response times, making it an ideal choice for deploying LLMs on AMD hardware.

However, it’s important to acknowledge that our exploration has focused primarily on general chatbot usage with short inputs and outputs. Further investigation is needed to optimize vLLM for specific use cases like summarization or long-form content generation. Additionally, a deeper dive into the performance differences between Triton and CK attention kernels could yield further insights.

We also want to acknolwedge this wonderful blogpost by Leonard Lin on how to further optimize vLLM for MI300X, including hipBLAS vs hipBLASLt, CK Flash Attention vs Triton Flash Attention, Tensor Parallelism vs Pipeline Parallelism, etc.

Acknowledgements
This blog post is drafted by the team at Embedded LLM and thank you to Hot Aisle Inc. for sponsoring MI300X for benchmarking vLLM.

Appendix

Server Specification

The following are the configuration of the amazing Hot Aisle server:

  CPU: 2 x Intel Xeon Platinum 8470
  GPU: 8 x AMD Instinct MI300X Accelerators
The model and software that we are using in the benchmark are as follows:
  Model: meta-llama/Llama-3.1-405B-Instruct and meta-llama/Llama-3.1-70B-Instruct
  vLLM (v0.6.2): vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs (github.com) commit: cb3b2b9ba4a95c413a879e30e2b8674187519a93
  Dataset: ShareGPT
  Benchmark script: benchmarks/benchmark_serving.py in the repository


We have built the ROCm compatible vLLM docker from Dockerfile.rocm found in the repository (we have pushed the docker image of the vLLM version that we have used to run our benchmark. Get it by docker pull ghcr.io/embeddedllm/vllm-rocm:cb3b2b9).
All of the benchmarks are run in the docker container instance, and are run with 4 MI300X GPUs using CK Flash Attention with VLLM_USE_TRITON_FLASH_ATTN=0.

Detail Benchmark Configuration


  
    
      Configuration
      Command
    
  
  
    
      vLLM Default Configuration
      VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-num-seqs 1024 --max-num-batched-tokens 1024 
    
    
      TGI Default Configuration
      ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --model-id Llama-3.1-405B-Instruct
    
    
      vLLM (This Guide)
      VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024 
    
    
      TGI (This Guide)
      ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id Llama-3.1-405B-Instruct
    
  



How Speculative Decoding Boosts vLLM Performance by up to 2.8x
2024-10-17T00:00:00+00:00
Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings.

This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance. You can view the session slides here. If you prefer watching, you can view the full recording on YouTube. We’d love to see you attend future sessions - please register!

An Introduction to Speculative Decoding

Speculative decoding (Leviathan et al., 2023) is a key technique in reducing latency during token generation in large language models (LLMs). This approach leverages smaller models to handle simpler token predictions while utilizing larger models to verify or adjust those predictions. By doing this, speculative decoding accelerates generation without sacrificing accuracy, making it a lossless yet highly efficient method for optimizing LLM performance.

Why can speculative decoding reduce latency? Traditionally, LLMs generate tokens one at a time in an autoregressive manner. For example, given a prompt, the model generates three tokens T1, T2, T3, each requiring a separate forward pass. Speculative decoding transforms this process by allowing multiple tokens to be proposed and verified in one forward pass.

Here’s how the process works:


  Draft Model: A smaller, more efficient model proposes tokens one by one.
  Target Model Verification: The larger model verifies these tokens in a single forward pass. It confirms correct tokens and corrects any incorrect ones.
  Multiple Tokens in One Pass: Instead of generating one token per pass, this method processes multiple tokens simultaneously, reducing latency.







As shown in the picture above, the draft model proposes five tokens: ["I", "like", "cooking", "and", "traveling"]. These are then forwarded to the target model for parallel verification. In this example, the third token, "cooking" (should be "playing"), was proposed inaccurately. As a result, only the first three tokens, ["I", "like", "playing"], are generated in this step.


By using this approach, speculative decoding speeds up token generation, making it an effective method for both small-scale and large-scale language model deployments.

How Speculative Decoding Works in vLLM

In vLLM, speculative decoding is integrated with the system’s continuous batching architecture, where different requests are processed together in a single batch, enabling higher throughput. vLLM uses two key components to implement this:


  Draft Runner: This runner is responsible for executing the smaller model to propose candidate tokens.
  Target Runner: The target runner verifies the tokens by running the larger model.


vLLM’s system is optimized to handle this process efficiently, allowing speculative decoding to work seamlessly with continuous batching, which increases the overall system performance.






Diagram illustrating how the draft and target runners interact within the vLLM batching system.


To implement speculative decoding in vLLM, two crucial components had to be modified:


  Scheduler: The scheduler was adjusted to handle multiple token slots within a single forward pass, enabling the simultaneous generation and verification of several tokens.
  Memory Manager: The memory manager now handles the KV cache for both the draft and target models, ensuring smooth processing during speculative decoding.







System architecture of speculative decoding in vLLM.


Types of Speculative Decoding Supported in vLLM

vLLM supports three types of speculative decoding, each tailored to different workloads and performance needs:

Draft Model-Based Speculative Decoding







This is the most commonly used form of speculative decoding, where a smaller model predicts the next tokens, and a larger model verifies them. A common example would be using a Llama 68M model to predict tokens for a Llama 2 70B model. This approach requires careful selection of the draft model to balance accuracy and overhead.

Choosing the correct draft model is essential for maximizing the efficiency of speculative decoding. The draft model needs to be small enough to avoid creating significant overhead but still accurate enough to provide a meaningful performance boost.

However, selecting the right draft model can be challenging. For example, in models like Llama 3, finding a suitable draft model is difficult due to differences in vocabulary size. Speculative decoding requires that the draft and target models share the same vocabulary, and in some cases, this can limit the use of speculative decoding. Therefore, in the following sections, we introduce several draft-model free speculative decoding methods.

Prompt Lookup Decoding






An example of prompt lookup decoding. Given the prompt, we build all 2-grams as the lookup key. The values are the three tokens following the lookup key. During generation, we will check if the current 2-gram matches any key. If so, we will propose the following tokens with the value.


Otherwise known as n-gram matching, this approach is effective for use cases like summarization and question-answering, where there is a significant overlap between the prompt and the answer. Instead of using a small model to propose tokens, the system speculates based on the information already available in the prompt. This works particularly well when the large model repeats parts of the prompt in its answers.

Medusa/Eagle/MLPSpeculator





Picture from https://github.com/FasterDecoding/Medusa.
In the example, three heads are used to propose tokens for the following three positions. Head 1 is proposing ["is", "\'", "the"] for the first position. Head 2 is proposing ["difficult", "is", "\'"] for the second position. Head 3 is proposing ["not", "difficult", "a"] for the third position. All heads take the output of the last transformer block as the input.


In this method, additional layers (or heads) are added to the large model itself, allowing it to predict multiple tokens in a single forward pass. This reduces the need for a separate draft model, instead leveraging the large model’s own capacity for parallel token generation. Though preliminary, this method shows promise for improving efficiency as more optimized kernels are developed.



Speculative Decoding Performance Insights: Speedups and Trade-offs

Speculative decoding offers significant performance benefits in low-QPS (queries per second) environments. For example, in testing on the ShareGPT dataset, vLLM demonstrated up to a 1.5x speedup in token generation when using draft model-based speculative decoding. Similarly, prompt lookup decoding has shown speedups of up to 2.8x when applied to summarization datasets, such as CNN/DailyMail.






   



Performance comparison showing spec decode delivering up to 1.5x Speedup at QPS=1 Llama3-70B on ShareGPT with 4xH100 using draft model (turboderp/Qwama-0.5B-Instruct) and up to 2.8x Speedup at QPS=1 Llama3-70B on CNN Dailymail with 4xH100 using n-grams.


However, in high-QPS environments, speculative decoding may introduce performance trade-offs. The extra compute required to propose and verify tokens can sometimes slow down the system when it is already compute-bound, as seen when the number of requests per second increases. In such cases, the overhead of speculative decoding can outweigh its benefits, leading to reduced performance.






As high QPS, we see 1.4x slowdown Llama3-70B on ShareGPT with 4xH100, 1.8x slowdown Llama3-70B on CNN Dailymail with 4xH100


On the Roadmap: Dynamic Adjustments for Better Performance

To overcome the limitations of speculative decoding in high-QPS settings, vLLM is working on implementing dynamic speculative decoding. Feel free to check out the paper for more detail. This is also one of the active research directions in vllm! This feature will allow vLLM to adjust the number of speculative tokens based on system load and the accuracy of the draft model. At a high level, dynamic speculative decoding shortens the proposed length when system load is high. However, the reduction is less pronounced when the average token acceptance rate is high as shown in the picture below.








In the future, the system will be able to automatically modify the degree of speculation at each step, ensuring speculative decoding is always beneficial, regardless of the workload. This will allow users to activate speculative decoding without worrying about whether it will slow down their system.

How to Use Speculative Decoding in vLLM

Setting up speculative decoding in vLLM is straightforward. When launching the vLLM server, you simply need to include the necessary flags to specify the speculative model, the number of tokens, and the tensor parallel size.

The following code configures vLLM in an offline mode to use speculative decoding with a draft model, speculating 5 tokens at a time:

from vllm import LLM

llm = LLM(
    model="facebook/opt-6.7b",
    speculative_model="facebook/opt-125m",
    num_speculative_tokens=5,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")


The following code configures vLLM to use speculative decoding where proposals are generated by matching n-grams in the prompt:

from vllm import LLM

llm = LLM(
    model="facebook/opt-6.7b",
    speculative_model="[ngram]",
    num_speculative_tokens=5,
    ngram_prompt_lookup_max=4,
    ngram_prompt_lookup_min=1,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")


At times, you may want the draft model to operate with a different tensor parallel size than the target model to improve efficiency. This allows the draft model to use fewer resources and has less communication overhead, leaving the more resource-intensive computations to the target model. In vLLM, you can configure the draft model to use a tensor parallel size of 1, while the target model uses a size of 4, as demonstrated in the example below.

from vllm import LLM

llm = LLM(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    tensor_parallel_size=4,
    speculative_model="ibm-fms/llama3-70b-accelerator",
    speculative_draft_tensor_parallel_size=1,
)
outputs = llm.generate("The future of AI is")

for output in outputs:
    print(f"Prompt: {output.prompt!r}, Generated text: {output.outputs[0].text!r}")




Future updates (paper, RFC) will allow vLLM to automatically choose the number of speculative tokens, removing the need for manual configuration and simplifying the process even further.

Follow our docs on Speculative Decoding in vLLM to get started. Join our bi-weekly office hours to ask questions and give feedback.

Conclusion: The Future of Speculative Decoding in vLLM

Speculative decoding in vLLM delivers substantial performance improvements, especially in low-QPS environments. As dynamic adjustments are introduced, it will become a highly effective tool even in high-QPS settings, making it a versatile and essential feature for reducing latency and increasing efficiency in LLM inference.

Configuration	Command
vLLM Default Configuration	`VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-num-seqs 1024 --max-num-batched-tokens 1024`
TGI Default Configuration	`ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --model-id Llama-3.1-405B-Instruct`
vLLM (This Guide)	`VLLM_RPC_TIMEOUT=30000 VLLM_USE_TRITON_FLASH_ATTN=0 vllm serve Llama-3.1-405B-Instruct -tp 8 --max-seq-len-to-capture 16384 --enable-chunked-prefill=False --num-scheduler-steps 15 --max-num-seqs 1024`
TGI (This Guide)	`ROCM_USE_FLASH_ATTN_V2_TRITON=false TRUST_REMOTE_CODE=true text-generation-launcher --num-shard 8 --sharded true --max-concurrent-requests 1024 --max-total-tokens 131072 --max-input-tokens 131000 --model-id Llama-3.1-405B-Instruct`

vLLM Blog

Introducing AIBrix: A Scalable, Cost-Effective Control Plane for vLLM

AIBrix Vision & Industry Collaboration

Explore More

FAQ

Distributed Inference with vLLM

Motivation

vLLM Architecture and Large Language Model Inference Challenges

GPU Parallelism Techniques in vLLM

Tensor Parallelism

Problem: Model Exceeds Single GPU Capacity

Pipeline Parallelism

Problem: Model Exceeds Multi-GPU Capacity

How It Works

Combining Tensor Parallelism and Pipeline Parallelism

Performance Scaling and Memory Effects

Further Reading

Conclusion

Come to the Bi-weekly Office Hours to learn more about LLM inference optimizations and vLLM!

Acknowledgement

Introducing vLLM Inference Provider in Llama Stack

What is Llama Stack?

vLLM Inference Provider

Tutorial

Prerequisites

Get Started via Containers

Start vLLM Server

Start Llama Stack Server

Deployment on Kubernetes

Acknowledgement

vLLM V1: A Major Upgrade to vLLM’s Core Architecture

Why vLLM V1?

Learning from vLLM V0

Goals of V1

Scope of V1

What’s New in vLLM V1?

1. Optimized Execution Loop & API Server

2. Simple & Flexible Scheduler

3. Zero-Overhead Prefix Caching

4. Clean Architecture for Tensor-Parallel Inference

5. Efficient Input Preparation

6. torch.compile and Piecewise CUDA Graphs

7. Enhanced Support for Multimodal LLMs

8. FlashAttention 3

Performance

Limitations & Future Work

How to Get Started

Acknowledgment

High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”

TL;DR

The Context

Introducing “vLLM Production-Stack”

Comparison with Alternatives:

The Design

Advantage #1: Easy Deployment

Advantage #2: Better Performance

Advantage #3: Effortless Monitoring

Conclusion

Structured Decoding in vLLM: a gentle introduction

Language models: A brief historical context

Why do we need structured decoding?

Structured decoding and vLLM

Previous limitations in vLLM

Integration with XGrammar

Tentative plans for v1

Acknowledgements

References

Installing and Developing vLLM with Ease

TL;DR:

Seamless Installation of vLLM Versions

Install Released Versions

Install the Latest vLLM from the Main Branch

Development Made Simple

Python Developers

C++/Kernel Developers

Track Changes with Ease

Installing a Specific Commit

Conclusion

Acknowledgments

vLLM 2024 Retrospective and 2025 Vision