Skip to content

Deploying vLLM on Linux

This guide walks you through setting up vLLM — a high-throughput, memory-efficient inference and serving engine — on a Linux machine with an NVIDIA GPU. By the end you’ll have a quantized model running locally, secured with an API key, exposed to the public internet via a Cloudflare Tunnel, and registered on Token Router so it can earn credits.


  • OS: a Linux distribution (Ubuntu 22.04 or 24.04 recommended).
  • GPU: an NVIDIA GPU with Compute Capability 7.0 or higher (T4, RTX 30xx/40xx, A10G, A100, H100, …).
  • NVIDIA drivers: confirm they’re installed by running nvidia-smi. You do not need to install the full CUDA Toolkit by hand — vLLM ships its own CUDA runtime via the PyTorch wheels.

The most robust way to install vLLM is with uv, a fast Python environment manager that handles the heavy PyTorch and CUDA dependencies cleanly.

Terminal window
curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2 — Create and activate a virtual environment

Section titled “Step 2 — Create and activate a virtual environment”
Terminal window
uv venv vllm-env --python 3.12
source vllm-env/bin/activate
Terminal window
uv pip install vllm

This automatically pulls the correct PyTorch wheel, FlashAttention, and vLLM’s custom CUDA kernels.


3. Choosing quantization and downloading a model

Section titled “3. Choosing quantization and downloading a model”

Quantization compresses the model to fit within your GPU’s VRAM while improving token-generation speed.

  • AWQ (Activation-aware Weight Quantization): the current standard for 4-bit on NVIDIA GPUs. Excellent accuracy and highly optimized kernels in vLLM. Ideal for consumer cards (e.g. a 4090) and server cards (e.g. an A10G).
  • FP8 (8-bit floating point): on newer architectures (Ada Lovelace, Hopper, Blackwell), FP8 offers big speedups with virtually zero quality loss.
  • GPTQ: an older 4-bit standard, still widely supported but generally outperformed by AWQ in vLLM.

Unlike desktop apps, vLLM downloads the model directly from the Hugging Face Hub the first time you run it. We’ll use an AWQ-quantized Qwen model:

Terminal window
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
--max-model-len 8192

Stop the server with Ctrl+C for now — we’ll restart it with authentication enabled in the next step.


Because we’re exposing this endpoint to the web, securing it is mandatory. vLLM natively supports API-key auth on its OpenAI-compatible routes — just append the --api-key flag:

Terminal window
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
--max-model-len 8192 \
--api-key "sk-my_super_secret_key_2026"

vLLM serves its API on http://127.0.0.1:8000 by default. Confirm it’s up locally:

Terminal window
curl http://127.0.0.1:8000/v1/models \
-H "Authorization: Bearer sk-my_super_secret_key_2026"

5. Exposing the endpoint via Cloudflare Tunnel

Section titled “5. Exposing the endpoint via Cloudflare Tunnel”

We’ll use cloudflared to create a secure reverse-proxy tunnel to your local machine — no firewall ports or router port-forwarding.

Download and install the Debian package (adjust for your distro if you’re not on Ubuntu/Debian):

Terminal window
curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
sudo dpkg -i cloudflared.deb

Step 2 — Quick tunnel (no domain required)

Section titled “Step 2 — Quick tunnel (no domain required)”

For rapid testing or temporary access:

Terminal window
cloudflared tunnel --url http://127.0.0.1:8000

This prints a https://<random>.trycloudflare.com URL.

Section titled “Step 3 — Persistent tunnel (recommended for earning)”

If you have a domain managed by Cloudflare, a named tunnel gives you a stable URL that survives restarts:

Terminal window
# Authenticate cloudflared with your Cloudflare account
cloudflared tunnel login
# Create a named tunnel
cloudflared tunnel create vllm-tunnel
# Point a hostname at it
cloudflared tunnel route dns vllm-tunnel ai.yourdomain.com
# Run it
cloudflared tunnel run --url http://127.0.0.1:8000 vllm-tunnel

Your NVIDIA-powered vLLM server is now reachable from anywhere. Fire a request at your Cloudflare URL:

Terminal window
curl https://ai.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-my_super_secret_key_2026" \
-d '{
"model": "Qwen/Qwen2.5-14B-Instruct-AWQ",
"messages": [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a bash script to parse a JSON file."}
],
"temperature": 0.7
}'

Turn that endpoint into an earning node:

  1. Sign in to the Token Router dashboard with GitHub.
  2. Open Instances → Add instance.
  3. Provide:
    • Model — the model id you’re serving (e.g. Qwen/Qwen2.5-14B-Instruct-AWQ).
    • Endpoint URL — your public tunnel URL, including /v1 (e.g. https://ai.yourdomain.com/v1).
    • Upstream API key — the sk-… key from step 4. It’s encrypted at rest and never stored in the clear.
  4. (Optional) Fill in the hardware and software inventory so the network understands your capacity.
  5. Save. Once active, your node enters rotation and receives traffic whenever it’s the least-loaded healthy candidate for that model.

Your Linux box is now a paid member of the Token Router network. See Earn Credits for how payouts work.