Deploying vLLM on Linux

This guide walks you through setting up vLLM — a high-throughput, memory-efficient inference and serving engine — on a Linux machine with an NVIDIA GPU. By the end you’ll have a quantized model running locally, secured with an API key, exposed to the public internet via a Cloudflare Tunnel, and registered on Token Router so it can earn credits.

1. Prerequisites: hardware and OS

OS: a Linux distribution (Ubuntu 22.04 or 24.04 recommended).
GPU: an NVIDIA GPU with Compute Capability 7.0 or higher (T4, RTX 30xx/40xx, A10G, A100, H100, …).
NVIDIA drivers: confirm they’re installed by running nvidia-smi. You do not need to install the full CUDA Toolkit by hand — vLLM ships its own CUDA runtime via the PyTorch wheels.

2. Installing vLLM

The most robust way to install vLLM is with uv, a fast Python environment manager that handles the heavy PyTorch and CUDA dependencies cleanly.

Step 1 — Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

Step 2 — Create and activate a virtual environment

uv venv vllm-env --python 3.12
source vllm-env/bin/activate

Step 3 — Install vLLM

uv pip install vllm

This automatically pulls the correct PyTorch wheel, FlashAttention, and vLLM’s custom CUDA kernels.

3. Choosing quantization and downloading a model

Quantization compresses the model to fit within your GPU’s VRAM while improving token-generation speed.

Choosing the right quantization

AWQ (Activation-aware Weight Quantization): the current standard for 4-bit on NVIDIA GPUs. Excellent accuracy and highly optimized kernels in vLLM. Ideal for consumer cards (e.g. a 4090) and server cards (e.g. an A10G).
FP8 (8-bit floating point): on newer architectures (Ada Lovelace, Hopper, Blackwell), FP8 offers big speedups with virtually zero quality loss.
GPTQ: an older 4-bit standard, still widely supported but generally outperformed by AWQ in vLLM.

Serving the model

Unlike desktop apps, vLLM downloads the model directly from the Hugging Face Hub the first time you run it. We’ll use an AWQ-quantized Qwen model:

vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
  --max-model-len 8192

Stop the server with Ctrl+C for now — we’ll restart it with authentication enabled in the next step.

4. Enabling API authentication

Because we’re exposing this endpoint to the web, securing it is mandatory. vLLM natively supports API-key auth on its OpenAI-compatible routes — just append the --api-key flag:

vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \
  --max-model-len 8192 \
  --api-key "sk-my_super_secret_key_2026"

vLLM serves its API on http://127.0.0.1:8000 by default. Confirm it’s up locally:

curl http://127.0.0.1:8000/v1/models \
  -H "Authorization: Bearer sk-my_super_secret_key_2026"

5. Exposing the endpoint via Cloudflare Tunnel

We’ll use cloudflared to create a secure reverse-proxy tunnel to your local machine — no firewall ports or router port-forwarding.

Step 1 — Install cloudflared

Download and install the Debian package (adjust for your distro if you’re not on Ubuntu/Debian):

curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
sudo dpkg -i cloudflared.deb

Step 2 — Quick tunnel (no domain required)

For rapid testing or temporary access:

cloudflared tunnel --url http://127.0.0.1:8000

This prints a https://<random>.trycloudflare.com URL.

Step 3 — Persistent tunnel (recommended for earning)

If you have a domain managed by Cloudflare, a named tunnel gives you a stable URL that survives restarts:

# Authenticate cloudflared with your Cloudflare account
cloudflared tunnel login

# Create a named tunnel
cloudflared tunnel create vllm-tunnel

# Point a hostname at it
cloudflared tunnel route dns vllm-tunnel ai.yourdomain.com

# Run it
cloudflared tunnel run --url http://127.0.0.1:8000 vllm-tunnel

6. Testing your setup

Your NVIDIA-powered vLLM server is now reachable from anywhere. Fire a request at your Cloudflare URL:

curl https://ai.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-my_super_secret_key_2026" \
  -d '{
    "model": "Qwen/Qwen2.5-14B-Instruct-AWQ",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a bash script to parse a JSON file."}
    ],
    "temperature": 0.7
  }'

7. Register it on Token Router

Turn that endpoint into an earning node:

Sign in to the Token Router dashboard with GitHub.
Open Instances → Add instance.
Provide:
- Model — the model id you’re serving (e.g. Qwen/Qwen2.5-14B-Instruct-AWQ).
- Endpoint URL — your public tunnel URL, including /v1 (e.g. https://ai.yourdomain.com/v1).
- Upstream API key — the sk-… key from step 4. It’s encrypted at rest and never stored in the clear.
(Optional) Fill in the hardware and software inventory so the network understands your capacity.
Save. Once active, your node enters rotation and receives traffic whenever it’s the least-loaded healthy candidate for that model.

Your Linux box is now a paid member of the Token Router network. See Earn Credits for how payouts work.