Deploying vLLM on Linux
This guide walks you through setting up vLLM — a high-throughput, memory-efficient inference and serving engine — on a Linux machine with an NVIDIA GPU. By the end you’ll have a quantized model running locally, secured with an API key, exposed to the public internet via a Cloudflare Tunnel, and registered on Token Router so it can earn credits.
1. Prerequisites: hardware and OS
Section titled “1. Prerequisites: hardware and OS”- OS: a Linux distribution (Ubuntu 22.04 or 24.04 recommended).
- GPU: an NVIDIA GPU with Compute Capability 7.0 or higher (T4, RTX 30xx/40xx, A10G, A100, H100, …).
- NVIDIA drivers: confirm they’re installed by running
nvidia-smi. You do not need to install the full CUDA Toolkit by hand — vLLM ships its own CUDA runtime via the PyTorch wheels.
2. Installing vLLM
Section titled “2. Installing vLLM”The most robust way to install vLLM is with uv, a fast Python environment manager that handles the heavy PyTorch and CUDA dependencies cleanly.
Step 1 — Install uv
Section titled “Step 1 — Install uv”curl -LsSf https://astral.sh/uv/install.sh | shStep 2 — Create and activate a virtual environment
Section titled “Step 2 — Create and activate a virtual environment”uv venv vllm-env --python 3.12source vllm-env/bin/activateStep 3 — Install vLLM
Section titled “Step 3 — Install vLLM”uv pip install vllmThis automatically pulls the correct PyTorch wheel, FlashAttention, and vLLM’s custom CUDA kernels.
3. Choosing quantization and downloading a model
Section titled “3. Choosing quantization and downloading a model”Quantization compresses the model to fit within your GPU’s VRAM while improving token-generation speed.
Choosing the right quantization
Section titled “Choosing the right quantization”- AWQ (Activation-aware Weight Quantization): the current standard for 4-bit on NVIDIA GPUs. Excellent accuracy and highly optimized kernels in vLLM. Ideal for consumer cards (e.g. a 4090) and server cards (e.g. an A10G).
- FP8 (8-bit floating point): on newer architectures (Ada Lovelace, Hopper, Blackwell), FP8 offers big speedups with virtually zero quality loss.
- GPTQ: an older 4-bit standard, still widely supported but generally outperformed by AWQ in vLLM.
Serving the model
Section titled “Serving the model”Unlike desktop apps, vLLM downloads the model directly from the Hugging Face Hub the first time you run it. We’ll use an AWQ-quantized Qwen model:
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \ --max-model-len 8192Stop the server with Ctrl+C for now — we’ll restart it with authentication enabled in the next step.
4. Enabling API authentication
Section titled “4. Enabling API authentication”Because we’re exposing this endpoint to the web, securing it is mandatory. vLLM natively supports API-key auth on its OpenAI-compatible routes — just append the --api-key flag:
vllm serve Qwen/Qwen2.5-14B-Instruct-AWQ \ --max-model-len 8192 \ --api-key "sk-my_super_secret_key_2026"vLLM serves its API on http://127.0.0.1:8000 by default. Confirm it’s up locally:
curl http://127.0.0.1:8000/v1/models \ -H "Authorization: Bearer sk-my_super_secret_key_2026"5. Exposing the endpoint via Cloudflare Tunnel
Section titled “5. Exposing the endpoint via Cloudflare Tunnel”We’ll use cloudflared to create a secure reverse-proxy tunnel to your local machine — no firewall ports or router port-forwarding.
Step 1 — Install cloudflared
Section titled “Step 1 — Install cloudflared”Download and install the Debian package (adjust for your distro if you’re not on Ubuntu/Debian):
curl -L --output cloudflared.deb https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.debsudo dpkg -i cloudflared.debStep 2 — Quick tunnel (no domain required)
Section titled “Step 2 — Quick tunnel (no domain required)”For rapid testing or temporary access:
cloudflared tunnel --url http://127.0.0.1:8000This prints a https://<random>.trycloudflare.com URL.
Step 3 — Persistent tunnel (recommended for earning)
Section titled “Step 3 — Persistent tunnel (recommended for earning)”If you have a domain managed by Cloudflare, a named tunnel gives you a stable URL that survives restarts:
# Authenticate cloudflared with your Cloudflare accountcloudflared tunnel login
# Create a named tunnelcloudflared tunnel create vllm-tunnel
# Point a hostname at itcloudflared tunnel route dns vllm-tunnel ai.yourdomain.com
# Run itcloudflared tunnel run --url http://127.0.0.1:8000 vllm-tunnel6. Testing your setup
Section titled “6. Testing your setup”Your NVIDIA-powered vLLM server is now reachable from anywhere. Fire a request at your Cloudflare URL:
curl https://ai.yourdomain.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer sk-my_super_secret_key_2026" \ -d '{ "model": "Qwen/Qwen2.5-14B-Instruct-AWQ", "messages": [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "Write a bash script to parse a JSON file."} ], "temperature": 0.7 }'7. Register it on Token Router
Section titled “7. Register it on Token Router”Turn that endpoint into an earning node:
- Sign in to the Token Router dashboard with GitHub.
- Open Instances → Add instance.
- Provide:
- Model — the model id you’re serving (e.g.
Qwen/Qwen2.5-14B-Instruct-AWQ). - Endpoint URL — your public tunnel URL, including
/v1(e.g.https://ai.yourdomain.com/v1). - Upstream API key — the
sk-…key from step 4. It’s encrypted at rest and never stored in the clear.
- Model — the model id you’re serving (e.g.
- (Optional) Fill in the hardware and software inventory so the network understands your capacity.
- Save. Once active, your node enters rotation and receives traffic whenever it’s the least-loaded healthy candidate for that model.
Your Linux box is now a paid member of the Token Router network. See Earn Credits for how payouts work.