Deploying llama.cpp on Windows
This guide provides step-by-step instructions for setting up llama.cpp on a Windows machine with an NVIDIA GPU. We’ll download pre-compiled CUDA binaries, pick a GGUF quantized model, run an OpenAI-compatible server with API authentication, expose it via a Cloudflare Tunnel, and register it on Token Router so it can earn credits.
1. Prerequisites: hardware and drivers
Section titled “1. Prerequisites: hardware and drivers”- OS: Windows 10 or Windows 11.
- GPU: an NVIDIA GPU (e.g. RTX 3060, 4090).
- NVIDIA drivers: keep them current via GeForce Experience or the NVIDIA website.
- CUDA Toolkit (optional): the pre-compiled binaries include the runtime they need, but installing the NVIDIA CUDA Toolkit maximizes compatibility.
2. Installing llama.cpp
Section titled “2. Installing llama.cpp”The easiest way to run llama.cpp on Windows — without configuring Visual Studio and CMake — is to use the official pre-compiled binaries.
- Open the llama.cpp GitHub Releases page.
- Download the latest Windows CUDA zip. For example:
llama-bxxxx-bin-win-cuda-cu12.2-x64.zip(choosecu12for newer drivers,cu11for older ones). - Extract the contents to a dedicated folder, e.g.
C:\llama\.
3. Downloading a model and choosing quantization
Section titled “3. Downloading a model and choosing quantization”llama.cpp uses the GGUF format, which runs highly quantized models efficiently.
Choosing the right quantization
Section titled “Choosing the right quantization”- Q4_K_M (4-bit): the universal sweet spot — an excellent balance of VRAM usage, speed, and quality. Recommended.
- Q5_K_M (5-bit): slightly better quality than Q4, a bit more VRAM.
- Q8_0 (8-bit): near-unquantized quality, but high VRAM (an 8B model is ~8–9 GB).
Downloading a GGUF model
Section titled “Downloading a GGUF model”- Go to Hugging Face (popular uploaders include
bartowskiandlmstudio-community). - Search for a model, e.g.
Llama-3-8B-Instruct-GGUF. - On the Files and versions tab, download the
*Q4_K_M.gguffile. - Move the
.gguffile into yourC:\llama\folder.
4. Running the server and enabling API auth
Section titled “4. Running the server and enabling API auth”llama.cpp includes a built-in HTTP server (llama-server.exe) with an OpenAI-compatible API. We’ll offload the model to the GPU and secure it with an API key.
- Open Command Prompt or PowerShell.
- Navigate to your folder:
cd C:\llama\. - Start the server:
llama-server.exe -m your_model_q4_k_m.gguf -c 4096 -ngl 99 --port 8080 --api-key "sk-my_super_secret_key_2026"Parameter breakdown:
-m— path to your model file.-c 4096— context window in tokens (adjust to your model and VRAM).-ngl 99— number of GPU layers to offload; 99 pushes all layers onto the NVIDIA GPU for maximum speed.--port 8080— the local port to serve on.--api-key— enforces authentication.
Leave this window open to keep the server running. Confirm it’s up locally in a second terminal:
curl http://127.0.0.1:8080/v1/models -H "Authorization: Bearer sk-my_super_secret_key_2026"5. Exposing the endpoint via Cloudflare Tunnel
Section titled “5. Exposing the endpoint via Cloudflare Tunnel”To reach your Windows server from the internet without port-forwarding, use a Cloudflare Tunnel.
Step 1 — Install cloudflared for Windows
Section titled “Step 1 — Install cloudflared for Windows”- Download the Windows executable:
cloudflared-windows-amd64.exe. - Move it to a folder (e.g.
C:\cloudflared\) and rename it tocloudflared.exe. - (Optional) Add
C:\cloudflared\to your Windows System PATH so you can runcloudflaredfrom anywhere.
Step 2 — Quick tunnel (no domain required)
Section titled “Step 2 — Quick tunnel (no domain required)”Open a new Command Prompt and run:
C:\cloudflared\cloudflared.exe tunnel --url http://127.0.0.1:8080This prints a https://<random>.trycloudflare.com URL for immediate testing.
Step 3 — Persistent tunnel (recommended for earning)
Section titled “Step 3 — Persistent tunnel (recommended for earning)”If you manage a domain through Cloudflare, a named tunnel gives you a stable URL:
cloudflared.exe tunnel logincloudflared.exe tunnel create llama-tunnelcloudflared.exe tunnel route dns llama-tunnel ai.yourdomain.comcloudflared.exe tunnel run --url http://127.0.0.1:8080 llama-tunnel6. Testing your setup
Section titled “6. Testing your setup”With both llama-server.exe and cloudflared running, your Windows machine is a secure, globally accessible AI endpoint. Test it from any computer:
curl https://ai.yourdomain.com/v1/chat/completions ^ -H "Content-Type: application/json" ^ -H "Authorization: Bearer sk-my_super_secret_key_2026" ^ -d "{ \"model\": \"default\", \"messages\": [ { \"role\": \"user\", \"content\": \"Write a PowerShell script to list directories.\" } ] }"7. Register it on Token Router
Section titled “7. Register it on Token Router”Turn that endpoint into an earning node:
- Sign in to the Token Router dashboard with GitHub.
- Open Instances → Add instance.
- Provide:
- Model — the model id you’re serving (llama.cpp reports the served model; you can also use the name you advertise, e.g.
llama-3-8b-instruct). - Endpoint URL — your public tunnel URL, including
/v1(e.g.https://ai.yourdomain.com/v1). - Upstream API key — the
sk-…key from step 4. It’s encrypted at rest and never stored in the clear.
- Model — the model id you’re serving (llama.cpp reports the served model; you can also use the name you advertise, e.g.
- (Optional) Fill in the hardware and software inventory so the network understands your capacity.
- Save. Once active, your node enters rotation and receives traffic whenever it’s the least-loaded healthy candidate for that model.
Your Windows machine is now a paid member of the Token Router network. See Earn Credits for how payouts work.