Skip to content

Deploying llama.cpp on Windows

This guide provides step-by-step instructions for setting up llama.cpp on a Windows machine with an NVIDIA GPU. We’ll download pre-compiled CUDA binaries, pick a GGUF quantized model, run an OpenAI-compatible server with API authentication, expose it via a Cloudflare Tunnel, and register it on Token Router so it can earn credits.


  • OS: Windows 10 or Windows 11.
  • GPU: an NVIDIA GPU (e.g. RTX 3060, 4090).
  • NVIDIA drivers: keep them current via GeForce Experience or the NVIDIA website.
  • CUDA Toolkit (optional): the pre-compiled binaries include the runtime they need, but installing the NVIDIA CUDA Toolkit maximizes compatibility.

The easiest way to run llama.cpp on Windows — without configuring Visual Studio and CMake — is to use the official pre-compiled binaries.

  1. Open the llama.cpp GitHub Releases page.
  2. Download the latest Windows CUDA zip. For example: llama-bxxxx-bin-win-cuda-cu12.2-x64.zip (choose cu12 for newer drivers, cu11 for older ones).
  3. Extract the contents to a dedicated folder, e.g. C:\llama\.

3. Downloading a model and choosing quantization

Section titled “3. Downloading a model and choosing quantization”

llama.cpp uses the GGUF format, which runs highly quantized models efficiently.

  • Q4_K_M (4-bit): the universal sweet spot — an excellent balance of VRAM usage, speed, and quality. Recommended.
  • Q5_K_M (5-bit): slightly better quality than Q4, a bit more VRAM.
  • Q8_0 (8-bit): near-unquantized quality, but high VRAM (an 8B model is ~8–9 GB).
  1. Go to Hugging Face (popular uploaders include bartowski and lmstudio-community).
  2. Search for a model, e.g. Llama-3-8B-Instruct-GGUF.
  3. On the Files and versions tab, download the *Q4_K_M.gguf file.
  4. Move the .gguf file into your C:\llama\ folder.

4. Running the server and enabling API auth

Section titled “4. Running the server and enabling API auth”

llama.cpp includes a built-in HTTP server (llama-server.exe) with an OpenAI-compatible API. We’ll offload the model to the GPU and secure it with an API key.

  1. Open Command Prompt or PowerShell.
  2. Navigate to your folder: cd C:\llama\.
  3. Start the server:
llama-server.exe -m your_model_q4_k_m.gguf -c 4096 -ngl 99 --port 8080 --api-key "sk-my_super_secret_key_2026"

Parameter breakdown:

  • -m — path to your model file.
  • -c 4096 — context window in tokens (adjust to your model and VRAM).
  • -ngl 99 — number of GPU layers to offload; 99 pushes all layers onto the NVIDIA GPU for maximum speed.
  • --port 8080 — the local port to serve on.
  • --api-key — enforces authentication.

Leave this window open to keep the server running. Confirm it’s up locally in a second terminal:

curl http://127.0.0.1:8080/v1/models -H "Authorization: Bearer sk-my_super_secret_key_2026"

5. Exposing the endpoint via Cloudflare Tunnel

Section titled “5. Exposing the endpoint via Cloudflare Tunnel”

To reach your Windows server from the internet without port-forwarding, use a Cloudflare Tunnel.

Step 1 — Install cloudflared for Windows

Section titled “Step 1 — Install cloudflared for Windows”
  1. Download the Windows executable: cloudflared-windows-amd64.exe.
  2. Move it to a folder (e.g. C:\cloudflared\) and rename it to cloudflared.exe.
  3. (Optional) Add C:\cloudflared\ to your Windows System PATH so you can run cloudflared from anywhere.

Step 2 — Quick tunnel (no domain required)

Section titled “Step 2 — Quick tunnel (no domain required)”

Open a new Command Prompt and run:

C:\cloudflared\cloudflared.exe tunnel --url http://127.0.0.1:8080

This prints a https://<random>.trycloudflare.com URL for immediate testing.

Section titled “Step 3 — Persistent tunnel (recommended for earning)”

If you manage a domain through Cloudflare, a named tunnel gives you a stable URL:

cloudflared.exe tunnel login
cloudflared.exe tunnel create llama-tunnel
cloudflared.exe tunnel route dns llama-tunnel ai.yourdomain.com
cloudflared.exe tunnel run --url http://127.0.0.1:8080 llama-tunnel

With both llama-server.exe and cloudflared running, your Windows machine is a secure, globally accessible AI endpoint. Test it from any computer:

Terminal window
curl https://ai.yourdomain.com/v1/chat/completions ^
-H "Content-Type: application/json" ^
-H "Authorization: Bearer sk-my_super_secret_key_2026" ^
-d "{ \"model\": \"default\", \"messages\": [ { \"role\": \"user\", \"content\": \"Write a PowerShell script to list directories.\" } ] }"

Turn that endpoint into an earning node:

  1. Sign in to the Token Router dashboard with GitHub.
  2. Open Instances → Add instance.
  3. Provide:
    • Model — the model id you’re serving (llama.cpp reports the served model; you can also use the name you advertise, e.g. llama-3-8b-instruct).
    • Endpoint URL — your public tunnel URL, including /v1 (e.g. https://ai.yourdomain.com/v1).
    • Upstream API key — the sk-… key from step 4. It’s encrypted at rest and never stored in the clear.
  4. (Optional) Fill in the hardware and software inventory so the network understands your capacity.
  5. Save. Once active, your node enters rotation and receives traffic whenever it’s the least-loaded healthy candidate for that model.

Your Windows machine is now a paid member of the Token Router network. See Earn Credits for how payouts work.