explaingit

castlen3/rtx3060-qwen3.6-35b-guide

16HTMLAudience · developerComplexity · 3/5Setup · hard

TLDR

A benchmark guide showing how to run a 35-billion-parameter AI language model at usable speed (27 tokens/sec) on a consumer NVIDIA RTX 3060 with 12 GB of video memory, with exact commands and lessons learned.

Mindmap

mindmap
  root((rtx3060-qwen3.6-35b))
    What it does
      Benchmark guide
      Local AI on 12GB GPU
      Performance tuning tips
    Hardware
      NVIDIA RTX 3060 12GB
      X99 workstation
    Software
      llama.cpp
      Qwen3.6-35B model
    Key Findings
      27 tokens per second
      n-cpu-moe matters most
      GPU fallback bug fix
    Data
      Raw CSV benchmarks
      Exact CLI commands
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Set up a fully local 35-billion-parameter AI assistant on your RTX 3060 using the exact llama.cpp commands from the guide.

USE CASE 2

Diagnose why your local AI model is running on CPU instead of GPU by applying the detection check described in the guide.

USE CASE 3

Tune the n-cpu-moe parameter and model variant choice to squeeze the best tokens-per-second out of a 12 GB GPU.

USE CASE 4

Extend the raw CSV benchmark data to compare additional models or parameter settings on similar consumer hardware.

Tech stack

llama.cppNVIDIA CUDAHTML

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an NVIDIA RTX 3060 12GB GPU, llama.cpp built with CUDA support, and downloading the Qwen3.6-35B model weights.

No license information was mentioned in the explanation.

In plain English

This repository, written in Traditional Chinese, is a detailed benchmark guide for running a large AI language model called Qwen3.6-35B-A3B on a consumer graphics card with only 12 gigabytes of video memory. The specific card tested is the NVIDIA RTX 3060 12GB paired with an older X99 workstation platform. The question the guide tries to answer is whether ordinary home hardware can handle a 35-billion-parameter model at a usable speed. The answer from the tests is yes. The author achieved around 27 tokens per second, which they describe as comfortable for daily use. Tokens are the units of text an AI model generates, and 27 per second means responses appear quickly rather than trickling in word by word. The guide provides the exact command-line parameters to reproduce this result using a tool called llama.cpp, which is a widely used program for running AI models locally. The guide documents several lessons learned during testing. The most consequential was discovering that one version of the llama.cpp binary was silently falling back to CPU-only mode even though a GPU was present, which cut performance by about 2.5 times. A simple check command reveals whether the GPU is actually being used. The author also found that adjusting a parameter called n-cpu-moe had more impact on speed than the parameters most people tune first. The guide also notes that for 12 gigabytes of video memory, the standard version of the model outperforms the speculative-decoding variant because the draft component of that variant consumes video memory that would otherwise hold more of the model on the GPU. The repository contains organized notes covering the test environment, software setup, model details, commands used, analysis of results, and conclusions. Raw benchmark data in CSV format is included for anyone who wants to extend the testing or create charts.

Copy-paste prompts

Prompt 1
Give me the exact llama.cpp command to run Qwen3.6-35B-A3B on an RTX 3060 12GB at around 27 tokens per second, based on the settings in this benchmark guide.
Prompt 2
How do I check whether llama.cpp is actually using my GPU or silently falling back to CPU? What command should I run and what output confirms GPU usage?
Prompt 3
Why does the standard Qwen3.6-35B model outperform the speculative-decoding variant on a 12 GB GPU, and which llama.cpp flags control this?
Open on GitHub → Explain another repo

← castlen3 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.