Set up a fully local 35-billion-parameter AI assistant on your RTX 3060 using the exact llama.cpp commands from the guide.
Diagnose why your local AI model is running on CPU instead of GPU by applying the detection check described in the guide.
Tune the n-cpu-moe parameter and model variant choice to squeeze the best tokens-per-second out of a 12 GB GPU.
Extend the raw CSV benchmark data to compare additional models or parameter settings on similar consumer hardware.
Requires an NVIDIA RTX 3060 12GB GPU, llama.cpp built with CUDA support, and downloading the Qwen3.6-35B model weights.
This repository, written in Traditional Chinese, is a detailed benchmark guide for running a large AI language model called Qwen3.6-35B-A3B on a consumer graphics card with only 12 gigabytes of video memory. The specific card tested is the NVIDIA RTX 3060 12GB paired with an older X99 workstation platform. The question the guide tries to answer is whether ordinary home hardware can handle a 35-billion-parameter model at a usable speed. The answer from the tests is yes. The author achieved around 27 tokens per second, which they describe as comfortable for daily use. Tokens are the units of text an AI model generates, and 27 per second means responses appear quickly rather than trickling in word by word. The guide provides the exact command-line parameters to reproduce this result using a tool called llama.cpp, which is a widely used program for running AI models locally. The guide documents several lessons learned during testing. The most consequential was discovering that one version of the llama.cpp binary was silently falling back to CPU-only mode even though a GPU was present, which cut performance by about 2.5 times. A simple check command reveals whether the GPU is actually being used. The author also found that adjusting a parameter called n-cpu-moe had more impact on speed than the parameters most people tune first. The guide also notes that for 12 gigabytes of video memory, the standard version of the model outperforms the speculative-decoding variant because the draft component of that variant consumes video memory that would otherwise hold more of the model on the GPU. The repository contains organized notes covering the test environment, software setup, model details, commands used, analysis of results, and conclusions. Raw benchmark data in CSV format is included for anyone who wants to extend the testing or create charts.
← castlen3 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.