Benchmark Gemma 4 27B against Claude Haiku, GPT-4o mini, Llama 4, Qwen3, and Mistral on shared prompts
Run an Ed25519-signed audit chain so evaluation runs are tamper-evident
Generate EU AI Act, NIST AI RMF, and ISO 42001 compliance summaries from a run
Operate a federated evaluation that shares only differentially private aggregates
Mock mode runs anywhere, but the real-model mode needs Ollama plus a GPU backend such as AMD MI300X or CUDA.
CITADEL is a hackathon submission by a solo author in Tashkent, Uzbekistan, written for the Gemma 4 Good Hackathon in May 2026. The author frames it as Consumer Reports for AI: an open evaluation setup that lets researchers anywhere measure language models against each other using the same kind of process that large labs run internally. It compares Google's Gemma 4 27B to five other frontier models, including Claude Haiku 4.5, GPT-4o mini, Llama 4 Scout, Qwen3-35B, and Mistral-7B. The README organises the project into thirteen numbered layers, labelled L0 through L12. Lower layers cover networking, hardware abstraction across AMD MI300X, CUDA, and CPU back ends, test suites such as the author's own Epistemic Curie Benchmark v2 plus MMLU-Pro and HumanEval, and model adapters including a local Ollama path. Higher layers add a Streamlit dashboard, an Ed25519 signed audit chain with SHA-256 hash linking so any tampering breaks the chain, a federated evaluation mode that shares only differentially private aggregates, an auto-generator of compliance reports for the EU AI Act, NIST AI RMF, ISO 42001, HIPAA, and PCI-DSS, and a sketched marketplace for fine-tuned models. A live demo runs on Streamlit and a three minute video walkthrough is linked. The README reports a real pilot run on AMD MI300X hardware against Gemma 3 27B on 2026-05-18, since Gemma 4 was not yet available in Ollama at that date. Pilot numbers include 87.5% authority resistance on 8 prompts and 72.8 tokens per second. A mock-run benchmark table ranks Claude Haiku 4.5 first and Gemma 4 27B third. Quick start is a git clone, pip install of requirements, and a Python runner that supports a mock mode and a real-model mode. The full test suite is reported as 76 of 76 passing. The code is MIT licensed and the author notes that the project is not affiliated with Google.
Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.