explaingit

srkrz23/citadel

0PythonAudience · researcherComplexity · 4/5ActiveLicenseSetup · moderate

TLDR

Hackathon framework that benchmarks Gemma 4 27B against five other LLMs across MMLU-Pro, HumanEval, and a custom Epistemic Curie suite, with a signed audit chain and a Streamlit dashboard.

Mindmap

mindmap
  root((CITADEL))
    Inputs
      Model adapters
      Benchmark prompts
      Hardware backends
    Outputs
      Streamlit dashboard
      Signed audit chain
      Compliance reports
      Benchmark rankings
    Use Cases
      Compare frontier LLMs
      Run federated evals
      Generate AI Act reports
    Tech Stack
      Python
      Streamlit
      Ollama
      CUDA
      Ed25519

Things people build with this

USE CASE 1

Benchmark Gemma 4 27B against Claude Haiku, GPT-4o mini, Llama 4, Qwen3, and Mistral on shared prompts

USE CASE 2

Run an Ed25519-signed audit chain so evaluation runs are tamper-evident

USE CASE 3

Generate EU AI Act, NIST AI RMF, and ISO 42001 compliance summaries from a run

USE CASE 4

Operate a federated evaluation that shares only differentially private aggregates

Tech stack

PythonStreamlitOllamaCUDAEd25519

Getting it running

Difficulty · moderate Time to first run · 30min

Mock mode runs anywhere, but the real-model mode needs Ollama plus a GPU backend such as AMD MI300X or CUDA.

MIT license, free to use, modify, and redistribute with attribution.

In plain English

CITADEL is a hackathon submission by a solo author in Tashkent, Uzbekistan, written for the Gemma 4 Good Hackathon in May 2026. The author frames it as Consumer Reports for AI: an open evaluation setup that lets researchers anywhere measure language models against each other using the same kind of process that large labs run internally. It compares Google's Gemma 4 27B to five other frontier models, including Claude Haiku 4.5, GPT-4o mini, Llama 4 Scout, Qwen3-35B, and Mistral-7B. The README organises the project into thirteen numbered layers, labelled L0 through L12. Lower layers cover networking, hardware abstraction across AMD MI300X, CUDA, and CPU back ends, test suites such as the author's own Epistemic Curie Benchmark v2 plus MMLU-Pro and HumanEval, and model adapters including a local Ollama path. Higher layers add a Streamlit dashboard, an Ed25519 signed audit chain with SHA-256 hash linking so any tampering breaks the chain, a federated evaluation mode that shares only differentially private aggregates, an auto-generator of compliance reports for the EU AI Act, NIST AI RMF, ISO 42001, HIPAA, and PCI-DSS, and a sketched marketplace for fine-tuned models. A live demo runs on Streamlit and a three minute video walkthrough is linked. The README reports a real pilot run on AMD MI300X hardware against Gemma 3 27B on 2026-05-18, since Gemma 4 was not yet available in Ollama at that date. Pilot numbers include 87.5% authority resistance on 8 prompts and 72.8 tokens per second. A mock-run benchmark table ranks Claude Haiku 4.5 first and Gemma 4 27B third. Quick start is a git clone, pip install of requirements, and a Python runner that supports a mock mode and a real-model mode. The full test suite is reported as 76 of 76 passing. The code is MIT licensed and the author notes that the project is not affiliated with Google.

Copy-paste prompts

Prompt 1
Help me wire a new model adapter for a locally hosted Llama 4 server into the L4 layer.
Prompt 2
Walk me through the Ed25519 audit chain and where I would add a new event type.
Prompt 3
Show me how to swap the Epistemic Curie Benchmark v2 prompts for my own evaluation set.
Prompt 4
Add a CSV export of the mock-run leaderboard alongside the Streamlit dashboard.
Prompt 5
Run the suite in mock mode on CPU and explain which layers I can disable safely.
Open on GitHub → Explain another repo

Generated 2026-05-22 · Model: sonnet-4-6 · Verify against the repo before relying on details.