srkrz23/citadel

Analysis updated 2026-06-24

★ 0PythonAudience · researcherComplexity · 4/5LicenseSetup · moderate

Mindmap

mindmap
  root((CITADEL))
    Inputs
      Model adapters
      Benchmark prompts
      Hardware backends
    Outputs
      Streamlit dashboard
      Signed audit chain
      Compliance reports
      Benchmark rankings
    Use Cases
      Compare frontier LLMs
      Run federated evals
      Generate AI Act reports
    Tech Stack
      Python
      Streamlit
      Ollama
      CUDA
      Ed25519

mindmap root((CITADEL)) Inputs Model adapters Benchmark prompts Hardware backends Outputs Streamlit dashboard Signed audit chain Compliance reports Benchmark rankings Use Cases Compare frontier LLMs Run federated evals Generate AI Act reports Tech Stack Python Streamlit Ollama CUDA Ed25519

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Benchmark Gemma 4 27B against Claude Haiku, GPT-4o mini, Llama 4, Qwen3, and Mistral on shared prompts

USE CASE 2

Run an Ed25519-signed audit chain so evaluation runs are tamper-evident

USE CASE 3

Generate EU AI Act, NIST AI RMF, and ISO 42001 compliance summaries from a run

USE CASE 4

Operate a federated evaluation that shares only differentially private aggregates

What is it built with?

PythonStreamlitOllamaCUDAEd25519

How does it compare?

	srkrz23/citadel	0xhassaan/nn-from-scratch	a-little-hoof/dsr
Stars	0	0	0
Language	Python	Python	Python
Setup difficulty	moderate	moderate	hard
Complexity	4/5	4/5	5/5
Audience	researcher	developer	researcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Mock mode runs anywhere, but the real-model mode needs Ollama plus a GPU backend such as AMD MI300X or CUDA.

MIT license, free to use, modify, and redistribute with attribution.

In plain English

CITADEL is a hackathon submission by a solo author in Tashkent, Uzbekistan, written for the Gemma 4 Good Hackathon in May 2026. The author frames it as Consumer Reports for AI: an open evaluation setup that lets researchers anywhere measure language models against each other using the same kind of process that large labs run internally. It compares Google's Gemma 4 27B to five other frontier models, including Claude Haiku 4.5, GPT-4o mini, Llama 4 Scout, Qwen3-35B, and Mistral-7B. The README organises the project into thirteen numbered layers, labelled L0 through L12. Lower layers cover networking, hardware abstraction across AMD MI300X, CUDA, and CPU back ends, test suites such as the author's own Epistemic Curie Benchmark v2 plus MMLU-Pro and HumanEval, and model adapters including a local Ollama path. Higher layers add a Streamlit dashboard, an Ed25519 signed audit chain with SHA-256 hash linking so any tampering breaks the chain, a federated evaluation mode that shares only differentially private aggregates, an auto-generator of compliance reports for the EU AI Act, NIST AI RMF, ISO 42001, HIPAA, and PCI-DSS, and a sketched marketplace for fine-tuned models. A live demo runs on Streamlit and a three minute video walkthrough is linked. The README reports a real pilot run on AMD MI300X hardware against Gemma 3 27B on 2026-05-18, since Gemma 4 was not yet available in Ollama at that date. Pilot numbers include 87.5% authority resistance on 8 prompts and 72.8 tokens per second. A mock-run benchmark table ranks Claude Haiku 4.5 first and Gemma 4 27B third. Quick start is a git clone, pip install of requirements, and a Python runner that supports a mock mode and a real-model mode. The full test suite is reported as 76 of 76 passing. The code is MIT licensed and the author notes that the project is not affiliated with Google.

Copy-paste prompts

Prompt 1

Help me wire a new model adapter for a locally hosted Llama 4 server into the L4 layer.

Prompt 2

Walk me through the Ed25519 audit chain and where I would add a new event type.

Prompt 3

Show me how to swap the Epistemic Curie Benchmark v2 prompts for my own evaluation set.

Prompt 4

Add a CSV export of the mock-run leaderboard alongside the Streamlit dashboard.

Prompt 5

Run the suite in mock mode on CPU and explain which layers I can disable safely.

Frequently asked questions

What is citadel?

Hackathon framework that benchmarks Gemma 4 27B against five other LLMs across MMLU-Pro, HumanEval, and a custom Epistemic Curie suite, with a signed audit chain and a Streamlit dashboard.

What language is citadel written in?

Mainly Python. The stack also includes Python, Streamlit, Ollama.

What license does citadel use?

MIT license, free to use, modify, and redistribute with attribution.

How hard is citadel to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is citadel for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.