explaingit

cs-whh/skilltrain-langfuse

Analysis updated 2026-05-18

1PythonAudience · developerComplexity · 3/5LicenseSetup · moderate

TLDR

A Python workflow that tests AI agent skills against a Langfuse dataset and keeps changes only when evaluator scores improve, replacing guesswork with evidence.

Mindmap

mindmap
  root((SkillTrain Langfuse))
    What it does
      Test AI skills
      Compare before and after
      Keep only improvements
    How it works
      Build dataset
      Run baseline
      Analyze failures
      Apply minimal change
      Run candidate
    Evaluators
      Rule-based checks
      LLM as judge
      Human review
    Tools used
      Python
      Langfuse
      Claude Code CLI
    Getting started
      Copy to skills dir
      Set Langfuse keys
      Use template script
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Build a test dataset of edge cases for an AI agent skill and use Langfuse evaluator scores to decide which prompt changes actually improve behavior.

USE CASE 2

Run LLM-as-a-judge evaluators automatically against your agent's outputs to get scored feedback without manual review on every test case.

USE CASE 3

Use the experiment template script to compare a baseline skill run against a candidate run and track which changes to keep in version control.

What is it built with?

PythonLangfuse

How does it compare?

cs-whh/skilltrain-langfusea-bissell/unleash-liteabhiinnovates/whatsapp-hr-assistant
Stars111
LanguagePythonPythonPython
Setup difficultymoderatehardhard
Complexity3/54/53/5
Audiencedeveloperresearcherdeveloper

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 1h+

Requires a Langfuse account or self-hosted instance, API keys in .env, and a Python environment configured for your agent runtime.

MIT license, meaning you can use, modify, and distribute it freely for any purpose including commercial use.

In plain English

SkillTrain Langfuse is a Python project that provides a structured workflow for improving AI agent behaviors using measurable evidence rather than guesswork. It targets developers who are building or maintaining "AI skills," which are sets of instructions or configured behaviors that tell an AI agent how to handle a particular type of task. The core problem it addresses is that AI skills often need tuning as real users expose failures and edge cases, but changing a skill without a way to measure the effect makes it hard to know whether the change actually helped. This project pairs each round of changes with a formal test run using Langfuse, an external platform that records the inputs, outputs, and scores for each test case, so a "before" and "after" can be compared side by side. The workflow follows a fixed loop. First, a test dataset is prepared with representative inputs and examples of expected behavior. An evaluator is configured, using rules, an AI-as-judge approach, or a human reviewer, to score each output. The current skill is run against all test cases to establish a baseline. Low-scoring cases and evaluator comments are then analyzed to understand exactly what is failing. A minimal change is made to the skill, and the same dataset and evaluator are run again. A change is kept only if scores improve without introducing new problems. The project is structured as a single Codex skill, meaning it is designed to be invoked from within an AI coding assistant. It includes a Python experiment template script for running test batches, reference documents for the Langfuse API, and notes for using Claude Code CLI as the agent runtime being tested against the dataset. The README is written primarily in Chinese with an English version provided separately. The project is licensed under MIT.

Copy-paste prompts

Prompt 1
Using skilltrain-langfuse, set up a Langfuse dataset with 10 test cases for my customer-support AI skill. Show me the JSON format for each dataset item including input and expected_output fields.
Prompt 2
Walk me through adapting the eval_experiment_template.py script to call my local Python agent and record its outputs as a Langfuse experiment run.
Prompt 3
I want to use LLM-as-a-judge to score my agent's outputs in skilltrain-langfuse. Write a Langfuse evaluator prompt that checks for completeness and avoids hallucinated facts.
Prompt 4
Show me how to run the skilltrain-langfuse experiment script with --limit 1 and a specific --item-id to smoke test a single dataset case before a full run.
Prompt 5
My skilltrain-langfuse candidate run scored lower than baseline. Help me roll back the skill change in git and document what failed in the evaluator comments.

Frequently asked questions

What is skilltrain-langfuse?

A Python workflow that tests AI agent skills against a Langfuse dataset and keeps changes only when evaluator scores improve, replacing guesswork with evidence.

What language is skilltrain-langfuse written in?

Mainly Python. The stack also includes Python, Langfuse.

What license does skilltrain-langfuse use?

MIT license, meaning you can use, modify, and distribute it freely for any purpose including commercial use.

How hard is skilltrain-langfuse to set up?

Setup difficulty is rated moderate, with roughly 1h+ to a first successful run.

Who is skilltrain-langfuse for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub cs-whh on gitmyhub

Verify against the repo before relying on details.