Analysis updated 2026-07-04 · repo last pushed 2026-06-04
Test whether synthetic patient gene data preserves real disease signals before sharing it with outside labs.
Compare different synthetic data generation methods to see which produces the most useful fake datasets.
Train a machine learning model on synthetic data and check if it can accurately classify real patients.
Generate final benchmark charts from pre-computed summary files without accessing raw patient data.
| nanda-aditya/rna-syn-bench | 0xhassaan/nn-from-scratch | a-little-hoof/dsr | |
|---|---|---|---|
| Stars | 0 | 0 | 0 |
| Language | Python | Python | Python |
| Last pushed | 2026-06-04 | — | — |
| Maintenance | Active | — | — |
| Setup difficulty | hard | moderate | hard |
| Complexity | 4/5 | 4/5 | 5/5 |
| Audience | researcher | developer | researcher |
Figures from each repo's GitHub metadata at analysis time.
Reproducing full results requires raw patient data locked behind a managed access agreement, charts can be regenerated from included summary files without raw data.
When medical researchers want to share patient gene expression data to speed up discoveries, they run into a major privacy hurdle: the data is so detailed that patients could potentially be re-identified. One workaround is to share "synthetic" data, fake patient cohorts generated to mirror the real ones. But how do you know if the fake data is actually good enough to use? This project provides a rigorous scoring system to test synthetic gene expression data on three fronts: whether it preserves real biological signals, whether it's useful for training machine learning models, and whether it truly protects patient privacy. The benchmark evaluates synthetic data across three key areas. First, it checks "differential expression fidelity", meaning, do the fake patients show the same gene-level disease signals as the real patients? Second, it tests machine learning utility by training a predictive model on the synthetic data and seeing if it can accurately classify real, held-out patients. Third, it runs a privacy risk check to ensure the synthetic data generator didn't simply memorize and reproduce real patients, which would defeat the purpose of using fake data. This tool is designed for computational biologists, data scientists, and researchers working with RNA-seq data (a common method for measuring gene activity). For example, a research hospital might want to share a dataset of lung cancer patients with an external lab. Before doing so, they would use this benchmark to evaluate the quality of their synthetic dataset to ensure the external lab could still discover the correct gene mutations and disease patterns without accessing the real, sensitive patient files. The project includes pre-computed results for four specific medical cohorts, including lung cancer, sepsis, and pediatric inflammatory bowel disease. Notably, you can reproduce all the final charts directly from the included summary files without needing access to the raw patient data. Running the full pipeline from scratch requires the raw data, which is locked behind a managed access agreement to protect the original patients. The project benchmarks three different synthetic data generation methods, including a deep learning approach and a simpler statistical method, providing a ready-made framework to compare how different generation techniques perform.
A scoring system that tests fake (synthetic) patient gene data for real biological signals, machine learning usefulness, and privacy protection before sharing it for research.
Mainly Python. The stack also includes Python, RNA-seq, Machine Learning.
Active — commit in last 30 days (last push 2026-06-04).
Setup difficulty is rated hard, with roughly 30min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.