explaingit

nanda-aditya/rna-syn-bench

Analysis updated 2026-07-04 · repo last pushed 2026-06-04

0PythonAudience · researcherComplexity · 4/5ActiveSetup · hard

TLDR

A scoring system that tests fake (synthetic) patient gene data for real biological signals, machine learning usefulness, and privacy protection before sharing it for research.

Mindmap

mindmap
  root((repo))
    What it does
      Tests synthetic gene data
      Checks biological signals
      Measures privacy risk
    Use cases
      Share data safely
      Validate before sharing
      Compare generators
    Cohorts included
      Lung cancer
      Sepsis
      Pediatric IBD
    Methods benchmarked
      Deep learning approach
      Statistical method
    Audience
      Computational biologists
      Data scientists
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Test whether synthetic patient gene data preserves real disease signals before sharing it with outside labs.

USE CASE 2

Compare different synthetic data generation methods to see which produces the most useful fake datasets.

USE CASE 3

Train a machine learning model on synthetic data and check if it can accurately classify real patients.

USE CASE 4

Generate final benchmark charts from pre-computed summary files without accessing raw patient data.

What is it built with?

PythonRNA-seqMachine LearningDeep Learning

How does it compare?

nanda-aditya/rna-syn-bench0xhassaan/nn-from-scratcha-little-hoof/dsr
Stars000
LanguagePythonPythonPython
Last pushed2026-06-04
MaintenanceActive
Setup difficultyhardmoderatehard
Complexity4/54/55/5
Audienceresearcherdeveloperresearcher

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · hard Time to first run · 30min

Reproducing full results requires raw patient data locked behind a managed access agreement, charts can be regenerated from included summary files without raw data.

In plain English

When medical researchers want to share patient gene expression data to speed up discoveries, they run into a major privacy hurdle: the data is so detailed that patients could potentially be re-identified. One workaround is to share "synthetic" data, fake patient cohorts generated to mirror the real ones. But how do you know if the fake data is actually good enough to use? This project provides a rigorous scoring system to test synthetic gene expression data on three fronts: whether it preserves real biological signals, whether it's useful for training machine learning models, and whether it truly protects patient privacy. The benchmark evaluates synthetic data across three key areas. First, it checks "differential expression fidelity", meaning, do the fake patients show the same gene-level disease signals as the real patients? Second, it tests machine learning utility by training a predictive model on the synthetic data and seeing if it can accurately classify real, held-out patients. Third, it runs a privacy risk check to ensure the synthetic data generator didn't simply memorize and reproduce real patients, which would defeat the purpose of using fake data. This tool is designed for computational biologists, data scientists, and researchers working with RNA-seq data (a common method for measuring gene activity). For example, a research hospital might want to share a dataset of lung cancer patients with an external lab. Before doing so, they would use this benchmark to evaluate the quality of their synthetic dataset to ensure the external lab could still discover the correct gene mutations and disease patterns without accessing the real, sensitive patient files. The project includes pre-computed results for four specific medical cohorts, including lung cancer, sepsis, and pediatric inflammatory bowel disease. Notably, you can reproduce all the final charts directly from the included summary files without needing access to the raw patient data. Running the full pipeline from scratch requires the raw data, which is locked behind a managed access agreement to protect the original patients. The project benchmarks three different synthetic data generation methods, including a deep learning approach and a simpler statistical method, providing a ready-made framework to compare how different generation techniques perform.

Copy-paste prompts

Prompt 1
Using the rna-syn-bench framework, write a script that loads my synthetic RNA-seq dataset and runs all three benchmark evaluations: biological signal fidelity, ML utility, and privacy risk.
Prompt 2
Help me compare three synthetic RNA-seq generation methods using this benchmark's scoring system and produce a summary chart showing which method best preserves disease signals.
Prompt 3
I have synthetic and real RNA-seq data. Walk me through training a classifier on the synthetic data and evaluating it on held-out real patients using this benchmark's ML utility test.
Prompt 4
Using the included pre-computed summary files, generate the final benchmark charts for the lung cancer and sepsis cohorts without needing access to the raw patient data.

Frequently asked questions

What is rna-syn-bench?

A scoring system that tests fake (synthetic) patient gene data for real biological signals, machine learning usefulness, and privacy protection before sharing it for research.

What language is rna-syn-bench written in?

Mainly Python. The stack also includes Python, RNA-seq, Machine Learning.

Is rna-syn-bench actively maintained?

Active — commit in last 30 days (last push 2026-06-04).

How hard is rna-syn-bench to set up?

Setup difficulty is rated hard, with roughly 30min to a first successful run.

Who is rna-syn-bench for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub nanda-aditya on gitmyhub

Verify against the repo before relying on details.