stas00/ml-engineering

Analysis updated 2026-06-24

★ 17,913PythonAudience · ops devopsComplexity · 4/5Setup · easy

Mindmap

mindmap
  root((ml-engineering))
    What it does
      Practical handbook
      Copy-paste solutions
      PDF and EPUB
    Topics
      Hardware accelerators
      Networking storage
      SLURM orchestration
      Training and inference
      Debugging
    Tools
      Benchmark scripts
      Comparison tables
    Audience
      ML engineers
      Cluster operators

mindmap root((ml-engineering)) What it does Practical handbook Copy-paste solutions PDF and EPUB Topics Hardware accelerators Networking storage SLURM orchestration Training and inference Debugging Tools Benchmark scripts Comparison tables Audience ML engineers Cluster operators

Click or tap to explore — scroll the page freely

What do people build with it?

USE CASE 1

Pick the right GPU, interconnect and storage for a large-model training cluster.

USE CASE 2

Debug a hanging distributed training job using the book's troubleshooting checklist.

USE CASE 3

Benchmark inter-node network throughput with the included Python scripts.

USE CASE 4

Set up SLURM job scripts to run multi-node LLM training reliably.

What is it built with?

PythonGPUSLURM

How does it compare?

	stas00/ml-engineering	fastapi/sqlmodel	xming521/weclone
Stars	17,913	17,928	17,885
Language	Python	Python	Python
Setup difficulty	easy	easy	hard
Complexity	4/5	2/5	4/5
Audience	ops devops	developer	general

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

It's a reference book, just read it, applying it requires real GPU cluster hardware.

License is not stated in the explanation.

In plain English

Machine Learning Engineering Open Book is a free, open-source reference guide for engineers who work on training, fine-tuning, and running large AI language models (LLMs) and multimodal models (models that work with text and images together). It is written by someone who gained hands-on experience leading the training of very large open-source AI models and shares practical, copy-paste-ready solutions from those real-world projects. The book is technical in nature, it is aimed at ML engineers and infrastructure operators, not beginners to AI. It covers a wide range of practical topics organized into chapters: the hardware side (GPUs and other accelerators, memory, storage, and networking between machines in a cluster), orchestration tools for managing computing resources (including SLURM, a system widely used in research clusters), guides for actually running model training jobs, tips for running model inference efficiently, and a substantial section on debugging and troubleshooting problems that arise during large-scale training. Throughout the material you will find benchmark tools (Python scripts), comparison tables of accelerator performance and network speeds, and step-by-step guides for common tasks like testing inter-node connectivity, debugging hanging distributed training jobs, and managing cluster workloads. The book is also available as downloadable PDF and EPUB files. You would use this resource if you are responsible for training or serving large AI models on a cluster of GPUs and need practical guidance on hardware choices, network configurations, debugging, and system-level troubleshooting, rather than conceptual introductions to machine learning.

Copy-paste prompts

Prompt 1

Summarize the ml-engineering chapter on GPU memory and give me a checklist for picking between A100 80GB and H100 for a 70B-parameter model.

Prompt 2

My multi-node training job hangs after a few hours. Walk me through the debugging steps from ml-engineering's troubleshooting chapter.

Prompt 3

Give me a SLURM sbatch script template (based on ml-engineering's examples) for an 8-node, 64-GPU LLM training run.

Prompt 4

Use the ml-engineering benchmarks to compare InfiniBand vs RoCE vs Ethernet for distributed training, which should I buy first?

Prompt 5

I'm new to running LLM inference at scale. Build me a reading order through ml-engineering covering inference + networking + debugging.

Frequently asked questions

What is ml-engineering?

A free open-source handbook for ML engineers training and serving large AI models on GPU clusters, full of practical hardware, networking and debugging advice.

What language is ml-engineering written in?

Mainly Python. The stack also includes Python, GPU, SLURM.

What license does ml-engineering use?

License is not stated in the explanation.

How hard is ml-engineering to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is ml-engineering for?

Mainly ops devops.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub stas00 on gitmyhub

Verify against the repo before relying on details.