rightnow-ai/automegakernel

★ 33Python

This is a quick first-pass explanation. The richer sections — use-cases, tech stack, setup, prompts — are still being generated.

In plain English

AutoMegaKernel (AMK) is a Python toolkit that compiles the entire forward pass of a large language model into a single CUDA kernel, called a megakernel, instead of launching many separate GPU operations for each token generated. Currently it works with HuggingFace Llama models. The goal is to reduce the overhead that accumulates when activation data has to be written back to GPU memory between every operation. In a standard setup, generating each token triggers many separate GPU kernel launches, one per layer operation, and results travel through high-bandwidth GPU memory (HBM) between them. A megakernel keeps the whole computation in fast on-chip memory and prefetches the next layer's weights while the current layer computes. This matters most for single-request, low-batch inference: voice assistants, real-time tools, and AI agents. AMK does not claim to beat throughput-optimized servers handling many requests at once. The performance results show AMK's int8-quantized megakernel outperforming NVIDIA's cuBLAS library at batch-1 decode on inference-class GPUs: the L4 and L40S GPUs see speedups of 1.18 to 1.33 times compared to cuBLAS running at full bf16 precision. On training-class GPUs like the A100 and H100, AMK trails cuBLAS, and the README states this plainly. The win on inference GPUs comes from reading fewer bytes (int8 loads half the data of bf16), not from a better-performing kernel at the same precision. A coding agent (Claude Code or Codex) drives the system through a structured interface: an MCP server, commands, and a schedule validator that checks proposed changes before they touch the GPU. The validator rejected zero unsafe schedules across 7,160 adversarial tests. When a proposed change would cause a deadlock or race condition, it is rejected at validation time rather than hanging the GPU. An unattended 10-minute autoresearch run improved the megakernel's performance 1.47 times over its starting schedule. Coverage today is the Llama model family on CUDA (sm_75 through sm_120). The README notes that broadening to more model families, hardware targets, and programming languages is the central direction of future work.

Open on GitHub → Explain another repo

← rightnow-ai on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.