zartbot/gfd

★ 11C++Audience · researcherComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((gfd))
    What it does
      Fast GPU data transfer
      Scatter-gather batching
      LLM inference speedup
    Tech stack
      C++
      CUDA
      AVX-512
      NVIDIA GPU
    Modes
      CPU-initiated
      GPU-triggered
      Production auto mode
    Results
      14 to 53x speedup
      Up to 53 GB per second
      Multi-GPU support

mindmap root((gfd)) What it does Fast GPU data transfer Scatter-gather batching LLM inference speedup Tech stack C++ CUDA AVX-512 NVIDIA GPU Modes CPU-initiated GPU-triggered Production auto mode Results 14 to 53x speedup Up to 53 GB per second Multi-GPU support

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Speed up LLM inference on NVIDIA GPUs by replacing slow per-transfer calls with a single batched memory operation.

USE CASE 2

Integrate the production auto mode into your AI inference pipeline so the library handles CPU-GPU data coordination automatically.

USE CASE 3

Benchmark GPU data transfer performance on your hardware using the included benchmark examples to compare against standard CUDA methods.

USE CASE 4

Enable multi-GPU inference workloads that require fast parallel data movement from scattered RAM locations.

Tech stack

C++CUDANVIDIA GPUAVX-512

Getting it running

Difficulty · hard Time to first run · 1h+

Requires an NVIDIA GPU with CUDA toolkit installed and a CPU with AVX-512 support, no prebuilt binaries provided.

No license is specified, reuse terms are unclear.

In plain English

GFD is a C++ library that solves a specific performance problem that comes up when running large language model inference on a machine with a dedicated graphics card. When an AI model processes a conversation, it needs to keep track of what was said earlier, and that memory often lives scattered across the system's main RAM. Moving it onto the graphics card for the next computation requires many small data transfers, and doing those one at a time is slow. The problem is that standard graphics card transfer functions have a fixed overhead cost per call, around one to two microseconds each. When thousands of small pieces of data need to move, those overhead costs add up and the connection between the CPU and graphics card runs at only a fraction of its theoretical speed. In benchmarks using standard methods, the measured transfer speed was about 3 gigabytes per second even though the hardware is capable of much more. GFD works around this by reorganizing who does what. Instead of sending many individual transfer requests, the library uses background CPU threads to gather the scattered pieces of data into a single contiguous block in a staging area, then sends that block to the graphics card in one large operation. It also uses a specialized CPU instruction set called AVX-512 to do the gathering work faster. The result, according to the benchmark results in the README, is transfer speeds between 14 and 53 times faster than the standard approach depending on the data size. The library offers several modes suited to different situations: one where the CPU initiates transfers directly, one where the graphics card triggers them and then continues computing in parallel, and a higher-level mode intended for production use where developers write only the compute logic and the library handles the coordination automatically. The project includes benchmarks, example code, and multi-GPU support. It is written for NVIDIA graphics cards using CUDA, the programming platform NVIDIA provides for GPU computing. No license is specified in the README.

Copy-paste prompts

Prompt 1

I'm running LLM inference on an NVIDIA GPU and hitting slow CPU-to-GPU transfer speeds. Show me how to integrate the GFD library's production auto mode so it handles memory movement automatically.

Prompt 2

Walk me through building the GFD C++ library with CUDA support and running the included benchmarks to measure transfer speeds on my hardware.

Prompt 3

Explain how GFD's scatter-gather approach works: how do the background CPU threads collect scattered memory pieces and why does sending one large block beat many small transfers?

Prompt 4

Show me how to use GFD's GPU-triggered mode where the graphics card initiates the transfer and continues computing in parallel while data moves.

Open on GitHub → Explain another repo

← zartbot on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.