Speed up LLM inference on NVIDIA GPUs by replacing slow per-transfer calls with a single batched memory operation.
Integrate the production auto mode into your AI inference pipeline so the library handles CPU-GPU data coordination automatically.
Benchmark GPU data transfer performance on your hardware using the included benchmark examples to compare against standard CUDA methods.
Enable multi-GPU inference workloads that require fast parallel data movement from scattered RAM locations.
Requires an NVIDIA GPU with CUDA toolkit installed and a CPU with AVX-512 support, no prebuilt binaries provided.
GFD is a C++ library that solves a specific performance problem that comes up when running large language model inference on a machine with a dedicated graphics card. When an AI model processes a conversation, it needs to keep track of what was said earlier, and that memory often lives scattered across the system's main RAM. Moving it onto the graphics card for the next computation requires many small data transfers, and doing those one at a time is slow. The problem is that standard graphics card transfer functions have a fixed overhead cost per call, around one to two microseconds each. When thousands of small pieces of data need to move, those overhead costs add up and the connection between the CPU and graphics card runs at only a fraction of its theoretical speed. In benchmarks using standard methods, the measured transfer speed was about 3 gigabytes per second even though the hardware is capable of much more. GFD works around this by reorganizing who does what. Instead of sending many individual transfer requests, the library uses background CPU threads to gather the scattered pieces of data into a single contiguous block in a staging area, then sends that block to the graphics card in one large operation. It also uses a specialized CPU instruction set called AVX-512 to do the gathering work faster. The result, according to the benchmark results in the README, is transfer speeds between 14 and 53 times faster than the standard approach depending on the data size. The library offers several modes suited to different situations: one where the CPU initiates transfers directly, one where the graphics card triggers them and then continues computing in parallel, and a higher-level mode intended for production use where developers write only the compute logic and the library handles the coordination automatically. The project includes benchmarks, example code, and multi-GPU support. It is written for NVIDIA graphics cards using CUDA, the programming platform NVIDIA provides for GPU computing. No license is specified in the README.
← zartbot on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.