nvidia/tensorrt-llm

★ 13,627PythonAudience · developerComplexity · 5/5Setup · hard

Mindmap

mindmap
  root((tensorrt-llm))
    What It Does
      GPU optimization
      Model compilation
      Faster inference
    Features
      Multi-GPU support
      Memory reduction
      High throughput
    Supported Models
      Language models
      Image generation
      Video generation
    Use Cases
      Production APIs
      High-volume serving
      Multi-user systems

mindmap root((tensorrt-llm)) What It Does GPU optimization Model compilation Faster inference Features Multi-GPU support Memory reduction High throughput Supported Models Language models Image generation Video generation Use Cases Production APIs High-volume serving Multi-user systems

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Optimize a language model to run faster on NVIDIA GPUs and reduce response latency in a production API.

USE CASE 2

Spread a large language model across multiple GPUs to handle models that would not fit on a single card.

USE CASE 3

Build a scalable AI API that serves more simultaneous user requests by using GPU memory optimizations.

Tech stack

PythonC++CUDATensorRTNVIDIA GPU

Getting it running

Difficulty · hard Time to first run · 1day+

Requires NVIDIA GPU hardware with CUDA and compatible drivers, no CPU or non-NVIDIA hardware support.

License not specified in the explanation, check the repository directly for terms.

In plain English

TensorRT-LLM is a toolkit from NVIDIA for running AI language models faster on NVIDIA graphics cards. Language models are the systems behind chatbots and text generators like GPT. Running these models quickly requires a lot of computation, and this library squeezes more performance out of the hardware by compiling the model into an optimized format before running it. The core idea is that a language model loaded directly from a research framework is not as fast as it could be on dedicated hardware. TensorRT-LLM takes those models and applies a set of low-level optimizations specific to NVIDIA GPUs, including techniques that reduce memory usage and increase how many requests the system can handle at once. The result is faster responses and the ability to serve more users simultaneously compared to running the model without these optimizations. The library supports a wide range of popular language models and works with multi-GPU setups, meaning you can spread a large model across several graphics cards to handle models that would not fit on one. It also supports image and video generation models, not just text. Developers interact with it using Python, and it includes examples and documentation for common use cases. This tool is primarily aimed at engineers deploying AI models in production, such as building an API that responds to user queries. It is not a tool for training models or for casual experimentation without programming knowledge. NVIDIA publishes a series of technical blog posts linked from the repository that describe specific performance improvements and advanced configuration options for those who want to dig into the details.

Copy-paste prompts

Prompt 1

I have a Llama model and NVIDIA GPUs. Walk me through using TensorRT-LLM to compile and run it faster than a standard HuggingFace setup.

Prompt 2

Help me configure TensorRT-LLM for multi-GPU inference to split a large language model across two A100s.

Prompt 3

I want to build a Python API that serves an LLM using TensorRT-LLM. Show me a minimal example that takes a prompt and returns a response.

Prompt 4

What are the performance differences between running a language model with TensorRT-LLM versus plain PyTorch on an NVIDIA GPU?

Open on GitHub → Explain another repo

← nvidia on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.