opennmt/ctranslate2

★ 4,484C++Audience · developerComplexity · 3/5Setup · moderate

Mindmap

mindmap
  root((CTranslate2))
    What it does
      Fast inference
      Model conversion
      Quantization
    Tech stack
      C++ core
      Python bindings
      CPU and GPU
    Use cases
      Machine translation
      Text generation
      Summarization
    Audience
      AI developers
      NLP engineers

mindmap root((CTranslate2)) What it does Fast inference Model conversion Quantization Tech stack C++ core Python bindings CPU and GPU Use cases Machine translation Text generation Summarization Audience AI developers NLP engineers

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Convert a Llama, Mistral, or Gemma model and run text generation at up to 4x less memory than standard formats.

USE CASE 2

Build a fast machine translation pipeline that works on CPU without requiring a GPU.

USE CASE 3

Speed up text summarization by swapping your standard inference framework for CTranslate2.

USE CASE 4

Run multilingual models on cheap cloud instances using 8-bit quantization to cut memory and cost.

Tech stack

C++PythonCUDA

Getting it running

Difficulty · moderate Time to first run · 30min

Models must be converted to CTranslate2 format using the included converter before inference can begin.

In plain English

CTranslate2 is a C++ library, also available as a Python package, for running AI language models faster and with less memory than general-purpose training frameworks. It does not train models. It takes an already-trained model, converts it to an optimized format, and then runs it at high speed for tasks like translation, text summarization, or text generation. The library supports a wide range of model architectures, including models behind many translation systems, text summarizers, and open-weight language models such as Llama, Mistral, and Gemma. Compatible models need to be converted using the provided tools before they can be used. Converters are included for several popular training frameworks, so most users can bring their existing models over without writing custom conversion code. Speed comes from several techniques applied automatically during inference: merging certain computation steps, removing padding from inputs, reordering batches to minimize wasted time, and using reduced numerical precision. The library can store and compute weights in 16-bit or 8-bit formats rather than the standard 32-bit, which shrinks model size on disk by up to 4x and often speeds up computation with minimal accuracy loss. The library runs on both CPU and GPU and detects the best backend for the current hardware automatically. Supported CPU architectures include x86-64 and ARM64, with integrations for several math acceleration libraries. Python users can install it with pip and start translating or generating text in a few lines of code. Documentation is available at the project site, and the project is maintained with backward compatibility in mind.

Copy-paste prompts

Prompt 1

Convert my HuggingFace Llama model to CTranslate2 format and show me a Python script to run text generation with it.

Prompt 2

Write a Python script using CTranslate2 that translates a list of English sentences to French using a Helsinki-NLP MarianMT model.

Prompt 3

Show me how to benchmark CTranslate2 against the standard HuggingFace pipeline for speed and memory on the same model.

Prompt 4

How do I enable INT8 quantization in CTranslate2 and what accuracy tradeoff should I expect?

Prompt 5

Set up CTranslate2 to run on CPU with ARM64 and process translation requests in batches.

Open on GitHub → Explain another repo

← opennmt on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.