explaingit

snatchev/linguist

Analysis updated 2026-07-05 · repo last pushed 2015-01-03

RubyAudience · developerComplexity · 2/5DormantSetup · moderate

TLDR

Linguist is the tool GitHub uses to detect what programming languages are in your repository. It identifies languages by file extension and content, filters out third-party and generated files, and lets you manually override results.

Mindmap

mindmap
  root((repo))
    What it does
      Detects repo languages
      Powers syntax highlighting
      Filters vendored code
      Excludes generated files
    How it works
      Matches file extensions
      Applies common-sense rules
      Uses statistical classifier
    Customization
      Override via gitattributes
      Flag paths as vendored
    Community
      Open source on GitHub
      Submit pull requests
      Languages in YAML file
    Tech stack
      Ruby
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Fix a repository showing the wrong primary language on GitHub.

USE CASE 2

Exclude vendored or generated files from your language statistics.

USE CASE 3

Force GitHub to count specific files as a particular language.

USE CASE 4

Identify the programming languages in a local codebase.

What is it built with?

Ruby

How does it compare?

snatchev/linguistjoshuakgoldberg/mastodonmoritzheiber/mysql
LanguageRubyRubyRuby
Last pushed2015-01-032024-05-112013-08-18
MaintenanceDormantDormantDormant
Setup difficultymoderatehardmoderate
Complexity2/54/53/5
Audiencedeveloperops devopsops devops

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · moderate Time to first run · 30min

It is a Ruby library that requires Ruby installed and some configuration to run locally on your own repository.

This project is maintained by GitHub as open source, though the specific license type is not stated in the explanation.

In plain English

Linguist is the tool GitHub uses to figure out what programming languages are in your code repository. When you look at a repo on GitHub and see a colored bar showing "70% Ruby, 20% JavaScript, 10% CSS," that breakdown is produced by this library. It also powers the syntax highlighting you see when you browse files on the site. The detection process works in layers. Most files are identified by their extension, a .rb file is Ruby, a .py file is Python. But some extensions are ambiguous. A .h file could be C, C++, or Objective-C. For those cases, Linguist first applies some common-sense rules, then falls back to a statistical classifier that looks at the actual content of the file to make an educated guess. Beyond detection, it also filters out "noise" files: vendored third-party code sitting in directories like vendor/, and generated files like minified JavaScript, so they don't skew your language stats or clutter diffs. Anyone who manages a GitHub repository benefits from this, though most people never think about it, it just works in the background. The people who interact with it directly are typically those whose repo language gets misidentified (say, a project showing up as "HTML" when it's really a JavaScript app) and want to fix it. Linguist lets you override its defaults by adding a .gitattributes file to your project, where you can explicitly tell it which language a file should be counted as, or flag certain paths as vendored so they're excluded from stats. The project is notable for its transparency and community-driven approach. GitHub actively encourages users to submit pull requests when a language is misdetected, and the full list of recognized languages lives in a human-readable YAML file that anyone can read and propose changes to. It's a rare example of a core platform feature being maintained as open source that the community can directly shape.

Copy-paste prompts

Prompt 1
My GitHub repo is showing up as HTML but it is a JavaScript app. How do I use a .gitattributes file to fix the language detection?
Prompt 2
How do I configure Linguist to mark my vendor directory as vendored so it does not skew my repo language stats?
Prompt 3
Write a .gitattributes snippet that tells Linguist to treat all .h files in my project as C++ instead of C.
Prompt 4
How does Linguist detect languages for files with ambiguous extensions like .h, and how can I override its guess?

Frequently asked questions

What is linguist?

Linguist is the tool GitHub uses to detect what programming languages are in your repository. It identifies languages by file extension and content, filters out third-party and generated files, and lets you manually override results.

What language is linguist written in?

Mainly Ruby. The stack also includes Ruby.

Is linguist actively maintained?

Dormant — no commits in 2+ years (last push 2015-01-03).

What license does linguist use?

This project is maintained by GitHub as open source, though the specific license type is not stated in the explanation.

How hard is linguist to set up?

Setup difficulty is rated moderate, with roughly 30min to a first successful run.

Who is linguist for?

Mainly developer.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Verify against the repo before relying on details.