explaingit

dharmamitra/dharmamitra-lexicon

Analysis updated 2026-05-18

3Audience · researcherComplexity · 1/5LicenseSetup · easy

TLDR

A CC BY 4.0 dataset of word-level translation spans between Sanskrit, Tibetan, and Chinese Buddhist canonical texts, with character offsets into the target sentence for each source term.

Mindmap

mindmap
  root((repo))
    What it is
      Translation spans
      Buddhist texts
      3 language pairs
    Languages
      Sanskrit to Tibetan
      Sanskrit to Chinese
      Chinese to Tibetan
    Data Format
      JSONL files
      Character offsets
      Source and target
    How Created
      Parallel alignment
      Gemma span model
      Hallucination filter
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

What do people build with it?

USE CASE 1

Build a cross-language Buddhist terminology lookup tool using the character-offset span data

USE CASE 2

Train or evaluate machine translation models on classical Buddhist text alignments

USE CASE 3

Study how individual Sanskrit, Tibetan, or Chinese Buddhist terms are translated across different canonical texts

What is it built with?

JSONLPythonGemma

How does it compare?

dharmamitra/dharmamitra-lexicon0marildo/imagoabdurrafey237/rag-chatbot
Stars333
LanguagePythonJupyter Notebook
Setup difficultyeasyeasymoderate
Complexity1/52/53/5
Audienceresearchergeneralgeneral

Figures from each repo's GitHub metadata at analysis time.

How do you get it running?

Difficulty · easy Time to first run · 5min

Pure JSONL data files, no code to install, just read the files in your language of choice.

Use freely for any purpose, including commercial use, as long as you credit the Dharmamitra project.

In plain English

Dharmamitra Lexicon is a dataset that maps individual words and phrases from ancient Buddhist texts in one language to their equivalent spans in another language. The three languages covered are Sanskrit, Tibetan, and Chinese, and the dataset captures all three pairings: Sanskrit to Tibetan, Sanskrit to Chinese, and Chinese to Tibetan. Each record pinpoints exactly where in the target sentence the translation of a given source word appears, down to the character position. This data powers a live lookup tool at lexicon.dharmamitra.org. The records come from aligned parallel texts, meaning pairs of sentences where a Buddhist canonical text in one language is matched against its translation in another. To build the dataset, each source sentence was processed to identify individual word boundaries, which is especially complex in Sanskrit and Classical Chinese. A machine learning model trained for this specific task then inserted markers into the target sentence to show which stretch of text corresponds to each source word or phrase. Any proposed match that is not a literal substring of the target sentence is rejected, which keeps the dataset free of invented alignments. Each JSONL file in the repository covers one source text paired with one target text. A file might contain, for example, all the Sanskrit-to-Tibetan span alignments for one specific Buddhist sutra. The format is plain text with one JSON object per line, so it can be read by any programming language without a special library. The license is Creative Commons Attribution 4.0, which allows you to use and adapt the data in any project, including commercial ones, as long as you credit the Dharmamitra project.

Copy-paste prompts

Prompt 1
I want to look up how a specific Sanskrit Buddhist term is translated into Tibetan using the Dharmamitra lexicon. Walk me through the JSONL record format and how to query it.
Prompt 2
How was the Dharmamitra lexicon dataset created? Explain the span projection method and how it prevents hallucinated alignments.
Prompt 3
I want to build a highlighted cross-reference viewer for Buddhist texts using the start/end character offsets in this dataset. What is the data model and how do I load a file?
Prompt 4
What is the difference between a token and a phrase record type in the Dharmamitra lexicon, and how are multi-word phrases reconstructed differently for Sanskrit vs Chinese sources?

Frequently asked questions

What is dharmamitra-lexicon?

A CC BY 4.0 dataset of word-level translation spans between Sanskrit, Tibetan, and Chinese Buddhist canonical texts, with character offsets into the target sentence for each source term.

What license does dharmamitra-lexicon use?

Use freely for any purpose, including commercial use, as long as you credit the Dharmamitra project.

How hard is dharmamitra-lexicon to set up?

Setup difficulty is rated easy, with roughly 5min to a first successful run.

Who is dharmamitra-lexicon for?

Mainly researcher.

Open on GitHub → Explain another repo

This repo across BitVibe Labs

Scan in gitsafehub Deploy in gitdeployhub dharmamitra on gitmyhub

Verify against the repo before relying on details.