Analysis updated 2026-05-18
Build a cross-language Buddhist terminology lookup tool using the character-offset span data
Train or evaluate machine translation models on classical Buddhist text alignments
Study how individual Sanskrit, Tibetan, or Chinese Buddhist terms are translated across different canonical texts
| dharmamitra/dharmamitra-lexicon | 0marildo/imago | abdurrafey237/rag-chatbot | |
|---|---|---|---|
| Stars | 3 | 3 | 3 |
| Language | — | Python | Jupyter Notebook |
| Setup difficulty | easy | easy | moderate |
| Complexity | 1/5 | 2/5 | 3/5 |
| Audience | researcher | general | general |
Figures from each repo's GitHub metadata at analysis time.
Pure JSONL data files, no code to install, just read the files in your language of choice.
Dharmamitra Lexicon is a dataset that maps individual words and phrases from ancient Buddhist texts in one language to their equivalent spans in another language. The three languages covered are Sanskrit, Tibetan, and Chinese, and the dataset captures all three pairings: Sanskrit to Tibetan, Sanskrit to Chinese, and Chinese to Tibetan. Each record pinpoints exactly where in the target sentence the translation of a given source word appears, down to the character position. This data powers a live lookup tool at lexicon.dharmamitra.org. The records come from aligned parallel texts, meaning pairs of sentences where a Buddhist canonical text in one language is matched against its translation in another. To build the dataset, each source sentence was processed to identify individual word boundaries, which is especially complex in Sanskrit and Classical Chinese. A machine learning model trained for this specific task then inserted markers into the target sentence to show which stretch of text corresponds to each source word or phrase. Any proposed match that is not a literal substring of the target sentence is rejected, which keeps the dataset free of invented alignments. Each JSONL file in the repository covers one source text paired with one target text. A file might contain, for example, all the Sanskrit-to-Tibetan span alignments for one specific Buddhist sutra. The format is plain text with one JSON object per line, so it can be read by any programming language without a special library. The license is Creative Commons Attribution 4.0, which allows you to use and adapt the data in any project, including commercial ones, as long as you credit the Dharmamitra project.
A CC BY 4.0 dataset of word-level translation spans between Sanskrit, Tibetan, and Chinese Buddhist canonical texts, with character offsets into the target sentence for each source term.
Use freely for any purpose, including commercial use, as long as you credit the Dharmamitra project.
Setup difficulty is rated easy, with roughly 5min to a first successful run.
Mainly researcher.
This repo across BitVibe Labs
Verify against the repo before relying on details.