Load the idiom dataset into an NLP project or word game to get 31,648 entries with pinyin, origin, and meaning
Build a Chinese character lookup tool that shows stroke count, radical, and pronunciation for any of 16,142 characters
Seed a database for a Chinese vocabulary learning app using the 264,434-word vocabulary file
Chinese-xinhua is a dataset repository containing Chinese language reference data in JSON format. It was assembled by one developer who scraped and cleaned data from various websites while building a Chinese idiom word game, and then published it so others would not need to repeat the same collection work. The repository contains four data files. The idiom file holds 31,648 entries, each with the idiom text, its pronunciation in pinyin, its origin, an example sentence, and an explanation of its meaning. Idioms in Chinese are typically four-character fixed phrases rooted in historical stories or classical literature. The word file covers 16,142 individual Chinese characters, with fields for stroke count, radical (the base component used to look up a character in a dictionary), pronunciation, and a detailed explanation. The vocabulary file holds 264,434 words or common expressions with brief definitions. The xiehouyu file contains 14,032 entries of a special type of two-part Chinese riddle where the first part is a setup and the second is the punchline. All data is stored as plain JSON arrays, making it straightforward to load into any programming language or database. The repository includes the scraping scripts used to collect the data. There is no API or application code, just the data files and collection tooling. The author notes clearly that the repository has no commercial purpose and that all content was gathered from public websites. It is intended for developers building Chinese language tools, natural language processing projects, or language learning applications who need a ready-made structured dataset.
← pwxcoo on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.