explaingit

pwxcoo/chinese-xinhua

11,544PythonAudience · developerComplexity · 1/5Setup · easy

TLDR

A ready-made dataset of Chinese language reference data in JSON format, covering over 31,000 idioms, 16,000 characters, 264,000 words, and 14,000 two-part riddles. Load the files directly into any project.

Mindmap

mindmap
  root((repo))
    What It Is
      Chinese language data
      JSON datasets
      Scraped reference data
    Data Files
      31648 idioms
      16142 characters
      264434 words
      14032 riddles
    Use Cases
      NLP projects
      Word games
      Language learning apps
    Tech
      Python scrapers
      Plain JSON format
    Audience
      Developers
      NLP researchers
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Load the idiom dataset into an NLP project or word game to get 31,648 entries with pinyin, origin, and meaning

USE CASE 2

Build a Chinese character lookup tool that shows stroke count, radical, and pronunciation for any of 16,142 characters

USE CASE 3

Seed a database for a Chinese vocabulary learning app using the 264,434-word vocabulary file

Tech stack

PythonJSON

Getting it running

Difficulty · easy Time to first run · 5min
No license is specified, the author states the data is for non-commercial use and was scraped from public websites.

In plain English

Chinese-xinhua is a dataset repository containing Chinese language reference data in JSON format. It was assembled by one developer who scraped and cleaned data from various websites while building a Chinese idiom word game, and then published it so others would not need to repeat the same collection work. The repository contains four data files. The idiom file holds 31,648 entries, each with the idiom text, its pronunciation in pinyin, its origin, an example sentence, and an explanation of its meaning. Idioms in Chinese are typically four-character fixed phrases rooted in historical stories or classical literature. The word file covers 16,142 individual Chinese characters, with fields for stroke count, radical (the base component used to look up a character in a dictionary), pronunciation, and a detailed explanation. The vocabulary file holds 264,434 words or common expressions with brief definitions. The xiehouyu file contains 14,032 entries of a special type of two-part Chinese riddle where the first part is a setup and the second is the punchline. All data is stored as plain JSON arrays, making it straightforward to load into any programming language or database. The repository includes the scraping scripts used to collect the data. There is no API or application code, just the data files and collection tooling. The author notes clearly that the repository has no commercial purpose and that all content was gathered from public websites. It is intended for developers building Chinese language tools, natural language processing projects, or language learning applications who need a ready-made structured dataset.

Copy-paste prompts

Prompt 1
I want to build a Chinese idiom quiz game using the chinese-xinhua dataset. Help me write a Python script that loads idioms.json, picks a random idiom, hides its explanation, and lets the user guess.
Prompt 2
Using the chinese-xinhua character data, help me build a simple Flask API endpoint that accepts a Chinese character and returns its stroke count, radical, and pronunciation.
Prompt 3
I want to import all 264,434 words from chinese-xinhua into a SQLite database so I can search and filter them by meaning. Write me the Python script to do that.
Prompt 4
Help me analyze the xiehouyu riddle dataset from chinese-xinhua: load the JSON, count how many riddles share the same punchline, and print the top 10 most common punchlines.
Open on GitHub → Explain another repo

← pwxcoo on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.