embedding/chinese-word-vectors

★ 12,217PythonAudience · researcherComplexity · 2/5Setup · moderate

Mindmap

mindmap
  root((chinese-word-vectors))
    What it is
      Pre-trained vectors
      100 plus sets
    Training Methods
      Word2Vec dense
      PPMI sparse
    Context Types
      Whole words
      Character fragments
    Text Sources
      Baidu Encyclopedia
      Wikipedia Chinese
      News and social media
    Evaluation
      CA8 benchmark
      Evaluation toolkit

mindmap root((chinese-word-vectors)) What it is Pre-trained vectors 100 plus sets Training Methods Word2Vec dense PPMI sparse Context Types Whole words Character fragments Text Sources Baidu Encyclopedia Wikipedia Chinese News and social media Evaluation CA8 benchmark Evaluation toolkit

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Load a pre-trained Baidu Encyclopedia word vector set into your NLP model to represent Chinese words numerically without any training compute.

USE CASE 2

Compare different vector sets using the included CA8 benchmark to pick the best one for your Chinese text classification task.

USE CASE 3

Use news-domain vectors from People's Daily or Sogou News to fine-tune a model that needs to understand journalistic or financial Chinese.

Tech stack

PythonWord2Vec

Getting it running

Difficulty · moderate Time to first run · 30min

Vector files are hosted on Baidu Netdisk and Google Drive rather than in the repository, you must download them separately before use.

In plain English

This repository is a library of pre-trained word vectors for the Chinese language. Word vectors are files where each word in a vocabulary has been converted into a long list of numbers that capture the word's meaning and its relationships to other words. Machine learning systems use these numerical representations to process and understand text. Rather than training these from scratch (which requires significant computing resources), developers can download a pre-built set and plug it into their own projects. The project provides more than 100 different sets of word vectors, giving users choices across three dimensions. The first is the training method: either dense vectors trained with Word2Vec (a widely used algorithm) or sparse vectors trained with a different statistical approach called PPMI. The second is the type of context used during training: some sets use whole words as context, others use character fragments (useful in Chinese where words can be broken into meaningful parts), and some combine both. The third dimension is the source text: the vectors were trained on different datasets including Baidu Encyclopedia, Chinese Wikipedia, People's Daily News, Sogou News, financial news, Zhihu (a Q&A platform similar to Quora), Weibo (a social media platform), classical Chinese literature, and a large mixed dataset combining several sources. The pre-trained files are in a plain text format where each line starts with a word followed by its vector values separated by spaces. They are hosted on Baidu Netdisk and Google Drive rather than directly in the repository, because the files are large. Alongside the vectors, the project includes a benchmark dataset called CA8, which tests how well word vectors capture analogical relationships in Chinese. An evaluation toolkit is also provided so researchers can measure and compare the quality of different vector sets. The vectors and dataset were introduced in a paper presented at ACL 2018, a major academic conference for language processing research. The repository asks users to cite that paper if they use these resources in their own work.

Copy-paste prompts

Prompt 1

How do I load a Word2Vec file from chinese-word-vectors into Python's gensim library and find the most similar words to a given term?

Prompt 2

Which chinese-word-vectors dataset should I use for sentiment analysis on Weibo posts, and how do I integrate it with PyTorch embeddings?

Prompt 3

Show me how to use the chinese-word-vectors evaluation toolkit to benchmark and compare different vector files on the CA8 analogy dataset.

Prompt 4

How do I convert the plain-text chinese-word-vectors format into a format compatible with Hugging Face Transformers tokenizer embeddings?

Open on GitHub → Explain another repo

← embedding on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.