Load a pre-trained Baidu Encyclopedia word vector set into your NLP model to represent Chinese words numerically without any training compute.
Compare different vector sets using the included CA8 benchmark to pick the best one for your Chinese text classification task.
Use news-domain vectors from People's Daily or Sogou News to fine-tune a model that needs to understand journalistic or financial Chinese.
Vector files are hosted on Baidu Netdisk and Google Drive rather than in the repository, you must download them separately before use.
This repository is a library of pre-trained word vectors for the Chinese language. Word vectors are files where each word in a vocabulary has been converted into a long list of numbers that capture the word's meaning and its relationships to other words. Machine learning systems use these numerical representations to process and understand text. Rather than training these from scratch (which requires significant computing resources), developers can download a pre-built set and plug it into their own projects. The project provides more than 100 different sets of word vectors, giving users choices across three dimensions. The first is the training method: either dense vectors trained with Word2Vec (a widely used algorithm) or sparse vectors trained with a different statistical approach called PPMI. The second is the type of context used during training: some sets use whole words as context, others use character fragments (useful in Chinese where words can be broken into meaningful parts), and some combine both. The third dimension is the source text: the vectors were trained on different datasets including Baidu Encyclopedia, Chinese Wikipedia, People's Daily News, Sogou News, financial news, Zhihu (a Q&A platform similar to Quora), Weibo (a social media platform), classical Chinese literature, and a large mixed dataset combining several sources. The pre-trained files are in a plain text format where each line starts with a word followed by its vector values separated by spaces. They are hosted on Baidu Netdisk and Google Drive rather than directly in the repository, because the files are large. Alongside the vectors, the project includes a benchmark dataset called CA8, which tests how well word vectors capture analogical relationships in Chinese. An evaluation toolkit is also provided so researchers can measure and compare the quality of different vector sets. The vectors and dataset were introduced in a paper presented at ACL 2018, a major academic conference for language processing research. The repository asks users to cite that paper if they use these resources in their own work.
← embedding on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.