Extract all names of people, organizations, and places from a document collection and normalize dates and numbers into standard formats.
Build a search or summarization tool that understands the grammatical structure of text, not just keyword matching.
Analyze multilingual text in Arabic, Chinese, French, German, Hungarian, Italian, or Spanish for academic research.
Resolve coreference in a document to map pronouns and noun phrases back to the entity they refer to.
Language model files must be downloaded separately per language, the English models are bundled but others require additional downloads.
Stanford CoreNLP is a Java library from Stanford University that takes raw text and automatically extracts structured information from it. Give it a sentence or a document and it will identify the parts of speech for each word, find the base form of each word, recognize names of people, organizations, and places, resolve dates and numbers into standard formats, map out the grammatical structure of sentences, and figure out when different phrases in the text are referring to the same entity. These are building blocks that power search tools, document summarizers, and other applications that need to understand language rather than just find keywords. The toolkit was first built for English but now supports Arabic, Chinese, French, German, Hungarian, Italian, and Spanish at varying levels of depth. The underlying techniques are a mix of rule-based logic, traditional machine learning models, and newer deep learning components, depending on the task. It is widely used in academic research, commercial products, and government applications. To use it, you add the library to a Java project via Maven or by downloading the jar files directly. Language models, which are the trained files the library needs to do its analysis, are downloaded separately per language. Smaller English models come bundled by default, larger specialized ones are available as additional downloads or from the Hugging Face Hub. Once the models are in place, running all of the analysis tools on a piece of text takes about two lines of code. The project is released under the GNU General Public License version 2 or later. That license permits free use and modification but does not allow you to incorporate the library into proprietary software you distribute to others without releasing your source code. The README covers build instructions for both Ant and Maven, model download links for each supported language, and links to the main documentation site at stanfordnlp.github.io/CoreNLP. Stable releases come out several times a year, the latest development code is always available directly from the repository.
← stanfordnlp on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.