Collect tweets or search results and classify them as positive or negative using the built-in sentiment analysis.
Scrape a website's HTML and extract structured data using the bundled parser without extra dependencies.
Cluster a set of documents by topic using the built-in vector space model and K-Nearest Neighbors classifier.
Officially supports Python 2.7 and 3.6, may have compatibility issues with newer Python versions as the project is no longer actively maintained.
Pattern is a Python library for pulling information from the web and making sense of it. It combines several different capabilities in one package: scraping content from websites and web services, analyzing the natural language in that content, running basic machine learning on the results, and visualizing how things connect to each other in a network. The web mining part can talk to services like Google, Twitter, and Wikipedia through their APIs, and includes a general-purpose web crawler and an HTML parser for extracting structured data from pages. The language processing part can identify parts of speech in text, such as whether a word is a noun or adjective, perform sentiment analysis to guess whether a piece of text sounds positive or negative, and look up word relationships through WordNet, which is a database of how English words relate to each other. The machine learning tools cover common techniques: vector space models for representing documents as numbers, clustering for grouping similar items together, and classification algorithms including K-Nearest Neighbors, Support Vector Machines, and a Perceptron. The README includes a worked example that collects tweets tagged with #win or #fail, pulls out the adjectives using the part-of-speech tagger, and trains a classifier to predict which category a new tweet belongs to. Pattern supports both Python 2.7 and Python 3.6 and can be installed with pip. It bundles its own copies of several algorithms and data sets, so it does not have many external dependencies. The project comes from academic research and has an associated paper in the Journal of Machine Learning Research. It is BSD-licensed and was developed at a university research group, with contributions from many people over the years.
← clips on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.