explaingit

clips/pattern

8,858PythonAudience · dataComplexity · 2/5LicenseSetup · easy

TLDR

A Python library that combines web scraping, natural language processing, and machine learning in one package. Collect data from websites and social APIs, analyze text sentiment and grammar, and classify documents without installing many separate tools.

Mindmap

mindmap
  root((pattern))
    Web Mining
      Google and Twitter APIs
      Web crawler
      HTML parser
    Language Processing
      Part-of-speech tagging
      Sentiment analysis
      WordNet lookup
    Machine Learning
      Vector space models
      KNN and SVM classifier
      Document clustering
    Audience
      Data researchers
      NLP beginners
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Collect tweets or search results and classify them as positive or negative using the built-in sentiment analysis.

USE CASE 2

Scrape a website's HTML and extract structured data using the bundled parser without extra dependencies.

USE CASE 3

Cluster a set of documents by topic using the built-in vector space model and K-Nearest Neighbors classifier.

Tech stack

PythonWordNet

Getting it running

Difficulty · easy Time to first run · 30min

Officially supports Python 2.7 and 3.6, may have compatibility issues with newer Python versions as the project is no longer actively maintained.

Use freely for any purpose including commercial use, keep the copyright notice and license text included with the package.

In plain English

Pattern is a Python library for pulling information from the web and making sense of it. It combines several different capabilities in one package: scraping content from websites and web services, analyzing the natural language in that content, running basic machine learning on the results, and visualizing how things connect to each other in a network. The web mining part can talk to services like Google, Twitter, and Wikipedia through their APIs, and includes a general-purpose web crawler and an HTML parser for extracting structured data from pages. The language processing part can identify parts of speech in text, such as whether a word is a noun or adjective, perform sentiment analysis to guess whether a piece of text sounds positive or negative, and look up word relationships through WordNet, which is a database of how English words relate to each other. The machine learning tools cover common techniques: vector space models for representing documents as numbers, clustering for grouping similar items together, and classification algorithms including K-Nearest Neighbors, Support Vector Machines, and a Perceptron. The README includes a worked example that collects tweets tagged with #win or #fail, pulls out the adjectives using the part-of-speech tagger, and trains a classifier to predict which category a new tweet belongs to. Pattern supports both Python 2.7 and Python 3.6 and can be installed with pip. It bundles its own copies of several algorithms and data sets, so it does not have many external dependencies. The project comes from academic research and has an associated paper in the Journal of Machine Learning Research. It is BSD-licensed and was developed at a university research group, with contributions from many people over the years.

Copy-paste prompts

Prompt 1
Using clips/pattern, show me how to pull tweets containing a hashtag and run sentiment analysis to count how many are positive versus negative.
Prompt 2
How do I use pattern's part-of-speech tagger to extract all adjectives from a paragraph of text in Python?
Prompt 3
Show me how to use pattern to scrape Wikipedia for a search term and extract the first paragraph of the result page.
Prompt 4
Using pattern's machine learning tools, how do I train a classifier on a set of labelled sentences and then predict the category of a new sentence?
Open on GitHub → Explain another repo

← clips on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.