fb55/htmlparser2

★ 4,763TypeScriptAudience · developerComplexity · 2/5Setup · easy

Mindmap

mindmap
  root((htmlparser2))
    What it does
      Parse HTML
      Parse XML
      Handle messy markup
    How it works
      Streaming callbacks
      Low memory usage
      No dependencies
    Ecosystem
      domhandler
      Cheerio
      CSS selector support
    Config options
      XML strict mode
      Case sensitivity
      Entity decoding

mindmap root((htmlparser2)) What it does Parse HTML Parse XML Handle messy markup How it works Streaming callbacks Low memory usage No dependencies Ecosystem domhandler Cheerio CSS selector support Config options XML strict mode Case sensitivity Entity decoding

Click or tap to explore — scroll the page freely

Things people build with this

USE CASE 1

Parse a downloaded HTML page to extract all links, headings, or specific elements for web scraping.

USE CASE 2

Process real-world messy HTML from third-party sources that has unclosed or mismatched tags without crashing.

USE CASE 3

Stream-parse a large HTML file without loading the whole document into memory to keep resource usage low.

Tech stack

TypeScriptJavaScriptNode.js

Getting it running

Difficulty · easy Time to first run · 5min

License information not provided in the explanation.

In plain English

htmlparser2 is a JavaScript library for reading HTML and XML documents in code. When you have a string of HTML text and you want to do something with its contents, such as find specific elements, extract text, or modify the structure, you need a parser to break that string into a meaningful structure your program can work with. htmlparser2 does this job, and its main selling point is that it does it faster than most alternatives. The library is described as "forgiving," which means it handles messy, real-world HTML that does not strictly follow the rules. Web pages in practice often have unclosed tags, mismatched nesting, and other quirks. htmlparser2 processes these without crashing or producing garbage output. If you need strict compliance with the official HTML specification, the README points toward a different library called parse5, which trades some speed for exactness. The API works by firing callbacks as the parser reads through the document. You provide functions that run when a tag opens, when text is encountered, when a tag closes, and so on. This streaming approach keeps memory usage low because the library does not need to build a complete picture of the document before your code can start acting on it. For cases where you do want a full document tree in memory, a companion library called domhandler converts the parser's output into a standard document object. Other libraries in the same ecosystem add CSS selector support and a jQuery-like API on top of that, with the most well-known being Cheerio. The library is published to npm and has no runtime dependencies. It is one of the most widely downloaded npm packages because it sits beneath many other tools that need to process HTML. Configuration options include an XML mode for stricter parsing, control over case handling for tag and attribute names, and a setting to turn off decoding of HTML character entities like the ampersand escape sequence.

Copy-paste prompts

Prompt 1

Using htmlparser2, help me write a Node.js script that parses a downloaded HTML string and extracts all href link values.

Prompt 2

Show me how to combine htmlparser2 with domhandler to build a full DOM tree and then query it with CSS selectors via Cheerio.

Prompt 3

Help me parse an RSS feed with htmlparser2 in XML mode and pull out each article title and publication date.

Prompt 4

Show me how to stream-parse a large HTML file with htmlparser2 and count every paragraph element without loading the file fully into memory.

Open on GitHub → Explain another repo

← fb55 on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.