Parse a downloaded HTML page to extract all links, headings, or specific elements for web scraping.
Process real-world messy HTML from third-party sources that has unclosed or mismatched tags without crashing.
Stream-parse a large HTML file without loading the whole document into memory to keep resource usage low.
htmlparser2 is a JavaScript library for reading HTML and XML documents in code. When you have a string of HTML text and you want to do something with its contents, such as find specific elements, extract text, or modify the structure, you need a parser to break that string into a meaningful structure your program can work with. htmlparser2 does this job, and its main selling point is that it does it faster than most alternatives. The library is described as "forgiving," which means it handles messy, real-world HTML that does not strictly follow the rules. Web pages in practice often have unclosed tags, mismatched nesting, and other quirks. htmlparser2 processes these without crashing or producing garbage output. If you need strict compliance with the official HTML specification, the README points toward a different library called parse5, which trades some speed for exactness. The API works by firing callbacks as the parser reads through the document. You provide functions that run when a tag opens, when text is encountered, when a tag closes, and so on. This streaming approach keeps memory usage low because the library does not need to build a complete picture of the document before your code can start acting on it. For cases where you do want a full document tree in memory, a companion library called domhandler converts the parser's output into a standard document object. Other libraries in the same ecosystem add CSS selector support and a jQuery-like API on top of that, with the most well-known being Cheerio. The library is published to npm and has no runtime dependencies. It is one of the most widely downloaded npm packages because it sits beneath many other tools that need to process HTML. Configuration options include an XML mode for stricter parsing, control over case handling for tag and attribute names, and a setting to turn off decoding of HTML character entities like the ampersand escape sequence.
← fb55 on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.