Scrape a web page and extract specific data like prices, headlines, or links using CSS selectors.
Sanitize user-submitted HTML in a forum or comment system to remove script tags and other dangerous content.
Modify HTML content programmatically, update links, change text, or restructure elements before serving them.
Parse messy HTML from legacy systems and clean it up before importing into a modern application.
jsoup is a Java library for reading, editing, and cleaning HTML and XML. If you have a web page or an HTML string and you want to pull specific information out of it, or modify its contents, jsoup gives you the tools to do that without writing low-level string manipulation code. The library parses HTML the same way a modern web browser does, following the WHATWG HTML5 specification. This means even broken or messy HTML from real websites, the kind that has unclosed tags or unusual formatting, will still produce a sensible result rather than an error. You can use jsoup to fetch a web page directly from a URL, read an HTML file from disk, or parse an HTML string you already have in memory. Once parsed, you can search the document using CSS selectors (the same kind used in web stylesheets) or by navigating the tree structure of elements manually. You can read text, pull attribute values, find all links, or target specific sections by their ID or class name. Editing is also supported. You can change the text or HTML inside elements, set attributes, add or remove elements, and then get back a clean HTML string as output. This makes jsoup useful for things like reformatting content before storing it, or building HTML programmatically. One practical use case highlighted in the README is safety filtering. If your application accepts HTML from users, such as in a forum or comment system, jsoup can strip out tags and attributes that could be used for cross-site scripting attacks. You define which tags are allowed, and jsoup removes everything else. jsoup is open source under the MIT license and has been actively maintained since 2009. It is added to a Java project using Maven or Gradle with a single dependency line. The project also documents Android support for developers building mobile applications.
← jhy on gitmyhub — every repo by this author, as a profile.
Verify against the repo before relying on details.