explaingit

jhy/jsoup

11,365JavaAudience · developerComplexity · 2/5LicenseSetup · easy

TLDR

Java library for fetching, parsing, editing, and sanitizing HTML and XML, handles real-world messy web pages and lets you find content using the same CSS selectors you use in stylesheets.

Mindmap

mindmap
  root((repo))
    What it does
      Parse HTML XML
      CSS selector search
      Safety filtering
    Inputs
      URL fetch
      File on disk
      HTML string
    Use cases
      Web scraping
      Content sanitizing
      HTML editing
    Tech
      Java library
      Maven Gradle
      Android support
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Things people build with this

USE CASE 1

Scrape a web page and extract specific data like prices, headlines, or links using CSS selectors.

USE CASE 2

Sanitize user-submitted HTML in a forum or comment system to remove script tags and other dangerous content.

USE CASE 3

Modify HTML content programmatically, update links, change text, or restructure elements before serving them.

USE CASE 4

Parse messy HTML from legacy systems and clean it up before importing into a modern application.

Tech stack

JavaMavenGradle

Getting it running

Difficulty · easy Time to first run · 5min
Use freely for any purpose including commercial projects, as long as you keep the copyright notice.

In plain English

jsoup is a Java library for reading, editing, and cleaning HTML and XML. If you have a web page or an HTML string and you want to pull specific information out of it, or modify its contents, jsoup gives you the tools to do that without writing low-level string manipulation code. The library parses HTML the same way a modern web browser does, following the WHATWG HTML5 specification. This means even broken or messy HTML from real websites, the kind that has unclosed tags or unusual formatting, will still produce a sensible result rather than an error. You can use jsoup to fetch a web page directly from a URL, read an HTML file from disk, or parse an HTML string you already have in memory. Once parsed, you can search the document using CSS selectors (the same kind used in web stylesheets) or by navigating the tree structure of elements manually. You can read text, pull attribute values, find all links, or target specific sections by their ID or class name. Editing is also supported. You can change the text or HTML inside elements, set attributes, add or remove elements, and then get back a clean HTML string as output. This makes jsoup useful for things like reformatting content before storing it, or building HTML programmatically. One practical use case highlighted in the README is safety filtering. If your application accepts HTML from users, such as in a forum or comment system, jsoup can strip out tags and attributes that could be used for cross-site scripting attacks. You define which tags are allowed, and jsoup removes everything else. jsoup is open source under the MIT license and has been actively maintained since 2009. It is added to a Java project using Maven or Gradle with a single dependency line. The project also documents Android support for developers building mobile applications.

Copy-paste prompts

Prompt 1
Using jsoup, write Java code to fetch a Wikipedia page and extract all h2 and h3 heading texts as a list.
Prompt 2
Show me how to use jsoup to sanitize user-submitted HTML for a comment section, allowing only bold, italic, and anchor tags.
Prompt 3
Write a jsoup program that reads a list of 50 URLs, fetches each page, and extracts the content of the meta description tag.
Prompt 4
How do I use jsoup to find all image tags on a page and collect their src attribute values into a list?
Open on GitHub → Explain another repo

← jhy on gitmyhub — every repo by this author, as a profile.

Verify against the repo before relying on details.