What is the use of JSOUP

Web scraping with Java


I can't find a good Java based web scraping API. The site I have to scratch doesn't offer an API either. I want to iterate over all web pages with some and extract the html titles / other things in their DOM trees.

Are there any other options than web scraping?

Reply:


Extracting the title isn't difficult and you have many options. Look here in the stack overflow for " Java HTML parsers ". One of them is Jsoup.

You can navigate the page with DOM if you know the page structure (see http://jsoup.org/cookbook/extracting-data/dom-navigation)

It's a good library and I've used it in my recent projects.







It is best to use the Selenium Web Driver

  1. Provides visual feedback to the encoder (see your scraping in action, see where it stops)

  2. Accurate and consistent as it directly controls the browser you are using.

  3. Slowly. Doesn't hit websites like HtmlUnit, but sometimes you don't want to hit too fast.

    Htmlunit is fast but terrible when it comes to Javascript and AJAX.



HTMLUnit can be used for web scraping. It supports visiting pages and filling out and submitting forms. I've used this in my project. It's a good Java library for web scraping. Read here for more




You could look into jwht scrappers!

This is a complete scrapping framework that has all of the features a developer can expect from a web scrapper:

It works with (jwht-htmltopojo) [https://github.com/whimtrip/jwht-htmltopojo) lib which itsef uses jsoup that has been mentioned by several other people here.

Together they will help you create great scrappers that map HTML directly to POJOs and bypass all classic scrapper problems in just a few minutes!

Hope this could help some people here!

Disclaimer, I'm the one who developed it, don't hesitate to share your comments with me!


Take a look at an HTML parser like TagSoup, HTMLCleaner, or NekoHTML.


If you want to automate the scraping of large amounts of pages or data, you can try Götz ETL.

It's completely model-driven like a real ETL tool. The data structure, task workflow, and pages to be searched are defined using a series of XML definition files and no coding is required. The query can either be written with selectors with JSoup or XPath with HtmlUnit.



For tasks of this type, I usually use Crawller4j + Jsoup.

With crawler4j I download the pages from a domain, you can specify which ULR with a regular expression.

Using jsoup, I "analyzed" the HTML data that you were looking for and downloaded using crawler4j.

You can usually download data with jsoup too, but Crawler4J makes it easy to find links. Another benefit of using crawler4j is that it is multithreaded and you can configure the number of concurrent threads

https://github.com/yasserg/crawler4j/wiki


I usually use selenium, which is automation testing software. You can control a browser through a web driver so you won't have any problems with javascripts and it usually won't be very recognized when using the full version. Headless browsers can be better identified.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.