Friday, August 10, 2012

Data Retrieval & Classification using jsoup


In the process of domain specific data retrieval, the main idea is to get the content which are within a certain domain & display to the user. So the relevant data may not be the whole website or even the web page instead it might be a small section within the webpage. Therefore a technological approach required to retrieve above mentioned data retrieval. There are few systems, libraries that can be used to retrieve specific data from websites, among them, jsoup looks promising for the purpose because of its features.
 Jsoup is a java library for working with real world HTML. It provides a very convenient API for extracting & manipulating data, using the best of DOM, CSS, & jquery like methods. Jsoup is an open source application which makes it a perfect development tool for this project as it can be modified according to the purpose. As jsoup is specially developed for java environment, it makes a perfect candidate for the development process as well. In the project, need to extract some sections within a given website, where the URL will be available. In this case, jsoup is suitable for the process, as it can be used to extract data for a given URL, from a file, or from a given string. Thus jsoup can be used to extract data from the given URL & then store the data & also extract or scrape sections within the data.
Jsoup API has many sophisticated features that can be used to enhance the extraction process. For example, data extraction can be done by reading the DOM structure of the website. As all websites are using HTML, jsoup can read the structure of the websites & go through the DOM structure & get the content as intended. In the jsoup, the HTML tags & attributes can be easily identified & get data by referring to them. These are called elements & elements provide a range od DOM-like methods to find elements, & extract & manipulate their data. The DOM getters are contextual: called on a parent document & find matching elements under the document; called on a child element they find elements under that child.  There are many elements & getters provided in jsoup. That makes the data extraction process very easy because, can extract only the intended sections without grabbing a bunch of web pages.
The extracted data need to be classified according to the content type. That means as text, images, videos, links, etc. & jsoup can be used for the categorization. It has the features to identify the content separately as text, links, and images & based on that the extraction process can be separated. It can identify the content type using the HTML tags & based on that use functions to extract each content type. Thus while extracting the data, the content classification also can be achieved using the jsoup.
So considering the requirements for data extraction process, jsoup can be mentioned as a highly sophisticated tool for data extraction. The features & functions provided for java based data extraction made the process very much easy & as it is an open source application, the cost effectiveness is also achieved.