Friday, August 10, 2012

Data Retrieval & Classification using jsoup


In the process of domain specific data retrieval, the main idea is to get the content which are within a certain domain & display to the user. So the relevant data may not be the whole website or even the web page instead it might be a small section within the webpage. Therefore a technological approach required to retrieve above mentioned data retrieval. There are few systems, libraries that can be used to retrieve specific data from websites, among them, jsoup looks promising for the purpose because of its features.
 Jsoup is a java library for working with real world HTML. It provides a very convenient API for extracting & manipulating data, using the best of DOM, CSS, & jquery like methods. Jsoup is an open source application which makes it a perfect development tool for this project as it can be modified according to the purpose. As jsoup is specially developed for java environment, it makes a perfect candidate for the development process as well. In the project, need to extract some sections within a given website, where the URL will be available. In this case, jsoup is suitable for the process, as it can be used to extract data for a given URL, from a file, or from a given string. Thus jsoup can be used to extract data from the given URL & then store the data & also extract or scrape sections within the data.
Jsoup API has many sophisticated features that can be used to enhance the extraction process. For example, data extraction can be done by reading the DOM structure of the website. As all websites are using HTML, jsoup can read the structure of the websites & go through the DOM structure & get the content as intended. In the jsoup, the HTML tags & attributes can be easily identified & get data by referring to them. These are called elements & elements provide a range od DOM-like methods to find elements, & extract & manipulate their data. The DOM getters are contextual: called on a parent document & find matching elements under the document; called on a child element they find elements under that child.  There are many elements & getters provided in jsoup. That makes the data extraction process very easy because, can extract only the intended sections without grabbing a bunch of web pages.
The extracted data need to be classified according to the content type. That means as text, images, videos, links, etc. & jsoup can be used for the categorization. It has the features to identify the content separately as text, links, and images & based on that the extraction process can be separated. It can identify the content type using the HTML tags & based on that use functions to extract each content type. Thus while extracting the data, the content classification also can be achieved using the jsoup.
So considering the requirements for data extraction process, jsoup can be mentioned as a highly sophisticated tool for data extraction. The features & functions provided for java based data extraction made the process very much easy & as it is an open source application, the cost effectiveness is also achieved.


Sunday, May 27, 2012

Using OpenJPA in Development

you know by now the use of openJPA but I havent mention you the way to use this in practice. Let me explain this using a simple example. I am using Netbeans IDE, which is easy for me to use openJPA as it will create the entity classes automatically. Study the following steps.
1. Create a new netbeans web project.
2. write click on the project & from the menu select New -->other--> persistence --> Persistence Unit.
you will get following dialog box.
3. let the name as it is & in the persistence provider section you need to select openJPA. If that is not in the list you need to create a new library. Go to new library & add your openjpa-all-2.1.0 jar. (you need to have openjpa unzipped in a folder).
4. Then click ok to complete the persistence creation.
5. Now you need to create a new package to insert the entity files. (example -com.openjpa.domain)
6.Right click on the package & select New -->other--> persistence --> Entity class from database.
Here you need to select your database connection. database schema is your db connection.
for example see below figure.
7. once the data source is selected, you can see the tables within the database will be displayed to you & you can select which tables you need to import as an entity class.
8. once you import the entity classes, netbeans will create them automatically.
8. If you take a look at these classes you can see table mapping & queries are generated.
9. then you can create your DAO files & access the data objects using the named queries or normal queries.
10. then you can use them in your servlets & jsps as you required.

So those are the steps to create a openJPA objects using netbeans. Hope this post helped you to understand the basics in openJPA. see you in next post.


MVC Architecture of OpenJPA


In the development process the standard way is to use Model View Controller architecture, where the separation of each component comes into play. 
1.      Model – Model represents the data & rules that govern access & update of data. That means basically it will contain database objects to be accessed.
2.      View – View is the web pages that we can display. The HTML, JSP pages are under the View category.
3.      Controller – This is the controlling mechanism between model & view. This will communicate with Model & View & translate & transport data objects.

Figure MVC Architecture of JPA

1.      DAO – DAO means Data Access Objects & it provides an abstract interface to the databases. Basically in DAO layer we write queries to access the database. We can write DAO’s & include methods as much as we wanted. When writing DAO we can use OpenJPA NamedQueries directly.
2.      Domain – Domain means the entity classes for the tables in the database. For each table in the database, we can create a mapping entity class & define the table entities as objects. It is much more efficient way than traditional relational database mapping. Domain also called as POJO (Plain Old Java Objects).

 In order to create the connection between database & domains, need to create an xml file including such details. That xml file is named as “persistance.xml” by default. It is similar to the normal database connection establishment, but it is defined in xml format. A typical persistance.xml file would contain following details.
This holds the connection details & there are two ways to retrieve the data objects. One method is using Entity manager & other way is using Session factory. I used entity manager factory in the DAO layers.
Hope this post would help you to understand the MVC architecture of the OpenJPA. Stay in touch :)