Tuesday 21 February 2012

Java based HTML Parser

Hello All,

Few days back I was thrown into a challenge to automate a report which was available only on internet as an HTML page. I initially thought it must be easy as the source must have exposed some APIs or services which I can consume. To my utter disappointment there was nothing like that but then they have the RSS feeds. To add more bitterness, unfortunately that was not good enough to produce the required report format.

I then had to find other methods of extracting data from source and after some brainstorming I came up with an idea of parsing the HTML itself. I thought it would be quick & easy to do and also a reliable way to go forward.

As a golden rule before I thought of creating my own HTML parser I did some goggling and to my shear surprise there are quite a few which exists. I then had to choose the right one for me and my choice became obvious when I got an open source based.

Jsoup it is. I downloaded the required library and you wont believe it was a simple 2 line of code and I was able to extract the data as per my needs. It offers so many ways to extract the data that I don't need to think about traversing nodes and their child nodes etc.

I recommend you to give it a try as it offers so much that you will surely be surprised like me.

In my next post I will highlight one more such utility libraries to play with MS and Open office products.

Cheers,
Namish

No comments:

Post a Comment