Wednesday, February 11, 2009

HTML Scrapping using Javascript ((for google gadgets))

Some friends at my company were working on doing some google gadgets. A large sum of gadgets were depending on data gathered from other websites which lack of any XML or RSS service providing this data in a direct way.

Since this is a problem we will face every now and then, we started thinking about a more generic solution to use in any gadget depending on such source of data.

The solution we reached was one of these three
  1. using a scrapping service such as Dapper or Yahoo pipes to do the scrapping on behalf of us and returns a well formed XML file to use in any gadget
  2. create a google app engine that we call and it scrape the data and returns XML to us
  3. using JS for scrapping HTML pages 
the first and second solutions may seam the same and actually they are except that Dapper isn't that reliable as it sometimes fails due to extra load on it while google app engine was proven to survive under high request rates

Anyway, i liked the third solution and said to myself lets give it a try and see if it will be performant enough or not. I thought scrapping html using JS is an easy matter that can be done easily in any google gadget but i was proven not to be like that at all. I will summerize the trials i made here starting from those who failed to the last solution that worked.

  1. Depending on Google Api method "_IG_FetchXmlContent". This way failed easily because it was expecting XML document and was faced with HTML Page. It gave me parse error on Doctype line. The result is FAILURE
  2. Depending on Google Api method "_IG_FetchContent". This way gave us the html as it is and it was time to parse it using DOM Parsers built already inside browsers. I tried doing so using Firefox browser but also got parse error because this is not a XML document but HTML one and parsers available only expects XML. The result is FAILURE
  3. Repeating step 2 again but after using a regular expression to take only inner HTML of body tag. DOM Parser failed on one of the comments lines present in the HTML page which may appear in
    may pages so this isn't a generic solution to be accepted. The result is FAILURE
  4. Using Regular expression to get body inner html and then add this to a hidden div then using normal JS methods for traversing DOM nodes considering this Div as my root. The result is SUCCESS
Since, the fourth trial was successful i made a generic method that anyone can use in his gadget. this simple method will just get html and scrape based on your scrapping function. To understand what i mean, have a look at the function definition first

scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){}

as in this definition we see that the function needs some parameters
  • url to retrieve html from
  • dataHolderId the id of the hidden div that the retrieved html will be added to it
  • scrapeFunction a function that takes the hidden div as a root element and use JS to get data desired "every one should write his according to what he wants to retrieve"
and this is the implementation of it

scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
_IG_FetchContent(url, function(responseText){ operate(responseText, dataHolderId, scrapeFunction); });
}


operate = function(responseText, dataHolderId, scrapeFunction){
  var body = /<body.*?>((.|\n|\r)*)<\/body>/.exec(responseText);
  var bodyData = body[1];
  _gel(dataHolderId).innerHTML = bodyData;
  scrapeFunction(dataHolderId);
}


these two functions are used to get html page then retrieve body inner html then call the scrape function passing to it the if of the hidden div containing the html body
it is your responsibility now to write the scrapping function desired based that this div is the root of your DOM tree

this is an example of a scrapping function i defined

scrape = function(dataHolderId){
  var elements = _gel(dataHolderId).getElementsByClassName('main');

  var noktas = [];
  var num = elements.length;
  for(i=0 ; i<num ; i+=2) noktas.push(elements[i].childNodes[0].innerHTML);

  for(i=0 ; i<noktas.length ; i++){
    var e = document.createElement('p');
    e.innerHTML = noktas[i];
    document.body.appendChild(e);
  }
}


That's it, i think you are ready now to use these two functions in any gadget whose data source should be scrapped
This method should be better as here all processing is made on client machine rather than any other servers

No comments: