Since this is a problem we will face every now and then, we started thinking about a more generic solution to use in any gadget depending on such source of data.
The solution we reached was one of these three
- using a scrapping service such as Dapper or Yahoo pipes to do the scrapping on behalf of us and returns a well formed XML file to use in any gadget
- create a google app engine that we call and it scrape the data and returns XML to us
- using JS for scrapping HTML pages
Anyway, i liked the third solution and said to myself lets give it a try and see if it will be performant enough or not. I thought scrapping html using JS is an easy matter that can be done easily in any google gadget but i was proven not to be like that at all. I will summerize the trials i made here starting from those who failed to the last solution that worked.
- Depending on Google Api method "_IG_FetchXmlContent". This way failed easily because it was expecting XML document and was faced with HTML Page. It gave me parse error on Doctype line. The result is FAILURE
- Depending on Google Api method "_IG_FetchContent". This way gave us the html as it is and it was time to parse it using DOM Parsers built already inside browsers. I tried doing so using Firefox browser but also got parse error because this is not a XML document but HTML one and parsers available only expects XML. The result is FAILURE
- Repeating step 2 again but after using a regular expression to take only inner HTML of body tag. DOM Parser failed on one of the comments lines present in the HTML page which may appear in
may pages so this isn't a generic solution to be accepted. The result is FAILURE - Using Regular expression to get body inner html and then add this to a hidden div then using normal JS methods for traversing DOM nodes considering this Div as my root. The result is SUCCESS
scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
as in this definition we see that the function needs some parameters
- url to retrieve html from
- dataHolderId the id of the hidden div that the retrieved html will be added to it
- scrapeFunction a function that takes the hidden div as a root element and use JS to get data desired "every one should write his according to what he wants to retrieve"
scrapeHTMLBody = function(url, dataHolderId, scrapeFunction){
}
operate = function(responseText, dataHolderId, scrapeFunction){
var body = /<body.*?>((.|\n|\r)*)<\/body>/.exec(responseText);
var bodyData = body[1];
scrapeFunction(dataHolderId);
}
these two functions are used to get html page then retrieve body inner html then call the scrape function passing to it the if of the hidden div containing the html body
it is your responsibility now to write the scrapping function desired based that this div is the root of your DOM tree
this is an example of a scrapping function i defined
scrape = function(dataHolderId){
var elements =
var noktas = [];
var num = elements.length;
for(i=0 ; i<num ; i+=2) noktas.push(elements[i].childNodes[0].innerHTML);
for(i=0 ; i<noktas.length ; i++){
var e = document.createElement('p');
e.innerHTML = noktas[i];
document.body.appendChild(e);
}
}
That's it, i think you are ready now to use these two functions in any gadget whose data source should be scrapped
This method should be better as here all processing is made on client machine rather than any other servers
No comments:
Post a Comment