animesh kumar

Running water never grows stale. Keep flowing!

Using node.js and jquery to scrape websites

with 30 comments

I have been playing with Node.js for last few days and am totally head over heels. Madly in love! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.

Javascript objects

Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.

  1. One obvious difference is in syntax, and the other major one is that
  2. Other languages have methods while Javascript has first-class functions.

First class functions. What does it mean? It means that they are expressions and can be assigned to a variable and can be easily passed around. Does it sound like a closure in Ruby? It does indeed. Well thought, it’s a little more than that. I will come to this again some other time. For now, let’s find out how we can create objects and use them? I will focus tell you two ways to do it.

The Classical way

Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;
	};
}

var rectangle = new Shape (2, 5);    // instantiate a new Shape object
console.log (rectangle.getArea());   // calculate the area: 10

Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here: http://www.packtpub.com/article/using-prototype-property-in-javascript

I will add a new function to calculate the perimeter of my Shape object.

Shape.prototype.getPerimiter = function() {
	return 2 * (this.width + this.height);
}

console.log (rectangle.getPerimiter());

What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously.  There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.

Now, let’s try to extend Shape. I will create a new constructor function for Square.

function Square(side){
	this.width = side;
	this.height = side;
}

Square.prototype = new Shape();

var sq = new Square(4);
console.log(sq.getArea());

I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?

The Prototypal way

Let’s do the same thing without using constructors now. Just plain prototypes!

var Shape = {
	getArea: function () {
		return this.width * this.height;
	},
	getPerimiter: function() {
		return 2 * (this.width + this.height);
	}
};

var rec = Object.create(Shape);
rec.width = 2;
rec.height = 5;

console.log(rec.getArea());

Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!

Node.js Modules

Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: http://nodejs.org/api.html. Above example can be ported to Node.js module ecosystem like explained below:

First, create Shape.js

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;
	};
}

// Export this module
exports.module = Shape;

And now, use this

var Shape = require('./Shape');

var rectangle = new Shape (2, 5);
console.log (rectangle.getArea());

Node.js loads and runs each module in a sandbox which staves off any possible name collision. That’s the benefit you get apart from having a properly structured code base.

Writing a screen scraping application

I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.

Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.

// External Modules
var request = require('ahr'), // Abstract-HTTP-request https://github.com/coolaj86/abstract-http-request
sys = require('sys'),		// System
events = require('events'),	// EventEmitter
jsdom = require('jsdom');	// JsDom https://github.com/tmpvar/jsdom

var jQueryPath = 'http://code.jquery.com/jquery-1.4.2.min.js';
var headers = {'content-type':'application/json', 'accept': 'application/json'};

// Export searcher
module.exports = Searcher;

function Searcher(param) {
	if (param.headers) {
		this.headers = param.headers;
	} else {
		this.headers = headers;
	}

	this.merchantName = param.merchantName;
	this.merchantUrl = param.merchantUrl;
	this.id = param.merchantUrl;
}

// Inherit from EventEmitter
Searcher.prototype = new process.EventEmitter;

Searcher.prototype.search = function(query, collector) {
	var self = this;
	var url = self.getSearchUrl(query);

	console.log('Connecting to... ' + url);

	request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) {
		if (err) {
			self.onError({error: err, searcher: self});
			self.onComplete({searcher: self});
		} else {
			console.log('Fetched content from... ' + url);
			// create DOM window from HTML data
			var window = jsdom.jsdom(html).createWindow();
			// load jquery with DOM window and call the parser!
			jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js', function() {
				self.parseHTML(window);
				self.onComplete({searcher: self});
			});
		}
	});
}

// Implemented in inhetired class
Searcher.prototype.getSearchUrl = function(query) {
	throw "getSearchUrl() is unimplemented!";
}
// Implemented in inhetired class
Searcher.prototype.parseHTML = function(window) {
	throw "parseForBook() is unimplemented!";
}
// Emits 'item' events when an item is found.
Searcher.prototype.onItem = function(item) {
	this.emit('item', item);
}
// Emits 'complete' event when searcher is done
Searcher.prototype.onComplete = function(searcher) {
	this.emit('complete', searcher);
}
// Emit 'error' events
Searcher.prototype.onError = function(error) {
	this.emit('error', error);
}

Searcher.prototype.toString = function() {
	return this.merchantName + "(" + this.merchantUrl + ")";
}

Now, code to scrape rediff books. I will name it searcher-rediff.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Rediff Books',
	merchantUrl: 'http://books.rediff.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/book/" + query;
}

searcher.parseHTML = function(window) {
	var self = this;

	window.$('div[id="prod_detail"]').each(function(){
		var item  = window.$(this);

		var title = item.find('#prod_detail2').find('font[id="book-titl"]').text();
		var link = item.find('#prod_detail2').find('a').attr('href');
		var author = item.find('#prod_detail2').find('font[id="book-auth"]').text();
		var price = item.find('#prod_detail2').find('font[id="book-pric"]').text();

		self.onItem({
			title: title,
			link: link,
			author: author,
			price: price
		});
	});
}

Run it now.

var searcher = require('./searcher-rediff');

searcher.on('item', function(item){
	console.log('Item found >> ' + item)
});

searcher.on('complete', function(searcher){
	console.log('searcher done!');
});

searcher.search("Salman");

What I did?

  1. First, I wrote a skeleton searcher class. This class makes the
    1. request to the merchant’s search URL (this URL is built in getSearchUrl function), then
    2. fetches the html data from here, then
    3. by using ‘jsdom’ module creates DOM’s window object which further
    4. gets parsed by ‘jquery’, and
    5. function parseHTML is executed.
  2. Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
    1. getSearchUrl function to return appropriate search URL to connect to, and
    2. parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.

Now, if I want to search say Flipkart along with Rediff, I just need to write a Flipkart specific implementation, say searcher-flipkart.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Flipkart',
	merchantUrl: 'http://www.flipkart.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/search-book" + '?query=' + query;
}

searcher.parseHTML = function(window) {
	var self = this;

	window.$('.search_result_item').each(function(){
		var item  = window.$(this);

		var title = item.find('.search_result_title').text().trim().replace(/\n/g, "");
		var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href');
		var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, "");

		self.onItem({
			title: title,
			link: link,
			price: price
		});
	});
}

I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: https://github.com/anismiles/jsdom-based-screen-scraper Chill!

What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?

About these ads

Written by Animesh

November 29, 2010 at 3:47 pm

30 Responses

Subscribe to comments with RSS.

  1. [...] Przeczytaj artykuł: Using node.js and jquery to scrape websites « Coding is an act of … [...]

  2. I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
    This is the page source.

    document.write(“Test”);

    And this is the result after javascript be processed by the parser.

    document.writeln(“Test”);

    Test

    Thank you,
    Ventura

    Jorge Ventura

    December 8, 2010 at 9:19 am

    • Ventura,

      Yeah it’s possible, but you might need a a way to execute the on-page javascript in a sandbox. Node.js can easily help you do that.

      Animesh

      January 2, 2011 at 2:35 pm

    • you can scape sites with JS on the page using jsdom’s jsdom.env() function.
      Check it out under the headline Easy Mode on the jsdom github page https://github.com/tmpvar/jsdom

      aaron

      March 15, 2011 at 3:13 pm

  3. I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
    This is the page source.

    document.write(\”Test\”);

    And this is the result after javascript be processed by the parser.

    document.writeln(\”Test\”);

    Test

    Thank you,
    Ventura

    Jorge Ventura

    December 8, 2010 at 9:21 am

  4. I am sorry, I was trying to post an HTML code but it does’t work here.

    Ventura

    Jorge Ventura

    December 8, 2010 at 9:22 am

  5. Great example. Thanks for sharing.

    I did run into an issue when trying to run your example for searcher-rediff.js. When it trys to create the window via jsdom it throws a stack that starts with this error:

    TypeError: Cannot read property ‘protocol’ of undefined

    Does this mean jsdom cannot correctly parse the html anymore because it’s changed?

    Rob

    January 2, 2011 at 11:51 am

    • Rob,

      Did you check the HTML content? Is it getting fetched properly? BTW, against which URL this error comes?

      -Animesh

      Animesh

      January 2, 2011 at 2:36 pm

      • Yes. The HTML content is coming back properly. This is the url I am fetching. http://books.rediff.com/book/Salman. The error is thrown at this line in searcher.js
        var window = jsdom.jsdom(html).createWindow();

        I believe it has to do with the inline javascript call in the HTML that looks like this:

        s.src = (document.location.protocol == “https:” ? “https://sb” : “http://b”) + “.scorecardresearch.com/beacon.js”;

        Rob

        January 2, 2011 at 11:29 pm

      • I see. I will look into this and revert. However, did it work for other urls?

        Animesh

        January 3, 2011 at 10:01 am

  6. Hi Animesh, wondering if this node.js will work for this situation? http://stackoverflow.com/questions/5054818/php-page-protection-for-cron-task-only

    Is it compatible with current mysql?

    Do we have to learn up this as a whole new language? or can easily reuse certain thing in php etc?

    wonderful

    February 20, 2011 at 3:01 pm

  7. Great writeup. I’m trying to run the searcher-server code, and I keep getting:

    TypeError: Object # has no method ‘on’
    at Object. (/Users/avishai/Downloads/anismiles-jsdom-based-screen-scraper-f0c79d3/searcher-server.js:9:10)
    at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:146:21)
    at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:157:15)
    at pass (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:162:10)
    at Object.router [as handle] (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:168:6)
    at next (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:218:15)
    at Server.handle (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:231:3)
    at Server.emit (events.js:45:17)
    at HTTPParser.onIncoming (http.js:1078:12)
    at HTTPParser.onHeadersComplete (http.js:87:31)

    Do you know why this might be?

    Avishai

    March 4, 2011 at 1:51 am

    • Avishai, what version of Node you are using. I find that you got Connect’s 0.5.10 version which I think should be fine.

      -Animesh

      Animesh

      March 4, 2011 at 6:39 pm

      • Fundamental reason behind this bug should be something to do with EventEmittter. let me explain,

        1. searcher.js inherits from EventEmitter
        (Line-26) Searcher.prototype = new process.EventEmitter;

        2. searcher-rediff.js, searcher-flipkart.js and searcher-landmarkonthenet.js extend from searcher.js, so they also inherit from EventEmitter.

        3. ‘on’ method is actually defined in EventEmitter.

        So, i think, for some reason, searcher.js is not able to inherit from EventEmitter and hence the method ‘on’ is missing.

        Animesh

        March 4, 2011 at 6:42 pm

  8. [...] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites [...]

  9. Is there a good way to do this on websites that require you to log in first before running a search?

    Avishai

    April 4, 2011 at 7:02 pm

    • i think, using POST method you can easily do a login on a site.

      Animesh

      April 4, 2011 at 7:12 pm

  10. Your “Javascript objects” helped a lot. Thanks.

  11. “Square.prototype = new Shape(); ”

    Here Shape constructor is called without any argument.But its definition has two arguments -width and length. I checked it. It works. Does javascript also creates default constructors with no arguments itself?

    • No. Think of JS not as a logical/democratic world, it’s more like anarchy. :) JS, internally, accepts params as a key-val pair and when you don’t pass anything… key-val pair just goes blank. It’s not an error. And if you try to look for these params, you will see ‘undefined’. One more difference between ‘undefined’ and ‘null’… eh?

      Chill!

      Animesh

      December 30, 2011 at 10:23 am

  12. Hi, this post is really interesting and while I’m trying to get the picture, I dont anderstand how is called the function searcher.getSearchUrl = function(query) { return this.merchantUrl + “/book/” + query;} in the searcher-rediff.js. thsnkd a lot.

    Yaver

    December 29, 2011 at 5:40 pm

  13. [...] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites [...]

  14. Hi Animesh,

    Sorry for being naive, would this be required to run on the server side – reason that I ask this is that I have a a need to scrape a website and show results in a mobile application using phonegap and I was wondering if this script could run on the client side or would it need to be deployed on the server side. Also could you please give an example of how to use POST for the website that requires login (I have the username and password).

    Thanks
    Tarun

    Tarun

    April 1, 2013 at 2:00 am

    • Sure you can run this on client side. However you will need to modify it a bit.

      Animesh

      April 1, 2013 at 10:09 am

  15. Using node.js and jquery to scrape websites | animesh kumar

    Thank you for submitting this cool story – Trackback from AnantLeaves

    AnantLeaves

    July 1, 2013 at 10:56 pm

  16. […] Using node.js and jquery to scrape websites […]

  17. to greatly simplify and speed up your code, try promise-parser

    http://www.npmjs.org/package/promise-parser

    JD

    June 12, 2014 at 3:20 am

  18. you should check out promise-parser

    http://www.npmjs.org/package/promise-parser

    http://github.com/rc0x03/node-promise-parser

    Features

    Fast: uses libxml C bindings
    Lightweight: no dependencies like jQuery, cheerio, or jsdom
    Clean: promise based interface- no more nested callbacks
    Flexible: supports both CSS and XPath selectors

    JD

    June 12, 2014 at 3:23 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 205 other followers

%d bloggers like this: