Using node.js and jquery to scrape websites

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

I have been playing with Node.js for last few days and am totally head over heels. Madly in love! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.

Javascript objects

Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.

One obvious difference is in syntax, and the other major one is that
Other languages have methods while Javascript has first-class functions.

First class functions. What does it mean? It means that they are expressions and can be assigned to a variable and can be easily passed around. Does it sound like a closure in Ruby? It does indeed. Well thought, it’s a little more than that. I will come to this again some other time. For now, let’s find out how we can create objects and use them? I will focus tell you two ways to do it.

The Classical way

Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;
	};
}

var rectangle = new Shape (2, 5);    // instantiate a new Shape object
console.log (rectangle.getArea());   // calculate the area: 10

Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here: http://www.packtpub.com/article/using-prototype-property-in-javascript

I will add a new function to calculate the perimeter of my Shape object.

Shape.prototype.getPerimiter = function() {
	return 2 * (this.width + this.height);
}

console.log (rectangle.getPerimiter());

What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously. There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.

Now, let’s try to extend Shape. I will create a new constructor function for Square.

function Square(side){
	this.width = side;
	this.height = side;
}

Square.prototype = new Shape();

var sq = new Square(4);
console.log(sq.getArea());

I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?

The Prototypal way

Let’s do the same thing without using constructors now. Just plain prototypes!

var Shape = {
	getArea: function () {
		return this.width * this.height;
	},
	getPerimiter: function() {
		return 2 * (this.width + this.height);
	}
};

var rec = Object.create(Shape);
rec.width = 2;
rec.height = 5;

console.log(rec.getArea());

Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!

Node.js Modules

Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: http://nodejs.org/api.html. Above example can be ported to Node.js module ecosystem like explained below:

First, create Shape.js

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;
	};
}

// Export this module
exports.module = Shape;

And now, use this

var Shape = require('./Shape');

var rectangle = new Shape (2, 5);
console.log (rectangle.getArea());

Node.js loads and runs each module in a sandbox which staves off any possible name collision. That’s the benefit you get apart from having a properly structured code base.

Writing a screen scraping application

I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.

Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.

// External Modules
var request = require('ahr'), // Abstract-HTTP-request https://github.com/coolaj86/abstract-http-request
sys = require('sys'),		// System
events = require('events'),	// EventEmitter
jsdom = require('jsdom');	// JsDom https://github.com/tmpvar/jsdom

var jQueryPath = 'http://code.jquery.com/jquery-1.4.2.min.js';
var headers = {'content-type':'application/json', 'accept': 'application/json'};

// Export searcher
module.exports = Searcher;

function Searcher(param) {
	if (param.headers) {
		this.headers = param.headers;
	} else {
		this.headers = headers;
	}

	this.merchantName = param.merchantName;
	this.merchantUrl = param.merchantUrl;
	this.id = param.merchantUrl;
}

// Inherit from EventEmitter
Searcher.prototype = new process.EventEmitter;

Searcher.prototype.search = function(query, collector) {
	var self = this;
	var url = self.getSearchUrl(query);

	console.log('Connecting to... ' + url);

	request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) {
		if (err) {
			self.onError({error: err, searcher: self});
			self.onComplete({searcher: self});
		} else {
			console.log('Fetched content from... ' + url);
			// create DOM window from HTML data
			var window = jsdom.jsdom(html).createWindow();
			// load jquery with DOM window and call the parser!
			jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js', function() {
				self.parseHTML(window);
				self.onComplete({searcher: self});
			});
		}
	});
}

// Implemented in inhetired class
Searcher.prototype.getSearchUrl = function(query) {
	throw "getSearchUrl() is unimplemented!";
}
// Implemented in inhetired class
Searcher.prototype.parseHTML = function(window) {
	throw "parseForBook() is unimplemented!";
}
// Emits 'item' events when an item is found.
Searcher.prototype.onItem = function(item) {
	this.emit('item', item);
}
// Emits 'complete' event when searcher is done
Searcher.prototype.onComplete = function(searcher) {
	this.emit('complete', searcher);
}
// Emit 'error' events
Searcher.prototype.onError = function(error) {
	this.emit('error', error);
}

Searcher.prototype.toString = function() {
	return this.merchantName + "(" + this.merchantUrl + ")";
}

Now, code to scrape rediff books. I will name it searcher-rediff.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Rediff Books',
	merchantUrl: 'http://books.rediff.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/book/" + query;
}

searcher.parseHTML = function(window) {
	var self = this;

	window.$('div[id="prod_detail"]').each(function(){
		var item  = window.$(this);

		var title = item.find('#prod_detail2').find('font[id="book-titl"]').text();
		var link = item.find('#prod_detail2').find('a').attr('href');
		var author = item.find('#prod_detail2').find('font[id="book-auth"]').text();
		var price = item.find('#prod_detail2').find('font[id="book-pric"]').text();

		self.onItem({
			title: title,
			link: link,
			author: author,
			price: price
		});
	});
}

Run it now.

var searcher = require('./searcher-rediff');

searcher.on('item', function(item){
	console.log('Item found >> ' + item)
});

searcher.on('complete', function(searcher){
	console.log('searcher done!');
});

searcher.search("Salman");

What I did?

First, I wrote a skeleton searcher class. This class makes the
1. request to the merchant’s search URL (this URL is built in getSearchUrl function), then
2. fetches the html data from here, then
3. by using ‘jsdom’ module creates DOM’s window object which further
4. gets parsed by ‘jquery’, and
5. function parseHTML is executed.
Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
1. getSearchUrl function to return appropriate search URL to connect to, and
2. parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.

Now, if I want to search say Flipkart along with Rediff, I just need to write a Flipkart specific implementation, say searcher-flipkart.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Flipkart',
	merchantUrl: 'http://www.flipkart.com'
});

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/search-book" + '?query=' + query;
}

searcher.parseHTML = function(window) {
	var self = this;

	window.$('.search_result_item').each(function(){
		var item  = window.$(this);

		var title = item.find('.search_result_title').text().trim().replace(/\n/g, "");
		var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href');
		var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, "");

		self.onItem({
			title: title,
			link: link,
			price: price
		});
	});
}

I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: https://github.com/anismiles/jsdom-based-screen-scraper Chill!

What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?

Written by Animesh

November 29, 2010 at 3:47 pm

Posted in Technology

Tagged with data scraping, JavaScript, JQuery, Node.js, web scraping

31 Responses

Subscribe to comments with RSS.

[…] Przeczytaj artykuł: Using node.js and jquery to scrape websites « Coding is an act of … […]

Using node.js and jquery to scrape websites « Coding is an act of … – js - dowiedz się więcej!

November 29, 2010 at 4:33 pm

Reply
I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
This is the page source.

document.write(“Test”);

And this is the result after javascript be processed by the parser.

document.writeln(“Test”);

Test

Thank you,
Ventura

Jorge Ventura

December 8, 2010 at 9:19 am

Reply
- Ventura,
  
  Yeah it’s possible, but you might need a a way to execute the on-page javascript in a sandbox. Node.js can easily help you do that.
  
  Animesh
  
  January 2, 2011 at 2:35 pm
  
  Reply
- you can scape sites with JS on the page using jsdom’s jsdom.env() function.
  Check it out under the headline Easy Mode on the jsdom github page https://github.com/tmpvar/jsdom
  
  aaron
  
  March 15, 2011 at 3:13 pm
  
  Reply
I have a question: It’s possible to scrape sites with javascript in the page ??? What I am asking is for example:
This is the page source.
”

document.write(\”Test\”);

”

And this is the result after javascript be processed by the parser.

”

document.writeln(\”Test\”);

Test

”

Thank you,
Ventura

Jorge Ventura

December 8, 2010 at 9:21 am

Reply
I am sorry, I was trying to post an HTML code but it does’t work here.

Ventura

Jorge Ventura

December 8, 2010 at 9:22 am

Reply
Great example. Thanks for sharing.

I did run into an issue when trying to run your example for searcher-rediff.js. When it trys to create the window via jsdom it throws a stack that starts with this error:

TypeError: Cannot read property ‘protocol’ of undefined

Does this mean jsdom cannot correctly parse the html anymore because it’s changed?

Rob

January 2, 2011 at 11:51 am

Reply
- Rob,
  
  Did you check the HTML content? Is it getting fetched properly? BTW, against which URL this error comes?
  
  -Animesh
  
  Animesh
  
  January 2, 2011 at 2:36 pm
  
  Reply
  - Yes. The HTML content is coming back properly. This is the url I am fetching. http://books.rediff.com/book/Salman. The error is thrown at this line in searcher.js
    var window = jsdom.jsdom(html).createWindow();
    
    I believe it has to do with the inline javascript call in the HTML that looks like this:
    
    s.src = (document.location.protocol == “https:” ? “https://sb” : “http://b”) + “.scorecardresearch.com/beacon.js”;
    
    Rob
    
    January 2, 2011 at 11:29 pm
  - I see. I will look into this and revert. However, did it work for other urls?
    
    Animesh
    
    January 3, 2011 at 10:01 am
Hi Animesh, wondering if this node.js will work for this situation? http://stackoverflow.com/questions/5054818/php-page-protection-for-cron-task-only

Is it compatible with current mysql?

Do we have to learn up this as a whole new language? or can easily reuse certain thing in php etc?

wonderful

February 20, 2011 at 3:01 pm

Reply
- Yep, Node has mysql libraries to work with. For your case, I think a good way might be to run node stuff in a process and bridge them with PHP shell. Node is just Javascript and hardly few things new. You can very easily pick it up. You can start from here: https://anismiles.wordpress.com/2010/11/11/wtf-is-node-js-and-what%E2%80%99s-the-fuss-all-about/
  
  -Animesh
  
  Animesh
  
  February 21, 2011 at 10:23 am
  
  Reply
Great writeup. I’m trying to run the searcher-server code, and I keep getting:

TypeError: Object # has no method ‘on’
at Object. (/Users/avishai/Downloads/anismiles-jsdom-based-screen-scraper-f0c79d3/searcher-server.js:9:10)
at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:146:21)
at param (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:157:15)
at pass (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:162:10)
at Object.router [as handle] (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/middleware/router.js:168:6)
at next (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:218:15)
at Server.handle (/Users/avishai/.node_libraries/.npm/connect/0.5.10/package/lib/connect/index.js:231:3)
at Server.emit (events.js:45:17)
at HTTPParser.onIncoming (http.js:1078:12)
at HTTPParser.onHeadersComplete (http.js:87:31)

Do you know why this might be?

Avishai

March 4, 2011 at 1:51 am

Reply
- Avishai, what version of Node you are using. I find that you got Connect’s 0.5.10 version which I think should be fine.
  
  -Animesh
  
  Animesh
  
  March 4, 2011 at 6:39 pm
  
  Reply
  - Fundamental reason behind this bug should be something to do with EventEmittter. let me explain,
    
    1. searcher.js inherits from EventEmitter
    (Line-26) Searcher.prototype = new process.EventEmitter;
    
    2. searcher-rediff.js, searcher-flipkart.js and searcher-landmarkonthenet.js extend from searcher.js, so they also inherit from EventEmitter.
    
    3. ‘on’ method is actually defined in EventEmitter.
    
    So, i think, for some reason, searcher.js is not able to inherit from EventEmitter and hence the method ‘on’ is missing.
    
    Animesh
    
    March 4, 2011 at 6:42 pm
[…] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites […]

Scrape web pages in real time with Node.js | DEEP in PHP

March 12, 2011 at 8:55 pm

Reply
Is there a good way to do this on websites that require you to log in first before running a search?

Avishai

April 4, 2011 at 7:02 pm

Reply
- i think, using POST method you can easily do a login on a site.
  
  Animesh
  
  April 4, 2011 at 7:12 pm
  
  Reply
Your “Javascript objects” helped a lot. Thanks.

Sanjeev Kumar Dangi (@skdangi)

November 13, 2011 at 8:52 pm

Reply
“Square.prototype = new Shape(); ”

Here Shape constructor is called without any argument.But its definition has two arguments -width and length. I checked it. It works. Does javascript also creates default constructors with no arguments itself?

Sanjeev Kumar Dangi (@skdangi)

November 29, 2011 at 7:50 pm

Reply
- No. Think of JS not as a logical/democratic world, it’s more like anarchy. 🙂 JS, internally, accepts params as a key-val pair and when you don’t pass anything… key-val pair just goes blank. It’s not an error. And if you try to look for these params, you will see ‘undefined’. One more difference between ‘undefined’ and ‘null’… eh?
  
  Chill!
  
  Animesh
  
  December 30, 2011 at 10:23 am
  
  Reply
Hi, this post is really interesting and while I’m trying to get the picture, I dont anderstand how is called the function searcher.getSearchUrl = function(query) { return this.merchantUrl + “/book/” + query;} in the searcher-rediff.js. thsnkd a lot.

Yaver

December 29, 2011 at 5:40 pm

Reply
- searcher.js ==> Searcher.prototype.search ==> line 30
  
  Animesh
  
  December 30, 2011 at 10:21 am
  
  Reply
[…] A few starting points: Node.js Fetch URL and display page body Using node.js and jquery to scrape websites […]

Scrape web pages in real time with Node.js | Easy jQuery | Free Popular Tips Tricks Plugins API Javascript and Themes

May 29, 2012 at 5:20 am

Reply
Hi Animesh,

Sorry for being naive, would this be required to run on the server side – reason that I ask this is that I have a a need to scrape a website and show results in a mobile application using phonegap and I was wondering if this script could run on the client side or would it need to be deployed on the server side. Also could you please give an example of how to use POST for the website that requires login (I have the username and password).

Thanks
Tarun

Tarun

April 1, 2013 at 2:00 am

Reply
- Sure you can run this on client side. However you will need to modify it a bit.
  
  Animesh
  
  April 1, 2013 at 10:09 am
  
  Reply
Using node.js and jquery to scrape websites | animesh kumar

Thank you for submitting this cool story – Trackback from AnantLeaves

AnantLeaves

July 1, 2013 at 10:56 pm

Reply
[…] Using node.js and jquery to scrape websites […]

Scrape web pages in real time with Node.js | Ask Programming & Technology

November 5, 2013 at 6:39 pm

Reply
to greatly simplify and speed up your code, try promise-parser

http://www.npmjs.org/package/promise-parser

JD

June 12, 2014 at 3:20 am

Reply
you should check out promise-parser

http://www.npmjs.org/package/promise-parser
http://github.com/rc0x03/node-promise-parser

Features

Fast: uses libxml C bindings
Lightweight: no dependencies like jQuery, cheerio, or jsdom
Clean: promise based interface- no more nested callbacks
Flexible: supports both CSS and XPath selectors

JD

June 12, 2014 at 3:23 am

Reply
The request times out : this is the code I am using :
app.get(‘/flipkart’, function(req, res){

var searcher = require(‘./searcher-flipkart’);

searcher.on(‘item’, function(item){
console.log(‘Item found >> ‘ + item)
});

searcher.on(‘complete’, function(searcher){
console.log(‘searcher done!’);
});

searcher.search(“Salman”);

});

Sum

July 22, 2014 at 4:17 pm

Reply