animesh kumar

Running water never grows stale. Keep flowing!

Sisyphus: slow – silent – violent – slow

with 3 comments

[tweetmeme source=”anismiles” only_single=false

is his climb
up the hill
tethered to Gods
stone Gods
full of gaze and words.

is his manner
on the top
he sits alone
drenched in sweat and tear
until he is pushed down.
Gods laugh and merry
fingers pointed at
their falling dolly.

is his fall
down into creeks
great Sisyphus
turned into stone
by the light of morning
doomed to disappear
without meaning.

He knows not
why he does what he does
his destiny chained before
he was even born.

Written by Animesh

December 17, 2010 at 2:50 pm

Using node.js and jquery to scrape websites

with 31 comments

[tweetmeme source=”anismiles” only_single=false

I have been playing with Node.js for last few days and am totally head over heels. Madly in love! It’s awesome to know how much you can build with how little. I have ranted about Node.js earlier and did some comparisons too. It’s fast, really fast. And it’s plain old Javascript we have been using for last many-many years now. I thought I would build a real world application with it to see how much it stands the water. Earlier I thought to make a something on top of Riak, but that felt like running too fast. Instead I picked up something simpler to deal only with Node.js. Now, I think it would make sense to brush up on some Javascript fundaments.

Javascript objects

Yes. Javascript is an object oriented language. But it’s different from your traditional classical OO languages like Java and Ruby.

  1. One obvious difference is in syntax, and the other major one is that
  2. Other languages have methods while Javascript has first-class functions.

First class functions. What does it mean? It means that they are expressions and can be assigned to a variable and can be easily passed around. Does it sound like a closure in Ruby? It does indeed. Well thought, it’s a little more than that. I will come to this again some other time. For now, let’s find out how we can create objects and use them? I will focus tell you two ways to do it.

The Classical way

Here is a constructor function for object Shape. It accepts two parameters and saves them into respective instance variables.

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;

var rectangle = new Shape (2, 5);    // instantiate a new Shape object
console.log (rectangle.getArea());   // calculate the area: 10

Javascript uses prototype chains to add new functions or variables to an object on the fly. You should read more about this thing here:

I will add a new function to calculate the perimeter of my Shape object.

Shape.prototype.getPerimiter = function() {
	return 2 * (this.width + this.height);

console.log (rectangle.getPerimiter());

What happened here? Did you notice that even if ‘rectangle’ was already defined it could access the newly added function to calculate perimeter. Wasn’t that awesome? Javascript is intelligent, dude. If you ask for something, it looks into the current object, and if not found, it would go up the object’s prototype chain to look for what you asked for. And since, we added the new function to the prototype, it’s found unscrupulously.  There is a lot of interesting stuffs going on here, you must read about it. I would suggest buying Manning’s Javascript Ninja, if you are really serious about it.

Now, let’s try to extend Shape. I will create a new constructor function for Square.

function Square(side){
	this.width = side;
	this.height = side;

Square.prototype = new Shape();

var sq = new Square(4);

I created a new Square class and overrode its prototype chain with that of Shape’s. I got all the functionalities and behavior of Shape. Easy… huh?

The Prototypal way

Let’s do the same thing without using constructors now. Just plain prototypes!

var Shape = {
	getArea: function () {
		return this.width * this.height;
	getPerimiter: function() {
		return 2 * (this.width + this.height);

var rec = Object.create(Shape);
rec.width = 2;
rec.height = 5;


Now that you have the Shape object, you can easily add new functions to its prototype chain, or even inherit it to another object. However I find this approach a little clumsy. I would rather stick to the classic way. You choose your pick. To each his own!

Node.js Modules

Node uses the CommonJS module system. Node has a simple module loading system where files and modules are in one-to-one correspondence. Here is the API: Above example can be ported to Node.js module ecosystem like explained below:

First, create Shape.js

function Shape(width, height) {
	this.width = width;        // instance variable width
	this.height = height;      // instance variable height
	this.getArea = function() {     // function to calculate Area, notice the assignment.
		return this.width * this.height;

// Export this module
exports.module = Shape;

And now, use this

var Shape = require('./Shape');

var rectangle = new Shape (2, 5);
console.log (rectangle.getArea());

Node.js loads and runs each module in a sandbox which staves off any possible name collision. That’s the benefit you get apart from having a properly structured code base.

Writing a screen scraping application

I will write a simple application to capture details from various websites. The beautiful thing is Javascript has been handling DOM objects for years. In fact Javascript was created to handle DOM objects. No wonder that it’s more mature than any other html parsing library. Also, given that there are many elegant frameworks like Prototype, Mootools, JQuery etc. available to use, scraping websites with Node.js should be easy and fun. Let’s do it. Let’s write an application to collect data from various book selling websites.

Create a basic searcher.js module. It would provide the fundamental skeleton for writing website specific tool.

// External Modules
var request = require('ahr'), // Abstract-HTTP-request
sys = require('sys'),		// System
events = require('events'),	// EventEmitter
jsdom = require('jsdom');	// JsDom

var jQueryPath = '';
var headers = {'content-type':'application/json', 'accept': 'application/json'};

// Export searcher
module.exports = Searcher;

function Searcher(param) {
	if (param.headers) {
		this.headers = param.headers;
	} else {
		this.headers = headers;

	this.merchantName = param.merchantName;
	this.merchantUrl = param.merchantUrl; = param.merchantUrl;

// Inherit from EventEmitter
Searcher.prototype = new process.EventEmitter; = function(query, collector) {
	var self = this;
	var url = self.getSearchUrl(query);

	console.log('Connecting to... ' + url);

	request({uri: url, method: 'GET', headers: self.headers, timeout: 10000}, function(err, response, html) {
		if (err) {
			self.onError({error: err, searcher: self});
			self.onComplete({searcher: self});
		} else {
			console.log('Fetched content from... ' + url);
			// create DOM window from HTML data
			var window = jsdom.jsdom(html).createWindow();
			// load jquery with DOM window and call the parser!
			jsdom.jQueryify(window, '', function() {
				self.onComplete({searcher: self});

// Implemented in inhetired class
Searcher.prototype.getSearchUrl = function(query) {
	throw "getSearchUrl() is unimplemented!";
// Implemented in inhetired class
Searcher.prototype.parseHTML = function(window) {
	throw "parseForBook() is unimplemented!";
// Emits 'item' events when an item is found.
Searcher.prototype.onItem = function(item) {
	this.emit('item', item);
// Emits 'complete' event when searcher is done
Searcher.prototype.onComplete = function(searcher) {
	this.emit('complete', searcher);
// Emit 'error' events
Searcher.prototype.onError = function(error) {
	this.emit('error', error);

Searcher.prototype.toString = function() {
	return this.merchantName + "(" + this.merchantUrl + ")";

Now, code to scrape rediff books. I will name it searcher-rediff.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Rediff Books',
	merchantUrl: ''

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/book/" + query;

searcher.parseHTML = function(window) {
	var self = this;

		var item  = window.$(this);

		var title = item.find('#prod_detail2').find('font[id="book-titl"]').text();
		var link = item.find('#prod_detail2').find('a').attr('href');
		var author = item.find('#prod_detail2').find('font[id="book-auth"]').text();
		var price = item.find('#prod_detail2').find('font[id="book-pric"]').text();

			title: title,
			link: link,
			author: author,
			price: price

Run it now.

var searcher = require('./searcher-rediff');

searcher.on('item', function(item){
	console.log('Item found >> ' + item)

searcher.on('complete', function(searcher){
	console.log('searcher done!');

What I did?

  1. First, I wrote a skeleton searcher class. This class makes the
    1. request to the merchant’s search URL (this URL is built in getSearchUrl function), then
    2. fetches the html data from here, then
    3. by using ‘jsdom’ module creates DOM’s window object which further
    4. gets parsed by ‘jquery’, and
    5. function parseHTML is executed.
  2. Second, I wrote another class that extends from searcher and intends to interact with Rediff. This class implements,
    1. getSearchUrl function to return appropriate search URL to connect to, and
    2. parseHTML function to scrape data from DOM’s window object. This is very interesting. You can use all your jquery knowledge to pick elements and parse data from inside the elements. Just like you did in old days when you added styles or data to random elements.

Now, if I want to search say Flipkart along with Rediff, I just need to write a Flipkart specific implementation, say searcher-flipkart.js

var Searcher = require('./searcher');

var searcher = new Searcher({
	merchantName: 'Flipkart',
	merchantUrl: ''

module.exports = searcher;

searcher.getSearchUrl = function(query) {
	return this.merchantUrl + "/search-book" + '?query=' + query;

searcher.parseHTML = function(window) {
	var self = this;

		var item  = window.$(this);

		var title = item.find('.search_result_title').text().trim().replace(/\n/g, "");
		var link = self.merchantUrl + item.find('.search_result_title').find("a").attr('href');
		var price = item.find('.search_results_list_price').text().trim().replace(/\n/g, "");

			title: title,
			link: link,
			price: price

I have also written a Runner class to execute the multiple searchers in parallel and collect results into an array. You can find the entire source code here: Chill!

What’s next? I am going to write on Node.js pretty feverishly. You better keep posted. How about a blog engine on Riak?

Written by Animesh

November 29, 2010 at 3:47 pm

WTF is node.js and what’s the fuss all about?

with 6 comments

[tweetmeme source=”anismiles” only_single=false

You must have been hearing about Node.js for quite some time. Me too! Everybody is talking about it, writing about it. I am tired. So I think I should try it myself. By definition, node.js is a library written for Google’s V8 that does evented I/O. V8 is a JavaScript engine written in C++ being used in Google Chrome, and it’s veryveryvery fast.

Point to note here is evented I/O. Traditionally you would wait for input/output to finish before moving further with your execution, but in evented environment you don’t wait, rather you get informed about I/O completion and meanwhile you could do whatever you want. Cool eh? Let’s cement it with an example. Say, you want to find out the last edited time of a file. Traditionally, you would do it this way:

// read file
Stat stat = readFileStat( ‘file-path’ );
// operation
useStatInfo( stat );

In evented environment, you would do it this way:

readFileStat( ‘file-path’, function ( result ) {
	// operation
	useStatInfo( result );
} );

In this case, once the file is read the result is passed to another function. You don’t have to wait. Do you see that? This enables evented systems to handle larger number of requests simultaneously, because there is no thread to spawn, no heap to allocate.

You have been doing this kind of things with Closures and Java anonymous functions since eternity. But JavaScript makes it all more natural and simpler. And that’s where Node.js shines. Let me list down the main things:

  1. It’s JavaScript. JavaScript’s anonymous functions and closures is perfect for callback definitions.
  2. Everything everywhere is asynchronous. There are no threads. Everything has been built up from scratch and everything is event driven.
  3. No old baggage. That is, nothing has been carried over from the old synchronous, threaded world. That’s a good thing though a little limiting right now since there aren’t many packages. But that would soon be taken care of. There is a huge community toiling here.
  4. Focus on dealing with data. You don’t have to focus on networks or protocols. Just focus on your data and your flow. Simple?
  5. It’s small.
  6. It’s fast.
  7. It’s easy.

Now don’t start thinking of Node.js as another framework like Rails, Django, Sinatra etc. Don’t. Node.js doesn’t only help you build a web application; it goes further and helps you to build an application server instead. Node.js is framework to build scalable network programs. It could run on HTTP protocol or on TCP or whatever. You don’t have to worry about it.


UPDATE: You should use Node Version Manager instead of bare installation.

I am using ubuntu-9.10-desktop-i386 and Oracle’s VirtualBox on Windows 7. Here is a nice tutorial to do it yourself: I think you could also use Cygwin to run Node.js but I don’t prefer that personally. Linux feels way much easier.

  1. Ensure you have all the essentials necessary.
    sudo apt-get update
    sudo apt-get install git-core
    sudo apt-get install build-essential
  2. Clone the Node.js repository:
    git clone git://
  3. Now configure and install:
    cd node
    ./configure && make && sudo make install

That’s it. You are done. Now, let’s make ourself a small and pretty HTTP server.

var sys  = require("sys"),
http = require("http");

http.createServer(function(request, response) {
	response.sendHeader(200, {"Content-Type": "text/html"});
	response.write("Hello World!");

sys.puts("Server running at http://localhost:8080/");

This script uses two modules, sys and http to create an HTTP server. The anonymous function being passed to http.createServer is called at each request. Save this script to helloworld.js file.

Now run this server,

node helloworld.js

Go to http://localhost:8080/ in your browser, you will see “Hello World!”


It all might have seemed so simple, eh? I know. It stunned me too. So I decided to benchmark it. I created identical apps in PHP (using PHP5 with Apache2 mod_php) and Node.js. Both apps rendered a single html page with similar content. I used Apache Benchmark tool to run the comparison.

ab –n 10000 –c 10
ab –n 10000 –c 10

PHP			2988.3 requests/sec
Node.js		5391.2 requests/sec

Node.js wins with a huge margin. Wondering why? Remember Node.js is an event driven framework, so unlike other servers like Apache it doesn’t open a socket or spawn a thread or even use a pool of threads, rather it has only a single thread running an event loop that executes the callbacks, so it needs only a small heap allocation and it leaves a much smaller footprint.

So Node.js indeed handled a lot of concurrent connections like a breeze. I thought to experiment a bit further. I introduced a 2 seconds sleep. That way, there will be many connections piling up and waiting to be responded.

ab –n 2500 –c 350
ab –n 2500 –c 350

PHP			27.3 requests/sec
Node.js		148.7 requests/sec

Amazing, isn’t it? Now, I am officially swept over by it.


In the next blog, I will create a simple web application with Node.js and Riak. Meanwhile, if Node.js indeed aroused your curiosity, you can read more,

  1. Ryan’s presentation
  2. Node.js API
  3. How to node

Written by Animesh

November 11, 2010 at 2:09 pm

Posted in Technology

Tagged with , , , ,

Fiddling with Cassandra 0.7-beta2

with 11 comments

[tweetmeme source=”anismiles” only_single=false

I have been dilly-dallying with Cassandra 0.7 for quite some time. My intensions were to build Cassandra 0.7 support into Kundera (a JPA 1.0 compliant ORM library to work with Cassandra). I must admit that often times I was very upset about the lack of documentation on Cassandra and libraries that I had planned to use, Pelops and Hector. So I decided that I should post my findings for your help.

Now since Cassandra 0.7 beta-2 has been released, I will concentrate my talk around this release.

Installing Cassandra 0.7

  • Download 0.7.0-beta2 (released on 2010-10-01) from here:
  • Extract the jar to some location say, D:\apache-cassandra-0.7.0-beta2
  • Set CASSANDRA_HOME environment variable to D:\apache-cassandra-0.7.0-beta2
  • You can also update you PATH variable to include $CASSANDRA_HOME/bin
  • Now, to start the server you would need to run this command:
    > cassandra -start

That’s it.

Okay, since you’ve gotten the basics right. I would like to tell you few important things about this new Cassandra release.

  1. Unlike .6.x versions, 0.7.x employs YAML instead of XML, that is, you are going to find cassandra.yaml instead of storage-conf.xml.
  2. 0.7 allows you to manage entire cluster, Keyspaces, Column Families everything from Thrift API.
  3. There is also support for Apache Avro. (I haven’t explored this though, so no more comment)
  4. 0.7 comes with secondary index features. What does it mean? It means, you can look for your data not just by Row Identifier, but also by Column Values. Interesting huh?

If you look into cassandra.yaml, you will find a default Keyspace1 and few Column Families too, but Cassandra doesn’t load them. I am not sure why. Theoretically, everything defined in the yaml file should have been created at the start. I am going to dig around this. Anyways for now, let’s create some Keyspaces and few Column Families ourselves. We can use Thrift API (and Cassandra client which uses Thrift itself) or JMX interface.

Dealing with Cassandra Client

Cassandra comes with a command line interface tool cassandra-cli. This tool is really really impressive. You should certainly spend some time with it.

  • Start the client,
    > cassandra-cli
  • Connect to server,
    > [default@unknown] connect localhost/9160
  • Create a new keyspace, (I picked this up from cassandra.yaml)
    > [default@unknown] create keyspace Keyspace1 with replication_factor=1
  • Create Column Families,
    > [default@unknown] use Keyspace1
    > [default@Keyspace1] create column family Standard1 with column_type = ‘Standard’ and comparator = ‘BytesType’
  • Describe keyspace,
    > [default@Keyspace1] describe keyspace Keyspace1

And so on. Use ‘help’ to learn more about cassandra-cli.


As I mentioned above, you can also use JMX to check what Keyspaces and Column Families exist in your server. But there is a little problem. Cassandra does not come with the mx4j-tools.jar, so you need to download and copy this jar to Cassandra’s lib folder. Download it from here:

Now, just run ‘jconsole’ and pick ‘org.apache.cassandra.thrift.CassandraDaemon’ process.

Java clientèle

Well, there are two serious contenders, Pelops and Hector. Both have released experimental support for Version 0.7. I had worked with Pelops earlier, so I thought this is time to give Hector a chance.

  • Download Hector (Sync release with Cassandra 0.7.0-beta2) from here:
    You can also use ‘git clone‘ to download the latest source.
  • Hector is a maven project. To compile the source into ‘jar’, just extract the release and run,
    > mvn package

My first program

To start with Hector, I thought to write a very small code to insert a Column and then later fetch it back. If you remember, in the previous section, we already created a keyspace ‘Keyspace1‘ and a Column Family ‘Standard1‘, and not we are going to make use of them.

import me.prettyprint.cassandra.serializers.StringSerializer;
import me.prettyprint.hector.api.Cluster;
import me.prettyprint.hector.api.Keyspace;
import me.prettyprint.hector.api.beans.HColumn;
import me.prettyprint.hector.api.exceptions.HectorException;
import me.prettyprint.hector.api.factory.HFactory;
import me.prettyprint.hector.api.mutation.Mutator;
import me.prettyprint.hector.api.query.ColumnQuery;
import me.prettyprint.hector.api.query.QueryResult;

public class HectorFirstExample {

	public static void main(String[] args) throws Exception {

		String keyspaceName = "Keyspace1";
		String columnFamilyName = "Standard1";
		String serverAddress = "localhost:9160";

		// Create Cassandra cluster
		Cluster cluster = HFactory.getOrCreateCluster("Cluster-Name", serverAddress);
		// Create Keyspace
		Keyspace keyspace = HFactory.createKeyspace(keyspaceName, cluster);

		try {
			// Mutation
			Mutator mutator = HFactory.createMutator(keyspace, StringSerializer.get());
			// Insert a new column with row-id 'id-1'
			mutator.insert("id-1", columnFamilyName, HFactory.createStringColumn("Animesh", "Kumar"));

			// Look up the same column
			ColumnQuery columnQuery = HFactory.createStringColumnQuery(keyspace);
			QueryResult> result = columnQuery.execute();

			System.out.println("Read HColumn from cassandra: " + result.get());
		} catch (HectorException e) {

That was simple. By the way, ‘Nate McCall‘ has written a set of example classes to help us understand Hector with Cassandra 0.7. Check it out here:

I am working towards introducing Cassandra 0.7 support in Kundera, and will be publishing my findings intermittently.

Written by Animesh

October 14, 2010 at 9:26 pm

Posted in Technology

Tagged with , , , , ,


with 2 comments

[tweetmeme source=”anismiles” only_single=false

the heart is broken
in thousands pieces, she says
and this time
there is no turning back
this is no hole, she sighs, to fill up
this is a big crack.

across the window
I see things that glow
angels flying down to take me, she says
I’m going away, farther than you can reach
time and again, I tried to repair
nothing touches me now, no love, no beseech.

I sweep the floor
dirt and tears come along
I know what to do
get a box and fix it up with some glue
you can’t fix it, don’t you see,
she growls, you can’t un-rape me.

Written by Animesh

September 28, 2010 at 3:09 pm

Posted in Poetry

Tagged with , , , ,

Lost and never found

with one comment

[tweetmeme source=”anismiles” only_single=false

All that I am or I pretend
sullen dreams of bleak end

Lost and never found
fallen leaves on the ground

Truth is a lie in disguise
solemn words of a skewed voice

Flowers sprawled by the riverbed
roots entangled into the mud

I seek love and love seeks a clown
run away before you’re drowned

Written by Animesh

September 27, 2010 at 5:26 pm

Posted in Poetry

Tagged with , , , ,

I have a paper heart

with 4 comments

[tweetmeme source=”anismiles” only_single=false

I have a paper heart
too easy to rip off my chest
fold, twist and turn
city, house or nest.

Around me, I’ve made collections
of ceramic vows, muddy tears
china-stone celebrations
all cornered into
dirty crumpled affections.

You and I glued into a story
tragic like always, but fairy
inking the ends together
with dull colors of cherry

Written by Animesh

September 20, 2010 at 7:38 pm

Posted in Poetry