animesh kumar

Running water never grows stale. Keep flowing!

Posts Tagged ‘Tutorials

Connecting to Cassandra – 1

with 13 comments

Cassandra uses the Apache Thrift framework as its client API. Apache Thrift is a remote procedure call framework “scalable cross-language services development”. You can define data types and service interfaces in a thrift definition file, through which the compiler generates the code in your chosen languages. Effectively, it combines a software stack with a code generation engine to build services that work efficiently and seamlessly between a numbers of languages.

Apache Thrift – though is a state of art engineering feat – is not the best choice for a client API, especially for Cassandra.

  1. Cassandra supports multiple nodes, and you can connect to any node anytime. And this is an amazing thing, because if a node falls down, a client can connect to any other node available without pulling system down. Alas! Apache Thrift doesn’t support this inherently, you need to make you client aware of node-failures and write a strategy to pick up a next alive node.
  2. Thrift doesn’t support connection pooling. So, either you connect to the server every time, or keep a connection alive for a longer period of time. Or, perhaps, write a connection pool engine. Sad!

There are few clients available which make these things easier for you. They are like wrapper over Thrift to save you from a lot of nuisance. Anyhow, since even those clients work on top of Thrift, it makes sense to learn Thrift: to make our foundation strong.

Let’s first create a dummy Keyspace for ourselves:

<Keyspace Name="AddressBook">
<ColumnFamily CompareWith="UTF8Type" Name="Users" />

<!-- Necessary for Cassandra -->

We created a new Keyspace “AddressBook” which has a ColumnFamily “Users” with sorting policy of “UTF8Type” type.

Connect to Cassandra Server:

private TTransport transport = null;
private Cassandra.Client client = null;

public Cassandra.Client connect(String host, int port) {
    try {
        transport = new TSocket(host, port);
        TProtocol protocol = new TBinaryProtocol(transport);
        Cassandra.Client client = new Cassandra.Client(protocol);;
        return client;
    } catch (TTransportException e) {
    return null;

The above code is pretty fundamental:

  1. Opens up a Socket at the given host and port.
  2. Defines a protocol, in this case, it’s binary.
  3. And instantiates the client object.
  4. Returns client object for further operations.

Note: Cassandra uses “9160” as its default port.

Disconnect from Cassandra Server:

public void disconnect() {
    try {
        if (null != transport) {
    } catch (TTransportException e) {

To close the connection in a descent way, you should invoke “flush” to take care of any data that might still be there in the transport buffer.

Store a data object:

Let’s say, our User object is something like below:

public class User {
    // unique
    private String username;
    private String email;
    private String phone;
    private String zip;

    // getter and setter here.

To model one User to Cassandra, we need 3 columns to store email, phone and zip and the name of the row would be username. Right? Let’s create a list to store these columns.

List<ColumnOrSuperColumn> columns = new ArrayList<ColumnOrSuperColumn>();

The List contains ColumnOrSuperColumn objects. Cassandra gives us an aggregate object which can contain either a Column or a SuperColumn. You wonder why? Because, Apache thrift doesn’t support inheritance. Anyways, now we will create columns and store them in this list.

// generate a timestamp.
long timestamp = new Date().getTime();
ColumnOrSuperColumn c = null;

// add email
c = new ColumnOrSuperColumn();
c.setColumn(new Column("email".getBytes("utf-8"), user.getEmail().getBytes("utf-8"), timestamp));

// add phone
c = new ColumnOrSuperColumn();
c.setColumn(new Column("phone".getBytes("utf-8"), user.getPhone().getBytes("utf-8"), timestamp));

// add zip
c = new ColumnOrSuperColumn();
c.setColumn(new Column("zip".getBytes("utf-8"), user.getZip().getBytes("utf-8"), timestamp));

Okay, so we have the list of columns populated. Now, we need a Map which will hold the rows, that is list of columns. Key to this map will be the name of the ColumnFamily.

Map<String, List<ColumnOrSuperColumn>> data = new HashMap<String, List<ColumnOrSuperColumn>>();
data.put("Users", columns); // “Users” is our ColumnFamily Name.

Great. We have everything in place. Now, we will use client.batch_insert to store everything at once. This will create row in the ColumnFamily identified by the given key.

client.batch_insert( "AddressBook",          // Keyspace
                      user.getUsername(),    // Row identifier key
                      data,                  // Map which contains the list of columns.
                      ConsistencyLevel.ANY   // Consistency level. Explained below.

ConsistencyLevel parameter is used for both read and write operations to determine when the request made by the client is successful. ConsistencyLevel.ANY means that a write action is successful when it has been written to at least one node. Read Cassandra Wiki for a detailed information.

In the next blog, we will see how to delete and update a record in Casandra.

Written by Animesh

May 24, 2010 at 10:42 am

Posted in Technology

Tagged with , , ,

Apache Lucene and Cassandra

with 5 comments

[tweetmeme source=”anismiles” only_single=false

I am trying to find ways to extend and scale Lucene to use various latest data storing mechanisms, like Cassandra, simpleDB etc. Why? Agree that Lucene is wonderful, blazingly high-performance with features like incremental indexing and all. But managing and scaling storage, reads, writes and index-optimizations sucks big time. Though we have Solr, Jboss’ Infinispan, and Berkeley’s DbDirectory etc. but the approach they have adopted is very conventional and do not leverage upon any of latest technological developments in non-relational, highly scalable and available data stores like Cassandra, couchDB etc.

And then, I came across Lucandra: an attempt to use Cassandra as an underlying data storage mechanism for Lucene. Ain’t the name(Lucene + Cassandra) say so? 🙂

Why Cassandra?

  1. Well, Cassandra is one of the most popular and widely used “NoSql” systems.
  2. Flexible: Cassandra is a scalable and easy to administer column-oriented data store. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
  3. Decentralized: Cassandra does not rely on a global file system, but uses decentralized peer to peer “Gossip”, and so, it has no single point of failure, and introducing new nodes to the cluster is dead simple.
  4. Fault-Tolerant: Cassandra also has built-in multi-master write, replication, rack awareness, and can handle dead nodes gracefully.
  5. Highly Available: Writes and reads offer a tunable ConsistencyLevel, all the way from “writes never fail” to “block for all replicas to be readable,” with the quorum level in the middle.
  6. And, Cassandra has a thriving community and is at production for products like Facebook, Digg, Twitter etc.

Cool. The idea sounds awesome. But wait, before we look into how Lucandra actually implements it, let’s try to find what are the possible ways of implementation. We need to understand the Lucene stack first, and where and how it can be extended?

Lucene Stack

There are 3 elementary components, IndexReader, IndexWriter and Directory. IndexWriter writes reverse indexes of a document with the help of Directory implementation to a disk. IndexReader reads from the indexes using the same Directory.

But, there is a catch. Lucene is not very well designed and its APIs are closed.

  1. Very poor OO design. There are classes, packages but almost no design pattern usage.
  2. Almost no use of interfaces. Query, HitCollector etc. are all subclasses of an abstract class, so:
    1. You’ll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls.
    2. It’s pain to apply AOP and auto-proxying.
  3. Some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behavior.

There are many more. Point is that Lucene is designed in such a way that you will upset your code purity no matter how you do it.

Read more:

Anyhow, to extend Lucene, there are 2 approaches:

  1. Either write a custom Directory implementation, or
  2. write custom IndexReader and IndexWriter classes.

Incorporating Cassandra by writing a custom Directory

This involves extending abstract Directory class. There are many examples like Lucene Jdbc Directory, Berkeley’s DbDirectory etc. for consultation.

Incorporating Cassandra by writing custom IndexReader and IndexWriter

This is a crude approach: writing custom IndexReader and IndexWriter classes. Note again, that native Lucene’s reader/writer classes don’t implement any Interfaces and hence it will be difficult to plug and use our custom reader/writer classes in any existing code. Well, but that’s what you get. Another thing is that, native IndexReader/IndexWriter classes perform a lot of additional logic than just indexing and reading. They use analyzers to analyze the supplied document, calculate terms, term frequencies to name few. We need to make sure that we don’t miss any of these lest Lucene shouldn’t do what we expect it to do.

Lucandra follows this approach. It has written a custom IndexWriter and IndexReader classes. I am going to explore more on it, and come back with what I find there.

Read it here: Lucandra – an inside story!


Do you know where does the name Lucene come from? Lucene is Doug Cutting‘s wife’s middle name, and her maternal grandmother’s first name. Lucene is a common Armenian first name.

And, what about Cassandra? In Greek mythology the name Cassandra means “Inflaming Men with Love” or an unheeded prophetess. She is a figure both of the epic tradition and of tragedy. Remember the movie Troy? Although, the movie was a not exactly what Odysseus wrote, but it was polluted to create more appealing cinematic drama. Read here:

Written by Animesh

May 19, 2010 at 7:26 am

Posted in Technology

Tagged with , , , , ,

Cassandra – Data Model

with 7 comments

[tweetmeme source=”anismiles” only_single=false

Cassandra is a completely different concept for a database. And if you are coming from the RDBMS background, you sure are going to have a tough time understanding the fundamentals, for example, you won’t find anything like tables, columns, constraints, indexes, queries etc. at least not in the since sense of relational databases. Cassandra has an altogether different approach toward DataModels.


The column is the lowest/smallest data container in Cassandra. It’s a tuple (triplet) that contains a name, a value and a timestamp.

user: {
    name: user,
    value: animesh.kumar,
    timestamp: 98989L

Name and Value are both binary (technically byte[]) and can be of any length.


To understand SuperColumn, try to look at it as a tuple which instead of containing binary values, contains a Map of unbounded Columns.

homeAddress: {
    name: homeAddress,
    value: {
        street: {
            timestamp: 98989L
        city: {
            timestamp: 98989L
            value:Bombay Hospital,
            timestamp: 98989L
            timestamp: 98989L

SuperColumn can be summarized as a column of Columns. It has got a Map styled containers to hold unbounded number of Columns (key has to be the same as the name of the Column). Also notice that SuperColumns don’t have a timestamp component.


Now, since we got two elementary DataModels, i.e. Column and SuperColumn, we need some mechanisms to hold them together, or group them.

Column Family

ColumnFamily is a structure that can keep an infinite number of rows – for most people with an RDBMS background – which is the structure that resembles a Table the most.

A ColumnFamily has:

  1. A name (think of the name of a Table),
  2. A map with a key (Row Identifier, like Primary Key) and
  3. A value which is a Map containing Columns.

For the Map with the columns: the key has to be the same as the name of the Column.

Profile = {

SuperColumn Family

SuperColumnFamily is easy after you have gotten through ColumnFamily. Instead of having Columns in the inner most Map we have SuperColumns. So it just adds an extra dimension.

AddressBook = {
    smile.animesh: {  // 1st row.
        ashish: {
        papa: {
    }, // end row
    ashish: {     // 2nd row.


A Keyspace is the outer most grouping of the data. From an RDBMS point of view you can compare this to your database schema, normally you have one per application.

A Keyspace contains the ColumnFamilies. There is no relationship between the ColumnFamiliies. They are just separate containers. They are NOT like tables in MySQL – you can’t join them, neither can you enforce any constraint. They are just separate containers.

Update: May 19, 2010 at 6:12 pm IST

Okay! So we have learnt the basics. You might need some time before you can start thinking in Cassandra DataModel’s terms.
Anyhow, let’s revise what we learnt in brief:

  1. Column is the basic data holding entity,
  2. SuperColumn contains a Map of Columns,
  3. ColumnFamily is where Cassandra stores all Columns; it loosely resembles databases’ Table.
  4. SuperColumn Family is just a ColumnFamily of SuperColumns.

Phew! It hasn’t yet got digested fully. It will take some time. 🙂

Next thing to learn about Cassandra is that it does NOT have any SQL like query features, so you can NOT sort the data when you are fetching it. Rather Cassandra sorts the data as soon as the data is put into the Cassandra clusters, and always remains sorted. Columns are sorted by their names, and the mechanism of sorting can be defined and controlled at the ColumnFamily’s definition, using “CompareWith” attribute.

Cassandra comes with following sorting options, though you can write your own sorting behavior it you need.

  1. BytesTypeSimple sort by byte value. No validation is performed.
  2. AsciiType:   Like BytesType, but validates that the input can be parsed as US-ASCII.
  3. UTF8TypeA string encoded as UTF8
  4. LongTypeA 64bit long
  5. LexicalUUIDType: A 128bit UUID, compared lexically (by byte value)
  6. TimeUUIDType: A 128bit version 1 UUID, compared by timestamp

Let’s try to understand it using some examples. Let’s say we have a raw Column set, i.e. which is not yet stored in Cassandra.

9:  {name: 9,  value: Ronald},
3:  {name: 3,  value: John},
15: {name: 15, value: Eric}

And, suppose that we have a ColumnFamily with UTF8Type sorting option.

<ColumnFamily CompareWith="UTF8Type" Name="Names"/>

Then, Cassandra will sort like,

15: {name: 15,  value: Eric},
3:  {name: 3,    value: John},
9:  {name: 9,   value: Ronald}

And with another ColumnFamily with LongType sorting option,

<ColumnFamily CompareWith="LongType" Name="Names"/>

Result will be like,

3:  {name: 3,  value: John},
9:  {name: 9,  value: Ronald},
15: {name: 15, value: Eric}

The same rules of sorting also get applied to SuperColumns. However, in this case we also need to specify a second sorting rule using the "CompareSubcolumnsWith" attribute for internal Columns’ sorting behavior.

For example consider following definition:

<ColumnFamily ColumnType="Super" CompareWith="UTF8Type"
CompareSubcolumnsWith="LongType" Name="Posts"/>

In this case, SuperColumns will be sorted by UTF8Type policy, and Columns by LongType policy.

If your need asks for custom sorting policies, you can easily write one.

  1. Create a Class extending org.apache.cassandra.db.marshal.AbstractType class.
  2. Package this class in a Java Archive and add it to the /lib folder of your Cassandra installation.
  3. Specify the fully qualified classname in the CompareSubcolumnsWith or CompareWith attribute. That’s it.

So, that’s all about Cassandra DataModels. Now, in our next step, we will write Cassandra client and see how deep the rabbit hole goes!

Written by Animesh

May 18, 2010 at 6:12 am

Posted in Technology

Tagged with , , ,

Cassandra – First Touch

with 2 comments

[tweetmeme source=”anismiles” only_single=false


Go to Apache Cassandra download page, and get yourself a latest copy.

Once downloaded, extract the zip file to some directory, say, D:\iLabs\apache-cassandra-0.6.1

Minimal Setup

Cassandra keeps all important information in storage-conf.xml. We will talk about this in details later, for now, let’s just tell Cassandra where to store Logs and Data?

  1. Let’s create a directory,
  2. Create 2 subdirectories, one for Logs and another for Data.
  3. Modify D:/Lab/Cassandra/conf/storage-conf.xml for following information:

Ignite the engine

  1. Make sure that you have JAVA_HOME set correctly.
  2. Also, make sure that port 8080 and 9160 is available. Generally, 9160 remains free, and Tomcat or Jboss might be running on 8080. Please shut down, Tomcat/Jboss or whatever server you have on 8080.
  3. Open command prompt, and go to Cassandra directory: D:\iLabs\apache-cassandra-0.6.1
  4. Run:
    D:\iLabs\apache-cassandra-0.6.1>bin\cassandra.bat -f
    Starting Cassandra Server
    Listening for transport dt_socket at address: 8888
    INFO 13:09:00,234 Auto DiskAccessMode determined to be standard
    INFO 13:09:00,531 Sampling index for D:\iLabs\cassanra-data\data\
    INFO 13:09:00,559 Sampling index for D:\iLabs\cassanra-data\data\
    INFO 13:09:00,567 Replaying D:\iLabs\cassanra-data\commitlog\
    INFO 13:09:00,607 Creating new commitlog segment D:/iLabs/cassanra-data/
    INFO 13:09:00,748 LocationInfo has reached its threshold; switching in a
    freshMemtable at CommitLogContext(file='D:/iLabs/cassanra-data/commitlog\
    CommitLog-1274081940607.log', position=133)
    INFO 13:09:00,752 Enqueuing flush of Memtable(LocationInfo)@20827431
    INFO 13:09:00,756 Writing Memtable(LocationInfo)@20827431
    INFO 13:09:00,948 Completed flushing D:\iLabs\cassanra-data\data\system\
    INFO 13:09:00,996 Log replay complete
    INFO 13:09:01,046 Saved Token found: 23289801966927000784786040626191443480
    INFO 13:09:01,047 Saved ClusterName found: Test Cluster
    INFO 13:09:01,061 Starting up server gossip
    INFO 13:09:01,128 Binding thrift service to localhost/
    INFO 13:09:01,136 Cassandra starting up...

Hallelujah! Engine is revved up.

Note: it’s listening to localhost/port:9160.

Let’s prance!

Cassandra distribution comes with Cassandra CLI, which is interactive command line tool. We will use this tool to test our server.

  1. Open another command prompt, and go to Cassandra directory: D:\iLabs\apache-cassandra-0.6.1
  2. Run:
    Starting Cassandra Client
    Welcome to cassandra CLI.
    Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
  3. Now, connect with the server:
    cassandra> connect localhost/9160
    Connected to: "Test Cluster" on localhost/9160
  4. Insert a key/value:
    cassandra> set Keyspace1.Standard1['0']['msg'] = 'Hello World!'
    Value inserted.
  5. Query it back:
    cassandra> get Keyspace1.Standard1['0']['msg']
    => (column=6d7367, value=Hello World!, timestamp=1274086005825000)

We have successfully connected to Cassandra server, inserted a value, and fetched it back. But, few things might have been overwhelming, right? Let’s try to understand that.
Let’s have a quick look at the schema. Remember, it is defined in the same file we modified earlier in the setup step: D:/Lab/Cassandra/conf/storage-conf.xml
You will notice below entry there:

<Keyspace Name="Keyspace1">
<ColumnFamily Name="Standard1" CompareWith="BytesType" />


Keyspace1 is the name of the schema, Standard1 is a collection of rows, and each row has an ordered set of key-value pairs. We will learn more about Keyspaces and Data Schemas in the next blog.

Written by Animesh

May 17, 2010 at 11:01 am

Posted in Technology

Tagged with , , ,