animesh kumar

Running water never grows stale. Keep flowing!

Posts Tagged ‘HPC

Cassandra – Data Model

with 7 comments

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

Cassandra is a completely different concept for a database. And if you are coming from the RDBMS background, you sure are going to have a tough time understanding the fundamentals, for example, you won’t find anything like tables, columns, constraints, indexes, queries etc. at least not in the since sense of relational databases. Cassandra has an altogether different approach toward DataModels.

Column

The column is the lowest/smallest data container in Cassandra. It’s a tuple (triplet) that contains a name, a value and a timestamp.

user: {
    name: user,
    value: animesh.kumar,
    timestamp: 98989L
}

Name and Value are both binary (technically byte[]) and can be of any length.

SuperColumn

To understand SuperColumn, try to look at it as a tuple which instead of containing binary values, contains a Map of unbounded Columns.

homeAddress: {
    name: homeAddress,
    value: {
        street: {
            name:street,
            value:MG-101,
            timestamp: 98989L
        },
        city: {
            name:city,
            value:Indore,
            timestamp: 98989L
        },
        landmark:{
            name:landmark,
            value:Bombay Hospital,
            timestamp: 98989L
        },
        zip:{
            name:zip,
            value:452001,
            timestamp: 98989L
        }
    }
}

SuperColumn can be summarized as a column of Columns. It has got a Map styled containers to hold unbounded number of Columns (key has to be the same as the name of the Column). Also notice that SuperColumns don’t have a timestamp component.

Grouping

Now, since we got two elementary DataModels, i.e. Column and SuperColumn, we need some mechanisms to hold them together, or group them.

Column Family

ColumnFamily is a structure that can keep an infinite number of rows – for most people with an RDBMS background – which is the structure that resembles a Table the most.

A ColumnFamily has:

  1. A name (think of the name of a Table),
  2. A map with a key (Row Identifier, like Primary Key) and
  3. A value which is a Map containing Columns.

For the Map with the columns: the key has to be the same as the name of the Column.

Profile = {
    smile.animesh:{
        username:{
            name:username,
            value:smile.animesh
        },
        email:{
            name:email,
            value:smile.animesh@gmail.com
        },
        age:{
            name:age,
            value:25
        }
    },
    ashish:{
    ...
    },
    santosh:{
    ...
    }
}

SuperColumn Family

SuperColumnFamily is easy after you have gotten through ColumnFamily. Instead of having Columns in the inner most Map we have SuperColumns. So it just adds an extra dimension.

AddressBook = {
    smile.animesh: {  // 1st row.
        ashish: {
            name:ashish,
            value:{
                street:{
                    name:street,
                    value:Dhurwa,
                },
                zip:{
                    name:zip,
                    value:9898
                },
                city:{
                    name:city,
                    value:Indore
                }
        },
        papa: {
            name:papa,
            value:{
                street:{
                    name:street,
                    value:Rajwada,
                },
                zip:{
                    name:zip,
                    value:83400
                },
                city:{
                    name:city,
                    value:Ranchi
                }
        },
    }, // end row
    ashish: {     // 2nd row.
    ...
    },
}

Keyspaces

A Keyspace is the outer most grouping of the data. From an RDBMS point of view you can compare this to your database schema, normally you have one per application.

A Keyspace contains the ColumnFamilies. There is no relationship between the ColumnFamiliies. They are just separate containers. They are NOT like tables in MySQL – you can’t join them, neither can you enforce any constraint. They are just separate containers.

Update: May 19, 2010 at 6:12 pm IST

Okay! So we have learnt the basics. You might need some time before you can start thinking in Cassandra DataModel’s terms.
Anyhow, let’s revise what we learnt in brief:

  1. Column is the basic data holding entity,
  2. SuperColumn contains a Map of Columns,
  3. ColumnFamily is where Cassandra stores all Columns; it loosely resembles databases’ Table.
  4. SuperColumn Family is just a ColumnFamily of SuperColumns.

Phew! It hasn’t yet got digested fully. It will take some time. 🙂

Next thing to learn about Cassandra is that it does NOT have any SQL like query features, so you can NOT sort the data when you are fetching it. Rather Cassandra sorts the data as soon as the data is put into the Cassandra clusters, and always remains sorted. Columns are sorted by their names, and the mechanism of sorting can be defined and controlled at the ColumnFamily’s definition, using “CompareWith” attribute.

Cassandra comes with following sorting options, though you can write your own sorting behavior it you need.

  1. BytesTypeSimple sort by byte value. No validation is performed.
  2. AsciiType:   Like BytesType, but validates that the input can be parsed as US-ASCII.
  3. UTF8TypeA string encoded as UTF8
  4. LongTypeA 64bit long
  5. LexicalUUIDType: A 128bit UUID, compared lexically (by byte value)
  6. TimeUUIDType: A 128bit version 1 UUID, compared by timestamp

Let’s try to understand it using some examples. Let’s say we have a raw Column set, i.e. which is not yet stored in Cassandra.

9:  {name: 9,  value: Ronald},
3:  {name: 3,  value: John},
15: {name: 15, value: Eric}

And, suppose that we have a ColumnFamily with UTF8Type sorting option.

<ColumnFamily CompareWith="UTF8Type" Name="Names"/>

Then, Cassandra will sort like,

15: {name: 15,  value: Eric},
3:  {name: 3,    value: John},
9:  {name: 9,   value: Ronald}

And with another ColumnFamily with LongType sorting option,

<ColumnFamily CompareWith="LongType" Name="Names"/>

Result will be like,

3:  {name: 3,  value: John},
9:  {name: 9,  value: Ronald},
15: {name: 15, value: Eric}

The same rules of sorting also get applied to SuperColumns. However, in this case we also need to specify a second sorting rule using the "CompareSubcolumnsWith" attribute for internal Columns’ sorting behavior.

For example consider following definition:

<ColumnFamily ColumnType="Super" CompareWith="UTF8Type"
CompareSubcolumnsWith="LongType" Name="Posts"/>

In this case, SuperColumns will be sorted by UTF8Type policy, and Columns by LongType policy.

If your need asks for custom sorting policies, you can easily write one.

  1. Create a Class extending org.apache.cassandra.db.marshal.AbstractType class.
  2. Package this class in a Java Archive and add it to the /lib folder of your Cassandra installation.
  3. Specify the fully qualified classname in the CompareSubcolumnsWith or CompareWith attribute. That’s it.

So, that’s all about Cassandra DataModels. Now, in our next step, we will write Cassandra client and see how deep the rabbit hole goes!

Written by Animesh

May 18, 2010 at 6:12 am

Posted in Technology

Tagged with , , ,

Cassandra – First Touch

with 2 comments

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

Download

Go to Apache Cassandra download page, and get yourself a latest copy.
Link: http://cassandra.apache.org/download/

Once downloaded, extract the zip file to some directory, say, D:\iLabs\apache-cassandra-0.6.1

Minimal Setup

Cassandra keeps all important information in storage-conf.xml. We will talk about this in details later, for now, let’s just tell Cassandra where to store Logs and Data?

  1. Let’s create a directory,
    D:\iLabs\cassanra-data
  2. Create 2 subdirectories, one for Logs and another for Data.
    D:\iLabs\cassanra-data\commitlog
    D:\iLabs\cassanra-data\data
  3. Modify D:/Lab/Cassandra/conf/storage-conf.xml for following information:
    <CommitLogDirectory>D:\iLabs\cassanra-data\commitlog</CommitLogDirectory>
    <DataFileDirectories>
    <DataFileDirectory>D:\iLabs\cassanra-data\data</DataFileDirectory>
    </DataFileDirectories>
    

Ignite the engine

  1. Make sure that you have JAVA_HOME set correctly.
  2. Also, make sure that port 8080 and 9160 is available. Generally, 9160 remains free, and Tomcat or Jboss might be running on 8080. Please shut down, Tomcat/Jboss or whatever server you have on 8080.
  3. Open command prompt, and go to Cassandra directory: D:\iLabs\apache-cassandra-0.6.1
  4. Run:
    D:\iLabs\apache-cassandra-0.6.1>bin\cassandra.bat -f
    Starting Cassandra Server
    Listening for transport dt_socket at address: 8888
    INFO 13:09:00,234 Auto DiskAccessMode determined to be standard
    INFO 13:09:00,531 Sampling index for D:\iLabs\cassanra-data\data\
    Keyspace1\Standard1-1-Data.db
    INFO 13:09:00,559 Sampling index for D:\iLabs\cassanra-data\data\
    system\LocationInfo-1-Data.db
    INFO 13:09:00,567 Replaying D:\iLabs\cassanra-data\commitlog\
    CommitLog-1274081403711.log
    INFO 13:09:00,607 Creating new commitlog segment D:/iLabs/cassanra-data/
    commitlog\CommitLog-1274081940607.log
    INFO 13:09:00,748 LocationInfo has reached its threshold; switching in a
    freshMemtable at CommitLogContext(file='D:/iLabs/cassanra-data/commitlog\
    CommitLog-1274081940607.log', position=133)
    INFO 13:09:00,752 Enqueuing flush of Memtable(LocationInfo)@20827431
    INFO 13:09:00,756 Writing Memtable(LocationInfo)@20827431
    INFO 13:09:00,948 Completed flushing D:\iLabs\cassanra-data\data\system\
    LocationInfo-2-Data.db
    INFO 13:09:00,996 Log replay complete
    INFO 13:09:01,046 Saved Token found: 23289801966927000784786040626191443480
    INFO 13:09:01,047 Saved ClusterName found: Test Cluster
    INFO 13:09:01,061 Starting up server gossip
    INFO 13:09:01,128 Binding thrift service to localhost/127.0.0.1:9160
    INFO 13:09:01,136 Cassandra starting up...
    

Hallelujah! Engine is revved up.

Note: it’s listening to localhost/port:9160.

Let’s prance!

Cassandra distribution comes with Cassandra CLI, which is interactive command line tool. We will use this tool to test our server.

  1. Open another command prompt, and go to Cassandra directory: D:\iLabs\apache-cassandra-0.6.1
  2. Run:
    D:\iLabs\apache-cassandra-0.6.1>bin\cassandra-cli.bat
    Starting Cassandra Client
    Welcome to cassandra CLI.
    Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
    cassandra>
  3. Now, connect with the server:
    cassandra> connect localhost/9160
    Connected to: "Test Cluster" on localhost/9160
  4. Insert a key/value:
    cassandra> set Keyspace1.Standard1['0']['msg'] = 'Hello World!'
    Value inserted.
  5. Query it back:
    cassandra> get Keyspace1.Standard1['0']['msg']
    => (column=6d7367, value=Hello World!, timestamp=1274086005825000)

We have successfully connected to Cassandra server, inserted a value, and fetched it back. But, few things might have been overwhelming, right? Let’s try to understand that.
Let’s have a quick look at the schema. Remember, it is defined in the same file we modified earlier in the setup step: D:/Lab/Cassandra/conf/storage-conf.xml
You will notice below entry there:

<Keyspace Name="Keyspace1">
<ColumnFamily Name="Standard1" CompareWith="BytesType" />

...
</Keyspace>

Keyspace1 is the name of the schema, Standard1 is a collection of rows, and each row has an ordered set of key-value pairs. We will learn more about Keyspaces and Data Schemas in the next blog.

Written by Animesh

May 17, 2010 at 11:01 am

Posted in Technology

Tagged with , , ,