animesh kumar

Running water never grows stale. Keep flowing!

Apache Lucene and Cassandra

with 5 comments

I am trying to find ways to extend and scale Lucene to use various latest data storing mechanisms, like Cassandra, simpleDB etc. Why? Agree that Lucene is wonderful, blazingly high-performance with features like incremental indexing and all. But managing and scaling storage, reads, writes and index-optimizations sucks big time. Though we have Solr, Jboss’ Infinispan, and Berkeley’s DbDirectory etc. but the approach they have adopted is very conventional and do not leverage upon any of latest technological developments in non-relational, highly scalable and available data stores like Cassandra, couchDB etc.

And then, I came across Lucandra: an attempt to use Cassandra as an underlying data storage mechanism for Lucene. Ain’t the name(Lucene + Cassandra) say so? :)

Why Cassandra?

  1. Well, Cassandra is one of the most popular and widely used “NoSql” systems.
  2. Flexible: Cassandra is a scalable and easy to administer column-oriented data store. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
  3. Decentralized: Cassandra does not rely on a global file system, but uses decentralized peer to peer “Gossip”, and so, it has no single point of failure, and introducing new nodes to the cluster is dead simple.
  4. Fault-Tolerant: Cassandra also has built-in multi-master write, replication, rack awareness, and can handle dead nodes gracefully.
  5. Highly Available: Writes and reads offer a tunable ConsistencyLevel, all the way from “writes never fail” to “block for all replicas to be readable,” with the quorum level in the middle.
  6. And, Cassandra has a thriving community and is at production for products like Facebook, Digg, Twitter etc.

Cool. The idea sounds awesome. But wait, before we look into how Lucandra actually implements it, let’s try to find what are the possible ways of implementation. We need to understand the Lucene stack first, and where and how it can be extended?

Lucene Stack

There are 3 elementary components, IndexReader, IndexWriter and Directory. IndexWriter writes reverse indexes of a document with the help of Directory implementation to a disk. IndexReader reads from the indexes using the same Directory.

But, there is a catch. Lucene is not very well designed and its APIs are closed.

  1. Very poor OO design. There are classes, packages but almost no design pattern usage.
  2. Almost no use of interfaces. Query, HitCollector etc. are all subclasses of an abstract class, so:
    1. You’ll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls.
    2. It’s pain to apply AOP and auto-proxying.
  3. Some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behavior.

There are many more. Point is that Lucene is designed in such a way that you will upset your code purity no matter how you do it.

Read more:
http://www.jroller.com/melix/entry/why_lucene_isn_t_that
http://lucene.grantingersoll.com/2008/03/28/why-lucene-isnt-that-good-javalobby/

Anyhow, to extend Lucene, there are 2 approaches:

  1. Either write a custom Directory implementation, or
  2. write custom IndexReader and IndexWriter classes.

Incorporating Cassandra by writing a custom Directory

This involves extending abstract Directory class. There are many examples like Lucene Jdbc Directory, Berkeley’s DbDirectory etc. for consultation.

Incorporating Cassandra by writing custom IndexReader and IndexWriter

This is a crude approach: writing custom IndexReader and IndexWriter classes. Note again, that native Lucene’s reader/writer classes don’t implement any Interfaces and hence it will be difficult to plug and use our custom reader/writer classes in any existing code. Well, but that’s what you get. Another thing is that, native IndexReader/IndexWriter classes perform a lot of additional logic than just indexing and reading. They use analyzers to analyze the supplied document, calculate terms, term frequencies to name few. We need to make sure that we don’t miss any of these lest Lucene shouldn’t do what we expect it to do.

Lucandra follows this approach. It has written a custom IndexWriter and IndexReader classes. I am going to explore more on it, and come back with what I find there.

Read it here: Lucandra – an inside story!

Trivia

Do you know where does the name Lucene come from? Lucene is Doug Cutting‘s wife’s middle name, and her maternal grandmother’s first name. Lucene is a common Armenian first name.

And, what about Cassandra? In Greek mythology the name Cassandra means “Inflaming Men with Love” or an unheeded prophetess. She is a figure both of the epic tradition and of tragedy. Remember the movie Troy? Although, the movie was a not exactly what Odysseus wrote, but it was polluted to create more appealing cinematic drama. Read here: http://ancienthistory.about.com/cs/grecoromanmyth1/a/troymoviereview.htm

About these ads

Written by Animesh

May 19, 2010 at 7:26 am

Posted in Technology

Tagged with , , , , ,

5 Responses

Subscribe to comments with RSS.

  1. thx,good article!

    medcl

    June 22, 2010 at 1:33 pm

  2. Hello, I used your Lucandra, When I inserted the 500w’s data, but The response time alwars greater than 1s. I don’t know is my a problem with Cassandra’s config. Do you have to be a performance test? Can you show me the test report? Could you have real-time search of idea to share with me,or other information.My email: okwangxing@gmail.com.Thank you.

    wangxing

    July 29, 2010 at 8:17 am

  3. Good article!!

    Can you share some of your Lucandra implementation?

    Siddharth

    October 3, 2010 at 6:30 pm

  4. hi nice article
    do you have any code example,please send me
    i will test it

    muhammed

    March 18, 2011 at 8:47 pm

  5. [...] 2. Animesh Kumar. Apache Lucene and Cassandra  [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 204 other followers

%d bloggers like this: