Apache Lucene and Cassandra

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

I am trying to find ways to extend and scale Lucene to use various latest data storing mechanisms, like Cassandra, simpleDB etc. Why? Agree that Lucene is wonderful, blazingly high-performance with features like incremental indexing and all. But managing and scaling storage, reads, writes and index-optimizations sucks big time. Though we have Solr, Jboss’ Infinispan, and Berkeley’s DbDirectory etc. but the approach they have adopted is very conventional and do not leverage upon any of latest technological developments in non-relational, highly scalable and available data stores like Cassandra, couchDB etc.

And then, I came across Lucandra: an attempt to use Cassandra as an underlying data storage mechanism for Lucene. Ain’t the name(Lucene + Cassandra) say so? 🙂

Why Cassandra?

Well, Cassandra is one of the most popular and widely used “NoSql” systems.
Flexible: Cassandra is a scalable and easy to administer column-oriented data store. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
Decentralized: Cassandra does not rely on a global file system, but uses decentralized peer to peer “Gossip”, and so, it has no single point of failure, and introducing new nodes to the cluster is dead simple.
Fault-Tolerant: Cassandra also has built-in multi-master write, replication, rack awareness, and can handle dead nodes gracefully.
Highly Available: Writes and reads offer a tunable ConsistencyLevel, all the way from “writes never fail” to “block for all replicas to be readable,” with the quorum level in the middle.
And, Cassandra has a thriving community and is at production for products like Facebook, Digg, Twitter etc.

Cool. The idea sounds awesome. But wait, before we look into how Lucandra actually implements it, let’s try to find what are the possible ways of implementation. We need to understand the Lucene stack first, and where and how it can be extended?

Lucene Stack

There are 3 elementary components, IndexReader, IndexWriter and Directory. IndexWriter writes reverse indexes of a document with the help of Directory implementation to a disk. IndexReader reads from the indexes using the same Directory.

But, there is a catch. Lucene is not very well designed and its APIs are closed.

Very poor OO design. There are classes, packages but almost no design pattern usage.
Almost no use of interfaces. Query, HitCollector etc. are all subclasses of an abstract class, so:
1. You’ll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls.
2. It’s pain to apply AOP and auto-proxying.
Some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behavior.

There are many more. Point is that Lucene is designed in such a way that you will upset your code purity no matter how you do it.

Anyhow, to extend Lucene, there are 2 approaches:

Either write a custom Directory implementation, or
write custom IndexReader and IndexWriter classes.

Incorporating Cassandra by writing a custom Directory

This involves extending abstract Directory class. There are many examples like Lucene Jdbc Directory, Berkeley’s DbDirectory etc. for consultation.

Incorporating Cassandra by writing custom IndexReader and IndexWriter

This is a crude approach: writing custom IndexReader and IndexWriter classes. Note again, that native Lucene’s reader/writer classes don’t implement any Interfaces and hence it will be difficult to plug and use our custom reader/writer classes in any existing code. Well, but that’s what you get. Another thing is that, native IndexReader/IndexWriter classes perform a lot of additional logic than just indexing and reading. They use analyzers to analyze the supplied document, calculate terms, term frequencies to name few. We need to make sure that we don’t miss any of these lest Lucene shouldn’t do what we expect it to do.

Lucandra follows this approach. It has written a custom IndexWriter and IndexReader classes. I am going to explore more on it, and come back with what I find there.

Read it here: Lucandra – an inside story!

Trivia

Do you know where does the name Lucene come from? Lucene is Doug Cutting‘s wife’s middle name, and her maternal grandmother’s first name. Lucene is a common Armenian first name.

And, what about Cassandra? In Greek mythology the name Cassandra means “Inflaming Men with Love” or an unheeded prophetess. She is a figure both of the epic tradition and of tragedy. Remember the movie Troy? Although, the movie was a not exactly what Odysseus wrote, but it was polluted to create more appealing cinematic drama. Read here: http://ancienthistory.about.com/cs/grecoromanmyth1/a/troymoviereview.htm

Written by Animesh

May 19, 2010 at 7:26 am

Posted in Technology

Tagged with Cassandra, Java, Lucandra, Lucene, Search, Tutorials

5 Responses

Subscribe to comments with RSS.

thx,good article!

medcl

June 22, 2010 at 1:33 pm

Reply
Hello, I used your Lucandra, When I inserted the 500w’s data, but The response time alwars greater than 1s. I don’t know is my a problem with Cassandra’s config. Do you have to be a performance test? Can you show me the test report? Could you have real-time search of idea to share with me,or other information.My email: okwangxing@gmail.com.Thank you.

wangxing

July 29, 2010 at 8:17 am

Reply
Good article!!

Can you share some of your Lucandra implementation?

Siddharth

October 3, 2010 at 6:30 pm

Reply
hi nice article
do you have any code example,please send me
i will test it

muhammed

March 18, 2011 at 8:47 pm

Reply
[…] 2. Animesh Kumar. Apache Lucene and Cassandra […]

[转]Integrating Lucene with HBase | 四号程序员

April 26, 2013 at 8:02 am

Reply

animesh kumar