Lucandra – an inside story!

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

Lucene works with

Index,
Document,
Field and
Term.

An index contains a sequence of documents. A document is a sequence of fields. A field is a named sequence of terms. A term is a string that represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occured in, an interned string. Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

Lucene’s index is inverted index. In normal indexes, you can look for a document to know what fields it contains. In inverted index, you look for a field to know all other documents it appears in. It’s kind of upside-down view of the world. But it makes searching blazingly fast.

On a very high level, you can think of lucene indexes as 2 buckets:

Bucket-1 keeps all the Terms (with additional info like, term frequency, position etc.) and it knows which documents have these terms.
Bucket-2 stores all leftover field info, majorly non-indexed info.

How Lucandra does it?

Lucandra needs 2 column families for each bucket described above.

Family-1 to store Term info. We call it “TermInfo”
Family-2 to store leftover info. We call it “Documents”

“TermInfo” family is a SuperColumnFamily. Each term gets stored in a separate row identified with TermKey (“index_name/field/term”) and stores SuperColumns containing Columns of various term information like, term frequency, position, offset, norms etc. This is how it looks:

"TermInfo" => {
    TermKey1:{                                        // Row 1
        docId:{
            name:docId,
            value:{
                Frequencies:{
                    name: Frequencies,
                    value: Byte[] of List[Number]
                },
                Position:{
                    name: Position,
                    value: Byte[] of List[Number]
                },
                Offsets:{
                    name: Offsets,
                    value: Byte[] of List[Number]
                },
                Norms:{
                    name: Norms,
                    value: Byte[] of List[Number]
                }
            }
        }
    },
    TermKey2 => {                                    // Row 2
    }
    ...
}

“Documents” family is a StandardColumnFamily. Each document gets stored in a separate row identified with DocId (“index_name/document_id”) and stores Columns of various storable fields. This looks like,

"Documents" => {
        DocId1: {                        // Row 1
            field1:{
                name: field1,
                value: binary storable content
            },
            field2{
                name: field2,
                value: binary storable content
            }
        },
        DocId2: {                        // Row 2
            field1:{
                name: field1,
                value: binary storable content
            },
        ...
        },
        ...
    }

The Lucandra Cassandra Keyspace looks like this:

<Keyspace Name="Lucandra">
    <ColumnFamily Name="TermInfo"
        CompareWith="BytesType"
        ColumnType="Super"
        CompareSubcolumnsWith="BytesType"
        KeysCached="10%" />
    <ColumnFamily Name="Documents"
        CompareWith="BytesType"
        KeysCached="10%" />

    <ReplicaPlacementStrategy>
        org.apache.cassandra.locator.RackUnawareStrategy
    </ReplicaPlacementStrategy>
    <ReplicationFactor>1</ReplicationFactor>
    <EndPointSnitch>
        org.apache.cassandra.locator.EndPointSnitch
    </EndPointSnitch>
</Keyspace>

Lucene has got many powerful features, like wildcards queries, result sorting, range queries etc. For Lucandra to have these features enabled, you must configure Cassandra with OrderedPreservingParitioner, i.e. OPP.

Cassandra comes with RandomPartitioner, i.e. RP by default, but

RP does NOT support Range Slices, and
If you scan through your keys, they will NOT come in order.

If you still insist on using RP, you might encounter some exceptions, and you might need to go to Lucandra source to amend range query sections.

java.lang.RuntimeException: InvalidRequestException(why:start key's md5 sorts after
end key's md5.this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
    at lucandra.LucandraTermEnum.loadTerms(LucandraTermEnum.java:217)
    at lucandra.LucandraTermEnum.skipTo(LucandraTermEnum.java:88)
    at lucandra.IndexReader.docFreq(IndexReader.java:163)
    at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:138)

This is what you need to change in Cassandra config:

<Partitioner>org.apache.cassandra.dht.OrderPreservingPartitioner</Partitioner>

Benefits

Since you can pull ranges of keys and groups of columns in Cassandra, you can really tune the performance of reads and minimize network IO for each query.
Since writes are indexed in Cassandra, and Cassandra replicates itself, you don’t need to worry about optimizing the indexes or reopening the index to see new writes. With Lucene you need to take care of optimizing your indexes from time to time, and you need to re-instantiate your Searcher object to see new writes.
So, with Cassandra underlying Lucene, you get a real-time distributed search engine.

Caveats

As we discussed in earlier post, you can extend Lucene either by implementing you own Directory class, or writing your own IndexReader and IndexWriter classes. And Lucandra does it using the former approach and it makes much more sense.

Read here: Apache Lucene and Cassandra

Benefits that Lucandra gets are because of Cassandra’s amazing capability to store and scale the key-value pairs. Directory class works in close proximity with IndexReader and IndexWriter to store and read indexes from some storage (filesystem and/or database). It generally receives huge chunks of sequential bytes, not a key-value pair, which would be difficult to store in Cassandra, and even if stored, it would not make optimum use of Cassandra.

Anyhow, given that Lucene is not very object oriented and almost never uses interfaces, using Lucandra’s IndexWriter and IndexReader seamlessly with your legacy codes will NOT be possible.

Lucandra’s IndexReader extends org.apache.lucene.index.IndexReader which makes this class fit for your legacy codes. You just need to instantiate it and then you can pass it around to your native code without much thought:

IndexReader indexReader = new IndexReader(INDEX_NAME, cassandraClient);
// Notice that the constructor is different.
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

But mind you, Lucandra’s IndexReader will NOT help you walk through the indexed documents. Who needs it anyway? 😉

However, Lucandra’s IndexWriter is an independent class, and doesn’t extend or relates to org.apache.lucene.index.IndexWriter in any way. That makes it impossible to use this class in your legacy codes without re-factoring. But, to ease you pain, it does implement the methods with the same signature as native’s, e.g. addDocument, deleteDocuments etc. have the same signature. If that makes you a little happy. 🙂

Also, Lucandra attempts to re-write all related logic inside its IndexWriter, for example, logic to invoke analyzer to fetch terms, calculating term frequencies, offsets etc. This too makes Lucandra a bit weird for future portability. Whenever, Lucene introduces a new thing, or changes its logic in any way, Lucanadra will need to re-implement them. For example, Lucene recently introduced Payloads which add weights to specific terms, much like spans. It works by extending Similarity class with additional logic. Lucandra doesn’t support it. And to support, Lucandra would need to amend its code.

In short, I am trying to say that the way Lucandra is implemented it would make it difficult to inherently use any future Lucene enhancements, but – God forbid! – there is no other way around. Wish Lucene had a better structure!

Anyways, right now, Lucandra supports:

Real-Time indexing
Zero optimization
Search
Sort
Range Queries
Delete
Wildcards and other Lucene magic
Faceting/Highlighting

Apart from this, the way Lucandra uses Cassandra can also have some scalability issues with large data. You can find some clue here:
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Performance

Lucandra claims that it’s slower that Lucene. Indexing is ~10% slower, and so is reading. However, I found it must better and faster than Lucene. I wrote comparative tests to index 15K documents, and search over the index. I ran the tests on my Dell-Latitude D520 with 3GB RAM, and Lucandra (single Cassandra node) was ~35% faster than Lucene during indexing, and ~20% for search. May be, I should try with bigger set of data.

is Lucandra production ready?

There is a Twitter search app http://sparse.ly which is built on Lucandra. This service uses Lucandra exclusively, without any relational or other sort of databases. Given the depth and breadth of twitter data, and that sparse.ly is pretty popular and stable, Lucandra does seem to be production ready.

🙂 But, may be, you should read the Caveats once more and see if you are okay with them.

Written by Animesh

May 27, 2010 at 8:03 am

Posted in Technology

Tagged with Cassandra, HPC, Java, Lucandra, Lucene

14 Responses

Subscribe to comments with RSS.

[…] This post was mentioned on Twitter by Jake Luciani and Jake Luciani, Animesh Kumar. Animesh Kumar said: Lucandra – an inside story!: http://wp.me/pVyFz-hV […]

Tweets that mention Lucandra – an inside story! « Reclaiming Life -- Topsy.com

May 27, 2010 at 10:28 am

Reply
Thanks for this detailed writeup! I think it will help people who are new to lucene as well as lucandra.

One note I wasn’t aware of is the issue you had with random partitioner.

If you can reproduce this with the latest code let me know how.

Also not sure if you linked to the code itself.

http://github.com/tjake/Lucandra

I’ll add a link to this post as well
Jake

Jake

May 27, 2010 at 12:47 pm

Reply
- Jake,
  
  Thanks for taking interest and listing this blog to relevant places.
  
  I ran my test suite with the latest Lucandra version, and indeed, i did not find the creepy bug with RP. It seemed to work fine. Great work!
  
  -Animesh
  
  Animesh
  
  May 31, 2010 at 10:18 am
  
  Reply
well i hacked into lucandra source code and only one word “AMATEUR”!!!, just check numDocs in IndexReader.java and you will see its value is always 100000. This value is critical in calculation of IDF (wiki it) …
good luck anyway

Minh Le

June 18, 2010 at 1:00 pm

Reply
[…] Kundera uses Lucene to index your Entities. Beneath Lucene, Kundera uses Lucandra to store the indexes in Cassandra itself. One fun implication of using Lucene is that apart from […]

Kundera: now JPA 1.0 Compatible « Coding is an act of faith.

July 14, 2010 at 9:55 am

Reply
it seems that lucandra will pull inverted index from cassandra through network and do computation in the client machine.

If I have a key with about 50,000 records, each record with 10 fields for searching/sort. Will the network be the problem?

chen xinli

September 3, 2010 at 9:13 am

Reply
- While looking up inverse index, you don’t load everything in memory. So, I don’t think lucandra will load that many records. Network should not be an issue.
  
  Animesh
  
  September 3, 2010 at 11:41 am
  
  Reply
Nice write up thanks, can you clarrify something for me here.

In your notation would multiple doc ids be add like

“TermInfo” => {
TermKey1:{
docId:{
name:docId,
value:{
Frequencies:{
name: Frequencies,
value: Byte[] of List[Number]
},
Position:{
name: Position,
value: Byte[] of List[Number]
},
Offsets:{
name: Offsets,
value: Byte[] of List[Number]
},
Norms:{
name: Norms,
value: Byte[] of List[Number]
}
}
}
docId:{
name:docId,
value:{
Frequencies:{
name: Frequencies,
value: Byte[] of List[Number]
},
Position:{
name: Position,
value: Byte[] of List[Number]
},
Offsets:{
name: Offsets,
value: Byte[] of List[Number]
},
Norms:{
name: Norms,
value: Byte[] of List[Number]
}
}
}

Or have i misinterpreted how the data is represented?
I.E are doc IDs columns or rows?

Courtney

September 4, 2010 at 4:38 pm

Reply
- I think you are right, however, TermInfo and Document information will be stored in two separate keyspaces.
  
  Animesh
  
  September 5, 2010 at 10:17 am
  
  Reply
Could you post a link to the source code for your work?

Jeryl Cook

October 5, 2010 at 10:26 pm

Reply
- http://github.com/tjake/Lucandra
  
  Animesh
  
  October 6, 2010 at 8:48 am
  
  Reply
Animesh, great article. But can you please write down some basic start up guide, so that for a naive user it would be easy to setup and run Solandra. Since i am facing the same issue. I am unable to understand what should be the starting point while working with Solandra, how should i start configuring and running sample search app? Your help will be precious.

hardikmu

August 6, 2012 at 5:04 pm

Reply
Right here is the right blog for anyone who would like to understand this topic.
You understand so much its almost hard to argue with you
(not that I personally would want to…HaHa). You certainly
put a new spin on a subject that has been written about for ages.
Excellent stuff, just great!

Immigration Advice in Barking

February 21, 2013 at 9:39 pm

Reply
Greetings from Colorado! I’m bored at work so I decided to check out your blog on my iphone during lunch break. I really like the information you present here and can’t wait to take a look when I get
home. I’m shocked at how fast your blog loaded on my cell phone .. I’m not even using WIFI, just
3G .. Anyhow, excellent site!

Workers Injury Attorney Los Angeles

February 21, 2013 at 10:53 pm

Reply