Lucene | animesh kumar

Posts Tagged ‘Lucene’

Kundera: now JPA 1.0 Compatible

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

If you are new to Kundera, you should read Kundera: knight in the shining armor! to get a brief idea about it.

Kundera has reached a major milestone lately, so I thought to sum up the developments here. First and foremost, Kundera is now JPA 1.0 compatible, thought it doesn’t support relationships yet, it does support easy JPA style @Entity declarations and Linear JPA Queries. 🙂 Didn’t you always want to search over Cassandra?

To begin with let’s see what the changes are.

Kundera do not have @CassandraEntity annotation anymore. It now expects JPA @Entity.
Kundera specific @Id has been replaced with JPA @Id.
Kundera specific @Column has been replaced with JPA @Column.
@ColumnFamily, @SuperColumnFamily and @SuperColumn are still there, and are expected to be there for a long time to come, because JPA doesn’t have any of these ideas.
@Index is introduced to control indexing of an entity bean. You can safely ignore it and let Kundera do the defaults for you.

I would recommend you to read about Entity annotation rules discussed in the earlier post. Apart from the points mentioned above, everything remains the same: https://anismiles.wordpress.com/2010/06/30/kundera-knight-in-the-shining-armor/#general-rules

How to define an entity class?

@Entity						// makes it an entity class
@ColumnFamily("Authors")	// assign ColumnFamily type and name
public class Author {

	@Id	// row identifier
	String username;

	@Column(name = "email")	// override column-name
	String emailAddress;

	@Column
	String country;

	@Column(name = "registeredSince")
	Date registered;

	String name;

	public Author() { // must have a default constructor
	}

	// getters, setters etc.
}

There is an important deviation from JPA specification here.

Unlike JPA you must explicitly annotate fields/properties you want to persist. Any field/property that is not @Column annotated will be ignored by Kundera.
In short, the paradigm is reversed here. JPA assumes everything persist-able unless explicitly defined @Transient. Kundera expects everything transient unless explicitly defined @Column.

How to instantiate EntityManager?

Kundera expects some properties to be provided with before you can bootstrap it.

# kundera.properties
# Cassandra nodes to with Kundera will connect
kundera.nodes=localhost

#Cassandra port
kundera.port=9160

#Cassandra keyspace which Kundera will use
kundera.keyspace=Blog

#Whether or not EntityManager can have sessions, that is L1 cache.
sessionless=false

#Cassandra client implementation. It must implement com.impetus.kundera.CassandraClient
kundera.client=com.impetus.kundera.client.PelopsClient

You can define these properties in a java Map object, or in JPA persistence.xml or in a property file “kundera.properties” kept in the classpath.

Instantiating with persistence.xml > Just replace the provider with com.impetus.kundera.ejb.KunderaPersistence which extends JPA PersistenceProvider. And either provide Kundera specific properties in the xml file or keep “kundera.properties” in the classpath.

Instantiating in standard J2SE environment, with explicit Map object.

Map map = new HashMap();
map.put("kundera.nodes", "localhost");
map.put("kundera.port", "9160");
map.put("kundera.keyspace", "Blog");
map.put("sessionless", "false");
map.put("kundera.client", "com.impetus.kundera.client.PelopsClient");

EntityManagerFactory factory = new EntityManagerFactoryImpl("test", map);
EntityManager manager = factory.createEntityManager();

Instantiating in standard J2SE environment, with “Kundera.properties” file. Pass null to EntityManagerFactoryImpl and it will automatically look for the property file.
```
EntityManagerFactory factory = new EntityManagerFactoryImpl("test", null);
EntityManager manager = factory.createEntityManager();
```

Entity Operations

Once you have EntityManager object you are good to go, applying all your JPA skills. For example, if you want to find an Entity object by key,

	try {
		Author author = manager.find(Author.class, "smile.animesh");
	} catch (PersistenceException pe) {
		pe.printStackTrace();
	}

Similarly, there are other JPA methods for various operations: merge, remove etc.

JPA Query

Note: Kundera uses Lucene to index your Entities. Beneath Lucene, Kundera uses Lucandra to store the indexes in Cassandra itself. One fun implication of using Lucene is that apart from regular JPA queries, you can also run Lucene queries. 😉

Here are some indexing fundamentals:

By default, all entities are indexed along with with all @Column properties.
If you do not want to index an entity, annotate it like, @Index (index=false)
If you do not want to index a @column property of an entity, annotate it like, @Index (index=false)

That’s it. Here is an example of JPA query:

	// write a JPA Query
	String jpaQuery = "SELECT a from Author a";

	// create Query object
	Query query = manager.createQuery(jpaQuery);

	// get results
	List<Author> list = query.getResultList();
	for (Author a : list) {
		System.out.println(a.getUsername());
	}

Kundera also supports multiple “where” clauses with “AND”, “OR”, “=” and “like” operations.

	// find all Autors with email like anismiles
	String jpaQuery_for_emails_like = "SELECT a from Author a WHERE a.emailAddress like anismiles";

	// find all Authors with email like anismiles or username like anim
	String jpaQuery_for_email_or_name = "SELECT a from Author a WHERE a.emailAddress like anismiles OR a.username like anim";

I think this will enable you to play around with Kundera. I will be writing up more on how Kundera indexes various entities and how you can execute Lucene Queries in subsequent posts.

Kundera’s next milestones will be:

Implementation of JPA listeners, @PrePersist @PostPersist etc.
Implementation of Relationships, @OneToMany, @ManyToMany etc.
Implementation of Transactional support, @Transactional

Written by Animesh

July 14, 2010 at 9:51 am

Posted in Technology

Tagged with Cassandra, HPC, Kundera, Lucandra, Lucene, NoSql

Lucandra – an inside story!

with 14 comments

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

Lucene works with

Index,
Document,
Field and
Term.

An index contains a sequence of documents. A document is a sequence of fields. A field is a named sequence of terms. A term is a string that represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occured in, an interned string. Note that terms may represent more than words from text fields, but also things like dates, email addresses, urls, etc.

Lucene’s index is inverted index. In normal indexes, you can look for a document to know what fields it contains. In inverted index, you look for a field to know all other documents it appears in. It’s kind of upside-down view of the world. But it makes searching blazingly fast.

On a very high level, you can think of lucene indexes as 2 buckets:

Bucket-1 keeps all the Terms (with additional info like, term frequency, position etc.) and it knows which documents have these terms.
Bucket-2 stores all leftover field info, majorly non-indexed info.

How Lucandra does it?

Lucandra needs 2 column families for each bucket described above.

Family-1 to store Term info. We call it “TermInfo”
Family-2 to store leftover info. We call it “Documents”

“TermInfo” family is a SuperColumnFamily. Each term gets stored in a separate row identified with TermKey (“index_name/field/term”) and stores SuperColumns containing Columns of various term information like, term frequency, position, offset, norms etc. This is how it looks:

"TermInfo" => {
    TermKey1:{                                        // Row 1
        docId:{
            name:docId,
            value:{
                Frequencies:{
                    name: Frequencies,
                    value: Byte[] of List[Number]
                },
                Position:{
                    name: Position,
                    value: Byte[] of List[Number]
                },
                Offsets:{
                    name: Offsets,
                    value: Byte[] of List[Number]
                },
                Norms:{
                    name: Norms,
                    value: Byte[] of List[Number]
                }
            }
        }
    },
    TermKey2 => {                                    // Row 2
    }
    ...
}

“Documents” family is a StandardColumnFamily. Each document gets stored in a separate row identified with DocId (“index_name/document_id”) and stores Columns of various storable fields. This looks like,

"Documents" => {
        DocId1: {                        // Row 1
            field1:{
                name: field1,
                value: binary storable content
            },
            field2{
                name: field2,
                value: binary storable content
            }
        },
        DocId2: {                        // Row 2
            field1:{
                name: field1,
                value: binary storable content
            },
        ...
        },
        ...
    }

The Lucandra Cassandra Keyspace looks like this:

<Keyspace Name="Lucandra">
    <ColumnFamily Name="TermInfo"
        CompareWith="BytesType"
        ColumnType="Super"
        CompareSubcolumnsWith="BytesType"
        KeysCached="10%" />
    <ColumnFamily Name="Documents"
        CompareWith="BytesType"
        KeysCached="10%" />

    <ReplicaPlacementStrategy>
        org.apache.cassandra.locator.RackUnawareStrategy
    </ReplicaPlacementStrategy>
    <ReplicationFactor>1</ReplicationFactor>
    <EndPointSnitch>
        org.apache.cassandra.locator.EndPointSnitch
    </EndPointSnitch>
</Keyspace>

Lucene has got many powerful features, like wildcards queries, result sorting, range queries etc. For Lucandra to have these features enabled, you must configure Cassandra with OrderedPreservingParitioner, i.e. OPP.

Cassandra comes with RandomPartitioner, i.e. RP by default, but

RP does NOT support Range Slices, and
If you scan through your keys, they will NOT come in order.

If you still insist on using RP, you might encounter some exceptions, and you might need to go to Lucandra source to amend range query sections.

java.lang.RuntimeException: InvalidRequestException(why:start key's md5 sorts after
end key's md5.this is not allowed; you probably should not specify end key at all,
under RandomPartitioner)
    at lucandra.LucandraTermEnum.loadTerms(LucandraTermEnum.java:217)
    at lucandra.LucandraTermEnum.skipTo(LucandraTermEnum.java:88)
    at lucandra.IndexReader.docFreq(IndexReader.java:163)
    at org.apache.lucene.search.IndexSearcher.docFreq(IndexSearcher.java:138)

This is what you need to change in Cassandra config:

<Partitioner>org.apache.cassandra.dht.OrderPreservingPartitioner</Partitioner>

Benefits

Since you can pull ranges of keys and groups of columns in Cassandra, you can really tune the performance of reads and minimize network IO for each query.
Since writes are indexed in Cassandra, and Cassandra replicates itself, you don’t need to worry about optimizing the indexes or reopening the index to see new writes. With Lucene you need to take care of optimizing your indexes from time to time, and you need to re-instantiate your Searcher object to see new writes.
So, with Cassandra underlying Lucene, you get a real-time distributed search engine.

Caveats

As we discussed in earlier post, you can extend Lucene either by implementing you own Directory class, or writing your own IndexReader and IndexWriter classes. And Lucandra does it using the former approach and it makes much more sense.

Read here: Apache Lucene and Cassandra

Benefits that Lucandra gets are because of Cassandra’s amazing capability to store and scale the key-value pairs. Directory class works in close proximity with IndexReader and IndexWriter to store and read indexes from some storage (filesystem and/or database). It generally receives huge chunks of sequential bytes, not a key-value pair, which would be difficult to store in Cassandra, and even if stored, it would not make optimum use of Cassandra.

Anyhow, given that Lucene is not very object oriented and almost never uses interfaces, using Lucandra’s IndexWriter and IndexReader seamlessly with your legacy codes will NOT be possible.

Lucandra’s IndexReader extends org.apache.lucene.index.IndexReader which makes this class fit for your legacy codes. You just need to instantiate it and then you can pass it around to your native code without much thought:

IndexReader indexReader = new IndexReader(INDEX_NAME, cassandraClient);
// Notice that the constructor is different.
IndexSearcher indexSearcher = new IndexSearcher(indexReader);

But mind you, Lucandra’s IndexReader will NOT help you walk through the indexed documents. Who needs it anyway? 😉

However, Lucandra’s IndexWriter is an independent class, and doesn’t extend or relates to org.apache.lucene.index.IndexWriter in any way. That makes it impossible to use this class in your legacy codes without re-factoring. But, to ease you pain, it does implement the methods with the same signature as native’s, e.g. addDocument, deleteDocuments etc. have the same signature. If that makes you a little happy. 🙂

Also, Lucandra attempts to re-write all related logic inside its IndexWriter, for example, logic to invoke analyzer to fetch terms, calculating term frequencies, offsets etc. This too makes Lucandra a bit weird for future portability. Whenever, Lucene introduces a new thing, or changes its logic in any way, Lucanadra will need to re-implement them. For example, Lucene recently introduced Payloads which add weights to specific terms, much like spans. It works by extending Similarity class with additional logic. Lucandra doesn’t support it. And to support, Lucandra would need to amend its code.

In short, I am trying to say that the way Lucandra is implemented it would make it difficult to inherently use any future Lucene enhancements, but – God forbid! – there is no other way around. Wish Lucene had a better structure!

Anyways, right now, Lucandra supports:

Real-Time indexing
Zero optimization
Search
Sort
Range Queries
Delete
Wildcards and other Lucene magic
Faceting/Highlighting

Apart from this, the way Lucandra uses Cassandra can also have some scalability issues with large data. You can find some clue here:
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Performance

Lucandra claims that it’s slower that Lucene. Indexing is ~10% slower, and so is reading. However, I found it must better and faster than Lucene. I wrote comparative tests to index 15K documents, and search over the index. I ran the tests on my Dell-Latitude D520 with 3GB RAM, and Lucandra (single Cassandra node) was ~35% faster than Lucene during indexing, and ~20% for search. May be, I should try with bigger set of data.

is Lucandra production ready?

There is a Twitter search app http://sparse.ly which is built on Lucandra. This service uses Lucandra exclusively, without any relational or other sort of databases. Given the depth and breadth of twitter data, and that sparse.ly is pretty popular and stable, Lucandra does seem to be production ready.

🙂 But, may be, you should read the Caveats once more and see if you are okay with them.

Written by Animesh

May 27, 2010 at 8:03 am

Posted in Technology

Tagged with Cassandra, HPC, Java, Lucandra, Lucene

Apache Lucene and Cassandra

with 5 comments

[tweetmeme source=”anismiles” only_single=false http://www.URL.com%5D

I am trying to find ways to extend and scale Lucene to use various latest data storing mechanisms, like Cassandra, simpleDB etc. Why? Agree that Lucene is wonderful, blazingly high-performance with features like incremental indexing and all. But managing and scaling storage, reads, writes and index-optimizations sucks big time. Though we have Solr, Jboss’ Infinispan, and Berkeley’s DbDirectory etc. but the approach they have adopted is very conventional and do not leverage upon any of latest technological developments in non-relational, highly scalable and available data stores like Cassandra, couchDB etc.

And then, I came across Lucandra: an attempt to use Cassandra as an underlying data storage mechanism for Lucene. Ain’t the name(Lucene + Cassandra) say so? 🙂

Why Cassandra?

Well, Cassandra is one of the most popular and widely used “NoSql” systems.
Flexible: Cassandra is a scalable and easy to administer column-oriented data store. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
Decentralized: Cassandra does not rely on a global file system, but uses decentralized peer to peer “Gossip”, and so, it has no single point of failure, and introducing new nodes to the cluster is dead simple.
Fault-Tolerant: Cassandra also has built-in multi-master write, replication, rack awareness, and can handle dead nodes gracefully.
Highly Available: Writes and reads offer a tunable ConsistencyLevel, all the way from “writes never fail” to “block for all replicas to be readable,” with the quorum level in the middle.
And, Cassandra has a thriving community and is at production for products like Facebook, Digg, Twitter etc.

Cool. The idea sounds awesome. But wait, before we look into how Lucandra actually implements it, let’s try to find what are the possible ways of implementation. We need to understand the Lucene stack first, and where and how it can be extended?

Lucene Stack

There are 3 elementary components, IndexReader, IndexWriter and Directory. IndexWriter writes reverse indexes of a document with the help of Directory implementation to a disk. IndexReader reads from the indexes using the same Directory.

But, there is a catch. Lucene is not very well designed and its APIs are closed.

Very poor OO design. There are classes, packages but almost no design pattern usage.
Almost no use of interfaces. Query, HitCollector etc. are all subclasses of an abstract class, so:
1. You’ll have to constantly cast your custom query objects to a Query in order to be able to use your objects in native Lucene calls.
2. It’s pain to apply AOP and auto-proxying.
Some classes which should have been inner are not, and anonymous classes are used for complex operations where you would typically need to override their behavior.

There are many more. Point is that Lucene is designed in such a way that you will upset your code purity no matter how you do it.

Anyhow, to extend Lucene, there are 2 approaches:

Either write a custom Directory implementation, or
write custom IndexReader and IndexWriter classes.

Incorporating Cassandra by writing a custom Directory

This involves extending abstract Directory class. There are many examples like Lucene Jdbc Directory, Berkeley’s DbDirectory etc. for consultation.

Incorporating Cassandra by writing custom IndexReader and IndexWriter

This is a crude approach: writing custom IndexReader and IndexWriter classes. Note again, that native Lucene’s reader/writer classes don’t implement any Interfaces and hence it will be difficult to plug and use our custom reader/writer classes in any existing code. Well, but that’s what you get. Another thing is that, native IndexReader/IndexWriter classes perform a lot of additional logic than just indexing and reading. They use analyzers to analyze the supplied document, calculate terms, term frequencies to name few. We need to make sure that we don’t miss any of these lest Lucene shouldn’t do what we expect it to do.

Lucandra follows this approach. It has written a custom IndexWriter and IndexReader classes. I am going to explore more on it, and come back with what I find there.

Read it here: Lucandra – an inside story!

Trivia

Do you know where does the name Lucene come from? Lucene is Doug Cutting‘s wife’s middle name, and her maternal grandmother’s first name. Lucene is a common Armenian first name.

And, what about Cassandra? In Greek mythology the name Cassandra means “Inflaming Men with Love” or an unheeded prophetess. She is a figure both of the epic tradition and of tragedy. Remember the movie Troy? Although, the movie was a not exactly what Odysseus wrote, but it was polluted to create more appealing cinematic drama. Read here: http://ancienthistory.about.com/cs/grecoromanmyth1/a/troymoviereview.htm

Written by Animesh

May 19, 2010 at 7:26 am

Posted in Technology

Tagged with Cassandra, Java, Lucandra, Lucene, Search, Tutorials

animesh kumar

Posts Tagged ‘Lucene’

Kundera: now JPA 1.0 Compatible

How to define an entity class?

How to instantiate EntityManager?

Entity Operations

JPA Query

Lucandra – an inside story!

How Lucandra does it?

Benefits

Caveats

Performance

is Lucandra production ready?

Apache Lucene and Cassandra

about me

Pages

categories

recent posts

archives

calendar

tags

search my blog

follow me

email subscription

tweeps @anismiles

bookmarks

disclaimer