"; */ ?>


Jul 11

One Small Citrus for Man; One Giant Leaf for Mankind

– Who does not like fruits!?
– Well.. that depends. Are you talking about “a structure of a plant that contains its seeds?”
– No silly, of course not! I am talking about data bases!

© by my brain

The Right Fruit for the Right Job

Now days in order to be competent in a world of Big Data you must get at least a Masters in Fruits, or as I call it an “MF Degree”. Why!? Well how about’em fruits:

You see? Very important to know which fruit to choose for your next {m|b|tr}illion dollar gig.

To expand my MF degree I love doing research in a big data space, and as I was walking around #oscon 2011 expo, I was really pleased to discover a new sort of fruits that I have not heard of before. You would think “yea, ok.. YAFDB: Yet Another Fruit DB”, but no => this one is different => this one has a kicker, this one has a.. “leaf”!

Leafing A for C

You may notice that the above fruit DBs missing that “power of the leaf”, and look rather leafless. And in the world of NoSQL databases fruit without a leaf has somewhat inconsistent properties. Well, let’s rephrase that: eventually the leaf will grow, so we can say that eventually those fruits will look consistent.

But what if a NoSQL database already came with leaf attached to it? You can’t argue that if it did, it would have a complete, consistent look to it.

Well that is quite interesting.. Why a NoSQL database can’t have a configuration to actually be consistent? Think about it.. If the data is spread/sharded/persisted to multiple nodes using a “consistent hashing” algorithm, where clients could have a guarantee that “this” data would live on “these” set of nodes, then any time an insert/update is completed ( truly committed ), any reads for that data would know exactly where/which nodes to read this data from. Since the hash is consistent.

The answer is actually obvious => by ensuring ‘C’ in a CAP theorem via consistent hash, you would need to sacrifice some of ‘A’.. Since certain data is limited by a concrete set of nodes (that client relies on), if some of those nodes are down, DB would need to lock/bring back/reconfigure/reshuffle data, and for that “moment” that data would be unAvailable. This can be improved/tuned with replication, but the “A sacrifice” remains to be there.

Well now I can actually try out the above with this new fruit DB that I discovered @ OSCON. It’s time you meet CitrusLeaf DB

Citrus DB with a Leaf Attached

You can go ahead and read their Architecture Paper with pretty pictures and quite interesting claims, but here I’ll just mention some interesting facts that are mostly not in a paper, which I gathered from talking to CitrusLeaf dudes at OSCON. By the way, they were really open about the internals of CitrusLeaf, even though it is a closed source, commercial product. So here we go:

  • The business niche CitrusLeaf aims to conquer is “Real Time Bidding” which in short is a bidding system that offers the opportunity to dynamically bid impressions ( Online Advertisement ). More about it here: http://en.wikipedia.org/wiki/Sell_Side_Platform
  • The pattern in Real Time Bidding space is 60/40 => 60% reads and 40% writes. CitrusLeaf promises to perform equally well for reads and writes
  • They claim to perform at 200,000 Transactions Per Second per node. Claim is based on 8 byte transactions, which according to CitrusLeaf folks is the usual transaction size in Real Time Bidding world
  • CitrusLeaf can use 3 different storage strategies: DRAM, SSD and Rotation Disks. They are optimized to work with SSDs, where the above benchmark drops to 20,000 Transactions Per Second for a single SSD. In a normal setup, a node would have about 4 SSD attached, where 80,000 Transactions Per Second can be achieved
  • Clients are available in C, C#, Java, Python, PHP, and Ruby
  • CitrusLeaf is ACID compliant, and uses consistent hashing to achieve ‘C’
  • Stores data in a B-Tree, since it does more (real time) reads than writes
  • Citrusleaf can store indices for 100 million records in 7 gigabytes of DRAM
  • Pricing model is per usage => e.g. per TB. Trial release includes a tracking mechanism where the system is reporting the usage

I feel like CitrusLeaf would be a cool addition to my MF degree, besides I already came up with a slogan for them: “One small citrus for man; one giant leaf for mankind” © by my brain

Jul 11

Having Cluster Fun @ Chariot Solutions

The best way to experiment with distributed computing is to have a distributed cluster of things to play with. One approach would of course be to spin off multiple Amazon EC2 instances, which would be wise and pretty cheap:

Micro instances provide 613 MB of memory and support 32-bit and 64-bit platforms on both Linux and Windows. Micro instance pricing for On-Demand instances starts at $0.02 per hour for Linux and $0.03 per hour for Windows”

However some problems are better solved/simulated by having real, “touchable” hardware, that would have real dedicated disks, dedicated cores, RAM, and would only share any kind of state with other nodes over network. Easier said that done though.. Do you have a dozen of spare (same spec’ed) PCs laying around?

But what if you had an awesome training room with, let’s say, 10 iMacs? That would look something like:

Chariot Solutions Training Room

This is in fact the real deal => “Chariot Solutions Training Room“, which is usually occupied by people learning about Scala, Clojure, Hadoop, Spring, Hibernate, Maven, Rails, etc..

So once upon a time, in after training hours, we decided to run some distributed simulations. As we were passing by the training room, we had a thought: “It’s Friday night, and as any other creatures, these beautiful machines would definitely like to hang out together”…

Cluster at Chariot Solutions

This is one of this night’s highlights: a MongoDB playground. The same Friday night we played with Riak, Cassandra, RabbitMQ and vanilla distributed Erlang. As you can imagine iMacs had a lot of fun in a process pumping data in and out via 10 Gigabit switch. And we geeked out like real men!

Jul 11

NoRAM DB => “If It Does Not Fit in RAM, I Will Quietly Die For You”

Out of 472 NoSQL databases / distributed caches that are currently available, highly buzzed and scream that only their precious brand solves the world toughest problems.. There are only a few that have less screaming and more doing that I found so far:

Choosing a Drink

See.. Choice makes things better and worse at the same time =>

if I am thirsty, and I only have water available, 
I'll take that, satisfy my thirst, 
and come back to focusing on whatever it is I was doing

at the same time

if I am thirsty, and I have 472 sodas, and 253 juices, and 83 waters available
I'll take a two hour brake to choose the right one, satisfy my thirst, 
and come back to focusing on whatever it is I was doing

It may not seem as two different experiences at first, but they are different approaches.

Especially bad is when out of those 472 sodas, #58 has such a seductive ad on a can, and it promisses you $1,000,000 prize, if only you make a sip. So you are interested ( many people drink it too ), trying it… and spitting it out immediately => now there are 471 left to go.. remaining thirst, and such a dissatisfaction

If there is CAP, there is CRACS

NoSQL is no different now days. Choosing the right data store for your task / project / problem really depends on several factors. Let’s redefine a CAP theorem (because we can), and call it CRACS instead:

  • Cost:                    Do you have spare change for several TB of RAM?
  • Reliability:            Fault Tolerance, Availability
  • Amount of data:   Do you operate Megabytes or Terabytes, or maybe Petabytes?
  • Consistency:         Can you survive two reads @ the same millisecond return different results?
  • Speed:                  Reading, which includes various aggregates AND writing

That’s pretty much it, let me know if you have you favorite that is missing from CRACS, but at least for now, five is a good number. Of course there are other important things, like simplicity ( hello Cassandra: NOT ) and ease of development ( hello VoltDB, KDB: NOT ), and of course fun to work with, and of course great documentation, and of course great community, and of course… there are others, but above five seem to nail it.

Distributed Cache: Why Redis and Hazelcast?

Well, let’s see Redis and Hazelcast are distributed caches with an optional persistent store. They score because they do just that => CACHE, they are data structure based: e.g. List, Set, Queue.. even DistributedTask, DistributedEvent, etc. Again they are upfront with you: we do awesome CACHE => and they do. I have a good feeling about GemFire, but I have not tried it, and last time I contacted them, they did not respond, so that’s that.

NoSQL: Going Beyond RAM

See, what I really learned to dislike is “if it does not fit in RAM, I will quietly die for you” NoSQL data stores ( hello MongoDB ). And it is not just the index that should entirely fit into memory, but also the data that has not yet been fsync’ed, data that was brought back for querying, data that was fsync’ed, but still hangs around, etc..

The thing to understand when comparing Riak to MongoDB is that Riak actually writes to disk, and MongoDB writes to mmap files ( memory ). Try setting Mongo’s WriteConcern to “FSYNC_SAFE”, and now compare it to Riak => that would be a fair comparison. And having LevelDB as Riak’s backend, or even good old Bitcask, Riak will take Gold. Yes, Riak’s storage is pluggable : ).

Another thing besides that obvious Mongo “RAM problem” is JSON, I mean BSON. Don’t let this ‘B‘ fool you, a key name such as “name” will take at least 4 bytes, without even getting to the actual value for this key, which affects performance, as well as storage. Riak has protocol buffers as an alternative to JSON, which can really help with the document size. And with secondary indicies on the way, it may even prove be searchable : ).

Both Riak and MongoDB struggle with Map/Reduce: MongoDB does it in a single (!) SpiderMonkey thread, and of course it has a friendly GlobalLock that does not help, but it takes a point from Riak by having secondary indicies. But both: Mongo and Riak are in the process of rewriting their MapReduce frameworks completely, so we’ll see how it goes.

Cassandra Goodness and Awkwardness

Cassandra is however a solid piece of software, with one caveat: you have to hire DataStax guys, who are really, really, really good, by the way, but you have to pay up good money for such goodness. Otherwise, on your own, you don’t need to really have a PhD to handle Cassandra [ you only really need to have a PhD in Comp Sci, if you actually can and like to hang around in school for a couple more years, but that is a topic for another blog post ].

Cassandra documentation is somewhat good, and the code is there, the only problem is it looks and feels a bit Frankenstein: murmurhash from here, bloom filter from here, here is a sprinkle of Thrift ( really!? ), in case you want to Insert/Update, here is a “mutator”, etc.. Plus adding a node is really an afterthought => if you have TBs of data, adding a node can take days, no really => days. Querying is improving with CQL roll out in 0.8, but any kind of even simplistic aggregation requires some “Thriftiveness” [ read “Awkwardness” ].

CouchDB is Awesome, but JSON

CouchDB looks and feels like an awesome choice if the size of the data is not a problem. Same as with MongoDB, “JSON only” approach is a bit weird, why only JSON? The point of no schema is NOT that it changes ALL the time => then you have just a BLOB, content of which you cannot predict, hence index. The point is, the schema ( and rather some part of it ) may/will change, so WHY the heck do I have to carry those key names (that are VARCHARs and take space) around? Again CouchDB + alternative protocol would make it in my “Real NoSQL Dudes” list easily, as I like most things about it, especially secondary indicies, Ubuntu integration, mobile expansion, and of course Erlang, but Riak has Erlang too : )

VoltDB: With Great Speed Comes Great Responsibility

As far as speed, VoltDB would probably leave most of others behind (well, besides OneTick and KDB), but it actually is not NoSQL (hence it is fully ACID), and it is not NotJustInRam store, since it IS Just RAM. Three huge caveats are:

1. Everything is a precoded store procedure => ad hoc querying is quite difficult
2. Aggregation queries can’t work with data greater than 100MB ( for a temp table )
3. Data “can” be exported to Hadoop, but there is no real integration (yet) to analyze the data that is currently in RAM along with the data in Hadoop.

But it is young and worth mentioning, as I think RAM and hardware gets cheaper, “commodity nodes” become less important, so “Scale Up” solutions may actually win back some architectures. There is of course a question of Fault Tolerance, which Erlang / AKKA based systems will solve a lot better than any “Scale Up”s, but it is a topic for another time.

Dark Hourses of NoSQL

There are others, such as Tokyo Cabinet ( or is it Kyoto Cabinet now days ), Project Voldemort => I have not tried them, but have heard good stories about them. The only problem I see with these “dark horse” solutions is lack of adoption.

Neo4j: Graph it Like You Mean It

Ok, so why Neo4j? Well, because it is a graph data store, that screws with your mind a bit (data modeling) until you get it. But once you get it, there is no excuse not to use it, especially when you create your next “social network, location based, shopping smart” start up => modeling connected things as a graph just MAKES SENSE, and Neo4j is perfect for it.

You know how fast it is to find “all” connections for a single node in a non graph data store? Well, it is longer and longer with “all” being a big number. With Neo4j, it is as fast as to find a single connection => cause it is a graph, it’s a natural data structure to work with graph shaped data. It comes with a price it is not as easy to Scale Out, at least for free. But.. it is fully transactional ( even JTA ), it persists things to disk, it is baked into Spring Data. Try it, makes your brain do a couple of front splits, but your mind feels really stretched afterwards.

The Future is 600 to 1

There is of course HBase, which Hadapt guys are promising to beat 600 to 1, so we’ll see what Hadapt brings to the table. Meanwhile I invite Riak, Redis, Hazecast and Neo4j to walk together with me the slippery slope of NoSQL.

Aug 09

Key Value Store List

B Tree

While playing with CouchDB, I decided to expand on the subject and research to see what else is out there. Apparently there are lots of cool implementations of Key Value Stores. And since to try them all will take a long time, I decided make some notes of what I found, to simplify my journey.

Here I created a simple reference list of different, non-commercial implementations of Key Value Stores. Let me know if there are more interesting projects that are not in this list, so we can keep this list updated.

Tokyo Cabinet

A C library that implements a very fast and space efficient key-value store:

The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array.

Besides bindings for Ruby, there are also APIs for Perl, Java, and Lua available.

To share Tokyo Cabinet across machines, Tokyo Tyrant provides a server for concurrent and remote connections.

Speed and efficiency are two consistent themes for Tokyo Cabinet. Benchmarks show that it only takes 0.7 seconds to store 1 million records in the regular hash table and 1.6 seconds for the B-Tree engine. To achieve this, the overhead per record is kept at as low as possible, ranging between 5 and 20 bytes: 5 bytes for B-Tree, 16-20 bytes for the Hash-table engine. And if small overhead is not enough, Tokyo Cabinet also has native support for Lempel-Ziv or BWT compression algorithms, which can reduce your database to ~25% of it’s size (typical text compression rate). Oh, and did I mention that it is thread safe (uses pthreads) and offers row-level locking?

good intro: Tokyo Cabinet Beyond Key Value Store

Project Voldemort

Voldemort is a very cool project that comes out of LinkedIn. They seem to even be providing a full time guy doing development and support via a mailing list. Kudos to them, because Voldemort, as far as I can tell, is great. Best of all, it scales. You can add servers to the cluster, you don’t do any client side hashing, throughput is increased as the size of the cluster increases. As far as I can tell, you can handle any increase in requests by adding servers as well as those servers being fault tolerant, so a dead server doesn’t bring down the cluster.

Voldemort does have a downside for me, because I primarily use ruby and the provided client is written in java, so you either have to use JRuby (which is awesome but not always realistic) or Facebook Thrift to interact with Voldemort. This means thrift has to be compiled on all of your machines, and since Thrift uses Boost C++ library, and Boost C++ library is both slow and painful to compile, deployment of Voldemort apps is increased significantly.

Voldemort is also intersting because it has pluggable data storage backend and the bulk of it is mostly for the sharding and fault tolerance and less about data storage. Voldemort might actually be a good layer on top of Redis or Tokyo Cabinet some day.

Voldemort, it should be noted, is also only going to be worth using if you actually need to spread your data out over a cluster of servers. If your data fits on a single server in Tokyo Tyrant, you are not going to gain anything by using Voldemort. Voldemort however, might be seen as a good migration path from Tokyo * when you do hit that wall were performance isn’t enough.(from: NoSQL If Only It Was That Easy)


Apache CouchDB is a document-oriented database that can be queried and indexed in a MapReduce fashion using JavaScript. CouchDB also offers incremental replication with bi-directional conflict detection and resolution.

CouchDB provides a RESTful JSON API than can be accessed from any environment that allows HTTP requests. There are myriad third-party client libraries that make this even easier from your programming language of choice. CouchDB’s built in Web administration console speaks directly to the database using HTTP requests issued from your browser.

It’s a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. CouchDB can do full text indexing of your documents, and lets you express views over your data in Javascript. I could imagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of other fields, many of which could be null, and each site update adds or changes the available fields. In situations like that it quickly gets unwieldy adding and changing columns in a database, and updating versions of your application code to match. (from: Anti RDBMS a List of Distributed Key Value Stores)

good intro: InfoQ: CouchDB From 10K Feet


MongoDB is not a key/value store, it’s quite a bit more. It’s definitely not a RDBMS either. I haven’t used MongoDB in production, but I have used it a little building a test app and it is a very cool piece of kit. It seems to be very performant and either has, or will have soon, fault tolerance and auto-sharding (aka it will scale). I think Mongo might be the closest thing to a RDBMS replacement that I’ve seen so far. It won’t work for all data sets and access patterns, but it’s built for your typical CRUD stuff. Storing what is essentially a huge hash, and being able to select on any of those keys, is what most people use a relational database for. If your DB is 3NF and you don’t do any joins (you’re just selecting a bunch of tables and putting all the objects together, AKA what most people do in a web app), MongoDB would probably kick ass for you.

Oh, and did I mention that, of all the NoSQL options out there, MongoDB is the one of the only ones being developed as a business with commercial support available? If you’re dealing with lots of other people’s data, and have a business built on the data in your DB, this isn’t trivial.

On a side note, if you use Ruby, check out MongoMapper for very easy and nice to use ruby access.(from: NoSQL If Only It Was That Easy)

MongoDB: Dwight Merriman, 10gen (slides, video)


Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google’s BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.

Cassandra was open sourced by Facebook in 2008, where it was designed by one of the authors of Amazon’s Dynamo. In a lot of ways you can think of Cassandra as Dynamo 2.0. Cassandra is in production use at Facebook but is still under heavy development.

Sounded very promising when the source was released by Facebook last year. They use it for inbox search. It’s Bigtable-esque, but uses a DHT so doesn’t need a central server (one of the Cassandra developers previously worked at Amazon on Dynamo). Unfortunately it’s languished in relative obscurity since release, because Facebook never really seemed interested in it as an open-source project. From what I can tell there isn’t much in the way of documentation or a community around the project at present. (from: Anti RDBMS a List of Distributed Key Value Stores)

good intro: Up and Running With Cassandra

MySQL Cluster / NDB

Although it is not your native Key Value Store, I found it interesting to put on the list. While it is commonly used through an SQL interface, the architecture and performance is exactly what you want: Cloud-like sharding, very good performance on key-value lookups, etc… And if you don’t want the SQL, you can use the NDB API directly, or REST through mod_ndb Apache module (http://code.google.com/p/mod-ndb/).

This would score high on your list if you evaluated it:

– Transparent sharding: Data is distributed through an md5sum hash on your primary key (or user defined key), yet you connect to whichever MySQL server you want, the partitions/shards are transparent behind that.

– Transparent re-sharding: In version 7.0, you can add more data nodes in an online manner, and re-partition tables without blocking traffic.

– Replication: Yes. (MySQL replication).

– Durable: Yes, ACID. (When using a redundant setup, which you always will.)

– Commit’s to disk: Not on commit, but with a 200ms-2000ms delay. Durability comes from committing to more than 1 node, *on commit*.

– Less than 10ms response times: You bet! 1-2 ms for quite complex queries even.

Other KVS I have in my backlog to expand on here later:


Lux IO



Oracle Berkeley DB










And as a bonus – there is another quite interesting initiative started by Yehuda Katz:


Moneta: A unified interface for key/value stores
Moneta provides a standard interface for interacting with various kinds of key/value stores.
Out of the box, it supports:
* File store for xattr
* Basic File Store
* Memcache store
* In-memory store
* The xattrs in a file system
* DataMapper
* S3
* Berkeley DB
* Redis
* Tokyo
* CouchDB
All stores support key expiration, but only memcache supports it natively. All other stores
emulate expiration.
The Moneta API is purposely extremely similar to the Hash API. In order so support an
identical API across stores, it does not support iteration or partial matches, but that
might come in a future release.

Any additional info / projects are welcome!