02
Dec 15

Time Series Database in One Line of Clojure

If you ever worked in the financial sector, specifically high frequency trading, a time series database is a well known tool that orders up all those quotes, orders, trades for financial pleasure.

The are many of these databases available. The Wall Street being The Wall Street would of course primarily use proprietary ones, since, well, it’s proprietary :), but giving them a credit: they do outperform open source ones by a lot, at least presently (talking about millions per second).

Disrupting Time Series Business


So I decided to write an open source time series database that will outperform them all not necessarily by performance, but definitely by clarity and size. Get ready for this one line.

If you read this far that means you are ready, so let’s begin by creating a database:

(def db (sorted-map-by >))

Oh, by the way we are done. It’s the one and only: The Time Series Database.

Map is King of Data


Let’s use it. First we’ll need some data:

(def data
  {1449088877092 {:GOOG {:bid 762.74 :offer 762.79}}
   1449088876590 {:AAPL {:bid 116.60 :offer 116.70}}
   1449088877601 {:MSFT {:bid 55.22 :offer 55.27}}
   1449088877203 {:TSLA {:bid 232.57 :offer 232.72}}
   1449088875914 {:NFLX {:bid 128.95 :offer 129.05}}
   1449088870005 {:FB {:bid 105.96 :offer 106.6}}})

The format is simple {timestamp data}.

Now a query to have a database as a value with this data:

(defn with [db data] (merge db data))

And finally some time based queries, like before and after:

(defn before [database ts] (into {} (subseq database > ts)))
(defn after [database ts] (into {} (subseq database < ts)))

done.

Action!


(before (with db data) 1449088877091)
 
{1449088876590 {:AAPL {:bid 116.6, :offer 116.7}},
 1449088875914 {:NFLX {:bid 128.95, :offer 129.05}},
 1449088870005 {:FB {:bid 105.96, :offer 106.6}}}
(after (with db data) 1449088877091)
 
{1449088877601 {:MSFT {:bid 55.22, :offer 55.27}},
 1449088877203 {:TSLA {:bid 232.57, :offer 232.72}},
 1449088877092 {:GOOG {:bid 762.74, :offer 762.79}}}

Beware, you, other time series databases!

P.S. Of course there is a possibility of events that came in at the exact same millisecond, so here is another line that solves it


24
Nov 15

Clojure Libraries in The Matrix

Clojure universe is mostly built on top of libraries rather than “frameworks” or “platforms”, which makes it really flexible and lots of fun to work with. Any library can be swapped, contributed to, or even created from scratch.

There are several things that make libraries great. The quality of its solution is of course the main focus which delivers the most value, but there are others. The one I’d like to mention is not how much a library does, but how little it should.

I like apples, you like me, you like apples


Dependencies are often overlooked when developing libraries. There are quite a few libraries that suffer from depending on something for either convenience, or for its built in example, or just in case, etc.

This results in downloading the whole maven repository when working on the project that depends on just a few of such libraries.

This also could create conflicts between the dependencies libraries bring and the real project required dependencies.

We can do better, and we should.

Those people don’t know what they are doing


The reason I bring it up is not because I am tired of these libraries, or it is time for a rant, but it is simply because I do it myself. And usually by the time I notice I did it, it requires significant rework to make sure developers that use/depend on my libraries do not bring “apples” that I like and they might not.

Useful vs. The Core


A great example of this is me including an excellent clojure/tools.logging as a top level dependency of mount. Mount manages application state lifecycle, and it would only make sense if every time a state is started or stopped, mount would log it:

dev=> (mount/start)
14:34:10.813 [nREPL-worker-0] INFO  mount.core - >> starting..  app-config
14:34:10.814 [nREPL-worker-0] INFO  mount.core - >> starting..  conn
14:34:10.838 [nREPL-worker-0] INFO  mount.core - >> starting..  nyse-app
14:34:10.844 [nREPL-worker-0] INFO  mount.core - >> starting..  nrepl
:started
 
dev=> (mount/stop-except #'app.www/nyse-app)
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  nrepl
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  conn
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  app-config
:stopped

It’s useful, right? Of course it is. As a developer that depends on mount, you don’t have to do it, it is already there for you, very informative and clean.

But here is the catch:

* what if you don’t like the way it logs it?
* what if you don’t want it to log at all?
* what if you use a different library for logging?
* etc..

In other words: “what if you don’t like or need apples and you eat bananas instead?”.

It ends up that “useful” is most of the time a red flag. Stop and think whether this “useful” feature is really the core piece of functionality, or is a bolted on “nice to have”.

Novelty Freshness of Refactoring


While it is not desired to have extra dependencies, and the above idea to include logging was not great, what was great are new thoughts during refactoring:

“Ok, I’ll remove logging, but now mount users won’t know anything about states..”

“Maybe they can use something like (states-with-deps) that would give them the current state of the application”:

dev=> (states-with-deps)
 
({:name app-config, :order 1, 
                    :started? true
                    :suspended? false
                    :ns #object[clojure.lang.Namespace 0x6e126efc "app.config"], 
                    :deps ()} 
 
 {:name conn, :order 2, 
              :started? true
              :suspended? false
              :ns #object[clojure.lang.Namespace 0xf1a66a6 "app.nyse"], 
              :deps ([app-config #'app.config/app-config])} 
 
 {:name nrepl, :order 3, 
               :started? true
               :suspended? false
               :ns #object[clojure.lang.Namespace 0x2c134117 "app"], 
               :deps ([app-config #'app.config/app-config])})

“That’s not bad, but what if they start/stop states selectively, or they suspended/resumed some states.. no visibility”

“Well, it’s simple, why not just return all the states that were affected by a lifecycle method?”

And that’s what I did. But I did not go through this thought process when I had logging in, since logging created an illusion of visibility and control, while in reality it gave “an ok” visibility and no control.

The solution just returns a vector of states that were affected:

dev=> (mount/start)
{:started [#'app.config/app-config 
           #'app.nyse/conn 
           #'app/nrepl
           #'check.suspend-resume-test/web-server
           #'check.suspend-resume-test/q-listener]}

The cool additional thing, and the reason it is a vector and not a set, is these states are in the vector in the order they were touched, in this case “started”.

Rules of The Matrix


While I made a mistake, I am glad I did. It gave me lots of food for thought as well as made me do some other cool tricks with robert hooke to demonstrate how to bring the same logging back if needed.

It does feel great to only depend on the Clojure itself, and a tiny tools.macro, which I use a single function from, and could potentially just grab from there, and cut my dependencies to The One.

clojure / stateComments Off on Managing Clojure app state since (reset)
12
Nov 15

Managing Clojure app state since (reset)

After shipping two large projects with component, and having several in works, I decided, Component is not for me, and wrote mount.

Here are the differences between mount and Component, and below is the story of why.

Java is Good


For the last several years Java started to get a bad rep by many people who now like “better” languages. Java is too verbose, too complex, too mutable, too ugly, too last century, etc..

I like Java, I liked it 10 years ago, I like it now. I did work in many languages, who doesn’t now days, and I still like Java. Java is solid. Java is simple. Yes, it is.

There are many corner cases that are well documented, and there are more that are not documented, but it does not make it complex. You can learn about Monads and Endofunctors on your own, while completing and shipping successful Java projects. Category Theory is not a prerequisite. Java is stable. I like Java.

Love is not solid, it’s more


Now I can’t say I love Java. Java is my good friend, we have a solid relationship. Sometimes we go to work together, but I do not have a feeling of excitement, I am not running (well, not too fast at least) to my laptop to try this cool Java thing, I would not spend a weekend with it unless we need to.

That’s where Clojure comes in. I love it. I won’t go into details on why I love it, first of all its personal, second of all there are plenty of other blog posts, books, videos that make Clojure shine. I just want to state that I love it. There is a difference.

Clojure. The Beginning.


I came to Clojure several years ago from a pretty common background: lots of Java and Spring. I like Spring a lot. It makes Java world shine, it taught me great ways to approach problems, it has great documentation and friendly community, I love friendly communities.

As I wrote more and more Clojure I fell in love with each new discovery, it made me think of time in a way Java didn’t. It greatly extended my reach into science behind a language.

Time went on and I started doing Clojure professionally. It is quite a different experience between using Clojure side by side with Java / Scala projects, using Clojure for tools and libraries, and using Clojure professionally: for products/applications. Products are very different and very stateful beings.

Clojure Developers are People


The question of state beyond a “map” or a “vector” in a lexical scope, but a product state, became the one of great importance. And the choices are not great here.

From talking to people I concluded that

* some people keep their product (application) states in “def”s
* some in “atoms”
* others in Component

The “Clojure Gods” talk a lot about state, but very rarely about an application state and codebase organization. I doubt Datomic uses Component, but I don’t know.

Since Component was gaining popularity, and I talked to JUXT people (great people btw), they seem to be very enthusiastic about it, I decided to give it a go.

Component Framework


When I started creating projects with Component, I already learned to like the Clojure way of functions and namespaces, so an object (record) oriented approach of Component was immediately suspicious.

Component is not exactly Spring of course. It aims to structure a stateful application, so it is reloadable in REPL, since REPL restart time is, well, slow.

In order to do that effectively, Component requires a whole app buy in, which makes it a framework rather than a library: another “unClojure” feeling that stayed with me while using Component.

Spring is a framework, and I like it. But in Java world that’s the culture. “Frameworks” is the approach. It is well accepted and tools are built around this.

Clojure world is all about libraries, and I love it, the same way I love open source solutions vs. closed packaged ones, i.e. open systems vs. closed systems. High cohesion, loose coupling, win win.

At Large


I understand that many people like Component, and I think their projects are based on it. Although it is not exactly evident, since most of products, Component would be needed for, are proprietary: enterprise (a.k.a. “at large”), “startups”, etc.. But there are several open source ones that look really good. A couple examples:

* Onyx
* BirdWatch

although BirdWatch switched from Component to system-toolbox:

“I have thrown out the Component library in master on GitHub, and I find using the systems-toolbox much more straight-forward”

I like Scotch


Long story short, while I delivered with Component, it did not deliver for me.

I rewrote Component projects with mount, and already shipped two of them. Both are live and blooming in prod.

Another team I work with liked this approach a lot, and rewrote one of their products in mount as well. First thing one of their developers said after trying it: “oh.. look, it’s like Clojure again!”.

It is, it is Clojure again for me too. I get it, Component may “do it” for many people, but alternatives are great!

I like Java, I love Clojure.


08
Oct 15

iterator-seq: chunks and hasNext

A couple of interesting fact to keep in mind about iterator-seq:

  • it calls hasNext right away
  • it “caches” reads by chunks of 32 items

Its implementation is quite simple and returns a lazy seq, but the above two is good to keep in mind when working with iteratees:

private static final int CHUNK_SIZE = 32;
public static ISeq chunkIteratorSeq(final Iterator iter){
    if(iter.hasNext()) {
        return new LazySeq(new AFn() {
            public Object invoke() {
                Object[] arr = new Object[CHUNK_SIZE];
                int n = 0;
                while(iter.hasNext() && n < CHUNK_SIZE)
                    arr[n++] = iter.next();
                return new ChunkedCons(new ArrayChunk(arr, 0, n), chunkIteratorSeq(iter));
            }
        });
    }
    return null;
}

Why 32? I like 42 better!


32 is a good choice for the CHUNK_SIZE since it matches the number of child nodes in Clojure (persistent) collections:

static public PersistentVector create(ISeq items){
    Object[] arr = new Object[32];
    int i = 0;
    for(;items != null && i < 32; items = items.next())
        arr[i++] = items.first();
 
    if(items != null) {  // >32, construct with array directly
        PersistentVector start = new PersistentVector(32, 5, EMPTY_NODE, arr);
        TransientVector ret = start.asTransient();
        for (; items != null; items = items.next())
            ret = ret.conj(items.first());
        return ret.persistent();
    } else if(i == 32) {   // exactly 32, skip copy
        return new PersistentVector(32, 5, EMPTY_NODE, arr);
    } else {  // <32, copy to minimum array and construct
        Object[] arr2 = new Object[i];
        System.arraycopy(arr, 0, arr2, 0, i);
        return new PersistentVector(i, 5, EMPTY_NODE, arr2);
    }
}

The Dark Side of “hasNext()”


But before creating a lazy seq, the first call “iterator-seq” does is iter.hasNext(). While this makes sense (why create a seq, if there is nothing to create it from), a thing to keep in mind is the implementation of the iteratee which is passed to “iterator-seq”. Here is an example from my recent HBase journey.

cbass wraps an HBase Scanner in “iterator-seq”:

(let [results (-> (.iterator (.getScanner h-table (scan-filter criteria)))
                  iterator-seq)

Once “iterator-seq” makes a call to iter.hasNext(), HBase scanner goes out and fetches the first result based on its filter. While this sounds ok, internally, depending on HBase client caching configuration, it may end up in fetching lots a lots of data to “cache” locally before returning a single item. Not exactly a “lazy seq behavior” the one can expect. More about it here.

To conclude: it is always good to keep a fresh copy of Clojure source code in the head 🙂

hbaseComments Off on HBase Scan: Let me cache it for you
08
Oct 15

HBase Scan: Let me cache it for you

HBase uses Scanners to SELECT things FROM table WHERE this AND that. An interesting bit that HBase scanners use is caching.

Since HBase designed to sit between a low level HDFS and a high level SQL front end, many of its APIs are “naturally leaking” which makes it harder to come up with good names, since they have to fit both worlds.

Scanner “caching” is a good example of a feature name that makes perfect sense internally in HBase, and is more than confusing for an HBase client.

Caching scans.. what!?


Here is how it works in a nutshell:

Scanners are iteratees that return results by calling “.next” on them.

Once the “.next” is called on a scanner, it will go and fetch the next result that matches its filter (i.e. similar to the SQL’s WHERE clause).

Here is the tricky part. Before the scanner returns that single result, it caches a chunk of results, so the next call to “.next” would come from this chunk, which effectively is a local “cache”. This is done so the trip to the region server for each call to “.next” can be avoided:

@Override
public Result next() throws IOException {
  // If the scanner is closed and there's nothing left in the cache, next is a no-op.
  if (cache.size() == 0 && this.closed) {
    return null;
  }
  if (cache.size() == 0) {
    loadCache();
  }
 
  if (cache.size() > 0) {
    return cache.poll();
  }
 
  // if we exhausted this scanner before calling close, write out the scan metrics
  writeScanMetrics();
  return null;
}

the snippet above is from HBase ClientScanner.

The loadCache() fetches a chunk size (read “cache size”) from a region server, and stuffs it in cache, which is just a Queue<Result>, that will be drained on all subsequent calls to “.next”.

Oh.. and Max Result Size is in bytes


Why is that tricky? Well, the tricky part is the size of this “chunk” that will be cached by the scanner on each request.

By default, this size is Integer.MAX_VALUE. Which means that the scanner will try to cache as much as possible, which is an Integer.MAX_VALUE or “hbase.client.scanner.max.result.size” which is by default 2MB. Depending on the size of the final result set, and on how scanners are used this can get unwieldy pretty quickly.

“hbase.client.scanner.max.result.size” is in bytes (not row numbers), so it is not exactly high level SQL’s “LIMIT” / “ROWNUM”. If not properly set, “caching” may end up either taking all the memory or simply return timeouts (e.g. “ScannerTimeoutException” which could get routed to “OutOfOrderScannerNextException”s, etc..) from the region server. Hence to get these two: “max size”, “cache size” play together nicely is important to scan smoothly.

Here is another good explanation by my friend Dan on why HBase caching is tricky and what it really does behind the covers.

Clients are Humans


If scanner “caching” was named a bit different, for example a “fetch size” would not be too bad, it would be a bit more obvious for the HBase client API. But then behind the covers, these values are truly “cached” and read from cache in each call to “scanner.next()”.

Both make sense, but I would argue that a client facing API should take precedence here. HBase developers already know the internals, and fetch size (as a property set to Scan) would not hurt them. “caching” can still be used internally under HBase covers.