"; */ ?>

clojure


2
Dec 15

Time Series Database in One Line of Clojure

If you ever worked in the financial sector, specifically high frequency trading, a time series database is a well known tool that orders up all those quotes, orders, trades for financial pleasure.

The are many of these databases available. The Wall Street being The Wall Street would of course primarily use proprietary ones, since, well, it’s proprietary :), but giving them a credit: they do outperform open source ones by a lot, at least presently (talking about millions per second).

Disrupting Time Series Business


So I decided to write an open source time series database that will outperform them all not necessarily by performance, but definitely by clarity and size. Get ready for this one line.

If you read this far that means you are ready, so let’s begin by creating a database:

(def db (sorted-map-by >))

Oh, by the way we are done. It’s the one and only: The Time Series Database.

Map is King of Data


Let’s use it. First we’ll need some data:

(def data
  {1449088877092 {:GOOG {:bid 762.74 :offer 762.79}}
   1449088876590 {:AAPL {:bid 116.60 :offer 116.70}}
   1449088877601 {:MSFT {:bid 55.22 :offer 55.27}}
   1449088877203 {:TSLA {:bid 232.57 :offer 232.72}}
   1449088875914 {:NFLX {:bid 128.95 :offer 129.05}}
   1449088870005 {:FB {:bid 105.96 :offer 106.6}}})

The format is simple {timestamp data}.

Now a query to have a database as a value with this data:

(defn with [db data] (merge db data))

And finally some time based queries, like before and after:

(defn before [database ts] (into {} (subseq database > ts)))
(defn after [database ts] (into {} (subseq database < ts)))

done.

Action!


(before (with db data) 1449088877091)
 
{1449088876590 {:AAPL {:bid 116.6, :offer 116.7}},
 1449088875914 {:NFLX {:bid 128.95, :offer 129.05}},
 1449088870005 {:FB {:bid 105.96, :offer 106.6}}}
(after (with db data) 1449088877091)
 
{1449088877601 {:MSFT {:bid 55.22, :offer 55.27}},
 1449088877203 {:TSLA {:bid 232.57, :offer 232.72}},
 1449088877092 {:GOOG {:bid 762.74, :offer 762.79}}}

Beware, you, other time series databases!

P.S. Of course there is a possibility of events that came in at the exact same millisecond, so here is another line that solves it


24
Nov 15

Clojure Libraries in The Matrix

Clojure universe is mostly built on top of libraries rather than “frameworks” or “platforms”, which makes it really flexible and lots of fun to work with. Any library can be swapped, contributed to, or even created from scratch.

There are several things that make libraries great. The quality of its solution is of course the main focus which delivers the most value, but there are others. The one I’d like to mention is not how much a library does, but how little it should.

I like apples, you like me, you like apples


Dependencies are often overlooked when developing libraries. There are quite a few libraries that suffer from depending on something for either convenience, or for its built in example, or just in case, etc.

This results in downloading the whole maven repository when working on the project that depends on just a few of such libraries.

This also could create conflicts between the dependencies libraries bring and the real project required dependencies.

We can do better, and we should.

Those people don’t know what they are doing


The reason I bring it up is not because I am tired of these libraries, or it is time for a rant, but it is simply because I do it myself. And usually by the time I notice I did it, it requires significant rework to make sure developers that use/depend on my libraries do not bring “apples” that I like and they might not.

Useful vs. The Core


A great example of this is me including an excellent clojure/tools.logging as a top level dependency of mount. Mount manages application state lifecycle, and it would only make sense if every time a state is started or stopped, mount would log it:

dev=> (mount/start)
14:34:10.813 [nREPL-worker-0] INFO  mount.core - >> starting..  app-config
14:34:10.814 [nREPL-worker-0] INFO  mount.core - >> starting..  conn
14:34:10.838 [nREPL-worker-0] INFO  mount.core - >> starting..  nyse-app
14:34:10.844 [nREPL-worker-0] INFO  mount.core - >> starting..  nrepl
:started
 
dev=> (mount/stop-except #'app.www/nyse-app)
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  nrepl
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  conn
14:34:47.766 [nREPL-worker-0] INFO  mount.core - << stopping..  app-config
:stopped

It’s useful, right? Of course it is. As a developer that depends on mount, you don’t have to do it, it is already there for you, very informative and clean.

But here is the catch:

* what if you don’t like the way it logs it?
* what if you don’t want it to log at all?
* what if you use a different library for logging?
* etc..

In other words: “what if you don’t like or need apples and you eat bananas instead?”.

It ends up that “useful” is most of the time a red flag. Stop and think whether this “useful” feature is really the core piece of functionality, or is a bolted on “nice to have”.

Novelty Freshness of Refactoring


While it is not desired to have extra dependencies, and the above idea to include logging was not great, what was great are new thoughts during refactoring:

“Ok, I’ll remove logging, but now mount users won’t know anything about states..”

“Maybe they can use something like (states-with-deps) that would give them the current state of the application”:

dev=> (states-with-deps)
 
({:name app-config, :order 1, 
                    :started? true
                    :suspended? false
                    :ns #object[clojure.lang.Namespace 0x6e126efc "app.config"], 
                    :deps ()} 
 
 {:name conn, :order 2, 
              :started? true
              :suspended? false
              :ns #object[clojure.lang.Namespace 0xf1a66a6 "app.nyse"], 
              :deps ([app-config #'app.config/app-config])} 
 
 {:name nrepl, :order 3, 
               :started? true
               :suspended? false
               :ns #object[clojure.lang.Namespace 0x2c134117 "app"], 
               :deps ([app-config #'app.config/app-config])})

“That’s not bad, but what if they start/stop states selectively, or they suspended/resumed some states.. no visibility”

“Well, it’s simple, why not just return all the states that were affected by a lifecycle method?”

And that’s what I did. But I did not go through this thought process when I had logging in, since logging created an illusion of visibility and control, while in reality it gave “an ok” visibility and no control.

The solution just returns a vector of states that were affected:

dev=> (mount/start)
{:started [#'app.config/app-config 
           #'app.nyse/conn 
           #'app/nrepl
           #'check.suspend-resume-test/web-server
           #'check.suspend-resume-test/q-listener]}

The cool additional thing, and the reason it is a vector and not a set, is these states are in the vector in the order they were touched, in this case “started”.

Rules of The Matrix


While I made a mistake, I am glad I did. It gave me lots of food for thought as well as made me do some other cool tricks with robert hooke to demonstrate how to bring the same logging back if needed.

It does feel great to only depend on the Clojure itself, and a tiny tools.macro, which I use a single function from, and could potentially just grab from there, and cut my dependencies to The One.

clojure / stateComments Off on Managing Clojure app state since (reset)
12
Nov 15

Managing Clojure app state since (reset)

After shipping two large projects with component, and having several in works, I decided, Component is not for me, and wrote mount.

Here are the differences between mount and Component, and below is the story of why.

Java is Good


For the last several years Java started to get a bad rep by many people who now like “better” languages. Java is too verbose, too complex, too mutable, too ugly, too last century, etc..

I like Java, I liked it 10 years ago, I like it now. I did work in many languages, who doesn’t now days, and I still like Java. Java is solid. Java is simple. Yes, it is.

There are many corner cases that are well documented, and there are more that are not documented, but it does not make it complex. You can learn about Monads and Endofunctors on your own, while completing and shipping successful Java projects. Category Theory is not a prerequisite. Java is stable. I like Java.

Love is not solid, it’s more


Now I can’t say I love Java. Java is my good friend, we have a solid relationship. Sometimes we go to work together, but I do not have a feeling of excitement, I am not running (well, not too fast at least) to my laptop to try this cool Java thing, I would not spend a weekend with it unless we need to.

That’s where Clojure comes in. I love it. I won’t go into details on why I love it, first of all its personal, second of all there are plenty of other blog posts, books, videos that make Clojure shine. I just want to state that I love it. There is a difference.

Clojure. The Beginning.


I came to Clojure several years ago from a pretty common background: lots of Java and Spring. I like Spring a lot. It makes Java world shine, it taught me great ways to approach problems, it has great documentation and friendly community, I love friendly communities.

As I wrote more and more Clojure I fell in love with each new discovery, it made me think of time in a way Java didn’t. It greatly extended my reach into science behind a language.

Time went on and I started doing Clojure professionally. It is quite a different experience between using Clojure side by side with Java / Scala projects, using Clojure for tools and libraries, and using Clojure professionally: for products/applications. Products are very different and very stateful beings.

Clojure Developers are People


The question of state beyond a “map” or a “vector” in a lexical scope, but a product state, became the one of great importance. And the choices are not great here.

From talking to people I concluded that

* some people keep their product (application) states in “def”s
* some in “atoms”
* others in Component

The “Clojure Gods” talk a lot about state, but very rarely about an application state and codebase organization. I doubt Datomic uses Component, but I don’t know.

Since Component was gaining popularity, and I talked to JUXT people (great people btw), they seem to be very enthusiastic about it, I decided to give it a go.

Component Framework


When I started creating projects with Component, I already learned to like the Clojure way of functions and namespaces, so an object (record) oriented approach of Component was immediately suspicious.

Component is not exactly Spring of course. It aims to structure a stateful application, so it is reloadable in REPL, since REPL restart time is, well, slow.

In order to do that effectively, Component requires a whole app buy in, which makes it a framework rather than a library: another “unClojure” feeling that stayed with me while using Component.

Spring is a framework, and I like it. But in Java world that’s the culture. “Frameworks” is the approach. It is well accepted and tools are built around this.

Clojure world is all about libraries, and I love it, the same way I love open source solutions vs. closed packaged ones, i.e. open systems vs. closed systems. High cohesion, loose coupling, win win.

At Large


I understand that many people like Component, and I think their projects are based on it. Although it is not exactly evident, since most of products, Component would be needed for, are proprietary: enterprise (a.k.a. “at large”), “startups”, etc.. But there are several open source ones that look really good. A couple examples:

* Onyx
* BirdWatch

although BirdWatch switched from Component to system-toolbox:

“I have thrown out the Component library in master on GitHub, and I find using the systems-toolbox much more straight-forward”

I like Scotch


Long story short, while I delivered with Component, it did not deliver for me.

I rewrote Component projects with mount, and already shipped two of them. Both are live and blooming in prod.

Another team I work with liked this approach a lot, and rewrote one of their products in mount as well. First thing one of their developers said after trying it: “oh.. look, it’s like Clojure again!”.

It is, it is Clojure again for me too. I get it, Component may “do it” for many people, but alternatives are great!

I like Java, I love Clojure.


8
Oct 15

iterator-seq: chunks and hasNext

A couple of interesting fact to keep in mind about iterator-seq:

  • it calls hasNext right away
  • it “caches” reads by chunks of 32 items

Its implementation is quite simple and returns a lazy seq, but the above two is good to keep in mind when working with iteratees:

private static final int CHUNK_SIZE = 32;
public static ISeq chunkIteratorSeq(final Iterator iter){
    if(iter.hasNext()) {
        return new LazySeq(new AFn() {
            public Object invoke() {
                Object[] arr = new Object[CHUNK_SIZE];
                int n = 0;
                while(iter.hasNext() && n < CHUNK_SIZE)
                    arr[n++] = iter.next();
                return new ChunkedCons(new ArrayChunk(arr, 0, n), chunkIteratorSeq(iter));
            }
        });
    }
    return null;
}

Why 32? I like 42 better!


32 is a good choice for the CHUNK_SIZE since it matches the number of child nodes in Clojure (persistent) collections:

static public PersistentVector create(ISeq items){
    Object[] arr = new Object[32];
    int i = 0;
    for(;items != null && i < 32; items = items.next())
        arr[i++] = items.first();
 
    if(items != null) {  // >32, construct with array directly
        PersistentVector start = new PersistentVector(32, 5, EMPTY_NODE, arr);
        TransientVector ret = start.asTransient();
        for (; items != null; items = items.next())
            ret = ret.conj(items.first());
        return ret.persistent();
    } else if(i == 32) {   // exactly 32, skip copy
        return new PersistentVector(32, 5, EMPTY_NODE, arr);
    } else {  // <32, copy to minimum array and construct
        Object[] arr2 = new Object[i];
        System.arraycopy(arr, 0, arr2, 0, i);
        return new PersistentVector(i, 5, EMPTY_NODE, arr2);
    }
}

The Dark Side of “hasNext()”


But before creating a lazy seq, the first call “iterator-seq” does is iter.hasNext(). While this makes sense (why create a seq, if there is nothing to create it from), a thing to keep in mind is the implementation of the iteratee which is passed to “iterator-seq”. Here is an example from my recent HBase journey.

cbass wraps an HBase Scanner in “iterator-seq”:

(let [results (-> (.iterator (.getScanner h-table (scan-filter criteria)))
                  iterator-seq)

Once “iterator-seq” makes a call to iter.hasNext(), HBase scanner goes out and fetches the first result based on its filter. While this sounds ok, internally, depending on HBase client caching configuration, it may end up in fetching lots a lots of data to “cache” locally before returning a single item. Not exactly a “lazy seq behavior” the one can expect. More about it here.

To conclude: it is always good to keep a fresh copy of Clojure source code in the head :)

clojure / hbaseComments Off on HBase: Delete By Anything
6
Oct 15

HBase: Delete By Anything

HBase Cookies


“Big data” is this great old buzz that no longer surprises anyone. Even politicians used it way back in 2012 to win the elections.

But the field became so large and fragmented that simple CRUD operations, that we used to just take for granted before, now require the whole new approach depending on which data store we have to deal with.

This short tale is about a tiny little feature in the universe of HBase. Namely a “delete by” feature. Which is quite simple in Datomic or SQL databases, but it is not that trivial in HBase due to the way its cookie crumbles.

Delete it like you mean it!


There is often a case where rows need to be deleted by a filter, that is similar to the one used in scan (i.e. by row key prefix, time range, etc.) HBase does not really help there besides providing a BulkDeleteEndpoint coprocessor.

This is not ideal as it delegates work to HBase “stored procedures” (effectively this is what coprocessors are). It really pays off during massive data manipulation, since it does happen directly on the server, but in simpler cases, which are many, coprocessors are less than ideal.

cbass achieves “deleting by anything” by a trivial flow: “scan + multi delete” packed in a “delete-by” function which preserves the “scan“‘s syntax:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {},
 "saturday" {:age "24 hours", :inhabited? :sometimes},
 "saturn" {:age "4.503 billion years", :inhabited? :unknown}}
user=> (delete-by conn "galaxy:planet" :from "sat" :to "saz")
;; deleting [saturday saturn], since they both match the 'from/to' criteria

look ma, no saturn, no saturday:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {}}

and of course any other criteria that is available in “scan” is available in “delete-by”.

Delete Row Key Function


Most of the time HBase keys are prefixed (salted with a prefix). This is done to avoid “RegionServer hotspotting“.

“delete-by” internally does a “scan” and returns keys that matched. Hence in order to delete these keys they have to be “re-salt-ed” according to the custom key design.

cbass addresses this by taking an optional delete-key-fn, which allows to “put some salt back” on those keys.

Here is a real world example:

;; HBase data
 
user=> (scan conn "table:name")
{"���|8276345793754387439|transfer" {...},
 "���|8276345793754387439|match" {...},
 "���|8276345793754387439|trade" {...},
 "�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}

a couple observations about the key:

  • it is prefixed with salt
  • it is piped delimited

In order to delete, say, all keys that start with 8276345793754387439, besides providing :from and :to, we would need to provide a :row-key-fn that would de salt and split, and then a :delete-key-fn that can reassemble it back:

(delete-by conn "table:name" :row-key-fn (comp split-key without-salt)
                             :delete-key-fn (fn [[x p]] (with-salt x p))
                             :from (-> "8276345793754387439" salt-pipe)
                             :to   (-> "8276345793754387439" salt-pipe+))))

*salt, *split and *pipe functions are not from cbass, since they are application specific. They are here to illustrate the point of how “delete-by” can be used to take on the real world.

;; HBase data after the "delete-by"
 
user=> (scan conn "table:name")
{"�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}