"; */ ?>


08
Oct 15

HBase Scan: Let me cache it for you

HBase uses Scanners to SELECT things FROM table WHERE this AND that. An interesting bit that HBase scanners use is caching.

Since HBase designed to sit between a low level HDFS and a high level SQL front end, many of its APIs are “naturally leaking” which makes it harder to come up with good names, since they have to fit both worlds.

Scanner “caching” is a good example of a feature name that makes perfect sense internally in HBase, and is more than confusing for an HBase client.

Caching scans.. what!?


Here is how it works in a nutshell:

Scanners are iteratees that return results by calling “.next” on them.

Once the “.next” is called on a scanner, it will go and fetch the next result that matches its filter (i.e. similar to the SQL’s WHERE clause).

Here is the tricky part. Before the scanner returns that single result, it caches a chunk of results, so the next call to “.next” would come from this chunk, which effectively is a local “cache”. This is done so the trip to the region server for each call to “.next” can be avoided:

@Override
public Result next() throws IOException {
  // If the scanner is closed and there's nothing left in the cache, next is a no-op.
  if (cache.size() == 0 && this.closed) {
    return null;
  }
  if (cache.size() == 0) {
    loadCache();
  }
 
  if (cache.size() > 0) {
    return cache.poll();
  }
 
  // if we exhausted this scanner before calling close, write out the scan metrics
  writeScanMetrics();
  return null;
}

the snippet above is from HBase ClientScanner.

The loadCache() fetches a chunk size (read “cache size”) from a region server, and stuffs it in cache, which is just a Queue<Result>, that will be drained on all subsequent calls to “.next”.

Oh.. and Max Result Size is in bytes


Why is that tricky? Well, the tricky part is the size of this “chunk” that will be cached by the scanner on each request.

By default, this size is Integer.MAX_VALUE. Which means that the scanner will try to cache as much as possible, which is an Integer.MAX_VALUE or “hbase.client.scanner.max.result.size” which is by default 2MB. Depending on the size of the final result set, and on how scanners are used this can get unwieldy pretty quickly.

“hbase.client.scanner.max.result.size” is in bytes (not row numbers), so it is not exactly high level SQL’s “LIMIT” / “ROWNUM”. If not properly set, “caching” may end up either taking all the memory or simply return timeouts (e.g. “ScannerTimeoutException” which could get routed to “OutOfOrderScannerNextException”s, etc..) from the region server. Hence to get these two: “max size”, “cache size” play together nicely is important to scan smoothly.

Here is another good explanation by my friend Dan on why HBase caching is tricky and what it really does behind the covers.

Clients are Humans


If scanner “caching” was named a bit different, for example a “fetch size” would not be too bad, it would be a bit more obvious for the HBase client API. But then behind the covers, these values are truly “cached” and read from cache in each call to “scanner.next()”.

Both make sense, but I would argue that a client facing API should take precedence here. HBase developers already know the internals, and fetch size (as a property set to Scan) would not hurt them. “caching” can still be used internally under HBase covers.


06
Oct 15

HBase: Delete By Anything

HBase Cookies


“Big data” is this great old buzz that no longer surprises anyone. Even politicians used it way back in 2012 to win the elections.

But the field became so large and fragmented that simple CRUD operations, that we used to just take for granted before, now require the whole new approach depending on which data store we have to deal with.

This short tale is about a tiny little feature in the universe of HBase. Namely a “delete by” feature. Which is quite simple in Datomic or SQL databases, but it is not that trivial in HBase due to the way its cookie crumbles.

Delete it like you mean it!


There is often a case where rows need to be deleted by a filter, that is similar to the one used in scan (i.e. by row key prefix, time range, etc.) HBase does not really help there besides providing a BulkDeleteEndpoint coprocessor.

This is not ideal as it delegates work to HBase “stored procedures” (effectively this is what coprocessors are). It really pays off during massive data manipulation, since it does happen directly on the server, but in simpler cases, which are many, coprocessors are less than ideal.

cbass achieves “deleting by anything” by a trivial flow: “scan + multi delete” packed in a “delete-by” function which preserves the “scan“‘s syntax:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {},
 "saturday" {:age "24 hours", :inhabited? :sometimes},
 "saturn" {:age "4.503 billion years", :inhabited? :unknown}}
user=> (delete-by conn "galaxy:planet" :from "sat" :to "saz")
;; deleting [saturday saturn], since they both match the 'from/to' criteria

look ma, no saturn, no saturday:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {}}

and of course any other criteria that is available in “scan” is available in “delete-by”.

Delete Row Key Function


Most of the time HBase keys are prefixed (salted with a prefix). This is done to avoid “RegionServer hotspotting“.

“delete-by” internally does a “scan” and returns keys that matched. Hence in order to delete these keys they have to be “re-salt-ed” according to the custom key design.

cbass addresses this by taking an optional delete-key-fn, which allows to “put some salt back” on those keys.

Here is a real world example:

;; HBase data
 
user=> (scan conn "table:name")
{"���|8276345793754387439|transfer" {...},
 "���|8276345793754387439|match" {...},
 "���|8276345793754387439|trade" {...},
 "�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}

a couple observations about the key:

  • it is prefixed with salt
  • it is piped delimited

In order to delete, say, all keys that start with 8276345793754387439, besides providing :from and :to, we would need to provide a :row-key-fn that would de salt and split, and then a :delete-key-fn that can reassemble it back:

(delete-by conn "table:name" :row-key-fn (comp split-key without-salt)
                             :delete-key-fn (fn [[x p]] (with-salt x p))
                             :from (-> "8276345793754387439" salt-pipe)
                             :to   (-> "8276345793754387439" salt-pipe+))))

*salt, *split and *pipe functions are not from cbass, since they are application specific. They are here to illustrate the point of how “delete-by” can be used to take on the real world.

;; HBase data after the "delete-by"
 
user=> (scan conn "table:name")
{"�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}

06
Oct 15

Adding Simple to HBase

Mutate and Complect!


The usual trend in functional programing is “immutable” => good, “mutable” => bad. Not true for all cases, but it is true for most, especially when multiple threads, processes, machines are involved.

HBase APIs are very much based on mutation. Since there are so many different ways to, for example, “scan” data, instead of using overloaded constructors or builders, HBase relies on setters. Count the number of setters in Scan, for example.

This just does not sit well with “immutable is good” feeling.

A long time HBaser might not agree, but I believe a learning curve is quite steep for HBase newcomers. Тhis is due to many things, Hadoop architecture, data model, row key design, co-processors, all the cool things it does. But mainly, I think, this is due to a heavy set of APIs that are just not simple.

Connecting “with” HBase


Here is an example from HBase book on how to find all columns in a row and family that start with “abc”. In SQL this would be done with something like:

SELECT * FROM <table> WHERE <row> LIKE 'abc%';

In HBase (this is a book example) it would be:

HTableInterface t = ...;
byte[] row = ...;
byte[] family = ...;
byte[] prefix = Bytes.toBytes("abc");
Scan scan = new Scan(row, row);        // (optional) limit to one row
scan.addFamily(family);                // (optional) limit to one family
Filter f = new ColumnPrefixFilter(prefix);
scan.setFilter(f);
scan.setBatch(10);                     // set this if there could be many columns returned
ResultScanner rs = t.getScanner(scan);
for (Result r = rs.next(); r != null; r = rs.next()) {
  for (KeyValue kv : r.raw()) {
    // each kv represents a column
  }
}
rs.close();

and that is given that data is not actually read into a comprehendible data structure (done in a nested loop), and concepts like row / family / column / scan, etc.. are well understood. I say it is not that simple. But can it be?

I say yes, it can. How about:

(scan conn table-name :starts-with "abc")

while a connection (conn) needs to be created and a family might be added if needed, this is a much simpler way to “connect with” HBase.

These are some of the reasons cbass was created: mainly to add “simple” to HBase.


12
Aug 15

Plain Old Clojure Object

Those times you need to have Java APIs.. Some of these APIs need to return data. In Clojure it is usually a map:

{:q “What is..?” :a 42}

In Java it is not that simple for several reasons.. Java maps are mutable, there are no idiomatic tools to inspect, destructure them, Java (programmers) like different types for different POJOs, etc..

So this data needs to be encapsulated in a way Java likes it, usually in a form of an object with private fields and getters with no behavior, i.e. POJOs.

Of course a Clojure project may have Java sources, where these POJOs can live, but why not just stick to Clojure all the way and create them using gen-class. Why? Because it is fun, and also because we can easily :require and use other Clojure libraries in these POJOs in case we need to.

JSL: Java as a second language


Oh yea, and let’s call them POCOs, cause they kind of are:

(ns poco)
 
(gen-class 
  :name org.stargate.PlainOldClojureObject
  :state "state"
  :init "init"
  :constructors {[Boolean Boolean String] []}
  :methods [[isHuman [] Boolean]
            [isFound [] Boolean]
            [planet [] String]])
 
(defn -init [human? found? planet]
  [[] (atom {:human? human?
             :found? found?
             :planet planet})])
 
(defn- get-field [this k]
  (@(.state this) k))
 
(defn -isHuman [this]
  (get-field this :human?))
 
(defn -isFound [this]
  (get-field this :found?))
 
(defn -planet [this]
  (get-field this :planet))
 
(defn -toString [this]
  (str @(.state this)))

This compiles and behaves exactly like a Java POJO would, since it is a POJO, I mean POCO:

user=> (import '[org.stargate PlainOldClojureObject])
org.stargate.PlainOldClojureObject
 
user=> (def poco (PlainOldClojureObject. true true "42"))
#'user/poco
 
user=> poco
#object[org.stargate.PlainOldClojureObject 0x68033b90 "{:human? true, :found? true, :planet \"42\"}"]
 
user=> (.isHuman poco)
true
user=> (.isFound poco)
true
user=> (.planet poco)
"42"

Of course there are records, but POCOs are just more fun :)


23
Apr 15

Question Everything

Feeding Da Brain


In 90s you would say: “I am a programmer”. Some would reply with “o.. k”. More insightful would reply with a question “which programming language?”.

21st century.. socially accepted terminology has changed a bit, now you would say “I am a developer”. Some would ask “which programming language?”. More insightful would reply with a question “which out of these 42 languages do you use the most?”

The greatest thing about using several at the same time is that feeling of constant adjustment as I jump between the languages. It feels like my brain goes through exuberant synaptogenesis over and over again building those new formations.

   What's for dinner today honey?
   Asynchronous brain refactoring with a gentle touch of "mental polish"

With all these new synapses, I came to love the fact that something that seemed so holy and “crystal right” before, now gets questioned and can easily be dismissed. Was it wrong all along? No. Did it change? No. So what changed then? Well.. perception did.

Inmates of the “Gang of Four” Prison


Design patterns are these “ways” of doing things that cripple new programmers, and imprison many senior ones. Instead of having an ability to think freely, we have all these “software standard patterns” which mostly have to do with limitations of “technology at time”.

Take big guys, like C++ / Java / C#, while they have many great features and ideas, these languages have horrible story of “behavior and state”: you always have to guard something. Whether it is from multiple threads, or from other people misusing it. The languages themselves promote reuse vs. decoupling: i.e. “let’s inherit that behavior”, etc..

So how do we overcome these risks and limitations? Simple: let’s create dozens of “ways” that all developers will follow to fight this together. Oh, yea, and let’s make it industry standard, call them patterns, teach them in schools, and select people by how well they can “apply” these patterns to “any” problem at hand.

Not all developers bought into this cult of course. Here is Peter Norvig’s notes from 1996, where he “dismisses” 16 out of 23 patterns from Gang of Four, by just using functions, types, modules, etc.

Builder Pattern vs. Immutable Data Structures


Builder pattern makes sense unless.. several things. There is a great “Builders vs. Option Maps” short post that talks about builder patter limitations:

* Builders are too verbose
* Builders are not data structures
* Builders are mutable
* Builders can’t easily compose
* Builders are order dependent

Due to mutable data structures (in Java/C#/alike) Builders still make sense for things like Google protobufs with simple (i.e. primitive) types, but for most cases where immutable things need to be created it is best to use immutable data structures for the above reasons.

While jumping between the languages, I often need to create things in Clojure that are implemented in Java with Builders. This is not always easy, especially when Builders rely on external state or/and when Builders need to be passed around (i.e. to achieve a certain level of composition).

Let’s say I need to create a notification client that, by design (on the Java side of things), takes some initial state (i.e. an external system connection props), and then event handlers (callbacks) are registered on it one by one, before it gets built, i.e. builds a final, immutable notification client:

SomeClassBuilder builder = SomeClass.newBuilder()
                             .setState( state )
                             .setAnotherThing( thing );
 
builder.register( notificationHandlerOne );
builder.register( notificationHandlerTwo );
...
builder.register( notificationHandlerN );
 
builder.build();

In case you need to decouple “register events” logic from this monolithic piece above, you would pass that builder to the caller that would pass it down the chain. It is something that seems “normal” to do (at least to a “9 to 5” developer), since methods with side effects do not really raise any eyebrows in OO languages. In fact most of methods in those languages have side effects.

I quite often see people designing builders such as the one above (with lots of external state), and when I need to use them in Clojure, it becomes very apparent that the above is not well designed:

;; creates a "mutable" builder..
(defn- make-bldr [state thing]
  (-> (SomeClass/newBuilder)
      (.withState state)
      (.withAnotherThing thing)))
 
;; wraps "builder.register(foo)" into a composable function
(defn register-event-handler! [bldr handler]
    ;; in case handler is just a Clojure function wrap it with something ".register" will accept
    (.register bldr handler))
 
(defn notification-client [state thing]
  (let [bldr (make-bldr state thing)]
    ;; ... do things that would call "register-event-handler!" passing them the "bldr"
    (.build bldr)))

Things that immediately feel “off” are: returning a mutable builder from “make-bldr”, mutating that builder in “register-event-handler!”, and returning that mutated builder back.. Not something common in Clojure at all.

Again the goal is to “decouple logic to register events from notification client creation“, if both can happen at the same time, the above Builder would work.

In Clojure it would just be a map. All data structures in Clojure are immutable, so there would be no intermediate mutable holder/builder, it would always be an immutable map. When all handlers are registered, this map would be passed to a function that would create a notification client with these handlers and start it with “state” and “thing”.

Mocking Suspicions


Another synapse formation, that was created from using many languages at the same time, convinced me that if I have to use а mock to test something, that something needs a close look, and would potentially welcome refactoring.

The most common case for mocking is:

A method of a component "A" takes a component "B" that depends on a component "C".

So if I want to test A’s method, I can just mock B and not to worry about C.

The flaw here is:

"B" that depends on a component "C".

These things are extremely beneficial to question. I used to use Spring a lot, and when I did, I loved it. Learned from it, taught it to others, and had a great sense of accomplishment when high complexity could be broken down to simple pieces and re wired together with Spring.

Time went on, and in Erlang or Clojure, or even Groovy for that matter, I used Spring less and less. I still use it for all my Java work, but not as much. So if 10 years ago:

"B" that depends on a component "C".

was a way of life, now, every time I see it, I ask why?. Does “B” have to depend on “C”? Can “B” be stateless and take “C” whenever it needed it, rather that be injected with it and carry that state burden?

If before “B” was:

public class B {
 
  private final C c;
 
  public B ( C c ) {
    this.c = c;
  }
 
  public Profit doBusiness() {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

Can it instead just be:

public final class B {
  public static Profit doBusiness( C c ) {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

In most cases it can! It really can, the problem is not enough of us question that dependency, but we should.

This does not mean “B” no longer depends on “C”, it means something more: there is no “B” (“there is no spoon..”) as it just becomes a module, which does not need to be passed around as an instance. The only thing that is left from “B” is “doBusiness( C c )”. Do we need to mock “C” now? Can it, its instance disappear the way “B” did? Most likely, and even if it can’t for whatever reason (i.e. someone else’s code), I did question it, and so should you.


The more synapse formations I go through the better I learn to question pretty much everything. It is fun, and it pays back with beautiful revelations. I love my brain :)