"; */ ?>


06
Oct 15

HBase: Delete By Anything

HBase Cookies


“Big data” is this great old buzz that no longer surprises anyone. Even politicians used it way back in 2012 to win the elections.

But the field became so large and fragmented that simple CRUD operations, that we used to just take for granted before, now require the whole new approach depending on which data store we have to deal with.

This short tale is about a tiny little feature in the universe of HBase. Namely a “delete by” feature. Which is quite simple in Datomic or SQL databases, but it is not that trivial in HBase due to the way its cookie crumbles.

Delete it like you mean it!


There is often a case where rows need to be deleted by a filter, that is similar to the one used in scan (i.e. by row key prefix, time range, etc.) HBase does not really help there besides providing a BulkDeleteEndpoint coprocessor.

This is not ideal as it delegates work to HBase “stored procedures” (effectively this is what coprocessors are). It really pays off during massive data manipulation, since it does happen directly on the server, but in simpler cases, which are many, coprocessors are less than ideal.

cbass achieves “deleting by anything” by a trivial flow: “scan + multi delete” packed in a “delete-by” function which preserves the “scan“‘s syntax:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {},
 "saturday" {:age "24 hours", :inhabited? :sometimes},
 "saturn" {:age "4.503 billion years", :inhabited? :unknown}}
user=> (delete-by conn "galaxy:planet" :from "sat" :to "saz")
;; deleting [saturday saturn], since they both match the 'from/to' criteria

look ma, no saturn, no saturday:

user=> (scan conn "galaxy:planet")
{"earth"
 {:age "4.543 billion years",
  :inhabited? true,
  :population 7125000000},
 "neptune" {:age "4.503 billion years", :inhabited? :unknown},
 "pluto" {}}

and of course any other criteria that is available in “scan” is available in “delete-by”.

Delete Row Key Function


Most of the time HBase keys are prefixed (salted with a prefix). This is done to avoid “RegionServer hotspotting“.

“delete-by” internally does a “scan” and returns keys that matched. Hence in order to delete these keys they have to be “re-salt-ed” according to the custom key design.

cbass addresses this by taking an optional delete-key-fn, which allows to “put some salt back” on those keys.

Here is a real world example:

;; HBase data
 
user=> (scan conn "table:name")
{"���|8276345793754387439|transfer" {...},
 "���|8276345793754387439|match" {...},
 "���|8276345793754387439|trade" {...},
 "�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}

a couple observations about the key:

  • it is prefixed with salt
  • it is piped delimited

In order to delete, say, all keys that start with 8276345793754387439, besides providing :from and :to, we would need to provide a :row-key-fn that would de salt and split, and then a :delete-key-fn that can reassemble it back:

(delete-by conn "table:name" :row-key-fn (comp split-key without-salt)
                             :delete-key-fn (fn [[x p]] (with-salt x p))
                             :from (-> "8276345793754387439" salt-pipe)
                             :to   (-> "8276345793754387439" salt-pipe+))))

*salt, *split and *pipe functions are not from cbass, since they are application specific. They are here to illustrate the point of how “delete-by” can be used to take on the real world.

;; HBase data after the "delete-by"
 
user=> (scan conn "table:name")
{"�d\k^|28768787578329|transfer" {...},
 "�d\k^|28768787578329|match" {...},
 "�d\k^|28768787578329|trade" {...}}

06
Oct 15

Adding Simple to HBase

Mutate and Complect!


The usual trend in functional programing is “immutable” => good, “mutable” => bad. Not true for all cases, but it is true for most, especially when multiple threads, processes, machines are involved.

HBase APIs are very much based on mutation. Since there are so many different ways to, for example, “scan” data, instead of using overloaded constructors or builders, HBase relies on setters. Count the number of setters in Scan, for example.

This just does not sit well with “immutable is good” feeling.

A long time HBaser might not agree, but I believe a learning curve is quite steep for HBase newcomers. Тhis is due to many things, Hadoop architecture, data model, row key design, co-processors, all the cool things it does. But mainly, I think, this is due to a heavy set of APIs that are just not simple.

Connecting “with” HBase


Here is an example from HBase book on how to find all columns in a row and family that start with “abc”. In SQL this would be done with something like:

SELECT * FROM <table> WHERE <row> LIKE 'abc%';

In HBase (this is a book example) it would be:

HTableInterface t = ...;
byte[] row = ...;
byte[] family = ...;
byte[] prefix = Bytes.toBytes("abc");
Scan scan = new Scan(row, row);        // (optional) limit to one row
scan.addFamily(family);                // (optional) limit to one family
Filter f = new ColumnPrefixFilter(prefix);
scan.setFilter(f);
scan.setBatch(10);                     // set this if there could be many columns returned
ResultScanner rs = t.getScanner(scan);
for (Result r = rs.next(); r != null; r = rs.next()) {
  for (KeyValue kv : r.raw()) {
    // each kv represents a column
  }
}
rs.close();

and that is given that data is not actually read into a comprehendible data structure (done in a nested loop), and concepts like row / family / column / scan, etc.. are well understood. I say it is not that simple. But can it be?

I say yes, it can. How about:

(scan conn table-name :starts-with "abc")

while a connection (conn) needs to be created and a family might be added if needed, this is a much simpler way to “connect with” HBase.

These are some of the reasons cbass was created: mainly to add “simple” to HBase.


12
Aug 15

Plain Old Clojure Object

Those times you need to have Java APIs.. Some of these APIs need to return data. In Clojure it is usually a map:

{:q “What is..?” :a 42}

In Java it is not that simple for several reasons.. Java maps are mutable, there are no idiomatic tools to inspect, destructure them, Java (programmers) like different types for different POJOs, etc..

So this data needs to be encapsulated in a way Java likes it, usually in a form of an object with private fields and getters with no behavior, i.e. POJOs.

Of course a Clojure project may have Java sources, where these POJOs can live, but why not just stick to Clojure all the way and create them using gen-class. Why? Because it is fun, and also because we can easily :require and use other Clojure libraries in these POJOs in case we need to.

JSL: Java as a second language


Oh yea, and let’s call them POCOs, cause they kind of are:

(ns poco)
 
(gen-class 
  :name org.stargate.PlainOldClojureObject
  :state "state"
  :init "init"
  :constructors {[Boolean Boolean String] []}
  :methods [[isHuman [] Boolean]
            [isFound [] Boolean]
            [planet [] String]])
 
(defn -init [human? found? planet]
  [[] (atom {:human? human?
             :found? found?
             :planet planet})])
 
(defn- get-field [this k]
  (@(.state this) k))
 
(defn -isHuman [this]
  (get-field this :human?))
 
(defn -isFound [this]
  (get-field this :found?))
 
(defn -planet [this]
  (get-field this :planet))
 
(defn -toString [this]
  (str @(.state this)))

This compiles and behaves exactly like a Java POJO would, since it is a POJO, I mean POCO:

user=> (import '[org.stargate PlainOldClojureObject])
org.stargate.PlainOldClojureObject
 
user=> (def poco (PlainOldClojureObject. true true "42"))
#'user/poco
 
user=> poco
#object[org.stargate.PlainOldClojureObject 0x68033b90 "{:human? true, :found? true, :planet \"42\"}"]
 
user=> (.isHuman poco)
true
user=> (.isFound poco)
true
user=> (.planet poco)
"42"

Of course there are records, but POCOs are just more fun :)


23
Apr 15

Question Everything

Feeding Da Brain


In 90s you would say: “I am a programmer”. Some would reply with “o.. k”. More insightful would reply with a question “which programming language?”.

21st century.. socially accepted terminology has changed a bit, now you would say “I am a developer”. Some would ask “which programming language?”. More insightful would reply with a question “which out of these 42 languages do you use the most?”

The greatest thing about using several at the same time is that feeling of constant adjustment as I jump between the languages. It feels like my brain goes through exuberant synaptogenesis over and over again building those new formations.

   What's for dinner today honey?
   Asynchronous brain refactoring with a gentle touch of "mental polish"

With all these new synapses, I came to love the fact that something that seemed so holy and “crystal right” before, now gets questioned and can easily be dismissed. Was it wrong all along? No. Did it change? No. So what changed then? Well.. perception did.

Inmates of the “Gang of Four” Prison


Design patterns are these “ways” of doing things that cripple new programmers, and imprison many senior ones. Instead of having an ability to think freely, we have all these “software standard patterns” which mostly have to do with limitations of “technology at time”.

Take big guys, like C++ / Java / C#, while they have many great features and ideas, these languages have horrible story of “behavior and state”: you always have to guard something. Whether it is from multiple threads, or from other people misusing it. The languages themselves promote reuse vs. decoupling: i.e. “let’s inherit that behavior”, etc..

So how do we overcome these risks and limitations? Simple: let’s create dozens of “ways” that all developers will follow to fight this together. Oh, yea, and let’s make it industry standard, call them patterns, teach them in schools, and select people by how well they can “apply” these patterns to “any” problem at hand.

Not all developers bought into this cult of course. Here is Peter Norvig’s notes from 1996, where he “dismisses” 16 out of 23 patterns from Gang of Four, by just using functions, types, modules, etc.

Builder Pattern vs. Immutable Data Structures


Builder pattern makes sense unless.. several things. There is a great “Builders vs. Option Maps” short post that talks about builder patter limitations:

* Builders are too verbose
* Builders are not data structures
* Builders are mutable
* Builders can’t easily compose
* Builders are order dependent

Due to mutable data structures (in Java/C#/alike) Builders still make sense for things like Google protobufs with simple (i.e. primitive) types, but for most cases where immutable things need to be created it is best to use immutable data structures for the above reasons.

While jumping between the languages, I often need to create things in Clojure that are implemented in Java with Builders. This is not always easy, especially when Builders rely on external state or/and when Builders need to be passed around (i.e. to achieve a certain level of composition).

Let’s say I need to create a notification client that, by design (on the Java side of things), takes some initial state (i.e. an external system connection props), and then event handlers (callbacks) are registered on it one by one, before it gets built, i.e. builds a final, immutable notification client:

SomeClassBuilder builder = SomeClass.newBuilder()
                             .setState( state )
                             .setAnotherThing( thing );
 
builder.register( notificationHandlerOne );
builder.register( notificationHandlerTwo );
...
builder.register( notificationHandlerN );
 
builder.build();

In case you need to decouple “register events” logic from this monolithic piece above, you would pass that builder to the caller that would pass it down the chain. It is something that seems “normal” to do (at least to a “9 to 5” developer), since methods with side effects do not really raise any eyebrows in OO languages. In fact most of methods in those languages have side effects.

I quite often see people designing builders such as the one above (with lots of external state), and when I need to use them in Clojure, it becomes very apparent that the above is not well designed:

;; creates a "mutable" builder..
(defn- make-bldr [state thing]
  (-> (SomeClass/newBuilder)
      (.withState state)
      (.withAnotherThing thing)))
 
;; wraps "builder.register(foo)" into a composable function
(defn register-event-handler! [bldr handler]
    ;; in case handler is just a Clojure function wrap it with something ".register" will accept
    (.register bldr handler))
 
(defn notification-client [state thing]
  (let [bldr (make-bldr state thing)]
    ;; ... do things that would call "register-event-handler!" passing them the "bldr"
    (.build bldr)))

Things that immediately feel “off” are: returning a mutable builder from “make-bldr”, mutating that builder in “register-event-handler!”, and returning that mutated builder back.. Not something common in Clojure at all.

Again the goal is to “decouple logic to register events from notification client creation“, if both can happen at the same time, the above Builder would work.

In Clojure it would just be a map. All data structures in Clojure are immutable, so there would be no intermediate mutable holder/builder, it would always be an immutable map. When all handlers are registered, this map would be passed to a function that would create a notification client with these handlers and start it with “state” and “thing”.

Mocking Suspicions


Another synapse formation, that was created from using many languages at the same time, convinced me that if I have to use а mock to test something, that something needs a close look, and would potentially welcome refactoring.

The most common case for mocking is:

A method of a component "A" takes a component "B" that depends on a component "C".

So if I want to test A’s method, I can just mock B and not to worry about C.

The flaw here is:

"B" that depends on a component "C".

These things are extremely beneficial to question. I used to use Spring a lot, and when I did, I loved it. Learned from it, taught it to others, and had a great sense of accomplishment when high complexity could be broken down to simple pieces and re wired together with Spring.

Time went on, and in Erlang or Clojure, or even Groovy for that matter, I used Spring less and less. I still use it for all my Java work, but not as much. So if 10 years ago:

"B" that depends on a component "C".

was a way of life, now, every time I see it, I ask why?. Does “B” have to depend on “C”? Can “B” be stateless and take “C” whenever it needed it, rather that be injected with it and carry that state burden?

If before “B” was:

public class B {
 
  private final C c;
 
  public B ( C c ) {
    this.c = c;
  }
 
  public Profit doBusiness() {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

Can it instead just be:

public final class B {
  public static Profit doBusiness( C c ) {
    return new Profit( c.doYourBusiness() + 42 );
  }
}

In most cases it can! It really can, the problem is not enough of us question that dependency, but we should.

This does not mean “B” no longer depends on “C”, it means something more: there is no “B” (“there is no spoon..”) as it just becomes a module, which does not need to be passed around as an instance. The only thing that is left from “B” is “doBusiness( C c )”. Do we need to mock “C” now? Can it, its instance disappear the way “B” did? Most likely, and even if it can’t for whatever reason (i.e. someone else’s code), I did question it, and so should you.


The more synapse formations I go through the better I learn to question pretty much everything. It is fun, and it pays back with beautiful revelations. I love my brain :)


02
Jul 14

Pom Pom Clojure

Fun with lein, Money with maven


While doing Clojure projects, it is the second time I faced a problem with a customer’s “build team” that knows what Java is, loves Maven, but does not believe in Mr. Leiningen, hence all of the lein niceties (plugins, once liners, tasks, etc..) need to now be converted to “pom.xml”s.

A good start is “lein pom”. While it only scratches the surface, it does generate a “pom.xml” with most of the dependencies. But in most cases it needs to be “massaged” well in order to fit а real maven build process.

Usual Suspects


Besides the most common “lein repl”, here is what I usually need lein to do:

* Compile Clojure code
* Some files need to be AOT compiled
* Run Clojure tests
* Compile ClojureScript

(not Clojure specific, but I’ll include it anyway)

* Compile protobuf
* Create a JAR for most projects
* Create a self executing “uberjar” for others

When Clojure is “Ahead Of Time”


Compiling, AOTing and running tests can be done with Clojure Maven Plugin:

<plugin>
    <groupId>com.theoryinpractise</groupId>
    <artifactId>clojure-maven-plugin</artifactId>
    <version>1.3.20</version>
    <extensions>true</extensions>
    <executions>
        <execution>
            <id>compile</id>
            <phase>compile</phase>
            <goals>
                <goal>compile</goal>
            </goals>
        </execution>
    </executions>
    <configuration>
        <namespaces>
            <namespace>whatsapp.core</namespace>
        </namespaces>
        <compileDeclaredNamespaceOnly>true</compileDeclaredNamespaceOnly>
        <sourceDirectories>
            <sourceDirectory>src</sourceDirectory>
        </sourceDirectories>
        <testSourceDirectories>
            <testSourceDirectory>test</testSourceDirectory>
        </testSourceDirectories>
    </configuration>
</plugin>

notice “namespaces” and “compileDeclaredNamespaceOnly”, this is how AOT is done for selected namespaces.

For AOT it’s good to remember that “a side effect of compiling Clojure code is loading the namespaces in order to make macros and functions they use available”, here are AOT compilation gotchas to keep in mind.

Compiling ClojureScript


This one is a bit trickier. If it is possible to convince a build team to install lein as a library that is used for the build process (e.g. similar to “protoc” to compile protobufs), then to compile ClojureScript, a lein cljsbuild can be added to the profile:

vi ~/.lein/profiles.clj
{:user {:plugins [[lein-cljsbuild "1.0.0"]]}}

and an exec maven plugin can be used to relay the execution to “lein”:

<plugin>
    <artifactId>exec-maven-plugin</artifactId>
    <groupId>org.codehaus.mojo</groupId>
    <executions>
        <execution>
            <id>compiling ClojureScript</id>
            <phase>generate-sources</phase>
            <goals>
                <goal>exec</goal>
            </goals>
            <configuration>
                <executable>lein</executable>
                <arguments>
                    <argument>cljsbuild</argument>
                    <argument>once</argument>
                </arguments>
            </configuration>
        </execution>
    </executions>
</plugin>

In fact, if “lein” is installed, it can be used via “exec-maven-plugin” to do everything else as well, but it all depends on build teams’ restrictions. For example, financial customers may have extremely strict “policies”/”rules”/”opinions”.

A couple more options to explore for building ClojureScript would be lein maven plugin and zi-cljs. Here is a related discussion on a ClojureScript google group.

Making Shippables


“lein uberjar” with some config in “project.clj” is all that is needed in “lein” world. In maven universe maven shade plugin will do just that:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.3</version>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <transformers>
                    <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                        <mainClass>org.gitpod.WhatsApp</mainClass>
                    </transformer>
                </transformers>
                <filters>
                    <filter>
                        <artifact>*:*</artifact>
                        <excludes>
                            <exclude>META-INF/*.SF</exclude>
                            <exclude>META-INF/*.DSA</exclude>
                            <exclude>META-INF/*.RSA</exclude>
                        </excludes>
                    </filter>
                </filters>
            </configuration>
        </execution>
    </executions>
</plugin>

above will create a self executing JAR with all dependencies included and with an entry point (-main) in “org.gitpod.WhatsApp”.

Google Protocol Buffers


With lein it is as simple as pluging in lein protobuf. In maven, it is not as simple, but also not terribly difficult and solved via maven-protoc-plugin:

<plugin>
    <groupId>com.google.protobuf.tools</groupId>
    <artifactId>maven-protoc-plugin</artifactId>
    <version>0.3.2</version>
    <extensions>true</extensions>
    <executions>
        <execution>
            <goals>
                <goal>compile</goal>
            </goals>
            <phase>generate-sources</phase>
        </execution>
    </executions>
    <configuration>
        <protocExecutable>${PROTOBUF_HOME}/src/protoc</protocExecutable>
        <protoSourceRoot>resources/proto</protoSourceRoot>
        <outputDirectory>target/classes</outputDirectory>
        <!--<additionalProtopathElements>-->
        <!--    <param>${PROTOBUF_HOME}/src/google/protobuf</param>-->
        <!--</additionalProtopathElements>-->
    </configuration>
</plugin>

here is a repository it currently lives at:

<pluginRepositories>
    <pluginRepository>
        <id>protoc-plugin</id>
        <url>http://sergei-ivanov.github.com/maven-protoc-plugin/repo/releases/</url>
    </pluginRepository>
</pluginRepositories>

notice “additionalProtopathElements”. In case clojure-protobuf is used with extensions, a path to “descriptor.proto” can be specified in “additionalProtopathElements”.