TechAscent - Distinctly Sorted - A Clojure Performance Koan

2023-11-01

Distinctly Sorted - A Clojure Performance Koan

Setting the Stage

A recent ClojureVerse post raised an interesting question about specifically how to use transducers to get distinct, sorted collections of values on multiple keys/columns. The conversation veered towards performance, which raised a slightly different question - what is the fastest way to get a distinct, sorted set of the values from columns in datasets on the JVM?

The dataset we settled on can be defined as such:

(def naive-vec-data
  (->> (for [_ (range 1e6)]
         {:name (str "some-name-" (rand-int 10000))
          :supply (rand-int 50000)})
       (vec)))
;;165ms

Spoiler alert: ham-fisted is one way to go fast.

So, start by defining the dataset in terms of those structures. repeatedly from the lazy-noncaching namespace builds the sequence almost twice as fast as clojure.core/for, but more meaningfully, hamf's vector provides a faster two arg reduce implementation on the subvec path (which comes up when parallelizing the solution later). While Clojure's base persistent vector's subvec implementation does not derive from IReduceInit and would fall back to an iterator-based reduction.

> (def vec-data
    (->> (fn [] {:name (str "some-name-" (rand-int 10000))
                 :supply (rand-int 50000)})
         (lznc/repeatedly 1000000)
         (hamf/vec)))
;;110ms

Per the original post, the task is to find a sorted collection of the distinct values in both columns - :name and :supply. A simple answer to this is:

(def answer [(->> vec-data (map :name) distinct sort)
             (->> vec-data (map :supply) distinct sort)])
;;274ms

We define one helper - key-op to avoid repeating ourselves too much:

(defn key-op
  [f]
  (mapv f [:name :supply]))

And throughout this article we will be using criterium's quick-bench method.

Some Initial Explorations

The initial suggested answer for the transducer form uses the xforms library:

(crit/quick-bench
 (xforms/transjuxt (key-op (fn [k] (comp (map k)
                                         (distinct)
					 (xforms/sort)
					 (xforms/into []))))
                   vec-data))
;;166ms

This answer is faster than the pure Clojure one - and one might be tempted to stop there and write a blog post amazed with transducers. But first, ask a few questions. Given that the first step is make the data distinct, which is a O(N) operation, and second to sort, which is O(N(log(N)). No big algorithmic mistakes, so how fast can it get?

Working from the initial pure clojure answer - one can get a bit of time back by simply using the lazy-noncaching version of map:

(crit/quick-bench (key-op #(->> vec-data (lznc/map %) distinct sort)))
;;224ms

And even more by using a set as opposed to clojure.core/distinct:

(crit/quick-bench (key-op #(->> vec-data (lznc/map %) set sort)))
;;140ms

Already faster than the transducer form, which is a good place to start. However, ham-fisted has fast sets - faster both in construction time and lookup time than Clojure's base sets - with the very same equality semantics.

(crit/quick-bench (key-op #(->> vec-data (lznc/map %) hamf/mut-set sort)))
;;59ms

So ~5x faster than the simple clojure.core one, and ~3x faster than the xforms/transjuxt one.

Bring in the Datasets

One insight into this problem is that it is doing a columnwise summary of the data. This brought up the idea of using our columnwise data analytics system - tech.ml.dataset.

(require '[tech.v3.dataset :as ds])

Construction of the dataset can be a one-liner:

(crit/quick-bench (ds/->dataset vec-data))
;;127ms

Given the data is already columnar, and in a container, and the key-set is fixed, it can be faster:

(crit/quick-bench (ds/->dataset {:name (lznc/map :name vec-data)
                                 :supply (lznc/map :supply vec-data)}))
;;44.ms

When it's necessary to construct many datasets from incoming data over and over again, one can move to a a parallelized formulation that is faster yet. It uses hamf's pgroups operation to divide up the index space and create many datasets in parallel then concat them at the end. The dataset concat operation has been carefully optimized for exactly situations like this:

(defn load-dataset-cwise
  ([] (load-dataset-cwise vec-data))
  ([vec-data] (ds/->dataset {:name (lznc/map :name vec-data)
                             :supply (lznc/map :supply vec-data)})))

(crit/quick-bench
 (->> (hamf/pgroups
       (count vec-data)
       (fn [^long sidx ^long eidx]
         (load-dataset-cwise (hamf/subvec vec-data sidx eidx))))
      (apply ds/concat)))
;;16ms

So, loading datasets should not itself be a bottleneck. Once we have a dataset, operations on columns get a bit faster:

(def ds (load-dataset-cwise vec-data))

(crit/quick-bench (key-op #(->> (ds %) hamf/mut-set sort)))
;;41ms

That's ~30% faster than the previous best, above.

Parallelizing Set Creation

Now, the profiler indicates that the long pole in tent is creating the sets. Parallelizing set creation saves time in one of two ways. Either by using a concurrent set, or by creating many hashsets in parallel and unioning them together. These can be compared.

First, the concurrent hashset. To save time diving through Javadocs, hamf's set namespace contains a function that can create a concurrent hashset in the way any Clojure programmer would expect.

(crit/quick-bench
 (key-op
  (fn [k]
    (let [s (hamf-set/java-concurrent-hashset)]
      (->> (hamf/pgroups
            (count vec-data)
            (fn [sidx eidx]
             (->> (lznc/map k (hamf/subvec vec-data sidx eidx))
                  (.addAll s))))
           (dorun))
           (sort s)))))
;;19ms

All hamf sets and maps have heavily optimized union operations. Furthermore things like union are set up so that if the left argument is mutable - either unshared or transient - then the data can be modified in place. The fast union operations are designed for these types of situations.

(crit/quick-bench
 (key-op
  (fn [k]
    (->> (hamf/pgroups
          (count vec-data)
          (fn [sidx eidx]
            (hamf/mut-set (lznc/map k (hamf/subvec vec-data sidx eidx)))))
         (reduce hamf/union)
	 (sort)))))
;;55ms

So in this case the concurrent hashset won. This is because in this data, the sets are relatively large - 10000 or 50000 things respectively. so the more expensive but lockless add pathway for the concurrent hashset can outrun the union of the relatively large sets.

Release The Ducks

A very reasonable question at this point is, "isn't this what SQL systems are for?". This is a great excuse to exercise our DuckDB bindings to observe the performance characteristics of DuckDB on this problem. Specifically, to investigate the speed of loading the data, and using prepared statements for query.

The repository associated with this blog post includes a script to enable duckdb on either Linux or mac machines:

source scripts/enable-duckdb

Then, restart the repl and load the library:

distinct-sort> (require '[tmducken.duckdb :as duckdb])
nil
distinct-sort> (duckdb/initialize!)
Nov 02, 2023 11:23:21 AM clojure.tools.logging$eval13914$fn__13917 invoke
INFO: Attempting to load duckdb from "/Users/chrispnuernberger/dev/tech.all/distinct-sort/binaries/libduckdb.dylib"
true
distinct-sort> (def db (duckdb/open-db))
#'distinct-sort/db
distinct-sort> (def conn (duckdb/connect db))
#'distinct-sort/conn

Now load the data. Recall that we defined already defined ds as the TMD dataset of vec-data. As we are going to benchmark this pathway we will define a function to drop then do a single-threaded load of data. The default TMD dataset name is :_unnamed and this gets translated into the string '_unnamed' in the sql database.

(defn load-duckdb-data
  [conn]
  (try (duckdb/drop-table! conn "_unnamed")
       (catch Exception e))
  (duckdb/create-table! conn ds)
  (duckdb/insert-dataset! conn ds))

distinct-sort> (crit/quick-bench (load-duckdb-data conn))
;;341ms

DuckDB does a bit of analysis on the incoming data and as such the load times are more than loading a TMD dataset. Once loaded, however, you have a high performance C++ system at your fingertips :-).

Begin by creating a vector of duckdb prepared statements. Once created, prepared statements can be called using the same conventions as normal Clojure functions. The default return type of prepared statements is a lazy noncaching sequence of datasets but to make this fair we request results concatenated into a single dataset.

distinct-sort> (def prepared-statements
                 (key-op (fn [k]
                           (let [kn (name k)]
                             (duckdb/prepare conn (str "select distinct " kn " from _unnamed order by " kn)
                                             {:result-type :single})))))
#'distinct-sort/prepared-statements
distinct-sort> (first prepared-statements)
#duckdb-prepared-statement-0["select distinct name from _unnamed order by name"]

The '-0' trailing the statement name is the argument count. We can call a statement just like a normal Clojure function and in this case the result is a single dataset:

distinct-sort> (ds/head ((first prepared-statements)))
:_unnamed [5 1]:

|           name |
|----------------|
|    some-name-0 |
|    some-name-1 |
|   some-name-10 |
|  some-name-100 |
| some-name-1000 |

So the benchmark is simple:

distinct-sort> (crit/quick-bench (mapv #(%) prepared-statements))
;;11ms

This is the fastest so far by a factor of 2 -- Nice!

The Gloves Come Off

There are some cases where C++ will beat the JVM and nothing can be done. This is not one of those cases.

What follows is a JVM solution as fast as the C++ solution. The trick is to exploit the exact semantics of the dataset.

The dataset schema:

distinct-sort> (mapv meta (vals ds))
[{:categorical? true, :name :name, :datatype :string, :n-elems 1000000}
 {:name :supply, :datatype :int32, :n-elems 1000000}]

The :supply column has an int32 datatype. Producing a sorted set of any integer datatype <= 32 bits on the JVM can be accomplished efficiently with roaring bitmap. It has a bulk constructor that takes an integer array and builds the relevant structures very fast. dtype-next has a namespace, tech.v3.datatype.bitmap, devoted to working with these great tools.

The conversion of an int32 buffer to an int32 array is very very fast - in this case the buffer is itself backed by an int32 array so the int-array call boils down into Arrays/copyOfRange.

(crit/quick-bench
 (key-op (fn [k]
           (let [col (ds k)]
             (if (= :int32 (dtype/elemwise-datatype col))
               (->> (hamf/pgroups
                     (ds/row-count ds)
                     (fn [^long sidx ^long eidx]
                       (doto (RoaringBitmap.)
                         (.add (dtype/->int-array (dtype/sub-buffer col sidx (- eidx sidx)))))))
                    (reduce #(do (.or ^RoaringBitmap %1 %2) %1))
                    (bitmap/->random-access))
               (->> (hamf/pgroups
                     (ds/row-count ds)
                     (fn [^long sidx ^long eidx]
                       (hamf/mut-set (dtype/sub-buffer col sidx (- eidx sidx)))))
                    (reduce hamf/union)
                    sort))))))
;;11ms

OK, that was with hamf's union. How about a concurrent hashmap approach?

(crit/quick-bench
 (key-op (fn [k]
           (let [col (ds k)]
             (if (= :int32 (dtype/elemwise-datatype col))
               (->> (hamf/pgroups
                     (ds/row-count ds)
                     (fn [^long sidx ^long eidx]
                       (doto (RoaringBitmap.)
                         (.add (dtype/->int-array (dtype/sub-buffer col sidx (- eidx sidx)))))))
                    (reduce #(do (.or ^RoaringBitmap %1 %2) %1))
                    (bitmap/->random-access))
               (let [s (hamf-set/java-concurrent-hashset)]
                 (->> (hamf/pgroups
                       (ds/row-count ds)
                       (fn [^long sidx ^long eidx]
                         (.addAll s (dtype/sub-buffer col sidx (- eidx sidx)))))
                      dorun)
                 (sort s)))))))
;;10ms

In some runs, the Java solution measures faster than the DuckDB solution, so there is probably room to play the JVM-startup-argument game if there's a real reason to win consistently. This was done from the repl which uses the g1gc. For example, in these situations it's possible for the parallelGC to perform better.

Some Final Thoughts

There are various ways to implement a somewhat involved columnwise summarization. The naive Clojure solution was (as you'd expect), beautiful, simple, and easy, and somewhat easier to find a faster path.

Other ideas abound, java treeset exists, the JVM has parallel sorts. There are explorations in the repository, so you can play. However, the reason they aren't in the blog post is they weren't fast enough to be interesting.

It's obviously encouraging that you can actually get pretty good performance just substituting a few characters in the naive solution. The best current bespoke solutions are much longer in lines but can be much faster and the use of DuckDB in this post is itself fairly naive. Algorithms which require generic filter-type operations such as range queries are going to be very very fast with DuckDB.

It bears repeating that the Java concurrent hashmap and concurrent hashset are fantastic datastructures. For those interested, the concurrent hashmap's put implementation uses CAS operations if the bucket is empty and a lock on the first node if it is not - meaning that put operations are effectively lockless in many situations. That being said, all the java.util.* datastructures use strict equal semantics, so if you make a map or set of longs and attempt to do a lookup with an integer, then the pain train is coming to town.

There really is a lot to learn about high performance engineering on the JVM. It really is its own beast, with key difference from native systems like Rust or C++. A post like this will never be the final say, it's more interesting to lift the Clojure community's knowledge in this manner one or two levels.

The problem with high performance programming is the expression of the problem and in that sense Clojure helps a lot - you get a correct solution so quickly that there's time left over to try a few different variants for speed. Each attempt is often economical in terms of lines added and effort. If the program produces the correct output after 20 minutes, then one has the rest of the day to play the benchmark game. Often, this beats an attempt in a 'faster' language where fights so hard to get any correct answer that there's never time to try enough different strategies to really find the fastest.

Hope you enjoyed this article and hope you got something out of it! Let us know on Slack or Zulip or Reddit or ClojureVerse.

TechAscent: Now you're sad you asked about a distinct sort, or maybe you're not (!)