Clojure and HBase: Iterate Lazily over a Scan

2019-05-31 13:54发布

问题:

Lets say I want to print the output of an hbase table scan in clojure.

(defmulti scan (fn [table & args] (map class args)))

(defmethod scan [java.lang.String java.lang.String] [table start-key end-key]
    (let [scan (Scan. (Bytes/toBytes start-key) (Bytes/toBytes end-key))]
        (let [scanner (.getScanner table scan)]
            (doseq [result scanner]
                (prn
                    (Bytes/toString (.getRow result))
                    (get-to-map result))))))

where get-to-map turns the result into a map. It could be run like this:

(hbase.table/scan table "key000001" "key999999")

But what if I want to let the user do something with the scan results? I could allow them to pass a function in as a callback to be applied to each result. But my question is this: what do I return if I want the user to be able to lazily iterate over the each result

(Bytes/toString (.getRow result))
(get-to-map result)

and not retain the previous results, as might happen in a simplistic implimentation with lazy-seq.

回答1:

If you accept a callback argument, you can just call it inside the doseq:

(defmulti scan [f table & args] (mapv class args)) ; mapv returns vector

(defmethod scan [String String] [f table start-key end-key]
               ; ^- java.lang classes are imported implicitly
  (let [scan ...
        scanner ...] ; no need for two separate lets
    (doseq [result scanner]
      ; call f here, e.g.
      (f result))))

Here f will be called once per result. Its return value, as well as the result itself, will be discarded immediately. You can of course call f with some preprocessed version of result, e.g. (f (foo result) (bar result)).

You could also return a sequence / vector of results to the client and let it do its own processing. If the sequence is lazy, you need to make sure that any resources backing it stay open for the duration of the processing (and presumably that they are closed later -- see with-open; the processing code would need to execute inside the with-open and be done with the processing when it returns).

For example, to return a vector of preprocessed results to the client you could do

(defmethod scan ...
  (let [...]
    (mapv (fn preprocess-result [result]
            (result->map result))
          scanner)))

The client can then do whatever it wants with them. Use map to return a lazy sequence instead. If the client then needs to open/close a resource, you could accept it as an argument to scan, so that the client could say

(with-open [r (some-resource)]
  ; or mapv, dorun+map, doall+for, ...
  (doseq [result (scan r ...)]
    (do-stuff-with result)))