Fast way to count how often pairs appear in Clojur

2020-06-25 04:13发布

I need to count the number of times that particular pairs of integers occur in some process (a heuristic detecting similar images using locality sensitive hashing - integers represent images and the "neighbours" below are images that have the same hash value, so the count indicates how many different hashes connect the given pair of images).

The counts are stored as a map from (ordered) pair to count (matches below).

The input (nbrs-list below) is a list of a list of integers that are considered neighbours, where every distinct (ordered) pair in the inner list ("the neighbours") should be counted. So, for example, if nbrs-list is [[1,2,3],[10,8,9]] then the pairs are [1,2],[1,3],[2,3],[8,9],[8,10],[9,10].

The routine collect is called multiple times; the matches parameter accumulates results and the nbrs-list is new on each call. The smallest number of neighbours (size of the inner list) is 1 and the largest ~1000. Each integer in nbrs-list occurs just once in any call to collect (this implies that, on each call, no pair occurs more than once) and the integers cover the range 0 - 1,000,000 (so the size of nbrs-list is less than 1,000,000 - since each value occurs just once and sometimes they occur in groups - but typically larger than 100,000 - as most images have no neighbours).

I have pulled the routines out of a larger chunk of code, so they may contain small edit errors, sorry.

(defn- flatten-1
  [list]
  (apply concat list))

(defn- pairs
  ([nbrs]
    (let [nbrs (seq nbrs)]
      (if nbrs (pairs (rest nbrs) (first nbrs) (rest nbrs)))))
  ([nbrs a bs]
    (lazy-seq
      (let [bs (seq bs)]
        (if bs
          (let [b (first bs)]
            (cons (if (> a b) [[b a] 1] [[a b] 1]) (pairs nbrs a (rest bs))))
          (pairs nbrs))))))

(defn- pairs-map
  [nbrs]
  (println (count nbrs))
  (let [pairs-list (flatten-1 (pairs nbrs))]
    (apply hash-map pairs-list)))

(defn- collect
  [matches nbrs-list]
  (let [to-add (map pairs-map nbrs-list)]
    (merge-with + matches (apply (partial merge-with +) to-add))))

So the above code expands each set of neighbours to ordered pairs; creates a map from pairs to 1; then combines maps using addition of values.

I'd like this to run faster. I don't see how to avoid the O(n^2) expansion of pairs, but I imagine I can at least reduce the constant overhead by avoiding so many intermediate structures. At the same time, I'd like the code to be fairly compact and readable...

Oh, and now I am exceeding the "GC overhead limit". So reducing memory use/churn is also a priority :o)

[Maybe this is too specific? I am hoping the lessons are general and haven't seem many posts about optimising "real" clojure code. Also, I can profile etc, but my code seems so ugly I am hoping there's an obvious, cleaner approach - particularly for the pairs expansion.]

4条回答
劳资没心,怎么记你
2楼-- · 2020-06-25 04:54

I guess you want the frequency with which each pair occurs ?

Try function frequencies. It uses transients under the hood, which should avoid GC overheads.

查看更多
家丑人穷心不美
3楼-- · 2020-06-25 04:57

the above suggestions seemed to help, but were insufficient. i finally got decent performance with:

  • packing pairs into a long value. this works because MAX_LONG > 1e12 and long instances are comparable (so work well as hash keys, unlike long[2]). this had a significant effect on lowering memory use compared to [n1 n2].

  • using a TLongByteHashMap primitive type hash map and mutating it.

  • handling the pair code with nested doseq loops (or nested for loops when using immutable data structures).

  • improving my locality sensitive hash. a big part of the problem was that it was too weak, so finding too many neighbours - if the neighbours of a million images are ill-constrained then you get a million million pairs, which consumes a little too much memory...

the inner loop now looks like:

(doseq [i (range 1 (count nbrs))]
  (doseq [j (range i)]
    (let [pair (pack size (nth nbrs i) (nth nbrs j))]
      (if-let [value (.get matches pair)]
        (.put matches pair (byte (inc value)))
        (.put matches pair (byte 0))))))

where

(defn- pack
  [size n1 n2]
  (+ (* size n1) n2))

(defn- unpack
  [size pair]
  [(int (/ pair size)) (mod pair size)])
查看更多
Fickle 薄情
4楼-- · 2020-06-25 05:00

(I hope I haven't misunderstood your question)

If you just want to count the pairs in lists as this, then [[1,2,3],[8,9,10]] is

(defn count-nbr-pairs [n]
   (/ (* n (dec n))
      2))

(defn count-pairs [nbrs-list]
   (apply + (map #(count-nbr-pairs (count %)) nbrs-list)))

(count-pairs [[1 2 3 4] [8 9 10]])
; => 9

This of course assumes, that you don't need to remove duplicate pairs.

查看更多
疯言疯语
5楼-- · 2020-06-25 05:01
=>(require '[clojure.math.combinatorics :as c])

=>(map #(c/combinations % 2) [[1 2 3] [8 9 10]])

(((1 2) (1 3) (2 3)) ((8 9) (8 10) (9 10)))

It's a pretty small library, take a look at the source

Performance wise you're looking at around the following number for your usecase of 1K unique values under 1M

=> (time (doseq
           [i (c/combinations (take 1000 (shuffle (range 1000000))) 2)]))

"Elapsed time: 270.99675 msecs"

That's including generating the target set, which takes about 100 ms on my machine.

查看更多
登录 后发表回答