jq: groupby and nested json arrays

2019-09-19 12:12发布

问题:

Let's say I have: [[1,2], [3,9], [4,2], [], []]

I would like to know the scripts to get:

  • The number of nested lists which are/are not non-empty. ie want to get: [3,2]

  • The number of nested lists which contain or not contain number 3. ie want to get: [1,4]

  • The number of nested lists for which the sum of the elements is/isn't less than 4. ie want to get: [3,2]

ie basic examples of nested data partition.

回答1:

Since stackoverflow.com is not a coding service, I'll confine this response to the first question, with the hope that it will convince you that learning jq is worth the effort.

Let's begin by refining the question about the counts of the lists "which are/are not empty" to emphasize that the first number in the answer should correspond to the number of empty lists (2), and the second number to the rest (3). That is, the required answer should be [2,3].

Solution using built-in filters

The next step might be to ask whether group_by can be used. If the ordering did not matter, we could simply write:

group_by(length==0) | map(length)

This returns [3,2], which is not quite what we want. It's now worth checking the documentation about what group_by is supposed to do. On checking the details at https://stedolan.github.io/jq/manual/#Builtinoperatorsandfunctions, we see that by design group_by does indeed sort by the grouping value.

Since in jq, false < true, we could fix our first attempt by writing:

group_by(length > 0) | map(length)

That's nice, but since group_by is doing so much work when all we really need is a way to count, it's clear we should be able to come up with a more efficient (and hopefully less opaque) solution.

An efficient solution

At its core the problem boils down to counting, so let's define a generic tabulate filter for producing the counts of distinct string values. Here's a def that will suffice for present purposes:

# Produce a JSON object recording the counts of distinct
# values in the given stream, which is assumed to consist 
# solely of strings.
def tabulate(stream):
  reduce stream as $s ({}; .[$s] += 1);

An efficient solution can now be written down in just two lines:

tabulate(.[] | length==0 | tostring )
| [.["true", "false"]]

QED

p.s.

The function named tabulate above is sometimes called bow (for "bag of words"). In some ways, that would be a better name, especially as it would make sense to reserve the name tabulate for similar functionality that would work for arbitrary streams.