JQ: count number of objects per group, for a subse

2019-07-01 21:24发布

问题:

I need to count number of objects in each group with JQ, but only for N most recent objects.

Sample input, for N=3:

{"modified":"Mon Sep 25 14:20:00 +0000 2018","object_id":1,"group_id":"C"}
{"modified":"Mon Sep 25 14:23:00 +0000 2018","object_id":2,"group_id":"A"}
{"modified":"Mon Sep 25 14:21:00 +0000 2018","object_id":3,"group_id":"B"}
{"modified":"Mon Sep 25 14:22:00 +0000 2018","object_id":4,"group_id":"A"}

Expected output:

{"A",2}
{"B",1}

I'm failing even to select a date-based subset which will preserve the structure of the objects: this is the best I managed to achieve:

 [
   .modified |= strptime("%a %b %d %H:%M:%S %z %Y") |
   .modified |= mktime |
   .modified |= strftime("%Y-%m-%d %H:%M:%S")
 ]  |
 sort_by(.modified) |
 .[] |
 {modified, object_id, group_id}

For some reason, results are still unsorted.

I'm also failing to convert such a list to an array to select only N most recent entries.

And after that I will need to count number of objects per group in some way.


Overall, looks like I need an extremely intuitive explanation on how arrays and lists of objects convert to each other, and how to modify some of their fields and, after that, to extract only fields required. The tutorials I've found so far did not help, unfortunately.

回答1:

Assuming your input file is:

cat file
{"modified":"Mon Sep 25 14:20:00 +0000 2018","object_id":1,"class_id":"C"}
{"modified":"Mon Sep 25 14:23:00 +0000 2018","object_id":2,"class_id":"A"}
{"modified":"Mon Sep 25 14:21:00 +0000 2018","object_id":3,"class_id":"B"}
{"modified":"Mon Sep 25 14:22:00 +0000 2018","object_id":4,"class_id":"A"}

You can try the following:

<file jq -s '
   [ .[] | 
     (.modified |= (strptime("%a %b %d %H:%M:%S +0000 %Y") | mktime)) 
   ] | 
   sort_by(.modified) |              # sort using converted time
   .[-3:] |                          # take the last 3
   group_by(.class_id) |             # group ids together
   .[] |                             
   {(.[0].class_id): length}'        # create the object using the id name and table length
{
   "A": 2
}
{
  "B": 1
}

Note that on my system, the option %z of strptime isn't working. So I replaced it with +0000 (which is anyway not used in the time conversion).



回答2:

The accepted answer uses the -s command-line option, which requires that the entire input data fit into memory. For very large data sets, this may not be possible.

Since the release of jq 1.5 (in 2015), an alternative is available. Here, therefore, a memory-efficient solution using inputs is presented.

The key functionality is encapsulated in the following jq filter:

# Return an array of n items as if by 
# [stream] | sort_by(filter) | .[-n:]
def maxn(stream; filter; n):
  def maxn:
    sort_by(filter) | .[-n :];
  reduce stream as $x ([]; . + [$x] | maxn);

A solution to the problem at hand (with N==3) can now be obtained in just three additional lines:

maxn(inputs; .modified | strptime("%a %b %d %H:%M:%S +0000 %Y") | mktime; 3)
| group_by(.class_id)[]
| {(.[0].class_id): length}

Note that this assumes the -n command-line option is used. If it is omitted, the first line of input will be ignored.

Large N

For large datasets, if the value of N is also large, it would probably be worth the trouble to tweak the above to use jq’s support fot binary search (bsearch) instead of sort_by. It might similarly be worthwhile cacheing the mktime values.