Query combinations with nested array of records in

2020-02-01 02:58发布

问题:

I'm working on a Rails application that utilizes the Postgres JSON data type. I have a JSON column called data in a table called reports. Let's say I have multiple entries like this:

Entry 1: {"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barB.png", "pos": "top"}],   "background":"background.png"}
Entry 2: {"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barC.png", "pos": "top"}],   "background":"bacakground.png"}
Entry 3: {"objects":[{"album": 1, "src":"fooA.png", "pos": "middle"},{"album": 2, "src":"barB.png", "pos": "middle"}],"background":"background.png"}
Entry 4: {"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 3, "src":"barB.png", "pos": "top"}],   "background":"backgroundA.png"}

What I would like to do is return the different combinations of entries that have the same album, src, and background (NOTE: within the objects node, order of array elements does not matter). For instance, the query should match entries 1,3 as one group, entry 2, as another, and etc. The goal is to find the top 3 most common combinations. I know how to do this using Ruby, but I would have to query a large sample of entries, then iterate over all of them. It seems more efficient to use Postgres if it can handle this task. I'm not enough of a SQL expert to know if this is possible.

This is the result I am looking for. Within objects, entries 1 and 3 both contain {"album": 1, "src":"fooA.png"}, {"album": 2, "src":"barB.png"}, as well as both have matching backgrounds. I would like to group them as one combination with a count of 2.

Since entry 2 does not match any entries under this criteria, then it is another combination with a count of 1. Entry 4 is also considered another combination with a count of 1. So the result I'm after would be:

ids  |  count
--------------
1,3  | 2
2    | 1
4    | 1

or

combinations                                                                                                                               | count
---------------------------------------------------------------------------------------------------------------------------------------------------
{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},  {"album": 2, "src":"barB.png", "pos": "top"}],  "background":"background.png"}  | 2
{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},  {"album": 2, "src":"barC.png", "pos": "top"}],  "background":"bacakground.png"} | 1
{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},  {"album": 3, "src":"barB.png", "pos": "top"}],  "background":"backgroundA.png"} | 1

Whichever is easier to achieve.

In my actual data, I have values other than just album and src in the array of JSON within the objects node. You'll notice that I've included pos to show this case. I only care about using the album, src, and background values to match the combos. I was hoping to ignore any other values.

Note

When I was testing Erwin's solution, I kept getting this error and I know why:

ERROR:  cannot call json_populate_recordset on a nested object

My json values are are actually a little more complex. For example:

{"objects":[{"album": 1, "src":"fooA.png", "pos": "top", filters: []},  {"album": 2, "src":"barB.png", "pos": "top", filters: []}

Obviously, filters is a nested object and is not supported by json_populate_recordset. However, I think I can work around this if there is no simple alternative. Again, I assume this is possible?

UPDATE

Due to a typo in my sample data above (which was my fault), this solution is a bit incomplete. When the typo is fixed, it solution does not work. Find the answer to that situation here. But Erwin's solution is still an answer to cases similar to what was described above.

回答1:

Given this table (which you should have provided in a form like this):

CREATE TABLE reports (rep_id int primary key, data json);
INSERT INTO reports (rep_id, data)
VALUES 
  (1, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barB.png", "pos": "top"}],   "background":"background.png"}')
, (2, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barC.png", "pos": "top"}],   "background":"bacakground.png"}')
, (3, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "middle"},{"album": 2, "src":"barB.png", "pos": "middle"}],"background":"background.png"}')
, (4, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 3, "src":"barB.png", "pos": "top"}],   "background":"backgroundA.png"}')
;

JSON records of well known, translatable type

Use json_populate_recordset() for unnesting the recordset "objects". The function requires a registered row type to define the names and data types of resulting columns. For the purpose of this demo or generally for ad-hoc queries, a temp table modeled after "objects" provides the same:

CREATE TEMP TABLE obj(album int, src text, pos text);

To find the top 3 most common combinations ... of entries that have the same album, src, and background:

SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM   reports r
     , json_populate_recordset(null::obj, r.data->'objects') o
GROUP  BY r.data->>'background'
        , o.album
        , o.scr
ORDER  BY count(*) DESC
LIMIT  3;

Each object counts, no matter whether from the same row or not. You did not define how to handle that exactly. Consequently, rep_id can pop up multiple times in the array ids. Add DISTINCT to array_agg() to fold possible duplicates. The count ct can be greater then the length of the array ids in this case.

Requires Postgres 9.3 for the JSON functions and operators and the implicit JOIN LATERAL.

JSON records of unknown or untranslatable type

json_array_elements() just unnests the json array without transforming the result into a SQL row. Access individual fields with JSON operators accordingly.

SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM   reports r
     , json_array_elements(r.data->'objects') o
GROUP  BY r.data->>'background'
        , o->>'album'
        , o->>'scr'
ORDER  BY count(*) DESC
LIMIT  3;