My real problem has to do with recording which of a very large number of anti-virus products agree that a given sample is a member of a given anti-virus family. The database has millions of samples, with tens of anti-virus products voting on each sample. I want to ask a query like "For the malware containing the name 'XYZ' which sample had the most votes, and which vendors voted for it?" and get results like:
"BadBadVirus"
V1 V2 V3 V4 V5 V6 V7
Sample 1 - 4 votes 1 0 1 0 0 1 1
Sample 2 - 5 votes 1 0 1 0 1 1 1
Sample 3 - 5 votes 1 0 1 0 1 1 1
total 14 3 3 2 3 3
Which might be used to tell me that Vendor 2 and Vendor 4 either don't know how to detect this malware, or that they name it something different.
I'm going to try to generalize my question slightly while hopefully not breaking your ability to help me. Assume that I have five voters (Alex, Bob, Carol, Dave, Ed) who have been asked to look at five photographs (P1, P2, P3, P4, P5) and decide what the "main subject" of the photograph is. For our example, we'll just assume they were limited to "Cat", "Dog", or "Horse". Not every voter votes on every thing.
The data is in the database in this form:
Photo, Voter, Decision
(1, 'Alex', 'Cat')
(1, 'Bob', 'Dog')
(1, 'Carol', 'Cat')
(1, 'Dave', 'Cat')
(1, 'Ed', 'Cat')
(2, 'Alex', 'Cat')
(2, 'Bob', 'Dog')
(2, 'Carol', 'Cat')
(2, 'Dave', 'Cat')
(2, 'Ed', 'Dog')
(3, 'Alex', 'Horse')
(3, 'Bob', 'Horse')
(3, 'Carol', 'Dog')
(3, 'Dave', 'Horse')
(3, 'Ed', 'Horse')
(4, 'Alex', 'Horse')
(4, 'Bob', 'Horse')
(4, 'Carol', 'Cat')
(4, 'Dave', 'Horse')
(4, 'Ed', 'Horse')
(5, 'Alex', 'Dog')
(5, 'Bob', 'Cat')
(5, 'Carol', 'Cat')
(5, 'Dave', 'Cat')
(5, 'Ed', 'Cat')
The objective is that given a photo topic we are looking for, we'd like to know how many voters thought that WAS the main point of that photo, but also list WHICH VOTERS thought that.
Query for: "Cat"
Total Alex Bob Carol Dave Ed
1 - 4 1 0 1 1 1
2 - 3 1 0 1 1 0
3 - 0 0 0 0 0 0
4 - 1 0 0 1 0 0
5 - 4 0 1 1 1 1
------------------------------------
total 12 2 1 4 3 2
Query for: "Dog"
Total Alex Bob Carol Dave Ed
1 - 1 0 1 0 0 0
2 - 2 0 1 0 0 1
3 - 1 0 0 1 0 0
4 - 0 0 0 0 0 0
5 - 1 1 0 0 0 0
------------------------------------
total 5 1 2 1 0 1
Is that something I can do with the data in the format that I have it stored?
I'm having difficulty getting a query that does that - although it's simple enough to dump the data out and then write a program to do that, I'd really like to be able to do it IN THE DATABASE if I can.
Thanks for any suggestions.
Using the same sample data as Clodoaldo ("create table vote...") and using the plpythonu function make_pivot_table (below), you can run:
The make_pivot_table function definition is:
The query for the cat:
If the number of voters is big or not known then it can be done dynamically:
The above code created the table pivot with all the decisions:
Your wish implies transfering some of the data (the names) into column headings, i.e. the schema of the resulting table. As this is somewhere between inconvenient and impossible, I would recommending just sorting and summing the data in sql, and doing the rest outside of the database.
and