Has anyone used tablefunc
to pivot on multiple variables as opposed to only using row name? The documentation notes:
The "extra" columns are expected to be the same for all rows with the same row_name value.
I'm not sure how to do this without combining the columns that I want to pivot on (which I highly doubt will give me the speed I need). One possible way to do this would be to make the entity numeric and add it to the localt as milliseconds, but this seems like a shaky way to proceed.
I've edited the data used in a response to this question: PostgreSQL Crosstab Query.
CREATE TEMP TABLE t4 (
timeof timestamp
,entity character
,status integer
,ct integer);
INSERT INTO t4 VALUES
('2012-01-01', 'a', 1, 1)
,('2012-01-01', 'a', 0, 2)
,('2012-01-02', 'b', 1, 3)
,('2012-01-02', 'c', 0, 4);
SELECT * FROM crosstab(
'SELECT timeof, entity, status, ct
FROM t4
ORDER BY 1,2,3'
,$$VALUES (1::text), (0::text)$$)
AS ct ("Section" timestamp, "Attribute" character, "1" int, "0" int);
Returns:
Section | Attribute | 1 | 0 ---------------------------+-----------+---+--- 2012-01-01 00:00:00 | a | 1 | 2 2012-01-02 00:00:00 | b | 3 | 4
So as the documentation states, the extra column aka 'Attribute' is assumed to be the same for each row name aka 'Section'. Thus, it reports b for the second row even though 'entity' also has a 'c' value for that 'timeof' value.
Desired Output:
Section | Attribute | 1 | 0
--------------------------+-----------+---+---
2012-01-01 00:00:00 | a | 1 | 2
2012-01-02 00:00:00 | b | 3 |
2012-01-02 00:00:00 | c | | 4
Any thoughts or references?
A little more background: I potentially need to do this for billions of rows and I'm testing out storing this data in long and wide formats and seeing if I can use tablefunc
to go from long to wide format more efficiently than with regular aggregate functions.
I'll have about 100 measurements made every minute for around 300 entities. Often, we will need to compare the different measurements made for a given second for a given entity, so we will need to go to wide format very often. Also, the measurements made on a particular entity are highly variable.
EDIT: I found a resource on this: http://www.postgresonline.com/journal/categories/24-tablefunc.
In my original question I should have used this for my sample data:
With this I have to pivot on both timeof and entity. Since
tablefunc
only uses one column for the pivoting, you need to find a way to stuff both dimensions in that column. (http://www.postgresonline.com/journal/categories/24-tablefunc). I went with the array, just like the example on in that link.FWIW, I tried using a character array and so far it looks like this is faster for my setup; 9.2.3 Postgresql.
This is the result and desired output.
I'm curious how this performs on a much larger data-set and will report back at a later date.
Ok, so I ran this on a table closer to my use case. Either I'm doing it wrong or crosstab is not suitable for my use.
First I made some similar data:
Then I ran the crosstab code a couple times:
Obtaining this on the third try:
Then I ran the standard solution a couple times:
Obtaining this on the third try:
So, for my case, it appears so far that crosstab is not a solution. And this is just one day when I'll have multiple years. In fact, I will probably have to go with wide format (not normalized) tables, despite the fact that which measurements are made for entities is variable and new ones are introduced, but I won't go into that here.
Here's some of my settings using Postgres 9.2.3:
The problem with your query is that
b
andc
share the same timestamp2012-01-02 00:00:00
, and you have thetimestamp
columntimeof
first in your query, so - even though you added bold emphasis -b
andc
are just extra columns that fall in the same group2012-01-02 00:00:00
. Only the first (b
) is returned since (quoting the manual):Bold emphasis mine.
Just revert the order of the first two columns to make
entity
the row name and it works as desired:entity
must be unique, of course.Reiterate
row_name
firstextra
columns nextcategory
(as defined by the second parameter) andvalue
last.Extra columns are filled from the first row from each
row_name
partition. Values from other rows are ignored, there is only one column perrow_name
to fill. Typically those would be the same for every row of onerow_name
, but that's up to you.For the different setup in your answer:
No wonder the queries in your test perform terribly. Your test setup has 14M rows and you process all of them before throwing most of it away with
LIMIT 1000
. For a reduced result set add WHERE conditions or a LIMIT to the source query!Plus, the array you work with is needlessly expensive on top of it. I generate a surrogate row name with dense_rank() instead.
db<>fiddle here - with a simpler test setup and fewer rows.