Consider the following table that records the changes in prices of different products belonging to different companies of different categories.
Column | Type | Modifiers
-----------------+--------+-----------
category_id | bigint | not null
product_id | bigint | not null
industry_id | bigint | not null
time | bigint | not null
price | bigint | not null
product_info | json | not null
Indexes:
"price_change_pk" PRIMARY KEY, btree (category_id, product_id, price, "time")
Foreign-key constraints:
"orders_industry_id" FOREIGN KEY (industry_id) REFERENCES industry_info(industry_id)
"orders_product_id" FOREIGN KEY (product_id) REFERENCES device_info(product_id)
"orders_category_id" FOREIGN KEY (categoy_id) REFERENCES category_info(category_id)
To be clear column values will be :
category_id
- a separate table will have the id(unique bigint value) mapped to the category name - 100s of categories
(Electronics, Fashion, Health, Sports, Toys, Books)
industry_id
- a separate table will have the id(unique bigint value) mapped to the industry name - several 1000s of industries in a category
(Nokia, Apple, Microsoft, PeterEngland, Rubik, Nivia, Cosco)
product_id
- a separate table will have the id(unique bigint value) mapped to the product name - millions of products in an industry
time
(unix time as bigint) - time at which the price was modified,
price
- several thousands of distinct values - (200, 10000, 14999, 30599, 450)
product_info
- a json that holds the extra details of the product (number of keys/value pairs may vary)
{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon}
The table is queried in several ways to analyze the trend of a product price being changed, as a chart, in a day/week/month as hour/day/week/month ranges. The trend may be based on no. of products, unique products being modified.
For example Google Sample Trend
Storing JSON as it is (as string
) makes use of more storage. So I tried storing, key-value in the json with an incrementing serial id in a separate table, and those ids are used.
Like
Keys (citext, bigint)
seller - 1
discount - 2
model - 3
EMIoption - 4
EMIvalue - 5
festival_offer - 6
...
...
currency - 25
Values (citext, bigint)
ABC Assured - 1
10 - 2
XYZ - 3
true - 4
12 - 5
28 - 6
comingsoon - 7
...
...
ZYX - 106
rupees - 107
american dollars - 108
canadian dollars - 109
Prime seller - 110
{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: rupees}
becomes
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}
{seller:"Prime seller", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: "canadian dollars"}
becomes
{"1":110, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":109}
For about 20M data set, it reduced about 1.5GB.
Increase in key-value cardinality, increases the serial numbers. So I tried storing the decimal as hexadecimals.
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}
becomes
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "19":"6B"}
{"1":110, "2":2", "3":106, "4":4, "5":5, "6":6, "7":7, "25":109}
becomes
{"1":, "2":2", "3":"6A", "4":4, "5":5, "6":6, "7":7, "19":"6D"}
So does storing these decimal integers as hexadecimal integers.
- Save storage space furthe ? (because visually it seems compressed)
- Does the JSON retain the data type of key-value, or they are stored as strings?
- Makes the data compressed?
- Improve read performance?
- Or in anyway it can be improved? (Indexing, or any?)
In a normal psql application, queries takes several minutes to complete. Since it complies with Time-Series data, we use TimescaleDB extension, and its sharding mechanism boosts the query execution but we need results in sub-seconds.
Query samples: To check how many times price was changed to 500, for all products, in a given category, in a month group by every day.
select count(*), to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price
from price_change
where category_id = 1000000010 and time between 1514745000000 and 1517423400000
and price = 500
group by price, unit;
To check how many times price was changed to any of (100,200,300,400,500,600,700,800,900,1000), for all products, in a given category, in the last 10 months group by every month.
select count(*), to_char(date_trunc('month', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price
from price_change
where category_id = 1000000010 and time between 1514745000000 and 1517423400000
and price in (100,200,300,400,500,600,700,800,900,1000) group by price, unit;
To select the product details whose price has been changed in the given time range, in a given category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000;
To select the industry and product id details whose price has been changed in the given time range, in a given category
select industry_id, product_id, price
from price_change
join industries using industry_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000;
To select product price change details, in a time range with discount of 10%, in a specific category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000
and product_info->>'discount'=10;
To select product price change details, in a time range sold by a specific seller, in a specific category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000
and product_info->>'seller'='ABC Assured';
In most cases, the query will not contain category_id
in the select columns.