Consider the following table that records the changes in prices of different products belonging to different companies of different categories.
Column | Type | Modifiers
-----------------+--------+-----------
category_id | bigint | not null
product_id | bigint | not null
industry_id | bigint | not null
time | bigint | not null
price | bigint | not null
product_info | json | not null
Indexes:
"price_change_pk" PRIMARY KEY, btree (category_id, product_id, price, "time")
Foreign-key constraints:
"orders_industry_id" FOREIGN KEY (industry_id) REFERENCES industry_info(industry_id)
"orders_product_id" FOREIGN KEY (product_id) REFERENCES device_info(product_id)
"orders_category_id" FOREIGN KEY (categoy_id) REFERENCES category_info(category_id)
To be clear column values will be :
category_id
- a separate table will have the id(unique bigint value) mapped to the category name - 100s of categories
(Electronics, Fashion, Health, Sports, Toys, Books)
industry_id
- a separate table will have the id(unique bigint value) mapped to the industry name - several 1000s of industries in a category
(Nokia, Apple, Microsoft, PeterEngland, Rubik, Nivia, Cosco)
product_id
- a separate table will have the id(unique bigint value) mapped to the product name - millions of products in an industry
time
(unix time as bigint) - time at which the price was modified,
price
- several thousands of distinct values - (200, 10000, 14999, 30599, 450)
product_info
- a json that holds the extra details of the product (number of keys/value pairs may vary)
{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon}
The table is queried in several ways to analyze the trend of a product price being changed, as a chart, in a day/week/month as hour/day/week/month ranges. The trend may be based on no. of products, unique products being modified.
For example Google Sample Trend
Storing JSON as it is (as string
) makes use of more storage. So I tried storing, key-value in the json with an incrementing serial id in a separate table, and those ids are used.
Like
Keys (citext, bigint)
seller - 1
discount - 2
model - 3
EMIoption - 4
EMIvalue - 5
festival_offer - 6
...
...
currency - 25
Values (citext, bigint)
ABC Assured - 1
10 - 2
XYZ - 3
true - 4
12 - 5
28 - 6
comingsoon - 7
...
...
ZYX - 106
rupees - 107
american dollars - 108
canadian dollars - 109
Prime seller - 110
{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: rupees}
becomes
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}
{seller:"Prime seller", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: "canadian dollars"}
becomes
{"1":110, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":109}
For about 20M data set, it reduced about 1.5GB.
Increase in key-value cardinality, increases the serial numbers. So I tried storing the decimal as hexadecimals.
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}
becomes
{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "19":"6B"}
{"1":110, "2":2", "3":106, "4":4, "5":5, "6":6, "7":7, "25":109}
becomes
{"1":, "2":2", "3":"6A", "4":4, "5":5, "6":6, "7":7, "19":"6D"}
So does storing these decimal integers as hexadecimal integers.
- Save storage space furthe ? (because visually it seems compressed)
- Does the JSON retain the data type of key-value, or they are stored as strings?
- Makes the data compressed?
- Improve read performance?
- Or in anyway it can be improved? (Indexing, or any?)
In a normal psql application, queries takes several minutes to complete. Since it complies with Time-Series data, we use TimescaleDB extension, and its sharding mechanism boosts the query execution but we need results in sub-seconds.
Query samples: To check how many times price was changed to 500, for all products, in a given category, in a month group by every day.
select count(*), to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price
from price_change
where category_id = 1000000010 and time between 1514745000000 and 1517423400000
and price = 500
group by price, unit;
To check how many times price was changed to any of (100,200,300,400,500,600,700,800,900,1000), for all products, in a given category, in the last 10 months group by every month.
select count(*), to_char(date_trunc('month', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price
from price_change
where category_id = 1000000010 and time between 1514745000000 and 1517423400000
and price in (100,200,300,400,500,600,700,800,900,1000) group by price, unit;
To select the product details whose price has been changed in the given time range, in a given category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000;
To select the industry and product id details whose price has been changed in the given time range, in a given category
select industry_id, product_id, price
from price_change
join industries using industry_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000;
To select product price change details, in a time range with discount of 10%, in a specific category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000
and product_info->>'discount'=10;
To select product price change details, in a time range sold by a specific seller, in a specific category
select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp
from price_change
join products using product_id
where price_change.category_id = 1000000010
and price_change.time between 1514745000000 and 1517423400000
and product_info->>'seller'='ABC Assured';
In most cases, the query will not contain category_id
in the select columns.
It would help if you also provide some examples of what you typically query on. There are different ways to optimize indexes / how data is written on disk that depend very much on what type of query you are running (more specifically, what is in your where clause)? If you are using where clauses that look into the JSON, you should consider either breaking those out into columns, or building indexes on the JSON itself.
It sounds like one of your concerns is storage. Because TimescaleDB and PostgreSQL are relational, they do take up more storage than, perhaps, a columnar store that might feature better compression characteristics. You could consider using something like ZFS to compress things as well.