Optimizing the storage space, query speed, JSON co

2019-06-14 02:58发布

Consider the following table that records the changes in prices of different products belonging to different companies of different categories.

     Column    |  Type  | Modifiers
-----------------+--------+-----------
 category_id   | bigint | not null
 product_id    | bigint | not null
 industry_id   | bigint | not null
 time          | bigint | not null
 price         | bigint | not null
 product_info  | json   | not null

Indexes:
    "price_change_pk" PRIMARY KEY, btree (category_id, product_id, price, "time")

Foreign-key constraints:
    "orders_industry_id" FOREIGN KEY (industry_id) REFERENCES industry_info(industry_id)
    "orders_product_id" FOREIGN KEY (product_id) REFERENCES device_info(product_id)
    "orders_category_id" FOREIGN KEY (categoy_id) REFERENCES category_info(category_id)

To be clear column values will be :

category_id - a separate table will have the id(unique bigint value) mapped to the category name - 100s of categories

(Electronics, Fashion, Health, Sports, Toys, Books)

industry_id - a separate table will have the id(unique bigint value) mapped to the industry name - several 1000s of industries in a category

(Nokia, Apple, Microsoft, PeterEngland, Rubik, Nivia, Cosco)

product_id - a separate table will have the id(unique bigint value) mapped to the product name - millions of products in an industry

time (unix time as bigint) - time at which the price was modified,

price - several thousands of distinct values - (200, 10000, 14999, 30599, 450)

product_info - a json that holds the extra details of the product (number of keys/value pairs may vary)

{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon}

The table is queried in several ways to analyze the trend of a product price being changed, as a chart, in a day/week/month as hour/day/week/month ranges. The trend may be based on no. of products, unique products being modified.

For example Google Sample Trend

Storing JSON as it is (as string) makes use of more storage. So I tried storing, key-value in the json with an incrementing serial id in a separate table, and those ids are used.

Like

Keys (citext, bigint)
seller - 1
discount - 2
model - 3
EMIoption - 4
EMIvalue - 5
festival_offer - 6
...
...
currency - 25

Values (citext, bigint)
ABC Assured - 1
10 - 2
XYZ - 3
true - 4
12 - 5
28 - 6
comingsoon - 7
...
...
ZYX - 106
rupees - 107
american dollars - 108
canadian dollars - 109
Prime seller - 110

{seller:"ABC Assured", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: rupees}

becomes

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}


{seller:"Prime seller", discount:10, model:XYZ, EMIoption:true, EMIvalue:12, festival_offer:28, market_stat:comingsoon, curreny: "canadian dollars"}

becomes

{"1":110, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":109}


For about 20M data set, it reduced about 1.5GB.

Increase in key-value cardinality, increases the serial numbers. So I tried storing the decimal as hexadecimals.

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "25":107}

becomes

{"1":1, "2":2", "3":3, "4":4, "5":5, "6":6, "7":7, "19":"6B"}


{"1":110, "2":2", "3":106, "4":4, "5":5, "6":6, "7":7, "25":109}

becomes

{"1":, "2":2", "3":"6A", "4":4, "5":5, "6":6, "7":7, "19":"6D"}


So does storing these decimal integers as hexadecimal integers.

  1. Save storage space furthe ? (because visually it seems compressed)
  2. Does the JSON retain the data type of key-value, or they are stored as strings?
  3. Makes the data compressed?
  4. Improve read performance?
  5. Or in anyway it can be improved? (Indexing, or any?)

In a normal psql application, queries takes several minutes to complete. Since it complies with Time-Series data, we use TimescaleDB extension, and its sharding mechanism boosts the query execution but we need results in sub-seconds.

Query samples: To check how many times price was changed to 500, for all products, in a given category, in a month group by every day.

select count(*), to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price 
from price_change 
where category_id = 1000000010 and time between 1514745000000 and 1517423400000 
  and price = 500 
group by price, unit;

To check how many times price was changed to any of (100,200,300,400,500,600,700,800,900,1000), for all products, in a given category, in the last 10 months group by every month.

select count(*), to_char(date_trunc('month', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as unit, price 
from price_change 
where category_id = 1000000010 and time between  1514745000000 and 1517423400000  
   and price in (100,200,300,400,500,600,700,800,900,1000) group by price, unit;

To select the product details whose price has been changed in the given time range, in a given category

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000;

To select the industry and product id details whose price has been changed in the given time range, in a given category

select industry_id, product_id, price 
from price_change 
  join industries using industry_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000;

To select product price change details, in a time range with discount of 10%, in a specific category

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000
  and product_info->>'discount'=10;

To select product price change details, in a time range sold by a specific seller, in a specific category

select product_id, product_name, price, to_char(date_trunc('day', to_timestamp(time/1000) at time zone 'Asia/Kolkata'), 'YYYY/MM/DD') as timestamp 
from price_change 
  join products using product_id 
where price_change.category_id = 1000000010 
  and price_change.time between 1514745000000 and 1517423400000
  and product_info->>'seller'='ABC Assured';

In most cases, the query will not contain category_id in the select columns.

1条回答
你好瞎i
2楼-- · 2019-06-14 03:14

It would help if you also provide some examples of what you typically query on. There are different ways to optimize indexes / how data is written on disk that depend very much on what type of query you are running (more specifically, what is in your where clause)? If you are using where clauses that look into the JSON, you should consider either breaking those out into columns, or building indexes on the JSON itself.

It sounds like one of your concerns is storage. Because TimescaleDB and PostgreSQL are relational, they do take up more storage than, perhaps, a columnar store that might feature better compression characteristics. You could consider using something like ZFS to compress things as well.

查看更多
登录 后发表回答