Optimizing a query returning a lot of records, a w

2020-05-01 06:00发布

问题:

I am not so int SQL and I have the following doubt about how to optimize a query. I am using MySql

I have this DB schema:

And this is the query that returns the last price (the last date into the Market_Commodity_Price_Series table) of a specific commodity into a specific market.

It contains a lot of join to retrieve all the related information:

SELECT MCPS.id AS series_id,
        MD_CD.market_details_id AS market_id,
        MD_CD.commodity_details_id AS commodity_id,
        MD.market_name AS market_name,
        MCPS.price_date AS price_date,
        MCPS.avg_price AS avg_price,
        CU.ISO_4217_cod AS currency, 
        MU.unit_name AS measure_unit, 
        CD.commodity_name_en,
        CN.commodity_name 
FROM Market_Commodity_Price_Series AS MCPS
INNER JOIN MeasureUnit AS MU ON MCPS.measure_unit_id = MU.id
INNER JOIN Currency AS CU ON MCPS.currency_id = CU.id
INNER JOIN MarketDetails_CommodityDetails AS MD_CD ON MCPS.market_commodity_details_id = MD_CD.id
INNER JOIN MarketDetails AS MD ON MD_CD.market_details_id = MD.id
INNER JOIN CommodityDetails AS CD ON MD_CD.commodity_details_id = CD.id
INNER JOIN CommodityName AS CN ON CD.id = CN.commodity_details_id
INNER JOIN Languages AS LN ON CN.language_id  = LN.id
WHERE MD.id = 4
AND CD.id = 4 
AND LN.id=1
ORDER BY price_date DESC LIMIT 1

My doubt is: using the previous query I am extracting all the records related to a specific commodity into a specific market from the Market_Commodity_Price_Series table, do a lot of join, ordinating these records based on the price_date field and limiting to the last one.

I think that it could be expansive because I can have a lot of records (because the Market_Commodity_Price_Series table contains daily information).

This query works but I think that can be done in a smarter way.

So I thought that I can do something like this:

1) Select the record related to the last price of a specific commodity into a specific market using a query like this:

SELECT measure_unit_id, 
        currency_id, 
        market_commodity_details_id, 
        MAX(price_date) price_date
FROM Market_Commodity_Price_Series  AS MCPS 
INNER JOIN MarketDetails_CommodityDetails AS MD_CD ON MCPS.market_commodity_details_id = MD_CD.id
WHERE MD_CD.market_details_id = 4
AND MD_CD.commodity_details_id = 4
GROUP BY measure_unit_id, currency_id, market_commodity_details_id

that returns the single record related to this information:

measure_unit_id      currency_id          market_commodity_details_id price_date
--------------------------------------------------------------------------------
1                    2                    24                          05/10/2017

Use this output like a table (I don't know the exact name, maybe view, is it?) and join this "table" to the other required information that are into the MeasureUnit, Currency, MarketDetails, CommodityDetails, CommodityName and Languages tables.

I think that it could be better because in this way I am using the MAX(price_date) price_date to extract only the record related to the latest price into the Market_Commodity_Price_Series instead obtain all the records, ordering and limiting to the latest one.

Furthermore most onf the JOIN operation are doing o the single record returned by the previous query and not on all the records returned by the first version of my query (potentially they could be hundreds or thousands).

Could be a smart solution?

If yes...what is the correct syntax to join the output of this query (considering it as a table) with the other tables?

回答1:

You've done a reasonably good job of writing an efficient query.

You didn't use SELECT *, which can mess up performance in a query with lots of joins, because it generates bloated and redundant intermediate result sets. But your intermediate result set -- the one you apply ORDER BY to -- is not bloated.

Your WHERE col = val clauses mostly mention primary keys of tables (I guess). That's good.

Your big table Market_Commodity_Price_Series could maybe use a compound covering index. Similarly, some other tables may need that kind of index. But that should be the topic of another question.

Your proposed optimization -- ordering an intermediate result set consisting mostly of id values -- would help a lot if you were doing ORDER BY ... LIMIT and using the LIMIT function to discard most of your results. But you are not doing that.

Without knowing more about your data, it's hard to offer a crisp opinion. But, if it were me I'd use your first query. I'd keep an eye on it as you go into production (and on other complex queries). When (not if) performance starts to deteriorate, then you can do EXPLAIN and figure out the best way to index your tables. You've done a good job of writing a query that will get your application up and running. Go with it!



回答2:

JOINs -- particularly on primary keys -- are not necessarily expensive. It looks like your joins are following the data model.

I wouldn't start optimizing the query without understanding its performance characteristics. How long does it take to run? How many records are being sorted to get the most recent?

Your WHERE clause appears to be limiting the data considerably. You can also set up an index to help with the WHERE clause clause -- however, because the fields come from different tables, it can be tricky to use indexes or all of them.

You have a complicated data model that is a bit difficult to follow. It seems possible that you are getting a Cartesian product due to multiple n-m relationships. If so, that can have a big impact on performance, and pre-aggregating the data along each dimension is the way to go.

However, I wouldn't start optimizing the query without understanding how the current one behaves.



回答3:

One of the approach is to make a separate Read Model Table it comes from CQRS approach with containing all necessary attributes just for select and no any joins but you will need to update the Read Model table each time some other tables changes one more options is to create a View