Big tables and analysis in MySql

2019-07-21 03:30发布

问题:

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.

This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:

  • ip
  • url
  • note
  • account_id

...where available.

Now, trying to do something like this:

SELECT COUNT(distinct ip) 
  FROM acts
  JOIN users ON(users.ip = acts.ip) 
 WHERE acts.url LIKE '%some_marketing_page%';

Basically never finishes. I switched to this:

SELECT COUNT(distinct ip) 
  FROM acts
  JOIN users ON(users.ip = acts.ip) 
 WHERE acts.note = 'some_marketing_page';

But it is still very slow, despite having an index on note.

I am obviously not pro at mysql. My question is:

How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?

回答1:

While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)

As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.

It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.

My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)

Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.

Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.

Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.



回答2:

Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:

SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';

If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).

Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:

  1. Decompose the url into component parts before you store it so that the search matches a column value exactly.

  2. Require the search term to appear at the beginning of the of the url field, not in the middle.

  3. Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.

Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.



回答3:

Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.

Should this be a LEFT JOIN?

Maybe this site can help.