-->

Moving records to another table vs. “flag” column

2020-07-11 07:31发布

问题:

Suppose I have a table that represents "task queue" (tens of millions of records).

Tasks can be "queued" or "done".

Which performs better if we want to grab 10 task to process?

  • Single table with "flag" column like ENUM / BIT / TINYINT flagging task as done or not (eventually index on column)
  • Separate tables for queued tasks, and completed task and deleting each completed task from queued with insertion to completed

Note that at the begining, we have a few or none completed tasks, but as processing is going, there will be milions of already done tasks.

回答1:

It probably doesn't matter, but if it were me, I would use the one table. Here's my reasoning:

First and foremost, we must assume good indexes on this table, which is what's going to make the lookup fast. With appropriate indexes, if you want to query for queued tasks, it won't matter whether the number of "done" tasks is 10 or 10 billion, the DBMS will only look at the queued ones.

Second, as a task moves from "queued" to "done", you're going to update its status. And this requires a bit of reorganization of the index by the DBMS, but that's OK, they've been doing that with high efficiency for something like 30 years now.

If you were to split them out into separate tables, essentially the maintenance of moving a record from one place to another would be put on your code instead of in the DBMS index reorganization code. Which of those code bases is better tested and more performant? :)

One final argument - if you put it all in one big table, further performance tweaking of the management of these tasks becomes a DBMS configuration issue, as opposed to a software development issue. That's a big win in my book. There's all sorts of crazy configuration stuff you can do to improve performance in any DBMS, including things like vertical and horizontal partitioning. Those things won't be options if the way you've distributed your data is via some scheme that's embedded in your software.

So bottom line - if you do the 2 table approach, I think it's going to perform very similarly to if you do the one table approach, once you take into account the extra work your code will have to do to move records around. If you delete an "open" task from one table and stick it into a "done" table, keep in mind the DBMS still will have to update the "open" index on the source table. Because there's likely not going to be a big performance difference, you should use the one table approach because it's less work for you, and gives you more flexibility later (speed improvements via configuration, vice software)