SQL Server and SqlDataReader - Trillion Records -

2019-02-13 22:30发布

问题:

I've never tried this - so I don't know if I'd run into memory issues.

But can a SqlDataReader read a trillion records? It's all streamed correct? I'm a little green to what the SQL/TDS protocol is doing under the covers.

UPDATE Translate Trillion to mean very large number. I probably should have said something like 1 billion or 100 million.

回答1:

Yes, that will stream... but I don't think you should actually try to do it.

If you could read a million records per second (which sounds unlikely to me) you'd still need 12 days to read a trillion records... that's a lot of work to risk losing half way through.

Now I realise you probably don't really want to read a trillion records, literally, but my point is that if you can separate your "large amount" of work into logical batches anyway, that's probably a good idea.



回答2:

There are a few details.

  • SqlDataReader will normally read an entire row in memory and cache it. This includes any BLOB fields, so you can end up caching several 2GB fields in memory (XML, VARBINARY(MAX), VARCHAR(MAX), NVARCHAR(MAX)). If such fields are a concern then you must pass in the CommandBehavior.SequentialAccess to ExecuteReader and use the streaming capabilities of the SqlClient specific types like SqlBytes.Stream.

  • A connection is busy until the SqlDataReader completes. This creates transactional problems because you won't be able to to any processing in the database in the same transaciton, because the connection is busy. Trying to open a different conneciton and enroll in the same transaction will fail, as loop back distributed transacitons are prohibited. The loution is to use MARS. You do so by setting MultipleActiveResultSets=True on the connection. This allows you to issue command on the same connection while a data reader is still active (typical fetch-process-fetch loop). Read the link to Christian Kleinerman's with great care, make sure you understand the issues and restrictions around MARS and transactions, they're quite subtle and counter intuitive.

  • Lengthy processing in the client will block the server. Your query will still be executing all this time and the server will have to suspend it when the communication pipe fills up. A query consumes a worker (or more if it has parallel plans) and workes are a very scarce commodity in a server (they equate roughly to threads). You won't be bale to afford many clients processing huge result sets at their own leissure.

  • Transaction size. Processing a trillion records on one transaction is never going to work. The log will have to grow to accomodate the entire transaction and won't truncate and reuse the VLFs, resulting in huge log growth.

  • Recovery time. If processing fails at the 999 billionth record it will have to rollback all the work done, so it will take another '12' days just to rollback.



回答3:

Yes - it might take a while (as long as your SQL isn't doing anything silly trying to take a snapshot or anything), but if your server can stream it out, the SqlDataReader shouldn't have a memory usage problem.