Sharing transactions between connections

2019-07-16 15:45发布

问题:

I have a FileShare crawler (getting permissions and dropping them somewhere for later Audit). Currently it is starting multiple threads to crawl the same folder (to speed up the process).

In C#, each SqlConnection object has its own SqlTransaction, initiated by the SqlConnection.BeginTransaction() call.

Here is the pseudo code of the current solution:

  1. Get list of folders
  2. For each folder get list of sub-folders
  3. For each sub folder start a thread to collect file shares
  4. Each thread will save collected data to database
  5. Run Audit reports on the database

The problem arise when one of the sub folders threads fails. We end up with partial folder scanning which "cannot be detected easily". The main reason is that each thread is running on a separate connection.

I would like to have each folder to be committed in the same transaction rather than having incomplete scanning (current situation, when some threads fail). No transaction concept is implemented but I am evaluating the options.

Based on the comments of this answer, the producer/consumer queue would be an option but unfortunately memory is a limit (due to the number of started threads). In case the producer/consumer space is committed to disk to overcome the RAM limit, the execution time will go up (due to the very limited disk I/O compared to memory I/O). I guess I am stuck with a memory/time compromise. Any other suggestions?

回答1:

It is possible to share the same transaction on multiple connections with SQL Server using the obsolete bind transaction feature. I have never used it and I wouldn't base new development on it. It also seems unnecessary here.

Can't you just have all the producers use the same connection and transaction? Put a lock around it. This obviously bottlenecks the process but it might still be fast enough.

You say you execute INSERT statements. For bulk inserts you can use the SqlBulkCopy class which is very much faster. Batch up the rows and only execute a bulk insert when you have >>1000 rows buffered.

I don't even see the need for producer/consumer here. It would indeed benefit performance by pipelining production with consumption but it also introduces far more complex threading. If you want to go this route you should probably give an IEnumerable<SqlDataRecord> to the SqlBulkCopy class to directly stream all rows that have been produced into it without intermediate buffering.