Need recommendations on pushing the envelope with

I am designing an application, one aspect of which is that it is supposed to be able to receive massive amounts of data into SQL database. I designed the database stricture as a single table with bigint identity, something like this one:

CREATE TABLE MainTable
(
   _id bigint IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    field1, field2, ...
)

I will omit how am I intending to perform queries, since it is irrelevant to the question I have.

I have written a prototype, which inserts data into this table using SqlBulkCopy. It seemed to work very well in the lab. I was able to insert tens of millions records at a rate of ~3K records/sec (full record itself is rather large, ~4K). Since the only index on this table is autoincrementing bigint, I have not seen a slowdown even after significant amount of rows was pushed.

Considering that the lab SQL server was a virtual machine with relatively weak configuration (4Gb RAM, shared with other VMs disk sybsystem), I was expecting to get significantly better throughput on the physical machine, but it didn't happen, or lets say the performance increase was negligible. I could, maybe get 25% faster inserts on physical machine. Even after I configured 3-drive RAID0, which performed 3 times faster than a single drive (measured by a benchmarking software), I got no improvement. Basically: faster drive subsystem, dedicated physical CPU and double RAM almost didn't translate into any performance gain.

I then repeated the test using biggest instance on Azure (8 cores, 16Gb), and I got the same result. So, adding more cores did not change insert speed.

At this time I have played around with following software parameters without any significant performance gain:

Modifying SqlBulkInsert.BatchSize parameter
Inserting from multiple threads simultaneously, and adjusting # of threads
Using table lock option on SqlBulkInsert
Eliminating network latency by inserting from a local process using shared memory driver

I am trying to increase performance at least 2-3 times, and my original idea was that throwing more hardware would get tings done, but so far it doesn't.

So, can someone recommend me:

What resource could be suspected a bottleneck here? How to confirm?
Is there a methodology I could try to get reliably scalable bulk insert improvement considering there is a single SQL server system?

UPDATE I am certain that load app is not a problem. It creates record in a temporary queue in a separate thread, so when there is an insert it goes like this (simplified):

===>start logging time
int batchCount = (queue.Count - 1) / targetBatchSize + 1;
Enumerable.Range(0, batchCount).AsParallel().
    WithDegreeOfParallelism(MAX_DEGREE_OF_PARALLELISM).ForAll(i =>
{
    var batch = queue.Skip(i * targetBatchSize).Take(targetBatchSize);
    var data = MYRECORDTYPE.MakeDataTable(batch);
    var bcp = GetBulkCopy();
    bcp.WriteToServer(data);
});
====> end loging time

timings are logged, and the part that creates a queue never takes any significant chunk

UPDATE2 I have implemented collecting how long each operation in that cycle takes and the layout is as follows:

queue.Skip().Take() - negligible
MakeDataTable(batch) - 10%
GetBulkCopy() - negligible
WriteToServer(data) - 90%

UPDATE3 I am designing for standard version of SQL, so I cannot rely on partitioning, since it's only available in Enterprise version. But I tried a variant of partitioning scheme:

created 16 filegroups (G0 to G15),
made 16 tables for insertion only (T0 to T15) each bound to its individual group. Tables are with no indexes at all, not even clustered int identity.
threads that insert data will cycle through all 16 tables each. This makes it almost a guarantee that each bulk insert operation uses its own table

That did yield ~20% improvement in bulk insert. CPU cores, LAN interface, Drive I/O were not maximized, and used at around 25% of max capacity.

UPDATE4 I think it is now as good as it gets. I was able to push inserts to a reasonable speeds using following techniques:

Each bulk insert goes into its own table, then results are merged into main one
Tables are recreated fresh for every bulk insert, table locks are used
Used IDataReader implementation from here instead of DataTable.
Bulk inserts done from multiple clients
Each client is accessing SQL using individual gigabit VLAN
Side processes accessing the main table use NOLOCK option
I examined sys.dm_os_wait_stats, and sys.dm_os_latch_stats to eliminate contentions

I have a hard time to decide at this point who gets a credit for answered question. Those of you who don't get an "answered", I apologize, it was a really tough decision, and I thank you all.

UPDATE5: Following item could use some optimization:

Used IDataReader implementation from here instead of DataTable.

Unless you run your program on machine with massive CPU core count, it could use some re-factoring. Since it is using reflection to generate get/set methods, that becomes a major load on CPUs. If performance is a key, it adds a lot of performance when you code IDataReader manually, so that it is compiled, instead of using reflection

回答1:

For recommendations on tuning SQL Server for bulk loads, see the Data Loading and Performance Guide paper from MS, and also Guidelines for Optimising Bulk Import from books online. Although they focus on bulk loading from SQL Server, most of the advice applies to bulk loading using the client API. This papers apply to SQL 2008 - you don't say which SQL Server version you're targetting
Both have quite a lot of information which it's worth going through in detail. However, some highlights:

Minimally log the bulk operation. Use bulk-logged or simple recovery. You may need to enable traceflag 610 (but see the caveats on doing this)
Tune the batch size
Consider partitioning the target table
Consider dropping indexes during bulk load

Nicely summarised in this flow chart from Data Loading and Performance Guide:

As others have said, you need to get some peformance counters to establish the source of the bottleneck, since your experiments suggest that IO might not be the limitation. Data Loading and Performance Guide includes a list of SQL wait types and performance counters to monitor (there are no anchors in the document to link to but this is about 75% through the document, in the section "Optimizing Bulk Load")

UPDATE

It took me a while to find the link, but this SQLBits talk by Thomas Kejser is also well worth watching - the slides are available if you don't have time to watch the whole thing. It repeats some of the material linked here but also covers a couple of other suggestions for how to deal with high incidences of particular performance counters.

回答2:

It seems you have done a lot however I am not sure if you have had chance to study Alberto Ferrari SqlBulkCopy Performance Analysis report, which describes several factors to consider the performance related with SqlBulkCopy. I would say lots of things discussed in that paper is still worth trying to that would good to try first.

回答3:

I am not sure why you are not getting 100% utilization on CPU, IO or memory. But if you simply want to improve your bulk load speeds, here is something to consider:

Partition you data file into different files. Or if they are coming from different sources, then simply create different data files.
Then run multiple bulk inserts simultaneously.

Depending on your situation the above may not be feasible; but if you can then I am sure it should improve your load speeds.