I have an Excel document that has about 250000 rows which takes forever to import. I have done many variations of this import, however there are a few requirements: - Need to validate the data in each cell - Must check if a duplicate exists in the database - If a duplicate exists, update the entry - If no entry exists, insert a new one
I have used parallelization as much as possible however I am sure that there must be some way to get this import to run much faster. Any assistance or ideas would be greatly appreciated.
Note that the database is on a LAN, and yes I know I haven't used parameterized sql commands (yet).
public string BulkUserInsertAndUpdate()
{
DateTime startTime = DateTime.Now;
try
{
ProcessInParallel();
Debug.WriteLine("Time taken: " + (DateTime.Now - startTime));
}
catch (Exception ex)
{
return ex.Message;
}
return "";
}
private IEnumerable<Row> ReadDocument()
{
using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(_fileName, false))
{
WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
Sheet ss = workbookPart.Workbook.Descendants<Sheet>().SingleOrDefault(s => s.Name == "User");
if (ss == null)
throw new Exception("There was a problem trying to import the file. Please insure that the Sheet's name is: User");
WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(ss.Id);
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
StringTablePart = workbookPart.SharedStringTablePart;
while (reader.Read())
{
if (reader.ElementType == typeof(Row))
{
do
{
if (reader.HasAttributes)
{
var rowNum = int.Parse(reader.Attributes.First(a => a.LocalName == "r").Value);
if (rowNum == 1)
continue;
var row = (Row)reader.LoadCurrentElement();
yield return row;
}
} while (reader.ReadNextSibling()); // Skip to the next row
break; // We just looped through all the rows so no need to continue reading the worksheet
}
}
}
}
private void ProcessInParallel()
{
// Use ConcurrentQueue to enable safe enqueueing from multiple threads.
var exceptions = new ConcurrentQueue<Exception>();
Parallel.ForEach(ReadDocument(), (row, loopState) =>
{
List<Cell> cells = row.Descendants<Cell>().ToList();
if (string.IsNullOrEmpty(GetCellValue(cells[0], StringTablePart)))
return;
// validation code goes here....
try
{
using (SqlConnection connection = new SqlConnection("user id=sa;password=D3vAdm!n@;server=196.30.181.143;database=TheUnlimitedUSSD;MultipleActiveResultSets=True"))
{
connection.Open();
SqlCommand command = new SqlCommand("SELECT count(*) FROM dbo.[User] WHERE MobileNumber = '" + mobileNumber + "'", connection);
var userCount = (int) command.ExecuteScalar();
if (userCount > 0)
{
// update
command = new SqlCommand("UPDATE [user] SET NewMenu = " + (newMenuIndicator ? "1" : "0") + ", PolicyNumber = '" + policyNumber + "', Status = '" + status + "' WHERE MobileNumber = '" + mobileNumber + "'", connection);
command.ExecuteScalar();
Debug.WriteLine("Update cmd");
}
else
{
// insert
command = new SqlCommand("INSERT INTO dbo.[User] ( MobileNumber , Status , PolicyNumber , NewMenu ) VALUES ( '" + mobileNumber + "' , '" + status + "' , '" + policyNumber + "' , " + (newMenuIndicator ? "1" : "0") + " )", connection);
command.ExecuteScalar();
Debug.WriteLine("Insert cmd");
}
}
}
catch (Exception ex)
{
exceptions.Enqueue(ex);
Debug.WriteLine(ex.Message);
loopState.Break();
}
});
// Throw the exceptions here after the loop completes.
if (exceptions.Count > 0)
throw new AggregateException(exceptions);
}
Some tips for enhanced processing (as I believe this is what you need, not really a code fix).
Some tips I'm not sure will help your specific situtation:
I would have suggested that you do a bulk import WITHOUT any validation to an intermediary table, and only then do all the validation via SQL. Your spreadsheet's data will now be in a similiar structure as a SQL table. This is what I have done with industrial strenght imports of 3 million rows + from Excel and CSV with great success.
Mostly I'd suggest you check that your parallelism is optimal. Since your bottlenecks are likely to be disk IO on the Excel file and IO to the Sql server, I'd suggest that it may not be. You've parallelised across those two processes (so each of them is reduced to the speed of the slowest); your parallel threads will be fighting over the database and potentially slowing eachother down. There's no point having (say) eight threads if your hard disk can't keep up with one - it just creates overhead.
Two things I'd suggest. First: take out all the parallelism and see if it's actually helping. If you single-threadedly parse the whole file into a single Queue in memory, then run the whole thing into the database, you might find it's faster.
Then, I'd try splitting it to just two threads: one to process the incoming file to the Queue, and one to take the items from the Queue and push them into the database. This way you have one thread per slow resource that you're handling - so you minimise contention - and each thread is blocked by only one resource - so you're handling that resource as optimally as possible.
This is the real trick of multithreaded programming. Throwing extra threads at a problem doesn't necessarily improve performance. What you're trying to do is minimise the time that your program is waiting idly for something external (such as disk or network IO) to complete. If one thread only waits on the Excel file, and one thread only waits on the SQL server, and what they do in between is minimal (which, in your case, it is), you'll find your code will run as fast as those external resources will allow it to.
Also, you mention it yourself, but using parameterised Sql isn't just a cool thing to point out: it will increase your performance. At the moment, you're creating a new
SqlCommand
for every insert, which has overhead. If you switch to a parameterised command, you can keep the same command throughout and just change the parameter values, which will save you some time. I don't think this is possible in a parallel ForEach (I doubt you can reuse theSqlCommand
across threads), but it'd work fine with either of the approaches above.