Any way to speed up this excel import?

2019-04-13 03:53发布

I have an Excel document that has about 250000 rows which takes forever to import. I have done many variations of this import, however there are a few requirements: - Need to validate the data in each cell - Must check if a duplicate exists in the database - If a duplicate exists, update the entry - If no entry exists, insert a new one

I have used parallelization as much as possible however I am sure that there must be some way to get this import to run much faster. Any assistance or ideas would be greatly appreciated.

Note that the database is on a LAN, and yes I know I haven't used parameterized sql commands (yet).

        public string BulkUserInsertAndUpdate()
        {
            DateTime startTime = DateTime.Now;
            try
            {
                ProcessInParallel();
                Debug.WriteLine("Time taken: " + (DateTime.Now - startTime));
            }
            catch (Exception ex)
            {
                return ex.Message;
            }

            return "";
        }


       private IEnumerable<Row> ReadDocument()
        {
            using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(_fileName, false))
            {
                WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;

                Sheet ss = workbookPart.Workbook.Descendants<Sheet>().SingleOrDefault(s => s.Name == "User");

                if (ss == null)
                    throw new Exception("There was a problem trying to import the file. Please insure that the Sheet's name is: User");

                WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(ss.Id);

                OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
                StringTablePart = workbookPart.SharedStringTablePart;

                while (reader.Read())
                {
                    if (reader.ElementType == typeof(Row))
                    {
                        do
                        {
                            if (reader.HasAttributes)
                            {
                                var rowNum = int.Parse(reader.Attributes.First(a => a.LocalName == "r").Value);

                                if (rowNum == 1)
                                    continue;

                                var row = (Row)reader.LoadCurrentElement();
                                yield return row;
                            }

                        } while (reader.ReadNextSibling()); // Skip to the next row
                        break; // We just looped through all the rows so no need to continue reading the worksheet
                    }
                }
            }
        }

 private void ProcessInParallel()
        {
            // Use ConcurrentQueue to enable safe enqueueing from multiple threads. 
            var exceptions = new ConcurrentQueue<Exception>();


            Parallel.ForEach(ReadDocument(), (row, loopState) =>
                {

                    List<Cell> cells = row.Descendants<Cell>().ToList();

                    if (string.IsNullOrEmpty(GetCellValue(cells[0], StringTablePart)))
                        return;

                    // validation code goes here....


                    try
                    {
                        using (SqlConnection connection = new SqlConnection("user id=sa;password=D3vAdm!n@;server=196.30.181.143;database=TheUnlimitedUSSD;MultipleActiveResultSets=True"))
                        {
                            connection.Open();
                            SqlCommand command = new SqlCommand("SELECT count(*) FROM dbo.[User] WHERE MobileNumber = '" + mobileNumber + "'", connection);
                            var userCount = (int) command.ExecuteScalar();
                            if (userCount > 0)
                            {
                                // update
                                command = new SqlCommand("UPDATE [user] SET NewMenu = " + (newMenuIndicator ? "1" : "0") + ", PolicyNumber = '" + policyNumber + "', Status = '" + status + "' WHERE MobileNumber = '" + mobileNumber + "'", connection);
                                command.ExecuteScalar();
                                Debug.WriteLine("Update cmd");
                            }
                            else
                            {
                                // insert
                                command = new SqlCommand("INSERT INTO dbo.[User] ( MobileNumber , Status , PolicyNumber ,  NewMenu ) VALUES  ( '" + mobileNumber + "' , '" + status + "' ,  '" + policyNumber + "' ,  " + (newMenuIndicator ? "1" : "0") + " )", connection);
                                command.ExecuteScalar();
                                Debug.WriteLine("Insert cmd");
                            }
                        }
                    }
                    catch (Exception ex)
                    {
                        exceptions.Enqueue(ex);
                        Debug.WriteLine(ex.Message);
                        loopState.Break();
                    }
                });

            // Throw the exceptions here after the loop completes. 
            if (exceptions.Count > 0)
                throw new AggregateException(exceptions);

        }

3条回答
等我变得足够好
2楼-- · 2019-04-13 04:07

Some tips for enhanced processing (as I believe this is what you need, not really a code fix).

  • Have Excel check for duplicate rows beforehand. It's a really decent tool for weeding out the obsolete tools. If A and B were duplicate, you'd create A then update with B's data. This way, you can weed out A and only create B.
  • Don't process it as an .xls(x) file, convert it to a CSV. (if you haven't already).
  • Create some stored procedures on your database. I generally dislike stored procedures when used in projects for simple data retrieval, but it works wonders for automated scripts that need to run efficiently. Just add a Create function (I assume the update function will be unnecessary after you've weeded out the duplicates (in tip 1)).+

Some tips I'm not sure will help your specific situtation:

  • Use LINQ instead of creating command strings. LINQ automatically fintunes your queries. However, suddenly switching to LINQ is not something you can do at the blink of an eye, so you'll need to outweigh effort against how much you need it.
查看更多
聊天终结者
3楼-- · 2019-04-13 04:14

I would have suggested that you do a bulk import WITHOUT any validation to an intermediary table, and only then do all the validation via SQL. Your spreadsheet's data will now be in a similiar structure as a SQL table. This is what I have done with industrial strenght imports of 3 million rows + from Excel and CSV with great success.

查看更多
混吃等死
4楼-- · 2019-04-13 04:19

Mostly I'd suggest you check that your parallelism is optimal. Since your bottlenecks are likely to be disk IO on the Excel file and IO to the Sql server, I'd suggest that it may not be. You've parallelised across those two processes (so each of them is reduced to the speed of the slowest); your parallel threads will be fighting over the database and potentially slowing eachother down. There's no point having (say) eight threads if your hard disk can't keep up with one - it just creates overhead.

Two things I'd suggest. First: take out all the parallelism and see if it's actually helping. If you single-threadedly parse the whole file into a single Queue in memory, then run the whole thing into the database, you might find it's faster.

Then, I'd try splitting it to just two threads: one to process the incoming file to the Queue, and one to take the items from the Queue and push them into the database. This way you have one thread per slow resource that you're handling - so you minimise contention - and each thread is blocked by only one resource - so you're handling that resource as optimally as possible.

This is the real trick of multithreaded programming. Throwing extra threads at a problem doesn't necessarily improve performance. What you're trying to do is minimise the time that your program is waiting idly for something external (such as disk or network IO) to complete. If one thread only waits on the Excel file, and one thread only waits on the SQL server, and what they do in between is minimal (which, in your case, it is), you'll find your code will run as fast as those external resources will allow it to.

Also, you mention it yourself, but using parameterised Sql isn't just a cool thing to point out: it will increase your performance. At the moment, you're creating a new SqlCommand for every insert, which has overhead. If you switch to a parameterised command, you can keep the same command throughout and just change the parameter values, which will save you some time. I don't think this is possible in a parallel ForEach (I doubt you can reuse the SqlCommand across threads), but it'd work fine with either of the approaches above.

查看更多
登录 后发表回答