C# Excel Interop Slow when looping through cells

2019-03-20 11:02发布

问题:

I am trying to extract all text data from an Excel document in C# and am having performance issues. In the following code I open the Workbook, loop over all worksheets, and loop over all cells in the used range, extracting the text from each cell as I go. The problem is, this takes 14 seconds to execute.

public class ExcelFile
{
    public string Path = @"C:\test.xlsx";
    private Excel.Application xl = new Excel.Application();
    private Excel.Workbook WB;
    public string FullText;
    private Excel.Range rng;
    private Dictionary<string, string> Variables;
    public ExcelFile()
    {
        WB = xl.Workbooks.Open(Path);
        xl.Visible = true;
        foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
        {
            rng = CurrentWS.UsedRange;
            for (int i = 1; i < rng.Count; i++)
            { FullText += rng.Cells[i].Value; }
        }
        WB.Close(false);
        xl.Quit();
    }
}

Whereas in VBA I would do something like this, which takes ~1 second:

Sub run()
    Dim strText As String
    For Each ws In ActiveWorkbook.Sheets
        For Each c In ws.UsedRange
            strText = strText & c.Text
        Next c
    Next ws
End Sub

Or, even faster (less than 1 second):

Sub RunFast()
    Dim strText As String
    Dim varCells As Variant
    For Each ws In ActiveWorkbook.Sheets
        varCells = ws.UsedRange
        For i = 1 To UBound(varCells, 1)
            For j = 1 To UBound(varCells, 2)
                strText = strText & CStr(varCells(i, j))
            Next j
        Next i
    Next ws
End Sub

Perhaps something is happening in the for loop in C# that I'm not aware of? Is it possible to load a range into an array-type object (as in my last example) to allow iteration over just the values, not the cell objects?

回答1:

Excel and C# run in different environments completely. C# runs in the .NET framework using managed memory while Excel is a native C++ application and runs in unmanaged memory. Translating data between these two (a process called "marshaling") is extremely expensive in terms of performance.

Tweaking your code isn't going to help. For loops, string construction, etc. are all blazingly fast compared to the marshaling process. The only way you are going to get significantly better performance is to reduce the number of trips that have to cross the interprocess boundary. Extracting data cell by cell is never going to get you the performance you want.

Here are a couple options:

  1. Write a sub or function in VBA that does everything you want, then call that sub or function via interop. Walkthrough.

  2. Use interop to save the worksheet to a temporary file in CSV format, then open the file using C#. You will need to loop through and parse the file to get it into a useful data structure, but this loop will go much faster.

  3. Use interop to save a range of cells to the clipboard, then use C# to read the clipboard directly.



回答2:

I use this function. The loops are only for converting to array starting at index 0, the main work is done in object[,] tmp = range.Value.

public object[,] GetTable(int row, int col, int width, int height)
{
    object[,] arr = new object[height, width];

    Range c1 = (Range)Worksheet.Cells[row + 1, col + 1];
    Range c2 = (Range)Worksheet.Cells[row + height, col + width];
    Range range = Worksheet.get_Range(c1, c2);

    object[,] tmp = range.Value;

    for (int i = 0; i < height; ++i)
    {
        for (int j = 0; j < width; ++j)
        {
            arr[i, j] = tmp[i + tmp.GetLowerBound(0), j + tmp.GetLowerBound(1)];
        }
    }                 

    return arr;
}


回答3:

One thing which will speed it up is to use a StringBuilder instead of += on the previous string. Strings are immutable in C# and therefore you are creating a ton of extra strings during your process of creating the final string.

Additionally you may improve performance looping over the row, column positions instead of looping over the index.

Here is the code changed with a StringBuilder and row, column positional looping:

public class ExcelFile
{
    public string Path = @"C:\test.xlsx";
    private Excel.Application xl = new Excel.Application();
    private Excel.Workbook WB;
    public string FullText;
    private Excel.Range rng;
    private Dictionary<string, string> Variables;
    public ExcelFile()
    {
        StringBuilder sb = new StringBuilder();
        WB = xl.Workbooks.Open(Path);
        xl.Visible = true;

        foreach (Excel.Worksheet CurrentWS in WB.Worksheets)
        {
            rng = CurrentWS.UsedRange;
            for (int i = 1; i <= rng.Rows.Count; i++)
            {
                for (int j = 1; j <=  rng.Columns.Count; j++)
                {
                    sb.append(rng.Cells[i, j].Value); 
                }
            }
        }
        FullText = sb.ToString();
        WB.Close(false);
        xl.Quit();
    }
}


回答4:

I sympathize with you pwwolff. Looping through Excel cells can be expensive. Antonio and Max are both correct but John Wu's answer sums it up nicely. Using string builder may speed things up and making an object array from the used range IMHO is about as fast as you are going to get using interop. I understand there are other third party libraries that may perform better. Looping through each cell will take an unacceptable amount of time if the file is large using interop.

On the tests below I used a workbook with a single sheet where the sheet has 11 columns and 100 rows of used range data. Using an object array implementation this took a little over a second. With 735 rows it took around 40 seconds.

I put 3 buttons on a form with a multi line text box. The first button uses your posted code. The second button takes the ranges out of the loops. The third button uses an object array approach. Each one has a significant performance improvement over the other. I used a text box on the form to output the data, you can use a string as you are but using a string builder would be better if you must have one big string.

Again, if the files are large you may want to consider another implementation. Hope this helps.

private void button1_Click(object sender, EventArgs e) {
  Stopwatch sw = new Stopwatch();
  MessageBox.Show("Start DoExcel...");
  sw.Start();
  DoExcel();
  sw.Stop();
  MessageBox.Show("End DoExcel...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
 }

private void button2_Click(object sender, EventArgs e) {
  MessageBox.Show("Start DoExcel2...");
  Stopwatch sw = new Stopwatch();
  sw.Start();
  DoExcel2();
  sw.Stop();
  MessageBox.Show("End DoExcel2...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
}

private void button3_Click(object sender, EventArgs e) {
  MessageBox.Show("Start DoExcel3...");
  Stopwatch sw = new Stopwatch();
  sw.Start();
  DoExcel3();
  sw.Stop();
  MessageBox.Show("End DoExcel3...Took: " + sw.Elapsed.Seconds + " seconds and " + sw.Elapsed.Milliseconds + " Milliseconds");
}

// object[,] array implementation
private void DoExcel3() {
  textBox1.Text = "";
  string Path = @"D:\Test\Book1 - Copy.xls";
  Excel.Application xl = new Excel.Application();
  Excel.Workbook WB;
  Excel.Range rng;

  WB = xl.Workbooks.Open(Path);
  xl.Visible = true;
  int totalRows = 0;
  int totalCols = 0;
  foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
    rng = CurrentWS.UsedRange;
    totalCols = rng.Columns.Count;
    totalRows = rng.Rows.Count;
    object[,] objectArray = (object[,])rng.Cells.Value;
    for (int row = 1; row < totalRows; row++) {
      for (int col = 1; col < totalCols; col++) {
        if (objectArray[row, col] != null)
          textBox1.Text += objectArray[row,col].ToString();
      }
      textBox1.Text += Environment.NewLine;
    }
  }
  WB.Close(false);
  xl.Quit();
  Marshal.ReleaseComObject(WB);
  Marshal.ReleaseComObject(xl);
}

// Range taken out of loops
private void DoExcel2() {
  textBox1.Text = "";
  string Path = @"D:\Test\Book1 - Copy.xls";
  Excel.Application xl = new Excel.Application();
  Excel.Workbook WB;
  Excel.Range rng;

  WB = xl.Workbooks.Open(Path);
  xl.Visible = true;
  int totalRows = 0;
  int totalCols = 0;
  foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
    rng = CurrentWS.UsedRange;
    totalCols = rng.Columns.Count;
    totalRows = rng.Rows.Count;
    for (int row = 1; row < totalRows; row++) {
      for (int col = 1; col < totalCols; col++) {
        textBox1.Text += rng.Rows[row].Cells[col].Value;
      }
      textBox1.Text += Environment.NewLine;
    }
  }
  WB.Close(false);
  xl.Quit();
  Marshal.ReleaseComObject(WB);
  Marshal.ReleaseComObject(xl);
}

// original posted code
private void DoExcel() {
  textBox1.Text = "";
  string Path = @"D:\Test\Book1 - Copy.xls";
  Excel.Application xl = new Excel.Application();
  Excel.Workbook WB;
  Excel.Range rng;

  WB = xl.Workbooks.Open(Path);
  xl.Visible = true;
  foreach (Excel.Worksheet CurrentWS in WB.Worksheets) {
    rng = CurrentWS.UsedRange;
    for (int i = 1; i < rng.Count; i++) {
      textBox1.Text += rng.Cells[i].Value;
    }
  }
  WB.Close(false);
  xl.Quit();
  Marshal.ReleaseComObject(WB);
  Marshal.ReleaseComObject(xl);
}