Having trouble reading excel file with the OpenXML

2019-02-28 14:34发布

问题:

I have a function that reads from an excel file and stores the results in a DataSet. I have another function that writes to an excel file. When I try to read from a regular human generated excel file, the excel reading function returns a blank DataSet, but when I read from the excel file generated by the writing function, it works perfectly fine. The function then will not work on a regular generated excel file, even when I just copy and paste the contents of the function generated excel file. I finally tracked it down to this, but I have no idea where to go from here. Is there something wrong with my code? Any help is greatly appreciated. Thanks in advance!

Here is the excel generating function.

public static Boolean writeToExcel(string fileName, DataSet data)
    {
        Boolean answer = false;
        using (SpreadsheetDocument excelDoc = SpreadsheetDocument.Create(tempPath + fileName, SpreadsheetDocumentType.Workbook))

        {

            WorkbookPart workbookPart = excelDoc.AddWorkbookPart();

            workbookPart.Workbook = new Workbook();

            WorksheetPart worksheetPart = workbookPart.AddNewPart<WorksheetPart>();

            Sheets sheets = excelDoc.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());
            Sheet sheet = new Sheet()
            {
                Id = excelDoc.WorkbookPart.GetIdOfPart(worksheetPart),
                SheetId = 1,
                Name = "Page1"
            };

            sheets.Append(sheet);

            CreateWorkSheet(worksheetPart, data);
            answer = true;
        }
        return answer;

    }
private static void CreateWorkSheet(WorksheetPart worksheetPart, DataSet data)
    {
        Worksheet worksheet = new Worksheet();

        SheetData sheetData = new SheetData();

        UInt32Value currRowIndex = 1U;
        int colIndex = 0;
        Row excelRow;
        DataTable table = data.Tables[0];

        for (int rowIndex = -1; rowIndex < table.Rows.Count; rowIndex++)
        {
            excelRow = new Row();
            excelRow.RowIndex = currRowIndex++;
            for (colIndex = 0; colIndex < table.Columns.Count; colIndex++)
            {
                Cell cell = new Cell()
                {
                    CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
                    DataType = CellValues.String
                };

                CellValue cellValue = new CellValue();
                if (rowIndex == -1)
                {
                    cellValue.Text = table.Columns[colIndex].ColumnName.ToString();
                }
                else
                {
                    cellValue.Text = (table.Rows[rowIndex].ItemArray[colIndex].ToString() != "") ? table.Rows[rowIndex].ItemArray[colIndex].ToString() : "*";
                }

                cell.Append(cellValue);
                excelRow.Append(cell);
            }
            sheetData.Append(excelRow);
        }
        SheetFormatProperties formattingProps = new SheetFormatProperties()
        {
            DefaultColumnWidth = 20D,
            DefaultRowHeight = 20D
        };

        worksheet.Append(formattingProps);
        worksheet.Append(sheetData);
        worksheetPart.Worksheet = worksheet;

    }

while the reading function is as following

public static void readInventoryExcel(string fileName, ref DataSet set)
    {
        using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
        {
            WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
            WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
            SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
            int count = -1;
            foreach (Row r in sheetData.Elements<Row>())
            {

                if (count >= 0)
                {
                    DataRow row = set.Tables[0].NewRow();
                    row["SerialNumber"] = r.ChildElements[1].InnerXml;
                    row["PartNumber"] = r.ChildElements[2].InnerXml;
                    row["EntryDate"] = r.ChildElements[3].InnerXml;
                    row["RetirementDate"] = r.ChildElements[4].InnerXml;
                    row["ReasonForReplacement"] = r.ChildElements[5].InnerXml;
                    row["RetirementTech"] = r.ChildElements[6].InnerXml;
                    row["IncludeInMaintenance"] = r.ChildElements[7].InnerXml;
                    row["MaintenanceTech"] = r.ChildElements[8].InnerXml;
                    row["Comment"] = r.ChildElements[9].InnerXml;
                    row["Station"] = r.ChildElements[10].InnerXml;
                    row["LocationStatus"] = r.ChildElements[11].InnerXml;
                    row["AssetName"] = r.ChildElements[12].InnerXml;
                    row["InventoryType"] = r.ChildElements[13].InnerXml;
                    row["Description"] = r.ChildElements[14].InnerXml;
                    set.Tables[0].Rows.Add(row);
                }
                count++;
            }
        }

回答1:

I think this is caused by the fact that you have only one sheet whereas Excel has three. I'm not certain but I think the sheets are returned in reverse order so you should change the line:

WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

to

WorksheetPart worksheetPart = workbookPart.WorksheetParts.Last();

It might be safer to search for the WorksheetPart if you can identify it by the sheet name. You need to find the Sheet first then use the Id of that to find the SheetPart:

private WorksheetPart GetWorksheetPartBySheetName(WorkbookPart workbookPart, string sheetName)
{
    //find the sheet first.
    IEnumerable<Sheet> sheets = workbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == sheetName);

    if (sheets.Count() > 0)
    {
        string relationshipId = sheets.First().Id.Value;
        WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationshipId);
        return worksheetPart;
    }

    return null;
}

You can then use:

WorksheetPart worksheetPart = GetWorksheetPartBySheetName(workbookPart, "Sheet1");

There are a couple of other things I've noticed whilst looking at your code which you may (or may not!) be interested in:

In your code you are only reading the InnerXml so it might not matter to you but the way Excel stores strings is different to the way you are writing them so reading an Excel generated file may not give you the values you expect. In your example you are writing the string directly to the cell like this:

But Excel uses a SharedStrings concept where all strings are written to a separate XML file called sharedStrings.xml. That file contains the strings used in the Excel file with a reference and it's that value that is stored in the cell value in the sheet XML.

The sharedString.xml looks like this:

And the Cell then looks like this:

The 47 in the <v> element is a reference to the 47th shared string. Note that the type (the t attribute) in your generated XML is str but the type in the Excel generated file is s. This denotes yours is an inline string and theirs is a shared string.

You can read the SharedStrings just as you would any other part:

var stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();

if (stringTable != null)
{
    sharedString = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}

Secondly, if you look at the cell reference that your code generates and the cell reference that Excel generates you can see you are only outputting the column and not the row (e.g. you output A instead of A1). To fix this you should change the line:

CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),

to

CellReference = Convert.ToString(Convert.ToChar(65 + colIndex) + rowIndex.ToString()),

I hope that helps.



回答2:

I ran into a similar issue a while back trying to do this for Word documents (procedurally generated worked fine, but human-generated did not). I found this tool to be very helpful:

http://www.microsoft.com/en-us/download/details.aspx?id=30425

Basically, it looks at a file and shows you the code that Microsoft would generate to read it, as well as the xml structure of the file itself. As usual for Microsoft products, there are quite a few menus and it's not very intuitive, but after clicking around for a bit you will be able to see exactly what is going on with any two files. I would recommend you open a working excel file and a non-working one and compare the difference to see what's causing your issue.



回答3:

Below is the OpenXML code that I use to read in a particular Worksheet from an Excel file, into a DataTable.

First, here's how you'd call it:

DataTable dt = OpenXMLHelper.ExcelWorksheetToDataTable("C:\\SQL Server\\SomeExcelFile.xlsx", "Mikes Worksheet");

And here's the code:

    public class OpenXMLHelper
    {
        public static DataTable ExcelWorksheetToDataTable(string pathFilename, string worksheetName)
        {
            DataTable dt = new DataTable(worksheetName);

            using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathFilename, false))
            {
                // Find the sheet with the supplied name, and then use that 
                // Sheet object to retrieve a reference to the first worksheet.
                Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
                if (theSheet == null)
                    throw new Exception("Couldn't find the worksheet: " + worksheetName);

                // Retrieve a reference to the worksheet part.
                WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
                Worksheet workSheet = wsPart.Worksheet;

                string dimensions = workSheet.SheetDimension.Reference.InnerText;       //  Get the dimensions of this worksheet, eg "B2:F4"

                int numOfColumns = 0;
                int numOfRows = 0;
                CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
                System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));

                SheetData sheetData = workSheet.GetFirstChild<SheetData>();
                IEnumerable<Row> rows = sheetData.Descendants<Row>();

                string[,] cellValues = new string[numOfColumns, numOfRows];

                int colInx = 0;
                int rowInx = 0;
                string value = "";
                SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;

                //  Iterate through each row of OpenXML data
                foreach (Row row in rows)
                {
                    for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
                    {
                        //  *DON'T* assume there's going to be one XML element for each item in each row...
                        Cell cell = row.Descendants<Cell>().ElementAt(i);
                        if (cell.CellValue == null || cell.CellReference == null)
                            continue;                       //  eg when an Excel cell contains a blank string

                        //  Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
                        colInx = GetColumnIndexByName(cell.CellReference);             //  eg "C" -> 2  (0-based)
                        rowInx = GetRowIndexFromCellAddress(cell.CellReference)-1;     //  Needs to be 0-based  

                        //  Fetch the value in this cell
                        value = cell.CellValue.InnerXml;
                        if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
                        {
                            value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
                        }

                        cellValues[colInx, rowInx] = value;
                    }
                    dt.Rows.Add(dataRow);
                }

                //  Copy the array of strings into a DataTable
                for (int col = 0; col < numOfColumns; col++)
                    dt.Columns.Add("Column_" + col.ToString());

                for (int row = 0; row < numOfRows; row++)
                {
                    DataRow dataRow = dt.NewRow();
                    for (int col = 0; col < numOfColumns; col++)
                    {
                        dataRow.SetField(col, cellValues[col, row]);
                    }
                    dt.Rows.Add(dataRow);
                }

#if DEBUG
                //  Write out the contents of our DataTable to the Output window (for debugging)
                string str = "";
                for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
                {
                    for (colInx = 0; colInx < maxNumOfColumns; colInx++)
                    {
                        object val = dt.Rows[rowInx].ItemArray[colInx];
                        str += (val == null) ? "" : val.ToString();
                        str += "\t";
                    }
                    str += "\n";
                }
                System.Diagnostics.Trace.WriteLine(str);
#endif
                return dt;
            }
        }

        private static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
        {
            //  How many columns & rows of data does this Worksheet contain ?  
            //  We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
            //      eg "B1:F4" -> we'll need 6 columns and 4 rows.
            //
            //  (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
            try
            {
                string[] parts = dimensions.Split(':');     // eg "B1:F4" 
                if (parts.Length != 2)
                    throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");

                numOfColumns = 1 + GetColumnIndexByName(parts[1]);     //  A=1, B=2, C=3  (1-based value), so F4 would return 6 columns
                numOfRows = GetRowIndexFromCellAddress(parts[1]);
            }
            catch
            {
                throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
            }
        }

        public static int GetRowIndexFromCellAddress(string cellAddress)
        {
            //  Convert an Excel CellReference column into a 1-based row index
            //  eg "D42"  ->  42
            //     "F123" ->  123
            string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
            return int.Parse(rowNumber);
        }

        public static int GetColumnIndexByName(string cellAddress)
        {
            //  Convert an Excel CellReference column into a 0-based column index
            //  eg "D42" ->  3
            //     "F123" -> 5
            var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
            int number = 0, pow = 1;
            for (int i = columnName.Length - 1; i >= 0; i--)
            {
                number += (columnName[i] - 'A' + 1) * pow;
                pow *= 26;
            }
            return number - 1;
        }
    }

Just to mention, some of our company's Excel Worksheets have one or more blank rows at the top. Strangely, this prevented some other OpenXML libraries from reading in such Worksheets properly.

This code deliberately creates a DataTable with one value for each of the cells in the Worksheet, even the blank ones at the top.