Alternatives for enhanced reading and parsing text

2019-04-11 23:41发布

问题:

I need to read from a variety of different text files (I've some delimited files and some fixed width files). I've considered parsing the files line by line (slow using the File.ReadLine type methods) and reading the file using the ODBC text driver (faster) but does anyone have any other (better) suggestions? I'm using .NET/C#.

回答1:

I'm not sure you could really do a text-and-Excel file parser, not unless by Excel file you mean a comma/pipe/tab delimited file, which is actually just another text file. Reading actual excel files require you to use the MS Office libraries.

For delimited text file parsing, you could look into FileHelpers -- open source and they pretty much have it covered. Not sure if it will match your speed requirements though.



回答2:

Ignoring the Excel part (which you say isn't important):

I've found LINQ to be fairly useful in parsing txt files (pipe-delimited or csv)

e.g. This reads a pipe-delimited file skipping the hader row and creates an IEnumerable as the result:

var records = from line in File.ReadAllLines(@"c:\blah.txt").Skip(1) let parts = line.Split('|') select parts;



回答3:

Answering my own question:

I ended up using the Microsoft.VisualBasic.FileIO.TextFieldParser object, see:

http://msdn.microsoft.com/en-us/library/f68t4563.aspx

(example of implementation here)

This allows me to handle csv files without worrying about how to cope with whether fields are enclosed in quotes, contain commas, escaped quotes etc.



回答4:

If the files are relatively small you can use the File class. It has these methods which may help you:

  • ReadAllBytes
  • ReadAllLines
  • ReadAllText


回答5:

Your question is a little vague. I assume that the text files contain structured data, not just random lines of text.

If you are parsing the files yourself then .NET has a library function to read all the lines from a text file into an array of strings (File.ReadAllLines). If you know your files are small enough to hold in memory, then you can use this method and iterate over the array using a regular expression to validate & extract the fields.

Excel files are a different ball game. .XLS files are binary, not text, so you would need to use a 3rd party library to access them. .XLSX files from Excel 2007 contain compressed XML data, so once again you would need to decompress the XML then use an XML parser to get at the data. I would not recommend writing your own XML parser, unless you feel the need for the intellectual exercise.



回答6:

I agree with John,

For example:-

using System.IO;

...

public class Program {
  public static void Main() {
    foreach(string s in File.ReadAllLines(@"c:\foo\bar\something.txt") {
      // Do something with each line...
    }
  }
}


回答7:

The File reading process is not slow if you read all file at once using the File class and the methods suggested by John. Depending upon the file's size and what you want to do with them, it may use more or less memory. I'd suggest you try with File.ReadAllText (or whatever is appropriate for you)



回答8:

Regarding reading XLS Files:

If you have Microsoft Office XP and above, you have access to the already included .NET SDK Office Libraries, where you can "natively" read XLS files, Word, PPT, etc. Please note that under Office XP you have to manually check that during install (unless you had .NET previously installed).

I don't know if these libraries are available as a separate package if you don't have Microsoft Office.

For some obscure reason, all these libraries (including the latest versions from Office 2007 -a.k.a.: Office 12), are COM components that are a pain to use, cause ugly dependencies and are not backwards compatible. I.E.: if you have some methods that work with Office XP (Office11), and you install that onto a customer with Office 12, it doesn't work, because some interfaces where changed. So you need to maintain two set of "libraries" and methods to deal with that. The same holds true if use Office 12 libraries to program, and you customer has Office 11. Your libraries don't work. :S

I don't know why Microsoft never created a Microsoft.Office.XXXX managed library (wrapper) around those ugly things.

Anyways, your question is quite strange, try to follow some advice here. Good luck!



回答9:

The ODBC text driver is now rather out of date - it has no Unicode support.

Amazingly MS Excel still uses it, so if you open a Unicode CSV in Excel 2007 (rather than import it) you lose all non-ASCII chars.

You best bet is to use .Net's file reading methods, as others have suggested.