Alternatives for enhanced reading and parsing text

2019-04-12 00:03发布

I need to read from a variety of different text files (I've some delimited files and some fixed width files). I've considered parsing the files line by line (slow using the File.ReadLine type methods) and reading the file using the ODBC text driver (faster) but does anyone have any other (better) suggestions? I'm using .NET/C#.

9条回答
甜甜的少女心
2楼-- · 2019-04-12 00:27

Your question is a little vague. I assume that the text files contain structured data, not just random lines of text.

If you are parsing the files yourself then .NET has a library function to read all the lines from a text file into an array of strings (File.ReadAllLines). If you know your files are small enough to hold in memory, then you can use this method and iterate over the array using a regular expression to validate & extract the fields.

Excel files are a different ball game. .XLS files are binary, not text, so you would need to use a 3rd party library to access them. .XLSX files from Excel 2007 contain compressed XML data, so once again you would need to decompress the XML then use an XML parser to get at the data. I would not recommend writing your own XML parser, unless you feel the need for the intellectual exercise.

查看更多
▲ chillily
3楼-- · 2019-04-12 00:29

I agree with John,

For example:-

using System.IO;

...

public class Program {
  public static void Main() {
    foreach(string s in File.ReadAllLines(@"c:\foo\bar\something.txt") {
      // Do something with each line...
    }
  }
}
查看更多
我想做一个坏孩纸
4楼-- · 2019-04-12 00:30

Regarding reading XLS Files:

If you have Microsoft Office XP and above, you have access to the already included .NET SDK Office Libraries, where you can "natively" read XLS files, Word, PPT, etc. Please note that under Office XP you have to manually check that during install (unless you had .NET previously installed).

I don't know if these libraries are available as a separate package if you don't have Microsoft Office.

For some obscure reason, all these libraries (including the latest versions from Office 2007 -a.k.a.: Office 12), are COM components that are a pain to use, cause ugly dependencies and are not backwards compatible. I.E.: if you have some methods that work with Office XP (Office11), and you install that onto a customer with Office 12, it doesn't work, because some interfaces where changed. So you need to maintain two set of "libraries" and methods to deal with that. The same holds true if use Office 12 libraries to program, and you customer has Office 11. Your libraries don't work. :S

I don't know why Microsoft never created a Microsoft.Office.XXXX managed library (wrapper) around those ugly things.

Anyways, your question is quite strange, try to follow some advice here. Good luck!

查看更多
做自己的国王
5楼-- · 2019-04-12 00:31

The ODBC text driver is now rather out of date - it has no Unicode support.

Amazingly MS Excel still uses it, so if you open a Unicode CSV in Excel 2007 (rather than import it) you lose all non-ASCII chars.

You best bet is to use .Net's file reading methods, as others have suggested.

查看更多
我欲成王,谁敢阻挡
6楼-- · 2019-04-12 00:33

Answering my own question:

I ended up using the Microsoft.VisualBasic.FileIO.TextFieldParser object, see:

http://msdn.microsoft.com/en-us/library/f68t4563.aspx

(example of implementation here)

This allows me to handle csv files without worrying about how to cope with whether fields are enclosed in quotes, contain commas, escaped quotes etc.

查看更多
倾城 Initia
7楼-- · 2019-04-12 00:36

If the files are relatively small you can use the File class. It has these methods which may help you:

  • ReadAllBytes
  • ReadAllLines
  • ReadAllText
查看更多
登录 后发表回答