I am trying to read some text files, where each line needs to be processed. At the moment I am just using a StreamReader, and then reading each line individually.
I am wondering whether there is a more efficient way (in terms of LoC and readability) to do this using LINQ without compromising operational efficiency. The examples I have seen involve loading the whole file into memory, and then processing it. In this case however I don't believe that would be very efficient. In the first example the files can get up to about 50k, and in the second example, not all lines of the file need to be read (sizes are typically < 10k).
You could argue that nowadays it doesn't really matter for these small files, however I believe that sort of the approach leads to inefficient code.
First example:
// Open file
using(var file = System.IO.File.OpenText(_LstFilename))
{
// Read file
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Ignore empty lines
if (line.Length > 0)
{
// Create addon
T addon = new T();
addon.Load(line, _BaseDir);
// Add to collection
collection.Add(addon);
}
}
}
Second example:
// Open file
using (var file = System.IO.File.OpenText(datFile))
{
// Compile regexs
Regex nameRegex = new Regex("IDENTIFY (.*)");
while (!file.EndOfStream)
{
String line = file.ReadLine();
// Check name
Match m = nameRegex.Match(line);
if (m.Success)
{
_Name = m.Groups[1].Value;
// Remove me when other values are read
break;
}
}
}
It's simpler to read a line and check whether or not it's null than to check for EndOfStream all the time.
However, I also have a
LineReader
class in MiscUtil which makes all of this a lot simpler - basically it exposes a file (or aFunc<TextReader>
as anIEnumerable<string>
which lets you do LINQ stuff over it. So you can do things like:The heart of
LineReader
is this implementation ofIEnumerable<string>.GetEnumerator
:Almost all the rest of the source is just giving flexible ways of setting up
dataSource
(which is aFunc<TextReader>
).You can write a LINQ-based line reader pretty easily using an iterator block:
or to make Jon happy:
then you have
ReadFrom(...)
as a lazily evaluated sequence without buffering, perfect forWhere
etc.Note that if you use
OrderBy
or the standardGroupBy
, it will have to buffer the data in memory; ifyou need grouping and aggregation, "PushLINQ" has some fancy code to allow you to perform aggregations on the data but discard it (no buffering). Jon's explanation is here.NOTE: You need to watch out for the
IEnumerable<T>
solution, as it will result in the file being open for the duration of processing.For example, with Marc Gravell's response:
the file will remain open for the whole of the processing.
Thanks all for your answers! I decided to go with a mixture, mainly focusing on Marc's though as I will only need to read lines from a file. I guess you could argue seperation is needed everywhere, but heh, life is too short!
Regarding the keeping the file open, that isn't going to be an issue in this case, as the code is part of a desktop application.
Lastly I noticed you all used lowercase string. I know in Java there is a difference between capitalised and non capitalised string, but I thought in C# lowercase string was just a reference to capitalised String?