I have been challenged with producing a method that will read in very large text files into a program these files can range from 2gb to 100gb.
The idea so far has been to read say a couple of 1000 lines of text into the method.
At the moment the program is setup using a stream reader reading a file line by line and processing the necessary areas of data found on that line.
using (StreamReader reader = new StreamReader("FileName"))
{
string nextline = reader.ReadLine();
string textline = null;
while (nextline != null)
{
textline = nextline;
Row rw = new Row();
var property = from matchID in xmldata
from matching in matchID.MyProperty
where matchID.ID == textline.Substring(0, 3).TrimEnd()
select matching;
string IDD = textline.Substring(0, 3).TrimEnd();
foreach (var field in property)
{
Field fl = new Field();
fl.Name = field.name;
fl.Data = textline.Substring(field.startByte - 1, field.length).TrimEnd();
fl.Order = order;
fl.Show = true;
order++;
rw.ID = IDD;
rw.AddField(fl);
}
rec.Rows.Add(rw);
nextline = reader.ReadLine();
if ((nextline == null) || (NewPack == nextline.Substring(0, 3).TrimEnd()))
{
d.ID = IDs.ToString();
d.Records.Add(rec);
IDs++;
DataList.Add(d.ID, d);
rec = new Record();
d = new Data();
}
}
}
The program goes on further and populates a class. ( just decided not to post the rest)
I know that once the program is shown an extremely large file, memory exception errors will occur.
so that is my current problem and so far i have been googling several approaches with many people just answering use a stream reader and reader.readtoend, i know readtoend wont work for me as i will get those memory errors.
Finally i have been looking into async as a way of creating a method that will read a certain amount of lines and wait for a call before processing the next amount of lines.
This brings me to my problem i am struggling to understand async and i can't seem to find any material that will help me learn and was hoping someone here can help me out with a way to understand async.
Of course if anyone knows of a better way to solve this problem I am all ears.
EDIT Added the remainder of the code to put a end to any confusion.
Your problem isn't synchronous v's asynchronous, it's that you're reading the entire file and storing parts of the file in memory before you do something with that data.
If you were reading each line, processing it and writing the result to another file/database, then
StreamReader
will let you process multi GB (or TB) files.Theres only a problem if you're storing a portions of the file until you finish reading it, then you can run into memory issues (but you'd be surprised how large you can let
Lists
&Dictionaries
get before you run out of memory)What you need to do is save your processed data as soon as you can, and not keep it in memory (or keep as little in memory as possible).
With files that large you may need to keep your working set (your processing data) in a database - possibly something like SqlExpress or SqlLite would do (but again, it depends on how large your working set gets).
Hope this helps, don't hesitate to ask further questions in the comments, or edit your original question, I'll update this answer if I can help in any way.
Update - Paging/Chunking
You need to read the text file in chunks of one page, and allow the user to scroll through the "pages" in the file. As the user scrolls you read and present them with the next page.
Now, there are a couple of things you can do to help yourself, always keep about 10 pages in memory, this allows your app to be responsive if the user pages up / down a couple of pages very quickly. In the applications idle time (Application Idle event) you can read in the next few pages, again you throw away pages that are more than five pages before or after the current page.
Paging backwards is a problem, because you don't know where each line begins or ends in the file, therefore you don't know where each page begins or ends. So for paging backwards, as you read down through the file, keep a list of offsets to the start of each page (
Stream.Pos
), then you can quicklySeek
to a given position and read the page in from there.If you need to allow the user to search through the file, then you pretty much read through the file line by line (remembering the page offsets as you go) looking for the text, then when you find something, read in and present them with that page.
You can speed everything up by pre-processing the file into a database, there are grid controls that will work off a dynamic dataset (they will do the paging for you) and you get the benefit of built in searches / filters.
So, from a certain point of view, this is reading the file asynchronously, but that's from the users point of view. But from a technical point of view, we tend to mean something else when we talk about doing something asynchronous when programming.