Need a way to sort a 100 GB log file by date [clos

2019-01-21 13:01发布

So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so

data <date> data data more data

I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.

Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?

    public static DateTime Parse(string data)
    {            
        int year, month, day;

        int.TryParse(data.Substring(0, 4), out year);
        int.TryParse(data.Substring(5, 2), out month);
        int.TryParse(data.Substring(8, 2), out day);

        return new DateTime(year, month, day);
    }

I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.

Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.

I'm looking for a nudge in the right direction, thanks in advance.

EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is

sort -k 2b 2008.log > 2008.sorted.log

-k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.

I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.

Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(

sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?

Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...

Edit 5 Before I went home, I restarted the process with the following command:

sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log

It returned this, this morning:

sort: write failed: /media/My Passport/sortQAUKdT: File too large

Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.

Final Notice This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!

Benchmarking

Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.

16条回答
你好瞎i
2楼-- · 2019-01-21 13:47

Actually I don`t have many ideas about the date conversion, but the things that I would try to use to do that is:

  1. A database with a Index in the Date Column (to be easy to search in this data after).
  2. To Insert in this base use Bulk Insert.
  3. And some way to parallel the reading (In think parallel LINQ would be good and is very easy to use).
  4. Lots of patience (the most important/hard thing)
查看更多
趁早两清
3楼-- · 2019-01-21 13:49

I do need to parse dates for the algorithms.

On *NIX, I generally would have first converted dates into something simple, suitable for text comparison and made it first word on the string. It's too early for date/time object creation. My usual date presentation is YYYYMMDD-hhmmss.millis. Make it that all files would have same date format.

I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

As you have figured it out already, merge sort is the only option.

So to me the tasks falls into the following step:

  1. dumb conversion to make dates sortable. Complexity: read/write sequentially 100GB.

  2. split data in chunks of usable size, e.g. 1GB and sort every chunk using plain quick sort before writing it to disk. Complexity: read/write sequentially 100GB; memory for quick sort.

  3. merge-sort the small files into one large. One can do it step-wise, using a program which takes two files and merges them into new one. Complexity: read/write sequentially 100GB log(N) times (where N is the number of files). HDD space requirement: 2*100GB (last merge of 2 x 50GB files into single 100GB file).

  4. A program to automate the previous step: pick two (e.g. smallest) files, start program to sort-merge them into a new file, remove the two original files. Repeat until number of files is greater than 1.

  5. (Optional) split the 100GB sorted file into smaller chunks of manageable size. After all you are going to do something with them. Number them sequentially or put first and last time stamps into the file name.

General concept: do not try to find a way to do it fast, piping 100GB would take time anyway; plan for the programs one every step to run over-night as a batch, without your attention.

On Linux that is all doable with shell/sort/awk/Perl, and I do not think that it is a problem to write it all in any other programming language. This is potentially 4 programs - but all of them are rather simple to code.

查看更多
够拽才男人
4楼-- · 2019-01-21 13:52

You can try implement radix sort algorithm. Because radix scans the whole list sequentially and only few times, it can help here to prevent gigant number of scans and seeks of your 100 GB file.

Radix sort intend to classificate your records each iteration by one part. This part can be a digit, or a datetime part like year, month, day. in this case you dont even need to convert the string to DateTime, you can convert only the specific part to int.

Edit:

For sorting purposes, you can create temp binary file with only 2 columns: DateTime (DateTime.ToBinary() as Int64) and line address in the source file (as Int64).

Then you getting a much smaller file with fixed size records, only 16 bytes per record, then you can sort it much faster (IO operations will be faster at least).

Once finished sorting the temp file, you can create back the full sorted 100 GB log file.

查看更多
Luminary・发光体
5楼-- · 2019-01-21 13:54

Why don't you try this relatively unkown tool from microsoft called logparser. It basically allows you to do an SQL query over a CSV file (or any other formatted textfile).

Saves you the trouble of pumping it into a database, doing your sort, and pumping it back out again

查看更多
老娘就宠你
6楼-- · 2019-01-21 13:55

Just to answer your question about sorting a long file that doesn't fit into the memory - you'll need to use some external sorting algorithm such as Merge sort. The process is roughly following:

  • Partition the input into several parts that fit into memory and can be sorted using standard in-memory sorting algorithms (e.g. 100 MB or larger - you'll need to keep ~4 parts in memory at once). Sort all the parts and write them back to disk.

  • Read two parts from the disk (they are both sorted) and merge them, which can be done just by simultaneously iterating over the two inputs. Write the merged data set to another place in the disk. Note that you don't need to read the whole part into memory - just read it/write it in blocks as you go.

  • Repeat merging of parts until you have only a single part (which will be sorted file with all the data from your original input data set).

You mentioned that the data is partially sorted already, so it would be a good idea to pick some algorithm for in-memory sorting (in the first phase) that is efficient in this case. You can see some suggestions in this question (though I'm not sure if the answer will be the same for very large data sets - and it depends on how much partially sorted the input is).

查看更多
Juvenile、少年°
7楼-- · 2019-01-21 13:59

The best way of optimising the parsing of the dates is to not parse them at all.

As the dates is in ISO 8601 format, you can just compare them as strings. There is no parsing needed at all.

Regarding the sorting, you should be able to effectively use the fact that it's partially sorted. One approach could be to read the file and write into separate files divided on time ranges, for example daily or hourly. If you make each file small enough you can then read them into memory and sort them, and then just merge all the files.

Another approach could be to read the file and write the records that are in order into one file, and the the other ones into another file. Sort the second file (possibly using this process recursively if it's large) and zip the two files together. I.e. a modified merge sort.

查看更多
登录 后发表回答