Hi guys :) I really confused of one task :-/
There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:
850025000010145
401115000010152
400025000010166
770025555010152
512498004158752
From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.
Which language and algorithm should I use? How to implement it?
You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.
First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat
and sort
commands:
cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort
(The -u
removes duplicates)
The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.
So instead, sort each individual file, then merge them:
sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted
The -m
switch merges the already-sorted files.
Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:
sort -u today >today_sort
Now, you can compare the files and output only the files unique to today's file:
comm -2 -3 today_sort combined_sort
-2
says suppress lines that occur only in the second file, and -3
says to suppress lines that are common to both files. So all you'll get is the lines in today_sort
that don't exist in combined_sort
.
Now, if you're going to do this every day, then you need to take the output from the comm
command and merge it with combined_sort
so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort
file every day. So:
comm -2 -3 today_sort combined_sort > new_values
Then:
sort -m combined_sort new_values > combined_sort_new
You'd probably want to name the file with the date, so you'd have combined_sort_20140401
and combined_sort_20140402
, etc.
So if you started at the beginning of the year and wanted to do this every day, your script would look something like:
sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort
$todays_file
, $old_combined_sort
, and $new_combined_sort
are parameters that you pass on the command line. So, if the script was called "daily":
daily todays_file.txt all_values_20140101 all_values_20140102
If you must solve the problem by hands:
- Convert strings to 64-bit integers. This saves space (2x to 4x) and speeds up - calculations calculations.
- Sort current file of integers
- Merge current file with old data file (already sorted), selecting new numbers
Merging step may looks like merge step of MergeSort.
You can store ranges of numbers in separate files to avoid extra large file sizes.
P.S. I wanted to propose to use bit map, but it will have size about 125 TB
One solution could be to build a prefix tree based on the previous n-1
files(suppose n
-th file was created today). The most time-consuming build process has to be done only once. After you build the prefix tree, you can save it as file(google for this topic).
Run the program to check the new file:
try(BufferedReader br = new BufferedReader(new FileReader("new_file.txt"))) {
String line = br.readLine();
while (line != null) {
if(!tree.contains(line)){
counter++;
}else{
tree.insert(line);
}
line = br.readLine();
}
}
So every day you run this 'pseudo' code, get the unique queries and update the tree.
contains
takes O(m) time where m is number of chars
insert
takes O(m) time too
I would suggest Java.