Comparison of extra-large subsets of strings

2019-05-24 12:03发布

Hi guys :) I really confused of one task :-/

There is one every-day file from 2000000 to 4000000 strings, which contains unique 15-symbol numbers line by line like this:

850025000010145
401115000010152
400025000010166
770025555010152
512498004158752

From beginning of current year you have some amount of such files accordingly. So I have to compare every line of today's file with all previous files from beginning of the year and return only that numbers which never meet before in all checked files.

Which language and algorithm should I use? How to implement it?

3条回答
Animai°情兽
2楼-- · 2019-05-24 12:19

You should be able to do this without having to write any code beyond a simple script (i.e. bash, Windows batch, Powershell, etc.). There are standard tools that make quick work of this type of thing.

First, you have some number of files that contain from 2 million to 4 million numbers. It's difficult to work with all those files, so the first thing you want to do is create a combined file that's sorted. The simple-minded way to do that is to concatenate all the files into a single file, sort it, and remove duplicates. For example, using the GNU/Linux cat and sort commands:

cat file1 file2 file3 file4 > combined
sort -u combined > combined_sort

(The -u removes duplicates)

The problem with that approach is that you end up sorting a very large file. Figure 4 million lines at 15 characters, plus newlines, on each line, and almost 100 days of files, and you're working with 7 gigabytes. A whole year's worth of data would be 25 gigabytes. That takes a long time.

So instead, sort each individual file, then merge them:

sort -u file1 >file1_sort
sort -u file2 >file2_sort
...
sort -m -u file1 file2 file3 > combined_sorted

The -m switch merges the already-sorted files.

Now what you have is a sorted list of all the identifiers you've seen so far. You want to compare today's file with that. First, sort today's file:

sort -u today >today_sort

Now, you can compare the files and output only the files unique to today's file:

comm -2 -3 today_sort combined_sort

-2 says suppress lines that occur only in the second file, and -3 says to suppress lines that are common to both files. So all you'll get is the lines in today_sort that don't exist in combined_sort.

Now, if you're going to do this every day, then you need to take the output from the comm command and merge it with combined_sort so that you can use that combined file tomorrow. That prevents you from having to rebuild the combined_sort file every day. So:

comm -2 -3 today_sort combined_sort > new_values

Then:

sort -m combined_sort new_values > combined_sort_new

You'd probably want to name the file with the date, so you'd have combined_sort_20140401 and combined_sort_20140402, etc.

So if you started at the beginning of the year and wanted to do this every day, your script would look something like:

sort -u $todays_file > todays_sorted_file
comm -2 -3 todays_sorted_file $old_combined_sort > todays_uniques
sort -m $old_combined_sort todays_sorted_file > $new_combined_sort

$todays_file, $old_combined_sort, and $new_combined_sort are parameters that you pass on the command line. So, if the script was called "daily":

daily todays_file.txt all_values_20140101 all_values_20140102
查看更多
再贱就再见
3楼-- · 2019-05-24 12:35

If you must solve the problem by hands:
- Convert strings to 64-bit integers. This saves space (2x to 4x) and speeds up - calculations calculations.
- Sort current file of integers
- Merge current file with old data file (already sorted), selecting new numbers

Merging step may looks like merge step of MergeSort. You can store ranges of numbers in separate files to avoid extra large file sizes.

P.S. I wanted to propose to use bit map, but it will have size about 125 TB

查看更多
Emotional °昔
4楼-- · 2019-05-24 12:42

One solution could be to build a prefix tree based on the previous n-1 files(suppose n-th file was created today). The most time-consuming build process has to be done only once. After you build the prefix tree, you can save it as file(google for this topic).

Run the program to check the new file:

try(BufferedReader br = new BufferedReader(new FileReader("new_file.txt"))) {
    String line = br.readLine();
    while (line != null) {
        if(!tree.contains(line)){
            counter++;
        }else{
                    tree.insert(line);
            }
        line = br.readLine();
    }
}

So every day you run this 'pseudo' code, get the unique queries and update the tree.

contains takes O(m) time where m is number of chars

insert takes O(m) time too

I would suggest Java.

查看更多
登录 后发表回答