Invert Large Hash Table in Cpp or other fast langu

2020-07-25 12:36发布

问题:

I am looking for efficient C++ (or other fast) to invert a huge hash table.

The number of hash keys is on the order of 200,000,000; and the number of possible elements in each hash key is of the order of 100,000.

I'd like to know what'd be a good way to (efficiently) invert such table such that now the elements are the keys and the keys are the elements.

Right now I have the data in my hard drive stored in a file called hash_file.txt. The file looks like:

>1
T1
T3
T4
T100
>2
T4
T77
T9980
etc.

Where, >1,...,>200,000,000 are all the possible keys of the original hash table; and T1,...,T100000 are all the possible elements for each key. Note: The hash table is quite sparse with no more of a few hundred elements per key.

The output, inverted hash table would look like this in this example:

>T1
1
>T3
1
>T100
1
>T4
1
2
>T77
2
>T9980
2

I tried some naive code and took forever, and run out of mem, so I am looking for good suggestions to start with.

回答1:

This is a pretty simply approach; worth a try (remember to build with optimisation enabled, but preferably not disabling assert ;-)).

#include <iostream>
#include <vector>
#include <cassert>

int main()
{
    char c;
    int n;
    int key = -1;
    const int max_t = 100000;
    std::vector<std::vector<int>> v(max_t + 1);
    while (std::cin >> c >> n)
        if (c == '>')
            key = n;
        else
        {
            assert(c == 'T');
            assert(key != -1);
            assert(0 <= n && n < v.size());
            v[n].push_back(key);
        }
    assert(std::cin.eof());
    for (int i = 0; i < v.size(); ++i)
    {
        if (v[i].empty()) continue;
        std::cout << ">T" << i << '\n';
        for (int j = 0; j < v[i].size(); ++j)
             std::cout << v[i][j] << '\n';
    }
}

(the output order is numeric not lexicographical like in your question... if you cared you could look for / write an algorithm to iterate in "i" in such a way mirroring lexicographical ordering)



回答2:

Although your question is framed around using an in-memory hash to invert the relationship of these items, as per the comments all you really want to do is get the output and the means is not important.

Since the amount of data you are working with, loading it all into memory is probably not going to be practical no matter what data structure you choose. So you are going to need some method that includes only a part of the data into memory at once.

I would be inclined to use a database for a task like this. Create a table which has two columns - the existing 'key' column, and the 'T' value column. Put an index on the value column. Then run a query which gives you the output you want.

Here is an example I knocked up using Postgresql:

create table bigmap (
  key integer,
  value text
);

create index on bigmap(value);

insert into bigmap(key,value) values (1, 'T1');
insert into bigmap(key,value) values (1, 'T3');
insert into bigmap(key,value) values (1, 'T4');
insert into bigmap(key,value) values (1, 'T100');
insert into bigmap(key,value) values (2, 'T4');
insert into bigmap(key,value) values (2, 'T77');
insert into bigmap(key,value) values (2, 'T9980');

select value,key from bigmap order by value,key;

 value | key
-------+-----
 T1    |   1
 T100  |   1
 T3    |   1
 T4    |   1
 T4    |   2
 T77   |   2
 T9980 |   2
(7 rows)

Populating the database from your input file should be relatively trivial. You could write a program in C++ to do this, but depending on how often you want to do it, you might be better off to use eg. perl

The advantage of using a database is that they already have efficient routines for sorting and indexing such data, and also have built in handling for preparing large query results using temporary files if the amount of memory available is not sufficient.

Also, if you want to find all the keys for a specific T-Value, it's easy:

select value,key from bigmap where value='T100';
 value | key
-------+-----
 T100  |   1
(1 row)


标签: c++ hashmap