Java Project: Make HashMap (including Load-Store)

2019-04-12 07:14发布

问题:

I am trying to code for our server in which I have to find users access type by URL.

Now, at the beginning, we see 100 millions distinct URL's are accessed per day. Now, by the time going it became nearly 600 millions distinct URL's per day.

For 100 millions, what we did is following:

1) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

2) Then search the HashMap to find how many time URL accessed.

Now, as the HashTable become larger, what we did is following:

1) Build two/three separate HashTable, and load and store it (on general file system) to find how many times URL accessed.

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading/storing HashTable (we are using File Channel, takes 16-19 seconds to load/store HashTable - 200 millions entry- as load factor is 0.5)

What we are trying to ask is:

1) Any comment how to solve this issue ?

2) How to reduce load/store time (I asked before but seems File Channel is the best way) ?

3) Is storing a large HashTable (more than memory) and caching it repeatedly will be a nice solution ? If so, how to do that (at least some pointers). We tried it by using

RandomAccessFile raf = new RandomAccessFile("array.dat", "rw");
IntBuffer map = raf.getChannel().map(FileChannel.MapMode.READ_WRITE, 0, 1 << 30).order(ByteOrder.nativeOrder()).asIntBuffer();

However, gives worser performance than previous.

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, we use some NoSQL DB like TokyoCabinet but from our experience, a custom HashTable gives better performance than it on 100 millions key-value pairs.

2) Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

What We forgot to mention is:

1) As our application is a part of project and to be applied on a small campus, so we assume URL accessed is not more than 800 million. So, you can think 600/700 data value is fixed.

2) Our main concern is performance.

3) We have to run our application locally.

Edit: code of our hashmap can be found here.

回答1:

It might be best to access the table as a memory-mapped buffer. That way, you could simply implement random access to the file, without worrying about loading and storing, and leave caching to the operating system. I see that your current implementation already does use memory-mapped access for reading and writing, but it still loads things into the java heap in between. Avoid this data duplication and copying! Treat the backing file itself as the data structure, and only access the portions of it that you actually need, only when you need them.

Within that file, hash maps will work if you are really really sure that hash collisions are not an issue. Otherwise I'd go for a B+ tree there, with nodes about the size of your hard disk pages. That way, each disk access will yield a lot more of usable data than just a single key, thus resulting in a more shallow tree and less individual disc operations.

I guess others will have implemented stuff like this, but if you prefer your own hash map implementation, you might prefer to write your own memory-mapped B+ trees as well.



回答2:

The whole approach sounds ridiculus to me. I gather what you really want to achive is a simple access counter per distinct URL. By its very nature, this data is frequently written but rarely ever read.

For this purpose, I would simply have a database table and add a new entry for every access (it can serve as log as well). When you need to figure out how often any URL was accessed this can be easily done using a SELECT COUNT from the table (depending on how much additional data you store along with the URL entries, you can even do constrainted counts like how often accessed yesterday, last week etc).

This puts all the work off to the point where the result is really needed.

BTW, you may be able to retrieve the access counts from the web servers log files as well, so maybe you don't need to write any data yourself. Look into this first.



回答3:

You can use a caching framework like JCS. 1 billion key-value pairs should not be a problem.

http://commons.apache.org/jcs/



回答4:

Definitely try redis, think it beats anything else hands down



回答5:

You can use Berkeley DB which is basically a key/value store written in C for ultimate performance. It's an Oracle product (Open Source though) so I would take it serious.



回答6:

If your application has to run locally without the usage of any external computing power, there is no solution which can be more performant then direct-memory access: the only data structure which can provides you better performances then an HashMap is an array, where the access at every element is O(1). This requires however knowing in advance how many items you have, have a unique addressing index per element, and also being able of reserving significant adjacent memory.

After arrays, which as described are suitable for limited cases, you have HashTables, however as the size of the data grows, the cost with collisions and dynamic resize increase and makes the performance poor.

You can refer to java.util.HashMap javadoc but also to Wikipedia http://en.wikipedia.org/wiki/Hash_table to understand the following:

  • How expensive is it to compute it?
  • How the value are well distributed?
  • What is the load factor that you are using, i.e. what cost will you have for conflict resolution?
  • How often will you need to resize your HashMap before you get to have it full contained all data?

If your performance degradates when building your HashMap, which I actually believe it's a ConcurrentHashMap (if you build it parallely it has to be thread safe), you might want to investigate why it happens.

A simple, but easy beginning will be to replace your HashMap with a TreeMap, whose performances are a deterministic function of its size, and compare the two performances.


If on the other side I misinterpreted your question and you have the opportunity to scale on multiple machines the computation, you have plenty of interesting solution on the market as someone has already pointed out, and to which I would add Cassandra.

These solutions achieve performance improvement by distributing the load among multiple nodes, but inside each node use well-known algorithm for fast and efficient addressing.



回答7:

Not clear for question and follow-up discussion, but what's the nature of your queries? You've got very different situations between
a) working through all ~700 million URLs during each working day, or
b) hitting some small number of those ~700 million URL's.

So: what's the ratio of # of queries to the # of URLs?

From your descriptions, it sounds like you may be loading/unloading the different files representing different portions of your array... which suggests random queries, which suggests (b).

As well, I gather you've already recognized that "all-in-memory" isn't feasible (i.e. you've broken the array across multiple files), so an optimal disk-access algorithm seems to be the next order of business, no?

Have you tried, per query, a simple seek (n * arrayElementSize) to offset in file and just read a few pages into memory (do you have/know a maximum # of values per key?). You've already got (computed) the base index into your array, so this should be easy to prototype.



回答8:

I would suggest you to use Oracle Coherence Cache. You can get all the benefits of HashTable it has all the methods which Map has.

Performance wise you can store data as per you requirement.Please have a look .



回答9:

You can try HugeCollections, I think it was written for this purpose

HugeCollections
Library to support collections with millions or billions of entries.

specifically HugeMap



回答10:

Use open source sqlite in memory database.



回答11:

If I understand you correctly, your data structure is not that big

[(32 + 64) * 600 million] bits i.e. a 53.644 MB structure in memory

The map data structure would consume some space too. I've found out the hard way that trove is one of the most memory efficient data structures around. I'd use a TLongIntHashMap to store long keys and integer values. It stored raw primitives so that you bypass the Long and Integer memory objects



回答12:

It seems You have a mostly read-only dataset that does not fit in memory, and You need fast key-lookups. I am afraid there is no silver-bullet solution here, except for a few possible tradeoffs.

If You access the 600M records all-over-the-place, No matter what You do You are going to be limited by disk random access speed (not sequential access speeed). Use FileChannel.map to directly access the file (no, don't read the contents of the file in memory, just operate on the MappedByteBuffer. Your OS will take care of caching for You). Investing in a SSD looks to be a good way to spend money (or maybe just buy some more memory?).

This is a campus environment, right? Maybe You can use computers in a lab to make a memcached/redis/etc. cluster? Maybe You could use it off-hours?

If You access some identifiable pieces of data at the same time (i.e. now we analyze domain a, then b, etc.), then splitting the data into buckets is a good idea. Like keep the related data physically close, to help caching. Or maybe pre-sort the urls, and access them in binary-search fashion?

If some probability of collisions is acceptable, maybe not storing the full urls but only 64-bit hashes of urls as hash keys is acceptable? With some gymnastics You could probably get away with not storing the keys at all?

That's my ideas for the moment.