Advice on handling large data volumes

2019-01-22 09:58发布

So I have a "large" number of "very large" ASCII files of numerical data (gigabytes altogether), and my program will need to process the entirety of it sequentially at least once.

Any advice on storing/loading the data? I've thought of converting the files to binary to make them smaller and for faster loading.

Should I load everything into memory all at once?
If not, is opening what's a good way of loading the data partially?
What are some Java-relevant efficiency tips?

11条回答
Melony?
2楼-- · 2019-01-22 10:34

You might want to have a look at the entries in the Wide Finder Project (do a google search for "wide finder" java).

The Wide finder involves reading over lots of lines in log files, so look at the Java implementations and see what worked and didn't work there.

查看更多
放荡不羁爱自由
3楼-- · 2019-01-22 10:36

You really haven't given us enough info to help you. Do you need to load each file in its entiretly in order to process it? Or can you process it line by line?

Loading an entire file at a time is likely to result in poor performance even for files that aren't terribly large. Your best bet is to define a buffer size that works for you and read/process the data a buffer at a time.

查看更多
可以哭但决不认输i
4楼-- · 2019-01-22 10:37

Without any additional insight into what kind of processing is going on, here are some general thoughts from when I have done similar work.

  1. Write a prototype of your application (maybe even "one to throw away") that performs some arbitrary operation on your data set. See how fast it goes. If the simplest, most naive thing you can think of is acceptably fast, no worries!

  2. If the naive approach does not work, consider pre-processing the data so that subsequent runs will run in an acceptable length of time. You mention having to "jump around" in the data set quite a bit. Is there any way to pre-process that out? Or, one pre-processing step can be to generate even more data - index data - that provides byte-accurate location information about critical, necessary sections of your data set. Then, your main processing run can utilize this information to jump straight to the necessary data.

So, to summarize, my approach would be to try something simple right now and see what the performance looks like. Maybe it will be fine. Otherwise, look into processing the data in multiple steps, saving the most expensive operations for infrequent pre-processing.

Don't "load everything into memory". Just perform file accesses and let the operating system's disk page cache decide when you get to actually pull things directly out of memory.

查看更多
Viruses.
5楼-- · 2019-01-22 10:40

If you need to access the data more than once, load it into a database. Most databases have some sort of bulk loading utility. If the data can all fit in memory, and you don't need to keep it around or access it that often, you can probably write something simple in Perl or your favorite scripting language.

查看更多
迷人小祖宗
6楼-- · 2019-01-22 10:49

If your numerical data is regularly sampled and you need to do random access consider to store them in a quadtree.

查看更多
小情绪 Triste *
7楼-- · 2019-01-22 10:51

You could convert to binary, but then you have 1+ something copies of the data, if you need to keep the original around.

It may be practical to build some kind of index on top of your original ascii data, so that if you need to go through the data again you can do it faster in subsequent times.

To answer your questions in order:

Should I load everything into memory all at once?

Not if don't have to. for some files, you may be able to, but if you're just processing sequentially, just do some kind of buffered read through the things one by one, storing whatever you need along the way.

If not, is opening what's a good way of loading the data partially?

BufferedReaders/etc is simplest, although you could look deeper into FileChannel/etc to use memorymapped I/O to go through windows of the data at a time.

What are some Java-relevant efficiency tips?

That really depends on what you're doing with the data itself!

查看更多
登录 后发表回答