So, I'm trying to do read-only access of a zip file in Java, decompressing in a multithreaded manner, because my standard simple singly-threaded solution of ZipFile/ZipEntry using the enumeration and inputstreams and what-not results in it taking about five full seconds just to decompress into memory a 50-meg zipfile which takes one second AT MOST for my disk to read without decompressing.
However, the entire Java zip library is synchronized to an incredibly-obnoxious degree, no doubt because it's all abstracted for reading/writing/etc. in the same code instead of having nice efficient non-synchronized read-only code.
I've looked at third-party Java libraries, and all of them are either massive VFS libraries that are worse than using an elephant gun to shoot a fly, or else the only reason they have a performance benefit is they multithread to the extent that most of the threads are blocking on disk IO anyway.
All I want to do is pull a zipfile into a byte[], fork some threads, and work on it. There's no reason any synchronization would be needed in any way for anything, because each of the unzipped files I use separately in memory without interaction.
Why must this be so difficult?
The fastest way to archieve this with Java is to use NIO. You can directly map the file into memory by using a MappedByteBuffer
.
FileChannel channel = FileChannel.open(Paths.get("/path/to/zip"),
StandardOpenOption.READ);
MappedByteBuffer buffer = channel.map(MapMode.READ_ONLY, 0, channel.size());
Now buffer
contains a memory mapped region of your whole file. You can do whatever you want with it, for example pass an offset
and a length
to a thread. I don't know which zip lib supports that, but apparently you have something like that already.
FYI, I tested a bit with a 50 mb single file archive and it took less than 200ms on average to read it with a usual ZipInputStream
- I think that you are trying to optimize pretty much nothing here.
Just for posterity's sake, after some testing back and forth, the answer I finally ended up using goes as follows (with complete iterations starting from scratch with closed files in a while (true)
loop):
Use DataInputStream.readFully
to pull the entire (50 meg, in
this case) zip file into a byte[]
.
Spawn worker threads (one per physical CPU core, 4 in my case)
which each take that byte[]
and create a
ZipInputStream(ByteArrayInputStream)
. The first worker skips 0
entries, the second skips 1, the second skips 2, etc., so they're all
offset from each other by one. The worker threads do not synchronize
at all, so they all have their own local copies of the zip file's
metadata and what-not. This is thread-safe because the zip file is
read-only and the workers are not sharing decompressed data.
Each worker thread reads an entry and processes it, and then skips
enough entries so that they all again are offset by one. So the first
thread reads entries 0,4,8..., the second reads 1,5,9..., and so
forth.
All the workers are pulled back in with .join().
My times were as follows:
Reading the zip file into the byte[]
with no unzipping at all
(just the IO) gives an average of 0.1 sec for every iteration.
Using straight ZipFile directly on the underlying file as normal,
yields an initial spike of 0.5 sec, followed by an average of 0.26
sec for each iteration thereafter (starting from fresh after closing
the previous ZipFile).
Reading the ZipFile into a byte[]
, creating a
ZipInputStream(ByteArrayInputStream)
with it with no multithreading
at all, results in an initial spike of 0.3 sec, followed by an
average of 0.26 sec for each iteration thereafter, showing that the
disk caching was having an effect rendering random-access and
initial-read equivalent.
Reading the ZipFile into a byte[]
, spawning 4 worker threads
with that byte[]
as described above, and waiting for them to
finish, brought the time back down to an average of 0.1 sec for every
iteration.
So, the verdict is, by this method I have successfully brought the processing of a moderately-sized zipfile with a moderately-powerful computer down to the time it takes to simply physically read the file, with the additional decompression step no longer noticeable at all. Obviously this same method on a huge zip file with tens of thousands of entries would still yield a massive speedup.
Seems I wasn't trying to optimize away nothing, considering I reduced the processing time for my sample file (which is around the size of the biggest one I'll need to work with) to 38% of the simple singly-threaded method.
Considering how incredibly well this hack-job worked, imagine the possible speedup with a native Java zip-reader class actually designed to do this without the synchronization built in.
As you noticed, all the methods in ZipFile
are synchronized. But that only stops multiple threads from running at the same time across different ZipFile
instances, opened for the same exact zipfile on disk.
If you want multiple threads to read from the same zipfile in a scalable way, you must open one ZipFile
instance per thread. That way, the per-thread lock in the ZipFile
methods does not block all but one thread from reading from the zipfile at one time. It also means that when each thread closes the ZipFile
after they're done reading, they close their own instance, not the shared instance, so you don't get an exception on the second and subsequent close.
Protip: if you really care about speed, you can get more performance by reading all the ZipEntry
objects from the first ZipFile instance, and sharing them with all threads, to avoid duplicating work in reading ZipEntry
objects for each thread separately. A ZipEntry
object is not tied to a specific ZipFile
instance per se, ZipEntry
just records metadata that will work with any ZipFile
object representing the same zipfile that the ZipEntry
came from.