Reading zip file efficiently in Java

2019-02-18 00:05发布

问题:

I working on a project which works on a very large amount of data. I have a lot(thousands) of zip files, each containing ONE simple txt file with thousands of lines(about 80k lines). What I am currently doing is the following:

for(File zipFile: dir.listFiles()){
ZipFile zf = new ZipFile(zipFile);
ZipEntry ze = (ZipEntry) zf.entries().nextElement();
BufferedReader in = new BufferedReader(new InputStreamReader(zf.getInputStream(ze)));
...

In this way I can read the file line by line, but it is definetely too slow. Given the large number of files and lines that need to be read, I need to read them in a more efficient way.

I have looked for a different approach, but I haven't been able to find anything. What I think I should use are the java nio APIs intended right for intensive I/O operations, but I don't know how to use them with zip files.

Any help would really be appreciated.

Thanks,

Marco

回答1:

I have a lot(thousands) of zip files. The zipped files are about 30MB each, while the txt inside the zip file is about 60/70 MB. Reading and processing the files with this code takes a lot of hours, around 15, but it depends.

Let's do some back-of-the-envelope calculations.

Let's say you have 5000 files. If it takes 15 hours to process them, this equates to ~10 seconds per file. The files are about 30MB each, so the throughput is ~3MB/s.

This is between one and two orders of magnitude slower than the rate at which ZipFile can decompress stuff.

Either there's a problem with the disks (are they local, or a network share?), or it is the actual processing that is taking most of the time.

The best way to find out for sure is by using a profiler.



回答2:

The right way to iterate a zip file

final ZipFile file = new ZipFile( FILE_NAME );
try
{
final Enumeration<? extends ZipEntry> entries = file.entries();
while ( entries.hasMoreElements() )
{
    final ZipEntry entry = entries.nextElement();
    System.out.println( entry.getName() );
    //use entry input stream:
    readInputStream( file.getInputStream( entry ) )
}
}
finally
{
file.close();
}

private static int readInputStream( final InputStream is ) throws IOException {
final byte[] buf = new byte[ 8192 ];
int read = 0;
int cntRead;
while ( ( cntRead = is.read( buf, 0, buf.length ) ) >=0  )
{
    read += cntRead;
}
return read;
}

Zip file consists of several entries, each of them has a field containing the number of bytes in the current entry. So, it is easy to iterate all zip file entries without actual data decompression. java.util.zip.ZipFile accepts a file/file name and uses random access to jump between file positions. java.util.zip.ZipInputStream, on the other hand, is working with streams, so it is unable to freely jump. That’s why it has to read and decompress all zip data in order to reach EOF for each entry and read the next entry header.

What does it mean? If you already have a zip file in your file system – use ZipFile to process it regardless of your task. As a bonus, you can access zip entries either sequentially or randomly (with rather small performance penalty). On the other hand, if you are processing a stream, you’ll need to process all entries sequentially using ZipInputStream.

Here is an example. A zip archive (total file size = 1.6Gb) containing three 0.6Gb entries was iterated in 0.05 sec using ZipFile and in 18 sec using ZipInputStream.



回答3:

You can use the new file API like this:

Path jarPath = Paths.get(...);
try (FileSystem jarFS = FileSystems.newFileSystem(jarPath, null)) {
    Path someFileInJarPath = jarFS.getPath("/...");
    try (ReadableByteChannel rbc = Files.newByteChannel(someFileInJarPath, EnumSet.of(StandardOpenOption.READ))) {
        // read file
    }
}

The code is for jar files, but I think it should work for zips as well.



回答4:

You can try this code

try
    {

        final ZipFile zf = new ZipFile("C:/Documents and Settings/satheesh/Desktop/POTL.Zip");

        final Enumeration<? extends ZipEntry> entries = zf.entries();
        ZipInputStream zipInput = null;

        while (entries.hasMoreElements())
        {
            final ZipEntry zipEntry=entries.nextElement();
            final String fileName = zipEntry.getName();
        // zipInput = new ZipInputStream(new FileInputStream(fileName));
            InputStream inputs=zf.getInputStream(zipEntry);
            //  final RandomAccessFile br = new RandomAccessFile(fileName, "r");
                BufferedReader br = new BufferedReader(new InputStreamReader(inputs, "UTF-8"));
                FileWriter fr=new FileWriter(f2);
            BufferedWriter wr=new BufferedWriter(new FileWriter(f2) );

            while((line = br.readLine()) != null)
            {
                wr.write(line);
                System.out.println(line);
                wr.newLine();
                wr.flush();
            }
            br.close();
            zipInput.closeEntry();
        }


    }
    catch(Exception e)
    {
        System.out.print(e);
    }
    finally
    {
        System.out.println("\n\n\nThe had been extracted successfully");

    }

this code works in a good manner.



回答5:

Intel has made an improved version of zlib, which Java uses internally peroform zip/unzip. It requires you to patch zlib sources with Interl's IPP paches. I made a benchmark showing 1.4x to 3x gains in throughput.