What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?
Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.
import java.io.IOException;
import java.io.PrintStream;
import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
try {
PrintStream ps = new PrintStream(new FileOutputStream(
"outer.zip/abc-åäö.txt"));
try {
ps.println("The characters åäö works here though.");
} finally {
ps.close();
}
} finally {
File.umount();
}
}
}
Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.
import java.io.IOException;
import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
ZipOutputStream zipOutput = new ZipOutputStream(
new FileOutputStream(encoding + "-example.zip"), encoding);
ZipEntry entry = new ZipEntry("abc-åäö.txt");
zipOutput.putNextEntry(entry);
zipOutput.closeEntry();
zipOutput.close();
}
}
}
Non-ASCII file names are not reliable across ZIP implementations and are best avoided. There is no provision for storing a charset setting in ZIP files; clients tend to guess with 'the current system codepage', which is unlikely to be what you want. Many combinations of client and codepage can result in inaccessible files.
Sorry!
You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29
Calling setEncoding("UTF-8") on your stream should be enough.
In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.
I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.
The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!
Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.
From a quick look at the TrueZIP manual - they recommend the JAR format:
This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.
Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:
Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).
Did it actually fail or was just a font issue? (e.g. font having different glyphs for those charcodes) I've seen similar issues in Windows where rendering "broke" because the font didn't support the charset but the data was actually intact and correct.