What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?
Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.
import java.io.IOException;
import java.io.PrintStream;
import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
try {
PrintStream ps = new PrintStream(new FileOutputStream(
"outer.zip/abc-åäö.txt"));
try {
ps.println("The characters åäö works here though.");
} finally {
ps.close();
}
} finally {
File.umount();
}
}
}
Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.
import java.io.IOException;
import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;
public class Main {
public static void main(final String[] args) throws IOException {
for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
ZipOutputStream zipOutput = new ZipOutputStream(
new FileOutputStream(encoding + "-example.zip"), encoding);
ZipEntry entry = new ZipEntry("abc-åäö.txt");
zipOutput.putNextEntry(entry);
zipOutput.closeEntry();
zipOutput.close();
}
}
}
The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.
The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.
For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.
My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.
EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.