POI XSSF / XLSX hashing indeterminism with Message

2020-06-06 02:14发布

问题:

There seems to be a problem with getting deterministic hash values for the POI XLSX format, with MessageDigest SHA-256 implementation, even for empty ByteArray streams. This happens randomly, after several hundreds or even only thousands of iterations.

The relevant code snippets used to reproduce the problem:

// TestNG FileTest:
@Test(enabled = true) // indeterminism at random iterations, such as 400 or 1290
public void emptyXLSXTest() throws IOException, NoSuchAlgorithmException {
    final Hasher hasher = new HasherImpl();
    boolean differentSHA256Hash = false;
    for (int i = 0; i < 10000; i++) {
        final ByteArrayOutputStream excelAdHoc1 = BusinessPlanInMemory.getEmptyExcel("xlsx");
        final ByteArrayOutputStream excelAdHoc2 = BusinessPlanInMemory.getEmptyExcel("xlsx");

        byte[] expectedByteArray = excelAdHoc1.toByteArray();
String expectedSha256 = hasher.sha256(expectedByteArray);
byte[] actualByteArray = excelAdHoc2.toByteArray();
String actualSha256 = hasher.sha256(actualByteArray);

if (!expectedSha256.equals(actualSha256)) {
            differentSHA256Hash = true;
            System.out.println("ITERATION: " + i);
            System.out.println("EXPECTED HASH: " + expectedSha256);
            System.out.println("ACTUAL HASH: " + actualSha256);
            break;
        }
    }
    Assert.assertTrue(differentSHA256Hash, "Indeterminism did not occur");
}

Referenced Hasher and POI code:

// HasherImpl class:
public String sha256(final InputStream stream) throws IOException, NoSuchAlgorithmException {
    final MessageDigest digest = MessageDigest.getInstance("SHA-256");
    final byte[] bytesBuffer = new byte[300000]; 
    int bytesRead = -1;
    while ((bytesRead = stream.read(bytesBuffer)) != -1) {
        digest.update(bytesBuffer, 0, bytesRead);
    }
    final byte[] hashedBytes = digest.digest();
    return bytesToHex(hashedBytes);
}

Tried to eliminate indeterminism due to meta data like creation time, to no avail:

// POI BusinessPlanInMemory helper class:
public static ByteArrayOutputStream getEmptyExcel(final String fileextension) throws IOException {
    Workbook wb;

    if (fileextension.equals("xls")) {
        wb = new HSSFWorkbook();
    }
    else {
        wb = new XSSFWorkbook();
        final POIXMLProperties props = ((XSSFWorkbook) wb).getProperties();
        final POIXMLProperties.CoreProperties coreProp = props.getCoreProperties();
        coreProp.setCreated("");
        coreProp.setIdentifier("1");
        coreProp.setModified("");
    }

    wb.createSheet();

    final ByteArrayOutputStream excelStream = new ByteArrayOutputStream();
    wb.write(excelStream);
    wb.close();
    return excelStream;
}

The HSSF / XLS format seems not to be affected by the problem described. Does anybody have a clue, what could be causing this, if not a bug in POI itself? Basically, the code above refers to https://poi.apache.org/spreadsheet/examples.htmlBusinessPlan example

Thanks for your input!

回答1:

This is not a definitive answer but this is my suspicion what happens:

docx and xlsx file formats are basically a bunch of zipped-up xml-files. This can easily be seen when renaming them to .zip and opening with your favorite zip-tool.

When examining a file created by word I noticed that the change-timestamp of all files contained in the archive is always 1980-01-01 00:00:00 while in those created with POI it will show the actual timestamp the file was created.

So my I suspect that your problem occurs when there is a timestamp-difference between one or more of the files in excelAdHoc1 and excelAdHoc2. This might happen when the clock switches to the next second while creating one or the other file.

This would not affect XLS-files since the HSSF-format is not of the "zipped xml"-type and thus does not contain any nested files that might have different timestamps.

To change the timestamps after writing the file you could try using the `java.util.zip``-package. I haven't tested it but this should do the trick:

ZipFile file = new ZipFile(pathToFile);
Enumeration<ZipEntry> e = file.entries();
while(e.hasMoreElements()) {
    ZipEntry entry = e.nextElement();
    entry.setTime(0L);
}


回答2:

Ok, I've found a way to reset all XSLX file entry file time attributes, according to some example found here at SO. Unfortunately, only file entries seem to be accessible by methods like via ZipFile or OPCPackage. I could not find a solution to also access and reset the folders inside the archive, which also have differing time attributes.

So far, I didn't succeed in eliminating the differing properties of the POI-generated XLSX-archives, to get same SHA256 hashes out of two otherwise identical files, for which the differing attributes seem to be the reason.

private void resetOPCPTimeAttributes(File file)
        throws InvalidFormatException, IOException, OpenXML4JException, XmlException {

    OPCPackage opcp = ZipPackage.open(file);
    resetZipfileContentTimeAttributes(opcp.getParts());

    opcp.flush();
    opcp.close();
}

private void resetZipfileContentTimeAttributes(List<PackagePart> parts) throws InvalidFormatException {

    ArrayList<PackagePart> subParts = null;
    for (PackagePart part: parts) {

        PackageProperties props = part.getPackage().getPackageProperties();
        props.setLastModifiedByProperty("");
        props.setCreatedProperty("");
        props.setModifiedProperty("");

        subParts = part.getPackage().getParts();

        while (subParts != null) {
            resetZipfileContentTimeAttributes(subParts);
        }
    }
}

Edit:

In the meantime (until I or someone else finds a solution for manipulating folder meta data inside the Zip archive), I've switched to the deep comparing solution here: Comparing XLSX files