Having trouble deleting keys from a Solr collection for files.
Updating the Solr collection with this:
<cfoutput query="fileQuery">
<cfset theFile = defaultpath & "#fileID#.pdf" />
<cfif fileExists(theFile)>
<cfindex
action="update"
collection="file_vault_solr"
type="file"
key="#theFile#"
title="#documentName#"
body="fileNumber,documentName"
custom1="/filevault/#filealias#"
custom2="#fileNumber#"
custom3="#documentName#"
>
</cfif>
</cfoutput>
However, when attempting to delete the key from the catalog it simply doesn't work. Here's the code being used to (try to) delete the keys:
<cfoutput query="deletedFile">
<cfset theFile = defaultpath & "#fileID#.pdf" />
<!--- Remove the deleted file from the collection. --->
<cfindex
collection="file_vault_solr"
type="file"
action="Delete"
key="#theFile#"
>
</cfoutput>
The key is not deleted, however. The only thing that has worked has been to purge the whole catalog and re-index all of the documents.
Any insights?
After a lot of debugging I found out.
The reason for this behavior is a very… uh… unfortunate uhm… "design decision" Adobe took when implementing the interface between ColdFusion and Solr.
So you have a Solr collection of indexed files and want to selectively purge the ones that do no longer exist on disk. I'm pretty sure that's the exact situation you've been in.
Let's assume:
- there is a file called
/path/to/file
on your system and
- it is indexed in the Solr collection
foo
.
When you issue a <cfindex collection="foo" action="delete" key="/path/to/file">
, ColdFusion sends the following HTTP request to Solr:
POST /solr/foo/update?wt=xml&version=2.2 (application/xml; charset=UTF-8)
<delete><id>1247603285</id></delete>
This is a perfectly reasonable request that Solr will happily fulfill. The only strange thing is the number in the <id>
. In any case, the file will be gone from the index after this operation.
Re-index the file and delete it from disk. Now:
- there no longer is a file called
/path/to/file
on your system, but
- it is still indexed in the Solr collection
foo
.
Let's do the same <cfindex action="delete">
operation again.
POST /solr/foo/update?wt=xml&version=2.2 (application/xml; charset=UTF-8)
<delete><id>/path/to/file</id></delete>
Huh? Shouldn't there be a number in the ID?
As it turns out, someone at Adobe thought it would be a jolly smart idea to use numbers for unique IDs of indexed files, to, uhhh, save space, I assume.
However for some inexplicable reason this only happens when the file in question still exists. If it does not exist anymore, ColdFusion will notice and pass the path instead.
Inspecting the number reveals that it would fit into a 32 bit signed integer value. (I've checked, there are plenty of negative values in the uid
field of the collection.)
So this looks as if they use some kind of hashing algorithm that returns 32 bits and chuck that in a int. CRC32 springs to mind, but that's not it. Also, java.util.zip.CRC32
returns a long
, so there wouldn't be any negative values in the first place.
The other readily available 32 bit hash in Java is ... java.lang.Object.hashCode()
.
Bingo.
"/path/to/file".hashCode() // -> 1247603285
So the solution is to never delete a file by its path, but always like this:
<cfindex collection="foo" action="delete" key="#path.hashCode()#">
For files that no longer exist this does the right thing.
More importantly: For files that still exist this does the right thing as well - ColdFusion would have sent the hash code anyway.
Until Adobe fixes this problem this is a safe and easy work-around.
Note that the file path is case sensitive and must match exactly with the one stored in the index.
A quick
<cfsearch collection="foo" name="foo">
without any criteria
will return all index entries, so retrieving the exact path of orphaned entries it not a big problem.
Eric Lippert explains object hash codes and why it is a bad idea to use them for anything "practical" in an application It's a .NET article but applies to Java just as well.
It boils down to: Adobe should store the actual path in the Solr collection and leave the performance optimization they seem to have attempted to Solr.
I've filed Bug 3589991 against Adobe's ColdFusion bug database.
The key has to match exactly what is in Solr's index. So ensure that "defaultpath" is the same in both and check that the case matches as I believe Solr is case sensitive.
To debug this I would suggest that you add the status="myStatusVar" to the cfindex call . Then on both the add and delete to see what is going on. If the delete is not returning a Deleted Count. Then there is a Key mismatch.
<cfindex
collection="file_vault_solr"
type="file"
action="Delete"
key="#theFile#"
status="myStatusVar"
>