So I've installed the Community 4.0.a and extended the mimetype list using mimetype-map.xml as I did before in 3.4
<alfresco-config area="mimetype-map">
<config evaluator="string-compare" condition="Mimetype Map">
<mimetypes>
<mimetype mimetype="application/dita+xml" text="true" display="DITA">
<extension default="true" display="DITA Topic">dita</extension>
<extension default="true" display="DITA Map">ditamap</extension>
<extension default="true" display="DITA Conditional Processing Profile">ditaval</extension>
</mimetype>
etc...
But each time I import a DITA file, it is either recognise as an XML file, or PLAIN TEXT. I've digged into it and it looks like it's because of Apache TIKA which analyze the beginning of the file to check it's mimetype.
How do I shortcut TIKA with my custom mimetype-map (as it looks from the code that TIKA is triggered first and if it found something then it's game over)?
DO I have to extend TIKA writing my own parser?
The Mimetype matching logic in 4.0 has changed slightly, now that the content is available for detection, rather than just the filename. As part of this, if Tika is very sure about what a file is, then this will be preferred.
In most cases, this means that for common but incorrectly named files, Tika can help correct mistakes. For non standard files, Tika will decline to offer a strong suggestion, and the Alfresco name based matching will be used as before. (In cases where Tika and Alfresco differ on what the canonical form of the mimetype is, the Alfresco version is preferred though)
There are a small number of cases where the file type is actually a specialisation of a common type, and Tika knows about the parent type but not the specific one. In this case, Tika strongly suggests the parent type, and we've no way to realise the new type added to Alfresco is based on that. (Tika has a hierarchy of mimetypes, while Alfresco just has a flat list). For these small number of cases, Tika needs guiding too.
The usual fix is to report a Tika bug, and have the filetype added upstream. (For very custom types, you need to add a Tika custom-mimetypes.xml too, which defines the hierarchy + glob.)
In this DITA case, I've opened TIKA-784 and added a provisional fix. This has now gone into Alfresco too.