Define a MIME type for .TXT files for Tika

2019-08-26 07:41发布

问题:

I want to define the MIME type of *.txt files: text/txt, so that Tika can apply a more specific parser than the one used for text/plain files.

The glob *.txt is included in the definition of the type text/plain in tika-mimetypes.xml. Moreover, it seems to me that you cannot redefine a MIME type in custom-mimetypes.xml, only add new globs or magic patterns. Additionally, if I define the text/txt type in tika-mimetypes.xml as a subtype of text/plain with only the glob *.txt, Tika still detects a txt file as text/plain.

Is it absurd to define a subtype of text/plain only for txt files? If not, is it possible to define it only with custom-mimetypes.xml? If not, what is the easiest way to extend tika so that it can parse txt files differently than (let's say) STEP 3D CAD .stp files or .cfg files?

The use case in detail: I have a large source of data composed of (recursive) archives. Some plain text files are huge and I don't want Tika to parse them. However, I want to keep all the txt files.

Edit: specify that I don't want to keep .cfg files either (*.cfg is a glob of text/plain)