What are the correct content-types for XML, HTML and XHTML documents?
I need to write a simple crawler that only fetches these kinds of files.
Nowadays http://example.net/index.html can serve for example a JPEG file due to mod_rewrite, so I need to check the content-type from response header and compare it with a list of allowed content-types.
Where can I get such a list from?
HTML:
text/html
, full-stop.XHTML:
application/xhtml+xml
, or only if following HTML compatbility guidelines,text/html
. See the W3 Media Types Note.XML:
text/xml
,application/xml
(RFC 2376).There are also many other media types based around XML, for example
application/rss+xml
orimage/svg+xml
. It's a safe bet that any unrecognised but registered ending in+xml
is XML-based. See the IANA list for registered media types ending in+xml
.(For unregistered
x-
types, all bets are off, but you'd hope+xml
would be respected.)