This question was similar in addressing hidden filetypes. I am struggling with a similar problem because I need to process only text containing files in folders that have many different filetypes- pictures, text, music. I am using os.walk which lists EVERYTHING, including files without an extension-like Icon files. I am using linux and would be satisfied to filter for only txt files. One way is too check the filename extension and this post explains nicely how it's done.
But this still leaves mislabeled files or files without an extension. There are hex values that uniquely identify filetypes known as magic numbers or file signatures. here and here Unfortunately, magic numbers do not exist for text files (see here).
One strategy that I have come up with is to parse the first bunch of characters to make sure they are words by doing a dictionary lookup(I am only dealing with English texts) Then only proceed with the full text processing if that is true.This approach seems rather heavy and expensive (doing a bunch of dictionary lookups for each file). Another approach is simply to look for the word 'the' which is unlikely to be frequent in a data file but commonly found in text files. But false negatives would cause me to lose text files for processing. I tried asking google for the longest text without the word 'the' but had no luck with that.
I do not know if this is the appropriate forum for this kind of question-it's almost a question of AI rather than computer science/coding. It's not as difficult as gibberish detection. The texts may not be semantically or syntactically correct- they might just be words like the inventory of a stockroom but also they might be prose and poetry. I just do not want to process files that could be byte code,source code, or collections of alphanumeric characters that are not English words.