How to identify binary and text files using Python

I need identify which file is binary and which is a text in a directory.

I tried use mimetypes but it isnt a good idea in my case because it cant identify all files mimes, and I have strangers ones here... I just need know, binary or text. Simple ? But I couldn´t find a solution...

Thanks

标签： python text binary file-type

4条回答

叛逆

2楼-- · 2020-02-25 08:16

It might be possible to use libmagic to guess the MIME type of the file using python-magic. If you get back something in the "text/*" namespace, it is likely a text file, while anything else is likely a binary file.

0人赞添加讨论(0) 举报

做自己的国王

3楼-- · 2020-02-25 08:27

Thanks everybody, I found a solution that suited my problem. I found this code at http://code.activestate.com/recipes/173220/ and I changed just a little piece to suit me.

It works fine.

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "\0" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True

0人赞添加讨论(0) 举报

聊天终结者

4楼-- · 2020-02-25 08:27

It's inherently not simple. There's no way of knowing for sure, although you can take a reasonably good guess in most cases.

Things you might like to do:

Look for known magic numbers in binary signatures
Look for the Unicode byte-order-mark at the start of the file
If the file is regularly 00 xx 00 xx 00 xx (for arbitrary xx) or vice versa, that's quite possibly UTF-16
Otherwise, look for 0s in the file; a file with a 0 in is unlikely to be a single-byte-encoding text file.

But it's all heuristic - it's quite possible to have a file which is a valid text file and a valid image file, for example. It would probably be nonsense as a text file, but legitimate in some encoding or other...

0人赞添加讨论(0) 举报

▲ chillily

5楼-- · 2020-02-25 08:28

If your script is running on *nix, you could use something like this:

import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None

0人赞添加讨论(0) 举报

How to identify binary and text files using Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间