可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am using boost::filesystem to search and process files in a directory. But instead of processing every file (checked by using boost::filesystem::is_regular_file()) I want to only process text files (or at least ignore binary files).

Is there a way I can achieve that even if files do not have an extension?

I would highly appreciate a platform independent solution.

回答1:

Use libmagic.

Libmagic is available on all major platforms (and many minors).

#include <boost/filesystem.hpp>
#include <boost/range.hpp>
#include <iostream>
#include <magic.h>

using namespace boost;
namespace fs = filesystem;

int main() {
    auto handle = ::magic_open(MAGIC_NONE|MAGIC_COMPRESS);
    ::magic_load(handle, NULL);

    for (fs::directory_entry const& x : make_iterator_range(fs::directory_iterator("."), {})) {
        auto type = ::magic_file(handle, x.path().native().c_str());
        std::cout << x.path() << "\t" << (type? type : "UNKOWN") << "\n";
    }

    ::magic_close(handle);
}

Prints, e.g.

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  ASCII text
"./explicit-failures.xsd"   XML document text
"./expected_results.xml"    XML document text
"./explicit-failures-markup.xml"    XML document text

You can use the flags to control the detail of classification, e.g. MAGIC_MIME:

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  text/plain; charset=us-ascii
"./explicit-failures.xsd"   application/xml; charset=us-ascii
"./expected_results.xml"    application/xml; charset=us-ascii
"./explicit-failures-markup.xml"    application/xml; charset=utf-8

Or loading just /etc/magic:

sehe@desktop:~/custom/boost/status$ /tmp/test 
"./Jamfile.v2"  ASCII text
"./explicit-failures.xsd"   ASCII text
"./expected_results.xml"    ASCII text, with very long lines
"./explicit-failures-markup.xml"    UTF-8 Unicode text

回答2:

There is no perfect solution.

You can do an educated guess, inspecting the content of the file. Text files often contain just printable ASCII text, which gives you some hint, but they might contain misleading UTF8 sequences if, for example, the text is written in hieroglyphs. Many files formats contain magical words in their headers, but there is no common convention about where that magic word is to find, thus you can easily construct a file containing the magical words of 5 different formats, all in their right place.

Sometimes it's really hard to decide what type of a file you see:

cat =13 /*/ >/dev/null 2>&1; echo "Hello, world!"; exit
*
*  This program works under cc, f77, and /bin/sh.
*
*/; main() {
      write(
cat-~-cat
     /*,'(
*/
     ,"Hello, world!"
     ,
cat); putchar(~-~-~-cat); } /*
     ,)')
      end
*/

Is that a sh-script, C source code or f77 source code?

I suggest you have a deep look in the source of the command file, which does the best effort to do what you try to do.

回答3:

You could steal from less. less considers a file a binary file if more than 5 characters in the first 256 byte are !isprint(c) && !iscntrl(c) in the current locale.

This too, is a heuristic (which is why less always says "this may be a binary file"), but it is a common one that usually works, and you can adjust the threshold if you're having trouble with some files.

回答4:

Using libmagic , you can find the type of file . man libmagic will give the detailed info.

Go through the example

 ` magic_t myt = magic_open(MAGIC_NONE);
  sprintf(fullfilename, "%s/%s", dir_name,filename);
  magic_load(myt,NULL);
  printf("file type is  %s", magic_file(myt,fullfilename));
  magic_close(myt);
 `