I am using boost::filesystem
to search and process files in a directory.
But instead of processing every file (checked by using boost::filesystem::is_regular_file()
) I want to only process text files (or at least ignore binary files).
Is there a way I can achieve that even if files do not have an extension?
I would highly appreciate a platform independent solution.
Use libmagic
.
Libmagic is available on all major platforms (and many minors).
#include <boost/filesystem.hpp>
#include <boost/range.hpp>
#include <iostream>
#include <magic.h>
using namespace boost;
namespace fs = filesystem;
int main() {
auto handle = ::magic_open(MAGIC_NONE|MAGIC_COMPRESS);
::magic_load(handle, NULL);
for (fs::directory_entry const& x : make_iterator_range(fs::directory_iterator("."), {})) {
auto type = ::magic_file(handle, x.path().native().c_str());
std::cout << x.path() << "\t" << (type? type : "UNKOWN") << "\n";
}
::magic_close(handle);
}
Prints, e.g.
sehe@desktop:~/custom/boost/status$ /tmp/test
"./Jamfile.v2" ASCII text
"./explicit-failures.xsd" XML document text
"./expected_results.xml" XML document text
"./explicit-failures-markup.xml" XML document text
You can use the flags to control the detail of classification, e.g. MAGIC_MIME:
sehe@desktop:~/custom/boost/status$ /tmp/test
"./Jamfile.v2" text/plain; charset=us-ascii
"./explicit-failures.xsd" application/xml; charset=us-ascii
"./expected_results.xml" application/xml; charset=us-ascii
"./explicit-failures-markup.xml" application/xml; charset=utf-8
Or loading just /etc/magic
:
sehe@desktop:~/custom/boost/status$ /tmp/test
"./Jamfile.v2" ASCII text
"./explicit-failures.xsd" ASCII text
"./expected_results.xml" ASCII text, with very long lines
"./explicit-failures-markup.xml" UTF-8 Unicode text
There is no perfect solution.
You can do an educated guess, inspecting the content of the file. Text files often contain just printable ASCII text, which gives you some hint, but they might contain misleading UTF8 sequences if, for example, the text is written in hieroglyphs. Many files formats contain magical words in their headers, but there is no common convention about where that magic word is to find, thus you can easily construct a file containing the magical words of 5 different formats, all in their right place.
Sometimes it's really hard to decide what type of a file you see:
cat =13 /*/ >/dev/null 2>&1; echo "Hello, world!"; exit
*
* This program works under cc, f77, and /bin/sh.
*
*/; main() {
write(
cat-~-cat
/*,'(
*/
,"Hello, world!"
,
cat); putchar(~-~-~-cat); } /*
,)')
end
*/
Is that a sh-script, C source code or f77 source code?
I suggest you have a deep look in the source of the command file
, which does the best effort to do what you try to do.
You could steal from less
. less
considers a file a binary file if more than 5 characters in the first 256 byte are !isprint(c) && !iscntrl(c)
in the current locale.
This too, is a heuristic (which is why less
always says "this may be a binary file"), but it is a common one that usually works, and you can adjust the threshold if you're having trouble with some files.
Using libmagic , you can find the type of file . man libmagic will
give the detailed info.
Go through the example
` magic_t myt = magic_open(MAGIC_NONE);
sprintf(fullfilename, "%s/%s", dir_name,filename);
magic_load(myt,NULL);
printf("file type is %s", magic_file(myt,fullfilename));
magic_close(myt);
`