Do we have any method/approach of removing duplica

2019-09-08 03:18发布

问题:

Do we have any method/approach in boost filesystem to remove duplicate files from a particular directory using c++?

I have retreived all the files in a particular directory using below code.Now I want to find the duplicates and then remove them.

Code to list files recursively in a directory using boost filesystem:

void listFiles()
{
fs::path sourceFolder;  
//SourceFolder
std::cout<<"SourceFolder:";
std::cin>>sourceFolder;

for(fs::recursive_directory_iterator it(sourceFolder), end_itr; it != end_itr; ++it)
    {
    if(!fs::is_regular_file(it->status())) 
        continue;  
    std::cout <<it->path().filename() <<endl;
    }
std::cout<<"Thanks for using file manager";
}

Thanks in advance.

回答1:

No. That's a rather specific use case, so you'll have to write the code yourself.

Basically, the best procedure is a three-step procedure: First sort the files by size. Different size, different files. Secondly, for all files with identical size, read the first 4K and compare those (skip this step for small files). Finally, if the first 4K is identical, compare the whole file.



回答2:

@MSalters already gave an idea how to approach this. It sounds like you'd better hash the files' content and then compare the hashes for equality. Relying purely on size might not be good enough. Comparing the hashes you can know the files are equal across the whole file system.



回答3:

@murrekat @MSalters did not suggest relying on the size alone. Instead, it's a very very sane idea to preselect potential matches based on size because you could be looking at days to generate hashes for a large volume of data and by the time you're done they'd be out-of-date :)

All fdupe tools I know of have this approach: a fast, cheap (preferrably based on filestat info) and only compare content if there's a potential match.

Doing blockwise compare often trumps hash comparisons, as it can be done streaming and the match can be discarded as soon as a difference is detected - removing the need to read the whole file at all.

Comparing full-content hashes can be beneficial in some cases:

  1. when you have many files that don't change (you can store the pre-calculated hash in a database, which balances the fact that you'll have to read the whole file to calculate the hash by the fact that you don't on any subsequent run.

  2. when you anticipate that some files might be duplicated on a large scale. In this case, you expect the comparison to not early-out significant percentage of the time, and you can avoid reading one side of the comparison.

  3. when you expect (many) sets of duplicates >2, basically for the same reason as #2



标签: c++ boost