Fastest way to compare directory state, or hashing

2020-06-03 01:08发布

We have a PHP application, and were thinking it might be advantageous to have the application know if there was a change in its makeup since the last execution. Mainly due to managing caches and such, and knowing that our applications are sometimes accessed by people who don't remember to clear the cache on changes. (Changing the people is the obvious answer, but alas, not really achievable)

We've come up with this, which is the fastest we've managed to eke out, running an average 0.08 on a developer machine for a typical project. We've experimented with shasum,md5 and crc32, and this is the fastest. We are basically md5ing the contents of every file, and md5'ing that output. Security isnt a concern, we're just interested in detecting filesystem changes via a differing checksum.

time (find application/ -path '*/.svn' -prune -o -type f -print0 | xargs -0 md5 | md5)

I suppose the question is, can this be optimised any further?

(I realise that pruning svn will have a cost, but find takes the least amount of time out of the components, so it will be pretty minimal. We're testing this on a working copy atm)

6条回答
不美不萌又怎样
2楼-- · 2020-06-03 01:45

Instead of going by file contents, you can use the same technique with filename and timestamps:

find . -name '.svn' -prune -o -type f -printf '%m%c%p' | md5sum

This is much faster than reading and hashing the contents of each file.

查看更多
爱情/是我丢掉的垃圾
3楼-- · 2020-06-03 01:49

Insteading of actively searching for changes, why not getting notified when something changes. Have a look at PHP's FAM - File Alteration Monitor API

FAM monitors files and directories, notifying interested applications of changes. More information about FAM is available at » http://oss.sgi.com/projects/fam/. A PHP script may specify a list of files for FAM to monitor using the functions provided by this extension. The FAM process is started when the first connection from any application to it is opened. It exits after all connections to it have been closed.

CAVEAT: requires an additional daemon on the machine and the PECL extension is unmaintained.

查看更多
够拽才男人
4楼-- · 2020-06-03 01:54

We didn't want to use FAM, since we would need to install it on the server, and thats not always possible. Sometimes clients are insistent we deploy on their broken infrastructure. Since it's discontinued, its hard to get that change approved red tape wise also.

The only way to improve the speed of the original version in the question is to make sure your file list is as succinct as possible. IE only hash the directories/files that really matter if changed. Cutting out directories that aren't relevant can give big speed boosts.

Past that, the application was using the function to check if there were changes in order to perform a cache clear if there were. Since we don't really want to halt the application while its doing this, this sort of thing is best farmed out carefully as an asynchronous process using fsockopen. That gives the best 'speed boost' overall, just be careful of race conditions.

Marking this as the 'answer' and upvoting the FAM answer.

查看更多
手持菜刀,她持情操
5楼-- · 2020-06-03 02:01

since you have svn, why don't you go by revisions. i realise you are skipping svn folders but i suppose you did that for speed advantage and that you do not have modified files in your production servers...

that beeing said, you do not have to re invent the wheel.

you could speed up the process by only looking at metadata read from the directory indexes (modification timestamp, filesize, etc)

you could also stop once you spotted a difference (should theoretically reduce the time by half in average) etc. there is a lot.

i honestly think the best way in this case is to just use the tools already available.

the linux tool diff has a -q option (quick).

you will need to use it with the recursive parameter -r as well.

diff -r -q dir1/ dir2/

it uses a lot of optimisations and i seriously doubt you can significantly improve upon it without considerable effort.

查看更多
可以哭但决不认输i
6楼-- · 2020-06-03 02:04

You can be notified of filesystem modifications using the inotify extension.

It can be installed with pecl:

pecl install inotify

Or manually (download, phpize && ./configure && make && make install as usual).

This is a raw binding over the linux inotify syscalls, and is probably the fastest solution on linux.

See this example of a simple tail implementation: http://svn.php.net/viewvc/pecl/inotify/trunk/tail.php?revision=262896&view=markup


If you want a higher level library, or suppot for non-linux systems, take a look at Lurker.

It works on any system, and can use inotity when it's available.

See the example from the README:

$watcher = new ResourceWatcher;
$watcher->track('an arbitrary id', '/path/to/views');

$watcher->addListener('an arbitrary id', function (FilesystemEvent $event) {
    echo $event->getResource() . 'was' . $event->getTypeString();
});

$watcher->start();
查看更多
祖国的老花朵
7楼-- · 2020-06-03 02:10

Definitely what you should be using is Inotify its fast and easy to configure, multiple options directly from bash or php of dedicate a simple node-inotify instance for this task

But Inotify does not worn on windows but you can easy write a command line application with FileSystemWatcher or FindFirstChangeNotification and call via exec

If you are looking for only PHP solution then its pretty difficult and you might not get the performance want because the only way is to scan that folder continuously

Here is a Simple Experiment

  • Don't use in production
  • Can not manage large file set
  • Does not support file monitoring
  • Only Support NEW , DELETED and MODIFIED
  • Does not support Recursion

Example

if (php_sapi_name() !== 'cli')
    die("CLI ONLY");

date_default_timezone_set("America/Los_Angeles");

$sm = new Monitor(__DIR__ . "/test");

// Add hook when new files are created
$sm->hook(function ($file) {
    // Send a mail or log to a file
    printf("#EMAIL NEW FILE %s\n", $file);
}, Monitor::NOTIFY_NEW);

// Add hook when files are Modified
$sm->hook(function ($file) {
    // Do monthing meaningful like calling curl
    printf("#HTTP POST  MODIFIED FILE %s\n", $file);
}, Monitor::NOTIFY_MODIFIED);

// Add hook when files are Deleted
$sm->hook(function ($file) {
    // Crazy ... Send SMS fast or IVR the Boss that you messed up
    printf("#SMS DELETED FILE %s\n", $file);
}, Monitor::NOTIFY_DELETED);

// Start Monitor
$sm->start();

Cache Used

// Simpe Cache
// Can be replaced with Memcache
class Cache {
    public $f;

    function __construct() {
        $this->f = fopen("php://temp", "rw+");
    }

    function get($k) {
        rewind($this->f);
        return json_decode(stream_get_contents($this->f), true);
    }

    function set($k, $data) {
        fseek($this->f, 0);
        fwrite($this->f, json_encode($data));
        return $k;
    }

    function run() {
    }
}

The Experiment Class

// The Experiment
class Monitor {
    private $dir;
    private $info;
    private $timeout = 1; // sec
    private $timeoutStat = 60; // sec
    private $cache;
    private $current, $stable, $hook = array();
    const NOTIFY_NEW = 1;
    const NOTIFY_MODIFIED = 2;
    const NOTIFY_DELETED = 4;
    const NOTIFY_ALL = 7;

    function __construct($dir) {
        $this->cache = new Cache();
        $this->dir = $dir;
        $this->info = new SplFileInfo($this->dir);
        $this->scan(true);
    }

    public function start() {
        $i = 0;
        $this->stable = (array) $this->cache->get(md5($this->dir));

        while(true) {
            // Clear System Cache at Intervals
            if ($i % $this->timeoutStat == 0) {
                clearstatcache();
            }

            $this->scan(false);

            if ($this->stable != $this->current) {
                $this->cache->set(md5($this->dir), $this->current);
                $this->stable = $this->current;
            }

            sleep($this->timeout);
            $i ++;

            // printf("Memory Usage : %0.4f \n", memory_get_peak_usage(false) /
            // 1024);
        }
    }

    private function scan($new = false) {
        $rdi = new FilesystemIterator($this->dir, FilesystemIterator::SKIP_DOTS);

        $this->current = array();
        foreach($rdi as $file) {

            // Skip files that are not redable
            if (! $file->isReadable())
                return false;

            $path = addslashes($file->getRealPath());
            $keyHash = md5($path);
            $fileHash = $file->isFile() ? md5_file($path) : "#";

            $hash["t"] = $file->getMTime();
            $hash["h"] = $fileHash;
            $hash["f"] = $path;

            $this->current[$keyHash] = json_encode($hash);
        }

        if ($new === false) {
            $this->process();
        }
    }

    public function hook(Callable $call, $type = Monitor::NOTIFY_ALL) {
        $this->hook[$type][] = $call;
    }

    private function process() {
        if (isset($this->hook[self::NOTIFY_NEW])) {
            $diff = array_flip(array_diff(array_keys($this->current), array_keys($this->stable)));
            $this->notify(array_intersect_key($this->current, $diff), self::NOTIFY_NEW);
            unset($diff);
        }

        if (isset($this->hook[self::NOTIFY_DELETED])) {
            $deleted = array_flip(array_diff(array_keys($this->stable), array_keys($this->current)));
            $this->notify(array_intersect_key($this->stable, $deleted), self::NOTIFY_DELETED);
        }

        if (isset($this->hook[self::NOTIFY_MODIFIED])) {
            $this->notify(array_diff_assoc(array_intersect_key($this->stable, $this->current), array_intersect_key($this->current, $this->stable)), self::NOTIFY_MODIFIED);
        }
    }

    private function notify(array $files, $type) {
        if (empty($files))
            return;

        foreach($this->hook as $t => $hooks) {
            if ($t & $type) {
                foreach($hooks as $hook) {
                    foreach($files as $file) {
                        $info = json_decode($file, true);
                        $hook($info['f'], $type);
                    }
                }
            }
        }
    }
}
查看更多
登录 后发表回答