Making a large processing job smaller

2019-02-18 08:39发布

This is the code I'm using as I work my way to a solution.

 public function indexAction()
    {
        //id3 options
        $options = array("version" => 3.0, "encoding" => Zend_Media_Id3_Encoding::ISO88591, "compat" => true);
        //path to collection
        $path = APPLICATION_PATH . '/../public/Media/Music/';//Currently Approx 2000 files
        //inner iterator
        $dir = new RecursiveDirectoryIterator($path, RecursiveDirectoryIterator::SKIP_DOTS);
        //iterator
        $iterator = new RecursiveIteratorIterator($dir, RecursiveIteratorIterator::SELF_FIRST);
        foreach ($iterator as $file) {
            if (!$file->isDir() && $file->getExtension() === 'mp3') {
                //real path to mp3 file
                $filePath = $file->getRealPath();
                Zend_Debug::dump($filePath);//current results: accepted path no errors
                $id3 = new Zend_Media_Id3v2($filePath, $options);
                foreach ($id3->getFramesByIdentifier("T*") as $frame) {
                    $data[$frame->identifier] = $frame->text;
                }
                Zend_Debug::dump($data);//currently can scan the whole collection without timing out, but APIC data not being processed.
            }
        }
    }

The problem: Process a file system of mp3 files in multiple directories. Extract id3 tag data to a database (3 tables) and extract the cover image from the tag to a separate file.

I can handle the actual extraction and data handling. My issue is with output.

With the way that Zend Framework 1.x handles output buffering, outputting an indicator that the files are being processed is difficult. In an old style PHP script, without output buffering, you could print out a bit of html with every iteration of the loop and have some indication of progress.

I would like to be able to process each album's directory, output the results and then continue on to the next album's directory. Only requiring user intervention on certain errors.

Any help would be appreciated.

Javascript is not the solution I'm looking for. I feel that this should be possible within the constructs of PHP and a ZF 1 MVC.

I'm doing this mostly for my own enlightenment, it seems a very good way to learn some important concepts.

[EDIT]
Ok, how about some ideas on how to break this down into smaller chunks. Process one chunk, commit, process next chunk, kind of thing. In or out of ZF.

[EDIT]
I'm beginning to see the problem with what I'm trying to accomplish. It seems that output buffering is not just happening in ZF, it's happening everywhere from ZF all the way to the browser. Hmmmmm...

4条回答
SAY GOODBYE
2楼-- · 2019-02-18 09:21

Introduction

This is a typical example of what you should not do because

  • You are trying to parse ID3 tag with PHP which is slow and trying to have multiple parse files at once would definitely make it even slower

  • RecursiveDirectoryIterator would load all the files in a folder and sub folder from what i see there is no limit .. it can be 2,000 today the 100,000 the next day ? Total processing time is unpredictable and this can definitely take some hours in some cases

  • High dependence on single file system, with your current architecture the files are stored in local system so it would be difficult to split the files and do proper load balancing

  • You are not checking if the file information has been extracted before and this results Loop and extraction Duplication

  • No locking system .. this means that this process can be initiated simultaneously resulting to general slow performance on the server

Solution 1 : With Current Architecture

My advice is not to use loop or RecursiveDirectoryIterator to process the files in bulk.

Target the file as soon as they are uploaded or transferred to the server. That way you are only working with one file at a time this way to can spread the processing time.

Solution 2: Job Queue (Proposed Solution)

Your problem is exactly what Job Queue are designed to do you are also not limited to implementing the parsing with PHP .. you take advantage of C or C++ for performance

Advantage

  • Transfer Jobs to other machines or processes that are better suited to do the work
  • It allows you to do work in parallel, to load balance processing
  • Reduce the latency of page views in high-volume web applications by running time-consuming tasks asynchronously
  • Multiple Languages client in PHP sever in C

Examples have tested

Expected Process Client

  • Connect To Job Queue eg German
  • Connect to Database eg MongoDB or Redis
  • Loop with folder path
  • Check File extension
  • If file is mp3 , generate file hash eg. sha1_file
  • Check if file has been sent for processing
  • send hash, file to Job Server

Expected Process Server

  • Connect To Job Queue eg German
  • Connect to Database eg MongoDB or Redis
  • Receive hash / file
  • Extract ID3 tag ;
  • Update DB with ID3 Tag Information

Finally this processing can be done on multiple servers in parallel

查看更多
放我归山
3楼-- · 2019-02-18 09:41

One solution would be to use a Job Queue, such a Gearman. Gearman is an excellent solution for this kind of problem, and easily integrated with Zend Framework (http://blog.digitalstruct.com/2010/10/17/integrating-gearman-into-zend-framework/)

It will allow you to create a worker to process each "chuck", allowing your process to continue unblocked while the job is processed, very handy for long running proceeses such as music/image processing etc http://gearman.org/index.php?id=getting_started

查看更多
萌系小妹纸
4楼-- · 2019-02-18 09:42

I should suggest using plugin.

class Postpone extends Zend_Controller_Plugin_Abstract
{

    private $tail;

    private $callback;


    function __construct ($callback = array())
    {
        $this->callback = $callback;
    }


    public function setRequest (Zend_Controller_Request_Abstract $request)
    {
        /*
         * We use layout, which essentially contains some html and a placeholder for action output.
         * We put the marker into this placeholder in order to figure out "the tail" -- the part of layout that goes after placeholder.
         */
        $mark = '---cut-here--';
        $layout = $this->getLayout ();

        $layout->content = $mark;

        /*
         * Now we have it.
         */
        $this->tail = preg_replace ("/.*$mark/s", '', $layout->render ());
    }


    public function postDispatch (Zend_Controller_Request_Abstract $request)
    {
        $response = $this->getResponse ();

        $response->sendHeaders ();

        /*
         * The layout generates its output to the default section of the response.
         * This output inludes "the tail".
         * We don't need this tail shown right now, because we have callback to do.
         * So we remove it here for a while, but we'll show it later.
         */
        echo substr ($this->getResponse ()
            ->getBody ('default'), 0, - strlen ($this->tail));

        /*
         * Since we have just echoed the result, we don't need it in the response. Do we?
         */
            Zend_Controller_Front::getInstance ()->returnResponse(true);
        $response->clearBody ();

        /*
         * Now to business.
         * We execute that calculation intensive callback.
         */
        if (! empty ($this->callback) && is_callable ($this->callback))
        {
            call_user_func ($this->callback);
        }

        /*
         * We sure don't want to leave behind the tail.
         * Output it so html looks consistent.
         */
        echo $this->tail;
    }


    /**
     * Returns layout object
     */
    function getLayout ()
    {
        $layout_plugin = Zend_Controller_Front::getInstance ()->getPlugin ('Zend_Layout_Controller_Plugin_Layout');
        return $layout = $layout_plugin->getLayout ();
    }
}




class IndexController extends Zend_Controller_Action
{


    /*
     * This is a calculation intensive action
     */
    public function indexAction ()
    {
        /*
         * Zend_Layout in its current implementation accumulates whole action output inside itself.
         * This fact hampers out intention to gradually output the result.
         * What we do here is we defer execution of our intensive calculation in form of callback into the Postpone plugin.
         * The scenario is:
         * 1. Application started
         * 2. Layout is started
         * 3. Action gets executed (except callback) and its output is collected by layout.
         * 4. Layout output goes to response.
         * 5. Postpone::postDispatch outputs first part of the response (without the tail).
         * 6. Postpone::postDispatch calls the callback. Its output goes stright to browser.
         * 7. Postpone::postDispatch prints the tail.
         */
        $this->getFrontController ()
            ->registerPlugin (new Postpone (function  ()
        {
            /*
             * A calculation immigration
             * Put your actual calculations here.
             */
        echo str_repeat(" ", 5000);
        foreach (range (1, 500) as $x)
        {
            echo "<p>$x</p><br />\n";
            usleep(61500);
            flush();
        }
        }), 1000);
    }
}
查看更多
放荡不羁爱自由
5楼-- · 2019-02-18 09:44

I'm not familiar with how Zend Framework work. I will give you a general advice. When working with process that is doing so many iterative and possibly in long time, it is generally advised that the long process be moved into background process. Or, in web related, moved into cron job.

If the process you want to use is for single site, you can implement something like this, in your cronjob (note: rough pseudo-code):

<?php

$targetdir = "/path/to/mp3";
$logdir = "/path/to/log/";

//check if current state is exists. If it does, then previous cronjob is still running
//we should stop this process so that it doesn't do duplicated process which might have introduced random bugs
if(file_exists($logdir."current-state")){
    exit;
}

//start process, write state to logdir
file_put_contents($logdir."current-log", "process started at ".date("Y-m-d H:i:s"));
file_put_contents($logdir."current-state", "started\t".date("Y-m-d H:i:s"));
$dirh = opendir($targetdir);
while($file = readdir($dirh)){
    //lets ignore current and parent dir
    if(in_array($file, array('.', '..'))) continue;

    //do whatever process you want to do here:


    //you might want to write another log, too:
    file_put_contents($logdir."current-log", "processing file {$file}", FILE_APPEND);


}
closedir($dirh);
file_put_contents($logdir."current-log", "process finished at ".date("Y-m-d H:i:s"));

//process is finished, delete current-state:
unlink($logdir."current-state");

Next, in your php file for web, you can add snippet to, says admin page, or footer, or whatever page you want, to see the progress:

<?php

if(file_exists($logdir."current-state")){
    echo "<strong>there are background process running.</strong>";
} else {
    echo "<strong>no background process running.</strong>";
}
查看更多
登录 后发表回答