how to read docx file math equations using php cod

2019-04-10 21:23发布

问题:

I am trying to read a docx file from php, as i read successfully but i didnt get some equation in the word document, as i am newbie in php i didnt know how to read that please suggest some ideas, the function i have tried to read the document is

function index()
{
    $document = 'file_path';
    $text_output = $this->read_docx($document);
    echo nl2br($text_output);

}
private function read_docx($filename) 
{
    var_dump($filename);
    $striped_content = '';
    $content = '';

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip))
        return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE)
            continue;

        if (zip_entry_name($zip_entry) != "word/document.xml")
            continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}

This is the sample math equation in the docx file which i am trying to read and render to html page. thanks

回答1:

I fully go through this https://msdn.microsoft.com/en-us/library/aa982683(v=office.12).aspx#Office2007ManipulatingXMLDocs_exploring and parse the xml using php xmlreader()

$document = 'url';
/*Function to extract images*/ 
function readZippedImages($filename) 
{
    $for_image = $filename;
    /*Create a new ZIP archive object*/
    $zip = new ZipArchive;
    /*Open the received archive file*/
    $final_arr=array();
    $repo = array();
    if (true === $zip->open($filename)) 
    {
        for ($i=0; $i<$zip->numFiles;$i++) 
        {
            if($i==3)//should be document.xml
            {
                //======function using xml parser================================//
                $check = $zip->getFromIndex($i);
                //Create a new XMLReader Instance
                $reader = new XMLReader();
                //Loading from a XML File or URL
                //$reader->open($check);
                //Loading from PHP variable
                $reader->xml($check);

                //====================parsing through the document==================//
                while($reader->read())
                {
                $node_loc = $reader->localName;
                if($reader->nodeType == XMLREADER::ELEMENT && $reader->localName == 'body')
                {
                 $reader->read();
                 $read_content = $reader->value. "\n";
                }
                if($node_loc == '#text')//parsing all the text from document using #text tag
                {
                    $temp_value = array("text"=>$reader->value);
                    array_push($final_arr,$temp_value);
                    $reader->read();
                    $read_content = $reader->value. "\n";
                }
                 if($node_loc == 'blip')//parsing all the images using blip tag which is under drawing tag
                {
                    $attri_r = $reader->getAttribute("r:embed");
                    $current_image_name = $repo[$attri_r];
                    $image_stream = $this->showimage($for_image,$current_image_name);//return the base64 string
                    $temp_value = array("image"=>$image_stream);
                    array_push($final_arr,$temp_value);
                }
                }
                //==================xml parser end============================//
            }
            if($i==2)//should be rels.xml
            {
                $check_id = $zip->getFromIndex($i);
                $reader_relation = new XMLReader();
                $reader_relation->xml($check_id);

                //====================parsing through the document==================//
                while($reader_relation->read())
                {
                    $node_loc = $reader_relation->localName;
                    if($reader_relation->nodeType == XMLREADER::ELEMENT && $reader_relation->localName == 'Relationship')
                    {
                     $read_content_id = $reader_relation->getAttribute("Id");
                     $read_content_name = $reader_relation->getAttribute("Target");
                     $repo[$read_content_id]=$read_content_name;
                    }

                }
            }
        }
     }
}


function showimage($zip_file_original, $file_name_image) 
{
    $file_name_image = 'word/'.$file_name_image.'';// getting the image in the zip using its name
    $z_show = new ZipArchive();
    if ($z_show->open($zip_file_original) !== true) {
        echo "File not found.";
        return false;
    }

    $stat = $z_show->statName($file_name_image);
    $fp   = $z_show->getStream($file_name_image);
    if(!$fp) {
        echo "Could not load image.";
        return false;
    }

    header('Content-Type: image/jpeg');
    header('Content-Length: ' . $stat['size']);
    $image = stream_get_contents($fp);
    $picture = base64_encode($image);
    return $picture;//return the base62 string for the current image.
    fclose($fp);
}
readZippedImages($document);

print the $final_arr you will get the all text and images in the document.



回答2:

First of all it is a very bad idea to parse XML using a regular expression. Instead use PHP's XML parser that is designed to do this kind of tasks.

You need to read the specification for Open XML (standard that used by Microsoft Office) to learn about the internal data structure that Microsoft use for storing these kinds of math equation.