2014-03-05 83 views
0

我先閱讀MS word文件,先將其轉換爲zip,然後獲取其XML。在php中閱讀MS word文件

但它刪除換行符,它困擾着我。我該怎麼辦?

我用這個代碼:

function get_docx_content($filename) { 
    //Check for extension 
    $ext = end(explode('.', $filename)); 

    //if its docx file 
    if($ext == 'docx') 
    $dataFile = "word/document.xml"; 
    //else it must be odt file 
    else 
    $dataFile = "content.xml"; 

    //Create a new ZIP archive object 
    $zip = new ZipArchive; 

    // Open the archive file 
    if (true === $zip->open($filename)) { 
     // If successful, search for the data file in the archive 
     if (($index = $zip->locateName($dataFile)) !== false) { 
      // Index found! Now read it to a string 
      $text = $zip->getFromIndex($index); 
      // Load XML from a string 
      // Ignore errors and warnings 
      $xml = DOMDocument::loadXML($text, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING); 
      // Remove XML formatting tags and return the text 
      return strip_tags($xml->saveXML()); 
     } 
     //Close the archive file 
     $zip->close(); 
    } 
} 

回答

0

試試這個,如果它可以幫助你。

<?php 



/***************************************************************** 
This approach uses detection of NUL (chr(00)) and end line (chr(13)) 
to decide where the text is: 
- divide the file contents up by chr(13) 
- reject any slices containing a NUL 
- stitch the rest together again 
- clean up with a regular expression 
*****************************************************************/ 

function parseWord($userDoc) 
{ 
    $fileHandle = fopen($userDoc, "r"); 
    $line = @fread($fileHandle, filesize($userDoc)); 
    $lines = explode(chr(0x0D),$line); 
    $outtext = ""; 
    foreach($lines as $thisline) 
     { 
     $pos = strpos($thisline, chr(0x00)); 
     if (($pos !== FALSE)||(strlen($thisline)==0)) 
      { 
      } else { 
      $outtext .= $thisline." "; 
      } 
     } 
    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\[email protected]\/\_\(\)]/","",$outtext); 
    return $outtext; 
} 

$userDoc = "cv.doc"; 

$text = parseWord($userDoc); 
echo $text; 


?> 

如果你想要更多的研究它,然後看看

http://www.blogs.zeenor.com/it/read-ms-word-docx-ms-word-2007-file-document-using-php.html

+0

無遺憾的是它造成的未知字符亂七八糟超過數百個,而我只有12語義詞。 –