2009-05-26 86 views
48

我試圖將DMOZ內容/結構XML文件解析到MySQL中,但是所有現有的腳本都是非常舊的,並且效果不佳。我怎樣才能在PHP中打開一個大的(+ 1GB)XML文件進行解析?解析PHP中的巨大XML文件

+0

http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/它如此簡單的紅寶石 – 2014-02-19 21:07:00

回答

74

只有兩個php API非常適合處理大文件。第一個是舊的expat api,第二個是較新的XMLreader函數。這些apis讀取連續流而不是將整個樹加載到內存中(這是simplexml和DOM的作用)。

舉個例子,你可能想看看DMOZ-目錄的這個部分解析器:

<?php 

class SimpleDMOZParser 
{ 
    protected $_stack = array(); 
    protected $_file = ""; 
    protected $_parser = null; 

    protected $_currentId = ""; 
    protected $_current = ""; 

    public function __construct($file) 
    { 
     $this->_file = $file; 

     $this->_parser = xml_parser_create("UTF-8"); 
     xml_set_object($this->_parser, $this); 
     xml_set_element_handler($this->_parser, "startTag", "endTag"); 
    } 

    public function startTag($parser, $name, $attribs) 
    { 
     array_push($this->_stack, $this->_current); 

     if ($name == "TOPIC" && count($attribs)) { 
      $this->_currentId = $attribs["R:ID"]; 
     } 

     if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) { 
      echo $attribs["R:RESOURCE"] . "\n"; 
     } 

     $this->_current = $name; 
    } 

    public function endTag($parser, $name) 
    { 
     $this->_current = array_pop($this->_stack); 
    } 

    public function parse() 
    { 
     $fh = fopen($this->_file, "r"); 
     if (!$fh) { 
      die("Epic fail!\n"); 
     } 

     while (!feof($fh)) { 
      $data = fread($fh, 4096); 
      xml_parse($this->_parser, $data, feof($fh)); 
     } 
    } 
} 

$parser = new SimpleDMOZParser("content.rdf.u8"); 
$parser->parse(); 
+0

處理大型XML大多數肯定是最好的答案 – Evert 2009-05-26 21:27:27

+9

這是一個偉大的答案,但我花了很長時間才發現需要使用[xml_set_default_handler()](http://php.net/manual/en/function.xml-set-default-handler.php)來訪問XML節點數據,通過上面的代碼,您只能看到節點的名稱及其屬性。 – DirtyBirdNJ 2012-01-18 17:53:56

4

這並不是一個很好的解決方案,而只是拋出另一種選擇在那裏:

可以打破許多大型XML文件成塊,特別是那些這實際上只是類似元素的列表(因爲我懷疑你正在使用的文件是)。

例如,如果您的文檔是這樣的:

<dmoz> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    ... 
</dmoz> 

您可以在一個或兩個MEG一次讀它,人爲地包裹你的根級別標記加載的幾個完整<listing>標籤,然後負載他們通過simplexml/domxml(我採用domxml,採取這種方法時)。

坦率地說,如果您使用PHP < 5.1.2,我更喜歡這種方法。在5.1.2及更高版本中,XMLReader是可用的,這可能是最好的選擇,但在此之前,您堅持使用上述分塊策略或舊的SAX/expat庫。我不知道其他人,但我恨寫/維護SAX/expat解析器。

但是請注意,當您的文檔不包含包含許多相同的底層元素(例如,它適用於任何種類的文件或URL列表等)時,此方法並不實際。 ,但對解析大型HTML文檔沒有意義)

9

我最近不得不解析一些非常大的XML文檔,並且需要一次讀取一個元素的方法。

如果你有以下文件complex-test.xml

<?xml version="1.0" encoding="UTF-8"?> 
<Complex> 
    <Object> 
    <Title>Title 1</Title> 
    <Name>It's name goes here</Name> 
    <ObjectData> 
     <Info1></Info1> 
     <Info2></Info2> 
     <Info3></Info3> 
     <Info4></Info4> 
    </ObjectData> 
    <Date></Date> 
    </Object> 
    <Object></Object> 
    <Object> 
    <AnotherObject></AnotherObject> 
    <Data></Data> 
    </Object> 
    <Object></Object> 
    <Object></Object> 
</Complex> 

,並希望返回<Object/>小號

PHP:

require_once('class.chunk.php'); 

$file = new Chunk('complex-test.xml', array('element' => 'Object')); 

while ($xml = $file->read()) { 
    $obj = simplexml_load_string($xml); 
    // do some parsing, insert to DB whatever 
} 

########### 
Class File 
########### 

<?php 
/** 
* Chunk 
* 
* Reads a large file in as chunks for easier parsing. 
* 
* The chunks returned are whole <$this->options['element']/>s found within file. 
* 
* Each call to read() returns the whole element including start and end tags. 
* 
* Tested with a 1.8MB file, extracted 500 elements in 0.11s 
* (with no work done, just extracting the elements) 
* 
* Usage: 
* <code> 
* // initialize the object 
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk')); 
* 
* // loop through the file until all lines are read 
* while ($xml = $file->read()) { 
*  // do whatever you want with the string 
*  $o = simplexml_load_string($xml); 
* } 
* </code> 
* 
* @package default 
* @author Dom Hastings 
*/ 
class Chunk { 
    /** 
    * options 
    * 
    * @var array Contains all major options 
    * @access public 
    */ 
    public $options = array(
    'path' => './',  // string The path to check for $file in 
    'element' => '',  // string The XML element to return 
    'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk 
); 

    /** 
    * file 
    * 
    * @var string The filename being read 
    * @access public 
    */ 
    public $file = ''; 
    /** 
    * pointer 
    * 
    * @var integer The current position the file is being read from 
    * @access public 
    */ 
    public $pointer = 0; 

    /** 
    * handle 
    * 
    * @var resource The fopen() resource 
    * @access private 
    */ 
    private $handle = null; 
    /** 
    * reading 
    * 
    * @var boolean Whether the script is currently reading the file 
    * @access private 
    */ 
    private $reading = false; 
    /** 
    * readBuffer 
    * 
    * @var string Used to make sure start tags aren't missed 
    * @access private 
    */ 
    private $readBuffer = ''; 

    /** 
    * __construct 
    * 
    * Builds the Chunk object 
    * 
    * @param string $file The filename to work with 
    * @param array $options The options with which to parse the file 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function __construct($file, $options = array()) { 
    // merge the options together 
    $this->options = array_merge($this->options, (is_array($options) ? $options : array())); 

    // check that the path ends with a/
    if (substr($this->options['path'], -1) != '/') { 
     $this->options['path'] .= '/'; 
    } 

    // normalize the filename 
    $file = basename($file); 

    // make sure chunkSize is an int 
    $this->options['chunkSize'] = intval($this->options['chunkSize']); 

    // check it's valid 
    if ($this->options['chunkSize'] < 64) { 
     $this->options['chunkSize'] = 512; 
    } 

    // set the filename 
    $this->file = realpath($this->options['path'].$file); 

    // check the file exists 
    if (!file_exists($this->file)) { 
     throw new Exception('Cannot load file: '.$this->file); 
    } 

    // open the file 
    $this->handle = fopen($this->file, 'r'); 

    // check the file opened successfully 
    if (!$this->handle) { 
     throw new Exception('Error opening file for reading'); 
    } 
    } 

    /** 
    * __destruct 
    * 
    * Cleans up 
    * 
    * @return void 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function __destruct() { 
    // close the file resource 
    fclose($this->handle); 
    } 

    /** 
    * read 
    * 
    * Reads the first available occurence of the XML element $this->options['element'] 
    * 
    * @return string The XML string from $this->file 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function read() { 
    // check we have an element specified 
    if (!empty($this->options['element'])) { 
     // trim it 
     $element = trim($this->options['element']); 

    } else { 
     $element = ''; 
    } 

    // initialize the buffer 
    $buffer = false; 

    // if the element is empty 
    if (empty($element)) { 
     // let the script know we're reading 
     $this->reading = true; 

     // read in the whole doc, cos we don't know what's wanted 
     while ($this->reading) { 
     $buffer .= fread($this->handle, $this->options['chunkSize']); 

     $this->reading = (!feof($this->handle)); 
     } 

     // return it all 
     return $buffer; 

    // we must be looking for a specific element 
    } else { 
     // set up the strings to find 
     $open = '<'.$element.'>'; 
     $close = '</'.$element.'>'; 

     // let the script know we're reading 
     $this->reading = true; 

     // reset the global buffer 
     $this->readBuffer = ''; 

     // this is used to ensure all data is read, and to make sure we don't send the start data again by mistake 
     $store = false; 

     // seek to the position we need in the file 
     fseek($this->handle, $this->pointer); 

     // start reading 
     while ($this->reading && !feof($this->handle)) { 
     // store the chunk in a temporary variable 
     $tmp = fread($this->handle, $this->options['chunkSize']); 

     // update the global buffer 
     $this->readBuffer .= $tmp; 

     // check for the open string 
     $checkOpen = strpos($tmp, $open); 

     // if it wasn't in the new buffer 
     if (!$checkOpen && !($store)) { 
      // check the full buffer (in case it was only half in this buffer) 
      $checkOpen = strpos($this->readBuffer, $open); 

      // if it was in there 
      if ($checkOpen) { 
      // set it to the remainder 
      $checkOpen = $checkOpen % $this->options['chunkSize']; 
      } 
     } 

     // check for the close string 
     $checkClose = strpos($tmp, $close); 

     // if it wasn't in the new buffer 
     if (!$checkClose && ($store)) { 
      // check the full buffer (in case it was only half in this buffer) 
      $checkClose = strpos($this->readBuffer, $close); 

      // if it was in there 
      if ($checkClose) { 
      // set it to the remainder plus the length of the close string itself 
      $checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize']; 
      } 

     // if it was 
     } elseif ($checkClose) { 
      // add the length of the close string itself 
      $checkClose += strlen($close); 
     } 

     // if we've found the opening string and we're not already reading another element 
     if ($checkOpen !== false && !($store)) { 
      // if we're found the end element too 
      if ($checkClose !== false) { 
      // append the string only between the start and end element 
      $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen)); 

      // update the pointer 
      $this->pointer += $checkClose; 

      // let the script know we're done 
      $this->reading = false; 

      } else { 
      // append the data we know to be part of this element 
      $buffer .= substr($tmp, $checkOpen); 

      // update the pointer 
      $this->pointer += $this->options['chunkSize']; 

      // let the script know we're gonna be storing all the data until we find the close element 
      $store = true; 
      } 

     // if we've found the closing element 
     } elseif ($checkClose !== false) { 
      // update the buffer with the data upto and including the close tag 
      $buffer .= substr($tmp, 0, $checkClose); 

      // update the pointer 
      $this->pointer += $checkClose; 

      // let the script know we're done 
      $this->reading = false; 

     // if we've found the closing element, but half in the previous chunk 
     } elseif ($store) { 
      // update the buffer 
      $buffer .= $tmp; 

      // and the pointer 
      $this->pointer += $this->options['chunkSize']; 
     } 
     } 
    } 

    // return the element (or the whole file if we're not looking for elements) 
    return $buffer; 
    } 
} 
+0

謝謝。這真的很有幫助。 – 2014-11-11 16:58:47

12

這是一個非常類似的問題,以Best way to process large XML in PHP但有非常好的具體答案upvoted解決DMOZ目錄解析的具體問題。 然而,由於這是一個很好的谷歌打在一般大個XML,我會重新發布從其他的問題我的答案,以及:

我對此採取:

https://github.com/prewk/XmlStreamer

一個簡單的類將在流式傳輸文件時將所有孩子提取到XML根元素。 經過來自pubmed.com的108 MB XML文件進行測試。

class SimpleXmlStreamer extends XmlStreamer { 
    public function processNode($xmlString, $elementName, $nodeIndex) { 
     $xml = simplexml_load_string($xmlString); 

     // Do something with your SimpleXML object 

     return true; 
    } 
} 

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml"); 
$streamer->parse(); 
+0

這太棒了!謝謝。一個問題:如何使用這個獲得根節點的屬性? – 2013-10-15 10:35:31

+0

@gyaani_guy我不認爲現在可能不幸。 – oskarth 2013-12-22 21:53:09

+4

這只是將整個文件加載到內存中! – 2014-03-07 16:14:34