我試圖將DMOZ內容/結構XML文件解析到MySQL中,但是所有現有的腳本都是非常舊的,並且效果不佳。我怎樣才能在PHP中打開一個大的(+ 1GB)XML文件進行解析?解析PHP中的巨大XML文件
回答
只有兩個php API非常適合處理大文件。第一個是舊的expat api,第二個是較新的XMLreader函數。這些apis讀取連續流而不是將整個樹加載到內存中(這是simplexml和DOM的作用)。
舉個例子,你可能想看看DMOZ-目錄的這個部分解析器:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
處理大型XML大多數肯定是最好的答案 – Evert 2009-05-26 21:27:27
這是一個偉大的答案,但我花了很長時間才發現需要使用[xml_set_default_handler()](http://php.net/manual/en/function.xml-set-default-handler.php)來訪問XML節點數據,通過上面的代碼,您只能看到節點的名稱及其屬性。 – DirtyBirdNJ 2012-01-18 17:53:56
我會建議使用基於SAX解析器,而不是基於DOM解析。
在PHP中使用SAX信息:http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
這並不是一個很好的解決方案,而只是拋出另一種選擇在那裏:
可以打破許多大型XML文件成塊,特別是那些這實際上只是類似元素的列表(因爲我懷疑你正在使用的文件是)。
例如,如果您的文檔是這樣的:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
您可以在一個或兩個MEG一次讀它,人爲地包裹你的根級別標記加載的幾個完整<listing>
標籤,然後負載他們通過simplexml/domxml(我採用domxml,採取這種方法時)。
坦率地說,如果您使用PHP < 5.1.2,我更喜歡這種方法。在5.1.2及更高版本中,XMLReader是可用的,這可能是最好的選擇,但在此之前,您堅持使用上述分塊策略或舊的SAX/expat庫。我不知道其他人,但我恨寫/維護SAX/expat解析器。
但是請注意,當您的文檔不包含包含許多相同的底層元素(例如,它適用於任何種類的文件或URL列表等)時,此方法並不實際。 ,但對解析大型HTML文檔沒有意義)
我最近不得不解析一些非常大的XML文檔,並且需要一次讀取一個元素的方法。
如果你有以下文件complex-test.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
,並希望返回<Object/>
小號
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* @package default
* @author Dom Hastings
*/
class Chunk {
/**
* options
*
* @var array Contains all major options
* @access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* @var string The filename being read
* @access public
*/
public $file = '';
/**
* pointer
*
* @var integer The current position the file is being read from
* @access public
*/
public $pointer = 0;
/**
* handle
*
* @var resource The fopen() resource
* @access private
*/
private $handle = null;
/**
* reading
*
* @var boolean Whether the script is currently reading the file
* @access private
*/
private $reading = false;
/**
* readBuffer
*
* @var string Used to make sure start tags aren't missed
* @access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* @param string $file The filename to work with
* @param array $options The options with which to parse the file
* @author Dom Hastings
* @access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a/
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* @return void
* @author Dom Hastings
* @access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* @return string The XML string from $this->file
* @author Dom Hastings
* @access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
謝謝。這真的很有幫助。 – 2014-11-11 16:58:47
這是一個非常類似的問題,以Best way to process large XML in PHP但有非常好的具體答案upvoted解決DMOZ目錄解析的具體問題。 然而,由於這是一個很好的谷歌打在一般大個XML,我會重新發布從其他的問題我的答案,以及:
我對此採取:
https://github.com/prewk/XmlStreamer
一個簡單的類將在流式傳輸文件時將所有孩子提取到XML根元素。 經過來自pubmed.com的108 MB XML文件進行測試。
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
這太棒了!謝謝。一個問題:如何使用這個獲得根節點的屬性? – 2013-10-15 10:35:31
@gyaani_guy我不認爲現在可能不幸。 – oskarth 2013-12-22 21:53:09
這只是將整個文件加載到內存中! – 2014-03-07 16:14:34
- 1. 用600M解析巨大的XML文件
- 2. PHP:解析巨大的XML無內存
- 3. PHP:如何解析一個巨大的XML文件
- 4. 解析Java中的巨大XML
- 5. Python解析一個巨大的文件
- 6. 解析一個巨大的JSON文件
- 7. 解析原生vs javascript的巨大XML文件
- 8. SAX解析器爲一個非常巨大的XML文件
- 9. 使用Go解析巨大的XML文件
- 10. 如何解析一個巨大的XML文件
- 11. 巨大文件解析算法
- 12. 解析大XML文件
- 13. 解析大型XML文件?
- 14. 解析android中的大xml文件
- 15. 解析Android中的大型XML文件
- 16. 用php解析xml文件
- 17. PHP不解析XML文件
- 18. 如何用Go中的各種元素來解析巨大的XML文件?
- 19. 如何使用Python解析一個巨大的xml文件(在旅途中)
- 20. 如何解析PHP中的大型XML文件?
- 21. 大型XML文件解析PHP中的塊數據掃描
- 22. JAVA - 解析巨大(超大)JSON文件的最佳方法
- 23. 替代解決方案解析巨大的文件
- 24. 將PHP文件解析爲XML文件?
- 25. Perl - 在Windows中解析巨大的* .gz文件
- 26. 解析目錄中的巨大記錄器文件
- 27. 在Python中解析巨大的日誌文件
- 28. 解析Python 2.7中巨大的結構化文件
- 29. 如何在PHP中解析XML文件
- 30. 如何在PHP中解析XML文件?
http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/它如此簡單的紅寶石 – 2014-02-19 21:07:00