回聲UTF-8文本刮網頁

IM使用此代碼從網站刮的特定數據時：回聲UTF-8文本刮網頁

<!DOCTYPE html> 
    <head> 
    <meta http-equiv="content-type" content="text/html; charset=utf-8"> 

    <title>scrap</title> 
    </head> 
    <body> 
<?php 
$url = 'http://xn--mgbaam1d9c.com'; 
$html = file_get_contents($url); 

libxml_use_internal_errors(true); 
$doc = new DOMDocument; 
$doc->loadHTML($html); 
$xpath = new DOMXpath($doc); 

// A name attribute on a <div>??? 
$node = $xpath->query('//div[@class="list"]')->item(0); 

echo $node->textContent; 

?> 

</body> 
</html>

的工作很好，但

結果顯示只有1個結果刮，我希望它顯示所有結果（該網站有分頁）。
結果在阿拉伯語和它顯示了類似下面這一形象 - http://i.stack.imgur.com/Z9VMn.png

讓我怎麼使它獲得所有結果&在阿拉伯語語言顯示他們喜歡他們。

在此先感謝。

來源

2013-10-31 Youssef Subehi

您只能得到第一個項目.item(0)。看看 $xpath->query返回：DOMNodeList其中有一個length 屬性。
使用 iconv將編碼從windows-1256轉換爲utf-8。

事情是這樣的：

$nodeList = $xpath->query('//div[@class="list"]'); 

for ($i = 0; $i < $nodeList->length; $i++) { 
    $node = $nodeList->item($i); 
    echo iconv('WINDOWS-1256','UTF-8',$node->textContent); 
}

編輯：mb_convert_encoding不支持Windows 1256，切換到iconv代替。

你也可以動態檢索HTML元內容編碼：

$fromEncoding = ''; 
$contentType = $xpath->query('//meta[@http-equiv="content-type"]')->item(0)->getAttribute('content'); 
preg_match('/charset=([A-Za-z0-9_-]+)$/',$contentType,$contentTypeMatches); 
if (isset($contentTypeMatches[1])) { 
    $fromEncoding = strtoupper($contentTypeMatches[1]); 
}

來源

2013-11-01 00:21:42 zamnuts

回答

相關問題