在PHP中猜測字符集編碼

我想用cURL在PHP中編寫我自己的網絡爬蟲。在PHP中猜測字符集編碼

[...] 
mb_internal_encoding('UTF-8'); 
mb_language('uni'); 
$this->_curl = curl_init(); 
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYPEER, false); 
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYHOST, 0); 
curl_setopt($this->_curl, CURLOPT_FOLLOWLOCATION, false); 
curl_setopt($this->_curl, CURLOPT_MAXREDIRS, 0); 
curl_setopt($this->_curl, CURLOPT_TIMEOUT, 10); 
curl_setopt($this->_curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10'); 
curl_setopt($this->_curl, CURLOPT_HEADER, true); 
curl_setopt($this->_curl, CURLOPT_RETURNTRANSFER, true); 
$header = array(
      "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
      "Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3", 
      "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", 
      "Keep-Alive: 115", 
      "Connection: keep-alive", 
); 
curl_setopt($this->_curl, CURLOPT_HTTPHEADER, $header); 
curl_setopt($this->_curl, CURLOPT_URL, $url); 
curl_setopt($this->_curl, CURLOPT_POST, false); 
curl_setopt($this->_curl, CURLOPT_POSTFIELDS, array()); 
curl_setopt($this->_curl, CURLOPT_HTTPGET, true); 
$page = curl_exec($this->_curl); 
[...]

問題是網站的字符集。正如你可以在

http://blog.163.com/drewes_4711/blog/static/179317021201151624826557/

看到有一個頭"Content-Type: ...;charset=GBK"所以我可以做mb_convert_encoding($content, "UTF-8", "GBK");但我應該怎麼做

http://tech.hexun.com/2011-06-21/130756909.html

它似乎是相同的字符集，但它不在HTTP標頭中給出。所以我有德語變音符號，中文和亞洲語言的巨大問題...有沒有任何模塊或片段，我可以使用cURL來確定任何下載的HTML網站的字符集？

來源

2011-06-25 rabudde

這第二個鏈接包含：看起來像普通的ASCII

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

所有數據之前。所以你可以嘗試一下，如果HTTP頭沒有提供線索，只是解析（假設純ASCII，而不是UTF-8 - 這可能會中斷），直到找到該頭。

這顯然不能保證工作。如果服務器沒有發送編碼，並且頁面也沒有該標題，那麼你運氣不好。沒有通用的手段來檢測給定數據的編碼。

來源

2011-06-25 06:58:45 Mat

@ mu，@ Mat：太好了。我不知道，爲什麼我沒有自己尋找這個。但是當我發送''Accept-Charset：utf-8「'時，假設內容是UTF-8並不正確，我在頭文件或內容中找不到任何內容類型？ – rabudde

在理想的世界裏，你確實不會有這個問題。但是，如果您的應用程序只能滿足配置良好，符合標準的網絡服務器和網頁......那麼您將缺少很多東西:-) – Mat

當然。我現在要嘗試解析字符集定義的頭文件和內容，之後我將使用'mb_check_encoding（$ string，'UTF-8'）'檢查，否則，我將執行'utf8_encode（$ string）'' – rabudde

在PHP中猜測字符集編碼

回答

相關問題