php dom文件刪除特殊字符

即時通訊使用dom文件getElementsByTagName檢索網站標題。php dom文件刪除特殊字符

這裏是我的代碼：

$doc = new DOMDocument(); 
@$doc->loadHTML($strData); 
$doc->encoding = 'utf-8'; 
$doc->saveHTML(); 
$titleNode = $doc->getElementsByTagName("title");

它工作正常，但當有特殊字符在標題中，檢索數據是不準確的。即時獲得「Some More Google Plus Invite Workarounds #wrapper { background:url(/) no-repeat 50% 0; } body { background:#CFD8E2; }」而不是。

我做了以下替換特殊字符，但它沒有工作：

// Replace all special characters into space 
    $specialChars = array('~','`','!','@','#','$','%','^','&','*','(',')','-','_','=','+','|','\\',']','[','}','{','"','\'',':',';','/','?','.',',','>','<'); 
     foreach ($specialChars as $a) { 
     $titleNode = str_replace($a, ' ', $titleNode); 

    }

即時得到空標題來代替。該<title>值是財產以後這樣的：

<title>Some More Google Plus Invite Workarounds < Communication, Social Networking < PC World India News < PC World.in</title>

所以我應該怎麼做

來源

2011-07-07 nuttynibbles

嗯，它是否讀取「小於」（<）作爲html標籤的開始？ –

喲，它是..閱讀後，它跳轉到閱讀CSS樣式。 – nuttynibbles

使用解析器+1！ –

它看起來像你的HTML格式不正確。如果你在標題中有一個流浪的<，我很驚訝你沒有得到Warning: DOMDocument::loadHTML(): error parsing attribute name in Entity, line: 1 in <path> on line <line>。

至於更換：如果更換一個HTML文檔中的所有<和>的，你就無法從中檢索元素：不會有任何元素左：

<head><title>Foo</title></head>

變爲

headtitleFoo/title/head

不幸的是，沒有太多的工作可以解決這個問題 - 糟糕的HTML是不好的HTML。如果你知道你可以提前預料到這種問題，那麼你可以用preg_replace（也許是preg_replace("#\s<\s#g",'<',$input);？preg_match('#title[^>]*>(.*)</title#', $input, $matches)？）或substr做一些事情，但你可能只是在一條小溪上。

來源

2011-07-07 03:26:09 cwallenpoole

yep html頁面im爬行不正常。我所做的只是替換標題值中的< >。所以它不會影響其餘的html = D – nuttynibbles

我有一個看的部位;這是一個問題，因爲他們沒有在標題中使用適當的HTML實體：

<title>Some More Google Plus Invite Workarounds < Communication, Social Networking < PC World India News < PC World.in</title>

我認爲DOM文檔具有與問題，並認爲這是在標籤結束。作爲解決方法，您可以將'<'添加到$ specialChars以避免此問題。

來源

2011-07-07 03:22:51 iHaveacomputer

補充!!目前我只是使用str_replace，以便它不會導致頁面崩潰。通過不這樣做，我的網站崩潰的時刻，它顯示標題 – nuttynibbles

$fp = fsockopen("www.domain.com", 80, $errno, $errstr, 30); 
if (!$fp) { 
    echo "$errstr ($errno)<br />\n"; 
} else { 
    $out = "GET/HTTP/1.1\r\n";  
    $out .= "Host: www.domain.com\r\n"; 
    $out .= "Connection: Close\r\n\r\n"; 
    fwrite($fp, $out); 
    $buffer = ''; 
    while (!feof($fp)) { 
     $buffer .= fgets($fp, 128); 
    } 
    fclose($fp); 
      preg_match('#<.*?title.*?>(.*?)<.*?title.*?>#', $buffer, $matches); 
      var_dump($matches); 
}

來源

2011-07-07 03:25:48

不應該那個正則表達式匹配

Something

Something else

' – cwallenpoole

我會嘗試你的解決方案，讓你後來= D – nuttynibbles

@cwallenpoole我想你是對的。可能需要重新修改一些堅果類食物，以避免上述的誤報。 –

php dom文件刪除特殊字符

回答

相關問題