從HTML字符串中除去所有標記屬性

我有一個表單，用戶可以使用TinyMCE進行樣式輸入描述。因此，我的用戶可以插入HTML。我使用strip_tags已經剝離幾乎所有的HTML元素，但用戶仍然可以輸入惡意數據，比如這一個：從HTML字符串中除去所有標記屬性

<strong onclick="window.location='http://example.com'">Evil</strong>

我想，以防止用戶能夠做到這一點，通過剝離所有屬性來自所有標籤，但style屬性除外。

我只能找到解決方案來剝離所有屬性，或剝離只有指定的。我只想保留style屬性。

我試過DOMDocument，但它似乎自己添加DOCTYPE和html標籤，將其作爲整個HTML文檔輸出。此外，它有時似乎隨機添加HTML實體，如顛倒的問號。

這裏是我的DOMDocument實現：

//Example "evil" input 
$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>"; 

//Strip all tags from description except these 
$description = strip_tags($description, '<p><br><a><b><i><u><strong><em><span><sup><sub>'); 

//Strip attributes from tags (to prevent inline Javascript) 
$dom = new DOMDocument(); 
$dom->loadHTML($description); 
foreach($dom->getElementsByTagName('*') as $element) 
{ 
    //Attributes cannot be removed directly because DOMNamedNodeMap implements Traversable incorrectly 
    //Atributes are first saved to an array and then looped over later 
    $attributes_to_remove = array(); 
    foreach($element->attributes as $name => $value) 
    { 
     if($name != 'style') 
     { 
      $attributes_to_remove[] = $name; 
     } 
    } 

    //Loop over saved attributes and remove them 
    foreach($attributes_to_remove as $attribute) 
    { 
     $element->removeAttribute($attribute); 
    } 
} 
echo $dom->saveHTML();

來源

2015-10-20 Hugo Zink

這裏有兩個選項的DOMDocument :: loadHtml（）將解決這個問題。

$dom->loadHTML($description, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

但它們只在libxml> = 2.7.8中可用。如果你有一箇舊版本，你可以嘗試一種不同的方法：

如果你知道你期望一個片段，你可以使用它並只保存body元素的孩子。

$description = <<<'HTML' 
<strong onclick="alert('evil');" style="text-align:center;">Evil</strong> 
HTML; 

$dom = new DOMDocument(); 
$dom->loadHTML($description); 
foreach($dom->getElementsByTagName('*') as $element) { 
    $attributes_to_remove = iterator_to_array($element->attributes); 
    unset($attributes_to_remove['style']); 
    foreach($attributes_to_remove as $attribute => $value) { 
     $element->removeAttribute($attribute); 
    } 
} 
foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $node) { 
    echo $dom->saveHTML($node); 
}

輸出：

<strong style="text-align:center;">Evil</strong>

來源

2015-10-20 13:18:05 ThW

我不知道這是多還是少你的意思該怎麼辦...

$description = "<p><strong onclick=\"alert('evil');\">Evil</strong></p>"; 
$description = strip_tags($description, '<p><br><a><b><i><u><strong><em><span><sup><sub>'); 

$dom=new DOMDocument; 
$dom->loadHTML($description); 
$tags=$dom->getElementsByTagName('*'); 

foreach($tags as $tag){ 
    if($tag->hasAttributes()){ 
     $attributes=$tag->attributes; 
     foreach($attributes as $name => $attrib) $tag->removeAttribute($name); 
    } 
} 
echo $dom->saveHTML(); 
/* Will echo out `Evil` in bold but without the `onclick` */

來源

2015-10-20 08:19:02 RamRaider

這是幾乎等同於我先前發佈的代碼。我的代碼（和你的代碼）插入了HTML實體和'html'和'body'標籤，這正是我試圖阻止的。我需要一個不使用DOMDocument的解決方案，並且不會嘗試「修復」HTML（因爲HTML並不是整個文檔）。 –

爲了公平起見，我在現有頁面上運行了這段代碼，發現沒有任何問題 - 當我按照「原樣」運行它時，沒有找到現有的html標記，就像你說的那樣，它已經爲自己添加了所有的HTML標記。 – RamRaider

從HTML字符串中除去所有標記屬性

回答

相關問題