我怎樣才能100％確定HTML標籤內的JS？

我需要用一些HTML標籤保存一些數據，所以我不能使用strip_tags所有的文本，我不能使用htmlentities，因爲文本必須由標籤修改。爲了捍衛其他用戶對抗XSS，我必須從標籤內部刪除任何JavaScript。我怎樣才能100％確定HTML標籤內的JS？

這樣做的最好方法是什麼？

來源

2013-04-09 BASILIO

http://stackoverflow.com/questions/1886740/php-remove -javascript – Michal 2013-04-09 16:07:28

如果您正在尋找使用JavaScript進行過濾，則在http://stackoverflow.com/questions/295566/sanitize-rewrite-html-on-the-client-side上提出了類似的問題。 – KernelPanik 2013-04-09 16:10:55

如果您需要保存的HTML標籤在數據庫中，而後者要打印回瀏覽器，沒有100％使用內置的PHP函數來實現這一點。當沒有html標籤時，它很容易，當您只有文本時，您可以使用內置的PHP函數來清除文本。

有一些功能可以從文本中清除XSS，但它們不是100％安全的，並且始終有一種XSS未被注意的方法。你的正則表達式的例子很好，但如果我使用讓我們說< script>alert('xss')</script>，或者正則表達式可能會錯過並且瀏覽器會執行的其他組合。

做到這一點，最好的方法是使用類似HTML Purifier

另外請注意，有兩種級別的安全，首先是當事情進入你的數據庫，二時，他們會從你的數據庫。

希望這會有所幫助！

來源

2013-04-09 16:13:06 Matija

使用HTML解析器（實際解析器，而不是基於正則表達式的解析器）以及標記和屬性白名單，有100％安全的方法來完成它。所有Stack Exchange網站都這樣做。 – zneak 2013-04-09 16:15:58

我的答案中沒有鏈接HTML Purifier？ :)我說它不是100％安全使用內置函數的PHP，或使用正則表達式。 – Matija 2013-04-09 16:17:56

我主要解決你的答案的第一段。 – zneak 2013-04-09 16:19:10

我建議你使用DOMDocument（與loadHTML）加載HTML說，除去各種標籤和每個屬性你不希望看到的，並保存回HTML（使用saveXML或saveHTML）。您可以通過遞歸迭代文檔根目錄的子項來完成此操作，並用內部內容替換不需要的標記。由於loadHTML以類似於瀏覽器的方式加載代碼，因此使用它比使用正則表達式更安全。

編輯這裏的「淨化」功能，我提出：

<?php 

function purifyNode($node, $whitelist) 
{ 
    $children = array(); 
    // copy childNodes since we're going to iterate over it and modify the collection 
    foreach ($node->childNodes as $child) 
     $children[] = $child; 

    foreach ($children as $child) 
    { 
     if ($child->nodeType == XML_ELEMENT_NODE) 
     { 
      purifyNode($child, $whitelist); 
      if (!isset($whitelist[strtolower($child->nodeName)])) 
      { 
       while ($child->childNodes->length > 0) 
        $node->insertBefore($child->firstChild, $child); 

       $node->removeChild($child); 
      } 
      else 
      { 
       $attributes = $whitelist[strtolower($child->nodeName)]; 
       // copy attributes since we're going to iterate over it and modify the collection 
       $childAttributes = array(); 
       foreach ($child->attributes as $attribute) 
        $childAttributes[] = $attribute; 

       foreach ($childAttributes as $attribute) 
       { 
        if (!isset($attributes[$attribute->name]) || !preg_match($attributes[$attribute->name], $attribute->value)) 
         $child->removeAttribute($attribute->name); 
       } 
      } 
     } 
    } 
} 

function purifyHTML($html, $whitelist) 
{ 
    $doc = new DOMDocument(); 
    $doc->loadHTML($html); 

    // make sure <html> doesn't have any attributes 
    while ($doc->documentElement->hasAttributes()) 
     $doc->documentElement->removeAttributeNode($doc->documentElement->attributes->item(0)); 

    purifyNode($doc->documentElement, $whitelist); 
    $html = $doc->saveHTML(); 
    $fragmentStart = strpos($html, '<html>') + 6; // 6 is the length of <html> 
    return substr($html, $fragmentStart, -8); // 8 is the length of </html> + 1 
} 

?>

你會叫purifyHTML與不安全的HTML字符串的標記和屬性預定義的白名單。白名單格式爲'tag'=> array（'attribute'=>'regex'）。白名單中不存在的標籤被剝離，其內容嵌入父標籤中。白名單中給定標籤不存在的屬性也會被刪除;以及存在於白名單中但與正則表達式不匹配的屬性也會被刪除。

下面是一個例子：

<?php 

$html = <<<HTML 
<p>This is a paragraph.</p> 
<p onclick="alert('xss')">This is an evil paragraph.</p> 
<p><a href="javascript:evil()">Evil link</a></p> 
<p><script>evil()</script></p> 
<p>This is an evil image: <img src="error.png" onerror="evil()"/></p> 
<p>This is nice <b>bold text</b>.</p> 
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p> 
HTML; 

// whitelist format: tag => array(attribute => regex) 
$whitelist = array(
    'b' => array(), 
    'i' => array(), 
    'u' => array(), 
    'p' => array(), 
    'img' => array('src' => '@\Ahttp://.+\[email protected]', 'alt' => '@.*@'), 
    'a' => array('href' => '@\Ahttp://.+\[email protected]') 
); 

$purified = purifyHTML($html, $whitelist); 
echo $purified; 

?>

結果是：

<p>This is a paragraph.</p> 
<p>This is an evil paragraph.</p> 
<p><a>Evil link</a></p> 
<p>evil()</p> 
<p>This is an evil image: <img></p> 
<p>This is nice <b>bold text</b>.</p> 
<p>This is a nice image: <img src="http://example.org/image.png" alt="Nice image"></p>

顯然，你不想讓任何on*屬性，我會建議對style因爲怪異的專有屬性如behavior。確保所有網址屬性都使用正確的正則表達式進行驗證，與完整字符串（\Aregex\Z）匹配。

來源

2013-04-09 16:06:25 zneak

它能處理HTML的片段，還是會嘗試創建一個完整的文檔，''標籤和所有？ – cHao 2013-04-09 16:16:10

@cHao，它會嘗試創建一個完整的文檔，但是你只需要遍歷''裏面的內容。此外，如果您使用遞歸方法並且不要將html和body列入白名單，那麼它應該就像它是一個片段一樣工作。 – zneak 2013-04-09 16:22:02

我敢打賭，我可以打破這一點。 – Hogan 2013-04-09 18:46:28

如果您想允許指定標籤，您必須解析HTML。

已經有用於該目的的很好的圖書館：HTML Purifier（LGPL下開源）

來源

2013-04-09 16:06:58 ComFreek

我寫了這個代碼，你可以設置標籤的列表和屬性進行刪除

function RemoveTagAttribute($Dom,$Name){ 
    $finder = new DomXPath($Dom); 
    if(!is_array($Name))$Name=array($Name); 
    foreach($Name as $Attribute){ 
     $Attribute=strtolower($Attribute); 
     do{ 
      $tag=$finder->query("//*[@".$Attribute."]"); 
      //print_r($tag); 
      foreach($tag as $T){ 
      if($T->hasAttribute($Attribute)){ 
       $T->removeAttribute($Attribute); 
      } 
      } 
     }while($tag->length>0); 
    } 
    return $Dom; 

} 
function RemoveTag($Dom,$Name){ 
    if(!is_array($Name))$Name=array($Name); 
    foreach($Name as $tagName){ 
     $tagName=strtolower($tagName); 
     do{ 
      $tag=$Dom->getElementsByTagName($tagName); 
      //print_r($tag); 
      foreach($tag as $T){ 
      // 
      $T->parentNode->removeChild($T); 
      } 
     }while($tag->length>0); 
    } 
    return $Dom; 

}

例如：

$dom= new DOMDocument; 
    $HTML = str_replace("&", "&amp;", $HTML); // disguise &s going IN to loadXML() 
    // $dom->substituteEntities = true; // collapse &s going OUT to transformToXML() 
    $dom->recover = TRUE; 
    @$dom->loadHTML('<?xml encoding="UTF-8">' .$HTML); 
    // dirty fix 
    foreach ($dom->childNodes as $item) 
    if ($item->nodeType == XML_PI_NODE) 
     $dom->removeChild($item); // remove hack 
    $dom->encoding = 'UTF-8'; // insert proper 
    $dom=RemoveTag($dom,"script"); 
    $dom=RemoveTagAttribute($dom,array("onmousedown","onclick")); 
    echo $dom->saveHTML();

來源

2013-04-09 17:31:57

我怎樣才能100％確定HTML標籤內的JS？

回答

相關問題