如何在異常情況下去除javascript中的所有html標記？

我一直在毆打我的頭對抗這個時間最長的時間現在，我希望有人可以幫助。基本上我有一個所見即所得的字段，用戶可以輸入格式化文本。但他們當然會複製並粘貼form word/web /等。所以我有一個JS函數捕獲粘貼輸入。我有一個函數可以去掉文本上的所有格式，但是我想讓它保留像p和br這樣的標籤，所以它不僅僅是一團糟。如何在異常情況下去除javascript中的所有html標記？

任何正則表達式ninjas在那裏？這是我到目前爲止，它的工作原理。只需要允許標籤。

o.node.innerHTML=o.node.innerHTML.replace(/(<([^>]+)>)/ig,"");

來源

2010-03-06 Code Monkey

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – waiwai933 2010-03-06 16:33:25

瀏覽器已在o.node中有一個非常好的解析HTML樹。將文檔內容序列化爲HTML（使用innerHTML），嘗試使用正則表達式（其中不能可靠地解析HTML），然後通過設置innerHTML將結果重新解析爲文檔內容...實際上有點不正確。

相反，請檢查您已在o.node內部擁有的元素和屬性節點，刪除不想要的元素和屬性節點，例如。：

filterNodes(o.node, {p: [], br: [], a: ['href']});

定義爲：

// Remove elements and attributes that do not meet a whitelist lookup of lowercase element 
// name to list of lowercase attribute names. 
// 
function filterNodes(element, allow) { 
    // Recurse into child elements 
    // 
    Array.fromList(element.childNodes).forEach(function(child) { 
     if (child.nodeType===1) { 
      filterNodes(child, allow); 

      var tag= child.tagName.toLowerCase(); 
      if (tag in allow) { 

       // Remove unwanted attributes 
       // 
       Array.fromList(child.attributes).forEach(function(attr) { 
        if (allow[tag].indexOf(attr.name.toLowerCase())===-1) 
         child.removeAttributeNode(attr); 
       }); 

      } else { 

       // Replace unwanted elements with their contents 
       // 
       while (child.firstChild) 
        element.insertBefore(child.firstChild, child); 
       element.removeChild(child); 
      } 
     } 
    }); 
} 

// ECMAScript Fifth Edition (and JavaScript 1.6) array methods used by `filterNodes`. 
// Because not all browsers have these natively yet, bodge in support if missing. 
// 
if (!('indexOf' in Array.prototype)) { 
    Array.prototype.indexOf= function(find, ix /*opt*/) { 
     for (var i= ix || 0, n= this.length; i<n; i++) 
      if (i in this && this[i]===find) 
       return i; 
     return -1; 
    }; 
} 
if (!('forEach' in Array.prototype)) { 
    Array.prototype.forEach= function(action, that /*opt*/) { 
     for (var i= 0, n= this.length; i<n; i++) 
      if (i in this) 
       action.call(that, this[i], i, this); 
    }; 
} 

// Utility function used by filterNodes. This is really just `Array.prototype.slice()` 
// except that the ECMAScript standard doesn't guarantee we're allowed to call that on 
// a host object like a DOM NodeList, boo. 
// 
Array.fromList= function(list) { 
    var array= new Array(list.length); 
    for (var i= 0, n= list.length; i<n; i++) 
     array[i]= list[i]; 
    return array; 
};

來源

2010-03-06 16:29:46 bobince

偉大的功能！和聰明的方法。奇蹟般有效。唯一剩下的（有時）是。 <！ - 垃圾 - >。我猜是因爲它們不是節點。任何方式擺脫這一點？如果不是很棒！ – 2010-03-06 17:25:40

他們是評論節點。你可以用'... else if（child.nodeType === 8）{element.removeChild（child）; }'（'8'' COMMENT_NODE'''''''''' ELEMENT_NODE''（雖然IE不給你常量名，所以你必須使用數字）。 – bobince 2010-03-06 20:14:49

上面的代碼是否對XSS安全（參見[question ]（http://stackoverflow.com/questions/18370188/securely-strip-html-tags-in-javascript-with-whitelist））？ – 2013-08-22 10:49:03

首先，我不確定是否regex是正確的工具。用戶可能輸入無效的HTML（忘記>或將>放在屬性中），然後正則表達式會失敗。不過，我不知道，如果解析器會更好/更防彈。

其次，你的正則表達式中有一些不必要的括號。

第三，你可以使用先行排除某些標籤：

o.node.innerHTML=o.node.innerHTML.replace(/<(?!\s*\/?(br|p)\b)[^>]+>/ig,"");

說明：

<比賽左尖括號

(?!\s*\/?(br|p)\b)斷言，這是不可能的匹配零個或多個空白字符，零個或一個/，br或p中的任一個，緊接着是字邊界。字邊界很重要，否則可能會觸發<pre>或<param ...>等標籤。

[^>]+是沒有閉合角度括號

>匹配的閉合尖括號匹配一個或多個字符。

請注意，如果在標籤內某處出現右角括號，則可能會遇到麻煩。

因此，這將匹配（和鋼帶）

<pre> <a href="dot.com"> </a> </pre>

，並留下

 等

單獨

。

來源

2010-03-06 15:23:13

嗯只是試過它，仍然剝奪了一切...不是註冊表你會建議什麼？我不想查找和替換每一種類型的標籤。 – 2010-03-06 15:27:05

對不起，我第一次誤讀你的文章（'b'而不是'br'）。你可以再試一次嗎？ – 2010-03-06 15:32:46

完美適合我！謝謝！ :) – podeig 2010-11-05 10:38:40

如何在異常情況下去除javascript中的所有html標記？

回答

相關問題