網絡收穫 - 刪除特殊字符

我試圖刮掉具有錨後一些空間的頁面：網絡收穫 - 刪除特殊字符

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

我似乎無法找到一個方法來指定的文本，我要麼觸發處理器錯誤，或者我無法檢測到字符串本身。所有事件導致html-to-xml轉換失敗，因爲包含字符時xml格式不正確。所以，我需要刪除所有東西之後的所有內容（請注意，在文檔中的其他地方有其他地方有div標籤或其他東西）。

我的代碼：

<xpath expression="/"> 
    <regexp replace="true"> 
      <regexp-pattern>(nbsp;)</regexp-pattern> 
       <regexp-source> 
        <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;"> 
         <http url="http://mysite.org/map/aindex/" method="get" /> 
        </html-to-xml> 
       </regexp-source> 
       <regexp-result> 
        <template></template> 
       </regexp-result> 
     </regexp> 
</xpath>

我想我的問題是與正則表達式模式。我試過了：

  &nbsp; 
    \& nbsp; (without the space in between -- SO doesn't display that correctly 
    \s+\|\s+

等等。我甚至試圖把表達式放在一個CDATA元素中，但是我也無法讓它工作。

有什麼想法？

來源

2012-10-13 user991945

這個貌似爲什麼基於正則表達式的Web刮是有缺陷的另一個很好的例子。我希望你能弄清楚如何使它工作。這是一個有趣而經典的Stack-O答案：http://stackoverflow.com/a/1732454/564406 – David

對於 在正則表達式模式，你可以嘗試使用\u00A0

來源

2012-12-08 22:21:01 Alexander

網絡收穫 - 刪除特殊字符

回答

相關問題