使用GSUB

我的HTML代碼中的R類似下面的部分中的R卸下HTML代碼：使用GSUB

"</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA"

我想使用GSUB以除去不想要的HTML代碼，以便輸出將是：

XXXX YYYY ZZZZ AAAA

我試過<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>如圖所示here但是失敗了，爲什麼？

我該如何在R中做到這一點？謝謝。

來源

2011-08-14 lokheart

它可能是清潔提取使用'XML'庫從HTML代碼名稱和'xPath'查詢。如果你發佈了一個包含html代碼的網頁的鏈接，那麼有很多人可以向你提供關於如何提取所需信息的指針。 – Ramnath

要小心... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Iterator

這個問題和其他應該合併？ http://stackoverflow.com/questions/7057374/remove-anything-within-a-pair-of-parenthesis-using-gsub-in-r – Iterator

我建議你留意@Ramnath和@Iterator的警告和使用解析器來代替，但這裏是我可以用你的字符串和regex做到最好：

（加上一個失蹤後，你的輸入字符串）

x <- "</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"group.php?g=1\">XXXX</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050\">YYYY</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\"> <a href=\"category.php?c=100050&brand=Motorola\">ZZZ</a> <img src=\"images/arrow_orange.gif\" width=\"8\" height=\"12\">AAAA</a>"

代碼：

x1 <- gsub("<([[:alpha:]][[:alnum:]]*)(.[^>]*)>([.^<]*)", "\\3", x) 
x1 
[1] "</a> XXXX</a> YYYY</a> ZZZ</a> AAAA</a>" 

gsub("</a>", "", x1) 
[1] " XXXX YYYY ZZZ AAAA"

來源

2011-08-14 18:28:25 Andrie

否'perl = TRUE'？如果我不在我的R正則表達式函數中使用它，我總覺得我生活得很危險。 – Iterator

可悲的是我不是perl一代，所以我總是使用'perl = FALSE'。個人喜好，我想... – Andrie

回答

相關問題