如何使用R從HTML中提取包含破折號的URL？

我有一些HTML，看起來像這樣：如何使用R從HTML中提取包含破折號的URL？

<ul><li><a href="http://www.website.com/index.aspx" target="_blank">Website</a></li> 
<li><a href="http://website.com/index.html" target="_blank">Website</a></li> 
<li><a href="http://www.website-with-dashes.org" target="_blank">Website With Dashes</a></li> 
<li><a href="http://website2.org/index.htm" target="_blank">Website 2</a></li> 
<li><a href="http://www.another-site.com/">Another Site</a></li>

使用

m<-regexpr("http://\\S*/?", links, perl=T) 
links<-regmatches(links, m)

得到我的聯繫，除了與他們破折號的那些被截斷這樣的：

http://www.website.com/index.aspx 
http://website.com/index.html 
http://www.website 
http://website2.org/index.htm 
http://www.another-site.com/

我認爲/ S匹配任何非空白。這是怎麼回事？

來源

2013-08-22 William Gunn

我無法複製您的問題。如果我用'\「替換''''，這樣我就可以用'readLines'導入文本，一切都按照你的意圖工作。 – thelatemail

使用XML::getHTMLlinks

如

library(XML) 
# assuming your html document is'foo.html') 

getHTMLLinks(doc = 'foo.html') 
# [1] "http://www.website.com/index.aspx" "http://website.com/index.html"  "http://www.website-with-dashes.org" 
# [4] "http://website2.org/index.htm"  "http://www.another-site.com/"

用正則表達式不一定是簡單的解析HTML。 https://stackoverflow.com/a/1732454/1385941是和有趣的閱讀。

來源

2013-08-22 06:08:29 mnel

是的，我已經閱讀過，但只是認爲我的應用程序很簡單，我會放棄它。這個答案並不能解決我確切的問題，但它指出了我朝着解決問題的不同方式或更好的方式。 –

如何使用R從HTML中提取包含破折號的URL？

回答

相關問題