1
我很難從網站的源代碼中提取特定的文本選擇。我可以提取整個列表,但我只需要一個國家,例如阿根廷。在R中提取網址和標題
的源代碼是:
<div class="article-content">
<div class="RichTextElement">
<div><h3 style="background-color: transparent; color: rgb(51, 51, 51);"><span style="font-weight: normal; font-family: Verdana;">Afghanistan - </span><span style="background-color: transparent; font-weight: normal; font-family: Verdana;"><a title="Tax Authority in Afganistan" href="http://mof.gov.af/en" style="background-color: transparent; color: rgb(51, 51, 51);">Ministry of Finance</a><br />Argentina - <a title="Tax Authority in Argentina" href="http://www.afip.gob.ar/english/" style="background-color: transparent; color: rgb(51, 51, 51);">Federal Administration of Public Revenues</a><br />
我只需要 「聯邦行政機構公共收入」 和 「http://www.afip.gob.ar/english/」
到目前爲止,我有:
argurl <- readLines("http://oceantax.co.uk/links/tax-authorities-worldwide.html")
strong <-as.matrix(grep("<br//>",argurl))
strong1starts <- grep("<br //>Argentina",argurl)
rowst1st <- which(grepl(strong1starts, strong))
strong1ends <- strong[rowst1st + 1 ,]-1
data1 <- as.matrix(argurl[strong1starts:strong1ends])
[唐'使用正則表達式來解析HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-containe d-tags):相反,請查看[Rvest](https://github.com/hadley/rvest)包中解析R中的HTML – 2015-02-24 18:34:39