R：XPath表達式返回所選元素外部的鏈接

我正在使用R來使用XPath語法從that page上的主表中刪除鏈接。主表是網頁上的第三張，我只想要包含雜誌文章的鏈接。R：XPath表達式返回所選元素外部的鏈接

我的代碼如下：

require(XML) 
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date")) 
(y = xpathApply(x, "//table")[[3]]) 
(z = xpathApply(y, "//table//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")) 
(links = unique(z))

如果你查看輸出，最後的鏈接不來自主表，但是從側邊欄，即使我選擇了主表中我的第三行問對象y只包含第三個表格。

我在做什麼錯？什麼是用XPath編碼的正確/更有效的方法？

注：新手XPath的寫作。

回答（真的很快），非常感謝！我的解決方案如下。

extract <- function(x) { 
    message(x) 
    html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date")) 
    html = xpathApply(html, "//table")[[3]] 
    html = xpathApply(html, ".//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href") 
    html = gsub("#ac_newscomment", "", html) 
    html = unique(html) 
} 

d = lapply(1:125, extract) 
d = unlist(d) 
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)

這樣可以節省各個環節與關鍵字「HADOPI」本網站上的新聞項目。

來源

2013-05-18 Fr.

你需要的，如果你想限制搜索到當前節點與.啓動模式。 /返回到文檔的開頭（即使根節點不在y中）。

xpathSApply(y, ".//a/@href")

或者，您可以使用XPath直接提取的第三張表：

xpathApply(x, "//table[3]//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")

來源

2013-05-18 20:19:12

這工作，問題編輯，以反映答案。謝謝！ –

R：XPath表達式返回所選元素外部的鏈接

回答

相關問題