刮分層數據

我想從global Dept stores刮大陸/國家Dept商店的名單。我正在運行以下代碼以首先獲得各大洲，因爲我們可以看到，XML層次結構是這樣的：每個大洲的國家都不是該大陸的子節點。刮分層數據

> url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" 
> doc = htmlTreeParse(url, useInternalNodes = T) 
> nodeNames = getNodeSet(doc, "//h2/span[@class='mw-headline']") 
> # For Africa 
> xmlChildren(nodeNames[[1]]) 
$a 
<a href="/wiki/Africa" title="Africa">Africa</a> 

attr(,"class") 
[1] "XMLInternalNodeList" "XMLNodeList"   
> xmlSize(nodeNames[[1]]) 
[1] 1

我知道我可以做一個單獨的getNodeSet命令的國家，但我只是想確保我不會錯過任何東西。是否有更智能的方法來獲取每個大洲內的所有數據，然後同時在每個國家內部獲取所有數據？

來源

2013-02-01 user1848018

鑑於您的文檔的結構，可能會更容易使用SAX解析它，而不是使用DOM樹。 – juba

uisng xpath，幾條路徑可以與|分隔器。所以我用它來獲得同樣的名單中的國家和商店。然後我得到第二個國家名單。我使用後面的列表來分割第一個

url<-"http://en.wikipedia.org/wiki/List_of_department_stores_by_country" 
library(XML) 
xmltext <- htmlTreeParse(url, useInternalNodes = T) 

## Here I use the combined xpath 
cont.shops <- xpathApply(xmltext, '//*[@id="mw-content-text"]/ul/li| 
            //*[@id="mw-content-text"]/h3',xmlValue) 
cont.shops<- do.call(rbind,cont.shops)     ## from list to vector 


head(cont.shops)     ## first element is country followed by shops 
    [,1]     
[1,] "[edit] Â Tunisia"  
[2,] "Magasin GÃƒÂ©nÃƒÂ©ral" 
[3,] "Mercure Market"  
[4,] "Promogro"    
[5,] "Geant"     
[6,] "Carrefour"    
## I get all the contries in one list 
contries <- xpathApply(xmltext, '//*[@id="mw-content-text"]/h3',xmlValue) 
contries <- do.call(rbind,contries)      ## from list to vector 

    head(contries) 
    [,1]     
[1,] "[edit] Â Tunisia"  
[2,] "[edit] Â Morocco"  
[3,] "[edit] Â Ghana"  
[4,] "[edit] Â Kenya"  
[5,] "[edit] Â Nigeria"  
[6,] "[edit] Â South Africa"

現在我做一些處理來拆分使用國家的cont.shops。

dd <- which(cont.shops %in% contries)     ## get the index of contries 
freq <- c(diff(dd),length(cont.shops)-tail(dd,1)+1)  ## use diff to get Frequencies 
contries.f <- rep(contries,freq)      ## create the factor splitter 


ll <- split(cont.shops,contries.f)

我可以檢查結果：

> ll[[contries[1]]] 
[1] "[edit] Â Tunisia"  "Magasin GÃƒÂ©nÃƒÂ©ral" "Mercure Market"  "Promogro"    "Geant"     
[6] "Carrefour"    "Monoprix"    
> ll[[contries[2]]] 
[1] "[edit] Â Morocco"               
[2] "Alpha 55, one 6-story store in Casablanca"         
[3] "Galeries Lafayette, to open in 2011[1] within Morocco Mall, in Casablanca"

來源

2013-02-01 20:18:12 agstudy

這非常有幫助。謝謝 – user1848018

回答

相關問題