2017-09-07 25 views
0

我正在使用r中的R獲得類別(維基百科頁面的底部大部分)。我已經使用SelectorGadget來標識用於類別提取的html節點。我使用的代碼如下如何使用Rvest中的R獲取Wikipedia中的「Categories」?

thepage <- read_html("https://en.wikipedia.org/wiki/San_Diego") 
Categories <- thepage %>% 
      html_nodes("#mw-normal-catlinks") %>% 
      html_text() 
Categories 

得到的結果如下:

"Categories: San Diego1769 establishments in California1850 establishments in CaliforniaCities in San Diego County, CaliforniaCounty seats in CaliforniaIncorporated cities and towns in CaliforniaPopulated coastal places in CaliforniaPopulated places established in 1769San Antonio-San Diego Mail LineSan Diego County, CaliforniaSan Diego metropolitan areaSpanish mission settlements in North AmericaSpecial economic zones of the United StatesStagecoach stops in the United States" 

正如你可以看到,有沒有分隔符的類別區分。第一類是「聖地亞哥」,第二類是「加利福尼亞州的1769個機構」。我如何在列表中獲得這些類別或以某種方式分離?

回答

1

每個類別是一個列表項,那麼你需要進入名單:

thepage %>% 
    html_nodes(".mw-normal-catlinks ul li") %>% 
    html_text() 

[1] "San Diego"         "1769 establishments in California"   
[3] "1850 establishments in California"   "Cities in San Diego County, California"  
[5] "County seats in California"     "Incorporated cities and towns in California" 
[7] "Populated coastal places in California"  "Populated places established in 1769"   
[9] "San Antonio-San Diego Mail Line"    "San Diego County, California"     
[11] "San Diego metropolitan area"     "Spanish mission settlements in North America" 
[13] "Special economic zones of the United States" "Stagecoach stops in the United States" 
相關問題