我要出門的肢體和猜的數據你,在從網址:https://en.wikipedia.org/wiki/List_of_largest_cities。
如果是這種情況,我建議你實際上嘗試重新讀取數據(不知道你是如何將數據轉化爲R的),因爲這可能會讓你的生活更輕鬆。
這裏有一種方法在讀取數據:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/List_of_largest_cities"
XPATH <- '//*[@id="mw-content-text"]/table[2]'
cities <- URL %>%
read_html() %>%
html_nodes(xpath=XPATH) %>%
html_table(fill = TRUE)
下面介紹一下當前數據的樣子。仍然需要進行清理(請注意,其中一些曾在合併單元格名稱從「行跨度」和種類列):
head(cities[[1]])
## City Nation Image Population Population Population
## 1 Image City proper Metropolitan area Urban area[7]
## 2 Shanghai China 24,256,800[8] 34,750,000[9] 23,416,000[a]
## 3 Karachi Pakistan 23,500,000[10] 25,400,000[11] 25,400,000
## 4 Beijing China 21,516,000[12] 24,900,000[13] 21,009,000
## 5 Dhaka Bangladesh 16,970,105[14] 15,669,000 18,305,671[15][not in citation given]
## 6 Delhi India 16,787,941[16] 24,998,000 21,753,486[17]
從那裏,清理可能是這樣的:
cities <- cities[[1]][-1, ]
names(cities) <- c("City", "Nation", "Image", "Pop_City", "Pop_Metro", "Pop_Urban")
cities["Image"] <- NULL
head(cities)
cities[] <- lapply(cities, function(x) type.convert(gsub("\\[.*|,", "", x)))
head(cities)
# City Nation Pop_City Pop_Metro Pop_Urban
# 2 Shanghai China 24256800 34750000 23416000
# 3 Karachi Pakistan 23500000 25400000 25400000
# 4 Beijing China 21516000 24900000 21009000
# 5 Dhaka Bangladesh 16970105 15669000 18305671
# 6 Delhi India 16787941 24998000 21753486
# 7 Lagos Nigeria 16060303 13123000 21000000
str(cities)
# 'data.frame': 163 obs. of 5 variables:
# $ City : Factor w/ 162 levels "Abidjan","Addis Ababa",..: 133 74 12 41 40 84 66 148 53 102 ...
# $ Nation : Factor w/ 59 levels "Afghanistan",..: 13 41 13 7 25 40 54 31 13 25 ...
# $ Pop_City : num 24256800 23500000 21516000 16970105 16787941 ...
# $ Pop_Metro: int 34750000 25400000 24900000 15669000 24998000 13123000 13520000 37843000 44259000 17712000 ...
# $ Pop_Urban: num 23416000 25400000 21009000 18305671 21753486 ...
但之後你只剩下3行。你的預期產量將如何? – Sotos
城市和國家的名單繼續。對於那個很抱歉。讓我編輯問題並顯示我的輸出應該如何。 – Abrar
您是否也許錯誤地將數據讀入了R?你是否保證數據每5行更改一次或者可能丟失數據? – A5C1D2H2I1M1N2O1R2T1