2013-09-22 99 views
3

任何人都可以幫助我將這些數據從文本或dat文件導入到R中。它有空間分隔,但城市名稱不應被視爲兩個名稱。像紐約一樣。將原始數據導入R

1 NEW YORK 7,262,700 
2 LOS ANGELES 3,259,340 
3 CHICAGO 3,009,530 
4 HOUSTON 1,728,910 
5 PHILADELPHIA 1,642,900 
6 DETROIT 1,086,220 
7 SAN DIEGO 1,015,190 
8 DALLAS 1,003,520 
9 SAN ANTONIO 914,350 
10 PHOENIX 894,070 

回答

4

爲您的特定數據幀,其中真正的空間只有大寫字母之間發生,可以考慮使用正則表達式:

gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "1 NEW YORK 7,262,700") 
# [1] "1 NEW-YORK 7,262,700" 
gsub("(*[A-Z]) ([A-Z]+)", "\\1-\\2", "3 CHICAGO 3,009,530") 
# [1] "3 CHICAGO 3,009,530" 

然後你可以解釋空格作爲字段分隔。

+2

第二個'[A-Z]'後面應該跟一個'+'而不是'*',否則最後會有一個城市的「芝加哥」。 –

+0

謝謝休! – Mike

1

在@休的答案擴展我會嘗試以下,雖然它不是特別有效。

lines <- scan("cities.txt", sep="\n", what="character") 
lines <- unlist(lapply(lines, function(x) { 
    gsub(pattern="(*[a-zA-Z]) ([a-zA-Z]+)", replacement="\\1-\\2", x) 
})) 

citiesDF <- data.frame(num = rep(0, length(lines)), 
         city = rep("", length(lines)), 
         population = rep(0, length(lines)), 
         stringsAsFactors=FALSE) 

for (i in 1:length(lines)) { 
    splitted = strsplit(lines[i], " +") 
    citiesDF[i, "num"] <- as.numeric(splitted[[1]][1]) 
    citiesDF[i, "city"] <- gsub("-", " ", splitted[[1]][2]) 
    citiesDF[i, "population"] <- as.numeric(gsub(",", "", splitted[[1]][3])) 
} 
+0

謝謝Manetheran – Mike

4

上的主題的變化...但第一,一些示例數據:

cat("1 NEW YORK 7,262,700", 
    "2 LOS ANGELES 3,259,340", 
    "3 CHICAGO 3,009,530", 
    "4 HOUSTON 1,728,910", 
    "5 PHILADELPHIA 1,642,900", 
    "6 DETROIT 1,086,220", 
    "7 SAN DIEGO 1,015,190", 
    "8 DALLAS 1,003,520", 
    "9 SAN ANTONIO 914,350", 
    "10 PHOENIX 894,070", sep = "\n", file = "test.txt") 

步驟1:閱讀與readLines

x <- readLines("test.txt") 

數據步驟2:找出可以用來插入分隔符的正則表達式。在這裏,模式似乎是(從行的結尾看)一組數字和逗號,前面加空格,前面加上ALL CAPS中的一些單詞。我們可以捕獲這些組並插入一些「製表符」分隔符(\t)。額外的斜線正確地逃脫它們。

gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x) 
# [1] "1\t NEW YORK \t7,262,700"  "2\t LOS ANGELES \t3,259,340" 
# [3] "3\t CHICAGO \t3,009,530"  "4\t HOUSTON \t1,728,910"  
# [5] "5\t PHILADELPHIA \t1,642,900" "6\t DETROIT \t1,086,220"  
# [7] "7\t SAN DIEGO \t1,015,190" "8\t DALLAS \t1,003,520"  
# [9] "9\t SAN ANTONIO \t914,350" "10\t PHOENIX \t894,070" 

步驟3:因爲我們知道我們的gsub工作,我們知道,read.delim具有可以用來代替「file」的說法是「text」的說法,我們可以直接使用read.delimgsub結果:

out <- read.delim(text = gsub("([A-Z ]+)(\\s?[0-9,]+$)", "\\\t\\1\\\t\\2", x), 
        header = FALSE, strip.white = TRUE) 
out 
# V1   V2  V3 
# 1 1  NEW YORK 7,262,700 
# 2 2 LOS ANGELES 3,259,340 
# 3 3  CHICAGO 3,009,530 
# 4 4  HOUSTON 1,728,910 
# 5 5 PHILADELPHIA 1,642,900 
# 6 6  DETROIT 1,086,220 
# 7 7 SAN DIEGO 1,015,190 
# 8 8  DALLAS 1,003,520 
# 9 9 SAN ANTONIO 914,350 
# 10 10  PHOENIX 894,070 

一個可能的最後一步是將第三列轉換爲數值:

out$V3 <- as.numeric(gsub(",", "", out$V3)) 
+0

謝謝Mahto – Mike