2014-02-08 14 views
0

我有類似下面的外部數據文件,無定界符:讀取不帶分隔符的變量複雜名稱,例如棒球選手

PLAYER TEAM STUFF1 STUFF2 
Jim Smith NYY 100 200 
Jerry Johnson Jr. PHI 100 200 
Andrew C. James STL 200 200 
A. J. Williams CWS 100 200 
Felix Rodriguez BAL 100 100 

我怎樣才能讀取這個文件?我正在考慮使用readLines並在任何三個連續的大寫字母序列之前分割字符串。但是,我不知道該怎麼做。

如果只將團隊名稱的第一個字母大寫,該怎麼辦?

下面是一個類似的文件,其中名稱後跟一列數字。我可以用下面的代碼讀取這些數據:

 TEAM STUFF1 STUFF2 
     New York Yankees 100 200 
     Philadelphia Phillies 100 200 
     Boston Red Sox 200 200 
     Los Angeles Angels 100 200 
     Chicago White Sox 100 100 
     Chicago Cubs 200 100 
     New York Mets 200 200 
     San Francisco Giants 100 300 
     Minnesota Twins 100 300 
     St. Louis Cardinals 200 300 

這裏是讀第二個數據集的代碼:

setwd('c:/users/mmiller21/simple R programs/') 

my.data3 <- readLines('team.names.with.spaces.txt') 

# split between desired columns 

my.data4 <- do.call(rbind, strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T)) 

# returns string w/o leading or trailing whitespace 
# This function is not mine and was found on Stack Overflow  
trim <- function (x) gsub("^\\s+|\\s+$", "", x) 

my.data5 <- trim(my.data4) 

# remove header 
my.data6 <- my.data5[-1,] 

# convert to data.frame 
my.data6 <- data.frame(my.data6, stringsAsFactors = FALSE) 

my.data6[,2] <- as.numeric(my.data6[,2]) 
my.data6[,3] <- as.numeric(my.data6[,3]) 
my.data6 
         X1 X2 X3 
1  New York Yankees 100 200 
2 Philadelphia Phillies 100 200 
3   Boston Red Sox 200 200 
4  Los Angeles Angels 100 200 
5  Chicago White Sox 100 100 
6   Chicago Cubs 200 100 
7   New York Mets 200 200 
8 San Francisco Giants 100 300 
9  Minnesota Twins 100 300 
10 St. Louis Cardinals 200 300 

謝謝你的任何建議。我喜歡一個解決方案的基礎R.

回答

0

這之前連續三個大寫字母分割字符串:

setwd('c:/users/mmiller21/simple R programs/') 

my.data3 <- readLines('player.names.with.spaces.txt') 

strsplit(my.data3, split = "(?<=[ ])(?=[A-Z]{3})", perl = T) 

我也許可以從那裏休息。儘管如果只有團隊名稱的第一個字母大寫,我仍然對如何閱讀文件感興趣。

這裏是上面代碼的結果:

[[1]] 
[1] "PLAYER " "TEAM " "STUFF1 " "STUFF2" 

[[2]] 
[1] "Jim Smith "  "NYY 100 200" 

[[3]] 
[1] "Jerry Johnson Jr. " "PHI 100 200" 

[[4]] 
[1] "Andrew C. James " "STL 200 200"  

[[5]] 
[1] "A. J. Williams " "CWS 100 200"  

[[6]] 
[1] "Felix Rodriguez " "BAL 100 100"  

這裏是一個解決方案,如果某支球隊名稱中包含三個大寫字母和其他包含兩個大寫字母,如通過以下數據集:

PLAYER TEAM STUFF1 STUFF2 
Jim Smith NYY 100 200 
Jerry Johnson Jr. TB 100 200 
Andrew C. James STL 200 200 
A. J. Williams TB 100 200 
Felix Rodriguez CWS 100 100 

my.data3 <- readLines('player.names.with.spaces3.txt') 

strsplit(my.data3, split = "(?<=[ ])((?=[A-Z]{2})|(?=[A-Z]{3}))", perl = T) 

倘若球隊的名字是不是所有的大寫字母,與這組數據:

PLAYER TEAM STUFF1 STUFF2 
Jim Smith NYY 100 200 
Jerry Johnson Jr. Phi 100 200 
Andrew C. James StL 200 200 
A. J. Williams CWS 100 200 
Felix Rodriguez Bal 100 100 

下面的代碼似乎工作,通過使用多個Split:

setwd('c:/users/mmiller21/simple R programs/') 

my.data3 <- readLines('player.names.with.spaces2.txt') 

my.data4 <- strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T) 

my.data5 <- do.call(rbind, my.data4[]) 
my.data5 <- my.data5[-1,] 

# returns string w/o leading or trailing whitespace 

trim <- function (x) gsub("^\\s+|\\s+$", "", x) 

my.data6 <- trim(my.data5) 

my.data7 <- strsplit(my.data6[,1], ' (?=[^ ]+$)', perl=TRUE) 

my.data8 <- do.call(rbind, my.data7[]) 

my.data9 <- trim(my.data8) 

my.data10 <- cbind(my.data9, my.data6[,2:3]) 
my.data10 

下面是結果:

 [,1]    [,2] [,3] [,4] 
[1,] "Jim Smith"   "NYY" "100" "200" 
[2,] "Jerry Johnson Jr." "Phi" "100" "200" 
[3,] "Andrew C. James" "StL" "200" "200" 
[4,] "A. J. Williams" "CWS" "100" "200" 
[5,] "Felix Rodriguez" "Bal" "100" "100" 
1

這裏有一個簡單的解決方案,滿足您的要求。它基於空白標記和重構名稱。它假定名稱是唯一包含多個令牌的字段。應當指出的是,間隔可能不被保存完好,不得與嵌入式標籤正常工作的空間,而不是:

library(stringr) 
lines = readLines("team.names.with.spaces.txt"); 
for (line in lines[2:length(lines)]) { 
    toks = strsplit(str_trim(line), " +")[[1]]; 
    ntoks = length(toks); 
    name = paste(toks[1:(ntoks-3)], collapse=' '); 
    team = toks[ntoks-2]; 
    num1 = as.integer(toks[ntoks-1]); 
    num2 = as.integer(toks[ntoks]); 
    print(line) 
    print(name) 
    print(team) 
    print(num1) 
    print(num2) 
} 

我不建議使用,除非你的文件總是乾淨構造str_trim(),在這種情況下,你可能可以刪除stringr依賴。輸出看起來是這樣的:

[1] "Jim Smith NYY 100 200" 
[1] "Jim Smith" 
[1] "NYY" 
[1] 100 
[1] 200 
[1] "Jerry Johnson Jr. PHI 100 200" 
[1] "Jerry Johnson Jr." 
[1] "PHI" 
[1] 100 
[1] 200 

作爲替代方案,您可以使用str_locate()與在名稱中的多個空格或標點符號(使用逗號的複姓)更穩定地處理:

library(stringr) 
x="Jerry Johnson Jr. PHI 100 200" 
ndx = str_locate(x," +[A-Z]{3} +[0-9]+ +[0-9]+")[1] 
name = substr(x,1,ndx-1); 
相關問題