我有類似下面的外部數據文件,無定界符:讀取不帶分隔符的變量複雜名稱,例如棒球選手
PLAYER TEAM STUFF1 STUFF2
Jim Smith NYY 100 200
Jerry Johnson Jr. PHI 100 200
Andrew C. James STL 200 200
A. J. Williams CWS 100 200
Felix Rodriguez BAL 100 100
我怎樣才能讀取這個文件?我正在考慮使用readLines
並在任何三個連續的大寫字母序列之前分割字符串。但是,我不知道該怎麼做。
如果只將團隊名稱的第一個字母大寫,該怎麼辦?
下面是一個類似的文件,其中名稱後跟一列數字。我可以用下面的代碼讀取這些數據:
TEAM STUFF1 STUFF2
New York Yankees 100 200
Philadelphia Phillies 100 200
Boston Red Sox 200 200
Los Angeles Angels 100 200
Chicago White Sox 100 100
Chicago Cubs 200 100
New York Mets 200 200
San Francisco Giants 100 300
Minnesota Twins 100 300
St. Louis Cardinals 200 300
這裏是讀第二個數據集的代碼:
setwd('c:/users/mmiller21/simple R programs/')
my.data3 <- readLines('team.names.with.spaces.txt')
# split between desired columns
my.data4 <- do.call(rbind, strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T))
# returns string w/o leading or trailing whitespace
# This function is not mine and was found on Stack Overflow
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
my.data5 <- trim(my.data4)
# remove header
my.data6 <- my.data5[-1,]
# convert to data.frame
my.data6 <- data.frame(my.data6, stringsAsFactors = FALSE)
my.data6[,2] <- as.numeric(my.data6[,2])
my.data6[,3] <- as.numeric(my.data6[,3])
my.data6
X1 X2 X3
1 New York Yankees 100 200
2 Philadelphia Phillies 100 200
3 Boston Red Sox 200 200
4 Los Angeles Angels 100 200
5 Chicago White Sox 100 100
6 Chicago Cubs 200 100
7 New York Mets 200 200
8 San Francisco Giants 100 300
9 Minnesota Twins 100 300
10 St. Louis Cardinals 200 300
謝謝你的任何建議。我喜歡一個解決方案的基礎R.