2014-02-26 18 views
1

我有一個CSV文件如何修改R中的正則表達式?

AdvertiserName,Market 
Wells Fargo,Gary INMetro Chicago IL Metro 
EMC,Los Angeles CAMetro Boston MA Metro 
Apple,Cupertino CA Metro 

,並在正則表達式

res <- 
gsub('(.*) ([A-Z]{2})*Metro (.*) ([A-Z]{2}) .*','\\1,\\2:\\3,\\4', 
    xx$Market) 

而現在的「市場」欄是像「加里MetroChicago IL地鐵」而不是「加里INMETRO芝加哥IL新城」和CSV文件就像

AdvertiserName,CampaignName 
Wells Fargo,Gary IN MetroChicago IL Metro 
EMC,Los Angeles CA MetroBoston MA Metro 
Apple,Cupertino CA Metro 

如何修改正則表達式的表達,以獲得所需的輸出

AdvertiserName,City,State 
Wells Fargo,Gary,IN 
Wells Fargo,Chicago,IL 
EMC,Los Angeles,CA 
EMC,Boston,MA 
Apple,Cupertino,CA 

R新來的任何幫助表示讚賞。

+0

「而現在的‘市場’欄是像‘加里MetroChicago IL地鐵’而不是‘加里MetroChicago IL新城’」。咦?有什麼不同? – Hugh

+0

@Hugh:改變它,謝謝你的注意。 – user3188390

回答

2

下面是與strsplit方式:

# read file 
dat <- read.csv("filename.csv", stringsAsFactors = FALSE) 

# split strings 
splitted <- strsplit(dat$CampaignName, 
        "((?=[A-Z]{2}))|((?<=[A-Z]{2}) [A-Z][a-z]+)", perl = TRUE) 
# [[1]] 
# [1] "Gary" "IN"  "Chicago" "IL"  
# 
# [[2]] 
# [1] "Los Angeles" "CA"   "Boston"  "MA"   
# 
# [[3]] 
# [1] "Cupertino" "CA"  

# create one data frame 
setNames(as.data.frame(do.call(rbind, 
           mapply(cbind, 
             dat$AdvertiserName, 
             lapply(splitted, function(x) 
             matrix(x, ncol = 2, byrow = TRUE))))), 
     c("AdvertiserName", "City", "State")) 
# AdvertiserName  City State 
# 1 Wells Fargo  Gary IN 
# 2 Wells Fargo  Chicago IL 
# 3   EMC Los Angeles CA 
# 4   EMC  Boston MA 
# 5   Apple Cupertino CA 
+0

在哪裏學習R的正則表達式?任何想法,也感謝您的建議和答案。 – user3188390

+1

@ user3188390你看過'?regex'嗎? –

0

這有點髒。編輯歡迎。

# Read in the csv file (saved here as a .txt) to 
    y <- readLines("Stackoverflow20140226.txt") 


# Every time see a state, shove a comma in 
for (i in seq(y)){ 
y[[i]] <- gsub("([A-Z]{2}) ", "\\1, ", y[[i]]) 
} 

tf <- tempfile() 
writeLines(y, tf) 

# Trick the csv file into thinking there are more columns 
ncol <- max(count.fields(tf, sep = ",")) 
x <- read.csv(tf, fill = TRUE, header = FALSE, skip=1, 
     col.names = paste("V", seq_len(ncol), sep = "")) 
unlink(tf) 
# Use reshape to melt the data frame 
library(reshape2) 
xx <- melt(x, id.vars=1, measure.vars = 2:ncol(x)) 

xx$variable <- NULL 
names(xx) <- c("AdvertiserName", "CampaignName") 

xx 
    AdvertiserName  CampaignName 
1 Wells Fargo   Gary IN 
2   EMC Los Angeles CA 
3   Apple  Cupertino CA 
4 Wells Fargo MetroChicago IL 
5   EMC MetroBoston MA 
6   Apple   Metro 
7 Wells Fargo   Metro 
8   EMC   Metro 
9   Apple   
+0

http://stackoverflow.com/questions/22032481/how-to-include-a-new-column-when-using-base-r,請參考網址,它會給你一個更好的主意,謝謝你的幫助,我感謝你的時間和精力,也許這個解決方案與我的要求稍有不同,因爲在R中使用for循環到目前爲止我的經驗並不是一個好主意,而且我擁有數百萬行數據。感謝任何進一步的幫助。 – user3188390