我已經得到了我想要轉換爲正確的大小寫全部大寫所有者名稱的列表的適當資本與公司名稱混合名稱字符串
owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY")
)
希望的輸出:
owner1
1: Dxxxxx Joseph V. Jr
2: Mirna Nxxxxx
3: Adrian Txxxx
4: Cutler Pxxxxxxxxx LLC
5: GVM Pxxxxxxxxx LLC
6: Earlena Rxxxxxxx
7: Nathaniel Txxxxx
8: Dxxxxxx Donna
9: Lxxxx Elaine E. TR
10: Sxxxxxx Kimberly
一個很大的第一步是在?chartr
提到的.simpleCap
功能的版本:
.simpleCap <- function(x) {
s <- strsplit(tolower(x), " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
這是問題的一大塊,但未能在4,5和9,我可以補充這種治療的關鍵短語(LLC,TR等)分開,但是這仍然留下像觀察5
這裏是至今(我已經得到了奇妙的功能加快通過以下@ eipi10的解決方案,它向量化的.simpleCap
功能,允許整個功能用於載體):
to.proper<-function(strings){
#vectorized version of .simpleCap;
# I've also built in that I know `strings` is all caps
res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
#In my data, some Irish/Scottish names separated the MC prefix
# Also, re-capitalize following a hyphen
res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
for (init in c("[A-Z]","Inc","Assoc","Co",
"Jr","Sr","Tr","Bros")){
#Add a period after common abbreviations
res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
}
for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
"Pa","Ii","Iii","Iv","Lp","Tj",
"Xiv","Ll","Yml","Us")){
#Re-capitalize any string of >=3 consonants (excluding
# Y for such names as LYNN and WYNN), as well as
# some other common phrases that need upper-casing
res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
}
#Re-capitalize post-Mc letters, e.g. in Mcmahon
gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}
對於在這個過程中單獨留下潛在的不可預測的縮寫(特別是像那些不常見的觀察5中的那些縮寫),有什麼想法?
我想你可能需要後綴的一些列表離開'LLC,TR'了比賽,而不是在資本 – akrun
使用除了@ akrun的建議,你有沒有從嘗試stri_trans_totitle() stringi包? – lawyeR
@lawyeR這也應該給同樣的問題。我試過了:-) – akrun