2015-05-22 48 views
1

我已經得到了我想要轉換爲正確的大小寫全部大寫所有者名稱的列表的適當資本與公司名稱混合名稱字符串

owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX", 
      "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC", 
      "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA", 
      "LXXXX ELAINE E TR","SXXXXXX KIMBERLY") 

希望的輸出:

    owner1 
1: Dxxxxx Joseph V. Jr 
2:   Mirna Nxxxxx 
3:   Adrian Txxxx 
4: Cutler Pxxxxxxxxx LLC 
5: GVM Pxxxxxxxxx LLC 
6:  Earlena Rxxxxxxx 
7:  Nathaniel Txxxxx 
8:   Dxxxxxx Donna 
9: Lxxxx Elaine E. TR 
10:  Sxxxxxx Kimberly 

一個很大的第一步是在?chartr提到的.simpleCap功能的版本:

.simpleCap <- function(x) { 
    s <- strsplit(tolower(x), " ")[[1]] 
    paste(toupper(substring(s, 1, 1)), substring(s, 2), 
      sep = "", collapse = " ") 
} 

這是問題的一大塊,但未能在4,5和9,我可以補充這種治療的關鍵短語(LLC,TR等)分開,但是這仍然留下像觀察5

這裏是至今(我已經得到了奇妙的功能加快通過以下@ eipi10的解決方案,它向量化的.simpleCap功能,允許整個功能用於載體):

to.proper<-function(strings){ 
    #vectorized version of .simpleCap; 
    # I've also built in that I know `strings` is all caps 
    res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T) 
    #In my data, some Irish/Scottish names separated the MC prefix 
    # Also, re-capitalize following a hyphen 
    res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T)) 
    for (init in c("[A-Z]","Inc","Assoc","Co", 
       "Jr","Sr","Tr","Bros")){ 
    #Add a period after common abbreviations 
    res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res) 
    } 
    for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}", 
       "Pa","Ii","Iii","Iv","Lp","Tj", 
       "Xiv","Ll","Yml","Us")){ 
    #Re-capitalize any string of >=3 consonants (excluding 
    # Y for such names as LYNN and WYNN), as well as 
    # some other common phrases that need upper-casing 
    res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T) 
    } 
    #Re-capitalize post-Mc letters, e.g. in Mcmahon 
    gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T) 
} 

對於在這個過程中單獨留下潛在的不可預測的縮寫(特別是像那些不常見的觀察5中的那些縮寫),有什麼想法?

+1

我想你可能需要後綴的一些列表離開'LLC,TR'了比賽,而不是在資本 – akrun

+1

使用除了@ akrun的建議,你有沒有從嘗試stri_trans_totitle() stringi包? – lawyeR

+0

@lawyeR這也應該給同樣的問題。我試過了:-) – akrun

回答

2

這是一個使用正則表達式將字符串轉換爲標題大小寫的函數(改編自@BenBolker's answer to a question I asked on SO a while back)。

該函數的編寫方式使您可以傳遞一個參數exceptions來處理GVM等特殊情況。我不確定這是否足夠靈活以滿足您的需求,因爲您必須對異常進行硬編碼,但我想我會發布它,看看是否有人可以提出改進建議。

dat = data.frame(owner1 = c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX", 
            "CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC", 
            "EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA", 
            "LXXXX ELAINE E TR","SXXXXXX KIMBERLY")) 

# Convert a string to title case 
tc = function(strings, exceptions="\\b(gvm)\\b") { 

    # Convert to title case, excluding terminal LLC, TR, etc. 
    title.case = gsub("\\b([a-zA-Z])([a-zA-Z]+)*(LLC| TR| FBO| LP)?", 
        "\\U\\1\\L\\2\\U\\3", strings, perl=TRUE) 

    # Add a period after initials (presumed to be any lone capital letter) 
    title.case = gsub(" ([A-Z]) ", " \\1\\. ", title.case) 

    # Deal with exceptions 
    title.case = gsub(exceptions, "\\U\\1", title.case, perl=TRUE, ignore.case=TRUE) 

    return(title.case) 
} 

dat$title.case = tc(dat$owner1) 

        owner1   title.case 
1  DXXXXX JOSEPH V JR Dxxxxx Joseph V. Jr 
2   MIRNA NXXXXX   Mirna Nxxxxx 
3   ADRIAN TXXXX   Adrian Txxxx 
4 CUTLER PXXXXXXXXX LLC Cutler Pxxxxxxxxx LLC 
5  GVM PXXXXXXXXX LLC GVM Pxxxxxxxxx LLC 
6  EARLENA RXXXXXXX  Earlena Rxxxxxxx 
7  NATHANIEL TXXXXX  Nathaniel Txxxxx 
8   DXXXXXX DONNA   Dxxxxxx Donna 
9  LXXXX ELAINE E TR Lxxxx Elaine E. TR 
10  SXXXXXX KIMBERLY  Sxxxxxx Kimberly 
+0

大的道具爲我使用的'.simpleCap'函數的矢量化版本,這sped大量使用我的代碼。我最終解決了與您所展示的功能接近的功能。礦是更加量身定製的;爲了推廣它,我可能會傳遞'exceptions'和'initialize'參數。 – MichaelChirico

+0

我也正在使用以下來找出什麼樣的2個字母的輔音短語圍繞着它們並逐個處理它們:regmatches(string,regexpr(「\\ b [B-DF-HJ-NP (不幸的是,由於諸如Jr,Sr,Co,Sc(學校),Ch(教會)等縮寫的大量縮寫以及一些越南人的名字吳等) – MichaelChirico