2015-07-21 53 views
3

這裏的數據的一個示例:如何在Stata中只提取一個字符串的大寫部分?

part1 
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 

每個串或者具有上和小寫部分,或者是全部大寫。我一直試圖使用正則表達式來只提取字符串的大寫部分,但沒有任何運氣。我已經能夠做的最好的是,當一個字符串開頭或以一定數量的大寫字符識別結束:

generate title = regexs(0) if regexm(part1, "^[A-Z][A-Z][A-Z].*[A-Z][A-Z][A-Z]$") 

我也試過以下,這是我從論壇的另一個問題拉:

generate title = regexs(0) if(regexm(part1, "\b[A-Z]{2,}\b")) 

這是應該尋找在行中至少有兩個大寫字母的單詞,但它只爲我返回缺少的值。我爲Mac使用Stata 13.1版。

+1

不確定你想要什麼:獲取所有大寫字母的段?嘗試使用'^ [^ a-z] + $'。但是,否定類可能不被支持。如果它不起作用,您將不得不嘗試解決方法,如'^ [AZ] [0-9A-Z〜\'@#$%^&*()_ +'= \] \ [{} \\ |' ?「;:/,>< - ] + $'。 –

回答

0

由於@stribizhev指出,否定可能是一個辦法:

clear 
set more off 

input /// 
str70 myvar 
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end 

gen title = trim(regexs(2)) if regexm(myvar, "([,.]*)([^a-z]*$)") 

list title 

結果是

. list title 

    +-----------------------------------------------+ 
    |           title | 
    |-----------------------------------------------| 
    1. |       TEST MODEL SEADROME | 
    2. |       L.B. MAYER HONORED | 
    3. |         A TOWN MOVES | 
    4. |      U.S. SAVINGS BONDS RALLY | 
    5. |    N.D. NOSES OUT S.M.U. BY 27 TO 20 | 
    |-----------------------------------------------| 
    6. |       BURN 2,300 SQUEALERS | 
    7. |            | 
    8. | N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING | 
    9. |            | 
10. |      PA. IT'S HIGHER EDUCATION | 
    |-----------------------------------------------| 
11. |        806 DECORATIONS | 
12. |            | 
13. |     F.D.R. ASKS VICTORY EFFORT | 
    +-----------------------------------------------+ 

我認爲這是接近你想要什麼,但並不完美。很難想象一個簡單的方法來清理字符串,如果他們沒有一些規則的結構。例如,比較觀察值6和10的輸入/輸出。

如果你有一個標題數據庫,在初始清理後,你可以比較和匹配。例如,請參閱ssc describe strgroup

0

這個問題的含義似乎是,你期望正則表達式規範拉出所有實例。然而,這可能是合理的,但並不是Stata中正則表達式的工作原理。您需要對實例進行循環。這使用mossssc install moss),它以此爲主要目的。 (採集苔蘚的提示是從第二個節目筆者典型的微弱文字遊戲而言,如果他正在讀這一點。)

clear 
input str100 part1 
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPENING" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end 
compress 

moss part1, match("([A-Z]+)") regex 
egen wanted = concat(_match*), p(" ") 
l wanted 

    +--------------------------------------------------+ 
    |           wanted | 
    |--------------------------------------------------| 
    1. |       C M TEST MODEL SEADROME | 
    2. |        L B MAYER HONORED | 
    3. |          A TOWN MOVES | 
    4. |       U S SAVINGS BONDS RALLY | 
    5. |      N D NOSES OUT S M U BY TO | 
    |--------------------------------------------------| 
    6. |        P P BURN SQUEALERS | 
    7. |          O B I T N | 
    8. | S S N Y DIAVOLO IS STAR AT BRILLIANT SPA OPENING | 
    9. |           R D D R | 
10. |      P PA IT S HIGHER EDUCATION | 
    |--------------------------------------------------| 
11. |          DECORATIONS | 
12. |          S H M F S | 
13. |      F D R ASKS VICTORY EFFORT | 
    +--------------------------------------------------+ 

我假設你想要的結果之間的空間;否則難以理解。您不要在大寫字母之間指定標點符號;如果你想要的話,你需要相應地修改正則表達式。

0

我不能想到一個單一的規則,將乾淨地解析一個單一的命令這種類型的數據。通常,最好的策略是針對簡單的案例,然後轉向更困難的案例,直到收益遞減使額外的嘗試失去吸引力。

使用正則表達式時,注意意想不到的匹配很重要,特別是在觀察次數很大的情況下。我使用listsome(來自SSC)進行此類工作。

看起來像part1通常以城市名稱和州名稱/縮寫開頭。這裏是處理簡單案例和城市/州情況的代碼:

clear 
input str60 part1 
"Cambridge, Maryland TEST MODEL SEADROME" 
"L.B. MAYER HONORED" 
"A TOWN MOVES" 
"U.S. SAVINGS BONDS RALLY" 
"N.D. NOSES OUT S.M.U. BY 27 TO 20" 
"Philadelphia, Pa. BURN 2,300 SQUEALERS" 
"Odd Bits In To-day's News" 
"Saratoga Springs, N.Y. DIAVOLO IS STAR AT BRILLIANT SPA OPEN" 
"Risk Death in Daring Race" 
"Philadelphia, PA. IT'S HIGHER EDUCATION" 
"806 DECORATIONS" 
"Snow Hauled 20 Miles For Skiers" 
"F.D.R. ASKS VICTORY EFFORT" 
end 

* take care of the easy cases where there are no lowercase letters 
gen title = part1 if !regexm(part1,"[a-z]") 

* this type of string work is easier if text is aligned to the left 
leftalign // (from SSC) 

* target cases of City, State at the start of part1. 
* with complex patterns, it's easy to miss unintended matches when 
* lots of obs are involved so use -listsome- (from SSC to track changes) 
gen title0 = title 
replace title = trim(regexs(3)) if regexm(part1,"^([A-Z][a-z ]*)+, ([A-Z][a-z]*\.?)+([^a-z]+$)") 
listsome if title != title0 

list part1 title