2012-10-22 28 views
6

我有一個字符串的長列表,例如這款機器可讀例如:在R中正確使用gsub /正則表達式?

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")) 

所以它看起來像這樣:

> A 
[[1]] 
[1] "Biology" 
[2] "Cell Biology" 
[3] "Art" 
[4] "Humanities, Multidisciplinary; Psychology, Experimental" 
[5] "Astronomy & Astrophysics; Physics, Particles & Fields" 
[6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods" 
[7] "Geriatrics & Gerontology" 
[8] "Gerontology" 
[9] "Management" 
[10] "Operations Research & Management Science" 
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic" 
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability" 

我想,爲了得到這個修改這些條款和消除重複結果:

[1] "Science" 
[2] "Science" 
[3] "Arts & Humanities" 
[4] "Arts & Humanities; Social Sciences" 
[5] "Science" 
[6] "Social Sciences; Science" 
[7] "Science" 
[8] "Social Sciences" 
[9] "Social Sciences" 
[10] "Science" 
[11] "Science" 
[12] "Social Sciences; Science" 

到目前爲止,我只得到了這一點:

stringedit <- function(A) 
{ 
    A <-gsub("Biology", "Science", A) 
    A <-gsub("Cell Biology", "Science", A) 
    A <-gsub("Art", "Arts & Humanities", A) 
    A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A) 
    A <-gsub("Psychology, Experimental", "Social Sciences", A) 
    A <-gsub("Astronomy & Astrophysics", "Science", A) 
    A <-gsub("Physics, Particles & Fields", "Science", A) 
    A <-gsub("Economics", "Social Sciences", A) 
    A <-gsub("Mathematics", "Science", A) 
    A <-gsub("Mathematics, Applied", "Science", A) 
    A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A) 
    A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A) 
    A <-gsub("Geriatrics & Gerontology", "Science", A) 
    A <-gsub("Gerontology", "Social Sciences", A) 
    A <-gsub("Management", "Social Sciences", A) 
    A <-gsub("Operations Research & Management Science", "Science", A) 
    A <-gsub("Computer Science, Artificial Intelligence", "Science", A) 
    A <-gsub("Computer Science, Information Systems", "Science", A) 
    A <-gsub("Engineering, Electrical & Electronic", "Science", A) 
    A <-gsub("Statistics & Probability", "Science", A) 
} 
B <- lapply(A, stringedit) 

但它不能正常工作:

> B 
[[1]] 
[1] "Science" 
[2] "Cell Science" 
[3] "Arts & Humanities" 
[4] "Arts & Humanities; Social Sciences" 
[5] "Science; Science" 
[6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences" 
[7] "Science" 
[8] "Social Sciences" 
[9] "Social Sciences" 
[10] "Operations Research & Social Sciences Science" 
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science" 
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science" 

我怎樣才能實現上述正確的輸出?
非常感謝您提前考慮!

+0

每當你發現自己以很多相似的代碼行結束時,你就會繞過可愛的[DRY原則](http://en.wikipedia.org/wiki/Don%27t_repeat_yourself)。所以現在是重新設計的時候了,顯然是一個包裝器傳遞給某種'apply'函數或其他類似循環的幫助器。 – aL3xa

回答

4

讓我從一個例子開始。你有一個字符串「細胞生物學」。第一個替代,A <-gsub("Biology", "Science", A),將其變成「細胞科學」。然後不被取代。

既然你不使用正則表達式,我寧願用一種散列做換人:

myhash <- c("Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", 
    "Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences", 
    "Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science", 
    "Science") 

names(myhash) <- c("Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary", 
    "Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics", 
    "Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management", 
    "Operations Research & Management Science", "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", "Engineering, Electrical & Electronic", 
    "Statistics & Probability") 

現在,給出一個字符串,如「生物學」,你可以快速查找你類別:

myhash[ "Biology" ] 

我不知道爲什麼你要使用的不是字符串的載體的名單,因此我將簡化了一下你的情況:

A <- c("Biology","Cell Biology","Art", 
    "Humanities, Multidisciplinary; Psychology, Experimental", 
    "Astronomy & Astrophysics; Physics, Particles & Fields", 
    "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods", 
    "Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic", 
    "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability") 

該查找不適用於組合字符串(包含「;」)。你可以拆分它們,但是使用strsplit。然後,您可以使用unique來避免術語重複,並使用paste函數將其重新組合在一起。

stringedit <- function(x) { 
    # first, split into subterms 
    a.all <- unlist(strsplit(x, "; *")) ; 
    paste(unique(myhash[ a.all ]), collapse= "; ") 
} 

unlist(lapply(A, stringedit )) 

這裏,作爲所要求的結果:

[1] "Science"       "Science"       "Arts & Humanities"     "Arts & Humanities; Social Sciences" 
[5] "Science"       "Social Sciences; Science"   "Science"       "Social Sciences"     
[9] "Social Sciences"     "Science"       "Science"       "Social Sciences; Science" 

當然,你可以調用*apply幾次這樣的:

a.spl <- sapply(A, strsplit, "; *") 
a.spl <- sapply(a.spl, function(x) myhash[ x ]) 
unlist(sapply(a.spl, collapse, "; ") 

這不是更多或比效率較低先前的代碼。

是的,你可以實現與正則表達式相同,但首先,它會涉及反正分割字符串,然後使用正則表達式的喜歡^Biology$,以確保它們能與「生物」,而不是「細胞生物學」等。除非你想去建設像「。*生物學」。最後,無論如何,你將不得不擺脫重複,儘管如此,在我看來,(i)較少詳細(=更容易出錯)和(ii)不值得付出努力。

+0

IMO,一個壞主意。在每個循環迭代中,你都要對每個字符串進行「分解」。你應該只做一次。 – aL3xa

+0

我只是'strsplit'-ing'長度(A)'次數;這與來自lapply(A,strsplit,「;」)'的分割數量並無太大區別。 – January

+0

非常感謝您的解決方案,@ 1月! – user1496104

2

以及如何使用switch

science.category <- function(science){ 
    switch(science, 
      "Biology" =, 
      "Cell Biology" =, 
      "Astronomy & Astrophysics" =, 
      "Physics, Particles & Fields" =, 
      "Mathematics" =, 
      "Mathematics, Applied" =, 
      "Mathematics, Interdisciplinary Applications" =, 
      "Geriatrics & Gerontology" =, 
      "Operations Research & Management Science" =, 
      "Computer Science, Artificial Intelligence" =, 
      "Computer Science, Information Systems" =, 
      "Engineering, Electrical & Electronic" =, 
      "Statistics & Probability" = "Science", 
      "Art" =, 
      "Humanities, Multidisciplinary" = "Arts & Humanities", 
      "Psychology, Experimental" =, 
      "Economics" =, 
      "Social Sciences, Mathematical Methods" =, 
      "Gerontology" =, 
      "Management" = "Social Sciences", 
      NA 
      ) 
} 

a <- unlist(lapply(A, strsplit, split = " *; *"), recursive = FALSE) 
a1 <- lapply(a, function(x) unique(sapply(x, science.category))) 
sapply(a1, paste, collapse = "; ") 

當然,這隻要你有推作爲switch參數正確字符串的工作。一個不匹配,你會以NA結束。對於某些高級用法,您應該編寫自己的包裝來使用grep-功能族,或者甚至使用agrep(小心處理)。

+1

雖然你錯過了在'strsplit'和'sapply'調用之間對'science category'的調用。 – January

+0

哈哈哈,太棒了! =)感謝您發現它! =) – aL3xa

+0

@一月,固定,thnx提示。 – aL3xa

5

我發現最簡單的方法是使用兩列data.frame作爲查詢,其中一列爲課程名稱,一列爲類別。這裏有一個例子:

course.categories <- data.frame(
    Course = 
    c("Art", "Humanities, Multidisciplinary", "Biology", "Cell Biology", 
    "Astronomy & Astrophysics", "Physics, Particles & Fields", "Mathematics", 
    "Mathematics, Applied", "Mathematics, Interdisciplinary Applications", 
    "Geriatrics & Gerontology", "Operations Research & Management Science", 
    "Computer Science, Artificial Intelligence", 
    "Computer Science, Information Systems", 
    "Engineering, Electrical & Electronic", "Statistics & Probability", 
    "Psychology, Experimental", "Economics", 
    "Social Sciences, Mathematical Methods", 
    "Gerontology", "Management"), 
    Category = 
    c("Arts & Humanities", "Arts & Humanities", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Science", 
    "Science", "Science", "Science", "Science", "Science", "Social Sciences", 
    "Social Sciences", "Social Sciences", "Social Sciences", "Social Sciences")) 

然後,假設A作爲一個列表,在你的問題:

sapply(strsplit(unlist(A), "; "), 
     function(x) 
     paste(unique(course.categories[match(x, course.categories[["Course"]]), 
             "Category"]), 
       collapse = "; ")) 
# [1] "Science"       "Science"       
# [3] "Arts & Humanities"     "Arts & Humanities; Social Sciences" 
# [5] "Science"       "Social Sciences; Science"   
# [7] "Science"       "Social Sciences"     
# [9] "Social Sciences"     "Science"       
# [11] "Science"       "Social Sciences; Science" 

match值從A與在course.categories數據集中過程中的名稱一致,並表示該行的比賽發生在;這用於提取課程所屬的類別。然後,unique確保我們只有一個類別。 paste把事情放回到一起。

+0

非常感謝您的建議@ mrdwab! – user1496104