2013-05-09 26 views
2

我有一個項目列表和搜索字詞的列表,我試圖做兩件事情:返回原來的搜尋字詞grep的R中

  1. 搜索通過項目的匹配任何搜索條款,並且如果找到匹配則返回true 。
  2. 對於那些返回true(即,有一個匹配)的所有項目,我想 也返回其在步驟1相匹配

所以原來的搜索詞,給出下面的數據幀:

   items 
1    alex 
2 alex is a person 
3 this is a test 
4   false 
5 this is cathy 

和下面的搜索字詞列表:

"alex"  "bob"  "cathy"  "derrick" "erica"  "ferdinand" 

我想創建以下的輸出:

   items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE  alex 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE  cathy 

步驟1非常簡單,但我在步驟(2)中遇到了問題。要創建「匹配」列,我使用grepl()創建一個變量,如果d$items中的某行在搜索項列表中,則該變量的值爲TRUE;否則,使用FALSE

對於第2步,我的想法是,我應該能夠使用grep(),同時指定value = T,如下面的代碼所示。但是,這會返回錯誤的值:而不是返回與grep匹配的原始搜索詞,它會返回匹配項的值。所以我得到以下輸出:

  items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE  alex is a person 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE  this is cathy 

這是我現在使用的代碼。任何想法將不勝感激!

# Dummy data and search terms 
d = data.frame(items = c("alex", "alex is a person", "this is a test", "false", "this is cathy")) 
searchTerms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand") 

# Return true iff search term is found in items column, not between letters 
d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), d[,1], ignore.case = TRUE 
) 

# Subset data 
dMatched = d[d$matches==T,] 

# This is where the problem is: return the value that was originally matched with grepl above 
dMatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE, value = TRUE 
) 


d$original[d$matches==T] = dMatched$original 
+1

您可以替換字母的長字符串' [:阿爾法:]'。 – Thomas 2013-05-09 19:08:52

+1

你可能想看看'regmatches'函數。 – Dason 2013-05-09 19:09:54

+1

@Thomas:謝謝你的提示。不過,[:alpha:]和其他預定義的字符類似乎對我而言似乎不起作用。它必須與我的區域設置有關。從字符類的正則表達式文檔:「(因爲它們的解釋是語言環境和實現相關的,所以最好避免它們。)指定所有ASCII字母的唯一便攜方法是將它們全部列爲字符類別 [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]。「 – Steve 2013-05-09 19:15:11

回答

2

不是你想要完美的東西,但你可以使用qdaptermco功能來做到這一點。這將在情況下幫助你在同一個句子兩個名字:

library(qdap) 
termco(d$items, 1:nrow(d), searchTerms) 

## > termco(d$items, 1:nrow(d), searchTerms) 
## nrow(d word.count  alex bob  cathy derrick erica ferdinand 
## 1  1   1 1(100.00%) 0   0  0  0   0 
## 2  2   4 1(25.00%) 0   0  0  0   0 
## 3  3   4   0 0   0  0  0   0 
## 4  4   1   0 0   0  0  0   0 
## 5  5   3   0 0 1(33.33%)  0  0   0 

爲了讓您可以使用qdap後在做什麼:

+0

這是一個非常不錯的解決方案,我想我喜歡它更好,因爲它也告訴你多個搜索條件是否與項目列表匹配。謝謝! – Steve 2013-05-09 21:49:56

3

感謝Dason提供的幫助提示!我能夠通過使用regmatches()來解決我的問題。這裏是我的代碼,從原來的問題是在那裏開始:

# This is where the problem is: return the value that was originally matched with grepl above 
m = regexpr(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE 
) 

dMatched$original = regmatches(dMatched[,1], m) 

d$original[d$matches==T] = dMatched$original 

這將返回下面的輸出,這正是我想要的:

   items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE alex 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE cathy