返回原來的搜尋字詞grep的R中

我有一個項目列表和搜索字詞的列表，我試圖做兩件事情：返回原來的搜尋字詞grep的R中

搜索通過項目的匹配任何搜索條款，並且如果找到匹配則返回true 。
對於那些返回true（即，有一個匹配）的所有項目，我想也返回其在步驟1相匹配

所以原來的搜索詞，給出下面的數據幀：

   items 
1    alex 
2 alex is a person 
3 this is a test 
4   false 
5 this is cathy

和下面的搜索字詞列表：

"alex"  "bob"  "cathy"  "derrick" "erica"  "ferdinand"

我想創建以下的輸出：

   items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE  alex 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE  cathy

步驟1非常簡單，但我在步驟（2）中遇到了問題。要創建「匹配」列，我使用grepl()創建一個變量，如果d$items中的某行在搜索項列表中，則該變量的值爲TRUE;否則，使用FALSE。

對於第2步，我的想法是，我應該能夠使用grep()，同時指定value = T，如下面的代碼所示。但是，這會返回錯誤的值：而不是返回與grep匹配的原始搜索詞，它會返回匹配項的值。所以我得到以下輸出：

  items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE  alex is a person 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE  this is cathy

這是我現在使用的代碼。任何想法將不勝感激！

# Dummy data and search terms 
d = data.frame(items = c("alex", "alex is a person", "this is a test", "false", "this is cathy")) 
searchTerms = c("alex", "bob", "cathy", "derrick", "erica", "ferdinand") 

# Return true iff search term is found in items column, not between letters 
d$matches = grepl(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), d[,1], ignore.case = TRUE 
) 

# Subset data 
dMatched = d[d$matches==T,] 

# This is where the problem is: return the value that was originally matched with grepl above 
dMatched$original = grep(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE, value = TRUE 
) 


d$original[d$matches==T] = dMatched$original

來源

2013-05-09 Steve

您可以替換字母的長字符串' [：阿爾法：]'。 – Thomas 2013-05-09 19:08:52

你可能想看看'regmatches'函數。 – Dason 2013-05-09 19:09:54

@Thomas：謝謝你的提示。不過，[：alpha：]和其他預定義的字符類似乎對我而言似乎不起作用。它必須與我的區域設置有關。從字符類的正則表達式文檔：「（因爲它們的解釋是語言環境和實現相關的，所以最好避免它們。）指定所有ASCII字母的唯一便攜方法是將它們全部列爲字符類別 [ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]。「 – Steve 2013-05-09 19:15:11

不是你想要完美的東西，但你可以使用qdap的termco功能來做到這一點。這將在情況下幫助你在同一個句子兩個名字：

library(qdap) 
termco(d$items, 1:nrow(d), searchTerms) 

## > termco(d$items, 1:nrow(d), searchTerms) 
## nrow(d word.count  alex bob  cathy derrick erica ferdinand 
## 1  1   1 1(100.00%) 0   0  0  0   0 
## 2  2   4 1(25.00%) 0   0  0  0   0 
## 3  3   4   0 0   0  0  0   0 
## 4  4   1   0 0   0  0  0   0 
## 5  5   3   0 0 1(33.33%)  0  0   0

爲了讓您可以使用qdap後在做什麼：

來源

2013-05-09 19:54:55

這是一個非常不錯的解決方案，我想我喜歡它更好，因爲它也告訴你多個搜索條件是否與項目列表匹配。謝謝！ – Steve 2013-05-09 21:49:56

感謝Dason提供的幫助提示！我能夠通過使用regmatches()來解決我的問題。這裏是我的代碼，從原來的問題是在那裏開始：

# This is where the problem is: return the value that was originally matched with grepl above 
m = regexpr(paste("(^| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", 
    searchTerms, "($| |[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVQXYZ])", sep = "", 
    collapse = "|"), dMatched[,1], ignore.case = TRUE 
) 

dMatched$original = regmatches(dMatched[,1], m) 

d$original[d$matches==T] = dMatched$original

這將返回下面的輸出，這正是我想要的：

   items matches original 
1    alex TRUE  alex 
2 alex is a person TRUE alex 
3 this is a test FALSE  <NA> 
4   false FALSE  <NA> 
5 this is cathy TRUE cathy

來源

2013-05-09 19:26:03 Steve

返回原來的搜尋字詞grep的R中

回答

相關問題