2016-09-30 78 views
1

我有一個有兩列的數據幀。一列包含句子列表,另一列包含單詞。例如:根據兩列之間的匹配值(精確)過濾數據幀

words sentences 
loose Loose connection several times a day on my tablet. 
loud People don't speak loud or clear enough to hear voicemails 
vice I strongly advice you to fix this issue 
advice I strongly advice you to fix this issue 

現在我要過濾這些數據幀,這樣我只得到具有恰好匹配句子中的單詞的那些行:

words sentences 
loose Loose connection several times a day on my tablet. 
loud People don't speak loud or clear enough to hear voicemails 
advice I strongly advice you to fix this issue 

這個詞「副」並不完全匹配,因此必須將其刪除。我在數據框中有近20k行。有人可以建議我使用哪種方法來完成這項任務,這樣我就不會失去太多的表現。

回答

2

您可以嘗試類似如下:

df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),] 

df 
    words            sentences 
1 loose  Loose connection several times a day on my tablet. 
2 loud People dont speak loud or clear enough to hear voicemail 
4 advice   advice I strongly advice you to fix this issue 
+0

這種方法比使用str_detect更快,因此接受這個答案。 – Venu

1

最簡單的辦法是使用stringr包:

df<- data.frame(words=c("went","zero", "vice"), sent=c("a man went to the park","one minus one is 0","any advice?")) 

df$words <- paste0(" ",df$words," ") 
df$sent <- paste0(" ",df$sent," ") 


df$match <- str_detect(df$sent,df$words) 

df.res <- df[df$match > 0,] 
df.res$match<-NULL 
df.res 
+0

這不會給OP的數據提供首選輸出。 – Jaap

+0

編輯,還是這樣? –

+0

現在工作,但它肯定不是最簡單的解決方案了。此外,「發送」欄的內容已經改變,這不是OP的意圖。 – Jaap

3

使用:

library(stringi) 
df[stri_detect_regex(tolower(df$sentences), paste0('\\b',df$words,'\\b')),] 

你:

words             sentences 
1 loose   Loose connection several times a day on my tablet. 
2 loud People don't speak loud or clear enough to hear voicemails 
4 advice     I strongly advice you to fix this issue 

說明:

  • 轉換句子中的資金,以小寫字母與tolower
  • 通過wordboundaries(\\b)包裹在words詞語創建paste0一個正則表達式矢量。
  • 使用來自stringi-package的stri_detect_regex來查看哪些行中沒有匹配,從而產生具有TRUE & FALSE值的邏輯向量。
  • 具有邏輯向量的子集。

作爲替代方案,也可以使用str_detectstringr包(實際上是圍繞stringi包的包裝):

library(stringr) 
df[str_detect(tolower(df$sentences), paste0('\\b',df$words,'\\b')),] 

二手數據:

df <- structure(list(words = c("loose", "loud", "vice", "advice"), 
        sentences = c("Loose connection several times a day on my tablet.", 
            "People don't speak loud or clear enough to hear voicemails", 
            "I strongly advice you to fix this issue", "I strongly advice you to fix this issue")), 
       .Names = c("words", "sentences"), class = "data.frame", row.names = c(NA, -4L)) 
相關問題