2015-02-05 177 views
-2

我有以下數據幀:兩個文件並提取匹配的單詞匹配一個

dataFrame <- data.frame(sent = c(1,1,2,2,3,3,3,4,5), word = c("good printer", "wireless easy", "just right size", 
                  "size perfect weight", "worth price", "website great tablet", 
                  "pan nice tablet", "great price", "product easy install"), val = c(1,2,3,4,5,6,7,8,9)) 

數據幀「數據幀」看起來象下面這樣:

sent    word val 
    1   good printer 1 
    1  wireless easy 2 
    2  just right size 3 
    2 size perfect weight 4 
    3   worth price 5 
    3 website great tablet 6 
    3  pan nice tablet 7 
    4   great price 8 
    5 product easy install 9 

然後,我有話:

nouns <- c("printer", "wireless", "weight", "price", "tablet") 

我只需要提取這些詞(名詞)從數據幀,只有這些提取添加到新列(例如提取)在數據幀

我真的很感激你的任何幫助od的建議。非常感謝前鋒。

所需的輸出:

sent    word val extract 
    1   good printer 1 printer 
    1  wireless easy 2 wireless 
    2  just right size 3 size 
    2 size perfect weight 4 weight 
    3   worth price 5 price 
    3 website great tablet 6 table 
    3  pan nice tablet 7 tablet 
    4   great price 8 price 
    5 product easy install 9 remove this row (no match) 
+4

一個小時前你沒問過類似的問題嗎?你如何迴應它的評論? http://stackoverflow.com/questions/28344070/extracting-words-from-sentence-which-match-with-words-in-dictionary – lawyeR 2015-02-05 14:27:38

+1

我已經關閉了其他問題,請不要發表相同的問題兩次。 – 2015-02-05 14:36:43

+0

抱歉,想知道改變任務,不幸的是複製了那個。 – martinkabe 2015-02-05 14:40:39

回答

2

下面是一個使用stringi包一個簡單的解決方案(size是不是在你的nouns列表BTW)。

library(stringi) 
transform(dataFrame, 
      extract = stri_extract_all(word, 
      regex = paste(nouns, collapse = "|"), 
      simplify = TRUE)) 

# sent     word val extract 
# 1 1   good printer 1 printer 
# 2 1  wireless easy 2 wireless 
# 3 2  just right size 3  <NA> 
# 4 2 size perfect weight 4 weight 
# 5 3   worth price 5 price 
# 6 3 website great tablet 6 tablet 
# 7 3  pan nice tablet 7 tablet 
# 8 4   great price 8 price 
# 9 5 product easy install 9  <NA> 
+1

非常感謝大衛偉大的工作。這就是我一直在尋找的。 – martinkabe 2015-02-05 14:39:43

0

這是另一種解決方案。稍微複雜一點,但它還會刪除其中有名詞和數據幀$字

require(stringr) 
dataFrame <- data.frame("sent" = c(1,1,2,2,3,3,3,4,5), 
          "word" = c("good printer", "wireless easy", "just right size", 
             "size perfect weight", "worth price", "website great tablet", 
             "pan nice tablet", "great price", "product easy install"), 
          val = c(1,2,3,4,5,6,7,8,9)) 

    nouns <- c("printer", "wireless", "weight", "price", "tablet") 

    test <- character() 
    df.del <- list() 

    for (i in 1:nrow(dataFrame)) { 
     if(length(intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) == 0) { 
      df.del <- rbind(df.del, i) 
     } else { 
      test <- rbind(test, 
          intersect(nouns, unlist(strsplit(as.character(dataFrame$word[i]), " ")))) 
     } 
    } 

    dataFrame <- dataFrame[-c(unlist(df.del)), ] 
    dataFrame <- cbind(dataFrame, test) 
    names(dataFrame)[4] <- "extract" 

輸出之間沒有匹配的行:

sent     word val extract 
1 1   good printer 1 printer 
2 1  wireless easy 2 wireless 
4 2 size perfect weight 4 weight 
5 3   worth price 5 price 
6 3 website great tablet 6 tablet 
7 3  pan nice tablet 7 tablet 
8 4   great price 8 price 
0

下面是一個使用循環功能和if語句另一種解決方案。

word<-dataFrame$word 
dat<-NULL 
extract<-c(rep(c("remove"), each=length(word))) 
n<-length(word) 
m<-length(nouns) 

for (i in 1:n) { 
g<-as.character(word[i]) 
for (j in 1:m) { 
dat<-grepl(nouns[j], g) 
if(dat == TRUE) {extract[i] <- nouns[j]} 
} 
} 

dataFrame$extract <- extract 

# sent     word val extract 
#1 1   good printer 1 printer 
#2 1  wireless easy 2 wireless 
#3 2  just right size 3 remove 
#4 2 size perfect weight 4 weight 
#5 3   worth price 5 price 
#6 3 website great tablet 6 tablet 
#7 3  pan nice tablet 7 tablet 
#8 4   great price 8 price 
#9 5 product easy install 9 remove