只從評論列表中提取相關評論

繼續我對文本分析的探索，我遇到了另一個障礙。我瞭解邏輯，但不知道如何在R中執行此操作。下面是我想要做的：我有2個CSV - 1.包含10,000條評論2.包含單詞列表我想選擇所有包含第二個CSV中任何單詞的評論。我該怎麼辦？只從評論列表中提取相關評論

例如：

**CSV 1:** 
this is a sample set 
the comments are not real 
this is a random set of words 
hope this helps the problem case 
thankyou for helping out 
i have learned a lot here 
feel free to comment 

**CSV 2** 
sample 
set 
comment 

**Expected output:** 
this is a sample set 
the comments are not real 
this is a random set of words 
feel free to comment

請注意：不同形式的話也被認爲是，例如，評論和意見都被認爲。

來源

2016-05-25 eclairs

兩者分別是評論和單詞列表 – eclairs

你可以讓你的例子重現嗎？ – Sotos

我們可以在paste之後使用grep第二數據集中的元素。

v1 <- scan("file2.csv", what ="") 
lines1 <- readLines("file1.csv") 
grep(paste(v1, collapse="|"), lines1, value=TRUE) 
#[1] "this is a sample set"   "the comments are not real" 
#[3] "this is a random set of words" "feel free to comment"

來源

2016-05-25 09:39:57 akrun

首先創建兩個對象稱爲從您的文件lines和words.to.match。你可以做這樣的：

lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]] 
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]

比方說，就像這樣：

lines <- c(
    'this is a sample set', 
    'the comments are not real', 
    'this is a random set of words', 
    'hope this helps the problem case', 
    'thankyou for helping out', 
    'i have learned a lot here', 
    'feel free to comment' 
) 
words.to.match <- c('sample', 'set', 'comment')

然後，您可以用兩個嵌套*apply-函數計算匹配：

matches <- mapply(
    function(words, line) 
     any(sapply(words, grepl, line, fixed=T)), 
    list(words.to.match), 
    lines 
) 
matched.lines <- lines[which(matches)]

這是怎麼回事這裏？我使用mapply來計算行中每行的函數，將words.to.match作爲另一個參數。請注意，list(words.to.match)的基數爲1.我只是在每個應用程序中回收這個參數。然後，在mapply函數中，我調用sapply函數來檢查是否有任何單詞與該行匹配（我通過grepl檢查匹配）。

這不一定是最有效的解決方案，但它對我來說更容易理解。你可以計算matches另一種方法是：

matches <- lapply(words.to.match, grepl, lines, fixed=T) 
matches <- do.call("rbind", matches) 
matches <- apply(matches, c(2), any)

我不喜歡這個解決方案，因爲你需要做一個do.call("rbind",...)，這是一個有點哈克。

來源

2016-05-25 09:55:01 bogdata

只從評論列表中提取相關評論

回答

相關問題