計算R中某個數據幀行的特定詞的出現次數

我有一個包含2列和多行的數據集。第一列ID，第二列屬於它的文本。計算R中某個數據幀行的特定詞的出現次數

我想添加更多的列，總結某些字符串在行的文本中出現的次數。該字符串將是「\ n個正\ n」，「\ N零\ n」，「\ n是負面的\ n」`數據集的

例子：

Id, Content 
2356, I like cheese.\n Positive\nI don't want to be here.\n Negative\n 
3456, I am alone.\n Neutral\n

在最後它看起來應該像

Id, Content,Positiv, Neutral, Negativ 
2356, I like cheese.\n Positive\nI don't want to be here.\n Negative\n,1 ,0 ,1 
3456, I am alone.\n Neutral\n, 0, 1, 0

現在，我嘗試過這樣的，但它沒有做出正確的回答：

getCount1 <- function(data, keyword) 
{ 
Positive <- str_count(Dataset$CONTENT, keyword) 
return(data.frame(data,Positive)) 
} 
Stufe1 <-getCount1(Dataset,'\n Positive\n') 
################################################################ 
getCount2 <- function(data, keyword) 
{ 
Neutral <- str_count(Stufe1$CONTENT, keyword) 
return(data.frame(data,Neutral)) 
} 
Stufe2 <-getCount2(Stufe1,'\n Neutral\n') 
##################################################### 
getCount3 <- function(data, keyword) 
{ 
Negative <- str_count(Stufe2$CONTENT, keyword) 
return(data.frame(data,Negative)) 
} 
Stufe3 <-getCount3(Stufe2,'\n Negative\n')

來源

2014-07-03 Carlo

而在這種情況下，比賽應該是零，對吧？查找'gregexpr'和'regmatches'作爲起點。或者，有幾個軟件包可以像「stringr」或「stringi」一樣使用。 – A5C1D2H2I1M1N2O1R2T1

歡迎來到StackOverflow！請閱讀關於[如何提出一個好問題]（http://stackoverflow.com/help/how-to-ask）以及如何生成[最小可重現示例]的信息（http://stackoverflow.com/questions/5963269 /如何對化妝一個偉大-R-重複性，例如/ 5963610＃5963610）。這會讓其他人更容易幫助你。 – Jaap

我假定這就是你需要

的樣本數據

id <- c(1:4) 
text <- c('I have a Dataset with 2 columns a', 
      'nd multiple rows. first column ID', 'second column the text which', 
      'n the text which belongs to it.') 
dataset <- data.frame(id,text)

功能找到數

library(stringr) 
getCount <- function(data,keyword) 
{ 
    wcount <- str_count(dataset$text, keyword) 
    return(data.frame(data,wcount)) 
}

調用getCount將應該給更新的數據集

> getCount(dataset,'second') 
    id        text wcount 
    1 I have a Dataset with 2 columns a  0 
    2 nd multiple rows. first column ID  0 
    3  second column the text which  1 
    4  n the text which belongs to it.  0

來源

2014-07-03 09:59:23

這工作還算不錯，但仍然存在問題，因爲我不是在搜索特定的單詞，而是在表達式中搜索，如果我將它與「正面」結合使用。但是，如果我想用表達式來做\ n正面\ n它不會。 – Carlo

可以用更好的樣本更新問題嗎？我只是嘗試'\ n正面'，它給了我適當的計數。 –

我更新了一個更好的示例，並根據您的解決方案發布了代碼，但在我的情況下，它不起作用。它只適用於我搜索正面，中性和負面。 – Carlo

提供一些選擇，讓我們開始略加修改@ on_the_shores_of_linux_sea的數據集。

id <- c(1:4) 
text <- c('I have a Dataset with 2 columns a', 
      'nd multiple rows. first column ID rows', 
      'second column the text which', 
      'n the text which belongs to it.') 
dataset <- data.frame(id,text)

用基本的R功能粘貼，你可以想出一個像這樣的功能。

wordCounter <- function(invec, word, ...) { 
    vapply(regmatches(invec, gregexpr(word, invec, ...)), length, 1L) 
}

你會使用這樣的：

## allows other arguments to gregexpr 
wordCounter(dataset$text, "id", ignore.case = TRUE) 
# [1] 0 1 0 0 
wordCounter(dataset$text, "id") 
# [1] 0 0 0 0 
wordCounter(dataset$text, "rows") 
# [1] 0 2 0 0 
wordCounter(dataset$text, "second", ignore.case = TRUE) 
# [1] 0 0 1 0

另一種選擇，如果你想要去一些現成的解決方案，將使用「stringi」包，裏面有一個漂亮的stri_count*功能集。在這裏，我用stri_count_fixed：

library(stringi) 
stri_count_fixed(dataset$text, "rows") 
# [1] 0 2 0 0

來源

2014-07-03 10:17:21 A5C1D2H2I1M1N2O1R2T1

這也可以不加載任何額外的庫，由阿南達指出。我的解決辦法是，提供了2列的表被稱爲dataset並查找字符串是mystring：

countOccurr = function(text,motif) { 
res = gregexpr(motif,text,fixed=T)[[1]] 
ifelse(res[1] == -1, 0, length(res)) 
} 

dataset = cbind(dataset, count = vapply(dataset[,2], countOccurr, 1, motif=mystring))

當心，你的數據框的第二列必須是模式字符，如果你想避免問題（@ on-the-shores-of-linux-sea作爲示例數據給出的數據框保留了模式因子，這對他的解決方案來說很好，但與我的解決方案無關）。否則使用as.character(dataset[,2])進行施放。

來源

2014-07-03 11:03:14 jaybee

計算R中某個數據幀行的特定詞的出現次數

回答

相關問題