計算n元語法對中的R

文本數據的每一行，我有以下格式的數據列：計算n元語法對中的R

文本

Hello world 
Hello 
How are you today 
I love stackoverflow 
blah blah blahdy

我想計算3克中的每一行此數據集可能使用了tau包的textcnt()函數。然而，當我嘗試它時，它給了我一個數字向量與整個列的ngram。我如何將這個函數分別應用於我的數據中的每個觀察值？

來源

2013-07-09 Brian Vanover

您可以使用'sapply' –

@TylerRinker謝謝泰勒。然而，sapply沒有工作。我用它是這樣的： > trigram_title < - sapply（eta_dedup $ title，textcnt（eta_dedup $ title，method =「ngram」）） Match.fun（FUN）中的錯誤： 'textcnt（eta_dedup $ title，method = 「ngram」）'不是函數，字符或符號 –

*顯示您所做的*更好，而不是提及它。 – Arun

這是你以後在做什麼？

library("RWeka") 
library("tm") 

TrigramTokenizer <- function(x) NGramTokenizer(x, 
           Weka_control(min = 3, max = 3)) 
# Using Tyler's method of making the 'Text' object here 
tdm <- TermDocumentMatrix(Corpus(VectorSource(Text)), 
          control = list(tokenize = TrigramTokenizer)) 

inspect(tdm) 

A term-document matrix (4 terms, 5 documents) 

Non-/sparse entries: 4/16 
Sparsity   : 80% 
Maximal term length: 20 
Weighting   : term frequency (tf) 

         Docs 
Terms     1 2 3 4 5 
    are you today  0 0 1 0 0 
    blah blah blahdy  0 0 0 0 1 
    how are you   0 0 1 0 0 
    i love stackoverflow 0 0 0 1 0

來源

2013-07-09 19:35:59 Ben

謝謝本。這允許我輕鬆計算字符串之間的標記相似性 –

嘗試「tdm < - 」行時出現以下錯誤：.jnew（name）中的錯誤：java.lang.ClassNotFoundException –

聽起來像您的Java安裝問題，也許路徑設置不正確。 – Ben

下面是一個使用qdap package

## Text <- readLines(n=5) 
## Hello world 
## Hello 
## How are you today 
## I love stackoverflow 
## blah blah blahdy 

library(qdap) 
ngrams(Text, seq_along(Text), 3)

這是一個列表，你可以用典型的列表索引訪問組件的NGRAM方法。

編輯：

至於你的第一種方法嘗試這樣的：

library(tau) 
sapply(Text, textcnt, method = "ngram") 

## sapply(eta_dedup$title, textcnt, method = "ngram")

來源

2013-07-09 19:05:12

謝謝泰勒！我將探索你的qdap包。我認爲現在我將使用Ben的RWeka/tm解決方案，因爲它以可以輕鬆計算相似度的方式顯示數據。 –

下面介紹如何使用quanteda包：

txt <- c("Hello world", "Hello", "How are you today", "I love stackoverflow", "blah blah blahdy") 

require(quanteda) 
dfm(txt, ngrams = 3, concatenator = " ", verbose = FALSE) 
## Document-feature matrix of: 5 documents, 4 features. 
## 5 x 4 sparse Matrix of class "dfmSparse" 
## features 
## docs how are you are you today i love stackoverflow blah blah blahdy 
## text1   0    0     0    0 
## text2   0    0     0    0 
## text3   1    1     0    0 
## text4   0    0     1    0 
## text5   0    0     0    1

來源

2015-12-10 10:47:58

我猜OP想用tau但其他人並沒有使用該軟件包。這裏是你如何在牛頭做到這一點：

data = "Hello world\nHello\nHow are you today\nI love stackoverflow\n 
blah blah blahdy" 

bigram_tau <- textcnt(data, n = 2L, method = "string", recursive = TRUE)

這會是一個線索，但你可以把它格式化爲帶令牌和尺寸更經典DATAM幀類型：

data.frame(counts = unclass(bigram_tau), size = nchar(names(bigram_tau))) 
format(r)

我強烈建議使用tau，因爲它對大數據表現非常好。我用它來創建1 GB的bigrams，並且它既快速又平滑。

來源

2016-07-10 19:39:24 ambodi

計算n元語法對中的R

回答

相關問題