如何提取來自R NGRAM時無法安裝RWeka

我打算從R.這個載體得到NGRAM無法安裝RWeka/rJava不管我做什麼，所以我找了這是NGRAM包替代。但是，這個腳本有問題，並且不起作用。如何提取來自R NGRAM時無法安裝RWeka

library(tm) 
library(ngram) 
text=c("A vector of n-grams","listed in decreasing blocks","it is a vector","it works a little differently","there are many vectors","another vector") 
myCorpus=VCorpus(VectorSource(text)) 
bigram_tokenizer <- function(x) 
ngram_asweka(x, min = 2, max = 2) 
bigram_tdm <- DocumentTermMatrix(myCorpus) 
findFreqTerms(bigram_tdm, 3)

什麼是造成字符（0）錯誤，以及如何處理它？謝謝！

來源

2017-04-08 santoku

「載體」也僅僅是兩次......嘗試添加一個額外的字符串'文本< - C（文字，「另一個向量」）' –

'字符（0）'意味着什麼也沒有發現 –

謝謝@EnriquePérezHerrero我加入，並將結果返回「向量」了，但因爲我指定n最小= 2，爲什麼沒有像返回「向量」兩字組？ – santoku

尋找二元語法是ngram包裝更容易： https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

library(ngram) 

text <- c("A vector of n-grams", 
     "listed in decreasing blocks", 
     "it is a vector", 
     "it works a little differently", 
     "there are many vectors", 
     "a vector") 
bigrams <- ngram(text, n = 2) 
phrase_table <- get.phrasetable(bigrams) 

phrase_table 

#     ngrams freq  prop 
#1   a vector  2 0.11764706 
#2   a little  1 0.05882353 
#3 little differently  1 0.05882353 
#4   it works  1 0.05882353 
#5   there are  1 0.05882353 
#6 decreasing blocks  1 0.05882353 
#7  in decreasing  1 0.05882353 
#8   listed in  1 0.05882353 
#9    it is  1 0.05882353 
#10    is a  1 0.05882353 
#11   A vector  1 0.05882353 
#12   of n-grams  1 0.05882353 
#13   vector of  1 0.05882353 
#14   works a  1 0.05882353 
#15   are many  1 0.05882353 
#16  many vectors  1 0.05882353

來源

2017-04-08 11:56:07

謝謝！最有幫助。想知道是否要創建一個tdm，如何通過這個bigram作爲控制參數，還是應該先將它轉換爲bigram phrase_table，然後再創建一個tdm？ – santoku

如何提取來自R NGRAM時無法安裝RWeka

回答

相關問題