2017-04-08 66 views
0

我打算從R.這個載體得到NGRAM無法安裝RWeka/rJava不管我做什麼,所以我找了這是NGRAM包替代。但是,這個腳本有問題,並且不起作用。如何提取來自R NGRAM時無法安裝RWeka

library(tm) 
library(ngram) 
text=c("A vector of n-grams","listed in decreasing blocks","it is a vector","it works a little differently","there are many vectors","another vector") 
myCorpus=VCorpus(VectorSource(text)) 
bigram_tokenizer <- function(x) 
ngram_asweka(x, min = 2, max = 2) 
bigram_tdm <- DocumentTermMatrix(myCorpus) 
findFreqTerms(bigram_tdm, 3) 

什麼是造成字符(0)錯誤,以及如何處理它?謝謝!

+1

「載體」也僅僅是兩次......嘗試添加一個額外的字符串'文本< - C(文字,「另一個向量」)' –

+0

'字符(0)'意味着什麼也沒有發現 –

+0

謝謝@EnriquePérezHerrero我加入,並將結果返回「向量」了,但因爲我指定n最小= 2,爲什麼沒有像返回「向量」兩字組? – santoku

回答

2

尋找二元語法是ngram包裝更容易: https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

library(ngram) 

text <- c("A vector of n-grams", 
     "listed in decreasing blocks", 
     "it is a vector", 
     "it works a little differently", 
     "there are many vectors", 
     "a vector") 
bigrams <- ngram(text, n = 2) 
phrase_table <- get.phrasetable(bigrams) 

phrase_table 

#     ngrams freq  prop 
#1   a vector  2 0.11764706 
#2   a little  1 0.05882353 
#3 little differently  1 0.05882353 
#4   it works  1 0.05882353 
#5   there are  1 0.05882353 
#6 decreasing blocks  1 0.05882353 
#7  in decreasing  1 0.05882353 
#8   listed in  1 0.05882353 
#9    it is  1 0.05882353 
#10    is a  1 0.05882353 
#11   A vector  1 0.05882353 
#12   of n-grams  1 0.05882353 
#13   vector of  1 0.05882353 
#14   works a  1 0.05882353 
#15   are many  1 0.05882353 
#16  many vectors  1 0.05882353 
+0

謝謝!最有幫助。想知道是否要創建一個tdm,如何通過這個bigram作爲控制參數,還是應該先將它轉換爲bigram phrase_table,然後再創建一個tdm? – santoku