2017-05-05 84 views
3

我嘗試從涅text文本中提取3克,因此對於tfis我使用ngramrr包。提取ngram與R

require(ngramrr) 
require(tm) 
require(magrittr) 

nirvana <- c("hello hello hello how low", "hello hello hello how low", 
      "hello hello hello how low", "hello hello hello", 
      "with the lights out", "it's less dangerous", "here we are now", "entertain us", 
      "i feel stupid", "and contagious", "here we are now", "entertain us", 
      "a mulatto", "an albino", "a mosquito", "my libido", "yeah", "hey yay") 

ngramrr(nirvana[1], ngmax = 3) 

Corpus(VectorSource(nirvana)) 

我得到這樣的結果:

[1] "hello"    "hello"    "hello"    "how"    "low"    "hello hello"  "hello hello"  
[8] "hello how"   "how low"   "hello hello hello" "hello hello how" "hello how low" 

我想知道我該怎麼做才能構建TermDocumentMatrix其中術語是卦名單。

謝謝

+0

我會用'quanteda'並轉換爲'tm'格式。 'nirvana%>%tokens(ngrams = 1:3)%>%dfm%>%convert(to =「tm」)' –

+0

@amatsuo_net謝謝你,你能幫我一個R例子嗎? –

+0

@Cath謝謝;) –

回答

1

上面我的意見是幾乎完成,但它是這樣的:

nirvana %>% tokens(ngrams = 1:3) %>% # generate tokens 
    dfm %>% # generate dfm 
    convert(to = "tm") %>% # convert to tm's document-term-matrix 
    t # transpose it to term-document-matrix