R的tm包的問題

我一直在嘗試跟隨Udemy教程，使用R中的tm包在推文上進行文本挖掘。R的tm包的問題

到目前爲止，本教程中指定的許多函數（以及cran.org上的tm pdf）導致了一系列錯誤，我不清楚如何解決它們。我正在編碼RStudio版本1.0.143和macOS Sierra。代碼和錯誤下面是我試圖從一系列的鳴叫做出wordcloud：

nyttweets <- searchTwitter("#NYT", n=1000) 
nytlist <- sapply(nyttweets, function(x) x$getText()) 
nytcorpus <- Corpus(VectorSource(nytlist))

這裏就是我遇到的第一個錯誤

nytcorpus <- tm_map(nytcorpus, tolower) 
**Warning message: 
In mclapply(content(x), FUN, ...) : 
all scheduled cores encountered errors in user code**

我看到的建議，做到以下幾點，這會導致另一個錯誤

nytcorpus <- tm_map(nytcorpus, tolower, mc.cores=1) 
**Error in FUN(X[[1L]], ...) : invalid multibyte string 1**

如果我改用「懶惰= TRUE」 tolower的和其他後續功能我跑後，我沒有收到一個錯誤：但是，當我終於嘗試合作nstruct我碰上了大量的錯誤wordcloud：

library("twitteR") 
library('wordcloud') 
library('SnowballC') 
library('tm') 
nytcorpus <- tm_map(nytcorpus, tolower, lazy=TRUE) 
nytcorpus <- tm_map(nytcorpus, removePunctuation, lazy=TRUE) 
nytcorpus <- tm_map(nytcorpus, function(x) removeWords(x, stopwords()), 
lazy=TRUE) 
nytcorpus <- tm_map(nytcorpus, PlainTextDocument) 
wordcloud(nytcorpus, min.freq=4, scale=c(5,1), random.color=F, max.word=45, 
random.order=F) 
**Warning messages: 
1: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F, : 
'removewords' could not be fit on page. It will not be plotted. 
2: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F, : 
"try-error" could not be fit on page. It will not be plotted. 
3: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F, : 
applicable could not be fit on page. It will not be plotted. 
4: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F, : 
object could not be fit on page. It will not be plotted. 
5: In wordcloud(nytcorpus, min.freq = 4, scale = c(5, 1), random.color = F, : 
usemethod("removewords", could not be fit on page. It will not be plotted.**

我不知道爲什麼功能，wordcloud試圖繪製的實際功能的話，如「removewords」或「嘗試 - 錯誤」，而不是來自紐約時報的推文。我見過的建議來包裝功能content_transformer，例如

nytcorpus <- tm_map(nytcorpus, content_transformer(tolower))

然而，我又剛剛得到的錯誤代碼‘「在用戶遇到錯誤的所有核計劃’。

這是非常令人沮喪的，我不確定是否應該完全取消使用tm包，特別是如果有更好的東西。任何建議，非常感謝。

來源

2017-06-28 sjc725

您使用的是什麼版本的'tm'。「無效的多字節字符串1」意味着您的文本可能有不正確的編碼。如果你能提供一個[可重現的例子]（https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example），那麼我們可以用樣本輸入來測試。一些不依賴於隨時間變化的「searchTwitter」。 – MrFlick

所以，我看到我使用的是0-6.2，而我使用的R對於包的依賴關係來說太舊了。在更新到R 3.4和0.7-1之後，我不再有錯誤。然而，我確實必須添加這個，我猜想這改變了我所抽取的推文的編碼： 'nytcorpus < - tm_map（nytcorpus，content_transformer（function（x）iconv（x， to ='UTF-8-MAC '，sub ='byte'）））' – sjc725

tm最近一直在試圖提高它的速度，並且似乎是一個涉及Rcpp的大修，該包最初並未與之搭配。也許你查看的教程是基於舊版本的tm，這可能是你遇到問題的原因之一。

我會給quanteda一試。

http://quanteda.io/

的主要原因是，它是由快幾個數量級（儘管如上所述，這可能最近已改變）。 Quanteda構建於stringi和data.table之上，它們已經在C++和C中進行了高度優化。本質上，quanteda利用了迄今爲止一些最快的R編程的工作。根據我的經驗，它也更穩定，這基於它所依賴的軟件包的成熟度是有意義的。

正如您很快就會發現的那樣，速度在構建和分析文檔術語矩陣時非常重要，特別是如果您創建各種長度的n-gram。所以，最好的方式是使用最快的軟件包。

賈斯汀

來源

2017-06-28 18:47:55 Justin

R的tm包的問題

回答

相關問題