我在Linux中執行了LDA,並且在主題2中沒有得到像「ø」這樣的字符。但是,它們在Windows中運行時顯示。有誰知道如何處理這個?我使用包quanteda
和topicmodels
。Windows中的R無法處理某些字符
> terms(LDAModel1,5)
Topic 1 Topic 2
[1,] "car" "ø"
[2,] "build" "ù"
[3,] "work" "network"
[4,] "drive" "ces"
[5,] "musk" "new"
編輯:
數據:https://www.dropbox.com/s/tdr9yok7tp0pylz/technology201501.csv
的代碼是這樣的:
library(quanteda)
library(topicmodels)
myCorpus <- corpus(textfile("technology201501.csv", textField = "title"))
myDfm <- dfm(myCorpus,ignoredFeatures=stopwords("english"), stem = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE)
myDfm <-removeFeatures(myDfm, c("reddit", "redditors","redditor","nsfw", "hey", "vs", "versus", "ur", "they'r", "u'll", "u.","u","r","can","anyone","will","amp","http","just"))
sparsityThreshold <- round(ndoc(myDfm) * (1 - 0.9999))
myDfm2 <- trim(myDfm, minDoc = sparsityThreshold)
LDAModel1 <- LDA(quantedaformat2dtm(myDfm2), 25, 'Gibbs', list(iter=4000,seed = 123))
我猜不同的區域設置。 – 2016-01-13 03:19:37
您沒有真正提供足夠的數據來使問題重現。我猜想問題在於文件編碼。 Windows假定文件採用「拉丁-1」編碼。您的Linux操作系統可能會採用UTF-8編碼。瞭解在數據文件中使用的編碼以及使用正確的編碼正確讀取數據非常重要。您不會顯示任何導入步驟,因此很難知道您可能做了什麼。 – MrFlick
我嘗試了像https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding這樣的不同編碼,但它不起作用。 – user1569341