2015-06-24 78 views
0

我有一個數據集(Facebook的帖子)(通過netvizz),我用R中的quanteda軟件包。這是我的R代碼。R採用量化的文本挖掘

# Load the relevant dictionary (relevant for analysis) 
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC") 

# Read File 
# Facebooks posts could be generated by FB Netvizz 
# https://apps.facebook.com/netvizz 
# Load FB posts as .csv-file from .zip-file 
fbpost <- read.csv("D:/FB-com.csv", sep=";") 

# Define the relevant column(s) 
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries 
# Define as corpus 
fb_corp <-corpus(fb_test) 
class(fb_corp) 

# LIWC Application 
fb_liwc<-dfm(fb_corp, dictionary=liwcdict) 
View(fb_liwc) 

一切工作,直到:

> fb_liwc<-dfm(fb_corp, dictionary=liwcdict) 
Creating a dfm from a corpus ... 
    ... indexing 2,760 documents 
    ... tokenizing texts, found 77,923 total tokens 
    ... cleaning the tokens, 1584 removed entirely 
    ... applying a dictionary consisting of 68 key entries 
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", : 
    invalid 'dimnames' given for data frame 

你會如何解釋錯誤消息?有什麼建議可以解決這個問題嗎?

+0

很難說,因爲我沒有文本輸入文件,但是如果您嘗試'dfm(inaugTexts,dictionary = liwcdict)',會發生什麼?我有'LIWC2001_English.dic'文件,'dfm'命令可以在'inaugTexts'下正常工作 - 儘管速度很慢,需要重寫才能優化它(列表中的下一部分)。 –

+0

它現在已經在dev分支中修復,您可以按照下面的答案進行安裝。 –

回答

1

Quanteda版本0.7.2中存在一個錯誤,導致dfm()在使用字典時,其中一個文檔不包含任何功能。你的例子失敗了,因爲在清理階段,Facebook的某些「文檔」最終會通過清理步驟刪除所有功能。

這不僅固定在0.8.0,而且還改變了字典dfm()的基礎實現,從而顯着提高了速度。 (該LIWC仍然是一個龐大而複雜的詞典和正則表達式仍然意味着它是慢得多比簡單索引標記使用。我們將在進一步優化這方面的工作。)

devtools::install_github("kbenoit/quanteda") 
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC") 
mydfm <- dfm(inaugTexts, dictionary = liwcdict) 
## Creating a dfm from a character vector ... 
## ... indexing 57 documents 
## ... lowercasing 
## ... tokenizing 
## ... shaping tokens into data.table, found 134,024 total tokens 
## ... applying a dictionary consisting of 68 key entries 
## ... summing dictionary-matched features by document 
## ... indexing 68 feature types 
## ... building sparse matrix 
## ... created a 57 x 68 sparse dfm 
## ... complete. Elapsed time: 14.005 seconds. 
topfeatures(mydfm, decreasing=FALSE) 
## Fillers Nonfl Swear  TV Eating Sleep Groom Death Sports Sexual 
##  0  0  0  42  47  49  53  76  81  100 

它也將工作,如果一個文檔在標記和清理之後包含零個特徵,這可能是打破您正在使用的Facebook文本的舊dfm

mytexts <- inaugTexts 
mytexts[3] <- "" 
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE) 
which(rowSums(mydfm)==0) 
## 1797-Adams 
##   3