2015-12-29 97 views
1

我有一個包含ID號碼列和文本列的數據集,並且我正在使用quanteda包對文本數據運行LIWC分析。這是我的數據設置的例子:在dfm()輸出中包含ID號碼

mydata<-data.frame(
    id=c(19,101,43,12), 
    text=c("No wonder, then, that ever gathering volume from the mere transit ", 
     "So that in many cases such a panic did he finally strike, that few ", 
     "But there were still other and more vital practical influences at work", 
     "Not even at the present day has the original prestige of the Sperm Whale"), 
    stringsAsFactors=F 
) 

我已經能夠進行使用scores <- dfm(as.character(mydata$text), dictionary = liwc)

然而,當我查看結果(View(scores)),我發現功能不引用的LIWC分析最終結果中的原始ID號碼(19,101,43,12)。相反,row.names列包含但它包含非描述性標識符(例如,「text1」中,「文本2」):

enter image description here

我怎樣才能獲得dfm()功能,包括在其輸出的ID號?謝謝!

回答

1

聽起來好像你希望dfm對象的行名是你的mydata$id的ID號。如果您將此ID聲明爲文本的文檔名稱,則會自動發生。最簡單的方法是從data.frame中創建一個量化的語料庫對象。

corpus()呼叫的下方,從您的id變量分配docnames。注意:summary()調用中的「文本」看起來像一個數字值,但它實際上是文本的文檔名稱。

require(quanteda) 
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]]) 
summary(myCorpus) 
# Corpus consisting of 4 documents. 
# 
# Text Types Tokens Sentences 
# 19 11  11   1 
# 101 13  14   1 
# 43 12  12   1 
# 12 12  14   1 
# 
# Source: /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit 
# Created: Tue Dec 29 11:54:00 2015 
# Notes: 

從那裏,文檔名自動成爲dfm中的行標籤。 (您可以爲您的LIWC應用程序添加dictionary =參數。)

myDfm <- dfm(myCorpus, verbose = FALSE) 
head(myDfm) 
# Document-feature matrix of: 4 documents, 45 features. 
# (showing first 4 documents and first 6 features) 
#  features 
# docs no wonder then that ever gathering 
# 19 1  1 1 1 1   1 
# 101 0  0 0 2 0   0 
# 43 0  0 0 0 0   0 
# 12 0  0 0 0 0   0 
+1

完美,謝謝! – abclist19