2015-05-11 30 views
4

每當我在tm R軟件包中運行inspect()函數時,我都會返回一個字符數而不是文檔的內容。無論我使用什麼數據源,都會發生這種情況。tm package:inspect()返回字符計數,而不是內容

這裏是我的代碼:

library(tm) 

data <- c("one two three", "two three four", "three four five") 

corp <- VCorpus(VectorSource(data)) 

inspect(corp) 

我的輸出例如:

inspect(corp) 

VCorpus 

Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 3 

[[1]] 
PlainTextDocument 

Metadata: 7 

Content: chars: 13 

[[2]] 
PlainTextDocument 

Metadata: 7 

Content: chars: 14 

[[3]] 
PlainTextDocument 
Metadata: 7 

Content: chars: 15 

,但我要的是:

​​

下面是使用奧維文本文件的另一個例子默認使用TM Package,並在Ingo Feinerer開始的這個「tm Package簡介」中引用。 http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf

代碼:

txt <- system.file("texts", "txt", package = "tm") 
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), 
+ readerControl = list(language = "lat")) 
inspect(ovid[1:2]) 

我想要的,什麼是應該輸出:

<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 2 

[[1]] 
<<PlainTextDocument (metadata: 7)>> 
    Si quis in hoc artem populo non novit amandi, 
hoc legat et lecto carmine doctus amet. 
arte citae veloque rates remoque moventur, 
arte leves currus: arte regendus amor. 
curribus Automedon lentisque erat aptus habenis, 
Tiphys in Haemonia puppe magister erat: 
me Venus artificem tenero praefecit Amori; 
Tiphys et Automedon dicar Amoris ego. 
ille quidem ferus est et qui mihi saepe repugnet: 
sed puer est, aetas mollis et apta regi. 
Phillyrides puerum cithara perfecit Achillem, 
atque animos placida contudit arte feros. 
qui totiens socios, totiens exterruit hostes, 
creditur annosum pertimuisse senem. 
[[2]] 
<<PlainTextDocument (metadata: 7)>> 
quas Hector sensurus erat, poscente magistro 
verberibus iussas praebuit ille manus. 
Aeacidae Chiron, ego sum praeceptor Amoris: 
saevus uterque puer, natus uterque dea. 
sed tamen et tauri cervix oneratur aratro, 
frenaque magnanimi dente teruntur equi; 
et mihi cedet Amor, quamvis mea vulneret arcu 
pectora, iactatas excutiatque faces. 
quo me fixit Amor, quo me violentius ussit, 

它所輸出對我來說:

<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 2 

[[1]] 
<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 49 
Content: chars: 48 
Content: chars: 46 
Content: chars: 47 
Content: chars: 0 
Content: chars: 52 
Content: chars: 48 
Content: chars: 46 
Content: chars: 46 
Content: chars: 53 
Content: chars: 0 
Content: chars: 49 
Content: chars: 49 
Content: chars: 50 
Content: chars: 49 
Content: chars: 44 

[[2]] 
<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 48 
Content: chars: 47 
Content: chars: 47 
Content: chars: 48 
Content: chars: 46 
Content: chars: 0 
Content: chars: 48 
Content: chars: 49 
Content: chars: 45 
Content: chars: 47 
Content: chars: 45 
Content: chars: 0 
Content: chars: 51 
Content: chars: 42 
Content: chars: 45 
Content: chars: 48 
Content: chars: 44 
+2

您能否提供至少一些您的文字內容以及您運行的代碼以實現檢查(corp)? – lawyeR

+0

我將它添加到原始問題 – Cbx

+0

我在數據上運行了您的代碼並得到了您想要的。也許你需要更新你的tm包? >檢查(CORP) << VCorpus(文件:3,元數據(語料庫/索引):0/0)>> [[1]] << PlainTextDocument(元數據:7)>> 一二三 [[2]] << PlainTextDocument(元數據:7)>> 二三四 [[3]] << PlainTextDocument(元數據:7)>> 三四五 – lawyeR

回答

0

這是因爲anythings在修改最新版本的包'tm'0.6-1,在07年5月。我只檢索了0.6版本,它的工作原理。

  1. 下載存檔:tm_0.6.tar.gz在此鏈接:http://cran.r-project.org/src/contrib/Archive/tm/

  2. 通過RStudio安裝:工具 - >安裝包>包檔案文件>選擇tm_0.6.tar.gz和安裝。

僅此:)

3

版本tm包的0.6-1改變了文檔打印到屏幕上的方式。它現在輸出文檔的緊湊表示,而不是文檔文本本身。

要獲取文檔文本,您需要將as.character()函數應用於語料庫中的文檔。

例如,使用ovid示例(此處使用tm版本0。6-2):

> txt <- system.file("texts", "txt", package = "tm") 
> ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"), 
    readerControl = list(language = "lat")) 

新的檢測功能輸出的每個文檔的緊湊表示:

> inspect(ovid[1:2]) 
<<VCorpus>> 
Metadata: corpus specific: 0, document level (indexed): 0 
Content: documents: 2 

[[1]] 
<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 676 

[[2]] 
<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 700 

要獲得每個文檔的全文,應用as.character()功能,您要的文件檢查(請注意,輸出已經被截斷):

> as.character(ovid[[1]]) 
[1] " Si quis in hoc artem populo non novit amandi,"  
[2] "   hoc legat et lecto carmine doctus amet."  
[3] " arte citae veloque rates remoque moventur,"  
[4] "   arte leves currus: arte regendus amor." 

要清理的顯示輸出,結合上述與writeLines()功能:

> writeLines(as.character(ovid[[1]])) 
    Si quis in hoc artem populo non novit amandi, 
     hoc legat et lecto carmine doctus amet. 
    arte citae veloque rates remoque moventur, 
     arte leves currus: arte regendus amor. 

要在語料庫多個文檔做到這一點,結合以上的lapply()功能(輸出被截斷):

> lapply(ovid[1:2], as.character) 
$ovid_1.txt 
[1] " Si quis in hoc artem populo non novit amandi,"  
[2] "   hoc legat et lecto carmine doctus amet."  
[3] " arte citae veloque rates remoque moventur,"  
[4] "   arte leves currus: arte regendus amor." 

$ovid_2.txt 
[1] " quas Hector sensurus erat, poscente magistro" 
[2] "   verberibus iussas praebuit ille manus."  
[3] " Aeacidae Chiron, ego sum praeceptor Amoris:"  
[4] "   saevus uterque puer, natus uterque dea." 

最後,要清理這個輸出,有點重複以前的檢查行爲,請嘗試使用l_ply()功能在plyr包如下(輸出被截斷):

> l_ply(ovid[1:2], function(doc) { 
    print(doc) # output summary of document 
    writeLines("") # output blank line between results 
    writeLines(as.character(doc)) # output clean document text 
    writeLines("") # output blank line between results 
    }) 

<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 676 

    Si quis in hoc artem populo non novit amandi, 
     hoc legat et lecto carmine doctus amet. 
    arte citae veloque rates remoque moventur, 
     arte leves currus: arte regendus amor. 

<<PlainTextDocument>> 
Metadata: 7 
Content: chars: 700 

    quas Hector sensurus erat, poscente magistro 
     verberibus iussas praebuit ille manus. 
    Aeacidae Chiron, ego sum praeceptor Amoris: 
     saevus uterque puer, natus uterque dea. 

希望這有助於!