每當我在tm R軟件包中運行inspect()
函數時,我都會返回一個字符數而不是文檔的內容。無論我使用什麼數據源,都會發生這種情況。tm package:inspect()返回字符計數,而不是內容
這裏是我的代碼:
library(tm)
data <- c("one two three", "two three four", "three four five")
corp <- VCorpus(VectorSource(data))
inspect(corp)
我的輸出例如:
inspect(corp)
VCorpus
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 3
[[1]]
PlainTextDocument
Metadata: 7
Content: chars: 13
[[2]]
PlainTextDocument
Metadata: 7
Content: chars: 14
[[3]]
PlainTextDocument
Metadata: 7
Content: chars: 15
,但我要的是:
下面是使用奧維文本文件的另一個例子默認使用TM Package,並在Ingo Feinerer開始的這個「tm Package簡介」中引用。 http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
代碼:
txt <- system.file("texts", "txt", package = "tm")
ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
+ readerControl = list(language = "lat"))
inspect(ovid[1:2])
我想要的,什麼是應該輸出:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument (metadata: 7)>>
Si quis in hoc artem populo non novit amandi,
hoc legat et lecto carmine doctus amet.
arte citae veloque rates remoque moventur,
arte leves currus: arte regendus amor.
curribus Automedon lentisque erat aptus habenis,
Tiphys in Haemonia puppe magister erat:
me Venus artificem tenero praefecit Amori;
Tiphys et Automedon dicar Amoris ego.
ille quidem ferus est et qui mihi saepe repugnet:
sed puer est, aetas mollis et apta regi.
Phillyrides puerum cithara perfecit Achillem,
atque animos placida contudit arte feros.
qui totiens socios, totiens exterruit hostes,
creditur annosum pertimuisse senem.
[[2]]
<<PlainTextDocument (metadata: 7)>>
quas Hector sensurus erat, poscente magistro
verberibus iussas praebuit ille manus.
Aeacidae Chiron, ego sum praeceptor Amoris:
saevus uterque puer, natus uterque dea.
sed tamen et tauri cervix oneratur aratro,
frenaque magnanimi dente teruntur equi;
et mihi cedet Amor, quamvis mea vulneret arcu
pectora, iactatas excutiatque faces.
quo me fixit Amor, quo me violentius ussit,
它所輸出對我來說:
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2
[[1]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 49
Content: chars: 48
Content: chars: 46
Content: chars: 47
Content: chars: 0
Content: chars: 52
Content: chars: 48
Content: chars: 46
Content: chars: 46
Content: chars: 53
Content: chars: 0
Content: chars: 49
Content: chars: 49
Content: chars: 50
Content: chars: 49
Content: chars: 44
[[2]]
<<PlainTextDocument>>
Metadata: 7
Content: chars: 48
Content: chars: 47
Content: chars: 47
Content: chars: 48
Content: chars: 46
Content: chars: 0
Content: chars: 48
Content: chars: 49
Content: chars: 45
Content: chars: 47
Content: chars: 45
Content: chars: 0
Content: chars: 51
Content: chars: 42
Content: chars: 45
Content: chars: 48
Content: chars: 44
您能否提供至少一些您的文字內容以及您運行的代碼以實現檢查(corp)? – lawyeR
我將它添加到原始問題 – Cbx
我在數據上運行了您的代碼並得到了您想要的。也許你需要更新你的tm包? >檢查(CORP) << VCorpus(文件:3,元數據(語料庫/索引):0/0)>> [[1]] << PlainTextDocument(元數據:7)>> 一二三 [[2]] << PlainTextDocument(元數據:7)>> 二三四 [[3]] << PlainTextDocument(元數據:7)>> 三四五 – lawyeR