基於文本文件的內容對文集進行子集

我正在使用R和tm包來進行一些文本分析。我正在嘗試根據在單個文本文件的內容中是否找到某個表達式來構建語料庫的一個子集。基於文本文件的內容對文集進行子集

我創建20個TEXTFILES語料庫（謝謝你lukeA在這個例子中）：

reut21578 <- system.file("texts", "crude", package = "tm") 
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

我現在想只選擇那些包含字符串「降價」 TEXTFILES創建一個子集，文集。

檢查該文件的第一文本文件，我知道有包含字符串中的至少一個文本文件：

writeLines(as.character(corp[1]))

我怎麼會去最好這樣做呢？

來源

2016-03-24 tarti

下面是使用一種方法tm_filter：

library(tm) 
reut21578 <- system.file("texts", "crude", package = "tm") 
corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)) 

(corp_sub <- tm_filter(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE)))) 
# <<VCorpus>> 
# Metadata: corpus specific: 0, document level (indexed): 0 
# Content: documents: 1 

cat(content(corp_sub[[1]])) 
# Diamond Shamrock Corp said that 
# effective today it had cut its contract prices for crude oil by 
# 1.50 dlrs a barrel. 
#  The reduction brings its posted price for West Texas 
# Intermediate to 16.00 dlrs a barrel, the copany said. 
#  "The price reduction today was made in the light of falling # <===== 
# oil product prices and a weak crude oil market," a company 
# spokeswoman said. 
#  Diamond is the latest in a line of U.S. oil companies that 
# have cut its contract, or posted, prices over the last two days 
# citing weak oil markets. 
# Reuter

我怎麼到那裏？通過查看packages' vignette，搜索子集，然後查看tm_filter（幫助：?tm_filter）的示例，其中提到了該示例。可能還需要查看?grep來檢查模式匹配的選項。

來源

2016-03-24 15:41:39 lukeA

@ lukeA的解決方案有效。我想提供另一種我更喜歡的解決方案。

library(tm) 

     reut21578 <- system.file("texts", "crude", package = "tm") 
     corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)) 

     corpTF <- lapply(corp, function(x) any(grep("price reduction", content(x), fixed=TRUE))) 

     for(i in 1:length(corp)) 
      corp[[i]]$meta["mySubset"] <- corpTF[i] 

     idx <- meta(corp, tag ="mySubset") == 'TRUE' 
     filtered <- corp[idx] 

     cat(content(filtered[[1]]))

利用這一解決方案採用meta標籤，我們可以看到所有語料庫元素與選擇標籤mySubset，價值我們選擇的「TRUE」和否則價值「FALSE」 。

來源

2016-03-24 19:53:15 Vezir

非常感謝您的加入。我同意，這非常有用！ – tarti

下面是使用quanteda包的一種更簡單的方法，它與重用其他R對象已經定義的現有方法的方式更加一致。 quanteda對於語料庫對象有一個subset方法，其工作方式與data.frame的子集方法類似，但在邏輯向量上進行選擇，包括在語料庫中定義的文檔變量。下面，我使用語料庫對象的texts()方法從語料庫中提取文本，並在grep()中使用該方法搜索您的一對單詞。

require(tm) 
data(crude) 

require(quanteda) 
# corpus constructor recognises tm Corpus objects 
(qcorpus <- corpus(crude)) 
## Corpus consisting of 20 documents. 
# use subset method 
(qcorpussub <- subset(qcorpus, grepl("price\\s+reduction", texts(qcorpus)))) 
## Corpus consisting of 1 document. 

# see the context 
## kwic(qcorpus, "price reduction") 
##      contextPre   keyword    contextPost 
## [127, 45:46] copany said." The [ price reduction ] today was made in the

注：我昏昏沉沉的正則表達式用「\ S +」，因爲你可以有空格，製表符，換行符，而不是隻是一個單一的空間的某種變體。

來源

2016-03-24 21:27:50

基於文本文件的內容對文集進行子集

回答

相關問題