從CSV文件中獲取R文本挖掘文檔

首先，我要道歉重複提問8月1 '13。但我不能評論最初的問題，因爲我必須有50個聲望才能評論我沒有的東西。原始問題可以從R text mining documents from CSV file (one row per doc)檢索。從CSV文件中獲取R文本挖掘文檔

我正嘗試在R中使用tm包，並且每篇文章摘要的CSV文件都是不同的摘要。我希望每一行都是語料庫中的不同文檔。我的數據集中有2000行。

我運行下面的代碼如以前奔建議：

# change this file location to suit your machine 
file_loc <- "C:/Users/.../docs.csv" 
# change TRUE to FALSE if you have no column headings in the CSV 
x <- read.csv(file_loc, header = TRUE) 
require(tm) 
corp <- Corpus(DataframeSource(x)) 
docs <- DocumentTermMatrix(corp)

當我檢查類：

# checking class 
class(docs) 
[1] "DocumentTermMatrix" "simple_triplet_matrix"

問題是TM轉換不會對此類工作：

# Preparing the Corpus 
# Simple Transforms 
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x)) 
docs <- tm_map(docs, toSpace, "/")

我收到此錯誤：

Error in UseMethod("tm_map", x) : 
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"

或其他代碼：

docs <- tm_map(docs, toSpace, "/|@|nn|")

我得到了同樣的錯誤：

Error in UseMethod("tm_map", x) : 
no applicable method for 'tm_map' applied to an object of class "c('DocumentTermMatrix', 'simple_triplet_matrix')"

您的幫助將不勝感激。

來源

2016-03-28 Sahara

您必須將您的函數應用於'Corpus'對象而不是'DocumentTermMatrix'。在'corp < - 語料庫（DataframeSource（x））'之後，嘗試'corp < - tm_map（corp，toSpace，「/」）'，然後創建你的'DocumentTermMatrix'。 – nicola

@nicola非常感謝。你是完全正確的。我得到它運行。但是，它似乎工作，直到我創建我的dtm。最後的代碼是'docs < - tm_map（docs，stemDocument）'和'inspect（docs [16]）''。結果是'內容：字符：1190'這對我來說似乎很好。但是當我創建dtm時，'dim（dtm）'的結果是'[1] 2004 0'。是的，我有2004年的文件，但0？！沒有在我的矩陣？！請指教。 – Sahara

這真的取決於你的數據。沒有看到它們就無法說出任何事情。一步一步看看你的語料庫，看看發生了什麼。 – nicola

代碼

docs <- tm_map(docs, toSpace, "/|@|nn|")

必須

docs <- tm_map(docs, toSpace, "/|@|\\|").

更換然後它會正常工作。

來源

2016-04-01 07:02:16 Sahara

從CSV文件中獲取R文本挖掘文檔

回答

相關問題