將IRC檔案整理成用於文本挖掘的語料庫

假設我想對IRC存檔的文本進行挖掘，並且希望使用諸如this one之類的檔案作爲源，以數月和數年的時間解析語料庫。將IRC檔案整理成用於文本挖掘的語料庫

在R中，接近這個問題的整體策略是什麼？

2014-02-28 histelheim

對於刮削部分這裏是一些起動碼。

library(XML) 

rootUri <- "http://donttreadonme.co.uk" 

doc <- htmlParse(paste0(rootUri, "/rubinius/index.html")) 

links <- xpathSApply(doc, "//a/@href") 

links <- grep("rubinius/2014", links, value = TRUE) 
links <- gsub("..", "", links, fixed = TRUE) 

messages <- lapply(links[1:5], function(l) { 
    doc <- htmlParse(paste0(rootUri, l)) 
    readHTMLTable(doc, which = 1, header = FALSE) 
}) 

messages <- do.call(rbind, messages) 

##    V1   V2 
## href.1 00:33:57  travis-ci 
## href.2 05:04:23  travis-ci 
## href.3 05:27:44  travis-ci 
## href.4 10:00:59 yorickpeterse 
## href.5 13:23:36 yorickpeterse 
## href.6 13:23:53 yorickpeterse 
##                       V3 
## href.1  [travis-ci] rubinius/rubinius/master (fcc5b8c - Brian Shirai): The build passed. 
## href.2 [travis-ci] rubinius/rubinius/master (901a6bc - Brian Shirai): The build was broken. 
## href.3 [travis-ci] rubinius/rubinius/master (5cffe7b - Brian Shirai): The build was fixed. 
## href.4                    morning 
## href.5   oh what the fuck RubyGems, why do you need the ext builder during runtime? 
## href.6        this better not be because I forgot --rubygems ignore

來源

2014-02-28 02:33:07

將IRC檔案整理成用於文本挖掘的語料庫

回答

相關問題