如何從R中的整個語料庫中提取文檔？

-1

# 1 
preliminari report intern algebra languag cacm decemb 1958 perli a 
j samelson k ca581203 jb march 22 1978 8 28 
pm 100 5 1 123 5 1 164 5 1 
1 5 1 1 5 1 1 5 1 205 
5 1 210 5 1 214 5 1 1982 5 
1 398 5 1 642 5 1 669 5 1 
1 6 1 1 6 1 1 6 1 1 
6 1 1 6 1 1 6 1 1 6 
1 1 6 1 1 6 1 1 6 1 
165 6 1 196 6 1 196 6 1 1273 
6 1 1883 6 1 324 6 1 43 6 
1 53 6 1 91 6 1 410 6 1 
3184 6 1 
# 2 
extract of root by repeat subtract for digit comput cacm 
decemb 1958 sugai i ca581202 jb march 22 1978 8 
29 pm 2 5 2 2 5 2 2 5 
2 
# 3 
techniqu depart on matrix program scheme cacm decemb 1958 friedman 
m d ca581201 jb march 22 1978 8 30 pm 
3 5 3 3 5 3 3 5 3 
# 4 
glossari of comput engin and program terminolog cacm novemb 1958 
ca581103 jb march 22 1978 8 32 pm 4 5 
4 4 5 4 4 5 4 
# 5 
two squar root approxim cacm novemb 1958 wadei w g 
ca581102 jb march 22 1978 8 33 pm 5 5 
5 5 5 5 5 5 5 
# 6 
the us of comput in inspect procedur cacm novemb 1958 
muller m e ca581101 jb march 22 1978 8 33 
pm 6 5 6 6 5 6 6 5 6 
477 5 6 6 6 6 
# 7 
glossari of comput engin and program terminolog cacm octob 1958 
ca581003 jb march 22 1978 8 35 pm 7 5 
7 7 5 7 7 5 7 
# 8 
on the equival and transform of program scheme cacm octob 
1958 friedman m d ca581002 jb march 22 1978 8 
36 pm 8 5 8 8 5 8 8 5 
8 
# 9 
propos for a...

我有大約3000個文件的這個語料庫被檢索。我想創建一個新的文件夾，我可以保留這些文件，如1.txt，2.txt等。每個文檔將以＃開頭。例如，1.txt將包含從＃1到＃2開頭的所有內容，2.txt將包含從＃2開始到＃3開始的所有內容，依此類推。任何幫助，不勝感激。如何從R中的整個語料庫中提取文檔？

來源

2017-04-21 prai

在粗體字母數分別爲實際上＃1，＃2，＃3，＃4，等等。我不知道它是如何轉換成粗體的，當我發佈這個問題時，我的散列消失了：/ – prai

另外，語料庫在.txt文件中。 – prai

請嘗試重新格式化您的問題，以便閱讀。 Plus提供給我們一個明確的問題，以及您迄今爲止所做的工作 –

讓我們假設你的文集是在一個名爲corpus.txt文件，該文件是這樣的：

# 1 
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. 
# 2 
At vero eos et accusam et justo duo dolores et ea rebum. 
Stet clita kasd gubergren, no sea takimata sanctus est 
Lorem ipsum dolor sit amet. 
# 3 
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, 
sed diam nonumy eirmod tempor invidunt ut labore et dolore 
magna aliquyam erat, sed diam voluptua. At vero eos et 
accusam et justo duo dolores et ea rebum.

可以使用readLines導入數據，然後提取開始#的元素的索引。然後根據這些索引分割文本文件，並在循環中生成單獨的文本文件。

實施例：

## Import corpus: 
textVec <- readLines("corpus.txt") 

## Find indices of the lines starting with '#': 
indexVec <- c(grep("^#", textVec), length(textVec) + 1) 

## Split corpus: 
textList <- lapply(1:(length(indexVec) - 1), 
    function(ii) textVec[(indexVec[ii]+1):(indexVec[ii+1] - 1)]) 

## Generate text files: 
for (ii in seq(along = textList)) writeLines(textList[[ii]], con = paste0(ii, ".txt"))

來源

2017-04-21 18:36:19 ikop

它解決了我的問題。非常感謝你ikop：D – prai

如何從R中的整個語料庫中提取文檔？

回答

相關問題