從文本文件中提取表格

我試圖從文本文件中提取表格，並在這裏找到了幾個較早的帖子，這些帖子解決了類似的問題。然而，似乎沒有人能夠有效解決我的問題。最有用的答案，我發現是我在這裏較早的一個問題：R: removing header, footer and sporadic column headings when reading csv file 從文本文件中提取表格

一個例子虛擬文本文件包含：

> 
> 
> ############################################################################### 
> 
> # Display AICc Table for the models above 
> 
> 
> collect.models(, adjust = FALSE) 
     model npar AICc DeltaAICc weight Deviance 
13  P1 19 94  0.00  0.78  9 
12  P2 21 94  2.64  0.20  9 
10  P3 15 94  9.44  0.02  9 
2  P4 11 94 619.26  0.00  9 
> 
> 
> ############################################################################### 
> 
> # the three lines below count the number of errors in the code above 
> 
> cat("ERROR COUNT:", .error.count, "\n") 
ERROR COUNT: 0 
> options(error = old.error.fun) 
> rm(.error.count, old.error.fun, new.error.fun) 
> 
> ########## 
> 
>

我寫了下面的代碼以提取所需的表：

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log') 

top <- '> collect.models\\(, adjust = FALSE)' 
bottom <- '> # the three lines below count the number of errors in the code above' 

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))] 
my.data <- my.data[-c(1:grep(top, my.data))] 
my.data <- my.data[c(1:(length(my.data)-4))] 
aa  <- as.data.frame(my.data) 
aa 

write.table(my.data, 'c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', quote=F, col.names=F, row.name=F) 
my.data2 <- read.table('c:/users/mmiller21/simple R programs/dummy.log.extraction.txt', header = TRUE, row.names = c(1)) 
my.data2 
    model npar AICc DeltaAICc weight Deviance 
13 P1 19 94  0.00 0.78  9 
12 P2 21 94  2.64 0.20  9 
10 P3 15 94  9.44 0.02  9 
2  P4 11 94 619.26 0.00  9

我寧願避免不得不寫，然後閱讀my.data以獲得所需的數據幀。在此之前，步驟當前的代碼返回my.data字符串矢量：

[1] "  model npar AICc DeltaAICc weight Deviance" "13  P1 19 94  0.00  0.78  9" 
[3] "12  P2 21 94  2.64  0.20  9" "10  P3 15 94  9.44  0.02  9" 
[5] "2  P4 11 94 619.26  0.00  9"

有一些方法可以讓我的琴絃上述載體轉化成這樣的一個數據幀中dummy.log.extraction.txt沒有寫，然後讀my.data？

行：

aa <- as.data.frame(my.data)

返回以下，它看起來像什麼，我想：

#            my.data 
# 1  model npar AICc DeltaAICc weight Deviance 
# 2 13  P1 19 94  0.00  0.78  9 
# 3 12  P2 21 94  2.64  0.20  9 
# 4 10  P3 15 94  9.44  0.02  9 
# 5 2  P4 11 94 619.26  0.00  9

但是：

dim(aa) 
# [1] 5 1

如果我可以拆分aa成列然後我認爲我會得到我想要的，而不必寫，然後閱讀my.data。

我找到帖子：Extracting Data from Text Files但是，在發佈的答案中，問題表似乎有固定的行數。在我的情況下，行數可以在1和20之間變化。另外，我寧願使用base R。在我的情況下，我認爲bottom和表的最後一行之間的行數是一個常數（這裏是4）。

我也發現帖子：How to extract data from a text file using R or PowerShell?然而，在我的情況下，列的寬度不固定，我不知道如何拆分字符串（或行），所以只有七列。

鑑於上述所有可能我的問題是真的如何將對象aa分成列。感謝您的任何建議或協助。

編輯：

實際日誌由一臺超級計算機產生並含有高達90000線。但是，日誌中的行數差別很大。這就是爲什麼我使用top和bottom。

來源

2013-07-04 Mark Miller

您的數據看起來像R對話輸出控制檯。人們想知道爲什麼表沒有被導出，或者爲什麼你不能運行R代碼來獲得它。 – Roland

R文件在超級計算機上運行，表格取自該機器返回的日誌。我不知道如何讓超級計算機爲我輸出一張桌子。 –

可能是你真正的日誌文件是完全不同的，更復雜，但是這一個，你可以使用read.table直接，你就必須用正確的參數發揮。

data <- read.table("c:/users/mmiller21/simple R programs/dummy.log", 
        comment.char = ">", 
        nrows = 4, 
        skip = 1, 
        header = TRUE, 
        row.names = 1) 

str(data) 
## 'data.frame': 4 obs. of 6 variables: 
## $ model : Factor w/ 4 levels "P1","P2","P3",..: 1 2 3 4 
## $ npar  : int 19 21 15 11 
## $ AICc  : int 94 94 94 94 
## $ DeltaAICc: num 0 2.64 9.44 619.26 
## $ weight : num 0.78 0.2 0.02 0 
## $ Deviance : int 9 9 9 9 

data 
## model npar AICc DeltaAICc weight Deviance 
## 13 P1 19 94  0.00 0.78  9 
## 12 P2 21 94  2.64 0.20  9 
## 10 P3 15 94  9.44 0.02  9 
## 2  P4 11 94 619.26 0.00  9

來源

2013-07-04 08:07:04 dickoa

謝謝。我應該提到日誌文件包含大約20000行，這就是爲什麼我使用頂部和底部。但是，您的答案可能會有幫助。 –

read.table及其家人現在有一個選項，閱讀文本：

> df <- read.table(text = paste(my.data, collapse = "\n")) 
> df 
    model npar AICc DeltaAICc weight Deviance 
13 P1 19 94  0.00 0.78  9 
12 P2 21 94  2.64 0.20  9 
10 P3 15 94  9.44 0.02  9 
2  P4 11 94 619.26 0.00  9 
> summary(df) 
model  npar   AICc  DeltaAICc   weight   Deviance 
P1:1 Min. :11.0 Min. :94 Min. : 0.00 Min. :0.000 Min. :9 
P2:1 1st Qu.:14.0 1st Qu.:94 1st Qu.: 1.98 1st Qu.:0.015 1st Qu.:9 
P3:1 Median :17.0 Median :94 Median : 6.04 Median :0.110 Median :9 
P4:1 Mean :16.5 Mean :94 Mean :157.84 Mean :0.250 Mean :9 
     3rd Qu.:19.5 3rd Qu.:94 3rd Qu.:161.90 3rd Qu.:0.345 3rd Qu.:9 
     Max. :21.0 Max. :94 Max. :619.26 Max. :0.780 Max. :9

來源

2013-07-04 07:54:50 kohske

謝謝。我應該提到日誌文件包含大約20000行，這就是爲什麼我使用頂部和底部。但是，您的答案可能會有幫助。 –

這看起來很奇怪，你必須閱讀的R控制檯。無論如何，你可以使用這樣一個事實，即你的表格行以數字開頭，並使用諸如^[0-9]+之類的東西提取你的inetersting行。然後read.table就像@kohske所顯示的那樣。

readLines('c:/users/mmiller21/simple R programs/dummy.log') 
idx <- which(grepl('^[0-9]+',ll)) 
idx <- c(min(idx)-1,idx) ## header line 
read.table(text=ll[idx]) 
model npar AICc DeltaAICc weight Deviance 
13 P1 19 94  0.00 0.78  9 
12 P2 21 94  2.64 0.20  9 
10 P3 15 94  9.44 0.02  9 
2  P4 11 94 619.26 0.00  9

來源

2013-07-04 08:02:47 agstudy

謝謝。我應該提到日誌文件包含大約20000行，這就是爲什麼我使用頂部和底部。但是，您的答案可能會有幫助。 –

謝謝那些發佈了答案的人。由於實際日誌文件的大小，複雜性和可變性，我認爲我需要繼續使用變量top和bottom。但是，我用dickoa的答案的元素來提出以下內容。

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log') 

top <- '> collect.models\\(, adjust = FALSE)' 
bottom <- '> # the three lines below count the number of errors in the code above' 

my.data <- my.data[-c(grep(bottom, my.data):length(my.data))] 
my.data <- my.data[-c(1:grep(top, my.data))] 

x <- read.table(text=my.data, comment.char = ">") 
x 

# model npar AICc DeltaAICc weight Deviance 
# 13 P1 19 94  0.00 0.78  9 
# 12 P2 21 94  2.64 0.20  9 
# 10 P3 15 94  9.44 0.02  9 
# 2  P4 11 94 619.26 0.00  9

這是更簡單的代碼：

my.data <- readLines('c:/users/mmiller21/simple R programs/dummy.log') 

top <- '> collect.models\\(, adjust = FALSE)' 
bottom <- '> # the three lines below count the number of errors in the code above' 

my.data <- my.data[grep(top, my.data):grep(bottom, my.data)] 

x <- read.table(text=my.data, comment.char = ">") 
x

來源

2013-07-04 08:59:48

從文本文件中提取表格

回答

相關問題