2013-07-03 23 views
1

我有一個文件,其中有一堆數據和文本。我想以這樣一種方式讀取文件,即只保留具有三個座標的行。三個座標是指我有格式的行,例如490353.36, 3755632.81, 109.73。換句話說,我想保留表面線後的數據。數據在不同橫截面上具有x,y和z座標。僅當R中有三列時纔讀取數據

樣本數據如下:

ENDSTREAMNETWORK: 

BEGIN CROSS-SECTIONS: 

    CROSS-SECTION: 
    STREAM ID:Sipsey Fork  
    REACH ID:Sipsey Fork  
    STATION:13.60 
    NODE NAME:     
    CUT LINE: 
     490353.358391478 , 3755632.80772044 
     490254.511677942 , 3755640.28160111 
     490229.8 , 3755642.15 
     490205.088314326 , 3755644.01839947 
     490130.953109393 , 3755649.62143546 
    SURFACE LINE: 
    490353.36, 3755632.81, 109.73 
    490341.00, 3755633.74, 103.63 
    490331.74, 3755634.44, 97.54 
    490276.13, 3755638.65, 91.44 
    490263.78, 3755639.58, 85.34 
    490254.51, 3755640.28, 79.25 
    490254.51, 3755640.28, 79.25 
    490242.16, 3755641.22, 75.59 
    490229.80, 3755642.15, 75.59 
    490217.44, 3755643.08, 75.59 
    490205.09, 3755644.02, 79.25 
    490205.09, 3755644.02, 79.25 
    490186.55, 3755645.42, 85.34 
    490177.29, 3755646.12, 91.44 
    490158.75, 3755647.52, 97.54 
    490146.40, 3755648.45, 103.63 
    490130.95, 3755649.62, 109.73 
    END: 

    CROSS-SECTION: 
    STREAM ID:Sipsey Fork  
    REACH ID:Sipsey Fork  
    STATION:13.552* 
    NODE NAME:     
    CUT LINE: 
     490348.236792825 , 3755554.44864345 
     490248.581497463 , 3755561.99219479 
     490223.87626427 , 3755563.8637565 
     490199.171038808 , 3755565.73531763 
     490122.732478269 , 3755571.5258566 
    SURFACE LINE: 
    490348.24, 3755554.45, 109.73 
    490335.78, 3755555.39, 103.68 
    490332.73, 3755555.62, 101.72 
    490326.44, 3755556.10, 97.65 
    490321.09, 3755556.50, 96.98 
    490279.74, 3755559.63, 92.42 
    490270.38, 3755560.34, 91.35 
    490262.42, 3755560.94, 87.53 
    490258.64, 3755561.23, 85.56 
    490257.92, 3755561.29, 85.22 
    490253.65, 3755561.61, 82.50 
    490248.58, 3755561.99, 79.27 
    490248.58, 3755561.99, 79.27 
    490245.75, 3755562.21, 78.40 
    490243.64, 3755562.37, 77.73 
    490236.08, 3755562.94, 75.58 
    490223.88, 3755563.86, 75.58 
    490212.36, 3755564.74, 75.58 
    490209.15, 3755564.98, 76.44 
    490206.21, 3755565.20, 77.24 
    490200.50, 3755565.63, 78.84 
    490199.17, 3755565.74, 79.26 
    490199.17, 3755565.74, 79.26 
    490197.66, 3755565.85, 79.78 
    490193.00, 3755566.20, 81.22 
    490186.72, 3755566.68, 83.20 
    490182.06, 3755567.03, 84.83 
    490180.06, 3755567.18, 85.47 
    490170.51, 3755567.91, 91.44 
    490170.23, 3755567.93, 91.52 
    490151.40, 3755569.35, 97.45 
    490141.55, 3755570.10, 102.06 
    490138.66, 3755570.32, 103.48 
    490133.49, 3755570.71, 105.53 
    490122.73, 3755571.53, 109.73 
    END: 

我有上千行如上所示。我只想編譯所有數據,並用逗號分隔三列,並將其保存爲R中的數據框。

上述數據集所需的示例輸出如下。逗號也應刪除

 490353.36, 3755632.81, 109.73 
    490341.00, 3755633.74, 103.63 
    490331.74, 3755634.44, 97.54 
    490276.13, 3755638.65, 91.44 
    490263.78, 3755639.58, 85.34 
    490254.51, 3755640.28, 79.25 
    490254.51, 3755640.28, 79.25 
    490242.16, 3755641.22, 75.59 
    490229.80, 3755642.15, 75.59 
    490217.44, 3755643.08, 75.59 
    490205.09, 3755644.02, 79.25 
    490205.09, 3755644.02, 79.25 
    490186.55, 3755645.42, 85.34 
    490177.29, 3755646.12, 91.44 
    490158.75, 3755647.52, 97.54 
    490146.40, 3755648.45, 103.63 
    490130.95, 3755649.62, 109.73 
    490348.24, 3755554.45, 109.73 
    490335.78, 3755555.39, 103.68 
    490332.73, 3755555.62, 101.72 
    490326.44, 3755556.10, 97.65 
    490321.09, 3755556.50, 96.98 
    490279.74, 3755559.63, 92.42 
    490270.38, 3755560.34, 91.35 
    490262.42, 3755560.94, 87.53 
    490258.64, 3755561.23, 85.56 
    490257.92, 3755561.29, 85.22 
    490253.65, 3755561.61, 82.50 
    490248.58, 3755561.99, 79.27 
    490248.58, 3755561.99, 79.27 
    490245.75, 3755562.21, 78.40 
    490243.64, 3755562.37, 77.73 
    490236.08, 3755562.94, 75.58 
    490223.88, 3755563.86, 75.58 
    490212.36, 3755564.74, 75.58 
    490209.15, 3755564.98, 76.44 
    490206.21, 3755565.20, 77.24 
    490200.50, 3755565.63, 78.84 
    490199.17, 3755565.74, 79.26 
    490199.17, 3755565.74, 79.26 
    490197.66, 3755565.85, 79.78 
    490193.00, 3755566.20, 81.22 
    490186.72, 3755566.68, 83.20 
    490182.06, 3755567.03, 84.83 
    490180.06, 3755567.18, 85.47 
    490170.51, 3755567.91, 91.44 
    490170.23, 3755567.93, 91.52 
    490151.40, 3755569.35, 97.45 
    490141.55, 3755570.10, 102.06 
    490138.66, 3755570.32, 103.48 
    490133.49, 3755570.71, 105.53 
    490122.73, 3755571.53, 109.73 
+0

如果你使用的是linux或者有'awk',這一行可以幫助'awk'{FS =「,」} {if(NF == 3)print}'raw_text' – dickoa

回答

3

我會做這樣的事情,首先與readLines閱讀的文本文件中:

tt <- readLines("myfile.txt") 
pat <- "^[ ]*(.*),(.*),(.*)[ ]*$" 
tt <- gsub(pat, "\\1,\\2,\\3", grep(pat, tt, value=TRUE)) 
dat <- read.table(textConnection(tt), sep=",", header=FALSE) 

的想法:首先,我們看整個文件在tt中,以便我們可以進行所有必需的更改,過濾所需的行等。然後,我們需要選擇要保留哪些行以及哪些行被丟棄。爲此,我們構建一個模式0-任何數量的空間,後面跟着任何數字,然後是,,後面跟着任何數字,然後是,,後面跟着0,任意數量的空格。這樣可以確保你得到的是隻有3列被,分開的行。因此,首先我們使用此patgrep來過濾這些行,並僅保留與圖案匹配的行(使用value=TRUE)。然後,我們使用gsub來刪除空格,並保留,之間的內容(我認爲這不是絕對必要的,但它確實無損)。然後,我們現在有我們需要的數據。我們所要做的就是將它傳遞到textConnection並按照平常的習慣使用read.table來閱讀。希望這可以幫助。

這些線條已經分開。只要逐一輸入這些行並查看輸出結果,你應該馬上就能理解它。

+0

呃。 'readLines'就是我正在尋找的東西。尼斯。 – nograpes

+0

+1非常好的方法 – dickoa

+0

@阿倫:非常感謝阿倫。您是否可以在每行代碼中添加文本以解釋每行代碼的作用? –

3

這是如此醜陋,我幾乎沒有發佈它。但是,它的工作。我在你的數據讀取,如:

raw<-read.table(textConnection('NDSTREAMNETWORK: 

BEGIN CROSS-SECTIONS: 

    CROSS-SECTION: 
    STREAM ID:Sipsey Fork  
    REACH ID:Sipsey Fork  
    STATION:13.60 
    NODE NAME:     
    CUT LINE: 
     490353.358391478 , 3755632.80772044 
     490254.511677942 , 3755640.28160111 
     490229.8 , 3755642.15 
     490205.088314326 , 3755644.01839947 
     490130.953109393 , 3755649.62143546 
    SURFACE LINE: 
    490353.36, 3755632.81, 109.73 
    490341.00, 3755633.74, 103.63 
    490331.74, 3755634.44, 97.54 
    490276.13, 3755638.65, 91.44 
    490263.78, 3755639.58, 85.34 
    490254.51, 3755640.28, 79.25 
    490254.51, 3755640.28, 79.25 
    490242.16, 3755641.22, 75.59 
    490229.80, 3755642.15, 75.59 
    490217.44, 3755643.08, 75.59 
    490205.09, 3755644.02, 79.25 
    490205.09, 3755644.02, 79.25 
    490186.55, 3755645.42, 85.34 
    490177.29, 3755646.12, 91.44 
    490158.75, 3755647.52, 97.54 
    490146.40, 3755648.45, 103.63 
    490130.95, 3755649.62, 109.73 
    END: 

    CROSS-SECTION: 
    STREAM ID:Sipsey Fork  
    REACH ID:Sipsey Fork  
    STATION:13.552* 
    NODE NAME:     
    CUT LINE: 
     490348.236792825 , 3755554.44864345 
     490248.581497463 , 3755561.99219479 
     490223.87626427 , 3755563.8637565 
     490199.171038808 , 3755565.73531763 
     490122.732478269 , 3755571.5258566 
    SURFACE LINE: 
    490348.24, 3755554.45, 109.73 
    490335.78, 3755555.39, 103.68 
    490332.73, 3755555.62, 101.72 
    490326.44, 3755556.10, 97.65 
    490321.09, 3755556.50, 96.98 
    490279.74, 3755559.63, 92.42 
    490270.38, 3755560.34, 91.35 
    490262.42, 3755560.94, 87.53 
    490258.64, 3755561.23, 85.56 
    490257.92, 3755561.29, 85.22 
    490253.65, 3755561.61, 82.50 
    490248.58, 3755561.99, 79.27 
    490248.58, 3755561.99, 79.27 
    490245.75, 3755562.21, 78.40 
    490243.64, 3755562.37, 77.73 
    490236.08, 3755562.94, 75.58 
    490223.88, 3755563.86, 75.58 
    490212.36, 3755564.74, 75.58 
    490209.15, 3755564.98, 76.44 
    490206.21, 3755565.20, 77.24 
    490200.50, 3755565.63, 78.84 
    490199.17, 3755565.74, 79.26 
    490199.17, 3755565.74, 79.26 
    490197.66, 3755565.85, 79.78 
    490193.00, 3755566.20, 81.22 
    490186.72, 3755566.68, 83.20 
    490182.06, 3755567.03, 84.83 
    490180.06, 3755567.18, 85.47 
    490170.51, 3755567.91, 91.44 
    490170.23, 3755567.93, 91.52 
    490151.40, 3755569.35, 97.45 
    490141.55, 3755570.10, 102.06 
    490138.66, 3755570.32, 103.48 
    490133.49, 3755570.71, 105.53 
    490122.73, 3755571.53, 109.73 
    END:'),sep='\n',stringsAsFactors=FALSE) 

然後我纏鬥它變成一個data.frame

vec<-unlist(raw) 

start<-grep('SURFACE LINE:',vec)+1 
end<-grep('END:',vec)-1 

data<-do.call(rbind, 
lapply(seq_along(start), 
    function(x) read.table(textConnection(vec[start[x]:end[x]]))) 
) 
2

不是最短的,但更容易理解,我

raw_text <- "ENDSTREAMNETWORK: 

BEGIN CROSS-SECTIONS: 

    CROSS-SECTION: 
    STREAM ID:Sipsey Fork  
    REACH ID:Sipsey Fork  
    STATION:13.60 
    NODE NAME:     
    CUT LINE: 
     490353.358391478 , 3755632.80772044 
     490254.511677942 , 3755640.28160111 
     490229.8 , 3755642.15 
     490205.088314326 , 3755644.01839947 
     490130.953109393 , 3755649.62143546 
    SURFACE LINE: 
    490353.36, 3755632.81, 109.73 
    490341.00, 3755633.74, 103.63 
    490331.74, 3755634.44, 97.54 
    490276.13, 3755638.65, 91.44 
    490263.78, 3755639.58, 85.34 
    490254.51, 3755640.28, 79.25 
    490254.51, 3755640.28, 79.25 
    490242.16, 3755641.22, 75.59 
    490229.80, 3755642.15, 75.59 
    490217.44, 3755643.08, 75.59 
    490205.09, 3755644.02, 79.25 
    490205.09, 3755644.02, 79.25 
    490186.55, 3755645.42, 85.34 
    490177.29, 3755646.12, 91.44 
    490158.75, 3755647.52, 97.54 
    490146.40, 3755648.45, 103.63 
    490130.95, 3755649.62, 109.73 
    END:" 

以下是具體步驟

## read the data 
raw_data <- readLines(textConnection(raw_text)) 

## split by "," 
split_list <- strsplit(raw_data, ",") 

## check for 3 columns 
data <- split_list[sapply(split_list, length) == 3] 

## remove space and "," 
data <- lapply(data, function(x) gsub("\\s+|\\,", "", x)) 

## bind the data 
do.call("rbind", data) 


##  [,1]  [,2]   [,3]  
## [1,] "490353.36" "3755632.81" "109.73" 
## [2,] "490341.00" "3755633.74" "103.63" 
## [3,] "490331.74" "3755634.44" "97.54" 
## [4,] "490276.13" "3755638.65" "91.44" 
## [5,] "490263.78" "3755639.58" "85.34" 
## [6,] "490254.51" "3755640.28" "79.25" 
## [7,] "490254.51" "3755640.28" "79.25" 
## [8,] "490242.16" "3755641.22" "75.59" 
## [9,] "490229.80" "3755642.15" "75.59" 
## [10,] "490217.44" "3755643.08" "75.59" 
## [11,] "490205.09" "3755644.02" "79.25" 
## [12,] "490205.09" "3755644.02" "79.25" 
## [13,] "490186.55" "3755645.42" "85.34" 
## [14,] "490177.29" "3755646.12" "91.44" 
## [15,] "490158.75" "3755647.52" "97.54" 
## [16,] "490146.40" "3755648.45" "103.63" 
## [17,] "490130.95" "3755649.62" "109.73" 
0

我想推薦另一種方法。正如@dickoa指出的那樣,如果您是Linux用戶或mac用戶,則可以使用第三方程序(如awkegrep)爲您進行過濾。沒有必要在R之外手動進行過濾,您可以通過一個system呼叫來完成。這兩項工作:

read.table(text = system("awk '{FS = \",\"} {if (NF == 3) print}' test.txt", 
         intern = TRUE), 
      sep = ',') 

隨着egrep

read.table(text = system("egrep '^[^,]+,[^,]+,[^,]+$' test.txt", intern = TRUE), 
      sep = ',') 

這樣做的優點是它不會合力R將所有的數據讀入

隨着awk由@dickoa建議內存,如果你正在閱讀非常大的文件,這可能會有所幫助。它也比其他建議的答案短。

相關問題