2013-04-11 61 views
2
y<-readLines("output.txt") 

當我讀取txt文件後,我需要將此數據格式化爲一定數量的列的數據幀。需要擺脫沒有21列的字母和行。我正在做以下解析 - 和任何字母。解析要讀取的數據的.txt文件。

p<-gsub("-","",p) 
p<-gsub("[aA-zZ]","",p) 

系統配置:LCPU = 96 MEM = 196608MB ENT = 16.00

kthr   memory       page      faults     cpu    time 
----------- --------------------- ------------------------------------ ------------------ ----------------------- -------- 
    r b p  avm  fre fi fo pi po fr  sr in  sy cs us sy id wa pc ec hr mi se 
19 0 0 21337487 7123470  0 201  0  0  0  0 3576 66723 30304 19 4 77 0 5.97 37.3 00:02:30 
27 0 0 21337431 7121069  0 123  0  0  0  0 4298 81526 36157 19 4 78 0 5.61 35.1 00:03:00 
18 0 0 21333631 7122351  0 195  0  0  0  0 3696 65163 30794 23 4 74 0 6.49 40.6 00:03:30 
19 0 0 21333590 7119082  0 194  0  0  0  0 5217 102823 47621 27 5 68 0 7.79 48.7 00:04:00 

    kthr   memory       page      faults     cpu    time 
----------- --------------------- ------------------------------------ ------------------ ----------------------- -------- 
    r b p  avm  fre fi fo pi po fr  sr in  sy cs us sy id wa pc ec hr mi se 
    20 0 0 21347610 7204383  0 167  0  0  0  0 3645 73642 33333 21 3 75 0 6.21 38.8 00:12:30 
    16 0 0 21347576 7201448  0 110  0  0  0  0 4882 84287 40503 23 4 73 0 6.77 42.3 00:13:00 

一旦我解析出不想要的字符,我有一些空行。這還沒有一個數據框架,我將如何擺脫這裏的空行?

回答

3

您可以通過readLinescount.fields完成此操作。

# path is the path to your data file 
read.table(text=readLines(path)[count.fields(path, blank.lines.skip=FALSE) == 21]) 

# V1 V2 V3  V4  V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20  V21 
# 1 19 0 0 21337487 7123470 0 201 0 0 0 0 3576 66723 30304 19 4 77 0 5.97 37.3 00:02:30 
# 2 27 0 0 21337431 7121069 0 123 0 0 0 0 4298 81526 36157 19 4 78 0 5.61 35.1 00:03:00 
# 3 18 0 0 21333631 7122351 0 195 0 0 0 0 3696 65163 30794 23 4 74 0 6.49 40.6 00:03:30 
# 4 19 0 0 21333590 7119082 0 194 0 0 0 0 5217 102823 47621 27 5 68 0 7.79 48.7 00:04:00 
# 5 20 0 0 21347610 7204383 0 167 0 0 0 0 3645 73642 33333 21 3 75 0 6.21 38.8 00:12:30 
# 6 16 0 0 21347576 7201448 0 110 0 0 0 0 4882 84287 40503 23 4 73 0 6.77 42.3 00:13:00 
1

正則表達式可以幫助

### For each row in your object "text", search for lines where... 
    # we start at the beginning of the line, search for a blank repeated 
    # any number of times, then we get to the end of the line 
index <- grep('^[[:blank:]]$', text) 

### Now that we know which rows contain only blanks, we know which rows to remove 
text <- text[-index] 
0
dat <- readLines(textConnection(' 
    kthr   memory       page      faults     cpu    time 
----------- --------------------- ------------------------------------ ------------------ ----------------------- -------- 
    r b p  avm  fre fi fo pi po fr  sr in  sy cs us sy id wa pc ec hr mi se 
19 0 0 21337487 7123470  0 201  0  0  0  0 3576 66723 30304 19 4 77 0 5.97 37.3 00:02:30 
27 0 0 21337431 7121069  0 123  0  0  0  0 4298 81526 36157 19 4 78 0 5.61 35.1 00:03:00 
18 0 0 21333631 7122351  0 195  0  0  0  0 3696 65163 30794 23 4 74 0 6.49 40.6 00:03:30 
19 0 0 21333590 7119082  0 194  0  0  0  0 5217 102823 47621 27 5 68 0 7.79 48.7 00:04:00 

    kthr   memory       page      faults     cpu    time 
----------- --------------------- ------------------------------------ ------------------ ----------------------- -------- 
    r b p  avm  fre fi fo pi po fr  sr in  sy cs us sy id wa pc ec hr mi se 
    20 0 0 21347610 7204383  0 167  0  0  0  0 3645 73642 33333 21 3 75 0 6.21 38.8 00:12:30 
    16 0 0 21347576 7201448  0 110  0  0  0  0 4882 84287 40503 23 4 73 0 6.77 42.3 00:13:00')) 

dat <- gsub('-','',dat) 
dat <- gsub('[ ]{1,}','|',dat) 
dat <- strsplit(dat,split='\\|') 
dat[lapply(dat,length)==24] 
col.names <- dat[lapply(dat,length)==24][[1]] 
dat <- do.call(rbind,dat[lapply(dat,length)==22]) 

你得到這個data.frame:

[,1] [,2] [,3] [,4] [,5]  [,6]  [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] 
[1,] "" "19" "0" "0" "21337487" "7123470" "0" "201" "0" "0" "0" "0" "3576" "66723" "30304" "19" "4" "77" "0" "5.97" "37.3" 
[2,] "" "27" "0" "0" "21337431" "7121069" "0" "123" "0" "0" "0" "0" "4298" "81526" "36157" "19" "4" "78" "0" "5.61" "35.1" 
[3,] "" "18" "0" "0" "21333631" "7122351" "0" "195" "0" "0" "0" "0" "3696" "65163" "30794" "23" "4" "74" "0" "6.49" "40.6" 
[4,] "" "19" "0" "0" "21333590" "7119082" "0" "194" "0" "0" "0" "0" "5217" "102823" "47621" "27" "5" "68" "0" "7.79" "48.7" 
[5,] "" "20" "0" "0" "21347610" "7204383" "0" "167" "0" "0" "0" "0" "3645" "73642" "33333" "21" "3" "75" "0" "6.21" "38.8" 
[6,] "" "16" "0" "0" "21347576" "7201448" "0" "110" "0" "0" "0" "0" "4882" "84287" "40503" "23" "4" "73" "0" "6.77" "42.3" 
    [,22]  
[1,] "00:02:30" 
[2,] "00:03:00" 
[3,] "00:03:30" 
[4,] "00:04:00" 
[5,] "00:12:30" 
[6,] "00:13:00" 

我覺得你還是需要將數據轉換成數字...