2013-08-05 51 views
0

我有一些空格分隔的數字數據。我試着用read.table在R中讀取它,但我在行空間分隔符丟失的地方遇到了一些問題。很多變量都粘在一起。我如何正確讀取這些數據?我試圖改變一些read.table參數,但這還不夠。如何閱讀R中卡住的數據?

原始數據是在這裏: https://dl.dropboxusercontent.com/u/74190377/data.txt

的樣本數據看起來像這樣:

structure(list(id = c("60019660101", "60019660102", "60019660103", 
"60019660104", "60019660105", "60019660106", "60019660107", "60019660108", 
"60019660109", "60019660110", "60019660111", "60019660112", "60019660113", 
"60019660114", "60019660115", "60019660116", "60019660117", "60019660118", 
"60019660119-10.6-12.4-11.9-11.6"), name1 = c("4.3", "7.4", "5.8", 
"4.3", "-3.5-12.9", "-6.6-13.3", "-5.7", "-5.0-11.4", "-7.5-12.0", 
"-8.8-15.3-11.5-19.5", "-9.8-16.4-13.1-22.3", "-8.9-17.4-10.9-20.0", 
"-7.3", "-5.8-10.5", "-5.4-13.6", "-9.4-20.4-14.4-26.3", "-7.9-15.6-10.3-19.4", 
"-8.7-11.2-10.5-16.0", "1.3"), name2 = c(".7", "3.8", "3.0", 
"-4.1", "-8.6", "-8.6-16.3", "-7.5", "-8.9-11.0", "-9.6-17.6", 
".0", ".6", "2.4", "-9.2", "-6.9", "-8.3", ".0", "1.2", ".8", 
"34-99.0"), name3 = c("3.4", "5.5", "4.2", "-1.9", "-5.6", "6.1", 
"-6.6", "1.8", "1.6", "20-99.0", "18", "17-99.0", "-8.5", "-8.0", 
"-9.1", "33", "33-99.0", "34-99.0", "-.9"), name4 = c("1.0", 
"1.9", "1.8", "-2.4", "1.5", "21-99.0", "-7.9", "25-99.0", "27-99.0", 
"-.9", "1.5", "-.9", "-9.1", "6.1", ".1", "4.6", "-.9", "-.9", 
"-.9"), name5 = c("1.0", "1.6", "10.9", "7.2", "17-99.0", "-.9", 
"1.0", "-.9", "-.9", "-.9", "-.9", "-.9", "2.4", "25-99.0", "33-99.0", 
"-.9", "-.9", "-.9", "-.9"), name6 = c("-9", "-9", "-9", "7-99.0", 
"-.9", "-.9", "27-99.0", "-.9", "-.9", "-.9", "-.9", "-.9", "20-99.0", 
"-.9", "-.9", "-.9", "-.9", "-.9", "-.9"), name7 = c(3.1, 3.7, 
2.7, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9), name8 = c(-0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, NA), name9 = c(-0.9, -0.9, -0.9, -0.9, 
-0.9, -0.9, -0.9, -0.9, -0.9, NA, -0.9, NA, -0.9, -0.9, -0.9, 
-0.9, NA, NA, NA), name10 = c(-0.9, -0.9, -0.9, -0.9, -0.9, NA, 
-0.9, NA, NA, NA, NA, NA, -0.9, -0.9, -0.9, NA, NA, NA, NA), 
    name11 = c(9.6, 7.8, 9, -0.9, NA, NA, -0.9, NA, NA, NA, NA, 
    NA, -0.9, NA, NA, NA, NA, NA, NA), name12 = c(-0.9, -0.9, 
    -0.9, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA)), .Names = c("id", "name1", "name2", "name3", 
"name4", "name5", "name6", "name7", "name8", "name9", "name10", 
"name11", "name12"), class = "data.frame", row.names = c(NA, 
-19L)) 

這裏是我的(壞)輸出:

       id    name1  name2 name3 name4 name5 name6 name7 name8 name9 name10 name11 name12 
1      60019660101     4.3  .7  3.4  1.0  1.0  -9 3.1 -0.9 -0.9 -0.9 9.6 -0.9 
2      60019660102     7.4  3.8  5.5  1.9  1.6  -9 3.7 -0.9 -0.9 -0.9 7.8 -0.9 
3      60019660103     5.8  3.0  4.2  1.8 10.9  -9 2.7 -0.9 -0.9 -0.9 9.0 -0.9 
4      60019660104     4.3  -4.1 -1.9 -2.4  7.2 7-99.0 -0.9 -0.9 -0.9 -0.9 -0.9  NA 
5      60019660105   -3.5-12.9  -8.6 -5.6  1.5 17-99.0  -.9 -0.9 -0.9 -0.9 -0.9  NA  NA 
6      60019660106   -6.6-13.3 -8.6-16.3  6.1 21-99.0  -.9  -.9 -0.9 -0.9 -0.9  NA  NA  NA 
7      60019660107    -5.7  -7.5 -6.6 -7.9  1.0 27-99.0 -0.9 -0.9 -0.9 -0.9 -0.9  NA 
8      60019660108   -5.0-11.4 -8.9-11.0  1.8 25-99.0  -.9  -.9 -0.9 -0.9 -0.9  NA  NA  NA 
9      60019660109   -7.5-12.0 -9.6-17.6  1.6 27-99.0  -.9  -.9 -0.9 -0.9 -0.9  NA  NA  NA 
10      60019660110 -8.8-15.3-11.5-19.5  .0 20-99.0  -.9  -.9  -.9 -0.9 -0.9 NA  NA  NA  NA 
11      60019660111 -9.8-16.4-13.1-22.3  .6  18  1.5  -.9  -.9 -0.9 -0.9 -0.9  NA  NA  NA 
12      60019660112 -8.9-17.4-10.9-20.0  2.4 17-99.0  -.9  -.9  -.9 -0.9 -0.9 NA  NA  NA  NA 
13      60019660113    -7.3  -9.2 -8.5 -9.1  2.4 20-99.0 -0.9 -0.9 -0.9 -0.9 -0.9  NA 
14      60019660114   -5.8-10.5  -6.9 -8.0  6.1 25-99.0  -.9 -0.9 -0.9 -0.9 -0.9  NA  NA 
15      60019660115   -5.4-13.6  -8.3 -9.1  .1 33-99.0  -.9 -0.9 -0.9 -0.9 -0.9  NA  NA 
16      60019660116 -9.4-20.4-14.4-26.3  .0  33  4.6  -.9  -.9 -0.9 -0.9 -0.9  NA  NA  NA 
17      60019660117 -7.9-15.6-10.3-19.4  1.2 33-99.0  -.9  -.9  -.9 -0.9 -0.9 NA  NA  NA  NA 
18      60019660118 -8.7-11.2-10.5-16.0  .8 34-99.0  -.9  -.9  -.9 -0.9 -0.9 NA  NA  NA  NA 
19 60019660119-10.6-12.4-11.9-11.6     1.3 34-99.0  -.9  -.9  -.9  -.9 -0.9 NA NA  NA  NA  NA 

這是多麼正確數據應看:

60019660101 4.3 .7  3.4  1.0 1.0 -9  3.1 -.9 -.9 -.9 9.6 -.9 
    60019660102 7.4 3.8  5.5  1.9 1.6 -9  3.7 -.9 -.9 -.9 7.8 -.9 
    60019660103 5.8 3.0  4.2  1.8 10.9 -9  2.7 -.9 -.9 -.9 9.0 -.9 
    60019660104 4.3 -4.1 -1.9 -2.4 7.2  7 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660105 -3.5 -12.9 -8.6 -5.6 1.5  17 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660106 -6.6 -13.3 -8.6 -16.3 6.1  21 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660107 -5.7 -7.5 -6.6 -7.9 1.0  27 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660108 -5.0 -11.4 -8.9 -11.0 1.8  25 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660109 -7.5 -12.0 -9.6 -17.6 1.6  27 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660110 -8.8 -15.3 -11.5 -19.5 .0  20 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660111 -9.8 -16.4 -13.1 -22.3 .6  18 1.5 -.9 -.9 -.9 -.9 -.9 
    60019660112 -8.9 -17.4 -10.9 -20.0 2.4  17 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660113 -7.3 -9.2 -8.5 -9.1 2.4  20 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660114 -5.8 -10.5 -6.9 -8.0 6.1  25 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660115 -5.4 -13.6 -8.3 -9.1  .1  33 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660116 -9.4 -20.4 -14.4 -26.3 .0  33 4.6 -.9 -.9 -.9 -.9 -.9 
    60019660117 -7.9 -15.6 -10.3 -19.4 1.2  33 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660118 -8.7 -11.2 -10.5 -16.0 .8  34 -99.0 -.9 -.9 -.9 -.9 -.9 
    60019660119 -10.6 -12.4 -11.9 -11.6 1.3  34 -99.0 -.9 -.9 -.9 -.9 -.9 
+0

您能向我們展示一個您從中獲得什麼以及您希望獲得什麼的例子嗎? – James

回答

5

您似乎有固定寬度的格式化數據。

read.fwf("https://dl.dropboxusercontent.com/u/74190377/data.txt", 
     widths=c(13,5,5,5,5,7,4,5,5,5,5,5,5)) 

#   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 
#1 60019660101 4.3 0.7 3.4 1.0 1.0 -9 3.1 -0.9 -0.9 -0.9 9.6 -0.9 
#2 60019660102 7.4 3.8 5.5 1.9 1.6 -9 3.7 -0.9 -0.9 -0.9 7.8 -0.9 
#3 60019660103 5.8 3.0 4.2 1.8 10.9 -9 2.7 -0.9 -0.9 -0.9 9.0 -0.9 
#4 60019660104 4.3 -4.1 -1.9 -2.4 7.2 7 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9 
#5 60019660105 -3.5 -12.9 -8.6 -5.6 1.5 17 -99.0 -0.9 -0.9 -0.9 -0.9 -0.9 
<snip> 
1

我會對原始文​​件執行正則表達式修復。任何編輯器(!甚至MSoft字)可以這樣做:

FIND 「 - 」

REPLACE 「(空格或製表符) - 」

取代所有

之後,read.table應該工作正好。

+0

R有可能嗎? –

+0

@JoteN是的,通過使用'gsub',但Roland的答案更好,至少對於這種特定的源文件格式。 –