2017-04-15 53 views
1

我有了這個格式的.txt文件:閱讀文本文件中的r和存儲所讀條件的下一行

-------------------------------------------------------------------------------------------------------------- 
m5a2              A2. Confirm how much time child lives with respondent 
-------------------------------------------------------------------------------------------------------------- 

        type: numeric (byte) 
       label: BM_101F 

       range: [-9,7]      units: 1 
     unique values: 8      missing .: 0/4898 

      tabulation: Freq. Numeric Label 
          1383  -9 -9 Not in wave 
          4  -2 -2 Don't know 
          2  -1 -1 Refuse 
          3272   1 1 all or most of the time 
          29   2 2 about half of the time 
          76   3 3 some of the time 
          80   4 4 none of the time 
          52   7 7 only on weekends 

-------------------------------------------------------------------------------------------------------------- 
m5a3             A3. Number of months ago child stopped living with you 
-------------------------------------------------------------------------------------------------------------- 

        type: numeric (int) 
       label: NUMERIC, but 44 nonmissing values are not labeled 

       range: [-9,120]      units: 1 
     unique values: 47      missing .: 0/4898 

       examples: -9 -9 Not in wave 
         -6 -6 Skip 
         -6 -6 Skip 
         -6 -6 Skip 

-------------------------------------------------------------------------------------------------------------- 

什麼是對我很重要,是代號,如m5a2時,說明A2. Confirm how much time child lives with respondent,最後,響應

tabulation: Freq. Numeric Label 
          1383  -9 -9 Not in wave 
          4  -2 -2 Don't know 
          2  -1 -1 Refuse 
          3272   1 1 all or most of the time 
          29   2 2 about half of the time 
          76   3 3 some of the time 
          80   4 4 none of the time 
          52   7 7 only on weekends 

我需要閱讀這三個項目將進行進一步的處理列表中的值。

我已經嘗試了以下內容,它在檢索代碼名稱和說明時起作用。

fileName <- "../data/ff_mom_cb9.txt" 
conn <- file(fileName,open="r") 
linn <-readLines(conn) 
L = list() 
for (i in 1:length(linn)){ 
    if((linn[i]=="--------------------------------------------------------------------------------------------------------------") & (linn[i+1]!="")) 
    { 
    L[i] = linn[i+1] 
    } 

    else 
    { 
    # read until hit the next dashed line 
    } 
} 
close(conn) 

有幾件事情我感到困惑: 1.我不知道如何讓它讀取行,直到遇到下下虛線。 2.如果我希望能夠可視化搜索並輕鬆檢索數據,我的方法是否正確地將讀取的數據存儲在列表中?

謝謝。

回答

0

這樣做會有些問題,因爲格式對每個項目來說都是非常不規範的。繼承人在第一項碼本文本上運行:

txt <- "m5a2              A2. Confirm how much time child lives with respondent 
-------------------------------------------------------------------------------------------------------------- 

        type: numeric (byte) 
       label: BM_101F 

       range: [-9,7]      units: 1 
     unique values: 8      missing .: 0/4898 

      tabulation: Freq. Numeric Label 
          1383  -9 -9 Not in wave 
          4  -2 -2 Don't know 
          2  -1 -1 Refuse 
          3272   1 1 all or most of the time 
          29   2 2 about half of the time 
          76   3 3 some of the time 
          80   4 4 none of the time 
          52   7 7 only on weekends 
" 
Lines <- readLines(textConnection(txt)) 
# isolate lines with letter in first column 
Lines[grep("^[a-zA-Z]", Lines)] 
# Now replace long runs of spaces with commas and scan: 

scan(text=sub("[ ]{10,100}", ",", Lines[grep("^[a-zA-Z]", Lines)]), 
    sep=",", what="") 
#---- 
Read 2 items 
[1] "m5a2"             
[2] "A2. Confirm how much time child lives with respondent" 

「製表」行可用於創建列標籤。

colnames <- scan(text=sub(".*tabulation[:]", "", 
        Lines[grep("tabulation[:]", Lines)]), sep="", what="") 
#Read 3 items 

用逗號替代策略需要更多地涉及到後面的行。第一隔離行,其中數字位是第一個非空格字符:

dataRows <- Lines[grep("^[ ]*\\d", Lines)] 

然後替換逗號用於圖案數字-2 +空間和與read.csv讀:

myDat <- read.csv(text= 
         gsub("(\\d)[ ]{2,}", "\\1,", dataRows), 
        header=FALSE ,col.names=colnames) 

#------------ 
myDat 
    V1 V2      V3 
1 1383 -9   -9 Not in wave 
2 4 -2    -2 Don't know 
3 2 -1     -1 Refuse 
4 3272 1 1 all or most of the time 
5 29 2 2 about half of the time 
6 76 3  3 some of the time 
7 80 4  4 none of the time 
8 52 7  7 only on weekends 

循環過如果線條對象是整個文件的多個項目,也許可以從cumsum(grepl("^-------", Lines)而產生反比如一個在:

Lines <- readLines("http://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_mom_cb9.txt") 
sum(grepl("^-------", Lines)) 
#---------------------- 
[1] 1966 
Warning messages: 
1: In grepl("^-------", Lines) : 
    input string 6995 is invalid in this locale 
2: In grepl("^-------", Lines) : 
    input string 7349 is invalid in this locale 
3: In grepl("^-------", Lines) : 
    input string 7350 is invalid in this locale 
4: In grepl("^-------", Lines) : 
    input string 7352 is invalid in this locale 
5: In grepl("^-------", Lines) : 
    input string 7353 is invalid in this locale 

我的「手持式掃描() - 呃」 SUG我認爲只有兩種類型的密碼本記錄:「表格」(可能大於10個左右的項目)和「例子」(多個項目)。它們具有不同的結構(如上面的代碼片段所示),因此可能只需要構建和部署兩種類型的解析邏輯。所以我認爲上面描述的工具可以做到這一點。

警告都與用作撇號的字符「\ x92」有關。正則表達式和R共享一個轉義字符「\」,所以你需要逃避逃生。他們可以糾正:

Lines <- gsub("\\\x92", "'", Lines) 
-1

這是怎麼回事?

df <- read.table("file.txt", 
      header = FALSE) 
df 
+0

它似乎一次只讀一行到'D'? – Waht

+0

如果你試試這個,該怎麼辦? –

+0

或者如果你有頭文件,只要將它改爲TRUE而不是FALSE –