2013-05-30 71 views
0

我從服務器收到一個製表符分隔的文件,該文件根據每個應答者輸出問題答案。我想將數據導入熊貓數據框,其中列是每個問題,行是每個答覆者的答案。以下是一位受訪者的看法:重新排列熊貓數據框的數據?

[2072] Anonymous 
Q-0 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.14 Student (Graduate/ Undergraduate) 
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 1|1|1|1|4| 
Q-2 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 1-3 
Q-3 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Male 
Q-4 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 18-24 
Q-5 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00  
Q-6 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Prefer not to answer 
Q-7 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Yes 
Q-8 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.13 Bachelor's Degree 
Q-9 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Other 
Q-10 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Mathematics 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 High school 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 College (introductory courses) 
Q-12 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 Professional 
Q-13 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Mac OS X 
Q-14 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.25 Every week 
Q-15 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 A test that proves or disproves of some abstract theory about the world 
Q-16 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-17 [01] Sat 25 May 2013 7:43 PM UTC +0000 2.00 Yes 
Q-18 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-19 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.20 Timely feedback from the instructor 
Q-20 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  

每位受訪者的回答之間都有回車。謝謝你的幫助!

+0

嗯......爲什麼downvote,幫派?這似乎是一個很好的用例,可能適用於其他人。 –

回答

1

不平凡的一步是劃定每個受訪者的區塊。如何重寫文件以在每一行前加上被訪者的ID?例如,在「匿名」的情況下,我看到「2072」。

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    # line might be like [####] Student_Name or Q-... 
    m = re.match('\[(\d+)\] .*', line) 
    if m: 
     # Line is like [####] Student_name. 
     respondent_id = m.group(1) 
     continue 
    # Line is like Q-... 
    # Write new line like #### Q-... 
    f.write(str(respondent_id) + line) 

然後使用pandas read_csv加載這個修改過的文件,給索引分配前兩列。 (它們將是MultiIndex。)然後使用unstack將Q的索引轉換爲列。

(全面披露:我測試了正則表達式,但我沒有測試過所有)

+0

實際上,如果它們是固定大小的塊(例如每個10行),那麼可以只讀它,然後BinGroup,我認爲 – Jeff

+0

很酷。我不知道這是一件事。 –

+0

實際上,更容易做到這一點:''''df.groupby(df.index.to_series()/ 3).sum()''(每3行)'''BinGrouper''必須直接指定標籤 – Jeff

0

下面是我工作:

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    m = re.match('\[\d+\]*', line) 
    if m: 
     respondent_id = m.group() 
     continue 
    f.write(str(respondent_id) + line)