2015-12-08 160 views
0

我使用JotForm可配置列表小部件來收集數據,但遇到麻煩解析正確的結果數據。當我使用Python /熊貓CSV解析

testdf = pd.read_csv ("TestLoad.csv") 

數據讀入爲兩條記錄,詳細信息存儲在「信息」列中。我明白爲什麼按照它的方式進行解析,但我想將細節分解爲多個記錄,如下所述。

任何幫助,將不勝感激。

樣品CSV

"Date","Information","Type" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New" 
"2015-12-06","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","New" 

當前結果

Date  Information                  Type 
2015-12-06 First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA; New 
2015-12-06 First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA; New 

所需的結果

Date  First Last School Type 
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-06 Jim Jones MCAA New 
2015-12-06 Jane Jones MCAA New 

回答

2

這是無用的文本,需要由主持人維持一個答案。下面是我使用的數據:

"Date","Information","Type" 
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New" 

import pandas as pd 
import numpy as np 
import csv 
import re 
import itertools as it 
import pprint 
import datetime as dt 

records = [] #Construct a complete record for each person 

colon_pairs = r""" 
    (\w+) #Match a 'word' character, one or more times, captured in group 1, followed by.. 
    :  #A colon, followed by... 
    \s*  #Whitespace, 0 or more times, followed by... 
    (\w+) #A 'word' character, one or more times, captured in group 2. 
""" 

colon_pairs_per_person = 3 

with open("csv1.csv", encoding='utf-8') as f: 
    next(f) #skip header line 
    record = {} 

    for date, info, the_type in csv.reader(f): 
     info_parser = re.finditer(colon_pairs, info, flags=re.X) 

     for i, match_obj in enumerate(info_parser): 
      key, val = match_obj.groups() 
      record[key] = val 

      if (i+1) % colon_pairs_per_person == 0: #then done with info for a person 
       record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date. 
       record['Type'] = the_type 

       records.append(record) 
       record = {} 

pprint.pprint(records) 
df = pd.DataFrame(
     sorted(records, key=lambda record: record['Date']) 
) 
print(df) 
df.set_index('Date', inplace=True) 
print(df) 

--output:-- 
[{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jim', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jane', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tom', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tammy', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}] 

     Date First Last School Type 
0 2015-12-06 Tom Smith MCAA New 
1 2015-12-06 Tammy Smith MCAA New 
2 2015-12-07 Jim Jones MCAA Old 
3 2015-12-07 Jane Jones MCAA Old 

      First Last School Type 
Date         
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-07 Jim Jones MCAA Old 
2015-12-07 Jane Jones MCAA Old 
+0

7stud - 感謝您的解決方案。這是我最終使用的方法,因爲記錄中的人數可能是1:n – Zymurgist66

0

我用正則表達式月arator與python引擎,所以我可以指定多個分隔符。然後,我使用usecols參數來指定數據框中您想要的csv文件中的哪些列。頭文件不會從文件中讀取,因爲它沒有任何數據,所以我跳過了第一行。我將第一組記錄和第二組記錄讀入2個數據幀,然後連接2個數據幀。

a = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,2,4,6, 14), header = None, engine='python') 
b = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,8,10,12,14), header = None, engine='python') 
a.columns = ['Date', 'First', "Last", 'School', 'Type'] 
b.columns = ['Date', 'First', "Last", 'School', 'Type'] 
final_data = pd.concat([a,b], axis = 0) 

如果您需要的順序保存,使得第二名稱出現正下方的第一個名字,你可以使用排序指數。我使用mergesort,因爲它是一個穩定的排序,這確保了第一條信息記錄(右邊的記錄)將位於左邊的信息記錄之上。

final_data.sort_index(kind='mergesort', inplace = True) 
>>>final_data 
     Date  First Last  School Type 
0 "2015-12-06" Tom Smith MCAA "New" 
0 "2015-12-06" Tammy Smith MCAA "New" 
1 "2015-12-06" Jim Jones MCAA "New" 
1 "2015-12-06" Jane Jones MCAA "New" 

編輯:將第二組記錄包括到數據中。將軸更改爲0.

+0

謝謝你的方法。我能夠複製,但是當我嘗試它時,代碼沒有在每行中找到第二個名字(例如,Tammy Smith和Jane Jones)。有什麼我需要以不同的方式遍歷「信息」列中的文本? – Zymurgist66

+0

@ Zymurgist66記錄是否必須出現,使湯姆史密斯必須出現在蒂米史密斯的正上方?無論如何,我編輯了我的回覆,閱讀了兩組名稱並提供了一個選項,以便維護訂單。 – imp9

+0

user1435522 - 否訂單不相關。我測試的最初例子只有每個記錄2個人。當我嘗試使用整個數據集時,我發現人數可能是1:n,所以我最終需要迭代人員。 – Zymurgist66