Python /熊貓CSV解析

我使用JotForm可配置列表小部件來收集數據，但遇到麻煩解析正確的結果數據。當我使用Python /熊貓CSV解析

testdf = pd.read_csv ("TestLoad.csv")

數據讀入爲兩條記錄，詳細信息存儲在「信息」列中。我明白爲什麼按照它的方式進行解析，但我想將細節分解爲多個記錄，如下所述。

任何幫助，將不勝感激。

樣品CSV

"Date","Information","Type" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New" 
"2015-12-06","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","New"

當前結果

Date  Information                  Type 
2015-12-06 First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA; New 
2015-12-06 First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA; New

所需的結果

Date  First Last School Type 
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-06 Jim Jones MCAA New 
2015-12-06 Jane Jones MCAA New

來源

2015-12-08 Zymurgist66

這是無用的文本，需要由主持人維持一個答案。下面是我使用的數據：

"Date","Information","Type" 
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New"

import pandas as pd 
import numpy as np 
import csv 
import re 
import itertools as it 
import pprint 
import datetime as dt 

records = [] #Construct a complete record for each person 

colon_pairs = r""" 
    (\w+) #Match a 'word' character, one or more times, captured in group 1, followed by.. 
    :  #A colon, followed by... 
    \s*  #Whitespace, 0 or more times, followed by... 
    (\w+) #A 'word' character, one or more times, captured in group 2. 
""" 

colon_pairs_per_person = 3 

with open("csv1.csv", encoding='utf-8') as f: 
    next(f) #skip header line 
    record = {} 

    for date, info, the_type in csv.reader(f): 
     info_parser = re.finditer(colon_pairs, info, flags=re.X) 

     for i, match_obj in enumerate(info_parser): 
      key, val = match_obj.groups() 
      record[key] = val 

      if (i+1) % colon_pairs_per_person == 0: #then done with info for a person 
       record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date. 
       record['Type'] = the_type 

       records.append(record) 
       record = {} 

pprint.pprint(records) 
df = pd.DataFrame(
     sorted(records, key=lambda record: record['Date']) 
) 
print(df) 
df.set_index('Date', inplace=True) 
print(df) 

--output:-- 
[{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jim', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jane', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tom', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tammy', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}] 

     Date First Last School Type 
0 2015-12-06 Tom Smith MCAA New 
1 2015-12-06 Tammy Smith MCAA New 
2 2015-12-07 Jim Jones MCAA Old 
3 2015-12-07 Jane Jones MCAA Old 

      First Last School Type 
Date         
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-07 Jim Jones MCAA Old 
2015-12-07 Jane Jones MCAA Old

來源

2015-12-08 05:23:05 7stud

7stud - 感謝您的解決方案。這是我最終使用的方法，因爲記錄中的人數可能是1：n – Zymurgist66

我用正則表達式月arator與python引擎，所以我可以指定多個分隔符。然後，我使用usecols參數來指定數據框中您想要的csv文件中的哪些列。頭文件不會從文件中讀取，因爲它沒有任何數據，所以我跳過了第一行。我將第一組記錄和第二組記錄讀入2個數據幀，然後連接2個數據幀。

a = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,2,4,6, 14), header = None, engine='python') 
b = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,8,10,12,14), header = None, engine='python') 
a.columns = ['Date', 'First', "Last", 'School', 'Type'] 
b.columns = ['Date', 'First', "Last", 'School', 'Type'] 
final_data = pd.concat([a,b], axis = 0)

如果您需要的順序保存，使得第二名稱出現正下方的第一個名字，你可以使用排序指數。我使用mergesort，因爲它是一個穩定的排序，這確保了第一條信息記錄（右邊的記錄）將位於左邊的信息記錄之上。

final_data.sort_index(kind='mergesort', inplace = True) 
>>>final_data 
     Date  First Last  School Type 
0 "2015-12-06" Tom Smith MCAA "New" 
0 "2015-12-06" Tammy Smith MCAA "New" 
1 "2015-12-06" Jim Jones MCAA "New" 
1 "2015-12-06" Jane Jones MCAA "New"

編輯：將第二組記錄包括到數據中。將軸更改爲0.

來源

2015-12-08 02:42:33 imp9

謝謝你的方法。我能夠複製，但是當我嘗試它時，代碼沒有在每行中找到第二個名字（例如，Tammy Smith和Jane Jones）。有什麼我需要以不同的方式遍歷「信息」列中的文本？ – Zymurgist66

@ Zymurgist66記錄是否必須出現，使湯姆史密斯必須出現在蒂米史密斯的正上方？無論如何，我編輯了我的回覆，閱讀了兩組名稱並提供了一個選項，以便維護訂單。 – imp9

user1435522 - 否訂單不相關。我測試的最初例子只有每個記錄2個人。當我嘗試使用整個數據集時，我發現人數可能是1：n，所以我最終需要迭代人員。 – Zymurgist66

Python /熊貓CSV解析

回答

相關問題