數據幀

說我必須存儲的數據如下：數據幀

[[[{}][{}]]]

或詞典

的兩個列表的列表的列表，其中：

{}：包含數據的字典從觀察事件的各個框架。（有兩個觀察員/電臺，因此有兩本字典。）

[{}][{}]：與單個事件相關的所有單個幀的兩個列表，每個觀察者/電臺一個。

[[{}][{}]]：單夜觀察所有事件的列表。

[[[{}][{}]]]：所有夜晚的列表。

希望這是明確的。我想要做的是創建兩個熊貓數據框，其中來自station_1的所有字典存儲在一箇中，並且所有來自station_2的字典存儲在另一箇中。

我的當前方法是如下（其中data爲上述數據結構）：

for night in range(len(data)): 

    station_1 = pd.DataFrame(data[night][0]) 
    station_2 = pd.DataFrame(data[night][1]) 

    all_station_1.append(station_1) 
    all_station_2.append(station_2) 

all_station_1 = pd.concat(all_station_1) 
all_station_2 = pd.concat(all_station_2)

我的理解是，雖然for循環必須效率極其低下，因爲我將縮放的這個腳本方式應用從我的樣本數據集中，這個成本很容易變得難以管理。

因此，任何意見，以更聰明的方式進行，將不勝感激！我覺得熊貓是如此的用戶友好，這是一種處理任何類型的數據結構的有效方式，但我還沒有能夠自己找到它。謝謝！

來源

2016-11-17 Sam L.

你可以試用'pd.read_json（）'。 – Khris

['[[{}] [{}]]]'中的數據示例以及期望的數據幀對於測試 –

有幫助當然，我已經在此處提供了一個示例：https://www.dropbox.com /s/8b4zqq6nhzbie4p/datasample.txt?dl=0 –

我不認爲你真的可以避免在這裏使用循環，除非你想通過sh調用jq。請參閱this answer

不管怎樣，使用您的完整示例，我設法將它解析爲多索引的數據框，我認爲它就是您想要的。

import datetime 
import re 
import json 

data=None 
with open('datasample.txt', 'r') as f: 
    data=f.readlines() 
# There's only one line 
data=data[0] 

# Replace single quotes to double quotes: I did that in the .txt file itself, you could do it using re 

# Fix the datetime problem 
cleaned_data = re.sub(r'(datetime.datetime\(.*?\))', lambda x: '"'+ str(eval(x.group(0)).isoformat())+'"', data)

現在，從文件中的字符串是有效的JSON，我們可以加載它：

json_data = json.loads(cleaned_data)

我們可以處理成一個數據幀：

# List to store the dfs before concat 
all_ = [] 
for n, night in enumerate(json_data): 
    for s, station in enumerate(night): 
     events = pd.DataFrame(station) 
     # Set index to the event number 
     events = events.set_index('###') 
     # Prepend night number and station number to index 
     events.index = pd.MultiIndex.from_tuples([(n, s, x) for x in events.index]) 
     all_.append(events) 

df_all = pd.concat(all_) 
# Rename the index levels 
df_all.index.names = ['Night','Station','Event'] 
# Convert to datetime 
df_all.DateTime = pd.to_datetime(df_all.DateTime) 
df_all

（截斷）結果：

來源

2016-11-17 11:17:18

非常感謝您的時間！我遵循大部分正在發生的事情，但希望再問兩個問題： 1.您使用datetime修復的問題是什麼？我可以在源代碼修復它，因爲我可以訪問用於準備數據文件的腳本。 2.如何使用re？替換單引號？我不確定如何避免將我的引號解釋爲字符串。再次感謝！ –

它可以幫助你最初格式化ISO格式字符串的日期時間。在這裏，我必須將datetime.datime（...）評估爲Json的ISO字符串。在你的文件中有'datetime.datetime（2011,12,13,22,15,37,880000）'，我將它轉換爲'「2011-12-13T22：15：37.880000」另外，嚴格格式的JSON需要引號爲雙引號「不是單引號」，所以你可以改變你的腳本來準備數據，否則你可以檢查[這個問題]（http://stackoverflow.com/questions/4033633/handling-lazy-json- in-python-expecting-property-name）在python之後做到這一點。 –

回答

相關問題