我試圖將大文件(53MB)XML文件加載到熊貓數據框中。這裏有3行實際數據(從NTSB航空事故報告公共數據庫),但實際的文件有77257行:更高效地將xml文件轉換爲數據框
<?xml version="1.0"?>
<DATA xmlns="http://www.ntsb.gov">
<ROWS>
<ROW EventId="20150901X74304" InvestigationType="Accident" AccidentNumber="GAA15CA244" EventDate="09/01/2015" Location="Truckee, CA" Country="United States" Latitude="" Longitude="" AirportCode="" AirportName="" InjurySeverity="" AircraftDamage="" AircraftCategory="" RegistrationNumber="N786AB" Make="JOE SALOMONE" Model="SUPER CUB SQ2" AmateurBuilt="" NumberOfEngines="" EngineType="" FARDescription="" Schedule="" PurposeOfFlight="" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="" ReportStatus="Preliminary" PublicationDate=""/>
<ROW EventId="20150901X92332" InvestigationType="Accident" AccidentNumber="CEN15LA392" EventDate="08/31/2015" Location="Houston, TX" Country="United States" Latitude="29.809444" Longitude="-95.668889" AirportCode="IWS" AirportName="WEST HOUSTON" InjurySeverity="Non-Fatal" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N452CS" Make="CESSNA" Model="T240" AmateurBuilt="No" NumberOfEngines="" EngineType="" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Instructional" AirCarrier="" TotalFatalInjuries="" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="2" WeatherCondition="VMC" BroadPhaseOfFlight="LANDING" ReportStatus="Preliminary" PublicationDate="09/04/2015"/>
<ROW EventId="20150729X33718" InvestigationType="Accident" AccidentNumber="CEN15FA325" EventDate="" Location="Truth or Consequences, NM" Country="United States" Latitude="33.250556" Longitude="-107.293611" AirportCode="TCS" AirportName="TRUTH OR CONSEQUENCES MUNI" InjurySeverity="Fatal(2)" AircraftDamage="Substantial" AircraftCategory="Airplane" RegistrationNumber="N32401" Make="PIPER" Model="PA-28-151" AmateurBuilt="No" NumberOfEngines="1" EngineType="Reciprocating" FARDescription="Part 91: General Aviation" Schedule="" PurposeOfFlight="Personal" AirCarrier="" TotalFatalInjuries="2" TotalSeriousInjuries="" TotalMinorInjuries="" TotalUninjured="" WeatherCondition="" BroadPhaseOfFlight="UNKNOWN" ReportStatus="Preliminary" PublicationDate="08/10/2015"/>
</ROWS>
</DATA>
下面的代碼,這是我改編自here,工作,但它是非常緩慢對於這些數據(我的系統超過30分鐘)。我似乎無法得到原始示例發佈的解決方案,因爲我的XML結構不同。有沒有更有效的方法來加載這些數據?
path_to_xml_file = mypath
import pandas as pd
import xml.etree.ElementTree as ET
#Load xml file data
tree = ET.parse(path_to_xml_file)
root = tree.getroot()
#Grab list of column names
aviationdata_column_names = root[0][0].attrib.keys()
#Create empty dataframe
aviationdata_df = pd.DataFrame(columns=aviationdata_column_names)
#Loop through tree and append to dataframe
for i in range(0,len(root[0])-1):
new_row = pd.Series(root[0][i].attrib)
new_row.name = i
aviationdata_df = aviationdata_df.append(new_row)
有圍繞互聯網(here,here和here)發佈的類似問題的各種解決方案,但我還沒有運氣實現它們。版本問題可能會對其中一些負責(我正在使用Python 2.7)。
@jezrael當我嘗試這些方法4第3,他們返回一個數據幀0或1個元素。如何讓這些方法循環通過xml文件並將整個結構拉入? – Edward
@jezrael唉,這是一個學習使用各種文件格式的練習。我被要求爲這個特定的數據集使用xml。 – Edward
@jezrael - 我很欣賞我之前提到的問題的鏈接。我懷疑xml對你來說太複雜了! – Parfait