2013-08-26 21 views
0

我已經探索了一個基於XML的API來處理與工作有關的事情,它來自倉庫數據。理想情況下,我想用大熊貓的python做一些分析。如何將以下多維XML類數據解析到數據框中

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC)) , None, StringAggregateDimensionValue(value=u'VIRTUALLY_LABELED_CASE') ], quantity=127) , 

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 28, 19, 30, tzinfo= UTC)) , StringAggregateDimensionValue(value=u'PPTransMergeNonCon') , StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=15) 

Aggregate(aggregate_dimension_value_list=[ DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 21, 0, tzinfo= UTC)) , StringAggregateDimensionValue(value=u'PPTransFRA1') , StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW') ], quantity=8) , 

的數據看起來像上面的流,我做了一些發現和VIM更換後(我知道我可以腳本這在Python)。我如何最好地將這種奇怪的格式轉換成Pandas?理想情況下,我希望datetime,String aggregatedimension值和數量。但在這個需要解析的數據中有很多無。 在一個數據框中,可以很容易地做一些分析,但我在這裏有點難住(並且感覺很像n00b)。

編輯: 這裏是我得到並想要解析的未被替換和未被替換的數據。它不是真正的XML,所以XML不起作用。

[<DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 20, 30, tzinfo=<UTC 
>))>, <StringAggregateDimensionValue(value=u'PPTransCGN1')>, < 
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=992)>, < 
StringAggregateDimensionValue(value=u'PPTransLEJ1')>, <StringAggregateDimensionValue(
value=u'PRIME_BIN_RANDOM_STOW')>], quantity=945)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013 
, 8, 23, 19, 30, tzinfo=<UTC>))>, None, <StringAggregateDimensionValue(value=u'TOTE')>], 
quantity=87)>, <Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(
value=datetime.datetime(2013, 8, 27, 17, 30, tzinfo=<UTC>))>, < 
StringAggregateDimensionValue(value=u'PPTransMUC3')>, <StringAggregateDimensionValue(
value=u'TOTE')>], quantity=14)>, <Aggregate(aggregate_dimension_value_list=[< 
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 27, 20, 30, tzinfo=<UTC 
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, < 
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=339)>, < 
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime. 
datetime(2013, 8, 26, 20, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u 
'PPTransCGN1')>, <StringAggregateDimensionValue(value=u'TOTE')>], quantity=1731)>, < 
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime. 
datetime(2013, 8, 26, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u 
'PPTransEUK5')>, quantity=444)>, <Aggregate(aggregate_dimension_value_list=[< 
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC 
>))>, <StringAggregateDimensionValue(value=u'PPTransEUK5')>, < 
StringAggregateDimensionValue(value=u'TOTE')>], quantity=28)>, <Aggregate(
aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime.datetime(2013 
, 8, 28, 19, 30, tzinfo=<UTC>))>, <StringAggregateDimensionValue(value=u'PPTransORY1')>, 
<StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=69)>, < 
Aggregate(aggregate_dimension_value_list=<Aggregate(aggregate_dimension_value_list=[< 
DateAggregateDimensionValue(value=datetime.datetime(2013, 8, 26, 19, 30, tzinfo=<UTC 
>))>, <StringAggregateDimensionValue(value=u'PPTransMAD4')>, < 
StringAggregateDimensionValue(value=u'PRIME_BIN_RANDOM_STOW')>], quantity=47)>, < 
Aggregate(aggregate_dimension_value_list=[<DateAggregateDimensionValue(value=datetime. 
datetime(2013, 8, 26, 21, 0, tzinfo=<UTC>))>, None, None], quantity=78)> 
+0

根本就開始爲實際的XML,爲什麼不使用XML解析器,像在這個問題:http://stackoverflow.com/a/16993660/1240268 –

+0

它實際上不是XML,它是'像XML',但比這更復雜一點。當我嘗試像XML解析它,我得到錯誤... –

+0

我建議看看錯誤,解析XML(類似)會更可取(更快,更安全)。您可以通過定義這些函數(主要是身份或元組)來對其進行破解,然後評估(但要小心如果此數據不是來自可信來源)並使用from_records。 –

回答

1

,如果您喜歡的東西沿着解析器的線路多,這裏是你的問題pyparsing刺:

from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional 
import datetime 

# define UTC timezone for sake of eval 
if hasattr(datetime,"timezone"): 
    UTC = datetime.timezone(datetime.timedelta(0),"UTC") 
else: 
    UTC = None 

_ = Suppress 
evaltokens = lambda s,l,t: eval(''.join(t)) 

timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False) 
timevalue.setParseAction(evaltokens) 

strvalue = 'u' + QuotedString("'", unquoteResults=False) 
strvalue.setParseAction(evaltokens) 

nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None]) 
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0])) 

COMMA = Optional(_(",")) 

valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4]) 

lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") + 
      valuedexpr(timevalue)("timestamp") + COMMA + 
      (nonevalue | valuedexpr(strvalue))("s1") + COMMA + 
      (nonevalue | valuedexpr(strvalue))("s2") + COMMA + 
     "]" + COMMA + 
     "quantity=" + intvalue("qty")) 

使用lineexpr.searchString拉出來的數據每個骨料:

for data in lineexpr.searchString(sample): 
    print data.dump() 
    print data.qty 
    print 

捐贈:

[datetime.datetime(2013, 8, 28, 19, 30), None, u'VIRTUALLY_LABELED_CASE', ']', 'quantity=', 127] 
- qty: 127 
- s1: None 
- s2: VIRTUALLY_LABELED_CASE 
- timestamp: 2013-08-28 19:30:00 
127 

[datetime.datetime(2013, 8, 28, 19, 30), u'PPTransMergeNonCon', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 15] 
- qty: 15 
- s1: PPTransMergeNonCon 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-28 19:30:00 
15 

[datetime.datetime(2013, 8, 27, 21, 0), u'PPTransFRA1', u'PRIME_BIN_RANDOM_STOW', ']', 'quantity=', 8] 
- qty: 8 
- s1: PPTransFRA1 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-27 21:00:00 
8 

dump()將顯示所有已命名的結果VA可供您使用的提示 - 請注意如何使用data.qty直接訪問數量屬性。這是在"quantity=" + intvalue("qty")中爲結果名稱「qty」定義的。可以類似地訪問timestamps1s2。 (還有一個小eval ING在此,清潔,最多就留給讀者自己練習。)

編輯:

下面是修改pyparsing解析器,處理您的原生類似XML的東西。這些變化真的非常輕微:

from pyparsing import Suppress,QuotedString,Word,alphas,nums,alphanums,Keyword,Optional, ungroup 
import datetime 

# define UTC timezone for sake of eval 
if hasattr(datetime,"timezone"): 
    UTC = datetime.timezone(datetime.timedelta(0),"UTC") 
else: 
    UTC = None 

_ = Suppress 
evaltokens = lambda s,l,t: eval(''.join(t)) 

timevalue = 'datetime.datetime' + QuotedString('(', endQuoteChar=')', unquoteResults=False) 
replUTC = lambda s,l,t: ''.join(t).replace("< UTC>","UTC").replace("<UTC >","UTC").replace("<UTC>","UTC") 
timevalue.setParseAction(replUTC, evaltokens) 

strvalue = 'u' + QuotedString("'", unquoteResults=False) 
strvalue.setParseAction(evaltokens) 

nonevalue = Keyword("None").setParseAction(lambda s,l,t: [None]) 
intvalue = Word(nums).setParseAction(lambda s,l,t: int(t[0])) 

COMMA = Optional(_(",")) 
LT,GT,LPAR,RPAR,LBRACK,RBRACK = map(Suppress,"<>()[]") 

#~ valuedexpr = lambda expr: (Word(alphas) + "(" + "value" + "=" + expr + ")").setParseAction(lambda t: t[4]) 
valuedexpr = lambda expr: ungroup(LT + (Word(alphas) + "(" + "value" + "=" + expr("value") + ")" + GT).setParseAction(lambda t: t.value)) 

#~ lineexpr = (_("Aggregate(aggregate_dimension_value_list=[") + 
      #~ valuedexpr(timevalue)("timestamp") + COMMA + 
      #~ (nonevalue | valuedexpr(strvalue))("s1") + COMMA + 
      #~ (nonevalue | valuedexpr(strvalue))("s2") + COMMA + 
     #~ "]" + COMMA + 
     #~ "quantity=" + intvalue("qty")) 

lineexpr = (LT + "Aggregate" + LPAR + "aggregate_dimension_value_list" + "=" + LBRACK + 
      valuedexpr(timevalue)("timestamp") + COMMA + 
      (nonevalue | valuedexpr(strvalue))("s1") + COMMA + 
      (nonevalue | valuedexpr(strvalue))("s2") + 
     RBRACK + COMMA + 
     "quantity=" + intvalue("qty") + RPAR + GT) 

從您的粘貼的文本(其中一些是畸形),即送:

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 992] 
- qty: 992 
- s1: PPTransCGN1 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-26 20:30:00 
992 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 23, 19, 30), None, u'TOTE', 'quantity=', 87] 
- qty: 87 
- s1: None 
- s2: TOTE 
- timestamp: 2013-08-23 19:30:00 
87 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 17, 30), u'PPTransMUC3', u'TOTE', 'quantity=', 14] 
- qty: 14 
- s1: PPTransMUC3 
- s2: TOTE 
- timestamp: 2013-08-27 17:30:00 
14 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 27, 20, 30), u'PPTransEUK5', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 339] 
- qty: 339 
- s1: PPTransEUK5 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-27 20:30:00 
339 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 20, 30), u'PPTransCGN1', u'TOTE', 'quantity=', 1731] 
- qty: 1731 
- s1: PPTransCGN1 
- s2: TOTE 
- timestamp: 2013-08-26 20:30:00 
1731 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransEUK5', u'TOTE', 'quantity=', 28] 
- qty: 28 
- s1: PPTransEUK5 
- s2: TOTE 
- timestamp: 2013-08-26 19:30:00 
28 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 28, 19, 30), u'PPTransORY1', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 69] 
- qty: 69 
- s1: PPTransORY1 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-28 19:30:00 
69 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 19, 30), u'PPTransMAD4', u'PRIME_BIN_RANDOM_STOW', 'quantity=', 47] 
- qty: 47 
- s1: PPTransMAD4 
- s2: PRIME_BIN_RANDOM_STOW 
- timestamp: 2013-08-26 19:30:00 
47 

['Aggregate', 'aggregate_dimension_value_list', '=', datetime.datetime(2013, 8, 26, 21, 0), None, None, 'quantity=', 78] 
- qty: 78 
- s1: None 
- s2: None 
- timestamp: 2013-08-26 21:00:00 
78 
+0

我似乎無法複製此提示。我使用的PyParsing的版本是什麼1.5.7我沒有收到任何錯誤信息,只是一個空白文件,當我將結果打印到文本文件中時... –

+0

Pyparsing 2.0.1,Python 3.3 - 但在這個解析器中沒有什麼特別的東西是特定於版本的,這應該在1.5.7下運行得很好。 (如果您使用Python 2.6或2.7,則可以安全地升級到Pyparsing 2.0.1。) – PaulMcG

+0

感謝!這有很大幫助。 –

0

您可以定義的最小類AggregateDateAggregateDimensionValueStringAggregateDimensionValue,然後eval各行依次爲:

import datetime 

# define UTC timezone for sake of eval 
if hasattr(datetime,"timezone"): 
    UTC = datetime.timezone(datetime.timedelta(0),"UTC") 
else: 
    UTC = None 

# define minimal classes to eval initializers 
class AggregateDimensionValue(object): 
    def __init__(self, value): 
     self.value = value 
class StringAggregateDimensionValue(AggregateDimensionValue): pass 
class DateAggregateDimensionValue(AggregateDimensionValue): pass 
class Aggregate(object): 
    def __init__(self, aggregate_dimension_value_list, quantity): 
     self.timestamp, self.s1, self.s2 = aggregate_dimension_value_list 
     # pull values out of parsed "aggregate" instances 
     self.timestamp = self.timestamp.value 
     if self.s1 is not None: 
      self.s1 = self.s1.value 
     if self.s2 is not None: 
      self.s2 = self.s2.value 
     self.quantity = quantity 

使用這些最小類eval輸入字符串:

for line in sample.splitlines(): 
    if not line.strip(): 
     continue 
    obj = eval(line.strip(' ,')) 
    print obj.__dict__ 

給予:

{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': None, 'quantity': 127, 's2': u'VIRTUALLY_LABELED_CASE'} 
{'timestamp': datetime.datetime(2013, 8, 28, 19, 30), 's1': u'PPTransMergeNonCon', 'quantity': 15, 's2': u'PRIME_BIN_RANDOM_STOW'} 
{'timestamp': datetime.datetime(2013, 8, 27, 21, 0), 's1': u'PPTransFRA1', 'quantity': 8, 's2': u'PRIME_BIN_RANDOM_STOW'} 

當然,這裏有所有關於使用eval的常見警告,例如注意任何可能注入的惡意代碼。但是我懷疑你自己已經在控制這個輸入文件,所以如果你注入自己的惡意代碼,你只能責怪自己。