2013-03-27 22 views
0

簡單的正則表達式我有一個字符串,它看起來像爲以下字符串

rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036 

現在,我想要做的是

extract timestamp: 134049600 
     event: EP002960010145 

現在isseue是有tmsid 我不經過%3D甚至知道它是什麼..但無論如何,有時它的%3D%6D,我認爲它甚至可以%16D?我不能確定

是否有一個強大的方式來處理上述字符串的這兩個領域?

感謝

回答

3

您正在看的URL引用的數據:

>>> from urllib2 import unquote 
>>> unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036') 
'rand_id:?tmsid=1340496000_EP002960010145_11_0_10050_1_2_10036' 

您可以在第一=分裂或許,再拆上_

>>> unquoted = unquote('rand_id%3A%3Ftmsid%3D1340496000_EP002960010145_11_0_10050_1_2_10036') 
>>> unquoted.split('=', 1)[1].split('_') 
['1340496000', 'EP002960010145', '11', '0', '10050', '1', '2', '10036'] 
>>> timestamp, event = unquoted.split('=', 1)[1].split('_')[:2] 
>>> timestamp, event 
('1340496000', 'EP002960010145') 

相反,如果數據有多個字段,你也可以在那裏找到&,你可以更好地解析問號後的所有內容作爲URL查詢條ng代替使用urlparse.parse_qs()

>>> from urlparse import parse_qs 
>>> parse_qs(unquoted.split('?', 1)[1]) 
{'tmsid': ['1340496000_EP002960010145_11_0_10050_1_2_10036']} 
>>> parsed = parse_qs(unquoted.split('?', 1)[1]) 
>>> timestamp, event = parsed['tmsid'][0].split('_', 2)[:2] 
>>> timestamp, event 
('1340496000', 'EP002960010145')