拆分,然後跳過一切,直到第gb
;下一個元素是ID:
from itertools import dropwhile
text = iter(text.split('|'))
next(dropwhile(lambda s: s != 'gb', text))
id = next(text)
演示:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> text = iter(text.split('|'))
>>> next(dropwhile(lambda s: s != 'gb', text))
'gb'
>>> id = next(text)
>>> id
'EDL26483.1'
換句話說,沒有必要爲一個正則表達式。
製作成生成方法,這讓所有的ID:
from itertools import dropwhile
def extract_ids(text):
text = iter(text.split('|'))
while True:
next(dropwhile(lambda s: s != 'gb', text))
yield next(text)
這給:
>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]'
>>> list(extract_ids(text))
['EDL26483.1', 'AAI37799.1']
,或者你可以在一個簡單的循環使用它:
for id in extract_ids(text):
print id
是文本的'>'字符的一部分? – 2013-02-13 20:56:56
沿着\ | gb \ |(。*?\ |) –
dutt
2013-02-13 20:59:23