解析ID從文本與Python

我有這樣的文字：解析ID從文本與Python

>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]

從這段文字我想解析來後ID | GB |並將其寫入列表中。

我嘗試使用正則表達式，但一直未能成功完成。

來源

2013-02-13 Leo.peis

是文本的'>'字符的一部分？ – 2013-02-13 20:56:56

沿着\ | gb \ |（。*？\ |） – dutt 2013-02-13 20:59:23

正則表達式應該工作在|管

import re 
re.findall('gb\|([^\|]*)\|', 'gb|AB1234|')

來源

2013-02-13 20:59:40 Hoopdady

拆分，然後跳過一切，直到第gb;下一個元素是ID：

from itertools import dropwhile 

text = iter(text.split('|')) 
next(dropwhile(lambda s: s != 'gb', text)) 
id = next(text)

演示：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]' 
>>> text = iter(text.split('|')) 
>>> next(dropwhile(lambda s: s != 'gb', text)) 
'gb' 
>>> id = next(text) 
>>> id 
'EDL26483.1'

換句話說，沒有必要爲一個正則表達式。

製作成生成方法，這讓所有的ID：

from itertools import dropwhile 

def extract_ids(text): 
    text = iter(text.split('|')) 
    while True: 
     next(dropwhile(lambda s: s != 'gb', text)) 
     yield next(text)

這給：

>>> text = '>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]' 
>>> list(extract_ids(text)) 
['EDL26483.1', 'AAI37799.1']

，或者你可以在一個簡單的循環使用它：

for id in extract_ids(text): 
    print id

來源

2013-02-13 21:00:15

的一些東西看起來像一個簡單的正則表達式可以工作的很多工作。 – Hoopdady 2013-02-13 21:06:09

@Hoopdady：沒有;我用了更多的文字來解釋它是如何工作的，但是這個方法全部都是4行。這是另一種方法，除此之外，它可以很好地工作。 – 2013-02-13 21:07:26

但你說得對，這可能不值得投票 – Hoopdady 2013-02-13 21:07:34

In [1]: import re 

In [2]: text = ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]" 

In [3]: re.findall(r'gb\|([^\|]+)', text)[0] 
Out[3]: 'EDL26483.1'

來源

2013-02-13 21:01:38 brwnj

re.findall('gi\|([0-9]+)\|', u'''>gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]''')適用於我： [u'124486857', u'341941060', u'148694536', u'223460980']

來源

2013-02-13 21:02:36 hd1

這是錯誤的信息; id是在'gb'鍵之後，而不是'gi'。 – 2013-02-13 21:05:54

在這種情況下，您可以得到沒有正則表達式，只需拆分'| gb |'，然後將第二部分拆分爲'|'並採取第一項：

s = 'the string from the question' 
r = s.split('|gb|') 
r.split('|')[0]

當然，你將不得不增加檢查，如果有更多/小於2個項目，但我認爲首先分開的返回列表會比使用正則表達式更快。

來源

2013-02-13 21:03:35

>>> import re 
>>> match_object = re.findall("\|gb\|(.*?)\|", ">gi|124486857|ref|NP_001074751.1| inhibitor of Bruton tyrosine kinase [Mus musculus] >gi|341941060|sp|Q6ZPR6.3|IBTK_MOUSE RecName: Full=Inhibitor of Bruton tyrosine kinase; Short=IBtk >gi|148694536|gb|EDL26483.1| mCG128548, isoform CRA_d [Mus musculus] >gi|223460980|gb|AAI37799.1| Ibtk protein [Mus musculus]") 
>>> print match_object 
['EDL26483.1', 'AAI37799.1']

正則表達式的意思就是「匹配任何字符（。），多次（*），但儘可能少他們的（？），並只保存該組（括號），他們必須立即跟從'| GB |'並緊挨着另一個「|」。「

我用「\ |」因爲「|」字符表示正則表達式中的替代匹配。

來源

2013-02-13 21:04:50 rkday

假設a是保存您的字符串變量...

>>> import re 
>>> a = ">gi|124486857|ref|NP_001074751.1| ..." 
>>> re.findall(r"(?:\|gb\|)([a-zA-Z0-9.]+)(?:\|)", a) 
['EDL26483.1', 'AAI37799.1']

來源

2013-02-13 21:10:27 obimod

解析ID從文本與Python

回答

相關問題