2016-12-02 60 views
0

我知道如何兩條線之間的解析,當起「目標字」和最終「目標字」是不同的兩條線之間的解析的Python:用相同的關鍵字

例如如果我想X和Y之間解析:

parse = False 
for line in open(sys.argv[1]): 
if Y in line: 
    parse = False 
if parse: 
    print line 
if X in line: 
    parse = True 

我卡在一個稍微不同的問題,在這裏我想與解析的詞是同一個詞。即,在此實例中,有4個不同的同系物基團,並且我想提取每個同系物組中的人/小鼠對,所以我想打開該文件:

1:_HomoloGene:_141209.Gene_conserved_in_Mammals 
LOC102724657       Homo_sapiens 
Gm12569         Mus_musculus 
2:_HomoloGene:_141208.Gene_conserved_in_Euarchontoglires  
LOC102724737       Homo_sapiens 
LOC102636216       Mus_musculus 
3:_HomoloGene:_141152.Gene_conserved_in_Euarchontoglires  
LOC728763        Homo_sapiens 
E030010N07Rik       Mus_musculus 
E030010N09Rik       Mus_musculus 
E030010N010Rik       Mus_musculus 
E030010N08Rik       Mus_musculus 
LOC102551034       Rattus_norvegicus 
4:_HomoloGene:_141054.Gene_conserved_in_Boreoeutheria  
LOC102723572       Homo_sapiens 
LOC102157295       Canis_lupus_familiaris 
LOC102633228       Mus_musculus 

向一個Homo_sapiens /小家鼠比較像這樣的:

Homo_sapiens Mus_musculus 
LOC102724657 Gm12569 
LOC102724737 LOC102636216 
LOC728763  E030010N07Rik 
LOC728763  E030010N09Rik 
LOC728763  E030010N010Rik 
LOC728763  E030010N08Rik 
LOC102723572 LOC102633228 

我沒有幾乎成功的代碼來顯示,這是什麼,我已經試過一個例子(和我也試了正則表達式和分裂的字行「HomoloGene」 ):

import sys 
ListOfLines = open(sys.argv[1]) 
for line in ListOfLines: 
     if "HomoloGene" in line: 
       if "HomoloGene" in ListOfLines.next(): 
         print line 
         print "**" 
       else: 
         print ListOfLines.next() 

謝謝

回答

3

下面的註釋代碼在您的示例中產生結果。要了解它,你可能需要閱讀以下內容:

驗證碼:

import sys 
import re 
from collections import defaultdict 
import itertools 

#define the pairs of words we want to compare 
compare = ['Homo_sapiens', 'Mus_musculus'] 

#define some regular expressions to split up the input data file 
#this searches for a digit, a colon, and matches the rest of the line 
group_re = re.compile("\n?\d+:.*\n") 
#this matches non-whitespace, followed by whitespace, and then non-whitespace, returning the two non-whitespace sections 
line_re = re.compile("(\S+)\s+(\S+)") 

#to store our resulting comparisons 
comparison = [] 

#open and read in the datafile 
datafile = open(sys.argv[1]).read() 
#use our regular expression to split the datafile into homolog groups 
for dataset in group_re.split(datafile): 
    #ignore empty matches 
    if dataset.strip()=='': continue 
    #split our group into lines 
    dataset = dataset.split('\n') 
    #use our regular expression to match each line, pulling out the two bits of data 
    dataset = [line_re.match(line).groups() for line in dataset if line.strip()!=''] 
    #build a dictionary to store our words 
    words = defaultdict(list) 
    #loop through our group dataset, grouping each line by its word 
    for v, k in dataset: words[k].append(v) 
    #add the results to our output list. Note here we are unpacking an argument list 
    comparison+=itertools.product(*[words[w] for w in compare]) 

#print out the words we wanted to compare 
print('\t'.join(compare)) 
#loop through our output dataset 
for combination in comparison: 
    #print each comparison, spaced with a tab character 
    print('\t'.join(combination)) 
1

它是一個兩部分問題。首先將同源組提取出一個字典,然後遍歷這些組並打印這些對。

#!/bin/python 
import re 
# Opens the text file 
with open("genes.txt","r") as f: 
    data = {} 
    # reads the lines 
    for line in f.readlines(): 
     # When there is a : at the line start -> new group 
     match = re.search("^([0-9]+):",line) 
     if match: 
      # extracts the group number and puts it to the dict 
      group = match.group(1) 
      # adds the species as entries with empty lists as values 
      data[str(group)] = { "Homo_sapiens":[] , "Mus_musculus":[]} 
     else: 
      # splits the line (also removes the \n) 
      text = line.replace("\n","").split() 
      # if the species is in the group, add the gene name to the list 
      if text[1] in data[group].keys(): 
       data[group][text[1]].append(text[0]) 
# Here you go with your parsed data 
print data 
# Now we feed it into the text format you want 
print "Homo_sapiens\t\tMus_musculus" 
# go through groups 
for gr in data: 
    # go through the Hs genes 
    for hs_gene in data[gr]["Homo_sapiens"]: 
     # get all the associated Ms genes 
     for ms_gene in data[gr]["Mus_musculus"]: 
      # print the pairs 
      print hs_gene+"\t\t"+ms_gene 

希望這會有所幫助。

+0

你不認爲組數會超過9? – alexis

+0

好點。相應地解決了這個問題 – CDe

+0

s /'if match!= None:'/'if match:'/。你忘了放棄'group'的舊定義,所以你的代碼仍然被破壞。 – alexis

相關問題