2015-04-29 35 views
3

我正在分析測序數據,並且我有幾個候選基因需要我找到它們的功能。使用python在兩個文件中查找匹配

在編輯可用的人類數據庫後,我想比較我的候選基因與數據庫並輸出候選基因的功能。

我只有基本的Python技能,所以我認爲這可能會幫助我加快我的工作,找到候選基因的功能。

所以文件1包含候選基因這個樣子

Gene 
AQP7 
RLIM 
SMCO3 
COASY 
HSPA6 

和數據庫,file2.csv看起來是這樣的:

Gene function 
PDCD6 Programmed cell death protein 6 
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2 Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 

所需的輸出

Gene(from file1) ,function(matching from file2) 

我試着使用此代碼:

file1 = 'file1.csv' 
file2 = 'file2.csv' 
output = 'file3.txt' 

with open(file1) as inf: 
    match = set(line.strip() for line in inf) 

with open(file2) as inf, open(output, 'w') as outf: 
    for line in inf: 
     if line.split(' ',1)[0] in match: 
      outf.write(line) 

我只收到空白頁。

我嘗試使用交集功能

with open('file1.csv', 'r') as ref: 
    with open('file2.csv','r') as com: 
     with open('common_genes_function','w') as output: 
      same = set(ref).intersection(com) 
       print same 

也沒有工作..

請幫幫忙,否則我必須這樣做手工

+0

你嘗試尋找到Python的'csv'模塊?它有很多方法可以方便地解析csv文件。您可以將兩個來自'file1'的基因加載到一個數組中,然後將數組中的每個項與由csv模塊加載到內存中的數據進行匹配。 – Bhargav

+0

如何將file1中的基因與file2中的函數關聯?文件1中是否有CDC2和PDCD基因? – lapinkoira

+0

file1中的基因應存在於file2中,因爲file2是完整的人類數據庫。上面顯示的數據只是內容的一部分。 –

回答

2

我會建議使用pandasmerge功能。但是,它需要在「基因」和「功能」欄之間有明確的分隔符。在我的例子,我認爲它是在標籤:

import pandas as pd 
#open files as pandas datasets 
file1 = pd.read_csv(filepath1, sep = '\t') 
file2 = pd.read_csv(filepath2, sep = '\t') 

#merge files by column 'Gene' using 'inner', so it comes up 
#with the intersection of both datasets 
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2']) 
file3.to_csv(filepath3, sep = ',') 
1

使用基本的Python,你可以嘗試以下方法:

import re 

gene_function = {} 
with open('file2.csv','r') as input: 
    lines = [line.strip() for line in input.readlines()[1:]] 
    for line in lines: 
     match = re.search("(\w+)\s+(.*)",line) 
     gene = match.group(1) 
     function = match.group(2) 
     if gene not in gene_function: 
      gene_function[gene] = function 

with open('file1.csv','r') as input: 
    genes = [i.strip() for i in input.readlines()[1:]] 
    for gene in genes: 
     if gene in gene_function: 
      print "{}, {}".format(gene, gene_function[gene]) 
相關問題