2017-07-16 25 views
0

我有一個包含網絡的tsv文件。這是一個片段。列0包含唯一ID,列1包含替代ID(不一定是唯一的)。那之後的每一列都包含一個'交互者'和一個互動分數。用非常特定的格式將tsv解析爲python

11746909_a_at A1CF    SHPRH 0.11081568  TRIM10 0.11914056 
11736238_a_at ABCA5   ANKS1A  0.1333185  CCDC90B 0.14495682 
11724734_at ABCB8    HYKK 0.09577321  LDB3 0.09845833 
11723976_at ABCC8   FAM161B 0.15087105   ID1 0.14801268 
11718612_a_at ABCD4   HOXC6 0.23559235  LCMT2 0.12867001 
11758217_s_at ABHD17C   FZD7 0.46334574  HIVEP3 0.24272481 

因此,例如,A1CF連接到SHPRHTRIM10與分別0.110815680.11914056分數。我想這個數據轉換成使用大熊貓這將是這樣一個「平」的格式:

11746909_a_at A1CF SHPRH 0.11081568 
         TRIM10 0.11914056 
11736238_a_at ABCA5 ANKS1A 0.1333185 
         CCDC90B 0.14495682 
...... and so on........ ........ .... 

注意,每行可以有(interactor, score)雙任意號碼。

我試過將列0和1設置爲索引,然後給列名稱df.colnames = ['Interactor', Weight']*int(df.shape[1]/2)然後使用pandas.groupby,但到目前爲止我的嘗試還沒有成功。任何人都可以提出一種方法來做到這一點?

+0

你可能要刷新你的記憶[問產生一個輸出數據幀]和[mcve]。 – boardrider

回答

0

像你上面指定應該不會太難

from collections import OrderedDict 
import pandas as pd 


def open_network_tsv(filepath): 
    """ 
    Read the tsv file, returning every line split by tabs 
    """ 
    with open(filepath) as network_file: 
     for line in network_file.readlines(): 
      line_columns = line.strip().split('\t') 
      yield line_columns 

def get_connections(potential_conns): 
    """ 
    Get the connections of a particular line, grouped 
    in interactor:score pairs 
    """ 
    for idx, val in enumerate(potential_conns): 
     if not idx % 2: 
      if len(potential_conns) >= idx + 2: 
       yield val, potential_conns[idx+1] 


def create_connections_df(filepath): 
    """ 
    Build the desired dataframe 
    """ 
    connections = OrderedDict({ 
     'uniq_id': [], 
     'alias': [], 
     'interactor': [], 
     'score': [] 
    }) 
    for line in open_network_tsv(filepath): 
     uniq_id, alias, *potential_conns = line 
     for connection in get_connections(potential_conns): 
      connections['uniq_id'].append(uniq_id) 
      connections['alias'].append(alias) 
      connections['interactor'].append(connection[0]) 
      connections['score'].append(connection[1]) 
    return pd.DataFrame(connections) 

也許你可以對輸出做了dataframe.set_index(['uniq_id', 'alias'])dataframe.groupby(['uniq_id', 'alias'])之後