2014-02-06 67 views
0

我有幾個製表符分隔的文件,我想使用csvDictreader讀入到dicts中。在開始實際數據之前,每個文件都包含以'#'或'\ t'開頭的幾條註釋行。註釋行的數量因文件而異。我一直在嘗試this post中列出的方法,但似乎無法使其工作。在csv.Dict讀取器中跳過不同類型的註釋行

這裏是我當前的代碼:

def load_database_snps(inputFile): 
    '''This function takes a txt tab delimited input file (in house database) and returns a list of dictionaries for each variant''' 
    idStore = [] #empty list for storing variant records                                           
    with open(inputFile, 'r+') as varin: 
     idStoreDictgroup = csv.DictReader((row for row in varin if row.startswith('hr', 1, 2)),delimiter='\t') #create a generator; dictionary per snp (row) in the file               
     idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names                         
     print(type(idStoreDictgroup)) 
     for d in idStoreDictgroup: #iterate over dictionaries in varin_dictgroup                                     
      print(d) 
      idStore.append(d) #attach to var_list                                            
    return idStore 

下面是一個輸入文件的例子:

## SM=Sample,AD=Total Allele Depth, DP=Total Depth 
## het;;; and homo;;; are breakdowns of variant read counts per sample - chr1:10002921 T>G AD=34 het:4;11;7;12 (sum=34) 


     Hetereozygous          Homozygous          
    Chr  Start  End   ref   |A|  |C|  |G|  |T|  HetCount  |A|  |C|  |G|  |T|  HomCount  TotalCount  SampleCount 
    chr1 10001102  10001102  T  0  0  SM=1;AD=22;DP=38  0  1  0  0  0  0  0  1  138  het:22; homo:- 
    chr1 10002921  10002921  T  0  0  SM=4;AD=34;DP=63  0  4  0  0  0  0  0  4  138  het:4;11;7;12; homo:- 

我想所有人閱讀該行以「人權委員會」或「CHR」 。我認爲它不起作用,因爲我需要遍歷它來重新格式化字段名稱,使用生成器在將行讀取到字典之前耗盡它。

該錯誤消息我得到的是:

Traceback (most recent call last): 
    File "snp_freq_V1-1_export.py", line 99, in <module> 
    snp_check_wrapper(inputargs.snpstocheck, inputargs.snp_database_location) 
    File "snp_freq_V1-1_export.py", line 92, in snp_check_wrapper 
    snpDatabase = load_database_snps(databaseInputFile) #store database variants in snp_database (a dictionary) 
    File "snp_freq_V1-1_export.py", line 53, in load_database_snps 
    idStoreDictgroup.fieldnames = [field.strip() for field in idStoreDictgroup.fieldnames] #strip whitespace from field names 
TypeError: 'NoneType' object is not iterable 

我試圖做的我當前的代碼逆並明確排除以「#」和「\ T」行。但是這也不起作用,只是給了我一個空白字典。

+1

有每個文件只有一個?例如...上面的評論/標題不會重複每個文件一次以上? –

+0

是的,所以從示例文件中,我希望它使用Chr Start ...行作爲標題和所有後續行作爲我的詞典的值。 –

回答

1

,你應該能夠做的是跳過前面所有的線直到東西有chr開始,如:

import csv 
from itertools import dropwhile 

with open('somefile') as fin: 
    start = dropwhile(lambda L: not L.lower().lstrip().startswith('chr'), fin) 
    for row in csv.DictReader(start, delimiter='\t'): 
     # do something 
+0

很好地工作,以避免不必要的線。我怎樣才能從我的字典鍵刪除空白? –

+0

@s_boardman那不是你的問題... :) –

+0

是的,現在已經設法修復它。非常感謝您的幫助! :) –