我正在使用的數據來自Excel文件,其索引1上含有氨基酸序列。我試圖計算不同的屬性基於使用BioPython的序列。我現在的代碼是:BioPython:氨基酸序列含有'J',無法計算分子量
import xlrd
import sys
from Bio.SeqUtils.ProtParam import ProteinAnalysis
print '~~~~~~~~~~~~~~~ EXCEL PARSER FOR PVA/NON-PVA DATA ~~~~~~~~~~~~~~~'
print 'Path to Excel file:', str(sys.argv[1])
fname = sys.argv[1]
workbook = xlrd.open_workbook(fname, 'rU')
print ''
print 'The sheet names that have been found in the Excel file: '
sheet_names = workbook.sheet_names()
number_of_sheet = 1
for sheet_name in sheet_names:
print '*', number_of_sheet, ': ', sheet_name
number_of_sheet += 1
with open("thefile.txt","w") as f:
lines = []
f.write('LENGTH.SEQUENCE,SEQUENCE,MOLECULAR.WEIGHT\n')
for sheet_name in sheet_names:
worksheet = workbook.sheet_by_name(sheet_name)
print 'opened: ', sheet_name
for i in range(1, worksheet.nrows):
row = worksheet.row_values(i)
analysed_seq = ProteinAnalysis(row[1].encode('utf-8'))
weight = analysed_seq.molecular_weight()
lines.append('{},{},{}\n'.format(row[2], row[1].encode('utf-8'), weight))
f.writelines(lines)
它一直在工作,直到我加入了分子量的計算。這表明以下錯誤:
Traceback (most recent call last):
File "Excel_PVAdata_Parser.py", line 28, in <module>
weight = analysed_seq.molecular_weight()
File "/usr/lib/python2.7/dist-packages/Bio/SeqUtils/ProtParam.py", line 114, in molecular_weight
total_weight += aa_weights[aa]
KeyError: 'J'
我看着在Excel中的數據文件,這表明氨基酸序列不包含J.是否有人知道一個包BioPython的其中捕獲有「未知氨基酸」或有另一個建議?
這個問題似乎是非常多的包特定,這不是任何Python標準模塊。我建議看看一個特定的論壇或在項目的github頁面。 – Raf