用Python提取Fasta月光蛋白質序列

我想通過Python從Moonlighting Protein Database（www.moonlightingproteins.org/results.php?search_text=）中提取含有氨基酸序列的FASTA文件，因爲它是一個迭代過程，我寧願學習如何編程，而不是手動完成，B/C來吧，我們在2016年。問題是我不知道如何編寫代碼，因爲我是一個菜鳥程序員:(。基本的僞代碼將是：提前用Python提取Fasta月光蛋白質序列

for protein_name in site: www.moonlightingproteins.org/results.php?search_text=: 

     go to the uniprot option 

     download the fasta file 

     store it in a .txt file inside a given folder

感謝

來源

2016-09-20 Manolo Flores

我建議谷歌上搜索「網絡與Python介紹刮」或類似的術語，並與有點亂搞。現在你的問題太抽象了。 – Swier

我強烈建議要問筆者數據庫從！：

我想在項目中使用MoonProt數據庫來分析使用生物信息學的氨基酸序列或結構。

如果您對感興趣，請使用MoonProt數據庫分析序列和/或結構的月光蛋白質，請通過[email protected]與我們聯繫。

假設你發現了一些有趣的東西，你將如何在論文或論文中引用它？「序列未經作者同意而從公共網頁上刪除」。更好地讚揚原始研究人員。

這是一個很好的介紹scraping

但是，回到你原來的問題。

import requests 
from lxml import html 
#let's download one protein at a time, change 3 to any other number 
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3') 
#convert the html document to something we can parse in Python 
tree = html.fromstring(page.content) 
#get all table cells 
cells = tree.xpath('//td') 

for i, cell in enumerate(cells): 
    if cell.text: 
     #if we get something which looks like a FASTA sequence, print it 
     if cell.text.startswith('>'): 
      print(cell.text) 
    #if we find a table cell which has UniProt in it 
    #let's print the link from the next cell 
    if 'UniProt' in cell.text_content(): 
     if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib: 
      print(cells[i + 1].find('a').attrib['href'])

來源

2016-09-20 21:17:04

用Python提取Fasta月光蛋白質序列

回答

相關問題