2014-01-07 94 views
0

我想下載從搜索結果下載第一個pdb文件(下載鏈接給出以下名稱)。我使用蟒蛇,硒和美麗。直到現在我已經開發了代碼。使用python beautifulsoup和硒下載文件

import urllib2 
from BeautifulSoup import BeautifulSoup 
from selenium import webdriver 


uni_id = "P22216" 

# set parameters 
download_dir = "/home/home/Desktop/" 
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id 

print "url - ", url 


# opening the url 
text = urllib2.urlopen(url).read(); 

#print "text : ", text 
soup = BeautifulSoup(text); 
#print soup 
print 


table = soup.find("table", {"class":"queryBlue"}) 
#print "table : ", table 

status = 0 
rows = table.findAll('tr') 
for tr in rows: 
    try: 
     cols = tr.findAll('td') 
     if cols: 
      link = cols[1].find('a').get('href') 
     print "link : ", link 
      if link: 
       if status==1: 
        main_url = "http://www.rcsb.org" + link 
       print "main_url-----", main_url 
       status = False 
       browser.click(main_url) 
     status+=1 

    except: 
    pass 

我正在變成無。
如何下載搜索列表中的第一個文件? (即2YGV在這種情況下)

Download link is : /pdb/protein/P32447 
+0

爲我工作。獲取'/pdb/explore/explore.do?structureId = 2YGV'。什麼問題?你不能下載它? – ton1c

+0

我也有,但如何下載該文件。 dat我的問題 – sam

回答

2

我不知道究竟是你想下載,但我這裏是如何下載2YGV文件:

import urllib 
import urllib2 
from bs4 import BeautifulSoup  

uni_id = "P22216"  
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id  
text = urllib2.urlopen(url).read()  
soup = BeautifulSoup(text)  
link = soup.find("span", {"class":"iconSet-main icon-download"}).parent.get("href")  
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb") 

該腳本將下載該文件來自頁面上的鏈接。這個腳本不需要selenium,但我用urllib來檢索文件。你可以閱讀this post瞭解更多信息,如何使用urllib下載文件。


編輯:

或者使用此代碼,找到下載鏈接(這一切都取決於你要下載從什麼網址是什麼文件):

import urllib 
import urllib2 
from bs4 import BeautifulSoup 


uni_id = "P22216" 
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id 
text = urllib2.urlopen(url).read() 
soup = BeautifulSoup(text) 
table = soup.find("table", {"class":"queryBlue"}) 
link = table.find("a", {"class":"tooltip"}).get("href") 
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb") 

這裏是你如何做你在評論中提出的問題的例子:

import mechanize 
from bs4 import BeautifulSoup 


SEARCH_URL = "http://www.rcsb.org/pdb/home/home.do" 

l = ["YGL130W", "YDL159W", "YOR181W"] 
browser = mechanize.Browser() 

for item in l: 
    browser.open(SEARCH_URL) 
    browser.select_form(nr=0) 
    browser["q"] = item 
    html = browser.submit() 

    soup = BeautifulSoup(html) 
    table = soup.find("table", {"class":"queryBlue"}) 
    if table: 
     link = table.find("a", {"class":"tooltip"}).get("href") 
     browser.retrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")[0] 
     print "Downloaded " + item + " as " + str(link.split("=")[-1]) + ".pdb" 
    else: 
     print item + " was not found" 

輸出:

Downloaded YGL130W as 3KYH.pdb 
Downloaded YDL159W as 3FWB.pdb 
YOR181W was not found 
+0

我閱讀並理解你的代碼。謝謝。我有列表l = [YGL130W,YDL159W,YOR181W]。與此我必須去http://www.rcsb.org/pdb/home/home.do,然後我必須採取每個ID並在該網站搜索。結果頁面有鏈接搜索pdb。我必須點擊它,然後才能下載pdb頁面,否則我將獲得多個pdbs。如果多個pdbs,那麼我必須下載搜索結果的第一個pdb。 – sam

+1

編輯答案。希望這有助於 – ton1c

+0

你一個驚人的編碼器。謝謝 – sam