2016-12-05 69 views
0

我試圖從這個webpage中提取數據,並且由於頁面HTML格式中的不一致,我遇到了一些麻煩。我有一個OGAP ID列表,我想爲每個OGAP ID提取基因名稱和任何文獻信息(PMID#)。感謝這裏的其他問題以及BeautifulSoup文檔,我一直能夠一致地獲得每個ID的基因名稱,但是我在文獻部分遇到了麻煩。以下是一些突出顯示不一致的搜索條件。使用BeautifulSoup4和Python從不一致的HTML頁面提取數據

HTML樣本的作品

搜索條件:OG00131

<tr> 
 
    <td colspan="4" bgcolor="#FBFFCC" class="STYLE28">Literature describing O-GlcNAcylation: 
 
    <br>&nbsp;&nbsp;PMID: 
 
    <a href="http://www.ncbi.nlm.nih.gov/pubmed/20068230">20068230</a> 
 
    [CAD, ETD MS/MS]; <br> 
 
    <br> 
 
    </td> 
 
</tr>

HTML樣品不工作

搜索條件:OG00020

<td align="top" bgcolor="#FBFFCC"> 
 
    <div class="STYLE28">Literature describing O-GlcNAcylation: </div> 
 
    <div class="STYLE28"> 
 
    <div class="STYLE28">PMID: 
 
     <a href="http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation">16408927</a> 
 
     [Azide-tag, nano-HPLC/tandem MS] 
 
    </div> 
 
    <br> 
 
    Site has not yet been determined. Use 
 
    <a href="parser2.cgi?ACLY_HUMAN" target="_blank">OGlcNAcScan</a> 
 
    to predict the O-GlcNAc site. </div> 
 
</td>

這裏是我的代碼至今

import urllib2 
from bs4 import BeautifulSoup 

#define list of genes 

#initialize variables 
gene_list = [] 
literature = [] 
# Test list 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 


for i in range(len(gene_listID)): 
    print gene_listID[i] 
    # Specifies URL, uses the "%" to sub in different ogapIDs based on a list provided 
    dbOGAP = "https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield=%s&select=Any" % gene_listID[i] 
    # Opens the URL as a page 
    page = urllib2.urlopen(dbOGAP) 
    # Reads the page and parses it through "lxml" format 
    soup = BeautifulSoup(page, "lxml") 

    gene_name = soup.find("td", text="Gene Name").find_next_sibling("td").text 
    print gene_name[1:] 
    gene_list.append(gene_name[1:]) 

    # PubMed IDs are located near the <td> tag with the term "Data and Source" 
    pmid = soup.find("span", text="Data and Source") 

    # Based on inspection of the website, need to move up to the parent <td> tag 
    pmid_p = pmid.parent 

    # Then we move to the next <td> tag, denoted as sibling (since they share parent <tr> (Table row) tag) 
    pmid_s = pmid_p.next_sibling 
    #for child in pmid_s.descendants: 
    # print child 
    # Now we search down the tree to find the next table data (<td>) tag 
    pmid_c = pmid_s.find("td") 
    temp_lit = [] 
    # Next we print the text of the data 
    #print pmid_c.text 
    if "No literature is available" in pmid_c.text: 
     temp_lit.append("No literature is available") 
     print "Not available" 
    else: 
    # and then print out a list of urls for each pubmed ID we have 
     print "The following is available" 
     for link in pmid_c.find_all('a'): 
      # the <a> tag includes more than just the link address. 
      # for each <a> tag found, print the address (href attribute) and extra bits 
      # link.string provides the string that appears to be hyperlinked. 
      # In this case, it is the pubmedID 
      print link.string 
      temp_lit.append("PMID: " + link.string + " URL: " + link.get('href')) 
    literature.append(temp_lit) 
    print "\n" 

如此看來元素是什麼拋出的代碼爲一個循環。有沒有辦法搜索任何帶有文本「PMID」的元素,並返回它後面的文本(如果有PMID號,則返回url)?如果沒有,我是否想檢查每個孩子,尋找我感興趣的文字?

我使用Python 2.7.10

回答

0
import requests 
from bs4 import BeautifulSoup 
import re 
gene_listID = ["OG00894", "OG00980", "OG00769", "OG00834","OG00852", "OG00131","OG00020"] 
urls = ('https://wangj27.u.hpc.mssm.edu/cgi-bin/DB_tb.cgi?textfield={}&select=Any'.format(i) for i in gene_listID) 

for url in urls: 
    r = requests.get(url) 
    soup = BeautifulSoup(r.text, 'lxml') 
    regex = re.compile(r'http://www.ncbi.nlm.nih.gov/pubmed/\d+') 

    a_tag = soup.find('a', href=regex) 
    has_pmid = 'PMID' in a_tag.previous_element 

    if has_pmid : 
     print(a_tag.text, a_tag.next_sibling, a_tag.get("href")) 
    else: 
     print("Not available") 

出來:

18984734 [GalNAz-Biotin tagging, CAD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/18984734 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
20068230 [CAD, ETD MS/MS]; http://www.ncbi.nlm.nih.gov/pubmed/20068230 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS]; http://www.ncbi.nlm.nih.gov/pubmed/16408927 
Not available 
16408927 [Azide-tag, nano-HPLC/tandem MS] http://www.ncbi.nlm.nih.gov/pubmed/16408927?dopt=Citation 

找到的第一個匹配的目標URL,它與數字結束,一個標籤,不是檢查是否 'PMID'在它之前的元素。 這個網站如此不穩定,我多次嘗試,希望這會有所幫助

+0

嘿,感謝您的幫助。我應該能夠玩弄這個,看看我能否使用這種方法得到所有的文獻。 –

相關問題