2017-07-06 139 views
-1

所以我想去http://www.medhelp.org/forums/list,然後有很多鏈接到那裏的不同疾病。在每個鏈接裏面,有幾個頁面,每個鏈接都有一些我想要的鏈接。從網站中提取鏈接

我想獲得一些鏈接。所以我用這個代碼:

myArray=[] 
html_page = urllib.request.urlopen("http://www.medhelp.org/forums/list") 
soup = bs(html_page) 
temp =soup.findAll('div',attrs={'class' : 'forums_link'}) 
for div in temp: 
    myArray.append('http://www.medhelp.org' + div.a['href']) 
myArray_for_questions=[] 
myPages=[] 

#this for is going over all links on the main page. in this case, all 
diseases 
for link in myArray: 

    # "link" is the URL for each link in the main page of our website 
    html_page = urllib.request.urlopen(link) 
    soup1 = bs(html_page) 

    #getting the questions's links in the first page 
    temp =soup1.findAll('div',attrs={'class' : 'subject_summary'}) 
    for div in temp: 
    myArray_for_questions.append('http://www.medhelp.org' + div.a['href']) 

    #now getting the URL for all next pages for this page 
    pages = soup1.findAll('a' ,href=True, attrs={'class' : 'page_nav'}) 
    for l in pages: 
    html_page_t = urllib.request.urlopen('http://www.medhelp.org' 
    +l.get('href')) 
    soup_t = bs(html_page_t) 
    other_pages = soup_t.findAll('a' ,href=True, attrs={'class' : 
    'page_nav'}) 
    for p in other_pages: 
     mystr='http://www.medhelp.org' +p.get('href') 
     if mystr not in myPages: 
      myPages.append(mystr) 
     if p not in pages: 
      pages.append(p) 

    # getting all links inside this page which are people's questions 
    for page in myPages: 
     html_page1 = urllib.request.urlopen(page) 
     soup2 = bs(html_page1) 
     temp =soup2.findAll('div',attrs={'class' : 'subject_summary'}) 
     for div in temp: 
     myArray_for_questions.append('http://www.medhelp.org' + 
     div.a['href']) 

但它需要永遠得到我想要從所有頁面的所有鏈接。有任何想法嗎?

感謝

+0

這是太普通。 請告訴我們你到目前爲止所嘗試過的東西,並縮小你的問題範圍。 – rowana

+0

當問一個問題時,您通常希望擁有您試圖實現並且有疑問的代碼,或者您應該要求幫助理解通過對該主題的研究(通過示例代碼片段)找到的代碼。 – gavsta707

+0

我還沒有開始。我只想寫一個特殊的網絡爬蟲,我認爲這樣做是因爲在這個論壇中有很多問題需要我們將所有這些疾病保存在一個文件中。 – Sanaz

回答