webscraping和保存JSON作爲結果

-2

我想刮與beautifulsoup網站這樣：webscraping和保存JSON作爲結果

從主頁上的40類只是名稱
然後去每一個類別，如（startupstash .COM/ideageneration /），並且其中會有一些子類
現在去每個子類假設第一個startupstash.com/resource/milanote/並採取內容細節

4.對於所有40個類別+子類別數量+每個子類別詳細信息，也是如此。

請有人能提供我一個想法如何approach..or法beautifulsoup..or可能code..i嘗試下來的東西

import requests 
from bs4 import BeautifulSoup 
headers={'User-Agent':'Mozilla/5.0'} 


base_url="http://startupstash.com/" 
req_home_page=requests.get(base_url,headers=headers) 
soup=BeautifulSoup(req_home_page.text, "html5lib") 
links_tag=soup.find_all('li', {'class':'categories-menu-item'}) 
titles_tag=soup.find_all('span',{'class':'name'}) 
links,titles=[],[] 

for link in links_tag: 
    links.append(link.a.get('href')) 
#print(links) 
for title in titles_tag: 
    titles.append(title.getText()) 
print("HOME PAGE TITLES ARE \n",titles)                
#HOME PAGE RESULT TITLE FINISH HERE 

for i in range(0,len(links)): 
    req_inside_page = requests.get(links[i],headers=headers) 
    page_store =BeautifulSoup(req_inside_page.text, "html5lib") 
    jump_to_next=page_store.find_all('div', { 'class' : 'company-listing more' }) 
    nextlinks=[] 
    for div in jump_to_next: 
     nextlinks.append(div.a.get("href")) 
    print("DETAIL OF THE LINKS IN EVERY CATEGORIES SCRAPPED HERE \n",nextlinks)      #SCRAPPED THE WEBSITES IN EVERY CATEGORIES 

    for j in range(0,len(nextlinks)): 
     req_final_page=requests.get(nextlinks[j],headers=headers) 
     page_stored=BeautifulSoup(req_final_page.text,'html5lib') 
     detail_content=page_stored.find('div', { 'class' : 'company-page-body body'}) 
     details,website=[],[] 
     for content in detail_content: 
     details.append(content.string) 
     print("DESCRIPTION ABOUT THE WEBSITE \n",details)          #SCRAPPED THE DETAILS OF WEBSITE 


     detail_website=page_stored.find('div',{'id':"company-page-contact-details"}) 
     table=detail_website.find('table') 
     for tr in table.find_all('tr')[2:]: 
      tds=tr.find_all('td')[1:] 
      for td in tds: 
       website.append(td.a.get('href')) 
       print("VISIT THE WEBSITE \n",website)

來源

2017-04-26 pupu

你有什麼確切的問題？請描述你嘗試過的和無法實現的。沒有人會爲你寫出整個刮板。 – VeGABAU

@ VeGABAU ..我只需要解決這個整個網站的方法..從第一頁我需要所有的類別名稱，第二個去每個類別和第三個從第三頁採取細節部分..... – pupu

好吧，首先您需要添加「用戶代理」在你的頭文件中模擬一個網頁瀏覽器（請不要濫用網站）。
然後你可以提取從第一頁的鏈接這一行：

links = [ li.a.get('href') for li in soup.find_all('li', {'class':'categories-menu-item'}) ]

然後遍歷這些鏈接，並得到他們每個人的鏈接：

links = [ div.a.get('href') for div in soup.find_all('div', { 'class' : 'company-listing-more' }) ]

最後得到的內容：

content = soup.find('div', { 'class' : 'company-page-body body'}).text

來源

2017-04-26 20:00:15

Heartfull謝謝親愛的@adam – pupu

這並不難，所有你需要做的就是檢查html並選擇合適的標籤 –

@adam，我嘗試了更多的方法，結果沒有得到執行。我改變了上面的代碼，如果可能的檢查一次，讓我知道主頁執行的錯誤。結果是剛剛完成在x秒... – pupu

webscraping和保存JSON作爲結果

回答

相關問題