刮網站使用Python +美麗的湯4個所有頁面

-4

我很新的Python和我試圖從一個網站抽取數據，但我需要的所有頁面，到目前爲止，我有：刮網站使用Python +美麗的湯4個所有頁面

import requests 
from bs4 import BeautifulSoup 


r = requests.get ("http://www.somesite.com/records/08-jan-2016/") 
r.content 
soup = BeautifulSoup(r.content, "html.parser") 
full_info = soup.find_all("div", {"class": "col-sm-10"}) 

for item in full_info : print (item.text)

這代碼打印來自當前頁面的數據，我如何管理從所有頁面獲取數據並導出到文件。

問候

來源

2016-03-29 user1385619

定義「所有頁面」。它們是可以遞歸訪問的鏈接嗎？（即，你可以用'wget -r'來獲取它們）它們是不同的URL嗎？他們互相鏈接嗎？你通常會如何獲得鏈接？你似乎有BeautifulSoup漂亮。你可以使用'open'寫入一個文件。 – Kupiakos

感謝您的回覆，網址格式設置爲日期：「http://www.somesite.com/records/08-jan-2016/」「http://www.somesite.com/records/09- jan-2016 /「」http://www.somesite.com/records/10-jan-2016/「等等，直到今天，在每個頁面的末尾都有一個預覽按鈕和下一個日期。 – user1385619

你怎麼知道哪些日期有效？你只是假設所有的日期，還是你有一個列表？ – Kupiakos

就個人而言，我會使用datetime庫日期計算 - 這就是它的設計目的。但是，因爲datetime的strftime是基於語言環境的，所以手動構建字符串更安全，除非您打算在與網站相匹配的已知區域設置上運行此操作。

import datetime 
MONTH_NAMES = {1: 'jan', 2: 'feb', 3: 'mar'} # and so on 
ONE_DAY = datetime.timedelta(1) 

def date_strings(first_date, last_date): 
    current_date = first_date 
    while current_date <= last_date: 
     yield '{0.day:02}-{1}-{0.year:04}'.format(
      current_date, MONTH_NAMES[current_date.month]) 
     # If running on a US locale, you can just use: 
     # yield current_date.strftime('%d-%b-%Y').lower() 
     current_date += ONE_DAY 

first_date = datetime.date(2016, 1, 8) 
last_date = datetime.date(2016, 3, 29) 

for date_string in date_strings(first_date, last_date): 
    print(date_string) 
    # Do whatever scraping you need using date_string

來源

2016-03-29 07:08:21 Kupiakos

所以要添加到的意見，如何通過多個日期重複提出的問題。我不是最熟練的程序員，但我會創建一個字典與鍵：值 =>月份：月的天數。然後你可以創建一個嵌套循環來創建字符串以追加到URL。

dates = {"jan":31, "feb":29, "mar":31} 
for month in dates: 
    for day in range(dates[month]): 
    url = "https://www.somepage.com/{0}-{1}-2016".format(str(day+1), month) 
    req = requests.get(url) 
    ...

來源

2016-03-29 05:09:03 Tony

刮網站使用Python +美麗的湯4個所有頁面

回答

相關問題