import requests
from bs4 import BeautifulSoup
def widow(max_pages):
page = 0 # craigslist starts at page 0
while page <= max_pages:
url = 'http://orlando.craigslist.org/search/cto?s=' + str(page) # craigslist search url + current page number
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'lxml') # my computer yelled at me if 'lxml' wasn't included. your mileage may vary
for link in soup.findAll('a', {'class':'hdrlnk'}):
href = 'http://orlando.craigslist.org' + link.get('href') # href = /cto/'number'.html
title = link.string
page += 100 # craigslist pages go 0, 100, 200, etc
widow(0) # 0 gets the first page, replace with multiples of 100 for extra pages
神聖的廢話哇。我現在很笨,現在哈哈。謝謝。 – v0dkuh
這不僅僅是解決方案的一部分嗎? 'page'遞增,但在示例中'max_pages'設置爲'0'。在第一頁之後,'100 <= 0'將返回False並因此退出循環。 –
OP的評論建議,他會打電話給窗口(0)以獲取第一頁。如果他打電話給窗口(1000),那麼他將繼續刮擦,直到頁面<= 1000 – sisanared