4
我想學習python 3.x,以便我可以抓取網站。人們建議我使用美麗的湯4或lxml.html。有人可以指點我正確的方向爲教程或與Python 3.x BeautifulSoup的例子?使用python 3的網頁抓取教程?
謝謝你的幫助。
我想學習python 3.x,以便我可以抓取網站。人們建議我使用美麗的湯4或lxml.html。有人可以指點我正確的方向爲教程或與Python 3.x BeautifulSoup的例子?使用python 3的網頁抓取教程?
謝謝你的幫助。
我實際上剛剛寫了a full guide on web scraping,其中包含一些Python示例代碼。我在Python 2.7中編寫和測試過,但根據Wall of Shame,我使用的包(請求和BeautifulSoup)都與Python 3完全兼容。
下面是一些代碼,以幫助您開始使用網絡的Python刮:
import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
# dynamically build the URL that we'll be making a request to
url = "http://www.google.com/search?q={term}".format(
term=keyword.strip().replace(" ", "+"),
)
# spoof some headers so the request appears to be coming from a browser, not a bot
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "en-US,en;q=0.8",
}
# make the request to the search url, passing in the the spoofed headers.
r = requests.get(url, headers=headers) # assign the response to a variable r
# check the status code of the response to make sure the request went well
if r.status_code != 200:
print("request denied")
return
else:
print("scraping " + url)
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# each result is an <li> element with class="g" this is our wrapper
results = soup.findAll("li", "g")
# iterate over each of the result wrapper elements
for result in results:
# the main link is an <h3> element with class="r"
result_anchor = result.find("h3", "r").find("a")
# print out each link in the results
print(result_anchor.contents)
if __name__ == "__main__":
# you can pass in a keyword to search for when you run the script
# be default, we'll search for the "web scraping" keyword
try:
keyword = sys.argv[1]
except IndexError:
keyword = "web scraping"
scrape_google(keyword)
如果你只是想了解更多關於Python 3中一般都已經熟悉Python 2.x中,然後this article上轉變從Python 2到Python 3可能會有所幫助。
如果你想做網頁抓取,請使用Python 2. [Scrapy](http://doc.scrapy.org/en/latest/intro/tutorial.html)是迄今爲止最好的Python抓取網頁框架,沒有3.x等價物。 – Blender