2013-05-28 27 views
4

我想學習python 3.x,以便我可以抓取網站。人們建議我使用美麗的湯4或lxml.html。有人可以指點我正確的方向爲教程或與Python 3.x BeautifulSoup的例子?使用python 3的網頁抓取教程?

謝謝你的幫助。

+2

如果你想做網頁抓取,請使用Python 2. [Scrapy](http://doc.scrapy.org/en/latest/intro/tutorial.html)是迄今爲止最好的Python抓取網頁框架,沒有3.x等價物。 – Blender

回答

14

我實際上剛剛寫了a full guide on web scraping,其中包含一些Python示例代碼。我在Python 2.7中編寫和測試過,但根據Wall of Shame,我使用的包(請求和BeautifulSoup)都與Python 3完全兼容。

下面是一些代碼,以幫助您開始使用網絡的Python刮:

import sys 
import requests 
from BeautifulSoup import BeautifulSoup 


def scrape_google(keyword): 

    # dynamically build the URL that we'll be making a request to 
    url = "http://www.google.com/search?q={term}".format(
     term=keyword.strip().replace(" ", "+"), 
    ) 

    # spoof some headers so the request appears to be coming from a browser, not a bot 
    headers = { 
     "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)", 
     "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
     "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3", 
     "accept-encoding": "gzip,deflate,sdch", 
     "accept-language": "en-US,en;q=0.8", 
    } 

    # make the request to the search url, passing in the the spoofed headers. 
    r = requests.get(url, headers=headers) # assign the response to a variable r 

    # check the status code of the response to make sure the request went well 
    if r.status_code != 200: 
     print("request denied") 
     return 
    else: 
     print("scraping " + url) 

    # convert the plaintext HTML markup into a DOM-like structure that we can search 
    soup = BeautifulSoup(r.text) 

    # each result is an <li> element with class="g" this is our wrapper 
    results = soup.findAll("li", "g") 

    # iterate over each of the result wrapper elements 
    for result in results: 

     # the main link is an <h3> element with class="r" 
     result_anchor = result.find("h3", "r").find("a") 

     # print out each link in the results 
     print(result_anchor.contents) 


if __name__ == "__main__": 

    # you can pass in a keyword to search for when you run the script 
    # be default, we'll search for the "web scraping" keyword 
    try: 
     keyword = sys.argv[1] 
    except IndexError: 
     keyword = "web scraping" 

    scrape_google(keyword) 

如果你只是想了解更多關於Python 3中一般都已經熟悉Python 2.x中,然後this article上轉變從Python 2到Python 3可能會有所幫助。