2014-01-12 59 views
1

使用python,你會如何去從網​​站上抓取圖片和文字。例如,假設我想刮圖片和文字here,我會用什麼python工具/庫?任何教程?圖片和文字蟒蛇皮刮

回答

0

requests,scrapyBeatidulSoup

Scrapy是可選的,但請求正在成爲非官方標準,我還沒有看到比BS更好的解析工具。

+0

BS是相對緩慢和漂亮的越野車 - 甚至BS4 – Amalgovinus

+0

你知道另一個XML操作框架與這個級別ob抽象嗎?如果是這樣,請分享,我個人會感興趣,並且會加強對這個問題的回答;) –

+0

正如上面提到的其他答案,lxml現在已經足夠了,並且可以解析扼殺BS的頁面。我願意放棄一些抽象,用XPath moonspeak弄髒我的手,如果這意味着一個頁面將被解析 – Amalgovinus

1

請不要使用正則表達式,它不是用於解析html。

通常我使用的工具如下組合:

  • 請求模塊
  • lxml.html
  • beautifulsoup4檢測網站編碼

一個方法是這樣的我希望你明白這個想法(代碼只是說明了這個概念,沒有經過測試,將不起作用):

import lxml.html 
import requests 
from cssselect import HTMLTranslator, SelectorError 
from bs4 import UnicodeDammit 

# first do the http request with requests module like 
r = requests.get('http://example.com') 
html = r.read() 

# Try to parse/decode the HTML result with lxml and beautifoulsoup4 
try: 
    doc = UnicodeDammit(html, is_html=True) 
    parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding) 
    dom = lxml.html.document_fromstring(html, parser=parser) 
    dom.resolve_base_href() 
except Exception as e: 
    print('Some error occured while lxml tried to parse: {}'.format(e.msg)) 
    return False 

# Try to extract all data that we are interested in with CSS selectors! 
try: 
    results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM')) 
    for e in results: 
     # access elements like 
     print(e.get('href')) # access href attribute 
     print(e.text_content()) # the content as text 
     # or process further 
     found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child')) 
except Exception as e: 
    print(e.__cause__)