圖片和文字蟒蛇皮刮

使用python，你會如何去從網站上抓取圖片和文字。例如，假設我想刮圖片和文字here，我會用什麼python工具/庫？任何教程？圖片和文字蟒蛇皮刮

來源

2014-01-12 mlo

requests,scrapy和BeatidulSoup。

Scrapy是可選的，但請求正在成爲非官方標準，我還沒有看到比BS更好的解析工具。

來源

2014-01-12 01:05:57

BS是相對緩慢和漂亮的越野車 - 甚至BS4 – Amalgovinus

你知道另一個XML操作框架與這個級別ob抽象嗎？如果是這樣，請分享，我個人會感興趣，並且會加強對這個問題的回答;） –

正如上面提到的其他答案，lxml現在已經足夠了，並且可以解析扼殺BS的頁面。我願意放棄一些抽象，用XPath moonspeak弄髒我的手，如果這意味着一個頁面將被解析 – Amalgovinus

請不要使用正則表達式，它不是用於解析html。

通常我使用的工具如下組合：

請求模塊
lxml.html
beautifulsoup4檢測網站編碼

一個方法是這樣的我希望你明白這個想法（代碼只是說明了這個概念，沒有經過測試，將不起作用）：

import lxml.html 
import requests 
from cssselect import HTMLTranslator, SelectorError 
from bs4 import UnicodeDammit 

# first do the http request with requests module like 
r = requests.get('http://example.com') 
html = r.read() 

# Try to parse/decode the HTML result with lxml and beautifoulsoup4 
try: 
    doc = UnicodeDammit(html, is_html=True) 
    parser = lxml.html.HTMLParser(encoding=doc.declared_html_encoding) 
    dom = lxml.html.document_fromstring(html, parser=parser) 
    dom.resolve_base_href() 
except Exception as e: 
    print('Some error occured while lxml tried to parse: {}'.format(e.msg)) 
    return False 

# Try to extract all data that we are interested in with CSS selectors! 
try: 
    results = dom.xpath(HTMLTranslator().css_to_xpath('some css selector to target the DOM')) 
    for e in results: 
     # access elements like 
     print(e.get('href')) # access href attribute 
     print(e.text_content()) # the content as text 
     # or process further 
     found = e.xpath(HTMLTranslator().css_to_xpath('h3.r > a:first-child')) 
except Exception as e: 
    print(e.__cause__)

來源

2014-01-12 01:13:49

圖片和文字蟒蛇皮刮

回答

相關問題