BeautifulSoup如何從.. img src獲取網址../ ..？

因此，可以說，我試圖讓鏈接到一個特定的形象，就像這樣：BeautifulSoup如何從.. img src獲取網址../ ..？

from bs4 import BeautfiulSoup 
import urlparse 

soup = BeautifulSoup("http://examplesite.com") 
for image in soup.findAll("img"): 
    srcd = urlparse.urlparse(src) 
    path = srcd.path # gets the path 
    fn = os.path.basename(path) # gets filename 

# lets say the webpage i was scraping had their images like this: 
# <img src="../..someimage.jpg" />

有沒有簡單的方法來得到的完整網址？或者我將不得不使用正則表達式？

來源

2012-11-15 Kaonashi

完整網址是依賴於基URI，這是依賴於上下文的（典型地，該網頁被從檢索，但要小心iframe也手冊[的''的網址上標籤]（http://www.w3.org/TR/html-markup/base.html）） – Cameron

使用urlparse.urljoin：

>>> import urlparse 
>>> base_url = "http://example.com/foo/" 
>>> urlparse.urljoin(base_url, "../bar") 
'http://example.com/bar' 
>>> urlparse.urljoin(base_url, "/baz") 
'http://example.com/baz'

來源

2012-11-15 18:24:01

BeautifulSoup如何從.. img src獲取網址../ ..？

回答

相關問題