2012-11-15 164 views
1

因此,可以說,我試圖讓鏈接到一個特定的形象,就像這樣:BeautifulSoup如何從.. img src獲取網址../ ..?

from bs4 import BeautfiulSoup 
import urlparse 

soup = BeautifulSoup("http://examplesite.com") 
for image in soup.findAll("img"): 
    srcd = urlparse.urlparse(src) 
    path = srcd.path # gets the path 
    fn = os.path.basename(path) # gets filename 

# lets say the webpage i was scraping had their images like this: 
# <img src="../..someimage.jpg" /> 

有沒有簡單的方法來得到的完整網址?或者我將不得不使用正則表達式?

+0

完整網址是依賴於基URI,這是依賴於上下文的(典型地,該網頁被從檢索,但要小心iframe也手冊[的''的網址上標籤](http://www.w3.org/TR/html-markup/base.html)) – Cameron

回答

2

使用urlparse.urljoin

>>> import urlparse 
>>> base_url = "http://example.com/foo/" 
>>> urlparse.urljoin(base_url, "../bar") 
'http://example.com/bar' 
>>> urlparse.urljoin(base_url, "/baz") 
'http://example.com/baz'