重建頁

從相對URL絕對URL給定一個頁面的絕對URL，並在該網頁中找到了一個相對鏈接，會有辦法一）明確重建或B）盡力而爲重構相對鏈接的絕對網址？重建頁

在我的情況下，我正在使用美麗的湯從給定的URL中讀取html文件，剝離出所有img標記源，並嘗試構建頁面圖像的絕對URL列表。

我的Python函數到目前爲止看起來像：

function get_image_url(page_url,image_src): 

    from urlparse import urlparse 
    # parsed = urlparse('http://user:[email protected]:80/path;parameters?query=argument#fragment') 
    parsed = urlparse(page_url) 
    url_base = parsed.netloc 
    url_path = parsed.path 

    if src.find('http') == 0: 
     # It's an absolute URL, do nothing. 
     pass 
    elif src.find('/') == 0: 
     # If it's a root URL, append it to the base URL: 
     src = 'http://' + url_base + src 
    else: 
     # If it's a relative URL, ?

注：不需要一個Python的答案，只是所需的邏輯。

來源

2012-03-15 Yarin

很簡單：

>>> from urlparse import urljoin 
>>> urljoin('http://mysite.com/foo/bar/x.html', '../../images/img.png') 
'http://mysite.com/images/img.png'

來源

2012-03-15 11:21:41

嘿coool！（猜猜我確實需要Python ..） – Yarin 2012-03-15 11:55:09

+10

urlparse模塊在Python 3中被重命名爲urllib.parse。所以，'from urllib.parse import urljoin' – SparkAndShine 2015-07-21 21:44:57

使用urllib.parse.urljoin解決對基本URL（可能是相對）URL。

但，網頁的基本URL不一定是一樣的，你拿來從文檔的URL，因爲HTML允許網頁指定其首選基URL via the BASE element。你所需要的邏輯如下：

base_url = page_url 
head = document.getElementsByTagName('head')[0] 
for base in head.getElementsByTagName('base'): 
    if base.hasAttribute('href'): 
     base_url = urllib.parse.urljoin(base_url, base.getAttribute('href')) 
     # HTML5 4.2.3 "if there are multiple base elements with href 
     # attributes, all but the first are ignored." 
     break

（如果您解析XHTML那麼在理論上，你應該考慮到，而多毛的XML Base specification代替但是，你也許可以矇混過關，而不用擔心的是，由於NO-一個真的使用XHTML。）

來源

2012-03-15 11:59:34

關鍵點 - 謝謝 – Yarin 2012-03-15 16:23:29

回答

相關問題