2017-03-06 40 views
0

我得到輸出文件中的鏈接列表,但需要將所有鏈接顯示爲絕對鏈接。有些是絕對的,有些是相對的。如何將基礎url附加到親屬以確保我只能在csv輸出中獲得絕對鏈接?無法附加基本URL以創建與Beatifulsoup的絕對鏈接Python 3

我找回所有環節,但不是所有人都絕對鏈接e.g /子頁面,而不是http://page.com/subpage

from bs4 import BeautifulSoup 
    import requests 
    import csv 

    j = requests.get("http://cnn.com").content 
    soup = BeautifulSoup(j, "lxml") 

    #only return links to subpages e.g. a tag that contains href 
    data = [] 
     for url in soup.find_all('a', href=True): 
     print(url['href']) 
     data.append(url['href']) 

    print(data) 

    with open("file.csv",'w') as csvfile: 
    write = csv.writer(csvfile, delimiter = ' ') 
    write.writerows(data) 

    content = open('file.csv', 'r').readlines() 
    content_set = set(content) 
    cleandata = open('file.csv', 'w') 

    for line in content_set: 
     cleandata.write(line) 

回答

1

urljoin

from urlparse import urljoin 
... 
base_url = "http://cnn.com" 
absolute_url = urljoin(base_url, relative_url)