我如何才能從beautifulsoup而不是隻有內部鏈接完整鏈接

我是新的python。我正在爲我工作的公司構建爬蟲。抓取它的網站時，有一個內部鏈接不是它所用的鏈接格式。我怎樣才能得到整個鏈接，而不是隻有目錄。如果我不太清楚，請運行我製作的代碼：我如何才能從beautifulsoup而不是隻有內部鏈接完整鏈接

import urllib2 
from bs4 import BeautifulSoup 

web_page_string = [] 

def get_first_page(seed): 
    response = urllib2.urlopen(seed) 
    web_page = response.read() 
    soup = BeautifulSoup(web_page) 
    for link in soup.find_all('a'): 
     print (link.get('href')) 
    print soup 


print get_first_page('http://www.fashionroom.com.br') 
print web_page_string

來源

2015-04-05 michelfashionroom

你是什麼意思的整個鏈接？ – 2015-04-05 14:38:00

'print seed +'/'+ link.get（'href'）'？ – Selcuk 2015-04-05 14:38:22

我想在上面的例子中找到htt：//www.fashionroom.com.br/indexnew.html。相反，我只是得到了indexnew.html – michelfashionroom 2015-04-05 14:41:20

要求每個人的答案我試圖把一個如果在腳本中。如果有人看到我將來會發現的潛在問題，請通知我

import urllib2 
from bs4 import BeautifulSoup 

web_page_string = [] 

def get_first_page(seed): 
    response = urllib2.urlopen(seed) 
    web_page = response.read() 
    soup = BeautifulSoup(web_page) 
    final_page_string = soup.get_text() 
    for link in soup.find_all('a'): 
     if (link.get('href'))[0:4]=='http': 
      print (link.get('href')) 
     else: 
      print seed+'/'+(link.get('href')) 
    print final_page_string 


print get_first_page('http://www.fashionroom.com.br') 
print web_page_string

來源

2015-04-05 18:35:40 michelfashionroom

我如何才能從beautifulsoup而不是隻有內部鏈接完整鏈接

回答

相關問題