1
我是新的python。我正在爲我工作的公司構建爬蟲。抓取它的網站時,有一個內部鏈接不是它所用的鏈接格式。我怎樣才能得到整個鏈接,而不是隻有目錄。如果我不太清楚,請運行我製作的代碼:我如何才能從beautifulsoup而不是隻有內部鏈接完整鏈接
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
for link in soup.find_all('a'):
print (link.get('href'))
print soup
print get_first_page('http://www.fashionroom.com.br')
print web_page_string
你是什麼意思的整個鏈接? – 2015-04-05 14:38:00
'print seed +'/'+ link.get('href')'? – Selcuk 2015-04-05 14:38:22
我想在上面的例子中找到htt://www.fashionroom.com.br/indexnew.html。相反,我只是得到了indexnew.html – michelfashionroom 2015-04-05 14:41:20