搜索獨特的網頁鏈接

我編寫的程序從http://www.stevens.edu/中提取網頁鏈接。現在我遇到了以下與該程序有關的問題。搜索獨特的網頁鏈接

1 - 我想只有從鏈接http和https

2開始 - 我得到一個解析器BS4任何關於解析器缺乏規範的警示 - 解決

如何解決這個問題？我沒有得到正確的方向來解決這個問題。

我的代碼 -

import urllib2 

from bs4 import BeautifulSoup as bs 
url = raw_input('Please enter the url for which you want to see unique web links -') 

print "\n" 

URLs (mostly HTTP) in a complex world 
req = urllib2.Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 
html = urllib2.urlopen(req).read() 
soup = bs(html) 
tags = soup('a') 
count = 0 
web_link = [] 
for tag in tags: 
    count = count + 1 
    store = tag.get('href', None) 
    web_link.append(store) 
print "Total no. of extracted web links are",count,"\n" 
print web_link 
print "\n" 
Unique_list = set(web_link) 
Unique_list = list(Unique_list) 

print "No. of the Unique web links after using set method", len(Unique_list),"\n"

來源

2016-04-22 Siddhesh Palav

請您澄清一下您的問題，您對「網站鏈接」和「本地內容」的定義看起來相當含糊。你的意思是你在尋找html文件而不是css，或者你的意思是你在尋找獨特的域名？或者是其他東西？ – OnGle

警告你可以有湯= bs（html，「lxml」）。對於獲取唯一的網頁鏈接，你可以有一個條件，如tag.get（'href'，None）中的'http'： –

'如果'http'in tag.get（'href'，None）：'仍然會得到樣式表或任何由此事通過http服務雖然。 – OnGle

對於第二個問題，你需要在創建頁面的BS指定解析器。
soup = bs(html,"html.parser")

這應該刪除你的警告。

來源

2016-04-22 06:45:00 sbk23

搜索獨特的網頁鏈接

回答

相關問題