下面是我試圖用來從給定主頁網址的網站檢索所有鏈接的一段代碼。我如何確保我在特定網站的關於我們頁面
import requests
from BeautifulSoup import BeautifulSoup
url = "https://www.udacity.com"
response = requests.get(url)
page = str(BeautifulSoup(response.content))
def getURL(page):
start_link = page.find("a href")
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
while True:
url, n = getURL(page)
page = page[n:]
if url:
print url
else:
break
結果是
/uconnect
#
/
/
/
/nanodegree
/courses/all
#
/legal/tos
/nanodegree
/courses/all
/nanodegree
uconnect
/
/course/machine-learning-engineer-nanodegree--nd009
/course/data-analyst-nanodegree--nd002
/course/ios-developer-nanodegree--nd003
/course/full-stack-web-developer-nanodegree--nd004
/course/senior-web-developer-nanodegree--nd802
/course/front-end-web-developer-nanodegree--nd001
/course/tech-entrepreneur-nanodegree--nd007
http://blog.udacity.com
http://support.udacity.com
/courses/all
/veterans
https://play.google.com/store/apps/details?id=com.udacity.android
https://itunes.apple.com/us/app/id819700933?mt=8
/us
/press
/jobs
/georgia-tech
/business
/employers
/success
#
/contact
/catalog-api
/legal
http://status.udacity.com
/sitemap/guides
/sitemap
https://twitter.com/udacity
https://www.facebook.com/Udacity
https://plus.google.com/+Udacity/posts
https://www.linkedin.com/company/udacity
Process finished with exit code 0
我想要得到的只是URL「關於我們」一個網站,該網站的區別在許多情況下,像
爲Udacity是https://www.udacity.com/us
的頁面對於artscape-inc它是https://www.artscape-inc.com/about-decorative-window-film/
我的意思是,我可以嘗試在URL中搜索關鍵字「about」,但據說我可能在這種方法中錯過了udacity。任何人都可以提出任何好的方法
這是不太可能有*是*一個好辦法 - 網站是免費的把自己的關於我的EQ無論他們想要什麼(或者什麼都沒有),只要他們喜歡就叫它。 – jonrsharpe
@jonrsharpe:yaa,那好吧 但仍然是多少我可以減少誤報的數量 – x0v