我如何確保我在特定網站的關於我們頁面

下面是我試圖用來從給定主頁網址的網站檢索所有鏈接的一段代碼。我如何確保我在特定網站的關於我們頁面

import requests 
from BeautifulSoup import BeautifulSoup 

url = "https://www.udacity.com" 
response = requests.get(url) 
page = str(BeautifulSoup(response.content)) 


def getURL(page): 

    start_link = page.find("a href") 
    if start_link == -1: 
     return None, 0 
    start_quote = page.find('"', start_link) 
    end_quote = page.find('"', start_quote + 1) 
    url = page[start_quote + 1: end_quote] 
    return url, end_quote 

while True: 
    url, n = getURL(page) 
    page = page[n:] 
    if url: 
     print url 
    else: 
     break

結果是

/uconnect 
# 
/
/
/
/nanodegree 
/courses/all 
# 
/legal/tos 
/nanodegree 
/courses/all 
/nanodegree 
uconnect 
/
/course/machine-learning-engineer-nanodegree--nd009 
/course/data-analyst-nanodegree--nd002 
/course/ios-developer-nanodegree--nd003 
/course/full-stack-web-developer-nanodegree--nd004 
/course/senior-web-developer-nanodegree--nd802 
/course/front-end-web-developer-nanodegree--nd001 
/course/tech-entrepreneur-nanodegree--nd007 
http://blog.udacity.com 
http://support.udacity.com 
/courses/all 
/veterans 
https://play.google.com/store/apps/details?id=com.udacity.android 
https://itunes.apple.com/us/app/id819700933?mt=8 
/us 
/press 
/jobs 
/georgia-tech 
/business 
/employers 
/success 
# 
/contact 
/catalog-api 
/legal 
http://status.udacity.com 
/sitemap/guides 
/sitemap 
https://twitter.com/udacity 
https://www.facebook.com/Udacity 
https://plus.google.com/+Udacity/posts 
https://www.linkedin.com/company/udacity 

Process finished with exit code 0

我想要得到的只是URL「關於我們」一個網站，該網站的區別在許多情況下，像

爲Udacity是https://www.udacity.com/us

的頁面

對於artscape-inc它是https://www.artscape-inc.com/about-decorative-window-film/

我的意思是，我可以嘗試在URL中搜索關鍵字「about」，但據說我可能在這種方法中錯過了udacity。任何人都可以提出任何好的方法

來源

2016-04-23 x0v

這是不太可能有*是*一個好辦法 - 網站是免費的把自己的關於我的EQ無論他們想要什麼（或者什麼都沒有），只要他們喜歡就叫它。 – jonrsharpe

@jonrsharpe：yaa，那好吧但仍然是多少我可以減少誤報的數量 – x0v

要覆蓋「關於我們」頁面鏈接的所有可能變體並不容易，但這裏是最初的想法，可以在兩種情況下都能正常工作 - 檢查href屬性中的「about」 a元素的文本：

def about_links(elm): 
    return elm.name == "a" and ("about" in elm["href"].lower() or \ 
           "about" in elm.get_text().lower())

用法：

soup.find_all(about_links) # or soup.find(about_links)

什麼，你也可以做，以減少誤報的數量只檢查頁的「頁腳」的一部分。例如。找到footer元素，或具有id="footer"或具有footer類的元素。

另一個想法排序「外包」出去的「關於我們」頁面定義的，將是谷歌（從您的腳本，當然）「大約」 +「網頁URL」，並搶了先搜索結果。

作爲一個方面說明，我發現你還在使用BeautifulSoup version 3 - 它沒有被開發和維護，你應該儘快切換到BeautifulSoup 4，通過安裝它：

pip install --upgrade beautifulsoup4

，改變你的進口：

from bs4 import BeautifulSoup

來源

2016-04-23 20:59:18 alecxe

此外，這是一個相關的線程：http://stackoverflow.com/a/28145856/771848。 – alecxe

我如何確保我在特定網站的關於我們頁面

回答

相關問題