過濾器超鏈接 - 蟒蛇

我想所有的網站，其URL文本包括像productservicesolutionindex過濾器超鏈接 - 蟒蛇

字的超鏈接所以我想出了這個

site = 'https://www.similarweb.com' 
resp = requests.get(site) 
encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None 
soup = BeautifulSoup(resp.content, from_encoding=encoding) 

contact_links = [] 
for a in soup.find_all('a', href=True): 
    if 'product' in a['href'] or 'service' in a['href'] or 'solution' in a['href'] or 'about' in a['href'] or 'index' in a['href']: 
     contact_links.append(a['href']) 

contact_links2 = [] 
for i in contact_links: 
    string2 = i 
    if string2[:4] == 'http': 
     contact_links2.append(i) 
    else: 
     contact_links2.append(site+i) 

for i in contact_links2: 
    print i

當運行https://www.similarweb.com這個片段它給出幾個鏈接，其中一些是

https://www.similarweb.com/apps/top/google/app-index/us/all/top-free 
https://www.similarweb.com/corp/solution/travel/ 
https://www.similarweb.com/corp/about/ 
http://www.thedailybeast.com/articles/2016/10/17/drudge-limbaugh-fall-for-twitter-joke-about-postal-worker-destroying-trump-ballots.html 
https://www.similarweb.com/apps/top/google/app-index/us/all/top-free

根據這一結果，我想只有那些鏈接，其中後這句話productservicesolutionindex不應該有任何更多的話

預期輸出：（只考慮前5個鏈接）

https://www.similarweb.com/corp/about/

我該怎麼辦那？

來源

2016-11-14 Guru

您想要刪除哪個示例網址？ – 2016-11-14 08:26:25

那些，其中'產品''服務''解決方案''索引'應該在URL – Guru

@LutzHorn下面的結尾字我想只在示例中的第三個網址 – Guru

如果條件存在，你應該在檢查單詞前後加上反斜槓。它應該是if '/product/' in a['href'] ...等等。

正如評論中提到的那樣，它應該是硬道理，那麼最好檢查一下a['href'].endswith('/product/')。由於endswith函數可以將元組作爲參數，所以你可以這樣做：

if a['href'].endswith(('/product/', '/index/', '/about/', '/solution/', 'service'))。

對於以元組中提到的任何字符串結尾的所有url，此條件將評估爲true。

來源

2016-11-14 08:29:22 falloutcoder

我們是否也可以在這裏包含正則表達式，以便可以覆蓋變體。就像'about'' aboutus''關於我們'與endswith一起 – Guru

import requests 
from bs4 import BeautifulSoup 
import re 
from urllib.parse import urljoin 


r = requests.get('https://www.similarweb.com/') 
soup = BeautifulSoup(r.text, 'lxml') 
urls = set() 

for i in soup.find_all('a', href=re.compile(r'((about)|(product)|(service)|(solution)|(index))/$')): 
    url = i.get('href') 
    abs_url = urljoin(r.url, url) 
    urls.add(abs_url) 
print(urls)

來源

2016-11-14 11:46:57

過濾器超鏈接 - 蟒蛇

回答

相關問題