獲取所有HTML數據EXCEPT mailto：和tel：在BS4中Python decompose（）

我需要從HTML中取出電話號碼和電子郵件。獲取所有HTML數據EXCEPT mailto：和tel：在BS4中Python decompose（）

我可以得到這些數據。

description_source = soup.select('a[href^="mailto:"]'), 
        soup.select('a[href^="tel:"]')

但我不想要它。

我想使用

decompose

description_source = soup.decompose('a[href^="mailto:"]')

我得到這個錯誤

TypeError: decompose() takes 1 positional argument but 2 were given

我曾經想過用

SoupStrainer

但它看起來像我將不得不包括除mailto和te之外的所有內容l至得到正確的信息......這個有點

滿當前的代碼是這樣的

import requests 
from bs4 import BeautifulSoup as bs4 

item_number = '122124438749' 

ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number 
r = requests.get(ebay_url) 
html_bytes = r.text 
soup = bs4(html_bytes, 'html.parser') 

description_source = soup.decompose('a[href^="mailto:"]') 
#description_source. 

print(description_source)

來源

2017-06-06 johnashu

發佈您的完整代碼。 –

嘗試使用find_all()。找到該頁面中的所有鏈接，然後查看哪些鏈接包含電話和電子郵件。然後刪除它們使用extract().

使用lxml解析器進行更快的處理。也推薦在官方文檔中使用。

import requests 
from bs4 import BeautifulSoup 

item_number = '122124438749' 

ebay_url = "http://vi.vipr.ebaydesc.com/ws/eBayISAPI.dll?ViewItemDescV4&item=" + item_number 
r = requests.get(ebay_url) 
html_bytes = r.text 
soup = BeautifulSoup(html_bytes, 'lxml') 

links = soup.find_all('a') 
email = '' 
phone = '' 

for link in links: 
    if(link.get('href').find('tel:') > -1): 
     link.extract() 

    elif(link.get('href').find('mailto:') > -1): 
     link.extract() 

print(soup.prettify())

可以使用也decompose()代替extract()。

來源

2017-06-06 12:32:18

嘿，這是一個小小的劇本，我玩得很開心。但是，這返回我想刪除的數據.. 我希望刪除所有電話：和mailto：HTML。整個HTML必須被下載..然後保存沒有和te相同的電話： – johnashu

噢好吧。花一些時間來編輯它。 –

alos我已將我的分析器更改爲XML！ – johnashu

獲取所有HTML數據EXCEPT mailto：和tel：在BS4中Python decompose（）

回答

相關問題