0
我只開始編寫python一個星期了,我正在解析和網頁上抓取一個網站。我如何擺脫所有內容?嘗試了幾件事情,但我似乎無法弄清楚。幫助將不勝感激!有沒有辦法簡化這個?從文本中刪除內容,python
import re
import requests
import json
from bs4 import BeautifulSoup, NavigableString
url = 'http://www.grabexample.com'
geturl = requests.get(url)
some_text = geturl.text
soup = BeautifulSoup(some_text, "html.parser")
soup.prettify()
all_on_URL = soup.find_all('a')
grab_text = soup.get_text(strip=True)
parse_a_text = grab_text.replace("</a>", '').replace("<a>", '').replace("<a", '')
parse_p_text = parse_a_text.replace("</p>", '').replace("<p>", '').replace("<p", '')
parse_div_text = parse_p_text.replace("</div>", '').replace("<div>", '').replace("<div", '')
parse_li_text = parse_div_text.replace("</li>", '').replace("<li>", '').replace("<li", '')
parse_span_text = parse_li_text.replace("</span>", '').replace("<span>", '').replace("<span", '')
parse_img_text = parse_span_text.replace("</img>", '').replace("<img>", '').replace("<img", '')
parse_ul_text = parse_img_text.replace("</ul>", '').replace("<ul>", '').replace("<ul", '')
parse_ol_text = parse_ul_text.replace("</ol>", '').replace("<ol>", '').replace("<ol", '')
parse_label_text = parse_ol_text.replace("</label>", '').replace("<label>", '').replace("<label", '')
parse_h_text = parse_ol_text.replace("</h>", '').replace("<h>", '').replace("<h", '')
parse_h1_text = parse_h_text.replace("</h1>", '').replace("<h1>", '').replace("<h1", '')
parse_h2_text = parse_h1_text.replace("</h2>", '').replace("<h2>", '').replace("<h2", '')
parse_h3_text = parse_h2_text.replace("</h3>", '').replace("<h3>", '').replace("<h3", '')
parse_h4_text = parse_h3_text.replace("</h4>", '').replace("<h4>", '').replace("<h4", '')
parse_h5_text = parse_h4_text.replace("</h5>", '').replace("<h5>", '').replace("<h5", '')
parse_href_text = parse_h4_text.replace("href=", '').replace("<", '').replace(">", '')
parse_box_text = parse_h4_text.replace("[]", '').replace("[", '').replace("]", '')
parse_space_text = parse_box_text.replace("\n", "").replace(" ", "")
parse_colon_text = parse_space_text.replace("{", '').replace("}", '').replace("#", '')
print(parse_colon_text)
另一種方式,我試圖寫它,但沒有工作,可能我在這裏做錯了什麼?
def notusetags():
invalid_tags= ['a', 'div', 'span', 'class', 'p', 'img', 'li', '\n']
# get_text(strip=True)
for tag in invalid_tags:
for match in soup2.findAll(tag):
match.replaceWithChildren('')
print soup
其他一些方法,我試圖寫它,但它沒有工作,可能我也錯過了這裏的東西嗎?
invalid_tags= ['a', 'div', 'span', 'class', 'p', 'img', 'li', '\n']
# get_text(strip=True)
for stripped in parse_text:
if stripped.name in invalid_tags:
s=""
for c in stripped.contents:
if not isinstance(c, NavigableString):
c = stripped(unicode(c), invalid_tags)
s+= unicode(c)
stripped.replaceWith(s)
testtext.append(stripped)
testtext = []
print(testtext)
你已經裝載BeautifulSoup,W你不是用它來解析嗎?這就是... – Cfreak
你想完成什麼? –
下面是關於同一主題的一個很好的答案:https://stackoverflow.com/questions/1765848/remove-a-tag-using-beautifulsoup-but-keep-its-contents – Cfreak