從字符串中刪除長破折號

我試圖從網站讀取html內容到Python來分析那裏的文本並決定它們屬於哪個類別。當我嘗試與他們合作時，我遇到了一個長破折號的問題，因爲他們進入了NoneType。我已經嘗試過在這個網站上建議的幾個修復程序，但他們都沒有工作。從字符串中刪除長破折號

from bs4 import BeautifulSoup 
import re 
import urllib.request 
response = urllib.request.urlopen('website-im-opening') 
content = response.read().decode('utf-8') 
#this does not work 
content = content.translate({0x2014: None}) 
content = re.sub(u'\u2014','',content) 
#This is other part of code 
htmlcontent = BeautifulSoup(content,"html.parser") 

for cont in htmlcontent.select('p'): 
    if cont.has_attr('class') == False: 
     print(cont.strip()) #Returns an error as text contains long dash

任何想法如何從字符串中篩選出長破折號以便與其他文本一起使用？我可以用短短的短劃線替換它或完全刪除，它們對我來說並不重要。

謝謝！

來源

2017-03-17 Banana

，你應該在以後使用BS4提取它清理數據：

BS4將轉換某些HTML實體，你不需要做你的自我。
BS4將文件解碼爲您

```

response = urllib.request.urlopen('website-im-opening') 

content = response.read() 

htmlcontent = BeautifulSoup(content,"html.parser") 

for cont in htmlcontent.find_all('p', class_=False): 

    print(p.text)

```

來源

2017-03-17 11:54:20

好的，所以我需要刪除解碼（'utf-8'），但是你在第1點意味着什麼？ – Banana

@Banana檢查我的更新，你應該對HTML代碼不做任何事情。 –

在for循環中執行cont.strip（）仍然給我NoneType對象不可調用的錯誤。 – Banana

會像這樣爲你做的工作嗎？

# will only work if dashes are at either end 
>>> a = '—asasas—' 
>>> a.strip('\xe2\x80\x94') 
'asasas'

它只是刪除了長劃可以改用：

# can replace '[long-dash]' with '' to remove instead 
>>> a = '—asasas—' 
>>> a.replace('\xe2\x80\x94', '[long-dash]') 
'[long-dash]asasas[long-dash]'

或東西，如果你想知道他們在那裏這種效果？

來源

2017-03-17 11:55:16

我試着用你提到的方法替換，也返回了一個錯誤。 – Banana

從字符串中刪除長破折號

回答

相關問題