如何從推文中刪除特殊字符（如`'ŒðŸ'`）

我必須從推文中清除特殊字符，例如ðŸ‘‰ðŸ‘ŒðŸ’¦âœ¨。爲了做到這一點，我遵循了這一策略（我使用Python 3）：如何從推文中刪除特殊字符（如`'ŒðŸ'`）

從字節轉換鳴叫字符串以獲得特殊字符爲十六進制，所以Ã成爲\xc3\;
使用正則表達式，刪除b'和b"（在字符串的開頭）和'或"（在字符串的末尾）的轉換處理之後被Python加入;
最後刪除十六進制表示，也使用正則表達式。

這裏是我的代碼：

import re 
tweet = 'b"[/Very seldom~ will someone enter your life] to question\xc3\xa2\xe2\x82\xac\xc2\xa6 "' 

#encoding to 'utf8' 
tweet_en = tweet.encode('utf8') 
#converting to string 
tweet_str = str(tweet_en) 
#eliminating the b' and b" at the begining of the string: 
tweet_nob = re.sub(r'^(b\'b\")', '', tweet_str) 
#deleting the single or double quotation marks at the end of the string: 
tweet_noendquot = re.sub(r'\'\"$', '', tweet_nob) 
#deleting hex 
tweet_regex = re.sub(r'\\x[a-f0-9]{2,}', '', tweet_noendquot) 
print('this is tweet_regex: ', tweet_regex)

最終輸出是：[/Very seldom~ will someone enter your life] to question "（從中我仍然無法刪除最後"）。我想知道是否有更好更直接的方式來清除Twitter數據中的特殊字符。任何幫助將不勝感激。

來源

2017-02-21 norpa

我認爲這將正常工作，如果你只是在尋找ASCII字符：

initial_str = 'Some text ðŸ‘‰ðŸ‘ŒðŸ’¦âœ¨ and some more text' 
clean_str = ''.join([c for c in initial_str if ord(c) < 128]) 
print(clean_str) # Some text and some more text

你可以做ord(c) in range()，並給它你想保留一定範圍的文本（可能包括表情符號）。

來源

2017-02-21 15:44:49 squgeim

如何從推文中刪除特殊字符（如`'ŒðŸ'`）

回答

相關問題