正則表達式表達了一個特定的模式

所以我必須開始\u其次是各種形式的4個字符的十六進制串的發生（他們不是Unicode對象，但在數據實際字符串，這就是爲什麼我想清理數據），並希望用空白替換那些事件。正則表達式表達了一個特定的模式

示例文本文件：Hello \u2022 Created, reviewed, \u00e9executed and maintained

對於如：會有串\u2022和\u00e9的出現，我想找到\u，並用4個字符子2022和00e9跟在後面的是一起拔出。我正在尋找適合此模式的正確表達式。

示例代碼：

import json 
import io 
import re 

files = glob('Candidate Profile Data/*') 

for file_ in files: 
    with io.open(file_, 'r', encoding='us-ascii') as json_file: 
     json_data = json_file.read().decode() 
     json_data = re.sub('[^\x00-\x7F]+',' ',json_data) 
     json_data = json_data.replace('\\n',' ') 
     json_data = re.sub(r'\\u[0-9a-f]{,4}',' ',json_data) 

     print json_data 
     json_data = json.loads(json_data) 
     print(json_data)

來源

2017-04-22 Mr. Robot

如果我得到它的權利，你需要從字符串中刪除Unicode字符？ –

@LeonardoChirivì不，這就是爲什麼我明確提到它們不是實際的unicode字符，而是以數據本身的字符串形式。 –

真的，我們需要你的代碼的例子，但作爲一個指針，正則表達式我想你會需要的是像r'\\u[0-9a-f]{,4}'

下面是一個例子它使用：

>>> import re 
>>> my_string='Hello \\u2022 Created, reviewed, \\u00e9executed and maintained' 
>>> my_string 
'Hello \\u2022 Created, reviewed, \\u00e9executed and maintained' 
>>> re.sub(r'\\u[0-9a-f]{,4}',"",my_string) 
'Hello Created, reviewed, executed and maintained'

仍希望看到你的代碼的例子，使我們能夠提供更準確的答案

來源

2017-04-22 16:01:15

它沒有工作，加上我添加了一個示例數據。 –

是的，它添加了前面的'r'後認爲它不是必需的。我只是添加了一個我想要做的事情的示例代碼。如果您可以將我的代碼合併到一個正則表達式中，我將非常感激。 –

正則表達式表達了一個特定的模式

回答

相關問題