2017-05-30 35 views
0

如何從pandas數據框列中的文本中刪除非ASCII字符(例如,??????????????)?如何從pandas數據框列中的文本中刪除非ASCII字符(例如,??????????????)?

我曾嘗試以下,但沒有運氣

df = pd.read_csv(path, index_col=0) 
for col in df.columns: 
for j in df.index: 
    markup1 = str(df.ix[j, col]).replace("\r", "") 
    markup1 = markup1.replace("\n", "") 
    markup1 = markup1.decode('unicode_escape').encode('ascii','ignore').strip() 
soup = BeautifulSoup(markup1, 'lxml') 
df.ix[j, col] = soup.get_text() 
print df.ix[j, 'requirements'] 

我嘗試使用正則表達式但它是行不通的。

markup1 = str(df.ix[j, 'requirements']).replace("\r", "") 
markup1 = markup1.replace("\n", "") 
markup1 = re.sub(r'[^\x00-\x7F]+', ' ', markup1) 

我仍然不斷收到非ASCII字符。任何建議,將不勝感激。

我已經加入了DF的前三行如下:

           col1    col2 \ 
1.0       H1B SPONSOR FOR L1/L2/OPT US, NY, New York 
2.0        Graphic/Web Designer  US, TX, Austin 
3.0 Full Stack Developer (.NET or equivalent + Jav...    GR, , 

       col3 col4 \ 
1.0     NaN NaN 
2.0 Sales and Marketing NaN 
3.0     NaN NaN 

               col5 \ 
1.0 i28 Technologies has demonstrated expertise in... 
2.0 outstanding people who believe that more is po... 
3.0            NaN 

               col6 \ 
1.0 Hello,Wish you are doing good...              ... 
2.0 The Graphic/Web Designer will manage, popula... 
3.0 You?ll have to join the Moosend dojo. But, yo... 

               col7 \ 
1.0 JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big... 
2.0 Bachelor?s degree in Graphic Design, Web Desig... 
3.0 ? .NET or equivalent (Java etc.)? MVC? Javascr... 

               col8 col9 
1.0            NaN f 
2.0 CSD offers a competitive benefits package for ... f 
3.0 You?ll be working with the best team in town..... f 

回答

1

選項1 - 如果你知道了一套完整的非ASCII字符:

df 
Out[36]: 
     col1 col2 
0 aa᧕¿µbb abcd 
1   hf4 efgh 
2   xxx ijk9 

df.replace(regex=True, to_replace=['Ð', '§', '±'], value='') # incomplete here 
Out[37]: 
     col1 col2 
0 aa•¿µbb abcd 
1  hf4 efgh 
2  xxx ijk9 

選項2 - 如果不能指定整個非ASCII字符集:

考慮使用string.printable

認爲可打印的ASCII字符的字符串。

from string import printable 

printable 
Out[38]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>[email protected][\\]^_`{|}~ \t\n\r\x0b\x0c' 

df.applymap(lambda y: ''.join(filter(lambda x: 
      x in string.printable, y))) 
Out[14]: 
    col1 col2 
0 aabb abcd 
1 hf4 asdf 
2 xxx  

注意,如果在數據幀的元素是所有非ASCII,將只用「」代替。

+0

感謝您的評論。我嘗試如下選項,但非ascii字符仍然在數據框中。 'df.replace(regex = True,to_replace = [''','€','£','Ã','¬','Ð','±','½','©','• ''''','§','¥','«','¤',' - ','œ','','''','|','','™', '','Î','¿','μ',''','‡','»','Ž','®','º','Ï','ƒ','¶ '''''''''''','Γ','Ç','Ö'],value ='',inplace = True)' –

+0

這很奇怪。當我在一個非ASCII字符的例子'df'上使用這個確切的操作時,它會返回'df'並將它們移除。什麼是dtypes?此外,我在Python 3.5中,但我不明白爲什麼會有影響。 –

+0

我使用python 2.7。 dtypes是對象 –

0

從Brad的答案中得到靈感,我通過使用[0-9] [a-z] [A-Z]的ascii值列表來解決問題。

def remove_non_ascii(text): 
L = [32, 44, 46, 65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,97,98,99,100,101,102,103, 104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122] 
text = str(text) 

return ''.join(i for i in text if ord(i) in L) 
相關問題