如何從pandas數據框列中的文本中刪除非ASCII字符(例如,??????????????)?如何從pandas數據框列中的文本中刪除非ASCII字符(例如,??????????????)?
我曾嘗試以下,但沒有運氣
df = pd.read_csv(path, index_col=0)
for col in df.columns:
for j in df.index:
markup1 = str(df.ix[j, col]).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = markup1.decode('unicode_escape').encode('ascii','ignore').strip()
soup = BeautifulSoup(markup1, 'lxml')
df.ix[j, col] = soup.get_text()
print df.ix[j, 'requirements']
我嘗試使用正則表達式但它是行不通的。
markup1 = str(df.ix[j, 'requirements']).replace("\r", "")
markup1 = markup1.replace("\n", "")
markup1 = re.sub(r'[^\x00-\x7F]+', ' ', markup1)
我仍然不斷收到非ASCII字符。任何建議,將不勝感激。
我已經加入了DF的前三行如下:
col1 col2 \
1.0 H1B SPONSOR FOR L1/L2/OPT US, NY, New York
2.0 Graphic/Web Designer US, TX, Austin
3.0 Full Stack Developer (.NET or equivalent + Jav... GR, ,
col3 col4 \
1.0 NaN NaN
2.0 Sales and Marketing NaN
3.0 NaN NaN
col5 \
1.0 i28 Technologies has demonstrated expertise in...
2.0 outstanding people who believe that more is po...
3.0 NaN
col6 \
1.0 Hello,Wish you are doing good... ...
2.0 The Graphic/Web Designer will manage, popula...
3.0 You?ll have to join the Moosend dojo. But, yo...
col7 \
1.0 JAVA, .NET, SQL, ORACLE, SAP, Informatica, Big...
2.0 Bachelor?s degree in Graphic Design, Web Desig...
3.0 ? .NET or equivalent (Java etc.)? MVC? Javascr...
col8 col9
1.0 NaN f
2.0 CSD offers a competitive benefits package for ... f
3.0 You?ll be working with the best team in town..... f
感謝您的評論。我嘗試如下選項,但非ascii字符仍然在數據框中。 'df.replace(regex = True,to_replace = [''','€','£','Ã','¬','Ð','±','½','©','• ''''','§','¥','«','¤',' - ','œ','','''','|','','™', '','Î','¿','μ',''','‡','»','Ž','®','º','Ï','ƒ','¶ '''''''''''','Γ','Ç','Ö'],value ='',inplace = True)' –
這很奇怪。當我在一個非ASCII字符的例子'df'上使用這個確切的操作時,它會返回'df'並將它們移除。什麼是dtypes?此外,我在Python 3.5中,但我不明白爲什麼會有影響。 –
我使用python 2.7。 dtypes是對象 –