我試圖從wiki標題轉儲中提取英文標題,該轉儲在文本文件中使用Python 3中的正則表達式。維基轉儲也包含其他語言的標題和一些符號。下面是我的代碼:TypeError:序列項目1:期望一個類似字節的對象,找到str
with open('/Users/some/directory/title.txt', 'rb')as f:
text=f.read()
letters_only = re.sub(b"[^a-zA-Z]", " ", text)
words = letters_only.lower().split()
print(words)
但我得到一個錯誤:
TypeError: sequence item 1: expected a bytes-like object, str found
在該行:letters_only = re.sub(b"[^a-zA-Z]", " ", text)
但是,我使用b''
讓儘可能字節類型的輸出,下面是文本文件的示例:
Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends
我已經在網上搜索,但不能su cceed。任何幫助將不勝感激。
嘗試're。sub(「[^ a-zA-Z]」,「」,text)'而不是 – imant
@imant我也試過這個,但是我得到了下面的錯誤:** TypeError:不能在字節類對象上使用字符串模式* * – Sherlock