Python文本解析＆分裂

我想寫一個函數，如果在字母字符之前有一個標點符號，函數將在之前放置一個空格，如果標點符號是在字母字符後面，那麼後面應該有一個空格。然而它不應該發生在整數情況下。例如Python文本解析＆分裂

("thanks." >>> "thanks ." and "hello?123!lom" >>> "hello ?123! lom")

我下面的代碼工作正常時，有一個標點符號，但不是在同一個標點符號重演看我下面的代碼：

def normalize(utterance): 

    # Converting to lowercase & removing multiple white spaces 
    utterance = ' '.join(utterance.lower().split()) 

#  List of punctuations 
    punctuations_list = [',','.','?',':',';','!',')','(','\''] 

    for punctuation in punctuations_list: 
     if punctuation in utterance: 
      try: 
       char_before = str(utterance[utterance.index(punctuation) -1]) 
       char_after = str(utterance[utterance.index(punctuation) +1]) 
      except IndexError: 
       char_after = "0" 


      if char_before.isdigit()==False and char_before not in punctuations_list: 
       utterance = utterance.replace(punctuation, " " + punctuation) 
      if char_after.isdigit()==False and char_after not in punctuations_list: 
       utterance = utterance.replace(punctuation, punctuation + " ") 

    return utterance 

normalize("thank you:? the time is 2:30pm") 
>>>'thank you :? the time is 2 :30pm'

我想輸出是：

'thank you :? the time is 2:30pm'

即沒有時間之間的空間，問題是因爲冒號「：」被重複我相信，有人可以解決這個問題。

的錯誤似乎是在下面的一行：

utterance = utterance.replace(punctuation, " " + punctuation)

無論它匹配它取代了整個標點符號，但我不知道如何在這方面的整頓！

來源

2017-06-22 Zing_Yang

你的問題是，replace函數會替換每個標點符號的出現。

您可以通過utterance代替每個字符迭代，並建立一個新的target串用正確的更換：

def normalize(utterance): 

    # Converting to lowercase & removing multiple white spaces 
    utterance = ' '.join(utterance.lower().split()) 
    #  List of punctuations 
    punctuations_list = [',','.','?',':',';','!',')','(','\''] 


    target = utterance[0] 
    for i in range(1, len(utterance) -1): 
     ch = utterance[i] 
     char_before = utterance[i-1] 
     char_after = utterance[i+1] 
     if ch in punctuations_list and not char_before.isdigit() and char_before not in punctuations_list: 
      target += " " 
     target += ch 
     if ch in punctuations_list and not char_after.isdigit() and char_after not in punctuations_list: 
      target += " " 
    target += utterance[-1] 
    return target

來源

2017-06-22 13:18:21 taras

這是給錯誤的輸出：'hank你：？在1點30分會議' –

謝謝。更新它 – taras

這應該這樣做： utterance = utterance.replace(punctuation, "" + punctuation)

編輯

正如我提到的，你應該在你的句子經過每一個字符，而不是每一個標點符號。我已經包含了一些其他修補程序，但是您仍然必須處理從我所完成的工作中翻倍的空間。

您將有這樣的事情：

def normalize(utterance): 

    # Converting to lowercase & removing multiple white spaces 
    utterance = ' '.join(utterance.lower().split()) 
    print utterance 

#  List of punctuations 
    punctuations_list = [',','.','?',':',';','!',')','(','\''] 

    for punctuation in utterance: 
     if punctuation in punctuations_list: 
      print punctuation 

      try: 
       char_before = str(utterance[utterance.index(punctuation) -1]) 
       char_after = str(utterance[utterance.index(punctuation) +1]) 
      except IndexError: 
       char_after = "0" 

      print char_before 

      if char_before.isdigit()==False and char_before not in punctuations_list: 
       utterance = utterance.replace(char_before+punctuation, char_before+" " + punctuation) 

      if char_before.isdigit()==True:     
       utterance = utterance.replace(punctuation, "" + punctuation) 

      if char_after.isdigit()==False and char_after not in punctuations_list: 
       utterance = utterance.replace(punctuation+char_after, punctuation + " "+char_after) 

    return utterance 

print normalize("thank you:? the time is 2:30pm")

來源

2017-06-22 13:10:43 Diego

的問題是，你是如下puctuation列表和每個標點符號之前選擇一個char和此標點符號第一次出現後。在整個句子中，同一類型的每個標點符號都被考慮在內。你應該改變你的for循環遍歷原始句子中的每個字符而不是標點符號列表。 – Diego

好的，感謝澄清這似乎工作。 –

不客氣！它需要更多的調整，但我認爲現在會更容易。 – Diego

您可以使用regex：

import re 

def normalize(text): 
    return re.sub(r"(?<=[a-zA-Z])(?=[,.?:;!()'])|(?<=[,.?:;!()'])(?=[a-zA-Z])", ' ', text)

該功能發現，用一個字母a-zA-Z之前或之後的字符,.?:;!()'之一，然後插入一個空間之間。

來源

2017-06-22 13:11:08

查看index()的文檔，然後查看find()。

查找（）：

返回最低索引在S其中子串子被發現使得子被完全包含在S [開始：結束。失敗時返回-1。開始和結束的默認值以及負值的解釋與切片相同。

我懷疑，因爲你使用索引（）來設置char_before和char_after，你只能這樣做了標點符號的第一個實例，使存在於utterance任何其他實例。你永遠不會循環回去，尋找更多的第一個例子。

來源

2017-06-22 13:17:29 bluescores

Python文本解析＆分裂

回答

相關問題