Python如何將字符串拆分爲包含單引號的單詞？

我有一個字符串a，我想返回一個列表b，其中包含的單詞不是從@或#開始，也不包含任何非單詞字符。Python如何將字符串拆分爲包含單引號的單詞？

但是，我很難將「他們」這樣的單詞保留爲單個單詞。請注意，「Okay .... so」這樣的詞應該分成兩個單詞「好吧」和「如此」。

我覺得問題可以通過修改正則表達式來解決。謝謝！

a = "@luke5sos are you awake now?!!! me #hashtag time! [email protected] over, now okay....so they're rich....and hopefully available?" 
a = a.split() 
b = [] 
for word in a: 
    if word != "" and word[0] != "@" and word[0] != "#": 
     for item in re.split(r'\W+\'\W|\W+', word): 
      if item != "": 
       b.append(item) 
      else: 
       continue 
    else: 
     continue 
print b

來源

2014-10-06 Shengjie Zhang

什麼是從這個預期的結果？ – hwnd 2014-10-06 03:23:00

['是'，'你'，'醒來'，'現在'，'我'，'時間'，'是'，'超過'，'現在'，'好'，'是'，「他們是「，'rich'，'and'，'hopefully'，'available'] – 2014-10-06 03:24:31

它很容易將所有這些規則組合成一個正則表達式：

import re 
a = "@luke5sos are you awake now?!!! me #hashtag time! [email protected] over, now okay....so they're rich....and hopefully available?" 
b = re.findall(r"(?<![@#])\b\w+(?:'\w+)?", a) 
print(b)

結果：

['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']

正則表達式是這樣的：

檢查，以確保它不來自#或@之後，使用(?<![@#])。
使用\b檢查它是否在單詞的開頭。這很重要，以便@/#檢查不會跳過一個字符並繼續。
將一個或多個「單詞」類型字符序列與\w+匹配。
可選地匹配撇號和一些更多的單詞類型字符與(?:'\w)?。

注意，第四步是寫這種方式使they're將作爲一個字數，但只有this，that，並且these從this, 'that', these將匹配。

來源

2014-10-06 03:58:08

下面的代碼（A）把....作爲一個字分離器，（b）中除去後的非字字符，如問號和驚歎號，和（c）拒絕與#或啓動的任何單詞@或以其它方式包含非字母字符：

a = "@luke5sos are you awake now?!!! me #hashtag time! [email protected] over, now okay....so they're rich....and hopefully available?" 
a = a.replace('....', ' ') 
a = re.sub('[[email protected]#$%^&]+(|$)', ' ', a) 
result = [w for w in a.split() if w[0] not in '@#' and w.replace("'",'').isalpha()] 
print result

這產生所期望的結果：

['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'now', 'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']

來源

2014-10-06 03:52:41 John1024

-1

從我的未derstand，你不需要帶有數字的單詞，並且想要忽略除單引號以外的所有其他特殊字符。你可以嘗試這樣的事：

import re 
a = re.sub('[^0-9a-zA-Z']+', ' ', a) 
b = a.split()

我一直沒能嘗試的語法，但希望它應該工作。我的建議是用單個空格替換每個不是普通數字的字符或單個字符。所以這將導致一個字符串，你需要的字符串由多個空格分隔。只需調用split函數而不帶任何參數，就可以將字符串拆分爲照顧多個空格的單詞。希望能幫助到你。

來源

2014-10-06 03:53:50

字符串''[^ 0-9a-zA-Z'] +''有不平衡的引號。你的意思是'「[^ 0-9a-zA-Z'] +」'？ – John1024 2014-10-06 04:00:34

我的意思是'regex'應該包含'0-9'，'a-z'，'A-Z'和'''。不過，我在這裏假設，*單引號*僅出現在「我」或「他們」的情況下。 – 2014-10-06 04:02:43

'[0-9a-zA-Z]'在大多數字符串上失敗。例如：'re.sub（'[^ 0-9a-zA-Z] +'，''，'لوحةلمفاتيحلعربية「）'其他解決方案正確標記字符串。 – 2017-09-27 11:18:29

import re 
v = re.findall(r'(?:\s|^)([\w\']+)\b', a)

給出：

['are', 'you', 'awake', 'now', 'me', 'time', 'is', 'over', 'now', 
'okay', 'so', "they're", 'rich', 'and', 'hopefully', 'available']

來源

2014-10-06 04:21:16 perreal

Python如何將字符串拆分爲包含單引號的單詞？

回答

相關問題