Python - 在txt中分割單詞

我想製作程序，它將分割txt文件中的每個單詞，以及單詞的返回列表，但不重複任何單詞。我將我的PDF書轉換爲txt，然後使用我的程序，但它完全失敗。我不知道，我做錯了什麼。這是我的代碼：Python - 在txt中分割單詞

def split(file): 
    lines = open(file, 'rU').readlines() 
    words = [] 
    word = '' 
    for line in lines: 
     for letter in line: 
      if letter not in [' ', '\n', '.', ',']: 
       word += letter 
      elif letter in [' ', '\n', '.', ',']: 
       if word not in words: 
        words.append(word) 
        word = '' 

    words.sort() 
    return words 


for word in split('AKiss.txt'): 
    print(word, end=' ')

我還附加了AKiss.txt和原始PDF以防萬一它可能有用。

PDF - http://1drv.ms/b/s!AtZrd19H_8oyabhAx-NZvIQD_Ug

TXT - http://1drv.ms/t/s!AtZrd19H_8oyapvBvAo27rNJSwQ

來源

2017-10-17 F_Zimny

*沒有重複* ...爲什麼不使用set而不是列表？ – Mangohero1

你能描述它是如何失敗的嗎？ – glibdud

@glibdud它在理論上返回其他詞，但有相同的詞，但沒有什麼區別，真正奇怪的是 - 它們不存在於文件中：「Do」不要「不要扭轉」不要扭轉「多蘿西」多蘿西「 –

你可以試試這個：

import itertools 
words = list(set(itertools.chain.from_iterable([[''.join(c for c in b if c.isalpha()) for b in i.strip('\n').split()] for i in open('filename.txt') if i != "\n"])))

來源

2017-10-17 19:52:38 Ajax1234

我工作過，但是我得到了與'？'相同的單詞。或者用圓點表示，是否有辦法，不僅可以「消除」新的線條，而且還可以用問號，逗號等來表示？ –

@F_Zimny請用上面的代碼再試一次 – Ajax1234

它很有用，非常感謝。坐在講座上，發現100個單詞，我不知道（英語不是我的母語）：D再次感謝。 –

您可能需要採取不同的方式：

def split_file(file): 
    all_words = set() 
    for ln in open(file, 'rU').readlines(): 
     words = ln.strip().split() 

     dot_split = [] 
     for w in words: 
      dot_split.extend(w.split('.')) 
     comma_split = [] 
     for w in dot_split: 
      comma_split.extend(w.split(',')) 

     all_words = all_words.union(set(comma_split)) 

    print(sorted(all_words)) 

split_file('test_file.txt')

或者更簡單，使用正則表達式：

import re 

def split_file2(file): 
    all_words2 = set() 
    for ln in open(file, 'rU').readlines(): 
     words2 = re.split('[ \t\n\.,]', ln.strip()) # note the escaped '.'! 
     all_words2 = all_words2.union(set(words2)) 
    print(sorted(all_words))

作爲一個邊注意我不會使用split作爲函數名稱，因爲它隱藏了您可能希望從標準庫/ string庫中使用的功能。

來源

2017-10-17 19:50:58 sophros

我這樣做是這樣的，但在輸出我得到空列表。 –

該行'all_words.union（set（words.split（'。'）。split（'，'）））'all_words = all_words.union（set（words.split（'。'）。split（'，' ）））'用於聯盟用作暗示 – Arunmozhi

@sophros此代碼有多個錯誤，嘗試改進並放棄 – Arunmozhi

使用strip()和split()方法應該幫助你在這裏。

來源

2017-10-17 19:55:18

Python - 在txt中分割單詞

回答

相關問題