自動刪除包含「ème」的數字和字母數字文字

使用sed或我可以在txt文件的目錄上運行的其他程序，如何從文件中刪除純數字以及數字和字母的組合，除那些屬於法國基數（例如2ème）的部分。自動刪除包含「ème」的數字和字母數字文字

例如，如果一個文本文件包含

12h03 11:00 27.8.16 23 3ème bonjour

然後我只想留住

3ème bonjour

編輯：因爲沒有號碼出現在它卓悅被保留。 3ème保留，因爲它結束於ème（基數）。其他令牌被刪除，因爲它們包含數字，但不是基數。

來源

2016-09-14 jvh_ch

實際的規則是什麼？你說你只想要包含'ème'的單詞，然後在你的例子中包含bonjour作爲保留的東西。規則集的哪一部分允許「bonjour」被包含在輸出中？ – jwpfox

抱歉不清楚：如果有數字出現，應刪除單詞，除非以'ème'結尾。 –

所以這條線「那麼我只想保留'3èmebonjour'。」應該閱讀「那麼我只想保留'3èmer'。」是對的嗎？ – jwpfox

jwpfox：我從來沒有用過Python，但我願意使用它。

與此同時，我寫了一個豬醜陋的R腳本，似乎足以達到我的目的。我會在這裏分享。

# Sample text 
text <- "Le 8 septembre à 11h30, Jean voyait les 2 filles pour la 3ème fois." 

# Split up by space 
splittext <- unlist(strsplit(text, 
          split = " ")) 

# Retain words containing no numbers, or that contain 'ème' or punctuation. 
selecttext <- splittext[!(grepl("\\d", splittext)) | 
         grepl("ème", splittext) | 
         grepl("[[:punct:]]", splittext)] 

# If a word contains both numbers and punctuation, retain only the punctuation 
selecttext[grepl("\\d", selecttext) & grepl("[[:punct:]]", selecttext)] <- stringr::str_sub(selecttext[grepl("\\d", selecttext) & grepl("[[:punct:]]", selecttext)], start=-1, end =-1) 

# Recombine 
text2 <- paste(selecttext, collapse = " ") 


> text2 
[1] "Le septembre à , Jean voyait les filles pour la 3ème fois."

它應該是讀取目錄中的所有文件，通過上面的行運行它們，並覆蓋源文件。

來源

2016-09-14 14:40:27

既然你打開一個Python的答案，這裏是Python3中的一個。

在filepath變量中提供要在其中工作的樹的根目錄的路徑，這將遍歷該目錄下樹中的所有文件並應用您提供的規則。

請注意，您似乎在您的R代碼中應用的規則似乎與您在問題中提出的規則不同。

import os 
import re 

filepath = 'testfiles' 
for(path, dirs, files) in os.walk(filepath): 
    searchpattern = re.compile('[0-9]') 
    for filename in files: 
     curfilepath = os.path.join(path, filename) 
     with open(curfilepath, mode='r', encoding='utf-8') as infile: 
      cleanlines = [] 
      for line in infile: 
       cleanline = '' 
       words = line.split(' ') 
       for word in words: 
        if 'ème' in word: 
         cleanline += word + ' ' 
        if searchpattern.search(word) is None: 
         cleanline += word + ' ' 
       cleanline = cleanline.strip() 
       if len(cleanline) > 0: 
        cleanlines.append(cleanline) 

     with open(curfilepath, mode='w', encoding='utf-8') as outfile: 
      for line in cleanlines: 
       outfile.write(line + '\n')

來源

2016-09-15 08:55:23 jwpfox

自動刪除包含「ème」的數字和字母數字文字

回答

相關問題