2017-10-11 35 views
0

我想遍歷.txt文件,並從該文件的第一行使用日期(例如1993年4月1日)。Python正則表達式:遍歷目錄中每個文件的第一行

此代碼的工作,但整個文件,只是第一線匹配不(注:下面的代碼林顯示出比剛日期匹配環以上):

以下腳本被更新,並且它的工作原理:

articles = glob.glob("*.txt") 
y = 1 

for f in articles: 
    with open(f, "r") as content: 
     wordcount = "x" 
     lines = content.readlines() 
     for line in lines : 
      if line[0:7] == "LENGTH:": 
       lineclean = re.sub('[#%&\<>*?:/{}[email protected]+|=]', '', line) 
       wordcount = lineclean[7:13] 
       if wordcount[5] == "w": 
        wordcount = wordcount[0:4] 
       elif wordcount[4] == "w": 
        wordcount = wordcount[0:3] 
       elif wordcount[3] == "w": 
        wordcount = wordcount[0:2] 
       elif wordcount[2] == "w": 
        wordcount = wordcount[0:1] 
    with open(f, "r") as content: 
     first_line = next(content) 
     try: 
      import re 
      match = re.search('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', first_line).group() 
     except: 
      pass   
     from dateutil import parser 
     parsed_pubdate = parser.parse(match).strftime('%Y-%m-%d')     
    try: 
     if wordcount != "x": 
      move(f, "{parsed_pubdate}_{wordcount}_{source}.txt".format(**locals())) 
     else: 
      pass 
    except OSError: 
     pass 
    y += 1 
    content.close() 

爲了只在文件的第一行匹配日期,我想補充^\sflags=re.MULTILINE,所以我得到:

match = re.search('^\s(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)? 
|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)? 
|Dec(ember)?)\s+\d{1,2},\s+\d{4}', line, flags=re.MULTILINE).group() 

但是,現在程序只使用一個日期(文件夾中最後一個文件的日期)並將其用於每個文件(因此每個文件獲取的日期相同,而日期在原始.txt文件中有所不同)。

我明確了這個循環的一部分,但我的問題只適用於正則表達式日期匹配循環。在此先感謝您的幫助!

+0

既然你刪除了「:」,你應該開始你的wordcount 6,不是嗎? 如果你只想檢查第一行,爲什麼不firs_line = content.readlines()[0]? –

+0

@AlfredoMiranda:我會提前'first_line = next(content)',以避免讀取所有行,以後丟棄除第一個以外的所有行... –

+0

我試過'first_line = content.readlines()[0]'但它給出了與正則表達式中的'第一行描述符'完全相同的問題。也就是說,它只使用一個.txt文件中的日期並將其應用於每個文件。 Re:字數統計。這在當前腳本中工作正常。 – Rens

回答

0
articles = glob.glob("*.txt") 
y = 1 

for f in articles: 
    with open(f, "r") as content: 
     wordcount = "x" 
     lines = content.readlines() 
     for line in lines : 
      if line[0:7] == "LENGTH:": 
       lineclean = re.sub('[#%&\<>*?:/{}[email protected]+|=]', '', line) 
       wordcount = lineclean[7:13] 
       if wordcount[5] == "w": 
        wordcount = wordcount[0:4] 
       elif wordcount[4] == "w": 
        wordcount = wordcount[0:3] 
       elif wordcount[3] == "w": 
        wordcount = wordcount[0:2] 
       elif wordcount[2] == "w": 
        wordcount = wordcount[0:1] 
    with open(f, "r") as content: 
     first_line = next(content) 
     try: 
      import re 
      match = re.search('(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)\s+\d{1,2},\s+\d{4}', first_line).group() 
     except: 
      pass   
     from dateutil import parser 
     parsed_pubdate = parser.parse(match).strftime('%Y-%m-%d')     
    try: 
     if wordcount != "x": 
      move(f, "{parsed_pubdate}_{wordcount}_{source}.txt".format(**locals())) 
     else: 
      pass 
    except OSError: 
     pass 
    y += 1 
    content.close() 
相關問題