2014-04-25 76 views
0

我正在嘗試編寫一段python來讀取我的文件。該代碼是下面:findall函數抓取錯誤信息

import re, os 
captureLevel = [] # capture read scale. 
captureQID = [] # capture questionID. 
captureDesc = [] # capture description. 
file=open(r'E:\Grad\LIS\LIS590 Text mining\Final_Project\finalproject_data.csv','rt') 
newfile=open('finalwordlist.csv','w') 
mytext=file.read() 

for row in mytext.split('\n'): 
    grabLevel=re.findall(r'(\d{1})+\n',row) 
    captureLevel.append(grabLevel)  
    grabQID=re.findall(r'(\w{1}\d{5})',row) 
    captureQID.append(grabQID)    #ERROR LINE. 
    grabDesc=re.findall(r'\,+\s+(\w.+)',row) 
    captureDesc.append(grabDesc) 

    lineCount = 0 
    wordCount = 0 
    lines = ''.join(grabDesc).split('.') 
    for line in lines: 
      lineCount +=1 
      for word in line.split(' '): 
       wordCount +=1 
       newfile.write(''.join(grabLevel) + '|' + ''.join(grabQID) + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n') 

newfile.close()

這裏有三個線我的數據:

a00004," another oakstr eetrequest, helped student request item",2 a00005, asked retiree if he used journal on circ list,2 a00006, asked scientist about owner of some archival notes,2

下面是結果: 22|a00002|1|1|a00002, 22|a00002|1|2| 22|a00002|1|3|scientist 22|a00002|1|4|looking 22|a00002|1|5|for

的結果的第一列應該只是一個數字,但爲什麼它打印出兩位數字?

任何想法這裏有什麼問題?謝謝。

回答

0

它又是標籤和空格的區別。需要特別注意Python。空格不被視爲等同於選項卡。這裏有一個有用的鏈接,談論差異:http://legacy.python.org/dev/peps/pep-0008/。簡而言之,建議在帖子中縮進空格。但是,我發現Tab也適用於縮進。保持縮進一致非常重要。所以,如果你使用標籤,確保你一直使用它。