2013-02-21 114 views
0

我有一個Python代碼來索引一個包含阿拉伯文字的文本文件。我測試了英文文本上的代碼,它運行良好,但是當我測試一個阿拉伯文文本時,它給了我一個錯誤。 注意:文本文件保存在unicode編碼中,而不是ANSI編碼。「列表索引超出範圍」在python

這是我的代碼:

from whoosh import fields, index 
import os.path 
import csv 
import codecs 
from whoosh.qparser import QueryParser 

# This list associates a name with each position in a row 
columns = ["juza","chapter","verse","voc"] 

schema = fields.Schema(juza=fields.NUMERIC, 
         chapter=fields.NUMERIC, 
         verse=fields.NUMERIC, 
         voc=fields.TEXT) 

# Create the Whoosh index 
indexname = "indexdir" 
if not os.path.exists(indexname): 
    os.mkdir(indexname) 
ix = index.create_in(indexname, schema) 

# Open a writer for the index 
with ix.writer() as writer: 
    with open("h.txt", 'r') as txtfile: 
    lines=txtfile.readlines() 

    # Read each row in the file 
    for i in lines: 

     # Create a dictionary to hold the document values for this row 
     doc = {} 
     thisline=i.split() 
     u=0 

     # Read the values for the row enumerated like 
     # (0, "juza"), (1, "chapter"), etc. 
     for w in thisline: 
     # Get the field name from the "columns" list 
      fieldname = columns[u] 
      u+=1 
      #if isinstance(w, basestring): 
      #  w = unicode(w) 
      doc[fieldname] = w 
     # Pass the dictionary to the add_document method 
     writer.add_document(**doc) 
with ix.searcher() as searcher: 
    query = QueryParser("voc", ix.schema).parse(u"بسم") 
    results = searcher.search(query) 
    print(len(results)) 
    print(results[1]) 

然後錯誤是:

Traceback (most recent call last): 
    File "C:\Python27\yarab.py", line 38, in <module> 
    fieldname = columns[u] 
IndexError: list index out of range 

這是文件的一個樣本:

1 1 1 كتاب 
1 1 2 قرأ 
1 1 3 لعب 
1 1 4 كتاب 
+3

你有印刷的'thisline = i.split()的結果'?它無疑有超過4個項目。 – StoryTeller 2013-02-21 16:10:10

+0

爲此,最好使用python csv模塊。看看這裏[鏈接](http://docs.python.org/2/library/csv.html) – Crazyshezy 2013-10-23 08:53:12

回答

0

雖然我不能看到任何明顯的錯誤與此同時,我會確保你是designing for error。確保你捕捉到split()返回的元素數量超過預期數量並及時處理(例如打印和終止)的任何情況。看起來你可能正在處理格式不正確的數據。

0

您錯過了腳本中Unicode的標題。第一行應爲:

編碼:UTF-8

另外打開與unicode編碼中使用的文件:

import codecs 
with codecs.open("s.txt",encoding='utf-8') as txtfile: