2017-02-17 24 views
0

我使用python lucene索引url頁面。Python lucene函數將字段內容添加到文檔不起作用

我有一些錯誤,試圖添加字段到文檔。我不知道爲什麼。 錯誤說:

JavaError:,> 的Java堆棧跟蹤: java.lang.IllegalArgumentException異常:它沒有意義有一個字段是既不索引,也不在org.apache.lucene.document存儲 。現場(Field.java:249)

符合我放哪兒:doc.add(場( 「內容」,文字,T2))

我使用的Python代碼是:

def IndexerForUrl(start, number, domain): 

lucene.initVM() 
# join base dir and index dir 
path = os.path.abspath("paths") 
directory = SimpleFSDirectory(Paths.get(path)) # the index 

analyzer = StandardAnalyzer() 

writerConfig = IndexWriterConfig(analyzer) 

writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE) 

writer = IndexWriter(directory, writerConfig) 

print "reading lines from sys.std..." 

# hashtable dictionary 
D = {} 

D[start] = [start] 



numVisited = 0 
wordBool = False 

n = start 

queue = [start] 
visited = set() 

t1 = FieldType() 
t1.setStored(True) 
t1.setTokenized(False) 

t2 = FieldType() 
t2.setStored(False) 
t2.setTokenized(True) 



while numVisited < number and queue and not wordBool: 
    pg = queue.pop(0) 

    if pg not in visited: 

     visited.add(pg) 

     htmlwebpg = urllib2.urlopen(pg).read() 
      # robot exclusion standard 
     rp = robotparser.RobotFileParser() 
     rp.set_url(pg) 
     rp.read() # read robots.txt url and feeds to parser 


     soup = BeautifulSoup(htmlwebpg, 'html.parser') 

     for script in soup(["script","style"]): 
      script.extract() 
     text = soup.get_text() 



     lines = (line.strip() for line in text.splitlines()) 
     chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
     text = '\n'.join(chunk for chunk in chunks if chunk) 

     print text 




     doc = Document() 

     doc.add(Field("urlpath", pg, t2)) 
     if len(text)> 0: 
      doc.add(Field("contents", text, t2)) 
     else: 
      print "warning: no content in %s " % pgv 

     writer.addDocument(doc) 


     numVisited = numVisited+1 

     linkset = set() 

      # add to list 
     for link in soup.findAll('a', attrs={'href':re.compile("^http://")}): 
       #links.append(link.get('href')) 
      if rp.can_fetch(link.get('href')): 
       linkset.add(link.get('href')) 

      D[pg] = linkset 

      queue.extend(D[pg] - visited) 

writer.commit() 
writer.close() 
directory.close() #close the index 
return writer 

回答

0

I如果一個字段既不被索引也不被存儲,它不會以任何方式在索引中被表示,因此它在那裏沒有意義。我猜你想要索引FieldType t2。要做到這一點,你需要set the IndexOptions,類似於:

t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) 
+0

哦謝謝。我會嘗試。 –

相關問題