整合提取PDF內容與Django的草垛

我已經提取PDF/DOCX內容與Solr的，我已經suceeded使用以下Solr的URL致力於此建立一些搜索查詢：整合提取PDF內容與Django的草垛

http://localhost:8983/solr/select?q=Lycee

我想用django-haystack建立一個這樣的查詢。我發現這個鏈接，都在談論這個問題：

https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst

但沒有「FileIndex」類Django的乾草堆（2.0.0測試版）。我如何在django-haystack中集成這樣的搜索？

來源

2012-12-26 Mohamed Ali

文檔中引用的「FileIndex」是haystack.indexes.SearchIndex的假設子類。這裏有一個例子：

from haystack import indexes 
from myapp.models import MyFile 

class FileIndex(indexes.SearchIndex, indexes.Indexable): 
    text = indexes.CharField(document=True, use_template=True) 
    title = indexes.CharField(model_attr='title') 
    owner = indexes.CharField(model_attr='owner__name') 


    def get_model(self): 
     return MyFile 

    def index_queryset(self, using=None): 
     return self.get_model().objects.all() 

    def prepare(self, obj): 
     data = super(FileIndex, self).prepare(obj) 

     # This could also be a regular Python open() call, a StringIO instance 
     # or the result of opening a URL. Note that due to a library limitation 
     # file_obj must have a .name attribute even if you need to set one 
     # manually before calling extract_file_contents: 
     file_obj = obj.the_file.open() 

     extracted_data = self.backend.extract_file_contents(file_obj) 

     # Now we'll finally perform the template processing to render the 
     # text field with *all* of our metadata visible for templating: 
     t = loader.select_template(('search/indexes/myapp/myfile_text.txt',)) 
     data['text'] = t.render(Context({'object': obj, 
             'extracted': extracted_data})) 

     return data

所以extracted_data將與你來到任何處理了提取PDF/DOCX內容替換。然後，您會更新您的模板以包含該數據。

來源

2014-07-21 22:59:56 user3470130

整合提取PDF內容與Django的草垛

回答

相關問題