Python - urllib3使用tika服務器從docx獲取文本

我使用python3，urllib3和tika-server-1.13以便從不同類型的文件中獲取文本。這是我的Python代碼：Python - urllib3使用tika服務器從docx獲取文本

def get_text(self, input_file_path, text_output_path, content_type): 
    global config 

    headers = util.make_headers() 
    mime_type = ContentType.get_mime_type(content_type) 
    if mime_type != '': 
     headers['Content-Type'] = mime_type 

    with open(input_file_path, "rb") as input_file: 
     fields = { 
      'file': (os.path.basename(input_file_path), input_file.read(), mime_type) 
     } 

    retry_count = 0 
    while retry_count < int(config.get("Tika", "RetriesCount")): 
     response = self.pool.request('PUT', '/tika', headers=headers, fields=fields) 
     if response.status == 200: 
      data = response.data.decode('utf-8') 
      text = re.sub("[\[][^\]]+[\]]", "", data) 
      final_text = re.sub("(\n(\t\r)*\n)+", "\n\n", text) 
      with open(text_output_path, "w+") as output_file: 
       output_file.write(final_text) 
      break 
     else: 
      if retry_count == (int(config.get("Tika", "RetriesCount")) - 1): 
       return False 
      retry_count += 1 
    return True

此代碼HTML文件，但是當我試圖解析從docx文件文本這是行不通的。

我從服務器的HTTP錯誤代碼422: Unprocessable Entity

使用tika-serverdocumentation我用curl，以檢查它是否與它的工作原理試圖回：

curl -X PUT --data-binary @test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"

和它的工作。

在tika server docs：

422無法處理的實體 - 不支持的MIME類型，加密文件&等

這是正確的MIME類型（也蒂卡檢查它的檢測系統），它受到支持並且文件未加密。

我認爲這與我如何將文件上傳到tika服務器有關，我做錯了什麼？

來源

2016-08-01 Danny Hambourg

您沒有以相同的方式上傳數據。 curl中的--data-binary只是簡單地上傳二進制數據。沒有編碼。在urllib3中，使用fields會導致urllib3生成multipart/form-encoded消息。最重要的是，您可以阻止urllib3正確設置請求的標題，以便Tika可以理解它。要麼停止更新headers['Content-Type']，要麼只是通過body=input_file.read()。

來源

2016-08-02 11:37:48

我相信你可以通過使用tika-python模塊與客戶端模式更容易。

如果你仍然堅持要自己的客戶端，也許這個模塊的源代碼中有一些線索，以顯示他如何處理所有這些不同的MIME類型......如果你有一個*.docx問題，你可能會與他人有問題。

來源

2018-01-25 18:39:37 joefromct

我剛剛意識到這個問題是2歲。哎呀！ – joefromct

Python - urllib3使用tika服務器從docx獲取文本

回答

相關問題