當爲Azure搜索的Blob內容編制索引時，「內容」太大

我按照本文所述設置了對於Azure的Blob索引和全文搜索：Indexing Documents in Azure Blob Storage with Azure Search。當爲Azure搜索的Blob內容編制索引時，「內容」太大

我的一些文件是失敗的索引，扔返回以下錯誤：

Field 'content' contains a term that is too large to process. The max length for UTF-8 encoded terms is 32766 bytes. The most likely cause of this error is that filtering, sorting, and/or faceting are enabled on this field, which causes the entire field value to be indexed as a single term. Please avoid the use of these options for large fields.

是生產這種錯誤的特定的PDF是3.68 MB，並含有多種內容（文本，表格，圖像等）。

索引和索引器完全按照該文章中的描述設置，並添加了一些文件類型限制。

指數：

{ 
    "name": "my-index", 
    "fields": [{ 
     "name": "id", 
     "type": "Edm.String", 
     "key": true, 
     "searchable": false 
    }, { 
     "name": "content", 
     "type": "Edm.String", 
     "searchable": true 
    }] 
}

索引：

{ 
    "name": "my-indexer", 
    "dataSourceName": "my-data-source", 
    "targetIndexName": "my-index", 
    "schedule": { 
     "interval": "PT2H" 
    }, 
    "parameters": { 
     "maxFailedItems": 10, 
     "configuration": { 
      "indexedFileNameExtensions": ".pdf,.doc,.docx,.xls,.xlsx,.ppt,.pptx,.html,.xml,.eml,.msg,.txt,.text" 
     } 
    } 
}

我想通過自己的文檔和其他一些相關的文章搜索，但我真的不能找到任何信息。我猜這是因爲此功能仍處於預覽狀態。

來源

2016-07-11 valverij

搜索索引中單個詞的大小有限制 - 它也恰好是32KB。如果搜索索引中的content字段標記爲filterable,facetable或sortable，那麼您將達到此限制（無論該字段是否標記爲可搜索）。通常對於大型可搜索內容，您希望啓用searchable，有時候還會啓用retrievable，但其他的則不會。這樣你就不會在索引方面限制內容長度。

更多內容請參閱this answer。

來源

2016-07-11 16:44:59

有道理。那麼默認情況下，該字段標記爲「可過濾」，「可表面」和/或「可排序」？ – valverij

是的，字符串字段默認是可排序/可過濾/可排序的 - 有關所有詳細信息，請參見[創建索引API]（https://msdn.microsoft.com/zh-cn/library/azure/dn798941.aspx）。 –

我們應該更新blob索引文章中的示例索引以使所有這些屬性爲false。 –

當爲Azure搜索的Blob內容編制索引時，「內容」太大

回答

相關問題