2015-09-06 78 views
0

比方說在引用的一個人,我有:查找的文檔

  • 一個數據庫13000人項,包括first name, name, birthday, street, zip code, city

  • 一個長文本其中包括一個特定人的個人資料。因爲它是由OCR processesed它可能包含spelling errors

在這裏你可以閱讀這些文字:

Harry Potter, born 25.03.1995, resident at Jahnstreet 43, London is a series of seven fantasy novels written by British author J. K. Rowling. The series chronicles the adventures of a young wizard, Harry Potter, the titular character, and his friends Ronald Weasley and Hermione Granger, all of whom are students at Hogwarts School of Witchcraft and Wizardry. The main story arc concerns Harry's quest to defeat the Dark wizard Lord Voldemort, who aims to become immortal, conquer the wizarding world, subjugate non-magical people, and destroy all those who stand in his way, especially Harry Potter. Since the release of the first novel, Harry Potter and the Philosopher's Stone, on 30 June 1997, the books have gained immense popularity, critical acclaim and commercial success worldwide.[2] The series has also had some share of criticism, including concern about the increasingly dark tone as the series progressed. As of May 2015, the books have sold more than 450 million copies worldwide, making the series the best-selling book series in history, and have been translated into 73 languages.[3][4] The last four books consecutively set records as the fastest-selling books in history, with the final installment selling roughly 11 million copies in the United States within the first 24 hours of its release. A series of many genres, including fantasy, coming of age and the British school story (with elements of mystery, thriller, adventureand romance), it has many cultural meanings and references.[5] According to Rowling, the main theme is death.[6] There are also many other themes in the series, such as prejudice and corruption.[7] 


現在我想找到被引用在數據庫中的人該文件


我hav關於如何做到這一點的不同想法。但我不知道哪一個帶來最好的結果? 你更喜歡哪種方式?推薦?感謝

  1. 我分裂陣列中的文本,並在數據庫中經歷各birthday,並與JavaScript的text.search('25.03.1995')尋找它時,有一擊,我經過的下一個領域如。 text.searc('Harry')。如果有幾個點擊,我找到了正確的記錄。

    • 利弊:易於實施,無需數據庫命令,純JavaScript
    • 利弊:如果OCR犯了一個錯誤,並讀取如。 Harly而不是Harry我無法識別它。如果日期格式不同,則會發生相同的情況
  2. 首先,我通過數據庫的幫助來索引文本。接下來我採用類似於第一個例子的方法。而經過數據庫中的每個列,但現在數據庫CONTAINS

    • 優點:更快,更好的結果?
    • 缺點:我需要一個良好的全文本搜索數據庫
  3. 我分裂了文本,並在數據庫列與SQL搜索每個單一的世界 - LIKE

    • 利弊:我不必索引文件,比包含更好?
    • 缺點:沒有那麼快,作爲文本索引?

感謝您的幫助在這件事

+0

也許某種模糊搜索可以幫助您克服OCR錯誤。試試這個例子 - http://glench.github.io/fuzzyset.js/ –

回答

1

我想是因爲你將不得不有時排序通過多個可能的匹配和13000項並不需要大量的內存OCR錯誤。所以使用第一種方法可能會更容易,並完全在JS中完成。無論哪種方式,你必須嘗試解析CSV。

這取決於我認爲OCR有多糟糕。如果不好,全文索引可能會有所幫助。

您也可以嘗試在npm中使用類似natural模塊的字符串距離。

+0

感謝您的幫助! –

+0

好的。我添加了另一個想法,我剛剛。 –