首先想到的是:對於50頁的工作,您可以通過只用人工來節省更多的時間。但是,如果你的團隊中有一位優秀的數據科學家,那麼你可以試試gensim。比較兩種不同短語的最新技術是詞嵌入。您可以將其視爲通過對數百萬個文檔進行培訓將單詞轉換爲高維矢量(從200到1000維)。
例如,如果你的字符串是「人機交互」,你會尋找類似的東西。
[(2, 0.99844527), # The EPS user interface management system
(0, 0.99809301), # Human machine interface for lab abc computer applications
(3, 0.9865886), # System and human system engineering testing of EPS
(1, 0.93748635), # A survey of user opinion of computer system response time
(4, 0.90755945), # Relation of user perceived response time to error measurement
(8, 0.050041795), # Graph minors A survey
(7, -0.098794639), # Graph minors IV Widths of trees and well quasi ordering
(6, -0.1063926), # The intersection graph of paths in trees
(5, -0.12416792)] # The generation of random binary unordered trees
來自:https://radimrehurek.com/gensim/tut3.html