Stanford NER和POS，多線程處理大數據

我正在嘗試使用斯坦福NER和斯坦福POS標記器解析大約23000個文檔。我曾嘗試使用下面的僞代碼來實現它 -Stanford NER和POS，多線程處理大數據

`for each in document: 
    eachSentences = PunktTokenize(each) 
    #code to generate NER Tagger 
    #code to generate POS Taggers on the above output`

對於一個4芯機，15 GB RAM，運行時間只爲淨入學率約爲945小時。我試圖通過使用「線程」庫加緊東西，但我得到下面的錯誤 -

`Exception in thread Thread-2: 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner 
    self.run() 
    File "/usr/lib/python2.7/threading.py", line 754, in run 
    self.__target(*self.__args, **self.__kwargs) 
    File "removeStopWords.py", line 75, in partofspeechRecognition 
    listOfRes_new = namedEntityRecognition(listRes[min:max]) 
    File "removeStopWords.py", line 63, in namedEntityRecognition 
    listRes_ner.append(namedEntityRecognitionResume(eachResSentence)) 
    File "removeStopWords.py", line 50, in namedEntityRecognitionResume 
    ner2Tags = ner2.tag(each.title().split()) 
    File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 71, in tag 
    return sum(self.tag_sents([tokens]), []) 
    File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 98, in tag_sents 
    os.unlink(self._input_file_path) 
OSError: [Errno 2] No such file or directory: '/tmp/tmpvMNqwB'`

我使用NLTK版本 - 3.2.1，斯坦福大學NER，POS - 3.7.0 jar文件，以及線程模塊。據我所知，這可能是由於/ tmp上的線程鎖定引起的。 如果我錯了，請糾正我，使用線程或更好的方式來實現它的最好方法是什麼？

我使用的3 Class Classifier for NER和Maxent POS Tagger

附： - 請忽略Python文件的名稱，但我仍未刪除原始文本中的停用詞或標點符號。

編輯 - 使用CPROFILE，並累計時間排序，我得到了以下排名前20位來電

600792 function calls (595912 primitive calls) in 60.795 seconds 

Ordered by: cumulative time 
List reduced from 3357 to 20 due to restriction <20> 

ncalls tottime percall cumtime percall filename:lineno(function) 
    1 0.000 0.000 60.811 60.811 removeStopWords.py:1(<module>) 
    1 0.000 0.000 58.923 58.923 removeStopWords.py:76(partofspeechRecognition) 
    28 0.001 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:69(tag) 
    28 0.004 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:73(tag_sents) 
    28 0.001 0.000 56.927 2.033 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:63(java) 
    141 0.001 0.000 56.532 0.401 /usr/lib/python2.7/subprocess.py:769(communicate) 
    140 0.002 0.000 56.530 0.404 /usr/lib/python2.7/subprocess.py:1408(_communicate) 
    140 0.008 0.000 56.492 0.404 /usr/lib/python2.7/subprocess.py:1441(_communicate_with_poll) 
    400 56.474 0.141 56.474 0.141 {built-in method poll} 
    1 0.001 0.001 43.522 43.522 removeStopWords.py:69(partofspeechRecognitionRes) 
    1 0.000 0.000 15.401 15.401 removeStopWords.py:62(namedEntityRecognition) 
    1 0.001 0.001 15.367 15.367 removeStopWords.py:46(namedEntityRecognitionRes) 
    141 0.004 0.000 2.302 0.016 /usr/lib/python2.7/subprocess.py:651(__init__) 
    141 0.020 0.000 2.287 0.016 /usr/lib/python2.7/subprocess.py:1199(_execute_child) 
    56 0.002 0.000 1.933 0.035 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:38(config_java) 
    56 0.001 0.000 1.931 0.034 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:599(find_binary) 
    112 0.002 0.000 1.930 0.017 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:582(find_binary_iter) 
    118 0.009 0.000 1.928 0.016 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:453(find_file_iter) 
    1 0.001 0.001 1.318 1.318 /usr/lib/python2.7/pickle.py:1383(load) 
    1 0.046 0.046 1.317 1.317 /usr/lib/python2.7/pickle.py:851(load)

來源

2017-01-31 vendaTrout

這是關於訓練分類器還是應用它們？ 945h似乎比您想要標記2300個文檔（或者在它們上面加上火車標籤）所需要的時間要長，除非這些文檔的尺寸很大。我懷疑你的代碼有什麼問題（例如爲每個句子創建新的標記實例），並且我會專注於解決這個問題，而不是嘗試多線程。試着找出哪些部分需要這麼長時間。 – lenz

23000個文件，每個文件有大約20-25個句子。我在開始時創建了一個標記實例，並且使用相同的實例對每個句子進行分類。我在我的文檔上應用NER分類器來標記它們。我正在使用** tqdm **預測剩餘時間，但最好的情況預測是600小時，這似乎很多。 – vendaTrout

啊好吧，23,000，不是2,300，我的不好。但是，這太長了，你應該做一些分析。 – lenz

這似乎是Python包裝是這裏的罪魁禍首。 Java實現並沒有花費太多時間。這大約需要@Gabor Angeli提到的。嘗試一下。

希望它有幫助！

來源

2017-02-18 14:01:05

Stanford NER和POS，多線程處理大數據

回答

相關問題