2
我正在嘗試使用斯坦福NER和斯坦福POS標記器解析大約23000個文檔。我曾嘗試使用下面的僞代碼來實現它 -Stanford NER和POS,多線程處理大數據
`for each in document:
eachSentences = PunktTokenize(each)
#code to generate NER Tagger
#code to generate POS Taggers on the above output`
對於一個4芯機,15 GB RAM,運行時間只爲淨入學率約爲945小時。我試圖通過使用「線程」庫加緊東西,但我得到下面的錯誤 -
`Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "removeStopWords.py", line 75, in partofspeechRecognition
listOfRes_new = namedEntityRecognition(listRes[min:max])
File "removeStopWords.py", line 63, in namedEntityRecognition
listRes_ner.append(namedEntityRecognitionResume(eachResSentence))
File "removeStopWords.py", line 50, in namedEntityRecognitionResume
ner2Tags = ner2.tag(each.title().split())
File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), [])
File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 98, in tag_sents
os.unlink(self._input_file_path)
OSError: [Errno 2] No such file or directory: '/tmp/tmpvMNqwB'`
我使用NLTK版本 - 3.2.1,斯坦福大學NER,POS - 3.7.0 jar文件,以及線程模塊。據我所知,這可能是由於/ tmp上的線程鎖定引起的。 如果我錯了,請糾正我,使用線程或更好的方式來實現它的最好方法是什麼?
我使用的3 Class Classifier for NER和Maxent POS Tagger
附: - 請忽略Python文件的名稱,但我仍未刪除原始文本中的停用詞或標點符號。
編輯 - 使用CPROFILE,並累計時間排序,我得到了以下排名前20位來電
600792 function calls (595912 primitive calls) in 60.795 seconds
Ordered by: cumulative time
List reduced from 3357 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 60.811 60.811 removeStopWords.py:1(<module>)
1 0.000 0.000 58.923 58.923 removeStopWords.py:76(partofspeechRecognition)
28 0.001 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:69(tag)
28 0.004 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:73(tag_sents)
28 0.001 0.000 56.927 2.033 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:63(java)
141 0.001 0.000 56.532 0.401 /usr/lib/python2.7/subprocess.py:769(communicate)
140 0.002 0.000 56.530 0.404 /usr/lib/python2.7/subprocess.py:1408(_communicate)
140 0.008 0.000 56.492 0.404 /usr/lib/python2.7/subprocess.py:1441(_communicate_with_poll)
400 56.474 0.141 56.474 0.141 {built-in method poll}
1 0.001 0.001 43.522 43.522 removeStopWords.py:69(partofspeechRecognitionRes)
1 0.000 0.000 15.401 15.401 removeStopWords.py:62(namedEntityRecognition)
1 0.001 0.001 15.367 15.367 removeStopWords.py:46(namedEntityRecognitionRes)
141 0.004 0.000 2.302 0.016 /usr/lib/python2.7/subprocess.py:651(__init__)
141 0.020 0.000 2.287 0.016 /usr/lib/python2.7/subprocess.py:1199(_execute_child)
56 0.002 0.000 1.933 0.035 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:38(config_java)
56 0.001 0.000 1.931 0.034 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:599(find_binary)
112 0.002 0.000 1.930 0.017 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:582(find_binary_iter)
118 0.009 0.000 1.928 0.016 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:453(find_file_iter)
1 0.001 0.001 1.318 1.318 /usr/lib/python2.7/pickle.py:1383(load)
1 0.046 0.046 1.317 1.317 /usr/lib/python2.7/pickle.py:851(load)
這是關於訓練分類器還是應用它們? 945h似乎比您想要標記2300個文檔(或者在它們上面加上火車標籤)所需要的時間要長,除非這些文檔的尺寸很大。我懷疑你的代碼有什麼問題(例如爲每個句子創建新的標記實例),並且我會專注於解決這個問題,而不是嘗試多線程。試着找出哪些部分需要這麼長時間。 – lenz
23000個文件,每個文件有大約20-25個句子。我在開始時創建了一個標記實例,並且使用相同的實例對每個句子進行分類。我在我的文檔上應用NER分類器來標記它們。我正在使用** tqdm **預測剩餘時間,但最好的情況預測是600小時,這似乎很多。 – vendaTrout
啊好吧,23,000,不是2,300,我的不好。但是,這太長了,你應該做一些分析。 – lenz