2017-05-02 28 views
0

我正在使用python 2.7,並且我已經預先訓練了英文嵌入。我需要查找從這個文件中嵌入的某個單詞。單詞嵌入提取

的文件有300點的尺寸和格式是這樣:

的-0.0279698616277 -0.00822567637943 -0.066859518431 0.0152934683231 -0.0329719520937 0.0530985715151 0.0346279291928 0.000898163363809 -0.0342044668875 -0.0358478199459 0.0330627337979 -0.0291780565785 -0.050316270082 0.0226246942919 -0.0999551118641 -0.0211768282161 -0.0650169654368 -0.13170513108 0.0136621823624 0.00761099698762 -0.0747038745232 -0.0309831087459 -0.0281774157081 -0.0381752846197 0.000854164869137 0.118230081556 -0.0544820178539 -0.0259578123228 -0.0250848970404 0.0432551614539 0.0604299831315 0.0605994794422 -0.0652365866148 0.0741619690129 -0.0122427203782 -0.0486630776978 0.0266766400501 -0.0575422338293 -0.01201 15890454 0.067022888369 0.0563923322428 0.116347799963 0.0272241149902 -0.0271056717851 -0.0876134412848 -0.0160824708647 0.0478176382685 -0.0278610721008 -0.043103116023 -0.123507487497 -0.0286480325182 -0.00985009337681 -0.00749645238334 -0.00322952663845 -0.046423238718 0.103032221776 0.0821490881533 -0.121380150997 -0.00599957532621 -0.0843011157914 -0.0667407039306 0.0204320098169 -0.0953102074899 -0.0644943672828 -0.00133722007224 0.00249399062204 -0.0199877549741 -0.0494372284268 0.00730022281006 0.100155611334 0.0158984940368 0.0919811737074 -0.0762293413195 0.0495974423547 0.110083862374 -0.0737607844265 0.0507363907294 -0.0101547411817 0.01065877457 0.0437805443228 0.0801814086384 -0.0739505163318 0.0359545673486 -0.0289695742598 0.122458949531 0.0247212132806 -0.0799729263198 -0.0204555870693 -0.00530952298573 -0.0580316010527 0.0849861556452 -0.0386267797212 0.0264685290268 -0.0680456213105 0.0826555349612 -0.0264161763876 0.0344213033507 -0.0995871582083 0.0533503097378 0。 037602190303 -0.061794122114 -0.00452664681682 -0.025897662482 -0.0804463278447 -0.0725472056937 -0.109343313871 0.0121977936453

我嘗試使用.split(" ")但是這會導致分裂向量爲好。有關如何搜索一個詞並從文件中提取其矢量的任何想法?

回答

0

此代碼將解析整個文件,並建立與嵌入矢量的字典:

>>> embeddings = {} 
>>> with open("pretrained_embeddings.txt", "rb") as f: 
...  for line in f.xreadlines(): 
...   line = line.decode("utf-8") 
...   columns = line.strip().split() 
...   embeddings[columns[0]] = [float(n) for n in columns[1:]] 
... 
>>> embeddings["the"] 
[-0.0279698616277, -0.00822567637943, -0.066859518431, 0.0152934683231, -0.0329719520937, 0.0530985715151, 0.0346279291928, 0.000898163363809, -0.0342044668875, -0.0358478199459, 0.0330627337979, -0.0291780565785, -0.050316270082, 0.0226246942919, -0.0999551118641, -0.0211768282161, -0.0650169654368, -0.13170513108, 0.0136621823624, 0.00761099698762, -0.0747038745232, -0.0309831087459, -0.0281774157081, -0.0381752846197, 0.000854164869137, 0.118230081556, -0.0544820178539, -0.0259578123228, -0.0250848970404, 0.0432551614539, 0.0604299831315, 0.0605994794422, -0.0652365866148, 0.0741619690129, -0.0122427203782, -0.0486630776978, 0.0266766400501, -0.0575422338293, -0.0120115890454, 0.067022888369, 0.0563923322428, 0.116347799963, 0.0272241149902, -0.0271056717851, -0.0876134412848, -0.0160824708647, 0.0478176382685, -0.0278610721008, -0.043103116023, -0.123507487497, -0.0286480325182, -0.00985009337681, -0.00749645238334, -0.00322952663845, -0.046423238718, 0.103032221776, 0.0821490881533, -0.121380150997, -0.00599957532621, -0.0843011157914, -0.0667407039306, 0.0204320098169, -0.0953102074899, -0.0644943672828, -0.00133722007224, 0.00249399062204, -0.0199877549741, -0.0494372284268, 0.00730022281006, 0.100155611334, 0.0158984940368, 0.0919811737074, -0.0762293413195, 0.110083862374, 0.0495974423547, -0.0737607844265, 0.0507363907294, 0.01065877457, -0.0101547411817, 0.0437805443228, 0.0801814086384, -0.0739505163318, 0.0359545673486, 0.122458949531, -0.0289695742598, 0.0247212132806, -0.0799729263198, -0.0204555870693, -0.00530952298573, -0.0580316010527, 0.0849861556452, -0.0386267797212, 0.0264685290268, -0.0680456213105, 0.0826555349612, -0.0264161763876, -0.0995871582083, 0.0344213033507, 0.0533503097378, 0.037602190303, -0.061794122114, -0.00452664681682, -0.025897662482, -0.0804463278447, -0.0725472056937, -0.109343313871, 0.0121977936453] 

注:

  • 這是非常嚴格的格式。沒有空行等
  • 它適用於Python 2.如果您想使用Python 3,只需將f.xreadlines()替換爲f即可。
+1

非常感謝:)作品完美! –

+0

我的榮幸!請注意,它假定文字中沒有空格。如果有的話,那麼正則表達式是你最好的選擇。注意:如果你感到高興,你可以+1我的答案。:) – MiniQuark

+0

非常感謝,我做了一些調整,並設法使用正則表達式來剝離一些值,例如[開頭]和結尾處的向量 我希望對您的答案+1,但我可以'因爲我是新來的,這是我的第一個問題 –

0

我發現每個維度有15字節或16字節,如果startwith' - '。所以,我建議使用re。

import re 
res = re.findall(r'(?:-0|0).[0-9]{13}', str) 
print(res) 

您可以試試。我沒有這些數據,所以我的嘗試很難,可能我的建議並沒有幫助!

0

如何

line = "the -0.0279698616277 -0.00822567637943 -0.0668... etc" 
word, vector = line.split(None,1) 
+0

'maxsplit'只適用於python 3. OP要求提供Python 2代碼。 – MiniQuark

+0

true,應該是 word,vector = line.split(None,1) – BoarGules