我想tokenize input file in python
請建議我我是python的新用戶。如何在Python輸入文件中標記自然英文文本?
我讀了一些有關正則表達式的thng,但仍有些混淆,所以請建議任何鏈接或代碼概述。
我想tokenize input file in python
請建議我我是python的新用戶。如何在Python輸入文件中標記自然英文文本?
我讀了一些有關正則表達式的thng,但仍有些混淆,所以請建議任何鏈接或代碼概述。
嘗試這樣:
import nltk
file_content = open("myfile.txt").read()
tokens = nltk.word_tokenize(file_content)
print tokens
的NLTK教程還滿容易遵循的例子:http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html
NLTK
如果你的文件很小:
with open(...) as x
,.read()
與word_tokenize()
[代碼]標記化它:
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin:
tokens = word_tokenize(fin.read())
如果文件較大:
with open(...) as x
文件,word_tokenize()
[編號]:
from __future__ import print_function
from nltk.tokenize import word_tokenize
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = word_tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
from __future__ import print_function
from spacy.tokenizer import Tokenizer
tokenizer = Tokenizer(nlp.vocab)
with open ('myfile.txt') as fin, open('tokens.txt') as fout:
for line in fin:
tokens = tokenizer.tokenize(line)
print(' '.join(tokens), end='\n', file=fout)
你要什麼來標記?你需要創建一個通用的標記器嗎?或者你需要一個特定(編程)語言的標記器/解析器? –