2017-05-01 20 views
0

目前我一直在使用這個功能,只提取英語有效字隻字符串和Unicode字符串:Python中提取的話只

s = """\"A must-read for the business leader of today and tomorrow."--John G. O'Neill, Vice President, 3M Canada. High Performance Sales Organizations defined the true nature of market-focused sales and service operations, and helped push sales organizations into the 21st century""" 
t = 'Life is life (I want chocolate);&' 
w = u'Tú te llamabas de niña Concepción Morales!!' 

def clean_words(text, separator=' '): 
    if isinstance(text, unicode): 
    return separator.join(re.findall(r'[\w]+', text, re.U)).rstrip() 
    else: 
    return re.sub(r'\W+', ' ', text).replace(' ', separator).rstrip() 

這似乎與姓氏和撇號的問題,有什麼建議? 它返回S:

A must read for the business leader of today and tomorrow John G O Neill Vice President 3M Canada High Performance Sales Organizations defined the true nature of market focused sales and service operations and helped push sales organizations into the 21st century 

,當我記號化它導致單個字符。

有什麼建議嗎?

+2

由於您使用NLTK,爲什麼不使用nltk.WordPunctTokenizer()或其他一些標準標記生成器? – DyZ

+0

WordPunctTokenizer似乎返回相似的結果:word_tokenizer.tokenize(s) [''','A','必須',' - ','讀','','','商業','領導者',''','今天','和','明天','。' - ','John','G','。','O',''','Neill',' ,'','副','總統',',','3M','加拿大','。','高','表演','銷售','組織','定義',' ''','','','','','','','','','','','','','市場',' - ','聚焦','銷售','幫助「,」推動「,」銷售「,」組織「,」進入「,」21世紀「,」世紀「] – spicyramen

回答

1

看起來它是一個你想要一個樹庫標記生成器:

from nltk.tokenize import TreebankWordTokenizer 
tokenizer = TreebankWordTokenizer() 
tokenizer.tokenize(s) 
#['``', 'A', 'must-read', 'for', 'the', 'business', 'leader', 'of', 
# 'today', 'and', 'tomorrow.', "''", '--', 'John', 'G.', "O'Neill", 
# ',', 'Vice', 'President', ',', '3M', 'Canada.', 'High', 
# 'Performance', 'Sales', 'Organizations', 'defined', 'the', 'true', 
# 'nature', 'of', 'market-focused', 'sales', 'and', 'service', 
# 'operations', ',', 'and', 'helped', 'push', 'sales', 
# 'organizations', 'into', 'the', '21st', 'century'] 
1

或者,你可以使用spacy

import spacy 
nlp = spacy.load('en') 
s_tokenized = [t.text for t in nlp(s)] 

# ['"', 'A', 'must', '-', 'read', 'for', 'the', 'business', 'leader', 'of', 
# 'today', 'and', 'tomorrow', '."--', 'John', 'G.', "O'Neill", ',', 'Vice', 
# 'President', ',', '3', 'M', 'Canada', '.', 'High', 'Performance', 'Sales', 
# 'Organizations', 'defined', 'the', 'true', 'nature', 'of', 'market', '-', 
# 'focused', 'sales', 'and', 'service', 'operations', ',', 'and', 'helped', 
# 'push', 'sales', 'organizations', 'into', 'the', '21st', 'century']