2017-08-10 64 views
0

如何使用python在spaCy中執行預處理步驟,如停用詞刪除,標點符號刪除,詞幹和詞形化。如何使用spaCy進行文本預處理?

我有csv文件中的文本數據,如段落和句子。我想做文本清理。

在大熊貓數據幀請多例如通過加載CSV

+0

這是sPacy非常簡單明瞭,首先讓我們知道你嘗試過什麼? – DhruvPathak

回答

1

它可以很容易地通過一些命令來完成。另請注意,spacy不支持詞幹。你可以參考這本thread

import spacy 
nlp = spacy.load('en') 

# sample text 
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. \ 
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown \ 
printer took a galley of type and scrambled it to make a type specimen book. It has survived not \ 
only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. \ 
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, \ 
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\ 
There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration \ 
in some form, by injected humour, or randomised words which don't look even slightly believable. If you are \ 
going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the \ 
middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, \ 
making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined \ 
with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated \ 
Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.""" 

# convert the text to a spacy document 
document = nlp(text) # all spacy documents are tokenized. You can access them using document[i] 
document[0:10] # = Lorem Ipsum is simply dummy text of the printing and 

#the good thing about spacy is a lot of things like lemmatization etc are done when you convert them to a spacy document `using nlp(text)`. You can access sentences using document.sents 
list(document.sents)[0] 

# lemmatized words can be accessed using document[i].lemma_ and you can check 
# if a word is a stopword by checking the `.is_stop` attribute of the word. 
# here I am extracting the lemmatized form of each word provided they are not a stop word 
lemmas = [token.lemma_ for token in document if not token.is_stop] 
相關問題