2017-05-28 71 views

回答

1

docs

Corpus reader functions are named based on the type of information they return. 
Some common examples, and their return types, are: 
- words(): list of str 
- sents(): list of (list of str) 
- paras(): list of (list of (list of str)) 
- tagged_words(): list of (str,str) tuple 
- tagged_sents(): list of (list of (str,str)) 
- tagged_paras(): list of (list of (list of (str,str))) 
- chunked_sents(): list of (Tree w/ (str,str) leaves) 
- parsed_sents(): list of (Tree with str leaves) 
- parsed_paras(): list of (list of (Tree with str leaves)) 
- xml(): A single xml ElementTree 
- raw(): unprocessed corpus contents 


>>> from nltk.corpus import brown 

>>> brown.tagged_words() 
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...] 

>>> len(brown.tagged_words()) # no. of words in the corpus. 
1161192 


>>> len(brown.tagged_sents()) # no. of sentence in the corpus. 
57340 

# Loop through the sentences and counts the words per sentence. 
>>> sum(len(sent) for sent in brown.tagged_sents()) # no. of words in the corpus. 
1161192 
相關問題