2009-12-27 13 views
3

源文本子倍數:United States Declaration of Independence分割大型字符串到含有經由蟒「N」的字數

一個如何可以分割上述源文本成多個子串,包含的一個「n」個數話?我使用拆分('')來提取每個單詞,但是我不知道如何在一個操作中使用多個單詞來完成此操作。

我可以遍歷我擁有的單詞列表,並通過在第一個列表中粘合單詞來創建另一個單詞(同時添加空格)。但是我的方法不是很pythonic。

回答

5
text = """ 
When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation. 

We hold these Truths to be self-evident, that all Men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness?-That to secure these Rights, Governments are instituted among Men, deriving their just Powers from the Consent of the Governed, that whenever any Form of Government becomes destructive of these Ends, it is the Right of the People to alter or abolish it, and to institute a new Government, laying its Foundation on such Principles, and organizing its Powers in such Form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient Causes; and accordingly all Experience hath shewn, that Mankind are more disposed to suffer, while Evils are sufferable, than to right themselves by abolishing the Forms to which they are accustomed. But when a long Train of Abuses and Usurpations, pursuing invariably the same Object, evinces a Design to reduce them under absolute Despotism, it is their Right, it is their Duty, to throw off such Government, and to provide new Guards for their future Security. Such has been the patient Sufferance of these Colonies; and such is now the Necessity which constrains them to alter their former Systems of Government. The History of the Present King of Great-Britain is a History of repeated Injuries and Usurpations, all having in direct Object the Establishment of an absolute Tyranny over these States. To prove this, let Facts be submitted to a candid World. 
""" 

words = text.split() 
subs = [] 
n = 4 
for i in range(0, len(words), n): 
    subs.append(" ".join(words[i:i+n])) 
print subs[:10] 

打印:

['When in the course', 'of human Events, it', 'becomes necessary for one', 'People to dissolve the', 'Political Bands which have', 'connected them with another,', 'and to assume among', 'the Powers of the', 'Earth, the separate and', 'equal Station to which'] 

,或者作爲一個列表理解:

subs = [" ".join(words[i:i+n]) for i in range(0, len(words), n)] 
+0

這似乎很pythonic。 – physicsmichael 2009-12-27 07:10:47

+2

哦。 ngram的大多數應用都希望'['在課程中','在'課程中','人類課程'等] – 2009-12-27 13:06:44

3

您正在嘗試創建n-grams?以下是我如何使用NLTK

punct = re.compile(r'^[^A-Za-z0-9]+|[^a-zA-Z0-9]+$') 
is_word=re.compile(r'[a-z]', re.IGNORECASE) 
sentence_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') 
word_tokenizer=nltk.tokenize.punkt.PunktWordTokenizer() 

def get_words(sentence): 
    return [punct.sub('',word) for word in word_tokenizer.tokenize(sentence) if is_word.search(word)] 

def ngrams(text, n): 
    for sentence in sentence_tokenizer.tokenize(text.lower()): 
     words = get_words(sentence) 
     for i in range(len(words)-(n-1)): 
      yield(' '.join(words[i:i+n])) 

然後

for ngram in ngrams(sometext, 3): 
    print ngram 
+0

有趣的鏈接!將來肯定會考慮使用該工具包。 – torger 2009-12-27 05:30:12

2

對於大型STRI ng,迭代器被推薦用於速度和低內存佔用。

import re, itertools 

# Original text 
text = "When in the course of human Events, it becomes necessary for one People to dissolve the Political Bands which have connected them with another, and to assume among the Powers of the Earth, the separate and equal Station to which the Laws of Nature and of Nature?s God entitle them, a decent Respect to the Opinions of Mankind requires that they should declare the causes which impel them to the Separation." 
n = 10 

# An iterator which will extract words one by one from text when needed 
words = itertools.imap(lambda m:m.group(), re.finditer(r'\w+', text)) 
# The final iterator that combines words into n-length groups 
word_groups = itertools.izip_longest(*(words,)*n) 

for g in word_groups: print g 

將得到以下結果:

('When', 'in', 'the', 'course', 'of', 'human', 'Events', 'it', 'becomes', 'necessary') 
('for', 'one', 'People', 'to', 'dissolve', 'the', 'Political', 'Bands', 'which', 'have') 
('connected', 'them', 'with', 'another', 'and', 'to', 'assume', 'among', 'the', 'Powers') 
('of', 'the', 'Earth', 'the', 'separate', 'and', 'equal', 'Station', 'to', 'which') 
('the', 'Laws', 'of', 'Nature', 'and', 'of', 'Nature', 's', 'God', 'entitle') 
('them', 'a', 'decent', 'Respect', 'to', 'the', 'Opinions', 'of', 'Mankind', 'requires') 
('that', 'they', 'should', 'declare', 'the', 'causes', 'which', 'impel', 'them', 'to') 
('the', 'Separation', None, None, None, None, None, None, None, None) 
+0

然後,我將每個組元組中的單詞連同空格粘在一起? – torger 2009-12-27 07:02:30

+0

是的,只需使用print''.join(g)而不是print g – iamamac 2009-12-27 07:23:04