2016-01-02 33 views
6

我在一個文本文件中有一本書,我需要打印每一節的第一段。我認爲如果我在\ n \ n和\ n之間找到了一段文字,我可以找到我的答案。這是我的代碼,它不起作用。你能告訴我,我錯在哪裏?在python中打印第一段

lines = [line.rstrip('\n') for line in open('G:\\aa.txt')] 

check = -1 
first = 0 
last = 0 

for i in range(len(lines)): 
    if lines[i] == "": 
      if lines[i+1]=="": 
       check = 1 
       first = i +2 
    if i+2< len(lines): 
     if lines[i+2] == "" and check == 1: 
      last = i+2 
while (first < last): 
    print(lines[first]) 
    first = first + 1 

另外我發現在計算器的碼我嘗試了太,但它只是印刷空數組。

f = open("G:\\aa.txt").readlines() 
flag=False 
for line in f: 
     if line.startswith('\n\n'): 
      flag=False 
     if flag: 
      print(line) 
     elif line.strip().endswith('\n'): 
      flag=True 

我在belown分享了這本書的一個樣本部分。

土地

有迷人的人類利益的一個廣闊的領域,躺在纔剛剛我們的大門,這尚未被,但很少探討之外的LAY。它是動物智能領域。

在研究世界野生動物的各種興趣中,沒有一個超過他們的思想,道德以及他們作爲心理過程的結果所進行的行爲。

II

野生動物氣質&個體性

我想在這裏做的就是,找到大寫線,並把他們都在一個數組。然後,使用索引方法,通過比較我創建的這個數組的這些元素的索引,我會找到每個部分的第一個和最後一個段落。

輸出應該是這樣的:

有迷人的人類利益的一個廣闊的領域,只有躺在只是我們的大門,這尚未被但很少探討之外。它是動物智能領域。

我想在這裏做的是,找到大寫的行,並把它們放在一個數組中。然後,使用索引方法,通過比較我創建的這個數組的這些元素的索引,我會找到每個部分的第一個和最後一個段落。

+0

你可以添加實際的輸入和預期的輸出嗎? –

回答

6

如果你想組可以使用itertools.groupby空行作爲分隔符使用部分:

from itertools import groupby 
with open("in.txt") as f: 
    for k, sec in groupby(f,key=lambda x: bool(x.strip())): 
     if k: 
      print(list(sec)) 

多帶些itertools FOO,我們可以用大寫的標題作爲分隔符得到部分:

from itertools import groupby, takewhile 

with open("in.txt") as f: 
    grps = groupby(f,key=lambda x: x.isupper()) 
    for k, sec in grps: 
     # if we hit a title line 
     if k: 
      # pull all paragraphs 
      v = next(grps)[1] 
      # skip two empty lines after title 
      next(v,""), next(v,"") 

      # take all lines up to next empty line/second paragraph 
      print(list(takewhile(lambda x: bool(x.strip()), v))) 

這將使你:

['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n'] 
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.'] 

每個部分的開頭都有一個全部大寫的標題,所以一旦我們擊中了,我們知道有兩條空行,那麼第一段和模式重複。

要掰成使用循環:

from itertools import groupby 
from itertools import groupby 
def parse_sec(bk): 
    with open(bk) as f: 
     grps = groupby(f, key=lambda x: bool(x.isupper())) 
     for k, sec in grps: 
      if k: 
       print("First paragraph from section titled :{}".format(next(sec).rstrip())) 
       v = next(grps)[1] 
       next(v, ""),next(v,"") 
       for line in v: 
        if not line.strip(): 
         break 
        print(line) 

爲了您的文字:

In [11]: cat -E in.txt 

THE LAY OF THE LAND$ 
$ 
$ 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$ 
$ 
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$ 
$ 
$ 
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$ 
$ 
$ 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 

的美元符號是新的生產線,產量:

In [12]: parse_sec("in.txt") 
First paragraph from section titled :THE LAY OF THE LAND 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence. 

First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 
+0

這很酷,我可以看到每個部分使用此代碼..但我只想看看他們的第一段。我可以提取? –

+0

@TuğcanDemir,你想從你的問題中的inout中拉出什麼? –

+0

我編輯了我的問題。 –

0

翻閱您找到的代碼,逐行。

f = open("G:\\aa.txt").readlines() 
flag=False 
for line in f: 
     if line.startswith('\n\n'): 
      flag=True 
     if flag: 
      print(line) 
     elif line.strip().endswith('\n'): 
      flag=True 

它似乎從不將標誌變量設置爲true。

如果你可以分享你書中的一些樣本,它會對每個人更有幫助。

+0

我分享了您共享的相同代碼,只需在第一個代碼塊中將該標誌設置爲true即可。 –

+0

當我將第一個標誌設置爲true時,它會在每一行上再增加2條空行。 –

0

這應該工作,只要沒有全部上限的段落:

f = open('file.txt') 

    for line in f: 
    line = line.strip() 
    if line: 
     for c in line: 
      if c < 'A' or c > 'Z': # check for non-uppercase chars 
       break 
     else:  # means the line is made of all caps i.e. I, II, etc, meaning new section 
      f.readline() # discard chapter headers and empty lines 
      f.readline() 
      f.readline() 
      print(f.readline().rstrip()) # print first paragraph 

    f.close() 

如果你也想得到最後一段,你可以跟蹤上次看到的包含小寫字符的行,然後一旦找到全部大寫行(I,II等),表示一個新的部分,然後打印最近的一行,因爲這將是上一節中的最後一段。

+0

它在兩個不連貫的句子之間打印出大量的空行... –

+0

@TuğcanDemir我做了一些細微的改動,以刪除空行並使代碼更具可讀性。此代碼(和以前的版本)與您上面提供的示例一起使用。你能提供給你那些結果的樣本部分嗎? – TisteAndii

1

總是有正則表達式....

import re 
with open("in.txt", "r") as fi: 
    data = fi.read() 
paras = re.findall(r""" 
        [IVXLCDM]+\n\n # Line of Roman numeral characters 
        [^a-z]+\n\n  # Line without lower case characters 
        (.*?)\n   # First paragraph line 
        """, data, re.VERBOSE) 
print "\n\n".join(paras) 
+0

這個人的成長模式:「有些人遇到問題時,想'我知道,我會用正則表達式'。 [現在他們有兩個問題](http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/)。「 '[IV] +'哈? – msw

+0

如何打印第一段而不是第一行? –

+0

所以,我找到我的方式使用您的代碼太..謝謝你這麼多:) –

0

TXR解決方案

 
$ txr firstpar.txr data 
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence. 
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created. 

守則firstpar.txr

 
@(repeat) 
@num 

@title 

@firstpar 
@ (require (and (< (length num) 5) 
       [some title chr-isupper] 
       (not [some title chr-islower]))) 
@ (do (put-line firstpar)) 
@(end) 

基本上,我們搜索的輸入的模式匹配綁定的三元素多線圖案,titlefirstpar變量。現在,這種模式可以在錯誤的地方匹配,因此可以使用require聲明添加一些限制性啓發式。章節號碼必須是簡短的一行,標題行必須包含一些大寫字母,而不是小寫字母。這個表達式寫在TXR Lisp中。

如果我們得到這個約束的匹配,那麼我們輸出在firstpar變量中捕獲的字符串。