import re  
import requests as rq 
from bs4 import BeautifulSoup as bs 

r = rq.get('http://www.cbsnews.com/news/transcript-of-the-2015-gop-debate-9-pm/') 
b = bs(r.text, 'html.parser') 
debatetext = b.find('div', attrs= {'class' , 'entry'}).findAll('p') 
pattern = re.compile(r'[A-Z][A-Z][A-Z].*:') 
for line in debatetext: 
     if re.search(pattern, line.text) is not None: 
       print line 


<p> BUSH: Here's what I believe. I believe we're at the verge of the greatest time to be alive in this world. </p> 
<p> But Washington is holding us back. How we tax, how we regulate. We're not embracing the energy revolution in our midst, a broken immigration system that has been politicized rather than turning it into an economic driver. </p> 
<p> We're not protecting and preserving our entitlement system or reforming for the next generation. All these things languish while we have politicians in Washington using these as wedge issues. </p> 
<p> Here's my commitment to you, because I did it as Florida. We can fix these things. We can grow economically and restore America's leadership in the world, so that everybody has a chance to rise up. I humbly ask for your vote, whenever you're gonna get to vote, whenever the primary is. Thank you all very much. </p> 



<div class="entry" itemprop="articleBody" id="article-entry">... 

<p> CARSON: -- extremely effectively.</p> 
<p> (APPLAUSE)</p> 
<p> BAIER: Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p> 
<p> Mr. Trump, ObamaCare is one of the things you call a disaster.</p> 
<p> TRUMP: A complete disaster, yes.</p> 
<p> BAIER: Saying it needs to be repealed and replaced.</p> 
<p> TRUMP: Correct.</p> 
<p> BAIER: Now, 15 years ago, uncalled yourself a liberal on health care. You were for a single-payer system, a Canadian-style system.</p> 
<p> Why were you for that then and why aren't you for it now? TRUMP: First of all, I'd like to just go back to one. In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East. And I'm the only one on this stage that knew that and had the vision to say it. And that's exactly what happened.</p> 
<p> BAIER: But on ObamaCare...</p> 
<p> TRUMP: And the Middle East became totally destabilized. So I just want to say.</p> 
<p> As far as single payer, it works in Canada. It works incredibly well in Scotland. It could have worked in a different age, which is the age you're talking about here.</p> 
<p> What I'd like to see is a private system without the artificial lines around every state. I have a big company with thousands and thousands of employees. And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder. Nobody can bid.</p> 
<p> You know why?</p> 
<p> Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p> 
<p> But they have total control of the politicians. They're making a fortune.</p> 
<p> Get rid of the artificial lines and you will have...</p> 
<p> (BUZZER NOISE)</p> 
<p> TRUMP: -- yourself great plans. And then we have to take care of the people that can't take care of themselves. And I will do that through a different system.</p> 
<p> (CROSSTALK)</p> 
<p> BAIER: Mr. Trump, hold up one second.</p> 
<p> PAUL: I've got a news flash...</p> 




import pprint 

## look for 3 capital letters 
## assume every line starts with "<p>" (so won't test for it) 

for record in test_data.split("\n"): 
    if len(record): 
     for ltr in split_rec[1][:3]: 
      if ltr < "A" or ltr > "Z": 

     ## found new name so print previous block 
     if found and len(one_group): 

## last group 
print one_group 

pattern = re.compile(r'([A-Z]+):(.*)') 

的+給了我一個或無限制的字母都大寫,所以這只是一個從鑽頭清理以前的正則表達式代碼。 我也改變它創建捕獲組,第一個是':'之前的任何大寫字母,第二個是':'之後的任何文本。


爲了處理有關添加缺少的語句,遵循此初始正則表達式模式的問題,我使用了一個狀態機。 請注意這隻適用,因爲我假設所有來自正則表達式匹配的以下文本應該屬於從正則表達式模式發現的揚聲器。

d = {} 
name = '' 
blurb = '' 
state = 0 
for line in debatetext: 
     m = re.search(pattern, line.text) 
     if m: 
      name = m.group(1) 
      blurb = m.group(2) 
      #skip past speakers section with all caps at beginning 
      if name != 'SPEAKERS': 
       state = 1     
       if name in d: 
        d[name] = [ blurb ] 
      if state: 

