2015-09-14 59 views
0

我試圖編譯所有的文本,直到下一場比賽與Python中的正則表達式。該數據是在線提供的辯論記錄。找到所有的文本,直到下一個正則表達式匹配

目前我正在嘗試遍歷p標記的所有匹配項,並用標記的揚聲器標識這些標記,然後將沒有標記的揚聲器的所有連續文本附加到先前的匹配項。

我不確定這是繼續進行的最佳方式,還是一次簡單地搜索和分組所有文本會更容易。目前我只能看到所有以至少三個大寫字母開頭的文字。

import re  
import requests as rq 
from bs4 import BeautifulSoup as bs 

r = rq.get('http://www.cbsnews.com/news/transcript-of-the-2015-gop-debate-9-pm/') 
b = bs(r.text, 'html.parser') 
debatetext = b.find('div', attrs= {'class' , 'entry'}).findAll('p') 
pattern = re.compile(r'[A-Z][A-Z][A-Z].*:') 
for line in debatetext: 
     if re.search(pattern, line.text) is not None: 
       print line 

示例文本

<p> BUSH: Here's what I believe. I believe we're at the verge of the greatest time to be alive in this world. </p> 
<p> But Washington is holding us back. How we tax, how we regulate. We're not embracing the energy revolution in our midst, a broken immigration system that has been politicized rather than turning it into an economic driver. </p> 
<p> We're not protecting and preserving our entitlement system or reforming for the next generation. All these things languish while we have politicians in Washington using these as wedge issues. </p> 
<p> Here's my commitment to you, because I did it as Florida. We can fix these things. We can grow economically and restore America's leadership in the world, so that everybody has a chance to rise up. I humbly ask for your vote, whenever you're gonna get to vote, whenever the primary is. Thank you all very much. </p> 

理想我想三人行不追加「BUSH:」在第一條語句或添加「BUSH:」或任何其他候選人交談的開始線。

編輯:大樣本

<div class="entry" itemprop="articleBody" id="article-entry">... 


<p> CARSON: -- extremely effectively.</p> 
<p> (APPLAUSE)</p> 
<p> BAIER: Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p> 
<p> Mr. Trump, ObamaCare is one of the things you call a disaster.</p> 
<p> TRUMP: A complete disaster, yes.</p> 
<p> BAIER: Saying it needs to be repealed and replaced.</p> 
<p> TRUMP: Correct.</p> 
<p> BAIER: Now, 15 years ago, uncalled yourself a liberal on health care. You were for a single-payer system, a Canadian-style system.</p> 
<p> Why were you for that then and why aren't you for it now? TRUMP: First of all, I'd like to just go back to one. In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East. And I'm the only one on this stage that knew that and had the vision to say it. And that's exactly what happened.</p> 
<p> BAIER: But on ObamaCare...</p> 
<p> TRUMP: And the Middle East became totally destabilized. So I just want to say.</p> 
<p> As far as single payer, it works in Canada. It works incredibly well in Scotland. It could have worked in a different age, which is the age you're talking about here.</p> 
<p> What I'd like to see is a private system without the artificial lines around every state. I have a big company with thousands and thousands of employees. And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder. Nobody can bid.</p> 
<p> You know why?</p> 
<p> Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p> 
<p> But they have total control of the politicians. They're making a fortune.</p> 
<p> Get rid of the artificial lines and you will have...</p> 
<p> (BUZZER NOISE)</p> 
<p> TRUMP: -- yourself great plans. And then we have to take care of the people that can't take care of themselves. And I will do that through a different system.</p> 
<p> (CROSSTALK)</p> 
<p> BAIER: Mr. Trump, hold up one second.</p> 
<p> PAUL: I've got a news flash...</p> 

回答

0

是以「我不知道這是否是繼續進行的最佳方式還是會更容易簡單地一次搜索和組所有文字」或者,「最好」的方式就是你理解和解決問題的方式。這是快速和骯髒的,但應該讓你開始。

import pprint 

test_data=""" <div class="entry" itemprop="articleBody" id="article-entry">... 


<p> CARSON: -- extremely effectively.</p> 
<p> (APPLAUSE)</p> 
<p> BAIER: Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p> 
<p> Mr. Trump, ObamaCare is one of the things you call a disaster.</p> 
<p> TRUMP: A complete disaster, yes.</p> 
<p> BAIER: Saying it needs to be repealed and replaced.</p> 
<p> TRUMP: Correct.</p> 
<p> BAIER: Now, 15 years ago, uncalled yourself a liberal on health care. You were for a single-payer system, a Canadian-style system.</p> 
<p> Why were you for that then and why aren't you for it now? TRUMP: First of all, I'd like to just go back to one. In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East. And I'm the only one on this stage that knew that and had the vision to say it. And that's exactly what happened.</p> 
<p> BAIER: But on ObamaCare...</p> 
<p> TRUMP: And the Middle East became totally destabilized. So I just want to say.</p> 
<p> As far as single payer, it works in Canada. It works incredibly well in Scotland. It could have worked in a different age, which is the age you're talking about here.</p> 
<p> What I'd like to see is a private system without the artificial lines around every state. I have a big company with thousands and thousands of employees. And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder. Nobody can bid.</p> 
<p> You know why?</p> 
<p> Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p> 
<p> But they have total control of the politicians. They're making a fortune.</p> 
<p> Get rid of the artificial lines and you will have...</p> 
<p> (BUZZER NOISE)</p> 
<p> TRUMP: -- yourself great plans. And then we have to take care of the people that can't take care of themselves. And I will do that through a different system.</p> 
<p> (CROSSTALK)</p> 
<p> BAIER: Mr. Trump, hold up one second.</p> 
<p> PAUL: I've got a news flash...</p>""" 

## look for 3 capital letters 
## assume every line starts with "<p>" (so won't test for it) 

one_group=[] 
for record in test_data.split("\n"): 
    record=record.strip() 
    if len(record): 
     split_rec=record.split() 
     found=True 
     for ltr in split_rec[1][:3]: 
      if ltr < "A" or ltr > "Z": 
       found=False 

     ## found new name so print previous block 
     if found and len(one_group): 
      pprint.pprint(one_group) 
      print 
      one_group=[] 
     one_group.append(record) 

## last group 
print one_group 
+0

感謝您的建議捲曲喬,我繼續我的正則表達式方法和狀態機,它工作得很好 –

1

我重新格式化我的正則表達式咯,所以它看起來是這樣的:

pattern = re.compile(r'([A-Z]+):(.*)') 

的+給了我一個或無限制的字母都大寫,所以這只是一個從鑽頭清理以前的正則表達式代碼。 我也改變它創建捕獲組,第一個是':'之前的任何大寫字母,第二個是':'之後的任何文本。

現在第二個匹配(組(0)是整個匹配,組(1)是名稱)可以用於附加到字典,並且可以附加連續的文本。

爲了處理有關添加缺少的語句,遵循此初始正則表達式模式的問題,我使用了一個狀態機。 請注意這隻適用,因爲我假設所有來自正則表達式匹配的以下文本應該屬於從正則表達式模式發現的揚聲器。

d = {} 
name = '' 
blurb = '' 
state = 0 
for line in debatetext: 
     m = re.search(pattern, line.text) 
     if m: 
      name = m.group(1) 
      blurb = m.group(2) 
      #skip past speakers section with all caps at beginning 
      if name != 'SPEAKERS': 
       state = 1     
       if name in d: 
        d[name].append(blurb) 
       else: 
        d[name] = [ blurb ] 
     else: 
      if state: 
       d[name].append(line.text) 

花了一點IRL幫助這個時候,但我認爲這個解決方案運作良好在這種情況下,可能會幫上其他人。我用這個來解析第二次辯論,它工作得很好。我可能會鼓搗它,以便添加語句以便我可以結合twitter數據進行一些關聯分析。

相關問題