我試圖編譯所有的文本,直到下一場比賽與Python中的正則表達式。該數據是在線提供的辯論記錄。找到所有的文本,直到下一個正則表達式匹配
目前我正在嘗試遍歷p標記的所有匹配項,並用標記的揚聲器標識這些標記,然後將沒有標記的揚聲器的所有連續文本附加到先前的匹配項。
我不確定這是繼續進行的最佳方式,還是一次簡單地搜索和分組所有文本會更容易。目前我只能看到所有以至少三個大寫字母開頭的文字。
import re
import requests as rq
from bs4 import BeautifulSoup as bs
r = rq.get('http://www.cbsnews.com/news/transcript-of-the-2015-gop-debate-9-pm/')
b = bs(r.text, 'html.parser')
debatetext = b.find('div', attrs= {'class' , 'entry'}).findAll('p')
pattern = re.compile(r'[A-Z][A-Z][A-Z].*:')
for line in debatetext:
if re.search(pattern, line.text) is not None:
print line
示例文本
<p> BUSH: Here's what I believe. I believe we're at the verge of the greatest time to be alive in this world. </p>
<p> But Washington is holding us back. How we tax, how we regulate. We're not embracing the energy revolution in our midst, a broken immigration system that has been politicized rather than turning it into an economic driver. </p>
<p> We're not protecting and preserving our entitlement system or reforming for the next generation. All these things languish while we have politicians in Washington using these as wedge issues. </p>
<p> Here's my commitment to you, because I did it as Florida. We can fix these things. We can grow economically and restore America's leadership in the world, so that everybody has a chance to rise up. I humbly ask for your vote, whenever you're gonna get to vote, whenever the primary is. Thank you all very much. </p>
理想我想三人行不追加「BUSH:」在第一條語句或添加「BUSH:」或任何其他候選人交談的開始線。
編輯:大樣本
<div class="entry" itemprop="articleBody" id="article-entry">...
<p> CARSON: -- extremely effectively.</p>
<p> (APPLAUSE)</p>
<p> BAIER: Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p>
<p> Mr. Trump, ObamaCare is one of the things you call a disaster.</p>
<p> TRUMP: A complete disaster, yes.</p>
<p> BAIER: Saying it needs to be repealed and replaced.</p>
<p> TRUMP: Correct.</p>
<p> BAIER: Now, 15 years ago, uncalled yourself a liberal on health care. You were for a single-payer system, a Canadian-style system.</p>
<p> Why were you for that then and why aren't you for it now? TRUMP: First of all, I'd like to just go back to one. In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East. And I'm the only one on this stage that knew that and had the vision to say it. And that's exactly what happened.</p>
<p> BAIER: But on ObamaCare...</p>
<p> TRUMP: And the Middle East became totally destabilized. So I just want to say.</p>
<p> As far as single payer, it works in Canada. It works incredibly well in Scotland. It could have worked in a different age, which is the age you're talking about here.</p>
<p> What I'd like to see is a private system without the artificial lines around every state. I have a big company with thousands and thousands of employees. And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder. Nobody can bid.</p>
<p> You know why?</p>
<p> Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p>
<p> But they have total control of the politicians. They're making a fortune.</p>
<p> Get rid of the artificial lines and you will have...</p>
<p> (BUZZER NOISE)</p>
<p> TRUMP: -- yourself great plans. And then we have to take care of the people that can't take care of themselves. And I will do that through a different system.</p>
<p> (CROSSTALK)</p>
<p> BAIER: Mr. Trump, hold up one second.</p>
<p> PAUL: I've got a news flash...</p>
感謝您的建議捲曲喬,我繼續我的正則表達式方法和狀態機,它工作得很好 –