2017-02-15 24 views
0

正則表達式的findall沒有預期的多個結果我在Python以下兩個片段(short_sentencelong_sentence這兒的一部分)可以用Python

short_sentence = '<p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p>' 

long_sentence = '<description>&lt;img src=&quot;http://cdn.static-economist.com/sites/default/files/images/print-edition/20170211_LDC811.png&quot; alt=&quot;&quot; title=&quot;&quot; height=&quot;376&quot; width=&quot;458&quot; class=&quot; blog-post-article-image blog-post-article-image__slim&quot; data-reactid=&quot;388&quot;/&gt;&lt;p data-reactid=&quot;389&quot;&gt;THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.&lt;/p&gt;&lt;p data-reactid=&quot;390&quot;&gt;To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.&lt;/p&gt;' 

我想解析每個的(最短)子這裏介於&lt; + anything + *&gt;&lt;/p&gt;之間的字符串。我知道,在short_sentence有一個這樣的occurence:

THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d. 

在long_sentence,上面有一個和另一個:

To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d. 

據我所知,Python的re.findall()還給匹配的潛臺詞出現的所有一個文本。當我嘗試執行以下命令:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", short_sentence) 

我得到正確的假設結果:

['THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.'] 

與此同時,當我嘗試從long_sentence有以下分析兩個字符串:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", long_sentence) 

我仍然只得到一個occurence(第二個):

['To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.'] 

我的問題是:第二種情況在這裏出了什麼問題?爲什麼不返回它的兩個出現?

+0

使用're.findall( 「<頁。*?>(。*?)</P >」,long_sentence)' –

+0

如果你試圖解析HTML或XML,可以考慮使用HTML或XML解析庫而不是正則表達式。 – Kevin

回答

0

p.*是貪婪的,所以它會盡其所能。如果您使用p.*?,您將獲得預期結果。

多一點關於該主題的信息在這裏,如果你需要它:http://www.regular-expressions.info/repeat.html

摘錄:

假設你想使用正則表達式匹配一個HTML標籤。你知道輸入將是一個有效的HTML文件,所以正則表達式不需要排除任何無效的尖括號。如果它位於尖括號之間,它是一個HTML標記。

大多數剛接觸正則表達式的人都會嘗試使用<。當他們在一個字符串上進行測試時,他們會感到驚訝。這是一個第一個測試。您可能會希望正則表達式匹配,並在匹配後繼續,