2013-10-13 64 views
1

我有這樣的文字:查找單詞下來(之間)的字符串在Python

<div style="margin-left:10px;margin-right:10px;"> 
<!-- start of lyrics --> 
There are times when I've wondered<br /> 
And times when I've cried<br /> 
When my prayers they were answered<br /> 
At times when I've lied<br /> 
But if you asked me a question<br /> 
Would I tell you the truth<br /> 
Now there's something to bet on<br /> 
You've got nothing to lose<br /> 
<br /> 
When I've sat by the window<br /> 
And gazed at the rain<br /> 
With an ache in my heart<br /> 
But never feeling the pain<br /> 
And if you would tell me<br /> 
Just what my life means<br /> 
Walking a long road<br /> 
Never reaching the end<br /> 
<br /> 
God give me the answer to my life<br /> 
God give me the answer to my dreams<br /> 
God give me the answer to my prayers<br /> 
God give me the answer to my being 
<!-- end of lyrics --> 
</div> 

我想打印首歌的歌詞,但re.findall和re.search不要在這種情況下工作。我如何?我正在使用此代碼:

lyrics = re.findall('<div style="margin-left:10px;margin-right:10px;">(.*?)</div>', open('file.html','r').read()) 

for words in lyrics: 
    print words 
+1

[除XHTML自包含標籤的正則表達式匹配開放標籤](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –

+0

're.findall'和're.search'肯定仍然有效,並且會在這種情況下工作,所以你只是沒有使用正確的正則表達式。因爲你沒有發佈你要做的事情,這會讓人們很難幫助你。 – Ben

回答

1

試試這個:

with open(r'<file_path>','r') as file: 
     for line in file: 
      if re.match(r'^<', line) == None: 
       print line[:line.find(r'<')] 

輸出

There are times when I've wondered 
And times when I've cried 
When my prayers they were answered 
At times when I've lied 
But if you asked me a question 
Would I tell you the truth 
Now there's something to bet on 
You've got nothing to lose 
When I've sat by the window 
And gazed at the rain 
With an ache in my heart 
But never feeling the pain 
And if you would tell me 
Just what my life means 
Walking a long road 
Never reaching the end 
God give me the answer to my life 
God give me the answer to my dreams 
God give me the answer to my prayers 
God give me the answer to my being 

編輯使用URL LIB和提取網頁的歌詞:

from lxml import etree 
import urllib, StringIO 

# Rip file from URL   
resultado=urllib.urlopen('http://www.azlyrics.com/lyrics/ironmaiden/noprayerforthedying.html') 
html = resultado.read() 
# Parse html to etree 
parser= etree.HTMLParser() 
tree=etree.parse(StringIO.StringIO(html),parser) 
# Apply the xpath rule 
e = tree.xpath("//div[@style='margin-left:10px;margin-right:10px;']/text()") 
# print output 
for i in e: 
    print str(i).strip() 
+0

的問題對不起,我把這個歌詞在這個html:視圖源:http://www.azlyrics.com/lyrics/ironmaiden/noprayerforthedying.html –

+0

Kewl ,但你應該在問題中提到這一點。無論如何,很好,你找到了一個解決方案。礦用於平面文件。我爲了效率而設計了它。 – Vivek

+0

不,我沒有找到解決方案:( –

1

您不應該使用正則表達式來解析HTML。

看起來你是在刮一個網站。你可以在裏面使用scrapylxmlxpath

Python 2.7.5+ (default, Sep 19 2013, 13:48:49) 
>>> html = """<div style="margin-left:10px;margin-right:10px;"> 
... <!-- start of lyrics --> 
... There are times when I've wondered<br /> 
... And times when I've cried<br /> 
... When my prayers they were answered<br /> 
... At times when I've lied<br /> 
... But if you asked me a question<br /> 
... Would I tell you the truth<br /> 
... Now there's something to bet on<br /> 
... You've got nothing to lose<br /> 
... <br /> 
... When I've sat by the window<br /> 
... And gazed at the rain<br /> 
... With an ache in my heart<br /> 
... But never feeling the pain<br /> 
... And if you would tell me<br /> 
... Just what my life means<br /> 
... Walking a long road<br /> 
... Never reaching the end<br /> 
... <br /> 
... God give me the answer to my life<br /> 
... God give me the answer to my dreams<br /> 
... God give me the answer to my prayers<br /> 
... God give me the answer to my being 
... <!-- end of lyrics --> 
... </div>""" 
>>> import lxml.html 
>>> html = lxml.html.fromstring(html) 
>>> html.text_content() 
"\n\nThere are times when I've wondered\nAnd times when I've cried\nWhen my prayers they were answered\nAt times when I've lied\nBut if you asked me a question\nWould I tell you the truth\nNow there's something to bet on\nYou've got nothing to lose\n\nWhen I've sat by the window\nAnd gazed at the rain\nWith an ache in my heart\nBut never feeling the pain\nAnd if you would tell me\nJust what my life means\nWalking a long road\nNever reaching the end\n\nGod give me the answer to my life\nGod give me the answer to my dreams\nGod give me the answer to my prayers\nGod give me the answer to my being\n\n" 
>>> 
+0

對不起,HTML代碼是這樣的:http://www.azlyrics.com/lyrics/ironmaiden/noprayerforthedying.html。如何解析? –

+0

如果你加載頁面,你可以使用這個:'page.xpath('// div [@ style =「margin-left:10px; margin-right:10px;」]')。text_content()'。但這已經是一個不同的問題。看標籤[標籤:lxml]和[標籤:xpath] – warvariuc

0

對於HTML代碼的這個特定部分,我不明白爲什麼re.findall不起作用。 四行實際代碼加上文本可能會導致輸出。

from re import findall 

html = """ 
<div style="margin-left:10px;margin-right:10px;"> 
<!-- start of lyrics --> 
There are times when I've wondered<br /> 
And times when I've cried<br /> 
When my prayers they were answered<br /> 
At times when I've lied<br /> 
But if you asked me a question<br /> 
Would I tell you the truth<br /> 
Now there's something to bet on<br /> 
You've got nothing to lose<br /> 
<br /> 
When I've sat by the window<br /> 
And gazed at the rain<br /> 
With an ache in my heart<br /> 
But never feeling the pain<br /> 
And if you would tell me<br /> 
Just what my life means<br /> 
Walking a long road<br /> 
Never reaching the end<br /> 
<br /> 
God give me the answer to my life<br /> 
God give me the answer to my dreams<br /> 
God give me the answer to my prayers<br /> 
God give me the answer to my being 
<!-- end of lyrics --> 
</div> 
""" 

raw = findall(r'.*<br />', html) 

for line in raw: 
    line = line.strip('<br />') 
    print(line) 
相關問題