2014-09-02 24 views
-1

我是RegEx的新手。我使用python瀏覽網頁並挑選出某些文本。我已經能夠挑選出我需要的部分附加字符。在下面我試圖讓這個表達式的例子:「需要這個」Python RegEx獲取特定文本

import re 

test = '<area alt=Need This <span class=;viewot;>view 1</span>||tin view:<br /> ' \ 
     '<div class=sadfca3 24swcdsa c4566 54dscz>' \ 
     '<span class=asafwef1 41sd fd3532 safwef>' \ 
     '<img class=sfecs 234af wefw47 5awef>' \ 
     '</span> ' \ 
     '<span class=sad536 fwfad23 4s214 fsadfw>' \ 
     '<img class=&we234 fsafsdf 2323 asdfsd>' \ 
     '</span>' \ 
     '<span class=afasui2 34 ewiasd23 4fjlwe;>' \ 
     '<img class=sfawejac2 42jk hewwef32 4uafasd>' \ 
     '</span> ' \ 
     '<span class=gdfjuia w8 aw ijfaw a909>' \ 
     '<img class=asfwejhjdkh f 8sd 8 awiosa;f98a 8a' \ 
     '</span> <div class=afkj waj 98u2oi kjaf09></div>" href="jkhafu.php">' 

print("findall") 
print(re.findall(r'<area alt=?.*<span class=', str(test), re.I|re.M)) 
print("finditer") 
print(re.finditer(r'<area alt=+.*<span class=', str(test), re.I|re.M)) 
print("match") 
print(re.match(r'<area alt=+.*<span class=', str(test), re.I|re.M)) 
print("search") 
print(re.search(r'<area alt=+.*<span class=', str(test), re.I|re.M)) 
print("split") 
print(re.split(r'<area alt=+.*<span class=', str(test), re.I|re.M)) 

re.match和re.seach接近我所需要。這裏是從上面的例子的結果:

findall 
['<area alt=Need This <span class=&quot;view&quot;>view 1</span>||time to spend in view:<br /> <div class=sadfca3 24swcdsa c4566 54dscz><span class=asafwef1 41sd fd3532 safwef><img class=sfecs 234af wefw47 5awef></span> <span class=sad536 fwfad23 4s214 fsadfw><img class=&we234 fsafsdf 2323 asdfsd></span><span class=afasui2 34 ewiasd23 4fjlwe;><img class=sfawejac2 42jk hewwef32 4uafasd></span> <span class='] 
finditer 
<callable_iterator object at 0x00493750> 
match 
<_sre.SRE_Match object; span=(0, 405), match='<area alt=Need This <span class=&quot;view&quot;>v> 
search 
<_sre.SRE_Match object; span=(0, 405), match='<area alt=Need This <span class=&quot;view&quot;>v> 
split 
['', 'gdfjuia w8 aw ijfaw a909><img class=asfwejhjdkh f 8sd 8 awiosa;f98a 8a</span> <div class=afkj waj 98u2oi kjaf09></div>" href="jkhafu.php">'] 

如何使用正則表達式使用Python 3.4,只得到「需要這個」從字符串在上面的例子中名爲test?

任何幫助將不勝感激!

+0

相關的,如果你曾經做任何事情更復雜的HTML解析:http://stackoverflow.com/a/1732454/406772 – 2014-09-02 23:17:16

+0

你的HTML是不實際有效。將鏈接分享給網頁,或按原樣提供相關的網頁html。 – alecxe 2014-09-02 23:17:24

回答

3

Use a lookbehind and lookahead assertion

(?<=area alt=).*?(?=\s+<span class=) 

代碼:

>>> m = re.search(r'(?<=area alt=).*?(?=\s+<span class=)', test).group() 
>>> m 
'Need This' 
+0

工作正常!謝謝! – user908759 2014-09-02 23:46:19

+0

不客氣... – 2014-09-02 23:49:44

2

你可以使用這個表達式:

area alt=([\w\s]+)< 

Working demo

enter image description here

的代碼是:

import re 
p = re.compile(ur'area alt=([\w\s]+)<') 
test_str = u"YOUR TEXT HERE" 
m = re.match(p, test_str) 
print m.group(1)