使用正則表達式在python中刮網頁

我想從網站上刮取一個列表。該列表已擴展到4個不同的頁面。 URL中爲每個頁面更改的參數都是「偏移量」。因此，對於，使用正則表達式在python中刮網頁

第1頁偏移量= 0

第2頁偏移量= 100

第3頁偏移量= 200

第4頁偏移= 300

我已經寫了下面的代碼： -

import re 
import urllib 

urlHandle = urllib.urlopen("http://sampleurl.com?request=1&offset=0") 
content = urlHandle.read() 

pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>') 

for match in pattern1.finditer(content): 
    print(match.group(1))

上面的代碼檢索「偏移量= 0" 。我在url本身附加了「offset = 0」。現在，因爲它擴展到4頁，我試着寫下面的代碼

import re 
import urllib 
import urllib2 
for i in range(0,400,100): 
    targeturl = "http://sampleurl.com?request=1&" 
    values = {'offset':i} 
    data = urllib.urlencode(values) 
    # req = urllib2.Request(targeturl,data) 
    finalurl = targeturl + data 
    urlHandle = urllib.urlopen(finalurl) 
    content = urlHandle.read() 
    pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>') 
    for match in pattern1.finditer(content): 
     print(match.group(1))

不知怎的，它不返回任何東西。我究竟做錯了什麼？

< <編輯>>

我也試過以下。它也沒有工作

import re 
import urllib 
import urllib2 
for i in range(0,400,100): 
    targeturl = "http://sampleurl.com?request=1&offset=0" 
    urlHandle = urllib.urlopen(targeturl) 
    content = urlHandle.read() 
    pattern1 = re.compile('<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>') 
    for match in pattern1.finditer(content): 
     print(match.group(1))

來源

2014-02-09 Neil

你的第二個正則表達式的格式不正確：的

'<a href="\/players\/\w{1}\/''\w+\d{2}\.html">([^<]*)</a>'

代替

'<a href="\/players\/\w{1}\/\w+\d{2}\.html">([^<]*)</a>'

那是一個錯字？

此外，在不同的但重要的說明，正則表達式不能完全解析HTML（RegEx match open tags except XHTML self-contained tags）。你真的應該考慮切換到一個HTML解析器（在python中，Scrapy在解析內容方面做得很好），否則你可能會冒險將幾個小時的頭撞到怪異的bug上。

來源

2014-02-09 09:53:40 Robin

感謝@Robin的輸入。但我仍然沒有看到任何改變。最初我在循環中使用了「我」。這是一個錯字。對於第二部分，我確實更改了我的網址以添加偏移量網址。但仍面臨同樣的問題。 URL中還有許多其他參數，它們保持不變並且不會更改。我一直保持着目標。我希望這不是問題。 – Neil

另外，我應該使用正則表達式。我知道HTML解析器更容易使用:( – Neil

在你的代碼中，你使用'urlHandle = urllib2.urlopen（targeturl）'。這是否也是一個錯字，而且你真的擁有'urlHandle = urllib2.urlopen（req）'？因爲在我看來你並沒有使用帶有偏移量參數的URL，這可能會導致你的問題 – Robin

只是標題說，有什麼問題「使用正則表達式刮」。不要這樣做。 BeautifulSoap只是一個更好的工具。用它。你的生活會改善，你的貓會坐在你的腿上，而我甚至不會提到你的妻子/丈夫（如果你沒有，你會）會爲你做什麼。

來源

2014-02-12 22:52:48 mcepl

使用正則表達式在python中刮網頁

回答

相關問題