scrapy - 獲取最終重定向的URL

我想在scrapy中獲取最終的重定向URL。例如，如果一個錨標記具有特定的格式：scrapy - 獲取最終重定向的URL

<a href="http://www.example.com/index.php" class="FOO_X_Y_Z" />

然後我需要獲得URL重定向到URL（如果是的話，如果200然後OK）。例如，我得到相應的錨標記是這樣的：

def parse (self, response) 
    hxs  = HtmlXPathSelector (response); 
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href"); 

    // Lets assume anchor contains the actual link (http://...) 
    for anchor in anchors: 
     final_url = get_final_url (anchor); // << I would need something like this 

     // Save final_url

所以，如果我訪問了http://www.example.com/index.php並會送我到10個重定向，最後它將在http://www.example.com/final.php停止 - 這就是我需要get_final_url()返回。

我想到了通向解決方案的途徑，但我在這裏要求看看scrapy是否已經提供了一個解決方案？

來源

2012-10-07 vanneto

再次，假設anchor包含實際的URL，我去與的urllib2來完成它：

def parse (self, response) 
    hxs  = HtmlXPathSelector (response); 
    anchors = hxs.select("//a[@class='FOO_X_Y_Z']/@href"); 

    // Lets assume anchor contains the actual link (http://...) 
    for anchor in anchors: 
     final_url = urllib2.open(anchor, None, 1).geturl() 

     // Save final_url

urllib2.open()返回與兩個類似文件的對象其他方法，其中之一是geturl()，它返回最終的URL（在所有重定向後）。它不是Scrapy的一部分，但它的工作原理。

來源

2012-10-10 08:29:40 vanneto

-4

這很簡單：

print response.url #(inside parse())

來源

2012-10-07 15:30:32

其實，這就是我剛剛得到的資源所在的URL。我需要href屬性中鏈接的最終網址。我想我還不夠清楚。不管怎樣，謝謝你。 – vanneto

-1

我使用response.headers這將返回一個信息列表。新的網址值位於「位置」鍵旁邊。

In [1]: response.headers 
Out[1]: 
{'Date': 'Thu, 09 Jun 2016 00:18:18 GMT', 
'Location': 'https:/www.protiviti.com/en-US/Pages/default.aspx', 
'Server': 'nginx/1.9.1', 
'X-Ms-Invokeapp': '1; RequireReadOnly'}

來源

2016-06-09 00:04:59 rodriguesJD

這是我第一次想要檢查標題，但是我處於服務器沒有更新標題以反映最終URL的情況 – wi1

scrapy - 獲取最終重定向的URL

回答

相關問題