HttpUrlConnection獲取內容的標題並獲得「永久移動」

這是我在Groovy中編寫的用於從網址中獲取頁面標題的代碼。不過，有些網站我得到了「永久移動」，我認爲這是因爲301重定向。如何避免這種情況，讓HttpURLConnection類要遵循正確的URL，並得到正確的頁面標題HttpUrlConnection獲取內容的標題並獲得「永久移動」

比如這個網站，我得到了「感動永久」，而不是正確的頁面標題 http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html

 

     def con = (HttpURLConnection) new URL(url).openConnection() 
     con.connect() 

     def inputStream = con.inputStream 

     HtmlCleaner cleaner = new HtmlCleaner() 
     CleanerProperties props = cleaner.getProperties() 

     TagNode node = cleaner.clean(inputStream) 
     TagNode titleNode = node.findElementByName("title", true); 

     def title = titleNode.getText().toString() 
     title = StringEscapeUtils.unescapeHtml(title).trim() 
     title = title.replace("\n", ""); 
     return title

來源

2011-08-14 toy

我能得到這個，如果我管理的重定向自己工作...

我認爲這個問題是，該網站將期待的cookie，它發送到重定向鏈的一半，如果它沒有得到它們，它會將您發送到登錄頁面。

此代碼顯然需要一些清理（也有可能是一個更好的方式來做到這一點），但它顯示瞭如何可以提取標題：

@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.2') 
@Grab('commons-lang:commons-lang:2.6') 
import org.apache.commons.lang.StringEscapeUtils 
import org.htmlcleaner.* 

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html' 
String cookie = null 
String pageContent = '' 

while(location) { 
    new URL(location).openConnection().with { con -> 
    // We'll do redirects ourselves 
    con.instanceFollowRedirects = false 

    // If we got a cookie last time round, then add it to our request 
    if(cookie) con.setRequestProperty('Cookie', cookie) 
    con.connect() 

    // Get the response code, and the location to jump to (in case of a redirect) 
    int responseCode = con.responseCode 
    location = con.getHeaderField("Location") 

    // Try and get a cookie the site will set, we will pass this next time round 
    cookie = con.getHeaderField("Set-Cookie") 

    // Read the HTML and close the inputstream 
    pageContent = con.inputStream.withReader { it.text } 
    } 
} 

// Then, clean paceContent and get the title 
HtmlCleaner cleaner = new HtmlCleaner() 
CleanerProperties props = cleaner.getProperties() 

TagNode node = cleaner.clean(pageContent) 
TagNode titleNode = node.findElementByName("title", true); 

def title = titleNode.text.toString() 
title = StringEscapeUtils.unescapeHtml(title).trim() 
title = title.replace("\n", "") 

println title

希望它能幫助！

來源

2011-08-15 11:47:10

您需要在HttpUrlConnection上調用setInstanceFollowRedirects（true）。即第一行後，插入 con.setInstanceFollowRedirects（真）

來源

2011-08-14 09:25:59 mmigdol

我試過了，但它仍然沒有工作。我認爲setInstainceFollowRedirects（true）是默認值。但是非常感謝你的回覆。 – toy

是的，我應該在發佈之前嘗試過自己。我確實重現了你的症狀，但是看不到爲什麼。我嘗試了HttpBuilder而不是HttpUrlConnection，並且遵循重定向而沒有其他配置。但是我還沒有能夠將所得到的內容傳遞給HtmlCleaner。 – mmigdol

這不是nyt paywall影響的事情嗎？ –

HttpUrlConnection獲取內容的標題並獲得「永久移動」

回答

相關問題