2011-08-14 33 views
1

這是我在Groovy中編寫的用於從網址中獲取頁面標題的代碼。不過,有些網站我得到了「永久移動」,我認爲這是因爲301重定向。如何避免這種情況,讓HttpURLConnection類要遵循正確的URL,並得到正確的頁面標題HttpUrlConnection獲取內容的標題並獲得「永久移動」

比如這個網站,我得到了「感動永久」,而不是正確的頁面標題 http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html

 

     def con = (HttpURLConnection) new URL(url).openConnection() 
     con.connect() 

     def inputStream = con.inputStream 

     HtmlCleaner cleaner = new HtmlCleaner() 
     CleanerProperties props = cleaner.getProperties() 

     TagNode node = cleaner.clean(inputStream) 
     TagNode titleNode = node.findElementByName("title", true); 

     def title = titleNode.getText().toString() 
     title = StringEscapeUtils.unescapeHtml(title).trim() 
     title = title.replace("\n", ""); 
     return title 
 

回答

1

我能得到這個,如果我管理的重定向自己工作...

我認爲這個問題是,該網站將期待的cookie,它發送到重定向鏈的一半,如果它沒有得到它們,它會將您發送到登錄頁面。

此代碼顯然需要一些清理(也有可能是一個更好的方式來做到這一點),但它顯示瞭如何可以提取標題:

@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.2') 
@Grab('commons-lang:commons-lang:2.6') 
import org.apache.commons.lang.StringEscapeUtils 
import org.htmlcleaner.* 

String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html' 
String cookie = null 
String pageContent = '' 

while(location) { 
    new URL(location).openConnection().with { con -> 
    // We'll do redirects ourselves 
    con.instanceFollowRedirects = false 

    // If we got a cookie last time round, then add it to our request 
    if(cookie) con.setRequestProperty('Cookie', cookie) 
    con.connect() 

    // Get the response code, and the location to jump to (in case of a redirect) 
    int responseCode = con.responseCode 
    location = con.getHeaderField("Location") 

    // Try and get a cookie the site will set, we will pass this next time round 
    cookie = con.getHeaderField("Set-Cookie") 

    // Read the HTML and close the inputstream 
    pageContent = con.inputStream.withReader { it.text } 
    } 
} 

// Then, clean paceContent and get the title 
HtmlCleaner cleaner = new HtmlCleaner() 
CleanerProperties props = cleaner.getProperties() 

TagNode node = cleaner.clean(pageContent) 
TagNode titleNode = node.findElementByName("title", true); 

def title = titleNode.text.toString() 
title = StringEscapeUtils.unescapeHtml(title).trim() 
title = title.replace("\n", "") 

println title 

希望它能幫助!

0

您需要在HttpUrlConnection上調用setInstanceFollowRedirects(true)。即第一行後,插入 con.setInstanceFollowRedirects(真)

+0

我試過了,但它仍然沒有工作。我認爲setInstainceFollowRedirects(true)是默認值。但是非常感謝你的回覆。 – toy

+0

是的,我應該在發佈之前嘗試過自己。我確實重現了你的症狀,但是看不到爲什麼。我嘗試了HttpBuilder而不是HttpUrlConnection,並且遵循重定向而沒有其他配置。但是我還沒有能夠將所得到的內容傳遞給HtmlCleaner。 – mmigdol

+0

這不是nyt paywall影響的事情嗎? –