如何使用html解析器獲取網頁標題

如何使用HTML解析器獲取給定URL的網頁標題？是否可以使用正則表達式來獲得標題？我寧願使用HTML解析器。如何使用html解析器獲取網頁標題

我在Java Eclipse IDE中工作。

我已經嘗試使用下面的代碼，但不成功。

任何想法？

提前致謝！

import org.htmlparser.Node; 

import org.htmlparser.Parser; 

import org.htmlparser.util.NodeList; 

import org.htmlparser.util.ParserException; 

import org.htmlparser.tags.TitleTag;  

public class TestHtml { 

public static void main(String... args) { 
    Parser parser = new Parser();  
    try { 
     parser.setResource("http://www.yahoo.com/"); 
     NodeList list = parser.parse(null); 
     Node node = list.elementAt(0); 

     if (node instanceof TitleTag) { 
      TitleTag title = (TitleTag) node; 


      System.out.println(title.getText()); 

     } 

    } catch (ParserException e) { 
     e.printStackTrace(); 
    } 
} 

}

來源

2010-07-09 smartcode

[你不能解析HTML或XML定期expresisons] [1] [1]：http://stackoverflow.com/questions/ 1732348 /正則表達式匹配開放標籤，除了-XHTML-自足標籤/ 1732454＃1732454 – Glyph 2011-10-16 03:49:55

根據你（重新）問題，問題是，你只檢查的第一個節點Node node = list.elementAt(0);，而你應該遍歷列表中找到標題（這是不是第一個）。你也可以使用NodeFilter作爲你的parse()只返回TitleTag然後標題將在第一個，你不必迭代。

來源

2010-07-09 09:05:09 Vinze

::是的..我知道但是我仍然無法找到應該遵循的方式來過濾TitleTag！任何想法.. ??日Thnx！ – smartcode 2010-07-09 09:11:48

從來沒有使用過這個庫，但必須是經典的...像新的NodeFilter（）{公共布爾接受（節點節點）{返回節點instanceof TitleTag; }} – Vinze 2010-07-09 09:55:56

::非常感謝bro.got根據你的答案結果..祝你有美好的一天！ – smartcode 2010-07-09 10:04:11

RegEx match open tags except XHTML self-contained tags

聰明的你不想使用正則表達式。

要使用HTML解析器，我們需要知道您正在使用哪種語言。既然你說你在「日食上」，我會假設Java。

查看http://www.ibm.com/developerworks/xml/library/x-domjava/的描述，概述和各種觀點。

來源

2010-07-09 07:54:57 Borealid

嗯 - 假設您使用的是java，但在大多數語言中都有相應的功能 - 您可以使用SAX解析器（例如將任何html轉換爲xhtml的TagSoup）並在您的處理程序中執行：

public class MyHandler extends org.xml.sax.helpers.DefaultHandler { 
    boolean readTitle = false; 
    StringBuilder title = new StringBuilder(); 

    public void startElement(String uri, String localName, String name, 
       Attributes attributes) throws SAXException { 
     if(localName.equals("title") { 
      readTitle = true; 
     } 
    } 

    public void endElement(String uri, String localName, String name) 
      throws SAXException { 
     if(localName.equals("title") { 
      readTitle = false; 
     } 
    } 

    public void characters(char[] ch, int start, int length) 
      throws SAXException { 
     if(readTitle) title.append(new String(ch, start, length)); 
    } 
}

，你用它在你的解析器（與tagsoup爲例）：

org.ccil.cowan.tagsoup.Parser parser = new Parser(); 
MyHandler handler = new MyHander(); 
parser.setContentHandler(handler); 
parser.parse(an input stream to your html file); 
return handler.title.toString();

來源

2010-07-09 07:55:50 Vinze

我曾與下面的代碼嘗試了segment.But我仍然無法得到的結果。公共類TestParser { 公共靜態無效的主要（字符串參數... args）{ 嘗試{ 解析器解析器=新的解析器（）; parser.setResource（「http://www.youtube.com」）; NodeList list = parser.parse（null）; Node node = list.elementAt（0）; 如果（節點的instanceof TitleTag）{ TitleTag標題=（TitleTag）節點; System.out.println（title.getText（））; } } catch（ParserException e）{ e.printStackTrace（）; } } – smartcode 2010-07-09 08:36:00

你應該把這個放在你的問題中，並且定義你使用哪種語言和哪個庫（或者添加相應的標籤），如果問題不太模糊，那麼得到答案會更有效。 .. – Vinze 2010-07-09 08:45:38

::我編輯了我的問題，如果你可以給任何想法或更正，它會更好的我..thanx！ – smartcode 2010-07-09 08:58:51

順便說一句，已經有一個非常簡單的HTMLParser標題提取。可以使用的是：http://htmlparser.sourceforge.net/samples.html

的方法來運行它是（從HTMLParser的代碼庫中）：執行命令

bin/parser http://website_url TITLE

或運行

java -jar <path to htmlparser.jar> http://website_url TITLE

或從你的代碼調用方法

org.htmlparser.Parser.main(String[] args)

與參數new String[] {"<website url>", "TITLE"}

來源

2010-07-09 09:42:10 madhurtanwani

這將是非常容易使用HTMLAgilityPack你只需要得到的HTTPRequest的性反應中字符串的形式。

String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument(); 
doc.loadHtml(response); 
HtmlNode node =doc.DocumentNode.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate 
node.innerText; //gives you the title of the page

helloWorld節點。的innerText包含的helloWorld

String response=httpRequest.getResponseString(); // this may have a few changes or no 
HtmlDocument doc= new HtmlDocument(); 
doc.loadHtml(response); 

HtmlNode node =doc.DocumentNode.selectSingleNode("//head");// this additional will get head which is a single node in html than get title from head's childrens 
HtmlNode node =node.selectSingleNode("//title"); // this line will fetch title tage from whole html document and return collection could iterate 


node.innerText; //gives you the title of the page

來源

2013-07-15 13:09:24

如何使用html解析器獲取網頁標題

回答

相關問題