使用JAVA從HTML網頁中的META標記中檢索關鍵字

我想從HTML網頁中檢索所有內容字以及使用Java在相同HTML網頁的META標記中包含的所有關鍵字。
例如，考慮這個HTML源代碼：使用JAVA從HTML網頁中的META標記中檢索關鍵字

<html> 
<head> 
<meta name = "keywords" content = "deception, intricacy, treachery"> 
</head> 
<body> 
My very short html document. 
<br> 
It has just 2 'lines'. 
</body> 
</html>

這裏的內容的話是：我，非常，短，HTML，文件，它，有,只是,行

注意：排除標點符號和數字'2'。

這裏的關鍵詞是：欺騙，複雜，背叛

我已經創造了這個目的稱爲WebDoc一類，這是據我已經能夠得到。

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.URL; 
import java.util.Set; 
import java.util.TreeSet; 

public class WebDoc { 

    protected URL _url; 
    protected Set<String> _contentWords; 
    protected Set<String> _keyWords 

    public WebDoc(URL paramURL) { 
     _url = paramURL; 
    } 

    public Set<String> getContents() throws IOException { 
     //URL url = new URL(url); 
     Set<String> contentWords = new TreeSet<String>(); 
     BufferedReader in = new BufferedReader(new InputStreamReader(_url.openStream())); 
     String inputLine; 

     while ((inputLine = in.readLine()) != null) { 
      // Process each line. 
      contentWords.add(RemoveTag(inputLine)); 
      //System.out.println(RemoveTag(inputLine)); 
     } 
     in.close(); 
     System.out.println(contentWords); 
     _contentWords = contentWords; 
     return contentWords; 
    }  

    public String RemoveTag(String html) { 
     html = html.replaceAll("\\<.*?>",""); 
     html = html.replaceAll("&",""); 
     return html; 
    } 



    public Set<String> getKeywords() { 
     //NO IDEA ! 
     return null; 
    } 

    public URL getURL() { 
     return _url; 
    } 

    @Override 
    public String toString() { 
     return null; 
    } 
}

來源

2011-02-23 kooldave98

因此，在RedSoxFan關於元關鍵字的回答後，您只需要拆分內容行。可以使用有一個類似的方法：

而不是

contentWords.add(RemoveTag(inputLine));

使用

contentWords.addAll(Arrays.asList(RemoveTag(inputLine).split("[^\\p{L}]+")));

.split(...)將您的所有無信行（我希望這個作品，請嘗試報告），返回一組子字符串，每個字符串只應包含字母，以及之間的一些空字符串。
Arrays.asList(...)將此數組包裝在一個列表中。
addAll(...)將此數組的所有元素添加到集合中，但不會重複）。

最後，您應該從您的contentWords-Set中刪除空字符串""。

來源

2011-02-23 23:57:11

過程中的每個線和使用

public Set<String> getKeywords(String str) { 
     Set<String> s = new HashSet<String>(); 
     str = str.trim(); 
     if (str.toLowerCase().startsWith("<meta ")) { 
      if (str.toLowerCase().matches("<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\".*\"/?>")) { 
       // Returns only whats in the content attribute (case-insensitive) 
       str = str.replaceAll("(?i)<meta name\\s?=\\s?\"keywords\"\\scontent\\s?=\\s?\"(.*)\"/?>","$1"); 
       for (String st:str.split(",")) s.add(st.trim()); 
       return s; 
      } 
     } 
     return null; 
    }

如果你需要一個解釋，讓我知道。

來源

2011-02-23 23:24:11 RedSoxFan

對不起，忘記了內容。 PaŭloEbermann給出了一個很好的答案，我只是檢查一下，看它是否在身體標記中，否則你會從頭部獲得信息 – RedSoxFan 2011-02-24 00:06:26

使用JAVA從HTML網頁中的META標記中檢索關鍵字

回答

相關問題