使用HashMap的

我想從我的HTML 頁計數的單詞數和使用HashMap的計算單詞的數量我想打印從HTML頁面使用HashMap的

Java代碼

詞和詞的出現次數

public class CountWords { 

    public void readFile() { 

     Scanner scanner = null; 
     try { 
      scanner = new Scanner(new File("D:\\Test.html")); 
     } catch (FileNotFoundException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 
     Map<String, Integer> map = new HashMap<String, Integer>(); 
     while (scanner.hasNext()) { 
      String word = scanner.next(); 
      if (map.containsKey(word)) { 
       map.put(word, map.get(word) + 1); 
      } else { 
       map.put(word, 1); 
      } 
     } 

     List<Map.Entry<String, Integer>> entries = new ArrayList(map.entrySet()); 

     for (int i = 0; i < map.size(); i++) { 
      System.out.println(entries.get(entries.size() - i - 1).getKey() 
        + " " + entries.get(entries.size() - i - 1).getValue()); 
     } 
    } 

}

輸出即時得到與HTML代碼也將原始數據，我只想打印其中即時通訊沒有看到html代碼

來源

2014-10-07 arjun narahari

放入您的輸出 – 2014-10-07 07:31:34

使用html解析器解析頁面並計算結果。 – Jens 2014-10-07 07:32:39

all {.gb1 {height：22px; margin-right：.5em; vertical-align：top} #gbar {float：left}} a.gb1，a.gb4 {text-decoration：下劃線1 in：1 刪除1 g}; akW =函數（b，c，d）{if（b）{var 1 if（「cad = h」== b）return 1 valign =「top」> 2014-10-07 07:38:31

你可以嘗試OWASP HTML清理庫的頁面中的文本 https://owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project。我以前用它來消毒用戶提交的帖子，但它應該達到你的要求。由於它是一個允許/限制HTML片段中的特定標籤的庫，因此您可以告訴它拒絕所有HTML標籤，並只將其中的內容提取出來。

你的代碼會是這樣的 PolicyFactory policy = new HtmlPolicyBuilder().toFactory(); String safeHTML = policy.sanitize(htmlContent);

我發現它是錯誤遠遠低於試圖任何一種正則表達式的容易。

你可能同時需要guava.jar和OWASP-Java的HTML-sanitizer.jar從http://owasp-java-html-sanitizer.googlecode.com/svn/trunk/distrib/lib/

來源

2014-10-07 07:50:57 Spheniscus

您應該刪除HTML標記。這裏是一個例子：Remove HTML tags from a String

Btw。爲什麼你的輸出如此複雜？

for (Map.Entry<String, Integer> entry : map.entrySet()) { 
    System.out.printf("%s %d\n", entry.getKey(), entry.getValue()); 
}

來源

2014-10-07 07:54:11

回答

相關問題