如何使用Jsoup從html文件中獲取特定數據？

我有一個本地語言的新聞紙的HTML文件，我想收集在本地語言只在新聞紙上的所有單詞如何使用Jsoup從html文件中獲取特定數據？

我已經在html文件中觀察到，在本地的所有單詞語言是類字段內容div元素下，所以我已選擇其元件獲得的數據，但在div元件也含有的元素，如在其內部的本地語言的單詞存在

<div class = "field-content"></div>

所以如何獲得只來自html文件的當地語言文字

網址的網站：http://www.andhrabhoomi.net/

我的代碼：

public static void main(String a[]) 
     { 
      Document doc; 
      try { 
       doc = Jsoup.connect("http://www.andhrabhoomi.net/").userAgent("Mozilla").get(); 
       String title = doc.title(); 

       System.out.println("title : " + title); 

        // get all links 
        //Elements links = doc.select("a[href]"); 

        Elements body = doc.select("div.field-content"); 

        for (Element link : body) { 

         System.out.println(link); 


    // get the value from href attribute 
         //System.out.println("\nlink : " + link.attr("href")); 
         //System.out.println("text : " + link.text()); 
        } 

      }catch(IOException e){ 
       System.out.println("error\n"); 

      } 
     }

來源

2016-03-15 Labeo

不知道你是什麼後，在這裏，但如果我的猜測是正確的這應該幫助。如果沒有，就說出來，我們會從那裏開始。

你會希望通過獲取只是有field-content然後擺脫所有其他HTML內容的類來改變你的選擇，你要添加text()到你的System.out.println(link.text());請參見下面的結束。

Elements body = doc.getElementsByClass("field-content"); 

for(Element link : body) 
{ 
    System.out.println(link.text()); 
}

來源

2016-03-15 17:00:25

由於它的工作 – Labeo

這裏的.text（）直接通過跳過元件右打印數據？ – Labeo

'.text（）'獲取元素的組合文本;因此，在這種情況下，我們選擇了'div'中的所有子元素的'div'和所有文本。所以是的，它幾乎拿出所有的標籤。但是，如果你只是在'div'文本後面，那麼你可以使用'ownText（）'雖然你會得到很多可能需要清除的空白。 –

的解決方案是：

 String title = doc.title(); 

     System.out.println("title : " + title); 

     //get all links 
     //Elements links = doc.select("a[href]"); 
     //Elements body = doc.select("div.field-content"); 
     Elements body = doc.select("div[class=\"field-content\"] > a"); 

     for (Element link : body) { 

      System.out.println("---------------------------------------------------------------------------------------------------------------"); 
      System.out.println(link); 

      Elements img = link.select("img"); 
      // get the value from href attribute 
      System.out.print("\nSrc Img : " + img.attr("src")); 

      Elements tag_a = link.select("a"); 
      System.out.println("\nHref : " + tag_a.attr("href")); 
      //System.out.println("text : " + tag_a.text()); 
     } 

    } catch (Exception e) { 
     System.out.println("error\n"); 

    } 
}

來源

2016-03-16 09:02:47

如何使用Jsoup從html文件中獲取特定數據？

回答

相關問題