使用Jsoup，我如何獲取每個鏈接中的每個信息？

 package com.muthu; 
    import java.io.IOException; 
    import org.jsoup.Jsoup; 
    import org.jsoup.helper.Validate; 
    import org.jsoup.nodes.Document; 
    import org.jsoup.nodes.Element; 
    import org.jsoup.select.Elements; 
    import org.jsoup.select.NodeVisitor; 
    import java.io.BufferedWriter; 
    import java.io.File; 
    import java.io.FileWriter; 
    import java.io.IOException; 
    import org.jsoup.nodes.*; 
    public class TestingTool 
    { 
     public static void main(String[] args) throws IOException 
     { 
    Validate.isTrue(args.length == 0, "usage: supply url to fetch"); 
      String url = "http://www.stackoverflow.com/"; 
      print("Fetching %s...", url); 
      Document doc = Jsoup.connect(url).get(); 
      Elements links = doc.select("a[href]"); 
      System.out.println(doc.text()); 
      Elements tags=doc.getElementsByTag("div"); 
      String alls=doc.text(); 
      System.out.println("\n"); 
      for (Element link : links) 
      { 
     print(" %s ", link.attr("abs:href"), trim(link.text(), 35)); 
      } 
      BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool     
      /linknames.txt")));   
     for (Element link : links) { 
      bw.write("Link: "+ link.text().trim()); 
     bw.write(System.getProperty("line.separator"));  
     }  
     bw.flush();  
     bw.close(); 
    }   } 
    private static void print(String msg, Object... args) { 
     System.out.println(String.format(msg, args)); 
    } 

    private static String trim(String s, int width) { 
     if (s.length() > width) 
      return s.substring(0, width-1) + "."; 
     else 
      return s; 
    } 
     }

來源

2012-12-08 Pearl

如果連接到一個URL，它只會解析當前頁面。但是你可以1）連接到一個URL，2）解析你需要的信息，3）選擇所有更多的鏈接，4）連接到它們，5）只要有新的鏈接，繼續這個。

考慮：（？）

你需要一個列表或者其他什麼東西，你已經保存的鏈接，你已經被解析
你必須決定是否需要此頁的鏈接只或的外部太
你要跳躍如「約」，「接觸」等

頁10

編輯：
（注：你必須添加一些更改/ ErrorHandling中碼）

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited 


public void visitUrl(String url) throws IOException 
{ 
    url = url.toLowerCase(); // now its case insensitive 

    if(!visitedUrls.contains(url)) // Do this only if not visted yet 
    { 
     Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document 

     /* ... Select your Data here ... */ 

     Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction! 

     for(Element next : nextLinks) // Iterate over all Links 
     { 
      visitUrl(next.absUrl("href")); // Recursive call for all next Links 
     } 
    } 
}

你必須在部分增加更多的限制/檢查，其中下一個鏈接被選中（也許你想跳過/忽略一些）;和一些錯誤處理。

編輯2：

要跳過忽略的環節，你可以使用這個：

創建一個集/表/不管，您存儲忽略關鍵字
注滿水！這些關鍵字
在您使用新的要解析的鏈接調用visitUrl()方法之前，請檢查此新Url c包含任何被忽略的關鍵字。如果它至少包含一個，它將被跳過。

我修改的例子有點這樣做（但它不是還測試！）。

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited 
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore 

// ... 


/* 
* Add keywords to the ignorelist. Each link that contains one of this 
* words will be skipped. 
* 
* Do this in eg. constructor, static block or a init method. 
*/ 
ignore.add(".twitter.com"); 

// ... 


public void visitUrl(String url) throws IOException 
{ 
    url = url.toLowerCase(); // Now its case insensitive 

    if(!visitedUrls.contains(url)) // Do this only if not visted yet 
    { 
     Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document 

     /* ... Select your Data here ... */ 

     Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction! 

     for(Element next : nextLinks) // Iterate over all Links 
     { 
      boolean skip = false; // If false: parse the url, if true: skip it 
      final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse 

      for(String s : ignore) // Iterate over all ignored keywords - maybe there's a better solution for this 
      { 
       if(href.contains(s)) // If the url contains ignored keywords it will be skipped 
       { 
        skip = true; 
        break; 
       } 
      } 

      if(!skip) 
       visitUrl(next.absUrl("href")); // Recursive call for all next Links 
     } 
    } 
}

解析下一個環節是由這個工作：

final String href = next.absUrl("href"); 
/* ... */ 
visitUrl(next.absUrl("href"));

但可能你應該多增加一些停止條件，這部分內容。

來源

2012-12-09 12:52:05 ollo

謝謝奧洛。我可以連接網址並獲取所有鏈接名稱。但我怎樣才能連接所有其他鏈接和解析鏈接的信息...給我一些建議...在此先感謝.. – Pearl

請參閱「編輯」一個簡短的例子。根據您的要求擴展此項。 – ollo

海奧洛，我如何添加解析編程的下一個鏈接.... – Pearl

使用Jsoup，我如何獲取每個鏈接中的每個信息？

回答

相關問題