JSoup從unix中的HTML中選擇

我有一個程序從PubMed站點的許多文章中提取某些元素（文章作者姓名）。雖然程序在我的電腦（Windows）中正常工作，但當我嘗試在unix上運行它時，返回一個空列表。我懷疑這是因爲unix系統中的語法應該有所不同。問題是JSoup文檔沒有提到什麼。任何人都知道這件事？我的代碼是這樣的：JSoup從unix中的HTML中選擇

Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get(); 
      System.out.println("connected"); 
      Elements authors = doc.select("div.auths >*"); 
      System.out.println("number of elements is " + authors.size());

最終的System.out.println總是說大小爲0，因此它什麼都不能做。提前

感謝

完整的例子：

protected static void searchLink(HashMap<String, HashSet<String>> authorsMap, HashMap<String, HashSet<String>> reverseAuthorsMap, 
     String fileLine 

     ) throws IOException, ParseException, InterruptedException 
{ 

      JSONParser parser = new JSONParser(); 
      JSONObject jsonObj = (JSONObject) parser.parse(fileLine.substring(0, fileLine.length() - 1)); 
      String pmidString = (String)jsonObj.get("pmid"); 
      System.out.println(pmidString); 

      Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get(); 
      System.out.println("connected"); 
      Elements authors = doc.select("div.auths >*"); 
      System.out.println("found the element"); 

      HashSet<String> authorsList = new HashSet<>(); 
      System.out.println("authors list hashSet created"); 
      System.out.println("number of elements is " + authors.size()); 
      for (int i =0; i < authors.size(); i++) 
      { 


       // add the current name to the names list 
       authorsList.add(authors.get(i).text()); 

       // pmidList variable 
       HashSet<String> pmidList; 
       System.out.println("stage 1"); 
       // if the author name is new, then create the list, add the current pmid and put it in the map 
       if(!authorsMap.containsKey(authors.get(i).text())) 
       { 
        pmidList = new HashSet<>(); 
        pmidList.add(pmidString); 
        System.out.println("made it to searchLink"); 
        authorsMap.put(authors.get(i).text(), pmidList); 

       } 
       // if the author name has been found before, get the list of articles and add the current 
       else 
       { 
        System.out.println("Author exists in map"); 
        pmidList = authorsMap.get(authors.get(i).text()); 
        pmidList.add(pmidString); 


        authorsMap.put(authors.get(i).text(), pmidList); 
        //authorsMap.put((String) authorName, null); 
       } 

       // finally, add the pmid-authorsList to the map 
       reverseAuthorsMap.put(pmidString, authorsList); 
       System.out.println("reverseauthors populated"); 

      } 

}

我有一個線程池，每個線程使用此方法來填充兩張地圖。文件行參數是我解析爲json並保留「pmid」字段的單個行。使用這個字符串，我訪問這篇文章的URL，並解析HTML作者的名字。其餘的應該可以工作（它可以在我的電腦上運行），但是因爲authors.size總是爲0，所以直接在System.out元素的數量下面根本不會得到執行。

來源

2013-12-09 Marios D. Lokas

你能提供一個完整的例子嗎？ – aditsu

完成我並不是說包括你想要做的所有處理，而只是提供一個[sscce]（http://sscce.org）。 – aditsu

我明白你的意思，但這實際上是一大堆代碼的一部分，我不能在這裏粘貼。我相當確信問題在於doc.select內部的語法，而我可以給你的任何東西都不會幫助你理清它，因爲除非在unix上運行，否則它將起作用。感謝您的關注 –

我已經試過了example做的正是你想要什麼：

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
import java.io.IOException; 

public class Test { 
    public static void main (String[] args) throws IOException { 
    String docId = "24312906"; 
    if (args.length > 0) { 
     docId = args[0]; 
    } 

    String url = "http://www.ncbi.nlm.nih.gov/pubmed/" + docId; 
    Document doc = Jsoup.connect(url).timeout(60000).userAgent("Mozilla/25.0").get(); 
    Elements authors = doc.select("div.auths >*"); 

    System.out.println("os.name=" + System.getProperty("os.name")); 
    System.out.println("os.arch=" + System.getProperty("os.arch")); 

    // System.out.println("doc=" + doc); 
    System.out.println("authors=" + authors); 
    System.out.println("authors.length=" + authors.size()); 

    for (Element a : authors) { 
     System.out.println(" author: " + a); 
    } 
    } 
}

我的操作系統是Linux操作系統：

# uname -a 
Linux graphene 3.11.0-13-generiC#20-Ubuntu SMP Wed Oct 23 07:38:26 UTC 2013 x86_64 x86_64 x86_64 
GNU/Linux 
# lsb_release -a 
No LSB modules are available. 
Distributor ID: Ubuntu 
Description: Ubuntu 13.10 
Release:  13.10 
Codename:  saucy

運行該程序產生：

os.name=Linux 
os.arch=amd64 
authors=<a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a> 
<a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a> 
authors.length=2 
    author: <a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a> 
    author: <a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a>

哪似乎工作。也許這個問題與fileLine有關？你可以打印出「URL」的值：

System.out.println("url='" + "http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString+ "'");

由於你沒有得到您的代碼的異常，我懷疑你得到一個文件，只是沒有一個你的代碼是期待。打印出來的文件，你可以看到你回來的東西可能也會有幫助。

來源

2013-12-09 20:01:50

我做了你的建議，顯然我沒有得到正確的HTML返回。返回的HTML解釋說，該網站阻止我，因爲上帝知道爲什麼。至少我想到了問題的真相。非常感謝您的幫助！ –

JSoup從unix中的HTML中選擇

回答

相關問題