2015-10-18 100 views
1

我有一個url數組,我想從URL中存儲信息,我在數據庫中讀取它。我的問題是我的數據列表太大的URL如果讀取從存儲在數據庫中的上面的露珠序列化每個URL將需要時間。Jsoup將內容保存到數據庫中

我知道有一種方法可以使用線程來操作,但我不知道該怎麼做,請幫助我。或者任何你的方法

try { 
    String lstUrls = "http://www.java2s.com/Tutorials/Java/Scala/index.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0020__Scala_Variables.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0040__Scala_Variable_Declarations.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0060__Scala_Semicolons.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0080__Scala_Code_Blocks.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0090__Scala_Comments.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0100__Scala_Type_Hierarchy.htm\n"; 
    String[] urls = lstUrls.split("\n"); 
    for (String url : urls) { 
     Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36").get(); 
     Elements select = doc.select("div.row"); 
     String html = select.html(); 
     System.out.println(html); 
     /* 
     insert html to database 
     */ 
    } 
} catch (IOException ex) { 
    ex.printStackTrace(); 
} 
+0

一兩件事你可以做的是排隊的輸出,並插入它在數據庫中的一個批次,所以你打的數據庫只有一次。 – turingcomplete

+0

@MaTâm如果我的回答對你有幫助,請考慮加註。 – Hasanaga

+0

謝謝turingcomplete對不起,英語不是我的語言,所以我不明白你說什麼,你應該希望更詳細的說明或我需要學習的文件。 –

回答

3

要使用多線程檢索數據,你可以做這樣的事情:

Executor ex = Executors.newFixedThreadPool(3); 
    String lstUrls = "http://www.java2s.com/Tutorials/Java/Scala/index.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0020__Scala_Variables.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0040__Scala_Variable_Declarations.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0060__Scala_Semicolons.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0080__Scala_Code_Blocks.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0090__Scala_Comments.htm\n" 
      + "http://www.java2s.com/Tutorials/Java/Scala/0100__Scala_Type_Hierarchy.htm\n"; 
    String[] urls = lstUrls.split("\n"); 
    for (final String url : urls) { 
     try { 
      ex.execute(new Runnable() { 
       @Override 
       public void run() { 
        try { 
         Document doc = Jsoup 
           .connect(url) 
           .userAgent(
             "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36") 
           .get(); 
         Elements select = doc.select("div.row"); 
         String html = select.html(); 
         System.out.println(html); 
         /* 
         * insert html to database 
         */ 
        } catch (Exception e) { 
         e.printStackTrace(); 
        } 
       } 
      }); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } 
    } 

這將使用3個線程同時處理的URL,如果你想使用更多然後3個線程更改此行Executor ex = Executors.newFixedThreadPool(3);並用您想要的任何數字替換3

你可以找到更多關於Executors here

+0

這很棒,我認爲做得更復雜。我會更多地瞭解你的方法。非常感謝你 –

+0

@MaTâm我很高興能夠幫到你,祝你好運。 – Titus

+0

提示:當我在 內爲(最終字符串網址:網址){} 完成以下通知並完成以下通知:該聲明爲通知首次運行。我剛剛完成工作後可以輸出循環通知? –

3

我建議在插入數據庫之前壓縮數據。

//PreparedStatement.setBytes(1,compress(html)); 

public static byte[] compress(String str) throws Exception { 
    if (str == null || str.length() == 0) { 
     return null; 
    } 
    ByteArrayOutputStream obj = new ByteArrayOutputStream(); 
    GZIPOutputStream gzip = new GZIPOutputStream(obj); 
    gzip.write(str.getBytes("UTF-8")); 
    gzip.close(); 
    return obj.toByteArray(); 
} 

public static String decompress(byte[] bytes) throws Exception { 
    GZIPInputStream gis = new GZIPInputStream(new ByteArrayInputStream(bytes)); 
    BufferedReader bf = new BufferedReader(new InputStreamReader(gis,"UTF-8")); 
    String outStr = ""; 
    String line; 
    while ((line = bf.readLine()) != null) { 
     outStr += line; 
    } 
    return outStr; 
} 

第二種方法,將html數據保存到一個文件中並且只在數據庫中存儲文件路徑。

long ts = System.currentTimeMillis(); 
String filePath = String.valueOf(ts)+".gz"; 
saveToFile(filePath ,html); 
--------  
public static void saveToFile(String filePath, String text) { 
    try { 
     GZIPOutputStream gzos = new GZIPOutputStream(new FileOutputStream(filePath)); 
     gzos.write(text.getBytes("UTF-8")); 
     gzos.finish(); 
     gzos.close(); 

    } catch (IOException ex) { 
     ex.printStackTrace(); 
    } 
} 
+0

我認爲如果將列表url分割成多個Thread,執行速度會更快。我認爲你也有興趣閱讀網頁,所以試着和我一起思考這個主題 –