Reactor 3.x（Java）：用於網頁抓取

Reactor noob here。Reactor 3.x（Java）：用於網頁抓取

這是更多的HowTo問題。

比方說，我有一個網站，我想抓取包含一組分頁的搜索結果。搜索結果頁面的數量未知。每個搜索頁面都有一個指向下一頁的鏈接。我想抓取所有頁面的所有搜索結果並處理每個搜索結果。

我該如何使用Reactor（單聲道/通量）在Java中完成此操作？

我想盡可能做到「反應性」。

基本上，電抗器（3.X）以下版本勢在必行僞代碼：

String url = "http://example.com/search/1"; 
    Optional<Document> docOp = getNextPage(url); (1) 
    while (docOp.isPresent()) { 
     Document doc = docOp.get(); 
     processDoc(doc);       (2) 
     docOp = getNextPage(getNextUrl(doc));  (3) 
    } 

    // (1) Get the first page of search results 
    // (2) Process all the search results on this page asynchronously 
    // (3) Find the next page URL, and get that page

來源

2017-02-02 Erik

與https://gitter.im/reactor/reactor一些幫助我來到這個解決方案。它可能不理想。我很想得到任何人可能看到的問題的反饋。

public void scrape() { 

    Try<Document> firstDocTry = this.getSearchResultsPage(Option.<Document>none().toTry()); (1) 

    // Generate a flux where each element in the flux is created using the current element 
    Flux.<Try<Document>, Try<Document>>generate(() -> firstDocTry, (docTry, sink) -> {   (2) 
      docTry = this.getSearchResultsPage(docTry); 
      docTry.isFailure() ? sink.complete() : sink.next(docTry); 
      return docTry; 
     }) 
     .flatMap(docTry -> this.transformToScrapedLoads(docTry))        (3) 
     .log() 
     .subscribe(scrapedLoad -> 
      scrapedLoadRepo.save(scrapedLoad)             (4) 
     ); 
} 

protected Try<Document> getSearchResultsPage(Try<Document> docTry) { 
    ... 
} 

protected Flux<ScrapedLoad> transformToScrapedLoads(Try<Document> docTry) { 
    ... 
}

（1）使用Javaslang的單子嘗試和選擇這裏。 'firstDocTry'播種發生器。 getSearchResultsPage（）知道到開始於搜索的第一頁，如果沒有Document提供。

（2）在這裏使用發生器。發表在焊劑中的每個元素是由現有元件

（3）的變換方法將每個文檔的助焊劑，其被組合並且發送到訂閱作爲單個通量

（4）用戶操作確定在由Flux生成的每個元素上。在這種情況下，堅持他們。

來源

2017-02-03 20:55:47 Erik

Reactor 3.x（Java）：用於網頁抓取

回答

相關問題