2017-08-18 41 views
2

當前我正在開發一個程序,該程序允許我收集添加到我的Ao3(Archive of Our Own)粉絲羣中的最近5個小說故事。然後將這些故事添加到我設置的ArrayList中,該列表將在過去一週內保存小說作品。在每週結束時,我計劃將ArrayList的內容轉儲到一個文本文件中,以便將其粘貼到我的subreddit的Reddit帖子中。現在,爲了防止重複,我想比較新解析的故事與當前在ArrayList中保存的故事。使用jsoup從特定標籤之間的網頁中抓取數據

(附加信息:該機器人將每隔30分鐘檢查網頁),我已經漸漸趕上了上

的部分是網頁的實際分析和充分利用HTML標籤之間的內容。

我擡頭看CSS選擇器,但我仍然感到十分困惑,因爲幾乎每個例子都來自像IMBD這樣簡單的網站。

從基礎研究來看,它看起來像在我正在看的主體內,故事全都在一個有序列表標記內。

<o1 class="work index group"> 
    <li class="work blurb group" id="work_10504812" role="article>...</li> 
    <li class="work blurb group" id="work_9656693" role="article>...</li> 
    <li class="work blurb group" id="work_11814486" role="article>...</li> 
    //Goes on for ~20 more stories 
    <li class="work blurb group" id="work_11687247" role="article>...</li> 
</ol> 

因此,爲了清楚起見,每個列表類型都是位於有序列表中的單個故事。在一個列表標籤內的任何內容如下。 (添加有序列表標籤的情況下)

<ol class="work index group"> 
    <li class="work blurb group" id="work_10504812" role="article"> 
    <!--title, author, fandom--> 
    <div class="header module"> 
    <h4 class="heading"> 
     <a href="/works/10504812">Pocket Healer</a> 
     by 

     <!-- do not cache --> 
     <a rel="author" href="https://stackoverflow.com/users/OverNoot/pseuds/OverNoot">OverNoot</a> 
    </h4> 
    <h5 class="fandoms heading"> 
     <span class="landmark">Fandoms:</span> 
     <a class="tag" href="/tags/Overwatch%20(Video%20Game)/works">Overwatch (Video Game)</a> 
     &nbsp; 
    </h5> 
    <!--required tags--> 
    <ul class="required-tags"> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="warning-no warnings" title="No Archive Warnings Apply"><span class="text">No Archive Warnings Apply</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="category-femslash category" title="F/F"><span class="text">F/F</span></span></a></li> 
<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="complete-no iswip" title="Work in Progress"><span class="text">Work in Progress</span></span></a></li> 
</ul> 
    <p class="datetime">17 Aug 2017</p> 
    </div> 
    <!--warnings again, cast, freeform tags--> 
    <h6 class="landmark heading">Tags</h6> 
    <ul class="tags commas"> 
    <li class="warnings"><strong><a class="tag" href="/tags/No%20Archive%20Warnings%20Apply/works">No Archive Warnings Apply</a></strong></li><li class="relationships"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works">Fareeha "Pharah" Amari/Angela "Mercy" Ziegler</a></li><li class="characters"><a class="tag" href="/tags/Fareeha%20%22Pharah%22%20Amari/works">Fareeha "Pharah" Amari</a></li> <li class="characters"><a class="tag" href="/tags/Angela%20%22Mercy%22%20Ziegler/works">Angela "Mercy" Ziegler</a></li> <li class="characters"><a class="tag" href="/tags/Winston%20(Overwatch)/works">Winston (Overwatch)</a></li> <li class="characters"><a class="tag" href="/tags/Lena%20%22Tracer%22%20Oxton/works">Lena "Tracer" Oxton</a></li><li class="freeforms"><a class="tag" href="/tags/Tiny%20Pharah%20and%20Tiny%20Mercy/works">Tiny Pharah and Tiny Mercy</a></li> <li class="freeforms"><a class="tag" href="/tags/Fluff/works">Fluff</a></li> <li class="freeforms last"><a class="tag" href="/tags/Cute/works">Cute</a></li> 
    </ul> 
    <!--summary--> 
    <h6 class="landmark heading">Summary</h6> 
    <blockquote class="userstuff summary"> 
     <p>Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other.</p> 
    </blockquote> 
    <!--stats--> 

    <dl class="stats"> 
     <dt class="language">Language:</dt> 
     <dd class="language">English</dd> 
    <dt class="words">Words:</dt> 
    <dd class="words">35,143</dd> 
    <dt class="chapters">Chapters:</dt> 
    <dd class="chapters">10/11</dd> 
    <dt class="comments">Comments:</dt> 
    <dd class="comments"><a href="/works/10504812?show_comments=true&amp;view_full_work=true#comments">168</a></dd> 
    <dt class="kudos">Kudos:</dt> 
    <dd class="kudos"><a href="/works/10504812?view_full_work=true#comments">438</a></dd> 
    <dt class="bookmarks">Bookmarks:</dt> 
    <dd class="bookmarks"><a href="/works/10504812/bookmarks">35</a></dd> 
    <dt class="hits">Hits:</dt> 
    <dd class="hits">5890</dd> 
    </dl> 
</li> 

,基本上我想提取標題,作者,URL,總結和評價。

到目前爲止,我已經收集了我想要提取的物品的位置,但我沒有真正的想法如何去做。

標題:

<a href="/works/10504812">Pocket Healer</a> 

作者:

<a rel="author" href="https://stackoverflow.com/users/OverNoot/pseuds/OverNoot">OverNoot</a> 

網址:

<li class="work blurb group" id="work_10504812" role="article"> 
<!--(http://archiveofourown.com/works/<the number after 'work_'>)--> 

摘要:

<blockquote class="userstuff summary"> 
    <p> (SUMMARY GOES HERE) </p> 
</blockquote> 

Rating:

<li> <a class="help symbol question modal modal-attached" title="Symbols key" aria-controls="#modal" href="/help/symbols-key.html"><span class="rating-general-audience rating" title="General Audiences"><span class="text">General Audiences</span></span></a></li> 

其他問題:是否有可能遍歷有序列表的內容,如forloop?

我爲打開網頁設置的當前代碼如下。

while (true) { 
     try { 

      String url = "http://archiveofourown.org/tags/Fareeha%20%22Pharah%22%20Amari*s*Angela%20%22Mercy%22%20Ziegler/works"; 
      Document doc = Jsoup.connect(url).get(); 

      //Returns element of webpage 
      doc.select("<Narrow down to ordered list>"); 

      //Run for loop to run through first 5 items of 
      Thread.sleep(THIRTY_MINUTES); 

     } 
     catch (Exception ex) { 
      ex.printStackTrace(); 
     } 

    } 

回答

0

您可以使用Document.select(String cssSelector)方法返回Elements,您可以迭代。例如,ol.work > li將返回所有li元素,這是第一級子元素到此ol.work元素。你可以用它遍歷所有的故事。

考慮下面的代碼部分:

Elements ol = doc.select("ol.work > li"); 

for (Element li : ol) { 
    String title = li.select("h4.heading a").first().text(); 
    String author = li.select("h4.heading a[rel=author]").text(); 
    String id = li.attr("id").replaceAll("work_",""); 
    String url = "http://archiveofourown.com/works/" + id; 
    String summary = li.select("blockquote.summary").text(); 
    String rating = li.select("span.rating").text(); 

    System.out.println("Title: " + title); 
    System.out.println("Author: " + author); 
    System.out.println("ID: " + id); 
    System.out.println("URL: " + url); 
    System.out.println("Summary: " + summary); 
    System.out.println("Rating: " + rating); 
} 

在這個例子中,我們得到的所有li元素在for循環和提取預期的內容。正如你所看到的,我們使用select方法對每個數據提取限制爲當前的li元素。 Element.text()方法以純文本的形式返回一個元素的主體,如果它們存在,則刪除所有標籤。

運行在與你把你的問題HTML代碼將產生以下輸出:

Title: Pocket Healer 
Author: OverNoot 
ID: 10504812 
URL: http://archiveofourown.com/works/10504812 
Summary: Angela and Fareeha wake up to find tiny alternate versions of themselves have appeared and are now imprinted on them. How will these tiny Pharahs and Mercies impact their work at Overwatch and more importantly how will it impact the feelings they have for each other. 
Rating: General Audiences 

我希望它能幫助。

+0

非常感謝您的幫助!那種東西讓我很頭疼,但它在我的代碼中完美無瑕。非常感激! – Jayps

+0

@Jayps我很高興我可以幫你:) –

+0

現在我想測試在c#中做同樣的事情,我認爲它會以同樣的方式工作?我只需要找到另一個類似於jsoup的庫(用於c#)? – Jayps