如何獲取通往下一頁的所有鏈接？

我已經在vba中編寫了一些代碼，以獲得從網頁導向下一頁的所有鏈接。下一頁鏈接的最大數量是255.運行我的腳本，我得到了6906鏈接中的所有鏈接。這意味着循環一次又一次地運行，我覆蓋了一些東西。篩選出重複的鏈接我可以看到有254個獨特的鏈接。我的目標不是將最高頁碼硬編碼到迭代鏈接。以下是我與努力：如何獲取通往下一頁的所有鏈接？

Sub YifyLink() 
    Const link = "https://www.yify-torrent.org/search/1080p/" 
    Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument 
    Dim x As Long, y As Long, item_link as String 

    With http 
     .Open "GET", link, False 
     .send 
     html.body.innerHTML = .responseText 
    End With 

    For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a") 
     If InStr(post.innerText, "Last") Then 
      x = Split(Split(post.href, "-")(1), "/")(0) 
     End If 
    Next post 
    For y = 0 To x 
     item_link = link & "t-" & y & "/" 

     With http 
      .Open "GET", item_link, False 
      .send 
      htm.body.innerHTML = .responseText 
     End With 
     For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a") 
      I = I + 1: Cells(I, 1) = posts.href 
     Next posts 
    Next y 
End Sub

元素在其中鏈接：

<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>

我得到的結果（部分部分）：

about:/search/1080p/t-20/ 
about:/search/1080p/t-21/ 
about:/search/1080p/t-22/ 
about:/search/1080p/t-23/ 
about:/search/1080p/t-255/

來源

2017-07-27 SIM

爲什麼從頁面刮（特別是因爲他們不在頁面上開始）？爲什麼不自己生成它們？ – YowE3K

因爲我嘗試使用的頁面可能不在第一頁。我怎麼能硬編碼最後一個數字然後迭代？ – SIM

您目前正在抓取的頁面似乎告訴您存在的最高頁碼（255），因此您不能只抓取一個數字，然後從1循環到該數字以生成全部255個鏈接？ – YowE3K

的想法應該是在循環中刮頁並找到要比較的內容（如果不是這樣），然後退出循環。

這可能是，即檢查字典對照鍵，或檢查元件是否退出，或任何其他邏輯可能是特定於您的問題。

例如，在這裏您的問題是，網站繼續顯示頁面255爲後面的頁面。所以這是我們的一個線索。我們可以將屬於page（n）的元素與屬於page（n-1）的元素進行比較。

例如，如果頁面256中的元素與頁面255中的元素相同，則退出loop/sub。請參閱下面的示例代碼：

Sub yify() 
Const mlink = "https://www.yify-torrent.org/search/1080p/t-" 
Dim http As New XMLHTTP60, html As New HTMLDocument 
Dim post As Object, posts As Object 
Dim pageno As Long, rowno As Long 

pageno = 1 
rowno = 1 

Do 
    With http 
     .Open "GET", mlink & pageno & "/", False 
     .send 
     html.body.innerHTML = .responseText 
    End With 

    Set posts = html.getElementsByClassName("mv") 
    If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do 

    For Each post In posts 
     With post.getElementsByTagName("div") 
      If .Length Then 
       rowno = rowno + 1 
       Cells(rowno, 1) = .Item(0).innerText 
      End If 
     End With 
    Next post 
    Debug.Print "pageno: " & pageno & " completed." 
    pageno = pageno + 1 
Loop 
End Sub

來源

2017-07-28 18:52:55 Tehscript

感謝Tehscript，爲您的解決方案。它永遠不會出軌。祝賀你的新成就。我不知道花了比我預期的更長的時間。我希望我能很快看到10代替1。順便說一句，我有一個驚人的腳本運行，你會得到所有的鏈接短於一秒。這裏是：「https://www.dropbox.com/s/2na6nfvipmsobat/For%20Tehscript.txt?dl=0」 – SIM

感謝SMth80，當我有時間時，我會看看。 – Tehscript

如何獲取通往下一頁的所有鏈接？

回答

相關問題