2015-12-16 26 views
4

我用vba颳了一些網站以獲得樂趣,並使用VBA作爲工具。我使用XMLHTTP和HTMLDocument(因爲它比internetExplorer.Application更快)。vba,getElementsByClassName,HTMLSource的雙引號不見了

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

    Dim XMLHTTPReq As Object 
    Dim htmlDoc As HTMLDocument 

    Dim postURL As String 

    postURL = "http://foodffs.tumblr.com/archive/2015/11" 

     Set XMLHTTPReq = New MSXML2.XMLHTTP 

     With XMLHTTPReq 
      .Open "GET", postURL, False 
      .Send 
     End With 

     Set htmlDoc = New HTMLDocument 
     With htmlDoc 
      .body.innerHTML = XMLHTTPReq.responseText 
     End With 

     i = 0 

     Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") 

     For Each vr In varTemp 
      ''''the next line is important to solve this issue *1 
      Cells(1, 1) = vr.outerHTML 
      Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date") 
      Cells(i + 1, 3) = varTemp2.Item(0).innerText 
      ''''the next line occur 438Error'''' 
      Set varTemp2 = vr.getElementsByClassName("hover_inner") 
      Cells(i + 1, 4) = varTemp2.innerText 

      i = i + 1 

     Next vr 
End Sub 

我* 1個 細胞(1,1)弄清楚這個問題表明了我接下來的事情就

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank> 
<DIV class=hover_inner><SPAN class=post_date>............... 

呀所有的類標籤丟失 「」。只有第一個功能的類有「」 我真的不知道爲什麼會出現這種情況。

//好的,我可以通過getElementsByTagName(「span」)進行分析。但我更喜歡「類」標記.....

+0

http://stackoverflow.com/questions/7927905/internet-explorer-innerhtml-outputs-attributes-without-quotes我不認爲HTML需要引號屬性值時,值不包含空格,並且你是什麼看看何時看到outerHTML反映了IE對此的表示。這可能不是你所得到的錯誤的根源。 –

+0

如果您嘗試設置varTemp2 = vr.querySelectorAll(「span.post_date」)'會發生什麼? – barrowc

+0

感謝所有! @TimWilliams我明白了。那麼getElementsByTagName(「span」)是我可以分析innerText的唯一方法? – Soborubang

回答

4

getElementsByClassName method不被認爲是一種方法本身;只有父HTMLDocument。如果您想用它來定位DIV元素中的元素,您需要創建一個由該特定DIV元素的.outerHtml組成的子HTMLDocument。

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

    Dim xmlHTTPReq As New MSXML2.XMLHTTP 
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument 
    Dim iDIV As Long, iSPN As Long, iEL As Long 
    Dim postURL As String, nr As Long, i As Long 

    postURL = "http://foodffs.tumblr.com/archive/2015/11" 

    With xmlHTTPReq 
     .Open "GET", postURL, False 
     .Send 
    End With 

    'Set htmlDOC = New HTMLDocument 
    With htmlDOC 
     .body.innerHTML = xmlHTTPReq.responseText 
    End With 

    i = 0 

    With htmlDOC 
     For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1 
      nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row 
      With .getElementsByClassName("post_glass post_micro_glass")(iDIV) 
       'method 1 - run through multiples in a collection 
       For iSPN = 0 To .getElementsByTagName("span").Length - 1 
        With .getElementsByTagName("span")(iSPN) 
         Select Case LCase(.className) 
          Case "post_date" 
           Cells(nr, 3) = .innerText 
          Case "post_notes" 
           Cells(nr, 4) = .innerText 
          Case Else 
           'do nothing 
         End Select 
        End With 
       Next iSPN 
       'method 2 - create a sub-HTML doc to facilitate getting els by classname 
       divSUBDOC.body.innerHTML = .outerHTML 'only the HTML from this DIV 
       With divSUBDOC 
        If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1 
         'use the first 
         Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText 
        End If 
       End With 
      End With 
     Next iDIV 
    End With 

End Sub 

雖然其他.getElementsByXXXX可以很容易地檢索另一個元素中收藏,getElementsByClassName method需要考慮它認爲是HTMLDocument的整體,即使你已經上當了它,以爲。

+0

真的很感謝你!我不知道getElementsByClassName是特殊的。我很佩服你! – Soborubang

+0

MDN有「你也可以在任何元素上調用'getElementsByClassName()';它只會返回具有給定類名稱的指定根元素的後代的元素。」我很確定我以前在IE中使用過這種方式... –

+0

https://developer.mozilla.org/zh-CN/docs/Web/API/Element/getElementsByClassName –

1

這是另一種方法。它與原始代碼非常相似,但使用querySelectorAll選擇相關的span元素。對於這種方法的一個重要的一點是VR必須被聲明爲是一個特定的元素類型,而不是作爲一個IHTMLElement或通用Object:

Option Explicit 

Public Sub XMLhtmlDocumentHTMLSourceScraper() 

' Changed from generic Object to specific type - not 
' strictly necessary to do this 
Dim XMLHTTPReq As MSXML2.XMLHTTP60 
Dim htmlDoc As HTMLDocument 

' These declarations weren't included in the original code 
Dim i As Integer 
Dim varTemp As Object 
' IMPORTANT: vr must be declared as a specific element type and not 
' as an IHTMLElement or generic Object 
Dim vr As HTMLDivElement 
Dim varTemp2 As Object 

Dim postURL As String 

postURL = "http://foodffs.tumblr.com/archive/2015/11" 

' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent 
' to the older XMLHTTP30 
Set XMLHTTPReq = New MSXML2.XMLHTTP60 

With XMLHTTPReq 
    .Open "GET", postURL, False 
    .Send 
End With 

Set htmlDoc = New HTMLDocument 
With htmlDoc 
    .body.innerHTML = XMLHTTPReq.responseText 
End With 

i = 0 

Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass") 

For Each vr In varTemp 
    ''''the next line is important to solve this issue *1 
    Cells(1, 1) = vr.outerHTML 

    Set varTemp2 = vr.querySelectorAll("span.post_date") 
    Cells(i + 1, 3) = varTemp2.Item(0).innerText 

    Set varTemp2 = vr.getElementsByClassName("hover_inner") 
    ' incorporating correction from Jeeped's comment (#56349646) 
    Cells(i + 1, 4) = varTemp2.Item(0).innerText 

    i = i + 1 
Next vr 

End Sub 

注:

  • XMLHTTP相當於XMLHTTP30如上所述here
  • 顯而易見需要聲明在this question探討,但,不同於getElementsByClassName方法的特定元件類型,querySelectorAll不IHTMLElement
的任何版本存在