2016-11-05 11 views
0

我試圖解析HTML,將它作爲一個巨大的字符串來給我。當我到達13號線時,NodeChild page = it.parent()H Groovy中的Slurping

我能夠找到我正在尋找的鑰匙,但數據像我這樣來到This Is Value One In My KeyThis is Value Two in my KeyThis is Value Three In My Key等等。我發現兩者之間的分隔符始終是UppercaseUppercase(withoutSpaces)。

我想把它放到ArrayList這樣或那樣的方式。有沒有一種方法,我缺少from the docs能夠自動做到這一點?有沒有更好的方法來解析這一切?

class htmlParsingStuff{ 

    private def slurper = new XmlSlurper(new Parser()) 

    private void slurpItUp(String rawHTMLString){ 
     ArrayList urlList = [] 
     def htmlParser = slurper.parseText(rawHTMLString) 

     htmlParser.depthFirst().findAll() { 
      //Loop through all of the HTML Tags to get to the key that I am looking for 
      //EDIT: I see that I am able to iterate through the parent object, I just need a way to figure out how to get into that object 
      boolean trigger = it.text() == 'someKey' 
      if (trigger){ 
       //I found the key that I am looking for 
       NodeChild page = it.parent() 
       page = page.replace('someKey', '') 
       LazyMap row = ["page": page, "type": "Some Type"] 
       urlList.add(row) 
      } 
     } 
    } 
} 
+0

好吧,我想我是對的,那麼沒有意識到你可以調用'.parent','.children','.childNodes'' –

回答

1

我不能爲您提供工作代碼,因爲我不知道您的具體html。

但是:請勿使用XmlSlurper解析HTML,HTML格式不正確,因此XmlSlurper不適合作業。

對於HTML,使用類似JSoup的庫。你會發現它更容易使用,特別是如果你有一些JQuery的知識。既然你沒有張貼您的HTML代碼段,我提出了我自己的例子:

@Grab(group='org.jsoup', module='jsoup', version='1.10.1') 
import org.jsoup.Jsoup 

def html = """ 
<html> 
<body> 
    <table> 
    <tr><td>Key 1</td></tr> 
    <tr><td>Key 2</td></tr> 
    <tr><td>Key 3</td></tr> 
    <tr><td>Key 4</td></tr> 
    <tr><td>Key 5</td></tr> 
    </table> 
</body> 
</html>""" 

def doc = Jsoup.parse(html) 
def elements = doc.select('td') 
def result = elements.collect {it.text()} 
// contains ['Key 1', 'Key 2', 'Key 3', 'Key 4', 'Key 5'] 

要操縱你會使用文檔

def doc = Jsoup.parse(html) 
def elements = doc.select('td') 
elements.each { oldElement -> 
    def newElement = new Element(Tag.valueOf('td'), '') 
    newElement.text('Another key') 
    oldElement.replaceWith(newElement) 
} 
println doc.outerHtml()