2013-05-18 189 views
4

我有很多HTML文件,我需要從中提取文本。如果它全部在一條線上,我可以很容易地做到這一點,但如果標籤環繞或在多條線上,我不知道如何做到這一點。這就是我的意思是:提取HTML標記之間的文本

<section id="MySection"> 
Some text here 
another line here <br> 
last line of text. 
</section> 

我不關心<br>文本,除非它會幫助周圍環繞的文本。我想要的區域始終以「MySection」開頭,然後以</section>結束。我想直到結束是這樣的:

Some text here another line here last line of text. 

我喜歡的東西就像一個VBScript或命令行選項,但我不知道從哪裏開始(SED?)。任何幫助?

回答

4

通常你會使用Internet Explore對於該R COM對象:

root = "C:\base\dir" 

Set ie = CreateObject("InternetExplorer.Application") 

For Each f In fso.GetFolder(root).Files 
    ie.Navigate "file:///" & f.Path 
    While ie.Busy : WScript.Sleep 100 : Wend 

    text = ie.document.getElementById("MySection").innerText 

    WScript.Echo Replace(text, vbNewLine, "") 
Next 

然而,<section>標籤不支持之前IE 9,甚至在IE 9 COM對象似乎不正確地處理它,作爲getElementById("MySection")只返回開始標記:

>>> wsh.echo ie.document.getelementbyid("MySection").outerhtml 
<SECTION id=MySection> 

你可以使用正則表達式來代替,雖然:

root = "C:\base\dir" 

Set fso = CreateObject("Scripting.FileSystemObject") 

Set re1 = New RegExp 
re1.Pattern = "<section id=""MySection"">([\s\S]*?)</section>" 
re1.Global = False 
re2.IgnoreCase = True 

Set re2 = New RegExp 
re2.Pattern = "(<br>|\s)+" 
re2.Global = True 
re2.IgnoreCase = True 

For Each f In fso.GetFolder(root).Files 
    html = fso.OpenTextFile(filename).ReadAll 

    Set m = re1.Execute(html) 
    If m.Count > 0 Then 
    text = Trim(re2.Replace(m.SubMatches(0).Value, " ")) 
    End If 

    WScript.Echo text 
Next 
1

這裏使用perl一個班輪溶液和從Mojolicious框架HTML解析器:

perl -MMojo::DOM -E ' 
    say Mojo::DOM->new(do { undef $/; <> })->at(q|#MySection|)->text 
' index.html 

假設index.html與以下內容:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
</head> 
<body id="portada"> 
<section id="MySection"> 
Some text here 
another line here <br> 
last line of text. 
</section> 
</body> 
</html> 

它產生:

Some text here another line here last line of text. 
+0

+1使用合適解析器和整體優雅的解決方案建議。 –