使用PowerShell解析html文件中的自定義標記

我有一個自定義html文件，下面的代碼使用自定義標記「TBD：comment」。我想從這個標籤獲取內容。使用PowerShell解析html文件中的自定義標記

<HTML> 
<BODY> 
<h1> This is a heading </h1> 
<P id='para1'>First Paragraph with some Random text</P> 
<P>Second paragraph with more random text</P> 
<A href="http://Geekeefy.wordpress.com">Cool Powershell blog</A> 
<TBD:comment name="Title"><h3>Katamma katamma loge kathamma</h3> 
</TBD:comment> 
<TBD:comment name="content"><h3>Lorem Ipsum is simply dummy text of the 
printing and typesetting industry. Lorem Ipsum has been the industry's 
standard dummy text ever since the 1500s, when an unk</h3> </TBD:comment> 
</BODY> 
</HTML>

下面的代碼似乎並沒有與自定義標籤的工作。

enter code here 
$html = Get-Content "C:\Users\sahuBaba\Desktop\ht.html" -Raw 
$doc = New-Object -com "HTMLFILE" 
$doc.IHTMLDocument2_write($html) 

$text = $doc.body.getElementsByTagName("TBD:comment") 
"Inner Text: " + $text[1].innerText

無輸出。有人可以幫忙嗎？提前致謝。

來源

2017-07-13 Sahu Baba

嘗試用正則表達式：

$regex = New-Object Text.RegularExpressions.Regex "<TBD:comment.+?(>.+?)<\/TBD:comment>", ('singleline', 'multiline') 
$content = "<your html>" 
foreach($m in $regex.Matches($content)) { 
    # remove leading '<' 
    $m.Groups[1].Value.Substring(1) 
}

來源

2017-07-14 09:45:58 hkarask

謝謝..它的作品，一個小的跟進問題，如果我想坐下特定名稱相同的標籤。如果你能幫我解決這個問題。這將非常有幫助。提前致謝。 –

試試這個：'$ regex = New-Object Text.RegularExpressions.Regex'。+？）<\/TBD：comment>'，（'singleline' ''' $ content =「」 foreach（$ m in $ regex.Matches（$ content））{ ＃ '） $ m.Groups [2] .Value.Substring（1） }' – hkarask

使用PowerShell解析html文件中的自定義標記

回答

相關問題