如何刪除所有段落標記之外的崇高文本

這似乎很奇怪，因爲它是寫的自動上傳到在線內容的系統，但這裏有雲：如何刪除所有段落標記之外的崇高文本

我寫上傳的崇高文本故事或任何。我創建通過崇高的文本文件的Word 2010 .htm文件（導出到純文本文件，命令行批在Word中，重新打開卓異新生成的.htm）。 export.htm文件是一個完整的html頁面，當我需要的是使用<p>標籤的正文條目。例如，從這個export.htm：

<html> 

<head> 
<meta http-equiv=Content-Type content="text/html; charset=windows-1252"> 
<meta name=Generator content="Microsoft Word 14 (filtered)"> 
<style> 
<!-- 
/* Font Definitions */ 
@font-face 
    {font-family:"Cambria Math"; 
    panose-1:2 4 5 3 5 4 6 3 2 4;} 
@font-face 
    {font-family:Calibri; 
    panose-1:2 15 5 2 2 2 4 3 2 4;} 
@font-face 
    {font-family:"Trebuchet MS"; 
    panose-1:2 11 6 3 2 2 2 2 2 4;} 
/* Style Definitions */ 
p.MsoNormal, li.MsoNormal, div.MsoNormal 
    {margin-top:0in; 
    margin-right:0in; 
    margin-bottom:10.0pt; 
    margin-left:0in; 
    line-height:115%; 
    font-size:11.0pt; 
    font-family:"Calibri","sans-serif";} 
.MsoChpDefault 
    {font-family:"Calibri","sans-serif";} 
.MsoPapDefault 
    {margin-bottom:10.0pt; 
    line-height:115%;} 
@page WordSection1 
    {size:8.5in 11.0in; 
    margin:1.0in 1.0in 1.0in 1.0in;} 
div.WordSection1 
    {page:WordSection1;} 
--> 
</style> 

</head> 

<body lang=EN-US> 

<div class=WordSection1> 

<p class=MsoNormal style='margin-top:12.0pt;text-indent:.5in'><font size=2 
face="Trebuchet MS"><span style='font-size:11.0pt;line-height:115%;font-family: 
"Trebuchet MS","sans-serif"'>This is a paragraph of story text to be uploaded 
to the online parsing system.</span></font></p> 

<p class=MsoNormal style='margin-top:12.0pt;text-indent:.5in'><font size=2 
face="Trebuchet MS"><span style='font-size:11.0pt;line-height:115%;font-family: 
"Trebuchet MS","sans-serif"'>This is a another paragraph of story text to be 
uploaded to the online parsing system.</span></font></p> 

</div> 

</body> 

</html>

我想保持唯一的部分如下：

<p class=MsoNormal style='margin-top:12.0pt;text-indent:.5in'><font size=2 
face="Trebuchet MS"><span style='font-size:11.0pt;line-height:115%;font-family: 
"Trebuchet MS","sans-serif"'>This is a paragraph of story text to be uploaded 
to the online parsing system.</span></font></p> 

<p class=MsoNormal style='margin-top:12.0pt;text-indent:.5in'><font size=2 
face="Trebuchet MS"><span style='font-size:11.0pt;line-height:115%;font-family: 
"Trebuchet MS","sans-serif"'>This is a another paragraph of story text to be 
uploaded to the online parsing system.</span></font></p>

一旦我有一個文件的這個特定部分，我可以執行一個更自動化的動作（連接線）並且文件已準備好發送到在線解析器。

解析器需要某種HTML文本格式，但只接受頁面主體的內容（頁面的其餘部分通過提交系統自動執行）。這需要從文字處理器導出HTML，但所有已知的處理器吐出了最大寬度的html。解析器看到換行符在文件中（HTML會忽略換行符），並增加了
標籤，所以這就是爲什麼我需要運行我的崇高腳本加入在導出文件中的行。但是爲了做到這一點，我需要清理導出，以便僅存在所需的行（內容段落），否則一般html將被整合到上傳到解析器的單行中。

我意識到這裏最好的解決方案可能是更改解析器，使其忽略未使用的文件垃圾，但它由不妥協的第三方（這是一個創意故事託管站點）控制。無論如何，這是脫離目標。我可以自己處理，只需要清除文件中的非段落部分即可。

我已經找到了手動選擇一個標籤然後獲取其全部內容的方法，但是抓住所有類型或抓取所有相反的元素（在此請求，除了所需的標籤之外）不在我的範圍之內。我已經搜索谷歌高和低，以及在這裏stackoverflow和幹了。

任何幫助表示讚賞，夥計們。

來源

2014-01-27 user3238802

如果你正在做的是包裹在HTML標記一些文本，你爲什麼不創建一個程序來做到這一點？ – Blender

既然你已經有辦法搶單標籤和它的內容，你有沒有試過以下，同時利用內部ST的多個遊標同樣的動作？ – skuroda

好神，Word仍在使用''標記？ – MattDMo

如何刪除所有段落標記之外的崇高文本

回答

相關問題