2017-01-24 81 views
2

基本上需要從由number.xml命名的一組單個XML文件中刪除當事方實體(以及它們之間的所有內容)。我嘗試以下,但它並不完全生產我需要的一切:Unix中的腳本從文件中刪除XML標記和內容

cat test.xml | sed "s;<parties>;\do_opentag ;" | sed "s;</parties>;\do_closetag ;" | awk 'BEGIN { doPrint = 1; } /do_opentag/ { doPrint = 0; print $0; } /do_closetag/ { doPrint = 1; } { if (doPrint) print $0; }' | grep -v 'do_opentag\|do_closetag' 

<?xml version="1.0" encoding="UTF-8"?> 
<patent-document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" pid="58326519" doc-generation-date="2016-10-11"> 
    <bibliographic-data> 
    <application-reference> 
     <pan>46422</pan> 
    </application-reference> 
    <publication-reference> 
     <publication-office>KR</publication-office> 
     <patent-publication-date> 
     <year>2016</year> 
     <month>10</month> 
     <day>11</day> 
     </patent-publication-date> 
    </publication-reference> 
    <parties> 
     <applicants> 
     <applicant sequence="1"> 
      <name lang="EN"></name> 
      <address> 
      <location-of-work>KR</location-of-work>M 
      </address> 
     </applicant> 
     </applicants> 
    </parties> 
    </bibliographic-data> 
    <vendor>Any</vendor> 
    <document-translation-date>2016-11-24</document-translation-date>M 
    <invention-title lang="EN">Cell preservation container for liquid-based cell inspection</invention-title> 
    <abstract lang="EN">The present invention relates to a liquid for discharging liquid containing cells and cell may be a sampling which is simply eminent generated in </abstract> 
    <comment lang="EN"></comment> 
</patent-document> 

回答

2

解析XML 需要 XML解析器。 使用起來相當簡單。要刪除parties節點:

xmlstarlet ed -P -d '//parties' file.xml 

產生

<?xml version="1.0" encoding="UTF-8"?> 
<patent-document xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" pid="58326519" doc-generation-date="2016-10-11"> 
    <bibliographic-data> 
    <application-reference> 
     <pan>46422</pan> 
    </application-reference> 
    <publication-reference> 
     <publication-office>KR</publication-office> 
     <patent-publication-date> 
     <year>2016</year> 
     <month>10</month> 
     <day>11</day> 
     </patent-publication-date> 
    </publication-reference> 

    </bibliographic-data> 
    <vendor>Any</vendor> 
    <document-translation-date>2016-11-24</document-translation-date>M 
    <invention-title lang="EN">Cell preservation container for liquid-based cell inspection</invention-title> 
    <abstract lang="EN">The present invention relates to a liquid for discharging liquid containing cells and cell may be a sampling which is simply eminent generated in </abstract> 
    <comment lang="EN"/> 
</patent-document> 
1

sed -e '/<parties>/,/<\/parties>/d' test.xml

在sed可以使用由逗號分隔的兩個圖案施加sed命令或命令行的包括該範圍和匹配模式之間。我在這裏說d - 刪除行 - 從/<parties>//<\/parties>/

這取決於您的XML格式。在匹配行上不能有其他內容需要保留。

如果要編輯文件,請將一個-i標誌添加到sed。

+0

謝謝。差不多了。出於某種原因,我收到一條消息,指出「在文件test.xml結尾處缺少換行符」,並且正在關閉的標記被丟棄。有什麼辦法解決這個問題? – Cinda

+0

由於最後一行不包含終止換行符,sed從不處理它。我從來沒有見過這個問題,但[這裏的第二個答案](http://unix.stackexchange.com/questions/31947/how-to-add-a-newline-to-the-end-of-a-文件)似乎是合理的:'echo >> test.xml; sed -e'/ /,/ <\/parties>/d'test.xml' – stevesliva