2017-03-23 61 views
0

我正在嘗試做一些研究,以瞭解如何從已經完成的Informatica Powercenter映射中創建文檔,並且由於不同選項的數量,初始方法對我來說很困難。這裏採用的方法是根據需要多次訪問映射中的每個框,將信息複製到一個word文檔中,進行格式化,每週進行數千次。如何解析xml以提取文檔的字段?

現在我有我認爲是解決方案的一個次級理念:將映射導出到XML,用一個腳本(或程序,我已經嘗試了幾次,用excel,不正當地)解析XML到更多容易複製粘貼,並以這種方式改善我的生活。

XML看起來像這樣(簡化爲儘可能少的行來作爲例子,它可能不是100%有效的,但原始的XML也是,顯然值賦值是我所提出的不與任何相關的東西是價值,而不是它的每一次該字符串):

Type 1 Document: 

    <!DOCTYPE POWERMART SYSTEM "ValueAssigned"> 
<POWERMART CREATION_DATE="ValueAssigned" REPOSITORY_VERSION="ValueAssigned"> 
<REPOSITORY NAME="ValueAssigned" VERSION="ValueAssigned" CODEPAGE="ValueAssigned" DATABASETYPE="ValueAssigned"> 
<FOLDER NAME="ValueAssigned" GROUP="" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned"> 
    <CONFIG DESCRIPTION ="ValueAssigned" ISDEFAULT ="YES" NAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
     <ATTRIBUTE NAME ="Field1" VALUE =""/> 
     <ATTRIBUTE NAME ="Field2" VALUE ="NO"/> 
    <WORKFLOW DESCRIPTION ="" ISENABLED ="ValueAssigned" ISRUNNABLESERVICE ="ValueAssigned" ISSERVICE ="ValueAssigned" ISVALID ="ValueAssigned" NAME ="ValueAssigned" REUSABLE_SCHEDULER ="ValueAssigned" SCHEDULERNAME ="ValueAssigned" SERVERNAME ="ValueAssigned" SERVER_DOMAINNAME ="ValueAssigned" SUSPEND_ON_ERROR ="ValueAssigned" TASKS_MUST_RUN_ON_SERVER ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
     <SCHEDULER DESCRIPTION ="" NAME ="SchedulerName" REUSABLE ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
      <SCHEDULEINFO SCHEDULETYPE ="ONDEMAND"/> 
     </SCHEDULER> 
     <TASK DESCRIPTION ="ValueAssigned" NAME ="Start" REUSABLE ="NO" TYPE ="Start" VERSIONNUMBER ="1"/> 
     <SESSION DESCRIPTION ="ValueAssigned" ISVALID ="ValueAssigned" MAPPINGNAME ="ValueAssigned" NAME ="ValueAssigned" REUSABLE ="ValueAssigned" SORTORDER ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
      <SESSTRANSFORMATIONINST ISREPARTITIONPOINT ="ValueAssigned" PARTITIONTYPE ="ValueAssigned" PIPELINE ="ValueAssigned" SINSTANCENAME ="ValueAssigned" STAGE ="ValueAssigned" TRANSFORMATIONNAME ="ValueAssigned" TRANSFORMATIONTYPE ="Target Definition"> 
       <ATTRIBUTE NAME ="ValueAssigned" VALUE ="ValueAssigned"/> 
       <ATTRIBUTE NAME ="ValueAssigned" VALUE ="ValueAssigned"/> 
      </SESSTRANSFORMATIONINST> 

因此,如果我們專注於一個任何標記,如

<CONFIG DESCRIPTION ="Default session configuration object" ISDEFAULT ="YES" NAME ="default_session_config" VERSIONNUMBER ="29"> 
     <ATTRIBUTE NAME ="Field1" VALUE =""/> 
     <ATTRIBUTE NAME ="Field2" VALUE ="NO"/> 

我們可以看到有一個標籤,config說明,接下來是幾個屬性名稱。我想到的其中一個選項有點幼稚,但是如果我要將它轉換爲列,使用excel或類似的命令,我可以看到一行包含根標記,然後是不同的類別,以及該分類到達我可以看到的地步:好的,這是標籤,這是一個包含所有值的列,我將它複製到我的Word文檔中並稱之爲一天。因爲在XML中有300到900行之間的任何地方,並且由於引號,常量標籤,列未被對齊,因爲行不具有相同的長度,所以它不容易看見也不容易使用(所以我不能使用列模式)...

我把其他類型的文件,以防萬一它使的信息如何differen是一個更清晰的概念,爲什麼我不跳直入做我自己的解析器的時候了:

<?xml version="ValueAssigned" encoding="ValueAssigned"?> 
<!DOCTYPE POWERMART SYSTEM "ValueAssigned"> 
<POWERMART CREATION_DATE="ValueAssigned" REPOSITORY_VERSION="ValueAssigned"> 
<REPOSITORY NAME="ValueAssigned" VERSION="ValueAssigned" CODEPAGE="ValueAssigned" DATABASETYPE="ValueAssigned"> 
<FOLDER NAME="ValueAssigned" GROUP="ValueAssigned" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned"> 
    <SOURCE BUSINESSNAME ="ValueAssigned" DATABASETYPE ="ValueAssigned" DBDNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" NAME ="ValueAssigned" OBJECTVERSION ="ValueAssigned" OWNERNAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
     <SOURCEFIELD BUSINESSNAME ="ValueAssigned" DATATYPE ="ValueAssigned" DESCRIPTION ="ValueAssigned" FIELDNUMBER ="ValueAssigned" FIELDPROPERTY ="ValueAssigned" FIELDTYPE ="ValueAssigned" HIDDEN ="ValueAssigned" KEYTYPE ="ValueAssigned" LENGTH ="ValueAssigned" LEVEL ="ValueAssigned" NAME ="ValueAssigned" NULLABLE ="ValueAssigned" OCCURS ="ValueAssigned" OFFSET ="ValueAssigned" PHYSICALLENGTH ="ValueAssigned" PHYSICALOFFSET ="ValueAssigned" PICTURETEXT ="ValueAssigned" PRECISION ="ValueAssigned" SCALE ="ValueAssigned" USAGE_FLAGS ="ValueAssigned"/> 
<FOLDER NAME="ValueAssigned" GROUP="ValueAssigned" OWNER="ValueAssigned" SHARED="ValueAssigned" DESCRIPTION="ValueAssigned" PERMISSIONS="ValueAssigned" UUID="ValueAssigned"> 
    <SOURCE BUSINESSNAME ="ValueAssigned" CRCVALUE ="ValueAssigned" DATABASETYPE ="ValueAssigned" DBDNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" IBMCOMP ="ValueAssigned" NAME ="ValueAssigned" OBJECTVERSION ="ValueAssigned" OWNERNAME ="ValueAssigned" VERSIONNUMBER ="ValueAssigned"> 
     <FLATFILE CODEPAGE ="ValueAssigned" CONSECDELIMITERSASONE ="ValueAssigned" DELIMITED ="ValueAssigned" DELIMITERS ="ValueAssigned" ESCAPE_CHARACTER ="ValueAssigned" KEEPESCAPECHAR ="ValueAssigned" LINESEQUENTIAL ="ValueAssigned" MULTIDELIMITERSASAND ="ValueAssigned" NULLCHARTYPE ="ValueAssigned" NULL_CHARACTER ="ValueAssigned" PADBYTES ="ValueAssigned" QUOTE_CHARACTER ="ValueAssigned" REPEATABLE ="ValueAssigned" ROWDELIMITER ="ValueAssigned" SHIFTSENSITIVEDATA ="ValueAssigned" SKIPROWS ="ValueAssigned" STRIPTRAILINGBLANKS ="ValueAssigned"/> 
     <SOURCEFIELD BUSINESSNAME ="ValueAssigned" DESCRIPTION ="ValueAssigned" FIELDNUMBER ="ValueAssigned" FIELDPROPERTY ="ValueAssigned" FIELDTYPE ="ValueAssigned" HIDDEN ="ValueAssigned" LENGTH ="ValueAssigned" LEVEL ="ValueAssigned" NAME ="ValueAssigned" OCCURS ="ValueAssigned" OFFSET ="ValueAssigned" PHYSICALLENGTH ="ValueAssigned" PHYSICALOFFSET ="ValueAssigned"> 
+1

這是不是很清楚你想達到什麼。你有興趣從XML中提取什麼特定的數據? 你想在你的word文檔中看到你的信息是什麼樣的? –

+0

關於不清楚,我對此感到抱歉:S。我會試着解釋,好像它已經解決了。我想分析儘可能多的xml,並且以這樣一種乾淨的方式,如果我想複製Session中的Attributes,我會轉到它所在的位置並複製所有這些(這就是爲什麼我想到了excel)。因此,我希望有一個流程能夠輕鬆地從XML中複製非常不同的數據,這些數據看起來像我寫下來的內容,同時考慮到此XML並不總是相同的,並且具有相同的標籤等等。 –

回答

0

我在過去做過類似的事情:將一個xml文件轉換爲一個平坦的TXT文件。 你的問題之一是XML是一個類似列表結構的嵌套格式。

一兩件事的工作原理是這樣壓平:

<CONFIG DESCRIPTION ="Default session configuration object" ISDEFAULT ="YES" NAME ="default_session_config" VERSIONNUMBER ="29"> 
     <ATTRIBUTE NAME ="Field1" VALUE =""/> 
     <ATTRIBUTE NAME ="Field2" VALUE ="NO"/> 
</CONFIG> 

成爲

CONFIG.DESCRIPTION = "Default session configuration object" 
CONFIG.ISDEFAULT = ="YES" 
CONFIG.NAME ="default_session_config" 
CONFIG.VERSIONNUMBER ="29" 
CONFIG.ATTRIBUTE[1].NAME="Field1" 
CONFIG.ATTRIBUTE[1].VALUE ="" 
CONFIG.ATTRIBUTE[2].NAME="Field2" 
CONFIG.ATTRIBUTE[2].VALUE ="NO" 

基本上具有的XPath = value格式。 您可以使用XML庫或XSLT和xsl模板來實現該功能。

+0

感謝您的信息!爲了獲得更多的想法,你會介意爲什麼你將它轉換爲平面文件?像,背後的目的是什麼。在我而言,這將是有用的,因爲如果你再導入到excel,使=分隔,這將是一個進步,所以它的另一種思路來考慮。 –

+1

平面文件大於一個xml文件更人類可讀和在CLI environement工作時可以更容易地grep的-ED。 –