我試圖用XML :: Simple和XML :: Twig來解析XML文件,結果相同。文件中的其他字段工作得很好。嘗試在Perl中解析XML,但長數據字符串被截斷
有問題的文件可以在這裏獲得:
curl -s "http://apps.nlm.nih.gov/medlineplus/services/mpconnect_service.cfm?mainSearchCriteria.v.cs=2.16.840.1.113883.6.103&mainSearchCriteria.v.c=130"
這是與分析器或文件有問題?兩個解析器的輸出結果都是一樣的。字符串中的HTML標籤存儲在XML
輸入字段(XML標籤名爲 '摘要' 中):XML的解析後
<summary type="html"><p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak immune systems and babies whose mothers become infected for the first time during pregnancy. Problems can include damage to the brain, eyes and other organs.</p>
^I
<p>You can get toxoplasmosis from </p>
<ul>
<li>^IWaste from an infected cat</li>
<li>^IEating contaminated meat that is raw or not well cooked </li>
<li>^IUsing utensils or cutting boards after they've had contact with raw meat </li>
<li>^IDrinking infected water </li>
<li>^IReceiving an infected organ transplant or blood transfusion</li>
</ul>
<p>Most people with toxoplasmosis don't need treatment. There are drugs to treat it for pregnant women and people with weak immune systems. </p>

<p class="NLMattribution">Centers for Disease Control and Prevention</p></summary>
輸出:
<p>Toxoplasmosis is a disease caused by the parasite <em>Toxoplasma gondii</em>. More than 60 million people in the U.S. have the parasite. Most of them don't get sick. But the parasite causes serious problems for some people. These include people with weak im<p class="NLMattribution">Centers for Disease Control and Prevention</p>to treat it for pregnant women and people with weak immune systems. </p>her organs.</p>
問題的解決方案: XML文件包含回車「 」,這會導致解析器出現問題。在我下載XML文件後,我刪除了回車符,內容如下:
sed -i 's/
//g' *.xml
解析器現在按預期工作。
更新: 回車不影響解析器,只有輸出顯示被截斷和混合起來。刪除它確實解決了我的問題。
如果您知道解決方案,請關閉該問題... – pavel 2011-06-06 15:10:10
實際上 字符不會給解析器帶來問題。當我打印結果時,我懷疑它們會導致問題。特別是如果你在Unix機器上工作。如果將結果輸出到文件中,則應能夠看到整個文本,包括一些^ M字符,這些字符在打印時看起來像文本的一部分。儘管沒有看到你的代碼,但很難說。 – mirod 2011-06-07 06:19:03
是的,這似乎是正確的,mirod。打印的輸出是錯誤的,其中一些部分被移除,其他部分被移除。我已經用這個信息更新了這篇文章。 – BackstreetStruts 2011-06-07 12:48:03