讀取和解析XML中的大塊在R

我試圖讀取和處理〜5.8GB .xml從維基百科轉儲使用R.我沒有太多的RAM，所以我想要處理它塊。（目前使用時xml2::read_xml塊我的電腦完全地）讀取和解析XML中的大塊在R

文件contais每個Wikipedia頁面一個xml元件，這樣：

<page> 
    <title>AccessibleComputing</title> 
    <ns>0</ns> 
    <id>10</id> 
    <redirect title="Computer accessibility" /> 
    <revision> 
     <id>631144794</id> 
     <parentid>381202555</parentid> 
     <timestamp>2014-10-26T04:50:23Z</timestamp> 
     <contributor> 
     <username>Paine Ellsworth</username> 
     <id>9092818</id> 
     </contributor> 
     <comment>add [[WP:RCAT|rcat]]s</comment> 
     <model>wikitext</model> 
     <format>text/x-wiki</format> 
     <text xml:space="preserve">#REDIRECT [[Computer accessibility]] 

{{Redr|move|from CamelCase|up}}</text> 
     <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1> 
    </revision> 
</page>

文件的樣品，可以發現here

從我透視，我認爲可以以塊的形式閱讀它，例如文件中每頁的頁面。 Ans將每個處理過的page元素保存爲.csv文件中的一行。

我想有一個data.frame與以下列。

id，標題和文字。

我該如何閱讀這個.xml大塊？

來源

2016-11-03 Daniel Falbel

我不確定我們是否能夠解決您的問題。你提供給我們的樣本很小，所以我不能真正重現你的問題。你有沒有嘗試像[這]（http://stackoverflow.com/questions/21222113/how-to-read-first-1000-lines-of-csv-file-into-r）（jlhoward答案）？ –

想象一個'.xml'，它有很多很多元素，就像問題中的元素一樣。我不能只讀一行，因爲它打破了xml結構。我想閱讀元素的元素，但我不知道如何做到這一點...顯然我鏈接到小樣本，但你可以在這裏下載完整的文件：https：//dumps.wikimedia.org/ptwiki/ 20161101 /這是ptwiki-20161101-pages-articles.xml.bz2 –

它可以改進，但主要的想法在這裏。你仍然需要定義來定義你要在每個互動閱讀readLines()功能，並且還內的方法來讀取每個數據塊線的量最好的辦法，但對於獲得大塊的解決方案在這裏：

xml <- readLines("ptwiki-20161101-pages-articles.xml", n = 2000) 

inicio <- grep(pattern = "<page>", x = xml) 
fim <- grep(pattern = "</page>", x = xml) 
if (length(inicio) > length(fim)) { # if you get more beginnings then ends 
    inicio <- inicio[-length(inicio)] # drop the last one 
} 

chunks <- vector("list", length(inicio)) 

for (i in seq_along(chunks)) { 
    chunks[[i]] <- xml[inicio[i]:fim[i]] 
} 

chunks <- sapply(chunks, paste, collapse = " ")

我試過read_xml(chunks[1]) %>% xml_nodes("text") %>% xml_text()，它解決了。

來源

2016-11-05 21:53:39

讀取和解析XML中的大塊在R

回答

相關問題