使用Jena Library從Java中的RDF網頁中提取URI

我已經編寫了以下代碼，用於從鏈接數據應用程序的內容類型爲application/rdf-xml的網頁中提取URI。使用Jena Library從Java中的RDF網頁中提取URI

public static void test(String url) { 
    try { 
     Model read = ModelFactory.createDefaultModel().read(url); 
     System.out.println("to go"); 
     StmtIterator si; 
     si = read.listStatements(); 
     System.out.println("to go"); 
     while(si.hasNext()) { 
      Statement s=si.nextStatement(); 
      Resource r=s.getSubject(); 
      Property p=s.getPredicate(); 
      RDFNode o=s.getObject(); 
      System.out.println(r.getURI()); 
      System.out.println(p.getURI()); 
      System.out.println(o.asResource().getURI()); 
     } 
    } 
    catch(JenaException | NoSuchElementException c) {} 
}

但對於輸入

<?xml version="1.0"?> 
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ex="http://example.org/stuff/1.0/"> 
    <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar" 
     dc:title="RDF/XML Syntax Specification (Revised)"> 
     <ex:editor> 
      <rdf:Description ex:fullName="Dave Beckett"> 
       <ex:homePage rdf:resource="http://purl.org/net/dajobe/" /> 
      </rdf:Description> 
     </ex:editor> 
    </rdf:Description> 
</rdf:RDF>

輸出是：

Subject URI is http://www.w3.org/TR/rdf-syntax-grammar 
Predicate URI is http://example.org/stuff/1.0/editor 
Object URI is null 
Subject URI is http://www.w3.org/TR/rdf-syntax-grammar 
Predicate URI is http://purl.org/dc/elements/1.1/title 
Website is read

我需要在輸出目前該網頁建立RDF頁面的網絡爬蟲上的所有URI。我需要輸出的所有訪問以下鏈接：

 http://www.w3.org/TR/rdf-syntax-grammar 
     http://example.org/stuff/1.0/editor 
     http://purl.org/net/dajobe 
     http://example.org/stuff/1.0/fullName 
     http://www.w3.org/TR/rdf-syntax-grammar 
     http://purl.org/dc/elements/1.1/title

來源

2012-09-22 Prannoy Mittal

把XML網上，給我們另外，你不應該在所有的三元手動迭代的URL – Raffaele

。請參閱[這個舊答案]（http://stackoverflow.com/a/12236809/315306）簡要介紹您應該在Jena中使用的查詢語言以從序列化模型中提取信息 – Raffaele

刪除這兩個無用的評論，並編輯您的問題提供所需的輸出，因爲我不能完全理解您的問題 – Raffaele

小錯誤：你的意思是application/rdf+xml（注意加號）。

無論如何，你的問題很簡單：

catch(JenaException | NoSuchElementException c) {}

壞！你在這裏失蹤拋出的錯誤，並且輸出被截斷：

System.out.println(o.asResource().getURI());

o不是總是一個資源，這將打破對三聯

<http://www.w3.org/TR/rdf-syntax-grammar> dc:title "RDF/XML Syntax ..."

，所以你需要要警惕的是：

if (o.isResource()) System.out.println(o.asResource().getURI());

或者更具體地說：

if (o.isURIResource()) System.out.println(o.asResource().getURI());

它將跳過您看到的的null輸出。

現在寫一千倍我不會吞下例外 :-)

來源

2012-09-22 17:15:26 user205512

是.. ..感謝很多..現在它的作品。 –

不，你不明白RDF的用途。抓取工具是一種旨在檢索在線內容並將其編入索引的程序。一個簡單的爬蟲可以用HTML文檔提供，它會下載（或者遞歸地）屬性<a>元素中提到的所有文檔。

RDF充滿了URLs，所以您可能認爲它是完美的提供爬蟲，但不幸的是，RDF文檔中的URL並非用於檢索其他文檔。實例：

http://example.org/stuff/1.0/editor 404未找到
http://purl.org/net/dajobe 302暫時移動
http://example.org/stuff/1.0/fullName 404未找到
http://www.w3.org/TR/rdf-syntax-grammar 301移至永久
http://purl.org/dc/elements/1.1/title 302暫時移動

它可以是一個巧合？我不這麼認爲。事實是，RDF旨在描述真實世界並且恰好它可以以XML形式序列化，但XML不是the only available serialization。

那麼，文檔中使用的URL是什麼？他們使用來命名東西。你知道多少約翰？可能有幾十個，還有成千上萬的約翰存在......但是，如果我擁有域example.com，我可以使用URL http://example.com/friends/John來引用我的朋友John。 RDF可以用來描述你的朋友約翰工作在123，美國廣播公司的途徑，通過兩個URL和一個字符串

"http://me.com/John" "http://me.com/works_at" "123, Abc avenue"

這被稱爲三重，以及其中包含的網址並不意味着點所以可以通過TCP套接字和理解HTTP協議的客戶端來獲取。請注意，您的朋友（約翰）和謂詞（工作地點）都通過URL在三元組中引用。但是，如果你在瀏覽器中嘗試這些URL，你什麼也得不到。

我不知道你爲什麼要構建抓取工具以及它應該做什麼，但肯定RDF不是你需要做的工作。

來源

2012-09-22 16:32:30 Raffaele

hey根據Tim Berner Lee關於鏈接數據的四個原則（http://www.w3.org/DesignIssues/LinkedData.html）。它應該檢索關於URI代表的資源的描述。 –

它*應*。不幸的是它*不會*。如果你不信任，就自己試試。而且，即使在該URL中存在HTML文檔，它也會描述例如「http：// ma.com/works_at」謂詞，但是以某種完全專有的格式（table？divs？xml？other？）那麼你打算如何使用它？ – Raffaele

嘿..根據鏈接數據，關於URI的引用，它應該檢索HTML或RDF/XML描述，具體取決於您發送請求的標頭。我想檢索RDF/XML描述，如果這個RDF/XML描述包含更多的URI，我想抓取這些URI。 –

使用Jena Library從Java中的RDF網頁中提取URI

回答

相關問題