2010-01-19 63 views
2

我有這一點我想從我從維基百科獲取的頁面中移除的文本。在PHP中使用正則表達式解析(解析Wikipedia標記)

{{Historical populations|type=USA 
| 1698|4937 
| 1712|5840 
| 1723|7248 
| 1737|10664 
| 1746|11717 
| 1756|13046 
| 1771|21863 
| 1790|33131 
| 1800|60515 
| 1810|96373 
| 1820|123706 
| 1830|202589 
| 1840|312710 
| 1850|515547 
| 1860|813669 
| 1870|942292 
| 1880|1206299 
| 1890|1515301 
| 1900|3437202 
| 1910|4766883 
| 1920|5620048 
| 1930|6930446 
| 1940|7454995 
| 1950|7891957 
| 1960|7781984 
| 1970|7894862 
| 1980|7071639 
| 1990|7322564 
| 2000|8008288 
| 2008*|8363710 
|footnote=Beginning 1900, figures are for consolidated city of five boroughs. Sources: 1698–1771,{{cite book|last=Greene and Harrington|first=|title=American Population Before the Federal Census of 1790|publisher=|location=New York|year=1932|isbn=|pages=}}, as cited in: {{cite book|last=Rosenwaike|first=Ira|title=Population History of New York City|publisher=Syracuse University Press|location=Syracuse, N.Y.|year=1972|isbn=0815621558|page=8}} 1790–1990,Gibson, Campbell.[http://www.census.gov/population/www/documentation/twps0027.html Population of the 100 Largest Cities and Other Urban Places in the United States:1790 to 1990], [[United States Census Bureau]], June 1998. Retrieved June 12, 2007. *2008 est[http://factfinder.census.gov/servlet/SAFFPopulation?_event=Search&geo_id=16000US3403940&_geoContext=01000US%7C04000US34%7C16000US3403940&_street=&_county=new+york+city&_cityTown=new+york+city&_state=04000US36&_zip=&_lang=en&_sse=on&ActiveGeoDiv=geoSelect&_useEV=&pctxt=fph&pgsl=160&_submenuId=population_0&ds_name=null&_ci_nbr=null&qr_name=null&reg=null%3Anull&_keyword=&_industry=Census Data for New York city, New York], [[United States Census Bureau]]. Retrieved June 12, 2007. 
}} 

下面的部分,我想保留爲純文本還(但不包括部分包裹着 「{{」 和 「}}」

New York is the most populous city in the United States, with an estimated 2008 population of 8,363,710(up from 7.3 million in 1990). This amounts to about 40.0% of New York State's population and a similar percentage of the metropolitan regional population. Over the last decade the city's population has been increasing and demographers estimate New York's population will reach between 9.2 and 9.5 million by 2030.{{cite web |title=New York City Population Projections by Age/Sex and Borough, 2000-2030 |publisher=[[New York City Department of City Planning]] |month=December | year=2006 |url=http://www.nyc.gov/html/dcp/pdf/census/projections_report.pdf |format=PDF |accessdate=2008-09-01}} See also {{cite news |last=Roberts, Sam |title=By 2025, Planners See a Million New Stories in the Crowded City |publisher=New York Times |date=February 19, 2006 |url=http://www.nytimes.com/2006/02/19/nyregion/19population.html?ex=1298005200&en=c586d38abbd16541&ei=5090&partner=rssuserland&emc=rss |accessdate=2008-09-01}} 

感謝。

+0

你有沒有已經嘗試過的正則表達式的例子? – 2010-01-19 16:47:25

回答

2

當前的代碼我正在使用的是以下內容來清理Wiki頁面,例如這一個:

http://en.wikipedia.org/wiki/Tel_Aviv(您可以通過單擊「編輯本頁」查看標記

我得到這個返回:

「,並讓位給其美譽爲‘不夜城’地中海大都市。 Haaretz編輯這是該國的金融資本和主要的表演藝術和商業中心。特拉維夫市區是中東地區第二大城市經濟體,被Foreign Policys 2008全球城市指數排在全球第42位。它也是該地區最昂貴的城市,也是全球第17個最昂貴的城市。以色列的生活成本很高,特拉維夫是其生活費用最高的城市。根據位於紐約的人力資源諮詢公司Mercer的資料,截至2008年,特拉維夫是中東地區最昂貴的城市,在世界上排名第14位。它落後於新加坡和巴黎,在這方面僅次於悉尼和都柏林。通過比較,紐約市是第22屆」

這是不正確的,預期的結果應該是:

特拉維夫 - 雅法(希伯來語:תֵּל-אָבִיב-יָפוֹ;阿拉伯語:تلأبيب, Tall'Abīb),通常稱爲特拉維夫,是以色列第二大城市,人口估計爲39.39萬,位於以色列地中海沿岸,面積爲51.8平方公里(20.0平方mi),位於以色列地中海沿岸。該城市是Gush Dan大都市地區中規模最大,人口最多的城市,截至2008年,該城市擁有315萬人口。該城市由特拉維夫 - 雅法市政府管理,由Ron Huldai負責。

對於這個PHP代碼:

function clean_wiki_text($text) 
    { 
    // first get rid of UGC HTML tags 
    $text = strip_tags($text); 

    // keep convert tag 
    $text = preg_replace("/\{\{convert\|([^\|]+)\|([^\|]+)\|[^\}]+\}\}/", "$1$2", $text); 

    // remove large blocks (treat as tags) 
    $text = preg_replace("/(<![^>]+>)/", '', $text); 
    $text = preg_replace('/\{\{\s?/', '<', $text); 
    $text = str_replace('}}', ' />', $text); 

    $text = str_replace('<! />', '', $text); 

    // more wiki formatting 
    $text = preg_replace("/'{2,6}/", '', $text); 
    $text = preg_replace("/[=\s]+External [lL]inks[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+See [aA]lso[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+References[\s=]+/", '', $text); 
    $text = preg_replace("/[=\s]+Notes[\s=]+/", '', $text); 
    $text = preg_replace('/\{\{([^\}]+)\}\}/', '', $text); 

    // drop page link text 
    $text = preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$2", $text); 
    // or keep it with preg_replace('/\[\[([^:\|\]]+)\|([^:\]]+)\]\]/', "$1 ($2)", $text); 

    $text = preg_replace('/\(\[[^\]]+\]\)/', '', $text); 
    $text = preg_replace('/\[\[([^:\]]+)\]\]/', "$1", $text); 
    $text = preg_replace('/\*?\s?\[\[([^\]]+)\]\]/', '', $text); 
    $text = preg_replace('/\*\s?\[([^\s]+)\s([^\]]+)\]/', "$2", $text); 
    $text = preg_replace('/\n(\*+\s?)/', '', $text); 
    $text = preg_replace('/\n{3,}/', "\n\n", $text); 
    $text = preg_replace('/<ref[^>]?>[^>]+>/', '', $text); 
    $text = preg_replace('/<cite[^>]?>[^>]+>/', '', $text); 

    $text = preg_replace('/={2,}/', '', $text); 
    $text = preg_replace('/{?class="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?width="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?height="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?style="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?rowspan="[^"]+"/', "", $text); 
    $text = preg_replace('/!?\s?bgcolor="[^"]+"/', "", $text); 

    $text = trim($text); 

    $text = preg_replace('/\n\n/', "<br />\n<br />\n", $text); 
    $text = preg_replace('/\r\n\r\n/', "<br />\r\n<br />\r\n", $text); 
/* 
    $config = array(
     'show-body-only' => true, 
     'clean'   => false, 
     'wrap'   => 0, 
     'show-warnings' => 0, 
     'show-errors' => 0, 
     'enclose-block-text' => false, 
     'vertical-space' => true, 
     'output-html' => true 
    ); 

    // Tidy 
    $tidy = new tidy; 
    $tidy->parseString($text, $config, 'utf8'); 
    $tidy->cleanRepair(); 

    $text = $tidy->value; 
*/ 
    $extras = array(
    // "/\((.*?)\)/is" => "", 
     "/\[(.*?)\]/is" => "" 
    ); 
    $text = preg_replace(array_keys($extras), array_values($extras), $text); 

    $text = str_replace(" ,", ',', $text); 
    $text = str_replace(", ", ',', $text); 
    $text = str_replace(",", ', ', $text); 
    $text = str_replace("(, ", '(', $text); 
    $text = str_replace(";,", ',', $text); 

    // lets keep it plain plain plain 
    $text = strip_tags($text); 
// $text = preg_replace('/\s\s+/', ' ', $text); 

    $text = str_replace("|-", '', $text); 
    $text = str_replace("|}", '', $text); 
    $text = str_replace("|", '', $text); 
    $text = str_replace('()', '', $text); 
    $text = str_replace('&nbsp;', ' ', $text); 

    $text = trim($text); 

    $text_arr = preg_split('/[\r\n]+/', $text, -1, PREG_SPLIT_NO_EMPTY); 
    $result = ""; 
    foreach ($text_arr as $paragraph) { 
     if (mb_strlen(trim($paragraph)) > 30) { 
     $result[] = $paragraph; 
     } 
    } 
    return $result; 
    } 
0

這真的很難做當僅提供一個例子的正則表達式 - 從我自己的cleeaning維基百科頁面的經驗,我知道其他網頁也很有可能看起來有點不同。只是爲了匹配您的例子很簡單:

{{.+?}}\n 

這隻能如果有一個換行符要刪除的部分後,如果您specifiy DOTALLMULTILINE。配合雙大括號的所有對和東西里面:

{{[^}]+}} 

你可以試着做幾次運行,各取出另一想要的部分 - 我懷疑這是很可行的,以匹配所有你需要一個正則表達式中。

+0

首先運行這個,除了下面的代碼 - 對於那個頁面來說,我會測試其他幾頁 - 但看起來不錯。 謝謝 – Simon 2010-01-19 21:34:15

1

只是在這裏猜測,但使用維基百科的標記庫(與Mediawiki捆綁在一起),將其轉換爲HTML然後使用任何您熟悉的XML庫進行分析是不是更容易和更安全?

API文檔可以在http://svn.wikimedia.org/doc/(在Parser模塊中)找到,它看起來並不複雜。基本上,所有你需要做的就是像下面這樣:

<?php 

require_once '/path/to/mediawiki/Parser.php'; 
// also include whatver classes Parser depends on or use Mediawiki's autoload 
// mechanism if it has any 

// retrieve the content of your page in $content 

$parser = new Parser(); 
$html = $parser->parse($content); 

$simplexml = simplexml_load_string($html); 

現在你有一個非常方便的SimpleXML對象一起玩。當然,這隻有在Mediawiki的解析器產生有效的XML(我敢打賭它)時纔有效。另外,如果Mediawiki包含某種自動加載機制,則可以通過在Mediawiki的代碼庫中查找__autoloadspl_autoload_register來輕鬆找到它。

希望它有幫助!