Python的正則表達式查找鏈接到MediaWiki標記的內容鏈接

如果我有一個包含類似下面的mediawiki標記的東西一些XML：Python的正則表達式查找鏈接到MediaWiki標記的內容鏈接

「......在12世紀收集，其中[亞歷山大大帝]是英雄，他在其中表示，有點像英國的[[國王亞瑟 |亞瑟]」

什麼是適當的參數是這樣的：

re.findall([[__?__]], article_entry)

我在逃避雙括號，並得到適當的鏈接，如文本絆了一下：[[Alexander of Paris|poet named Alexander]]

2009-05-01 unmounted

下面是一個例子

import re 

pattern = re.compile(r"\[\[([\w \|]+)\]\]") 
text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas" 
results = pattern.findall(text) 

output = [] 
for link in results: 
    output.append(link.split("|")[0]) 

# outputs ['Alexander of Paris']

第2版，將更多的進入正則表達式，但作爲一個結果，改變輸出：

import re 

pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]") 
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]" 
results = pattern.findall(text) 

# outputs [('a', '|b'), ('c', '|d'), ('efg', '')] 

print [link[0] for link in results] 

# outputs ['a', 'c', 'efg']

第3版，如果你只是想沒有標題的鏈接。

pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]") 
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]" 
results = pattern.findall(text) 

# outputs ['a', 'c', 'efg']

來源

2009-05-01 01:20:08 Unknown

我使用'\ [\ [（。+？）\] \]'爲我自己的目的。它有點短。 :) – Gandaro 2012-02-04 14:26:11

正則表達式： \ w +（\ w +）+（=]？）

輸入

[巴黎的亞歷山大|詩人名叫亞歷山大]

輸出

詩人命名亞歷山大

輸入

[亞歷山大的巴黎]輸出

亞歷山大巴黎

的

來源

2009-05-01 01:52:23 ByteNirvana

這不是所需的輸出。 ;） – Gandaro 2012-02-04 14:28:33

import re 
pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])") 
text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]" 
results = pattern.findall(text) 
print results

會給輸出

["Alexander the Great", "King Arthur"]

來源

2009-05-01 01:57:28 erik

如果您試圖從頁面獲取所有鏈接，那麼儘可能使用MediaWiki API要容易得多，例如， http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website)。

請注意，這兩種方法都會丟失嵌入到模板中的鏈接。

來源

2009-05-01 04:23:25 pfctdayelise

其實我從轉儲工作，謝謝你的提示，雖然 – unmounted 2009-05-01 08:59:57

Python的正則表達式查找鏈接到MediaWiki標記的內容鏈接

回答

相關問題