獲得維基百科文章中的第一個鏈接不括號內

所以我很感興趣this theory，如果你去到一個隨機的維基百科文章，請點擊第一個鏈接不括號內反覆，在你最終會病例95％關於Philosophy的文章。獲得維基百科文章中的第一個鏈接不括號內

我想用Python語言編寫，做鏈接抓取，我和中端腳本，打印一個不錯的名單，其中被訪問的文章（linkA -> linkB -> linkC）等

我設法得到的HTML DOM網頁，並設法刪除一些不必要的鏈接和導致消歧頁面的頂部描述欄。到目前爲止，我的結論是：

DOM開始於在某些頁面上右側看到的表格，例如Human。我們希望忽略這些鏈接。
的有效鏈接元素全部具有<p>元素的地方作爲他們的祖先（最常見的父母或祖父母，如果它是一個<b>標籤內或相似的。頂欄導致歧義頁，似乎並沒有包含任何<p>元素。
無效鏈接包含一些特殊的詞，後跟一個冒號，如Wikipedia:

到目前爲止，一切都很好，但它是讓我的括號內。在文章中有關Human例如，第一個鏈接不括號內爲「/ wiki/Species」，但是腳本在裏面找到「/ wiki/Taxonomy」。

我不知道如何去編程，因爲我必須在父/子節點的某些組合中查找文本，這些節點可能並不總是相同的。有任何想法嗎？

我的代碼可以在下面看到，但這是我編造得非常快，並不感到自豪。然而，它的評論，所以你可以看到我的思路（我希望:)）。

"""Wikipedia fun""" 
import urllib2 
from xml.dom.minidom import parseString 
import time 

def validWikiArticleLinkString(href): 
    """ Takes a string and returns True if it contains the substring 
     '/wiki/' in the beginning and does not contain any of the 
     "special" wiki pages. 
    """ 
    return (href.find("/wiki/") == 0 
      and href.find("(disambiguation)") == -1 
      and href.find("File:") == -1 
      and href.find("Wikipedia:") == -1 
      and href.find("Portal:") == -1 
      and href.find("Special:") == -1 
      and href.find("Help:") == -1 
      and href.find("Template_talk:") == -1 
      and href.find("Template:") == -1 
      and href.find("Talk:") == -1 
      and href.find("Category:") == -1 
      and href.find("Bibcode") == -1 
      and href.find("Main_Page") == -1) 


if __name__ == "__main__": 
    visited = [] # a list of visited links. used to avoid getting into loops 

    opener = urllib2.build_opener() 
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] # need headers for the api 

    currentPage = "Human" # the page to start with 

    while True: 
     infile = opener.open('http://en.wikipedia.org/w/index.php?title=%s&printable=yes' % currentPage) 
     html = infile.read() # retrieve the contents of the wiki page we are at 

     htmlDOM = parseString(html) # get the DOM of the parsed HTML 
     aTags = htmlDOM.getElementsByTagName("a") # find all <a> tags 

     for tag in aTags: 
      if "href" in tag.attributes.keys():   # see if we have the href attribute in the tag 
       href = tag.attributes["href"].value  # get the value of the href attribute 
       if validWikiArticleLinkString(href):        # if we have one of the link types we are looking for 

        # Now come the tricky parts. We want to look for links in the main content area only, 
        # and we want the first link not in parentheses. 

        # assume the link is valid. 
        invalid = False    

        # tables which appear to the right on the site appear first in the DOM, so we need to make sure 
        # we are not looking at a <a> tag somewhere inside a <table>. 
        pn = tag.parentNode      
        while pn is not None: 
         if str(pn).find("table at") >= 0: 
          invalid = True 
          break 
         else: 
          pn = pn.parentNode 

        if invalid:  # go to next link 
         continue    

        # Next we look at the descriptive texts above the article, if any; e.g 
        # This article is about .... or For other uses, see ... (disambiguation). 
        # These kinds of links will lead into loops so we classify them as invalid. 

        # We notice that this text does not appear to be inside a <p> block, so 
        # we dismiss <a> tags which aren't inside any <p>. 
        pnode = tag.parentNode 
        while pnode is not None: 
         if str(pnode).find("p at") >= 0: 
          break 
         pnode = pnode.parentNode 
        # If we have reached the root node, which has parentNode None, we classify the 
        # link as invalid. 
        if pnode is None: 
         invalid = True 

        if invalid: 
         continue 


        ###### this is where I got stuck: 
        # now we need to look if the link is inside parentheses. below is some junk 

#     for elem in tag.parentNode.childNodes: 
#      while elem.firstChild is not None: 
#       elem = elem.firstChid 
#      print elem.nodeValue 

        print href  # this will be the next link 
        newLink = href[6:] # except for the /wiki/ part 
        break 

     # if we have been to this link before, break the loop 
     if newLink in visited: 
      print "Stuck in loop." 
      break 
     # or if we have reached Philosophy 
     elif newLink == "Philosophy": 
      print "Ended up in Philosophy." 
      break 
     else: 
      visited.append(currentPage)  # mark this currentPage as visited 
      currentPage = newLink   # make the the currentPage we found the new page to fetch 
      time.sleep(5)     # sleep some to see results as debug

來源

2012-05-17 pg-robban

您可能想嘗試使用lxml提供的更豐富的接口。這允許你使用xpath和其他一些東西。 – Marcin

雖然我們在推薦，但我想放棄美麗的名字作爲一個可能有用的名字。 – marue

@marue兩個偉大的口味，味道很棒：lxml有一個美麗的後端！ – Marcin

我發現在Github上（http://github.com/JensTimmerman/scripts/blob/master/philosophy.py）python腳本玩這個遊戲。它使用Beautifulsoup進行HTML解析，並解決parantheses問題，他只是在解析鏈接之前刪除方括號內的文本。

來源

2012-05-17 17:04:17 Ponytech

獲得維基百科文章中的第一個鏈接不括號內

回答

相關問題