Python 2.7美麗的湯 - 解析鏈接列表

-1

我想解析此頁上具有相同層次結構的所有鏈接。我沒有得到任何回溯，但沒有獲得數據。Python 2.7美麗的湯 - 解析鏈接列表

我試圖從代碼高亮部分得到href標記：

我當前的代碼是：

def link_parser(soup,itemsList): 
for item in soup.findAll("div", { "class" : "tileInfo" }): 
    for link in item.findAll("a", { "class" : "productClick productTitle" }): 
     try: 
      itemsList.put(removeNonAscii(html_parser.unescape(link.string)).replace(',',' ')+","+clean_a_url(link['href'])) 
     except Exception: 
      print "Formatting error: " 
      traceback.print_exc(file=sys.stdout) 

return ""

來源

2014-09-25 user3677501

你不應該給圖像鏈接。從圖像中「複製粘貼」是不可能的。 – 2014-09-25 16:53:24

你需要什麼數據？爲什麼'removeNonAscii'和'clean_a_url'？你不需要使用html編碼的字符串，BeautifulSoup已經爲你做了，你可以使用'link.text'來訪問非轉義的文本。 – 2014-09-25 16:53:59

我需要在href從這個標籤：我使用clean_a_url因爲我需要url的一致性和清潔性，我使用removeNonAscii，因爲有時我的網址中會有NonAscii字符。你能告訴我一個使用link.text訪問非轉義文本的例子嗎？ – user3677501 2014-09-25 17:03:50

它看起來像你試圖刮掉Target的網站 - 也許this page 。

你遇到過網頁抓取的根本困難之一 - 你看到的是而不是總是你得到的。在這種情況下，他們在加載頁面後在一堆內容中進行AJAX處理。首次加載頁面時，請注意小風車動畫 - 您嘗試訪問的內容根本不存在於DOM中，直到它們在該頁面上獲得的所有各種js腳本都運行。（和他們已經有了一個一大堆人）

我通過點擊一點，它看起來像負責生成內容的代碼是該位的jQuery：

<script id="productTitleTmpl" type="text/x-jquery-tmpl" > 
     {{if $item.parent.parent.viewType != "details"}} 
      {{tmpl($data.itemAttributes) "#productBrandTmpl"}} 
     {{/if}} 
     <a class="productClick productTitle" id="prodTitle-{{= $item.parent.parent.viewType}}-{{= $item.parent.parent.currentPageNumber}}-{{= $item.parent.productCounter}}" href="/{{= productDetailPageURL}}#prodSlot={{= $item.parent.parent.viewType}}_{{= $item.parent.parent.currentPageNumber}}_{{= $item.parent.productCounter}}" title="{{= title}}" name="prodTitle_{{= $item.catalogEntryId}}"> 
      {{= $item.parent.parent.fetchProductTitleForView($item.productTitle)}} 
     </a>

所以，無論如何。如果你真的在抓取這個頁面時已經死了，你將需要丟棄urllib（或者你用來獲取html的任何東西）。相反，請使用支持JavaScript的無頭瀏覽器（如selenium）訪問此頁面，讓javascript運行，然後進行擦除。所有這些都超出了這個答案的範圍，但是你可以谷歌瀏覽各種無頭瀏覽器解決方案，並找到適合你的解決方案。

來源

2014-09-25 17:29:22 roippi

非常感謝你研究我的問題，roippi，而不是讓我浪費更多時間在這個頁面上使用urllib。我將開始研究無頭瀏覽器解決方案。我非常感謝你的回答。 – user3677501 2014-09-25 17:32:46

Python 2.7美麗的湯 - 解析鏈接列表

回答

相關問題