2017-08-07 24 views
1

我有一個問題的答案,我問的SO here,似乎當我再運行代碼工作。BeautifulSoup:輸出開始一定數量的迭代之後改變循環

然而,當我嘗試實施它在一個循環中,結果開始的第三次迭代後改變。這只是每次調用相同URL的示例。

from bs4 import BeautifulSoup 
import requests 
import re 

for x in range(5): 
    url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0' 
    html = requests.get(url, headers={'Cookie': 'PHPSESSID=notimportant'}) 
    soup = BeautifulSoup(html.text, "lxml") 

    tags = list(soup.find_all('span', {'class':'PrintHistRed'})) 
    tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:])  # 2: skip "are you an adviser" at the top 
    tags.extend([t.parent for t in soup.find_all(text="No Information Filed")]) 

    output = [] 

    for entry in sorted(tags): 
     if entry.name == 'img': 
      alt = entry['alt'] 
      if 'Radio' in alt: 
       output.append('NO' if 'not selected' in alt else 'YES') 
      else: 
       output.append('O' if 'not checked' in alt else 'X') 
     else: 
      output.append(entry.text) 

    print output[:9] 

我試着把time.sleep()放在代碼中的不同位置,認爲它必須這樣做,但沒有運氣。我也想知道是否與Cookie有關?但也沒辦法,真的...

任何幫助,不勝感激!

+0

如果你把進口環的內部是什麼? –

+0

這實際上使得輸出在第二次迭代之後開始改變,這很有趣。不知道這是否給任何人提供任何線索...... –

+1

好吧,這很奇怪。 –

回答

0

所以你得到怪異的行爲要排序的「對象」(類型bs4.element.Tag,請參閱https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag),而不是字符串你的代碼。

變化:

for entry in sorted(tags): 

到:

for entry in tags: 

然後輸出:

[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', u'$\xa07,402,178', u'$\xa05,000,000', u'47', u'4', u'28'] 
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', u'$\xa07,402,178', u'$\xa05,000,000', u'47', u'4', u'28'] 
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', u'$\xa07,402,178', u'$\xa05,000,000', u'47', u'4', u'28'] 
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', u'$\xa07,402,178', u'$\xa05,000,000', u'47', u'4', u'28'] 
[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', u'$\xa07,402,178', u'$\xa05,000,000', u'47', u'4', u'28'] 

響應更新發表評論,如果您需要保留的順序嘗試這樣的事情(如果你願意,可以更多地壓縮代碼,不需要兩個語句):

from bs4 import BeautifulSoup 
import requests 
import re 

for x in range(5): 
    url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0' 
    html = requests.get(url, headers={'Cookie': 'PHPSESSID=notimportant'}) 
    soup = BeautifulSoup(html.text, "lxml") 

    regexp = re.compile(r'Radio|Checkbox') 
    mytags = [] 
    tags = soup.find_all(['span', 'img']) 
    for tag in tags: 
     if (tag.has_attr('class') and 'PrintHistRed' in tag['class']) or (tag.has_attr('alt') and regexp.search(tag['alt'])): 
      mytags.append(tag) 
     elif (tag.text == "No Information Filed"): 
      mytags.append(tag.parent) 

    output = [] 

    for entry in mytags: 
     if entry.name == 'img': 
      alt = entry['alt'] 
      if 'Radio' in alt: 
       output.append('NO' if 'not selected' in alt else 'YES') 
      else: 
       output.append('O' if 'not checked' in alt else 'X') 
     else: 
      output.append(entry.text) 

    print (output) 
+0

我需要的輸出,以很慢按照他們在網站上的顯示方式進行排序。這不會以正確的順序給出輸出,這是排序(標籤)(有時)所做的。我應該擴展示例輸出,以便明白我的意思。 –

相關問題