2013-02-19 28 views
2

我是noob的定義。我對python幾乎一無所知,並且正在尋求幫助。我可以閱讀足夠的代碼來改變變量以適應我的需求,但是當我做一些原始代碼不需要的東西...我迷路了。Python - 使腳本循環,直到條件滿足,併爲每個循環使用不同的代理地址

所以這裏是交易,我找到了一個craigslist(CL)標記腳本,最初搜索所有CL網站和標記的帖子,其中包含一個特定的關鍵字(它被寫爲標記所有提到scienceology的帖子)。

我改變它只在我的一般區域(15個網站而不是437)搜索CL網站,但它仍然會查找已更改的特定關鍵字。我想自動標記持續垃圾郵件的人,並且很難排序,因爲我在CL上做了很多業務,從郵件中排序。

我想讓腳本執行循環,直到它不能再在每個循環之後找到滿足標準更改代理服務器的帖子。並在劇本里面放置代理/ IP地址的地方

我期待着您的回覆。

這裏是改變的代碼,我有:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 


import urllib 
from twill.commands import * # gives us go() 

areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 'mendocino', 'modesto', 'monterey', 'redding', 'reno', 'sacramento', 'siskiyou', 'stockton', 'yubasutter', 'reno'] 

def expunge(url, area): 
    page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
    page = page[page.index('<hr>'):].split('\n')[0] 
    page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
     num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
     spam = 'https://post.craigslist.org/flag?flagCode=15&amppostingID='+num # url for flagging as spam 
     go(spam) # flag it 


print 'Checking ' + str(len(areas)) + ' areas...' 

for area in ['http://' + a + '.craigslist.org/' for a in areas]: 
    ujam = area + 'search/?query=james+"916+821+0590"+&catAbb=hhh' 
    udre = area + 'search/?query="DRE+%23+01902542+"&catAbb=hhh' 
    try: 
     jam = urllib.urlopen(ujam).read() 
     dre = urllib.urlopen(udre).read() 
    except: 
     print 'tl;dr error for ' + area 

    if 'Found: ' in jam: 
     print 'Found results for "James 916 821 0590" in ' + area 
     expunge(ujam, area) 
     print 'All "James 916 821 0590" listings marked as spam for area' 

    if 'Found: ' in dre: 
     print 'Found results for "DRE # 01902542" in ' + area 
     expunge(udre, area) 
     print 'All "DRE # 01902542" listings marked as spam for area' 
+0

如果你只使用'go',只導入'go':'從twill.commands導入go' – askewchan 2013-02-19 21:17:19

+0

導入錯誤:沒有名爲模塊去 – 2013-02-19 22:18:10

+0

奇怪:HTTP://斜紋.idyll.org/python-api.html說:'從twill.commands進口去' – askewchan 2013-02-19 22:24:07

回答

0

您可以創建一個恆定的循環是這樣的:

while True: 
    if condition : 
     break 

Itertools有技巧的去重複http://docs.python.org/2/library/itertools.html

特別是屈指可數,退房itertools.cycle

(這些都是指向正確方向的指針。你可以制定一個解決方案,其他,甚至兩個)

+0

對不起,我不明白它..我試圖添加repeat()進入代碼,但我不斷得到Traceback(最近一次調用最後): 文件「/ home/quonundrum/Desktop/CL。py',第43行,在 repeat('spam,4') NameError:name'repeat'is not defined >>> – 2013-02-19 21:57:40

+0

'import itertools as it' then call'it.repeat()' – askewchan 2013-02-19 22:27:31

+0

I' ('go,4'),it.repeat('go(spam),4'),it.repeat('expunge'),it.repeat('ujam')..和一大堆其他人......這不是重複,但也沒有給出任何錯誤 – 2013-02-19 22:54:55

0

我對你的代碼做了一些改變。在我看來,函數expunge已經循環遍歷頁面中的所有結果,所以我不確定你需要做什麼循環,但是有一個例子說明如何在結束時檢查結果是否被找到,但沒有循環可以打破。

不知道如何更改代理/ IP。

順便說一句,你有'reno'兩次。

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib 
from twill.commands import go 

areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 
     'mendocino', 'modesto', 'monterey', 'redding', 'reno', 
     'sacramento', 'siskiyou', 'stockton', 'yubasutter'] 
queries = ['james+"916+821+0590"','"DRE+%23+01902542"'] 

def expunge(url, area): 
    page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
    page = page[page.index('<hr>'):].split('\n')[0] 
    page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
     num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
     spam = 'https://post.craigslist.org/flag?flagCode=15&amppostingID='+num # url for flagging as spam 
     go(spam) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

for area in areas: 
    for query in queries: 
     qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
     try: 
      q = urllib.urlopen(qurl).read() 
     except: 
      print 'tl;dr error for {} in {}'.format(query, area) 
      break 

     if 'Found: ' in q: 
      print 'Found results for {} in {}'.format(query, area) 
      expunge(qurl, area) 
      print 'All {} listings marked as spam for area'.format(query) 
     elif 'Nothing found for that search' in q: 
      print 'No results for {} in {}'.format(query, area) 
      break 
     else: 
      break 
+0

酷,看起來好多了。有沒有辦法讓它繼續運行,直到它沒有得到任何結果? – 2013-02-19 23:57:53

+0

你意味着你期望結果頁面在程序運行時改變? – askewchan 2013-02-20 00:02:55

+0

在shell中顯示它發現/標記的東西,所以我想知道是否有腳本繼續運行,直到沒有更多結果對於被搜索的關鍵字(IE瀏覽器的所有結果)重新標記直到刪除)。 – 2013-02-20 00:27:52

0

我做了一些改變...不知道他們工作得如何,但我沒有得到任何錯誤。請讓我知道,如果你發現任何錯誤/缺少的東西。 - 感謝

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib, urllib2 
from twill.commands import go 


proxy = urllib2.ProxyHandler({'https': '108.60.219.136:8080'}) 
opener = urllib2.build_opener(proxy) 
urllib2.install_opener(opener) 
proxy2 = urllib2.ProxyHandler({'https': '198.144.186.98:3128'}) 
opener2 = urllib2.build_opener(proxy2) 
urllib2.install_opener(opener2) 
proxy3 = urllib2.ProxyHandler({'https': '66.55.153.226:8080'}) 
opener3 = urllib2.build_opener(proxy3) 
urllib2.install_opener(opener3) 
proxy4 = urllib2.ProxyHandler({'https': '173.213.113.111:8080'}) 
opener4 = urllib2.build_opener(proxy4) 
urllib2.install_opener(opener4) 
proxy5 = urllib2.ProxyHandler({'https': '198.154.114.118:3128'}) 
opener5 = urllib2.build_opener(proxy5) 
urllib2.install_opener(opener5) 


    areas = ['sfbay', 'chico', 'fresno', 'goldcountry', 'humboldt', 
    'mendocino', 'modesto', 'monterey', 'redding', 'reno', 
    'sacramento', 'siskiyou', 'stockton', 'yubasutter'] 
queries = ['james+"916+821+0590"','"DRE+%23+01902542"'] 

    def expunge(url, area): 
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
page = page[page.index('<hr>'):].split('\n')[0] 
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
    num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
    spam = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=15&amppostingID='+num) 
    spam2 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=28&amppostingID='+num) 
    spam3 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=16&amppostingID='+num) 
    go(spam) # flag it 
    go(spam2) # flag it 
    go(spam3) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

    for area in areas: 
for query in queries: 
    qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
    try: 
     q = urllib.urlopen(qurl).read() 
    except: 
     print 'tl;dr error for {} in {}'.format(query, area) 
     break 

    if 'Found: ' in q: 
     print 'Found results for {} in {}'.format(query, area) 
     expunge(qurl, area) 
     print 'All {} listings marked as spam for {}'.format(query, area) 
     print '' 
     print '' 
    elif 'Nothing found for that search' in q: 
     print 'No results for {} in {}'.format(query, area) 
     print '' 
     print '' 
     break 
    else: 
     break 
0
#!/usr/bin/env python 
# -*- coding: utf-8 -*- 

import urllib, urllib2 
from twill.commands import go 


proxy = urllib2.ProxyHandler({'https': '108.60.219.136:8080'}) 
opener = urllib2.build_opener(proxy) 
urllib2.install_opener(opener) 
proxy2 = urllib2.ProxyHandler({'https': '198.144.186.98:3128'}) 
opener2 = urllib2.build_opener(proxy2) 
urllib2.install_opener(opener2) 
proxy3 = urllib2.ProxyHandler({'https': '66.55.153.226:8080'}) 
opener3 = urllib2.build_opener(proxy3) 
urllib2.install_opener(opener3) 
proxy4 = urllib2.ProxyHandler({'https': '173.213.113.111:8080'}) 
opener4 = urllib2.build_opener(proxy4) 
urllib2.install_opener(opener4) 
proxy5 = urllib2.ProxyHandler({'https': '198.154.114.118:3128'}) 
opener5 = urllib2.build_opener(proxy5) 
urllib2.install_opener(opener5) 


areas = ['capecod'] 
queries = ['rent','rental','home','year','falmouth','lease','credit','tenant','apartment','bedroom','bed','bath'] 

    def expunge(url, area): 
page = urllib.urlopen(url).read() # <-- and v and vv gets you urls of ind. postings 
page = page[page.index('<hr>'):].split('\n')[0] 
page = [i[:i.index('">')] for i in page.split('href="')[1:-1] if '<font size="-1">' in i] 

    for u in page: 
    num = u[u.rfind('/')+1:u.index('.html')] # the number of the posting (like 34235235252) 
    spam = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=15&amppostingID='+num) 
    spam2 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=28&amppostingID='+num) 
    spam3 = urllib2.urlopen('https://post.craigslist.org/flag?flagCode=16&amppostingID='+num) 
    go(spam) # flag it 
    go(spam2) # flag it 
    go(spam3) # flag it 

print 'Checking ' + str(len(areas)) + ' areas...' 

    for area in areas: 
for query in queries: 
    qurl = 'http://' + area + '.craigslist.org/search/?query=' + query + '+&catAbb=hhh' 
    try: 
     q = urllib.urlopen(qurl).read() 
    except: 
     print 'tl;dr error for {} in {}'.format(query, area) 
     break 

    if 'Found: ' in q: 
     print 'Found results for {} in {}'.format(query, area) 
     expunge(qurl, area) 
     print 'All {} listings marked as spam for {}'.format(query, area) 
     print '' 
     print '' 
    elif 'Nothing found for that search' in q: 
     print 'No results for {} in {}'.format(query, area) 
     print '' 
     print '' 
     break 
    else: 
     break 
相關問題