2
自從我開始使用mrjob並且我已經嘗試了某些中低級任務以來,我只有幾天的時間了。現在,我堅持將常用抓取[now onwards will be know as CC]
位置作爲emr的輸入使用python mrjob將Comomn抓取位置作爲使用mrjob的Amazon EMR輸入python
我的配置文件看起來像這樣:
runners:
emr:
aws_access_key_id: <AWS Access Key>
aws_secret_access_key: <AWS Secret Access Key>
aws_region: us-east-1
ec2_key_pair: cslab
ec2_key_pair_file: ~/cslab.pem
ec2_instance_type: m1.small
num_ec2_instances: 5
local:
base_tmp_dir: /tmp
Big thing small :I am trying to get the number of words in a web page of a site
Big thing big: Is my code below
我的代碼:
import warc
class MRcount(MRJob):
# ...
def mapper(self, _, s3_path):
s3_url_parsed = urlparse.urlparse(s3_url)
bucket_name = s3_url_parsed.netloc
key_path = s3_url_parsed.path[1:]
conn = boto.connect_s3()
bucket = conn.get_bucket('aws-publicdatasets', validate=False)
key = Key(bucket, s3_path)
webpage_text = record.payload.read()
yield record.header['warc-target-uri'],len(webpage_text.split()
if __name__ == '__main__':
MRcount.run())
一切都很好,直到但現在,當我嘗試運行它。
CMD:
$ python mr_crawl.py -r emr s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/wet.paths.gz
錯誤:
boto.exception.S3ResponseError: S3ResponseError: 301 Moved Permanently
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>PermanentRedirect</Code><Message>The bucket you are attempting to access must be addressed using the specified endpoint. Please send all future requests to this endpoint.</Message <RequestId>06660583263444FC</RequestId><Bucket>smarkets-db</Bucket><HostId>TCZJTKZ8wo8V1h0xjkOI6grojs/r9IBkhMOcvolXv06QEtxTX89M55aLTPGOo/ht</HostId><Endpoint>eu-west-bucket.s3.amazonaws.com</Endpoint></Error>
我想這是因爲我在配置文件中的區域,並刪除它,但我得到一個新的錯誤
我的新配置文件:
runners:
emr:
aws_access_key_id: <AWS Access Key>
aws_secret_access_key: <AWS Secret Access Key>
ec2_key_pair: cslab
ec2_key_pair_file: ~/cslab.pem
ec2_instance_type: m1.small
num_ec2_instances: 5
local:
base_tmp_dir: /tmp
我收到以下錯誤SSH錯誤:
using configs in /etc/mrjob.conf
using existing scratch bucket mrjob-4db6342a70e021ad
using s3://mrjob-4db6342a70e021ad/tmp/ as our scratch dir on S3
creating tmp directory /tmp/word_count.20140603.181541.006786
writing master bootstrap script to /tmp/word_count.20140603.181541.006786/b.py
Copying non-input files into s3://mrjob-4db6342a70e021ad/tmp/word_count.matthew.20140603.181541.006786/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-3DCN7LULSRILW
Created new job flow j-3DCN7LULSRILW
Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid
Logs are in s3://mrjob-4db6342a70e021ad/tmp/logs/j-3DCN7LULSRILW/
Scanning S3 logs for probable cause of failure
Waiting 5.0s for S3 eventual consistency
Terminating job flow: j-3DCN7LULSRILW
Traceback (most recent call last):
File "word_count.py", line 16, in <module>
MRcount.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 494, in run
mr_job.execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 512, in execute
super(MRJob, self).execute()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 147, in execute
self.run_job()
File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 208, in run_job
runner.run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run
self._run()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 809, in _run
self._wait_for_job_to_complete()
File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 1599, in _wait_for_job_to_complete
raise Exception(msg)
Exception: Job on job flow j-3DCN7LULSRILW failed with status FAILED: The given SSH key name was invalid
感謝,
ssh-key的名稱必須與aws控制檯中的名稱相同。 – Pykler
@Pykler我沒有在我的代碼中提供ssh-key。 – The6thSense