2015-08-26 37 views
1

我無法讓HTCondor運行我的工作。我一直在抨擊這件事,而且我正在嘗試隨機的事情,所以我認爲我應該尋求指導。HTCondor喬布斯未運行

我在Ubuntu 15.04上從website安裝了HTCondor 8.2.9。以下是關於我的系統的以下信息。

$ cat /etc/condor/condor_config.local 
# 
# Local Condor Config 
# 

CONDOR_HOST = aidan-laptop 
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR 
#FLOCK_TO = aidan-laptop 
FLOCK_FROM = aidan-laptop localhost 

我目前的主機名

$ hostname 
aidan-laptop 

我定義的主機

$ cat /etc/hosts 
127.0.0.1 localhost 
127.0.1.1 aidan-laptop 

# The following lines are desirable for IPv6 capable hosts 
::1  ip6-localhost ip6-loopback 
fe00::0 ip6-localnet 
ff00::0 ip6-mcastprefix 
ff02::1 ip6-allnodes 
ff02::2 ip6-allrouters 

我現在的狀態

$ condor_status 
Name    OpSys  Arch State  Activity LoadAv Mem ActvtyTime 

[email protected] LINUX  X86_64 Unclaimed Idle  0.090 1976 0+00:04:39 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:05:05 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:05:06 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:05:07 
        Total Owner Claimed Unclaimed Matched Preempting Backfill 

     X86_64/LINUX  4  0  0   4  0   0  0 

       Total  4  0  0   4  0   0  0 

在隊列縱觀

$ condor_q 


-- Submitter: aidan-laptop : <192.168.1.151:39444> : aidan-laptop 
ID  OWNER   SUBMITTED  RUN_TIME ST PRI SIZE CMD    
    1.0 aidan   8/26 09:27 0+00:00:00 I 0 0.0 hello.sh   
    1.1 aidan   8/26 09:27 0+00:00:00 I 0 0.0 hello.sh   
    1.2 aidan   8/26 09:27 0+00:00:00 I 0 0.0 hello.sh   

3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended 
$ date 
Wed Aug 26 09:52:33 PDT 2015 
$ lsb_release -r 
Release: 15.04 

試圖分析的作業掛起,然後打印和錯誤

$ date; condor_q -pool 1.00 -analyze; date 
Wed Aug 26 09:58:01 PDT 2015 
Error: Could not fetch startd ads 
Wed Aug 26 09:59:01 PDT 2015 

而且我StartLog從停止到啓動,

$ sudo service condor stop 
$ sudo rm /var/log/condor/StartLog 
$ date; sudo service condor start 
Wed Aug 26 10:01:02 PDT 2015 
$ sleep 1m; date; condor_status 
Wed Aug 26 10:02:19 PDT 2015 
Name    OpSys  Arch State  Activity LoadAv Mem ActvtyTime 

[email protected] LINUX  X86_64 Unclaimed Idle  0.160 1976 0+00:00:04 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:00:31 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:00:32 
[email protected] LINUX  X86_64 Unclaimed Idle  0.000 1976 0+00:00:33 
        Total Owner Claimed Unclaimed Matched Preempting Backfill 

     X86_64/LINUX  4  0  0   4  0   0  0 

       Total  4  0  0   4  0   0  0 
$ date; cat /var/log/condor/StartLog 
Wed Aug 26 10:02:35 PDT 2015 
08/26/15 10:01:03 ****************************************************** 
08/26/15 10:01:03 ** condor_startd (CONDOR_STARTD) STARTING UP 
08/26/15 10:01:03 ** /usr/sbin/condor_startd 
08/26/15 10:01:03 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 
08/26/15 10:01:03 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 
08/26/15 10:01:03 ** $CondorVersion: 8.2.9 Aug 12 2015 BuildID: 335399 $ 
08/26/15 10:01:03 ** $CondorPlatform: x86_64_Ubuntu14 $ 
08/26/15 10:01:03 ** PID = 2487 
08/26/15 10:01:03 ** Log last touched time unavailable (No such file or directory) 
08/26/15 10:01:03 ****************************************************** 
08/26/15 10:01:03 Using config source: /etc/condor/condor_config 
08/26/15 10:01:03 Using local config sources: 
08/26/15 10:01:03 /etc/condor/condor_config.local 
08/26/15 10:01:03 config Macros = 60, Sorted = 60, StringBytes = 1596, TablesBytes = 2208 
08/26/15 10:01:03 CLASSAD_CACHING is ENABLED 
08/26/15 10:01:03 Daemon Log is logging: D_ALWAYS D_ERROR 
08/26/15 10:01:03 DaemonCore: command socket at <192.168.1.151:47358> 
08/26/15 10:01:03 DaemonCore: private command socket at <192.168.1.151:47358> 
08/26/15 10:01:09 VM-gahp server reported an internal error 
08/26/15 10:01:09 VM universe will be tested to check if it is available 
08/26/15 10:01:09 History file rotation is enabled. 
08/26/15 10:01:09 Maximum history file size is: 20971520 bytes 
08/26/15 10:01:09 Number of rotated history files is: 2 
08/26/15 10:01:09 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto 
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00% 
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00% 
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00% 
slot type 0: Cpus: 1.000000, Memory: 1976, Swap: 25.00%, Disk: 25.00% 
08/26/15 10:01:09 slot1: New machine resource allocated 
08/26/15 10:01:09 Setting up slot pairings 
08/26/15 10:01:09 slot2: New machine resource allocated 
08/26/15 10:01:09 Setting up slot pairings 
08/26/15 10:01:09 slot3: New machine resource allocated 
08/26/15 10:01:09 Setting up slot pairings 
08/26/15 10:01:09 slot4: New machine resource allocated 
08/26/15 10:01:09 Setting up slot pairings 
08/26/15 10:01:09 CronJobList: Adding job 'mips' 
08/26/15 10:01:09 CronJobList: Adding job 'kflops' 
08/26/15 10:01:09 CronJob: Initializing job 'mips' (/usr/lib/condor/libexec/condor_mips) 
08/26/15 10:01:09 CronJob: Initializing job 'kflops' (/usr/lib/condor/libexec/condor_kflops) 
08/26/15 10:01:09 slot1: State change: IS_OWNER is false 
08/26/15 10:01:09 slot1: Changing state: Owner -> Unclaimed 
08/26/15 10:01:09 State change: RunBenchmarks is TRUE 
08/26/15 10:01:09 slot1: Changing activity: Idle -> Benchmarking 
08/26/15 10:01:09 BenchMgr:StartBenchmarks() 
08/26/15 10:01:09 slot2: State change: IS_OWNER is false 
08/26/15 10:01:09 slot2: Changing state: Owner -> Unclaimed 
08/26/15 10:01:09 State change: RunBenchmarks is TRUE 
08/26/15 10:01:09 slot2: Changing activity: Idle -> Benchmarking 
08/26/15 10:01:09 slot2: Changing activity: Benchmarking -> Idle 
08/26/15 10:01:09 slot3: State change: IS_OWNER is false 
08/26/15 10:01:09 slot3: Changing state: Owner -> Unclaimed 
08/26/15 10:01:09 State change: RunBenchmarks is TRUE 
08/26/15 10:01:09 slot3: Changing activity: Idle -> Benchmarking 
08/26/15 10:01:09 slot3: Changing activity: Benchmarking -> Idle 
08/26/15 10:01:09 slot4: State change: IS_OWNER is false 
08/26/15 10:01:09 slot4: Changing state: Owner -> Unclaimed 
08/26/15 10:01:09 State change: RunBenchmarks is TRUE 
08/26/15 10:01:09 slot4: Changing activity: Idle -> Benchmarking 
08/26/15 10:01:09 slot4: Changing activity: Benchmarking -> Idle 
08/26/15 10:01:35 State change: benchmarks completed 
08/26/15 10:01:35 slot1: Changing activity: Benchmarking -> Idle 

讓我知道,如果需要更多的信息。

UPDATE:

我在談判日誌中找到此。我無法弄清楚它的含義。

08/26/15 11:20:15 ---------- Started Negotiation Cycle ---------- 
08/26/15 11:20:15 Phase 1: Obtaining ads from collector ... 
08/26/15 11:20:15 Getting startd private ads ... 
08/26/15 11:20:15 condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 5 bytes from collector at <127.0.1.1:9618>. 
08/26/15 11:20:15 IO: Failed to read packet header 
08/26/15 11:20:15 Couldn't fetch ads: communication error 
08/26/15 11:20:15 Aborting negotiation cycle 

回答

2

問題是我使用本地主機。我需要將它與一個與非回送IP鏈接的主機名進行交易。

因此增加,

master.aidan.condor 192.168.1.3 

到/ etc/hosts中

和改變局部禿鷹config來

CONDOR_HOST = master.aidan.condor 
DAEMON_LIST = MASTER, STARTD, SCHEDD, COLLECTOR, NEGOTIATOR 
#FLOCK_TO = aidan-laptop 
FLOCK_FROM = *.aidan.condor 

,並改變對日誌,錯誤和輸出權限文件允許任何人寫信給他們修復它。

FLOCK_FROM可能存在問題,但我會和它一起玩並看看。