2016-03-01 72 views
1

所以我正在用一些本地虛擬機測試一些玩具postgresql基礎結構,以確定pgpool在故障轉移時的行爲。我配置了一個基本的設置,其中有兩臺數據庫機器(192.168.0.2和192.168.0.3)和一臺pgpool機器(192.168.0.4)。已使用流複製將192.168.0.3設置爲192.168.0.2的從屬設備。 pgpool-ii已經使用以下配置:主/從模式下的pgpool-ii:我如何最容易觸發故障切換?

listen_addresses = '*' 
backend_hostname0 = '192.168.0.2' 
backend_port0 = 5432 
backend_weight0 = 1 
backend_data_directory0 = '/var/lib/postgresql/9.4/main/' 
backend_flag0 = 'ALLOW_TO_FAILOVER' 
backend_hostname1 = '192.168.0.3' 
backend_port1 = 5432 
backend_weight1 = 1 
backend_data_directory1 = '/var/lib/postgresql/9.4/main/' 
backend_flag1 = 'ALLOW_TO_FAILOVER' 
enable_pool_hba = on 
replication_mode = false 
master_slave_mode = on 
master_slave_sub_mode = 'stream' 
fail_over_on_backend_error = true 
failover_command = '/root/pgpool_failover_stream.sh %d %H /tmp/postgresql.trigger.5432' 
load_balance_mode = false 

我已經證實了這一切的作品。也就是說,當我更改master數據庫時,複製工作正常,我可以通過示例應用程序連接到master,slave和pgpool-ii,並獲得我期望的結果。

現在,我已經開始了一個連接到pgpool的長時間運行的應用程序,然後嘗試通過SSH進入主數據庫服務器並強制結束postgres任務(以root用戶身份登錄service postgresql stop)進行故障轉移。我的應用程序保持正確執行查詢,但不發生故障轉移(腳本尚未運行)。我甚至測試過直接連接到master數據庫,當我停止postgres服務時,我最終崩潰了應用程序。

我做錯了什麼?我沒有正確配置我的pgpool嗎?還是有更好的方法來觸發故障轉移?

編輯:按照要求,這裏是哪裏出現的第一個錯誤日誌的部分:

... 
2016-03-15 18:47:15: pid 1232: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1231: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1230: DEBUG: initializing backend status 
2016-03-15 18:47:15: pid 1209: ERROR: failed to authenticate 
2016-03-15 18:47:15: pid 1209: DETAIL: invalid authentication message response type, Expecting 'R' and received 'E' 
2016-03-15 18:47:15: pid 1209: LOG: find_primary_node: checking backend no 1 
2016-03-15 18:47:15: pid 1209: ERROR: failed to authenticate 
2016-03-15 18:47:15: pid 1209: DETAIL: invalid authentication message response type, Expecting 'R' and received 'E' 
2016-03-15 18:47:15: pid 1209: DEBUG: find_primary_node: no primary node found 
... 

奇怪的是,我仍然可以連接到pgpool和執行查詢,所以我顯然不明白的東西那裏。

編輯2:這些是我在主人的service postgresql shutdown後得到的錯誤。我顯示了一切,開始關閉pgpool。

... 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: clearing doing extended query messaging. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting doing extended query messaging. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting query in progress. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: reading backend data packet kind 
2016-03-16 17:24:57: pid 1012: DETAIL: backend:0 of 2 kind = 'E' 
2016-03-16 17:24:57: pid 1012: DEBUG: processing backend response 
2016-03-16 17:24:57: pid 1012: DETAIL: received kind 'E'(45) from backend 
2016-03-16 17:24:57: pid 1012: ERROR: unable to forward message to frontend 
2016-03-16 17:24:57: pid 1012: DETAIL: FATAL error occured on backend 
2016-03-16 17:24:57: pid 1012: DEBUG: session context: setting query in progress. DONE 
2016-03-16 17:24:57: pid 1012: DEBUG: decide where to send the queries 
2016-03-16 17:24:57: pid 1012: DETAIL: destination = 3 for query= "DISCARD ALL" 
2016-03-16 17:24:57: pid 1012: DEBUG: waiting for query response 
2016-03-16 17:24:57: pid 1012: DETAIL: waiting for backend:0 to complete the query 
2016-03-16 17:24:57: pid 1012: FATAL: unable to read data from DB node 0 
2016-03-16 17:24:57: pid 1012: DETAIL: EOF encountered with backend 
2016-03-16 17:24:57: pid 998: DEBUG: reaper handler 
2016-03-16 17:24:57: pid 998: LOG: child process with pid: 1012 exits with status 256 
2016-03-16 17:24:57: pid 998: LOG: fork a new child process with pid: 1033 
2016-03-16 17:24:57: pid 998: DEBUG: reaper handler: exiting normally 
2016-03-16 17:24:57: pid 1033: DEBUG: initializing backend status 
2016-03-16 17:25:02: pid 1031: DEBUG: PCP child receives shutdown request signal 2 
2016-03-16 17:25:02: pid 1029: LOG: child process received shutdown request signal 2 
... 

請注意,我的示例應用程序事實上在主站關閉時死亡。

編輯3:錯誤我得到在新的日誌,經過合理設置sr_check_periodsr_check_usersr_check_password,所有先前的錯誤,現在都沒有了:

2016-03-31 17:45:00: pid 18363: DEBUG: detect error: kind: 1 
2016-03-31 17:45:00: pid 18363: DEBUG: reading backend data packet kind 
2016-03-31 17:45:00: pid 18363: DETAIL: backend:0 of 2 kind = '1' 
... 
2016-03-31 17:45:00: pid 18363: DEBUG: detect error: kind: S 

回答

0

有可能是沒有得到執行故障切換腳本多重原因。主要步驟是啓用log_destination屬性到syslog並啓用調試模式(debug_level = 1)。

我親眼目睹了故障切換腳本無法獲取%d,%H(特殊字符)的參數的情況,因爲腳本無法將ssh發送到從站並觸發觸發器文件。

如果您發佈相同的日誌文件,我可以提供更多的細節。

基於新的日誌: 我可以看到一個錯誤:未通過身份驗證。 你可以檢查pgpool以下參數是否已正確配置

health_check_user
health_check_password
recovery_user
recovery_password
wd_lifecheck_user
wd_lifecheck_password
sr_check_user
sr_check_password

我希望你有接下來的步驟改變Postgres的用戶密碼

alter user postgres password 'yourpassword' 

,並確保您在任何情況下給出相同的密碼。

從日誌中,它看起來像一個認證問題。你能告訴我你使用的pgpool的版本嗎?

這是我們正在使用的有3臺(1個主站,1個從站和1個機pgpool) 我已經修改了設置以適應您的IP地址

listen_addresses = '*' 
    port = 5433 
    socket_dir = '/var/run/postgresql' 
    pcp_port = 9898 
    pcp_socket_dir = '/var/run/postgresql' 

    backend_hostname0 = '192.168.0.2' 
    backend_port0 = 5432 
    backend_weight0 = 1 
    backend_data_directory0 = '/var/lib/postgresql/9.4/main' 
    backend_flag0 = 'ALLOW_TO_FAILOVER' 

    backend_hostname1 = '192.168.0.3' 
    backend_port1 = 5432 
    backend_weight1 = 1 
    backend_data_directory1 = '/var/lib/postgresql/9.4/main' 
    backend_flag1 = 'ALLOW_TO_FAILOVER' 

    enable_pool_hba = on 
    pool_passwd = '' 
    authentication_timeout = 60 
    ssl = off 
    num_init_children = 4 
    max_pool = 2 
    child_life_time = 300 
    child_max_connections = 0 
    connection_life_time = 0 
    client_idle_limit = 0 
    log_destination = 'stderr,syslog' 
    print_timestamp = on 
    log_connections = on 
    log_hostname = on 
    log_statement = on 
    log_per_node_statement = on 
    log_standby_delay = 'none' 
    syslog_facility = 'LOCAL0' 
    syslog_ident = 'pgpool' 
    debug_level = 1 
    pid_file_name = '/var/run/postgresql/pgpool.pid' 
    logdir = '/var/log/postgresql' 
    connection_cache = on 
    reset_query_list = 'ABORT; DISCARD ALL' 

    replication_mode = off 
    replicate_select = off 
    insert_lock = on 
    lobj_lock_table = '' 
    replication_stop_on_mismatch = off 
    failover_if_affected_tuples_mismatch = off 

    load_balance_mode = off 
    ignore_leading_white_space = on 
    white_function_list = '' 
    black_function_list = 'nextval,setval' 

    master_slave_mode = on 
    master_slave_sub_mode = 'stream' 
    sr_check_period = 10 
    sr_check_user = 'postgres' 
    sr_check_password = 'postgres123' 
    delay_threshold = 0 
    follow_master_command = '' 
    parallel_mode = off 
    pgpool2_hostname = 'pgmaster' 

    system_db_hostname = 'localhost' 
    system_db_port = 5432 
    system_db_dbname = 'pgpool' 
    system_db_schema = 'pgpool_catalog' 
    system_db_user = 'pgpool' 
    system_db_password = '' 

    health_check_period = 5 
    health_check_timeout = 20 
    health_check_user = 'postgres' 
    health_check_password = 'postgres123' 
    health_check_max_retries = 2 
    health_check_retry_delay = 1 

    failover_command = '/usr/sbin/failover_modified.sh %d "%H" %P /var/lib/postgresql/9.4/main/pgsql.trigger.5432' 
    failback_command = '' 
    fail_over_on_backend_error = on 
    search_primary_node_timeout = 10 

    recovery_user = 'postgres' 
    recovery_password = 'postgres123' 
    recovery_1st_stage_command = '' 
    recovery_2nd_stage_command = '' 
    recovery_timeout = 90 
    client_idle_limit_in_recovery = 0 

    use_watchdog = off 
    trusted_servers = '' 
    ping_path = '/bin' 
    wd_hostname = '' 
    wd_port = 9000 
    wd_authkey = '' 
    delegate_IP = '' 
    ifconfig_path = '/sbin' 
    if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0' 
    if_down_cmd = 'ifconfig eth0:0 down' 
    arping_path = '/usr/sbin' 
    arping_cmd = 'arping -U $_IP_$ -w 1' 

    clear_memqcache_on_escalation = on 
    wd_escalation_command = '' 

    wd_lifecheck_method = 'heartbeat' 
    wd_interval = 10 
    wd_heartbeat_port = 9694 
    wd_heartbeat_keepalive = 2 
    wd_heartbeat_deadtime = 30 
    heartbeat_destination0 = '192.168.0.2' 
    heartbeat_destination_port0 = 9694 
    heartbeat_device0 = '' 

    heartbeat_destination1 = '192.168.0.3' 
    wd_life_point = 3 
    wd_lifecheck_query = 'SELECT 1' 
    wd_lifecheck_dbname = 'postgres' 
    wd_lifecheck_user = 'postgres' 
    wd_lifecheck_password = 'postgres123' 

    relcache_expire = 0 
    relcache_size = 256 
    check_temp_table = on 

    memory_cache_enabled = off 
    memqcache_method = 'shmem' 
    memqcache_memcached_host = 'localhost' 
    memqcache_memcached_port = 11211 
    memqcache_total_size = 67108864 
    memqcache_max_num_cache = 1000000 
    memqcache_expire = 0 
    memqcache_auto_cache_invalidation = on 
    memqcache_maxcache = 409600 
    memqcache_cache_block_size = 1048576 
    memqcache_oiddir = '/var/log/pgpool/oiddir' 
    white_memqcache_table_list = '' 
    black_memqcache_table_list = '' 

而且,我希望配置你已經修改了pool_hba.conf來訪問主從服務器

+0

嗨Raveesh,謝謝你的回覆!我已啓用日誌記錄,甚至在啓動時我已經注意到一些錯誤似乎可能是相關的。我編輯了我的問題以包含必要的信息。 – gdoug

+0

您可以給出關閉主設備後發生的日誌。我認爲這些日誌沒有指出「爲什麼故障轉移不執行腳本」的真正問題 –

+0

再次請求更新日誌信息。 – gdoug