2015-07-10 133 views
1

我修改了thegeekinthecorner示例以便能夠連續發送數據。 我使用的是g ++ 4.9.2。客戶端與服務器之間的通信不穩定

我試圖卸載的最新官方公報從這裏http://downloads.openfabrics.org/OFED/

OFED Distribution Software Installation Menu 

    1) View OFED Installation Guide 
    2) Install OFED Software 
    3) Show Installed Software 
    4) Configure IPoIB 
    5) Uninstall OFED Software 

    Q) Exit 

Select Option [1-5]:5 

Uninstalling the previous version of OFED 
Running rpm -e --allmatches libibverbs libibverbs-devel libibverbs-utils libmthca libmlx4 libcxgb3 libnes libipathverbs libibcm libibumad libibumad-devel libibmad ibacm librdmacm librdmacm-utils librdmacm-devel opensm opensm-libs dapl perftest mstflint ibutils infiniband-diags qperf infinipath-psm opensm opensm-libs libipathverbs dapl libibcm libibmad libibumad libibumad-devel libibverbs libibverbs-devel libibverbs-utils libipathverbs libmthca libmlx4 librdmacm librdmacm-devel librdmacm-utils ibacm ibutils ibutils-libs libnes infinipath-psm 
Failed to uninstall the previous installation 
See /tmp/OFED.22320.logs/ofed_uninstall.log 
[[email protected] OFED-1.5.4-20110726-0732]$ 
[[email protected] OFED-1.5.4-20110726-0732]$ 

OFED相反,如果我只是嘗試安裝它,我得到這個:

OFED Distribution Software Installation Menu 

    1) Basic (OFED modules and basic user level libraries) 
    2) HPC (OFED modules and libraries, MPI and diagnostic tools) 
    3) All packages (all of Basic, HPC) 
    4) Customize 

    Q) Exit 

Select Option [1-4]:3 

Please choose an implementation of MVAPICH2: 

1) OFA (IB and iWARP) 
2) uDAPL 
Implementation [1]: 1 

Enable ROMIO support [Y/n]: 

Enable shared library support [Y/n]: 

Enable Checkpoint-Restart support [y/N]: 
Kernel 3.10.0-229.7.2.el7.x86_64 is not supported. 
For the list of Supported Platforms and Operating Systems see 
/mnt/gluster/Downloads/OFED-1.5.4-20110726-0732/docs/OFED_release_notes.txt 
[[email protected] OFED-1.5.4-20110726-0732]$ 

[[email protected] Release]$ lspci | grep -i mel 
02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR/10GigE] (rev b0) 
[[email protected] Release]$ 


[[email protected] Release]$ ibv_devinfo 
hca_id: mlx4_0 
    transport:   InfiniBand (0) 
    fw_ver:    2.7.200 
    node_guid:   0025:90ff:ff1a:081c 
    sys_image_guid:   0025:90ff:ff1a:081f 
    vendor_id:   0x02c9 
    vendor_part_id:   26428 
    hw_ver:    0xB0 
    board_id:   SM_2092000001000 
    phys_port_cnt:   1 
     port: 1 
      state:   PORT_ACTIVE (4) 
      max_mtu:  4096 (5) 
      active_mtu:  4096 (5) 
      sm_lid:   1 
      port_lid:  2 
      port_lmc:  0x00 
      link_layer:  InfiniBand 

[[email protected] Release]$ ifconfig -a 

ib0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 2044 
     inet 192.168.0.1 netmask 255.255.255.0 broadcast 192.168.0.255 
     inet6 fe80::225:90ff:ff1a:71 prefixlen 64 scopeid 0x20<link> 
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8). 
     infiniband 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 txqueuelen 256 (InfiniBand) 
     RX packets 5 bytes 280 (280.0 B) 
     RX errors 0 dropped 0 overruns 0 frame 0 
     TX packets 0 bytes 0 (0.0 B) 
     TX errors 0 dropped 27 overruns 0 carrier 0 collisions 0 

下面是客戶端和服務器。當我運行這個程序,客戶端將發送郵件,但它發送的郵件數量是不穩定的,錯誤信息往往

客戶:

#include <iostream> 
#include <thread> 

#include <netdb.h> 
#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <unistd.h> 
#include <rdma/rdma_cma.h> 

#define TEST_NZ(x) do { if ((x)) die("error: " #x " failed (returned non-zero)."); } while (0) 
#define TEST_Z(x) do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0) 

const int BUFFER_SIZE = 2048; 
const int TIMEOUT_IN_MS = 500; /* ms */ 

struct context 
{ 
    struct ibv_context *ctx; 
    struct ibv_pd *pd; 
    struct ibv_cq *cq; 
    struct ibv_comp_channel *comp_channel; 

    pthread_t cq_poller_thread; 
}; 

struct connection 
{ 
    struct rdma_cm_id *id; 
    struct ibv_qp *qp; 

    struct ibv_mr *recv_mr; 
    struct ibv_mr *send_mr; 

    char *recv_region; 
    char *send_region; 

    int num_completions; 
}; 

static pthread_t msgThread; 

static void die(const char *reason); 

static void build_context(struct ibv_context *verbs); 
static void build_qp_attr(struct ibv_qp_init_attr *qp_attr); 
static void * poll_cq(void *); 
static void post_receives(struct connection *conn); 
static void register_memory(struct connection *conn); 

static int on_addr_resolved(struct rdma_cm_id *id); 
static void on_completion(struct ibv_wc *wc); 
static int on_connection(void *context); 
static int on_disconnect(struct rdma_cm_id *id); 
static int on_event(struct rdma_cm_event *event); 
static int on_route_resolved(struct rdma_cm_id *id); 

static struct context *s_ctx = NULL; 

#include <mutex>    // std::mutex, std::unique_lock 
#include <condition_variable> // std::condition_variable 

std::mutex mtx; 
std::condition_variable cv; 

bool ok_to_send_next_message = 1; 
bool message_available() 
{ 
    return 0 != ok_to_send_next_message; 
} 

int main(int argc, char **argv) 
{ 
    struct addrinfo *addr; 
    struct rdma_cm_event *event = NULL; 
    struct rdma_cm_id *conn= NULL; 
    struct rdma_event_channel *ec = NULL; 

    if (argc != 3) 
     die("usage: client <server-address> <server-port>"); 

    TEST_NZ(getaddrinfo(argv[1], argv[2], NULL, &addr)); 

    TEST_Z(ec = rdma_create_event_channel()); 
    TEST_NZ(rdma_create_id(ec, &conn, NULL, RDMA_PS_TCP)); 
    TEST_NZ(rdma_resolve_addr(conn, NULL, addr->ai_addr, TIMEOUT_IN_MS)); 

    freeaddrinfo(addr); 

    while (0 == rdma_get_cm_event(ec, &event)) 
     //while (rdma_get_cm_event(ec, &event)) 
    { 
     std::cout << "rdma_get_cm_event\n"; 

     struct rdma_cm_event event_copy; 

     memcpy(&event_copy, event, sizeof(*event)); 
     rdma_ack_cm_event(event); 

     if (on_event(&event_copy)) 
      break; 
    } 

    rdma_destroy_event_channel(ec); 

    return 0; 
} 

void die(const char *reason) 
{ 
    fprintf(stderr, "%s\n", reason); 
    exit(EXIT_FAILURE); 
} 

void build_context(struct ibv_context *verbs) 
{ 
    if (s_ctx) 
    { 
     if (s_ctx->ctx != verbs) 
      die("cannot handle events in more than one context."); 

     return; 
    } 

    s_ctx = (struct context *)malloc(sizeof(struct context)); 

    s_ctx->ctx = verbs; 

    TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx)); 
    TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx)); 
    TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 100, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */ 
    TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0)); 

    TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL)); 
} 

void *SendMessages(void *context) 
{ 
    static int loopcount = 0; 
    while(1) 
    { 
     std::unique_lock<std::mutex> lck(mtx); 
     cv.wait(lck, message_available); 
     //std::this_thread::sleep_for(std::chrono::microseconds(50)); 

     ok_to_send_next_message = 0; 

     struct connection *conn = (struct connection *)context; 
     struct ibv_send_wr wr, *bad_wr = NULL; 
     struct ibv_sge sge; 

     std::cout << "looping send..." << loopcount << '\n' << std::flush; 

     memset(&wr, 0, sizeof(wr)); 

     wr.wr_id = (uintptr_t)conn; 
     wr.opcode = IBV_WR_SEND; 
     wr.sg_list = &sge; 
     wr.num_sge = 1; 
     wr.send_flags = IBV_SEND_SIGNALED; 

     sge.addr = (uintptr_t)conn->send_region; 
     sge.length = BUFFER_SIZE; 
     sge.lkey = conn->send_mr->lkey; 

     snprintf(conn->send_region, BUFFER_SIZE, "message from active/client side with count %d", loopcount++); 
     TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr)); 
    } 
} 

void build_qp_attr(struct ibv_qp_init_attr *qp_attr) 
{ 
    std::cout << "build_qp_attr\n"; 

    memset(qp_attr, 0, sizeof(*qp_attr)); 

    qp_attr->send_cq = s_ctx->cq; 
    qp_attr->recv_cq = s_ctx->cq; 
    qp_attr->qp_type = IBV_QPT_RC; 

    qp_attr->cap.max_send_wr = 100; 
    qp_attr->cap.max_recv_wr = 100; 
    qp_attr->cap.max_send_sge = 1; 
    qp_attr->cap.max_recv_sge = 1; 
} 

void * poll_cq(void *ctx) 
{ 
    struct ibv_cq *cq; 
    struct ibv_wc wc; 

    while (1) 
    { 
     TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx)); 
     ibv_ack_cq_events(cq, 1); 
     TEST_NZ(ibv_req_notify_cq(cq, 0)); 

     int ne; 
     struct ibv_wc wc; 

     do 
     { 
      std::cout << "polling\n"; 
      ne = ibv_poll_cq(cq, 1, &wc); 
     } 
     while(ne == 0); 

     on_completion(&wc); 

     //if (wc.opcode == IBV_WC_SEND) 
     if (wc.status == IBV_WC_SUCCESS) 
     { 
      { 
       ok_to_send_next_message = 1; 
       //while (message_available()) std::this_thread::yield(); 
       //std::cout << "past yield\n"; 
       std::unique_lock<std::mutex> lck(mtx); 
       cv.notify_one(); 
      } 
     } 
    } 

    return NULL; 
} 

void post_receives(struct connection *conn) 
{ 
    std::cout << "post_receives\n"; 

    struct ibv_recv_wr wr, *bad_wr = NULL; 
    struct ibv_sge sge; 

    wr.wr_id = (uintptr_t)conn; 
    wr.next = NULL; 
    wr.sg_list = &sge; 
    wr.num_sge = 1; 

    sge.addr = (uintptr_t)conn->recv_region; 
    sge.length = BUFFER_SIZE; 
    sge.lkey = conn->recv_mr->lkey; 

    TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr)); 
} 

void register_memory(struct connection *conn) 
{ 
    std::cout << "register_memory\n"; 

    conn->send_region = (char *)malloc(BUFFER_SIZE); 
    conn->recv_region = (char *)malloc(BUFFER_SIZE); 

    TEST_Z(conn->send_mr = ibv_reg_mr(
           s_ctx->pd, 
           conn->send_region, 
           BUFFER_SIZE, 
           IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE)); 

    TEST_Z(conn->recv_mr = ibv_reg_mr(
           s_ctx->pd, 
           conn->recv_region, 
           BUFFER_SIZE, 
           IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE)); 
} 

int on_addr_resolved(struct rdma_cm_id *id) 
{ 
    std::cout << "on_addr_resolved\n"; 

    struct ibv_qp_init_attr qp_attr; 
    struct connection *conn; 

    build_context(id->verbs); 
    build_qp_attr(&qp_attr); 

    TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr)); 

    id->context = conn = (struct connection *)malloc(sizeof(struct connection)); 

    conn->id = id; 
    conn->qp = id->qp; 
    conn->num_completions = 0; 

    register_memory(conn); 
    post_receives(conn); 

    TEST_NZ(rdma_resolve_route(id, TIMEOUT_IN_MS)); 

    return 0; 
} 

void on_completion(struct ibv_wc *wc) 
{ 
    std::cout << "on_completion\n"; 

    struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id; 

    if (wc->status != IBV_WC_SUCCESS) 
    { 
     //die("\ton_completion: status is not IBV_WC_SUCCESS."); 
     printf("\ton_completion: status is not IBV_WC_SUCCESS."); 
     printf("\t it is %d ", wc->status); 
    } 

    printf("\n"); 

    if (wc->opcode & IBV_WC_RECV) 
     printf("\treceived message: %s\n", conn->recv_region); 
    else if (wc->opcode == IBV_WC_SEND) 
     printf("\tsend completed successfully.\n"); 
    else 
     die("\ton_completion: completion isn't a send or a receive."); 

    if (5 == ++conn->num_completions) 
     rdma_disconnect(conn->id); 
} 

int on_connection(void *context) 
{ 
    std::cout << "on_connection\n"; 

    TEST_NZ(pthread_create(&msgThread, NULL, SendMessages, context)); 

    return 0; 
} 

int on_disconnect(struct rdma_cm_id *id) 
{ 
    struct connection *conn = (struct connection *)id->context; 

    printf("disconnected.\n"); 

    rdma_destroy_qp(id); 

    ibv_dereg_mr(conn->send_mr); 
    ibv_dereg_mr(conn->recv_mr); 

    free(conn->send_region); 
    free(conn->recv_region); 

    free(conn); 

    rdma_destroy_id(id); 

    return 1; /* exit event loop */ 
} 

int on_route_resolved(struct rdma_cm_id *id) 
{ 
    struct rdma_conn_param cm_params; 

    printf("route resolved.\n"); 

    memset(&cm_params, 0, sizeof(cm_params)); 
    TEST_NZ(rdma_connect(id, &cm_params)); 

    return 0; 
} 

int on_event(struct rdma_cm_event *event) 
{ 
    int r = 0; 

    if (event->event == RDMA_CM_EVENT_ADDR_RESOLVED) 
     r = on_addr_resolved(event->id); 
    else if (event->event == RDMA_CM_EVENT_ROUTE_RESOLVED) 
     r = on_route_resolved(event->id); 
    else if (event->event == RDMA_CM_EVENT_ESTABLISHED) 
     r = on_connection(event->id->context); 
    else if (event->event == RDMA_CM_EVENT_DISCONNECTED) 
     r = on_disconnect(event->id); 
    else 
     die("on_event: unknown event."); 

    return r; 
} 

服務器:

#include <iostream> 

#include <stdio.h> 
#include <stdlib.h> 
#include <string.h> 
#include <unistd.h> 
#include <inttypes.h> 

#include <rdma/rdma_cma.h> 

#define TEST_NZ(x) do { if ((x)) die("error: " #x " failed (returned non-zero)."); } while (0) 
#define TEST_Z(x) do { if (!(x)) die("error: " #x " failed (returned zero/null)."); } while (0) 

const int BUFFER_SIZE = 2048; 

struct context 
{ 
    struct ibv_context *ctx; 
    struct ibv_pd *pd; 
    struct ibv_cq *cq; 
    struct ibv_comp_channel *comp_channel; 

    pthread_t cq_poller_thread; 
}; 

struct connection 
{ 
    struct ibv_qp *qp; 

    struct ibv_mr *recv_mr; 
    struct ibv_mr *send_mr; 

    char *recv_region; 
    char *send_region; 
}; 

static void die(const char *reason); 

static void build_context(struct ibv_context *verbs); 
static void build_qp_attr(struct ibv_qp_init_attr *qp_attr); 
static void * poll_cq(void *); 
static void post_receives(struct connection *conn); 
static void register_memory(struct connection *conn); 

static void on_completion(struct ibv_wc *wc); 
static int on_connect_request(struct rdma_cm_id *id); 
static int on_connection(void *context); 
static int on_disconnect(struct rdma_cm_id *id); 
static int on_event(struct rdma_cm_event *event); 

static struct context *s_ctx = NULL; 

int main(int argc, char **argv) 
{ 
    struct sockaddr_in6 addr; 
    struct rdma_cm_event *event = NULL; 
    struct rdma_cm_id *listener = NULL; 
    struct rdma_event_channel *ec = NULL; 
    uint16_t port = 0; 

    memset(&addr, 0, sizeof(addr)); 
    addr.sin6_family = AF_INET6; 

    TEST_Z(ec = rdma_create_event_channel()); 
    TEST_NZ(rdma_create_id(ec, &listener, NULL, RDMA_PS_TCP)); 
    TEST_NZ(rdma_bind_addr(listener, (struct sockaddr *)&addr)); 
    TEST_NZ(rdma_listen(listener, 100)); /* backlog=10 is arbitrary */ 

    //printf("[ %"PRIu32" ]\n", *addr.sin6_addr.s6_addr32); 

    port = ntohs(rdma_get_src_port(listener)); 

    printf("listening on port %d.\n", port); 

    while (rdma_get_cm_event(ec, &event) == 0) 
    { 
     struct rdma_cm_event event_copy; 

     memcpy(&event_copy, event, sizeof(*event)); 
     rdma_ack_cm_event(event); 

     if (on_event(&event_copy)) 
      break; 
    } 

    rdma_destroy_id(listener); 
    rdma_destroy_event_channel(ec); 

    return 0; 
} 

void die(const char *reason) 
{ 
    fprintf(stderr, "%s\n", reason); 
    exit(EXIT_FAILURE); 
} 

void build_context(struct ibv_context *verbs) 
{ 
    if (s_ctx) 
    { 
     if (s_ctx->ctx != verbs) 
      die("cannot handle events in more than one context."); 

     return; 
    } 

    s_ctx = (struct context *)malloc(sizeof(struct context)); 

    s_ctx->ctx = verbs; 

    TEST_Z(s_ctx->pd = ibv_alloc_pd(s_ctx->ctx)); 
    TEST_Z(s_ctx->comp_channel = ibv_create_comp_channel(s_ctx->ctx)); 
    TEST_Z(s_ctx->cq = ibv_create_cq(s_ctx->ctx, 100, NULL, s_ctx->comp_channel, 0)); /* cqe=10 is arbitrary */ 
    TEST_NZ(ibv_req_notify_cq(s_ctx->cq, 0)); 

    TEST_NZ(pthread_create(&s_ctx->cq_poller_thread, NULL, poll_cq, NULL)); 
} 

void build_qp_attr(struct ibv_qp_init_attr *qp_attr) 
{ 
    memset(qp_attr, 0, sizeof(*qp_attr)); 

    qp_attr->send_cq = s_ctx->cq; 
    qp_attr->recv_cq = s_ctx->cq; 
    qp_attr->qp_type = IBV_QPT_RC; 

    qp_attr->cap.max_send_wr = 100; 
    qp_attr->cap.max_recv_wr = 100; 
    qp_attr->cap.max_send_sge = 1; 
    qp_attr->cap.max_recv_sge = 1; 
} 

void * poll_cq(void *ctx) 
{ 
    struct ibv_cq *cq; 
    struct ibv_wc wc; 

    while (1) 
    { 
     TEST_NZ(ibv_get_cq_event(s_ctx->comp_channel, &cq, &ctx)); 
     ibv_ack_cq_events(cq, 1); 
     TEST_NZ(ibv_req_notify_cq(cq, 0)); 

     while (ibv_poll_cq(cq, 1, &wc)) 
     { 
      std::cout << "polling\n"; 
      on_completion(&wc); 
     } 
    } 

    return NULL; 
} 

void post_receives(struct connection *conn) 
{ 
    std::cout << "post_receives\n"; 

    struct ibv_recv_wr wr, *bad_wr = NULL; 
    struct ibv_sge sge; 

    wr.wr_id = (uintptr_t)conn; 
    wr.next = NULL; 
    wr.sg_list = &sge; 
    wr.num_sge = 1; 

    sge.addr = (uintptr_t)conn->recv_region; 
    sge.length = BUFFER_SIZE; 
    sge.lkey = conn->recv_mr->lkey; 

    TEST_NZ(ibv_post_recv(conn->qp, &wr, &bad_wr)); 
} 

void register_memory(struct connection *conn) 
{ 
    conn->send_region = (char *)malloc(BUFFER_SIZE); 
    conn->recv_region = (char *)malloc(BUFFER_SIZE); 

    TEST_Z(conn->send_mr = ibv_reg_mr(
           s_ctx->pd, 
           conn->send_region, 
           BUFFER_SIZE, 
           IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE)); 

    TEST_Z(conn->recv_mr = ibv_reg_mr(
           s_ctx->pd, 
           conn->recv_region, 
           BUFFER_SIZE, 
           IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE)); 
} 

void on_completion(struct ibv_wc *wc) 
{ 
    if (wc->status != IBV_WC_SUCCESS) 
     die("on_completion: status is not IBV_WC_SUCCESS."); 

    if (wc->opcode & IBV_WC_RECV) 
    { 
     struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id; 
     post_receives(conn); 
     printf("received message: %s\n", conn->recv_region); 
    } 
    else if (wc->opcode == IBV_WC_SEND) 
    { 
     printf("send completed successfully.\n"); 
    } 
} 

int on_connect_request(struct rdma_cm_id *id) 
{ 
    struct ibv_qp_init_attr qp_attr; 
    struct rdma_conn_param cm_params; 
    struct connection *conn; 

    printf("received connection request.\n"); 

    build_context(id->verbs); 
    build_qp_attr(&qp_attr); 

    TEST_NZ(rdma_create_qp(id, s_ctx->pd, &qp_attr)); 

    id->context = conn = (struct connection *)malloc(sizeof(struct connection)); 
    conn->qp = id->qp; 

    register_memory(conn); 
    post_receives(conn); 

    memset(&cm_params, 0, sizeof(cm_params)); 
    TEST_NZ(rdma_accept(id, &cm_params)); 

    return 0; 
} 

int on_connection(void *context) 
{ 
    struct connection *conn = (struct connection *)context; 
    struct ibv_send_wr wr, *bad_wr = NULL; 
    struct ibv_sge sge; 

    snprintf(conn->send_region, BUFFER_SIZE, "message from passive/server side with pid %d", getpid()); 

    printf("connected. posting send...\n"); 

    memset(&wr, 0, sizeof(wr)); 

    wr.opcode = IBV_WR_SEND; 
    wr.sg_list = &sge; 
    wr.num_sge = 1; 
    wr.send_flags = IBV_SEND_SIGNALED; 

    sge.addr = (uintptr_t)conn->send_region; 
    sge.length = BUFFER_SIZE; 
    sge.lkey = conn->send_mr->lkey; 

    TEST_NZ(ibv_post_send(conn->qp, &wr, &bad_wr)); 

    return 0; 
} 

int on_disconnect(struct rdma_cm_id *id) 
{ 
    struct connection *conn = (struct connection *)id->context; 

    printf("peer disconnected.\n"); 

    rdma_destroy_qp(id); 

    ibv_dereg_mr(conn->send_mr); 
    ibv_dereg_mr(conn->recv_mr); 

    free(conn->send_region); 
    free(conn->recv_region); 

    free(conn); 

    rdma_destroy_id(id); 

    return 0; 
} 

int on_event(struct rdma_cm_event *event) 
{ 
    std::cout << "on_event\n"; 

    int r = 0; 

    if (event->event == RDMA_CM_EVENT_CONNECT_REQUEST) 
     r = on_connect_request(event->id); 
    else if (event->event == RDMA_CM_EVENT_ESTABLISHED) 
     r = on_connection(event->id->context); 
    else if (event->event == RDMA_CM_EVENT_DISCONNECTED) 
     r = on_disconnect(event->id); 
    else 
     die("on_event: unknown event."); 

    return r; 
} 

這裏有幾次運行。完全隨機發送的消息數:

[[email protected] Release]$ ./TGKITCClient 192.168.0.1 47819 
rdma_get_cm_event 
on_addr_resolved 
build_qp_attr 
register_memory 
post_receives 
rdma_get_cm_event 
route resolved. 
rdma_get_cm_event 
on_connection 
looping send...0 
polling 
on_completion 
    received message: message from passive/server side with pid 4188 

polling 
on_completion 
    send completed successfully. 

looping send...1 
polling 
on_completion 
    send completed successfully. 

^C 
[[email protected] Release]$ 

然後

[[email protected] Release]$ ./TGKITCClient 192.168.0.1 55148 
rdma_get_cm_event 
on_addr_resolved 
build_qp_attr 
register_memory 
post_receives 
rdma_get_cm_event 
route resolved. 
rdma_get_cm_event 
on_connection 
looping send...0 
polling 
on_completion 
    received message: message from passive/server side with pid 4279 

polling 
on_completion 
    send completed successfully. 

looping send...1 
polling 
on_completion 
    send completed successfully. 

looping send...2 
polling 
on_completion 
    send completed successfully. 

looping send...3 
polling 
on_completion 
    send completed successfully. 

looping send...4 
polling 
on_completion 
    send completed successfully. 

looping send...5 
polling 
on_completion 
    send completed successfully. 

looping send...6 
polling 
on_completion 
    send completed successfully. 

looping send...7 
polling 
on_completion 
    send completed successfully. 

looping send...8 
rdma_get_cm_event 
disconnected. 
polling 
on_completion 
    send completed successfully. 

    on_completion: status is not IBV_WC_SUCCESS.  it is 5 [[email protected] Release]$ 

下面是服務器端:

on_event 
peer disconnected. 
on_event 
received connection request. 
post_receives 
on_event 
connected. posting send... 
polling 
send completed successfully. 
polling 
post_receives 
received message: message from active/client side with count 0 
polling 
post_receives 
received message: message from active/client side with count 1 
polling 
post_receives 
received message: message from active/client side with count 2 
polling 
post_receives 
received message: message from active/client side with count 3 
polling 
post_receives 
received message: message from active/client side with count 4 
polling 
post_receives 
received message: message from active/client side with count 5 
polling 
post_receives 
received message: message from active/client side with count 6 
polling 
post_receives 
received message: message from active/client side with count 7 
on_event 
peer disconnected. 
+0

什麼是* thegeekinthecorner?*如果問題很重要,請提供解釋或鏈接。 – jpaugh

+0

thegeekinthecorner現在是原始帖子中的鏈接。 – Ivan

回答

0

確保最新的驅動程序和固件上的卡安裝。除此之外,嘗試運行IB時使用大多數操作系統發行版中包含的RDMA軟件包是一個危險的遊戲。

強烈建議像這樣的應用程序應該使用Open Fabrics企業分佈來提供openib,opensm和各種其他有用的infiniband相關軟件包,用於分析診斷和網絡調整。官方的OFED包可以在OpenFabrics website上找到。

基於這個問題,它看起來像IPoIB正在使用,但沒有提到具體的配置。 IPoIB不一定是利用IB卡中可用硬件資源的最佳方式。

除了這些注意事項,確保子網管理員的設置和配置正確。一些交換機具有內置的子網管理器,可以通過管理界面進行訪問和配置,在其他情況下,在您使用的其中一個節點上運行和配置子網管理器可能更有意義。 OpenSM是OFED發行版中包含的常用子網管理器,並且有許多在線指南可用於根據所設置的網絡類型設置和配置子網管理器。

OFED還包括各種IB測試和分析工具。 ibdiagnetuseful tool for debugging IB network issues。網上有很多指南可以顯示不同的方式來使用該工具以及OFED中包含的其他工具。

根據所使用的IB交換機的類型,可能還會有一些網絡管理和診斷工具可用於進一步分析網絡。IB硬件和管理它的低級軟件的配置有時對於整體性能來說比實際運行的代碼更重要。但由於被稱爲重新編譯和鏈接到OFED的正確版本的相關庫可能是明智的,如果重要的硬件配置變化的軟件作出。

+0

嗨馬特。我添加了ifconfig -a的輸出以發佈。正如我在這裏所述http://serverfault.com/questions/692643/infiniband-verifying-that-rdma-is-working所有的測試工作出色。我沒有使用開關。兩個節點通過電纜連接,並且opensm在其中一個節點上運行。似乎很好。 – Ivan

+0

您是否知道更新此特定卡上固件的鏈接? – Ivan

+0

我下載了最新的OFED。它似乎是一堆.rpms。我正在使用CentOS 7.我使用舊式的yum install來裝載我現在的東西。 OFED的自述文件表示它將刪除所有以前的安裝。你知道這是否包括我通過百勝安裝的東西? – Ivan

相關問題