提高

2013-08-02 54 views
3

我已經在C的Linux系統,該系統解決了線性系統A x = b,其中A是利用以下兩種方法的稀疏對稱矩陣上寫入的碼++稀疏線性系統的解:提高

  1. 使用UMFPACK到按順序分解和做後向前置換。
  2. 使用UMFPACK順序分解,然後使用cuSPARSE庫進行反向正向替換。

系統配置我的是:CUDA 5.0版本,UMFPACK版本5.6.2,Linux內核版本Debian的3.2.46-1,使用的顯卡:的GeForce GTX泰坦。

從理論上講,第二種方法應該比第一種方法執行得更好或者沒有錯誤。不過,我觀察了以下問題:

  1. 使用UMFPACK功能umfpack_di_solve幾乎2x比CUDA變異更快的向後/向前替代。
  2. 對於某些矩陣,使用UMFPACK和CUDA獲得的結果之間的誤差相當大,最大誤差爲3.2537,而對於其他矩陣,它的秩序爲1e-16

附件是由以下部分組成我的tar文件:

  • 的文件夾factorize_copy與主文件fc.cu我使用求解線性系統。它從網格_ * _ CSC.m中讀取稀疏矩陣,這些文件也出現在相同的目錄中。爲了方便起見,在文本文件中給出了三個稀疏矩陣的結果。
  • 一個文件夾,包含編譯和運行UMFPACK(我們也用於計算)的所有依賴項。

到tar文件的鏈接是 https://www.dropbox.com/s/9qfs5awclshyk3b/code.tar.gz

如果你想運行的代碼,我在我的系統中factorize_copy目錄使用我提供了一個MAKEFILE。您可能需要重新編譯UMFPACK庫。

我們的程序輸出的樣本輸出586 x 586稀疏矩陣也顯示在下面(請注意,與其他我們檢查過的稀疏矩陣相比,這種情況下的誤差非常高)。

 
***** Reading the Grids 

    Reading Grids Successful 

***** Solving a sparse matrix of size: 586x586 

***** Solving the grid on umfpack 

***** Factorizing The Grid 

-------------- CPU TIME for umfpack factorization is: 0.00109107 

-------------- Wall-Clock TIME for umfpack factorization is: 0 

    Factorizing the Grid successful 

    Solving the grid on umfpack successful 

-------------- CPU TIME for umfpack solve is: 6.281e-05 

***** Allocating GPU memory and Copying data 

---------------- CPU Time for Allocating GPU memory and Copying Data: 1.6 

***** Performing b = P*b to account for the row ordering in A 

    Matrix-Vector (Pb) multiplication successful 

***** Solving the system: LUx=b 

    Analyzing Ly = b successful 

    Solving Ly = b successful 

    Analyzing Ux = y successful 

    Solving Ux = y successful 

***** Performing x = Q*x to account for the column ordering in A 

    Matrix-Vector (Qx) multiplication successful 

---------- GPU Time for the solve is: 5.68029 ms 

##### Maximum error between UMFPACK and CUDA: 3.2537 

##### Average error between UMFPACK and CUDA: 0.699926 

***** Writing the results to the output files 

    Result Written to the file 'vout_586.m' and the file 'vout_umfpack_586.m' 

(Operation Successful!) 

我會很感激,如果任何人都可以指出可能的錯誤可能是在這種情況下什麼。如果有更好的方法來解決使用CUDA的稀疏線性系統,我錯過了,請讓我知道。

編輯:我想出了爲什麼它在某些情況下給出錯誤,在某些情況下不是。在代碼中調用內核函數時,每塊的線程數有錯誤。但是,我仍然有加速問題。

+0

您的系統是否有哪些CPU?你提到UMFPACK 5.6.2(似乎可用),但你的tar文件似乎有UMFPACK 5.2.0在裏面?這很重要嗎?另外,在你的'Makefile'中,你指定了'-G'開關並且不指定任何'-arch'開關。我會刪除'-G'開關併爲GTX Titan指定'-arch = sm_35'開關。 –

+0

使用的CPU是:Intel酷睿i7-4770四核心處理器。 UMFPACK版本並不重要。 – yassine93

回答

4

如果你正在處理一個在CPU上花費亞毫秒數量的問題,考慮到gpu計算中涉及的所有延遲,你幾乎不能期望gpu執行得更快。

+0

我正在做系統解決蒙特卡洛迭代。分解只需要一次,但在一次蒙特卡羅迭代中多次執行後向/前向替換。因此,我希望將GPU用於後向/前向求解,這樣我就不用花時間在主內存和GPU內存之間來回複製數據。順便說一下,對於1M或更大的稀疏矩陣,觀察到同樣的趨勢。有沒有什麼方法可以解決這個問題並加速解決。 – yassine93

1

本文考慮了稀疏線性系統快速求解的一個非常重要的問題。

自2015年11月,cuSPARSE庫提供的例程基於LU分解稀疏線性系統的解,特別是

cusparse<t>csrilu02 

cusparse<t>csrsv2_solve 

此外,cuSPARSE提供

cusparse<t>csrcolor 

哪個工具nts 圖着色。用於不完全LU-因式分解利用圖着色的在

Graph Coloring: More Parallelism for Incomplete-LU Factorization

M.Naumov, P.Castonguay, J. Cohen, 「Parallel Graph Coloring with Applications to the incomplete-LU factorization on the GPU」, NVIDIA Research Technical Report, May, 2015進行說明。

這個想法是將圖着色算法應用到與系統的係數矩陣相關聯的行依賴圖中,然後對系統方程進行相應的重新排序,這樣LU分解例程可以提取更多的並行性。

下面,請使用上述想法找到一個完全樣例:

#include <stdio.h> 
#include <stdlib.h> 
#include <iostream> 
#include <assert.h> 

#include "Utilities.cuh" 

#include <cuda_runtime.h> 
#include <cusparse_v2.h> 

#define BLOCKSIZE 256 

/**************************/ 
/* SETTING UP THE PROBLEM */ 
/**************************/ 
void setUpTheProblem(double **h_A_dense, double **h_x_dense, double **d_A_dense, double **d_x_dense, const int N) { 

    // --- Host side dense matrix 
    h_A_dense[0] = (double*)calloc(N * N, sizeof(*h_A_dense)); 

    // --- Column-major ordering 
    h_A_dense[0][0] = 0.4612f; h_A_dense[0][4] = -0.0006f; h_A_dense[0][8] = 0.f; h_A_dense[0][12] = 0.0f; 
    h_A_dense[0][1] = -0.0006f; h_A_dense[0][5] = 0.f; h_A_dense[0][9] = 0.0723f; h_A_dense[0][13] = 0.04f; 
    h_A_dense[0][2] = 0.3566f; h_A_dense[0][6] = 0.0723f; h_A_dense[0][10] = 0.f; h_A_dense[0][14] = 0.0f; 
    h_A_dense[0][3] = 0.0f;  h_A_dense[0][7] = 0.0f;  h_A_dense[0][11] = 1.0f; h_A_dense[0][15] = 0.1f; 

    h_x_dense[0] = (double *)malloc(N * sizeof(double)); 
    h_x_dense[0][0] = 100.0; h_x_dense[0][1] = 200.0; h_x_dense[0][2] = 400.0; h_x_dense[0][3] = 500.0; 

    // --- Create device arrays and copy host arrays to them 
    gpuErrchk(cudaMalloc(&d_A_dense[0], N * N * sizeof(double))); 
    gpuErrchk(cudaMemcpy(d_A_dense[0], h_A_dense[0], N * N * sizeof(double), cudaMemcpyHostToDevice)); 

    gpuErrchk(cudaMalloc(&d_x_dense[0], N * sizeof(double))); 
    gpuErrchk(cudaMemcpy(d_x_dense[0], h_x_dense[0], N * sizeof(double), cudaMemcpyHostToDevice)); 
} 

/************************/ 
/* FROM DENSE TO SPARSE */ 
/************************/ 
void fromDenseToSparse(const cusparseHandle_t handle, double *d_A_dense, double **d_A, int **d_A_RowIndices, int **d_A_ColIndices, int *nnz, 
         cusparseMatDescr_t *descrA, const int N) { 

    cusparseSafeCall(cusparseCreateMatDescr(&descrA[0])); 
    cusparseSafeCall(cusparseSetMatType  (descrA[0], CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descrA[0], CUSPARSE_INDEX_BASE_ZERO)); 

    nnz[0] = 0;        // --- Number of nonzero elements in dense matrix 
    const int lda = N;      // --- Leading dimension of dense matrix 

    // --- Device side number of nonzero elements per row 
    int *d_nnzPerVector; gpuErrchk(cudaMalloc(&d_nnzPerVector, N * sizeof(int))); 
    cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, N, N, descrA[0], d_A_dense, lda, d_nnzPerVector, &nnz[0])); 

    // --- Host side number of nonzero elements per row 
    int *h_nnzPerVector = (int *)malloc(N * sizeof(int)); 
    gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, N * sizeof(int), cudaMemcpyDeviceToHost)); 

    printf("Number of nonzero elements in dense matrix = %i\n\n", nnz[0]); 
    for (int i = 0; i < N; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]); 
    printf("\n"); 

    // --- Device side sparse matrix 
    gpuErrchk(cudaMalloc(&d_A[0], nnz[0] * sizeof(double))); 

    gpuErrchk(cudaMalloc(&d_A_RowIndices[0], (N + 1) * sizeof(int))); 
    gpuErrchk(cudaMalloc(&d_A_ColIndices[0], nnz[0] * sizeof(int))); 

    cusparseSafeCall(cusparseDdense2csr(handle, N, N, descrA[0], d_A_dense, lda, d_nnzPerVector, d_A[0], d_A_RowIndices[0], d_A_ColIndices[0])); 

    // --- Host side sparse matrix 
    double *h_A = (double *)malloc(nnz[0] * sizeof(double));   
    int *h_A_RowIndices = (int *)malloc((N + 1) * sizeof(*h_A_RowIndices)); 
    int *h_A_ColIndices = (int *)malloc(nnz[0] * sizeof(*h_A_ColIndices)); 
    gpuErrchk(cudaMemcpy(h_A, d_A[0], nnz[0] * sizeof(double), cudaMemcpyDeviceToHost)); 
    gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost)); 
    gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices[0], nnz[0] * sizeof(int), cudaMemcpyDeviceToHost)); 

    printf("\nOriginal matrix in CSR format\n\n"); 
    for (int i = 0; i < nnz[0]; ++i) printf("A[%i] = %f ", i, h_A[i]); printf("\n"); 

    printf("\n"); 
    for (int i = 0; i < (N + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n"); 

    for (int i = 0; i < nnz[0]; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);  

} 

/******************/ 
/* GRAPH COLORING */ 
/******************/ 
__global__ void setRowIndices(int *d_B_RowIndices, const int N) { 

    const int tid = threadIdx.x + blockDim.x * blockIdx.x; 

    if (tid == N)  d_B_RowIndices[tid] = N; 
    else if (tid < N) d_B_RowIndices[tid] = tid; 

} 

__global__ void setB(double *d_B, const int N) { 

    const int tid = threadIdx.x + blockDim.x * blockIdx.x; 

    if (tid < N) d_B[tid] = 1.f; 

} 

void graphColoring(const cusparseHandle_t handle, const int nnz, const cusparseMatDescr_t descrA, const double fractionToColor, double *d_A, 
        const int *d_A_RowIndices, const int *d_A_ColIndices, double **d_B, int **d_B_RowIndices, int **d_B_ColIndices, 
        cusparseMatDescr_t *descrB, const int N) { 

    cusparseColorInfo_t info;  cusparseSafeCall(cusparseCreateColorInfo(&info)); 

    int ncolors; 
    int *d_coloring;  gpuErrchk(cudaMalloc(&d_coloring, N * sizeof(double))); 
    gpuErrchk(cudaMalloc(&d_B_ColIndices[0], N * sizeof(double))); 
    cusparseSafeCall(cusparseDcsrcolor(handle, N, nnz, descrA, d_A, d_A_RowIndices, d_A_ColIndices, &fractionToColor, &ncolors, d_coloring, 
             d_B_ColIndices[0], info)); 

    int *h_coloring  = (int *)malloc(N * sizeof(double)); 
    int *h_B_ColIndices = (int *)malloc(N * sizeof(double)); 
    gpuErrchk(cudaMemcpy(h_coloring, d_coloring, N * sizeof(double), cudaMemcpyDeviceToHost)); 
    gpuErrchk(cudaMemcpy(h_B_ColIndices, d_B_ColIndices[0], N * sizeof(double), cudaMemcpyDeviceToHost)); 

    for (int i = 0; i < N; i++) printf("h_coloring = %i; h_B_ColIndices = %i\n", h_coloring[i], h_B_ColIndices[i]); 

    gpuErrchk(cudaMalloc(&d_B_RowIndices[0], (N + 1) * sizeof(int))); 
    int *h_B_RowIndices = (int *)malloc((N + 1) * sizeof(double)); 
    setRowIndices<<<iDivUp(N + 1, BLOCKSIZE), BLOCKSIZE>>>(d_B_RowIndices[0], N); 

    gpuErrchk(cudaMemcpy(h_B_RowIndices, d_B_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost)); 
    printf("\n"); for (int i = 0; i <= N; i++) printf("h_B_RowIndices = %i\n", h_B_RowIndices[i]); 

    gpuErrchk(cudaMalloc(&d_B[0], N * sizeof(double))); 
    double *h_B = (double *)malloc(N * sizeof(double)); 
    setB<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_B[0], N); 

    gpuErrchk(cudaMemcpy(h_B, d_B[0], N * sizeof(double), cudaMemcpyDeviceToHost)); 
    printf("\n"); for (int i = 0; i < N; i++) printf("h_B = %f\n", h_B[i]); 

    // --- Descriptor for sparse mutation matrix B 
    cusparseSafeCall(cusparseCreateMatDescr(&descrB[0])); 
    cusparseSafeCall(cusparseSetMatType  (descrB[0], CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descrB[0], CUSPARSE_INDEX_BASE_ZERO)); 
} 

/*************************/ 
/* MATRIX ROW REORDERING */ 
/*************************/ 
void matrixRowReordering(const cusparseHandle_t handle, int nnzA, int nnzB, int *nnzC, cusparseMatDescr_t descrA, cusparseMatDescr_t descrB, 
         cusparseMatDescr_t *descrC, double *d_A, int *d_A_RowIndices, int *d_A_ColIndices, double *d_B, int *d_B_RowIndices, 
         int *d_B_ColIndices, double **d_C, int **d_C_RowIndices, int **d_C_ColIndices, const int N) { 

    // --- Descriptor for sparse matrix C 
    cusparseSafeCall(cusparseCreateMatDescr(&descrC[0])); 
    cusparseSafeCall(cusparseSetMatType  (descrC[0], CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descrC[0], CUSPARSE_INDEX_BASE_ZERO)); 

    const int lda = N;      // --- Leading dimension of dense matrix 

    // --- Device side sparse matrix 
    gpuErrchk(cudaMalloc(&d_C_RowIndices[0], (N + 1) * sizeof(int))); 

    // --- Host side sparse matrices 
    int *h_C_RowIndices = (int *)malloc((N + 1) * sizeof(int)); 

    // --- Performing the matrix - matrix multiplication 
    int baseC; 
    int *nnzTotalDevHostPtr = &nnzC[0]; 

    cusparseSafeCall(cusparseSetPointerMode(handle, CUSPARSE_POINTER_MODE_HOST)); 

    cusparseSafeCall(cusparseXcsrgemmNnz(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, N, descrB, nnzB, 
             d_B_RowIndices, d_B_ColIndices, descrA, nnzA, d_A_RowIndices, d_A_ColIndices, descrC[0], d_C_RowIndices[0], 
             nnzTotalDevHostPtr)); 
    if (NULL != nnzTotalDevHostPtr) nnzC[0] = *nnzTotalDevHostPtr; 
    else { 
     gpuErrchk(cudaMemcpy(&nnzC[0], d_C_RowIndices + N, sizeof(int), cudaMemcpyDeviceToHost)); 
     gpuErrchk(cudaMemcpy(&baseC, d_C_RowIndices,  sizeof(int), cudaMemcpyDeviceToHost)); 
     nnzC -= baseC; 
    } 
    gpuErrchk(cudaMalloc(&d_C_ColIndices[0], nnzC[0] * sizeof(int))); 
    gpuErrchk(cudaMalloc(&d_C[0], nnzC[0] * sizeof(double))); 
    double *h_C = (double *)malloc(nnzC[0] * sizeof(double));  
    int *h_C_ColIndices = (int *)malloc(nnzC[0] * sizeof(int)); 
    cusparseSafeCall(cusparseDcsrgemm(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, N, descrB, nnzB, 
             d_B, d_B_RowIndices, d_B_ColIndices, descrA, nnzA, d_A, d_A_RowIndices, d_A_ColIndices, descrC[0], 
             d_C[0], d_C_RowIndices[0], d_C_ColIndices[0])); 

    double *h_C_dense = (double*)malloc(N * N * sizeof(double)); 
    double *d_C_dense; gpuErrchk(cudaMalloc(&d_C_dense, N * N * sizeof(double))); 
    cusparseSafeCall(cusparseDcsr2dense(handle, N, N, descrC[0], d_C[0], d_C_RowIndices[0], d_C_ColIndices[0], d_C_dense, N)); 

    gpuErrchk(cudaMemcpy(h_C ,   d_C[0],   nnzC[0] * sizeof(double), cudaMemcpyDeviceToHost)); 
    gpuErrchk(cudaMemcpy(h_C_RowIndices, d_C_RowIndices[0], (N + 1) * sizeof(int), cudaMemcpyDeviceToHost)); 
    gpuErrchk(cudaMemcpy(h_C_ColIndices, d_C_ColIndices[0], nnzC[0] * sizeof(int), cudaMemcpyDeviceToHost)); 

    printf("\nResult matrix C in CSR format\n\n"); 
    for (int i = 0; i < nnzC[0]; ++i) printf("C[%i] = %f ", i, h_C[i]); printf("\n"); 

    printf("\n"); 
    for (int i = 0; i < (N + 1); ++i) printf("h_C_RowIndices[%i] = %i \n", i, h_C_RowIndices[i]); printf("\n"); 

    printf("\n"); 
    for (int i = 0; i < nnzC[0]; ++i) printf("h_C_ColIndices[%i] = %i \n", i, h_C_ColIndices[i]); 

    gpuErrchk(cudaMemcpy(h_C_dense, d_C_dense, N * N * sizeof(double), cudaMemcpyDeviceToHost)); 

    for (int j = 0; j < N; j++) { 
     for (int i = 0; i < N; i++) 
      printf("%f \t", h_C_dense[i * N + j]); 
     printf("\n"); 
     } 

} 

/******************/ 
/* ROW REORDERING */ 
/******************/ 
void rowReordering(const cusparseHandle_t handle, int nnzA, cusparseMatDescr_t descrB, double *d_B, int *d_B_RowIndices, int *d_B_ColIndices, 
        double *d_x_dense, double **d_y_dense, const int N) { 

    gpuErrchk(cudaMalloc(&d_y_dense[0], N  * sizeof(double))); 

    const double alpha = 1.; 
    const double beta = 0.; 
    cusparseSafeCall(cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nnzA, &alpha, descrB, d_B, d_B_RowIndices, d_B_ColIndices, d_x_dense, 
            &beta, d_y_dense[0])); 

    double *h_y_dense = (double*)malloc(N *  sizeof(double)); 
    gpuErrchk(cudaMemcpy(h_y_dense,   d_y_dense[0],   N * sizeof(double), cudaMemcpyDeviceToHost)); 

    printf("\nResult vector\n\n"); 
    for (int i = 0; i < N; ++i) printf("h_y[%i] = %f ", i, h_y_dense[i]); printf("\n"); 

} 

/*****************************/ 
/* SOLVING THE LINEAR SYSTEM */ 
/*****************************/ 
void LUDecomposition(const cusparseHandle_t handle, int nnzC, cusparseMatDescr_t descrC, double *d_C, int *d_C_RowIndices, int *d_C_ColIndices, 
        double *d_x_dense, double **d_y_dense, const int N) { 

    /******************************************/ 
    /* STEP 1: CREATE DESCRIPTORS FOR L AND U */ 
    /******************************************/ 
    cusparseMatDescr_t  descr_L = 0; 
    cusparseSafeCall(cusparseCreateMatDescr (&descr_L)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ZERO)); 
    cusparseSafeCall(cusparseSetMatType  (descr_L, CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatFillMode (descr_L, CUSPARSE_FILL_MODE_LOWER)); 
    cusparseSafeCall(cusparseSetMatDiagType (descr_L, CUSPARSE_DIAG_TYPE_UNIT)); 

    cusparseMatDescr_t  descr_U = 0; 
    cusparseSafeCall(cusparseCreateMatDescr (&descr_U)); 
    cusparseSafeCall(cusparseSetMatIndexBase(descr_U, CUSPARSE_INDEX_BASE_ZERO)); 
    cusparseSafeCall(cusparseSetMatType  (descr_U, CUSPARSE_MATRIX_TYPE_GENERAL)); 
    cusparseSafeCall(cusparseSetMatFillMode (descr_U, CUSPARSE_FILL_MODE_UPPER)); 
    cusparseSafeCall(cusparseSetMatDiagType (descr_U, CUSPARSE_DIAG_TYPE_NON_UNIT)); 

    /**************************************************************************************************/ 
    /* STEP 2: QUERY HOW MUCH MEMORY USED IN LU FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */ 
    /**************************************************************************************************/ 
    csrilu02Info_t info_C = 0; cusparseSafeCall(cusparseCreateCsrilu02Info (&info_C)); 
    csrsv2Info_t info_L = 0; cusparseSafeCall(cusparseCreateCsrsv2Info (&info_L)); 
    csrsv2Info_t info_U = 0; cusparseSafeCall(cusparseCreateCsrsv2Info (&info_U)); 

    int pBufferSize_M, pBufferSize_L, pBufferSize_U; 
    cusparseSafeCall(cusparseDcsrilu02_bufferSize(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, &pBufferSize_M)); 
    cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, &pBufferSize_L)); 
    cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, &pBufferSize_U)); 

    int pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_U)); 
    void *pBuffer = 0; gpuErrchk(cudaMalloc((void**)&pBuffer, pBufferSize)); 

    /************************************************************************************************/ 
    /* STEP 3: ANALYZE THE THREE PROBLEMS: LU FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */ 
    /************************************************************************************************/ 
    int structural_zero; 

    cusparseSafeCall(cusparseDcsrilu02_analysis(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    cusparseStatus_t status = cusparseXcsrilu02_zeroPivot(handle, info_C, &structural_zero); 
    if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("A(%d,%d) is missing\n", structural_zero, structural_zero); } 

    cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer)); 

    /************************************/ 
    /* STEP 4: FACTORIZATION: A = L * U */ 
    /************************************/ 
    int numerical_zero; 

    cusparseSafeCall(cusparseDcsrilu02(handle, N, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, info_C, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 
    status = cusparseXcsrilu02_zeroPivot(handle, info_C, &numerical_zero); 
    if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("U(%d,%d) is zero\n", numerical_zero, numerical_zero); } 

    /*********************/ 
    /* STEP 5: L * z = x */ 
    /*********************/ 
    // --- Allocating the intermediate result vector 
    double *d_z_dense;  gpuErrchk(cudaMalloc(&d_z_dense, N * sizeof(double))); 

    const double alpha = 1.; 
    cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, &alpha, descr_L, d_C, d_C_RowIndices, d_C_ColIndices, info_L, d_x_dense, d_z_dense, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer)); 

    /*********************/ 
    /* STEP 5: U * y = z */ 
    /*********************/ 
    gpuErrchk(cudaMalloc(&d_y_dense[0], N * sizeof(double))); 
    cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnzC, &alpha, descr_U, d_C, d_C_RowIndices, d_C_ColIndices, info_U, d_z_dense, d_y_dense[0], CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer)); 

    double *h_y_dense = (double *)malloc(N * sizeof(double)); 
    gpuErrchk(cudaMemcpy(h_y_dense, d_y_dense[0], N * sizeof(double), cudaMemcpyDeviceToHost)); 
    printf("\n\nFinal result\n"); 
    for (int k=0; k<N; k++) printf("x[%i] = %f\n", k, h_y_dense[k]); 

} 

/********/ 
/* MAIN */ 
/********/ 
int main() 
{ 
    // --- Initialize cuSPARSE 
    cusparseHandle_t handle; cusparseSafeCall(cusparseCreate(&handle)); 

    /*************************************************/ 
    /* SETTING UP THE ORIGINAL LINEAR SYSTEM PROBLEM */ 
    /*************************************************/ 
    const int N  = 4;    // --- Number of rows and columns 

    double *h_A_dense; double *h_x_dense; 
    double *d_A_dense; double *d_x_dense; 
    setUpTheProblem(&h_A_dense, &h_x_dense, &d_A_dense, &d_x_dense, N); 

    /************************/ 
    /* FROM DENSE TO SPARSE */ 
    /************************/ 
    //--- Descriptor for sparse matrix A 
    cusparseMatDescr_t descrA; 

    int *d_A_RowIndices, *d_A_ColIndices; 
    double *d_A; 

    int nnzA; 

    fromDenseToSparse(handle, d_A_dense, &d_A, &d_A_RowIndices, &d_A_ColIndices, &nnzA, &descrA, N); 

    /******************/ 
    /* GRAPH COLORING */ 
    /******************/ 
    const double fractionToColor = 0.95; 

    int *d_B_RowIndices, *d_B_ColIndices; 
    double *d_B; 

    int nnzB; 

    cusparseMatDescr_t descrB;  
    graphColoring(handle, nnzB, descrA, fractionToColor, d_A, d_A_RowIndices, d_A_ColIndices, &d_B, &d_B_RowIndices, &d_B_ColIndices, &descrB, N); 

    /*************************/ 
    /* MATRIX ROW REORDERING */ 
    /*************************/ 
    int nnzC; 

    int *d_C_RowIndices, *d_C_ColIndices; 
    double *d_C; 

    cusparseMatDescr_t descrC; 
    matrixRowReordering(handle, nnzA, nnzB, &nnzC, descrA, descrB, &descrC, d_A, d_A_RowIndices, d_A_ColIndices, d_B, d_B_RowIndices, d_B_ColIndices, 
         &d_C, &d_C_RowIndices, &d_C_ColIndices, N); 

    /******************/ 
    /* ROW REORDERING */ 
    /******************/ 
    double *d_y_dense; 
    rowReordering(handle, nnzA, descrB, d_B, d_B_RowIndices, d_B_ColIndices, d_x_dense, &d_y_dense, N); 

    /*****************************/ 
    /* SOLVING THE LINEAR SYSTEM */ 
    /*****************************/ 
    double *d_xsol_dense; 
    LUDecomposition(handle, nnzC, descrC, d_C, d_C_RowIndices, d_C_ColIndices, d_y_dense, &d_xsol_dense, N); 

}