The final report may be accessed .

The proposal for the project may be accessed .

Checkpoint

Progress Update

We have roughly followed the schedule that we had outlined in the project proposal. We have shown this in the table in the later sections of this report. Following is the work we have done so far.

 We have managed to parallelize the implementation up to the Lanczos stage of the algorithm. We have highlighted the issues that we are facing beyond this point in the issues section.

The Hybrid Parallel Approach

We are using the NVIDIA cuSparse [2] library for node-local matrix operations that are performed on the GPU. This allows us to focus on tuning the performance of our library across nodes, while at the same time reaping the benefits of upstream updates to the cuSparse library. The goal is to produce a working library that is practically usable, with as little maintenance overhead as possible.

Our algorithm can be broken down into three stages:

  1. Construction of the Graph Laplacian from the input data set and distribution across nodes.
  2. Parallel Lanczos algorithm to compute the tri-diagonal matrix.
  3. QR algorithm to compute the Eigenvectors of the tri-diagonal matrix.

Constructing the un-normalized Graph Laplacian from the distributed adjacency matrix is embarrassingly parallel. Note that even a compressed matrix representation of the adjacency matrix can be easily converted into the compressed representation of the Laplacian. We exploit this property to efficiently construct the Graph Laplacian.

The following algorithm describes the working of the naïve hybrid MPI+CUDA code that we have at this stage for the Lanczos step of the algorithm.



Even though the naïve algorithm attempts to minimize communication be performing local matrix-vector operations in parallel across nodes, on the GPUs of each of the nodes, it still suffers from multiple issues. We have highlighted these, along with other issues, in the issues section.

Current Issues

Goals and Deliverables

In light of the issues that we have discovered, we have slightly refined the goals from the original proposal.

We will still have a working, correct implementation of a C++ hybrid MPI + CUDA parallel version of a Lanczos-based spectral clustering algorithm as a miniature library. We expect the clustering quality of the results produced by our algorithm to be comparable to those produced by a sequential numpy version of the algorithm. For simplicity, we are restricting ourselves to the case of discovering 2 clusters.

The previous CUDA-only implementation [3] achieved 5x speedup over a sequential MATLAB version. Pessimistically, we expect to achieve at least 5x speedup over the sequential numpy version as well, but on larger data sets (as we have more total memory and compute power available to us).

We will also compare our code with the CUDA-only parallel implementations of the Lanczos algorithm, and hope to stay competitive. We anticipate two major bottlenecks in our code: inter-node communication overhead in MPI and CUDA host memory to device memory transfer overhead. We will spend a significant amount of time optimizing and analyzing these two.

We hope to achieve at least (5 + 0.5 * number of nodes)x speedup on the two test data sets. The reason is that we expect the algorithm to scale with the number of nodes as well, providing more speedup than the 5x speedup obtained using the CUDA-only implementation. However, we anticipate non-ideal speedup and limit ourselves to a more achievable goal of 0.5 * number of nodes.

For the poster session, we will show a visualization of our speedup graphs, and discuss the parallelization approaches and trade-offs that we made. We will also show the interface to the library, the API structure and the easy of use of our code.

Revised Schedule

Start Date End Date Work to be done
Nov 1 Nov 5 Understand the mathematics behind Spectral Clustering. Finalize an eigen decomposition algorithm. Finalize on a test data set.
Nov 6 Nov 13 Implement sequential Spectral Clustering. Compare against benchmarks.
Nov 14 Nov 20 Prepare the hybrid MPI + CUDA architecture. Write subroutines using the cuSPARSE library. Design and write the outline for the basic hybrid parallel version of the Lanczos algorithm.
Nov 21 Nov 24 Complete implementing the basic hybrid Lanczos by adding CUDA-aware MPI transfers. Use the small data set to run correctness tests.
Nov 25 Nov 27 Implement the better data distribution strategy to counter workload imbalance. Optimize the algorithm for large data sets. Work on Minimizing MPI/CUDA overheads.
Nov 28 Dec 1 Run benchmark tests on large data sets. Deal with QR stage of the algorithm. Find methods to perform fair evaluations.
Dec 1 Dec 4 Run tests using the fair evaluation scheme. Compute final benchmark numbers and design the graphs/visualizations to be used in the poster/writeup.
Dec 5 Dec 12 Buffer for pending tasks and tests. Write the final project report. Prepare poster for presentation. Create necessary pending visualizations.

References

[1]    Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.

[2]    NVIDIA cuSPARSE::CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/cusparse/, Accessed: 2016-11-20.

[3]   Parallel eigensolver for graph spectral analysis on gpu. http://linhr.me/15618/. Accessed: 2016-10-31.