Checkpoint update is available .

Project Proposal

Full PDF


We intend to implement a hybrid MPI + CUDA algorithm to perform spectral clustering of an input graph. We will test the algorithm against a sequential MATLAB implementation and a parallel single-GPU CUDA implementation using one (or more) of the SNAP[1] data sets on two fronts: quality and overall computation time. We may use more precise measurements to support our claims and verify our hypotheses.


Given an undirected similarity graph G = (V,W,E), a clustering of the graph is an assignment of vertices vi, ∀i ∈ {1,2,...,|V|} to clusters cj, ∀j ∈{1,2,...,|C|} such that a particular cost metric on the similarities W over the clusters is minimized.

A normalized cut of a graph is a partition of the graph into two partitions A and B such that the sum of the weights of the edges connecting vertices in A to those in B is the minimum of all possible partitions and the size of A and B are similar.


Since this problem is NP-hard in nature, the approach of spectral clustering attempts to relax the constraints of the normalized cut and approximate the normalized cut using graph Laplacians.

Without diving into the mathematical details (see [2], [3], [4] for more details), the problem reduces to that of constructing the graph Laplacian (or the normalized graph Laplacian)

 L = D - A      (D = Diagonal Degree Matrix, A = Affinity Matrix)

 Lnorm = I - D-1/2AD1/2

And computing the two partitions using the second-smallest eigen value corresponding to one of the two Laplacian matrices. One popular method for computing the eigen values is using the Lanczos [5] algorithm, which involves iterative matrix-vector and vector-vector computations.

Therefore, from an engineering standpoint, spectral graph clustering algorithms reduce to steps of matrix operations.

Since GPUs are excellent candidates for speeding up matrix operations by performing independent computations in parallel, we intend to use them to speed up steps of the spectral clustering algorithm as well. Further, we intend to discover methods to distribute and perform these computations on a cluster of nodes using MPI, while continuing to exploit the parallelism offered by the GPUs on each of these nodes. The following figure illustrates our parallelization strategy.

To the best of our knowledge, this is the first such effort in this direction.



We will be using a couple of papers to help us get familiar with the mathematical theory of spectral clustering, as well as sequential implementations of the same. We would also be referring to previous (non-hybrid) parallel implementations of parts of the algorithm, such as graph construction and eigen value computation. We have outlined these below.

Goals and Deliverables

Plan to Achieve

We will have a working, correct implementation of a C++ hybrid MPI + CUDA parallel version of a Spectral Clustering algorithm as a miniature library. We expect the quality of the results produced by our algorithm to be comparable to those produced by a sequential version of the algorithm.

The previous CUDA-only implementation[7] achieved 5x speedup over a sequential MATLAB version. Pessimistically, we expect to achieve at least 5x speedup over the sequential MATLAB version as well, but on larger data sets (as we have more total memory and compute power available to us).

We will also compare our code with MPI-only (if available) and CUDA-only parallel implementations of the algorithm, and hope to stay competitive. We anticipate two major bottlenecks in our code: inter-node communication overhead in MPI and CUDA host memory to device memory transfer overhead. We will spend a significant amount of time optimizing and analyzing these two.

For the poster session, we will show a visualization of our speedup graphs, and discuss the parallelization approaches and trade-offs that we made. We will also show the interface to the library, the API structure and the easy of use of our code.

 If we do decide to trade quality for speedup, we will make sure to document and bound the error in our algorithm.

Hope to Achieve

We hope to achieve at least (5 + 0.5 * number of nodes)x speedup on two data sets. The reason is that we expect the algorithm to scale with the number of nodes as well, providing more speedup than the 5x speedup obtained using the CUDA-only implementation. However, we anticipate non-ideal speedup and limit ourselves to a more achievable goal of 0.5 * number of nodes.

Further, if our project goes better than expected, we will test against a few more data sets (with varying degrees of sparsity) and study the scaling characteristics of our code on these data sets. In the best case scenario, we would also perform isolated testing on AWS machines or on different GPUs to analyze the variability in our results.

In Case of Delays

Conversely, if our project does not go according to the time line due to unforeseen issues, we will implement a subpart of the algorithm completely and correctly, and describe exactly what extra work would need to be done to convert the partial results into the final expected result.

If our speedups are not as we expected, we will study the scaling properties of various parts of the algorithm individually and point out exactly why our algorithm does not work as we expected.

Platform Choice

Our code will be written from scratch in C++, but may choose to use Thrust/cuBLAS helper routines. We may study ideas from the previous CUDA-only versions to understand how they fit into our MPI + CUDA formulation of the algorithm, and decide to use a few of them (while providing due credit).


Start Date End Date Work to be done
Nov 1 Nov 5 Understand the mathematics behind Spectral Clustering. Finalize an eigen decomposition algorithm. Finalize on a test data set. Compute benchmarks using existing implementations.
Nov 6 Nov 13 Implement sequential Spectral Clustering. Compare against benchmarks.
Nov 14 Nov 20 Prepare the hybrid MPI + CUDA architecture. Develop subroutines for Spectral Clustering using CUDA. Distribute tasks across OpenMPI processes. Use small data sets to run correctness tests.
Nov 21 Nov 27 Optimize the algorithm for large data sets. Work on Minimizing MPI/CUDA overheads. Run benchmark tests on large data sets.
Nov 28 Dec 4 Buffer for pending tasks/running tests.
Dec 5 Dec 12 Write the final project report. Prepare poster for presentation. Create necessary pending visualizations.


[1]    Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection., June 2014.

[2]   Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.

[3]   Satu Elisa Schaeffer. Graph clustering. Computer Science Review, 1(1):27–64, 2007.

[4]   Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. In Advances In Neural Information Processing Systems, pages 849–856. MIT Press, 2001.

[5]   Jane Cullum and Ralph A. Willoughby. A survey of lanczos procedures for very large real ‘symmetric’ eigenvalue problems. Journal of Computational and Applied Mathematics, 12:37 – 60, 1985.

[6]   Jiri Kraus. An introduction to cuda-aware mpi. Published: 2013-03-13.

[7]   Parallel eigensolver for graph spectral analysis on gpu. Accessed: 2016-10-31.

[8]   Xinlei Chen and Deng Cai. Large scale spectral clustering with landmark-based representation. In AAAI, 2011.

[9]   Donghui Yan, Ling Huang, and Michael I. Jordan. Fast approximate spectral clustering. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 907–916, New York, NY, USA, 2009. ACM.

[10]   Scott White and Padhraic Smyth. A Spectral Clustering Approach To Finding Communities in Graphs, pages 274–285.

[11]   Maxim Naumov. Fast spectral graph partitioning on gpus. Nvidia Developer Blog: Published: 2016-05-12.

[12]   Fast and efficient spectral clustering in matlab. Mathworks Tutorial Article: Accessed: 2016-10-31.

[13]   Clustering using scikit-learn. Accessed: 2016-10-31.

[14]   J. Zhang, G. Wu, X. Hu, S. Li, and S. Hao. A parallel k-means clustering algorithm with mpi. In 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming, pages 60–64, Dec 2011.

[15]   W. Y. Chen, Y. Song, H. Bai, C. J. Lin, and E. Y. Chang. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):568–586, March 2011.