GPU computation performs parallel blocks based multiplication, where each block is assigned to a grid of threads running on separate core. I have 18 cores on my sony notebook (16 on GPU and 2 on CPU) and that is a lot of power!
- GPGPU: CUDPP provides a library for parallel multiplication based on classical parallel prefix approach. I tested some code (adapted to run on windows vista by myself) and this approach looks quite promising;
- IBM proposes a closed-source solution for sparse matrix - vector multiplication. Declared performance is interesting but I don't know the internals of their proposed approach;
- This technical report from NVIDIA tackles the problem with an interesting mix of algorithmic and engineering solutions: N. Bell and M. Garland. Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, December 2008. [PDF] [Online].