This week I implemented a GPU resident parallel reduction for Cholla as described in this NVIDIA blog post. Performance is similar to current reductions that we use but instead of a complicated chunk of code required and finishing the reduction on CPU it’s entirely GPU resident and can be done with a single device function call at the end of a transform/reduce kernel. I’ve implemented it for the time step calculations but currently it’s returning slightly different values than the old version. The end result still appears to be correct but I’m working on finding out where exactly the differences lay.
I’ve finished implementing a kernel and series of functions for determining the divergence of the magnetic field, reducing it both at the local and global level, checking the result, and then either exiting if it’s too high or just reporting it if the value is small enough to be acceptable.
- Read and presented on Morton et al. 2022