Efficient Barnes–Hut CUDA Simulation

Intro

An N-body simulation is a simulation of a system of n different particles or entities, typically simulating the force of gravity between them. From classic physics,

\[F=G\frac{m_1m_2}{r^2}\]

where F is force, G is the gravitational constant, \(m_1\) is the mass of object 1, \(m_2\) is the mass of object 2, and \(r\) is the distance between the centers of the two masses.

By applying this equation for every pair of entities and accumulating the results, we can calculate the force on every entity at any given point in time. Thus, we can find the acceleration for every entity at the current time step in \(O(n^2)\). We cannot simulate time continuously as in real life, so we can choose a small timestep t, like t = 0.01s. Then, instead of adding the newly calculated acceleration (a) to every entity’s current velocity (v), we add \(a\cdot t\) instead. Although this isn’t perfect, it is often good enough. If we need more precision, we can set t to be smaller in exchange for the simulation taking more time.

Barnes Hut

The approach described in the introduction is very simple and may be implemented in 20 or so lines of code, excluding the visualization. However, sitting at \(O(n^2)\), it quickly slows down when n grows large. We can approximate the exact algorithm with Barnes Hut, an algorithm with time complexity \(O(n\log n)\). In three dimensions, the idea is to split space into octants recursively until each leaf suboctant contains only one entity. Then, to calculate the acceleration on each entity, traverse down the generated 3D tree. For any close enough suboctant, fully recurse down that subtree as well, doing the usual naive calculations. However, for any suboctants which are far enough away from the entity currently being queried on, treat all of the entities within that suboctant as a single entity for the purpose of calculation.

This approximation relies on the \(r^2\) in the denominator of the force calculation with the reasoning that the impact from far away entities will be smaller or less important than the impact from close-by entities.

If implemented inefficiently, this approach may be only barely faster than the naive approach or even the same speed with worse accuracy, so care must be taken in ensuring that the algorithm and the data structures supporting the algorithm are as efficient as possible.

CUDA

Supporting the simulation of hundreds of thousands of entities at the same time is hard, especially in terms of performance. A normal computer processor, the CPU, is primarily geared towards single-threaded code—code that is run one line of code after another, very quickly. In the last 15 or so years, General-Purpose Graphics Processing Units (GPGPUs), have become more popular because of their ability to perform many small calculations in parallel. However, programming them requires a slightly different skillset compared to programming on classical CPUs, as many learn in college, so learning to utilize GPGPUs within their limits was a challenge for me.

Parallelizing the naive \(O(n^2)\) algorithm on GPGPUs is not very difficult, but the challenge comes in combining the Barnes Hut approximation algorithm with CUDA programming. Somehow, we need to efficiently and in parallel construct the recursive tree of octants (octree) and then efficiently query this tree to calculate the acceleration for every node.

As a side challenge, we should be able to visualize this efficiently in 3D space as well.