On GPU accelerated TPT 2019 March

  • jerrythebest
    12th Mar 2019 Member 2 Permalink

    I've been playing around with building my own tpt from scratch for a week or two and I finally got a prototype working. 

    https://www.youtube.com/watch?v=m355xPfXkPo&feature=youtu.be

    I only have 5 elements for now and 2 of them are for debugging.

    Temperture is broken and pretty much the only thing working right now is pressure, wind, and basic particle movement. 

     

    The upside is that it's a 1024x1024 play space with hundreds of thousands of particles running at 60 fps on a laptop GTX 1050. Could comfortably handle more right now but I expect some performance penalties as more features are added. 

     

    I've been scratching my head at a few technical limitations thats holding me back and I hope to discuss those here for those that are familiar with GPU programming. 

     

    The challenges I've identified so far:

    • High thread divergence as number of elements go up.
    • Particle movement and garbage collection requires some level of highly contested memory read/writes + synchronization.
    • The air simulation currently uses too many per thread registers. It currently needs to know the pressure, heat, wind, and particles around it in a 3x3 neigborhood. 

    High thread diverence should eventually cap out in impact since there can only be so many elements in a warp. The main ones are particle movement and air sim. 

     

    In tpt, particle movement is done sequentially so whole groups of particles can move together in a shuffle. 

    For example. If particles a, b, and c all have a positive x velocity. c can move right, b can move right to the place c was occupying, and a can move right to the place b was occupying. 

     

    But on the GPU, all particle movements are calculated simultaneously, so no particle can "take the place" of another that moves out in the same frame. This has profound impacts on the calculations of collision normals as well. The way tpt calculates collision normals is too expensive and unfeasible on the GPU. 

     

    On garbage collection, I doubt I would be able to really do much since parallel sparse array compaction is still an unsolved problem. I'm currently just using a stack in shared memory for each group that gets pushed to global after that group finishes execution. 

     

    I thought about busting out histogram pyrimids but that felt like overkill. Plus, if I was going to do something like a histogram pyrimid I might as well just implement bucket sort to help out with thread divergence. Though, I havn't done performance testing yet. 

     

    Assuming 96kb of shared memory per SM, I would want atleast 2 blocks so 48kb per block. If I have a group size of 512 each bucket would need 512 * 4 = 2kb of memory. That means I can have at most 24 buckets, which isn't a lot to play with. 

     

    For the air sim, I've already done all I can to reduce memory footprint. I even went as far as to store data in bit fields and bit masks. Honestly, I don't even know if the encoding and decoding costs are worth the memory savings. I could try using fixed instead of floats, but idk what impact thatll have on simulation stability and most GPUs have word sizes of 4 bytes anyways so that probably won't do much. 

     

    Hmm... idk. I guess anyone with ideas on how to solve any of the issues can chime in so atleast I'm getting a sanity check. 

    Edited 4 times by jerrythebest. Last: 13th Mar 2019