AMD actually used Powder Toy for OpenCL demos. I contacted them if they would be willing to provide the code to include it in Powder Toy. I will let you know.
(I know this is a bump, but it might be totally woth it.)
Here is a link to the demo video:
http://fireuser.com/blog/amd_opencl_parallel_computing_demo_from_siggraph_asia_2008/
Its not very hard to convert codes to opencl form. Im eager to convert powdertoy codes if there are not so many recursive functions(they are harder to apply but not that hard but for a limited depth)
Please tell me if you need me to convert any function to opencl. Maybe I could write some auxilliary functions for java or C++ as DLL or something. Moving a fluid code from cpu to gpu made it 5-10 times faster than my CPU(8 cores), more than 20x faster for nbody and 50x for some random number generation.
I converted some to opencl with ease within just a day or two. I also create my own codes using opencl again. Here are some of what I did:
http://www.youtube.com/watch?v=YNj10f-7yTA
http://www.youtube.com/watch?v=vX-tsp1f3Qs
http://www.youtube.com/watch?v=r9gcCMl_eMc
http://www.youtube.com/watch?v=HKsqYoMlwas
http://www.youtube.com/watch?v=RHoNvTxWvHk
http://www.youtube.com/watch?v=GWsNDicrY1o
http://www.youtube.com/watch?v=602XVhl2QMY
A lot of the particle interaction code would be rather difficult to parallelise, but if you want to, you could start by trying to convert airflow simulation.
Source code for Powder Toy can be found at https://github.com/simtr/The-Powder-Toy/tree/develop .
Airflow simulation is in the src/simulation/Air.cpp file.
Okay, will work on air for now. Maybe adding the opencl as an auxilliary can do the work without rewriting other parts of program. Aux will have an input array, an output array and a compute method.
For example, "void Air::Invert()" seems to be memory-bound and bad for sending to GPU but if there will be many iterations of that before rendering, then it worths sending to gpu along with other kernels' buffers (or at least used by cpu on all cores). Multiple iterative version can get 5x to 10x according to memory speed of GPU. For single iteration it can have only pci-e bandwidth advantage of 5GB/s on top of system ram 20GB/s so very little gain.
But, "void Air::update_air(void)" seems to be worthy for sending to GPU.
Maybe something like this:
lets assume you need clear+update+invert+updateh+invert again and repeat for 5 times per iteration
building solver functions:
int airPressure=0;
int height_map=1;
AuxDev GPU = new AuxFunc("gpu", 512,512,4,4,2);
AuxDev CPU = new AuxFunc("cpu",512,512,4,4,2);
GPU.addKernel(clear_air);
GPU.addKernel(update_air);
GPU.addKernel(invert_air);
GPU.addKernel(update_air_h);
GPU.addKernel(invert_air);
GPU.addBuffer(airPressure, 512,512,"float");
GPU.addBuffer(height_map, 512,512,"float");
GPU.wire_buffers_to_kernels(new int[]{airPressure}, update_air);
GPU.wire_buffers_to_kernels(new int[]{airPressure}, invert_air);
GPU.wire_buffers_to_kernels(new int[]{airPressure, height_map}, update_air_h);
...
CPU.addKernel(otherThings);
...
...
GPU.readBufferFromCPU_CPP_array(.....)
CPU.readBufferFromCPU_CPP_array(....);
CPU.compute();
for(int i=0;i<5;i++)
{
GPU.compute();
}
CPU.writeBufferToCPU_CPP_array(....);
GPU.writeBufferToCPU_CPP_array(.....)
I think Air::Invert() is only run in response to user input, so almost never. Clear() and ClearAirH() are also not used a lot. update_air() and update_airh() are the important functions, they get run once per simulation time step.
At each timestep, Simulation::update_particles() is run. This calls Air::update_air() once, and might call Air::update_airh() once (update_airh is optional, depending on game settings). Then particle interactions happen, which may modify air data. The frame is then rendered.
In the current update_air():
pv, vx, vy are input arrays containing air pressure, x velocity, y velocity.
bmap_blockair is an input array containing information about positions of blockages which stop airflow (such as walls).
opv, ovx, ovy are temporary storage arrays for the output. The output should go in pv/vx/vy. But the pv/vx/vy input data should not change during calculation, so output is written to opv/ovx/ovy. At the end of update_air() it is copied back to pv/vx/vy.
Is the main project python? If not, can I open whole project with msvc2012 to try any opencl code? Maybe I just write some opencl class auxilliary and send them to you so you try and write results to here.
XRES=1024, YRES=768, CELL=2
will upload when I fix some minor bugs.
Edit: update air and update airh have some common buffers so I united them into a single gpu call:
------------
CPU update air: 32 ms
GPU update air: 8 ms
------------
CPU update airh 18 ms
------------
GPU update airh 4 ms
------------
GPU update air + airh single call: 10ms (only 3 ms is computing, rest is write/read from a sigle thread says codexl profiler)
Maybe multithreaded(using a simple openmp body) write/read can do even better timings like 5-6 ms
(disabling profiler changes times from 10ms to 7 ms so it can even be 3-4 ms for the release )
------------
CPU update particle first loop: 1ms (will make this gpu later)
------------
CPU update particle second loop: 1ms (will make gpu later)
------------
CPU update particle_i: 11ms (will change its nested loops into gpu calls later)
------------
Edit: cpu render_fire seems to be demanding and gpu doable too!
Edit: started converting render_fire (with its add_pixel function) to gpu opencl code.
edit: render_fire now takes around 7 ms for my system while cpu version takes 45ms.
edit: now upate_particles_i is the most cycle consuming part. Will focus there.