You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- [peak performance on GPUs](#single-gpu-benchmarks) (datacenter/gaming/professional/laptop), validated with roofline model
212
212
- [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) and other algebraic optimization to minimize round-off error
213
213
214
+
- <details><summary>powerful model extensions</summary>
- [fully analytic PLIC](https://doi.org/10.3390/computation10020021) for efficient curvature calculation
227
+
- improved mass conservation
228
+
- ultra efficient implementation with only [4 kernels](https://doi.org/10.3390/computation10060092) additionally to `stream_collide()` kernel
229
+
- thermal LBM to simulate thermal convection
230
+
- D3Q7 subgrid for thermal DDFs
231
+
- in-place streaming with [Esoteric-Pull](https://doi.org/10.3390/computation10060092) for thermal DDFs
232
+
- optional [FP16S or FP16C compression](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) for thermal DDFs with [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
233
+
- Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
- [fully analytic PLIC](https://doi.org/10.3390/computation10020021) for efficient curvature calculation
229
-
- improved mass conservation
230
-
- ultra efficient implementation with only [4 kernels](https://doi.org/10.3390/computation10060092) additionally to `stream_collide()` kernel
231
-
- thermal LBM to simulate thermal convection
232
-
- D3Q7 subgrid for thermal DDFs
233
-
- in-place streaming with [Esoteric-Pull](https://doi.org/10.3390/computation10060092) for thermal DDFs
234
-
- optional [FP16S or FP16C compression](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) for thermal DDFs with [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
235
-
- Smagorinsky-Lilly subgrid turbulence LES model to keep simulations with very large Reynolds number stable
- particles with immersed-boundary method (either passive or 2-way-coupled, only supported with single-GPU)
237
+
</details>
238
238
239
239
240
240
241
-
## Graphics Features
241
+
## Solving the Visualization Problem
242
242
243
-
- on Windows and Linux: real time [interactive rasterization and raytracing graphics](https://www.researchgate.net/publication/360501260_Combined_scientific_CFD_simulation_and_interactive_raytracing_with_OpenCL)
244
-
- on Windows and Linux (even in WSL and/or remote through SSH): real time interactive ASCII console graphics
245
-
- rendering is fully parallelized for multi-GPU via seamless domain decomposition rasterization
246
-
- with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (132 Megapixel (16K) and above)
243
+
- FluidX3D can do simulations so large that storing the volumetric data for later rendering becomes unmanageable (like 120GB for a single frame, hundreds of TeraByte for a video)
244
+
- instead, FluidX3D allows [rendering raw simulation data directly in VRAM](https://www.researchgate.net/publication/360501260_Combined_scientific_CFD_simulation_and_interactive_raytracing_with_OpenCL), so no large volumetric files have to be exported to the hard disk (see my [technical talk](https://youtu.be/pD8JWAZ2f8o))
245
+
- the rendering is so fast that it works interactively in real time for both rasterization and raytracing
246
+
- if no monitor is available (like on a remote Linux server), there is an ASCII rendering mode to interactively visualize the simulation in the terminal (even in WSL and/or through SSH)
247
+
- rendering is fully multi-GPU-parallelized via seamless domain decomposition rasterization
248
+
- with interactive graphics mode disabled, image resolution can be as large as VRAM allows for (4K/8K/16K and above)
247
249
- (interacitive) visualization modes:
248
250
- flags (and force vectors on solid boundary cells if the extension is used)
- gaming GPUs (desktop or laptop), like Nvidia GeForce, AMD Radeon, Intel Arc
264
+
- professional/workstation GPUs, like Nvidia Quadro, AMD Radeon Pro / FirePro
265
+
- integrated GPUs
266
+
- Intel Xeon Phi (requires installation of the [Intel OpenCL CPU Runtime ("oclcpuexp")](https://github.com/intel/llvm/releases?q=oneAPI+DPC%2B%2B+Compiler))
267
+
- Intel/AMD CPUs (requires installation of the [Intel OpenCL CPU Runtime ("oclcpuexp")](https://github.com/intel/llvm/releases?q=oneAPI+DPC%2B%2B+Compiler))
268
+
- even smartphone ARM GPUs
269
+
- supports parallelization across multiple GPUs on a single PC/laptop/server with PCIe communication, no SLI/Crossfire/NVLink/InfinityFabric or MPI installation required; the GPUs don't even have to be from the same vendor, but similar memory capacity and bandwidth is recommended
270
+
- works in Windows and Linux with C++17, with limited support also for MacOS and Android
271
+
- supports importing and voxelizing triangle meshes from binary `.stl` files, with fast GPU voxelization
272
+
- supports exporting volumetric data as binary `.vtk` files with `lbm.<field>.write_device_to_vtk();`
273
+
- supports exporting rendered frames as `.png`/`.qoi`/`.bmp` files with `lbm.graphics.write_frame();`, encoding is handled in parallel on the CPU while the simulation on GPU can continue without delay
274
+
275
+
276
+
257
277
## How to get started?
258
278
259
279
1. Check the settings and extensions in [`src/defines.hpp`](src/defines.hpp) by uncommenting corresponding lines.
- supports parallelization across multiple GPUs on a single node (PC/laptop/server) with PCIe communication, no SLI/Crossfire/NVLink/InfinityFabric or MPI installation required; the GPUs don't even have to be from the same vendor, but similar memory capacity and bandwidth is recommended
304
-
- supports importing and voxelizing triangle meshes from binary `.stl` files, with fast GPU voxelization
305
-
- supports exporting volumetric data as binary `.vtk` files
306
-
- supports exporting rendered frames as `.png`/`.qoi`/`.bmp` files; time-consuming image encoding is handled in parallel on the CPU while the simulation on GPU can continue without delay
307
-
308
-
309
-
310
312
## Single-GPU Benchmarks
311
313
312
314
Here are [performance benchmarks](https://doi.org/10.3390/computation10060092) on various hardware in MLUPs/s, or how many million lattice points are updated per second. The settings used for the benchmark are D3Q19 SRT with no extensions enabled (only LBM with implicit mid-grid bounce-back boundaries) and the setup consists of an empty cubic box with sufficient size (typically 256³). Without extensions, a single lattice point requires:
0 commit comments