diff --git a/README.md b/README.md
index 21586ac8..ec56d81b 100644
--- a/README.md
+++ b/README.md
@@ -3,7 +3,7 @@
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper "OpenCL-Wrapper").


-
+
Update History
@@ -55,16 +55,23 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on
- improved raytracing framerate when camera is inside fluid
- fixed skybox pole flickering artifacts
- fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind
+- v2.6 (16.04.2023)
+ - patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported
## Compute Features
-- CFD model: lattice Boltzmann method (LBM)
- ⚬ streaming (part 2/2)
f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t) for i ∈ [1, q-1]
- ⚬ collision
ρ(x,t) = (Σi fitemp(x,t)) + 1
u(x,t) = 1∕ρ(x,t) Σi ci fitemp(x,t)
fieq-shifted(x,t) = wi ρ · ((u°ci)2∕(2c4) - (u°u)∕(2c2) + (u°ci)∕c2) + wi (ρ-1)
fitemp(x, t+Δt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)
- ⚬ streaming (part 1/2)
f0(x, t+Δt) = f0temp(x, t+Δt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, t+Δt) = fitemp(x, t+Δt) for i ∈ [1, q-1]
+- CFD model: lattice Boltzmann method (LBM)
+
+ - streaming (part 2/2)f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t) for i ∈ [1, q-1]
+ - collisionρ(x,t) = (Σi fitemp(x,t)) + 1
u(x,t) = 1∕ρ(x,t) Σi ci fitemp(x,t)
fieq-shifted(x,t) = wi ρ · ((u°ci)2∕(2c4) - (u°u)∕(2c2) + (u°ci)∕c2) + wi (ρ-1)
fitemp(x, t+Δt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)
+ - streaming (part 1/2)f0(x, t+Δt) = f0temp(x, t+Δt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, t+Δt) = fitemp(x, t+Δt) for i ∈ [1, q-1]
+ - velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27
+ - collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)
+
+
-- peak performance on GPUs (datacenter/gaming/professional/laptop), validated with roofline model
-- optimized to minimize memory demand:
- - traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
- ```
- 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦
- 🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
- 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
- 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
- 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
- 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
-
- (density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte)
- ```
- - allows for 3 Million cells per 1 GB VRAM
- - FluidX3D (D3Q19) requires only 55 Bytes/cell with [Esoteric-Pull](https://doi.org/10.3390/computation10060092)+[FP16](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
- ```
- 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
- 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
+- optimized to minimize VRAM footprint to 1/6 of other LBM codes
- (density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte)
- ```
+ - traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
+ - 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
(density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte)
+ - allows for 3 Million cells per 1 GB VRAM
+ - FluidX3D (D3Q19) requires only 55 Bytes/cell with [Esoteric-Pull](https://doi.org/10.3390/computation10060092)+[FP16](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
+ - 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
(density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte)
- allows for 19 Million cells per 1 GB VRAM
- in-place streaming with [Esoteric-Pull](https://doi.org/10.3390/computation10060092): eliminates redundant copy `B` of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming
- [decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C)](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups
-- multi-GPU support on a single node (PC/laptop/server) via domain decomposition
- - allows pooling VRAM from multiple GPUs for much larger grid resolution
+ - only 8 flag bits per lattice point (can be used independently / at the same time)
+
+ - `TYPE_S` (stationary or moving) solid boundaries
+ - `TYPE_E` equilibrium boundaries (inflow/outflow)
+ - `TYPE_T` temperature boundaries
+ - `TYPE_F` free surface (fluid)
+ - `TYPE_I` free surface (interface)
+ - `TYPE_G` free surface (gas)
+ - `TYPE_X` remaining for custom use or further extensions
+ - `TYPE_Y` remaining for custom use or further extensions
+
+
+ - large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM
+
+ | GPU VRAM capacity | 1 GB | 2 GB | 3 GB | 4 GB | 6 GB | 8 GB | 10 GB | 11 GB | 12 GB | 16 GB | 20 GB | 24 GB | 32 GB | 40 GB | 48 GB | 64 GB | 80 GB | 94 GB | 128 GB | 192 GB | 256 GB |
+ | :------------------------------- | --------: | --------: | --------: | --------: | --------: | --------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ----------: | ----------: | ----------: |
+ | approximate GPU price | $25
GT 210 | $25
GTX 950 | $12
GTX 1060 | $50
GT 730 | $35
GTX 1060 | $70
RX 470 | $500
RTX 3080 | $240
GTX 1080 Ti | $75
Tesla M40 | $75
Instinct MI25 | $900
RX 7900 XT | $205
Tesla P40 | $600
Instinct MI60 | $5500
A100 | $2400
RTX 8000 | $10k
Instinct MI210 | $11k
A100 | >$40k
H100 NVL | ?
Max Series 1550 | - | - |
+ | traditional LBM (FP64) | 144³ | 182³ | 208³ | 230³ | 262³ | 288³ | 312³ | 322³ | 330³ | 364³ | 392³ | 418³ | 460³ | 494³ | 526³ | 578³ | 624³ | 658³ | 730³ | 836³ | 920³ |
+ | FluidX3D (FP32/FP32) | 224³ | 282³ | 322³ | 354³ | 406³ | 448³ | 482³ | 498³ | 512³ | 564³ | 608³ | 646³ | 710³ | 766³ | 814³ | 896³ | 966³ | 1018³ | 1130³ | 1292³ | 1422³ |
+ | FluidX3D (FP32/FP16) | 266³ | 336³ | 384³ | 424³ | 484³ | 534³ | 574³ | 594³ | 610³ | 672³ | 724³ | 770³ | 848³ | 912³ | 970³ | 1068³ | 1150³ | 1214³ | 1346³ | 1540³ | 1624³ |
+
+
+- cross-vendor multi-GPU support on a single PC/laptop/server
+
+ - domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution
- each domain (GPU) can hold up to 4.29 billion (2³², 1624³) lattice points (225 GB memory)
- GPUs don't have to be identical (not even from the same vendor), but similar VRAM capacity/bandwidth is recommended
- ⚬ domain communication architecture (simplified)
-
- ```diff
- ++ .-----------------------------------------------------------------. ++
- ++ | GPU 0 | ++
- ++ | LBM Domain 0 | ++
- ++ '-----------------------------------------------------------------' ++
- ++ | selective /|\ ++
- ++ \|/ in-VRAM copy | ++
- ++ .-------------------------------------------------------. ++
- ++ | GPU 0 - Transfer Buffer 0 | ++
- ++ '-------------------------------------------------------' ++
- !! | PCIe /|\ !!
- !! \|/ copy | !!
- @@ .-------------------------. .-------------------------. @@
- @@ | CPU - Transfer Buffer 0 | | CPU - Transfer Buffer 1 | @@
- @@ '-------------------------'\ /'-------------------------' @@
- @@ pointer X swap @@
- @@ .-------------------------./ \.-------------------------. @@
- @@ | CPU - Transfer Buffer 1 | | CPU - Transfer Buffer 0 | @@
- @@ '-------------------------' '-------------------------' @@
- !! /|\ PCIe | !!
- !! | copy \|/ !!
- ++ .-------------------------------------------------------. ++
- ++ | GPU 1 - Transfer Buffer 1 | ++
- ++ '-------------------------------------------------------' ++
- ++ /|\ selective | ++
- ++ | in-VRAM copy \|/ ++
- ++ .-----------------------------------------------------------------. ++
- ++ | GPU 1 | ++
- ++ | LBM Domain 1 | ++
- ++ '-----------------------------------------------------------------' ++
- ## | ##
- ## domain synchronization barrier ##
- ## | ##
- || -------------------------------------------------------------> time ||
- ```
-
- ⚬ domain communication architecture (detailed)
-
- ```diff
- ++ .-----------------------------------------------------------------. ++
- ++ | GPU 0 | ++
- ++ | LBM Domain 0 | ++
- ++ '-----------------------------------------------------------------' ++
- ++ | selective in- /|\ | selective in- /|\ | selective in- /|\ ++
- ++ \|/ VRAM copy (X) | \|/ VRAM copy (Y) | \|/ VRAM copy (Z) | ++
- ++ .---------------------.---------------------.---------------------. ++
- ++ | GPU 0 - TB 0X+ | GPU 0 - TB 0Y+ | GPU 0 - TB 0Z+ | ++
- ++ | GPU 0 - TB 0X- | GPU 0 - TB 0Y- | GPU 0 - TB 0Z- | ++
- ++ '---------------------'---------------------'---------------------' ++
- !! | PCIe /|\ | PCIe /|\ | PCIe /|\ !!
- !! \|/ copy | \|/ copy | \|/ copy | !!
- @@ .---------. .---------.---------. .---------.---------. .---------. @@
- @@ | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- | @@
- @@ | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ | @@
- @@ '---------\ /---------'---------\ /---------'---------\ /---------' @@
- @@ pointer X swap (X) pointer X swap (Y) pointer X swap (Z) @@
- @@ .---------/ \---------.---------/ \---------.---------/ \---------. @@
- @@ | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ | @@
- @@ | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- | @@
- @@ '---------' '---------'---------' '---------'---------' '---------' @@
- !! /|\ PCIe | /|\ PCIe | /|\ PCIe | !!
- !! | copy \|/ | copy \|/ | copy \|/ !!
- ++ .--------------------..---------------------..--------------------. ++
- ++ | GPU 1 - TB 1X- || GPU 3 - TB 3Y- || GPU 5 - TB 5Z- | ++
- ++ :====================::=====================::====================: ++
- ++ | GPU 2 - TB 2X+ || GPU 4 - TB 4Y+ || GPU 6 - TB 6Z+ | ++
- ++ '--------------------''---------------------''--------------------' ++
- ++ /|\ selective in- | /|\ selective in- | /|\ selective in- | ++
- ++ | VRAM copy (X) \|/ | VRAM copy (Y) \|/ | VRAM copy (Z) \|/ ++
- ++ .--------------------..---------------------..--------------------. ++
- ++ | GPU 1 || GPU 3 || GPU 5 | ++
- ++ | LBM Domain 1 || LBM Domain 3 || LBM Domain 5 | ++
- ++ :====================::=====================::====================: ++
- ++ | GPU 2 || GPU 4 || GPU 6 | ++
- ++ | LBM Domain 2 || LBM Domain 4 || LBM Domain 6 | ++
- ++ '--------------------''---------------------''--------------------' ++
- ## | | | ##
- ## | domain synchronization barriers | ##
- ## | | | ##
- || -------------------------------------------------------------> time ||
- ```
+ - domain communication architecture (simplified)
+ ```diff
+ ++ .-----------------------------------------------------------------. ++
+ ++ | GPU 0 | ++
+ ++ | LBM Domain 0 | ++
+ ++ '-----------------------------------------------------------------' ++
+ ++ | selective /|\ ++
+ ++ \|/ in-VRAM copy | ++
+ ++ .-------------------------------------------------------. ++
+ ++ | GPU 0 - Transfer Buffer 0 | ++
+ ++ '-------------------------------------------------------' ++
+ !! | PCIe /|\ !!
+ !! \|/ copy | !!
+ @@ .-------------------------. .-------------------------. @@
+ @@ | CPU - Transfer Buffer 0 | | CPU - Transfer Buffer 1 | @@
+ @@ '-------------------------'\ /'-------------------------' @@
+ @@ pointer X swap @@
+ @@ .-------------------------./ \.-------------------------. @@
+ @@ | CPU - Transfer Buffer 1 | | CPU - Transfer Buffer 0 | @@
+ @@ '-------------------------' '-------------------------' @@
+ !! /|\ PCIe | !!
+ !! | copy \|/ !!
+ ++ .-------------------------------------------------------. ++
+ ++ | GPU 1 - Transfer Buffer 1 | ++
+ ++ '-------------------------------------------------------' ++
+ ++ /|\ selective | ++
+ ++ | in-VRAM copy \|/ ++
+ ++ .-----------------------------------------------------------------. ++
+ ++ | GPU 1 | ++
+ ++ | LBM Domain 1 | ++
+ ++ '-----------------------------------------------------------------' ++
+ ## | ##
+ ## domain synchronization barrier ##
+ ## | ##
+ || -------------------------------------------------------------> time ||
+ ```
+ - domain communication architecture (detailed)
+ ```diff
+ ++ .-----------------------------------------------------------------. ++
+ ++ | GPU 0 | ++
+ ++ | LBM Domain 0 | ++
+ ++ '-----------------------------------------------------------------' ++
+ ++ | selective in- /|\ | selective in- /|\ | selective in- /|\ ++
+ ++ \|/ VRAM copy (X) | \|/ VRAM copy (Y) | \|/ VRAM copy (Z) | ++
+ ++ .---------------------.---------------------.---------------------. ++
+ ++ | GPU 0 - TB 0X+ | GPU 0 - TB 0Y+ | GPU 0 - TB 0Z+ | ++
+ ++ | GPU 0 - TB 0X- | GPU 0 - TB 0Y- | GPU 0 - TB 0Z- | ++
+ ++ '---------------------'---------------------'---------------------' ++
+ !! | PCIe /|\ | PCIe /|\ | PCIe /|\ !!
+ !! \|/ copy | \|/ copy | \|/ copy | !!
+ @@ .---------. .---------.---------. .---------.---------. .---------. @@
+ @@ | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- | @@
+ @@ | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ | @@
+ @@ '---------\ /---------'---------\ /---------'---------\ /---------' @@
+ @@ pointer X swap (X) pointer X swap (Y) pointer X swap (Z) @@
+ @@ .---------/ \---------.---------/ \---------.---------/ \---------. @@
+ @@ | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ | @@
+ @@ | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- | @@
+ @@ '---------' '---------'---------' '---------'---------' '---------' @@
+ !! /|\ PCIe | /|\ PCIe | /|\ PCIe | !!
+ !! | copy \|/ | copy \|/ | copy \|/ !!
+ ++ .--------------------..---------------------..--------------------. ++
+ ++ | GPU 1 - TB 1X- || GPU 3 - TB 3Y- || GPU 5 - TB 5Z- | ++
+ ++ :====================::=====================::====================: ++
+ ++ | GPU 2 - TB 2X+ || GPU 4 - TB 4Y+ || GPU 6 - TB 6Z+ | ++
+ ++ '--------------------''---------------------''--------------------' ++
+ ++ /|\ selective in- | /|\ selective in- | /|\ selective in- | ++
+ ++ | VRAM copy (X) \|/ | VRAM copy (Y) \|/ | VRAM copy (Z) \|/ ++
+ ++ .--------------------..---------------------..--------------------. ++
+ ++ | GPU 1 || GPU 3 || GPU 5 | ++
+ ++ | LBM Domain 1 || LBM Domain 3 || LBM Domain 5 | ++
+ ++ :====================::=====================::====================: ++
+ ++ | GPU 2 || GPU 4 || GPU 6 | ++
+ ++ | LBM Domain 2 || LBM Domain 4 || LBM Domain 6 | ++
+ ++ '--------------------''---------------------''--------------------' ++
+ ## | | | ##
+ ## | domain synchronization barriers | ##
+ ## | | | ##
+ || -------------------------------------------------------------> time ||
+ ```
+- [peak performance on GPUs](#single-gpu-benchmarks) (datacenter/gaming/professional/laptop), validated with roofline model
- [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) and other algebraic optimization to minimize round-off error
-- velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27
-- collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT)
-- only 8 flag bits per lattice point (can be used independently / at the same time):
- - `TYPE_S` (stationary or moving) solid boundaries
- - `TYPE_E` equilibrium boundaries (inflow/outflow)
- - `TYPE_T` temperature boundaries
- - `TYPE_F` free surface (fluid)
- - `TYPE_I` free surface (interface)
- - `TYPE_G` free surface (gas)
- - `TYPE_X` remaining for custom use or further extensions
- - `TYPE_Y` remaining for custom use or further extensions
@@ -320,90 +318,97 @@ In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP
If your GPU is not on the list yet, you can report your benchmarks [here](https://github.com/ProjectPhysX/FluidX3D/issues/8).
-| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] |
-| :---------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: |
-| AMD Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) |
-| AMD Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) |
-| Nvidia H100 PCIe 80GB | 51.01 | 80 | 2000 | 11128 (85%) | 20624 (79%) | 13862 (53%) |
-| Nvidia A100 SXM4 80GB | 19.49 | 80 | 2039 | 10228 (77%) | 18448 (70%) | 11197 (42%) |
-| Nvidia A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) |
-| Nvidia A100 PCIe 40GB | 19.49 | 40 | 1555 | 8526 (84%) | 16035 (79%) | 11088 (55%) |
-| Nvidia Tesla V100 16GB | 14.13 | 16 | 900 | 5128 (87%) | 10325 (88%) | 7683 (66%) |
-| Nvidia Quadro GV100 | 16.66 | 32 | 870 | 3442 (61%) | 6641 (59%) | 5863 (52%) |
-| Nvidia Tesla P100 16GB | 9.52 | 16 | 732 | 3295 (69%) | 5950 (63%) | 4176 (44%) |
-| Nvidia Tesla P100 12GB | 9.52 | 12 | 549 | 2427 (68%) | 4141 (58%) | 3999 (56%) |
-| Nvidia Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) |
-| Nvidia Tesla K80 (1 GPU) | 4.11 | 12 | 240 | 916 (58%) | 1642 (53%) | 943 (30%) |
-| Nvidia Tesla K20c | 3.52 | 5 | 208 | 861 (63%) | 1507 (56%) | 720 (27%) |
-| AMD Radeon RX 7900 XTX | 61.44 | 24 | 960 | 3665 (58%) | 7644 (61%) | 7716 (62%) |
-| AMD Radeon RX 7900 XT | 51.61 | 20 | 800 | 3013 (58%) | 5856 (56%) | 5986 (58%) |
-| AMD Radeon RX 6900 XT | 23.04 | 16 | 512 | 1968 (59%) | 4227 (64%) | 4207 (63%) |
-| AMD Radeon RX 6800 XT | 20.74 | 16 | 512 | 2008 (60%) | 4241 (64%) | 4224 (64%) |
-| AMD Radeon RX 5700 XT | 9.75 | 8 | 448 | 1368 (47%) | 3253 (56%) | 3049 (52%) |
-| AMD Radeon RX Vega 64 | 13.35 | 8 | 484 | 1875 (59%) | 2878 (46%) | 3227 (51%) |
-| AMD Radeon RX 580 4GB | 6.50 | 4 | 256 | 946 (57%) | 1848 (56%) | 1577 (47%) |
-| AMD Radeon HD 7850 | 1.84 | 2 | 154 | 112 (11%) | 120 ( 6%) | 635 (32%) |
-| Intel Arc A770 LE | 19.66 | 16 | 560 | 2741 (75%) | 4591 (63%) | 4626 (64%) |
-| Intel Arc A750 LE | 17.20 | 8 | 512 | 2625 (78%) | 4184 (63%) | 4238 (64%) |
-| Nvidia GeForce RTX 4090 | 82.58 | 24 | 1008 | 5624 (85%) | 11091 (85%) | 11496 (88%) |
-| Nvidia GeForce RTX 4080 | 55.45 | 16 | 717 | 3914 (84%) | 7626 (82%) | 7933 (85%) |
-| Nvidia GeForce RTX 3090 Ti | 40.00 | 24 | 1008 | 5717 (87%) | 10956 (84%) | 10400 (79%) |
-| Nvidia GeForce RTX 3090 | 39.05 | 24 | 936 | 5418 (89%) | 10732 (88%) | 10215 (84%) |
-| Nvidia GeForce RTX 3080 Ti | 37.17 | 12 | 912 | 5202 (87%) | 9832 (87%) | 9347 (79%) |
-| Nvidia RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) |
-| Nvidia GeForce RTX 3080 | 29.77 | 10 | 760 | 4230 (85%) | 8118 (82%) | 7714 (78%) |
-| Nvidia GeForce RTX 3070 | 20.31 | 8 | 448 | 2578 (88%) | 5096 (88%) | 5060 (87%) |
-| Nvidia GeForce RTX 3060 Ti | 16.49 | 8 | 448 | 2644 (90%) | 5129 (88%) | 4718 (81%) |
-| Nvidia RTX A5000M | 16.59 | 16 | 448 | 2228 (76%) | 4461 (77%) | 3662 (63%) |
-| Nvidia GeForce RTX 3060 | 13.17 | 12 | 360 | 2108 (90%) | 4070 (87%) | 3566 (76%) |
-| Nvidia GeForce RTX 3060M | 10.94 | 6 | 336 | 2019 (92%) | 4012 (92%) | 3572 (82%) |
-| Nvidia GeForce RTX 3050M | 7.13 | 4 | 192 | 1180 (94%) | 2339 (94%) | 2016 (81%) |
-| Nvidia Quadro RTX 6000 | 16.31 | 24 | 672 | 3307 (75%) | 6836 (78%) | 6879 (79%) |
-| Nvidia Quadro RTX 8000 Pass. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) |
-| Nvidia GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) |
-| Nvidia GeForce RTX 2080 Sup. | 11.34 | 8 | 496 | 2434 (75%) | 5284 (82%) | 5087 (79%) |
-| Nvidia Quadro RTX 5000 | 11.15 | 16 | 448 | 2341 (80%) | 4766 (82%) | 4773 (82%) |
-| Nvidia GeForce RTX 2060 Sup. | 7.18 | 8 | 448 | 2503 (85%) | 5035 (87%) | 4463 (77%) |
-| Nvidia Quadro RTX 4000 | 7.12 | 8 | 416 | 2284 (84%) | 4584 (85%) | 4062 (75%) |
-| Nvidia GeForce RTX 2060 KO | 6.74 | 6 | 336 | 1643 (75%) | 3376 (77%) | 3266 (75%) |
-| Nvidia GeForce RTX 2060 | 6.74 | 6 | 336 | 1681 (77%) | 3604 (83%) | 3571 (82%) |
-| Nvidia GeForce GTX 1660 Sup. | 5.03 | 6 | 336 | 1696 (77%) | 3551 (81%) | 3040 (70%) |
-| Nvidia Tesla T4 | 8.14 | 15 | 300 | 1356 (69%) | 2869 (74%) | 2887 (74%) |
-| Nvidia GeForce GTX 1660 Ti | 5.48 | 6 | 288 | 1467 (78%) | 3041 (81%) | 3019 (81%) |
-| Nvidia GeForce GTX 1660 | 5.07 | 6 | 192 | 1016 (81%) | 1924 (77%) | 1992 (80%) |
-| Nvidia GeForce GTX 1650M | 3.20 | 4 | 128 | 706 (84%) | 1214 (73%) | 1400 (84%) |
-| Nvidia Titan Xp | 12.15 | 12 | 548 | 2919 (82%) | 5495 (77%) | 5375 (76%) |
-| Nvidia GeForce GTX 1080 Ti | 12.06 | 11 | 484 | 2631 (83%) | 4837 (77%) | 4877 (78%) |
-| Nvidia GeForce GTX 1080 | 9.78 | 8 | 320 | 1623 (78%) | 3100 (75%) | 3182 (77%) |
-| Nvidia GeForce GTX 1060M | 4.44 | 6 | 192 | 983 (78%) | 1882 (75%) | 1803 (72%) |
-| Nvidia GeForce GTX 1050M Ti | 2.49 | 4 | 112 | 631 (86%) | 1224 (84%) | 1115 (77%) |
-| Nvidia Quadro P1000 | 1.89 | 4 | 82 | 426 (79%) | 839 (79%) | 778 (73%) |
-| Nvidia GeForce GTX 970 | 4.17 | 4 | 224 | 980 (67%) | 1721 (59%) | 1623 (56%) |
-| Nvidia Quadro M4000 | 2.57 | 8 | 192 | 899 (72%) | 1519 (61%) | 1050 (42%) |
-| Nvidia Tesla M60 (1 GPU) | 4.82 | 8 | 160 | 853 (82%) | 1571 (76%) | 1557 (75%) |
-| Nvidia GeForce GTX 960M | 1.51 | 4 | 80 | 442 (84%) | 872 (84%) | 627 (60%) |
-| Nvidia Quadro K2000 | 0.73 | 2 | 64 | 312 (75%) | 444 (53%) | 171 (21%) |
-| Nvidia GeForce GT 630 (OEM) | 0.46 | 2 | 29 | 151 (81%) | 185 (50%) | 78 (21%) |
-| Nvidia Quadro NVS 290 | 0.03 | 0.256 | 6 | 1 ( 2%) | 1 ( 1%) | 1 ( 1%) |
-| Apple M1 Pro GPU 16C 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) |
-| AMD Radeon Vega 8 (4750G) | 2.15 | 27 | 57 | 263 (71%) | 511 (70%) | 501 (68%) |
-| AMD Radeon Vega 8 (3500U) | 1.23 | 7 | 38 | 157 (63%) | 282 (57%) | 288 (58%) |
-| Intel UHD Graphics 630 | 0.46 | 7 | 51 | 151 (45%) | 301 (45%) | 187 (28%) |
-| Intel HD Graphics 5500 | 0.35 | 3 | 26 | 75 (45%) | 192 (58%) | 108 (32%) |
-| Intel HD Graphics 4600 | 0.38 | 2 | 26 | 105 (63%) | 115 (35%) | 34 (10%) |
-| Samsung ARM Mali-G72 MP18 | 0.24 | 4 | 29 | 14 ( 7%) | 17 ( 5%) | 12 ( 3%) |
-| 2x AMD EPYC 9654 | 29.49 | 1536 | 922 | 1381 (23%) | 1814 (15%) | 1801 (15%) |
-| Intel Xeon Phi 7210 | 5.32 | 192 | 102 | 415 (62%) | 193 (15%) | 223 (17%) |
-| 4x Intel Xeon E5-4620 v4 | 2.69 | 512 | 273 | 460 (26%) | 275 ( 8%) | 239 ( 7%) |
-| 2x Intel Xeon E5-2630 v4 | 1.41 | 64 | 137 | 264 (30%) | 146 ( 8%) | 129 ( 7%) |
-| 2x Intel Xeon E5-2623 v4 | 0.67 | 64 | 137 | 125 (14%) | 66 ( 4%) | 59 ( 3%) |
-| 2x Intel Xeon E5-2680 v3 | 1.92 | 64 | 137 | 209 (23%) | 305 (17%) | 281 (16%) |
-| Intel Core i9-10980XE | 3.23 | 128 | 94 | 286 (47%) | 251 (21%) | 223 (18%) |
-| Intel Core i5-9600 | 0.60 | 16 | 43 | 146 (52%) | 127 (23%) | 147 (27%) |
-| Intel Core i7-8700K | 0.71 | 16 | 51 | 152 (45%) | 134 (20%) | 116 (17%) |
-| Intel Core i7-7700HQ | 0.36 | 12 | 38 | 81 (32%) | 82 (16%) | 108 (22%) |
-| Intel Core i7-4770 | 0.44 | 16 | 26 | 104 (62%) | 69 (21%) | 59 (18%) |
-| Intel Core i7-4720HQ | 0.33 | 16 | 26 | 58 (35%) | 13 ( 4%) | 47 (14%) |
+Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, 🟣 Apple, 🟡 Samsung
+
+| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] |
+| :--------------------------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: |
+| | | | | | | |
+| 🔴 Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) |
+| 🔴 Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) |
+| 🟢 H100 PCIe 80GB | 51.01 | 80 | 2000 | 11128 (85%) | 20624 (79%) | 13862 (53%) |
+| 🟢 A100 SXM4 80GB | 19.49 | 80 | 2039 | 10228 (77%) | 18448 (70%) | 11197 (42%) |
+| 🟢 A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) |
+| 🟢 A100 PCIe 40GB | 19.49 | 40 | 1555 | 8526 (84%) | 16035 (79%) | 11088 (55%) |
+| 🟢 Tesla V100 16GB | 14.13 | 16 | 900 | 5128 (87%) | 10325 (88%) | 7683 (66%) |
+| 🟢 Quadro GV100 | 16.66 | 32 | 870 | 3442 (61%) | 6641 (59%) | 5863 (52%) |
+| 🟢 Tesla P100 16GB | 9.52 | 16 | 732 | 3295 (69%) | 5950 (63%) | 4176 (44%) |
+| 🟢 Tesla P100 12GB | 9.52 | 12 | 549 | 2427 (68%) | 4141 (58%) | 3999 (56%) |
+| 🟢 Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) |
+| 🟢 Tesla K80 (1 GPU) | 4.11 | 12 | 240 | 916 (58%) | 1642 (53%) | 943 (30%) |
+| 🟢 Tesla K20c | 3.52 | 5 | 208 | 861 (63%) | 1507 (56%) | 720 (27%) |
+| | | | | | | |
+| 🔴 Radeon RX 7900 XTX | 61.44 | 24 | 960 | 3665 (58%) | 7644 (61%) | 7716 (62%) |
+| 🔴 Radeon RX 7900 XT | 51.61 | 20 | 800 | 3013 (58%) | 5856 (56%) | 5986 (58%) |
+| 🔴 Radeon RX 6900 XT | 23.04 | 16 | 512 | 1968 (59%) | 4227 (64%) | 4207 (63%) |
+| 🔴 Radeon RX 6800 XT | 20.74 | 16 | 512 | 2008 (60%) | 4241 (64%) | 4224 (64%) |
+| 🔴 Radeon RX 5700 XT | 9.75 | 8 | 448 | 1368 (47%) | 3253 (56%) | 3049 (52%) |
+| 🔴 Radeon RX Vega 64 | 13.35 | 8 | 484 | 1875 (59%) | 2878 (46%) | 3227 (51%) |
+| 🔴 Radeon RX 580 4GB | 6.50 | 4 | 256 | 946 (57%) | 1848 (56%) | 1577 (47%) |
+| 🔴 Radeon HD 7850 | 1.84 | 2 | 154 | 112 (11%) | 120 ( 6%) | 635 (32%) |
+| 🔵 Arc A770 LE | 19.66 | 16 | 560 | 2741 (75%) | 4591 (63%) | 4626 (64%) |
+| 🔵 Arc A750 LE | 17.20 | 8 | 512 | 2625 (78%) | 4184 (63%) | 4238 (64%) |
+| 🟢 GeForce RTX 4090 | 82.58 | 24 | 1008 | 5624 (85%) | 11091 (85%) | 11496 (88%) |
+| 🟢 RTX 6000 Ada | 91.10 | 48 | 960 | 4997 (80%) | 10249 (82%) | 10293 (83%) |
+| 🟢 GeForce RTX 4080 | 55.45 | 16 | 717 | 3914 (84%) | 7626 (82%) | 7933 (85%) |
+| 🟢 GeForce RTX 3090 Ti | 40.00 | 24 | 1008 | 5717 (87%) | 10956 (84%) | 10400 (79%) |
+| 🟢 GeForce RTX 3090 | 39.05 | 24 | 936 | 5418 (89%) | 10732 (88%) | 10215 (84%) |
+| 🟢 GeForce RTX 3080 Ti | 37.17 | 12 | 912 | 5202 (87%) | 9832 (87%) | 9347 (79%) |
+| 🟢 RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) |
+| 🟢 GeForce RTX 3080 | 29.77 | 10 | 760 | 4230 (85%) | 8118 (82%) | 7714 (78%) |
+| 🟢 GeForce RTX 3070 | 20.31 | 8 | 448 | 2578 (88%) | 5096 (88%) | 5060 (87%) |
+| 🟢 GeForce RTX 3060 Ti | 16.49 | 8 | 448 | 2644 (90%) | 5129 (88%) | 4718 (81%) |
+| 🟢 RTX A5000M | 16.59 | 16 | 448 | 2228 (76%) | 4461 (77%) | 3662 (63%) |
+| 🟢 GeForce RTX 3060 | 13.17 | 12 | 360 | 2108 (90%) | 4070 (87%) | 3566 (76%) |
+| 🟢 GeForce RTX 3060M | 10.94 | 6 | 336 | 2019 (92%) | 4012 (92%) | 3572 (82%) |
+| 🟢 GeForce RTX 3050M | 7.13 | 4 | 192 | 1180 (94%) | 2339 (94%) | 2016 (81%) |
+| 🟢 Quadro RTX 6000 | 16.31 | 24 | 672 | 3307 (75%) | 6836 (78%) | 6879 (79%) |
+| 🟢 Quadro RTX 8000 Pass. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) |
+| 🟢 GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) |
+| 🟢 GeForce RTX 2080 Sup. | 11.34 | 8 | 496 | 2434 (75%) | 5284 (82%) | 5087 (79%) |
+| 🟢 Quadro RTX 5000 | 11.15 | 16 | 448 | 2341 (80%) | 4766 (82%) | 4773 (82%) |
+| 🟢 GeForce RTX 2060 Sup. | 7.18 | 8 | 448 | 2503 (85%) | 5035 (87%) | 4463 (77%) |
+| 🟢 Quadro RTX 4000 | 7.12 | 8 | 416 | 2284 (84%) | 4584 (85%) | 4062 (75%) |
+| 🟢 GeForce RTX 2060 KO | 6.74 | 6 | 336 | 1643 (75%) | 3376 (77%) | 3266 (75%) |
+| 🟢 GeForce RTX 2060 | 6.74 | 6 | 336 | 1681 (77%) | 3604 (83%) | 3571 (82%) |
+| 🟢 GeForce GTX 1660 Sup. | 5.03 | 6 | 336 | 1696 (77%) | 3551 (81%) | 3040 (70%) |
+| 🟢 Tesla T4 | 8.14 | 15 | 300 | 1356 (69%) | 2869 (74%) | 2887 (74%) |
+| 🟢 GeForce GTX 1660 Ti | 5.48 | 6 | 288 | 1467 (78%) | 3041 (81%) | 3019 (81%) |
+| 🟢 GeForce GTX 1660 | 5.07 | 6 | 192 | 1016 (81%) | 1924 (77%) | 1992 (80%) |
+| 🟢 GeForce GTX 1650M | 3.20 | 4 | 128 | 706 (84%) | 1214 (73%) | 1400 (84%) |
+| 🟢 Titan Xp | 12.15 | 12 | 548 | 2919 (82%) | 5495 (77%) | 5375 (76%) |
+| 🟢 GeForce GTX 1080 Ti | 12.06 | 11 | 484 | 2631 (83%) | 4837 (77%) | 4877 (78%) |
+| 🟢 GeForce GTX 1080 | 9.78 | 8 | 320 | 1623 (78%) | 3100 (75%) | 3182 (77%) |
+| 🟢 GeForce GTX 1060M | 4.44 | 6 | 192 | 983 (78%) | 1882 (75%) | 1803 (72%) |
+| 🟢 GeForce GTX 1050M Ti | 2.49 | 4 | 112 | 631 (86%) | 1224 (84%) | 1115 (77%) |
+| 🟢 Quadro P1000 | 1.89 | 4 | 82 | 426 (79%) | 839 (79%) | 778 (73%) |
+| 🟢 GeForce GTX 970 | 4.17 | 4 | 224 | 980 (67%) | 1721 (59%) | 1623 (56%) |
+| 🟢 Quadro M4000 | 2.57 | 8 | 192 | 899 (72%) | 1519 (61%) | 1050 (42%) |
+| 🟢 Tesla M60 (1 GPU) | 4.82 | 8 | 160 | 853 (82%) | 1571 (76%) | 1557 (75%) |
+| 🟢 GeForce GTX 960M | 1.51 | 4 | 80 | 442 (84%) | 872 (84%) | 627 (60%) |
+| 🟢 Quadro K2000 | 0.73 | 2 | 64 | 312 (75%) | 444 (53%) | 171 (21%) |
+| 🟢 GeForce GT 630 (OEM) | 0.46 | 2 | 29 | 151 (81%) | 185 (50%) | 78 (21%) |
+| 🟢 Quadro NVS 290 | 0.03 | 0.256 | 6 | 1 ( 2%) | 1 ( 1%) | 1 ( 1%) |
+| | | | | | | |
+| 🟣 M1 Pro GPU 16C 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) |
+| 🔴 Radeon Vega 8 (4750G) | 2.15 | 27 | 57 | 263 (71%) | 511 (70%) | 501 (68%) |
+| 🔴 Radeon Vega 8 (3500U) | 1.23 | 7 | 38 | 157 (63%) | 282 (57%) | 288 (58%) |
+| 🔵 UHD Graphics 630 | 0.46 | 7 | 51 | 151 (45%) | 301 (45%) | 187 (28%) |
+| 🔵 HD Graphics 5500 | 0.35 | 3 | 26 | 75 (45%) | 192 (58%) | 108 (32%) |
+| 🔵 HD Graphics 4600 | 0.38 | 2 | 26 | 105 (63%) | 115 (35%) | 34 (10%) |
+| 🟡 ARM Mali-G72 MP18 | 0.24 | 4 | 29 | 14 ( 7%) | 17 ( 5%) | 12 ( 3%) |
+| | | | | | | |
+| 🔴 2x EPYC 9654 | 29.49 | 1536 | 922 | 1381 (23%) | 1814 (15%) | 1801 (15%) |
+| 🔵 Xeon Phi 7210 | 5.32 | 192 | 102 | 415 (62%) | 193 (15%) | 223 (17%) |
+| 🔵 4x Xeon E5-4620 v4 | 2.69 | 512 | 273 | 460 (26%) | 275 ( 8%) | 239 ( 7%) |
+| 🔵 2x Xeon E5-2630 v4 | 1.41 | 64 | 137 | 264 (30%) | 146 ( 8%) | 129 ( 7%) |
+| 🔵 2x Xeon E5-2623 v4 | 0.67 | 64 | 137 | 125 (14%) | 66 ( 4%) | 59 ( 3%) |
+| 🔵 2x Xeon E5-2680 v3 | 1.92 | 64 | 137 | 209 (23%) | 305 (17%) | 281 (16%) |
+| 🔵 Core i9-10980XE | 3.23 | 128 | 94 | 286 (47%) | 251 (21%) | 223 (18%) |
+| 🔵 Core i5-9600 | 0.60 | 16 | 43 | 146 (52%) | 127 (23%) | 147 (27%) |
+| 🔵 Core i7-8700K | 0.71 | 16 | 51 | 152 (45%) | 134 (20%) | 116 (17%) |
+| 🔵 Core i7-7700HQ | 0.36 | 12 | 38 | 81 (32%) | 82 (16%) | 108 (22%) |
+| 🔵 Core i7-4770 | 0.44 | 16 | 26 | 104 (62%) | 69 (21%) | 59 (18%) |
+| 🔵 Core i7-4720HQ | 0.33 | 16 | 26 | 58 (35%) | 13 ( 4%) | 47 (14%) |
@@ -411,68 +416,41 @@ If your GPU is not on the list yet, you can report your benchmarks [here](https:
Multi-GPU benchmarks are done at the largest possible grid resolution with a cubic domain, and either 2x1x1, 2x2x1 or 2x2x2 of these cubic domains together. The percentages in brackets are single-GPU roofline model efficiency, and the multiplicator numbers in brackets are scaling factors relative to benchmarked single-GPU performance.
-| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] |
-| :---------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: |
-| | | | | | | |
-| 1x AMD Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) |
-| 1x AMD Instinct MI250 (2 GCD) | 90.52 | 128 | 3277 | 9460 (1.7x) | 14313 (1.6x) | 17338 (2.0x) |
-| 2x AMD Instinct MI250 (4 GCD) | 181.04 | 256 | 6554 | 16925 (3.0x) | 29163 (3.2x) | 29627 (3.5x) |
-| 4x AMD Instinct MI250 (8 GCD) | 362.08 | 512 | 13107 | 27350 (4.9x) | 52258 (5.8x) | 53521 (6.3x) |
-| | | | | | | |
-| 1x AMD Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) |
-| 2x AMD Radeon VII | 27.66 | 32 | 2048 | 8113 (1.7x) | 15591 (2.0x) | 10352 (2.0x) |
-| 4x AMD Radeon VII | 55.32 | 64 | 4096 | 12911 (2.6x) | 24273 (3.1x) | 17080 (3.2x) |
-| 8x AMD Radeon VII | 110.64 | 128 | 8192 | 21946 (4.5x) | 30826 (4.0x) | 24572 (4.7x) |
-| | | | | | | |
-| 1x Nvidia A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) |
-| 2x Nvidia A100 SXM4 40GB | 38.98 | 80 | 3110 | 13629 (1.6x) | 24620 (1.5x) | 18850 (1.7x) |
-| 4x Nvidia A100 SXM4 40GB | 77.96 | 160 | 6220 | 17978 (2.1x) | 30604 (1.9x) | 30627 (2.7x) |
-| | | | | | | |
-| 1x Nvidia Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) |
-| 2x Nvidia Tesla K40m | 8.58 | 24 | 577 | 1971 (1.7x) | 3300 (1.8x) | 1801 (2.0x) |
-| 3x Tesla K40m + 1x Titan Xp | 17.16 | 48 | 1154 | 3117 (2.8x) | 5174 (2.8x) | 3127 (3.4x) |
-| | | | | | | |
-| 1x Nvidia RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) |
-| 2x Nvidia RTX A6000 | 80.00 | 96 | 1536 | 8041 (1.8x) | 15026 (1.7x) | 14795 (1.7x) |
-| 4x Nvidia RTX A6000 | 160.00 | 192 | 3072 | 14314 (3.2x) | 27915 (3.2x) | 27227 (3.2x) |
-| 8x Nvidia RTX A6000 | 320.00 | 384 | 6144 | 19311 (4.4x) | 40063 (4.5x) | 39004 (4.6x) |
-| | | | | | | |
-| 1x Nvidia Quadro RTX 8000 Pa. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) |
-| 2x Nvidia Quadro RTX 8000 Pa. | 29.86 | 96 | 1248 | 4767 (1.8x) | 9607 (1.8x) | 10214 (1.8x) |
-| | | | | | | |
-| 1x Nvidia GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) |
-| 2x Nvidia GeForce RTX 2080 Ti | 26.90 | 22 | 1232 | 5085 (1.6x) | 10770 (1.6x) | 10922 (1.6x) |
-| 4x Nvidia GeForce RTX 2080 Ti | 53.80 | 44 | 2464 | 9117 (2.9x) | 18415 (2.7x) | 18598 (2.7x) |
-| 7x RTX 2080 Ti + 1x A100 40GB | 107.60 | 88 | 4928 | 16146 (5.1x) | 33732 (5.0x) | 33857 (4.9x) |
-
-
-
-## Maximum Single-Domain Grid Resolution for D3Q19 LBM
-
-| Memory | FP32/FP32 | FP32/FP16 |
-| -----: | --------: | --------: |
-| 1 GB | 224³ | 266³ |
-| 2 GB | 282³ | 336³ |
-| 3 GB | 322³ | 384³ |
-| 4 GB | 354³ | 424³ |
-| 6 GB | 406³ | 484³ |
-| 8 GB | 448³ | 534³ |
-| 10 GB | 482³ | 574³ |
-| 11 GB | 498³ | 594³ |
-| 12 GB | 512³ | 610³ |
-| 16 GB | 564³ | 672³ |
-| 20 GB | 608³ | 724³ |
-| 24 GB | 646³ | 770³ |
-| 32 GB | 710³ | 848³ |
-| 40 GB | 766³ | 912³ |
-| 48 GB | 814³ | 970³ |
-| 64 GB | 896³ | 1068³ |
-| 80 GB | 966³ | 1150³ |
-| 96 GB | 1026³ | 1222³ |
-| 128 GB | 1130³ | 1346³ |
-| 192 GB | 1292³ | 1540³ |
-| 256 GB | 1422³ | 1624³ |
-| 384 GB | 1624³ | 1624³ |
+Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, 🟣 Apple, 🟡 Samsung
+
+| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] |
+| :------------------------------------------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: |
+| | | | | | | |
+| 🔴 1x Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) |
+| 🔴 1x Instinct MI250 (2 GCD) | 90.52 | 128 | 3277 | 9460 (1.7x) | 14313 (1.6x) | 17338 (2.0x) |
+| 🔴 2x Instinct MI250 (4 GCD) | 181.04 | 256 | 6554 | 16925 (3.0x) | 29163 (3.2x) | 29627 (3.5x) |
+| 🔴 4x Instinct MI250 (8 GCD) | 362.08 | 512 | 13107 | 27350 (4.9x) | 52258 (5.8x) | 53521 (6.3x) |
+| | | | | | | |
+| 🔴 1x Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) |
+| 🔴 2x Radeon VII | 27.66 | 32 | 2048 | 8113 (1.7x) | 15591 (2.0x) | 10352 (2.0x) |
+| 🔴 4x Radeon VII | 55.32 | 64 | 4096 | 12911 (2.6x) | 24273 (3.1x) | 17080 (3.2x) |
+| 🔴 8x Radeon VII | 110.64 | 128 | 8192 | 21946 (4.5x) | 30826 (4.0x) | 24572 (4.7x) |
+| | | | | | | |
+| 🟢 1x A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) |
+| 🟢 2x A100 SXM4 40GB | 38.98 | 80 | 3110 | 13629 (1.6x) | 24620 (1.5x) | 18850 (1.7x) |
+| 🟢 4x A100 SXM4 40GB | 77.96 | 160 | 6220 | 17978 (2.1x) | 30604 (1.9x) | 30627 (2.7x) |
+| | | | | | | |
+| 🟢 1x Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) |
+| 🟢 2x Tesla K40m | 8.58 | 24 | 577 | 1971 (1.7x) | 3300 (1.8x) | 1801 (2.0x) |
+| 🟢 3x K40m + 1x Titan Xp | 17.16 | 48 | 1154 | 3117 (2.8x) | 5174 (2.8x) | 3127 (3.4x) |
+| | | | | | | |
+| 🟢 1x RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) |
+| 🟢 2x RTX A6000 | 80.00 | 96 | 1536 | 8041 (1.8x) | 15026 (1.7x) | 14795 (1.7x) |
+| 🟢 4x RTX A6000 | 160.00 | 192 | 3072 | 14314 (3.2x) | 27915 (3.2x) | 27227 (3.2x) |
+| 🟢 8x RTX A6000 | 320.00 | 384 | 6144 | 19311 (4.4x) | 40063 (4.5x) | 39004 (4.6x) |
+| | | | | | | |
+| 🟢 1x Quadro RTX 8000 Pa. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) |
+| 🟢 2x Quadro RTX 8000 Pa. | 29.86 | 96 | 1248 | 4767 (1.8x) | 9607 (1.8x) | 10214 (1.8x) |
+| | | | | | | |
+| 🟢 1x GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) |
+| 🟢 2x GeForce RTX 2080 Ti | 26.90 | 22 | 1232 | 5085 (1.6x) | 10770 (1.6x) | 10922 (1.6x) |
+| 🟢 4x GeForce RTX 2080 Ti | 53.80 | 44 | 2464 | 9117 (2.9x) | 18415 (2.7x) | 18598 (2.7x) |
+| 🟢 7x 2080 Ti + 1x A100 40GB | 107.60 | 88 | 4928 | 16146 (5.1x) | 33732 (5.0x) | 33857 (4.9x) |
diff --git a/src/info.cpp b/src/info.cpp
index c02ac129..1d408ad0 100644
--- a/src/info.cpp
+++ b/src/info.cpp
@@ -67,7 +67,7 @@ void Info::print_logo() const {
print("| "); print("\\ \\ / /", c); print(" |\n");
print("| "); print("\\ ' /", c); print(" |\n");
print("| "); print("\\ /", c); print(" |\n");
- print("| "); print("\\ /", c); print(" FluidX3D Version 2.5 |\n");
+ print("| "); print("\\ /", c); print(" FluidX3D Version 2.6 |\n");
print("| "); print("'", c); print(" Copyright (c) Moritz Lehmann |\n");
}
void Info::print_initialize() {
diff --git a/src/opencl.hpp b/src/opencl.hpp
index 7a21f20c..677a1c10 100644
--- a/src/opencl.hpp
+++ b/src/opencl.hpp
@@ -24,6 +24,7 @@ struct Device_Info {
uint compute_units=0u; // compute units (CUs) can contain multiple cores depending on the microarchitecture
uint clock_frequency=0u; // in MHz
bool is_cpu=false, is_gpu=false;
+ bool intel_gpu_above_4gb_patch = false; // memory allocations greater than 4GB need to be specifically enabled on Intel GPUs
uint is_fp64_capable=0u, is_fp32_capable=0u, is_fp16_capable=0u, is_int64_capable=0u, is_int32_capable=0u, is_int16_capable=0u, is_int8_capable=0u;
uint cores=0u; // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
float tflops=0.0f; // estimated device FP32 floating point performance in TeraFLOPs/s
@@ -63,6 +64,19 @@ struct Device_Info {
const float arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
cores = to_uint((float)compute_units*(nvidia+amd+intel+apple+arm)); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device floating point performance in TeraFLOPs/s
+ if(intel==8.0f) { // fix wrong global memory reporting for Intel Arc GPUs
+ if(contains_any(name, {"A770", "0x56a0"})&&(memory==12992u)) memory = 16240u; // fix wrong (80% on Windows) memory reporting on Intel Arc A770 16GB
+ if(contains_any(name, {"A770", "0x56a0"})&&(memory== 6476u)) memory = 8096u; // fix wrong (80% on Windows) memory reporting on Intel Arc A770 8GB
+ if(contains_any(name, {"A750", "0x56a1"})&&(memory== 6476u)) memory = 8096u; // fix wrong (80% on Windows) memory reporting on Intel Arc A750 8GB
+ if(contains_any(name, {"A580", "0x56a2"})&&(memory== 6476u)) memory = 8096u; // fix wrong (80% on Windows) memory reporting on Intel Arc A580 8GB
+ if(contains_any(name, {"A380", "0x56a5"})&&(memory== 4844u)) memory = 6056u; // fix wrong (80% on Windows) memory reporting on Intel Arc A380 6GB
+ if(contains_any(name, {"A770", "0x56a0"})&&(memory==15473u)) memory = 16288u; // fix wrong (95% on Linux) memory reporting on Intel Arc A770 16GB
+ if(contains_any(name, {"A770", "0x56a0"})&&(memory== 7721u)) memory = 8128u; // fix wrong (95% on Linux) memory reporting on Intel Arc A770 8GB
+ if(contains_any(name, {"A750", "0x56a1"})&&(memory== 7721u)) memory = 8128u; // fix wrong (95% on Linux) memory reporting on Intel Arc A750 8GB
+ if(contains_any(name, {"A580", "0x56a2"})&&(memory== 7721u)) memory = 8128u; // fix wrong (95% on Linux) memory reporting on Intel Arc A580 8GB
+ if(contains_any(name, {"A380" "0x56a5"})&&(memory== 5783u)) memory = 6088u; // fix wrong (95% on Linux) memory reporting on Intel Arc A380 6GB
+ }
+ intel_gpu_above_4gb_patch = (intel==8.0f)&&(memory>4096); // enable memory allocations greater than 4GB for Intel GPUs with >4GB VRAM
}
inline Device_Info() {}; // default constructor
};
@@ -161,11 +175,12 @@ class Device {
const string kernel_code = enable_device_capabilities()+"\n"+opencl_c_code;
cl_source.push_back({ kernel_code.c_str(), kernel_code.length() });
this->cl_program = cl::Program(info.cl_context, cl_source);
+ const string build_options = string("-cl-fast-relaxed-math")+(info.intel_gpu_above_4gb_patch ? " -cl-intel-greater-than-4GB-buffer-required" : "");
#ifndef LOG
- int error = cl_program.build({ info.cl_device }, "-cl-fast-relaxed-math -w"); // compile OpenCL C code, disable warnings
+ int error = cl_program.build({ info.cl_device }, (build_options+" -w").c_str()); // compile OpenCL C code, disable warnings
if(error) print_warning(cl_program.getBuildInfo(info.cl_device)); // print build log
#else // LOG, generate logfile for OpenCL code compilation
- int error = cl_program.build({ info.cl_device }, "-cl-fast-relaxed-math"); // compile OpenCL C code
+ int error = cl_program.build({ info.cl_device }, build_options.c_str()); // compile OpenCL C code
const string log = cl_program.getBuildInfo(info.cl_device);
write_file("bin/kernel.log", log); // save build log
if((uint)log.length()>2u) print_warning(log); // print build log
@@ -210,7 +225,7 @@ template class Memory {
device.info.memory_used += (uint)(capacity()/1048576ull); // track device memory usage
if(device.info.memory_used>device.info.memory) print_error("Device \""+device.info.name+"\" does not have enough memory. Allocating another "+to_string((uint)(capacity()/1048576ull))+" MB would use a total of "+to_string(device.info.memory_used)+" MB / "+to_string(device.info.memory)+" MB.");
int error = 0;
- device_buffer = cl::Buffer(device.get_cl_context(), CL_MEM_READ_WRITE, capacity(), nullptr, &error);
+ device_buffer = cl::Buffer(device.get_cl_context(), CL_MEM_READ_WRITE|((int)device.info.intel_gpu_above_4gb_patch<<23), capacity(), nullptr, &error); // for Intel GPUs, set flag CL_MEM_ALLOW_UNRESTRICTED_SIZE_INTEL = (1<<23)
if(error==-61) print_error("Memory size is too large at "+to_string((uint)(capacity()/1048576ull))+" MB. Device \""+device.info.name+"\" accepts a maximum buffer size of "+to_string(device.info.max_global_buffer)+" MB.");
else if(error) print_error("Device buffer allocation failed with error code "+to_string(error)+".");
device_buffer_exists = true;