From f42943b24f2f65c5b9695db3243e2563064642b0 Mon Sep 17 00:00:00 2001 From: Moritz Lehmann Date: Wed, 12 Apr 2023 12:33:42 +0200 Subject: [PATCH 1/5] Updated Readme --- README.md | 257 ++++++++++++++++++++++++------------------------------ 1 file changed, 112 insertions(+), 145 deletions(-) diff --git a/README.md b/README.md index 21586ac8..7cfbfd9d 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper "OpenCL-Wrapper").
- +
Update History @@ -61,10 +61,13 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on ## Compute Features -- CFD model: lattice Boltzmann method (LBM) -
⚬  streaming (part 2/2)

f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t)   for   i ∈ [1, q-1]

-
⚬  collision

ρ(x,t) = (Σi fitemp(x,t)) + 1

u(x,t) = 1ρ(x,t) Σi ci fitemp(x,t)

fieq-shifted(x,t) = wi ρ · ((u°ci)2(2c4) - (u°u)(2c2) + (u°ci)c2) + wi (ρ-1)

fitemp(x, tt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)

-
⚬  streaming (part 1/2)

f0(x, tt) = f0temp(x, tt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, tt) = fitemp(x, tt)   for   i ∈ [1, q-1]

+-
CFD model: lattice Boltzmann method (LBM) + + - streaming (part 2/2)

f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t)   for   i ∈ [1, q-1]

+ - collision

ρ(x,t) = (Σi fitemp(x,t)) + 1

u(x,t) = 1ρ(x,t) Σi ci fitemp(x,t)

fieq-shifted(x,t) = wi ρ · ((u°ci)2(2c4) - (u°u)(2c2) + (u°ci)c2) + wi (ρ-1)

fitemp(x, tt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)

+ - streaming (part 1/2)

f0(x, tt) = f0temp(x, tt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, tt) = fitemp(x, tt)   for   i ∈ [1, q-1]

+ +
-- peak performance on GPUs (datacenter/gaming/professional/laptop), validated with roofline model -- optimized to minimize memory demand: - - traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell - ``` - 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦 - 🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 - 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 - 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 - 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 - 🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥 - - (density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte) - ``` - - allows for 3 Million cells per 1 GB VRAM - - FluidX3D (D3Q19) requires only 55 Bytes/cell with [Esoteric-Pull](https://doi.org/10.3390/computation10060092)+[FP16](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) - ``` - 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 - 🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩 +-
optimized to minimize VRAM footprint to 1/6 of other LBM codes - (density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte) - ``` + - traditional LBM (D3Q19) with FP64 requires ~344 Bytes/cell
+ - 🟧🟧🟧🟧🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟨🟨🟨🟨🟨🟨🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥🟥
(density 🟧, velocity 🟦, flags 🟨, 2 copies of DDFs 🟩/🟥; each square = 1 Byte) + - allows for 3 Million cells per 1 GB VRAM + - FluidX3D (D3Q19) requires only 55 Bytes/cell with [Esoteric-Pull](https://doi.org/10.3390/computation10060092)+[FP16](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats)
+ - 🟧🟧🟧🟧🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟦🟨🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩
(density 🟧, velocity 🟦, flags 🟨, DDFs 🟩; each square = 1 Byte) - allows for 19 Million cells per 1 GB VRAM - in-place streaming with [Esoteric-Pull](https://doi.org/10.3390/computation10060092): eliminates redundant copy `B` of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming - [decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C)](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups -- multi-GPU support on a single node (PC/laptop/server) via domain decomposition - - allows pooling VRAM from multiple GPUs for much larger grid resolution + - large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM + + | GPU VRAM capacity | 1 GB | 2 GB | 3 GB | 4 GB | 6 GB | 8 GB | 10 GB | 11 GB | 12 GB | 16 GB | 20 GB | 24 GB | 32 GB | 40 GB | 48 GB | 64 GB | 80 GB | 94 GB | 128 GB | 192 GB | 256 GB | + | :------------------------------- | --------: | --------: | --------: | --------: | --------: | --------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ----------: | ----------: | ----------: | + | approximate GPU price | $25
GT 210 | $25
GTX 950 | $12
GTX 1060 | $50
GT 730 | $35
GTX 1060 | $70
RX 470 | $500
RTX 3080 | $240
GTX 1080 Ti | $75
Tesla M40 | $75
Instinct MI25 | $900
RX 7900 XT | $205
Tesla P40 | $600
Instinct MI60 | $5500
A100 | $2400
RTX 8000 | $31k
Instinct MI210 | $11k
A100 | >$40k
H100 NVL | ?
Max Series 1550 | - | - | + | traditional LBM (FP64) | 144³ | 182³ | 208³ | 230³ | 262³ | 288³ | 312³ | 322³ | 330³ | 364³ | 392³ | 418³ | 460³ | 494³ | 526³ | 578³ | 624³ | 658³ | 730³ | 836³ | 920³ | + | FluidX3D (FP32/FP32) | 224³ | 282³ | 322³ | 354³ | 406³ | 448³ | 482³ | 498³ | 512³ | 564³ | 608³ | 646³ | 710³ | 766³ | 814³ | 896³ | 966³ | 1018³ | 1130³ | 1292³ | 1422³ | + | FluidX3D (FP32/FP16) | 266³ | 336³ | 384³ | 424³ | 484³ | 534³ | 574³ | 594³ | 610³ | 672³ | 724³ | 770³ | 848³ | 912³ | 970³ | 1068³ | 1150³ | 1214³ | 1346³ | 1540³ | 1624³ | + +
+-
cross-vendor multi-GPU support on a single PC/laptop/server + + - domain decomposition allows pooling VRAM from multiple GPUs for much larger grid resolution - each domain (GPU) can hold up to 4.29 billion (2³², 1624³) lattice points (225 GB memory) - GPUs don't have to be identical (not even from the same vendor), but similar VRAM capacity/bandwidth is recommended -
⚬  domain communication architecture (simplified) - - ```diff - ++ .-----------------------------------------------------------------. ++ - ++ | GPU 0 | ++ - ++ | LBM Domain 0 | ++ - ++ '-----------------------------------------------------------------' ++ - ++ | selective /|\ ++ - ++ \|/ in-VRAM copy | ++ - ++ .-------------------------------------------------------. ++ - ++ | GPU 0 - Transfer Buffer 0 | ++ - ++ '-------------------------------------------------------' ++ - !! | PCIe /|\ !! - !! \|/ copy | !! - @@ .-------------------------. .-------------------------. @@ - @@ | CPU - Transfer Buffer 0 | | CPU - Transfer Buffer 1 | @@ - @@ '-------------------------'\ /'-------------------------' @@ - @@ pointer X swap @@ - @@ .-------------------------./ \.-------------------------. @@ - @@ | CPU - Transfer Buffer 1 | | CPU - Transfer Buffer 0 | @@ - @@ '-------------------------' '-------------------------' @@ - !! /|\ PCIe | !! - !! | copy \|/ !! - ++ .-------------------------------------------------------. ++ - ++ | GPU 1 - Transfer Buffer 1 | ++ - ++ '-------------------------------------------------------' ++ - ++ /|\ selective | ++ - ++ | in-VRAM copy \|/ ++ - ++ .-----------------------------------------------------------------. ++ - ++ | GPU 1 | ++ - ++ | LBM Domain 1 | ++ - ++ '-----------------------------------------------------------------' ++ - ## | ## - ## domain synchronization barrier ## - ## | ## - || -------------------------------------------------------------> time || - ``` - -
⚬  domain communication architecture (detailed) - - ```diff - ++ .-----------------------------------------------------------------. ++ - ++ | GPU 0 | ++ - ++ | LBM Domain 0 | ++ - ++ '-----------------------------------------------------------------' ++ - ++ | selective in- /|\ | selective in- /|\ | selective in- /|\ ++ - ++ \|/ VRAM copy (X) | \|/ VRAM copy (Y) | \|/ VRAM copy (Z) | ++ - ++ .---------------------.---------------------.---------------------. ++ - ++ | GPU 0 - TB 0X+ | GPU 0 - TB 0Y+ | GPU 0 - TB 0Z+ | ++ - ++ | GPU 0 - TB 0X- | GPU 0 - TB 0Y- | GPU 0 - TB 0Z- | ++ - ++ '---------------------'---------------------'---------------------' ++ - !! | PCIe /|\ | PCIe /|\ | PCIe /|\ !! - !! \|/ copy | \|/ copy | \|/ copy | !! - @@ .---------. .---------.---------. .---------.---------. .---------. @@ - @@ | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- | @@ - @@ | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ | @@ - @@ '---------\ /---------'---------\ /---------'---------\ /---------' @@ - @@ pointer X swap (X) pointer X swap (Y) pointer X swap (Z) @@ - @@ .---------/ \---------.---------/ \---------.---------/ \---------. @@ - @@ | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ | @@ - @@ | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- | @@ - @@ '---------' '---------'---------' '---------'---------' '---------' @@ - !! /|\ PCIe | /|\ PCIe | /|\ PCIe | !! - !! | copy \|/ | copy \|/ | copy \|/ !! - ++ .--------------------..---------------------..--------------------. ++ - ++ | GPU 1 - TB 1X- || GPU 3 - TB 3Y- || GPU 5 - TB 5Z- | ++ - ++ :====================::=====================::====================: ++ - ++ | GPU 2 - TB 2X+ || GPU 4 - TB 4Y+ || GPU 6 - TB 6Z+ | ++ - ++ '--------------------''---------------------''--------------------' ++ - ++ /|\ selective in- | /|\ selective in- | /|\ selective in- | ++ - ++ | VRAM copy (X) \|/ | VRAM copy (Y) \|/ | VRAM copy (Z) \|/ ++ - ++ .--------------------..---------------------..--------------------. ++ - ++ | GPU 1 || GPU 3 || GPU 5 | ++ - ++ | LBM Domain 1 || LBM Domain 3 || LBM Domain 5 | ++ - ++ :====================::=====================::====================: ++ - ++ | GPU 2 || GPU 4 || GPU 6 | ++ - ++ | LBM Domain 2 || LBM Domain 4 || LBM Domain 6 | ++ - ++ '--------------------''---------------------''--------------------' ++ - ## | | | ## - ## | domain synchronization barriers | ## - ## | | | ## - || -------------------------------------------------------------> time || - ``` + - domain communication architecture (simplified) + ```diff + ++ .-----------------------------------------------------------------. ++ + ++ | GPU 0 | ++ + ++ | LBM Domain 0 | ++ + ++ '-----------------------------------------------------------------' ++ + ++ | selective /|\ ++ + ++ \|/ in-VRAM copy | ++ + ++ .-------------------------------------------------------. ++ + ++ | GPU 0 - Transfer Buffer 0 | ++ + ++ '-------------------------------------------------------' ++ + !! | PCIe /|\ !! + !! \|/ copy | !! + @@ .-------------------------. .-------------------------. @@ + @@ | CPU - Transfer Buffer 0 | | CPU - Transfer Buffer 1 | @@ + @@ '-------------------------'\ /'-------------------------' @@ + @@ pointer X swap @@ + @@ .-------------------------./ \.-------------------------. @@ + @@ | CPU - Transfer Buffer 1 | | CPU - Transfer Buffer 0 | @@ + @@ '-------------------------' '-------------------------' @@ + !! /|\ PCIe | !! + !! | copy \|/ !! + ++ .-------------------------------------------------------. ++ + ++ | GPU 1 - Transfer Buffer 1 | ++ + ++ '-------------------------------------------------------' ++ + ++ /|\ selective | ++ + ++ | in-VRAM copy \|/ ++ + ++ .-----------------------------------------------------------------. ++ + ++ | GPU 1 | ++ + ++ | LBM Domain 1 | ++ + ++ '-----------------------------------------------------------------' ++ + ## | ## + ## domain synchronization barrier ## + ## | ## + || -------------------------------------------------------------> time || + ``` + - domain communication architecture (detailed) + ```diff + ++ .-----------------------------------------------------------------. ++ + ++ | GPU 0 | ++ + ++ | LBM Domain 0 | ++ + ++ '-----------------------------------------------------------------' ++ + ++ | selective in- /|\ | selective in- /|\ | selective in- /|\ ++ + ++ \|/ VRAM copy (X) | \|/ VRAM copy (Y) | \|/ VRAM copy (Z) | ++ + ++ .---------------------.---------------------.---------------------. ++ + ++ | GPU 0 - TB 0X+ | GPU 0 - TB 0Y+ | GPU 0 - TB 0Z+ | ++ + ++ | GPU 0 - TB 0X- | GPU 0 - TB 0Y- | GPU 0 - TB 0Z- | ++ + ++ '---------------------'---------------------'---------------------' ++ + !! | PCIe /|\ | PCIe /|\ | PCIe /|\ !! + !! \|/ copy | \|/ copy | \|/ copy | !! + @@ .---------. .---------.---------. .---------.---------. .---------. @@ + @@ | CPU 0X+ | | CPU 1X- | CPU 0Y+ | | CPU 3Y- | CPU 0Z+ | | CPU 5Z- | @@ + @@ | CPU 0X- | | CPU 2X+ | CPU 0Y- | | CPU 4Y+ | CPU 0Z- | | CPU 6Z+ | @@ + @@ '---------\ /---------'---------\ /---------'---------\ /---------' @@ + @@ pointer X swap (X) pointer X swap (Y) pointer X swap (Z) @@ + @@ .---------/ \---------.---------/ \---------.---------/ \---------. @@ + @@ | CPU 1X- | | CPU 0X+ | CPU 3Y- | | CPU 0Y+ | CPU 5Z- | | CPU 0Z+ | @@ + @@ | CPU 2X+ | | CPU 0X- | CPU 4Y+ | | CPU 0Y- | CPU 6Z+ | | CPU 0Z- | @@ + @@ '---------' '---------'---------' '---------'---------' '---------' @@ + !! /|\ PCIe | /|\ PCIe | /|\ PCIe | !! + !! | copy \|/ | copy \|/ | copy \|/ !! + ++ .--------------------..---------------------..--------------------. ++ + ++ | GPU 1 - TB 1X- || GPU 3 - TB 3Y- || GPU 5 - TB 5Z- | ++ + ++ :====================::=====================::====================: ++ + ++ | GPU 2 - TB 2X+ || GPU 4 - TB 4Y+ || GPU 6 - TB 6Z+ | ++ + ++ '--------------------''---------------------''--------------------' ++ + ++ /|\ selective in- | /|\ selective in- | /|\ selective in- | ++ + ++ | VRAM copy (X) \|/ | VRAM copy (Y) \|/ | VRAM copy (Z) \|/ ++ + ++ .--------------------..---------------------..--------------------. ++ + ++ | GPU 1 || GPU 3 || GPU 5 | ++ + ++ | LBM Domain 1 || LBM Domain 3 || LBM Domain 5 | ++ + ++ :====================::=====================::====================: ++ + ++ | GPU 2 || GPU 4 || GPU 6 | ++ + ++ | LBM Domain 2 || LBM Domain 4 || LBM Domain 6 | ++ + ++ '--------------------''---------------------''--------------------' ++ + ## | | | ## + ## | domain synchronization barriers | ## + ## | | | ## + || -------------------------------------------------------------> time || + ```
+- [peak performance on GPUs](#single-gpu-benchmarks) (datacenter/gaming/professional/laptop), validated with roofline model - [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) and other algebraic optimization to minimize round-off error - velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27 - collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT) -- only 8 flag bits per lattice point (can be used independently / at the same time): +-
only 8 flag bits per lattice point (can be used independently / at the same time) + - `TYPE_S` (stationary or moving) solid boundaries - `TYPE_E` equilibrium boundaries (inflow/outflow) - `TYPE_T` temperature boundaries @@ -213,6 +207,8 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem - `TYPE_X` remaining for custom use or further extensions - `TYPE_Y` remaining for custom use or further extensions +
+ ## Optional Compute Extensions @@ -447,35 +443,6 @@ Multi-GPU benchmarks are done at the largest possible grid resolution with a cub -## Maximum Single-Domain Grid Resolution for D3Q19 LBM - -| Memory | FP32/FP32 | FP32/FP16 | -| -----: | --------: | --------: | -| 1 GB | 224³ | 266³ | -| 2 GB | 282³ | 336³ | -| 3 GB | 322³ | 384³ | -| 4 GB | 354³ | 424³ | -| 6 GB | 406³ | 484³ | -| 8 GB | 448³ | 534³ | -| 10 GB | 482³ | 574³ | -| 11 GB | 498³ | 594³ | -| 12 GB | 512³ | 610³ | -| 16 GB | 564³ | 672³ | -| 20 GB | 608³ | 724³ | -| 24 GB | 646³ | 770³ | -| 32 GB | 710³ | 848³ | -| 40 GB | 766³ | 912³ | -| 48 GB | 814³ | 970³ | -| 64 GB | 896³ | 1068³ | -| 80 GB | 966³ | 1150³ | -| 96 GB | 1026³ | 1222³ | -| 128 GB | 1130³ | 1346³ | -| 192 GB | 1292³ | 1540³ | -| 256 GB | 1422³ | 1624³ | -| 384 GB | 1624³ | 1624³ | - - - ## FAQs ### General From 0eec60819a4b67b626f7b9accfcbbcdb75c517c3 Mon Sep 17 00:00:00 2001 From: Moritz Lehmann Date: Wed, 12 Apr 2023 12:45:56 +0200 Subject: [PATCH 2/5] Updated Readme --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 7cfbfd9d..fb55c868 100644 --- a/README.md +++ b/README.md @@ -99,7 +99,7 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem | GPU VRAM capacity | 1 GB | 2 GB | 3 GB | 4 GB | 6 GB | 8 GB | 10 GB | 11 GB | 12 GB | 16 GB | 20 GB | 24 GB | 32 GB | 40 GB | 48 GB | 64 GB | 80 GB | 94 GB | 128 GB | 192 GB | 256 GB | | :------------------------------- | --------: | --------: | --------: | --------: | --------: | --------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ---------: | ----------: | ----------: | ----------: | - | approximate GPU price | $25
GT 210 | $25
GTX 950 | $12
GTX 1060 | $50
GT 730 | $35
GTX 1060 | $70
RX 470 | $500
RTX 3080 | $240
GTX 1080 Ti | $75
Tesla M40 | $75
Instinct MI25 | $900
RX 7900 XT | $205
Tesla P40 | $600
Instinct MI60 | $5500
A100 | $2400
RTX 8000 | $31k
Instinct MI210 | $11k
A100 | >$40k
H100 NVL | ?
Max Series 1550 | - | - | + | approximate GPU price | $25
GT 210 | $25
GTX 950 | $12
GTX 1060 | $50
GT 730 | $35
GTX 1060 | $70
RX 470 | $500
RTX 3080 | $240
GTX 1080 Ti | $75
Tesla M40 | $75
Instinct MI25 | $900
RX 7900 XT | $205
Tesla P40 | $600
Instinct MI60 | $5500
A100 | $2400
RTX 8000 | $10k
Instinct MI210 | $11k
A100 | >$40k
H100 NVL | ?
Max Series 1550 | - | - | | traditional LBM (FP64) | 144³ | 182³ | 208³ | 230³ | 262³ | 288³ | 312³ | 322³ | 330³ | 364³ | 392³ | 418³ | 460³ | 494³ | 526³ | 578³ | 624³ | 658³ | 730³ | 836³ | 920³ | | FluidX3D (FP32/FP32) | 224³ | 282³ | 322³ | 354³ | 406³ | 448³ | 482³ | 498³ | 512³ | 564³ | 608³ | 646³ | 710³ | 766³ | 814³ | 896³ | 966³ | 1018³ | 1130³ | 1292³ | 1422³ | | FluidX3D (FP32/FP16) | 266³ | 336³ | 384³ | 424³ | 484³ | 534³ | 574³ | 594³ | 610³ | 672³ | 724³ | 770³ | 848³ | 912³ | 970³ | 1068³ | 1150³ | 1214³ | 1346³ | 1540³ | 1624³ | From 9d07709b55a75ed520c7f8deffc6b7fcbcf8bb55 Mon Sep 17 00:00:00 2001 From: Moritz Lehmann Date: Thu, 13 Apr 2023 00:48:45 +0200 Subject: [PATCH 3/5] Cosmetics for benchmark tables Readme --- README.md | 245 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 127 insertions(+), 118 deletions(-) diff --git a/README.md b/README.md index fb55c868..d9db1f69 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via [OpenCL](https://github.com/ProjectPhysX/OpenCL-Wrapper "OpenCL-Wrapper").
- +
Update History @@ -316,90 +316,97 @@ In consequence, the arithmetic intensity of this implementation is 2.37 (FP32/FP If your GPU is not on the list yet, you can report your benchmarks [here](https://github.com/ProjectPhysX/FluidX3D/issues/8). -| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] | -| :---------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: | -| AMD Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) | -| AMD Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) | -| Nvidia H100 PCIe 80GB | 51.01 | 80 | 2000 | 11128 (85%) | 20624 (79%) | 13862 (53%) | -| Nvidia A100 SXM4 80GB | 19.49 | 80 | 2039 | 10228 (77%) | 18448 (70%) | 11197 (42%) | -| Nvidia A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) | -| Nvidia A100 PCIe 40GB | 19.49 | 40 | 1555 | 8526 (84%) | 16035 (79%) | 11088 (55%) | -| Nvidia Tesla V100 16GB | 14.13 | 16 | 900 | 5128 (87%) | 10325 (88%) | 7683 (66%) | -| Nvidia Quadro GV100 | 16.66 | 32 | 870 | 3442 (61%) | 6641 (59%) | 5863 (52%) | -| Nvidia Tesla P100 16GB | 9.52 | 16 | 732 | 3295 (69%) | 5950 (63%) | 4176 (44%) | -| Nvidia Tesla P100 12GB | 9.52 | 12 | 549 | 2427 (68%) | 4141 (58%) | 3999 (56%) | -| Nvidia Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) | -| Nvidia Tesla K80 (1 GPU) | 4.11 | 12 | 240 | 916 (58%) | 1642 (53%) | 943 (30%) | -| Nvidia Tesla K20c | 3.52 | 5 | 208 | 861 (63%) | 1507 (56%) | 720 (27%) | -| AMD Radeon RX 7900 XTX | 61.44 | 24 | 960 | 3665 (58%) | 7644 (61%) | 7716 (62%) | -| AMD Radeon RX 7900 XT | 51.61 | 20 | 800 | 3013 (58%) | 5856 (56%) | 5986 (58%) | -| AMD Radeon RX 6900 XT | 23.04 | 16 | 512 | 1968 (59%) | 4227 (64%) | 4207 (63%) | -| AMD Radeon RX 6800 XT | 20.74 | 16 | 512 | 2008 (60%) | 4241 (64%) | 4224 (64%) | -| AMD Radeon RX 5700 XT | 9.75 | 8 | 448 | 1368 (47%) | 3253 (56%) | 3049 (52%) | -| AMD Radeon RX Vega 64 | 13.35 | 8 | 484 | 1875 (59%) | 2878 (46%) | 3227 (51%) | -| AMD Radeon RX 580 4GB | 6.50 | 4 | 256 | 946 (57%) | 1848 (56%) | 1577 (47%) | -| AMD Radeon HD 7850 | 1.84 | 2 | 154 | 112 (11%) | 120 ( 6%) | 635 (32%) | -| Intel Arc A770 LE | 19.66 | 16 | 560 | 2741 (75%) | 4591 (63%) | 4626 (64%) | -| Intel Arc A750 LE | 17.20 | 8 | 512 | 2625 (78%) | 4184 (63%) | 4238 (64%) | -| Nvidia GeForce RTX 4090 | 82.58 | 24 | 1008 | 5624 (85%) | 11091 (85%) | 11496 (88%) | -| Nvidia GeForce RTX 4080 | 55.45 | 16 | 717 | 3914 (84%) | 7626 (82%) | 7933 (85%) | -| Nvidia GeForce RTX 3090 Ti | 40.00 | 24 | 1008 | 5717 (87%) | 10956 (84%) | 10400 (79%) | -| Nvidia GeForce RTX 3090 | 39.05 | 24 | 936 | 5418 (89%) | 10732 (88%) | 10215 (84%) | -| Nvidia GeForce RTX 3080 Ti | 37.17 | 12 | 912 | 5202 (87%) | 9832 (87%) | 9347 (79%) | -| Nvidia RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) | -| Nvidia GeForce RTX 3080 | 29.77 | 10 | 760 | 4230 (85%) | 8118 (82%) | 7714 (78%) | -| Nvidia GeForce RTX 3070 | 20.31 | 8 | 448 | 2578 (88%) | 5096 (88%) | 5060 (87%) | -| Nvidia GeForce RTX 3060 Ti | 16.49 | 8 | 448 | 2644 (90%) | 5129 (88%) | 4718 (81%) | -| Nvidia RTX A5000M | 16.59 | 16 | 448 | 2228 (76%) | 4461 (77%) | 3662 (63%) | -| Nvidia GeForce RTX 3060 | 13.17 | 12 | 360 | 2108 (90%) | 4070 (87%) | 3566 (76%) | -| Nvidia GeForce RTX 3060M | 10.94 | 6 | 336 | 2019 (92%) | 4012 (92%) | 3572 (82%) | -| Nvidia GeForce RTX 3050M | 7.13 | 4 | 192 | 1180 (94%) | 2339 (94%) | 2016 (81%) | -| Nvidia Quadro RTX 6000 | 16.31 | 24 | 672 | 3307 (75%) | 6836 (78%) | 6879 (79%) | -| Nvidia Quadro RTX 8000 Pass. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) | -| Nvidia GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) | -| Nvidia GeForce RTX 2080 Sup. | 11.34 | 8 | 496 | 2434 (75%) | 5284 (82%) | 5087 (79%) | -| Nvidia Quadro RTX 5000 | 11.15 | 16 | 448 | 2341 (80%) | 4766 (82%) | 4773 (82%) | -| Nvidia GeForce RTX 2060 Sup. | 7.18 | 8 | 448 | 2503 (85%) | 5035 (87%) | 4463 (77%) | -| Nvidia Quadro RTX 4000 | 7.12 | 8 | 416 | 2284 (84%) | 4584 (85%) | 4062 (75%) | -| Nvidia GeForce RTX 2060 KO | 6.74 | 6 | 336 | 1643 (75%) | 3376 (77%) | 3266 (75%) | -| Nvidia GeForce RTX 2060 | 6.74 | 6 | 336 | 1681 (77%) | 3604 (83%) | 3571 (82%) | -| Nvidia GeForce GTX 1660 Sup. | 5.03 | 6 | 336 | 1696 (77%) | 3551 (81%) | 3040 (70%) | -| Nvidia Tesla T4 | 8.14 | 15 | 300 | 1356 (69%) | 2869 (74%) | 2887 (74%) | -| Nvidia GeForce GTX 1660 Ti | 5.48 | 6 | 288 | 1467 (78%) | 3041 (81%) | 3019 (81%) | -| Nvidia GeForce GTX 1660 | 5.07 | 6 | 192 | 1016 (81%) | 1924 (77%) | 1992 (80%) | -| Nvidia GeForce GTX 1650M | 3.20 | 4 | 128 | 706 (84%) | 1214 (73%) | 1400 (84%) | -| Nvidia Titan Xp | 12.15 | 12 | 548 | 2919 (82%) | 5495 (77%) | 5375 (76%) | -| Nvidia GeForce GTX 1080 Ti | 12.06 | 11 | 484 | 2631 (83%) | 4837 (77%) | 4877 (78%) | -| Nvidia GeForce GTX 1080 | 9.78 | 8 | 320 | 1623 (78%) | 3100 (75%) | 3182 (77%) | -| Nvidia GeForce GTX 1060M | 4.44 | 6 | 192 | 983 (78%) | 1882 (75%) | 1803 (72%) | -| Nvidia GeForce GTX 1050M Ti | 2.49 | 4 | 112 | 631 (86%) | 1224 (84%) | 1115 (77%) | -| Nvidia Quadro P1000 | 1.89 | 4 | 82 | 426 (79%) | 839 (79%) | 778 (73%) | -| Nvidia GeForce GTX 970 | 4.17 | 4 | 224 | 980 (67%) | 1721 (59%) | 1623 (56%) | -| Nvidia Quadro M4000 | 2.57 | 8 | 192 | 899 (72%) | 1519 (61%) | 1050 (42%) | -| Nvidia Tesla M60 (1 GPU) | 4.82 | 8 | 160 | 853 (82%) | 1571 (76%) | 1557 (75%) | -| Nvidia GeForce GTX 960M | 1.51 | 4 | 80 | 442 (84%) | 872 (84%) | 627 (60%) | -| Nvidia Quadro K2000 | 0.73 | 2 | 64 | 312 (75%) | 444 (53%) | 171 (21%) | -| Nvidia GeForce GT 630 (OEM) | 0.46 | 2 | 29 | 151 (81%) | 185 (50%) | 78 (21%) | -| Nvidia Quadro NVS 290 | 0.03 | 0.256 | 6 | 1 ( 2%) | 1 ( 1%) | 1 ( 1%) | -| Apple M1 Pro GPU 16C 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) | -| AMD Radeon Vega 8 (4750G) | 2.15 | 27 | 57 | 263 (71%) | 511 (70%) | 501 (68%) | -| AMD Radeon Vega 8 (3500U) | 1.23 | 7 | 38 | 157 (63%) | 282 (57%) | 288 (58%) | -| Intel UHD Graphics 630 | 0.46 | 7 | 51 | 151 (45%) | 301 (45%) | 187 (28%) | -| Intel HD Graphics 5500 | 0.35 | 3 | 26 | 75 (45%) | 192 (58%) | 108 (32%) | -| Intel HD Graphics 4600 | 0.38 | 2 | 26 | 105 (63%) | 115 (35%) | 34 (10%) | -| Samsung ARM Mali-G72 MP18 | 0.24 | 4 | 29 | 14 ( 7%) | 17 ( 5%) | 12 ( 3%) | -| 2x AMD EPYC 9654 | 29.49 | 1536 | 922 | 1381 (23%) | 1814 (15%) | 1801 (15%) | -| Intel Xeon Phi 7210 | 5.32 | 192 | 102 | 415 (62%) | 193 (15%) | 223 (17%) | -| 4x Intel Xeon E5-4620 v4 | 2.69 | 512 | 273 | 460 (26%) | 275 ( 8%) | 239 ( 7%) | -| 2x Intel Xeon E5-2630 v4 | 1.41 | 64 | 137 | 264 (30%) | 146 ( 8%) | 129 ( 7%) | -| 2x Intel Xeon E5-2623 v4 | 0.67 | 64 | 137 | 125 (14%) | 66 ( 4%) | 59 ( 3%) | -| 2x Intel Xeon E5-2680 v3 | 1.92 | 64 | 137 | 209 (23%) | 305 (17%) | 281 (16%) | -| Intel Core i9-10980XE | 3.23 | 128 | 94 | 286 (47%) | 251 (21%) | 223 (18%) | -| Intel Core i5-9600 | 0.60 | 16 | 43 | 146 (52%) | 127 (23%) | 147 (27%) | -| Intel Core i7-8700K | 0.71 | 16 | 51 | 152 (45%) | 134 (20%) | 116 (17%) | -| Intel Core i7-7700HQ | 0.36 | 12 | 38 | 81 (32%) | 82 (16%) | 108 (22%) | -| Intel Core i7-4770 | 0.44 | 16 | 26 | 104 (62%) | 69 (21%) | 59 (18%) | -| Intel Core i7-4720HQ | 0.33 | 16 | 26 | 58 (35%) | 13 ( 4%) | 47 (14%) | +Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, 🟣 Apple, 🟡 Samsung + +| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] | +| :--------------------------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: | +| | | | | | | | +| 🔴 Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) | +| 🔴 Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) | +| 🟢 H100 PCIe 80GB | 51.01 | 80 | 2000 | 11128 (85%) | 20624 (79%) | 13862 (53%) | +| 🟢 A100 SXM4 80GB | 19.49 | 80 | 2039 | 10228 (77%) | 18448 (70%) | 11197 (42%) | +| 🟢 A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) | +| 🟢 A100 PCIe 40GB | 19.49 | 40 | 1555 | 8526 (84%) | 16035 (79%) | 11088 (55%) | +| 🟢 Tesla V100 16GB | 14.13 | 16 | 900 | 5128 (87%) | 10325 (88%) | 7683 (66%) | +| 🟢 Quadro GV100 | 16.66 | 32 | 870 | 3442 (61%) | 6641 (59%) | 5863 (52%) | +| 🟢 Tesla P100 16GB | 9.52 | 16 | 732 | 3295 (69%) | 5950 (63%) | 4176 (44%) | +| 🟢 Tesla P100 12GB | 9.52 | 12 | 549 | 2427 (68%) | 4141 (58%) | 3999 (56%) | +| 🟢 Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) | +| 🟢 Tesla K80 (1 GPU) | 4.11 | 12 | 240 | 916 (58%) | 1642 (53%) | 943 (30%) | +| 🟢 Tesla K20c | 3.52 | 5 | 208 | 861 (63%) | 1507 (56%) | 720 (27%) | +| | | | | | | | +| 🔴 Radeon RX 7900 XTX | 61.44 | 24 | 960 | 3665 (58%) | 7644 (61%) | 7716 (62%) | +| 🔴 Radeon RX 7900 XT | 51.61 | 20 | 800 | 3013 (58%) | 5856 (56%) | 5986 (58%) | +| 🔴 Radeon RX 6900 XT | 23.04 | 16 | 512 | 1968 (59%) | 4227 (64%) | 4207 (63%) | +| 🔴 Radeon RX 6800 XT | 20.74 | 16 | 512 | 2008 (60%) | 4241 (64%) | 4224 (64%) | +| 🔴 Radeon RX 5700 XT | 9.75 | 8 | 448 | 1368 (47%) | 3253 (56%) | 3049 (52%) | +| 🔴 Radeon RX Vega 64 | 13.35 | 8 | 484 | 1875 (59%) | 2878 (46%) | 3227 (51%) | +| 🔴 Radeon RX 580 4GB | 6.50 | 4 | 256 | 946 (57%) | 1848 (56%) | 1577 (47%) | +| 🔴 Radeon HD 7850 | 1.84 | 2 | 154 | 112 (11%) | 120 ( 6%) | 635 (32%) | +| 🔵 Arc A770 LE | 19.66 | 16 | 560 | 2741 (75%) | 4591 (63%) | 4626 (64%) | +| 🔵 Arc A750 LE | 17.20 | 8 | 512 | 2625 (78%) | 4184 (63%) | 4238 (64%) | +| 🟢 GeForce RTX 4090 | 82.58 | 24 | 1008 | 5624 (85%) | 11091 (85%) | 11496 (88%) | +| 🟢 RTX 6000 Ada | 91.10 | 48 | 960 | 4997 (80%) | 10249 (82%) | 10293 (83%) | +| 🟢 GeForce RTX 4080 | 55.45 | 16 | 717 | 3914 (84%) | 7626 (82%) | 7933 (85%) | +| 🟢 GeForce RTX 3090 Ti | 40.00 | 24 | 1008 | 5717 (87%) | 10956 (84%) | 10400 (79%) | +| 🟢 GeForce RTX 3090 | 39.05 | 24 | 936 | 5418 (89%) | 10732 (88%) | 10215 (84%) | +| 🟢 GeForce RTX 3080 Ti | 37.17 | 12 | 912 | 5202 (87%) | 9832 (87%) | 9347 (79%) | +| 🟢 RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) | +| 🟢 GeForce RTX 3080 | 29.77 | 10 | 760 | 4230 (85%) | 8118 (82%) | 7714 (78%) | +| 🟢 GeForce RTX 3070 | 20.31 | 8 | 448 | 2578 (88%) | 5096 (88%) | 5060 (87%) | +| 🟢 GeForce RTX 3060 Ti | 16.49 | 8 | 448 | 2644 (90%) | 5129 (88%) | 4718 (81%) | +| 🟢 RTX A5000M | 16.59 | 16 | 448 | 2228 (76%) | 4461 (77%) | 3662 (63%) | +| 🟢 GeForce RTX 3060 | 13.17 | 12 | 360 | 2108 (90%) | 4070 (87%) | 3566 (76%) | +| 🟢 GeForce RTX 3060M | 10.94 | 6 | 336 | 2019 (92%) | 4012 (92%) | 3572 (82%) | +| 🟢 GeForce RTX 3050M | 7.13 | 4 | 192 | 1180 (94%) | 2339 (94%) | 2016 (81%) | +| 🟢 Quadro RTX 6000 | 16.31 | 24 | 672 | 3307 (75%) | 6836 (78%) | 6879 (79%) | +| 🟢 Quadro RTX 8000 Pass. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) | +| 🟢 GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) | +| 🟢 GeForce RTX 2080 Sup. | 11.34 | 8 | 496 | 2434 (75%) | 5284 (82%) | 5087 (79%) | +| 🟢 Quadro RTX 5000 | 11.15 | 16 | 448 | 2341 (80%) | 4766 (82%) | 4773 (82%) | +| 🟢 GeForce RTX 2060 Sup. | 7.18 | 8 | 448 | 2503 (85%) | 5035 (87%) | 4463 (77%) | +| 🟢 Quadro RTX 4000 | 7.12 | 8 | 416 | 2284 (84%) | 4584 (85%) | 4062 (75%) | +| 🟢 GeForce RTX 2060 KO | 6.74 | 6 | 336 | 1643 (75%) | 3376 (77%) | 3266 (75%) | +| 🟢 GeForce RTX 2060 | 6.74 | 6 | 336 | 1681 (77%) | 3604 (83%) | 3571 (82%) | +| 🟢 GeForce GTX 1660 Sup. | 5.03 | 6 | 336 | 1696 (77%) | 3551 (81%) | 3040 (70%) | +| 🟢 Tesla T4 | 8.14 | 15 | 300 | 1356 (69%) | 2869 (74%) | 2887 (74%) | +| 🟢 GeForce GTX 1660 Ti | 5.48 | 6 | 288 | 1467 (78%) | 3041 (81%) | 3019 (81%) | +| 🟢 GeForce GTX 1660 | 5.07 | 6 | 192 | 1016 (81%) | 1924 (77%) | 1992 (80%) | +| 🟢 GeForce GTX 1650M | 3.20 | 4 | 128 | 706 (84%) | 1214 (73%) | 1400 (84%) | +| 🟢 Titan Xp | 12.15 | 12 | 548 | 2919 (82%) | 5495 (77%) | 5375 (76%) | +| 🟢 GeForce GTX 1080 Ti | 12.06 | 11 | 484 | 2631 (83%) | 4837 (77%) | 4877 (78%) | +| 🟢 GeForce GTX 1080 | 9.78 | 8 | 320 | 1623 (78%) | 3100 (75%) | 3182 (77%) | +| 🟢 GeForce GTX 1060M | 4.44 | 6 | 192 | 983 (78%) | 1882 (75%) | 1803 (72%) | +| 🟢 GeForce GTX 1050M Ti | 2.49 | 4 | 112 | 631 (86%) | 1224 (84%) | 1115 (77%) | +| 🟢 Quadro P1000 | 1.89 | 4 | 82 | 426 (79%) | 839 (79%) | 778 (73%) | +| 🟢 GeForce GTX 970 | 4.17 | 4 | 224 | 980 (67%) | 1721 (59%) | 1623 (56%) | +| 🟢 Quadro M4000 | 2.57 | 8 | 192 | 899 (72%) | 1519 (61%) | 1050 (42%) | +| 🟢 Tesla M60 (1 GPU) | 4.82 | 8 | 160 | 853 (82%) | 1571 (76%) | 1557 (75%) | +| 🟢 GeForce GTX 960M | 1.51 | 4 | 80 | 442 (84%) | 872 (84%) | 627 (60%) | +| 🟢 Quadro K2000 | 0.73 | 2 | 64 | 312 (75%) | 444 (53%) | 171 (21%) | +| 🟢 GeForce GT 630 (OEM) | 0.46 | 2 | 29 | 151 (81%) | 185 (50%) | 78 (21%) | +| 🟢 Quadro NVS 290 | 0.03 | 0.256 | 6 | 1 ( 2%) | 1 ( 1%) | 1 ( 1%) | +| | | | | | | | +| 🟣 M1 Pro GPU 16C 16GB | 4.10 | 11 | 200 | 1204 (92%) | 2329 (90%) | 1855 (71%) | +| 🔴 Radeon Vega 8 (4750G) | 2.15 | 27 | 57 | 263 (71%) | 511 (70%) | 501 (68%) | +| 🔴 Radeon Vega 8 (3500U) | 1.23 | 7 | 38 | 157 (63%) | 282 (57%) | 288 (58%) | +| 🔵 UHD Graphics 630 | 0.46 | 7 | 51 | 151 (45%) | 301 (45%) | 187 (28%) | +| 🔵 HD Graphics 5500 | 0.35 | 3 | 26 | 75 (45%) | 192 (58%) | 108 (32%) | +| 🔵 HD Graphics 4600 | 0.38 | 2 | 26 | 105 (63%) | 115 (35%) | 34 (10%) | +| 🟡 ARM Mali-G72 MP18 | 0.24 | 4 | 29 | 14 ( 7%) | 17 ( 5%) | 12 ( 3%) | +| | | | | | | | +| 🔴 2x EPYC 9654 | 29.49 | 1536 | 922 | 1381 (23%) | 1814 (15%) | 1801 (15%) | +| 🔵 Xeon Phi 7210 | 5.32 | 192 | 102 | 415 (62%) | 193 (15%) | 223 (17%) | +| 🔵 4x Xeon E5-4620 v4 | 2.69 | 512 | 273 | 460 (26%) | 275 ( 8%) | 239 ( 7%) | +| 🔵 2x Xeon E5-2630 v4 | 1.41 | 64 | 137 | 264 (30%) | 146 ( 8%) | 129 ( 7%) | +| 🔵 2x Xeon E5-2623 v4 | 0.67 | 64 | 137 | 125 (14%) | 66 ( 4%) | 59 ( 3%) | +| 🔵 2x Xeon E5-2680 v3 | 1.92 | 64 | 137 | 209 (23%) | 305 (17%) | 281 (16%) | +| 🔵 Core i9-10980XE | 3.23 | 128 | 94 | 286 (47%) | 251 (21%) | 223 (18%) | +| 🔵 Core i5-9600 | 0.60 | 16 | 43 | 146 (52%) | 127 (23%) | 147 (27%) | +| 🔵 Core i7-8700K | 0.71 | 16 | 51 | 152 (45%) | 134 (20%) | 116 (17%) | +| 🔵 Core i7-7700HQ | 0.36 | 12 | 38 | 81 (32%) | 82 (16%) | 108 (22%) | +| 🔵 Core i7-4770 | 0.44 | 16 | 26 | 104 (62%) | 69 (21%) | 59 (18%) | +| 🔵 Core i7-4720HQ | 0.33 | 16 | 26 | 58 (35%) | 13 ( 4%) | 47 (14%) | @@ -407,39 +414,41 @@ If your GPU is not on the list yet, you can report your benchmarks [here](https: Multi-GPU benchmarks are done at the largest possible grid resolution with a cubic domain, and either 2x1x1, 2x2x1 or 2x2x2 of these cubic domains together. The percentages in brackets are single-GPU roofline model efficiency, and the multiplicator numbers in brackets are scaling factors relative to benchmarked single-GPU performance. -| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] | -| :---------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: | -| | | | | | | | -| 1x AMD Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) | -| 1x AMD Instinct MI250 (2 GCD) | 90.52 | 128 | 3277 | 9460 (1.7x) | 14313 (1.6x) | 17338 (2.0x) | -| 2x AMD Instinct MI250 (4 GCD) | 181.04 | 256 | 6554 | 16925 (3.0x) | 29163 (3.2x) | 29627 (3.5x) | -| 4x AMD Instinct MI250 (8 GCD) | 362.08 | 512 | 13107 | 27350 (4.9x) | 52258 (5.8x) | 53521 (6.3x) | -| | | | | | | | -| 1x AMD Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) | -| 2x AMD Radeon VII | 27.66 | 32 | 2048 | 8113 (1.7x) | 15591 (2.0x) | 10352 (2.0x) | -| 4x AMD Radeon VII | 55.32 | 64 | 4096 | 12911 (2.6x) | 24273 (3.1x) | 17080 (3.2x) | -| 8x AMD Radeon VII | 110.64 | 128 | 8192 | 21946 (4.5x) | 30826 (4.0x) | 24572 (4.7x) | -| | | | | | | | -| 1x Nvidia A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) | -| 2x Nvidia A100 SXM4 40GB | 38.98 | 80 | 3110 | 13629 (1.6x) | 24620 (1.5x) | 18850 (1.7x) | -| 4x Nvidia A100 SXM4 40GB | 77.96 | 160 | 6220 | 17978 (2.1x) | 30604 (1.9x) | 30627 (2.7x) | -| | | | | | | | -| 1x Nvidia Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) | -| 2x Nvidia Tesla K40m | 8.58 | 24 | 577 | 1971 (1.7x) | 3300 (1.8x) | 1801 (2.0x) | -| 3x Tesla K40m + 1x Titan Xp | 17.16 | 48 | 1154 | 3117 (2.8x) | 5174 (2.8x) | 3127 (3.4x) | -| | | | | | | | -| 1x Nvidia RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) | -| 2x Nvidia RTX A6000 | 80.00 | 96 | 1536 | 8041 (1.8x) | 15026 (1.7x) | 14795 (1.7x) | -| 4x Nvidia RTX A6000 | 160.00 | 192 | 3072 | 14314 (3.2x) | 27915 (3.2x) | 27227 (3.2x) | -| 8x Nvidia RTX A6000 | 320.00 | 384 | 6144 | 19311 (4.4x) | 40063 (4.5x) | 39004 (4.6x) | -| | | | | | | | -| 1x Nvidia Quadro RTX 8000 Pa. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) | -| 2x Nvidia Quadro RTX 8000 Pa. | 29.86 | 96 | 1248 | 4767 (1.8x) | 9607 (1.8x) | 10214 (1.8x) | -| | | | | | | | -| 1x Nvidia GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) | -| 2x Nvidia GeForce RTX 2080 Ti | 26.90 | 22 | 1232 | 5085 (1.6x) | 10770 (1.6x) | 10922 (1.6x) | -| 4x Nvidia GeForce RTX 2080 Ti | 53.80 | 44 | 2464 | 9117 (2.9x) | 18415 (2.7x) | 18598 (2.7x) | -| 7x RTX 2080 Ti + 1x A100 40GB | 107.60 | 88 | 4928 | 16146 (5.1x) | 33732 (5.0x) | 33857 (4.9x) | +Colors: 🔴 AMD, 🔵 Intel, 🟢 Nvidia, 🟣 Apple, 🟡 Samsung + +| Device | FP32
[TFlops/s] | Mem
[GB] | BW
[GB/s] | FP32/FP32
[MLUPs/s] | FP32/FP16S
[MLUPs/s] | FP32/FP16C
[MLUPs/s] | +| :------------------------------------------------------------- | -----------------: | ----------: | -----------: | ---------------------: | ----------------------: | ----------------------: | +| | | | | | | | +| 🔴 1x Instinct MI250 (1 GCD) | 45.26 | 64 | 1638 | 5638 (53%) | 9030 (42%) | 8506 (40%) | +| 🔴 1x Instinct MI250 (2 GCD) | 90.52 | 128 | 3277 | 9460 (1.7x) | 14313 (1.6x) | 17338 (2.0x) | +| 🔴 2x Instinct MI250 (4 GCD) | 181.04 | 256 | 6554 | 16925 (3.0x) | 29163 (3.2x) | 29627 (3.5x) | +| 🔴 4x Instinct MI250 (8 GCD) | 362.08 | 512 | 13107 | 27350 (4.9x) | 52258 (5.8x) | 53521 (6.3x) | +| | | | | | | | +| 🔴 1x Radeon VII | 13.83 | 16 | 1024 | 4898 (73%) | 7778 (58%) | 5256 (40%) | +| 🔴 2x Radeon VII | 27.66 | 32 | 2048 | 8113 (1.7x) | 15591 (2.0x) | 10352 (2.0x) | +| 🔴 4x Radeon VII | 55.32 | 64 | 4096 | 12911 (2.6x) | 24273 (3.1x) | 17080 (3.2x) | +| 🔴 8x Radeon VII | 110.64 | 128 | 8192 | 21946 (4.5x) | 30826 (4.0x) | 24572 (4.7x) | +| | | | | | | | +| 🟢 1x A100 SXM4 40GB | 19.49 | 40 | 1555 | 8522 (84%) | 16013 (79%) | 11251 (56%) | +| 🟢 2x A100 SXM4 40GB | 38.98 | 80 | 3110 | 13629 (1.6x) | 24620 (1.5x) | 18850 (1.7x) | +| 🟢 4x A100 SXM4 40GB | 77.96 | 160 | 6220 | 17978 (2.1x) | 30604 (1.9x) | 30627 (2.7x) | +| | | | | | | | +| 🟢 1x Tesla K40m | 4.29 | 12 | 288 | 1131 (60%) | 1868 (50%) | 912 (24%) | +| 🟢 2x Tesla K40m | 8.58 | 24 | 577 | 1971 (1.7x) | 3300 (1.8x) | 1801 (2.0x) | +| 🟢 3x K40m + 1x Titan Xp | 17.16 | 48 | 1154 | 3117 (2.8x) | 5174 (2.8x) | 3127 (3.4x) | +| | | | | | | | +| 🟢 1x RTX A6000 | 40.00 | 48 | 768 | 4421 (88%) | 8814 (88%) | 8533 (86%) | +| 🟢 2x RTX A6000 | 80.00 | 96 | 1536 | 8041 (1.8x) | 15026 (1.7x) | 14795 (1.7x) | +| 🟢 4x RTX A6000 | 160.00 | 192 | 3072 | 14314 (3.2x) | 27915 (3.2x) | 27227 (3.2x) | +| 🟢 8x RTX A6000 | 320.00 | 384 | 6144 | 19311 (4.4x) | 40063 (4.5x) | 39004 (4.6x) | +| | | | | | | | +| 🟢 1x Quadro RTX 8000 Pa. | 14.93 | 48 | 624 | 2591 (64%) | 5408 (67%) | 5607 (69%) | +| 🟢 2x Quadro RTX 8000 Pa. | 29.86 | 96 | 1248 | 4767 (1.8x) | 9607 (1.8x) | 10214 (1.8x) | +| | | | | | | | +| 🟢 1x GeForce RTX 2080 Ti | 13.45 | 11 | 616 | 3194 (79%) | 6700 (84%) | 6853 (86%) | +| 🟢 2x GeForce RTX 2080 Ti | 26.90 | 22 | 1232 | 5085 (1.6x) | 10770 (1.6x) | 10922 (1.6x) | +| 🟢 4x GeForce RTX 2080 Ti | 53.80 | 44 | 2464 | 9117 (2.9x) | 18415 (2.7x) | 18598 (2.7x) | +| 🟢 7x 2080 Ti + 1x A100 40GB | 107.60 | 88 | 4928 | 16146 (5.1x) | 33732 (5.0x) | 33857 (4.9x) | From 042f51a5d3168725c19d429110d26501ccf2770a Mon Sep 17 00:00:00 2001 From: Moritz Lehmann Date: Sat, 15 Apr 2023 10:57:41 +0200 Subject: [PATCH 4/5] Cosmetics in Readme --- README.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index d9db1f69..8e0a75d5 100644 --- a/README.md +++ b/README.md @@ -95,6 +95,18 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem - allows for 19 Million cells per 1 GB VRAM - in-place streaming with [Esoteric-Pull](https://doi.org/10.3390/computation10060092): eliminates redundant copy `B` of density distribution functions (DDFs) in memory; almost cuts memory demand in half and slightly increases performance due to implicit bounce-back boundaries; offers optimal memory access patterns for single-cell in-place streaming - [decoupled arithmetic precision (FP32) and memory precision (FP32 or FP16S or FP16C)](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats): all arithmetic is done in FP32 for compatibility on all hardware, but DDFs in memory can be compressed to FP16S or FP16C: almost cuts memory demand in half again and almost doubles performance, without impacting overall accuracy for most setups + -
only 8 flag bits per lattice point (can be used independently / at the same time) + + - `TYPE_S` (stationary or moving) solid boundaries + - `TYPE_E` equilibrium boundaries (inflow/outflow) + - `TYPE_T` temperature boundaries + - `TYPE_F` free surface (fluid) + - `TYPE_I` free surface (interface) + - `TYPE_G` free surface (gas) + - `TYPE_X` remaining for custom use or further extensions + - `TYPE_Y` remaining for custom use or further extensions + +
- large cost saving: comparison of maximum single-GPU grid resolution for D3Q19 LBM | GPU VRAM capacity | 1 GB | 2 GB | 3 GB | 4 GB | 6 GB | 8 GB | 10 GB | 11 GB | 12 GB | 16 GB | 20 GB | 24 GB | 32 GB | 40 GB | 48 GB | 64 GB | 80 GB | 94 GB | 128 GB | 192 GB | 256 GB | @@ -196,18 +208,6 @@ $$f_j(i\\%2\\ ?\\ \vec{x}+\vec{e}_i\\ :\\ \vec{x},\\ t+\Delta t)=f_i^\textrm{tem - [DDF-shifting](https://www.researchgate.net/publication/362275548_Accuracy_and_performance_of_the_lattice_Boltzmann_method_with_64-bit_32-bit_and_customized_16-bit_number_formats) and other algebraic optimization to minimize round-off error - velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27 - collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT) --
only 8 flag bits per lattice point (can be used independently / at the same time) - - - `TYPE_S` (stationary or moving) solid boundaries - - `TYPE_E` equilibrium boundaries (inflow/outflow) - - `TYPE_T` temperature boundaries - - `TYPE_F` free surface (fluid) - - `TYPE_I` free surface (interface) - - `TYPE_G` free surface (gas) - - `TYPE_X` remaining for custom use or further extensions - - `TYPE_Y` remaining for custom use or further extensions - -
From 8c25a1f6624ce42b071c505dc34d836f69b42085 Mon Sep 17 00:00:00 2001 From: Moritz Lehmann Date: Sun, 16 Apr 2023 12:16:31 +0200 Subject: [PATCH 5/5] FluidX3D v2.6 update: patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported --- README.md | 8 +++++--- src/info.cpp | 2 +- src/opencl.hpp | 21 ++++++++++++++++++--- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 8e0a75d5..ec56d81b 100644 --- a/README.md +++ b/README.md @@ -55,6 +55,8 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on - improved raytracing framerate when camera is inside fluid - fixed skybox pole flickering artifacts - fixed bug where moving objects during re-voxelization would leave an erroneous trail of solid grid cells behind +- v2.6 (16.04.2023) + - patched OpenCL issues of Intel Arc GPUs: now VRAM allocations >4GB are possible and correct VRAM capacity is reported
@@ -66,8 +68,10 @@ The fastest and most memory efficient lattice Boltzmann CFD software, running on - streaming (part 2/2)

f0temp(x,t) = f0(x, t)
fitemp(x,t) = f(t%2 ? i : (i%2 ? i+1 : i-1))(i%2 ? x : x-ei, t)   for   i ∈ [1, q-1]

- collision

ρ(x,t) = (Σi fitemp(x,t)) + 1

u(x,t) = 1ρ(x,t) Σi ci fitemp(x,t)

fieq-shifted(x,t) = wi ρ · ((u°ci)2(2c4) - (u°u)(2c2) + (u°ci)c2) + wi (ρ-1)

fitemp(x, tt) = fitemp(x,t) + Ωi(fitemp(x,t), fieq-shifted(x,t), τ)

- streaming (part 1/2)

f0(x, tt) = f0temp(x, tt)
f(t%2 ? (i%2 ? i+1 : i-1) : i)(i%2 ? x+ei : x, tt) = fitemp(x, tt)   for   i ∈ [1, q-1]

+ - velocity sets: D2Q9, D3Q15, D3Q19 (default), D3Q27 + - collision operators: single-relaxation-time (SRT/BGK) (default), two-relaxation-time (TRT) -
+