153153The average strategy is the average of strategies followed in each round,
154154 for all $I \in \mathcal{I}, a \in A(I)$
155155
156- $${\color {cyan}\b ar{\sigma}^T_i(I)(a)} =
157- \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I){\color {lightgreen}\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
156+ $$\t extcolor {cyan}{ \b ar{\sigma}^T_i(I)(a)} =
157+ \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\t extcolor {lightgreen}{ \sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
158158
159159That is the mean regret of not playing with the optimal strategy.
160160
210210
211211### Counterfactual regret
212212
213- **Counterfactual value** $\color {pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
213+ **Counterfactual value** $\t extcolor {pink}{v_i(\sigma, I)}$ is the expected utility for player $i$ if
214214 if player $i$ tried to reach $I$ (took the actions leading to $I$ with a probability of $1$).
215215
216- $$\color {pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
216+ $$\t extcolor {pink}{v_i(\sigma, I)} = \sum_{z \in Z_I} \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
217217
218218where $Z_I$ is the set of terminal histories reachable from $I$,
219219and $z[I]$ is the prefix of $z$ up to $I$.
227227
228228$$R^T_{i,imm}(I) = \f rac{1}{T} \sum_{t=1}^T
229229\Big(
230- \color {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{v_i(\sigma^t, I)}
230+ \t extcolor {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{v_i(\sigma^t, I)}
231231\Big)$$
232232
233233where $\sigma |_{I \r ightarrow a}$ is the strategy profile $\sigma$ with the modification
244244
245245The strategy is calculated using regret matching.
246246
247- The regret for each information set and action pair $\color {orange}{R^T_i(I, a)}$ is maintained,
247+ The regret for each information set and action pair $\t extcolor {orange}{R^T_i(I, a)}$ is maintained,
248248
249249\b egin{align}
250- \color {coral}{r^t_i(I, a)} &=
251- \color {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{v_i(\sigma^t, I)}
250+ \t extcolor {coral}{r^t_i(I, a)} &=
251+ \t extcolor {pink}{v_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{v_i(\sigma^t, I)}
252252 \\
253- \color {orange}{R^T_i(I, a)} &=
254- \f rac{1}{T} \sum_{t=1}^T \color {coral}{r^t_i(I, a)}
253+ \t extcolor {orange}{R^T_i(I, a)} &=
254+ \f rac{1}{T} \sum_{t=1}^T \t extcolor {coral}{r^t_i(I, a)}
255255\end{align}
256256
257257and the strategy is calculated with regret matching,
258258
259259\b egin{align}
260- \color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
260+ \t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
261261\b egin{cases}
262- \f rac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}},
263- & \t ext{if} \sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0 \\
262+ \f rac{\t extcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')}},
263+ & \t ext{if} \sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')} \gt 0 \\
264264\f rac{1}{\lvert A(I) \r vert},
265265 & \t ext{otherwise}
266266\end{cases}
267267\end{align}
268268
269- where $\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$
269+ where $\t extcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\t extcolor {orange}{R^T_i(I, a)}, 0 \Big)$
270270
271271The paper
272272The paper
279279
280280### Monte Carlo CFR (MCCFR)
281281
282- Computing $\color {coral}{r^t_i(I, a)}$ requires expanding the full game tree
282+ Computing $\t extcolor {coral}{r^t_i(I, a)}$ requires expanding the full game tree
283283on each iteration.
284284
285285The paper
296296
297297Then we get **sampled counterfactual value** fro block $j$,
298298
299- $$\color {pink}{\t ilde{v}(\sigma, I|j)} =
299+ $$\t extcolor {pink}{\t ilde{v}(\sigma, I|j)} =
300300 \sum_{z \in Q_j} \f rac{1}{q(z)}
301301 \pi^\sigma_{-i}(z[I]) \pi^\sigma(z[I], z) u_i(z)$$
302302
303303The paper shows that
304304
305- $$\mathbb{E}_{j \sim q_j} \Big[ \color {pink}{\t ilde{v}(\sigma, I|j)} \Big]
306- = \color {pink}{v_i(\sigma, I)}$$
305+ $$\mathbb{E}_{j \sim q_j} \Big[ \t extcolor {pink}{\t ilde{v}(\sigma, I|j)} \Big]
306+ = \t extcolor {pink}{v_i(\sigma, I)}$$
307307
308308with a simple proof.
309309
310310Therefore we can sample a part of the game tree and calculate the regrets.
311311We calculate an estimate of regrets
312312
313313$$
314- \color {coral}{\t ilde{r}^t_i(I, a)} =
315- \color {pink}{\t ilde{v}_i(\sigma^t |_{I \r ightarrow a}, I)} - \color {pink}{\t ilde{v}_i(\sigma^t, I)}
314+ \t extcolor {coral}{\t ilde{r}^t_i(I, a)} =
315+ \t extcolor {pink}{\t ilde{v}_i(\sigma^t |_{I \r ightarrow a}, I)} - \t extcolor {pink}{\t ilde{v}_i(\sigma^t, I)}
316316$$
317317
318- And use that to update $\color {orange}{R^T_i(I, a)}$ and calculate
319- the strategy $\color {lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320- Finally, we calculate the overall average strategy $\color {cyan}{\b ar{\sigma}^T_i(I)(a)}$.
318+ And use that to update $\t extcolor {orange}{R^T_i(I, a)}$ and calculate
319+ the strategy $\t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)}$ on each iteration.
320+ Finally, we calculate the overall average strategy $\t extcolor {cyan}{\b ar{\sigma}^T_i(I)(a)}$.
321321
322322Here is a [Kuhn Poker](kuhn/index.html) implementation to try CFR on Kuhn Poker.
323323
@@ -422,24 +422,24 @@ class InfoSet:
422422 # Total regret of not taking each action $A(I_i)$,
423423 #
424424 # \begin{align}
425- # \color {coral}{\tilde{r}^t_i(I, a)} &=
426- # \color {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427- # \color {pink}{\tilde{v}_i(\sigma^t, I)}
425+ # \textcolor {coral}{\tilde{r}^t_i(I, a)} &=
426+ # \textcolor {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
427+ # \textcolor {pink}{\tilde{v}_i(\sigma^t, I)}
428428 # \\
429- # \color {orange}{R^T_i(I, a)} &=
430- # \frac{1}{T} \sum_{t=1}^T \color {coral}{\tilde{r}^t_i(I, a)}
429+ # \textcolor {orange}{R^T_i(I, a)} &=
430+ # \frac{1}{T} \sum_{t=1}^T \textcolor {coral}{\tilde{r}^t_i(I, a)}
431431 # \end{align}
432432 #
433- # We maintain $T \color {orange}{R^T_i(I, a)}$ instead of $\color {orange}{R^T_i(I, a)}$
433+ # We maintain $T \textcolor {orange}{R^T_i(I, a)}$ instead of $\textcolor {orange}{R^T_i(I, a)}$
434434 # since $\frac{1}{T}$ term cancels out anyway when computing strategy
435- # $\color {lightgreen}{\sigma_i^{T+1}(I)(a)}$
435+ # $\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)}$
436436 regret : Dict [Action , float ]
437437 # We maintain the cumulative strategy
438- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}$$
438+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}$$
439439 # to compute overall average strategy
440440 #
441- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
442- # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
441+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
442+ # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}}{\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
443443 cumulative_strategy : Dict [Action , float ]
444444
445445 def __init__ (self , key : str ):
@@ -489,59 +489,59 @@ def calculate_strategy(self):
489489 Calculate current strategy using [regret matching](#RegretMatching).
490490
491491 \b egin{align}
492- \color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
492+ \t extcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
493493 \b egin{cases}
494- \f rac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}},
495- & \t ext{if} \sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0 \\
494+ \f rac{\t extcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')}},
495+ & \t ext{if} \sum_{a'\in A(I)}\t extcolor {orange}{R^{T,+}_i(I, a')} \gt 0 \\
496496 \f rac{1}{\lvert A(I) \r vert},
497497 & \t ext{otherwise}
498498 \end{cases}
499499 \end{align}
500500
501- where $\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$
501+ where $\t extcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\t extcolor {orange}{R^T_i(I, a)}, 0 \Big)$
502502 """
503- # $$\color {orange}{R^{T,+}_i(I, a)} = \max \Big(\color {orange}{R^T_i(I, a)}, 0 \Big)$$
503+ # $$\textcolor {orange}{R^{T,+}_i(I, a)} = \max \Big(\textcolor {orange}{R^T_i(I, a)}, 0 \Big)$$
504504 regret = {a : max (r , 0 ) for a , r in self .regret .items ()}
505- # $$\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}$$
505+ # $$\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')}$$
506506 regret_sum = sum (regret .values ())
507- # if $\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')} \gt 0$,
507+ # if $\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')} \gt 0$,
508508 if regret_sum > 0 :
509- # $$\color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
510- # \frac{\color {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\color {orange}{R^{T,+}_i(I, a')}}$$
509+ # $$\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
510+ # \frac{\textcolor {orange}{R^{T,+}_i(I, a)}}{\sum_{a'\in A(I)}\textcolor {orange}{R^{T,+}_i(I, a')}}$$
511511 self .strategy = {a : r / regret_sum for a , r in regret .items ()}
512512 # Otherwise,
513513 else :
514514 # $\lvert A(I) \rvert$
515515 count = len (list (a for a in self .regret ))
516- # $$\color {lightgreen}{\sigma_i^{T+1}(I)(a)} =
516+ # $$\textcolor {lightgreen}{\sigma_i^{T+1}(I)(a)} =
517517 # \frac{1}{\lvert A(I) \rvert}$$
518518 self .strategy = {a : 1 / count for a , r in regret .items ()}
519519
520520 def get_average_strategy (self ):
521521 """
522522 ## Get average strategy
523523
524- $$\color {cyan}{\b ar{\sigma}^T_i(I)(a)} =
525- \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}
524+ $$\t extcolor {cyan}{\b ar{\sigma}^T_i(I)(a)} =
525+ \f rac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\t extcolor {lightgreen}{\sigma^t(I)(a)}}
526526 {\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
527527 """
528- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \color {lightgreen}{\sigma^t(I)(a)}$$
528+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) \textcolor {lightgreen}{\sigma^t(I)(a)}$$
529529 cum_strategy = {a : self .cumulative_strategy .get (a , 0. ) for a in self .actions ()}
530530 # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I) =
531531 # \sum_{a \in A(I)} \sum_{t=1}^T
532- # \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}$$
532+ # \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}$$
533533 strategy_sum = sum (cum_strategy .values ())
534534 # If $\sum_{t=1}^T \pi_i^{\sigma^t}(I) > 0$,
535535 if strategy_sum > 0 :
536- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
537- # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}}
536+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
537+ # \frac{\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}}
538538 # {\sum_{t=1}^T \pi_i^{\sigma^t}(I)}$$
539539 return {a : s / strategy_sum for a , s in cum_strategy .items ()}
540540 # Otherwise,
541541 else :
542542 # $\lvert A(I) \rvert$
543543 count = len (list (a for a in cum_strategy ))
544- # $$\color {cyan}{\bar{\sigma}^T_i(I)(a)} =
544+ # $$\textcolor {cyan}{\bar{\sigma}^T_i(I)(a)} =
545545 # \frac{1}{\lvert A(I) \rvert}$$
546546 return {a : 1 / count for a , r in cum_strategy .items ()}
547547
@@ -610,7 +610,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
610610 $$\sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)$$
611611 where $Z_h$ is the set of terminal histories with prefix $h$
612612
613- While walking the tee it updates the total regrets $\color {orange}{R^T_i(I, a)}$.
613+ While walking the tee it updates the total regrets $\t extcolor {orange}{R^T_i(I, a)}$.
614614 """
615615
616616 # If it's a terminal history $h \in Z$ return the terminal utility $u_i(h)$.
@@ -656,27 +656,27 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
656656 # update the cumulative strategies and total regrets
657657 if h .player () == i :
658658 # Update cumulative strategies
659- # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\color {lightgreen}{\sigma^t(I)(a)}
659+ # $$\sum_{t=1}^T \pi_i^{\sigma^t}(I)\textcolor {lightgreen}{\sigma^t(I)(a)}
660660 # = \sum_{t=1}^T \Big[ \sum_{h \in I} \pi_i^{\sigma^t}(h)
661- # \color {lightgreen}{\sigma^t(I)(a)} \Big]$$
661+ # \textcolor {lightgreen}{\sigma^t(I)(a)} \Big]$$
662662 for a in I .actions ():
663663 I .cumulative_strategy [a ] = I .cumulative_strategy [a ] + pi_i * I .strategy [a ]
664664 # \begin{align}
665- # \color {coral}{\tilde{r}^t_i(I, a)} &=
666- # \color {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667- # \color {pink}{\tilde{v}_i(\sigma^t, I)} \\
665+ # \textcolor {coral}{\tilde{r}^t_i(I, a)} &=
666+ # \textcolor {pink}{\tilde{v}_i(\sigma^t |_{I \rightarrow a}, I)} -
667+ # \textcolor {pink}{\tilde{v}_i(\sigma^t, I)} \\
668668 # &=
669669 # \pi^{\sigma^t}_{-i} (h) \Big(
670670 # \sum_{z \in Z_h} \pi^{\sigma^t |_{I \rightarrow a}}(h, z) u_i(z) -
671671 # \sum_{z \in Z_h} \pi^\sigma(h, z) u_i(z)
672672 # \Big) \\
673- # T \color {orange}{R^T_i(I, a)} &=
674- # \sum_{t=1}^T \color {coral}{\tilde{r}^t_i(I, a)}
673+ # T \textcolor {orange}{R^T_i(I, a)} &=
674+ # \sum_{t=1}^T \textcolor {coral}{\tilde{r}^t_i(I, a)}
675675 # \end{align}
676676 for a in I .actions ():
677677 I .regret [a ] += pi_neg_i * (va [a ] - v )
678678
679- # Update the strategy $\color {lightgreen}{\sigma^t(I)(a)}$
679+ # Update the strategy $\textcolor {lightgreen}{\sigma^t(I)(a)}$
680680 I .calculate_strategy ()
681681
682682 # Return the expected utility for player $i$,
@@ -685,7 +685,7 @@ def walk_tree(self, h: History, i: Player, pi_i: float, pi_neg_i: float) -> floa
685685
686686 def iterate (self ):
687687 """
688- ### Iteratively update $\color {lightgreen}{\sigma^t(I)(a)}$
688+ ### Iteratively update $\t extcolor {lightgreen}{\sigma^t(I)(a)}$
689689
690690 This updates the strategies for $T$ iterations.
691691 """
0 commit comments