Figure 1
Warm-up
Stage 1 trains a feasible policy network by simulation. BPTT through this rollout later provides costate information.
PG-DPO
Figure 1
Stage 1 trains a feasible policy network by simulation. BPTT through this rollout later provides costate information.
Figure 2
Stage 2 uses BPTT to produce pathwise estimates $\hat{\lambda}^{\mathrm{pw}}_{t_0}$, whose average gives $\hat{\lambda}_{t_0}$ at this Merton point.
Figure 3
As $\hat{\lambda}_{t_0}$ improves, $\hat{\pi}$ moves toward the Hamiltonian critical point.
Scope. This page shows the idea at one Merton point $(t_0,x_0)$. In the full PG-DPO method, the same costate-to-control step is learned over a Markov state domain.
Extension. Beyond this Merton example, the PG-DPO idea can be extended to constrained control, non-Markovian dynamics, non-exponential discounting, transaction costs, and other continuous-time decision problems.