PG-DPO

Stage 1 Warm-up

Stage 2 Costate estimation and control recovery

Figure 1

Warm-up

Stage 1 trains a feasible policy network by simulation. BPTT through this rollout later provides costate information.

Figure 2

Costate estimation

Stage 2 uses BPTT to produce pathwise estimates $\hat{\lambda}^{\mathrm{pw}}_{t_0}$, whose average gives $\hat{\lambda}_{t_0}$ at this Merton point.

Figure 3

Control recovery

Risky weight $\hat{\pi}$ $\hat{\pi}=0.280$

start 0.280 critical point 0.750

As $\hat{\lambda}_{t_0}$ improves, $\hat{\pi}$ moves toward the Hamiltonian critical point.

Model and objective

$$dX_t = X_t\bigl(r + \pi_t(\mu-r)\bigr)dt + X_t\pi_t\sigma\,dW_t - c_t\,dt$$

$$U(x)=\frac{x^{1-\gamma}}{1-\gamma},\qquad \max_{(\pi,c)} \mathbb{E}\left[U(X_T)\right]$$

Hamiltonian

$$H(x,\pi,\lambda)=\lambda x\bigl(r+\pi(\mu-r)\bigr)-\frac{1}{2}\gamma\lambda x\sigma^2\pi^2$$

Recovered control

$$\hat{\pi}_t=-\frac{\hat{\lambda}_t}{X_t\,\partial_x\hat{\lambda}_t}\,\frac{\mu-r}{\sigma^2}$$

Scope. This page shows the idea at one Merton point $(t_0,x_0)$. In the full PG-DPO method, the same costate-to-control step is learned over a Markov state domain.

Extension. Beyond this Merton example, the PG-DPO idea can be extended to constrained control, non-Markovian dynamics, non-exponential discounting, transaction costs, and other continuous-time decision problems.