From Scrath PPO Implementation.

Follow the full discussion on Reddit.
I've been for the past 5 months working on a from scratch PPO implementation. I am doing most of the work from scratch except numerical computation libraries such as numpy. It started with supervised learning networks to now this. And I just can't seem to get it. Every paper I read is A. Outdated/Incorrect B. Incomplete. No paper has a full description on what they do and what Hyper Params they use. I tried reading the SB3 code but it's too different from my implementation and I just don't understand whats happening as it's just so many files, I can't find the little nitts and gritts. So I'm just gonna post my backward method and if someone wishes to read it and would tell me some mistakes/reccomendation. Would be great! Side notes: I made the optim which uses standard gradient descent and the critic just takes state. I'm not using GAE as I'm trying to minimize potential failure points. All the hyperparams are standard vals.

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

From Scrath PPO Implementation.

Comments

Discover the Best of Machine Learning.