From Scrath PPO Implementation.

Follow the full discussion on Reddit.
I've been for the past 5 months working on a from scratch PPO implementation. I am doing most of the work from scratch except numerical computation libraries such as numpy. It started with supervised learning networks to now this. And I just can't seem to get it. Every paper I read is A. Outdated/Incorrect B. Incomplete. No paper has a full description on what they do and what Hyper Params they use. I tried reading the SB3 code but it's too different from my implementation and I just don't understand whats happening as it's just so many files, I can't find the little nitts and gritts. So I'm just gonna post my backward method and if someone wishes to read it and would tell me some mistakes/reccomendation. Would be great! Side notes: I made the optim which uses standard gradient descent and the critic just takes state. I'm not using GAE as I'm trying to minimize potential failure points. All the hyperparams are standard vals.

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter