The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022

Follow the full discussion on Reddit.
ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset.

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022

Comments

Discover the Best of Machine Learning.