[R] Corpus differences problem

Follow the full discussion on Reddit.
I am working on a keyword generation problem, I have categories (classes) and each category have a number of data points (raw text). I have to generate keywords for each category, then those keywords will be used to classify new data points to the right category. The end goal is to be able to generate relevant keywords starting from only 5 data points per category, and naturally depending on the 5 random data points it should be more or less accurate and generalizable. The top management is asking to know if there is any metric to can assess the relevance of any specific 5 data points… which seems non feasible if you don’t see the whole data. What I did, is to take 5 data points that performs the best and 5 that performs the worst, then try to see any differences between them. I’m a little stuck and don’t know if you have any ideas 💡? Thank you 🙏🏼

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter