Instruction fine-tuning with a Low-Resource Language

Follow the full discussion on Reddit.
I am trying to build a summarizer for a conversation that happened between a rule-based bot and a customer. To my disadvantage the working language is Turkish. I gathered fine-tuning data of 1.000 examples. Also, I have a Turkish summarization dataset of +100k. As far as I observed instruction fine-tuning will yield proper results if and only if there is a good amount of examples in the pre-training data of the LLM. Have you had similar experiences with low-resource languages? Any advice on how to tackle such issues? Also, do you know any open-source LLM with a high amount of low-resource language in its pre-training data?

Comments

There's unfortunately not much to read here yet...

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Join over 900 Machine Learning Engineers receiving our weekly digest.

Best of Machine LearningBest of Machine Learning

Discover the best guides, books, papers and news in Machine Learning, once per week.

Twitter