Instruction fine-tuning with a Low-Resource Language

Follow the full discussion on Reddit.
I am trying to build a summarizer for a conversation that happened between a rule-based bot and a customer. To my disadvantage the working language is Turkish. I gathered fine-tuning data of 1.000 examples. Also, I have a Turkish summarization dataset of +100k. As far as I observed instruction fine-tuning will yield proper results if and only if there is a good amount of examples in the pre-training data of the LLM. Have you had similar experiences with low-resource languages? Any advice on how to tackle such issues? Also, do you know any open-source LLM with a high amount of low-resource language in its pre-training data?

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.

Instruction fine-tuning with a Low-Resource Language

Comments

Discover the Best of Machine Learning.