Dutch Medical NER with BERT - domain specific difficulties

Follow the full discussion on Reddit.
The subject of my thesis is 'dutch named entity recognition using BERT'. This means that I will have to do entity extraction on dutch clinical notes using BERT. The main obstacle I see here is that, since there are only 2 dutch BERT models that were pre-trained on a dutch corpus of books/news text, my guess is that these models will perform rather poorly for (dutch) clinical notes. These notes are full of medical jargon, acronyms, shorthand notation, misspellings, sentence fragments and high terminological variation. I also don't know how this will play out with the word piece embeddings. Comments/remark are much appreciated!

Visit Website

Discover the Best of Machine Learning.

Ever having issues keeping up with everything that's going on in Machine Learning? That's where we help. We're sending out a weekly digest, highlighting the Best of Machine Learning.