|
Capitalization may vary, unwanted punctuation marks such as ellipsis, bullets or numbering. Good training data must meet the following criteria: You need plain text without any formatting or everything in ALL CAPS. Longer sentences, i.e. longer than 5 words, are best. However, the sentences must not be too long, i.e. no more than 50 words. No bullets or numbering as these disrupt the training process. No repeating characters, such as double spaces or ellipsis... No tabs - these are common when trying to create a table of contents manually in Word.
We are currently developing a system that we call a “washing machine”. Your TM HK Phone Number is cleaned in this washing machine and can therefore provide better training data. More on that soon! Include your terminology Some MT engines, such as Google AutoML, allow you to include your terminology in the machine translation process. This is particularly important because, despite training, the technical terms or company-specific terminology and could, for example, translate brand names. The additional terminology process then overlays the first result of the machine translation and replaces the relevant technical terms with the desired terminology.

This means the translator has less work to do afterwards. PEMT = MTPE = post-editing machine translation Do we still need translators? Oh yes, absolutely, at least for most machine translated texts. In this case we are talking about post-editing. In this step, the machine translation results are checked and corrected where necessary. The aim of the training is to reduce the post editor's work to a minimum. Generic, subject-specific engines for better machine translations Don't get nervous if your corpus of existing translations isn't large enough.
|
|