Do commonly used OCR programs use something like translation memory for "prediction" of bad scans?
Thread poster: Adieu
Adieu
Adieu  Identity Verified
Ukrainian to English
+ ...
Jan 25, 2022

I've recently had a few jobs in MemoQ where the provided MemoQ source segments had very odd discrepancies with the source pdf.

Specifically, for grainy pdfs, it would "see" previous long-time occupants of official positions after titles, instead of the new person's name actually printed on the pdf.

Think pdf says "President Joe Biden" and memoQ sees "Bill Clinton" or "Donald Trump".

And this is in the SOURCE text segments, not translation memory suggestions
... See more
I've recently had a few jobs in MemoQ where the provided MemoQ source segments had very odd discrepancies with the source pdf.

Specifically, for grainy pdfs, it would "see" previous long-time occupants of official positions after titles, instead of the new person's name actually printed on the pdf.

Think pdf says "President Joe Biden" and memoQ sees "Bill Clinton" or "Donald Trump".

And this is in the SOURCE text segments, not translation memory suggestions.

[Edited at 2022-01-25 15:21 GMT]
Collapse


 
Natalie
Natalie  Identity Verified
Poland
Local time: 17:15
Member (2002)
English to Russian
+ ...

Moderator of this forum
SITE LOCALIZER
Don't use CAT tools for OCR Jan 25, 2022

Use special tools for doing OCR (like FineReader or similar), correct errors and then use the resulting file for translation. OCR software may warn you of low resolution, skewed text etc while CAT tools just use the converted text as is and pre-translate it using the existing TM, hence your "odd discrepancies".

 
Adieu
Adieu  Identity Verified
Ukrainian to English
+ ...
TOPIC STARTER
I have no idea or control over what they use Jan 25, 2022

Uploads on the memoQ server are managed by the client.

Natalie wrote:

Use special tools for doing OCR (like FineReader or similar), correct errors and then use the resulting file for translation. OCR software may warn you of low resolution, skewed text etc while CAT tools just use the converted text as is and pre-translate it using the existing TM, hence your "odd discrepancies".


 
Natalie
Natalie  Identity Verified
Poland
Local time: 17:15
Member (2002)
English to Russian
+ ...

Moderator of this forum
SITE LOCALIZER
If you have no control over the process Jan 25, 2022

you need to complain to the client and tell them about the bad quality of the files provided for translation.

Jorge Payan
Philippe Locquet
 
Endre Both
Endre Both  Identity Verified
Germany
Local time: 17:15
English to German
Fascinating Jan 26, 2022

Do tell us if you find out about the reason for the discrepancy. One possibility is that the source for the OCR might have been a different file than the PDF you received. Or a very crude version of “Machine Learning” at work, trained on previous documents.

I am reminded of the Xerox photocopiers that randomly changed numbers in ph
... See more
Do tell us if you find out about the reason for the discrepancy. One possibility is that the source for the OCR might have been a different file than the PDF you received. Or a very crude version of “Machine Learning” at work, trained on previous documents.

I am reminded of the Xerox photocopiers that randomly changed numbers in photocopies:
http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
Collapse


Adieu
 
Natalie
Natalie  Identity Verified
Poland
Local time: 17:15
Member (2002)
English to Russian
+ ...

Moderator of this forum
SITE LOCALIZER
These discrepancies are just fuzzies Jan 26, 2022

I.e. the result of pre-translation by MemoQ of the converted PDF. In case the translator has his own TM for this client and does not like the results of pre-translation, he may clear the target and pre-translate again using his own TM.

In other words: this problem has absolutely nothing to do with OCR (and MemoQ has nothing to do with "commonly used OCR programs").


[Edited at 2022-01-26 09:22 GMT]


 
Adieu
Adieu  Identity Verified
Ukrainian to English
+ ...
TOPIC STARTER
Except they're not Jan 26, 2022

The discrepancies are on the SOURCE TEXT side in MemoQ

Natalie wrote:

I.e. the result of pre-translation by MemoQ of the converted PDF. In case the translator has his own TM for this client and does not like the results of pre-translation, he may clear the target and pre-translate again using his own TM.

In other words: this problem has absolutely nothing to do with OCR (and MemoQ has nothing to do with "commonly used OCR programs").


[Edited at 2022-01-26 09:22 GMT]


 
Rolf Keller
Rolf Keller
Germany
Local time: 17:15
English to German
Is it really the result of OCRing a PDF? Jan 26, 2022

Think pdf says "President Joe Biden" and memoQ sees "Bill Clinton" or "Donald Trump".

If(!!) there is no TM involved anywhere, the PDF contains both names.

PDFs can contain several graphics and text layers. Example: If somebody somehow managed to over"paint" a graphics layer to "Biden" without touching a text layer (saying "Trump"), the effect you described may arise.
Another possibility: The PDF contains invisible text revisions.

As there are 999 different software modules that can read PDFs one should never believe WYSIWYGot.

In order to mitigate such problems, authors should store PDFs in PDF/A format (even MS Word is able to do that).


 
Adieu
Adieu  Identity Verified
Ukrainian to English
+ ...
TOPIC STARTER
Well I don't know what to make of this Jan 27, 2022

But I'm seeing even more of this nonsense.

Now it's email addresses, with some @somethingelses turning into very different-looking @clientdomains after whatever OCR manipulations my client's minions do to upload pdfs onto their memoQ server.

Interesting.


 
Natalie
Natalie  Identity Verified
Poland
Local time: 17:15
Member (2002)
English to Russian
+ ...

Moderator of this forum
SITE LOCALIZER
Once again: Jan 27, 2022

you need to inform your client about bad quality of the files for translation.

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 17:15
Member (2006)
English to Afrikaans
+ ...
Perhaps Jan 27, 2022

Adieu wrote:
Specifically, for grainy pdfs, it would "see" previous long-time occupants of official positions after titles, instead of the new person's name actually printed on the pdf.

How curious!

I do suspect that OCR programs use dictionaries. My OCR results are better if I select the correct language for the text, even for languages that use the same character set. If I scan an Afrikaans text but I set the language to Dutch, then some of the grainy words are "recognised" as Dutch words, whereas if I set the language to Afrikaans, they're not.

[Edited at 2022-01-27 20:06 GMT]


Adieu
 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Do commonly used OCR programs use something like translation memory for "prediction" of bad scans?






Trados Business Manager Lite
Create customer quotes and invoices from within Trados Studio

Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »