Quantcast
Channel: Active questions tagged r - Stack Overflow
Viewing all articles
Browse latest Browse all 201867

R function/package to standardize wrongly ocr-ed words?

$
0
0

I am scrapping/'ocr-ing' hundreds of pages with the wonderful pdftools package which include repeatedly names of the same persons. By and large the extraction works well, but in a few instances names are wrongly recognized, eg Simo instead of Simic. So I end up with eg 200 times Simic and 15 times Simo (and the same more or less with other names).

One way to correct this is to manually modify the wrong entries with e.g. tidyverse's case_when and str_detect. This means however I have to check each name and specify for each specific case.

So my question is whether there is any function, r package which takes on such a task and makes it a bit easier e.g. group by words which are not different than by more then two characters and harmonise them with the most frequent. Obviously this approach might cause problems if there are indeed very similar names, but they could be checked.

Grateful for any hint. Many thanks.


Viewing all articles
Browse latest Browse all 201867

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>