Parallel corpus approach for name matching in record linkage

L. E. Zhukov; Sukharev J.; Popescul A.

?

Parallel corpus approach for name matching in record linkage

P. 995–1000.

Zhukov L. E., Sukharev J., Popescul A.

Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build namematching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.

Language: English

Full text

Text on another site

Keywords: Record Linkage Crowd Sourcing Machine Translation

In book

Proceedings of 14th International Conference on Data Mining (ICDM 2014)

NY: IEEE Computer Society, 2014.

MuMMy: Multimodal Dataset supporting VLM-based Egyptology Research Assistant

Golyadkin M., Innokentiy Humonen, Rubanova V. et al., , in: MM '25: Proceedings of the 33rd ACM International Conference on Multimedia.: Association for Computing Machinery (ACM), 2025. P. 12875–12881.

We present the first multimodal dataset MuMMy, for developing research assistants that can interpret Egyptian hieroglyphic texts. It pairs images with Gardiner codes, transliteration, and English translation at two levels of granularity. We also evaluate several deep learning pipelines across OCR, transliteration, and translation tasks, revealing the complexity of the domain and the challenges posed ...

Added: November 8, 2025

Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale (CSW 2021)

Copenhagen, Denmark: CEUR Workshop Proceedings, 2021.

The second workshop on Crowd Science is organized in conjunction with the 47th International Conference on Very Large Data Bases (VLDB 2021). This workshop is the second in a series of events that has the goal of helping crowdsourcing “transition” from art to science, and tackles the research challenges that we face to make crowdsourcing ...

Added: December 13, 2021

Reflections of syntactic structures in nonautoregressive language models

Плетенев С. А., В кн.: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 16–19 июня 2021 г.)Issue 20.: Russian State University for the Humanitie, 2021.

Added: December 13, 2021

Uncertainty Estimation in Autoregressive Structured Prediction

Andrey Malinin, Gales M., , in: Proceedings of the 9th International Conference on Learning Representations (ICLR 2021). ICLR, 2021.: ICLR, 2021. P. 1–31.

Added: November 1, 2021

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Ryabinin M., Malinin A., Gales M., , in: Advances in Neural Information Processing Systems 34 (NeurIPS 2021).: Curran Associates, Inc., 2021. P. 6023–6035.

Added: October 31, 2021

Proceedings of the 3rd Workshop on Neural Generation and Translation

Association for Computational Linguistics, 2019.

This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). ...

Added: January 7, 2021

CEUR Workshop Proceedings (Proceedings of the International Conference "Internet and Modern Society" IMS-2020, 17-20 June 2020, ITMO University, St. Petersburg, Russia)

CEUR Workshop Proceedings, 2020.

The International Conference “Internet and Modern Society” (IMS-2020) was initially planned to take place in St. Petersburg, Russia. Due to the spread of COVID-19 and the ban on public events, the conference was held during 17-20 June 2020 in the format of online sessions with a discussion of papers and presentations uploaded in advance. The ...

Added: November 1, 2020