Record linkage

You must add a |reason= parameter to this Cleanup template – replace it with {{Cleanup|July 2006|reason=<Fill reason here>}}, or remove the Cleanup template.
Record linkage also known as deduplication, refers to the task of finding entries that refer to the same entity in two or more files. Record linkage is an appropriate technique when you have to join data sets that do not have a unique database key in common. A data set that have been through Record linkage is said to be linked.

Record linkage (RL) is a useful tool when performing data mining tasks, where the data originated from different sources or different organizations. Most commonly, RL the datasets involve joining records of persons based on name, since no National identification number or similar is recorded in the data.

Naming conventions

Record linkage is the term used by statisticians, epidemiologist and historians among others. Commercial mail and database applications refer to it as merge/purge processing or list washing. Computer scientists often refer to it as data matching or as the object identity problem. Other names used to describe the same concept include entity resolution, duplicate detection, record matching, instance identification, deduplication, coreference resolution, reference reconciliation and database hardening. This confusion of terminology has led to few cross-references between these research communities.^[1] ^[2]

Methods of RL

The two main approaches to RL are probabilistic (PRL) and rules-based (or deterministic or exact).

PRL uses a greater number of matching variables to provide a maximum likelihood estimate among potential matches, whereas deterministic RL requires an exact match in content and format of an identifier. For example, a birth date may be coded as dd/mm/yy, dd/mm/yyyy, mm/dd/yy, mm/dd/yyyy and so on. In the case where two databases are merged deterministically, Juan C. Garcia DOB 10-15-1935 in database A is likely not to match his record in database 'B' if he was entered as JC Garcai DOB 15-Oct-35. A PRL program would provide a maximum likelihood score between those two records, based on the number of elements matching between the two (J--C--G--a--r--c--10/Oct--35) from all of the available data.

History of RL theory

The initial idea goes back to Halbert L. Dunn ("Record Linkage" in: American Journal of Public Health, Vol. 36 (1946), 1412-1416). In the 1950s, Howard Borden Newcombe laid the probabilistic foundations of modern record linkage theory.

In 1969, Fellegi and Sunter formalized these ideas. Their pioneering work "A Theory For Record Linkage" is, still today, the mathematical tool for any record linkage application. Its bibliographical reference is: American Statistical Association Journal Vol. 64 (1969), 1183-1210.

Mathematical model

In an application with two files, A and B, denote the rows (records) by $\alpha (a)$ in file A and $\beta (b)$ in file B. Assign $K$ characteristics to each record. The set of records that represent identical entities is defined by

$M=\left\{(a,b);a=b;a\in A;b\in B\right\}$

and the complement of set $M$ , namely set $U$ representing different entities is defined as

$U=\{(a,b);a\neq b;a\in A,b\in B\}$ .

A vector, $\gamma$ is defined, that contains the coded agreements and disagreements on each characteristic:

$\gamma \left[\alpha (a),\beta (b)\right]=\{\gamma ^{1}\left[\alpha (a),\beta (b)\right],...,\gamma ^{K}\left[\alpha (a),\beta (b)\right]\}$

where $K$ is a subscript for the characteristics (sex, age, martial status, etc.) in the files. The conditional probabilities of observing a specific vector $\gamma$ given $(a,b)\in M$ , $(a,b)\in U$ are defined as

$m(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in M\right\}=\sum _{(a,b)\in M}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|M\right]$

and

$u(\gamma )=P\left\{\gamma \left[\alpha (a),\beta (b)\right]|(a,b)\in U\right\}=\sum _{(a,b)\in U}P\left\{\gamma \left[\alpha (a),\beta (b)\right]\right\}\cdot P\left[(a,b)|U\right],$ respectively.

Applications in historical research

Record linkage is important to social history research since most data sets, such as census records and parish registers were recorded long before the invention of National identification numbers. When old sources are digitized, linking of data sets is a prerequisite for longitudinal study. This process is often further complicated by lack of standard spelling of names, family names that changes according to place of dwelling, changing of administrative boundaries and problems of checking the data against other sources. Record Linkage was among the most prominent themes in the History and computing field in the 1980s, but has since been subject to less attention in research.

Applications in medical practice and research

Software implementations

Febrl Free, open source application for RL written in Python and C by the Australian National University.
Link Plus Free, probabilistic record linkage program developed at the US Centers For Disease Control and Prevention (CDC)].

External links

Deduplication Software http://www.helpit.com

Notes

[1] Cristen, P & T: Febrl - Freely extensible biomedical record linkage (Manual, release 0.3) p.9

[2] Ahmed Elmagarmid, Panagiotis G. Ipeirotis, Vassilios Verykios: Duplicate Record Detection: A Survey p.2

[1]

[2]