Amy:

I believe that you will find that Felligi/Sunter assign m probabilities to

patterns of identifiers, not as usually implemented to individual fields in

an identifying record. F/S specifically warn against summing m probabilities

of conditionally dependent fields (such as first name and gender). So

estimates of m probabilities for individual fields may not have that much

value. In fact, Belin and Rubin have found that automated estimates of false

match rates based on sums of field weights tend to be wildly optimistic.

I would not expect the F/S model to yield good estimates of

misclassification error rates unless you have good identifying information

and expect to find good matches for a large proportion of the 12K records.

In that case, a similarity score computed by summing field match weights

will likely separate scores for pairs of records into minimally overlapping

m and u distributions. The scores at which the distributions overlap require

clerical review. The F/S model does not help much when the m and u

distributions overlap substantially.

During the last several years I have experimented with iterative methods for

determining match/non-match threshholds and with using a training sample of

clerical review outcomes to calibrate similarity scores. At some stage I

hope to have something that others will find useful.

Sig

-----Original Message-----

From: Amy

To: XXXX@XXXXX.COM

Sent: 1/13/2004 6:07 PM

Subject: Probabilistic Record Linkage using Fellegi-Sunter model

Hi there:

We're trying to link two large data sets collected from two different

sources for epidemiological research (0.6M records in file A and

12,000 in file B). These two data sets share a few common fields such

as Social Insurance Number, Name (partically), Date of Birth, Gender,

etc. Due to incompleteness or errors of some data records (missing,

typo), I guess we have to use probabilistic linkage to find matched

records.

We're planning to use the Fellegi-Sunter model, which calculates a

composite weight (likelihood ratio) for each record pair based on

estimated m (1-error_rate) and u (1/#values)probabilities, then assign

two thresholds to divide all record pairs into three sets: "Match",

"Non-Match", "Uncertain".

Our problem here is that we are not clear about the error rates of

those matching fields in these records (Social Insurance Number, name,

date of birth), so m probability is unknown. Additionally, we still

don't quite understand how the threshold values are determined

Therefore, if there are any pointers to the estimation of error rate

(i.e., previous studies), and information on how to appropriately set

the threshold values and related software/tools would be greatly

appreciated.

Any pointers would be greatly appreciated.

Amy