### sas >> Probabilistic Record Linkage using Fellegi-Sunter model

Hi there:

We're trying to link two large data sets collected from two different
sources for epidemiological research (0.6M records in file A and
12,000 in file B). These two data sets share a few common fields such
as Social Insurance Number, Name (partically), Date of Birth, Gender,
etc. Due to incompleteness or errors of some data records (missing,
typo), I guess we have to use probabilistic linkage to find matched
records.

We're planning to use the Fellegi-Sunter model, which calculates a
composite weight (likelihood ratio) for each record pair based on
estimated m (1-error_rate) and u (1/#values)probabilities, then assign
two thresholds to divide all record pairs into three sets: "Match",
"Non-Match", "Uncertain".

Our problem here is that we are not clear about the error rates of
those matching fields in these records (Social Insurance Number, name,
date of birth), so m probability is unknown. Additionally, we still
don't quite understand how the threshold values are determined

Therefore, if there are any pointers to the estimation of error rate
(i.e., previous studies), and information on how to appropriately set
the threshold values and related software/tools would be greatly
appreciated.

Any pointers would be greatly appreciated.

Amy

### sas >> Probabilistic Record Linkage using Fellegi-Sunter model

Amy:
I believe that you will find that Felligi/Sunter assign m probabilities to
patterns of identifiers, not as usually implemented to individual fields in
an identifying record. F/S specifically warn against summing m probabilities
of conditionally dependent fields (such as first name and gender). So
estimates of m probabilities for individual fields may not have that much
value. In fact, Belin and Rubin have found that automated estimates of false
match rates based on sums of field weights tend to be wildly optimistic.

I would not expect the F/S model to yield good estimates of
misclassification error rates unless you have good identifying information
and expect to find good matches for a large proportion of the 12K records.
In that case, a similarity score computed by summing field match weights
will likely separate scores for pairs of records into minimally overlapping
m and u distributions. The scores at which the distributions overlap require
clerical review. The F/S model does not help much when the m and u
distributions overlap substantially.

During the last several years I have experimented with iterative methods for
determining match/non-match threshholds and with using a training sample of
clerical review outcomes to calibrate similarity scores. At some stage I
hope to have something that others will find useful.
Sig
-----Original Message-----
From: Amy
To: XXXX@XXXXX.COM
Sent: 1/13/2004 6:07 PM
Subject: Probabilistic Record Linkage using Fellegi-Sunter model

Hi there:

We're trying to link two large data sets collected from two different
sources for epidemiological research (0.6M records in file A and
12,000 in file B). These two data sets share a few common fields such
as Social Insurance Number, Name (partically), Date of Birth, Gender,
etc. Due to incompleteness or errors of some data records (missing,
typo), I guess we have to use probabilistic linkage to find matched
records.

We're planning to use the Fellegi-Sunter model, which calculates a
composite weight (likelihood ratio) for each record pair based on
estimated m (1-error_rate) and u (1/#values)probabilities, then assign
two thresholds to divide all record pairs into three sets: "Match",
"Non-Match", "Uncertain".

Our problem here is that we are not clear about the error rates of
those matching fields in these records (Social Insurance Number, name,
date of birth), so m probability is unknown. Additionally, we still
don't quite understand how the threshold values are determined

Therefore, if there are any pointers to the estimation of error rate
(i.e., previous studies), and information on how to appropriately set
the threshold values and related software/tools would be greatly
appreciated.

Any pointers would be greatly appreciated.

Amy

### sas >> Probabilistic Record Linkage using Fellegi-Sunter model

have a related and hopefully simpler question.

I am also planning a record linkage, but using fuzzy merge techniques with
PROC SQL, previously discussed on this list, where each match on a linking
variable is weighted and then summed to obtain an overall score.

What is a good method to determine the weights?

I am thinking about adapting some of the probabilistic approaches to get
some rough estimates, but am not sure if there is a better way.

The common examples I've seen are as follows:

For a unique identifier, something like SSN, assume one in a million chance
of acciential aggreement.

For sex, probability is 0.5 (Assuming equal number of males and females ).

For both, assume m probability of .9.

Therefore the weights are:

SSN = log2(.9/.000001) = 19.78
Sex = log2(.9/.5) = .85

This makes intuitive sense.

But what about a case like region, where there are 4 areas and the
population allocation is unequal, say A = .5, B = .2, C = .2, D = .1 ?

In this case, would it still be reasonable to calculate weight as
log2(.9/.25) = 1.85 ?

Basically, this reduces to estimating the u probability as 1/distinct domain
values for the
particular linking variable (in this case 1/4).

I realize the probability is not correct, but is this at least a reasonable
approach or is there a better method?

Frank

-----Original Message-----
From: Sigurd Hermansen [mailto: XXXX@XXXXX.COM ]
Sent: Tuesday, January 13, 2004 10:33 PM
Subject: Re: Probabilistic Record Linkage using Fellegi-Sunter model

Amy:
I believe that you will find that Felligi/Sunter assign m probabilities to
patterns of identifiers, not as usually implemented to individual fields in
an identifying record. F/S specifically warn against summing m probabilities
of conditionally dependent fields (such as first name and gender). So
estimates of m probabilities for individual fields may not have that much
value. In fact, Belin and Rubin have found that automated estimates of false
match rates based on sums of field weights tend to be wildly optimistic.

I would not expect the F/S model to yield good estimates of
misclassification error rates unless you have good identifying information
and expect to find good matches for a large proportion of the 12K records.
In that case, a similarity score computed by summing field match weights
will likely separate scores for pairs of records into minimally overlapping
m and u distributions. The scores at which the distributions overlap require
clerical review. The F/S model does not help much when the m and u
distributions overlap substantially.

During the last several years I have experimented with iterative methods for
determining match/non-match threshholds and with using a training sample of
clerical review outcomes to calibrate similarity scores. At some stage I
hope to have something that others will find useful.
Sig
-----Original Message-----
From: Amy
To: XXXX@XXXXX.COM
Sent: 1/13/2004 6:07 PM
Subject: Probabilistic Record Linkage using Fellegi-Sunter model

Hi there:

We're trying to link two large data sets collected from two different
sources for epidemiological research (0.6M records in file A and
12,000 in file B). These two data sets share a few common fields such
as Social Insurance Number, Name (partically), Date of Birth, Gender,
etc. Due to incompleteness or errors of some data records (missing,
typo), I guess we have to use p

```Does anyone have advice/code regarding doing probabilistic linkage in sas?

Carolyn
```

```If you have SAS/OR, look at PROC ASSIGN.

On Fri, 29 Aug 2003 17:25:52 +0200, Hans Reitsma < XXXX@XXXXX.COM >
wrote:

>Dear all
>
>After a probabilistic record linkage operation of two files
>(A and B) we have a file with potential links, e.g. pairs of
>records from file A and B whose total weights are above a
>certain cut-off value. Some of these potential links are
>intertangled (is that the right word?) meaning that 1 record
>from file A can be linked to two different records of file
>B. The weights can be different, but they are both above the
>cut-off value. We know that only one record from file A (or
>B) can be linked to only one record in the other file. This
>example is easy to solve, the pair with the highest weight
>
>data). In these situations., I want to obtain the solution
>that maximises the total sum of weights of all the pairs
>that belong to that solution. This means that not
>automatically the link with the highest weight wins. Here
>are some sample data to clarify the issue.
>
>
>Cluster    id_a    id_b    weight        Desired result
>13
>
>25
>
>
>
>Any help, suggestions?
>
>
>
>Hans Reitsma, MD PhD
>
>Dept. of Clinical Epidemiology & Biostatistics
>
>PO Box 22700, 1100 DE, Amsterdam, The Netherlands
>
>Phone: +31-20-5663273, Fax: +31-20-6912683
>
>E-mail:  XXXX@XXXXX.COM
```

```I have a dataset which is not exactly as standard time series. So the
PROC ARIMA cannot be appried directly. I need to write a customerized
ARMA under proc model framework, specifically, AR(1)+MA(1).

If anyone has some sample codes, it would be a great helpful.

TIA.
```

```Hi All,

Has anyone have used generalized linear models to model time series
data especially multivariate time series?  I came upon a book on this
topic.
the book name is "Regression Models for Time Series Analysis" and you
can find it sold on amazon.

There is no way that my company would buy SAS/ETS. Now I am on the
steep learning curve of modeling multivariate time series data using
R. The time series stuff itself is new for me and the lack of
documentation and different and confusing R library makes matter even
worse. Actually I focus more on model reference than the forecasting
accuracy. Maybe the regress analysis is a better options. Is there any
papers or examples I can read and learn?

Thanks.

Ming Chen
```