> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD variable?

It is neither necessary nor sufficient. In fact the model assumes that

{logodds of FRAUD} NOT FRAUD is linearly related to your linearly

predictors.

Here is a way to views it.

1) Bin a predictor into, say 30 bins. i=1 to 30

2) Calculate logodds of FRAUD for each bin.

3) plot logodds of FRAUD against bined predictor values(mean, median)

Based on what you see, you may take a proper transformation. One is

parametric and the other is non-parametric. If you have a large number

of events( FRAUD), the non-parametric way is prefered. ......

HTH

On May 15, 11:40m, XXXX@XXXXX.COM (Tom White) wrote:

> Hello everyone,

>

> As you know, I have been trying to fit a predictive model to health care fraud using GLIMMIX.

>

> One of the assumptions is that the predictor variables in the MODEL statement

> are linearly related to the RESPONSE dependenent variable (in my case I call it FRAUD).

>

> FRAUD takes on values

>

> (claim submitted by doctor is not fraudulent--about 99% of my population>

> (claim submitted by doctor is fraudulent--about 1% of my populat>on>

>

> I have about say 100K doctors (DOC_ID) and many more patients (PAT_ID) seen by these doct>rs>

>

> An example model>is>

>

> proc glimmix data= MYD>TA;

> model FRAUD=PAT_AGE PAT_SEX NUM_YNGPATS NUM_PATS / didt=bin link=lo>it;

> random intercept / subject = DOC>ID;

> random intercept / subject = PAT_ID; /* Per Dale's suggestion last week or the week befo>e*/

> >un>

>

> So, I'm trying to fit something like the above mo>el>

>

> PAT_AGE and PAT_SEX are patient predictors (PAT_SEX = M o> F>

>

> NUM_YNGPATS = The # of young patients the doctor >ees

> NUM_PATS = The total # of patients the doctor>ha>

>

> NUM_YNGPATS & NUM_PATS are, of course, group level predictors at the doctor (DOC_ID) le>el,

> i.e., all claims under the same DOC_ID will be populated by the same NUM_YNGPATS & NUM_PATS val>es>

>

> NUM_YNGPATS & NUM_PATS are continous variables ranging in v>lue

> from 1 to some upper bound (say, 40 young pats and 60 NUM_PA>S)>

>

> So, that's pretty much the background informat>on>

>

> Now to my quest>on:

> ow do I check the linearity assumption associated with this logistic model for my >bove

> redictors PAT_AGE (and similarly NUM_YNGPATS & NUM_PATS) and the categorical PAT_SEX (M >r >).

>

> For the sake of this discussion assume that I have many obs so I have, say, 50K obs I'm working>wi>h.

>

> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD var>ab>e?

>

> If, per your explanation, a few of my predictors are not linearly related to >RAUD,

> do I then try to find some transformation (say take the log or something) for, say, NUM_Y>GPATS

> s> t>at

>

> UM_YNGPATS_TR=log(NUM_Y>G_>ATS)

>

> check linearity again and then use NUM_YNGPATS_TR in the model statemen> a>ove?

>

> Thank you for your t>ou>hts.>
>>
> T
>>

> --

> See Exclusive Video: 10th Annual Young Hollywood Awardshttp://www.hollywoodlife.net/younghollywoodawards2008/