> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD variable?
It is neither necessary nor sufficient. In fact the model assumes that
{logodds of FRAUD} NOT FRAUD is linearly related to your linearly
predictors.
Here is a way to views it.
1) Bin a predictor into, say 30 bins. i=1 to 30
2) Calculate logodds of FRAUD for each bin.
3) plot logodds of FRAUD against bined predictor values(mean, median)
Based on what you see, you may take a proper transformation. One is
parametric and the other is non-parametric. If you have a large number
of events( FRAUD), the non-parametric way is prefered. ......
HTH
On May 15, 11:40m, XXXX@XXXXX.COM (Tom White) wrote:
> Hello everyone,
>
> As you know, I have been trying to fit a predictive model to health care fraud using GLIMMIX.
>
> One of the assumptions is that the predictor variables in the MODEL statement
> are linearly related to the RESPONSE dependenent variable (in my case I call it FRAUD).
>
> FRAUD takes on values
>
> (claim submitted by doctor is not fraudulent--about 99% of my population>
> (claim submitted by doctor is fraudulent--about 1% of my populat>on>
>
> I have about say 100K doctors (DOC_ID) and many more patients (PAT_ID) seen by these doct>rs>
>
> An example model>is>
>
> proc glimmix data= MYD>TA;
> model FRAUD=PAT_AGE PAT_SEX NUM_YNGPATS NUM_PATS / didt=bin link=lo>it;
> random intercept / subject = DOC>ID;
> random intercept / subject = PAT_ID; /* Per Dale's suggestion last week or the week befo>e*/
> >un>
>
> So, I'm trying to fit something like the above mo>el>
>
> PAT_AGE and PAT_SEX are patient predictors (PAT_SEX = M o> F>
>
> NUM_YNGPATS = The # of young patients the doctor >ees
> NUM_PATS = The total # of patients the doctor>ha>
>
> NUM_YNGPATS & NUM_PATS are, of course, group level predictors at the doctor (DOC_ID) le>el,
> i.e., all claims under the same DOC_ID will be populated by the same NUM_YNGPATS & NUM_PATS val>es>
>
> NUM_YNGPATS & NUM_PATS are continous variables ranging in v>lue
> from 1 to some upper bound (say, 40 young pats and 60 NUM_PA>S)>
>
> So, that's pretty much the background informat>on>
>
> Now to my quest>on:
> ow do I check the linearity assumption associated with this logistic model for my >bove
> redictors PAT_AGE (and similarly NUM_YNGPATS & NUM_PATS) and the categorical PAT_SEX (M >r >).
>
> For the sake of this discussion assume that I have many obs so I have, say, 50K obs I'm working>wi>h.
>
> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD var>ab>e?
>
> If, per your explanation, a few of my predictors are not linearly related to >RAUD,
> do I then try to find some transformation (say take the log or something) for, say, NUM_Y>GPATS
> s> t>at
>
> UM_YNGPATS_TR=log(NUM_Y>G_>ATS)
>
> check linearity again and then use NUM_YNGPATS_TR in the model statemen> a>ove?
>
> Thank you for your t>ou>hts.>
>>
> T
>>
> --
> See Exclusive Video: 10th Annual Young Hollywood Awardshttp://www.hollywoodlife.net/younghollywoodawards2008/