### sas >> PROC GLIMMIX--How to Check Linearity Assumption

> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD variable?

It is neither necessary nor sufficient. In fact the model assumes that
{logodds of FRAUD} NOT FRAUD is linearly related to your linearly
predictors.

Here is a way to views it.
1) Bin a predictor into, say 30 bins. i=1 to 30
2) Calculate logodds of FRAUD for each bin.
3) plot logodds of FRAUD against bined predictor values(mean, median)

Based on what you see, you may take a proper transformation. One is
parametric and the other is non-parametric. If you have a large number
of events( FRAUD), the non-parametric way is prefered. ......

HTH

On May 15, 11:40m, XXXX@XXXXX.COM (Tom White) wrote:
> Hello everyone,
>
> As you know, I have been trying to fit a predictive model to health care fraud using GLIMMIX.
>
> One of the assumptions is that the predictor variables in the MODEL statement
> are linearly related to the RESPONSE dependenent variable (in my case I call it FRAUD).
>
> FRAUD takes on values
>
> (claim submitted by doctor is not fraudulent--about 99% of my population>
> (claim submitted by doctor is fraudulent--about 1% of my populat>on>
>
> I have about say 100K doctors (DOC_ID) and many more patients (PAT_ID) seen by these doct>rs>
>
> An example model>is>
>
> proc glimmix data= MYD>TA;
> model FRAUD=PAT_AGE PAT_SEX NUM_YNGPATS NUM_PATS / didt=bin link=lo>it;
> random intercept / subject = DOC>ID;
> random intercept / subject = PAT_ID; /* Per Dale's suggestion last week or the week befo>e*/
> >un>
>
> So, I'm trying to fit something like the above mo>el>
>
> PAT_AGE and PAT_SEX are patient predictors (PAT_SEX = M o> F>
>
> NUM_YNGPATS = The # of young patients the doctor >ees
> NUM_PATS = The total # of patients the doctor>ha>
>
> NUM_YNGPATS & NUM_PATS are, of course, group level predictors at the doctor (DOC_ID) le>el,
> i.e., all claims under the same DOC_ID will be populated by the same NUM_YNGPATS & NUM_PATS val>es>
>
> NUM_YNGPATS & NUM_PATS are continous variables ranging in v>lue
> from 1 to some upper bound (say, 40 young pats and 60 NUM_PA>S)>
>
> So, that's pretty much the background informat>on>
>
> Now to my quest>on:
> ow do I check the linearity assumption associated with this logistic model for my >bove
> redictors PAT_AGE (and similarly NUM_YNGPATS & NUM_PATS) and the categorical PAT_SEX (M >r >).
>
> For the sake of this discussion assume that I have many obs so I have, say, 50K obs I'm working>wi>h.
>
> How do I check that the 50K obs of the above predictors are linearly related to the FRAUD var>ab>e?
>
> If, per your explanation, a few of my predictors are not linearly related to >RAUD,
> do I then try to find some transformation (say take the log or something) for, say, NUM_Y>GPATS
> s> t>at
>
> UM_YNGPATS_TR=log(NUM_Y>G_>ATS)
>
> check linearity again and then use NUM_YNGPATS_TR in the model statemen> a>ove?
>
> Thank you for your t>ou>hts.> >> > T >>
> --
> See Exclusive Video: 10th Annual Young Hollywood Awardshttp://www.hollywoodlife.net/younghollywoodawards2008/

### sas >> PROC GLIMMIX--How to Check Linearity Assumption

Yes, this is most certainly one way to examine the linearity
assumption. This approach works best if you have only a single
predictor variable. In the multivariate setting where the effect
of each predictor is conditional on the effects of other predictors,
then this approach may not work as well. Also, I don't see the need
for any more than 10 bins in most circumstances.

Another way to examine whether linearity holds is to include terms
in your model which represent some departure from linearity. If
there is significant improvement in the model fit when these terms
are included, then the assumption of linearity in the predictors
does not hold. Often, this is performed simply by including
polynomials of your predictors. A little more sophisticated approach
may be to employ a spline basis for representing nonlinearity. My
favorite spline basis is to use restricted cubic splines as discussed
in

Harrell, Frank. "Regression Modeling Strategies: With Applications
to Linear Models, Logistic Regression, and Survival Analysis."
Springer, 2001.

There are a couple of SAS macros available for generating splined
variables. Go to
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/SasMacros

HTH,

Dale

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: XXXX@XXXXX.COM
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

### sas >> PROC GLIMMIX--How to Check Linearity Assumption

Wensui,

If you have p=0 or p=1 in a bin, then you need to construct fewer
(and wider) bins such that you can compute the log-odds in every bin.
Alternatively, as presented in my previous post, add terms such
as polynomials or splines which account for nonlinearity. If
those terms are significant, then there is evidence of nonlinearity.

Dale

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: XXXX@XXXXX.COM
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

### sas >> PROC GLIMMIX--How to Check Linearity Assumption

Why? I don't see any reason why few bins will not work. Widening
the bins should not affect the linearity of the log odds. It may
produce a course look at the linearity assumption, but that may be
just what is needed. If there are few successes, a small change
in the number of positive responses can have a big effect on the
log odds. By widening your bins, you should increase the stability
estimates of p (and, hence, the log odds) should improve your
ability to examine the assumption of linearity.

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: XXXX@XXXXX.COM
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

### sas >> PROC GLIMMIX--How to Check Linearity Assumption

For a categorical predictor variable, there is not a question about
linearity of the effect estimate. Take gender. There are only
two genders. (Well, I guess hermaphrodites might count as a
separate gender. But the typical analysis does not record gender
with a hermaphrodite check box.) When gender is employed in an
analysis, you obtain a mean effect for males and a mean effect for
females. We are not interested in a mean effect for 2/3 male and
1/3 female. Therefore, we would not have any interest in testing

It is only when the predictor variable is at least ordinal that
the notion of linearity begins to present itself. If you have
an ordinal variable, then you can examine linearity employing
orthogonal polynomials in a CONTRAST statement. You might want to
Google "orthogonal polynomials sas contrast" (without the quotation
polynomials to test linearity assumptions in SAS.

You do not have to search hard to obtain the required log odds values.
When you include the SOLUTION (or S) option on the MODEL statement,
the parameter estimates produced by the GLIMMIX procedure are the
log odds values. Note that the log odds will be zero for your
reference level. As you move away from your reference level, you
are looking for the parameter estimates to increase/decrease uniformly
assuming that the bin centers are (approximately) equally spaced.

---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: XXXX@XXXXX.COM
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

### sas >> PROC GLIMMIX--How to Check Linearity Assumption

hank you Dale. I was aware of the S option but I haven't had the time
yet to investigate in detail.I thought it had something to do with making
the GLIMMIX procedure run moreefficiently in the presence of large data
set (millions of records).I am trying very hard to learn this stuff in a
short period of time ... Tom

----- Original Message -----
From: "Dale McLerran"
To: XXXX@XXXXX.COM
Subject: Re: Re: PROC GLIMMIX--How to Check Linearity Assumption
Date: Mon, 19 May 2008 11:19:40 -0700 (PDT)

--- Tom White wrote:

> Wensui and Dale,
>
> Thank you both (and Shiling Zhang) for your thoughts on this.
>
> One more observation on my part followed by a question:
>
> The binning discussed here works for continuous predictors like
> dollar amount, age, etc.
> Am I right?
>
> How about for gategorical predictors like SEX (Male of Female)?
> Will Dale's idea work if I use, say, SEX*SEX (a quadratic
predictor)
> on a categorical predictor like this?
>
> Clearly, in this case, I don't see how this variable could possibly
> be linearly related to my FRAUD (0 and 1) .
>

For a categorical predictor variable, there is not a question about
linearity of the effect estimate. Take gender. There are only
two genders. (Well, I guess hermaphrodites might count as a
separate gender. But the typical analysis does not record gender
with a hermaphrodite check box.) When gender is employed in an
analysis, you obtain a mean effect for males and a mean effect for
females. We are not interested in a mean effect for 2/3 male and
1/3 female. Therefore, we would not have any interest in testing

It is only when the predictor variable is at least ordinal that
the notion of linearity begins to present itself. If you have
an ordinal variable, then you can examine linearity employing
orthogonal polynomials in a CONTRAST statement. You might want to
Google "orthogonal polynomials sas contrast" (without the quotation
polynomials to test linearity assumptions in SAS.

> My question:
> How do I compute the logodds of FRAUD after I do the binning and
> on my way to check linearity
> according to Shiling Zhang?
>
> EXAMPLE: Take AGE as my predictor grouped into 10 groups.
> Call this binned variable AGE_BIN.
> To compute the logodds of FRAUD I fit the model:
> proc glimmix=3DMYDATA;
> class AGE_BIN;
> model FRAUD=3DAGE_BIN/dist=3D ...;
> random intercept / subject =3D DOC_ID;
> random intercept / subject =3D PAT_ID;
> run;
>
> Am I right so far?
>
> From this procedure, I can request the predicted probabilities,
> but they will already be un-transformed from log(p/1-p) to actual
> probabilities between 0 and 1.
>
> How do I get my hands on log(p/1-p) so I can plot it against
AGE_BIN
> per Shiling Zhang?

You do not have to search hard to obtain the required log odds
values.
When you include the SOLUTION (or S) option on the MODEL statement,
the parameter estimates produced by the GLIMMIX procedure are the
log odds values. Note that the log odds will be zero for your
reference level. As you move away from your reference level, you
are look

```I need to check the assumption of Cox model for the treatment group,
which is the main variable of interest, and there are also quite a lot
covariates in the data set. The follow-up time (survival time) is days
from randomization to either diagnosis of disease or censoring.

Is there a way that I can perform this assumption check in SAS? I
thought I can do a plot of residual of survivial time against time?
Will this work?

I appreciate any input from you folks!!!

```

```Dear all,

Is there an easy explanation why the REPEATED statement has disappeared
when the macro GLIMMIX was transferred into a genuine procedure?

Ulf

```