sas >> Conducting statistical test on "large" datasets

by gregor » Thu, 29 Apr 2004 15:06:55 GMT

Hi SAS folk!

Statistics is nice thing, but I have problems with it if I have
large datasets. It can be discused what is large, but that is another
story.

I noticed that usually more or less everthing is significant with large
datasets. However I can not be sure if significant differences are also
important for real life.

Dealing with this issue for quite some time I noticed that residuals from
orinary linear models are distributed more or less normally e.g. the fit
of observed and expected is OK, but EDF test say that this is not the case.
I found another test by Jarque-Bera which performs better. I have provided
code for macro bellow.

Has anyone any comments on my "problems". I am wondering if there are any
test for other distributions i.e. lognormal, gamma, weibull,..... And I also
wonder if there is any general method/test in models which helps us to
determine what is significant due to large number of data or really significant.

With regards, Gregor

* %normal_Jarque-Bera.sas
*---------------------------------------------------------------------------
* $Id: %normal_Jarque-Bera.sas,v 1.4 2004/03/23 08:02:47 gregor Exp $
*---------------------------------------------------------------------------
* What: Macro for Jarque-Bera (S-K) test for normality. This test accounts
* for large number of observations and in such situation performs better
* than Kolmogorov-Smirnow test. Te later one tends to reject null
* when N is large hypothesis.
*--------------------------------------------------------------------------;

%put . * Jarque_Bera(data, var);

%macro Jarque_Bera(data, var);

%* --- Compute moments and other basic stat. ---;
proc univariate data=&data noprint;
var &var;
output out=tmp_jarque_bera nobs=nobs mean=mean var=var std=std
stdmean=stdmean min=min max=max range=range skewness=skewness
kurtosis=kurtosis mode=mode median=median;
run;

%* --- Test ---;
data tmp_jarque_bera; set tmp_jarque_bera;
chi_data=(((skewness*skewness)/6)+((kurtosis*kurtosis)/24))*nobs;
p=1-probchi(chi_data,2);
run;

%* --- Printout and cleanup ---;
proc sql;
title " --- Jarque-Bera test for normality for variable &var (&data) ---";
select nobs, mean, std, min, max, skewness, kurtosis, chi_data, p
from tmp_jarque_bera;
drop table tmp_jarque_bera;
quit;

%mend;

*--------------------------------------------------------------------------;
* %normal_Jarque-Bera.sas ends here ;

sas >> Conducting statistical test on "large" datasets

by HERMANS1 » Fri, 30 Apr 2004 01:59:55 GMT


regor:
I recall being impressed initially when statistical estimates of odds ratios
based on millions of observations turned out to have lower confidence limits
a little greater than one. In time I observed that many weak models have
statistical significance.

As a practical matter, all one needs to do is to increase the level of
evidence required to reject a null hypothesis; for example, require 99%
instead of 95% confidence. This makes perfect sense in the Bayesian world,
where '...extraordinary claims require extraordinary evidence ...'.

Of course one should understand that statistical significance has much more
significance when data are scarce. Statistical significance tests are really
asking "Do we have a large enough sample to make it highly unlikely that
many other samples of the same size from the same population would give us
different results?" A large sample makes that question moot, but it opens up
other questions:
- would the large sample split on an attribute (say, geographic region)
into large subsets
for which one gives us a different result?
- are some of the variables in a model (say, level of education and brand
preference)
acting as a proxy for an omitted variable (say, income)?

Strictly to illustrate some model validity and reliability concerns in
unrelated disciplines, see
http://www.dmreview.com/article_sub.cfm?articleId=4933
and
http://trochim.human.cornell.edu/tutorial/driebe/tweb3.htm

Sig


-----Original Message-----
From: Gregor Gorjanc [mailto: XXXX@XXXXX.COM ]
Sent: Thursday, April 29, 2004 3:07 AM
To: XXXX@XXXXX.COM
Subject: Conducting statistical test on "large" datasets


Hi SAS folk!

Statistics is nice thing, but I have problems with it if I have large
datasets. It can be discused what is large, but that is another story.

I noticed that usually more or less everthing is significant with large
datasets. However I can not be sure if significant differences are also
important for real life.

Dealing with this issue for quite some time I noticed that residuals from
orinary linear models are distributed more or less normally e.g. the fit of
observed and expected is OK, but EDF test say that this is not the case. I
found another test by Jarque-Bera which performs better. I have provided
code for macro bellow.

Has anyone any comments on my "problems". I am wondering if there are any
test for other distributions i.e. lognormal, gamma, weibull,..... And I also
wonder if there is any general method/test in models which helps us to
determine what is significant due to large number of data or really
significant.

With regards, Gregor

* %normal_Jarque-Bera.sas
*---------------------------------------------------------------------------
* $Id: %normal_Jarque-Bera.sas,v 1.4 2004/03/23 08:02:47 gregor Exp $
*---------------------------------------------------------------------------
* What: Macro for Jarque-Bera (S-K) test for normality. This test accounts
* for large number of observations and in such situation performs
better
* than Kolmogorov-Smirnow test. Te later one tends to reject null
* when N is large hypothesis.
*--------------------------------------------------------------------------;

%put . * Jarque_Bera(data, var);

%macro Jarque_Bera(data, var);

%* --- Compute moments and other basic stat. ---;
proc univariate data=&data noprint;
var &var;
output out=tm

sas >> Conducting statistical test on "large" datasets

by Jay Weedon » Fri, 30 Apr 2004 02:36:04 GMT

On 29 Apr 04 07:06:55 GMT, XXXX@XXXXX.COM (Gregor



Normality of residuals isn't really an issue for very large data sets,
because of the central limit theorem. As you suggest, an effect of
almost any size will be statistically signficant with a huge sample,
which makes inferential statistics less important. You need to decide
whether the observed effect sizes are scientifically meaningful.

JW

sas >> Conducting statistical test on "large" datasets

by jim.groeneveld » Fri, 07 May 2004 17:00:44 GMT

i Gregor,

Yes, the larger your sample the more power you have (more than minimally necessary) and the larger the chances on finding statistically significant results. And actually these results represent significant effects, how small these may be in absolute terms. Next to a statistical significance you have to set some minimum effect in advance, which you consider conceptually significant. This is just a matter of your own decision: which effects are really interesting?

If you don't have a large sample, but a whole population, usual statistics do not apply. Any effect is an effect which is present (e.g. a difference), but it may be too small to be interesting or meaningful.

You could also take one or more smaller samples from your large one and perform the sifnificance tests on them, finding statistical significance in most instances, but not necessarily always.

So, while testing hypotheses, which you have specified in advance, usually you gather just enough (or a little bit more) data to test them. If you already have a large bunch of data in which you are searching for significant effects in a more or less explorative way, significance does not have the same value as with explicit hypotheses. Explorative "significance" only can be an indication of a possibly present effect, which still has to be acknowledged by new, independent research with firm hypotheses. You might use a part of the large amount of data (a representative sample) to search for potential effects, explain what you might have found, build a theory on it, deduce firm hypotheses and test those on quite another part (a representative sample as well) of the same large amount of data.

I hope these thoughts may help you a little.

Regards - Jim.
. . . . . . . . . . . . . . . .

Jim Groeneveld, MSc.
Biostatistician
Science Team
Vitatron B.V.
Meander 1051
6825 MJ Arnhem
Tel: +31/0 26 376 7365
Fax: +31/0 26 376 7305
XXXX@XXXXX.COM
www.vitatron.com

My computer remains home, but I will attend SUGI 2004.

[common disclaimer]


-----Original Message-----
From: Gregor Gorjanc [mailto: XXXX@XXXXX.COM ]
Sent: Thursday, April 29, 2004 09:07
To: XXXX@XXXXX.COM
Subject: Conducting statistical test on "large" datasets


Hi SAS folk!

Statistics is nice thing, but I have problems with it if I have
large datasets. It can be discused what is large, but that is another
story.

I noticed that usually more or less everthing is significant with large
datasets. However I can not be sure if significant differences are also
important for real life.

Dealing with this issue for quite some time I noticed that residuals from
orinary linear models are distributed more or less normally e.g. the fit
of observed and expected is OK, but EDF test say that this is not the case.
I found another test by Jarque-Bera which performs better. I have provided
code for macro bellow.

Has anyone any comments on my "problems". I am wondering if there are any
test for other distributions i.e. lognormal, gamma, weibull,..... And I also
wonder if there is any general method/test in models which helps us to
determine what is significant due to large number of data or really significant.

With regards, Gregor

* %normal_Jarque-Bera.sas
*---------------------------------------------------------------------------
* $Id: %normal_Jarque-Bera.sas,v 1.4 2004/03/23 08:02:47 gregor Exp $
*-----------------------------------------------------

Similar Threads

1. conducting difference of proportions test with weighted

2. conducting difference of proportions test with weighted survey

3. How to conduct t-test using SAS

On NHANES data.

SAS-9.1.3 sp4

Appreciate it.

4. Using SAS with large datasets (linking SAS and Access)

5. Using SAS with large datasets

Hi,

  I normally use SAS on datasets with only a 1,000-10,000 subjects/lines of
data.  When I run SAS on such datasets, the calculations usually complete
within seconds, or at most minutes.

  Now I am working on a dataset that has ~500,000 entries.  It is stored
within an MS-Access database.  When I tried to simply read in the relevant
table from the Access db, it took over 10 hours in real time!  Then I tried
a simple PROC MEANS, and that has been running for over two hours, and is
still not complete!

  Is this performance typical with large datasets?  If not, how can one
improve the situation (besides getting a faster computer, more memory,
etc)?  What are your experiences?

  Thanks,

  Howard Alper

6. Efficiency with large datasets

7. Fwd: Techniques for dealing with large clinical lab datasets

 Hey Bob,

Have you turned on dataset compression?  I have found that datasets with
large numbers of repeating elements are actually accessed quicker with
compression turned on in addition to (and because of the ) the space
savings...  Of course, certain datasets with high number of distinct values
and very little text data can actually be larger and slower when
compressed...

Regards,
Stephen


  On Wed, Apr 2, 2008 at 1:02 PM, < XXXX@XXXXX.COM > wrote:

> We have some Phase 3 trials, with 1000 or more patients in each trial, and
> durations of a year or more. This means the volume of lab data will be
> (already is) quite large. I have already split the overall lab dataset
> into chemistry, hematology, urinalysis, etc. and added indexing to
> retrieve specific lab tests, but is there anything else that I should be
> doing? Is hashing applicable here?
>
> Thanks,
>
> Bob Abelson
> HGSI
> 240 314 4400 x1374
>  XXXX@XXXXX.COM 
>

8. Techniques for dealing with large clinical lab datasets