Explore the distribution of trio-sharing (K=3) SMOR

14 May 2015

The data set contains the maxSumWtSMOR values calculated for 10M randomly selected all-control trios (i.e., they do not have cv-harboring IBD haplotypes) using K3 SMOR DB table Kirk created. I used R fitdistrplus::descdist to examine parametric distributions most similar to the empirical data. It turns out that, after removing the data points greater than the 0.999 quantile, the rest seems to resemble a beta distribution (see the figure below; R code used to create the figure is behind)


quant999 <- quantile(emp_val,0.999)
emp_val_lt999quant <- emp_val[emp_val<quant999]
descdist(emp_val_lt999quant, discrete=F)

Let’s fit the data to a beta distribution and examine the goodness of fit. Below is my first try


fit_beta <- fitdist(emp_val_lt999quant, "beta")
summary(fit_beta)
plot(fit_beta)

However, using all data points in emp_val_lt999quant leads to an error message


Error in start.arg.default(x, distr = distname) :
  values must be in [0-1] to fit a beta distribution

Taking the advice from the FAQ page of “fitdistrplus”, I tried two bypassing approaches: either to remove impossible values (this means I need to remove even more tail values…Hmm…not sure whether I should do that…But, likely (very likely) K3 SMOR is a mixture of multiple (beta or else) distributions, so partitioning the dataset may not be as crazy as it looks like…) or to shift/scale the data (this option is recommended by the authors of that R package).


## Option 1
fit1<-fitdist(emp_val_lt999quant[emp_val_lt999quant > 0 & emp_val_lt999quant < 1], "beta")
summary(fit1)
plot(fit1)
## Result of Option 1 (see the plot below)
Fitting of the distribution ' beta ' by maximum likelihood
Parameters:
        estimate   Std. Error
shape1 0.5000635 0.0001928784
shape2 2.2352214 0.0011730171
## Option 2
rescaled <- (emp_val_lt999quant - min(emp_val_lt999quant)*1.01)/(max(emp_val_lt999quant)*1.01 - min(emp_val_lt999quant)*1.01)
rescaled <- rescaled[rescaled > 0 & rescaled < 1]
fit2<-fitdist(rescaled, "beta")
summary(fit2)
plot(fit2)
## Result of Option 2
Fitting of the distribution ' beta ' by maximum likelihood
Parameters :
        estimate   Std. Error
shape1 0.4790209 0.0001777215
shape2 7.4899576 0.0042315218
Loglikelihood:  19965803   AIC:  -39931602   BIC:  -39931574
Correlation matrix:
          shape1    shape2
shape1 1.0000000 0.6148664
shape2 0.6148664 1.0000000

Option 1 diagnosis plots Option 2 diagnosis plots It seems that Option 1 is better than Qption 2 with regard to the QQ plot. Actually I’m not satisfied with either of them. Eventually I did the following: (1) do a log transformation of all the maxSumWtSMOR values larger than 0 (exclusive); (2) run descdist again and it seems to be a beta (or a uniform, but the latter does not play out when I tried to fit it); (3) do a min-max normalization of the log values as suggested; (4) fit a beta distribution. Below are the code and the result.


log_emp_val <- log(emp_val[emp_val>0])
descdist(log_emp_val,discrete=F)
#It seems to be a beta distribution
rescaled <- (log_emp_val - min(log_emp_val)*1.01)/(max(log_emp_val)*1.01 - min(log_emp_val)*1.01)
##summary(rescaled), to make sure it is indeed within (0,1)
  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.0070  0.4501  0.5612  0.5484  0.6293  0.9971
fit_chr1<-fitdist(rescaled, "beta")
summary(fit_chr1)
Fitting of the distribution ' beta ' by maximum likelihood
Parameters :
       estimate  Std. Error
shape1 12.58162 0.005623613
shape2 10.35302 0.004607655
Loglikelihood:  8549512   AIC:  -17099020   BIC:  -17098992
Correlation matrix:
          shape1    shape2
shape1 1.0000000 0.9569802
shape2 0.9569802 1.0000000

The Cullen and Frey graph

The diagnostic plot (you can see the right tail area is still a little off on the QQ plot, but it looks better; we don't care about the left tail)

I used the said approach to examine the distribution of the natural logarithm of maxSumWtSMOR given K=2, 3, and 4 (single marker SMOR is retrieved from respective K=2/3/4 Database table Kirk has created. Below is some of the fitting I’ve encountered


  • Fit a normal distribution: the data (supposed it's a vector; referred to as x hereafter exchangeably) does not need to be prepared, unless you want a standard normal
  • dist_norm=fitdist(data, "norm")
  • Fit a normal distribution: the data does not need to be prepared, as fitting a normal distribution
  • dist_logis=fitdist(data, "logis")
  • Fit a beta distribution: the data needs to be adjusted if they are not between 0 and 1 (exclusive). You can use the following general code to do the max-min normalization by giving your own new_start and new_end (e.g., 0.001 and 0.999)
  • rescale <- function(x, new_start, new_end){ rescaled.data=(new_end-new_start)*(x-min(x))/(max(x)-min(x))+new_start rescaled.data } dist_beta=fitdist(rescale(data, 0.001, 0.999), "beta")
  • Fit a lognormal or gamma distribution: the data need to be all positive. It means you probably can only fit that part of the data
  • dist_ln <- fitdist(x[x>0], "lnorm") dist_ga <- fitdist(x[x>0], "gamma") </code></pre>

    Skewness and Kurtosis

    The Gullen-Frey graph gives you an idea which theoretical distribution may fit your emipricial data based on two measures of shapes--skewness and kurtosis. Now let's discuss the distribution of loge(maxSumWtSMOR) in various scenarios regarding these two measures.

    Reference