On Some New Hybridized Regression Estimation and Feature Selection Techniques
On Some New Hybridized Regression Estimation and Feature Selection Techniques
ISSN No:-2456-2165
Abstract:- Conventional regularization techniques like rivals or exceeds 𝑛 (the number of observations), we often
LASSO, SCAD and MCP have been shown to perform seek, for the sake of interpretation, a smaller set of variables.
poorly in the presence of extremely large or ultra-high Hence, we want to our fitting procedure to make only a
dimensional covariates. This has created the need for subset of the coefficients large and others small or even
and led to the development and reliance on filtering zero. These shortcomings are of high-dimensionality in
technique like screening. Screening techniques (such as regression setting. The traditional method (OLS) tends to
SIS, DC-SIS, and DC – RoSIS) have been shown to over fit the model also the method becomes unusable as the
reduce the computational complexity in selecting coefficients estimate is no longer unique and its variance
important covariates from ultrahigh dimensional becomes infinite.
candidates. To this end, there have been various
attempts to hybridize the conventional regularization To deal with such problems, coefficient shrinkage
techniques. In this paper, we combine some (regularization) is employed to shrink the estimated
regularization techniques (LASSO and SCAD) with a coefficients towards zero relative to the least squares
screening technique (DC – RoSIS) to form new hybrid estimates. Depending on what type of shrinkage is
methods with a view to achieving better dimension performed, these procedures are capable of reducing the
reduction and variable selection simultaneously. variance and can also perform variable selection. Some of
Extensive simulation results and real life data these procedures like the least absolute shrinkage selection
performance show that the proposed methods perform Operator (LASSO), SCAD (smoothly clipped absolute
better than the conventional methods. deviation) (Fan and Li, 2001)[2] and the MCP (minimax
concave penalty) (Zhang, 2010) [3] enable variable selection
Keywords:- Regularization Techniques, Screening such that only the important predictor variables stay in the
Technique, LASSO DC–RoSIS, SCAD DC – RoSIS. model (Szymczack, et al., 2009)[1].
II. METHODOLOGY
This section presents the methodology employed in this paper with a focus on the traditional linear regression techniques.
Linear Regression
Consider the multiple linear regression models where 𝑌 denote the response variable (also called the dependent variable) and
𝑋1 , 𝑋2 …, 𝑋𝑝 , denote the explanatory variables (also called predictors, features or independent variables). The relationship
between 𝑌 and 𝑋1 , 𝑋2 …, 𝑋𝑝 can be expressed as
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑝 𝑋𝑝 + 𝜀 (1)
The parameters 𝛽0 , 𝛽1 … 𝛽𝑝 are called regression coefficients and ε is the random error term
Given a data set {𝑦𝑖, 𝑥𝑖1, 𝑥𝑖2, … , 𝑥𝑖𝑝, }𝑛𝑖=1 of 𝑛 statistical units, each statistical unit can be expressed as
Where 𝑦𝑖 is the 𝑖 th response observation, 𝛽0 , 𝛽1, 𝛽2, … , 𝛽𝑝 are the unknown parameters and
𝑌 = 𝑋𝛽 + 𝜀 (3)
Linearity:
The relationship between the explanatory variables and the response variable is linear. This is the only restriction on the
parameters (not explanatory variables), since the explanatory variables are regarded as fixed values.
Independence:
There are two types of independence.
Normality:
The error terms follow normal distribution.
𝜀𝑖 ~𝑁(0, 𝛿𝑖2),
𝜎2 0 ⋯ 0
𝛿2 = ( 0 𝜎2 ⋯ 0)
⋮ ⋮ ⋱ ⋮
0 0 … 𝜎2
Equal Variance:
Error terms are assume to have equal variances.
The ordinary Least Squares (OLS) is the traditional technique used to estimate the parameters of the multiple linear
regression model. The OLS estimator, which minimizes the residual sum of squares,
Is given as
𝛽̂0𝐿𝑆 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑌.
Penalization Methods
We consider a linear regression model given with 𝑛 observations on a dependent variable 𝑌 having p predictors. Penalized
regression approaches have been used in cases where 𝑝 < 𝑛, and in the case with 𝑝 ≥ 𝑛. In general, the Penalized Least Squares
(PLS) is aimed at minimizing Residual Sum of Squares
(𝑌 − 𝑋𝛽)𝑇 (𝑌 − 𝑋𝛽)
𝑇
Subject to 𝑃𝑒𝑛(𝛽) ≤ 𝑡, where 𝑃𝑒𝑛(𝛽) (specific penalty) is a function of 𝛽 = (𝛽0, 𝛽1, … , 𝛽𝑝 ) and 𝑡 is a tuning parameter.
This constrained optimization problem can be solved with the equivalent Lagrangian formulation which minimizes.
𝑇
𝑃𝐿𝑆 = 𝑂𝐿𝑆 + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦 = (𝑌 − 𝑋𝛽) (𝑌 − 𝑋𝛽)
+𝜆𝑃𝑒𝑛(𝛽) (5)
Where 𝜆 is a tuning parameter and controls the strength of shrinkage. For example,𝜆 = 0, no penalty is applied and we have
the ordinary least squares regression. When 𝜆 gets larger, more weight is given to the penalty term. Desirable properties of
penalization include variable selection and grouping effect.
LASSO Penalty
The Least Absolute Shrinkage and Selection Operator (LASSO) regression method was introduced by Tibshirani (1996) as
an estimation and variable selection method. It is also called L1 penalized regression. The LASSO is a penalized least squares
procedure that minimizes RSS subject to the non-differentiable constraint expressed in terms of the L1 norm of the coefficients.
The penalty function is given by
𝑝
Where 𝜆 = 0
𝑀𝑆𝐸(𝛽̂𝐿𝐴𝑆𝑆𝑂 ) = 𝑀𝑆𝐸(𝛽̂𝑂𝐿𝑆 ).
And when 𝜆 → ∞
𝑀𝑆𝐸(𝛽̂𝐿𝐴𝑆𝑆𝑂 ) = 𝑡𝑟𝑎𝑐𝑒 (𝑉𝑎𝑟(𝛽̂𝐿𝐴𝑆𝑆𝑂 )) + 𝐵𝑖𝑎𝑠 𝑇 (𝛽̂𝐿𝐴𝑆𝑆𝑂 )𝐵𝑖𝑎𝑠(𝛽̂𝐿𝐴𝑆𝑆𝑂 ) → 0.
Since 𝐵𝑖𝑎𝑠 𝑇 (𝛽̂𝐿𝐴𝑆𝑆𝑂 )𝐵𝑖𝑎𝑠(𝛽̂𝐿𝐴𝑆𝑆𝑂 )𝑎𝑛𝑑 𝑡𝑟𝑎𝑐𝑒 (𝑉𝑎𝑟(𝛽̂𝐿𝐴𝑆𝑆𝑂 )) move to opposite directions as the tuning parameter 𝜆
increases, thus, we can choose an optimal value of the parameter 𝜆 that minimizes 𝑀𝑆𝐸(𝛽̂𝐿𝐴𝑆𝑆𝑂 ).
Where
Where 𝐼(∙) is the indicator function and a = 3.7 is suggested by Fan and Li (2001).
Penalized M-Estimation
It is common to for the response variable in a regression problem to contain outliers. The OLS procedure and penalized
methods discussed earlier do not perform adequately when there are outliers in the response data. One robust approach that
handles the problem of outliers is M-Estimation. The letter M indicates that M estimation is an estimation of the maximum
likelihood type. M estimation principle is to minimize the residual function.
𝑝
𝑦𝑖 − 𝛽0 − ∑𝑗=1 𝛽𝑗 𝑥𝑖𝑗
𝛽̂𝑀 = min 𝜌 ( ), (10)
𝛽 𝜎
If the 𝜌 function can be differentiated, the M-estimator is said to be a 𝜓-type. Otherwise, the M-estimator is said to be a 𝜌-
type. Note that the OLS estimator is a special case of the M-estimator.
Common 𝜌 functions are the Tukey’s bisquare, Andrew’s and Huber’s functions. Tukey’s 𝜌 function is given as
Where 𝑐 is a constant.
1 − cos(𝑟𝑖 ) , 𝑖𝑓 |𝑟𝑖 | ≤ 𝜋
𝜌(𝑟𝑖 ) = { .
0, 𝑖𝑓 |𝑟𝑖 | > 𝜋
The M-estimation algorithm using the Tukey’s bisquare function is given as follows:
While the M-estimation technique may be robust against outliers, it doesn’t cater for other problems associated with
regression such as high- dimensionality and multicollinearity (Freue et al, 2019). In order to solve the problem of high-
dimensionality or multicollinearity a penalized M-Estimation procedure may be used.
Freue et al (2019) introduced efficient algorithms for penalized M-Estimators using the LASSO and Elastic-Net penalties.
The pense R package contains implementation of M-Estimation using the LASSO and Elastic-Net penalties.
Where 𝑆1 = 𝐸(|𝑋 − 𝑋̃||𝑌 − 𝑌̃ |), 𝑆2 = 𝐸(|𝑋 − 𝑋̃||𝑌 − 𝑌̃|), 𝑆3 = 𝐸(|𝑋 − 𝑋̃||𝑌 − 𝑌̃|), and (𝑋̃, 𝑌̃) is an independent copy of
(𝑋, 𝑌). The distance correlation between 𝑋 and 𝑌 is
𝑑𝑐𝑜𝑣(𝑋, 𝑌)
𝑑𝑐𝑜𝑟𝑟(𝑋, 𝑌) = (13)
√𝑑𝑐𝑜𝑣(𝑋, 𝑌) 𝑑𝑐𝑜𝑣(𝑋, 𝑌)
Szekely et al (2007) pointed out that 𝑑𝑐𝑜𝑟𝑟(𝑋, 𝑌) = 0 if and only if 𝑋 and 𝑌 are independent and 𝑑𝑐𝑜𝑟𝑟(𝑋, 𝑌 ) is strictly
increasing in the absolute value of the Pearson correlation between 𝑋 and 𝑌. Motivated by these properties, Li et al (2012)
proposed a sure independence screening to rank all predictors using their distance correlations with the response variable, termed
DC-SIS, and proved its sure screening property for ultrahigh-dimensional data.
Following Zhong et al (2016), let 𝑋𝑘 denote the 𝑘 𝑡ℎ predictor with 𝑘 = 1, . . . , 𝑝𝑛 , this work proposes to quantify the
importance of 𝑋𝑘 is through its distance correlation with the marginal distribution function of 𝑌, denoted by 𝐹(𝑌). That is,
𝜔𝑘 = 𝑑𝑐𝑜𝑟𝑟{𝑋𝐾 , 𝐹(𝑌)},
The distance correlation has several advantages compared with existing measurements: 𝑑𝑐𝑜𝑟𝑟{𝑋𝑘 , 𝐹(𝑌 )} = 0 if and only if
𝑋𝑘 and 𝑌 are independent, and following Li et al (2012), we can see that the screening procedure is model-free and hence is
applicable for both dense and sparse situations ; since 𝐹(𝑌) is a bounded function for all types of 𝑌, it can be expected that the
procedure has a reliable performance when the response is the heavy-tailed and when extreme values are present in the response
values; If one suspects that the covariates also contain some extreme values, then one can use 𝜔𝑘𝑏 = 𝑑𝑐𝑜𝑟𝑟{𝐹𝑘 (𝑋𝑘 ), 𝐹(𝑌 )} to rank
the importance of the 𝑋𝑘 , where 𝐹𝑘 (𝑥) = 𝐸 {𝟏(𝑋𝑘 ≤ 𝑥)}.
Zhong et al (2016) showed how to implement the marginal utility in the screening procedure as follows. Let {(𝑿𝒊 , 𝑌𝑖 ), 𝑖 =
1,··· , 𝑛} be a random sample from the population (𝑿, 𝑌). The distance covariance between 𝑋𝑘 and 𝐹(𝑌 ) is first estimated through
the moment estimation method,
Where
𝑛 𝑛
1
𝑆̂𝑘,1 = 2 ∑ ∑|𝑋𝑖𝑘 − 𝑋𝑗𝑘 ||𝐹𝑛 (𝑌𝑖 ) − 𝐹𝑛 (𝑌𝑗 )|,
𝑛
𝑖=1 𝑗=1
𝑛 𝑛 𝑛 𝑛
1 1
𝑆̂𝑘,2 = ∑ ∑|𝑋𝑖𝑘 − 𝑋𝑗𝑘 | 2 ∑ ∑|𝐹𝑛 (𝑌𝑖 ) − 𝐹𝑛 (𝑌𝑗 )|,
𝑛2 𝑛
𝑖=1 𝑗=1 𝑖=1 𝑗=1
And
𝑛 𝑛 𝑛
1
𝑆̂𝑘,3 = ∑ ∑ ∑ |𝑋𝑖𝑘 − 𝑋𝑙𝑘 ||𝐹𝑛 (𝑌𝑖 ) − 𝐹𝑛 (𝑌𝑗 )|
𝑛3
𝑖=1 𝑗=1 𝑙=1
Are the corresponding estimators of 𝑆𝑘,1 , 𝑆𝑘,2 , 𝑆𝑘,3 , 𝑎𝑛𝑑 𝐹𝑛 (𝑦) = 𝑛−1 ∑𝑛𝑖=1 1 (𝑌𝑖 ≤ 𝑦). We estimate 𝜔𝑘 with
̂ (𝑋𝑘 , 𝐹(𝑌))
𝑑𝑐𝑜𝑣
𝜔 ̂ {𝑋𝑘 , 𝐹(𝑌)} =
̂𝑘 = 𝑑𝑐𝑜𝑟𝑟
̂ (𝑋𝑘 , 𝑋𝑘 ) 𝑑𝑐𝑜𝑣
√𝑑𝑐𝑜𝑣 ̂ (𝐹(𝑌), 𝐹(𝑌))
Larger than a user-specified threshold. Let A ̂ =≤ {k ∶ 𝜔 ̂𝑘 ≥ cn−κ , for 1 ≤ k ≤ p𝑛 } . The independence screening
procedure retains the covariates with the 𝜔𝑘 values for some pre-specified thresholds c > 0 and 0 κ < 1/2. The constants c and κ
control the signal strength (see Zhong et al, 2016). Zhong et al (2016) referred to this approach as the distance correlation based
robust independence screening procedure (DC-RoSIS).
̂ (𝐹(𝑋𝐾 ), 𝐹(𝑌))
𝑑𝑐𝑜𝑣
̂ {𝐹(𝑋𝐾 ), 𝐹(𝑌)} =
̂𝑘𝑏 = 𝑑𝑐𝑜𝑟𝑟
𝜔
̂ (𝐹(𝑋𝐾 ), 𝐹(𝑋𝐾 )) 𝑑𝑐𝑜𝑣
√𝑑𝑐𝑜𝑣 ̂ (𝐹(𝑌), 𝐹(𝑌))
Where,
And
𝑛 𝑛 𝑛
1
𝑆̂𝑘,3
𝑏
= ∑ ∑ ∑ |𝐹𝑛 (𝑋𝑖𝑘 ) − 𝐹𝑛 (𝑋𝑗𝑘 )||𝐹𝑛 (𝑌𝑖 ) − 𝐹𝑛 (𝑌𝑗 )|
𝑛3
𝑖=1 𝑗=1 𝑙=1
̂𝑘𝑏 may produce better results if the covariates also contain some extreme values.
The use of 𝜔
Theorem 1. Under the condition (C1) that there exist positive constants 𝑡𝑜 and 𝐶 such that
max 𝐸{exp(𝑡|𝑋𝑘 |)} ≤ 𝐶 < ∞, 𝑓𝑜𝑟 0 < 𝑡 ≤ 𝑡0 , for any 0 < 𝛾 < 1/2 − 𝜅, there exist positive constants 𝑐1 and 𝑐2 such that
1≤𝑘≤𝑃𝑛
We remark here that to derive the consistency of the estimated marginal utility, we do not need any moment condition on the
response. To prove the sure screening property, we make use of further assumption (C6) - the marginal utility satisfies min 𝜔𝑘 ≥
𝑘∈𝐴
2𝑐𝑛−𝜅 , for some constants 𝑐 > 0 𝑎𝑛𝑑 0 ≤ 𝜅 < 1/2.
Condition (C6) allows the minimal signal of the active covariates to converge to zero as the sample size diverges, while it
requires the minimum signal of active covariates be not too small.
Theorem 2 (Sure Screening Property). Under (C6) and the conditions in Theorem 1, it follows that 𝑃𝑟 (𝐴 ⊆ 𝐴̂) ≥ 1 −
1−2(𝑘+𝛾)
𝑂(𝑠𝑛 [𝑒𝑥𝑝{−𝑐1𝑛 } + 𝑛𝑒𝑥𝑝(−𝑐2 𝑛𝛾 )]), where 𝑠𝑛 is the cardinality of 𝐴. Thus, 𝑃𝑟 (𝐴 ⊆ 𝐴̂) → 1 as 𝑛 → ∞.
(𝛽𝐴 0 , 𝛽𝐴 1 , 𝛽𝐴 2 , … , 𝛽𝐴 𝑝 ) denote the regression coefficients (Fu, 1998), proximal methods (Beck and Teboulle, 2009)
and quadratic solver (Grandvalet et al, 2017).
associated with 𝑋𝐴 . Then,
The SCAD-DCRoSIS Penalized Regression
The minimization problem given by (20) can be solved
Given that the earlier definitions of 𝑋 , 𝑋𝐴 and 𝛽𝐴
by a number of algorithms including as coordinate descent
remain unchanged. Then, the SCAD-DCRoSIS estimator
𝛽̂SCAD−DCRoSIS is given as
𝑝
Where,
𝛽𝐴 2𝑖 + 𝜆2
𝑎𝜆|𝛽𝐴 𝑖 | − 2 (𝑎 + 1)𝜆2
p𝜆 (𝛽𝐴 𝑖 ) = 𝜆|𝛽𝐴 𝑖 |𝐼(0 ≤ 𝜆) + 𝐼(𝜆 ≤ |𝛽𝐴 𝑖 | ≤ 𝑎𝜆) + 𝐼(|𝛽𝐴 𝑖 | > 𝑎𝜆),
𝑎−1 2
For some 𝑎 > 2, 𝜆 > 0 and 𝐼(∙) is the indicator function. The minimization problem in (22) can be solved using coordinate
descent algorithms.
The minimization problem in (24) can be solved by a weighted LASSO least squares technique proposed by Freue et al (2019).
IV. ANALYSIS AND RESULTS In this case, the simulated data sets consist of 𝑛/10𝑛/
100 observations and 200 predictors and we set 𝛽 =
This section presents details description of the (5, ⏟… ,0), 𝑛 = 100, 𝜎 = 12 and 𝜌(𝑖, 𝑗) = 0.5|𝑖−𝑗| for
⏟… ,5 , 0,
proposed LASSO-DCRoSIS, LASSO-M-DCRoSIS and 20 180
SCAD-DCRoSIS. The section also shows the results of the all 𝑖, 𝑗.
evaluation of the proposed hybrid methods against
themselves and other classical methods under different Case 2
sample size settings and outlier severity. It is worthy to note In this case, a linear model only is considered and is
that all implementations of the methods, simulations and
computations were carried out using R(R Core Team, 2019) 𝑌𝑖 = 𝛽1 𝑋𝑖1 + 𝛽2 𝑋𝑖2 + 𝛽7 𝑋𝑖7 + 𝜀𝑖 , 𝑖 = 1,2, … , 𝑛.
while tables and plots are used to present the results.
𝑇
𝑋 = (𝑋1 , 𝑋2 , … , 𝑋𝑝 ) was generated from 𝒩(0, Σ) ,
Simulation Design
where Σ = (𝜎𝑖𝑗 ) with 𝜎𝑖𝑗 = 0.5|𝑖−𝑗| . Here, 𝑝 was set to
The performances of the LASSO-DCRoSIS, LASSO- 𝑝×𝑝
M-DCRoSIS and SCAD-DCRoSIS for variable selection 1000 and 𝑛 = 50,100 and 200. It should be noted that out
and estimation are evaluated via simulation at various of the 1000 generated covariates, only three (𝑋1 , 𝑋2 and 𝑋7 )
sample sizes and level of contamination by outliers. Each are useful in the model. Hence, 𝛽 was set such that 𝛽 =
simulated data consists of a training set for fitting the model, (3,1.5, 0,0,0,0,2,0 … , 0)𝑇 .
a validation set for selecting the tuning parameters, and a
test set on which the test errors are computed for evaluation Case 3:
of performance. The notation ·/·/· is used to represent the In this case, the simulated data sets consist of 𝑛/10𝑛/
number of observations in the training, validation and test 200 observations and 1000 predictors and we set 𝛽 =
set, respectively. (0,
⏟… ,0 , 2,
⏟… ,2 , 0, ⏟… ,2), 𝑛 ∈ { 50, 100} , 𝜎 = 2 and
⏟… ,0 , 2,
485 15 485 15
Case 1 𝜌(𝑖, 𝑗) = 0.5|𝑖−𝑗| for all 𝑖, 𝑗 . In this case there are 1000
The true underlying regression model from which we sparse grouped predictors with only 30 being relevant.
simulate data is given by
Case 4:
𝑌 = 𝑋 𝑇 𝛽∗ + 𝜎 ∗ 𝜖, 𝜖~𝑁(0,1). In this case, the simulated data sets consisting of
𝑛/10𝑛/200 observations and 1000 predictors and we set
Table 1 Simulation Results for Case 1 at 𝑛 = 50, 100, 150, 200, with no Outliers, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 26 6 16 10 207.348 30.567 213.200
SCAD-DCRoSIS 19 1 13 6 348.516 26.994 270.275
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 23 3 16 8 166.630 18.358 129.842
LASSO 28 8 20 8 57.500 21.075 65.991
SCAD 17 3 10 7 547.868 29.072 463.314
LASSO-DCRoSIS 41 21 20 21 3.108 3.680 6.856
SCAD-DCRoSIS 20 0 20 0 2.050 3.271 5.810
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 31 11 20 11 1.116 3.566 5.018
LASSO 41 21 20 21 3.183 3.571 7.029
SCAD 20 0 20 0 1.638 1.790 5.288
LASSO-DCRoSIS 42 22 20 22 1.658 2.637 5.609
SCAD-DCRoSIS 20 0 20 0 0.998 0.770 4.802
𝒏 = 𝟏𝟓𝟎 LASSO-M-DCRoSIS 31 11 20 11 0.623 1.890 4.496
LASSO 43 23 20 23 1.802 2.726 5.697
SCAD 20 0 20 0 0.957 0.499 4.753
LASSO-DCRoSIS 36 16 20 16 1.145 1.033 4.830
SCAD-DCRoSIS 20 0 20 0 0.751 0.514 4.445
𝒏 = 𝟐𝟎𝟎 LASSO-M-DCRoSIS 31 11 20 11 0.466 1.428 4.396
LASSO 45 25 20 25 1.186 2.197 5.155
SCAD 20 0 20 0 0.741 0.439 4.447
Simulation results when there are no outliers in the response variable for case 1 are given in Table 1. The table contains
medians of 𝑆, 𝑆𝐸, 𝐶, 𝐼𝐶, 𝑀𝑆𝐸𝑌 , 𝐴𝐸 and 𝑀𝑆𝐸𝛽 over 100 replications at sample sizes 50, 100, 150 and 200. The true size of the
model for this case is 20. In terms of variable selection SCAD and SCAD-DCRoSIS correctly select the important variables and
correctly leave out the unimportant ones. However, SCAD-DCRoSIS outperforms the SCAD in terms of estimation and prediction
at sample size 50. Also, LASSO tend to select larger models compared to the proposed LASSO-DCRoSIS and LASSO-M-
DCRoSIS. Similar behaviour can be observed at sample sizes 150 and 200.
Table 2 Simulation Results for case 1 at 𝑛 = 50, 100, 150, 200, with 10% Outliers in 𝑌, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 29 9 17 13 228.839 35.226 270.052
ENET-DCRoSIS 28 8 15 13 225.676 30.699 271.707
SCAD-DCRoSIS 21 1 12 9 471.181 32.602 433.153
MCP-DCRoSIS 19 1 12 8 476.287 32.941 438.463
LASSO-M-DCRoSIS 24 4 16 8 156.260 14.845 149.378
𝒏 = 𝟓𝟎
ENET-M-DCRoSIS 24 4 16 8 154.718 14.976 149.378
LASSO 44 24 18 27 224.143 30.550 270.027
ENET 47 27 19 28 143.432 28.087 200.692
SCAD 29 9 14 15 573.576 27.475 469.810
MCP 25 5 14 12 598.667 25.885 471.252
LASSO-DCRoSIS 37 17 20 14 52.831 10.445 75.917
SCAD-DCRoSIS 27 7 20 9 70.970 8.346 84.054
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 26 6 19 7 31.108 4.283 52.914
LASSO 42 22 20 22 33.792 11.560 77.480
SCAD 41 21 20 21 93.324 10.848 101.754
LASSO-DCRoSIS 33 13 20 14 10.855 6.426 53.967
SCAD-DCRoSIS 23 3 20 3 14.289 4.305 50.453
𝒏 = 𝟏𝟓𝟎 LASSO-M-DCRoSIS 27 7 20 7 0.502 1.526 45.011
LASSO 43 23 20 23 11.621 7.129 54.537
SCAD 47 27 20 27 38.879 8.527 63.101
LASSO-DCRoSIS 35 15 20 15 6.025 4.435 49.315
SCAD-DCRoSIS 20 0 20 0 7.503 2.627 46.542
𝒏 = 𝟐𝟎𝟎 LASSO-M-DCRoSIS 29 9 20 9 0.363 1.323 44.757
LASSO 42 22 20 22 6.611 5.178 50.391
SCAD 31 11 20 11 10.934 5.009 49.401
Simulation results for case 1 with outliers introduced Simulation results when there are no outliers in the
into the response are given in Table 2. SCAD-DCRoSIS response variable for case 2 are given in Table 3. The true
outperforms SCAD in terms of estimation and prediction. size of this model is 3. At sample size 50, LASSO-M-
SCAD seems to be strongly affected by the presence of DCRoSIS outperforms the rest in terms of prediction and
outliers. At sample sizes 150 and 200, LASSO-M- estimation accuracy but SCAD-DCRoSIS has the best
DCRoSIS significantly outperform others showing that they performance in terms of variable selection. At sample sizes
are superior when outliers are present. 100, 150 and 200, SCAD-DCRoSIS has the best
performance in terms of variable selection, estimation and
Case 2 prediction. In this setting, all methods correctly selects the
The simulation results are presented in this section. important variables into the model, however, larger models
The results are based on 100 replications and the evaluation are selected by LASSO and SCAD.
criteria are 𝑆, 𝑆𝐸, 𝐶, 𝐼𝐶, 𝑀𝑆𝐸𝑌 , 𝑀𝑆𝐸𝛽 and 𝐴𝐸.
Table 3 Simulation Results for Case 2 at 𝑛 = 50, 100, 150, 200, with no Outliers, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 13 10 3 10 1.797 3.296 6.049
SCAD-DCRoSIS 9 6 3 6 1.799 2.304 5.485
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 7 4 3 4 0.462 1.629 4.524
LASSO 21 18 3 18 2.333 3.691 6.452
SCAD 17 14 3 14 2.481 2.603 5.737
LASSO-DCRoSIS 14 11 3 11 0.867 2.045 4.816
SCAD-DCRoSIS 8 5 3 5 0.301 0.909 4.209
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 9 6 3 6 0.251 1.066 4.212
LASSO 19 16 3 16 0.871 2.167 4.862
SCAD 19 16 3 16 0.466 1.297 4.408
𝒏 = 𝟏𝟓𝟎 LASSO-DCRoSIS 12 9 3 9 0.439 1.493 4.375
Table 4 Simulation Results for Case 2 at 𝑛 = 50, 100, 150, 200, with 10% Outliers in 𝑌, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 6 3 1 5 12.092 7.055 58.209
SCAD-DCRoSIS 14 11 1 13 39.305 15.409 76.091
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 6 3 3 3 0.312 1.605 45.207
LASSO 10 7 1 9 11.663 7.303 57.176
SCAD 26 23 2 24 57.192 15.742 92.875
LASSO-DCRoSIS 10 7 2 8 7.132 5.809 51.648
SCAD-DCRoSIS 29 26 2 27 33.056 15.089 71.621
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 8 5 3 5 0.101 0.787 44.190
LASSO 14 11 2 12 7.512 6.147 51.868
SCAD 47 44 2 45 54.290 17.512 91.307
LASSO-DCRoSIS 12 9 3 9 5.067 4.861 49.362
SCAD-DCRoSIS 30 27 3 27 17.286 11.647 58.306
𝒏 = 𝟏𝟓𝟎 LASSO-M-DCRoSIS 9 6 3 6 0.064 0.632 43.472
LASSO 16 13 3 13 4.951 5.122 49.192
SCAD 64 61 2 61 46.938 17.123 80.462
LASSO-DCRoSIS 13 10 3 10 2.048 3.208 46.450
SCAD-DCRoSIS 33 30 3 30 6.480 6.796 47.835
𝒏 = 𝟐𝟎𝟎 LASSO-M-DCRoSIS 9 6 3 6 0.049 0.477 44.033
LASSO 20 17 3 17 2.325 3.537 46.323
SCAD 79 76 3 76 11.007 9.217 53.441
Table 4 present simulation results for case 2 with 10% outliers introduced into the response variable for case 2. Across all
sample sizes LASSO-M-DCRoSIS outperformed the rest in terms of variable selection, prediction and estimation accuracy while
SCAD produced the worst performance indicating that they don’t do well in the presence of outliers. In this setting also, SCAD
always selects larger models while all the proposed methods always select more parsimonious models compared to existing
methods.
Case 3
Table 5 Simulation Results for Case 3 at 𝑛 = 50, 100, 150, 200, with no Outliers, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 22 8 9 12 118.178 51.043 219.490
SCAD-DCRoSIS 14 16 7 8 162.370 49.502 249.613
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 19 11 11 8 110.334 34.909 151.395
LASSO 35 5 24 11 56.869 20.174 61.112
SCAD 18 12 7 9 125.117 53.103 249.119
LASSO-DCRoSIS 43 13 22 21 57.359 25.915 66.021
SCAD-DCRoSIS 29 1 17 12 101.739 23.268 91.621
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 36 6 23 13 44.165 13.223 38.862
LASSO 76 46 30 46 13.545 17.941 16.593
SCAD 34 4 15 19 145.221 36.416 125.094
LASSO-DCRoSIS 52 22 28 24 16.553 11.280 16.756
𝒏 = 𝟏𝟓𝟎 SCAD-DCRoSIS 37 7 27 10 18.981 6.790 14.961
LASSO-M-DCRoSIS 44 14 28 16 13.315 6.187 12.927
Simulation results when there are no outliers in the all the methods except LASSO correctly selects the
response variable for case 3 are given in Table 5. The true important variables into the model at small sample sizes.
size of this model is 30 but the values of the coefficients are This is an indication that the LASSO is quite conservative in
relatively small and the importance of the corresponding terms of variable selection.
predictors may be harder to detect. At sample size 50,100,
and 150, the LASSO outperforms the rest in terms of Table 6 present simulation results for case 3 with 10%
prediction, estimation accuracy and selection of important outliers introduced into the response variable for case 2.
variables. However, at sample size 200, SCAD followed by Across all sample sizes LASSO-M-DCRoSIS outperform
SCAD-DCRoSIS have the best performance in terms of the rest in terms of prediction and estimation accuracy while
variable selection, estimation and prediction. In this setting, SCAD produced the worst performance.
Table 6 Simulation Results for Case 3 at 𝑛 = 50, 100, 150, 200, with 10% Outliers in 𝑌, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 17 13 7 10 125.476 56.116 287.257
SCAD-DCRoSIS 18 12 6 13 269.223 65.477 390.465
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 19 11 9 9 109.214 35.836 196.114
LASSO 30 0 13 17 113.895 53.630 214.896
SCAD 36 6 0 36 829.179 93.528 1045.335
LASSO-DCRoSIS 40 10 19 21 79.008 34.611 147.307
SCAD-DCRoSIS 38 8 16 22 176.564 37.161 176.564
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 36 6 22 14 45.706 12.422 80.393
LASSO 83 53 23 60 77.679 33.938 132.499
SCAD 62 32 2 60 706.323 100.389 875.419
LASSO-DCRoSIS 52 22 27 25 41.602 22.631 84.648
SCAD-DCRoSIS 47 17 22 25 79.892 21.809 102.168
𝒏 = 𝟏𝟓𝟎 LASSO-M-DCRoSIS 44 14 27 17 17.283 6.794 55.770
LASSO 79 49 28 51 41.121 22.821 89.129
SCAD 80 50 7 73 498.179 80.622 576.196
LASSO-DCRoSIS 57 27 29 28 17.282 13.351 60.556
SCAD-DCRoSIS 50 20 29 22 25.882 9.305 59.041
𝒏 = 𝟐𝟎𝟎 LASSO-M-DCRoSIS 49 19 29 19 5.486 3.556 46.806
LASSO 84 54 30 54 15.058 13.895 61.126
SCAD 85 55 21 63 96.376 26.282 103.093
Case 4
Table 7 Simulation Results for Case 4 at 𝑛 = 50, 100, 150, 200, with no Outliers, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 10 5 5 6 404.660 6.619 5.062
SCAD-DCRoSIS 3 12 3 0 531.851 6.855 4.635
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 9 6 5 3 373.798 7.040 4.395
LASSO 28 13 4 23 393.286 23.812 7.249
SCAD 3 12 3 0 538.689 71.909 4.378
LASSO-DCRoSIS 13 2 5 8 369.792 7.686 4.587
SCAD-DCRoSIS 3 12 3 0 538.511 7.292 4.318
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 11 4 6 5 334.217 6.225 4.292
LASSO 24 9 5 18 379.969 6.6767 5.134
SCAD 3 12 3 0 539.963 71.878 4.371
𝒏 = 𝟏𝟓𝟎 LASSO-DCRoSIS 14 1 6 8 343.050 7.792 4.505
Simulation results when there are no outliers in the response variable for case 4 are given in Table 7. The true size of this
model here is 15 and the important predictors are divided into three groups such that predictors within each group are strongly
correlated. All the methods perform similarly with respect to prediction. However, LASSO, LASSO-DCRoSIS, LASSO-M-
DCRoSIS, SCAD, and SCAD-DCRoSIS tend to select one of the important variables in each group with none having the ability to
do group selection.
Table 8 Simulation Results for Case 4 at 𝑛 = 50, 100, 150, 200, with 10% Outliers in 𝑌, based on 100 Replications
𝑺 𝑺𝑬 𝑪 𝑰𝑪 𝑴𝑺𝑬𝜷 AE 𝑴𝑺𝑬𝒀
LASSO-DCRoSIS 9 6 3 6 434.349 15.012 58.394
SCAD-DCRoSIS 3 12 3 0 510.214 12.619 45.612
𝒏 = 𝟓𝟎 LASSO-M-DCRoSIS 8 7 5 3 392.046 7.673 44.489
LASSO 24 9 3 21 76.333 15.370 368.988
SCAD 3 12 3 0 537.854 73.312 43.118
LASSO-DCRoSIS 11 4 4 7 416.201 12.853 51.732
SCAD-DCRoSIS 3 12 3 0 537.374 12.021 42.660
𝒏 = 𝟏𝟎𝟎 LASSO-M-DCRoSIS 12 3 6 6 323.962 5.098 43.871
LASSO 22 7 4 18 412.615 10.085 55.300
SCAD 3 12 3 0 543.058 72.047 41.198
LASSO-DCRoSIS 12 3 4 8 407.867 8.031 49.090
SCAD-DCRoSIS 3 12 3 0 530.967 10.459 41.530
𝒏 = 𝟏𝟓𝟎 LASSO-M-DCRoSIS 13 2 7 6 277.646 4.997 44.103
LASSO 22 7 4 19 410.733 11.271 50.458
SCAD 3 12 3 0 537.219 71.810 40.594
LASSO-DCRoSIS 12 3 4 8 432.346 9.929 46.403
SCAD-DCRoSIS 3 12 3 0 537.457 12.747 41.320
𝒏 = 𝟐𝟎𝟎 LASSO-M-DCRoSIS 14 1 7 7 252.355 4.848 43.932
LASSO 23 8 4 19 413.570 14.299 46.930
SCAD 3 12 3 0 543.027 71.980 40.471
Table 8 present simulation results for case 4 with 10% The training dataset were used for model fitting and
outliers introduced into the response variable for case 4. selection of tuning parameters by10-fold cross validation.
Across all sample sizes, SCAD has the worst performance in The performance of the methods are then compared based
all criteria and just like when there were no outliers, on their prediction mean squared error (MSEy) on the test
LASSO, LASSO-DCRoSIS, LASSO-M-DCRoSIS, SCAD, dataset and number of non-zero coefficients. The process of
and SCAD-DCRoSIS, select one of the important variables data splitting, model fitting and computation of MSEy were
in each group. repeated 100 times. The results for both datasets are
summarized in Table 6.
Application to Real Life Datasets
In this section, application of the proposed methods The boxplot and the histogram of 𝑌 (TRIM32 gene)
(LASSO-DCRoSIS, LASSO-M-DCRoSIS, and SCAD- are displayed in Figures 1 and 2. Both indicate that the
DCRoSIS on a real life dataset is considered. The dataset is response distribution may be heavy-tailed and the data
the gene expression data from the microarray experiments contain outliers.
on 120 mammalian eye tissue samples by Scheetz et al.
(2006). The dataset consist of 200 predictors which
represents 200 gene probes of 120 rats. The response is the
expression level of TRIM32 gene.
V. CONCLUSION
REFERENCES