INTRODUCTION
Dimensionality reduction is essential to improve the accuracy, efficiency and
scalability of a classification process in pattern recognition tasks and exploratory
data analysis. In order to realize an effective dimensionality reduction, feature
ranking is quite often required. There are two categories of feature ranking
strategies: The wrapper strategy and the filter strategy (Wang
et al., 1999; Guyon and Elisseeff, 2003; Song
et al., 2005). In the wrapper strategy, feature ranking associated
with a particular classifier can be done by evaluating the classification accuracy.
The huge computation burden is the major drawback of this strategy. The filter
strategy is to rank features by evaluating some criterion. Such a criterion
may be the classifierparameterbased criterion (Basak et
al., 1998; Wang et al., 1999; Andonie
and Cataron, 2005) and the classifierfree criterion (Dash
et al., 1997; Morita et al., 2003;
Torkkola, 2003; Dy and Bradley, 2004;
Biesiada et al., 2005). The latter is our concern
in this study.
In general, most of the classifierfree criteria (Morita
et al., 2003; Torkkola, 2003; Dy
and Bradley, 2004; Biesiada et al., 2005)
are based on the firstorder or secondorder statistics computed from the empirical
distributions. However, these criteria based on the firstorder statistics are
sensitive to noise in datasets, whereas, these criteria based on the secondorder
statistics are sensitive to data transformation. In order to circumvent these
drawbacks and enhance the robustness to noise and data transformation, MI (mutual
information) as the high order statistics has been introduced into the classifierfree
criteria. However, this poses great difficulties as it requires prior knowledge
of the underlying probability density functions of datasets and the numerical
integration of these functions, which leads to a high computational complexity.
Moreover, MIbased criteria often fail in higher dimensions (Kononenko,
1994; Torkkola, 2003; Andonie
and Cataron, 2005). In the latest advance of MIbased criteria, Torkkola
(2003) proposed the nonparametric MI criterion based on Parzenwindow density
estimation (Bishop, 1995) and applied it to feature transformation.
Similar studies are also described by Principe et al.
(2000), Guyon and Elisseeff (2003) and Huang
and Chow (2003).
As pointed out, this study deals with feature ranking (as the first step of
feature extraction). In this study, a novel feature ranking approach is proposed.
The core of the proposed approach here can be stated in brief as follows. First,
a weight with one as its initial value is assigned to every feature. The weight
of each feature reflects its importance in the data space. Thus, with the help
of the concept of feature transform (Torkkola, 2003),
two data spaces are generated. One is associated with all the original features
and the other is associated with all the weighted features. Second, we use the
Parzen window density estimator to derive two underlying probability density
functions of the datasets, respectively, in these two data spaces. In order
to keep more information in the datasets, we adopt the more precise Parzen window
density estimator generated by using the fuzzy Cmeans algorithm (FCM) (Dunn,
1973; Bezdek, 1981) and then computing the variances
of the corresponding different clusters (i.e., classes). Algorithm FCM realizes
clustering by minimizing its objective function which induces the corresponding
fuzzy membership functions. It was first developed by Dunn
(1973) and then improved by Bezdek (1981). It has been
frequently used in pattern recognition, data mining and so on. Third, we use
ISE (Integrated Squared Error) criterion to measure the global error between
the density estimates of these two data spaces. ISE is a robust criterion for
the presence of noise and outliers in datasets (Girolami
and Chao, 2003; Wang et al., 2008). Finally,
by using the gradient descent method for ISE, we derive the iterative learning
rules for the used weights and final weights will also be accordingly obtained.
The proposed approach seems to be very much related with Torkkola
(2003) study. However, the distinctive characteristics of the proposed approach
here should be emphasized: First, the proposed approach is the classifierfree
one, i.e., no prior knowledge about the class labels in datasets is assumed,
whereas, Torkkola’s approach is oriented to such datasets where the class
labels of all the data in datasets are known. This indicates the big discrepancy
between the proposed approach and his study. The used MI criterion in his study
is only related to the lowdimensional weighted data space, whereas, the used
ISE in the proposed approach deals with the global error measure between the
original data space and the weighted data space with the same dimension. Moreover,
from the pure viewpoint of the computational complexity, ISE may have less computational
burden than MI criterion, we will explain the reason later. Second, the proposed
approach adapts FCM to partition the dataset to be analysed and then calculate
the variance of every cluster obtained. Thus, the underlying probability density
function computed by the modified Parzen window density estimator will make
use of more information in datasets.
THE PARZEN WINDOW DENSITY ESTIMATOR AND ISE CRITERION
The modified parzen window density estimator in the proposed approach:
The traditional Parzen window density estimator is in particular attractive
when no a prior information is available to guide the choice of the precise
form of the density of a dataset to be analysed. A probability density
estimate
can be obtained from the finite ddimensional data points (x_{1},x_{2},…,x_{N}
ε X^{d}) by employing the following Parzen window density
estimator:
where, x_{i} ε X^{d}, N is the number of all the
data points in the dataset, φ(·) is a kernel window function and
h_{N} is the window’s width and V_{N} is the window’s
volume.
In this study,
in Eq. 1 is taken as the following Gaussian kernel function:
where, Σ’ is the covariance matrix, h is the window’s
width. Let Σ = h^{2}Σ’, since
so, we have:
Thus, we have
In most cases, in order to ease the computational complexity, one adopts
the unified variance, i.e., Σ = σ^{2}I where, I is the
identity matrix, σ^{2} is the variance of the dataset. However,
when the dataset contains several classes where the compactness of every
class is different, which actually implies that the variance of every
class is different from others, this assumption may pose a serious issue,
i.e., the obtained density estimate loses much useful information contained
in the dataset.
In particular, due to high dimensionality, all the data points in a dataset
are more sparsely distributed in a high dimensional data space such that
the compactness of all the projected data points along every dimension
may perhaps be different from others. Therefore, we should compute Σ
in Eq. 4 based on the variance of every dimension.
Next, let us state the modified Parzen window density estimator used
in the proposed approach.
Given the dataset D_{X} = {x_{l},…,x_{N}},
x_{i} = (x_{il}, x_{i2},…,x_{id})^{T}
ε X^{d}, i = 1,2,…, N, suppose there are N_{c}
classes in D_{x} and there are J_{c} data points in the
cth class, then we can estimate the probability P(c) of the cth class
as:
We employ Eq. 4 to estimate the conditional probability
function of the cth class as:
where, x_{ci} is the value of the ith data point in the cth class;
Σ_{c} denotes the diagonal covariance matrix of the cth class
and it is defined as:
where, h is the window’s width, * denotes the multiplication operator,
is the variance of all the data points in the cth class along the kth
dimension, which is approximately calculated using:
Thus, the density function p(x) for the data D_{X} can be estimated
as:
ISE criterion in the proposed approach: In order to evaluate the
importance of every dimension (feature), we utilize the following weighting
strategy, i.e., imposing a linear dimensionkeeping transformation y =
w
x = (w_{1}x_{1},w_{2}x_{2},…,w_{d}x_{d})^{T}
to x, where, w = (w_{1},w_{2},…,w_{d})^{T},
w_{i} > 0, i = 1,2,…,d. The bigger w_{i} is, the
more important the corresponding ith feature will be. With this linear
transformation, the original dataset D_{X} becomes a transformed
dataset D_{Y}.
For this transformed dataset D_{Y}, we can estimate its probability
density function q(x) using Eq. 10 as:
where,
takes a diagonal covariance matrix for ease of the computational complexity,
which is defined as:
where, h is the window’s width,
is the variance of all the data points in the cth class along the kth
transformed dimension, which can be estimated as:
From the intuitive viewpoint, a good w should make q(x) for the transformed
dataset D_{y} to approximate p(x) for the original dataset D_{X}
as well as possible. Therefore, we require a criterion to evaluate w. ISE criterion
(Girolami and Chao, 2003; Wang, 2008)
has been investigated as an error criterion which will be less influenced by
the presence of noise and outliers in the dataset. ISE criterion is a measure
of the global accuracy of a density estimate, which converges to the mean squared
error asymptotically. For the above two density estimates, w which provides
the minimum of ISE is as follows:
where, the term
has been dropped from the above since it is not dependent on the weight vector
w. We should keep in mind that when we attempt to use ISE criterion for unsupervised
feature ranking, MI criterion used in the Torkkola’s work about supervised
feature extraction (Torkkola, 2003) actually provides
us alternative possible choice. However, in terms of the framework of the
Torkkola’s (2003) study, except for
the term
still needs to be computed in MI criterion. Therefore, using ISE criterion here
may result in further reducing the computational burden, compared with Torkkola
(2003). In other words, the computational merit motivates us to use ISE
criterion instead of MI criterion in this study.
Let us derive
now (Dash et al., 1997; Torkkola,
2003):
we have
Similarly, we have:
THE DETAILS OF THE PROPOSED APPROACH
Since, the weight vector w reflects the importance of every feature in
the dataset, thus, the corresponding feature ranking can be achieved using
the obtained w. Presentgoal is to find out w such that:
Where:
The gradient descent method is adopted in the proposed approach. First,
we need the derivatives of ISE(w), which are computed as follows.
Since,
in terms of Eq. 1921, we immediately
have:
Thus, we further have the following learning rule for w:
where, α is the learning rate which is taken as 0.1 in our study.
Before giving the complete description of the proposed approach, we must
discuss a particular implementation detail. As is seen before, when weighting
all the features, we employ the constraint:
Such a constraint will pose a subtle issue, i.e., solving
with this constraint will yield all the same w_{i}(i = 1,2,…,d),
i.e., w_{1} = w_{2} = … =w_{d} = 1/d. There
are two strategies which can be used to circumvent this trouble. Suppose
we have prior knowledge about the least important feature (e.g., the ith
feature), the first strategy is to force w_{i} = 0 during the
whole running period of the proposed approach. If we do not have such
prior knowledge, the second strategy will start to work. We first take
an arbitrary feature (e.g., the ith feature) and let w_{i} = 0
during the whole running period and then run the proposed approach. Accordingly,
we choose the least important feature (e.g., the jth feature) and let
w_{j} = 0 during the whole running period, then run the proposed
approach again. In this way, we can easily obtain the final feature ranking
for the dataset to be analysed. Please note, since we are interested in
important features instead of the least important ones in feature ranking,
such an arrangement based on the second strategy has almost no impact
on the final feature ranking.
For the sake of the space of the paper, assume the first strategy is
taken, let us state the proposed approach as follows:
Step 1: 
Given the dataset D, initialize w = (w_{1},w_{2},…,w_{d})^{T}
and let w_{i} corresponding to the least important feature
be zero; t = 0 
Step 2: 
If all the data points in D have been classified with their class labels,
skip this step. Otherwise, determine the number of classes and use the fuzzy
clustering algorithm FCM (Dunn, 1973; Bezdek,
1981) to partition D such that the modified Parzen window density estimator
in (10) can be employed 
Step 3: 
Compute ISE using Eq. 19 
Step 4: 
Compute w_{k} (t+1) using Eq. 23 
Step 5: 
t ← t+1 
Step 6: 
If ISE reaches its minimum or t achieves the predefined maximum,
then go to step 7, otherwise go to step 3 
Step 7: 
Output the ranking result of features, according to the obtained
w 
EXPERIMENTAL RESULTS
We present the experimental results for eleven benchmarking datasets to validate
the effectiveness of the proposed approach. It should be emphasized that we
do not claim that the experimental results of present approach here are superior
to current featureranking approaches. They show its power as a classifierfree
approach by (1) comparing it with another classifierfree approach based on
the traditional Parzen window density estimator and (2) experimentally proving
it to be comparable to other supervised ranking approach.
Three groups of experiments were carried out with MATLAB 6.0 on the computer
of Pentium IV with 2.66 GHz CPU and 512 MB memory. The first group of experiments
deals with eleven datasets where algorithm FCM was first used to partition these
datasets and then the proposed approach was executed to obtain the corresponding
ranking results of features contained in these datasets. We verified the obtained
ranking results using the kfold crossvalidation test (Richard,
2000) and the LIBSVM tool (Chang and Lin, 2001) and
did a comparative study with other approaches. In kfold crossvalidation test,
a dataset is divided into k subsets. Each time, one of the k subsets is used
as the test set and all other k1 subsets are put together to form a training
set. Then the average accuracy or error across all trials is computed. In the
second group of experiments, three datasets are involved, where the datasets
Pipeline and Landsat contain their test data and the dataset WBC (Wisconsin
Diagnostic Breast Cancer) does not have the test data. For the former, the classification
accuracy, as the performance index, is used to evaluate the proposed approach.
For the latter, the crossvalidation test is adopted. In the third group of
experiments, we presented a comparative result for the dataset Vowel with the
latest advance (Andonie and Cataron, 2005).
Group A: This group of experiments deals with eleven UCI (Blake
and Merz, 1998) datasets, as shown in Table 1. For the
datasets Iris, NewThyroid, PimaDiabetes, BreastCancer, Wine, Ionosphere and
Sonar, Table 2 shows the obtained feature ranks using the
proposed approach and SUD approach (Dash et al., 1997),
RELIEF approach (Kira and Rendell, 1992; Kononenko,
1994; Dash et al., 1997). Table
3 demonstrates the obtained feature ranks for the dataset PimaDiabetes
using the above three approaches and Kmeansbased approach (Girolami
and Chao, 2003). The obtained feature ranks for the dataset wine is reported
in Table 4 using both the above four approaches and FSM approach
(Law et al., 2002). Table 5
demonstrates the feature ranks for the dataset Sonar using the proposed approach
and RELIEF approach. Please note, in these tables, the relevant results of SUD
and RELIEF are drawn (Dash et al., 1997), those
of Kmeansbased approach are (Girolami and Chao, 2003)
and those of FSM are from Law et al. (2002).
Here, let us briefly introduce the basic ideas of these approaches. SUD is
based on the observation that removing an irrelevant feature from the feature
set may not change the underlying concept of the data, but not so otherwise.
Table 1: 
Eleven UCI datasets 

Table 2: 
Feature ranks using three approaches 

Table 3: 
Feature ranks using four approaches for PimaDiabetes 

Table 4: 
Feature ranks using five approaches for Wine 

Table 5: 
Feature ranks using the proposed approach and RELIEFF
for Sonar 

Each time a relatively unimportant feature is removed from the feature set by
using the entropy index (Dash et al., 1997). The
adopted RELIEFF is a modification of the original RELIEF (Kononenko,
1994; Dash et al., 1997), a key idea of which
is to estimate the quality of features according to how well their values distinguish
between data points that are near to each other. The original RELIEF can deal
with nominal and numerical features. However, it cannot deal with incomplete
data and is limited to twoclass problems. Its extended version RELIEFF solves
these problems. The feature ranking approach based on Kmeans clustering involves
classification capabilities of feature vectors and correlation analysis between
two features. It can rank the features by using the correlation index between
two features (Girolami and Chao, 2003). FSM approach
is based on a feature saliency measure which is obtained by an EM algorithm.
However, it assumes that the features are conditionally independent, given the
components. It ranks the features in descending order of saliency (Law et al., 2002).

Fig. 1: 
Crossvalidation for the dataset NewThyroid 
As we may know well, the most important features in Iris dataset are
of the third and fourth. Just like SUD, RELIEFF, the proposed approach
yielded the same ranking result. Figure 16
demonstrate the 5fold crossvalidation accuracies for other datasets
NewThyroid, PimaDiabetes, BreastCancer, Wine, Ionosphere and Sonar,
which clearly indicate that in most cases, the proposed approach is comparativelycompetitive
to other approaches.

Fig. 2: 
Crossvalidation for the dataset PimaDiabetes 

Fig. 3: 
Crossvalidation for the dataset BreastCancer 

Fig. 4: 
Crossvalidation for the dataset Wine 
Figure 7ad shows the visualization
results for two most important features of the dataset Wine using these
approaches. It is clear that the proposed approach had better visualization
capability than other approaches, (Fig. 7d).
Of course, since the combinational computation of Gaussian functions
is involved in the proposed approach, its computational time is still
comparatively high. For example, it requires 4.438 sec for Iris dataset
and 4.406 sec for NewThyroid dataset, however, 39.453 for PimaDiabetes
dataset, 38.922 sec for BreastCancer dataset and 34.141 sec for Ionosphere
dataset.

Fig. 5: 
Crossvalidation for the dataset Ionosphere 

Fig. 6: 
Crossvalidation for the dataset Sonar 
This actually poses a notingworthy issue, i.e., we should manage
to further reduce its computational burden in near future.
Group B: This group of experiments deals with three datasets: Pipeline,
Landsat and WBC, as shown in Table 1. All the data in these
three datasets have been classified using the class labels. Table
6 demonstrates the feature ranks for these three datasets using the proposed
approach. For the datasets Pipeline and Landsat, we employed LIBSVM (Chang
and Lin, 2001) to their training data according to the obtained feature
ranks and then obtained the classification accuracies for their test data, (Fig.
89). Figure 10 demonstrates the 5fold
crossvalidation accuracies for the dataset WBC. In Fig. 10
and Table 6, a comparison for this dataset between the modified
Parzen window density estimator and the traditional one is also included. In
particular, two very different feature ranking are generated by traditional
Parzen window density estimator and the modified Parzen window density estimator,
Table 6. This just show the inappropriateness of the traditional
Parzen window density estimator. We can see the effectiveness of the proposed
approach from these Fig. 10 and Table 6.
That is to say, with the proposed approach, only a small number of the ranked
features are taken, the corresponding comparatively high classification/crossvalidation
accuracies for these three datasets were obtained.


Fig. 7: 
Visualization results using the five approaches for
the two most important features of the dataset Wine, (a) RELIEFF
approach, (b) SUD approach and Kmeansbased approach, (c) FSM approach
and (d) The proposed approach 

Fig. 8: 
Classification for the dataset Pipeline using LIBSVM 

Fig. 9: 
Classification for the dataset Landsat using LIBSVM 

Fig. 10: 
Crossvalidation for the dataset WBC using LIBSVM 
For example, we presented
25 ranked features from the dataset Pipeline obtained by the proposed approach,
to LIBSVM, the obtained classification accuracies for the test data are 93.4,
94.9, 96.3 and 97.1.
Group C: Only the data set Vowel is involved here to exhibit the comparative
study between the proposed approach and other feature ranking approaches RLVQ,
GRLVQ, SRNG, ERLVQ and ESRNG in the latest literature (Andonie
and Cataron, 2005). As we may know well, Standard Learning Vector Quantization
(LVQ) does not discriminate between more or less informative features: their
influence on the distance function is equal. On the contrary, Relevance LVQ
(RLVQ) in (Bojer et al., 2001; Andonie
and Cataron, 2005) holds a changeable relevance value for every feature
and employs a weighted distance function for classification. An iterative heuristic
training process is used to tune the weight values for a specific problem: the
influence of features which frequently contribute to misclassifications of the
system is reduced while the influence of very reliable features is increased.
Generalized RLVQ (GRLVQ) is the modification of RLVQ. This approach modifies
RLVQ by using an adaptive metric and leads to a more powerful classifier with
little extra cost compared with RLVQ. The Supervised Relevance Neural Gas (SRNG)
approach combines the neuralgas (NG) algorithm (Hammer et
al., 2005) and GRLVQ. The idea was to incorporate neighborhood cooperation
of NG into GRLVQ to speedup the convergence and make initialization less crucial.
Energy SRNG (ESRNG) approach uses the maximization of the informational energy
(IE) as a criterion for computing the relevancies of input features. This adaptive
relevance determination is used in combination with the SNG model. Energy Generalized
Relevance LVQ (EGRLVQ) approach uses the estimation of the Informational Energy
(IE) as a maximization criterion for computing the feature relevance. Both ESRNG
and EGRLVQ measure the informational energy using Onicescu’s informational
energy (Cataron and Andonie, 2004).
Table 6: 
Feature ranks using the proposed approach for Pipeline,
Landsat, WBC 

Table 7: 
Feature ranks using seven approaches for Vowel 

Table 8: 
Classification accuracies using LIBSVM for test data
of Vowel (%) 

Table 7 demonstrates the experimental results using both
the above approaches and the proposed approach here. We directly draw the results
of all other methods except the proposed approach from (Andonie
and Cataron, 2005) into this Table 7. Based on the obtained
feature ranks, LIBSVM is first applied to the training data of this dataset
and then to its test data. As seen in the Table 8, when 25
ranked features are taken, the obtained classification performance using the
proposed approach is comparable to ESRNG and a little less than ERLVQ. For other
cases except one or six feature cases, the proposed approach is generally comparable
to other approaches. However, please let us keep in mind the fact that the proposed
approach attains these comparable results in the case where it is classifierfree
while other approaches here are supervised. What is more, since all approaches
yield very low classification accuracies (much less than 50%) for one feature
case, it is unnecessary for us to analyze this case. For the six feature case
in Table 8, although the proposed approach obtains the lowest
classification accuracy, it is still a little bigger than 50% (i.e., 51.3%).
In summary, the proposed approach is effective in most cases for the dataset
Vowel.
CONCLUSIONS
This study describes a novel feature ranking approach which may be attractive
for pattern recognition tasks on highdimensional datasets. The proposed
approach provides three major contributions. First, the proposed Parzen
window density estimator takes nonuniform distribution contained in the
datasets into account, so the proposed estimator can make use of more
information of the datasets than the conventional Parzen window density
estimator. Second, instead of the commonly used MI based criterion, the
novel ISE criterion, as an alternative way, is adopted in our study. Present
experimental results demonstrate that the proposed approach based on ISE
criterion has as least the same or comparable performance as current feature
ranking approaches. Third, the proposed approach requires no prior knowledge
of the datasets to be analyzed and it can be easily realized by using
the corresponding gradient descent method.
Although the proposed approach is based on ISE criterion rather than
MI criterion, it still has comparatively high computational burden due
to the existence of the combinational computation of Gaussian functions.
We will study how to further reduce its computational time, possibly using
the random sampling method, in near future. Further research may include
theoretical research on the proposed approach and explore more practical
applications.
ACKNOWLEDGMENTS
This research is supported by 2007 National 863 project (Grant No. 2007AA1Z158),
2007 two grants (No. 60773206, 60704047) from National Science Foundation
of China, New_century Outstanding Young Scholar Grant of Ministry of Education
of China (NCET040496), National KeySoft Lab. at Nanjing University,
The Key Lab. of Computer Information Technologies at JiangSu Province,
the 2004 key project of Ministry of Education of China and National Key
Lab. of Pattern Recognition at Institute of Automation, CAS, China.