Kernel Sparse Feature Selection Based on Semantics in Text Classification

Information Technology Journal

Year: 2012 | Volume: 11 | Issue: 3 | Page No.: 319-323
DOI: 10.3923/itj.2012.319.323

Kernel Sparse Feature Selection Based on Semantics in Text Classification

Zhantao Deng, Guyu Hu, Zhisong Pan and Yanyan Zhang

Abstract: Sparse representation originating from signal compressed sensing theory has attracted increasing interest in computer vision research community. In this paper, we present a novel non-parametric feature selection method based on sparse representation in text classification. In order to solve the problem of polysems and synonyms in VSM, we construct semantic structure to represent document with PLSA. Motivated by the fact that kernel trick can capture the nonlinear similarity of features, which may reduce the feature quantization error, we propose Empirical Kernel Sparse Representation (EKSR). We apply EKSR to reconstruct weight vector between samples, then design evaluating mechanism CKernel Sparsity Score (KSS) to select excellent feature subset. As the natural discriminative power of EKSR, KSS can find Agood@ feature which preserves the original structure with less information loss. The results of experiment both on English and Chinese dataset demonstrate the effectiveness of the proposed method.

Fulltext PDF Fulltext HTML

How to cite this article

Zhantao Deng, Guyu Hu, Zhisong Pan and Yanyan Zhang, 2012. Kernel Sparse Feature Selection Based on Semantics in Text Classification. Information Technology Journal, 11: 319-323.

Keywords: semantic structure, kernel sparse representation, feature selection and Text classification

INTRODUCTION

With the rapid development of Internet applications, there will be a huge number of content-rich news release page and forum posts every day. However, the data in network is usually provided in high-dimensional form, which brings on the so-called “curse of dimensionality”. Feature selection is an effective approach to deal with such problem, because it preserves original structure and selects excellent feature subsets according as evaluate mechanism. Researchers have developed a variety of feature selection methods such as Document Frequency (DF), Information Gain (IG), Mutual Information (MI), Expected Cross Entropy (ECE) (CHI) and Term Strength (TS).

Studies (Wright et al., 2009) have shown that no feature selection method has excellent classification effect in any datasets. Therefore, we focus on designing generalized feature selection method which selects characters of a global and makes the best use of various features of the information to reflect the inner structure of the data reasonably. In this paper, motivated by the recent development of Sparse Representation (SR) (Huang and Aviyente, 2007) which succeed in statistics and pattern recognition (Mairal et al., 2008a, b). We propose a simple feature selection method called kernel sparsity score based on semantic (Se-KSS). Specifically, semantic structure of documents is firstly constructed based on probabilistic latent semantic analysis (PLSA) (Hofmann, 1999). Secondly, an “Aadjacent” weight matrix of data set is constructed based on modified sparse representation framework Cempirical kernel Sparse Representation (EKSR), and then the low-dimensional embedding of the data is evaluated to best preserve such weight matrix. At last, evaluating mechanism kernel sparsity score (KSS) is designed to select excellent feature subset. Although supervised information is not needed, KSS tends to find the “good” feature which preserves the original structure.

DOCUMENT REPRESENTATION

Probabilistic Latent Semantic Analysis (PLSA) simulates the occurrence process of word in document by the probabilistic model, and shows the relationship with “document-semantic-word”. Suppose we have given a collection of text documents D = {d_I,....., d_M} with terms from a vocabulary W = (ω_I,....., ω_M), z ε Ζ = {z_I,....., z_k} is an unobserved class variable. Then a joint probability model over D x W is defined by the mixture:

(1)

According as the principle of maximum likelihood estimation, the objective function of PLSA is described as Log-Likelihood function as follows:

(2)

The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM) algorithm (Hofmann, 2001).Then, K-dimensional vector is considered as term probability representation of document (Sivic et al., 2005).

KERNEL SPARSE FEATURE SELECTION

Motivated by kernel trick, nonlinear data is mapped into kernel space in which the nonlinear similarity of the features can be captured and reconstruct by sparse representation.

Empirical kernel sparse representation: Sparse representation is initially proposed as an extension to traditional signal representation (Mallat and Zhifeng, 1993). Suppose we have given a signal χ ε R^m and a matrix X = [x1, x2,...., x3] ε R^mxn containing the elements of an over-complete dictionary (Murray and Kreutz-Delgado, 2007.) in its columns, the goal of SR is to represent x using as few entries of X as possible. In many practical problems, the signal x is generally noisy, thus an error tolerance ε was used to handle this problem. This can be formally expressed as follows:

(3)

In general, the minimization problem can be solved by standard -magic software¹.

In order to solve nonlinear problem, we map the input data into feature spaces by kernel trick. Traditionally, the mapping is implicitly represented by specifying a kernel function as the inner product between each pair of samples in the feature space (Shawe-Taylor and Cristianini, 2004). Recently, Kernel Sparse Representation (KSR) method (Gao et al., 2010) has shown the necessity in face recognition, but it is difficult in optimization of sparse matrix. Conversely, the mapping in this paper, is given in an explicit form as describe in Xiong et al. (2005). If the rank of the kernel matrix K is r, it can be decomposed as:

(4)

where, is a diagonal matrix consisting of the r positive eigenvalues of K and Q consists of the corresponding orthonormal eigenvectors. Thus, the explicit mapping also called the Empirical Kernel Mapping (EKM) in Xiong et al. (2005), is given as Φ^e χ→F^r:

(5)

Let B = KQA^-1/2and then the dot product matrix {Φ^e (χ_i)}^N_{i = 1}of generated by EKM can be calculated as:

(6)

That is exactly equal to the kernel matrix in the Implicit Kernel Mapping (IKM) and, the mapped samples generated by EKM and IKM, respectively, have the same geometrical structure. In Xiong et al. (2005), it is shown that comparing EKM with IKM, the former is easier to access and easier to study the adaptability of a kernel to the input space than the latter. This is why we select EKM here.

In this study, the samples explicitly mapped into, where is called new view of the original input space, corresponding to kernel. Then we substitute the mapped features and basis to the formulation of sparse representation, and arrive at empirical kernel sparse representation (EKSR), the optimization problem is expressed as follows:

(7)

Equation 7 is convex and the l_i minimization problem also can be solved by standard l_i-magic software. Then we can calculate the weight matrix S = [S₁, S₂,…, S_m]^T based on Eq. 7.

Kernel sparsity score: Although the sparse reconstructive weight matrix has no labels of data, it can reflect intrinsic geometric properties of the data and contain natural discriminating information (Qiao et al., 2010).

Let X = [x₁, x₂,…x_m]εR^nxm be the data matrix including all the training samples in its columns, x_ij represents the jth character of the ith sample. We expect to reconstruct each sample x_i by EKSR, using as few samples as possible. A sparse reconstructive weight vector Si = [S_i1,…S_i(I-1), 0, S_i(I+1),…S_im]^T for each x_i can be calculated through Eq. 7, where the elements S_ij (j ≠I) denote the contribution of each x_j to reconstruct the sample x_i. That is:

(8)

Algorithm: Kernel sparsity score based on semantic


Fig. 1:	Se-KSS algorithm

After computing the weight vector S_i for each x_i, I = 1, 2, Y, m. We can define the sparse reconstructive weight matrix as follows:

(9)

Extended to the entire data, we construct evaluate mechanism (Dan, 2010) by the reconstruction error as follows:

(10)

The ability of preserving sparse structure is the stronger as the value of KS_Score is the smaller. We select excellent feature subsets by the value of KS_Score, called Kernel Sparsity Score (KSS) algorithm. Based on the above discussion, we summarize the proposed algorithm as shown in Fig. 1.

EXPERIMENT AND RESULT

As described above, DF, IG, Chi and TDE (Song et al., 2010.) have been successfully applied to text classification. In what follows, we test the performance of our proposed algorithm compared with four popular algorithms on two datasets.

The datasets: Retuers is a subset of Retuers 21578 which has 6 different classes and contains 6223 documents for training and 2428 documents for test. Sogou is provided for classifying Chinese by Sogou Lab. We select a subset which contains 5000 documents of 5 classes, where there are 1000 samples per subject. At the moment of pretreatment, we use ICTCLAS provided by CAS for participle.

Experimental results:

•	Based on K-NN classifier. To verify the effect of different topics, we evaluate the performance of the proposed method with several topics in the series of experiments using the K-NN classifier, where we set neighborhood size k = 30. In experiments, the frame of multi-class classification is one against rest. We show the performance of the number of PLSA = s topics in Fig. 2
•	To verify the effectiveness of the proposed method, we evaluate the mentioned methods based on two classifiers including KNN and SVM, where the number of topics with Se-KSS selects the optimization from Fig. 2. The experimental results are presented in Table 1. For SVM, we simply use the linear kernel


Fig. 2(a-b):	The performance of k-NN classifier on the different number of PLSA = s topics

Table 1:	The average F₁-Measure of different classifiers based on different feature selection

From above, we can draw the t as follows: (1) Se-KSS has excellent performance in classification on both datasets. This shows that by reconstructing the sparse weight matrix of data, the F₁-Measure can be improved. From Fig. 2, we know that the performance of Se-KSS is under the influence of the number of topics. According to different datasets, we should select appropriate topics. (2) DF is simple to perform, but it generally performs as well as IG and CHI. In order to improve efficiency, we can use it instead of IG and CHI. Because TDE uses information entropy to describe the distribution, its performance exceeds DF, IG and CHI. (3) . The classifiers also affect the F₁-Measure performance significantly. However, the proposed Se-KSS can generally achieve better performance than other methods based on KNN or SVM.

CONCLUSION

We introduced a non-parametric feature selection method based on Empirical Kernel Sparse Representation (EKSR) for text classification, which gets the sparse weight matrix in a high dimensional feature space mapped by empirical kernel function. Then the excellent features are acquired automatically and capture spatial information among documents. The result of experiment has shown the performance of the proposed algorithm outperforms the compared methods on the two datasets used here without label information.

However, for PLSA model, there is no way to assign probability to an unseen document and it is prone to overfitting. We can consider using Latent Dirichlet Allocation (LDA) (Blei et al., 2003.), which is a well-defined generative probabilistic model and generalizes easily to new documents. Secondly, the use of label information should be considered, one way is to reconstruct weight matrix using Tree Group Lasso (Liu and Ye, 2010). In the future work, we will try to overcome this limitation to further improve its performance.

ACKNOWLEDGMENTS

This study was partly supported by National Natural Science Foundation of China 60603029.

REFERENCES

Wright, J., A.Y. Yang, A. Ganesh, S.S. Sastry and Y. Ma, 2009. Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell., 31: 210-227.
CrossRef Direct Link

Huang, K. and S. Aviyente, 2007. Sparse Representation for Signal Classification. In: Advances in Neural Information Processing Systems, Scholkopf, B., J. Platt and T. Hofmann (Eds.). MIT Press, Cambridge, MA, pp: 609-616

Mairal, J., F. Bach, J. Ponce, G. Sapiro and A. Zisserman, 2008. Discriminative learned dictionaries for local image analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 23-28, 2008, Anchorage, AK., USA., pp: 1-8.

Mairal, J., F. Bach, J. Ponce, G. Sapiro and A. Zisserman, 2008. Supervised dictionary learning. Report No. RR-6652, National Institute for Research in Computer, Paris, France, pp: 15. http://www.di.ens.fr/willow/pdfs/RR-6652.pdf.

Hofmann, T., 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 15-19, 1999, Berkeley, CA., USA., pp: 50-57.

Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Machine Learn., 42: 177-196.
CrossRef

Sivic, J., B.C. Russell, A.A. Efros, A. Zisserman and W.T. Freeman, 2005. Discovering objects and their location in images. IEEE Int. Conf. Comput. Version, 1: 370-377.
CrossRef

Mallat, S.G. and Z. Zhang, 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process., 41: 3397-3415.
CrossRef

Murray, J.F. and K. Kreutz-Delgado, 2007. Visual recognition and inference using dynamic overcomplete sparse learning. Neural Comput., 19: 2301-2352.
Direct Link

Shawe-Taylor, J. and N. Cristianini, 2004. Kernel Methods for Pattern Analysis. Cambridge University Press, ISBN: 0521813972 England, pp: 462

Gao, S., I.W.H. Tsang and L.T. Chia, 2010. Kernel sparse representation for image classification and face recognition. Proceedings of the 11th European Conference on Computer Vision: Part IV, September 5-11, 2010, Heraklion, Crete, Greece, pp: 1-14.

Xiong, H., M.N.S. Swamy and M.O. Ahmad, 2005. Optimizing the kernel in the empirical feature space. IEEE Trans. Neural Networks, 16: 460-474.
CrossRef

Qiao, L., S. Chen and X. Tan, 2010. Sparsity preserving projections with applications to face recognition. Pattern Recogn., 43: 331-341.
CrossRef

Dan, S., 2010. Research on feature selection algorithms based on pairwise constraints and sparse representation. Nanjing University of Aeronautics and Astronautics, Nanjing, China.

Song, J., M. Xu and C. Fan, 2010. A text feature selection method using TFIDF based on entropy. Proceedings of the 9th International FLINS Conference, August 2-4, 2010, Emei, Chengdu, China, pp: 962-967.

Blei, D.M., A.Y. Ng and M.I. Jordan, 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 993-1022.
Direct Link

Liu, J. and J. Ye, 2010. Moreau-Yosida regularization for grouped tree structure learning. Proceedings of the 24th Annual Conference on Neural Information Processing Systems, December 6-9, 2010, Hyatt Regency, Vancouver, BC., Canada, pp: 1-9.

HOME JOURNALS CONTACT

Information Technology Journal

Year: 2012 | Volume: 11 | Issue: 3 | Page No.: 319-323 DOI: 10.3923/itj.2012.319.323

Kernel Sparse Feature Selection Based on Semantics in Text Classification

Zhantao Deng, Guyu Hu, Zhisong Pan and Yanyan Zhang

How to cite this article

Zhantao Deng, Guyu Hu, Zhisong Pan and Yanyan Zhang, 2012. Kernel Sparse Feature Selection Based on Semantics in Text Classification. Information Technology Journal, 11: 319-323.

Keywords: semantic structure, kernel sparse representation, feature selection and Text classification

REFERENCES

Year: 2012 | Volume: 11 | Issue: 3 | Page No.: 319-323
DOI: 10.3923/itj.2012.319.323