Research Article
An Enhanced Classification and Prediction of Neoplasm using Neural Network
College of Computer Sciences and Information Technologies, King Faisal University, Al-Ahsa, Saudi Arabia
Voguet et al. (2009) noticed that patient's age and positive margins can be considered as predictive factors for residual tumor. Two expert centers conducted breast cancer test experiments involving 294 patients. A comparative analysis was carried between residual and non-residual tumors. It was identified that 202 patients has residual tumor keeping the clinical risk factors included. They concluded that positive margins on operative specimen with addition of younger age are some risk factors for this type of tumor. Internecine neoplasm proliferates over abutting areas and inducements the increase in annihilation in women. The woman breast is composes of major three components including milk producing glands, fatty tissues and blood vessels. There is another collection of immune cells that are small bean shaped and carry fluids. Internecine masses can harm these sensitive guard agents and further proliferating to other regions.
Application of Computer Aided Diagnostic (CAD) tools and methods brings healthy source of inspiration for espial of internecine carcinomas (Voguet et al., 2009). On an average, the CAD systems convert the laser signals into digital ones for processing by microprocessor. Some better comparisons can be made between mammograms obtained from conventional ways and CAD's.
More discursive focus can be made on specific regions (abuts of carcinoma cells). An interesting study of 84 cases phyllodes (Foxcroft et al., 2007) revealed 71 as benign, 5 as malignant and 8 borderlines. Ultrasound and mammography were termed being non specific. The study showed that accuracy is better analyzed in small carcinomas because larger ones need more sampling. It was further noticed that fibro-adenomas had to be reviewed using ultrasound. The antediluvian espial can better help the chances of recovery and focuses the need to develop some improved methodologies for espial and diagnosis (Akhtar et al., 2008). Several different techniques are being employed for espial of internecine masses. Among these, mammography is considered as a suitable approach for carcinoma espial. Also, there is a need to improve the contrivances both in terms of X-ray and mammography itself by enhancement of mammogram images.
Ultrasound is considered another suitable imaging approach for espial of those masses that can remain hidden in normal mammography techniques. Sonogram is a resulting picture produced recording the echoes (by application of high quality sound waves). There are some tinny calcium collections that may remain hidden from ultrasound, so this methodology may not be recommended in antediluvian or routine checking. Another enhanced approach is digital mammography in which the images are manipulated on computer systems instead of films. There is a trade-off between digital and normal mammography. Digitization can help in better understanding but does not guarantee the choicest solution.
A population based study of 356 patients concluded that tumor location is not an independent prognostic aspect leading to survival. Besides, some critical factors such as patient age, tumor size, central location, number of positive lymph nodes and histological type associate a major risk of death. It was revealed by experiments that location does not influence survival (Jayasinghe and Boyages, 2009). Hassouna et al. (2006) reported another interesting case study of 106 patients with age margin of 39.5 years and mean tumor size of 83 mm. The study revealed the comparative difference of prognosis and treatment for phyllodes. Overall, 82 patients were treated using conservative and 24 by radical surgery. During the observation, 8 patients developed metastases and it was recommended that malignant phyllodes tumors for a simple mastectomy and wide excision for benign and borderline. Mohamad et al. (2009a) proposed a two stage gene selection method. This selection method selects a smaller subset of useful gene. The automatic yield of this smaller gene is a result of application of filters that perform some kind of pre-selection. Further optimization was made through multi-objective integrated method. They tested the phenomenon with three different datasets of gene expression data. The approach was applicable to micro-array gene expression datasets that can estimate the degree of comparison between cancerous and normal sets. Mohamad et al. (2009b) highlighted the problems in micro-array data analysis and comparison between cancerous and normal cells. The proposed approach overcame all the hurdles (removal of noisy data, irrelevant gene data etc.) in order to select a nice smaller subset of dataset for cancer classification. Basically, the approach is a cyclic hybrid technique. Authors have used five real time datasets to test the mechanism. Mohamad et al. (2010) proposed a three stage method for selection of information genes to classify cancer. The problem highlighted is the selection of set of disease genes from some huge amount of genetic data over micro-array. The proposed method comprises of initial selection by a filter technique, implementation of an integrated method for optimization and frequency analysis of gene appearance in diverse near optimal genetic subsets. This contribution helps classifier to accurately classify the patterns. A critical review of literature indicated that the information on the classification and prediction of neoplasm using neural network is inadequate. Therefore, the main objective of present investigation was to propose an enhanced approach for classification and prediction of neoplasm using neural network.
THE PROPOSED ENHANCED APPROACH
Neoplasm classification and prediction were proposed with the help of neural network with prior data segmentation and purification and the selection of best parameters for a feasible network. The proposed approach in this study was composed of the following tasks.
• | Purification and segmentation |
• | Development of feasible network architecture |
• | Neoplasm classification |
• | Neoplasm prediction |
• | Flexibility |
The context of development of a network from a pool of choices in allowable feasible range is very important factor for approaching optimum results. This extreme optimal solution is worthless without adding flavors of best clustering, cleansing and scalability.
PURIFICATION AND SEGMENTATION
This stage of analysis contains the following points.
• | Data gathering |
• | Purification/cleanliness |
• | Textual to numeric conversion |
• | Data integration |
• | Segmentation/clustering |
The neoplasm datasets contain clinical data in terms of parameters Diagnosis, Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry, Fractal Dimension, Single Epithelial Radius, Single Epithelial Texture, Single Epithelial Perimeter, Single Epithelial Area, Single Epithelial Smoothness, Single Epithelial Compactness, Worst Radius, Worst Texture, Worst Perimeter, Worst Area, Worst Smoothness, Worst Compactness, Worst Concavity, Worst Concave Points, Worst Symmetry and Worst Fractal Dimension described in Table 1.
On the other hand, Table 2 shows the snap of final shape of data after vital adjustments. This translation helped to achieve the objectives to better train the network system and better efficiency in terms of optimal solution. It also helped in reduction of overhead involved in automatic dataset translation by central process. As a result of this translation, complex architectures were avoided by selecting simpler and smart combinations with less number of layers and neurons.
The segmentation phase clustered the dataset in random clusters for training and testing the selected choices of combinations. Figure 1 describes the segmentation of datasets for neoplasm training and testing at random sub-sets. Dataset were fragmented into four chunks of unequal lengths. Each neoplasm segment contained four clusters and each cluster contained four sub-clusters. One larger cluster was chosen with a smaller cluster for neoplasm training architecture, while another larger and smaller for testing. This combination of sub-systems guaranteed the optimal solution of the study approach.
DEVELOPMENT OF FEASIBLE NETWORK ARCHITECTURE
Several combinations were generated to build, test and apply cluster suitability for this approach.
Fig. 1: | Data clusters of neoplasm for training and testing |
Table 1: | Neoplasm dataset |
Table 2: | Purified dataset |
Since the neural network requires training over specimen datasets, variant combinations were fed by thresh holding hidden layers with different neurons. Only those architectures were listed that provided the most significant value for root Mean Square Error (MSE). The selected architectures are described in Table 3.
Table 3 describes 8 cases representing random number of sigmoid layers with different neurons at each layer. It can be observed that with increasing hidden layers and number of neurons on these specified layers, the root mean square value also increased and decreased with the decrease in number of hidden layers and an increase in neurons.
Table 3: | Random combination of layers with neurons against MSE |
Based on this, it was concluded that case 7 is the chosen architecture over specified clusters of datasets.
NEOPLASM CLASSIFICATION
With the selection of smart and efficient network, the training and cross validation of clusters were performed. For instance, Fig. 2 demonstrates that confusion matrix provided 96% of Internecine and 99.45% of benign diagnosis. The diagonal values from left to right (upper to lower) present numerals for benign and malignant confusion, respectively while numeral from right to left (upper to lower) depict corresponding errors values in training combinations of larger with smaller clusters.
Figure 3 describes the cross validation confusion matrix values. It is evident that Malignancy in cluster datasets was 100% and the benign ratio was 97.64%. The diagonal errors prune to be 2% for benign clusters and 0% for malignant chunks. However, a slight variation was noticed in numerical values for the chosen architecture against classification phenomenon. This might be due to minute difference in size of clusters at network training and classification phases.
Figure 4 shows the depiction of rates of deviation of training and cross validation error values. There was a slight variation in Training (T) and Cross Validation (CV) at the start of classification and both curves behaved in same fashion after some 200 epochs. However, smaller differences in curves present best classification in clusters and demonstrated an optimal combination of sigmoid and neurons for T and CV.
Fig. 2: | Training confusion matrix |
Fig. 3: | Cross validation confusion matrix |
Fig. 4: | Mean square error in T and CV |
On other hand, root mean square error value range is below 0.1 (an average value below 0.5 is considered suitable) which demonstrated very good classification over trained and tested clusters.
NEOPLASM PREDICTION
Contrary to classification, predictions were made over 20% larger and smaller data chunks and generalized the idea for rest of clusters as there was a slight variation (normally less than 0.5%) in cross validation over entire domain and selected regions. Mean Square Error (MSE) values for training and cross validation over prediction are presented in Fig. 5. An average cost of criterion for MSE is below 0.03 and 0.05 for CV. These results were very optimal for prediction phenomenon. This idea was generalized to cover not only similar clusters in same domain but also to cover diversity of clusters in other domains (larger and smaller).
Figure 6 shows training and cross validation curves which were similar in behavior after about 170 epochs and slightly different in range between 60 to 150 epochs. Both curves remained side by side starting from 0.5 and declined continuously ending similarly along the x-axis after 200 epochs. The MSE values were highly optimal and desired for prediction of neoplasm in clinical clustered datasets.
FLEXIBILITY
This enhanced approach for neoplasm classification and prediction is highly flexible and can be scaled to any problem size. Figure 7 presents the enhanced approach with flexibility feature to accommodate any problem size. The first rectangle DS represents neoplasm actual dataset with its length and complexity. Last small rectangles show segments containing larger and smaller clusters. Processing elements define neoplasm classification and prediction in terms of parameters established for this approach (cost factors, confusion matrices, training, testing and percentage of malignant and benign identification with cross validation).
Fig. 5: | MSE values for prediction |
Fig. 6: | Active cost of criterion for T and CV for prediction |
Fig. 7: | Flexibility for classification and prediction |
This observation can be made on behalf of:
• | Complexity of problem |
• | Segmentation of larger datasets in smaller segments efficiently |
• | Distribution of variant cluster to segments |
• | Selection of best combination of architecture |
• | Availability of vast choice of sigmoid layers and number of preceptorns |
• | Any assignment between processing elements and clusters |
• | Cross validation with multi view analysis of criterion |
• | Assembly of chunks at time of need (if required according to nature of problem being addressed) |
Neoplasm classification and prediction with the help of dataset segmentation in the form of larger and smaller combinations of clusters were demonstrated with the help of discussions relating to cost factor, confusion matrix percentage, percentage of training data for cross validation and training. The study opted the way to categorize larger sets into segments of equal length with unequal number of cluster of datasets. The different combinations helped better to visualize neoplasm classification and prediction in terms of graphical presentation of cost factors and confusion matrices. This small scale testing and validation can be extended to any number of datasets with diverse number of segments and clusters (scalabilit y factor of this enhanced mechanism). Table 4 demonstrates the cost factor values against variant combinations of clusters for training and testing datasets. It is clear that with the increase of cross validation and training datasets, a good percentage for malignant and benign neoplasm categorization could be expected in terms of cost factor shown graphically. This percentage fell down with lowering the clusters for T and CV. All cases demonstrated feasible values of RMSE, because efficient calculations were performed in the selection of optimal network architecture from a pool of larger and smaller clusters of data. Similar work was carried by Chen et al. (2002) who diagnosed the breast tumor using a combination of neural network approach and wavelet transform. The segmentation algorithm takes advantage from some features such as variance contrast, distribution distortion and auto correlation contrast. The sonograms were analyzed using multilevel preceptors.
Table 4: | Diversity in cases for cost factor against combinations of clusters |
These network approaches were tested and evaluated using 242 cases with k-fold cross validation for evaluation of performance. Also, Meinel (2005) developed computer-aided diagnostic system for breast mri lesion classification This investigation views also agree those of Wang et al. (2010) who described a tumor classification method using probabilistic neural network classifier with neighborhood rough set based gene reduction. It is also important to concentrate the problems addressed in high concentration and small size of samples in datasets. An iterative search model algorithm was used for selection of initial set. Minimum gene subset was found by refinement of this initial set. Ensemble classifier was constructed by using majority voting strategy. The cross validation of results were made on single biomedical experiment.
Table 5 presents different combinations of larger and smaller sized clusters with corresponding axon output values. Cluster 1 with data range (training and testing) 10, 10 is more closely packed. The RMS value starts from 0.9 and fluctuates between 0.9 and 0.2. The major area of fluctuation is between 0.2 and 0.6. It represents an average error value for output axon. Case 2 shows the same phenomenon between some 0.7 and 0.1.
Table 5: | Cluster range with output axon demonstration |
Cases 3, 4 and 5 are almost similar with slight variations. Case 6 is another depiction of case 1 and 2. Generally it was observed that the average number of clusters (combination of both larger and smaller) for training and cross validation provide more optimal solution for both classification and prediction of neoplasm.
Computer aided tools and technologies play an important role in multi dimensional analysis of important biological data. This investigation addressed a challenging problem for classification and prediction of neoplasm in differentials of data segmentation and clustering. The results obtained in this process highlighted this as an enhanced and scalable approach with high level conceptual granularity. Instead of performing a whole analysis, the clinical data was segmented in variant clusters housed in equal sized chunks. Each segment was analyzed in parametric presentation of degree of prediction with appropriate selection of best neural combination concentrating features of cross validation, root mean square errors, active and confusion matrices. The enhanced approach obtained 100% results for benign and 99.5% for malignant with least feasible RMS value 0.017. The proposed enhanced approach was elaborated in detail with textual, experimental and graphical details with an added scalable feature of its applicability for any problem size and complexity.