Research Article
The Evolution of Reusable Programs Using Genetic Algorithm
Department of Computer Science, Information Technology College, University of Bahrain, Bahrain
One of the central challenges of computer science is to get a computer to solve a problem without explicitly programming it. Genetic Programming (GP), (Koza, 1992; Weijie et al., 2011) is a domain-independent problem-solving approach in which computer programs are evolved to solve, or approximately solve, problems. It is an application of Genetic Algorithm (GA) (Mosavi, 2011; Mahi and Izabatene, 2011), which is a search algorithm based on the mechanics of natural selection and natural genetic and based on the Darwinian principle of reproduction and survival of the fittest and analogs of naturally occurring genetic operations such as crossover and mutation. GA was suggested by Holland in the seventies (Holland, 1975). Over the last twenty years, it has been used to solve a wide range of search, optimization and machine learning problems (Goldberg, 1989).
GP is adaptive algorithm, where the structure under adaption is a population of chromosomes represent the candidate programs (solutions).
There are different representation methods used to represent the chromosomes, the most common on the Lisp expressions (Koza, 1992, 1995, 1999) uses LISP expressions to represent the population programs. However, there is another representation method, which is proposed by Paterson and Livesey (1996, 1997), They introduced a new method for program representation. He suggested a different approach using Backus Naur Form (BNF) definition which is a notation for expressing the grammar of a language in the form of production rules. He used a fixed length chromosome which encodes production rules, where the genotype (genetic search space element i.e., chromosome) is distinct from the phenotype (solutions space element i.e., program). This approach is called Genetic Algorithm for Developing Software by Peterson (GADS). The GADS (Al-Bastaki and Awad, 2010) genotype is a list of integers which, when input by a suitable generator, causes that generator to output the program that is the corresponding phenotype. The mapping from genotype to phenotype is called ontogenic mapping.
In this study, we use the representation method of GADS in order to represent the programs. The main objective is to investigate effectiveness of applying different architecture altering operations introduced by Koza (1999) with GADS. Thus, by using the proposed approach, more complex structured programs can be evolved efficiently comparable with traditional GADS and GP.
The symbolic regression problem has been chosen as case study here. GP has been used in wide area of applications (Pugazhenthi and Rajagopalan, 2007), one of them is to solve the symbolic regression problem. Symbolic Regression can be viewed as the process of shaping an equation from a given set of points.
Many efforts have been made to use genetic algorithms to solve symbolic regression problems. One of the problems that plagues most of the efforts is finding a way to efficiently mutate and cross-breed symbolic expressions so that the resulting expressions have a valid mathematical syntax. One approach to this problem is to perform a mutation, check the result and then try a different random mutation until a syntactically valid expression is generated. Obviously, this can be a time consuming process for complex expressions. A second approach is to limit what type of mutations can be performed-for example, only exchanging complete sub-expressions. The problem with this approach is that if limited mutations are used, the evolution process is hindered and it may take a large number of generations to find a solution, or it may be completely unable to find the optimal solution (Gene expression programming, 2008). In this study constrained GADS, which is inspired from the concept of strongly-typed GP (Haynes et al., 1995), with automatically defined functions (ADF) is presented.
GA AND GP
One of the component methodologies of computational intelligence is evolutionary computation. There are number of evolutionary computation techniques, such as GA, Genetic Programming (GP) (Kumarci et al., 2010), Cultured Algorithms and Differential Evolution algorithms. Regardless of the technique used, evolutionary computation applications follow a similar procedure:
• | Initialize the population |
• | Evaluate each individual in the population |
• | Select individuals |
• | Produce a new population by applying a number of operations on selected individuals |
• | loop to step b until some condition is met |
GA is a general, probabilistic and adaptive search algorithm. GA is a stochastic search algorithm that is based on the Darwinian principle of survival of the fittest. It works on population of individuals (chromosomes) that represent the candidate solutions.
Gas have been applied to solve complex problems (Christy and Thambidurai, 2006; Chettih et al., 2008) which has been used in a large number of scientific and engineering problems, such as optimization, automatic programming and machine learning.
GP is an application of GA in which the chromosomes are the candidate programs. It used to solve problems by generating the program that can be used to solve this problem. Thus, with GP the structure under adaption is more complex.
GADS WITH ADF
To work with GP, we have to determine the function library and representation scheme, in addition to the fitness function.
Table 1: | The syntax rules |
The representation method used here is variable length chromosomes, where the genes are integer numbers represent the number of the syntax rules. The syntax rules (BNF) used is presented in Table 1 and the function and terminal sets, F and T are:
Automatically Defined Functions (ADF) has been introduced by Koza (1995), where GP will automatically and dynamically evolve a combined structure containing ADF and a calling program capable of calling the ADF. In this work ADF is a technique used with GADS.
When an ADF is encountered in a genotype, a random number is generated which represents the number of function parameters, then the body of the function is constructed. Therefore, the phenotypes consists of: the root (ADFi, where i is the identification number of the function which is incremented whenever a new ADF is introduced), the list of parameters (the number of these parameters is generated randomly) and function definition.
In this Table 1, n represent an integer number, X is a variable, P is the list of parameters and n Call P represents calling the function ADFn with the parameter list P, in which P should be a list of constants (integers). Also, n of rule (9) represents the number of parameters (number of variables in the function definition).
In order to generate a well formed expression, constrained GADS is used. Thus, The syntax of the programs should be preserved during the initial population generation and by the genetic operation used to modify the population. Therefore, the generation of a gene in the chromosomes is simply based on some constraints (according the syntax rules defined in Table 1, such that: if a1, a2, .,aL is the genotype, the selection of gene aL+1 is not randomly, instead, it is dependent on the gene aL. Therefore, each gene has a number of allowed genes to appear after it.
The process of generating an ADF in a genotype of the initial population can be performed as follows:
• | Generate a random number n which the number of function parameters |
• | Define the body of the function recursively, where the primitives that composed the function is either a function, or an integer number in the range 1....n |
We need to mention here that wherever an ADFi is defined in a chromosome, it is replaced by the function i call P, i.e., rule number 4 and the function definition is stored in a separate array. Furthermore, it is not allowed to include the gene (4) in the chromosome unless an ADF is found in this chromosome.
GENETIC OPERATIONS
In this study, different genetic operations (Wasan, 2008) can be used to modify the populations. The operations used are as follows:
• | The crossover operator: It must be implemented so that two chromosomes (genotypes) that are syntactically correct to produce two offsprings that are also syntactically correct. The following are the steps needed to perform the modified brood crossover operator: |
Step 1: | Select two parents from the population |
Step 2: | Select a gene randomly from the first parent and select a gene randomly from the second parent |
Step 3: | Test that genes for the syntactic constraint |
Step 4: | If it is correct then exchange the genes, otherwise, select another gene from the second parent until the correct gene is found |
Step 5: | Steps 2-5 is repeated NB times to generate 2*NB offspring |
Step 6: | Evaluate each of the children for fitness. Select the best two, they are considered as the children of the parents. The remaining of the offspring are discarded |
• | The mutation operator: It involves the selection of a gene randomly from a genotype and then generate a gene randomly to replace the selected gene. Check the left and right neighbors of that gene, if it satisfies the syntactic rules, then replace it, otherwise, select another gene |
In this method, the genotypes have a variable length. Thus, lengths of genotypes in the population are selected randomly and the max. length must be specified beforehand by the user and depends on the problem
• | The function deletion operator: It used to delete an ADF. A random ADF is selected randomly |
• | Altering the AFD definition be changing the number of parameters: This can be performed as follows: when an ADF is selected for this operation, a random number is generated to be the new p of the ADF. Then, change the body of the ADF by replacing the integer numbers that represent the parameters by new numbers generated randomly |
EXPERIMENTAL WORK
The proposed modified GADS has been implemented to solve the symbolic regression problem using C++ programming language. Each chromosome has been implemented as a structure of the following fields: one-dimensional array of integers (chromosome), ADF definitions (if any), chromosome length and chromosome fitness value. Where the fitness function used is:
(1) |
where, actualO is the actual output of the chromosome x and desiredO is the desired output.
Table 2: | Results of different experiments |
The genetic parameters used are: population size = 100, crossover probability = 1, mutation and other operators are of probability = 0.05.
For example, the expression to be evolved is:
X4+X3+X2+X |
Using the syntax rules of Table 1, after 22 generations the following genotype has been obtained:
17150217020215180202170202 |
The corresponding phenotype is:
(x+(x*x))*((x%x)+(x*x)) |
In another run, after two generations, the following genotype has been obtained:
171502170202191503170202 |
The corresponding phenotype is:
(x+(x*x))*((1) call 1)) |
where, ADF1 (1+(x*x)).
Table 2 presents the results of different experiments listed bellow and these results are the average number of generations needed to find the correct solution (program) which are obtained by executing the proposed method 100 times:
• | Ex1: | Without ADF |
• | Ex2: | With ADF and the genetic operations used are crossover and mutation only |
• | Ex3: | With ADF and the genetic operations used are crossover, mutation and ADF deletion |
• | Ex4: | With ADF and the genetic operations used are crossover, mutation and changing the number of parameters |
It is clear that GADS with ADF gives the best results in term of number of generations and GADS without ADF is inefficient especially for complex problems and expressions.
In this study, GADS has been used in which the structure under adaption is a population of strings, while in GP, the structure is a population of programs (LISP expressions). Thus, GADS uses the GA engine and works on simpler structure. Hence, we expect an improvement in the efficiency in terms of time and storage space, in addition to simplify the implementation of genetic operations such as crossover and mutation.
The main objective of this study is to introduce the concept of reusability and ADF to GADS which can improve the efficiency especially for complex problems. The function reusability has been introduced by using ADF with the call function, in addition to different altering architecture operators. The symbolic regression problem is considered here and we have observed that the number of generations needed to find the correct solution is minimized comparable with GADS without ADF. For future work, other genetic operations and applications will be studied.