INTRODUCTION
Nowadays robots need more sense, decision and technology^{[1]}. Among the points most difficult and required by the current world, is certainly the navigation of the robots in mediums generally not structured^{[2]}.
It is in this way that we have tried the robot learn how to follow. Firstly to make the robot learn how to follow a straight line trajectory aiming to make the follow behaviour perfectly an object which constitutes one of the modules in the navigation of a mobile robot while considering the behaviourbased architecture^{[3]}.
Robot architecture: A rather standard architecture was used^{[4]}, the robot considered is circular having three sensors one in front and one on each side. The angle of sensors orientation chooses are of 45° on both sides of the frontal axis of the robot (Fig. 1).

Fig. 1: 
Structure and position of the sensors 
The robot must move along a straight line trajectory, from a starting point “D” to any objective point “A”. It must thus learn to follow this trajectory. By these three sensors the robot calculates the length “l” compared to an eventual obstacle, its orientation θ and the angle θ’ with respect to the objective (Fig. 2).
Navigation: Our aim is to allow the robot, initially to orient its angle directly towards the objective point. Then it must learn how to move along this trajectory by holding its straight line.
The use of fuzzy logic seems to give good results in this kind of problems
such navigation without an analytical model of the environment. It remains to
notice, as soon as the environment becomes complex, two problems emerge to knowing:
• 
The difficulty of the construction of the rule basis 
• 
Refinement of this rule basis 
In order to remedy to these problems, we proposed a model using fuzzy logic
and the reinforcement learning. The main concepts in the reinforcement learning,
are the agent and environment^{[57]}. The agent has a number of possible
actions, the agent improves some actions in the environment which is modeled
by a set of states. For some states, the agent receives a signal of the environment
called the reward. The task of the reinforcement learning is to find the action
which gives the greatest value of the discounted reward called the Qvalue.
The step passes by two stages:
• 
A phase of Exploration 
• 
The second phase of Exploitation 
The reasoning process: The well known method by the most popular reinforcement
learning is Qlearning where an agent updated successively the quantity Q_{i}(x,a)
which represents the quality of the selected action "a" for the state "x". Within
this framework, we used an alternative of Qlearning associated with fuzzy logic
in order to as well as possible use its properties such the formulation of human
knowledge in the form of fuzzy rules and the use of imprecise and vague data.
Its principle consists in proposing several conclusions for each rule and to
associate each potential solution a quality function^{[8,9]}.
R_{i}: If x_{1} is A_{1}^{ I}
and ……….. and x_{n} is A_{n}^{ I}
Then 

y is u[i,1] with q[i,1]=0
or
y is u[i,2] with q[i,2]=0 …………………………..
or
y is u[i,N] with q[i,N]=0. 
Where (u[i,j])^{N}_{j=1} are potential solutions whose quality
is initialized arbitrarily.
The inferred output is given by the formula:
The quality of this action is:
The approach, is similar in its principle to that previously quoted except
that the set of the suggested actions are not crisp values but fuzzy subsets;
because in practice, we can be in the presence of case where the set of the
actions to be chosen is not given in known actual values but rather in the form
of linguistic terms such to have the choice between turning slightly or midway.
Hence the rules which we will use will take the following form:
R_{i}: If x_{1} is A_{1}^{ I}
and ……….. and x_{n} is A_{n}^{ I} Then 

y is B[i,1] with q[i,1]=0
or
y is B[i,2] with q[i,2]=0… …………………………..
or
y is B[i,N] with q[i,N]=0. 
Where B[i,j] represents the fuzzy subset associated with the rule I and the
conclusion j.
Fuzzification of inputs and outputs: In present case, it was considered that fuzzy linguistic rules with two inputs, Δθ which is the difference between the course of the robot i.e. its orientation and the objective, (Δθ= θθ’), as for the second entry "d", it represents the distance to an obstacle, while the output is the orientation α which the robot must take.
Hence the following fuzzy subsets with their fuzzification was obtained (Fig. 3).

Fig. 3: 
Fuzzification of inputs and outputs 
Construction of the basic rules: The basic rules of simulator are given
by:
R_{1}: If Δθ est N and d is N Then 

α is N with q_{1N
}or α is Z with q_{1Z
}or α is P with q_{1P} 
In total, we have nine rules:
R_{i}: If Δθ is A and d is B Then 

α is N with q_{iN
}or α is Z with q_{iZ
}or α is P with q_{iP} 
with A et B are subsets whose linguistic terms can be N (negative), Z (zero), or P (positive).
Exploration phase: The first step in reinforcement learning is the exploration,
which consists in choosing the best actions progressively. We proposed the following
diagram block (Fig. 4) and we must follow the following algorithm:
1. 
Initiate the different qualities q_{iA} by number
0 
2. 
Repeat for a given number of period "n" 
3. 
Calculation of the degrees of membership of each input to the various
fuzzy subsets:
μ_{Aj}^{ I} (x_{j}) for j=1 to n and I =1 to
N. 
4. 
Calculation of the truth value of each rule, for I=1 to N: 

α_{i} (x) = min_{j}
(μ_{Aj}^{ I} (x_{j})) for j = 1 to n 
5. 
Choose an action by the pseudostochastic method which is summarized by: 
• 
The action with better value of q_{iA} has a probability P of
being selected. 
• 
Otherwise, an action is selected randomly amongst all the other possible
actions in a given state 
6. 
Calculation of the contribution of each chosen rule by the pseudostochastic
method:
μ(α)=min (α_{i} (x), μ_{B}^{ I}(α))

7. 
Aggregation of rules:
μ(α)=max_{ I} (μ_{B}^{ I}(α))

8. 
Defuzzification of the output variable: 


9. 
Calculate the new orientation of the robot: 

θ =θ +α 
10. 
Move the robot and compute the variation 

Δθ =θθ’. 
11. 
Calculate the reinforcement: 


where d_ac is the current distance with respect to the objective following
the displacement of the robot, as for d_an is the old distance compared to the
objective i.e. the state before the displacement of the robot.

Fig. 4: 
Followup of corridor module 
12. 
Update qualities of the rules which contributed to the variation
of the angle α^{[10,11]}: 


β: learning rate , γ: delay factor , α_{i}:
truth value of rule I 
13. 
If "n" is reached or d_ac is small, we stop the learning process 
Exploitation phase: The optimal policy is obtained by choosing the action
which, in each state, maximizes the quality function:
This policy is called "greedy" . However, at the beginning of the learning, the values Q(x,u) are not significant and the greedy policy is not applicable.
For our case, the robot is set to its starting point "A" and for each displacement,
it follows the following steps:
• 
Direct the robot towards its arrival point 
• 
Repeat 
• 
Calculate the distance "d" compared to possible obstacles 
• 
Move by choosing the action of best quality using the fuzzy controller 
• 
Until reaching the goal 
RESULTS AND DISCUSSION
During present work, the following coefficients were choose:
The probability P = 0.9, γ = 0.9 and β = 0.9
Number of passes n = 1000
The basic rules are given by Fig. 5.

Fig. 6: 
Exploration phase(1)/exploitation(2) 
From Fig. 6, it can be noticed that the trajectory (1) followup by the robot at the time of the phase of exploration is different from the trajectory (2) which is carried out at the time of the exploitation phase (after learning).
We also notice according to the above table that only the linguistic rules 7 and 8 which contributed in this example hence the change of the coefficients of qualities of only these two rules.
Another teaching which we can draw from this Table that:
Rule 7: If (θθ’) is N (negative) and d is P Then α is P (positive) which is favored and has the best quality (0.27).
Rule 8: If (θθ’) is Z (zero) and d is P Then α is Z (zero) which is favored and has the best quality (0.63).
These two results are completely logical because if the variation tends towards the negative one, it is necessary to apply a positive action to compensate this deviation and if it is null, it is necessary to maintain the action of term zero i.e. to keep the same angle without deviation. It should be also noted that the iteration count that we needed to find these results is 72.
This study has allowed us to implement a fuzzy approach applied to the reinforcement learning. It gave us good results, in particular in the followup of rectilinear trajectory. However, the subject is not ready to be completed.
Further work is needed and should be retained:
• 
Realization of a robot able to learn how to avoid obstacles 
• 
Realization of a robot able to generate actions in conflict 
• 
Integration of the vague concepts, training by reinforcement learning
on all the levels of the architecture of the robot and addition of some
alternatives 
• 
Realization of a simulator allowing to code and simulate the behavior
of a robot in an environment a priori unknown. 