INTRODUCTION
Overcurrent relays have been used as an applicable tool for distribution
and transmission line protection and, also for many years, their coordination
problem has been studied by the researchers and engineers of power system.
The objective function of the problem, which is the total operating time
of the relays, is minimized to have a high speed protective system. In
addition, some coordination constraints must be satisfied so that the
protection system is kept reliable and selective. In the existing literature,
several methods have been utilized to solve this problem. Since years,
mathematical formulationbased classical optimization techniques such
as linear programming (Urdaneta et al., 1996, 2000), nonlinear
programming (Sadeh, 2005; Keil and Iäger, 2008) and mixedinteger
linear programming (Zeineldin et al., 2004) have been applied.
The nonlinear methods try to find the Time Multiplier Setting (TMS) and
Current Setting (I_{set}) simultaneously. However, these approaches
work on iterative search, which does not have enough ability to achieve
the global optimal setting of the relays. To reduce the complexity of
nonlinear problem, at first, the problem is transformed to the linear
programming by determining the I_{set} of relays using some heuristic
techniques. Then, the TMS of relays are calculated by solving the obtained
linear optimization problem. However, in order to solve the nonlinear
problem, it would be beneficial to use the stochastic optimization approaches
such as genetic and evolutionary algorithms (Razavi et al., 2008;
So and Li, 2000), particle swarm optimization (Mansour et al.,
2007; Zeineldin et al., 2006) which can overcome the drawbacks
of traditional methods.
After restructuring the vertically electricity industry to competitive
power market, the relay coordination has become more critical and important
problem. In this new environment, establishing the competitive energy
market and providing open access to transmission network, the efficient
operation of transmission network encountered with some issues and challenges.
This leads to increase the complexity and sensitivity of overcurrent relay
coordination problem. Due to considerable change of power system configuration
in the power market, the obtained optimal setting of overcurrent relays
in a configuration may not be suitable for the new configurations. In
order to have a selective, reliable and high speed protection system for
different situations of power system, it is necessary to use an intelligent
approach with more ability than the mentioned classical and stochastic
optimization methods.
In this study, the overcurrent relay coordination problem is modeled
as Qlearning based cooperative multiagent system. Each relay tries to
find the desirable setting in each situation of power system based on
QLearning (QL) algorithm which is kind of modelfree Reinforcement Learning
(RL). In fact, each relay learns from the past experiences which obtained
by taking actions in the repeated games to minimize its operating time
for all possible situations, while, it does not have any knowledge about
actions of other relays and the occurrence of new power system configurations.
As a result, the overcurrent relays can act as an autonomous adaptive
intelligent agent in the MultiAgent System (MAS). Moreover, by combining
the cooperation concept in the MAS with the QL algorithm, the intelligent
adaptive agents attempt cooperating each other to minimize the overall
operating time whereas they satisfy their coordination constraints. Considering
the ability of RL methods to solve the problem with finite discrete variables,
the structural constraints of overcurrent relays are respected easily.
REINFORCEMENT LEARNING
The reinforcement learning problem is the learning problem of agent interacts
with its environment so as to achieve its goal. In fact, RL is learning
policy which means what to do or, in other words, how to map each situation
to action to maximize the received long run reward. The trial and error
and delayed reward are the most important characteristics of RL. In order
to find the best policy, at first, the agent must obtain new experiences
based on trial and error. In each state of environment, it evaluates how
good the chosen action is considering the immediate reward and value of
new state. The value of a state is the long term reward expected to be
acquired over the future starting from that state (Sutton and Barto, 1998).
Assume that an agent interacts with its environment at a sequence of
discrete time steps, t = 0, 1, 2, .., as shown in Fig.
1. Also, assume that S = {s_{1}, s_{2},. . . , s_{n}}
is the finite set of possible states of the environment and A = {a_{1},
a_{2},..., a_{m}} is the finite set of admissible actions,
which the agent can take. At each time step t, the agent senses the current
state st = s ∈ S of its environment and accordingly selects an action
a_{t} = a ∈ A. As a result of its action, the state of environment
changes to the new state s_{t+1} = s` ∈ S, with a transition
probability Pss` (a) and the agent receives an immediate reward r_{t+1}.
However, the reinforcement learning is established on an important assumption
that the interaction of agent and dynamic environment satisfies the Markov
property and the reinforcement learning task is a Markov Decision Process
(MDP). In this condition, the value of a state s under policy π,
denoted by V^{π}(s), as given in Eq. 1,
is the expected return when starting in state s and afterward following
policy π.

Fig. 1: 
The agentenvironment interaction in RL 
where, γ is discounted factor and r is the reward function. Considering
the Bellman equation in dynamic programming, the above equation can be
written as:
In which π(s,a) is the probability of taking action a in state s
and R^{a}_{ss`} is the expected value of reward r_{t+1}.
Now, the optimal value and the best action in state s are calculated by
taking the maximum value of Eq. 2 over action space A(s)
and building Bellman optimality equation as follows:
Here, it is explained that how the RL algorithm is used to find the desirable
setting of relays in power system protection. In order to reach this objective
and to solve relay coordination problem, we apply the cooperation concept
in RLbased MAS.
RLBASED COOPERATIVE MAS APPLIED TO RELAY COORDINATION
The relay coordination problem is modeled as a cooperative MAS in which
each agent intelligently finds out the desirable setting using RL algorithm.
As a result, the overcurrent relay coordination changes to the RL problem
which can be solved using the Dynamic Programming (DP) or TemporalDifference
(TD) methods.
Modeling state of environment and agent`s action: Each relay
in the coordination problem is an intelligent agent which interacts with
the power system as dynamic environment. The power system configuration
is the state of environment. The environment can experience some states
due to outage or entrance of generators and transmissions and also load
variation. In this condition, it is necessary to adopt the relays setting
to the different situations and update the policy of coordination for
each one.

Fig. 2: 
The definition of action space 
However, as described earlier, the RL is a capable method to
improve the selected strategy in the previous state of dynamic environment
to the new one by evaluating the new state. Therefore, using the RL method,
an adaptive coordination of relays is possible which increases the reliability
and security of power system protection.
In the RLbased MAS, each agent selects its setting i.e., TMS and I_{set}
as action among action space. Considering relay structure, the range of
TMS is limited between TMS_{min} and TMS_{max}. Also,
the I^{s}_{set,min} and I^{s}_{set,max}
of relays for each state s can be obtained considering the power flow
and short circuit calculations. Then, the two dimensional action space
is built for each relay using its discrete parameters as shown in Fig.
2. It is obvious that I^{s}_{set,min} and I^{s}_{set,max}
of each relay can be different from other relays.
Defining agent`s punishment function: Here, the cooperation concept
of agents is used to solve the overcurrent relay coordination problem.
Generally, based on the literature of overcurrent relay coordination,
the setting of relays is adjusted to minimize their total operating time
(Eq. 4), such that the physical constraints in Eq.
5, the assumed operation curve of relay and the coordination constraints
of back upprimary pairs in Eq. 6 are satisfied. In order
to apply the RLbased cooperative MAS in coordination problem, the reward
function is replaced by the punishment function and the max operator in
Eq. 3 replaced by min operator. The mathematical formulation
of relays coordination problem is presented as follows:
where, Nr is the number of relays,
is the maximum current flows through the back up overcurrent relay i when
the fault is happened close to and in front of primary relay j and
is the corresponding current through the primary relay. Thus, in order
to minimize the total operating time of all relays using the MAS, the
intelligent agents must give support and cooperate each other. Nevertheless,
the MAS can be designed such that the agents compete or cooperate with
each other. This strongly depends on how the punishment function of agents
is defined.
To reach this goal, two additive components are assigned as the punishment
function of each intelligent agent. First part is the total operating
time of all relays the same as Eq. 4 and the second part
is taken into account to respect the coordination constraint in Eq.
6. To keep the coordination of back upprimary pairs, a penalty function
is defined as a second part of punishment function. Now, we call its primary
and back up relays as primary and back up agents. Suppose the sets of
P^{i} and B^{i} consist of all primary and back up agents
of the intelligent agent ith, respectively. The penalty function of ith
intelligent agent is sum of following twoargument functions given in
Eq. 7 and 8. Each twoargument function
is driven from the coordination constraint of the agent ith with one of
the primary or back up agents.
In the Eq. 7 j belongs to P^{i} and in Eq.
8 j belongs to B^{i}. Using these two equations, the penalty
function of ith intelligent agent as a second term of its punishment function
is written as:
Now, the punishment function of ith intelligent agent is constructed
as follows:
As a result, each intelligent agent takes an action in each state of
power system to minimize the cumulative punishment in long run. Considering
the defined punishment function, each agent not only try to minimize its
operating time, however, it selects an action so that minimizes the total
operating time when satisfies its coordination constraints. Whereas, it
does not have any knowledge about other intelligent agents` actions, the
operating time of other agents is added to its operating time to guide
the agents to be cooperative. Consequently, instead of solving the relays
coordination problem using one agent with 2*Nr dimensional action space,
using the RLbased MAS, this studied problem is modeled as a distributed
system. Using the proposed model, the dimensional of action space is reduced
and, therefore, the search process to find the optimal action can be performed
efficiently. Though, the sub optimality issue arises in the MAS which
is unavoidable.
To find the optimal setting of each agent in state s, the optimal value
function must be computed. If the model of environment, means P^{a}_{ss`}
and R^{a}_{ss`}, is known, the DP can be a suitable technique
to evaluate the value function. The P^{a}_{ss`} can be
calculated using the historical data, but the second component i.e., R^{a}_{ss`}
is not absolutely known for all agent. Because the punishment function
of each agent not only depends on its action, but is affected by other
agents` actions which are not known from the viewpoint of the agent. In
this condition the QL algorithm (TD(0)) which is kind of reinforcement
learning can be useful to solve the mentioned problem. The QL algorithm
does not need any model of environment and develop the optimal action
by estimating the value function.
OVERCURRENT RELAY COORDINATION BASED ON QL ALGORITHM
To find the optimal solution of Markov Decision Problems without a prior
knowledge of the environment, the Watkins`s QL algorithm, which is a kind
of reinforcement learning algorithm (Sutton and Barto, 1998), can be used.
The QL algorithm is a very effective modelfree algorithm that is insensitive
to exploration for learning from delayed reinforcement (Kaelbling et
al., 1996). Thus, it may be a suitable method to model the overcurrent
relay coordination problem as a QLbased cooperative MAS. Each agent is
able to select the best action by cooperating with other agents when it
does not have any knowledge about the model of power system environment
and other agents` actions.
QL algorithm: In the QL algorithm, the value function for each
admissible (s, a) is defined as a Qvalue. Considering the QL algorithm,
each agent attempts to discover the optimal policy π*(s) ∈ A
to maximize the Qvalue of each state over the long run or maximize the
total reward that it receives in repeated games, using the Bellman optimality
Eq. 11:
Without learning a model, however, the QL algorithm is able to determine
the optimal policy for every admissible (s, a) by online estimating its
Qvalue using the zeroorder temporal difference TD(0) method. To start
the learning process, the Qvalue lookup table is initialized randomly
or using the agent`s knowledge. In each iteration of game, after doing
action a_{t}, the only available information for the QL algorithm
is s_{t}, a_{t}, s_{t+1 }and r_{t+1}.
The updating rule for the stateaction pair (s_{t}, a_{t})
is given by:
where, α is the agent`s learning rate. α can be interpreted
as how much the estimated Qvalues are updated by new data. To apply the
QL algorithm in coordination problem, the reward function must be substituted
by the punishment function. Also, in Eq. 14, the min
operator is used instead of max operator.
QLbased relay coordination: Here, it is presented that how the
QL based cooperative MAS is used to solve the relay coordination problem.
Interactions of all agents with environment in the MAS are shown in Fig.
3.
The same as selfplay problem, in MAS, all agents after sensing the state
of environment and evaluating their situations take the desirable action
simultaneously.

Fig. 3: 
The schematic diagram of cooperative MAS 

Fig. 4: 
The flowchart of QLbased overcurrent relay coordination 
To find the desirable overcurrent relay coordination, each agent participates
in a repeated game in which the behaviors of all agents are oriented based
on QL algorithm. In Fig. 4, it is illustrated that how
an intelligent agent interacts with the power system environment to adjust
its setting.
According to this flowchart, each agent` learning process to determine
its setting is done as follows:
• 
State identification: The state of environment
for the current step is the power system configuration in the pervious
step. It is obtained using the Monte Carlo simulation which is run
based on the calculated occurrence probability of all scenarios of
power system configuration. 
• 
Action selection: After obtaining the current state, the
agent uses its Qvalue lookup table which saves the Qvalues for each
stateaction pair. The action selection through the QL algorithm is
done by choosing the action with minimum Qvalue in the current state.
To trade off between exploitation and exploration, the agent can utilize
from the εgreedy strategy. It means that the agent selects the
action that has the minimum Qvalue with high probability (1ε)
and an arbitrary action from all admissible actions with small probability
ε, independent of the Qvalues. 
• 
Qvalue updating: At the end of each step, the agent is notified
of the new configuration (s_{t+1}) to take the next decision
(a_{t+1}). Also, the power flow and short circuit calculations
are performed in the current configuration (s_{t}). Then,
regarding the setting of agents (a_{t}), the punishment function
is calculated considering the mentioned cooperative philosophy using
the Eq. 710 and then updates its
Qvalue according to Eq. 13 and 14
by using the min operator instead of max operator. 
RESULTS AND DISCUSSION
Here, the QLbased cooperative MAS is applied to solve the coordination
problem of overcurrent relays.

Fig. 5: 
The studied distribution system in scenario 1 

Fig. 6: 
The studied distribution system in scenario 2 
In order to study the performance of the
proposed method, a typical distribution system, consists of three overcurrent
relays, is considered (Fig. 5). The generator and the
parallel transformers are the Thevenin equivalent calculated from bus
A. Considering the change of power system topology, the Thevenin equivalent
can vary. As a result, it is obvious that the Short Circuit Capacity (SCC)
can take different values. It leads to change of the short circuit current
in the distribution system and, consequently, the reset of relays setting
is needed. For example, in this simulation, two scenarios are possible
for Thevenin equivalent which is shown in Fig. 5 and 6.
The operation characteristic of the overcurrent relays are assumed to
be as Eq. 15.
where, t is the operation time of the relay and I is the current through
the relay. Also, I_{set} and TMS can take the discrete value between
0.25 and 2.5 by stepping 0.125 and between 0.1 and 1 by stepping 0.025,
respectively. Thus, the action space of each intelligent agent as an overcurrent
relay consists of 19x37 settings. The required fault currents of coordination
problem are presented in the Fig. 5 and 6
for both network topologies. The CTI is assumed to be 0.3 sec. Also, in
each scenario, there are two primaryback up pairs i.e., (B, A) and (C,
B).
It should be noted that the I_{set} of each relay must be greater
than the load current and be smaller than the value of minimum current
flows through it when fault occurs at far bus of its primary relays. As
a result, some actions in the 19x37 actions may be inadmissible which
are neglected.
Table 1: 
Settings of overcurrent relays using the proposed method 

Table 2: 
Settings of overcurrent relays using nonlinear programming 

The admissible actions, which satisfy the mentioned constraint,
are identified. Therefore, the intelligent agent is only able to select
the admissible actions. Now, the QLbased cooperative MAS is implemented
based on the proposed flowchart of coordination problem in Fig.
4. The obtained setting of overcurrent relays are shown in Table
1.
The total operation time of relays according to the obtained settings in scenarios
1 and 2 are 1.7609 and 1.6122 sec, respectively.
It should be noted that in the studied system with two possible scenarios,
the occurrence of state s_{t+1} is independent of the state s_{t}
and a_{t}. Thus, the value of parameter γ is chosen zero
without loss of optimality. Moreover, in this situation, the punishment
function in state s_{t} only depends on a_{t}.
Consequently, the probability of occurrence of two scenarios can not affect
the setting of relays.
At the end of this section, the results obtained using the proposed method
are compared with those of a classical optimization approach, namely Sequential
Quadratic Programming (SQP) method. Optimization toolbox of MATALB^{®}
is used to solve the nonlinear programming. The optimal values of time
multiplier settings and the current settings of the relays are shown Table
2. The current settings of the relays in both methods are completely
the same. But the time multiplier settings of the relays obtained using
nonlinear programming are lower than those of obtained using the proposed
method. However, it is very important to note that in the proposed method,
the change of topology is taken into account adaptively and the settings
are modified after changing the system topology automatically.
CONCLUSION
In this research, the reinforcement learningbased cooperative multiagent
system is constructed and applied in solving the overcurrent relay coordination
problem for first time in the existing literature. As a result, combining
the reinforcement learning method and cooperation concept in the multiagent
system, each overcurrent relay as an intelligent agent adjusts its setting
such that it improves the overall operating time when respects the coordination
constraints. Considering the ability of reinforcement learning to select
the best strategy according to the state of the environment, the designed
tool can adopt the relay coordination to the new power system configuration.
Consequently, we could achieve an adaptive coordination which improves
and reinforces the security and reliability of power system protection.
Also, due to increasing uncertainty and considerable change of power system
configuration in the competitive electricity market, the importance of
using such method to provide a flexible operation of power system should
be considered more than before.