Abstract: This study presents the novel RBPF for mobile robot SLAM using stereovision to extract landmark information. The particle filter is combined with Gaussian Mixture Unscented Particle Filters (GMUPF) to extending the path posterior by sampling new poses that integrate the current observation that drastically reduces the uncertainty about the robot pose. The landmark position estimation and update is also implemented through GMUPF which a single update step from moving and sensing can be done and the change to the map certainty can be done in constant time. Furthermore, the number of resampling steps is determined adaptively, which seriously reduces the particle depletion problem. The 3D natural point landmarks are structured with matching Scale Invariant Feature Transform (SIFT) feature pairs. The matching for multi-dimension SIFT features is implemented with a KD-tree which introduce the Mahalanobis distance instead of the Euclidean distance for matching features in the time cost of O (log2N). Experiment results on real robot in our indoor environment show the advantages of our methods over previous approaches.
INTRODUCTION
A key prerequisite for a truly autonomous robot is that it can simultaneous localizes itself and accurately map its surroundings. The problem of achieving this is one of the most active areas in mobile robotics research, which is known as Simultaneous Localization and Mapping (SLAM). Most of the present research on SLAM is based on Extended Kalman filter (EKF) (Leonard et al., 2002), whose primary advantage is that it estimates the fully posterior over landmark maps and robot poses in an online fashion. However, the use of the EKF in practice has two well-known drawbacks. The first one is that linearization can produce highly unstable filters if the assumptions of local linearity is violated. The second one is that the derivation of the Jacobian matrices is nontrivial in most applications and often leads to significant implementation difficulties.
Recently, particle filter have been at the core of solutions to higher dimensional robot problems such as SLAM, which, when phrased as a state estimation problem. RBPF (Murphy and Russell, 2001) is proposed as an effective way of representing alternative hypotheses on robot paths and associated maps. More recently, RBPF is used widely to building map (Stachniss et al., 2005). Daily and Parnichkun (2005) describe the application of Fast SLAM using a trinocular stereo camera. Se et al. (2002) demonstrate the use of SIFT point features as landmarks for the SLAM problem using a trinocular Stereo vision. Gil et al. (2006) describes an approach that builds three dimensional maps using visual landmarks extracted from stereo images of an unmodified environment and solves the SLAM problem.
In this study, we present an investigation into the use of stereo vision (Elinas et al., 2006) for SLAM in indoor environment with 3D feature landmarks, which are structured from the SIFT feature matching pairs. These 2D SIFT features are used to structure 3D landmarks because they are invariant to image scale, rotation and translation as well as partially invariant to illumination changes and affine or 3D projection and their description is implemented with multi-dimensional vector (Lowe, 2004). This combination can result in many highly distinctive landmarks from environment, which simplifies the data association problem to only distinguishing unique landmarks. We present a fast and efficient algorithm which introduce the Mahalanobis distance instead of the Euclidean distance for matching features in a KD-Tree in the time cost of O(log2N). Following, our approach applies RBPF to estimate a posterior of the path of the robot, where each particle has associated with an entire map, in which each landmark is estimated and updated by a novel algorithm to recursively update the posterior density of SLAM called the Gaussian Mixture Unscented Particle Filters (GMUPF) (Wendel et al., 2005). The filter has equal or batter estimation performance when compared to standard particle filters and the Unscented Particle Filter (UPF) (Julier et al., 2000), at a largely reduced computational cost. The GMUPF combines an Importance Sampling (IS) based measurement update step with an Unscented Kalman Filter (UKF) based Gaussian sum filter for time update and proposal density generation.
NOVEL RBPF FOR SLAM
Consider the case of a mobile robot moving through an unknown environment consisted
of a set of landmarks. The landmark n is denoted θn. The robot
moves according to a known probabilistic motion model
The SLAM problem is to recover the posterior distribution p(st, θ1,...,θM, zt, ut, nt), where, M is the number of landmarks observed so far and the notation st denotes sl..., st (and similarly for other variables). Murphy and Russel (2001) provide an implementation of RBPF for SLAM:
(1) |
This can be done efficiently, since the factorization decouples the SLAM problem into a path estimation problem and individual conditional landmark location problems and the p(θn,st,zt,nt) can be computed analytically once st and zt are known and the amount of computation needed for each incremental update stays constant, regardless of the path length. Each map is constructed given zt and the trajectory st represented by the corresponding particle. Each particle is of the form , Where (i) indicates the index of the particle; st,(i) is its path estimate, μ(i)m,t and '(i)m,t are the mean and variance of the Gaussian representing the m-th landmark location.
In our methods, the i-th new pose st(i) is drawn from
the posterior
(2) |
Where:
G | = | The number of mixing components |
α(g) | = | The mixing weights |
is a normal distribution with mean vector and positive define covariance matrix pt(g),(i).
GMUPF Algorithm
By using the UT transformation (Miaohai et al., 2006), we follow GMUPF algorithm to extend the path st,(i) by sampling the new poses st(i) from the posterior .
Calculate the Sigma Points
(3) |
Using Motion Model to Predict
At time t-1, assume the posterior state density is approximated by the G-component
GMM:
(4) |
and the process and observation noise densities are approximated by the following I and J component GMMs, respectively
(5) |
(6) |
Where:
Vt-1 | = | The process noise |
Ot-1 | = | The observation or measurement noise |
For j = 1, ,J, set . For j = 1, ,J, set and for g = 1, ,G, set .
For
Generate the Proposal Distribution
The predictive state density is now approximated:
(7) |
The proposal distribution is approximated by the GMM:
(8) |
Measurement Update
Draw N samples
(9) |
Normalize the weights:
Use a weighted EM (WEM) algorithm to fit a G-component GMM to the set of weighted
particles
(10) |
Inference
The conditional mean state estimate and the corresponding error covariance
(11) |
IMPLEMENTATION DETAILS OF USING STEREO VISION
SIFT Feature Extraction Based Stereo Vision
Let ILt and IRt denote the right
and left gray scale images captured with a stereo camera at a time t. we compute
image points of interest from both images by selecting maximal points in the
scale space pyramid of a Difference of Gaussians. We extract natural landmarks
that correspond to points in the 3-dimensional space. Each landmark is a vector
l = {PG, CG, α.,s,f}, such that PG = {XG,YG,ZG}
is a 3-dimensional position vector in the maps global coordinate frame,
CG is the 3x3 covariance matrix for PG and α,s,f
describe an invariant feature based on the Scale Invariant Feature Transform
(SIFT). Parameter α is the orientation of the feature, s is its scale and
f is the 128-dimensional key vector which represents the histogram of local
edge orientations. Each point is accompanied by its SIFT descriptor and then
matched across images. The matching procedure is constrained by the well-known
epipolar geometry of the stereo rig and the Euclidean distance of their SIFT
keys. In addition, a comparison between SIFT descriptors is used to avoid false
correspondence. As a result, at a time t we obtain a set of B observations
denoted:
(12) |
Where, is the image coordinates of the point and .
KD-Tree Based Feature Matching
Here we describes KD-tree algorithm for determining the matched SIFT feature
pairs of left and right images captured along the robots path by a stereo
vision system. We use each associated SIFT descriptor to improve the data associated
process. Each SIFT descriptor is a 128-long vector computed from the image gradient
at a local neighbourhood of the interest point. In the approaches (Miaohai et
al., 2006), data association is based on the squared Euclidean distance
between descriptors. However, when the same point is viewed from significantly
different viewpoints, the difference in the descriptor is remarkable. In the
presence of similar looking landmarks, this fact causes a significant number
of false correspondences. In order to reduce the false correspondence problem,
we propose a different method to present the distance function. We consider
different views of the same visual landmark as different patterns belonging
to Ci. Whenever a landmark is found, it is tracked along p frames
and its descriptors d1,d2,...,dp are stored.
Then, for each landmark Ci, we compute a mean value
(13) |
Where, pi is a diagonal covariance matrix associated with the class
Ci with mean value
(14) |
Where, λ is constant and 0 < λ < 1, if this inequality is satisfied, then the matching is successful and simultaneously eliminates the false matching. Figure 1 shows an example of SIFT feature matching for a pair image from the labor with stereo vision and we obtain 76 matched pairs which the matching accurate rate is higher than 90%.
Fig. 1: | The SIFT feature matches based on KD-tree and the matching pairs are represented by red rectangle. The mismatching pairs are represented by blue +. The matched key-points are 76. The corrected are 68 |
Observation Model Based Stereo Vision
In the stereovision case, we perform a linear search of the keys in the
left image for the best match to each key in the right, subject to epipolar
constraints and determine its 3D position and covariance according to the well-known
stereo equations:
(15) |
Where, f is the focal length of the camera, B is the based-line of the stereo vision, d is the disparity between SIFT keys in the left and right images and [u,v] is the pixel position of the key in the right camera.
An observation is defined as a set of k correspondences between landmarks in
the map and the current view. Let Ft = {f1,...,fk}
be the k SIFT feature key-points observed at time t, in which there are n key-points
matching with the 3D landmarks in the map database:
Where, zlt represented the observation Fl = {fl,...,fn} and ztv represents the observation Fv = {fn+1,...,fn+m}, represents the likelihood of the observation zlt given the matching relation nlt and represents the likelihood of the observation zvt given the matching relation nvt, these two likelihood can be calculated separately as follows:
Where, represents the likelihood of the observation being fj when robot at pose st(1) observing the landmark Lfj, represents the likelihood of the observation being fj when robot at pose st(i) observing the landmark Vfj. A pose of the camera, st, defines a transformation [R,T]st from the camera to the global coordinate frame. Let the 3D coordinates of the landmark Lfj be (X(j), Y(j), Z(j)), using the Mahalanobis distance metric, we can obtain as follows:
(16) |
Where:
J | = | The Jacobian matrix of the observation equation |
Gfj | = | The covariance of Lfj |
The maximum observation innovation T1 is constant (in our case, 3.0), which is selected so as to prevent outlier observations from significantly affecting the observation likelihood.
While the feature Vfj has no 3D spatial information, is only calculated according to epipolar constrain:
Where, Ifj is the image coordinate of the feature Vfj, Hfj is the epipolar line on the image plane corresponding to Vfj at time t and Hj is the epipolar line on the image plane corresponding to the feature fj at time t-1, dist(•) is the function of the distance between point and line.
After calculating the observation model , which can be used to evaluate the i-th particle weight wt(i) and wt(i) is taken as follows:
(17) |
Motion Model
An essential component to the implementation of RBPF is the specification
of the robots motion model. Let It and Tt-1 represent
the pairs of stereo images taken with the robots camera at two consecutive
intervals with the robot moving between the two. For each pair of images we
detect points of interest, compute SIFT descriptors for them and perform stereo
matching, resulting in 2 sets of landmarks Lt-1 and Lt.
We compute the camera motion using the Levenberg-Marquardt (LM) algorithm minimizing
the re-projection error of the 3D coordinates of the landmarks.
Let
such as to minimize the vector of error measurement ε for a known stereo vision calibration matrix K, ε is defined as:
(18) |
Given an initial estimate for the parameters, we use LM algorithm to solve for χ that minimizes ε in three steps:
• | According to the vector of residuals ε(χ), compute its matrix of derivatives J with respect to the components of χ. |
(19) |
• | Compute the update rule: |
(20) |
• | Set |
Where, the value of λ controls the distance traveled along the gradient direction and controls the convergence of the solution by switching between pure gradient descent and Newtons method.
RESULTS AND DISCUSSION
The experiments are performed on a Pioneer 3-DX Mobile Robot incorporating an 800 MHz intel Pentium processor as shown in Fig. 2A. Motor control is performed on the on-board computer, while a 2.4GHz PC connected to the robot by a wireless link provides the main processing power for stereo vision processing and the SLAM software. A Point Grey Bumblebee Stereo-vision mounted at the front of the robot is used for detecting the landmarks. The test environment is a robot laboratory with limited space in Fig. 2B.
One of the most difficult issues in a SLAM implementation is the data association problem, where the hypothesis of assigning an observation to an existing map location is tested. In our implementation the KD-tree Based SIFT and Mahalanobis distance solves this problem. For illustrating the advantages of our methods over previous approaches, we implement SLAM with our novel RBPF and previous method. This experiment is described as follows.
Firstly, the robot is set at the distance of 4 m from the lab door and the robot orientation is parallel with the door, at the same time, there are some obstacles in the front of the robot. While the robot is moving ahead, the stereo pair image frames are captured and processed, building the map of the indoor environment.
In Fig. 3, the experiment results. In the map, S represents the start point of robot path and E represents the end point of the robot path, the red points represent the robot path particle, the 2D view of 3D landmarks in the map is represented with blue points. As shown in Fig. 3A, if we increase the number of particles, the performance of conventional RBPF will be improved largely, however, the storage requirement and calculation burden is severely aggravated, owning to each particle associated with a view of the map. Figure 3B shows the built map with the novel RBPF, which adopts 50 particles and 100 particles. The performance of the novel RBPF changes a little with increasing the number of particle, so we can build precise map only with few particles.
Another experiment was carried out in our single lab room, where the compact map is built with our novel RBPF method and 186 pair stereo image frames of size 320x240 are captured.
Fig. 2: | (A) Pioneer mobile robot and (B) experimental environment |
Fig. 3: | Experiment results of map building based (A) conventional RBPF: (i) 100 particles, (ii) 500 particles; (B) novel RBPF: (iii) 50 particles, (iv) 100 particles |
Fig. 4: | Part of an image from the stereo camera (A) and the corresponding disparity image (B) |
Fig. 5: | The Comparison of the three paths, red line present the measured path of robot, green line present the odometers path, the blue show the modifying path by Novel RBPF |
The real environment of our room is shown in Fig. 4A, this is a typical image of our test environment where the pathways are very narrow with little texture throughout the cameras field of view. Fig. 4B is the disparity image generated by the stereo correspondence algorithm with lighter shades indicating objects closer to camera.
In order to measure the accuracy of our novel RBPF, we have measured ground truth pose data for different locations. The starting location is node A and the last location is node I, the others are labeled nodes B, C, D, E, F, G, H. We have labeled these locations on the learned map shown in the Fig. 5. The comparison of the three paths that comprise odometer path (green dashed line) and modified path by the novel RBPF (blue solid line) and the real measures path (red solid line) showed in the Fig. 5. From the Fig. 5, we can see that the odometer data is less deviation from the real measure data in the short term, however, with the robot moving, the deviation become larger. But for the modified data by novel RBPF, the deviation changes a little with the robot moving.
CONCLUSION
This article described a novel algorithm for SLAM problem using stereo vision. Like many previously published SLAM algorithms, our method calculates posterior probability distributions over 3D SIFT featured maps and robot locations. It does so recursively based on a key property of the SLAM problem: The conditional independence of feature estimates given the robot path. This conditional independence gives rise to a factored representation of the posterior using a combination of particle filters for estimating the robot path and the GMUPF for estimating the map. Furthermore, the number of resampling steps is determined adaptively, which seriously reduces the particle depletion problem. In order to reduce the false correspondence in the data association the Mahalanobis distance is introduced in the KD-tree based feature matching instead of the Euclidean distance. We have shown that we can construct accurate metric maps of 3D point landmarks from dense correlation-based stereo without loop closure. Experiment results on real robot in our indoor environment show that the advantages of our methods over previous approaches.