Many researches being conducted on the species of Drosophilidae mostly involve
species identification and distribution. Field sampling for inventory of flies
results in a collection of thousands of specimens. Identifying these samples
one at a time using the traditional method, the dichotomous key, can become
time-consuming. For this reason, a much more efficient method of identification
There are many studies that have used computers to automate species identification.
These systems that have been developed are based on the concept of the dichotomous
key (Abdulrahaman et al., 2010; Kirchoff
et al., 2008; Rocker et al., 2007;
Smith-Akin et al., 2006; Norton,
2005). The results, however, indicate that automating the dichotomous key
may have certain upper limits.
With the dichotomous key, the process of identification must be sequentially
followed (Rocker et al., 2007). If the user
is unsure of the pair of characters shown, he is forced to choose either one
of them. Choosing the wrong option can result to an erroneous species-specific
path. It is also pointed out that automation of the dichotomous key works well
for small groups of taxa but is unlikely to be very effective with larger groups
of taxa (Kirchoff et al., 2008). Existing implementations
of automating the dichotomous key do not have a way of quantifying correctness.
In this study, a different technique is presented and used for automation. The
System of Identification for Drosophilidae (SIrD) is a tool for the ease of
identification of the Drosophilidae species, using binary codes instead of the
dichotomous key to represent the presence or absence of character states. Encoding
character states in binary format, in effect, produces a linear binary string.
This binary linearization allows SIrD to calculate, in a straightforward manner,
the degree of similarity between the sample specimen and the species in the
database and then express it in percentage.
MATERIALS AND METHODS
Several other benefits can be derived from the use of such principle. Binary
linearization allows character states to become independent of each other when
establishing presence or absence, speeding up the identification of specimens.
It also allows greater tolerance to identification errors. If the user made
a mistake in establishing the presence or absence of a character, or makes only
a partial input, SIrD simply takes this into account and calculates the degree
of similarity accordingly. It displays all possible matches in descending order
of degree of similarity, thus avoiding the all-or-nothing approach of the dichotomous
key. Furthermore, the linear approach of identification makes the character
ID easily extensible, allowing new characters to be added at the end of the
binary string, especially when new discoveries are found.
Another important aspect of SIrD is that it is created using open-source web
development software. The computational algorithms are programmed in PHP, the
species data are stored and retrieved using MySQL and the overall user interface
is handled by Drupal. These make it possible for SIrD to be compatible with
many browsers for the World Wide Web. A prototype has been deployed on the Internet
and can be found at http://drosophila.site88.net.
Binary coding and the degree of similarity: The binary code involves
only two values, true and false (or yes and no) represented in numeric form
respectively. The binary code 1
is assigned to a character state if it is present in the sample being analyzed
and binary code 0
if such character state is absent. These binary codes, when lined up, form the
character ID that uniquely describes each Drosophilidae species. An identification
process based on these binary codes is formulated to determine a possible match
between the sample being analyzed and the species in the database.
A crucial factor in establishing a match between two objects is to quantify
how similar they are. The technique used in this study is derived from the mathematical
model published in 1901 by Jaccard (1901), a professor
of botany and plant physiology at the ETH Zurich University. The Jaccard index
of similarity, also known as the Jaccard similarity coefficient (Coefficient
de communauté), is a statistic used for comparing the similarity and
diversity of sample sets:
The degree of similarity, as defined in this study, is the proportion of the
actual number of similarities to the possible number of similarities, expressed
in percentage, as shown in the mathematical model below:
With binary coding, the existence of a particular character can simply be established
by assigning it a 1 (present) or a 0 (absent). Determining
similarity, then, becomes straight forward, as shown in the algorithm below:
||Determine the ordinal position of the present (code 1)
characters in the binary code of each object
||Count the number of ordinal positions for which present (code 1)
characters of each object match. This is the actual number of similarities
||Count the number of present (code 1) characters in the binary
code of each object
||Determine which is the larger of the two. This becomes the possible number
of similarities (denominator)
||Divide the actual number by the possible number to get the degree of similarity
It is important to keep in mind that, in dealing with proportions, care must
be taken in choosing the denominator. Although the resulting quotient may not
be mathematically erroneous, it will certainly not be logical (i.e., degree
of similarity exceeding 100%).
Binary code score card: A score card (Appendix Table 1)
has been established to facilitate the formulation of binary codes necessary
to uniquely describe a species. The character states used in the score card
are based on the significant identifying characters of the species (Markow
and OGrady, 2006; Wheeler, 1952; Harrison,
1950). A thorough examination of the significant identifying characters
is done to ensure that no repetition of character states exists, avoiding that
particular pitfall of the dichotomous key. The score card (in printable and
spreadsheet formats) and a diagram pointing the locations of the characters
are made available for download and reference with the software (http://drosophila.site88.net/?q=downloads).
The primary purpose of the score card is to provide users a guide on the character
states needed for the identification of specimens through SIrD. For convenience,
the taxonomic characters are classified into the several bodily categories,
where each character state is listed accordingly. These are the general body
characteristics (13 states), head (59 states), thorax (63 states), legs (74
states), abdomen (66 states), genitalia and anal (91 states) and wings (30 states).
There are a total of 396 character states in the score card, all of which can
simply be answered by binary codes 1
(yes or no).
When examining specimens for identification, there are situations where the
user is unsure of the character state observed. If the dichotomous key were
used, the user would be forced to choose one state and engage in trial-and-error.
The dichotomous key may be answered incorrectly, leading to erroneous identification
With the score card, however, the user can simply place a question mark ?
on that character state to tell SIrD to ignore it during comparison and still
allow the software to compute an accurate degree of similarity. This effectively
eliminates the problem of dependency. The users
decision in answering a character state does not affect the decision he makes
on any other character state.
Another purpose of the score card is to enable the formulation of binary strings
for adding (or editing) established species in the database. In contrast with
identification, adding or editing database entries require stricter compliance
with binary coding. Only binary codes 1
are allowed and all other symbols are considered illegal. Also, no character
state must be skipped during formulation in order to maintain their ordinal
positions relative to each other in the binary sequence.
These strict measures are necessary to ensure the correctness of the resulting
character ID for the database. Once the binary coding is completed, the species
listed in the software will then have its own unique binary string which becomes
the basis for comparison by SIrD when computing the degree of similarity.
RESULTS AND DISCUSSION
SIrD software: SIrD is developed using several open-source programming
tools, namely, PHP, MySQL and Drupal. These tools were specifically chosen for
their practical merits. PHP is a widely-used open source general-purpose scripting
language that is especially suited for web development (http://www.php.net).
MySQL is a popular open source database system used by many organizations to
save time and money in powering business-critical systems and packaged software
(Oracle Corp. 2013). Drupal is an open source content management platform powering
millions of websites and applications (http://drupal.org).
A usage guide (Fig. 1) for the software is made available
for download and reference (http://drosophila.site88.net/?q=downloads).
SIrD is designed with ease-of-use in mind, with a simplistic user interface
and avoids clutter to make it pleasant to navigate around. The process of using
SIrD is straightforward and can be summed up in only a few steps, as shown:
||Download the binary code score card and the species diagram
||Use the printable score card and species diagram as a visual
||Use the spreadsheet score card as the answer sheet
||Highlight and copy the binary code portion only of the specific
row being answered and paste it on the textbox at the landing page of SIrD
||There may be white spaces in between the binary codes, but SIrD will simply
ignore them upon submission
||Click on the search button to show the results page
SirD automatically displays, in descending order of degree of similarity (expressed
in percentage), species that are possible matches to the sample the user is
analyzing. Each species listed is presented with its binary identifier grouped
into 10 characters for easy counting. The result also highlights the ordinal
position of the character state where the sample matches that of a particular
species, for review and verification.
SirD also allows users to diagnose specific portions of the specimen (i.e.,
thorax only, head and abdomen simultaneously, etc). The input binary string
for identification is much shorter, but still yields the possible matches. For
this to work, the user simply has to replace the binary code with special symbols,
at appropriate positions, to effectively ignore some characters, as shown in
Table 1 and 2.
||User guide for system of identification for Drsophilidae (SIrD),
a step by step guide in using the software
Table 1 lists the special symbols that will be used when
one wishes to ignore one or more character states. Table 2
illustrates sample placements and combinations of these special symbols. The
symbols are used to replace the codes for the caracter state(s) that the user
wishes to ignore.
Shortcuts to ignore characters in bulk can also be done by category. Each major
body part (the category) is assigned a specific alphabetic symbol to act as
placeholder if that particular body part needs to be ignored. Basically, the
categories being ignored depend on the where the symbols are positioned, as
shown in the Table 3 and 4.
|| Special symbols for replacing a certain number of characters
|| Positioning of binary string and special symbols
|| Alphabetic symbols indicating the particular body part to
|| Positioning of binary string and alphabetic symbols
Table 3 presents the alphabetic symbols representing each
body parts. When using these symbols, the user must keep in mind that the symbols
should always be placed in order, i.e., the alphabetic symbol A should always
be the first symbol in the string as shown in Table 4.
It is important to note that these special and alphabetic symbols are valid
only when used during identification. These symbols are not valid when used
during adding or updating a character ID in the database, in order to preserve
the binary format needed to conveniently and uniquely describe the species.
Aside from being an identification system for the Drosophilidae species, SIrD
is also a capable information storage and retrieval system. When the user clicks
on a species name, it moves into another page dedicated solely for that particular
The species-dedicated page displays the scientific name, its designated binary
string, the list of character states present and sample images of the species.
It also displays information gathered from the published dichotomous keys and
other articles describing the phenotypic characters of the species. These sources
of information are listed in the References box in each of the species
Many of the existing computer-based identification systems utilize the concept
of the couplet system of the dichotomous key. Although computerization allows
better handling of data, it still fundamentally presents the same problem in
which the decision a user makes in a couplet is affected by the decision he
made with the previous couplet, because the couplets are dependent on each other.
The computer-based applications for identification developed by Rocker
et al. (2007) and Dallwitz et al. (2000),
for example, clearly demonstrates such dependence.
The dependencies introduced by the dichotomous key concept imply that there
is a unique path for each species. The user may need to be an expert in analysing
the specimen and be experienced in using the dichotomous key to get the correct
identification. This may not be easily accomplished by a novice user. By comparison,
the linear binary system of classification implemented in SIrD eliminates the
problem of dependency. The users
decision in picking a particular character state, whether it is correct or not,
does not affect the decision he makes when choosing other character states.
This problem on dependency has also been addressed in MOSCHweb, an interactive
key of the Palearctic tachinidae (Cerretti et al.,
2012). MOSCHweb is a more flexible identification software, allowing the
user to ignore character states that are difficult to interpret or are inapplicable
due to damage. However, the process of identification of MOSCHweb such that
when the users selects a character state, it discards all other taxa that do
not share that state, eventually narrowing down to a single taxón.
With SIrD, on the other hand, the linear nature of its binary string enables
it to calculate the degree of similarity, allowing the software to display any
other possible matches. Along with the information on degree of similarity,
the ordinal positions of the character states in which they match are displayed
to allow for review and verification.
The use of binary codes in this SIrD also allows for updating when new discoveries
are found. In Van der Linde and Houle (2008) discovery
of new species of Drosophilade fruit flies in the Philippines, she was not able
to classify them because of the presence of new characters. With the use of
SIrD, new species can be added into the system without having to overhaul or
reconstruct the arrangement of characters in the binary strings. New character
states can be conveniently added at the end of the binary string without affecting
the sequence of binary codes already entered into the database. The binary coding
system is extensible enough to easily accommodate new additions.
The System of Identification for Drosophilidae offers an efficient way of identification
as well as ease of updating for new discoveries. Its binary-based identification
system allows straightforward computation of the degree of similarity which
currently does not exist in other identification systems and verification of
results. The high tolerance to errors by the software allows even novice users
to retrieve possible matches. Time and resources consumed in doing research
can, therefore, be reduced significantly.
|Appendix Table 1:
||Score card for the software system of identification for drosophilidae
|General body characteristics, 1-4: General size, 5-11: Color,
12-13: Pigmentation, Head characters: 14-17: Face and fronto orbital region,
18-21: Arista, 22: Antennae, 23-27: Palps, 28-34: Gena, 35-37: Ocelli, 38-44:
eyes, 45-47: Vittae, 48-61: Carina, 62-65: Oral bristle, 66-67: Anterior
reclinate, 68-70: Proclinate, 71-72: Postocellar setae, Thorax: 73-79: Achrostical
setulae, 80-102: Mesonotum, 103-108: pleura, 109-115: Thorax general color
and pattern, 116-118: Notum, 119-122: Scutellum, 123-124: Katepisternal
setae, 125: Prescutellar setulae, 126-130: Sterno index, 131-132: Scutellar
setae, 133-134: Position of reclinate, 135: Distinct spots, Legs: 136-142:
General color and characteristics, 143-148: Femur, 149-154: Tibia, 155-166:
Tarsals, 167-169: Foretarsus, 170-171: Distal tarsals, 172: Posterior paramere,
173-174: Proximal sex comb, 175-196: Basitarsus, 197-198: Distal sex comb,
199-209: Sex comb and teeth, Abdominal characters: 210-215: General characteristics
of the abdomen, 216-223: Abdominal tergites, 224-225: Ventral margin, 226:
Spiracles, 227-230: Spiracles, 231-232: Sternites, 233: 4th sternite, 234-235:
5th sternite, 236-239: 6th sternite, 240-242: 7th sternite, 243-246: Tergites,
247-255: Abdominal banding, 256: Abdominal shape, 257-261: Apical banding,
262-268: Lateral areas of the abdomen, 269- 275: Markings on the abdomen,
276: Primary claspers, 277-298: Secondary claspers, 299-308: Surstylus,
309-311: Prensisetae, 312-313: Genital arch, 314-324: Anal plate, 325-328:
Hypandrium, 329-332: Testes, 333-334: cercus, 335-347: Epandrium, 348-352:
Aedeagus, 353-354: Paraphyses, 355-359: Penis, 360-363: Phallus, 364-366:
Paramere, Wings: 367-371: General characteristics of the wings, 372- 376:
Infuscations, 377-384: Costal cell setulae, 385-386: Longitudinal veins,
387-388: Basal cells, 389: Costal lappet, 390-396