The power of computers have enabled the field of biology to make significant leaps in quantitative methods. Fresh ideas from other disciplines in science and technology offer insight on how old techniques can be evolved into new ones for the benefit of biologists worldwide. In this study, a different approach to Drosophila species identification is presented by making use of binary coding and computer automation. The developed software, a System of Identification for Drosophilidae (SIrD) uses binary strings to describe species characteristics and use it to calculate the degree of similarity between a sample specimen and the species stored in the database. This method of identification allows for greater tolerance to errors and flexibility in adding new species characters. The linear nature of the binary string enables the software to eliminate the common problems encountered in the couplet system of the dichotomous key, such as, species-specific paths, low tolerance to errors and rigidity in adding new characters. SIrD has a user-friendly interface and is created using open-source web development software, making it Internet-ready, for easy access on the World Wide Web.
PDF Abstract XML References Citation
How to cite this article
Many researches being conducted on the species of Drosophilidae mostly involve species identification and distribution. Field sampling for inventory of flies results in a collection of thousands of specimens. Identifying these samples one at a time using the traditional method, the dichotomous key, can become time-consuming. For this reason, a much more efficient method of identification is necessary.
There are many studies that have used computers to automate species identification. These systems that have been developed are based on the concept of the dichotomous key (Abdulrahaman et al., 2010; Kirchoff et al., 2008; Rocker et al., 2007; Smith-Akin et al., 2006; Norton, 2005). The results, however, indicate that automating the dichotomous key may have certain upper limits.
With the dichotomous key, the process of identification must be sequentially followed (Rocker et al., 2007). If the user is unsure of the pair of characters shown, he is forced to choose either one of them. Choosing the wrong option can result to an erroneous species-specific path. It is also pointed out that automation of the dichotomous key works well for small groups of taxa but is unlikely to be very effective with larger groups of taxa (Kirchoff et al., 2008). Existing implementations of automating the dichotomous key do not have a way of quantifying correctness. In this study, a different technique is presented and used for automation. The System of Identification for Drosophilidae (SIrD) is a tool for the ease of identification of the Drosophilidae species, using binary codes instead of the dichotomous key to represent the presence or absence of character states. Encoding character states in binary format, in effect, produces a linear binary string. This binary linearization allows SIrD to calculate, in a straightforward manner, the degree of similarity between the sample specimen and the species in the database and then express it in percentage.
MATERIALS AND METHODS
Several other benefits can be derived from the use of such principle. Binary linearization allows character states to become independent of each other when establishing presence or absence, speeding up the identification of specimens. It also allows greater tolerance to identification errors. If the user made a mistake in establishing the presence or absence of a character, or makes only a partial input, SIrD simply takes this into account and calculates the degree of similarity accordingly. It displays all possible matches in descending order of degree of similarity, thus avoiding the all-or-nothing approach of the dichotomous key. Furthermore, the linear approach of identification makes the character ID easily extensible, allowing new characters to be added at the end of the binary string, especially when new discoveries are found.
Another important aspect of SIrD is that it is created using open-source web development software. The computational algorithms are programmed in PHP, the species data are stored and retrieved using MySQL and the overall user interface is handled by Drupal. These make it possible for SIrD to be compatible with many browsers for the World Wide Web. A prototype has been deployed on the Internet and can be found at http://drosophila.site88.net.
Binary coding and the degree of similarity: The binary code involves only two values, true and false (or yes and no) represented in numeric form as 1 and 0, respectively. The binary code 1 is assigned to a character state if it is present in the sample being analyzed and binary code 0 if such character state is absent. These binary codes, when lined up, form the character ID that uniquely describes each Drosophilidae species. An identification process based on these binary codes is formulated to determine a possible match between the sample being analyzed and the species in the database.
A crucial factor in establishing a match between two objects is to quantify how similar they are. The technique used in this study is derived from the mathematical model published in 1901 by Jaccard (1901), a professor of botany and plant physiology at the ETH Zurich University. The Jaccard index of similarity, also known as the Jaccard similarity coefficient (Coefficient de communauté), is a statistic used for comparing the similarity and diversity of sample sets:
The degree of similarity, as defined in this study, is the proportion of the actual number of similarities to the possible number of similarities, expressed in percentage, as shown in the mathematical model below:
With binary coding, the existence of a particular character can simply be established by assigning it a 1 (present) or a 0 (absent). Determining similarity, then, becomes straight forward, as shown in the algorithm below:
|•||Determine the ordinal position of the present (code 1) characters in the binary code of each object|
|•||Count the number of ordinal positions for which present (code 1) characters of each object match. This is the actual number of similarities (numerator)|
|•||Count the number of present (code 1) characters in the binary code of each object|
|•||Determine which is the larger of the two. This becomes the possible number of similarities (denominator)|
|•||Divide the actual number by the possible number to get the degree of similarity|
It is important to keep in mind that, in dealing with proportions, care must be taken in choosing the denominator. Although the resulting quotient may not be mathematically erroneous, it will certainly not be logical (i.e., degree of similarity exceeding 100%).
Binary code score card: A score card (Appendix Table 1) has been established to facilitate the formulation of binary codes necessary to uniquely describe a species. The character states used in the score card are based on the significant identifying characters of the species (Markow and OGrady, 2006; Wheeler, 1952; Harrison, 1950). A thorough examination of the significant identifying characters is done to ensure that no repetition of character states exists, avoiding that particular pitfall of the dichotomous key. The score card (in printable and spreadsheet formats) and a diagram pointing the locations of the characters are made available for download and reference with the software (http://drosophila.site88.net/?q=downloads).
The primary purpose of the score card is to provide users a guide on the character states needed for the identification of specimens through SIrD. For convenience, the taxonomic characters are classified into the several bodily categories, where each character state is listed accordingly. These are the general body characteristics (13 states), head (59 states), thorax (63 states), legs (74 states), abdomen (66 states), genitalia and anal (91 states) and wings (30 states). There are a total of 396 character states in the score card, all of which can simply be answered by binary codes 1 or 0 (yes or no).
When examining specimens for identification, there are situations where the user is unsure of the character state observed. If the dichotomous key were used, the user would be forced to choose one state and engage in trial-and-error. The dichotomous key may be answered incorrectly, leading to erroneous identification (Osborne, 1963).
With the score card, however, the user can simply place a question mark ? on that character state to tell SIrD to ignore it during comparison and still allow the software to compute an accurate degree of similarity. This effectively eliminates the problem of dependency. The users decision in answering a character state does not affect the decision he makes on any other character state.
Another purpose of the score card is to enable the formulation of binary strings for adding (or editing) established species in the database. In contrast with identification, adding or editing database entries require stricter compliance with binary coding. Only binary codes 1 and 0 are allowed and all other symbols are considered illegal. Also, no character state must be skipped during formulation in order to maintain their ordinal positions relative to each other in the binary sequence.
These strict measures are necessary to ensure the correctness of the resulting character ID for the database. Once the binary coding is completed, the species listed in the software will then have its own unique binary string which becomes the basis for comparison by SIrD when computing the degree of similarity.
RESULTS AND DISCUSSION
SIrD software: SIrD is developed using several open-source programming tools, namely, PHP, MySQL and Drupal. These tools were specifically chosen for their practical merits. PHP is a widely-used open source general-purpose scripting language that is especially suited for web development (http://www.php.net). MySQL is a popular open source database system used by many organizations to save time and money in powering business-critical systems and packaged software (Oracle Corp. 2013). Drupal is an open source content management platform powering millions of websites and applications (http://drupal.org).
A usage guide (Fig. 1) for the software is made available for download and reference (http://drosophila.site88.net/?q=downloads). SIrD is designed with ease-of-use in mind, with a simplistic user interface and avoids clutter to make it pleasant to navigate around. The process of using SIrD is straightforward and can be summed up in only a few steps, as shown:
|•||Download the binary code score card and the species diagram|
|•||Use the printable score card and species diagram as a visual aids|
|•||Use the spreadsheet score card as the answer sheet|
|•||Highlight and copy the binary code portion only of the specific row being answered and paste it on the textbox at the landing page of SIrD|
|•||There may be white spaces in between the binary codes, but SIrD will simply ignore them upon submission|
|•||Click on the search button to show the results page|
SirD automatically displays, in descending order of degree of similarity (expressed in percentage), species that are possible matches to the sample the user is analyzing. Each species listed is presented with its binary identifier grouped into 10 characters for easy counting. The result also highlights the ordinal position of the character state where the sample matches that of a particular species, for review and verification.
SirD also allows users to diagnose specific portions of the specimen (i.e., thorax only, head and abdomen simultaneously, etc). The input binary string for identification is much shorter, but still yields the possible matches. For this to work, the user simply has to replace the binary code with special symbols, at appropriate positions, to effectively ignore some characters, as shown in Table 1 and 2.
|Fig. 1:||User guide for system of identification for Drsophilidae (SIrD), a step by step guide in using the software|
Table 1 lists the special symbols that will be used when one wishes to ignore one or more character states. Table 2 illustrates sample placements and combinations of these special symbols. The symbols are used to replace the codes for the caracter state(s) that the user wishes to ignore.
Shortcuts to ignore characters in bulk can also be done by category. Each major body part (the category) is assigned a specific alphabetic symbol to act as placeholder if that particular body part needs to be ignored. Basically, the categories being ignored depend on the where the symbols are positioned, as shown in the Table 3 and 4.
|Table 1:||Special symbols for replacing a certain number of characters|
|Table 2:||Positioning of binary string and special symbols|
|Table 3:||Alphabetic symbols indicating the particular body part to be ignored|
|Table 4:||Positioning of binary string and alphabetic symbols|
Table 3 presents the alphabetic symbols representing each body parts. When using these symbols, the user must keep in mind that the symbols should always be placed in order, i.e., the alphabetic symbol A should always be the first symbol in the string as shown in Table 4.
It is important to note that these special and alphabetic symbols are valid only when used during identification. These symbols are not valid when used during adding or updating a character ID in the database, in order to preserve the binary format needed to conveniently and uniquely describe the species.
Aside from being an identification system for the Drosophilidae species, SIrD is also a capable information storage and retrieval system. When the user clicks on a species name, it moves into another page dedicated solely for that particular species.
The species-dedicated page displays the scientific name, its designated binary string, the list of character states present and sample images of the species. It also displays information gathered from the published dichotomous keys and other articles describing the phenotypic characters of the species. These sources of information are listed in the References box in each of the species dedicated page.
Many of the existing computer-based identification systems utilize the concept of the couplet system of the dichotomous key. Although computerization allows better handling of data, it still fundamentally presents the same problem in which the decision a user makes in a couplet is affected by the decision he made with the previous couplet, because the couplets are dependent on each other. The computer-based applications for identification developed by Rocker et al. (2007) and Dallwitz et al. (2000), for example, clearly demonstrates such dependence.
The dependencies introduced by the dichotomous key concept imply that there is a unique path for each species. The user may need to be an expert in analysing the specimen and be experienced in using the dichotomous key to get the correct identification. This may not be easily accomplished by a novice user. By comparison, the linear binary system of classification implemented in SIrD eliminates the problem of dependency. The users decision in picking a particular character state, whether it is correct or not, does not affect the decision he makes when choosing other character states.
This problem on dependency has also been addressed in MOSCHweb, an interactive key of the Palearctic tachinidae (Cerretti et al., 2012). MOSCHweb is a more flexible identification software, allowing the user to ignore character states that are difficult to interpret or are inapplicable due to damage. However, the process of identification of MOSCHweb such that when the users selects a character state, it discards all other taxa that do not share that state, eventually narrowing down to a single taxón.
With SIrD, on the other hand, the linear nature of its binary string enables it to calculate the degree of similarity, allowing the software to display any other possible matches. Along with the information on degree of similarity, the ordinal positions of the character states in which they match are displayed to allow for review and verification.
The use of binary codes in this SIrD also allows for updating when new discoveries are found. In Van der Linde and Houle (2008) discovery of new species of Drosophilade fruit flies in the Philippines, she was not able to classify them because of the presence of new characters. With the use of SIrD, new species can be added into the system without having to overhaul or reconstruct the arrangement of characters in the binary strings. New character states can be conveniently added at the end of the binary string without affecting the sequence of binary codes already entered into the database. The binary coding system is extensible enough to easily accommodate new additions.
The System of Identification for Drosophilidae offers an efficient way of identification as well as ease of updating for new discoveries. Its binary-based identification system allows straightforward computation of the degree of similarity which currently does not exist in other identification systems and verification of results. The high tolerance to errors by the software allows even novice users to retrieve possible matches. Time and resources consumed in doing research can, therefore, be reduced significantly.
|Appendix Table 1:||Score card for the software system of identification for drosophilidae (SSIrD)|
|General body characteristics, 1-4: General size, 5-11: Color, 12-13: Pigmentation, Head characters: 14-17: Face and fronto orbital region, 18-21: Arista, 22: Antennae, 23-27: Palps, 28-34: Gena, 35-37: Ocelli, 38-44: eyes, 45-47: Vittae, 48-61: Carina, 62-65: Oral bristle, 66-67: Anterior reclinate, 68-70: Proclinate, 71-72: Postocellar setae, Thorax: 73-79: Achrostical setulae, 80-102: Mesonotum, 103-108: pleura, 109-115: Thorax general color and pattern, 116-118: Notum, 119-122: Scutellum, 123-124: Katepisternal setae, 125: Prescutellar setulae, 126-130: Sterno index, 131-132: Scutellar setae, 133-134: Position of reclinate, 135: Distinct spots, Legs: 136-142: General color and characteristics, 143-148: Femur, 149-154: Tibia, 155-166: Tarsals, 167-169: Foretarsus, 170-171: Distal tarsals, 172: Posterior paramere, 173-174: Proximal sex comb, 175-196: Basitarsus, 197-198: Distal sex comb, 199-209: Sex comb and teeth, Abdominal characters: 210-215: General characteristics of the abdomen, 216-223: Abdominal tergites, 224-225: Ventral margin, 226: Spiracles, 227-230: Spiracles, 231-232: Sternites, 233: 4th sternite, 234-235: 5th sternite, 236-239: 6th sternite, 240-242: 7th sternite, 243-246: Tergites, 247-255: Abdominal banding, 256: Abdominal shape, 257-261: Apical banding, 262-268: Lateral areas of the abdomen, 269- 275: Markings on the abdomen, 276: Primary claspers, 277-298: Secondary claspers, 299-308: Surstylus, 309-311: Prensisetae, 312-313: Genital arch, 314-324: Anal plate, 325-328: Hypandrium, 329-332: Testes, 333-334: cercus, 335-347: Epandrium, 348-352: Aedeagus, 353-354: Paraphyses, 355-359: Penis, 360-363: Phallus, 364-366: Paramere, Wings: 367-371: General characteristics of the wings, 372- 376: Infuscations, 377-384: Costal cell setulae, 385-386: Longitudinal veins, 387-388: Basal cells, 389: Costal lappet, 390-396|
- Abdulrahaman, A.A., I.B. Asaju, M.O. Arigbede and F.A. Oladele, 2010. Computerized system for identification of some savanna tree species in Nigeria. J. Horticul. For., 2: 112-116.
- Harrison, R.A., 1950. New Zealand drosophilidae (Diptera): I-introduction and descriptions of domestic species of the genus Drosophila fallen. Trans. Royal Soc. New Zealand, 79: 505-517.
- Jaccard, P., 1901. Distribution de la flore alpine dans le bassin des Dransesetdansquelquesregionsvoisines. Bull. de la Societe Vaudoise des Sci. Naturelles, 37: 241-272.
- Markow, T.A. and P.M. O'Grady, 2006. Drosophila: A Guide to Species Identification and Use. Elsevier Inc., New Yok, Pages: 247.
- Osborne, D.V., 1963. Some aspects of the theory of dichotomous keys. New Phytologist, 62: 144-160.
- Rocker, J., C.M. Yauch, S. Yenduri, L.A. Perkins and F. Zand, 2007. Paper-based dichotomous key to computer-based application for biological identification. J. Comput. Sci. Coll., 22: 30-38.
- Smith-Akin, K.A., S. McLane, T.M. Craig and T.R. Johnson, 2006. Application of cognitive engineering principles to the redesign of a dichotomous identification key for parasitology. Proceedings of the American Medical Informatics Association Annual Symposium, November 11-15, 2006, Washington, DC., pp: 739-743.
- Van der Linde, K. and D. Houle, 2008. A supertree analysis and literature review of the genus Drosophila and closely related genera (Diptera, Drosophilidae). Insect Syst. Evol., 39: 241-266.
- Wheeler, M.R., 1952. A key to the genera of Drosophilidae of the pacific Islands (Diptera). Proc. Hawaiian Entomol. Soc., 14: 421-423.