Authors in the Driver's Seat: Fast, Consistent, Computable Phenotype Data and Ontology Production

Modern evolutionary research requires organism character data from multiple branches of the Tree of Life, but the majority of such data are in a “legacy” format that computers can not use. Highly trained postdoc scientists are employed to manually convert them. By one estimate one year’s worth of English biomedical journal publications alone would take a full-time postdoc over 30 years to convert. This manual process is fundamentally problematics for three reasons: First, it does not stop the continued publication of character descriptions in the legacy format. Second, it takes time away from highly trained scientists and preventing them from making new and more important discoveries. Third, it does not address the fundamental causes of large differences (~ 40%) in the data converted by different scientists, resulting in less useful data. This project will investigate a transformative approach that solves the problem at its root, by enabling authors of biodiversity works to directly produce computable data at the time of publication, eliminating the need for subsequent post-publication data conversion. Scientists will be able to direct use the published data in the new format to address grand conservation, environmental, or medical challenges facing humankind today.

This research has great potential to solve the phenotype data production problem at its root. It recognizes that authors make (implicit) ontological commitments when writing a descriptive work and that they are best qualified to interpret the work. Instead of limiting authors to any “standardized” terminology, authors will be provided with a low-barrier, efficient, open-collaboration platform to describe characters as accurately as they wish using an ontology built collectively with other authors working on similar taxa. This process makes authors more aware of term usage in the community and helps the community (including younger generation of scientists) converge on best practices. The result will be that all character terms included in the descriptive work are documented/defined in the taxon-specific ontology and the descriptions are consequently made semantically clear, and ready for software to harvest computable character data. The software platform to be developed by this project exposes disagreements among authors and usage-based term status in the ontology, and notifies authors of updates and conflicts via a mobile-friendly app. Taking into account user motivation and attribution factors, the project will design, implement, and evaluate the approach from both human social behavior and software usability perspectives, employing recent results from open online collaboration research and cutting-edge usability test techniques (e.g., measuring user stress levels via eye-tracking and pupil dilation). The success of the project is generalizable to other domains and will fundamentally change the way phenotypic characters are documented and used in biological research.

Project duration: July 2017 - Jun 2020. Progress will be documented on this site as we carry out the project.

Contact hongcui@email.arizona.edu for more information or to share ideas.