Investigators: Greg Hamerly, PhD, and Pablo Rivas, PhD
Summary
CRADLE is a software project to detect leukocoria, a key symptom of ocular disease, using machine learning techniques. We have developed a prediction model based on a limited training set, but building higher-quality predictors requires more data.
Gathering a lot of data by hand can be difficult when interesting events are rare. In our case, leukocoria appears in a small fraction of recreational photographs. Further, having metadata indicating that the photos of interest come from individuals with confirmed disease makes obtaining ground truth data even more difficult.
Data augmentation and synthesis are potential solutions for sparse and rare data. Data augmentation is a process to create additional data by modifying ground truth data to create new versions. Data synthesis is creating new data from generators, without requiring ground truth data.
There are potential ethical issues related to these approaches. From the input side, we need to be able to track the provenance of data to ensure it is used properly. People who have donated their data for use must know that it is being handled and used according to their understanding, and that there are limits to its use and distribution.
From the output side, we need to measure the effectiveness and reliability of using generated data for the learning to make predictions. We also want to report to the end user any discovered limitations about the suitability of the model as it relates to the way the model was trained.
Machine learning has not had guidelines for these problems. We aim to investigate and address these issues by studying and adapting best practices from biomedical ethics. In particular, which of the grounding principles such as autonomy, nonmaleficence, beneficence, and justice are applicable, and how?