A new study led by Australia's national science agency CSIRO, has found 95.5 per cent of current entries in GISAID, the world's largest novel coronavirus genome database, do not contain relevant patient information -- a critical piece of the puzzle to understand the virus and how it is evolving.
The researchers have used this finding to develop a standardized data collection template, which can be implemented on repositories like GISAID, without identifying the patient and making it easier for clinical teams treating patients to share more of their knowledge.
This enables the scientific community to access important information including symptoms, vaccine status and travel history and in doing so build a more complete picture of the impact of COVID-19 on each patient.
SARS-CoV-2, the virus that causes COVID-19, is one of the most sequenced viruses in history, with over 200,000 sequences on GISAID as of 16 November 2020.
The last 100,000 sequences of the virus were uploaded in the past two months, a global record.
The study, a collaboration with GISAID and other academic partners, proposes a standardised data collection method to help scientists and clinicians around the world gather and share vital information in the fight against COVID-19.
CSIRO researcher and senior author of the paper Dr S.S. Vasan, who is also Honorary Professor at the University of York, UK, said it is critical to collect the 'patient journey' in as much detail as possible to understand the impact of virus evolution on the disease and its consequences.
We urgently need de-identified patient data associated with these virus genome sequences in order to decipher whether disease outcomes are due to a mutation, or multiple mutations, in the virus or host factors such as age, gender and co-morbidities."
S.S. Vasan, Honorary Professor, University of York
"It's very likely this information is known to the clinical teams who treated the patient but does not make its way to public repositories such as GISAID, due to the number of steps involved."
Recognising this need for clinical data, GISAID made 'patient status' a compulsory field for uploading virus sequences since 27 April 2020.
However, the study showed a lack of digital infrastructure for collecting clinical information has hampered progress.
It also identified the need for a standardised vocabulary and mechanism for linking in with health systems as key factors for capturing the necessary information.
Lead author and CSIRO researcher Dr Denis Bauer, who is also Honorary Associate Professor at Macquarie University, Sydney, said with the adoption of the study's proposed data collection template, future sequences shared through the GISAID initiative could contain more meaningful de-identified patient information.
"We have identified steps in the clinical health data acquisition cycle and workflows that likely have the biggest impact in the data-driven understanding of this virus," Dr Bauer said.
"Following the 'Fast Healthcare Interoperable Resource' implementation guide, we have introduced an ontology-based standard questionnaire consistent with the World Health Organization's recommendations."
Barwon Health's Director of Infectious Diseases Professor Eugene Athan welcomed the new data collection template.
"Barwon Health is leading a study on the long-term biological, physiological and psychological effects of COVID-19, in partnership with CSIRO and Deakin University, and we intend to implement this mechanism for our data collection and reporting," Prof Athan said.
"Having a simplified and standardised approach to sharing relevant patient information alongside genome sequences will enable critical research into COVID-19 and comparisons between different studies and population sets.
"I encourage clinicians and scientists around the world to share, wherever possible, de-identified patient information and clinical outcomes using this template to support ongoing research efforts."