Something that most scientists agree on is that data is good, and more data is better. Over the last few years, it has become apparent that one of the biggest barriers to progress in genomic research has been the inaccessibility of data that has been previously generated. Many journals now insist that for a scientific manuscript to be published, the data it describes should be made publicly available. This is an important cultural shift from the journals that recognises the importance of making data available, however it is often side-stepped by researchers who make the data either ‘available on request’ (good luck with that one), or release the data under ‘managed access’. This essentially means to access the data, you must submit a lengthy application to a data access committee, who decide on a case-by-case basis whether or not to grant access. This creates a lot of work for the researchers who need to fill out these applications, but also those who have to review them – and adds huge delays to research.
The importance of open data in genomics was recognised most notably by the Human Genome Project and the scientists involved who sequenced the first human genome, completed back in 2003. The project was an enormous undertaking, taking 13 years to complete, and costing nearly £2 billion ($2.7 billion) in total. Once finished, a decision was made which really kick started genetic research, and changed the landscape of science forever – they made the data completely free and openly available to anyone who wanted to use it. This provided a blueprint for all genetic research that has happened since the project, and massively accelerated the field, making it what it is today. It is perhaps surprising that since then, researchers have been somewhat reluctant to make genetic data openly available.
One project which is trying to encourage a change in the culture of science towards open-data is the Personal Genome Project (PGP). The principle of the project is to invite willing volunteers to make their genomic data publicly available – following completion of a test which ensures that they understand the potential risks of being part of such a project. There are several branches of the PGP running globally, and I am delighted to be part of the PGP-UK team, founded by Stephan Beck in 2013.
A few months ago, we released a paper describing the pilot phase of the Personal Genome Project UK, which summarises the project, the recruitment of our pilot participants (our resident ‘citizen scientists’), and the analyses of the multi-omics data generated from these individuals. This included the first genetic and epigenetic reports which were given to participants, and the development of our app, GenoME, which allows you to explore the genomes and epigenomes of four of our participant ambassadors (the app is freely available on the iPad app store now). Finally, the paper also covers our first Genome Donations – we were the first project in the world to develop and use genome donation, allowing individuals who had their genome privately sequenced elsewhere to turn that closed-access data into open-access data for the benefit of research.
This week, we have released another paper on bioRxiv which gives a more in-depth interrogation of the multi-omics dataset we have produced. The paper describes the dataset of genetic (whole genome sequencing), epigenetic (whole genome bisulphite sequencing and methylation array) and transcriptomic (RNA-seq) data for this pilot group. One of the issues we identified when releasing these data, was that there is no single platform in which you could release open-access multi-omics datasets. To solve this problem, as well as releasing the data on the standard separate platforms (ENA, EVA and ArrayExpress), we also established collaborations with the cloud platform providers SevenBridges Genomics and Lifebit, who now host all the multi-omics data in one place. As data analysis can also be performed on these platforms, it removes the need to download all the data to perform analyses locally (which would take over 10 days with a typical connection!).
The paper also describes the extensive quality checks that the team performed on the data to ensure it is of top quality. An important part of this was verifying that the datasets all definitely do correspond to the right participants – something which surprisingly isn’t performed very often in multi-omics projects! To do this, we extracted information about a set of SNPs from each ‘omics’ dataset (whole genome bisulphite sequencing, array data and RNA-seq) and correlated these back to the SNP loci extracted from the whole genome sequencing. This allowed us to verify across all data types that the samples were annotated correctly, and also to ensure that the data quality was consistently high across the dataset.
If you would like to learn more about the project and the multi-omics dataset we generated, here are the links to the two manuscripts:
Personal Genome Project UK (PGP-UK): a research and citizen science hybrid project in support of personalised medicine (BMC Medical Genomics, November 2018): https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-018-0423-1
The Personal Genome Project-UK: an open access resource of human multi-omics data (BioRxiv, March 2019): https://www.biorxiv.org/content/10.1101/566711v1