Big data handled with new BC|TILING module
We are proud to announce a new module for our BC|GENOME system for massively parallel computation that removes bottlenecks from handling of massive data amounts.
The BC|TILING module uses a highly compressed data format suitable for parallel computation and can be added to the existing platform to drastically increase data handling efficiency and reduce storage space requirements. The module stores genetic variant data derived from imputation and sequencing. The name “tiling” indicates the way the genotype data matrix is indexed and stored as compressed “tiles” of fixed size, enabling scalable, distributed data storage and analysis.
BC|TILING has been designed to work in the most extensive studies with a target of 1,000,000 subjects x 100,000,000 markers but without any upper or lower limits. The module increases data handling efficiency and reduces required storage space even with smaller amounts of data and does not require any investments into new hardware.
With this new system, handling of big data becomes faster than ever before. The ideal hardware consists of a HPC cluster with a large distributed file system that can be used as a permanent storage for genotype data. BC|TILING also enables starting with smaller storage and calculation capacity, utilizing new capacity as data amounts grow – the performance can be scaled linearly by adding new computation units.
BC|TILING removes bottlenecks from data import and retrieval, pre-processing and creating input for analysis. Pre-processing data for distributed analysis is now lightweight and fast. Data can be retrieved and exported quickly for selected markers or subjects in both marker-major formats (e.g. VCF, Oxford .gen, PLINK binary) and subject-major formats (e.g. PLINK ped, Merlin, Linkage, MaCH).
This new module was developed to increase productivity in data analysis and ensure the scalability of data analysis workflows with ever growing data amounts. We are committed to making sure that our customers achieve faster results and want to make their lives easier by eliminating bottlenecks and manual data management.
Table 1: BC|TILING storage space savings. Comparison is most relevant between SQL storage (BC|GENE), compressed data storage using BCD (also in BC|SNPmax), and tiled storage.
|Subjects||Variants/ subjects||Subjects x SNPs||SQL||BC|GENOME (BCD)||BC|TILING (dosage)||BC|TILING (VCF*)|
|10,000||100 M||1,000 B||88,000 GB||3,500 GB||375 GB||127 GB|
|100,000||100 M||10,000 B||880,000 GB||35,000 GB||3,750 GB||1,269 GB|
|1,000,000||100 M||100,000 B||8,800,000 GB||350,000 GB||37,500 GB||12,690 GB|
*This number is for VCF file with genotypes only. When storing additional info, disk consumption increases.
In addition to substantial storage savings, BC|TILING facilitates a highly efficient parallel data analysis process. In practice this means that a larger number of calculation nodes can be used for data analysis decreasing the processing time. The following table illustrates the performance increments for a small dataset which can be processed using all methods. For larger datasets, the performance benefit of BC|TILING will become even more substantial.
Table 2: BC|TILING performance (PLINK association analysis using single server with 4 CPU cores and large external calculation resource). Time includes segmentation, file format conversion, PLINK analysis and result collection.
|Dataset size (genotypes/ variants)||SQL||BC|GENOME||BC|TILING|
|2.2 B||2 hour 45 min||7 minutes||1 min 30 sec|
|22 B||N/A||N/A||8 min 50 sec|
|222 B||N/A||N/A||21 min 30 sec|
No meaning – no gain: Interpretation is the key
In September this year, Genomics England introduced the “PanelApp” that helps experts to comb through the 100,000 genomes project’s data for rare diseases. This is a crowdsourcing project where the whole scientific community is invited to take part in interpretation and reviewing of the variants. The aim is to produce diagnostic quality gene panels. The PanelApp is a very ambitious project, I would say, but definitely the way to go, as no other formal interpretation systems yet exist anywhere.
There have been other attempts at collective interpretation, on various levels of expected contribution, and in many different formats. One project is the Cafe Variome -portal, brought into daylight in Gen2Phen -project some years back. Here the idea is to collect variants and phenotypic evidence from various contributing resources, with an easy API for submissions, and to make remote repositories available for discovery. It is a neat solution, and grows its community of collaborators, partnership networks, and other experts, to add more value to the content.
The LOVD -project provides local tools for locus-specific projects that allow the experts to choose, what information they want to expose outside the project, whilst keeping the data in-house. A hub connects these distributed LOVD instances, and allows discovery platforms (like Cafe Variome) to tap into the exposed data in the LOVDs. The toolbox also doubles as an internal information centre.
In both of these concepts the value is in the expert interpretation, not in data sharing itself. The caveat is that their use requires some effort (you need to install the tool, or you need to reformat your data to fit the communication layer), and this is a deterrent. We need things to be made very, very easy for us to use, unless we have the luxury of having an IT team at our command. The other deterrent is the ultimate sharing of data, unfortunately.
Crowdsourcing interpretation work at Genomics England is a wonderful idea. The data is already there, and people are invited to show their expertise. Giving your expert opinion is rather easy, and serves a large community of people, so the requirements for easiness, motivation, and working on ‘somebody else’s data’ exist. I believe PanelApp has the potential makings of a true discovery platform.
Still, the problems persist with isolated data collections, and there is no immediately easy solution for those. Maybe only these big, centralised efforts will have the chance to become truly meaningful, and to propel the medical advancements. Or maybe, someone somewhere, will come up with another kind of idea, and is able to connect the scattered dots into an interpretation app that tells us, what’s going on with the genes.
It is not necessary to own or control massive server pits and cellars of machinery to be at the top of the genomic data food chain. The possession of the information is one thing – getting value out of it is a completely different story. Companies like Congenia and Omicia (recent additions to the Genomics England -family) are increasingly more interesting for their ability to provide fast and automated services. But even they are dependent on the large scheme of building a genomic consensus out of the complex space of evidence and knowledge that’s still hidden in the experts’ hard drives.
The skill of following what your bioinformaticians are up to
I very recently read an interesting review in Nature BoneKey reports about GWAS and NGS analyses (Alonso et al 2015 ). As a non-bioinformatician working for a bioinformatics company, I have learned by doing and listening. Still, there are many aspects and fine points that I probably fail to grasp in all but the most simplistic GWAS workflows. Therefore, a review addressing the standardisation of GWAS workflow and QC of data was very welcome to me.
If you are in a managerial position supervising bioinformatics professionals, but only know the ‘lingo’ at a level of an enthusiastic amateur, a paper like this should be very useful to you. This one takes the slant of comparing GWAS with NGS-based association analysis, which I find very topical, and walks you through every single step from initial QC to visualisation options.
Once you have walked through the recommended standard workflow in the paper, you will begin to understand, why your people are hyping about new ways to speed up PCA, and why they can’t agree on MAF thresholds (and what that even means).
There is a clear lack of standards and recommendations for QC and association analysis of NGS data, which I found initially quite surprising, as the technology was very rapidly adopted by people dealing with GWAS a wee while ago. However, the improvements in sequencing methods have definitely made QC of NGS data a rapidly moving target for standardisation. Simple lack in understanding how human genome works has caused major debates about how the variants should be interpreted, which directly affects how one should control for anomalies prior to association analysis. And so on.
It is also important to realise the fundamental differences between SNPs and NGS – mainly the vast volume of the latter swamping any reasonable signals, still detectable by the former. However, NGS in association has potentially so much more power and sensitivity in comparison to SNP approach that despite of these issues the community should and will persevere with the idea that NGS is here to stay, and will one future day be the de facto data source for association analyses.
 Big data challenges in bone research: genome-wide association studies and next-generation sequencing. Nerea Alonso, Gavin Lucas & Pirro Hysi. BoneKEy Reports (2015) 4, Article number: 635 (2015)
What we still don’t know after GWAS
I had the pleasure of attending a lecture by Professor Struan Grant, a Scotsman on tenure in Philadelphia. His topic was almost the same as the title of this blog entry, and as good talks do, it left me thinking.
After a period of awesome adventures in the world of gigantic data sets and massive computer racks it has become clear that the crunching of mega GWAS signals has left scientists largely scratching their heads. For example, the rapidly rising number one health threat – type 2 diabetes – is eating away the healthcare resources worldwide, and despite numerous and extensive GWAS projects has revealed only a handful of signals in the human genome. We have some candidate genes, yes, but are they even near the real culprits? What is the actual genetic, and the upstream cellular mechanism causing type 2 diabetes to kick off?
Struan had his own thoughts about it, and he empathically announced himself as a wet lab enthusiast that is now ‘ex-GWAS’. For a man who started his diabetes 2 journey in the offices of deCODE, this is a major change in ideology (mind you, even then he had his success with linkage peaks, rather than SNPs). In his opinion, there are a few issues with massive studies focusing on GWAS signals.
First of all, are clinicians sure they have patients with the targeted disease? It turns out there’s a thing called type 1.5 diabetes, or LADA to friends. This is the latent form of type 1 diabetes hitting younger, lean and otherwise healthy adults, rather than people with the typical onset age of 45 years or more. Struan thinks that the type 2 diabetes cases in many studies are in fact ‘contaminated’ by LADA cases, bringing in signals from genes belonging to a totally different kind of disease. The number of LADA patients could be as much as 10% in these cohorts. Moreover, as this disease is often treated as type 2, the medication ends up destroying the patients own insulin production – a mistreatment easily avoided with an antibody test.
This is just an example. When you study a common disease with complex genomics, you probably have a very complex phenotype for the disease as well. If a signal pops up in a GWAS dataset pointing to a mystery pathway, and generally not making too much sense, one should check the specificity of the diagnostics criteria. With massive GWAS cohorts the chances are that possible ‘contaminants’ become part of the screening panel, and send many a PhD student on a wild goose chase.
The second problem with focusing on GWAS has been the two-dimensional thinking associated with it. We all remember from biology classes that the double helix of DNA comes as mighty long and complexly folded chromosomes. Homing in on the causal SNP might still take you far a way from the actual point of effect in the genome aka your gene of interest. After a project becomes saturated with possible SNPs to follow, the nearest gene upstream may not be the target. Even if the SNP sits dead bang in the middle of an exome, being as non-synonymous as anything, one still cannot know, if that’s the culprit, or if it is a part in a complex DNA binding regulatory element with a buddy or two suppressing something 20 Mbs away.
Bashing GWAS as a technology like this is totally unfair and unbalanced of me, but it is sometimes fun to point out the obvious. Of course everyone in the field knows these things and has prepared to answer the trick questions from sly reviewers. However, one important factor that follows from all this is the change in funding principles. The sheer number of genotyped patients in a GWAS project no longer impresses public purses like NIH. The clinical research is not yet completely saturated with SNPs for common ailments, but it is getting there. Maybe someone should have a look at the diagnostics criteria in some of them, and give an educated opinion about their specificity. Many researchers also believe that when it comes to the most studied common diseases, it is time to get back in the wet lab, and give the bioinformaticians new, more three-dimensional challenges to pursue within the chromosomal folds.
(At the time of writing Struan’s recent paper had not yet appeared. Here it is for those who want to read the story: Q. Xia, S. Deliard, C.X. Yuan, M.E. Johnson and S.F.A. Grant: Characterization of the transcriptional machinery bound across the widely presumed type 2 diabetes causal variant, rs7903146, within TCF7L2. European Journal of Human Genetics Epub ahead of print, March 2014.)
Mediterranean Genes in Casablanca
There is kerfuffle in the breakfast room. Conquering a table is challenging, getting a cup of coffee is another battle. After that everything’s good, and I’m very happy with my breakfast. My attention is drawn to the magnificent pile of Moroccan cakes, sweets, and pastries in the middle of the large room.
The MEDIGENE generic assembly meeting is about to start in Casablanca, Morocco. MEDIGENE is a EU funded FP7 consortium that started work in 2012. It is already a mature project; everyone knows each other and the momentum is good. Everyone appears to speak French sauf pour moi. FP7 projects are well known for over-elaborate acronyms, but this is easy: Mediterranean Genes. The consortium is set out to investigate the epidemiology of metabolic syndrome (MetS), obesity, and diabetes in the migrant populations around the Mediterranean area. It is a massive undertaking, and the number of associated interest groups is high.
Everyone is getting fatter these days. Lifestyle changes leading to increased calorie intake and decreased activity are playing havoc to the health of the global population. The numbers are frightening, and diabetes is estimated to be the number one money hole in most economies by 2020. MetS is an indicative collection of risk factors, multiplying a patient’s chances for diabetes and cardiovascular complications. These factors include large waistlines, bad lipid and cholesterol values, high blood pressure, and high fasting blood sugar (MetS has many definitions, this is just one of them).
The central hypothesis driving MEDIGENE is that different ethnic groups and populations express different sets of these risk factors with differing intensities, and the genetics behind these issues vary significantly between genders and from one ethnic group to another. Lifestyle and dietary habits may differ between immigrants and people back home. Are some populations more prone to MetS due to dietary changes, and are there groups where the genes, not diet, play the major role? How should the local doctors take into account the ethnic background of a patient suffering from MetS?
These are very complex and difficult questions. We need to analyse carefully the dietary and demographic information alongside the genetic makeup of the different populations, comparing these with individuals with health problems. The ancestry of people needs to be tracked down, and we need to know where different European and North African ethnic groups originated from, hundreds to thousands of years ago.
In other words: heaps of fun. The project has already revealed some candidate genes and markers, and our Russian bioinformaticians are crunching away with their principal component analyses, producing some impressive population graphs. The family structures of North African people make very interesting Y chromosome and mitochondrial maps. We even have anthropologists comparing ancient Roman DNA with the genetic makeup of today’s populations, and making some educated guesses about the lifestyle of these people from the old bones.
The motivation to pin down population specific risk factors for diabetes, stroke, and heart disease is very high, and the tedious and bureaucratic sample collection continues. Sharing samples internationally needs specific ethical rules imposed by the coordinating institution in France, as the coordinator Professor Florin Grigorescu explains it. This causes time-consuming bureaucracy amongst the partners, and has already taken 2 years out of the 4 available. Despite the tough going, we know that in the end this collection will be a tremendously valuable source showing the spectrum of immigrant populations within Europe.
The meeting in Casablanca has presented everyone with an impressive list of accomplishments, new ideas and people, work ready for publication, new tools for creating international data collections, and all in all, relaxed spirit and confidence that the project will be a huge success. It is stimulating to see such an international bunch of exceedingly clever people chatting away, putting their heads together, and solving problems, like which dietary database is the best, and where to have dinner tonight.
Which brings me back to the pile of Moroccan delicacies in the centre of the table; all this talk of MetS has put me off them and I’m trying to lose some weight.
Tags: EU FP7