The skill of following what your bioinformaticians are up to

March 12, 2015 By Anni Ahonen-Bishopp, Director of R&D

I very recently read an interesting review in Nature BoneKey reports about GWAS and NGS analyses (Alonso et al 2015 [1]). As a non-bioinformatician working for a bioinformatics company, I have learned by doing and listening. Still, there are many aspects and fine points that I probably fail to grasp in all but the most simplistic GWAS workflows. Therefore, a review addressing the standardisation of GWAS workflow and QC of data was very welcome to me.

If you are in a managerial position supervising bioinformatics professionals, but only know the ‘lingo’ at a level of an enthusiastic amateur, a paper like this should be very useful to you. This one takes the slant of comparing GWAS with NGS-based association analysis, which I find very topical, and walks you through every single step from initial QC to visualisation options.

Once you have walked through the recommended standard workflow in the paper, you will begin to understand, why your people are hyping about new ways to speed up PCA, and why they can’t agree on MAF thresholds (and what that even means).

There is a clear lack of standards and recommendations for QC and association analysis of NGS data, which I found initially quite surprising, as the technology was very rapidly adopted by people dealing with GWAS a wee while ago. However, the improvements in sequencing methods have definitely made QC of NGS data a rapidly moving target for standardisation. Simple lack in understanding how human genome works has caused major debates about how the variants should be interpreted, which directly affects how one should control for anomalies prior to association analysis. And so on.

It is also important to realise the fundamental differences between SNPs and NGS – mainly the vast volume of the latter swamping any reasonable signals, still detectable by the former. However, NGS in association has potentially so much more power and sensitivity in comparison to SNP approach that despite of these issues the community should and will persevere with the idea that NGS is here to stay, and will one future day be the de facto data source for association analyses.

[1] Big data challenges in bone research: genome-wide association studies and next-generation sequencing. Nerea Alonso, Gavin Lucas & Pirro Hysi. BoneKEy Reports (2015) 4, Article number: 635 (2015)

Tags: ,

What we still don’t know after GWAS

January 20, 2015 By Anni Ahonen-Bishopp, Director of R&D

I had the pleasure of attending a lecture by Professor Struan Grant, a Scotsman on tenure in Philadelphia. His topic was almost the same as the title of this blog entry, and as good talks do, it left me thinking.

After a period of awesome adventures in the world of gigantic data sets and massive computer racks it has become clear that the crunching of mega GWAS signals has left scientists largely scratching their heads. For example, the rapidly rising number one health threat – type 2 diabetes – is eating away the healthcare resources worldwide, and despite numerous and extensive GWAS projects has revealed only a handful of signals in the human genome. We have some candidate genes, yes, but are they even near the real culprits? What is the actual genetic, and the upstream cellular mechanism causing type 2 diabetes to kick off?

Struan had his own thoughts about it, and he empathically announced himself as a wet lab enthusiast that is now ‘ex-GWAS’. For a man who started his diabetes 2 journey in the offices of deCODE, this is a major change in ideology (mind you, even then he had his success with linkage peaks, rather than SNPs). In his opinion, there are a few issues with massive studies focusing on GWAS signals.

First of all, are clinicians sure they have patients with the targeted disease? It turns out there’s a thing called type 1.5 diabetes, or LADA to friends. This is the latent form of type 1 diabetes hitting younger, lean and otherwise healthy adults, rather than people with the typical onset age of 45 years or more. Struan thinks that the type 2 diabetes cases in many studies are in fact ‘contaminated’ by LADA cases, bringing in signals from genes belonging to a totally different kind of disease. The number of LADA patients could be as much as 10% in these cohorts. Moreover, as this disease is often treated as type 2, the medication ends up destroying the patients own insulin production – a mistreatment easily avoided with an antibody test.

This is just an example. When you study a common disease with complex genomics, you probably have a very complex phenotype for the disease as well. If a signal pops up in a GWAS dataset pointing to a mystery pathway, and generally not making too much sense, one should check the specificity of the diagnostics criteria. With massive GWAS cohorts the chances are that possible ‘contaminants’ become part of the screening panel, and send many a PhD student on a wild goose chase.

The second problem with focusing on GWAS has been the two-dimensional thinking associated with it. We all remember from biology classes that the double helix of DNA comes as mighty long and complexly folded chromosomes. Homing in on the causal SNP might still take you far a way from the actual point of effect in the genome aka your gene of interest. After a project becomes saturated with possible SNPs to follow, the nearest gene upstream may not be the target. Even if the SNP sits dead bang in the middle of an exome, being as non-synonymous as anything, one still cannot know, if that’s the culprit, or if it is a part in a complex DNA binding regulatory element with a buddy or two suppressing something 20 Mbs away.

Bashing GWAS as a technology like this is totally unfair and unbalanced of me, but it is sometimes fun to point out the obvious. Of course everyone in the field knows these things and has prepared to answer the trick questions from sly reviewers. However, one important factor that follows from all this is the change in funding principles. The sheer number of genotyped patients in a GWAS project no longer impresses public purses like NIH. The clinical research is not yet completely saturated with SNPs for common ailments, but it is getting there. Maybe someone should have a look at the diagnostics criteria in some of them, and give an educated opinion about their specificity. Many researchers also believe that when it comes to the most studied common diseases, it is time to get back in the wet lab, and give the bioinformaticians new, more three-dimensional challenges to pursue within the chromosomal folds.

(At the time of writing Struan’s recent paper had not yet appeared. Here it is for those who want to read the story: Q. Xia, S. Deliard, C.X. Yuan, M.E. Johnson and S.F.A. Grant: Characterization of the transcriptional machinery bound across the widely presumed type 2 diabetes causal variant, rs7903146, within TCF7L2. European Journal of Human Genetics Epub ahead of print, March 2014.)

Tags: ,

Want to know more? Let’s talk

Get in touch