Big data handled with new BC|TILING module

December 4, 2015 By Anni Ahonen-Bishopp, Director of R&D

We are proud to announce a new module for our BC|GENOME system for massively parallel computation that removes bottlenecks from handling of massive data amounts.

The BC|TILING module uses a highly compressed data format suitable for parallel computation and can be added to the existing platform to drastically increase data handling efficiency and reduce storage space requirements. The module stores genetic variant data derived from imputation and sequencing. The name “tiling” indicates the way the genotype data matrix is indexed and stored as compressed “tiles” of fixed size, enabling scalable, distributed data storage and analysis.

BC|TILING has been designed to work in the most extensive studies with a target of 1,000,000 subjects x 100,000,000 markers but without any upper or lower limits. The module increases data handling efficiency and reduces required storage space even with smaller amounts of data and does not require any investments into new hardware.

With this new system, handling of big data becomes faster than ever before. The ideal hardware consists of a HPC cluster with a large distributed file system that can be used as a permanent storage for genotype data. BC|TILING also enables starting with smaller storage and calculation capacity, utilizing new capacity as data amounts grow – the performance can be scaled linearly by adding new computation units.

BC|TILING removes bottlenecks from data import and retrieval, pre-processing and creating input for analysis. Pre-processing data for distributed analysis is now lightweight and fast. Data can be retrieved and exported quickly for selected markers or subjects in both marker-major formats (e.g. VCF, Oxford .gen, PLINK binary) and subject-major formats (e.g. PLINK ped, Merlin, Linkage, MaCH).

This new module was developed to increase productivity in data analysis and ensure the scalability of data analysis workflows with ever growing data amounts. We are committed to making sure that our customers achieve faster results and want to make their lives easier by eliminating bottlenecks and manual data management.

Table 1: BC|TILING storage space savings. Comparison is most relevant between SQL storage (BC|GENE), compressed data storage using BCD (also in BC|SNPmax), and tiled storage.

Subjects Variants/ subjects Subjects x SNPs SQL BC|GENOME (BCD) BC|TILING (dosage) BC|TILING (VCF*)
10,000 100 M 1,000 B 88,000 GB 3,500 GB 375 GB 127 GB
100,000 100 M 10,000 B 880,000 GB 35,000 GB 3,750 GB 1,269 GB
1,000,000 100 M 100,000 B 8,800,000 GB 350,000 GB 37,500 GB 12,690 GB

*This number is for VCF file with genotypes only. When storing additional info, disk consumption increases.

In addition to substantial storage savings, BC|TILING facilitates a highly efficient parallel data analysis process. In practice this means that a larger number of calculation nodes can be used for data analysis decreasing the processing time. The following table illustrates the performance increments for a small dataset which can be processed using all methods. For larger datasets, the performance benefit of BC|TILING will become even more substantial.

Table 2: BC|TILING performance (PLINK association analysis using single server with 4 CPU cores and large external calculation resource). Time includes segmentation, file format conversion, PLINK analysis and result collection.

Dataset size (genotypes/ variants) SQL BC|GENOME BC|TILING
2.2 B 2 hour 45 min 7 minutes 1 min 30 sec
22 B N/A N/A 8 min 50 sec
222 B N/A N/A 21 min 30 sec

The skill of following what your bioinformaticians are up to

March 12, 2015 By Anni Ahonen-Bishopp, Director of R&D

I very recently read an interesting review in Nature BoneKey reports about GWAS and NGS analyses (Alonso et al 2015 [1]). As a non-bioinformatician working for a bioinformatics company, I have learned by doing and listening. Still, there are many aspects and fine points that I probably fail to grasp in all but the most simplistic GWAS workflows. Therefore, a review addressing the standardisation of GWAS workflow and QC of data was very welcome to me.

If you are in a managerial position supervising bioinformatics professionals, but only know the ‘lingo’ at a level of an enthusiastic amateur, a paper like this should be very useful to you. This one takes the slant of comparing GWAS with NGS-based association analysis, which I find very topical, and walks you through every single step from initial QC to visualisation options.

Once you have walked through the recommended standard workflow in the paper, you will begin to understand, why your people are hyping about new ways to speed up PCA, and why they can’t agree on MAF thresholds (and what that even means).

There is a clear lack of standards and recommendations for QC and association analysis of NGS data, which I found initially quite surprising, as the technology was very rapidly adopted by people dealing with GWAS a wee while ago. However, the improvements in sequencing methods have definitely made QC of NGS data a rapidly moving target for standardisation. Simple lack in understanding how human genome works has caused major debates about how the variants should be interpreted, which directly affects how one should control for anomalies prior to association analysis. And so on.

It is also important to realise the fundamental differences between SNPs and NGS – mainly the vast volume of the latter swamping any reasonable signals, still detectable by the former. However, NGS in association has potentially so much more power and sensitivity in comparison to SNP approach that despite of these issues the community should and will persevere with the idea that NGS is here to stay, and will one future day be the de facto data source for association analyses.

[1] Big data challenges in bone research: genome-wide association studies and next-generation sequencing. Nerea Alonso, Gavin Lucas & Pirro Hysi. BoneKEy Reports (2015) 4, Article number: 635 (2015)

Tags: ,

Want to know more? Let’s talk

Get in touch