Part 1: Natura Protocol: Data Quality Control

This part of the protocol details the steps for data quality assessment and control that are typically carried out during case-control association studies. The steps described involve the identification and removal of DNA samples and markers that introduce bias. These critical steps are paramount to the success of a case-control study and are necessary before statistically testing for association. We describe how to use PLINK, a tool for handling SNP data, to perform assessments of failure rate per individual and per SNP and to assess the degree of relatedness between individuals. We will also use the R statistical computational and programming environemnt to process results file. These platforms were selected because they are user-friendly, widely used and computationally efficient.

Part 2: Basic Statistical Analysis

The second part of this protocol describes how to perform basic statistical analysis in a population-based genetic association case-control study. The steps described involve the (i) appropriate selection of measures of association and relevance of disease models; (ii) appropriate selection of tests of association; (iii) visualization and interpretation of results; and (iv) consideration of appropriate methods to control for multiple testing. We describe how to use PLINK and R popular tools for handling single-nucleotide polymorphism data in order to carry out tests of association and visualize and interpret results. This protocol assumes that data quality assessment and control has been performed (see Part 1), as described in a previous protocol, so that samples and markers deemed to have the potential to introduce bias to the study have been identified and removed. Study design, marker selection and quality control of case-control studies have also been discussed in earlier protocols

About this tutorial

This tutorial was adopted from the pipelines published in these two Nature Protocol publication:

Software Requirement

Installing R

R is open-source and free for all. There are multiple ways to install R

R IDE (Integrated Development Environment)

  • Default R console that come with the R-project.
  • RStudio is one of the best Integrative Development Environment (IDE) for R
  • R Evolution Analytics is a commercial company that built from R, it was recently bought by Microsoft, whom is fully intented to provide supports for the software in a foreseeable future.

PDF Viewer

Make sure that your computer can view PDF document; Adobe Acrobat

Tutorial Dataset

We will be working with 2 datasets:

  1. baby-set
  2. Framingham-set


The baby-set is small, and is meant for the Quick-Start demonstration of how to use the plink.exe and R. The baby-set consists of 2 files that represent the marker for the gene STMND1

  1. STMND1-gene.ped:
    • Table Rows: 2,215 Individuals
    • Table Columns: 6 meta-data columns + 6 genotypes
    • Table Rows: 6 markers


The full GWAS dataset from the Framingham Project

We will be working with 3 files: (see Figure 1)

  1. Framingham.ped:
    • Table Row: 2,215 individuals
    • Table Columns: 6 meta-data columns + 497,243 genotypes
    • Table Row: 497,243 markers
  3. Framingham_Asthma.cov:
    • covariate for the dataset: 10 co-variate

File Format

plink.exe uses two specialized file format to represent the genotype and markers for all individuals being studied.

The PED file

We will be using the ped file format to store genotype data, which originated some years ago in the LINKAGE package. ped files are text files containing one line per genotyped sample, with fields separated by “white space” (TAB characters or SPACEs). The first six fields contain:

  1. Pedigree or family identifier, unique to the family of which this subject is a member,
  2. Further identifier, unique (within the family) to each family member,
  3. The member identifier of the father of the subject if the father is also present in the data, otherwise an arbitrary code (usually 0),
  4. Similarly, an identifier for the mother of the subject,
  5. The sex of the subject (1= Male, 2= Female), and
  6. A binary trait indicator (1= Absent, 2= Present). Default analysis co-variate
  7. Column 7 onward: the 497,243 genotypes/markers (2 columns per genotype; see Figure 1)

Missing values in the last two fields are usually coded as zero.

The MAP file

The map file contains chromosome location information for the marker (SNP) being interrogated (see Figure 1)

The cov file

The cov file contain additional phenotype information for the disease state being tested. (see Figure 1)

File Format Summary Diagram