Lesson Plan

Data Files

There are only 3 VCF files for this analysis module:


These are mini-VCF with a compilation of 100 genes for each case.

Sample Description

alt text

Genetic Variation

In this analysis, we are only interested in small-scale sequence variation (<1 Kbp). There are 2 major types of small-scale varations; substitution and Indels


Substitution is a point mutation, or signle base modification, which causes a change in a single nucleotide. This mutation is also commonly known as Single-Nucleotide Polymorphism SNP.


Indel is short for insertion or deletion of bases in the DNA. A microindel is an indel that results in a net change of 1 (such as SNP) to 50 nucleotide. Indel often results in frameshift in the coding region and have disastrous consequences in biology.

Variation Detectin using GATK

The most popular Variate Detection tool is from Broad Institute called GATK

alt text

James’ Analysis Workflow

alt text

Capturing Genetic Variation in VCF file

I found these 2 resources for an introduction to the VCF file:

VCF stands for Variant Call Format. It is a standardized text file format for representing SNP, indel, and structural variations found in sequencing your samples.

Understanding the VCF file

At First Glance

VCF is a text file by nature. Opening it in any text editor (Try avoid using Microsoft Word since the software may alter the content unintentionally) will reveal that the VCF file is consists of 2 general regions; meta-data and variant-data:

  • meta-data: This segment contain information used to explain the variant-data contents
  • variant-data: These are the actual variant data; begins with a header row:
    • #CHROM: chromosome number
    • POS: starting position (always forward strand)
    • ID: reference information if available, such as the dbSNP info
    • REF: the base in the reference genome (the 2 bald white guys)
    • ALT: the alternative base found in this sample
    • QUAL: The Phred-scaled probability that a REF/ALT polymorphism exists at this site given sequencing data. Because the Phred scale is -10 * log(1-p), a value of 10 indicates a 1 in 10 chance of error, while a 100 indicates a 1 in 10^10 chance (see the FAQ article for a detailed explanation). These values can grow very large when a large amount of data is used for variant calling, so QUAL is not often a very useful property for evaluating the quality of a variant call.
    • FILTER: This field contains the name(s) of any filter(s) that the variant fails to pass, or the value PASS if the variant passed all filters. If the FILTER value is ., then no filtering has been applied to the records. It is extremely important to apply appropriate filters before using a variant callset in downstream analysis. See our documentation on filtering variants for more information on this topic.
    • INFO: Various site-level annotations
    • FORAMT: how the genotype and other sample-level information is represented. (more on this later)