Relationship Inference in KING

KING Tutorial: Relationship Inference

KING is a toolset to explore genotype data from a genome-wide association study (GWAS) and a sequencing project. KING can be used to check family relationship and flag pedigree errors by estimating all kinship coefficients for all pairwise relationships. Unrelated pairs can be well separated from close relatives (up to 3rd-degree) and vice versa. The kinship coefficient estimates for close relatives are highly accurate. Other applications of KING such as the identification of population substructure will not be described in detail in this tutorial.

Family relationship inference in KING is very FAST (minutes to examine thousands of individuals, in contrast to days/weeks using other software, over 100 to 1000 fold savings in compuational time), and robust to the presence of population structure. The sample size (number of individuals) of the data can be as small as 2, or as large as > 100,000, in either scenario standard relationship inference software cannot handle. Genotype data do not need to be cleaned (or pruned) at the SNP level (i.e., no SNPs need to be removed) prior to the KING relationship inference. (However, sample-level QC such as removing samples with low call rate is still necessary since a single poorly genotyped sample could cause a cluster of inflated relationships).

GENERAL INPUT FILES

The input files include a data file (-d), a pedigree file (-p) and a map file in MERLIN format, or alternatively a binary format file (-b). The command line will look like this:

  prompt> king -d ex.dat.gz -p ex.ped.gz -m ex.map.gz --kinship
  prompt> king -b ex.bgeno.gz --kinship --related
  prompt> king -b ex.bed --kinship --related

Commands above specify the estimation of all pair-wise kinship coefficients within- and between-families.

KING supports input files in MERLIN format. In additional to recognizing zipped files (file name ends with .gz) and multiple files as in Merlin, e.g.,

  prompt> king -d ex1.dat.gz,ex2.dat.gz -p ex1.ped.gz,ex2.ped.gz -m ex1.map.gz,ex2.map.gz --kinship

KING also supports a binary format, either in KING binary format (unique to KING), or a more well-known PLINK binary format. A binary format allows the compression of genotype data by using two bits to represent a genotype. Examples are:

  prompt> king -b ex.bgeno --kinship
  prompt> king -b ex.bed --kinship

The binary format offers convenient computational savings. With the use of the binary format, the time to load a typical GWAS dataset usually reduces from > 30 minutes to a few seconds, and only a fraction of computer memory and disk space is needed. Binary format data can be generated with commands

  prompt> king -d ex.dat -p ex.ped -m ex.map --binary
  prompt> king -d ex.dat -p ex.ped -m ex.map --plink
  prompt> plink --file mydata --make-bed

It is highly recommended to generate a binary format dataset first before trying different options implemented in KING. It typically takes a much longer time to load the Merlin format data than to perform the pairwise relationship inference. An example dataset in the KING binary format can be downloaded at this link: ex.bgeno [6.7MB] ,and this dataset will be used throughout the tutorial. The KING binary, PLINK binary and Merlin format can be converted to each other in KING:

  prompt> king -b ex.bgeno --merlin
  prompt> king -b ex.bgeno --plink --prefix ex
  prompt> king -b ex.bed --binary --prefix ex

It has been tested that KING relationship inference works quite well with the sequencing data. This feature may give KING huge advantage over alternative methods in which rare variants need to be excluded from the inference procedure. The VCF file of the sequencing data can be easily converted into a PLINK binary format with a simple shell script like this: VCFtobed.bsh, or with the use of VCFtool:

 prompt> vcftools --vcf example.vcf --plink-tped
 prompt> plink --tfile out --make-bed --out ex

RELATIONSHIP INFERENCE

Pair-wise relationship is checked between each pair of individuals. Two algorithms are available for relationship inference. One algorithm assumes a homogeneous population (through paramter --homo), and the other algorithm allows the existence of population structure (through paramter --kinship). Examples are

  prompt> king -b ex.bed --kinship
  prompt> king -b ex.bed --kinship --related --degree 2
  prompt> king -b ex.bed --kinship --ibs
  prompt> king -b ex.bed --homo

The robust algorithm (default) is highly recommended. In each relationship inference, the output is separate for relationships that are within or between families. Note an unrelated individual is treated as a family of size one. If the datasets only consist of unrelated individuals as reported, then all results are saved in the between-family output.

--kinship produces a subset of results produced by --kinship --ibs analysis. In addition to the robust kinship estiamte, summary statistics such as the counts of IBS0, IBS1, IBS2, the average of IBS, and the standard error of the IBS estimate are provided in the --kinship --ibs analysis. The option --ibs by itself only summarizes the IBS statistics without calculating the kinship coefficients. Parameter --related --degree 2 specifies that only related pairs (up to the 2nd-degree in this case) between families are included in the output. Specifically all pairs across families with a kinship coefficient less than 0.0884 will be excluded from the output.

The --related option is highly recommended when dealing with large datasets (e.g., with sample size > 10,000). Besides substantial disk space saving, the computational time is now dramatically reduced, thanks to a computationally efficient algorithm newly implemented in versions 1.3 and later. E.g., When only 1st- or 2nd-degree relative pairs (through parameter --degree 2) are included in the output (through --related), the computation time could be >10 times less! Note the inference accuracy is not sacrificed and the inference results for the close relatives of interest are the same as the relationship inference without the --related option. This speed-up should be extremely attractive to many applications. When sample size is really large, say > 100,000, "king --related" is probably the only choice to have the analysis done in a reasonable amount of time (say couple of days using a single CPU).

--unrelated is a handy option to extract a list of unrelated individuals. E.g.,

  prompt> king -b ex.bed --unrelated --degree 2

estimates relatedness in the data first, followed by extracting a list of individuals that contains no pairs of individuals with a 1st- or 2nd-degree relationship. This option is available in version 1.4 and later. The detailed algorithm is described in this reference: Manichaikul et al. 2012 [PDF]

PAIRWISE RELATIONSHIP WITHIN FAMILIES

The output for within-family relationship checking using --kinship (saved in file king.kin) will look like this:

FID     ID1     ID2     N_SNP   Z0      Phi     HetHet  IBS0    Kinship Error
28      1       2       2359853 0.000   0.2500  0.162   0.0008  0.2459  0
28      1       3       2351257 0.000   0.2500  0.161   0.0008  0.2466  0
28      2       3       2368538 1.000   0.0000  0.120   0.0634  -0.0108 0
117     1       2       2354279 0.000   0.2500  0.163   0.0006  0.2477  0
117     1       3       2358957 0.000   0.2500  0.164   0.0006  0.2490  0
117     2       3       2348875 1.000   0.0000  0.122   0.0616  -0.0017 0
1344    1       12      2372286 0.000   0.2500  0.149   0.0003  0.2480  0
1344    1       13      2370435 0.000   0.2500  0.148   0.0003  0.2465  0
1344    12      13      2374888 1.000   0.0000  0.117   0.0582  0.0003  0

Each row provides information for one pair of individuals. The columns are

FID: Family ID for the pair
ID1: Individual ID for the first individual of the pair
ID2: Individual ID for the second individual of the pair
N_SNP: The number of SNPs that do not have missing genotypes in either of the individual
Z0: Pr(IBD=0) as specified by the provided pedigree data
Phi: Kinship coefficient as specified by the provided pedigree data
HetHet: Proportion of SNPs with double heterozygotes (e.g., AG and AG)
IBS0: Porportion of SNPs with zero IBS (identical-by-state) (e.g., AA and GG)
Kinship: Estimated kinship coefficient from the SNP data
Error: Indicates difference between the estimated and specified kinship coefficients (1 for  error, 0.5 for warning)

The default kinship coefficient estimation only involves the use of SNP data from this pair of individuals, and the inference is robust to population structure. A negative kinship coefficient estimation indicates an unrelated relationship. The reason that a negative kinship coefficient is not set to zero is a very negative value may indicate the population structure between the two individuals. Close relatives can be inferred fairly reliably based on the estimated kinship coefficients as shown in the following simple algorithm: an estimated kinship coefficient range >0.354, [0.177, 0.354], [0.0884, 0.177] and [0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively. Relationship inference for more distant relationships is more challenging. A plot of the estimated kinship coefficient against the proportion of zero IBS-sharing is highly recommended. In the absence of population structure, relationship inference can also be carried out using an alternative algorithm through parameter "--homo".

Here is an example of the relationship inference using the HapMap GWAS data: PDF and its R code

PAIRWISE RELATIONSHIP ACROSS FAMILIES (OR UNRELATED INDIVIDUALS)

The output for between-family relationship checking (saved in file king.kin0) will look like this:

FID1    ID1     FID2    ID2     N_SNP   HetHet  IBS0    Kinship
28      3       117     1       2360618 0.143   0.0267  0.1356
28      3       117     2       2352628 0.161   0.0009  0.2441
28      3       117     3       2354540 0.120   0.0624  -0.0119
28      3       1344    1       2361807 0.093   0.1095  -0.2295
28      3       1344    12      2367180 0.094   0.1091  -0.2225
28      3       1344    13      2364816 0.093   0.1082  -0.2224
117     1       1344    1       2362787 0.094   0.1093  -0.2312
117     1       1344    12      2368467 0.095   0.1088  -0.2230
117     1       1344    13      2365036 0.094   0.1084  -0.2253
117     2       1344    1       2354855 0.094   0.1084  -0.2281
117     2       1344    12      2361351 0.095   0.1078  -0.2206
117     2       1344    13      2357936 0.095   0.1067  -0.2190
117     3       1344    1       2357771 0.094   0.1102  -0.2348
117     3       1344    12      2364365 0.095   0.1086  -0.2232
117     3       1344    13      2361061 0.094   0.1096  -0.2301

This analysis shows the "unrelated" families 28 and 117 are actually connected through an unreported parent-offspring pair (28_3, 117_2).

Here is an example of relationship inference across families using the HapMap GWAS data: PDF and its R code

POPULATION STRUCTURE INFERENCE

To identify population substructure, parameter individual, pca, or mds can be specified:

  prompt> king -b ex.bed --individual
  prompt> king -b ex.bed --mds
  prompt> king -b ex.bed --pca 5

The --individual option of KING provides the mean and variance estimation of allele frequencies for each individuals, the --mds specifies the multidimensional scaling (MDS) analysis, while --pca 5 specifies the principal component analysis (PCA). The --mds is highly recommended (over --pca). More details are here.

OTHER PARAMETERS

The following parameters can also be specified:

--errorrate: the error (IBS=0) rate between any pair of parent-offspring should be less than this errorrate cutoff

--homo estimates kinship and IBD0 assuming all samples are from a homogeneous population, similar to most other software.

--minMAF specifies the minimum minor allele frequency to select SNPs for relationship inference. It only applies to --homo. Default value is 0.01.

--showIBD allows --homo analysis to show IBD1 and IBD2 in the output.

--prefix specifies the file name to store the output statistics data for relationship inference. "king" is used as default.

--binary rewrites data in KING binary format.

--merlin rewrites data in MERLIN format.

--plink rewrites data in PLINK binary format.

REFERENCE

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF][Citations]

======================================
Last updated: May 2012 by Wei-Min Chen

KING Executable Download | KING Hompage