VCF data format
GADMA can load VCF file with SNP’s and build SFS data from it for demographic inference. VCF file should be set together with popmap file that describing how samples map to populations:
# param_file
# Set VCF and popmap file
Input data : path_to_vcf_file, path_to_popmap_file
...
VCF file
Tip
It is better to set options Sequence length
and Mutation rate
if they are known.
Then units of parameters of the demographic history will be in physical units.
Note
Sometimes it is better to convert a VCF (.vcf) file into a SFS (.sfs) file. To do so use easySFS.
There are several conditions of successful reading of VCF file. SNP’s will be taken to SFS data if:
VCF file contains diploid samples;
FILTER column is equal to PASS or .;
REF and ALT columns are nucleotides (one letter from [A, T, G, C]);
When positions are repeated the last SNP will be taken.
Ancestral allele should be put to INFO column as AA= field.
Popmap file
Popmap file keep information about how samples map the populations. It is a plain-text with two tab-separated columns: sample name and corresponding population name. For example:
sample_1 Pop1
sample_2 Pop1
sample_3 Pop2
sample_4 Pop2
sample_5 Pop2
Note
Popmap file can miss some samples from VCF file or have some extra - all of them will be ignored.
Examples
In the following example of VCF file only one line will pass the restrictions of GADMA reading. SNP on line 3 will be filtered out as FILTER column is equal to q10 (should be PASS or .). SNP’s on lines 4, 5 and 6 has REF and/or ALT column with not-nucleotides. And only the first SNP is okay.
1#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
220 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
320 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
420 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
520 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
620 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Another example of VCF file with ancestral allele information, five SNP’s will be successfully read (lines 2, 4, 7, 8, 9):
1#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT tsk_0 tsk_1 tsk_2 tsk_3 tsk_4 tsk_5
21 316 . C G . PASS AA=C GT 1|0 0|0 0|1 1|0 0|1 0|1
31 1099 . C T . NOTPASS AA=C GT 1|0 0|0 0|1 1|0 0|1 0|1
41 1338 . C T . PASS AA=C GT 0|0 0|0 0|0 0|0 1|0 1|0
51 2276 . 0 1 . PASS AA=T GT 1|0 0|0 0|1 1|0 0|1 0|1
61 2889 . 0 1 . NOTPASS AA=0 GT 0|1 1|1 1|0 0|1 1|0 1|0
71 3107 . G T . PASS AA=G GT 1|0 0|0 0|1 1|0 0|1 0|1
81 3535 . C T . PASS AA=C GT 0|0 0|0 0|0 0|0 1|0 1|0
91 3738 . C T . PASS AA=C GT 1|0 0|0 0|0 0|0 0|0 0|0
101 3803 . A C . NOTPASS AA=A GT 0|0 0|0 0|0 0|0 0|1 0|0
And corresponding popmap file:
tsk_0 YRI
tsk_1 YRI
tsk_2 CEU
tsk_3 CEU
tsk_4 CEU
tsk_5 CHB