The GenomeDiff file format describes mutational differences between a reference DNA sequence and a sample. It may also include evidence from computational analysis or experiments that supports mutations.
An example of a portion of a GenomeDiff file:
#=GENOME_DIFF 1.0
DEL 61 11 NC_001416 139 1
INS 62 12 NC_001416 14266 G
SNP 63 13 NC_001416 20661 G
INS 64 14 NC_001416 20835 C
SNP 65 15 NC_001416 21714 A
DEL 60 33,1 NC_001416 21738 5996
SNP 66 35 NC_001416 31016 C
...
MC 9 NC_001416 1 2 0 0 left_inside_cov=0 left_outside_cov=NA right_inside_cov=0 right_outside_cov=169
RA 11 NC_001416 139 0 G . frequency=1 new_cov=34/40 quality=309.0 ref_cov=0/0 tot_cov=34/40
JC 2 NC_001416 5491 1 NC_001416 30255 1 0 alignment_overlap=4 coverage_minus=8 coverage_plus=0 flanking_left=35 flanking_right=35 key=NC_001416__5491__1__NC_001416__30251__1__4____35__35__0__0 max_left=30 max_left_minus=30 max_left_plus=0 max_min_left=0 max_min_left_minus=0 max_min_left_plus=0 max_min_right=11 max_min_right_minus=11 max_min_right_plus=0 max_right=11 max_right_minus=11 max_right_plus=0 min_overlap_score=44 pos_hash_score=7 reject=NJ,COV side_1_annotate_key=gene side_1_overlap=4 side_1_redundant=0 side_2_annotate_key=gene side_2_overlap=0 side_2_redundant=0 total_non_overlap_reads=8 total_reads=8
JC 3 NC_001416 13180 1 NC_001416 13218 1 0 alignment_overlap=4 coverage_minus=1 coverage_plus=0 flanking_left=35 flanking_right=35 key=NC_001416__13180__1__NC_001416__13214__1__4____35__35__0__0 max_left=17 max_left_minus=17 max_left_plus=0 max_min_left=0 max_min_left_minus=0 max_min_left_plus=0 max_min_right=14 max_min_right_minus=14 max_min_right_plus=0 max_right=14 max_right_minus=14 max_right_plus=0 min_overlap_score=14 pos_hash_score=1 reject=NJ,COV side_1_annotate_key=gene side_1_overlap=4 side_1_redundant=0 side_2_annotate_key=gene side_2_overlap=0 side_2_redundant=0 total_non_overlap_reads=1 total_reads=1
RA 12 NC_001416 14266 1 . G frequency=1 new_cov=44/31 quality=186.3 ref_cov=0/0 tot_cov=44/31
JC 5 NC_001416 14869 -1 NC_001416 15609 -1 0 alignment_overlap=7 coverage_minus=1 coverage_plus=0 flanking_left=35 flanking_right=35 key=NC_001416__14869__0__NC_001416__15616__0__7____35__35__0__0 max_left=21 max_left_minus=21 max_left_plus=0 max_min_left=0 max_min_left_minus=0 max_min_left_plus=0 max_min_right=7 max_min_right_minus=7 max_min_right_plus=0 max_right=7 max_right_minus=7 max_right_plus=0 min_overlap_score=7 pos_hash_score=1 reject=NJ,COV side_1_annotate_key=gene side_1_overlap=7 side_1_redundant=0 side_2_annotate_key=gene side_2_overlap=0 side_2_redundant=0 total_non_overlap_reads=1 total_reads=1
The first line of the file must define that this is a GenomeDiff file and the version of the file specification used:
#=GENOME_DIFF 1.0
Lines beginning with #=<name> <value> are interpreted as metadata. (Thus, the first line is assigning a metadata item named GENOME_DIFF a value of 1.0.) Names cannot include whitespace characters. Values may include whitespace characters. Lines with the same name are concatenated with single spaces added between them.
Lines beginning with whitespace and # are comments. Comments may not occur at the end of a data line.
Data lines describe either a mutation or evidence from an analysis that can potentially support a mutational event. Data fields are tab-delimited. Each line begins with several fields containing information common to all types, continues with a fixed number of type-specific fields, and ends with an arbitrary number of name=value pairs that store optional information.
type <string>
type of the entry on this line.
id or evidence-id <uint32>
For evidence and validation lines, the id of this item. For mutation lines, the ids of all evidence or validation items that support this mutation. May be set to ‘.’ if a line was manually edited.
parent-ids <uint32>
ids of evidence that support this mutation. May be set to ‘.’ or left blank.
mutation types are 3 letters: SNP, SUB, DEL, INS, MOB, AMP, CON, INV.
evidence types are 2 letters: RA, MC, JC, UN.
validation types are 4 letters: TSEQ, PFLP, RFLP, PFGE, PHYL, CURA.
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
new_seq <char>
new base at position
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position of the first replaced nucleotide in reference sequence fragment.
size <uint32>
number of bases after the specified reference position to replace with new_seq
new_seq <string>
new bases at position
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
size <uint32>
number of bases deleted in reference, including reference position.
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment, after which the INS is placed.
new_seq <string>
new base inserted after the specified reference position
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
repeat_name <string>
name of the mobile element. Should correspond to an annotated repeat_region in the reference.
strand <1/-1>
strand of mobile element insertion.
duplication_size <uint32>
number of bases duplicated during insertion, beginning with the specified reference position.
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
size <uint32>
number of bases duplicated starting with the specified reference position.
new_copy_number <uint32>
new number of copies of specified bases.
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment that was the target of gene conversion from another genomic location.
size <uint32>
number of bases to replace in the reference genome beginning at the specified position.
region <sequence:start-end>
Region in the reference genome to use as a replacement.
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
size <uint32>
number of bases in inverted region beginning at the specified reference position.
These attributes control how molecular events in a a GenomeDiff are counted for summary purposes.
between=<element_name>
This mutation occurs between copies of this element. For example, a deletion caused by recombination between two copies of a mobile element.
mediated=<element_name>
This mutation was mediated by insertion of a new copy of this element and recombination with an existing copy, such that the number of this element did not net increase in the resulting genome.
adjacent=<element_name>
This mutation
with=<mutatiion_id>
This mutation should be counted as a single molecular event with the other specified mutation. For example, active mobile elements may lose and gain a few bases at their margins, and when this occurs the most parsimonious explanation is one round of recombination or excision and re-insertion.
These attributes control how mutations are applied when building a new reference genome from the original reference genome and a GenomeDiff and when building phylogenetic trees between multiple samples. They are not generated automatically by breseq.
before=<mutation_id> or after=<mutation_id>
Apply this mutation before or after another mutation. For example, did a base substitution occur after a region was duplicated, thus it is only in one copy or did it occur before the duplication, thus altering both copies? Did a base substitution happen before a deletion, hiding a mutation that should be included in any phylogenetic inference? The before. When neither of these attributes is present, mutations will be applied in the order in which they appear in the file.
within=<mutation_id>, within_position=<mutation_id>, within_copy=<mutation_id>
This mutation happens inside of a different mutation. These options can specify, for example, that a base substitution happens in the second copy of a duplicated region. within and within_position must both be provided if one is supplied. If within_copy is not provided (because it is unknown), the mutation will be placed arbitrarily in the first copy. Note that the actual position of this mutation is still used for annotating its effects.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
insert_position <uint32>
number of bases inserted after the reference position to get to this base. An value of zero refers to the base. A value of 5 means that this evidence if for the fifth newly inserted column after the reference position.
ref_base <char>
base in the reference genome.
new_base <char>
new base supported by read alignment evidence.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
start <uint32>
start position in reference sequence fragment.
end <uint32>
end position in reference sequence of region.
start_range <uint32>
number of bases to offset after the start position to define the upper limit of the range where the start of a deletion could be.
end_range <uint32>
number of bases to offset before the end position to define the lower limit of the range where the start of a deletion could be.
Essentially this is evidence of missing coverage between two positions in the ranges [start, start+start_range] [end-end_range, end].
side_1_seq_id <string>
id of reference sequence fragment containing side 1 of the junction.
side_1_position <uint32>
position of side 1 at the junction boundary.
side_1_strand <1/-1>
direction that side 1 continues matching the reference sequence
side_2_seq_id <string>
id of reference sequence fragment containing side 2 of the junction.
side_2_position <uint32>
position of side 2 at the junction boundary.
side_2_strand <1/-1>
direction that side 2 continues matching the reference sequence.
overlap <uint32>
Number of bases that the two sides of the new junction have in common.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
start <uint32>
start position in reference sequence of region.
end <uint32>
end position in reference sequence of region.
These items indicate that mutations have been validated by further, targeted experiments.
An expert has examined the data output from a prediction program and determined that this mutations is a true positive.
Line specification:
expert <string>
Name or initials of the person who predicted the mutation.
An expert has examined the raw read data and determined that this predicted mutation is a false positive.
Line specification:
expert <string>
Name or initials of the person who predicted the mutation.
This validation was transferred from validation in another, related genome.
Line specification:
gd <string>
Name of the genome_diff file containing the evidence.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
primer1_start <uint32>
position in reference sequence of the 5’ end of primer 1.
primer1_end <uint32>
position in reference sequence of the 3’ end of primer 1.
primer2_start <uint32>
position in reference sequence of the 5’ end of primer 2.
primer2_end <uint32>
position in reference sequence of the 3’ end of primer 2.
For primer 1, start < end. For primer 2, end < start.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
primer1_start <uint32>
position in reference sequence of the 5’ end of primer 1.
primer1_end <uint32>
position in reference sequence of the 3’ end of primer 1.
primer2_start <uint32>
position in reference sequence of the 5’ end of primer 2.
primer2_end <uint32>
position in reference sequence of the 3’ end of primer 2.
For primer 1, start < end. For primer 2, end < start.
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
primer1_start <uint32>
position in reference sequence of the 5’ end of primer 1.
primer1_end <uint32>
position in reference sequence of the 3’ end of primer 1.
primer2_start <uint32>
position in reference sequence of the 5’ end of primer 2.
primer2_end <uint32>
position in reference sequence of the 3’ end of primer 2.
enzyme <string>
Restriction enzyme used to distinguish reference from mutated allele.
For primer 1, start < end. For primer 2, end < start.
Changes in fragment sizes of genomic DNA digested with restriction enzymes and separated by pulsed-field
Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
restriction enzyme <string>
Restriction enzyme used to digest genomic DNA and observe fragments.
Generic container for a note about a mutation prediction
Line specification:
note <string>
Free text note.
Artificially mask a section of DNA as “N”s. This is useful for creating fake reference sequences. Particularly for targeted sequencing approaches. Line specification:
seq_id <string>
id of reference sequence fragment containing mutation, evidence, or validation.
position <uint32>
position in reference sequence fragment.
size <uint32>
number of bases masked to “N” in reference, including reference position.