As described on the UCSC Genome Browser website (see link below), the BED format is a concise and flexible way to represent genomic features and annotations. The BED format description supports up to 12 columns, but only the first 3 are required for the UCSC browser, the Galaxy browser and for BEDTools. BEDTools allows one to use the “BED12” format (that is, all 12 fields listed below). However, only intersectBed, coverageBed, genomeCoverageBed, and bamToBed will obey the BED12 “blocks” when computing overlaps, etc., via the “-split” option. For all other tools, the last six columns are not used for any comparisons by the BEDTools. Instead, they will use the entire span (start to end) of the BED12 entry to perform any relevant feature comparisons. The last six columns will be reported in the output of all comparisons.
The file description below is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.
- Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
- This column is required.
- The first base in a chromosome is numbered 0.
- The start position in each BED feature is therefore interpreted to be 1 greater than the start position listed in the feature. For example, start=9, end=20 is interpreted to span bases 10 through 20,inclusive.
- This column is required.
- The end position in each BED feature is one-based. See example above.
- This column is required.
- Any string can be used. For example, “LINE”, “Exon3”, “HWIEAS_0001:3:1:0:266#0/1”, or “my_Feature”.
- This column is optional.
- Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.
- This column is optional.
- This column is optional.
- Allowed yet ignored by BEDTools.
- Allowed yet ignored by BEDTools.
- Allowed yet ignored by BEDTools.
- Allowed yet ignored by BEDTools.
- Allowed yet ignored by BEDTools.
- Allowed yet ignored by BEDTools.
BEDTools requires that all BED input files (and input received from stdin) are tab-delimited. The following types of BED files are supported by BEDTools:
We have defined a new file format (BEDPE) in order to concisely describe disjoint genome features, such as structural variations or paired-end sequence alignments. We chose to define a new format because the existing “blocked” BED format (a.k.a. BED12) does not allow inter-chromosomal feature definitions. In addition, BED12 only has one strand field, which is insufficient for paired-end sequence alignments, especially when studying structural variation.
The BEDPE format is described below. The description is modified from: http://genome.ucsc.edu/FAQ/FAQformat#format1.
- Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
- This column is required.
- Use ”.” for unknown.
- The first base in a chromosome is numbered 0.
- As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required.
- Use -1 for unknown.
- The end position in each BEDPE feature is one-based.
- This column is required.
- Use -1 for unknown.
- Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
- This column is required.
- Use ”.” for unknown.
- The first base in a chromosome is numbered 0.
- As with BED format, the start position in each BEDPE feature is therefore interpreted to be 1 greater than the start position listed in the feature. This column is required.
- Use -1 for unknown.
- The end position in each BEDPE feature is one-based.
- This column is required.
- Use -1 for unknown.
- Any string can be used. For example, “LINE”, “Exon3”, “HWIEAS_0001:3:1:0:266#0/1”, or “my_Feature”.
- This column is optional.
- Any string can be used. For example, 7.31E-05 (p-value), 0.33456 (mean enrichment value), “up”, “down”, etc.
- This column is optional.
- This column is optional.
- Use ”.” for unknown.
- This column is optional.
- Use ”.” for unknown.
- These additional columns are optional.
Entries from an typical BEDPE file:
chr1 100 200 chr5 5000 5100 bedpe_example1 30 + -
chr9 1000 5000 chr9 3000 3800 bedpe_example2 100 + -
Entries from a BEDPE file with two custom fields added to each record:
chr1 10 20 chr5 50 60 a1 30 + - 0 1
chr9 30 40 chr9 80 90 a2 100 + - 2 1
The GFF format is described on the Sanger Institute’s website (http://www.sanger.ac.uk/resources/software/gff/spec.html). The GFF description below is modified from the definition at this URL. All nine columns in the GFF format description are required by BEDTools.
- Any string can be used. For example, “chr1”, “III”, “myChrom”, “contig1112.23”.
- This column is required.
- This column is required.
- Any string can be used. For example, “exon”, etc.
- This column is required.
- This column is required.
- BEDTools accounts for the fact the GFF uses a one-based position and BED uses a zero-based start position.
- This column is required.
- This column is required.
- This column is required.
- This column is required.
- This column is required.
An entry from an example GFF file :
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ;
E_value 0.0003 dJ102G20 GD_mRNA coding_exon 7105 7201 . - 2 Sequence
"dJ102G20.C1.1"
Some of the BEDTools (e.g., genomeCoverageBed, complementBed, slopBed) need to know the size of the chromosomes for the organism for which your BED files are based. When using the UCSC Genome Browser, Ensemble, or Galaxy, you typically indicate which which species/genome build you are working. The way you do this for BEDTools is to create a “genome” file, which simply lists the names of the chromosomes (or scaffolds, etc.) and their size (in basepairs).
Genome files must be tab-delimited and are structured as follows (this is an example for C. elegans):
chrI 15072421
chrII 15279323
...
chrX 17718854
chrM 13794
BEDTools includes pre-defined genome files for human and mouse in the /genomes directory included in the BEDTools distribution.
The SAM / BAM format is a powerful and widely-used format for storing sequence alignment data (see http://samtools.sourceforge.net/ for more details). It has quickly become the standard format to which most DNA sequence alignment programs write their output. Currently, the following BEDTools support inout in BAM format: intersectBed, windowBed, coverageBed, genomeCoverageBed, pairToBed, bamToBed. Support for the BAM format in BEDTools allows one to (to name a few): compare sequence alignments to annotations, refine alignment datasets, screen for potential mutations and compute aligned sequence coverage.
The details of how these tools work with BAM files are addressed in Section 5 of this manual.
The Variant Call Format (VCF) was conceived as part of the 1000 Genomes Project as a standardized means to report genetic variation calls from SNP, INDEL and structural variant detection programs (see http://www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcf4.0 for details). BEDTools now supports the latest version of this format (i.e, Version 4.0). As a result, BEDTools can be used to compare genetic variation calls with other genomic features.