HTSeq is a Python package that provides infrastructure to process data from high-throughput sequencing assays.
Download links and installation instructions can be found here
The Tour shows you how to get started. It explains how to install HTSeq, and then demonstrates typical analysis steps with explicit examples. Read this first, and then see the Reference for details.
A detailed use case: TSS plots
This chapter explains typical usage patterns for HTSeq by explaining in detail three different solutions to the same programming task.
Reference documentation
The various classes of HTSeq are described here.
Sequences and FASTA/FASTQ files
In order to represent sequences and reads (i.e., sequences with base-call quality information), the classes Sequence and SequenceWithQualities are used. The classes FastaReader and FastqReader allow to parse FASTA and FASTQ files.
Genomic intervals and genomic arrays
The classes GenomicInterval and GenomicPosition represent intervals and positions in a genome. The class GenomicArray is an all-purpose container with easy access via a genomic interval or position, and GenomicArrayOfSets is a special case useful to deal with genomic features (such as genes, exons, etc.)
To process the output from short read aligners in various formats (e.g., SAM), the classes described here are used, to represent output files and alignments, i.e., reads with their alignment information.
The classes GenomicFeature and GFF_Reader help to deal with genomic annotation data.
Scripts
The following scripts can be used without any Python knowledge.
Quality Assessment with htseq-qa
Given a FASTQ or SAM file, this script produces a PDF file with plots depicting the base calls and base-call qualities by position in the read. This is useful to assess the technical quality of a sequencing run.
Counting reads in features with htseq-count
Given a SAM file with alignments and a GFF file with genomic features, this script counts how many reads map to each feature.
Appendices
HTSeq is developed by Simon Anders at EMBL Heidelberg (Genome Biology Unit). Please do not hesitate to contact me (anders at embl dot de) if you have any comments or questions.
HTSeq is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
The full text of the GNU General Public License, version 3, can be found here: http://www.gnu.org/licenses/gpl-3.0-standalone.html