Next: , Previous: bppreroot, Up: Reference


4.10 BppSeqMan: Bio++ Sequence Manipulation

The Bio++ Sequence Manipulator convert between various file formats, and can also perform various operations on sequences. It uses the common options for setting the alphabet, loading the sequences (see Sequences) and writing the resulting data set (see WritingSequences). It can use the “Generic” option for alphabets if only file format conversion is to be performed, but the correct alphabet must be specified for more advanced manipulations, like in silico molecular biology.

BppSeqMan can perform any number of elementary operation, in any order, providing the output of operation n is compatible with input of operation n+1, and that the input of operation 1 is compatible with the input data.

Specific options:

sequence.manip = {list<string>}
The list, in appropriate order, of elementary operations to perform. See below for a list of these operations.
Complement [[alphabet = DNA or RNA]]
Convert to the complementary sequence, keeping the original alphabet.
Transcript [[alphabet = DNA or RNA]]
Convert to the complementary sequence, switching the type of alphabet (DNA<->RNA).
Switch [[alphabet = DNA or RNA]]
Change the alphabet type (DNA<->RNA).
Translate(code = {genetic code}) [[alphabet = DNA or RNA]]
Convert to proteins. You have to specify a genetic code, see specific options. code: The genetic code to use for the translation, one of
Invert
Invert the sequence 5' <-> 3' or N <-> C
RemoveGaps
Remove all gaps in sequences (ie, 'unalign').
GapToUnknown
Change gaps to fully unresolved characters, N for nucleotides and X for proteins.
UnknownToGap
Change (partially) unresolved characters to gaps.
RemoveStops
Remove all stop codons in sequences. If sequences are aligned, stop codons will be replaced by gaps.
RemoveColumnsWithStops
Remove all sites with at least one stop codon.
GetCDS
Remove the first stop codon and everything after in codon sequences.
CoerceToAlignment
Try to convert a set of sequence to an alignment. This will fail if sequences do not have the same length. This step is required before trying commands 'ResolveDotted' or 'KeepComplete'.
ResolveDotted(alphabet={RNA|DNA|Proteins}) [[Aligned sequences]]
Convert a human-readable alignment to a machine-readable alignment. This manipulation must be first if it is used, and the data must be load with the Generic alphabet. alphabet: The alphabet to use in order to resolve a dotted alignment.
KeepComplete(maxGapAllowed={int>0} or {float[0,100]}+%) [[Aligned sequences]]
Keep only complete sites, ie sites without any gap. Sites with unresolved characters are not removed. It is also possible to fix a maximum proportion of gaps, see specific options. maxGapAllowed: The maximum proportion of gaps allowed.
GetCodonPosition(position={1|2|3})
Retrieve the given positions from codon sequences (aligned or not).
FilterFromTree(tree.file={path}, tree.format={chars})
Get a subset of sequences based on a tree file. The order of sequences in the file will reflect the tree structure. All sequences which do not have a corresponding leaf in the tree, based on the sequence name, will be removed. This method can therefore be used for subsetting a list of sequences, and/or rearrange them in a more convenient manner.

Examples of use:

•Just change file format:
          sequence.manip=

•Change DNA to RNA:
          sequence.manip=Switch

•Unalign sequences, perform transcription and translate to proteins:
          sequence.manip=RemoveGaps,Transcript,Translate

•Change all unresolved characters to gaps and keep only positions with less than 5 gaps:
          sequence.manip=UnknownToGap,KeepComplete(maxGapAllowed=5)

•Keep only positions with less than 30% of gaps, and change them to unresolved characters:
          sequence.manip=KeepComplete(maxGapAllowed=30%),GapToUnknown