Sketches
========

For sequences to be compared with :code:`mash`, they must first be `sketched`,
which creates vastly reduced representations of them. This will happen
automatically if :code:`mash dist` is given raw sequences. However, if multiple
comparisons will be performed, it is more efficient to create sketches with
:code:`mash sketch` first and provide them to :code:`mash dist` in place of the
raw sequences. Sketching parameters can be provided to either tool via
command line options.

Reduced representations with MinHash tables
-------------------------------------------
Sketches are used by the `MinHash` algorithm to allow fast distance estimations
with low storage and memory requirements. To make a sketch, each k-mer in a
sequence is `hashed`, which creates a pseudo-random identifier. By sorting these
identifiers (`hashes`), a small subset from the top of the sorted list can
represent the entire sequence (these are `min-hashes`). The more similar another
sequence is, the more min-hashes it is likely to share.

k-mer size
''''''''''
As in any k-mer based method, larger k-mers will provide more specificity, while
smaller k-mers will provide more sensitivity. Larger genomes will also require
larger k-mers to avoid k-mers that are shared by chance. K-mer size is
specified with :code:`-k`, and sketch files must have the same k-mer size to be
compared with :code:`mash dist`. When :code:`mash sketch` is run, it
automatically assesses the specified k-mer size against the sizes of input
genomes by estimating the probability of a random match as:

.. math::
  p = \frac 1 {\frac {\left(\overline\Sigma\right)^k} g + 1}
  
...where :math:`g` is the genome size and :math:`\Sigma` is the alphabet (ACGT
by default). If this probability exceeds a threshold (specified by
:code:`-w`; 0.01 by default) for any input genomes, a warning will be given
with the minimum k-mer size needed to get within the threshold.

For large collections of sketches, memory and storage may also be a
consideration when choosing a k-mer size. Mash will use 32-bit hashes, rather
than 64-bit, if they can encompass the full k-mer space for the alphabet in use.
This will (roughly) halve the size of the size of the sketch file on disk and
the memory it uses when loaded for :code:`mash dist`. The criterion for using a
32-bit hash is:

.. math::
   \left({\overline\Sigma}\right)^k \leq 2^{32}

...which becomes :math:`k \leq 16` for nucleotides (the default) and
:math:`k \leq 7` for amino acids.

sketch size
'''''''''''
Sketch size corresponds to the number of (non-redundant) min-hashes that are
kept. Larger sketches will better represent the sequence, but at the cost of
larger sketch files and longer comparison times. The error bound of a distance
estimation for a given sketch size :math:`s` is formulated as:

.. math::
  \sqrt{\frac{1}{s}}

Sketch size is specified with :code:`-s`. Sketches of different sizes can be
compared with :code:`mash dist`, although the comparison will be restricted to
the smaller of the two sizes.

Strand-independence with canonical k-mers
-----------------------------------------
By default, :code:`mash` will ignore strandedness when sketching by using
canonical k-mers, as done in `Jellyfish`_. This works by using the reverse
complement of a k-mer if it comes before the original k-mer alphabetically.
It also means k-mers that do not contain only nucleotides (A, C, G, T, and their
lowercases) must be ignored. To use every k-mer as it appears, :code:`-n`
(noncanonical) can be specified when sketching.

Cleaning up read sets with Bloom filtering
------------------------------------------

Since MinHash is a k-mer based method, removing unique k-mers greatly improves
results for read sets, since unique k-mers are likely to represent sequencing
error. :code:`mash` provides an efficient way to filter without prior k-mer
counting by using a Bloom filter. This method can underfilter, but it will
never overfilter (non-unique k-mers are guaranteed to be kept), and it requires
significantly less time and memory than true k-mer counting. The filter can be
enabled with :code:`-u` when sketching (in :code:`mash sketch` or :code:`mash
dist`). The amount of underfiltering can be managed with the parameters of the
Bloom filter (:code:`-g`, :code:`-e`, and :code:`-m`).

Working with sketch files
-------------------------

The sketch or sketches stored in a sketch file, and their parameters, can be 
inspected with :code:`mash info`. If sketch files have matching k-mer sizes,
their sketches can be combined into a single file with :code:`mash paste`. This
allows simple pairwise comparisons with :code:`mash dist`, and allows sketching
of multiple files to be parallelized.

.. _Jellyfish: http://www.cbcb.umd.edu/software/jellyfish/