Conrado Martinez: Sublinear Random Sampling and Applications
- https://www.cs.upc.edu/en/events/conrado-martinez-sublinear-random-sampling-and-applications
- Conrado Martinez: Sublinear Random Sampling and Applications
- 2026-01-09T12:00:00+01:00
- 2026-01-09T13:00:00+01:00
Jan 09, 2026 from 12:00 PM to 01:00 PM (Europe/Madrid / UTC100)
Consider the following scenario: we have a large database of gene sequences, hundreds of thousands, even, millions (e.g. in GenBank). In order to enable
fast similarity search, clustering, phylogenetic reconstruction, etc. it is usual to consider the set of distinct k-mers of each sequence in the database (typical values of
k are around 20), but instead of using the full set of k-mers of each sequence, we will use a sketch (a random sample) instead: computing the
similarity (Jaccard, cosine, ...) of the sketches yields accurate estimates of the true similarities.
One problem that we face then is that of producing random samples for each gene sequence in an efficient way. Practical constraints require, among other
things, that the algorithms can be parallelized and that the size n of the set from which we draw the samples is not known in advance.
Algorithms such as MinHash (mash) or bottom-k are easy, efficient and will produce random samples of a given fixed size. Because the size of the sketches
is fixed in advance these schemes have limited accuracy and adapt poorly when the database contains very different types of gene sequences (small/medium/
large sequences, low/high variability, ...). Recent proposals, such as FracMinHash (sourmash) are also fast, simple and efficient, but produce random
samples of linear size w.r.t. the size n of "population". For FracMInHash, similarity estimates are very accurate, but this is at the expense of bigger
sketches and, hence, larger processing times too.
In the talk I will describe my recent work on two sublinear random sampling algorithms. The first is Affirmative Sampling (Lumbroso and M., 2022) which is
the first of its kind, as far as we know. Affirmative Samping is simple and efficient, and its standard variant produces samples of (expected) size k
ln(n/k) + O(k), for a fixed parameter k. Another variant generates samples of expected size Theta(n^alpha), for a given valua alpha, 0 < alpha < 1.
Unfortunately, parallelization of Affirmative Sampling will render different samples depending on how the database is "split" among parallel threads, which
is not desirable in practice.
In a more recent work, we introduce MaxGeomHash (Hera, Koslicki and M., submitted) which admits "deterministic" parallelization, and has better
concentration around the expected size of the samples. The algorithm is also very simple, and effcient in practice. The standard variant (MGH) produces
samples of expected size k log_2(n/k) + k + O(1), and variance is 1.215 ... k + o(1); for the variant alpha-MGH, 0 < alpha < 1, we have samples of expected
size is f(alpha) n^alpha + o(n^alpha), for a explicitly computable f(alpha), and the variance is n^alpha + o(n^alpha). MGH achieves accuracy close to that
of FracMInHash, while only needing a fraction of the computational resources needed by FracMInHash.
We will end briefly describing some of the applications and implications of these algorithms, besides the initial example motivated in computational
biology.
Share: