Topics (not comprehensive):
- Suffix Tree Construction
- Naive
- LCP Array construction and usage
- Ukkonen’s Algorithm (using the tricks)
- What are suffix links? how are they constructed? their role?
- Edge label compressions, ski/count
- time complexity
- BWT string: construction, reversal, and indexing algorithms
- Randomized algos (randomized quick sort and quick select)
- Benefit of the discussed randomized algorithms
- P, NP, NP-heard problems
- NP complete problems, proof, reduction, how to perform reduction
Suffix Tree
What it does: A data structure for string matching, a form of trie.
How it does it: Represents all suffixes of a string S in a compressed form.
- every internal node has at least 2 children
Append the ’$’ so that no suffix is the prefix of another suffix
Naive
Construction:
- Create n suffixes - O(n^2)
- Insert suffix i into tree - O(n^2):
- Find longest prefix between i and the existing tree, and break the edge at the last matched character
- Label nodes with paths to maintain structure, and adjust pointers
- Repeat for all j characters of substring
Search (for pattern P in text T):
- Follow a path down from root of suffix tree of T based on characters in P
Time complexity: O(|P| + m) (m=number of occurrences of P in S) - Assuming a finite alphabet
Ukkonen’s Algorithm
A O(n) construction algorithm for suffix trees
- With no optimizations, the algorithm is O(n^3).
- O(n) loop for each suffix
- O(n) traversal worst case for each string
- O(n) updates to update leaf ends
xx- With optimizations: - Skip links - Link internal nodes of suffix tree to other nodes for compression
- Skip/count trick - Skip to next node rather than next character when traversing the suffix tree
- Suffix link + skip-counts makes algo O(n^2)
- Edge-label compression - Use pointer to a global end to automatically update leaves (makes updating leaves O(1) rather than O(n))
Advanced Suffix Structures
Suffix Array
What it does: Helps with finding occurrences of a pattern in a text
- Can do Binary Search on the suffix array
- BS based pattern matching time:
- |P|=m, |S|=n, k is number of occurances of P
How it does it: A lexicographically sorted array of all n suffixes
- BS based pattern matching time:
Construction
- Can be done in
with naive algo (n strings to sort, O(n) comparison) - Can be constructed from suffix tree in linear time (lexicographic order DFS algorithm)
Longest Common Prefix Array (LCP)
What it does: Holds the length of the common prefix between i and (i-1) suffixes in the suffix array
Construction
BWT
Ranking
The rank array represents the lexicographical order of every character, as they occur in the BWT string.
BWT Algorithm
What it does: The algorithm creates a reversible string that can be used for data compression, that rearranges that order of characters of an input string
- This is the whole “rearranging so that compression becomes more efficient” idea
How it does it: It orders circular shifts of string T’
Construction
- Construct string T’=T+’$’
- Create all circular shifts of string T’, and sort them in lexicographic order for a BWT matrix
- Extract the last column of the BWT matrix to obtain the BWT string
Naive time: O(n^2) + O(n^2logn) + O(n) = O(n^2logn)
Constructing BWT using SA is O(n)
Reconstruction
- Start with ’$’ as character c
- Find position of character c in the BWT string
- Find the rank associated with index i (rank[i])
- Go to the index associated with the rank, and follow that ‘linked list’. Append the characters as you go
- Reverse your string
FM Index
What it is
- An index that combines BWT with a few small auxiliary data structures
- Has:
- BWT string, SA array, rank array, more?
What it does
- BWT string, SA array, rank array, more?
- Allows searching for P in BWM
- ?
Constructing BWT using SA is O(n)
Reversing BWT String is O(n)
Searching for a pattern P is O(m)
Randomized Algorithms
Randomized vs. Deterministic
- Minimimum cut problem (minimum number of cuts to create two disconnected graphs)
- Deterministic: Edmond-Karp Algorithm
- Randomized: Karger’s Algorithm (Monte Carlo) Random edge selection
Monte Carlo and Las Vegas
| Monte Carlo | Las Vegas | |
|---|---|---|
| Running time | Fixed (Deterministic) | Variable (Non-deterministic) |
| Correctness | Random / Probabilistic | Guaranteed |
| Examples | Random Forest / SGD | Quicksort |
Solving for randomized algorithm running times
Time complexity of a randomized algorithm:
Understand derivations for:
- Randomized Quicksort
- Randomized Selection
NP Completeness
P: Polynomial Time - Can be solved, but also checked in polynomial time.
- Sorting, searching, matching
NP: Nondeterministic Polynomial Time - Can be checked in polynomial time, unknown if can also be solved in polynomial time. - SAT, Hamilton Cycle, Subset Sum
NP-hard: Problems that are as hard as the hardest problems in NP. Verification not guaranteed in polynomial time. - Max-cut problem
NP-complete: A problem that is both NP (verifieable in polynomial time) and NP-hard (as hard as the hardest problem in NP).
Understand derivations for:
- SAT ∝ 0/1 Knapsack
- 3-CNF-SAT ≤p K-Clique
- K-Clique ≤p Independent Set (IS)
Polynomial Reduction
Formal Proof Steps
- Prove that L
NP (show it can be verified in polynomial time) - Prove that L
NP-hard by selecting a known NP-complete problem L’ - Perform a reduction?