Wednesday, September 6, 2017

KMP string search algorithm

https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
Preprocessing Overview (geeksforgeeks.org):
  • KMP algorithm does preproceses pat[] and constructs an auxiliary lps[] of size m (same as size of pattern) which is used to skip characters while matching.
  • name lps indicates longest proper prefix which is also suffix.. A proper prefix is prefix with whole string not allowed. For example, prefixes of “ABC” are “”, “A”, “AB” and “ABC”. Proper prefixes are “”, “A” and “AB”. Suffixes of the string are “”, “C”, “BC” and “ABC”.
  • For each sub-pattern pat[0..i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i].
       lps[i] = the longest proper prefix of pat[0..i] 
                  which is also a suffix of pat[0..i]. 
The pattern is:
1. Initialization: longest prefix suffix len = 0, i =1, lps[0] = 0
2. Assume the current index is i; and lps[0,...,i-1] are known.
    the length of the previous longest prefix suffix is 
    lps[i-1] len. That means pat[0,...,len-1] = pat[i-len,...,i-1].
3. If pat[len] == pat[i], then lps[i] = lps[i-1]+1 = len +1. 
    Continue with len = len+1 and i = i+1 and do step 3 again.
4. If then pat[len] != pat[i] 
    4.0 if len != 0. 
         4.01 Let len' = lps[len-1]
              that means pat[0,...,len'-1] = pat[len-len',...,len-1] = pat[i-len',...,i-1].
          To find lps[i], we need to check if pat[len'] matches pat[i]. 
          4.02 if: pat[len'] == pat[i]
                  then pattern in the green boxes matches each other: 
                  pat[0,...,len'] = pat[i-len',...,i]and we get 
                  lps[i] = len' + 1 = lps[len-1] +1.
          4.03 else: pat[len'] != pat[i]
                  repeat step 4.01 and 4.02 as: 
                  assume len'' = lps[len'-1], check if pat[len''] matches pat[i].
           Note that steps 4.01, 4.02 and 4.03 is just a repeat of steps 2, 3, 4.
     4.1 if len == 0.
           lps[i] = 0, i = i+1 and do step 3 and 4 again.
The code to get lps[] 
// Fills lps[] for given patttern pat[0..M-1] void computeLPSArray(char *pat, int M, int *lps) { // length of the previous longest prefix suffix int len = 0; lps[0] = 0; // lps[0] is always 0 // the loop calculates lps[i] for i = 1 to M-1 int i = 1; while (i < M) { if (pat[i] == pat[len]) { len++; lps[i] = len; i++; } else // (pat[i] != pat[len]) { // This is tricky. Consider the example. // AAACAAAA and i = 7. The idea is similar // to search step. if (len != 0) { len = lps[len-1]; // Also, note that we do not increment // i here } else // if (len == 0) { lps[i] = 0; i++; } } } }

No comments:

Post a Comment