The problem of String Matching Given a string S, the problem of string matching deals with finding whether a pattern p occurs in S and if p does occur.

Презентация:



Advertisements
Похожие презентации
SPLAY TREE The basic idea of the splay tree is that every time a node is accessed, it is pushed to the root by a series of tree rotations. This series.
Advertisements

HPC Pipelining Parallelism is achieved by starting to execute one instruction before the previous one is finished. The simplest kind overlaps the execution.
Sequences Sequences are patterns. Each pattern or number in a sequence is called a term. The number at the start is called the first term. The term-to-term.
Combination. In mathematics a combination is a way of selecting several things out of a larger group, where (unlike permutations) order does not matter.
Conditional Statements. Program control statements modify the order of statement execution. Statements in a C program normally executes from top to bottom,
Lesson 2. How to say hello & goodbye ?. When we first meet someone whether it is a person we know or someone we are meeting for the first time, we will.
Section 2.1: Use Inductive Reasoning Conjecture: A conjecture is an unproven statement that is based on observations; an educated guess. Inductive Reasoning:
Take one minute to prepare a talk on the following subject. Take notes if you like and remember to include reasons and examples. You should then speak.
Benford Benford's law, also called the first-digit law, states that in lists of numbers from many (but not all) real-life sources of data, the leading.
Describe a movie which made a strong impression on you. You should say: which movie it was – the name what the movie was about who the main stars were.
What to expect? How to prepare? What to do? How to win and find a good job? BUSINESS ENGLISH COURSE NOVA KAKHOVKA GUMNASUIM 2012.
Why do we learn English at schools. (by Kurdina Ekaterina) Learning a new language often begins at a young age and, at some schools, is continued throughout.
DRAFTING and DIMENSIONING 98. A properly dimensioned drawing of a part is very important to the manufacturing outcome. With CATIA, it can be a very simple.
Computers are a necessary part of modern life. Computers play an important role in the lives of most of us today, whether we realize it or not. Some people,
In America Unemployment Why does the government collect statistics on the unemployed? Why do we care?
1 Another useful model is autoregressive model. Frequently, we find that the values of a series of financial data at particular points in time are highly.
AVL-Trees COMP171 Fall AVL Trees / Slide 2 Balanced binary tree * The disadvantage of a binary search tree is that its height can be as large as.
Convolutional Codes Mohammad Hanaysheh Mahdi Barhoush.
In mathematics, the notion of permutation is used with several slightly different meanings, all related to the act of permuting (rearranging) objects.
Multiples Michael Marchenko. Definition In mathematics, a multiple is the product of any quantity and an integer. in other words, for the quantities a.
Транксрипт:

The problem of String Matching Given a string S, the problem of string matching deals with finding whether a pattern p occurs in S and if p does occur then returning position in S where p occurs.

…. a O(mn) approach One of the most obvious approach towards the string matching problem would be to compare the first element of the pattern to be searched p, with the first element of the string S in which to locate p. If the first element of p matches the first element of S, compare the second element of p with second element of S. If match found proceed likewise until entire p is found. If a mismatch is found at any position, shift p one position to the right and repeat comparison beginning from first element of p.

How does the O(mn) approach work Below is an illustration of how the previously described O(mn) approach works. String S abcabaabcabac abaa Pattern p

Step 1:compare p[1] with S[1] S abcabaabcabac pabaa Step 2: compare p[2] with S[2] Sabcabaabcabac pabaa

Step 3: compare p[3] with S[3] S abaa p Mismatch occurs here.. Since mismatch is detected, shift p one position to the left and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift p one position to the right and repeat matching procedure.abcabaabcabac

S abcabaabcabac abaa p Finally, a match would be found after shifting p three times to the right side. Drawbacks of this approach: if m is the length of pattern p and n the length of string S, the matching time is of the order O(mn). This is a certainly a very slow running algorithm. What makes this approach so slow is the fact that elements of S with which comparisons had been performed earlier are involved again and again in comparisons in some future iterations. For example: when mismatch is detected for the first time in comparison of p[3] with S[3], pattern p would be moved one position to the right and matching procedure would resume from here. Here the first comparison that would take place would be between p[0]=a and S[1]=b. It should be noted here that S[1]=b had been previously involved in a comparison in step 2. this is a repetitive use of S[1] in another comparison. It is these repetitive comparisons that lead to the runtime of O(mn).

The Knuth-Morris-Pratt Algorithm Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of S that have previously been involved in comparison with some element of the pattern p to be matched. i.e., backtracking on the string S never occurs

Components of KMP algorithm The prefix function, Π The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern p. In other words, this enables avoiding backtracking on the string S. The KMP Matcher With string S, pattern p and prefix function Π as inputs, finds the occurrence of p in S and returns the number of shifts of p after which occurrence is found.

The prefix function, Π Following pseudocode computes the prefix fucnction, Π: Compute-Prefix-Function (p) 1 m length[p] //p pattern to be matched 2 Π[1] 0 3 k 0 4 for q 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k Π[k] 7 If p[k+1] = p[q] 8 then k k +1 9 Π[q] k 10 return Π

Example: compute Π for the pattern p below: p ababaca q pababaca Π00 Initially: m = length[p] = 7 Π[1] = 0 k = 0 Step 1: q = 2, k=0 Π[2] = 0 Step 2: q = 3, k = 0, Π[3] = 1 Step 3: q = 4, k = 1 Π[4] = 2 q pababaca Π001 q pababacA Π0012

Step 4: q = 5, k =2 Π[5] = 3 Step 5: q = 6, k = 3 Π[6] = 1 Step 6: q = 7, k = 1 Π[7] = 1 After iterating 6 times, the prefix function computation is complete: q pababaca Π00123 q pababaca Π q pababaca Π q pabAbaca Π

The KMP Matcher The KMP Matcher, with pattern p, string S and prefix function Π as input, finds a match of p in S. Following pseudocode computes the matching component of KMP algorithm: KMP-Matcher(S,p) 1 n length[S] 2 m length[p] 3 Π Compute-Prefix-Function(p) 4 q 0 //number of characters matched 5 for i 1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != S[i] 7 do q Π[q] //next character does not match 8 if p[q+1] = S[i] 9 then q q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print Pattern occurs with shift i – m 12 q Π[ q] // look for the next match Note: KMP finds every occurrence of a p in S. That is why KMP does not terminate in step 12, rather it searches remainder of S for any more occurrences of p.

Illustration: given a String S and pattern p as follows: S bacbabababacaca ababaca p Let us execute the KMP algorithm to find whether p occurs in S. For p the prefix function, Π was computed previously and is as follows:q pabAbaca Π

bacbabababacaab bacbabababacaab ababaca ababaca Initially: n = size of S = 15; m = size of p = 7 Step 1: i = 1, q = 0 comparing p[1] with S[1] S p P[1] does not match with S[1]. p will be shifted one position to the right. S p Step 2: i = 2, q = 0 comparing p[1] with S[2] P[1] matches S[2]. Since there is a match, p is not shifted.

Step 3: i = 3, q = 1 bacbabababacaab ababaca Comparing p[2] with S[3] Sbacbabababacaab bacbabababacaab ababaca ababaca p S p S p p[2] does not match with S[3] Backtracking on p, comparing p[1] and S[3] Step 4: i = 4, q = 0 comparing p[1] with S[4]p[1] does not match with S[4] Step 5: i = 5, q = 0 comparing p[1] with S[5] p[1] matches with S[5]

bacbabababacaab bacbabababacaab bacbabababacaab ababaca ababaca ababaca Step 6: i = 6, q = 1 S p Comparing p[2] with S[6]p[2] matches with S[6] S p Step 7: i = 7, q = 2 Comparing p[3] with S[7]p[3] matches with S[7] Step 8: i = 8, q = 3 Comparing p[4] with S[8]p[4] matches with S[8] S p

Step 9: i = 9, q = 4 Comparing p[5] with S[9] Comparing p[6] with S[10] Comparing p[5] with S[11] Step 10: i = 10, q = 5 Step 11: i = 11, q = 4 S S S p p p bacbabababacaab bacbabababacaab bacbabababacaab ababaca ababaca ababaca p[6] doesnt match with S[10] Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3 p[5] matches with S[9] p[5] matches with S[11]

bacbabababacaab bacbabababacaab ababaca ababaca Step 12: i = 12, q = 5 Comparing p[6] with S[12] Comparing p[7] with S[13] S S p p Step 13: i = 13, q = 6 p[6] matches with S[12] p[7] matches with S[13] Pattern p has been found to completely occur in string S. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.

Running - time analysis Compute-Prefix-Function (Π) 1 m length[p] //p pattern to be matched 2 Π[1] 0 3 k 0 4 for q 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k Π[k] 7 If p[k+1] = p[q] 8 then k k +1 9 Π[q] k 10 return Π In the above pseudocode for computing the prefix function, the for loop from step 4 to step 10 runs m times. Step 1 to step 3 take constant time. Hence the running time of compute prefix function is Θ(m). KMP Matcher 1 n length[S] 2 m length[p] 3 Π Compute-Prefix-Function(p) 4 q 0 5 for i 1 to n 6 do while q > 0 and p[q+1] != S[i] 7 do q Π[q] 8 if p[q+1] = S[i] 9 then q q if q = m 11 then print Pattern occurs with shift i – m 12 q Π[ q] The for loop beginning in step 5 runs n times, i.e., as long as the length of the string S. Since step 1 to step 4 take constant time, the running time is dominated by this for loop. Thus running time of matching function is Θ(n).