Author |
Topic: Re: substring search (Read 9294 times) |
|
sameer_r
Guest
|
Hi everyone, I didn't see any message posted regarding the substring search problem. Pardon me if a thread already exists for it. I was thinking about a solution to the problem and came up with the following algorithm (only time efficient) Say A is the string to be searched in the string B. for e.g. A = "xyxz" B = "xyzxyxz" Sort A and Sort B keeping track of the original indices of each character. A = "xxyz" with original indices 1,3,2,4 B = "xxxyyzz" with original indices 1,4,6,2,5,3,7 Allocate an integer array (counter) equal to the size of B. Now for every character in B, do a binary search to locate the corresponding character in A (there could be more than one occurence of the same character). Say for the first x in B, we locate the first x in A with original index 1. (since its sorted the other x's would be easy to find by just increasing the index until we came to a character > x) Now if this character in B were to be a part of the substring A, then the beginning of the string would have been (original index of this character in B - original index of this character in A + 1). In the case of the first x in A: 1-1+1 or in the case of second x in A : 1-3+1<0 (hence cannot be) This means that if the first x in B were a part of the substring A, then it could be the first character of A and not the 3rd. Now if this x is the first character of substring A, then 1 should be the index from which the substring starts in B and hence in Count array we increment Count[1] by 1. Finally when we are done looking at all the elements of B, then we scan through the Count array and all those offsets that have the value equal to the length of substring A, are in fact the beginning locations of the substring A in B. Thanks, Sameer.
|
|
IP Logged |
|
|
|
Papa Homer
Guest
|
I am feeling a bit slow right now, so could you provide some more details for you solution? Specifically, the runtime complexity for each step as well as an explaination of why it actually works. Thanks.
|
|
IP Logged |
|
|
|
sameer_r
Guest
|
The algorithm has the following steps: sort A and B sorting A theta(m) sorting B theta(n) assuming that we know in advance the range of characters that the strings can contain. go thru each element of B (n) search that element in A theta(lg(n)) go thru the matching element in A (at worst m) (this part I am not sure of. looks like an average case analysis like in quick sort is needed) finally go thru the count array and find the offsets where your substring starts in B. theta(n) so i guess the complexity in worst case comes out to be theta(nm), though I think it should be better than that.
|
|
IP Logged |
|
|
|
Papa Homer
Guest
|
Yeah, that's the complexity of a brute force method: for every character in B, see if the next m characters make up A.
|
|
IP Logged |
|
|
|
sameer_r
Guest
|
For Brute force method, the run-time is always theta(nm), but for this algo, the worst case is theta(nm). It could perform better. Even for quick-sort the worst case run-time is theta(n^2), but average case is theta(nlgn), which is pretty good, in fact the theoritically the best possible for comparison sort. I still hope (fingers crossed) that this algorithm is better than brute force.
|
|
IP Logged |
|
|
|
Papa Homer
Guest
|
That's not entire correct. For the brute force method the best case is m (if A appears in the beginning of B). However, the worst and AVERAGE cases are still mn. So, the question is, what is the average performance of your algorithm on randomk data.
|
|
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: substring search
« Reply #6 on: Apr 26th, 2003, 5:25am » |
Quote Modify
|
there is a sublinear search algorithm for this problem.. You need a little preprocessing (linear time) and a few simple tricks.. Of course without my book I can't really describe it adequatly..
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: substring search
« Reply #7 on: Apr 27th, 2003, 1:37pm » |
Quote Modify
|
I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space) Suppose we start with two strings A = abbac B = abcaccbcabcbabbaccbabcccabababbabbcc start comparing from the back of A abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters Compare at that point again abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x x same again as before abcaccbcabcbabbaccbabcccabababbabbcc abbac !xx x x x and again, compare, you can skip the substring (one character in this case) you allready know is there abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x xXxXXX done best case when there is a match m; best case when there isn't n/m; on avarage < n, worst case n+m (with preprocessing)
|
« Last Edit: Apr 27th, 2003, 1:46pm by towr » |
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
akasina9
Newbie
x = 0x2B | ~ 0x2B. x is the question
Gender:
Posts: 9
|
|
Re: substring search
« Reply #8 on: Nov 17th, 2012, 6:30am » |
Quote Modify
|
on Apr 27th, 2003, 1:37pm, towr wrote:I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space) Suppose we start with two strings A = abbac B = abcaccbcabcbabbaccbabcccabababbabbcc start comparing from the back of A abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters Compare at that point again abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x x same again as before abcaccbcabcbabbaccbabcccabababbabbcc abbac !xx x x x and again, compare, you can skip the substring (one character in this case) you allready know is there abcaccbcabcbabbaccbabcccabababbabbcc abbac xxx x xXxXXX done best case when there is a match m; best case when there isn't n/m; on avarage < n, worst case n+m (with preprocessing) |
| I think towr was referring to the Boyer Moore algorithm.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm .
|
|
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: substring search
« Reply #9 on: Nov 17th, 2012, 7:35am » |
Quote Modify
|
Probably, yeah. God, that was long ago... Did we even have wikipedia back then?
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
TenaliRaman
Uberpuzzler
I am no special. I am only passionately curious.
Gender:
Posts: 1001
|
|
Re: substring search
« Reply #10 on: Nov 23rd, 2012, 11:13am » |
Quote Modify
|
We did have wikipedia back then, but it wasn't popular as it is now nor was it as rich in information as it is today. The oldest wikipedia article on Boyer Moore that wayback machine has is around 2007. So, I guess its reasonable enough to say that Boyer Moore article wasn't around when you wrote your answer
|
|
IP Logged |
Self discovery comes when a man measures himself against an obstacle - Antoine de Saint Exupery
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: substring search
« Reply #12 on: Dec 13th, 2012, 11:19am » |
Quote Modify
|
No, I don't think it's that one. I can't really be bothered to get into it, but Boyer Moore seems more like it.
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
|