wu :: forums - Re: substring search

wu :: forums « wu :: forums - Re: substring search » Welcome, Guest. Please Login or Register. Nov 21^st, 2024, 1:54am
RIDDLES SITE WRITE MATH! Home Help Search Members Login Register

   wu :: forums
   riddles
   cs (Moderators: SMQ, Grimbal, william wu, ThudnBlunder, Icarus, Eigenray, towr)
   Re: substring search

« Previous topic | Next topic »

Pages: 1

Notify of replies

Send Topic

Author

Topic: Re: substring search (Read 9294 times)

sameer_r
Guest

Re: substring search
« on: Apr 25^th, 2003, 2:31pm »

Quote

Modify

Remove

Hi everyone,

I didn't see any message posted regarding the substring search problem. Pardon me if a thread already exists for it.

I was thinking about a solution to the problem and came up with the following algorithm (only time efficient)

Say A is the string to be searched in the string B.
for e.g.
A = "xyxz"
B = "xyzxyxz"

Sort A and Sort B keeping track of the original indices of each character.
A = "xxyz" with original indices 1,3,2,4
B = "xxxyyzz" with original indices 1,4,6,2,5,3,7

Allocate an integer array (counter) equal to the size of B.

Now for every character in B, do a binary search to locate the corresponding character in A (there could be more than one occurence of the same character).

Say for the first x in B, we locate the first x in A with original index 1. (since its sorted the other x's would be easy to find by just increasing the index until we came to a character > x)

Now if this character in B were to be a part of the substring A, then the beginning of the string would have been (original index of this character in B - original index of this character in A + 1).

In the case of the first x in A: 1-1+1
or in the case of second x in A : 1-3+1<0 (hence cannot be)

This means that if the first x in B were a part of the substring A, then it could be the first character of A and not the 3rd.

Now if this x is the first character of substring A, then 1 should be the index from which the substring starts in B and hence in Count array we increment Count[1] by 1.

Finally when we are done looking at all the elements of B, then we scan through the Count array and all those offsets that have the value equal to the length of substring A, are in fact the beginning locations of the substring A in B.

Thanks,
Sameer.

IP Logged

Papa Homer
Guest

Re: substring search
« Reply #1 on: Apr 25^th, 2003, 3:06pm »

Quote

Modify

Remove

I am feeling a bit slow right now, so could you provide some more details for you solution? Specifically, the runtime complexity for each step as well as an explaination of why it actually works.

Thanks.

IP Logged

sameer_r
Guest

Re: substring search
« Reply #2 on: Apr 25^th, 2003, 3:13pm »

Quote

Modify

Remove

The algorithm has the following steps:

sort A and B
sorting A theta(m)
sorting B theta(n)
assuming that we know in advance the range of characters that the strings can contain.

go thru each element of B (n)
search that element in A theta(lg(n))
go thru the matching element in A (at worst m) (this part I am not sure of. looks like an average case analysis like in quick sort is needed)

finally go thru the count array and find the offsets where your substring starts in B. theta(n)

so i guess the complexity in worst case comes out to be theta(nm), though I think it should be better than that.

IP Logged

Papa Homer
Guest

Re: substring search
« Reply #3 on: Apr 25^th, 2003, 3:34pm »

Quote

Modify

Remove

Yeah, that's the complexity of a brute force method: for every character in B, see if the next m characters make up A.

IP Logged

sameer_r
Guest

Re: substring search
« Reply #4 on: Apr 25^th, 2003, 4:06pm »

Quote

Modify

Remove

For Brute force method, the run-time is always theta(nm), but for this algo, the worst case is theta(nm). It could perform better. Even for quick-sort the worst case run-time is theta(n^2), but average case is theta(nlgn), which is pretty good, in fact the theoritically the best possible for comparison sort.

I still hope (fingers crossed) that this algorithm is better than brute force.

IP Logged

Papa Homer
Guest

Re: substring search
« Reply #5 on: Apr 25^th, 2003, 4:18pm »

Quote

Modify

Remove

That's not entire correct. For the brute force method the best case is m (if A appears in the beginning of B). However, the worst and AVERAGE cases are still mn. So, the question is, what is the average performance of your algorithm on randomk data.

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: substring search
« Reply #6 on: Apr 26^th, 2003, 5:25am »

Quote

Modify

there is a sublinear search algorithm for this problem..
You need a little preprocessing (linear time) and a few simple tricks..
Of course without my book I can't really describe it adequatly..

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: substring search
« Reply #7 on: Apr 27^th, 2003, 1:37pm »

Quote

Modify

I'll just give an example for now.. As I said, there are a few tricks you need to do it efficiently, but basicly here's how it works (after preprocessing A in linear time/space)

Suppose we start with two strings

 
 A = abbac 
 B = abcaccbcabcbabbaccbabcccabababbabbcc

start comparing from the back of A

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
 abbac 
   xxx

since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters

Compare at that point again

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
      abbac 
   xxx    x

it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
        abbac 
   xxx    x x

same again as before

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
          abbac 
   !xx    x x x

and again, compare, you can skip the substring (one character in this case) you allready know is there

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
             abbac 
   xxx    x xXxXXX

done

best case when there is a match m; best case when there isn't n/m; on avarage < n, worst case n+m (with preprocessing)

« Last Edit: Apr 27^th, 2003, 1:46pm by towr »

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

akasina9
Newbie

x = 0x2B | ~ 0x2B. x is the question

Gender: male

Posts: 9

Re: substring search
« Reply #8 on: Nov 17^th, 2012, 6:30am »

Quote

Modify

on Apr 27^th, 2003, 1:37pm, towr wrote:

 
 A = abbac 
 B = abcaccbcabcbabbaccbabcccabababbabbcc

start comparing from the back of A

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
 abbac 
   xxx

since it doesn't match, and ac doesn't occur in A again, you can shift A past the first 5 characters

Compare at that point again

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
      abbac 
   xxx    x

it doesn't match, but b does occur in A, so shift A to match to the first b and compare from there again

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
        abbac 
   xxx    x x

same again as before

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
          abbac 
   !xx    x x x

and again, compare, you can skip the substring (one character in this case) you allready know is there

 
 abcaccbcabcbabbaccbabcccabababbabbcc 
             abbac 
   xxx    x xXxXXX

done

best case when there is a match m; best case when there isn't n/m; on avarage < n, worst case n+m (with preprocessing)

I think towr was referring to the Boyer Moore algorithm.http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm .

IP Logged

towr
wu::riddles Moderator
Uberpuzzler

Some people are average, some are just mean.

Gender: male

Posts: 13730

Re: substring search
« Reply #9 on: Nov 17^th, 2012, 7:35am »

Quote

Modify

Probably, yeah.
God, that was long ago... Did we even have wikipedia back then?

IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB

TenaliRaman
Uberpuzzler

I am no special. I am only passionately curious.

Gender: male

Posts: 1001

Re: substring search
« Reply #10 on: Nov 23^rd, 2012, 11:13am »

Quote

Modify

We did have wikipedia back then, but it wasn't popular as it is now nor was it as rich in information as it is today. The oldest wikipedia article on Boyer Moore that wayback machine has is around 2007. So, I guess its reasonable enough to say that Boyer Moore article wasn't around when you wrote your answer Cheesy