|
||||
Title: links Post by jhngalt on Mar 5th, 2005, 4:22pm How would you design a system which spiders the links from a webpage - specifically, how would you check if a link has already been hit? thanks |
||||
Title: Re: links Post by puzzlecracker on Mar 5th, 2005, 7:46pm put it in some sort of hash table to determine the collision (a link has been hit). and put all links you encounter in a proirity queue - so remove on at the time and go it. |
||||
Title: Re: links Post by jhngalt on Mar 5th, 2005, 8:28pm on 03/05/05 at 19:46:46, puzzlecracker wrote:
hi.. why do you put all links you encounter into a priority queue? thanks, |
||||
Title: Re: links Post by puzzlecracker on Mar 5th, 2005, 9:15pm a queue would also be sufficient - so that you visit all wepages in order for ex. Page z has links a,b,c,d page a: e, f, g, d : and so on.... queue would be initially filled with a b c d then you remove 'a' and it would be filled with b c d e f g .. then you go you go to 'b' and so on. Also, every single jump verify with hash to avoid cycle and repeats. |
||||
Title: Re: links Post by jhngalt on Mar 6th, 2005, 6:00am on 03/05/05 at 21:15:58, puzzlecracker wrote:
You said in every single jump verify with hash to avoid cycle and repeats, meaning when you get a link you hash it and see if it already had been previously hit, what hash function do you think we could use, i read somewhere that you could use the checksum of the link, i'm not sure what that is, any ideas? also some webpages have different urls pointing to it, how could you tell if you have already been there but under a different link.. thanks, |
||||
Title: Re: links Post by jhngalt on Mar 16th, 2005, 8:09am on 03/06/05 at 06:00:08, jhngalt wrote:
nobody has any ideas? |
||||
Title: Re: links Post by towr on Mar 17th, 2005, 1:00am Well, for the latter part. Visiting the link should help distinguish wether different urls point to the same website. (Just resolve what ip and path or content they point to) |
||||
Title: Re: links Post by TenaliRaman on Mar 21st, 2005, 10:52pm The checksum mentioned could be as simple as adding up the letters in the url. Lucky for us, the urls are all small letters and even luckier since addition of two letters doesnt give us a third letter. (I think!). If the above is not correct(tho i think it should be), we may use an MD5 checksum. -- AI |
||||
Title: Re: links Post by towr on Mar 22nd, 2005, 4:32am on 03/21/05 at 22:52:00, TenaliRaman wrote:
Essentially that means any anagrams will be confused, among many other things (the hash for "ac" would also be the same as "bb"). And I just don't think you would want to confuse urls like "www.dicktraceysucks.com" with "www.traceysucksdick.com".. :P Quote:
|
||||
Title: Re: links Post by Grimbal on Jun 14th, 2013, 7:54am The problem with a simple sum is that you never get values over a few thousand. If you have millions of URLs, you get many collisions. Java String's hash function is basically this: char val[] = the string's chars int hash = 0; for( int i=0 ; i<val.length ; i++ ){ hash = hash*31 + val[i]; } |
||||
Title: Re: links Post by TenaliRaman on Jun 16th, 2013, 4:47pm Yeah, the addition is a bad idea ;D |
||||
Title: Re: links Post by gitanas on Jan 27th, 2016, 6:33am You could store links in a database. Moreover add index on that column and search will work very fast. |
||||
Powered by YaBB 1 Gold - SP 1.4! Forum software copyright © 2000-2004 Yet another Bulletin Board |