|
||||
Title: Sampling Random Subsets From A Stream Post by william wu on May 18th, 2007, 1:52am A stream of records is starting to come your way. The records are abstract pieces of data, each of equal size. You don't know in advance when the stream will end, but you do know that it will end eventually, where "eventually" means a long time later. When the stream finishes, your task is to produce a subset of N records, where the subset is chosen uniformly at random from all possible N-subsets of records that have been seen. How would you do this? Note 1: One way to do this is to simply store every record in memory, and then draw a random N-subset from all the records. However, this is very expensive, and may be infeasible considering how large the records might be, and how you don't know when the stream will end. With this understanding in mind, the challenge is to design something better. Note 2: This is similar to an existing problem on the main site. Maybe this problem has shown up already elsewhere too . Also, this is a real problem that arises in some practical situations. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on May 18th, 2007, 3:12am Hmm, we know the N=1 case. It should be possible to generalize Obviously we start with taking the first N records. After that we need some way to replace older ones with newer ones at same probability. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Eigenray on May 18th, 2007, 4:53pm There's really only one possible strategy that involves keeping only N records: we must take the K'th record, K>N, with probability [hide]C(K-1,N-1)/C(K,N) = N/K (any given record gets replaced with probability 1/K)[/hide], because it might be the last one. Then we just need to show that this works. [hideb]By induction, suppose that, at time K-1, all C(K-1,N) N-subsets of [K-1]={1,...,K-1} are equally likely. We show that any N-subset of S of [K] is equally likely at time K. Case 1: K is in S. There are (K-1)-(N-1) N-subsets of [K-1] that contain S\{K}, each of which turns into S with probability 1/K. So the probability that we get S at time K is 1/C(K-1,N)*(K-N)*1/K = 1/C(K,N). Case 2: K is not in S. Then the only way to get S at time K is to have S at time K-1, with probability 1/C(K-1,N), and to stay at S, with probabilty (1-N/K), and again 1/C(K-1,N)*(1-N/K) = 1/C(K,N).[/hideb] |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 13th, 2007, 10:12am Consider the following variation: You do know the number of records in the stream (say, N), but you cannot use any random access data structure (let say, the only available container is a linked list). What is the most efficient way to produce a random subset of size k? |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Jun 13th, 2007, 11:24am on 06/13/07 at 10:12:09, Barukh wrote:
|
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 13th, 2007, 10:28pm on 06/13/07 at 11:24:05, towr wrote:
How would you produce this subset so that every possible subset of size k is equally likely? ??? |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Jun 14th, 2007, 2:01am on 06/13/07 at 22:28:30, Barukh wrote:
This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Grimbal on Jun 14th, 2007, 3:24am on 06/13/07 at 22:28:30, Barukh wrote:
Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2. If the size of the subset is specified, Eigenray gave a perfectly workable solution. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 14th, 2007, 4:03am on 06/14/07 at 03:24:47, Grimbal wrote:
Oops... I forgot to specify that the subset should be of a specified size (like in the original problem). Quote:
Eigenray's solution uses random access data structure. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Jun 14th, 2007, 4:05am on 06/14/07 at 03:24:47, Grimbal wrote:
Of course, on average you'd have to go through nearly the whole stream anyway. And that last wee bit should hurt performance. I suppose you'll save on generating random numbers and such.. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Jun 14th, 2007, 4:06am on 06/14/07 at 04:03:38, Barukh wrote:
For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 14th, 2007, 5:36am on 06/14/07 at 04:06:12, towr wrote:
Right. But you can do better! |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Aryabhatta on Jun 14th, 2007, 7:30am Just thinking aloud. I haven't done any calculations or tried to prove correctness. What if have a pipeline of k (=required size of subset) "One number Choosers" (ONC) where a "One number chooser" corresponds to the case the subset size 1. A number is sent down the pipeline if it is rejected (or replaced) by the ONC before it, in its selection algorithm. Each ONC sees its own version of the stream, dependent on the choices of the ONC before it in the pipeline. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by spirit on Jun 15th, 2007, 2:28am We can use reservior sampling algorithm,that is use for randomly selects the n records for N set of records where N>>n. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 15th, 2007, 3:04am Welcome to the forum, spirit! :D on 06/15/07 at 02:28:57, spirit wrote:
Interesting idea, but actually you don't need any reserviors in this case. Hint: [hide]The solution I have in mind doesn't use any additional memory[/hide]. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Grimbal on Jun 15th, 2007, 7:17am [hide]Combinatorics! If you have N items and you want a set of size k, there are C(N,k) possible subsets of size k. C(N-1,k-1) include the first element, C(N-1,k) exclude it. So, choose the first element with probability C(N-1,k-1)/C(N,k) which happens to be k/N. For the rest, choose a random set of k or k-1 items among the N-1 items. [/hide] |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 15th, 2007, 10:37pm Well done, Grimbal! :D So, when N is known and only sequential accesses are allowed, there is an algorithm taking O(N) time and O(1) space. Let's continue: This time, N is unknown, and still only sequential accesses are allowed. How efficient an algorithm you can devise? |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Grimbal on Jun 16th, 2007, 2:14am O(N) time and O(N) memory? ::) (too lazy to search for the moment...) |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 16th, 2007, 4:33am on 06/16/07 at 02:14:06, Grimbal wrote:
Hmm... I should've been more specific. The O-notation is not appropriate here. Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory. The real question is: can we do better than 2N? Quote:
Don't believe you can be lazy. ;D |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Grimbal on Jun 18th, 2007, 1:27am on 06/16/07 at 04:33:24, Barukh wrote:
The input being a stream, I thought you can not rewind it and access it a second time. That is why I counted O(N) space to save the data first. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Barukh on Jun 18th, 2007, 1:39am on 06/18/07 at 01:27:11, Grimbal wrote:
You can do better. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by andyv on Jun 18th, 2007, 8:17am const int maxn = 100; .... int n; int sol[maxn]; int size, k; cin >> k; size = 0; while (!cin.eof ()) { size++; if (size < n) sol[size - 1] = k; } |
||||
Title: Re: Sampling Random Subsets From A Stream Post by mad on Mar 3rd, 2008, 8:07pm on 06/14/07 at 02:01:49, towr wrote:
Initially the probability for each index is 1/N. Then, we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Mar 4th, 2008, 1:11am on 03/03/08 at 20:07:19, mad wrote:
Suppose we already picked M elements, and they were jointly picked with probability M/N Then the probability of the N-M elements being left is (N-M)/N, and we then pick one of these with probability 1/(N-M), so the probability it would be picked from the start is (N-M)/N*1/(N-M)= 1/N You can easily see that it holds for M=0, so the induction is sound, and therefore each of the K elements will be picked with probability 1/N |
||||
Title: Re: Sampling Random Subsets From A Stream Post by kirru on Jul 13th, 2009, 3:25am initially select first k numbers in to a set RANDOM for next element (say i th), replace one of the element in a RANDOM with that element with probability i-k/i. to replace an element select an element from RANDOM with probability 1/k. |
||||
Title: Re: Sampling Random Subsets From A Stream Post by Grimbal on Jul 13th, 2009, 4:54am You can do both with a single random draw. - draw a random number r from 0..(i-1) - if r<k, replace item r with the new item |
||||
Title: Re: Sampling Random Subsets From A Stream Post by kirru on Mar 31st, 2010, 11:55pm @towr could you tell me where i got wrong ?? |
||||
Title: Re: Sampling Random Subsets From A Stream Post by towr on Apr 1st, 2010, 2:13am on 03/31/10 at 23:55:38, kirru wrote:
|
||||
Title: Re: Sampling Random Subsets From A Stream Post by kirru on Apr 1st, 2010, 5:37am thanks towr |
||||
Powered by YaBB 1 Gold - SP 1.4! Forum software copyright © 2000-2004 Yet another Bulletin Board |