wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
riddles >> cs >> Sampling Random Subsets From A Stream
(Message started by: william wu on May 18th, 2007, 1:52am)

Title: Sampling Random Subsets From A Stream
Post by william wu on May 18th, 2007, 1:52am
A stream of records is starting to come your way. The records are abstract pieces of data, each of equal size. You don't know in advance when the stream will end, but you do know that it will end eventually, where "eventually" means a long time later.

When the stream finishes, your task is to produce a subset of N records, where the subset is chosen uniformly at random from all possible N-subsets of records that have been seen. How would you do this?

Note 1: One way to do this is to simply store every record in memory, and then draw a random N-subset from all the records. However, this is very expensive, and may be infeasible considering how large the records might be, and how you don't know when the stream will end. With this understanding in mind, the challenge is to design something better.

Note 2: This is similar to an existing problem on the main site. Maybe this problem has shown up already elsewhere too . Also, this is a real problem that arises in some practical situations.

Title: Re: Sampling Random Subsets From A Stream
Post by towr on May 18th, 2007, 3:12am
Hmm, we know the N=1 case. It should be possible to generalize
Obviously we start with taking the first N records. After that we need some way to replace older ones with newer ones at same probability.

Title: Re: Sampling Random Subsets From A Stream
Post by Eigenray on May 18th, 2007, 4:53pm
There's really only one possible strategy that involves keeping only N records: we must take the K'th record, K>N, with probability [hide]C(K-1,N-1)/C(K,N) = N/K (any given record gets replaced with probability 1/K)[/hide], because it might be the last one.  Then we just need to show that this works.

[hideb]By induction, suppose that, at time K-1, all C(K-1,N) N-subsets of [K-1]={1,...,K-1} are equally likely.  We show that any N-subset of S of [K] is equally likely at time K.

Case 1: K is in S.  There are (K-1)-(N-1) N-subsets of [K-1] that contain S\{K}, each of which turns into S with probability 1/K.  So the probability that we get S at time K is 1/C(K-1,N)*(K-N)*1/K = 1/C(K,N).

Case 2: K is not in S.  Then the only way to get S at time K is to have S at time K-1, with probability 1/C(K-1,N), and to stay at S, with probabilty (1-N/K), and again 1/C(K-1,N)*(1-N/K) = 1/C(K,N).[/hideb]

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 13th, 2007, 10:12am
Consider the  following variation:

You do know the number of records in the stream (say, N), but you cannot use any random access data structure  (let say, the  only available container is a linked list).

What is the most efficient way to produce a random subset of size k?

Title: Re: Sampling Random Subsets From A Stream
Post by towr on Jun 13th, 2007, 11:24am

on 06/13/07 at 10:12:09, Barukh wrote:
What is the most efficient way to produce a random sub[s]et?
Produce a subset of possible indices, then walk through the list and retrieve them?

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 13th, 2007, 10:28pm

on 06/13/07 at 11:24:05, towr wrote:
Produce a subset of possible indices, then walk through the list and retrieve them?

How would you produce this subset so that every possible subset of size k is equally likely?

???

Title: Re: Sampling Random Subsets From A Stream
Post by towr on Jun 14th, 2007, 2:01am

on 06/13/07 at 22:28:30, Barukh wrote:
How would you produce this subset so that every possible subset is equally likely?
An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc.
This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime.

Title: Re: Sampling Random Subsets From A Stream
Post by Grimbal on Jun 14th, 2007, 3:24am

on 06/13/07 at 22:28:30, Barukh wrote:
How would you produce this subset so that every possible subset is equally likely?

Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2.  If the size of the subset is specified, Eigenray gave a perfectly workable solution.

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 14th, 2007, 4:03am

on 06/14/07 at 03:24:47, Grimbal wrote:
Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2.

Oops... I forgot to specify that the subset should be of a specified size (like in the original problem).


Quote:
If the size of the subset is specified, Eigenray gave a perfectly workable solution.

Eigenray's solution uses random access data structure.

Title: Re: Sampling Random Subsets From A Stream
Post by towr on Jun 14th, 2007, 4:05am

on 06/14/07 at 03:24:47, Grimbal wrote:
If the size of the subset is specified, Eigenray gave a perfectly workable solution.
Well, the downside is that with Eigenray's solution, you'll have to go through the whole stream before you're done.
Of course, on average you'd have to go through nearly the whole stream anyway. And that last wee bit should hurt performance.
I suppose you'll save on generating random numbers and such..

Title: Re: Sampling Random Subsets From A Stream
Post by towr on Jun 14th, 2007, 4:06am

on 06/14/07 at 04:03:38, Barukh wrote:
Eigenray's solution uses random access data structure.
Only for the subset you already have, not to the stream.

For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead.

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 14th, 2007, 5:36am

on 06/14/07 at 04:06:12, towr wrote:
For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead.

Right. But you can do better!

Title: Re: Sampling Random Subsets From A Stream
Post by Aryabhatta on Jun 14th, 2007, 7:30am
Just thinking aloud. I haven't done any calculations or tried to prove correctness.

What if have a pipeline of k (=required size of subset) "One number Choosers" (ONC) where a "One number chooser" corresponds to the case the subset size 1.

A number is sent down the pipeline if it is rejected (or replaced) by the ONC before it, in its selection algorithm. Each ONC sees its own version of the stream, dependent on the choices of the ONC before it in the pipeline.

Title: Re: Sampling Random Subsets From A Stream
Post by spirit on Jun 15th, 2007, 2:28am
We can use  reservior sampling algorithm,that is use for  randomly selects the n records for  N set of records where N>>n.

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 15th, 2007, 3:04am
Welcome to the forum, spirit!  :D


on 06/15/07 at 02:28:57, spirit wrote:
We can use  reservior sampling algorithm,that is use for  randomly selects the n records for  N set of records where N>>n.

Interesting idea, but actually you don't need any reserviors in this case.

Hint: [hide]The solution I have in mind doesn't use any additional memory[/hide].

Title: Re: Sampling Random Subsets From A Stream
Post by Grimbal on Jun 15th, 2007, 7:17am
[hide]Combinatorics!

If you have N items and you want a set of size k, there are C(N,k) possible subsets of size k.  C(N-1,k-1) include the first element, C(N-1,k) exclude it.  So, choose the first element with probability C(N-1,k-1)/C(N,k) which happens to be k/N.  For the rest, choose a random set  of k or k-1 items among the N-1 items.
[/hide]

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 15th, 2007, 10:37pm
Well done, Grimbal!  :D

So, when N is known and only sequential accesses are allowed, there is an algorithm taking O(N) time and O(1) space.

Let's continue:

This time, N is unknown, and still only sequential accesses are allowed. How efficient an algorithm you can devise?

Title: Re: Sampling Random Subsets From A Stream
Post by Grimbal on Jun 16th, 2007, 2:14am
O(N) time and O(N) memory? ::)

(too lazy to search for the moment...)

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 16th, 2007, 4:33am

on 06/16/07 at 02:14:06, Grimbal wrote:
O(N) time and O(N) memory? ::)

Hmm... I should've been more specific. The O-notation is not appropriate here.

Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory.

The real question is: can we do better than 2N?


Quote:
(too lazy to search for the moment...)

Don't believe you can be lazy.  ;D

Title: Re: Sampling Random Subsets From A Stream
Post by Grimbal on Jun 18th, 2007, 1:27am

on 06/16/07 at 04:33:24, Barukh wrote:
Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory.

The input being a stream, I thought you can not rewind it and access it a second time.  That is why I counted O(N) space to save the data first.

Title: Re: Sampling Random Subsets From A Stream
Post by Barukh on Jun 18th, 2007, 1:39am

on 06/18/07 at 01:27:11, Grimbal wrote:
That is why I counted O(N) space to save the data first.

You can do better.

Title: Re: Sampling Random Subsets From A Stream
Post by andyv on Jun 18th, 2007, 8:17am
const int maxn = 100;
....
int n;
int sol[maxn];

int size, k;
cin >> k; size = 0;
while (!cin.eof ())
{
  size++;
  if (size < n) sol[size - 1] = k;




}

Title: Re: Sampling Random Subsets From A Stream
Post by mad on Mar 3rd, 2008, 8:07pm

on 06/14/07 at 02:01:49, towr wrote:
An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc.
This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime.


Initially the probability for each index is 1/N. Then,
we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample.

Title: Re: Sampling Random Subsets From A Stream
Post by towr on Mar 4th, 2008, 1:11am

on 03/03/08 at 20:07:19, mad wrote:
Initially the probability for each index is 1/N. Then,
we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample.
Each element has a probability of 1/N of being among the k elements of the subset

Suppose we already picked M elements, and they were jointly picked with probability M/N
Then the probability of the N-M elements being left is (N-M)/N, and we then pick one of these with probability 1/(N-M), so the probability it would be picked from the start is (N-M)/N*1/(N-M)= 1/N
You can easily see that it holds for M=0, so the induction is sound, and therefore each of the K elements will be picked with probability 1/N

Title: Re: Sampling Random Subsets From A Stream
Post by kirru on Jul 13th, 2009, 3:25am
initially select first k numbers in to a set RANDOM

for next element (say i th), replace one of the element in a RANDOM with that element with probability i-k/i.

to replace an element select an element from RANDOM with probability 1/k.  

Title: Re: Sampling Random Subsets From A Stream
Post by Grimbal on Jul 13th, 2009, 4:54am
You can do both with a single random draw.
- draw a random number r from 0..(i-1)
- if r<k, replace item r with the new item

Title: Re: Sampling Random Subsets From A Stream
Post by kirru on Mar 31st, 2010, 11:55pm
@towr  

could you tell me where i got wrong ??


Title: Re: Sampling Random Subsets From A Stream
Post by towr on Apr 1st, 2010, 2:13am

on 03/31/10 at 23:55:38, kirru wrote:
@towr  

could you tell me where i got wrong ??
Well, for one thing, if you replace an element in your set with probability (i-k)/i, then you're increasingly likely to replace it as i grows; the probability the last element is in your set goes to 1.

Title: Re: Sampling Random Subsets From A Stream
Post by kirru on Apr 1st, 2010, 5:37am
thanks towr



Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board