wu :: forums
« wu :: forums - Sampling Random Subsets From A Stream »

Welcome, Guest. Please Login or Register.
Nov 28th, 2024, 9:37am

RIDDLES SITE WRITE MATH! Home Home Help Help Search Search Members Members Login Login Register Register
   wu :: forums
   riddles
   cs
(Moderators: william wu, towr, ThudnBlunder, Grimbal, Icarus, Eigenray, SMQ)
   Sampling Random Subsets From A Stream
« Previous topic | Next topic »
Pages: 1 2  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print
   Author  Topic: Sampling Random Subsets From A Stream  (Read 4943 times)
william wu
wu::riddles Administrator
*****





   
WWW

Gender: male
Posts: 1291
Sampling Random Subsets From A Stream  
« on: May 18th, 2007, 1:52am »
Quote Quote Modify Modify

A stream of records is starting to come your way. The records are abstract pieces of data, each of equal size. You don't know in advance when the stream will end, but you do know that it will end eventually, where "eventually" means a long time later.
 
When the stream finishes, your task is to produce a subset of N records, where the subset is chosen uniformly at random from all possible N-subsets of records that have been seen. How would you do this?
 
Note 1: One way to do this is to simply store every record in memory, and then draw a random N-subset from all the records. However, this is very expensive, and may be infeasible considering how large the records might be, and how you don't know when the stream will end. With this understanding in mind, the challenge is to design something better.
 
Note 2: This is similar to an existing problem on the main site. Maybe this problem has shown up already elsewhere too . Also, this is a real problem that arises in some practical situations.
IP Logged


[ wu ] : http://wuriddles.com / http://forums.wuriddles.com
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #1 on: May 18th, 2007, 3:12am »
Quote Quote Modify Modify

Hmm, we know the N=1 case. It should be possible to generalize
Obviously we start with taking the first N records. After that we need some way to replace older ones with newer ones at same probability.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Eigenray
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 1948
Re: Sampling Random Subsets From A Stream  
« Reply #2 on: May 18th, 2007, 4:53pm »
Quote Quote Modify Modify

There's really only one possible strategy that involves keeping only N records: we must take the K'th record, K>N, with probability C(K-1,N-1)/C(K,N) = N/K (any given record gets replaced with probability 1/K), because it might be the last one.  Then we just need to show that this works.
 
hidden:
By induction, suppose that, at time K-1, all C(K-1,N) N-subsets of [K-1]={1,...,K-1} are equally likely.  We show that any N-subset of S of [K] is equally likely at time K.
 
Case 1: K is in S.  There are (K-1)-(N-1) N-subsets of [K-1] that contain S\{K}, each of which turns into S with probability 1/K.  So the probability that we get S at time K is 1/C(K-1,N)*(K-N)*1/K = 1/C(K,N).
 
Case 2: K is not in S.  Then the only way to get S at time K is to have S at time K-1, with probability 1/C(K-1,N), and to stay at S, with probabilty (1-N/K), and again 1/C(K-1,N)*(1-N/K) = 1/C(K,N).
« Last Edit: May 18th, 2007, 5:45pm by Eigenray » IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #3 on: Jun 13th, 2007, 10:12am »
Quote Quote Modify Modify

Consider the  following variation:
 
You do know the number of records in the stream (say, N), but you cannot use any random access data structure  (let say, the  only available container is a linked list).
 
What is the most efficient way to produce a random subset of size k?
« Last Edit: Jun 14th, 2007, 4:04am by Barukh » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #4 on: Jun 13th, 2007, 11:24am »
Quote Quote Modify Modify

on Jun 13th, 2007, 10:12am, Barukh wrote:
What is the most efficient way to produce a random sub[s]et?
Produce a subset of possible indices, then walk through the list and retrieve them?
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #5 on: Jun 13th, 2007, 10:28pm »
Quote Quote Modify Modify

on Jun 13th, 2007, 11:24am, towr wrote:
Produce a subset of possible indices, then walk through the list and retrieve them?

How would you produce this subset so that every possible subset of size k is equally likely?
 
 Huh
« Last Edit: Jun 14th, 2007, 4:03am by Barukh » IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #6 on: Jun 14th, 2007, 2:01am »
Quote Quote Modify Modify

on Jun 13th, 2007, 10:28pm, Barukh wrote:
How would you produce this subset so that every possible subset is equally likely?
An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc.
This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime.
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Sampling Random Subsets From A Stream  
« Reply #7 on: Jun 14th, 2007, 3:24am »
Quote Quote Modify Modify

on Jun 13th, 2007, 10:28pm, Barukh wrote:
How would you produce this subset so that every possible subset is equally likely?

Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2.  If the size of the subset is specified, Eigenray gave a perfectly workable solution.
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #8 on: Jun 14th, 2007, 4:03am »
Quote Quote Modify Modify

on Jun 14th, 2007, 3:24am, Grimbal wrote:
Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2.

Oops... I forgot to specify that the subset should be of a specified size (like in the original problem).
 
Quote:
If the size of the subset is specified, Eigenray gave a perfectly workable solution.

Eigenray's solution uses random access data structure.
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #9 on: Jun 14th, 2007, 4:05am »
Quote Quote Modify Modify

on Jun 14th, 2007, 3:24am, Grimbal wrote:
If the size of the subset is specified, Eigenray gave a perfectly workable solution.
Well, the downside is that with Eigenray's solution, you'll have to go through the whole stream before you're done.
Of course, on average you'd have to go through nearly the whole stream anyway. And that last wee bit should hurt performance.
I suppose you'll save on generating random numbers and such..
« Last Edit: Jun 14th, 2007, 4:08am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #10 on: Jun 14th, 2007, 4:06am »
Quote Quote Modify Modify

on Jun 14th, 2007, 4:03am, Barukh wrote:
Eigenray's solution uses random access data structure.
Only for the subset you already have, not to the stream.
 
For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead.
« Last Edit: Jun 14th, 2007, 4:08am by towr » IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #11 on: Jun 14th, 2007, 5:36am »
Quote Quote Modify Modify

on Jun 14th, 2007, 4:06am, towr wrote:
For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead.

Right. But you can do better!
IP Logged
Aryabhatta
Uberpuzzler
*****






   


Gender: male
Posts: 1321
Re: Sampling Random Subsets From A Stream  
« Reply #12 on: Jun 14th, 2007, 7:30am »
Quote Quote Modify Modify

Just thinking aloud. I haven't done any calculations or tried to prove correctness.
 
What if have a pipeline of k (=required size of subset) "One number Choosers" (ONC) where a "One number chooser" corresponds to the case the subset size 1.
 
A number is sent down the pipeline if it is rejected (or replaced) by the ONC before it, in its selection algorithm. Each ONC sees its own version of the stream, dependent on the choices of the ONC before it in the pipeline.
« Last Edit: Jun 14th, 2007, 7:33am by Aryabhatta » IP Logged
spirit
Newbie
*





   


Posts: 11
Re: Sampling Random Subsets From A Stream  
« Reply #13 on: Jun 15th, 2007, 2:28am »
Quote Quote Modify Modify

We can use  reservior sampling algorithm,that is use for  randomly selects the n records for  N set of records where N>>n.
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #14 on: Jun 15th, 2007, 3:04am »
Quote Quote Modify Modify

Welcome to the forum, spirit!  Cheesy
 
on Jun 15th, 2007, 2:28am, spirit wrote:
We can use  reservior sampling algorithm,that is use for  randomly selects the n records for  N set of records where N>>n.

Interesting idea, but actually you don't need any reserviors in this case.
 
Hint: The solution I have in mind doesn't use any additional memory.
IP Logged
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Sampling Random Subsets From A Stream  
« Reply #15 on: Jun 15th, 2007, 7:17am »
Quote Quote Modify Modify

Combinatorics!
 
If you have N items and you want a set of size k, there are C(N,k) possible subsets of size k.  C(N-1,k-1) include the first element, C(N-1,k) exclude it.  So, choose the first element with probability C(N-1,k-1)/C(N,k) which happens to be k/N.  For the rest, choose a random set  of k or k-1 items among the N-1 items.
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #16 on: Jun 15th, 2007, 10:37pm »
Quote Quote Modify Modify

Well done, Grimbal!  Cheesy
 
So, when N is known and only sequential accesses are allowed, there is an algorithm taking O(N) time and O(1) space.
 
Let's continue:
 
This time, N is unknown, and still only sequential accesses are allowed. How efficient an algorithm you can devise?
IP Logged
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Sampling Random Subsets From A Stream  
« Reply #17 on: Jun 16th, 2007, 2:14am »
Quote Quote Modify Modify

O(N) time and O(N) memory? Roll Eyes
 
(too lazy to search for the moment...)
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #18 on: Jun 16th, 2007, 4:33am »
Quote Quote Modify Modify

on Jun 16th, 2007, 2:14am, Grimbal wrote:
O(N) time and O(N) memory? Roll Eyes

Hmm... I should've been more specific. The O-notation is not appropriate here.
 
Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory.
 
The real question is: can we do better than 2N?
 
Quote:
(too lazy to search for the moment...)

Don't believe you can be lazy.  Grin
« Last Edit: Jun 16th, 2007, 5:24am by Barukh » IP Logged
Grimbal
wu::riddles Moderator
Uberpuzzler
*****






   


Gender: male
Posts: 7527
Re: Sampling Random Subsets From A Stream  
« Reply #19 on: Jun 18th, 2007, 1:27am »
Quote Quote Modify Modify

on Jun 16th, 2007, 4:33am, Barukh wrote:
Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory.

The input being a stream, I thought you can not rewind it and access it a second time.  That is why I counted O(N) space to save the data first.
IP Logged
Barukh
Uberpuzzler
*****






   


Gender: male
Posts: 2276
Re: Sampling Random Subsets From A Stream  
« Reply #20 on: Jun 18th, 2007, 1:39am »
Quote Quote Modify Modify

on Jun 18th, 2007, 1:27am, Grimbal wrote:
That is why I counted O(N) space to save the data first.

You can do better.
IP Logged
andyv
Newbie
*





   


Posts: 14
Re: Sampling Random Subsets From A Stream  
« Reply #21 on: Jun 18th, 2007, 8:17am »
Quote Quote Modify Modify

const int maxn = 100;
....
int n;
int sol[maxn];
 
int size, k;
cin >> k; size = 0;
while (!cin.eof ())  
{
   size++;
   if (size < n) sol[size - 1] = k;
 
 
 
 
}
IP Logged
mad
Junior Member
**





   


Posts: 118
Re: Sampling Random Subsets From A Stream  
« Reply #22 on: Mar 3rd, 2008, 8:07pm »
Quote Quote Modify Modify

on Jun 14th, 2007, 2:01am, towr wrote:

An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc.
This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime.

 
Initially the probability for each index is 1/N. Then,
we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample.
IP Logged
towr
wu::riddles Moderator
Uberpuzzler
*****



Some people are average, some are just mean.

   


Gender: male
Posts: 13730
Re: Sampling Random Subsets From A Stream  
« Reply #23 on: Mar 4th, 2008, 1:11am »
Quote Quote Modify Modify

on Mar 3rd, 2008, 8:07pm, mad wrote:
Initially the probability for each index is 1/N. Then,
we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample.
Each element has a probability of 1/N of being among the k elements of the subset
 
Suppose we already picked M elements, and they were jointly picked with probability M/N
Then the probability of the N-M elements being left is (N-M)/N, and we then pick one of these with probability 1/(N-M), so the probability it would be picked from the start is (N-M)/N*1/(N-M)= 1/N
You can easily see that it holds for M=0, so the induction is sound, and therefore each of the K elements will be picked with probability 1/N
IP Logged

Wikipedia, Google, Mathworld, Integer sequence DB
kirru
Newbie
*





   


Posts: 34
Re: Sampling Random Subsets From A Stream  
« Reply #24 on: Jul 13th, 2009, 3:25am »
Quote Quote Modify Modify

initially select first k numbers in to a set RANDOM
 
for next element (say i th), replace one of the element in a RANDOM with that element with probability i-k/i.
 
to replace an element select an element from RANDOM with probability 1/k.
IP Logged
Pages: 1 2  Reply Reply Notify of replies Notify of replies Send Topic Send Topic Print Print

« Previous topic | Next topic »

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright © 2000-2004 Yet another Bulletin Board