Author |
Topic: Sampling Random Subsets From A Stream (Read 4943 times) |
|
william wu
wu::riddles Administrator
Gender:
Posts: 1291
|
|
Sampling Random Subsets From A Stream
« on: May 18th, 2007, 1:52am » |
Quote Modify
|
A stream of records is starting to come your way. The records are abstract pieces of data, each of equal size. You don't know in advance when the stream will end, but you do know that it will end eventually, where "eventually" means a long time later. When the stream finishes, your task is to produce a subset of N records, where the subset is chosen uniformly at random from all possible N-subsets of records that have been seen. How would you do this? Note 1: One way to do this is to simply store every record in memory, and then draw a random N-subset from all the records. However, this is very expensive, and may be infeasible considering how large the records might be, and how you don't know when the stream will end. With this understanding in mind, the challenge is to design something better. Note 2: This is similar to an existing problem on the main site. Maybe this problem has shown up already elsewhere too . Also, this is a real problem that arises in some practical situations.
|
|
IP Logged |
[ wu ] : http://wuriddles.com / http://forums.wuriddles.com
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #1 on: May 18th, 2007, 3:12am » |
Quote Modify
|
Hmm, we know the N=1 case. It should be possible to generalize Obviously we start with taking the first N records. After that we need some way to replace older ones with newer ones at same probability.
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Eigenray
wu::riddles Moderator Uberpuzzler
Gender:
Posts: 1948
|
|
Re: Sampling Random Subsets From A Stream
« Reply #2 on: May 18th, 2007, 4:53pm » |
Quote Modify
|
There's really only one possible strategy that involves keeping only N records: we must take the K'th record, K>N, with probability C(K-1,N-1)/C(K,N) = N/K (any given record gets replaced with probability 1/K), because it might be the last one. Then we just need to show that this works. hidden: | By induction, suppose that, at time K-1, all C(K-1,N) N-subsets of [K-1]={1,...,K-1} are equally likely. We show that any N-subset of S of [K] is equally likely at time K. Case 1: K is in S. There are (K-1)-(N-1) N-subsets of [K-1] that contain S\{K}, each of which turns into S with probability 1/K. So the probability that we get S at time K is 1/C(K-1,N)*(K-N)*1/K = 1/C(K,N). Case 2: K is not in S. Then the only way to get S at time K is to have S at time K-1, with probability 1/C(K-1,N), and to stay at S, with probabilty (1-N/K), and again 1/C(K-1,N)*(1-N/K) = 1/C(K,N). |
|
« Last Edit: May 18th, 2007, 5:45pm by Eigenray » |
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #3 on: Jun 13th, 2007, 10:12am » |
Quote Modify
|
Consider the following variation: You do know the number of records in the stream (say, N), but you cannot use any random access data structure (let say, the only available container is a linked list). What is the most efficient way to produce a random subset of size k?
|
« Last Edit: Jun 14th, 2007, 4:04am by Barukh » |
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #4 on: Jun 13th, 2007, 11:24am » |
Quote Modify
|
on Jun 13th, 2007, 10:12am, Barukh wrote:What is the most efficient way to produce a random sub[s]et? |
| Produce a subset of possible indices, then walk through the list and retrieve them?
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #5 on: Jun 13th, 2007, 10:28pm » |
Quote Modify
|
on Jun 13th, 2007, 11:24am, towr wrote:Produce a subset of possible indices, then walk through the list and retrieve them? |
| How would you produce this subset so that every possible subset of size k is equally likely?
|
« Last Edit: Jun 14th, 2007, 4:03am by Barukh » |
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #6 on: Jun 14th, 2007, 2:01am » |
Quote Modify
|
on Jun 13th, 2007, 10:28pm, Barukh wrote:How would you produce this subset so that every possible subset is equally likely? |
| An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc. This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime.
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Grimbal
wu::riddles Moderator Uberpuzzler
Gender:
Posts: 7527
|
|
Re: Sampling Random Subsets From A Stream
« Reply #7 on: Jun 14th, 2007, 3:24am » |
Quote Modify
|
on Jun 13th, 2007, 10:28pm, Barukh wrote:How would you produce this subset so that every possible subset is equally likely? |
| Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2. If the size of the subset is specified, Eigenray gave a perfectly workable solution.
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #8 on: Jun 14th, 2007, 4:03am » |
Quote Modify
|
on Jun 14th, 2007, 3:24am, Grimbal wrote:Maybe I don't understand the question, but if all subsets must be equally likely, just take each item with probability 1/2. |
| Oops... I forgot to specify that the subset should be of a specified size (like in the original problem). Quote:If the size of the subset is specified, Eigenray gave a perfectly workable solution. |
| Eigenray's solution uses random access data structure.
|
|
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #9 on: Jun 14th, 2007, 4:05am » |
Quote Modify
|
on Jun 14th, 2007, 3:24am, Grimbal wrote:If the size of the subset is specified, Eigenray gave a perfectly workable solution. |
| Well, the downside is that with Eigenray's solution, you'll have to go through the whole stream before you're done. Of course, on average you'd have to go through nearly the whole stream anyway. And that last wee bit should hurt performance. I suppose you'll save on generating random numbers and such..
|
« Last Edit: Jun 14th, 2007, 4:08am by towr » |
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #10 on: Jun 14th, 2007, 4:06am » |
Quote Modify
|
on Jun 14th, 2007, 4:03am, Barukh wrote:Eigenray's solution uses random access data structure. |
| Only for the subset you already have, not to the stream. For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead.
|
« Last Edit: Jun 14th, 2007, 4:08am by towr » |
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #11 on: Jun 14th, 2007, 5:36am » |
Quote Modify
|
on Jun 14th, 2007, 4:06am, towr wrote:For small k it wouldn't be too bad to just go through it to index it, O(k2). But you can always use a tree structure instead. |
| Right. But you can do better!
|
|
IP Logged |
|
|
|
Aryabhatta
Uberpuzzler
Gender:
Posts: 1321
|
|
Re: Sampling Random Subsets From A Stream
« Reply #12 on: Jun 14th, 2007, 7:30am » |
Quote Modify
|
Just thinking aloud. I haven't done any calculations or tried to prove correctness. What if have a pipeline of k (=required size of subset) "One number Choosers" (ONC) where a "One number chooser" corresponds to the case the subset size 1. A number is sent down the pipeline if it is rejected (or replaced) by the ONC before it, in its selection algorithm. Each ONC sees its own version of the stream, dependent on the choices of the ONC before it in the pipeline.
|
« Last Edit: Jun 14th, 2007, 7:33am by Aryabhatta » |
IP Logged |
|
|
|
spirit
Newbie
Posts: 11
|
|
Re: Sampling Random Subsets From A Stream
« Reply #13 on: Jun 15th, 2007, 2:28am » |
Quote Modify
|
We can use reservior sampling algorithm,that is use for randomly selects the n records for N set of records where N>>n.
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #14 on: Jun 15th, 2007, 3:04am » |
Quote Modify
|
Welcome to the forum, spirit! on Jun 15th, 2007, 2:28am, spirit wrote:We can use reservior sampling algorithm,that is use for randomly selects the n records for N set of records where N>>n. |
| Interesting idea, but actually you don't need any reserviors in this case. Hint: The solution I have in mind doesn't use any additional memory.
|
|
IP Logged |
|
|
|
Grimbal
wu::riddles Moderator Uberpuzzler
Gender:
Posts: 7527
|
|
Re: Sampling Random Subsets From A Stream
« Reply #15 on: Jun 15th, 2007, 7:17am » |
Quote Modify
|
Combinatorics! If you have N items and you want a set of size k, there are C(N,k) possible subsets of size k. C(N-1,k-1) include the first element, C(N-1,k) exclude it. So, choose the first element with probability C(N-1,k-1)/C(N,k) which happens to be k/N. For the rest, choose a random set of k or k-1 items among the N-1 items.
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #16 on: Jun 15th, 2007, 10:37pm » |
Quote Modify
|
Well done, Grimbal! So, when N is known and only sequential accesses are allowed, there is an algorithm taking O(N) time and O(1) space. Let's continue: This time, N is unknown, and still only sequential accesses are allowed. How efficient an algorithm you can devise?
|
|
IP Logged |
|
|
|
Grimbal
wu::riddles Moderator Uberpuzzler
Gender:
Posts: 7527
|
|
Re: Sampling Random Subsets From A Stream
« Reply #17 on: Jun 16th, 2007, 2:14am » |
Quote Modify
|
O(N) time and O(N) memory? (too lazy to search for the moment...)
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #18 on: Jun 16th, 2007, 4:33am » |
Quote Modify
|
on Jun 16th, 2007, 2:14am, Grimbal wrote:O(N) time and O(N) memory? |
| Hmm... I should've been more specific. The O-notation is not appropriate here. Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory. The real question is: can we do better than 2N? Quote:(too lazy to search for the moment...) |
| Don't believe you can be lazy.
|
« Last Edit: Jun 16th, 2007, 5:24am by Barukh » |
IP Logged |
|
|
|
Grimbal
wu::riddles Moderator Uberpuzzler
Gender:
Posts: 7527
|
|
Re: Sampling Random Subsets From A Stream
« Reply #19 on: Jun 18th, 2007, 1:27am » |
Quote Modify
|
on Jun 16th, 2007, 4:33am, Barukh wrote:Of course, it's always possible to do a first pass to determine N, and then do it again by the previous algorithm. This gives 2N operations and no extra memory. |
| The input being a stream, I thought you can not rewind it and access it a second time. That is why I counted O(N) space to save the data first.
|
|
IP Logged |
|
|
|
Barukh
Uberpuzzler
Gender:
Posts: 2276
|
|
Re: Sampling Random Subsets From A Stream
« Reply #20 on: Jun 18th, 2007, 1:39am » |
Quote Modify
|
on Jun 18th, 2007, 1:27am, Grimbal wrote:That is why I counted O(N) space to save the data first. |
| You can do better.
|
|
IP Logged |
|
|
|
andyv
Newbie
Posts: 14
|
|
Re: Sampling Random Subsets From A Stream
« Reply #21 on: Jun 18th, 2007, 8:17am » |
Quote Modify
|
const int maxn = 100; .... int n; int sol[maxn]; int size, k; cin >> k; size = 0; while (!cin.eof ()) { size++; if (size < n) sol[size - 1] = k; }
|
|
IP Logged |
|
|
|
mad
Junior Member
Posts: 118
|
|
Re: Sampling Random Subsets From A Stream
« Reply #22 on: Mar 3rd, 2008, 8:07pm » |
Quote Modify
|
on Jun 14th, 2007, 2:01am, towr wrote: An easy way is to first pick a random number in [1..N], than one in [1..N-1], if the latter is greater or equal to the first add one to it. Then pick one from [1..N-2] if it's greater/equal to the smallest add one, if it's then greater/equal to the other add one again. Etc. This would be O(K2), with K being the size of the index-subset. It can probably be improved to O(K log K) if you use a tree structure. But since the eventual retrieving of the subset of objects takes a lot more time, there isn't much of a point to improving the runtime. |
| Initially the probability for each index is 1/N. Then, we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample.
|
|
IP Logged |
|
|
|
towr
wu::riddles Moderator Uberpuzzler
Some people are average, some are just mean.
Gender:
Posts: 13730
|
|
Re: Sampling Random Subsets From A Stream
« Reply #23 on: Mar 4th, 2008, 1:11am » |
Quote Modify
|
on Mar 3rd, 2008, 8:07pm, mad wrote:Initially the probability for each index is 1/N. Then, we select an element from [1 .. N-1] - each index has a probability 1/(N-1) of being included... Could you explain how this produces a random sample. |
| Each element has a probability of 1/N of being among the k elements of the subset Suppose we already picked M elements, and they were jointly picked with probability M/N Then the probability of the N-M elements being left is (N-M)/N, and we then pick one of these with probability 1/(N-M), so the probability it would be picked from the start is (N-M)/N*1/(N-M)= 1/N You can easily see that it holds for M=0, so the induction is sound, and therefore each of the K elements will be picked with probability 1/N
|
|
IP Logged |
Wikipedia, Google, Mathworld, Integer sequence DB
|
|
|
kirru
Newbie
Posts: 34
|
|
Re: Sampling Random Subsets From A Stream
« Reply #24 on: Jul 13th, 2009, 3:25am » |
Quote Modify
|
initially select first k numbers in to a set RANDOM for next element (say i th), replace one of the element in a RANDOM with that element with probability i-k/i. to replace an element select an element from RANDOM with probability 1/k.
|
|
IP Logged |
|
|
|
|