# Reservoir Sampling

Wikipedia:

Reservoir sampling is a family of randomized algorithms for randomly choosing `k` samples from a list of `n` items, where `n` is either a very large or unknown number. Typically `n` is large enough that the list doesn’t fit into main memory.

O(n) time solution:

1. Create an array `reservoir[0..k-1]` and copy first `k` items of `stream[]` to it.
2. Now one by one consider all items from (k+1)th item to nth item.
1. Generate a random number from 0 to i where `i` is index of current item in `stream[]`. Let the generated random number is `j`.
2. If `j` is in range `0` to `k-1`, replace `reservoir[j]` with `arr[i]`

Code

``````// An efficient Java program to randomly
// select k items from a stream of items
import java.util.Arrays;
import java.util.Random;
public class ReservoirSampling
{
// A function to randomly select k items from stream[0..n-1].
static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]

// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++) {
reservoir[i] = stream[i];
}

Random r = new Random();

// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);

// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k) {
reservoir[j] = stream[i];
}

}

System.out.println("Following are k randomly selected items");
System.out.println(Arrays.toString(reservoir));
}

//Driver Program to test above method
public static void main(String[] args) {
int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
int n = stream.length;
int k = 5;
selectKItems(stream, n, k);
}
}
//This code is contributed by Sumit Ghosh
``````

#### How does it work?

To Prove: The probability that any item `stream[i]` where `0 <= i < n` will be in final `reservoir[]` is `k/n`.

##### Case 1: For last n-k stream items, i.e., for stream[i] where k <= i < n

For `stream[n - 1]`:

``````The probability that the last item is in final reservoir

= The probability that one of the first k indexes is picked for last item

= k/n (the probability of picking one of the k items from a list of size n)
``````

For `stream[n-2]`:

``````The probability that the second last item is in final reservoir[]

= [Probability that one of the first k indexes is picked in iteration for stream[n-2]] X
[Probability that the index picked in iteration for stream[n-1] is not same as index picked for stream[n-2] ]

= [k/(n-1)]*[(n-1)/n] = k/n.
``````
##### Case 2: For first k stream items, i.e., for stream[i] where 0 <= i < k

The first k items are initially copied to reservoir[] and may be removed later in iterations for stream[k] to stream[n].

``````The probability that an item from stream[0..k-1] is in final array

= Probability that the item is not picked when items stream[k], stream[k+1], …. stream[n-1] are considered

= [k/(k+1)] x [(k+1)/(k+2)] x [(k+2)/(k+3)] x … x [(n-1)/n] = k/n
``````

#### Implementation: Select K Items from A Stream of N element

``````static void selectKItems(int stream[], int n, int k)
{
int i; // index for elements in stream[]
// reservoir[] is the output array. Initialize it with
// first k elements from stream[]
int reservoir[] = new int[k];
for (i = 0; i < k; i++) {
reservoir[i] = stream[i];
}
Random r = new Random();
// Iterate from the (k+1)th element to nth element
for (; i < n; i++)
{
// Pick a random index from 0 to i.
int j = r.nextInt(i + 1);
// If the randomly picked index is smaller than k,
// then replace the element present at the index
// with new element from stream
if(j < k) {
reservoir[j] = stream[i];
}
}
System.out.println("Following are k randomly selected items");
System.out.println(Arrays.toString(reservoir));
}
``````

## Interview Questions

### 面试题：等概率挑出文件中的一行

#### 问题描述

Amazon: 一个文件中有很多行，不能全部放到内存中，如何等概率的随机挑出其中的一行？

#### 那在线算法是怎样的？

1. 如果非中文，直接跳过
2. 如果 Buffer 不满，将这条 Query 直接加入 Buffer 中
3. 如果 Buffer 满了，假设当前一共出了过 M 条中文 Queries，用一个随机函数，以 N / M 的概率来决定这条 Query 是否能被选中留下。 3.1 如果没有选中，则跳过该 Query，继续处理下一条 Query 3.2 如果选中了，则用一个随机函数，以 1 / N 的概率从 Buffer 中随机挑选一个 Query 来丢掉，让当前的 Query 放进去。

Implementation: Select K Items from A Stream of N element