储层采样


难度级别 中等
经常问 亚马逊 Facebook
排列

储层采样是从n个项的给定列表中随机选择k个储层项的技术,其中n非常大。
例如,Google,YouTube等中的搜索列表。

储层采样

朴素的水库采样方法

建一个水库 排列 的大小为k,则从给定列表中随机选择项目。 如果所选项目在存储库中不存在,则添加它,否则继续下一个项目。

  1. 创建一个大小为k的容器数组。
  2. 将curr初始化为0,curr表示要填充的油藏阵列的curr位置。 当curr小于k时,重复第3步和第4步。
  3. 生成一个介于0到n之间的随机数(不包括在内)。 令生成的随机数为rand。
  4. 如果stream [rand]处的元素已经存在于容器数组中,请重复步骤3。否则将tank [curr]设置为stream [rand]并递增curr。
  5. 打印容器数组的元素。

JAVA代码的水库采样

public class ReservoirSampling {
    private static void selectReservoir(int[] stream, int k) {
        int n = stream.length;
        int reservoir[] = new int[k];
        // Index where the current item has to be placed
        int curr = 0;

        while (curr < k) {
            // Randomly generate an index between 0 to n
            int i = (int) (Math.random() * n) % n;
            boolean found = false;
            // Check if it is already present
            for (int j = 0; j < curr; j++) {
                if (reservoir[j] == stream[i]) {
                    found = true;
                    break;
                }
            }
            // If not present add it to the reservoir list
            if (!found) {
                reservoir[curr] = stream[i];
                curr++;
            }
        }

        // Print the randomly selected items
        for (int i = 0; i < k; i++) {
            System.out.print(reservoir[i] + " ");
        }
        System.out.println();
    }

    public static void main(String[] args) {
        // Example
        int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
        int k = 5;
        selectReservoir(stream, k);
    }
}

储层采样的C ++代码

#include <bits/stdc++.h>
using namespace std;

void selectReservoir(int stream[], int k, int n) {
    int reservoir[k];
    // Index where the current item has to be placed
    int curr = 0;
    
    srand(time(0));
    
    while (curr < k) {
        // Randomly generate an index between 0 to n
        int i = rand() % n;
        bool found = false;
        // Check if it is already present
        for (int j = 0; j < curr; j++) {
            if (reservoir[j] == stream[i]) {
                found = true;
                break;
            }
        }
        // If not present add it to the reservoir list
        if (!found) {
            reservoir[curr] = stream[i];
            curr++;
        }
    }
    
    // Print the randomly selected items
    for (int i = 0; i < k; i++) {
        cout<<reservoir[i]<<" ";
    }
    cout<<endl;
}

int main() {
    // Example
    int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    int k = 5;
    int n = sizeof(stream) / sizeof(stream[0]);
    selectReservoir(stream, k, n);
    return 0;
}
2 12 9 3 5

输出可能会有所不同,因为代码会随机选择储层。

复杂度分析

时间复杂度= 好的2
空间复杂度= O(1) 
其中k是存储库数组中存在的元素数。

储层采样的最佳方法

对于给定的流,可以有效地解决上述问题,因为

  1. 创建一个大小为k的容器数组,并将流的前k个项目复制到该数组中。
  2. 从索引(k +1)到n遍历流。
  3. 对于流中的第i个元素,生成一个随机数,例如j,如果该数在0到(k – 1)之间,则用stream [i]替换库[j]。
  4. 库数组包含从流中选择的k个随机数,然后打印它们。

JAVA代码的水库采样

public class ReservoirSampling {
    private static void selectReservoir(int[] stream, int k) {
        int n = stream.length;
        int reservoir[] = new int[k];

        // Copy first k items to the reservoir array
        for (int i = 0; i < k; i++) {
            reservoir[i] = stream[i];
        }

        // Traverse the remaining stream
        for (int i = k; i < n; i++) {
            // Generate a random number between 0 to n
            int randomNumber = (int) (Math.random() * n) % n;
            // If this number lies between 0 to k, replace reservoir[randomNumber] with stream of i
            if (randomNumber >= 0 && randomNumber < k) {
                reservoir[randomNumber] = stream[i];
            }
        }

        // Print the reservoir array
        for (int i = 0; i < k; i++) {
            System.out.print(reservoir[i] + " ");
        }
        System.out.println();
    }

    public static void main(String[] args) {
        int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
        int k = 5;
        selectReservoir(stream, k);
    }
}

储层采样的C ++代码

#include <bits/stdc++.h>
using namespace std;

void selectReservoir(int stream[], int k, int n) {
    int reservoir[k];
    
    // Copy first k items to the reservoir array
    for (int i = 0; i < k; i++) {
        reservoir[i] = stream[i];
    }
    
    srand(time(0));
    
    // Traverse the remaining stream
    for (int i = k; i < n; i++) {
        // Generate a random number between 0 to n
        int randomNumber = rand() % n;
        
        // If this number lies between 0 to k, replace reservoir[randomNumber] with stream of i
        if (randomNumber >= 0 && randomNumber < k) {
            reservoir[randomNumber] = stream[i];
        }
    }
    
    // Print the randomly selected items
    for (int i = 0; i < k; i++) {
        cout<<reservoir[i]<<" ";
    }
    cout<<endl;
}

int main() {
    // Example
    int stream[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
    int k = 5;
    int n = sizeof(stream) / sizeof(stream[0]);
    selectReservoir(stream, k, n);
    return 0;
}
11 2 8 4 6

输出可能会有所不同,因为代码会随机选择储层。

复杂度分析

时间复杂度= O(N) 
空间复杂度= O(1)
其中,n是流中元素的总数,k是容器数组中存在的元素数。

參考資料