Simple Random Sampling produces samples of a specific size, where each item has the same probability of being chosen. DataFu has scalable implementations of this that will generate samples of exactly the right size with very high probability (at least 99.99%).
Sampling: simple random sample with/without replacement, weighted sample, sample by keys Hashing: SHA and MD5 Link Analysis: PageRank Assorted Macros: deduplication of tables, human-readable diffs and more More Tips and Tricks There are also Javadocs available for all UDFs in the library. We continue to add UDFs to the library.
Sampling Simple random sampling with or without replacement, weighted sampling. Link Analysis Run PageRank on a graph represented by a bag of nodes and edges. More Other useful methods like Assert and Coalesce. If you'd like to read more details about these functions, check out the Guide.
It takes a bag of n items and a sampling probability p as the inputs, and outputs a simple random sample of size exactly ceil (p*n) in a bag, with probability at least 99.99%.
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil (p*n) with probability at least 99.99%, where $n$ is the size of the population.
The method of sampling is to convert the key to a hash, derive a double value from this, and then test this against a supplied probability. The double value derived from a key is uniformly distributed between 0 and 1.