Global web icon
apache.org
https://datafu.apache.org/docs/datafu/guide/sampli…
Sampling - Guide - Apache DataFu Pig
Simple Random Sampling produces samples of a specific size, where each item has the same probability of being chosen. DataFu has scalable implementations of this that will generate samples of exactly the right size with very high probability (at least 99.99%).
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/guide.html
Guide - Apache DataFu Pig
Sampling: simple random sample with/without replacement, weighted sample, sample by keys Hashing: SHA and MD5 Link Analysis: PageRank Assorted Macros: deduplication of tables, human-readable diffs and more More Tips and Tricks There are also Javadocs available for all UDFs in the library. We continue to add UDFs to the library.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/getting-star…
Apache DataFu Pig - Getting Started
Sampling Simple random sampling with or without replacement, weighted sampling. Link Analysis Run PageRank on a graph represented by a bag of nodes and edges. More Other useful methods like Assert and Coalesce. If you'd like to read more details about these functions, check out the Guide.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.6.1/datafu…
SimpleRandomSample (datafu-pig 1.6.1 API)
It takes a bag of n items and a sampling probability p as the inputs, and outputs a simple random sample of size exactly ceil (p*n) in a bag, with probability at least 99.99%.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.1.0/datafu…
SimpleRandomSample (DataFu 1.1.0)
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil (p*n) with probability at least 99.99%, where $n$ is the size of the population.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.2.0/datafu…
datafu.pig.sampling (DataFu 1.2.0)
Sampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.6.0/datafu…
SampleByKey (datafu-pig 1.6.0 API)
The method of sampling is to convert the key to a hash, derive a double value from this, and then test this against a supplied probability. The double value derived from a key is uniformly distributed between 0 and 1.
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.5.0/overvi…
Class Hierarchy (datafu-pig 1.5.0 API)
datafu.pig.sampling. ReservoirSample (implements org.apache.pig.Algebraic) datafu.pig.sampling.WeightedReservoirSample datafu.pig.sampling. WeightedReservoirSample datafu.pig.sessions. SessionCount datafu.pig.sessions. Sessionize datafu.pig.stats. StreamingQuantile datafu.pig.stats.StreamingMedian datafu.pig.stats. StreamingMedian datafu.pig ...
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.6.1/datafu…
ReservoirSample (datafu-pig 1.6.1 API)
java.lang.Object org.apache.pig.EvalFunc<T> org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag> datafu.pig.sampling.ReservoirSample All Implemented Interfaces: org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic
Global web icon
apache.org
https://datafu.apache.org/docs/datafu/1.2.0/index-…
Index (DataFu 1.2.0)
datafu.pig.sampling - package datafu.pig.sampling Sampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc. datafu.pig.sessions - package datafu.pig.sessions