首页 > 编程知识 正文

spark sample算子,spark常用算子区别

时间:2023-05-06 17:37:58 阅读:250388 作者:4018

sample(withReplacement, fraction, seed) 
以指定的随机种子随机抽样出数量为 fraction 的数据,withReplacement 表示是抽出的数据是否放回,true 为有放回的抽样,false 为无放回的抽样,seed 用于指定随机数生成器种子。
例如:从 RDD 中随机且有放回的抽出 50% 的数据,随机种子值为 3(即可能以1 2 3的其中一个起始值)。主要用于观察大数据集的分布情况。

源码:

** * Return a sampled subset of this RDD. * * @param withReplacement can elements be sampled multiple times (replaced when sampled out) * @param fraction expected size of the sample as a fraction of this RDD's size * without replacement: probability that each element is chosen; fraction must be [0, 1] * with replacement: expected number of times each element is chosen; fraction must be >= 0 * @param seed seed for the random number generator */def sample( withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T] = withScope { require(fraction >= 0.0, "Negative fraction value: " + fraction) if (withReplacement) { new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed) } else { new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed) }}

示例代码: 

scala> val rdd = sc.parallelize(1 to 10)rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[8] at parallelize at <console>:24scala> rdd.collect()res11: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)scala> var sample1 = rdd.sample(true, 0.4, 2).collectsample1: Array[Int] = Array(1, 2, 2, 7, 7, 8, 9) 为什么抽样出7个数据呢?scala> var sample2 = rdd.sample(false, 0.2, 3).collectsample2: Array[Int] = Array(1, 9)

 

Hive3详细教程(八)Hive3自定义UDF函数(elipse版)

版权声明:该文观点仅代表作者本人。处理文章:请发送邮件至 三1五14八八95#扣扣.com 举报,一经查实,本站将立刻删除。