家用过滤器,高效过滤器

在高缓存直通并发情况下，缓存通常用于处理大多数请求(如memcached、ehcache和redis )，以防止请求直接命中数据库。

但是，对于一些恶意要求，传统的缓存机制很难应对。例如，有以下情况。

接口参数为String id，业务处理为获取此参数，SQL : select * fromuserwhereid=？ limit 1；通常的用户要求参数为25，业务处理为

select * fromuserwhereid=25 limit 1；成功返回查询的用户表id为25的信息，并将辅助数据放入缓存中，就像redis一样。 key=25，value=user.toString )和查询id为25的请求可以直接返回redis中的数据，而无需重复查询数据库。

但是，请求参数为-1时，业务处理为

select * from user where id=-1 limit 1；数据库中通常不存在id-1数据，返回值为空，因此缓存中也不存在数据。大量这些请求会命中数据库，导致性能问题，并影响整个系统。缓存成为摆设，无法处理重复的请求。将此情况转换为缓存穿透

解决方案缓存直通有各种各样的方法。例如，创建无效请求的黑名单，并且如果查询参数在黑名单上，则可以直接监听；在前一级放置有效数据集合，记录所有密钥，根据集合进行判断；等等。但是，这些方案有很大的弊端

需要无限大的空间来保存不存在的信息，不合理；集合大小与数据库大小呈正相关，并且必须遵循数据库才能实时更新，以避免严重消耗。 “光晕过滤器”(Bloom Filter )轮廓是一个长二进制向量和一系列随机映射函数

用于检测集合中是否存在元素

空间效率和查询时间远远优于一般算法

有一定的错误识别率，不能删除要素

算法介绍使用一系列随机映射函数将元素映射到二进制向量上，每个函数可以设置为1。用16位向量举个例子，可以得到以下内容

1010101000101000

检测判别元素时

使用相同的映射函数完成映射，得到

0100101010000000

按上一步获得的向量和位进行或运算

1010100100010010010100000---------------110101010101000中得到的向量不等于原始向量，因此可以得出该元素不等于原始向量

模糊滤波器的误判别按照上述步骤生成二进制向量

检测某些不存在的元素

得到二进制向量

0010100010100000

上一步生成的二进制向量和按位或运算

1010100100001010010010000---------------- 10101001000中得到的二进制向量与原始向量相等，因此请参见

注：光晕滤波器的错误识别率与二进制向量的长度、存储要素的个数有关。

JAVA使用guava.jar封装了布隆过滤器，并在pom文件中引入了依赖软件包

dependencygroupidcom.Google.guava/groupidartifactidguava/artifactidversion 23.0 /版本/dependency表示布隆过滤器privatestaticbloomfilterintegerbloomfilter=bloom filter.create (funnels.integer funnel (，total，fpp

BloomFilter.create提供三个参数：元素类型、最大存储数和错误判断率。

guava封装基于这三个参数生成二进制向量数组LockFreeBitArray。

内存越大，误判率越低，该数组越大，理论上运算也越慢。稍后测试性能。

源代码如下：

/* * createsa { @ linkbloomfilter } withtheexpectednumberofinsertionsand * expectedfalsepositiveprobability.* * pnotethathatoviove withsignificantlymoreelementsthanspecified，* its saturati在will result中

on, and a sharp deterioration of its false positive probability. * * <p>The constructed {@code BloomFilter} will be serializable if the provided * {@code Funnel<T>} is. * * <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of * ensuring proper serialization and deserialization, which is important since {@link #equals} * also relies on object identity of funnels. * * @param funnel the funnel of T's that the constructed {@code BloomFilter} will use * @param expectedInsertions the number of expected insertions to the constructed * {@code BloomFilter}; must be positive * @param fpp the desired false positive probability (must be positive and less than 1.0) * @return a {@code BloomFilter} * @since 19.0 */ public static <T> BloomFilter<T> create( Funnel<? super T> funnel, long expectedInsertions, double fpp) { return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64); } @VisibleForTesting static <T> BloomFilter<T> create( Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) { checkNotNull(funnel); checkArgument( expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions); checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp); checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp); checkNotNull(strategy); if (expectedInsertions == 0) { expectedInsertions = 1; } /* * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size * is proportional to -log(p), but there is not much of a point after all, e.g. * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares! */ long numBits = optimalNumOfBits(expectedInsertions, fpp); int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits); try { return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy); } catch (IllegalArgumentException e) { throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e); }

LockFreeBitArray是BloomFilter的静态内部类，通过他的构造器可以发现，实际存储的数据类型为AtomicLongArray

看到这里的包路径，我们就更可以安心了，线程安全。

元素检测

调用封装好的api，BloomFilter.mightContain

读注释发现，返回true表示可能含有此数据，返回false表示一定不包含此数据！
上述为使用布隆过滤器的思想。

布隆过滤器的使用与性能表现 package com.example.demo.test;import com.google.common.hash.BloomFilter;import com.google.common.hash.Funnels;import java.util.ArrayList;import java.util.List;public class BloomTest { private static int total = 10000000; private static double fpp = 0.0000001; private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), total, fpp); public static void main(String[] args) { System.out.println("=======init========"); Long initTime = System.currentTimeMillis(); for (int i = 0; i < total; i++) { bloomFilter.put(i); } Long startTime = System.currentTimeMillis(); System.out.println("init time:"+(startTime - initTime)+"ms"); System.out.println("=======start========"); // 存放误判的数据 List<Integer> list = new ArrayList<>(1000); for (int i = total+1000; i < total + total ; i++) { if (bloomFilter.mightContain(i)) { list.add(i); } } System.out.println("误判数量:" + list.size()); System.out.println("=======done========"); Long endTime = System.currentTimeMillis(); System.out.println("use time:"+(endTime - startTime)+"ms"); }}

声明一千万的数据量，同时进行一千万的数据检测，误判率为0.0000001

执行结果为

=init==
init time:16554ms
=start==
误判数量:2
=done==
use time:2378ms

可以看出在初始化时，较为耗时，但做一千万数据量与另一个一千万数据量的比对时，仅耗时2.3s，且只有2个误判数量。
对应到实际业务中就是，数据库存在一千万的数据量，恶意请求为一千万个不同的错误id，仅有2个id会打到数据库上，基本可以完成对恶意请求的拦截。
若想要更好的结果，可以继续优化误判别率，或是针对这不多的漏网之鱼定向捕杀。