faiss实现的高效 K-means 聚类

https://www.aiuai.cn/aifarm1662.html
faiss安装报错参考:https://github.com/facebookresearch/faiss/issues/821

1. K-means 聚类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import faiss
import pickle
import numpy as np
import time

x = np.random.random((100000, 2048)).astype('float32')
ncentroids = 10
niter = 500
verbose = True
d = x.shape[1]

start_time = time.time()

'''
d:向量维度
ncentroids:聚类中心
niter:迭代次数
verbose:是否打印迭代情况
gpu:是否使用GPU
'''
#cpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
#gpu,使用所有的gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=True)
#gpu,使用 3 个gpu
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose, gpu=3)

kmeans.train(x)
train_time = time.time()
print(train_time - start_time)

cluster_centers = kmeans.centroids #聚类后的聚类中心
obj = kmeans.obj #目标函数,kmeans 中为总的平方差
iteration_stats = kmeans.iteration_stats #聚类中的统计信息

# 预测
D, I = kmeans.index.search(x, 1)
# 其中, I 中为 x 中每一行的向量所对应的最接近的聚类(centroid),D 包含了对应的平方 L2 距离.

search_time = time.time()
print(search_time - train_time)

# 倒排索引
index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)
print(D)

2. PCA 计算

例如,将 40D 向量降维到 10D,

1
2
3
4
5
6
7
8
#随机生成训练数据
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply_py(mt)
#print this to show that the magnitude of tr's columns is decreasing
print((tr ** 2).sum(0))

3. PQ 量化

如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
d = 32  # data dimension
cs = 4 # code size (bytes)

#随机生成数据集
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

#
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)

# encode
codes = pq.compute_codes(x)

# decode
x2 = pq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()

标量量化(scalar quantizer):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
d = 32  # data dimension

# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')

# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')

# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)

# encode
codes = sq.compute_codes(x)

# decode
x2 = sq.decode(codes)

# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()


faiss实现的高效 K-means 聚类
http://example.com/2021/09/06/2021-09-06-faiss实现的高效 K-means 聚类/
作者
NSX
发布于
2021年9月6日
许可协议