Sklearn Paired Cosine Distance Issue

Posted on 07/09/2023 in posts machine-learning

Here I explore an issue with how sklearn's paired_cosine_distances function returns erronous values when we have one vector with zero norm.

import numpy as np
from sklearn.metrics.pairwise import paired_cosine_distances, paired_euclidean_distances
from sklearn.feature_extraction.text import TfidfVectorizer

paired_cosine_distances(
    np.array([[1, 1], [0, 1], [0, 0], [1, 0]]),
    np.array([[1, 1], [0, 1], [0, 1], [0, 1]])
)
# Outputs: array([0. , 0. , 0.5, 1. ])

# dot products
(np.array([[1, 1], [0, 1], [0, 0], [1, 0]]) * np.array([[1, 1], [0, 1], [0, 1], [0, 1]])).sum(axis=-1)
# Outputs: array([2, 1, 0, 0])

Dot product between [0, 0] and [0, 1] is zero and hence cosine sim should also be zero (or at best undefined). However, sklearn paried cosine dist will give a value of 0.5 for this case which is not realistic. The reason for this dist value is because in sklearn the cosine dist is defined as: 1 - cos(theta) = 0.5*|A - B| for unit normed vectors A and B. This makes |A - B| term equal to 1 for the case where one of the vector has zero norm.

However, if we take the dot product defition of cosine dist = 1 - A.B / (|A|*|B|), this value should either be undefined as we have a 0/0 division or should be 1.0 assuming A.B = 0 takes precedence.

This can cause incorrect results when mining massive datasets using pairwise cosine dist which is an important metric in ML for NN search or as a feature to downstream models.

Make sure to keep this in mind or filter out values with zero norm when working with cosine distances in sklearn.