The SQL DISTANCE
function in Polypheny is a versatile tool primarily used for comparing two arrays based on a specific metric. This function plays a pivotal role in k-nearest-neighbour (kNN) search, allowing for the identification of entries based on their distance to a specific vector.
Function Syntax
The general syntax for the DISTANCE
function is as follows:
DISTANCE(<target array>, <array to compare with>, <metric> [, <weights>])
<target array>
: The array from your table you want to compare with another array.<array to compare with>
: The array you are comparing the target array with.<metric>
: The metric used to calculate the distance. This can be one of the following: ‘L1’, ‘L2’, ‘L2 squared’, ‘Cosine’, ‘ChiSquared’.<weights>
(optional): An array of weights for weighted distance calculations.
Examples
Consider you have a table ProductVectors
with a column ProductFeatures
storing arrays representing product features. You can use the DISTANCE
function to find products similar to a specific product feature vector:
SELECT id, DISTANCE(ProductFeatures, ARRAY[...], 'L2') as dist
FROM ProductVectors
ORDER BY dist ASC
LIMIT 5;
This query will return the IDs of the five products whose feature vectors have the smallest L2 distance to the given array.
To use the function with weights, you could write:
SELECT id, DISTANCE(ProductFeatures, ARRAY[...], 'L2', ARRAY[...]) as dist
FROM ProductVectors
ORDER BY dist ASC
LIMIT 5;
Utility
The DISTANCE
function is particularly useful for scenarios where you need to identify similarities or differences between data points in multi-dimensional space. These include:
- Recommendation Systems: You can use the
DISTANCE
function to recommend similar products or services to users based on their past behaviors or preferences. - Clustering: The function can be useful in clustering analysis where you want to group similar data points together.
- Anomaly Detection: The
DISTANCE
function can help identify outliers in your dataset, which deviate significantly from the other data points.
Remember that the right choice of distance metric depends on the nature of your data and the specific requirements of your use case.