Reference-based Coordinate Assignment

RCA is a dimensionality reduction method. Importantly, this implies we already have some representation available, which we may want to reduce.

The main idea could be conceptually represented as:

$x_{d} = d i s t an ce (x_{n}; a_{n}, b_{n}, \dots, d_{n})$

On the right hand side, the distance of each $n$ dimensional vector $a \dots d$ to $x_{n}$ is calculated. Each of these vectors is a centroid calculated from a cluster of embeddings.

The result, on the left hand side, is a reduced $d$ -dimensional sample vector $x_{d}$ .

An important point is raised in the paper:

To avoid degeneracies arising from equidistant configurations, the reference points must span the structure of the dataset.

Which means the reference vectors (centroids) should be able to reach any point in the original dataset, and not lose expressivity.

The method has useful properties:

Compared to a large vector, it is cheaper to use to train a model (and can still be accurate),
Compared to other dimensionality reduction methods, it is easier to extend,
With certain care, the dimensions may be interpretable.

Keyboard shortcuts

AI for Chemistry

Reference-based Coordinate Assignment