Atom Vectors - Atom2Vec

A popular representation of atoms as vectors appeared in (2018): Atom2Vec.

They take compounds from a database, and build a matrix like the one below:¹

	$(2) S b_{3}$	$(2) S e_{3}$	$(2) T e_{3}$	$(3) B i_{2}$	$(3) O_{2}$	$(3) S_{2}$
$Bi$	1	1	1	0	1	0
$Sb$	0	1	1	1	0	1
...	1	0	1	0	1	1

Let's describe using the compound $B i_{2} S b_{3}$ as an example:

When $Bi$ is the target, it generates $(2) S b_{3}$ , placed in the first column. (2) is the stoichiometry of the element $Bi$ in the compound.
When $Sb$ is the target it generates $(3) B i_{2}$ , placed in the fourth column. (3) is the stoichiometry of the element $Sb$ in the compound.

Since a particular atom binds to a very small fraction of all groups, each row is very sparse (high fraction of zeros). The same is valid for columns.

SVD Method

A normalised matrix $X_{u}$ is obtained by normalising each row vector independently. Using the euclidean norm (2-norm) allows for an intuitive similarity metric:

$dist (u_{1}, u_{2}) = 1 - u_{1} \cdot u_{2} = 1 - similarity$

In their best-performing model, they compute $S V D (X_{u}) = U D V^{T}$ , collect the $d$ -rows with the largest singular values, and compute $F = U^{'} D^{'}$ where $D^{'}$ is the slice of rows of D with the $d$ largest singular values, and $U^{'}$ the corresponding columns.

Note

The strategy has certain beauty to it: the new f-vectors retain the inner product similarity but are denser. Though now, the columns have no explicit meaning.

Findings

Similar atoms have similar vectors,
Increasing the distance threshold in stages, vectors can be clustered hierarchically, from the leaf-nodes (atoms) downwards (groups).
- At some level, groups match the periodic table groups. (I don't know how the grouping is made unambiguous).
- At a very large distance, all atoms merge into a single group. The result is called dendogram.

Image (modified) from Original Paper under CC-BY-SA 4.0. The atoms are rotated to make the image fit (rotated).

Looking at the variation of some dimensions in the vectors, we can assign meaning to some of them.

Benches

Then, they compared to "empirical features" —a vector (group, period,...), padded to match their $d$ — with the task of predicting the DFT-found formation-energies of $\approx 1 0^{4}$ elpasolite crystals ( $AB C_{2} D_{6}$ ).

Each solid was represented as a concatenation of atom vectors, and feed it to a hidden layer. (They also do other tasks.)

The paper ends with an interesting insight:

Structural information has to be taken into account to accurately model how atoms are bound together to form either environment or compound, where the recent development on recursive and graph-based neural networks might help.

It would be a binary matrix but the database contains some compounds multiple times and those are left duplicated (for some strange reason). ↩

Keyboard shortcuts

AI for Chemistry

Atom Vectors - Atom2Vec

SVD Method

Findings

Benches