Atom Vectors - Atom2Vec
A popular representation of atoms as vectors appeared in (2018): Atom2Vec.
They take compounds from a database, and build a matrix like the one below:1
| 1 | 1 | 1 | 0 | 0 | 1 | 0 | |
| 0 | 1 | 1 | 1 | 0 | 0 | 1 | |
| ... | 1 | 0 | 1 | 0 | 0 | 1 | 1 |
Let's describe using the compound as an example:
- When is the target, it generates , placed in the first column.
(2)is the stoichiometry of the element in the compound. - When is the target it generates , placed in the fourth column.
(3)is the stoichiometry of the element in the compound.
Since a particular atom binds to a very small fraction of all groups, each row is very sparse (high fraction of zeros). The same is valid for columns.
SVD Method
A normalised matrix is obtained by normalising each row vector independently. Using the euclidean norm (2-norm) allows for an intuitive similarity metric:
In their best-performing model, they compute , collect the -rows with the largest singular values, and compute where is the slice of rows of D with the largest singular values, and the corresponding columns.
Note
The strategy has certain beauty to it: the new f-vectors retain the inner product similarity but are denser. Though now, the columns have no explicit meaning.
Findings
- Similar atoms have similar vectors,
- Increasing the distance threshold in stages, vectors can be clustered hierarchically, from the leaf-nodes (atoms) downwards (groups).
- At some level, groups match the periodic table groups. (I don't know how the grouping is made unambiguous).
- At a very large distance, all atoms merge into a single group. The result is called dendogram.
Image (modified) from Original Paper under CC-BY-SA 4.0. The atoms are rotated to make the image fit (rotated).
- Looking at the variation of some dimensions in the vectors, we can assign meaning to some of them.
Benches
Then, they compared to "empirical features" —a vector (group, period,...), padded to match their — with the task of predicting the DFT-found formation-energies of elpasolite crystals ().
Each solid was represented as a concatenation of atom vectors, and feed it to a hidden layer. (They also do other tasks.)
The paper ends with an interesting insight:
Structural information has to be taken into account to accurately model how atoms are bound together to form either environment or compound, where the recent development on recursive and graph-based neural networks might help.
-
It would be a binary matrix but the database contains some compounds multiple times and those are left duplicated (for some strange reason). ↩