Skip to content

Conversation

@rwest
Copy link
Member

@rwest rwest commented May 20, 2014

This is a copy of @bbuesser's pull request #217, but pulling from a GreenGroup branch (GreenGroup/new-style-adjacency-list), so that we can all add commits (and indeed pull requests) to this pull request before merging, without Beat needing to merge them all. It should stay in sync with (and will only work with) GreenGroup/new-style-adjacency-list branch on the RMG-database project. The rest of this comment is from Beat's original pull request:

Do not merge yet

These are the necessary changes to RMG-Py and its unit tests to handle the new adjacency list format. At the bottom I have copied the same explanation of the new adjacency list format as I have used for the RMG database pull request #37. Before that I am adding some comments on how RMG-Py handles the new adjacency list up to this pull request, further improvements are expected.

I would like to mention the following functions.

toAdjacencyList(): prints always the complete adjacency list with U, L and (currently) calculated E. It accepts a new argument "printMultiplicity=False/True" which defines if the multiplicity should be printed as part of the adjacency list or not.

fromAdjacencyList(): It requires always the U label. It can read the L label, if not defined it assumes L0 for molecules and None (not defined and not to be compared) for groups. E is currently read but not used for anything. The formal charges are currently calculated from the U label and the number of bonds for each atom. The reason for this is that it was easy to implement the code for reading the E label but because I didn't have any test cases so far where I could collect experiences how important the E label is I thought it might be safer for this pull request to satisfy U and the bonds and create a neutral species. Therefore E is currently overwritten assuming a neutral species. fromAdjacencyList() accepts wild cards Ux, Lx and Ex representing any number possible.

fromRDKitMol(): all function based on from RDKitMol() like from SMILES assume maximum multiplicity is given, therefore multiplicity=2*spin+1=number of unpaired electrons+1

open topics for future development:

  • currently multiplicity is stored in the Species class and the Molecule class which has reasons in the RMG development history. In future in might be better to decide for one place to store it, preferably the Species class. Further the Conformer class already had a multiplicity label to calculate accurate thermochemistry, this might as well be combined with the species multiplicity.
  • currently the TransitionState class does not have a multiplicity label although it should have one too

Explanation of new adjacency list style:

This is the first version of the RMG database with the new adjacency list format and multiplicity as a species/molecule property. Kinetics libraries store the multiplicity as part of the adjacency list where everywhere else it is a separate argument. I think multiplicity should not be part of the adjacency list because it does not depend on its details, e.g. there can be many adjacency list for the same species (resonance isomers) all having necessarily the same multiplicity. As soon as we continue with our efforts of having a separate structure library for kinetics rules, this difference of storing multiplicity in the adjacency list for kinetics will fall away.

The new adjacency list format is (e.g. nitromethane, CH3NO2):

1 C U0 L0 E0 {2,S} {3,S} {4,S} {5,S}
2 N U0 L0 E+1 {1,S} {6,D} {7,S}
3 H U0 L0 E0 {1,S}
4 H U0 L0 E0 {1,S}
5 H U0 L0 E0 {1,S}
6 O U0 L2 E0 {2,D}
7 O U0 L3 E-1 {2,S}

where

U: the flag for unpaired electrons (formerly radicals). There are two reasons to abandon the R for radicals. First we are using R to represent unspecified groups in the elements column. Second it would be confusing in a future publication to use "radical" at the same time for the species with unpaired electrons and the unpaired electron itself.

L: the flag for the number of lone electron pairs. The reason against P as flag was the possible future introduction of phosphorus that would bring P as an element.

E: the flag for formal charges. The sum of all E is equal to the total charge of the species. Currently only neutral species are reactive in RMG, therefore sum(E)=0 is required for reactivity. E has been chosen as the capital letter of e representing an electronic charge. E+1 means one electron less, E-2 means two additional electrons on that atom. C was not used as flag for formal charge because it is used to represent carbon

There are no more 2T and 2S or any other combination accepted by RMG to represent multiplicity.

Adjacency list in kinetics libraries or rules look like the following:

HCO
multiplicity 2
1 C U1 L0 E0 {2,D} {3,S}
2 O U0 L2 E0 {1,D}
3 H U0 L0 E0 {1,S}

It can contain in its first line as always a label. The second line contains the label "multiplicity" followed by a space and a number representing the multiplicity of that species.

For groups the multiplicity label is always a separate argument (thermo and kinetics) and is defined as a list containing all accepted multiplicities where that group will be applicable.

Groups only require the U flag for unpaired electrons, L is optional and will be compared if defined while E is not read at the moment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think officially the SMILES string for this format would be [C+]#[O-] ?
See eg. this quote in a book and compare the cactus resolver response for [C+]#[O-] compared to C#O (hydroxymethylene) or [C]=O (which it gets wrong and thinks is formaldehyde!)
Cactus also interpret [C]#[O] is carbon monoxide. Wikipedia say it should be [C+]#[O-].

@rwest
Copy link
Member Author

rwest commented Jun 5, 2014

I've been thinking about the labels...
They always have a number immediately after them, like L2, or E0, so they won't get confused with element names like C for carbon or P for phosphorous, which would never appear here in the adjacency list. I wonder is it worth, therefore, using E to represent Charge, and L to represent Pairs, or should we stick with C and P?
Thoughts?

@rwest
Copy link
Member Author

rwest commented Jun 19, 2014

We should probably cherry-pick those "saving the (entire) database" commits onto master, as they should be independent of the adjacency list syntax. I think we are tantalizingly close to done..

@rwest
Copy link
Member Author

rwest commented Jun 27, 2014

I tried to write some documentation of the new adjacency list, but the documentation arrangement in general makes it hard to find the bit I wrote; I'm not sure it's accessible anywhere, unless you search for it? Maybe that should be fixed on master first.

bbuesser added 24 commits July 25, 2014 14:22
- remove multiplicity from GroupAtom
- add multiplicity to Group
- update __gainRadical() and __loseRadical()
- update equivalent() and isSpecificCaseOf()
- remove multiplicity from Atom
- add multiplicity to Molecule
rwest and others added 28 commits July 25, 2014 14:22
There wasn't much overlap so I made them separate tests.
These regular expressions should be able to detect the old
and the intermediate style adjacency lists.
…tron.

Previously if multiplicity was 1, it wouldn't print it, even if it was a
singlet biradical. Now, if there's a radical OR it's non-unity, it gets printed.
There was a whole block for if atom.isHydrogen() and another for
if it's not, but they only differed by the value of two constants
(the maximum number of lone pairs and the charge)
Better to remove blank lines from the end before checking
the adjacency list style from the last line.
It now must be a list of integers, like 'multiplicity [0,1,2]',
or an integer like 'multiplicity 3'. The assertion tells you.
I think it was probably meant to, but we should double check.
It looks to me like this wild card is not (yet) used anywhere
in the database anyway.

I also made it so we use the wildcard when writing adjacency lists
too, if either the unpaired electrons or pairs have [0,1,2,3,4]
they are written as 'x'.
Eclipse was complaining about the indentation.
This was called atomElectronStates, presumably because it used to
represent the multiplicity too (eg. 2T, 2S, etc.) but now it is
just the unpaired electrons, it makes more sense to call it that.
When creating a molecule via the SMILES shortcut, you can now
specify the multiplicity, and furthermore the __repr__ will
get back what you started with:

In [1]: from rmgpy.molecule import Molecule

In [2]: print Molecule(SMILES="[CH2]").toAdjacencyList()
multiplicity 3
1 C U2 L0 E0  {2,S} {3,S}
2 H U0 L0 E0  {1,S}
3 H U0 L0 E0  {1,S}

In [3]: print Molecule(SMILES="[CH2]", multiplicity=1).toAdjacencyList()
multiplicity 1
1 C U2 L0 E0  {2,S} {3,S}
2 H U0 L0 E0  {1,S}
3 H U0 L0 E0  {1,S}

In [4]: Molecule().fromAdjacencyList(""" 1 C 2  """)
Out[4]: Molecule(SMILES="[CH2]")

In [5]: Molecule().fromAdjacencyList(""" 1 C 2S  """)
Out[5]: Molecule(SMILES="[CH2]", multiplicity=1)
Previously, a '2' represented '{2S,2T}' which would in
turn be parsed as [2,2] for the allowed number of electrons.
We now store sorted(set(radicalElectrons)) so that
'{2S,2T,1,3Q}' will become [1,2,3] instead of [2,2,1,3]

Also changed the line wrap to make easier to read.
'multiplicity [1, 2, 3]' is now 'multiplicity [1,2,3]'
This counts pairs, not electrons, so I find this name clearer than
lonePairElectrons
…ned (wildcard).

Previously, these were all ending up as -1. Now they are set to None,
which is what the new fromAdjacencyList does.
… GroupAtoms

isSpceficiCaseOf etc is ignoring lone pairs and charges. Made a few notes as to where this should be fixed
Now using:
-'u' for unpaired electrons
-'p' for paired electrons
-'c' for charge

Sample adjlist for HO2:

'''
multiplicity 2
1 O u0 p2 c0  {2,S} {3,S}
2 O u1 p2 c0  {1,S}
3 H u0 p0 c0  {1,S}
'''
…y List syntax.

This should help distinguish
 * wildcard lists from bond definitions
 * old adjacency lists from new adjacency lists
 * dicts from lists
connie added a commit that referenced this pull request Jul 25, 2014
@connie connie merged commit 37be0ac into master Jul 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants