-
Notifications
You must be signed in to change notification settings - Fork 250
New style adjacency list #218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
examples/rmg/methylformate/input.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think officially the SMILES string for this format would be [C+]#[O-] ?
See eg. this quote in a book and compare the cactus resolver response for [C+]#[O-] compared to C#O (hydroxymethylene) or [C]=O (which it gets wrong and thinks is formaldehyde!)
Cactus also interpret [C]#[O] is carbon monoxide. Wikipedia say it should be [C+]#[O-].
|
I've been thinking about the labels... |
|
We should probably cherry-pick those "saving the (entire) database" commits onto master, as they should be independent of the adjacency list syntax. I think we are tantalizingly close to done.. |
|
I tried to write some documentation of the new adjacency list, but the documentation arrangement in general makes it hard to find the bit I wrote; I'm not sure it's accessible anywhere, unless you search for it? Maybe that should be fixed on master first. |
- remove multiplicity from GroupAtom - add multiplicity to Group - update __gainRadical() and __loseRadical() - update equivalent() and isSpecificCaseOf()
saveSpeciesDictionary()
- remove multiplicity from Atom - add multiplicity to Molecule
- instead of triplet assumption
There wasn't much overlap so I made them separate tests.
These regular expressions should be able to detect the old and the intermediate style adjacency lists.
…tron. Previously if multiplicity was 1, it wouldn't print it, even if it was a singlet biradical. Now, if there's a radical OR it's non-unity, it gets printed.
Two nested "if"s became an "if and"
There was a whole block for if atom.isHydrogen() and another for if it's not, but they only differed by the value of two constants (the maximum number of lone pairs and the charge)
Better to remove blank lines from the end before checking the adjacency list style from the last line.
It now must be a list of integers, like 'multiplicity [0,1,2]', or an integer like 'multiplicity 3'. The assertion tells you.
I think it was probably meant to, but we should double check. It looks to me like this wild card is not (yet) used anywhere in the database anyway. I also made it so we use the wildcard when writing adjacency lists too, if either the unpaired electrons or pairs have [0,1,2,3,4] they are written as 'x'.
Eclipse was complaining about the indentation.
This was called atomElectronStates, presumably because it used to represent the multiplicity too (eg. 2T, 2S, etc.) but now it is just the unpaired electrons, it makes more sense to call it that.
When creating a molecule via the SMILES shortcut, you can now
specify the multiplicity, and furthermore the __repr__ will
get back what you started with:
In [1]: from rmgpy.molecule import Molecule
In [2]: print Molecule(SMILES="[CH2]").toAdjacencyList()
multiplicity 3
1 C U2 L0 E0 {2,S} {3,S}
2 H U0 L0 E0 {1,S}
3 H U0 L0 E0 {1,S}
In [3]: print Molecule(SMILES="[CH2]", multiplicity=1).toAdjacencyList()
multiplicity 1
1 C U2 L0 E0 {2,S} {3,S}
2 H U0 L0 E0 {1,S}
3 H U0 L0 E0 {1,S}
In [4]: Molecule().fromAdjacencyList(""" 1 C 2 """)
Out[4]: Molecule(SMILES="[CH2]")
In [5]: Molecule().fromAdjacencyList(""" 1 C 2S """)
Out[5]: Molecule(SMILES="[CH2]", multiplicity=1)
Previously, a '2' represented '{2S,2T}' which would in
turn be parsed as [2,2] for the allowed number of electrons.
We now store sorted(set(radicalElectrons)) so that
'{2S,2T,1,3Q}' will become [1,2,3] instead of [2,2,1,3]
Also changed the line wrap to make easier to read.
'multiplicity [1, 2, 3]' is now 'multiplicity [1,2,3]'
This counts pairs, not electrons, so I find this name clearer than lonePairElectrons
…ned (wildcard). Previously, these were all ending up as -1. Now they are set to None, which is what the new fromAdjacencyList does.
…dex, range, len, lookup etc.
… GroupAtoms isSpceficiCaseOf etc is ignoring lone pairs and charges. Made a few notes as to where this should be fixed
Now using:
-'u' for unpaired electrons
-'p' for paired electrons
-'c' for charge
Sample adjlist for HO2:
'''
multiplicity 2
1 O u0 p2 c0 {2,S} {3,S}
2 O u1 p2 c0 {1,S}
3 H u0 p0 c0 {1,S}
'''
…y List syntax. This should help distinguish * wildcard lists from bond definitions * old adjacency lists from new adjacency lists * dicts from lists
This is a copy of @bbuesser's pull request #217, but pulling from a GreenGroup branch (
GreenGroup/new-style-adjacency-list), so that we can all add commits (and indeed pull requests) to this pull request before merging, without Beat needing to merge them all. It should stay in sync with (and will only work with)GreenGroup/new-style-adjacency-listbranch on the RMG-database project. The rest of this comment is from Beat's original pull request:Do not merge yet
These are the necessary changes to RMG-Py and its unit tests to handle the new adjacency list format. At the bottom I have copied the same explanation of the new adjacency list format as I have used for the RMG database pull request #37. Before that I am adding some comments on how RMG-Py handles the new adjacency list up to this pull request, further improvements are expected.
I would like to mention the following functions.
toAdjacencyList(): prints always the complete adjacency list with U, L and (currently) calculated E. It accepts a new argument "printMultiplicity=False/True" which defines if the multiplicity should be printed as part of the adjacency list or not.
fromAdjacencyList(): It requires always the U label. It can read the L label, if not defined it assumes L0 for molecules and None (not defined and not to be compared) for groups. E is currently read but not used for anything. The formal charges are currently calculated from the U label and the number of bonds for each atom. The reason for this is that it was easy to implement the code for reading the E label but because I didn't have any test cases so far where I could collect experiences how important the E label is I thought it might be safer for this pull request to satisfy U and the bonds and create a neutral species. Therefore E is currently overwritten assuming a neutral species. fromAdjacencyList() accepts wild cards Ux, Lx and Ex representing any number possible.
fromRDKitMol(): all function based on from RDKitMol() like from SMILES assume maximum multiplicity is given, therefore multiplicity=2*spin+1=number of unpaired electrons+1
open topics for future development:
Explanation of new adjacency list style:
This is the first version of the RMG database with the new adjacency list format and multiplicity as a species/molecule property. Kinetics libraries store the multiplicity as part of the adjacency list where everywhere else it is a separate argument. I think multiplicity should not be part of the adjacency list because it does not depend on its details, e.g. there can be many adjacency list for the same species (resonance isomers) all having necessarily the same multiplicity. As soon as we continue with our efforts of having a separate structure library for kinetics rules, this difference of storing multiplicity in the adjacency list for kinetics will fall away.
The new adjacency list format is (e.g. nitromethane, CH3NO2):
where
U: the flag for unpaired electrons (formerly radicals). There are two reasons to abandon the R for radicals. First we are using R to represent unspecified groups in the elements column. Second it would be confusing in a future publication to use "radical" at the same time for the species with unpaired electrons and the unpaired electron itself.
L: the flag for the number of lone electron pairs. The reason against P as flag was the possible future introduction of phosphorus that would bring P as an element.
E: the flag for formal charges. The sum of all E is equal to the total charge of the species. Currently only neutral species are reactive in RMG, therefore sum(E)=0 is required for reactivity. E has been chosen as the capital letter of e representing an electronic charge. E+1 means one electron less, E-2 means two additional electrons on that atom. C was not used as flag for formal charge because it is used to represent carbon
There are no more 2T and 2S or any other combination accepted by RMG to represent multiplicity.
Adjacency list in kinetics libraries or rules look like the following:
It can contain in its first line as always a label. The second line contains the label "multiplicity" followed by a space and a number representing the multiplicity of that species.
For groups the multiplicity label is always a separate argument (thermo and kinetics) and is defined as a list containing all accepted multiplicities where that group will be applicable.
Groups only require the U flag for unpaired electrons, L is optional and will be compared if defined while E is not read at the moment.