Skip to content

Commit 966391a

Browse files
committed
Documentation, SSS, exact and similarity search
2 parents 3729816 + df69e2d commit 966391a

56 files changed

Lines changed: 3496 additions & 2486 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 123 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,16 @@
33

44
***
55
## Abstract
6-
> The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information. That would be implemented on top of neo4j-java-driver.
6+
> Chemical and pharmaceutical R&D produce large amounts of data of completely different nature, such as chemical structures, recipe and process data, formulation data, and data from various application tests. Altogether these data rarely follow a schema. Consequently, relational data models and databases have frequetly disadvantages mapping these data appropriately. Here, chemical data frequently leads to rather abstract data models, which are difficult to develop, align, and maintain with the domain experts. Upon retrieval computationally expesive joins in not predetermined depths may cause issues.
77
8+
> Graph data models promise here advantages:
9+
> - they can easily be understood by non IT experts from the research domains
10+
> - due to their plasticity, they can easily be extended and refactored
11+
> - graph databases such as neo4j are made for coping with arbitrary path lengths
12+
13+
> Chemical data models usually require a database to be able to deal with chemical structures to be utilized for structure based queries to either identify records or as filtering criteria.
14+
15+
> The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information.
816
> Task is to enable identification of entry points into the graph via exact/substructure/similarity searches (UC1). UC2 is closely related to UC1, but here the intention is to use chemical structures as limiting conditions in graph traversals originating from different entry points. Both use cases rely on the same integration of RDkit and Neo4j and will only differ in their CYPHER statements.
917
1018
__Mentors:__
@@ -27,12 +35,76 @@ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
2735
-Dpackaging=jar
2836
```
2937
2) Generate .jar file with all dependencies with `mvn package`
38+
3) Put generated .jar file into `plugins/` folder of your neo4j instance and start the server
39+
4) By executing `CALL dbms.procedures()`, you are expected to see `org.rdkit.*` procedures
3040

3141
## Extension functionality
3242

43+
### User scenario:
44+
45+
#### Feeding the data into database
46+
47+
##### way A:
48+
1) Plugin not present
49+
2) Feed Neo4j DB
50+
3) then `CALL org.rdkit.update(['Chemical', 'Structure'])` & `CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])`
51+
52+
> That triggers computation of additional properties (fp, etc.) and fp index creation
53+
> Automated computation of properties enabled only after `update` procedure
54+
55+
##### way B:
56+
1) Plugin present
57+
2) Feed Neo4j DB
58+
3) then `CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])`
59+
60+
> Automated computation of additional properties (fp, etc.) and triggered index
61+
> Fp index automatically updated when new :Structure:Chemical records arrive
62+
63+
##### way C (the most suitable)
64+
1) Plugin present
65+
2) `CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])`
66+
3) Then feed Knime
67+
68+
> Automated computation of additional properties (fp, etc.) and index
69+
> Empty Neo4j instance is prepared in advance
70+
> Whenever a new :Structure:Chemical entries comes, property calculation and fp index update are automatically conducted
71+
72+
#### Execution of exact search
73+
_It is possible to check index existence with `CALL db.indexes`_
74+
75+
0) It would strongly affect performance of exact search if `createIndex` procedure was called earlier (it creates a property index).
76+
1) `CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
77+
2) `CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol>')` (refer to tests for examples)
78+
79+
#### Execution of substructure search
80+
81+
1) Make sure the fulltext index exists with `CALL db.indexes`, `fp_index` must exist. (It should be created with `createIndex` procedure)
82+
2) `CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
83+
3) `CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')`
84+
85+
#### Execution of similarity search (currently slow)
86+
87+
1) `CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'torsion_fp', 'torsion')` - new property `torsion_fp` is created
88+
2) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'torsion', 'torsion_fp', 0.4)`
89+
3) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)`
90+
91+
#### Usage of `org.rdkit.search.substructure.is` function in complex queries
92+
93+
```$cypher
94+
CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(C)(C)OC(=O)N1CCC(COc2ccc(OCc3ccccc3)cc2)CC1') YIELD luri
95+
MATCH (finalProduct:Entity{luri:luri})
96+
CALL apoc.path.expand(finalProduct, "<HAS_PRODUCT,>HAS_INGREDIENT", ">Reaction", 0, 4) yield path
97+
WITH nodes(path)[-1] as reaction, path, (length(path)+1)/2 as depths
98+
MATCH (reaction)-[:HAS_INGREDIENT]->(c:Compound) where org.rdkit.search.substructure.is(c, 'CC(C)C(O)=O')
99+
RETURN path
100+
```
101+
102+
---
33103
### Node labels: [`Chemical`, `Structure`] - strict rule (!)
34104

35105
* __Whenever a new node added with labels__, an `rdkit` event handler is applied and new node properties are constructed from `mdlmol` property.
106+
Those are also reserved property names
107+
36108
1) `canonical_smiles`
37109
2) `inchi`
38110
3) `formula`
@@ -41,30 +113,73 @@ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
41113
6) `fp_ones` - count of positive bits
42114
7) `mdlmol`
43115

116+
Additional reserved property names:
117+
118+
- `smiles`
119+
44120
* If the graph was fulfilled with nodes before the extension was loaded, it is possible to apply a procedure:
45121
`CALL org.rdkit.update(['Chemical', 'Structure'])` - which iterates through nodes with specified labels and creates properties described before.
46122

47123
* In order to speed up an exact search, create an index on top of `canonical_smiles` property
48124

49-
### User-defined procedures
125+
### User-defined procedures & functions
50126

51127
1) `CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
52128
2) `CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol block>')`
53129
* RDKit provides functionality to use `exact search` on top of `smiles` and `mdlmol blocks`, returns a node which satisfies `canonical smiles`
54130
3) `CALL org.rdkit.update(['Chemical', 'Structure'])`
55-
* Update procedure (manual properties initialization from `mdlmol` property)
131+
* Update procedure (manual properties initialization from `mdlmol` property)
132+
* _Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)_
56133
4) `CALL org.rdkit.search.createIndex(['Chemical', 'Structure'])`
57134
* Create fulltext index (called `rdkitIndex`) on property `fp`, which is required for substructure search
58135
* Create index for `:Chemical(canonical_smiles)` property
59136
5) `CALL org.rdkit.search.deleteIndex()`
60137
* Delete fulltext index (called `rdkitIndex`) on property `fp`, which is required for substructure search
61138
* Delete index for `:Chemical(canonical_smiles)` property
62139
6) `CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
63-
* Subscture search based on smiles substructure, correct smiles is expected
140+
* SSS based on smiles substructure
141+
7) `CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')`
142+
* SSS based on mdlmol block substructure
143+
8) `CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'morgan_fp', 'morgan')`
144+
* Create a new property called `morgan_fp` with fingerprint type `morgan` on all nodes
145+
* Supporting properties are: `morgan_fp_type`, `morgan_fp_ones` are also added
146+
* Creates fulltext index on this property
147+
* Node is skipped if it's not possible to convert its smiles with this fingerprint type
148+
* It is __not allowed__ to use property name equal to predefined
149+
9) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)`
150+
* Call similarity search with next parameters:
151+
- Node labels: `['Chemical', 'Structure']`
152+
- Smiles: `'CC(=O)Nc1nnc(S(N)(=O)=O)s1'`
153+
- Fingerprint type: `'pattern'`
154+
- Property name: `'fp'`
155+
- Threshold: `0.7`
156+
* Smiles value is converted into specfied _fingerprint type_ (if possible) and compared with nodes which have _property_ (`'fp'` in this case)
157+
* Threshold is a lower bound for the score value
158+
* _Current implementation uses single thread and on a huge database may take a lot of time (>3 minutes)_
159+
10) User-defined function `org.rdkit.search.substructure.is(<node object>, '<smiles_string>')`
160+
* Return boolean answer: does specified `node` object have substructure match provided by `smiles_string`.
161+
162+
---
163+
164+
# Results overview
165+
166+
## What was achieved
167+
168+
1) Implementation of exact search (100%)
169+
2) Implementation of substructure search (90%, several minor bugs)
170+
3) Implementation of condition based graph traversal - usage of function calls in complex queries (100%)
171+
4) Implementation of similarity search (70%, major performance issues)
172+
5) Coverage with unit tests (80%, not all invalid arguments for procedures are tested)
173+
174+
## What remains to be done
64175

176+
<!-- 0) Query features in substructure search (blocking of position in molecule from further substitution; using atom lists on certain positions in molecule) -->
177+
1) Speed up batch tasks by utilizing several threads (currently waiting for resolving issue on native level)
178+
2) Speed up the `similarity search` procedures
179+
3) Solve minor bugs (todos) like unclosed `query` object during SSS
65180

66-
## Useful links:
67-
- https://github.com/neo4j/neo4j
68-
- https://github.com/neo4j-contrib/neo4j-lucene5-index
69-
- https://github.com/rdkit/org.rdkit.lucene
181+
## What problems were encountered
70182

183+
1) Compatability of native libraries for win64 (beginning of the development)
184+
2) Lazy streams evaluation and not resolved issue with `query` object during SSS
185+
3) Parallelization of stream evaluations

pom.xml

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66

77
<groupId>org.neo4j.rdkit</groupId>
88
<artifactId>rdkit-index</artifactId>
9-
<version>0.0.3</version>
10-
<name>RDKit-Neo4j</name>
9+
<version>0.0.7</version>
10+
<name>RDKit-Neo4j plugin</name>
1111
<packaging>jar</packaging>
1212

1313

@@ -21,6 +21,9 @@
2121
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
2222
<maven.compiler.source>1.8</maven.compiler.source>
2323
<maven.compiler.target>1.8</maven.compiler.target>
24+
<license.licenseName>evgerher_license</license.licenseName>
25+
<license.licenseResolver>${project.baseUri}src/main/license</license.licenseResolver> <!-- todo: does not work -->
26+
<!--<license.descriptionTemplate>${project.baseUri}src/main/resources/license/license-header.txt</license.descriptionTemplate> &lt;!&ndash; todo: does not work &ndash;&gt;-->
2427
</properties>
2528

2629
<dependencyManagement>
@@ -158,6 +161,31 @@
158161
</execution>
159162
</executions>
160163
</plugin>
164+
<plugin>
165+
<groupId>org.codehaus.mojo</groupId>
166+
<artifactId>license-maven-plugin</artifactId>
167+
<version>1.20</version>
168+
<configuration>
169+
<organizationName>RDKit</organizationName>
170+
<inceptionYear>2019</inceptionYear>
171+
<!--<includes>-->
172+
<!--<include>*.java</include>-->
173+
<!--</includes>-->
174+
<excludes>
175+
<exclude>*.txt</exclude>
176+
<exclude>*.properties</exclude>
177+
</excludes>
178+
</configuration>
179+
<executions>
180+
<execution>
181+
<id>first</id>
182+
<goals>
183+
<goal>update-file-header</goal>
184+
</goals>
185+
<phase>process-sources</phase>
186+
</execution>
187+
</executions>
188+
</plugin>
161189
</plugins>
162190
</build>
163191
</project>

src/main/java/org/neo4j/kernel/api/impl/fulltext/analyzer/providers/RDKit.java

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)