Skip to content

Commit 8c6f717

Browse files
committed
Report evaluation
1 parent a2b28a8 commit 8c6f717

2 files changed

Lines changed: 50 additions & 21 deletions

File tree

README.md

Lines changed: 48 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,14 @@
33

44
***
55
## Abstract
6-
> The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information. That would be implemented on top of neo4j-java-driver.
7-
6+
> Chemical and pharmaceutical R&D produce large amounts of data of completely different nature, such as chemical structures, recipe and process data, formulation data, and data from various application tests. Altogether these data rarely follow a schema. Consequently, relational data models and databases have frequetly disadvantages mapping these data appropriately. Here, chemical data frequently leads to rather abstract data models, which are difficult to develop, align, and maintain with the domain experts. Upon retrieval computationally expesive joins in not predetermined depths may cause issues.
7+
> Graph data models promise here advantages:
8+
> • they can easily be understood by non IT experts from the research domains
9+
> • due to their plasticity, they can easily be extended and refactored
10+
> • graph databases such as neo4j are made for coping with arbitrary path lengths
11+
> Chemical data models usually require a database to be able to deal with chemical structures to be utilized for structure based queries to either identify records or as filtering criteria.
12+
13+
> The project will be focused on development of extension for neo4j graph database for querying knowledge graphs storing molecular and chemical information.
814
> Task is to enable identification of entry points into the graph via exact/substructure/similarity searches (UC1). UC2 is closely related to UC1, but here the intention is to use chemical structures as limiting conditions in graph traversals originating from different entry points. Both use cases rely on the same integration of RDkit and Neo4j and will only differ in their CYPHER statements.
915
1016
__Mentors:__
@@ -27,6 +33,8 @@ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
2733
-Dpackaging=jar
2834
```
2935
2) Generate .jar file with all dependencies with `mvn package`
36+
3) Put generated .jar file into `plugins/` folder of your neo4j instance and start the server
37+
4) By executing `CALL dbms.procedures()`, you are expected to see `org.rdkit.*` procedures
3038

3139
## Extension functionality
3240

@@ -47,36 +55,36 @@ mvn org.apache.maven.plugins:maven-install-plugin:2.3.1:install-file \
4755
2) Feed Neo4j DB
4856
3) then `CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])`
4957

50-
> Automated computation of additional properties (fp, etc.) and triggered index
51-
> Fp index automatically updated when new :Structure:Chemical records arrive
58+
> Automated computation of additional properties (fp, etc.) and triggered index
59+
> Fp index automatically updated when new :Structure:Chemical records arrive
5260
5361
##### way C (the most suitable)
5462
1) Plugin present
5563
2) `CALL org.rdkit.search.createIndex(['Structure', 'Chemical'])`
5664
3) Then feed Knime
5765

58-
> Automated computation of additional properties (fp, etc.) and index
59-
> Empty Neo4j instance is prepared in advance
60-
> Whenever a new :Structure:Chemical entries comes, property calculation and fp index update are automatically conducted
66+
> Automated computation of additional properties (fp, etc.) and index
67+
> Empty Neo4j instance is prepared in advance
68+
> Whenever a new :Structure:Chemical entries comes, property calculation and fp index update are automatically conducted
6169
6270
#### Execution of exact search
6371
_It is possible to check index existence with `CALL db.indexes`_
6472

65-
0) It would strongly affect performance of exact search if `createIndex` procedure was called earlier (it creates a property index).
66-
1) `CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
67-
2) `CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol>')` (refer to tests for examples)
73+
0) It would strongly affect performance of exact search if `createIndex` procedure was called earlier (it creates a property index).
74+
1) `CALL org.rdkit.search.exact.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
75+
2) `CALL org.rdkit.search.exact.mol(['Chemical', 'Structure'], '<mdlmol>')` (refer to tests for examples)
6876

6977
#### Execution of substructure search
7078

71-
1) Make sure the fulltext index exists with `CALL db.indexes`, `fp_index` must exist. (It should be created with `createIndex` procedure)
72-
2) `CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
73-
3) `CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')`
79+
1) Make sure the fulltext index exists with `CALL db.indexes`, `fp_index` must exist. (It should be created with `createIndex` procedure)
80+
2) `CALL org.rdkit.search.substructure.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1')`
81+
3) `CALL org.rdkit.search.substructure.mol(['Chemical', 'Structure'], '<mol value>')`
7482

7583
#### Execution of similarity search (currently slow)
7684

77-
1) `CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'torsion_fp', 'torsion')` - new property `morgan_fp` is created
78-
2) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'torsion', 'torsion_fp', 0.4)`
79-
3) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)`
85+
1) `CALL org.rdkit.fingerprint.create(['Chemical, 'Structure'], 'torsion_fp', 'torsion')` - new property `torsion_fp` is created
86+
2) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'torsion', 'torsion_fp', 0.4)`
87+
3) `CALL org.rdkit.fingerprint.search.smiles(['Chemical', 'Structure'], 'CC(=O)Nc1nnc(S(N)(=O)=O)s1', 'pattern', 'fp', 0.7)`
8088

8189
#### Usage of `org.rdkit.search.substructure.is` function in complex queries
8290

@@ -149,6 +157,27 @@ Additional reserved property names:
149157
10) User-defined function `org.rdkit.search.substructure.is(<node object>, '<smiles_string>')`
150158
* Return boolean answer: does specified `node` object have substructure match provided by `smiles_string`.
151159

152-
## Useful links:
153-
- https://github.com/neo4j/neo4j
154-
- https://github.com/rdkit/org.rdkit.lucene
160+
---
161+
162+
# Results overview
163+
164+
## What was achieved
165+
166+
1) Implementation of exact search (100%)
167+
2) Implementation of substructure search (90%, several minor bugs)
168+
3) Implementation of condition based graph traversal - usage of function calls in complex queries (100%)
169+
4) Implementation of similarity search (70%, major performance issues)
170+
5) Coverage with unit tests (80%, not all invalid arguments for procedures are tested)
171+
172+
## What remains to be done
173+
174+
<!-- 0) Query features in substructure search (blocking of position in molecule from further substitution; using atom lists on certain positions in molecule) -->
175+
1) Speed up batch tasks by utilizing several threads (currently waiting for resolving issue on native level)
176+
2) Speed up the `similarity search` procedures
177+
3) Solve minor bugs (todos) like unclosed `query` object during SSS
178+
179+
## What problems were encountered
180+
181+
1) Compatability of native libraries for win64 (beginning of the development)
182+
2) Lazy streams evaluation and not resolved issue with `query` object during SSS
183+
3) Parallelization of stream evaluations

pom.xml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,10 +161,10 @@
161161
</execution>
162162
</executions>
163163
</plugin>
164-
<plugin> <!-- TODO: requires maven >=3.5.4 -->
164+
<plugin>
165165
<groupId>org.codehaus.mojo</groupId>
166166
<artifactId>license-maven-plugin</artifactId>
167-
<version>2.0.0</version>
167+
<version>1.20</version>
168168
<configuration>
169169
<organizationName>RDKit</organizationName>
170170
<inceptionYear>2019</inceptionYear>

0 commit comments

Comments
 (0)