Skip to content

Commit fece005

Browse files
Added new gensim node embedder and refactored similarity to support different backends (#91)
* Added gensim embedder and new sim backends * Updated notebooks and docs * Bugfixes * Updated readme
1 parent bcbf453 commit fece005

43 files changed

Lines changed: 142527 additions & 2758 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Using the built-in :code:`PGFrame` data structure (currently, `pandas <https://p
2828
- `graph-tool <https://graph-tool.skewed.de/>`_ (for the analytics API)
2929
- `Neo4j <https://neo4j.com/>`_ (for the analytics and representation learning API);
3030
- `StellarGraph <https://stellargraph.readthedocs.io/en/stable/>`_ (for the representation learning API).
31+
- `gensim <https://radimrehurek.com/gensim/>`_ (for the representation learning API).
3132

3233
This repository originated from the Blue Brain effort on building a COVID-19-related knowledge graph from the `CORD-19 <https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>`_ dataset and analysing the generated graph to perform literature review of the role of glucose metabolism deregulations in the progression of COVID-19. For more details on how the knowledge graph is built, explored and analysed, see `COVID-19 co-occurrence graph generation and analysis <https://github.com/BlueBrain/BlueGraph/tree/master/cord19kg#readme>`__.
3334

@@ -156,7 +157,9 @@ To get familiar with the ideas behind the co-occurrence analysis and the graph a
156157
- `Literature exploration (PGFrames + in-memory analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Literature%20exploration%20(PGFrames%20%2B%20in-memory%20analytics%20tutorial).ipynb>`_ illustrates how to use BlueGraphs's analytics API for in-memory graph backends based on the :code:`NetworkX` and the :code:`graph-tool` libraries.
157158
- `NASA keywords (PGFrames + Neo4j analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/NASA%20keywords%20(PGFrames%20%2B%20Neo4j%20analytics%20tutorial).ipynb>`_ illustrates how to use the Neo4j-based analytics API for persistent property graphs.
158159

159-
`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification, edge prediction and embedding pipeline building.
160+
`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification and edge prediction.
161+
162+
`Create and run embedding pipelines <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20run%20embedding%20pipelines.ipynb>`_ illustrates how embedding pipelines can be built and executed using BlueGraph.
160163

161164
Finally, `Create and push embedding pipeline into Nexus.ipynb <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20push%20embedding%20pipeline%20into%20Nexus.ipynb>`_ illustrates how embedding pipelines can be created and pushed to `Nexus <https://bluebrainnexus.io/>`_ and
162165
`Embedding service API <https://github.com/BlueBrain/BlueGraph/blob/master/services/embedder/examples/notebooks/Embedding%20service%20API.ipynb>`_ shows how embedding service that retrieves the embedding pipelines from Nexus can be used.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.
2+
3+
# Copyright 2020-2021 Blue Brain Project / EPFL
4+
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
from .embed.embedders import GensimNodeEmbedder
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.
2+
3+
# Copyright 2020-2021 Blue Brain Project / EPFL
4+
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.
2+
3+
# Copyright 2020-2021 Blue Brain Project / EPFL
4+
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
from collections import namedtuple
17+
import warnings
18+
import pandas as pd
19+
20+
from gensim.models.poincare import PoincareModel
21+
22+
from bluegraph.core.embed.embedders import GraphElementEmbedder
23+
from bluegraph.backends.params import (GENSIM_PARAMS,
24+
DEFAULT_GENSIM_PARAMS)
25+
26+
27+
GensimGraph = namedtuple('GensimGraph', 'graph graph_configs')
28+
29+
30+
class GensimNodeEmbedder(GraphElementEmbedder):
31+
32+
_transductive_models = [
33+
"poincare",
34+
"word2vec"
35+
]
36+
37+
def __init__(self, model_name, directed=True, include_type=False,
38+
feature_props=None, feature_vector_prop=None,
39+
edge_weight=None, **model_params):
40+
if directed is False and model_name == "poincare":
41+
raise GraphElementEmbedder.FittingException(
42+
"Poincare embedding can be performed only on directed graphs: "
43+
"undirected graph was provided")
44+
super().__init__(
45+
model_name=model_name, directed=directed,
46+
include_type=include_type,
47+
feature_props=feature_props,
48+
feature_vector_prop=feature_vector_prop,
49+
edge_weight=edge_weight, **model_params)
50+
51+
@staticmethod
52+
def _generate_graph(pgframe, graph_configs):
53+
"""Generate backend-specific graph object."""
54+
return GensimGraph(pgframe, graph_configs)
55+
56+
def _dispatch_model_params(self, **kwargs):
57+
"""Dispatch training parameters."""
58+
params = {}
59+
for k, v in kwargs.items():
60+
if k not in GENSIM_PARAMS[self.model_name]:
61+
warnings.warn(
62+
f"GensimNodeEmbedder's model '{self.model_name}' "
63+
f"does not support the training parameter '{k}', "
64+
"the parameter will be ignored",
65+
GraphElementEmbedder.FittingWarning)
66+
else:
67+
params[k] = v
68+
69+
for k, v in DEFAULT_GENSIM_PARAMS.items():
70+
if k not in params:
71+
params[k] = v
72+
return params
73+
74+
def _fit_transductive_embedder(self, train_graph):
75+
"""Fit transductive embedder (no model, just embeddings)."""
76+
77+
model_params = {**self.params}
78+
del model_params["epochs"]
79+
80+
if self.model_name == "poincare":
81+
model = PoincareModel(
82+
train_graph.graph.edges(), **model_params)
83+
84+
model.train(epochs=self.params["epochs"])
85+
86+
embedding = pd.DataFrame(
87+
[
88+
(n, model.kv.get_vector(n))
89+
for n in train_graph.graph.nodes()
90+
],
91+
columns=["@id", "embedding"]
92+
).set_index("@id")
93+
return embedding
94+
95+
def _fit_inductive_embedder(self, train_graph):
96+
"""Fit inductive embedder (predictive model and embeddings)."""
97+
raise NotImplementedError(
98+
"Inductive models are not implemented for gensim-based "
99+
"node embedders")
100+
101+
def _predict_embeddings(self, graph, nodes=None):
102+
"""Fit inductive embedder (predictive model and embeddings)."""
103+
raise NotImplementedError(
104+
"Inductive models are not implemented for gensim-based "
105+
"node embedders")
106+
107+
@staticmethod
108+
def _save_predictive_model(model, path):
109+
pass
110+
111+
@staticmethod
112+
def _load_predictive_model(path):
113+
pass

bluegraph/backends/neo4j/analyse/paths.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ def _compute_yen_shortest_paths(graph, source, target, n,
104104
graph._generate_st_match_query(source, target) +
105105
Neo4jPathFinder._generate_path_search_call(
106106
graph, source, target,
107-
"gds.beta.shortestPath.yens.stream",
107+
"gds.shortestPath.yens.stream",
108108
distance, exclude_edge,
109109
extra_params={"k": n}) +
110110
"YIELD nodeIds\n"

bluegraph/backends/neo4j/embed/embedders.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,11 +48,16 @@ class Neo4jNodeEmbedder(GraphElementEmbedder):
4848
@staticmethod
4949
def _generate_graph(pgframe=None, uri=None, username=None,
5050
password=None, driver=None,
51-
node_label=None, edge_label=None):
51+
node_label=None, edge_label=None,
52+
graph_configs=None):
5253
"""Generate backend-specific graph object."""
54+
if graph_configs is None:
55+
graph_configs = {"directed": True}
56+
5357
return pgframe_to_neo4j(
5458
pgframe=pgframe, uri=uri, username=username, password=password,
55-
driver=driver, node_label=node_label, edge_label=edge_label)
59+
driver=driver, node_label=node_label, edge_label=edge_label,
60+
directed=graph_configs["directed"])
5661

5762
def _dispatch_model_params(self, **kwargs):
5863
"""Dispatch training parameters."""
@@ -223,7 +228,9 @@ def fit_model(self, pgframe=None, uri=None, username=None, password=None,
223228
train_graph = self._generate_graph(
224229
pgframe=pgframe, uri=uri, username=username,
225230
password=password, driver=driver,
226-
node_label=node_label, edge_label=edge_label)
231+
node_label=node_label, edge_label=edge_label,
232+
graph_configs=self.graph_configs)
233+
# self.graph_configs
227234
else:
228235
train_graph = graph_view
229236

bluegraph/backends/neo4j/io.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -162,12 +162,12 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
162162
node_label_repr = f":{node_label}" if node_label else ""
163163

164164
query = (
165-
f"""
166-
WITH [{", ".join(node_repr)}] AS batch
167-
UNWIND batch as individual
168-
CREATE (n{node_label_repr})
169-
SET n += individual
170-
""")
165+
f"""
166+
WITH [{", ".join(node_repr)}] AS batch
167+
UNWIND batch as individual
168+
CREATE (n{node_label_repr})
169+
SET n += individual
170+
""")
171171
execute(driver, query)
172172

173173
# Add node types to the Neo4j node labels
@@ -189,6 +189,7 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
189189
edge_labels = [edge_label]
190190

191191
for edge_label in edge_labels:
192+
192193
# Select edges of a given type, if applicable
193194
edges = pgframe.edges(
194195
raw_frame=True,

bluegraph/backends/params.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,3 +84,27 @@
8484
"clusters_q": 1,
8585
"num_powers": 10
8686
}
87+
88+
89+
GENSIM_PARAMS = {
90+
"poincare": [
91+
"epochs",
92+
"size",
93+
"alpha",
94+
"negative",
95+
"workers",
96+
"epsilon",
97+
"regularization_coeff",
98+
"burn_in",
99+
"burn_in_alpha",
100+
"init_range",
101+
"dtype",
102+
"seed"
103+
]
104+
}
105+
106+
107+
DEFAULT_GENSIM_PARAMS = {
108+
"size": 64,
109+
"epochs": 50
110+
}

bluegraph/core/embed/embedders.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def _inductive_models(self):
5959

6060
@staticmethod
6161
@abstractmethod
62-
def _generate_graph(self, pgframe):
62+
def _generate_graph(pgframe, graph_configs):
6363
"""Generate backend-specific graph object."""
6464
pass
6565

@@ -167,7 +167,7 @@ def fit_model(self, pgframe):
167167
if not isinstance(embeddings, pd.DataFrame):
168168
embeddings = pd.DataFrame(
169169
{"embedding": embeddings.tolist()},
170-
index=train_graph.nodes())
170+
index=pgframe.nodes())
171171
elif self.model_name in self._inductive_models:
172172
self._embedding_model = self._fit_inductive_embedder(train_graph)
173173
embeddings = self._predict_embeddings(train_graph)
@@ -234,8 +234,12 @@ def load(path):
234234

235235
with open(os.path.join(path, "emb.pkl"), "rb") as f:
236236
embedder = pickle.load(f)
237-
embedder._embedding_model = embedder._load_predictive_model(
238-
os.path.join(path, "model"))
237+
238+
embedder._embedding_model = None
239+
if os.path.isfile(os.path.join(path, "model")):
240+
embedder._embedding_model = embedder._load_predictive_model(
241+
os.path.join(path, "model"))
242+
239243
if decompressed:
240244
shutil.rmtree(path)
241245

bluegraph/core/io.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -954,6 +954,8 @@ def edge_types(self, flatten=False):
954954
"""Return a list of edges types."""
955955
if flatten:
956956
types = _aggregate_values(self._edges["@type"])
957+
if isinstance(types, str):
958+
types = [types]
957959
else:
958960
types = []
959961
for el in self._edges["@type"]:
@@ -1112,9 +1114,10 @@ def get_edge_typing(self):
11121114
def aggregate_properties(frame, func, into="aggregation_result"):
11131115
if "@type" in frame.columns:
11141116
df = frame.drop("@type", axis=1)
1117+
aggregated = df.aggregate(func, axis=1).values.tolist()
11151118
frame = pd.DataFrame(
11161119
{
1117-
into: df.aggregate(func, axis=1),
1120+
into: aggregated,
11181121
"@type": frame["@type"]
11191122
},
11201123
index=frame.index)

0 commit comments

Comments
 (0)