ToG-3/data/multimodal_test_samples/documents.json at 5ce2b31cfaf2a2648bb5e24a36a4cecf48401664 · DataArcTech/ToG-3 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
[
    {
        "id_": "08e7cb38-322c-43b0-9f39-4d730527e92a",
        "embedding": null,
        "metadata": {
            "header": "Leveraging knowledge graphs to power LangChain Applications",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "RAG applications are all the rage at the moment. Everyone is building their company documentation chatbot or similar. Mostly, they all have in common that their source of knowledge is unstructured text, which gets chunked and embedded in one way or another. However, not all information arrives as unstructured text.\nSay, for example, you wanted to create a chatbot that could answer questions about your microservice architecture, ongoing tasks, and more. Tasks are mostly defined as unstructured text, so there wouldn\u2019t be anything different from the usual RAG workflow there. However, how could you prepare information about your microservices architecture so the chatbot can retrieve up-to-date information? One option would be to create daily snapshots of the architecture and transform them into text that the LLM would understand. However, what if there is a better approach? Meet knowledge graphs, which can store both structured and unstructured information in a single database.\nNodes and relationships are used to describe data in a knowledge graph. Typically, nodes are used to represent entities or concepts like people, organizations, and locations. In the microservice graph example, nodes describe people, teams, microservices, and tasks. On the other hand, relationships are used to define connections between these entities, like dependencies between microservices or task owners.\nBoth nodes and relationships can have property values stored as key-value pairs.\nThe microservice nodes have two node properties describing their name and technology. On the other hand, task nodes are more complex. They have the the name, status, description, as well as embedding properties. By storing text embedding values as node properties, you can perform a vector similarity search of task descriptions identical to if you had the tasks stored in a vector database. Therefore, knowledge graphs allow you to store and retrieve both structured and unstructured information to power your RAG applications.\nIn this blog post, I\u2019ll walk you through a scenario of implementing a knowledge graph based RAG application with LangChain to support your DevOps team. The code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "RAG applications are all the rage at the moment. Everyone is building their company documentation chatbot or similar. Mostly, they all have in common that their source of knowledge is unstructured text, which gets chunked and embedded in one way or another. However, not all information arrives as unstructured text.\nSay, for example, you wanted to create a chatbot that could answer questions about your microservice architecture, ongoing tasks, and more. Tasks are mostly defined as unstructured text, so there wouldn\u2019t be anything different from the usual RAG workflow there. However, how could you prepare information about your microservices architecture so the chatbot can retrieve up-to-date information? One option would be to create daily snapshots of the architecture and transform them into text that the LLM would understand. However, what if there is a better approach? Meet knowledge graphs, which can store both structured and unstructured information in a single database.\nNodes and relationships are used to describe data in a knowledge graph. Typically, nodes are used to represent entities or concepts like people, organizations, and locations. In the microservice graph example, nodes describe people, teams, microservices, and tasks. On the other hand, relationships are used to define connections between these entities, like dependencies between microservices or task owners.\nBoth nodes and relationships can have property values stored as key-value pairs.\nThe microservice nodes have two node properties describing their name and technology. On the other hand, task nodes are more complex. They have the the name, status, description, as well as embedding properties. By storing text embedding values as node properties, you can perform a vector similarity search of task descriptions identical to if you had the tasks stored in a vector database. Therefore, knowledge graphs allow you to store and retrieve both structured and unstructured information to power your RAG applications.\nIn this blog post, I\u2019ll walk you through a scenario of implementing a knowledge graph based RAG application with LangChain to support your DevOps team. The code is available on GitHub.\nGitHub\n"
    },
    {
        "id_": "71f3812f-4bba-48ce-8926-2f20dfd3863e",
        "embedding": null,
        "metadata": {
            "header": "Neo4j Environment Setup",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "You need to set up a Neo4j 5.11 or greater to follow along with the examples in this blog post. The easiest way is to start a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you can also set up a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.\nNeo4j Aura\nNeo4j Desktop\n```from langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,     username=username,     password=password)```\nfrom langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,     username=username,     password=password)\nfrom\nimport\n\"neo4j+s://databases.neo4j.io\"\n\"neo4j\"\n\"\"\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "You need to set up a Neo4j 5.11 or greater to follow along with the examples in this blog post. The easiest way is to start a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you can also set up a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.\nNeo4j Aura\nNeo4j Desktop\n```from langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,     username=username,     password=password)```\nfrom langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,     username=username,     password=password)\nfrom\nimport\n\"neo4j+s://databases.neo4j.io\"\n\"neo4j\"\n\"\"\n"
    },
    {
        "id_": "7de6d0e0-25cf-4415-b8c0-65b98880ec5c",
        "embedding": null,
        "metadata": {
            "header": "Dataset",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Knowledge graphs are excellent at connecting information from multiple data sources. You could fetch information from cloud services, task management tools, and more when developing a DevOps RAG application.\nSince this kind of microservice and task information is not public, I had to create a synthetic dataset. I employed ChatGPT to help me. It\u2019s a small dataset with only 100 nodes, but enough for this tutorial. The following code will import the sample graph into Neo4j.\n```import requestsurl = \"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"import_query = requests.get(url).json()['query']graph.query(    import_query)```\nimport requestsurl = \"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"import_query = requests.get(url).json()['query']graph.query(    import_query)\nimport\nrequests\nurl\n=\n\"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"\n'query'\nIf you inspect the graph in Neo4j Browser, you should get a similar visualization.\nBlue nodes describe microservices. These microservices may have dependencies on one another, implying that the functioning or the outcome of one might be reliant on another\u2019s operation. On the other hand, the brown nodes represent tasks that are directly linked to these microservices. Besides showing how things are set up and their linked tasks, our graph also shows which teams are in charge of what.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Knowledge graphs are excellent at connecting information from multiple data sources. You could fetch information from cloud services, task management tools, and more when developing a DevOps RAG application.\nSince this kind of microservice and task information is not public, I had to create a synthetic dataset. I employed ChatGPT to help me. It\u2019s a small dataset with only 100 nodes, but enough for this tutorial. The following code will import the sample graph into Neo4j.\n```import requestsurl = \"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"import_query = requests.get(url).json()['query']graph.query(    import_query)```\nimport requestsurl = \"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"import_query = requests.get(url).json()['query']graph.query(    import_query)\nimport\nrequests\nurl\n=\n\"https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/da8882249af3e819a80debf3160ebbb3513ee962/microservices.json\"\n'query'\nIf you inspect the graph in Neo4j Browser, you should get a similar visualization.\nBlue nodes describe microservices. These microservices may have dependencies on one another, implying that the functioning or the outcome of one might be reliant on another\u2019s operation. On the other hand, the brown nodes represent tasks that are directly linked to these microservices. Besides showing how things are set up and their linked tasks, our graph also shows which teams are in charge of what.\n"
    },
    {
        "id_": "7a9ac15c-5353-4512-9924-a4b8d257cd62",
        "embedding": null,
        "metadata": {
            "header": "Neo4j Vector\u00a0index",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "We will begin by implementing a vector index search for finding relevant tasks by their name and description. If you are unfamiliar with vector similarity search, let me give you a quick refresher. The key idea is to calculate the text embedding values for each task based on their description and name. Then, at query time, find the most similar tasks to the user input using a similarity metric like a cosine distance.\nThe retrieved information from the vector index can then be used as context to the LLM so it can generate accurate and up-to-date answers.\nThe tasks are already in our knowledge graph. However, we need to calculate the embedding values and create the vector index. This can be achieved with the from_existing_graph method.\n```import osfrom langchain.vectorstores.neo4j_vector import Neo4jVectorfrom langchain.embeddings.openai import OpenAIEmbeddingsos.environ['OPENAI_API_KEY'] = \"OPENAI_API_KEY\"vector_index = Neo4jVector.from_existing_graph(    OpenAIEmbeddings(),    url=url,    username=username,    password=password,    index_name='tasks',    node_label=\"Task\",    text_node_properties=['name', 'description', 'status'],    embedding_node_property='embedding',)```\nimport osfrom langchain.vectorstores.neo4j_vector import Neo4jVectorfrom langchain.embeddings.openai import OpenAIEmbeddingsos.environ['OPENAI_API_KEY'] = \"OPENAI_API_KEY\"vector_index = Neo4jVector.from_existing_graph(    OpenAIEmbeddings(),    url=url,    username=username,    password=password,    index_name='tasks',    node_label=\"Task\",    text_node_properties=['name', 'description', 'status'],    embedding_node_property='embedding',)\nimport\nfrom\nimport\nfrom\nimport\n'OPENAI_API_KEY'\n\"OPENAI_API_KEY\"\n'tasks'\n\"Task\"\n'name'\n'description'\n'status'\n'embedding'\nIn this example, we used the following graph-specific parameters for the from_existing_graph method.\nNow that the vector index has been initiated, we can use it as any other vector index in LangChain.\n```response = vector_index.similarity_search(    \"How will RecommendationService be updated?\")print(response[0].page_content)# name: BugFix# description: Add a new feature to RecommendationService to provide ...# status: In Progress```\nresponse = vector_index.similarity_search(    \"How will RecommendationService be updated?\")print(response[0].page_content)# name: BugFix# description: Add a new feature to RecommendationService to provide ...# status: In Progress\n\"How will RecommendationService be updated?\"\nprint\n0\n# name: BugFix\n# description: Add a new feature to RecommendationService to provide ...\n# status: In Progress\nYou can observe that we construct a response of a map or dictionary-like string with defined properties in the text_node_properties parameter.\nNow we can easily create a chatbot response by wrapping the vector index into a RetrievalQA module.\n```from langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAIvector_qa = RetrievalQA.from_chain_type(    llm=ChatOpenAI(),    chain_type=\"stuff\",    retriever=vector_index.as_retriever())vector_qa.run(    \"How will recommendation service be updated?\")# The RecommendationService is currently being updated to include a new feature # that will provide more personalized and accurate product recommendations to # users. This update involves leveraging user behavior and preference data to # enhance the recommendation algorithm. The status of this update is currently# in progress.```\nfrom langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAIvector_qa = RetrievalQA.from_chain_type(    llm=ChatOpenAI(),    chain_type=\"stuff\",    retriever=vector_index.as_retriever())vector_qa.run(    \"How will recommendation service be updated?\")# The RecommendationService is currently being updated to include a new feature # that will provide more personalized and accurate product recommendations to # users. This update involves leveraging user behavior and preference data to # enhance the recommendation algorithm. The status of this update is currently# in progress.\nfrom\nimport\nfrom\nimport\n\"stuff\"\n\"How will recommendation service be updated?\"\n# The RecommendationService is currently being updated to include a new feature\n# that will provide more personalized and accurate product recommendations to\n# users. This update involves leveraging user behavior and preference data to\n# enhance the recommendation algorithm. The status of this update is currently\n# in progress.\nOne limitation of vector indexes, in general, is that they don\u2019t provide the ability to aggregate information like you would with a structured query language like Cypher. Take, for example, the following example:\n```vector_qa.run(    \"How many open tickets there are?\")# There are 4 open tickets.```\nvector_qa.run(    \"How many open tickets there are?\")# There are 4 open tickets.\n\"How many open tickets there are?\"\n# There are 4 open tickets.\nThe response seems valid, and the LLM uses assertive language, making you believe the result is correct. However, the problem is that the response directly correlates to the number of retrieved documents from the vector index, which is four by default. What actually happens is that the vector index retrieves four open tickets, and the LLM unquestioningly believes that those are all the open tickets. However, the truth is different, and we can validate it using a Cypher statement.\n```graph.query(    \"MATCH (t:Task {status:'Open'}) RETURN count(*)\")# [{'count(*)': 5}]```\ngraph.query(    \"MATCH (t:Task {status:'Open'}) RETURN count(*)\")# [{'count(*)': 5}]\n\"MATCH (t:Task {status:'Open'}) RETURN count(*)\"\n# [{'count(*)': 5}]\nThere are five open tasks in our toy graph. While vector similarity search is excellent for sifting through relevant information in unstructured text, it lacks the capability to analyze and aggregate structured information. Using Neo4j, this problem can be easily solved by employing Cypher, which is a structured query language for graph databases.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "We will begin by implementing a vector index search for finding relevant tasks by their name and description. If you are unfamiliar with vector similarity search, let me give you a quick refresher. The key idea is to calculate the text embedding values for each task based on their description and name. Then, at query time, find the most similar tasks to the user input using a similarity metric like a cosine distance.\nThe retrieved information from the vector index can then be used as context to the LLM so it can generate accurate and up-to-date answers.\nThe tasks are already in our knowledge graph. However, we need to calculate the embedding values and create the vector index. This can be achieved with the from_existing_graph method.\n```import osfrom langchain.vectorstores.neo4j_vector import Neo4jVectorfrom langchain.embeddings.openai import OpenAIEmbeddingsos.environ['OPENAI_API_KEY'] = \"OPENAI_API_KEY\"vector_index = Neo4jVector.from_existing_graph(    OpenAIEmbeddings(),    url=url,    username=username,    password=password,    index_name='tasks',    node_label=\"Task\",    text_node_properties=['name', 'description', 'status'],    embedding_node_property='embedding',)```\nimport osfrom langchain.vectorstores.neo4j_vector import Neo4jVectorfrom langchain.embeddings.openai import OpenAIEmbeddingsos.environ['OPENAI_API_KEY'] = \"OPENAI_API_KEY\"vector_index = Neo4jVector.from_existing_graph(    OpenAIEmbeddings(),    url=url,    username=username,    password=password,    index_name='tasks',    node_label=\"Task\",    text_node_properties=['name', 'description', 'status'],    embedding_node_property='embedding',)\nimport\nfrom\nimport\nfrom\nimport\n'OPENAI_API_KEY'\n\"OPENAI_API_KEY\"\n'tasks'\n\"Task\"\n'name'\n'description'\n'status'\n'embedding'\nIn this example, we used the following graph-specific parameters for the from_existing_graph method.\nNow that the vector index has been initiated, we can use it as any other vector index in LangChain.\n```response = vector_index.similarity_search(    \"How will RecommendationService be updated?\")print(response[0].page_content)# name: BugFix# description: Add a new feature to RecommendationService to provide ...# status: In Progress```\nresponse = vector_index.similarity_search(    \"How will RecommendationService be updated?\")print(response[0].page_content)# name: BugFix# description: Add a new feature to RecommendationService to provide ...# status: In Progress\n\"How will RecommendationService be updated?\"\nprint\n0\n# name: BugFix\n# description: Add a new feature to RecommendationService to provide ...\n# status: In Progress\nYou can observe that we construct a response of a map or dictionary-like string with defined properties in the text_node_properties parameter.\nNow we can easily create a chatbot response by wrapping the vector index into a RetrievalQA module.\n```from langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAIvector_qa = RetrievalQA.from_chain_type(    llm=ChatOpenAI(),    chain_type=\"stuff\",    retriever=vector_index.as_retriever())vector_qa.run(    \"How will recommendation service be updated?\")# The RecommendationService is currently being updated to include a new feature # that will provide more personalized and accurate product recommendations to # users. This update involves leveraging user behavior and preference data to # enhance the recommendation algorithm. The status of this update is currently# in progress.```\nfrom langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAIvector_qa = RetrievalQA.from_chain_type(    llm=ChatOpenAI(),    chain_type=\"stuff\",    retriever=vector_index.as_retriever())vector_qa.run(    \"How will recommendation service be updated?\")# The RecommendationService is currently being updated to include a new feature # that will provide more personalized and accurate product recommendations to # users. This update involves leveraging user behavior and preference data to # enhance the recommendation algorithm. The status of this update is currently# in progress.\nfrom\nimport\nfrom\nimport\n\"stuff\"\n\"How will recommendation service be updated?\"\n# The RecommendationService is currently being updated to include a new feature\n# that will provide more personalized and accurate product recommendations to\n# users. This update involves leveraging user behavior and preference data to\n# enhance the recommendation algorithm. The status of this update is currently\n# in progress.\nOne limitation of vector indexes, in general, is that they don\u2019t provide the ability to aggregate information like you would with a structured query language like Cypher. Take, for example, the following example:\n```vector_qa.run(    \"How many open tickets there are?\")# There are 4 open tickets.```\nvector_qa.run(    \"How many open tickets there are?\")# There are 4 open tickets.\n\"How many open tickets there are?\"\n# There are 4 open tickets.\nThe response seems valid, and the LLM uses assertive language, making you believe the result is correct. However, the problem is that the response directly correlates to the number of retrieved documents from the vector index, which is four by default. What actually happens is that the vector index retrieves four open tickets, and the LLM unquestioningly believes that those are all the open tickets. However, the truth is different, and we can validate it using a Cypher statement.\n```graph.query(    \"MATCH (t:Task {status:'Open'}) RETURN count(*)\")# [{'count(*)': 5}]```\ngraph.query(    \"MATCH (t:Task {status:'Open'}) RETURN count(*)\")# [{'count(*)': 5}]\n\"MATCH (t:Task {status:'Open'}) RETURN count(*)\"\n# [{'count(*)': 5}]\nThere are five open tasks in our toy graph. While vector similarity search is excellent for sifting through relevant information in unstructured text, it lacks the capability to analyze and aggregate structured information. Using Neo4j, this problem can be easily solved by employing Cypher, which is a structured query language for graph databases.\n"
    },
    {
        "id_": "5e8ec5c8-68c1-48d7-ad71-ab11342556f1",
        "embedding": null,
        "metadata": {
            "header": "Graph Cypher\u00a0search",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Cypher is a structured query language designed to interact with graph databases and provides a visual way of matching patterns and relationships. It relies on the following ascii-art type of syntax:\n```(:Person {name:\"Tomaz\"})-[:LIVES_IN]->(:Country {name:\"Slovenia\"})```\n(:Person {name:\"Tomaz\"})-[:LIVES_IN]->(:Country {name:\"Slovenia\"})\n\"Tomaz\"\n[:LIVES_IN]\n\"Slovenia\"\nThis patterns describes a node with a label Person and the name property Tomaz that has a LIVES_IN relationship to the Country node of Slovenia.\nThe neat thing about LangChain is that it provides a GraphCypherQAChain, which generates the Cypher queries for you, so you don\u2019t have to learn Cypher syntax in order to retrieve information from a graph database like Neo4j.\nGraphCypherQAChain\nThe following code will refresh the graph schema and instantiate the Cypher chain.\n```from langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4'),    qa_llm = ChatOpenAI(temperature=0), graph=graph, verbose=True,)```\nfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4'),    qa_llm = ChatOpenAI(temperature=0), graph=graph, verbose=True,)\nfrom\nimport\n0\n'gpt-4'\n0\nTrue\nGenerating valid Cypher statements is a complex task. Therefore, it is recommended to use state-of-the-art LLMs like gpt-4 to generate Cypher statements, while generating answers using the database context can be left to gpt-3.5-turbo.\nNow, you can ask the same question about how many tickets are open.\n```cypher_chain.run(    \"How many open tickets there are?\")```\ncypher_chain.run(    \"How many open tickets there are?\")\n\"How many open tickets there are?\"\nResult is the following\nYou can also ask the chain to aggregate the data using various grouping keys, like the following example.\n```cypher_chain.run(    \"Which team has the most open tasks?\")```\ncypher_chain.run(    \"Which team has the most open tasks?\")\n\"Which team has the most open tasks?\"\nResult is the following\nYou might say these aggregations are not graph-based operations, and you will be correct. We can, of course, perform more graph-based operations like traversing the dependency graph of microservices.\n```cypher_chain.run(    \"Which services depend on Database directly?\")```\ncypher_chain.run(    \"Which services depend on Database directly?\")\n\"Which services depend on Database directly?\"\nResult is the following\nOf course, you can also ask the chain to produce variable-length path traversals by asking questions like:\nvariable-length path traversals\n```cypher_chain.run(    \"Which services depend on Database indirectly?\")```\ncypher_chain.run(    \"Which services depend on Database indirectly?\")\n\"Which services depend on Database indirectly?\"\nResult is the following\nSome of the mentioned services are the same as in the directly dependent question. The reason is the structure of the dependency graph and not the invalid Cypher statement.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Cypher is a structured query language designed to interact with graph databases and provides a visual way of matching patterns and relationships. It relies on the following ascii-art type of syntax:\n```(:Person {name:\"Tomaz\"})-[:LIVES_IN]->(:Country {name:\"Slovenia\"})```\n(:Person {name:\"Tomaz\"})-[:LIVES_IN]->(:Country {name:\"Slovenia\"})\n\"Tomaz\"\n[:LIVES_IN]\n\"Slovenia\"\nThis patterns describes a node with a label Person and the name property Tomaz that has a LIVES_IN relationship to the Country node of Slovenia.\nThe neat thing about LangChain is that it provides a GraphCypherQAChain, which generates the Cypher queries for you, so you don\u2019t have to learn Cypher syntax in order to retrieve information from a graph database like Neo4j.\nGraphCypherQAChain\nThe following code will refresh the graph schema and instantiate the Cypher chain.\n```from langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4'),    qa_llm = ChatOpenAI(temperature=0), graph=graph, verbose=True,)```\nfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4'),    qa_llm = ChatOpenAI(temperature=0), graph=graph, verbose=True,)\nfrom\nimport\n0\n'gpt-4'\n0\nTrue\nGenerating valid Cypher statements is a complex task. Therefore, it is recommended to use state-of-the-art LLMs like gpt-4 to generate Cypher statements, while generating answers using the database context can be left to gpt-3.5-turbo.\nNow, you can ask the same question about how many tickets are open.\n```cypher_chain.run(    \"How many open tickets there are?\")```\ncypher_chain.run(    \"How many open tickets there are?\")\n\"How many open tickets there are?\"\nResult is the following\nYou can also ask the chain to aggregate the data using various grouping keys, like the following example.\n```cypher_chain.run(    \"Which team has the most open tasks?\")```\ncypher_chain.run(    \"Which team has the most open tasks?\")\n\"Which team has the most open tasks?\"\nResult is the following\nYou might say these aggregations are not graph-based operations, and you will be correct. We can, of course, perform more graph-based operations like traversing the dependency graph of microservices.\n```cypher_chain.run(    \"Which services depend on Database directly?\")```\ncypher_chain.run(    \"Which services depend on Database directly?\")\n\"Which services depend on Database directly?\"\nResult is the following\nOf course, you can also ask the chain to produce variable-length path traversals by asking questions like:\nvariable-length path traversals\n```cypher_chain.run(    \"Which services depend on Database indirectly?\")```\ncypher_chain.run(    \"Which services depend on Database indirectly?\")\n\"Which services depend on Database indirectly?\"\nResult is the following\nSome of the mentioned services are the same as in the directly dependent question. The reason is the structure of the dependency graph and not the invalid Cypher statement.\n"
    },
    {
        "id_": "1e6dc67b-3f35-4d0a-8311-c8d334cb932b",
        "embedding": null,
        "metadata": {
            "header": "Knowledge graph\u00a0agent",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Since we have implemented separate tools for the structured and unstructured parts of the knowledge graph, we can add an agent that can use these two tools to explore the knowledge graph.\n```from langchain.agents import initialize_agent, Toolfrom langchain.agents import AgentTypetools = [    Tool(        name=\"Tasks\",        func=vector_qa.run,        description=\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\",    ),    Tool(        name=\"Graph\",        func=cypher_chain.run,        description=\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\",    ),]mrkl = initialize_agent(    tools,     ChatOpenAI(temperature=0, model_name='gpt-4'),    agent=AgentType.OPENAI_FUNCTIONS, verbose=True)```\nfrom langchain.agents import initialize_agent, Toolfrom langchain.agents import AgentTypetools = [    Tool(        name=\"Tasks\",        func=vector_qa.run,        description=\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\",    ),    Tool(        name=\"Graph\",        func=cypher_chain.run,        description=\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\",    ),]mrkl = initialize_agent(    tools,     ChatOpenAI(temperature=0, model_name='gpt-4'),    agent=AgentType.OPENAI_FUNCTIONS, verbose=True)\nfrom\nimport\nfrom\nimport\n\"Tasks\"\n\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\"\n\"Graph\"\n\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\"\n0\n'gpt-4'\nTrue\nLet\u2019s try out how well does the agent works.\n```response = mrkl.run(\"Which team is assigned to maintain PaymentService?\")print(response)```\nresponse = mrkl.run(\"Which team is assigned to maintain PaymentService?\")print(response)\n\"Which team is assigned to maintain PaymentService?\"\nprint\nResult is the following\nLet\u2019s now try to invoke the Tasks tool.\n```response = mrkl.run(\"Which tasks have optimization in their description?\")print(response)```\nresponse = mrkl.run(\"Which tasks have optimization in their description?\")print(response)\n\"Which tasks have optimization in their description?\"\nprint\nResult is the following\nOne thing is certain. I have to work on my agent prompt engineering skills. There is definitely room for improvement in tools description. Additionally, you can also customize the agent prompt.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Since we have implemented separate tools for the structured and unstructured parts of the knowledge graph, we can add an agent that can use these two tools to explore the knowledge graph.\n```from langchain.agents import initialize_agent, Toolfrom langchain.agents import AgentTypetools = [    Tool(        name=\"Tasks\",        func=vector_qa.run,        description=\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\",    ),    Tool(        name=\"Graph\",        func=cypher_chain.run,        description=\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\",    ),]mrkl = initialize_agent(    tools,     ChatOpenAI(temperature=0, model_name='gpt-4'),    agent=AgentType.OPENAI_FUNCTIONS, verbose=True)```\nfrom langchain.agents import initialize_agent, Toolfrom langchain.agents import AgentTypetools = [    Tool(        name=\"Tasks\",        func=vector_qa.run,        description=\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\",    ),    Tool(        name=\"Graph\",        func=cypher_chain.run,        description=\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\",    ),]mrkl = initialize_agent(    tools,     ChatOpenAI(temperature=0, model_name='gpt-4'),    agent=AgentType.OPENAI_FUNCTIONS, verbose=True)\nfrom\nimport\nfrom\nimport\n\"Tasks\"\n\"\"\"Useful when you need to answer questions about descriptions of tasks.        Not useful for counting the number of tasks.        Use full question as input.        \"\"\"\n\"Graph\"\n\"\"\"Useful when you need to answer questions about microservices,        their dependencies or assigned people. Also useful for any sort of         aggregation like counting the number of tasks, etc.        Use full question as input.        \"\"\"\n0\n'gpt-4'\nTrue\nLet\u2019s try out how well does the agent works.\n```response = mrkl.run(\"Which team is assigned to maintain PaymentService?\")print(response)```\nresponse = mrkl.run(\"Which team is assigned to maintain PaymentService?\")print(response)\n\"Which team is assigned to maintain PaymentService?\"\nprint\nResult is the following\nLet\u2019s now try to invoke the Tasks tool.\n```response = mrkl.run(\"Which tasks have optimization in their description?\")print(response)```\nresponse = mrkl.run(\"Which tasks have optimization in their description?\")print(response)\n\"Which tasks have optimization in their description?\"\nprint\nResult is the following\nOne thing is certain. I have to work on my agent prompt engineering skills. There is definitely room for improvement in tools description. Additionally, you can also customize the agent prompt.\n"
    },
    {
        "id_": "321e9abb-6b20-4fb6-8459-9e39fc203edd",
        "embedding": null,
        "metadata": {
            "header": "Conclusion",
            "source": "2023-10-18_Using-a-Knowledge-Graph-to-implement-a-DevOps-RAG-application-b6ba24831b16.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Knowledge graphs are an excellent fit when you require structured and unstructured data to power your RAG applications. With the approach shown in this blog post, you can avoid polyglot architectures, where you must maintain and sync multiple types of databases. Learn more about graph-based search in LangChain here.\nhere\nThe code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Knowledge graphs are an excellent fit when you require structured and unstructured data to power your RAG applications. With the approach shown in this blog post, you can avoid polyglot architectures, where you must maintain and sync multiple types of databases. Learn more about graph-based search in LangChain here.\nhere\nThe code is available on GitHub.\nGitHub\n"
    },
    {
        "id_": "2d42d6c0-8e93-4af4-83b3-981220bdec62",
        "embedding": null,
        "metadata": {
            "header": "Seamlessy implement information extraction pipeline with LangChain and\u00a0Neo4j",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Extracting structured information from unstructured data like text has been around for some time and is nothing new. However, LLMs brought a significant shift to the field of information extraction. If before you needed a team of machine learning experts to curate datasets and train custom models, you only need access to an LLM nowadays. The barrier to entry has dropped significantly, making what was just a couple of years ago reserved for domain experts more accessible to even non-technical people.\nThe image depicts the transformation of unstructured text into structured information. This process, labeled as the information extraction pipeline, results in a graph representation of information. The nodes represent key entities, while the connecting lines denote the relationships between these entities. Knowledge graphs are useful for multi-hop question-answering, real-time analytics, or when you want to combine structured and unstructured data in a single database.\nmulti-hop question-answering\nreal-time analytics\ncombine structured and unstructured data in a single database\nWhile extracting structured information from text has been made more accessible due to LLMs, it is by no means a solved problem. In this blog post, we will use OpenAI functions in combination with LangChain to construct a knowledge graph from a sample Wikipedia page. Along the way, we will discuss best practices as well as some limitations of current LLMs.\nOpenAI functions in combination with LangChain\ntldr; The code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Extracting structured information from unstructured data like text has been around for some time and is nothing new. However, LLMs brought a significant shift to the field of information extraction. If before you needed a team of machine learning experts to curate datasets and train custom models, you only need access to an LLM nowadays. The barrier to entry has dropped significantly, making what was just a couple of years ago reserved for domain experts more accessible to even non-technical people.\nThe image depicts the transformation of unstructured text into structured information. This process, labeled as the information extraction pipeline, results in a graph representation of information. The nodes represent key entities, while the connecting lines denote the relationships between these entities. Knowledge graphs are useful for multi-hop question-answering, real-time analytics, or when you want to combine structured and unstructured data in a single database.\nmulti-hop question-answering\nreal-time analytics\ncombine structured and unstructured data in a single database\nWhile extracting structured information from text has been made more accessible due to LLMs, it is by no means a solved problem. In this blog post, we will use OpenAI functions in combination with LangChain to construct a knowledge graph from a sample Wikipedia page. Along the way, we will discuss best practices as well as some limitations of current LLMs.\nOpenAI functions in combination with LangChain\ntldr; The code is available on GitHub.\nGitHub\n"
    },
    {
        "id_": "f8fa27af-cf6e-45b9-bba3-9ee40fc6327b",
        "embedding": null,
        "metadata": {
            "header": "Neo4j Environment setup",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "You need to setup a Neo4j to follow along with the examples in this blog post. The easiest way is to start a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you can also setup a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.\nNeo4j Aura\nNeo4j Desktop\nThe following code will instantiate a LangChain wrapper to connect to Neo4j Database.\n```from langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,    username=username,    password=password)```\nfrom langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,    username=username,    password=password)\nfrom\nimport\n\"neo4j+s://databases.neo4j.io\"\n\"neo4j\"\n\"\"\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "You need to setup a Neo4j to follow along with the examples in this blog post. The easiest way is to start a free instance on Neo4j Aura, which offers cloud instances of Neo4j database. Alternatively, you can also setup a local instance of the Neo4j database by downloading the Neo4j Desktop application and creating a local database instance.\nNeo4j Aura\nNeo4j Desktop\nThe following code will instantiate a LangChain wrapper to connect to Neo4j Database.\n```from langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,    username=username,    password=password)```\nfrom langchain.graphs import Neo4jGraphurl = \"neo4j+s://databases.neo4j.io\"username =\"neo4j\"password = \"\"graph = Neo4jGraph(    url=url,    username=username,    password=password)\nfrom\nimport\n\"neo4j+s://databases.neo4j.io\"\n\"neo4j\"\n\"\"\n"
    },
    {
        "id_": "4b37e805-5fe0-4ba7-bfca-a8d908c2e913",
        "embedding": null,
        "metadata": {
            "header": "Information extraction pipeline",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "A typical information extraction pipeline contains the following steps.\nIn the first step, we run the input text through a coreference resolution model. The coreference resolution is the task of finding all expressions that refer to a specific entity. Simply put, it links all the pronouns to the referred entity. In the named entity recognition part of the pipeline, we try to extract all the mentioned entities. The above example contains three entities: Tomaz, Blog, and Diagram. The next step is the entity disambiguation step, an essential but often overlooked part of an information extraction pipeline. Entity disambiguation is the process of accurately identifying and distinguishing between entities with similar names or references to ensure the correct entity is recognized in a given context. In the last step, the model tried to identify various relationships between entities. For example, it could locate the LIKES relationship between Tomaz and Blog entities.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "A typical information extraction pipeline contains the following steps.\nIn the first step, we run the input text through a coreference resolution model. The coreference resolution is the task of finding all expressions that refer to a specific entity. Simply put, it links all the pronouns to the referred entity. In the named entity recognition part of the pipeline, we try to extract all the mentioned entities. The above example contains three entities: Tomaz, Blog, and Diagram. The next step is the entity disambiguation step, an essential but often overlooked part of an information extraction pipeline. Entity disambiguation is the process of accurately identifying and distinguishing between entities with similar names or references to ensure the correct entity is recognized in a given context. In the last step, the model tried to identify various relationships between entities. For example, it could locate the LIKES relationship between Tomaz and Blog entities.\n"
    },
    {
        "id_": "c7f9c806-32d6-4796-b5ad-6a02542ca4b0",
        "embedding": null,
        "metadata": {
            "header": "Extracting structured information with OpenAI functions",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "OpenAI functions are a great fit to extract structured information from natural language. The idea behind OpenAI functions is to have an LLM output a predefined JSON object with populated values. The predefined JSON object can be used as input to other functions in so-called RAG applications, or it can be used to extract predefined structured information from text.\nOpenAI functions\nIn LangChain, you can pass a Pydantic class as description of the desired JSON object of the OpenAI functions feature. Therefore, we will start by defining the desired structure of information we want to extract from text. LangChain already has definitions of nodes and relationship as Pydantic classes that we can reuse.\npass a Pydantic class as description\ndefinitions of nodes and relationship as Pydantic classes that we can reuse\n```class Node(Serializable):    \"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"    id: Union[str, int]    type: str = \"Node\"    properties: dict = Field(default_factory=dict)class Relationship(Serializable):    \"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"    source: Node    target: Node    type: str    properties: dict = Field(default_factory=dict)```\nclass Node(Serializable):    \"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"    id: Union[str, int]    type: str = \"Node\"    properties: dict = Field(default_factory=dict)class Relationship(Serializable):    \"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"    source: Node    target: Node    type: str    properties: dict = Field(default_factory=dict)\nclass\nNode\nSerializable\n\"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"\nid\nUnion\nstr\nint\ntype\nstr\n\"Node\"\ndict\ndict\nclass\nRelationship\nSerializable\n\"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"\ntype\nstr\ndict\ndict\nUnfortunately, it turns out that OpenAI functions don\u2019t currently support a dictionary object as a value. Therefore, we have to overwrite the properties definition to adhere to the limitations of the functions\u2019 endpoint.\n```from langchain.graphs.graph_document import (    Node as BaseNode,    Relationship as BaseRelationship)from typing import List, Dict, Any, Optionalfrom langchain.pydantic_v1 import Field, BaseModelclass Property(BaseModel):  \"\"\"A single property consisting of key and value\"\"\"  key: str = Field(..., description=\"key\")  value: str = Field(..., description=\"value\")class Node(BaseNode):    properties: Optional[List[Property]] = Field(        None, description=\"List of node properties\")class Relationship(BaseRelationship):    properties: Optional[List[Property]] = Field(        None, description=\"List of relationship properties\"    )```\nfrom langchain.graphs.graph_document import (    Node as BaseNode,    Relationship as BaseRelationship)from typing import List, Dict, Any, Optionalfrom langchain.pydantic_v1 import Field, BaseModelclass Property(BaseModel):  \"\"\"A single property consisting of key and value\"\"\"  key: str = Field(..., description=\"key\")  value: str = Field(..., description=\"value\")class Node(BaseNode):    properties: Optional[List[Property]] = Field(        None, description=\"List of node properties\")class Relationship(BaseRelationship):    properties: Optional[List[Property]] = Field(        None, description=\"List of relationship properties\"    )\nfrom\nimport\nas\nas\nfrom\nimport\nList\nDict\nAny\nOptional\nfrom\nimport\nclass\nProperty\nBaseModel\n\"\"\"A single property consisting of key and value\"\"\"\nstr\n\"key\"\nstr\n\"value\"\nclass\nNode\nBaseNode\nOptional\nList\nNone\n\"List of node properties\"\nclass\nRelationship\nBaseRelationship\nOptional\nList\nNone\n\"List of relationship properties\"\nHere, we have overwritten the properties value to be a list of Property classes instead of a dictionary to overcome the limitations of the API. Because you can only pass a single object to the API, we can to combine the nodes and relationships in a single class called KnowledgeGraph.\n```class KnowledgeGraph(BaseModel):    \"\"\"Generate a knowledge graph with entities and relationships.\"\"\"    nodes: List[Node] = Field(        ..., description=\"List of nodes in the knowledge graph\")    rels: List[Relationship] = Field(        ..., description=\"List of relationships in the knowledge graph\"    )```\nclass KnowledgeGraph(BaseModel):    \"\"\"Generate a knowledge graph with entities and relationships.\"\"\"    nodes: List[Node] = Field(        ..., description=\"List of nodes in the knowledge graph\")    rels: List[Relationship] = Field(        ..., description=\"List of relationships in the knowledge graph\"    )\nclass\nKnowledgeGraph\nBaseModel\n\"\"\"Generate a knowledge graph with entities and relationships.\"\"\"\nList\n\"List of nodes in the knowledge graph\"\nList\n\"List of relationships in the knowledge graph\"\nThe only thing left is to do a bit of prompt engineering and we are good to go. How I usually go about prompt engineering is the following:\nI specifically chose the markdown format as I have seen somewhere that OpenAI models respond better to markdown syntax in prompts, and it seems to be at least plausible from my experience.\nIterating over prompt engineering, I came up with the following system prompt for an information extraction pipeline.\n```llm = ChatOpenAI(model=\"gpt-3.5-turbo-16k\", temperature=0)def get_extraction_chain(    allowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None    ):    prompt = ChatPromptTemplate.from_messages(    [(      \"system\",      f\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"),        (\"human\", \"Use the given format to extract information from the following input: {input}\"),        (\"human\", \"Tip: Make sure to answer in the correct format\"),    ])    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)```\nllm = ChatOpenAI(model=\"gpt-3.5-turbo-16k\", temperature=0)def get_extraction_chain(    allowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None    ):    prompt = ChatPromptTemplate.from_messages(    [(      \"system\",      f\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"),        (\"human\", \"Use the given format to extract information from the following input: {input}\"),        (\"human\", \"Tip: Make sure to answer in the correct format\"),    ])    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)\n\"gpt-3.5-turbo-16k\"\n0\ndef\nget_extraction_chain\nallowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None\nOptional\nList\nstr\nNone\nOptional\nList\nstr\nNone\n\"system\"\nf\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"\n{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}\n'- **Allowed Node Labels:**'\n\", \"\nif\nelse\n\"\"\n{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}\n'- **Allowed Relationship Types**:'\n\", \"\nif\nelse\n\"\"\n\"human\"\n\"Use the given format to extract information from the following input: {input}\"\n\"human\"\n\"Tip: Make sure to answer in the correct format\"\nreturn\nFalse\nYou can see that we are using the 16k version of the GPT-3.5 model. The main reason is that the OpenAI function output is a structured JSON object, and structured JSON syntax adds a lot of token overhead to the result. Essentially, you are paying for the convenience of structured output in increased token space.\nBesides the general instructions, I have also added the option to limit which node or relationship types should be extracted from text. You\u2019ll see through examples why this might come in handy.\nWe have the Neo4j connection and LLM prompt ready, which means we can define the information extraction pipeline as a single function.\n```def extract_and_store_graph(    document: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None) -> None:    # Extract graph data using OpenAI functions    extract_chain = get_extraction_chain(nodes, rels)    data = extract_chain.run(document.page_content)    # Construct a graph document    graph_document = GraphDocument(      nodes = [map_to_base_node(node) for node in data.nodes],      relationships = [map_to_base_relationship(rel) for rel in data.rels],      source = document    )    # Store information into a graph    graph.add_graph_documents([graph_document])```\ndef extract_and_store_graph(    document: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None) -> None:    # Extract graph data using OpenAI functions    extract_chain = get_extraction_chain(nodes, rels)    data = extract_chain.run(document.page_content)    # Construct a graph document    graph_document = GraphDocument(      nodes = [map_to_base_node(node) for node in data.nodes],      relationships = [map_to_base_relationship(rel) for rel in data.rels],      source = document    )    # Store information into a graph    graph.add_graph_documents([graph_document])\ndef\nextract_and_store_graph\ndocument: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None\nOptional\nList\nstr\nNone\nOptional\nList\nstr\nNone\nNone\n# Extract graph data using OpenAI functions\n# Construct a graph document\nfor\nin\nfor\nin\n# Store information into a graph\nThe function takes in a LangChain document as well as optional nodes and relationship parameters, which are used to limit the types of objects we want the LLM to identify and extract. A month or so ago, we added the add_graph_documents method the Neo4j graph object, which we can utilize here to seamlessly import the graph.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "OpenAI functions are a great fit to extract structured information from natural language. The idea behind OpenAI functions is to have an LLM output a predefined JSON object with populated values. The predefined JSON object can be used as input to other functions in so-called RAG applications, or it can be used to extract predefined structured information from text.\nOpenAI functions\nIn LangChain, you can pass a Pydantic class as description of the desired JSON object of the OpenAI functions feature. Therefore, we will start by defining the desired structure of information we want to extract from text. LangChain already has definitions of nodes and relationship as Pydantic classes that we can reuse.\npass a Pydantic class as description\ndefinitions of nodes and relationship as Pydantic classes that we can reuse\n```class Node(Serializable):    \"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"    id: Union[str, int]    type: str = \"Node\"    properties: dict = Field(default_factory=dict)class Relationship(Serializable):    \"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"    source: Node    target: Node    type: str    properties: dict = Field(default_factory=dict)```\nclass Node(Serializable):    \"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"    id: Union[str, int]    type: str = \"Node\"    properties: dict = Field(default_factory=dict)class Relationship(Serializable):    \"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"    source: Node    target: Node    type: str    properties: dict = Field(default_factory=dict)\nclass\nNode\nSerializable\n\"\"\"Represents a node in a graph with associated properties.    Attributes:        id (Union[str, int]): A unique identifier for the node.        type (str): The type or label of the node, default is \"Node\".        properties (dict): Additional properties and metadata associated with the node.    \"\"\"\nid\nUnion\nstr\nint\ntype\nstr\n\"Node\"\ndict\ndict\nclass\nRelationship\nSerializable\n\"\"\"Represents a directed relationship between two nodes in a graph.    Attributes:        source (Node): The source node of the relationship.        target (Node): The target node of the relationship.        type (str): The type of the relationship.        properties (dict): Additional properties associated with the relationship.    \"\"\"\ntype\nstr\ndict\ndict\nUnfortunately, it turns out that OpenAI functions don\u2019t currently support a dictionary object as a value. Therefore, we have to overwrite the properties definition to adhere to the limitations of the functions\u2019 endpoint.\n```from langchain.graphs.graph_document import (    Node as BaseNode,    Relationship as BaseRelationship)from typing import List, Dict, Any, Optionalfrom langchain.pydantic_v1 import Field, BaseModelclass Property(BaseModel):  \"\"\"A single property consisting of key and value\"\"\"  key: str = Field(..., description=\"key\")  value: str = Field(..., description=\"value\")class Node(BaseNode):    properties: Optional[List[Property]] = Field(        None, description=\"List of node properties\")class Relationship(BaseRelationship):    properties: Optional[List[Property]] = Field(        None, description=\"List of relationship properties\"    )```\nfrom langchain.graphs.graph_document import (    Node as BaseNode,    Relationship as BaseRelationship)from typing import List, Dict, Any, Optionalfrom langchain.pydantic_v1 import Field, BaseModelclass Property(BaseModel):  \"\"\"A single property consisting of key and value\"\"\"  key: str = Field(..., description=\"key\")  value: str = Field(..., description=\"value\")class Node(BaseNode):    properties: Optional[List[Property]] = Field(        None, description=\"List of node properties\")class Relationship(BaseRelationship):    properties: Optional[List[Property]] = Field(        None, description=\"List of relationship properties\"    )\nfrom\nimport\nas\nas\nfrom\nimport\nList\nDict\nAny\nOptional\nfrom\nimport\nclass\nProperty\nBaseModel\n\"\"\"A single property consisting of key and value\"\"\"\nstr\n\"key\"\nstr\n\"value\"\nclass\nNode\nBaseNode\nOptional\nList\nNone\n\"List of node properties\"\nclass\nRelationship\nBaseRelationship\nOptional\nList\nNone\n\"List of relationship properties\"\nHere, we have overwritten the properties value to be a list of Property classes instead of a dictionary to overcome the limitations of the API. Because you can only pass a single object to the API, we can to combine the nodes and relationships in a single class called KnowledgeGraph.\n```class KnowledgeGraph(BaseModel):    \"\"\"Generate a knowledge graph with entities and relationships.\"\"\"    nodes: List[Node] = Field(        ..., description=\"List of nodes in the knowledge graph\")    rels: List[Relationship] = Field(        ..., description=\"List of relationships in the knowledge graph\"    )```\nclass KnowledgeGraph(BaseModel):    \"\"\"Generate a knowledge graph with entities and relationships.\"\"\"    nodes: List[Node] = Field(        ..., description=\"List of nodes in the knowledge graph\")    rels: List[Relationship] = Field(        ..., description=\"List of relationships in the knowledge graph\"    )\nclass\nKnowledgeGraph\nBaseModel\n\"\"\"Generate a knowledge graph with entities and relationships.\"\"\"\nList\n\"List of nodes in the knowledge graph\"\nList\n\"List of relationships in the knowledge graph\"\nThe only thing left is to do a bit of prompt engineering and we are good to go. How I usually go about prompt engineering is the following:\nI specifically chose the markdown format as I have seen somewhere that OpenAI models respond better to markdown syntax in prompts, and it seems to be at least plausible from my experience.\nIterating over prompt engineering, I came up with the following system prompt for an information extraction pipeline.\n```llm = ChatOpenAI(model=\"gpt-3.5-turbo-16k\", temperature=0)def get_extraction_chain(    allowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None    ):    prompt = ChatPromptTemplate.from_messages(    [(      \"system\",      f\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"),        (\"human\", \"Use the given format to extract information from the following input: {input}\"),        (\"human\", \"Tip: Make sure to answer in the correct format\"),    ])    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)```\nllm = ChatOpenAI(model=\"gpt-3.5-turbo-16k\", temperature=0)def get_extraction_chain(    allowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None    ):    prompt = ChatPromptTemplate.from_messages(    [(      \"system\",      f\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"),        (\"human\", \"Use the given format to extract information from the following input: {input}\"),        (\"human\", \"Tip: Make sure to answer in the correct format\"),    ])    return create_structured_output_chain(KnowledgeGraph, llm, prompt, verbose=False)\n\"gpt-3.5-turbo-16k\"\n0\ndef\nget_extraction_chain\nallowed_nodes: Optional[List[str]] = None,    allowed_rels: Optional[List[str]] = None\nOptional\nList\nstr\nNone\nOptional\nList\nstr\nNone\n\"system\"\nf\"\"\"# Knowledge Graph Instructions for GPT-4## 1. OverviewYou are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.- **Nodes** represent entities and concepts. They're akin to Wikipedia nodes.- The aim is to achieve simplicity and clarity in the knowledge graph, making it accessible for a vast audience.## 2. Labeling Nodes- **Consistency**: Ensure you use basic or elementary types for node labels.  - For example, when you identify an entity representing a person, always label it as **\"person\"**. Avoid using more specific terms like \"mathematician\" or \"scientist\".- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}## 3. Handling Numerical Data and Dates- Numerical data, like age or other related information, should be incorporated as attributes or properties of the respective nodes.- **No Separate Nodes for Dates/Numbers**: Do not create separate nodes for dates or numerical values. Always attach them as attributes or properties of nodes.- **Property Format**: Properties must be in a key-value format.- **Quotation Marks**: Never use escaped single or double quotes within property values.- **Naming Convention**: Use camelCase for property keys, e.g., `birthDate`.## 4. Coreference Resolution- **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.If an entity, such as \"John Doe\", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., \"Joe\", \"he\"), always use the most complete identifier for that entity throughout the knowledge graph. In this example, use \"John Doe\" as the entity ID.  Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial. ## 5. Strict ComplianceAdhere to the rules strictly. Non-compliance will result in termination.\"\"\"\n{'- **Allowed Node Labels:**' + \", \".join(allowed_nodes) if allowed_nodes else \"\"}\n'- **Allowed Node Labels:**'\n\", \"\nif\nelse\n\"\"\n{'- **Allowed Relationship Types**:' + \", \".join(allowed_rels) if allowed_rels else \"\"}\n'- **Allowed Relationship Types**:'\n\", \"\nif\nelse\n\"\"\n\"human\"\n\"Use the given format to extract information from the following input: {input}\"\n\"human\"\n\"Tip: Make sure to answer in the correct format\"\nreturn\nFalse\nYou can see that we are using the 16k version of the GPT-3.5 model. The main reason is that the OpenAI function output is a structured JSON object, and structured JSON syntax adds a lot of token overhead to the result. Essentially, you are paying for the convenience of structured output in increased token space.\nBesides the general instructions, I have also added the option to limit which node or relationship types should be extracted from text. You\u2019ll see through examples why this might come in handy.\nWe have the Neo4j connection and LLM prompt ready, which means we can define the information extraction pipeline as a single function.\n```def extract_and_store_graph(    document: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None) -> None:    # Extract graph data using OpenAI functions    extract_chain = get_extraction_chain(nodes, rels)    data = extract_chain.run(document.page_content)    # Construct a graph document    graph_document = GraphDocument(      nodes = [map_to_base_node(node) for node in data.nodes],      relationships = [map_to_base_relationship(rel) for rel in data.rels],      source = document    )    # Store information into a graph    graph.add_graph_documents([graph_document])```\ndef extract_and_store_graph(    document: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None) -> None:    # Extract graph data using OpenAI functions    extract_chain = get_extraction_chain(nodes, rels)    data = extract_chain.run(document.page_content)    # Construct a graph document    graph_document = GraphDocument(      nodes = [map_to_base_node(node) for node in data.nodes],      relationships = [map_to_base_relationship(rel) for rel in data.rels],      source = document    )    # Store information into a graph    graph.add_graph_documents([graph_document])\ndef\nextract_and_store_graph\ndocument: Document,    nodes:Optional[List[str]] = None,    rels:Optional[List[str]]=None\nOptional\nList\nstr\nNone\nOptional\nList\nstr\nNone\nNone\n# Extract graph data using OpenAI functions\n# Construct a graph document\nfor\nin\nfor\nin\n# Store information into a graph\nThe function takes in a LangChain document as well as optional nodes and relationship parameters, which are used to limit the types of objects we want the LLM to identify and extract. A month or so ago, we added the add_graph_documents method the Neo4j graph object, which we can utilize here to seamlessly import the graph.\n"
    },
    {
        "id_": "7b42a91b-63a9-41c8-82f1-d68747a543b5",
        "embedding": null,
        "metadata": {
            "header": "Evaluation",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "We will extract information from the Walt Disney Wikipedia page and construct a knowledge graph to test the pipeline. Here, we will utilize the Wikipedia loader and text chunking modules provided by LangChain.\n```from langchain.document_loaders import WikipediaLoaderfrom langchain.text_splitter import TokenTextSplitter# Read the wikipedia articleraw_documents = WikipediaLoader(query=\"Walt Disney\").load()# Define chunking strategytext_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)# Only take the first the raw_documentsdocuments = text_splitter.split_documents(raw_documents[:3])```\nfrom langchain.document_loaders import WikipediaLoaderfrom langchain.text_splitter import TokenTextSplitter# Read the wikipedia articleraw_documents = WikipediaLoader(query=\"Walt Disney\").load()# Define chunking strategytext_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)# Only take the first the raw_documentsdocuments = text_splitter.split_documents(raw_documents[:3])\nfrom\nimport\nfrom\nimport\n# Read the wikipedia article\n\"Walt Disney\"\n# Define chunking strategy\n2048\n24\n# Only take the first the raw_documents\n3\nYou might have noticed that we use a relatively large chunk_size value. The reason is that we want to provide as much context as possible around a single sentence in order for the coreference resolution part to work as best as possible. Remember, the coreference step will only work if the entity and its reference appear in the same chunk; otherwise, the LLM doesn\u2019t have enough information to link the two.\nNow we can go ahead and run the documents through the information extraction pipeline.\n```from tqdm import tqdmfor i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d)```\nfrom tqdm import tqdmfor i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d)\nfrom\nimport\nfor\nin\nenumerate\nlen\nThe process takes around 5 minutes, which is relatively slow. Therefore, you would probably want parallel API calls in production to deal with this problem and achieve some sort of scalability.\nLet\u2019s first look at the types of nodes and relationships the LLM identified.\nSince the graph schema is not provided, the LLM decides on the fly what types of node labels and relationship types it will use. For example, we can observe that there are Company and Organization node labels. Those two things are probably semantically similar or identical, so we would want to have only a single node label representing the two. This problem is more obvious with relationship types. For example, we have CO-FOUNDER and COFOUNDEROF relationships as well as DEVELOPER and DEVELOPEDBY.\nFor any more serious project, you should define the node labels and relationship types the LLM should extract. Luckily, we have added the option to limit the types in the prompt by passing additional parameters.\n```# Specify which node labels should be extracted by the LLMallowed_nodes = [\"Person\", \"Company\", \"Location\", \"Event\", \"Movie\", \"Service\", \"Award\"]for i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d, allowed_nodes)```\n# Specify which node labels should be extracted by the LLMallowed_nodes = [\"Person\", \"Company\", \"Location\", \"Event\", \"Movie\", \"Service\", \"Award\"]for i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d, allowed_nodes)\n# Specify which node labels should be extracted by the LLM\n\"Person\"\n\"Company\"\n\"Location\"\n\"Event\"\n\"Movie\"\n\"Service\"\n\"Award\"\nfor\nin\nenumerate\nlen\nIn this example, I have only limited the node labels, but you can easily limit the relationship types by passing another parameter to the extract_and_store_graph function.\nThe visualization of the extracted subgraph has the following structure.\nThe graph turned out better than expected (after five iterations\u00a0:) ). I couldn\u2019t catch the whole graph nicely in the visualization, but you can explore it on your own in Neo4j Browser other tools.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "We will extract information from the Walt Disney Wikipedia page and construct a knowledge graph to test the pipeline. Here, we will utilize the Wikipedia loader and text chunking modules provided by LangChain.\n```from langchain.document_loaders import WikipediaLoaderfrom langchain.text_splitter import TokenTextSplitter# Read the wikipedia articleraw_documents = WikipediaLoader(query=\"Walt Disney\").load()# Define chunking strategytext_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)# Only take the first the raw_documentsdocuments = text_splitter.split_documents(raw_documents[:3])```\nfrom langchain.document_loaders import WikipediaLoaderfrom langchain.text_splitter import TokenTextSplitter# Read the wikipedia articleraw_documents = WikipediaLoader(query=\"Walt Disney\").load()# Define chunking strategytext_splitter = TokenTextSplitter(chunk_size=2048, chunk_overlap=24)# Only take the first the raw_documentsdocuments = text_splitter.split_documents(raw_documents[:3])\nfrom\nimport\nfrom\nimport\n# Read the wikipedia article\n\"Walt Disney\"\n# Define chunking strategy\n2048\n24\n# Only take the first the raw_documents\n3\nYou might have noticed that we use a relatively large chunk_size value. The reason is that we want to provide as much context as possible around a single sentence in order for the coreference resolution part to work as best as possible. Remember, the coreference step will only work if the entity and its reference appear in the same chunk; otherwise, the LLM doesn\u2019t have enough information to link the two.\nNow we can go ahead and run the documents through the information extraction pipeline.\n```from tqdm import tqdmfor i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d)```\nfrom tqdm import tqdmfor i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d)\nfrom\nimport\nfor\nin\nenumerate\nlen\nThe process takes around 5 minutes, which is relatively slow. Therefore, you would probably want parallel API calls in production to deal with this problem and achieve some sort of scalability.\nLet\u2019s first look at the types of nodes and relationships the LLM identified.\nSince the graph schema is not provided, the LLM decides on the fly what types of node labels and relationship types it will use. For example, we can observe that there are Company and Organization node labels. Those two things are probably semantically similar or identical, so we would want to have only a single node label representing the two. This problem is more obvious with relationship types. For example, we have CO-FOUNDER and COFOUNDEROF relationships as well as DEVELOPER and DEVELOPEDBY.\nFor any more serious project, you should define the node labels and relationship types the LLM should extract. Luckily, we have added the option to limit the types in the prompt by passing additional parameters.\n```# Specify which node labels should be extracted by the LLMallowed_nodes = [\"Person\", \"Company\", \"Location\", \"Event\", \"Movie\", \"Service\", \"Award\"]for i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d, allowed_nodes)```\n# Specify which node labels should be extracted by the LLMallowed_nodes = [\"Person\", \"Company\", \"Location\", \"Event\", \"Movie\", \"Service\", \"Award\"]for i, d in tqdm(enumerate(documents), total=len(documents)):    extract_and_store_graph(d, allowed_nodes)\n# Specify which node labels should be extracted by the LLM\n\"Person\"\n\"Company\"\n\"Location\"\n\"Event\"\n\"Movie\"\n\"Service\"\n\"Award\"\nfor\nin\nenumerate\nlen\nIn this example, I have only limited the node labels, but you can easily limit the relationship types by passing another parameter to the extract_and_store_graph function.\nThe visualization of the extracted subgraph has the following structure.\nThe graph turned out better than expected (after five iterations\u00a0:) ). I couldn\u2019t catch the whole graph nicely in the visualization, but you can explore it on your own in Neo4j Browser other tools.\n"
    },
    {
        "id_": "64f0692b-9013-416b-bd34-e384a4c68068",
        "embedding": null,
        "metadata": {
            "header": "Entity disambiguation",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "One thing I should mention is that we partly skipped entity disambiguation part. We used a large chunk size and added a specific instruction for coreference resolution and entity disambiguation in the system prompt. However, since each chunk is processed separately, there is no way to ensure consistency of entities between different text chunks. For example, you could end up with two nodes representing the same person.\nIn this example, Walt Disney and Walter Elias Disney refer to the same real-world person. The entity disambiguation problem is nothing new and there has been various solution proposed to solve it:\nentity linking\nentity disambiguation\nsecond pass through an LLM\nGraph-based approaches\nWhich solution you should use depends on your domain and use case. However, have in mind that entity disambiguation step should not be overlooked as it can have a significant impact on the accuracy and effectiveness of your RAG applications.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "One thing I should mention is that we partly skipped entity disambiguation part. We used a large chunk size and added a specific instruction for coreference resolution and entity disambiguation in the system prompt. However, since each chunk is processed separately, there is no way to ensure consistency of entities between different text chunks. For example, you could end up with two nodes representing the same person.\nIn this example, Walt Disney and Walter Elias Disney refer to the same real-world person. The entity disambiguation problem is nothing new and there has been various solution proposed to solve it:\nentity linking\nentity disambiguation\nsecond pass through an LLM\nGraph-based approaches\nWhich solution you should use depends on your domain and use case. However, have in mind that entity disambiguation step should not be overlooked as it can have a significant impact on the accuracy and effectiveness of your RAG applications.\n"
    },
    {
        "id_": "1e036532-9417-4d93-b6bb-59ee04594065",
        "embedding": null,
        "metadata": {
            "header": "Rag Application",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "The last thing we will do is show you how you can browse information in a knowledge graph by constructing Cypher statements. Cypher is a structured query language used to work with graph databases, similar to how SQL is used for relational databases. LangChain has a GraphCypherQAChain that reads the schema of the graph and constructs appropriate Cypher statements based on the user input.\nGraphCypherQAChain\n```# Query the knowledge graph in a RAG applicationfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    graph=graph,    cypher_llm=ChatOpenAI(temperature=0, model=\"gpt-4\"),    qa_llm=ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\"),    validate_cypher=True, # Validate relationship directions    verbose=True)cypher_chain.run(\"When was Walter Elias Disney born?\")```\n# Query the knowledge graph in a RAG applicationfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    graph=graph,    cypher_llm=ChatOpenAI(temperature=0, model=\"gpt-4\"),    qa_llm=ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\"),    validate_cypher=True, # Validate relationship directions    verbose=True)cypher_chain.run(\"When was Walter Elias Disney born?\")\n# Query the knowledge graph in a RAG application\nfrom\nimport\n0\n\"gpt-4\"\n0\n\"gpt-3.5-turbo\"\nTrue\n# Validate relationship directions\nTrue\n\"When was Walter Elias Disney born?\"\nWhich results in the following:\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "The last thing we will do is show you how you can browse information in a knowledge graph by constructing Cypher statements. Cypher is a structured query language used to work with graph databases, similar to how SQL is used for relational databases. LangChain has a GraphCypherQAChain that reads the schema of the graph and constructs appropriate Cypher statements based on the user input.\nGraphCypherQAChain\n```# Query the knowledge graph in a RAG applicationfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    graph=graph,    cypher_llm=ChatOpenAI(temperature=0, model=\"gpt-4\"),    qa_llm=ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\"),    validate_cypher=True, # Validate relationship directions    verbose=True)cypher_chain.run(\"When was Walter Elias Disney born?\")```\n# Query the knowledge graph in a RAG applicationfrom langchain.chains import GraphCypherQAChaingraph.refresh_schema()cypher_chain = GraphCypherQAChain.from_llm(    graph=graph,    cypher_llm=ChatOpenAI(temperature=0, model=\"gpt-4\"),    qa_llm=ChatOpenAI(temperature=0, model=\"gpt-3.5-turbo\"),    validate_cypher=True, # Validate relationship directions    verbose=True)cypher_chain.run(\"When was Walter Elias Disney born?\")\n# Query the knowledge graph in a RAG application\nfrom\nimport\n0\n\"gpt-4\"\n0\n\"gpt-3.5-turbo\"\nTrue\n# Validate relationship directions\nTrue\n\"When was Walter Elias Disney born?\"\nWhich results in the following:\n"
    },
    {
        "id_": "11515392-14ee-41ce-8b71-48c7813f074a",
        "embedding": null,
        "metadata": {
            "header": "Summary",
            "source": "2023-10-20_Constructing-knowledge-graphs-from-text-using-OpenAI-functions-096a6d010c17.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Knowledge graphs are a great fit when you need a combination of structured and structured data to power your RAG applications. In this blog post, you have learned how to construct a knowledge graph in Neo4j on an arbitrary text using OpenAI functions. OpenAI functions provide the convenience of neatly structured outputs, making them an ideal fit for extracting structured information. To have a great experience constructing graphs with LLMs, make sure to define the graph schema as detailed as possible and make sure you add an entity disambiguation step after the extraction.\nIf you are eager to learn more about building AI applications with graphs, join us at the NODES, online, 24h conference organized by Neo4j on October 26th, 2023.\nNODES, online, 24h conference\nThe code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Knowledge graphs are a great fit when you need a combination of structured and structured data to power your RAG applications. In this blog post, you have learned how to construct a knowledge graph in Neo4j on an arbitrary text using OpenAI functions. OpenAI functions provide the convenience of neatly structured outputs, making them an ideal fit for extracting structured information. To have a great experience constructing graphs with LLMs, make sure to define the graph schema as detailed as possible and make sure you add an entity disambiguation step after the extraction.\nIf you are eager to learn more about building AI applications with graphs, join us at the NODES, online, 24h conference organized by Neo4j on October 26th, 2023.\nNODES, online, 24h conference\nThe code is available on GitHub.\nGitHub\n"
    },
    {
        "id_": "52408c1c-7c65-43e4-82c9-74b9d7224c8a",
        "embedding": null,
        "metadata": {
            "header": "Develop RAG applications and don\u2019t share your private data with\u00a0anyone!",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "In the spirit of Hacktoberfest, I decided to write a blog post using a vector database for change. The main reason for that is that in spirit of open source love, I have to give something back to Philip Vollet in exchange for all the significant exposure he provided me, starting from many years ago.\nPhilip Vollet\nPhilip works at Weaviate, which is a vector database, and vector similarity search is prevalent in retrieval-augmented applications nowadays. As you might imagine, we will be using Weaviate to power our RAG application. In addition, we\u2019ll be using local LLM and embedding models, making it safe and convenient when dealing with private and confidential information that mustn\u2019t leave your premises.\nWeaviate\nThey say that knowledge is power, and Huberman Labs podcast is one of the finer source of information of scientific discussion and scientific-based tools to enhance your life. In this blog post, we will use LangChain to fetch podcast captions from YouTube, embed and store them in Weaviate, and then use a local LLM to build a RAG application.\nHuberman Labs podcast\nThe code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "In the spirit of Hacktoberfest, I decided to write a blog post using a vector database for change. The main reason for that is that in spirit of open source love, I have to give something back to Philip Vollet in exchange for all the significant exposure he provided me, starting from many years ago.\nPhilip Vollet\nPhilip works at Weaviate, which is a vector database, and vector similarity search is prevalent in retrieval-augmented applications nowadays. As you might imagine, we will be using Weaviate to power our RAG application. In addition, we\u2019ll be using local LLM and embedding models, making it safe and convenient when dealing with private and confidential information that mustn\u2019t leave your premises.\nWeaviate\nThey say that knowledge is power, and Huberman Labs podcast is one of the finer source of information of scientific discussion and scientific-based tools to enhance your life. In this blog post, we will use LangChain to fetch podcast captions from YouTube, embed and store them in Weaviate, and then use a local LLM to build a RAG application.\nHuberman Labs podcast\nThe code is available on GitHub.\nGitHub\n"
    },
    {
        "id_": "ec9050d5-9280-46fc-99b3-7d92ae90daf6",
        "embedding": null,
        "metadata": {
            "header": "Weaviate cloud\u00a0services",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "To follow the examples in this blog post, you first need to register with WCS. Once you are registered, you can create a new Weaviate Cluster by clicking the \u201cCreate cluster\u201d button. For this tutorial, we will be using the free trial plan, which will provide you with a sandbox for 14 days.\nregister with WCS\nFor the next steps, you will need the following two pieces of information to access your cluster:\n```import weaviateWEAVIATE_URL = \"WEAVIATE_CLUSTER_URL\"WEAVIATE_API_KEY = \"WEAVIATE_API_KEY\"client = weaviate.Client(    url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))```\nimport weaviateWEAVIATE_URL = \"WEAVIATE_CLUSTER_URL\"WEAVIATE_API_KEY = \"WEAVIATE_API_KEY\"client = weaviate.Client(    url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))\nimport\n\"WEAVIATE_CLUSTER_URL\"\n\"WEAVIATE_API_KEY\"\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "To follow the examples in this blog post, you first need to register with WCS. Once you are registered, you can create a new Weaviate Cluster by clicking the \u201cCreate cluster\u201d button. For this tutorial, we will be using the free trial plan, which will provide you with a sandbox for 14 days.\nregister with WCS\nFor the next steps, you will need the following two pieces of information to access your cluster:\n```import weaviateWEAVIATE_URL = \"WEAVIATE_CLUSTER_URL\"WEAVIATE_API_KEY = \"WEAVIATE_API_KEY\"client = weaviate.Client(    url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))```\nimport weaviateWEAVIATE_URL = \"WEAVIATE_CLUSTER_URL\"WEAVIATE_API_KEY = \"WEAVIATE_API_KEY\"client = weaviate.Client(    url=WEAVIATE_URL, auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY))\nimport\n\"WEAVIATE_CLUSTER_URL\"\n\"WEAVIATE_API_KEY\"\n"
    },
    {
        "id_": "aabffe34-fb75-4ac1-9c26-48c1ba111cab",
        "embedding": null,
        "metadata": {
            "header": "Local embedding and LLM\u00a0models",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "I am most familiar with the LangChain LLM framework, so we will be using it to ingest documents as well as retrieve them. We will be using sentence_transformers/all-mpnet-base-v2 embedding model and zephyr-7b-alpha llm. Both of these models are open source and available on HuggingFace. The implementation code for these two models in LangChain was kindly borrowed from the following repository:\nGitHub - aigeek0x0/zephyr-7b-alpha-langchain-chatbot: Chat with PDF using Zephyr 7B Alpha\u2026Chat with PDF using Zephyr 7B Alpha, Langchain, ChromaDB, and Gradio with Free Google Colab - GitHub\u00a0\u2026github.com\n\nIf you are using Google Collab environment, make sure to use GPU runtime.\nWe will begin by defining the embedding model, which can be easily retrieved from HuggingFace using the following code:\n```# specify embedding model (using huggingface sentence transformer)embedding_model_name = \"sentence-transformers/all-mpnet-base-v2\"model_kwargs = {\"device\": \"cuda\"}embeddings = HuggingFaceEmbeddings(  model_name=embedding_model_name,   model_kwargs=model_kwargs)```\n# specify embedding model (using huggingface sentence transformer)embedding_model_name = \"sentence-transformers/all-mpnet-base-v2\"model_kwargs = {\"device\": \"cuda\"}embeddings = HuggingFaceEmbeddings(  model_name=embedding_model_name,   model_kwargs=model_kwargs)\n# specify embedding model (using huggingface sentence transformer)\n\"sentence-transformers/all-mpnet-base-v2\"\n\"device\"\n\"cuda\"\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "I am most familiar with the LangChain LLM framework, so we will be using it to ingest documents as well as retrieve them. We will be using sentence_transformers/all-mpnet-base-v2 embedding model and zephyr-7b-alpha llm. Both of these models are open source and available on HuggingFace. The implementation code for these two models in LangChain was kindly borrowed from the following repository:\nGitHub - aigeek0x0/zephyr-7b-alpha-langchain-chatbot: Chat with PDF using Zephyr 7B Alpha\u2026Chat with PDF using Zephyr 7B Alpha, Langchain, ChromaDB, and Gradio with Free Google Colab - GitHub\u00a0\u2026github.com\n\nIf you are using Google Collab environment, make sure to use GPU runtime.\nWe will begin by defining the embedding model, which can be easily retrieved from HuggingFace using the following code:\n```# specify embedding model (using huggingface sentence transformer)embedding_model_name = \"sentence-transformers/all-mpnet-base-v2\"model_kwargs = {\"device\": \"cuda\"}embeddings = HuggingFaceEmbeddings(  model_name=embedding_model_name,   model_kwargs=model_kwargs)```\n# specify embedding model (using huggingface sentence transformer)embedding_model_name = \"sentence-transformers/all-mpnet-base-v2\"model_kwargs = {\"device\": \"cuda\"}embeddings = HuggingFaceEmbeddings(  model_name=embedding_model_name,   model_kwargs=model_kwargs)\n# specify embedding model (using huggingface sentence transformer)\n\"sentence-transformers/all-mpnet-base-v2\"\n\"device\"\n\"cuda\"\n"
    },
    {
        "id_": "cc4c0b00-42b2-40c3-ae66-54bd405484c8",
        "embedding": null,
        "metadata": {
            "header": "Ingest HubermanLabs podcasts into\u00a0Weaviate",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "I have learned that each channel on YouTube has an RSS feed, that can be used to fetch links to the latest 10 videos. As the RSS feed returns a XML, we need to employ a simple Python script to extract the links.\n```import requestsimport xml.etree.ElementTree as ETURL = \"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"response = requests.get(URL)xml_data = response.content# Parse the XML dataroot = ET.fromstring(xml_data)# Define the namespacenamespaces = {    \"atom\": \"http://www.w3.org/2005/Atom\",    \"media\": \"http://search.yahoo.com/mrss/\",}# Extract YouTube linksyoutube_links = [    link.get(\"href\")    for link in root.findall(\".//atom:link[@rel='alternate']\", namespaces)][1:]```\nimport requestsimport xml.etree.ElementTree as ETURL = \"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"response = requests.get(URL)xml_data = response.content# Parse the XML dataroot = ET.fromstring(xml_data)# Define the namespacenamespaces = {    \"atom\": \"http://www.w3.org/2005/Atom\",    \"media\": \"http://search.yahoo.com/mrss/\",}# Extract YouTube linksyoutube_links = [    link.get(\"href\")    for link in root.findall(\".//atom:link[@rel='alternate']\", namespaces)][1:]\n\"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"\n# Parse the XML data\n# Define the namespace\n\"atom\"\n\"http://www.w3.org/2005/Atom\"\n\"media\"\n\"http://search.yahoo.com/mrss/\"\n# Extract YouTube links\n\"href\"\n\".//atom:link[@rel='alternate']\"\n][1:]\nNow that we have the links to the videos at hand, we can use the YoutubeLoader from LangChain to retrieve the captions. Next, as with most RAG ingestions pipelines, we have to chunk the text into smaller pieces before ingestion. We can use the text splitter functionality that is built into LangChain.\n```from langchain.document_loaders import YoutubeLoaderall_docs = []for link in youtube_links:    # Retrieve captions    loader = YoutubeLoader.from_youtube_url(link)    docs = loader.load()    all_docs.extend(docs)# Split documentstext_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)split_docs = text_splitter.split_documents(all_docs)# Ingest the documents into Weaviatevector_db = Weaviate.from_documents(    split_docs, embeddings, client=client, by_text=False)```\nfrom langchain.document_loaders import YoutubeLoaderall_docs = []for link in youtube_links:    # Retrieve captions    loader = YoutubeLoader.from_youtube_url(link)    docs = loader.load()    all_docs.extend(docs)# Split documentstext_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)split_docs = text_splitter.split_documents(all_docs)# Ingest the documents into Weaviatevector_db = Weaviate.from_documents(    split_docs, embeddings, client=client, by_text=False)\nfrom\nimport\nfor\nin\n# Retrieve captions\n# Split documents\n128\n0\n# Ingest the documents into Weaviate\nFalse\nYou can test the vector retriever using the following code:\n```print(    vector_db.similarity_search(        \"Which are tools to bolster your mental health?\", k=3)    )```\nprint(    vector_db.similarity_search(        \"Which are tools to bolster your mental health?\", k=3)    )\nprint\n\"Which are tools to bolster your mental health?\"\n3\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "I have learned that each channel on YouTube has an RSS feed, that can be used to fetch links to the latest 10 videos. As the RSS feed returns a XML, we need to employ a simple Python script to extract the links.\n```import requestsimport xml.etree.ElementTree as ETURL = \"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"response = requests.get(URL)xml_data = response.content# Parse the XML dataroot = ET.fromstring(xml_data)# Define the namespacenamespaces = {    \"atom\": \"http://www.w3.org/2005/Atom\",    \"media\": \"http://search.yahoo.com/mrss/\",}# Extract YouTube linksyoutube_links = [    link.get(\"href\")    for link in root.findall(\".//atom:link[@rel='alternate']\", namespaces)][1:]```\nimport requestsimport xml.etree.ElementTree as ETURL = \"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"response = requests.get(URL)xml_data = response.content# Parse the XML dataroot = ET.fromstring(xml_data)# Define the namespacenamespaces = {    \"atom\": \"http://www.w3.org/2005/Atom\",    \"media\": \"http://search.yahoo.com/mrss/\",}# Extract YouTube linksyoutube_links = [    link.get(\"href\")    for link in root.findall(\".//atom:link[@rel='alternate']\", namespaces)][1:]\n\"https://www.youtube.com/feeds/videos.xml?channel_id=UC2D2CMWXMOVWx7giW1n3LIg\"\n# Parse the XML data\n# Define the namespace\n\"atom\"\n\"http://www.w3.org/2005/Atom\"\n\"media\"\n\"http://search.yahoo.com/mrss/\"\n# Extract YouTube links\n\"href\"\n\".//atom:link[@rel='alternate']\"\n][1:]\nNow that we have the links to the videos at hand, we can use the YoutubeLoader from LangChain to retrieve the captions. Next, as with most RAG ingestions pipelines, we have to chunk the text into smaller pieces before ingestion. We can use the text splitter functionality that is built into LangChain.\n```from langchain.document_loaders import YoutubeLoaderall_docs = []for link in youtube_links:    # Retrieve captions    loader = YoutubeLoader.from_youtube_url(link)    docs = loader.load()    all_docs.extend(docs)# Split documentstext_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)split_docs = text_splitter.split_documents(all_docs)# Ingest the documents into Weaviatevector_db = Weaviate.from_documents(    split_docs, embeddings, client=client, by_text=False)```\nfrom langchain.document_loaders import YoutubeLoaderall_docs = []for link in youtube_links:    # Retrieve captions    loader = YoutubeLoader.from_youtube_url(link)    docs = loader.load()    all_docs.extend(docs)# Split documentstext_splitter = TokenTextSplitter(chunk_size=128, chunk_overlap=0)split_docs = text_splitter.split_documents(all_docs)# Ingest the documents into Weaviatevector_db = Weaviate.from_documents(    split_docs, embeddings, client=client, by_text=False)\nfrom\nimport\nfor\nin\n# Retrieve captions\n# Split documents\n128\n0\n# Ingest the documents into Weaviate\nFalse\nYou can test the vector retriever using the following code:\n```print(    vector_db.similarity_search(        \"Which are tools to bolster your mental health?\", k=3)    )```\nprint(    vector_db.similarity_search(        \"Which are tools to bolster your mental health?\", k=3)    )\nprint\n\"Which are tools to bolster your mental health?\"\n3\n"
    },
    {
        "id_": "97b1315c-f1c9-4b69-b0d4-7e3e823015c7",
        "embedding": null,
        "metadata": {
            "header": "Setting up a local\u00a0LLM",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "This part of the code was completely copied from the example provided by the AI Geek. It loads the zephyr-7b-alpha-sharded model and its tokenizer from HuggingFace and loads it as a LangChain LLM module.\ncopied from the example provided by the AI Geek\n```# specify model huggingface mode namemodel_name = \"anakin87/zephyr-7b-alpha-sharded\"# function for loading 4-bit quantized modeldef load_quantized_model(model_name: str):    \"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"    bnb_config = BitsAndBytesConfig(        load_in_4bit=True,        bnb_4bit_use_double_quant=True,        bnb_4bit_quant_type=\"nf4\",        bnb_4bit_compute_dtype=torch.bfloat16,    )    model = AutoModelForCausalLM.from_pretrained(        model_name,        load_in_4bit=True,        torch_dtype=torch.bfloat16,        quantization_config=bnb_config,    )    return model# function for initializing tokenizerdef initialize_tokenizer(model_name: str):    \"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)    tokenizer.bos_token_id = 1  # Set beginning of sentence token id    return tokenizer# initialize tokenizertokenizer = initialize_tokenizer(model_name)# load modelmodel = load_quantized_model(model_name)# specify stop token idsstop_token_ids = [0]# build huggingface pipeline for using zephyr-7b-alphapipeline = pipeline(    \"text-generation\",    model=model,    tokenizer=tokenizer,    use_cache=True,    device_map=\"auto\",    max_length=2048,    do_sample=True,    top_k=5,    num_return_sequences=1,    eos_token_id=tokenizer.eos_token_id,    pad_token_id=tokenizer.eos_token_id,)# specify the llmllm = HuggingFacePipeline(pipeline=pipeline)```\n# specify model huggingface mode namemodel_name = \"anakin87/zephyr-7b-alpha-sharded\"# function for loading 4-bit quantized modeldef load_quantized_model(model_name: str):    \"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"    bnb_config = BitsAndBytesConfig(        load_in_4bit=True,        bnb_4bit_use_double_quant=True,        bnb_4bit_quant_type=\"nf4\",        bnb_4bit_compute_dtype=torch.bfloat16,    )    model = AutoModelForCausalLM.from_pretrained(        model_name,        load_in_4bit=True,        torch_dtype=torch.bfloat16,        quantization_config=bnb_config,    )    return model# function for initializing tokenizerdef initialize_tokenizer(model_name: str):    \"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)    tokenizer.bos_token_id = 1  # Set beginning of sentence token id    return tokenizer# initialize tokenizertokenizer = initialize_tokenizer(model_name)# load modelmodel = load_quantized_model(model_name)# specify stop token idsstop_token_ids = [0]# build huggingface pipeline for using zephyr-7b-alphapipeline = pipeline(    \"text-generation\",    model=model,    tokenizer=tokenizer,    use_cache=True,    device_map=\"auto\",    max_length=2048,    do_sample=True,    top_k=5,    num_return_sequences=1,    eos_token_id=tokenizer.eos_token_id,    pad_token_id=tokenizer.eos_token_id,)# specify the llmllm = HuggingFacePipeline(pipeline=pipeline)\n# specify model huggingface mode name\n\"anakin87/zephyr-7b-alpha-sharded\"\n# function for loading 4-bit quantized model\ndef\nload_quantized_model\nmodel_name: str\nstr\n\"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"\nTrue\nTrue\n\"nf4\"\nTrue\nreturn\n# function for initializing tokenizer\ndef\ninitialize_tokenizer\nmodel_name: str\nstr\n\"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"\nFalse\n1\n# Set beginning of sentence token id\nreturn\n# initialize tokenizer\n# load model\n# specify stop token ids\n0\n# build huggingface pipeline for using zephyr-7b-alpha\n\"text-generation\"\nTrue\n\"auto\"\n2048\nTrue\n5\n1\n# specify the llm\nI haven\u2019t played around yet, but you could probably reuse this code to load other LLMs from HuggingFace.\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "This part of the code was completely copied from the example provided by the AI Geek. It loads the zephyr-7b-alpha-sharded model and its tokenizer from HuggingFace and loads it as a LangChain LLM module.\ncopied from the example provided by the AI Geek\n```# specify model huggingface mode namemodel_name = \"anakin87/zephyr-7b-alpha-sharded\"# function for loading 4-bit quantized modeldef load_quantized_model(model_name: str):    \"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"    bnb_config = BitsAndBytesConfig(        load_in_4bit=True,        bnb_4bit_use_double_quant=True,        bnb_4bit_quant_type=\"nf4\",        bnb_4bit_compute_dtype=torch.bfloat16,    )    model = AutoModelForCausalLM.from_pretrained(        model_name,        load_in_4bit=True,        torch_dtype=torch.bfloat16,        quantization_config=bnb_config,    )    return model# function for initializing tokenizerdef initialize_tokenizer(model_name: str):    \"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)    tokenizer.bos_token_id = 1  # Set beginning of sentence token id    return tokenizer# initialize tokenizertokenizer = initialize_tokenizer(model_name)# load modelmodel = load_quantized_model(model_name)# specify stop token idsstop_token_ids = [0]# build huggingface pipeline for using zephyr-7b-alphapipeline = pipeline(    \"text-generation\",    model=model,    tokenizer=tokenizer,    use_cache=True,    device_map=\"auto\",    max_length=2048,    do_sample=True,    top_k=5,    num_return_sequences=1,    eos_token_id=tokenizer.eos_token_id,    pad_token_id=tokenizer.eos_token_id,)# specify the llmllm = HuggingFacePipeline(pipeline=pipeline)```\n# specify model huggingface mode namemodel_name = \"anakin87/zephyr-7b-alpha-sharded\"# function for loading 4-bit quantized modeldef load_quantized_model(model_name: str):    \"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"    bnb_config = BitsAndBytesConfig(        load_in_4bit=True,        bnb_4bit_use_double_quant=True,        bnb_4bit_quant_type=\"nf4\",        bnb_4bit_compute_dtype=torch.bfloat16,    )    model = AutoModelForCausalLM.from_pretrained(        model_name,        load_in_4bit=True,        torch_dtype=torch.bfloat16,        quantization_config=bnb_config,    )    return model# function for initializing tokenizerdef initialize_tokenizer(model_name: str):    \"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"    tokenizer = AutoTokenizer.from_pretrained(model_name, return_token_type_ids=False)    tokenizer.bos_token_id = 1  # Set beginning of sentence token id    return tokenizer# initialize tokenizertokenizer = initialize_tokenizer(model_name)# load modelmodel = load_quantized_model(model_name)# specify stop token idsstop_token_ids = [0]# build huggingface pipeline for using zephyr-7b-alphapipeline = pipeline(    \"text-generation\",    model=model,    tokenizer=tokenizer,    use_cache=True,    device_map=\"auto\",    max_length=2048,    do_sample=True,    top_k=5,    num_return_sequences=1,    eos_token_id=tokenizer.eos_token_id,    pad_token_id=tokenizer.eos_token_id,)# specify the llmllm = HuggingFacePipeline(pipeline=pipeline)\n# specify model huggingface mode name\n\"anakin87/zephyr-7b-alpha-sharded\"\n# function for loading 4-bit quantized model\ndef\nload_quantized_model\nmodel_name: str\nstr\n\"\"\"    :param model_name: Name or path of the model to be loaded.    :return: Loaded quantized model.    \"\"\"\nTrue\nTrue\n\"nf4\"\nTrue\nreturn\n# function for initializing tokenizer\ndef\ninitialize_tokenizer\nmodel_name: str\nstr\n\"\"\"    Initialize the tokenizer with the specified model_name.    :param model_name: Name or path of the model for tokenizer initialization.    :return: Initialized tokenizer.    \"\"\"\nFalse\n1\n# Set beginning of sentence token id\nreturn\n# initialize tokenizer\n# load model\n# specify stop token ids\n0\n# build huggingface pipeline for using zephyr-7b-alpha\n\"text-generation\"\nTrue\n\"auto\"\n2048\nTrue\n5\n1\n# specify the llm\nI haven\u2019t played around yet, but you could probably reuse this code to load other LLMs from HuggingFace.\n"
    },
    {
        "id_": "f809a52c-1331-491f-9abc-317103cd675d",
        "embedding": null,
        "metadata": {
            "header": "Building a conversation chain",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Now that we have our vector retrieval and th LLM ready, we can implement a retrieval-augmented chatbot in only a couple lines of code.\n```qa_chain = RetrievalQA.from_chain_type(    llm=llm, chain_type=\"stuff\", retriever=vector_db.as_retriever())```\nqa_chain = RetrievalQA.from_chain_type(    llm=llm, chain_type=\"stuff\", retriever=vector_db.as_retriever())\n\"stuff\"\nLet\u2019s now test how well it works:\n```response = qa_chain.run(    \"How does one increase their mental health?\")print(response)```\nresponse = qa_chain.run(    \"How does one increase their mental health?\")print(response)\n\"How does one increase their mental health?\"\nprint\nLet\u2019s try another one:\n```response = qa_chain.run(\"How to increase your willpower?\")print(response)```\nresponse = qa_chain.run(\"How to increase your willpower?\")print(response)\n\"How to increase your willpower?\"\nprint\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Now that we have our vector retrieval and th LLM ready, we can implement a retrieval-augmented chatbot in only a couple lines of code.\n```qa_chain = RetrievalQA.from_chain_type(    llm=llm, chain_type=\"stuff\", retriever=vector_db.as_retriever())```\nqa_chain = RetrievalQA.from_chain_type(    llm=llm, chain_type=\"stuff\", retriever=vector_db.as_retriever())\n\"stuff\"\nLet\u2019s now test how well it works:\n```response = qa_chain.run(    \"How does one increase their mental health?\")print(response)```\nresponse = qa_chain.run(    \"How does one increase their mental health?\")print(response)\n\"How does one increase their mental health?\"\nprint\nLet\u2019s try another one:\n```response = qa_chain.run(\"How to increase your willpower?\")print(response)```\nresponse = qa_chain.run(\"How to increase your willpower?\")print(response)\n\"How to increase your willpower?\"\nprint\n"
    },
    {
        "id_": "e6780b6a-731b-4f21-9610-7193ca57ec9c",
        "embedding": null,
        "metadata": {
            "header": "Summary",
            "source": "2023-10-30_How-to-implement-Weaviate-RAG-applications-with-Local-LLMs-and-Embedding-models-24a9128eaf84.html"
        },
        "excluded_embed_metadata_keys": [],
        "excluded_llm_metadata_keys": [],
        "relationships": {},
        "metadata_template": "{key}: {value}",
        "metadata_separator": "\n",
        "text_resource": {
            "embeddings": null,
            "text": "Only a couple of months ago, most of us didn\u2019t realize that we will be able to run LLMs on our laptop or free-tier Google Collab so soon. Many RAG applications deal with private and confidential data, where it can\u2019t be shared with third-party LLM providers. In those cases, using a local embedding and LLM models as described in this blog post is the ideal solution.\nAs always, the code is available on GitHub.\nGitHub\n",
            "path": null,
            "url": null,
            "mimetype": null
        },
        "image_resource": null,
        "audio_resource": null,
        "video_resource": null,
        "text_template": "{metadata_str}\n\n{content}",
        "class_name": "Document",
        "text": "Only a couple of months ago, most of us didn\u2019t realize that we will be able to run LLMs on our laptop or free-tier Google Collab so soon. Many RAG applications deal with private and confidential data, where it can\u2019t be shared with third-party LLM providers. In those cases, using a local embedding and LLM models as described in this blog post is the ideal solution.\nAs always, the code is available on GitHub.\nGitHub\n"
    }
]