-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathgenerate_samples.py
More file actions
180 lines (168 loc) · 9 KB
/
generate_samples.py
File metadata and controls
180 lines (168 loc) · 9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
"""
One-time script to generate sample PDF documents for the RAG chatbot demo.
Run: python generate_samples.py
Requires: reportlab (already in requirements.txt)
"""
import os
from reportlab.lib.pagesizes import LETTER
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.enums import TA_LEFT, TA_CENTER
OUTPUT_DIR = os.path.join(os.path.dirname(__file__), "sample_docs")
os.makedirs(OUTPUT_DIR, exist_ok=True)
def _doc(filename: str, title: str, sections: list[tuple[str, str]]) -> None:
"""Build a PDF with a title and a list of (heading, body) sections."""
path = os.path.join(OUTPUT_DIR, filename)
doc = SimpleDocTemplate(
path,
pagesize=LETTER,
rightMargin=inch,
leftMargin=inch,
topMargin=inch,
bottomMargin=inch,
)
styles = getSampleStyleSheet()
title_style = ParagraphStyle(
"Title",
parent=styles["Title"],
fontSize=20,
spaceAfter=20,
alignment=TA_CENTER,
)
heading_style = ParagraphStyle(
"Heading",
parent=styles["Heading2"],
fontSize=13,
spaceAfter=6,
spaceBefore=14,
textColor="#1a1a2e",
)
body_style = ParagraphStyle(
"Body",
parent=styles["Normal"],
fontSize=11,
leading=16,
spaceAfter=8,
alignment=TA_LEFT,
)
story = [
Paragraph(title, title_style),
Spacer(1, 0.2 * inch),
]
for heading, body in sections:
story.append(Paragraph(heading, heading_style))
story.append(Paragraph(body, body_style))
story.append(Spacer(1, 0.1 * inch))
doc.build(story)
print(f"Created: {path}")
# ── ai_overview.pdf ──────────────────────────────────────────────────────────
_doc(
"ai_overview.pdf",
"Artificial Intelligence: An Overview",
[
(
"What is Artificial Intelligence?",
"Artificial Intelligence (AI) is the simulation of human intelligence processes by "
"computer systems. These processes include learning (acquiring information and rules "
"for using it), reasoning (using the rules to reach approximate or definite "
"conclusions), and self-correction. AI enables machines to perform tasks that "
"traditionally required human cognitive abilities such as visual perception, speech "
"recognition, decision-making, and language translation.",
),
(
"History of AI: 1950s to Present",
"The concept of AI was formally introduced by Alan Turing in 1950 with his "
"landmark paper 'Computing Machinery and Intelligence,' where he proposed the "
"Turing Test. The term 'Artificial Intelligence' was coined by John McCarthy at "
"the 1956 Dartmouth Conference. The field experienced cycles of optimism and "
"'AI winters' (periods of reduced funding and interest) through the 1970s and "
"1980s. The 1990s brought expert systems and IBM Deep Blue defeating chess "
"champion Garry Kasparov in 1997. The 2010s saw a renaissance driven by deep "
"learning, massive datasets, and GPU computing power, culminating in breakthroughs "
"like AlphaGo (2016) and large language models such as GPT (2018–present).",
),
(
"Types of AI: Narrow, General, and Super",
"Narrow AI (Weak AI) is designed to perform a specific task — such as facial "
"recognition, spam filtering, or playing chess — and cannot generalise beyond its "
"training domain. This is the only type of AI that exists today. "
"Artificial General Intelligence (AGI) refers to a hypothetical system that can "
"perform any intellectual task a human can, with the ability to transfer knowledge "
"across domains. AGI remains an open research challenge. "
"Artificial Superintelligence (ASI) describes a theoretical AI that surpasses "
"human cognitive performance in every domain. ASI is purely speculative and raises "
"significant ethical and existential questions among researchers.",
),
(
"Real-World Applications of AI",
"AI applications span virtually every industry. In healthcare, AI assists in "
"early disease detection, drug discovery, and personalised treatment plans. In "
"finance, AI powers algorithmic trading, fraud detection, and credit scoring. "
"Autonomous vehicles use AI for perception and decision-making. Natural Language "
"Processing (NLP) enables virtual assistants like Siri, Alexa, and ChatGPT. "
"Recommendation engines at Netflix, Spotify, and Amazon leverage AI to personalise "
"content. In manufacturing, AI-powered robots perform precision assembly and "
"quality control. The global AI market was valued at over $200 billion in 2023 "
"and is projected to exceed $1 trillion by 2030.",
),
],
)
# ── ml_concepts.pdf ──────────────────────────────────────────────────────────
_doc(
"ml_concepts.pdf",
"Machine Learning: Core Concepts",
[
(
"What is Machine Learning?",
"Machine Learning (ML) is a subset of AI that enables systems to learn and "
"improve from experience without being explicitly programmed. Instead of writing "
"rules by hand, a practitioner feeds data into an algorithm, which finds patterns "
"and builds a model that can make predictions or decisions on new, unseen data. "
"The key insight is that the system improves automatically as it is exposed to "
"more data over time.",
),
(
"Supervised, Unsupervised, and Reinforcement Learning",
"Supervised Learning trains a model on labelled examples — pairs of inputs and "
"correct outputs — so it can predict the label for new inputs. Classic tasks "
"include email spam detection (classification) and house price prediction "
"(regression). "
"Unsupervised Learning finds hidden structure in unlabelled data. Clustering "
"algorithms group similar data points, while dimensionality reduction techniques "
"like PCA compress data while retaining important features. "
"Reinforcement Learning trains an agent to make sequential decisions by rewarding "
"desirable actions and penalising undesirable ones. It has achieved superhuman "
"performance in games like Go and Atari, and is used in robotics and "
"recommendation systems.",
),
(
"Common Algorithms: Linear Regression, Decision Trees, Neural Networks",
"Linear Regression is the simplest ML model: it fits a straight line through "
"data points to predict a continuous output. Despite its simplicity, it remains "
"highly interpretable and effective for many real-world problems. "
"Decision Trees partition the feature space into regions by asking a series of "
"yes/no questions at each node. They are easy to visualise and interpret. "
"Ensemble methods such as Random Forests and Gradient Boosting combine hundreds "
"of trees for higher accuracy. "
"Neural Networks are composed of layers of interconnected nodes (neurons) that "
"transform input data into progressively more abstract representations. Deep "
"Neural Networks with many layers excel at image recognition, speech synthesis, "
"and natural language understanding.",
),
(
"How Models Are Trained and Evaluated",
"Training involves feeding labelled data to an algorithm and using an "
"optimisation procedure — typically gradient descent — to minimise a loss function "
"that measures prediction error. The dataset is split into training, validation, "
"and test sets to detect overfitting. "
"Key evaluation metrics include accuracy, precision, recall, F1-score "
"(for classification), and mean squared error or R² (for regression). "
"Cross-validation techniques such as k-fold CV provide robust performance "
"estimates. Hyperparameter tuning — adjusting learning rate, depth, regularisation "
"strength, etc. — is performed on the validation set. A model generalises well "
"when it performs consistently on both validation and held-out test data.",
),
],
)
print("\nSample PDFs generated successfully in ./sample_docs/")