A database is an organized collection of inter-related data that models some aspect of the real world. Database systems are crucial for managing data in various applications including banking, healthcare, airlines, and e-commerce.
A Database Management System (DBMS) is the software that manages a database. A general-purpose DBMS is designed to allow the definition, creation, querying, updating, and administration of databases.
Simple text-file-based databases face significant limitations. Consider a social media analytics application using separate text files (Users.txt, Posts.txt, Interactions.txt):
When the same information (e.g., "Timothée Chalamet, Paris") is duplicated across multiple entries, it leads to:
- Inconsistencies when data is updated in one place but not others
- Wasted storage space
- Maintenance difficulties
File-based systems require scanning entire files for data retrieval or updates, resulting in:
- O(n) complexity for most operations
- No optimization for common queries
- Poor performance as data grows
Without structured access methods, even simple queries require:
- Full file scans
- Manual parsing of text data
- No indexing capabilities
Multiple users trying to modify the same data simultaneously can lead to:
- Data races and corruption
- Lost updates
- Inconsistent state
Flat files lack mechanisms to ensure data persistence during system crashes:
- No transaction logs
- No recovery mechanisms
- Risk of partial writes
Flat files have inefficient memory management:
- No structured way to manage data between DRAM (fast, volatile) and disk (slow, persistent)
- Entire files may need to be loaded into memory
- No caching strategies
Poor usability requiring custom code for even simple queries:
- No declarative query language
- Every operation requires imperative programming
- High development overhead
Introduced by Ted Codd (1970), relational databases simplify data management by organizing data as tables. Key benefits include:
By using primary keys (e.g., UserID) and foreign keys, duplication is minimized:
- Single source of truth for each entity
- References instead of copies
- Normalized data structures
Benefits from structured data and optimized access paths:
- Index Database: B+ trees, hash tables for fast lookups
- Query optimization
- Efficient join algorithms
Achieved via Concurrency Control mechanisms:
- Locking protocols
- Multi-version concurrency control (MVCC)
- Isolation levels
Guaranteed by Atomic Transactions with ACID properties:
- All or Nothing: Operations either complete fully or not at all
- Reversion: Ability to rollback on failure
- Write-ahead logging (WAL)
Differentiates between storage tiers:
- Cached Pages in DRAM: Faster access but not durable
- Disk Storage: Slower access but durable
- Buffer pool management
- Transaction logs for durability
Utilizes declarative languages like SQL:
- Specify what you want, not how to get it
- Declarative vs. imperative approach
- High-level abstraction
- Reduced development time
A data model is a collection of concepts for describing the data in a database.
Data is represented using tables in the relational model:
- Tables: Consist of columns (attributes) and rows (entries)
- Degree: The number of columns in a table
- Cardinality: The number of rows in a table
- Schema: Defines the structure (table name, column names, data types)
- Instance: The current state of data in the table
Example Table Structure:
| Column Name | Data Type | Constraints |
|---|---|---|
| UserID | INTEGER | PRIMARY KEY |
| Name | VARCHAR(100) | NOT NULL |
| VARCHAR(255) | UNIQUE | |
| Address | VARCHAR(500) |
Constraints are rules that restrict the data values permitted in a system:
- Primary Key Constraint: Uniquely identifies each row
- Foreign Key Constraint: Maintains referential integrity between tables
- Unique Constraint: Ensures no duplicate values in a column
- Not Null Constraint: Requires a value in a column
- Check Constraint: Validates data against a condition
Operations are used to retrieve and change data. Based on relational algebra and mathematical set theory:
| Operator | Symbol | Description | SQL Equivalent |
|---|---|---|---|
| Projection | π | Select specific columns | SELECT columns |
| Selection | σ | Filter rows by condition | WHERE clause |
| Cartesian Product | × | Combine rows from tables | CROSS JOIN |
| Union | ∪ | Combine results | UNION |
| Difference | − | Rows in first but not second | EXCEPT |
| Intersection | ∩ | Common rows | INTERSECT |
| Join | ⋈ | Combine related rows | JOIN |
- SELECT (Projection): Select specific columns from a table
- WHERE (Selection): Filter rows based on conditions
- GROUP BY (Grouping): Group rows with same values
- Aggregate Functions: SUM, COUNT, AVG, MIN, MAX
- JOIN: Link rows from different tables
Example:
SELECT u.Name, COUNT(p.PostID) as PostCount
FROM Users u
JOIN Posts p ON u.UserID = p.UserID
WHERE u.Address = 'Paris'
GROUP BY u.Name
HAVING COUNT(p.PostID) > 5;Integrity refers to how accurately the database reflects reality:
- Data matches real-world state
- Constraints enforced
- Valid relationships maintained
Poor Integrity Example: Database shows a user's age as 200 years old.
Consistency relates to the absence of internal conflicts within a database:
- No contradictory values
- Referential integrity maintained
- Constraints satisfied
Inconsistency Example: A Post record references a UserID that doesn't exist in the Users table.
To handle changes in entity information over time, databases use surrogate keys:
Problem: Natural identifiers (like email or name) can change.
Solution: System-generated unique identifiers (surrogate keys) that:
- Never change
- Have no business meaning
- Provide stable references
- Simplify relationships
Example:
Without Surrogate:
Users(Email, Name, Address)
Email is primary key but can change
With Surrogate:
Users(UserID, Email, Name, Address)
UserID is primary key and never changes
The ANSI/SPARC Three-Level Architecture separates data representation into three schema levels, enabling efficient data access and independence.
The conceptual schema describes general and time-invariant structures of reality.
Characteristics:
- Does not involve data representation details
- No physical organization concerns
- Focus on data meaning and relationships
- Platform-independent
Contents:
- Entity definitions
- Relationships between entities
- Integrity constraints
- Business rules
Example:
CREATE TABLE Users (
UserID INTEGER PRIMARY KEY,
Name VARCHAR(100) NOT NULL,
Email VARCHAR(255) UNIQUE,
Address VARCHAR(500)
);The external schema represents a subset of information derived from the conceptual schema, designed for specific user group needs.
Characteristics:
- User-specific views
- Simplified or restricted data access
- Security through data hiding
- Customized presentation
Implementation: Views serve as windows into the database rather than physically existing as separate tables.
Example:
CREATE VIEW ActiveUsers AS
SELECT UserID, Name, Email
FROM Users
WHERE LastLoginDate > CURRENT_DATE - INTERVAL '30 days';The internal schema describes the physical representation of data specified in the conceptual schema.
Primary Technique: Indexing
- Optimizes query performance
- Improves update efficiency
- Not directly accessed by applications
- Can be modified without affecting applications
Storage Considerations:
- File organization (heap, sorted, hashed)
- Access paths (B+ trees, hash indexes)
- Compression techniques
- Partitioning strategies
The ability to modify the internal schema without affecting external schema-based applications:
- Change storage structures
- Add or remove indexes
- Modify file organization
- Applications remain unaffected
The ability to change the conceptual schema without changing applications that run on the external schemata:
- Add new tables or columns
- Modify relationships
- Change constraints
- Views can insulate applications from changes
The ANSI/SPARC DBMS Framework (proposed in 1975) defines the components of a three-level architecture.
The schema compiler handles schema definitions from different administrators:
| Administrator | Responsibility | Schema Type |
|---|---|---|
| Enterprise Administrator | Defines overall data structure | Conceptual Schema |
| Application System Administrator | Defines user views | External Schema |
| Database Administrator | Defines storage details | Internal Schema |
Process:
- Schema definitions are submitted
- Checked for syntax correctness
- Validated against existing schemas
- Stored in the metadatabase (data dictionary)
The query transformer translates user queries through schema levels:
Query Flow:
- User submits query at external schema level
- External → Conceptual: Resolve view definitions
- Conceptual → Internal: Determine access paths and physical operations
- Internal → Storage: Generate operating system calls
Optimization Steps:
- Query parsing and validation
- Logical optimization (relational algebra)
- Physical optimization (access method selection)
- Cost-based optimization
Translates internal schema operations into operating system calls to retrieve data from storage:
Operations:
- File system interactions
- Buffer management
- Disk I/O scheduling
- Cache management
The process is reversed when preparing a response:
- Data retrieved from storage
- Storage → Internal: Deserialize data structures
- Internal → Conceptual: Apply data transformations
- Conceptual → External: Filter and format for specific view
- Return to user in requested format
In production systems, this process is highly optimized for efficiency while maintaining the same functional behavior.
Course Materials:
- CS 6400: Database Systems Concepts and Design - Georgia Tech OMSCS
- CS 6422: Database System Implementation - Georgia Tech OMSCS

