Skip to content

vitorbetmann/cs50_sql_final_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Design Document

By Vítor Betmann Calaça

Video overview: https://youtu.be/93zT3HXG7qA

Scope

Purpose:

  • The INTERNSHIP DATABASE is inspired by an internship course from the Cinema & Media Arts department that I took while at university. The course aimed at helping students find internships in Canada (which may be in person, remote, or hybrid) and grades them based on their performance during the internship.
  • The database manages and stores data on which students obtained internships, at which companies, who supervised them, how the students performed, and each student’s feedback regarding the internship experience.

The scope of this database includes:

  • Students, Companies, and Supervisors, each containing basic identifying information.
  • Internships, containing the student, supervisor, student position, and the start and end dates of the internship.
  • Student Feedback and Supervisor Feedback, containing scores on how the student performed (for course grading) and the student’s opinion of the internship, respectively.

Beyond the scope of this database are:

  • The actual grading of students.
  • Any non-essential student information for this course, such as tuition and what other courses they have enrolled in.
  • Any legal information or documentation related to legal processes regarding the university–company relationship.

Functional Requirements

Supported by the database:

  • CRUD operations for students, supervisors, and companies.
  • Tracking a student’s internship history with data on supervisor, company, position, and start and end dates.
  • Providing a way for supervisors to grade students’ performances during the internship, as well as a way for the student to grade their own internship experience.

This database is intended to be used as a historical record of internship experiences. Using it to infer “present time” information is unreliable. Beyond its scope are:

  • Using it to search for current open internships.
  • Inferring any present day information (such as if a company is still operating or if a supervisor still works there).

Representation

Entities are captured in SQLite tables with the following schema.

Entities

The database includes the following entities:

The students table includes:

  • id, which specifies the unique ID for the student as an INTEGER. This column thus has the PRIMARY KEY constraint applied.
  • student_number, which specifies the student’s unique identification number for the purposes of anything related to the university as TEXT, to avoid issues with student numbers that start with 0 (as all student numbers must be 9 digits long, and a CHECK constraint ensures that). Due to it being associated with a single student, it has the UNIQUE constraint.
  • first_name, which specifies the student's first name as TEXT, as it is appropriate for string fields (like a name).
  • last_name, which specifies the student's last name as TEXT, for the same reason as first_name.
  • email, which specifies the student's university email as TEXT, for the same reason as first_name. As with the student_number, each student has their own email, therefore, this field has the UNIQUE constraint applied.

All fields are required and thus NOT NULL.

The companies table includes:

  • id, which specifies the unique ID for the company as an INTEGER. This column thus has the PRIMARY KEY constraint applied.
  • name, which specifies the official name of the company as TEXT, as it is appropriate for string fields (like a name). To avoid duplicates, the UNIQUE constraint is applied.
  • industry, which specifies the industry in which the company operates as TEXT due to the same reason as name (although a MySQL SET would work better for companies like Sony, which act in film and animation, but in this database it is assumed each company only spans a single industry). It has a CHECK constraint (as it's meant for Cinema & Media Arts students) to ensure the company is under the film and tv, radio, animation, or journalism industries. Although this list is not comprehensive, it should limit students' options to companies related to their degree.
  • city, which specifies the city at which the company is located as TEXT due to the same reason as name.
  • province, which specifies the province at which the company is located as TEXT due to the same reason as name (although a MySQL ENUM would work better). This field has a CHECK constraint for all the two-letter codes of Canada's provinces.
  • blacklisted, which specifies whether a company has been blacklisted or not (i.e. students should avoid it) as NUMERIC. This field works as a soft delete, as when a company is blacklisted (i.e. when a statement tries to delete it from the database), CURRENT_DATE fills the column. From this, it can be inferred that if this column is not NULL, then this company has been blacklisted, and the date in the field states when that happened.

Except for blacklisted, all fields are required and thus NOT NULL.

The supervisors table includes:

  • id, which specifies the unique ID for the supervisor as an INTEGER. This column thus has the PRIMARY KEY constraint applied.
  • first_name, which specifies the supervisor's first name as TEXT, as it is appropriate for string fields (like a name).
  • last_name, which specifies the supervisor's last name as TEXT due to the same reason as first_name.
  • company_id, which is the ID of the company at which the supervisor works as an INTEGER. This column thus has the FOREIGN KEY constraint applied, referencing the id column in the companies table to ensure data integrity.
  • position, which specifies the job position the supervisor holds at the company they work at as TEXT, due to the same reason as first_name.
  • phone_number, which specifies the personal or work phone number for quickly reaching the supervisor as TEXT as the priority here is readability, although having phone_number stored as an INTEGER would also work under other circumstances. To avoid duplicates, this field has the UNIQUE constraint applied.
  • email, which specifies the work email of the supervisor as TEXT, for the same reason as first_name. As with the phone_number, each company employee has their own email, and therefore the UNIQUE constraint is applied.

All fields are required and thus NOT NULL.

The internships table includes:

  • id, which specifies the unique ID for the internship as an INTEGER. This column thus has the PRIMARY KEY constraint applied.
  • student_id, which is the ID of the student interning at a certain company as an INTEGER. This column thus has the FOREIGN KEY constraint applied, referencing the id column in the students table to ensure data integrity.
  • supervisor_id, which is the ID of the supervisor supervising the student as an INTEGER. This column thus has the FOREIGN KEY constraint applied, referencing the id column in the supervisors table to ensure data integrity.
  • student_position, which specifies the job position the student holds at the company they intern at as TEXT, as it is appropriate for string fields (like a job position).
  • work_type, which specifies with a CHECK constraint whether the internship is in person, remote, or hybrid as TEXT due to the same reason as student_position.
  • start_date, which specifies when the internship started as NUMERIC, as that is a good type affinity to represent dates. It defaults to CURRENT_DATE.
  • end_date, which specifies when the internship ended as NUMERIC due to the same reason as start_date. A CHECK constraint ensures end_date is either NULL or greater than start_date.

Except for end_date, all fields are required and thus NOT NULL.

The supervisor_feedback table includes:

  • internship_id, which specifies the unique ID for the supervisor feedback as an INTEGER. This column thus has the PRIMARY KEY constraint applied, and also the FOREIGN KEY constraint applied, referencing the id column in the internships table to ensure data integrity and save on memory.
  • initiative_score, which is the initiative score the supervisor would give the student.
  • quality_of_work_score, which is the quality of work score the supervisor would give the student.
  • productivity_score, which is the productivity score the supervisor would give the student.
  • reliability_score, which is the reliability score the supervisor would give the student.
  • comment, which contains the comments of the supervisor regarding the student's performance as TEXT, as it is appropriate for long strings or paragraphs. It is not required, and therefore it can be NULL.

All [something]_score columns are required, therefore NOT NULL, and have a CHECK constraint to be between 0 and 5.

The student_feedback table includes:

  • internship_id, which specifies the unique ID for the student feedback as an INTEGER. This column thus has the PRIMARY KEY constraint applied, and also has the FOREIGN KEY constraint applied, referencing the id column in the internships table to ensure data integrity and save on memory.
  • internship_score, which is the score the student would give the internship experience. It is required, therefore NOT NULL, and has a CHECK to be BETWEEN 0 and 5.
  • comment, which contains the comments of the student regarding the internship experience as TEXT, as it is appropriate for long strings or paragraphs. It is not required, therefore it can be NULL.

Relationships

Below, find the ER diagram alongside the explanation of each relationship.

ER Diagram

  • Each student can acquire 0 or more internships (although, if they get no internship, it is fair to assume they will fail the course).
  • Each internship should be associated with exactly 1 student and 1 supervisor.
  • Each supervisor can supervise 1 or more internships simultaneously.
  • A supervisor can work for 1 company, but each company can have 1 or more supervisors. This database does not take into account the possibility of a supervisor quitting and leaving a company with 0 supervisors, and it also assumes that, in order for a company to be inserted, the company must have at least 1 supervisor.
  • Each student and supervisor should write 1 feedback entry for each internship they are related to. This design decision prevents bloating the database with multiple comments for the same internship.

Optimizations

INDEXes

  • While student_number was automatically indexed by SQLite as a COVERING INDEX, idx_supervisor_email is a COVERING INDEX created to speed up suprevisor searches. This database expects most searches will be made with the former two identifiers in mind over people's names, as people with the same name would not produce accurate query results.
  • idx_blacklist is an index of all blacklisted companies. It has been created instead of a list of whitelisted companies because it is expected that there will be significantly less blacklisted companies over whitelisted ones (e.g., 5% of all companies are blacklisted). When querying for all whitelisted companies, SQLite will still have to scan all of the companies table, but for quick glances at which companies are blacklisted, this index speeds up that search and does not require as much memory as an index for whitelisted companies would.
  • Many common queries rely on the student number (or on finding the student ID through the student number), which relates students to a given supervisor through the internships table. Examples include finding the supervisor feedback for a given student or finding the contact information of a student’s supervisor. supervisor_id and student_id (from the internships table) are also used in many JOIN clauses, such as joins to find the top five companies most likely to hire remotely or to find all students working in person. This is enough to justify the creation of the idx_student_id, idx_supervisor_id, idx_work_type, and idx_start_date indexes (from the internships table) to speed up queries. Checks with EXPLAIN QUERY PLAN support this decision.
  • The idx_internship_id_supervisor_feedback index (from the supervisor_feedback table) was also created to speed up queries related to a supervisor's evaluation of a student's internship. As a result, these will also speed up the professor's work when grading.

VIEWs

  • companies_whitelist and companies_blacklist provide quick ways to filter whitelisted and blacklisted companies when using SELECT statements.
  • companies_by_hire_amount and companies_by_score allow for quick overviews of “companies to maintain a good relationship with,” which, for example, could indicate the top companies to contact for a partnership in the following year.

TRIGGERs

  • blacklist_company performs a soft delete on a company. Instead of deleting a company’s entry from the database, it fills the blacklisted column with the CURRENT_DATE and prevents the company from appearing in the companies_whitelist VIEW.

Limitations

  • This database assumes all companies are located in Canada, although adding columns to support other countries or more detailed address information (such as a ZIP code) should be easy.
  • This database does not support any type of documentation as proof of what a student did (such as a PDF document proving the student was hired by a certain company). That's not in the database's scope, though.
  • As the database is targeted toward Cinema & Media Arts students, it is easy to keep track of the industry each company belongs to (film and tv, radio, journalism, and animation so far). However, if this list grows larger, creating a separate table to hold all industry types would be ideal. As is, the design is not very scalable without harming clarity.

About

Final Project for the Harvard course: CS50's Introduction to Databases with SQL

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors