Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
b96b0f9
fix: output specific validation errors
Mesh-ach May 27, 2025
ba278bc
fix: output specific validation errors
Mesh-ach May 27, 2025
72e616e
fix: output specific validation errors
Mesh-ach May 27, 2025
9372315
fix: output specific validation errors
Mesh-ach May 27, 2025
4914a19
fix: output specific validation errors
Mesh-ach May 27, 2025
2eb982c
Merge pull request #61 from datakind/Validation-Errors
Mesh-ach May 27, 2025
605ab31
adjusted val cols
Mesh-ach Jun 1, 2025
160ad26
adjusted val cols
Mesh-ach Jun 1, 2025
819bd3d
adjusted val cols
Mesh-ach Jun 1, 2025
2e51034
adjusted val cols
Mesh-ach Jun 1, 2025
fe8651d
adjusted val cols
Mesh-ach Jun 2, 2025
fbe634f
adjusted val cols
Mesh-ach Jun 2, 2025
6f91176
adjusted val cols
Mesh-ach Jun 2, 2025
2b9a7b2
adjusted val cols
Mesh-ach Jun 2, 2025
1c02e56
adjusted val cols
Mesh-ach Jun 2, 2025
6eb51fc
adjusted val cols
Mesh-ach Jun 2, 2025
889e325
adjusted val cols
Mesh-ach Jun 2, 2025
3cc281e
adjusted val cols
Mesh-ach Jun 2, 2025
0a4f735
adjusted val cols
Mesh-ach Jun 2, 2025
5605b7c
adjusted val cols
Mesh-ach Jun 2, 2025
62f0939
adjusted val cols
Mesh-ach Jun 2, 2025
3f485c5
adjusted val cols
Mesh-ach Jun 2, 2025
95edc00
Merge pull request #62 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
7f6db13
adjusted val cols
Mesh-ach Jun 2, 2025
ce6825d
adjusted val cols
Mesh-ach Jun 2, 2025
c52ca86
adjusted val cols
Mesh-ach Jun 2, 2025
3338d99
adjusted val cols
Mesh-ach Jun 2, 2025
2a78cc3
adjusted val cols
Mesh-ach Jun 2, 2025
92093bc
adjusted val cols
Mesh-ach Jun 2, 2025
488efd5
adjusted val cols
Mesh-ach Jun 2, 2025
2839a12
adjusted val cols
Mesh-ach Jun 2, 2025
5cb6453
adjusted val cols
Mesh-ach Jun 2, 2025
66ad6d4
adjusted val cols
Mesh-ach Jun 2, 2025
20a53f3
adjusted val cols
Mesh-ach Jun 2, 2025
55483f3
Merge pull request #63 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
a7e49b5
adjusted val cols
Mesh-ach Jun 2, 2025
b093d0c
adjusted val cols
Mesh-ach Jun 2, 2025
1605f0c
Merge pull request #64 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
ca7dfa4
adjusted val cols
Mesh-ach Jun 2, 2025
f960c85
adjusted val cols
Mesh-ach Jun 2, 2025
46beb4b
Merge pull request #65 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
266c8dd
adjusted val cols
Mesh-ach Jun 2, 2025
8157cde
Merge pull request #66 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
0dc904b
adjusted val cols
Mesh-ach Jun 2, 2025
cd2bea1
Merge pull request #67 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
e4a0b92
adjusted val cols
Mesh-ach Jun 2, 2025
c54fb5b
Merge pull request #68 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
7c13614
adjusted val cols
Mesh-ach Jun 2, 2025
9d35d68
Merge pull request #69 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
4745af0
adjusted val cols
Mesh-ach Jun 2, 2025
845589e
Merge pull request #70 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
497352b
adjusted val cols
Mesh-ach Jun 2, 2025
20e884e
Merge pull request #71 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
f2ab256
adjusted val cols
Mesh-ach Jun 2, 2025
835e800
Merge pull request #72 from datakind/develop
Mesh-ach Jun 2, 2025
3691156
Merge pull request #73 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
9f5d128
adjusted val cols
Mesh-ach Jun 2, 2025
2b0c35b
Merge pull request #74 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
b96d56c
adjusted val cols
Mesh-ach Jun 2, 2025
6d0ac6d
adjusted val cols
Mesh-ach Jun 2, 2025
4b53cf7
Merge pull request #75 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
cd99bf9
adjusted val cols
Mesh-ach Jun 2, 2025
ece92a8
adjusted val cols
Mesh-ach Jun 2, 2025
bed6487
Merge pull request #76 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
26ca174
adjusted val cols
Mesh-ach Jun 2, 2025
3566c13
Merge pull request #77 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
38f036f
adjusted val cols
Mesh-ach Jun 2, 2025
5dcc4ab
Merge pull request #78 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
1a542b8
adjusted validation files
Mesh-ach Jun 2, 2025
f0e890d
adjusted validation files
Mesh-ach Jun 2, 2025
b51b4a8
Merge pull request #79 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
a427bd5
adjusted validation files
Mesh-ach Jun 2, 2025
3e3f4c0
Merge pull request #80 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
2411433
adjusted validation files
Mesh-ach Jun 2, 2025
ab46919
adjusted validation files
Mesh-ach Jun 2, 2025
a9cda3f
Merge pull request #81 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
a8d1e6c
adjusted validation files
Mesh-ach Jun 2, 2025
9333002
Merge pull request #82 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
5720698
adjusted validation files
Mesh-ach Jun 2, 2025
7b1d5e0
adjusted validation files
Mesh-ach Jun 2, 2025
b66ec69
Merge pull request #83 from datakind/Validation-Errors
Mesh-ach Jun 2, 2025
ca40008
adjusted validation files
Mesh-ach Jun 3, 2025
facda49
adjusted validation files
Mesh-ach Jun 3, 2025
44c0bc8
adjusted validation files
Mesh-ach Jun 3, 2025
5a4a7b2
Merge pull request #84 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
47015f2
adjusted validation files
Mesh-ach Jun 3, 2025
3601164
Merge pull request #85 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
c21885c
adjusted validation files
Mesh-ach Jun 3, 2025
b959fd8
adjusted validation files
Mesh-ach Jun 3, 2025
99d7f6c
adjusted validation files
Mesh-ach Jun 3, 2025
1070b7d
Merge pull request #86 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
1b08bfe
adjusted validation files
Mesh-ach Jun 3, 2025
1592486
adjusted validation files
Mesh-ach Jun 3, 2025
4918670
adjusted validation files
Mesh-ach Jun 3, 2025
e3ef82a
Merge pull request #87 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
b03265e
adjusted validation files
Mesh-ach Jun 3, 2025
71b605e
adjusted validation files
Mesh-ach Jun 3, 2025
f1bbfc1
Merge pull request #88 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
449a3ac
fix: base and pdp schema
Mesh-ach Jun 3, 2025
8009c52
Merge pull request #89 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
6323623
fix: base and pdp schema
Mesh-ach Jun 3, 2025
d85c4ec
Merge pull request #90 from datakind/Validation-Errors
Mesh-ach Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,9 @@ dependencies = [
"pandas",
"six",
"types-six",
"fuzzywuzzy"
"fuzzywuzzy",
"databricks-sql-connector",
"pandera~=0.13"
]

[project.urls]
Expand Down
33 changes: 21 additions & 12 deletions src/webapp/gcsutil.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,13 @@

from .config import gcs_vars, databricks_vars
from .validation import validate_file_reader
from .utilities import (
SchemaType,
)
from typing import Any, List
import logging

# Set the logging
logging.basicConfig(format="%(asctime)s [%(levelname)s]: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

SIGNED_URL_EXPIRY_MIN = 30

Expand All @@ -34,7 +38,7 @@ def rename_file(
# There is also an `if_source_generation_match` parameter, which is not used in this example.
destination_generation_match_precondition = 0

blob_copy = source_bucket.copy_blob(
source_bucket.copy_blob(
source_blob,
new_file_name,
if_generation_match=destination_generation_match_precondition,
Expand All @@ -55,7 +59,7 @@ def credentials(self):
self._credentials, self._project_id = google.auth.default()
return self._credentials

def generate_upload_signed_url(self, bucket_name: str, file_name: str) -> str:
def generate_upload_signed_url(self, bucket_name: str, file_name: str) -> Any:
"""Generates a v4 signed URL for uploading a blob using HTTP PUT."""
r = requests.Request()
self.credentials().refresh(r)
Expand Down Expand Up @@ -88,7 +92,7 @@ def generate_upload_signed_url(self, bucket_name: str, file_name: str) -> str:

return url

def generate_download_signed_url(self, bucket_name: str, blob_name: str) -> str:
def generate_download_signed_url(self, bucket_name: str, blob_name: str) -> Any:
"""Generates a v4 signed URL for downloading a blob using HTTP GET."""
r = requests.Request()
self.credentials().refresh(r)
Expand Down Expand Up @@ -172,7 +176,7 @@ def create_bucket(self, bucket_name: str) -> None:
new_bucket.set_iam_policy(policy)

def list_blobs_in_folder(
self, bucket_name: str, prefix: str, delimiter=None
self, bucket_name: str, prefix: str, delimiter: Any = None
) -> list[str]:
"""Lists all the blobs in the bucket that begin with the prefix.

Expand Down Expand Up @@ -218,7 +222,7 @@ def list_blobs_in_folder(

def download_file(
self, bucket_name: str, file_name: str, destination_file_name: str
):
) -> Any:
"""Downloads a blob from the bucket."""

# The path to which the file should be downloaded
Expand Down Expand Up @@ -264,17 +268,21 @@ def delete_file(self, bucket_name: str, file_name: str):
blob.delete()

def validate_file(
self, bucket_name: str, file_name: str, allowed_schemas: set[SchemaType]
) -> set[SchemaType]:
self, bucket_name: str, file_name: str, allowed_schemas: list[str]
) -> List[str]:
"""Validate that a file is one of the allowed schemas."""
client = storage.Client()
bucket = client.bucket(bucket_name)
blob = bucket.blob(f"unvalidated/{file_name}")
new_blob_name = f"validated/{file_name}"
schems = set()
schems: List[str] = []
try:
with blob.open("r") as file:
schems = validate_file_reader(file, allowed_schemas)
schemas = validate_file_reader(file, allowed_schemas)
schems = [str(s) for s in schemas.get("schemas", [])]
logging.debug(
f"If you see this file validation was successful {schems}"
)
except Exception as e:
blob.delete()
raise e
Expand All @@ -283,6 +291,7 @@ def validate_file(
raise ValueError(new_blob_name + ": File already exists.")
bucket.copy_blob(blob, bucket, new_blob_name)
blob.delete()
logging.debug("If you see this file validation was complete")
return schems

def get_file_contents(self, bucket_name: str, file_name: str):
Expand Down
128 changes: 93 additions & 35 deletions src/webapp/routers/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
import uuid
from datetime import datetime, date

from typing import Annotated, Any, Dict
from typing import Annotated, Any, Dict, List
from pydantic import BaseModel
from fastapi import APIRouter, Depends, HTTPException, status, Response
from sqlalchemy import and_, or_
from sqlalchemy.orm import Session
from sqlalchemy.future import select
import os
import logging
from sqlalchemy.exc import IntegrityError

from ..utilities import (
has_access_to_inst_or_err,
Expand All @@ -20,7 +23,6 @@
get_current_active_user,
DataSource,
get_external_bucket_name,
SchemaType,
decode_url_piece,
)

Expand All @@ -29,13 +31,17 @@
local_session,
BatchTable,
FileTable,
InstTable,
)

from ..gcsdbutils import update_db_from_bucket

from ..gcsutil import StorageControl

# Set the logging
logging.basicConfig(format="%(asctime)s [%(levelname)s]: %(message)s")
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

router = APIRouter(
prefix="/institutions",
tags=["data"],
Expand Down Expand Up @@ -91,7 +97,7 @@ class DataInfo(BaseModel):
name: str
data_id: str
# The batch(es) that this data is present in.
batch_ids: set[str] = {}
batch_ids: set[str] = set()
inst_id: str
# Size to the nearest MB.
# size_mb: int
Expand Down Expand Up @@ -123,7 +129,7 @@ class ValidationResult(BaseModel):
# Must be unique within an institution to avoid confusion.
name: str
inst_id: str
file_types: set[SchemaType]
file_types: List[str]
source: str


Expand Down Expand Up @@ -838,6 +844,33 @@ def download_url_inst_file(
)


def infer_models_from_filename(file_path: str, institution_id: str) -> List[str]:
name = os.path.basename(file_path).lower()

inferred = set()
if "course" in name:
inferred.add("COURSE")
if "student" in name:
inferred.add("STUDENT")
if institution_id == "pdp":
inferred.add("SEMESTER")
if "semester" in name:
inferred.add("SEMESTER")
if "cohort" in name:
inferred.add("STUDENT")
inferred.add("SEMESTER")

if not inferred:
logging.error(
ValueError(
f"Could not infer model(s) from file name: {name}, filenames sould be descriptive of the kind of data it contains e.g. course, cohort"
)
)
inferred.add("UNKNOWN")

return sorted(inferred)


def validation_helper(
source_str: str,
inst_id: str,
Expand All @@ -854,51 +887,76 @@ def validation_helper(
detail="File name can't contain '/'.",
)
local_session.set(sql_session)
inst_query_result = (
local_session.get()
.execute(select(InstTable).where(InstTable.id == str_to_uuid(inst_id)))
.all()
)
if len(inst_query_result) == 0:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail="Institution not found.",
)
if len(inst_query_result) > 1:
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Institution duplicates found.",
)
allowed_schemas = set()
if inst_query_result[0][0].schemas:
allowed_schemas = set(inst_query_result[0][0].schemas)

inferred_schemas = set()
allowed_schemas = None
if not allowed_schemas:
allowed_schemas = infer_models_from_filename(file_name, "pdp")

inferred_schemas: list[str] = []

try:
inferred_schemas = storage_control.validate_file(
get_external_bucket_name(inst_id), file_name, allowed_schemas
get_external_bucket_name(inst_id),
file_name,
allowed_schemas,
)
logging.debug(
f"!!!!!!!!!!Inferred Schemas was successful {list(inferred_schemas)}"
)
except Exception as e:
logging.debug(f"!!!!!!!!!!Inferred Schemas FAILED {e}")
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="File type is not valid and/or not accepted by this institution: "
+ str(e),
) from e
new_file_record = FileTable(
name=file_name,
inst_id=str_to_uuid(inst_id),
uploader=str_to_uuid(current_user.user_id),
source=source_str,
sst_generated=False,
schemas=list(inferred_schemas),
valid=True,

existing_file = (
local_session.get()
.query(FileTable)
.filter_by(
name=file_name,
inst_id=str_to_uuid(inst_id),
)
.first()
)
local_session.get().add(new_file_record)

if existing_file:
logging.info(f"File '{file_name}' already exists for institution {inst_id}.")
db_status = f"File '{file_name}' already exists for institution {inst_id}."
else:
try:
new_file_record = FileTable(
name=file_name,
inst_id=str_to_uuid(inst_id),
uploader=str_to_uuid(current_user.user_id),
source=source_str,
sst_generated=False,
schemas=list(inferred_schemas),
valid=True,
)
local_session.get().add(new_file_record)
local_session.get().flush()
logging.info(f"File record inserted for '{file_name}'")
db_status = f"File record inserted for '{file_name}'"
except IntegrityError as e:
local_session.get().rollback()
logging.warning(f"IntegrityError: {e}")
db_status = "Already exists"
except Exception as e:
local_session.get().rollback()
logging.error(f"Unexpected DB error: {e}")
raise HTTPException(
status_code=500,
detail=f"Unexpected database error while inserting file record: {e}",
)

return {
"name": file_name,
"inst_id": inst_id,
"file_types": inferred_schemas,
"file_types": list(inferred_schemas),
"source": source_str,
"status": db_status,
}


Expand Down
16 changes: 8 additions & 8 deletions src/webapp/routers/data_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@ def test_update_batch(client: TestClient):

def test_validate_success_batch(client: TestClient):
"""Test PATCH /institutions/<uuid>/batch."""
MOCK_STORAGE.validate_file.return_value = {SchemaType.UNKNOWN}
MOCK_STORAGE.validate_file.return_value = ["UNKNOWN"]

# Use validate for manual upload
response_upload = client.post(
Expand Down Expand Up @@ -608,28 +608,28 @@ def test_validate_success_batch(client: TestClient):

def test_validate_failure_batch(client: TestClient):
"""Test PATCH /institutions/<uuid>/batch."""
MOCK_STORAGE.validate_file.return_value = {SchemaType.PDP_COHORT}
MOCK_STORAGE.validate_file.return_value = ["COURSE"]
# Authorized.
# Use validate upload
response_upload = client.post(
"/institutions/"
+ uuid_to_str(USER_VALID_INST_UUID)
+ "/input/validate-upload/file_name.csv",
+ "/input/validate-upload/file_name_course.csv",
)
assert response_upload.status_code == 200
assert response_upload.json()["name"] == "file_name.csv"
assert response_upload.json()["file_types"] == ["PDP_COHORT"]
assert response_upload.json()["name"] == "file_name_course.csv"
assert response_upload.json()["file_types"] == ["COURSE"]
assert response_upload.json()["inst_id"] == uuid_to_str(USER_VALID_INST_UUID)
assert response_upload.json()["source"] == "MANUAL_UPLOAD"

# Use valiate sftp
response_sftp = client.post(
"/institutions/"
+ uuid_to_str(USER_VALID_INST_UUID)
+ "/input/validate-upload/file_name.csv",
+ "/input/validate-upload/file_name_course.csv",
)
assert response_sftp.status_code == 200
assert response_sftp.json()["name"] == "file_name.csv"
assert response_sftp.json()["file_types"] == ["PDP_COHORT"]
assert response_sftp.json()["name"] == "file_name_course.csv"
assert response_sftp.json()["file_types"] == ["COURSE"]
assert response_sftp.json()["inst_id"] == uuid_to_str(USER_VALID_INST_UUID)
assert response_sftp.json()["source"] == "MANUAL_UPLOAD"
2 changes: 1 addition & 1 deletion src/webapp/test_files/financial_sst_pdp.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Student ID,Institution ID,Academic Year,Dependency Status,Housing Status,Cost of Attendance,EFC,Total Institutional Grants,Total State Grants,Total Federal Grants,Unmet Need,Net Price,Applied Aid
Student ID,Institution ID,Academic Year,Dependency Status,Housing Status,Cost of Attendance,EFC,Total Institutional Grants,Total State Grants,Pell Status First Year,Unmet Need,Net Price,Applied Aid
999999,99999999,2019-20,Unknown,Off-campus,3505,0,0,0,774,2731,2731,N
999998,99999999,2019-20,Independent,Off-campus,4210,0,0,0,3097,1113,1113,Y
999997,99999999,2019-20,Dependent,On-campus housing,19938,1768,0,2566,4445,11159,12927,Y
Expand Down
Loading
Loading