Skip to content

Commit e63d1ac

Browse files
ObadaSnicomyObada HaddadDidayoloihsaan-ullah
authored andcommitted
Optimise download many (#2001)
* download client side and bulk download * working bulk download * cleaned * gestion of faulty submission * changed to post instead of get and added failed.txt to bundle * added limit on number of file to download simultaneously * it should work * cleaned comments * Flake8 fixes * added option to bulk download scores files and predictions files * cleaned code * exclude not finished submission inside download pred and scores * download client side and bulk download * working bulk download * cleaned * gestion of faulty submission * changed to post instead of get and added failed.txt to bundle * added limit on number of file to download simultaneously * it should work * cleaned comments * Flake8 fixes * Rebase optimise_download_many (#2159) * removed and from docs * replaced DEFAULT_FROM_EMAIL by SERVER_EMAIL * remove docs/ since we do not need it anymore and it's unmaintained * Django Admin Interface upgrades (#2090) * update django admin interface to make it easier to manage spam * add option to export email and username of organizers as a CSV file or JSON * add option to export email and username of queue owners as a CSV file or JSON * use raw_id_fields to make django admin pages load much faster * make file size human readable * re-arranged the fieldsets of some admin pages to be easier to navigate * add small description about size (bytes of GB); limit text length displayed in list displays * add custom text in list filter * make some filter clearer: remove useless repetitions * add export option for users * fix typo --------- Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> * `Compute Worker` Use docker-py instead of the current subprocess way of doing things (for podman and docker) (#2065) * Using docker-py and podman-py instead of the current subprocess way of doing things * Removed useless line * Started work on Podman; re-added log stream to codabench instance * Corrected typo in variable name; small changes for image pulls for podman * remove the podman package and use only docker API calls for both podman and docker containers * add better progress bar when downloading and some debug logger output * add github workflow * workflow fixes * format with ruff; add cleanup * format with ruff; fix websockets connection infinite waiting * change Dockerfile.compute_worker to remove docker; make GPU selection compatible with podman and docker * fix a typo * ruff formatting; update documentation * delete useless workflow * update documentation, remove useless files, add docker in pyproject for pytests * try to fix pytest errors * fix typo * remove unused folder for podman * add more logs for the compute worker; update documentation * fix poetry lock after rebase * rebase + remove failing tests. We can test this in e2e tests * fix flake8 errors * fix poetry lock after rebase * remove poetry at the end to fix vulnerabilities and make image lighter * remove docker and rich from main pyproject.toml * fix missing f * add timeout for websocket connections * add more information about which container engine we are using and if GPU will be used * tentative fix for send_detailed_results websocket connection attempt * make error reporting better; handle websocket conenctions failures better * fix hostname inside celery worker to use container hostname * fix compute worker startup in compute worker service since we use entrypoint in Dockerfile.compute worker * add workflow to build and push compute worker image automatically on file changes for every branch * change Dockerfile.compute_worker and test workflow * fix missing checkout in new workflows * fix missing needs: to make the push job wait for the build to finish first * further fixes in workflow * rebase branch and update documentation about the new docker image workflow * update documentation to be more clear * fix typo * update compute worker image name * raise timeout to make it less likely to fail when instance is slow to load * fix compute worker name in github workflow --------- Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> * version bump * fix playwright test failing because it's not looking far enough for the leaderboard ID * remove selenium traces; make celery do all the work instead of django * re-enabled django acting as celery for some tasks * fix typo * added submission id and filename fileds to submission csv * fixed date selection (february related ?) * Remove useless files (#2138) * remove useless files and update documentation * Remove reference to reset_db.sh from README.md --------- Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> Co-authored-by: didayolo <adrien.pavao@gmail.com> * Django to 4.2.0 (#1959) * django_to_3.2.25 * django_to_4.0.0 * django to 4.20 - close but still failing tests * pdb in test_competitions * STATICFILES_STORAGE is old pattern * some migrations for changes to default field behaviors * flake adjustments * remove a pdb for test * channels to 4.2.0 * merged develop into django4 branch * Added Daphne middleware + whitenoise for static files; fixed csrf errors * CSRF fixed * CSRF fixes * Fixed potential botocore problem with use_ssl state * rebase and modifed migration * remove daphne and selenium * flake8 fixes * makemigrations results after rebase * Cleanup * Update packages --------- Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> Co-authored-by: didayolo <adrien.pavao@gmail.com> * General - Added new files for Governance, Privacy and About (#2094) * Added new files for Governance, Privacy and About. Updated links in docs to poin to these * Fix links * Reference to Codabench --------- Co-authored-by: didayolo <adrien.pavao@gmail.com> * Update mkdocs.yml * Delete documentation/docs/Organizers/Benchmark_Creation/Cancer-Benchmarks.md * More flexible server status page * Fix behavior --------- Co-authored-by: Ihsan Ullah <ihsan2131@gmail.com> Co-authored-by: Obada Haddad-Soussac <11889208+ObadaS@users.noreply.github.com> Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> Co-authored-by: acletournel <acl@lri.fr> Co-authored-by: Benjamin Bearce <bbearce@gmail.com> * Flake8 fixes * Fix bulk download --------- Co-authored-by: Nicolas HOMBERG <nicolas.homberg@univ-grenoble-alpes.fr> Co-authored-by: Obada Haddad <obada.haddad@lisn.fr> Co-authored-by: Adrien Pavão <adrien.pavao@u-psud.fr> Co-authored-by: Ihsan Ullah <ihsan2131@gmail.com> Co-authored-by: acletournel <acl@lri.fr> Co-authored-by: Benjamin Bearce <bbearce@gmail.com> Co-authored-by: didayolo <adrien.pavao@gmail.com>
1 parent 2e7c2f1 commit e63d1ac

File tree

5 files changed

+244
-116
lines changed

5 files changed

+244
-116
lines changed

src/apps/api/views/submissions.py

Lines changed: 60 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,10 @@
1414
from rest_framework.viewsets import ModelViewSet
1515
from rest_framework_csv import renderers
1616
from django.core.files.base import ContentFile
17-
from django.http import StreamingHttpResponse
1817

1918
from profiles.models import Organization, Membership
2019
from tasks.models import Task
21-
from api.serializers.submissions import SubmissionCreationSerializer, SubmissionSerializer, SubmissionFilesSerializer
20+
from api.serializers.submissions import SubmissionCreationSerializer, SubmissionSerializer, SubmissionFilesSerializer, SubmissionDetailSerializer
2221
from competitions.models import Submission, SubmissionDetails, Phase, CompetitionParticipant
2322
from leaderboards.strategies import put_on_leaderboard_by_submission_rule
2423
from leaderboards.models import SubmissionScore, Column, Leaderboard
@@ -220,6 +219,27 @@ def destroy(self, request, *args, **kwargs):
220219
self.perform_destroy(submission)
221220
return Response(status=status.HTTP_204_NO_CONTENT)
222221

222+
def check_submission_permissions(self, request, submissions):
223+
# Check permissions
224+
if not request.user.is_authenticated:
225+
raise PermissionDenied("You must be logged in to download submissions")
226+
# Allow admins
227+
if request.user.is_superuser or request.user.is_staff:
228+
allowed = True
229+
else:
230+
# Build one Q object for "owner OR organizer"
231+
organiser_q = (
232+
Q(phase__competition__created_by=request.user) |
233+
Q(phase__competition__collaborators=request.user)
234+
)
235+
# Submissions that violate the rule
236+
disallowed = submissions.exclude(Q(owner=request.user) | organiser_q)
237+
allowed = not disallowed.exists()
238+
if not allowed:
239+
raise PermissionDenied(
240+
"You do not have permission to download one or more of the requested submissions"
241+
)
242+
223243
@action(detail=True, methods=('DELETE',))
224244
def soft_delete(self, request, pk):
225245
submission = self.get_object()
@@ -384,26 +404,28 @@ def re_run_many_submissions(self, request):
384404
submission.re_run()
385405
return Response({})
386406

387-
@action(detail=False, methods=['get'])
407+
# TODO: The 3 functions download many should be bundled inside a genereic with the function like "get_prediction_result" as a parameter instead of the same code 3 times
408+
@action(detail=False, methods=('POST',))
388409
def download_many(self, request):
389-
"""
390-
Download a ZIP containing several submissions.
391-
"""
392-
pks = request.query_params.get('pks')
393-
if pks:
394-
pks = json.loads(pks) # Convert JSON string to list
395-
else:
396-
return Response({"error": "`pks` query parameter is required"}, status=400)
410+
pks = request.data.get('pks')
411+
if not pks:
412+
return Response({"error": "`pks` field is required"}, status=400)
413+
414+
# pks is already parsed as a list if JSON was sent properly
415+
if not isinstance(pks, list):
416+
return Response({"error": "`pks` must be a list"}, status=400)
397417

398418
# Get submissions
399419
submissions = Submission.objects.filter(pk__in=pks).select_related(
400420
"owner",
401-
"phase__competition",
402-
"phase__competition__created_by",
403-
).prefetch_related("phase__competition__collaborators")
404-
if submissions.count() != len(pks):
421+
"phase",
422+
"data"
423+
)
424+
425+
if len(list(submissions)) != len(pks):
405426
return Response({"error": "One or more submission IDs are invalid"}, status=404)
406427

428+
# Nicolas Homberg : should create a function for this ?
407429
# Check permissions
408430
if not request.user.is_authenticated:
409431
raise PermissionDenied("You must be logged in to download submissions")
@@ -424,12 +446,29 @@ def download_many(self, request):
424446
"You do not have permission to download one or more of the requested submissions"
425447
)
426448

427-
# Download
428-
from competitions.tasks import stream_batch_download
429-
in_memory_zip = stream_batch_download(pks)
430-
response = StreamingHttpResponse(in_memory_zip, content_type='application/zip')
431-
response['Content-Disposition'] = 'attachment; filename="bulk_submissions.zip"'
432-
return response
449+
files = []
450+
451+
for sub in submissions:
452+
file_path = sub.data.data_file.name.split('/')[-1]
453+
short_name = f"{sub.id}_{sub.owner}_PhaseId{sub.phase.id}_{sub.data.created_when.strftime('%Y-%m-%d:%M-%S')}_{file_path}"
454+
# url = sub.data.data_file.url
455+
url = SubmissionDetailSerializer(sub.data, context=self.get_serializer_context()).data['data_file']
456+
# url = SubmissionFilesSerializer(sub, context=self.get_serializer_context()).data['data_file']
457+
files.append({"name": short_name, "url": url})
458+
459+
return Response(files)
460+
461+
for sub in submissions:
462+
if sub.status not in [Submission.FINISHED]: # Submission.FAILED, Submission.CANCELLED
463+
continue
464+
file_path = sub.data.data_file.name.split('/')[-1]
465+
complete_name = f"res_{sub.id}_{sub.owner}_PhaseId{sub.phase.id}_{sub.data.created_when.strftime('%Y-%m-%d:%M-%S')}_{file_path}"
466+
result_url = SubmissionDetailSerializer(sub.data, context=self.get_serializer_context()).get_scoring_result(sub)
467+
# detailed results is already in the results zip file but For very large detailed results it could be helpfull to remove it
468+
# detailed_result_url = serializer.get_scoring_result(sub)
469+
files.append({"name": complete_name, "url": result_url})
470+
471+
return Response(files)
433472

434473
@action(detail=True, methods=('GET',))
435474
def get_details(self, request, pk):

src/apps/competitions/tasks.py

Lines changed: 0 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,6 @@
1919
from django.utils.timezone import now
2020
from rest_framework.exceptions import ValidationError
2121

22-
from urllib.request import urlopen
23-
from contextlib import closing
24-
from urllib.error import ContentTooShortError
25-
2622
from celery_config import app
2723
from competitions.models import Submission, CompetitionCreationTaskStatus, SubmissionDetails, Competition, \
2824
CompetitionDump, Phase
@@ -317,49 +313,6 @@ def send_child_id(submission, child_id):
317313
})
318314

319315

320-
def retrieve_data(url, data=None):
321-
with closing(urlopen(url, data)) as fp:
322-
headers = fp.info()
323-
324-
bs = 1024 * 8
325-
size = -1
326-
read = 0
327-
if "content-length" in headers:
328-
size = int(headers["Content-Length"])
329-
330-
while True:
331-
block = fp.read(bs)
332-
if not block:
333-
break
334-
read += len(block)
335-
yield(block)
336-
337-
if size >= 0 and read < size:
338-
raise ContentTooShortError(
339-
"retrieval incomplete: got only %i out of %i bytes"
340-
% (read, size))
341-
342-
343-
def zip_generator(submission_pks):
344-
in_memory_zip = BytesIO()
345-
with zipfile.ZipFile(in_memory_zip, 'w', zipfile.ZIP_DEFLATED) as zip_file:
346-
for submission_id in submission_pks:
347-
submission = Submission.objects.get(id=submission_id)
348-
short_name = "ID_" + str(submission_id) + '_' + submission.data.data_file.name.split('/')[-1]
349-
url = make_url_sassy(path=submission.data.data_file.name)
350-
for block in retrieve_data(url):
351-
zip_file.writestr(short_name, block)
352-
353-
in_memory_zip.seek(0)
354-
355-
return in_memory_zip
356-
357-
358-
@app.task(queue='site-worker', soft_time_limit=60 * 60)
359-
def stream_batch_download(submission_pks):
360-
return zip_generator(submission_pks)
361-
362-
363316
@app.task(queue='site-worker', soft_time_limit=60)
364317
def _run_submission(submission_pk, task_pks=None, is_scoring=False):
365318
"""This function is wrapped so that when we run tests we can run this function not

src/static/js/ours/client.js

Lines changed: 6 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -128,31 +128,13 @@ CODALAB.api = {
128128
return CODALAB.api.request('GET', `${URLS.API}submissions/${id}/get_detail_result/`)
129129
},
130130
download_many_submissions: function (pks) {
131-
console.log('Request bulk');
132-
const params = new URLSearchParams({ pks: JSON.stringify(pks) });
133-
const url = `${URLS.API}submissions/download_many/?${params}`;
134-
return fetch(url, {
135-
method: 'GET',
136-
headers: {
137-
'Content-Type': 'application/json'
138-
}
139-
}).then(response => {
140-
if (!response.ok) {
141-
throw new Error('Network response was not ok ' + response.statusText);
142-
}
143-
return response.blob();
144-
}).then(blob => {
145-
const link = document.createElement('a');
146-
link.href = window.URL.createObjectURL(blob);
147-
link.download = 'bulk_submissions.zip';
148-
document.body.appendChild(link);
149-
link.click();
150-
document.body.removeChild(link);
151-
}).catch(error => {
152-
console.error('Error downloading submissions:', error);
153-
});
131+
return CODALAB.api.request(
132+
'POST',
133+
URLS.API + "submissions/download_many/",
134+
{ pks: pks } // body is JSON by convention
135+
);
154136
},
155-
137+
156138
/*---------------------------------------------------------------------
157139
Leaderboards
158140
---------------------------------------------------------------------*/

0 commit comments

Comments
 (0)