Skip to content

__get_all() returns too many results #67

@FolkertRA

Description

@FolkertRA

Bug Description
__get_all() is a function that is internally used to multiple pages of results (more than 50), used by e.g. get_baselines_versioneditems() and other functions (probably also get_abstractitems().

This function first retrieves one page, and if it does not get all the results then it starts a ThreadPool to retrieve all pages. However, while doing this, it again retrieves the first page and adds it to the results.

The resulting behavior is that the first set of 50 items (or less if max is set lower) is duplicated in the results.

To Reproduce
Steps to reproduce the behavior:

  1. call get_baselines_versioneditems() on a baseline containing more than 50 items.
  2. count results and observe that there are 50 more than expected

Expected behavior
Return exactly the nr of items that is requested.

Fix:
adjust the start of the range for idx in the ThreadPool, to start allowed_results_per_page later. Example code below.

` def __get_all(self, resource, params=None, allowed_results_per_page=__allowed_results_per_page, **kwargs):
"""This method will get all of the resources specified by the resource parameter, if an id or some other
parameter is required for the resource, include it in the params parameter.
It uses a ThreadPool to speed up execution time when getting a large amount of data - GET requests are
easily parallelized.
Returns a single JSON array with all of the retrieved items."""

    if allowed_results_per_page < 1 or allowed_results_per_page > 50:
        raise ValueError("Allowed results per page must be between 1 and 50")

    start_index = 0
    total_results = float("inf")

    data = []

    # get the first page of data
    page_response = self.__get_page(resource, start_index, params=params, allowed_results_per_page=allowed_results_per_page, **kwargs)
    page_json = page_response.json()
    total_results = page_json['meta']['pageInfo']['totalResults']
    start_index = page_json['meta']['pageInfo']['startIndex']
    data.extend(page_json.get('data'))

    # if we got less data back than the total results info field returned in the first page, get all remaining pages using a ThreadPool
    if len(data) < total_results:
        with ThreadPool(15) as pool:
            # setup the args for each __get_page call so we get each page with a start index increasing by the number of allowed results per page
            # BUGRFIX: the start of the idx should be the start index of the first page + allowed results per page, not 0
            # this is because the first page is already fetched and we need to start at the next page
            get_page_args = [[resource, idx, params, allowed_results_per_page] for idx in range(start_index + allowed_results_per_page, total_results, allowed_results_per_page)]
            pages_data = pool.starmap(self.__get_page, get_page_args)

            # when the thread pool finishes, assemble all the data sequentially
            for page in pages_data:
                pg_json = page.json()
                data.extend(pg_json.get('data'))

    return data`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions