__get_all() returns too many results

**Bug Description**
__get_all() is a function that is internally used to multiple pages of results (more than 50), used by e.g. get_baselines_versioneditems() and other functions (probably also get_abstractitems().

This function first retrieves one page, and if it does not get all the results then it starts a ThreadPool to retrieve all pages. However, while doing this, it again retrieves the first page and adds it to the results.

The resulting behavior is that the first set of 50 items (or less if max is set lower) is duplicated in the results.

**To Reproduce**
Steps to reproduce the behavior:
1. call get_baselines_versioneditems() on a baseline containing more than 50 items.
2. count results and observe that there are 50 more than expected

**Expected behavior**
Return exactly the nr of items that is requested.

**Fix:**
adjust the start of the range for idx in the ThreadPool, to start allowed_results_per_page later. Example code below.

`    def __get_all(self, resource, params=None, allowed_results_per_page=__allowed_results_per_page, **kwargs):
        """This method will get all of the resources specified by the resource parameter, if an id or some other
        parameter is required for the resource, include it in the params parameter.
        It uses a ThreadPool to speed up execution time when getting a large amount of data - GET requests are
        easily parallelized.
        Returns a single JSON array with all of the retrieved items."""

        if allowed_results_per_page < 1 or allowed_results_per_page > 50:
            raise ValueError("Allowed results per page must be between 1 and 50")

        start_index = 0
        total_results = float("inf")

        data = []

        # get the first page of data
        page_response = self.__get_page(resource, start_index, params=params, allowed_results_per_page=allowed_results_per_page, **kwargs)
        page_json = page_response.json()
        total_results = page_json['meta']['pageInfo']['totalResults']
        start_index = page_json['meta']['pageInfo']['startIndex']
        data.extend(page_json.get('data'))

        # if we got less data back than the total results info field returned in the first page, get all remaining pages using a ThreadPool
        if len(data) < total_results:
            with ThreadPool(15) as pool:
                # setup the args for each __get_page call so we get each page with a start index increasing by the number of allowed results per page
                # BUGRFIX: the start of the idx should be the start index of the first page + allowed results per page, not 0
                # this is because the first page is already fetched and we need to start at the next page
                get_page_args = [[resource, idx, params, allowed_results_per_page] for idx in range(start_index + allowed_results_per_page, total_results, allowed_results_per_page)]
                pages_data = pool.starmap(self.__get_page, get_page_args)

                # when the thread pool finishes, assemble all the data sequentially
                for page in pages_data:
                    pg_json = page.json()
                    data.extend(pg_json.get('data'))

        return data`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

__get_all() returns too many results #67

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

__get_all() returns too many results #67

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions