Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karton reanalysis API is slow #650

Open
phretor opened this issue Aug 4, 2022 · 3 comments
Open

Karton reanalysis API is slow #650

phretor opened this issue Aug 4, 2022 · 3 comments
Labels
type:improvement Small improvement of existing feature zone:integrations Tasks related with plugins and integrations

Comments

@phretor
Copy link

phretor commented Aug 4, 2022

I'm not sure whether this is an issue with the API server or the MWDB client. I'm using the following code to re-analyze all samples matching a query:

@retry(**retry_opts)
def get_count(mwdb: MWDB, q: str) -> int:
    logger.info("Counting files matching '{}'", q)
    return mwdb.count_files(q)


@retry(**retry_opts)
def fetch_files(mwdb: MWDB, q: str) -> Iterator[MWDBFile]:
    return mwdb.search_files(q)


@retry(**retry_opts)
def reanalyze(obj: MWDBObject) -> None:
    obj.reanalyze()


def _reanalyze(obj: MWDBObject) -> str:
    try:
        reanalyze(obj)
        return obj.sha256
    except Exception:
        logger.opt(exception=True).error(
            "{} max retries limit exceeded. Skipping.", obj.sha256
        )


@retry(logger=logger)
def do_work(q: str, n_procs: int):
    mwdb = MWDB()
    total: Optional[int] = None
    files: Optional[Iterator[MWDBFile]] = None

    try:
        total = get_count(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[get_count] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching the number of files")

    if total == 0:
        return 0

    logger.info("Found {} files matching the query", total)

    try:
        files = fetch_files(mwdb, q)
    except Exception:
        logger.opt(exception=True).error(
            "[fetch_files] Max number of retries (3). Quitting"
        )
        raise Exception("Error fetching files")

    with tqdm(total=total) as bar:
        for obj in files:
            bar.write(_reanalyze(obj))
            bar.update()

    return 0

def main():
    do_work('NOT tag:"foo"', 1)

I see it takes 5-10 seconds to do one iteration, which is a lot. The MWDB API is deployed with default options, using the recommended Docker Compose file, so it's one Nginx frontend and 4 uWSGI backends. The machine is doing nothing and is not experiencing load.

I'm trying to understand whether the bottleneck is the iteration over the files iterator, or the way I submit files. I see the iteration is technically doing a self.api.get(object_type.URL_TYPE, params=params) in the end, so that may be the bottleneck. But why so slow?

I guess there are no bulk methods in the API, right?

@phretor
Copy link
Author

phretor commented Aug 4, 2022

After some investigation, if I comment out the obj.reanalyze() I can confirm that the iteration itself doesn't take a lot of time.

The bottleneck seems here:

    def reanalyze(
        self, arguments: Optional[Dict[str, Any]] = None
    ) -> "MWDBKartonAnalysis":
        """
        Submits new Karton analysis for given object.

        Requires MWDB Core >= 2.3.0.

        :param arguments: |
            Optional, additional arguments for analysis.
            Reserved for future functionality.

        .. versionadded:: 4.0.0
        """
        from .karton import MWDBKartonAnalysis

        arguments = {"arguments": arguments or {}}
        analysis = self.api.post(
            "object/{id}/karton".format(**self.data), json=arguments
        )
        self._expire("analyses")
        return MWDBKartonAnalysis(self.api, analysis)

that is, the POST request is blocking.

@psrok1
Copy link
Member

psrok1 commented Aug 19, 2022

I think that bottleneck is an API and gathering metadata about created analysis (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/resources/karton.py#L130) including status, last_update and processing_in (https://github.com/CERT-Polska/mwdb-core/blob/master/mwdb/model/karton.py#L61). And here comes the huge weakness of current model: we need to iterate over all tasks currently processing in Karton (get_karton_state) to check the metadata about task tree. That's why if we do massive reanalysis, it's getting slower and slower.

That problem is already referenced in another issue in Karton itself: CERT-Polska/karton#178

So there are two solutions for that:

  • speed-up the analysis (task tree) status inspection in Karton
  • not return the analysis status in reanalysis endpoint response and just return 200 OK if reanalysis was spawned correctly.

@psrok1 psrok1 added the zone:integrations Tasks related with plugins and integrations label Aug 30, 2022
@psrok1 psrok1 added the type:improvement Small improvement of existing feature label Jan 23, 2023
@psrok1
Copy link
Member

psrok1 commented Mar 16, 2023

We're actually going to speed up analysis status inspection soon: CERT-Polska/karton#207

@psrok1 psrok1 changed the title MWDB API slow or client-side issue? Karton reanalysis API is slow Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:improvement Small improvement of existing feature zone:integrations Tasks related with plugins and integrations
Projects
None yet
Development

No branches or pull requests

2 participants