Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test which need to copy data files fail in CI #201

Open
laraPPr opened this issue Nov 4, 2024 · 9 comments · May be fixed by #205
Open

Test which need to copy data files fail in CI #201

laraPPr opened this issue Nov 4, 2024 · 9 comments · May be fixed by #205

Comments

@laraPPr
Copy link
Collaborator

laraPPr commented Nov 4, 2024

The CI creates a temporary directory with a copy of the EESSI test-suite from where the jobs get submitted. In the job the tests than tries to find the input scripts for CP2K in that temporary directory where it cannot find it.
The default is set to RFM_CHECK_SEARCH_PATH=${TEMPDIR}/test-suite/eessi/testsuite/tests/ in the CI.

Error:

cp: cannot stat '/tmp/rfm.2SxhMpngRk/test-suite/eessi/testsuite/tests/apps/cp2k/input/QS/H2O-32.inp': No such file or directory

Example script generated by ReFrame on Ugent tier2 system

#!/bin/bash
#SBATCH --job-name="rfm_EESSI_CP2K_613b66ce"
#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --time=2:0:0
#SBATCH --export=None
#SBATCH --clusters=skitty
#SBATCH --mem=10120M
module swap cluster/skitty
module load vsc-mympirun
module load CP2K/2023.1-foss-2023a
export OMP_NUM_THREADS=1
cp /tmp/rfm.2SxhMpngRk/test-suite/eessi/testsuite/tests/apps/cp2k/input/QS/H2O-32.inp ./
mympirun --hybrid 2 cp2k.popt -i H2O-32.inp

Output:

 *******************************************************************************
 *   ___                                                                       *
 *  /   \                                                                      *
 * [ABORT]   The specified OLD file <H2O-32.inp> cannot be opened. It does not *
 *  \___/                       exist. Data directory path:                    *
 *    |        /apps/gent/RHEL9/zen4-ib/software/CP2K/2023.1-foss-2023a/data   *
 *  O/|                                                                        *
 * /| |                                                                        *
 * / \                                                   common/cp_files.F:396 *
 *******************************************************************************
@laraPPr
Copy link
Collaborator Author

laraPPr commented Nov 4, 2024

it also happens with the LAMMPS rhodo case

LAMMPS (2 Aug 2023 - Update 2)
  using 1 OpenMP thread(s) per MPI task
ERROR: Cannot open file data.rhodo: No such file or directory (src/read_data.cpp:367)
Last command: read_data       data.rhodo
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[50886,1],35]
  Exit code:    1
--------------------------------------------------------------------------

@laraPPr laraPPr changed the title CP2K tests failing in CI Test which need to copy data files fail in CI Nov 4, 2024
@casparvl
Copy link
Collaborator

casparvl commented Nov 5, 2024

I'd consider this a bug in the CP2K test. Tests should not assume that the original prefix that defines the test are still available. That's what the staging dir is for: that one is guaranteed to be available, and that's where the input files should come from.

Why does it even need to do this copy:

cp /tmp/rfm.2SxhMpngRk/test-suite/eessi/testsuite/tests/apps/cp2k/input/QS/H2O-32.inp ./

that input dir should be in the staging dir, which is the current workingdir of the test? Can't we just make it do:

mympirun --hybrid 2 cp2k.popt -i input/QS/H2O-32.inp

Without the copy?

@casparvl
Copy link
Collaborator

casparvl commented Nov 5, 2024

Yes, I'm pretty sure now: this line should be different, and just have the full (relative) path to the input file, e.g. input/QS/H2O-21.inp. And then we remove the prerun_cmds

@laraPPr
Copy link
Collaborator Author

laraPPr commented Nov 5, 2024

Yes I think the CP2K test should do something similar than the LAMMPS test which uses ReFrame to add the file to the staging dir. but their it also is failing so maybe than it is a ReFrame bug because it seems to only do the copy to the stage dir in the job when using readonly_files.

@casparvl
Copy link
Collaborator

casparvl commented Nov 5, 2024

Hmm, does the LAMMPS test work in the CI run? I'd assume that gives the same issue, since read_only files are symlinked instead of copied. So you'd then have a symlink to /tmp on a batch node that points to a non-existant location. https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.readonly_files Thus I'd expect this to also be broken.

I guess the use of read_only makes sense if the files are very big (i.e. where the copy-time would be unreasonably long). But the implicit assumption then is that the directory in which the test suite is installed or cloned is available on the batch nodes too. If we want to make that assumption, we need to alter how we run the CI pipelines and make sure that checkout is on a shared filesystem. One can assume the homedir to be on a shared FS, and just checkout there, but making assumptions could bite us too. I'm not sure... Also, it feels a bit weird to check this out in the homedir since it's really temporary data: you only want to keep it to run the CI run, then throw it away again.

Another option is to use the resourcesdir from here. But that'd require additional configuration. I'd consider that the way to go for really large files - files that are too large to copy, but also to even host in our test suite repo. We should think about how we do that though. One can do it as part of the main repo through git lfs (in which case the read_only would be the right solution, not the resourcesdir), or package it separately alltogether in a separate repo, that can be cloned and pointed to with the resourcesdir. The latter has the disadvantage that you may need to keep checkouts of both repos in sync (if you update a test, you probably also need to update the content of your resourcesdir).

Anyway, for this test, and for LAMMPS, I would suggest we not use the read_only stuff. It's small enough to be staged, no problem. And here, we omit the explicit copy altogether and just pick it up from the stagedir.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Nov 5, 2024

Yes indeed the rhodo test of lammps does not work. I'm ok with not using read-only for LAMMPS and than also fixing CP2K but I started using readonly-files because we said originally that we shouldn't copy that file so much.

One of the suggestions I can remember from when we had that discussion was to use a cvmfs repo for the test data. So than we could point resourcesdir to the cvmfs repo.

It is maybe better to put something in place before it starts getting out of hand with the copying of data files.

@casparvl
Copy link
Collaborator

casparvl commented Nov 6, 2024

I've been thinking about this a bit. What concerns me here is the fact that constructs like hosting the data files in a CVMFS repo, or using the read-only construct make it harder for end-users to run the test suite. It's already not super easy with the requirement of setting up the ReFrame config file. Using the read-only construct puts in the implicit requirement that the test suite be run from a shared directory. That is something that is bound to bite some users. Yes, we can document it, but who reads that front-to-back?

Honestly, saving that one copy of 6MB (the size of the LAMMPS rhodo data file) isn't worth it for me. Even on a somewhat slugish filesystem, it'd only take 0.1s for each partition for which it needs to be staged. That's completely negligible compared to the runtime, scheduling overhead, etc of those tests. If you're really concerned about the copies, I'd suggest looking into test fixtures to see if we can somehow use those to make sure the input file only gets staged once, instead of once per test instance.

I'm not 100% sure how to get this done. I think we'd have to define the fixture in a separate directory, so as to avoid the source files being staged with the main tests. Then, the fixture test itself needs to only stage the data. I guess we could make it a RunOnlyRegressionTest where the run command does nothing (we can check how RunOnlyRegressionTest implements the no-op for the compile stage, and do the same for the run stage), and the sanity check just checks for the presence of the data file in the staging dir. We then call the fixture from the real LAMMPS test like so:

stage_input = fixture('stage_input_file', sope=<session>)

Note that even in case of many test sizes (node counts / core counts), it doesn't change the fact that the copy-overhead is tiny. So I'm not sure if this is worth the complexity. But it does reduce disk useage for the stage dirs. Right now, on my system, I already get 52 test instances for LAMMPS (2 partitions, 2 modules, 13 sizes), so that'd be 300 MB. Not the end of the world, but it could have been 6 MB :)

For really large test inputs: I like the idea of hosting that in CVMFS and pointing the resourcesdir there. But note that this does still suffer from a versioning issue, i.e. if you update a test in the test suite, you might have to update an existing source file. We can avoid that by using versioned prefixes (something like /cvmfs/test-resources.eessi.io/v0.4.0 for version 0.4.0 of the test suite), and benefit from the deduplication that CVMFS offers. Again, I wouldn't want that running the test suite becomes unnecessarily difficult though. So, it'd require programming some logic. We could probably benefit from the eessi_mixin class to take that logic and execute it. It should do something like:

  • Check for the necessary input file in /cvmfs/test-resources.eessi.io/<version>
  • If that doesn't exist, check for a resourcesdir in the ReFrame config. This would offer a fallback option for people who don't have CVMFS deployed (and/or don't have it deployed on the batch nodes) and want to test on local modules.
  • If that doesn't exist, print a warning message that this test needs big input t(and where the user is expected to be able to get that), and suggesting that one downloads that and configures the resourcesdir. Then, skip the test.

@casparvl
Copy link
Collaborator

casparvl commented Nov 6, 2024

I'd try something like:

A/large_input
A/stage_testfiles.py
B/actual_test.py

with A/stage_testfiles.py

class stage_testfiles(rfm.CompileOnlyRegressionTest):
    def compile(self):
    '''no op'''

and B/actual_test.py:

class actual_test(rfm.RunOnlyRegressionTest):
    stage_files = fixture(stage_testfiles, scope='session')
    executable_opts = ['-i %s' % self.stage_files.stagedir]

I've asked on ReFrame Slack if they think this would work, or if there is a better way to do this.

@casparvl
Copy link
Collaborator

casparvl commented Nov 6, 2024

Response on Slack from Vasileios:

"""
Yes, that would almost work and it’s what fixtures were made for. To make it work, a couple of suggestions:

  1. No need to make the stage_testfiles a compile-only test and override the compile method. This will give you a warning (or maybe error) as compile() is a final method, unless you pass the special=True metaclass argument. But you don’t need that. You can define it as run-only with executable = "true".
  2. For the actual_test you can’t access the fixture in the class body. You can only access after setup stage, so usually fixture access happens in a post-setup or pre-compile/pre-run hook.

"""

So, this could be a way to go.

@casparvl casparvl linked a pull request Nov 14, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants