Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add snapshot testing to CLI & set up AWS mock #13672

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Dec 5, 2024

Which issue does this PR close?

Related to #13456 (comment)

Rationale for this change

We currently don't test whether our external integrations actually work: as a result #13576 has happened.

What changes are included in this PR?

Integration tests for S3.

Are there any user-facing changes?

No.

@blaginin blaginin changed the title Add snap to CLI & set up AWS mock Add snapshot testing to CLI & set up AWS mock Dec 6, 2024
datafusion-cli/tests/integration_setup.bash Outdated Show resolved Hide resolved
datafusion-cli/tests/snapshots/[email protected] Outdated Show resolved Hide resolved
datafusion-cli/tests/cli_integration.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 10, 2024
@blaginin
Copy link
Contributor Author

This PR is ready ish for the reivew - but i want to merge #13576 first, so this one has a smaller diff

# Conflicts:
#	datafusion-cli/Cargo.lock
Comment on lines 183 to 189
- name: Setup Minio - S3-compatible storage
working-directory: datafusion-cli
run:
echo "MINIO_CONTAINER=$(docker run -d -p 9000:9000 -e MINIO_ROOT_USER=TEST-DataFusionLogin -e MINIO_ROOT_PASSWORD=TEST-DataFusionPassword quay.io/minio/minio server /data)" >> $GITHUB_ENV
- name: Run tests (excluding doctests, but with integration tests)
working-directory: datafusion-cli
run: cargo test --lib --tests --bins --all-features
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb you sugessting copying object_store approach for testing.

In object_store, they use Localstack for S3 simulation. It works fine for testing, but the problem is that it doesn't actually validate the credentials.

In another part of object_store, Minio is used, and it does validate credentials. So, I think we should switch to using Minio for testing here.

# Conflicts:
#	datafusion/common/src/config.rs
#	datafusion/core/tests/config_from_env.rs
#	datafusion/sqllogictest/test_files/information_schema.slt
#	docs/source/user-guide/configs.md
@github-actions github-actions bot removed documentation Improvements or additions to documentation sql SQL Planner core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Dec 20, 2024
@blaginin blaginin marked this pull request as ready for review December 20, 2024 18:04
@blaginin blaginin requested a review from findepi December 20, 2024 18:06
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin -- I think this is a great idea ❤️

I left some comments -- let me know what you think .

I have two concerns with this PRL

  1. The number of dependencies that are added for tests that have to be manually run (aka everyone who checks out datafusion will begin having to compile a bunch of new crates, including CI, but likely won't run these tests
  2. The test is not run automatically in CI and thus someone has to remember to run it. I predict this means it sill slowly bit rot (aka not get run and break and no one will notice)


glob!("sql/*.sql", |path| {
let input = fs::read_to_string(path).unwrap();
assert_cmd_snapshot!(cli().pass_stdin(input))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we have had really nice luck in influxdb_iox using insta, and instead of using the external snap shot testing we used inline snapshots (so the expected results are inline with the test code)

It looks like this:

   fn test_union_not_nested() {
        let plan = Arc::new(UnionExec::new(vec![other_node()]));
        let opt = NestedUnion;
        insta::assert_yaml_snapshot!(
            OptimizationTest::new(plan, opt),
            @r#"
        input:
          - " UnionExec"
          - "   EmptyExec"
        output:
          Ok:
            - " UnionExec"
            - "   EmptyExec"
        "#
        );
    }

Is that possible with assert_cmt_snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is absolutely possible! My idea was to make it closer to slt and separate the code from the data. Setting up the suite is quite hard (we need to create a new user, upload a test file, etc.), but with the current approach, we only need to do it once rather than for every single test.

However, I'm happy to make changes if you'd prefer it to be inline?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer it to be inline, but perhaps we can do that as a follow on PR

(the way we do this in influxdb_iox is that we have a test function that returns a string and then compare the string with the assert_yaml_snapshot -- that way the setup code isn't replicated and we can still have the tests inline


```shell
cargo install cargo-insta
cargo insta review
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this in influxdb_iox and I find it super useful

#[case("nd-json")]
#[case("automatic")]
#[test]
fn test_cli_format<'a>(#[case] format: &'a str) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering why we need all the new aws-s3-sdk dependencies and it looks like it is needed in order to programmatically setup the bucket.

I think minio also supports just serving the contents of directories as s3 files

Did you consider looking at just configuring minio directly rather than using the AWS S3 ASK just to create. a bucket?

I know it doesn't seem like a big deal, but this PR adds many dependencies (that is why Cargo.lock is so big) and we try to keep the dependency chain down as much as possible (it is already large)

I realize it is only for datafusion-cli dev, but that dev directly impacts maintenance (

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think minio also supports just serving the contents of directories as s3 files

I believe this feature got removed - minio/minio#15496 (comment)

However, I did something similar and removed aws sdk - would this work?

@blaginin
Copy link
Contributor Author

blaginin commented Dec 21, 2024

The test is not run automatically in CI and thus someone has to remember to run it. I predict this means it sill slowly bit rot (aka not get run and break and no one will notice)

Thank you for checking!! I actually think it does work in the CI (hence so many commits in this PR to make it work 😀)

https://github.com/apache/datafusion/actions/runs/12435578988/job/34722247404?pr=13672

Firefox 2024-12-21 09 59 34

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @blaginin

I think this is a nice improvement in testing.

Can you please try and remove the need for the 10s of new dependencies that come in with aws-sdk-s3?

Otherwise from my perspective this PR is ready to go

# Conflicts:
#	.github/workflows/rust.yml
@blaginin
Copy link
Contributor Author

blaginin commented Jan 9, 2025

Sorry, was on holidays. Will remove aws-sdk-s3 now, that's a very fair point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of DataFusion
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants