You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
Hello @v-olmedo,
Thanks for trying out the code pattern. Actually this pattern is initially targeted towards developers and target platform was laptop. My thought was that data with larger scale factor may be too large for a laptop running spark. Thats why i didn't expose the scale factor. Here is the line in the code that hard-codes it to 1G at present.
"2") gen_data $TPCDS_ROOT_DIR '1G' ;;
You can change it to increase the scale factor. Please make sure to move the data to HDFS if you want parallelism in processing. Also you may want to partition data. I have very briefly touched up on this in the doc.
Hello @dilipbiswal,
You stated: "Please make sure to move the data to HDFS", does that mean that dsdgen can't generate the tables in parallel, distributed manner across a cluster that isn't HDFS? Also for the query execution with dsqgen I don't seem to get any distributed processing !
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I do not see any way to do that.
The text was updated successfully, but these errors were encountered: