Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Cannot complete the test run #32

Open
HarryLiUS opened this issue Jul 13, 2018 · 11 comments
Open

Cannot complete the test run #32

HarryLiUS opened this issue Jul 13, 2018 · 11 comments

Comments

@HarryLiUS
Copy link

Hello,

I following the instruction to do a local test run. First 3 steps completed successfully. At step 4, the table creation completed in about 10+ minutes. It is longer than I expected, but it is completed. Here is the output:

==============================================
TPC-DS On Spark Menu

SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit

Please enter your choice followed by [ENTER]: 1

INFO: Creating tables. Will take a few minutes ...
INFO: Progress : [########################################] 100%
INFO: Spark tables created successfully..
Press any key to continue

After succeeded with table creation, I tried to run query 1 and here is what I got:

==============================================
TPC-DS On Spark Menu

SETUP
(1) Create spark tables
RUN
(2) Run a subset of TPC-DS queries
(3) Run All (99) TPC-DS Queries
CLEANUP
(4) Cleanup
(Q) Quit

Please enter your choice followed by [ENTER]: 2

Enter a comma separated list of queries to run (ex: 1, 2), followed by [ENTER]:
1
INFO: Checking pre-reqs for running TPC-DS queries. May take a few seconds..
ERROR: The rowcounts for TPC-DS tables are not correct. Please make sure option 1
is run before continuing with currently selected option
Press any key to continue

I repeated this again and no help.

Checking rowcounts.rrn, it is all 0.

And, here is the output from spark-shell from step 3.

scala> spark.conf
res0: org.apache.spark.sql.RuntimeConfig = org.apache.spark.sql.RuntimeConfig@505bc480
scala> spark.conf.get("spark.sql.catalogImplementation")
res1: String = hive

Thank you for the help,
Harry

@stevemar
Copy link

@dilipbiswal can you give this a look when you have a minute ^

@HarryLiUS
Copy link
Author

Hello @stevemart @dilipbiswal,
Do you have any update?

Also, I have questions about tpcdsenv.sh variables. For the error above, I used default except point the root directory to my TPC-DS installation directory. Here is the tpcdsenv.sh:

harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$ cat bin/tpcdsenv.sh
#!/bin/bash
#
# tpcdsenv.sh - UNIX Environment Setup
#

#######################################################################
# This is a mandatory parameter. Please provide the location of
# spark installation.
#######################################################################
export SPARK_HOME=/usr/local/harry/spark

#######################################################################
# Script environment parameters. When they are not set the script
# defaults to paths relative from the script directory.
#######################################################################

export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=
export TPCDS_DBNAME=
export TPCDS_WORK_DIR=
harry.li@perf84:/usr/local/harry/tpcds/spark-tpc-ds-performance-test$

My questions are:

  1. Is this setting good to use test with ./bin/tpcdsspark.sh?
  2. If I need to move my database from local disk to HDFS, what kind of changes will be? I tried to change the setting as following and it does not work.

export TPCDS_ROOT_DIR=/usr/local/harry/tpcds/spark-tpc-ds-performance-test
export TPCDS_LOG_DIR=hdfs:///TPC-DS/logDir
export TPCDS_DBNAME=hdfs:///TPC-DS/dbDir
export TPCDS_WORK_DIR=hdfs:///TPC-DS/workDir

Please advise and thanks in advance.
Harry

@cruizen
Copy link

cruizen commented Dec 20, 2018

@HarryLiUS Can you run step 4 (cleanup) to clean all data and start from scratch? I think you may have run dsdgen to generate data at a different scale factor.

@mosayyebzadeh
Copy link

@HarryLiUS Could you solve the problem? I am facing the same problem.

@ViRaL95
Copy link

ViRaL95 commented Jun 6, 2020

has anyone resolved this?

@mosayyebzadeh
Copy link

I could not make it work with Spark 3.0.0. But after switching to Spark 2.4.5 the problem went away.

@fbaig
Copy link

fbaig commented May 13, 2021

Works without any modifications with Spark 2.4.5 and Spark 2.4.7. However, requires some modifications to run with Spark version 3.0.1. Actually, the solution does not even relate to Spark. There is a check which compares row counts from generated data and expected results. The check fails because it compares the contents of files. The newer version of Spark has some new warnings that get added to the beginning of the generated file and thus fails the comparison with the expected result.
Following are the steps to make it work with Spark version 3.0.1

  • In bin/tpcdsspark.sh in function check_createtables()
  • Before the file comparison check i.e. if cmp -s "$file1" "$file2"
  • If you are on Mac sed -i '' '/^W/d' $file1
  • If you are on Linux sed -i '/^W/d' $file1

@ChenZuzhi
Copy link

This error occurs when the file rowcounts.rrn and the file rowcounts.expected are not exactly the same.
For me, it turns out that the rowcounts.rrn is derived by the log rowcounts.out , thus contains some unexpected warning logs.
The rowcounts.rnn then turn out to be like:

WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067

And the rowcounts.expected is like:

6
11718
144067

This cause the error in check_createtable.

So here's my solution:
Open up the log rowcounts.rrn under worker directory, write down the words that occurs in the file that should not contained in rowcounts.expected. In my case, the words including 'WARNING', 'Setting'
Then edit the file tpcdsspark.sh in line 99, add | grep -v "WARNING" and | grep -v "Setting", this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'.
I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.

@ltgoter
Copy link

ltgoter commented Aug 11, 2022

This error occurs when the file rowcounts.rrn and the file rowcounts.expected are not exactly the same. For me, it turns out that the rowcounts.rrn is derived by the log rowcounts.out , thus contains some unexpected warning logs. The rowcounts.rnn then turn out to be like:

WARNING: An illegal reflective access operation has occurred...
Setting default log level to "WARN".
6
11718
144067

And the rowcounts.expected is like:

6
11718
144067

This cause the error in check_createtable.

So here's my solution: Open up the log rowcounts.rrn under worker directory, write down the words that occurs in the file that should not contained in rowcounts.expected. In my case, the words including 'WARNING', 'Setting' Then edit the file tpcdsspark.sh in line 99, add | grep -v "WARNING" and | grep -v "Setting", this will filter the unexpected log in 'rowcounts.out' and derive a good 'rowcounts.rnn'. I then use "Cleanup" and then "create spark tables" in 'tpcdsspark.sh' command, then everything works fine for me.

For spark 3.3.0, it need more filter. i make it work with adding the follow filter:

| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"

@shuaiwuyue
Copy link

hi, @HarryLiUS Have you solved this problem? I checked my rowcounts.rrn and it is also all 0.

@theosib-amazon
Copy link

For spark 3.3.0, it need more filter. i make it work with adding the follow filter:

| grep -v "WARNING" | grep -v "Setting" | grep -v "Spark"

This is what fixed it for me.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants