A package that is used to run SQL queries configured for the Analytical Platform. This package is a reticulated wrapper around pydbtools which uses AWS Wrangler's Athena module but adds additional functionality (like Jinja templating, creating temporary tables) and alters some configuration to our specification.
Alternatively you probably want to use
Rdbtools, which has the
advantages of supporting dbplyr
and being R-native, so there's no messing with reticulate
and Python which cause endless problems.
Run the following commands in the R console.
# Set up the project to use renv, if not already done
renv::init()
# Tell renv that Python will be used
renv::use_python()
# Install the reticulate library to interface with Python
renv::install("reticulate")
# Install the Python library pydbtools
reticulate::py_install("pydbtools")
# Install dbtools
renv::install("moj-analytical-services/dbtools")
There is a vignette with more details but the following describes the basics of the package.
library(dbtools)
df <- read_sql_query("SELECT * from a_database.table LIMIT 10")
response <- dbtools::start_query_execution_and_wait(
"CREATE DATABASE IF NOT EXISTS my_test_database"
)
dbtools::create_temp_table(
"SELECT a_col, count(*) as n FROM a_database.table GROUP BY a_col",
table_name="temp_table_1"
)
df <- dbtools::read_sql_query("SELECT * FROM __temp__.temp_table_1 WHERE n < 10")
dbtools::delete_partitions_and_data(
database='my_database',
table='my_table',
expression='year = 2020 or year = 2021'
)
dbtools::delete_table_and_data(database='my_database', table='my_table')
dbtools::delete_database('my_database')
# These can be used for temporary databases and tables.
dbtools::delete_table_and_data(database='__temp__', table='my_temp_table')
sql_template <- "SELECT * FROM {{ db_name }}.{{ table }}"
sql <- dbtools::render_sql_template(sql_template, {"db_name": db_name, "table": "department"})
df <- dbtools::read_sql_query(sql)
cat("SELECT * FROM {{ db_name }}.{{ table_name }}", file="tempfile.sql")
sql <- dbtools::get_sql_from_file("tempfile.sql", jinja_args={"db_name": db_name, "table_name": "department"})
dbtools::read_sql_query(sql)
- No longer dependent on s3tools
- Wraps
pydbtools
functions
- Fixes prompts to install miniconda - now automatically uses main Analytical Platform Conda Python, based on sys path
- Fixed issue where credentials would not refresh
- Is now dependant on
pydbtools
package - SQL queries like
SHOW COLUMNS FROM db.table
now work forread_sql
and return a df.
- Updated the version in the DESCRIPTION file to the correct version
- Removed input parameters
bucket
andoutput_folder
fromread_sql
andget_athena_query_response
functions. New section to README named 'Under The Hood' explains why. - Note package now requires the group policy
StandardDatabaseAccess
to be attached to the role that needs to use this package.
- Added function
read_sql
which reads an SQL query directly into an R dataframe. See R documentation (i.e.?read_sql
) - Input parameter
out_path
in functionget_athena_query_response
has been replaced by two input parametersbucket
andoutput_folder
. E.g. If yourout_path="s3://my-bucket/__temp__"
then the new input params arebucket=my-bucket
andoutput_folder=__temp__
. Note thatoutput_folder
defaults to value__athena_temp__
it is recommended that you leave this unchanged.
timeout
is now an input parameter toget_athena_query_response
if not set there is no timeout for the athena query.get_athena_query_response
will now print out the athena_client response if the athena query fails.