Skip to content
This repository has been archived by the owner on Sep 3, 2022. It is now read-only.

`datalab` to `google.datalab` Migration Guide

Yasser Elsayed edited this page Apr 13, 2017 · 3 revisions

[Under construction]

As Datalab is moving out of BETA, the API surface has been reworked and polished into a new namespace package google.datalab. This page documents the notable differences between this new namespace and the old datalab, which is planned to be phased out.

It's worth mentioning that, while this plan is in motion, the old datalab namespace is still included in the Datalab environment for backward compatibility, in order to allow existing notebooks to run without modification. However, migrating to the new namespace is highly encouraged in order to make use of ongoing support and new features.

BigQuery Changes

Datalab is moving to the new BigQuery Standard SQL, which is compliant with the SQL 2011 standard. Legacy SQL BigQuery is no longer supported. Please refer to the BigQuery migration guide here to help you change your queries to the new standard.

  • The magics %sql and %bigquery have been removed, their functionality merged into the new %bq magic, which allows you to execute queries, as well as declare Query objects, UDFs, and External Data Sources. See %bq -h for more details.

  • The magic command structure has been changed for all magics to "%magic resources action". As an example, you can list tables by doing %bq tables list.

  • Query.extract(), Query.extract_async(), and Query.results() are now all part of Query.execute(). A QueryOutput class has been added that can specify the type of output when executing a query. For example, to execute a query and extract the results into a file, you can do:

query.execute(QueryOutput.file())
  • Query.sample() and sampling_query() have been replaced by a Sampling class that can specify sampling method when executing a query. Here's an example:
# use random sampling to get a 2% sample of the query
query.execute(sampling=Sampling.random(percent=2))
  • Query.to_dataframe() and Query.to_file() can both be done using the QueryOutput object described above. They still also exist on the Table object, which results from query execution.

  • Schema.from_dataframe() is now part of Schema.from_data(), which can recognize the type of the data passed in.

  • Table.to_query() and View.to_query() have both been replaced with static methods on the Query class, particularly Query.from_table() and Query.from_view().

  • View.execute(), View.execute_async(), View.results(), and View.sample() have been removed in favor of build a Query object out of the View object using the constructor mentioned above, then calling these methods on the Query object.

  • All SQL parsing has been removed from Datalab. Instead, a SQL query's dependencies (subqueries, UDFs..., etc) are concatenated and sent to the BigQuery service API. This means that arbitrary variable substitution is no longer supported, and the only parameterization functionality allowed is what is offered by BigQuery itself. You can read more on it here. Datalab offers an easy way to do query parameterization by adding a query_params dictionary parameter.

Storage Changes

  • The magic %storage has been renamed %gcs

  • Item is now called Object to properly reflect Google Cloud Storage naming.

  • Object now has download(), upload(), read_stream(), and write_stream() functionalities.

Other API Changes

  • Context moved to top-level google.datalab, and all project_id and credentials arguments now default to using the global settings under the Context.default() variable, unless a Context variable is passed in. For example, to set the bigquery billing tier config globally, you can do: Context.default().set_config({'bigquery_billing_tier': 2}).

  • Project and Projects modules have been removed, their functionality now being part of utils.