Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docminorfixes1021 #1281

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/howto/load_data_from_external_data_stores.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@

SnappyData comes bundled with the libraries to access HDFS (Apache compatible). You can load your data using SQL or DataFrame API.

## Example - Loading data from CSV file using SQL
## Example - Loading Data from CSV File using SQL

The following example demonstrates how you can load data from the CSV file, in a local file system, by using SQL:

```pre
// Create an external table based on CSV file
Expand All @@ -14,7 +16,7 @@ CREATE TABLE CUSTOMER using column options() as (select * from CUSTOMER_STAGING_
```

!!! Tip
Similarly, you can create an external table for all data sources and use SQL "insert into" query to load data. For more information on creating external tables refer to, [CREATE EXTERNAL TABLE](../reference/sql_reference/create-external-table/)
Similarly, you can create an external table for all data sources and use SQL "insert into" query to load data. For more information on creating external tables refer to, [CREATE EXTERNAL TABLE](../reference/sql_reference/create-external-table/).


## Example - Loading CSV Files from HDFS using API
Expand Down Expand Up @@ -73,7 +75,7 @@ val df = session.createDataFrame(rdd, ds.schema)
df.write.format("column").saveAsTable("columnTable")
```

## Importing Data using JDBC from a relational DB
## Importing Data using JDBC from Relational DB

!!! Note
Before you begin, you must install the corresponding JDBC driver. To do so, copy the JDBC driver jar file in **/jars** directory located in the home directory and then restart the cluster.
Expand Down
37 changes: 20 additions & 17 deletions docs/howto/load_data_into_snappydata_tables.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,31 @@

SnappyData relies on the Spark SQL Data Sources API to parallelly load data from a wide variety of sources. By integrating the loading mechanism with the Query engine (Catalyst optimizer) it is often possible to push down filters and projections all the way to the data source minimizing data transfer. Here is the list of important features:

**Support for many Sources** </br>There is built-in support for many data sources as well as data formats. Data can be accessed from S3, file system, HDFS, Hive, RDB, etc. And the loaders have built-in support to handle CSV, Parquet, ORC, Avro, JSON, Java/Scala Objects, etc as the data formats.
* **Support for many Sources** </br>There is built-in support for many data sources as well as data formats. Data can be accessed from S3, file system, HDFS, Hive, RDB, etc. Moreover, loaders have built-in support to handle CSV, Parquet, ORC, Avro, JSON, Java/Scala Objects, etc. as the data formats.
* **Access virtually any modern data store**</br> Virtually all major data providers have a native Spark connector that complies with the Data Sources API. For example, you can load data from any RDB like Amazon Redshift, Cassandra, Redis, Elastic Search, Neo4J, etc. While thee connectors are not built-in, you can easily deploy these connectors as dependencies into a SnappyData cluster. All the connectors are typically registered in spark-packages.org.
* **Avoid Schema wrangling** </br>Spark supports schema inference. Which means, all you need to do is point to the external source in your 'create table' DDL (or Spark SQL API) and schema definition is learned by reading in the data. There is no need to define each column and type explicitly. This is extremely useful when dealing with disparate, complex and wide data sets.
* **Read nested, sparse data sets**</br> When data is accessed from a source, the schema inference occurs by not just reading a header but often by reading the entire data set. For instance, when reading JSON files, the structure could change from document to document. The inference engine builds up the schema as it reads each record and keeps unioning them to create a unified schema. This approach allows developers to become very productive with disparate data sets.

**Access virtually any modern data store**</br> Virtually all major data providers have a native Spark connector that complies with the Data Sources API. For e.g. you can load data from any RDB like Amazon Redshift, Cassandra, Redis, Elastic Search, Neo4J, etc. While these connectors are not built-in, you can easily deploy these connectors as dependencies into a SnappyData cluster. All the connectors are typically registered in spark-packages.org

**Avoid Schema wrangling** </br>Spark supports schema inference. Which means, all you need to do is point to the external source in your 'create table' DDL (or Spark SQL API) and schema definition is learned by reading in the data. There is no need to explicitly define each column and type. This is extremely useful when dealing with disparate, complex and wide data sets.

**Read nested, sparse data sets**</br> When data is accessed from a source, the schema inference occurs by not just reading a header but often by reading the entire data set. For instance, when reading JSON files the structure could change from document to document. The inference engine builds up the schema as it reads each record and keeps unioning them to create a unified schema. This approach allows developers to become very productive with disparate data sets.

**Load using Spark API or SQL** </br> You can use SQL to point to any data source or use the native Spark Scala/Java API to load.
For instance, you can first [create an external table](../reference/sql_reference/create-external-table.md).
## Loading Data using Spark API or SQL
You can use SQL to point to any data source or use the native Spark Scala/Java API to load. For instance, you can first [create an external table](../reference/sql_reference/create-external-table.md).

```pre
CREATE EXTERNAL TABLE <tablename> USING <any-data-source-supported> OPTIONS <options>
```

Next, use it in any SQL query or DDL. For example,


```pre
CREATE EXTERNAL TABLE STAGING_CUSTOMER USING parquet OPTIONS(path 'quickstart/src/main/resources/customerparquet')

CREATE TABLE CUSTOMER USING column OPTIONS(buckets '8') AS ( SELECT * FROM STAGING_CUSTOMER)

```

**Example - Load from CSV**
## Example - Loading Data from CSV

You can either explicitly define the schema or infer the schema and the column data types. To infer the column names, we need the CSV header to specify the names. In this example we don't have the names, so we explicitly define the schema.
You can either explicitly define the schema or infer the schema and the column data types. To infer the column names, we need the CSV header to specify the names. In this example we do not have the names, so we explicitly define the schema.

```pre
// Get a SnappySession in a local cluster
Expand Down Expand Up @@ -56,7 +55,7 @@ snSession.sql("CREATE TABLE CUSTOMER ( " +
"USING COLUMN OPTIONS (PARTITION_BY 'C_CUSTKEY')")
```

**Load data in the CUSTOMER table from a CSV file by using Data Sources API**
**Load Data in the CUSTOMER Table from a CSV File by using Data Sources API**

```pre
val tableSchema = snSession.table("CUSTOMER").schema
Expand All @@ -66,16 +65,16 @@ customerDF.write.insertInto("CUSTOMER")

The [Spark SQL programming guide](https://spark.apache.org/docs/2.1.1/sql-programming-guide.html#data-sources) provides a full description of the Data Sources API

**Example - Load from Parquet files**
## Example - Loading Data from Parquet Files

```pre
val customerDF = snSession.read.parquet(s"$dataDir/customer_parquet")
customerDF.write.insertInto("CUSTOMER")
```

**Inferring schema from data file**
**Inferring Schema from Data File**

A schema for the table can be inferred from the data file. Data is first introspected to learn the schema (column names and types) without requring this input from the user. The example below illustrates reading a parquet data source and creates a new columnar table in SnappyData. The schema is automatically defined when the Parquet data files are read.
A schema for the table can be inferred from the data file. Data is first introspected to learn the schema (column names and types) without requiring this input from the user. The example below illustrates reading a parquet data source and creates a new columnar table in SnappyData. The schema is automatically defined when the Parquet data files are read.

```pre
val customerDF = snSession.read.parquet(s"quickstart/src/main/resources/customerparquet")
Expand All @@ -100,11 +99,15 @@ customer_csv_DF.write.format("column").mode("append").options(props1).saveAsTabl

The source code to load the data from a CSV/Parquet files is in [CreateColumnTable.scala](https://github.com/SnappyDataInc/snappydata/blob/master/examples/src/main/scala/org/apache/spark/examples/snappydata/CreateColumnTable.scala).

**Example - reading JSON documents**
## Example - Reading JSON Documents
As mentioned before when dealing with JSON you have two challenges - (1) the data can be highly nested (2) the structure of the documents can keep changing.

Here is a simple example that loads multiple JSON records that show dealing with schema changes across documents - [WorkingWithJson.scala](https://github.com/SnappyDataInc/snappydata/blob/master/examples/src/main/scala/org/apache/spark/examples/snappydata/WorkingWithJson.scala)
Here is a simple example that loads multiple JSON records that show dealing with schema changes across documents: [WorkingWithJson.scala](https://github.com/SnappyDataInc/snappydata/blob/master/examples/src/main/scala/org/apache/spark/examples/snappydata/WorkingWithJson.scala)

!!! Note

When loading data from sources like CSV or Parquet the files would need to be accessible from all the cluster members in SnappyData. Make sure it is NFS mounted or made accessible through the Cloud solution (shared storage like S3).

## Troubleshooting Tip
When reading or writing CSV/Parquet to and from S3, the `ConnectionPoolTimeoutException` error may be reported. To avoid this error, in the Spark context, set the value of the `fs.s3a.connection.maximum` property to a number greater than the possible number of partitions. </br>
For example, `snc.sparkContext.hadoopConfiguration.set("fs.s3a.connection.maximum", "1000")`
2 changes: 1 addition & 1 deletion docs/programming_guide/tables_in_snappydata.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ CREATE TABLE [IF NOT EXISTS] table_name
)
[AS select_statement];

DROP TABLE [IF EXISTS] table_name
DROP TABLE [IF EXISTS] table_name;
```

Refer to the [Best Practices](../best_practices/design_schema.md) section for more information on partitioning and colocating data and [CREATE TABLE](../reference/sql_reference/create-table.md) for information on creating a row/column table.</br>
Expand Down
4 changes: 2 additions & 2 deletions docs/reference/command_line_utilities/modify_disk_store.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ Snappy>create region --name=regionName --type=PARTITION_PERSISTENT_OVERFLOW

**For non-secured cluster**

## Description

The following table describes the options used for `snappy modify-disk-store`:

| Items | Description |
Expand All @@ -27,8 +29,6 @@ The following table describes the options used for `snappy modify-disk-store`:
!!! Note
The name of the disk store, the directories its files are stored in, and the region to target are all required arguments.

## Description

## Examples

**Secured cluster**
Expand Down