Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Bulk CSV loading through map reduce

James Taylor edited this page Oct 15, 2013 · 14 revisions

Phoenix provides support for loading csv data into a new/existing Phoenix table using Hadoop Map-Reduce. This provides approximately 2x fast performance in comparison to the existing csv loader using psql.

####Sample input CSV data:

col1_value1, col2_value1, col3_value1

col1_value2, col2_value2, col3_value2

####Compatible Phoenix schema to hold the above CSV data:

CREATE TABLE Test.Phoenix (row_key bigint not null, m.row_value1 varchar(50), m.row_value2 varchar(50)  CONSTRAINT pk PRIMARY KEY (row_key))
                               ||                   Column-Family (m)                  ||
-----------------------------------------------------------------------------------------
row_key (bigint - PRIMARY KEY) || row_value1 (varchar(50)) || row_value2 (varchar(50)) ||

####How to run?

1- Please make sure that Hadoop cluster is working correctly and you are able to run any job like: http://wiki.apache.org/hadoop/WordCount.

2- Verify that $HADOOP_HOME is set.

export HADOOP_HOME=/opt/hadoop/hadoop-{version}
echo $HADOOP_HOME
/opt/hadoop/hadoop-{version}

3- Run the bulk loader job using the script /bin/csv-bulk-loader.sh as below:

/bin/csv-bulk-loader.sh <input-options>

#####Input-options

Below is the list of valid input options while running the bulk loader:

-i                             CSV data file path in hdfs (mandatory)
-s                             Phoenix schema name (mandatory if table is created without default phoenix schema name)
-t                             Phoenix table name (mandatory)
-sql                           Phoenix create table sql path (mandatory)
-zk                            Zookeeper IP:<port> (mandatory)
-o                             Output directory path in hdfs (optional)
-idx                           Phoenix index table name (optional, index support is yet to be added)
-error                         Ignore error while reading rows from CSV ? (1 - YES | 0 - NO, defaults to 1) (optional)
-help                          Print all options (optional)

For e.g.

bin/csv-bulk-loader.sh -i data.csv -s Test -t Phoenix -sql ~/Documents/createTable.sql -zk localhost:2181

This would create the phoenix table "Test.Phoenix" as specified in createTable.sql and will then load the CSV data from the file "data.csv" into the table.

P.S.
  1. The current bulk loader does not support the migration of index related data yet. So, if you have created your phoenix table with index, please use the same old CSV loader of Phoenix.
  2. In case you want to further optimize the map-reduce performance, please refer to the current map-reduce optimization params in the file "config/csv-bulk-load-config.properties". In case you modify this list, please re-build the phoenix jar and re-run the job as described above.
Clone this wiki locally