Bulk CSV loading through map reduce

Phoenix v 2.1 provides support for loading CSV data into a new/existing Phoenix table using Hadoop Map-Reduce. This provides approximately 2x fast performance in comparison to the existing psql csv loader.

####Sample input CSV data:

col1_value1, col2_value1, col3_value1

col1_value2, col2_value2, col3_value2

####Compatible Phoenix schema to hold the above CSV data:

CREATE TABLE Test.Phoenix (row_key bigint not null, m.row_value1 varchar(50), m.row_value2 varchar(50)  CONSTRAINT pk PRIMARY KEY (row_key))

                               ||                   Column-Family (m)                  ||
-----------------------------------------------------------------------------------------
row_key (bigint - PRIMARY KEY) || row_value1 (varchar(50)) || row_value2 (varchar(50)) ||

####How to run?

1- Please make sure that Hadoop cluster is working correctly and you are able to run any job like this.

2- Verify that $HADOOP_HOME is set.

export HADOOP_HOME=/opt/hadoop/hadoop-{version}
echo $HADOOP_HOME
/opt/hadoop/hadoop-{version}

3- Run the bulk loader job using the script /bin/csv-bulk-loader.sh as below:

./csv-bulk-loader.sh <option value>

<option>  <value>
-i        CSV data file path in hdfs (mandatory)
-s        Phoenix schema name (mandatory if not default)
-t        Phoenix table name (mandatory)
-sql      Phoenix create table sql file path (mandatory)
-zk       Zookeeper IP:<port> (mandatory)
-o        Output directory path in hdfs (optional)
-idx      Phoenix index table name (optional, not yet supported)
-error    Ignore error while reading rows from CSV ? (1-YES | 0-NO, default-1) (optional)
-help     Print all options (optional)

Example

./csv-bulk-loader.sh -i data.csv -s Test -t Phoenix -sql ~/Documents/createTable.sql -zk localhost:2181

This would create the phoenix table "Test.Phoenix" as specified in createTable.sql and will then load the CSV data from the file "data.csv" into the table.

P.S.

The current bulk loader does not support the migration of index related data yet. So, if you have created your phoenix table with index, please use the psql CSV loader.
In case you want to further optimize the map-reduce performance, please refer to the current map-reduce optimization params in the file "src/main/config/csv-bulk-load-config.properties". In case you modify this list, please re-build the phoenix jar and re-run the job as described above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk CSV loading through map reduce

P.S.

Clone this wiki locally