-
Notifications
You must be signed in to change notification settings - Fork 227
Bulk CSV loading through map reduce
Phoenix provides support for loading csv data into a new/existing Phoenix table using Hadoop Map-Reduce. This provides approximately 2x fast performance in comparison to the existing csv loader using psql.
####Sample input CSV data:
col1_value1, col2_value1, col3_value1
col1_value2, col2_value2, col3_value2
####Compatible Phoenix schema to hold the above CSV data:
CREATE TABLE Test.Phoenix (row_key bigint not null, m.row_value1 varchar(50), m.row_value2 varchar(50) CONSTRAINT pk PRIMARY KEY (row_key))
|| Column-Family (m) ||
-----------------------------------------------------------------------------------------
row_key (bigint - PRIMARY KEY) || row_value1 (varchar(50)) || row_value2 (varchar(50)) ||
####How to run?
1- Please make sure that Hadoop cluster is working correctly and you are able to run any job like: http://wiki.apache.org/hadoop/WordCount.
2- Verify that $HADOOP_HOME is set.
export HADOOP_HOME=/opt/hadoop/hadoop-{version}
echo $HADOOP_HOME
/opt/hadoop/hadoop-{version}
3- Run the bulk loader job using the script /bin/csv-bulk-loader.sh as below:
/bin/csv-bulk-loader.sh <input-options>
#####Input-options
Below is the list of valid input options while running the bulk loader:
-i CSV data file path in hdfs (mandatory)
-s Phoenix schema name (mandatory if table is created without default phoenix schema name)
-t Phoenix table name (mandatory)
-sql Phoenix create table sql path (mandatory)
-zk Zookeeper IP:<port> (mandatory)
-o Output directory path in hdfs (optional)
-idx Phoenix index table name (optional, index support is yet to be added)
-error Ignore error while reading rows from CSV ? (1 - YES | 0 - NO, defaults to 1) (optional)
-help Print all options (optional)
For e.g.
bin/csv-bulk-loader.sh -i data.csv -s Test -t Phoenix -sql ~/Documents/createTable.sql -zk localhost:2181
This would create the phoenix table "Test.Phoenix" as specified in createTable.sql and will then load the CSV data from the file "data.csv" into the table.
- The current bulk loader does not support the migration of index related data yet. So, if you have created your phoenix table with index, please use the same old CSV loader of Phoenix.
- In case you want to further optimize the map-reduce performance, please refer to the current map-reduce optimization params in the file "config/csv-bulk-load-config.properties". In case you modify this list, please re-build the phoenix jar and re-run the job as described above.