This repository has been archived by the owner on Feb 12, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 227
Bulk CSV loading through map reduce
James Taylor edited this page Oct 23, 2013
·
14 revisions
Phoenix v 2.1 provides support for loading CSV data into a new/existing Phoenix table using Hadoop Map-Reduce. This provides approximately 2x fast performance in comparison to the existing psql csv loader.
####Sample input CSV data:
col1_value1, col2_value1, col3_value1
col1_value2, col2_value2, col3_value2
####Compatible Phoenix schema to hold the above CSV data:
CREATE TABLE Test.Phoenix (row_key bigint not null, m.row_value1 varchar(50), m.row_value2 varchar(50) CONSTRAINT pk PRIMARY KEY (row_key))
|| Column-Family (m) ||
-----------------------------------------------------------------------------------------
row_key (bigint - PRIMARY KEY) || row_value1 (varchar(50)) || row_value2 (varchar(50)) ||
####How to run?
1- Please make sure that Hadoop cluster is working correctly and you are able to run any job like this.
2- Verify that $HADOOP_HOME is set.
export HADOOP_HOME=/opt/hadoop/hadoop-{version}
echo $HADOOP_HOME
/opt/hadoop/hadoop-{version}
3- Run the bulk loader job using the script /bin/csv-bulk-loader.sh as below:
./csv-bulk-loader.sh <option value>
<option> <value>
-i CSV data file path in hdfs (mandatory)
-s Phoenix schema name (mandatory if not default)
-t Phoenix table name (mandatory)
-sql Phoenix create table sql file path (mandatory)
-zk Zookeeper IP:<port> (mandatory)
-o Output directory path in hdfs (optional)
-idx Phoenix index table name (optional, not yet supported)
-error Ignore error while reading rows from CSV ? (1-YES | 0-NO, default-1) (optional)
-help Print all options (optional)
Example
./csv-bulk-loader.sh -i data.csv -s Test -t Phoenix -sql ~/Documents/createTable.sql -zk localhost:2181
This would create the phoenix table "Test.Phoenix" as specified in createTable.sql and will then load the CSV data from the file "data.csv" into the table.
- The current bulk loader does not support the migration of index related data yet. So, if you have created your phoenix table with index, please use the psql CSV loader.
- In case you want to further optimize the map-reduce performance, please refer to the current map-reduce optimization params in the file "src/main/config/csv-bulk-load-config.properties". In case you modify this list, please re-build the phoenix jar and re-run the job as described above.