reading HDF5 File in Spark

Solution 1. Use the master node's memory

description

read each dataset in an H5 file to master node's memory.

use sc.parallelize() to distribute each dataset to worker node.

limitations

It takes huge memory of master node. It will not work when there are too many/large H5 files.

tests

see basemods_spark_runner_memory.py

Solution 2. use a storage system/Network File System

description

build a NFS in the cluster, so that each worker node can read data from the same directory and we only need one copy of the data.

limitations

For now we don't know how to build a NFS, and may be this solution is not suitable for all situations.

tests

Hint: if you do some tests, please log the time, the result of your tests and how you did it.

Solution 3. Copy all the data to each worker node

description

Copy all the data to the same directory in each worker node.

limitations

It will make the job a lot easier, but it seems that it is not smart enough. It costs a great deal of space of hard disks.

tests

Hint: if you do some tests, please log the time, the result of your tests and how you did it.

Solution 4. Transforming H5 file format

description

Transform H5 file to a hadoop-friendly file format (TextFile, SquenceFile, HadoopInputFileFormat) first, then save the transformed files to HDFS/AWS3 etc. The transformation can/need be done in a single machine or a spark cluster.

limitations

The transformation may cost a lot of time. Need to find an efficient way.

Hadoop/Spark has no interface to read data from a file(SequenceFile or HadoopInputFormat) when the data structure of the file is composite type not simple tpye(such as String,Int,Double etc).

tests

It's feasible to use python interface to transform the simple pairdata(such as RDD(String,Int)) to Sequencefile or other HadoopInputFormat,and reread data from the file.

But we haven't find a way to transform h5file to hadoop-file format and then reread it from local file-system or HDFS because the Writable interface can not transform Tuple type.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H5inRDD.log.md

H5inRDD.log.md

reading HDF5 File in Spark

Solution 1. Use the master node's memory

description

limitations

tests

Solution 2. use a storage system/Network File System

description

limitations

tests

Solution 3. Copy all the data to each worker node

description

limitations

tests

Solution 4. Transforming H5 file format

description

limitations

tests

Files

H5inRDD.log.md

Latest commit

History

H5inRDD.log.md

File metadata and controls

reading HDF5 File in Spark

Solution 1. Use the master node's memory

description

limitations

tests

Solution 2. use a storage system/Network File System

description

limitations

tests

Solution 3. Copy all the data to each worker node

description

limitations

tests

Solution 4. Transforming H5 file format

description

limitations

tests