- read each dataset in an H5 file to master node's memory.
- use sc.parallelize() to distribute each dataset to worker node.
- It takes huge memory of master node. It will not work when there are too many/large H5 files.
- see basemods_spark_runner_memory.py
build a NFS in the cluster, so that each worker node can read data from the same directory and we only need one copy of the data.
For now we don't know how to build a NFS, and may be this solution is not suitable for all situations.
Hint: if you do some tests, please log the time, the result of your tests and how you did it.
Copy all the data to the same directory in each worker node.
It will make the job a lot easier, but it seems that it is not smart enough. It costs a great deal of space of hard disks.
Hint: if you do some tests, please log the time, the result of your tests and how you did it.
Transform H5 file to a hadoop-friendly file format (TextFile, SquenceFile, HadoopInputFileFormat) first, then save the transformed files to HDFS/AWS3 etc. The transformation can/need be done in a single machine or a spark cluster.
- The transformation may cost a lot of time. Need to find an efficient way.
- Hadoop/Spark has no interface to read data from a file(SequenceFile or HadoopInputFormat) when the data structure of the file is composite type not simple tpye(such as String,Int,Double etc).
- It's feasible to use python interface to transform the simple pairdata(such as RDD(String,Int)) to Sequencefile or other HadoopInputFormat,and reread data from the file.
- But we haven't find a way to transform h5file to hadoop-file format and then reread it from local file-system or HDFS because the Writable interface can not transform Tuple type.