Checkout examples in databricks_notebooks_examples branch
Scaladoc link https://procter-gamble-tech.github.io/octopufs/#com.pg.bigdata.octopufs.package
OctopuFS is Scala/Spark toolkit to manage cloud storage, especially ADLSgen2 directly from databricks. It provides several capabilities, which internally have retry mechanism built in, which will repeat unsuccessful operations up to 5 times :
com.pg.bigdata.octopufs.fs.DistributedExecution
OctopuFS distributes copy operation to spark tasks and does data copy 3x faster than spark read/write operation while utilizing less CPU
Many operations on ADLS are limited to HTTP requests only, thus they don't require significant fardware involvement and can be run on single machine. Operation on tens of thousands of files/folders take appox 1 minute. There operations inclide:
com.pg.bigdata.octopufs.fs.LocalExecution
com.pg.bigdata.octopufs.acl.AclManager
com.pg.bigdata.octopufs.fs.getSize
com.pg.bigdata.octopufs.Promotor
OctopuFS uses above functions on Hive metadata layer (i.e. Tableas and partitions) to enable operations currently not accessible for tables, which are not using Databricks Delta format abstraction.
RDD API security setup
For copy operation only it is recommended to turn of or tune spark speculation spark.conf.set("spark.speculation","false")
Most methods require implicit parameter:
- SparkSession – for distributed copy
implicit val s = spark
- Configuration – for local, multithreaded operation
implicit val c = spark.sparkContext.hadoopConfiguration
Clone and compile repository to get the latest version or download jar from artifact repositories. Once you have jar, upload it to spark cluster and run ot from scala notebook or from your own jar.
Please rememer to set up credentials like it was mentioned above.
In case you find anny issue with the package, do not hesitate to open issue on github. Please be as specific as possible regarding the error and context/environment you were using when issue occured.
Jacek Tokar @ Procter&Gamble