[[TOC]]
Soon to a repo near you :)
- Adapt towards the latest Apache Spark versions from 3.3.x
- Added
StreamingTrigger.AvailableNow
- Build with Spark 3.3.0 and tested against Spark 3.3.0 to 3.5.1
- Cross compile Scala 2.12 and 2.13
- Tested against all major available Apache Spark 3.x versions
- Code reformatting
- Building with JDK 17 targeting Java 8
- Added test java options to handle the JDK 17
- Build with Spark 3.2.x
- Removed the
spark-utils-io-pureconfig
module - Refactored
TypesafeConfigBuilder
, which has two implementations now:SimpleTypesafeConfigBuilder
andFuzzyTypesafeConfigBuilder
- Small improvements to
SharedSparkSession
- Documentation updates
TypesafeConfigBuilder.getApplicationConfiguration
requires an application configuration file name parameterTypesafeConfigBuilder.getApplicationConfiguration
no longer requires animplicit
SparkContext
SparkApp.main
refactored
DataSource
exposesreader
in addition toread
- Added
SparkSessionOps.streamingSource
DataSink
andDataAwareSink
exposewriter
in addition towrite
- Documentation improvements
Major Library Redesign
The project was split into different configuration modules
spark-utils-io-pureconfig
for the new PureConfig implementationspark-utils-io-configz
for the legacy ConfigZ implementation
It is best to import either one of the following
"org.tupol" %% "spark-utils-io-configz" % sparkUtilsVersion
"org.tupol" %% "spark-utils-io-pureconfig" % sparkUtilsVersion
instead of
"org.tupol" %% "spark-utils" % sparkUtilsVersion
kafka.bootstrap.servers
was renamed tokafkaBootstrapServers
in Kafka sources and sinks configurationbucketColumns
was renamed tocolumns
in file data sinkspartition.files
was renamed topartition.number
in sinks configuration
SourceConfiguration.extract
is no longer used; useSourceConfigurator.extract
insteadFileSourceConfiguration.extract
is no longer used; useFileSourceConfigurator.extract
insteadGenericSinkConfiguration.optionalSaveMode
was renamed toGenericSinkConfiguration.mode
TypesafeConfigBuilder.applicationConfiguration()
was renamed togetApplicationConfiguration()
and was made public, so it can be overridden and theargs
is no longer an implicit parameter; This impactsSparkApp
andSparkFun
- Fixed
core
dependency toscala-utils
; now usingscala-utils-core
- Refactored the
core
/implicits
package to make the implicits a little more explicit
- Small dependencies and documentation improvements
- The documentation needs to be further reviewed
- The project is split into two modules:
spark-utils-core
andspark-utils-io
- The project moved to Apache Spark 3.0.1, which is a popular choice for the Databricks Cluster users
- The project is only compiled on Scala 2.12
- There is a major redesign of core components, mainly returning
Try[_]
for better exception handling - Dependencies updates
- The project compiles with both Scala
2.11.12
and2.12.12
- Updated Apache Spark to
2.4.6
- Updated the
spark-xml
library to0.10.0
- Removed the
com.databricks:spark-avro
dependency, as avro support is now built into Apache Spark - Removed the shadow
org.apache.spark.Loggin
class, which is replaced by theorg.tupol.spark.Loggign
knock-off
- Added
SparkFun
, a convenience wrapper aroundSparkApp
that makes the code even more concise - Added
FormatType.Custom
so any format types are accepted, but of course, not any random format type will work, but now other formats likedelta
can be configured and used - Added
GenericSourceConfiguration
(replacing the old privateBasicConfiguration
) andGenericDataSource
- Added
GenericSinkConfiguration
,GenericDataSink
andGenericDataAwareSink
- Removed the short
”avro”
format as it will be included in Spark 2.4 - Added format validation to
FileSinkConfiguration
- Added generic-data-source.md and generic-data-sink.md docs
- Added the
StreamingConfiguration
marker trait - Added
GenericStreamDataSource
,FileStreamDataSource
andKafkaStreamDataSource
- Added
GenericStreamDataSink
,FileStreamDataSink
andKafkaStreamDataSink
- Added
FormatAwareStreamingSourceConfiguration
andFormatAwareStreamingSinkConfiguration
- Extracted
TypesafeConfigBuilder
- API Changes: Added a new type parameter to the
DataSink
that describes the type of the output - Improved unit test coverage
- Added support for bucketing in data sinks
- Improved the community resources
- Added configuration variable substitution support
- Split
SparkRunnable
intoSparkRunnable
andSparkApp
- Changed the
SparkRunnable
API; nowrun()
returnsResult
instead ofTry[Result]
- Changed the
SparkApp
API; nowbuildConfig()
was renamed tocreateContext()
and now it returnsContext
instead ofTry[Context]
- Changed the
DataSource
API; nowread()
returnsDataFrame
instead ofTry[DataFrame]
- Changed the
DataSink
API; nowwrite()
returnsDataFrame
instead ofTry[DataFrame]
- Small documentation improvements
- Added
DataSource
andDataSink
IO frameworks - Added
FileDataSource
andFileDataSink
IO frameworks - Added
JdbcDataSource
andJdbcDataSink
IO frameworks - Moved all useful implicit conversions into
org.tupol.spark.implicits
- Added testing utilities under
org.tupol.spark.testing