Spark 2.0.1-1611 API Changes

This topic describes the public API changes that occurred between Apache Spark 1.6.1 and Spark 2.0.1.

Removed Methods

The following items have been removed from Apache Spark 2.0.1:
  • Bagel (the Spark implementation of Google Pregel)
  • Most of the deprecated methods from Spark 1.x, including:
Category Subcategory Instead of this removed API... Use...
GraphX   mapReduceTriplets aggregateMessages
    runSVDPlusPlus run
    GraphKryoRegistrator  
SQL DataType DataType.fromCaseClassString DataType.fromJson
  DecimalType DecimalType() DecimalType(precision, scale) to provide precision explicitly
    DecimalType(Option[PrecisionInfo]) DecimalType(precision scale)
    PrecisionInfo DecimalType(precision, scale)
    precisionInfo precision and scale
    Unlimited (No longer supported)
  Column Column.in() isin()
  DataFrame toSchemaRDD toDF
    createJDBCTable write.jdbc()
    saveAsParquetFile write.parquet()
    saveAsTable write.saveAsTable()
    save write.save()
    insertInto write.mode(SaveMode.Append).saveAsTable()
  DataframeReader DataFrameReader.load(path) option("path", path).load()
  Functions cumeDist cume_dist
    denseRank dense_rank
    percentRank percent_rank
    rowNumber row_number
    inputFileName input_file_name
    isNaN isnan
    sparkPartitionId spark_partition_id
    callUDF udf
Core SparkContext Constructors no longer take prefferedNodeLocationData param
    tachyonFolderName externalBlockStoreFolderName
    initLocalProperties, clearFiles, clearJars (No longer needed)
    runJob method no longer takes allowLocal param
    defaultMinSplits defaultMinPartitions
    [Double, Int, Long, Float]AccumulatorParam implicit objects from AccumulatorParam
    rddTo[Pair, Async, Sequence, Ordered]RDDFunctions implicit functions from RDD
    [double, numeric]RDDToDoubleRDDFunctions implicit functions from RDD
    intToIntWritable, longToLongWritable, floatToFloatWritable, doubleToDoubleWritable, boolToBoolWritable, bytesToBytesWritable, stringToText implicit functions from WriteableFactory
    [int, long, double, float, boolean, bytes, string, writable]WritableConverter implicit functions from WritableConverter
  TaskContext runningLocally isRunningLocally
    addOnCompleteCallback addTaskCompletionListener
    attemptId attemptNumber
  JavaRDDLike splits partitions
    toArray collect
  JavaSparkContext defaultMinSplits defaultMinPartitions
    clearJars, clearFiles (No longer needed)
  PairRDDFunctions PairRDDFunctions.reduceByKeyToDriver reduceByKeyLocally
  RDD mapPartitionsWithContext Taskcontext.get
    mapPartitionsWithSplit mapPartitionsWithIndex
    mapWith mapPartitionsWithIndex
    flatMapWith mapPartitionsWithIndex and flatMap
    foreachWith mapPartitionsWithIndex and foreach
    filterWith mapPartitionsWithIndex and filter
    toArray collect
  TaskInfo TaskInfo.attempt TaskInfo.attemptNumber
  Guava Optional Guava Optional org.apache.spark.api.java.Optional
  Vector Vector, VectorSuite  
Configuration options and params   --name  
    --driver-memory spark.driver.memory
    --driver-cores spark.driver.cores
    --executor-memory spark.executor.memory
    --executor-cores spark.executor.cores
    --queue spark.yarn.queue
    --files spark.yarn.dist.files
    --archives spark.yarn.dist.archives
    --addJars spark.yarn.dist.jars
    --py-files spark.submit.pyFiles
Note also the following deprecated configuration options and parameters:
  • Methods from Python DataFrame that returned RDD have been moved to dataframe.rdd. For example, df.map is now df.rdd.map.
  • Some streaming connectors (Twitter, Akka, MQTT, and ZeroMQ) have been removed.
  • org.apache.spark.shuffle.hash.HashShuffleManager no longer exists. SortShuffleManager is the default since Spark 1.2.
  • DataFrame is no longer a class. It is a subtype of DataSet.

Behavior Changes

Spark 2.0.1 implements the following behavior changes:

  • Spark 2.0.1 uses Scala 2.11 instead of 2.10.
  • Floating literals in SQL are now parsed as decimal type instead of double type.
  • The Kryo version is now 3.0.
  • Jersey version is now 2.
  • Java RDD flatMap and mapPartitions functions now require functions that return Java iterator instead of Iterable.
  • Java RDD countByKey and countApproxDistinctByKey now return Map[K => Long] instead of Map[K => Object].
  • When writing Parquet files, the summary files are no longer written (set parquet.enable.summary-metadata to true to re-enable).
  • Lots were changed in MLLib. Follow the Apache Spark Migration Guide.
  • Sparkcontext.emptyRDD now returns RDD instead of EmptyRDD.
  • Spark Standalone Master no longer serves the jobs history.
  • org.apache.spark.api.java.JavaPairRDD methods were changed:
    • countByKey and countApproxDistinctByKey now return java.lang.Long instead of scala.Long.
    • sampleByKey and sampleByKeyExact now return java.lang.Double instead of scala.Double.
  • The Old Application History format that created folders for each application has been removed.
  • org.apache.spark.Logging is now private. You can use slf4j directly instead.

Other Deprecated Items

  • Java 7 is now deprecated.
  • Python 2.6 is now deprecated.
  • TaskContext.isRunningLocally now is always false, as there is no more local execution of yarn-client and yarn-cluster as masters. Use --master yarn and --deploy-mode client/cluster.
  • Instead of HiveContext, use SparkSession.builder.enableHiveSupport.
  • Instead of SQLContext, use SparkSession.builder.
  • Some methods related to Accumulators, ShuffleWriteMetrics, SparklLoop, DataSet, and SQLContext are now deprecated. You will see warnings in your application logs if you use them.