Packages

p

org.apache.spark.sql

execution

package execution

The physical execution component of Spark SQL. Note that this is a private package. All classes in catalyst are considered an internal API to Spark SQL and are subject to change between minor releases.

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. execution
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Type Members

  1. class AggregatingAccumulator extends AccumulatorV2[InternalRow, InternalRow]

    Accumulator that computes a global aggregate.

  2. case class AppendColumnsExec(func: (Any) ⇒ Any, deserializer: Expression, serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Applies the given function to each input row, appending the encoded result at the end of the row.

  3. case class AppendColumnsWithObjectExec(func: (Any) ⇒ Any, inputSerializer: Seq[NamedExpression], newColumnsSerializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with Product with Serializable

    An optimized version of AppendColumnsExec, that can be executed on deserialized object directly.

  4. case class ApplyColumnarRulesAndInsertTransitions(conf: SQLConf, columnarRules: Seq[ColumnarRule]) extends Rule[SparkPlan] with Product with Serializable

    Apply any user defined ColumnarRules and find the correct place to insert transitions to/from columnar formatted data.

  5. trait BaseLimitExec extends SparkPlan with LimitExec with CodegenSupport

    Helper trait which defines methods that are shared by both LocalLimitExec and GlobalLimitExec.

  6. abstract class BaseSubqueryExec extends SparkPlan

    Parent class for different types of subquery plans

  7. trait BinaryExecNode extends SparkPlan
  8. trait BlockingOperatorWithCodegen extends SparkPlan with CodegenSupport

    A special kind of operators which support whole stage codegen.

    A special kind of operators which support whole stage codegen. Blocking means these operators will consume all the inputs first, before producing output. Typical blocking operators are sort and aggregate.

  9. abstract class BufferedRowIterator extends AnyRef
  10. class CacheManager extends Logging

    Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed.

    Provides support in a SQLContext for caching query results and automatically using these cached results when subsequent queries are executed. Data is cached using byte buffers stored in an InMemoryRelation. This relation is automatically substituted query plans that return the sameResult as the originally cached query.

    Internal to Spark SQL.

  11. case class CachedData(plan: LogicalPlan, cachedRepresentation: InMemoryRelation) extends Product with Serializable

    Holds a cached logical plan and its data

  12. case class CoGroupExec(func: (Any, Iterator[Any], Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, leftDeserializer: Expression, rightDeserializer: Expression, leftGroup: Seq[Attribute], rightGroup: Seq[Attribute], leftAttr: Seq[Attribute], rightAttr: Seq[Attribute], outputObjAttr: Attribute, left: SparkPlan, right: SparkPlan) extends SparkPlan with BinaryExecNode with ObjectProducerExec with Product with Serializable

    Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side.

    Co-groups the data from left and right children, and calls the function with each group and 2 iterators containing all elements in the group from left and right side. The result of this function is flattened before being output.

  13. class CoGroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow], Iterator[InternalRow])]

    Iterates over GroupedIterators and returns the cogrouped data, i.e.

    Iterates over GroupedIterators and returns the cogrouped data, i.e. each record is a grouping key with its associated values from all GroupedIterators. Note: we assume the output of each GroupedIterator is ordered by the grouping key.

  14. case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Physical plan for returning a new RDD that has exactly numPartitions partitions.

    Physical plan for returning a new RDD that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you see ShuffleExchange. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

  15. class CoalescedPartitioner extends Partitioner

    A Partitioner that might group together one or more partitions from the parent.

  16. trait CodegenSupport extends SparkPlan

    An interface for those physical operators that support codegen.

  17. case class CollapseCodegenStages(conf: SQLConf, codegenStageCounter: AtomicInteger = new AtomicInteger(0)) extends Rule[SparkPlan] with Product with Serializable

    Find the chained plans that support codegen, collapse them together as WholeStageCodegen.

    Find the chained plans that support codegen, collapse them together as WholeStageCodegen.

    The codegenStageCounter generates ID for codegen stages within a query plan. It does not affect equality, nor does it participate in destructuring pattern matching of WholeStageCodegenExec.

    This ID is used to help differentiate between codegen stages. It is included as a part of the explain output for physical plans, e.g.

    Physical Plan

    *(5) SortMergeJoin [x#3L], [y#9L], Inner :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(x#3L, 200) : +- *(1) Project [(id#0L % 2) AS x#3L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 5, step=1, splits=8) +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(y#9L, 200) +- *(3) Project [(id#6L % 2) AS y#9L] +- *(3) Filter isnotnull((id#6L % 2)) +- *(3) Range (0, 5, step=1, splits=8)

    where the ID makes it obvious that not all adjacent codegen'd plan operators are of the same codegen stage.

    The codegen stage ID is also optionally included in the name of the generated classes as a suffix, so that it's easier to associate a generated class back to the physical operator. This is controlled by SQLConf: spark.sql.codegen.useIdInClassName

    The ID is also included in various log messages.

    Within a query, a codegen stage in a plan starts counting from 1, in "insertion order". WholeStageCodegenExec operators are inserted into a plan in depth-first post-order. See CollapseCodegenStages.insertWholeStageCodegen for the definition of insertion order.

    0 is reserved as a special ID value to indicate a temporary WholeStageCodegenExec object is created, e.g. for special fallback handling when an existing WholeStageCodegenExec failed to generate/compile code.

  18. case class CollectLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with LimitExec with Product with Serializable

    Take the first limit elements and collect them to a single partition.

    Take the first limit elements and collect them to a single partition.

    This operator will be used when a logical Limit operation is the final operator in an logical plan, which happens when the user is collecting results back to the driver.

  19. case class CollectMetricsExec(name: String, metricExpressions: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Collect arbitrary (named) metrics from a SparkPlan.

  20. class ColumnarRule extends AnyRef

    Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan.

    Holds a user defined rule that can be used to inject columnar implementations of various operators in the plan. The preColumnarTransitions Rule can be used to replace SparkPlan instances with versions that support a columnar implementation. After this Spark will insert any transitions necessary. This includes transitions from row to columnar RowToColumnarExec and from columnar to row ColumnarToRowExec. At this point the postColumnarTransitions Rule is called to allow replacing any of the implementations of the transitions or doing cleanup of the plan, like inserting stages to build larger batches for more efficient processing, or stages that transition the data to/from an accelerator's memory.

  21. case class ColumnarToRowExec(child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow.

    Provides a common executor to translate an RDD of ColumnarBatch into an RDD of InternalRow. This is inserted whenever such a transition is determined to be needed.

    The implementation is based off of similar implementations in org.apache.spark.sql.execution.python.ArrowEvalPythonExec and MapPartitionsInRWithArrowExec. Eventually this should replace those implementations.

  22. trait DataSourceScanExec extends SparkPlan with LeafExecNode
  23. case class DeserializeToObjectExec(deserializer: Expression, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with CodegenSupport with Product with Serializable

    Takes the input row from child and turns it into object using the given deserializer expression.

    Takes the input row from child and turns it into object using the given deserializer expression. The output of this operator is a single-field safe row containing the deserialized object.

  24. abstract class ExecSubqueryExpression extends PlanExpression[BaseSubqueryExec]

    The base class for subquery that is used in SparkPlan.

  25. case class ExpandExec(projections: Seq[Seq[Expression]], output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.

    Apply all of the GroupExpressions to every input row, hence we will get multiple output rows for an input row.

    projections

    The group of expressions, all of the group expressions should output the same schema specified bye the parameter output

    output

    The output Schema

    child

    Child operator

  26. sealed trait ExplainMode extends AnyRef
  27. case class ExternalRDD[T](outputObjAttr: Attribute, rdd: RDD[T])(session: SparkSession) extends LeafNode with ObjectProducer with MultiInstanceRelation with Product with Serializable

    Logical plan node for scanning data from an RDD.

  28. case class ExternalRDDScanExec[T](outputObjAttr: Attribute, rdd: RDD[T]) extends SparkPlan with LeafExecNode with ObjectProducerExec with Product with Serializable

    Physical plan node for scanning data from an RDD.

  29. trait FileRelation extends AnyRef

    An interface for relations that are backed by files.

    An interface for relations that are backed by files. When a class implements this interface, the list of paths that it returns will be returned to a user who calls inputPaths on any DataFrame that queries this relation.

  30. case class FileSourceScanExec(relation: HadoopFsRelation, output: Seq[Attribute], requiredSchema: StructType, partitionFilters: Seq[Expression], optionalBucketSet: Option[BitSet], dataFilters: Seq[Expression], tableIdentifier: Option[TableIdentifier]) extends SparkPlan with DataSourceScanExec with Product with Serializable

    Physical plan node for scanning data from HadoopFsRelations.

    Physical plan node for scanning data from HadoopFsRelations.

    relation

    The file-based relation to scan.

    output

    Output attributes of the scan, including data attributes and partition attributes.

    requiredSchema

    Required schema of the underlying relation, excluding partition columns.

    partitionFilters

    Predicates to use for partition pruning.

    optionalBucketSet

    Bucket ids for bucket pruning

    dataFilters

    Filters on non-partition columns.

    tableIdentifier

    identifier for the table in the metastore.

  31. case class FilterExec(condition: Expression, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with PredicateHelper with Product with Serializable

    Physical plan for Filter.

  32. case class FlatMapGroupsInRExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, outputSchema: StructType, keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable

    Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group.

    Groups the input rows together and calls the R function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.

  33. case class FlatMapGroupsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], keyDeserializer: Expression, groupingAttributes: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format.

    Similar with FlatMapGroupsInRExec but serializes and deserializes input/output in Arrow format. This is also somewhat similar with org.apache.spark.sql.execution.python.FlatMapGroupsInPandasExec.

  34. case class GenerateExec(generator: Generator, requiredChildOutput: Seq[Attribute], outer: Boolean, generatorOutput: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.

    Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows. This operation is similar to a flatMap in functional programming with one important additional feature, which allows the input rows to be joined with their output.

    This operator supports whole stage code generation for generators that do not implement terminate().

    generator

    the generator expression

    requiredChildOutput

    required attributes from child's output

    outer

    when true, each input row will be output at least once, even if the output of the given generator is empty.

    generatorOutput

    the qualified output attributes of the generator of this node, which constructed in analysis phase, and we can not change it, as the parent node bound with it already.

  35. case class GlobalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with BaseLimitExec with Product with Serializable

    Take the first limit elements of the child's single output partition.

  36. class GroupedIterator extends Iterator[(InternalRow, Iterator[InternalRow])]

    Iterates over a presorted set of rows, chunking it up by the grouping expression.

    Iterates over a presorted set of rows, chunking it up by the grouping expression. Each call to next will return a pair containing the current group and an iterator that will return all the elements of that group. Iterators for each group are lazily constructed by extracting rows from the input iterator. As such, full groups are never materialized by this class.

    Example input:

    Input: [a, 1], [b, 2], [b, 3]
    Grouping: x#1
    InputSchema: x#1, y#2

    Result:

    First call to next():  ([a], Iterator([a, 1])
    Second call to next(): ([b], Iterator([b, 2], [b, 3])

    Note, the class does not handle the case of an empty input for simplicity of implementation. Use the factory to construct a new instance.

  37. case class InSubqueryExec(child: Expression, plan: BaseSubqueryExec, exprId: ExprId, resultBroadcast: Broadcast[Array[Any]] = null) extends ExecSubqueryExpression with Product with Serializable

    The physical node of in-subquery.

    The physical node of in-subquery. This is for Dynamic Partition Pruning only, as in-subquery coming from the original query will always be converted to joins.

  38. case class InputAdapter(child: SparkPlan) extends SparkPlan with UnaryExecNode with InputRDDCodegen with Product with Serializable

    InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.

    InputAdapter is used to hide a SparkPlan from a subtree that supports codegen.

    This is the leaf node of a tree with WholeStageCodegen that is used to generate code that consumes an RDD iterator of InternalRow.

  39. trait InputRDDCodegen extends SparkPlan with CodegenSupport

    Leaf codegen node reading from a single RDD.

  40. trait LeafExecNode extends SparkPlan
  41. trait LimitExec extends SparkPlan with UnaryExecNode

    The operator takes limited number of elements from its child operator.

  42. case class LocalLimitExec(limit: Int, child: SparkPlan) extends SparkPlan with BaseLimitExec with Product with Serializable

    Take the first limit elements of each child partition, but do not collect or shuffle them.

  43. case class LocalTableScanExec(output: Seq[Attribute], rows: Seq[InternalRow]) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable

    Physical plan node for scanning data from a local collection.

    Physical plan node for scanning data from a local collection.

    Seq may not be serializable and ideally we should not send rows and unsafeRows to the executors. Thus marking them as transient.

  44. case class LogicalRDD(output: Seq[Attribute], rdd: RDD[InternalRow], outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil, isStreaming: Boolean = false)(session: SparkSession) extends LeafNode with MultiInstanceRelation with Product with Serializable

    Logical plan node for scanning data from an RDD of InternalRow.

  45. case class MapElementsExec(func: AnyRef, outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with CodegenSupport with Product with Serializable

    Applies the given function to each input object.

    Applies the given function to each input object. The output of its child must be a single-field row containing the input object.

    This operator is kind of a safe version of ProjectExec, as its output is custom object, we need to use safe row to contain it.

  46. case class MapGroupsExec(func: (Any, Iterator[Any]) ⇒ TraversableOnce[Any], keyDeserializer: Expression, valueDeserializer: Expression, groupingAttributes: Seq[Attribute], dataAttributes: Seq[Attribute], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with UnaryExecNode with ObjectProducerExec with Product with Serializable

    Groups the input rows together and calls the function with each group and an iterator containing all elements in the group.

    Groups the input rows together and calls the function with each group and an iterator containing all elements in the group. The result of this function is flattened before being output.

  47. case class MapPartitionsExec(func: (Iterator[Any]) ⇒ Iterator[Any], outputObjAttr: Attribute, child: SparkPlan) extends SparkPlan with ObjectConsumerExec with ObjectProducerExec with Product with Serializable

    Applies the given function to input object iterator.

    Applies the given function to input object iterator. The output of its child must be a single-field row containing the input object.

  48. case class MapPartitionsInRWithArrowExec(func: Array[Byte], packageNames: Array[Byte], broadcastVars: Array[Broadcast[AnyRef]], inputSchema: StructType, output: Seq[Attribute], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.

    Similar with MapPartitionsExec and org.apache.spark.sql.execution.r.MapPartitionsRWrapper but serializes and deserializes input/output in Arrow format.

    This is somewhat similar with org.apache.spark.sql.execution.python.ArrowEvalPythonExec

  49. trait ObjectConsumerExec extends SparkPlan with UnaryExecNode

    Physical version of ObjectConsumer.

  50. trait ObjectProducerExec extends SparkPlan

    Physical version of ObjectProducer.

  51. case class OptimizeMetadataOnlyQuery(catalog: SessionCatalog) extends Rule[LogicalPlan] with Product with Serializable

    This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata.

    This rule optimizes the execution of queries that can be answered by looking only at partition-level metadata. This applies when all the columns scanned are partition columns, and the query has an aggregate operator that satisfies the following conditions: 1. aggregate expression is partition columns. e.g. SELECT col FROM tbl GROUP BY col. 2. aggregate function on partition columns with DISTINCT. e.g. SELECT col1, count(DISTINCT col2) FROM tbl GROUP BY col1. 3. aggregate function on partition columns which have same result w or w/o DISTINCT keyword. e.g. SELECT col1, Max(col2) FROM tbl GROUP BY col1.

  52. case class PlanLater(plan: LogicalPlan) extends SparkPlan with LeafExecNode with Product with Serializable
  53. case class PlanSubqueries(sparkSession: SparkSession) extends Rule[SparkPlan] with Product with Serializable

    Plans subqueries that are present in the given SparkPlan.

  54. case class ProjectExec(projectList: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    Physical plan for Project.

  55. class QueryExecution extends AnyRef

    The primary workflow for executing relational queries using Spark.

    The primary workflow for executing relational queries using Spark. Designed to allow easy access to the intermediate phases of query execution for developers.

    While this is not a public class, we should avoid changing the function names for the sake of changing them, because a lot of developers use the feature for debugging.

  56. class QueryExecutionException extends Exception
  57. case class RDDScanExec(output: Seq[Attribute], rdd: RDD[InternalRow], name: String, outputPartitioning: Partitioning = UnknownPartitioning(0), outputOrdering: Seq[SortOrder] = Nil) extends SparkPlan with LeafExecNode with InputRDDCodegen with Product with Serializable

    Physical plan node for scanning data from an RDD of InternalRow.

  58. case class RangeExec(range: Range) extends SparkPlan with LeafExecNode with CodegenSupport with Product with Serializable

    Physical plan for range (generating a range of 64 bit numbers).

  59. final class RecordBinaryComparator extends RecordComparator
  60. case class ReuseSubquery(conf: SQLConf) extends Rule[SparkPlan] with Product with Serializable

    Find out duplicated subqueries in the spark plan, then use the same subquery result for all the references.

  61. case class ReusedSubqueryExec(child: BaseSubqueryExec) extends BaseSubqueryExec with LeafExecNode with Product with Serializable

    A wrapper for reused BaseSubqueryExec.

  62. case class RowDataSourceScanExec(fullOutput: Seq[Attribute], requiredColumnsIndex: Seq[Int], filters: Set[Filter], handledFilters: Set[Filter], rdd: RDD[InternalRow], relation: BaseRelation, tableIdentifier: Option[TableIdentifier]) extends SparkPlan with DataSourceScanExec with InputRDDCodegen with Product with Serializable

    Physical plan node for scanning data from a relation.

  63. case class RowToColumnarExec(child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch.

    Provides a common executor to translate an RDD of InternalRow into an RDD of ColumnarBatch. This is inserted whenever such a transition is determined to be needed.

    This is similar to some of the code in ArrowConverters.scala and org.apache.spark.sql.execution.arrow.ArrowWriter. That code is more specialized to convert InternalRow to Arrow formatted data, but in the future if we make OffHeapColumnVector internally Arrow formatted we may be able to replace much of that code.

    This is also similar to org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate() and org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.toBatch() toBatch is only ever called from tests and can probably be removed, but populate is used by both Orc and Parquet to initialize partition and missing columns. There is some chance that we could replace populate with RowToColumnConverter, but the performance requirements are different and it would only be to reduce code.

  64. class SQLExecutionRDD extends RDD[InternalRow]

    It is just a wrapper over sqlRDD, which sets and makes effective all the configs from the captured SQLConf.

    It is just a wrapper over sqlRDD, which sets and makes effective all the configs from the captured SQLConf. Please notice that this means we may miss configurations set after the creation of this RDD and before its execution.

  65. case class SampleExec(lowerBound: Double, upperBound: Double, withReplacement: Boolean, seed: Long, child: SparkPlan) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    Physical plan for sampling the dataset.

    Physical plan for sampling the dataset.

    lowerBound

    Lower-bound of the sampling probability (usually 0.0)

    upperBound

    Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.

    withReplacement

    Whether to sample with replacement.

    seed

    the random seed

    child

    the SparkPlan

  66. case class ScalarSubquery(plan: BaseSubqueryExec, exprId: ExprId) extends ExecSubqueryExpression with Product with Serializable

    A subquery that will return only one row and one column.

    A subquery that will return only one row and one column.

    This is the physical copy of ScalarSubquery to be used inside SparkPlan.

  67. case class SerializeFromObjectExec(serializer: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with ObjectConsumerExec with CodegenSupport with Product with Serializable

    Takes the input object from child and turns in into unsafe row using the given serializer expression.

    Takes the input object from child and turns in into unsafe row using the given serializer expression. The output of its child must be a single-field row containing the input object.

  68. class ShuffledRowRDD extends RDD[InternalRow]

    This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs.

    This is a specialized version of org.apache.spark.rdd.ShuffledRDD that is optimized for shuffling rows instead of Java key-value pairs. Note that something like this should eventually be implemented in Spark core, but that is blocked by some more general refactorings to shuffle interfaces / internals.

    This RDD takes a ShuffleDependency (dependency), and an optional array of partition start indices as input arguments (specifiedPartitionStartIndices).

    The dependency has the parent RDD of this RDD, which represents the dataset before shuffle (i.e. map output). Elements of this RDD are (partitionId, Row) pairs. Partition ids should be in the range [0, numPartitions - 1]. dependency.partitioner is the original partitioner used to partition map output, and dependency.partitioner.numPartitions is the number of pre-shuffle partitions (i.e. the number of partitions of the map output).

    When specifiedPartitionStartIndices is defined, specifiedPartitionStartIndices.length will be the number of post-shuffle partitions. For this case, the ith post-shuffle partition includes specifiedPartitionStartIndices[i] to specifiedPartitionStartIndices[i+1] - 1 (inclusive).

    When specifiedPartitionStartIndices is not defined, there will be dependency.partitioner.numPartitions post-shuffle partitions. For this case, a post-shuffle partition is created for every pre-shuffle partition.

  69. case class SortExec(sortOrder: Seq[SortOrder], global: Boolean, child: SparkPlan, testSpillFrequency: Int = 0) extends SparkPlan with UnaryExecNode with BlockingOperatorWithCodegen with Product with Serializable

    Performs (external) sorting.

    Performs (external) sorting.

    global

    when true performs a global sort of all partitions by shuffling the data first if necessary.

    testSpillFrequency

    Method for configuring periodic spilling in unit tests. If set, will spill every frequency records.

  70. class SparkOptimizer extends Optimizer
  71. abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable

    The base class for physical operators.

    The base class for physical operators.

    The naming convention is that physical operators end with "Exec" suffix, e.g. ProjectExec.

  72. class SparkPlanInfo extends AnyRef

    :: DeveloperApi :: Stores information about a SQL SparkPlan.

    :: DeveloperApi :: Stores information about a SQL SparkPlan.

    Annotations
    @DeveloperApi()
  73. class SparkPlanner extends SparkStrategies
  74. class SparkSqlAstBuilder extends AstBuilder

    Builder that converts an ANTLR ParseTree into a LogicalPlan/Expression/TableIdentifier.

  75. class SparkSqlParser extends AbstractSqlParser

    Concrete parser for Spark SQL statements.

  76. abstract class SparkStrategies extends QueryPlanner[SparkPlan]
  77. abstract class SparkStrategy extends GenericStrategy[SparkPlan]

    Converts a logical plan into zero or more SparkPlans.

    Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting with the query planner and is not designed to be stable across spark releases. Developers writing libraries should instead consider using the stable APIs provided in org.apache.spark.sql.sources

  78. case class SubqueryBroadcastExec(name: String, index: Int, buildKeys: Seq[Expression], child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable

    Physical plan for a custom subquery that collects and transforms the broadcast key values.

    Physical plan for a custom subquery that collects and transforms the broadcast key values. This subquery retrieves the partition key from the broadcast results based on the type of HashedRelation returned. If the key is packed inside a Long, we extract it through bitwise operations, otherwise we return it from the appropriate index of the UnsafeRow.

    index

    the index of the join key in the list of keys from the build side

    buildKeys

    the join keys from the build side of the join used

    child

    the BroadcastExchange from the build side of the join

  79. case class SubqueryExec(name: String, child: SparkPlan) extends BaseSubqueryExec with UnaryExecNode with Product with Serializable

    Physical plan for a subquery.

  80. case class TakeOrderedAndProjectExec(limit: Int, sortOrder: Seq[SortOrder], projectList: Seq[NamedExpression], child: SparkPlan) extends SparkPlan with UnaryExecNode with Product with Serializable

    Take the first limit elements as defined by the sortOrder, and do projection if needed.

    Take the first limit elements as defined by the sortOrder, and do projection if needed. This is logically equivalent to having a Limit operator after a SortExec operator, or having a ProjectExec operator between them. This could have been named TopK, but Spark's top operator does the opposite in ordering so we name it TakeOrdered to avoid confusion.

  81. trait UnaryExecNode extends SparkPlan
  82. case class UnionExec(children: Seq[SparkPlan]) extends SparkPlan with Product with Serializable

    Physical plan for unioning two plans, without a distinct.

    Physical plan for unioning two plans, without a distinct. This is UNION ALL in SQL.

    If we change how this is implemented physically, we'd need to update org.apache.spark.sql.catalyst.plans.logical.Union.maxRowsPerPartition.

  83. final class UnsafeExternalRowSorter extends AnyRef
  84. final class UnsafeFixedWidthAggregationMap extends AnyRef
  85. final class UnsafeKVExternalSorter extends AnyRef
  86. class UnsafeRowSerializer extends Serializer with Serializable

    Serializer for serializing UnsafeRows during shuffle.

    Serializer for serializing UnsafeRows during shuffle. Since UnsafeRows are already stored as bytes, this serializer simply copies those bytes to the underlying output stream. When deserializing a stream of rows, instances of this serializer mutate and return a single UnsafeRow instance that is backed by an on-heap byte array.

    Note that this serializer implements only the Serializer methods that are used during shuffle, so certain SerializerInstance methods will throw UnsupportedOperationException.

  87. case class WholeStageCodegenExec(child: SparkPlan)(codegenStageId: Int) extends SparkPlan with UnaryExecNode with CodegenSupport with Product with Serializable

    WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.

    WholeStageCodegen compiles a subtree of plans that support codegen together into single Java function.

    Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):

    WholeStageCodegen Plan A FakeInput Plan B

    -> execute() | doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute() | +-----------------> produce() | doProduce() -------> produce() | doProduce() | doConsume() <--------- consume() | doConsume() <-------- consume()

    SparkPlan A should override doProduce() and doConsume().

    doCodeGen() will create a CodeGenContext, which will hold a list of variables for input, used to generated code for BoundReference.

Value Members

  1. object AggregatingAccumulator extends Serializable
  2. object BaseLimitExec extends Serializable
  3. object CoalesceExec extends Serializable
  4. object CodegenMode extends ExplainMode with Product with Serializable

    Codegen mode means that when printing explain for a DataFrame, if generated codes are available, a physical plan and the generated codes are expected to be printed to the console.

  5. object CollectMetricsExec extends Serializable
  6. object CostMode extends ExplainMode with Product with Serializable

    Cost mode means that when printing explain for a DataFrame, if plan node statistics are available, a logical plan and the statistics are expected to be printed to the console.

  7. object ExecSubqueryExpression
  8. object ExplainMode
  9. object ExplainUtils
  10. object ExtendedMode extends ExplainMode with Product with Serializable

    Extended mode means that when printing explain for a DataFrame, both logical and physical plans are expected to be printed to the console.

  11. object ExternalRDD extends Serializable
  12. object FormattedMode extends ExplainMode with Product with Serializable

    Formatted mode means that when printing explain for a DataFrame, explain output is expected to be split into two sections: a physical plan outline and node details.

  13. object GroupedIterator
  14. object HiveResult

    Runs a query returning the result in Hive compatible form.

  15. object MapGroupsExec extends Serializable
  16. object ObjectOperator

    Helper functions for physical operators that work with user defined objects.

  17. object PartitionedFileUtil
  18. object QueryExecution
  19. object SQLExecution
  20. object SimpleMode extends ExplainMode with Product with Serializable

    Simple mode means that when printing explain for a DataFrame, only a physical plan is expected to be printed to the console.

  21. object SortPrefixUtils
  22. object SparkPlan extends Serializable
  23. object SubqueryBroadcastExec extends Serializable
  24. object SubqueryExec extends Serializable
  25. object UnaryExecNode extends Serializable
  26. object WholeStageCodegenExec extends Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped