pyspark.sql.DataFrame#

class pyspark.sql.DataFrame(jdf, sql_ctx)[source]#

A distributed collection of data grouped into named columns.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Notes

A DataFrame should only be created as described above. It should not be directly created via using the constructor.

Examples

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession:

>>> people = spark.createDataFrame([
...     {"deptId": 1, "age": 40, "name": "Hyukjin Kwon", "gender": "M", "salary": 50},
...     {"deptId": 1, "age": 50, "name": "Takuya Ueshin", "gender": "M", "salary": 100},
...     {"deptId": 2, "age": 60, "name": "Xinrong Meng", "gender": "F", "salary": 150},
...     {"deptId": 3, "age": 20, "name": "Haejoon Lee", "gender": "M", "salary": 200}
... ])

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame, Column.

To select a column from the DataFrame, use the apply method:

>>> age_col = people.age

A more concrete example:

>>> # To create DataFrame using SparkSession
... department = spark.createDataFrame([
...     {"id": 1, "name": "PySpark"},
...     {"id": 2, "name": "ML"},
...     {"id": 3, "name": "Spark SQL"}
... ])

>>> people.filter(people.age > 30).join(
...     department, people.deptId == department.id).groupBy(
...     department.name, "gender").agg(
...         {"salary": "avg", "age": "max"}).sort("max(age)").show()
+-------+------+-----------+--------+
|   name|gender|avg(salary)|max(age)|
+-------+------+-----------+--------+
|PySpark|     M|       75.0|      50|
|     ML|     F|      150.0|      60|
+-------+------+-----------+--------+

Methods

`agg`(*exprs)	Aggregate on the entire `DataFrame` without groups (shorthand for `df.groupBy().agg()`).
`alias`(alias)	Returns a new `DataFrame` with an alias set.
`approxQuantile`(col, probabilities, relativeError)	Calculates the approximate quantiles of numerical columns of a `DataFrame`.
`cache`()	Persists the `DataFrame` with the default storage level (MEMORY_AND_DISK_DESER).
`checkpoint`([eager])	Returns a checkpointed version of this `DataFrame`.
`coalesce`(numPartitions)	Returns a new `DataFrame` that has exactly numPartitions partitions.
`colRegex`(colName)	Selects column based on the column name specified as a regex and returns it as `Column`.
`collect`()	Returns all the records in the DataFrame as a list of `Row`.
`corr`(col1, col2[, method])	Calculates the correlation of two columns of a `DataFrame` as a double value.
`count`()	Returns the number of rows in this `DataFrame`.
`cov`(col1, col2)	Calculate the sample covariance for the given columns, specified by their names, as a double value.
`createGlobalTempView`(name)	Creates a global temporary view with this `DataFrame`.
`createOrReplaceGlobalTempView`(name)	Creates or replaces a global temporary view using the given name.
`createOrReplaceTempView`(name)	Creates or replaces a local temporary view with this `DataFrame`.
`createTempView`(name)	Creates a local temporary view with this `DataFrame`.
`crossJoin`(other)	Returns the cartesian product with another `DataFrame`.
`crosstab`(col1, col2)	Computes a pair-wise frequency table of the given columns.
`cube`(*cols)	Create a multi-dimensional cube for the current `DataFrame` using the specified columns, allowing aggregations to be performed on them.
`describe`(*cols)	Computes basic statistics for numeric and string columns.
`distinct`()	Returns a new `DataFrame` containing the distinct rows in this `DataFrame`.
`drop`(*cols)	Returns a new `DataFrame` without specified columns.
`dropDuplicates`([subset])	Return a new `DataFrame` with duplicate rows removed, optionally only considering certain columns.
`dropDuplicatesWithinWatermark`([subset])	Return a new `DataFrame` with duplicate rows removed,
`drop_duplicates`([subset])	`drop_duplicates()` is an alias for `dropDuplicates()`.
`dropna`([how, thresh, subset])	Returns a new `DataFrame` omitting rows with null or NaN values.
`exceptAll`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame` while preserving duplicates.
`explain`([extended, mode])	Prints the (logical and physical) plans to the console for debugging purposes.
`fillna`(value[, subset])	Returns a new `DataFrame` which null values are filled with new value.
`filter`(condition)	Filters rows using the given condition.
`first`()	Returns the first row as a `Row`.
`foreach`(f)	Applies the `f` function to all `Row` of this `DataFrame`.
`foreachPartition`(f)	Applies the `f` function to each partition of this `DataFrame`.
`freqItems`(cols[, support])	Finding frequent items for columns, possibly with false positives.
`groupBy`(*cols)	Groups the `DataFrame` by the specified columns so that aggregation can be performed on them.
`groupby`(*cols)	`groupby()` is an alias for `groupBy()`.
`groupingSets`(groupingSets, *cols)	Create multi-dimensional aggregation for the current `DataFrame` using the specified grouping sets, so we can run aggregation on them.
`head`([n])	Returns the first `n` rows.
`hint`(name, *parameters)	Specifies some hint on the current `DataFrame`.
`inputFiles`()	Returns a best-effort snapshot of the files that compose this `DataFrame`.
`intersect`(other)	Return a new `DataFrame` containing rows only in both this `DataFrame` and another `DataFrame`.
`intersectAll`(other)	Return a new `DataFrame` containing rows in both this `DataFrame` and another `DataFrame` while preserving duplicates.
`isEmpty`()	Checks if the `DataFrame` is empty and returns a boolean value.
`isLocal`()	Returns `True` if the `collect()` and `take()` methods can be run locally (without any Spark executors).
`join`(other[, on, how])	Joins with another `DataFrame`, using the given join expression.
`limit`(num)	Limits the result count to the number specified.
`localCheckpoint`([eager])	Returns a locally checkpointed version of this `DataFrame`.
`mapInArrow`(func, schema[, barrier, profile])	Maps an iterator of batches in the current `DataFrame` using a Python native function that is performed on pyarrow.RecordBatchs both as input and output, and returns the result as a `DataFrame`.
`mapInPandas`(func, schema[, barrier, profile])	Maps an iterator of batches in the current `DataFrame` using a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as a `DataFrame`.
`melt`(ids, values, variableColumnName, ...)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`mergeInto`(table, condition)	Merges a set of updates, insertions, and deletions based on a source table into a target table.
`observe`(observation, *exprs)	Define (named) metrics to observe on the DataFrame.
`offset`(num)	Returns a new :class: DataFrame by skipping the first n rows.
`orderBy`(cols, *kwargs)	Returns a new `DataFrame` sorted by the specified column(s).
`pandas_api`([index_col])	Converts the existing DataFrame into a pandas-on-Spark DataFrame.
`persist`([storageLevel])	Sets the storage level to persist the contents of the `DataFrame` across operations after the first time it is computed.
`printSchema`([level])	Prints out the schema in the tree format.
`randomSplit`(weights[, seed])	Randomly splits this `DataFrame` with the provided weights.
`registerTempTable`(name)	Registers this `DataFrame` as a temporary table using the given name.
`repartition`(numPartitions, *cols)	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`repartitionByRange`(numPartitions, *cols)	Returns a new `DataFrame` partitioned by the given partitioning expressions.
`replace`(to_replace[, value, subset])	Returns a new `DataFrame` replacing a value with another value.
`rollup`(*cols)	Create a multi-dimensional rollup for the current `DataFrame` using the specified columns, allowing for aggregation on them.
`sameSemantics`(other)	Returns True when the logical query plans inside both `DataFrame`s are equal and therefore return the same results.
`sample`([withReplacement, fraction, seed])	Returns a sampled subset of this `DataFrame`.
`sampleBy`(col, fractions[, seed])	Returns a stratified sample without replacement based on the fraction given on each stratum.
`select`(*cols)	Projects a set of expressions and returns a new `DataFrame`.
`selectExpr`(*expr)	Projects a set of SQL expressions and returns a new `DataFrame`.
`semanticHash`()	Returns a hash code of the logical query plan against this `DataFrame`.
`show`([n, truncate, vertical])	Prints the first `n` rows of the DataFrame to the console.
`sort`(cols, *kwargs)	Returns a new `DataFrame` sorted by the specified column(s).
`sortWithinPartitions`(cols, *kwargs)	Returns a new `DataFrame` with each partition sorted by the specified column(s).
`subtract`(other)	Return a new `DataFrame` containing rows in this `DataFrame` but not in another `DataFrame`.
`summary`(*statistics)	Computes specified statistics for numeric and string columns.
`tail`(num)	Returns the last `num` rows as a `list` of `Row`.
`take`(num)	Returns the first `num` rows as a `list` of `Row`.
`to`(schema)	Returns a new `DataFrame` where each row is reconciled to match the specified schema.
`toArrow`()	Returns the contents of this `DataFrame` as PyArrow `pyarrow.Table`.
`toDF`(*cols)	Returns a new `DataFrame` that with new specified column names
`toJSON`([use_unicode])	Converts a `DataFrame` into a `RDD` of string.
`toLocalIterator`([prefetchPartitions])	Returns an iterator that contains all of the rows in this `DataFrame`.
`toPandas`()	Returns the contents of this `DataFrame` as Pandas `pandas.DataFrame`.
`transform`(func, args, *kwargs)	Returns a new `DataFrame`.
`transpose`([indexColumn])	Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.
`union`(other)	Return a new `DataFrame` containing the union of rows in this and another `DataFrame`.
`unionAll`(other)	Return a new `DataFrame` containing the union of rows in this and another `DataFrame`.
`unionByName`(other[, allowMissingColumns])	Returns a new `DataFrame` containing union of rows in this and another `DataFrame`.
`unpersist`([blocking])	Marks the `DataFrame` as non-persistent, and remove all blocks for it from memory and disk.
`unpivot`(ids, values, variableColumnName, ...)	Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.
`where`(condition)	`where()` is an alias for `filter()`.
`withColumn`(colName, col)	Returns a new `DataFrame` by adding a column or replacing the existing column that has the same name.
`withColumnRenamed`(existing, new)	Returns a new `DataFrame` by renaming an existing column.
`withColumns`(*colsMap)	Returns a new `DataFrame` by adding multiple columns or replacing the existing columns that have the same names.
`withColumnsRenamed`(colsMap)	Returns a new `DataFrame` by renaming multiple columns.
`withMetadata`(columnName, metadata)	Returns a new `DataFrame` by updating an existing column with metadata.
`withWatermark`(eventTime, delayThreshold)	Defines an event time watermark for this `DataFrame`.
`writeTo`(table)	Create a write configuration builder for v2 sources.

Attributes

`columns`	Retrieves the names of all columns in the `DataFrame` as a list.
`dtypes`	Returns all column names and their data types as a list.
`executionInfo`	Returns a ExecutionInfo object after the query was executed.
`isStreaming`	Returns `True` if this `DataFrame` contains one or more sources that continuously return data as it arrives.
`na`	Returns a `DataFrameNaFunctions` for handling missing values.
`rdd`	Returns the content as an `pyspark.RDD` of `Row`.
`schema`	Returns the schema of this `DataFrame` as a `pyspark.sql.types.StructType`.
`sparkSession`	Returns Spark session that created this `DataFrame`.
`stat`	Returns a `DataFrameStatFunctions` for statistic functions.
`storageLevel`	Get the `DataFrame`'s current storage level.
`write`	Interface for saving the content of the non-streaming `DataFrame` out into external storage.
`writeStream`	Interface for saving the content of the streaming `DataFrame` out into external storage.
`is_cached`