ChiSquareTest#
- class pyspark.ml.stat.ChiSquareTest[source]#
- Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical. - The null hypothesis is that the occurrence of the outcomes is statistically independent. - New in version 2.2.0. - Methods - test(dataset, featuresCol, labelCol[, flatten])- Perform a Pearson's independence test using dataset. - Methods Documentation - static test(dataset, featuresCol, labelCol, flatten=False)[source]#
- Perform a Pearson’s independence test using dataset. - New in version 2.2.0. - Changed in version 3.1.0: Added optional - flattenargument.- Parameters
- datasetpyspark.sql.DataFrame
- DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value. 
- featuresColstr
- Name of features column in dataset, of type Vector (VectorUDT). 
- labelColstr
- Name of label column in dataset, of any numerical type. 
- flattenbool, optional
- if True, flattens the returned dataframe. 
 
- dataset
- Returns
- pyspark.sql.DataFrame
- DataFrame containing the test result for every feature against the label. If flatten is True, this DataFrame will contain one row per feature with the following fields: - featureIndex: int 
- pValue: float 
- degreesOfFreedom: int 
- statistic: float 
 - If flatten is False, this DataFrame will contain a single Row with the following fields: - pValues: Vector 
- degreesOfFreedom: Array[int] 
- statistics: Vector 
 - Each of these fields has one value per feature. 
 
 - Examples - >>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.stat import ChiSquareTest >>> dataset = [[0, Vectors.dense([0, 0, 1])], ... [0, Vectors.dense([1, 0, 1])], ... [1, Vectors.dense([2, 1, 1])], ... [1, Vectors.dense([3, 1, 1])]] >>> dataset = spark.createDataFrame(dataset, ["label", "features"]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label') >>> chiSqResult.select("degreesOfFreedom").collect()[0] Row(degreesOfFreedom=[3, 1, 0]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label', True) >>> row = chiSqResult.orderBy("featureIndex").collect() >>> row[0].statistic 4.0