Built-in Expectations in Great Expectations
Great expectation is a Python tool for data testing, documentation, and profiling. Here is a figure from the documentation describing its purpose:
Great Expectations makes it easy to include data testing in your ML pipeline, when dealing with tabular data. Data testing is similar to software testing. You launch a test suite to check various assumptions on a given dataset: column names, types, min-max values, distributions, proportion of missing values, categories, string content… These are called Expectations. A set of declared Expectations is called an Expectation Suite. Once a Suite has been created, you can use it to validate any new/modified data. If ever it fails, you can adapt the pipeline depending on the type of data failure. The library also allows the automatic creation of data quality reports and documentation.
Data checking makes it easier to identify the source of an error in an ML model, but also helps identifying some drifts or other problems with the input data. So, more generally it helps build trust around data.
Great Expectations has many components and many features. In this post, we are going to focus on a single subject: Expectations, which is a central element of the library. We are going to list built-in Expectations, and apply them to an example dataset: the ubiquitous Titanic dataset. We are NOT going to deal with other subjects such as Expectations Suites, Datasources, Checkpoints, Stores, Data Contexts, CLI, deployment, metrics…
Expectations
As it is written in the documentation, “It all starts with Expectations. An Expectation is how we communicate the way data should appear.” From the API reference:
An Expectation is a statement describing a verifiable property of data. Like assertions in traditional python unit tests, Expectations provide a flexible, declarative language for describing expected behavior. Unlike traditional unit tests, Great Expectations applies Expectations to data instead of code.
Note that it is also possible to create custom Expectations. Here we are going to look at the built-in Expectations for the Pandas backend. As explained in the documentation: “Because Great Expectations can run against different platforms, not all Expectations have been implemented for all platforms.” The different backends are Pandas, SQL and Spark.
Imports
import string
from datetime import datetime
import pandas as pd
import numpy as np
import great_expectations as ge
Versions:
- Python 3.9.7
- great_expectations 0.13.31
- pandas 1.3.2
Load the data into a dataframe
We load the data with Great Expectations:
df = ge.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
)
Here is a small description of each feature:
- Survival - Survival (0 = No; 1 = Yes)
- Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- Name - Name
- Sex - Sex (“male or “female”)
- Age - Age
- SibSp - Number of Siblings/Spouses Aboard
- Parch - Number of Parents/Children Aboard
- Ticket - Ticket Number
- Fare - Passenger Fare
- Cabin - Cabin
- Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
df.shape
(891, 12)
df.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
df.nunique()
PassengerId 891
Survived 2
Pclass 3
Name 891
Sex 2
Age 88
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 147
Embarked 3
dtype: int64
df.dtypes
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
Testing an Expectation on the dataset
The Expectation is seen as a DataFrame method. Let’s try the expect_column_to_exist
Expectation:
expect = df.expect_column_to_exist(column="PassengerId")
expect
{
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"success": true,
"result": {},
"meta": {}
}
It returns a dict-like validation report. The level of verbosity can be changed when creating reports. But in the present post, we are only interested in the “success” key:
expect.success
True
So let’s start with the available Expectations. Note that some Expectations have not yet been migrated to the v3 (Batch Request) API, so we did not include them in this post. Also, we did not investigate distributional Expectations, which are a bit more complicated to use…
Expectations : Table shape
columns = df.columns
for column in columns:
expect = df.expect_column_to_exist(column=column)
assert expect.success
expect = df.expect_table_columns_to_match_ordered_list(column_list=columns)
assert expect.success
expect = df.expect_table_columns_to_match_ordered_list(column_list=columns[::-1])
assert not expect.success
expect = df.expect_table_columns_to_match_set(column_set=columns)
assert expect.success
expect = df.expect_table_columns_to_match_set(column_set=columns[:-1])
expect.result["details"]
{'mismatched': {'unexpected': ['Embarked']}}
When the exact_match
boolean parameter is set to False
, we get a failure if a given column is not found in the dataset, but not if columns from the dataset are not given:
expect = df.expect_table_columns_to_match_set(
column_set=["PassengerId", "toto"], exact_match=False
)
assert not expect.success
expect = df.expect_table_columns_to_match_set(column_set=[], exact_match=False)
assert expect.success
expect = df.expect_table_row_count_to_be_between(min_value=700, max_value=1000)
assert expect.success
One can skip one of the arguments min_value
or max_value
. The default value is None
:
expect = df.expect_table_row_count_to_be_between(max_value=1000)
assert expect.success
expect = df.expect_table_row_count_to_equal(value=891)
assert expect.success
Expectations : Missing values, unique values, and types
All the Expectations in this category support the mostly
argument, used when only a fraction of the column values is supposed to match the expectation.
expect = df.expect_column_values_to_be_unique(column="PassengerId")
assert expect.success
expect = df.expect_column_values_to_be_unique(column="Ticket", mostly=0.6)
assert expect.success
for column in ["PassengerId", "Survived"]:
expect = df.expect_column_values_to_not_be_null(column=column)
assert expect.success
expect = df.expect_column_values_to_not_be_null(column="Age", mostly=0.8)
assert expect.success
expect = df.expect_column_values_to_not_be_null(column="Age", mostly=0.9)
assert not expect.success
tmp_col = "NullCol"
df[tmp_col] = np.NaN
expect = df.expect_column_values_to_be_null(column="NullCol")
assert expect.success
df.drop(tmp_col, axis=1, inplace=True)
expect = df.expect_column_values_to_be_null(column="Cabin", mostly=0.5)
assert expect.success
For the expected types (type_
parameter), we use regular Python types. With the Pandas backend, we could also use NumPy dtypes. From the documentation:
Valid types are defined by the current backend implementation and are dynamically loaded. For example, valid types for PandasDataset include any numpy dtype values (such as ‘int64’) or native python types (such as ‘int’), whereas valid types for a SqlAlchemyDataset include types named by the current driver such as ‘INTEGER’ in most SQL dialects and ‘TEXT’ in dialects such as postgresql. Valid types for SparkDFDataset include ‘StringType’, ‘BooleanType’ and other pyspark-defined type names.
types = {
"PassengerId": "int",
"Survived": "int",
"Pclass": "int",
"Name": "str",
"Sex": "str",
"Age": "float",
"SibSp": "int",
"Parch": "int",
"Ticket": "str",
"Fare": "float",
"Cabin": "str",
"Embarked": "str",
}
for column, type_ in types.items():
expect = df.expect_column_values_to_be_of_type(column=column, type_=type_)
assert expect.success
We could also use the mostly
argument in case where most of the values are of the same type:
tmp_col = "mixedTypes"
df[tmp_col] = 1
df.loc[df[:3].index, tmp_col] = "a"
expect = df.expect_column_values_to_be_of_type(column=tmp_col, type_="int", mostly=0.9)
assert expect.success
expect = df.expect_column_values_to_be_in_type_list(
column=tmp_col, type_list=["int", "str"]
)
assert expect.success
Let’s insert a float
type item in the temporary column and use the mostly
argument:
df.loc[df[:1].index, tmp_col] = np.pi
expect = df.expect_column_values_to_be_in_type_list(
column=tmp_col, type_list=["int", "str"]
)
assert not expect.success
expect = df.expect_column_values_to_be_in_type_list(
column=tmp_col, type_list=["int", "str"], mostly=0.99
)
assert expect.success
df.drop(tmp_col, axis=1, inplace=True)
Expectations : Sets and ranges
All the Expectations in this category support the mostly
argument.
expect = df.expect_column_values_to_be_in_set(column="Survived", value_set=[0, 1])
assert expect.success
expect = df.expect_column_values_to_be_in_set(column="Pclass", value_set=[1, 2, 3])
assert expect.success
expect = df.expect_column_values_to_be_in_set(
column="Sex",
value_set=["female", "male"],
)
assert expect.success
expect = df.expect_column_values_to_be_in_set(
column="Embarked",
value_set=["C", "Q", "S"],
)
assert expect.success
expect = df.expect_column_values_to_be_in_set(
column="Embarked", value_set=["C", "S"], mostly=0.9
)
assert expect.success
Another possible argument is the boolean parse_strings_as_datetimes
.
date_col = "datetimeCol"
date = "1912-04-15"
df[date_col] = date
df[date_col] = pd.to_datetime(df[date_col])
expect = df.expect_column_values_to_be_in_set(
column=date_col,
value_set=["1912-04-14", date, 2],
parse_strings_as_datetimes=False,
)
assert not expect.success
expect = df.expect_column_values_to_be_in_set(
column=date_col,
value_set=["1912-04-14", date, 2],
parse_strings_as_datetimes=True,
)
assert expect.success
alphabet_string = string.ascii_uppercase
value_set = [l for l in alphabet_string if l not in ["C", "Q", "S"]]
expect = df.expect_column_values_to_not_be_in_set(
column="Embarked",
value_set=value_set,
)
assert expect.success
With mostly
:
expect = df.expect_column_values_to_not_be_in_set(
column="Embarked", value_set=["Q", "X", "Y", "Z"], mostly=0.9
)
assert expect.success
expect = df.expect_column_values_to_be_between(column="Age", min_value=0, max_value=120)
assert expect.success
One can also use strict_min
and strict_max
to include the bounds or not (default=False
).
df.Parch.max()
6
expect = df.expect_column_values_to_be_between(
column="Parch", min_value=0, max_value=6, strict_max=True
)
assert not expect.success
expect = df.expect_column_values_to_be_between(column="Parch", min_value=0, max_value=6)
assert expect.success
Other possible arguments are:
parse_strings_as_datetimes
(boolean or None) – If True, parse min_value, max_value, and all non-null column values to datetimes before making comparisons.output_strftime_format
(str or None) – A valid strfime format for datetime output. Only used if parse_strings_as_datetimes=True.mostly
Let’s try an example with parse_strings_as_datetimes
and output_strftime_format
:
df[date_col] = "1912/04/15"
expect = df.expect_column_values_to_be_between(
column=date_col,
min_value="1912-04-14",
max_value="1912-04-16",
parse_strings_as_datetimes=True,
output_strftime_format="%Y/%m/%d",
)
assert expect.success
df.sort_values(by="PassengerId", inplace=True, ascending=True)
expect = df.expect_column_values_to_be_increasing(column="PassengerId")
assert expect.success
Other possible arguments are:
strictly
(Boolean or None) – If True, values must be strictly greater than previous valuesparse_strings_as_datetimes
(boolean or None) – If True, all non-null column values to datetimes before making comparisonsmostly
df.sort_values(by="PassengerId", inplace=True, ascending=True)
expect = df.expect_column_values_to_be_increasing(column="PassengerId", strictly=True)
assert expect.success
In order to use parse_strings_as_datetimes
, we create a column with an increasing dates as strings:
df[date_col] = pd.date_range(
start=datetime(1912, 1, 1), periods=df.shape[0], freq="D"
).astype(str)
expect = df.expect_column_values_to_be_increasing(
column=date_col, parse_strings_as_datetimes=True, strictly=True
)
assert expect.success
df.drop(date_col, axis=1, inplace=True)
df.sort_values(by="PassengerId", inplace=True, ascending=False)
expect = df.expect_column_values_to_be_decreasing(column="PassengerId")
assert expect.success
Of course, expect_column_values_to_be_decreasing
takes the same arguments as expect_column_values_to_be_increasing
.
Expectations : String matching
All the Expectations in this category support the mostly
argument.
expect = df.expect_column_value_lengths_to_be_between(
column="Name", min_value=10, max_value=200
)
assert expect.success
expect = df.expect_column_value_lengths_to_be_between(
column="Name", min_value=10, max_value=50, mostly=0.9
)
assert expect.success
expect = df.expect_column_value_lengths_to_equal(column="Embarked", value=1)
assert expect.success
expect = df.expect_column_value_lengths_to_equal(column="Cabin", value=3, mostly=0.3)
assert expect.success
We are going to check thisexpect_column_values_to_match_regex
expectation on the Cabin
feature. Here are some example values:
'B57 B59 B63 B66', 'C7', 'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T'
expect = df.expect_column_values_to_match_regex(
column="Cabin",
regex="(?:[A-Z]\\d{0,3}\\s?)+", # at least 1 group of 1 uppercase letter followed by 0 or more digits and 0 or 1 white space
)
assert expect.success
expect = df.expect_column_values_to_match_regex_list(
column="Cabin", regex_list=["[A-Z]\\d{0,3}", "(?:[A-Z]\\d{0,3}\\s?){2,100}"]
)
assert expect.success
expect = df.expect_column_values_to_not_match_regex_list(
column="Cabin", regex_list=["[a-z]\\d{1,3}", "[a-z]\\d{1,3}[a-z]"]
)
assert expect.success
Expectations : Datetime and JSON parsing
All the Expectations in this category support the mostly
argument.
df["date_example"] = "1912/04/15"
expect = df.expect_column_values_to_match_strftime_format(
column="date_example", strftime_format="%Y/%m/%d"
)
assert expect.success
df.drop("date_example", axis=1, inplace=True)
df["date_example"] = "2012-01-19 17:21:00"
expect = df.expect_column_values_to_be_dateutil_parseable(column="date_example")
assert expect.success
df.drop("date_example", axis=1, inplace=True)
df[
"json_example"
] = """{
"menu": {
"id": "file",
"value": "File",
"popup": {
"menuitem": [
{ "value": "New", "onclick": "CreateNewDoc()" },
{ "value": "Open", "onclick": "OpenDoc()" },
{ "value": "Close", "onclick": "CloseDoc()" }
]
}
}
}"""
expect = df.expect_column_values_to_be_json_parseable(column="json_example")
assert expect.success
df.drop("json_example", axis=1, inplace=True)
df["latitude"] = 180.0 * (np.random.rand(len(df)) - 0.5)
df["longitude"] = 360.0 * (np.random.rand(len(df)) - 0.5)
df["json_example"] = (
"""{"latitude": """
+ df.latitude.astype(str)
+ """, "longitude": """
+ df.latitude.astype(str)
+ "}"
)
json_schema = {
"title": "Longitude and Latitude",
"description": "A geographical coordinate on a planet (most commonly Earth).",
"required": ["latitude", "longitude"],
"type": "object",
"properties": {
"latitude": {"type": "number", "minimum": -90, "maximum": 90},
"longitude": {"type": "number", "minimum": -180, "maximum": 180},
},
}
expect = df.expect_column_values_to_match_json_schema(
column="json_example", json_schema=json_schema
)
assert expect.success
df.drop(["latitude", "longitude", "json_example"], axis=1, inplace=True)
Expectations : Aggregate functions
expect_column_distinct_values_to_be_in_set
is not really different from expect_column_values_to_be_in_set
. Note that it also supports the parse_strings_as_datetimes
argument.
expect = df.expect_column_distinct_values_to_be_in_set(
column="Survived", value_set=[0, 1]
)
assert expect.success
expect = df.expect_column_distinct_values_to_contain_set(
column="Embarked",
value_set=["C", "Q"],
)
assert expect.success
expect = df.expect_column_distinct_values_to_equal_set(
column="Embarked",
value_set=["C", "Q", "S"],
)
assert expect.success
expect = df.expect_column_mean_to_be_between(column="Age", min_value=20, max_value=40)
assert expect.success
df.Age.median()
28.0
expect = df.expect_column_median_to_be_between(column="Age", min_value=20, max_value=40)
assert expect.success
expect = df.expect_column_quantile_values_to_be_between(
column="Age",
quantile_ranges={
"quantiles": [0.25, 0.5, 0.75],
"value_ranges": [[15, 25], [23, 33], [35, 43]],
},
)
assert expect.success
expect = df.expect_column_mean_to_be_between(
column="Age", min_value=0.0, strict_max=100.0
)
assert expect.success
expect = df.expect_column_stdev_to_be_between(
column="Age", min_value=10.0, strict_max=20.0
)
assert expect.success
expect = df.expect_column_unique_value_count_to_be_between(
column="Pclass", min_value=1, max_value=3
)
assert expect.success
expect_column_proportion_of_unique_values_to_be_between
, from the API reference:
Expect the proportion of unique values to be between a minimum value and a maximum value.
For example, in a column containing [1, 2, 2, 3, 3, 3, 4, 4, 4, 4], there are 4 unique values and 10 total values for a proportion of 0.4.
expect = df.expect_column_proportion_of_unique_values_to_be_between(
column="Embarked", strict_min=0.9, strict_max=0.99
)
assert expect.success
expect = df.expect_column_most_common_value_to_be_in_set(
column="Embarked", value_set=("S", "C")
)
assert expect.success
expect = df.expect_column_sum_to_be_between(
column="Survived", min_value=300, max_value=400
)
assert expect.success
expect = df.expect_column_min_to_be_between(column="Age", min_value=0, max_value=10)
assert expect.success
expect = df.expect_column_max_to_be_between(column="Age", min_value=40, max_value=100)
assert expect.success
For the expect_column_kl_divergence_to_be_less_than
Expectation, we used some coefficients found on the github repository:
df.expect_column_kl_divergence_to_be_less_than(
column="Pclass",
partition_object={
"values": ["*", 1, 2, 3],
"weights": [
0.0007616146230007616,
0.24523990860624523,
0.2124904798172125,
0.5415079969535415,
],
},
threshold=0.6,
)
assert expect.success
expect_column_kl_divergence_to_be_less_than
is some kind of distribution Expectation, which expects a partition_object
argument.
Conditional Expectations
Finally we are going to have a look at conditional Expectations with the row_condition
argument. As explained in the documentation:
Sometimes one may hold an Expectation not for a dataset in its entirety but only for a particular subset. Alternatively, what one expects of some variable may depend on the value of another. One may, for example, expect a column that holds the country of origin to not be null only for people of foreign descent.
df.expect_column_values_to_be_in_set(
column="Sex",
value_set=["male", "female"],
row_condition='Embarked=="S"',
condition_parser="pandas",
)
assert expect.success
Some of the Expectations do not take the row_condition
argument:
expect_column_to_exist
expect_table_columns_to_match_ordered_list
expect_table_column_count_to_be_between
expect_table_column_count_to_equal