Skip to content

Validation

execution.utils.validation

Data Validation Utilities

This module provides comprehensive data validation functionality including schema validation, data type checking, field validation, and result tracking. Used throughout the ETL pipeline to ensure data quality and consistency.

ValidationResult dataclass

Container for validation results and metadata.

Tracks validation status, errors, warnings, and specific issues found during data validation processes.

Attributes:

Name Type Description
is_valid bool

Boolean indicating if validation passed

errors List[str]

List of error messages found during validation

warnings List[str]

List of warning messages

missing_fields Optional[List[str]]

Fields that were expected but not found

type_mismatches Optional[List[str]]

Fields with incorrect data types

field_name Optional[str]

Specific field being validated (optional)

is_valid: bool instance-attribute

errors: List[str] instance-attribute

warnings: List[str] instance-attribute

missing_fields: Optional[List[str]] = None class-attribute instance-attribute

type_mismatches: Optional[List[str]] = None class-attribute instance-attribute

field_name: Optional[str] = None class-attribute instance-attribute

__init__(is_valid: bool, errors: List[str], warnings: List[str], missing_fields: Optional[List[str]] = None, type_mismatches: Optional[List[str]] = None, field_name: Optional[str] = None) -> None

__post_init__() -> None

add_error(error: str) -> None

add_warning(warning: str) -> None

get(key: str, default: Any = None) -> Any

__getitem__(key: str) -> Any

SchemaValidator

check_nullable = check_nullable instance-attribute

__init__(check_nullable: bool = False)

validate(data: Any) -> ValidationResult

get_schema() -> Any

validate_schema(df: DataFrame, expected_schema: StructType) -> ValidationResult

validate_curated_schema(df: DataFrame, expected_schema: StructType) -> ValidationResult

validate_transformed_schema(df: DataFrame, expected_schema: StructType) -> ValidationResult

get_schema_diff(schema1: StructType, schema2: StructType) -> Dict[str, Any]

select_expected_fields(df: DataFrame, expected_schema: StructType) -> DataFrame

validate_required_fields(df: DataFrame, required_fields: List[str]) -> Dict[str, Any]

validate_schema_strict(df: DataFrame, expected_schema: StructType, schema_name: str) -> None

validate_required_fields_strict(df: DataFrame, required_fields: List[str]) -> None

get_schema_summary(schema: StructType) -> Dict[str, Any]