Data Processing

Clean, transform, and prepare data for analysis.

Purpose

Use this module for feature engineering, missing data handling, entity matching, and data quality checks.

Key functions

  • AddDateNumberColumns — Add year, month, quarter, week, and day columns from dates

  • AddLeadingZeros — Add leading zeros to numeric columns

  • AddRowCountColumn — Add row numbers within groups

  • AddTPeriodColumn — Create time period columns for time series analysis

  • AddTukeyOutlierColumn — Add an outlier flag column using Tukey’s method

  • CleanTextColumns — Remove leading/trailing spaces from text columns

  • ConductAnomalyDetection — Detect anomalies using a z-score method

  • ConductEntityMatching — Fuzzy matching between datasets using various algorithms

  • ConvertOddsToProbability — Convert odds to probabilities

  • CountMissingDataByGroup — Count missing values grouped by categories

  • CreateBinnedColumn — Bin continuous variables into discrete categories

  • CreateDataOverview — Dataset summary with missing data visualization

  • CreateRandomSampleGroups — Create random sample groups for validation

  • CreateRareCategoryColumn — Identify and flag rare categories

  • CreateStratifiedRandomSampleGroups — Stratified random sampling

  • ImputeMissingValuesUsingNearestNeighbors — Impute missing values using KNN

  • VerifyGranularity — Check dataset granularity based on key columns

Common use cases

  • Data cleaning and feature engineering

  • Missing data handling

  • Data quality assessment

  • Sampling and validation splits

  • Entity resolution (fuzzy matching)

Last updated