User Guide

Creating Functions

qualipy.reflect.function.function(allowed_arguments: ~typing.Optional[~typing.List[str]] = None, return_format: type = typing.Union[float, str], arguments: ~typing.Optional[~typing.Dict[str, ~typing.Any]] = None, fail: bool = False, display_name: ~typing.Optional[str] = None, description: ~typing.Optional[str] = None, input_format: type = <class 'float'>, custom_value_return_format: ~typing.Optional[type] = None) Callable[source]

Define a function that can be applied to a qualipy dataset

Use this decorator to specify a qualipy function, and describe how it will function when executed. Whatever function this decorator is used for must abide by three rules:

  1. The first argument must be data - This is the data object you pass to Qualipy

  2. The second argument is the column - This is the name of the column the function

    is being applied to.

  3. Any arguments as they correspond to allowed_arguments - They must contain the same

    name exactly.

Parameters
  • allowed_arguments – An optional list that specifies what arguments can be passed to the function at runtime

  • return_format – Used for rendering purposes on the reporting. Can be either float, int, str, dict, or bool

  • fail – If this rule returns a boolean, should the process halt given a False?

  • display_name – This is how the function would be displayed on a report. if not given, it will take the name of the function itself

  • description – If given, this will be displayed when hovering over the function name in a report

Returns

Any value that corresponds to the appropriate return_format

Example 1 - A simple function with no additional arguments:

import qualipy as qpy
@qpy.function(return_format=float)
def mean(data, column):
    return data[column].mean()
Per the rules, data represents the data passed through, in this case a pandas DataFrame,

column is the string name of column is used to access the column from the DataFrame.

Additionally, the method mean returns a float value, which is consistent with the return_format set in the decorator call.

Example 2 - A simple function with additional arguments:

@qpy.function(return_format=int, allowed_arguments=["standard_deviations"])
def std_over_limit(data, column, standard_deviations):
    mean = data[column].mean()
    std = data[column].std()
    data = data[
        (data[column] < (mean - standard_deviations * std))
        | (data[column] > (mean + standard_deviations * std))
    ]
    return data.shape[0]

Example 3 - A function when running SQL as backend:

@qpy.function(return_format=float)
def mean(data, column):
    return data.engine.execute(
        sa.select([sa.func.avg(sa.column(column))]).select_from(data._table)
    ).scalar()

Creating a mapping

qualipy.reflect.column.column(column_name: Optional[Union[str, List[str]]] = None, column_type=None, force_type: bool = False, overwrite_type: bool = False, null: bool = True, force_null: bool = False, unique: bool = False, is_category: bool = False, is_date: bool = False, split_on: Optional[str] = None, column_stage_collection_name: Optional[str] = None, functions: Optional[List[Union[Callable, Dict]]] = None, extra_functions: Optional[Dict[str, Dict]] = None)[source]

This allows us to map to a column of a data object.

This is one of the essential components of Qualipy. Using column allows us to map to a specific column of whatever data object we are reflecting, and specify what that column should look like - as well as apply any aggregate functions we’ve defined.

Note - You must explicitly add it to the Project object in order for it to run.

Parameters
  • column_name – The name of the column in the data object - Generally either the column name in the pandas or SQL table.

  • column_type – Useful if you want to enforce types in a pandas DataFrame. See (link here) DataTypes section for more information.

  • force_type – If column_type is used, should the type be enforced. Setting this to True means that the entire process will halt if right type is not present.

  • overwrite_type – This is useful if the aggregate function requires a specific datatype for it to be computed.

  • null – Can the column contain missing values

  • force_null – If null is set to False - should the process fail given there are missing values present.

  • unique – Should uniqueness in the column be enforced.

  • is_category – Denoting a column as a category has several consequences - including automatically collecting counts for each category.

  • functions – A list of property defined functions.

  • extra_functions – If this mapping is used for multiple columns but want a function to be applied to only one of the columns, use this. See example for more information.

Returns

A column object that can be added to a Project. See Project for more details.

Example 1 - Reflect a pandas column with one function:

price = qpy.column(column_name="price", column_type=FloatType(), functions=[mean])

Here, price is the name of the pandas column. We want to column to be of float type, and we’re collecting the mean of the price.

Example 2 - Reflect a column, and call a function with arguments:

price = qpy.column(
    column_name="price",
    column_type=FloatType(),
    functions=[{"function": std_over_limit, "parameters": {"standard_deviations": 3}}],
)

Example 3 - Reflect multiple columns, and call a function on just one of them:

num_columns = qpy.column(
    column_name=["price", "some_other_column"],
    column_type=FloatType(),
    functions=[mean],
    extra_functions={
        "price": [
            {"function": std_over_limit, "parameters": {"standard_deviations": 3}},
        ],
    },
)

In this scenario, mean will be applied to price, but std_over_limit will only be applied price

Project

class qualipy.project.Project(project_name: str, config_dir: str, re_init: bool = False)[source]

The project class points to a specific configuration, and holds all mappings.

It also includes a lot of useful utility functions for working with the management of projects

__init__(project_name: str, config_dir: str, re_init: bool = False)[source]
Parameters
  • project_name – The name of the project. This will be important for referencing in report generation later. The project_name can not be changed - as it used internally when storing data

  • config_dir – A path to the configuration directory, as created using the CLI command qualipy generate-config. See the (link here)``config`` section for more information

add_column(column: Column, name: Optional[str] = None, column_stage_collection_name: Optional[str] = None) None[source]

Add a mapping to this project

This is the method to use when adding a column mapping to the project. Once added, it will automatically be executed when running the pipeline.

Parameters
  • column – The column object. Can either be created through the function method or class method.

  • name

    This is useful when you don’t want to run all mappings at once. Often, you’ll do analysis on different subsets of the same dataset. Use name to reference it later on and only execute it for a specific subset.

    This name is also essential if you want to analyze the same column, but in a different subset of the data.

Returns

None

Example 1 - Instantiate a project:

import qualipy as qpy

project = qpy.Project(project_name='stocks', config_dir='/tmp/.config')

Example 2 - Instantiate a project and add a column to it:

import qualipy as qpy

project = qpy.Project(project_name='stocks', config_dir='/tmp/.config')
# using the price column defined above
project.add_column(column=price, name='price_analysis')

Supported DataSet Types

Currently, there are three different dataset types supported: Pandas, Spark, and SQL

Pandas

class qualipy.backends.pandas_backend.dataset.PandasData(data: DataFrame)[source]

PandasData must be instantiated when tracking pandas data

__init__(data: DataFrame)[source]
Parameters

data – The pandas dataset that we want to track

set_stratify_rule(column: str, values: Optional[List[str]] = None) None[source]

Use this when you want to run all functions on separate stratifications

Currently, only equality based stratification is possible. In the future, comparison based stratifications will be available.

Parameters
  • column – The name of the column you want to stratify on.

  • values – If you only want to include a subset of values within column, specify them here

Returns

None

Example 1 - Setting symbol as a stratification:

from qualipy.backends.pandas_backend.dataset import PandasData

stocks = PandasData(stocks)
stocks.set_stratify_rule("symbol")

Example 2 - Setting symbol as a stratification and specifying the subset of stocks to analyze:

from qualipy.backends.pandas_backend.dataset import PandasData

stocks = PandasData(stocks)
stocks.set_stratify_rule("symbol", values=['IBM', 'AAPL'])

SQL

class qualipy.backends.sql_backend.dataset.SQLData(engine: Optional[Engine] = None, table_name: Optional[str] = None, schema: Optional[str] = None, conn_string: Optional[str] = None, custom_select_sql: Optional[str] = None, create_temp: bool = False, backend='sql')[source]

This is used when tracking a relational table

__init__(engine: Optional[Engine] = None, table_name: Optional[str] = None, schema: Optional[str] = None, conn_string: Optional[str] = None, custom_select_sql: Optional[str] = None, create_temp: bool = False, backend='sql')[source]
Parameters
  • engine – A sqlalchemy engine to the database containing the table we want to track

  • table_name – The name of the table we want to track

  • schema – The schema the table is in

  • conn_string – If engine is None, you can just pass the sqlalchemy database connection

  • custom_select_sql – Must be proper SQL for whatever DB you are using. This will instantiate a temporary table that Qualipy will run against. This is useful if you dont need the entire table, or need to run any joins before running Qualipy. However, often it might be better to just create a view of what you need.

set_custom_where(custom_where: str)[source]

Set this when you want a function to run on a subset of the table

Parameters

custom_where – The where portion of a sql statement. This can then be used in a function. See example in the documentation for more information

Example 1 - Instantiating a table:

import sqlalchemy as sa
from qualipy.backends.sql_backend.dataset import SQLData

engine = sa.create_engine('sqlite://')
data = SQLData(engine=engine, table_name='my_table')

Example 2 - Instantiating a table and setting a custom where clause:

import sqlalchemy as sa
from qualipy.backends.sql_backend.dataset import SQLData

engine = sa.create_engine('sqlite://')
data = SQLData(engine=engine, table_name='my_table')
data.set_custom_where("my_col = 'setosa'")

Qualipy

class qualipy.run.Qualipy(project: Project, backend: str = 'pandas', time_of_run: Optional[datetime] = None, batch_name: Optional[str] = None, overwrite_arguments: Optional[dict] = None)[source]

This is the main entrypoint to Qualipy. This is the object that will actually execute on your data.

__init__(project: Project, backend: str = 'pandas', time_of_run: Optional[datetime] = None, batch_name: Optional[str] = None, overwrite_arguments: Optional[dict] = None)[source]
Parameters
  • project – Your defined qualipy.Project

  • backend – Can be either “pandas”, “sql”, or “spark” depending on what kind of data you are tracking

  • time_of_run – If None, this will be the current datetime. Note, this is very important for analysis, as time_of_run is essentially your x_axis in all time series analysis. Being able to set it to a specific date can be useful when generating retrospective statistics.

  • batch_name – Useful for comparing specific time points by name during analysis. By default it will take the time_of_run as batch_name

set_dataset(df, columns: Optional[List[str]] = None, run_name: Optional[str] = None) None[source]

This specified the exact subset of data you want to run on.

Use this method when you don’t have all of the data (a live process) and want to only run on one batch of data.

Parameters
  • df – Can be either PandasData, SQLData, or SparkData

  • columns – If you don’t want to run all mappings on this specific subset of data, you can specify just the columns you want to run. Note - this corresponds to the name argument when adding a column to a project

  • run_name – If you’re running metrics from a project on many different subsets any iterations of the data, you might want to give each specific subset a name. This is especially necessary when running aggregates on a column where the column name itself stays the same, but the meaning changes based on the subset. By default, this will take the value of ‘0’

Returns

None

set_chunked_dataset(df, columns: Optional[List[str]] = None, run_name: Optional[str] = None, time_freq: str = '1D', time_column=None)[source]

This specified the exact subset of data you want to run on.

Use this method when you already have all data available, and want to retrospectively analyze all historical as if it was a live process. Note - There’s nothing stopping you from first running this on the available data and then running on a batch-per-batch basis afterwards using regular set_dataset.

Parameters
  • df – Can be either PandasData, SQLData, or SparkData

  • columns – If you don’t want to run all mappings on this specific subset of data, you can specify just the columns you want to run. Note - this corresponds to the name argument when adding a column to a project

  • run_name – If you’re running metrics from a project on many different subsets any iterations of the data, you might want to give each specific subset a name. This is especially necessary when running aggregates on a column where the column name itself stays the same, but the meaning changes based on the subset. By default, this will take the value of ‘0’

  • time_freq – A pandas-like timeseries frequency term. Use this page to know what you can use: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases (turn to link)

  • time_column – The time series column qualipy should use to chunk the data

Returns

None

Data types

There are several data types one can check for, depending on the backend. For pandas, these include

  • DateTimeType

  • FloatType - will match against float16-128

  • IntType - will match against int0-64

  • NumericTypeType - will match with any numeric subtype

  • ObjectType

  • BoolType

For SQL and SPARK backends, these are generally less important as type is usually enforced by the framework itself, reducing the need for type checking.