Automatic Constraint Generation, Data Verification & Anomaly Detection¶
The TDDA library provides support for constraint generation,
verification and anomaly detection for datasets, including .csv
files
and Pandas DataFrames.
The module includes:
- The tdda Command-line Tool for discovering constraints in data,
and for verifying data against those constraints,
using the TDDA JSON file format (
.tdda
files). - A Python
constraints
library containing classes that implement constraint discovery and validation, for use from within other Python programs. - Python implementations of constraint discovery, verification and
and anomaly detection for a number of data sources:
.csv
files- Pandas and R DataFrames saved as
.feather
files - PostgreSQL database tables (
postgres:
) - MySQL database tables (
mysql:
) - SQLite database tables (
sqlite:
) - MongoDB document collections (
mongodb:
)
Note
To use databases, pandas
, .feather
files etc. you may need
to install extra optional packages.
See Optional Installations for using Databases, Feather Files, Pandas.
The tdda
Command-line Tool¶
The tdda
command-line utility provides a tool for discovering constraints
in data and saving them as a .tdda
file in the
TDDA JSON file format, and also for verifying data using
against constraints stored in a .tdda
file.
It also provides some other functionality to help with using the tool. The following command forms are supported:
- tdda discover —perform constraint discovery.
- tdda verify — verify data against constraints.
- tdda detect — detect anomalies in data by checking constraints.
tdda examples
— generate example data and code.tdda help
— show help on how to use the tool.tdda test
— run the TDDA library’s internal tests.
See Examples for more detail on the code and data
examples that are included as part of the tdda
package.
See Tests for more detail on the tdda
package’s own tests,
used to test that the package is installed and configured correctly.
tdda discover
¶
The tdda discover
command can generate constraints for data,
and save the generated constraints as a
TDDA JSON file format file (.tdda
).
Usage:
tdda discover [FLAGS] input [constraints.tdda]
input
is one of:- a
.csv
file - a
-
, meaning that a.csv
file should be read from standard input - a
feather
file containing a DataFrame, with extension.feather
- a database table
- a
constraints.tdda
, if provided, specifies the name of a file to which the generated constraints will be written.
If no constraints output file is provided, or if -
is used,
the constraints are written to standard output (stdout
).
Optional flags are:
-r
or--rex
, to include regular expression generation-R
or--norex
, to exclude regular expression generation
See Constraints for CSV Files and Pandas DataFrames for details of how a .csv
file is read.
See Constraints for Databases for details of how database tables are accessed.
tdda verify
¶
The tdda verify
command is used to validate data from various sources,
against constraints from a
TDDA JSON file format constraints file.
Usage:
tdda verify [FLAGS] input [constraints.tdda]
input
is one of:- a
.csv
file - a
-
, meaning it will read a.csv
file from standard input - a
feather
file containing a DataFrame, with extension.feather
- a database table
- a
constraints.tdda
, if provided, is a JSON.tdda
file constaining constraints.
If no constraints file is provided and the input is a .csv
or
a .feather
file,
a constraints file with the same path as the input file, but with a .tdda
extension, will be used.
For database tables, the constraints file parameter is mandatory.
Optional flags are:
-a
,--all
- Report all fields, even if there are no failures
-f
,--fields
- Report only fields with failures
-7
,--ascii
- Report in ASCII form, without using special characters.
--epsilon E
- Use this value of epsilon for fuzziness in comparing numeric values.
--type_checking strict|sloppy
- By default, type checking is sloppy, meaning that when checking type
constraints, all numeric types are considered to be equivalent. With
strict typing,
int
is considered different fromreal
.
See Constraints for CSV Files and Pandas DataFrames for details of how a .csv
file is read.
See Constraints for Databases for details of how database tables are accessed.
tdda detect
¶
The tdda detect
command is used to detect anomalies on data,
by checking against constraints from a
TDDA JSON file format constraints file.
Usage:
tdda detect [FLAGS] input constraints.tdda output
input
is one of:- a
.csv
file name - a
-
, meaning it will read a.csv
file from standard input - a
feather
file containing a DataFrame, with extension.feather
- a database table
- a
constraints.tdda
, is a JSON.tdda
file constaining constraints.output
is one of:- a
.csv
file to be created containing failing records - a
-
, meaning it will write the.csv
file containing failing records to standard output - a
feather
file with extension.feather
, to be created containing a DataFrame of failing records
- a
If no constraints file is provided and the input is a .csv
or feather file,
a constraints file with the same path as the input file, but with a .tdda
extension, will be used.
Optional flags are:
-a
,--all
- Report all fields, even if there are no failures
-f
,--fields
- Report only fields with failures
-7
,--ascii
- Report in ASCII form, without using special characters.
--epsilon E
- Use this value of epsilon for fuzziness in comparing numeric values.
--type_checking strict|sloppy
- By default, type-checking is sloppy, meaning that when checking type
constraints, all numeric types are considered to be equivalent. With
strict typing,
int
is considered different fromreal
.
--write-all
- Include passing records in the output.
--per-constraint
- Write one column per failing constraint, as well as the
n_failures
total column for each row.
--output-fields FIELD1 FIELD2 ...
- Specify original columns to write out. If used with no field names, all original columns will be included.
--index
- Include a row-number index in the output file. The row number is automatically included if no output fields are specified. Rows are usually numbered from 1, unless the (feather) input file already has an index.
If no records fail any of the constraints, then no output file is created (and if the output file already exists, it is deleted).
See Constraints for CSV Files and Pandas DataFrames for details of how a .csv
file is read.
See Constraints for Databases for details of how database tables are accessed.
Constraints for CSV Files and Pandas DataFrames¶
If a .csv
file is used with the tdda
command-line tool, it will be
processed by the standard Pandas .csv
file reader with
the following settings:
index_col
isNone
infer_datetime_format
isTrue
quotechar
is"
quoting
iscsv.QUOTE_MINIMAL
escapechar
is\
(backslash)na_values
are the empty string,"NaN"
, and"NULL"
keep_default_na
isFalse
Constraints for Databases¶
When a database table is used with the any tdda
command-line tool,
the table name (including an optional schema) can be preceded by
DBTYPE
chosen from postgres
, mysql
, sqlite
or
mongodb
:
DBTYPE:[schema.]tablename
The following example will use the file .tdda_db_conn_postgres
from your
home directory (see Database Connection Files), providing all of the default
parameters for the database connection.
tdda discover postgres:mytable
tdda discover postgres:myschema.mytable
For MongoDB, document collections are used instead of database tables, and a document can be referred to at any level in the collection structure. Only scalar properties are used for constraint discovery and verification (and any deeper nested structure is ignored). For example:
tdda discover mongodb:mydocument
tdda discover mongodb:subcollection.mysubdocument
Parameters can also be provided using the following flags (which override
the values in the .tdda_db_conn_DBTYPE
file, if provided):
--conn FILE
- Database connection file (see Database Connection Files)
--dbtype DBTYPE
- Type of database
--db DATABASE
- Name of database to connect to
--host HOSTNAME
- Name of server to connect to
--port PORTNUMBER
- IP port number to connect to
--user USERNAME
- Username to connect as
--password PASSWORD
- Password to authenticate with
If --conn
is provided, then none of the other options are required, and
the database connection details are read from the specified file.
If the database type is specified (with the --dbtype
option, or by
prefixing the table name, such as postgres:mytable
), then a default
connection file .tdda_db_conn_DBTYPE
(in your home directory) is used,
if present (where DBTYPE is the name of the kind of database server).
To use constraints for databases, you must have an appropriate DB-API (PEP-0249) driver library installed within your Python environment.
These are:
- For PostgreSQL:
pygresql
orPyGreSQL
- For MySQL:
MySQL-python
,mysqlclient
ormysql-connector-python
- For SQLite:
sqlite3
- For MongoDB:
pymongo
Database Connection Files¶
To use a database source, you can either specify the database type
using the --dbtype DBTYPE
option, or you can prefix the table name
with an appropriate DBTYPE:
(one of the supported kinds of database
server, such as postgres
).
You can provide default values for all of the other database options in
a database connection file .tdda_db_conn_DBTYPE
, in your home directory.
Any database-related options passed in on the command line will override the default settings from the connection file.
A tdda_db_conn_DBTYPE
file is a JSON file of the form:
{
"dbtype": DBTYPE,
"db": DATABASE,
"host": HOSTNAME,
"port": PORTNUMBER,
"user": USERNAME,
"password": PASSWORD,
"schema": SCHEMA,
}
Some additional notes:
- All the entries are optional.
- If a
password
is provided, then care should be taken to ensure that the file has appropriate filesystem permissions so that it cannot be read by other users. - If a
schema
is provided, then it will be used as the default schema, when constraints are discovered or verified on a table name with no schema specified. - For MySQL (in a
.tdda_db_conn_mysql
file), theschema
parameter must be specified, as there is no built-in default for it to use. - For Microsoft Windows, the connector file should have the same name as for Unix, beginning with a dot, even though this form of filename is not otherwise commonly used on Windows.
TDDA JSON file format¶
A .tdda
file is a JSON file containing a single JSON object of the form:
{
"fields": {
field-name: field-constraints,
...
}
}
Each field-constraints
item is a JSON object containing a property for
each included constraint:
{
"type": one of int, real, bool, string or date
"min": minimum allowed value,
"max": maximum allowed value,
"min_length": minimum allowed string length (for string fields),
"max_length": maximum allowed string length (for string fields),
"max_nulls": maximum number of null values allowed,
"sign": one of positive, negative, non-positive, non-negative,
"no_duplicates": true if the field values must be unique,
"values": list of distinct allowed values,
"rex": list of regular expressions, to cover all cases
}
Constraints Examples¶
The tdda.constraints
module includes a set of examples.
To copy these constraints examples, run the command:
tdda examples constraints [directory]
If directory
is not supplied, constraints_examples
will be used.
Alternatively, you can copy all examples using the following command:
tdda examples
which will create a number of separate subdirectories.