Advanced Monitor Options¶
In Creating New Monitors we outlined how to create a very simple monitor that produces a simple plot. In this section we will dive into options available for creating a more complex Monitor.
Note
There are no further options for creating DataModels, and so the previous section should be referenced for the creation of new DataModels.
Running the monitoring steps manually¶
If there is a need to run the monitoring steps manually, BaseMonitor
includes the following methods that can be
called independently:
initialize_data
: retrieve data defined by the data model set the data attribute, create hover text.run_analysis
: execute the track method, find outliers if defined, set notifications.plot
: create the plotly figure (creates the html output)write_figure
: save the plotly figure to an html file.notify
: send notifications
Note
If the monitoring steps are run individually, they must be executed in logical order.
For example, initialize_data
must be executed first, followed by run_analysis
if the intent is to only
execute the analysis portion of the monitor.
Notifications¶
The monitorframe
frame work provides support for email notifications upon execution of a monitor.
There are two steps for activating email notifications:
Define
notification_settings
in the new monitor class required keys:- active: turn the notifications on or off
- username: user that’s used for sending the messages
- recipients: additional users that should be notified of results
Define the message that the monitor should send in
set_notification
For example:
class MyMonitor(BaseMonitor):
data_model = MyNewModel
notification_settings = dict(
active=True,
username='user',
recipients=['other@stsci.edu', 'other2@stsci.edu'] # recipients can also be a single string if there's only one
)
plottype = 'line'
x = 'col1'
y = 'col2'
def track(self):
"""Measure the mean of the first column"""
return self.data.col1.mean() # Remember that data is a pandas DataFrame!
def set_notification(self):
return f'The mean of col1 is {self.results}!' # The return value of track is stored in the results attribute!
Note
set_notification
should return a string.
Databases¶
monitorframe
provides support for and an interface to two SQLite databases through the peewee
ORM with
one of these databases is for storing monitor data, while the other is used for storing monitoring analysis results.
Each of these databases are created automatically when configured, and tables for those databases are also automatically
created when the ingestion methods are called (ingest
for a DataModel and store_results
for a Monitor).
The configuration of these databases is discussed in Overview.
DataModel Database¶
The main difference between the DataModel database and the Monitor results database is that while the Monitor results database is pre-defined (although broadly), the DataModel database is not. The table is constructed based on the input data, which means that each DataModel can have completely different sets of columns.
This type of implementation does have a drawback though: due to the dynamic nature of how tables are defined, it’s possible to have unintended consequences in how the data is ingested. In particular, it’s possible to have duplicate entries.
To protect against this issue, it’s recommended that DataModels are defined with a primary_key
attribute.
This will prevent duplicate entries from being added to the database (an example of this is included in the
Creating New Monitors section).
Once the DataModel’s database and table exist, the DataModel’s model
attribute can be utilized.
The model
attribute is a peewee.Model
object that represents the DataModel’s table and can be used to query the
data stored there.
Users can take advantage of the DataModel’s model
attribute when implementing a Monitor’s get_data
method.
For examples of querying and filtering, see peewee’s Querying section.
Monitor Results Database¶
Each Monitor that is defined will automatically create a database table based on the name of the
monitor if the store_results
method is called (with the default method):
class, MyMonitor -> results database table name, "MyMonitor"
- The results table is defined with two columns:
Datetime
Result
The Datetime
column corresponds to the date and time that the monitor was executed.
Each monitor that is derived from BaseMonitor
will have a date
attribute that is set when an instance of the
monitor is created.
date
is a python datetime
object, and will be stored in the “isoformat”
The Result
column is a JSON field.
A JSON field is used to standardize the tables for each monitor while allowing for flexibility in what exactly each
monitor stores in the table.
The only caveat to this is that whatever results that users desire to store, must be compatible with python’s json
encoder and decoder which performs the following translations:
JSON | Python |
---|---|
object | dict |
array | list |
string | str |
number (int) | int |
number (real) | float |
true | True |
false | False |
null | None |
This means that whatever is intended to be stored should be composed of those Python data structures.
There is some support for this with pandas.
Both Series
and DataFrame
objects have a to_json
method for automatically translating those data structures
to JSON friendly structures.
For more information on pandas’ to_json
method, see
this, and for more on
Python’s JSON encoder and decoder, see their documentation.
The data database columns are defined based on the data recovered by the get_new_data
method.
Storing and accessing results¶
BaseMonitor
does provide a “default” attempt at storing the results, but for more complicated results (or just for
more custom storage), a format_results
method must be implemented.
Building off of the previous MyMonitor
example:
def format_results(self):
# Create a custom result with json-friendly python data structures
results = {
'my result 1': self.data.col1.to_json # store the whole column if you want!
'my result mean': self.results # MyMonitor's track method returns the mean of col1
}
return results
The new entry will be created on execution, and if format_results
has been implemented, that resulting object will
be used.
To query the Monitor’s table for a specific result, results_table
and the table’s column definitions
(which are used in querying) are available as attributes:
monitor = MyMonitor()
query_results = monitor.results_table # Returns all results as a peewee ModelSelect object
# Further querying
more_specific = query_results.where(monitor.datetime_col == '2019-04-23T14:07:03.500365')
# Format rows as a list of dictionaries
list(more_specific.dicts())
Note
If a Monitor has been defined, but has not been executed (specifically the store_results
method), the database
table for that monitor will not exist yet.
In this case, the results_table
property will be None
.
For information on how to perform queries, see peewee’s documentation.
Customizing Plotting¶
BaseMonitor
provides some basic plotting functionality that produces ploty
interactive plots.
There are some additional options that can be set for controlling this basic plotting
Setting a specific output file name or destination¶
By default, the resulting figure of a monitor derived from BaseMonitor
will be given a name that is a combination
of the monitor’s class name and the date that the monitor instance was created, and will be placed in the current
working directory.
To change the path of the output file, assign output
to a directory:
class MyMonitor(BaseMonitor)
data_model = MyNewModel
...
output = '/new/path/to/file/' # For setting the path, but not the filename
To change the name of the file, assign output
to a full path:
class MyMonitor(BaseMonitor)
data_model = MyNewModel
...
output = '/new/path/to/file/new_file_name.html' # For setting the path, but not the filename
Adding a third dimension to the output¶
The basic plotting functionality of BaseMonitor
restricts the dimensionality to 3 dimensions at the maximum (it is
basic after all).
The third dimension is a color dimension supports either an array of the same shape as x
and y
.
To specify a color dimension to the data, simply set the z
attribute.
The third dimension can also be used to create an image plot.
Adding additional information to the hover labels¶
If additional information should be displayed on hover for each data point, that information should be included the data retrieved by the data model.
For example, if in the simple line plot created in Creating New Monitors needed to also include a “name” for each
data point, get_data
would need to be modified like so:
class MyNewModel(BaseDataModel):
def get_new_data(self):
reuturn {
'col1': [1, 2, 3],
'col2': [4, 5, 6],
'names': ['first', 'second', 'third']
}
In the definition of the monitor, the new “names” column would need to be identified as a label:
class MyMonitor(BaseMonitor):
data_model = MyNewModel
labels = ['names'] # List of column names in data that should be used in hover labels
plottype = 'line'
x = 'col1'
y = 'col2'
def track(self):
"""Measure the mean of the first column"""
return self.data.col1.mean() # Remember that data is a pandas DataFrame!
This will add each “name” to the corresponding point in the hover labels in the plotly figure.
More complex plotting¶
For more complex plotting, plot
should be overridden with whatever is needed, but plotly
is still required.
When a new instance of a monitor is created, a plotly figure is created automatically.
Note
If subplots are needed, the subplots
and subplots_layout
attributes need to be defined in the monitor class.
This is because the plotly figure object is different for subplots.
To set the monitor to use a subplots figure:
class MyMonitor(BaseMonitor):
data_model = MyNewModel
...
subplots = True
subplot_layout = (2, 2) # 2x2 grid of plots
The plot
method should add whatever traces (plotly
’s term) and layouts necessary to that monitor figure
attribute:
def plot(self):
... # Lot's of complicated plotting stuff that results in a "plot" trace object and a new layout object
self.figure.add_trace(plot)
self.figure['layout'].update(layout)
If users want to integrate existing matplotlib plots without have to rewrite the entire plot, plotly
’s
mpl_to_plotly
function can be used:
import plotly.tools as tls
new_plotly = tls.mpl_to_plotly(existing_mpl_figure)
This figure could then be assigned to the figure attribute on the monitor:
def plot(self):
self.figure = new_plotly
Once plotting is all done, the figure can be written to an html file (with the default or specified path and/or name)
with the write_figure
method:
monitor.write_figure()
Finding Outliers¶
If part of the monitor is to locate outliers, then the find_outliers
method must be implemented.
This method should return a mask array that can be used with the data
attribute of the Monitor if the user intends
to use the basic plotting functionality, but otherwise can return whatever is needed.
Outliers will be accessible via the outliers
attribute of the monitor.
When using the basic plotting functionality, outliers will automatically be plotted in red, but for more advanced
plotting that requires that the plot
method be overridden, the user will have to determine how to visualize any
outliers.
For example, if we add a find_outliers
implementation to MyMonitor
:
def find_outliers(self):
return self.data.col1 > 1 # Returns a pandas Series mask
After the analysis has been run, you can access the outlying data with:
monitor = MyMonitor()
monitor.monitor()
outliers = monitor.data[monitor.outliers]