Quickstart¶
This section describes a minimal example. First, create daskperiment.Experiment instance. This instance controlls an experiment, a chain of functions to output value and a collection of input variables.
Note
Unrelated logs are omitted in following examples.
>>> import numpy as np
>>> import daskperiment
>>> ex = daskperiment.Experiment(id='quickstart_pj')
...
>>> ex
Experiment(id: quickstart_pj, trial_id: 0, backend: LocalBackend('daskperiment_cache/quickstart_pj'))
Then, use Experiment.parameter method to define parameters (input variables for the experiment). The actual value of each parameter can be changed in every trial.
>>> a = ex.parameter('a')
>>> b = ex.parameter('b')
>>> a
Parameter(a: Undefined)
Next, define each experiment step (function) by decorating with Experiment instance (@ex).
Note that the function to output the final result (mostly objective value to be minimized or maximized) must be decorated with Experiment.result. The chain of these functions are expressed as Dask.Delayed instance.
>>> @ex
>>> def prepare_data(a, b):
>>> return a + b
>>> @ex.result
>>> def calculate_score(s):
>>> return 10 / s
>>> d = prepare_data(a, b)
>>> s = calculate_score(d)
>>> s
Delayed('calculate_score-ebe2d261-8903-45e1-b224-72b4c886e4c5')
Thus, you can visualize computation graph via .visualize method.
>>> s.visualize()
Use Experiment.set_parameters method to set parameters for a trial. After setting parameters, Parameter variable and experiment result can be computable.
Parameters are recommended to be a scalar (or lightweight value) because these are stored as history (for example, passing filename as a parameter is preffered rather than passing DataFrame).
>>> ex.set_parameters(a=1, b=2)
...
>>> s.compute()
... [INFO] Started Experiment (trial id=1)
...
... [INFO] Finished Experiment (trial id=1)
...
3.3333333333333335
You can update any parameters for next trial. Every trials can be distinguished by trial id.
>>> ex.set_parameters(b=3)
>>> s.compute()
...
... [INFO] Started Experiment (trial id=2)
...
... [INFO] Finished Experiment (trial id=2)
...
2.5
After some trials, you can retrieve parameter values specifying trial id.
>>> ex.get_parameters(trial_id=1)
{'a': 1, 'b': 2}
>>> ex.get_parameters(trial_id=2)
{'a': 1, 'b': 3}
Experiment.get_history returns a DataFrame which stores the history of trial parameters and its results. You can select desirable trial using pandas basic operation.
>>> ex.get_history()
a b Result Result Type Success Finished \
1 1 2 3.333333 <class 'float'> True 2019-02-03 XX:XX:XX.XXXXXX
2 1 3 2.500000 <class 'float'> True 2019-02-03 XX:XX:XX.XXXXXX
Process Time Description
1 00:00:00.014183 NaN
2 00:00:00.012354 NaN
When any error occurs during the trial, Experiment instance stores the log as failed trial. The “Description” column contains the error detail.
>>> ex.set_parameters(a=1, b=-1)
>>> s.compute()
...
ZeroDivisionError: division by zero
>>> ex.get_history()
a b Result Result Type Success Finished \
1 1 2 3.333333 <class 'float'> True 2019-02-03 XX:XX:XX.XXXXXX
2 1 3 2.500000 <class 'float'> True 2019-02-03 XX:XX:XX.XXXXXX
3 1 -1 NaN None False 2019-02-03 XX:XX:XX.XXXXXX
Process Time Description
1 00:00:00.014183 NaN
2 00:00:00.012354 NaN
3 00:00:00.015954 ZeroDivisionError(division by zero)
Handling Intermediate Result¶
Next example shows how to retrieve an intermediate result of the chain.
The only difference is using Experiment.persist decorator. It makes Experiment instance to keep the decorated function’s intermediate result. After definition, rebuilt the same workflow using the persisted function.
Note that an intermediate result is saved as a pickle file named with its function name which must be unique in the experiment.
>>> @ex.persist
>>> def prepare_data(a, b):
>>> return a + b
>>> d = prepare_data(a, b)
>>> s = calculate_score(d)
... [WARNING] Code context has been changed: prepare_data
... [WARNING] @@ -1,3 +1,3 @@
... [WARNING] -@ex
... [WARNING] +@ex.persist
... [WARNING] def prepare_data(a, b):
... [WARNING] return a + b
...
Note
If you execute the code above, daskperiment outputs some “WARNING” indicating code contexts has been changed. It’s because daskperiment automatically tracks code context to guarantee reproducibility.
Let’s perform some trials.
>>> ex.set_parameters(a=1, b=2)
>>> s.compute()
...
... [INFO] Finished Experiment (trial id=4)
...
3.3333333333333335
>>> ex.set_parameters(a=3, b=2)
>>> s.compute()
...
... [INFO] Finished Experiment (trial id=5)
...
2.0
You can retrieve intermediate results via Experiment.get_persisted method by specifying function name and trial id.
>>> ex.get_persisted('prepare_data', trial_id=4)
...
3
>>> ex.get_persisted('prepare_data', trial_id=5)
...
5
Monitoring Metrics¶
You may need to monitor transition of some metrics during each trial. In each experiment function, you can call Experiment.save_metric to save metric with its key (name) and epoch.
>>> @ex.result
>>> def calculate_score(s):
>>> for i in range(100):
>>> ex.save_metric('dummy_score', epoch=i, value=100 - np.random.random() * i)
>>> return 10 / s
>>> d = prepare_data(a, b)
>>> s = calculate_score(d)
...
>>> ex.set_parameters(a=1, b=2)
>>> s.compute()
...
... [INFO] Finished Experiment (trial id=6)
...
3.3333333333333335
After a trial, you can load saved metric using Experiment.load_metric specifying its name and trial_id. As it returns metrics as a DataFrame, you can easily investigate them.
>>> dummy_score = ex.load_metric('dummy_score', trial_id=6)
>>> dummy_score.head()
Trial ID 6
Epoch
0 100.000000
1 99.925724
2 99.616405
3 98.527259
4 97.086730
Perform another trial.
>>> ex.set_parameters(a=3, b=4)
>>> s.compute()
...
... [INFO] Finished Experiment (trial id=7)
...
1.4285714285714286
To compare metrics between trials, pass multiple trial ids to Experiment.load_metric.
>>> ex.load_metric('dummy_score', trial_id=[6, 7]).head()
Trial ID 6 7
Epoch
0 100.000000 100.000000
1 99.925724 99.497605
2 99.616405 99.459706
3 98.527259 98.027079
4 97.086730 99.517617
Check Code Context¶
During the trials, daskperiment tracks code contexts decorated with Experiment decorators.
To check the tracked code contexts, use Experiment.get_code specifying trial id (if is not specified, it returns current code).
>>> ex.get_code()
@ex.persist
def prepare_data(a, b):
return a + b
@ex.result
def calculate_score(s):
for i in range(100):
ex.save_metric('dummy_score', epoch=i, value=100 - np.random.random() * i)
return 10 / s
>>> ex.get_code(trial_id=1)
@ex
def prepare_data(a, b):
return a + b
@ex.result
def calculate_score(s):
return 10 / s
Each code context is also saved as a text file per trial id. So it is easy to handle by diff tools and Git.
Function Purity And Handling Randomness¶
To make the experiment reproducible, all the experiment step should be “pure” function (it always returns the same outputs if the inputs to it are the same). In other words, the function should not have internal state nor randomness.
daskperiment checks whether each experiment step is pure. It internally stores the hash of inputs and output, and issues a warning if its output is changed even though the inputs are unchanged.
To illustrate this, add randomness to the example code.
>>> @ex.result
>>> def calculate_score(s):
>>> for i in range(100):
>>> ex.save_metric('dummy_score', epoch=i, value=100 - np.random.random() * i)
>>> return 10 / s + np.random.random()
>>> d = prepare_data(a, b)
>>> s = calculate_score(d)
Because of the code change, it outputs the different results even though its inputs (parameters) are the same. In this case, daskperiment issuess the warning.
>>> s.compute()
...
... [INFO] Random seed is not provided, initialized with generated seed: 1336143935
...
... [WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {})
... [INFO] Finished Experiment (trial id=8)
2.1481070929378823
This function outputs different result in every trial because of the randomness. To make the function reproducible, random seed should be provided.
To do this, pass seed argument to compute method. Note that this trial issue the warning because its result is different to the previous result (no seed).
>>> s.compute(seed=1)
...
... [INFO] Random seed is initialized with given seed: 1
...
... [WARNING] Experiment step result is changed with the same input: (step: calculate_score, args: (7,), kwargs: {})
... [INFO] Finished Experiment (trial id=9)
1.7552163303435249
Another trial with the same seed doesn’t issue the warning, because the result is unchanged.
>>> s.compute(seed=1)
...
... [INFO] Random seed is initialized with given seed: 1
...
... [INFO] Finished Experiment (trial id=9)
1.7552163303435249
Save Experiment Status¶
daskperiment automatically saves its internal state when a experiment result is computed (when .compute is called). Also, Experiment instance recovers previous state when it is instanciated.
Following example instanciates Experiment instance using the same id as above. Thus, the created Experiment recovers its previous trial history.
>>> ex_new = daskperiment.Experiment(id='quickstart_pj')
Calling .get_history returns information of previous trials.
>>> ex_new.get_history()
...
Also, Experiment instance automatically detects the environment change from its previous trial. Following is a sample log when package update is detected (pandas 0.23.4 -> 0.24.0).
... [INFO] Loaded Experiment(id: quickstart_pj, trial_id: 14) from path=daskperiment_cache/quickstart_pj/quickstart_pj.pkl
... [WARNING] Installed Python packages have been changed
... [WARNING] @@ -142 +142 @@
... [WARNING] -pandas 0.23.4 (/Users/sinhrks/anaconda/lib/python3.6/site-packages)
... [WARNING] +pandas 0.24.0 (/Users/sinhrks/anaconda/lib/python3.6/site-packages)