TREANT
groot.treant
Created by Gabriele Tolomei on 2019-01-23.
Code adapted for comparison with GROOT from src/parallel_robust_forest.py at https://github.com/gtolomei/treant
Also see: https://arxiv.org/abs/1907.01197
Attacker
Class Attacker represents an attacker.
__init__(self, rules, budget)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rules |
obj: |
required | |
budget |
float |
total budget of the attacker (per instance). |
required |
attack(self, x, feature_id, cost)
This function retrieves the list of attacks to a given instance, on a given feature, subject to a given cost.
attack_dataset(self, X, attacks_filename=None)
This function is responsible for attacking the whole input dataset. It either loads all the attacks from the attack file provided as input or it computes all the attacks from scratch.
AttackerRule
Class AttackerRule represents a rule of attack.
__init__(self, pre_conditions, post_condition, cost, is_numerical=True)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pre_conditions |
dict |
set of pre-conditions which must be met in order for this rule to be applied. |
required |
post_condition |
dict |
post-condition indicating the outcome of this rule once applied. |
required |
cost |
float |
cost of rule application. |
required |
is_numerical |
boolean |
flag to indicate whether the attack specified by this rule operates on a numerical (perturbation) or a categorical (assignment) feature. |
True |
apply(self, x)
Application of the rule to the input instance x.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
numpy.array |
1-dimensional array representing an instance. |
required |
Returns:
Type | Description |
---|---|
x_prime (numpy.array) |
A (deep) copy of x yet modified according to the post-condition of this rule. |
get_cost(self)
Return the cost of this rule.
get_target_feature(self)
Return the feature (id) targeted by this rule.
is_applicable(self, x)
Returns whether the rule can be applied to the input instance x or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
numpy.array |
1-dimensional array representing an instance. |
required |
numerical_idx |
list |
binary array which indicates whether a feature is numerical or not; numerical_idx[i] = 1 iff feature id i is numerical, 0 otherwise. |
required |
Returns:
Type | Description |
---|---|
True iff this rule is applicable to x (i.e., if x satisfies ALL the pre-conditions of this rule). |
Constraint
Class Constraint represents a constraint.
__init__(self, x, y, cost, ineq, bound)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x |
int |
current instance. |
required |
y |
int/float |
label associated with this instance. |
required |
cost |
float |
cost associated with this instance (so far). |
required |
ineq |
int |
flag to encode the direction of the inequality represented by this constraint; 0 = 'less than', 1 = 'greater than or equal to'. |
required |
bound |
float |
constraint value on the loss function |
required |
encode_for_optimizer(self, direction)
Encode this constraint according to the format used by the optimizer.
propagate_left(self, attacker, feature_id, feature_value, is_numerical)
Propagate the constraint to the left.
propagate_right(self, attacker, feature_id, feature_value, is_numerical)
Propagate the constraint to the right.
Node
Class Node represents a node of a decision tree.
__init__(self, node_id, values, n_values, left=None, right=None, best_split_feature_id=None, best_split_feature_value=None)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
node_id |
int |
node identifier. |
required |
values |
int |
number of instances |
required |
n_values |
int |
maximum number of unique y values. |
required |
left |
obj: |
None |
|
right |
obj: |
None |
|
best_split_feature_id |
int |
index of the feature associated with the best split of this Node. Defaults to None. |
None |
best_split_feature_value |
float |
value of the feature associated with the best split of this Node. Defaults to None. |
None |
get_node_prediction(self)
Get the prediction as being computed at this node.
is_leaf(self)
Returns True iff the current node is a leaf (i.e., if it doesn't have neither a left nor a right child)
RobustDecisionTree (BaseEstimator, ClassifierMixin)
This class implements a single Robust Decision Tree. Inspired by sklearn API, it is a sublcass of the sklearn.base.BaseEstimator class and exposes two main methods: - fit(X, y) - predict(X) The former is used at training time for learning a single decision tree; the latter is used at inference (testing) time for computing predictions using the learned tree.
__init__(self, tree_id=0, attacker=<groot.treant.Attacker object at 0x7f5e8ea0f7f0>, split_optimizer=<groot.treant.SplitOptimizer object at 0x7f5e8ea0f850>, max_depth=8, min_instances_per_node=20, max_samples=1.0, max_features=1.0, replace_samples=False, replace_features=False, feature_blacklist={}, affine=True, seed=0)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_id |
int |
tree identifier. |
0 |
attacker |
obj: |
<groot.treant.Attacker object at 0x7f5e8ea0f7f0> |
|
split_optimizer |
obj: |
<groot.treant.SplitOptimizer object at 0x7f5e8ea0f850> |
|
max_depth |
int |
maximum depth of the tree to be generated (default = 10). |
8 |
min_instances_per_node |
int |
minimum number of instances per node (default = 20). |
20 |
max_samples |
float |
proportion of instances sampled without replacement (default = 1.0, i.e., 100%) |
1.0 |
max_features |
float |
proportion of features sampled without replacement (default = 1.0, i.e., 100%) |
1.0 |
feature_blacklist |
dict |
dictionary of features excluded during tree growth (default = {}), i.e., empty). |
{} |
replace_samples |
bool |
whether the random sampling of instances should be with replacement or not (default = False). |
False |
replace_features |
bool |
whether the random sampling of features should be with replacement or not (default = False). |
False |
seed |
int |
integer used by randomized processes. |
0 |
fit(self, X, y=None, numerical_idx=None)
This function is the public API's entry point for client code to start training a single Robust Decision Tree. It saves both the input data (X) and labels/targets (y) in the internals of the tree and delegates off to the private self.__fit method. The result being a reference to the root node of the trained tree.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
numpy.array |
2-dimensional array of shape (n_samples, n_features) |
required |
y |
numpy.array |
1-dimensional array of values of shape (n_samples, ) |
None |
predict(self, X, y=None)
This function is the public API's entry point for client code to obtain predictions from an already trained tree. If this tree hasn't been trained yet, predictions cannot be made; otherwise, for each instance in X, the tree is traversed until a leaf node is met: the prediction stored at that leaf node is the one returned to the caller.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
numpy.array |
2-dimensional array of shape (n_test_samples, n_features) containing samples which we want to know the predictions of. |
required |
Returns:
Type | Description |
---|---|
predictions (numpy.array) |
1-dimensional array of shape (n_test_samples, ). |
predict_proba(self, X, y=None)
This function is the public API's entry point for client code to obtain predictions from an already trained tree. If this tree hasn't been trained yet, predictions cannot be made; otherwise, for each instance in X, the tree is traversed until a leaf node is met: the prediction stored at that leaf node is the one returned to the caller.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
numpy.array |
2-dimensional array of shape (n_test_samples, n_features) containing samples which we want to know the predictions of. |
required |
Returns:
Type | Description |
---|---|
probs (numpy.array) |
2-dimensional array of shape (n_test_samples, 2) containing probability scores both for class 0 (1st column) and class 1 (2nd column). |
save(self, filename)
This function is used to persist this RobustDecisionTree object to file on disk using dill.
SplitOptimizer
Class used for determining the best splitting strategy, accoriding to a specific splitting function. The class comes with few splitting functions already implemented. In particular, those are as follows:
- __gini_impurity (classification);
- __entropy (classification);
- __logloss (classificattion);
- __mse (regression);
- __sse (regression);
- __mae (regression).
Of course this class can be instantiated with custom, user-defined splitting functions.
__init__(self, split_function_name=None, icml2019=False)
special
Class constructor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
split_function |
func |
The function used as splitting criterion. Defaults to None, if so it falls back to __gini_impurity implemented internally. |
required |
if split_function is None: self.split_function = SplitOptimizer._SplitOptimizer__sse self.split_function_name = "SSE"
!!! else self.split_function = split_function if split_function_name is None: split_function_name = split_function.name self.split_function_name = split_function_name
evaluate_split(self, y_true, y_pred)
This function is a meta-function which calls off to the actual splitting function along with input arguments.
optimize_gain(self, X, y, rows, numerical_idx, feature_blacklist, n_sample_features, replace_features, attacker, costs, constraints, current_score, current_prediction_score)
This function is responsible for finding the splitting which optimizes the gain (according to the splitting function) among all the possibile splittings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
numpy.array |
2-dimensional array of shape (n_samples, n_features) representing the feature matrix; |
required |
y |
numpy.array |
1-dimensional array of shape (n_samples, ) representing class labels (classification) or target values (regression). |
required |
rows |
numpy.array |
1-dimensional array containing the indices of a subset of n_samples (i.e., a subset of the rows of X and y). |
required |
numerical_idx |
list |
binary array which indicates whether a feature is numerical or not; numerical_idx[i] = 1 iff feature id i is numerical, 0 otherwise. |
required |
feature_blacklist |
set |
set of (integer) indices corresponding to blacklisted features. |
required |
n_sample_features |
int |
number of features to be randomly sampled at each node split. |
required |
attacker |
obj: |
required | |
costs |
dict |
cost associated with each instance (indexed by rows). |
required |
constraints |
list |
list of |
required |
current_score |
float |
is the score before any splitting is done; this must be compared with the best splitting found. Whenever the current_score is greater than the one computed after splitting there will be a gain. |
required |
Returns:
Type | Description |
---|---|
best_gain (float) |
The highest gain obtained after all the possible splittings have been tested (may be 0, in which case the splitting will be not worth it) best_split_left_id (numpy.array): 1-dimensional array containing all the indices of rows going on the left branch. best_split_right_id (numpy.array): 1-dimensional array containing all the indices of rows going on the right branch. best_split_feature_id (int): index of the feature which led to the best splitting. best_split_feature_value (int/float): value of the feature which led to the best splitting. next_best_split_feature_value (int/float): next-observed value of the feature which led to the best splitting. constraints_left (numpy.array): array of constraints if propagated to left. constraints_right (numpy.array): array of constraints if propagated to right. costs_left (numpy.array): array of costs if propagated left. costs_right (numpy.array): array of cost if propagated right. |