Skip to content

TREANT

groot.treant

Created by Gabriele Tolomei on 2019-01-23.

Code adapted for comparison with GROOT from src/parallel_robust_forest.py at https://github.com/gtolomei/treant

Also see: https://arxiv.org/abs/1907.01197

Attacker

Class Attacker represents an attacker.

__init__(self, rules, budget) special

Class constructor.

Parameters:

Name Type Description Default
rules

obj:AttackerRule): set of AttackerRule objects.

required
budget float

total budget of the attacker (per instance).

required

attack(self, x, feature_id, cost)

This function retrieves the list of attacks to a given instance, on a given feature, subject to a given cost.

attack_dataset(self, X, attacks_filename=None)

This function is responsible for attacking the whole input dataset. It either loads all the attacks from the attack file provided as input or it computes all the attacks from scratch.

AttackerRule

Class AttackerRule represents a rule of attack.

__init__(self, pre_conditions, post_condition, cost, is_numerical=True) special

Class constructor.

Parameters:

Name Type Description Default
pre_conditions dict

set of pre-conditions which must be met in order for this rule to be applied.

required
post_condition dict

post-condition indicating the outcome of this rule once applied.

required
cost float

cost of rule application.

required
is_numerical boolean

flag to indicate whether the attack specified by this rule operates on a numerical (perturbation) or a categorical (assignment) feature.

True

apply(self, x)

Application of the rule to the input instance x.

Parameters:

Name Type Description Default
x numpy.array

1-dimensional array representing an instance.

required

Returns:

Type Description
x_prime (numpy.array)

A (deep) copy of x yet modified according to the post-condition of this rule.

get_cost(self)

Return the cost of this rule.

get_target_feature(self)

Return the feature (id) targeted by this rule.

is_applicable(self, x)

Returns whether the rule can be applied to the input instance x or not.

Parameters:

Name Type Description Default
x numpy.array

1-dimensional array representing an instance.

required
numerical_idx list

binary array which indicates whether a feature is numerical or not; numerical_idx[i] = 1 iff feature id i is numerical, 0 otherwise.

required

Returns:

Type Description

True iff this rule is applicable to x (i.e., if x satisfies ALL the pre-conditions of this rule).

Constraint

Class Constraint represents a constraint.

__init__(self, x, y, cost, ineq, bound) special

Class constructor.

Parameters:

Name Type Description Default
x int

current instance.

required
y int/float

label associated with this instance.

required
cost float

cost associated with this instance (so far).

required
ineq int

flag to encode the direction of the inequality represented by this constraint; 0 = 'less than', 1 = 'greater than or equal to'.

required
bound float

constraint value on the loss function

required

encode_for_optimizer(self, direction)

Encode this constraint according to the format used by the optimizer.

propagate_left(self, attacker, feature_id, feature_value, is_numerical)

Propagate the constraint to the left.

propagate_right(self, attacker, feature_id, feature_value, is_numerical)

Propagate the constraint to the right.

Node

Class Node represents a node of a decision tree.

__init__(self, node_id, values, n_values, left=None, right=None, best_split_feature_id=None, best_split_feature_value=None) special

Class constructor.

Parameters:

Name Type Description Default
node_id int

node identifier.

required
values int

number of instances

required
n_values int

maximum number of unique y values.

required
left

obj:Node, optional): left child node. Defaults to None.

None
right

obj:Node, optional): left child node. Defaults to None.

None
best_split_feature_id int

index of the feature associated with the best split of this Node. Defaults to None.

None
best_split_feature_value float

value of the feature associated with the best split of this Node. Defaults to None.

None

get_node_prediction(self)

Get the prediction as being computed at this node.

is_leaf(self)

Returns True iff the current node is a leaf (i.e., if it doesn't have neither a left nor a right child)

RobustDecisionTree (BaseEstimator, ClassifierMixin)

This class implements a single Robust Decision Tree. Inspired by sklearn API, it is a sublcass of the sklearn.base.BaseEstimator class and exposes two main methods: - fit(X, y) - predict(X) The former is used at training time for learning a single decision tree; the latter is used at inference (testing) time for computing predictions using the learned tree.

__init__(self, tree_id=0, attacker=<groot.treant.Attacker object at 0x7f5e8ea0f7f0>, split_optimizer=<groot.treant.SplitOptimizer object at 0x7f5e8ea0f850>, max_depth=8, min_instances_per_node=20, max_samples=1.0, max_features=1.0, replace_samples=False, replace_features=False, feature_blacklist={}, affine=True, seed=0) special

Class constructor.

Parameters:

Name Type Description Default
tree_id int

tree identifier.

0
attacker

obj:Attacker): the attacker under which this tree must grow (default = empty attacker).

<groot.treant.Attacker object at 0x7f5e8ea0f7f0>
split_optimizer

obj:SplitOptimizer): the optimizer used by this tree (default = SSE).

<groot.treant.SplitOptimizer object at 0x7f5e8ea0f850>
max_depth int

maximum depth of the tree to be generated (default = 10).

8
min_instances_per_node int

minimum number of instances per node (default = 20).

20
max_samples float

proportion of instances sampled without replacement (default = 1.0, i.e., 100%)

1.0
max_features float

proportion of features sampled without replacement (default = 1.0, i.e., 100%)

1.0
feature_blacklist dict

dictionary of features excluded during tree growth (default = {}), i.e., empty).

{}
replace_samples bool

whether the random sampling of instances should be with replacement or not (default = False).

False
replace_features bool

whether the random sampling of features should be with replacement or not (default = False).

False
seed int

integer used by randomized processes.

0

fit(self, X, y=None, numerical_idx=None)

This function is the public API's entry point for client code to start training a single Robust Decision Tree. It saves both the input data (X) and labels/targets (y) in the internals of the tree and delegates off to the private self.__fit method. The result being a reference to the root node of the trained tree.

Parameters:

Name Type Description Default
X numpy.array

2-dimensional array of shape (n_samples, n_features)

required
y numpy.array

1-dimensional array of values of shape (n_samples, )

None

predict(self, X, y=None)

This function is the public API's entry point for client code to obtain predictions from an already trained tree. If this tree hasn't been trained yet, predictions cannot be made; otherwise, for each instance in X, the tree is traversed until a leaf node is met: the prediction stored at that leaf node is the one returned to the caller.

Parameters:

Name Type Description Default
X numpy.array

2-dimensional array of shape (n_test_samples, n_features) containing samples which we want to know the predictions of.

required

Returns:

Type Description
predictions (numpy.array)

1-dimensional array of shape (n_test_samples, ).

predict_proba(self, X, y=None)

This function is the public API's entry point for client code to obtain predictions from an already trained tree. If this tree hasn't been trained yet, predictions cannot be made; otherwise, for each instance in X, the tree is traversed until a leaf node is met: the prediction stored at that leaf node is the one returned to the caller.

Parameters:

Name Type Description Default
X numpy.array

2-dimensional array of shape (n_test_samples, n_features) containing samples which we want to know the predictions of.

required

Returns:

Type Description
probs (numpy.array)

2-dimensional array of shape (n_test_samples, 2) containing probability scores both for class 0 (1st column) and class 1 (2nd column).

save(self, filename)

This function is used to persist this RobustDecisionTree object to file on disk using dill.

SplitOptimizer

Class used for determining the best splitting strategy, accoriding to a specific splitting function. The class comes with few splitting functions already implemented. In particular, those are as follows:

  • __gini_impurity (classification);
  • __entropy (classification);
  • __logloss (classificattion);
  • __mse (regression);
  • __sse (regression);
  • __mae (regression).

Of course this class can be instantiated with custom, user-defined splitting functions.

__init__(self, split_function_name=None, icml2019=False) special

Class constructor.

Parameters:

Name Type Description Default
split_function func

The function used as splitting criterion. Defaults to None, if so it falls back to __gini_impurity implemented internally.

required

if split_function is None: self.split_function = SplitOptimizer._SplitOptimizer__sse self.split_function_name = "SSE"

!!! else self.split_function = split_function if split_function_name is None: split_function_name = split_function.name self.split_function_name = split_function_name

evaluate_split(self, y_true, y_pred)

This function is a meta-function which calls off to the actual splitting function along with input arguments.

optimize_gain(self, X, y, rows, numerical_idx, feature_blacklist, n_sample_features, replace_features, attacker, costs, constraints, current_score, current_prediction_score)

This function is responsible for finding the splitting which optimizes the gain (according to the splitting function) among all the possibile splittings.

Parameters:

Name Type Description Default
X numpy.array

2-dimensional array of shape (n_samples, n_features) representing the feature matrix;

required
y numpy.array

1-dimensional array of shape (n_samples, ) representing class labels (classification) or target values (regression).

required
rows numpy.array

1-dimensional array containing the indices of a subset of n_samples (i.e., a subset of the rows of X and y).

required
numerical_idx list

binary array which indicates whether a feature is numerical or not; numerical_idx[i] = 1 iff feature id i is numerical, 0 otherwise.

required
feature_blacklist set

set of (integer) indices corresponding to blacklisted features.

required
n_sample_features int

number of features to be randomly sampled at each node split.

required
attacker

obj:Attacker): attacker.

required
costs dict

cost associated with each instance (indexed by rows).

required
constraints list

list of Constraint objects.

required
current_score float

is the score before any splitting is done; this must be compared with the best splitting found. Whenever the current_score is greater than the one computed after splitting there will be a gain.

required

Returns:

Type Description
best_gain (float)

The highest gain obtained after all the possible splittings have been tested (may be 0, in which case the splitting will be not worth it) best_split_left_id (numpy.array): 1-dimensional array containing all the indices of rows going on the left branch. best_split_right_id (numpy.array): 1-dimensional array containing all the indices of rows going on the right branch. best_split_feature_id (int): index of the feature which led to the best splitting. best_split_feature_value (int/float): value of the feature which led to the best splitting. next_best_split_feature_value (int/float): next-observed value of the feature which led to the best splitting. constraints_left (numpy.array): array of constraints if propagated to left. constraints_right (numpy.array): array of constraints if propagated to right. costs_left (numpy.array): array of costs if propagated left. costs_right (numpy.array): array of cost if propagated right.