pysr3.lme.problems module

Representations of datasets that are compatible with skmixed’s models

class pysr3.lme.problems.LMEProblem(fixed_features: List[ndarray], random_features: List[ndarray], obs_vars: int | float | ndarray, group_labels: ndarray, intercept_label: str, column_labels: List[Tuple[int, int]], order_of_objects: ndarray, answers=None, fe_columns=None, re_columns=None, fe_regularization_weights=None, re_regularization_weights=None)

Bases: Problem

Helper class which implements Linear Mixed-Effects models’ abstractions over a given dataset.

It also can generate random problems with specific characteristics.

Constructor for LMEProblem class. It is meant to be used by library’s internals. Check out class methods like from_x_y() and from_dataframe() below that are designed to be used by users for creating LMEProblems.

static from_dataframe(data: DataFrame, fixed_effects: List[str], random_effects: List[str], groups: str, variance: str, target: str, not_regularized_fe: List[str], not_regularized_re: List[str])

Creates LMEProblem from Pandas dataframe

Parameters:

data (pd.DataFrame) – Dataframe that contains all relevant data
fixed_effects (List[str]) – List of column names that should be included as fixed effects
random_effects (List[str]) – List of column names that should be included as random effects
groups (str) – Name of the column that contains groups labels
variance (str) – Name of the column that contains observation variances
target (str) – Name of the column that contains the target variable
not_regularized_fe (str) – List of fixed effects which corresponding coefficients in the model are not penalized by a sparsity-promoting regularizer. Does NOT guarantee that these features are going to be included to the final model but significantly increases the chances of it.
not_regularized_re (str) – List of random effects which corresponding coefficients in the model are not penalized by a sparsity-promoting regularizer. Does NOT guarantee that these features are going to be included to the final model but significantly increases the chances of it.

Returns:

LMEProblem

static from_x_y(x: ndarray, y: ndarray | None = None, columns: List[str] | None = None, columns_labels: List[str] | None = None, fit_fixed_intercept: bool = False, fit_random_intercept: bool = False, must_include_fe: List[str] | None = None, must_include_re: List[str] | None = None, **kwargs)

Transforms matrices x (data) and y(answers) into an instance of LMEProblem

Parameters:

x (array-like, shape = [m,n]) – Data.
y (array-like, shape = [m]) – Answers.
columns (List[str]) – List of columns names
columns_labels (List[str]) –

List of column labels. There shall be only one column of group labels and answers STDs.
- “fixed” : fixed effect
- “random” : random effect
- “fixed+random” : both fixed and random,
- “group” : groups labels
- “variance” : answers standard deviations
- “intercept” : intercept column (fixed or random intercept is controlled by “fit_fixed_intercept”
  
  and “fit_random_intercept” respectively.
fit_fixed_intercept (bool, default = True) – Whether to add an intercept as a fixed feature
fit_random_intercept (bool, default = True) – Whether to add an intercept as a random feature.
must_include_re (List[str]) – List of fixed effects for which any effect of sparsity-promoting regularizers should be disabled. NB: it does not guarantee the inclusion of this feature to the ultimate model.
must_include_fe (List[str]) – Same for random effects
kwargs – It’s not used now, but it’s left here for future.

Returns:

problem (LMEProblem) – an instance of LMEProblem build on the given data.

static generate(groups_sizes: List[int | None] | None = None, features_labels: List[str] | None = None, fit_fixed_intercept: bool = False, fit_random_intercept: bool = False, features_covariance_matrix: ndarray | None = None, obs_var: int | float | Sized | None = 0.1, beta: ndarray | None = None, gamma: ndarray | None = None, true_random_effects: ndarray | None = None, as_x_y=False, return_true_model_coefficients: bool = True, seed: int | None = None, generator_params: dict | None = None, chance_missing: float = 0.0, chance_outlier: float = 0.0, outlier_multiplier: float = 5.0, distribution='normal')

Generates a random mixed-effects problem with given parameters.

The model is:

Y_i = X_i*β + Z_i*u_i + 𝜺_i,

where

u_i ~ 𝒩(0, diag(𝛄)),

𝜺_i ~ 𝒩(0, diag(variance))

Parameters:

groups_sizes (List, Optional) – List of groups sizes. If None then generates it from U[1, 1000]^k where k ~ U[1, 10]
features_labels (List, Optional) – List of features labels which define whether a role of features in the problem: “fixed” – fixed only, “random” – random only, “fixed+random” – both. Does NOT include intercept (it’s handled with fit_random_intercept parameter). If None then generates a random list from U[1, 4]^k where k ~ U[1, 10]
fit_fixed_intercept (bool, default is False) –

If True then the model adds intercept to the set of fixed features. Intercept should not be
in the features_covariance_matrix or features_labels.
fit_random_intercept (bool, default is False) –

True if the intercept is a random parameter as well. Intercept should never be
in the features_covariance_matrix or features_labels.
features_covariance_matrix (np.ndarray, Optional, Symmetric and PSD) – Covariance matrix of the features from features labels (columns from the dataset to be generated). If None then defaults to the identity matrix, in which case all features are independent. Should be the size of len(features_labels).
obs_var (float or np.ndarray) –

Variances of measurement errors. Can be:
- float : In this case all errors for all groups have the same variance.
- np.array of length equal to the number of groups : In this case each group has its own variance
  
  of the measurement errors, and it is the same for all objects within a group.
- stds : np.array of length equal to the number of objects in all groups cumulatively.
  
  In this case every object has its own variance.
Raise ValueError if obs_var has some other length then above.
beta (np.ndarray) – True vector of fixed effects. Should be equal to the number of fixed features in the features_labels plus one (intercept). If None then it’s generated randomly from U[0, 1]^k where k is the number of fixed features plus intercept.
gamma (np.ndarray) – True vector of random effects. Should be equal to the number of random features in the features_labels plus one if fit_random_intercept is True. If None then it’s generated randomly from U[0, 1]^k where k is the number of random effects plus (maybe) intercept.
true_random_effects (np.ndarray) – True random effects. Should be of a shape=(m, k) where m is the length of gamma, k is the number of groups. If None then generated according to the model: u_i ~ 𝒩(0, diag(𝛄)).
as_x_y (bool, default is False) – If True, returns the data in the form of tuple of matrices (X, y). Otherwise returns an instance of the respective class.
return_true_model_coefficients (bool, default is True) – If True, the second return argument is a dict with true model coefficients: beta, gamma, random effects and true values of measurements errors, otherwise returns None.
seed (int, default is None) – If given, initializes the global Numpy random generator with this seed.
generator_params (dict) – Dictionary with the parameters of the problem generator, like min-max bounds for the number of groups and objects. If None then the default one is used (see at the beginning of this file).
distribution (str) – which distribution is used for generating features: “normal” or “uniform”
chance_outlier (float, from 0 to 1) – chance that a selected value in data matrix is an outlier. If so, it gets multiplied by outlier_multiplier
outlier_multiplier (float) – magnitude of the outliers
chance_missing (float, from 0 to 1) – chance that a selected value is going to be missing from the dataset, in which case it’s set to 0.

Returns:

problem (LMEProblem) – Generated problem
true_parameters (dict, optional) –

True parameters for generated problem:
- ”beta” : true beta,
- ”gamma” : true gamma,
- ”per_group_coefficients”: true per group coefficients (b such that y = Xb, where X is from to_x_y())
- ”active_categorical_set”: set of categorical features which were used for true latent group division
- ”true_group_labels”: labels from true latent group division
- ”random_effects”: true random effects
- ”errors”: true errors
- ”true_rmse”: loss value when true beta, gamma and random effects are used.

to_dataframe()

to_x_y() → Tuple[ndarray, ndarray, List]

Transforms the problem to the (X, y) form.

The first row of X is going to be features labels.

Returns:

X (np.ndarray) – Features as a matrix
y (np.ndarray) – Answer as a vector

class pysr3.lme.problems.LMEStratifiedShuffleSplit(columns_labels: List[str], random_state=42, test_size=0.25, n_splits=3)

Bases: object

Class that generates shuffle splits of the dataset that are stratified by groups

Creates LMEStratifiedShuffleSplit

Parameters:

columns_labels (List[str]) –

List of column labels. There shall be only one column of group labels and answers STDs.
- “fixed” : fixed effect
- “random” : random effect
- “fixed+random” : both fixed and random,
- “group” : groups labels
- “variance” : answers standard deviations
- “intercept” : intercept column (fixed or random intercept is controlled by “fit_fixed_intercept”
  
  and “fit_random_intercept” respectively.
random_state (int) – Random seed for the generator
test_size (float, between 0 and 1) – fraction of the dataset for the test part of splits
n_splits (int) – number of splits

get_n_splits(x, y, groups)

split(x=None, y=None, groups=None)

Generates splits

Parameters:

x (ndarray, (n, p)) – data matrix
y (ndarray (n, )) – target variable

Returns:

Iterable over tuples (X_train, y_train, X_test, y_test) that are stratified by the group

class pysr3.lme.problems.Problem(**kwargs)

Bases: object

Template class for various representations of datasets.

Initializes the class :Parameters: kwargs – anything needed

from_x_y(x, y, **kwargs)

Creates Problem from matrices X and Y

Parameters:

x (ndarray, (n, p)) – data matrix
y (ndarray (n, )) – target variable
kwargs – anything needed

Returns:

Problem

to_x_y(**kwargs)

Converts its internal representation into the (X, y) dataset

Parameters:: kwargs – anything needed
Returns:: Matrices X (n, p) and y (n, )

pysr3.lme.problems.get_per_group_coefficients(beta, random_effects, labels)

Derives per group coefficients from the vectors of fixed and per-cluster random effects.

Parameters:

beta (ndarray, shape=(n,), n is the number of fixed effects.) – Vector of fixed effects.
random_effects (ndarray or list, shape=(m, k), m groups, k random effects.) – Array of random effects.
labels (List[str]) –

List of column labels. There shall be only one column of group labels and answers STDs.
- “fixed” : fixed effect
- “random” : random effect
- “fixed+random” : both fixed and random,
- “group” : groups labels
- “variance” : answers standard deviations
- “intercept” : intercept column (fixed or random intercept is controlled by “fit_fixed_intercept”
  
  and “fit_random_intercept” respectively.

Returns:

per_group_coefficients (ndarray, shape=(m, t)) – Array of cluster coefficients: m clusters times t coefficients.

pysr3.lme.problems.random_effects_to_matrix(random_effects)

Stacks a list of tuples (group: random effects) into an array

Parameters:: random_effects (List[Tuple[Any, ndarray]]) – List of random effects in the format [(group1: effect1), (group2: effects2), …]
Returns:: ndarray of random effects stacked vertically