Utilities API Reference
This document provides detailed API documentation for utility classes and functions in Torch-RecHub.
Data Processing Tools (data.py)
Dataset Classes
TorchDataset
- Introduction: Basic implementation of PyTorch dataset for handling features and labels.
- Parameters:
x(dict): Feature dictionary, keys are feature names, values are feature datay(array): Label data
PredictDataset
- Introduction: Dataset class for prediction phase, containing only feature data.
- Parameters:
x(dict): Feature dictionary, keys are feature names, values are feature data
MatchDataGenerator
- Introduction: Data generator for recall tasks, used to generate training and testing data loaders.
- Main Methods:
generate_dataloader(x_test_user, x_all_item, batch_size, num_workers=8): Generate training, testing, and item data loaders- Parameters:
x_test_user(dict): Test user featuresx_all_item(dict): All item featuresbatch_size(int): Batch sizenum_workers(int): Number of worker processes for data loading
DataGenerator
- Introduction: General data generator supporting dataset splitting and loading.
- Main Methods:
generate_dataloader(x_val=None, y_val=None, x_test=None, y_test=None, split_ratio=None, batch_size=16, num_workers=0): Generate training, validation, and test data loaders- Parameters:
x_val,y_val: Validation set features and labelsx_test,y_test: Test set features and labelssplit_ratio(list): Split ratios for train, validation, and test setsbatch_size(int): Batch sizenum_workers(int): Number of worker processes for data loading
Utility Functions
get_auto_embedding_dim
- Introduction: Automatically calculate embedding vector dimension based on number of categories.
- Parameters:
num_classes(int): Number of categories- Returns:
- int: Embedding vector dimension, formula:
[6 * (num_classes)^(1/4)]
get_loss_func
- Introduction: Get loss function.
- Parameters:
task_type(str): Task type, "classification" or "regression"- Returns:
- torch.nn.Module: Corresponding loss function
get_metric_func
- Introduction: Get evaluation metric function.
- Parameters:
task_type(str): Task type, "classification" or "regression"- Returns:
- function: Corresponding evaluation metric function
generate_seq_feature
- Introduction: Generate sequence features and negative samples.
- Parameters:
data(pd.DataFrame): Raw datauser_col(str): User ID column nameitem_col(str): Item ID column nametime_col(str): Timestamp column nameitem_attribute_cols(list): Item attribute columns for sequence feature generationmin_item(int): Minimum number of items per usershuffle(bool): Whether to shuffle datamax_len(int): Maximum sequence length
Recall Tools (match.py)
Data Processing Functions
gen_model_input
- Introduction: Merge user and item features, process sequence features.
- Parameters:
df(pd.DataFrame): Data with history sequence featuresuser_profile(pd.DataFrame): User feature datauser_col(str): User column nameitem_profile(pd.DataFrame): Item feature dataitem_col(str): Item column nameseq_max_len(int): Maximum sequence lengthpadding(str): Padding method, 'pre' or 'post'truncating(str): Truncating method, 'pre' or 'post'
negative_sample
- Introduction: Negative sampling method for recall models.
- Parameters:
items_cnt_order(dict): Item count dictionary, sorted by count in descending orderratio(int): Negative sample ratiomethod_id(int): Sampling method ID- 0: Random sampling
- 1: Word2Vec-style popularity sampling
- 2: Log popularity sampling
- 3: Tencent RALM sampling
Vector Retrieval Classes
Annoy
- Introduction: Vector recall tool based on Annoy.
- Parameters:
metric(str): Distance metric methodn_trees(int): Number of treessearch_k(int): Search parameter- Main Methods:
fit(X): Build indexquery(v, n): Query nearest neighbors
Milvus
- Introduction: Vector recall tool based on Milvus.
- Parameters:
dim(int): Vector dimensionhost(str): Milvus server addressport(str): Milvus server port- Main Methods:
fit(X): Build indexquery(v, n): Query nearest neighbors
Multi-task Learning Tools (mtl.py)
Utility Functions
shared_task_layers
- Introduction: Get shared layer and task-specific layer parameters in multi-task models.
- Parameters:
model(torch.nn.Module): Multi-task model, supports MMOE, SharedBottom, PLE, AITM- Returns:
- list: Shared layer parameter list
- list: Task-specific layer parameter list
Optimizer Classes
MetaBalance
- Introduction: MetaBalance optimizer for balancing gradients in multi-task learning.
- Parameters:
parameters(list): Model parametersrelax_factor(float): Relaxation factor for gradient scaling, default 0.7beta(float): Moving average coefficient, default 0.9- Main Methods:
step(losses): Execute optimization step, update parameters
Gradient Processing Functions
gradnorm
- Introduction: Implement GradNorm algorithm for dynamically adjusting task weights in multi-task learning.
- Parameters:
loss_list(list): List of task lossesloss_weight(list): List of task weightsshare_layer(torch.nn.Parameter): Shared layer parametersinitial_task_loss(list): List of initial task lossesalpha(float): GradNorm algorithm hyperparameter