Utilities API Reference
This document provides detailed API documentation for utility classes and functions in Torch-RecHub.
Data Processing Tools (data.py)
Dataset Classes
TorchDataset
- Introduction: Basic implementation of PyTorch dataset for handling features and labels.
- Parameters:
x
(dict): Feature dictionary, keys are feature names, values are feature datay
(array): Label data
PredictDataset
- Introduction: Dataset class for prediction phase, containing only feature data.
- Parameters:
x
(dict): Feature dictionary, keys are feature names, values are feature data
MatchDataGenerator
- Introduction: Data generator for recall tasks, used to generate training and testing data loaders.
- Main Methods:
generate_dataloader(x_test_user, x_all_item, batch_size, num_workers=8)
: Generate training, testing, and item data loaders- Parameters:
x_test_user
(dict): Test user featuresx_all_item
(dict): All item featuresbatch_size
(int): Batch sizenum_workers
(int): Number of worker processes for data loading
DataGenerator
- Introduction: General data generator supporting dataset splitting and loading.
- Main Methods:
generate_dataloader(x_val=None, y_val=None, x_test=None, y_test=None, split_ratio=None, batch_size=16, num_workers=0)
: Generate training, validation, and test data loaders- Parameters:
x_val
,y_val
: Validation set features and labelsx_test
,y_test
: Test set features and labelssplit_ratio
(list): Split ratios for train, validation, and test setsbatch_size
(int): Batch sizenum_workers
(int): Number of worker processes for data loading
Utility Functions
get_auto_embedding_dim
- Introduction: Automatically calculate embedding vector dimension based on number of categories.
- Parameters:
num_classes
(int): Number of categories- Returns:
- int: Embedding vector dimension, formula:
[6 * (num_classes)^(1/4)]
get_loss_func
- Introduction: Get loss function.
- Parameters:
task_type
(str): Task type, "classification" or "regression"- Returns:
- torch.nn.Module: Corresponding loss function
get_metric_func
- Introduction: Get evaluation metric function.
- Parameters:
task_type
(str): Task type, "classification" or "regression"- Returns:
- function: Corresponding evaluation metric function
generate_seq_feature
- Introduction: Generate sequence features and negative samples.
- Parameters:
data
(pd.DataFrame): Raw datauser_col
(str): User ID column nameitem_col
(str): Item ID column nametime_col
(str): Timestamp column nameitem_attribute_cols
(list): Item attribute columns for sequence feature generationmin_item
(int): Minimum number of items per usershuffle
(bool): Whether to shuffle datamax_len
(int): Maximum sequence length
Recall Tools (match.py)
Data Processing Functions
gen_model_input
- Introduction: Merge user and item features, process sequence features.
- Parameters:
df
(pd.DataFrame): Data with history sequence featuresuser_profile
(pd.DataFrame): User feature datauser_col
(str): User column nameitem_profile
(pd.DataFrame): Item feature dataitem_col
(str): Item column nameseq_max_len
(int): Maximum sequence lengthpadding
(str): Padding method, 'pre' or 'post'truncating
(str): Truncating method, 'pre' or 'post'
negative_sample
- Introduction: Negative sampling method for recall models.
- Parameters:
items_cnt_order
(dict): Item count dictionary, sorted by count in descending orderratio
(int): Negative sample ratiomethod_id
(int): Sampling method ID- 0: Random sampling
- 1: Word2Vec-style popularity sampling
- 2: Log popularity sampling
- 3: Tencent RALM sampling
Vector Retrieval Classes
Annoy
- Introduction: Vector recall tool based on Annoy.
- Parameters:
metric
(str): Distance metric methodn_trees
(int): Number of treessearch_k
(int): Search parameter- Main Methods:
fit(X)
: Build indexquery(v, n)
: Query nearest neighbors
Milvus
- Introduction: Vector recall tool based on Milvus.
- Parameters:
dim
(int): Vector dimensionhost
(str): Milvus server addressport
(str): Milvus server port- Main Methods:
fit(X)
: Build indexquery(v, n)
: Query nearest neighbors
Multi-task Learning Tools (mtl.py)
Utility Functions
shared_task_layers
- Introduction: Get shared layer and task-specific layer parameters in multi-task models.
- Parameters:
model
(torch.nn.Module): Multi-task model, supports MMOE, SharedBottom, PLE, AITM- Returns:
- list: Shared layer parameter list
- list: Task-specific layer parameter list
Optimizer Classes
MetaBalance
- Introduction: MetaBalance optimizer for balancing gradients in multi-task learning.
- Parameters:
parameters
(list): Model parametersrelax_factor
(float): Relaxation factor for gradient scaling, default 0.7beta
(float): Moving average coefficient, default 0.9- Main Methods:
step(losses)
: Execute optimization step, update parameters
Gradient Processing Functions
gradnorm
- Introduction: Implement GradNorm algorithm for dynamically adjusting task weights in multi-task learning.
- Parameters:
loss_list
(list): List of task lossesloss_weight
(list): List of task weightsshare_layer
(torch.nn.Parameter): Shared layer parametersinitial_task_loss
(list): List of initial task lossesalpha
(float): GradNorm algorithm hyperparameter