[QST] Custom dataset with interacion only #741

dcy0577 · 2023-08-22T10:13:13Z

❓ Questions & Help

Hi, my dataset contains only the user-item interaction and timestamp. I noticed that the data used in session-based example all contain additional information as features, such like category. Can I use the same logic as in the example code to preprocess my data, without adding any additional feature columns? Can the model accept such data format?

user_id:token item_id:token timestamp:float
0 0 1681314649
0 0 1681314664
0 0 1681314674
0 0 1681314688
0 1 1681322022
0 1 1681322023
0 1 1681322024
0 1 1681322026
0 1 1681322027
0 1 1681322029
0 1 1681322030
0 1 1681322032
0 1 1681322033
0 1 1681322034
...

rnyak · 2023-08-28T15:18:47Z

@dcy0577 you dont need extra features. you can groupby your data by user_id, create sequential features and use the item_id-list column as only input to the model.

another option is to create some temporal features since you have timestamp data already. we showcase some ways of temporal features but these are just some examples, you can be creative and create your own temporal features.

NamartaVij · 2023-09-08T19:30:56Z

@rnyak If I want to extract both the long-term and short-term interests of a user from their interactions, I can still follow a similar approach ( while considering two time windows for defining "long-term" and "short-term.")

Group your data by user_id to create sequences of user-item interactions for each user.
Order these sequences by the timestamp to maintain chronological order.
Define a time threshold that separates long-term and short-term interactions. For example, 7 days long term and 5 days short term
Split the sequences into two parts: one for long-term interactions and one for short-term interactions based on the time threshold.
I followed this procedure for my above-mentioned dataset where we don't have sessions: https://nvidia-merlin.github.io/Merlin/v0.7.1/examples/getting-started-movielens/01-Download-Convert.html

question is where to mention this threshold value for long term and short term when I am trying to extract users' long term and short term interests separately using XLNet

dcy0577 · 2023-09-15T19:34:59Z

@rnyak thanks for the answer. Could you please elaborate more about max_session_length? In data preprocessing part I see:

# Truncate sequence features to first interacted 20 items 
SESSIONS_MAX_LENGTH = 20 
groupby_features_truncated = groupby_features_list >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH)

Is the slicing a must? If I understand correctly, the max length should be the max value appears in item_id-count?

Also in model configuration part, there are some sequence length:

max_sequence_length, d_model = 20, 320
# Define input module to process tabular input-features and to prepare masked inputs
input_module = tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=max_sequence_length,
    continuous_projection=64,
    aggregation="concat",
    d_output=d_model,
    masking="mlm",
)

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=8, n_layer=2, total_seq_length=20
)

training_args = tr.trainer.T4RecTrainingArguments(
                output_dir="./tmp",
                max_sequence_length=20,
                data_loader_engine='merlin',
                num_train_epochs=200, 
                dataloader_drop_last=False,
                per_device_train_batch_size = BATCH_SIZE_TRAIN,
                per_device_eval_batch_size = BATCH_SIZE_VALID,
                learning_rate=0.0005,
                fp16=True,
                report_to = [],
                logging_steps=20
            )

Do the max_sequence_length and total_seq_length here need to be consistent with SESSIONS_MAX_LENGTH?

rnyak · 2023-09-19T13:34:52Z

Yes, we expect them to be consistent for the data loader and for the input block.

dcy0577 added the status/needs-triage label Aug 22, 2023

dcy0577 closed this as completed Nov 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Custom dataset with interacion only #741

[QST] Custom dataset with interacion only #741

dcy0577 commented Aug 22, 2023

rnyak commented Aug 28, 2023

NamartaVij commented Sep 8, 2023

dcy0577 commented Sep 15, 2023

rnyak commented Sep 19, 2023

[QST] Custom dataset with interacion only #741

[QST] Custom dataset with interacion only #741

Comments

dcy0577 commented Aug 22, 2023

❓ Questions & Help

rnyak commented Aug 28, 2023

NamartaVij commented Sep 8, 2023

dcy0577 commented Sep 15, 2023

rnyak commented Sep 19, 2023