Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Custom dataset with interacion only #741

Closed
dcy0577 opened this issue Aug 22, 2023 · 4 comments
Closed

[QST] Custom dataset with interacion only #741

dcy0577 opened this issue Aug 22, 2023 · 4 comments

Comments

@dcy0577
Copy link

dcy0577 commented Aug 22, 2023

❓ Questions & Help

Hi, my dataset contains only the user-item interaction and timestamp. I noticed that the data used in session-based example all contain additional information as features, such like category. Can I use the same logic as in the example code to preprocess my data, without adding any additional feature columns? Can the model accept such data format?

user_id:token item_id:token timestamp:float
0 0 1681314649
0 0 1681314664
0 0 1681314674
0 0 1681314688
0 1 1681322022
0 1 1681322023
0 1 1681322024
0 1 1681322026
0 1 1681322027
0 1 1681322029
0 1 1681322030
0 1 1681322032
0 1 1681322033
0 1 1681322034
...

@rnyak
Copy link
Contributor

rnyak commented Aug 28, 2023

@dcy0577 you dont need extra features. you can groupby your data by user_id, create sequential features and use the item_id-list column as only input to the model.

another option is to create some temporal features since you have timestamp data already. we showcase some ways of temporal features but these are just some examples, you can be creative and create your own temporal features.

@NamartaVij
Copy link

@rnyak If I want to extract both the long-term and short-term interests of a user from their interactions, I can still follow a similar approach ( while considering two time windows for defining "long-term" and "short-term.")

  1. Group your data by user_id to create sequences of user-item interactions for each user.

  2. Order these sequences by the timestamp to maintain chronological order.

  3. Define a time threshold that separates long-term and short-term interactions. For example, 7 days long term and 5 days short term

  4. Split the sequences into two parts: one for long-term interactions and one for short-term interactions based on the time threshold.
    I followed this procedure for my above-mentioned dataset where we don't have sessions: https://nvidia-merlin.github.io/Merlin/v0.7.1/examples/getting-started-movielens/01-Download-Convert.html

question is where to mention this threshold value for long term and short term when I am trying to extract users' long term and short term interests separately using XLNet

@dcy0577
Copy link
Author

dcy0577 commented Sep 15, 2023

@rnyak thanks for the answer. Could you please elaborate more about max_session_length? In data preprocessing part I see:

# Truncate sequence features to first interacted 20 items 
SESSIONS_MAX_LENGTH = 20 
groupby_features_truncated = groupby_features_list >> nvt.ops.ListSlice(-SESSIONS_MAX_LENGTH) 

Is the slicing a must? If I understand correctly, the max length should be the max value appears in item_id-count?

Also in model configuration part, there are some sequence length:

max_sequence_length, d_model = 20, 320
# Define input module to process tabular input-features and to prepare masked inputs
input_module = tr.TabularSequenceFeatures.from_schema(
    schema,
    max_sequence_length=max_sequence_length,
    continuous_projection=64,
    aggregation="concat",
    d_output=d_model,
    masking="mlm",
)

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=8, n_layer=2, total_seq_length=20
)

training_args = tr.trainer.T4RecTrainingArguments(
                output_dir="./tmp",
                max_sequence_length=20,
                data_loader_engine='merlin',
                num_train_epochs=200, 
                dataloader_drop_last=False,
                per_device_train_batch_size = BATCH_SIZE_TRAIN,
                per_device_eval_batch_size = BATCH_SIZE_VALID,
                learning_rate=0.0005,
                fp16=True,
                report_to = [],
                logging_steps=20
            )

Do the max_sequence_length and total_seq_length here need to be consistent with SESSIONS_MAX_LENGTH?

@rnyak
Copy link
Contributor

rnyak commented Sep 19, 2023

Yes, we expect them to be consistent for the data loader and for the input block.

@dcy0577 dcy0577 closed this as completed Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants