Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dataset] Clarification on Dataset processing #767

Open
BowenYao18 opened this issue Sep 26, 2024 · 1 comment
Open

[Dataset] Clarification on Dataset processing #767

BowenYao18 opened this issue Sep 26, 2024 · 1 comment

Comments

@BowenYao18
Copy link

cites_edge = add_self_loops(remove_self_loops(paper_paper_edges)[0])[0]
self.edge_dict = {
('paper', 'cites', 'paper'): (torch.cat([cites_edge[1, :], cites_edge[0, :]]), torch.cat([cites_edge[0, :], cites_edge[1, :]])),

Let me use an example.

  1. Assume we have edge file like this:
[0, 1, 2]  # cites_edge[0, :]
[1, 2, 3]  # cites_edge[1, :]
  1. Then, we first do

add_self_loops(remove_self_loops(paper_paper_edges)[0])[0]

, which gives us this:

[0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]
[1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]
  1. Then, we have its reverse edge:
[1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]
[0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]
  1. If we follow this code

(torch.cat([cites_edge[1, :], cites_edge[0, :]]), torch.cat([cites_edge[0, :], cites_edge[1, :]])

, we should have this:

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]

Instead of this below since we must exactly follow the MLPerf, we cannot have the other way around like this (torch.cat([cites_edge[0, :], cites_edge[1, :]]), torch.cat([cites_edge[1, :], cites_edge[0, :]])

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]

Am I understanding this correctly? Does the order matters here? Thank you!

@Elnifio
Copy link
Contributor

Elnifio commented Sep 26, 2024

In GNN training, we only care if the edge counts (graph topology) are the same, which means that you should have the exact number of edges (a->b for any combination of a, b in the reference implementation graph) as the reference implementation.

However, the order does not matter here - they will be rearranged in the COO -> CSC conversion process anyways, so you can use cites_edge[0, :] + cites_edge[1, :] for the source and cites_edge[1, :] + cites_edge[0, :] for the destination as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants