[Dataset] Clarification on Dataset processing #767

BowenYao18 · 2024-09-26T06:34:55Z

training/graph_neural_network/dataset.py

Lines 137 to 139 in cdd928d

    
           cites_edge = add_self_loops(remove_self_loops(paper_paper_edges)[0])[0] 
        
           self.edge_dict = { 
        
               ('paper', 'cites', 'paper'): (torch.cat([cites_edge[1, :], cites_edge[0, :]]), torch.cat([cites_edge[0, :], cites_edge[1, :]])),

Let me use an example.

Assume we have edge file like this:

[0, 1, 2]  # cites_edge[0, :]
[1, 2, 3]  # cites_edge[1, :]

Then, we first do

add_self_loops(remove_self_loops(paper_paper_edges)[0])[0]

, which gives us this:

[0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]
[1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]

Then, we have its reverse edge:

[1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]
[0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]

If we follow this code

(torch.cat([cites_edge[1, :], cites_edge[0, :]]), torch.cat([cites_edge[0, :], cites_edge[1, :]])

, we should have this:

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]

Instead of this below since we must exactly follow the MLPerf, we cannot have the other way around like this (torch.cat([cites_edge[0, :], cites_edge[1, :]]), torch.cat([cites_edge[1, :], cites_edge[0, :]])

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]

Am I understanding this correctly? Does the order matters here? Thank you!

The text was updated successfully, but these errors were encountered:

Elnifio · 2024-09-26T19:38:43Z

In GNN training, we only care if the edge counts (graph topology) are the same, which means that you should have the exact number of edges (a->b for any combination of a, b in the reference implementation graph) as the reference implementation.

However, the order does not matter here - they will be rearranged in the COO -> CSC conversion process anyways, so you can use cites_edge[0, :] + cites_edge[1, :] for the source and cites_edge[1, :] + cites_edge[0, :] for the destination as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dataset] Clarification on Dataset processing #767

[Dataset] Clarification on Dataset processing #767

BowenYao18 commented Sep 26, 2024

Elnifio commented Sep 26, 2024

[Dataset] Clarification on Dataset processing #767

[Dataset] Clarification on Dataset processing #767

Comments

BowenYao18 commented Sep 26, 2024

Elnifio commented Sep 26, 2024