Skip to content

Adding support for BigBird in GraphGPS#10598

Open
Rahul1epic wants to merge 9 commits into
pyg-team:masterfrom
Rahul1epic:bigbird
Open

Adding support for BigBird in GraphGPS#10598
Rahul1epic wants to merge 9 commits into
pyg-team:masterfrom
Rahul1epic:bigbird

Conversation

@Rahul1epic
Copy link
Copy Markdown

This PR implements BigBird in accordance with the paper https://arxiv.org/abs/2007.14062 and provides support for the same in GraphGPS.

This PR closes issue #8320

Details

  • Adds BigBird sparse attention to torch_geometric/nn/attention in bigbird.py
  • Adds respective test in test/nn/attention. (the test uses sequence length of 4096)
  • Adds support for BigBird in gps_conv.py
  • Adds BigBird to gps_conv tests.

Note: Since sparse attention depends on the block_size and number of random blocks, BigBird may fail if sequence length is too short. When such cases arise, BigBird falls back to regular dense attention. Such case is occurring in the gps_conv_test at the moment. Further, since BigBird is designed for long sequences, I have set the deafult value of block_size to 64 and number of random blocks to 3.

Copy link
Copy Markdown
Contributor

@xnuohz xnuohz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you compare bigbird vs performer (or other attention type) on a dataset (like zinc)? To see if the metrics get better or not.

Comment thread test/nn/conv/test_gps_conv.py Outdated
Comment on lines +22 to +27
# BigBird attention depends on blocksize.
# The default value of 64 is too large for this test.
# Hence, we set it to 1 and assuming no random blocks.
if conv.attn_type == 'bigbird':
conv.attn = BigBirdAttention(channels=16, n_heads=4, block_size=1,
num_rand_blocks=1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this under GPSConv? when should the model use large or small block_size

Copy link
Copy Markdown
Author

@Rahul1epic Rahul1epic Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had set the default block size to 64 assuming the sequence length of the input is very large (1024 or 4096) for example. A sequence length of less than the block size would mean there are no blocks which is incorrect I assume.
I had added assertions for the same as done here.

Further for example lets say the sequence length of the input to the transformer is 4 , block_size is 1 and rand_blocks I want to insert is also 1 (like the above case). This means there are 4 blocks say B0, B1, B2, B3
Note that random blocks cannot be allocated for blocks to local blocks and global blocks, in this case for example for block B1 ( B0 and B3 are the global blocks, those anyways attend to all other blocks)

  • local window blocks are B0, B1 (itself), and B2
  • globals are B0 and B3

Thus, choosing 1 random block is not possible so the random block list becomes empty thereby throwing an error.
Thus, to handle this for consistency and to allow random blocks to be inserted (and to ensure random blocks have both incoming and outgoing edge not just an outgoing edge in the blocked attention adjacency matrix), I fallback to dense attention if the sequence length is less than or equal to (2*num_random_blocks+5)*blocksize as shown here.

I am still quite unsure about these assertions and fallbacks maybe there's a better way.....
Also, Is the above explanation in the context of BigBird sparse attention right or am I missing something ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I just made a few changes to make it more concise.

I simplified handling low sequence lengths to one constraint. Its automatically falls back to dense attention if sequence length is less than or equal to (2*num_random_blocks+5)*blocksize.

Further, I added padding if the sequence length is not divisible by the block size :)

I removed the above block from test_gps_conv.py and the tests still passes (it fall back to dense attention, however since the sequence length doesn't satisfy the above constraint..)

@Rahul1epic
Copy link
Copy Markdown
Author

Can you compare bigbird vs performer (or other attention type) on a dataset (like zinc)? To see if the metrics get better or not.

Sure, will try to do this

BigBird strictly falls back to dense attention if seq_length <= (5 + 2*num_rand_blocks) * block_size.
Add appropiate padding to input if the sequence length is not divisible by the block size.
@Rahul1epic
Copy link
Copy Markdown
Author

Hi @xnuohz,
I was trying to use the zinc dataset, but i cannot see the effect of bigbird because the number of nodes in each graph is too less causing the sequence length constraint to fail resulting in it falling back to dense attention.

Instead, I created a random input graphs with varying number of nodes as 1024,2048,... so on. Add created random edges for the graphs as well. The nodes have an embedding dimension of 1024.
I then observe the time it takes to do a forward pass along a single GPSconv layer for each of the attention types. I obtained the following graph:

image

Is this expected?

@xnuohz
Copy link
Copy Markdown
Contributor

xnuohz commented Feb 11, 2026

@Rahul1epic can you get the results like this?
image

Previously, band mask was not being modified in place while unsqueezing its first dimension. This has been fixed
@Rahul1epic
Copy link
Copy Markdown
Author

Rahul1epic commented Feb 13, 2026

Hi @xnuohz,
I followed the training example here https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_gps.py for multihead, performer, and bigbird on the zinc subset. I trained for 50 epochs in each case. The best test MAEs I got at the end of the 50 epochs in each case were as follows:
Multihead: 0.2006
Performer: 0.1922
BigBird: 0.2251

Also, since in the zinc dataset the number of graph nodes of each sample are less (around 20-40), bigbird was always falling back to original attention as the constraint on sequence length was not satisfied for any sample.

Are the results reasonable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants