Adding support for BigBird in GraphGPS by Rahul1epic · Pull Request #10598 · pyg-team/pytorch_geometric

Rahul1epic · 2026-02-08T15:35:20Z

This PR implements BigBird in accordance with the paper https://arxiv.org/abs/2007.14062 and provides support for the same in GraphGPS.

This PR closes issue #8320

Details

Adds BigBird sparse attention to torch_geometric/nn/attention in bigbird.py
Adds respective test in test/nn/attention. (the test uses sequence length of 4096)
Adds support for BigBird in gps_conv.py
Adds BigBird to gps_conv tests.

Note: Since sparse attention depends on the block_size and number of random blocks, BigBird may fail if sequence length is too short. When such cases arise, BigBird falls back to regular dense attention. Such case is occurring in the gps_conv_test at the moment. Further, since BigBird is designed for long sequences, I have set the deafult value of block_size to 64 and number of random blocks to 3.

xnuohz

Can you compare bigbird vs performer (or other attention type) on a dataset (like zinc)? To see if the metrics get better or not.

xnuohz · 2026-02-09T05:30:08Z

+    # BigBird attention depends on blocksize.
+    # The default value of 64 is too large for this test.
+    # Hence, we set it to 1 and assuming no random blocks.
+    if conv.attn_type == 'bigbird':
+        conv.attn = BigBirdAttention(channels=16, n_heads=4, block_size=1,
+                                     num_rand_blocks=1)


can we move this under GPSConv? when should the model use large or small block_size

I had set the default block size to 64 assuming the sequence length of the input is very large (1024 or 4096) for example. A sequence length of less than the block size would mean there are no blocks which is incorrect I assume.
I had added assertions for the same as done here.

Further for example lets say the sequence length of the input to the transformer is 4 , block_size is 1 and rand_blocks I want to insert is also 1 (like the above case). This means there are 4 blocks say B0, B1, B2, B3
Note that random blocks cannot be allocated for blocks to local blocks and global blocks, in this case for example for block B1 ( B0 and B3 are the global blocks, those anyways attend to all other blocks)

local window blocks are B0, B1 (itself), and B2

globals are B0 and B3

Thus, choosing 1 random block is not possible so the random block list becomes empty thereby throwing an error.
Thus, to handle this for consistency and to allow random blocks to be inserted (and to ensure random blocks have both incoming and outgoing edge not just an outgoing edge in the blocked attention adjacency matrix), I fallback to dense attention if the sequence length is less than or equal to (2*num_random_blocks+5)*blocksize as shown here.

I am still quite unsure about these assertions and fallbacks maybe there's a better way.....
Also, Is the above explanation in the context of BigBird sparse attention right or am I missing something ?

Hi,
I just made a few changes to make it more concise.

I simplified handling low sequence lengths to one constraint. Its automatically falls back to dense attention if sequence length is less than or equal to (2*num_random_blocks+5)*blocksize.

Further, I added padding if the sequence length is not divisible by the block size :)

I removed the above block from test_gps_conv.py and the tests still passes (it fall back to dense attention, however since the sequence length doesn't satisfy the above constraint..)

Rahul1epic · 2026-02-09T12:07:14Z

Can you compare bigbird vs performer (or other attention type) on a dataset (like zinc)? To see if the metrics get better or not.

Sure, will try to do this

BigBird strictly falls back to dense attention if seq_length <= (5 + 2*num_rand_blocks) * block_size. Add appropiate padding to input if the sequence length is not divisible by the block size.

Rahul1epic · 2026-02-10T15:18:24Z

Hi @xnuohz,
I was trying to use the zinc dataset, but i cannot see the effect of bigbird because the number of nodes in each graph is too less causing the sequence length constraint to fail resulting in it falling back to dense attention.

Instead, I created a random input graphs with varying number of nodes as 1024,2048,... so on. Add created random edges for the graphs as well. The nodes have an embedding dimension of 1024.
I then observe the time it takes to do a forward pass along a single GPSconv layer for each of the attention types. I obtained the following graph:

Is this expected?

xnuohz · 2026-02-11T01:28:20Z

@Rahul1epic can you get the results like this?

… attention in bigbird

…gth holds

Previously, band mask was not being modified in place while unsqueezing its first dimension. This has been fixed

Rahul1epic · 2026-02-13T07:17:04Z

Hi @xnuohz,
I followed the training example here https://github.com/pyg-team/pytorch_geometric/blob/master/examples/graph_gps.py for multihead, performer, and bigbird on the zinc subset. I trained for 50 epochs in each case. The best test MAEs I got at the end of the 50 epochs in each case were as follows:
Multihead: 0.2006
Performer: 0.1922
BigBird: 0.2251

Also, since in the zinc dataset the number of graph nodes of each sample are less (around 20-40), bigbird was always falling back to original attention as the constraint on sequence length was not satisfied for any sample.

Are the results reasonable?

Rahul1epic added 2 commits February 8, 2026 20:12

feat: Add bigbird transformer implementation and respective test

4e9eb0c

feat: Add BigBird to gps_conv and respective test

5a7f5e5

Rahul1epic requested review from akihironitta, rusty1s and wsad1 as code owners February 8, 2026 15:35

xnuohz reviewed Feb 9, 2026

View reviewed changes

refactor: Concise constraints to apply sparse attention

be85dc8

BigBird strictly falls back to dense attention if seq_length <= (5 + 2*num_rand_blocks) * block_size. Add appropiate padding to input if the sequence length is not divisible by the block size.

Rahul1epic added 4 commits February 11, 2026 13:31

fix: Use correct attention mask dimensions when falling back to dense…

689162b

… attention in bigbird

fix: Ensure is_sparse is set to true if the condition on sequence len…

b4e0876

…gth holds

fix: Make is_sparse stateless

af789e8

fix: Use correct the shape of band mask

97df21f

Previously, band mask was not being modified in place while unsqueezing its first dimension. This has been fixed

Rahul1epic added 2 commits February 16, 2026 16:55

Merge branch 'master' into bigbird

33cd421

Merge branch 'pyg-team:master' into bigbird

7cf5742

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for BigBird in GraphGPS#10598

Adding support for BigBird in GraphGPS#10598
Rahul1epic wants to merge 9 commits into
pyg-team:masterfrom
Rahul1epic:bigbird

Rahul1epic commented Feb 8, 2026

Uh oh!

xnuohz left a comment

Uh oh!

xnuohz Feb 9, 2026

Uh oh!

Rahul1epic Feb 9, 2026 •

edited

Loading

Uh oh!

Rahul1epic Feb 9, 2026

Uh oh!

Rahul1epic commented Feb 9, 2026

Uh oh!

Rahul1epic commented Feb 10, 2026

Uh oh!

xnuohz commented Feb 11, 2026

Uh oh!

Rahul1epic commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rahul1epic commented Feb 8, 2026

Details

Uh oh!

xnuohz left a comment

Choose a reason for hiding this comment

Uh oh!

xnuohz Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Rahul1epic Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rahul1epic Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Rahul1epic commented Feb 9, 2026

Uh oh!

Rahul1epic commented Feb 10, 2026

Uh oh!

xnuohz commented Feb 11, 2026

Uh oh!

Rahul1epic commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rahul1epic Feb 9, 2026 •

edited

Loading

Rahul1epic commented Feb 13, 2026 •

edited

Loading