The Timing of LYC STAT Handlers

Written by Ron Nelson and ISSOtm

Raster effects are probably the greatest assets that retro game consoles have. The fact that the PPU generates the image right as it is displayed allows many special effects to be created by modifying the rendering parameters while the image is being drawn. Here is an example:

However, unlike some consoles like the SNES, the Game Boy contains no hardware dedicated to raster effects, so the task falls squarely on the CPU. This causes raster FX code to interact with the rest of the program in complex ways, particularly when it comes to accessing VRAM.

In this article, we will explore different techniques for handling raster effects, and discuss their pros and cons with the help of some diagrams.

PRIOR KNOWLEDGE ASSUMED

This article is not a friendly introduction to programming raster effects, and assumes you are already comfortable with Game Boy programming. To learn more about how to achieve neat raster effects like the above, check out DeadCScroll first, which the above GIF is actually from!

Additionally, since the operations discussed here are extremely timing-sensitive, discussions will revolve around assembly instructions. You can learn how to program for the Game Boy in assembly in GB ASM Tutorial.

TERMINOLOGY

We'll reference a few terms throughout this tutorial; here are brief explanations of them:

SoC: System-on-a-Chip, a single chip that includes most (or all!) components of a system. The Game Boy's functionality is almost entirely contained within a single chip, confusingly labelled "DMG-CPU" or similar. (Contrast this with, for example, the SNES, where there is one chip for the CPU, two for the PPU, and many more.)
CPU: Central Processing Unit, the part of the SoC that executes code and configures everything else.
PPU: Pixel Processing Unit, the part of the SoC that is responsible for sending pixels to the LCD and generating them.
Rasterization: the process of turning... something (for example, a collection of textured polygons; or, on the GB, tiles and tilemaps) into an array of pixels. "Raster" is sort of a contraction of that term.
Scanline: a row of pixels; it's called a "scan"-line because the lines get drawn one by one, pixel by pixel, as if the PPU was "scanning" along the screen.
Register: in general, a small piece of memory, usually linked to some hardware component.
PPU mode: The PPU can be in one of four modes at a given time, depending on what it's doing. Please refer to Pan Docs to learn what each mode corresponds to and how they are scheduled—they interact very tightly with raster effects.
Interrupt: an event that gets generated. Typically, this causes a "handler" to be called, which is a special routine dedicated to reacting to a given interrupt.
"Main thread": any code that is executed outside of interrupt handlers.

Introduction

The easiest way to implement raster effects is to use the LYC register with the STAT interrupt.

Here is what the Pan Docs have to say about this register's simple function:

FF45 - LYC (LY Compare) (R/W)
The Game Boy permanently compares the value of the LYC and LY registers. When both values are identical, the “LYC=LY” flag in the STAT register is set, and (if enabled) a STAT interrupt is requested.

So then, the outline for setting up a raster effect is as follows:

Register an interrupt by setting LYC to the desired scanline
When that scanline begins, the STAT interrupt handler will automatically be called
Perform your chosen effect by modifying PPU registers
Exit the handler with reti

ALTERNATIVES

There are other ways to perform raster FX, such as busy-waiting in the "main thread", but as this article's title suggests, we won't discuss them here.

A major pro of LYC-interrupt-based raster effects is that they can be made self-contained, and thus largely independent of whatever the "main thread" is doing. This, in turn, simplifies the mental complexity of the code (decoupling), copes better with lag frames, and more.

Many of the points brought forth later, particularly regarding cycle counting, are still relevant with these alternatives, so this is still worth reading!

These four steps sound simple enough on their own, but there are numerous caveats we will discuss. Strap in!

Most raster effects are implemented by modifying registers between scanlines. Thus, you will want to write the register either during Mode 2 (of the same scanline), or Mode 0 (of the previous one)—anything but Mode 3, really.
Unfortunately, LY=LYC interrupts are requested at the beginning of a scanline, so during the very short Mode 2, leaving too little time to perform but the most basic of effects.
Writing to the register during HBlank instead implies triggering the interrupt on the scanline above the effect, as well as idling for most of the scanline. So, if I wanted to enable sprites on scanline 16, I'd write 15 to LYC.
Mode 3's length is variable, so syncing to HBlank is difficult and time-consuming.
The interrupt handler's execution may be delayed by a few cycles, which makes it difficult to reliably sync to the PPU.
If the "main thread" is itself trying to sync with the PPU (typically by polling STAT in a loop), our interrupt may throw off its timing.

Sounds good? Then let's get started!

Timing

First, let's look at the timing of the rendering itself, courtesy of the Pan Docs:

Here are some key points:

A "dot" is one period of the PPU's 4 MiHz clock, i.e. 0.25 µs.
A "cycle" is the main unit of time in the CPU, which is equal to 1 µs, or 4 dots. (The Game Boy Color CPU can enter a "double-speed" mode which halves the length of cycles, but not of dots. For the sake of simplicity, we won't consider the differences it involves here.)
Each scanline takes exactly 456 dots, or 114 cycles.
Mode 2 also takes a constant amount of time (20 cycles)
HBlank's length varies wildly, and will often be nearly as long as or longer than the drawing phase.
HBlank and OAM scan are mostly interchangeable, and long as you're not writing to OAM.
The worst-case HBlank's length is not a multiple of 4 dots, so we will round down to 21 cycles.

Let's consider a simple STAT handler, which disables OBJs if called at line 128, and enables them otherwise:

LYC::
    push af
    ldh a, [rLY]
    cp 128 - 1
    jr z, .disableSprites

    ; enable sprites
    ldh a, [rLCDC]
    or a, LCDCF_OBJON
    ldh [rLCDC], a
    pop af
    reti

.disableSprites
    ldh a, [rLCDC]
    and a, ~LCDCF_OBJON
    ldh [rLCDC], a
    pop af
    reti

Tips

This is not an especially well-written STAT handler, but the actual is outside the scope of this tutorial. If that's what you're looking for, check out DeadCScroll by Blitter Object. It triggers the STAT interrupt on HBlanks rather than LYC, but the fundamentals are the same.

Note that, for simplicity's sake, DeadCScroll does not consider the problems described further below, so be wary of combining that tutorial's STAT handler unmodified with STAT-based VRAM accesses in the main thread.

Let's assume that the interrupt fires at, say, scanline 42. Equipped with the GB instruction table (see its legend at the bottom), we can plot how many cycles each operation takes, in relation with the PPU's mode:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Interrupt dispatch
	Write to LCDC
	Return from interrupt

Scanline cycle	Instruction
Scanline cycle	Instruction
0
1
2
3
4
5	`push af`
6
7
8
9	`ldh a, [rLY]`
10
11
12	`cp 128 - 1`
13	`cp 128 - 1`
14	`jr z, .disableSprites`
15	`jr z, .disableSprites`
16	`ldh a, [rLCDC]`
17
18
19	`or a, LCDCF_OBJON`
20	`or a, LCDCF_OBJON`
21	`ldh [rLCDC], a`
22
23
24	`pop af`
25
26
27	`reti`
28
29
30

The first 5 cycles do not have an instruction: indeed, calling an interrupt handler is not instantaneous, and the CPU is temporarily busy pushing the program counter (PC) to the stack, disabling interrupts, etc. Then, the actual interrupt handler begins execution.

We can immediately spot a problem: the cycle during which LCDC is written to falls in the middle of rendering! (With only a handful of exceptions, instructions that access memory do so on their very last cycle.) This is usually undesirable, and could lead to graphical glitches like an OBJ being partially cut off until we write to LCDC.

Another problem, less obvious but oh so painful, is how the interrupt handler might interact with the "main thread"'s operation.

The VRAM access race condition

Accessing VRAM is not possible during Mode 3. Thus, when we want to access VRAM, precautions must be taken; the most common is to use the following loop:

.waitVRAM
	ldh a, [rSTAT]
	and STATF_BUSY ; 2
	jr nz, .waitVRAM

This loop checks whether [STAT] & 2 is zero, and exits when it does. Looking at documentation for STAT, we can see that the lowest 2 bits report the PPU's mode, and that [STAT] & 2 is zero for Mode 0 and Mode 1, but not Mode 2 or Mode 3. So, essentially, this loop waits for Mode 0 or Mode 1, which are both safe to write to VRAM—but it can't be that simple.

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Read from STAT
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0	`and STATF_BUSY`
1	`and STATF_BUSY`
2	`jr nz, .waitVRAM`
3	`jr nz, .waitVRAM`
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20	(...)
21	(...)

Pictured above is the "worst case" for this loop. As you can see, on the cycle that STAT is read, the PPU is still in Mode 0; however, checking for it takes a few cycles, during which we enter Mode 2!

Now, thankfully, Mode 2 is also safe for accessing VRAM—but only 16 cycles of it remain. This is why this loop is said to guarantee 16 "VRAM-safe" cycles: any access performed 17 cycles or more after it would break in this worst case.

Now, what would happen if our interrupt was requested in the middle of this?

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Read from STAT
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0
1
2
3
4
5	`push af`
6
7
8
9	`ldh a, [rLY]`
10
11
12	`cp 128 - 1`
13	`cp 128 - 1`
14	`jr z, .disableSprites`
15	`jr z, .disableSprites`
16	`ldh a, [rLCDC]`
17
18
19	`or a, LCDCF_OBJON`
20	`or a, LCDCF_OBJON`
21	`ldh [rLCDC], a`
22
23
24	`pop af`
25
26
27	`reti`
28
29
30
31	`and STATF_BUSY`
32	`and STATF_BUSY`
33	`jr nz, .waitVRAM`
34	`jr nz, .waitVRAM`
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

Oh no! The main thread is now trying to access VRAM right in the middle of Mode 3! This could lead to all sorts of visual bugs.

A solution?

The solution is not too complicated, at least on paper. We should be able to use the same STAT-checking loop (or at least, a variation of it) inside of the handler. It works in the main thread, so it should work here as well, right?

Remember that many STAT handlers will be much more complicated than the simple example above, so let's draw a diagram with an imaginary handler that would take significantly more time:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Read from STAT
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0
1
2
3
4
5	(...)
...
96
97	`ldh a, [rSTAT]`
98
99
100	`and STATF_BUSY`
101	`and STATF_BUSY`
102	`jr nz, .handlerWait`
103	`jr nz, .handlerWait`
104	`(Write to LCDC)`
105
106
107	`pop hl`
108
109
110	`pop af`
111
112
113	`reti`
0
1
2
3	`and STATF_BUSY`
4	`and STATF_BUSY`
5	`jr nz, .waitVRAM`
6	`jr nz, .waitVRAM`
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Tips

All the instructions between the "Interrupt dispatch" and "Return from interrupt" blocks are the interrupt handler, the rest is in the "main thread".

The STAT loop does fix the register being written to during Mode 3; however, once again, the 16 cycles that "main thread" expects to be VRAM-safe overlap with Mode 3. The problem here is that the write, pop and reti all take some of those cycles, and the "main thread" is using the value it read from STAT during the previous scanline—but that value is now stale.

Possible fixes

Using what we have learned so far, we can boil down the problem to three factors:

Our handler can trigger in the middle of this sequence of events
Our handler preserves the stale value read from STAT earlier
Our handler returns during a time where accessing VRAM is unsafe

It would be enough to get rid of any of these, so let's enumerate our options.

Dealing with it

It's entirely possible to accept the loss of some of those cycles. This amounts to assuming less than the usual 16 cycles after such loops. For example, putting a STAT-polling loop just before the last pop af and reti would have these two eat up 7 cycles, so we are down to 9.

This will quickly become impractical, requiring syncing to the LCD much more often in the main thread.

Handler timing

A simple way to prevent those pesky handlers from throwing off our timing is to disable them, with the di instruction. Unfortunately, it can't quite be so simple, as using di for this brings its own share of problems.

The most important one is that disabling the handlers like this delays their execution! STAT handlers designed to write to hardware regs during HBlank may start doing so during rendering instead; timer interrupts won't trigger as regularly now; and so on.

Using di is valid in some cases, but typically not when STAT interrupts are involved, due to their fairly strict timing requirements.

An oddly common alternative is to perform all VRAM updates in VBlank handler. (The reason why it's common especially in early GB games is likely being a carry-over from the NES, where the lack of HBlanks essentially mandates such a setup anyway.) While this can work, such as for Metroid II, it requires significant complexity from having to keep deferring graphical updates.

Stale `STAT` read

There is not much that can be done about this one. The interrupt handler must preserve registers, and ...

TOCTTOU

Return timing

This is the solution that the rest of this article will explore, as we will see that it makes the least painful compromises out of most use cases.

So, the real solution is to fully exit before the end of HBlank. There are two ways to do this. One is to wait for the Drawing phase before waiting for HBlank. This effectively catches the very start of HBlank, leaving plenty of time to exit. Here's how the earlier example might look using this method:

LYC::
    push af
    push hl
    ldh a, [rLY]
    cp 128 - 1
    jr z, .disableSprites

    ; enable sprites
    ldh a, [rLCDC]
    or a, LCDCF_OBJON
    jr .finish

.disableSprites
    ldh a, [rLCDC]
    and a, ~LCDCF_OBJON

.finish
    ld hl, rSTAT
.waitNotBlank
    bit STATB_BUSY, [hl]
    jr z, .waitNotBlank
.waitBlank
    bit STATB_BUSY, [hl]
    jr nz, .waitBlank

    ldh [rLCDC], a
    pop hl
    pop af
    reti

See how this method never interferes with VRAM accesses in the main thread, even with the worst possible timing and the shortest of HBlanks:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Interrupt dispatch
	STAT is tested
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0
1
2
3
4
5	`push af`
6
7
8
9	`push hl`
10
11
12
13	`ldh a, [rLY]`
14
15
16	`cp 128 - 1`
17	`cp 128 - 1`
18	`jr z, .disableSprites`
19	`jr z, .disableSprites`
20	`ldh a, [rLCDC]`
21
22
23	`or a, LCDCF_OBJON`
24	`or a, LCDCF_OBJON`
25	`jr .finish`
26
27
28	`ld hl, rSTAT`
29
30
31	`bit STATB_BUSY, [hl]`
32
33
34	`jr z, .waitNotBlank`
35	`jr z, .waitNotBlank`
36	`bit STATB_BUSY, [hl]`
37
38
39	`jr nz, .waitBlank`
40
41
42	(...)
...
83
84	`bit STATB_BUSY, [hl]`
85
86
87	`jr nz, .waitBlank`
88
89
90	`bit STATB_BUSY, [hl]`
91
92
93	`jr nz, .waitBlank`
94
95
96	`bit STATB_BUSY, [hl]`
97
98
99	`jr nz, .waitBlank`
100	`jr nz, .waitBlank`
101	`ldh [rLCDC], a`
102
103
104	`pop hl`
105
106
107	`pop af`
108
109
110	`reti`
111
112
113
0	`and STATF_BUSY`
1	`and STATF_BUSY`
2	`jr nz, .waitVRAM`
3	`jr nz, .waitVRAM`
4	(...)
...
19

Phew! This just barely works. There are only two cycles to spare! If there were multiple registers that needed updating, you might run into trouble. Normally, These really short HBlanks are the worst-case scenario that you always fear. However, in practice, HBlanks are normally much longer, often even longer than the drawing phase. Using this method, that can actually have unfortunate consequences:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	Interrupt dispatch
	STAT is tested
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0
1
2
3
4
5	`push af`
6
7
8
9	`push hl`
10
11
12
13	(...)
...
58
59	`ld hl, rSTAT`
60
61
62	`bit STATB_BUSY, [hl]`
63
64
65	`jr z, .waitNotBlank`
66
67
68	(...)
...
1
2	`bit STATB_BUSY, [hl]`
3
4
5	`jr z, .waitNotBlank`
6	`jr z, .waitNotBlank`
7	`bit STATB_BUSY, [hl]`
8
9
10	`jr nz, .waitBlank`
11
12
13	(...)
...
60
61	`bit STATB_BUSY, [hl]`
62
63
64	`jr nz, .waitBlank`
65	`jr nz, .waitBlank`
66	`ldh [rLCDC], a`
67
68
69	`pop hl`
70
71
72	`pop af`
73
74
75	`reti`
76
77
78
79	`and STATF_BUSY`
80	`and STATF_BUSY`
81	`jr nz, .waitVRAM`
82	`jr nz, .waitVRAM`
83	(...)
...
98

This time, when all the processing was done, there was still plenty of time left in the scanline to safely exit. However, since HBlank was so long, the routine missed the check for the drawing window and wasted an entire scanline waiting for that Drawing -> HBlank transition before it exited. Not only does this waste precious CPU time, but it also limits how often raster FX can be used throughout the frame. This method still works fine though, and can be an easy approach if you use Raster FX sparingly.

I'm a bit of a perfectionist, so I usually like to strive for the absolute best method. In a perfect world, we would precisely know whether we have enough HBlank left to safely exit. There actually is a way to do that though! You just need to count exactly how long your routine takes, and make sure it always exits during HBlank. This comes with some caveats though. Most routines, if they haven't been specifically designed for this method, will take a variable amount of time. The main things you need to avoid are if statements and loops. Specifically, if statements of this form are problematic:

    ; test a condition here...

    jr nc, .skip ; skip the next part unless Carry is set

    ; do something here, only if the previous operation set Carry

.skip
    ; continue on with the program.

The problem here is that the code following this pattern may be run after a variable number of cycles have passed. If you need to use an if statement, always make it an if/else statement so that you can waste cycles in the else portion and take the same number of cycles.

So now that you're ready to count the cycles of your handler, how long do you need to make the routine? Let's look at some more diagrams to figure this out!

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	STAT read
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
111	`ldh a, [rSTAT]`
112
113
0
1
2
3
4
5	(...)
...
109
110	`reti`
111
112
113
0	`and STATF_BUSY`
1	`and STATF_BUSY`
2	`jr nz, .waitVRAM`
3	`jr nz, .waitVRAM`
4	(...)
...
19

Wow! That's a lot of cycles! Here, the routine takes exactly one scanline to complete, so the main thread does its writes at the same moment on the next scanline, with no idea what happened! If you count up all the cyan cycles, you'll see that there are 105 of them, and 109 if you count the reti. This extra time makes it possible to write to two or three registers safely, rather than just one. If you don't need all that time, you can make it shorter as well:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	STAT read
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
107	`ldh a, [rSTAT]`
108
109
110	`and STATF_BUSY`
111	`and STATF_BUSY`
112	`jr nz, .waitVRAM`
113	`jr nz, .waitVRAM`
0
1
2
3
4
5	(...)
...
88
89	`reti`
90
91
92
93	(...)
...
108

This time, I put the and and jr before the interrupt, so that when it resumes, it's all ready to start writing to VRAM. This interrupt routine is 87 cycles long, including the reti. This won't often prove especially useful though, because you never take any time during HBlank to actually do any register writes. However, you could use this if your routine has a case where it realizes that nothing actually needs to be written, and you can exit earlier.

From those two diagrams, you'll see that the 22 cycles of worst-case HBlank is the time you can use to write to any PPU registers, pop your registers back, and then exit with reti. These 22 cycles are cycle 88 through cycle 109, inclusive.

What if I told you that you could actually have your handler take only 86 cycles? Well, you can!

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	STAT read
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
107	`ldh a, [rSTAT]`
108
109
110	`and STATF_BUSY`
111	`and STATF_BUSY`
112	`jr nz, .waitVRAM`
113	`jr nz, .waitVRAM`
0
1
2
3
4
5	(...)
...
87
88	`reti`
89
90
91
92	(...)
...
107

This seems bad, since the first cycle of the red bar, where the main thread may try to access VRAM, is potentially during the Drawing phase! This is also fine though. All instructions that access memory, whether through an immediate address or using a register pair as a pointer, take multiple cycles to complete. That's because the first cycle of every instruction is used to fetch the operation code itself. The memory access that the instruction performs is always in the 2nd, 3rd or 4th cycle of the instruction. In this situation, the 2nd cycle of the VRAM-accessible time is in HBlank, so this won't actually cause any problems.

But Wait!

The interrupt latency I showed earlier doesn't actually tell the full story. Before it even starts to service the interrupt, the system waits for the current instruction to finish. This is how that might look with the longest allowable routine:

Legend

PPU Mode
2	OAM scan
3	Drawing
0	HBlank

CPU operation
	STAT read
	Interrupt dispatch
	Return from interrupt
	VRAM accesses

Scanline cycle	Instruction
Scanline cycle	Instruction
106	`ldh a, [rSTAT]`
107
108
109	`and STATF_BUSY`
110	`and STATF_BUSY`
111	`jr nz, .waitVRAM`
112	`jr nz, .waitVRAM`
113	`call SomeFunc`
0
1
2
3
4
5
6
7
8
9
10	(...)
...
0
1	`reti`
2
3
4
5	(...)
...
14

Here, the first green block shows the system waiting 5 cycles for a call instruction to finish. call is the longest instruction at 6 cycles, so if the interrupt is requested just after it begins, the system will wait 5 cycles for it to complete. This seems bad, since the routine exited after the end of HBlank. However, this is actually fine! Those waiting cycles were not wasted; they were still 5 cycles of work that the main thread got done. So in the end, the main thread still gets its 20 cycles of VRAM-accessible time.

Pros and Cons

Thus far, I have presented two very different methods for making safe LYC handlers, and each have their pros and cons.

Double-Busy-Loop

Pros

does not require all code to be constant-time
does not require tedious cycle-counting
may exit very early if the routine finishes quickly

Cons

does not provide enough HBlank time to safely write multiple registers
if the routine takes too long, it may miss mode 3 and waste an entire scanline before exiting

Cycle-counting

Pros

leaves more time for more complex logic in the routine
allows enough time during blanking to write to up to three registers
never takes longer than one scanline

Cons

requires all code to be constant-time
requires tedious cycle-counting
always takes close to an entire scanline, even if HBlank starts much sooner

This suggests that the double-busy-loop method is good for extremely simple LYC routines that only need to write to one register, or routines that for some reason cannot be cycle-counted. If you need more time for calculations and more time to write to those registers, you can cycle-count your routine.

But what if you could combine both these methods? Enter the Hybrid Cycle-Counted Handler™, a technique I came up with while writing this document.

Combining Approaches

The goal of this method is to combine the maximum HBlank time that cycle-counting delivers, while still exiting early when HBlank is longer. Here is an example. If you've read DeadCScroll, you'll recognise this as that tutorial's STAT Handler, modified to start at Mode 2 rather than HBlank, and be safe towards VRAM accesses in the main thread.

    push af ; 4
    push hl ; 8

    ; obtain the pointer to the data pair
    ldh a, [rLY] ; 11
    inc a ; 12
    add a, a ; 13 ; double the offset since each line uses 2 bytes
    ld l, a ; 14
    ldh a, [hDrawBuffer] ; 17
    adc 0 ; 19
    ld h, a ; 20 ; hl now points to somewhere in the draw buffer

    call UnconditionalRet ; just waste 31 cycles while we wait for HBlank to maybe start
    call UnconditionalRet
    call UnconditionalRet
    nop ; 51

    ; now start trying to look for HBlank to exit early

    ldh a, [rSTAT]
    and STATF_BUSY
    jr z, .setAndExit ; 58

    ldh a, [rSTAT]
    and STATF_BUSY
    jr z, .setAndExit ; 65

    ldh a, [rSTAT]
    and STATF_BUSY
    jr z, .setAndExit ; 72

    ldh a, [rSTAT]
    and STATF_BUSY
    jr z, .setAndExit ; 79

    nop ; waste 4 more cycles since there isn't time for another check
    nop
    nop
    nop ; 83

.setAndExit
    ; set the scroll registers
    ld a,[hl+] ; 85
    ldh [rSCY],a ; 88
    ld a,[hl+] ; 90
    ldh [rSCX],a ; 93

    pop hl ; 97
    pop af ; 100
    reti ; 104

Once the handler finishes its logic, the handler delays cycles until it reaches the window then HBlank might start. With a 5-cycle offset due to a call, and the longest possible HBlank, the earliest HBlank might start is cycle 54, so that's the first attempt to read STAT. It keeps checking STAT until even in the worst-case scenario, it knows that HBlank will start. Then, it uses that time to write the scroll registers and exit. This way, it can still exit early, as long as the HBlank length permits. This routine takes 104 cycles in the worst-case scenario, but may take as few as 79 if HBlank comes sooner.

The reason that the double-busy-loop method requires checking for Mode 3 but this method does not is that the double-busy-loop method is not cycle-counted, so you might be at the very end of HBlank which is problematic. Since this method is cycle-counted, you know that if HBlank has begun, you are at or near the start of it.

If we make a similar list of pros and cons for this method, this is what it might look like:

Hybrid cycle-counting

Pros

may exit very early if HBlank is longer
allows enough time during blanking to write to up to three registers
never takes longer than one scanline

Cons

requires all code to be constant-time
requires tedious cycle-counting

This method can work well in many circumstances, and is especially suited to frequent effects that modify multiple registers and need to exit quickly to avoid taking too much CPU time. This method can even work reasonably well when used on every scanline through the Mode 2 interrupt.

All three of these methods can generate great-looking effects, but I think the third one is an especially attractive option.

Congrats! You made it to the end of the tutorial! I bet you're tired of reading it, and I'm tired of writing it too. So thanks for reading, see you next time!