Performance¶
ScrollKit runs the same code in two places that could not be more different. On your desktop the simulator renders in pygame at whatever speed your laptop can manage, which is to say instantly. On the actual hardware, an Adafruit MatrixPortal S3, that same code runs as interpreted CircuitPython on an ESP32-S3: roughly 100 times slower, with a few megabytes of RAM and no second thread to hide behind. The trap is obvious in hindsight and catches everyone anyway. Your app looks gorgeous in the simulator, you flash it, and the scrolling text moves like a slideshow.
That gap is the price of writing in Python. Python instead of C++ is a deliberate trade: you give up raw speed and get back code one person can write, read, and run unchanged on a desktop. The library spends most of its design budget buying that speed back where it counts. This page is the cost model behind those decisions, measured on real hardware, plus the one war story that explains why the rules exist.
These numbers are measured, not guessed
Every figure on this page comes from a microbenchmark suite run on a real
MatrixPortal S3 (CircuitPython 9.1.0). scrollkit.dev.performance_guide()
returns the live table; device_benchmarks.json and
matrixportal_s3_baseline.json hold the raw captures. They can't drift from
the prose, because they aren't prose. Other supported boards carry their own
profile; an uncalibrated board uses a labeled estimate until a baseline is
captured (see Adding New Hardware).
The frame budget¶
The device targets 20 frames per second. That is your whole allowance: 50 milliseconds per frame, and not a millisecond more if you want the motion to stay smooth.
Some of that is already spent before your code does anything. Every single frame
pays for one display.refresh(), which pushes the framebuffer out to the panel.
At the default bit_depth=4 that costs about 4.5 ms. It is a floor you cannot get
under, a tax on simply being powered on. Whatever your app actually does (move
text, paint a transition, fly a swarm of birds across the screen) has to fit in
what's left.
The strict feasibility gate enforces this for you. Run an app through
run_headless(app, strict=True) and any effect that busts the 50 ms budget raises
FeasibilityError before it ever reaches a board. More on that at the end.
Where the time goes¶
The entire performance story is one comparison: a call that runs in C versus a loop that runs in interpreted Python. The spread is not subtle.
Writing a single pixel, three ways:
| Operation | Cost per pixel | |
|---|---|---|
bitmap[x, y] = 1 (interpreted) |
~7,000 ns | the trap |
bitmaptools.blit (C) |
~620 ns | ~11× faster |
bitmap.fill (C) |
~4.4 ns | ~1,600× faster |
Read that bottom row again. Filling a region with a C call is about 1,600 times
faster per pixel than setting pixels one at a time in a Python loop. Clearing the
whole 64×32 panel (2,048 pixels) is one bitmap.fill at roughly 9 microseconds.
The same clear written as a nested Python loop is about 14 milliseconds, which is
most of your frame gone before you've drawn anything.
Refresh cost depends on color depth, and the cliff is steep:
display.refresh() |
Frame time | FPS ceiling |
|---|---|---|
bit_depth ≤ 4 |
~4.5 ms | ~220 |
bit_depth 6 |
~13.7 ms | ~73 |
Two extra bits of color triple the refresh cost. That is why the library defaults
to bit_depth=4 and tells you to leave it there unless you genuinely need smooth
gradients.
Compute is the quiet budget killer. CircuitPython manages roughly 500,000 simple Python operations per second, so a thousand-operation calculation costs about 1.5 ms straight out of your frame. There is no background thread to hide it in; the runtime is cooperative, so any math you do is math the display loop is not doing. A few hundred lines of arithmetic per frame and you are competing with the rendering you are trying to feed.
Allocation carries a tax of its own. A Bitmap costs ~120 µs to create, a TileGrid ~48 µs, a small bytearray ~74 µs, and every object you allocate is an object the garbage collector will later walk, a pause that runs from 0.3 to 0.9 ms depending on how much you've churned. Do any of that once per frame and you've signed up for a recurring bill.
RAM, for once, is rarely the thing that bites. The ESP32-S3's PSRAM leaves roughly 1.5 MB free to your app, which swallows the web server and the data updates without complaint. On this board, time is the scarce resource, not memory.
How the library works around it¶
Every rule below is just one of those numbers turned into a habit.
- Reuse a
Label; change.textonly when the value changes. Rebuilding a glyph bitmap means setting its pixels one at a time, the 7,000-ns-per-pixel path, and it is the single largest per-frame cost in a busy app. To scroll, move the label's.xand leave.textalone.UnifiedDisplaykeeps a label pool and does this for you, so don't allocate your own Label every frame. - Never push pixels in a Python loop. Use
bitmap.fillfor solid regions andbitmaptools.blitto copy. The transitions all paint through one preallocatedOverlayMaskwith bounded, bulkfill_regioncalls, never a 2,048-pixel loop. - Keep
bit_depth=4. It is the default, and it refreshes three times faster thanbit_depth=6. - Don't allocate per frame. Build your Bitmaps, TileGrids, and Groups once, then mutate them. Allocation costs tens of microseconds each and feeds the GC pauses you'll pay for later.
- Chunk heavy compute across frames. A long calculation belongs in pieces, spread over several frames, the same way a slow HTTP fetch (synchronous on this device) should render a "loading" frame before it blocks the loop.
None of this is exotic. It is the same advice you'd give for any tight render loop, made non-negotiable by a chip that punishes the lazy version by a factor of a thousand.
Case study: rewriting the swarm¶
The rules above sound abstract until something violates all of them at once. That something was the swarm.
The first swarm effect was a boids flock: 150 birds obeying the classic three rules (separation, alignment, cohesion), assembling text out of captured pixels as they flew over it. It was beautiful, and it only ever ran on the desktop, off precomputed paths, because computing it live would have buried the device. The cost was structural. Each frame ran three separate O(n²) neighbor passes, one per flocking rule, and then an O(birds × pixels) scan to decide which pixels each bird had captured. With 150 birds and a few hundred target pixels, that is tens of thousands of Python operations per frame before a single pixel is drawn. On a chip that does 500,000 operations a second, you can do that arithmetic yourself.
So it was rewritten from the algorithm up (commit a152267). The three neighbor
passes became one combined pass that computes all three rules together, using
squared distances so the radius checks never pay for a sqrt. The per-frame capture
scan became O(1): each bird pulls its target from a pre-shuffled queue and lights it
on arrival, no searching. And the drawing dropped its per-pixel loops for bulk
bitmap.fill. Same effect, same flocking, completely different cost.
The payoff, measured on an S3 with refresh included:
| Birds | Frame time | Verdict |
|---|---|---|
| 14 | ~25 ms | the safe default |
| 20 | ~34 ms | fine |
| 28 | ~48 ms | right at the 20 fps limit |
| 100 | ~600 ms | about 1.6 fps |
Cost still grows with the square of the bird count (the neighbor pass is irreducibly pairwise), which is why the on-device default is 14 birds and the ceiling sits around 28. The desktop simulator has no such limit, so crank it up for a screenshot. The lesson is the one this whole page keeps making: the right algorithm plus a C bulk call turned an effect that took 0.6 seconds a frame into one that takes 25 milliseconds. Nothing about the birds changed. Only the cost did.
Catch it before you flash¶
The point of all this is that you don't have to find out on hardware. The simulator models the device's speed, so you can fail fast on your laptop.
from scrollkit.dev import run_headless, validate
result = run_headless(app, frames=120, strict=True) # raises FeasibilityError if over budget
print(result.as_text()) # estimated hardware FPS + warnings
report = validate(app) # structured issues, each with a fix
print(report.ok)
strict=True turns the 50 ms budget into a hard wall: a sustained over-budget run,
a catastrophic single frame, or a RAM breach all raise FeasibilityError. Without
it, the RunResult still reports an estimated hardware FPS and any stutter
warnings, so you see the cliff coming even when you don't enforce it.
And when a number won't land, feel it. Build the display with throttle=True (or
set SCROLLKIT_HW_THROTTLE=1) and the simulator window crawls at the modeled
hardware speed, complete with console nags when a frame would stutter. Watching the
thing limp is more persuasive than any table on this page.
from scrollkit.display.simulator import SimulatorDisplay
SimulatorDisplay(width=64, height=32, throttle=True) # implies hardware timing
See also¶
- Theatrical Transitions lists the per-transition pixel and frame budgets: the same cost model applied effect by effect.
- Characterful Scrolling covers the fixed-point, no-allocation scroller internals.
- The Simulator documents hardware-timing mode and how the device model is calibrated.
- Adding New Hardware covers running ScrollKit on other boards and capturing a per-board performance baseline.
AGENTS.mdat the repo root is the condensed version of this page for an AI agent building apps, with the same measured numbers.