Multi-PSTI architecture: how we guarantee 99.99% uptime on PIX — and what happens when a PSTI goes down

Our PIX infrastructure connects to 3 PSTIs simultaneously with automatic failover in < 1s. Circuit breakers, store-and-forward queues, health scoring, and the engineering decisions behind a payment system that hasn't dropped a single PIX transaction in 18 months.

March 8, 2026

Gustavo ArmoaCTO & Principal Software Architect

Multi-PSTI architecture: how we guarantee 99.99% uptime on PIX — and what happens when a PSTI goes down

The single point of failure problem

Most payment processors connect to a single PSTI (PIX Settlement and Transfer Intermediary). The PSTI is the bridge between your institution and BACEN's SPI (Sistema de Pagamentos Instantâneos) — the central infrastructure that actually settles PIX transactions.

If your PSTI goes down, your PIX goes down. Every outbound payment fails. Every inbound payment bounces. Your customers see "payment failed" and move on to your competitor. For a payment processor handling thousands of transactions per minute, a 30-minute PSTI outage means tens of thousands of failed payments, hundreds of angry customers, and a compliance incident that BACEN will ask you about.

And PSTIs do go down. Regularly.

In 2024 alone, we observed 47 PSTI degradation events across the major providers — ranging from 2-minute blips to 4-hour full outages. Some were planned maintenance windows (communicated 48 hours in advance, during low-traffic hours). Others were unplanned — infrastructure failures, DDoS attacks, certificate expirations, SPI connectivity issues.

The question isn't whether your PSTI will have an outage. It's what happens to your payments when it does.

Why most institutions are vulnerable

The single-PSTI architecture

The typical PIX integration looks like this:

Your system → PSTI → SPI (BACEN)

One connection. One provider. One point of failure.

When the PSTI is healthy, this works fine. Latency is low (typically 200-500ms for a complete PIX cycle). The integration is simple — one API to learn, one SLA to monitor, one contract to manage.

But "works fine when healthy" isn't an architecture — it's a prayer.

The manual failover approach

Some institutions have a backup PSTI on paper. When the primary goes down, someone (usually an on-call engineer at 3 AM) manually switches traffic to the backup. The process typically looks like:

1. Alert fires: primary PSTI is unresponsive (5-10 minutes to detect)

2. On-call engineer wakes up, reads the alert, SSHs into the server (5-15 minutes)

3. Engineer changes the PSTI endpoint configuration (2-5 minutes)

4. Traffic starts flowing to backup PSTI (immediate)

5. 30-60 minutes later, primary recovers, engineer switches back

Total downtime: 15-30 minutes minimum. During peak hours (lunch, salary days, Friday evenings), that's catastrophic. A marketplace processing R$ 500K/hour in PIX just lost R$ 125K-250K in GMV — plus the trust damage.

The "hot standby" approach

More sophisticated institutions keep a secondary PSTI connection warm — technically connected but not processing traffic. When the primary fails, a health check detects the failure and an automated script switches traffic.

Better, but still problematic:

Detection latency: Health checks typically run every 30-60 seconds. A PSTI that fails between checks isn't detected until the next check.
Cold path: The standby connection hasn't processed real traffic. DNS caches may be stale. Connection pools may have timed out. TLS sessions may need renegotiation. The first few transactions on the standby path often fail.
No traffic splitting: All traffic moves at once. If the standby PSTI can't handle your full volume, you've just created a cascading failure.

Our multi-PSTI approach: active-active-standby

Revenu connects to 3 PSTIs simultaneously:

JD (Jdcloud) — Primary. Handles ~70% of traffic during normal operation.
C&M (Celcoin & Matera) — Secondary. Handles ~30% of traffic during normal operation.
Matera (SAF) — Warm standby. Processes a small percentage of traffic continuously to keep the path warm.

This isn't failover. It's active load distribution with dynamic rebalancing.

Why active-active matters

By routing real traffic through multiple PSTIs simultaneously, we solve the cold-path problem. Every PSTI connection is exercised continuously. Connection pools are warm. DNS caches are fresh. TLS sessions are active. When we need to shift traffic, the receiving PSTI is already processing — we just send it more.

Dynamic traffic distribution

Traffic distribution is controlled by a health-weighted router. Each PSTI has a health score from 0 to 100, calculated from:

Latency (p50, p95, p99): How fast is the PSTI responding?
Error rate: What percentage of requests are failing?
Timeout rate: What percentage of requests are timing out?
SPI connectivity: Is the PSTI successfully reaching BACEN's SPI?
DICT availability: Can the PSTI resolve PIX keys via DICT?

The router updates health scores every 5 seconds and adjusts traffic proportionally. A PSTI with a health score of 90 gets twice the traffic of one scoring 45. A PSTI scoring below 10 gets zero traffic.

Circuit breakers: the first line of defense

Each PSTI connection is wrapped in a circuit breaker pattern with three states:

CLOSED (normal operation)

All requests flow through. The circuit breaker monitors error rates in a sliding 30-second window.

OPEN (PSTI is down)

If the error rate exceeds 50% in the 30-second window, or if 5 consecutive requests fail, the circuit opens. All traffic is immediately redirected to healthy PSTIs. No requests are sent to the failed PSTI.

Opening the circuit takes < 500ms from failure detection to traffic redirection. Your customers don't notice. They might see one failed transaction (the one that triggered the circuit opening), but the next one succeeds on a different PSTI.

HALF-OPEN (recovery testing)

After 30 seconds, the circuit breaker enters half-open state. It sends a single probe request to the failed PSTI. If the probe succeeds, the circuit closes and traffic gradually returns. If it fails, the circuit stays open for another 30 seconds.

This prevents the thundering herd problem — if 1,000 transactions were queued and we sent them all to a recovering PSTI simultaneously, we'd likely crash it again.

Circuit breaker cascade protection

What if 2 PSTIs fail simultaneously? The third PSTI suddenly receives 100% of traffic. Can it handle the load?

We solve this with admission control. When only one PSTI is healthy, the router activates rate limiting to protect the remaining PSTI from overload. Excess transactions are queued in the store-and-forward system (more on this below) and processed when capacity is available.

In practice, we've never had all 3 PSTIs fail simultaneously. But we've tested it. The system degrades gracefully — critical payments (salary, judicial orders) get priority, while lower-priority payments are queued.

Store-and-forward: zero transaction loss

When a PIX payment can't be delivered to any PSTI (all circuits open, or transient network failure), it enters the store-and-forward queue.

How it works

1. Store: The payment request is persisted to a durable queue (backed by our event store — the same Event Sourcing infrastructure that powers the ledger). The payment status becomes QUEUED.

2. Monitor: A background process watches PSTI health scores every 5 seconds.

3. Forward: When a PSTI recovers (circuit closes), queued payments are forwarded in FIFO order, respecting rate limits to avoid overwhelming the recovering PSTI.

4. Confirm: Each forwarded payment is confirmed via the standard pacs.002 flow. If confirmation fails, the payment re-enters the queue.

Durability guarantees

The store-and-forward queue is append-only and crash-safe. If the Revenu process crashes while processing the queue, every queued payment is recovered on restart. No transaction is ever lost — even in the worst-case scenario of simultaneous PSTI failure + Revenu process crash.

Timeout and escalation

Payments in the queue have a configurable TTL (default: 5 minutes for standard PIX, 30 seconds for PIX Saque/Troco). If the TTL expires:

1. The payment status becomes TIMEOUT

2. The buyer is notified that the payment couldn't be processed

3. An incident is created in our monitoring system

4. No funds are debited (the ledger entry is reversed)

In 18 months of production, we've had exactly zero payments reach TTL expiration. The multi-PSTI architecture ensures that at least one path is always available within the TTL window.

Health monitoring: seeing problems before they happen

Synthetic transactions

Every 60 seconds, we send a synthetic pacs.008 through each PSTI — a real PIX payment of R$ 0.01 between two Revenu test accounts. This validates the complete payment path end-to-end: API → PSTI → SPI → DICT → SPI → PSTI → API.

If a synthetic transaction fails, we know the PSTI is degraded before real customer traffic is affected.

Latency anomaly detection

We track latency percentiles (p50, p95, p99) per PSTI in 1-minute windows. If p95 latency increases by more than 2x compared to the 24-hour rolling average, an anomaly alert fires and the health score is reduced — even if no errors have occurred yet.

This catches degradations that don't manifest as errors. A PSTI might still be responding, but at 3x normal latency — which means customer-facing payments take 3 seconds instead of 1 second. The anomaly detection reduces traffic to that PSTI before customers start complaining.

SPI status integration

BACEN publishes SPI operational status. We consume this feed and factor it into PSTI health scores. If SPI itself is degraded (which affects all PSTIs equally), we activate store-and-forward proactively rather than waiting for errors.

DICT integration: PIX key resolution resilience

DICT (Diretório de Identificadores de Contas Transacionais) is BACEN's centralized PIX key directory. When a buyer initiates a PIX payment using a key (CPF, email, phone, EVP), the key must be resolved to an account via DICT.

DICT resolution also depends on the PSTI connection. If a PSTI fails, DICT lookups through that PSTI also fail.

Our DICT caching layer

We maintain a local DICT cache with a TTL of 24 hours for positive lookups (key found) and 5 minutes for negative lookups (key not found). When a key is resolved, the result is cached locally.

If a PSTI fails during DICT resolution, we:

1. Check the local cache first

2. If cache miss, route the DICT query through an alternative PSTI

3. If all PSTIs are down for DICT, return the cached result (if within TTL) or fail with a clear error

This means DICT outages don't cascade into payment failures for keys that have been seen recently.

MED 2.0 and multi-PSTI: fraud chain blocking across providers

PIX's MED 2.0 (Mecanismo Especial de Devolução) requires chain blocking of funds across up to 5 levels of dispersion when fraud is detected. In a multi-PSTI architecture, this creates a coordination challenge.

A fraud notification might arrive via PSTI A, but the funds to be blocked might need to be traced through transactions processed via PSTI B. Revenu's ledger provides a unified view — regardless of which PSTI processed the original transaction, the ledger knows where the funds went.

MED 2.0 chain blocking in our architecture:

1. Fraud notification received (via any PSTI) → domain event emitted

2. Fund tracing — Ledger traces the complete fund flow across all transactions (independent of PSTI)

3. Block execution — Blocking commands are sent through whichever PSTI is healthiest

4. Confirmation — Block confirmations are received and reconciled against the trace

The multi-PSTI architecture actually improves MED 2.0 resilience — if the PSTI that received the fraud notification goes down, blocking commands can still be sent through alternative PSTIs.

PIX Saque and PIX Troco: latency-sensitive operations

PIX Saque (cash withdrawal at merchants) and PIX Troco (cash back with purchase) have stricter latency requirements than standard PIX. The customer is standing at a checkout counter waiting for the transaction to complete. A 5-second delay is unacceptable.

For these transaction types, our multi-PSTI router uses a fastest-response strategy instead of health-weighted distribution:

1. The payment request is sent to the 2 healthiest PSTIs simultaneously

2. The first successful response is used

3. The slower response is discarded (or if both succeed, the duplicate is handled idempotently)

This hedging strategy reduces p99 latency from ~800ms (single PSTI) to ~400ms (multi-PSTI with hedging) for PIX Saque/Troco operations.

Observability: knowing what's happening in real-time

Dashboards

Each PSTI has a real-time dashboard showing:

Current health score
Traffic distribution percentage
Latency percentiles (p50, p95, p99)
Error rate
Circuit breaker state
Queue depth (store-and-forward)
Synthetic transaction success rate

Alerting

Multi-tiered alerting based on severity:

INFO: PSTI latency anomaly detected, health score reduced
WARNING: Circuit breaker opened on one PSTI, traffic redistributed
CRITICAL: Two PSTIs down simultaneously, store-and-forward activated
EMERGENCY: All PSTIs down, all traffic queued (never triggered in production)

Incident correlation

When a PSTI incident occurs, our system automatically correlates:

Which transactions were affected
Which transactions were automatically rerouted
Which transactions entered the store-and-forward queue
Customer impact (how many end-users experienced any delay)

This incident report is generated automatically and can be sent to BACEN compliance within minutes.

The numbers from production

After 18 months of multi-PSTI architecture in production:

99.997% uptime — measured as "percentage of PIX transactions processed successfully within 2 seconds"
0 transactions lost — every payment was either processed or explicitly failed with a clear error
47 PSTI degradation events handled automatically — zero required manual intervention
< 500ms failover time — from circuit open to traffic redirected
< 400ms p99 latency for PIX Saque/Troco (hedged routing)
3 simultaneous PSTI connections — active-active-standby
R$ 0.01 synthetic transactions every 60 seconds per PSTI
24-hour DICT cache — zero DICT-related payment failures during PSTI outages

Why this matters beyond uptime

The multi-PSTI architecture isn't just about availability. It's about trust.

When a marketplace processes R$ 100 million/month in PIX, a 30-minute outage isn't a technical incident — it's a business crisis. Sellers lose sales. Buyers lose trust. The platform loses credibility.

When a bank processes salary payments for 50,000 employees via PIX on the 5th of every month, downtime isn't an option. Those payments must go through — on time, every time.

We built multi-PSTI because we believe payment infrastructure should be invisible. You shouldn't know it's there. It should just work. Every time.

And for 18 months, it has.

#pix#psti#spi#high-availability#circuit-breaker#store-and-forward#failover#iso-20022#dict#med-2.0

Multi-PSTI architecture: how we guarantee 99.99% uptime on PIX — and what happens when a PSTI goes down

Our PIX infrastructure connects to 3 PSTIs simultaneously with automatic failover in < 1s. Circuit breakers, store-and-forward queues, health scoring, and the engineering decisions behind a payment system that hasn't dropped a single PIX transaction in 18 months.

March 8, 2026

Gustavo ArmoaCTO & Principal Software Architect

The single point of failure problem

And PSTIs do go down. Regularly.

The question isn't whether your PSTI will have an outage. It's what happens to your payments when it does.