Troubleshooting

Common failure modes and recovery procedures for the Virtufin API Gateway.

See also: Development guide → Troubleshooting for service-startup, port-conflict, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, Dapr sidecar, and image-pull issues.

Operation-level failures

These failures occur after the service is running, typically surfacing as gRPC status codes or HTTP 5xx responses.

`UNAVAILABLE` from a backend service call

Symptom: Invoke or ListMethods returns StatusCode.Unavailable for a specific backend service.

Likely causes: - The backend service (WorkManager, WebSocketManager) is not running, or is on the wrong port - The services.json config has the wrong host:port for the backend - The Dapr sidecar is not running alongside the backend (no daprd container in docker ps, no dapr_runtime_* log lines) - The backend is in another namespace/pod and DNS resolution is failing

Diagnose:

# 1. Verify the backend is listening
grpcurl -plaintext <backend-host>:<grpc-port> list

# 2. Check Dapr sidecar logs
docker logs <backend-pod> -c daprd | tail -50

# 3. Check the API Gateway logs for the call stack
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-api | grep -i "<service-name>"

# 4. Verify the services.json config matches
cat config/services.json | jq '.services[] | select(.name=="<service-name>")'

`INVALID_ARGUMENT` on `Subscribe`

Symptom: The Subscribe RPC returns immediately with StatusCode.InvalidArgument and the message "At least one filter must be set".

Likely cause: The request has empty services, topics, and eventTypes arrays. The filter validation requires at least one non-empty filter.

Fix: Set at least one of services, topics, or eventTypes to a non-empty value.

`RESOURCE_EXHAUSTED` on `SaveState`

Symptom: SaveState returns StatusCode.ResourceExhausted with a message about state-store capacity.

Likely cause: The Dapr state store (Redis/Valkey) is full. The default maxmemory-policy in redis.conf is noeviction, so writes fail when memory is exhausted.

Fix: Increase the Redis instance size, or set maxmemory-policy allkeys-lru to evict older keys.

`DEADLINE_EXCEEDED` on `Invoke`

Symptom: Invoke returns DeadlineExceeded after the configured timeout.

Likely cause: The backend method itself is slow (not the network). Invoke has a 30-second default timeout inherited from the gRPC client. The underlying backend may be doing expensive work (e.g., loading a large Python worker code module).

Fix: Increase the per-RPC timeout (configurable via Grpc:DefaultTimeoutSeconds in appsettings.json) or optimize the backend operation.

State store failures

State changes not persisting

Symptom: SaveState returns success but the next GetState returns the old value, or GetAllState returns no values.

Likely cause: The state-store key is not in the registered-keys set. GetAllState only returns values for keys that have been registered via RegisterKeys or previously used via SaveState/GetState. Keys created by external tools (e.g., redis-cli SET) are not in the tracking table.

Fix:

# List registered keys for a service
curl http://localhost:5001/v1/state/<service-name>

# Manually register a key
curl -X POST http://localhost:5001/v1/state/register-keys \
  -H "Content-Type: application/json" \
  -d '{"service":"<service-name>","keys":["<key-name>"]}'

Pub/Sub failures

Subscriptions stop receiving events

Symptom: A client subscribed via the Pubsub.Subscribe RPC stops receiving events after a few minutes.

Likely cause: Subscription is being reaped by the SubscriptionHealthSweeperHostedService. The sweeper runs every 5 minutes and removes subscriptions that haven't sent any traffic in 5 minutes. Empty-Data heartbeats (added in D10) should keep the subscription alive, but a network glitch that causes the sweeper to mark a healthy subscription as dead is the most common cause.

Diagnose:

# Check the sweeper logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-api | grep -i "sweeper\|heartbeat"

# Check the subscription count
curl http://localhost:5001/health

Fix: Restart the client subscription; the next sweep will recreate the health record.

Recovery procedures

Restart the gateway after a state-store outage

If the Dapr state store is unavailable for an extended period, the gateway may have cached stale state. Restart all gateway replicas to force a fresh ServicesConfigurationLoader.LoadAsync and rebuild the gRPC channel pool:

kubectl rollout restart deployment/virtufin-api -n virtufin
kubectl rollout status deployment/virtufin-api -n virtufin

Reset all in-flight subscriptions

If subscriptions are wedged (heartbeat lag, broken state), force a clean state:

# Restart all gateway replicas (closes all subscriptions)
kubectl rollout restart deployment/virtufin-api -n virtufin

# Wait for old replicas to terminate
kubectl rollout status deployment/virtufin-api -n virtufin

# Clients must re-subscribe

Reporting issues

If none of the above resolves your issue, gather the following and file a Gitea issue:

API Gateway version (LIBRARY_VERSION from the build info endpoint)
services.json config (redact any sensitive values)
Full error message and gRPC status code
Relevant log lines from the gateway and the affected backend service