Troubleshooting
Common failure modes and recovery procedures for the Virtufin API Gateway.
See also: Development guide → Troubleshooting for service-startup, port-conflict, and protoc-generation issues. Deployment guide → Troubleshooting for Kubernetes pod, Dapr sidecar, and image-pull issues.
Operation-level failures
These failures occur after the service is running, typically surfacing as gRPC status codes or HTTP 5xx responses.
UNAVAILABLE from a backend service call
Symptom: Invoke or ListMethods returns StatusCode.Unavailable for a
specific backend service.
Likely causes:
- The backend service (WorkManager, WebSocketManager) is not running, or is on
the wrong port
- The services.json config has the wrong host:port for the backend
- The Dapr sidecar is not running alongside the backend (no daprd container
in docker ps, no dapr_runtime_* log lines)
- The backend is in another namespace/pod and DNS resolution is failing
Diagnose:
# 1. Verify the backend is listening
grpcurl -plaintext <backend-host>:<grpc-port> list
# 2. Check Dapr sidecar logs
docker logs <backend-pod> -c daprd | tail -50
# 3. Check the API Gateway logs for the call stack
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-api | grep -i "<service-name>"
# 4. Verify the services.json config matches
cat config/services.json | jq '.services[] | select(.name=="<service-name>")'
INVALID_ARGUMENT on Subscribe
Symptom: The Subscribe RPC returns immediately with
StatusCode.InvalidArgument and the message "At least one filter must be set".
Likely cause: The request has empty services, topics, and eventTypes
arrays. The filter validation requires at least one non-empty filter.
Fix: Set at least one of services, topics, or eventTypes to a
non-empty value.
RESOURCE_EXHAUSTED on SaveState
Symptom: SaveState returns StatusCode.ResourceExhausted with a message
about state-store capacity.
Likely cause: The Dapr state store (Redis/Valkey) is full. The
default maxmemory-policy in redis.conf is noeviction, so writes fail when
memory is exhausted.
Fix: Increase the Redis instance size, or set
maxmemory-policy allkeys-lru to evict older keys.
DEADLINE_EXCEEDED on Invoke
Symptom: Invoke returns DeadlineExceeded after the configured timeout.
Likely cause: The backend method itself is slow (not the network).
Invoke has a 30-second default timeout inherited from the gRPC client. The
underlying backend may be doing expensive work (e.g., loading a large Python
worker code module).
Fix: Increase the per-RPC timeout (configurable via
Grpc:DefaultTimeoutSeconds in appsettings.json) or optimize the backend
operation.
State store failures
State changes not persisting
Symptom: SaveState returns success but the next GetState returns
the old value, or GetAllState returns no values.
Likely cause: The state-store key is not in the registered-keys set.
GetAllState only returns values for keys that have been registered via
RegisterKeys or previously used via SaveState/GetState. Keys created by
external tools (e.g., redis-cli SET) are not in the tracking table.
Fix:
# List registered keys for a service
curl http://localhost:5001/v1/state/<service-name>
# Manually register a key
curl -X POST http://localhost:5001/v1/state/register-keys \
-H "Content-Type: application/json" \
-d '{"service":"<service-name>","keys":["<key-name>"]}'
Pub/Sub failures
Subscriptions stop receiving events
Symptom: A client subscribed via the Pubsub.Subscribe RPC stops receiving
events after a few minutes.
Likely cause: Subscription is being reaped by the
SubscriptionHealthSweeperHostedService. The sweeper runs every 5 minutes and
removes subscriptions that haven't sent any traffic in 5 minutes. Empty-Data
heartbeats (added in D10) should keep the subscription alive, but a network
glitch that causes the sweeper to mark a healthy subscription as dead is the
most common cause.
Diagnose:
# Check the sweeper logs
kubectl logs -n virtufin -l app.kubernetes.io/name=virtufin-api | grep -i "sweeper\|heartbeat"
# Check the subscription count
curl http://localhost:5001/health
Fix: Restart the client subscription; the next sweep will recreate the health record.
Recovery procedures
Restart the gateway after a state-store outage
If the Dapr state store is unavailable for an extended period, the gateway may
have cached stale state. Restart all gateway replicas to force a fresh
ServicesConfigurationLoader.LoadAsync and rebuild the gRPC channel pool:
kubectl rollout restart deployment/virtufin-api -n virtufin
kubectl rollout status deployment/virtufin-api -n virtufin
Reset all in-flight subscriptions
If subscriptions are wedged (heartbeat lag, broken state), force a clean state:
# Restart all gateway replicas (closes all subscriptions)
kubectl rollout restart deployment/virtufin-api -n virtufin
# Wait for old replicas to terminate
kubectl rollout status deployment/virtufin-api -n virtufin
# Clients must re-subscribe
Reporting issues
If none of the above resolves your issue, gather the following and file a Gitea issue:
- API Gateway version (
LIBRARY_VERSIONfrom the build info endpoint) services.jsonconfig (redact any sensitive values)- Full error message and gRPC status code
- Relevant log lines from the gateway and the affected backend service