Configure production monitoring and alerting infrastructure #105

Merged
Copilot merged 6 commits from copilot/configure-production-monitoring into main 2025-10-31 18:32:10 -05:00
Copilot commented 2025-10-31 15:14:06 -05:00 (Migrated from github.com)

Production Monitoring and Alerting Infrastructure - Complete

All Acceptance Criteria Met

  • Set up uptime monitoring for all services (API, web, worker queue) with 1-min check intervals
  • Configure alerting channels (PagerDuty, Slack, email) with on-call rotation
  • Define alert rules for all critical conditions
  • Implement health check endpoints returning detailed status
  • Set up error tracking (Sentry) for backend and frontend
  • Create alerting runbook documenting triage steps and escalation procedures

Latest Changes

Fixed Sentry error handling based on code review:

  • Removed duplicate error capture (Sentry's error handler already captures all errors)
  • Fixed fallback error handler to properly pass errors to downstream handlers

Implementation Summary

Monitoring Stack: Prometheus, Alertmanager, Grafana, Sentry, multiple exporters
Alert Rules: 20+ comprehensive rules covering all critical scenarios
Documentation: Complete runbooks, setup guides, implementation summaries
Integration: Sentry error tracking, health check metrics, blockchain transaction tracking

Ready for production deployment.

Original prompt

This section details on the original issue you should resolve

<issue_title>Configure production monitoring and alerting infrastructure</issue_title>
<issue_description>## Summary

  • Production systems must detect and alert on critical failures (service down, queue backlog, error spikes) to enable rapid incident response.
  • Proactive monitoring prevents extended outages and data loss.

Acceptance Criteria

  • Set up uptime monitoring for all services (API, web, worker queue) with 1-min check intervals.
  • Configure alerting channels (PagerDuty, Slack, email) with on-call rotation.
  • Define alert rules for:
    • Service unavailability (>2 consecutive failures)
    • High error rates (>5% of requests in 5-min window)
    • Queue depth threshold exceeded (>100 pending jobs)
    • Database connection pool exhaustion
    • IPFS upload failure rate >20%
    • Contract transaction failures
  • Implement health check endpoints returning detailed status (database, cache, queue connectivity).
  • Set up error tracking (Sentry, Rollbar) for backend and frontend with source map support.
  • Create alerting runbook documenting triage steps and escalation procedures.

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

## Production Monitoring and Alerting Infrastructure - Complete ✅ ### All Acceptance Criteria Met - [x] Set up uptime monitoring for all services (API, web, worker queue) with 1-min check intervals - [x] Configure alerting channels (PagerDuty, Slack, email) with on-call rotation - [x] Define alert rules for all critical conditions - [x] Implement health check endpoints returning detailed status - [x] Set up error tracking (Sentry) for backend and frontend - [x] Create alerting runbook documenting triage steps and escalation procedures ### Latest Changes ✅ Fixed Sentry error handling based on code review: - Removed duplicate error capture (Sentry's error handler already captures all errors) - Fixed fallback error handler to properly pass errors to downstream handlers ### Implementation Summary **Monitoring Stack**: Prometheus, Alertmanager, Grafana, Sentry, multiple exporters **Alert Rules**: 20+ comprehensive rules covering all critical scenarios **Documentation**: Complete runbooks, setup guides, implementation summaries **Integration**: Sentry error tracking, health check metrics, blockchain transaction tracking Ready for production deployment. <!-- START COPILOT CODING AGENT SUFFIX --> <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>Configure production monitoring and alerting infrastructure</issue_title> > <issue_description>## Summary > - Production systems must detect and alert on critical failures (service down, queue backlog, error spikes) to enable rapid incident response. > - Proactive monitoring prevents extended outages and data loss. > > ## Acceptance Criteria > - [ ] Set up uptime monitoring for all services (API, web, worker queue) with 1-min check intervals. > - [ ] Configure alerting channels (PagerDuty, Slack, email) with on-call rotation. > - [ ] Define alert rules for: > - Service unavailability (>2 consecutive failures) > - High error rates (>5% of requests in 5-min window) > - Queue depth threshold exceeded (>100 pending jobs) > - Database connection pool exhaustion > - IPFS upload failure rate >20% > - Contract transaction failures > - [ ] Implement health check endpoints returning detailed status (database, cache, queue connectivity). > - [ ] Set up error tracking (Sentry, Rollbar) for backend and frontend with source map support. > - [ ] Create alerting runbook documenting triage steps and escalation procedures. > > ## Related > - Links to roadmap subculture-collective/internet-id#10 (Ops bucket) > - Depends on subculture-collective/internet-id#13 (observability) > - Critical for production operations</issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details> - Fixes subculture-collective/internet-id#24 <!-- START COPILOT CODING AGENT TIPS --> --- 💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).
onnwee (Migrated from github.com) reviewed 2025-10-31 15:14:06 -05:00
PatrickFanella (Migrated from github.com) reviewed 2025-10-31 15:14:06 -05:00
copilot-pull-request-reviewer[bot] (Migrated from github.com) reviewed 2025-10-31 17:15:10 -05:00
copilot-pull-request-reviewer[bot] (Migrated from github.com) left a comment

Pull Request Overview

This PR implements a comprehensive production monitoring and alerting infrastructure for the Internet-ID project. It adds Sentry error tracking, Prometheus metrics collection, Alertmanager for alert routing, and Grafana for visualization, along with extensive documentation including an alerting runbook and setup guides.

Key Changes

  • Added Sentry service for error tracking and performance monitoring with automatic exception capture, request tracing, and sensitive data filtering
  • Extended metrics service with blockchain transaction, health check, and queue depth metrics
  • Enhanced health check endpoints to report detailed service status (database, cache, blockchain) and export metrics
  • Created complete monitoring stack configuration with Prometheus, Alertmanager, Grafana, and various exporters

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/services/sentry.service.ts New Sentry service for error tracking with middleware handlers
scripts/services/metrics.service.ts Added blockchain, health check, and queue metrics
scripts/routes/health.routes.ts Enhanced health checks with metrics integration
scripts/app.ts Integrated Sentry middleware and global error handler
package.json Added @sentry/node and @sentry/profiling-node dependencies
ops/monitoring/* Prometheus, Alertmanager, and Blackbox exporter configurations
docker-compose.monitoring.yml Complete monitoring stack deployment configuration
docs/ops/* Comprehensive monitoring setup and alerting runbook documentation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Pull Request Overview This PR implements a comprehensive production monitoring and alerting infrastructure for the Internet-ID project. It adds Sentry error tracking, Prometheus metrics collection, Alertmanager for alert routing, and Grafana for visualization, along with extensive documentation including an alerting runbook and setup guides. ### Key Changes - Added Sentry service for error tracking and performance monitoring with automatic exception capture, request tracing, and sensitive data filtering - Extended metrics service with blockchain transaction, health check, and queue depth metrics - Enhanced health check endpoints to report detailed service status (database, cache, blockchain) and export metrics - Created complete monitoring stack configuration with Prometheus, Alertmanager, Grafana, and various exporters ### Reviewed Changes Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments. <details> <summary>Show a summary per file</summary> | File | Description | | ---- | ----------- | | scripts/services/sentry.service.ts | New Sentry service for error tracking with middleware handlers | | scripts/services/metrics.service.ts | Added blockchain, health check, and queue metrics | | scripts/routes/health.routes.ts | Enhanced health checks with metrics integration | | scripts/app.ts | Integrated Sentry middleware and global error handler | | package.json | Added @sentry/node and @sentry/profiling-node dependencies | | ops/monitoring/* | Prometheus, Alertmanager, and Blackbox exporter configurations | | docker-compose.monitoring.yml | Complete monitoring stack deployment configuration | | docs/ops/* | Comprehensive monitoring setup and alerting runbook documentation | </details> --- 💡 <a href="/subculture-collective/internet-id/new/main/.github/instructions?filename=*.instructions.md" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Add Copilot custom instructions</a> for smarter, more guided reviews. <a href="https://docs.github.com/en/copilot/customizing-copilot/adding-repository-custom-instructions-for-github-copilot" class="Link--inTextBlock" target="_blank" rel="noopener noreferrer">Learn how to get started</a>.
copilot-pull-request-reviewer[bot] (Migrated from github.com) commented 2025-10-31 17:15:09 -05:00

The global error handler is placed after the Sentry error handler, which means errors will be captured twice by Sentry - once by Sentry's error handler (line 108) and again manually (line 119). Consider removing the manual sentryService.captureException call since Sentry's error handler already captures all errors that pass through it.

The global error handler is placed after the Sentry error handler, which means errors will be captured twice by Sentry - once by Sentry's error handler (line 108) and again manually (line 119). Consider removing the manual `sentryService.captureException` call since Sentry's error handler already captures all errors that pass through it.
@@ -0,0 +1,277 @@
import * as Sentry from "@sentry/node";
copilot-pull-request-reviewer[bot] (Migrated from github.com) commented 2025-10-31 17:15:09 -05:00

The fallback error handler when Sentry is not initialized should call next(_err) to pass the error to the next error handler, not next(). Without passing the error, the error will be swallowed and downstream error handlers won't receive it. This also applies to the return type which expects a 4-parameter error handler signature.

      return ((_err, _req, _res, next) => next(_err)) as ReturnType<typeof Sentry.Handlers.errorHandler>;
The fallback error handler when Sentry is not initialized should call `next(_err)` to pass the error to the next error handler, not `next()`. Without passing the error, the error will be swallowed and downstream error handlers won't receive it. This also applies to the return type which expects a 4-parameter error handler signature. ```suggestion return ((_err, _req, _res, next) => next(_err)) as ReturnType<typeof Sentry.Handlers.errorHandler>; ```
onnwee commented 2025-10-31 17:20:39 -05:00 (Migrated from github.com)

@copilot apply changes based on the comments in this thread

@copilot apply changes based on the comments in [this thread](https://github.com/subculture-collective/internet-id/pull/105#pullrequestreview-3406355852)
copilot-swe-agent[bot] (Migrated from github.com) reviewed 2025-10-31 17:23:20 -05:00
@@ -0,0 +1,277 @@
import * as Sentry from "@sentry/node";
copilot-swe-agent[bot] (Migrated from github.com) commented 2025-10-31 17:23:20 -05:00

Fixed in commit 7161ed1. Updated fallback error handler to call next(_err) to properly pass errors to downstream handlers.

Fixed in commit 7161ed1. Updated fallback error handler to call `next(_err)` to properly pass errors to downstream handlers.
copilot-swe-agent[bot] (Migrated from github.com) reviewed 2025-10-31 17:23:20 -05:00
copilot-swe-agent[bot] (Migrated from github.com) commented 2025-10-31 17:23:20 -05:00

Fixed in commit 7161ed1. Removed the duplicate sentryService.captureException call since Sentry's error handler already captures all errors.

Fixed in commit 7161ed1. Removed the duplicate `sentryService.captureException` call since Sentry's error handler already captures all errors.
Copilot commented 2025-10-31 17:23:20 -05:00 (Migrated from github.com)

@copilot apply changes based on the comments in this thread

Changes applied in commit 7161ed1:

  • Removed duplicate Sentry error capture in global error handler
  • Fixed fallback error handler to pass error with next(_err)
> @copilot apply changes based on the comments in [this thread](https://github.com/subculture-collective/internet-id/pull/105#pullrequestreview-3406355852) Changes applied in commit 7161ed1: - Removed duplicate Sentry error capture in global error handler - Fixed fallback error handler to pass error with `next(_err)`
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: subculture-collective/internet-id#105