* Initial plan * Add browser extension implementation with core features Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * Update documentation and add extension to build system Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * Add testing guide, SVG icon placeholder, and implementation summary Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * Fix security vulnerabilities: XSS prevention and proper URL encoding Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * Add comprehensive security documentation for browser extension * Fix URL sanitization in platform detector for security Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * Add final implementation report - Browser extension complete Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>
14 KiB
Observability and Monitoring
This document describes the observability stack for Internet-ID, including structured logging, metrics collection, and monitoring setup.
Overview
Internet-ID implements a comprehensive observability baseline to support incident response, performance monitoring, and system health tracking:
- Structured Logging: JSON-formatted logs with correlation IDs using Pino
- Metrics Export: Prometheus-compatible metrics using prom-client
- Health Checks: Detailed service health endpoints
- Request Tracing: Automatic correlation ID generation for request tracking
Quick Start
Local Development
-
Start the API server:
npm run start:api -
Access observability endpoints:
- Health check: http://localhost:3001/api/health
- Prometheus metrics: http://localhost:3001/api/metrics
- Metrics (JSON): http://localhost:3001/api/metrics/json
-
View logs: Logs are automatically printed to stdout with pretty formatting in development mode.
Structured Logging
Overview
The logging service uses Pino, a high-performance JSON logger for Node.js. All logs include:
- Timestamp: ISO 8601 format
- Log level: trace, debug, info, warn, error, fatal
- Service name:
internet-id-api - Environment: development, production, etc.
- Correlation ID: Unique ID per request for tracing
- Context: Additional structured data
Configuration
Configure logging via environment variables in .env:
# Log level (trace, debug, info, warn, error, fatal)
# Default: info
LOG_LEVEL=info
# Application environment
NODE_ENV=production
Log Levels
- trace: Very verbose debugging (e.g., function entry/exit)
- debug: Detailed debugging information
- info: General informational messages (default)
- warn: Warning messages that don't prevent operation
- error: Error messages for handled exceptions
- fatal: Critical errors that cause service termination
Usage in Code
import { logger } from "./services/logger.service";
// Simple log message
logger.info("User registered successfully");
// Log with context
logger.info("File uploaded", {
userId: "123",
filename: "video.mp4",
size: 1024000,
});
// Log errors
try {
// ... some operation
} catch (error) {
logger.error("Failed to process file", error, {
userId: "123",
operation: "upload",
});
}
// Create child logger with persistent context
const childLogger = logger.child({
module: "verification",
userId: "123",
});
childLogger.info("Starting verification");
Request Correlation
Every HTTP request automatically gets a correlation ID that appears in all logs for that request:
{
"level": "info",
"time": "2025-10-31T03:17:28.870Z",
"correlationId": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"msg": "Incoming request",
"method": "POST",
"url": "/api/register",
"userAgent": "Mozilla/5.0...",
"ip": "192.168.1.1"
}
Access the correlation ID in request handlers:
app.post("/api/example", (req, res) => {
const correlationId = req.correlationId;
req.log.info("Processing request"); // Uses request-specific logger
// ...
});
Sensitive Data Redaction
The logger automatically redacts sensitive fields from logs:
*.password*.secret*.token*.apiKey*.privateKeyreq.headers.authorizationreq.headers['x-api-key']
These fields are completely removed from log output.
Metrics
Overview
Metrics are exposed in Prometheus format at /api/metrics for scraping by monitoring systems. The service tracks:
- HTTP request latency and counts
- Active connections
- Cache performance (hits/misses)
- Verification outcomes
- IPFS upload performance
- Database query performance
Available Metrics
HTTP Metrics
# Request duration histogram (seconds)
http_request_duration_seconds{method="POST",route="/api/register",status_code="200"}
# Request count
http_requests_total{method="POST",route="/api/register",status_code="200"}
# Active connections
active_connections
Application Metrics
# Verification outcomes
verification_total{outcome="success",platform="youtube"}
verification_duration_seconds{outcome="success",platform="youtube"}
# IPFS uploads
ipfs_uploads_total{provider="pinata",status="success"}
ipfs_upload_duration_seconds{provider="pinata"}
# Cache performance
cache_hits_total{cache_type="redis"}
cache_misses_total{cache_type="redis"}
# Database queries
db_query_duration_seconds{operation="findMany",table="Content"}
Default Metrics
Node.js process metrics are automatically collected:
process_cpu_user_seconds_totalprocess_cpu_system_seconds_totalprocess_resident_memory_bytesprocess_heap_bytesnodejs_eventloop_lag_secondsnodejs_gc_duration_seconds- And more...
Accessing Metrics
Prometheus format (for scraping):
curl http://localhost:3001/api/metrics
JSON format (for debugging):
curl http://localhost:3001/api/metrics/json
Prometheus Configuration
To scrape metrics with Prometheus, add this job to your prometheus.yml:
scrape_configs:
- job_name: "internet-id-api"
scrape_interval: 15s
static_configs:
- targets: ["localhost:3001"]
metrics_path: "/api/metrics"
For production deployments with multiple instances, use service discovery:
scrape_configs:
- job_name: "internet-id-api"
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: internet-id-api
Health Checks
Endpoint
GET /api/health
Returns detailed health status of all service components:
{
"status": "ok",
"timestamp": "2025-10-31T03:17:28.870Z",
"uptime": 3600.5,
"services": {
"database": {
"status": "healthy"
},
"cache": {
"status": "healthy",
"enabled": true
},
"blockchain": {
"status": "healthy",
"blockNumber": 12345678
}
}
}
Status Codes
- 200 OK: All services healthy
- 503 Service Unavailable: One or more services unhealthy or degraded
Service Status Values
- healthy: Service operating normally
- degraded: Service operational but with issues (e.g., cache unavailable)
- unhealthy: Service not operational
- disabled: Service intentionally disabled
Using Health Checks
Kubernetes liveness probe:
livenessProbe:
httpGet:
path: /api/health
port: 3001
initialDelaySeconds: 30
periodSeconds: 10
Docker healthcheck:
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
CMD curl -f http://localhost:3001/api/health || exit 1
Log Shipping
Production Log Destinations
For production deployments, ship logs to a centralized logging service. Configuration examples:
Logtail (BetterStack)
# .env
LOGTAIL_SOURCE_TOKEN=your_logtail_source_token
To integrate Logtail, install the transport:
npm install @logtail/pino
Update logger.service.ts to add Logtail transport when token is present.
Datadog
# .env
DATADOG_API_KEY=your_datadog_api_key
DATADOG_APP_KEY=your_datadog_app_key
DATADOG_SITE=datadoghq.com # or datadoghq.eu for EU
To integrate Datadog, install the transport:
npm install pino-datadog
ELK Stack (Elasticsearch)
# .env
ELASTICSEARCH_URL=https://your-elasticsearch-host:9200
ELASTICSEARCH_USERNAME=your_username
ELASTICSEARCH_PASSWORD=your_password
ELASTICSEARCH_INDEX=internet-id-logs
To integrate Elasticsearch, use Filebeat or Logstash to collect logs from stdout/files.
File-based Logging
For file-based logging with rotation:
npm install pino-roll
Or use OS-level log rotation with rsyslog/logrotate.
Docker/Kubernetes Logging
When running in containers, simply log to stdout (default). Container orchestration platforms automatically collect logs:
Docker Compose:
services:
api:
image: internet-id-api
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Kubernetes: Logs are automatically collected by the cluster logging system (Fluentd, Fluent Bit, etc.).
Monitoring Dashboards
Prometheus + Grafana
-
Set up Prometheus to scrape metrics (see configuration above)
-
Install Grafana and add Prometheus as a data source
-
Import dashboard template:
Create a dashboard with these panels:
Request Rate & Latency:
# Request rate
rate(http_requests_total[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status_code=~"5.."}[5m])
Application Metrics:
# Cache hit rate
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))
# Verification success rate
rate(verification_total{outcome="success"}[5m]) / rate(verification_total[5m])
# Active connections
active_connections
System Metrics:
# CPU usage
rate(process_cpu_user_seconds_total[5m])
# Memory usage
process_resident_memory_bytes
# Event loop lag
rate(nodejs_eventloop_lag_seconds[5m])
Example Grafana Dashboard JSON
See ops/monitoring/grafana-dashboard.json (to be created) for a complete dashboard template.
Alerting
Prometheus Alerting Rules
Example alert rules for prometheus/alerts.yml:
groups:
- name: internet_id_api
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
# Service unavailable
- alert: ServiceDown
expr: up{job="internet-id-api"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Internet-ID API is down"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "High API latency"
description: "P95 latency is {{ $value }}s"
# Low cache hit rate
- alert: LowCacheHitRate
expr: |
rate(cache_hits_total[5m]) /
(rate(cache_hits_total[5m]) + rate(cache_misses_total[5m])) < 0.5
for: 15m
labels:
severity: info
annotations:
summary: "Cache hit rate is low"
description: "Hit rate: {{ $value | humanizePercentage }}"
Best Practices
Logging Best Practices
-
Use structured logging: Always log with context objects, not string concatenation
// Good logger.info("User registered", { userId, email }); // Bad logger.info(`User ${userId} registered with email ${email}`); -
Choose appropriate log levels: Don't log everything at
infolevel -
Include correlation IDs: Use the request logger (
req.log) to maintain correlation -
Don't log sensitive data: Even with redaction, be careful with PII and secrets
-
Add context, not just messages: Logs should be queryable and filterable
Metrics Best Practices
-
Use labels wisely: Don't use unbounded values (like user IDs) as labels
-
Keep cardinality low: Limit the number of unique label combinations
-
Prefer histograms over summaries: Histograms are aggregatable across instances
-
Use seconds for durations: Prometheus convention
-
Name metrics clearly: Follow Prometheus naming conventions
_totalsuffix for counters_secondssuffix for durations_bytessuffix for sizes
Monitoring Best Practices
-
Monitor the golden signals: Latency, Traffic, Errors, Saturation (Google SRE)
-
Set meaningful alerts: Avoid alert fatigue with actionable alerts only
-
Document your alerts: Include runbooks for each alert
-
Test your alerts: Verify alerts fire under expected conditions
-
Monitor business metrics: Track verification rates, registrations, etc.
Troubleshooting
Logs not appearing
Check log level:
echo $LOG_LEVEL # Should be info or lower
Check NODE_ENV:
echo $NODE_ENV # Pretty logs only in development
Enable debug logging temporarily:
LOG_LEVEL=debug npm run start:api
Metrics not available
Verify endpoint responds:
curl http://localhost:3001/api/metrics
Check Prometheus scrape status: Visit http://localhost:9090/targets in Prometheus UI
View metrics in JSON for debugging:
curl http://localhost:3001/api/metrics/json | jq
High memory usage
Check for metrics cardinality explosion:
# Count unique metric series
curl -s http://localhost:3001/api/metrics | grep -c '^[a-z]'
If this number is very high (>10,000), you may have too many label combinations.
Performance impact
Logging: Pino is extremely fast (minimal overhead)
- Use async logging in production for even better performance
- Avoid logging in tight loops
Metrics: Minimal overhead for most metrics
- Histograms are more expensive than counters/gauges
- Keep label cardinality low