* Initial plan * docs: add comprehensive contributing guidelines and templates Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * docs: update README and SECURITY with better formatting and links Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> * docs: finalize contributing guidelines and formatting Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>
15 KiB
Centralized Log Aggregation & Analysis
This document describes the centralized logging infrastructure for Discord SpyWatcher using the Grafana Loki stack.
Overview
Discord SpyWatcher implements a comprehensive log aggregation system that collects, stores, and analyzes logs from all services in a centralized location.
Stack Components:
- Grafana Loki - Log aggregation and storage system
- Promtail - Log collection and shipping agent
- Grafana - Visualization and search UI
- Winston - Structured JSON logging library
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Application Services │
├─────────────┬─────────────┬──────────┬────────┬────────────┤
│ Backend │ Frontend │ Postgres │ Redis │ PgBouncer │
│ (Winston) │ (Console) │ (Logs) │ (Logs) │ (Logs) │
└──────┬──────┴──────┬──────┴────┬─────┴───┬────┴──────┬─────┘
│ │ │ │ │
└─────────────┴───────────┴─────────┴───────────┘
│
▼
┌───────────────┐
│ Promtail │ ◄── Log Collection Agent
│ (Log Shipper) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Loki │ ◄── Log Aggregation & Storage
│ (Log Store) │
└───────┬───────┘
│
▼
┌───────────────┐
│ Grafana │ ◄── Visualization & Search UI
│ (Dashboard) │
└───────────────┘
Features
✅ Log Collection
- Backend logs - Application, security, and error logs in JSON format
- Security logs - Authentication, authorization, and security events
- Database logs - PostgreSQL query and connection logs
- Redis logs - Cache operations and connection logs
- PgBouncer logs - Connection pool metrics and activity
- Nginx logs - HTTP access and error logs (production)
- Container logs - Docker container stdout/stderr
✅ Structured Logging
- JSON format for easy parsing and filtering
- Request ID correlation for tracing
- Log levels: error, warn, info, debug
- Automatic metadata enrichment (service, job, level)
✅ Retention Policies
- 30-day retention - Automatic deletion of logs older than 30 days
- Compression - Automatic log compression to save storage
- Configurable - Easy to adjust retention period based on requirements
✅ Search & Filtering
- LogQL - Powerful query language for log searching
- Grafana UI - User-friendly interface for log exploration
- Filters - Filter by service, level, time range, and custom fields
- Live tail - Real-time log streaming
Quick Start
Starting the Logging Stack
Development:
docker-compose -f docker-compose.dev.yml up -d loki promtail grafana
Production:
docker-compose -f docker-compose.prod.yml up -d loki promtail grafana
Accessing Grafana
- Open your browser to
http://localhost:3000 - Login with default credentials:
- Username:
admin - Password:
admin(change on first login)
- Username:
- Navigate to Explore or Dashboards > Spywatcher - Log Aggregation
Changing Admin Credentials
Set environment variables:
GRAFANA_ADMIN_USER=your_username
GRAFANA_ADMIN_PASSWORD=your_secure_password
Configuration
Loki Configuration
Location: loki/loki-config.yml
Key settings:
retention_period: 720h- Keep logs for 30 daysingestion_rate_mb: 15- Max ingestion rate (15 MB/s)max_entries_limit_per_query: 5000- Max entries per query
Promtail Configuration
Location: promtail/promtail-config.yml
Log sources configured:
- Backend application logs (
/logs/backend/*.log) - Security logs (
/logs/backend/security.log) - PostgreSQL logs (
/var/log/postgresql/*.log) - Docker container logs (via Docker socket)
Pipeline stages:
- JSON parsing for structured logs
- Label extraction (level, service, action, etc.)
- Timestamp parsing
- Output formatting
Grafana Configuration
Location: grafana/provisioning/
Datasources:
- Loki (default) -
http://loki:3100 - Prometheus -
http://backend:3001/metrics
Dashboards:
Spywatcher - Log Aggregation- Main logging dashboard
Usage
Searching Logs
Basic Search
{job="backend"}
Filter by Level
{job="backend", level="error"}
Search in Message
{job="backend"} |= "error"
Security Logs
{job="security"} | json | action="LOGIN_ATTEMPT"
Time Range
Use Grafana's time picker to select a specific time range (e.g., last 1 hour, last 24 hours, custom range).
Common Queries
All errors in the last hour:
{job=~"backend|security"} | json | level="error"
Failed login attempts:
{job="security"} | json | action="LOGIN_ATTEMPT" | result="FAILURE"
Slow database queries:
{job="backend"} | json | message=~".*query.*" | duration > 1000
Rate limiting events:
{job="security"} | json | action="RATE_LIMIT_VIOLATION"
Request by request ID:
{job="backend"} | json | requestId="abc123"
Live Tailing
- Go to Explore in Grafana
- Select Loki datasource
- Enter your LogQL query
- Click Live button in the top right
This will stream logs in real-time as they arrive.
Dashboard
The pre-configured dashboard includes:
- Log Volume by Level - Time series chart showing log volume by level
- Log Counts by Level - Statistics showing error, warn, and info counts
- Application Logs - Main log viewer with filtering
- Security Logs - Dedicated security event viewer
- Error Logs - Quick view of all error logs
Template Variables:
$job- Filter by job (backend, security, postgres, etc.)$level- Filter by log level (error, warn, info, debug)$search- Free-text search filter
Structured Logging Best Practices
Application Code
Use Winston logger with structured fields:
import logger from './middleware/winstonLogger';
// Basic logging
logger.info('User logged in', { userId: user.id });
// With request ID
import { logWithRequestId } from './middleware/winstonLogger';
logWithRequestId('info', 'Processing request', req.id, {
userId: user.id,
action: 'fetch_data',
});
// Error logging
logger.error('Database connection failed', {
error: err.message,
stack: err.stack,
});
Log Levels
- error - Application errors, exceptions, failures
- warn - Warning conditions, degraded performance
- info - Important business events, state changes
- debug - Detailed diagnostic information
Security Events
Use the security logger for security-related events:
import { logSecurityEvent, SecurityActions } from './utils/securityLogger';
await logSecurityEvent({
userId: user.discordId,
action: SecurityActions.LOGIN_SUCCESS,
result: 'SUCCESS',
ipAddress: req.ip,
userAgent: req.get('user-agent'),
requestId: req.id,
});
Retention Policies
Current Settings
- Retention Period: 30 days (720 hours)
- Compaction Interval: 10 minutes
- Retention Delete Delay: 2 hours
- Reject Old Samples: 7 days
Adjusting Retention
Edit loki/loki-config.yml:
limits_config:
retention_period: 720h # Change this value (e.g., 1440h for 60 days)
table_manager:
retention_period: 720h # Keep same as above
compactor:
retention_enabled: true
Then restart Loki:
docker-compose restart loki
Performance Tuning
Ingestion Limits
Adjust in loki/loki-config.yml:
limits_config:
ingestion_rate_mb: 15 # MB/s per tenant
ingestion_burst_size_mb: 20 # Burst size
per_stream_rate_limit: 3MB # Per stream rate
per_stream_rate_limit_burst: 15MB # Per stream burst
Query Performance
limits_config:
max_entries_limit_per_query: 5000 # Max entries returned
max_streams_per_user: 10000 # Max streams per user
Cache Configuration
query_range:
results_cache:
cache:
embedded_cache:
enabled: true
max_size_mb: 100 # Increase for better performance
Alerting
Setting up Alerts
- Create alert rules in
loki/alert-rules.yml:
groups:
- name: spywatcher-alerts
interval: 1m
rules:
- alert: HighErrorRate
expr: |
sum(rate({job="backend", level="error"}[5m])) > 10
for: 5m
labels:
severity: critical
annotations:
summary: 'High error rate detected'
description: 'Error rate is {{ $value }} errors/sec'
- Configure Alertmanager URL in
loki/loki-config.yml:
ruler:
alertmanager_url: http://alertmanager:9093
Troubleshooting
Logs not appearing in Grafana
-
Check Promtail is running:
docker ps | grep promtail docker logs spywatcher-promtail-dev -
Check Loki is accepting logs:
curl http://localhost:3100/ready -
Verify log files exist:
docker exec spywatcher-backend-dev ls -la /app/logs -
Check Promtail configuration:
docker exec spywatcher-promtail-dev cat /etc/promtail/config.yml
Loki storage issues
Check disk usage:
du -sh /var/lib/docker/volumes/discord-spywatcher_loki-data/
Force compaction:
docker exec spywatcher-loki-dev wget -qO- http://localhost:3100/loki/api/v1/delete?query={job="backend"}&start=2024-01-01T00:00:00Z&end=2024-01-02T00:00:00Z
Performance issues
- Reduce retention period - Lower retention in
loki-config.yml - Increase resources - Adjust memory limits in
docker-compose.prod.yml - Reduce log volume - Increase LOG_LEVEL to 'warn' or 'error'
- Add sampling - Implement log sampling in application code
Monitoring the Logging Stack
Loki Metrics
Available at: http://localhost:3100/metrics
Key metrics:
loki_ingester_chunks_created_total- Chunks createdloki_ingester_bytes_received_total- Bytes ingestedloki_request_duration_seconds- Query performance
Promtail Metrics
Available at: http://localhost:9080/metrics
Key metrics:
promtail_sent_entries_total- Entries sent to Lokipromtail_dropped_entries_total- Dropped entriespromtail_read_bytes_total- Bytes read from logs
Grafana Health
Available at: http://localhost:3000/api/health
Integration with Other Tools
Prometheus Integration
Loki integrates seamlessly with Prometheus for correlated metrics and logs:
- Configure Prometheus datasource in Grafana
- Use derived fields to link logs to traces
- Create unified dashboards with both metrics and logs
Sentry Integration
Logs can reference Sentry issues:
logger.error('Unhandled exception', {
sentryEventId: sentryEventId,
error: err.message,
});
Search in Loki:
{job="backend"} | json | sentryEventId="abc123"
Security Considerations
Access Control
- Change default Grafana password - Set
GRAFANA_ADMIN_PASSWORD - Enable HTTPS - Configure SSL/TLS for Grafana
- Network isolation - Keep Loki/Promtail in private network
- Authentication - Enable OAuth or LDAP authentication in Grafana
- Enable Loki authentication - For production, set
auth_enabled: trueinloki/loki-config.ymland configure authentication methods
Note: Loki authentication is disabled by default for development/testing. For production deployments, enable authentication to prevent unauthorized access to log data. See Loki authentication documentation.
Log Sanitization
Winston logger automatically sanitizes sensitive data:
- Passwords
- Tokens (access, refresh, API keys)
- OAuth scopes
- Email addresses
See: backend/src/utils/securityLogger.ts
Compliance
- GDPR - Logs containing PII are automatically sanitized
- Data Retention - 30-day retention complies with most regulations
- Audit Trail - Security logs provide compliance audit trail
Resources
Documentation
Example Queries
Community
Comparison with ELK Stack
| Feature | Loki Stack | ELK Stack |
|---|---|---|
| Storage | Index labels, not full text | Full text indexing |
| Resource Usage | Low (300-500MB) | High (2-4GB+) |
| Query Language | LogQL (Prometheus-like) | Lucene/KQL |
| Setup Complexity | Simple (3 containers) | Complex (5+ containers) |
| Cost | Free, open source | Free, but resource intensive |
| Scalability | Good for small-medium | Better for enterprise |
| Integration | Native Prometheus/Grafana | Elasticsearch ecosystem |
| Best For | Cloud-native, Kubernetes | Large enterprises, full-text search |
Conclusion
The centralized logging system provides comprehensive log aggregation and analysis capabilities for Discord SpyWatcher. With proper configuration and usage, it enables:
- Faster debugging - Correlate logs across services
- Better monitoring - Real-time visibility into system behavior
- Improved security - Track security events and detect anomalies
- Compliance - Audit trail and data retention policies
- Performance optimization - Identify bottlenecks and slow queries
For questions or issues, refer to the troubleshooting section or consult the official documentation.