Go to file

Copilot 35020d1703 [WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

* Initial plan

* Add pytest markers for slow and integration tests

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* Create Alembic merge migration for three heads

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* Add missing BASE_DOMAIN to .env.example

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* Migrate dependencies to pyproject.toml

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

* Update README with new installation instructions

Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: onnwee <211922112+onnwee@users.noreply.github.com>

2026-02-15 22:53:31 -06:00

.github

Production Kubernetes infrastructure with EKS, Helm charts, and CI/CD (#169 )

2025-10-29 17:03:08 -05:00

alembic

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

client-sdk

Add WebSocket real-time audio streaming and matching (#160 )

2025-10-28 16:45:00 -05:00

config

Implement CDN & edge computing infrastructure for global performance (#183 )

2025-10-31 18:26:04 -05:00

docs

Implement community marketplace for plugins, databases, themes, and integrations (#187 )

2025-11-01 19:53:31 -05:00

examples

Optimize audio fingerprinting with vectorization, batch processing, and LSH indexing (#174 )

2025-10-30 12:47:09 -05:00

frontend

Implement community marketplace for plugins, databases, themes, and integrations (#187 )

2025-11-01 19:53:31 -05:00

helm/soundhash

Production Kubernetes infrastructure with EKS, Helm charts, and CI/CD (#169 )

2025-10-29 17:03:08 -05:00

k8s

[WIP] Add comprehensive monitoring, tracing, and logging infrastructure (#182 )

2025-10-31 14:46:15 -05:00

monitoring

[WIP] Add comprehensive monitoring, tracing, and logging infrastructure (#182 )

2025-10-31 14:46:15 -05:00

scripts

[WIP] Build developer portal with API documentation and SDKs (#186 )

2025-11-01 14:42:39 -05:00

src

Deduplicate cookie detection and remove hardcoded paths in video_processor.py (#222 )

2026-02-15 22:47:47 -06:00

templates/email

Implement email notification system with marketing automation (#162 )

2025-10-28 21:36:00 -05:00

terraform

Implement CDN & edge computing infrastructure for global performance (#183 )

2025-10-31 18:26:04 -05:00

tests

Fix N+1 queries and blocking I/O in ingestion pipeline (#218 )

2026-02-15 15:43:17 -06:00

.coveragerc

Add enhanced CI/CD workflows, deployment automation, and performance benchmarking

2025-10-27 22:37:53 +00:00

.dockerignore

Add .dockerignore file (force add)

2025-10-20 01:12:46 +00:00

.env.example

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

.gitignore

Production Kubernetes infrastructure with EKS, Helm charts, and CI/CD (#169 )

2025-10-29 17:03:08 -05:00

.gitleaks.toml

Apply suggestion from @Copilot

2025-10-17 13:36:04 -05:00

.pre-commit-config.yaml

Set up pre-commit with Black, Ruff, and basic file checks

2025-10-17 18:58:30 +00:00

alembic.ini

Add Alembic database migrations with schema versioning

2025-10-18 15:07:34 +00:00

benchmark_fingerprinting_report.md

Optimize audio fingerprinting with vectorization, batch processing, and LSH indexing (#174 )

2025-10-30 12:47:09 -05:00

CODE_OF_CONDUCT.md

Add CODE_OF_CONDUCT.md based on Contributor Covenant v2.0

2025-10-18 01:03:37 +00:00

codecov.yml

Update codecov.yml

2025-10-18 09:53:09 -05:00

CONTRIBUTING.md

Fix repository reference consistency and remove duplicate content

2025-10-25 00:33:36 +00:00

coverage.json

Add comprehensive test coverage improvements and infrastructure

2025-10-27 19:45:02 +00:00

DEVELOPER_PORTAL.md

[WIP] Build developer portal with API documentation and SDKs (#186 )

2025-11-01 14:42:39 -05:00

docker-compose.monitoring.yml

[WIP] Add comprehensive monitoring, tracing, and logging infrastructure (#182 )

2025-10-31 14:46:15 -05:00

docker-compose.prod.yml

Complete Docker setup with dev/prod configs and documentation

2025-10-20 01:19:53 +00:00

docker-compose.yml

Add caching support for yt-dlp and fingerprint reuse

2025-10-26 14:31:03 +00:00

Dockerfile

Add Docker infrastructure with Makefile and improved configs

2025-10-20 01:12:01 +00:00

Dockerfile.production

Production Kubernetes infrastructure with EKS, Helm charts, and CI/CD (#169 )

2025-10-29 17:03:08 -05:00

fresh_start.sh

Set up pre-commit with Black, Ruff, and basic file checks

2025-10-17 18:58:30 +00:00

IMPLEMENTATION_SUMMARY.md

[WIP] Build developer portal with API documentation and SDKs (#186 )

2025-11-01 14:42:39 -05:00

LICENSE

Changes before error encountered

2025-10-17 21:57:25 +00:00

Makefile

Improve Makefile documentation and align test target with CI

2025-10-25 17:22:37 +00:00

MARKETPLACE_IMPLEMENTATION_SUMMARY.md

Implement community marketplace for plugins, databases, themes, and integrations (#187 )

2025-11-01 19:53:31 -05:00

mkdocs.yml

[WIP] Build developer portal with API documentation and SDKs (#186 )

2025-11-01 14:42:39 -05:00

MONETIZATION_IMPLEMENTATION_SUMMARY.md

Implement monetization system with affiliate program, referral rewards, and revenue sharing (#185 )

2025-11-01 00:29:21 -05:00

ONBOARDING_IMPLEMENTATION.md

Add comprehensive onboarding system with interactive wizard, tutorials, and progress tracking (#176 )

2025-10-30 15:33:20 -05:00

pyproject.toml

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

README.md

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

requirements-dev.txt

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

requirements-docs.txt

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

requirements-gpu.txt

Optimize audio fingerprinting with vectorization, batch processing, and LSH indexing (#174 )

2025-10-30 12:47:09 -05:00

requirements.txt

[WIP] Implement CI/DX improvements for pytest and Alembic (#223 )

2026-02-15 22:53:31 -06:00

WEBHOOK_IMPLEMENTATION_SUMMARY.md

Implement webhook system for real-time event notifications (#179 )

2025-10-30 20:31:17 -05:00

WEBHOOK_INTEGRATION.md

Replace deprecated datetime.utcnow() with timezone-aware datetime.now(timezone.utc) (#215 )

2026-02-15 12:34:30 -06:00

README.md

SoundHash - Video Clip Matching System

A sophisticated system for matching audio clips from videos across social media platforms using audio fingerprinting and PostgreSQL.

📚 View Full Documentation - Comprehensive guides, API reference, and architecture details

Quick Start - Get running in <15 minutes
Architecture Overview - System design and data flow
Social Media Bots - Twitter and Reddit bot setup
Troubleshooting - Solutions to common problems
Database Backups - Backup and restore procedures
Usage - Command-line options and examples
Security - Credential management

Project Status

📋 Roadmap: Issue #34 | 🗂️ Project Board: @onnwee's soundhash | 🏁 Milestones: View all

Features

🎵 Audio fingerprinting using spectral analysis
🗄️ PostgreSQL database for scalable storage
🤖 Social media bot integration (Twitter, Reddit)
📺 YouTube channel ingestion
🔍 Real-time clip matching
📊 Beautiful colored logging with progress tracking
🚀 REST API with JWT authentication (API Docs)
📝 Interactive API documentation (Swagger/ReDoc)
🔐 API key support for machine-to-machine access
⚡ Rate limiting and CORS support
🛡️ Enterprise Security - Multi-tier rate limiting, WAF, DDoS protection (Security Docs)
🔒 Threat detection (SQL injection, XSS, brute force)
📋 Compliance ready (SOC 2, ISO 27001, HIPAA)
✉️ Email notification system with marketing automation (Email Docs)
📧 Transactional, product, and marketing emails
🎨 Customizable templates with A/B testing
📈 Email analytics with open/click tracking

Architecture Overview

SoundHash processes videos through a multi-stage pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                         SoundHash Pipeline                           │
└─────────────────────────────────────────────────────────────────────┘

1. INGESTION (channel_ingester.py)
   ├─ YouTube API → Fetch channel videos
   ├─ Create ProcessingJob entries (idempotent)
   └─ Store metadata in PostgreSQL
          ↓
2. VIDEO PROCESSING (video_processor.py)
   ├─ yt-dlp → Download best audio stream
   ├─ ffmpeg → Convert to mono WAV @ 16kHz
   └─ Segment into 90-second chunks
          ↓
3. FINGERPRINTING (audio_fingerprinting.py)
   ├─ STFT → Spectral analysis
   ├─ Peak detection → Extract features
   ├─ Normalize → Compact vector + MD5 hash
   └─ Store in PostgreSQL (audio_fingerprints table)
          ↓
4. MATCHING (future)
   ├─ Query clip → Extract fingerprint
   ├─ Compare → Correlation + Euclidean similarity
   └─ Return matched videos with confidence scores

Key Components

Ingestion: src/ingestion/channel_ingester.py - Async orchestration, idempotent job creation
Video I/O: src/core/video_processor.py - yt-dlp + ffmpeg pipeline with cookie/proxy support
Fingerprints: src/core/audio_fingerprinting.py - STFT, spectral peaks → compact vector + MD5
Database: src/database/{connection,models,repositories}.py - SQLAlchemy engine/schema/DAOs
YouTube API: src/api/youtube_service.py - OAuth flow, channel/video metadata
Config/Logging: config/{settings.py,logging_config.py} - Centralized configuration

Database Schema

channels: YouTube channel metadata
videos: Video information and processing status
audio_fingerprints: Spectral fingerprint data for audio segments (vector + hash)
match_results: Query results and similarity scores
processing_jobs: Background job queue with status tracking

REST API

SoundHash provides a comprehensive REST API for programmatic access to all features. The API supports JWT authentication and API keys, with interactive documentation available at /docs.

Quick Start

Start the API server:
```
python scripts/start_api.py
```
Access interactive documentation:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc

Register a user:

curl -X POST http://localhost:8000/api/v1/auth/register \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "email": "user@example.com", "password": "SecurePass123!"}'

Login and get access token:

curl -X POST http://localhost:8000/api/v1/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "SecurePass123!"}'

API Endpoints

Authentication (/api/v1/auth) - User registration, login, API keys
Videos (/api/v1/videos) - Upload, list, process videos
Matches (/api/v1/matches) - Find audio matches, search
Channels (/api/v1/channels) - Channel management
Fingerprints (/api/v1/fingerprints) - Fingerprint data
Admin (/api/v1/admin) - System stats, job management

See API Documentation for complete details, examples, and code snippets.

Configuration

Set these environment variables in .env:

API_HOST=0.0.0.0
API_PORT=8000
API_SECRET_KEY=your-secret-key-here  # Generate with: openssl rand -hex 32
API_ACCESS_TOKEN_EXPIRE_MINUTES=30
API_CORS_ORIGINS=http://localhost:3000,http://localhost:8000

SoundHash includes bots for Twitter and Reddit that help users identify audio clips from videos.

Twitter Bot ✅

Status: Fully functional

The Twitter bot listens for mentions, processes video URLs, and replies with matching clips from the database.

Features:

Automatic mention monitoring
Video URL extraction and processing
Match result replies with timestamps and links
Standalone match summary tweets
Rate limiting with retry logic
Robust error handling

Quick Setup:

# Add credentials to .env
TWITTER_BEARER_TOKEN=your_token
TWITTER_CONSUMER_KEY=your_key
TWITTER_CONSUMER_SECRET=your_secret
TWITTER_ACCESS_TOKEN=your_access_token
TWITTER_ACCESS_TOKEN_SECRET=your_access_secret

# Test the bot
python scripts/test_twitter_bot.py

# Run the bot
python -m src.bots.twitter_bot

Reddit Bot 🚧

Status: Work in progress (stub implementation)

The Reddit bot will monitor specified subreddits for video clip identification requests.

Planned features:

Subreddit monitoring
Comment/post processing
Match result replies
Rate limiting

Documentation: See docs/BOTS.md for complete setup instructions and API reference.

Quick Start (🎯 Target: <15 minutes)

Choose your preferred setup method:

Option A: Docker (Recommended - Fastest Setup)

Prerequisites: Docker and Docker Compose installed

⏱️ Estimated time: 5-10 minutes

# 1. Clone repository
git clone <repository-url> soundhash
cd soundhash

# 2. Configure environment
cp .env.example .env
# Edit .env with your settings (DATABASE_URL will be overridden for Docker)

# 3. Start services (PostgreSQL + App)
make up
# Or: docker compose up -d

# 4. Initialize database
make setup-db
# Or: docker compose exec app python scripts/setup_database.py

# 5. Setup YouTube API (interactive OAuth flow)
docker compose exec app python scripts/setup_youtube_api.py

# 6. Test with limited videos
make ingest
# Or: docker compose exec app python scripts/ingest_channels.py --dry-run --max-videos 5

# 7. View logs
make logs-app
# Or: docker compose logs -f app

🔧 Makefile Commands:

make up - Start all services
make down - Stop all services
make logs - View all logs
make logs-app - View app logs
make shell - Open shell in app container
make setup-db - Initialize database
make test - Run tests
make help - Show all available commands

✅ Advantages:

No manual PostgreSQL or ffmpeg installation
Isolated environment
Production-like setup
Easy cleanup with make clean or docker compose down -v
Makefile for common operations

Option B: Local Development

Prerequisites: Python 3.12+, PostgreSQL 12+, ffmpeg

⏱️ Estimated time: 10-15 minutes

# 1. Clone and setup virtual environment
git clone <repository-url> soundhash
cd soundhash
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate

# 2. Install dependencies
pip install -e .                    # Core dependencies
# or for development:
pip install -e .[dev]               # With dev tools
# or for documentation:
pip install -e .[docs]              # With docs tools
# or all together:
pip install -e .[dev,docs]          # With dev and docs tools

# 3. Install system dependencies
# Ubuntu/Debian:
sudo apt update && sudo apt install postgresql postgresql-contrib ffmpeg
# macOS:
brew install postgresql ffmpeg

# 4. Setup PostgreSQL
createdb soundhash
# Optional: Create dedicated user
# psql -c "CREATE USER soundhash_user WITH PASSWORD 'your_password';"
# psql -c "GRANT ALL PRIVILEGES ON DATABASE soundhash TO soundhash_user;"

# 5. Configure environment
cp .env.example .env
# Edit .env with your database credentials and settings

# 6. Initialize database
python scripts/setup_database.py

# 7. Setup YouTube API (required)
python scripts/setup_youtube_api.py

# 8. Test with limited videos
python scripts/ingest_channels.py --dry-run --max-videos 5 --log-level DEBUG

✅ Advantages:

Direct access to Python environment for debugging
Faster iteration during development
Full control over dependencies

Comparison: Docker vs Local

Feature	Docker 🐳	Local 💻
Setup time	5-10 min	10-15 min
Prerequisites	Docker only	Python, PostgreSQL, ffmpeg
Isolation	✅ Full	❌ System-wide
Production parity	✅ High	⚠️ Varies
Debugging	⚠️ Via logs/exec	✅ Direct
Cleanup	✅ `docker compose down -v`	⚠️ Manual
Best for	Quick start, CI/CD	Active development

What Happens After Setup?

After completing either setup method:

Database Initialized: Tables created (channels, videos, audio_fingerprints, processing_jobs)
YouTube API Ready: OAuth token stored in token.json for API access
System Ready: Can now ingest channels and process videos

Next Steps:

# 1. Configure target channels in .env
TARGET_CHANNELS=UCo_QGM_tJZOkOCIFi2ik5kA,UCDz8WxTg4R7FUTSz7GW2cYA

# 2. Ingest and process channels (start small!)
python scripts/ingest_channels.py --max-videos 10

# 3. Monitor progress in logs
tail -f logs/soundhash.log

# 4. Query database to see results
psql soundhash -c "SELECT COUNT(*) FROM audio_fingerprints;"

Docker Configuration Details

Environment Variables

When running with Docker Compose, the .env file is automatically loaded. Key variables for Docker setup:

# Database (automatically configured for containers)
DATABASE_HOST=db                  # Service name in docker-compose.yml
DATABASE_PORT=5432                # Internal container port
DATABASE_NAME=soundhash
DATABASE_USER=soundhash_user
DATABASE_PASSWORD=soundhash_password123

# External database access (from host machine)
DATABASE_PORT=5435                # Host port mapped to container

# OAuth server (for YouTube API setup)
AUTH_SERVER_PORT=8001            # Host port for OAuth callbacks

Docker Volumes

Docker Compose mounts several directories for data persistence and development:

./logs → /app/logs - Application logs persist on host
./temp → /app/temp - Temporary audio files persist on host
./cache → /app/cache - yt-dlp HTTP cache for faster re-downloads
./src → /app/src - Source code (read-only, for hot-reload in dev)
./scripts → /app/scripts - Scripts (read-only)
postgres_data - Named volume for PostgreSQL data (managed by Docker)

Credentials (optional mounts):

./credentials.json → /app/credentials.json - YouTube OAuth credentials
./token.json → /app/token.json - OAuth refresh token
./cookies.txt → /app/cookies.txt - Browser cookies for yt-dlp

Caching Configuration: SoundHash uses caching to reduce redundant work and bandwidth usage:

yt-dlp HTTP Cache: Speeds up re-downloading the same videos
- Location: ./cache/yt-dlp (configurable via YT_DLP_CACHE_DIR)
- Enable/disable: ENABLE_YT_DLP_CACHE=true/false (default: true)
Fingerprint Reuse: Skips re-fingerprinting when parameters haven't changed
- Automatically checks if fingerprints exist with matching sample_rate, n_fft, and hop_length
- Invalidates cache if any fingerprinting parameters change in config

To clear caches:

# Clear yt-dlp cache
rm -rf ./cache/yt-dlp

# Force re-fingerprinting (requires database changes)
# Update fingerprinting parameters in .env (e.g., change FINGERPRINT_SAMPLE_RATE)

Common Docker Operations

# View running containers
make ps

# Access app container shell
make shell

# Access database shell
make shell-db

# Rebuild after dependency changes
make build
make up

# Clean restart (removes volumes - WARNING: destroys data!)
make clean
make up
make setup-db

# Run one-off commands
docker compose exec app python scripts/ingest_channels.py --help
docker compose exec app python -c "from src.database.connection import db_manager; print('OK')"

# View resource usage
docker compose stats

Production Deployment

For production use, combine the base docker-compose.yml with docker-compose.prod.yml:

# Start in production mode
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# View logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs -f

# Stop services
docker compose -f docker-compose.yml -f docker-compose.prod.yml down

Production configuration includes:

Automatic restart policies
Resource limits (CPU and memory)
Log rotation
Removes source code mounts (baked into image)
Runs ingestion script by default

⚠️ Security Note: See the Security and Secrets Management section below for important information about handling credentials safely.

Troubleshooting Common Issues

🚫 YouTube Download Failures / Rate Limiting

Symptoms: Downloads fail with "HTTP Error 429", "HTTP Error 403", "Video unavailable", or frequent timeouts

Solutions (in order of effectiveness):

Use browser cookies (Recommended):

# Option 1: Export cookies to file
# Use browser extension "Get cookies.txt LOCALLY" (Firefox/Chrome)
# Save to cookies.txt in project root
YT_COOKIES_FILE=./cookies.txt

# Option 2: Extract from browser directly (easier)
YT_COOKIES_FROM_BROWSER=firefox
# Or with specific profile:
YT_COOKIES_FROM_BROWSER=chrome:Profile 1
# Or specify a different browser profile
YT_COOKIES_FROM_BROWSER=chrome
YT_BROWSER_PROFILE=Profile 1

Configure proxy:

# Single proxy
USE_PROXY=true
PROXY_URL=http://proxy.example.com:8080

# Or rotating proxy list (comma-separated)
USE_PROXY=true
PROXY_LIST=http://proxy1.example.com:8080,http://proxy2.example.com:8080

Change player client (if videos appear restricted):

YT_PLAYER_CLIENT=android  # or ios, web_safari, tv, web_embedded

Reduce concurrent downloads:

MAX_CONCURRENT_DOWNLOADS=1  # Default is 3

Update yt-dlp (fixes many issues):
```
pip install --upgrade yt-dlp
```

Understanding Error Messages:

The system now provides specific remediation advice for common errors:

HTTP 403 Forbidden: Video may be geo-restricted, age-restricted, or YouTube detected automation
- ✅ Use authenticated cookies (YT_COOKIES_FILE or YT_COOKIES_FROM_BROWSER)
- ✅ Configure proxy to change apparent location
- ✅ Try different player client (YT_PLAYER_CLIENT=android)
HTTP 429 Too Many Requests: YouTube rate limit exceeded
- ✅ Reduce MAX_CONCURRENT_DOWNLOADS
- ✅ Use authenticated cookies to get higher quota
- ✅ Configure proxy rotation (PROXY_LIST)
- ⏱️ System auto-retries with exponential backoff
HTTP 410 Gone: Video has been removed or is no longer available
- ⚠️ This is permanent - video cannot be retrieved
- System will skip without retrying
Bot Detection: "Sign in to confirm you're not a bot"
- ✅ Set YT_COOKIES_FILE or YT_COOKIES_FROM_BROWSER
- ✅ Update yt-dlp: pip install --upgrade yt-dlp

🎵 ffmpeg Issues

Symptoms: "ffmpeg not found", audio conversion fails, or "codec not supported"

Solutions:

Verify ffmpeg installation:

ffmpeg -version
# Should show version 4.0+ with libopus, libvorbis

Install/reinstall ffmpeg:

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows - Download from https://ffmpeg.org/download.html
# Add to PATH environment variable

Check PATH (if installed but not found):

which ffmpeg  # Linux/macOS
where ffmpeg  # Windows

Docker users: ffmpeg is included in the Docker image, no action needed

🗄️ Database Connection Issues

Symptoms: "Connection refused", "Authentication failed", "Database does not exist"

Solutions:

Check PostgreSQL is running:

# Linux
sudo systemctl status postgresql
sudo systemctl start postgresql

# macOS
brew services list
brew services start postgresql

# Docker
docker compose ps

Verify connection string in .env:

# Format: postgresql://user:password@host:port/dbname
DATABASE_URL=postgresql://soundhash_user:password@localhost:5432/soundhash

# Or use individual vars:
DATABASE_HOST=localhost  # Use 'postgres' in Docker
DATABASE_PORT=5432       # Use 5435 for Docker host access
DATABASE_NAME=soundhash
DATABASE_USER=soundhash_user
DATABASE_PASSWORD=your_password

Test connection manually:

psql -h localhost -U soundhash_user -d soundhash

Create database if missing:

createdb soundhash
# Or: psql -c "CREATE DATABASE soundhash;"

Docker networking:
- From host: use localhost:5435 (as configured in docker-compose.yml)
- From app container: use postgres:5432 (service name as host)

💾 Database Backups and Restore

Backup Strategy: Regular backups ensure data safety for fingerprints and matches. Use the provided scripts for automated backup and restore operations.

Creating Backups

Basic backup (local filesystem):

# Create a backup with default settings
python scripts/backup_database.py

# Create a backup with custom name
python scripts/backup_database.py --name daily_backup

# Clean up old backups only
python scripts/backup_database.py --cleanup-only

S3 backup (optional, requires boto3):

# Install boto3 if using S3
pip install boto3

# Configure S3 settings in .env
BACKUP_S3_ENABLED=true
BACKUP_S3_BUCKET=my-soundhash-backups
BACKUP_S3_PREFIX=soundhash-backups/

# Create backup and upload to S3
python scripts/backup_database.py --s3

Automated backups with cron:

# Add to crontab (crontab -e):
# Daily backup at 2 AM with S3 upload
0 2 * * * cd /path/to/soundhash && python scripts/backup_database.py --s3 >> /var/log/soundhash-backup.log 2>&1

# Weekly backup on Sunday at 3 AM (local only)
0 3 * * 0 cd /path/to/soundhash && python scripts/backup_database.py --name weekly >> /var/log/soundhash-backup.log 2>&1

Restoring from Backups

List available backups:

# List local backups
python scripts/restore_database.py --list

# List S3 backups
python scripts/restore_database.py --list --from-s3

Restore from backup:

# Restore from latest backup (local)
python scripts/restore_database.py --latest

# Restore from specific backup file
python scripts/restore_database.py --file soundhash_backup_20240101_120000.dump

# Restore from S3
python scripts/restore_database.py --latest --from-s3

# WARNING: Clean mode drops all existing objects first
python scripts/restore_database.py --latest --clean

Partial restore options:

# Restore only data (preserve existing schema)
python scripts/restore_database.py --latest --data-only

# Restore only schema (preserve existing data)
python scripts/restore_database.py --latest --schema-only

Backup Configuration

Configure backup settings in .env:

# Local backup directory
BACKUP_DIR=./backups

# Retention policy (days)
BACKUP_RETENTION_DAYS=30

# S3 configuration (optional)
BACKUP_S3_ENABLED=false
BACKUP_S3_BUCKET=my-soundhash-backups
BACKUP_S3_PREFIX=soundhash-backups/

Testing Backup/Restore

Verify backup integrity by restoring to a fresh database:

# 1. Create a test database
# If you have PostgreSQL client tools installed locally:
createdb soundhash_test
# Or, if you are using Docker Compose for PostgreSQL:
docker compose exec postgres createdb -U soundhash soundhash_test

# 2. Create a backup from production
python scripts/backup_database.py --name test_restore

# 3. Restore to test database (modify .env temporarily to point to test DB)
DATABASE_NAME=soundhash_test python scripts/restore_database.py --latest --clean

# 4. Verify data integrity
psql soundhash_test -c "SELECT COUNT(*) FROM videos;"
psql soundhash_test -c "SELECT COUNT(*) FROM audio_fingerprints;"

# 5. Clean up test database
# If you have PostgreSQL client tools installed locally:
dropdb soundhash_test
# Or, if your database is running in Docker Compose:
docker compose exec postgres dropdb -U soundhash soundhash_test

Backup best practices:

✅ Test restore process regularly
✅ Store backups in multiple locations (local + S3)
✅ Set up automated daily/weekly backups via cron
✅ Monitor backup logs for failures
✅ Keep backups for at least 30 days
✅ Document backup/restore procedures for your team

📦 Import/Dependency Errors

Symptoms: "ModuleNotFoundError", "No module named 'X'"

Solutions:

Ensure virtual environment is activated:

source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

Reinstall dependencies:
```
pip install -e .[dev] --upgrade
```
Check Python version (requires 3.12+):
```
python --version
```

💾 Disk Space Issues

Symptoms: "No space left on device", temp directory fills up

Solutions:

Enable automatic cleanup in .env:
```
CLEANUP_SEGMENTS_AFTER_PROCESSING=true
```

Manually clean temp directory:

rm -rf ./temp/*  # or your TEMP_DIR path

Monitor disk usage:

df -h .  # Check available space
du -sh temp/  # Check temp directory size

Process fewer videos at once:

python scripts/ingest_channels.py --max-videos 10

🔐 YouTube API Authentication Issues

Symptoms: "Invalid credentials", "Quota exceeded", "Unauthorized", "Token expired"

Solutions:

Follow OAuth setup guide:
- Create OAuth credentials in Google Cloud Console (type: Desktop Application)
- Add redirect URIs: http://localhost:8080/, http://localhost:8000/, http://localhost/
- Download credentials.json and place it in the project root
- Run python scripts/setup_youtube_api.py to generate token.json
- For detailed steps, see YOUTUBE_OAUTH_SETUP.md
Token refresh (automatic):
- The system automatically refreshes expired tokens using the refresh token
- This happens transparently on each API call
- No user action needed for normal token expiration

Regenerate token if corrupted or refresh fails:

rm token.json
python scripts/setup_youtube_api.py

Check credentials.json is valid JSON from Google Cloud Console
- Must be OAuth 2.0 Client ID for Desktop Application
- Must include redirect URIs configured
Verify API is enabled in Google Cloud Console:
- YouTube Data API v3 must be enabled for your project
- Check at: APIs & Services → Library → YouTube Data API v3
Check quota limits:
- Default: 10,000 units/day
- 1 video = ~7 units, 1 channel = ~3 units
- Monitor at: https://console.cloud.google.com/apis/dashboard
OAuth Consent Screen Configuration:
- Add your Google account as a test user (if app is in testing mode)
- Or publish the app (requires verification for production)
- Ensure scope https://www.googleapis.com/auth/youtube.readonly is configured

Understanding Token Files:

credentials.json: OAuth 2.0 client credentials from Google Cloud Console (static)
token.json: Access and refresh tokens (generated after OAuth flow, auto-refreshed)
Both files are automatically excluded from git via .gitignore
Keep both files secure and never commit them to version control

🐛 General Debugging Tips

Enable debug logging:

python scripts/ingest_channels.py --log-level DEBUG

Start with dry run:

python scripts/ingest_channels.py --dry-run --max-videos 5

Test individual components:

from src.core.video_processor import VideoProcessor
from src.core.audio_fingerprinting import AudioFingerprinter

processor = VideoProcessor()
audio_file = processor.download_video_audio("https://youtube.com/watch?v=...")

fingerprinter = AudioFingerprinter()
fp = fingerprinter.extract_fingerprint(audio_file)
print(f"Confidence: {fp['confidence_score']}")

Check logs directory: ./logs/ contains detailed error traces

Verify environment variables:

python -c "from config.settings import Config; print(Config.DATABASE_URL)"

Usage

Command Line Options

The ingestion script supports various options for flexible processing:

# Basic ingestion (unlimited videos per channel)
python scripts/ingest_channels.py

# Process specific channels with all their videos
python scripts/ingest_channels.py --channels "UCo_QGM_tJZOkOCIFi2ik5kA,UCDz8WxTg4R7FUTSz7GW2cYA"

# Limit videos per channel (useful for testing)
python scripts/ingest_channels.py --max-videos 10

# Dry run (no actual processing, shows what would be ingested)
python scripts/ingest_channels.py --dry-run

# Set log level
python scripts/ingest_channels.py --log-level DEBUG

# Disable colored output
python scripts/ingest_channels.py --no-colors

⚠️ Important: By default, the system will fetch ALL videos from each channel. For channels with thousands of videos, this can take a very long time and generate a lot of processing jobs. Use --max-videos to limit the number if you want to test or process only recent content.

Bot Deployment

Configure API keys in .env
Run Twitter bot: python src/bots/twitter_bot.py
Run Reddit bot: python src/bots/reddit_bot.py

Manual Testing

from src.core.audio_fingerprinting import AudioFingerprinter
from src.core.video_processor import VideoProcessor

processor = VideoProcessor()
fingerprinter = AudioFingerprinter()

# Process a video
audio_file = processor.download_video_audio("https://youtube.com/watch?v=...")
fingerprint = fingerprinter.extract_fingerprint(audio_file)

Detailed Architecture

For architecture overview and flow diagram, see the Architecture Overview section above.

Directory Structure

src/core/ - Core audio processing and fingerprinting
src/database/ - Database models and operations
src/bots/ - Social media bot implementations
src/ingestion/ - Channel data ingestion system
src/api/ - External API integrations (YouTube)
src/auth/ - Authentication and OAuth flows
config/ - Configuration management and logging
scripts/ - Utility scripts for setup and maintenance
tests/ - Test suite

Additional Documentation

ARCHITECTURE.md - Detailed project structure
INSTALL.md - Comprehensive installation guide with Docker and manual options
YOUTUBE_OAUTH_SETUP.md - YouTube API setup guide
AUTH_SETUP.md - Twitter & Reddit authentication
SECURITY.md - Security best practices and secrets management

Security and Secrets Management

📖 Quick Reference: See SECURITY.md for a condensed checklist and quick reference guide.

Overview

SoundHash requires various API credentials and tokens to function. It's critical to handle these securely to prevent unauthorized access to your accounts and services.

Protected Files

The following files contain sensitive information and are automatically excluded from version control via .gitignore:

.env - Environment variables including API keys, database passwords, and tokens
credentials.json - Google OAuth 2.0 client credentials for YouTube API
token.json - OAuth refresh tokens (generated after authentication)
cookies.txt - Browser cookies for yt-dlp (if used)

Safe Credential Handling

Local Development

Use .env for configuration:

cp .env.example .env
# Edit .env with your actual credentials

YouTube OAuth Setup:
- Download credentials.json from Google Cloud Console
- Place it in the project root (it's automatically ignored by git)
- Run python scripts/setup_youtube_api.py to generate token.json
- Both files remain local only - never commit them

Verify .gitignore protection:

git status --ignored
# Your secret files should appear in the ignored list

GitHub Actions / CI

For running workflows that need credentials:

Use GitHub Secrets (Settings → Secrets and variables → Actions):
- DATABASE_URL - PostgreSQL connection string
- YOUTUBE_API_KEY - For YouTube Data API (if using API key method)
- TWITTER_* - Twitter API credentials
- REDDIT_* - Reddit API credentials

Reference in workflows:

env:
  DATABASE_URL: ${{ secrets.DATABASE_URL }}
  YOUTUBE_API_KEY: ${{ secrets.YOUTUBE_API_KEY }}

Production Deployment

For production deployments, use proper secrets management:

Docker: Use Docker secrets or environment variables
Cloud Platforms: Use AWS Secrets Manager, Google Secret Manager, or Azure Key Vault
Kubernetes: Use Kubernetes Secrets with proper RBAC

Secret Scanning

This repository uses Gitleaks in CI to automatically scan for accidentally committed secrets:

Automatic scanning on every push and pull request
CI fails if secrets are detected
Custom rules for YouTube API keys, Twitter tokens, Reddit credentials, etc.
Weekly scheduled scans to catch issues early

If the CI fails due to detected secrets:

Rotate the compromised credential immediately
Remove the secret from all commits (use git filter-branch or BFG Repo-Cleaner)
Update your local .env with the new credential
Never commit the secret again

Best Practices

✅ DO:

Use .env for all secrets and credentials
Add sensitive files to .gitignore before creating them
Use GitHub Secrets for CI/CD credentials
Rotate credentials regularly
Use least-privilege access principles
Review .gitignore before committing new files

❌ DON'T:

Commit .env, credentials.json, token.json, or cookies.txt
Store secrets in code, comments, or documentation
Share credentials via email, chat, or unsecured channels
Use production credentials in development
Hardcode API keys or passwords in Python files

Credential Rotation

If you suspect a credential has been compromised:

Immediately revoke the credential at the service provider
Generate a new credential
Update your .env and/or GitHub Secrets
Audit access logs for unauthorized usage
Notify team members if applicable

Additional Resources

Performance Tips & Best Practices

Development Workflow

Start Small: Always test with --max-videos 5 and --dry-run first
Use Debug Logging: Add --log-level DEBUG when troubleshooting
Monitor Resources: Watch disk space (du -sh temp/) and database size
Enable Cleanup: Set CLEANUP_SEGMENTS_AFTER_PROCESSING=true to save disk space

Production Considerations

Rate Limiting:
- Use cookies from authenticated session
- Configure proxy rotation for high-volume processing
- Respect YouTube API quotas (10,000 units/day default)
Resource Management:
- Set appropriate MAX_CONCURRENT_CHANNELS (1-3 recommended) to limit parallel channel ingestion
- Set appropriate MAX_CONCURRENT_DOWNLOADS (1-3 recommended) for video processing
- Configure CHANNEL_RETRY_DELAY and CHANNEL_MAX_RETRIES for failure handling
- Monitor PostgreSQL performance with EXPLAIN ANALYZE
- Consider connection pooling for multiple workers
Reliability:
- Enable automatic cleanup to prevent disk exhaustion
- Use Docker for consistent deployment environment
- Implement monitoring and alerting for job failures
Security:
- Never commit .env, credentials.json, token.json, or cookies.txt
- Use GitHub Secrets for CI/CD credentials
- Rotate API keys regularly
- Use least-privilege database users

Common Gotchas

⚠️ Unlimited ingestion is expensive: Without --max-videos, the system fetches ALL videos from each channel (potentially thousands). This can take hours and consume significant resources.

⚠️ Cookie authentication: yt-dlp works better with authenticated cookies. Use YT_COOKIES_FROM_BROWSER=firefox for automatic extraction.

⚠️ Database driver: The system auto-selects the psycopg driver. No manual installation needed.

⚠️ Temp directory bloat: Without cleanup enabled, audio segments accumulate. Enable CLEANUP_SEGMENTS_AFTER_PROCESSING or manually clean ./temp/ periodically.

Useful Commands

# Check database size
psql soundhash -c "SELECT pg_size_pretty(pg_database_size('soundhash'));"

# List processing job statuses
psql soundhash -c "SELECT status, COUNT(*) FROM processing_jobs GROUP BY status;"

# Find failed jobs
psql soundhash -c "SELECT * FROM processing_jobs WHERE status = 'failed' LIMIT 10;"

# Clean up old jobs (careful!)
psql soundhash -c "DELETE FROM processing_jobs WHERE status = 'completed' AND updated_at < NOW() - INTERVAL '7 days';"

# Monitor real-time logs
tail -f logs/soundhash.log

# Test a single video manually
python -c "from src.core.video_processor import VideoProcessor; print(VideoProcessor().download_video_audio('https://youtube.com/watch?v=...'))"

FAQ

Q: How long does it take to process one video?
A: Depends on video length and your hardware. Typically 30-60 seconds per video (download + segmentation + fingerprinting).

Q: Can I process multiple channels simultaneously?
A: Yes, the system processes channels concurrently with bounded concurrency. Control the limit with MAX_CONCURRENT_CHANNELS in .env (default: 2). Each channel ingestion includes retry logic with exponential backoff for resilience.

Q: What happens if ingestion is interrupted?
A: The system is idempotent - rerunning will skip already-created jobs. Use --only-process to process existing jobs without re-ingesting.

Q: How do I reset everything?
A: Run fresh_start.sh or manually: dropdb soundhash && createdb soundhash && python scripts/setup_database.py

Q: Can I use API key instead of OAuth?
A: OAuth is required for channel listing. API key alone has limited functionality.

Q: How much disk space do I need?
A: Varies by usage. Estimate ~50MB per video (audio + segments). Enable cleanup to reduce footprint.

Q: Does this work with private/unlisted videos?
A: Only if your authenticated cookies have access to those videos.

Production Deployment

SoundHash includes production-ready Kubernetes configurations for enterprise deployment.

Deployment Options

1. Kubernetes (Recommended for Production)

# Deploy to production using Helm
helm install soundhash ./helm/soundhash \
  --namespace soundhash-production \
  --values ./helm/soundhash/values-production.yaml

Features:

Zero-downtime rolling updates
Horizontal Pod Autoscaler (3-20 replicas)
TLS/SSL with Let's Encrypt
PgBouncer connection pooling
Redis caching
Multi-AZ deployment
Health checks and readiness probes

📚 Documentation:

2. Infrastructure as Code (Terraform)

Provision AWS infrastructure (EKS, RDS, ElastiCache, S3, EFS):

cd terraform
terraform init
terraform plan
terraform apply

Includes:

EKS cluster with managed node groups
Multi-AZ RDS PostgreSQL
ElastiCache Redis
S3 object storage
EFS for shared storage
Security groups and IAM roles

3. Automated Deployment Script

# Deploy to production
./scripts/deploy.sh production v1.0.0

# Deploy to staging
./scripts/deploy.sh staging latest

CI/CD Pipelines

GitHub Actions workflows for automated deployment:

Staging: Auto-deploys on push to main branch
Production: Deploys on release publication
Docker Build: Tests and validates Docker images on PRs

Quick Start - Production Deployment

Prerequisites:
- Kubernetes cluster (v1.27+)
- kubectl and Helm installed
- Docker registry access

Create Secrets:

kubectl create secret generic soundhash-secrets \
  --namespace=soundhash-production \
  --from-literal=database-url="postgresql://user:pass@host:5432/soundhash" \
  --from-literal=api-secret-key="your-secret-key"

Deploy with Helm:

helm install soundhash ./helm/soundhash \
  --namespace soundhash-production \
  --values ./helm/soundhash/values-production.yaml

Verify Deployment:

kubectl get pods -n soundhash-production
kubectl get svc -n soundhash-production

Configuration Files

k8s/                    # Raw Kubernetes manifests
├── deployment.yaml     # API deployment
├── service.yaml        # Load balancer service
├── ingress.yaml        # TLS/SSL ingress
├── hpa.yaml           # Horizontal Pod Autoscaler
├── pgbouncer.yaml     # Database connection pooling
└── redis.yaml         # Redis StatefulSet

helm/soundhash/        # Helm charts
├── Chart.yaml
├── values.yaml        # Default values
├── values-staging.yaml
├── values-production.yaml
└── templates/         # Kubernetes templates

terraform/             # Infrastructure as Code
├── main.tf           # AWS provider config
├── eks.tf            # EKS cluster
├── rds.tf            # PostgreSQL database
├── elasticache.tf    # Redis cache
├── s3.tf             # Object storage
└── vpc.tf            # Network configuration

Monitoring & Observability

Prometheus: Metrics collection (pods annotated for scraping)
Grafana: Visualization dashboards
ELK/Loki: Log aggregation
CloudWatch: AWS infrastructure monitoring

Security Features

Production-Grade Security (see Security Documentation)

Application Security:

✅ Multi-tier rate limiting (per-IP, per-user, per-endpoint)
✅ Automated threat detection (SQL injection, XSS, path traversal)
✅ IP allowlist/blocklist with CIDR support
✅ API key management with rotation and expiration
✅ Request signature verification (HMAC-SHA256)
✅ Security headers (CSP, HSTS, X-Frame-Options)
✅ Security audit logging (SOC 2, ISO 27001 ready)

Infrastructure Security:

Non-root container users
Security contexts and dropped capabilities
Secrets management via Kubernetes Secrets
TLS/SSL for all external traffic
Network policies for pod isolation
RBAC for access control
Image vulnerability scanning

DDoS Protection & WAF:

Cloudflare or AWS Shield integration
OWASP Top 10 protection
See DDoS Protection Guide

Scaling

Manual:

kubectl scale deployment/soundhash-api -n soundhash-production --replicas=10

Automatic:

HPA scales based on CPU (70%) and memory (80%) utilization
Scales from 3 to 20 replicas
Smart scale-up (immediate) and scale-down (5min stabilization)

Estimated Costs

Production (AWS EKS):

EKS Cluster: $73/month
EC2 Nodes (3x t3.xlarge): $450/month
RDS (db.r6g.xlarge Multi-AZ): $730/month
ElastiCache (cache.r6g.large): $340/month
EFS/S3/Data Transfer: ~$100/month
Total: ~$1,700/month

Development (optimized):

Smaller instances: ~$220/month

For detailed deployment instructions, see the deployment documentation.

Languages

Python 83.8%

TypeScript 11.4%

HCL 2.9%

JavaScript 0.8%

Shell 0.4%

Other 0.4%

README.md

SoundHash - Video Clip Matching System

Table of Contents

Project Status

Features

Architecture Overview

Key Components

Database Schema

REST API

Quick Start

API Endpoints

Configuration

Social Media Bots

Twitter Bot ✅

Reddit Bot 🚧

Quick Start (🎯 Target: <15 minutes)

Option A: Docker (Recommended - Fastest Setup)

Option B: Local Development

Comparison: Docker vs Local

What Happens After Setup?

Docker Configuration Details

Environment Variables

Docker Volumes

Common Docker Operations

Production Deployment

Troubleshooting Common Issues

🚫 YouTube Download Failures / Rate Limiting

🎵 ffmpeg Issues

🗄️ Database Connection Issues

💾 Database Backups and Restore

Creating Backups

Restoring from Backups

Backup Configuration

Testing Backup/Restore

📦 Import/Dependency Errors

💾 Disk Space Issues

🔐 YouTube API Authentication Issues

🐛 General Debugging Tips

Usage

Command Line Options

Bot Deployment

Manual Testing

Detailed Architecture

Directory Structure

Additional Documentation

Security and Secrets Management

Overview

Protected Files

Safe Credential Handling

Local Development

GitHub Actions / CI

Production Deployment

Secret Scanning

Best Practices

Credential Rotation

Additional Resources

Performance Tips & Best Practices

Development Workflow

Production Considerations

Common Gotchas

Useful Commands

FAQ

Production Deployment

Deployment Options

1. Kubernetes (Recommended for Production)

2. Infrastructure as Code (Terraform)

3. Automated Deployment Script

CI/CD Pipelines

Quick Start - Production Deployment

Configuration Files

Monitoring & Observability

Security Features

Scaling

Estimated Costs