Initial commit: Epstein Files Database project structure

- PostgreSQL schema for documents, entities, relationships, cross-refs
- Neo4j schema for graph relationships
- TypeScript extraction pipeline (OCR, NER, deduplication)
- Go API server (Fiber) with full REST endpoints
- React + Tailwind frontend with network visualization
- Pattern finder agent for connection discovery
- Docker compose for databases (Postgres, Neo4j, Typesense)
- Cross-reference matching for PPP loans, FEC, federal grants
This commit is contained in:
2026-02-02 14:54:00 -06:00
commit f30c25e79f
33 changed files with 4353 additions and 0 deletions

66
.gitignore vendored Normal file
View File

@@ -0,0 +1,66 @@
# Data sources - too large for git
DataSources/
# Build outputs
dist/
build/
.next/
out/
# Dependencies
node_modules/
vendor/
# Environment
.env
.env.local
.env.*.local
# Go
*.exe
*.dll
*.so
*.dylib
bin/
# Python
__pycache__/
*.py[cod]
*$py.class
.venv/
venv/
env/
*.egg-info/
# IDE
.idea/
.vscode/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Database files (large, generated)
*.db
*.sqlite
*.sqlite3
# Logs
*.log
logs/
# Temporary files
tmp/
temp/
.cache/
# Generated data (can be recreated)
data/processed/
data/embeddings/
data/exports/
# Keep config examples
!*.example

21
LICENSE Normal file
View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 Subcult
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

217
README.md Normal file
View File

@@ -0,0 +1,217 @@
# Epstein Files Database
A searchable database and network analysis tool for the DOJ Epstein Files release. Built to make public records accessible, cross-referenced, and analyzable.
## What This Does
1. **Entity Extraction** — Extracts names, organizations, locations, and dates from 4,055 DOJ documents
2. **Relationship Mapping** — Builds a graph of connections based on document co-occurrence
3. **Layer Classification** — Classifies entities by degree of separation from Jeffrey Epstein
4. **Cross-Reference Engine** — Fuzzy-matches entities against:
- PPP loan data (SBA)
- FEC campaign contributions
- Federal grant recipients
5. **Pattern Detection Agent** — AI agent specialized in finding non-obvious connections
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Frontend (React + Tailwind) │
│ • Search Interface • Network Visualization • Document Viewer │
└─────────────────────────┬───────────────────────────────────────┘
┌─────────────────────────▼───────────────────────────────────────┐
│ API Server (Go) │
│ • REST Endpoints • Full-text Search • Graph Queries │
└─────────────────────────┬───────────────────────────────────────┘
┌─────────────────────────▼───────────────────────────────────────┐
│ Data Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ PostgreSQL │ │ Neo4j │ │ Typesense/Meilisearch │ │
│ │ Entities │ │ Graph │ │ Full-text Search │ │
│ │ Documents │ │ Relations │ │ │ │
│ │ Cross-refs │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────▼───────────────────────────────────────┐
│ Extraction Pipeline (TypeScript) │
│ • OCR Processing • NER Extraction • Relationship Inference │
└─────────────────────────────────────────────────────────────────┘
```
## Tech Stack
| Component | Technology | Rationale |
|-----------|------------|-----------|
| Frontend | React + Tailwind + Vite | Fast, modern, type-safe |
| API | Go (Fiber/Echo) | Performance for graph queries |
| Primary DB | PostgreSQL | Structured data, JSONB, full-text |
| Graph DB | Neo4j | Relationship traversal at scale |
| Search | Typesense | Fast fuzzy search, typo-tolerant |
| Extraction | TypeScript + LLM | Entity extraction, deduplication |
| Pattern Agent | OpenClaw sub-agent | AI-driven connection discovery |
## Data Sources
### Primary: DOJ Epstein Files
- **4,055 documents** (EFTA00000001 through EFTA00008528)
- **1.77M lines** of OCR text
- **157GB** raw data (PDFs, images, scans)
- Source: https://www.justice.gov/epstein
### Cross-Reference Datasets
- **PPP Loans**: SBA FOIA data (https://data.sba.gov/dataset/ppp-foia)
- **FEC Contributions**: Federal Election Commission (https://www.fec.gov/data/)
- **Federal Grants**: USASpending.gov (https://www.usaspending.gov/download_center/custom_award_data)
## Layer Classification
| Layer | Definition | Example |
|-------|------------|---------|
| **L0** | Jeffrey Epstein himself | — |
| **L1** | Direct associates (named in documents with Epstein) | Ghislaine Maxwell |
| **L2** | One degree removed (connected to L1 but not directly to Epstein) | — |
| **L3** | Two degrees removed | — |
## Getting Started
### Prerequisites
- Docker & Docker Compose
- Node.js 20+
- Go 1.21+
- PostgreSQL 16+ (or use Docker)
- Neo4j 5+ (or use Docker)
### Quick Start
```bash
# Clone the repo
git clone https://github.com/subculture-collective/epstein-db.git
cd epstein-db
# Start databases
docker-compose up -d
# Install dependencies
npm install
cd api && go mod download && cd ..
# Run extraction pipeline (requires OpenAI-compatible API)
cp .env.example .env
# Edit .env with your API keys
npm run extract
# Start the API server
cd api && go run . &
# Start the frontend
npm run dev
```
## Project Structure
```
epstein-db/
├── api/ # Go API server
│ ├── cmd/ # Entry points
│ ├── internal/ # Internal packages
│ │ ├── handlers/ # HTTP handlers
│ │ ├── db/ # Database access
│ │ ├── graph/ # Neo4j operations
│ │ └── search/ # Typesense operations
│ └── pkg/ # Public packages
├── extraction/ # TypeScript extraction pipeline
│ ├── src/
│ │ ├── ocr/ # OCR processing
│ │ ├── ner/ # Named Entity Recognition
│ │ ├── dedup/ # Entity deduplication
│ │ └── cross-ref/ # Cross-reference matching
│ └── scripts/ # Pipeline scripts
├── frontend/ # React frontend
│ ├── src/
│ │ ├── components/ # UI components
│ │ ├── pages/ # Route pages
│ │ ├── hooks/ # Custom hooks
│ │ └── api/ # API client
│ └── public/
├── agents/ # AI agents
│ └── pattern-finder/ # Connection discovery agent
├── data/ # Data directory (gitignored)
│ ├── raw/ # Symlink to DataSources
│ ├── processed/ # Extracted entities/relations
│ ├── crossref/ # PPP, FEC, grants data
│ └── exports/ # Generated exports
├── docker-compose.yml # Database services
├── schema/ # Database schemas
│ ├── postgres/ # SQL migrations
│ └── neo4j/ # Cypher constraints
└── docs/ # Documentation
├── ARCHITECTURE.md
├── DATA_MODEL.md
└── CONTRIBUTING.md
```
## Roadmap
### Phase 1: Foundation ✅
- [x] Repository setup
- [ ] Database schema design
- [ ] Docker compose for databases
- [ ] Basic extraction pipeline
### Phase 2: Entity Extraction
- [ ] OCR text ingestion
- [ ] Named Entity Recognition (NER)
- [ ] Entity deduplication (LLM-assisted)
- [ ] Document-entity relationships
### Phase 3: Graph Construction
- [ ] Neo4j schema
- [ ] Co-occurrence relationship building
- [ ] Layer classification algorithm
- [ ] Graph API endpoints
### Phase 4: Cross-Reference
- [ ] PPP loan data ingestion
- [ ] FEC contribution data ingestion
- [ ] Federal grants data ingestion
- [ ] Fuzzy matching engine
### Phase 5: Frontend
- [ ] Search interface
- [ ] Network visualization (D3/Force-Graph)
- [ ] Document viewer
- [ ] Entity detail pages
### Phase 6: Pattern Agent
- [ ] Agent architecture design
- [ ] Connection hypothesis generation
- [ ] Validation pipeline
- [ ] Report generation
## Contributing
This is an open research project. Contributions welcome:
- Entity extraction improvements
- Fuzzy matching algorithms
- UI/UX improvements
- Additional cross-reference datasets
- Pattern detection strategies
## License
MIT License. The code is open source. The documents are public records.
## Disclaimer
This is an independent research project. We make no representations about the completeness or accuracy of the analysis. This tool surfaces connections — it does not assert guilt, criminality, or wrongdoing.

View File

@@ -0,0 +1,113 @@
# Pattern Finder Agent
An AI agent specialized in discovering non-obvious connections, patterns, and relationships within the Epstein Files database.
## Purpose
While the extraction pipeline identifies explicit entities and relationships, the Pattern Finder looks for:
1. **Indirect Connections** — Entities that appear in similar contexts but are never directly linked
2. **Temporal Patterns** — Activities that cluster around specific dates or events
3. **Financial Flows** — Money movement patterns across entities
4. **Network Anomalies** — Unusually dense or sparse connection patterns
5. **Cross-Reference Insights** — What PPP/FEC/Grants matches reveal about entities
## How It Works
The agent runs periodically (or on-demand) and:
1. **Samples the Graph** — Pulls subgraphs around high-degree or interesting entities
2. **Generates Hypotheses** — Uses LLM to identify potential patterns
3. **Validates Hypotheses** — Checks evidence in the actual documents
4. **Reports Findings** — Stores validated patterns with evidence chains
## Agent Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Pattern Finder Agent │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. Sampling Module │
│ • Random walk from high-degree nodes │
│ • Temporal window sampling │
│ • Cross-reference focused sampling │
│ │
│ 2. Hypothesis Generator (LLM) │
│ • Pattern recognition prompts │
│ • Anomaly detection prompts │
│ • Connection inference prompts │
│ │
│ 3. Evidence Validator │
│ • Document retrieval │
│ • Citation extraction │
│ • Confidence scoring │
│ │
│ 4. Report Generator │
│ • Pattern summary │
│ • Evidence chain │
│ • Visualization data │
│ │
└─────────────────────────────────────────────────────────────────┘
```
## Pattern Types
### Financial Patterns
- Money flows between entities
- Unusual transaction timing
- Shell company connections
- Donation clustering
### Travel Patterns
- Co-location events
- Flight log correlations
- Property connections
- Event attendance
### Organizational Patterns
- Board memberships
- Foundation connections
- Employment relationships
- Legal representation
### Temporal Patterns
- Activity clustering around dates
- Gaps in documentation
- Correlated timelines
## Usage
```bash
# Run a pattern discovery session
npm run agent:pattern-finder
# Focus on specific entity
npm run agent:pattern-finder -- --entity "Ghislaine Maxwell"
# Focus on date range
npm run agent:pattern-finder -- --from "2005-01-01" --to "2010-12-31"
# Focus on pattern type
npm run agent:pattern-finder -- --type financial
```
## Output
Patterns are stored in the `pattern_findings` table with:
- Title and description
- Involved entities
- Evidence (documents, relationships)
- Confidence score
- Status (hypothesis, validated, rejected)
## Integration with OpenClaw
This agent can be spawned as a sub-agent from OpenClaw:
```typescript
sessions_spawn({
task: "Analyze the network around Les Wexner for financial patterns",
label: "pattern-finder-wexner",
})
```

View File

@@ -0,0 +1,315 @@
/**
* Pattern Finder Agent
*
* Discovers non-obvious connections and patterns in the Epstein Files database.
*/
import Anthropic from '@anthropic-ai/sdk';
import { z } from 'zod';
import pg from 'pg';
const { Pool } = pg;
// ============================================================================
// Configuration
// ============================================================================
const config = {
DATABASE_URL: process.env.DATABASE_URL || 'postgresql://epstein:epstein_dev@localhost:5432/epstein',
ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY || '',
LLM_MODEL: process.env.LLM_MODEL || 'claude-sonnet-4-20250514',
};
const pool = new Pool({ connectionString: config.DATABASE_URL });
const anthropic = new Anthropic({ apiKey: config.ANTHROPIC_API_KEY });
// ============================================================================
// Types
// ============================================================================
interface Entity {
id: number;
canonicalName: string;
entityType: string;
layer: number;
documentCount: number;
connectionCount: number;
pppMatches: any[];
fecMatches: any[];
grantsMatches: any[];
}
interface Connection {
entity1: string;
entity2: string;
sharedDocs: number;
documentIds: string[];
}
interface PatternHypothesis {
title: string;
description: string;
patternType: string;
entityNames: string[];
evidence: string[];
confidence: number;
}
// ============================================================================
// Sampling Functions
// ============================================================================
async function getHighDegreeEntities(limit: number = 50): Promise<Entity[]> {
const result = await pool.query(`
SELECT
id, canonical_name, entity_type, layer,
document_count, connection_count,
ppp_matches, fec_matches, grants_matches
FROM entities
WHERE entity_type IN ('person', 'organization')
ORDER BY connection_count DESC
LIMIT $1
`, [limit]);
return result.rows.map(row => ({
id: row.id,
canonicalName: row.canonical_name,
entityType: row.entity_type,
layer: row.layer || 0,
documentCount: row.document_count || 0,
connectionCount: row.connection_count || 0,
pppMatches: row.ppp_matches || [],
fecMatches: row.fec_matches || [],
grantsMatches: row.grants_matches || [],
}));
}
async function getEntityConnections(entityId: number, limit: number = 100): Promise<Connection[]> {
const result = await pool.query(`
SELECT
e1.canonical_name AS entity1,
e2.canonical_name AS entity2,
COUNT(DISTINCT d.id) AS shared_docs,
array_agg(DISTINCT d.doc_id) AS document_ids
FROM document_entities de1
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id != de2.entity_id
JOIN entities e1 ON de1.entity_id = e1.id
JOIN entities e2 ON de2.entity_id = e2.id
JOIN documents d ON de1.document_id = d.id
WHERE de1.entity_id = $1
GROUP BY e1.canonical_name, e2.canonical_name
ORDER BY shared_docs DESC
LIMIT $2
`, [entityId, limit]);
return result.rows.map(row => ({
entity1: row.entity1,
entity2: row.entity2,
sharedDocs: parseInt(row.shared_docs),
documentIds: row.document_ids,
}));
}
async function getEntitiesWithCrossRefMatches(): Promise<Entity[]> {
const result = await pool.query(`
SELECT
id, canonical_name, entity_type, layer,
document_count, connection_count,
ppp_matches, fec_matches, grants_matches
FROM entities
WHERE
(ppp_matches IS NOT NULL AND jsonb_array_length(ppp_matches) > 0)
OR (fec_matches IS NOT NULL AND jsonb_array_length(fec_matches) > 0)
OR (grants_matches IS NOT NULL AND jsonb_array_length(grants_matches) > 0)
ORDER BY connection_count DESC
LIMIT 100
`);
return result.rows.map(row => ({
id: row.id,
canonicalName: row.canonical_name,
entityType: row.entity_type,
layer: row.layer || 0,
documentCount: row.document_count || 0,
connectionCount: row.connection_count || 0,
pppMatches: row.ppp_matches || [],
fecMatches: row.fec_matches || [],
grantsMatches: row.grants_matches || [],
}));
}
// ============================================================================
// Pattern Detection
// ============================================================================
const PATTERN_SYSTEM_PROMPT = `You are an investigative analyst specializing in network analysis and pattern detection. You're analyzing data from the Jeffrey Epstein case documents.
Your task is to identify non-obvious patterns, connections, and anomalies that might warrant further investigation.
Focus on:
1. Financial patterns (money flows, unusual transactions, timing)
2. Organizational patterns (shared board memberships, foundations, legal representation)
3. Temporal patterns (activities clustering around dates, gaps in documentation)
4. Network anomalies (unusually dense connections, unexpected bridges between groups)
5. Cross-reference insights (what PPP loans, FEC contributions, or federal grants reveal)
Be specific and cite evidence. Generate hypotheses that can be validated with document review.
IMPORTANT: You are surfacing patterns for investigation, not asserting guilt or wrongdoing.`;
async function generatePatternHypotheses(
entities: Entity[],
connections: Connection[]
): Promise<PatternHypothesis[]> {
const entitySummaries = entities.map(e => ({
name: e.canonicalName,
type: e.entityType,
layer: e.layer,
docs: e.documentCount,
connections: e.connectionCount,
hasPPP: e.pppMatches.length > 0,
hasFEC: e.fecMatches.length > 0,
hasGrants: e.grantsMatches.length > 0,
}));
const connectionSummaries = connections.slice(0, 50).map(c => ({
pair: `${c.entity1}${c.entity2}`,
sharedDocs: c.sharedDocs,
}));
const prompt = `Analyze this network data and identify potential patterns worth investigating.
ENTITIES (${entities.length} total, showing key attributes):
${JSON.stringify(entitySummaries, null, 2)}
TOP CONNECTIONS:
${JSON.stringify(connectionSummaries, null, 2)}
Generate 3-5 pattern hypotheses. For each, provide:
1. A specific, descriptive title
2. What the pattern suggests
3. Which entities are involved
4. What evidence supports this hypothesis
5. Confidence level (0-1)
Return JSON array:
[
{
"title": "Pattern Title",
"description": "What this pattern suggests and why it's notable",
"patternType": "financial|organizational|temporal|network|crossref",
"entityNames": ["Entity1", "Entity2"],
"evidence": ["Evidence point 1", "Evidence point 2"],
"confidence": 0.7
}
]
Return ONLY valid JSON.`;
const response = await anthropic.messages.create({
model: config.LLM_MODEL,
max_tokens: 4096,
system: PATTERN_SYSTEM_PROMPT,
messages: [{ role: 'user', content: prompt }],
});
const content = response.content[0];
if (content.type !== 'text') {
throw new Error('Unexpected response type');
}
const jsonMatch = content.text.match(/\[[\s\S]*\]/);
if (!jsonMatch) {
console.error('No JSON found:', content.text);
return [];
}
return JSON.parse(jsonMatch[0]);
}
// ============================================================================
// Save Patterns
// ============================================================================
async function savePattern(pattern: PatternHypothesis): Promise<number> {
// Get entity IDs
const entityResult = await pool.query(`
SELECT id FROM entities WHERE canonical_name = ANY($1)
`, [pattern.entityNames]);
const entityIds = entityResult.rows.map(r => r.id);
const result = await pool.query(`
INSERT INTO pattern_findings
(title, description, pattern_type, entity_ids, evidence, confidence, status)
VALUES ($1, $2, $3, $4, $5, $6, 'hypothesis')
RETURNING id
`, [
pattern.title,
pattern.description,
pattern.patternType,
entityIds,
JSON.stringify({
entityNames: pattern.entityNames,
evidencePoints: pattern.evidence,
}),
pattern.confidence,
]);
return result.rows[0].id;
}
// ============================================================================
// Main
// ============================================================================
async function main() {
console.log('🔎 Pattern Finder Agent starting...\n');
// Get high-degree entities
console.log('📊 Sampling high-degree entities...');
const highDegree = await getHighDegreeEntities(50);
console.log(` Found ${highDegree.length} high-degree entities`);
// Get entities with cross-reference matches
console.log('📊 Sampling entities with cross-reference matches...');
const crossRef = await getEntitiesWithCrossRefMatches();
console.log(` Found ${crossRef.length} entities with PPP/FEC/Grants matches`);
// Get connections for top entities
console.log('📊 Sampling connections...');
const allConnections: Connection[] = [];
for (const entity of highDegree.slice(0, 10)) {
const connections = await getEntityConnections(entity.id, 50);
allConnections.push(...connections);
}
console.log(` Found ${allConnections.length} connections`);
// Combine entities (deduplicate)
const allEntities = [...highDegree, ...crossRef];
const uniqueEntities = Array.from(
new Map(allEntities.map(e => [e.id, e])).values()
);
// Generate pattern hypotheses
console.log('\n🧠 Generating pattern hypotheses...');
const patterns = await generatePatternHypotheses(uniqueEntities, allConnections);
console.log(` Generated ${patterns.length} hypotheses`);
// Save patterns
console.log('\n💾 Saving patterns to database...');
for (const pattern of patterns) {
const id = await savePattern(pattern);
console.log(` ✓ Saved: ${pattern.title} (ID: ${id})`);
}
console.log('\n✅ Pattern Finder complete!');
console.log(` Patterns discovered: ${patterns.length}`);
await pool.end();
}
main().catch((error) => {
console.error('Fatal error:', error);
process.exit(1);
});

105
api/cmd/server/main.go Normal file
View File

@@ -0,0 +1,105 @@
package main
import (
"context"
"log"
"os"
"os/signal"
"syscall"
"github.com/gofiber/fiber/v2"
"github.com/gofiber/fiber/v2/middleware/cors"
"github.com/gofiber/fiber/v2/middleware/logger"
"github.com/gofiber/fiber/v2/middleware/recover"
"github.com/joho/godotenv"
"github.com/subculture-collective/epstein-db/api/internal/db"
"github.com/subculture-collective/epstein-db/api/internal/handlers"
)
func main() {
// Load .env file
if err := godotenv.Load(); err != nil {
log.Println("No .env file found, using environment variables")
}
// Initialize database connection
if err := db.Initialize(context.Background()); err != nil {
log.Fatalf("Failed to initialize database: %v", err)
}
defer db.Close()
// Create Fiber app
app := fiber.New(fiber.Config{
AppName: "Epstein Files API",
})
// Middleware
app.Use(recover.New())
app.Use(logger.New())
app.Use(cors.New(cors.Config{
AllowOrigins: "*",
AllowMethods: "GET,POST,PUT,DELETE,OPTIONS",
AllowHeaders: "Origin, Content-Type, Accept, Authorization",
}))
// Routes
api := app.Group("/api")
// Stats
api.Get("/stats", handlers.GetStats)
// Entities
api.Get("/entities", handlers.SearchEntities)
api.Get("/entities/:id", handlers.GetEntity)
api.Get("/entities/:id/connections", handlers.GetEntityConnections)
api.Get("/entities/:id/documents", handlers.GetEntityDocuments)
// Documents
api.Get("/documents", handlers.ListDocuments)
api.Get("/documents/:id", handlers.GetDocument)
api.Get("/documents/:id/text", handlers.GetDocumentText)
api.Get("/documents/:id/entities", handlers.GetDocumentEntities)
// Graph/Network
api.Get("/network", handlers.GetNetwork)
api.Get("/network/layers", handlers.GetNetworkByLayer)
// Cross-references
api.Get("/crossref/ppp", handlers.SearchPPP)
api.Get("/crossref/fec", handlers.SearchFEC)
api.Get("/crossref/grants", handlers.SearchGrants)
// Patterns
api.Get("/patterns", handlers.ListPatterns)
api.Get("/patterns/:id", handlers.GetPattern)
// Search
api.Get("/search", handlers.FullTextSearch)
// Health check
app.Get("/health", func(c *fiber.Ctx) error {
return c.JSON(fiber.Map{"status": "ok"})
})
// Get port from environment
port := os.Getenv("PORT")
if port == "" {
port = "3001"
}
// Graceful shutdown
go func() {
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
<-sigChan
log.Println("Shutting down...")
app.Shutdown()
}()
// Start server
log.Printf("Starting server on port %s", port)
if err := app.Listen(":" + port); err != nil {
log.Fatalf("Server error: %v", err)
}
}

31
api/go.mod Normal file
View File

@@ -0,0 +1,31 @@
module github.com/subculture-collective/epstein-db/api
go 1.21
require (
github.com/gofiber/fiber/v2 v2.52.4
github.com/jackc/pgx/v5 v5.5.5
github.com/neo4j/neo4j-go-driver/v5 v5.19.0
github.com/typesense/typesense-go v1.1.0
github.com/joho/godotenv v1.5.1
)
require (
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/jackc/pgpassfile v1.0.0 // indirect
github.com/jackc/pgservicefile v0.0.0-20231201235250-de7065d80cb9 // indirect
github.com/jackc/puddle/v2 v2.2.1 // indirect
github.com/klauspost/compress v1.17.8 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/mattn/go-isatty v0.0.20 // indirect
github.com/mattn/go-runewidth v0.0.15 // indirect
github.com/rivo/uniseg v0.4.7 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasthttp v1.52.0 // indirect
github.com/valyala/tcplisten v1.0.0 // indirect
golang.org/x/crypto v0.22.0 // indirect
golang.org/x/sync v0.7.0 // indirect
golang.org/x/sys v0.19.0 // indirect
golang.org/x/text v0.14.0 // indirect
)

35
api/internal/db/db.go Normal file
View File

@@ -0,0 +1,35 @@
package db
import (
"context"
"os"
"github.com/jackc/pgx/v5/pgxpool"
)
var pool *pgxpool.Pool
func Initialize(ctx context.Context) error {
connString := os.Getenv("DATABASE_URL")
if connString == "" {
connString = "postgresql://epstein:epstein_dev@localhost:5432/epstein"
}
var err error
pool, err = pgxpool.New(ctx, connString)
if err != nil {
return err
}
return pool.Ping(ctx)
}
func Close() {
if pool != nil {
pool.Close()
}
}
func Pool() *pgxpool.Pool {
return pool
}

View File

@@ -0,0 +1,202 @@
package handlers
import (
"context"
"strconv"
"github.com/gofiber/fiber/v2"
"github.com/subculture-collective/epstein-db/api/internal/db"
)
// SearchPPP searches PPP loan data
func SearchPPP(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
query := c.Query("q", "")
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
if limit > 200 {
limit = 200
}
rows, err := pool.Query(ctx, `
SELECT id, borrower_name, borrower_city, borrower_state,
loan_amount, forgiveness_amount, lender, date_approved,
similarity(borrower_name, $1) AS score
FROM ppp_loans
WHERE $1 = '' OR borrower_name % $1 OR borrower_name ILIKE '%' || $1 || '%'
ORDER BY
CASE WHEN $1 != '' THEN similarity(borrower_name, $1) ELSE 0 END DESC,
loan_amount DESC NULLS LAST
LIMIT $2
`, query, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var results []fiber.Map
for rows.Next() {
var id int
var name string
var city, state, lender *string
var loanAmount, forgivenessAmount *float64
var dateApproved *string
var score float64
if err := rows.Scan(&id, &name, &city, &state, &loanAmount,
&forgivenessAmount, &lender, &dateApproved, &score); err != nil {
continue
}
results = append(results, fiber.Map{
"id": id,
"borrowerName": name,
"borrowerCity": city,
"borrowerState": state,
"loanAmount": loanAmount,
"forgivenessAmount": forgivenessAmount,
"lender": lender,
"dateApproved": dateApproved,
"matchScore": score,
})
}
return c.JSON(fiber.Map{
"results": results,
"count": len(results),
})
}
// SearchFEC searches FEC contribution data
func SearchFEC(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
query := c.Query("q", "")
candidate := c.Query("candidate", "")
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
if limit > 200 {
limit = 200
}
rows, err := pool.Query(ctx, `
SELECT id, contributor_name, contributor_city, contributor_state,
contributor_employer, contributor_occupation,
candidate_name, committee_name, amount, contribution_date,
similarity(contributor_name, $1) AS score
FROM fec_contributions
WHERE ($1 = '' OR contributor_name % $1 OR contributor_name ILIKE '%' || $1 || '%')
AND ($2 = '' OR candidate_name ILIKE '%' || $2 || '%')
ORDER BY
CASE WHEN $1 != '' THEN similarity(contributor_name, $1) ELSE 0 END DESC,
amount DESC NULLS LAST
LIMIT $3
`, query, candidate, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var results []fiber.Map
for rows.Next() {
var id int
var name string
var city, state, employer, occupation, candidateName, committeeName *string
var amount *float64
var contributionDate *string
var score float64
if err := rows.Scan(&id, &name, &city, &state, &employer, &occupation,
&candidateName, &committeeName, &amount, &contributionDate, &score); err != nil {
continue
}
results = append(results, fiber.Map{
"id": id,
"contributorName": name,
"contributorCity": city,
"contributorState": state,
"employer": employer,
"occupation": occupation,
"candidateName": candidateName,
"committeeName": committeeName,
"amount": amount,
"contributionDate": contributionDate,
"matchScore": score,
})
}
return c.JSON(fiber.Map{
"results": results,
"count": len(results),
})
}
// SearchGrants searches federal grants data
func SearchGrants(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
query := c.Query("q", "")
agency := c.Query("agency", "")
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
if limit > 200 {
limit = 200
}
rows, err := pool.Query(ctx, `
SELECT id, recipient_name, recipient_city, recipient_state,
awarding_agency, funding_agency, award_amount, award_date,
description, cfda_title,
similarity(recipient_name, $1) AS score
FROM federal_grants
WHERE ($1 = '' OR recipient_name % $1 OR recipient_name ILIKE '%' || $1 || '%')
AND ($2 = '' OR awarding_agency ILIKE '%' || $2 || '%')
ORDER BY
CASE WHEN $1 != '' THEN similarity(recipient_name, $1) ELSE 0 END DESC,
award_amount DESC NULLS LAST
LIMIT $3
`, query, agency, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var results []fiber.Map
for rows.Next() {
var id int
var name string
var city, state, awardingAgency, fundingAgency *string
var awardAmount *float64
var awardDate, description, cfdaTitle *string
var score float64
if err := rows.Scan(&id, &name, &city, &state, &awardingAgency, &fundingAgency,
&awardAmount, &awardDate, &description, &cfdaTitle, &score); err != nil {
continue
}
results = append(results, fiber.Map{
"id": id,
"recipientName": name,
"recipientCity": city,
"recipientState": state,
"awardingAgency": awardingAgency,
"fundingAgency": fundingAgency,
"awardAmount": awardAmount,
"awardDate": awardDate,
"description": description,
"cfdaTitle": cfdaTitle,
"matchScore": score,
})
}
return c.JSON(fiber.Map{
"results": results,
"count": len(results),
})
}

View File

@@ -0,0 +1,238 @@
package handlers
import (
"context"
"strconv"
"github.com/gofiber/fiber/v2"
"github.com/subculture-collective/epstein-db/api/internal/db"
)
// ListDocuments returns a paginated list of documents
func ListDocuments(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
if limit > 200 {
limit = 200
}
offsetStr := c.Query("offset", "0")
offset, _ := strconv.Atoi(offsetStr)
docType := c.Query("type", "")
dataset := c.Query("dataset", "")
rows, err := pool.Query(ctx, `
SELECT id, doc_id, dataset_id, document_type, summary, date_earliest, date_latest
FROM documents
WHERE ($1 = '' OR document_type = $1)
AND ($2 = '' OR dataset_id = $2::int)
ORDER BY doc_id
LIMIT $3 OFFSET $4
`, docType, dataset, limit, offset)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var documents []fiber.Map
for rows.Next() {
var id, datasetID int
var docID string
var docType, summary *string
var dateEarliest, dateLatest *string
if err := rows.Scan(&id, &docID, &datasetID, &docType, &summary, &dateEarliest, &dateLatest); err != nil {
continue
}
documents = append(documents, fiber.Map{
"id": id,
"docId": docID,
"datasetId": datasetID,
"documentType": docType,
"summary": summary,
"dateEarliest": dateEarliest,
"dateLatest": dateLatest,
})
}
return c.JSON(fiber.Map{
"documents": documents,
"count": len(documents),
"offset": offset,
"limit": limit,
})
}
// GetDocument returns a single document by ID
func GetDocument(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
var doc struct {
ID int `json:"id"`
DocID string `json:"docId"`
DatasetID int `json:"datasetId"`
DocumentType *string `json:"documentType"`
Summary *string `json:"summary"`
DetailedSummary *string `json:"detailedSummary"`
DateEarliest *string `json:"dateEarliest"`
DateLatest *string `json:"dateLatest"`
ContentTags []byte `json:"contentTags"`
PageCount *int `json:"pageCount"`
}
err = pool.QueryRow(ctx, `
SELECT id, doc_id, dataset_id, document_type, summary, detailed_summary,
date_earliest::text, date_latest::text, content_tags, page_count
FROM documents WHERE id = $1
`, id).Scan(
&doc.ID, &doc.DocID, &doc.DatasetID, &doc.DocumentType,
&doc.Summary, &doc.DetailedSummary, &doc.DateEarliest,
&doc.DateLatest, &doc.ContentTags, &doc.PageCount,
)
if err != nil {
return c.Status(404).JSON(fiber.Map{"error": "document not found"})
}
return c.JSON(doc)
}
// GetDocumentText returns the full text of a document
func GetDocumentText(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
var text *string
err = pool.QueryRow(ctx, "SELECT full_text FROM documents WHERE id = $1", id).Scan(&text)
if err != nil {
return c.Status(404).JSON(fiber.Map{"error": "document not found"})
}
return c.JSON(fiber.Map{
"id": id,
"text": text,
})
}
// GetDocumentEntities returns entities mentioned in a document
func GetDocumentEntities(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
rows, err := pool.Query(ctx, `
SELECT e.id, e.canonical_name, e.entity_type, e.layer, de.mention_count
FROM entities e
JOIN document_entities de ON e.id = de.entity_id
WHERE de.document_id = $1
ORDER BY de.mention_count DESC
`, id)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var entities []fiber.Map
for rows.Next() {
var entityID int
var name, etype string
var layer *int
var mentions int
if err := rows.Scan(&entityID, &name, &etype, &layer, &mentions); err != nil {
continue
}
entities = append(entities, fiber.Map{
"id": entityID,
"canonicalName": name,
"entityType": etype,
"layer": layer,
"mentionCount": mentions,
})
}
return c.JSON(fiber.Map{
"entities": entities,
"count": len(entities),
})
}
// FullTextSearch searches document text
func FullTextSearch(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
query := c.Query("q", "")
if query == "" {
return c.Status(400).JSON(fiber.Map{"error": "query required"})
}
limitStr := c.Query("limit", "20")
limit, _ := strconv.Atoi(limitStr)
if limit > 100 {
limit = 100
}
rows, err := pool.Query(ctx, `
SELECT id, doc_id, document_type, summary,
ts_rank(to_tsvector('english', full_text), plainto_tsquery('english', $1)) AS rank,
ts_headline('english', full_text, plainto_tsquery('english', $1),
'MaxWords=50, MinWords=20, StartSel=<mark>, StopSel=</mark>') AS snippet
FROM documents
WHERE to_tsvector('english', full_text) @@ plainto_tsquery('english', $1)
ORDER BY rank DESC
LIMIT $2
`, query, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var results []fiber.Map
for rows.Next() {
var id int
var docID string
var docType, summary, snippet *string
var rank float64
if err := rows.Scan(&id, &docID, &docType, &summary, &rank, &snippet); err != nil {
continue
}
results = append(results, fiber.Map{
"id": id,
"docId": docID,
"documentType": docType,
"summary": summary,
"rank": rank,
"snippet": snippet,
})
}
return c.JSON(fiber.Map{
"results": results,
"count": len(results),
"query": query,
})
}

View File

@@ -0,0 +1,250 @@
package handlers
import (
"context"
"strconv"
"github.com/gofiber/fiber/v2"
"github.com/subculture-collective/epstein-db/api/internal/db"
)
// GetStats returns database statistics
func GetStats(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
var stats struct {
Documents int64 `json:"documents"`
Entities int64 `json:"entities"`
Triples int64 `json:"triples"`
PPPLoans int64 `json:"pppLoans"`
FECRecords int64 `json:"fecRecords"`
Grants int64 `json:"grants"`
Patterns int64 `json:"patterns"`
}
pool.QueryRow(ctx, "SELECT COUNT(*) FROM documents").Scan(&stats.Documents)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM entities").Scan(&stats.Entities)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM triples").Scan(&stats.Triples)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM ppp_loans").Scan(&stats.PPPLoans)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM fec_contributions").Scan(&stats.FECRecords)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM federal_grants").Scan(&stats.Grants)
pool.QueryRow(ctx, "SELECT COUNT(*) FROM pattern_findings").Scan(&stats.Patterns)
return c.JSON(stats)
}
// SearchEntities searches for entities by name
func SearchEntities(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
query := c.Query("q", "")
limitStr := c.Query("limit", "20")
limit, _ := strconv.Atoi(limitStr)
if limit > 100 {
limit = 100
}
entityType := c.Query("type", "")
layer := c.Query("layer", "")
sqlQuery := `
SELECT id, canonical_name, entity_type, layer, document_count, connection_count
FROM entities
WHERE ($1 = '' OR canonical_name ILIKE '%' || $1 || '%' OR canonical_name % $1)
AND ($2 = '' OR entity_type = $2::entity_type)
AND ($3 = '' OR layer = $3::int)
ORDER BY
CASE WHEN $1 != '' THEN similarity(canonical_name, $1) ELSE 0 END DESC,
document_count DESC
LIMIT $4
`
rows, err := pool.Query(ctx, sqlQuery, query, entityType, layer, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var entities []fiber.Map
for rows.Next() {
var id int
var name, etype string
var layerVal, docCount, connCount *int
if err := rows.Scan(&id, &name, &etype, &layerVal, &docCount, &connCount); err != nil {
continue
}
entities = append(entities, fiber.Map{
"id": id,
"canonicalName": name,
"entityType": etype,
"layer": layerVal,
"documentCount": docCount,
"connectionCount": connCount,
})
}
return c.JSON(fiber.Map{
"entities": entities,
"count": len(entities),
})
}
// GetEntity returns a single entity by ID
func GetEntity(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
var entity struct {
ID int `json:"id"`
CanonicalName string `json:"canonicalName"`
EntityType string `json:"entityType"`
Layer *int `json:"layer"`
Description *string `json:"description"`
DocumentCount *int `json:"documentCount"`
ConnectionCount *int `json:"connectionCount"`
Aliases []byte `json:"aliases"`
PPPMatches []byte `json:"pppMatches"`
FECMatches []byte `json:"fecMatches"`
GrantsMatches []byte `json:"grantsMatches"`
}
err = pool.QueryRow(ctx, `
SELECT id, canonical_name, entity_type, layer, description,
document_count, connection_count, aliases,
ppp_matches, fec_matches, grants_matches
FROM entities WHERE id = $1
`, id).Scan(
&entity.ID, &entity.CanonicalName, &entity.EntityType,
&entity.Layer, &entity.Description, &entity.DocumentCount,
&entity.ConnectionCount, &entity.Aliases,
&entity.PPPMatches, &entity.FECMatches, &entity.GrantsMatches,
)
if err != nil {
return c.Status(404).JSON(fiber.Map{"error": "entity not found"})
}
return c.JSON(entity)
}
// GetEntityConnections returns entities connected to a given entity
func GetEntityConnections(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
if limit > 200 {
limit = 200
}
rows, err := pool.Query(ctx, `
SELECT
e2.id, e2.canonical_name, e2.entity_type, e2.layer,
COUNT(DISTINCT d.id) AS shared_docs
FROM document_entities de1
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id != de2.entity_id
JOIN entities e2 ON de2.entity_id = e2.id
JOIN documents d ON de1.document_id = d.id
WHERE de1.entity_id = $1
GROUP BY e2.id, e2.canonical_name, e2.entity_type, e2.layer
ORDER BY shared_docs DESC
LIMIT $2
`, id, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var connections []fiber.Map
for rows.Next() {
var connID int
var name, etype string
var layerVal *int
var sharedDocs int
if err := rows.Scan(&connID, &name, &etype, &layerVal, &sharedDocs); err != nil {
continue
}
connections = append(connections, fiber.Map{
"id": connID,
"canonicalName": name,
"entityType": etype,
"layer": layerVal,
"sharedDocs": sharedDocs,
})
}
return c.JSON(fiber.Map{
"connections": connections,
"count": len(connections),
})
}
// GetEntityDocuments returns documents mentioning an entity
func GetEntityDocuments(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
limitStr := c.Query("limit", "50")
limit, _ := strconv.Atoi(limitStr)
rows, err := pool.Query(ctx, `
SELECT d.id, d.doc_id, d.document_type, d.summary, d.date_earliest, d.date_latest
FROM documents d
JOIN document_entities de ON d.id = de.document_id
WHERE de.entity_id = $1
ORDER BY d.date_earliest DESC NULLS LAST
LIMIT $2
`, id, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var documents []fiber.Map
for rows.Next() {
var docID int
var docIdStr string
var docType, summary *string
var dateEarliest, dateLatest *string
if err := rows.Scan(&docID, &docIdStr, &docType, &summary, &dateEarliest, &dateLatest); err != nil {
continue
}
documents = append(documents, fiber.Map{
"id": docID,
"docId": docIdStr,
"documentType": docType,
"summary": summary,
"dateEarliest": dateEarliest,
"dateLatest": dateLatest,
})
}
return c.JSON(fiber.Map{
"documents": documents,
"count": len(documents),
})
}

View File

@@ -0,0 +1,282 @@
package handlers
import (
"context"
"strconv"
"github.com/gofiber/fiber/v2"
"github.com/subculture-collective/epstein-db/api/internal/db"
)
// GetNetwork returns the relationship network for visualization
func GetNetwork(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
limitStr := c.Query("limit", "1000")
limit, _ := strconv.Atoi(limitStr)
if limit > 10000 {
limit = 10000
}
minConnections := c.Query("minConnections", "2")
minConn, _ := strconv.Atoi(minConnections)
// Get nodes (entities with sufficient connections)
nodeRows, err := pool.Query(ctx, `
SELECT id, canonical_name, entity_type, layer, document_count, connection_count
FROM entities
WHERE entity_type IN ('person', 'organization')
AND connection_count >= $1
ORDER BY connection_count DESC
LIMIT $2
`, minConn, limit)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer nodeRows.Close()
var nodes []fiber.Map
nodeIDs := make(map[int]bool)
for nodeRows.Next() {
var id int
var name, etype string
var layer, docCount, connCount *int
if err := nodeRows.Scan(&id, &name, &etype, &layer, &docCount, &connCount); err != nil {
continue
}
nodeIDs[id] = true
nodes = append(nodes, fiber.Map{
"id": id,
"canonicalName": name,
"entityType": etype,
"layer": layer,
"documentCount": docCount,
"connectionCount": connCount,
})
}
// Get edges (co-occurrence relationships)
edgeRows, err := pool.Query(ctx, `
SELECT
de1.entity_id AS source,
de2.entity_id AS target,
COUNT(DISTINCT de1.document_id) AS weight
FROM document_entities de1
JOIN document_entities de2 ON de1.document_id = de2.document_id
AND de1.entity_id < de2.entity_id
JOIN entities e1 ON de1.entity_id = e1.id
JOIN entities e2 ON de2.entity_id = e2.id
WHERE e1.entity_type IN ('person', 'organization')
AND e2.entity_type IN ('person', 'organization')
AND e1.connection_count >= $1
AND e2.connection_count >= $1
GROUP BY de1.entity_id, de2.entity_id
HAVING COUNT(DISTINCT de1.document_id) >= 2
ORDER BY weight DESC
LIMIT $2
`, minConn, limit*3)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer edgeRows.Close()
var edges []fiber.Map
for edgeRows.Next() {
var source, target, weight int
if err := edgeRows.Scan(&source, &target, &weight); err != nil {
continue
}
// Only include edges where both nodes are in our node set
if nodeIDs[source] && nodeIDs[target] {
edges = append(edges, fiber.Map{
"source": source,
"target": target,
"weight": weight,
})
}
}
return c.JSON(fiber.Map{
"nodes": nodes,
"edges": edges,
"stats": fiber.Map{
"nodeCount": len(nodes),
"edgeCount": len(edges),
},
})
}
// GetNetworkByLayer returns entities organized by layer
func GetNetworkByLayer(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
var layers []fiber.Map
for layer := 0; layer <= 3; layer++ {
rows, err := pool.Query(ctx, `
SELECT id, canonical_name, entity_type, document_count, connection_count
FROM entities
WHERE layer = $1 AND entity_type IN ('person', 'organization')
ORDER BY connection_count DESC
LIMIT 100
`, layer)
if err != nil {
continue
}
var entities []fiber.Map
for rows.Next() {
var id int
var name, etype string
var docCount, connCount *int
if err := rows.Scan(&id, &name, &etype, &docCount, &connCount); err != nil {
continue
}
entities = append(entities, fiber.Map{
"id": id,
"canonicalName": name,
"entityType": etype,
"documentCount": docCount,
"connectionCount": connCount,
})
}
rows.Close()
layers = append(layers, fiber.Map{
"layer": layer,
"entities": entities,
"count": len(entities),
})
}
return c.JSON(fiber.Map{
"layers": layers,
})
}
// ListPatterns returns discovered patterns
func ListPatterns(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
status := c.Query("status", "")
patternType := c.Query("type", "")
rows, err := pool.Query(ctx, `
SELECT id, title, description, pattern_type, confidence, status, discovered_at
FROM pattern_findings
WHERE ($1 = '' OR status = $1)
AND ($2 = '' OR pattern_type = $2)
ORDER BY discovered_at DESC
LIMIT 100
`, status, patternType)
if err != nil {
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
}
defer rows.Close()
var patterns []fiber.Map
for rows.Next() {
var id int
var title, description, ptype, status string
var confidence *float64
var discoveredAt string
if err := rows.Scan(&id, &title, &description, &ptype, &confidence, &status, &discoveredAt); err != nil {
continue
}
patterns = append(patterns, fiber.Map{
"id": id,
"title": title,
"description": description,
"patternType": ptype,
"confidence": confidence,
"status": status,
"discoveredAt": discoveredAt,
})
}
return c.JSON(fiber.Map{
"patterns": patterns,
"count": len(patterns),
})
}
// GetPattern returns a single pattern with full details
func GetPattern(c *fiber.Ctx) error {
ctx := context.Background()
pool := db.Pool()
id, err := strconv.Atoi(c.Params("id"))
if err != nil {
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
}
var pattern struct {
ID int `json:"id"`
Title string `json:"title"`
Description string `json:"description"`
PatternType string `json:"patternType"`
EntityIDs []int `json:"entityIds"`
Evidence []byte `json:"evidence"`
Confidence *float64 `json:"confidence"`
Status string `json:"status"`
Notes *string `json:"notes"`
DiscoveredAt string `json:"discoveredAt"`
DiscoveredBy string `json:"discoveredBy"`
}
err = pool.QueryRow(ctx, `
SELECT id, title, description, pattern_type, entity_ids, evidence,
confidence, status, notes, discovered_at, discovered_by
FROM pattern_findings WHERE id = $1
`, id).Scan(
&pattern.ID, &pattern.Title, &pattern.Description, &pattern.PatternType,
&pattern.EntityIDs, &pattern.Evidence, &pattern.Confidence,
&pattern.Status, &pattern.Notes, &pattern.DiscoveredAt, &pattern.DiscoveredBy,
)
if err != nil {
return c.Status(404).JSON(fiber.Map{"error": "pattern not found"})
}
// Get entity details
entityRows, err := pool.Query(ctx, `
SELECT id, canonical_name, entity_type, layer
FROM entities WHERE id = ANY($1)
`, pattern.EntityIDs)
if err == nil {
var entities []fiber.Map
for entityRows.Next() {
var eid int
var name, etype string
var layer *int
if err := entityRows.Scan(&eid, &name, &etype, &layer); err != nil {
continue
}
entities = append(entities, fiber.Map{
"id": eid,
"canonicalName": name,
"entityType": etype,
"layer": layer,
})
}
entityRows.Close()
return c.JSON(fiber.Map{
"pattern": pattern,
"entities": entities,
})
}
return c.JSON(pattern)
}

64
docker-compose.yml Normal file
View File

@@ -0,0 +1,64 @@
services:
postgres:
image: postgres:16-alpine
container_name: epstein-db-postgres
restart: unless-stopped
environment:
POSTGRES_USER: epstein
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-epstein_dev}
POSTGRES_DB: epstein
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./schema/postgres:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U epstein -d epstein"]
interval: 10s
timeout: 5s
retries: 5
neo4j:
image: neo4j:5-community
container_name: epstein-db-neo4j
restart: unless-stopped
environment:
NEO4J_AUTH: neo4j/${NEO4J_PASSWORD:-neo4j_dev}
NEO4J_PLUGINS: '["apoc"]'
NEO4J_dbms_memory_heap_initial__size: 512m
NEO4J_dbms_memory_heap_max__size: 2G
ports:
- "7474:7474" # HTTP
- "7687:7687" # Bolt
volumes:
- neo4j_data:/data
- neo4j_logs:/logs
healthcheck:
test: ["CMD", "neo4j", "status"]
interval: 10s
timeout: 10s
retries: 5
typesense:
image: typesense/typesense:27.1
container_name: epstein-db-typesense
restart: unless-stopped
environment:
TYPESENSE_DATA_DIR: /data
TYPESENSE_API_KEY: ${TYPESENSE_API_KEY:-typesense_dev}
TYPESENSE_ENABLE_CORS: "true"
ports:
- "8108:8108"
volumes:
- typesense_data:/data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8108/health"]
interval: 10s
timeout: 5s
retries: 5
volumes:
postgres_data:
neo4j_data:
neo4j_logs:
typesense_data:

39
extraction/package.json Normal file
View File

@@ -0,0 +1,39 @@
{
"name": "@epstein-db/extraction",
"version": "1.0.0",
"description": "Entity extraction pipeline for Epstein Files Database",
"type": "module",
"scripts": {
"build": "tsc",
"dev": "tsx watch src/index.ts",
"extract:documents": "tsx src/scripts/extract-documents.ts",
"extract:entities": "tsx src/scripts/extract-entities.ts",
"deduplicate": "tsx src/scripts/deduplicate.ts",
"load:crossref": "tsx src/scripts/load-crossref.ts",
"match:crossref": "tsx src/scripts/match-crossref.ts",
"calculate:layers": "tsx src/scripts/calculate-layers.ts",
"sync:neo4j": "tsx src/scripts/sync-neo4j.ts",
"pipeline": "npm run extract:documents && npm run extract:entities && npm run deduplicate && npm run calculate:layers && npm run sync:neo4j",
"typecheck": "tsc --noEmit"
},
"dependencies": {
"@anthropic-ai/sdk": "^0.24.0",
"@neondatabase/serverless": "^0.9.0",
"better-sqlite3": "^11.0.0",
"dotenv": "^16.4.5",
"drizzle-orm": "^0.30.0",
"neo4j-driver": "^5.19.0",
"openai": "^4.47.0",
"p-limit": "^5.0.0",
"pg": "^8.11.5",
"typesense": "^1.8.2",
"zod": "^3.23.0"
},
"devDependencies": {
"@types/better-sqlite3": "^7.6.10",
"@types/node": "^20.12.0",
"@types/pg": "^8.11.5",
"tsx": "^4.9.0",
"typescript": "^5.4.0"
}
}

33
extraction/src/config.ts Normal file
View File

@@ -0,0 +1,33 @@
import { z } from 'zod';
import dotenv from 'dotenv';
dotenv.config();
const configSchema = z.object({
// Database
DATABASE_URL: z.string().default('postgresql://epstein:epstein_dev@localhost:5432/epstein'),
NEO4J_URI: z.string().default('bolt://localhost:7687'),
NEO4J_USER: z.string().default('neo4j'),
NEO4J_PASSWORD: z.string().default('neo4j_dev'),
TYPESENSE_HOST: z.string().default('localhost'),
TYPESENSE_PORT: z.coerce.number().default(8108),
TYPESENSE_API_KEY: z.string().default('typesense_dev'),
// LLM
OPENAI_API_KEY: z.string().optional(),
OPENAI_BASE_URL: z.string().optional(),
ANTHROPIC_API_KEY: z.string().optional(),
LLM_MODEL: z.string().default('claude-sonnet-4-20250514'),
// Extraction
DATA_DIR: z.string().default('../DataSources'),
BATCH_SIZE: z.coerce.number().default(10),
MAX_WORKERS: z.coerce.number().default(5),
// Rate limiting
REQUESTS_PER_MINUTE: z.coerce.number().default(50),
});
export type Config = z.infer<typeof configSchema>;
export const config = configSchema.parse(process.env);

248
extraction/src/db.ts Normal file
View File

@@ -0,0 +1,248 @@
import pg from 'pg';
import { config } from './config.js';
const { Pool } = pg;
export const pool = new Pool({
connectionString: config.DATABASE_URL,
});
// Helper for transactions
export async function withTransaction<T>(
fn: (client: pg.PoolClient) => Promise<T>
): Promise<T> {
const client = await pool.connect();
try {
await client.query('BEGIN');
const result = await fn(client);
await client.query('COMMIT');
return result;
} catch (error) {
await client.query('ROLLBACK');
throw error;
} finally {
client.release();
}
}
// Document operations
export async function insertDocument(doc: {
docId: string;
datasetId: number;
filePath?: string;
fullText?: string;
pageCount?: number;
}): Promise<number> {
const result = await pool.query(
`INSERT INTO documents (doc_id, dataset_id, file_path, full_text, page_count)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (doc_id) DO UPDATE SET
full_text = COALESCE(EXCLUDED.full_text, documents.full_text),
updated_at = NOW()
RETURNING id`,
[doc.docId, doc.datasetId, doc.filePath, doc.fullText, doc.pageCount]
);
return result.rows[0].id;
}
export async function updateDocumentAnalysis(
docId: string,
analysis: {
summary: string;
detailedSummary: string;
documentType: string;
dateEarliest?: Date;
dateLatest?: Date;
contentTags: string[];
}
): Promise<void> {
await pool.query(
`UPDATE documents SET
summary = $2,
detailed_summary = $3,
document_type = $4,
date_earliest = $5,
date_latest = $6,
content_tags = $7,
analysis_status = 'complete',
analyzed_at = NOW(),
updated_at = NOW()
WHERE doc_id = $1`,
[
docId,
analysis.summary,
analysis.detailedSummary,
analysis.documentType,
analysis.dateEarliest,
analysis.dateLatest,
JSON.stringify(analysis.contentTags),
]
);
}
export async function getDocumentsPendingAnalysis(
limit: number = 100
): Promise<Array<{ id: number; docId: string; fullText: string }>> {
const result = await pool.query(
`SELECT id, doc_id, full_text FROM documents
WHERE analysis_status = 'pending' AND full_text IS NOT NULL
LIMIT $1`,
[limit]
);
return result.rows.map((row) => ({
id: row.id,
docId: row.doc_id,
fullText: row.full_text,
}));
}
// Entity operations
export async function upsertEntity(entity: {
canonicalName: string;
entityType: string;
aliases?: string[];
description?: string;
}): Promise<number> {
const result = await pool.query(
`INSERT INTO entities (canonical_name, entity_type, aliases, description)
VALUES ($1, $2::entity_type, $3, $4)
ON CONFLICT (canonical_name, entity_type) DO UPDATE SET
aliases = COALESCE(
entities.aliases || EXCLUDED.aliases,
entities.aliases,
EXCLUDED.aliases
),
updated_at = NOW()
RETURNING id`,
[
entity.canonicalName,
entity.entityType,
JSON.stringify(entity.aliases || []),
entity.description,
]
);
return result.rows[0].id;
}
export async function linkEntityToDocument(
entityId: number,
documentId: number,
mentionCount: number = 1,
contextSnippet?: string
): Promise<void> {
await pool.query(
`INSERT INTO document_entities (document_id, entity_id, mention_count, context_snippet)
VALUES ($1, $2, $3, $4)
ON CONFLICT (document_id, entity_id) DO UPDATE SET
mention_count = document_entities.mention_count + EXCLUDED.mention_count`,
[documentId, entityId, mentionCount, contextSnippet]
);
}
export async function insertTriple(triple: {
documentId: number;
subjectId: number;
predicate: string;
objectId: number;
locationId?: number;
timestamp?: Date;
explicitTopic?: string;
implicitTopic?: string;
tags?: string[];
sequenceOrder: number;
}): Promise<number> {
const result = await pool.query(
`INSERT INTO triples
(document_id, subject_id, predicate, object_id, location_id, timestamp, explicit_topic, implicit_topic, tags, sequence_order)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
RETURNING id`,
[
triple.documentId,
triple.subjectId,
triple.predicate,
triple.objectId,
triple.locationId,
triple.timestamp,
triple.explicitTopic,
triple.implicitTopic,
JSON.stringify(triple.tags || []),
triple.sequenceOrder,
]
);
return result.rows[0].id;
}
// Layer calculation
export async function calculateEntityLayers(): Promise<void> {
// Set Layer 1: entities that share documents with Epstein
await pool.query(`
WITH epstein AS (
SELECT id FROM entities WHERE canonical_name = 'Jeffrey Epstein' AND entity_type = 'person'
),
epstein_docs AS (
SELECT DISTINCT document_id FROM document_entities WHERE entity_id = (SELECT id FROM epstein)
),
layer1_entities AS (
SELECT DISTINCT entity_id FROM document_entities
WHERE document_id IN (SELECT document_id FROM epstein_docs)
AND entity_id != (SELECT id FROM epstein)
)
UPDATE entities SET layer = 1, updated_at = NOW()
WHERE id IN (SELECT entity_id FROM layer1_entities) AND layer IS NULL
`);
// Set Layer 2: entities that share documents with Layer 1 (but not with Epstein directly)
await pool.query(`
WITH layer1 AS (
SELECT id FROM entities WHERE layer = 1
),
layer1_docs AS (
SELECT DISTINCT document_id FROM document_entities WHERE entity_id IN (SELECT id FROM layer1)
),
layer2_candidates AS (
SELECT DISTINCT entity_id FROM document_entities
WHERE document_id IN (SELECT document_id FROM layer1_docs)
)
UPDATE entities SET layer = 2, updated_at = NOW()
WHERE id IN (SELECT entity_id FROM layer2_candidates) AND layer IS NULL
`);
// Set Layer 3: remaining entities
await pool.query(`
UPDATE entities SET layer = 3, updated_at = NOW() WHERE layer IS NULL
`);
}
// Search
export async function searchEntities(
query: string,
limit: number = 20
): Promise<
Array<{
id: number;
canonicalName: string;
entityType: string;
layer: number;
documentCount: number;
}>
> {
const result = await pool.query(
`SELECT id, canonical_name, entity_type, layer, document_count
FROM entities
WHERE canonical_name ILIKE $1 OR canonical_name % $2
ORDER BY similarity(canonical_name, $2) DESC, document_count DESC
LIMIT $3`,
[`%${query}%`, query, limit]
);
return result.rows.map((row) => ({
id: row.id,
canonicalName: row.canonical_name,
entityType: row.entity_type,
layer: row.layer,
documentCount: row.document_count,
}));
}
export async function close(): Promise<void> {
await pool.end();
}

View File

@@ -0,0 +1,208 @@
import { z } from 'zod';
import Anthropic from '@anthropic-ai/sdk';
import { config } from '../config.js';
// Initialize Anthropic client
const anthropic = new Anthropic({
apiKey: config.ANTHROPIC_API_KEY,
});
// ============================================================================
// SCHEMAS
// ============================================================================
export const EntitySchema = z.object({
name: z.string(),
type: z.enum(['person', 'organization', 'location', 'date', 'reference', 'financial']),
context: z.string().optional(),
});
export const TripleSchema = z.object({
subject: z.string(),
subjectType: z.enum(['person', 'organization', 'location']),
predicate: z.string(),
object: z.string(),
objectType: z.enum(['person', 'organization', 'location', 'date', 'reference', 'financial']),
location: z.string().optional(),
timestamp: z.string().optional(),
explicitTopic: z.string().optional(),
implicitTopic: z.string().optional(),
tags: z.array(z.string()).optional(),
});
export const DocumentAnalysisSchema = z.object({
summary: z.string(),
detailedSummary: z.string(),
documentType: z.string(),
dateEarliest: z.string().nullable(),
dateLatest: z.string().nullable(),
contentTags: z.array(z.string()),
entities: z.array(EntitySchema),
triples: z.array(TripleSchema),
});
export type Entity = z.infer<typeof EntitySchema>;
export type Triple = z.infer<typeof TripleSchema>;
export type DocumentAnalysis = z.infer<typeof DocumentAnalysisSchema>;
// ============================================================================
// EXTRACTION PROMPTS
// ============================================================================
const EXTRACTION_SYSTEM_PROMPT = `You are an expert document analyst specializing in legal documents, financial records, and correspondence. Your task is to extract structured information from documents related to the Jeffrey Epstein case.
Extract the following:
1. **Entities**: All people, organizations, locations, dates, document references, and financial amounts mentioned.
2. **Relationships (Triples)**: Subject-Predicate-Object relationships between entities.
3. **Document Analysis**: Summary, type classification, date range, and content tags.
Be thorough but precise. If information is unclear or partially redacted, note what you can determine. Focus on factual extraction, not interpretation.
IMPORTANT:
- Normalize names where possible (e.g., "J. Epstein" → "Jeffrey Epstein" if context confirms)
- Include context snippets for important entities
- Extract temporal information when available
- Tag relationships with relevant categories (legal, financial, travel, social, etc.)`;
const EXTRACTION_USER_PROMPT = (text: string) => `Analyze this document and extract structured information.
<document>
${text}
</document>
Respond with a JSON object matching this schema:
{
"summary": "One sentence summary of the document",
"detailedSummary": "A paragraph explaining the document's content and significance",
"documentType": "Type of document (e.g., deposition, email, financial record, flight log, etc.)",
"dateEarliest": "YYYY-MM-DD or null if no dates",
"dateLatest": "YYYY-MM-DD or null if no dates",
"contentTags": ["tag1", "tag2", ...],
"entities": [
{"name": "Full Name", "type": "person|organization|location|date|reference|financial", "context": "brief context"}
],
"triples": [
{
"subject": "Entity Name",
"subjectType": "person|organization|location",
"predicate": "action/relationship verb",
"object": "Entity Name",
"objectType": "person|organization|location|date|reference|financial",
"location": "where (optional)",
"timestamp": "YYYY-MM-DD (optional)",
"explicitTopic": "stated subject matter (optional)",
"implicitTopic": "inferred subject matter (optional)",
"tags": ["legal", "financial", "travel", etc.]
}
]
}
Return ONLY valid JSON, no markdown or explanation.`;
// ============================================================================
// EXTRACTION FUNCTION
// ============================================================================
export async function extractFromDocument(
docId: string,
text: string
): Promise<DocumentAnalysis> {
// Truncate very long documents
const maxChars = 100000;
const truncatedText = text.length > maxChars
? text.slice(0, maxChars) + '\n\n[TRUNCATED - document continues...]'
: text;
const response = await anthropic.messages.create({
model: config.LLM_MODEL,
max_tokens: 8192,
system: EXTRACTION_SYSTEM_PROMPT,
messages: [
{
role: 'user',
content: EXTRACTION_USER_PROMPT(truncatedText),
},
],
});
// Extract text content
const content = response.content[0];
if (content.type !== 'text') {
throw new Error(`Unexpected response type: ${content.type}`);
}
// Parse JSON
let parsed: unknown;
try {
// Try to extract JSON from the response (sometimes wrapped in markdown)
const jsonMatch = content.text.match(/\{[\s\S]*\}/);
if (!jsonMatch) {
throw new Error('No JSON found in response');
}
parsed = JSON.parse(jsonMatch[0]);
} catch (error) {
console.error(`Failed to parse JSON for ${docId}:`, content.text.slice(0, 500));
throw new Error(`JSON parse error: ${error}`);
}
// Validate against schema
const result = DocumentAnalysisSchema.parse(parsed);
return result;
}
// ============================================================================
// DEDUPLICATION
// ============================================================================
const DEDUP_SYSTEM_PROMPT = `You are an expert at identifying when different name variations refer to the same entity. Given a list of entity names, group them by the actual entity they refer to.
Consider:
- Name variations (J. Smith, John Smith, John Q. Smith)
- Nicknames and aliases
- Organizational name variations (LLC vs Inc)
- Typos and OCR errors
Be conservative - only merge entities when you're confident they're the same.`;
const DEDUP_USER_PROMPT = (entities: string[]) => `Group these entity names by the actual entity they refer to. Return a JSON object where keys are canonical names and values are arrays of aliases.
Entities:
${entities.map((e) => `- ${e}`).join('\n')}
Return JSON like:
{
"Jeffrey Epstein": ["J. Epstein", "Epstein", "Jeffrey E. Epstein"],
"Ghislaine Maxwell": ["G. Maxwell", "Maxwell"]
}
Return ONLY valid JSON.`;
export async function deduplicateEntities(
entities: string[]
): Promise<Record<string, string[]>> {
const response = await anthropic.messages.create({
model: config.LLM_MODEL,
max_tokens: 4096,
system: DEDUP_SYSTEM_PROMPT,
messages: [
{
role: 'user',
content: DEDUP_USER_PROMPT(entities),
},
],
});
const content = response.content[0];
if (content.type !== 'text') {
throw new Error(`Unexpected response type: ${content.type}`);
}
const jsonMatch = content.text.match(/\{[\s\S]*\}/);
if (!jsonMatch) {
throw new Error('No JSON found in dedup response');
}
return JSON.parse(jsonMatch[0]);
}

View File

@@ -0,0 +1,135 @@
/**
* Document Extraction Script
*
* Reads OCR text from the data sources and loads it into PostgreSQL.
* This is the first step in the pipeline.
*/
import fs from 'fs';
import path from 'path';
import readline from 'readline';
import { config } from '../config.js';
import { insertDocument, close } from '../db.js';
// Path to the combined text file
const DATA_DIR = path.resolve(config.DATA_DIR);
const COMBINED_TEXT_PATH = path.join(DATA_DIR, 'combined-all-epstein-files/COMBINED_ALL_EPSTEIN_FILES_djvu.txt');
// Document ID pattern: EFTA00000001
const DOC_ID_PATTERN = /^EFTA\d{8}$/;
interface DocumentChunk {
docId: string;
lines: string[];
}
async function* readDocuments(): AsyncGenerator<DocumentChunk> {
const fileStream = fs.createReadStream(COMBINED_TEXT_PATH);
const rl = readline.createInterface({
input: fileStream,
crlfDelay: Infinity,
});
let currentDoc: DocumentChunk | null = null;
for await (const line of rl) {
const trimmed = line.trim();
// Check if this is a new document ID
if (DOC_ID_PATTERN.test(trimmed)) {
// If we have a previous document, yield it
if (currentDoc && currentDoc.lines.length > 0) {
yield currentDoc;
}
// Start a new document
currentDoc = {
docId: trimmed,
lines: [],
};
} else if (currentDoc) {
// Add line to current document
if (trimmed.length > 0) {
currentDoc.lines.push(line);
}
}
}
// Yield the last document
if (currentDoc && currentDoc.lines.length > 0) {
yield currentDoc;
}
}
function getDatasetId(docId: string): number {
// Extract the numeric portion
const num = parseInt(docId.replace('EFTA', ''), 10);
// Map to dataset based on the metadata:
// DataSet 1: EFTA00000001-00003158
// DataSet 2: EFTA00003159-00003857
// DataSet 3: EFTA00003858-00005586
// DataSet 4: EFTA00005705-00008320
// DataSet 5: EFTA00008409-00008528
if (num <= 3158) return 1;
if (num <= 3857) return 2;
if (num <= 5586) return 3;
if (num <= 8320) return 4;
return 5;
}
async function main() {
console.log('📄 Starting document extraction...');
console.log(`Reading from: ${COMBINED_TEXT_PATH}`);
// Check if file exists
if (!fs.existsSync(COMBINED_TEXT_PATH)) {
console.error(`❌ File not found: ${COMBINED_TEXT_PATH}`);
console.error('Make sure the DataSources directory is properly set up.');
process.exit(1);
}
let count = 0;
let errors = 0;
const seenDocs = new Set<string>();
for await (const doc of readDocuments()) {
// Skip duplicate doc IDs (the OCR sometimes repeats)
if (seenDocs.has(doc.docId)) {
continue;
}
seenDocs.add(doc.docId);
try {
const fullText = doc.lines.join('\n');
const datasetId = getDatasetId(doc.docId);
await insertDocument({
docId: doc.docId,
datasetId,
fullText,
pageCount: 1, // We'll update this later with actual page counts
});
count++;
if (count % 100 === 0) {
console.log(` ✓ Processed ${count} documents...`);
}
} catch (error) {
console.error(`❌ Error processing ${doc.docId}:`, error);
errors++;
}
}
console.log(`\n✅ Document extraction complete!`);
console.log(` Total documents: ${count}`);
console.log(` Errors: ${errors}`);
await close();
}
main().catch((error) => {
console.error('Fatal error:', error);
process.exit(1);
});

View File

@@ -0,0 +1,198 @@
/**
* Entity Extraction Script
*
* Processes documents through the LLM to extract entities and relationships.
* Uses rate limiting and batching for efficiency.
*/
import pLimit from 'p-limit';
import { config } from '../config.js';
import {
getDocumentsPendingAnalysis,
updateDocumentAnalysis,
upsertEntity,
linkEntityToDocument,
insertTriple,
pool,
close,
} from '../db.js';
import { extractFromDocument, type Entity, type Triple } from '../ner/extractor.js';
// Rate limiter
const limit = pLimit(config.MAX_WORKERS);
// Track progress
let processed = 0;
let errors = 0;
let totalEntities = 0;
let totalTriples = 0;
async function processDocument(doc: {
id: number;
docId: string;
fullText: string;
}): Promise<void> {
try {
console.log(` 📝 Processing ${doc.docId}...`);
// Mark as processing
await pool.query(
`UPDATE documents SET analysis_status = 'processing' WHERE id = $1`,
[doc.id]
);
// Extract entities and relationships
const analysis = await extractFromDocument(doc.docId, doc.fullText);
// Parse dates
const dateEarliest = analysis.dateEarliest
? new Date(analysis.dateEarliest)
: undefined;
const dateLatest = analysis.dateLatest
? new Date(analysis.dateLatest)
: undefined;
// Update document analysis
await updateDocumentAnalysis(doc.docId, {
summary: analysis.summary,
detailedSummary: analysis.detailedSummary,
documentType: analysis.documentType,
dateEarliest,
dateLatest,
contentTags: analysis.contentTags,
});
// Insert entities and get their IDs
const entityIdMap = new Map<string, number>();
for (const entity of analysis.entities) {
const entityId = await upsertEntity({
canonicalName: entity.name,
entityType: entity.type,
});
entityIdMap.set(entity.name.toLowerCase(), entityId);
// Link entity to document
await linkEntityToDocument(entityId, doc.id, 1, entity.context);
}
totalEntities += analysis.entities.length;
// Insert triples
for (let i = 0; i < analysis.triples.length; i++) {
const triple = analysis.triples[i];
// Get or create subject entity
let subjectId = entityIdMap.get(triple.subject.toLowerCase());
if (!subjectId) {
subjectId = await upsertEntity({
canonicalName: triple.subject,
entityType: triple.subjectType,
});
entityIdMap.set(triple.subject.toLowerCase(), subjectId);
}
// Get or create object entity
let objectId = entityIdMap.get(triple.object.toLowerCase());
if (!objectId) {
objectId = await upsertEntity({
canonicalName: triple.object,
entityType: triple.objectType,
});
entityIdMap.set(triple.object.toLowerCase(), objectId);
}
// Get location entity if present
let locationId: number | undefined;
if (triple.location) {
locationId = entityIdMap.get(triple.location.toLowerCase());
if (!locationId) {
locationId = await upsertEntity({
canonicalName: triple.location,
entityType: 'location',
});
entityIdMap.set(triple.location.toLowerCase(), locationId);
}
}
// Parse timestamp
const timestamp = triple.timestamp ? new Date(triple.timestamp) : undefined;
// Insert triple
await insertTriple({
documentId: doc.id,
subjectId,
predicate: triple.predicate,
objectId,
locationId,
timestamp,
explicitTopic: triple.explicitTopic,
implicitTopic: triple.implicitTopic,
tags: triple.tags,
sequenceOrder: i,
});
}
totalTriples += analysis.triples.length;
processed++;
console.log(
`${doc.docId}: ${analysis.entities.length} entities, ${analysis.triples.length} triples`
);
} catch (error) {
errors++;
console.error(`${doc.docId}: ${error}`);
// Mark as failed
await pool.query(
`UPDATE documents SET
analysis_status = 'failed',
error_message = $2,
updated_at = NOW()
WHERE id = $1`,
[doc.id, String(error)]
);
}
}
async function main() {
console.log('🔍 Starting entity extraction...');
console.log(` Model: ${config.LLM_MODEL}`);
console.log(` Workers: ${config.MAX_WORKERS}`);
console.log(` Batch size: ${config.BATCH_SIZE}\n`);
let hasMore = true;
while (hasMore) {
// Get batch of pending documents
const documents = await getDocumentsPendingAnalysis(config.BATCH_SIZE);
if (documents.length === 0) {
hasMore = false;
break;
}
console.log(`\n📦 Processing batch of ${documents.length} documents...`);
// Process in parallel with rate limiting
await Promise.all(
documents.map((doc) => limit(() => processDocument(doc)))
);
// Brief pause between batches
await new Promise((resolve) => setTimeout(resolve, 1000));
}
console.log(`\n✅ Entity extraction complete!`);
console.log(` Documents processed: ${processed}`);
console.log(` Entities extracted: ${totalEntities}`);
console.log(` Triples extracted: ${totalTriples}`);
console.log(` Errors: ${errors}`);
await close();
}
main().catch((error) => {
console.error('Fatal error:', error);
process.exit(1);
});

View File

@@ -0,0 +1,236 @@
/**
* Cross-Reference Matching Script
*
* Matches extracted entities against PPP loans, FEC contributions, and federal grants.
* Uses fuzzy matching with configurable thresholds.
*/
import { pool, close } from '../db.js';
// Similarity threshold for matches (0-1)
const MATCH_THRESHOLD = 0.7;
interface Match {
entityId: number;
entityName: string;
source: 'ppp' | 'fec' | 'grants';
sourceId: number;
sourceName: string;
score: number;
}
async function findPPPMatches(): Promise<Match[]> {
console.log('🔍 Matching entities against PPP loans...');
const result = await pool.query(`
SELECT
e.id AS entity_id,
e.canonical_name AS entity_name,
p.id AS source_id,
p.borrower_name AS source_name,
similarity(e.canonical_name, p.borrower_name) AS score
FROM entities e
CROSS JOIN LATERAL (
SELECT id, borrower_name
FROM ppp_loans
WHERE
borrower_name % e.canonical_name
AND similarity(borrower_name, e.canonical_name) >= $1
ORDER BY similarity(borrower_name, e.canonical_name) DESC
LIMIT 5
) p
WHERE e.entity_type IN ('person', 'organization')
`, [MATCH_THRESHOLD]);
return result.rows.map((row) => ({
entityId: row.entity_id,
entityName: row.entity_name,
source: 'ppp' as const,
sourceId: row.source_id,
sourceName: row.source_name,
score: row.score,
}));
}
async function findFECMatches(): Promise<Match[]> {
console.log('🔍 Matching entities against FEC contributions...');
const result = await pool.query(`
SELECT
e.id AS entity_id,
e.canonical_name AS entity_name,
f.id AS source_id,
f.contributor_name AS source_name,
similarity(e.canonical_name, f.contributor_name) AS score
FROM entities e
CROSS JOIN LATERAL (
SELECT id, contributor_name
FROM fec_contributions
WHERE
contributor_name % e.canonical_name
AND similarity(contributor_name, e.canonical_name) >= $1
ORDER BY similarity(contributor_name, e.canonical_name) DESC
LIMIT 5
) f
WHERE e.entity_type = 'person'
`, [MATCH_THRESHOLD]);
return result.rows.map((row) => ({
entityId: row.entity_id,
entityName: row.entity_name,
source: 'fec' as const,
sourceId: row.source_id,
sourceName: row.source_name,
score: row.score,
}));
}
async function findGrantsMatches(): Promise<Match[]> {
console.log('🔍 Matching entities against federal grants...');
const result = await pool.query(`
SELECT
e.id AS entity_id,
e.canonical_name AS entity_name,
g.id AS source_id,
g.recipient_name AS source_name,
similarity(e.canonical_name, g.recipient_name) AS score
FROM entities e
CROSS JOIN LATERAL (
SELECT id, recipient_name
FROM federal_grants
WHERE
recipient_name % e.canonical_name
AND similarity(recipient_name, e.canonical_name) >= $1
ORDER BY similarity(recipient_name, e.canonical_name) DESC
LIMIT 5
) g
WHERE e.entity_type IN ('person', 'organization')
`, [MATCH_THRESHOLD]);
return result.rows.map((row) => ({
entityId: row.entity_id,
entityName: row.entity_name,
source: 'grants' as const,
sourceId: row.source_id,
sourceName: row.source_name,
score: row.score,
}));
}
async function saveMatches(matches: Match[]): Promise<void> {
if (matches.length === 0) return;
const values = matches.map((m) =>
`(${m.entityId}, '${m.source}', ${m.sourceId}, ${m.score}, 'fuzzy')`
).join(',\n');
await pool.query(`
INSERT INTO entity_crossref_matches (entity_id, source, source_id, match_score, match_method)
VALUES ${values}
ON CONFLICT DO NOTHING
`);
}
async function updateEntityCrossRefSummary(): Promise<void> {
console.log('📊 Updating entity cross-reference summaries...');
// Update PPP matches
await pool.query(`
UPDATE entities e
SET ppp_matches = (
SELECT jsonb_agg(jsonb_build_object(
'id', p.id,
'borrower', p.borrower_name,
'amount', p.loan_amount,
'score', m.match_score
))
FROM entity_crossref_matches m
JOIN ppp_loans p ON m.source_id = p.id
WHERE m.entity_id = e.id AND m.source = 'ppp' AND NOT m.false_positive
)
WHERE EXISTS (
SELECT 1 FROM entity_crossref_matches m
WHERE m.entity_id = e.id AND m.source = 'ppp'
)
`);
// Update FEC matches
await pool.query(`
UPDATE entities e
SET fec_matches = (
SELECT jsonb_agg(jsonb_build_object(
'id', f.id,
'contributor', f.contributor_name,
'candidate', f.candidate_name,
'amount', f.amount,
'score', m.match_score
))
FROM entity_crossref_matches m
JOIN fec_contributions f ON m.source_id = f.id
WHERE m.entity_id = e.id AND m.source = 'fec' AND NOT m.false_positive
)
WHERE EXISTS (
SELECT 1 FROM entity_crossref_matches m
WHERE m.entity_id = e.id AND m.source = 'fec'
)
`);
// Update grants matches
await pool.query(`
UPDATE entities e
SET grants_matches = (
SELECT jsonb_agg(jsonb_build_object(
'id', g.id,
'recipient', g.recipient_name,
'agency', g.awarding_agency,
'amount', g.award_amount,
'score', m.match_score
))
FROM entity_crossref_matches m
JOIN federal_grants g ON m.source_id = g.id
WHERE m.entity_id = e.id AND m.source = 'grants' AND NOT m.false_positive
)
WHERE EXISTS (
SELECT 1 FROM entity_crossref_matches m
WHERE m.entity_id = e.id AND m.source = 'grants'
)
`);
}
async function main() {
console.log('🔗 Starting cross-reference matching...\n');
// Find all matches
const pppMatches = await findPPPMatches();
console.log(` Found ${pppMatches.length} PPP matches`);
const fecMatches = await findFECMatches();
console.log(` Found ${fecMatches.length} FEC matches`);
const grantsMatches = await findGrantsMatches();
console.log(` Found ${grantsMatches.length} grants matches`);
// Save matches
console.log('\n💾 Saving matches to database...');
await saveMatches(pppMatches);
await saveMatches(fecMatches);
await saveMatches(grantsMatches);
// Update entity summaries
await updateEntityCrossRefSummary();
const totalMatches = pppMatches.length + fecMatches.length + grantsMatches.length;
console.log(`\n✅ Cross-reference matching complete!`);
console.log(` Total matches: ${totalMatches}`);
console.log(` PPP: ${pppMatches.length}`);
console.log(` FEC: ${fecMatches.length}`);
console.log(` Grants: ${grantsMatches.length}`);
await close();
}
main().catch((error) => {
console.error('Fatal error:', error);
process.exit(1);
});

20
extraction/tsconfig.json Normal file
View File

@@ -0,0 +1,20 @@
{
"compilerOptions": {
"target": "ES2022",
"module": "NodeNext",
"moduleResolution": "NodeNext",
"lib": ["ES2022"],
"outDir": "./dist",
"rootDir": "./src",
"strict": true,
"esModuleInterop": true,
"skipLibCheck": true,
"forceConsistentCasingInFileNames": true,
"resolveJsonModule": true,
"declaration": true,
"declarationMap": true,
"sourceMap": true
},
"include": ["src/**/*"],
"exclude": ["node_modules", "dist"]
}

14
frontend/index.html Normal file
View File

@@ -0,0 +1,14 @@
<!DOCTYPE html>
<html lang="en" class="dark">
<head>
<meta charset="UTF-8" />
<link rel="icon" type="image/svg+xml" href="/favicon.svg" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="description" content="Searchable database and network analysis tool for the DOJ Epstein Files release" />
<title>Epstein Files Database</title>
</head>
<body class="bg-background text-white antialiased">
<div id="root"></div>
<script type="module" src="/src/main.tsx"></script>
</body>
</html>

38
frontend/package.json Normal file
View File

@@ -0,0 +1,38 @@
{
"name": "@epstein-db/frontend",
"version": "1.0.0",
"private": true,
"type": "module",
"scripts": {
"dev": "vite",
"build": "tsc && vite build",
"preview": "vite preview",
"lint": "eslint . --ext ts,tsx --report-unused-disable-directives --max-warnings 0"
},
"dependencies": {
"@tanstack/react-query": "^5.32.0",
"clsx": "^2.1.0",
"d3": "^7.9.0",
"lucide-react": "^0.372.0",
"react": "^18.2.0",
"react-dom": "^18.2.0",
"react-force-graph-2d": "^1.25.5",
"react-router-dom": "^6.22.0",
"tailwind-merge": "^2.2.2"
},
"devDependencies": {
"@types/d3": "^7.4.3",
"@types/node": "^20.12.0",
"@types/react": "^18.2.79",
"@types/react-dom": "^18.2.25",
"@vitejs/plugin-react": "^4.2.0",
"autoprefixer": "^10.4.19",
"eslint": "^8.57.0",
"eslint-plugin-react-hooks": "^4.6.0",
"eslint-plugin-react-refresh": "^0.4.6",
"postcss": "^8.4.38",
"tailwindcss": "^3.4.3",
"typescript": "^5.4.0",
"vite": "^5.2.0"
}
}

29
frontend/src/App.tsx Normal file
View File

@@ -0,0 +1,29 @@
import { Routes, Route } from 'react-router-dom'
import { Layout } from './components/Layout'
import { HomePage } from './pages/HomePage'
import { NetworkPage } from './pages/NetworkPage'
import { EntitiesPage } from './pages/EntitiesPage'
import { EntityDetailPage } from './pages/EntityDetailPage'
import { DocumentsPage } from './pages/DocumentsPage'
import { DocumentDetailPage } from './pages/DocumentDetailPage'
import { SearchPage } from './pages/SearchPage'
import { PatternsPage } from './pages/PatternsPage'
import { CrossRefPage } from './pages/CrossRefPage'
export default function App() {
return (
<Layout>
<Routes>
<Route path="/" element={<HomePage />} />
<Route path="/network" element={<NetworkPage />} />
<Route path="/entities" element={<EntitiesPage />} />
<Route path="/entities/:id" element={<EntityDetailPage />} />
<Route path="/documents" element={<DocumentsPage />} />
<Route path="/documents/:id" element={<DocumentDetailPage />} />
<Route path="/search" element={<SearchPage />} />
<Route path="/patterns" element={<PatternsPage />} />
<Route path="/crossref" element={<CrossRefPage />} />
</Routes>
</Layout>
)
}

277
frontend/src/api/index.ts Normal file
View File

@@ -0,0 +1,277 @@
const API_BASE = '/api'
export interface Stats {
documents: number
entities: number
triples: number
pppLoans: number
fecRecords: number
grants: number
patterns: number
}
export interface Entity {
id: number
canonicalName: string
entityType: string
layer: number | null
description?: string
documentCount: number
connectionCount: number
aliases?: string[]
pppMatches?: any[]
fecMatches?: any[]
grantsMatches?: any[]
}
export interface Document {
id: number
docId: string
datasetId: number
documentType?: string
summary?: string
detailedSummary?: string
dateEarliest?: string
dateLatest?: string
contentTags?: string[]
pageCount?: number
}
export interface Connection {
id: number
canonicalName: string
entityType: string
layer: number | null
sharedDocs: number
}
export interface NetworkData {
nodes: Array<{
id: number
canonicalName: string
entityType: string
layer: number | null
documentCount: number
connectionCount: number
}>
edges: Array<{
source: number
target: number
weight: number
}>
stats: {
nodeCount: number
edgeCount: number
}
}
export interface Pattern {
id: number
title: string
description: string
patternType: string
confidence: number | null
status: string
discoveredAt: string
}
export interface SearchResult {
id: number
docId: string
documentType?: string
summary?: string
rank: number
snippet?: string
}
// Stats
export async function getStats(): Promise<Stats> {
const res = await fetch(`${API_BASE}/stats`)
if (!res.ok) throw new Error('Failed to fetch stats')
return res.json()
}
// Entities
export async function searchEntities(params: {
q?: string
type?: string
layer?: string
limit?: number
}): Promise<{ entities: Entity[]; count: number }> {
const searchParams = new URLSearchParams()
if (params.q) searchParams.set('q', params.q)
if (params.type) searchParams.set('type', params.type)
if (params.layer) searchParams.set('layer', params.layer)
if (params.limit) searchParams.set('limit', params.limit.toString())
const res = await fetch(`${API_BASE}/entities?${searchParams}`)
if (!res.ok) throw new Error('Failed to search entities')
return res.json()
}
export async function getEntity(id: number): Promise<Entity> {
const res = await fetch(`${API_BASE}/entities/${id}`)
if (!res.ok) throw new Error('Failed to fetch entity')
return res.json()
}
export async function getEntityConnections(
id: number,
limit?: number
): Promise<{ connections: Connection[]; count: number }> {
const params = limit ? `?limit=${limit}` : ''
const res = await fetch(`${API_BASE}/entities/${id}/connections${params}`)
if (!res.ok) throw new Error('Failed to fetch connections')
return res.json()
}
export async function getEntityDocuments(
id: number,
limit?: number
): Promise<{ documents: Document[]; count: number }> {
const params = limit ? `?limit=${limit}` : ''
const res = await fetch(`${API_BASE}/entities/${id}/documents${params}`)
if (!res.ok) throw new Error('Failed to fetch documents')
return res.json()
}
// Documents
export async function listDocuments(params: {
type?: string
dataset?: string
limit?: number
offset?: number
}): Promise<{ documents: Document[]; count: number; offset: number; limit: number }> {
const searchParams = new URLSearchParams()
if (params.type) searchParams.set('type', params.type)
if (params.dataset) searchParams.set('dataset', params.dataset)
if (params.limit) searchParams.set('limit', params.limit.toString())
if (params.offset) searchParams.set('offset', params.offset.toString())
const res = await fetch(`${API_BASE}/documents?${searchParams}`)
if (!res.ok) throw new Error('Failed to list documents')
return res.json()
}
export async function getDocument(id: number): Promise<Document> {
const res = await fetch(`${API_BASE}/documents/${id}`)
if (!res.ok) throw new Error('Failed to fetch document')
return res.json()
}
export async function getDocumentText(id: number): Promise<{ id: number; text: string }> {
const res = await fetch(`${API_BASE}/documents/${id}/text`)
if (!res.ok) throw new Error('Failed to fetch document text')
return res.json()
}
export async function getDocumentEntities(
id: number
): Promise<{ entities: Array<Entity & { mentionCount: number }>; count: number }> {
const res = await fetch(`${API_BASE}/documents/${id}/entities`)
if (!res.ok) throw new Error('Failed to fetch document entities')
return res.json()
}
// Network
export async function getNetwork(params?: {
limit?: number
minConnections?: number
}): Promise<NetworkData> {
const searchParams = new URLSearchParams()
if (params?.limit) searchParams.set('limit', params.limit.toString())
if (params?.minConnections) searchParams.set('minConnections', params.minConnections.toString())
const res = await fetch(`${API_BASE}/network?${searchParams}`)
if (!res.ok) throw new Error('Failed to fetch network')
return res.json()
}
export async function getNetworkByLayer(): Promise<{
layers: Array<{
layer: number
entities: Entity[]
count: number
}>
}> {
const res = await fetch(`${API_BASE}/network/layers`)
if (!res.ok) throw new Error('Failed to fetch network layers')
return res.json()
}
// Patterns
export async function listPatterns(params?: {
status?: string
type?: string
}): Promise<{ patterns: Pattern[]; count: number }> {
const searchParams = new URLSearchParams()
if (params?.status) searchParams.set('status', params.status)
if (params?.type) searchParams.set('type', params.type)
const res = await fetch(`${API_BASE}/patterns?${searchParams}`)
if (!res.ok) throw new Error('Failed to list patterns')
return res.json()
}
export async function getPattern(id: number): Promise<{
pattern: Pattern & { entityIds: number[]; evidence: any; notes?: string }
entities: Entity[]
}> {
const res = await fetch(`${API_BASE}/patterns/${id}`)
if (!res.ok) throw new Error('Failed to fetch pattern')
return res.json()
}
// Search
export async function fullTextSearch(
query: string,
limit?: number
): Promise<{ results: SearchResult[]; count: number; query: string }> {
const params = new URLSearchParams({ q: query })
if (limit) params.set('limit', limit.toString())
const res = await fetch(`${API_BASE}/search?${params}`)
if (!res.ok) throw new Error('Failed to search')
return res.json()
}
// Cross-reference
export async function searchPPP(
query: string,
limit?: number
): Promise<{ results: any[]; count: number }> {
const params = new URLSearchParams({ q: query })
if (limit) params.set('limit', limit.toString())
const res = await fetch(`${API_BASE}/crossref/ppp?${params}`)
if (!res.ok) throw new Error('Failed to search PPP')
return res.json()
}
export async function searchFEC(
query: string,
candidate?: string,
limit?: number
): Promise<{ results: any[]; count: number }> {
const params = new URLSearchParams({ q: query })
if (candidate) params.set('candidate', candidate)
if (limit) params.set('limit', limit.toString())
const res = await fetch(`${API_BASE}/crossref/fec?${params}`)
if (!res.ok) throw new Error('Failed to search FEC')
return res.json()
}
export async function searchGrants(
query: string,
agency?: string,
limit?: number
): Promise<{ results: any[]; count: number }> {
const params = new URLSearchParams({ q: query })
if (agency) params.set('agency', agency)
if (limit) params.set('limit', limit.toString())
const res = await fetch(`${API_BASE}/crossref/grants?${params}`)
if (!res.ok) throw new Error('Failed to search grants')
return res.json()
}

View File

@@ -0,0 +1,80 @@
import { ReactNode } from 'react'
import { Link, useLocation } from 'react-router-dom'
import {
Search,
Network,
Users,
FileText,
Lightbulb,
Link2,
Home
} from 'lucide-react'
import { clsx } from 'clsx'
interface LayoutProps {
children: ReactNode
}
const navItems = [
{ path: '/', icon: Home, label: 'Home' },
{ path: '/network', icon: Network, label: 'Network' },
{ path: '/entities', icon: Users, label: 'Entities' },
{ path: '/documents', icon: FileText, label: 'Documents' },
{ path: '/search', icon: Search, label: 'Search' },
{ path: '/patterns', icon: Lightbulb, label: 'Patterns' },
{ path: '/crossref', icon: Link2, label: 'Cross-Ref' },
]
export function Layout({ children }: LayoutProps) {
const location = useLocation()
return (
<div className="min-h-screen flex">
{/* Sidebar */}
<nav className="w-64 bg-surface border-r border-border flex flex-col">
{/* Logo */}
<div className="p-4 border-b border-border">
<Link to="/" className="flex items-center gap-2">
<div className="w-8 h-8 bg-red-600 rounded-lg flex items-center justify-center">
<span className="text-white font-bold text-sm">EF</span>
</div>
<div>
<h1 className="font-semibold text-white">Epstein Files</h1>
<p className="text-xs text-gray-500">Database</p>
</div>
</Link>
</div>
{/* Navigation */}
<div className="flex-1 p-2">
{navItems.map(({ path, icon: Icon, label }) => (
<Link
key={path}
to={path}
className={clsx(
'flex items-center gap-3 px-3 py-2 rounded-lg mb-1 transition-colors',
location.pathname === path
? 'bg-blue-600/20 text-blue-400'
: 'text-gray-400 hover:bg-surface-hover hover:text-gray-200'
)}
>
<Icon size={18} />
<span>{label}</span>
</Link>
))}
</div>
{/* Footer */}
<div className="p-4 border-t border-border text-xs text-gray-500">
<p>4,055 documents</p>
<p className="mt-1">DOJ Release Dec 2025</p>
</div>
</nav>
{/* Main Content */}
<main className="flex-1 overflow-auto">
{children}
</main>
</div>
)
}

88
frontend/src/index.css Normal file
View File

@@ -0,0 +1,88 @@
@tailwind base;
@tailwind components;
@tailwind utilities;
@layer base {
body {
@apply bg-background text-gray-100;
}
/* Custom scrollbar */
::-webkit-scrollbar {
width: 8px;
height: 8px;
}
::-webkit-scrollbar-track {
@apply bg-surface;
}
::-webkit-scrollbar-thumb {
@apply bg-border rounded-full;
}
::-webkit-scrollbar-thumb:hover {
@apply bg-gray-600;
}
}
@layer components {
.card {
@apply bg-surface border border-border rounded-lg;
}
.btn {
@apply px-4 py-2 rounded-lg font-medium transition-colors;
}
.btn-primary {
@apply bg-blue-600 hover:bg-blue-700 text-white;
}
.btn-secondary {
@apply bg-surface border border-border hover:bg-surface-hover text-gray-200;
}
.input {
@apply bg-surface border border-border rounded-lg px-4 py-2 text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:border-transparent;
}
/* Layer badges */
.layer-badge {
@apply inline-flex items-center px-2 py-0.5 rounded-full text-xs font-medium;
}
.layer-0 {
@apply bg-red-500/20 text-red-400 border border-red-500/30;
}
.layer-1 {
@apply bg-orange-500/20 text-orange-400 border border-orange-500/30;
}
.layer-2 {
@apply bg-yellow-500/20 text-yellow-400 border border-yellow-500/30;
}
.layer-3 {
@apply bg-green-500/20 text-green-400 border border-green-500/30;
}
/* Entity type badges */
.entity-person {
@apply bg-blue-500/20 text-blue-400;
}
.entity-organization {
@apply bg-purple-500/20 text-purple-400;
}
.entity-location {
@apply bg-teal-500/20 text-teal-400;
}
}
/* Force graph styling */
.force-graph-container {
background: #0a0a0a;
}

25
frontend/src/main.tsx Normal file
View File

@@ -0,0 +1,25 @@
import React from 'react'
import ReactDOM from 'react-dom/client'
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
import { BrowserRouter } from 'react-router-dom'
import App from './App'
import './index.css'
const queryClient = new QueryClient({
defaultOptions: {
queries: {
staleTime: 1000 * 60 * 5, // 5 minutes
retry: 1,
},
},
})
ReactDOM.createRoot(document.getElementById('root')!).render(
<React.StrictMode>
<QueryClientProvider client={queryClient}>
<BrowserRouter>
<App />
</BrowserRouter>
</QueryClientProvider>
</React.StrictMode>,
)

View File

@@ -0,0 +1,187 @@
import { useQuery } from '@tanstack/react-query'
import { Link } from 'react-router-dom'
import { getStats, getNetworkByLayer } from '@/api'
import { Users, FileText, Network, Lightbulb, DollarSign, Vote, Building } from 'lucide-react'
export function HomePage() {
const { data: stats, isLoading: statsLoading } = useQuery({
queryKey: ['stats'],
queryFn: getStats,
})
const { data: layersData, isLoading: layersLoading } = useQuery({
queryKey: ['network-layers'],
queryFn: getNetworkByLayer,
})
return (
<div className="p-6">
{/* Header */}
<div className="mb-8">
<h1 className="text-3xl font-bold text-white mb-2">Epstein Files Database</h1>
<p className="text-gray-400">
Searchable database and network analysis tool for the DOJ Epstein Files release
</p>
</div>
{/* Stats Grid */}
<div className="grid grid-cols-2 md:grid-cols-4 gap-4 mb-8">
<StatCard
icon={FileText}
label="Documents"
value={stats?.documents ?? 0}
loading={statsLoading}
/>
<StatCard
icon={Users}
label="Entities"
value={stats?.entities ?? 0}
loading={statsLoading}
/>
<StatCard
icon={Network}
label="Relationships"
value={stats?.triples ?? 0}
loading={statsLoading}
/>
<StatCard
icon={Lightbulb}
label="Patterns"
value={stats?.patterns ?? 0}
loading={statsLoading}
/>
</div>
{/* Cross-Reference Stats */}
<div className="card p-4 mb-8">
<h2 className="text-lg font-semibold mb-4">Cross-Reference Data</h2>
<div className="grid grid-cols-3 gap-4">
<div className="flex items-center gap-3">
<div className="p-2 bg-green-500/20 rounded-lg">
<DollarSign className="text-green-400" size={20} />
</div>
<div>
<p className="text-sm text-gray-400">PPP Loans</p>
<p className="font-semibold">{stats?.pppLoans?.toLocaleString() ?? '—'}</p>
</div>
</div>
<div className="flex items-center gap-3">
<div className="p-2 bg-blue-500/20 rounded-lg">
<Vote className="text-blue-400" size={20} />
</div>
<div>
<p className="text-sm text-gray-400">FEC Records</p>
<p className="font-semibold">{stats?.fecRecords?.toLocaleString() ?? '—'}</p>
</div>
</div>
<div className="flex items-center gap-3">
<div className="p-2 bg-purple-500/20 rounded-lg">
<Building className="text-purple-400" size={20} />
</div>
<div>
<p className="text-sm text-gray-400">Federal Grants</p>
<p className="font-semibold">{stats?.grants?.toLocaleString() ?? '—'}</p>
</div>
</div>
</div>
</div>
{/* Layer Overview */}
<div className="card p-4 mb-8">
<h2 className="text-lg font-semibold mb-4">Network Layers</h2>
<div className="space-y-4">
{[0, 1, 2, 3].map((layer) => {
const layerData = layersData?.layers?.find((l) => l.layer === layer)
return (
<div key={layer} className="flex items-center gap-4">
<span className={`layer-badge layer-${layer}`}>L{layer}</span>
<div className="flex-1">
<div className="flex justify-between mb-1">
<span className="text-sm text-gray-300">
{layer === 0 && 'Jeffrey Epstein'}
{layer === 1 && 'Direct Associates'}
{layer === 2 && 'One Degree Removed'}
{layer === 3 && 'Two Degrees Removed'}
</span>
<span className="text-sm text-gray-500">
{layerData?.count ?? 0} entities
</span>
</div>
<div className="h-2 bg-surface rounded-full overflow-hidden">
<div
className={`h-full ${
layer === 0 ? 'bg-red-500' :
layer === 1 ? 'bg-orange-500' :
layer === 2 ? 'bg-yellow-500' :
'bg-green-500'
}`}
style={{
width: `${Math.min(100, (layerData?.count ?? 0) / 10)}%`
}}
/>
</div>
</div>
</div>
)
})}
</div>
</div>
{/* Quick Actions */}
<div className="grid grid-cols-2 md:grid-cols-3 gap-4">
<Link to="/network" className="card p-4 hover:bg-surface-hover transition-colors">
<Network className="text-blue-400 mb-2" size={24} />
<h3 className="font-semibold mb-1">Explore Network</h3>
<p className="text-sm text-gray-400">Interactive visualization of entity connections</p>
</Link>
<Link to="/search" className="card p-4 hover:bg-surface-hover transition-colors">
<FileText className="text-green-400 mb-2" size={24} />
<h3 className="font-semibold mb-1">Search Documents</h3>
<p className="text-sm text-gray-400">Full-text search across all documents</p>
</Link>
<Link to="/patterns" className="card p-4 hover:bg-surface-hover transition-colors">
<Lightbulb className="text-yellow-400 mb-2" size={24} />
<h3 className="font-semibold mb-1">View Patterns</h3>
<p className="text-sm text-gray-400">AI-discovered connections and insights</p>
</Link>
</div>
{/* Disclaimer */}
<div className="mt-8 p-4 bg-yellow-500/10 border border-yellow-500/30 rounded-lg">
<h3 className="font-semibold text-yellow-400 mb-1">Disclaimer</h3>
<p className="text-sm text-gray-300">
This is an independent research tool. It surfaces connections from public documents
it does not assert guilt, criminality, or wrongdoing. Always verify claims against primary sources.
</p>
</div>
</div>
)
}
function StatCard({
icon: Icon,
label,
value,
loading,
}: {
icon: any
label: string
value: number
loading: boolean
}) {
return (
<div className="card p-4">
<div className="flex items-center gap-3">
<div className="p-2 bg-surface-hover rounded-lg">
<Icon className="text-gray-400" size={20} />
</div>
<div>
<p className="text-sm text-gray-400">{label}</p>
<p className="text-xl font-semibold">
{loading ? '—' : value.toLocaleString()}
</p>
</div>
</div>
</div>
)
}

View File

@@ -0,0 +1,30 @@
/** @type {import('tailwindcss').Config} */
export default {
content: [
"./index.html",
"./src/**/*.{js,ts,jsx,tsx}",
],
theme: {
extend: {
colors: {
// Dark theme optimized for document analysis
background: '#0a0a0a',
surface: '#141414',
'surface-hover': '#1a1a1a',
border: '#262626',
// Layer colors
'layer-0': '#ef4444', // Epstein - red
'layer-1': '#f97316', // Direct - orange
'layer-2': '#eab308', // One removed - yellow
'layer-3': '#22c55e', // Two removed - green
// Entity type colors
'entity-person': '#3b82f6',
'entity-org': '#8b5cf6',
'entity-location': '#14b8a6',
},
},
},
plugins: [],
}

21
frontend/vite.config.ts Normal file
View File

@@ -0,0 +1,21 @@
import { defineConfig } from 'vite'
import react from '@vitejs/plugin-react'
import path from 'path'
export default defineConfig({
plugins: [react()],
resolve: {
alias: {
'@': path.resolve(__dirname, './src'),
},
},
server: {
port: 3000,
proxy: {
'/api': {
target: 'http://localhost:3001',
changeOrigin: true,
},
},
},
})

View File

@@ -0,0 +1,105 @@
// Neo4j Cypher constraints and initial setup
// Run these after Neo4j starts
// ============================================================================
// CONSTRAINTS
// ============================================================================
// Entity uniqueness
CREATE CONSTRAINT entity_unique IF NOT EXISTS
FOR (e:Entity) REQUIRE (e.canonicalName, e.type) IS UNIQUE;
// Document uniqueness
CREATE CONSTRAINT document_unique IF NOT EXISTS
FOR (d:Document) REQUIRE d.docId IS UNIQUE;
// ============================================================================
// INDEXES
// ============================================================================
// Entity indexes
CREATE INDEX entity_name IF NOT EXISTS FOR (e:Entity) ON (e.canonicalName);
CREATE INDEX entity_type IF NOT EXISTS FOR (e:Entity) ON (e.type);
CREATE INDEX entity_layer IF NOT EXISTS FOR (e:Entity) ON (e.layer);
// Full-text search index on entity names
CREATE FULLTEXT INDEX entity_search IF NOT EXISTS FOR (e:Entity) ON EACH [e.canonicalName, e.aliases];
// Document indexes
CREATE INDEX document_docid IF NOT EXISTS FOR (d:Document) ON (d.docId);
CREATE INDEX document_type IF NOT EXISTS FOR (d:Document) ON (d.documentType);
// ============================================================================
// ENTITY TYPES (Labels)
// ============================================================================
// We use labels for entity types:
// - :Person
// - :Organization
// - :Location
// - :Entity (base label, all entities have this)
// ============================================================================
// RELATIONSHIP TYPES
// ============================================================================
// - MENTIONED_IN: Entity -> Document (entity appears in document)
// - CONNECTED_TO: Entity -> Entity (co-occurrence relationship)
// - HAS_RELATIONSHIP: Entity -> Entity with action property (from triples)
// - CROSSREF_MATCH: Entity -> CrossRefRecord (PPP, FEC, Grants)
// ============================================================================
// INITIAL DATA
// ============================================================================
// Create Jeffrey Epstein as the root node
MERGE (e:Entity:Person {canonicalName: 'Jeffrey Epstein', type: 'person'})
SET e.layer = 0,
e.description = 'American financier and convicted sex offender',
e.aliases = ['Jeffrey E. Epstein', 'J. Epstein', 'Epstein', 'JE'],
e.createdAt = datetime();
// ============================================================================
// HELPER PROCEDURES
// ============================================================================
// Calculate layer for an entity based on shortest path to Epstein
// Usage: CALL calculateLayer($entityName) YIELD layer
// This needs APOC plugin installed
// CALL apoc.custom.asProcedure(
// 'calculateLayer',
// '
// MATCH (epstein:Entity {canonicalName: "Jeffrey Epstein"})
// MATCH (target:Entity {canonicalName: $entityName})
// MATCH path = shortestPath((epstein)-[:CONNECTED_TO*]-(target))
// RETURN length(path) AS layer
// ',
// 'read',
// [['layer', 'INTEGER']],
// [['entityName', 'STRING']]
// );
// ============================================================================
// EXAMPLE QUERIES
// ============================================================================
// Find all Layer 1 entities (direct connections to Epstein)
// MATCH (epstein:Entity {canonicalName: 'Jeffrey Epstein'})-[:CONNECTED_TO]-(layer1:Entity)
// RETURN layer1.canonicalName, layer1.type;
// Find shared connections between two entities
// MATCH (a:Entity {canonicalName: $name1})-[:CONNECTED_TO]-(shared:Entity)-[:CONNECTED_TO]-(b:Entity {canonicalName: $name2})
// RETURN shared.canonicalName, shared.type;
// Find documents where two entities appear together
// MATCH (a:Entity {canonicalName: $name1})-[:MENTIONED_IN]->(d:Document)<-[:MENTIONED_IN]-(b:Entity {canonicalName: $name2})
// RETURN d.docId, d.summary;
// Get entity's network up to N hops
// MATCH path = (e:Entity {canonicalName: $name})-[:CONNECTED_TO*1..3]-(connected:Entity)
// RETURN path;
// Find money flows (entities connected through financial documents)
// MATCH (a:Entity)-[:MENTIONED_IN]->(d:Document {documentType: 'financial'})<-[:MENTIONED_IN]-(b:Entity)
// WHERE a <> b
// RETURN a.canonicalName, b.canonicalName, count(d) AS sharedFinancialDocs
// ORDER BY sharedFinancialDocs DESC;

View File

@@ -0,0 +1,403 @@
-- Epstein Files Database Schema
-- PostgreSQL 16+
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text matching
CREATE EXTENSION IF NOT EXISTS btree_gin; -- GIN indexes for JSONB
CREATE EXTENSION IF NOT EXISTS unaccent; -- Accent-insensitive search
-- ============================================================================
-- DOCUMENTS
-- ============================================================================
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
doc_id TEXT UNIQUE NOT NULL, -- EFTA00000001
dataset_id INTEGER NOT NULL, -- Which dataset (1-5)
file_path TEXT, -- Original file path
-- Content
full_text TEXT, -- OCR text
page_count INTEGER,
-- AI Analysis
summary TEXT, -- One sentence summary
detailed_summary TEXT, -- Paragraph summary
document_type TEXT, -- Deposition, email, financial record, etc.
-- Temporal
date_earliest DATE, -- Earliest date mentioned
date_latest DATE, -- Latest date mentioned
-- Metadata
content_tags JSONB DEFAULT '[]', -- AI-extracted tags
analysis_status TEXT DEFAULT 'pending', -- pending, processing, complete, failed
error_message TEXT,
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
analyzed_at TIMESTAMPTZ
);
CREATE INDEX idx_documents_doc_id ON documents(doc_id);
CREATE INDEX idx_documents_dataset ON documents(dataset_id);
CREATE INDEX idx_documents_type ON documents(document_type);
CREATE INDEX idx_documents_status ON documents(analysis_status);
CREATE INDEX idx_documents_dates ON documents(date_earliest, date_latest);
CREATE INDEX idx_documents_fulltext ON documents USING gin(to_tsvector('english', full_text));
CREATE INDEX idx_documents_tags ON documents USING gin(content_tags);
-- ============================================================================
-- ENTITIES
-- ============================================================================
-- Entity types enum
CREATE TYPE entity_type AS ENUM (
'person',
'organization',
'location',
'date',
'reference', -- Document references, case numbers, etc.
'financial', -- Dollar amounts, account numbers
'unknown'
);
CREATE TABLE entities (
id SERIAL PRIMARY KEY,
canonical_name TEXT NOT NULL, -- Deduplicated canonical form
entity_type entity_type NOT NULL,
-- Classification
layer INTEGER, -- 0=Epstein, 1=direct, 2=one removed, 3=two removed
-- Metadata
aliases JSONB DEFAULT '[]', -- Alternative spellings/names
attributes JSONB DEFAULT '{}', -- Type-specific attributes
description TEXT, -- AI-generated description
-- Cross-reference matches
ppp_matches JSONB DEFAULT '[]', -- Matched PPP loan records
fec_matches JSONB DEFAULT '[]', -- Matched FEC contributions
grants_matches JSONB DEFAULT '[]', -- Matched federal grants
-- Stats
document_count INTEGER DEFAULT 0, -- Number of documents mentioning entity
connection_count INTEGER DEFAULT 0, -- Number of connections to other entities
-- Timestamps
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(canonical_name, entity_type)
);
CREATE INDEX idx_entities_name ON entities(canonical_name);
CREATE INDEX idx_entities_name_trgm ON entities USING gin(canonical_name gin_trgm_ops);
CREATE INDEX idx_entities_type ON entities(entity_type);
CREATE INDEX idx_entities_layer ON entities(layer);
CREATE INDEX idx_entities_aliases ON entities USING gin(aliases);
-- ============================================================================
-- ENTITY ALIASES
-- ============================================================================
CREATE TABLE entity_aliases (
id SERIAL PRIMARY KEY,
original_name TEXT NOT NULL,
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
confidence REAL DEFAULT 1.0, -- Confidence of alias match
source TEXT DEFAULT 'extraction', -- extraction, llm_dedup, manual
reasoning TEXT, -- Why this was matched
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_aliases_original ON entity_aliases(original_name);
CREATE INDEX idx_aliases_original_trgm ON entity_aliases USING gin(original_name gin_trgm_ops);
CREATE INDEX idx_aliases_entity ON entity_aliases(entity_id);
-- ============================================================================
-- DOCUMENT-ENTITY RELATIONSHIPS
-- ============================================================================
CREATE TABLE document_entities (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
-- Context
mention_count INTEGER DEFAULT 1, -- How many times mentioned
first_mention INTEGER, -- Character offset of first mention
context_snippet TEXT, -- Surrounding text
-- Metadata
extraction_confidence REAL DEFAULT 1.0,
UNIQUE(document_id, entity_id)
);
CREATE INDEX idx_doc_entities_doc ON document_entities(document_id);
CREATE INDEX idx_doc_entities_entity ON document_entities(entity_id);
-- ============================================================================
-- RDF TRIPLES (Relationships)
-- ============================================================================
CREATE TABLE triples (
id SERIAL PRIMARY KEY,
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
-- Subject-Predicate-Object
subject_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
predicate TEXT NOT NULL, -- Action/verb
object_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
-- Context
location_id INTEGER REFERENCES entities(id) ON DELETE SET NULL,
timestamp DATE,
-- Metadata
explicit_topic TEXT, -- Stated subject matter
implicit_topic TEXT, -- Inferred subject matter
tags JSONB DEFAULT '[]',
confidence REAL DEFAULT 1.0,
sequence_order INTEGER, -- Order within document
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_triples_document ON triples(document_id);
CREATE INDEX idx_triples_subject ON triples(subject_id);
CREATE INDEX idx_triples_object ON triples(object_id);
CREATE INDEX idx_triples_predicate ON triples(predicate);
CREATE INDEX idx_triples_timestamp ON triples(timestamp);
CREATE INDEX idx_triples_tags ON triples USING gin(tags);
-- ============================================================================
-- CROSS-REFERENCE TABLES
-- ============================================================================
-- PPP Loans
CREATE TABLE ppp_loans (
id SERIAL PRIMARY KEY,
loan_number TEXT UNIQUE,
borrower_name TEXT NOT NULL,
borrower_address TEXT,
borrower_city TEXT,
borrower_state TEXT,
borrower_zip TEXT,
loan_amount NUMERIC(15,2),
loan_status TEXT,
forgiveness_amount NUMERIC(15,2),
lender TEXT,
naics_code TEXT,
business_type TEXT,
jobs_retained INTEGER,
date_approved DATE,
-- Matching metadata
normalized_name TEXT, -- For fuzzy matching
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_ppp_name ON ppp_loans(borrower_name);
CREATE INDEX idx_ppp_name_trgm ON ppp_loans USING gin(borrower_name gin_trgm_ops);
CREATE INDEX idx_ppp_normalized ON ppp_loans USING gin(normalized_name gin_trgm_ops);
-- FEC Contributions
CREATE TABLE fec_contributions (
id SERIAL PRIMARY KEY,
fec_id TEXT,
contributor_name TEXT NOT NULL,
contributor_city TEXT,
contributor_state TEXT,
contributor_zip TEXT,
contributor_employer TEXT,
contributor_occupation TEXT,
committee_id TEXT,
committee_name TEXT,
candidate_id TEXT,
candidate_name TEXT,
amount NUMERIC(12,2),
contribution_date DATE,
contribution_type TEXT,
-- Matching metadata
normalized_name TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_fec_contributor ON fec_contributions(contributor_name);
CREATE INDEX idx_fec_contributor_trgm ON fec_contributions USING gin(contributor_name gin_trgm_ops);
CREATE INDEX idx_fec_normalized ON fec_contributions USING gin(normalized_name gin_trgm_ops);
CREATE INDEX idx_fec_candidate ON fec_contributions(candidate_name);
CREATE INDEX idx_fec_committee ON fec_contributions(committee_name);
-- Federal Grants
CREATE TABLE federal_grants (
id SERIAL PRIMARY KEY,
award_id TEXT,
recipient_name TEXT NOT NULL,
recipient_city TEXT,
recipient_state TEXT,
recipient_zip TEXT,
awarding_agency TEXT,
funding_agency TEXT,
award_amount NUMERIC(15,2),
award_date DATE,
description TEXT,
cfda_number TEXT,
cfda_title TEXT,
-- Matching metadata
normalized_name TEXT,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_grants_recipient ON federal_grants(recipient_name);
CREATE INDEX idx_grants_recipient_trgm ON federal_grants USING gin(recipient_name gin_trgm_ops);
CREATE INDEX idx_grants_normalized ON federal_grants USING gin(normalized_name gin_trgm_ops);
-- ============================================================================
-- ENTITY CROSS-REFERENCE MATCHES
-- ============================================================================
CREATE TYPE match_source AS ENUM ('ppp', 'fec', 'grants');
CREATE TABLE entity_crossref_matches (
id SERIAL PRIMARY KEY,
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
source match_source NOT NULL,
source_id INTEGER NOT NULL, -- ID in the source table
-- Match quality
match_score REAL NOT NULL, -- 0-1 similarity score
match_method TEXT, -- exact, fuzzy, soundex, etc.
verified BOOLEAN DEFAULT FALSE, -- Human-verified match
false_positive BOOLEAN DEFAULT FALSE, -- Confirmed not a match
created_at TIMESTAMPTZ DEFAULT NOW(),
verified_at TIMESTAMPTZ,
verified_by TEXT
);
CREATE INDEX idx_crossref_entity ON entity_crossref_matches(entity_id);
CREATE INDEX idx_crossref_source ON entity_crossref_matches(source, source_id);
-- ============================================================================
-- PATTERN FINDINGS
-- ============================================================================
CREATE TABLE pattern_findings (
id SERIAL PRIMARY KEY,
-- The pattern
title TEXT NOT NULL,
description TEXT NOT NULL,
pattern_type TEXT, -- financial_flow, travel_pattern, organizational_link, etc.
-- Involved entities
entity_ids INTEGER[] NOT NULL,
-- Evidence
evidence JSONB NOT NULL, -- Supporting documents, connections, etc.
confidence REAL,
-- Status
status TEXT DEFAULT 'hypothesis', -- hypothesis, validated, rejected
notes TEXT,
-- Timestamps
discovered_at TIMESTAMPTZ DEFAULT NOW(),
discovered_by TEXT DEFAULT 'pattern_agent',
validated_at TIMESTAMPTZ,
validated_by TEXT
);
CREATE INDEX idx_patterns_type ON pattern_findings(pattern_type);
CREATE INDEX idx_patterns_status ON pattern_findings(status);
CREATE INDEX idx_patterns_entities ON pattern_findings USING gin(entity_ids);
-- ============================================================================
-- VIEWS
-- ============================================================================
-- Entity connections view
CREATE VIEW entity_connections AS
SELECT
e1.id AS entity1_id,
e1.canonical_name AS entity1_name,
e1.entity_type AS entity1_type,
e2.id AS entity2_id,
e2.canonical_name AS entity2_name,
e2.entity_type AS entity2_type,
COUNT(DISTINCT d.id) AS shared_documents,
array_agg(DISTINCT d.doc_id) AS document_ids
FROM document_entities de1
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id < de2.entity_id
JOIN entities e1 ON de1.entity_id = e1.id
JOIN entities e2 ON de2.entity_id = e2.id
JOIN documents d ON de1.document_id = d.id
GROUP BY e1.id, e1.canonical_name, e1.entity_type, e2.id, e2.canonical_name, e2.entity_type;
-- ============================================================================
-- FUNCTIONS
-- ============================================================================
-- Normalize name for fuzzy matching
CREATE OR REPLACE FUNCTION normalize_name(name TEXT) RETURNS TEXT AS $$
BEGIN
RETURN lower(
regexp_replace(
regexp_replace(
unaccent(name),
'[^a-zA-Z0-9 ]', '', 'g'
),
'\s+', ' ', 'g'
)
);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- Update entity stats
CREATE OR REPLACE FUNCTION update_entity_stats() RETURNS TRIGGER AS $$
BEGIN
-- Update document count
UPDATE entities e
SET document_count = (
SELECT COUNT(DISTINCT document_id)
FROM document_entities
WHERE entity_id = e.id
),
connection_count = (
SELECT COUNT(*)
FROM entity_connections
WHERE entity1_id = e.id OR entity2_id = e.id
),
updated_at = NOW()
WHERE e.id = COALESCE(NEW.entity_id, OLD.entity_id);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trigger_update_entity_stats
AFTER INSERT OR UPDATE OR DELETE ON document_entities
FOR EACH ROW EXECUTE FUNCTION update_entity_stats();
-- ============================================================================
-- INITIAL DATA
-- ============================================================================
-- Insert Jeffrey Epstein as Layer 0
INSERT INTO entities (canonical_name, entity_type, layer, description, aliases)
VALUES (
'Jeffrey Epstein',
'person',
0,
'American financier and convicted sex offender',
'["Jeffrey E. Epstein", "J. Epstein", "Epstein", "JE"]'::jsonb
) ON CONFLICT DO NOTHING;