Initial commit: Epstein Files Database project structure
- PostgreSQL schema for documents, entities, relationships, cross-refs - Neo4j schema for graph relationships - TypeScript extraction pipeline (OCR, NER, deduplication) - Go API server (Fiber) with full REST endpoints - React + Tailwind frontend with network visualization - Pattern finder agent for connection discovery - Docker compose for databases (Postgres, Neo4j, Typesense) - Cross-reference matching for PPP loans, FEC, federal grants
This commit is contained in:
66
.gitignore
vendored
Normal file
66
.gitignore
vendored
Normal file
@@ -0,0 +1,66 @@
|
||||
# Data sources - too large for git
|
||||
DataSources/
|
||||
|
||||
# Build outputs
|
||||
dist/
|
||||
build/
|
||||
.next/
|
||||
out/
|
||||
|
||||
# Dependencies
|
||||
node_modules/
|
||||
vendor/
|
||||
|
||||
# Environment
|
||||
.env
|
||||
.env.local
|
||||
.env.*.local
|
||||
|
||||
# Go
|
||||
*.exe
|
||||
*.dll
|
||||
*.so
|
||||
*.dylib
|
||||
bin/
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
*$py.class
|
||||
.venv/
|
||||
venv/
|
||||
env/
|
||||
*.egg-info/
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
*.swp
|
||||
*.swo
|
||||
*~
|
||||
|
||||
# OS
|
||||
.DS_Store
|
||||
Thumbs.db
|
||||
|
||||
# Database files (large, generated)
|
||||
*.db
|
||||
*.sqlite
|
||||
*.sqlite3
|
||||
|
||||
# Logs
|
||||
*.log
|
||||
logs/
|
||||
|
||||
# Temporary files
|
||||
tmp/
|
||||
temp/
|
||||
.cache/
|
||||
|
||||
# Generated data (can be recreated)
|
||||
data/processed/
|
||||
data/embeddings/
|
||||
data/exports/
|
||||
|
||||
# Keep config examples
|
||||
!*.example
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 Subcult
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
217
README.md
Normal file
217
README.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# Epstein Files Database
|
||||
|
||||
A searchable database and network analysis tool for the DOJ Epstein Files release. Built to make public records accessible, cross-referenced, and analyzable.
|
||||
|
||||
## What This Does
|
||||
|
||||
1. **Entity Extraction** — Extracts names, organizations, locations, and dates from 4,055 DOJ documents
|
||||
2. **Relationship Mapping** — Builds a graph of connections based on document co-occurrence
|
||||
3. **Layer Classification** — Classifies entities by degree of separation from Jeffrey Epstein
|
||||
4. **Cross-Reference Engine** — Fuzzy-matches entities against:
|
||||
- PPP loan data (SBA)
|
||||
- FEC campaign contributions
|
||||
- Federal grant recipients
|
||||
5. **Pattern Detection Agent** — AI agent specialized in finding non-obvious connections
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Frontend (React + Tailwind) │
|
||||
│ • Search Interface • Network Visualization • Document Viewer │
|
||||
└─────────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────▼───────────────────────────────────────┐
|
||||
│ API Server (Go) │
|
||||
│ • REST Endpoints • Full-text Search • Graph Queries │
|
||||
└─────────────────────────┬───────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────▼───────────────────────────────────────┐
|
||||
│ Data Layer │
|
||||
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
|
||||
│ │ PostgreSQL │ │ Neo4j │ │ Typesense/Meilisearch │ │
|
||||
│ │ Entities │ │ Graph │ │ Full-text Search │ │
|
||||
│ │ Documents │ │ Relations │ │ │ │
|
||||
│ │ Cross-refs │ │ │ │ │ │
|
||||
│ └──────────────┘ └──────────────┘ └────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌─────────────────────────▼───────────────────────────────────────┐
|
||||
│ Extraction Pipeline (TypeScript) │
|
||||
│ • OCR Processing • NER Extraction • Relationship Inference │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Tech Stack
|
||||
|
||||
| Component | Technology | Rationale |
|
||||
|-----------|------------|-----------|
|
||||
| Frontend | React + Tailwind + Vite | Fast, modern, type-safe |
|
||||
| API | Go (Fiber/Echo) | Performance for graph queries |
|
||||
| Primary DB | PostgreSQL | Structured data, JSONB, full-text |
|
||||
| Graph DB | Neo4j | Relationship traversal at scale |
|
||||
| Search | Typesense | Fast fuzzy search, typo-tolerant |
|
||||
| Extraction | TypeScript + LLM | Entity extraction, deduplication |
|
||||
| Pattern Agent | OpenClaw sub-agent | AI-driven connection discovery |
|
||||
|
||||
## Data Sources
|
||||
|
||||
### Primary: DOJ Epstein Files
|
||||
- **4,055 documents** (EFTA00000001 through EFTA00008528)
|
||||
- **1.77M lines** of OCR text
|
||||
- **157GB** raw data (PDFs, images, scans)
|
||||
- Source: https://www.justice.gov/epstein
|
||||
|
||||
### Cross-Reference Datasets
|
||||
- **PPP Loans**: SBA FOIA data (https://data.sba.gov/dataset/ppp-foia)
|
||||
- **FEC Contributions**: Federal Election Commission (https://www.fec.gov/data/)
|
||||
- **Federal Grants**: USASpending.gov (https://www.usaspending.gov/download_center/custom_award_data)
|
||||
|
||||
## Layer Classification
|
||||
|
||||
| Layer | Definition | Example |
|
||||
|-------|------------|---------|
|
||||
| **L0** | Jeffrey Epstein himself | — |
|
||||
| **L1** | Direct associates (named in documents with Epstein) | Ghislaine Maxwell |
|
||||
| **L2** | One degree removed (connected to L1 but not directly to Epstein) | — |
|
||||
| **L3** | Two degrees removed | — |
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Prerequisites
|
||||
- Docker & Docker Compose
|
||||
- Node.js 20+
|
||||
- Go 1.21+
|
||||
- PostgreSQL 16+ (or use Docker)
|
||||
- Neo4j 5+ (or use Docker)
|
||||
|
||||
### Quick Start
|
||||
|
||||
```bash
|
||||
# Clone the repo
|
||||
git clone https://github.com/subculture-collective/epstein-db.git
|
||||
cd epstein-db
|
||||
|
||||
# Start databases
|
||||
docker-compose up -d
|
||||
|
||||
# Install dependencies
|
||||
npm install
|
||||
cd api && go mod download && cd ..
|
||||
|
||||
# Run extraction pipeline (requires OpenAI-compatible API)
|
||||
cp .env.example .env
|
||||
# Edit .env with your API keys
|
||||
|
||||
npm run extract
|
||||
|
||||
# Start the API server
|
||||
cd api && go run . &
|
||||
|
||||
# Start the frontend
|
||||
npm run dev
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
epstein-db/
|
||||
├── api/ # Go API server
|
||||
│ ├── cmd/ # Entry points
|
||||
│ ├── internal/ # Internal packages
|
||||
│ │ ├── handlers/ # HTTP handlers
|
||||
│ │ ├── db/ # Database access
|
||||
│ │ ├── graph/ # Neo4j operations
|
||||
│ │ └── search/ # Typesense operations
|
||||
│ └── pkg/ # Public packages
|
||||
│
|
||||
├── extraction/ # TypeScript extraction pipeline
|
||||
│ ├── src/
|
||||
│ │ ├── ocr/ # OCR processing
|
||||
│ │ ├── ner/ # Named Entity Recognition
|
||||
│ │ ├── dedup/ # Entity deduplication
|
||||
│ │ └── cross-ref/ # Cross-reference matching
|
||||
│ └── scripts/ # Pipeline scripts
|
||||
│
|
||||
├── frontend/ # React frontend
|
||||
│ ├── src/
|
||||
│ │ ├── components/ # UI components
|
||||
│ │ ├── pages/ # Route pages
|
||||
│ │ ├── hooks/ # Custom hooks
|
||||
│ │ └── api/ # API client
|
||||
│ └── public/
|
||||
│
|
||||
├── agents/ # AI agents
|
||||
│ └── pattern-finder/ # Connection discovery agent
|
||||
│
|
||||
├── data/ # Data directory (gitignored)
|
||||
│ ├── raw/ # Symlink to DataSources
|
||||
│ ├── processed/ # Extracted entities/relations
|
||||
│ ├── crossref/ # PPP, FEC, grants data
|
||||
│ └── exports/ # Generated exports
|
||||
│
|
||||
├── docker-compose.yml # Database services
|
||||
├── schema/ # Database schemas
|
||||
│ ├── postgres/ # SQL migrations
|
||||
│ └── neo4j/ # Cypher constraints
|
||||
│
|
||||
└── docs/ # Documentation
|
||||
├── ARCHITECTURE.md
|
||||
├── DATA_MODEL.md
|
||||
└── CONTRIBUTING.md
|
||||
```
|
||||
|
||||
## Roadmap
|
||||
|
||||
### Phase 1: Foundation ✅
|
||||
- [x] Repository setup
|
||||
- [ ] Database schema design
|
||||
- [ ] Docker compose for databases
|
||||
- [ ] Basic extraction pipeline
|
||||
|
||||
### Phase 2: Entity Extraction
|
||||
- [ ] OCR text ingestion
|
||||
- [ ] Named Entity Recognition (NER)
|
||||
- [ ] Entity deduplication (LLM-assisted)
|
||||
- [ ] Document-entity relationships
|
||||
|
||||
### Phase 3: Graph Construction
|
||||
- [ ] Neo4j schema
|
||||
- [ ] Co-occurrence relationship building
|
||||
- [ ] Layer classification algorithm
|
||||
- [ ] Graph API endpoints
|
||||
|
||||
### Phase 4: Cross-Reference
|
||||
- [ ] PPP loan data ingestion
|
||||
- [ ] FEC contribution data ingestion
|
||||
- [ ] Federal grants data ingestion
|
||||
- [ ] Fuzzy matching engine
|
||||
|
||||
### Phase 5: Frontend
|
||||
- [ ] Search interface
|
||||
- [ ] Network visualization (D3/Force-Graph)
|
||||
- [ ] Document viewer
|
||||
- [ ] Entity detail pages
|
||||
|
||||
### Phase 6: Pattern Agent
|
||||
- [ ] Agent architecture design
|
||||
- [ ] Connection hypothesis generation
|
||||
- [ ] Validation pipeline
|
||||
- [ ] Report generation
|
||||
|
||||
## Contributing
|
||||
|
||||
This is an open research project. Contributions welcome:
|
||||
- Entity extraction improvements
|
||||
- Fuzzy matching algorithms
|
||||
- UI/UX improvements
|
||||
- Additional cross-reference datasets
|
||||
- Pattern detection strategies
|
||||
|
||||
## License
|
||||
|
||||
MIT License. The code is open source. The documents are public records.
|
||||
|
||||
## Disclaimer
|
||||
|
||||
This is an independent research project. We make no representations about the completeness or accuracy of the analysis. This tool surfaces connections — it does not assert guilt, criminality, or wrongdoing.
|
||||
113
agents/pattern-finder/README.md
Normal file
113
agents/pattern-finder/README.md
Normal file
@@ -0,0 +1,113 @@
|
||||
# Pattern Finder Agent
|
||||
|
||||
An AI agent specialized in discovering non-obvious connections, patterns, and relationships within the Epstein Files database.
|
||||
|
||||
## Purpose
|
||||
|
||||
While the extraction pipeline identifies explicit entities and relationships, the Pattern Finder looks for:
|
||||
|
||||
1. **Indirect Connections** — Entities that appear in similar contexts but are never directly linked
|
||||
2. **Temporal Patterns** — Activities that cluster around specific dates or events
|
||||
3. **Financial Flows** — Money movement patterns across entities
|
||||
4. **Network Anomalies** — Unusually dense or sparse connection patterns
|
||||
5. **Cross-Reference Insights** — What PPP/FEC/Grants matches reveal about entities
|
||||
|
||||
## How It Works
|
||||
|
||||
The agent runs periodically (or on-demand) and:
|
||||
|
||||
1. **Samples the Graph** — Pulls subgraphs around high-degree or interesting entities
|
||||
2. **Generates Hypotheses** — Uses LLM to identify potential patterns
|
||||
3. **Validates Hypotheses** — Checks evidence in the actual documents
|
||||
4. **Reports Findings** — Stores validated patterns with evidence chains
|
||||
|
||||
## Agent Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Pattern Finder Agent │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ 1. Sampling Module │
|
||||
│ • Random walk from high-degree nodes │
|
||||
│ • Temporal window sampling │
|
||||
│ • Cross-reference focused sampling │
|
||||
│ │
|
||||
│ 2. Hypothesis Generator (LLM) │
|
||||
│ • Pattern recognition prompts │
|
||||
│ • Anomaly detection prompts │
|
||||
│ • Connection inference prompts │
|
||||
│ │
|
||||
│ 3. Evidence Validator │
|
||||
│ • Document retrieval │
|
||||
│ • Citation extraction │
|
||||
│ • Confidence scoring │
|
||||
│ │
|
||||
│ 4. Report Generator │
|
||||
│ • Pattern summary │
|
||||
│ • Evidence chain │
|
||||
│ • Visualization data │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Pattern Types
|
||||
|
||||
### Financial Patterns
|
||||
- Money flows between entities
|
||||
- Unusual transaction timing
|
||||
- Shell company connections
|
||||
- Donation clustering
|
||||
|
||||
### Travel Patterns
|
||||
- Co-location events
|
||||
- Flight log correlations
|
||||
- Property connections
|
||||
- Event attendance
|
||||
|
||||
### Organizational Patterns
|
||||
- Board memberships
|
||||
- Foundation connections
|
||||
- Employment relationships
|
||||
- Legal representation
|
||||
|
||||
### Temporal Patterns
|
||||
- Activity clustering around dates
|
||||
- Gaps in documentation
|
||||
- Correlated timelines
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
# Run a pattern discovery session
|
||||
npm run agent:pattern-finder
|
||||
|
||||
# Focus on specific entity
|
||||
npm run agent:pattern-finder -- --entity "Ghislaine Maxwell"
|
||||
|
||||
# Focus on date range
|
||||
npm run agent:pattern-finder -- --from "2005-01-01" --to "2010-12-31"
|
||||
|
||||
# Focus on pattern type
|
||||
npm run agent:pattern-finder -- --type financial
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
Patterns are stored in the `pattern_findings` table with:
|
||||
- Title and description
|
||||
- Involved entities
|
||||
- Evidence (documents, relationships)
|
||||
- Confidence score
|
||||
- Status (hypothesis, validated, rejected)
|
||||
|
||||
## Integration with OpenClaw
|
||||
|
||||
This agent can be spawned as a sub-agent from OpenClaw:
|
||||
|
||||
```typescript
|
||||
sessions_spawn({
|
||||
task: "Analyze the network around Les Wexner for financial patterns",
|
||||
label: "pattern-finder-wexner",
|
||||
})
|
||||
```
|
||||
315
agents/pattern-finder/agent.ts
Normal file
315
agents/pattern-finder/agent.ts
Normal file
@@ -0,0 +1,315 @@
|
||||
/**
|
||||
* Pattern Finder Agent
|
||||
*
|
||||
* Discovers non-obvious connections and patterns in the Epstein Files database.
|
||||
*/
|
||||
|
||||
import Anthropic from '@anthropic-ai/sdk';
|
||||
import { z } from 'zod';
|
||||
import pg from 'pg';
|
||||
|
||||
const { Pool } = pg;
|
||||
|
||||
// ============================================================================
|
||||
// Configuration
|
||||
// ============================================================================
|
||||
|
||||
const config = {
|
||||
DATABASE_URL: process.env.DATABASE_URL || 'postgresql://epstein:epstein_dev@localhost:5432/epstein',
|
||||
ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY || '',
|
||||
LLM_MODEL: process.env.LLM_MODEL || 'claude-sonnet-4-20250514',
|
||||
};
|
||||
|
||||
const pool = new Pool({ connectionString: config.DATABASE_URL });
|
||||
const anthropic = new Anthropic({ apiKey: config.ANTHROPIC_API_KEY });
|
||||
|
||||
// ============================================================================
|
||||
// Types
|
||||
// ============================================================================
|
||||
|
||||
interface Entity {
|
||||
id: number;
|
||||
canonicalName: string;
|
||||
entityType: string;
|
||||
layer: number;
|
||||
documentCount: number;
|
||||
connectionCount: number;
|
||||
pppMatches: any[];
|
||||
fecMatches: any[];
|
||||
grantsMatches: any[];
|
||||
}
|
||||
|
||||
interface Connection {
|
||||
entity1: string;
|
||||
entity2: string;
|
||||
sharedDocs: number;
|
||||
documentIds: string[];
|
||||
}
|
||||
|
||||
interface PatternHypothesis {
|
||||
title: string;
|
||||
description: string;
|
||||
patternType: string;
|
||||
entityNames: string[];
|
||||
evidence: string[];
|
||||
confidence: number;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Sampling Functions
|
||||
// ============================================================================
|
||||
|
||||
async function getHighDegreeEntities(limit: number = 50): Promise<Entity[]> {
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
id, canonical_name, entity_type, layer,
|
||||
document_count, connection_count,
|
||||
ppp_matches, fec_matches, grants_matches
|
||||
FROM entities
|
||||
WHERE entity_type IN ('person', 'organization')
|
||||
ORDER BY connection_count DESC
|
||||
LIMIT $1
|
||||
`, [limit]);
|
||||
|
||||
return result.rows.map(row => ({
|
||||
id: row.id,
|
||||
canonicalName: row.canonical_name,
|
||||
entityType: row.entity_type,
|
||||
layer: row.layer || 0,
|
||||
documentCount: row.document_count || 0,
|
||||
connectionCount: row.connection_count || 0,
|
||||
pppMatches: row.ppp_matches || [],
|
||||
fecMatches: row.fec_matches || [],
|
||||
grantsMatches: row.grants_matches || [],
|
||||
}));
|
||||
}
|
||||
|
||||
async function getEntityConnections(entityId: number, limit: number = 100): Promise<Connection[]> {
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
e1.canonical_name AS entity1,
|
||||
e2.canonical_name AS entity2,
|
||||
COUNT(DISTINCT d.id) AS shared_docs,
|
||||
array_agg(DISTINCT d.doc_id) AS document_ids
|
||||
FROM document_entities de1
|
||||
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id != de2.entity_id
|
||||
JOIN entities e1 ON de1.entity_id = e1.id
|
||||
JOIN entities e2 ON de2.entity_id = e2.id
|
||||
JOIN documents d ON de1.document_id = d.id
|
||||
WHERE de1.entity_id = $1
|
||||
GROUP BY e1.canonical_name, e2.canonical_name
|
||||
ORDER BY shared_docs DESC
|
||||
LIMIT $2
|
||||
`, [entityId, limit]);
|
||||
|
||||
return result.rows.map(row => ({
|
||||
entity1: row.entity1,
|
||||
entity2: row.entity2,
|
||||
sharedDocs: parseInt(row.shared_docs),
|
||||
documentIds: row.document_ids,
|
||||
}));
|
||||
}
|
||||
|
||||
async function getEntitiesWithCrossRefMatches(): Promise<Entity[]> {
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
id, canonical_name, entity_type, layer,
|
||||
document_count, connection_count,
|
||||
ppp_matches, fec_matches, grants_matches
|
||||
FROM entities
|
||||
WHERE
|
||||
(ppp_matches IS NOT NULL AND jsonb_array_length(ppp_matches) > 0)
|
||||
OR (fec_matches IS NOT NULL AND jsonb_array_length(fec_matches) > 0)
|
||||
OR (grants_matches IS NOT NULL AND jsonb_array_length(grants_matches) > 0)
|
||||
ORDER BY connection_count DESC
|
||||
LIMIT 100
|
||||
`);
|
||||
|
||||
return result.rows.map(row => ({
|
||||
id: row.id,
|
||||
canonicalName: row.canonical_name,
|
||||
entityType: row.entity_type,
|
||||
layer: row.layer || 0,
|
||||
documentCount: row.document_count || 0,
|
||||
connectionCount: row.connection_count || 0,
|
||||
pppMatches: row.ppp_matches || [],
|
||||
fecMatches: row.fec_matches || [],
|
||||
grantsMatches: row.grants_matches || [],
|
||||
}));
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Pattern Detection
|
||||
// ============================================================================
|
||||
|
||||
const PATTERN_SYSTEM_PROMPT = `You are an investigative analyst specializing in network analysis and pattern detection. You're analyzing data from the Jeffrey Epstein case documents.
|
||||
|
||||
Your task is to identify non-obvious patterns, connections, and anomalies that might warrant further investigation.
|
||||
|
||||
Focus on:
|
||||
1. Financial patterns (money flows, unusual transactions, timing)
|
||||
2. Organizational patterns (shared board memberships, foundations, legal representation)
|
||||
3. Temporal patterns (activities clustering around dates, gaps in documentation)
|
||||
4. Network anomalies (unusually dense connections, unexpected bridges between groups)
|
||||
5. Cross-reference insights (what PPP loans, FEC contributions, or federal grants reveal)
|
||||
|
||||
Be specific and cite evidence. Generate hypotheses that can be validated with document review.
|
||||
|
||||
IMPORTANT: You are surfacing patterns for investigation, not asserting guilt or wrongdoing.`;
|
||||
|
||||
async function generatePatternHypotheses(
|
||||
entities: Entity[],
|
||||
connections: Connection[]
|
||||
): Promise<PatternHypothesis[]> {
|
||||
const entitySummaries = entities.map(e => ({
|
||||
name: e.canonicalName,
|
||||
type: e.entityType,
|
||||
layer: e.layer,
|
||||
docs: e.documentCount,
|
||||
connections: e.connectionCount,
|
||||
hasPPP: e.pppMatches.length > 0,
|
||||
hasFEC: e.fecMatches.length > 0,
|
||||
hasGrants: e.grantsMatches.length > 0,
|
||||
}));
|
||||
|
||||
const connectionSummaries = connections.slice(0, 50).map(c => ({
|
||||
pair: `${c.entity1} ↔ ${c.entity2}`,
|
||||
sharedDocs: c.sharedDocs,
|
||||
}));
|
||||
|
||||
const prompt = `Analyze this network data and identify potential patterns worth investigating.
|
||||
|
||||
ENTITIES (${entities.length} total, showing key attributes):
|
||||
${JSON.stringify(entitySummaries, null, 2)}
|
||||
|
||||
TOP CONNECTIONS:
|
||||
${JSON.stringify(connectionSummaries, null, 2)}
|
||||
|
||||
Generate 3-5 pattern hypotheses. For each, provide:
|
||||
1. A specific, descriptive title
|
||||
2. What the pattern suggests
|
||||
3. Which entities are involved
|
||||
4. What evidence supports this hypothesis
|
||||
5. Confidence level (0-1)
|
||||
|
||||
Return JSON array:
|
||||
[
|
||||
{
|
||||
"title": "Pattern Title",
|
||||
"description": "What this pattern suggests and why it's notable",
|
||||
"patternType": "financial|organizational|temporal|network|crossref",
|
||||
"entityNames": ["Entity1", "Entity2"],
|
||||
"evidence": ["Evidence point 1", "Evidence point 2"],
|
||||
"confidence": 0.7
|
||||
}
|
||||
]
|
||||
|
||||
Return ONLY valid JSON.`;
|
||||
|
||||
const response = await anthropic.messages.create({
|
||||
model: config.LLM_MODEL,
|
||||
max_tokens: 4096,
|
||||
system: PATTERN_SYSTEM_PROMPT,
|
||||
messages: [{ role: 'user', content: prompt }],
|
||||
});
|
||||
|
||||
const content = response.content[0];
|
||||
if (content.type !== 'text') {
|
||||
throw new Error('Unexpected response type');
|
||||
}
|
||||
|
||||
const jsonMatch = content.text.match(/\[[\s\S]*\]/);
|
||||
if (!jsonMatch) {
|
||||
console.error('No JSON found:', content.text);
|
||||
return [];
|
||||
}
|
||||
|
||||
return JSON.parse(jsonMatch[0]);
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Save Patterns
|
||||
// ============================================================================
|
||||
|
||||
async function savePattern(pattern: PatternHypothesis): Promise<number> {
|
||||
// Get entity IDs
|
||||
const entityResult = await pool.query(`
|
||||
SELECT id FROM entities WHERE canonical_name = ANY($1)
|
||||
`, [pattern.entityNames]);
|
||||
|
||||
const entityIds = entityResult.rows.map(r => r.id);
|
||||
|
||||
const result = await pool.query(`
|
||||
INSERT INTO pattern_findings
|
||||
(title, description, pattern_type, entity_ids, evidence, confidence, status)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, 'hypothesis')
|
||||
RETURNING id
|
||||
`, [
|
||||
pattern.title,
|
||||
pattern.description,
|
||||
pattern.patternType,
|
||||
entityIds,
|
||||
JSON.stringify({
|
||||
entityNames: pattern.entityNames,
|
||||
evidencePoints: pattern.evidence,
|
||||
}),
|
||||
pattern.confidence,
|
||||
]);
|
||||
|
||||
return result.rows[0].id;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Main
|
||||
// ============================================================================
|
||||
|
||||
async function main() {
|
||||
console.log('🔎 Pattern Finder Agent starting...\n');
|
||||
|
||||
// Get high-degree entities
|
||||
console.log('📊 Sampling high-degree entities...');
|
||||
const highDegree = await getHighDegreeEntities(50);
|
||||
console.log(` Found ${highDegree.length} high-degree entities`);
|
||||
|
||||
// Get entities with cross-reference matches
|
||||
console.log('📊 Sampling entities with cross-reference matches...');
|
||||
const crossRef = await getEntitiesWithCrossRefMatches();
|
||||
console.log(` Found ${crossRef.length} entities with PPP/FEC/Grants matches`);
|
||||
|
||||
// Get connections for top entities
|
||||
console.log('📊 Sampling connections...');
|
||||
const allConnections: Connection[] = [];
|
||||
for (const entity of highDegree.slice(0, 10)) {
|
||||
const connections = await getEntityConnections(entity.id, 50);
|
||||
allConnections.push(...connections);
|
||||
}
|
||||
console.log(` Found ${allConnections.length} connections`);
|
||||
|
||||
// Combine entities (deduplicate)
|
||||
const allEntities = [...highDegree, ...crossRef];
|
||||
const uniqueEntities = Array.from(
|
||||
new Map(allEntities.map(e => [e.id, e])).values()
|
||||
);
|
||||
|
||||
// Generate pattern hypotheses
|
||||
console.log('\n🧠 Generating pattern hypotheses...');
|
||||
const patterns = await generatePatternHypotheses(uniqueEntities, allConnections);
|
||||
console.log(` Generated ${patterns.length} hypotheses`);
|
||||
|
||||
// Save patterns
|
||||
console.log('\n💾 Saving patterns to database...');
|
||||
for (const pattern of patterns) {
|
||||
const id = await savePattern(pattern);
|
||||
console.log(` ✓ Saved: ${pattern.title} (ID: ${id})`);
|
||||
}
|
||||
|
||||
console.log('\n✅ Pattern Finder complete!');
|
||||
console.log(` Patterns discovered: ${patterns.length}`);
|
||||
|
||||
await pool.end();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
105
api/cmd/server/main.go
Normal file
105
api/cmd/server/main.go
Normal file
@@ -0,0 +1,105 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"log"
|
||||
"os"
|
||||
"os/signal"
|
||||
"syscall"
|
||||
|
||||
"github.com/gofiber/fiber/v2"
|
||||
"github.com/gofiber/fiber/v2/middleware/cors"
|
||||
"github.com/gofiber/fiber/v2/middleware/logger"
|
||||
"github.com/gofiber/fiber/v2/middleware/recover"
|
||||
"github.com/joho/godotenv"
|
||||
|
||||
"github.com/subculture-collective/epstein-db/api/internal/db"
|
||||
"github.com/subculture-collective/epstein-db/api/internal/handlers"
|
||||
)
|
||||
|
||||
func main() {
|
||||
// Load .env file
|
||||
if err := godotenv.Load(); err != nil {
|
||||
log.Println("No .env file found, using environment variables")
|
||||
}
|
||||
|
||||
// Initialize database connection
|
||||
if err := db.Initialize(context.Background()); err != nil {
|
||||
log.Fatalf("Failed to initialize database: %v", err)
|
||||
}
|
||||
defer db.Close()
|
||||
|
||||
// Create Fiber app
|
||||
app := fiber.New(fiber.Config{
|
||||
AppName: "Epstein Files API",
|
||||
})
|
||||
|
||||
// Middleware
|
||||
app.Use(recover.New())
|
||||
app.Use(logger.New())
|
||||
app.Use(cors.New(cors.Config{
|
||||
AllowOrigins: "*",
|
||||
AllowMethods: "GET,POST,PUT,DELETE,OPTIONS",
|
||||
AllowHeaders: "Origin, Content-Type, Accept, Authorization",
|
||||
}))
|
||||
|
||||
// Routes
|
||||
api := app.Group("/api")
|
||||
|
||||
// Stats
|
||||
api.Get("/stats", handlers.GetStats)
|
||||
|
||||
// Entities
|
||||
api.Get("/entities", handlers.SearchEntities)
|
||||
api.Get("/entities/:id", handlers.GetEntity)
|
||||
api.Get("/entities/:id/connections", handlers.GetEntityConnections)
|
||||
api.Get("/entities/:id/documents", handlers.GetEntityDocuments)
|
||||
|
||||
// Documents
|
||||
api.Get("/documents", handlers.ListDocuments)
|
||||
api.Get("/documents/:id", handlers.GetDocument)
|
||||
api.Get("/documents/:id/text", handlers.GetDocumentText)
|
||||
api.Get("/documents/:id/entities", handlers.GetDocumentEntities)
|
||||
|
||||
// Graph/Network
|
||||
api.Get("/network", handlers.GetNetwork)
|
||||
api.Get("/network/layers", handlers.GetNetworkByLayer)
|
||||
|
||||
// Cross-references
|
||||
api.Get("/crossref/ppp", handlers.SearchPPP)
|
||||
api.Get("/crossref/fec", handlers.SearchFEC)
|
||||
api.Get("/crossref/grants", handlers.SearchGrants)
|
||||
|
||||
// Patterns
|
||||
api.Get("/patterns", handlers.ListPatterns)
|
||||
api.Get("/patterns/:id", handlers.GetPattern)
|
||||
|
||||
// Search
|
||||
api.Get("/search", handlers.FullTextSearch)
|
||||
|
||||
// Health check
|
||||
app.Get("/health", func(c *fiber.Ctx) error {
|
||||
return c.JSON(fiber.Map{"status": "ok"})
|
||||
})
|
||||
|
||||
// Get port from environment
|
||||
port := os.Getenv("PORT")
|
||||
if port == "" {
|
||||
port = "3001"
|
||||
}
|
||||
|
||||
// Graceful shutdown
|
||||
go func() {
|
||||
sigChan := make(chan os.Signal, 1)
|
||||
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
|
||||
<-sigChan
|
||||
log.Println("Shutting down...")
|
||||
app.Shutdown()
|
||||
}()
|
||||
|
||||
// Start server
|
||||
log.Printf("Starting server on port %s", port)
|
||||
if err := app.Listen(":" + port); err != nil {
|
||||
log.Fatalf("Server error: %v", err)
|
||||
}
|
||||
}
|
||||
31
api/go.mod
Normal file
31
api/go.mod
Normal file
@@ -0,0 +1,31 @@
|
||||
module github.com/subculture-collective/epstein-db/api
|
||||
|
||||
go 1.21
|
||||
|
||||
require (
|
||||
github.com/gofiber/fiber/v2 v2.52.4
|
||||
github.com/jackc/pgx/v5 v5.5.5
|
||||
github.com/neo4j/neo4j-go-driver/v5 v5.19.0
|
||||
github.com/typesense/typesense-go v1.1.0
|
||||
github.com/joho/godotenv v1.5.1
|
||||
)
|
||||
|
||||
require (
|
||||
github.com/andybalholm/brotli v1.1.0 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/jackc/pgpassfile v1.0.0 // indirect
|
||||
github.com/jackc/pgservicefile v0.0.0-20231201235250-de7065d80cb9 // indirect
|
||||
github.com/jackc/puddle/v2 v2.2.1 // indirect
|
||||
github.com/klauspost/compress v1.17.8 // indirect
|
||||
github.com/mattn/go-colorable v0.1.13 // indirect
|
||||
github.com/mattn/go-isatty v0.0.20 // indirect
|
||||
github.com/mattn/go-runewidth v0.0.15 // indirect
|
||||
github.com/rivo/uniseg v0.4.7 // indirect
|
||||
github.com/valyala/bytebufferpool v1.0.0 // indirect
|
||||
github.com/valyala/fasthttp v1.52.0 // indirect
|
||||
github.com/valyala/tcplisten v1.0.0 // indirect
|
||||
golang.org/x/crypto v0.22.0 // indirect
|
||||
golang.org/x/sync v0.7.0 // indirect
|
||||
golang.org/x/sys v0.19.0 // indirect
|
||||
golang.org/x/text v0.14.0 // indirect
|
||||
)
|
||||
35
api/internal/db/db.go
Normal file
35
api/internal/db/db.go
Normal file
@@ -0,0 +1,35 @@
|
||||
package db
|
||||
|
||||
import (
|
||||
"context"
|
||||
"os"
|
||||
|
||||
"github.com/jackc/pgx/v5/pgxpool"
|
||||
)
|
||||
|
||||
var pool *pgxpool.Pool
|
||||
|
||||
func Initialize(ctx context.Context) error {
|
||||
connString := os.Getenv("DATABASE_URL")
|
||||
if connString == "" {
|
||||
connString = "postgresql://epstein:epstein_dev@localhost:5432/epstein"
|
||||
}
|
||||
|
||||
var err error
|
||||
pool, err = pgxpool.New(ctx, connString)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
return pool.Ping(ctx)
|
||||
}
|
||||
|
||||
func Close() {
|
||||
if pool != nil {
|
||||
pool.Close()
|
||||
}
|
||||
}
|
||||
|
||||
func Pool() *pgxpool.Pool {
|
||||
return pool
|
||||
}
|
||||
202
api/internal/handlers/crossref.go
Normal file
202
api/internal/handlers/crossref.go
Normal file
@@ -0,0 +1,202 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"strconv"
|
||||
|
||||
"github.com/gofiber/fiber/v2"
|
||||
"github.com/subculture-collective/epstein-db/api/internal/db"
|
||||
)
|
||||
|
||||
// SearchPPP searches PPP loan data
|
||||
func SearchPPP(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
query := c.Query("q", "")
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 200 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, borrower_name, borrower_city, borrower_state,
|
||||
loan_amount, forgiveness_amount, lender, date_approved,
|
||||
similarity(borrower_name, $1) AS score
|
||||
FROM ppp_loans
|
||||
WHERE $1 = '' OR borrower_name % $1 OR borrower_name ILIKE '%' || $1 || '%'
|
||||
ORDER BY
|
||||
CASE WHEN $1 != '' THEN similarity(borrower_name, $1) ELSE 0 END DESC,
|
||||
loan_amount DESC NULLS LAST
|
||||
LIMIT $2
|
||||
`, query, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var results []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var name string
|
||||
var city, state, lender *string
|
||||
var loanAmount, forgivenessAmount *float64
|
||||
var dateApproved *string
|
||||
var score float64
|
||||
|
||||
if err := rows.Scan(&id, &name, &city, &state, &loanAmount,
|
||||
&forgivenessAmount, &lender, &dateApproved, &score); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
results = append(results, fiber.Map{
|
||||
"id": id,
|
||||
"borrowerName": name,
|
||||
"borrowerCity": city,
|
||||
"borrowerState": state,
|
||||
"loanAmount": loanAmount,
|
||||
"forgivenessAmount": forgivenessAmount,
|
||||
"lender": lender,
|
||||
"dateApproved": dateApproved,
|
||||
"matchScore": score,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
})
|
||||
}
|
||||
|
||||
// SearchFEC searches FEC contribution data
|
||||
func SearchFEC(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
query := c.Query("q", "")
|
||||
candidate := c.Query("candidate", "")
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 200 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, contributor_name, contributor_city, contributor_state,
|
||||
contributor_employer, contributor_occupation,
|
||||
candidate_name, committee_name, amount, contribution_date,
|
||||
similarity(contributor_name, $1) AS score
|
||||
FROM fec_contributions
|
||||
WHERE ($1 = '' OR contributor_name % $1 OR contributor_name ILIKE '%' || $1 || '%')
|
||||
AND ($2 = '' OR candidate_name ILIKE '%' || $2 || '%')
|
||||
ORDER BY
|
||||
CASE WHEN $1 != '' THEN similarity(contributor_name, $1) ELSE 0 END DESC,
|
||||
amount DESC NULLS LAST
|
||||
LIMIT $3
|
||||
`, query, candidate, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var results []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var name string
|
||||
var city, state, employer, occupation, candidateName, committeeName *string
|
||||
var amount *float64
|
||||
var contributionDate *string
|
||||
var score float64
|
||||
|
||||
if err := rows.Scan(&id, &name, &city, &state, &employer, &occupation,
|
||||
&candidateName, &committeeName, &amount, &contributionDate, &score); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
results = append(results, fiber.Map{
|
||||
"id": id,
|
||||
"contributorName": name,
|
||||
"contributorCity": city,
|
||||
"contributorState": state,
|
||||
"employer": employer,
|
||||
"occupation": occupation,
|
||||
"candidateName": candidateName,
|
||||
"committeeName": committeeName,
|
||||
"amount": amount,
|
||||
"contributionDate": contributionDate,
|
||||
"matchScore": score,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
})
|
||||
}
|
||||
|
||||
// SearchGrants searches federal grants data
|
||||
func SearchGrants(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
query := c.Query("q", "")
|
||||
agency := c.Query("agency", "")
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 200 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, recipient_name, recipient_city, recipient_state,
|
||||
awarding_agency, funding_agency, award_amount, award_date,
|
||||
description, cfda_title,
|
||||
similarity(recipient_name, $1) AS score
|
||||
FROM federal_grants
|
||||
WHERE ($1 = '' OR recipient_name % $1 OR recipient_name ILIKE '%' || $1 || '%')
|
||||
AND ($2 = '' OR awarding_agency ILIKE '%' || $2 || '%')
|
||||
ORDER BY
|
||||
CASE WHEN $1 != '' THEN similarity(recipient_name, $1) ELSE 0 END DESC,
|
||||
award_amount DESC NULLS LAST
|
||||
LIMIT $3
|
||||
`, query, agency, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var results []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var name string
|
||||
var city, state, awardingAgency, fundingAgency *string
|
||||
var awardAmount *float64
|
||||
var awardDate, description, cfdaTitle *string
|
||||
var score float64
|
||||
|
||||
if err := rows.Scan(&id, &name, &city, &state, &awardingAgency, &fundingAgency,
|
||||
&awardAmount, &awardDate, &description, &cfdaTitle, &score); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
results = append(results, fiber.Map{
|
||||
"id": id,
|
||||
"recipientName": name,
|
||||
"recipientCity": city,
|
||||
"recipientState": state,
|
||||
"awardingAgency": awardingAgency,
|
||||
"fundingAgency": fundingAgency,
|
||||
"awardAmount": awardAmount,
|
||||
"awardDate": awardDate,
|
||||
"description": description,
|
||||
"cfdaTitle": cfdaTitle,
|
||||
"matchScore": score,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
})
|
||||
}
|
||||
238
api/internal/handlers/documents.go
Normal file
238
api/internal/handlers/documents.go
Normal file
@@ -0,0 +1,238 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"strconv"
|
||||
|
||||
"github.com/gofiber/fiber/v2"
|
||||
"github.com/subculture-collective/epstein-db/api/internal/db"
|
||||
)
|
||||
|
||||
// ListDocuments returns a paginated list of documents
|
||||
func ListDocuments(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 200 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
offsetStr := c.Query("offset", "0")
|
||||
offset, _ := strconv.Atoi(offsetStr)
|
||||
|
||||
docType := c.Query("type", "")
|
||||
dataset := c.Query("dataset", "")
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, doc_id, dataset_id, document_type, summary, date_earliest, date_latest
|
||||
FROM documents
|
||||
WHERE ($1 = '' OR document_type = $1)
|
||||
AND ($2 = '' OR dataset_id = $2::int)
|
||||
ORDER BY doc_id
|
||||
LIMIT $3 OFFSET $4
|
||||
`, docType, dataset, limit, offset)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var documents []fiber.Map
|
||||
for rows.Next() {
|
||||
var id, datasetID int
|
||||
var docID string
|
||||
var docType, summary *string
|
||||
var dateEarliest, dateLatest *string
|
||||
|
||||
if err := rows.Scan(&id, &docID, &datasetID, &docType, &summary, &dateEarliest, &dateLatest); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
documents = append(documents, fiber.Map{
|
||||
"id": id,
|
||||
"docId": docID,
|
||||
"datasetId": datasetID,
|
||||
"documentType": docType,
|
||||
"summary": summary,
|
||||
"dateEarliest": dateEarliest,
|
||||
"dateLatest": dateLatest,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"documents": documents,
|
||||
"count": len(documents),
|
||||
"offset": offset,
|
||||
"limit": limit,
|
||||
})
|
||||
}
|
||||
|
||||
// GetDocument returns a single document by ID
|
||||
func GetDocument(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
var doc struct {
|
||||
ID int `json:"id"`
|
||||
DocID string `json:"docId"`
|
||||
DatasetID int `json:"datasetId"`
|
||||
DocumentType *string `json:"documentType"`
|
||||
Summary *string `json:"summary"`
|
||||
DetailedSummary *string `json:"detailedSummary"`
|
||||
DateEarliest *string `json:"dateEarliest"`
|
||||
DateLatest *string `json:"dateLatest"`
|
||||
ContentTags []byte `json:"contentTags"`
|
||||
PageCount *int `json:"pageCount"`
|
||||
}
|
||||
|
||||
err = pool.QueryRow(ctx, `
|
||||
SELECT id, doc_id, dataset_id, document_type, summary, detailed_summary,
|
||||
date_earliest::text, date_latest::text, content_tags, page_count
|
||||
FROM documents WHERE id = $1
|
||||
`, id).Scan(
|
||||
&doc.ID, &doc.DocID, &doc.DatasetID, &doc.DocumentType,
|
||||
&doc.Summary, &doc.DetailedSummary, &doc.DateEarliest,
|
||||
&doc.DateLatest, &doc.ContentTags, &doc.PageCount,
|
||||
)
|
||||
|
||||
if err != nil {
|
||||
return c.Status(404).JSON(fiber.Map{"error": "document not found"})
|
||||
}
|
||||
|
||||
return c.JSON(doc)
|
||||
}
|
||||
|
||||
// GetDocumentText returns the full text of a document
|
||||
func GetDocumentText(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
var text *string
|
||||
err = pool.QueryRow(ctx, "SELECT full_text FROM documents WHERE id = $1", id).Scan(&text)
|
||||
if err != nil {
|
||||
return c.Status(404).JSON(fiber.Map{"error": "document not found"})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"id": id,
|
||||
"text": text,
|
||||
})
|
||||
}
|
||||
|
||||
// GetDocumentEntities returns entities mentioned in a document
|
||||
func GetDocumentEntities(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT e.id, e.canonical_name, e.entity_type, e.layer, de.mention_count
|
||||
FROM entities e
|
||||
JOIN document_entities de ON e.id = de.entity_id
|
||||
WHERE de.document_id = $1
|
||||
ORDER BY de.mention_count DESC
|
||||
`, id)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var entities []fiber.Map
|
||||
for rows.Next() {
|
||||
var entityID int
|
||||
var name, etype string
|
||||
var layer *int
|
||||
var mentions int
|
||||
|
||||
if err := rows.Scan(&entityID, &name, &etype, &layer, &mentions); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
entities = append(entities, fiber.Map{
|
||||
"id": entityID,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"layer": layer,
|
||||
"mentionCount": mentions,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"entities": entities,
|
||||
"count": len(entities),
|
||||
})
|
||||
}
|
||||
|
||||
// FullTextSearch searches document text
|
||||
func FullTextSearch(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
query := c.Query("q", "")
|
||||
if query == "" {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "query required"})
|
||||
}
|
||||
|
||||
limitStr := c.Query("limit", "20")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 100 {
|
||||
limit = 100
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, doc_id, document_type, summary,
|
||||
ts_rank(to_tsvector('english', full_text), plainto_tsquery('english', $1)) AS rank,
|
||||
ts_headline('english', full_text, plainto_tsquery('english', $1),
|
||||
'MaxWords=50, MinWords=20, StartSel=<mark>, StopSel=</mark>') AS snippet
|
||||
FROM documents
|
||||
WHERE to_tsvector('english', full_text) @@ plainto_tsquery('english', $1)
|
||||
ORDER BY rank DESC
|
||||
LIMIT $2
|
||||
`, query, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var results []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var docID string
|
||||
var docType, summary, snippet *string
|
||||
var rank float64
|
||||
|
||||
if err := rows.Scan(&id, &docID, &docType, &summary, &rank, &snippet); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
results = append(results, fiber.Map{
|
||||
"id": id,
|
||||
"docId": docID,
|
||||
"documentType": docType,
|
||||
"summary": summary,
|
||||
"rank": rank,
|
||||
"snippet": snippet,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"results": results,
|
||||
"count": len(results),
|
||||
"query": query,
|
||||
})
|
||||
}
|
||||
250
api/internal/handlers/entities.go
Normal file
250
api/internal/handlers/entities.go
Normal file
@@ -0,0 +1,250 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"strconv"
|
||||
|
||||
"github.com/gofiber/fiber/v2"
|
||||
"github.com/subculture-collective/epstein-db/api/internal/db"
|
||||
)
|
||||
|
||||
// GetStats returns database statistics
|
||||
func GetStats(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
var stats struct {
|
||||
Documents int64 `json:"documents"`
|
||||
Entities int64 `json:"entities"`
|
||||
Triples int64 `json:"triples"`
|
||||
PPPLoans int64 `json:"pppLoans"`
|
||||
FECRecords int64 `json:"fecRecords"`
|
||||
Grants int64 `json:"grants"`
|
||||
Patterns int64 `json:"patterns"`
|
||||
}
|
||||
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM documents").Scan(&stats.Documents)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM entities").Scan(&stats.Entities)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM triples").Scan(&stats.Triples)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM ppp_loans").Scan(&stats.PPPLoans)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM fec_contributions").Scan(&stats.FECRecords)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM federal_grants").Scan(&stats.Grants)
|
||||
pool.QueryRow(ctx, "SELECT COUNT(*) FROM pattern_findings").Scan(&stats.Patterns)
|
||||
|
||||
return c.JSON(stats)
|
||||
}
|
||||
|
||||
// SearchEntities searches for entities by name
|
||||
func SearchEntities(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
query := c.Query("q", "")
|
||||
limitStr := c.Query("limit", "20")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 100 {
|
||||
limit = 100
|
||||
}
|
||||
|
||||
entityType := c.Query("type", "")
|
||||
layer := c.Query("layer", "")
|
||||
|
||||
sqlQuery := `
|
||||
SELECT id, canonical_name, entity_type, layer, document_count, connection_count
|
||||
FROM entities
|
||||
WHERE ($1 = '' OR canonical_name ILIKE '%' || $1 || '%' OR canonical_name % $1)
|
||||
AND ($2 = '' OR entity_type = $2::entity_type)
|
||||
AND ($3 = '' OR layer = $3::int)
|
||||
ORDER BY
|
||||
CASE WHEN $1 != '' THEN similarity(canonical_name, $1) ELSE 0 END DESC,
|
||||
document_count DESC
|
||||
LIMIT $4
|
||||
`
|
||||
|
||||
rows, err := pool.Query(ctx, sqlQuery, query, entityType, layer, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var entities []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var name, etype string
|
||||
var layerVal, docCount, connCount *int
|
||||
|
||||
if err := rows.Scan(&id, &name, &etype, &layerVal, &docCount, &connCount); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
entities = append(entities, fiber.Map{
|
||||
"id": id,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"layer": layerVal,
|
||||
"documentCount": docCount,
|
||||
"connectionCount": connCount,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"entities": entities,
|
||||
"count": len(entities),
|
||||
})
|
||||
}
|
||||
|
||||
// GetEntity returns a single entity by ID
|
||||
func GetEntity(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
var entity struct {
|
||||
ID int `json:"id"`
|
||||
CanonicalName string `json:"canonicalName"`
|
||||
EntityType string `json:"entityType"`
|
||||
Layer *int `json:"layer"`
|
||||
Description *string `json:"description"`
|
||||
DocumentCount *int `json:"documentCount"`
|
||||
ConnectionCount *int `json:"connectionCount"`
|
||||
Aliases []byte `json:"aliases"`
|
||||
PPPMatches []byte `json:"pppMatches"`
|
||||
FECMatches []byte `json:"fecMatches"`
|
||||
GrantsMatches []byte `json:"grantsMatches"`
|
||||
}
|
||||
|
||||
err = pool.QueryRow(ctx, `
|
||||
SELECT id, canonical_name, entity_type, layer, description,
|
||||
document_count, connection_count, aliases,
|
||||
ppp_matches, fec_matches, grants_matches
|
||||
FROM entities WHERE id = $1
|
||||
`, id).Scan(
|
||||
&entity.ID, &entity.CanonicalName, &entity.EntityType,
|
||||
&entity.Layer, &entity.Description, &entity.DocumentCount,
|
||||
&entity.ConnectionCount, &entity.Aliases,
|
||||
&entity.PPPMatches, &entity.FECMatches, &entity.GrantsMatches,
|
||||
)
|
||||
|
||||
if err != nil {
|
||||
return c.Status(404).JSON(fiber.Map{"error": "entity not found"})
|
||||
}
|
||||
|
||||
return c.JSON(entity)
|
||||
}
|
||||
|
||||
// GetEntityConnections returns entities connected to a given entity
|
||||
func GetEntityConnections(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 200 {
|
||||
limit = 200
|
||||
}
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT
|
||||
e2.id, e2.canonical_name, e2.entity_type, e2.layer,
|
||||
COUNT(DISTINCT d.id) AS shared_docs
|
||||
FROM document_entities de1
|
||||
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id != de2.entity_id
|
||||
JOIN entities e2 ON de2.entity_id = e2.id
|
||||
JOIN documents d ON de1.document_id = d.id
|
||||
WHERE de1.entity_id = $1
|
||||
GROUP BY e2.id, e2.canonical_name, e2.entity_type, e2.layer
|
||||
ORDER BY shared_docs DESC
|
||||
LIMIT $2
|
||||
`, id, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var connections []fiber.Map
|
||||
for rows.Next() {
|
||||
var connID int
|
||||
var name, etype string
|
||||
var layerVal *int
|
||||
var sharedDocs int
|
||||
|
||||
if err := rows.Scan(&connID, &name, &etype, &layerVal, &sharedDocs); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
connections = append(connections, fiber.Map{
|
||||
"id": connID,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"layer": layerVal,
|
||||
"sharedDocs": sharedDocs,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"connections": connections,
|
||||
"count": len(connections),
|
||||
})
|
||||
}
|
||||
|
||||
// GetEntityDocuments returns documents mentioning an entity
|
||||
func GetEntityDocuments(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
limitStr := c.Query("limit", "50")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT d.id, d.doc_id, d.document_type, d.summary, d.date_earliest, d.date_latest
|
||||
FROM documents d
|
||||
JOIN document_entities de ON d.id = de.document_id
|
||||
WHERE de.entity_id = $1
|
||||
ORDER BY d.date_earliest DESC NULLS LAST
|
||||
LIMIT $2
|
||||
`, id, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var documents []fiber.Map
|
||||
for rows.Next() {
|
||||
var docID int
|
||||
var docIdStr string
|
||||
var docType, summary *string
|
||||
var dateEarliest, dateLatest *string
|
||||
|
||||
if err := rows.Scan(&docID, &docIdStr, &docType, &summary, &dateEarliest, &dateLatest); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
documents = append(documents, fiber.Map{
|
||||
"id": docID,
|
||||
"docId": docIdStr,
|
||||
"documentType": docType,
|
||||
"summary": summary,
|
||||
"dateEarliest": dateEarliest,
|
||||
"dateLatest": dateLatest,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"documents": documents,
|
||||
"count": len(documents),
|
||||
})
|
||||
}
|
||||
282
api/internal/handlers/network.go
Normal file
282
api/internal/handlers/network.go
Normal file
@@ -0,0 +1,282 @@
|
||||
package handlers
|
||||
|
||||
import (
|
||||
"context"
|
||||
"strconv"
|
||||
|
||||
"github.com/gofiber/fiber/v2"
|
||||
"github.com/subculture-collective/epstein-db/api/internal/db"
|
||||
)
|
||||
|
||||
// GetNetwork returns the relationship network for visualization
|
||||
func GetNetwork(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
limitStr := c.Query("limit", "1000")
|
||||
limit, _ := strconv.Atoi(limitStr)
|
||||
if limit > 10000 {
|
||||
limit = 10000
|
||||
}
|
||||
|
||||
minConnections := c.Query("minConnections", "2")
|
||||
minConn, _ := strconv.Atoi(minConnections)
|
||||
|
||||
// Get nodes (entities with sufficient connections)
|
||||
nodeRows, err := pool.Query(ctx, `
|
||||
SELECT id, canonical_name, entity_type, layer, document_count, connection_count
|
||||
FROM entities
|
||||
WHERE entity_type IN ('person', 'organization')
|
||||
AND connection_count >= $1
|
||||
ORDER BY connection_count DESC
|
||||
LIMIT $2
|
||||
`, minConn, limit)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer nodeRows.Close()
|
||||
|
||||
var nodes []fiber.Map
|
||||
nodeIDs := make(map[int]bool)
|
||||
|
||||
for nodeRows.Next() {
|
||||
var id int
|
||||
var name, etype string
|
||||
var layer, docCount, connCount *int
|
||||
|
||||
if err := nodeRows.Scan(&id, &name, &etype, &layer, &docCount, &connCount); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
nodeIDs[id] = true
|
||||
nodes = append(nodes, fiber.Map{
|
||||
"id": id,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"layer": layer,
|
||||
"documentCount": docCount,
|
||||
"connectionCount": connCount,
|
||||
})
|
||||
}
|
||||
|
||||
// Get edges (co-occurrence relationships)
|
||||
edgeRows, err := pool.Query(ctx, `
|
||||
SELECT
|
||||
de1.entity_id AS source,
|
||||
de2.entity_id AS target,
|
||||
COUNT(DISTINCT de1.document_id) AS weight
|
||||
FROM document_entities de1
|
||||
JOIN document_entities de2 ON de1.document_id = de2.document_id
|
||||
AND de1.entity_id < de2.entity_id
|
||||
JOIN entities e1 ON de1.entity_id = e1.id
|
||||
JOIN entities e2 ON de2.entity_id = e2.id
|
||||
WHERE e1.entity_type IN ('person', 'organization')
|
||||
AND e2.entity_type IN ('person', 'organization')
|
||||
AND e1.connection_count >= $1
|
||||
AND e2.connection_count >= $1
|
||||
GROUP BY de1.entity_id, de2.entity_id
|
||||
HAVING COUNT(DISTINCT de1.document_id) >= 2
|
||||
ORDER BY weight DESC
|
||||
LIMIT $2
|
||||
`, minConn, limit*3)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer edgeRows.Close()
|
||||
|
||||
var edges []fiber.Map
|
||||
for edgeRows.Next() {
|
||||
var source, target, weight int
|
||||
if err := edgeRows.Scan(&source, &target, &weight); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
// Only include edges where both nodes are in our node set
|
||||
if nodeIDs[source] && nodeIDs[target] {
|
||||
edges = append(edges, fiber.Map{
|
||||
"source": source,
|
||||
"target": target,
|
||||
"weight": weight,
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"nodes": nodes,
|
||||
"edges": edges,
|
||||
"stats": fiber.Map{
|
||||
"nodeCount": len(nodes),
|
||||
"edgeCount": len(edges),
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
// GetNetworkByLayer returns entities organized by layer
|
||||
func GetNetworkByLayer(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
var layers []fiber.Map
|
||||
|
||||
for layer := 0; layer <= 3; layer++ {
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, canonical_name, entity_type, document_count, connection_count
|
||||
FROM entities
|
||||
WHERE layer = $1 AND entity_type IN ('person', 'organization')
|
||||
ORDER BY connection_count DESC
|
||||
LIMIT 100
|
||||
`, layer)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
var entities []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var name, etype string
|
||||
var docCount, connCount *int
|
||||
|
||||
if err := rows.Scan(&id, &name, &etype, &docCount, &connCount); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
entities = append(entities, fiber.Map{
|
||||
"id": id,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"documentCount": docCount,
|
||||
"connectionCount": connCount,
|
||||
})
|
||||
}
|
||||
rows.Close()
|
||||
|
||||
layers = append(layers, fiber.Map{
|
||||
"layer": layer,
|
||||
"entities": entities,
|
||||
"count": len(entities),
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"layers": layers,
|
||||
})
|
||||
}
|
||||
|
||||
// ListPatterns returns discovered patterns
|
||||
func ListPatterns(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
status := c.Query("status", "")
|
||||
patternType := c.Query("type", "")
|
||||
|
||||
rows, err := pool.Query(ctx, `
|
||||
SELECT id, title, description, pattern_type, confidence, status, discovered_at
|
||||
FROM pattern_findings
|
||||
WHERE ($1 = '' OR status = $1)
|
||||
AND ($2 = '' OR pattern_type = $2)
|
||||
ORDER BY discovered_at DESC
|
||||
LIMIT 100
|
||||
`, status, patternType)
|
||||
if err != nil {
|
||||
return c.Status(500).JSON(fiber.Map{"error": err.Error()})
|
||||
}
|
||||
defer rows.Close()
|
||||
|
||||
var patterns []fiber.Map
|
||||
for rows.Next() {
|
||||
var id int
|
||||
var title, description, ptype, status string
|
||||
var confidence *float64
|
||||
var discoveredAt string
|
||||
|
||||
if err := rows.Scan(&id, &title, &description, &ptype, &confidence, &status, &discoveredAt); err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
patterns = append(patterns, fiber.Map{
|
||||
"id": id,
|
||||
"title": title,
|
||||
"description": description,
|
||||
"patternType": ptype,
|
||||
"confidence": confidence,
|
||||
"status": status,
|
||||
"discoveredAt": discoveredAt,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"patterns": patterns,
|
||||
"count": len(patterns),
|
||||
})
|
||||
}
|
||||
|
||||
// GetPattern returns a single pattern with full details
|
||||
func GetPattern(c *fiber.Ctx) error {
|
||||
ctx := context.Background()
|
||||
pool := db.Pool()
|
||||
|
||||
id, err := strconv.Atoi(c.Params("id"))
|
||||
if err != nil {
|
||||
return c.Status(400).JSON(fiber.Map{"error": "invalid id"})
|
||||
}
|
||||
|
||||
var pattern struct {
|
||||
ID int `json:"id"`
|
||||
Title string `json:"title"`
|
||||
Description string `json:"description"`
|
||||
PatternType string `json:"patternType"`
|
||||
EntityIDs []int `json:"entityIds"`
|
||||
Evidence []byte `json:"evidence"`
|
||||
Confidence *float64 `json:"confidence"`
|
||||
Status string `json:"status"`
|
||||
Notes *string `json:"notes"`
|
||||
DiscoveredAt string `json:"discoveredAt"`
|
||||
DiscoveredBy string `json:"discoveredBy"`
|
||||
}
|
||||
|
||||
err = pool.QueryRow(ctx, `
|
||||
SELECT id, title, description, pattern_type, entity_ids, evidence,
|
||||
confidence, status, notes, discovered_at, discovered_by
|
||||
FROM pattern_findings WHERE id = $1
|
||||
`, id).Scan(
|
||||
&pattern.ID, &pattern.Title, &pattern.Description, &pattern.PatternType,
|
||||
&pattern.EntityIDs, &pattern.Evidence, &pattern.Confidence,
|
||||
&pattern.Status, &pattern.Notes, &pattern.DiscoveredAt, &pattern.DiscoveredBy,
|
||||
)
|
||||
|
||||
if err != nil {
|
||||
return c.Status(404).JSON(fiber.Map{"error": "pattern not found"})
|
||||
}
|
||||
|
||||
// Get entity details
|
||||
entityRows, err := pool.Query(ctx, `
|
||||
SELECT id, canonical_name, entity_type, layer
|
||||
FROM entities WHERE id = ANY($1)
|
||||
`, pattern.EntityIDs)
|
||||
if err == nil {
|
||||
var entities []fiber.Map
|
||||
for entityRows.Next() {
|
||||
var eid int
|
||||
var name, etype string
|
||||
var layer *int
|
||||
if err := entityRows.Scan(&eid, &name, &etype, &layer); err != nil {
|
||||
continue
|
||||
}
|
||||
entities = append(entities, fiber.Map{
|
||||
"id": eid,
|
||||
"canonicalName": name,
|
||||
"entityType": etype,
|
||||
"layer": layer,
|
||||
})
|
||||
}
|
||||
entityRows.Close()
|
||||
|
||||
return c.JSON(fiber.Map{
|
||||
"pattern": pattern,
|
||||
"entities": entities,
|
||||
})
|
||||
}
|
||||
|
||||
return c.JSON(pattern)
|
||||
}
|
||||
64
docker-compose.yml
Normal file
64
docker-compose.yml
Normal file
@@ -0,0 +1,64 @@
|
||||
services:
|
||||
postgres:
|
||||
image: postgres:16-alpine
|
||||
container_name: epstein-db-postgres
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
POSTGRES_USER: epstein
|
||||
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-epstein_dev}
|
||||
POSTGRES_DB: epstein
|
||||
ports:
|
||||
- "5432:5432"
|
||||
volumes:
|
||||
- postgres_data:/var/lib/postgresql/data
|
||||
- ./schema/postgres:/docker-entrypoint-initdb.d:ro
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "pg_isready -U epstein -d epstein"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
neo4j:
|
||||
image: neo4j:5-community
|
||||
container_name: epstein-db-neo4j
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
NEO4J_AUTH: neo4j/${NEO4J_PASSWORD:-neo4j_dev}
|
||||
NEO4J_PLUGINS: '["apoc"]'
|
||||
NEO4J_dbms_memory_heap_initial__size: 512m
|
||||
NEO4J_dbms_memory_heap_max__size: 2G
|
||||
ports:
|
||||
- "7474:7474" # HTTP
|
||||
- "7687:7687" # Bolt
|
||||
volumes:
|
||||
- neo4j_data:/data
|
||||
- neo4j_logs:/logs
|
||||
healthcheck:
|
||||
test: ["CMD", "neo4j", "status"]
|
||||
interval: 10s
|
||||
timeout: 10s
|
||||
retries: 5
|
||||
|
||||
typesense:
|
||||
image: typesense/typesense:27.1
|
||||
container_name: epstein-db-typesense
|
||||
restart: unless-stopped
|
||||
environment:
|
||||
TYPESENSE_DATA_DIR: /data
|
||||
TYPESENSE_API_KEY: ${TYPESENSE_API_KEY:-typesense_dev}
|
||||
TYPESENSE_ENABLE_CORS: "true"
|
||||
ports:
|
||||
- "8108:8108"
|
||||
volumes:
|
||||
- typesense_data:/data
|
||||
healthcheck:
|
||||
test: ["CMD", "curl", "-f", "http://localhost:8108/health"]
|
||||
interval: 10s
|
||||
timeout: 5s
|
||||
retries: 5
|
||||
|
||||
volumes:
|
||||
postgres_data:
|
||||
neo4j_data:
|
||||
neo4j_logs:
|
||||
typesense_data:
|
||||
39
extraction/package.json
Normal file
39
extraction/package.json
Normal file
@@ -0,0 +1,39 @@
|
||||
{
|
||||
"name": "@epstein-db/extraction",
|
||||
"version": "1.0.0",
|
||||
"description": "Entity extraction pipeline for Epstein Files Database",
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"build": "tsc",
|
||||
"dev": "tsx watch src/index.ts",
|
||||
"extract:documents": "tsx src/scripts/extract-documents.ts",
|
||||
"extract:entities": "tsx src/scripts/extract-entities.ts",
|
||||
"deduplicate": "tsx src/scripts/deduplicate.ts",
|
||||
"load:crossref": "tsx src/scripts/load-crossref.ts",
|
||||
"match:crossref": "tsx src/scripts/match-crossref.ts",
|
||||
"calculate:layers": "tsx src/scripts/calculate-layers.ts",
|
||||
"sync:neo4j": "tsx src/scripts/sync-neo4j.ts",
|
||||
"pipeline": "npm run extract:documents && npm run extract:entities && npm run deduplicate && npm run calculate:layers && npm run sync:neo4j",
|
||||
"typecheck": "tsc --noEmit"
|
||||
},
|
||||
"dependencies": {
|
||||
"@anthropic-ai/sdk": "^0.24.0",
|
||||
"@neondatabase/serverless": "^0.9.0",
|
||||
"better-sqlite3": "^11.0.0",
|
||||
"dotenv": "^16.4.5",
|
||||
"drizzle-orm": "^0.30.0",
|
||||
"neo4j-driver": "^5.19.0",
|
||||
"openai": "^4.47.0",
|
||||
"p-limit": "^5.0.0",
|
||||
"pg": "^8.11.5",
|
||||
"typesense": "^1.8.2",
|
||||
"zod": "^3.23.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/better-sqlite3": "^7.6.10",
|
||||
"@types/node": "^20.12.0",
|
||||
"@types/pg": "^8.11.5",
|
||||
"tsx": "^4.9.0",
|
||||
"typescript": "^5.4.0"
|
||||
}
|
||||
}
|
||||
33
extraction/src/config.ts
Normal file
33
extraction/src/config.ts
Normal file
@@ -0,0 +1,33 @@
|
||||
import { z } from 'zod';
|
||||
import dotenv from 'dotenv';
|
||||
|
||||
dotenv.config();
|
||||
|
||||
const configSchema = z.object({
|
||||
// Database
|
||||
DATABASE_URL: z.string().default('postgresql://epstein:epstein_dev@localhost:5432/epstein'),
|
||||
NEO4J_URI: z.string().default('bolt://localhost:7687'),
|
||||
NEO4J_USER: z.string().default('neo4j'),
|
||||
NEO4J_PASSWORD: z.string().default('neo4j_dev'),
|
||||
TYPESENSE_HOST: z.string().default('localhost'),
|
||||
TYPESENSE_PORT: z.coerce.number().default(8108),
|
||||
TYPESENSE_API_KEY: z.string().default('typesense_dev'),
|
||||
|
||||
// LLM
|
||||
OPENAI_API_KEY: z.string().optional(),
|
||||
OPENAI_BASE_URL: z.string().optional(),
|
||||
ANTHROPIC_API_KEY: z.string().optional(),
|
||||
LLM_MODEL: z.string().default('claude-sonnet-4-20250514'),
|
||||
|
||||
// Extraction
|
||||
DATA_DIR: z.string().default('../DataSources'),
|
||||
BATCH_SIZE: z.coerce.number().default(10),
|
||||
MAX_WORKERS: z.coerce.number().default(5),
|
||||
|
||||
// Rate limiting
|
||||
REQUESTS_PER_MINUTE: z.coerce.number().default(50),
|
||||
});
|
||||
|
||||
export type Config = z.infer<typeof configSchema>;
|
||||
|
||||
export const config = configSchema.parse(process.env);
|
||||
248
extraction/src/db.ts
Normal file
248
extraction/src/db.ts
Normal file
@@ -0,0 +1,248 @@
|
||||
import pg from 'pg';
|
||||
import { config } from './config.js';
|
||||
|
||||
const { Pool } = pg;
|
||||
|
||||
export const pool = new Pool({
|
||||
connectionString: config.DATABASE_URL,
|
||||
});
|
||||
|
||||
// Helper for transactions
|
||||
export async function withTransaction<T>(
|
||||
fn: (client: pg.PoolClient) => Promise<T>
|
||||
): Promise<T> {
|
||||
const client = await pool.connect();
|
||||
try {
|
||||
await client.query('BEGIN');
|
||||
const result = await fn(client);
|
||||
await client.query('COMMIT');
|
||||
return result;
|
||||
} catch (error) {
|
||||
await client.query('ROLLBACK');
|
||||
throw error;
|
||||
} finally {
|
||||
client.release();
|
||||
}
|
||||
}
|
||||
|
||||
// Document operations
|
||||
export async function insertDocument(doc: {
|
||||
docId: string;
|
||||
datasetId: number;
|
||||
filePath?: string;
|
||||
fullText?: string;
|
||||
pageCount?: number;
|
||||
}): Promise<number> {
|
||||
const result = await pool.query(
|
||||
`INSERT INTO documents (doc_id, dataset_id, file_path, full_text, page_count)
|
||||
VALUES ($1, $2, $3, $4, $5)
|
||||
ON CONFLICT (doc_id) DO UPDATE SET
|
||||
full_text = COALESCE(EXCLUDED.full_text, documents.full_text),
|
||||
updated_at = NOW()
|
||||
RETURNING id`,
|
||||
[doc.docId, doc.datasetId, doc.filePath, doc.fullText, doc.pageCount]
|
||||
);
|
||||
return result.rows[0].id;
|
||||
}
|
||||
|
||||
export async function updateDocumentAnalysis(
|
||||
docId: string,
|
||||
analysis: {
|
||||
summary: string;
|
||||
detailedSummary: string;
|
||||
documentType: string;
|
||||
dateEarliest?: Date;
|
||||
dateLatest?: Date;
|
||||
contentTags: string[];
|
||||
}
|
||||
): Promise<void> {
|
||||
await pool.query(
|
||||
`UPDATE documents SET
|
||||
summary = $2,
|
||||
detailed_summary = $3,
|
||||
document_type = $4,
|
||||
date_earliest = $5,
|
||||
date_latest = $6,
|
||||
content_tags = $7,
|
||||
analysis_status = 'complete',
|
||||
analyzed_at = NOW(),
|
||||
updated_at = NOW()
|
||||
WHERE doc_id = $1`,
|
||||
[
|
||||
docId,
|
||||
analysis.summary,
|
||||
analysis.detailedSummary,
|
||||
analysis.documentType,
|
||||
analysis.dateEarliest,
|
||||
analysis.dateLatest,
|
||||
JSON.stringify(analysis.contentTags),
|
||||
]
|
||||
);
|
||||
}
|
||||
|
||||
export async function getDocumentsPendingAnalysis(
|
||||
limit: number = 100
|
||||
): Promise<Array<{ id: number; docId: string; fullText: string }>> {
|
||||
const result = await pool.query(
|
||||
`SELECT id, doc_id, full_text FROM documents
|
||||
WHERE analysis_status = 'pending' AND full_text IS NOT NULL
|
||||
LIMIT $1`,
|
||||
[limit]
|
||||
);
|
||||
return result.rows.map((row) => ({
|
||||
id: row.id,
|
||||
docId: row.doc_id,
|
||||
fullText: row.full_text,
|
||||
}));
|
||||
}
|
||||
|
||||
// Entity operations
|
||||
export async function upsertEntity(entity: {
|
||||
canonicalName: string;
|
||||
entityType: string;
|
||||
aliases?: string[];
|
||||
description?: string;
|
||||
}): Promise<number> {
|
||||
const result = await pool.query(
|
||||
`INSERT INTO entities (canonical_name, entity_type, aliases, description)
|
||||
VALUES ($1, $2::entity_type, $3, $4)
|
||||
ON CONFLICT (canonical_name, entity_type) DO UPDATE SET
|
||||
aliases = COALESCE(
|
||||
entities.aliases || EXCLUDED.aliases,
|
||||
entities.aliases,
|
||||
EXCLUDED.aliases
|
||||
),
|
||||
updated_at = NOW()
|
||||
RETURNING id`,
|
||||
[
|
||||
entity.canonicalName,
|
||||
entity.entityType,
|
||||
JSON.stringify(entity.aliases || []),
|
||||
entity.description,
|
||||
]
|
||||
);
|
||||
return result.rows[0].id;
|
||||
}
|
||||
|
||||
export async function linkEntityToDocument(
|
||||
entityId: number,
|
||||
documentId: number,
|
||||
mentionCount: number = 1,
|
||||
contextSnippet?: string
|
||||
): Promise<void> {
|
||||
await pool.query(
|
||||
`INSERT INTO document_entities (document_id, entity_id, mention_count, context_snippet)
|
||||
VALUES ($1, $2, $3, $4)
|
||||
ON CONFLICT (document_id, entity_id) DO UPDATE SET
|
||||
mention_count = document_entities.mention_count + EXCLUDED.mention_count`,
|
||||
[documentId, entityId, mentionCount, contextSnippet]
|
||||
);
|
||||
}
|
||||
|
||||
export async function insertTriple(triple: {
|
||||
documentId: number;
|
||||
subjectId: number;
|
||||
predicate: string;
|
||||
objectId: number;
|
||||
locationId?: number;
|
||||
timestamp?: Date;
|
||||
explicitTopic?: string;
|
||||
implicitTopic?: string;
|
||||
tags?: string[];
|
||||
sequenceOrder: number;
|
||||
}): Promise<number> {
|
||||
const result = await pool.query(
|
||||
`INSERT INTO triples
|
||||
(document_id, subject_id, predicate, object_id, location_id, timestamp, explicit_topic, implicit_topic, tags, sequence_order)
|
||||
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10)
|
||||
RETURNING id`,
|
||||
[
|
||||
triple.documentId,
|
||||
triple.subjectId,
|
||||
triple.predicate,
|
||||
triple.objectId,
|
||||
triple.locationId,
|
||||
triple.timestamp,
|
||||
triple.explicitTopic,
|
||||
triple.implicitTopic,
|
||||
JSON.stringify(triple.tags || []),
|
||||
triple.sequenceOrder,
|
||||
]
|
||||
);
|
||||
return result.rows[0].id;
|
||||
}
|
||||
|
||||
// Layer calculation
|
||||
export async function calculateEntityLayers(): Promise<void> {
|
||||
// Set Layer 1: entities that share documents with Epstein
|
||||
await pool.query(`
|
||||
WITH epstein AS (
|
||||
SELECT id FROM entities WHERE canonical_name = 'Jeffrey Epstein' AND entity_type = 'person'
|
||||
),
|
||||
epstein_docs AS (
|
||||
SELECT DISTINCT document_id FROM document_entities WHERE entity_id = (SELECT id FROM epstein)
|
||||
),
|
||||
layer1_entities AS (
|
||||
SELECT DISTINCT entity_id FROM document_entities
|
||||
WHERE document_id IN (SELECT document_id FROM epstein_docs)
|
||||
AND entity_id != (SELECT id FROM epstein)
|
||||
)
|
||||
UPDATE entities SET layer = 1, updated_at = NOW()
|
||||
WHERE id IN (SELECT entity_id FROM layer1_entities) AND layer IS NULL
|
||||
`);
|
||||
|
||||
// Set Layer 2: entities that share documents with Layer 1 (but not with Epstein directly)
|
||||
await pool.query(`
|
||||
WITH layer1 AS (
|
||||
SELECT id FROM entities WHERE layer = 1
|
||||
),
|
||||
layer1_docs AS (
|
||||
SELECT DISTINCT document_id FROM document_entities WHERE entity_id IN (SELECT id FROM layer1)
|
||||
),
|
||||
layer2_candidates AS (
|
||||
SELECT DISTINCT entity_id FROM document_entities
|
||||
WHERE document_id IN (SELECT document_id FROM layer1_docs)
|
||||
)
|
||||
UPDATE entities SET layer = 2, updated_at = NOW()
|
||||
WHERE id IN (SELECT entity_id FROM layer2_candidates) AND layer IS NULL
|
||||
`);
|
||||
|
||||
// Set Layer 3: remaining entities
|
||||
await pool.query(`
|
||||
UPDATE entities SET layer = 3, updated_at = NOW() WHERE layer IS NULL
|
||||
`);
|
||||
}
|
||||
|
||||
// Search
|
||||
export async function searchEntities(
|
||||
query: string,
|
||||
limit: number = 20
|
||||
): Promise<
|
||||
Array<{
|
||||
id: number;
|
||||
canonicalName: string;
|
||||
entityType: string;
|
||||
layer: number;
|
||||
documentCount: number;
|
||||
}>
|
||||
> {
|
||||
const result = await pool.query(
|
||||
`SELECT id, canonical_name, entity_type, layer, document_count
|
||||
FROM entities
|
||||
WHERE canonical_name ILIKE $1 OR canonical_name % $2
|
||||
ORDER BY similarity(canonical_name, $2) DESC, document_count DESC
|
||||
LIMIT $3`,
|
||||
[`%${query}%`, query, limit]
|
||||
);
|
||||
return result.rows.map((row) => ({
|
||||
id: row.id,
|
||||
canonicalName: row.canonical_name,
|
||||
entityType: row.entity_type,
|
||||
layer: row.layer,
|
||||
documentCount: row.document_count,
|
||||
}));
|
||||
}
|
||||
|
||||
export async function close(): Promise<void> {
|
||||
await pool.end();
|
||||
}
|
||||
208
extraction/src/ner/extractor.ts
Normal file
208
extraction/src/ner/extractor.ts
Normal file
@@ -0,0 +1,208 @@
|
||||
import { z } from 'zod';
|
||||
import Anthropic from '@anthropic-ai/sdk';
|
||||
import { config } from '../config.js';
|
||||
|
||||
// Initialize Anthropic client
|
||||
const anthropic = new Anthropic({
|
||||
apiKey: config.ANTHROPIC_API_KEY,
|
||||
});
|
||||
|
||||
// ============================================================================
|
||||
// SCHEMAS
|
||||
// ============================================================================
|
||||
|
||||
export const EntitySchema = z.object({
|
||||
name: z.string(),
|
||||
type: z.enum(['person', 'organization', 'location', 'date', 'reference', 'financial']),
|
||||
context: z.string().optional(),
|
||||
});
|
||||
|
||||
export const TripleSchema = z.object({
|
||||
subject: z.string(),
|
||||
subjectType: z.enum(['person', 'organization', 'location']),
|
||||
predicate: z.string(),
|
||||
object: z.string(),
|
||||
objectType: z.enum(['person', 'organization', 'location', 'date', 'reference', 'financial']),
|
||||
location: z.string().optional(),
|
||||
timestamp: z.string().optional(),
|
||||
explicitTopic: z.string().optional(),
|
||||
implicitTopic: z.string().optional(),
|
||||
tags: z.array(z.string()).optional(),
|
||||
});
|
||||
|
||||
export const DocumentAnalysisSchema = z.object({
|
||||
summary: z.string(),
|
||||
detailedSummary: z.string(),
|
||||
documentType: z.string(),
|
||||
dateEarliest: z.string().nullable(),
|
||||
dateLatest: z.string().nullable(),
|
||||
contentTags: z.array(z.string()),
|
||||
entities: z.array(EntitySchema),
|
||||
triples: z.array(TripleSchema),
|
||||
});
|
||||
|
||||
export type Entity = z.infer<typeof EntitySchema>;
|
||||
export type Triple = z.infer<typeof TripleSchema>;
|
||||
export type DocumentAnalysis = z.infer<typeof DocumentAnalysisSchema>;
|
||||
|
||||
// ============================================================================
|
||||
// EXTRACTION PROMPTS
|
||||
// ============================================================================
|
||||
|
||||
const EXTRACTION_SYSTEM_PROMPT = `You are an expert document analyst specializing in legal documents, financial records, and correspondence. Your task is to extract structured information from documents related to the Jeffrey Epstein case.
|
||||
|
||||
Extract the following:
|
||||
|
||||
1. **Entities**: All people, organizations, locations, dates, document references, and financial amounts mentioned.
|
||||
2. **Relationships (Triples)**: Subject-Predicate-Object relationships between entities.
|
||||
3. **Document Analysis**: Summary, type classification, date range, and content tags.
|
||||
|
||||
Be thorough but precise. If information is unclear or partially redacted, note what you can determine. Focus on factual extraction, not interpretation.
|
||||
|
||||
IMPORTANT:
|
||||
- Normalize names where possible (e.g., "J. Epstein" → "Jeffrey Epstein" if context confirms)
|
||||
- Include context snippets for important entities
|
||||
- Extract temporal information when available
|
||||
- Tag relationships with relevant categories (legal, financial, travel, social, etc.)`;
|
||||
|
||||
const EXTRACTION_USER_PROMPT = (text: string) => `Analyze this document and extract structured information.
|
||||
|
||||
<document>
|
||||
${text}
|
||||
</document>
|
||||
|
||||
Respond with a JSON object matching this schema:
|
||||
{
|
||||
"summary": "One sentence summary of the document",
|
||||
"detailedSummary": "A paragraph explaining the document's content and significance",
|
||||
"documentType": "Type of document (e.g., deposition, email, financial record, flight log, etc.)",
|
||||
"dateEarliest": "YYYY-MM-DD or null if no dates",
|
||||
"dateLatest": "YYYY-MM-DD or null if no dates",
|
||||
"contentTags": ["tag1", "tag2", ...],
|
||||
"entities": [
|
||||
{"name": "Full Name", "type": "person|organization|location|date|reference|financial", "context": "brief context"}
|
||||
],
|
||||
"triples": [
|
||||
{
|
||||
"subject": "Entity Name",
|
||||
"subjectType": "person|organization|location",
|
||||
"predicate": "action/relationship verb",
|
||||
"object": "Entity Name",
|
||||
"objectType": "person|organization|location|date|reference|financial",
|
||||
"location": "where (optional)",
|
||||
"timestamp": "YYYY-MM-DD (optional)",
|
||||
"explicitTopic": "stated subject matter (optional)",
|
||||
"implicitTopic": "inferred subject matter (optional)",
|
||||
"tags": ["legal", "financial", "travel", etc.]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Return ONLY valid JSON, no markdown or explanation.`;
|
||||
|
||||
// ============================================================================
|
||||
// EXTRACTION FUNCTION
|
||||
// ============================================================================
|
||||
|
||||
export async function extractFromDocument(
|
||||
docId: string,
|
||||
text: string
|
||||
): Promise<DocumentAnalysis> {
|
||||
// Truncate very long documents
|
||||
const maxChars = 100000;
|
||||
const truncatedText = text.length > maxChars
|
||||
? text.slice(0, maxChars) + '\n\n[TRUNCATED - document continues...]'
|
||||
: text;
|
||||
|
||||
const response = await anthropic.messages.create({
|
||||
model: config.LLM_MODEL,
|
||||
max_tokens: 8192,
|
||||
system: EXTRACTION_SYSTEM_PROMPT,
|
||||
messages: [
|
||||
{
|
||||
role: 'user',
|
||||
content: EXTRACTION_USER_PROMPT(truncatedText),
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
// Extract text content
|
||||
const content = response.content[0];
|
||||
if (content.type !== 'text') {
|
||||
throw new Error(`Unexpected response type: ${content.type}`);
|
||||
}
|
||||
|
||||
// Parse JSON
|
||||
let parsed: unknown;
|
||||
try {
|
||||
// Try to extract JSON from the response (sometimes wrapped in markdown)
|
||||
const jsonMatch = content.text.match(/\{[\s\S]*\}/);
|
||||
if (!jsonMatch) {
|
||||
throw new Error('No JSON found in response');
|
||||
}
|
||||
parsed = JSON.parse(jsonMatch[0]);
|
||||
} catch (error) {
|
||||
console.error(`Failed to parse JSON for ${docId}:`, content.text.slice(0, 500));
|
||||
throw new Error(`JSON parse error: ${error}`);
|
||||
}
|
||||
|
||||
// Validate against schema
|
||||
const result = DocumentAnalysisSchema.parse(parsed);
|
||||
|
||||
return result;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// DEDUPLICATION
|
||||
// ============================================================================
|
||||
|
||||
const DEDUP_SYSTEM_PROMPT = `You are an expert at identifying when different name variations refer to the same entity. Given a list of entity names, group them by the actual entity they refer to.
|
||||
|
||||
Consider:
|
||||
- Name variations (J. Smith, John Smith, John Q. Smith)
|
||||
- Nicknames and aliases
|
||||
- Organizational name variations (LLC vs Inc)
|
||||
- Typos and OCR errors
|
||||
|
||||
Be conservative - only merge entities when you're confident they're the same.`;
|
||||
|
||||
const DEDUP_USER_PROMPT = (entities: string[]) => `Group these entity names by the actual entity they refer to. Return a JSON object where keys are canonical names and values are arrays of aliases.
|
||||
|
||||
Entities:
|
||||
${entities.map((e) => `- ${e}`).join('\n')}
|
||||
|
||||
Return JSON like:
|
||||
{
|
||||
"Jeffrey Epstein": ["J. Epstein", "Epstein", "Jeffrey E. Epstein"],
|
||||
"Ghislaine Maxwell": ["G. Maxwell", "Maxwell"]
|
||||
}
|
||||
|
||||
Return ONLY valid JSON.`;
|
||||
|
||||
export async function deduplicateEntities(
|
||||
entities: string[]
|
||||
): Promise<Record<string, string[]>> {
|
||||
const response = await anthropic.messages.create({
|
||||
model: config.LLM_MODEL,
|
||||
max_tokens: 4096,
|
||||
system: DEDUP_SYSTEM_PROMPT,
|
||||
messages: [
|
||||
{
|
||||
role: 'user',
|
||||
content: DEDUP_USER_PROMPT(entities),
|
||||
},
|
||||
],
|
||||
});
|
||||
|
||||
const content = response.content[0];
|
||||
if (content.type !== 'text') {
|
||||
throw new Error(`Unexpected response type: ${content.type}`);
|
||||
}
|
||||
|
||||
const jsonMatch = content.text.match(/\{[\s\S]*\}/);
|
||||
if (!jsonMatch) {
|
||||
throw new Error('No JSON found in dedup response');
|
||||
}
|
||||
|
||||
return JSON.parse(jsonMatch[0]);
|
||||
}
|
||||
135
extraction/src/scripts/extract-documents.ts
Normal file
135
extraction/src/scripts/extract-documents.ts
Normal file
@@ -0,0 +1,135 @@
|
||||
/**
|
||||
* Document Extraction Script
|
||||
*
|
||||
* Reads OCR text from the data sources and loads it into PostgreSQL.
|
||||
* This is the first step in the pipeline.
|
||||
*/
|
||||
|
||||
import fs from 'fs';
|
||||
import path from 'path';
|
||||
import readline from 'readline';
|
||||
import { config } from '../config.js';
|
||||
import { insertDocument, close } from '../db.js';
|
||||
|
||||
// Path to the combined text file
|
||||
const DATA_DIR = path.resolve(config.DATA_DIR);
|
||||
const COMBINED_TEXT_PATH = path.join(DATA_DIR, 'combined-all-epstein-files/COMBINED_ALL_EPSTEIN_FILES_djvu.txt');
|
||||
|
||||
// Document ID pattern: EFTA00000001
|
||||
const DOC_ID_PATTERN = /^EFTA\d{8}$/;
|
||||
|
||||
interface DocumentChunk {
|
||||
docId: string;
|
||||
lines: string[];
|
||||
}
|
||||
|
||||
async function* readDocuments(): AsyncGenerator<DocumentChunk> {
|
||||
const fileStream = fs.createReadStream(COMBINED_TEXT_PATH);
|
||||
const rl = readline.createInterface({
|
||||
input: fileStream,
|
||||
crlfDelay: Infinity,
|
||||
});
|
||||
|
||||
let currentDoc: DocumentChunk | null = null;
|
||||
|
||||
for await (const line of rl) {
|
||||
const trimmed = line.trim();
|
||||
|
||||
// Check if this is a new document ID
|
||||
if (DOC_ID_PATTERN.test(trimmed)) {
|
||||
// If we have a previous document, yield it
|
||||
if (currentDoc && currentDoc.lines.length > 0) {
|
||||
yield currentDoc;
|
||||
}
|
||||
|
||||
// Start a new document
|
||||
currentDoc = {
|
||||
docId: trimmed,
|
||||
lines: [],
|
||||
};
|
||||
} else if (currentDoc) {
|
||||
// Add line to current document
|
||||
if (trimmed.length > 0) {
|
||||
currentDoc.lines.push(line);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Yield the last document
|
||||
if (currentDoc && currentDoc.lines.length > 0) {
|
||||
yield currentDoc;
|
||||
}
|
||||
}
|
||||
|
||||
function getDatasetId(docId: string): number {
|
||||
// Extract the numeric portion
|
||||
const num = parseInt(docId.replace('EFTA', ''), 10);
|
||||
|
||||
// Map to dataset based on the metadata:
|
||||
// DataSet 1: EFTA00000001-00003158
|
||||
// DataSet 2: EFTA00003159-00003857
|
||||
// DataSet 3: EFTA00003858-00005586
|
||||
// DataSet 4: EFTA00005705-00008320
|
||||
// DataSet 5: EFTA00008409-00008528
|
||||
|
||||
if (num <= 3158) return 1;
|
||||
if (num <= 3857) return 2;
|
||||
if (num <= 5586) return 3;
|
||||
if (num <= 8320) return 4;
|
||||
return 5;
|
||||
}
|
||||
|
||||
async function main() {
|
||||
console.log('📄 Starting document extraction...');
|
||||
console.log(`Reading from: ${COMBINED_TEXT_PATH}`);
|
||||
|
||||
// Check if file exists
|
||||
if (!fs.existsSync(COMBINED_TEXT_PATH)) {
|
||||
console.error(`❌ File not found: ${COMBINED_TEXT_PATH}`);
|
||||
console.error('Make sure the DataSources directory is properly set up.');
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
let count = 0;
|
||||
let errors = 0;
|
||||
const seenDocs = new Set<string>();
|
||||
|
||||
for await (const doc of readDocuments()) {
|
||||
// Skip duplicate doc IDs (the OCR sometimes repeats)
|
||||
if (seenDocs.has(doc.docId)) {
|
||||
continue;
|
||||
}
|
||||
seenDocs.add(doc.docId);
|
||||
|
||||
try {
|
||||
const fullText = doc.lines.join('\n');
|
||||
const datasetId = getDatasetId(doc.docId);
|
||||
|
||||
await insertDocument({
|
||||
docId: doc.docId,
|
||||
datasetId,
|
||||
fullText,
|
||||
pageCount: 1, // We'll update this later with actual page counts
|
||||
});
|
||||
|
||||
count++;
|
||||
if (count % 100 === 0) {
|
||||
console.log(` ✓ Processed ${count} documents...`);
|
||||
}
|
||||
} catch (error) {
|
||||
console.error(`❌ Error processing ${doc.docId}:`, error);
|
||||
errors++;
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`\n✅ Document extraction complete!`);
|
||||
console.log(` Total documents: ${count}`);
|
||||
console.log(` Errors: ${errors}`);
|
||||
|
||||
await close();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
198
extraction/src/scripts/extract-entities.ts
Normal file
198
extraction/src/scripts/extract-entities.ts
Normal file
@@ -0,0 +1,198 @@
|
||||
/**
|
||||
* Entity Extraction Script
|
||||
*
|
||||
* Processes documents through the LLM to extract entities and relationships.
|
||||
* Uses rate limiting and batching for efficiency.
|
||||
*/
|
||||
|
||||
import pLimit from 'p-limit';
|
||||
import { config } from '../config.js';
|
||||
import {
|
||||
getDocumentsPendingAnalysis,
|
||||
updateDocumentAnalysis,
|
||||
upsertEntity,
|
||||
linkEntityToDocument,
|
||||
insertTriple,
|
||||
pool,
|
||||
close,
|
||||
} from '../db.js';
|
||||
import { extractFromDocument, type Entity, type Triple } from '../ner/extractor.js';
|
||||
|
||||
// Rate limiter
|
||||
const limit = pLimit(config.MAX_WORKERS);
|
||||
|
||||
// Track progress
|
||||
let processed = 0;
|
||||
let errors = 0;
|
||||
let totalEntities = 0;
|
||||
let totalTriples = 0;
|
||||
|
||||
async function processDocument(doc: {
|
||||
id: number;
|
||||
docId: string;
|
||||
fullText: string;
|
||||
}): Promise<void> {
|
||||
try {
|
||||
console.log(` 📝 Processing ${doc.docId}...`);
|
||||
|
||||
// Mark as processing
|
||||
await pool.query(
|
||||
`UPDATE documents SET analysis_status = 'processing' WHERE id = $1`,
|
||||
[doc.id]
|
||||
);
|
||||
|
||||
// Extract entities and relationships
|
||||
const analysis = await extractFromDocument(doc.docId, doc.fullText);
|
||||
|
||||
// Parse dates
|
||||
const dateEarliest = analysis.dateEarliest
|
||||
? new Date(analysis.dateEarliest)
|
||||
: undefined;
|
||||
const dateLatest = analysis.dateLatest
|
||||
? new Date(analysis.dateLatest)
|
||||
: undefined;
|
||||
|
||||
// Update document analysis
|
||||
await updateDocumentAnalysis(doc.docId, {
|
||||
summary: analysis.summary,
|
||||
detailedSummary: analysis.detailedSummary,
|
||||
documentType: analysis.documentType,
|
||||
dateEarliest,
|
||||
dateLatest,
|
||||
contentTags: analysis.contentTags,
|
||||
});
|
||||
|
||||
// Insert entities and get their IDs
|
||||
const entityIdMap = new Map<string, number>();
|
||||
|
||||
for (const entity of analysis.entities) {
|
||||
const entityId = await upsertEntity({
|
||||
canonicalName: entity.name,
|
||||
entityType: entity.type,
|
||||
});
|
||||
entityIdMap.set(entity.name.toLowerCase(), entityId);
|
||||
|
||||
// Link entity to document
|
||||
await linkEntityToDocument(entityId, doc.id, 1, entity.context);
|
||||
}
|
||||
|
||||
totalEntities += analysis.entities.length;
|
||||
|
||||
// Insert triples
|
||||
for (let i = 0; i < analysis.triples.length; i++) {
|
||||
const triple = analysis.triples[i];
|
||||
|
||||
// Get or create subject entity
|
||||
let subjectId = entityIdMap.get(triple.subject.toLowerCase());
|
||||
if (!subjectId) {
|
||||
subjectId = await upsertEntity({
|
||||
canonicalName: triple.subject,
|
||||
entityType: triple.subjectType,
|
||||
});
|
||||
entityIdMap.set(triple.subject.toLowerCase(), subjectId);
|
||||
}
|
||||
|
||||
// Get or create object entity
|
||||
let objectId = entityIdMap.get(triple.object.toLowerCase());
|
||||
if (!objectId) {
|
||||
objectId = await upsertEntity({
|
||||
canonicalName: triple.object,
|
||||
entityType: triple.objectType,
|
||||
});
|
||||
entityIdMap.set(triple.object.toLowerCase(), objectId);
|
||||
}
|
||||
|
||||
// Get location entity if present
|
||||
let locationId: number | undefined;
|
||||
if (triple.location) {
|
||||
locationId = entityIdMap.get(triple.location.toLowerCase());
|
||||
if (!locationId) {
|
||||
locationId = await upsertEntity({
|
||||
canonicalName: triple.location,
|
||||
entityType: 'location',
|
||||
});
|
||||
entityIdMap.set(triple.location.toLowerCase(), locationId);
|
||||
}
|
||||
}
|
||||
|
||||
// Parse timestamp
|
||||
const timestamp = triple.timestamp ? new Date(triple.timestamp) : undefined;
|
||||
|
||||
// Insert triple
|
||||
await insertTriple({
|
||||
documentId: doc.id,
|
||||
subjectId,
|
||||
predicate: triple.predicate,
|
||||
objectId,
|
||||
locationId,
|
||||
timestamp,
|
||||
explicitTopic: triple.explicitTopic,
|
||||
implicitTopic: triple.implicitTopic,
|
||||
tags: triple.tags,
|
||||
sequenceOrder: i,
|
||||
});
|
||||
}
|
||||
|
||||
totalTriples += analysis.triples.length;
|
||||
processed++;
|
||||
|
||||
console.log(
|
||||
` ✓ ${doc.docId}: ${analysis.entities.length} entities, ${analysis.triples.length} triples`
|
||||
);
|
||||
} catch (error) {
|
||||
errors++;
|
||||
console.error(` ❌ ${doc.docId}: ${error}`);
|
||||
|
||||
// Mark as failed
|
||||
await pool.query(
|
||||
`UPDATE documents SET
|
||||
analysis_status = 'failed',
|
||||
error_message = $2,
|
||||
updated_at = NOW()
|
||||
WHERE id = $1`,
|
||||
[doc.id, String(error)]
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
async function main() {
|
||||
console.log('🔍 Starting entity extraction...');
|
||||
console.log(` Model: ${config.LLM_MODEL}`);
|
||||
console.log(` Workers: ${config.MAX_WORKERS}`);
|
||||
console.log(` Batch size: ${config.BATCH_SIZE}\n`);
|
||||
|
||||
let hasMore = true;
|
||||
|
||||
while (hasMore) {
|
||||
// Get batch of pending documents
|
||||
const documents = await getDocumentsPendingAnalysis(config.BATCH_SIZE);
|
||||
|
||||
if (documents.length === 0) {
|
||||
hasMore = false;
|
||||
break;
|
||||
}
|
||||
|
||||
console.log(`\n📦 Processing batch of ${documents.length} documents...`);
|
||||
|
||||
// Process in parallel with rate limiting
|
||||
await Promise.all(
|
||||
documents.map((doc) => limit(() => processDocument(doc)))
|
||||
);
|
||||
|
||||
// Brief pause between batches
|
||||
await new Promise((resolve) => setTimeout(resolve, 1000));
|
||||
}
|
||||
|
||||
console.log(`\n✅ Entity extraction complete!`);
|
||||
console.log(` Documents processed: ${processed}`);
|
||||
console.log(` Entities extracted: ${totalEntities}`);
|
||||
console.log(` Triples extracted: ${totalTriples}`);
|
||||
console.log(` Errors: ${errors}`);
|
||||
|
||||
await close();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
236
extraction/src/scripts/match-crossref.ts
Normal file
236
extraction/src/scripts/match-crossref.ts
Normal file
@@ -0,0 +1,236 @@
|
||||
/**
|
||||
* Cross-Reference Matching Script
|
||||
*
|
||||
* Matches extracted entities against PPP loans, FEC contributions, and federal grants.
|
||||
* Uses fuzzy matching with configurable thresholds.
|
||||
*/
|
||||
|
||||
import { pool, close } from '../db.js';
|
||||
|
||||
// Similarity threshold for matches (0-1)
|
||||
const MATCH_THRESHOLD = 0.7;
|
||||
|
||||
interface Match {
|
||||
entityId: number;
|
||||
entityName: string;
|
||||
source: 'ppp' | 'fec' | 'grants';
|
||||
sourceId: number;
|
||||
sourceName: string;
|
||||
score: number;
|
||||
}
|
||||
|
||||
async function findPPPMatches(): Promise<Match[]> {
|
||||
console.log('🔍 Matching entities against PPP loans...');
|
||||
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
e.id AS entity_id,
|
||||
e.canonical_name AS entity_name,
|
||||
p.id AS source_id,
|
||||
p.borrower_name AS source_name,
|
||||
similarity(e.canonical_name, p.borrower_name) AS score
|
||||
FROM entities e
|
||||
CROSS JOIN LATERAL (
|
||||
SELECT id, borrower_name
|
||||
FROM ppp_loans
|
||||
WHERE
|
||||
borrower_name % e.canonical_name
|
||||
AND similarity(borrower_name, e.canonical_name) >= $1
|
||||
ORDER BY similarity(borrower_name, e.canonical_name) DESC
|
||||
LIMIT 5
|
||||
) p
|
||||
WHERE e.entity_type IN ('person', 'organization')
|
||||
`, [MATCH_THRESHOLD]);
|
||||
|
||||
return result.rows.map((row) => ({
|
||||
entityId: row.entity_id,
|
||||
entityName: row.entity_name,
|
||||
source: 'ppp' as const,
|
||||
sourceId: row.source_id,
|
||||
sourceName: row.source_name,
|
||||
score: row.score,
|
||||
}));
|
||||
}
|
||||
|
||||
async function findFECMatches(): Promise<Match[]> {
|
||||
console.log('🔍 Matching entities against FEC contributions...');
|
||||
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
e.id AS entity_id,
|
||||
e.canonical_name AS entity_name,
|
||||
f.id AS source_id,
|
||||
f.contributor_name AS source_name,
|
||||
similarity(e.canonical_name, f.contributor_name) AS score
|
||||
FROM entities e
|
||||
CROSS JOIN LATERAL (
|
||||
SELECT id, contributor_name
|
||||
FROM fec_contributions
|
||||
WHERE
|
||||
contributor_name % e.canonical_name
|
||||
AND similarity(contributor_name, e.canonical_name) >= $1
|
||||
ORDER BY similarity(contributor_name, e.canonical_name) DESC
|
||||
LIMIT 5
|
||||
) f
|
||||
WHERE e.entity_type = 'person'
|
||||
`, [MATCH_THRESHOLD]);
|
||||
|
||||
return result.rows.map((row) => ({
|
||||
entityId: row.entity_id,
|
||||
entityName: row.entity_name,
|
||||
source: 'fec' as const,
|
||||
sourceId: row.source_id,
|
||||
sourceName: row.source_name,
|
||||
score: row.score,
|
||||
}));
|
||||
}
|
||||
|
||||
async function findGrantsMatches(): Promise<Match[]> {
|
||||
console.log('🔍 Matching entities against federal grants...');
|
||||
|
||||
const result = await pool.query(`
|
||||
SELECT
|
||||
e.id AS entity_id,
|
||||
e.canonical_name AS entity_name,
|
||||
g.id AS source_id,
|
||||
g.recipient_name AS source_name,
|
||||
similarity(e.canonical_name, g.recipient_name) AS score
|
||||
FROM entities e
|
||||
CROSS JOIN LATERAL (
|
||||
SELECT id, recipient_name
|
||||
FROM federal_grants
|
||||
WHERE
|
||||
recipient_name % e.canonical_name
|
||||
AND similarity(recipient_name, e.canonical_name) >= $1
|
||||
ORDER BY similarity(recipient_name, e.canonical_name) DESC
|
||||
LIMIT 5
|
||||
) g
|
||||
WHERE e.entity_type IN ('person', 'organization')
|
||||
`, [MATCH_THRESHOLD]);
|
||||
|
||||
return result.rows.map((row) => ({
|
||||
entityId: row.entity_id,
|
||||
entityName: row.entity_name,
|
||||
source: 'grants' as const,
|
||||
sourceId: row.source_id,
|
||||
sourceName: row.source_name,
|
||||
score: row.score,
|
||||
}));
|
||||
}
|
||||
|
||||
async function saveMatches(matches: Match[]): Promise<void> {
|
||||
if (matches.length === 0) return;
|
||||
|
||||
const values = matches.map((m) =>
|
||||
`(${m.entityId}, '${m.source}', ${m.sourceId}, ${m.score}, 'fuzzy')`
|
||||
).join(',\n');
|
||||
|
||||
await pool.query(`
|
||||
INSERT INTO entity_crossref_matches (entity_id, source, source_id, match_score, match_method)
|
||||
VALUES ${values}
|
||||
ON CONFLICT DO NOTHING
|
||||
`);
|
||||
}
|
||||
|
||||
async function updateEntityCrossRefSummary(): Promise<void> {
|
||||
console.log('📊 Updating entity cross-reference summaries...');
|
||||
|
||||
// Update PPP matches
|
||||
await pool.query(`
|
||||
UPDATE entities e
|
||||
SET ppp_matches = (
|
||||
SELECT jsonb_agg(jsonb_build_object(
|
||||
'id', p.id,
|
||||
'borrower', p.borrower_name,
|
||||
'amount', p.loan_amount,
|
||||
'score', m.match_score
|
||||
))
|
||||
FROM entity_crossref_matches m
|
||||
JOIN ppp_loans p ON m.source_id = p.id
|
||||
WHERE m.entity_id = e.id AND m.source = 'ppp' AND NOT m.false_positive
|
||||
)
|
||||
WHERE EXISTS (
|
||||
SELECT 1 FROM entity_crossref_matches m
|
||||
WHERE m.entity_id = e.id AND m.source = 'ppp'
|
||||
)
|
||||
`);
|
||||
|
||||
// Update FEC matches
|
||||
await pool.query(`
|
||||
UPDATE entities e
|
||||
SET fec_matches = (
|
||||
SELECT jsonb_agg(jsonb_build_object(
|
||||
'id', f.id,
|
||||
'contributor', f.contributor_name,
|
||||
'candidate', f.candidate_name,
|
||||
'amount', f.amount,
|
||||
'score', m.match_score
|
||||
))
|
||||
FROM entity_crossref_matches m
|
||||
JOIN fec_contributions f ON m.source_id = f.id
|
||||
WHERE m.entity_id = e.id AND m.source = 'fec' AND NOT m.false_positive
|
||||
)
|
||||
WHERE EXISTS (
|
||||
SELECT 1 FROM entity_crossref_matches m
|
||||
WHERE m.entity_id = e.id AND m.source = 'fec'
|
||||
)
|
||||
`);
|
||||
|
||||
// Update grants matches
|
||||
await pool.query(`
|
||||
UPDATE entities e
|
||||
SET grants_matches = (
|
||||
SELECT jsonb_agg(jsonb_build_object(
|
||||
'id', g.id,
|
||||
'recipient', g.recipient_name,
|
||||
'agency', g.awarding_agency,
|
||||
'amount', g.award_amount,
|
||||
'score', m.match_score
|
||||
))
|
||||
FROM entity_crossref_matches m
|
||||
JOIN federal_grants g ON m.source_id = g.id
|
||||
WHERE m.entity_id = e.id AND m.source = 'grants' AND NOT m.false_positive
|
||||
)
|
||||
WHERE EXISTS (
|
||||
SELECT 1 FROM entity_crossref_matches m
|
||||
WHERE m.entity_id = e.id AND m.source = 'grants'
|
||||
)
|
||||
`);
|
||||
}
|
||||
|
||||
async function main() {
|
||||
console.log('🔗 Starting cross-reference matching...\n');
|
||||
|
||||
// Find all matches
|
||||
const pppMatches = await findPPPMatches();
|
||||
console.log(` Found ${pppMatches.length} PPP matches`);
|
||||
|
||||
const fecMatches = await findFECMatches();
|
||||
console.log(` Found ${fecMatches.length} FEC matches`);
|
||||
|
||||
const grantsMatches = await findGrantsMatches();
|
||||
console.log(` Found ${grantsMatches.length} grants matches`);
|
||||
|
||||
// Save matches
|
||||
console.log('\n💾 Saving matches to database...');
|
||||
await saveMatches(pppMatches);
|
||||
await saveMatches(fecMatches);
|
||||
await saveMatches(grantsMatches);
|
||||
|
||||
// Update entity summaries
|
||||
await updateEntityCrossRefSummary();
|
||||
|
||||
const totalMatches = pppMatches.length + fecMatches.length + grantsMatches.length;
|
||||
console.log(`\n✅ Cross-reference matching complete!`);
|
||||
console.log(` Total matches: ${totalMatches}`);
|
||||
console.log(` PPP: ${pppMatches.length}`);
|
||||
console.log(` FEC: ${fecMatches.length}`);
|
||||
console.log(` Grants: ${grantsMatches.length}`);
|
||||
|
||||
await close();
|
||||
}
|
||||
|
||||
main().catch((error) => {
|
||||
console.error('Fatal error:', error);
|
||||
process.exit(1);
|
||||
});
|
||||
20
extraction/tsconfig.json
Normal file
20
extraction/tsconfig.json
Normal file
@@ -0,0 +1,20 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ES2022",
|
||||
"module": "NodeNext",
|
||||
"moduleResolution": "NodeNext",
|
||||
"lib": ["ES2022"],
|
||||
"outDir": "./dist",
|
||||
"rootDir": "./src",
|
||||
"strict": true,
|
||||
"esModuleInterop": true,
|
||||
"skipLibCheck": true,
|
||||
"forceConsistentCasingInFileNames": true,
|
||||
"resolveJsonModule": true,
|
||||
"declaration": true,
|
||||
"declarationMap": true,
|
||||
"sourceMap": true
|
||||
},
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules", "dist"]
|
||||
}
|
||||
14
frontend/index.html
Normal file
14
frontend/index.html
Normal file
@@ -0,0 +1,14 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang="en" class="dark">
|
||||
<head>
|
||||
<meta charset="UTF-8" />
|
||||
<link rel="icon" type="image/svg+xml" href="/favicon.svg" />
|
||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||
<meta name="description" content="Searchable database and network analysis tool for the DOJ Epstein Files release" />
|
||||
<title>Epstein Files Database</title>
|
||||
</head>
|
||||
<body class="bg-background text-white antialiased">
|
||||
<div id="root"></div>
|
||||
<script type="module" src="/src/main.tsx"></script>
|
||||
</body>
|
||||
</html>
|
||||
38
frontend/package.json
Normal file
38
frontend/package.json
Normal file
@@ -0,0 +1,38 @@
|
||||
{
|
||||
"name": "@epstein-db/frontend",
|
||||
"version": "1.0.0",
|
||||
"private": true,
|
||||
"type": "module",
|
||||
"scripts": {
|
||||
"dev": "vite",
|
||||
"build": "tsc && vite build",
|
||||
"preview": "vite preview",
|
||||
"lint": "eslint . --ext ts,tsx --report-unused-disable-directives --max-warnings 0"
|
||||
},
|
||||
"dependencies": {
|
||||
"@tanstack/react-query": "^5.32.0",
|
||||
"clsx": "^2.1.0",
|
||||
"d3": "^7.9.0",
|
||||
"lucide-react": "^0.372.0",
|
||||
"react": "^18.2.0",
|
||||
"react-dom": "^18.2.0",
|
||||
"react-force-graph-2d": "^1.25.5",
|
||||
"react-router-dom": "^6.22.0",
|
||||
"tailwind-merge": "^2.2.2"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/d3": "^7.4.3",
|
||||
"@types/node": "^20.12.0",
|
||||
"@types/react": "^18.2.79",
|
||||
"@types/react-dom": "^18.2.25",
|
||||
"@vitejs/plugin-react": "^4.2.0",
|
||||
"autoprefixer": "^10.4.19",
|
||||
"eslint": "^8.57.0",
|
||||
"eslint-plugin-react-hooks": "^4.6.0",
|
||||
"eslint-plugin-react-refresh": "^0.4.6",
|
||||
"postcss": "^8.4.38",
|
||||
"tailwindcss": "^3.4.3",
|
||||
"typescript": "^5.4.0",
|
||||
"vite": "^5.2.0"
|
||||
}
|
||||
}
|
||||
29
frontend/src/App.tsx
Normal file
29
frontend/src/App.tsx
Normal file
@@ -0,0 +1,29 @@
|
||||
import { Routes, Route } from 'react-router-dom'
|
||||
import { Layout } from './components/Layout'
|
||||
import { HomePage } from './pages/HomePage'
|
||||
import { NetworkPage } from './pages/NetworkPage'
|
||||
import { EntitiesPage } from './pages/EntitiesPage'
|
||||
import { EntityDetailPage } from './pages/EntityDetailPage'
|
||||
import { DocumentsPage } from './pages/DocumentsPage'
|
||||
import { DocumentDetailPage } from './pages/DocumentDetailPage'
|
||||
import { SearchPage } from './pages/SearchPage'
|
||||
import { PatternsPage } from './pages/PatternsPage'
|
||||
import { CrossRefPage } from './pages/CrossRefPage'
|
||||
|
||||
export default function App() {
|
||||
return (
|
||||
<Layout>
|
||||
<Routes>
|
||||
<Route path="/" element={<HomePage />} />
|
||||
<Route path="/network" element={<NetworkPage />} />
|
||||
<Route path="/entities" element={<EntitiesPage />} />
|
||||
<Route path="/entities/:id" element={<EntityDetailPage />} />
|
||||
<Route path="/documents" element={<DocumentsPage />} />
|
||||
<Route path="/documents/:id" element={<DocumentDetailPage />} />
|
||||
<Route path="/search" element={<SearchPage />} />
|
||||
<Route path="/patterns" element={<PatternsPage />} />
|
||||
<Route path="/crossref" element={<CrossRefPage />} />
|
||||
</Routes>
|
||||
</Layout>
|
||||
)
|
||||
}
|
||||
277
frontend/src/api/index.ts
Normal file
277
frontend/src/api/index.ts
Normal file
@@ -0,0 +1,277 @@
|
||||
const API_BASE = '/api'
|
||||
|
||||
export interface Stats {
|
||||
documents: number
|
||||
entities: number
|
||||
triples: number
|
||||
pppLoans: number
|
||||
fecRecords: number
|
||||
grants: number
|
||||
patterns: number
|
||||
}
|
||||
|
||||
export interface Entity {
|
||||
id: number
|
||||
canonicalName: string
|
||||
entityType: string
|
||||
layer: number | null
|
||||
description?: string
|
||||
documentCount: number
|
||||
connectionCount: number
|
||||
aliases?: string[]
|
||||
pppMatches?: any[]
|
||||
fecMatches?: any[]
|
||||
grantsMatches?: any[]
|
||||
}
|
||||
|
||||
export interface Document {
|
||||
id: number
|
||||
docId: string
|
||||
datasetId: number
|
||||
documentType?: string
|
||||
summary?: string
|
||||
detailedSummary?: string
|
||||
dateEarliest?: string
|
||||
dateLatest?: string
|
||||
contentTags?: string[]
|
||||
pageCount?: number
|
||||
}
|
||||
|
||||
export interface Connection {
|
||||
id: number
|
||||
canonicalName: string
|
||||
entityType: string
|
||||
layer: number | null
|
||||
sharedDocs: number
|
||||
}
|
||||
|
||||
export interface NetworkData {
|
||||
nodes: Array<{
|
||||
id: number
|
||||
canonicalName: string
|
||||
entityType: string
|
||||
layer: number | null
|
||||
documentCount: number
|
||||
connectionCount: number
|
||||
}>
|
||||
edges: Array<{
|
||||
source: number
|
||||
target: number
|
||||
weight: number
|
||||
}>
|
||||
stats: {
|
||||
nodeCount: number
|
||||
edgeCount: number
|
||||
}
|
||||
}
|
||||
|
||||
export interface Pattern {
|
||||
id: number
|
||||
title: string
|
||||
description: string
|
||||
patternType: string
|
||||
confidence: number | null
|
||||
status: string
|
||||
discoveredAt: string
|
||||
}
|
||||
|
||||
export interface SearchResult {
|
||||
id: number
|
||||
docId: string
|
||||
documentType?: string
|
||||
summary?: string
|
||||
rank: number
|
||||
snippet?: string
|
||||
}
|
||||
|
||||
// Stats
|
||||
export async function getStats(): Promise<Stats> {
|
||||
const res = await fetch(`${API_BASE}/stats`)
|
||||
if (!res.ok) throw new Error('Failed to fetch stats')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Entities
|
||||
export async function searchEntities(params: {
|
||||
q?: string
|
||||
type?: string
|
||||
layer?: string
|
||||
limit?: number
|
||||
}): Promise<{ entities: Entity[]; count: number }> {
|
||||
const searchParams = new URLSearchParams()
|
||||
if (params.q) searchParams.set('q', params.q)
|
||||
if (params.type) searchParams.set('type', params.type)
|
||||
if (params.layer) searchParams.set('layer', params.layer)
|
||||
if (params.limit) searchParams.set('limit', params.limit.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/entities?${searchParams}`)
|
||||
if (!res.ok) throw new Error('Failed to search entities')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getEntity(id: number): Promise<Entity> {
|
||||
const res = await fetch(`${API_BASE}/entities/${id}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch entity')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getEntityConnections(
|
||||
id: number,
|
||||
limit?: number
|
||||
): Promise<{ connections: Connection[]; count: number }> {
|
||||
const params = limit ? `?limit=${limit}` : ''
|
||||
const res = await fetch(`${API_BASE}/entities/${id}/connections${params}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch connections')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getEntityDocuments(
|
||||
id: number,
|
||||
limit?: number
|
||||
): Promise<{ documents: Document[]; count: number }> {
|
||||
const params = limit ? `?limit=${limit}` : ''
|
||||
const res = await fetch(`${API_BASE}/entities/${id}/documents${params}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch documents')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Documents
|
||||
export async function listDocuments(params: {
|
||||
type?: string
|
||||
dataset?: string
|
||||
limit?: number
|
||||
offset?: number
|
||||
}): Promise<{ documents: Document[]; count: number; offset: number; limit: number }> {
|
||||
const searchParams = new URLSearchParams()
|
||||
if (params.type) searchParams.set('type', params.type)
|
||||
if (params.dataset) searchParams.set('dataset', params.dataset)
|
||||
if (params.limit) searchParams.set('limit', params.limit.toString())
|
||||
if (params.offset) searchParams.set('offset', params.offset.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/documents?${searchParams}`)
|
||||
if (!res.ok) throw new Error('Failed to list documents')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getDocument(id: number): Promise<Document> {
|
||||
const res = await fetch(`${API_BASE}/documents/${id}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch document')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getDocumentText(id: number): Promise<{ id: number; text: string }> {
|
||||
const res = await fetch(`${API_BASE}/documents/${id}/text`)
|
||||
if (!res.ok) throw new Error('Failed to fetch document text')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getDocumentEntities(
|
||||
id: number
|
||||
): Promise<{ entities: Array<Entity & { mentionCount: number }>; count: number }> {
|
||||
const res = await fetch(`${API_BASE}/documents/${id}/entities`)
|
||||
if (!res.ok) throw new Error('Failed to fetch document entities')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Network
|
||||
export async function getNetwork(params?: {
|
||||
limit?: number
|
||||
minConnections?: number
|
||||
}): Promise<NetworkData> {
|
||||
const searchParams = new URLSearchParams()
|
||||
if (params?.limit) searchParams.set('limit', params.limit.toString())
|
||||
if (params?.minConnections) searchParams.set('minConnections', params.minConnections.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/network?${searchParams}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch network')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getNetworkByLayer(): Promise<{
|
||||
layers: Array<{
|
||||
layer: number
|
||||
entities: Entity[]
|
||||
count: number
|
||||
}>
|
||||
}> {
|
||||
const res = await fetch(`${API_BASE}/network/layers`)
|
||||
if (!res.ok) throw new Error('Failed to fetch network layers')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Patterns
|
||||
export async function listPatterns(params?: {
|
||||
status?: string
|
||||
type?: string
|
||||
}): Promise<{ patterns: Pattern[]; count: number }> {
|
||||
const searchParams = new URLSearchParams()
|
||||
if (params?.status) searchParams.set('status', params.status)
|
||||
if (params?.type) searchParams.set('type', params.type)
|
||||
|
||||
const res = await fetch(`${API_BASE}/patterns?${searchParams}`)
|
||||
if (!res.ok) throw new Error('Failed to list patterns')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function getPattern(id: number): Promise<{
|
||||
pattern: Pattern & { entityIds: number[]; evidence: any; notes?: string }
|
||||
entities: Entity[]
|
||||
}> {
|
||||
const res = await fetch(`${API_BASE}/patterns/${id}`)
|
||||
if (!res.ok) throw new Error('Failed to fetch pattern')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Search
|
||||
export async function fullTextSearch(
|
||||
query: string,
|
||||
limit?: number
|
||||
): Promise<{ results: SearchResult[]; count: number; query: string }> {
|
||||
const params = new URLSearchParams({ q: query })
|
||||
if (limit) params.set('limit', limit.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/search?${params}`)
|
||||
if (!res.ok) throw new Error('Failed to search')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
// Cross-reference
|
||||
export async function searchPPP(
|
||||
query: string,
|
||||
limit?: number
|
||||
): Promise<{ results: any[]; count: number }> {
|
||||
const params = new URLSearchParams({ q: query })
|
||||
if (limit) params.set('limit', limit.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/crossref/ppp?${params}`)
|
||||
if (!res.ok) throw new Error('Failed to search PPP')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function searchFEC(
|
||||
query: string,
|
||||
candidate?: string,
|
||||
limit?: number
|
||||
): Promise<{ results: any[]; count: number }> {
|
||||
const params = new URLSearchParams({ q: query })
|
||||
if (candidate) params.set('candidate', candidate)
|
||||
if (limit) params.set('limit', limit.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/crossref/fec?${params}`)
|
||||
if (!res.ok) throw new Error('Failed to search FEC')
|
||||
return res.json()
|
||||
}
|
||||
|
||||
export async function searchGrants(
|
||||
query: string,
|
||||
agency?: string,
|
||||
limit?: number
|
||||
): Promise<{ results: any[]; count: number }> {
|
||||
const params = new URLSearchParams({ q: query })
|
||||
if (agency) params.set('agency', agency)
|
||||
if (limit) params.set('limit', limit.toString())
|
||||
|
||||
const res = await fetch(`${API_BASE}/crossref/grants?${params}`)
|
||||
if (!res.ok) throw new Error('Failed to search grants')
|
||||
return res.json()
|
||||
}
|
||||
80
frontend/src/components/Layout.tsx
Normal file
80
frontend/src/components/Layout.tsx
Normal file
@@ -0,0 +1,80 @@
|
||||
import { ReactNode } from 'react'
|
||||
import { Link, useLocation } from 'react-router-dom'
|
||||
import {
|
||||
Search,
|
||||
Network,
|
||||
Users,
|
||||
FileText,
|
||||
Lightbulb,
|
||||
Link2,
|
||||
Home
|
||||
} from 'lucide-react'
|
||||
import { clsx } from 'clsx'
|
||||
|
||||
interface LayoutProps {
|
||||
children: ReactNode
|
||||
}
|
||||
|
||||
const navItems = [
|
||||
{ path: '/', icon: Home, label: 'Home' },
|
||||
{ path: '/network', icon: Network, label: 'Network' },
|
||||
{ path: '/entities', icon: Users, label: 'Entities' },
|
||||
{ path: '/documents', icon: FileText, label: 'Documents' },
|
||||
{ path: '/search', icon: Search, label: 'Search' },
|
||||
{ path: '/patterns', icon: Lightbulb, label: 'Patterns' },
|
||||
{ path: '/crossref', icon: Link2, label: 'Cross-Ref' },
|
||||
]
|
||||
|
||||
export function Layout({ children }: LayoutProps) {
|
||||
const location = useLocation()
|
||||
|
||||
return (
|
||||
<div className="min-h-screen flex">
|
||||
{/* Sidebar */}
|
||||
<nav className="w-64 bg-surface border-r border-border flex flex-col">
|
||||
{/* Logo */}
|
||||
<div className="p-4 border-b border-border">
|
||||
<Link to="/" className="flex items-center gap-2">
|
||||
<div className="w-8 h-8 bg-red-600 rounded-lg flex items-center justify-center">
|
||||
<span className="text-white font-bold text-sm">EF</span>
|
||||
</div>
|
||||
<div>
|
||||
<h1 className="font-semibold text-white">Epstein Files</h1>
|
||||
<p className="text-xs text-gray-500">Database</p>
|
||||
</div>
|
||||
</Link>
|
||||
</div>
|
||||
|
||||
{/* Navigation */}
|
||||
<div className="flex-1 p-2">
|
||||
{navItems.map(({ path, icon: Icon, label }) => (
|
||||
<Link
|
||||
key={path}
|
||||
to={path}
|
||||
className={clsx(
|
||||
'flex items-center gap-3 px-3 py-2 rounded-lg mb-1 transition-colors',
|
||||
location.pathname === path
|
||||
? 'bg-blue-600/20 text-blue-400'
|
||||
: 'text-gray-400 hover:bg-surface-hover hover:text-gray-200'
|
||||
)}
|
||||
>
|
||||
<Icon size={18} />
|
||||
<span>{label}</span>
|
||||
</Link>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{/* Footer */}
|
||||
<div className="p-4 border-t border-border text-xs text-gray-500">
|
||||
<p>4,055 documents</p>
|
||||
<p className="mt-1">DOJ Release Dec 2025</p>
|
||||
</div>
|
||||
</nav>
|
||||
|
||||
{/* Main Content */}
|
||||
<main className="flex-1 overflow-auto">
|
||||
{children}
|
||||
</main>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
88
frontend/src/index.css
Normal file
88
frontend/src/index.css
Normal file
@@ -0,0 +1,88 @@
|
||||
@tailwind base;
|
||||
@tailwind components;
|
||||
@tailwind utilities;
|
||||
|
||||
@layer base {
|
||||
body {
|
||||
@apply bg-background text-gray-100;
|
||||
}
|
||||
|
||||
/* Custom scrollbar */
|
||||
::-webkit-scrollbar {
|
||||
width: 8px;
|
||||
height: 8px;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-track {
|
||||
@apply bg-surface;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb {
|
||||
@apply bg-border rounded-full;
|
||||
}
|
||||
|
||||
::-webkit-scrollbar-thumb:hover {
|
||||
@apply bg-gray-600;
|
||||
}
|
||||
}
|
||||
|
||||
@layer components {
|
||||
.card {
|
||||
@apply bg-surface border border-border rounded-lg;
|
||||
}
|
||||
|
||||
.btn {
|
||||
@apply px-4 py-2 rounded-lg font-medium transition-colors;
|
||||
}
|
||||
|
||||
.btn-primary {
|
||||
@apply bg-blue-600 hover:bg-blue-700 text-white;
|
||||
}
|
||||
|
||||
.btn-secondary {
|
||||
@apply bg-surface border border-border hover:bg-surface-hover text-gray-200;
|
||||
}
|
||||
|
||||
.input {
|
||||
@apply bg-surface border border-border rounded-lg px-4 py-2 text-white placeholder-gray-500 focus:outline-none focus:ring-2 focus:ring-blue-500 focus:border-transparent;
|
||||
}
|
||||
|
||||
/* Layer badges */
|
||||
.layer-badge {
|
||||
@apply inline-flex items-center px-2 py-0.5 rounded-full text-xs font-medium;
|
||||
}
|
||||
|
||||
.layer-0 {
|
||||
@apply bg-red-500/20 text-red-400 border border-red-500/30;
|
||||
}
|
||||
|
||||
.layer-1 {
|
||||
@apply bg-orange-500/20 text-orange-400 border border-orange-500/30;
|
||||
}
|
||||
|
||||
.layer-2 {
|
||||
@apply bg-yellow-500/20 text-yellow-400 border border-yellow-500/30;
|
||||
}
|
||||
|
||||
.layer-3 {
|
||||
@apply bg-green-500/20 text-green-400 border border-green-500/30;
|
||||
}
|
||||
|
||||
/* Entity type badges */
|
||||
.entity-person {
|
||||
@apply bg-blue-500/20 text-blue-400;
|
||||
}
|
||||
|
||||
.entity-organization {
|
||||
@apply bg-purple-500/20 text-purple-400;
|
||||
}
|
||||
|
||||
.entity-location {
|
||||
@apply bg-teal-500/20 text-teal-400;
|
||||
}
|
||||
}
|
||||
|
||||
/* Force graph styling */
|
||||
.force-graph-container {
|
||||
background: #0a0a0a;
|
||||
}
|
||||
25
frontend/src/main.tsx
Normal file
25
frontend/src/main.tsx
Normal file
@@ -0,0 +1,25 @@
|
||||
import React from 'react'
|
||||
import ReactDOM from 'react-dom/client'
|
||||
import { QueryClient, QueryClientProvider } from '@tanstack/react-query'
|
||||
import { BrowserRouter } from 'react-router-dom'
|
||||
import App from './App'
|
||||
import './index.css'
|
||||
|
||||
const queryClient = new QueryClient({
|
||||
defaultOptions: {
|
||||
queries: {
|
||||
staleTime: 1000 * 60 * 5, // 5 minutes
|
||||
retry: 1,
|
||||
},
|
||||
},
|
||||
})
|
||||
|
||||
ReactDOM.createRoot(document.getElementById('root')!).render(
|
||||
<React.StrictMode>
|
||||
<QueryClientProvider client={queryClient}>
|
||||
<BrowserRouter>
|
||||
<App />
|
||||
</BrowserRouter>
|
||||
</QueryClientProvider>
|
||||
</React.StrictMode>,
|
||||
)
|
||||
187
frontend/src/pages/HomePage.tsx
Normal file
187
frontend/src/pages/HomePage.tsx
Normal file
@@ -0,0 +1,187 @@
|
||||
import { useQuery } from '@tanstack/react-query'
|
||||
import { Link } from 'react-router-dom'
|
||||
import { getStats, getNetworkByLayer } from '@/api'
|
||||
import { Users, FileText, Network, Lightbulb, DollarSign, Vote, Building } from 'lucide-react'
|
||||
|
||||
export function HomePage() {
|
||||
const { data: stats, isLoading: statsLoading } = useQuery({
|
||||
queryKey: ['stats'],
|
||||
queryFn: getStats,
|
||||
})
|
||||
|
||||
const { data: layersData, isLoading: layersLoading } = useQuery({
|
||||
queryKey: ['network-layers'],
|
||||
queryFn: getNetworkByLayer,
|
||||
})
|
||||
|
||||
return (
|
||||
<div className="p-6">
|
||||
{/* Header */}
|
||||
<div className="mb-8">
|
||||
<h1 className="text-3xl font-bold text-white mb-2">Epstein Files Database</h1>
|
||||
<p className="text-gray-400">
|
||||
Searchable database and network analysis tool for the DOJ Epstein Files release
|
||||
</p>
|
||||
</div>
|
||||
|
||||
{/* Stats Grid */}
|
||||
<div className="grid grid-cols-2 md:grid-cols-4 gap-4 mb-8">
|
||||
<StatCard
|
||||
icon={FileText}
|
||||
label="Documents"
|
||||
value={stats?.documents ?? 0}
|
||||
loading={statsLoading}
|
||||
/>
|
||||
<StatCard
|
||||
icon={Users}
|
||||
label="Entities"
|
||||
value={stats?.entities ?? 0}
|
||||
loading={statsLoading}
|
||||
/>
|
||||
<StatCard
|
||||
icon={Network}
|
||||
label="Relationships"
|
||||
value={stats?.triples ?? 0}
|
||||
loading={statsLoading}
|
||||
/>
|
||||
<StatCard
|
||||
icon={Lightbulb}
|
||||
label="Patterns"
|
||||
value={stats?.patterns ?? 0}
|
||||
loading={statsLoading}
|
||||
/>
|
||||
</div>
|
||||
|
||||
{/* Cross-Reference Stats */}
|
||||
<div className="card p-4 mb-8">
|
||||
<h2 className="text-lg font-semibold mb-4">Cross-Reference Data</h2>
|
||||
<div className="grid grid-cols-3 gap-4">
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="p-2 bg-green-500/20 rounded-lg">
|
||||
<DollarSign className="text-green-400" size={20} />
|
||||
</div>
|
||||
<div>
|
||||
<p className="text-sm text-gray-400">PPP Loans</p>
|
||||
<p className="font-semibold">{stats?.pppLoans?.toLocaleString() ?? '—'}</p>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="p-2 bg-blue-500/20 rounded-lg">
|
||||
<Vote className="text-blue-400" size={20} />
|
||||
</div>
|
||||
<div>
|
||||
<p className="text-sm text-gray-400">FEC Records</p>
|
||||
<p className="font-semibold">{stats?.fecRecords?.toLocaleString() ?? '—'}</p>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="p-2 bg-purple-500/20 rounded-lg">
|
||||
<Building className="text-purple-400" size={20} />
|
||||
</div>
|
||||
<div>
|
||||
<p className="text-sm text-gray-400">Federal Grants</p>
|
||||
<p className="font-semibold">{stats?.grants?.toLocaleString() ?? '—'}</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Layer Overview */}
|
||||
<div className="card p-4 mb-8">
|
||||
<h2 className="text-lg font-semibold mb-4">Network Layers</h2>
|
||||
<div className="space-y-4">
|
||||
{[0, 1, 2, 3].map((layer) => {
|
||||
const layerData = layersData?.layers?.find((l) => l.layer === layer)
|
||||
return (
|
||||
<div key={layer} className="flex items-center gap-4">
|
||||
<span className={`layer-badge layer-${layer}`}>L{layer}</span>
|
||||
<div className="flex-1">
|
||||
<div className="flex justify-between mb-1">
|
||||
<span className="text-sm text-gray-300">
|
||||
{layer === 0 && 'Jeffrey Epstein'}
|
||||
{layer === 1 && 'Direct Associates'}
|
||||
{layer === 2 && 'One Degree Removed'}
|
||||
{layer === 3 && 'Two Degrees Removed'}
|
||||
</span>
|
||||
<span className="text-sm text-gray-500">
|
||||
{layerData?.count ?? 0} entities
|
||||
</span>
|
||||
</div>
|
||||
<div className="h-2 bg-surface rounded-full overflow-hidden">
|
||||
<div
|
||||
className={`h-full ${
|
||||
layer === 0 ? 'bg-red-500' :
|
||||
layer === 1 ? 'bg-orange-500' :
|
||||
layer === 2 ? 'bg-yellow-500' :
|
||||
'bg-green-500'
|
||||
}`}
|
||||
style={{
|
||||
width: `${Math.min(100, (layerData?.count ?? 0) / 10)}%`
|
||||
}}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
})}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Quick Actions */}
|
||||
<div className="grid grid-cols-2 md:grid-cols-3 gap-4">
|
||||
<Link to="/network" className="card p-4 hover:bg-surface-hover transition-colors">
|
||||
<Network className="text-blue-400 mb-2" size={24} />
|
||||
<h3 className="font-semibold mb-1">Explore Network</h3>
|
||||
<p className="text-sm text-gray-400">Interactive visualization of entity connections</p>
|
||||
</Link>
|
||||
<Link to="/search" className="card p-4 hover:bg-surface-hover transition-colors">
|
||||
<FileText className="text-green-400 mb-2" size={24} />
|
||||
<h3 className="font-semibold mb-1">Search Documents</h3>
|
||||
<p className="text-sm text-gray-400">Full-text search across all documents</p>
|
||||
</Link>
|
||||
<Link to="/patterns" className="card p-4 hover:bg-surface-hover transition-colors">
|
||||
<Lightbulb className="text-yellow-400 mb-2" size={24} />
|
||||
<h3 className="font-semibold mb-1">View Patterns</h3>
|
||||
<p className="text-sm text-gray-400">AI-discovered connections and insights</p>
|
||||
</Link>
|
||||
</div>
|
||||
|
||||
{/* Disclaimer */}
|
||||
<div className="mt-8 p-4 bg-yellow-500/10 border border-yellow-500/30 rounded-lg">
|
||||
<h3 className="font-semibold text-yellow-400 mb-1">Disclaimer</h3>
|
||||
<p className="text-sm text-gray-300">
|
||||
This is an independent research tool. It surfaces connections from public documents —
|
||||
it does not assert guilt, criminality, or wrongdoing. Always verify claims against primary sources.
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
|
||||
function StatCard({
|
||||
icon: Icon,
|
||||
label,
|
||||
value,
|
||||
loading,
|
||||
}: {
|
||||
icon: any
|
||||
label: string
|
||||
value: number
|
||||
loading: boolean
|
||||
}) {
|
||||
return (
|
||||
<div className="card p-4">
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="p-2 bg-surface-hover rounded-lg">
|
||||
<Icon className="text-gray-400" size={20} />
|
||||
</div>
|
||||
<div>
|
||||
<p className="text-sm text-gray-400">{label}</p>
|
||||
<p className="text-xl font-semibold">
|
||||
{loading ? '—' : value.toLocaleString()}
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
)
|
||||
}
|
||||
30
frontend/tailwind.config.js
Normal file
30
frontend/tailwind.config.js
Normal file
@@ -0,0 +1,30 @@
|
||||
/** @type {import('tailwindcss').Config} */
|
||||
export default {
|
||||
content: [
|
||||
"./index.html",
|
||||
"./src/**/*.{js,ts,jsx,tsx}",
|
||||
],
|
||||
theme: {
|
||||
extend: {
|
||||
colors: {
|
||||
// Dark theme optimized for document analysis
|
||||
background: '#0a0a0a',
|
||||
surface: '#141414',
|
||||
'surface-hover': '#1a1a1a',
|
||||
border: '#262626',
|
||||
|
||||
// Layer colors
|
||||
'layer-0': '#ef4444', // Epstein - red
|
||||
'layer-1': '#f97316', // Direct - orange
|
||||
'layer-2': '#eab308', // One removed - yellow
|
||||
'layer-3': '#22c55e', // Two removed - green
|
||||
|
||||
// Entity type colors
|
||||
'entity-person': '#3b82f6',
|
||||
'entity-org': '#8b5cf6',
|
||||
'entity-location': '#14b8a6',
|
||||
},
|
||||
},
|
||||
},
|
||||
plugins: [],
|
||||
}
|
||||
21
frontend/vite.config.ts
Normal file
21
frontend/vite.config.ts
Normal file
@@ -0,0 +1,21 @@
|
||||
import { defineConfig } from 'vite'
|
||||
import react from '@vitejs/plugin-react'
|
||||
import path from 'path'
|
||||
|
||||
export default defineConfig({
|
||||
plugins: [react()],
|
||||
resolve: {
|
||||
alias: {
|
||||
'@': path.resolve(__dirname, './src'),
|
||||
},
|
||||
},
|
||||
server: {
|
||||
port: 3000,
|
||||
proxy: {
|
||||
'/api': {
|
||||
target: 'http://localhost:3001',
|
||||
changeOrigin: true,
|
||||
},
|
||||
},
|
||||
},
|
||||
})
|
||||
105
schema/neo4j/constraints.cypher
Normal file
105
schema/neo4j/constraints.cypher
Normal file
@@ -0,0 +1,105 @@
|
||||
// Neo4j Cypher constraints and initial setup
|
||||
// Run these after Neo4j starts
|
||||
|
||||
// ============================================================================
|
||||
// CONSTRAINTS
|
||||
// ============================================================================
|
||||
|
||||
// Entity uniqueness
|
||||
CREATE CONSTRAINT entity_unique IF NOT EXISTS
|
||||
FOR (e:Entity) REQUIRE (e.canonicalName, e.type) IS UNIQUE;
|
||||
|
||||
// Document uniqueness
|
||||
CREATE CONSTRAINT document_unique IF NOT EXISTS
|
||||
FOR (d:Document) REQUIRE d.docId IS UNIQUE;
|
||||
|
||||
// ============================================================================
|
||||
// INDEXES
|
||||
// ============================================================================
|
||||
|
||||
// Entity indexes
|
||||
CREATE INDEX entity_name IF NOT EXISTS FOR (e:Entity) ON (e.canonicalName);
|
||||
CREATE INDEX entity_type IF NOT EXISTS FOR (e:Entity) ON (e.type);
|
||||
CREATE INDEX entity_layer IF NOT EXISTS FOR (e:Entity) ON (e.layer);
|
||||
|
||||
// Full-text search index on entity names
|
||||
CREATE FULLTEXT INDEX entity_search IF NOT EXISTS FOR (e:Entity) ON EACH [e.canonicalName, e.aliases];
|
||||
|
||||
// Document indexes
|
||||
CREATE INDEX document_docid IF NOT EXISTS FOR (d:Document) ON (d.docId);
|
||||
CREATE INDEX document_type IF NOT EXISTS FOR (d:Document) ON (d.documentType);
|
||||
|
||||
// ============================================================================
|
||||
// ENTITY TYPES (Labels)
|
||||
// ============================================================================
|
||||
// We use labels for entity types:
|
||||
// - :Person
|
||||
// - :Organization
|
||||
// - :Location
|
||||
// - :Entity (base label, all entities have this)
|
||||
|
||||
// ============================================================================
|
||||
// RELATIONSHIP TYPES
|
||||
// ============================================================================
|
||||
// - MENTIONED_IN: Entity -> Document (entity appears in document)
|
||||
// - CONNECTED_TO: Entity -> Entity (co-occurrence relationship)
|
||||
// - HAS_RELATIONSHIP: Entity -> Entity with action property (from triples)
|
||||
// - CROSSREF_MATCH: Entity -> CrossRefRecord (PPP, FEC, Grants)
|
||||
|
||||
// ============================================================================
|
||||
// INITIAL DATA
|
||||
// ============================================================================
|
||||
|
||||
// Create Jeffrey Epstein as the root node
|
||||
MERGE (e:Entity:Person {canonicalName: 'Jeffrey Epstein', type: 'person'})
|
||||
SET e.layer = 0,
|
||||
e.description = 'American financier and convicted sex offender',
|
||||
e.aliases = ['Jeffrey E. Epstein', 'J. Epstein', 'Epstein', 'JE'],
|
||||
e.createdAt = datetime();
|
||||
|
||||
// ============================================================================
|
||||
// HELPER PROCEDURES
|
||||
// ============================================================================
|
||||
|
||||
// Calculate layer for an entity based on shortest path to Epstein
|
||||
// Usage: CALL calculateLayer($entityName) YIELD layer
|
||||
// This needs APOC plugin installed
|
||||
|
||||
// CALL apoc.custom.asProcedure(
|
||||
// 'calculateLayer',
|
||||
// '
|
||||
// MATCH (epstein:Entity {canonicalName: "Jeffrey Epstein"})
|
||||
// MATCH (target:Entity {canonicalName: $entityName})
|
||||
// MATCH path = shortestPath((epstein)-[:CONNECTED_TO*]-(target))
|
||||
// RETURN length(path) AS layer
|
||||
// ',
|
||||
// 'read',
|
||||
// [['layer', 'INTEGER']],
|
||||
// [['entityName', 'STRING']]
|
||||
// );
|
||||
|
||||
// ============================================================================
|
||||
// EXAMPLE QUERIES
|
||||
// ============================================================================
|
||||
|
||||
// Find all Layer 1 entities (direct connections to Epstein)
|
||||
// MATCH (epstein:Entity {canonicalName: 'Jeffrey Epstein'})-[:CONNECTED_TO]-(layer1:Entity)
|
||||
// RETURN layer1.canonicalName, layer1.type;
|
||||
|
||||
// Find shared connections between two entities
|
||||
// MATCH (a:Entity {canonicalName: $name1})-[:CONNECTED_TO]-(shared:Entity)-[:CONNECTED_TO]-(b:Entity {canonicalName: $name2})
|
||||
// RETURN shared.canonicalName, shared.type;
|
||||
|
||||
// Find documents where two entities appear together
|
||||
// MATCH (a:Entity {canonicalName: $name1})-[:MENTIONED_IN]->(d:Document)<-[:MENTIONED_IN]-(b:Entity {canonicalName: $name2})
|
||||
// RETURN d.docId, d.summary;
|
||||
|
||||
// Get entity's network up to N hops
|
||||
// MATCH path = (e:Entity {canonicalName: $name})-[:CONNECTED_TO*1..3]-(connected:Entity)
|
||||
// RETURN path;
|
||||
|
||||
// Find money flows (entities connected through financial documents)
|
||||
// MATCH (a:Entity)-[:MENTIONED_IN]->(d:Document {documentType: 'financial'})<-[:MENTIONED_IN]-(b:Entity)
|
||||
// WHERE a <> b
|
||||
// RETURN a.canonicalName, b.canonicalName, count(d) AS sharedFinancialDocs
|
||||
// ORDER BY sharedFinancialDocs DESC;
|
||||
403
schema/postgres/001_initial_schema.sql
Normal file
403
schema/postgres/001_initial_schema.sql
Normal file
@@ -0,0 +1,403 @@
|
||||
-- Epstein Files Database Schema
|
||||
-- PostgreSQL 16+
|
||||
|
||||
-- Enable required extensions
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- Fuzzy text matching
|
||||
CREATE EXTENSION IF NOT EXISTS btree_gin; -- GIN indexes for JSONB
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent; -- Accent-insensitive search
|
||||
|
||||
-- ============================================================================
|
||||
-- DOCUMENTS
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE documents (
|
||||
id SERIAL PRIMARY KEY,
|
||||
doc_id TEXT UNIQUE NOT NULL, -- EFTA00000001
|
||||
dataset_id INTEGER NOT NULL, -- Which dataset (1-5)
|
||||
file_path TEXT, -- Original file path
|
||||
|
||||
-- Content
|
||||
full_text TEXT, -- OCR text
|
||||
page_count INTEGER,
|
||||
|
||||
-- AI Analysis
|
||||
summary TEXT, -- One sentence summary
|
||||
detailed_summary TEXT, -- Paragraph summary
|
||||
document_type TEXT, -- Deposition, email, financial record, etc.
|
||||
|
||||
-- Temporal
|
||||
date_earliest DATE, -- Earliest date mentioned
|
||||
date_latest DATE, -- Latest date mentioned
|
||||
|
||||
-- Metadata
|
||||
content_tags JSONB DEFAULT '[]', -- AI-extracted tags
|
||||
analysis_status TEXT DEFAULT 'pending', -- pending, processing, complete, failed
|
||||
error_message TEXT,
|
||||
|
||||
-- Timestamps
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
analyzed_at TIMESTAMPTZ
|
||||
);
|
||||
|
||||
CREATE INDEX idx_documents_doc_id ON documents(doc_id);
|
||||
CREATE INDEX idx_documents_dataset ON documents(dataset_id);
|
||||
CREATE INDEX idx_documents_type ON documents(document_type);
|
||||
CREATE INDEX idx_documents_status ON documents(analysis_status);
|
||||
CREATE INDEX idx_documents_dates ON documents(date_earliest, date_latest);
|
||||
CREATE INDEX idx_documents_fulltext ON documents USING gin(to_tsvector('english', full_text));
|
||||
CREATE INDEX idx_documents_tags ON documents USING gin(content_tags);
|
||||
|
||||
-- ============================================================================
|
||||
-- ENTITIES
|
||||
-- ============================================================================
|
||||
|
||||
-- Entity types enum
|
||||
CREATE TYPE entity_type AS ENUM (
|
||||
'person',
|
||||
'organization',
|
||||
'location',
|
||||
'date',
|
||||
'reference', -- Document references, case numbers, etc.
|
||||
'financial', -- Dollar amounts, account numbers
|
||||
'unknown'
|
||||
);
|
||||
|
||||
CREATE TABLE entities (
|
||||
id SERIAL PRIMARY KEY,
|
||||
canonical_name TEXT NOT NULL, -- Deduplicated canonical form
|
||||
entity_type entity_type NOT NULL,
|
||||
|
||||
-- Classification
|
||||
layer INTEGER, -- 0=Epstein, 1=direct, 2=one removed, 3=two removed
|
||||
|
||||
-- Metadata
|
||||
aliases JSONB DEFAULT '[]', -- Alternative spellings/names
|
||||
attributes JSONB DEFAULT '{}', -- Type-specific attributes
|
||||
description TEXT, -- AI-generated description
|
||||
|
||||
-- Cross-reference matches
|
||||
ppp_matches JSONB DEFAULT '[]', -- Matched PPP loan records
|
||||
fec_matches JSONB DEFAULT '[]', -- Matched FEC contributions
|
||||
grants_matches JSONB DEFAULT '[]', -- Matched federal grants
|
||||
|
||||
-- Stats
|
||||
document_count INTEGER DEFAULT 0, -- Number of documents mentioning entity
|
||||
connection_count INTEGER DEFAULT 0, -- Number of connections to other entities
|
||||
|
||||
-- Timestamps
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
|
||||
UNIQUE(canonical_name, entity_type)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_entities_name ON entities(canonical_name);
|
||||
CREATE INDEX idx_entities_name_trgm ON entities USING gin(canonical_name gin_trgm_ops);
|
||||
CREATE INDEX idx_entities_type ON entities(entity_type);
|
||||
CREATE INDEX idx_entities_layer ON entities(layer);
|
||||
CREATE INDEX idx_entities_aliases ON entities USING gin(aliases);
|
||||
|
||||
-- ============================================================================
|
||||
-- ENTITY ALIASES
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE entity_aliases (
|
||||
id SERIAL PRIMARY KEY,
|
||||
original_name TEXT NOT NULL,
|
||||
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
|
||||
confidence REAL DEFAULT 1.0, -- Confidence of alias match
|
||||
source TEXT DEFAULT 'extraction', -- extraction, llm_dedup, manual
|
||||
reasoning TEXT, -- Why this was matched
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_aliases_original ON entity_aliases(original_name);
|
||||
CREATE INDEX idx_aliases_original_trgm ON entity_aliases USING gin(original_name gin_trgm_ops);
|
||||
CREATE INDEX idx_aliases_entity ON entity_aliases(entity_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- DOCUMENT-ENTITY RELATIONSHIPS
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE document_entities (
|
||||
id SERIAL PRIMARY KEY,
|
||||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||||
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
|
||||
|
||||
-- Context
|
||||
mention_count INTEGER DEFAULT 1, -- How many times mentioned
|
||||
first_mention INTEGER, -- Character offset of first mention
|
||||
context_snippet TEXT, -- Surrounding text
|
||||
|
||||
-- Metadata
|
||||
extraction_confidence REAL DEFAULT 1.0,
|
||||
|
||||
UNIQUE(document_id, entity_id)
|
||||
);
|
||||
|
||||
CREATE INDEX idx_doc_entities_doc ON document_entities(document_id);
|
||||
CREATE INDEX idx_doc_entities_entity ON document_entities(entity_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- RDF TRIPLES (Relationships)
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE triples (
|
||||
id SERIAL PRIMARY KEY,
|
||||
document_id INTEGER NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
|
||||
|
||||
-- Subject-Predicate-Object
|
||||
subject_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
|
||||
predicate TEXT NOT NULL, -- Action/verb
|
||||
object_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
|
||||
|
||||
-- Context
|
||||
location_id INTEGER REFERENCES entities(id) ON DELETE SET NULL,
|
||||
timestamp DATE,
|
||||
|
||||
-- Metadata
|
||||
explicit_topic TEXT, -- Stated subject matter
|
||||
implicit_topic TEXT, -- Inferred subject matter
|
||||
tags JSONB DEFAULT '[]',
|
||||
confidence REAL DEFAULT 1.0,
|
||||
sequence_order INTEGER, -- Order within document
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_triples_document ON triples(document_id);
|
||||
CREATE INDEX idx_triples_subject ON triples(subject_id);
|
||||
CREATE INDEX idx_triples_object ON triples(object_id);
|
||||
CREATE INDEX idx_triples_predicate ON triples(predicate);
|
||||
CREATE INDEX idx_triples_timestamp ON triples(timestamp);
|
||||
CREATE INDEX idx_triples_tags ON triples USING gin(tags);
|
||||
|
||||
-- ============================================================================
|
||||
-- CROSS-REFERENCE TABLES
|
||||
-- ============================================================================
|
||||
|
||||
-- PPP Loans
|
||||
CREATE TABLE ppp_loans (
|
||||
id SERIAL PRIMARY KEY,
|
||||
loan_number TEXT UNIQUE,
|
||||
borrower_name TEXT NOT NULL,
|
||||
borrower_address TEXT,
|
||||
borrower_city TEXT,
|
||||
borrower_state TEXT,
|
||||
borrower_zip TEXT,
|
||||
loan_amount NUMERIC(15,2),
|
||||
loan_status TEXT,
|
||||
forgiveness_amount NUMERIC(15,2),
|
||||
lender TEXT,
|
||||
naics_code TEXT,
|
||||
business_type TEXT,
|
||||
jobs_retained INTEGER,
|
||||
date_approved DATE,
|
||||
|
||||
-- Matching metadata
|
||||
normalized_name TEXT, -- For fuzzy matching
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_ppp_name ON ppp_loans(borrower_name);
|
||||
CREATE INDEX idx_ppp_name_trgm ON ppp_loans USING gin(borrower_name gin_trgm_ops);
|
||||
CREATE INDEX idx_ppp_normalized ON ppp_loans USING gin(normalized_name gin_trgm_ops);
|
||||
|
||||
-- FEC Contributions
|
||||
CREATE TABLE fec_contributions (
|
||||
id SERIAL PRIMARY KEY,
|
||||
fec_id TEXT,
|
||||
contributor_name TEXT NOT NULL,
|
||||
contributor_city TEXT,
|
||||
contributor_state TEXT,
|
||||
contributor_zip TEXT,
|
||||
contributor_employer TEXT,
|
||||
contributor_occupation TEXT,
|
||||
committee_id TEXT,
|
||||
committee_name TEXT,
|
||||
candidate_id TEXT,
|
||||
candidate_name TEXT,
|
||||
amount NUMERIC(12,2),
|
||||
contribution_date DATE,
|
||||
contribution_type TEXT,
|
||||
|
||||
-- Matching metadata
|
||||
normalized_name TEXT,
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_fec_contributor ON fec_contributions(contributor_name);
|
||||
CREATE INDEX idx_fec_contributor_trgm ON fec_contributions USING gin(contributor_name gin_trgm_ops);
|
||||
CREATE INDEX idx_fec_normalized ON fec_contributions USING gin(normalized_name gin_trgm_ops);
|
||||
CREATE INDEX idx_fec_candidate ON fec_contributions(candidate_name);
|
||||
CREATE INDEX idx_fec_committee ON fec_contributions(committee_name);
|
||||
|
||||
-- Federal Grants
|
||||
CREATE TABLE federal_grants (
|
||||
id SERIAL PRIMARY KEY,
|
||||
award_id TEXT,
|
||||
recipient_name TEXT NOT NULL,
|
||||
recipient_city TEXT,
|
||||
recipient_state TEXT,
|
||||
recipient_zip TEXT,
|
||||
awarding_agency TEXT,
|
||||
funding_agency TEXT,
|
||||
award_amount NUMERIC(15,2),
|
||||
award_date DATE,
|
||||
description TEXT,
|
||||
cfda_number TEXT,
|
||||
cfda_title TEXT,
|
||||
|
||||
-- Matching metadata
|
||||
normalized_name TEXT,
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW()
|
||||
);
|
||||
|
||||
CREATE INDEX idx_grants_recipient ON federal_grants(recipient_name);
|
||||
CREATE INDEX idx_grants_recipient_trgm ON federal_grants USING gin(recipient_name gin_trgm_ops);
|
||||
CREATE INDEX idx_grants_normalized ON federal_grants USING gin(normalized_name gin_trgm_ops);
|
||||
|
||||
-- ============================================================================
|
||||
-- ENTITY CROSS-REFERENCE MATCHES
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TYPE match_source AS ENUM ('ppp', 'fec', 'grants');
|
||||
|
||||
CREATE TABLE entity_crossref_matches (
|
||||
id SERIAL PRIMARY KEY,
|
||||
entity_id INTEGER NOT NULL REFERENCES entities(id) ON DELETE CASCADE,
|
||||
source match_source NOT NULL,
|
||||
source_id INTEGER NOT NULL, -- ID in the source table
|
||||
|
||||
-- Match quality
|
||||
match_score REAL NOT NULL, -- 0-1 similarity score
|
||||
match_method TEXT, -- exact, fuzzy, soundex, etc.
|
||||
verified BOOLEAN DEFAULT FALSE, -- Human-verified match
|
||||
false_positive BOOLEAN DEFAULT FALSE, -- Confirmed not a match
|
||||
|
||||
created_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
verified_at TIMESTAMPTZ,
|
||||
verified_by TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_crossref_entity ON entity_crossref_matches(entity_id);
|
||||
CREATE INDEX idx_crossref_source ON entity_crossref_matches(source, source_id);
|
||||
|
||||
-- ============================================================================
|
||||
-- PATTERN FINDINGS
|
||||
-- ============================================================================
|
||||
|
||||
CREATE TABLE pattern_findings (
|
||||
id SERIAL PRIMARY KEY,
|
||||
|
||||
-- The pattern
|
||||
title TEXT NOT NULL,
|
||||
description TEXT NOT NULL,
|
||||
pattern_type TEXT, -- financial_flow, travel_pattern, organizational_link, etc.
|
||||
|
||||
-- Involved entities
|
||||
entity_ids INTEGER[] NOT NULL,
|
||||
|
||||
-- Evidence
|
||||
evidence JSONB NOT NULL, -- Supporting documents, connections, etc.
|
||||
confidence REAL,
|
||||
|
||||
-- Status
|
||||
status TEXT DEFAULT 'hypothesis', -- hypothesis, validated, rejected
|
||||
notes TEXT,
|
||||
|
||||
-- Timestamps
|
||||
discovered_at TIMESTAMPTZ DEFAULT NOW(),
|
||||
discovered_by TEXT DEFAULT 'pattern_agent',
|
||||
validated_at TIMESTAMPTZ,
|
||||
validated_by TEXT
|
||||
);
|
||||
|
||||
CREATE INDEX idx_patterns_type ON pattern_findings(pattern_type);
|
||||
CREATE INDEX idx_patterns_status ON pattern_findings(status);
|
||||
CREATE INDEX idx_patterns_entities ON pattern_findings USING gin(entity_ids);
|
||||
|
||||
-- ============================================================================
|
||||
-- VIEWS
|
||||
-- ============================================================================
|
||||
|
||||
-- Entity connections view
|
||||
CREATE VIEW entity_connections AS
|
||||
SELECT
|
||||
e1.id AS entity1_id,
|
||||
e1.canonical_name AS entity1_name,
|
||||
e1.entity_type AS entity1_type,
|
||||
e2.id AS entity2_id,
|
||||
e2.canonical_name AS entity2_name,
|
||||
e2.entity_type AS entity2_type,
|
||||
COUNT(DISTINCT d.id) AS shared_documents,
|
||||
array_agg(DISTINCT d.doc_id) AS document_ids
|
||||
FROM document_entities de1
|
||||
JOIN document_entities de2 ON de1.document_id = de2.document_id AND de1.entity_id < de2.entity_id
|
||||
JOIN entities e1 ON de1.entity_id = e1.id
|
||||
JOIN entities e2 ON de2.entity_id = e2.id
|
||||
JOIN documents d ON de1.document_id = d.id
|
||||
GROUP BY e1.id, e1.canonical_name, e1.entity_type, e2.id, e2.canonical_name, e2.entity_type;
|
||||
|
||||
-- ============================================================================
|
||||
-- FUNCTIONS
|
||||
-- ============================================================================
|
||||
|
||||
-- Normalize name for fuzzy matching
|
||||
CREATE OR REPLACE FUNCTION normalize_name(name TEXT) RETURNS TEXT AS $$
|
||||
BEGIN
|
||||
RETURN lower(
|
||||
regexp_replace(
|
||||
regexp_replace(
|
||||
unaccent(name),
|
||||
'[^a-zA-Z0-9 ]', '', 'g'
|
||||
),
|
||||
'\s+', ' ', 'g'
|
||||
)
|
||||
);
|
||||
END;
|
||||
$$ LANGUAGE plpgsql IMMUTABLE;
|
||||
|
||||
-- Update entity stats
|
||||
CREATE OR REPLACE FUNCTION update_entity_stats() RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
-- Update document count
|
||||
UPDATE entities e
|
||||
SET document_count = (
|
||||
SELECT COUNT(DISTINCT document_id)
|
||||
FROM document_entities
|
||||
WHERE entity_id = e.id
|
||||
),
|
||||
connection_count = (
|
||||
SELECT COUNT(*)
|
||||
FROM entity_connections
|
||||
WHERE entity1_id = e.id OR entity2_id = e.id
|
||||
),
|
||||
updated_at = NOW()
|
||||
WHERE e.id = COALESCE(NEW.entity_id, OLD.entity_id);
|
||||
|
||||
RETURN NULL;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE TRIGGER trigger_update_entity_stats
|
||||
AFTER INSERT OR UPDATE OR DELETE ON document_entities
|
||||
FOR EACH ROW EXECUTE FUNCTION update_entity_stats();
|
||||
|
||||
-- ============================================================================
|
||||
-- INITIAL DATA
|
||||
-- ============================================================================
|
||||
|
||||
-- Insert Jeffrey Epstein as Layer 0
|
||||
INSERT INTO entities (canonical_name, entity_type, layer, description, aliases)
|
||||
VALUES (
|
||||
'Jeffrey Epstein',
|
||||
'person',
|
||||
0,
|
||||
'American financier and convicted sex offender',
|
||||
'["Jeffrey E. Epstein", "J. Epstein", "Epstein", "JE"]'::jsonb
|
||||
) ON CONFLICT DO NOTHING;
|
||||
Reference in New Issue
Block a user