job-forge/docs/database_design.md

# JobForge MVP - Database Design & Schema

**Version:** 1.0.0 MVP
**Database:** PostgreSQL 16 with pgvector
**Target Audience:** Backend Developers
**Last Updated:** July 2025

---

## 🎯 Database Overview

### Technology Stack
- **Database:** PostgreSQL 16
- **Extensions:** pgvector (for AI embeddings)
- **Security:** Row Level Security (RLS) for multi-tenancy
- **Connection:** AsyncPG with SQLAlchemy 2.0
- **Migrations:** Direct SQL for MVP (Alembic in Phase 2)

### Design Principles
- **User Isolation:** Complete data separation between users
- **Data Integrity:** Foreign key constraints and validation
- **Performance:** Optimized indexes for common queries
- **Security:** RLS policies prevent cross-user data access
- **Scalability:** Schema designed for future SaaS features

---

## 📊 Entity Relationship Diagram

```mermaid
erDiagram
    USERS ||--o{ APPLICATIONS : creates
    USERS ||--o{ USER_RESUMES : owns
    APPLICATIONS ||--o{ DOCUMENTS : contains
    DOCUMENTS ||--o| DOCUMENT_EMBEDDINGS : has_embedding

    USERS {
        uuid id PK
        varchar email UK
        varchar password_hash
        varchar full_name
        timestamp created_at
        timestamp updated_at
    }

    APPLICATIONS {
        uuid id PK
        uuid user_id FK
        varchar name
        varchar company_name
        varchar role_title
        text job_url
        text job_description
        varchar location
        varchar priority_level
        varchar status
        boolean research_completed
        boolean resume_optimized
        boolean cover_letter_generated
        timestamp created_at
        timestamp updated_at
    }

    DOCUMENTS {
        uuid id PK
        uuid application_id FK
        varchar document_type
        text content
        timestamp created_at
        timestamp updated_at
    }

    USER_RESUMES {
        uuid id PK
        uuid user_id FK
        varchar name
        text content
        varchar focus_area
        boolean is_primary
        timestamp created_at
        timestamp updated_at
    }

    DOCUMENT_EMBEDDINGS {
        uuid id PK
        uuid document_id FK
        vector embedding
        timestamp created_at
    }
```

---

## 🗄️ Complete Database Schema

### Database Initialization
```sql
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS vector;

-- Create custom types
CREATE TYPE priority_level_type AS ENUM ('low', 'medium', 'high');
CREATE TYPE application_status_type AS ENUM (
    'draft',
    'research_complete',
    'resume_ready',
    'cover_letter_ready'
);
CREATE TYPE document_type_enum AS ENUM (
    'research_report',
    'optimized_resume',
    'cover_letter'
);
CREATE TYPE focus_area_type AS ENUM (
    'software_development',
    'data_science',
    'management',
    'consulting',
    'other'
);
```

### Core Tables

#### Users Table
```sql
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    password_hash VARCHAR(255) NOT NULL,
    full_name VARCHAR(255) NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Constraints
    CONSTRAINT email_format CHECK (email ~* '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$'),
    CONSTRAINT name_not_empty CHECK (LENGTH(TRIM(full_name)) > 0)
);

-- Indexes
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_created_at ON users(created_at);

-- Row Level Security
ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- Users can only see their own record
CREATE POLICY users_own_data ON users
    FOR ALL
    USING (id = current_setting('app.current_user_id')::UUID);
```

#### Applications Table
```sql
CREATE TABLE applications (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    name VARCHAR(255) NOT NULL,
    company_name VARCHAR(255) NOT NULL,
    role_title VARCHAR(255) NOT NULL,
    job_url TEXT,
    job_description TEXT NOT NULL,
    location VARCHAR(255),
    priority_level priority_level_type DEFAULT 'medium',
    status application_status_type DEFAULT 'draft',

    -- Phase tracking
    research_completed BOOLEAN DEFAULT FALSE,
    resume_optimized BOOLEAN DEFAULT FALSE,
    cover_letter_generated BOOLEAN DEFAULT FALSE,

    -- Timestamps
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Constraints
    CONSTRAINT job_description_min_length CHECK (LENGTH(job_description) >= 50),
    CONSTRAINT company_name_not_empty CHECK (LENGTH(TRIM(company_name)) > 0),
    CONSTRAINT role_title_not_empty CHECK (LENGTH(TRIM(role_title)) > 0),
    CONSTRAINT valid_job_url CHECK (
        job_url IS NULL OR
        job_url ~* '^https?://[^\s/$.?#].[^\s]*$'
    ),

    -- Business logic constraints
    CONSTRAINT resume_requires_research CHECK (
        NOT resume_optimized OR research_completed
    ),
    CONSTRAINT cover_letter_requires_resume CHECK (
        NOT cover_letter_generated OR resume_optimized
    )
);

-- Indexes
CREATE INDEX idx_applications_user_id ON applications(user_id);
CREATE INDEX idx_applications_status ON applications(status);
CREATE INDEX idx_applications_priority ON applications(priority_level);
CREATE INDEX idx_applications_created_at ON applications(created_at);
CREATE INDEX idx_applications_company_name ON applications(company_name);

-- Full text search index for job descriptions
CREATE INDEX idx_applications_job_description_fts
ON applications USING gin(to_tsvector('english', job_description));

-- Row Level Security
ALTER TABLE applications ENABLE ROW LEVEL SECURITY;

CREATE POLICY applications_user_access ON applications
    FOR ALL
    USING (user_id = current_setting('app.current_user_id')::UUID);
```

#### Documents Table
```sql
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    application_id UUID NOT NULL REFERENCES applications(id) ON DELETE CASCADE,
    document_type document_type_enum NOT NULL,
    content TEXT NOT NULL,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Constraints
    CONSTRAINT content_min_length CHECK (LENGTH(content) >= 10),
    CONSTRAINT unique_document_per_application UNIQUE (application_id, document_type)
);

-- Indexes
CREATE INDEX idx_documents_application_id ON documents(application_id);
CREATE INDEX idx_documents_type ON documents(document_type);
CREATE INDEX idx_documents_updated_at ON documents(updated_at);

-- Full text search index for document content
CREATE INDEX idx_documents_content_fts
ON documents USING gin(to_tsvector('english', content));

-- Row Level Security
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;

CREATE POLICY documents_user_access ON documents
    FOR ALL
    USING (
        application_id IN (
            SELECT id FROM applications
            WHERE user_id = current_setting('app.current_user_id')::UUID
        )
    );
```

#### User Resumes Table
```sql
CREATE TABLE user_resumes (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
    name VARCHAR(255) NOT NULL,
    content TEXT NOT NULL,
    focus_area focus_area_type DEFAULT 'other',
    is_primary BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Constraints
    CONSTRAINT resume_name_not_empty CHECK (LENGTH(TRIM(name)) > 0),
    CONSTRAINT resume_content_min_length CHECK (LENGTH(content) >= 100),

    -- Only one primary resume per user
    CONSTRAINT unique_primary_resume UNIQUE (user_id, is_primary)
    DEFERRABLE INITIALLY DEFERRED
);

-- Indexes
CREATE INDEX idx_user_resumes_user_id ON user_resumes(user_id);
CREATE INDEX idx_user_resumes_focus_area ON user_resumes(focus_area);
CREATE INDEX idx_user_resumes_is_primary ON user_resumes(is_primary);

-- Full text search index for resume content
CREATE INDEX idx_user_resumes_content_fts
ON user_resumes USING gin(to_tsvector('english', content));

-- Row Level Security
ALTER TABLE user_resumes ENABLE ROW LEVEL SECURITY;

CREATE POLICY user_resumes_access ON user_resumes
    FOR ALL
    USING (user_id = current_setting('app.current_user_id')::UUID);
```

#### Document Embeddings Table (AI Features)
```sql
CREATE TABLE document_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
    embedding vector(1536),  -- OpenAI text-embedding-3-large dimension
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Constraints
    CONSTRAINT unique_embedding_per_document UNIQUE (document_id)
);

-- Vector similarity index
CREATE INDEX idx_document_embeddings_vector
ON document_embeddings USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Regular indexes
CREATE INDEX idx_document_embeddings_document_id ON document_embeddings(document_id);

-- Row Level Security
ALTER TABLE document_embeddings ENABLE ROW LEVEL SECURITY;

CREATE POLICY document_embeddings_access ON document_embeddings
    FOR ALL
    USING (
        document_id IN (
            SELECT d.id FROM documents d
            JOIN applications a ON d.application_id = a.id
            WHERE a.user_id = current_setting('app.current_user_id')::UUID
        )
    );
```

---

## 🔒 Security Policies

### Row Level Security Overview
All tables with user data have RLS enabled to ensure complete data isolation:

```sql
-- Function to get current user ID from session
CREATE OR REPLACE FUNCTION get_current_user_id()
RETURNS UUID AS $$
BEGIN
    RETURN current_setting('app.current_user_id')::UUID;
EXCEPTION
    WHEN others THEN
        RETURN NULL;
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;

-- Helper function to check if user owns application
CREATE OR REPLACE FUNCTION user_owns_application(app_id UUID)
RETURNS BOOLEAN AS $$
BEGIN
    RETURN EXISTS (
        SELECT 1 FROM applications
        WHERE id = app_id
        AND user_id = get_current_user_id()
    );
END;
$$ LANGUAGE plpgsql SECURITY DEFINER;
```

### Setting User Context
Backend must set user context for each request:

```python
# In FastAPI dependency
async def set_user_context(user: User = Depends(get_current_user)):
    async with get_db_connection() as conn:
        await conn.execute(
            "SET LOCAL app.current_user_id = %s",
            str(user.id)
        )
    return user
```

---

## 🚀 Database Functions

### Trigger Functions
```sql
-- Update timestamp trigger function
CREATE OR REPLACE FUNCTION update_updated_at_column()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Apply to all tables with updated_at
CREATE TRIGGER update_users_updated_at
    BEFORE UPDATE ON users
    FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

CREATE TRIGGER update_applications_updated_at
    BEFORE UPDATE ON applications
    FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

CREATE TRIGGER update_documents_updated_at
    BEFORE UPDATE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();

CREATE TRIGGER update_user_resumes_updated_at
    BEFORE UPDATE ON user_resumes
    FOR EACH ROW EXECUTE FUNCTION update_updated_at_column();
```

### Business Logic Functions
```sql
-- Generate application name
CREATE OR REPLACE FUNCTION generate_application_name(
    p_company_name VARCHAR,
    p_role_title VARCHAR
) RETURNS VARCHAR AS $$
DECLARE
    clean_company VARCHAR;
    clean_role VARCHAR;
    date_suffix VARCHAR;
BEGIN
    -- Clean and normalize names
    clean_company := LOWER(REGEXP_REPLACE(p_company_name, '[^a-zA-Z0-9]', '_', 'g'));
    clean_role := LOWER(REGEXP_REPLACE(p_role_title, '[^a-zA-Z0-9]', '_', 'g'));
    date_suffix := TO_CHAR(NOW(), 'YYYY_MM_DD');

    RETURN clean_company || '_' || clean_role || '_' || date_suffix;
END;
$$ LANGUAGE plpgsql;

-- Update application phases trigger
CREATE OR REPLACE FUNCTION update_application_phases()
RETURNS TRIGGER AS $$
BEGIN
    -- Auto-update phase completion based on document existence
    IF TG_OP = 'INSERT' OR TG_OP = 'UPDATE' THEN
        UPDATE applications SET
            research_completed = EXISTS (
                SELECT 1 FROM documents
                WHERE application_id = NEW.application_id
                AND document_type = 'research_report'
            ),
            resume_optimized = EXISTS (
                SELECT 1 FROM documents
                WHERE application_id = NEW.application_id
                AND document_type = 'optimized_resume'
            ),
            cover_letter_generated = EXISTS (
                SELECT 1 FROM documents
                WHERE application_id = NEW.application_id
                AND document_type = 'cover_letter'
            )
        WHERE id = NEW.application_id;

        -- Update status based on completion
        UPDATE applications SET
            status = CASE
                WHEN cover_letter_generated THEN 'cover_letter_ready'
                WHEN resume_optimized THEN 'resume_ready'
                WHEN research_completed THEN 'research_complete'
                ELSE 'draft'
            END
        WHERE id = NEW.application_id;
    END IF;

    RETURN COALESCE(NEW, OLD);
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER documents_update_phases
    AFTER INSERT OR UPDATE OR DELETE ON documents
    FOR EACH ROW EXECUTE FUNCTION update_application_phases();
```

---

## 📈 Performance Optimization

### Query Optimization
```sql
-- Most common query patterns with optimized indexes

-- 1. Get user applications (paginated)
-- Index: idx_applications_user_id, idx_applications_created_at
SELECT * FROM applications
WHERE user_id = $1
ORDER BY created_at DESC
LIMIT $2 OFFSET $3;

-- 2. Get application with documents
-- Index: idx_documents_application_id
SELECT a.*, d.document_type, d.content
FROM applications a
LEFT JOIN documents d ON a.id = d.application_id
WHERE a.id = $1 AND a.user_id = $2;

-- 3. Search applications by company/role
-- Index: idx_applications_company_name, full-text search
SELECT * FROM applications
WHERE user_id = $1
AND (
    company_name ILIKE $2
    OR role_title ILIKE $3
    OR to_tsvector('english', job_description) @@ plainto_tsquery('english', $4)
)
ORDER BY created_at DESC;
```

### Connection Pooling
```python
# SQLAlchemy async engine configuration
engine = create_async_engine(
    DATABASE_URL,
    pool_size=20,          # Connection pool size
    max_overflow=30,       # Additional connections beyond pool_size
    pool_pre_ping=True,    # Validate connections before use
    pool_recycle=3600,     # Recycle connections every hour
    echo=False             # Disable SQL logging in production
)
```

---

## 🧪 Test Data Setup

### Development Seed Data
```sql
-- Insert test user (password: "testpass123")
INSERT INTO users (id, email, password_hash, full_name) VALUES (
    '123e4567-e89b-12d3-a456-426614174000',
    'test@example.com',
    '$2b$12$LQv3c1yqBWVHxkd0LHAkCOYz6TtxMQJqhN8/LewgdyN8yF5V4M2kq',
    'Test User'
);

-- Insert test resume
INSERT INTO user_resumes (user_id, name, content, focus_area, is_primary) VALUES (
    '123e4567-e89b-12d3-a456-426614174000',
    'Software Developer Resume',
    '# Test User\n\n## Experience\n\nSoftware Developer at Tech Corp...',
    'software_development',
    true
);

-- Insert test application
INSERT INTO applications (
    user_id, name, company_name, role_title,
    job_description, status, research_completed
) VALUES (
    '123e4567-e89b-12d3-a456-426614174000',
    'google_senior_developer_2025_07_01',
    'Google',
    'Senior Developer',
    'We are seeking an experienced software developer to join our team...',
    'research_complete',
    true
);
```

---

## 🔄 Database Migrations (Future)

### Migration Strategy for Phase 2
When adding Alembic migrations:

```python
# alembic/env.py configuration for RLS
from sqlalchemy import text

def run_migrations_online():
    # Set up RLS context for migrations
    with engine.connect() as connection:
        connection.execute(text("SET row_security = off"))

        context.configure(
            connection=connection,
            target_metadata=target_metadata,
            compare_type=True,
            compare_server_default=True
        )

        with context.begin_transaction():
            context.run_migrations()
```

### Planned Schema Changes
- **Usage tracking tables** for SaaS billing
- **Subscription management** tables
- **Audit log** tables for compliance
- **Performance metrics** tables
- **Additional indexes** based on production usage

---

## 🛠️ Database Maintenance

### Regular Maintenance Tasks
```sql
-- Vacuum and analyze (run weekly)
VACUUM ANALYZE;

-- Update table statistics
ANALYZE applications;
ANALYZE documents;
ANALYZE user_resumes;

-- Check index usage
SELECT schemaname, tablename, indexname, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_tup_read DESC;

-- Monitor vector index performance
SELECT * FROM pg_stat_user_indexes
WHERE indexname LIKE '%vector%';
```

### Backup Strategy
```bash
# Daily backup script
pg_dump -h localhost -U jobforge_user -d jobforge_mvp \
    --clean --if-exists --verbose \
    > backup_$(date +%Y%m%d).sql

# Restore from backup
psql -h localhost -U jobforge_user -d jobforge_mvp < backup_20250701.sql
```

---

## 📊 Monitoring Queries

### Performance Monitoring
```sql
-- Slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
WHERE mean_time > 100  -- queries slower than 100ms
ORDER BY mean_time DESC;

-- Table sizes
SELECT
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size,
    pg_total_relation_size(schemaname||'.'||tablename) as size_bytes
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY size_bytes DESC;

-- Connection counts
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
```

---

*This database design provides a solid foundation for the MVP while being prepared for future SaaS features. The RLS policies ensure complete user data isolation, and the schema is optimized for the expected query patterns.*