Files
comunidadhll/docs/database-maintenance.md
devRaGonSa 0da8338ba8 Fix
2026-06-05 16:57:25 +02:00

9.1 KiB
Raw Blame History

Database Maintenance

Overview

HLL Vietnam keeps database cleanup at the application level.

The current maintenance scope is intentionally narrow:

  • old server_snapshots;
  • old non-critical rcon_admin_log_events;
  • old critical rcon_admin_log_events only after retention and protected-match checks;
  • old non-protected rcon_materialized_matches;
  • dependent rcon_match_player_stats for deleted matches.

The first maintenance pass does not routinely delete:

  • displayed_historical_snapshots;
  • file-based snapshots under backend/data/snapshots/;
  • public-scoreboard historical_* fallback tables;
  • player_event_raw_ledger and its worker metadata;
  • Elo/MMR tables;
  • Comunidad Hispana #03 data reactivation or targets.

Why Application-Level And Not pg_cron

Cleanup is versioned in backend code instead of delegated to pg_cron, host cron, or a separate container because the retention logic depends on product rules:

  • keep the latest 100 closed materialized matches;
  • keep the current month;
  • keep the previous month during the first 7 days of a new month;
  • keep the current week;
  • keep the previous week when weekly fallback may still need it;
  • keep child stats for protected matches;
  • avoid breaking current/live pages that still read recent AdminLog data.

Those rules belong with the applications read and write model, not inside database-only scheduling.

Scheduled Cleanup Inside historical-runner

Database maintenance is scheduled inside app.historical_runner.

Behavior:

  • disabled by default;
  • no extra Docker service is added for maintenance;
  • the runner checks whether maintenance is due;
  • when enabled and due, the runner invokes python -m app.database_maintenance cleanup --apply behavior through the shared Python function;
  • failures are logged and do not crash the historical runner loop;
  • cleanup runs under the same writer-lock coordination used by the historical writer flows.

Relevant structured log events:

  • database-maintenance-scheduler-skipped-disabled
  • database-maintenance-scheduler-skipped-not-due
  • database-maintenance-scheduler-started
  • database-maintenance-scheduler-completed
  • database-maintenance-scheduler-failed

Environment Variables

Required maintenance-related variables:

HLL_DB_MAINTENANCE_ENABLED=false
HLL_DB_MAINTENANCE_INTERVAL_SECONDS=43200
HLL_RECENT_MATCHES_KEEP=100
HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS=30
HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS=90
HLL_SERVER_SNAPSHOT_RETENTION_DAYS=14
HLL_DB_MAINTENANCE_BATCH_SIZE=5000

Meaning:

  • HLL_DB_MAINTENANCE_ENABLED Enables scheduled apply mode inside historical-runner.
  • HLL_DB_MAINTENANCE_INTERVAL_SECONDS Default scheduler interval. 43200 means every 12 hours.
  • HLL_RECENT_MATCHES_KEEP Number of latest closed materialized matches that must always be protected.
  • HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS Retention for non-critical AdminLog events such as chat/connect/disconnect.
  • HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS Retention for critical AdminLog events such as kill, match_start, match_end.
  • HLL_SERVER_SNAPSHOT_RETENTION_DAYS Retention for live server snapshots.
  • HLL_DB_MAINTENANCE_BATCH_SIZE Delete batch size for apply mode.

Protected Data

The cleanup command protects:

  • latest 100 closed materialized matches by default;
  • current month materialized matches;
  • previous month materialized matches when the current day is 1 through 7;
  • current week materialized matches;
  • previous week materialized matches when weekly fallback may still need them;
  • rcon_match_player_stats belonging to protected matches;
  • current/live AdminLog data required for visible current-match surfaces;
  • displayed_historical_snapshots;
  • file snapshots in backend/data/snapshots/.

If a match timestamp cannot be interpreted safely, that match is skipped and protected instead of deleted.

Deleted Data

Apply mode is currently allowed to delete:

  • server_snapshots older than retention;
  • non-critical rcon_admin_log_events older than retention;
  • critical rcon_admin_log_events older than retention only when they are not required by protected materialized match ranges;
  • non-protected rcon_materialized_matches;
  • dependent rcon_match_player_stats for deleted matches.

Current critical AdminLog event types:

  • kill
  • match_start
  • match_end

Dry-Run Command

From backend/:

python -m app.database_maintenance cleanup --dry-run

From the repository root with the backend package on PYTHONPATH:

$env:PYTHONPATH='backend'
python -m app.database_maintenance cleanup --dry-run

Inside Docker Compose:

docker compose exec backend python -m app.database_maintenance cleanup --dry-run

Useful dry-run options:

docker compose exec backend python -m app.database_maintenance cleanup --dry-run `
  --recent-matches-keep 100 `
  --admin-log-noncritical-retention-days 30 `
  --admin-log-critical-retention-days 90 `
  --server-snapshot-retention-days 14 `
  --batch-size 5000

Dry-run is the safe preview path and should be reviewed before any production apply.

Apply Command

Local module execution:

python -m app.database_maintenance cleanup --apply

Docker Compose:

docker compose exec backend python -m app.database_maintenance cleanup --apply

One-off local validation with a fixed time anchor:

python -m app.database_maintenance cleanup --apply --now 2026-06-20T12:00:00Z

Optional maintenance vacuum/analyze:

python -m app.database_maintenance cleanup --apply --vacuum-analyze

Table-Size Audit SQL

select
  schemaname,
  relname as table_name,
  pg_size_pretty(pg_total_relation_size(relid)) as total_size,
  pg_size_pretty(pg_relation_size(relid)) as table_size,
  pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as indexes_size,
  n_live_tup as estimated_rows,
  n_dead_tup as estimated_dead_rows
from pg_stat_user_tables
order by pg_total_relation_size(relid) desc;

Row-Count And Age Audit SQL

AdminLog events by type/date

select
  event_type,
  count(*) as row_count,
  min(event_timestamp) as first_event_timestamp,
  max(event_timestamp) as last_event_timestamp,
  min(server_time) as first_server_time,
  max(server_time) as last_server_time
from rcon_admin_log_events
group by event_type
order by row_count desc, event_type asc;

Materialized matches by server/date

select
  target_key,
  source_basis,
  count(*) as matches,
  min(coalesce(ended_at, started_at)) as first_closed_at,
  max(coalesce(ended_at, started_at)) as last_closed_at
from rcon_materialized_matches
group by target_key, source_basis
order by target_key asc, source_basis asc;

Server snapshots by date

select
  server_id,
  min(captured_at) as first_captured_at,
  max(captured_at) as last_captured_at,
  count(*) as snapshot_rows
from server_snapshots
group by server_id
order by last_captured_at desc;

Displayed snapshots count

select
  snapshot_type,
  metric,
  snapshot_window,
  count(*) as snapshot_rows,
  min(generated_at) as first_generated_at,
  max(generated_at) as last_generated_at
from displayed_historical_snapshots
group by snapshot_type, metric, snapshot_window
order by snapshot_type asc, metric asc, snapshot_window asc;

Logs To Inspect

The cleanup command emits JSON logs. Minimum events to look for:

  • database-maintenance-started
  • database-maintenance-plan
  • database-maintenance-table-skipped
  • database-maintenance-delete-batch
  • database-maintenance-completed
  • database-maintenance-error

Examples:

docker compose logs --tail=200 backend
docker compose logs --tail=200 historical-runner

If scheduled cleanup is enabled:

docker compose logs --tail=200 historical-runner

Docker And Portainer Warnings

  • Never use docker compose down -v unless you intentionally want to delete PostgreSQL and mounted volume data.
  • Always review dry-run output before enabling apply in production.
  • Do not manually delete protected match or player-stat rows from PostgreSQL.
  • Keep backups before changing retention settings.
  • Do not add Comunidad Hispana #03 back into RCON targets in this task.
  • Do not add a separate maintenance container, host cron, or pg_cron job for this feature.

For Portainer-style operations the same warning applies:

  • deleting volumes is destructive;
  • maintenance should run through the application command, not through manual table purges.

Rollback And Restore Considerations

  • Retention changes are destructive when apply mode runs.
  • Keep a PostgreSQL backup before enabling scheduled apply in production.
  • If cleanup removes too much data, recovery is restore-based, not “undo last delete.”
  • Favor dry-run, smaller batch sizes, and reviewed retention values before long-running scheduled apply.

Safe Operator Flow

  1. Audit table size and row ages with the SQL above.
  2. Run dry-run locally or in Compose.
  3. Review protected counts and candidate counts in JSON output.
  4. Enable HLL_DB_MAINTENANCE_ENABLED=true only after dry-run review.
  5. Monitor historical-runner logs for scheduler events and cleanup completion.