308 lines
9.1 KiB
Markdown
308 lines
9.1 KiB
Markdown
# Database Maintenance
|
||
|
||
## Overview
|
||
|
||
HLL Vietnam keeps database cleanup at the application level.
|
||
|
||
The current maintenance scope is intentionally narrow:
|
||
|
||
- old `server_snapshots`;
|
||
- old non-critical `rcon_admin_log_events`;
|
||
- old critical `rcon_admin_log_events` only after retention and protected-match checks;
|
||
- old non-protected `rcon_materialized_matches`;
|
||
- dependent `rcon_match_player_stats` for deleted matches.
|
||
|
||
The first maintenance pass does not routinely delete:
|
||
|
||
- `displayed_historical_snapshots`;
|
||
- file-based snapshots under `backend/data/snapshots/`;
|
||
- public-scoreboard `historical_*` fallback tables;
|
||
- `player_event_raw_ledger` and its worker metadata;
|
||
- Elo/MMR tables;
|
||
- Comunidad Hispana #03 data reactivation or targets.
|
||
|
||
## Why Application-Level And Not `pg_cron`
|
||
|
||
Cleanup is versioned in backend code instead of delegated to `pg_cron`, host cron, or a separate container because the retention logic depends on product rules:
|
||
|
||
- keep the latest 100 closed materialized matches;
|
||
- keep the current month;
|
||
- keep the previous month during the first 7 days of a new month;
|
||
- keep the current week;
|
||
- keep the previous week when weekly fallback may still need it;
|
||
- keep child stats for protected matches;
|
||
- avoid breaking current/live pages that still read recent AdminLog data.
|
||
|
||
Those rules belong with the application’s read and write model, not inside database-only scheduling.
|
||
|
||
## Scheduled Cleanup Inside `historical-runner`
|
||
|
||
Database maintenance is scheduled inside `app.historical_runner`.
|
||
|
||
Behavior:
|
||
|
||
- disabled by default;
|
||
- no extra Docker service is added for maintenance;
|
||
- the runner checks whether maintenance is due;
|
||
- when enabled and due, the runner invokes `python -m app.database_maintenance cleanup --apply` behavior through the shared Python function;
|
||
- failures are logged and do not crash the historical runner loop;
|
||
- cleanup runs under the same writer-lock coordination used by the historical writer flows.
|
||
|
||
Relevant structured log events:
|
||
|
||
- `database-maintenance-scheduler-skipped-disabled`
|
||
- `database-maintenance-scheduler-skipped-not-due`
|
||
- `database-maintenance-scheduler-started`
|
||
- `database-maintenance-scheduler-completed`
|
||
- `database-maintenance-scheduler-failed`
|
||
|
||
## Environment Variables
|
||
|
||
Required maintenance-related variables:
|
||
|
||
```text
|
||
HLL_DB_MAINTENANCE_ENABLED=false
|
||
HLL_DB_MAINTENANCE_INTERVAL_SECONDS=43200
|
||
HLL_RECENT_MATCHES_KEEP=100
|
||
HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS=30
|
||
HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS=90
|
||
HLL_SERVER_SNAPSHOT_RETENTION_DAYS=14
|
||
HLL_DB_MAINTENANCE_BATCH_SIZE=5000
|
||
```
|
||
|
||
Meaning:
|
||
|
||
- `HLL_DB_MAINTENANCE_ENABLED`
|
||
Enables scheduled apply mode inside `historical-runner`.
|
||
- `HLL_DB_MAINTENANCE_INTERVAL_SECONDS`
|
||
Default scheduler interval. `43200` means every 12 hours.
|
||
- `HLL_RECENT_MATCHES_KEEP`
|
||
Number of latest closed materialized matches that must always be protected.
|
||
- `HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS`
|
||
Retention for non-critical AdminLog events such as chat/connect/disconnect.
|
||
- `HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS`
|
||
Retention for critical AdminLog events such as `kill`, `match_start`, `match_end`.
|
||
- `HLL_SERVER_SNAPSHOT_RETENTION_DAYS`
|
||
Retention for live server snapshots.
|
||
- `HLL_DB_MAINTENANCE_BATCH_SIZE`
|
||
Delete batch size for apply mode.
|
||
|
||
## Protected Data
|
||
|
||
The cleanup command protects:
|
||
|
||
- latest 100 closed materialized matches by default;
|
||
- current month materialized matches;
|
||
- previous month materialized matches when the current day is `1` through `7`;
|
||
- current week materialized matches;
|
||
- previous week materialized matches when weekly fallback may still need them;
|
||
- `rcon_match_player_stats` belonging to protected matches;
|
||
- current/live AdminLog data required for visible current-match surfaces;
|
||
- `displayed_historical_snapshots`;
|
||
- file snapshots in `backend/data/snapshots/`.
|
||
|
||
If a match timestamp cannot be interpreted safely, that match is skipped and protected instead of deleted.
|
||
|
||
## Deleted Data
|
||
|
||
Apply mode is currently allowed to delete:
|
||
|
||
- `server_snapshots` older than retention;
|
||
- non-critical `rcon_admin_log_events` older than retention;
|
||
- critical `rcon_admin_log_events` older than retention only when they are not required by protected materialized match ranges;
|
||
- non-protected `rcon_materialized_matches`;
|
||
- dependent `rcon_match_player_stats` for deleted matches.
|
||
|
||
Current critical AdminLog event types:
|
||
|
||
- `kill`
|
||
- `match_start`
|
||
- `match_end`
|
||
|
||
## Dry-Run Command
|
||
|
||
From `backend/`:
|
||
|
||
```powershell
|
||
python -m app.database_maintenance cleanup --dry-run
|
||
```
|
||
|
||
From the repository root with the backend package on `PYTHONPATH`:
|
||
|
||
```powershell
|
||
$env:PYTHONPATH='backend'
|
||
python -m app.database_maintenance cleanup --dry-run
|
||
```
|
||
|
||
Inside Docker Compose:
|
||
|
||
```powershell
|
||
docker compose exec backend python -m app.database_maintenance cleanup --dry-run
|
||
```
|
||
|
||
Useful dry-run options:
|
||
|
||
```powershell
|
||
docker compose exec backend python -m app.database_maintenance cleanup --dry-run `
|
||
--recent-matches-keep 100 `
|
||
--admin-log-noncritical-retention-days 30 `
|
||
--admin-log-critical-retention-days 90 `
|
||
--server-snapshot-retention-days 14 `
|
||
--batch-size 5000
|
||
```
|
||
|
||
Dry-run is the safe preview path and should be reviewed before any production apply.
|
||
|
||
## Apply Command
|
||
|
||
Local module execution:
|
||
|
||
```powershell
|
||
python -m app.database_maintenance cleanup --apply
|
||
```
|
||
|
||
Docker Compose:
|
||
|
||
```powershell
|
||
docker compose exec backend python -m app.database_maintenance cleanup --apply
|
||
```
|
||
|
||
One-off local validation with a fixed time anchor:
|
||
|
||
```powershell
|
||
python -m app.database_maintenance cleanup --apply --now 2026-06-20T12:00:00Z
|
||
```
|
||
|
||
Optional maintenance vacuum/analyze:
|
||
|
||
```powershell
|
||
python -m app.database_maintenance cleanup --apply --vacuum-analyze
|
||
```
|
||
|
||
## Table-Size Audit SQL
|
||
|
||
```sql
|
||
select
|
||
schemaname,
|
||
relname as table_name,
|
||
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
|
||
pg_size_pretty(pg_relation_size(relid)) as table_size,
|
||
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as indexes_size,
|
||
n_live_tup as estimated_rows,
|
||
n_dead_tup as estimated_dead_rows
|
||
from pg_stat_user_tables
|
||
order by pg_total_relation_size(relid) desc;
|
||
```
|
||
|
||
## Row-Count And Age Audit SQL
|
||
|
||
### AdminLog events by type/date
|
||
|
||
```sql
|
||
select
|
||
event_type,
|
||
count(*) as row_count,
|
||
min(event_timestamp) as first_event_timestamp,
|
||
max(event_timestamp) as last_event_timestamp,
|
||
min(server_time) as first_server_time,
|
||
max(server_time) as last_server_time
|
||
from rcon_admin_log_events
|
||
group by event_type
|
||
order by row_count desc, event_type asc;
|
||
```
|
||
|
||
### Materialized matches by server/date
|
||
|
||
```sql
|
||
select
|
||
target_key,
|
||
source_basis,
|
||
count(*) as matches,
|
||
min(coalesce(ended_at, started_at)) as first_closed_at,
|
||
max(coalesce(ended_at, started_at)) as last_closed_at
|
||
from rcon_materialized_matches
|
||
group by target_key, source_basis
|
||
order by target_key asc, source_basis asc;
|
||
```
|
||
|
||
### Server snapshots by date
|
||
|
||
```sql
|
||
select
|
||
server_id,
|
||
min(captured_at) as first_captured_at,
|
||
max(captured_at) as last_captured_at,
|
||
count(*) as snapshot_rows
|
||
from server_snapshots
|
||
group by server_id
|
||
order by last_captured_at desc;
|
||
```
|
||
|
||
### Displayed snapshots count
|
||
|
||
```sql
|
||
select
|
||
snapshot_type,
|
||
metric,
|
||
snapshot_window,
|
||
count(*) as snapshot_rows,
|
||
min(generated_at) as first_generated_at,
|
||
max(generated_at) as last_generated_at
|
||
from displayed_historical_snapshots
|
||
group by snapshot_type, metric, snapshot_window
|
||
order by snapshot_type asc, metric asc, snapshot_window asc;
|
||
```
|
||
|
||
## Logs To Inspect
|
||
|
||
The cleanup command emits JSON logs. Minimum events to look for:
|
||
|
||
- `database-maintenance-started`
|
||
- `database-maintenance-plan`
|
||
- `database-maintenance-table-skipped`
|
||
- `database-maintenance-delete-batch`
|
||
- `database-maintenance-completed`
|
||
- `database-maintenance-error`
|
||
|
||
Examples:
|
||
|
||
```powershell
|
||
docker compose logs --tail=200 backend
|
||
docker compose logs --tail=200 historical-runner
|
||
```
|
||
|
||
If scheduled cleanup is enabled:
|
||
|
||
```powershell
|
||
docker compose logs --tail=200 historical-runner
|
||
```
|
||
|
||
## Docker And Portainer Warnings
|
||
|
||
- Never use `docker compose down -v` unless you intentionally want to delete PostgreSQL and mounted volume data.
|
||
- Always review dry-run output before enabling apply in production.
|
||
- Do not manually delete protected match or player-stat rows from PostgreSQL.
|
||
- Keep backups before changing retention settings.
|
||
- Do not add Comunidad Hispana #03 back into RCON targets in this task.
|
||
- Do not add a separate maintenance container, host cron, or `pg_cron` job for this feature.
|
||
|
||
For Portainer-style operations the same warning applies:
|
||
|
||
- deleting volumes is destructive;
|
||
- maintenance should run through the application command, not through manual table purges.
|
||
|
||
## Rollback And Restore Considerations
|
||
|
||
- Retention changes are destructive when apply mode runs.
|
||
- Keep a PostgreSQL backup before enabling scheduled apply in production.
|
||
- If cleanup removes too much data, recovery is restore-based, not “undo last delete.”
|
||
- Favor dry-run, smaller batch sizes, and reviewed retention values before long-running scheduled apply.
|
||
|
||
## Safe Operator Flow
|
||
|
||
1. Audit table size and row ages with the SQL above.
|
||
2. Run dry-run locally or in Compose.
|
||
3. Review protected counts and candidate counts in JSON output.
|
||
4. Enable `HLL_DB_MAINTENANCE_ENABLED=true` only after dry-run review.
|
||
5. Monitor `historical-runner` logs for scheduler events and cleanup completion.
|