This commit is contained in:
devRaGonSa
2026-06-05 16:57:25 +02:00
commit 0da8338ba8
310 changed files with 45849 additions and 0 deletions

View File

@@ -0,0 +1,307 @@
# Database Maintenance
## Overview
HLL Vietnam keeps database cleanup at the application level.
The current maintenance scope is intentionally narrow:
- old `server_snapshots`;
- old non-critical `rcon_admin_log_events`;
- old critical `rcon_admin_log_events` only after retention and protected-match checks;
- old non-protected `rcon_materialized_matches`;
- dependent `rcon_match_player_stats` for deleted matches.
The first maintenance pass does not routinely delete:
- `displayed_historical_snapshots`;
- file-based snapshots under `backend/data/snapshots/`;
- public-scoreboard `historical_*` fallback tables;
- `player_event_raw_ledger` and its worker metadata;
- Elo/MMR tables;
- Comunidad Hispana #03 data reactivation or targets.
## Why Application-Level And Not `pg_cron`
Cleanup is versioned in backend code instead of delegated to `pg_cron`, host cron, or a separate container because the retention logic depends on product rules:
- keep the latest 100 closed materialized matches;
- keep the current month;
- keep the previous month during the first 7 days of a new month;
- keep the current week;
- keep the previous week when weekly fallback may still need it;
- keep child stats for protected matches;
- avoid breaking current/live pages that still read recent AdminLog data.
Those rules belong with the applications read and write model, not inside database-only scheduling.
## Scheduled Cleanup Inside `historical-runner`
Database maintenance is scheduled inside `app.historical_runner`.
Behavior:
- disabled by default;
- no extra Docker service is added for maintenance;
- the runner checks whether maintenance is due;
- when enabled and due, the runner invokes `python -m app.database_maintenance cleanup --apply` behavior through the shared Python function;
- failures are logged and do not crash the historical runner loop;
- cleanup runs under the same writer-lock coordination used by the historical writer flows.
Relevant structured log events:
- `database-maintenance-scheduler-skipped-disabled`
- `database-maintenance-scheduler-skipped-not-due`
- `database-maintenance-scheduler-started`
- `database-maintenance-scheduler-completed`
- `database-maintenance-scheduler-failed`
## Environment Variables
Required maintenance-related variables:
```text
HLL_DB_MAINTENANCE_ENABLED=false
HLL_DB_MAINTENANCE_INTERVAL_SECONDS=43200
HLL_RECENT_MATCHES_KEEP=100
HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS=30
HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS=90
HLL_SERVER_SNAPSHOT_RETENTION_DAYS=14
HLL_DB_MAINTENANCE_BATCH_SIZE=5000
```
Meaning:
- `HLL_DB_MAINTENANCE_ENABLED`
Enables scheduled apply mode inside `historical-runner`.
- `HLL_DB_MAINTENANCE_INTERVAL_SECONDS`
Default scheduler interval. `43200` means every 12 hours.
- `HLL_RECENT_MATCHES_KEEP`
Number of latest closed materialized matches that must always be protected.
- `HLL_ADMIN_LOG_NONCRITICAL_RETENTION_DAYS`
Retention for non-critical AdminLog events such as chat/connect/disconnect.
- `HLL_ADMIN_LOG_CRITICAL_RETENTION_DAYS`
Retention for critical AdminLog events such as `kill`, `match_start`, `match_end`.
- `HLL_SERVER_SNAPSHOT_RETENTION_DAYS`
Retention for live server snapshots.
- `HLL_DB_MAINTENANCE_BATCH_SIZE`
Delete batch size for apply mode.
## Protected Data
The cleanup command protects:
- latest 100 closed materialized matches by default;
- current month materialized matches;
- previous month materialized matches when the current day is `1` through `7`;
- current week materialized matches;
- previous week materialized matches when weekly fallback may still need them;
- `rcon_match_player_stats` belonging to protected matches;
- current/live AdminLog data required for visible current-match surfaces;
- `displayed_historical_snapshots`;
- file snapshots in `backend/data/snapshots/`.
If a match timestamp cannot be interpreted safely, that match is skipped and protected instead of deleted.
## Deleted Data
Apply mode is currently allowed to delete:
- `server_snapshots` older than retention;
- non-critical `rcon_admin_log_events` older than retention;
- critical `rcon_admin_log_events` older than retention only when they are not required by protected materialized match ranges;
- non-protected `rcon_materialized_matches`;
- dependent `rcon_match_player_stats` for deleted matches.
Current critical AdminLog event types:
- `kill`
- `match_start`
- `match_end`
## Dry-Run Command
From `backend/`:
```powershell
python -m app.database_maintenance cleanup --dry-run
```
From the repository root with the backend package on `PYTHONPATH`:
```powershell
$env:PYTHONPATH='backend'
python -m app.database_maintenance cleanup --dry-run
```
Inside Docker Compose:
```powershell
docker compose exec backend python -m app.database_maintenance cleanup --dry-run
```
Useful dry-run options:
```powershell
docker compose exec backend python -m app.database_maintenance cleanup --dry-run `
--recent-matches-keep 100 `
--admin-log-noncritical-retention-days 30 `
--admin-log-critical-retention-days 90 `
--server-snapshot-retention-days 14 `
--batch-size 5000
```
Dry-run is the safe preview path and should be reviewed before any production apply.
## Apply Command
Local module execution:
```powershell
python -m app.database_maintenance cleanup --apply
```
Docker Compose:
```powershell
docker compose exec backend python -m app.database_maintenance cleanup --apply
```
One-off local validation with a fixed time anchor:
```powershell
python -m app.database_maintenance cleanup --apply --now 2026-06-20T12:00:00Z
```
Optional maintenance vacuum/analyze:
```powershell
python -m app.database_maintenance cleanup --apply --vacuum-analyze
```
## Table-Size Audit SQL
```sql
select
schemaname,
relname as table_name,
pg_size_pretty(pg_total_relation_size(relid)) as total_size,
pg_size_pretty(pg_relation_size(relid)) as table_size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as indexes_size,
n_live_tup as estimated_rows,
n_dead_tup as estimated_dead_rows
from pg_stat_user_tables
order by pg_total_relation_size(relid) desc;
```
## Row-Count And Age Audit SQL
### AdminLog events by type/date
```sql
select
event_type,
count(*) as row_count,
min(event_timestamp) as first_event_timestamp,
max(event_timestamp) as last_event_timestamp,
min(server_time) as first_server_time,
max(server_time) as last_server_time
from rcon_admin_log_events
group by event_type
order by row_count desc, event_type asc;
```
### Materialized matches by server/date
```sql
select
target_key,
source_basis,
count(*) as matches,
min(coalesce(ended_at, started_at)) as first_closed_at,
max(coalesce(ended_at, started_at)) as last_closed_at
from rcon_materialized_matches
group by target_key, source_basis
order by target_key asc, source_basis asc;
```
### Server snapshots by date
```sql
select
server_id,
min(captured_at) as first_captured_at,
max(captured_at) as last_captured_at,
count(*) as snapshot_rows
from server_snapshots
group by server_id
order by last_captured_at desc;
```
### Displayed snapshots count
```sql
select
snapshot_type,
metric,
snapshot_window,
count(*) as snapshot_rows,
min(generated_at) as first_generated_at,
max(generated_at) as last_generated_at
from displayed_historical_snapshots
group by snapshot_type, metric, snapshot_window
order by snapshot_type asc, metric asc, snapshot_window asc;
```
## Logs To Inspect
The cleanup command emits JSON logs. Minimum events to look for:
- `database-maintenance-started`
- `database-maintenance-plan`
- `database-maintenance-table-skipped`
- `database-maintenance-delete-batch`
- `database-maintenance-completed`
- `database-maintenance-error`
Examples:
```powershell
docker compose logs --tail=200 backend
docker compose logs --tail=200 historical-runner
```
If scheduled cleanup is enabled:
```powershell
docker compose logs --tail=200 historical-runner
```
## Docker And Portainer Warnings
- Never use `docker compose down -v` unless you intentionally want to delete PostgreSQL and mounted volume data.
- Always review dry-run output before enabling apply in production.
- Do not manually delete protected match or player-stat rows from PostgreSQL.
- Keep backups before changing retention settings.
- Do not add Comunidad Hispana #03 back into RCON targets in this task.
- Do not add a separate maintenance container, host cron, or `pg_cron` job for this feature.
For Portainer-style operations the same warning applies:
- deleting volumes is destructive;
- maintenance should run through the application command, not through manual table purges.
## Rollback And Restore Considerations
- Retention changes are destructive when apply mode runs.
- Keep a PostgreSQL backup before enabling scheduled apply in production.
- If cleanup removes too much data, recovery is restore-based, not “undo last delete.”
- Favor dry-run, smaller batch sizes, and reviewed retention values before long-running scheduled apply.
## Safe Operator Flow
1. Audit table size and row ages with the SQL above.
2. Run dry-run locally or in Compose.
3. Review protected counts and candidate counts in JSON output.
4. Enable `HLL_DB_MAINTENANCE_ENABLED=true` only after dry-run review.
5. Monitor `historical-runner` logs for scheduler events and cleanup completion.