Suddent validation errors

icharala · May 29, 2025, 10:46am

Hi all, my chirpstack installation suddently stopped working and I am trying to debug the issue. I have some thoughts, but I wanted to pick your brains.

I deployed chirpstack chirpstack/chirpstack:4
I configured a bunch of gateways, device profiles and devices
Fast forward a month later, I started to see errors on the UI, more specifically errors like column device_profile.is_relay does not exist and Validator function error error=column device_profile.class_b_timeout does not exist which was very concerning.

Trying to dig further, I realised a few things:

Checking at __diesel_schema_migrations table, I noticed that 2 migrations were executed a few weeks ago.
Comparing my back-ups between the pre-migration and post-migration states, I noticed that indeed, that public.device_profile was severly crippled.

For reference here is the schema, post-migration, and my currently deployed chirpstack version is v4.10.1 (as seen from chirpstack UI):

CREATE TABLE public.device_profile (
    id uuid NOT NULL,
    tenant_id uuid NOT NULL,
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    name character varying(100) NOT NULL,
    region character varying(10) NOT NULL,
    mac_version character varying(10) NOT NULL,
    reg_params_revision character varying(20) NOT NULL,
    adr_algorithm_id character varying(100) NOT NULL,
    payload_codec_runtime character varying(20) NOT NULL,
    uplink_interval integer NOT NULL,
    device_status_req_interval integer NOT NULL,
    supports_otaa boolean NOT NULL,
    supports_class_b boolean NOT NULL,
    supports_class_c boolean NOT NULL,
    tags jsonb NOT NULL,
    payload_codec_script text NOT NULL,
    flush_queue_on_activate boolean NOT NULL,
    description text NOT NULL,
    measurements jsonb NOT NULL,
    auto_detect_measurements boolean NOT NULL,
    region_config_id character varying(100),
    allow_roaming boolean NOT NULL,
    rx1_delay smallint NOT NULL,
    abp_params jsonb,
    class_b_params jsonb,
    class_c_params jsonb,
    relay_params jsonb,
    app_layer_params jsonb NOT NULL
);

I understand my bad choice of using the 4 tag (instad of pinning it on some explicit version), which may have caused a newer version of chirpstack to be re-deployed when the infrastructure got restarted at some point of time.

But I would have never expected such a damage even if the deployment picked another version from the v4 branch.

Does anyone have any insights regarding why this may have happend?

Liam_Philipp · May 29, 2025, 3:11pm

Could be the 4.7 changes which made some database changes. Although I didn’t think that impacted the device_profiles data.

Could also be that one of your databases version is no longer supported, I’ve seen a lot of talk about old redis versions failing recently.

icharala · May 30, 2025, 12:11am

Could be the 4.7 changes which made some database changes. Although I didn’t think that impacted the device_profiles data.

That was also my first hunch, but noticed that even the v4.7 tag, has the is_relay field in migration.

Could also be that one of your databases version is no longer supported, I’ve seen a lot of talk about old redis versions failing recently.

Not sure where Redis is involved in the Postgres schema , but I also don’t have full context. My Redis version is 7.4.1 (and Postgres is 17.2.0).

Some other thoughts:

Is chirpstack the only component that does DB migrations? Could it be that some of the other component (chirpstack-rest-api or chirpstack-gateway-bridge) applied that migration?

Also, in case it helps, here is the data from the __diesel_schema_migrations table:

COPY public.__diesel_schema_migrations (version, run_on) FROM stdin;
00000000000000	2024-12-07 21:55:24.355487
20220426153628	2024-12-07 21:55:24.383123
20220428071028	2024-12-07 21:55:24.38533
20220511084032	2024-12-07 21:55:24.386856
20220614130020	2024-12-07 21:55:24.392495
20221102090533	2024-12-07 21:55:24.394456
20230103201442	2024-12-07 21:55:24.39641
20230112130153	2024-12-07 21:55:24.397875
20230206135050	2024-12-07 21:55:24.399435
20230213103316	2024-12-07 21:55:24.402537
20230216091535	2024-12-07 21:55:24.404127
20230925105457	2024-12-07 21:55:24.410898
20231019142614	2024-12-07 21:55:24.412543
20231122120700	2024-12-07 21:55:24.414788
20240207083424	2024-12-07 21:55:24.416508
20240326134652	2024-12-07 21:55:24.418439
20240430103242	2024-12-07 21:55:24.421689
20240613122655	2024-12-07 21:55:24.423634
20240916123034	2024-12-07 21:55:24.425978
20241112135745	2024-12-17 09:05:29.777724
20250113152218	2025-05-06 10:13:27.227628
20250121093745	2025-05-06 10:13:27.283628
\.

icharala · May 30, 2025, 12:19am

Right, I thought version was a random primary key, seeded by the current date, but I am realizing now it is the date + version of the migration file

And indeed, I see that these fields were dropped at 2025-01-13-152218_refactor_device_profile_fields (aka version 20250113152218)

Ok, so I think I answered the issue myself… chiprstack was probably bumped at v4.12.1 (possibly when deployed in another worker node - because that was the ‘latest’ v4 tag) and then bumped back to my initial v4.10.1 version (when it went back to the original worker node - because that was the ‘last cached’ v4 tag)

I think the red herring here was the fact that many fields were refactored on the device profile on the v4.12.1 release and I thought the DB was “damaged”

(And I guess when the older v4.10.1 version was re-deployed, the “down” migration is not executed, because I assume that’s explicit?)