Today, I’ll make a little explanation on how to solve this warning coming from ceph status.
Maybe this case does not match your error, but I think the commands I used give hoy a way to follow and solve it.
The whole process is written on the wiki
See general status:
# ceph -s
...
health: HEALTH_WARN
Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
...
# ceph health detail
HEALTH_WARN Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
PG_DEGRADED Degraded data redundancy: 2 pgs degraded, 8 pgs undersized
pg 14.0 is stuck undersized for 510298.054479, current state active+undersized, last acting [5,12]
pg 14.1 is stuck undersized for 510298.091712, current state active+undersized, last acting [18,7]
pg 14.2 is stuck undersized for 510298.007891, current state active+undersized+degraded, last acting [7,18]
pg 14.3 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,5]
pg 14.4 is stuck undersized for 510298.054479, current state active+undersized+degraded, last acting [5,18]
pg 14.5 is stuck undersized for 510298.033776, current state active+undersized, last acting [16,1]
pg 14.6 is stuck undersized for 510298.086409, current state active+undersized, last acting [8,3]
pg 14.7 is stuck undersized for 510298.091649, current state active+undersized, last acting [18,3]
Why pg 14 is the **only one** that is failing???
Getting details of one PG:
# ceph tell 14.0 query | jq .
{
"state": "active+undersized",
...
"stat_sum": {
"num_bytes": 0,
"num_objects": 0,
"num_object_clones": 0,
"num_object_copies": 0,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 0,
"num_whiteouts": 0,
"num_read": 0,
"num_read_kb": 0,
"num_write": 0,
"num_write_kb": 0,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 0,
"num_bytes_recovered": 0,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
"num_omap_bytes": 0,
"num_omap_keys": 0,
"num_objects_repaired": 0
...
],
"recovery_state": [
{
"name": "Started/Primary/Active",
"enter_time": "2020-08-11 11:50:38.233290",
"might_have_unfound": [],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "MIN",
"backfill_info": {
"begin": "MIN",
"end": "MIN",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
"scrub": {
"scrubber.epoch_start": "0",
"scrubber.active": false,
"scrubber.state": "INACTIVE",
"scrubber.start": "MIN",
"scrubber.end": "MIN",
"scrubber.max_end": "MIN",
"scrubber.subset_last_update": "0'0",
"scrubber.deep": false,
"scrubber.waiting_on_whom": []
}
},
{
"name": "Started",
"enter_time": "2020-08-11 11:50:37.502984"
}
],
"agent_state": {}
}
pg 14 does not have data nor activity!!!!
Let’s see in-deep the rest of the pg’s:
# ceph pg dump
version 2063560
stamp 2020-08-17 09:46:06.557196
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN
...
5.7 3 0 0 0 0 897 0 0 3 3 active+clean 2020-08-17 00:25:41.809330 3605'5 3629:3491 [5,15,14,18] 5 [5,15,14,18] 5 3605'5 2020-08-17 00:25:41.809266 3605'5 2020-08-13 05:37:52.184231 0
14.6 0 0 0 0 0 0 0 0 0 0 active+undersized 2020-08-11 11:50:38.208103 0'0 3628:8 [8,3] 8 [8,3] 8 0'0 2020-08-11 11:50:37.135596 0'0 2020-08-11 11:50:37.135596 0
...
14 0 0 0 0 0 0 0 0 14 14
13 4652 0 0 0 0 259597692 2169033596 4701468 24546 24546
12 1182753 0 0 0 0 80144984316 0 0 98035 98035
11 15 0 0 0 0 0 75562576 256644 21660 21660
10 69520 0 0 0 0 29298471706 0 0 24446 24446
5 5 0 0 0 0 2050 0 0 5 5
6 8 0 0 0 0 0 0 0 2747 2747
7 76 0 0 0 0 14374 11048 60 12097 12097
8 207 0 0 0 0 0 0 0 24532 24532
sum 1257236 0 0 0 0 109703070138 2244607220 4958172 208082 208082
OSD_STAT USED AVAIL USED_RAW TOTAL HB_PEERS PG_SUM PRIMARY_PG_SUM
19 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] 17 4
18 28 GiB 2.0 TiB 29 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19] 22 6
17 48 GiB 2.0 TiB 49 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19] 21 9
16 20 GiB 2.0 TiB 21 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19] 17 4
15 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19] 19 4
14 36 GiB 2.0 TiB 37 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19] 20 2
13 27 GiB 2.0 TiB 28 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19] 17 3
12 47 GiB 2.0 TiB 48 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19] 23 8
11 12 GiB 2.0 TiB 13 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19] 11 5
10 17 GiB 2.0 TiB 18 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19] 14 2
3 24 GiB 2.0 TiB 25 GiB 2.0 TiB [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 14 5
2 20 GiB 2.0 TiB 21 GiB 2.0 TiB [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 12 4
1 36 GiB 2.0 TiB 37 GiB 2.0 TiB [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 18 4
0 41 GiB 2.0 TiB 42 GiB 2.0 TiB [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 21 2
4 24 GiB 2.0 TiB 25 GiB 2.0 TiB [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 18 4
5 40 GiB 2.0 TiB 42 GiB 2.0 TiB [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19] 23 10
6 55 GiB 1.9 TiB 56 GiB 2.0 TiB [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19] 21 4
7 35 GiB 2.0 TiB 36 GiB 2.0 TiB [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19] 23 5
8 32 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19] 16 6
9 31 GiB 2.0 TiB 33 GiB 2.0 TiB [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19] 21 5
sum 636 GiB 39 TiB 659 GiB 40 TiB
* NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
pg 14 is the only one that **don’t have** a high availability (3 replicas or more)… why?
# ceph pg dump_stuck inactive ok # ceph pg dump_stuck stale ok # ceph pg dump_stuck undersized ok PG_STAT STATE UP UP_PRIMARY ACTING ACTING_PRIMARY 14.1 active+undersized [18,7] 18 [18,7] 18 14.0 active+undersized [5,12] 5 [5,12] 5 14.3 active+undersized [8,5] 8 [8,5] 8 14.2 active+undersized+degraded [7,18] 7 [7,18] 7 14.6 active+undersized [8,3] 8 [8,3] 8 14.7 active+undersized [18,3] 18 [18,3] 18 14.4 active+undersized+degraded [5,18] 5 [5,18] 5 14.5 active+undersized [16,1] 16 [16,1] 16 # ceph pg force-recovery 14.0 pg 14.0 doesn't require recovery; # ceph pg force-backfill 14.0 pg 14.0 doesn't require backfilling; # ceph pg force-recovery 14.4 instructing pg(s) [14.4] on osd.5 to force-recovery; # ceph pg force-backfill 14.4 instructing pg(s) [14.4] on osd.5 to force-backfill; # ceph pg force-recovery 14.2 instructing pg(s) [14.2] on osd.7 to force-recovery; # ceph pg force-backfill 14.2 instructing pg(s) [14.2] on osd.7 to force-backfill; # ceph pg ls PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG STATE SINCE VERSION REPORTED UP ACTING SCRUB_STAMP DEEP_SCRUB_STAMP 5.0 1 0 0 0 348 0 0 1 active+clean 18h 3605'2 ... 13.7 559 0 0 0 25166390 290987928 629844 3072 active+clean 33h 3629'373240 3629:428639 [16,2,17,1]p16 [16,2,17,1]p16 2020-08-16 00:16:42.372384 2020-08-13 15:45:28.525122 14.0 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [5,12]p5 [5,12]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.1 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [18,7]p18 [18,7]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.2 0 0 0 0 0 0 0 7 active+undersized+degraded 5d 3629'7 3629:21 [7,18]p7 [7,18]p7 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.3 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [8,5]p8 [8,5]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.4 0 0 0 0 0 0 0 7 active+undersized+degraded 5d 3629'7 3629:21 [5,18]p5 [5,18]p5 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.5 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [16,1]p16 [16,1]p16 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.6 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [8,3]p8 [8,3]p8 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 14.7 0 0 0 0 0 0 0 0 active+undersized 5d 0'0 3628:8 [18,3]p18 [18,3]p18 2020-08-11 11:50:37.135596 2020-08-11 11:50:37.135596 * NOTE: Omap statistics are gathered during deep scrub and may be inaccurate soon afterwards depending on utilisation. See http://docs.ceph.com/docs/master/dev/placement-group/#omap-statistics for further details.
Thoughts: \\
* PG 14 is **empty** !!!
* PG 14 is not scrub’in !!!
* PG 14 does not backfill!!!!!
\\
**WHY????????**
# ceph osd lspools 5 .rgw.root 6 default.rgw.control 7 default.rgw.meta 8 default.rgw.log 10 default.rgw.buckets.data 11 default.rgw.buckets.index 12 cephfs_data-ftp 13 cephfs_metadata-ftp 14 default.rgw.buckets.non-ec
Maybe PG 14 is **EMPTY** cause pool 14 is unused!!!!!\\
\\
Pool is empty, but there are more empty pools and the warning comes only from this.
# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.buckets.non-ec 0 3.0 40940G 0.0000 1.0 8 warn ... default.rgw.log 0 4.0 40940G 0.0000 1.0 8 on
This pool does not have autoscale!!!!\\
Turning it on:
# ceph osd pool set default.rgw.buckets.non-ec pg_autoscale_mode on set pool 14 pg_autoscale_mode to on # ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE default.rgw.buckets.non-ec 0 3.0 40940G 0.0000 1.0 8 on ... default.rgw.log 0 4.0 40940G 0.0000 1.0 8 on
check:
# ceph -s
...
health: HEALTH_WARN
Degraded data redundancy: 8 pgs undersized
...
# ceph health detail
HEALTH_WARN Degraded data redundancy: 8 pgs undersized
PG_DEGRADED Degraded data redundancy: 8 pgs undersized
pg 14.0 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,12]
pg 14.1 is stuck undersized for 513135.375686, current state active+undersized, last acting [18,7]
pg 14.2 is stuck undersized for 513135.291865, current state active+undersized, last acting [7,18]
pg 14.3 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,5]
pg 14.4 is stuck undersized for 513135.338453, current state active+undersized, last acting [5,18]
pg 14.5 is stuck undersized for 513135.317750, current state active+undersized, last acting [16,1]
pg 14.6 is stuck undersized for 513135.370383, current state active+undersized, last acting [8,3]
pg 14.7 is stuck undersized for 513135.375623, current state active+undersized, last acting [18,3]
ceph is moving!!! But still has WARNING, looking for placement rule:
# ceph osd pool ls detail | grep "non-ec" pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3630 flags hashpspool stripe_width 0 application rgw
pool 14 has the default placement rule, switching to CiberterminalRule:
# ceph osd pool set default.rgw.buckets.non-ec crush_rule CiberterminalRule set pool 14 crush_rule to CiberterminalRule # ceph osd pool ls detail | grep "non-ec" pool 14 'default.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 3631 flags hashpspool stripe_width 0 application rgw
Check:
# ceph health detail
HEALTH_OK
# ceph -s
cluster:
id: a3a799ce-f1d3-4230-a915-06e988fee767
health: HEALTH_OK
...
**OUUUUU YEAHHHHHHHHH**