Common Issues
Benchmarking Ceph Storage
You want to benchmark the storage of your Ceph cluster(s)? This is a short list of tools to benchmark storage.
Recommended tools:
- General Benchmarking Testing of Storage (e.g., plain disks, and other storage software)
 - Ceph specific Benchmarking:
 
CephFS mount issues on Hosts
Make sure you have a (active) Linux kernel of version 4.17 or higher.
5.x or higher).HEALTH_WARN 1 large omap objects
Issue
HEALTH_WARN 1 large omap objects
# and/or
LARGE_OMAP_OBJECTS 1 large omap objects
Solution
The following command should fix the issue:
radosgw-admin reshard stale-instances rm
MDSs report oversized cache
Issue
Ceph health status reports, e.g., 1 MDSs report oversized cache.
[root@rook-ceph-tools-86d54cbd8d-6ktjh /]# ceph -s
  cluster:
    id:     67e1ce27-0405-441e-ad73-724c93b7aac4
    health: HEALTH_WARN
            1 MDSs report oversized cache
[...]
Solution
You can try to increase the mds cache memory limit setting1.
Find Device OSD is using
Issue
You need to find out which disk/device is used by an OSD daemon.
Scenarios: smartctl is showing that the disk should be replaced, disk has already failed, etc.
Solution
Use the various ls* subcommands of ceph device.
$ ceph device --help
device check-health                                                         Check life expectancy of devices
device get-health-metrics <devid> [<sample>]                                Show stored device metrics for the device
device info <devid>                                                         Show information about a device
device light on|off <devid> [ident|fault] [--force]                         Enable or disable the device light. Default type is `ident`
'Usage: device
                                                                             light (on|off) <devid> [ident|fault] [--force]'
device ls                                                                   Show devices
device ls-by-daemon <who>                                                   Show devices associated with a daemon
device ls-by-host <host>                                                    Show devices on a host
device ls-lights                                                            List currently active device indicator lights
device monitoring off                                                       Disable device health monitoring
device monitoring on                                                        Enable device health monitoring
device predict-life-expectancy <devid>                                      Predict life expectancy with local predictor
device query-daemon-health-metrics <who>                                    Get device health metrics for a given daemon
device rm-life-expectancy <devid>                                           Clear predicted device life expectancy
device scrape-daemon-health-metrics <who>                                   Scrape and store device health metrics for a given daemon
device scrape-health-metrics [<devid>]                                      Scrape and store device health metrics
device set-life-expectancy <devid> <from> [<to>]                            Set predicted device life expectancy
The ceph device subcommands allow you to do even more things, e.g., turn on the disk light in server chassis.
Enabling the light for the disk can help the datacenter workers to easily locate the disk and not replacing the wrong disk.
Locate Disk of OSD by OSD daemon ID (e.g., OSD 13):
$ ceph device ls-by-daemon osd.13
DEVICE                                     HOST:DEV                                           EXPECTED FAILURE
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123  HOSTNAME:nvme1n1
Show all disks by host (hostname):
$ ceph device ls-by-host HOSTNAME
DEVICE                                     HOST:DEV                                           EXPECTED FAILURE
DEVICE                                     DEV      DAEMONS  EXPECTED FAILURE
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123  nvme1n1  osd.5
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123  nvme0n1  osd.2
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123  nvme2n1  osd.8
SAMSUNG_MZVL2512HCJQ-00B00_S1234567890123  nvme3n1  osd.13
CephOSDSlowOps Alerts
Issue
TODO
Things to Try
- Ensure the disks you are using are healthy
- Check the SMART values. A bad disks can lock up an application (such as a Ceph OSD) or worse the whole server.
 
 
Should this page not have yielded you a solution, checkout the Rook Ceph Common Issues doc as well.
Footnotes
- Report/ Source for information regarding this issue has been taken from http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-December/037633.html ↩
 - Rook Ceph Docs v1.7 - Ceph Filesystem CRD - MDS Resources Configuration Settings ↩