On Wed, Aug 9, 2017 at 12:00 PM, Linda Knippers <linda.knippers(a)hpe.com> wrote:
On 08/09/2017 02:30 PM, Dan Williams wrote:
> On Wed, Aug 9, 2017 at 7:19 AM, Jeff Moyer <jmoyer(a)redhat.com> wrote:
>> Yasunori Goto <y-goto(a)jp.fujitsu.com> writes:
>>>>> Another approach could be to integrate NVDIMM event
>>>>> monitoring into some other utility, like the rasdaemon. I'm
>>>>> your thoughts.
>>>> Though I'm not sure which (existing or new) utility is appropriate
>>>> I prefer this way. So, I'll think about it.
>>> I investigated the issue that notification/monitoring feature of over-
>>> threshold event with my co-worker. Here is current our understandings.
>>> a) rasdaemon
>>> It is good tools for machine check error, and if machine check occurs on
>>> NVDIMM, I suppose it will work not only traditional RAM but also NVDIMM.
>>> But, it may not fit the purpose of notification/monitoring threshold
>>> b) smartmontools (https://www.smartmontools.org/
>>> This tool may fit the purpose of notification/monitoring of health of
>>> However, it may a bit troublesome due to the followings.
>>> - The smartd seems to check smart values of each devices with
>>> ioctl() periodically (In other words, "polling").
>>> Probably, other devices does not have the
>>> notification interface like "ndctl_dimm_get_health_eventfd()
>>> and poll()/select()".
>>> - smartmontools supports many OSs (Windows, darwin, xxxBSDs, os2(!)).
>>> I'm not sure other OSs have similar notification interface like
>>> So, it may need to "polling" like other devices.
>>> c) udev
>>> Udev can kick any programs if udev.rules is created.
>>> However, there is no uevent for the event of over-threshold currently.
>>> In addition, I'm not sure that udev fits this type of event
>>> d) make a new tiny daemon in ndctl tree
>>> This may be simpler way.
>>> It can use ndctl_dimm_get_health_eventfd() and poll()/select().
>>> But, ndctl may be included in kernel source,
>>> and I don't know whether kernel includes other daemon tools or not.
>> e) acpid
> Except acpid is ACPI specific, and the event sources that libnvdimm
> generates are generic. For example, we may be getting an Open Firmware
> libnvdimm bus in the next merge window.
Can you say more about that? It seems that the notifications we're worried
about here and the interface for getting information about the notification
are both ACPI-specific.
Capturing the raw acpi events is not that interesting because we'll
immediately want to turn around and ask what those mean to Linux
kernel objects, so might as well monitor those objects directly.
We haven't talked much about iwhat a daemon would do once it gets
notification from whatever the source is. That might help us determine
the right tool. Is it just logging?
Yes, logging, and maybe a simple framework to call external helper
applications when a given events fires, or fires too many times within
a certain threshold.