Hi Dan,
How about let the BIOS report a new type for kmem in e820 table?
e.g.
#define E820_PMEM 7
#define E820_KMEM 8
Then pmem and kmem are separately, and we can easily hotadd kmem
to the memory subsystem, no disturb the existing code (e.g. pmem,
nvdimm, dax...).
I don't know whether Intel will change some hardware features for
pmem which used like a volatility memory in the future. Perhaps
faster than pmem, cheaper, but volatility, and no need to care
about atomicity, consistency, L2/L3 cache...
Another question, why call it kmem? what does the "k" mean?
Thanks,
Xishi Qiu
On 2018/10/23 09:11, Dan Williams wrote:
On Mon, Oct 22, 2018 at 6:05 PM Dan Williams
<dan.j.williams(a)intel.com> wrote:
>
> On Mon, Oct 22, 2018 at 1:18 PM Dave Hansen <dave.hansen(a)linux.intel.com>
wrote:
>>
>> Persistent memory is cool. But, currently, you have to rewrite
>> your applications to use it. Wouldn't it be cool if you could
>> just have it show up in your system like normal RAM and get to
>> it like a slow blob of memory? Well... have I got the patch
>> series for you!
>>
>> This series adds a new "driver" to which pmem devices can be
>> attached. Once attached, the memory "owned" by the device is
>> hot-added to the kernel and managed like any other memory. On
>> systems with an HMAT (a new ACPI table), each socket (roughly)
>> will have a separate NUMA node for its persistent memory so
>> this newly-added memory can be selected by its unique NUMA
>> node.
>>
>> This is highly RFC, and I really want the feedback from the
>> nvdimm/pmem folks about whether this is a viable long-term
>> perversion of their code and device mode. It's insufficiently
>> documented and probably not bisectable either.
>>
>> Todo:
>> 1. The device re-binding hacks are ham-fisted at best. We
>> need a better way of doing this, especially so the kmem
>> driver does not get in the way of normal pmem devices.
>> 2. When the device has no proper node, we default it to
>> NUMA node 0. Is that OK?
>> 3. We muck with the 'struct resource' code quite a bit. It
>> definitely needs a once-over from folks more familiar
>> with it than I.
>> 4. Is there a better way to do this than starting with a
>> copy of pmem.c?
>
> So I don't think we want to do patch 2, 3, or 5. Just jump to patch 7
> and remove all the devm_memremap_pages() infrastructure and dax_region
> infrastructure.
>
> The driver should be a dead simple turn around to call add_memory()
> for the passed in range. The hard part is, as you say, arranging for
> the kmem driver to not stand in the way of typical range / device
> claims by the dax_pmem device.
>
> To me this looks like teaching the nvdimm-bus and this dax_kmem driver
> to require explicit matching based on 'id'. The attachment scheme
> would look like this:
>
> modprobe dax_kmem
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/new_id
> echo dax0.0 > /sys/bus/nd/drivers/dax_pmem/unbind
> echo dax0.0 > /sys/bus/nd/drivers/dax_kmem/bind
>
> At step1 the dax_kmem drivers will match no devices and stays out of
> the way of dax_pmem. It learns about devices it cares about by being
> explicitly told about them. Then unbind from the typical dax_pmem
> driver and attach to dax_kmem to perform the one way hotplug.
>
> I expect udev can automate this by setting up a rule to watch for
> device-dax instances by UUID and call a script to do the detach /
> reattach dance.
The next question is how to support this for ranges that don't
originate from the pmem sub-system. I expect we want dax_kmem to
register a generic platform device representing the range and have a
generic platofrm driver that turns around and does the add_memory().