On Mon, Apr 8, 2019 at 1:13 AM Brice Goglin <Brice.Goglin(a)inria.fr> wrote:
Le 08/04/2019 à 06:26, Dan Williams a écrit :
> On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin <Brice.Goglin(a)inria.fr> wrote:
>> Hello
>>
>> I am trying to understand the locality of the DAX devices with
>> respect to processors with SubNUMA clustering enabled. The machine
>> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of
both
>> processors, #4-5 are prox domains for each socket set of NVDIMMs.
>>
>> SLIT says the topology looks like this, which seems OK to me:
>>
>> Package 0 ---------- Package 1
>> NVregion0 NVregion1
>> | | | |
>> SNC 0 SNC 1 SNC 2 SNC 3
>> node0 node1 node2 node3
>>
>> However each DAX "numa_node" attribute contains a single node ID,
>> which leads to this topology instead:
>>
>> Package 0 ---------- Package 1
>> | | | |
>> SNC 0 SNC 1 SNC 2 SNC 3
>> node0 node1 node2 node3
>> | |
>> dax0.0 dax1.0
>>
>> It looks like this is caused by acpi_map_pxm_to_online_node()
>> only returning the first closest node found in the SLIT.
>> However, even if we change it to return multiple local nodes,
>> the DAX "numa_node" attribute cannot expose multiple nodes.
>> Should we rather expose Keith HMAT attributes for DAX devices?
> If I understand the suggestion correctly you're referring to the
> "target_node" or the unique node number that gets assigned when the
> memory is transitioned online. I struggle to see the incremental
> benefit relative to what we lose with compatibility of the
> "traditional" numa node interpretation for a device that indicates
> which cpus are close to the given device. I think the bulk of the
> problem is solved with the next suggestion below.
Hello Dan,
Not sure why you're talking about "target_node" here. That attribute is
correct:
$ cat /sys/bus/dax/devices/dax0.0/target_node
4
My issue is with "numa_node" which fails to return enough information here:
$ cat
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node
0
(instead of 0+1 but I don't want to change the semantics of that file,
see below)
>
>> Maybe there's even a way to share them between DAX devices
>> and Dave's KMEM hotplugged NUMA nodes?
> In this instance, where the expectation is that the NVDIMM range is
> equidistant from both SNC nodes on a package, I would teach numactl
> tool and other tooling to return a list of local nodes rather than the
> single attribute. Effectively an operation like "numactl --preferred
> block:pmem0" would return a node-mask that includes nodes 0 and 1.
Teaching these tools is exactly what I want to solve here (I was rather
talking about dax0.0 than pmem0 but it doesn't matter much). There are
usually two ways to find the locality of a device from userspace:
* Reading a "local_cpus" sysfs attribute. Works well for finding local
CPUs. Doesn't always work for finding local memory when some CPUs are
offline: if all CPUs of the local node are offline, you loose the
information about the local memory being close to your device (Intel
people from "mOS" heavily rely of this).
* Reading a "numa_node" sysfs attribute, but it points to a single node.
Keith HMAT patches are somehow a 3rd way that doesn't have any of these
issues: you just read "access0/initiators/node*":
* If you want local CPUs, you read the "cpumap" of the initiators nodes.
* If you want the list of "close" memory nodes, you have the list of
initiator "nodes", or their targets.
It would work very well for describing the topology of my machine once I
hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and
node1 is node4/access0/initiators/
Yes, I agree with all of the above, but I think we need a way to fix
this independent of the HMAT data being present. The SLIT already
tells the kernel enough to let tooling figure out equidistant "local"
nodes. While the numa_node attribute will remain a singleton the
tooling needs to handle this case and can't assume the HMAT data will
be present.
I know HMAT attributes don't appear in hotplugged node sysfs
directories
yet, but it would also be nice to have a way to get that information for
dax devices before hotplug, since the dax device and hotplugged nodes
are the same thing.
In an crazy world, maybe we could have something like this:
* before hotplug with kmem driver, unregistered nodes appear in a
special directory such as
/sys/devices/system/node/unregistered_hmat/nodeX together with their
HMAT attributes. If I want to find the locality of a DAX device, I read
its target_node, and go to the corresponding unregistered_hmat/nodeX and
read cpumap, initiators, etc.
* at hotplug, the node is moved out of unregistered_hmat/
Some sort of offline target_node data makes sense, but seems secondary
to teaching tools to supplement the 'numa_node' attribute.