Le 08/04/2019 à 06:26, Dan Williams a écrit :
On Thu, Apr 4, 2019 at 12:48 PM Brice Goglin
<Brice.Goglin(a)inria.fr> wrote:
> Hello
>
> I am trying to understand the locality of the DAX devices with
> respect to processors with SubNUMA clustering enabled. The machine
> I am using has 6invalidate_mapping_pages proximity domains: #0-3 are the SNCs of
both
> processors, #4-5 are prox domains for each socket set of NVDIMMs.
>
> SLIT says the topology looks like this, which seems OK to me:
>
> Package 0 ---------- Package 1
> NVregion0 NVregion1
> | | | |
> SNC 0 SNC 1 SNC 2 SNC 3
> node0 node1 node2 node3
>
> However each DAX "numa_node" attribute contains a single node ID,
> which leads to this topology instead:
>
> Package 0 ---------- Package 1
> | | | |
> SNC 0 SNC 1 SNC 2 SNC 3
> node0 node1 node2 node3
> | |
> dax0.0 dax1.0
>
> It looks like this is caused by acpi_map_pxm_to_online_node()
> only returning the first closest node found in the SLIT.
> However, even if we change it to return multiple local nodes,
> the DAX "numa_node" attribute cannot expose multiple nodes.
> Should we rather expose Keith HMAT attributes for DAX devices?
If I understand the suggestion correctly you're referring to the
"target_node" or the unique node number that gets assigned when the
memory is transitioned online. I struggle to see the incremental
benefit relative to what we lose with compatibility of the
"traditional" numa node interpretation for a device that indicates
which cpus are close to the given device. I think the bulk of the
problem is solved with the next suggestion below.
Hello Dan,
Not sure why you're talking about "target_node" here. That attribute is
correct:
$ cat /sys/bus/dax/devices/dax0.0/target_node
4
My issue is with "numa_node" which fails to return enough information here:
$ cat
/sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/region0/dax0.0/numa_node
0
(instead of 0+1 but I don't want to change the semantics of that file,
see below)
> Maybe there's even a way to share them between DAX devices
> and Dave's KMEM hotplugged NUMA nodes?
In this instance, where the expectation is that the NVDIMM range is
equidistant from both SNC nodes on a package, I would teach numactl
tool and other tooling to return a list of local nodes rather than the
single attribute. Effectively an operation like "numactl --preferred
block:pmem0" would return a node-mask that includes nodes 0 and 1.
Teaching these tools is exactly what I want to solve here (I was rather
talking about dax0.0 than pmem0 but it doesn't matter much). There are
usually two ways to find the locality of a device from userspace:
* Reading a "local_cpus" sysfs attribute. Works well for finding local
CPUs. Doesn't always work for finding local memory when some CPUs are
offline: if all CPUs of the local node are offline, you loose the
information about the local memory being close to your device (Intel
people from "mOS" heavily rely of this).
* Reading a "numa_node" sysfs attribute, but it points to a single node.
Keith HMAT patches are somehow a 3rd way that doesn't have any of these
issues: you just read "access0/initiators/node*":
* If you want local CPUs, you read the "cpumap" of the initiators nodes.
* If you want the list of "close" memory nodes, you have the list of
initiator "nodes", or their targets.
It would work very well for describing the topology of my machine once I
hotplug node4 and node5 using Dave's "kmem" driver: I get node0 and
node1 is node4/access0/initiators/
I know HMAT attributes don't appear in hotplugged node sysfs directories
yet, but it would also be nice to have a way to get that information for
dax devices before hotplug, since the dax device and hotplugged nodes
are the same thing.
In an crazy world, maybe we could have something like this:
* before hotplug with kmem driver, unregistered nodes appear in a
special directory such as
/sys/devices/system/node/unregistered_hmat/nodeX together with their
HMAT attributes. If I want to find the locality of a DAX device, I read
its target_node, and go to the corresponding unregistered_hmat/nodeX and
read cpumap, initiators, etc.
* at hotplug, the node is moved out of unregistered_hmat/
> If we had one SPA range per SNC, would it still be possible
> to interleave NVDIMMs of both SNC to create a single region
> for each socket?
I don't follow the question, if the SPA range is split you want the
SLIT to lie and say it isn't?
Sorry, these questions about NFIT were not related to my specific config
but rather to understand what configs are possible. Elliott's answer and
yours clarified things, thanks.
Brice