[ add Keith and Dave for their thoughts ]
On Wed, Apr 17, 2019 at 2:46 PM Brice Goglin <Brice.Goglin(a)inria.fr> wrote:
Le 17/04/2019 à 23:35, Dan Williams a écrit :
> On Tue, Apr 16, 2019 at 8:31 AM Brice Goglin <Brice.Goglin(a)inria.fr> wrote:
>>
>> Le 08/04/2019 à 21:55, Brice Goglin a écrit :
>>
>> Le 08/04/2019 à 16:56, Dan Williams a écrit :
>>
>> Yes, I agree with all of the above, but I think we need a way to fix
>> this independent of the HMAT data being present. The SLIT already
>> tells the kernel enough to let tooling figure out equidistant "local"
>> nodes. While the numa_node attribute will remain a singleton the
>> tooling needs to handle this case and can't assume the HMAT data will
>> be present.
>>
>> So you want to export the part of SLIT that is currently hidden to
>> userspace because the corresponding nodes aren't registered?
>>
>> With the patch below, I get 17 17 28 28 in dax0.0/node_distance which
>> means it's close to node0 and node1.
>>
>> The code is pretty much a duplicate of read_node_distance() in
>> drivers/base/node.c. Not sure it's worth factorizing such small functions?
>>
>> The name "node_distance" (instead of "distance" for NUMA
nodes) is also
>> subject to discussion.
>>
>> Here's a better patch that exports the existing routine for showing
>> node distances, and reuses it in dax/bus.c and nvdimm/pfn_devs.c:
>>
>> # cat /sys/class/block/pmem1/device/node_distance
>> 28 28 17 17
>> # cat /sys/bus/dax/devices/dax0.0/node_distance
>> 17 17 28 28
>>
>> By the way, it also handles the case where the nd_region has no
>> valid target_node (idea stolen from kmem.c).
>>
>> Are there other places where it'd be useful to export that attribute?
>>
>> Ideally we could just export it in the region sysfs directory,
>> but I can't find backlinks going from daxX.Y or pmemZ to that
>> region directory :/
> I understand where you're trying to go, but this is too dax-device
> specific. What about a storage-controller in the topology that is
> equidistant from multiple cpu nodes. I'd rather solve this from the
> tooling perspective to lookup cpu nodes that are equidistant to the
> device's "numa_node".
I don't see how you're going to lookup those equidistant nodes. In the
above case, pmem1 numa_node is 2. Where do you want tools to find the
information that pmem1 is actually close to node2 AND node3?
Yeah, I was indeed confusing proximity-domain and numa-node in my
thought process of what information userspace tools have readily
available, but I think a generic solution is still salvageable.
That information is hidden in SLIT node5<->node2 and
node5<->node3 but
these are not exposed to userspace tools since node5 isn't registered.
I think the root problem is that the kernel allocates numa-nodes in
arch-specific code at the beginning of time and the proximity-domain
information is not readily available with the expectation that the
Linux numa node is sufficient.
Your node_distance attribute proposal solves this, but I find SLIT
data to be a bit magical and poorly specified, especially across
architectures.
What about just exporting the proximity domain information via an
opaque firmware-implementation-specific 'node_handle' attribute? Then
the node_handle can be used to consult questions like what numa-nodes
is this handle local to beyond what the 'numa_node' attribute
indicates, what is the effective target-node for this node-handle, and
allow for interrogating the next level of detail beyond what
CONFIG_HMEM_REPORTING allows.