Re: [PATCH 1/4] nvdimm: Add wrapper for IOCTL pass thru.
by Dan Williams
On Tue, Nov 10, 2015 at 12:53 PM, Linda Knippers <linda.knippers(a)hpe.com> wrote:
> On 11/10/2015 3:27 PM, Dan Williams wrote:
>>
>> On Tue, Nov 10, 2015 at 11:49 AM, Jerry Hoemann <jerry.hoemann(a)hpe.com>
>> wrote:
>>>
>>> On Tue, Nov 10, 2015 at 12:51:59PM -0500, Jeff Moyer wrote:
>>>>
>>>> Jerry Hoemann <jerry.hoemann(a)hpe.com> writes:
>>>>
>>>>> Add IOCTL type 'P' to denote NVDIMM_TYPE_PASSTHRU.
>>>>
>>>>
>>>> Can't you just make passthrough a separate command? If you actually add
>>>
>>>
>>> There are multiple conflicting NVDIMM _DSM running around, they
>>> are "device specific". So, we should plan in general and not just
>>> for the example DSM that Intel added support for. These DSM have
>>> over lapping and incompatible function ids.
>>>
>>> The Intel example is an example, not standard. They are free to
>>> change it at will. So, we can't be certain there won't be a
>>> conflict some time in the future if we try to use their number space.
>>>
>>> I'm trying to create a generic pass thru that any vendors can use.
>>> Putting
>>> this in the Intel function number space doesn't make a lot of sense to
>>> me.
>>
>>
>> It isn't the "Intel" function number space. The fact that they
>> currently align is just a happy accident.
>
>
> It's not really a happy accident. Your commit message says it
> was derived from the Intel spec 'for convenience', which I think is
> convenient
> for anything that implements that spec.
Right, and now its no longer convenient to keep things one to one.
> We've discussed ways of supporting different command sets with you
> and determined that this pass-through mechanism was a good approach
> because it allows multiple different command sets to be support in
> a generic way. Blending the two flavors (generic pass through and explicit
> function definitions) is confusing to me.
>
>> The kernel is free to break
>> the 1:1 ioctl number to DSM function number relationship, and I think
>> it would make the implementation cleaner in this case.
>
>
> To me it's less clean and even for your own example spec, less
> convenient if Intel ever updates that spec.
If that spec is ever updated any new commands will be implemented with
this new generic envelope as the marshaling mechanism. I'd also look
to convert the existing commands into this new envelope and deprecate
the existing per-DSM-function number approach. Finally I don't look
at this as purely "passthru" as the kernel will want to crack open the
input payload for commands that it cares about with kernel relevant
side effects, like namespace label updates.
5 years, 2 months
[PATCH 0/4] autodetect ndctl api features
by Dan Williams
The v4.3 kernel implements an address range scrub (ARS) command
interface that was missing from v4.2. The libndctl implementation was
updated accordingly, but v4.2 kernels are of course not exporting the
new capabilities leading to compile breakage for new ndctl on old
kernels.
Fix this by teaching libndctl to detect when it is being built against
older kernel headers. Note that newer kernels remains backwards
compatible for old libndctl builds, this is just teaching newer libndctl
to degrade gracefully when kernel features are missing.
---
Dan Williams (4):
ndctl: export ndctl_bus_get_ctx
ndctl: fix return value truncation for ndctl_cmd_ars_get_record_{addr, len}
ndctl: move forward declarations of ndctl_[un]bind and to_module
ndctl: autodetect ars support in ndctl.h
Makefile.am | 4 +
configure.ac | 36 ++++++-
lib/libndctl-ars.c | 207 +++++++++++++++++++++++++++++++++++++++
lib/libndctl-private.h | 65 +++++++++++-
lib/libndctl.c | 256 +-----------------------------------------------
lib/libndctl.sym | 1
lib/ndctl/libndctl.h | 54 ++++++++++
lib/test-libndctl.c | 23 ++++
8 files changed, 385 insertions(+), 261 deletions(-)
create mode 100644 lib/libndctl-ars.c
5 years, 2 months
[PATCH 0/3] ndctl: default to local ndctl.h
by Dan Williams
Use the new --enable-local configuration option in rpm builds by default
and squash a couple compiler warnings.
---
Dan Williams (3):
ndctl: fix return value truncation for ndctl_cmd_ars_get_record_{addr, len}
ndctl: default to local ndctl.h for rpm builds
ndctl: fix test-libndctl warning
5 years, 2 months
Re: [PATCH 4/4] ndctl: autodetect ars support in ndctl.h
by Jeff Moyer
"Elliott, Robert (Persistent Memory)" <elliott(a)hpe.com> writes:
>> It might make more sense to copy the header file from the kernel
>> sources
>> into the ndctl sources. Then, always enable the functionality at
>> build time and auto-detect at runtime.
>
> The kernel at the time of the build could be older (not just newer)
> than the ndctl package.
Hi, Rob,
Sorry, but I don't know what point you're trying to make. Could you
elaborate?
Cheers,
Jeff
5 years, 2 months
[GIT PULL] libnvdimm for 4.4
by Williams, Dan J
Hi Linus, please pull from...
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-for-4.4
...to receive the libnvdimm update for 4.4. Outside of the new ACPI-NFIT hot-add support this pull request is more notable for what it does not contain, than what it does. There were a handful of development topics this cycle, dax get_user_pages, dax fsync, and raw block dax, that need more more iteration and will wait for 4.5.
The patches to make devm and the pmem driver NUMA aware have been in
-next for several weeks. The hot-add support has not, but is contained
to the NFIT driver and is passing unit tests. The coredump support is
straightforward and was looked over by Jeff. All of it has received a
0day build success notification across 107 configs.
The following changes since commit 7379047d5585187d1288486d4627873170d0005a:
Linux 4.3-rc6 (2015-10-18 16:08:42 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm tags/libnvdimm-for-4.4
for you to fetch changes up to ab27a8d04b32b6ee8c30c14c4afd1058e8addc82:
coredump: add DAX filtering for FDPIC ELF coredumps (2015-11-09 13:29:54 -0500)
----------------------------------------------------------------
libnvdimm for 4.4:
1/ Add support for the ACPI 6.0 NFIT hot add mechanism to process
updates of the NFIT at runtime.
2/ Teach the coredump implementation how to filter out DAX mappings.
3/ Introduce NUMA hints for allocations made by the pmem driver, and as
a side effect all devm allocations now hint their NUMA node by
default.
----------------------------------------------------------------
Dan Williams (8):
x86, mm: quiet arch_add_memory()
pmem: kill memremap_pmem()
devm_memunmap: use devres_release()
devm_memremap: convert to return ERR_PTR
devm: make allocations numa aware by default
devm_memremap_pages: use numa_mem_id
pmem, memremap: convert to numa aware allocations
Merge branch 'for-4.4/hotplug' into libnvdimm-for-next
Ross Zwisler (2):
coredump: add DAX filtering for ELF coredumps
coredump: add DAX filtering for FDPIC ELF coredumps
Vishal Verma (2):
nfit: in acpi_nfit_init, break on a 0-length table
acpi: nfit: Add support for hot-add
Documentation/filesystems/proc.txt | 22 +--
arch/x86/mm/init.c | 4 +-
arch/x86/mm/init_64.c | 4 +-
drivers/acpi/nfit.c | 298 ++++++++++++++++++++++++++++++-------
drivers/acpi/nfit.h | 2 +
drivers/base/devres.c | 19 +--
drivers/nvdimm/pmem.c | 28 ++--
fs/binfmt_elf.c | 10 ++
fs/binfmt_elf_fdpic.c | 15 ++
include/linux/device.h | 16 +-
include/linux/pmem.h | 26 +---
include/linux/sched.h | 4 +-
kernel/memremap.c | 16 +-
tools/testing/nvdimm/test/nfit.c | 164 +++++++++++++++++++-
14 files changed, 494 insertions(+), 134 deletions(-)
5 years, 2 months
Re: [PATCH v3 2/2] acpi: nfit: Add support for hot-add
by Dan Williams
On Sun, Nov 8, 2015 at 4:49 PM, Rafael J. Wysocki <rjw(a)rjwysocki.net> wrote:
> On Saturday, November 07, 2015 10:57:17 AM Dan Williams wrote:
>> Rafael, awaiting your ack...
>
> Well, this seems to be entirely NFIT-related.
>
> Is there anything in particular you want me to look at?
>
This might be more relevant for a follow-on patch, but I wonder if the
core should be guaranteeing that the ->notify() callback occurs under
device_lock(). In the case of NFIT it seems possible for the notify
event to race ->add() or ->remove(), but maybe I missed some other
guarantee?
5 years, 2 months
[PATCH v3 0/2] Hotplug support for libnvdimm
by Vishal Verma
This series adds support for hotplug of NVDIMMs. Upon hotplug, the ACPI
core calls the .notify callback we register. From this, we evaluate the
_FIT method which returns an updated NFIT. This is scanned for any new
tables, and any new regions found from it are registered and made
available for use.
The series is tested with nfit_test (tools/testing/nvdimm) only, which
means the parts of getting a notification from the acpi core, and calling
_FIT are untested.
Changes from v2->v3:
- in acpi_nfit_init, splice off the old contents if to a "prev" list and
only check for duplicates when "prev" is not empty (Dan)
- in acpi_nfit_init, error out if tables are found to be deleted
- locking changes: Use device_lock for .add and .notify. Check if
dev->driver is valid during notify to protect against a prior
removal (Dan)
- Change IS_ERR_OR_NULL to IS_ERR for acpi_nfit_desc_init (Dan)
- nfit_test: for the hot-plug DIMM, add a flush hint table too for
completeness
Changes from v1->v2:
- If a 0-length header is found in the nfit (patch 1), also spew a
warning (Jeff)
- Don't make a new acpi_evaluate_fit helper - open code a call to
acpi_evaluate_object in nfit.c (Dan/Rafael)
- Remove a warning for duplicate DCRs (Toshi)
- Add an init_lock to protect the notify handler from racing with an
'add' or 'remove' (Dan)
- The only NVDIMM in a system *could* potentially come from a hotplug,
esp in the virtualization case. Refactor how acpi_nfit_desc is
initialized to account for this. For the same reason, don't fail when
a valid NFIT is not found at driver load time. A by-product of this
change is that we need to initialize lists and mutexes manually in
nfit test. (Dan)
- Remove acpi_nfit_merge (added in v1) as it is now essentially
the same as acpi_nfit_init
- Reword the commit message for patch 2/2 to say 'hot add' instead of
hotplug, making it clearer that hot removal support is not being added
Vishal Verma (2):
nfit: in acpi_nfit_init, break on a 0-length table
acpi: nfit: Add support for hot-add
drivers/acpi/nfit.c | 307 +++++++++++++++++++++++++++++++--------
drivers/acpi/nfit.h | 2 +
tools/testing/nvdimm/test/nfit.c | 164 ++++++++++++++++++++-
3 files changed, 413 insertions(+), 60 deletions(-)
--
2.4.3
5 years, 2 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Sat, Nov 7, 2015 at 12:38 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Sat, 7 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> > On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> >> On 11/06/15 15:17, Dan Williams wrote:
>> >> >>
>> >> >> Is it really required to do that on all cpus?
>> >> >
>> >> > I believe it is, but I'll double check.
>> >> >
>> >>
>> >> It's required on all CPUs on which the DAX memory may have been dirtied.
>> >> This is similar to the way we flush TLBs.
>> >
>> > Right. And that's exactly the problem: "may have been dirtied"
>> >
>> > If DAX is used on 50% of the CPUs and the other 50% are plumming away
>> > happily in user space or run low latency RT tasks w/o ever touching
>> > it, then having an unconditional flush on ALL CPUs is just wrong
>> > because you penalize the uninvolved cores with a completely pointless
>> > SMP function call and drain their caches.
>> >
>>
>> It's not wrong and pointless, it's all we have available outside of
>> having the kernel remember every virtual address that might have been
>> touched since the last fsync and sit in a loop flushing those virtual
>> address cache line by cache line.
>>
>> There is a crossover point where wbinvd is better than a clwb loop
>> that needs to be determined.
>
> This is a totally different issue and I'm well aware that there is a
> tradeoff between wbinvd() and a clwb loop. wbinvd() might be more
> efficient performance wise above some number of cache lines, but then
> again it's draining all unrelated stuff as well, which can result in a
> even larger performance hit.
>
> Now what really concerns me more is that you just unconditionally
> flush on all CPUs whether they were involved in that DAX stuff or not.
>
> Assume that DAX using application on CPU 0-3 and some other unrelated
> workload on CPU4-7. That flush will
>
> - Interrupt CPU4-7 for no reason (whether you use clwb or wbinvd)
>
> - Drain the cache for CPU4-7 for no reason if done with wbinvd()
>
> - Render Cache Allocation useless if done with wbinvd()
>
> And we are not talking about a few micro seconds here. Assume that
> CPU4-7 have cache allocated and it's mostly dirty. We've measured the
> wbinvd() impact on RT, back then when the graphic folks used it as a
> big hammer. The maximum latency spike was way above one millisecond.
>
> We have similar issues with TLB flushing, but there we
>
> - are tracking where it was used and never flush on innocent cpus
>
> - one can design his application in a way that it uses different
> processes so cross CPU flushing does not happen
>
> I know that this is not an easy problem to solve, but you should be
> aware that various application scenarios are going to be massively
> unhappy about that.
>
Thanks for that explanation. Peter had alluded to it at KS, but I
indeed did not know that it was as horrible as milliseconds of
latency, hmm...
One other mitigation that follows on with Dave's plan of per-inode DAX
control, is to also track when an inode has a writable DAX mmap
established. With that we could have a REQ_DAX flag to augment
REQ_FLUSH to potentially reduce committing violence on the cache. In
an earlier thread I also recall an idea to have an mmap flag that an
app can use to say "yes, I'm doing a writable DAX mapping, but I'm
taking care of the cache myself". We could track innocent cpus, but
I'm thinking that would be a core change to write-protect pages when a
thread migrates? In general I feel there's a limit for how much
hardware workaround is reasonable to do in the core kernel vs waiting
for the platform to offer better options...
Sorry if I'm being a bit punchy, but I'm still feeling like I need to
defend the notion that DAX may just need to be turned off in some
situations.
5 years, 2 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Fri, Nov 6, 2015 at 10:50 PM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Fri, 6 Nov 2015, H. Peter Anvin wrote:
>> On 11/06/15 15:17, Dan Williams wrote:
>> >>
>> >> Is it really required to do that on all cpus?
>> >
>> > I believe it is, but I'll double check.
>> >
>>
>> It's required on all CPUs on which the DAX memory may have been dirtied.
>> This is similar to the way we flush TLBs.
>
> Right. And that's exactly the problem: "may have been dirtied"
>
> If DAX is used on 50% of the CPUs and the other 50% are plumming away
> happily in user space or run low latency RT tasks w/o ever touching
> it, then having an unconditional flush on ALL CPUs is just wrong
> because you penalize the uninvolved cores with a completely pointless
> SMP function call and drain their caches.
>
It's not wrong and pointless, it's all we have available outside of
having the kernel remember every virtual address that might have been
touched since the last fsync and sit in a loop flushing those virtual
address cache line by cache line.
There is a crossover point where wbinvd is better than a clwb loop
that needs to be determined.
5 years, 2 months
Re: [PATCH 0/2] "big hammer" for DAX msync/fsync correctness
by Dan Williams
On Fri, Nov 6, 2015 at 9:35 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
> On Fri, 6 Nov 2015, Dan Williams wrote:
>> On Fri, Nov 6, 2015 at 12:06 AM, Thomas Gleixner <tglx(a)linutronix.de> wrote:
>> > Just for the record. Such a flush mechanism with
>> >
>> > on_each_cpu()
>> > wbinvd()
>> > ...
>> >
>> > will make that stuff completely unusable on Real-Time systems. We've
>> > been there with the big hammer approach of the intel graphics
>> > driver.
>>
>> Noted. This means RT systems either need to disable DAX or avoid
>> fsync. Yes, this is a wart, but not an unexpected one in a first
>> generation persistent memory platform.
>
> And it's not just only RT. The folks who are aiming for 100%
> undisturbed user space (NOHZ_FULL) will be massively unhappy about
> that as well.
>
> Is it really required to do that on all cpus?
>
I believe it is, but I'll double check.
I assume the folks that want undisturbed userspace are ok with the
mitigation to modify their application to flush by individual cache
lines if they want to use DAX without fsync. At least until the
platform can provide a cheaper fsync implementation.
The option to drive cache flushing from the radix is at least
interruptible, but it might be long running depending on how much
virtual address space is dirty. Altogether, the options in the
current generation are:
1/ wbinvd driven: quick flush O(size of cache), but long interrupt-off latency
2/ radix driven: long flush O(size of dirty range), but at least preempt-able
3/ DAX without calling fsync: userspace takes direct responsibility
for cache management of DAX mappings
4/ DAX disabled: fsync is the standard page cache writeback latency
We could potentially argue about 1 vs 2 ad nauseum, but I wonder if
there is room to it punt it to a configuration option or make it
dynamic? My stance is do 1 with the hope of riding options 3 and 4
until the platform happens to provide a better alternative.
5 years, 2 months