On Fri, Apr 29, 2016 at 4:36 PM, Linda Knippers <linda.knippers(a)hpe.com> wrote:
I tested Vishal's original ARS patches and they worked correctly
on my test
system, which doesn't support ARS. By worked correctly, I mean that they
properly looked at the status of the Query ARS Capabilities function and saw
that it indicated that the function is not supported. The original code
skipped the rest of the ARS processing.
Why doesn't the firmware just not export the ARS function in the
supported functions list?
We should still fix this bug, but the whole "implement a function to
say this function is not implemented" seems a bit of a waste.
The code I tested has been re-worked a number of times and now it no
longer
looks at the status of the Query function. It assumes that if the DSM worked
at all, it must be ok and happily uses fields that is shouldn't be looking at.
If you look here:
> static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
> struct nfit_spa *nfit_spa)
> {
> struct acpi_nfit_system_address *spa = nfit_spa->spa;
> int rc;
>
> if (!nfit_spa->max_ars) {
On a platform that doesn't support ARS, max_ars will always be zero
so you'll keep executing this code when it seems like you're trying
to avoid that.
> struct nd_cmd_ars_cap ars_cap;
>
> memset(&ars_cap, 0, sizeof(ars_cap));
> rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
> if (rc < 0)
> return rc;
The call succeeds so this return isn't taken, but then the code assumes
everything is good. It should check ars_cap.status so see if the function
is supported or if there was an error and return something appropriate.
In previous version of the code, that's what acpi_nfit_find_poison() did.
Instead, this function continues, using data that's not right and making
more calls that also aren't supported.
Hmm, we should be bailing on any non-zero status, why is that getting skipped?
/* Command failed */
if (ars_cap->status & 0xffff)
return -EIO;
> nfit_spa->max_ars = ars_cap.max_ars_out;
> nfit_spa->clear_err_unit = ars_cap.clear_err_unit;
Since we don't bail out, the above values are zero but not actually checked
by subsequent users.
> /* check that the supported scrub types match the spa type */
> if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE &&
> ((ars_cap.status >> 16) & ND_ARS_VOLATILE)
== 0)
> return -ENOTTY;
> else if (nfit_spa_type(spa) == NFIT_SPA_PM &&
> ((ars_cap.status >> 16) &
ND_ARS_PERSISTENT) == 0)
> return -ENOTTY;
> }
>
> if (ars_status_alloc(acpi_desc, nfit_spa->max_ars))
> return -ENOMEM;
>
> rc = ars_get_status(acpi_desc);
> if (rc < 0 && rc != -ENOSPC)
> return rc;
>
> if (ars_status_process_records(acpi_desc->nvdimm_bus,
> acpi_desc->ars_status))
> return -ENOMEM;
>
> return 0;
> }
I don't know if you want to change how this function works or change how
ars_get_cap works but something needs to change. You actually do check the
status in xlat_status() if the call was initiated from user space but you don't
check when it's initiated from within the kernel.
Hmm, why do you say that? xlat_status() only gets skipped if the
output buffer is underrun and in that case the underrun should trigger
-ENXIO to be returned.
For reference, below is the dmesg output from my system running the
latest
upstream kernel.
Can you add some debug tracing to xlat_status()? Something is not lining up.