On 4/29/2016 8:12 PM, Dan Williams wrote:
On Fri, Apr 29, 2016 at 4:36 PM, Linda Knippers
<linda.knippers(a)hpe.com> wrote:
> I tested Vishal's original ARS patches and they worked correctly on my test
> system, which doesn't support ARS. By worked correctly, I mean that they
> properly looked at the status of the Query ARS Capabilities function and saw
> that it indicated that the function is not supported. The original code
> skipped the rest of the ARS processing.
Why doesn't the firmware just not export the ARS function in the
supported functions list?
It's a root device function. In theory a platform might support ARS for some
SPA ranges but not for others so even if the function is supported in general,
it might not be supported for all ranges. An SPA range might also support ARS
but not the ability to clear an error. I think you have to check all these
things.
We should still fix this bug, but the whole "implement a
function to
say this function is not implemented" seems a bit of a waste.
Even if all the functions are supported, you still have to check the status
because something could fail.
> The code I tested has been re-worked a number of times and now it no longer
> looks at the status of the Query function. It assumes that if the DSM worked
> at all, it must be ok and happily uses fields that is shouldn't be looking at.
>
> If you look here:
>
>> static int acpi_nfit_query_poison(struct acpi_nfit_desc *acpi_desc,
>> struct nfit_spa *nfit_spa)
>> {
>> struct acpi_nfit_system_address *spa = nfit_spa->spa;
>> int rc;
>>
>> if (!nfit_spa->max_ars) {
>
> On a platform that doesn't support ARS, max_ars will always be zero
> so you'll keep executing this code when it seems like you're trying
> to avoid that.
>
>> struct nd_cmd_ars_cap ars_cap;
>>
>> memset(&ars_cap, 0, sizeof(ars_cap));
>> rc = ars_get_cap(acpi_desc, &ars_cap, nfit_spa);
>> if (rc < 0)
>> return rc;
>
> The call succeeds so this return isn't taken, but then the code assumes
> everything is good. It should check ars_cap.status so see if the function
> is supported or if there was an error and return something appropriate.
> In previous version of the code, that's what acpi_nfit_find_poison() did.
> Instead, this function continues, using data that's not right and making
> more calls that also aren't supported.
Hmm, we should be bailing on any non-zero status, why is that getting skipped?
/* Command failed */
if (ars_cap->status & 0xffff)
return -EIO;
>
>> nfit_spa->max_ars = ars_cap.max_ars_out;
>> nfit_spa->clear_err_unit = ars_cap.clear_err_unit;
>
> Since we don't bail out, the above values are zero but not actually checked
> by subsequent users.
>
>> /* check that the supported scrub types match the spa type */
>> if (nfit_spa_type(spa) == NFIT_SPA_VOLATILE &&
>> ((ars_cap.status >> 16) &
ND_ARS_VOLATILE) == 0)
>> return -ENOTTY;
>> else if (nfit_spa_type(spa) == NFIT_SPA_PM &&
>> ((ars_cap.status >> 16) &
ND_ARS_PERSISTENT) == 0)
>> return -ENOTTY;
>> }
>>
>> if (ars_status_alloc(acpi_desc, nfit_spa->max_ars))
>> return -ENOMEM;
>>
>> rc = ars_get_status(acpi_desc);
>> if (rc < 0 && rc != -ENOSPC)
>> return rc;
>>
>> if (ars_status_process_records(acpi_desc->nvdimm_bus,
>> acpi_desc->ars_status))
>> return -ENOMEM;
>>
>> return 0;
>> }
>
> I don't know if you want to change how this function works or change how
> ars_get_cap works but something needs to change. You actually do check the
> status in xlat_status() if the call was initiated from user space but you don't
> check when it's initiated from within the kernel.
Hmm, why do you say that? xlat_status() only gets skipped if the
output buffer is underrun and in that case the underrun should trigger
-ENXIO to be returned.
> For reference, below is the dmesg output from my system running the latest
> upstream kernel.
Can you add some debug tracing to xlat_status()? Something is not lining up.
xlat_status() is not called from acpi_nfit_query_poison(), which is called at
boot time. Not sure if we're getting there from acpi_nfit_scrub() or
acpi_nfit_async_scrub().
-- ljk