Hi Oleg,
Sorry for the late reply. Thanks for clarifying the replica locks details.
For the second case, even if some clients had read locks instead of write
locks for the data, the MDS would have granted the read locks on the agent
clients but would have set the non-primary replicas to 'Stale' state. So I
guess that would have invalidated the client cache holding the non-primary
replicas also.
Regards,
Akhilesh Gadde.
On Wed, Apr 8, 2015 at 12:42 AM, Drokin, Oleg <oleg.drokin(a)intel.com> wrote:
The thing is, when we start our write and break the replicated layout
into
just the primary one, we are the first writer ,so by definition, no clients
could have any dirty data in their cache to flush.
So we only need to cause the clients to flush their read caches and then
only for those that cache the non-primary replica since primary replica
remains valid (incoming writes will take care of those read locks).
Regarding the agents that would re-sync the replicas - in order for them
to red the primary replica content, they would need to obrain corresponding
read locks first andthat would invalidate all the write locks and cause all
clients with dirty cache to flush their cache, so no additional steps are
needed here I imagine, it's no different from any other sort of read.
On Apr 8, 2015, at 12:07 AM, Akhilesh Gadde wrote:
> Hi Oleg,
>
> Thanks again for the clarification.
> i. Yes. I second with you that the diagram wrongly shows two
INTENT_WRITE replies. The 2nd reply in the diagram would have been the
correct one in my guess.
> ii. One point that you made - "In case of replication it's not necessary
to flush primary replica locks because the content does not really change,
I imagine (as opposed to a restriping where all objects are moved)."
> --> I think that the MDS needs to ask the OSTs to flush locks on the
clients since as you mentioned that would lead to the write of dirty
buffers from the client cache to the OSTs. The reason I think this should
be the case is that the client that modifies the file may be different from
the client that is now trying to sync the data between primary replica and
non-primary replicas.
>
>
------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Some relevant information from the file replication design document:
> --> "When the first write comes, and the replicated file is in READ_ONLY
state, it will increase the layout generation and pick the primary replica,
then change the file state to WRITE_PENDING and mark non-primary replicas
STALE. The writing client will then update primary replica’s layout
generation, and then notify the MDT to set the replicated file state to
WRITABLE. After the replicated file is re-synchronized, the file state will
go back to READ_ONLY again."
>
>
>
> --> i. The client sends an RPC called MDT_INTENT_WRITE to the MDT before
it writes replicated files.
> ii. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it
turns out that layout has to be changed, it will update the layout
synchronously.
> iii. MDT would specify the primary replica to the client and the
client’s corresponding OSC would communicate with primary replica and ask
it to update the layout info.
> iv. After primary replica responds, the client would send a message to
MDT saying the layout info has been updated and MDT sets the file to
writable.
> v. The client would only write to the primary replica copy of the file.
> vi. For synchronization between primary and non-primary replicas, some
dedicated clients,
> named agent clients, could be used to pick files that have elapsed at
least quiescent time since the last write and synchronize all the replicas.
The client that is re-synchronizing the replicated file is not required to
be the same client that wrote that file.
>
------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Regards,
> Akhilesh Gadde.
>