Hi Oleg,

Sorry for the late reply. Thanks for clarifying the replica locks details. 
For the second case, even if some clients had read locks instead of write locks for the data, the MDS would have granted the read locks on the agent clients but would have set the non-primary replicas to 'Stale' state. So I guess that would have invalidated the client cache holding the non-primary replicas also. 

Regards,
Akhilesh Gadde.

On Wed, Apr 8, 2015 at 12:42 AM, Drokin, Oleg <oleg.drokin@intel.com> wrote:
The thing is, when we start our write and break the replicated layout into just the primary one, we are the first writer ,so by definition, no clients could have any dirty data in their cache to flush.
So we only need to cause the clients to flush their read caches and then only for those that cache the non-primary replica since primary replica remains valid (incoming writes will take care of those read locks).

Regarding the agents that would re-sync the replicas - in order for them to red the primary replica content, they would need to obrain corresponding read locks first andthat would invalidate all the write locks and cause all clients with dirty cache to flush their cache, so no additional steps are needed here I imagine, it's no different from any other sort of read.

On Apr 8, 2015, at 12:07 AM, Akhilesh Gadde wrote:

> Hi Oleg,
>
> Thanks again for the clarification.
> i. Yes. I second with you that the diagram wrongly shows two INTENT_WRITE replies. The 2nd reply in the diagram would have been the correct one in my guess.
> ii. One point that you made - "In case of replication it's not necessary to flush primary replica locks because the content does not really change, I imagine (as opposed to a restriping where all objects are moved)."
> --> I think that the MDS needs to ask the OSTs to flush locks on the clients since as you mentioned that would lead to the write of dirty buffers from the client cache to the OSTs. The reason I think this should be the case is that the client that modifies the file may be different from the client that is now trying to sync the data between primary replica and non-primary replicas.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
> Some relevant information from the file replication design document:
> --> "When the first write comes, and the replicated file is in READ_ONLY state, it will increase the layout generation and pick the primary replica, then change the file state to WRITE_PENDING and mark non-primary replicas STALE. The writing client will then update primary replica’s layout generation, and then notify the MDT to set the replicated file state to WRITABLE. After the replicated file is re-synchronized, the file state will go back to READ_ONLY again."
>
>
>
> --> i. The client sends an RPC called MDT_INTENT_WRITE to the MDT before it writes replicated files.
> ii. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it turns out that layout has to be changed, it will update the layout synchronously.
> iii. MDT would specify the primary replica to the client and the client’s corresponding OSC would communicate with primary replica and ask it to update the layout info.
> iv. After primary replica responds, the client would send a message to MDT saying the layout info has been updated and MDT sets the file to writable.
> v. The client would only write to the primary replica copy of the file.
> vi. For synchronization between primary and non-primary replicas, some dedicated clients,
> named agent clients, could be used to pick files that have elapsed at least quiescent time since the last write and synchronize all the replicas. The client that is re-synchronizing the replicated file is not required to be the same client that wrote that file.
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
>
> Regards,
> Akhilesh Gadde.
>