Hi Oleg,
Thanks again for the clarification.
i. Yes. I second with you that the diagram wrongly shows two INTENT_WRITE replies. The 2nd reply in the diagram would have been the correct one in my guess.
ii. One point that you made - "In case of replication it's not necessary to flush primary replica locks because the content does not really change, I imagine (as opposed to a restriping where all objects are moved)."
--> I think that the MDS needs to ask the OSTs to flush locks on the clients since as you mentioned that would lead to the write of dirty buffers from the client cache to the OSTs. The reason I think this should be the case is that the client that modifies the file may be different from the client that is now trying to sync the data between primary replica and non-primary replicas.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Some relevant information from the file replication design document:
--> "When the first write comes, and the replicated file is in READ_ONLY state, it will increase the layout generation and pick the primary replica, then change the file state to WRITE_PENDING and mark non-primary replicas STALE. The writing client will then update primary replica’s layout generation, and then notify the MDT to set the replicated file state to WRITABLE. After the replicated file is re-synchronized, the file state will go back to READ_ONLY again."
--> i. The client sends an RPC called MDT_INTENT_WRITE to the MDT before it writes replicated files.
ii. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it turns out that layout has to be changed, it will update the layout synchronously.
iii. MDT would specify the primary replica to the client and the client’s corresponding OSC would communicate with primary replica and ask it to update the layout info.
iv. After primary replica responds, the client would send a message to MDT saying the layout info has been updated and MDT sets the file to writable.
v. The client would only write to the primary replica copy of the file.
vi. For synchronization between primary and non-primary replicas, some dedicated clients,
named agent clients, could be used to pick files that have elapsed at least quiescent time since the last write and synchronize all the replicas. The client that is re-synchronizing the replicated file is not required to be the same client that wrote that file.
------------------------------------------------------------------------------------------------------------------------------------------------
Regards,
Akhilesh Gadde.