Hi Oleg,
Thanks again for the clarification.
i. Yes. I second with you that the diagram wrongly shows two INTENT_WRITE
replies. The 2nd reply in the diagram would have been the correct one in my
guess.
ii. One point that you made - "In case of replication it's not necessary to
flush primary replica locks because the content does not really change, I
imagine (as opposed to a restriping where all objects are moved)."
--> I think that the MDS needs to ask the OSTs to flush locks on the
clients since as you mentioned that would lead to the write of dirty
buffers from the client cache to the OSTs. The reason I think this should
be the case is that the client that modifies the file may be different from
the client that is now trying to sync the data between primary replica and
non-primary replicas.
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Some relevant information from the file replication design document:
--> "When the first write comes, and the replicated file is in READ_ONLY
state, it will increase the layout generation and pick the primary replica,
then change the file state to WRITE_PENDING and mark non-primary replicas
STALE. The writing client will then update primary replica’s layout
generation, and then notify the MDT to set the replicated file state to
WRITABLE. After the replicated file is re-synchronized, the file state will
go back to READ_ONLY again."
--> i. The client sends an RPC called MDT_INTENT_WRITE to the MDT before it
writes replicated files.
ii. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it turns
out that layout has to be changed, it will update the layout synchronously.
iii. MDT would specify the primary replica to the client and the client’s
corresponding OSC would communicate with primary replica and ask it to
update the layout info.
iv. After primary replica responds, the client would send a message to MDT
saying the layout info has been updated and MDT sets the file to writable.
v. The client would only write to the primary replica copy of the file.
vi. For synchronization between primary and non-primary replicas, some
dedicated clients,
named agent clients, could be used to pick files that have elapsed at least
quiescent time since the last write and synchronize all the replicas. The
client that is re-synchronizing the replicated file is not required to be
the same client that wrote that file.
------------------------------------------------------------------------------------------------------------------------------------------------
Regards, Akhilesh Gadde.