On Wed, Apr 8, 2015 at 7:25 PM <
hpdd-discuss-request@lists.01.org> wrote:
Send HPDD-discuss mailing list submissions to
hpdd-discuss@lists.01.org
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.01.org/mailman/listinfo/hpdd-discuss
or, via email, send a message with subject or body 'help' to
hpdd-discuss-request@lists.01.org
You can reach the person managing the list at
hpdd-discuss-owner@lists.01.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of HPDD-discuss digest..."
Today's Topics:
1. Slow write performance help (Kumar, Amit)
2. Re: Slow write performance help
(Mohr Jr, Richard Frank (Rick Mohr))
3. Re: How to read the I/O in flight statistic in
obdfilter.*.brw_stats (Dilger, Andreas)
4. [PATCH] staging: lustre: llite: remove obsolete conditional
code (Andreas Dilger)
5. Re: Lustre file read/write operations (Akhilesh Gadde)
----------------------------------------------------------------------
Message: 1
Date: Wed, 8 Apr 2015 21:05:17 +0000
From: "Kumar, Amit" <ahkumar@mail.smu.edu>
To: "hpdd-discuss@lists.01.org" <hpdd-discuss@lists.01.org>
Subject: [HPDD-discuss] Slow write performance help
Message-ID:
<BB6BA2C397CCB140A2475E542B2164471E79F958@SXMB1PG.SYSTEMS.SMU.EDU>
Content-Type: text/plain; charset="us-ascii"
Dear All,
We had a power outage and we recovered perfectly fine, except 2 of the OSS server mounting the OST's over IB from the DDN storage seem to be dead slow. Read is perfectly fine I get a pretty good read performance, about 1200 MB/s. But write is like 4MB/s where as other OST's on other OSS's are doing perfectly fine about 350MB/s.
No hardware errors on OSS servers, Storage controllers etc. Storage controllers, connecting these two OSS with issues also, serves two other OSS's and their performance is perfectly fine.
Any help or direction to debug this will be very helpful. I am running out of ideas on what could cause this. Could it be it takes a while to recover since the file system crashed.
Thank you,
Amit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.01.org/pipermail/hpdd-discuss/attachments/20150408/480538db/attachment-0001.html>
------------------------------
Message: 2
Date: Wed, 8 Apr 2015 21:18:49 +0000
From: "Mohr Jr, Richard Frank (Rick Mohr)" <rmohr@utk.edu>
To: "Kumar, Amit" <ahkumar@mail.smu.edu>
Cc: "hpdd-discuss@lists.01.org" <hpdd-discuss@lists.01.org>
Subject: Re: [HPDD-discuss] Slow write performance help
Message-ID: <7643EB45-AE2B-482C-9C09-DC19C44E9DC2@utk.edu>
Content-Type: text/plain; charset="utf-8"
> On Apr 8, 2015, at 5:05 PM, Kumar, Amit <ahkumar@mail.smu.edu> wrote:
> We had a power outage and we recovered perfectly fine, except 2 of the OSS server mounting the OST?s over IB from the DDN storage seem to be dead slow. Read is perfectly fine I get a pretty good read performance, about 1200 MB/s. But write is like 4MB/s where as other OST?s on other OSS?s are doing perfectly fine about 350MB/s.
>
> No hardware errors on OSS servers, Storage controllers etc. Storage controllers, connecting these two OSS with issues also, serves two other OSS?s and their performance is perfectly fine.
>
> Any help or direction to debug this will be very helpful. I am running out of ideas on what could cause this. Could it be it takes a while to recover since the file system crashed.
Are all OSTs on those two OSS servers slow? Have you looked at the IB counters to see if there are any errors?
Another thing you could try would be to look at the performance counters on the DDN controllers to see if there is anything out of the ordinary like unusually long write latencies or IO sizes that are smaller than you are expecting.
Have you tried restarting the servers and/or the DDN controllers to see if that clears anything up?
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
------------------------------
Message: 3
Date: Wed, 8 Apr 2015 23:12:55 +0000
From: "Dilger, Andreas" <andreas.dilger@intel.com>
To: Michael Kluge <michael.kluge@tu-dresden.de>,
"<hpdd-discuss@lists.01.org>" <hpdd-discuss@ml01.01.org>
Subject: Re: [HPDD-discuss] How to read the I/O in flight statistic in
obdfilter.*.brw_stats
Message-ID: <D14B120E.EA13E%andreas.dilger@intel.com>
Content-Type: text/plain; charset="iso-8859-1"
On 2015/04/08, 6:43 AM, "Michael Kluge" <michael.kluge@tu-dresden.de>
wrote:
>Hi all,
>
>can anyone please explain how this individual statistic is
>calculated/updated? From what I understand it shows how many read or
>write requests have been in flight at certain points in time in the
>past. What I don't know is: when is this statistic updated. If I had to
>guess I would say it must be any of these:
>1) whenever a new I/O request is issued, it puts the number of open
>requests in the statistic field
It's this one.
Cheers, Andreas
>2) there is a service thread that reads the I/O queue size at a fixed
>interval
>Any help is appreciated!
>
>
>Regards, Michael
>
>--
>Dr.-Ing. Michael Kluge
>
>Technische Universit?t Dresden
>Center for Information Services and
>High Performance Computing (ZIH)
>D-01062 Dresden
>Germany
>
>Contact:
>Willersbau, Room A 208
>Phone: (+49) 351 463-34217
>Fax: (+49) 351 463-37773
>e-mail: michael.kluge@tu-dresden.de
>WWW: http://www.tu-dresden.de/zih
>
>
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division
------------------------------
Message: 4
Date: Wed, 8 Apr 2015 17:24:02 -0600
From: Andreas Dilger <andreas.dilger@intel.com>
To: Greg KH <gregkh@linuxfoundation.org>
Cc: devel@driverdev.osuosl.org, hpdd-discuss@lists.01.org,
linux-kernel@vger.kernel.org
Subject: [HPDD-discuss] [PATCH] staging: lustre: llite: remove
obsolete conditional code
Message-ID:
<1428535442-11366-1-git-send-email-andreas.dilger@intel.com>
Remove conditional flock/aops code that was only for out-of-tree
vendor kernels but is not relevant for in-kernel code.
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
---
drivers/staging/lustre/lustre/llite/llite_internal.h | 4 ----
drivers/staging/lustre/lustre/llite/llite_lib.c | 8 --------
drivers/staging/lustre/lustre/llite/rw26.c | 20 --------------------
3 files changed, 32 deletions(-)
diff --git a/drivers/staging/lustre/lustre/llite/llite_internal.h b/drivers/staging/lustre/lustre/llite/llite_internal.h
index 37306e0..1d4b2eb 100644
--- a/drivers/staging/lustre/lustre/llite/llite_internal.h
+++ b/drivers/staging/lustre/lustre/llite/llite_internal.h
@@ -727,11 +727,7 @@ int ll_readahead(const struct lu_env *env, struct cl_io *io,
struct ll_readahead_state *ras, struct address_space *mapping,
struct cl_page_list *queue, int flags);
-#ifndef MS_HAS_NEW_AOPS
extern const struct address_space_operations ll_aops;
-#else
-extern const struct address_space_operations_ext ll_aops;
-#endif
/* llite/file.c */
extern struct file_operations ll_file_operations;
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index a3367bf..8327ad6 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -228,14 +228,6 @@ static int client_common_fill_super(struct super_block *sb, char *md, char *dt,
if (sbi->ll_flags & LL_SBI_USER_XATTR)
data->ocd_connect_flags |= OBD_CONNECT_XATTR;
-#ifdef HAVE_MS_FLOCK_LOCK
- /* force vfs to use lustre handler for flock() calls - bug 10743 */
- sb->s_flags |= MS_FLOCK_LOCK;
-#endif
-#ifdef MS_HAS_NEW_AOPS
- sb->s_flags |= MS_HAS_NEW_AOPS;
-#endif
-
if (sbi->ll_flags & LL_SBI_FLOCK)
sbi->ll_fop = &ll_file_operations_flock;
else if (sbi->ll_flags & LL_SBI_LOCALFLOCK)
diff --git a/drivers/staging/lustre/lustre/llite/rw26.c b/drivers/staging/lustre/lustre/llite/rw26.c
index 2f21304..5d8bdfd 100644
--- a/drivers/staging/lustre/lustre/llite/rw26.c
+++ b/drivers/staging/lustre/lustre/llite/rw26.c
@@ -517,7 +517,6 @@ static int ll_migratepage(struct address_space *mapping,
}
#endif
-#ifndef MS_HAS_NEW_AOPS
const struct address_space_operations ll_aops = {
.readpage = ll_readpage,
.direct_IO = ll_direct_IO_26,
@@ -532,22 +531,3 @@ const struct address_space_operations ll_aops = {
.migratepage = ll_migratepage,
#endif
};
-#else
-const struct address_space_operations_ext ll_aops = {
- .orig_aops.readpage = ll_readpage,
-/* .orig_aops.readpages = ll_readpages, */
- .orig_aops.direct_IO = ll_direct_IO_26,
- .orig_aops.writepage = ll_writepage,
- .orig_aops.writepages = ll_writepages,
- .orig_aops.set_page_dirty = ll_set_page_dirty,
- .orig_aops.prepare_write = ll_prepare_write,
- .orig_aops.commit_write = ll_commit_write,
- .orig_aops.invalidatepage = ll_invalidatepage,
- .orig_aops.releasepage = ll_releasepage,
-#ifdef CONFIG_MIGRATION
- .orig_aops.migratepage = ll_migratepage,
-#endif
- .write_begin = ll_write_begin,
- .write_end = ll_write_end
-};
-#endif
--
1.9.3
------------------------------
Message: 5
Date: Wed, 8 Apr 2015 22:25:30 -0400
From: Akhilesh Gadde <akhilesh.gadde@stonybrook.edu>
To: "Drokin, Oleg" <oleg.drokin@intel.com>
Cc: "<hpdd-discuss@lists.01.org>" <hpdd-discuss@ml01.01.org>
Subject: Re: [HPDD-discuss] Lustre file read/write operations
Message-ID:
<CANgPofs7t8sFExCnH691TVj8f824UacHZ+f3Jw=kxa+b71uJng@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Oleg,
Sorry for the late reply. Thanks for clarifying the replica locks details.
For the second case, even if some clients had read locks instead of write
locks for the data, the MDS would have granted the read locks on the agent
clients but would have set the non-primary replicas to 'Stale' state. So I
guess that would have invalidated the client cache holding the non-primary
replicas also.
Regards,
Akhilesh Gadde.
On Wed, Apr 8, 2015 at 12:42 AM, Drokin, Oleg <oleg.drokin@intel.com> wrote:
> The thing is, when we start our write and break the replicated layout into
> just the primary one, we are the first writer ,so by definition, no clients
> could have any dirty data in their cache to flush.
> So we only need to cause the clients to flush their read caches and then
> only for those that cache the non-primary replica since primary replica
> remains valid (incoming writes will take care of those read locks).
>
> Regarding the agents that would re-sync the replicas - in order for them
> to red the primary replica content, they would need to obrain corresponding
> read locks first andthat would invalidate all the write locks and cause all
> clients with dirty cache to flush their cache, so no additional steps are
> needed here I imagine, it's no different from any other sort of read.
>
> On Apr 8, 2015, at 12:07 AM, Akhilesh Gadde wrote:
>
> > Hi Oleg,
> >
> > Thanks again for the clarification.
> > i. Yes. I second with you that the diagram wrongly shows two
> INTENT_WRITE replies. The 2nd reply in the diagram would have been the
> correct one in my guess.
> > ii. One point that you made - "In case of replication it's not necessary
> to flush primary replica locks because the content does not really change,
> I imagine (as opposed to a restriping where all objects are moved)."
> > --> I think that the MDS needs to ask the OSTs to flush locks on the
> clients since as you mentioned that would lead to the write of dirty
> buffers from the client cache to the OSTs. The reason I think this should
> be the case is that the client that modifies the file may be different from
> the client that is now trying to sync the data between primary replica and
> non-primary replicas.
> >
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > Some relevant information from the file replication design document:
> > --> "When the first write comes, and the replicated file is in READ_ONLY
> state, it will increase the layout generation and pick the primary replica,
> then change the file state to WRITE_PENDING and mark non-primary replicas
> STALE. The writing client will then update primary replica?s layout
> generation, and then notify the MDT to set the replicated file state to
> WRITABLE. After the replicated file is re-synchronized, the file state will
> go back to READ_ONLY again."
> >
> >
> >
> > --> i. The client sends an RPC called MDT_INTENT_WRITE to the MDT before
> it writes replicated files.
> > ii. When the MDT receives the MDT_INTENT_WRITE RPC request, and if it
> turns out that layout has to be changed, it will update the layout
> synchronously.
> > iii. MDT would specify the primary replica to the client and the
> client?s corresponding OSC would communicate with primary replica and ask
> it to update the layout info.
> > iv. After primary replica responds, the client would send a message to
> MDT saying the layout info has been updated and MDT sets the file to
> writable.
> > v. The client would only write to the primary replica copy of the file.
> > vi. For synchronization between primary and non-primary replicas, some
> dedicated clients,
> > named agent clients, could be used to pick files that have elapsed at
> least quiescent time since the last write and synchronize all the replicas.
> The client that is re-synchronizing the replicated file is not required to
> be the same client that wrote that file.
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------
> >
> >
> > Regards,
> > Akhilesh Gadde.
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.01.org/pipermail/hpdd-discuss/attachments/20150408/de65e631/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
HPDD-discuss mailing list
HPDD-discuss@lists.01.org
https://lists.01.org/mailman/listinfo/hpdd-discuss
------------------------------
End of HPDD-discuss Digest, Vol 29, Issue 11
********************************************