On 2013/29/05 3:17 PM, "gary.k.sweely(a)census.gov"
<gary.k.sweely(a)census.gov> wrote:
Our backend strorage is all virtualized using IBM SVC clusters.
That is probably not going to work very well for performance, especially
if OSTs are
going to share the same underlying disks.
We can have the SVC cluster maintain a mirror or create a snapshot
for
the backup. Assuming the snapshot requires short lock up of the MDS
write services, it seems creating a mirror and
snapshot of that would maintain accessibility of the MDS. The
snapshot mirror/snapshot could be presented to backup for DR or major
file corruption Backup/restores of the MDS. The MDS space is small
enough that we can probably accept cost to mirror that.
We can't do the same for the OSTs. It is to much to mirror. I'm
thinking the OSTs would be backed up by several LUSTRE Client backup
media servers. The clients would be backing up different subdirectories
concurrently and then be restoring them concurrently to meet our RTO.
Seems like it should handled ok. But I may be misunderstanding how
LUSTRE works and clearly don't know the internals to know if this might
cause it to choke.
That is exactly what Lustre is designed to do - parallel IO.
So question becomes, I have major failure. I restore the MDS from backup.
You only need to restore the MDT from backup if the MDT had a critical
failure.
In every other case, you restore specific files from your file-level
backup.
It is possible to use e.g. "lfs find" to find files located on a specific
failed
OST if there is a critical OST failure. In that case, you would unlink
the failed
files located on that OST, and then restore them from backup.
I rebuild the OST server OS from image satellite service. Then
restore
the files from most recent client backup. How well does the MDS handle
all the files being recreated by several client when the restored MDS had
a previous image of metadata that no longer exists?
I think you are confused. If you are restoring a file-level backup then
Lustre doesn't care what you are writing (i.e. restoring), it is just a
file write.
Cheers, Andreas
Does it just overwrite all of its tables. Or does it start choking
because of the streaming writes from multiple concurrent restore sessions.
---------------------------------------------------------------
Truth 23. Your common sense is not always someone else's common sense.
Don't assume that just because it's obvious to you, it will be obvious to
others.
---------------------------------------------------------------
Gary K Sweely, 301-763-5532, Cell 301-651-2481
SAN and Storage Systems Manager
US Census, Bowie Computer Center
Paper Mail to:
US Census, CSVD-BCC
Washington DC 20233
Office Physical or Delivery Address
17101 Melford Blvd
Bowie MD, 20715
-----"Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com> wrote: -----
To: "gary.k.sweely(a)census.gov" <gary.k.sweely(a)census.gov>,
"hpdd-discuss(a)lists.01.org \" <hpdd-discuss(a)lists.01.org>\""
<hpdd-discuss(a)lists.01.org " <hpdd-discuss(a)lists.01.org>">
From: "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
Date: 05/29/2013 04:51PM
Cc: "Dilger, Andreas" <andreas.dilger(a)intel.com>, Chris Churchey
<churchey(a)theatsgroup.com>, "\"\" james.m.lessard(a)census.gov \"
<james.m.lessard(a)census.gov>, "Holtz, JohnX"
<johnx.holtz(a)intel.com>,
Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>,
"raymond.illian(a)census.gov"
<raymond.illian(a)census.gov>, "anthony.t.li(a)census.gov"
<anthony.t.li(a)census.gov>
Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
systems, any issues
Gary,
The MDT backup is there to provide a fast time to restore of the MDS
should that service be irrevocably compromised. It would not play a part
in normal file restore operations
of e.g. individual files or directories (nor would it provide any
benefit if the whole file system was lost completely). Backing up the MDT
is a shortcut for the eventuality of losing just the metadata. In your
scenario, how do you intend to mirror the MDTs?
You can multi-home the Lustre servers and attach backup hosts exclusively
to that network. You will need to do this for the MGS, MDS and OSS
servers. If your backup system
supports it, attaching the backup servers to the tape library through an
FC SAN is way less overhead than Ethernet.
Ethernet bonding can deliver benefits but is very implementation
dependent (switches also play a part here) and I¹m not well versed. The
use cases for Linux are pretty widely
documented though.
Malcolm.
--
Malcolm Cowe, Systems Engineer
Intel High Performance Data Division
malcolm.j.cowe(a)intel.com
+61 408 573 001
From: gary.k.sweely(a)census.gov [mailto:gary.k.sweely@census.gov]
Sent: Thursday, May 30, 2013 1:49 AM
To: Cowe, Malcolm J; hpdd-discuss(a)lists.01.org "
<hpdd-discuss(a)lists.01.org>"
Cc: Dilger, Andreas; Chris Churchey; "" james.m.lessard(a)census.gov "
<james.m.lessard(a)census.gov>, " Holtz; JohnX <johnx.holtz(a)intel.com>,
Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>,
raymond.illian(a)census.gov, anthony.t.li(a)census.gov
Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
systems, any issues
Thanks, Lots of helpful information.
Our LUSTRE service requirement doesn't have urgent enough requirement
(clout) to justify a full mirror of the LUSTRE FS for HA and DR purposes.
So we will be falling back
to a lower level of backup/DR functionality focused on DR restoration in
5 days and ability to rapidly restore a file from previous days backup.
If I create an image backup of the MDSs, get client based File backup of
the OST's over a longer window or different time period, and then had to
do a full restore, will
the OST's be so far out of synch with the MDT's that the file system
would be unreadable, or is their a process to resynchronize the MDT data
with the actual files in the OST's?
Sounds like I should be focused on;
* Getting solid backup of the MDS servers and at least the MDT volumes.
Configuration design would be to mirror the MDS server/servers. Snapshot
the mirror. Run backup against the snapshot.
This provides rapid recovery of the MDS using the mirror, and DR using
the backup with no down time to the MDS/MDTs. Because MDS/MDTs are
smaller in size with total less then 1TB I can probably afford to mirror
these.
If I don't want to tie up the disk space of a Mirror then just snapshot
the MDS and run backup against the snapshot, but this requires short
period of locked access to the MDT's while snapshot
is being generated.
* Getting basic file backups from within the LUSTRE file system for
individual file restores. I need to set up a few more LUSTRE Clients to
act as backup media servers that can concurrently backup
directories and files of the LUSTRE file systems. Quantity of "media"
servers depends on how many directories they can backup in the backup
window and how much load they create on the OSSs and MDS running the
backup.
Because we have jobs running round the clock the backups will be
competing for IO from the OSSs.
Our OSS ethernet fabric is currently limited to 10ge. This leads more
research questions into what kind of load impacts the backup will
generate against the OSSs to determine
how it will impact regular workload activities.
Can an OST be presented out over two different IP address on a single OSS
such that we could push the backup/restores over their own interface and
IP address?
Alternative would be bonding multiple 10ge together.
Has anyone tested to see if 2 bonded 10ge interfaces perform equally as 2
separate fully loaded 10ge?
Which also leads to question about CPU load. Does LUSTRE distribute IO
load well across multiple CPUs?
The ethernet traffic for bonded 10ge is likely to use full capacity of a
couple of CPUs. So we would need to see load distributed across multiple
CPU cores and hopefully
having adjacencies of ethernet IO to the core servicing the file
activity.
We will need to also take into consideration backup network load and
number of the LUSTRE backup clients that are running to help define how
many OSSs are needed and how
many OSTs of what sizes would be on each OSS.
Additional research digging, Is a file service backup process likely to
thrash the LUSTRE caching tables?
We are going to be running a test configuration with a few OSSs and few
OSTs and a couple of client servers so I guess we will see what our
backup system creates in load
for performance design issues. This is going to turn into an
interesting exercise, not meant to be a full benchmark, but instead a
feasibility analysis to help determine LUSTRE design requirements for our
go to environment.
---------------------------------------------------------------
Truth 23. Your common sense is not always someone else's common sense.
Don't assume that just because it's obvious to you, it will be obvious to
others.
---------------------------------------------------------------
Gary K Sweely, 301-763-5532, Cell 301-651-2481
SAN and Storage Systems Manager
US Census, Bowie Computer Center
Paper Mail to:
US Census, CSVD-BCC
Washington DC 20233
Office Physical or Delivery Address
17101 Melford Blvd
Bowie MD, 20715
-----"Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com> wrote: -----
To: "Dilger, Andreas" <andreas.dilger(a)intel.com>
From: "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
Date: 05/29/2013 06:30AM
Cc: "gary.k.sweely(a)census.gov" <gary.k.sweely(a)census.gov>,
"hpdd-discuss(a)lists.01.org" <hpdd-discuss(a)lists.01.org>,
Prasad Surampudi <prasad.surampudi(a)theatsgroup.com>, "Holtz, JohnX"
<johnx.holtz(a)intel.com>, "raymond.illian(a)census.gov"
<raymond.illian(a)census.gov>, Chris Churchey <churchey(a)theatsgroup.com>,
"james.m.lessard(a)census.gov" <james.m.lessard(a)census.gov>
Subject: RE: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
systems, any issues
Agreed. In addition to a file based backup strategy, capturing the MDTs
with a device level backup can protect against catastrophic loss of the
MDT. In this situation, restoring the MDT from
backup and running consistency checks on the file system will be far
quicker than recreating the FS and implementing a full restore.
A lot depends on the criticality of the service being supported and the
SLA for operational availability of the platform. A strategy that is
built on device level backup of the MDTs along with
file-based backup of the whole file system should provide sound
coverage. In addition, replication, properly implemented, represents the
fastest time to recovery in the context of a DR plan and can be useful in
quickly rectifying mistakes in production systems
as well. The overhead is reduced overall capacity as well as the
additional processes required to fail-over and fail-back.
Regards,
Malcolm.
From: Dilger, Andreas
Sent: Wednesday, May 29, 2013 7:20 PM
To: Cowe, Malcolm J
Cc: gary.k.sweely(a)census.gov;
hpdd-discuss(a)lists.01.org; Prasad Surampudi; Holtz, JohnX;
raymond.illian(a)census.gov; Chris Churchey;
james.m.lessard(a)census.gov
Subject: Re: [HPDD-discuss] Anyone Backing up a Large LUSTRE file
systems, any issues
I'd still also recommend a device level backup (using "dd", preferably of
a snapshot) for the MDT filesystem. This is absolutely critical
information, and backup/restore using "dd" is much more efficient than
file-level backups, and not unreasonable given the
relatively small size of the MDT compared to the total filesystem size.
Cheers, Andreas
On 2013-05-28, at 17:27, "Cowe, Malcolm J" <malcolm.j.cowe(a)intel.com>
wrote:
Hi Gary,
I would recommend a file-based backup strategy where the backup processes
run on Lustre clients that are connected to the backup infrastructure. In
fact this is the only realistic way to be
able to provide targeted restores of files/directories. We quite often
see data management or mover nodes in HPC architectures servers on the
boundary of the cluster that can interface with external data systems
such as tape libraries, either over a network
or fibre channel. By managing the backups like this, there is no need to
interface directly with the OSTs or MDTs and most if not all backup
applications will work perfectly well on the data management Lustre
client.
One might also want to consider an online duplicate of the most critical
data by syncing to a separate lustre fs, since restore time from a tape
vault can be considerable for a large volume
of data. Several strategies exist, depending on requirements and the
applications in use.
Regards,
Malcolm.
--
Malcolm Cowe, Systems Engineer
Intel High Performance Data Division
malcolm.j.cowe(a)intel.com
+61 408 573 001
From:hpdd-discuss-bounces@lists.01.org
[mailto:hpdd-discuss-bounces@lists.01.org]
On Behalf Of gary.k.sweely(a)census.gov
Sent: Wednesday, May 29, 2013 1:16 AM
To: hpdd-discuss(a)lists.01.org
Cc: Prasad Surampudi; Chris Churchey; Holtz, JohnX;
raymond.illian(a)census.gov <mailto:raymond.illian@census.gov>;
james.m.lessard(a)census.gov
Subject: [HPDD-discuss] Anyone Backing up a Large LUSTRE file systems,
any issues
Has anyone identified issues with backing up and restoring a large LUSTRE
file system.
We want to be able to backup the file system and restore both individual
files, and the full file system.
Has anyone identified specific issues with backup and restore of the
LUSTRE file system.
Backup needs to run while users are accessing and writing files to the
file system.
Backup concern:
1. How does it handle backup of data spread across multiple OST/OSS's yet
maintain consistency of the file segments?
2. Will backup system require backup media service pulling data over
Ethernet, or can the OSS's do direct backup and restore of EXT4 file
systems for full system backup/restores while maintaining
consistency of the files spread across OSTs?
3. Is there a specific backup product used to solve some of the file
consistency issues?
We would be using a large tape drive library cluster that can strip the
backup across multiple tape drives to improve backup media performance.
This would most likely mean having several systems
running backup concurrently to multiple tape drive strip sets. I expect
we would need to break the LUSTRE file systems into several backup
segments running concurrently, which would also mean several independent
restores to restore the whole system. But one
major requirement is being able to restore a single file or directory
when needed.
Backup windows would be 8-14 hours.
RTO of single file would need to be under 1 hour.
RTO of full file system would be 4 days.
RPO is one day's worth of project data, 1 week's worth of source data.
We are considering a LUSTRE environment as follows;
30TB-50TB source data, potentially will grow out to about 200TB.
100TB to 500TB Project workspace.
30TB of user Scratch space (does not need to be backed up).
Initial total capacity 170TB growing to max size of 1PB.
Most likely initially using 2TB OST's, across 11+ OSS's. May user larger
OST's if no issues found in services/supportability/throughput.
We were thinking of breaking the total space into separate file systems
to allow using multiple MDS/MDT's for improving performance of the MDS's,
which would also facilitate easier full LUSTRE
file system backup/restores. But this means loosing the flexibility of
having one large file system.
OSTs using EXT4 or XFS file systems.
About 25 Dedicated Clients servers with 20 to 40 CPU cores and 200GB-1TB
RAM running scheduled batch compute jobs. Grows as loads dictate.
Potentially add about 10-100 VMware Virtual client compute servers
running batch jobs. (4 or 8 cores with 8 to 32GB ram).
About 2-5 interactive user nodes, nodes added as load needs dictate.
Cheers, Andreas
--
Andreas Dilger
Lustre Software Architect
Intel High Performance Data Division