[PATCH] dm,dax: Make sure dm_dax_flush() is called if device supports it
by Vivek Goyal
Right now, dm_dax_flush() is not being called and I think it is not being
called becuase DAXDEV_WRITE_CACHE is not set on dm dax device.
If underlying dax device supports write cache, set DAXDEV_WRITE_CACHE on
dm dax device. This will get dm_dax_flush() being called.
Signed-off-by: Vivek Goyal <vgoyal(a)redhat.com>
---
drivers/dax/super.c | 5 +++++
drivers/md/dm-table.c | 33 +++++++++++++++++++++++++++++++++
include/linux/dax.h | 1 +
3 files changed, 39 insertions(+)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ce9e563e6e1d..5c5e7b9f6831 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -278,6 +278,11 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc)
}
EXPORT_SYMBOL_GPL(dax_write_cache);
+bool dax_write_cache_enabled(struct dax_device *dax_dev)
+{
+ return test_bit(DAXDEV_WRITE_CACHE, &dax_dev->flags);
+}
+
bool dax_alive(struct dax_device *dax_dev)
{
lockdep_assert_held(&dax_srcu);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index a39bcd9b982a..3be0ab2a71c8 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -20,6 +20,7 @@
#include <linux/atomic.h>
#include <linux/blk-mq.h>
#include <linux/mount.h>
+#include <linux/dax.h>
#define DM_MSG_PREFIX "table"
@@ -1630,6 +1631,35 @@ static bool dm_table_supports_flush(struct dm_table *t, unsigned long flush)
return false;
}
+static int device_dax_flush_capable(struct dm_target *ti, struct dm_dev *dev,
+ sector_t start, sector_t len, void *data)
+{
+ struct dax_device *dax_dev = dev->dax_dev;
+
+ if (!dax_dev)
+ return false;
+
+ if (dax_write_cache_enabled(dax_dev))
+ return true;
+ return false;
+}
+
+static int dm_table_supports_dax_flush(struct dm_table *t)
+{
+ struct dm_target *ti;
+ unsigned i;
+
+ for (i = 0; i < dm_table_get_num_targets(t); i++) {
+ ti = dm_table_get_target(t, i);
+
+ if (ti->type->iterate_devices &&
+ ti->type->iterate_devices(ti, device_dax_flush_capable, NULL))
+ return true;
+ }
+
+ return false;
+}
+
static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1785,6 +1815,9 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
}
blk_queue_write_cache(q, wc, fua);
+ if (dm_table_supports_dax_flush(t))
+ dax_write_cache(t->md->dax_dev, true);
+
/* Ensure that all underlying devices are non-rotational. */
if (dm_table_all_devices_attribute(t, device_is_nonrot))
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 794811875732..df97b7af7e2c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -87,6 +87,7 @@ size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
void dax_flush(struct dax_device *dax_dev, pgoff_t pgoff, void *addr,
size_t size);
void dax_write_cache(struct dax_device *dax_dev, bool wc);
+bool dax_write_cache_enabled(struct dax_device *dax_dev);
/*
* We use lowest available bit in exceptional entry for locking, one bit for
--
2.13.3
4 years, 10 months
Moving ndctl development into the kernel tree?
by Dan Williams
Hi Linus,
Would you be open to the ndctl [1] project moving its development into
the kernel tree? The main reasons why I ask are:
* Unit test development can touch both the kernel-side emulated nvdimm
infrastructure in tools/testing/nvdimm/ and the corresponding tests in
tools/ndctl/test/ in the same commit or patch series.
* Like perf, ndctl borrows the sub-command architecture and option
parsing from git. So, this code could be refactored into something
shared / generic, i.e. the bits in tools/perf/util/.
We continue to see updates in the ACPI and UEFI specification for
nvdimm details and one of the capabilities added in ACPI 6.2 that
needs new test development is error injection. I'm also expecting to
merge patches from Oliver this cycle expanding nvdimm support to Open
Firmware / powerpc platforms.
The ndctl project includes GPLv2 utilities (ndctl and daxctl) as well
as LGPLv2.1 libraries (libndctl and libdaxctl).
The coupling of the tests to new libnvdimm sub-system capabilities,
and the architecture specific nature of some nvdimm enabling leads me
to believe ndctl would enjoy some synergies living in the same
repository as the kernel.
[1]: https://github.com/pmem/ndctl
4 years, 10 months
Re: KVM "fake DAX" flushing interface - discussion
by Dan Williams
On Fri, Jul 21, 2017 at 8:58 AM, Stefan Hajnoczi <stefanha(a)redhat.com> wrote:
> On Fri, Jul 21, 2017 at 09:29:15AM -0400, Pankaj Gupta wrote:
>>
>> > > A] Problems to solve:
>> > > ------------------
>> > >
>> > > 1] We are considering two approaches for 'fake DAX flushing interface'.
>> > >
>> > > 1.1] fake dax with NVDIMM flush hints & KVM async page fault
>> > >
>> > > - Existing interface.
>> > >
>> > > - The approach to use flush hint address is already nacked upstream.
>> > >
>> > > - Flush hint not queued interface for flushing. Applications might
>> > > avoid to use it.
>> >
>> > This doesn't contradicts the last point about async operation and vcpu
>> > control. KVM async page faults turn the Address Flush Hints write into
>> > an async operation so the guest can get other work done while waiting
>> > for completion.
>> >
>> > >
>> > > - Flush hint address traps from guest to host and do an entire fsync
>> > > on backing file which itself is costly.
>> > >
>> > > - Can be used to flush specific pages on host backing disk. We can
>> > > send data(pages information) equal to cache-line size(limitation)
>> > > and tell host to sync corresponding pages instead of entire disk
>> > > sync.
>> >
>> > Are you sure? Your previous point says only the entire device can be
>> > synced. The NVDIMM Adress Flush Hints interface does not involve
>> > address range information.
>>
>> Just syncing entire block device should be simple but costly. Using flush
>> hint address to write data which contains list/info of dirty pages to
>> flush requires more thought. This calls mmio write callback at Qemu side.
>> As per Intel (ACPI spec 6.1, Table 5-135) there is limit to max length
>> of data guest can write and is equal to cache line size.
>>
>> >
>> > >
>> > > - This will be an asynchronous operation and vCPU control is returned
>> > > quickly.
>> > >
>> > >
>> > > 1.2] Using additional para virt device in addition to pmem device(fake dax
>> > > with device flush)
>> >
>> > Perhaps this can be exposed via ACPI as part of the NVDIMM standards
>> > instead of a separate KVM-only paravirt device.
>>
>> Same reason as above. If we decide on sending list of dirty pages there is
>> limit to send max size of data to host using flush hint address.
>
> I understand now: you are proposing to change the semantics of the
> Address Flush Hints interface. You want the value written to have
> meaning (the address range that needs to be flushed).
>
> Today the spec says:
>
> The content of the data is not relevant to the functioning of the
> flush hint mechanism.
>
> Maybe the NVDIMM folks can comment on this idea.
I think it's unworkable to use the flush hints as a guest-to-host
fsync mechanism. That mechanism was designed to flush small memory
controller buffers, not large swaths of dirty memory. What about
running the guests in a writethrough cache mode to avoid needing dirty
cache management altogether? Either way I think you need to use
device-dax on the host, or one of the two work-in-progress filesystem
mechanisms (synchronous-faults or S_IOMAP_FROZEN) to avoid need any
metadata coordination between guests and the host.
4 years, 10 months
[PATCH v5 0/5] DAX common 4k zero page
by Ross Zwisler
Changes since v4:
- Added static __vm_insert_mixed() to mm/memory.c that holds the common
code for both vm_insert_mixed() and vm_insert_mixed_mkwrite() so we
don't have duplicate code and we don't have to pass boolean flags
around. (Dan & Jan)
- Added a comment for the PFN sanity checking done in the mkwrite case of
insert_pfn().
- Added Jan's reviewed-by tags.
This series has passed a full xfstests run on both XFS and ext4.
---
When servicing mmap() reads from file holes the current DAX code allocates
a page cache page of all zeroes and places the struct page pointer in the
mapping->page_tree radix tree. This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via a
DAX mmap() over a hole, we allocate a new page cache page. This means that
if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory.
2) It is slower than using a common zero page because each page fault has
more work to do. Instead of just inserting a common zero page we have to
allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and for
page cache pages in the radix tree made the DAX code more complex.
This series solves these issues by following the lead of the DAX PMD code
and using a common 4k zero page instead. This reduces memory usage and
decreases latencies for some workloads, and it simplifies the DAX code,
removing over 100 lines in total.
Ross Zwisler (5):
mm: add vm_insert_mixed_mkwrite()
dax: relocate some dax functions
dax: use common 4k zero page for dax mmap reads
dax: remove DAX code from page_cache_tree_insert()
dax: move all DAX radix tree defs to fs/dax.c
Documentation/filesystems/dax.txt | 5 +-
fs/dax.c | 345 ++++++++++++++++----------------------
fs/ext2/file.c | 25 +--
fs/ext4/file.c | 32 +---
fs/xfs/xfs_file.c | 2 +-
include/linux/dax.h | 45 -----
include/linux/mm.h | 2 +
include/trace/events/fs_dax.h | 2 -
mm/filemap.c | 13 +-
mm/memory.c | 50 +++++-
10 files changed, 196 insertions(+), 325 deletions(-)
--
2.9.4
4 years, 10 months
Parcel 002975888 delivery notification, UPS
by smartpharma@host.1face1.com
Dear Customer,
This is to confirm that your item has been shipped at July 24.
Please check delivery label attached!
Yours respectfully,
,
UPS Senior Station Manager.
4 years, 10 months
[PATCH v4 0/5] DAX common 4k zero page
by Ross Zwisler
Changes since v3:
- Rebased onto the current linux/master which is based on v4.13-rc1.
- Instead of adding vm_insert_mkwrite_mixed() and duplicating code from
vm_insert_mixed(), instead just add a 'mkwrite' parameter to
vm_insert_mixed() and update all call sites. (Vivek)
- Added a sanity check to the mkwrite case of insert_pfn() to be sure the
pfn for the pte we are about to make writable matches the pfn for our
fault. (Jan)
- Fixed up some changelog wording for clarity. (Jan)
---
When servicing mmap() reads from file holes the current DAX code allocates
a page cache page of all zeroes and places the struct page pointer in the
mapping->page_tree radix tree. This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via a
DAX mmap() over a hole, we allocate a new page cache page. This means that
if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory.
2) It is slower than using a common zero page because each page fault has
more work to do. Instead of just inserting a common zero page we have to
allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and for
page cache pages in the radix tree made the DAX code more complex.
This series solves these issues by following the lead of the DAX PMD code
and using a common 4k zero page instead. This reduces memory usage and
decreases latencies for some workloads, and it simplifies the DAX code,
removing over 100 lines in total.
This series has passed my targeted testing and a full xfstests run on both
XFS and ext4.
Ross Zwisler (5):
mm: add mkwrite param to vm_insert_mixed()
dax: relocate some dax functions
dax: use common 4k zero page for dax mmap reads
dax: remove DAX code from page_cache_tree_insert()
dax: move all DAX radix tree defs to fs/dax.c
Documentation/filesystems/dax.txt | 5 +-
drivers/dax/device.c | 2 +-
drivers/gpu/drm/exynos/exynos_drm_gem.c | 3 +-
drivers/gpu/drm/gma500/framebuffer.c | 2 +-
drivers/gpu/drm/msm/msm_gem.c | 3 +-
drivers/gpu/drm/omapdrm/omap_gem.c | 6 +-
drivers/gpu/drm/ttm/ttm_bo_vm.c | 2 +-
fs/dax.c | 342 +++++++++++++-------------------
fs/ext2/file.c | 25 +--
fs/ext4/file.c | 32 +--
fs/xfs/xfs_file.c | 2 +-
include/linux/dax.h | 45 -----
include/linux/mm.h | 2 +-
include/trace/events/fs_dax.h | 2 -
mm/filemap.c | 13 +-
mm/memory.c | 27 ++-
16 files changed, 181 insertions(+), 332 deletions(-)
--
2.9.4
4 years, 10 months
[PATCH v3 0/5] DAX common 4k zero page
by Ross Zwisler
When servicing mmap() reads from file holes the current DAX code allocates
a page cache page of all zeroes and places the struct page pointer in the
mapping->page_tree radix tree. This has three major drawbacks:
1) It consumes memory unnecessarily. For every 4k page that is read via a
DAX mmap() over a hole, we allocate a new page cache page. This means that
if you read 1GiB worth of pages, you end up using 1GiB of zeroed memory.
2) It is slower than using a common zero page because each page fault has
more work to do. Instead of just inserting a common zero page we have to
allocate a page cache page, zero it, and then insert it.
3) The fact that we had to check for both DAX exceptional entries and for
page cache pages in the radix tree made the DAX code more complex.
This series solves these issues by following the lead of the DAX PMD code
and using a common 4k zero page instead. This reduces memory usage and
decreases latencies for some workloads, and it simplifies the DAX code,
removing over 100 lines in total.
Andrew, I'm still hoping to get this merged for v4.13 if possible. I I have
addressed all of Jan's feedback, but he is on vacation for the next few
weeks so he may not be able to give me Reviewed-by tags. I think this
series is relatively low risk with clear benefits, and I think we should be
able to address any issues that come up during the v4.13 RC series.
This series has passed my targeted testing and a full xfstests run on both
XFS and ext4.
---
Changes since v2:
- If we call insert_pfn() with 'mkwrite' for an entry that already exists,
don't overwrite the pte with a brand new one. Just add the appropriate
flags. (Jan)
- Keep put_locked_mapping_entry() as a simple wrapper for
dax_unlock_mapping_entry() so it has naming parity with
get_unlocked_mapping_entry(). (Jan)
- Remove DAX special casing in page_cache_tree_insert(), move
now-private definitions from dax.h to dax.c. (Jan)
Ross Zwisler (5):
mm: add vm_insert_mixed_mkwrite()
dax: relocate some dax functions
dax: use common 4k zero page for dax mmap reads
dax: remove DAX code from page_cache_tree_insert()
dax: move all DAX radix tree defs to fs/dax.c
Documentation/filesystems/dax.txt | 5 +-
fs/dax.c | 345 ++++++++++++++++----------------------
fs/ext2/file.c | 25 +--
fs/ext4/file.c | 32 +---
fs/xfs/xfs_file.c | 2 +-
include/linux/dax.h | 45 -----
include/linux/mm.h | 2 +
include/trace/events/fs_dax.h | 2 -
mm/filemap.c | 13 +-
mm/memory.c | 57 ++++++-
10 files changed, 205 insertions(+), 323 deletions(-)
--
2.9.4
4 years, 10 months
We could not deliver your parcel, #001980074
by postmaster@ds5.datahost.gr
Dear Customer,
We can not deliver your parcel arrived at July 21.
Postal label is enclosed to this e-mail. Please check the attachment!
With thanks and appreciation,
,
UPS Parcels Operation Agent.
4 years, 10 months
[PATCH -mm -v3 05/12] block, THP: Make block_device_operations.rw_page support THP
by Huang, Ying
From: Huang Ying <ying.huang(a)intel.com>
The .rw_page in struct block_device_operations is used by the swap
subsystem to read/write the page contents from/into the corresponding
swap slot in the swap device. To support the THP (Transparent Huge
Page) swap optimization, the .rw_page is enhanced to support to
read/write THP if possible.
Signed-off-by: "Huang, Ying" <ying.huang(a)intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler(a)intel.com> [for brd.c, zram_drv.c, pmem.c]
Cc: Johannes Weiner <hannes(a)cmpxchg.org>
Cc: Minchan Kim <minchan(a)kernel.org>
Cc: Dan Williams <dan.j.williams(a)intel.com>
Cc: Vishal L Verma <vishal.l.verma(a)intel.com>
Cc: Jens Axboe <axboe(a)kernel.dk>
Cc: linux-nvdimm(a)lists.01.org
---
drivers/block/brd.c | 6 +++++-
drivers/block/zram/zram_drv.c | 2 ++
drivers/nvdimm/btt.c | 4 +++-
drivers/nvdimm/pmem.c | 41 ++++++++++++++++++++++++++++++-----------
4 files changed, 40 insertions(+), 13 deletions(-)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 104b71c0490d..5d9ed0616413 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -326,7 +326,11 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
struct page *page, bool is_write)
{
struct brd_device *brd = bdev->bd_disk->private_data;
- int err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
+ int err;
+
+ if (PageTransHuge(page))
+ return -ENOTSUPP;
+ err = brd_do_bvec(brd, page, PAGE_SIZE, 0, is_write, sector);
page_endio(page, is_write, err);
return err;
}
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 856d5dc02451..e2a305b41cd4 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -927,6 +927,8 @@ static int zram_rw_page(struct block_device *bdev, sector_t sector,
struct zram *zram;
struct bio_vec bv;
+ if (PageTransHuge(page))
+ return -ENOTSUPP;
zram = bdev->bd_disk->private_data;
if (!valid_io_request(zram, sector, PAGE_SIZE)) {
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 14323faf8bd9..60491641a8d6 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1241,8 +1241,10 @@ static int btt_rw_page(struct block_device *bdev, sector_t sector,
{
struct btt *btt = bdev->bd_disk->private_data;
int rc;
+ unsigned int len;
- rc = btt_do_bvec(btt, NULL, page, PAGE_SIZE, 0, is_write, sector);
+ len = hpage_nr_pages(page) * PAGE_SIZE;
+ rc = btt_do_bvec(btt, NULL, page, len, 0, is_write, sector);
if (rc == 0)
page_endio(page, is_write, 0);
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index f7099adaabc0..e9aa453da50c 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -80,22 +80,40 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
static void write_pmem(void *pmem_addr, struct page *page,
unsigned int off, unsigned int len)
{
- void *mem = kmap_atomic(page);
-
- memcpy_flushcache(pmem_addr, mem + off, len);
- kunmap_atomic(mem);
+ unsigned int chunk;
+ void *mem;
+
+ while (len) {
+ mem = kmap_atomic(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE);
+ memcpy_flushcache(pmem_addr, mem + off, chunk);
+ kunmap_atomic(mem);
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += PAGE_SIZE;
+ }
}
static blk_status_t read_pmem(struct page *page, unsigned int off,
void *pmem_addr, unsigned int len)
{
+ unsigned int chunk;
int rc;
- void *mem = kmap_atomic(page);
-
- rc = memcpy_mcsafe(mem + off, pmem_addr, len);
- kunmap_atomic(mem);
- if (rc)
- return BLK_STS_IOERR;
+ void *mem;
+
+ while (len) {
+ mem = kmap_atomic(page);
+ chunk = min_t(unsigned int, len, PAGE_SIZE);
+ rc = memcpy_mcsafe(mem + off, pmem_addr, chunk);
+ kunmap_atomic(mem);
+ if (rc)
+ return BLK_STS_IOERR;
+ len -= chunk;
+ off = 0;
+ page++;
+ pmem_addr += PAGE_SIZE;
+ }
return BLK_STS_OK;
}
@@ -188,7 +206,8 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector,
struct pmem_device *pmem = bdev->bd_queue->queuedata;
blk_status_t rc;
- rc = pmem_do_bvec(pmem, page, PAGE_SIZE, 0, is_write, sector);
+ rc = pmem_do_bvec(pmem, page, hpage_nr_pages(page) * PAGE_SIZE,
+ 0, is_write, sector);
/*
* The ->rw_page interface is subtle and tricky. The core
--
2.13.2
4 years, 10 months
[PATCH -mm -v2 00/12] mm, THP, swap: Delay splitting THP after swapped out
by Huang, Ying
From: Huang Ying <ying.huang(a)intel.com>
Hi, Andrew, could you help me to check whether the overall design is
reasonable?
Hi, Johannes and Minchan, Thanks a lot for your review to the first
step of the THP swap optimization! Could you help me to review the
second step in this patchset?
Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset? Especially [01/12], [02/12], [03/12],
[04/12], [11/12], and [12/12].
Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset? Especially [01/12], [03/12], [07/12], [08/12], [09/12],
[11/12].
Hi, Johannes, Michal, could you help me to review the cgroup part of
the patchset? Especially [08/12], [09/12], and [10/12].
And for all, Any comment is welcome!
Because the THP swap writing support patch [06/12] needs to be rebased
on multipage bvec patchset which hasn't been merged yet. The [06/12]
in this patchset is just a test patch and will be rewritten later.
The patchset depends on multipage bvec patchset too.
This is the second step of THP (Transparent Huge Page) swap
optimization. In the first step, the splitting huge page is delayed
from almost the first step of swapping out to after allocating the
swap space for the THP and adding the THP into the swap cache. In the
second step, the splitting is delayed further to after the swapping
out finished. The plan is to delay splitting THP step by step,
finally avoid splitting THP for the THP swapping out and swap out/in
the THP as a whole.
In the patchset, more operations for the anonymous THP reclaiming,
such as TLB flushing, writing the THP to the swap device, removing the
THP from the swap cache are batched. So that the performance of
anonymous THP swapping out are improved.
This patchset is based on the 6/16 head of mmotm/master.
During the development, the following scenarios/code paths have been
checked,
- swap out/in
- swap off
- write protect page fault
- madvise_free
- process exit
- split huge page
Please let me know if I missed something.
With the patchset, the swap out throughput improves 42% (from about
5.81GB/s to about 8.25GB/s) in the vm-scalability swap-w-seq test case
with 16 processes. At the same time, the IPI (reflect TLB flushing)
reduced about 78.9%. The test is done on a Xeon E5 v3 system. The
swap device used is a RAM simulated PMEM (persistent memory) device.
To test the sequential swapping out, the test case creates 8
processes, which sequentially allocate and write to the anonymous
pages until the RAM and part of the swap device is used up.
Below is the part of the cover letter for the first step patchset of
THP swap optimization which applies to all steps.
----------------------------------------------------------------->
Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swap out even on a high-end server machine. Because the
performance of the storage device improved faster than that of single
logical CPU. And it seems that the trend will not change in the near
future. On the other hand, the THP becomes more and more popular
because of increased memory size. So it becomes necessary to optimize
THP swap performance.
The advantages of the THP swap support include:
- Batch the swap operations for the THP to reduce TLB flushing and
lock acquiring/releasing, including allocating/freeing the swap
space, adding/deleting to/from the swap cache, and writing/reading
the swap space, etc. This will help improve the performance of the
THP swap.
- The THP swap space read/write will be 2M sequential IO. It is
particularly helpful for the swap read, which are usually 4k random
IO. This will improve the performance of the THP swap too.
- It will help the memory fragmentation, especially when the THP is
heavily used by the applications. The 2M continuous pages will be
free up after THP swapping out.
- It will improve the THP utilization on the system with the swap
turned on. Because the speed for khugepaged to collapse the normal
pages into the THP is quite slow. After the THP is split during the
swapping out, it will take quite long time for the normal pages to
collapse back into the THP after being swapped in. The high THP
utilization helps the efficiency of the page based memory management
too.
There are some concerns regarding THP swap in, mainly because possible
enlarged read/write IO size (for swap in/out) may put more overhead on
the storage device. To deal with that, the THP swap in should be
turned on only when necessary. For example, it can be selected via
"always/never/madvise" logic, to be turned on globally, turned off
globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
Best Regards,
Huang, Ying
4 years, 10 months