Files
kernel/kernel-rt/centos/patches/0035-xfs-drop-submit-side-trans-alloc-for-append-ioends.patch
Zhixiong Chi 5ab365dd41 xfs: fix ioend batching log reservation deadlock
Problem:
We received a report of a workload that causes an xfs task to be blocked
for more than 120 seconds on log reservation via iomap_ioend completion
batching.

 kernel: err [5636141.631454] INFO: task xfs-conv/dm-4:1788 blocked for
                                    more than 122 seconds.

 kernel: info [267022.728862] Workqueue: xfs-conv/dm-4 xfs_end_io [xfs]
 kernel: info [267022.728864] Call Trace:
 kernel: info [267022.728870] __schedule+0x340/0x810
 kernel: info [267022.728876] schedule+0x51/0xc0
 kernel: info [267022.728913] xlog_grant_head_wait+0xc7/0x200 [xfs]
 kernel: info [267022.728950] xlog_grant_head_check+0xd0/0x110 [xfs]
 kernel: info [267022.728985] xfs_log_reserve+0xc3/0x1e0 [xfs]
 kernel: info [267022.729023] xfs_trans_reserve+0x156/0x1b0 [xfs]
 kernel: info [267022.729184] xfs_trans_alloc+0xc6/0x190 [xfs]
 kernel: info [267022.729317] xfs_iomap_write_unwritten+0xaa/0x2c0 [xfs]
 kernel: info [267022.729333] ? stop_one_cpu+0x71/0xa0
 kernel: info [267022.729347] ? set_cpus_allowed_ptr+0x10/0x10
 kernel: info [267022.729396] xfs_end_ioend+0xc4/0x100 [xfs]
 kernel: info [267022.729444] ? xfs_setfilesize_ioend+0x60/0x60 [xfs]
 kernel: info [267022.729491] xfs_end_io+0xb9/0xe0 [xfs]
 kernel: info [267022.729505] process_one_work+0x1a1/0x370
 kernel: info [267022.729516] rescuer_thread+0x207/0x350
 kernel: info [267022.729528] ? worker_thread+0x370/0x370
 kernel: info [267022.729537] kthread+0x12e/0x150
 kernel: info [267022.729548] ? __kthread_cancel_work+0x40/0x40
 kernel: info [267022.729559] ret_from_fork+0x1f/0x30

After that, the connection via ssh to the controller is stuck,
Press Ctrl+C, it entered shell and the prompt displayed '-sh-4.2$'

Solution:
Removing the preallocated transaction from xfs append ioends to avoid
the ioend completion batching log reservation deadlock.
Now we continue to process the append ioend completions via the
workqueue, but let the wq task allocate the transaction similar to other
ioend types.

Backport the four patches from upstream(git://git.kernel.org/pub/scm/
linux/kernel/git/torvalds/linux.git) for debian-based StarlingX.
Only the 0034-xfs-use-current-journal_info-for-detecting-transacti.patch
for centos-based StarlingX is from stable tree(git://git.kernel.org/pub/
scm/linux/kernel/git/stable/linux.git linux-5.10.y branch), because the
kernel has been upgraded to v5.10.152 for debian-based StarlingX which
includes this fix, so we just apply it for the centos-based one.

TestPlan:
Pass: Execute bonnie++ test for xfs filesystem successfully without
      kernel panic and any xfs anomalies in the kernel logs.
      $mkfs.x /dev/sdc1
      $mount /dev/sdc1 ~/xfstests
      $sudo bonnie++ -u root:root -d ~/xfstests
Debian:
Pass: build-pkgs -c -a
Pass: build-image
Pass: boot successfully with std/rt.
CentOS:
Pass: build-pkgs
Pass: build-iso
Pass: boot successfully with std/rt.

Closes-Bug: 1996269

Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>
Change-Id: I1e5b85111b2b54cd249c116724b952042f9d781f
2022-11-17 10:40:05 -05:00

139 lines
4.9 KiB
Diff

From 8182ec00803085354761bbadf0287cad7eac0e2f Mon Sep 17 00:00:00 2001
From: Brian Foster <bfoster@redhat.com>
Date: Fri, 9 Apr 2021 10:27:43 -0700
Subject: [PATCH 2/5] xfs: drop submit side trans alloc for append ioends
Per-inode ioend completion batching has a log reservation deadlock
vector between preallocated append transactions and transactions
that are acquired at completion time for other purposes (i.e.,
unwritten extent conversion or COW fork remaps). For example, if the
ioend completion workqueue task executes on a batch of ioends that
are sorted such that an append ioend sits at the tail, it's possible
for the outstanding append transaction reservation to block
allocation of transactions required to process preceding ioends in
the list.
Append ioend completion is historically the common path for on-disk
inode size updates. While file extending writes may have completed
sometime earlier, the on-disk inode size is only updated after
successful writeback completion. These transactions are preallocated
serially from writeback context to mitigate concurrency and
associated log reservation pressure across completions processed by
multi-threaded workqueue tasks.
However, now that delalloc blocks unconditionally map to unwritten
extents at physical block allocation time, size updates via append
ioends are relatively rare. This means that inode size updates most
commonly occur as part of the preexisting completion time
transaction to convert unwritten extents. As a result, there is no
longer a strong need to preallocate size update transactions.
Remove the preallocation of inode size update transactions to avoid
the ioend completion processing log reservation deadlock. Instead,
continue to send all potential size extending ioends to workqueue
context for completion and allocate the transaction from that
context. This ensures that no outstanding log reservation is owned
by the ioend completion worker task when it begins to process
ioends.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[commit 7cd3099f4925d7c15887d1940ebd65acd66100f5 upstream
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git]
Signed-off-by: Zhixiong Chi <zhixiong.chi@windriver.com>
---
fs/xfs/xfs_aops.c | 45 +++------------------------------------------
1 file changed, 3 insertions(+), 42 deletions(-)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index b4186d666..60943b28f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -39,33 +39,6 @@ static inline bool xfs_ioend_is_append(struct iomap_ioend *ioend)
XFS_I(ioend->io_inode)->i_d.di_size;
}
-STATIC int
-xfs_setfilesize_trans_alloc(
- struct iomap_ioend *ioend)
-{
- struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount;
- struct xfs_trans *tp;
- int error;
-
- error = xfs_trans_alloc(mp, &M_RES(mp)->tr_fsyncts, 0, 0, 0, &tp);
- if (error)
- return error;
-
- ioend->io_private = tp;
-
- /*
- * We may pass freeze protection with a transaction. So tell lockdep
- * we released it.
- */
- __sb_writers_release(ioend->io_inode->i_sb, SB_FREEZE_FS);
- /*
- * We hand off the transaction to the completion thread now, so
- * clear the flag here.
- */
- xfs_trans_clear_context(tp);
- return 0;
-}
-
/*
* Update on-disk file size now that data has been written to disk.
*/
@@ -182,12 +155,10 @@ xfs_end_ioend(
error = xfs_reflink_end_cow(ip, offset, size);
else if (ioend->io_type == IOMAP_UNWRITTEN)
error = xfs_iomap_write_unwritten(ip, offset, size, false);
- else
- ASSERT(!xfs_ioend_is_append(ioend) || ioend->io_private);
+ if (!error && xfs_ioend_is_append(ioend))
+ error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
done:
- if (ioend->io_private)
- error = xfs_setfilesize_ioend(ioend, error);
iomap_finish_ioends(ioend, error);
memalloc_nofs_restore(nofs_flag);
}
@@ -237,7 +208,7 @@ xfs_end_io(
static inline bool xfs_ioend_needs_workqueue(struct iomap_ioend *ioend)
{
- return ioend->io_private ||
+ return xfs_ioend_is_append(ioend) ||
ioend->io_type == IOMAP_UNWRITTEN ||
(ioend->io_flags & IOMAP_F_SHARED);
}
@@ -250,8 +221,6 @@ xfs_end_bio(
struct xfs_inode *ip = XFS_I(ioend->io_inode);
unsigned long flags;
- ASSERT(xfs_ioend_needs_workqueue(ioend));
-
spin_lock_irqsave(&ip->i_ioend_lock, flags);
if (list_empty(&ip->i_ioend_list))
WARN_ON_ONCE(!queue_work(ip->i_mount->m_unwritten_workqueue,
@@ -501,14 +470,6 @@ xfs_prepare_ioend(
ioend->io_offset, ioend->io_size);
}
- /* Reserve log space if we might write beyond the on-disk inode size. */
- if (!status &&
- ((ioend->io_flags & IOMAP_F_SHARED) ||
- ioend->io_type != IOMAP_UNWRITTEN) &&
- xfs_ioend_is_append(ioend) &&
- !ioend->io_private)
- status = xfs_setfilesize_trans_alloc(ioend);
-
memalloc_nofs_restore(nofs_flag);
if (xfs_ioend_needs_workqueue(ioend))
--
2.34.1