default Offset to 1MB boundaries for improved SSD (and Raid Virtual Disk) partition alignment

Discussion:

Tom Smyth

2021-04-20 16:46:52 UTC

Hello,

just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

Is there a down side to moving the default offset to 2048 ? 1MB
off set on 512 byte format disks.
we have been running 2048 offset as our starting offset, for our
OpenBSD installs for about 3 -4 years now and we have not come across
issues.

it is unlikely that this will be changed in 6.9 release but It might
be worth re-visiting as it would
make for more straightforward aligned partitions on OpenBSD installs..

my experience is more for x86 / amd64 rather than other platforms ..

Kindest Regards,

Tom Smyth

--
Kindest regards,
Tom Smyth.

Christian Weisgerber

2021-04-20 21:40:08 UTC

Permalink

Post by Tom Smyth
just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)

It was changed from 63 in 2010.

Post by Tom Smyth
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

What are you trying to optimize with this? FFS2 file systems reserve
64 kB at the start of a partition, and after that it's filesystem
blocks, which are 16/32/64 kB, depending on the size of the filesystem.
I can barely see an argument for aligning large partitions at 128
sectors, but what purpose would larger multiples serve?

Post by Tom Smyth
Is there a down side to moving the default offset to 2048 ?

Not really. It wastes a bit of space, but that is rather insignificant
for today's disk sizes.

--
Christian "naddy" Weisgerber ***@mips.inka.de

Tom Smyth

2021-04-21 07:20:10 UTC

Permalink

Hi Christian,

if you were to have a 1MB file or a database that needed to read 1MB
of data, i
f the partitions are not aligned then
your underlying storage system need to load 2 chunks or write 2
chunks for 1 MB of data, written,

So *worst* case you would double the workload for the storage hardware
(SSD or Hardware RAID with large chunks) for each transaction
on writing to SSDs if you are not aligned one could *worst *case
double the write / wear rate.

The improvement would be less for accessing small files and writing small files
(as they would need to be across 2 Chunks )

The following paper explains (better than I do )
https://www.vmware.com/pdf/esx3_partition_align.pdf

if the cost is 1-8MB at the start of the disk (assuming partitions are sized
so that they dont loose the ofset of 2048 sectors)
I think it is worth pursuing. (again I only have experience on amd64
/i386 hardware)

Thanks
Tom Smyth

Post by Christian Weisgerber

Post by Tom Smyth
just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)

It was changed from 63 in 2010.

Post by Tom Smyth
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

Post by Tom Smyth
Is there a down side to moving the default offset to 2048 ?

Not really. It wastes a bit of space, but that is rather insignificant
for today's disk sizes.
--

--
Kindest regards,
Tom Smyth.

Otto Moerbeek

2021-04-21 07:48:57 UTC

Permalink

Post by Tom Smyth
Hi Christian,
if you were to have a 1MB file or a database that needed to read 1MB
of data, i
f the partitions are not aligned then
your underlying storage system need to load 2 chunks or write 2
chunks for 1 MB of data, written,
So *worst* case you would double the workload for the storage hardware
(SSD or Hardware RAID with large chunks) for each transaction
on writing to SSDs if you are not aligned one could *worst *case
double the write / wear rate.
The improvement would be less for accessing small files and writing small files
(as they would need to be across 2 Chunks )
The following paper explains (better than I do )
https://www.vmware.com/pdf/esx3_partition_align.pdf
if the cost is 1-8MB at the start of the disk (assuming partitions are sized
so that they dont loose the ofset of 2048 sectors)
I think it is worth pursuing. (again I only have experience on amd64
/i386 hardware)

Doing a quick scan trhough the pdf I only see talk about 64k boundaries.

FFS(2) will split up any partiition in multiple cylinder groups. Each
cylinder group starts with a superblock copy, inode tables and other
meta datas before the data blocks of that cylinder group. Having the
start of a partion a 1 1MB boundary does not get you those data blocks
at a specific boundary. So I think your resoning does not apply to FFS(2).

It might make sense to move the start to offset 128 for big
partitions, so you align with the 64k boundary mentioned in the pdf,
the block size is already 64k (for big parttiions).

-Otto

Post by Tom Smyth
Thanks
Tom Smyth

Post by Christian Weisgerber

Post by Tom Smyth
just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)

It was changed from 63 in 2010.

Post by Tom Smyth
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

Post by Tom Smyth
Is there a down side to moving the default offset to 2048 ?

Not really. It wastes a bit of space, but that is rather insignificant
for today's disk sizes.
--

--
Kindest regards,
Tom Smyth.

Tom Smyth

2021-04-21 08:56:59 UTC

Permalink

Hello Otto, Christian,

I was relying on that paper for the pictures of the alignment issue,

VMFS (vmware file system) since version 5 of vmwarehas allocation
units of 1MB each

https://kb.vmware.com/s/article/2137120

my understanding is that SSDs have a similar allocation unit setup of 1MB,

and that aligning your file system to 1MB would improve performance

|OpenBSD Filesystem --------------| FFS-Filesystem
|VMDK Virtual Disk file for Guest | OpenBSD-Gusest-Disk0.vmdk
|vmware datastore------ ------------| 1MB allocation
|Logical Storage Device / RAID---|
|SSD or DISK storage --------------| 1MB allocation unit (on some SSDs)

Figure 2 of the following paper shows what
https://www.usenix.org/legacy/event/usenix09/tech/full_papers/rajimwale/rajimwale.pdf
as your writes start to cross another underlying block boundary you
see a degradation of performance
largest impact is on a write o1 1MB (misaligned) across 2 blocks,
but it repeats as you increase the number of MB in a transaction but
the % overhead
reduces for each additional 1MB in the Transaction.

If there is no downside to allocating /Offsetting filesystems on 1MB
boundaries,
can we do that by default to reduce wear on SSDs, and improve performance
in Virtualized Environments with large allocation units on what ever storage
subsystem they are running.

Thanks for your time

Tom Smyth

Post by Otto Moerbeek

Doing a quick scan trhough the pdf I only see talk about 64k boundaries.
FFS(2) will split up any partiition in multiple cylinder groups. Each
cylinder group starts with a superblock copy, inode tables and other
meta datas before the data blocks of that cylinder group. Having the
start of a partion a 1 1MB boundary does not get you those data blocks
at a specific boundary. So I think your resoning does not apply to FFS(2).
It might make sense to move the start to offset 128 for big
partitions, so you align with the 64k boundary mentioned in the pdf,
the block size is already 64k (for big parttiions).
-Otto

Post by Tom Smyth
Thanks
Tom Smyth

Post by Christian Weisgerber

Post by Tom Smyth
just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)

It was changed from 63 in 2010.

Post by Tom Smyth
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

Post by Tom Smyth
Is there a down side to moving the default offset to 2048 ?

Not really. It wastes a bit of space, but that is rather insignificant
for today's disk sizes.
--

--
Kindest regards,
Tom Smyth.

Otto Moerbeek

2021-04-21 13:14:07 UTC

Permalink

Post by Tom Smyth
Hello Otto, Christian,
I was relying on that paper for the pictures of the alignment issue,
VMFS (vmware file system) since version 5 of vmwarehas allocation
units of 1MB each
https://kb.vmware.com/s/article/2137120
my understanding is that SSDs have a similar allocation unit setup of 1MB,
and that aligning your file system to 1MB would improve performance
|OpenBSD Filesystem --------------| FFS-Filesystem
|VMDK Virtual Disk file for Guest | OpenBSD-Gusest-Disk0.vmdk
|vmware datastore------ ------------| 1MB allocation
|Logical Storage Device / RAID---|
|SSD or DISK storage --------------| 1MB allocation unit (on some SSDs)
Figure 2 of the following paper shows what
https://www.usenix.org/legacy/event/usenix09/tech/full_papers/rajimwale/rajimwale.pdf
as your writes start to cross another underlying block boundary you
see a degradation of performance
largest impact is on a write o1 1MB (misaligned) across 2 blocks,

Max unit OpenBSD writes in one go is 64k. So the issue is not that
relevant. Only 1 in 16 blocks would potentially cross a boundary.

You are free to setup your disks in a way that suits you, but in
general I don't think we should enforce 1Mb alignment of start of
partition and/or size because *some* *might* get a benefit.

-Otto

Post by Tom Smyth
but it repeats as you increase the number of MB in a transaction but
the % overhead
reduces for each additional 1MB in the Transaction.
If there is no downside to allocating /Offsetting filesystems on 1MB
boundaries,
can we do that by default to reduce wear on SSDs, and improve performance
in Virtualized Environments with large allocation units on what ever storage
subsystem they are running.
Thanks for your time
Tom Smyth

Post by Otto Moerbeek

Doing a quick scan trhough the pdf I only see talk about 64k boundaries.
FFS(2) will split up any partiition in multiple cylinder groups. Each
cylinder group starts with a superblock copy, inode tables and other
meta datas before the data blocks of that cylinder group. Having the
start of a partion a 1 1MB boundary does not get you those data blocks
at a specific boundary. So I think your resoning does not apply to FFS(2).
It might make sense to move the start to offset 128 for big
partitions, so you align with the 64k boundary mentioned in the pdf,
the block size is already 64k (for big parttiions).
-Otto

Post by Tom Smyth
Thanks
Tom Smyth

Post by Christian Weisgerber

Post by Tom Smyth
just installing todays snapshot and the default offset on amd64 is 64,
(as it has been for as long as I can remember)

It was changed from 63 in 2010.

Post by Tom Smyth
Is it worth while updating the defaults so that OpenBSD partition
layout will be optimal for SSD or other Virtualized RAID environments
with 1MB Chunks,

Post by Tom Smyth
Is there a down side to moving the default offset to 2048 ?

Not really. It wastes a bit of space, but that is rather insignificant
for today's disk sizes.
--

--
Kindest regards,
Tom Smyth.

Christian Weisgerber

2021-04-21 14:21:54 UTC

Permalink

Post by Tom Smyth
if you were to have a 1MB file or a database that needed to read 1MB
of data, i
f the partitions are not aligned then
your underlying storage system need to load 2 chunks or write 2
chunks for 1 MB of data, written,

You seem to assume that FFS2 would align a 1MB file on an 1MB border
within the filesystem. That is not case. That 1MB file will be
aligned on a blocksize border (16/32/64 kB, depending on filesystem
size). Aligning the partition on n*blocksize has no effect on this.

--
Christian "naddy" Weisgerber ***@mips.inka.de

Tom Smyth

2021-04-21 16:35:10 UTC

Permalink

Christian, Otto, Thanks for your feedback on this one....

Ill research it further,
but NTFS has 4K, 8K 32K and 64K Allocation units on the
filessystem and for Microsoft windows running Exchange or Database workloads
they were recommending alignment of the NTFS partitions
on the 1MB offset also.

From Otto's, explanation (Thanks) of 1/16 blocks would potentially
cross a boundary of the
storage subsystem,
6.25% of reads(or writes) could result in a double Read ( or double write)

of course the write issue is a bigger problem for the SSDs..

I can configure the partitions how I want ,for now anyway,

Ill do a little digging on FFS and FFS2 and see how the filesystem
database (or table)
is structured...

Thanks for the feedback it is very helpful to me....

All the best,

Tom Smyth

Post by Christian Weisgerber

--
Kindest regards,
Tom Smyth.

Kent Watsen

2021-04-21 17:52:24 UTC

Permalink

I’m running OpenBSD on top of bHyve using virtual disks allocated out of ZFS pools. While not the same setup, some concepts carry over…

I have two types of pools:

1) an “expensive" pool for fast random IO:
- this pool is made up stripes of SSD-based vdevs.
- ZFS is configured to use a 16K recordsize for this pool.
- good for small files (guest OS, DBs, web/mail/dns files, etc.)
- When ZFS is told to use the SSD, it starts the partition
on sector 256 (not the default sector 34) to ensure good
SSD NAND alignment.

2) a less-expensive pool for large sequential IO:
- this pool is a single RAIDZ2-based vdev using spinning rust.
- ZFS is configured to use a 1M recordsize for this pool.
- good for large files (movies, high-res images, backups, etc.)

Virtual disks are exposed to the OpenBSD guests from both pools. The guest’s root-disk is always allocated from pool #1. Typically, a second application-specific disk is also allocated from pool #1 (e.g., /var/www/sites on a web server, /home on a mail server, etc.). Only in special circumstances (e.g., a media server) is a disk allocated from pool #2.

This arrangement steps around needing to read/write 1M blocks for each small file access, and also the possibility that a guest accessing a given block will span more than a single physical block.

Can VMWare virtual disks be configured similarly?

K.

Post by Tom Smyth
Christian, Otto, Thanks for your feedback on this one....
Ill research it further,
but NTFS has 4K, 8K 32K and 64K Allocation units on the
filessystem and for Microsoft windows running Exchange or Database workloads
they were recommending alignment of the NTFS partitions
on the 1MB offset also.
From Otto's, explanation (Thanks) of 1/16 blocks would potentially
cross a boundary of the
storage subsystem,
6.25% of reads(or writes) could result in a double Read ( or double write)
of course the write issue is a bigger problem for the SSDs..
I can configure the partitions how I want ,for now anyway,
Ill do a little digging on FFS and FFS2 and see how the filesystem
database (or table)
is structured...
Thanks for the feedback it is very helpful to me....
All the best,
Tom Smyth

Post by Christian Weisgerber

--
Kindest regards,
Tom Smyth.

Stuart Henderson

2021-04-21 20:20:34 UTC

Permalink

Post by Kent Watsen
- When ZFS is told to use the SSD, it starts the partition
on sector 256 (not the default sector 34) to ensure good
SSD NAND alignment.

The OS doesn't get all that close to the NAND layer with typical
computer component SSD drives, there is a layer in between doing
translation/wear levelling (and in some cases compression).
Black box proprietary code with presumably a fair bit of deep
magic involved. (Some OS do have more direct access to certain
types of flash devices that need OS control of wear-levelling;
OpenBSD doesn't and FFS is probably not the right filesystem for
this anyway).

There are different block sizes involved too; one is the size in
which writes can be done; the other is for erases which is typically
much larger.

If someone wants this badly enough then the starting point is to
show some figures for a situation which it improves. Benchmarks for
speed improvements. Maybe there's something in SSD SMART stats that
will give clues to whether it reduces write amplification.
(Then it needs repeating on different hardware; even different
firmware versions in an SSD could change how it behaves, let alone
differences between the various controller manufacturers).

I've written disklabel/fdisk diffs for this before, but I couldn't
figure out whether they actually helped anything.

Kent Watsen

2021-04-21 18:12:08 UTC

Permalink

[My previous message was somewhat garbled when reflected back at me. It looks better in the archives here: https://marc.info/?l=openbsd-misc&m=161902769301731&w=2. I’m resending as plain-text to see if the problem is on my end.]

I’m running OpenBSD on top of bHyve using virtual disks allocated out of ZFS pools. While not the same setup, some concepts carry over...

I have two types of pools:

1) an “expensive" pool for fast random IO:
- this pool is made up stripes of SSD-based vdevs.
- ZFS is configured to use a 16K recordsize for this pool.
- good for small files (guest OS, DBs, web/mail/dns files, etc.)
- When ZFS is told to use the SSD, it starts the partition
on sector 256 (not the default sector 34) to ensure good
SSD NAND alignment.

2) a less-expensive pool for large sequential IO:
- this pool is a single RAIDZ2-based vdev using spinning rust.
- ZFS is configured to use a 1M recordsize for this pool.
- good for large files (movies, high-res images, backups, etc.)

Virtual disks are exposed to the OpenBSD guests from both pools. The guest’s root-disk is always allocated from pool #1. Typically, a second application-specific disk is also allocated from pool #1 (e.g., /var/www/sites on a web server, /home on a mail server, etc.). Only in special circumstances (e.g., a media server) is a disk allocated from pool #2.

This arrangement steps around needing to read/write 1M blocks for each small file access, and also the possibility that a guest accessing a given block will span more than a single physical block.

Can VMWare virtual disks be configured similarly?

K.