Safe Writing to Files in IoT and Industrial Systems

February 11, 2021

Unique Views: 1,169

sinceSeptember, 2021

Author(s)

Michael Roeschter
Michael is a pre-sales architect at Azul. He has written software in Turbo Pascal, Fortran, Cobol, Visual Basic, C++ and C before settling on Java in 1998 for good. Working ... Learn more

Especially on IoT devices, file corruption on shutdown is a common concern. This article discusses how to write to disk safely in Java, combining disk sync, shutdown hooks, and atomic renaming of files.

Files On Disk Can Still Easily Become Corrupted

For performance optimization, file systems write to disks asynchronously resulting in potential corruption when a hard system shutdowns occurs through power off or crashes.

As this has become a rare experience for desktop and server users, it comes as a surprise to many developers working on IoT devices and industrial computers that hard power cuts are a common operational scenario and storage is far less robust than expected.

General Problem

Operating systems cache writes to storage devices in RAM buffers. Additionally, spinning disks and SSDs have their own RAM buffers for write optimization. To clear buffers and guarantee safe physical storage, file systems provide the sync() command, which cascades through the hierarchy of storage devices.

It should be no surprise that the “sync” call is expensive, even more so on embedded systems where I/O is often slower for cost reasons and even writes to SSD are heavily buffered and delayed for the purpose of reducing wear on the SSD. As calling sync() for every write operation is a no go area due to poor performance in most scenarios, the developer needs to make the correct design choices.

Useful Design Patterns

File flush() and sync()
Shutdown hooks and SIGTERM versus SIGKILL
Atomic renaming of files

File Flush() and Sync()

FileOutputStream.flush() or equivalent calls do not guarantee write to disk. It only ensures that whatever Java internal buffer is held for optimization is flushed to the underlaying storage. While using flush is strongly recommended before closing streams and files, it does not address our safe storage problem.

FileDescriptor.sync() will call sync() on OS level to synchronize the underlying file system buffer and disks. Note that this syncs one whole file system. All buffers, including those created by other programs will be synced.

The performance impact can be large and calls to sync() should be infrequent. If the system has multiple file system or storages, the call will generally only sync() one file system.

The flush() and sync() calls are often mixed up, but both are required to safely write a file to disk:

FileOutputStream out = new FileOutputStream(filename); //Open a file
BufferedOutputStream bout = new BufferedOutputStream(out);
bout.write(….) //repeat in loop - writes optimized through buffering
bout.flush(); //pushes our Java side buffers to the OS
out.getFD().sync(); //Makes sure all OS and disk buffer out done. Worst case this can take from 100s of ms on a servers to seconds on an embedded device.

Note that we skipped the file close(). We can safely write a portion of a file (e.g. a log) and expect the file system to “repair” our open file after a crash. But skipping the sync() or flush() will potentially leave the data corrupted even if we closed the file nicely.

Shutdown Hooks and SIGTERM vs. SIGKILL

Our application may be asked to shutdown before our naturally reaches a sync() point. As it is impractical to add shutdown related logic in each I/O related code, shutdown hooks run on regular(!) JVM shutdown an allow us to clean up and flush/sync databases and files.

Runtime.getRuntime().addShutdownHook()

Note that Shutdown hooks run during a “soft kill” but do NOT run during a “hard kill” of the JVM.

The shutdown hook is triggered by:

System.exit()
“kill -SIGTERM” which is the default for kill

A Hard shutdown, skipping the shutdown hook, is triggered by:

System.halt()
“kill -9” or “kill – SIGKILL”

Atomic Renaming of Files

Even with the above design patterns, when something goes wrong, our devices powers off or network connection is broken, files may still be corrupted and incomplete. How can we perform an atomic file operation which guarantees that the reading application always see a correct file?

You may expect that modern file system have transactions or locking mechanism for this purpose but few do and therefore its best not to rely on them. Renaming a file is for all practical purposes atomic on all file systems, local or remote and offers and age old way and portable way out of the dilemma.

Create and write a temporary file “abcd_temp”
(Optionally) Delete “config.dat_old”
(Optionally) rename “config.dat” to “config.dat_old”
Rename “abcd_temp” to “config.dat”

The above works for local file system as well as network protocos like NFS, SMB FTP, SCP...

Notes

When the reading application does not find a “config.dat”, it can recover the previous “config.dat_old”.
This also works when the reading application is polling for files as it can never open a half written file by accident.

Don’t Forget to Share This Post!

Author Michael Roeschter

Michael is a pre-sales architect at Azul. He has written software in Turbo Pascal, Fortran, Cobol, Visual Basic, C++ and C before settling on Java in 1998 for good. Working on business integration, business process and lately IoT problems, Michaels perspective is less on software architecture but more on practical…

Author(s)

Michael Roeschter
Michael is a pre-sales architect at Azul. He has written software in Turbo Pascal, Fortran, Cobol, Visual Basic, C++ and C before settling on Java in 1998 for good. Working ... Learn more

Comments (0)

Cancel reply

Your email address will not be published. Required fields are marked *

Highlight your code snippets using [code lang="language name"] shortcode. Just insert your code between opening and closing tag: [code lang="java"] code [/code]. Or specify another language.

Save my name, email, and website in this browser for the next time I comment.

Safe Writing to Files in IoT and Industrial Systems

Author(s)

Files On Disk Can Still Easily Become Corrupted

General Problem

Useful Design Patterns

File Flush() and Sync()

Shutdown Hooks and SIGTERM vs. SIGKILL

Atomic Renaming of Files

Notes

Author(s)

Comments (0)

Cancel reply

Set Event Reminder

Subscribe to foojay updates: