YAML is a human-readable language to serialize data that’s commonly used for config files. The word YAML is an acronym for “YAML ain’t a markup language” and was first released in 2001. You can compare YAML to JSON or XML as all of them are text-based structured formats.
YAML files are often used to configure applications, application servers, or clusters. It is a very common format in Spring Boot applications and, of course, to configure Kubernetes. However, similarly to JSON and XML, you can use YAML to serialize and deserialize data.
Although YAML looks like an excellent alternative for XML and JSON, many people aren’t a big fan of the structure. Since the language is line-based and uses indentation to represent structure and nesting, indentation often causes problems when parsing complex data structures. A single missing (or extra) whitespace in a complex, data-heavy structure will cause failures when parsing YAML. This causes unexpected problems, and finding the problem in a YAML file is difficult.
Most importantly to note, manually importing YAML in your Java application with an outdated version of snakeyaml might get you into trouble.
Billion laughs attack
One feature of YAML is that you can create anchors. You can reuse these anchors in different places so you do not have to repeat yourself. In the simplified example below, I create two variables: var1
and var2
. By using anchors, var2
has the same value as var1
.
var1: &anchor value var2: *anchor
Let’s take this to the extreme and create the famous billion laughs attack for YAML. By applying this concept in a nested way, I can actually make a billion laughs.
lol1: &lol1 ["lol","lol","lol","lol","lol","lol","lol","lol","lol"] lol2: &lol2 [*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1,*lol1] lol3: &lol3 [*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2,*lol2] lol4: &lol4 [*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3,*lol3] lol5: &lol5 [*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4,*lol4] lol6: &lol6 [*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5,*lol5] lol7: &lol7 [*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6,*lol6] lol8: &lol8 [*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7,*lol7] lol9: &lol9 [*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8,*lol8] lolz: &lolz [*lol9]
As you can see, lol1
is a list of 10 strings "lol"
. The variable lol2
is a list of 10 times lol1
. By repeating this principle several times, we end up with lolz
= 10^9 times "lol"
. Better said, a billion laughs.
With anchors, you can create a YAML bomb! The tremendous amount of (nested) objects that such a YAML input creates will cause a memory overload.
Please read the full article to get a breakdown of how this attack works with actual Java examples and more importantly how to prevent this problem in your Java application.