|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "Using xmldiff in Python unit tests" |
| 4 | +categories: template |
| 5 | +author: Jan Černý |
| 6 | +author_url: https://github.com/jan-cerny |
| 7 | +--- |
| 8 | + |
| 9 | +Recently, we have decided to improve the test coverage of the [ComplianceAsCode](https://github.com/ComplianceAsCode/content) build system by adding more unit tests for our Python modules. |
| 10 | + |
| 11 | +Specifically, we have focused on testing code that works with XML. |
| 12 | +We have been creating tests for methods that generate XML elements or generate XML trees or transform one XML tree to another. |
| 13 | + |
| 14 | +At first sight, testing these types of methods looks easy. |
| 15 | +We created some fixtures and then wrote some test cases with asserts counting the amount of generated elements and attributes and checking the expected values. |
| 16 | +An example of this is below: |
| 17 | + |
| 18 | +```python |
| 19 | +def test_group_to_xml_element(group_selinux): |
| 20 | + group_el = group_selinux.to_xml_element() |
| 21 | + assert group_el is not None |
| 22 | + assert group_el.tag == {% raw %} "{%s}Group" % XCCDF12_NS {% endraw %} |
| 23 | + assert len(group_el.attrib) == 1 |
| 24 | + assert group_el.get("id") == "xccdf_org.ssgproject.content_group_selinux" |
| 25 | + assert group_el.text is None |
| 26 | + ... snip ... |
| 27 | +``` |
| 28 | + |
| 29 | +This is quite easy and most people would be fine with a test case like this. |
| 30 | +The advantage of this approach was that every requirement on the tested method had its own assert so when test started to fail it was immediately obvious what is broken. |
| 31 | +However, we didn't quite like it. |
| 32 | +The expected XML structure generated by the tested method (`to_xml_element()` in the example above) isn't clear from the code. |
| 33 | +The test can be quite long and it is laborious to write all the asserts for methods generating big XML trees with many child elements. |
| 34 | +So we have started to look for options for improving the tests. |
| 35 | + |
| 36 | +## Get familiar with xmldiff |
| 37 | + |
| 38 | +We have discovered the [xmldiff](https://xmldiff.readthedocs.io/en/stable/) project. |
| 39 | + |
| 40 | +It's a Python package that can be installed by `pip`: |
| 41 | + |
| 42 | +```bash |
| 43 | +$ sudo pip3 install xmldiff |
| 44 | +``` |
| 45 | + |
| 46 | +It can be used both as a command line tool and a Python module. |
| 47 | + |
| 48 | +Assuming that you have 2 XML files, `file1.xml` and `file2.xml`, run the following command: |
| 49 | + |
| 50 | +```bash |
| 51 | +$ xmldiff file1.xml file2.xml |
| 52 | + |
| 53 | +[update-attribute, /ns0:Rule/ns0:platform[1], idref, "virtual"] |
| 54 | +[update-text, /ns0:Rule/ns0:ident[1], "777777"] |
| 55 | +``` |
| 56 | + |
| 57 | +The `xmldiff` command will return a list of actions. |
| 58 | +This list of actions is so-called "Edit Script" and contains all changes needed to transform the first compared XML to the second compared XML. |
| 59 | +In the example above, we can see there are two differences between the two XML files. |
| 60 | +First is that the attribute `idref` on element described by XPath expression `/ns0:Rule/ns0:platform[1]` is changed to `virtual`. |
| 61 | +Second is that the text of the element described by XPath expression `/ns0:Rule/ns0:ident[1]` is changed to `777777`. |
| 62 | + |
| 63 | +In a Python script, you can call xmldiff this way: |
| 64 | + |
| 65 | +```python |
| 66 | +import xmldiff.main |
| 67 | +diff = xmldiff.main.diff_files("file1.xml","file2.xml") |
| 68 | +print(diff) |
| 69 | +``` |
| 70 | + |
| 71 | +It seems that the `xmldiff` is very easy to use, so we have decided to use it in our unit tests. |
| 72 | +The [xmldiff documentation](https://xmldiff.readthedocs.io/en/stable/) is a good starting point. |
| 73 | + |
| 74 | +But, we have encountered some small caveats, which we will describe below. |
| 75 | + |
| 76 | +## Passing XML trees to the library |
| 77 | + |
| 78 | +Our methods usually return `xml.etree.ElementTree` instances, so we first used the `xmldiff.main.diff_trees()` method to compare them. |
| 79 | +We put the expected output to a file in our test data directory and in the test we parsed the file and we put the parsed tree in a fixture. |
| 80 | + |
| 81 | +The problem was that the xmldiff takes `lxml` instances and not `xml.etree` instances which we use, so we had to convert both of them to `lxml`. |
| 82 | + |
| 83 | +This works quite fine. |
| 84 | +In case of any random difference between the actual and the expected output the test would fail. |
| 85 | +Our previous test then looked like this: |
| 86 | + |
| 87 | +```python |
| 88 | +def test_group_to_xml_element(group_selinux, group_selinux_xml): |
| 89 | + group_el = group_selinux.to_xml_element() |
| 90 | + group_tree = lxml.etree.fromstring(ET.tostring(group_el)) |
| 91 | + diff = xmldiff.main.diff_trees(group_tree, group_selinux_xml) |
| 92 | + assert diff == [] |
| 93 | +``` |
| 94 | + |
| 95 | +## Handling white space |
| 96 | + |
| 97 | +However, then we reviewed our code and we didn't like the saved XML test data — they were ugly, with no nice formatting. |
| 98 | +So we decided to apply `xmllint` pretty format and then the XMLs look pretty. |
| 99 | +But, the tests started to fail. |
| 100 | + |
| 101 | +We have found that the `xmldiff` is very sensitive and produced a bunch of differences that we add newline and whitespace here and there. |
| 102 | +We were wondering how to convince `xmldiff` to ignore the whitespace. |
| 103 | +We didn't want to run `xmllint` command as a subprocess in our tests. |
| 104 | +We tried to use [formatters](https://xmldiff.readthedocs.io/en/stable/api.html#using-formatters) but with no luck, xmllint still behaved sensitively to whitespace. |
| 105 | +We were mainly concerned that the data in the stored form would be difficult to review and the whitespace sensitivity would make them cumbersome to maintain. |
| 106 | +By accident, we have discovered that this behavior doesn't happen with the `xmllint.main.diff_files()` method. |
| 107 | +That method isn't sensitive to whitespace or formatting of the XML files, so we can save them in a pretty format. |
| 108 | +So we reworked our tests so that the test first saved the output of the tested method to a temporary file and then we called `xmllint.main.diff_files()` to compare this temporary file with our static file in test data. |
| 109 | +The test function code is very easy and the test data can look pretty. |
| 110 | +Moreover, we don't need to import `lxml`. |
| 111 | + |
| 112 | +```python |
| 113 | +def test_group_to_xml_element(group_selinux): |
| 114 | + group_el = group_selinux.to_xml_element() |
| 115 | + with temporary_filename() as real: |
| 116 | + ET.ElementTree(group_el).write(real) |
| 117 | + expected = os.path.join(DATADIR, "selinux.xml") |
| 118 | + diff = xmldiff.main.diff_files(real, expected) |
| 119 | + assert diff == [] |
| 120 | +``` |
| 121 | + |
| 122 | +Note: The `temporary_filename` is a context manager that gives us a temporary file name. |
| 123 | + |
| 124 | +## Working with namespaces |
| 125 | + |
| 126 | +One of our methods transforms a given XML tree to a different XML tree that differs in a couple of attributes and values but the rest of the tree is the same. |
| 127 | +So we have compared the input of this method with the output of this method using `xmldiff` and we got the diff in the form of an Edit script. |
| 128 | +Then, we had to solve how to write an assert that this Edit script is the expected one. |
| 129 | +In other words, to verify that the `xmldiff` has given the expected diff. |
| 130 | +We found that the items in the diff are Python `namedtuple`s and that we can easily create our own `namedtuple`s in the code and then check if they're present in the diff. |
| 131 | + |
| 132 | +These tuples contain the description of the element using XPath. However, it uses the namespace prefix. |
| 133 | +We were afraid that this prefix can become different easily. |
| 134 | +Using the prefix without any mapping is not the way one normally works with namespaces. |
| 135 | +But, there was no way to provide a correct XPath with the namespaces and the documentation doesn't mention how to do that. |
| 136 | +So we have created a workaround that access the namespace map in the "new" `lxml` tree and we create a reverse mapping and then we save the actual prefix to a variable. |
| 137 | + |
| 138 | +In the following example, we test that the 2 XML `lxml` trees differ in exactly one thing which is a value of the `id` attribute on the `definition` element, where the `definition` element belongs to the `"http://oval.mitre.org/XMLSchema/oval-definitions-5"` namespace. |
| 139 | + |
| 140 | +```python |
| 141 | +def test_foo(old, new): |
| 142 | + # create an inverted namespace map from new.nsmap |
| 143 | + # inverted map maps prefixes to namespace URIs |
| 144 | + inverted_new_nsmap = {v: k for k, v in new.nsmap.items()} |
| 145 | + # take the actually used prefix of the namespace |
| 146 | + prefix = inverted_new_nsmap["http://oval.mitre.org/XMLSchema/oval-definitions-5"] |
| 147 | + # perform the diff |
| 148 | + diff = set(xmldiff_main.diff_trees(old, new)) |
| 149 | + # craft the expected value, use the prefix variable in the XPath expression |
| 150 | + action1 = xmldiff.actions.UpdateAttrib( |
| 151 | + node=f'/{prefix}:oval_definitions/{prefix}:definitions/{prefix}:definition[1]', |
| 152 | + name='id', |
| 153 | + value='oval:ssg-kerberos_disable_no_keytab:def:1') |
| 154 | + # assert that the expected value is in the diff |
| 155 | + assert action1 in diff |
| 156 | + # assert that no other value than the expected value is in the diff |
| 157 | + diff.remove(action) |
| 158 | + assert diff == set() |
| 159 | +``` |
| 160 | + |
| 161 | +## Conditional imports |
| 162 | + |
| 163 | +Another problem that we faced is that we wanted to use the `xmldiff` tests in our upstream and downstream CI. |
| 164 | +Unfortunately, we discovered that the library isn't available as RPM, neither in Fedora nor in RHEL. |
| 165 | +It's available only in PyPI. |
| 166 | +That means we can't execute the tests in some of our test environments. |
| 167 | +But, we wanted to still run the tests in the environments where `xmldiff` is available and at the same time not disable all the unit tests on the other systems. Fortunately, `pytest` has a very elegant method `importorskip()` that skips the test case when some module isn't available and still runs the other test cases. |
| 168 | + |
| 169 | +We have used this method in every test function where we use `xmldiff`: |
| 170 | + |
| 171 | +```python |
| 172 | +def test_foo(): |
| 173 | + … |
| 174 | + xmldiff_main = pytest.importorskip("xmldiff.main") |
| 175 | + diff = xmldiff_main.diff_files(real_file_path, expected_file_path) |
| 176 | + … |
| 177 | +``` |
| 178 | + |
| 179 | +## Conclusion |
| 180 | + |
| 181 | +The `xmldiff` library is very useful tool for comparing XMLs and writing unit tests for Python code working with XML. |
| 182 | +We have successfully introduced multiple unit tests that leverage `xmldiff` in our project. |
| 183 | +If you are curious about the full code, take a look for example to [test_build_yaml](https://github.com/ComplianceAsCode/content/blob/master/tests/unit/ssg-module/test_build_yaml.py). |
| 184 | + |
| 185 | +However, for wider adoption in our project, we will need to make the `xmldiff` package present in Fedora and other Linux distributions. |
0 commit comments