Skip to content

Commit 070cc87

Browse files
authored
Merge pull request #45 from jan-cerny/xmldiff
Add "Using xmldiff in Python unit tests"
2 parents 9062318 + f3a19da commit 070cc87

1 file changed

Lines changed: 185 additions & 0 deletions

File tree

Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
---
2+
layout: post
3+
title: "Using xmldiff in Python unit tests"
4+
categories: template
5+
author: Jan Černý
6+
author_url: https://github.com/jan-cerny
7+
---
8+
9+
Recently, we have decided to improve the test coverage of the [ComplianceAsCode](https://github.com/ComplianceAsCode/content) build system by adding more unit tests for our Python modules.
10+
11+
Specifically, we have focused on testing code that works with XML.
12+
We have been creating tests for methods that generate XML elements or generate XML trees or transform one XML tree to another.
13+
14+
At first sight, testing these types of methods looks easy.
15+
We created some fixtures and then wrote some test cases with asserts counting the amount of generated elements and attributes and checking the expected values.
16+
An example of this is below:
17+
18+
```python
19+
def test_group_to_xml_element(group_selinux):
20+
group_el = group_selinux.to_xml_element()
21+
assert group_el is not None
22+
assert group_el.tag == {% raw %} "{%s}Group" % XCCDF12_NS {% endraw %}
23+
assert len(group_el.attrib) == 1
24+
assert group_el.get("id") == "xccdf_org.ssgproject.content_group_selinux"
25+
assert group_el.text is None
26+
... snip ...
27+
```
28+
29+
This is quite easy and most people would be fine with a test case like this.
30+
The advantage of this approach was that every requirement on the tested method had its own assert so when test started to fail it was immediately obvious what is broken.
31+
However, we didn't quite like it.
32+
The expected XML structure generated by the tested method (`to_xml_element()` in the example above) isn't clear from the code.
33+
The test can be quite long and it is laborious to write all the asserts for methods generating big XML trees with many child elements.
34+
So we have started to look for options for improving the tests.
35+
36+
## Get familiar with xmldiff
37+
38+
We have discovered the [xmldiff](https://xmldiff.readthedocs.io/en/stable/) project.
39+
40+
It's a Python package that can be installed by `pip`:
41+
42+
```bash
43+
$ sudo pip3 install xmldiff
44+
```
45+
46+
It can be used both as a command line tool and a Python module.
47+
48+
Assuming that you have 2 XML files, `file1.xml` and `file2.xml`, run the following command:
49+
50+
```bash
51+
$ xmldiff file1.xml file2.xml
52+
53+
[update-attribute, /ns0:Rule/ns0:platform[1], idref, "virtual"]
54+
[update-text, /ns0:Rule/ns0:ident[1], "777777"]
55+
```
56+
57+
The `xmldiff` command will return a list of actions.
58+
This list of actions is so-called "Edit Script" and contains all changes needed to transform the first compared XML to the second compared XML.
59+
In the example above, we can see there are two differences between the two XML files.
60+
First is that the attribute `idref` on element described by XPath expression `/ns0:Rule/ns0:platform[1]` is changed to `virtual`.
61+
Second is that the text of the element described by XPath expression `/ns0:Rule/ns0:ident[1]` is changed to `777777`.
62+
63+
In a Python script, you can call xmldiff this way:
64+
65+
```python
66+
import xmldiff.main
67+
diff = xmldiff.main.diff_files("file1.xml","file2.xml")
68+
print(diff)
69+
```
70+
71+
It seems that the `xmldiff` is very easy to use, so we have decided to use it in our unit tests.
72+
The [xmldiff documentation](https://xmldiff.readthedocs.io/en/stable/) is a good starting point.
73+
74+
But, we have encountered some small caveats, which we will describe below.
75+
76+
## Passing XML trees to the library
77+
78+
Our methods usually return `xml.etree.ElementTree` instances, so we first used the `xmldiff.main.diff_trees()` method to compare them.
79+
We put the expected output to a file in our test data directory and in the test we parsed the file and we put the parsed tree in a fixture.
80+
81+
The problem was that the xmldiff takes `lxml` instances and not `xml.etree` instances which we use, so we had to convert both of them to `lxml`.
82+
83+
This works quite fine.
84+
In case of any random difference between the actual and the expected output the test would fail.
85+
Our previous test then looked like this:
86+
87+
```python
88+
def test_group_to_xml_element(group_selinux, group_selinux_xml):
89+
group_el = group_selinux.to_xml_element()
90+
group_tree = lxml.etree.fromstring(ET.tostring(group_el))
91+
diff = xmldiff.main.diff_trees(group_tree, group_selinux_xml)
92+
assert diff == []
93+
```
94+
95+
## Handling white space
96+
97+
However, then we reviewed our code and we didn't like the saved XML test data — they were ugly, with no nice formatting.
98+
So we decided to apply `xmllint` pretty format and then the XMLs look pretty.
99+
But, the tests started to fail.
100+
101+
We have found that the `xmldiff` is very sensitive and produced a bunch of differences that we add newline and whitespace here and there.
102+
We were wondering how to convince `xmldiff` to ignore the whitespace.
103+
We didn't want to run `xmllint` command as a subprocess in our tests.
104+
We tried to use [formatters](https://xmldiff.readthedocs.io/en/stable/api.html#using-formatters) but with no luck, xmllint still behaved sensitively to whitespace.
105+
We were mainly concerned that the data in the stored form would be difficult to review and the whitespace sensitivity would make them cumbersome to maintain.
106+
By accident, we have discovered that this behavior doesn't happen with the `xmllint.main.diff_files()` method.
107+
That method isn't sensitive to whitespace or formatting of the XML files, so we can save them in a pretty format.
108+
So we reworked our tests so that the test first saved the output of the tested method to a temporary file and then we called `xmllint.main.diff_files()` to compare this temporary file with our static file in test data.
109+
The test function code is very easy and the test data can look pretty.
110+
Moreover, we don't need to import `lxml`.
111+
112+
```python
113+
def test_group_to_xml_element(group_selinux):
114+
group_el = group_selinux.to_xml_element()
115+
with temporary_filename() as real:
116+
ET.ElementTree(group_el).write(real)
117+
expected = os.path.join(DATADIR, "selinux.xml")
118+
diff = xmldiff.main.diff_files(real, expected)
119+
assert diff == []
120+
```
121+
122+
Note: The `temporary_filename` is a context manager that gives us a temporary file name.
123+
124+
## Working with namespaces
125+
126+
One of our methods transforms a given XML tree to a different XML tree that differs in a couple of attributes and values but the rest of the tree is the same.
127+
So we have compared the input of this method with the output of this method using `xmldiff` and we got the diff in the form of an Edit script.
128+
Then, we had to solve how to write an assert that this Edit script is the expected one.
129+
In other words, to verify that the `xmldiff` has given the expected diff.
130+
We found that the items in the diff are Python `namedtuple`s and that we can easily create our own `namedtuple`s in the code and then check if they're present in the diff.
131+
132+
These tuples contain the description of the element using XPath. However, it uses the namespace prefix.
133+
We were afraid that this prefix can become different easily.
134+
Using the prefix without any mapping is not the way one normally works with namespaces.
135+
But, there was no way to provide a correct XPath with the namespaces and the documentation doesn't mention how to do that.
136+
So we have created a workaround that access the namespace map in the "new" `lxml` tree and we create a reverse mapping and then we save the actual prefix to a variable.
137+
138+
In the following example, we test that the 2 XML `lxml` trees differ in exactly one thing which is a value of the `id` attribute on the `definition` element, where the `definition` element belongs to the `"http://oval.mitre.org/XMLSchema/oval-definitions-5"` namespace.
139+
140+
```python
141+
def test_foo(old, new):
142+
# create an inverted namespace map from new.nsmap
143+
# inverted map maps prefixes to namespace URIs
144+
inverted_new_nsmap = {v: k for k, v in new.nsmap.items()}
145+
# take the actually used prefix of the namespace
146+
prefix = inverted_new_nsmap["http://oval.mitre.org/XMLSchema/oval-definitions-5"]
147+
# perform the diff
148+
diff = set(xmldiff_main.diff_trees(old, new))
149+
# craft the expected value, use the prefix variable in the XPath expression
150+
action1 = xmldiff.actions.UpdateAttrib(
151+
node=f'/{prefix}:oval_definitions/{prefix}:definitions/{prefix}:definition[1]',
152+
name='id',
153+
value='oval:ssg-kerberos_disable_no_keytab:def:1')
154+
# assert that the expected value is in the diff
155+
assert action1 in diff
156+
# assert that no other value than the expected value is in the diff
157+
diff.remove(action)
158+
assert diff == set()
159+
```
160+
161+
## Conditional imports
162+
163+
Another problem that we faced is that we wanted to use the `xmldiff` tests in our upstream and downstream CI.
164+
Unfortunately, we discovered that the library isn't available as RPM, neither in Fedora nor in RHEL.
165+
It's available only in PyPI.
166+
That means we can't execute the tests in some of our test environments.
167+
But, we wanted to still run the tests in the environments where `xmldiff` is available and at the same time not disable all the unit tests on the other systems. Fortunately, `pytest` has a very elegant method `importorskip()` that skips the test case when some module isn't available and still runs the other test cases.
168+
169+
We have used this method in every test function where we use `xmldiff`:
170+
171+
```python
172+
def test_foo():
173+
174+
xmldiff_main = pytest.importorskip("xmldiff.main")
175+
diff = xmldiff_main.diff_files(real_file_path, expected_file_path)
176+
177+
```
178+
179+
## Conclusion
180+
181+
The `xmldiff` library is very useful tool for comparing XMLs and writing unit tests for Python code working with XML.
182+
We have successfully introduced multiple unit tests that leverage `xmldiff` in our project.
183+
If you are curious about the full code, take a look for example to [test_build_yaml](https://github.com/ComplianceAsCode/content/blob/master/tests/unit/ssg-module/test_build_yaml.py).
184+
185+
However, for wider adoption in our project, we will need to make the `xmldiff` package present in Fedora and other Linux distributions.

0 commit comments

Comments
 (0)