A Supplement to Chapter 4
A.1 Definitions 1: Graphs, nodes, links
We recall that a graph is formally defined as a pair G=(N,L). Where N is a set whose elements are the nodes (also called vertices or points), and L is a set of links (also called edges) which are ordered pairs of distinct nodes. L is a subset of the set of all possible links between nodes, where in our case a node cannot be associated to itself: L⊆{(x,y)|(x,y)∈N2∧x≠y}. For a link (x,y), x and y are called the endpoints of a link. In our approach, links are undirected and a link represents an occurrence of x in y. As a shorthand we write the link between x and y as lxy. For example, the link lxy can represent the occurrence of a category ‘date of birth’ (node x) in the document group ‘Eurodac’ (node y). All categories present in the document group have corresponding nodes and links in the graph. In fact, we can treat each data model (i.e., document group) as a separate graph. The complete graph is then in effect the combination of different data models. For each data model as a separate graph Gi, the combined graph is the disjoint union of graphs: G=⋃i∈IGi.
A.2 Definitions 2: Attributes
In our graph model, nodes are objects composed of attributes that are used to keep metadata of nodes. These attributes are formulated using the notation n.a for an attribute a of a node n. The most important metadata kept for a node are n.name and n.type, where name is the natural language label of the node. The attribute type can only take a limited set of values: type∈{category,categoryValue,codeGroup,document,documentGroup}.
A.3 Definitions 3: Graph drawing
A drawing of a graph G=(N,L) is a collection of points in a two-dimensional space. Each point pi with coordinates x and y is the position of the node ni in the layout. Whenever there exists a link (pi,pj)∈L, a line is drawn between points pi and pj. The task of the layout algorithm is to find a positioning of points so that specific criteria are optimally met. Examples of commonly used criteria are: nodes should not overlap, neighbouring nodes should be grouped together, the number of crossing link should be minimised. Each algorithm and set of criteria has its own benefits and drawbacks.
A.4 Definitions 4: Degree & neighbourhood
For a node njn the degree is defined as the number of links a node has: deg(x)=|{nj:lij∈L}|. The set of linked nodes is called the neighbourhood of a node. The neighbourhood Hi for a node nj is defined as: Hi={nj:lij∈L∨lji∈L}.
A.5 Definitions 5: Betweenness centrality
The betweenness centrality of a node n is defined as bc(n)=∑s≠n≠tσst(n)σst. Where σst is the total amount of shortest paths from node s to node t and σst(n) is the amount of those paths that pass through n. A path is a sequence of nodes, where each pair of nodes in the sequence is linked. The shortest path is the path between two nodes s and t that traverses the smallest number nodes. The equation for betweenness centrality takes into account that there may be several possible paths from s to t, with only some passing through n.
A.6 Definitions 6: Presence
The presence of all categories in a document group node nx is a set of all category nodes Categories(x)={ny∈N:(lxy∈L∨lyx∈L)∧ny.type=category}. The presence of a category nx in a document group is the set of nodes of the type document group for which there exist a link between this category and the document group. Formally defined as: Presence(x,documentGroup)={ny∈N:lxy∈Links∧y.type=docGroup}.
A.7 Definitions 7: Intersection and difference
The absence of categories between a docGroup1 and docGroup2 is the set of categories present in the second document group minus the set of categories present in the first. In our notation: Absence(docGroup1,docGroup2)={Categories(docGroup2)∖Categories(docGroup1)}. The categories that are common between those same two document groups are determined using the intersection of the sets of categories that are present in either: CommonCodes(docGroup1,docGroup2)=docGroup1∩docGroup2. This operation is not limited to two sets. The intersection between more sets can be notated as ⋂ni=1Presence(docGroupi).