sdcTools
diff --git a/‎by-sa.png‎
17.2 KB b/‎by-sa.png‎
17.2 KB
diff --git a/‎chapters/0-Introduction.tex‎
Lines changed: 33 additions & 0 deletions b/‎chapters/0-Introduction.tex‎
Lines changed: 33 additions & 0 deletions
@@ -0,0 +1,33 @@
+\chapter{Introduction} \label{sec:intro}
+Users of statistical data are often interested in spatial distribution patterns. A policy-maker may be interested in the distribution of income over the neighborhoods of a city, a health care professional may want to know where to find incidences of infections and a social worker may want to focus on locations that are at high risk for social problems. While these are all examples of valid and relevant needs, they are at odds with confidentiality. When a location is too detailed, e.g. an address, or when its neighborhood has few inhabitants, e.g. one isolated household, the displayed information at that location is very disclosive. Thus, effective disclosure control methods to protect publications that involve spatial information are needed. As with any statistical disclosure control (SDC) method, applying it to protect published data will result in a loss of utility. The ambition is to provide some balance between maintaining utility and assuring a low disclosure risk. 
+
+Spatial data can be present in statistical output in many ways. Commonly used output formats are
+\begin{itemize}
+    \item Microdata files that contain spatial information; ranging from coarse classifications like NUTS to detailed coordinates of where the individual units reside;
+    \item Tables where one of the spanning variables is location-related; ranging from administrative regions to squared grids;
+    \item Cartographic maps where the distribution of a variables is plotted; ranging from choropleth maps to density plots.
+\end{itemize}
+
+The current guidelines describe some measures to specify the disclosure risks involved with publications including a spatial dimension, statistical disclosure control methods to reduce the risk of disclosure while maintaining (some of) the utility, and ways to assess the amount of utility / information lost in the process.
+
+These guidelines mainly focus on the situation of publishing either tabular data or cartographic maps. They include methods that can be used prior to tabulation or prior to plotting as well as methods that can be used as a means to plot spatial patterns where the statistical disclosure control is intrinsically part of the plotting procedure.\bigskip
+
+We use the term `geo-reference' generically to refer to an attribute that encodes a geographic dimension. For instance: a city identifier, a postal code, a geographic grid cell, or other small area. The attribute `references' (or points to) a real-world area or location. The reference is imperfect, since complex geography must be simplified in order to represent it as a piece of data. The geometric form that represents a land plot or a part of a town is a mere approximation of the real thing. Several such approximations are common in official statistics, among them polygon vector shapes and geographic grids. For an introduction to data types and the geographies they describe see, for instance, \cite{Haining2010} or \cite{AudricEtAl2018}.
+
+\emph{Geo-referenced data}\footnote{
+    Closely related words are `geo-localized' data or `spatial' data. The first usually more specifically refers to point-type data (GPS or other geographic coordinates), while the latter is more generic and may refer to point, line, area, or field type data.} 
+describes any data that includes a geo-reference - usually together with other non-spatial attributes traditionally collected by National Statistical Institutes. A 2-dimensional table that shows counts of households by (1) type of energy used for heating and (2) Local Administrative Unit (LAU) is an example of geo-referenced data. A choropleth map visualizing the share of people at risk of poverty within 500m $\times$ 500m grid squares is another. Yet another is a detailed hypercube, containing small aggregates for a whole range of household-level variables, together with a very fine grid reference -- such data may underlie interactive applications, where users can navigate through space and query attributes of interest for selected sub-spaces.
+In many countries, frameworks exist that bring together statistical and geographic information, assuring their compatibility and mutual high quality, see e.g.\ \cite{HaldorsonMostrom2019} and \cite{VanHalderenEtAl2016}.
+
+\subsubsection{Why special guidelines for geo-referenced data?}
+There are at least two reasons for providing guidelines targeting specifically at geo-referenced data. The first reason is related to the identifying character of geo-referenced variables. In general, any variable that relates to the location of a unit can be used as a quasi-identifier. That is, using the location-related variable together with a limited number of other variables, it becomes quite easy to identify individual units. In case of microdata output it is obvious that coordinates of dwellings or of company buildings directly identify individual units. In case of aggregated data represented in tables, combining the location data with some other variables will quickly reveal individual units. Finally, presenting output as a plot on a cartographic map makes it easy to locate and hence identify individual units as well.
+
+A second reason for targeted guidelines is that even though a lot of SDC methods are available for publishing data either as microdata or as aggregated data, these methods in general do not take into account the spatial character of the variables. The current guidelines will present methods that \textit{do} take into account the spatial character. Moreover, some methods can be applied to produce cartographic maps without having to protect the underlying data before the plotting: they provide means to directly plot variables on a map using the unprotected underlying data.\bigskip
+
+These guidelines extend and specify the general advice on statistical disclosure control methodology collected in \citet{HundepoolEtAl2024}. Readers who are new to SDC or who seek further information on related methods and concepts may wish to also refer to the more fundamental guidance there.
+Moreover, additional information on general statistical disclosure control methodology targeted at census publication can be found in the \emph{Guidelines for SDC Methods for Census and Demographics Data} \citep{HundepoolEtAl2024}. \bigskip
+
+This document is structured as follows: We begin in Chapter \ref{sec:ident} with some preliminary concepts and considerations concerning the problem of SDC for geo-referenced data. Specific aspects of disclosure risk for such data are introduced in Chapter \ref{sec:risk}, followed by measures of information loss from SDC in Chapter \ref{sec:util}. In Chapter \ref{sec:methods} we suggest a range of methods to accomplish a reduction of disclosure risk, with a special focus on aggregates in geographic grid cells. Finally, several of the concepts and metrics introduced will be applied in an illustrative case study in Chapter \ref{sec:cs_ckm}.\bigskip
+
+Related materials, like the code files for Chapter \ref{sec:cs_ckm}, are supplied in an online repository:
+\url{https://github.com/sdcTools/GeoSpatialGuidelinesRelated}.