Blog/blog/automatisierte-datenschutz-analyse

Automated Privacy Analysis

// How we built an engine that browses websites like a real browser, systematically capturing cookies, external connections, and GDPR violations.

Juliamarie CurtoAugust 23, 2024

If you want to check a website for privacy compliance, you'll typically reach for one of the many online scanners. Enter a URL, wait a few seconds, get a list of cookies. The problem: that list is almost always incomplete.

The reason lies in how modern websites work. A static HTTP request — which is what most scanners do — only sees what the server delivers directly. But the majority of cookies and external connections are only created through JavaScript execution in the browser. Google Analytics doesn't set its cookies via HTTP headers; it does so through a JavaScript file that first needs to be loaded, parsed, and executed. A static scanner sees none of this.

We wanted to find out what actually happens when someone visits a website. Not on one website. On thousands.

The problem with the homepage

Most analysis tools only check the homepage. That sounds reasonable, but it misses a crucial point: many websites only load privacy-relevant services on subpages. A contact form embeds Google reCaptcha. A directions page loads Google Maps. An embedded video on a subpage activates YouTube tracking.

In our analysis of German parliament members' websites, this was particularly evident in the case of Josef Rief: only after analyzing over 200 subpages did we find Facebook integrations that forwarded personal data. A homepage-only scan would have classified his website as unremarkable.

Our engine therefore navigates across configurable page depths. It identifies internal links, follows them, and records for each individual subpage which cookies are newly set and which external connections are established. In some analyses, the engine visits over a thousand subpages of a single website.

Two browsers, different results

While developing the engine, we formed a hypothesis: different browser engines set cookies differently. To verify this, we ran two engines in parallel on the same website — Puppeteer on a Chromium base and Playwright as a multi-engine solution.

The hypothesis was confirmed. Systematic comparative tests across several hundred websites showed that certain tracking mechanisms are browser-specific. What gets set in Chromium is not necessarily identical in another engine. Existing tools like Cookiebot or OneTrust use only a single browser engine and therefore cannot detect these differences.

A Google Analytics cookie isn't simply called _ga. It's called _ga_A1B2C3D4E5, where the suffix is unique to each website. Matomo generates _pk_id.7.a4c3, Hotjar sets _hjSessionUser_3218564. The same cookie type appears under a different name on every website.

For a single website, this is irrelevant. For a systematic analysis of thousands of websites, it's a fundamental problem. Without normalization, every cookie would be unique. You couldn't say "this cookie is Google Analytics" — you'd just have a list of individual strings.

Through systematic analysis of thousands of cookie names, we identified patterns that make it possible to separate the dynamic tracking ID component from the semantic cookie name. For the most common analytics and marketing services, we developed parsing rules and refined them iteratively:

_ga_XXXXXXXXXX -> _ga_* (Google Analytics)
_pk_id.X.XXXX -> _pk_id.* (Matomo)
_hjSessionUser_XXXXXX -> _hjSessionUser_* (Hotjar)
mp_XXXXXXXXXX_mixpanel -> mp_*_mixpanel (Mixpanel)

The result: from over 10,000 individual cookie instances, we were able to extract the actual cookie types. When a new website is scanned and a known cookie type is recognized, a classification can be made immediately — regardless of the individual configuration of the website in question.

What HTTP headers reveal

In addition to cookies and JavaScript behavior, the engine systematically captures HTTP response headers. These often reveal more about a website's technological infrastructure than its operators are aware of.

The Server header reveals whether Apache, nginx, or another web server is running. X-Powered-By discloses the backend technology. CF-Cache-Status indicates that Cloudflare is in use. X-Akamai-Cache points to Akamai as a CDN. Via headers expose reverse proxies.

This technology detection provides context for the privacy assessment. When a website uses Cloudflare as a CDN, all requests are routed through Cloudflare servers — a circumstance that can be relevant for GDPR assessment but is not captured by pure cookie scanners.

The knowledge database

Detecting cookies and external connections is only half the work. The other half: understanding what they mean.

We built a knowledge database in which over 8,000 cookie types and more than 10,000 external URLs are systematically recorded and classified. Each entry contains the purpose (session management, analytics, marketing, personalization), the provider and their data processing practices, the storage duration, and the GDPR assessment: is this cookie permissible without consent (Art. 6(1)(f) — legitimate interest) or is consent required (Art. 6(1)(a))?

The classification is not trivial. Many cookies are poorly documented or not documented at all. Their function had to be determined through reverse engineering, traffic analysis, and systematic comparisons. Existing public cookie databases like Cookiepedia contain descriptions but offer no systematic GDPR classification distinguishing between "permissible without consent" and "consent required."

The database links three entities: websites, cookies, and external URLs. A cookie is linked to the external services through which it is set. External URLs are mapped to the technologies and scripts through which they are embedded. This creates a three-dimensional picture: which cookie originates from which service, embedded through which CDN — and how is each of these layers to be assessed under data protection law?

Scaling

A single website scan takes varying amounts of time depending on scope. Small websites are captured in minutes; complex sites with many subpages keep the engine busy for hours. To analyze thousands of websites within a reasonable timeframe, a priority-based queuing system was necessary.

Up to six websites are processed in parallel. Jobs are persistently stored and automatically restarted on failure. Hanging browser processes — an unavoidable problem with automated browser control — are terminated after twelve hours and marked as failed. On top of that comes timeout handling for unresponsive servers, detection of HTTPS misconfigurations, resolution of redirect chains, and detection of Cloudflare Turnstile CAPTCHAs, where an automated scan inevitably fails.

The technical challenge lay less in the individual components than in their interplay: avoiding race conditions during parallel write operations, cleanly terminating hanging browser processes, distinguishing network errors from actual website problems.

In total, we have analyzed over 12,000 websites with this system so far.

What we found

The results of our analyses to date are documented in separate articles.

In our analysis of 1,536 websites of Viennese doctors, 49% of websites set cookies on the very first page load without consent. The Google Analytics cookie _ga was found on 187 websites without prior consent. 655 websites loaded Google Fonts from external servers. For medical websites, this is particularly sensitive: merely visiting a specialist's website can allow inferences about one's health condition — and when Google Analytics runs without consent, that information goes straight to Google.

In our investigation of all German parliament members' websites, 513 of 709 members (72%) violated the GDPR — the very regulation they themselves were responsible for enacting into national law. Only 196 designed their websites in compliance with data protection rules.

The recurring pattern across all analyses: Google services dominate. Google Fonts, Google Analytics, Google Maps, Google Tag Manager, Google reCaptcha. In the Vienna doctors' analysis, Google services occupied the top seven positions among external resources most frequently loaded without consent.

Structural causes

The causes rarely lie in malicious intent. Web agencies implement default configurations that integrate Google Analytics without considering the data protection consequences. Website builders like Wix set their own cookies without the operator having any influence — as the German Federal Commissioner for Data Protection noted in his 2022 activity report: new offerings are often initially problematic from a data protection perspective because they are frequently created using pre-built template systems that inherently use unnecessary cookies and integrate external services.¹

WordPress plugins load external resources that the operator never consciously integrated. And cookie banners are displayed but are often technically ineffective — the cookies are set regardless of the user's decision. None of this can be captured with a single manual scan. It only becomes visible through systematic, automated analysis at scale.

"It should be noted that new offerings are often initially problematic from a data protection perspective because they are frequently created using pre-built template systems that not infrequently use unnecessary cookies and integrate external services." — Prof. Ulrich Kelber: Activity Report for Data Protection and Freedom of Information 2022 ↩