Data Lineage Tracking
Purpose
Map and document the complete lifecycle of personal data through organizational systems, from initial collection point through all transformations, transfers, and storage locations to final deletion or anonymization.
Prerequisites
- Data inventory or asset register identifying systems processing personal data
- Network architecture documentation showing system interconnections
- Access to system metadata, ETL configurations, and API documentation
- Records of Processing Activities (RoPA) under GDPR Article 30
Workflow
Step 1: Define Lineage Scope
Determine scope boundaries for the lineage mapping exercise:
- Data categories: Identify which personal data categories to trace (identity data, contact data, financial data, behavioral data, special category data under Art. 9)
- System boundaries: Define which systems are in-scope (production databases, data warehouses, analytics platforms, third-party SaaS, backup systems)
- Temporal scope: Determine whether to map current-state lineage only or include historical data flows
- Legal basis mapping: Link each processing activity in the lineage to its GDPR Article 6 lawful basis
Step 2: Identify Data Sources (Collection Points)
Document every point where personal data enters the organization:
- Direct collection: Web forms, mobile apps, point-of-sale terminals, customer service interactions, paper forms digitized via scanning
- Indirect collection: Third-party data providers, publicly available sources, data brokers, partner organizations
- Derived data: Data generated through processing (risk scores, customer segments, behavioral profiles)
- Inferred data: Data inferred from other data points (creditworthiness, health predictions, preferences)
For each source, record:
- Source identifier and type
- Data categories collected (referencing Art. 30(1)(c) categories)
- Legal basis under Art. 6(1) and, if applicable, Art. 9(2)
- Information provided to data subjects per Art. 13 or Art. 14
- Volume and frequency of collection
Step 3: Map Data Transformations
Document every transformation applied to personal data:
- ETL processes: Extract-Transform-Load pipelines moving data between systems
- Aggregation: Grouping individual records into summary statistics
- Pseudonymization: Replacing identifiers with tokens per Art. 4(5) and Recital 26
- Anonymization: Irreversible de-identification per WP29 Opinion 05/2014
- Enrichment: Combining data from multiple sources to create enriched profiles
- Format conversion: Changing data formats (CSV to JSON, database migration)
For each transformation, record:
- Input data categories and source system
- Transformation logic description
- Output data categories and destination system
- Whether transformation changes the identifiability of data subjects
- Retention period at destination per Art. 5(1)(e) storage limitation
Step 4: Document Data Flows and Transfers
Map all movements of personal data between systems and parties:
- Internal flows: Between departments, systems, databases within the organization
- Processor transfers: To data processors under Art. 28 agreements
- Third-country transfers: Cross-border transfers requiring Art. 44-49 safeguards
- Third-party disclosures: To independent controllers (regulators, partners, law enforcement)
For each flow, record:
- Source and destination system/entity
- Transfer mechanism (API, file transfer, database replication, manual export)
- Legal safeguard for international transfers (SCCs, BCRs, adequacy decision)
- Encryption in transit and at rest
- Frequency and volume
Step 5: Map Data Storage and Retention
Document where personal data resides at each stage:
- Primary storage: Production databases, CRM systems, HRIS
- Secondary storage: Data warehouses, analytics databases, reporting systems
- Archival storage: Long-term archives, cold storage, compliance archives
- Backup storage: Disaster recovery systems, backup tapes, cloud backup
- Temporary storage: Caches, message queues, log files, session storage
For each storage location, record:
- Storage technology and location (on-premises, cloud region)
- Retention period and legal basis for retention
- Access controls and encryption
- Deletion or anonymization mechanism at end of retention period
Step 6: Implement Automated Lineage Discovery
Deploy tooling to automate lineage tracking:
- Database-level lineage: Query log analysis, column-level lineage from SQL parsing
- Application-level lineage: API call tracing, service mesh observability
- Pipeline-level lineage: ETL tool metadata (Apache Airflow lineage backend, dbt documentation)
- Infrastructure-level lineage: Network flow logs, data lake audit trails
Use the scripts/process.py helper to parse system metadata and generate lineage graphs.
Step 7: Integrate with RoPA and Compliance
Link lineage data to GDPR compliance documentation:
- Art. 30 RoPA: Each lineage path should map to a processing activity in the RoPA
- DPIA triggers: Flag lineage paths involving Art. 35(3) processing (large-scale profiling, systematic monitoring, special categories)
- Data subject rights: Use lineage to locate all data for DSAR responses (Art. 15 access, Art. 17 erasure, Art. 20 portability)
- Breach impact scoping: Use lineage to determine affected data subjects and categories during incident response per Art. 33(3)
Step 8: Maintain and Validate
Establish ongoing lineage maintenance:
- Change management: Update lineage when new systems, data flows, or processing activities are introduced
- Periodic validation: Quarterly review to verify lineage accuracy against actual system behavior
- Stakeholder review: Annual sign-off from data owners, system architects, and DPO
- Completeness check: Cross-reference lineage against data inventory and RoPA to identify gaps
Verification