16 min read
In part 1 of this blog, we explained why data privacy was much more than a compliance project, spanning across customer relationship management and data management disciplines. We also introduced the five pillars that companies need to establish to get data ready for GDPR:
All the above goals require organizations to implement a coordinated approach to data management. In the pages below, we outline the pillars businesses need to embrace to establish best practices for data management that will enable them to comply with GDPR.
The first pillar is focused on best practices for taking control of the personal data it processes and/or stores within the enterprise. A fast and efficient approach to achieve this is bringing all data into a data lake on a converged platform to enable more holistic data processing. Doing so allows companies to not only collect all data that requires attention, but also connect it to a hub where it can be discovered, harmonized, cleansed, protected, governed, and shared safely.
Creating this data lake is a pragmatic milestone for GDPR compliance, especially for protecting and sharing personal data. Once done, organizations can achieve a broader view beyond the data lake by reaching upstream and looking at data sources before they fill the lake, such as the CRM, marketing, and digital systems. This second step would enable them to get an end-to-end view of their information supply chain and ensure data governance, quality, and stewardship at the point of origin, and tackle the most challenging dimensions of GDPR such as the right to be forgotten.
The MapR Data Platform can become that underlying data lake for all personal data. The platform can ingest data from a variety of sources ranging from mainframes to private and public clouds to edge devices, can store metadata from files, tables, and streams in the same cluster, add to that the ability to do in-place analytics and machine learning. This helps organizations bring all relevant data in one place allowing for a 360-degree view at the same time derive business value from this data.
In this approach, it’s critical to capture each bit of personal data together with the information related to consent across any data source and then reconcile the two into a 360° view of each customer’s identity (figure 1). The challenge is that businesses typically know their customers – or employees – in several different contexts. For example, an airline may have profile information from their Twitter account, in their reservation system, and as a frequent flyer. The might as well track their customer location to help them find their way in the airport.
So how can organizations achieve this 360-degree view of a customer’s identity? Talend’s Big Data Platform can help not only for data integration but also for data reconciliation by embedding native data quality components to match disparate data sources, helping business understand that John Smith is the same person as firstname.lastname@example.org or \@JohnSmith, for example.
Figure 1: Talend Data Fabric combines data quality, data stewardship, and data integration into a unified platform that helps companies collect, standardize, reconcile, certify, protect and propagate Personally Identifiable Information (PII) data.
Talend Master Data Management (MDM) can be also leveraged on top of the data lake, not only to reconcile the data around a common master data record, but also to enable governance and stewardship on top for data protection, and safely propagate it across the required systems. In the context of GDPR, MDM also has particular relevance for managing opt-ins. In GDPR, opt-ins need to be applied across multiple applications. So, businesses need to consider them across a range of areas such as email campaigns, personalized website next-best-offers, and with respect to other applications such as billing or customer service. All these elements are likely to require different applications to process them – and so MDM will help reconcile, protect, and create an audit trail of personal data in one place (Figure 2) – and then apply it across the different applications.
Figure 2: Talend provides record level lineage with undo/redo capabilities, thereby providing an audit trail for opt-ins and any other data that relates to a data subject.
The second pillar, data classification and lineage, involves helping businesses define and categorize the data which needs to be accessed, pinpointing where it is located across the system and gauging how that information is related to other relevant information across the system.
MapR Volumes offers an entity for data management to address this pillar. Volumes can be created on a user basis, department basis, project basis, or as required by the business. Volumes can be mirrored for replication purposes between clusters within the same datacenter, across data centers, or even across on-premises and the cloud. Volumes can also be used to enforce disk usage limits, set replication levels, establish ownership and accountability.
Talend Data Fabric tightly integrates with those environments to provide data lineage for data flows, highlighting where personal data originates and where it ends up. Additionally, Talend Metadata Manager can draw the information supply chain across any system and beyond the data lake (figure 3). This type of metadata management, in turn, enables almost anyone in the organization to know where the data is by using a business glossary (Figure 4) and also reveals the relevant files or databases within which that data is stored, thereby effectively establishing data lineage.
Figure 3: Talend can automatically harvest data to create your personal inventory and provides an end-to-end view of your entire information chain: you know where your data comes from and where it ends up.
Figure 4: Talend Business Glossary helps you to create your reference, document, and classify your critical data elements with direct links to the datasets you should refer to.
MapR XD offers a wide variety of data protection features – encryptions mechanisms, authentication options, and authorization controls using access control lists as well as a separate model of authorization in the form of access control expressions (ACEs) unique to MapR. In addition, customers can devise granular security policy controls on a chosen volume of personal data. For instance, an administrator could restrict access to EU citizens’ personal data only to finance for accounting purposes and not let other functions in the company access it, hence avoiding unauthorized use of personal data. MapR XD offers volume-level snapshots, and mirroring for data recovery and protection functionalities. Using mirroring, users can replicate the volumes for load balancing or disaster recovery purposes. Similarly, a snapshot of a volume at a specific point in time is useful for rollback to a known data set. Volume management and mirroring also offer multi-tenancy and segregation while ensuring data protection.
Data Protection is also about data anonymization. Using semantic discovery capabilities, organizations can automatically discover sensitive data such as credit card numbers or emails within newly loaded data sources. This is an important capability because it alerts organizations about potential data privacy issues – effectively driving them to certain data sources that may require attention for GDPR compliance. They can then ask themselves the question – do I really need to expose this sensitive data in this context?
Applying those techniques can also bring processing and storage of personal data outside the scope of the GDPR. For example, sensitive data could be accessible in a CRM system but masked when used for analytics or development and testing. Related to this is the concept of data shuffling, a type of data masking that involves a column of data being randomly shuffled so its identity is hidden, but the relevant values remain in place. In this way, privacy is preserved, but analytics and data testing can still take place using the original data values. Data masking and shuffling are part of Talend Data Quality, which provides a single set of tools to build data quality controls across Talend’s entire integration platform (figure 5). It generates native code to run data quality controls and data anonymization at the right place, and at the right time, regardless of whether that data is on-premise or in the cloud, at rest or in-motion.
Figure 5: Talend provides data masking and shuffling capabilities for batch and real-time data streams, as well as for any audience including business users, through self-service tools.
Pillar four—self-service curation and certification—supports the delegation of authority from data experts like data protection officers or data stewards to business users. Think about a sales engineer who might be best positioned to ensure the contact data related to their account is up-to-date; or a campaign manager who becomes accountable to check and prove that a consent mechanism has been put in place by the partners with whom they work, enriching the marketing database with new contact data. To ensure that everyone in the organization has the autonomy to manage their data usage in a compliant manner, businesses need to provide straightforward self-served apps like Talend Data Preparation and Talend Data Stewardship (Figure 6) to various departments.
Figure 6: Talend allows to delegate accountability for Personal Data to potentially anyone in the organization through self-service data preparation and stewardship tools.
In addition, MapR XD offers the flexibility of file-level volume-level read/write access to the entire underlying data lake, allowing granular updates mentioned above whenever required.
The final pillar of GDPR, is about managing data location, data transfers and allowing data subjects to retain control over their data.
With this respect, MapR Volumes allows portability of personal data within the data lake even when it is spread across multiple siloes, hybrid multi-cloud environments or on edge devices. The volume structure defines how the data is distributed across the repository. For instance, companies can easily move EU citizen data to repositories in the EU by constructing volumes accordingly and consolidating all pertaining personal data into them. This becomes a key underpinning for the GDPR data lake as it enables customers to manage the location of the personal data.
Data Subject Access Rights (DSAR) enables data subjects’ to easily access, require rectification or erasure, or reclaim their personal data in a portable so that he can share with other suppliers. To facilitate this capability, businesses can implement a data access and download tool with a data integration software, and maintain a list of all the customers who requested action be taken re: their personal data.
To comply with such mandates, companies can run a job using Talend Data Integration; create a comma-separated value (CSV) for each customer and have an email automatically sent to each to secure approval (Figure 7). Alternatively, they can use an open API – in other words, the business could open a privacy control center on its website and encourage any customer wanting to know what data the business has about them to access it. This is an approach that could be supported by the use of Talend Data Services, which can expose real-time data services through a standard, well documented, and easy to consume API, such as REST.
Figure 7: Complying with the right to data portability with Talend Data Integration.
As GDPR approaches, companies are becoming increasingly concerned that it could threaten their business. However, instead of ‘stressing about it’, they should view this as an opportunity to create a system of trust for engaging with their customers and employees, together with a platform to personalize and maximize the value of every interaction. Building such an engagement model requires a modern and converged data platform that can scale with cloud and big data requirements. At Talend, we see the data lake as an excellent path to creating a 360-degree view of each customer or employee. The MapR-Talend data lake will not only be used to document, categorize and map the data, but also to track and trace the changes applied to it and deliver the data services to the data subject per their rights (right of access, of rectification, data portability, rights to be forgotten).
Organizations can build on this platform to implement the five pillars of data management best practice and put in place the latest data management capabilities to deliver everything from data capture, integration, classification, and lineage, to data anonymization, self-service curation, and data portability. By doing so, they will put themselves in a position to more effectively manage the changes brought by GDPR and further develop a best-practice approach to data management that will help drive their success today and in the future.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.