This paper focuses on the core technical capacity building of the 0-1 stage with the first priority, and delves into the pragmatic methods and tool selection considerations in the three major areas of data inventory and assetization, data quality baseline establishment, and data security cornerstone.
For enterprises in the initial stage of data management (stage 0-1), the core challenge is to transform data resources with scattered, varying quality and security risks into trusted, usable, and controllable data assets. Achieving “visibility”, “controllability” and “availability” of data is the core goal of this stage, which is highly dependent on the construction and implementation of key technical capabilities.
1. Data inventory and capitalization
Data inventory is the first step to find out the data background and establish data asset cognition, and the goal is to form a panoramic view of the data of the enterprise.
1.1 Metadata management
Metadata management is the core support of data inventory.
1.1.1 Selection of lightweight metadata management tools
Open Source Solutions:ApacheAtlas, as a mature choice in the Hadoop ecosystem, has a core advantage in its native integration with components such as Hive, HBase, and Kafka. The working mechanism is to automatically extract technical metadata (tables, fields, partitions, data types, data formats, etc.) and partial operational metadata from the source system (such as HiveMetastore) through pre-built or customized metadata collectors (Hook/Bridge) and store them in its internal JanusGraph graph database or HBase. The provided RESTful API and WebUI support metadata querying, browsing, and basic management. For small deployments in stage 0-1, you can choose its lightweight mode (such as using embedded HBase/Solr) to quickly build the basic framework.
[fancyadid=“45”]
Self-built simple platform: When an open source solution cannot fully meet specific needs or needs to be more flexible and controllable, consider building your own. The technology stack typically includes:
- Backend storage:Choose MySQL, PostgreSQL and other relational databases to design metadata storage models. The core table should cover: data source information table, data table/entity table, field/attribute table, business terminology table, data kinship table, user/permission table, etc.
- Metadata Collection:Develop ingestion scripts or small services using JDBC/ODBC, API calls, file parsing (e.g., parsing DDL statements), etc., to pull technical metadata from the source system (database, file system, API, etc.) on a regular or triggered basis. Incremental collection mechanisms need to be considered.
- Front-end display:Front-end frameworks such as Vue.jsReact are used to build management interfaces to realize functions such as adding, deleting, modifying, searching, and visualizing metadata. At its core, it’s about providing a clear, easy-to-use data asset browsing experience.
1.1.2 Definition of the core metamodel
Building a clear and consistent metadata model is the foundation for effective management, including:
Business metadata:
Core Elements:Business term name, standardized definition, business domain/process, responsible person (business owner), and other related terms (synonyms, parent-child relationship, etc.).
Implementation practice:Product managers lead cross-functional (business, technical) workshops to define key business concepts (e.g., “active orders”, “active users”). Define results that need to be stored in a structured manner (database tables) and strongly correlated mapping with technical metadata such as table fields. This significantly reduces communication ambiguity and ensures that the technology accurately reflects business intent.
Technical metadata:
Core Elements:Physical storage location (library, instance, cluster), data object name (table, view, topic), data structure (field name, data type, length, constraints), data storage format (ParquetORCJSON, etc.), partition information, ETL job information (script path, scheduling period), data kinship (upstream source, downstream consumption).
Collection and management:Retrieved from sources such as database system tables, ETL tool logs, message queue configurations, file system properties, and more through automated tools such as Atlas or scripts. It is necessary to design a reasonable storage model (such as a star/snowflake model) to associate entities such as tables, fields, and jobs.
Manage metadata:
Core Elements:Data owner (technical owner), creator, created/updated time, access rights information, data lifecycle status (active, archived, expired), data classification and classification labels, change history (who modified what and when, and why).
Value:Clarify management responsibilities, support audit traceability, and ensure the standardization of data management processes. Changelogging mechanisms such as database triggers + log tables are crucial.
1.2 Data asset catalog
Building a data asset catalog for users (especially business users) based on metadata is a direct carrier for data to be “visible and usable”.
1.2.1 Drive deep collaboration between business and technology to build a catalog
Global data source discovery and mapping:
- Product managers need to work together with business units to sort out core business processes (such as order to payment, lead to customer), identify key data entities generated and consumed in the process and their source systems (such as customer tables in CRM, transaction tables in order systems, behavioral data in log servers).
- The technical team is responsible for exploring the physical deployment, storage method (database type, tablespace), access interface (JDBCAPIFilePath), data scale, and update frequency of these source systems.
- The output should be a data source distribution map covering the main business domain (physical + logical view), and the location and flow direction of key data should be clarified.
Accurate capture and alignment of business semantics:
- The business team is responsible for explaining the specific meaning of key data entities and fields in the business context, calculation rules (e.g., whether “GMV” includes shipping, refunds), and business rule constraints (e.g., “customer level” determination logic).
- The technical team is responsible for translating these business semantics into comments in technical metadata, associating them to business glossary items, and ensuring that technical implementations (such as field names, calculation logic) match them.
- Product managers need to design standardized semantic description templates (fields), establish feedback and arbitration mechanisms (such as regular review meetings), and resolve disputes that are inconsistent with business and technical understanding.
Initial construction and visualization of data lineage:
- Start with the most important and core business reports or indicators, reverse trace the original data sources on which their calculations depend, and sort out the intermediate processing steps (ETL jobs, SQL scripts, and computing engine tasks).
- Use tools such as Atlas’ built-in lineage, Graphviz plotting, and open source versions of dedicated data lineage tools such as Marquez, to visualize kinship relationships and clearly show the flow path and transformation process of data from the source system to the consumer side.
- It emphasizes the need for continuous updates and maintenance as the business and system evolve.
1.2.2 Design a user-oriented asset catalog experience
Intuitive directory structure and navigation:
- It adopts a hierarchical organization method that combines hierarchical (e.g., business domain->data subject domain->data entities/tables) and labeling (business labels, technical tags such as “basic data” and “derivative indicators”).
- The interface is designed with user habits in mind: clear tree navigation, breadcrumb paths, favorites function, recent visit history. Put high-frequency accessed data assets in a prominent position.
Efficient Search and Discovery Capabilities:
- It supports full-text search based on keywords (table names, field names, business terms, and description text), and integrates intelligent suggestions and auto-complete to improve efficiency.
- Provide multi-dimensional combination filtering: quickly narrow down the search scope by business domain, data source system, data owner, classification and grading labels, update time range, etc. The filter criteria should be intuitive and easy to use, and the results should be dynamically refreshed.
Rich and useful data detail page:
- Clicking on a specific data asset should aggregate all its relevant metadata: business description (associated business terms), technical details (field list and type, sample data preview), management information (Owner, update time), data lineage map, associated data quality report (e.g., latest inspection results), and usage examples/best practices links.
- Organize information in a card-style or tabbed layout that is clear and easy to read. Provides convenient features for exporting metadata (such as CSV), sharing links, subscribing to change notifications, and more. Clearly display the quality assessment status of data (such as pass/warning/fail indicators) to enhance user trust.
2. Data quality baseline established
Without quality assurance, the value of data is greatly reduced and even risky. Stage 0-1 needs to establish basic quality management capabilities.
2.1 Key data identification
Resources are limited, and data must be prioritized to govern the data that has the greatest impact on business objectives.
Methodology: Product managers organize business units to identify key business entities (e.g., “customers”, “products”, “orders”, “transactions”) and their key attributes (e.g., customer “contact”, order “amount”, transaction “status”) based on current core business goals (e.g., increasing marketing conversion rates, reducing risk losses, and meeting compliance reporting requirements).
Evaluation Dimensions: The matrix analysis method is used to evaluate from two dimensions:
- Business value dimension:The potential impact of this data error/missing data on business decisions, process efficiency, customer experience, revenue costs, compliance risks.
- Data complexity dimension:The number of system sources involved in this data, the complexity of processing and transformation, and the difficulty of governance (such as whether sensitive data is involved, difficulty in cross-departmental coordination).
Output: Form a priority list of key data entities and attributes to guide resource investment.
2.2 Rule Definition and Measurement
Quality rules are the yardsticks that measure data and need to be defined with the business and translated into executable inspection logic.
2.2.1 Define core data quality rules with business parties
Completeness:
Rule Definition:Specify which fields are required in which business scenarios. For example, “mobile phone number” is required when customers register, and “product ID” and “quantity” are required when creating an order.
Technical implementation considerations:Set real-time verification on the data entry/acquisition interface; Perform null value checks on batch imported data in the ETL process. For data that may be delayed due to process reasons, define acceptable delay windows (SLIs) and default process filling/completion process policies. Establish monitoring alarms for missing data.
Accuracy:
Rule Definition:Whether the data is true and correctly reflects reality. For example, whether the “customer age” is within a reasonable range (0-120), whether the “product price” is consistent with the pricing system, and whether the “address” is valid.
Technical implementation considerations:Define the valid value range, enumeration list, formatting rules (regex checks) for fields. Write a validation script or utilize the tool rules engine for checking. For key data (such as amount and identity information), third-party authoritative data sources (such as ID card verification services and credit reporting interfaces) can be introduced for cross-verification. Establish user feedback channels (such as the “Report Error” button on the data details page) and a quick remediation process.
Consistency:
Rule Definition:The same data should be consistent across systems or records. For example, the same “customer ID” should be the same in the CRM system and the “customer name” of the order system; The “inventory” of the same product in different channels should be synchronized within a reasonable time difference.
Technical implementation considerations:Establish a unified view of core master data (customers, products, suppliers) (MDM concept prototype). Establish standards and timeliness requirements for cross-system data synchronization. Develop matching scripts or tools that trigger cross-system data consistency checks periodically or after critical operations such as master data updates. Achieve inconsistent real-time/near-real-time detection and alarming.
Timeliness:
Rule Definition:Whether the time delay from data generation to availability or update aligns with business needs. For example, real-time risk control requires transaction data with a second-degree delay, and monthly reports may tolerate T+1 days of data.
Technical implementation considerations:Define SLAs (service level agreements) for each data source and dataset, including the desired update frequency (real-time, near real-time, hourly, day-level) and maximum latency tolerance. The data stream with high timeliness requirements is collected and transmitted in real time by message queue (KafkaPulsar). Develop a clear ETL scheduling plan for low-frequency data. Monitor processing delays in each link of the data pipeline.
2.2.2 Design actionable metrics and monitoring boards
Metric design:
Quantify the rules. For example:
- Completeness: (1-(Null Records/Total Records)) * 100% or Missing Value Ratio = (Null Records / Total Records) * 100%
- Accuracy: (1 – (number of error records/total number of records)) * 100% or error rate = (number of error records / total number of records) * 100% (error records must be clearly defined, such as failed rule verification or manual review confirmation)
- Consistency: Consistent record ratio = (number of consistent records / total number of matched records) * 100% (in specific matching scenarios)
- Timeliness: Data freshness = current time – data timestamp (calculate the maximum value, average, proportion of exceeding SLA threshold) or latency time distribution statistics.
Key Points:Indicators need to be measurable and calculable. Complex problems (such as “address accuracy”) can be broken down into multiple subrules (format validity, administrative division existence, street existence) and designed corresponding sub-indicators. Set clear, business-approved health thresholds for each metric.
Surveillance board design:
Build a data quality monitoring dashboard using a BI tool (TableauPowerBISupersetGrafana).
Core content:
- Displays the current values and trend charts of core quality indicators (completeness rate, accuracy, etc.) by data entity/key attributes.
- Use red/yellow/green to visually identify the status of the indicator (normal, warning, abnormal).
- Show details of recent quality inspection results (specific number of records and examples of rule violations).
- Integrated alarm function to automatically trigger notifications when indicators exceed the threshold (email, DingTalk, Qiwei).
User Experience:You can filter and view by business domain, data source, data owner, and other dimensions. Provides down-drill analysis capabilities. Generate comprehensive data quality reports regularly for management decision-making.
2.3 Basic inspection and rectification
Implement rules and indicators, and establish a mechanism for problem discovery, notification, and rectification to form a closed loop of quality.
Problem findings: Regularly run inspection scripts/tools/tasks to scan target data and identify problem records that violate quality rules.
Problem Recording and Notification:
- Log the issue details (rule violations, data sources/tables/fields involved, issue record primary key/sample, severity level, discovery time) to the problem ledger (database table or ticket system).
- Automatically notify the data owner (technical owner) and the relevant business owner. Notification messages should clearly describe the issue, potential business impact, and desired timeframe for resolution.
Problem analysis and rectification:
- The responsible person analyzes the root cause of the problem (source entry error?) ETL logic flaw? Interface abnormality? Data delay? )。
- Formulate and implement rectification plans (correcting source data, fixing ETL code, optimizing interface logic, supplementing missing data, etc.).
Verification and Closed Loop:
- After rectification, trigger or wait for the next QA run.
- Verify that the issue is resolved and update the issue ledger status to Fixed.
- Regularly review high-frequency or serious quality issues to promote process optimization or system improvement to prevent recurrence.
3. Data security infrastructure
While releasing the value of data, it is necessary to build a solid security line of defense to meet compliance requirements.
3.1 Data classification and grading
Clarifying the sensitivity and importance of data is fundamental to implementing differentiated protection.
3.1.1 Promote the development of standards that comply with regulations and business needs
Product managers need to work with legal, compliance, security, and core business departments to jointly formulate data classification and grading standards for enterprises.
Basis: National laws and regulations (Cybersecurity Law, Data Security Law, Personal Information Protection Law), industry regulatory requirements (special regulations in finance, medical and other industries), and internal risk management strategies.
Standard content:
- Classify:Categories are divided according to the nature of data (e.g., personal information, financial information, trade secrets, operational data, public information).
- Grading:On the basis of classification, once the data is leaked, tampered with, destroyed or illegally used, the degree of potential harm caused by national security, public interests, enterprise operations, and personal rights and interests is graded (common such as public level, internal level, sensitive level, and confidential level).
- Clarify definitions, scope, characteristics, and typical examples at all levels.For example, “sensitive level” can be defined as: including personal privacy information (ID number, mobile phone number, home address, biometric information), important customer information, undisclosed financial data, core business analysis models, etc., and the leakage may cause great damage or financial loss to individuals or enterprises.
Output: Form a formal and approved document of “Enterprise Data Classification and Grading Specification”.
3.1.2 Implement specific data assets
- Organize business departments and technical teams to classify and grade the core data assets (tables and fields) that have been counted out according to the Specification.
- Classification and grading results (labels) are entered into the metadata management system/asset directory as key management metadata.
- This label is the core basis for subsequent implementation of security policies such as access control, encryption, masking, and auditing.
3.2 Basic access control
The primary task of Phase 0-1 is to prevent unauthorized access and data breaches.
3.2.1 Implement the principle of least privilege
Core Concept: Users/apps can only have the minimum data access necessary to complete their work tasks, no more.
Product Manager Role: Collaborate with security teams, data owners (business parties), and technical teams.
1) Sort out the core responsibilities of different job roles (such as sales representatives, customer service personnel, data analysts, financial personnel, DevOps) and the data scope (which business domains/entities/tables) and operation types (read, write, delete, modify) required for work.
2) Define roles based on this and assign permissions to roles down to the table level (or even key field level). For example:
- Sales representative role: read-only access to basic customer information and sales opportunities.
- Data Analyst role: Read-only access to sales detail wide tables, product dimension tables, and unauthorized access to raw log tables containing sensitive information.
3) Assign users to their desired roles instead of direct empowerment. Implement least privilege management through role permission mapping.
3.2.2 Establish a basic user role and permission management framework
Technical implementation:
- Leverage your existing identity authentication and access management (IAM) systems (such as LDAP/ADOkta and Alibaba Cloud RAM) as user identity sources.
- Build role-based access control (RBAC) models at the data platform layer, such as the database’s own permission system, HadoopRanger/Sentry, data catalog tools, or self-built middleware.
- Core elements: Users, roles, permission sets, user-role associations, and role-permission associations.
Process guarantee:
- Establish a standardized permission request process (e.g., through the ticket system), clarify the reason for the request, the required data scope, the type of operation, the applicant, and the approver (data owner + security/superior).
- Establish a regular review mechanism for authority to ensure that authority is adjusted or recycled in a timely manner after personnel changes in positions.
- Record detailed permission grants and changelogs to meet audit requirements.
4. The key role of the product manager in the 0-1 stage
In the technical capacity building of data governance stage 0-1, product managers are the bridge and driving force connecting business needs and technology implementation, and their core values are reflected in:
4.1 Technology selection and evaluation
When evaluating technical solutions such as metadata management tools, data quality tools, and security components, PMs need to have a deep understanding of current business pain points (such as “data not found”, “data cannot be trusted”, “data leakage risk”) and business development expectations for the next 1-2 years.
The evaluation dimension is not limited to the feature list:
- Business Fit:Does the workflow, metamodel extensibility, and user interface of the tool match the usage habits and cognition of business personnel? Does it effectively support defined core business terms and processes?
- Data Scale and Complexity Adaptation:Can the architecture of the tool support the current data scale and have a reasonable expansion path? What is the integration compatibility with existing technology stacks (databases, big data platforms)?
- Total Cost of Ownership (TCO):In addition to procurement/licensing fees, you need to evaluate deployment costs, O&M complexity, learning curves, and custom development investments. Open source solutions need to evaluate community activity and business support options.
- Evolution Capabilities:Can the solution smoothly support the subsequent evolution to more advanced stages such as automated data lineage, real-time quality monitoring, and fine-grained dynamic desensitization? Avoid the burden of future replacements by introducing short-term options.
Output:Based on multi-dimensional objective evaluation, a technical selection suggestion report is formed.
4.2 Promote cross-team collaboration and data standard formulation
Data governance is essentially a cross-departmental collaborative project. PMs need to proactively break down departmental walls (business, technology, legal, compliance, security, risk).
Core Areas of Collaboration:
- Data standard formulation:Lead or deeply participate in the formulation of core standards such as business terminology, data classification and grading standards, data quality rule definitions, and master data definitions. Ensure that standards meet regulatory requirements, and can be understood and implemented by business. Balance the demands of all parties and promote consensus.
- Clear data responsibility:Promote the establishment of a clear data owner (business owner and technical owner) system, and clarify the responsibilities of all parties in terms of data definition, quality, security, and use.
- Process docking:Ensure that data governance processes (such as metadata maintenance processes, quality issue handling processes, and permission application processes) are effectively connected with existing business and IT processes.
4.3 Define data quality KPIs and correlate business value
Investments in data governance need to prove their ROI. PMs need to translate abstract data quality metrics into business language and perceived value.
Method:
- Direct Hook:For example, the improvement of “customer contact information accuracy” is quantitatively correlated with the improvement of “marketing campaign reach” and “customer service satisfaction”. Link the improvement of order data integrity to financial settlement efficiency and reduction of manual reconciliation costs.
- Risk Avoidance:Quantify how data quality improvements can reduce business risks caused by data errors (e.g., poor decision losses, compliance fines, customer churn risks).
- Value Delivery:Regularly report to business and management on specific business benefits (e.g., cost savings, efficiency improvements, revenue growth, risk reductions) brought about by data quality improvements, and continue to seek support and resource investment.
4.4 Coordinate the implementation of security compliance requirements
In an increasingly stringent regulatory environment, PMs need to take on the responsibility of coordinating the implementation of:
- Requirements Understanding and Transformation:Deeply understand the data security and compliance requirements put forward by the legal, compliance, and security departments (such as GDPR, CCPA, DSAR and anonymization requirements in China’s Personal Information Protection Law), and translate them into specific data management functional requirements (such as classification and hierarchical label management, access control policies, audit logs, data masking rules).
- Scheme design and coordination:Participate in the design of technical and management solutions that meet compliance requirements (e.g., enabling sensitive data labeling and desensitization previews in the data catalog, designing permission models that meet the “minimum necessary” principle, planning audit log scope and storage), and coordinating technical teams for implementation.
- Compliance Verification:Assist in organizational compliance checks or audits, providing necessary process descriptions and evidence (e.g., permission approval records, data classification and grading lists, data quality monitoring reports).
4.5 Guardian Tool Platform User Experience
The end users of data governance tools (especially metadata platforms, data asset catalogs) are a wide range of business and technical people. A poor user experience will greatly hinder the promotion and value of the tool.
PM needs to be deeply involved:
- User Research:Understand the core demands and usage scenarios of users in different roles (business analysts, data engineers, data scientists, managers).
- Interface and interaction design:Pay attention to the platform’s ease of use, intuitiveness, and clarity of information presentation. Review the UI/UX design draft to ensure that the navigation is reasonable, the search is efficient, the information on the detail page is organized in an orderly manner, and the operation process is smooth.
- Value guidance:Design beginner guides, help documentation, and best practice cases to reduce user learning costs and guide users to discover the value of tools (such as “How do I quickly find the data I need?”). “How to understand the blood of this indicator? ”)。
- Feedback Closed Loop:Establish user feedback channels, continuously collect usage pain points and improvement suggestions, and drive iterative optimization of products. The goal is to make users “willing to use, like to use, and inseparable”.