Designing Effective Data Security Schemas | by Renae Kang

Cybersecurity tools generate a vast amount of logging data, each with its own unique data definitions and field names. Meanwhile, enterprises often collect logs from different sources, such as network devices, servers, applications, and user activity. These logs can be massive in size, reaching petabytes or even exabytes, and they often contain a wide variety of fields with varying definitions and data types. In fact, Adobe collects tens of terabytes of security tooling and related log data every day.

That’s why data correlation is essential to finding the usable nuggets of information that help us make the best security control decisions possible. In short, it helps us pinpoint the actionable “needles in our logging haystack.” Though analyzing individual datasets is relatively straightforward, correlating data across multiple logs can pose a significant challenge because of the heterogeneity of field names and data formats across different logs.

Field normalization is a crucial step in enabling effective correlation. As such, one of the most important steps for identifying misconfigurations, attacks paths, and vulnerabilities across the systems is to develop a standardized security data schema. In this blog post, I will explain the importance of field normalization and share tangible methods to address this data schema problem effectively.

Normalizing Fields for Better Analysis

Field normalization is a widely accepted method used to homogenize the name, definition, and data type of a particular field across different data sets. While investigating security incidents, it’s important to analyze data from multiple logs to quickly identify the root causes of attacks.

The benefits of field normalization include:

Data Integration: Field normalization streamlines the integration of logs from various sources by standardizing field names and its definitions, eliminating the need for intricate transformations during the analysis.
Improved Correlation: Inconsistent data formats can obscure valuable relationships in logs, hindering pattern detection, and correlation ability. Normalization helps make it easier to detect patterns and identify potential attacks or anomalies.
Better Productivity: Standardized field names and data formats simplify analysis and reduce complexity. Our researchers and incident response teams can focus on extracting insights without being stalled by data inconsistencies.
Data Error Reduction: Consistent data representations prevent errors from data misinterpretation, ensuring accurate and reliable analysis. Error-checking can be incorporated in the schema pipeline.
Data Portability: Standardized field names and formats promote seamless collaboration among different security teams, enabling efficient data sharing and enhancing incident response and threat mitigation strategies.

Designing Base Classes

After conducting comprehensive research on Adobe’s in-house datasets and industry-standard security stack tools, our team developed an extensible modular framework for defining variables within the data schema.

In our approach, we thoroughly researched several security data sources, meticulously identifying common security challenges that enterprises face, such as cloud misconfigurations, container security issues, network attack patterns, and identity misuse. Once we pinpointed the key focus areas, we designed base classes, or a logical construct, tailored to these specific needs. These base classes serve as the building blocks, each evolving to encompass multiple objects, variables, or column names. The flexibility of this approach allows seamless transformation of logs, adapting to the various data lakes and SIEM solutions. In the following table, we present a selection of sample base classes:

Here is a logical illustration of what a base class might look like:

{ 
"source": { 
"number": "07", 
"organizationName": "adobe", 
"domain": "example.com", 
"ip": "120.105.139.131", 
"user": { 
"fullName": "ABC XYZ", 
"id": "00uxfurz4oTUMQKJPJUXPCFDRTT" 
}, 
"customField": "Custom Value", 
"customField1": "Custom Value 1", 
"custom_Field2": [ 
"Value3", 
"Value4" 
] 
} 
}

As you can see in above, it is possible to define custom variables within a specific class. We can also add additional metadata, such as data type, timestamp, log source, and data storage location to enable data comprehension and aid in further analysis.

Establishing Field Nomenclature

Field nomenclature is vital in designing a security schema. In essence, field nomenclature refers to the naming conventions used for fields in a database or data lake. Consistent and meaningful field names are essential for effective correlation and analysis.

Here are some field nomenclature best practices that we follow at Adobe:

Field names should not start with a number or symbols.
Field names should be unique within a table.
Every field name should start with lower case; multiple words should be CamelCase. (Example: full name — -> fullName)
Any table-specific field names that don’t directly map with base fields should follow the syntax: <table_name>_<field_name> (Example: dns_account_id)
If fields have sub levels, the flattening should follow below syntax: source.user.fullName → source_user_fullName. Source is Top Level Class Name. User is the first-level subfield within the class source. fullName is a second-level subfield within class source, first-level subfield within user struct.
Fields may (or may not) have sublevels. Non-nested field: source_number. Nested field: souce_user_fullName
When transforming column names, the object should be flattened within a base case using an underscore. (Example: source.number → souce_number)

Field Mapping to Data source

In the table below, we’ve outlined a sample field-level mapping across various datasets, all sharing a common definition.

Schematizing and Flattening Data

In this approach, the field normalization process consists of two main steps: Schematization and Flattening.

Schematization involves identifying and extracting critical fields or variables of interest from raw logs. These fields are then transformed into a base field structured format, in which each field is mapped to its corresponding variable in the raw logs.
Flattening takes the schematized data and further simplifies its structure by eliminating nested fields. Each variable within a base field is extracted and given its own unique field name, adhering to a nomenclature defined above by eliminating the hierarchical structure of the data and making the data easier for the end-user to query.

Schematizing and flattening normalizes data, ensuring consistency in field names and formats. This prepared data enables deep analysis, uncovering insights and patterns related to security incidents or anomalies.

Table Structure in the Data Lake

Once the schema is ready, you can divide the tables per event aggregation, this should be data-source- agnostic. All the tables will have common base fields, along with any table-specific fields.

Example: All the network logs, VPC flow logs, Azure logs, and Google Cloud Platform logs can reside in a common table. And, because the fields are mapped to the base fields, we can write complex correlations.

Writing Correlations on the Data Lake

Once you’ve loaded data into tables, you can query the data using the specified syntax. From the user perspective, they only query the network events; internally, all possible network sources are schematized into a master table.

The following query example helps identify all network events, regardless of cloud or datacenter origin, specifically focusing on SSH events that have been BLOCKED.

SELECT * FROM network_master_table 
WHERE event_time BETWEEN ‘2023-12-01' AND ‘2023-12-10’ WHERE source_ip = ’10.0.0.10’ AND source_port = ‘22’ AND event_outcome = ‘BLOCKED’

Conclusion

Field normalization is crucial in enhancing security analysis by addressing the diversity in field names and data formats commonly found within security logs. By improving data integration, enhancing correlation, and reducing errors, field normalization facilitates more efficient data sharing among security teams. In turn, utilizing base classes, employing techniques such as schematization and flattening, and transforming raw logs all help enable more effective security analytics.

Source link

Top Jobs

Services Categories

Top Categories

Top Skills

Designing Effective Data Security Schemas | by Renae Kang | Jan, 2024

Leave a Comment Cancel reply

Behind the Scenes with Ammar Alim, DevSecOps Leader | by Renae Kang | Dec, 2024

VPC: Behind The Scenes | DigitalOcean

Project Redact: Implementing Guardrails to Protect Secrets | by Renae Kang | Nov, 2024

Follow Us

About

Categories

Support

Subscribe

Apps