Designing Effective Data Security Schemas | by Renae Kang | Jan, 2024

Cybersecurity tools generate a vast amount of logging data, each with its own unique data definitions and field names. Meanwhile, enterprises often collect logs from different sources, such as network devices, servers, applications, and user activity. These logs can be massive in size, reaching petabytes or even exabytes, and they often contain a wide variety of fields with varying definitions and data types. In fact, Adobe collects tens of terabytes of security tooling and related log data every day.
That’s why data correlation is essential to finding the usable nuggets of information that help us make the best security control decisions possible. In short, it helps us pinpoint the actionable “needles in our logging haystack.” Though analyzing individual datasets is relatively straightforward, correlating data across multiple logs can pose a significant challenge because of the heterogeneity of field names and data formats across different logs.
Field normalization is a crucial step in enabling effective correlation. As such, one of the most important steps for identifying misconfigurations, attacks paths, and vulnerabilities across the systems is to develop a standardized security data schema. In this blog post, I will explain the importance of field normalization and share tangible methods to address this data schema problem effectively.
Normalizing Fields for Better Analysis
Field normalization is a widely accepted method used to homogenize the name, definition, and data type of a particular field across different data sets. While investigating security incidents, it’s important to analyze data from multiple logs to quickly identify the root causes of attacks.
The benefits of field normalization include:
Designing Base Classes
After conducting comprehensive research on Adobe’s in-house datasets and industry-standard security stack tools, our team developed an extensible modular framework for defining variables within the data schema.
In our approach, we thoroughly researched several security data sources, meticulously identifying common security challenges that enterprises face, such as cloud misconfigurations, container security issues, network attack patterns, and identity misuse. Once we pinpointed the key focus areas, we designed base classes, or a logical construct, tailored to these specific needs. These base classes serve as the building blocks, each evolving to encompass multiple objects, variables, or column names. The flexibility of this approach allows seamless transformation of logs, adapting to the various data lakes and SIEM solutions. In the following table, we present a selection of sample base classes:
Here is a logical illustration of what a base class might look like:
{
"source": {
"number": "07",
"organizationName": "adobe",
"domain": "example.com",
"ip": "120.105.139.131",
"user": {
"fullName": "ABC XYZ",
"id": "00uxfurz4oTUMQKJPJUXPCFDRTT"
},
"customField": "Custom Value",
"customField1": "Custom Value 1",
"custom_Field2": [
"Value3",
"Value4"
]
}
}
As you can see in above, it is possible to define custom variables within a specific class. We can also add additional metadata, such as data type, timestamp, log source, and data storage location to enable data comprehension and aid in further analysis.
Establishing Field Nomenclature
Field nomenclature is vital in designing a security schema. In essence, field nomenclature refers to the naming conventions used for fields in a database or data lake. Consistent and meaningful field names are essential for effective correlation and analysis.
Here are some field nomenclature best practices that we follow at Adobe:
Field Mapping to Data source
In the table below, we’ve outlined a sample field-level mapping across various datasets, all sharing a common definition.
Schematizing and Flattening Data
In this approach, the field normalization process consists of two main steps: Schematization and Flattening.
Schematizing and flattening normalizes data, ensuring consistency in field names and formats. This prepared data enables deep analysis, uncovering insights and patterns related to security incidents or anomalies.
Table Structure in the Data Lake
Once the schema is ready, you can divide the tables per event aggregation, this should be data-source- agnostic. All the tables will have common base fields, along with any table-specific fields.
Example: All the network logs, VPC flow logs, Azure logs, and Google Cloud Platform logs can reside in a common table. And, because the fields are mapped to the base fields, we can write complex correlations.
Writing Correlations on the Data Lake
Once you’ve loaded data into tables, you can query the data using the specified syntax. From the user perspective, they only query the network events; internally, all possible network sources are schematized into a master table.
The following query example helps identify all network events, regardless of cloud or datacenter origin, specifically focusing on SSH events that have been BLOCKED.
SELECT * FROM network_master_table
WHERE event_time BETWEEN ‘2023-12-01' AND ‘2023-12-10’ WHERE source_ip = ’10.0.0.10’ AND source_port = ‘22’ AND event_outcome = ‘BLOCKED’
Conclusion
Field normalization is crucial in enhancing security analysis by addressing the diversity in field names and data formats commonly found within security logs. By improving data integration, enhancing correlation, and reducing errors, field normalization facilitates more efficient data sharing among security teams. In turn, utilizing base classes, employing techniques such as schematization and flattening, and transforming raw logs all help enable more effective security analytics.