Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance or examples on Classification #260

Open
robinho81 opened this issue Dec 17, 2024 · 6 comments
Open

Guidance or examples on Classification #260

robinho81 opened this issue Dec 17, 2024 · 6 comments

Comments

@robinho81
Copy link

First of all excellent tool!

In some of the examples I have begun to build I see a lot of DS06 - Data Leak threats.

I see from the source code that the method hasDataLeaks() compares the classification of data in the dataflow sink and source to the classification of the data that is being transmitted, which seems to make sense.

Is there any guidance / examples / OWASP documentation on how one should model the classification in data and data flows?

I could contribute with an example when I get this figured out.

@raphaelahrens
Copy link
Contributor

I really like this question since at first I thought is it not obvious, but when I looked at my references I was not able to find a good short explanation on how to add classification labels for data or how to use these labels in threat modeling.
So I went looking for some material.

I found NIST IR 8496 Chapter 3 which describes the idea of classification of data.
This document even goes a step over the typical Public, Secret and Top Secret labels and recommends more complex labels like "Personal Identifiable Information" or GDPR.

What I always like to recommend is Chapter 8 Mandatory Access Control by Fred B. Schneider. This describes the basic idea of MLS and how the ordering of labels can be used for access control.
It uses a lot of math but the basic idea is to only allow access to a piece of data if you can show that the data label is smaller or equal to the label of the person or system which wants to access the data.
If there is no ordering for example the label GDPR and private key, than access is denied.

I looked into what is written on data classification and threat modeling and really could not find any good resources.
In "Threat Modeling - A practical Guide for Development Teams page 110" by @izar and M.J. Coles they mention a label, a type and security requirements of the data but don't go into details.
PASTA talks about data classification in stage one, but I does not go into how to obtain the classification.
At least as far as I have looked into PASTA, since I could never use it for work, so maybe there is some guidance.
"Threat modeling - design for security" mentions an example for authorization requirements and briefly talks about classification, but nothing on how to classify.

But sadly I found nothing about how labels can be used in threat modelling.

What is currently in pytm follows the simple MLS structure first. Classification is totally ordered and it uses the classical military style labels.

pytm/pytm/pytm.py

Lines 322 to 328 in 2a37d1e

class Classification(OrderedEnum):
UNKNOWN = 0
PUBLIC = 1
RESTRICTED = 2
SENSITIVE = 3
SECRET = 4
TOP_SECRET = 5

Even the rule in hasDataLeak also assumes this.

pytm/pytm/pytm.py

Lines 1933 to 1939 in 2a37d1e

def hasDataLeaks(self):
return any(
d.classification > self.source.maxClassification
or d.classification > self.sink.maxClassification
or d.classification > self.maxClassification
for d in self.data
)

I usually suggested to new threat modellers to follow a similar simple approach first pick a simple labeling, numbers between 1 to 5 on how important a piece of data is.
Where important can mean availability, integrity and/or confidentiality.
For small threat models the difference is usually in the context.
The next step is to write down the data type that go over different dataflows and which is in the datastores and label it.
For the datastores it should be noted that a database might hold much more data than just what is going through the dataflows.
What I found important is that the model includes some form of classification for all the data, since with the classification it is easier to justify countermeasures.
With the data labels it is possible to then set labels for the other elements in the model and apply the least privilege principal.

But I think it would be good to have better guidance for data classification in regards to threat modelling.

While writing this I was wondering if the current implementation in pytm might be improved.

  1. I think Unknown should be the default, but having an Unknown classification of a type of data is a sign of an unfinished threat model.
  2. I wonder how difficult it would be to have a custom labeling instead of the default. Maybe even with a custom compare function.

Not sure how helpful my reply is to your question, but it made me think and I had to write this down.

@raphaelahrens
Copy link
Contributor

Adding @nineinchnick, to ask do you remember why you choose the classification?

@nineinchnick
Copy link
Collaborator

I remember I did some research, just to get some ideas about the level names, but I don't have any specific reference. It's very likely I got some ideas from Wikipedia, which should not be used as a reference anyway. I think they were supposed to be practical - offering a good enough range so it's easy to classify components of existing systems.

UNKNOWN is a bit controversial. By default, components should be classified with the highest possible value, but this would generate too many false positives, so having UNKNOWN as the default makes the whole classification opt-in. It should be more explicit - classification should be enabled for the whole model, and then we should not allow unknowns - every component needs to be classified explicitly, or making them TOP_SECRET by default.

I'd always recommend that pytm should be extensible. This is how I was using it, working with a private extension (not a fork), and contributing more valuable overrides upstream.

@nineinchnick
Copy link
Collaborator

Also, an important note - I've been using a custom threats database. There's no strong relationship between the default database provided by pytm and the properties used in building the model. IMO, all threats should be very well-defined in a particular domain, there's no generic database that would apply everywhere.

@robinho81
Copy link
Author

Thanks a lot for the detailed responses. I managed to start classifying the data involved in my data flows, and this led to a lot of the "data leaks" being resolved, but not all, which I think was quite a good result. Also, I see that classifying the data involved in our application is actually a very important first step in threat modelling.

Thanks for brining my attention to the threat database. When I started out I had not considered the need to customise this for my domain.

Is there still merit in new users creating a simple threat model with the default database? I suppose that this [using the default threat database] means the reporting is a bit verbose?

@nineinchnick
Copy link
Collaborator

Building the model itself is a good exercise to mark trust boundaries. We reviewed the model to identify the possible threats and then made sure they're captured in the threat database, with correct rules. Then we either saw same threats applied elsewhere. If there were false positives, we had to make the conditions more strict, or in the extreme case add more properties in pytm to help differentiate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants