-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] refactor DataAccess to DataSource #1967
base: dev
Are you sure you want to change the base?
Conversation
@kshalot opened for early feedback. Currently, i did a rough refactor to control access to project files though direct access(files on server) and research environment(gcs). I wanted to see if i am on the right track or totally off-course :D Please check commitwise serially otherwise it might be confusing to review |
a7a7f42
to
4f8686c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few initial comments
projects_with_research_environment = DataSource.objects.filter( | ||
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT, | ||
project=self).values_list('project_id', flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the way we want to go - let's not make any assumptions regarding which data source is expected. Right now we still operate with the assumption that being authorized to view a project's files means that you can access all of its data sources. Therefore, the accessible_by
and is_authorized
methods should simply return the list of accessible projects (or True
in case of is_authorized
). We can then use the DataSource
relation to only choose the relevant datasets.
So research environments would get the list of all accessible projects using a generic method and simply filter out the irrelevant ones. We could also create more methods, e.g.
def can_access_directly(): ...
def can_create_research_environment(): ...
This will be even more apparent if we decide to refactor the authorization to, for example, the access grant model (see #1927 (comment)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.
I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the can_create_research_environment
part of the code, this looks like a leftover from some old logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i agree, this method should be generic and we should be creating separate methods which use the accessible_by
and is_authorized
to check access via specific data access mechanism.
I see your point, the accessible_by method should only give us project that the user has access to. This method should not care what access mechanisms are set for this project. Also, appreciate your question, it is super helpful to think about the desig, i will remember to ask myself while working on other functions.
Thank you @kshalot , Super appreciate the feedback. Let me get on it right away |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitupreti I left a few comments. The base looks good, but what I think would be very helpful here is if we would see how this new DataSource
would play into the existing codebase, i.e. trying to use this model to control whether a project's files are visible etc (instead of the way it is done, which uses the allow_file_downloads
boolean.
My comments might be a bit all over the place - if you encounter any trouble planning the next steps here then let me know. We can jump on a call and figure this out together. Whatever works for you!
if not settings.ENABLE_CLOUD_RESEARCH_ENVIRONMENTS: | ||
self.fields['access_mechanism'].choices = [ | ||
choice for choice in self.fields['access_mechanism'].choices if choice[0] != 'research-environment'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering whether we want to care about this being enabled or not... It makes sense and it would be my first instinct to do so, but right now it seems that research environments are singled out as something that can be present or not.
The HDN deployment is not integrated with any other data source other than local/research environment, so in that case it makes no sense showing anything other than that. Research environments are the only ones that have an explicit setting to disable them, but still.
I guess it doesn't hurt to have this, it just got me thinking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see what you mean. It makes sense about for HDN like deployment, we might just want to have the local/research environment and disable other stuff.
From what i understand, we might want to allow the deployments to have any combination of access mechanisms.
so what if we have a configurable env like
ACCESS_MECHANISMS = 'research-environment,direct, awss3'
and use that.
or maybe since this is something only the Project Editor or console admin can do, we might be okay with showing them all the option and trust them to use the right ones.
projects_with_research_environment = DataSource.objects.filter( | ||
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT, | ||
project=self).values_list('project_id', flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which case is this covering? Why do we need to check whether a source exists to check whether a project is accessible by a user? This just makes this method less generic, since it only makes sense in case of HDN.
I think my previous point still stands - we should only check access mechanisms here (event, DUA, credentialing etc.) and not sources.
|
||
project = models.ForeignKey('project.PublishedProject', | ||
related_name='data_sources', db_index=True, on_delete=models.CASCADE) | ||
files_available = models.BooleanField(default=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to go back to this to give it some more thought - the files_available
boolean is dubious now, seeing it in practice. It's a bit confusing what it's supposed to achieve unless you were part of the previous conversations 🤔
I think it's going to be more clear once we see it actually used in practice, i.e. replace the current DataAccess
with DataSource
. Originally the idea of this boolean was to denote the source that gives us access to the file browser. I think at this point we could not care about this and assume only the direct
source gives us direct file download access. We could easily go back to this and make it more generic in the future.
The point is that in theory the GCS source could also give us access to files, since we are using buckets in HDN to store the projects, with the exact same UX as PhysioNet has for direct upload. But right now the mechanisms are tightly coupled with their specific use cases, so maybe we could just not care about it for now. This would mean we wouldn't initially need the files_available
bool.
projects_with_research_environment = DataSource.objects.filter( | ||
access_mechanism=DataSource.AccessMechanism.RESEARCH_ENVIRONMENT, | ||
project=self).values_list('project_id', flat=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the can_create_research_environment
part of the code, this looks like a leftover from some old logic
Thank you. This is super helpful |
4f8686c
to
17c3791
Compare
Hi @kshalot , i abstracted the creation of datasource on project publish to a separate class, which is called after the project is published. Regarding the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitupreti I left a few comments. I think we are generally going in the right direction.
Correct me if I'm wrong though - this is still not integrated with the project page, right? I.e. it does not control whether the file tab is shown, whether the S3 message is shown etc?
I think you should at least integrate the direct access with the file viewer so we can see it in action. This is the most important part, because the file viewer is not part of the current DataAccess
model (the one which you are refactoring into DataSource
.
Controls all access to project data. | ||
""" | ||
class DataLocation(models.TextChoices): | ||
DIRECT = 'DI', 'Direct' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any advantage to using abbreviations like DI
or AS3
over DIRECT
, AWS_S3
, AWS_OPEN_DATA
etc? I guess it will save a teeny-tine amount of space but makes things a bit incomprehensible when looking at it from the data's perspective.
if self.data_location == self.DataLocation.GOOGLE_BIGQUERY: | ||
if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL: | ||
raise ValidationError('Google BigQuery data sources must use the Google Group Email access mechanism.') | ||
if not self.email: | ||
raise ValidationError('Google BigQuery data sources must have an email address.') | ||
elif self.data_location == self.DataLocation.GOOGLE_CLOUD_STORAGE: | ||
if self.access_mechanism != self.AccessMechanism.GOOGLE_GROUP_EMAIL: | ||
raise ValidationError('Google Cloud Storage data sources must use the Google Group Email access ' | ||
'mechanism.') | ||
if not self.uri: | ||
raise ValidationError('Google Cloud Storage data sources must have an uri address.') | ||
elif self.data_location == self.DataLocation.AWS_OPEN_DATA: | ||
if self.access_mechanism != self.AccessMechanism.S3: | ||
raise ValidationError('AWS Open Data data sources must use the S3 access mechanism.') | ||
if not self.uri: | ||
raise ValidationError('AWS Open Data data sources must have a URI.') | ||
elif self.data_location == self.DataLocation.AWS_S3: | ||
if self.access_mechanism != self.AccessMechanism.S3: | ||
raise ValidationError('AWS S3 data sources must use the S3 access mechanism.') | ||
if not self.uri: | ||
raise ValidationError('AWS S3 data sources must have a URI.') | ||
elif self.data_location == self.DataLocation.DIRECT: | ||
if self.email: | ||
raise ValidationError('Direct data sources must not have an email address.') | ||
if self.uri: | ||
raise ValidationError('Direct data sources must not have a URI.') | ||
else: | ||
raise ValidationError('Invalid data location.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very hard to follow. You could approach it differently - define what exactly is required for each data source and have a single piece of logic that validates it.
required_fields = {
self.DataLocation.DIRECT: [],
self.DataLocation.AWS_S3: ["uri"],
...
}
forbidden_fields = {
self.DataLocation.DIRECT: ["uri", "email"],
...
}
for required_field in required_fields[self.data_location]:
if not getattr(self, required_field): # Or something like that
raise ValidationError("...")
for forbidden_field in forbidden_fields[self.data_location]:
if getattr(self, forbidden_field): # Or something like that
raise ValidationError("...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same goes for data access:
data_location_access_mechanisms = {
self.DataLocation.DIRECT: None
self.DataLocation.AWS_S3: self.AccessMechanism.AWS_S3,
...
}
physionet-django/project/utility.py
Outdated
def __init__(self, **kwargs): | ||
self.data_location = kwargs.get('data_location', None) | ||
self.files_available = kwargs.get('files_available', None) | ||
self.email = kwargs.get('email', None) | ||
self.uri = kwargs.get('uri', None) | ||
self.access_mechanism = kwargs.get('access_mechanism', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why did you opt for keyword arguments here? Why not:
def __init__(self, **kwargs): | |
self.data_location = kwargs.get('data_location', None) | |
self.files_available = kwargs.get('files_available', None) | |
self.email = kwargs.get('email', None) | |
self.uri = kwargs.get('uri', None) | |
self.access_mechanism = kwargs.get('access_mechanism', None) | |
def __init__(self, data_location=None, files_available=None, email=None, uri=None, access_mechanism=None): | |
self.data_location = data_location | |
self.files_available = files_available | |
self.email = email | |
self.uri = uri | |
self.access_mechanism = access_mechanism |
physionet-django/project/utility.py
Outdated
uri=self.uri, | ||
) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know self
is not needed here, but this is called on an instance everywhere (see your uses of this method - you first instantiate the class). Maybe @staticmethod
is not necessary here.
physionet-django/project/utility.py
Outdated
files_available=True, | ||
data_location=DataSource.DataLocation.DIRECT, | ||
) | ||
elif (settings.DEFAULT_PROJECT_ACCESS_MECHANISM == DataSource.DataLocation.RESEARCH_ENVIRONMENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's too many conditions IMO, it should be extracted somewhere. It's also not clear why the condition is there in the first place. I.e. why do we need all 3 settings to be set to a certain value.
If the settings are that coupled, then maybe we should opt for a different setup. Have one environment variable, e.g. RESEARCH_ENVIRONMENT_DEPLOYMENT
that will set all three (DEFAULT_PROJECT_ACCESS_MECHANISM
, DEFAULT_PROJECT_DATA_LOCATION
and STORAGE_TYPE
) if set to True
and if set to False
it will set the two from the condition above.
I think i had added that part of integration. i.e using datasource to control access in project on Its kind of chaotic right now. I will work on it and update the code. Thanks again |
Here we are creating DataSource model which will work exactly like dataaccess but will let us manage all project related access through this model. This will make it easier in future to manage access as everything will be centralized though here. Similarly as discussed here MIT-LCP#1927, we can use this model to work with `AccessGrant`
Context: The goal here is to refactor to let access controlled through DataAccess without changing the existing logic. And the existing logic is if the project has research environment enabled, the user should not be able to view/download files(they can only create research environment and play with the data there) In this commit, we divided the access into 2 section(first for RE and the rest). The outcome should be same as before which is if RE is enabled show the `published_project_denied_downloads.html` else show normal file access stuff
replaces allow_file_downloads with can_view_files(this controls if a user has download access to the project by using the DataSource). Here the idea is that all the direct download stuff will be inside ``` {% if can_view_files %} ``` and the other access methods like gcp, aws will be outside this block. So the logic will be 1. If the project is RE project, show the RE warning message 2. Else 2.1 If the project has other access outside RE and direct download show that. 2.2 If the project has direct download access, then show direct download stuff To make it easier to understand the changes, i am breaking down the commits to very smaller chunks Note that 2.1 and 2.2 dont have to be either or(as currently a project can be directly accessible and also be accessible via gcs,aws etc)
The p and h5 text should appear for both direct download and the other gcp,aws access mechanism. So brining them outside the can_view_file if block
Move the other access method(gcp) outside of can_view_files if block. The logic before this commit was, if a project doesnot have access policy but has project.gcp, show the gcp stuff. On this commit, we have the same logic.
clean unnecessary/empty else statement
move data_access stuff outside the if can_view_files block. Current logic was if a user can download the project files, check the data_access and show the access method details for the ones that exist. So it should be safe to bring this block outside the if can_view_files block. Note that, currently for gcp, aws, we are still using the old data_access methods. We will refactor those in future commit(the current series of commit is to refactor the direct download to use data access while maintaining the same logic)
When a project is published, we should always create a datasource object based on the project storage type because with the DataSource all access will go though it, otherwise the files wont be accessible. This commit creates the DataSource type automatically when project is published. The type of DataSource created depends on the settings. For Physionet, the default data storage is direct, and for HDN its GCP(defined by STORAGE_TYPE) and access mechanism is Research Environment. For auto creation of DataSource, i added 2 new environment variable to control location and access_mechanism of DataSource. Note: the DataSourceCreator doesn't belong on the access.py, i had to keep it here for a short time until we refactor the authorization in a separate app to avoid circular import
b655aca
to
3a3752a
Compare
Context:
As the first step to solve #1927, we want to refactor the current code to centralize all the project data access mechanism though DataSource. The goal of this PR is to refactor the current project to use DataSource without messing the access logic.
List of expected changes
Create DataSource model
Refactor code to use DataSource and control all access mechanism through DataSource
2.1 Direct Access(Files stored on server)
2.2 Research Environment Access(All other access disabled, users can only create the Research Environment)
2.3 GCP, AWS(WIP)
Create migration script to migrate the access location from dataaccess to datasource for all existing projects for (Pending)
a. direct access
b. RE
c. aws, gcp