-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Commit Networks #263
Add Commit Networks #263
Conversation
Currently the part in the showcase is still missing, I will look into where and what to add there. |
Currently the networks contain very little information, because most information typically is
Because in this case the commits are the vertices of the network, that information is not portayed anywhere - meaning the date, author information and commit-id have to be looked up seperately from the network using the commit hash. Should I include more vertex attributes that also contain this information or some other functionality to more easily retrieve this data? |
In the network construction for the commit cochange network, respect.temporal.order is currently always set to 'TRUE'. This matches the construction of the artifact cochange network, that's why I did the same. Should that be the case? |
I am currently removing columns from the 'commit.net.data' in 'get.commit.network.cochange' because they contain commit information that should not be on the edges in the commit network. The function 'construct.edge.list.from.key.value.list' returns columns that are not wanted/needed for the commit network construction. The question is, should I remove the lines as I currently do, or should I add a boolean parameter to the function, similar to the existing 'artifact.edges' parameter? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #263 +/- ##
==========================================
+ Coverage 80.24% 80.89% +0.65%
==========================================
Files 16 16
Lines 4905 5036 +131
==========================================
+ Hits 3936 4074 +138
+ Misses 969 962 -7 ☔ View full report in Codecov by Sentry. |
@Leo-Send Could you please rebase your branch onto the current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall thank you @Leo-Send !
I have a couple of points that need addressing but its nothing too bad.
'get.commit.network' will delegate calls to corresponding methods, depending on 'commit.relation' config parameter in NetworkConf Signed-off-by: Leo Sendelbach <[email protected]>
functions 'get.commit.network.cochange' and 'get.commit.network.commit.interaction' are called in 'get.commit.network'. Also add 'group.commits.by.data.column', a helper function used in constructing the cochange commit network. Signed-off-by: Leo Sendelbach <[email protected]>
Also add first test for commit-interaction based commit network and fixed a minoir error in network creation Signed-off-by: Leo Sendelbach <[email protected]>
Initializing vertex kind to 'TYPE.COMMIT' in the correct position Signed-off-by: Leo Sendelbach <[email protected]>
Tests for each artifact type, parameterized for directed attribute Signed-off-by: Leo Sendelbach <s8lesendqstud.uni-saarland.de>
Commit Network now also built when calling function 'get.networks'. Signed-off-by: Leo Sendelbach <[email protected]>
show how to construct commit network in showcase. Also fixed bug that resulted in showcase crashing. Signed-off-by: Leo Sendelbach <[email protected]>
In this process, also refactor 'construct.edge.list.from.key.value.list' method. Some more comments might be necessary. Signed-off-by: Leo Sendelbach <[email protected]>
New function allows adding vertex attributes from commit data to commit network vertices Signed-off-by: Leo Sendelbach <[email protected]>
'add.vertex.attribute.commit.network' is now used in showcase. Also minor changes to documentation and performance improvement in cochange commit network creation. Signed-off-by: Leo Sendelbach <[email protected]>
attribute 'date' added to cochange commit network edges, attribute artifact.type added to all networks based on commit interactions Signed-off-by: Leo Sendelbach <[email protected]>
I have Identified the issue with the showcase: since we manually add the column 'artifact.type' to dataframes when constructing commit interaction based networks, it breaks when the existing data is empty. The empty data was not an issue for the showcase before; there were just warnings but since we do not do anything with the networks anyway, everything worked. Now, I see two ways to fix this problem:
Which would you prefer @hechtlC ? |
I don't know what @hechtlC prefers, but here a few thoughts from my side on your two suggestions:
I don't fully understand what this column is for. Is it just used for network construction or is it also available to the end user? If it is available for the end user, I would like to make sure that the expected columns are there, independent of empty data or not. If it is just used internally for network construction and does not have any effect on the resulting network, then a check for empty data before adding a column might make sense (but I am not sure).
Regarding adding commit interaction data: I am not sure. Is it just an issue of the non-existent file, or of the non-existent data (as it would also be in an empty file)? Missing data should not lead to failing network construction. So, even when adding more test data would solve the problem, I don't think that this should be considered as a fix for the underlying problem. |
I know what you mean and will add the column to the empty data frames. I will then change the already existing part that is used for setting the correct value to fill this new column. On another note:
This is the error message. It occurs in line 230 in my code, in the lapply (see below) there - even after uncommenting all changes I made to the showcase. Manually checking the type of the cf.data, it says it doesn't know the type because of infinite recursion... Any Ideas what I can do there?
EDIT: |
For some reason, adding the line
breaks the splitting, meaning that the content of 'cf.data' after
is total gibberish. This seems to be a problem in the splitting - I believe commit interaction data should be excluded from any splitting as it is not annotated with the relevant data for splitting anyway. |
I agree with @Leo-Send: commit-interaction data should not be split. Therefore, commit-interaction data should be considered as additional data sources that are ignored by splitting and added to each range data object as a whole, such as pasta data or author data. According to the following line, commit-interaction data are correctly categorized as additional data sources: Lines 1895 to 1902 in 74ebe0b
So, there might be any other location where commit-interaction data are considered to be in the wrong category... Maybe there is a typo somewhere (using the plural version instead of the singular, or something similar leads to be not recognized correctly...) Is this something that has worked correctly before your rebase today? Then we might find the problem in the previous PR(s) that have been merged on If you could trace down the problem until where splitting breaks, we could have a closer look at that particular part of the code. Our expert for splitting is @MaLoefUDS, maybe he also has an idea on what's going on there? |
Added linebreaks, fixed spelling, removed cbind Signed-off-by: Leo Sendelbach <[email protected]>
I am even more confused now - checking older commits I found that adding this line to the showcase has broken the splitting since commit interactions were added. But I added the commit network using commit interactions to the showcase a month ago - and I am very sure that I tried it and the CI also worked (apart from the old R version that is now removed). Since I rebased and force-pushed, I do not now how to access the CI run reports from old commits, so I cannot confirm it currently. EDIT: EDIT 2: |
As already said yesterday: Commit-interaction data are already in the correct category that is excluded from splitting, as the following snippet shows: Lines 1895 to 1902 in 74ebe0b
So, either there is some typo somewhere in the used spelling variants of "commit interaction(s)", or there is something else going on that is totally weird. Again, if you can provide the exact line at which it breaks, we can have a look why this line is executed at all. |
Okay so i found some weird stuff going on. As I am not an expert on how splitting works, please correct me if I am misunderstanding something. For this bug to occur in the showcase, you need to copy the 'commit.ineractions.yaml' from the testing/test_proximity/proximity folder to the testing/test_feature/feature and subsequent 002-v2-v3 folder. If you want to replicate my 'testing without commit interactions in showcase', comment out line 78 and change the value in line 69 to "cochange".
EDIT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @Leo-Send. I have reviewed your changes (except for the tests and showcase) and found a couple of minor inconsistencies or cases in which I am not sure about whether the current implementation is consistent or not. Please see my detailed comments below. Please don't see every comment as a need to change something - in some cases I just would like you to check similar cases in the existing implementation before deciding on whether we should change something or not.
In addition, could you please rebase the last two commits?
- The second sentence of the commit message of "Add 'artifact.type' to commit interaction data" contains the word "until" twice; and I also don't understand the sentence. Could you please rephrase that?
- Fix crash in showcase and add update tests: I guess this commit will not be present in the final PR any more, but I would like to mention it though: First, the "add update tests" part sounds broken. Second, the commit message does neither describe what's going on there, nor what the problem is that is fixed by this commit. "Crash in showcase" could be everything, but the problem is not the showcase, but something else in coronet that causes the showcase to crash ;) But I guess this will change anyway after we have figured out the actual problem there.
In general, it should not make a difference whether you use a RangeData or a ProjectData object. But maybe there is some inconsistency regarding caching additional data sources. I suggest to check how the different additional data sources are handled? Maybe there is a different behavior because commit-interaction data are linked to the commit data, which is not the case for author data, for example. Maybe the connection between commit data and commit-interaction data leads to this infinite loop? Maybe you could also check what's different to how PaStA data are handled (but, if I am not mistaken, the relationship between PaStA and commit data is the other way round, which might not trigger the problem...) Unfortunately, I have no further ideas. |
Hey, sorry for the late response. I have some nice insight on the issue regarding the splitting in
First of all, this does have nothing to do with the selected data source or if commit interactions are present (as Leo already mentioned). Interestingly this broken application has been in The reason for the break is that for some data sources, some of their elements are before the predefined bins by which we split. When we split by given bins, we use the However, the |
Good catch @MaLoefUDS! Thanks for looking into this! I agree with you that splitting should not break when there are elements that are not part of a bin. I guess that this can happen in two different ways: Prior to the first bin, and after the last bin. Do we already handle the case "after the last bin" correctly? If so, we could try to handle "prior to the first bin" similarly. If we don't handle either of them up until now, I suggest to open a new issue for that and discuss potential solutions there, not to mess up this PR more than we currently already do... @Leo-Send: Independent of the splitting bug that @MaLoefUDS has identified, the infinite recursion seems to be a different problem that also needs to be fixed... Is there a way to fix this problem independent of the splitting bug, or do the two problems depend on each other? That is, does the infinite recursion occur depending on the splitting bug, or does it occur independently? |
I found the reason for the infinite recursion and it does not seem to be directly related to what @MaLoefUDS found, although I believe it both originates from the problem that we do not verify the bins when splitting. This infinite recursion is supposed to be stopped by checks if the data source is already cached - the method set.commits first sets the field for storing commits and then calls update.commit.interactions, thus does not get calledin update.commit.interactions again, as commits are cached. This does not, however, work if the commit data is empty. This specific call to splitting that is in the showcase results in one RangeData to have empty commit data, thus leading to this recursion. I see two ways to handle this:
|
Regarding fixing this problem of broken splitting which may or may not be the sole cause of Leo's problems, I have two possible fixes in mind. But first of: Currently we do not have any problems with data elements that lie beyond the last bin. The bins we provide are always just the first date bound. The next following bin is then the last date bound, while also being the first date of its own new bin. In the current implementation, the last bin is then holds all elements from this point on until infinity. Obviously, we cannot simply include all elements before the first bin into the first bin as well, as this would lead to unintuitive behavior (i.e., what would it mean to provide only one bin when splitting?). Here are my ideas for fixes:
I don't have any preference on this, since I am not sure which idiomatic values of coronet are at stake 😅 |
Now it is getting confusing here. We discuss two different problems and both of you, @Leo-Send and @MaLoefUDS, have suggested two potential solutions for one of the two problems... Let's start with the infinite recursion spotted by @Leo-Send:
Option 1. is not a valid solution. There always can be time ranges without any commit and we still want to keep this range. Option 2. sounds promising. When choosing option 2, we also need to check whether there are other cases which lead to a similar problem. Either we are able to find a common solution that handles all these cases, or we need to handle each of them separately. Maybe you can think about that until our meeting, and then let's discuss the potential options in our meeting. Regarding the problems occurring during splitting @MaLoefUDS: We need to figure out in which situations the elements before the first bin are actually needed. This is the key question to this problem, and I don't have an answer to this question. If there are no such situations, we could potentially remove them. Otherwise we need to find a way to keep them. But first we need to figure out whether there is any use case in which the elements before the first bin are relevant. As this problem seems to be independent of @Leo-Send's problem (as the infinite recursions can occur on any empty commit data), let's continue the discussion on splitting in a separate issue. |
I've just transferred the comments that discuss the splitting problem into a new issue #267. Please continue discussing the splitting problem there. Let's stick to the changes of this PR and the discussion of the infinite recursion here. |
Networks based on commit interaction data now correctly have an edge attribute called 'artifact.type'. Value of column 'artifact.type' in commit interaction data is 'CommitInteraction' until potentially overwritten in artifact network construction Signed-off-by: Leo Sendelbach <[email protected]>
Add check for calling function in the beginning of 'update.commit.interactions'. Also contains minor fixes to adress PR comments and updates tests to reflect changes made in previous commit. Signed-off-by: Leo Sendelbach <[email protected]>
Include this PR's changelog in the NEWS.md Add constant for commit interaction artifact type Move check for avoiding infinite recursion to the correct position and add commentary Signed-off-by: Leo Sendelbach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look at README, NEWS, and the most recent changes @Leo-Send.
Please see my comments below - there is one line in which we should make use of the new constant in the code; and there are some missing items in the README.
Everything else are just styling issues (missing spaces etc.).
Minor changes in response to reviews. Also added a use for constant `ARTIFACT.COMMIT.INTERACTION` that was previously overlooked. Signed-off-by: Leo Sendelbach <[email protected]>
Prerequisites
showcase.R
with respect to my changes.dev
.Description
Add a new type of network to coronet: Commit Network
The commit network uses commits as vertices. The edges are based on commit interactions or cochange.
Changelog
Added
Changed7Improved
construct.edge.list.from.key.value.list
to be more readable(PR Add Commit Networks #263, 05c3bc0)Fixed
artifact.type
(PR Add Commit Networks #263, 849123a)