Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text in FileObject.size_in_bytes Field Causes Error #278

Open
brlogan opened this issue Nov 11, 2015 · 5 comments
Open

Text in FileObject.size_in_bytes Field Causes Error #278

brlogan opened this issue Nov 11, 2015 · 5 comments

Comments

@brlogan
Copy link

brlogan commented Nov 11, 2015

I have come across many STIX documents that include units in the FileObject.size_in_bytes field (despite its name). For example, I have a bunch of STIX documents with size_in_bytes values like "123 bytes". Because python-cybox tries to convert this value to a long, I get an "invalid literal for long() with base 10" error. From a TAXII/STIX perspective, it's annoying to have an entire STIX package fail because of this.

I'm not sure what the "right" answer to this problem is, but even something as simple as stripping out "bytes" or "B" would be helpful. If you wanted to get fancy, you could also do conversions for things like "KB", "MB", etc.

@brlogan
Copy link
Author

brlogan commented May 20, 2016

This issue was brought up again earlier this month on the cit-users list. While more granular exception handling would help us not lose the entire package, even just handling "bytes" in the field would be very helpful. Any thoughts on this?

Matthew Hall writes:

...when the code comes across a CybOX FileObj w/ a bogus Size_In_Bytes, the exception disrupts parsing the entire STIX Package not just the corrupted / invalid entity:

<FileObj:Size_In_Bytes condition="Equals">380058 bytes/FileObj:Size_In_Bytes

ValueError: invalid literal for long() with base 10: '380058 bytes'
File ".../venv/lib/python2.7/site-packages/cybox/common/properties.py", line 514, in _parse_value
return long(value, 0)

How can I perform a best-effort parse with python-stix in order to operate as properly as possible in such situations?

@gtback
Copy link
Contributor

gtback commented May 23, 2016

We could certainly strip out all non-digit characters. This would break for things like "KB", but content like that wouldn't have worked in the first place. I'm a bit hesitant to do anything more than that. One thing about the current approach is that it handles hex ("0x") numbers correctly, but would break with this naive approach. Side note: I think octal numbers would be OK, since a leading "0" (except for "0x") causes Python to interpret it as octal, even if it's not "0o".

@brlogan
Copy link
Author

brlogan commented May 24, 2016

I have never seen hex or octal values in this field, but "bytes" seems to be pretty common. I don't think just stripping non-digit characters would be a good choice. I'd prefer to handle/convert for a few common cases (bytes, KB, MB, etc.) and continue triggering an exception for ambiguous values.

@gtback
Copy link
Contributor

gtback commented May 25, 2016

Thanks, @brlogan. One issues is that it is easiest to implement it for all UnsignedLongObjectPropertyType properties at the same time. A lot of the other fields are places where I would legitmately expect a "0x" prefix. I can see three basic solutions:

  • Implement custom logic for the size_in_bytes field.
  • If the value ends in bytes, remove the last 6 characters.
  • Strip all alphabetic characters (upper and lower) from the beginning and end of the string. Octal and hex values should be unaffected, since they start with 0, but it would handle all kinds of suffixes.

I'm leaning towards the third, but it could be overkill. Thoughts?

@brlogan
Copy link
Author

brlogan commented Jun 1, 2016

I'm not sure that the third option is the better way to go. If we strip something like "MB" or "KB" from the end of the string and just use the numeric value as if it were bytes, then we are working with incorrect data. An error may be better in that case. Further, if you strip letters from the end, you may change the hex value.
If someone is up to it, implementing some custom logic would be really nice, but with the frequency I've come across "bytes" in that field, I'd be happy with just the middle option.

@gtback gtback self-assigned this Jun 8, 2016
@gtback gtback removed their assignment Apr 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants