This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.
A bit more background about this useragent parser can be found in this blog which I wrote about it: https://techlab.bol.com/making-sense-user-agent-string/
The resulting output fields can be classified into several categories:
- The Device: The hardware that was used.
- The Operating System: The base software that runs on the hardware
- The Layout Engine: The underlying core that converts the 'HTML' into a visual/interactive
- The Agent: The actual "Browser" that was used.
- Extra fields: In some cases we have additional fields to describe the agent. These fields are among others specific fields for the Facebook and Kobo apps, and fields to describe deliberate useragent manipulation situations (Anonymization, Hackers, etc.)
Note that not all fields are always available. So if you look at a specific field you will in general find null values and "Unknown" in there as well.
There are as little as possible lookup tables included the system really tries to analyze the useragent and extract values from it. The aim of this approach is to have a system that can classify as much traffic as possible yet require as little as possible maintenance because all versions and in many places also the names of the used components are extracted without knowing them beforehand.
As an example the useragent of my phone:
Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD90Z) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.124 Mobile Safari/537.36
is converted into this set of fields:
Field name | Value |
---|---|
DeviceClass | 'Phone' |
DeviceName | 'Google Nexus 6' |
DeviceBrand | 'Google' |
OperatingSystemClass | 'Mobile' |
OperatingSystemName | 'Android' |
OperatingSystemVersion | '7.0' |
OperatingSystemNameVersion | 'Android 7.0' |
OperatingSystemVersionBuild | 'NBD90Z' |
LayoutEngineClass | 'Browser' |
LayoutEngineName | 'Blink' |
LayoutEngineVersion | '53.0' |
LayoutEngineVersionMajor | '53' |
LayoutEngineNameVersion | 'Blink 53.0' |
LayoutEngineNameVersionMajor | 'Blink 53' |
AgentClass | 'Browser' |
AgentName | 'Chrome' |
AgentVersion | '53.0.2785.124' |
AgentVersionMajor | '53' |
AgentNameVersion | 'Chrome 53.0.2785.124' |
AgentNameVersionMajor | 'Chrome 53' |
You can try it online with your own browser here: https://analyze-useragent.appspot.com/.
NOTES
- This runs under a "Free quota" on Google AppEngine. If this quote is exceeded then it will simply become unavailable for that day.
- After a while of inactivity the instance is terminated so the first page may take 15-30 seconds to load.
- If you really like this then run it on your local systems. It's much faster that way.
This system is based on analyzing the useragent string and looking for the patterns in the useragent string as they have been defined by parties like Google, Microsoft, Samsung and many others. These have been augmented with observations how developers apparently do things. There are really no (ok, very limited) lookup tables that define if a certain device name is a Phone or a Tablet. This makes this system very maintainable because there is no need to have a list of all possible devices.
As a consequence if a useragent does not follow these patterns the analysis will yield the 'wrong' answer. Take for example these two (both were found exactly as shown here in the logs of a live website):
Mozilla/5.0 (Linux; Android 5.1; SAMSUNG-T805s Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.94 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 4.4.2; SAMSUNG-T805S Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.89 Safari/537.36
The difference between "Mobile Safari" and "Safari" has been defined for Google Chrome as the difference between "Phone" and "Tablet" (see https://developer.chrome.com/multidevice/user-agent ).
And as you can see in this example: we sometimes get it wrong. The impact in this case is however very limited: Of the 445 visitors I found using this device only 2 were classified wrong all others were correct.
A second example is when the Samsung Browser is installed on a non-Samsung device (in this example a Google Nexus 6):
Mozilla/5.0 (Linux; Android 7.0; SAMSUNG Nexus 6 Build/NBD92G) AppleWebKit/537.36 (KHTML, like Gecko) SamsungBrowser/5.4 Chrome/51.0.2704.106 Mobile Safari/537.36
As you can see this browser assumes it is only installed on Samsung devices so they 'force' the word Samsung in there. In this case you will see this being reported as a "Samsung Nexus 6", which is obviously wrong.
On my i7 system I see a speed ranging from 500 to 4000 useragents per second (depending on the length and ambiguities in the useragent). On average the speed is around 2000 per second or ~0.5ms each. A LRU cache is in place that does over 1M per second if they are in the cache.
Please note that the current system take approx 256MiB of RAM just for the engine (without any caching!!).
Output from the benchmark ( using this code ) on a Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz:
Benchmark | Mode | Cnt | Score | Error | Units | |
---|---|---|---|---|---|---|
AnalyzerBenchmarks.android6Chrome46 | avgt | 10 | 0.561 | ± | 0.011 | ms/op |
AnalyzerBenchmarks.androidPhone | avgt | 10 | 0.726 | ± | 0.010 | ms/op |
AnalyzerBenchmarks.googleAdsBot | avgt | 10 | 0.120 | ± | 0.002 | ms/op |
AnalyzerBenchmarks.googleAdsBotMobile | avgt | 10 | 0.378 | ± | 0.001 | ms/op |
AnalyzerBenchmarks.googleBotMobileAndroid | avgt | 10 | 0.616 | ± | 0.006 | ms/op |
AnalyzerBenchmarks.googlebot | avgt | 10 | 0.197 | ± | 0.007 | ms/op |
AnalyzerBenchmarks.hackerSQL | avgt | 10 | 0.093 | ± | 0.004 | ms/op |
AnalyzerBenchmarks.hackerShellShock | avgt | 10 | 0.069 | ± | 0.003 | ms/op |
AnalyzerBenchmarks.iPad | avgt | 10 | 0.339 | ± | 0.003 | ms/op |
AnalyzerBenchmarks.iPhone | avgt | 10 | 0.343 | ± | 0.003 | ms/op |
AnalyzerBenchmarks.iPhoneFacebookApp | avgt | 10 | 0.717 | ± | 0.004 | ms/op |
AnalyzerBenchmarks.win10Chrome51 | avgt | 10 | 0.290 | ± | 0.010 | ms/op |
AnalyzerBenchmarks.win10Edge13 | avgt | 10 | 0.328 | ± | 0.003 | ms/op |
AnalyzerBenchmarks.win10IE11 | avgt | 10 | 0.334 | ± | 0.006 | ms/op |
AnalyzerBenchmarks.win7ie11 | avgt | 10 | 0.329 | ± | 0.006 | ms/op |
In the canonical usecase of analysing clickstream data you will see a <1ms hit per visitor (or better: per new non-cached useragent) and for all the other clicks the values are retrieved from this cache at a speed of < 1 microsecond (i.e. close to 0).
In addition to the UDFs for many of Apache tools (Pig, Hive, Flink, Beam, ... see below) this analyzer can also directly be used in Java based applications.
First add the library as a dependency to your application. This has been published to maven central so that should work in almost any environment.
<dependency>
<groupId>nl.basjes.parse.useragent</groupId>
<artifactId>yauaa</artifactId>
<version>5.0</version>
</dependency>
and in your application you can use it as simple as this
UserAgentAnalyzer uaa = UserAgentAnalyzer
.newBuilder()
.hideMatcherLoadStats()
.withCache(25000)
.build();
UserAgent agent = uaa.parse("Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");
for (String fieldName: agent.getAvailableFieldNamesSorted()) {
System.out.println(fieldName + " = " + agent.getValue(fieldName));
}
Please instantiate a new UserAgentAnalyzer as few times as possible because the initialization step for a full UserAgentAnalyzer (i.e. all fields) usually takes something in the range of 2-5 seconds. If you need multiple instances of the UserAgentAnalyzer then you MUST create a new Builder instance for each of those.
Note that not all fields are available after every parse. So be prepared to receive a 'null' if you extract a specific name.
IMPORTANT: This library is single threaded ! Because the code is not reentrant the main method has been synchronized on the instance. So if you are in a multi threaded situation you should create a separate instance per thread or accept the speed limitation. Note that you should really instantiate it only once per thread (and use a ThreadPool or something similar) because starting a new instance takes several seconds.
Eclipse users: Be aware of there is a bug in Eclipse which will show you errors in perfectly valid Java code: https://bugs.eclipse.org/bugs/show_bug.cgi?id=527475 The errors you see are related to the inheritance model used by the Builders in this project and the fact that Eclipse does not interpret it correctly.
In some scenarios you only want a specific field and all others are unwanted. This can be achieved by creating the analyzer in Java like this:
UserAgentAnalyzer uaa;
uaa = UserAgentAnalyzer
.newBuilder()
.withField("DeviceClass")
.withField("AgentNameVersionMajor")
.build();
One important effect is that this speeds up the system because it will kick any rules that do not help in getting the desired fields. The above example showed an approximate 40% speed increase (i.e. times dropped from ~1ms to ~0.6ms).
The system relies heavily on HashMaps to quickly find the rules that need to be fired.
Some fields only require a handful of rules where others have a lot of them. This means that it depends on the fields that have been requested how many rules are kept in the system and thus how much memory is used to store the rules in. To get an idea of the relative memory impact of the rules needed for a specific field.
This table was constructed by running all testcases against the engine where we only request 1 field.
Then after forcing a GC in the JVM we retrieve the memory footprint.
Because there are no rules for the field __SyntaxError__
we assumed that to be the baseline
against which we determine the relative memory usage.
Because most rules determine several fields there is a lot of overlap in the rules used. If you keep all rules we see that version 5.0 uses about 232 MiB of memory for all rules which shows that the most expensive rules related to finding the DeviceName and DeviceBrand because both need to determine the brand of the device at hand.
Field | Relative Memory usage |
---|---|
DeviceClass | 81.9 MiB |
DeviceName | 203.2 MiB |
DeviceBrand | 181.2 MiB |
DeviceCpu | 3.1 MiB |
DeviceCpuBits | 9.5 MiB |
DeviceFirmwareVersion | 4.1 MiB |
DeviceVersion | 14.3 MiB |
OperatingSystemClass | 56.9 MiB |
OperatingSystemName | 56.0 MiB |
OperatingSystemVersion | 55.9 MiB |
OperatingSystemNameVersion | 56.2 MiB |
OperatingSystemVersionBuild | 2.2 MiB |
LayoutEngineClass | 8.9 MiB |
LayoutEngineName | 8.9 MiB |
LayoutEngineVersion | 8.9 MiB |
LayoutEngineVersionMajor | 8.9 MiB |
LayoutEngineNameVersion | 9.0 MiB |
LayoutEngineNameVersionMajor | 9.0 MiB |
LayoutEngineBuild | 1.5 MiB |
AgentClass | 17.1 MiB |
AgentName | 16.7 MiB |
AgentVersion | 16.7 MiB |
AgentVersionMajor | 16.7 MiB |
AgentNameVersion | 16.8 MiB |
AgentNameVersionMajor | 16.8 MiB |
AgentBuild | 0.5 MiB |
AgentLanguage | 0.3 MiB |
AgentLanguageCode | 0.3 MiB |
AgentInformationEmail | 4.0 MiB |
AgentInformationUrl | 5.9 MiB |
AgentSecurity | 0.3 MiB |
AgentUuid | 0.3 MiB |
FacebookCarrier | 0.2 MiB |
FacebookDeviceClass | 0.5 MiB |
FacebookDeviceName | 0.5 MiB |
FacebookDeviceVersion | 0.5 MiB |
FacebookFBOP | 0.2 MiB |
FacebookFBSS | 0.5 MiB |
FacebookOperatingSystemName | 0.5 MiB |
FacebookOperatingSystemVersion | 0.5 MiB |
Anonymized | 1.2 MiB |
HackerAttackVector | 0.2 MiB |
HackerToolkit | 0.2 MiB |
KoboAffiliate | 0.0 MiB |
KoboPlatformId | 0.0 MiB |
IECompatibilityVersion | 1.6 MiB |
IECompatibilityVersionMajor | 1.6 MiB |
IECompatibilityNameVersion | 1.6 MiB |
IECompatibilityNameVersionMajor | 1.6 MiB |
Carrier | 0.2 MiB |
GSAInstallationID | 0.0 MiB |
WebviewAppName | 1.9 MiB |
WebviewAppNameVersionMajor | 1.9 MiB |
WebviewAppVersion | 1.9 MiB |
WebviewAppVersionMajor | 1.9 MiB |
Several external computation systems support the concept of a User Defined Function (UDF). A UDF is simply a way of making functionality (in this case the analysis of useragents) available in such a system.
For several systems (tools used within bol.com (where I work)) I have written such a UDF which are all part of this project.
- Apache Pig
- Apache Flink
- Apache Beam
- Apache Hive
- Commandline tool
- Apache Drill which was originally written by Charles S. Grive
- Apache Nifi
Value | Meaning |
---|---|
Desktop | The device is assessed as a Desktop/Laptop class device |
Anonymized | In some cases the useragent has been altered by anonimization software |
Unknown | We really don't know, these are usually useragents that look normal yet contain almost no information about the device |
Mobile | A device that is mobile yet we do not know if it is a eReader/Tablet/Phone or Watch |
Tablet | A mobile device with a rather large screen (common > 7") |
Phone | A mobile device with a small screen (common < 7") |
Watch | A mobile device with a tiny screen (common < 2"). Normally these are an additional screen for a phone/tablet type device. |
Virtual Reality | A mobile device with a VR capabilities |
eReader | Similar to a Tablet yet in most cases with an eInk screen |
Set-top box | A connected device that allows interacting via a TV sized screen |
TV | Similar to Set-top box yet here this is built into the TV |
Game Console | 'Fixed' game systems like the PlayStation and XBox |
Handheld Game Console | 'Mobile' game systems like the 3DS |
Robot | Robots that visit the site |
Robot Mobile | Robots that visit the site indicating they want to be seen as a Mobile visitor |
Spy | Robots that visit the site pretending they are robots like google, but they are not |
Hacker | In case scripting is detected in the useragent string, also fallback in really broken situations |
Value | Meaning |
---|---|
Desktop | The type of OS you would run on a Desktop or Laptop |
Mobile | The type of OS you would run on a Phone, Tablet or Watch |
Cloud | Looks like a thing that runs in a cloud environment |
Embedded | Apparently embedded into something like a TV |
Game Console | A game console like PS4, Xbox |
Hacker | A hacker, so it can really be anything. |
Anonymized | It was explicitly hidden |
Unknown | We don't know |
Value | Meaning |
---|---|
Browser | A regular browser |
Mobile App | A mobile app which probably includes a regular webbrowser |
Hacker | A hacker, so it can really be anything. |
Robot | A robot spidering the site |
Unknown | We don't know |
Value | Meaning |
---|---|
Browser | A regular browser |
Browser Webview | A regular browser being used as part of a mobile app |
Mobile App | A mobile app |
Robot | A robot that wants to be treated as a desktop device |
Robot Mobile | A robot that wants to be treated as a mobile device |
Cloud Application | Something running in a cloud (but not a regular robot) |
Email Client | This is an email application that did the request |
Special | Something special we cannot fully classify |
Testclient | A website testing tool |
Hacker | A hacker, so it can really be anything. |
Unknown | We don't know |
Value | Meaning |
---|---|
Weak security | Indicated to use deliberately weakened encryption (usually due to export restrictions or local laws). |
Strong security | Indicated to use strong (normal) encryption. |
Unknown | It was not specified (very common) |
Hacker | A hacker, so it can really be anything. |
Parsing useragents is considered by many to be a ridiculously hard problem. The main problems are:
- Although there seems to be a specification, many do not follow it.
- Useragents LIE that they are their competing predecessor with an extra flag.
The pattern the 'normal' browser builders are following is that they all LIE about the ancestor they are trying to improve upon.
The reason this system (historically) works is because a lot of website builders do a very simple check to see if they can use a specific feature.
if (useragent.contains("Chrome")) {
// Use the chrome feature we need.
}
Some may improve on this an actually check the (major) version that follows.
A good example of this is the Edge browser:
Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136
It says it:
- is Mozilla/5.0
- uses AppleWebKit/537.36
- for "compatibility" the AppleWebKit lie about being "KHTML" and that it is similar to "Gecko" are also copied
- is Chrome 42
- is Safari 537
- is Edge 12
So any website looking for the word it triggers upon will find it and enable the right features.
When looking at most implementations of analysing the useragents I see that most implementations are based around lists of regular expressions. These are (in the systems I have seen) executed in a specific order to find the first one that matches.
In this solution direction the order in which things occur determines if the patterns match or not.
Regular expressions are notoriously hard to write and debug and (unless you make them really complex) the order in which parts of the pattern occur is fixed.
I wanted to see if a completely different approach would work: Can we actually parse these things into a tree and work from there.
The parser (ANTLR4 based) will be able to parse a lot of the agents but not all. Tests have shown that it will parse >99% of all useragents on a large website which is more than 99.99% of the traffic.
Now the ones that it is not able to parse are the ones that have been set manually to a invalid value. So if that happens we assume you are a hacker. In all other cases we have matchers that are triggered if a specific value is found by the parser. Such a matcher then tells this class is has found a match for a certain attribute with a certain confidence level (0-10000). In the end the matcher that has found a match with the highest confidence for a value 'wins'.
The main concept of this useragent parser is that we have two things:
- A Parser (ANTLR4) that converts the useragent into a nice tree through which we can walk along.
- A collection of matchers.
- A matcher triggers if a set of patterns is present in the tree.
- Each pattern is detected by a "matcher action" that triggers and can fill a single attribute. If a matcher triggers a set of attributes get set with a value and a confidence level
- All results from all triggered matchers (and actions) are combined and for each individual attribute the 'highest value' wins.
As a performance optimization we walk along the parsed tree once and fire everything we find into a precomputed hashmap that points to all the applicable matcher actions. As a consequence
- the matching is relatively fast even though the number of matchers already runs into the few hundreds.
- the startup is "slow"
- the memory footprint is pretty big due to the number of matchers, the size of the hashmap and the cache of the parsed useragents.
A much more in depth explanation can be found in the documentation on how to create new rules
If this project has business value for you then don't hesitate to support me with a small donation.
Yet Another UserAgent Analyzer
Copyright (C) 2013-2018 Niels Basjes
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.