This is an Apache Drill UDF for parsing User Agent Strings. This function was originally created by Charles S. Givre and was imported into the main Yauaa project to ensure users would have a prebuilt and up to date version available.
I have copied/implemented the functions
parse_user_agent ( <useragent> )
parse_user_agent_field ( <useragent> , <desired fieldname> )
Able to do basic unit test during build for the parse_user_agent_field function. NOT yet able to do automated testing for the parse_user_agent function.
When I run it in a local drill the things I try work for the most (see below). I have not yet tested this in a distributed setup.
So please try it out and report any issues and/or improvements you see.
You can get the prebuilt UDF from maven central. This is the maven dependency definition where it can be downloaded.
<dependency>
<groupId>nl.basjes.parse.useragent</groupId>
<artifactId>yauaa-drill</artifactId>
<version>5.0</version>
</dependency>
Simply install the normal build tools for a Java project (i.e. maven and jdk 8+) and then simply do:
mvn clean package
The jar file for Apache Drill can be found as
./udfs/drill/function/target/yauaa-drill-<version>.jar
To install this function put the jar file in the /jars/3rdparty directory.
cp ./udfs/drill/target/yauaa-drill-<version>.jar <drill-path>/jars/3rdparty
Make sure you replace <drill-path>
with your actual path to your drill installation.
NOTE: You do not need the yauaa-drill-<version>-sources.jar
to run in drill.
Working queries with a direct value
SELECT parse_user_agent_field('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36', 'AgentNameVersion') from (values(1));
+----------------------+
| EXPR$0 |
+----------------------+
| Chrome 48.0.2564.82 |
+----------------------+
This one doesn't work
SELECT parse_user_agent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36') from (values(1));
Error: PLAN ERROR: Failure while materializing expression in constant expression evaluator [PARSE_USER_AGENT('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36')]. Errors:
Error in expression at index -1. Error: Only ProjectRecordBatch could have complex writer function. You are using complex writer function parse_user_agent in a non-project operation!. Full expression: --UNKNOWN EXPRESSION--.
Assume the file /tmp/testcase.tsv
which contains 2 lines:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.82 Safari/537.36
Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Working queries using the input file
SELECT parse_user_agent_field(columns[0], 'AgentNameVersion') AS AgentNameVersion from dfs.`/tmp/testcase.tsv`;
+----------------------+
| AgentNameVersion |
+----------------------+
| Chrome 48.0.2564.82 |
| Googlebot 2.1 |
+----------------------+
and
SELECT parse_user_agent(columns[0]) AS AgentNameVersion from dfs.`/tmp/testcase.tsv`;
+------------------+
| AgentNameVersion |
+------------------+
| {
"DeviceClass": "Desktop",
"DeviceName": "Linux Desktop",
"DeviceBrand": "Unknown",
"DeviceCpu": "Intel x86_64",
"DeviceCpuBits": "64",
"OperatingSystemClass": "Desktop",
"OperatingSystemName": "Linux",
"OperatingSystemVersion": "Intel x86_64",
"OperatingSystemNameVersion": "Linux Intel x86_64",
"LayoutEngineClass": "Browser",
"LayoutEngineName": "Blink",
"LayoutEngineVersion": "48.0",
"LayoutEngineVersionMajor": "48",
"LayoutEngineNameVersion": "Blink 48.0",
"LayoutEngineNameVersionMajor": "Blink 48",
"AgentClass": "Browser",
"AgentName": "Chrome",
"AgentVersion": "48.0.2564.82",
"AgentVersionMajor": "48",
"AgentNameVersion": "Chrome 48.0.2564.82",
"AgentNameVersionMajor": "Chrome 48" } |
| {
"DeviceClass": "Robot Mobile",
"DeviceName": "Google",
"DeviceBrand": "Google",
"OperatingSystemClass": "Cloud",
"OperatingSystemName": "Google",
"OperatingSystemVersion": "Google",
"OperatingSystemNameVersion": "Google",
"LayoutEngineClass": "Robot",
"LayoutEngineName": "Googlebot",
"LayoutEngineVersion": "Googlebot",
"LayoutEngineVersionMajor": "Googlebot",
"LayoutEngineNameVersion": "Googlebot",
"LayoutEngineNameVersionMajor": "Googlebot",
"AgentClass": "Robot Mobile",
"AgentName": "Googlebot",
"AgentVersion": "2.1",
"AgentVersionMajor": "2",
"AgentNameVersion": "Googlebot 2.1",
"AgentNameVersionMajor": "Googlebot 2",
"AgentInformationUrl": "http://www.google.com/bot.html" } |
+------------------+
The function returns a Drill map, so you can access any of the fields using Drill's table.map.key notation. For example, the query below illustrates how to extract a field from this map and summarize it:
SELECT uadata.ua.AgentNameVersion AS Browser,
COUNT( * ) AS BrowserCount
FROM (
SELECT parse_user_agent( columns[0] ) AS ua
FROM dfs.`/tmp/testcase.tsv`
) AS uadata
GROUP BY uadata.ua.AgentNameVersion
ORDER BY BrowserCount DESC;
+----------------------+---------------+
| Browser | BrowserCount |
+----------------------+---------------+
| Chrome 48.0.2564.82 | 1 |
| Googlebot 2.1 | 1 |
+----------------------+---------------+
Yet Another UserAgent Analyzer
Copyright (C) 2013-2018 Niels Basjes
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.