URLCleaner is a microservice component, that is used to remove redirects and affiliate markers from a url.
To run the microservice it is required to set up the following:
-
MongoDB
The MongoDB is used to keep track of blacklisted shops (in the collection named 'blacklist'). -
Idealo bridge
The URLCleaner uses idealo bridge to resolve shopID to shop root url.
The name of the database which will be used is 'data'.
URLCLEANER_PORT - the port used by URLCleaner API_URL - URL address of idealo bridge ACCESS_TOKEN_URI - URI for retrieving access token from idealo bridge CLIENT_ID - Client ID for retrieving access token from idealo bridge CLIENT_SECRET - Client ID for retrieving access token from idealo bridge MONGO_IP - IP address of Mongo DB MONGO_PORT - Port that Mongo DB is using MONGO_URLCLEANER_USER - Username that used to access Mongo DB MONGO_PASSWORD - Password that used to access Mongo DB
The URLCleaner is using 2 different strategies to clean urls:
-
Redirect clean
The component removes parts of given Url before the corresponding root url. -
Affiliate marker clean
The component removes affiliate marker (for example UTM) from the given url.
The both strategies are applied after each other. The result is the output.
- The list of affiliate markers can be extended.
- The list of shop specific affiliate markers can be collected automatically. We can remove every parameter and fetching the page using the resulting url. If it is identical to the webpage fetched using dirtyUrl, then the removed parameter is affiliate marker for this shop. If not, then not.
- The component should compare the webpages that got fetched using dirtyUrl and corresponding cleanUrl. If they are identical, the dirtyUrl got cleaned successfully. If not, then not. It is enough to check this way one url for every shop.