The output of WaybackURLKeyMaker and other canonicalizers based on BasicURLCanonicalizer has changed for URLs that contain non UTF-8 percent encoded sequences. For example when a URL contains "%C3%23" it will now be normalised to "%c3%23" whereas previous releases produced "%25c3%23". This change brings webarchive-commons more inline with pywb, surt (Python), warcio.js and RFC 3986. While CDX file compatibility with these newer tools should improve, note that CDX files generated by the new release which contain such URLs may not work correctly with existing versions of OpenWayback that use the older webarchive-commons. #102
- WAT: Duplicated payload metadata values for "Actual-Content-Length" and "Trailing-Slop-Length" #103
- ObjectPlusFilesOutputStream.hardlinkOrCopy now uses
Files.createLink()
instead of executingln
. This prevents the potential for security vulnerabilities from command line option injection and improves portability.
- fastutil removed
- dsiutils removed
The following classes and enum members have been marked deprecated as a step towards removal of the dependency on Apache Commons HttpClient 3.1.
- org.archive.httpclient.HttpRecorderGetMethod
- org.archive.httpclient.HttpRecorderMethod
- org.archive.httpclient.HttpRecorderPostMethod
- org.archive.httpclient.SingleHttpConnectionManager
- org.archive.httpclient.ThreadLocalHttpConnectionManager
- org.archive.util.binsearch.impl.http.ApacheHttp31SLR
- org.archive.util.binsearch.impl.http.ApacheHttp31SLRFactory
- org.archive.util.binsearch.impl.http.HTTPSeekableLineReaderFactory.HttpLibs.APACHE_31
- MetaData is now multivalued to support repeated WARC and HTTP headers. #98
- commons-io 2.18.0
- commons-lang 2.6
- guava 33.3.1-jre
- hadoop 3.4.1
- htmlparser 2.1
- httpcore 4.4.16
- json 20240303
- junit 4.13.2
- Fixed URLParser and WaybackURLKeyMaker failing on URLs with IPv6 address hostnames #100
- WAT extractor: do not fail on missing WARC-Filename in warcinfo record
- ExtractingParseObserver: extract rel, hreflang and type attributes
- ExtractingParseObserver: extract links from onClick attributes
- commons-collections 3.2.2
- commons-io 2.7
- dsiutils 2.2.8
- guava 33.3.0-jre
- hadoop 3.4.0 (now optional)
- pig 0.17.0
- org.json 20231013
- joda-time (was unused)
- Use commons-collections v3.2.2 to avoid v3.2.1 vulnerability
- Extract
property
attributes of HTML meta elements - Do not add value of preceding HTTP header field if there is no value
- Fix WAT records corresponding to response records of Wget generated WARCs
- Improve HTML link extraction
- Move unit tests over from heritrix3 to webarchive-commons
- Strip empty port via URLParser
- Use CharsetDetector to guess encoding of HTML documents
- Fix last header was lost if LF LF
- Make regular expression to extract URLs from CSS more restrictive
- Remove invalid constant
PROFILE_REVISIT_URI_AGNOSTIC_IDENTICAL_DIGEST
- Make canonicalizer be able to strip session id params even if they are the first params in the query string
- Store origin-code of ARC file header
- Flush output etc before tallying stats to fix sizeOnDisk calculation
- Get rid of broken, seemingly unnecessary escapeWhitespace() step of uri fixup
- Handle empty String argument in CharsetDetector.trimAttrValue
- WAT extractor: adding information in WAT's warcinfo
- WAT extractor: missing WARC format version
- WAT extractor: envelope structure does not conform to the WAT specification
- WAT extractor: WARC-Date in all records should be the WAT record generation date
- WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself
- WAT extractor: Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length
- Escape redirect URLs in RealCDXExtractorOutput
- Tests fail on Windows
- Test fails on Java 8
- RecordingOutputStream can affect tcp packets sent in an undesirable way