1.3.0

URL Canonicalization Changed

The output of WaybackURLKeyMaker and other canonicalizers based on BasicURLCanonicalizer has changed for URLs that contain non UTF-8 percent encoded sequences. For example when a URL contains "%C3%23" it will now be normalised to "%c3%23" whereas previous releases produced "%25c3%23". This change brings webarchive-commons more inline with pywb, surt (Python), warcio.js and RFC 3986. While CDX file compatibility with these newer tools should improve, note that CDX files generated by the new release which contain such URLs may not work correctly with existing versions of OpenWayback that use the older webarchive-commons. #102

Bug fixes

WAT: Duplicated payload metadata values for "Actual-Content-Length" and "Trailing-Slop-Length" #103
ObjectPlusFilesOutputStream.hardlinkOrCopy now uses Files.createLink() instead of executing ln. This prevents the potential for security vulnerabilities from command line option injection and improves portability.

Dependency upgrades

fastutil removed
dsiutils removed

Deprecations

The following classes and enum members have been marked deprecated as a step towards removal of the dependency on Apache Commons HttpClient 3.1.

org.archive.httpclient.HttpRecorderGetMethod
org.archive.httpclient.HttpRecorderMethod
org.archive.httpclient.HttpRecorderPostMethod
org.archive.httpclient.SingleHttpConnectionManager
org.archive.httpclient.ThreadLocalHttpConnectionManager
org.archive.util.binsearch.impl.http.ApacheHttp31SLR
org.archive.util.binsearch.impl.http.ApacheHttp31SLRFactory
org.archive.util.binsearch.impl.http.HTTPSeekableLineReaderFactory.HttpLibs.APACHE_31

1.2.0

New features

MetaData is now multivalued to support repeated WARC and HTTP headers. #98

Dependency upgrades

commons-io 2.18.0
commons-lang 2.6
guava 33.3.1-jre
hadoop 3.4.1
htmlparser 2.1
httpcore 4.4.16
json 20240303
junit 4.13.2

1.1.11

Bug fixes

Fixed URLParser and WaybackURLKeyMaker failing on URLs with IPv6 address hostnames #100

1.1.10

Bug fixes

WAT extractor: do not fail on missing WARC-Filename in warcinfo record
ExtractingParseObserver: extract rel, hreflang and type attributes
ExtractingParseObserver: extract links from onClick attributes

Dependency Upgrades

commons-collections 3.2.2
commons-io 2.7
dsiutils 2.2.8
guava 33.3.0-jre
hadoop 3.4.0 (now optional)
pig 0.17.0
org.json 20231013

Dependency Removals

joda-time (was unused)

1.1.9

Use commons-collections v3.2.2 to avoid v3.2.1 vulnerability
Extract property attributes of HTML meta elements
Do not add value of preceding HTTP header field if there is no value
Fix WAT records corresponding to response records of Wget generated WARCs

1.1.8

Improve HTML link extraction
Move unit tests over from heritrix3 to webarchive-commons
Strip empty port via URLParser
Use CharsetDetector to guess encoding of HTML documents
Fix last header was lost if LF LF
Make regular expression to extract URLs from CSS more restrictive
Remove invalid constant PROFILE_REVISIT_URI_AGNOSTIC_IDENTICAL_DIGEST

1.1.7

Make canonicalizer be able to strip session id params even if they are the first params in the query string
Store origin-code of ARC file header
Flush output etc before tallying stats to fix sizeOnDisk calculation
Get rid of broken, seemingly unnecessary escapeWhitespace() step of uri fixup

1.1.6

Handle empty String argument in CharsetDetector.trimAttrValue
WAT extractor: adding information in WAT's warcinfo
WAT extractor: missing WARC format version
WAT extractor: envelope structure does not conform to the WAT specification
WAT extractor: WARC-Date in all records should be the WAT record generation date
WAT extractor: WARC-Filename in the WAT warcinfo record should be the WAT filename itself
WAT extractor: Entity-Trailing-Slop-Bytes should be called Entity-Trailing-Slop-Length

1.1.5

Escape redirect URLs in RealCDXExtractorOutput
Tests fail on Windows
Test fails on Java 8
RecordingOutputStream can affect tcp packets sent in an undesirable way

1.1.4

All dates should be independent of locale settings
Resolved fastutil conflict in dependencies

1.1.3

Synchronised with IA fork
Updated to more recent Guava APIs
Fixed handling of uncompressed ARC files #13 and #14
Avoid pulling in the logback dependency IA#13

1.1.2

Fixed support for reading uncompressed WARCs, along with some unit testing.

1.1.1

Renamed from commons-webarchive to webarchive-commons
Cope with malformed GZip extra fields as produced by wget 1.14
Switch to httpcomponents, and add IA deployment information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGES.md

CHANGES.md

1.3.0

URL Canonicalization Changed

Bug fixes

Dependency upgrades

Deprecations

1.2.0

New features

Dependency upgrades

1.1.11

Bug fixes

1.1.10

Bug fixes

Dependency Upgrades

Dependency Removals

1.1.9

1.1.8

1.1.7

1.1.6

1.1.5

1.1.4

1.1.3

1.1.2

1.1.1

Files

CHANGES.md

Latest commit

History

CHANGES.md

File metadata and controls

1.3.0

URL Canonicalization Changed

Bug fixes

Dependency upgrades

Deprecations

1.2.0

New features

Dependency upgrades

1.1.11

Bug fixes

1.1.10

Bug fixes

Dependency Upgrades

Dependency Removals

1.1.9

1.1.8

1.1.7

1.1.6

1.1.5

1.1.4

1.1.3

1.1.2

1.1.1