forked from Yelp/mrjob
-
Notifications
You must be signed in to change notification settings - Fork 1
/
CHANGES.txt
813 lines (775 loc) · 36.3 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
v0.6.0, 2017-03-?? -- ???
* removed deprecated INPUT, OUTPUT attributes from JarStep
* removed deprecated filesystem methods:
* path_exists()
* path_join()
* SSHFilesystem.ssh_slave_hosts()
* removed deprecated mrjob command aliases:
* create-job-flow
* terminate-idle-job-flows
* terminate-job-flow
* removed deprecated MRJob/MRJobLauncher methods:
* all_option_groups()
* is_mapper_or_reducer()
* mr()
* removed option groups (deprecated) from MRJobLauncher
* removed deprecated functions:
* mrjob.parse:
* is_windows_path()
* parse_key_value_list()
* parse_port_range_list()
* mrjob.util:
* args_for_opt_dest_subset()
* bash_wrap()
* buffer_iterator_to_line_iterator()
* bunzip2_stream() (now in mrjob.cat)
* gunzip_stream() (now in mrjob.cat)
* populate_option_groups_with_options()
* scrape_options_and_index_by_dest()
* scrape_options_into_new_groups()
* removed deprecated options:
* bootstrap_cmds
* bootstrap_files
* bootstrap_scripts
* hadoop_home
* hadoop_streaming_jar_on_emr
* num_ec2_instances
* python_archives
* setup_cmds
* setup_scripts
* strict_protocols
* removed deprecated option aliases:
* ami_version
* aws_availability_zone
* aws_region
* base_tmp_dir
* check_emr_status_every
* ec2_core_instance_bid_price
* ec2_core_instance_type
* ec2_instance_type
* ec2_master_instance_bid_price
* ec2_master_instance_type
* ec2_slave_instance_type
* ec2_task_instance_bid_price
* ec2_task_instance_type
* emr_job_flow_id
* emr_job_flow_pool_name
* emr_tags
* hdfs_scratch_dir
* num_ec2_core_instances
* num_ec2_task_instances
* pool_emr_job_flows
* s3_log_uri
* s3_scratch_uri
* s3_tmp_dir
* s3_sync_wait_time
* s3_upload_part_size
* ssh_tunnel_to_job_tracker
* removed deprecated runner methods:
* get_job_name()
* EMRJobRunner:
* get_ami_version()
* get_emr_job_flow_id()
* make_persistent_job_flow()
* removed deprecated passthrough to runner.fs
* removed deprecated switches:
* --partitioner
* removed deprecated JOB_FLOW and *SCRATCH cleanup types
v0.5.8, 2017-02-01 -- upload_dirs, pre-filters
* automatically tarball and upload directories with --dir, setup hooks (#23)
* specify path for inter-step output with --step-output-dir #263
* jobs:
* better --help printout
* deprecated option groups in MRJobs
* deprecated MRJob.get_all_option_groups()
* overriding *_pre_filter() methods in MRJob works again (#1521)
* all step types accept jobconf (#1447)
* quieted warning about SORT_VALUES on Hadoop 2 (#1286)
* all runners:
* wrap tasks that require pipes with sh_bin, not bash (#1330)
* local runner:
* allows non-zero exit status from pre-filters (#1524)
* pre-filters can now handle compressed input (#1061)
* EMR runner:
* fetch logs from task nodes as well as core nodes (#1400)
* use ListInstances rather than dfsadmin to get node list (#1345)
* moved mrjob.util.bunzip2_stream() to mrjob.cat
* moved mrjob.util.gunzip_stream() to mrjob.cat
* deprecated:
* mrjob.util.args_for_opt_dest_subset()
* mrjob.util.bash_wrap()
* mrjob.util.populate_option_groups_with_options()
* mrjob.util.scrape_options_and_index_by_dest()
* mrjob.util.tar_and_gz()
* SSHFilesystem.ssh_slave_hosts()
v0.5.7, 2016-12-19 -- Spark
* EMR and Hadoop runners:
* full support for Spark (#1320)
* includes spark() method in MRJob and SparkStep/SparkScriptStep
* can use environment variables and ~ in hadoop_streaming_jar option
* EMR runner:
* default AMI version is now 4.8.2 (#1486)
* default instance type is m1.large when running Spark jobs (#1465)
* added debug logging for matching available pooled clusters (#1449)
* defaults to cheapest instance type that will work (#1369)
* master bootstrap script always created when pooling
* no longer crashes when trying to use missing ssh binary (#1474)
* pooled clusters may have 1000 steps (#1463)
* failed jobs no longer reported as 100% complete (#793)
* All runners:
* py_files option for Spark and streaming steps (#1375)
* bootstrap mrjob with a .zip rather than a tarball
* options refactor, added missing command-line switches (#1439)
* mrjob terminate-idle-clusters works with all step types (#1363)
* log interpretation
* dropped unnecessary container-to-attempt-ID mapping (#1487)
* more efficient search for task log errors (#1450)
* cleaner error messages when bootstrapped mrjob won't compile
* JarSteps
* now support libjars, jobconf (#1481)
* JarStep.{INPUT,OUTPUT} are deprecated (use mrjob.step.{INPUT,OUTPUT})
* is_uri() now only matches URIs containing "://" (#1455)
* works in Anaconda3 Jupyter Notebook (#1441)
* deprecated mrjob.parse.is_windows_path()
* deprecated mrjob.parse.parse_key_value_list()
* deprecated mrjob.parse.parse_port_range_list()
* deprecated mrjob.util.scrape_options_into_new_groups()
* deprecated non-strict protocols (#1452)
* deprecated python_archives (#1056)
v0.5.6, 2016-09-12 -- dataproc crash fix
* Dataproc runner:
* fix Hadoop version crash on unknown image version (#1428)
* EMR and Hadoop runners:
* prioritize task errors as probable cause of failure (#1429)
* ignore Java stack trace in task stderr logs (#1430)
v0.5.5, 2016-09-05 -- missing ami_version option
* EMR runner:
* deprecate, don't remove ami_version option in v0.5.4 (#1421)
* update memory/CPU stats for EC2 instances for pooling (#1414)
* pooling treats application names as case-insensitive (#1417)
v0.5.4, 2016-08-26 -- pooling auto-recovery
* jobs:
* pass_through_option(), for existing command-line options (#1075)
* MRJob.options.runner now defaults to None, not 'inline' or 'local'
* runners:
* all:
* names of uploaded files now never start with . or _ (#1200)
* Hadoop:
* log parsing:
* handles more log4j patterns (#1405)
* gracefully handles IOError from exists() (#1355)
* fixed crash bug in Hadoop FS on Python 3 (#1396)
* EMR:
* pooling auto-recovers from joining a cluster that self-terminated (#708)
* log fetching uses sudo on 4.3.0+ AMIs (#1244)
* fixed broken --ssh-bind-ports switch (#1402)
* idle termination script now only runs on master node (#1398)
* ssh tunnel connects to internal IP of resource manager (#1397)
* AWS credentials no longer logged in verbose mode (#1353)
* many option names are now more generic (#1247)
* ami_version -> image_version
* accidentally removed ami_version option entirely (fixed in v0.5.5)
* aws_availability_zone -> zone
* aws_region -> region
* check_emr_status_every -> check_cluster_every
* ec2_core_instance_bid_price -> core_instance_bid_price
* ec2_core_instance_type -> core_instance_type
* ec2_instance_type -> instance_type
* ec2_master_instance_bid_price -> master_instance_bid_price
* ec2_master_instance_type -> master_instance_type
* ec2_task_instance_bid_price -> task_instance_bid_price
* ec2_task_instance_type -> task_instance_type
* emr_tags -> tags
* num_ec2_core_instances -> num_core_instances
* num_ec2_task_instances -> num_task_instances
* s3_log_uri -> cloud_log_dir
* s3_sync_wait_time -> cloud_fs_sync_secs
* s3_tmp_dir -> cloud_tmp_dir
* s3_upload_part_size -> cloud_upload_part_size
* num_ec2_instances is deprecated (use num_core_instances)
* ec2_slave_instance_type is deprecated (use core_instance_type)
* hadoop_streaming_jar_on_emr is deprecated (#1405)
* hadoop_streaming_jar handles this instead with file:// URIs
* bootstrap_python does nothing on AMI 4.6.0+, as not needed (#1358)
* mrjob audit-emr-usage should show less/no API throttling warnings (#1091)
v0.5.3, 2016-07-15 -- libjars
* jobs:
* LIBJARS and libjars method (#1341)
* runners:
* all:
* .cpython-3*.pyc files no longer included when bootstrapping mrjob
* local:
*PATH envvars combined with local separator (#1321)
* Hadoop and EMR:
* libjars option (#198)
* fixes to ordering of generic and JAR-specific options (#1331, #1332)
* Hadoop:
* more default log dirs (#1339)
* hadoop_tmp_dir handles ~ and envvars (#1322) (broken in v0.5.0)
* EMR:
* determine cause of failure of bootstrap scripts (#370)
* master bootstrap script now redirects stdout to stderr
* emr_configurations option (#1276)
* subnet option (#1323)
* SSH tunnel opened as soon as cluster is ready (#1115)
* SSH tunnel leaves stdin alone (#1161)
* combine_lists() treats dicts as values, not sequences
v0.5.2, 2016-05-23 -- initial Cloud Dataproc support
* basic support for Google Cloud Dataproc (#1243)
* lacks log interpretation, JarStep support
* on EMR, wait for steps to complete in correct order (#1316)
* correctly handle ~ in include path in mrjob.conf (#1308)
* new emr_applications option (#1293)
* fix running deprecated tools with python -m (#1312)
* fix ssh tunneling to 2.x AMIs on EMR in VPCs (#1311)
v0.5.1, 2016-04-29 -- post-release bugfixes
* strict_protocols in mrjob.conf is no longer ignored (#1302)
* check_input_paths in mrjob.conf is no longer ignored
* partitioner() is no longer ignored, fixing SORT_VALUES (#1294)
* --partitioner switch is deprecated
* improved probable cause of error from pre-YARN logs (#1288)
* ssh_bind_ports now defaults to (x)range, not list (#1284)
* mrjob terminate-idle-clusters handles debugging jar from boto 2.40.0 (#1306)
v0.5.0, 2016-03-28 -- the future is in the past
* supports Python 3 (#989)
* requires boto 2.35.0 or newer (#980)
* removed many workarounds for S3 and EMR (#980), IAM (#1062)
* jobs:
* is_mapper_or_reducer() is now is_task() (#1072)
* mr() no longer takes positional arguments (#814)
* removed jar() (use mrjob.step.JarStep)
* removed testing methods parse_counters() and parse_output()
* protocols:
* protocols are strict by default (#724)
* JSON protocols use ujson when available, then simplejson (#1002, #1266)
* can explicitly choose Standard, Simple or Ultra JSON protocol
* raw protocols handle bytes or unicode depending on Python version
* can explicitly choose Text or Bytes protocol
* mrjob.step:
* JarStep only takes "args" and "main_class" keyword args
* removed MRJobStep (use MRStep)
* runners:
* All runners:
* totally revamped log handling (#1123)
* runner status/log messages are less noisy (#1044)
* don't bootstrap mrjob if interpreter is set (#1041)
* fs methods path_exists() and path_join() are now exists() and join()
* deprecation warning: use runner.fs explicitly (#1146)
* changes to cleanup options:
* removed IS_SUCCESSFUL (use ALL)
* LOCAL_SCRATCH is now LOCAL_TMP (#318)
* new HADOOP_TMP option handles HDFS cleanup (#1261)
* REMOTE_SCRATCH is now CLOUD_TMP (#1261)
* base_tmp_dir option is now local_tmp_dir (#318)
* non-inline runners raise StepFailedException on step failure (#1219)
* steps_python_bin defaults to current python interpreter (#1038)
* _job_name is now _job_key (#982)
* EMR:
* default AWS region is us-west-2 (#1025)
* default instance type is m1.medium (#992)
* visible_to_all_users defaults to true (#1016)
* matches your minor version of Python 2 on 3.x and 4.x AMIs (#1265)
* 4.x AMIs are supported (#1105)
* added --release-label switch (--ami-version 4.x.y also works)
* can fetch counters and probable cause of failure on 3.x and 4.x AMIs
* SSH tunnel now works on 3.x and 4.x AMIs (#1013)
* ssh_tunnel_to_job_tracker option is now ssh_tunnel
* correctly fetch step logs by step ID (#1117)
* bootstrap_python option
* s3_scratch_uri option is now s3_tmp_dir (#318)
* aws_region is no longer inferred from s3_tmp_dir
* create/select temp bucket in same region as EMR jobs (#687)
* added iam_endpoint option (#1067)
* removed s3_conn args from methods in EMRJobRunner and S3Filesystem
* S3 Filesystem:
* connect to each S3 bucket on appropriate endpoint (#1028)
* fall back to default if we can't get bucket location (#1170)
* removed special treatment of _$folder$ keys
* removed deprecated S3Filesystem method get_s3_folder_keys()
* recurse "subdirectories" even if uri lacks trailing / (#1183)
* removed iam_job_flow_role option (use iam_instance_profile)
* custom hadoop_streaming_jar gets properly uploaded
* job cleanup temporarily disabled (#1241)
* pooling respects key pair (#1230)
* idle cluster self-termination respects non-streaming jobs (#1145)
* deprecated "latest" AMI version not passed through to EMR (#1269)
* emr_job_flow_id option is now cluster_id (#1082)
* emr_job_flow_pool_name is now pool_name (#1082)
* pool_emr_job_flows is now pool_clusters (#1082)
* Hadoop
* works out-of the-box on most Hadoop setups (#1160)
* works out-of the box inside EMR (2.x, 3.x, and 4.x AMIs)
* counters are parsed from Hadoop binary stderr in YARN (#1153)
* can find logs and probable cause of failure in YARN (#1195)
* will search in <output dir>/_logs, to support Cloudera (#565)
* HDFS Filesystem:
* use fs -ls -R and fs -rm -R in YARN (#1152)
* mkdir() now uses -p on YARN (#991)
* fs.du() now works on YARN (#1155)
* fs.du() now returns 0 for nonexistent files instead of erroring
* fs.rm() now uses -skipTrash
* dropped support for Hadoop prior to 0.20.203 (#1208)
* added hadoop_log_dirs option
* hdfs_scratch_dir option is now hadoop_tmp_dir (#318)
* hadoop_home is deprecated
* uses -D and correct property name when step has no reduces (#1213)
* Inline/Local
* runner.fs raises IOError if passed URIs (#1185)
* version-agnostic by default (#735)
* removed ignored hadoop_extra_args and hadoop_streaming_jar opts (#1275)
* inline runner uses multiple splits by default (#1276)
* removed mrjob.compat.get_jobconf_value() (use jobconf_from_env())
* removed mrjob.compat methods to support Hadoop prior to 0.20.203:
* supports_combiners_in_hadoop_streaming()
* supports_new_distributed_cache_options()
* uses_generic_jobconf()
* removed mrjob.conf.combine_cmd_lists()
* removed fetch-logs tool (#1127)
* mrjob subcommands use "cluster" rather than "job-flow" (#1082)
* create-job-flow is now create-cluster
* terminate-idle-job-flows is now terminate-idle-clusters
* terminate-job-flow is now terminate-cluster
* Python-version-specific mrjob-x and mrjob-x.y commands (#1104)
* use followlinks=True with os.walk()
* all internal constants/functions/methods explicitly start with _ (#681)
* mrjob.util:
* file_ext() takes filename, not path
* random_identifier() moved here from mrjob.aws
* buffer_iterator_to_line_iterator() is now to_lines()
* to_lines() no longer appends a newline to data (#819)
* removed extract_dir_for_tar()
* gunzip_stream() now yields chunks, not lines
* removed hash_object()
v0.4.6, 2015-11-09 -- config files
* PyYAML>=3.08 is required
* !clear tag in conf files (#1162)
* combine_lists() and combine_path_lists() can handle scalars (#1172)
* include: paths in conf files are relative to real path of conf file (#1166)
* mrjob.conf.combine_cmd_lists() is deprecated (#1168)
* EMR runner: pool_wait_minutes can now be loaded from mrjob.conf (#1070)
* support for wheel packaging format (#1140)
v0.4.5, 2015-07-28 -- DescribeJobFlows begone
* boto>=2.6.0 is required (used to be 2.2.0)
* runners:
* EMR:
* moved off deprecated DescribeJobFlows API (#876)
* time-to-end-of-hour now uses creation time, not "start" time
* aws_security_token for temporary credentials (#1003)
* Use AWS managed policies when creating IAM objects (#1026)
* Fall back to default role/instance profile when no IAM access (#1008)
* added emr_tags option (#1058)
* added get_ami_version() method
* hadoop_version option no longer has any effect (#1017)
* Hadoop:
* --hadoop-home switch now works (#1037)
* EMR tools:
* added switches for AWS connection options etc. (#1087)
* mrboss is available from command line tool: mrjob boss [args]
* terminate_idle_job_flows:
* less prone to race condition (#910)
* prints results to stdout in dry_run mode (#1102)
* job flows stuck in STARTING state no longer considered idle
* report_long_jobs reports job flows stuck in STARTING state
* collect_emr_stats and job_flow_pool are deprecated
* more efficient decoding of bz2 files
* mrjob.retry.RetryWrapper raises exception when out of tries (#1093)
v0.4.4, 2015-04-21 -- EMRgency!
* runners:
* EMR:
* Create IAM objects as needed (unbreaks mrjob for new accounts) (#999)
* --iam-job-flow-role renamed to --iam-instance-profile (#1001)
* new --iam-service-role option (#1005)
v0.4.3, 2015-04-08 -- SO many bugfixes
* jobs:
* MRStep's constructor treats kwarg=None same as not setting it (#970)
* parse_counters() and parse_output() are deprecated (#829)
* self.mr is deprecated in favor of MRStep (#815)
* runners:
* All runners:
* You can now set strict_protocols from mrjob.conf (#726)
* new --no-strict-protocols command-line option
* streaming output from closed runner shows a warning (#853)
* EMR:
* --check-input-paths and --no-check-input-paths options (#864)
* skip (very slow) validation of s3 buckets if boto < 2.25.0 (#865)
* Fix for max_hours_idle bug that was terminating job flows early (#932)
* --emr-api-param allows users to pass additional parameters to boto's
EMR API (#879)
* unset paramaters with --no-emr-api-param
* bootstrap_python_packages (deprecated) now works on 3.x EMR AMIs (#863)
* Use TERMINATE_CLUSTER instead of deprecated TERMINATE_JOB_FLOW (#974)
* updated EC2 instance type data for pooling (#995)
* Hadoop:
* exclude hadoop source jars when looking for streaming jar (#861)
* Fixed mkdir_on_hdfs for Hadoop version 2.x (#923)
* Fixed hadoop_bin on Windows (#843)
* Local
* bootstrap mrjob by default (#984)
* Inline
* fix for add_file_option() (#851)
* cd to job's working directory before instantiating mrjob class (#988)
* Use pytest to run tests (#898)
* collect-emr-active-stats subcommand (#947)
* Using xtrace flag to get more output during bootstrap (#943)
* Fixed log printouts for command line tools (#901)
* Fix to avoid interpreting windows paths as URIs (#880)
* Better error message when ssh keyfile is missing (#858)
* Update EMR tool ISO8601 parsing to be consistent with EMR runner (#869)
* Dropped support for Python 2.5 (#713)
* Dropped support for the 1.x EMR AMI series, which uses Python 2.5
v0.4.2, 2013-11-27 -- that's one small step for a JAR
* jobs:
* can interpolate input and output path(s) into arguments of JarSteps,
so they can be part of multi-step jobs (#773)
* see mrjob/examples/mr_jar_step_example.py
* JarStep now takes keyword arguments only (#769)
* removed useless "name" field; "step_args" is now just "args"
* MRJobStep (usually accessed via MRJob.mr()) is now MRStep
* runners:
* All runners:
* --setup is now fully functional (#206)
* --python-archive, --setup-cmd, and --setup-script are deprecated
* --bootstrap option works and uses sh (#206)
* --bootstrap-cmd, --bootstrap-file, --bootstrap-python-package,
--bootstrap-script are deprecated
* setup commands can no longer corrupt a task's input and output (#803)
* sh_bin is now "sh -e" by default so setup fails fast (#810)
* default is "/bin/sh -e" on EMR
* EMR:
* JarSteps work again (#763)
* auto-uploads jars for JarSteps (#772)
* JARs on the EMR instances can be accessed with file:/// URIs
* ssh_cat() no longer raises an error when catting a file
containing an error (#807)
* Fixed SignatureDoesNotMatchError that happens with boto 2.10.0+
with Python prior to 2.7.5 (#778)
* Hadoop:
* now handles JarSteps too (#770)
* Fix to mrjob.parse.urlparse() that was breaking Python 2.5
* mrjob.util.buffer_iterator_to_line_iterator() is now more efficient
and uses a bounded amount of memory
* bz2 decompression no longer discards data (#817)
v0.4.1, 2013-09-16 -- secondary sort and self-terminating job flows
* jobs:
* SORT_VALUES: Secondary sort by value (#240)
* see mrjob/examples/
* can now override jobconf() again (#656)
* renamed mrjob.compat.get_jobconf_value() to jobconf_from_env()
* examples:
* bash_wrap/ (mapper/reducer_cmd() example)
* mr_most_used_word.py (two step job)
* mr_next_word_stats.py (SORT_VALUES example)
* runners:
* All runners:
* single setup option works but is not yet documented (#206)
* setup now uses sh rather than python internally
* EMR runner:
* max_hours_idle: self-terminating idle job flows (#628)
* mins_to_end_of_hour option gives finer control over self-termination.
* Can reuse pooled job flows where previous job failed (#633)
* Throws IOError if output path already exists (#634)
* Gracefully handles SSL cert issues (#621, #706)
* Automatically infers EMR/S3 endpoints from region (#658)
* ls() supports s3n:// schema (#672)
* Fixed log parsing crash on JarSteps (#645)
* visible_to_all_users works with boto <2.8.0 (#701)
* must use --interpreter with non-Python scripts (#683)
* cat() can decompress gzipped data (#601)
* Hadoop runner:
* check_input_paths: can disable input path checking (#583)
* cat() can decompress gzipped data (#601)
* Inline/Local runners:
* Fixed counter parsing for multi-step jobs in inline mode
* Supports per-step jobconf (#616)
* Documentation revamp
* mrjob.parse.urlparse() works consistently across Python versions (#686)
* deprecated:
* many constants in mrjob.emr replaced with functions in mrjob.aws
* removed deprecated features:
* old conf locations (~/.mrjob and in PYTHONPATH) (#747)
* built-in protocols must be instances (#488)
v0.4.0, 2013-04-30 -- Slouching toward nirvana
* Changes:
* 'mrjob' command (#225)
* Changed default runner from 'local' to 'inline' (#423)
* Local runner no longer adds working directory to PYTHONPATH of
subprocesses; use inline runner instead (#424)
* Requires boto 2.2.0 or later
* Filesystem functionality moved out of MRJobRunner into into 'fs' objects
but forwarded from runners for backward compatibility
* Changed exception hierarchy of mrjob.ssh (which is private but
important)
* Inline and local runners now inherit from the SimMRJobRunner class and thus share most
of their implementation
* Internal data structure for representing a step is much richer, allowing
many cool future features (#479)
* mrjob detects Hadoop version from EMR based on API responses instead of
what's in the config (#611)
* New features:
* Support for non-Hadoop Streaming jar steps (#499)
* Support for arbitrary commands as Hadoop Streaming
mappers/combiners/reducers
* mapper_pre_filter, combiner_pre_filter, and reducer_pre_filter allow
running of a UNIX command in front of tasks to filter input outside of
the interpreter
* Hadoop runner uses PTY to print output from the Hadoop sub process to the
console (#580)
* mrjob knows how to terminate the job on cleanup (Ctrl+C closes the job).
(#353)
* Allow use of multiple -c flags on the command line (#420)
* Bug fixes:
* Silenced some incorrect warnings about ignored options in 'inline' runner
* terminate_idle_job_flows uses the default configuration to terminate idle jobs (#559)
* Removed deprecated functionality:
* --hadoop-*-format
* --*-protocol switches
* MRJob.DEFAULT_*_PROTOCOL
* MRJob.get_default_opts()
* MRJob.protocols()
* PROTOCOL_DICT
* IF_SUCCESSFUL
* DEFAULT_CLEANUP
* S3Filesystem.get_s3_folder_keys()
v0.3.5, 2012-08-21 -- The Last Ride of v0.3.x[?]
* EMR:
* --pool-wait-minutes option lets you wait up to X minutes before creating a
job flow (#455)
* Job flow ID included in error messages on failure (#452)
* JOB and JOB_FLOW cleanup options (#485, #455)
* EMR and Hadoop:
* Compatibility fixes related to deprecated options and Hadoop's bizarre
non-sequential version numbers (#489, #534)
* Other:
* Warn when *_PROTOCOL is not a class (#490)
* Bug fixes:
* Unicode strings can be used when specifying interpreters (#431)
* --enable-emr-logging no longer causes the wrong counters/logs to be parsed
(#446)
* TMP_DIR inserted into 'sort' environment variables (#477)
* Setting hadoop_home in mrjob.conf works again
* Gzipped input files work when specified with relative paths (#494)
* Passthrough options are not re-ordered when sent to Hadoop Streaming
(#509)
v0.3.4.1, 2012-06-12 -- The test suite doesn't catch everything...
* Local mode doesn't try to send multiple mappers to the same output file
when using multiple compressed files as input
v0.3.4, 2012-06-11 -- We are friendly people.
* Experimental support for IronPython in the local and inline runners
* set_status() and increment_counter() will encode messages/names of type
'unicode' as UTF-8 when writing to Hadoop Streaming
* EMR and Hadoop counter parsing is more correct
* mrjob.tools.emr.fetch_logs fetches logs from S3 when asked instead of
incorrectly refusing to do so
* jobconf values can be booleans in mrjob.conf as well as 'true' and 'false'
strings
* hadoop_version can be a float in mrjob.conf, but a warning is printed to the
console
* Command line help is split across several --help-* commands
* Local runner sorts output consistently
v0.3.3.2, 2012-04-10 -- It's a race [condition]!
* Option parsing no longer dies when -- is used as an argument (#435)
* Fixed race condition where two jobs can join same job flow thinking it is
idle, delaying one of the jobs (#438)
* Better error message when a config file contains no data for the current
runner (#433)
v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix
* Fixed S3 locking mechanism parsing of last modified time to work around an
inconsistency in the EMR API
v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE!
* EMR:
* Error detection code follows symlinks in Hadoop logs (#396)
* terminate_idle_job_flows locks job flows before terminating them (#391)
* terminate_idle_job_flows -qq silences all output (#380)
* Other fixes:
* mr_tower_of_powers test no longer requires Testify (#395)
* Various runner du() implementations no longer broken (#393, #394)
* Hadoop counter parser regex handles long lines better (#388)
* Hadoop counter parser regex is more correct (#305)
* Better error when trying to parse YAML without PyYAML (#348)
v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more
* Docs:
* 'Testing with mrjob' section in docs (includes #321)
* MRJobRunner.counters() included in docs (#321)
* terminate_idle_job_flows is spelled correctly in docs (#339)
* Running jobs:
* local mode:
* Allow non-string jobconf values again (this changed in v0.3.0)
* Don't split *.gz files (#333)
* emr mode:
* Spot instance support via ec2_*_instance_bid_price and renamed instance
type/number options (#219)
* ami_version option to allow switching between EMR AMIs (#306)
* 'Error while reading from input file' displays correct file (#358)
* python_bin used for bootstrap_python_packages instead of just 'python'
(#355)
* Pooling works with bootstrap_mrjob=False (#347)
* Pooling makes sure a job flow has space for the new job before joining
it (#324)
* EMR tools:
* create_job_flow no longer tries to use an option that does not exist
(#349)
* report_long_jobs tool alerts on jobs that have run for more than X hours
(#345)
* mrboss no longer spells stderr 'stsderr'
* terminate_idle_job_flows counts jobs with pending (but not running)
steps as idle (#365)
* terminate_idle_job_flows can terminate job flows near the end of a
billable hour (#319)
* audit_usage breaks down job flows by pool (#239)
* Various tools (e.g. audit_usage) get list of job flows correctly (#346)
v0.3.1, 2011-12-20 -- Nooooo there were bugs!
* Instance-type command-line arguments always override mrjob.conf (Issue #311)
* Fixed crash in mrjob.tools.emr.audit_usage (Issue #315)
* Tests now use unittest; python setup.py test now works (Issue #292)
v0.3.0, 2011-12-07 -- Worth the wait
* Configuration:
* Saner mrjob.conf locations (Issue #97):
* ~/.mrjob is deprecated in favor of ~/.mrjob.conf
* searching in PYTHONPATH is deprecated
* MRJOB_CONF environment variable for custom paths
* Defining Jobs (MRJob):
* Combiner support (Issue #74)
* *_init() and *_final() methods for mappers, combiners, and reducers
(Issue #124)
* mapper/combiner/reducer methods no longer need to contain a yield
statement if they emit no data
* Protocols:
* Protocols can be anything with read() and write() methods, and are
instances by default (Issue #229)
* Set protocols with the *_PROTOCOL attributes or by re-defining the
*_protocol() methods
* Built-in protocol classes cache the encoded and decoded value of the
last key for faster decoding during reducing (Issue #230)
* --*protocol switches and aliases are deprecated (Issue #106)
* Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format()
methods (Issue #241)
* --hadoop-*-format switches are deprecated
* Hadoop formats can no longer be set from mrjob.conf
* Set jobconf with JOBCONF attribute or the jobconf() method (in addition
to --jobconf)
* Set Hadoop partitioner class with --partitioner, PARTITIONER, or
partitioner() (Issue #6)
* Custom option parsing (Issue #172)
* Use mrjob.compat.get_jobconf_value() to get jobconf values from environment
* Running jobs:
* All modes:
* All runners are Hadoop-version aware and use the correct jobconf and
combiner invocation styles (Issue #111)
* All types of URIs can be passed through to Hadoop (Issue #53)
* Speed up steps with no mapper by using cat (Issue #5)
* Stream compressed files with cat() method (Issue #17)
* hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96)
* job_name_prefix option is gone (was deprecated)
* Better cleanup (Issue #10):
* Separate cleanup_on_failure option
* More granular cleanup options
* Cleaner handling of passthrough options (Issue #32)
* emr mode:
* job flow pooling (Issue #26)
* vastly improved log fetching via SSH (Issue #2)
* New tool: mrjob.tools.emr.fetch_logs
* default Hadoop version on EMR is 0.20 (was 0.18)
* ec2_instance_type option now only sets instance type for slave nodes
when there are multiple EC2 instances (Issue #66)
* New tool: mrjob.tools.emr.mrboss for running commands on all nodes and
saving output locally
* inline mode:
* Supports cmdenv (Issue #136)
* Passthrough options can now affect steps list (Issue #301)
* local mode:
* Runs 2 mappers and 2 reducers in parallel by default (Issue #228)
* Preliminary Hadoop simulation for some jobconf variables (Issue #86)
* Misc:
* boto 2.0+ is now required (Issue #92)
* Removed debian packaging (should be handled separately)
v0.2.8, 2011-09-07 -- Bugfixes and betas
* Fix log parsing crash dealing with timeout errors
* Make mr_travelling_salesman.py work with simplejson
* Add emr_additional_info option, to support EMR beta features
* Remove debian packaging (should be handled separately)
* Fix crash when creating tmp bucket for job in us-east-1
v0.2.7, 2011-07-12 -- Hooray for interns!
* All runner options can be set from the command line (Issue #121)
* Including for mrjob.tools.emr.create_job_flow (Issue #142)
* New EMR options:
* availability_zone (Issue #72)
* bootstrap_actions (Issue #69)
* enable_emr_debugging (Issue #133)
* Read counters from EMR log files (Issue #134)
* Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9)
* EMR parses and reports job failure due to steps timing out (Issue #15)
* EMR bootstrap files are no longer made public on S3 (Issue #70)
* mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming
jars correctly (Issue #116)
* LocalMRJobRunner separates out counters by step (Issue #28)
* bootstrap_python_packages works regardless of tarball name (Issue #49)
* mrjob always creates temp buckets in the correct AWS region (Issue #64)
* Catch abuse of __main__ in jobs (Issue #78)
* Added mr_travelling_salesman example
v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more
* Set Hadoop to run on EMR with --hadoop-version (Issue #71).
* Default is still 0.18, but will change to 0.20 in mrjob v0.3.0.
* New inline runner, for testing locally with a debugger
* New --strict-protocols option, to catch unencodable data (Issue #76)
* Added steps_python_bin option (for use with virtualenv)
* mrjob no longer chokes when asked to run on an EMR job flow running
Hadoop 0.20 (Issue #110)
* mrjob no longer chokes on job flows with no LogUri (Issue #112)
v0.2.5, 2011-04-29 -- Hadoop input and output formats
* Added hadoop_input/output_format options
* You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar)
* extra args to hadoop now come before -mapper/-reducer on EMR, so
that e.g. -libjar will work (worked in hadoop mode since v0.2.2)
* hadoop mode now supports s3n:// URIs (Issue #53)
v0.2.4, 2011-03-09 -- fix bootstrapping mrjob
* Fix bootstrapping of mrjob in hadoop and local mode (Issue #89)
* SSH tunnels try to use the same port for the same job flow (Issue #67)
* Added mr_postfix_bounce and mr_pegasos_svm to examples.
* Retry on spurious 505s from EMR API
v0.2.3, 2011-02-24 -- boto compatibility
* Fix incompatibility with boto 2.0b4 (Issue #91)
v0.2.2, 2011-02-15 -- GET/POST EMR issue
* Use POST requests for most EMR queries (EMR was choking on large GETs)
* find_probable_cause_of_failure() ignores transient errors (Issue #31)
* --hadoop-arg now actually works (Issue #79)
* on Hadoop, extra args are added first, so you can set e.g. -libjar
* S3 buckets may now have . in their names
* MRJob scripts now respect --quiet (Issue #84)
* added --no-output option for MRJob scripts (Issue #81)
* added --python-bin option (Issue #54)
v0.2.1, 2010-11-17 -- laststatechangereason bugfix
* Don't assume EMR sets laststatechangereason
v0.2.0, 2010-11-15 -- Many bugfixes, Windows support
* New Features/Changes:
* EMRJobRunner now prints % of mappers and reducers completed when you
enable the SSH tunnel.
* Added mr_page_rank example
* Added mrjob.tools.emr.audit_usage script (Issue #21)
* You can specify alternate job owners with the "owner" option. Useful for
auditing usage. (Issue #59)
* The job_name_prefix option has been renamed to label (the old name still
works but is deprecated)
* bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo
* Bugs Fixed/Cleanup:
* bootstrap files no longer get uploaded to S3 twice (Issue #8)
* When using add_file_option(), show_steps() can now see the local version
of the file (Issue #45)
* Now works on Windows (Issue #46)
* No longer requires external jar, tar, or zip binaries (Issue #47)
* mrjob-* scratch bucket is only created as needed (Issue #50)
* Can now specify us-east-1 region explicitly (Issue #58)
* mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60)
v0.1.0, 2010-10-28 -- Same code, better version. It's official!
v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against
* Added debian packaging
* mrjob bootstrapping can now deal with symlinks in site-packages/mrjob
* MRJobRunner.stream_output() can now be called multiple times
v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing
* Fixed small bugs that broke Python 2.5.1 and Python 2.7
* Fixed reading mrjob.conf without yaml installed
* Fix tests to work with modern simplejson and pipes.quote()
* Auto-create temp bucket on S3 if we don't have one (Issue #16)
* Auto-infer AWS region from bucket (Issue #7)
* --steps now passes in all extra args (e.g. --protocol) (Issue #4)
* Better docs
v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV!