Packman Build Service PMBS

We truncated the diff of some files because they were too big. If you want to see the full diff for every file, click here.

Changes of Revision 9

x265.changes Changed

@@ -1,4 +1,62 @@
 -------------------------------------------------------------------
+Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
+
+- soname bumped to 51
+- Update to stable version 1.6
+  Perfomance changes:
+  * heavy improvements for AVX2 capable platforms
+    (Haswell and later Intel CPUs) and work efficiency
+    improvements for multiple-socket machines.
+  
+  API changes:
+  * --threads N replaced by --pools N,N and --lookahead-slices N
+  * --[no-]rdoq-level N - finer control over RDOQ effort
+  * --min-cu-size N - trade-off compression for performance
+  * --max-tu-size N - trade-off compression for performance
+  * --[no-]temporal-layers - code unreferenced B frames in temporal
+    layer 1
+  * --[no-]cip aliases added for --[no-]constrained-intra
+  * Added support for new color transfer functions "smpte-st-2084"
+    and "smpte-st-428
+  * --limit-refs N was added, but not yet implemented
+  * Deprecated x265_setup_primitives() was removed from the public
+    API and is no longer exported DLLs
+  
+  Threading changes:
+  * The x265 thread pool has been made NUMA aware.
+  * The --threads  parameter, which used to specify a global
+    pool size, has been replaced with a --pools parameter which
+    allows you to specify a pool size per NUMA node (aka CPU socket
+    or package). The default is still to allocate one pool worker
+    thread per logical core on the machine, but with --pools one
+    can isolate those threads to a given socket.
+  * Other than socket isolation, the biggest visible change in the
+    NUMA aware thread pools is the increase in work efficiency.
+    The total utilization will generally decrease but the performance
+    will increase since worker threads spend less time context
+    switching.  Also, the threading of the lookahead was made more
+    work-efficient. Each lookahead job is a much larger piece of work.
+    Before (1.5):
+    disable thread pool: --threads 1
+    default thread pool: --threads 0
+    restrict to 4 threads: --threads 4
+    After (1.6):
+    disable thread pools: --pools 0
+    default thread pools: --pools *
+    restrict to 4 threads: --pools 4
+    restrict to 4 threads on socket 1: --pools -,4
+    restrict to all threads on socket 0: --pools +,-
+  
+  Multi-lib interface:
+  * In order to support runtime selection of a libx265
+    shared library, we have introduced an x265_api structure
+    and an x265_api_get() function. Applications which use
+    this interface to acquire the libx265 functional interface
+    will be able to use shim libraries to bind a particular build
+    of libx265 at run time. See the API documentation for full
+    details.
+
+-------------------------------------------------------------------
 Sun Feb 22 09:07:11 UTC 2015 - aloisio@gmx.com
 
 - soname bump

​x
 
@@ -1,4 +1,62 @@
 -------------------------------------------------------------------
+Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
+
+- soname bumped to 51
+- Update to stable version 1.6
+  Perfomance changes:
+  * heavy improvements for AVX2 capable platforms
+    (Haswell and later Intel CPUs) and work efficiency
+    improvements for multiple-socket machines.
+  
+  API changes:
+  * --threads N replaced by --pools N,N and --lookahead-slices N
+  * --[no-]rdoq-level N - finer control over RDOQ effort
+  * --min-cu-size N - trade-off compression for performance
+  * --max-tu-size N - trade-off compression for performance
+  * --[no-]temporal-layers - code unreferenced B frames in temporal
+    layer 1
+  * --[no-]cip aliases added for --[no-]constrained-intra
+  * Added support for new color transfer functions "smpte-st-2084"
+    and "smpte-st-428
+  * --limit-refs N was added, but not yet implemented
+  * Deprecated x265_setup_primitives() was removed from the public
+    API and is no longer exported DLLs
+  
+  Threading changes:
+  * The x265 thread pool has been made NUMA aware.
+  * The --threads  parameter, which used to specify a global
+    pool size, has been replaced with a --pools parameter which
+    allows you to specify a pool size per NUMA node (aka CPU socket
+    or package). The default is still to allocate one pool worker
+    thread per logical core on the machine, but with --pools one
+    can isolate those threads to a given socket.
+  * Other than socket isolation, the biggest visible change in the
+    NUMA aware thread pools is the increase in work efficiency.
+    The total utilization will generally decrease but the performance
+    will increase since worker threads spend less time context
+    switching.  Also, the threading of the lookahead was made more
+    work-efficient. Each lookahead job is a much larger piece of work.
+    Before (1.5):
+    disable thread pool: --threads 1
+    default thread pool: --threads 0
+    restrict to 4 threads: --threads 4
+    After (1.6):
+    disable thread pools: --pools 0
+    default thread pools: --pools *
+    restrict to 4 threads: --pools 4
+    restrict to 4 threads on socket 1: --pools -,4
+    restrict to all threads on socket 0: --pools +,-
+  
+  Multi-lib interface:
+  * In order to support runtime selection of a libx265
+    shared library, we have introduced an x265_api structure
+    and an x265_api_get() function. Applications which use
+    this interface to acquire the libx265 functional interface
+    will be able to use shim libraries to bind a particular build
+    of libx265 at run time. See the API documentation for full
+    details.
+
+-------------------------------------------------------------------
 Sun Feb 22 09:07:11 UTC 2015 - aloisio@gmx.com
 
 - soname bump
​

x265.spec Changed

 
@@ -1,10 +1,10 @@
 # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
 
 Name:           x265
-%define soname  43
+%define soname  51
 %define libname lib%{name}
 %define libsoname %{libname}-%{soname}
-Version:        1.5
+Version:        1.6
 Release:        0
 License:        GPL-2.0+
 Summary:        A free h265/HEVC encoder - encoder binary
@@ -45,7 +45,7 @@
 %prep
 %setup -q -n "%{name}_%{version}/build/linux"
 cd ../..
-%patch0 -p1
+%patch0
 cd -
 %define FAKE_BUILDDATE %(LC_ALL=C date -u -r %{_sourcedir}/%{name}.changes '+%%b %%e %%Y')
 sed -i -e "s/0.0/%{soname}.0/g" ../../source/cmake/version.cmake
​

arm.patch Changed

@@ -1,7 +1,6 @@
-diff -urN a/source/CMakeLists.txt b/source/CMakeLists.txt
---- a/source/CMakeLists.txt	2015-02-10 14:15:13.000000000 -0700
-+++ b/source/CMakeLists.txt	2015-02-12 06:25:01.334927114 -0700
-@@ -46,10 +46,18 @@
+--- source/CMakeLists.txt.orig	2015-04-28 21:43:18.585528552 +0200
++++ source/CMakeLists.txt	2015-04-28 21:47:14.995334232 +0200
+@@ -50,10 +50,18 @@
          set(X64 1)
          add_definitions(-DX86_64=1)
      endif()
@@ -23,8 +22,8 @@
  else()
      message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
      message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
-@@ -133,8 +141,8 @@
-     if(X86 AND NOT X64)
+@@ -155,8 +163,8 @@
+     elseif(X86 AND NOT X64)
          add_definitions(-march=i686)
      endif()
 -    if(ARM)
@@ -32,11 +31,10 @@
 +    if(ARMV7)
 +        add_definitions(-fPIC)
      endif()
-     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
-     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
-diff -urN a/source/common/cpu.cpp b/source/common/cpu.cpp
---- a/source/common/cpu.cpp	2015-02-10 14:15:13.000000000 -0700
-+++ b/source/common/cpu.cpp	2015-02-12 06:25:01.334927114 -0700
+     if(FPROFILE_GENERATE)
+         if(INTEL_CXX)
+--- source/common/cpu.cpp.orig	2015-04-28 21:47:44.634923269 +0200
++++ source/common/cpu.cpp	2015-04-28 21:49:50.305468867 +0200
 @@ -37,7 +37,7 @@
  #include <machine/cpu.h>
  #endif

 
@@ -1,7 +1,6 @@
-diff -urN a/source/CMakeLists.txt b/source/CMakeLists.txt
---- a/source/CMakeLists.txt    2015-02-10 14:15:13.000000000 -0700
-+++ b/source/CMakeLists.txt    2015-02-12 06:25:01.334927114 -0700
-@@ -46,10 +46,18 @@
+--- source/CMakeLists.txt.orig 2015-04-28 21:43:18.585528552 +0200
++++ source/CMakeLists.txt  2015-04-28 21:47:14.995334232 +0200
+@@ -50,10 +50,18 @@
          set(X64 1)
          add_definitions(-DX86_64=1)
      endif()
@@ -23,8 +22,8 @@
  else()
      message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown")
      message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}")
-@@ -133,8 +141,8 @@
-     if(X86 AND NOT X64)
+@@ -155,8 +163,8 @@
+     elseif(X86 AND NOT X64)
          add_definitions(-march=i686)
      endif()
 -    if(ARM)
@@ -32,11 +31,10 @@
 +    if(ARMV7)
 +        add_definitions(-fPIC)
      endif()
-     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
-     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
-diff -urN a/source/common/cpu.cpp b/source/common/cpu.cpp
---- a/source/common/cpu.cpp    2015-02-10 14:15:13.000000000 -0700
-+++ b/source/common/cpu.cpp    2015-02-12 06:25:01.334927114 -0700
+     if(FPROFILE_GENERATE)
+         if(INTEL_CXX)
+--- source/common/cpu.cpp.orig 2015-04-28 21:47:44.634923269 +0200
++++ source/common/cpu.cpp  2015-04-28 21:49:50.305468867 +0200
 @@ -37,7 +37,7 @@
  #include <machine/cpu.h>
  #endif
​

baselibs.conf Changed

 
@@ -1,1 +1,1 @@
-libx265-43
+libx265-51
​

x265_1.5.tar.gz/.hg_archival.txt -> x265_1.6.tar.gz/.hg_archival.txt Changed

 
@@ -1,4 +1,4 @@
 repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
-node: 9f0324125f53a12f766f6ed6f98f16e2f42337f4
+node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3
 branch: stable
-tag: 1.5
+tag: 1.6
​

x265_1.5.tar.gz/.hgtags -> x265_1.6.tar.gz/.hgtags Changed

 
@@ -13,3 +13,4 @@
 d6257335c5370ee54317a0426a12c1f0724b18b9 1.2
 c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3
 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
+9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
​

x265_1.5.tar.gz/doc/reST/api.rst -> x265_1.6.tar.gz/doc/reST/api.rst Changed

@@ -72,11 +72,13 @@
 	process. All of the encoders must use the same maximum CTU size
 	because many global variables are configured based on this size.
 	Encoder allocation will fail if a mis-matched CTU size is attempted.
+	If no encoders are open, **x265_cleanup()** can be called to reset
+	the configured CTU size so a new size can be used.
 
 An encoder is allocated by calling **x265_encoder_open()**::
 
 	/* x265_encoder_open:
-	*      create a new encoder handler, all parameters from x265_param are copied */
+	 *      create a new encoder handler, all parameters from x265_param are copied */
 	x265_encoder* x265_encoder_open(x265_param *);
 
 The returned pointer is then passed to all of the functions pertaining
@@ -337,10 +339,44 @@
 	void x265_encoder_close(x265_encoder *);
 
 When the application has completed all encodes, it should call
-**x265_cleanup()** to free process global resources like the thread pool;
-particularly if a memory-leak detection tool is being used::
+**x265_cleanup()** to free process global, particularly if a memory-leak
+detection tool is being used. **x265_cleanup()** also resets the saved
+CTU size so it will be possible to create a new encoder with a different
+CTU size::
 
-	/***
-	 * Release library static allocations
-	 */
+	/* x265_cleanup:
+	 *     release library static allocations, reset configured CTU size */
 	void x265_cleanup(void);
+
+
+Multi-library Interface
+=======================
+
+If your application might want to make a runtime selection between among
+a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
+want to use the multi-library interface.
+
+Instead of directly using all of the **x265_** methods documented
+above, you query an x265_api structure from your libx265 and then use
+the function pointers within that structure of the same name, but
+without the **x265_** prefix. So **x265_param_default()** becomes
+**api->param_default()**. The key method is x265_api_get()::
+
+    /* x265_api_get:
+     *   Retrieve the programming interface for a linked x265 library.
+     *   May return NULL if no library is available that supports the
+     *   requested bit depth. If bitDepth is 0, the function is guarunteed
+     *   to return a non-NULL x265_api pointer from the system default
+     *   libx265 */
+    const x265_api* x265_api_get(int bitDepth);
+
+The general idea is to request the API for the bitDepth you would prefer
+the encoder to use (8 or 10), and if that returns NULL you request the
+API for bitDepth=0, which returns the system default libx265.
+
+Note that using this multi-library API in your application is only the
+first step. Next your application must dynamically link to libx265 and
+then you must build and install a multi-lib configuration of libx265,
+which includes 8bpp and 16bpp builds of libx265 and a shim library which
+forwards x265_api_get() calls to the appropriate library using dynamic
+loading and binding.

 
@@ -72,11 +72,13 @@
    process. All of the encoders must use the same maximum CTU size
    because many global variables are configured based on this size.
    Encoder allocation will fail if a mis-matched CTU size is attempted.
+   If no encoders are open, **x265_cleanup()** can be called to reset
+   the configured CTU size so a new size can be used.
 
 An encoder is allocated by calling **x265_encoder_open()**::
 
    /* x265_encoder_open:
-   *      create a new encoder handler, all parameters from x265_param are copied */
+    *      create a new encoder handler, all parameters from x265_param are copied */
    x265_encoder* x265_encoder_open(x265_param *);
 
 The returned pointer is then passed to all of the functions pertaining
@@ -337,10 +339,44 @@
    void x265_encoder_close(x265_encoder *);
 
 When the application has completed all encodes, it should call
-**x265_cleanup()** to free process global resources like the thread pool;
-particularly if a memory-leak detection tool is being used::
+**x265_cleanup()** to free process global, particularly if a memory-leak
+detection tool is being used. **x265_cleanup()** also resets the saved
+CTU size so it will be possible to create a new encoder with a different
+CTU size::
 
-   /***
-    * Release library static allocations
-    */
+   /* x265_cleanup:
+    *     release library static allocations, reset configured CTU size */
    void x265_cleanup(void);
+
+
+Multi-library Interface
+=======================
+
+If your application might want to make a runtime selection between among
+a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
+want to use the multi-library interface.
+
+Instead of directly using all of the **x265_** methods documented
+above, you query an x265_api structure from your libx265 and then use
+the function pointers within that structure of the same name, but
+without the **x265_** prefix. So **x265_param_default()** becomes
+**api->param_default()**. The key method is x265_api_get()::
+
+    /* x265_api_get:
+     *   Retrieve the programming interface for a linked x265 library.
+     *   May return NULL if no library is available that supports the
+     *   requested bit depth. If bitDepth is 0, the function is guarunteed
+     *   to return a non-NULL x265_api pointer from the system default
+     *   libx265 */
+    const x265_api* x265_api_get(int bitDepth);
+
+The general idea is to request the API for the bitDepth you would prefer
+the encoder to use (8 or 10), and if that returns NULL you request the
+API for bitDepth=0, which returns the system default libx265.
+
+Note that using this multi-library API in your application is only the
+first step. Next your application must dynamically link to libx265 and
+then you must build and install a multi-lib configuration of libx265,
+which includes 8bpp and 16bpp builds of libx265 and a shim library which
+forwards x265_api_get() calls to the appropriate library using dynamic
+loading and binding.
​

x265_1.5.tar.gz/doc/reST/cli.rst -> x265_1.6.tar.gz/doc/reST/cli.rst Changed

@@ -171,19 +171,54 @@
 	Over-allocation of frame threads will not improve performance, it
 	will generally just increase memory use.
 
-.. option:: --threads <integer>
+	**Values:** any value between 8 and 16. Default is 0, auto-detect
 
-	Number of threads to allocate for the worker thread pool  This pool
-	is used for WPP and for distributed analysis and motion search:
-	:option:`--wpp` :option:`--pmode` and :option:`--pme` respectively.
+.. option:: --pools <string>, --numa-pools <string>
 
-	If :option:`--threads` 1 is specified, then no thread pool is
-	created. When no thread pool is created, all the thread pool
-	features are implicitly disabled. If all the pool features are
-	disabled by the user, then the pool is implicitly disabled.
+	Comma seperated list of threads per NUMA node. If "none", then no worker
+	pools are created and only frame parallelism is possible. If NULL or ""
+	(default) x265 will use all available threads on each NUMA node::
 
-	Default 0, one thread is allocated per detected hardware thread
-	(logical CPU cores)
+	'+'  is a special value indicating all cores detected on the node
+	'*'  is a special value indicating all cores detected on the node and all remaining nodes
+	'-'  is a special value indicating no cores on the node, same as '0'
+
+	example strings for a 4-node system::
+
+	""        - default, unspecified, all numa nodes are used for thread pools
+	"*"       - same as default
+	"none"    - no thread pools are created, only frame parallelism possible
+	"-"       - same as "none"
+	"10"      - allocate one pool, using up to 10 cores on node 0
+	"-,+"     - allocate one pool, using all cores on node 1
+	"+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+	"+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+	"-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+	"8,8,8,8" - allocate four pools with up to 8 threads in each pool
+
+	The total number of threads will be determined by the number of threads
+	assigned to all nodes. The worker threads will each be given affinity for
+	their node, they will not be allowed to migrate between nodes, but they
+	will be allowed to move between CPU cores within their node.
+
+	If the three pool features: :option:`--wpp` :option:`--pmode` and
+	:option:`--pme` are all disabled, then :option:`--pools` is ignored
+	and no thread pools are created.
+
+	If "none" is specified, then all three of the thread pool features are
+	implicitly disabled.
+
+	Multiple thread pools will be allocated for any NUMA node with more than
+	64 logical CPU cores. But any given thread pool will always use at most
+	one NUMA node.
+
+	Frame encoders are distributed between the available thread pools,
+	and the encoder will never generate more thread pools than
+	:option:`--frame-threads`.  The pools are used for WPP and for
+	distributed analysis and motion search.
+
+	Default "", one thread is allocated per detected hardware thread
+	(logical CPU cores) and one thread pool per NUMA node.
 
 .. option:: --wpp, --no-wpp
 
@@ -409,7 +444,30 @@
 	If :option:`--level-idc` has been specified, the option adds the
 	intention to support the High tier of that level. If your specified
 	level does not support a High tier, a warning is issued and this
-	modifier flag is ignored.
+	modifier flag is ignored. If :option:`--level-idc` has been specified,
+	but not --high-tier, then the encoder will attempt to encode at the 
+	specified level, main tier first, turning on high tier only if 
+	necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+	Max number of L0 references to be allowed. This number has a linear
+	multiplier effect on the amount of work performed in motion search,
+	but will generally have a beneficial affect on compression and
+	distortion.
+	
+	Note that x265 allows up to 16 L0 references but the HEVC
+	specification only allows a maximum of 8 total reference frames. So
+	if you have B frames enabled only 7 L0 refs are valid and if you
+	have :option:`--b-pyramid` enabled (which is enabled by default in
+	all presets), then only 6 L0 refs are the maximum allowed by the
+	HEVC specification.  If x265 detects that the total reference count
+	is greater than 8, it will issue a warning that the resulting stream
+	is non-compliant and it signals the stream as profile NONE and level
+	NONE but still allows the encode to continue.  Compliant HEVC
+	decoders may refuse to decode such streams.
+	
+	Default 3
 
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
@@ -444,7 +502,7 @@
 	+-------+---------------------------------------------------------------+
 	| 3     | RDO mode and split decisions, chroma residual used for sa8d   |
 	+-------+---------------------------------------------------------------+
-	| 4     | Adds RDO Quant                                                |
+	| 4     | Currently same as 3                                           |
 	+-------+---------------------------------------------------------------+
 	| 5     | Adds RDO prediction decisions                                 |
 	+-------+---------------------------------------------------------------+
@@ -465,6 +523,23 @@
 	and less frame parallelism as well. Because of this the faster
 	presets use a CU size of 32. Default: 64
 
+.. option:: --min-cu-size <64|32|16|8>
+
+	Minimum CU size (width and height). By using 16 or 32 the encoder
+	will not analyze the cost of CUs below that minimum threshold,
+	saving considerable amounts of compute with a predictable increase
+	in bitrate. This setting has a large effect on performance on the
+	faster presets.
+
+	Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency)
+
+.. note::
+
+	All encoders within a single process must use the same settings for
+	the CU size range. :option:`--ctu` and :option:`--min-cu-size` must
+	be consistent for all of them since the encoder configures several
+	key global data structures based on this range.
+
 .. option:: --rect, --no-rect
 
 	Enable analysis of rectangular motion partitions Nx2N and 2NxN
@@ -494,14 +569,6 @@
 	Measure full CU size (2Nx2N) merge candidates first; if no residual
 	is found the analysis is short circuited. Default disabled
 
-.. option:: --fast-cbf, --no-fast-cbf
-
-	Short circuit analysis if a prediction is found that does not set
-	the coded block flag (aka: no residual was encoded).  It prevents
-	the encoder from perhaps finding other predictions that also have no
-	residual but require less signaling bits or have less distortion.
-	Only applicable for RD levels 5 and 6. Default disabled
-
 .. option:: --fast-intra, --no-fast-intra
 
 	Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +593,6 @@
 	Only effective at RD levels 3 and above, which perform RDO mode
 	decisions.
 
-.. option:: --tskip, --no-tskip
-
-	Enable evaluation of transform skip (bypass DCT but still use
-	quantization) coding for 4x4 TU coded blocks.
-
-	Only effective at RD levels 3 and above, which perform RDO mode
-	decisions. Default disabled
-
 .. option:: --tskip-fast, --no-tskip-fast
 
 	Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -567,6 +626,30 @@
 Options which affect the transform unit quad-tree, sometimes referred to
 as the residual quad-tree (RQT).
 
+.. option:: --rdoq-level <0|1|2>, --no-rdoq-level
+
+	Specify the amount of rate-distortion analysis to use within
+	quantization::
+
+	At level 0 rate-distortion cost is not considered in quant
+	
+	At level 1 rate-distortion cost is used to find optimal rounding
+	values for each level (and allows psy-rdoq to be effective). It
+	trades-off the signaling cost of the coefficient vs its post-inverse
+	quant distortion from the pre-quant coefficient. When
+	:option:`--psy-rdoq` is enabled, this formula is biased in favor of
+	more energy in the residual (larger coefficient absolute levels)
+	
+	At level 2 rate-distortion cost is used to make decimate decisions
+	on each 4x4 coding group, including the cost of signaling the group
+	within the group bitmap. If the total distortion of not signaling
+	the entire coding group is less than the rate cost, the block is
+	decimated. Next, it applies rate-distortion cost analysis to the
+	last non-zero coefficient, which can result in many (or all) of the
+	coding groups being decimated. Psy-rdoq is less effective at
+	preserving energy when RDOQ is at level 2, since it only has
+	influence over the level distortion costs.
+
 .. option:: --tu-intra-depth <1..4>
 
 	The transform unit (residual) quad-tree begins with the same depth
@@ -593,9 +676,76 @@
 	partitions, in which case a TU split is implied and thus the
 	residual quad-tree begins one layer below the CU quad-tree.
 
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+	Noise reduction - an adaptive deadzone applied after DCT
+	(subtracting from DCT coefficients), before quantization.  It does
+	no pixel-level filtering, doesn't cross DCT block boundaries, has no

 
@@ -171,19 +171,54 @@
    Over-allocation of frame threads will not improve performance, it
    will generally just increase memory use.
 
-.. option:: --threads <integer>
+   **Values:** any value between 8 and 16. Default is 0, auto-detect
 
-   Number of threads to allocate for the worker thread pool  This pool
-   is used for WPP and for distributed analysis and motion search:
-   :option:`--wpp` :option:`--pmode` and :option:`--pme` respectively.
+.. option:: --pools <string>, --numa-pools <string>
 
-   If :option:`--threads` 1 is specified, then no thread pool is
-   created. When no thread pool is created, all the thread pool
-   features are implicitly disabled. If all the pool features are
-   disabled by the user, then the pool is implicitly disabled.
+   Comma seperated list of threads per NUMA node. If "none", then no worker
+   pools are created and only frame parallelism is possible. If NULL or ""
+   (default) x265 will use all available threads on each NUMA node::
 
-   Default 0, one thread is allocated per detected hardware thread
-   (logical CPU cores)
+   '+'  is a special value indicating all cores detected on the node
+   '*'  is a special value indicating all cores detected on the node and all remaining nodes
+   '-'  is a special value indicating no cores on the node, same as '0'
+
+   example strings for a 4-node system::
+
+   ""        - default, unspecified, all numa nodes are used for thread pools
+   "*"       - same as default
+   "none"    - no thread pools are created, only frame parallelism possible
+   "-"       - same as "none"
+   "10"      - allocate one pool, using up to 10 cores on node 0
+   "-,+"     - allocate one pool, using all cores on node 1
+   "+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+   "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+   "-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+   "8,8,8,8" - allocate four pools with up to 8 threads in each pool
+
+   The total number of threads will be determined by the number of threads
+   assigned to all nodes. The worker threads will each be given affinity for
+   their node, they will not be allowed to migrate between nodes, but they
+   will be allowed to move between CPU cores within their node.
+
+   If the three pool features: :option:`--wpp` :option:`--pmode` and
+   :option:`--pme` are all disabled, then :option:`--pools` is ignored
+   and no thread pools are created.
+
+   If "none" is specified, then all three of the thread pool features are
+   implicitly disabled.
+
+   Multiple thread pools will be allocated for any NUMA node with more than
+   64 logical CPU cores. But any given thread pool will always use at most
+   one NUMA node.
+
+   Frame encoders are distributed between the available thread pools,
+   and the encoder will never generate more thread pools than
+   :option:`--frame-threads`.  The pools are used for WPP and for
+   distributed analysis and motion search.
+
+   Default "", one thread is allocated per detected hardware thread
+   (logical CPU cores) and one thread pool per NUMA node.
 
 .. option:: --wpp, --no-wpp
 
@@ -409,7 +444,30 @@
    If :option:`--level-idc` has been specified, the option adds the
    intention to support the High tier of that level. If your specified
    level does not support a High tier, a warning is issued and this
-   modifier flag is ignored.
+   modifier flag is ignored. If :option:`--level-idc` has been specified,
+   but not --high-tier, then the encoder will attempt to encode at the 
+   specified level, main tier first, turning on high tier only if 
+   necessary and available at that level.
+
+.. option:: --ref <1..16>
+
+   Max number of L0 references to be allowed. This number has a linear
+   multiplier effect on the amount of work performed in motion search,
+   but will generally have a beneficial affect on compression and
+   distortion.
+   
+   Note that x265 allows up to 16 L0 references but the HEVC
+   specification only allows a maximum of 8 total reference frames. So
+   if you have B frames enabled only 7 L0 refs are valid and if you
+   have :option:`--b-pyramid` enabled (which is enabled by default in
+   all presets), then only 6 L0 refs are the maximum allowed by the
+   HEVC specification.  If x265 detects that the total reference count
+   is greater than 8, it will issue a warning that the resulting stream
+   is non-compliant and it signals the stream as profile NONE and level
+   NONE but still allows the encode to continue.  Compliant HEVC
+   decoders may refuse to decode such streams.
+   
+   Default 3
 
 .. note::
    :option:`--profile`, :option:`--level-idc`, and
@@ -444,7 +502,7 @@
    +-------+---------------------------------------------------------------+
    | 3     | RDO mode and split decisions, chroma residual used for sa8d   |
    +-------+---------------------------------------------------------------+
-   | 4     | Adds RDO Quant                                                |
+   | 4     | Currently same as 3                                           |
    +-------+---------------------------------------------------------------+
    | 5     | Adds RDO prediction decisions                                 |
    +-------+---------------------------------------------------------------+
@@ -465,6 +523,23 @@
    and less frame parallelism as well. Because of this the faster
    presets use a CU size of 32. Default: 64
 
+.. option:: --min-cu-size <64|32|16|8>
+
+   Minimum CU size (width and height). By using 16 or 32 the encoder
+   will not analyze the cost of CUs below that minimum threshold,
+   saving considerable amounts of compute with a predictable increase
+   in bitrate. This setting has a large effect on performance on the
+   faster presets.
+
+   Default: 8 (minimum 8x8 CU for HEVC, best compression efficiency)
+
+.. note::
+
+   All encoders within a single process must use the same settings for
+   the CU size range. :option:`--ctu` and :option:`--min-cu-size` must
+   be consistent for all of them since the encoder configures several
+   key global data structures based on this range.
+
 .. option:: --rect, --no-rect
 
    Enable analysis of rectangular motion partitions Nx2N and 2NxN
@@ -494,14 +569,6 @@
    Measure full CU size (2Nx2N) merge candidates first; if no residual
    is found the analysis is short circuited. Default disabled
 
-.. option:: --fast-cbf, --no-fast-cbf
-
-   Short circuit analysis if a prediction is found that does not set
-   the coded block flag (aka: no residual was encoded).  It prevents
-   the encoder from perhaps finding other predictions that also have no
-   residual but require less signaling bits or have less distortion.
-   Only applicable for RD levels 5 and 6. Default disabled
-
 .. option:: --fast-intra, --no-fast-intra
 
    Perform an initial scan of every fifth intra angular mode, then
@@ -526,14 +593,6 @@
    Only effective at RD levels 3 and above, which perform RDO mode
    decisions.
 
-.. option:: --tskip, --no-tskip
-
-   Enable evaluation of transform skip (bypass DCT but still use
-   quantization) coding for 4x4 TU coded blocks.
-
-   Only effective at RD levels 3 and above, which perform RDO mode
-   decisions. Default disabled
-
 .. option:: --tskip-fast, --no-tskip-fast
 
    Only evaluate transform skip for NxN intra predictions (4x4 blocks).
@@ -567,6 +626,30 @@
 Options which affect the transform unit quad-tree, sometimes referred to
 as the residual quad-tree (RQT).
 
+.. option:: --rdoq-level <0|1|2>, --no-rdoq-level
+
+   Specify the amount of rate-distortion analysis to use within
+   quantization::
+
+   At level 0 rate-distortion cost is not considered in quant
+   
+   At level 1 rate-distortion cost is used to find optimal rounding
+   values for each level (and allows psy-rdoq to be effective). It
+   trades-off the signaling cost of the coefficient vs its post-inverse
+   quant distortion from the pre-quant coefficient. When
+   :option:`--psy-rdoq` is enabled, this formula is biased in favor of
+   more energy in the residual (larger coefficient absolute levels)
+   
+   At level 2 rate-distortion cost is used to make decimate decisions
+   on each 4x4 coding group, including the cost of signaling the group
+   within the group bitmap. If the total distortion of not signaling
+   the entire coding group is less than the rate cost, the block is
+   decimated. Next, it applies rate-distortion cost analysis to the
+   last non-zero coefficient, which can result in many (or all) of the
+   coding groups being decimated. Psy-rdoq is less effective at
+   preserving energy when RDOQ is at level 2, since it only has
+   influence over the level distortion costs.
+
 .. option:: --tu-intra-depth <1..4>
 
    The transform unit (residual) quad-tree begins with the same depth
@@ -593,9 +676,76 @@
    partitions, in which case a TU split is implied and thus the
    residual quad-tree begins one layer below the CU quad-tree.
 
+.. option:: --nr-intra <integer>, --nr-inter <integer>
+
+   Noise reduction - an adaptive deadzone applied after DCT
+   (subtracting from DCT coefficients), before quantization.  It does
+   no pixel-level filtering, doesn't cross DCT block boundaries, has no
​

x265_1.5.tar.gz/doc/reST/presets.rst -> x265_1.6.tar.gz/doc/reST/presets.rst Changed

@@ -24,19 +24,21 @@
 +==============+===========+===========+==========+========+======+========+======+========+==========+=========+
 | ctu          |   32      |    32     |   32     |  64    |  64  |   64   |  64  |  64    |   64     |   64    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| bframes      |    4      |     4     |    4     |   4    |  4   |    4   |  4   |   8    |    8     |    8    |
+| min-cu-size  |   16      |     8     |    8     |   8    |   8  |    8   |   8  |   8    |    8     |    8    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| b-adapt      |    0      |     0     |    0     |   0    |  2   |    2   |  2   |   2    |    2     |    2    |
+| bframes      |    3      |     3     |    4     |   4    |  4   |    4   |  4   |   8    |    8     |    8    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| rc-lookahead |   10      |    10     |   15     |  15    |  15  |   20   |  25  |   30   |   40     |   60    |
+| b-adapt      |    0      |     0     |    0     |   0    |  0   |    2   |  2   |   2    |    2     |    2    |
++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
+| rc-lookahead |    5      |    10     |   15     |  15    |  15  |   20   |  25  |   30   |   40     |   60    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | scenecut     |    0      |    40     |   40     |  40    |  40  |   40   |  40  |   40   |   40     |   40    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| refs         |    1      |     1     |    1     |   1    |  3   |    3   |  3   |   3    |    5     |    5    |
+| refs         |    1      |     1     |    1     |   1    |  2   |    3   |  3   |   3    |    5     |    5    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | me           |   dia     |   hex     |   hex    |  hex   | hex  |   hex  | star |  star  |   star   |   star  |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| merange      |   25      |    44     |   57     |  57    |  57  |   57   | 57   |  57    |   57     |   92    |
+| merange      |   57      |    57     |   57     |  57    |  57  |   57   | 57   |  57    |   57     |   92    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | subme        |    0      |     1     |    1     |   2    |  2   |    2   |  3   |   3    |    4     |    5    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
@@ -60,12 +62,14 @@
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | weightb      |    0      |     0     |    0     |   0    |  0   |    0   |  0   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| aq-mode      |    0      |     0     |    2     |   2    |  2   |    2   |  2   |   2    |    2     |    2    |
+| aq-mode      |    0      |     0     |    1     |   1    |  1   |    1   |  1   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | cuTree       |    0      |     0     |    0     |   0    |  1   |    1   |  1   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | rdLevel      |    2      |     2     |    2     |   2    |  2   |    3   |  4   |   6    |    6     |    6    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
+| rdoq-level   |    0      |     0     |    0     |   0    |  0   |    0   |  2   |   2    |    2     |    2    |
++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | tu-intra     |    1      |     1     |    1     |   1    |  1   |    1   |  1   |   2    |    3     |    4    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | tu-inter     |    1      |     1     |    1     |   1    |  1   |    1   |  1   |   2    |    3     |    4    |
@@ -114,17 +118,12 @@
 modes which preserve high frequency noise:
 
     * :option:`--psy-rd` 0.5
+    * :option:`--rdoq-level` 1
     * :option:`--psy-rdoq` 30
 
-.. Note::
-
-    --psy-rdoq is only effective when RDOQuant is enabled, which is at
-    RD levels 4, 5, and 6 (presets slow and below).
-
 It lowers the strength of adaptive quantization, so residual energy can
 be more evenly distributed across the (noisy) picture:
 
-    * :option:`--aq-mode` 1
     * :option:`--aq-strength` 0.3
 
 And it similarly tunes rate control to prevent the slice QP from

 
@@ -24,19 +24,21 @@
 +==============+===========+===========+==========+========+======+========+======+========+==========+=========+
 | ctu          |   32      |    32     |   32     |  64    |  64  |   64   |  64  |  64    |   64     |   64    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| bframes      |    4      |     4     |    4     |   4    |  4   |    4   |  4   |   8    |    8     |    8    |
+| min-cu-size  |   16      |     8     |    8     |   8    |   8  |    8   |   8  |   8    |    8     |    8    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| b-adapt      |    0      |     0     |    0     |   0    |  2   |    2   |  2   |   2    |    2     |    2    |
+| bframes      |    3      |     3     |    4     |   4    |  4   |    4   |  4   |   8    |    8     |    8    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| rc-lookahead |   10      |    10     |   15     |  15    |  15  |   20   |  25  |   30   |   40     |   60    |
+| b-adapt      |    0      |     0     |    0     |   0    |  0   |    2   |  2   |   2    |    2     |    2    |
++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
+| rc-lookahead |    5      |    10     |   15     |  15    |  15  |   20   |  25  |   30   |   40     |   60    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | scenecut     |    0      |    40     |   40     |  40    |  40  |   40   |  40  |   40   |   40     |   40    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| refs         |    1      |     1     |    1     |   1    |  3   |    3   |  3   |   3    |    5     |    5    |
+| refs         |    1      |     1     |    1     |   1    |  2   |    3   |  3   |   3    |    5     |    5    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | me           |   dia     |   hex     |   hex    |  hex   | hex  |   hex  | star |  star  |   star   |   star  |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| merange      |   25      |    44     |   57     |  57    |  57  |   57   | 57   |  57    |   57     |   92    |
+| merange      |   57      |    57     |   57     |  57    |  57  |   57   | 57   |  57    |   57     |   92    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | subme        |    0      |     1     |    1     |   2    |  2   |    2   |  3   |   3    |    4     |    5    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
@@ -60,12 +62,14 @@
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | weightb      |    0      |     0     |    0     |   0    |  0   |    0   |  0   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
-| aq-mode      |    0      |     0     |    2     |   2    |  2   |    2   |  2   |   2    |    2     |    2    |
+| aq-mode      |    0      |     0     |    1     |   1    |  1   |    1   |  1   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | cuTree       |    0      |     0     |    0     |   0    |  1   |    1   |  1   |   1    |    1     |    1    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | rdLevel      |    2      |     2     |    2     |   2    |  2   |    3   |  4   |   6    |    6     |    6    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
+| rdoq-level   |    0      |     0     |    0     |   0    |  0   |    0   |  2   |   2    |    2     |    2    |
++--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | tu-intra     |    1      |     1     |    1     |   1    |  1   |    1   |  1   |   2    |    3     |    4    |
 +--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+
 | tu-inter     |    1      |     1     |    1     |   1    |  1   |    1   |  1   |   2    |    3     |    4    |
@@ -114,17 +118,12 @@
 modes which preserve high frequency noise:
 
     * :option:`--psy-rd` 0.5
+    * :option:`--rdoq-level` 1
     * :option:`--psy-rdoq` 30
 
-.. Note::
-
-    --psy-rdoq is only effective when RDOQuant is enabled, which is at
-    RD levels 4, 5, and 6 (presets slow and below).
-
 It lowers the strength of adaptive quantization, so residual energy can
 be more evenly distributed across the (noisy) picture:
 
-    * :option:`--aq-mode` 1
     * :option:`--aq-strength` 0.3
 
 And it similarly tunes rate control to prevent the slice QP from
​

x265_1.5.tar.gz/doc/reST/threading.rst -> x265_1.6.tar.gz/doc/reST/threading.rst Changed

@@ -2,41 +2,34 @@
 Threading
 *********
 
-Thread Pool
-===========
+Thread Pools
+============
 
-x265 creates a pool of worker threads and shares this thread pool
-with all encoders within the same process (it is process global, aka a
-singleton).  The number of threads within the thread pool is determined
-by the encoder which first allocates the pool, which by definition is
-the first encoder created within each process.
+x265 creates one or more thread pools per encoder, one pool per NUMA
+node (typically a CPU socket). :option:`--pools` specifies the number of
+pools and the number of threads per pool the encoder will allocate. By
+default x265 allocates one thread per (hyperthreaded) CPU core on each
+NUMA node.
 
-:option:`--threads` specifies the number of threads the encoder will
-try to allocate for its thread pool.  If the thread pool was already
-allocated this parameter is ignored.  By default x265 allocates one
-thread per (hyperthreaded) CPU core in your system.
+If you are running multiple encoders on a system with multiple NUMA
+nodes, it is recommended to isolate each of them to a single node in
+order to avoid the NUMA overhead of remote memory access.
 
-Work distribution is job based.  Idle worker threads ask their parent
-pool object for jobs to perform.  When no jobs are available, idle
-worker threads block and consume no CPU cycles.
+Work distribution is job based. Idle worker threads scan the job
+providers assigned to their thread pool for jobs to perform. When no
+jobs are available, the idle worker threads block and consume no CPU
+cycles.
 
 Objects which desire to distribute work to worker threads are known as
-job providers (and they derive from the JobProvider class).  When job
-providers have work they enqueue themselves into the pool's provider
-list (and dequeue themselves when they no longer have work).  The thread
+job providers (and they derive from the JobProvider class).  The thread
 pool has a method to **poke** awake a blocked idle thread, and job
 providers are recommended to call this method when they make new jobs
 available.
 
 Worker jobs are not allowed to block except when abosultely necessary
-for data locking. If a job becomes blocked, the worker thread is
-expected to drop that job and go back to the pool and find more work.
-
-.. note::
-
-	x265_cleanup() frees the process-global thread pool, allowing
-	it to be reallocated if necessary, but only if no encoders are
-	allocated at the time it is called.
+for data locking. If a job becomes blocked, the work function is
+expected to drop that job so the worker thread may go back to the pool
+and find more work.
 
 Wavefront Parallel Processing
 =============================
@@ -82,24 +75,35 @@
 thread count to be higher than if WPP was enabled.  The exact formulas
 are described in the next section.
 
+Bonded Task Groups
+==================
+
+If a worker thread job has work which can be performed in parallel by
+many threads, it may allocate a bonded task group and enlist the help of
+other idle worker threads in the same pool. Those threads will cooperate
+to complete the work of the bonded task group and then return to their
+idle states. The larger and more uniform those tasks are, the better the
+bonded task group will perform.
+
 Parallel Mode Analysis
-======================
+~~~~~~~~~~~~~~~~~~~~~~
 
 When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
-8x8) will distribute its analysis work to the thread pool. Each analysis
-job will measure the cost of one prediction for the CU: merge, skip,
-intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
-of increased parallelism is often enough to be able to reduce frame
-parallelism while achieving the same overall CPU utilization. Reducing
-frame threads is often beneficial to ABR and VBV rate control.
+8x8) will distribute its analysis work to the thread pool via a bonded
+task group. Each analysis job will measure the cost of one prediction
+for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At
+slower presets, the amount of increased parallelism is often enough to
+be able to reduce frame parallelism while achieving the same overall CPU
+utilization. Reducing frame threads is often beneficial to ABR and VBV
+rate control.
 
 Parallel Motion Estimation
-==========================
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 When :option:`--pme` is enabled all of the analysis functions which
 perform motion searches to reference frames will distribute those motion
-searches as jobs for worker threads (if more than two motion searches
-are required).
+searches as jobs for worker threads via a bonded task group (if more
+than two motion searches are required).
 
 Frame Threading
 ===============
@@ -125,16 +129,21 @@
 for motion reference must be processed by the loop filters and the loop
 filters cannot run until a full row has been encoded, and it must run a
 full row behind the encode process so that the pixels below the row
-being filtered are available. When you add up all the row lags each
-frame ends up being 3 CTU rows behind its reference frames (the
-equivalent of 12 macroblock rows for x264)
+being filtered are available. On top of this, HEVC has two loop filters:
+deblocking and SAO, which must be run in series with a row lag between
+them. When you add up all the row lags each frame ends up being 3 CTU
+rows behind its reference frames (the equivalent of 12 macroblock rows
+for x264). And keep in mind the wave-front progression pattern; by the
+time the reference frame finishes the third row of CTUs, nearly half of
+the CTUs in the frame may be compressed (depending on the display aspect
+ratio).
 
 The third extenuating circumstance is that when a frame being encoded
 becomes blocked by a reference frame row being available, that frame's
 wave-front becomes completely stalled and when the row becomes available
 again it can take quite some time for the wave to be restarted, if it
-ever does. This makes WPP many times less effective when frame
-parallelism is in use.
+ever does. This makes WPP less effective when frame parallelism is in
+use.
 
 :option:`--merange` can have a negative impact on frame parallelism. If
 the range is too large, more rows of CTU lag must be added to ensure
@@ -213,13 +222,13 @@
 
 The lookahead module of x265 (the lowres pre-encode which determines
 scene cuts and slice types) uses the thread pool to distribute the
-lowres cost analysis to worker threads. It follows the same wave-front
-pattern as the main encoder except it works in reverse-scan order.
+lowres cost analysis to worker threads. It will use bonded task groups
+to perform batches of frame cost estimates, and it may optionally use
+bonded task groups to measure single frame cost estimates using slices.
 
-The function slicetypeDecide() itself may also be performed by a worker
-thread if your system has enough CPU cores to make this a beneficial
-trade-off, else it runs within the context of the thread which calls the
-x265_encoder_encode().
+The function slicetypeDecide() itself is also be performed by a worker
+thread if your encoder has a thread pool, else it runs within the
+context of the thread which calls the x265_encoder_encode().
 
 SAO
 ===

 
@@ -2,41 +2,34 @@
 Threading
 *********
 
-Thread Pool
-===========
+Thread Pools
+============
 
-x265 creates a pool of worker threads and shares this thread pool
-with all encoders within the same process (it is process global, aka a
-singleton).  The number of threads within the thread pool is determined
-by the encoder which first allocates the pool, which by definition is
-the first encoder created within each process.
+x265 creates one or more thread pools per encoder, one pool per NUMA
+node (typically a CPU socket). :option:`--pools` specifies the number of
+pools and the number of threads per pool the encoder will allocate. By
+default x265 allocates one thread per (hyperthreaded) CPU core on each
+NUMA node.
 
-:option:`--threads` specifies the number of threads the encoder will
-try to allocate for its thread pool.  If the thread pool was already
-allocated this parameter is ignored.  By default x265 allocates one
-thread per (hyperthreaded) CPU core in your system.
+If you are running multiple encoders on a system with multiple NUMA
+nodes, it is recommended to isolate each of them to a single node in
+order to avoid the NUMA overhead of remote memory access.
 
-Work distribution is job based.  Idle worker threads ask their parent
-pool object for jobs to perform.  When no jobs are available, idle
-worker threads block and consume no CPU cycles.
+Work distribution is job based. Idle worker threads scan the job
+providers assigned to their thread pool for jobs to perform. When no
+jobs are available, the idle worker threads block and consume no CPU
+cycles.
 
 Objects which desire to distribute work to worker threads are known as
-job providers (and they derive from the JobProvider class).  When job
-providers have work they enqueue themselves into the pool's provider
-list (and dequeue themselves when they no longer have work).  The thread
+job providers (and they derive from the JobProvider class).  The thread
 pool has a method to **poke** awake a blocked idle thread, and job
 providers are recommended to call this method when they make new jobs
 available.
 
 Worker jobs are not allowed to block except when abosultely necessary
-for data locking. If a job becomes blocked, the worker thread is
-expected to drop that job and go back to the pool and find more work.
-
-.. note::
-
-   x265_cleanup() frees the process-global thread pool, allowing
-   it to be reallocated if necessary, but only if no encoders are
-   allocated at the time it is called.
+for data locking. If a job becomes blocked, the work function is
+expected to drop that job so the worker thread may go back to the pool
+and find more work.
 
 Wavefront Parallel Processing
 =============================
@@ -82,24 +75,35 @@
 thread count to be higher than if WPP was enabled.  The exact formulas
 are described in the next section.
 
+Bonded Task Groups
+==================
+
+If a worker thread job has work which can be performed in parallel by
+many threads, it may allocate a bonded task group and enlist the help of
+other idle worker threads in the same pool. Those threads will cooperate
+to complete the work of the bonded task group and then return to their
+idle states. The larger and more uniform those tasks are, the better the
+bonded task group will perform.
+
 Parallel Mode Analysis
-======================
+~~~~~~~~~~~~~~~~~~~~~~
 
 When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to
-8x8) will distribute its analysis work to the thread pool. Each analysis
-job will measure the cost of one prediction for the CU: merge, skip,
-intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At slower presets, the amount
-of increased parallelism is often enough to be able to reduce frame
-parallelism while achieving the same overall CPU utilization. Reducing
-frame threads is often beneficial to ABR and VBV rate control.
+8x8) will distribute its analysis work to the thread pool via a bonded
+task group. Each analysis job will measure the cost of one prediction
+for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At
+slower presets, the amount of increased parallelism is often enough to
+be able to reduce frame parallelism while achieving the same overall CPU
+utilization. Reducing frame threads is often beneficial to ABR and VBV
+rate control.
 
 Parallel Motion Estimation
-==========================
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 When :option:`--pme` is enabled all of the analysis functions which
 perform motion searches to reference frames will distribute those motion
-searches as jobs for worker threads (if more than two motion searches
-are required).
+searches as jobs for worker threads via a bonded task group (if more
+than two motion searches are required).
 
 Frame Threading
 ===============
@@ -125,16 +129,21 @@
 for motion reference must be processed by the loop filters and the loop
 filters cannot run until a full row has been encoded, and it must run a
 full row behind the encode process so that the pixels below the row
-being filtered are available. When you add up all the row lags each
-frame ends up being 3 CTU rows behind its reference frames (the
-equivalent of 12 macroblock rows for x264)
+being filtered are available. On top of this, HEVC has two loop filters:
+deblocking and SAO, which must be run in series with a row lag between
+them. When you add up all the row lags each frame ends up being 3 CTU
+rows behind its reference frames (the equivalent of 12 macroblock rows
+for x264). And keep in mind the wave-front progression pattern; by the
+time the reference frame finishes the third row of CTUs, nearly half of
+the CTUs in the frame may be compressed (depending on the display aspect
+ratio).
 
 The third extenuating circumstance is that when a frame being encoded
 becomes blocked by a reference frame row being available, that frame's
 wave-front becomes completely stalled and when the row becomes available
 again it can take quite some time for the wave to be restarted, if it
-ever does. This makes WPP many times less effective when frame
-parallelism is in use.
+ever does. This makes WPP less effective when frame parallelism is in
+use.
 
 :option:`--merange` can have a negative impact on frame parallelism. If
 the range is too large, more rows of CTU lag must be added to ensure
@@ -213,13 +222,13 @@
 
 The lookahead module of x265 (the lowres pre-encode which determines
 scene cuts and slice types) uses the thread pool to distribute the
-lowres cost analysis to worker threads. It follows the same wave-front
-pattern as the main encoder except it works in reverse-scan order.
+lowres cost analysis to worker threads. It will use bonded task groups
+to perform batches of frame cost estimates, and it may optionally use
+bonded task groups to measure single frame cost estimates using slices.
 
-The function slicetypeDecide() itself may also be performed by a worker
-thread if your system has enough CPU cores to make this a beneficial
-trade-off, else it runs within the context of the thread which calls the
-x265_encoder_encode().
+The function slicetypeDecide() itself is also be performed by a worker
+thread if your encoder has a thread pool, else it runs within the
+context of the thread which calls the x265_encoder_encode().
 
 SAO
 ===
​

x265_1.6.tar.gz/readme.rst Added

 
@@ -0,0 +1,14 @@
+=================
+x265 HEVC Encoder
+=================
+
+| **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_
+| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_ 
+| **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
+
+`x265 <https://www.videolan.org/developers/x265.html>`_ is an open
+source HEVC encoder. See the developer wiki for instructions for
+downloading and building the source.
+
+x265 is free to use under the `GNU GPL <http://www.gnu.org/licenses/gpl-2.0.html>`_ 
+and is also available under a commercial `license <http://x265.org>`_ 
​

x265_1.5.tar.gz/source/CMakeLists.txt -> x265_1.6.tar.gz/source/CMakeLists.txt Changed

@@ -12,6 +12,9 @@
 if(POLICY CMP0042)
     cmake_policy(SET CMP0042 NEW) # MACOSX_RPATH
 endif()
+if(POLICY CMP0054)
+    cmake_policy(SET CMP0054 OLD) # Only interpret if() arguments as variables or keywords when unquoted
+endif()
 
 project (x265)
 cmake_minimum_required (VERSION 2.8.8) # OBJECT libraries require 2.8.8
@@ -20,8 +23,14 @@
 include(CheckSymbolExists)
 include(CheckCXXCompilerFlag)
 
+option(FPROFILE_GENERATE "Compile executable to generate usage data" OFF)
+option(FPROFILE_USE "Compile executable using generated usage data" OFF)
+option(NATIVE_BUILD "Target the build CPU" OFF)
+option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
+mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
+
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 43)
+set(X265_BUILD 51)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -29,11 +38,6 @@
 
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
-option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF)
-if(CHECKED_BUILD)
-    add_definitions(-DCHECKED_BUILD=1)
-endif()
-
 # System architecture detection
 string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
 set(X86_ALIASES x86 i386 i686 x86_64 amd64)
@@ -61,6 +65,19 @@
     if(LIBRT)
         list(APPEND PLATFORM_LIBS rt)
     endif()
+    find_package(Numa)
+    if(NUMA_FOUND)
+        list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
+        check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
+        if(NUMA_V2)
+            add_definitions(-DHAVE_LIBNUMA)
+            message(STATUS "libnuma found, building with support for NUMA nodes")
+            list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
+            link_directories(${NUMA_LIBRARY_DIR})
+            include_directories(${NUMA_INCLUDE_DIR})
+        endif()
+    endif()
+    mark_as_advanced(LIBRT NUMA_FOUND)
 endif(UNIX)
 
 if(X64 AND NOT WIN32)
@@ -77,13 +94,13 @@
   add_definitions(-DMACOS)
 endif()
 
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
     set(CLANG 1)
 endif()
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Intel")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Intel")
     set(INTEL_CXX 1)
 endif()
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU")
     set(GCC 1)
 endif()
 
@@ -92,13 +109,12 @@
     set(MSVC 1)
 endif()
 if(MSVC)
-    option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
-    if (STATIC_LINK_CRT)
+    if(STATIC_LINK_CRT)
         set(CompilerFlags CMAKE_CXX_FLAGS_RELEASE CMAKE_C_FLAGS_RELEASE)
         foreach(CompilerFlag ${CompilerFlags})
             string(REPLACE "/MD" "/MT" ${CompilerFlag} "${${CompilerFlag}}")
         endforeach()
-    endif (STATIC_LINK_CRT)
+    endif(STATIC_LINK_CRT)
     add_definitions(/W4)  # Full warnings
     add_definitions(/Ob2) # always inline
     add_definitions(/MP)  # multithreaded build
@@ -130,12 +146,56 @@
     if(ENABLE_PIC)
          add_definitions(-fPIC)
     endif(ENABLE_PIC)
-    if(X86 AND NOT X64)
+    if(NATIVE_BUILD)
+        if(INTEL_CXX)
+            add_definitions(-xhost)
+        else()
+            add_definitions(-march=native)
+        endif()
+    elseif(X86 AND NOT X64)
         add_definitions(-march=i686)
     endif()
     if(ARM)
         add_definitions(-march=armv6 -mfloat-abi=hard -mfpu=vfp)
     endif()
+    if(FPROFILE_GENERATE)
+        if(INTEL_CXX)
+            add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
+            list(APPEND LINKER_OPTIONS "-prof-gen")
+        else()
+            check_cxx_compiler_flag(-fprofile-generate CC_HAS_PROFILE_GENERATE)
+            if(CC_HAS_PROFILE_GENERATE)
+                add_definitions(-fprofile-generate)
+                list(APPEND LINKER_OPTIONS "-fprofile-generate")
+            endif(CC_HAS_PROFILE_GENERATE)
+        endif(INTEL_CXX)
+    endif(FPROFILE_GENERATE)
+    if(FPROFILE_USE)
+        if(INTEL_CXX)
+            add_definitions(-prof-use -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
+            list(APPEND LINKER_OPTIONS "-prof-use")
+        else()
+            check_cxx_compiler_flag(-fprofile-use CC_HAS_PROFILE_USE)
+            check_cxx_compiler_flag(-fprofile-correction CC_HAS_PROFILE_CORRECTION)
+            check_cxx_compiler_flag(-Wno-error=coverage-mismatch CC_HAS_COVMISMATCH)
+            if(CC_HAS_PROFILE_USE)
+                add_definitions(-fprofile-use)
+                list(APPEND LINKER_OPTIONS "-fprofile-use")
+            endif(CC_HAS_PROFILE_USE)
+            if(CC_HAS_PROFILE_CORRECTION)
+                # auto-correct corrupted counters (happens a lot with x265)
+                add_definitions(-fprofile-correction)
+            endif(CC_HAS_PROFILE_CORRECTION)
+            if(CC_HAS_COVMISMATCH)
+                # ignore coverage mismatches (also happens a lot)
+                add_definitions(-Wno-error=coverage-mismatch)
+            endif(CC_HAS_COVMISMATCH)
+        endif(INTEL_CXX)
+    endif(FPROFILE_USE)
+    if(STATIC_LINK_CRT)
+        add_definitions(-static)
+        list(APPEND LINKER_OPTIONS "-static")
+    endif(STATIC_LINK_CRT)
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
     if (CC_HAS_NO_ARRAY_BOUNDS)
@@ -154,6 +214,35 @@
     if(CC_HAS_FNO_EXCEPTIONS_FLAG)
         add_definitions(-fno-exceptions)
     endif()
+    set(FSANITIZE "" CACHE STRING "-fsanitize options for GCC/clang")
+    if(FSANITIZE)
+        add_definitions(-fsanitize=${FSANITIZE})
+        # clang and gcc need the sanitize options to be passed at link
+        # time so the appropriate ASAN/TSAN runtime libraries can be
+        # linked.
+        list(APPEND LINKER_OPTIONS "-fsanitize=${FSANITIZE}")
+    endif()
+    option(ENABLE_AGGRESSIVE_CHECKS "Enable stack protection and -ftrapv" OFF)
+    if(ENABLE_AGGRESSIVE_CHECKS)
+        # use with care, -ftrapv can cause testbench SIGILL exceptions
+        # since it is testing corner cases of signed integer math
+        add_definitions(-DUSING_FTRAPV=1)
+        check_cxx_compiler_flag(-fsanitize=undefined-trap CC_HAS_CATCH_UNDEFINED) # clang
+        check_cxx_compiler_flag(-ftrapv CC_HAS_FTRAPV)                            # gcc
+        check_cxx_compiler_flag(-fstack-protector-all CC_HAS_STACK_PROTECT)       # gcc
+        if(CC_HAS_FTRAPV)
+            add_definitions(-ftrapv)
+        endif()
+        if(CC_HAS_CATCH_UNDEFINED)
+            add_definitions(-fsanitize=undefined-trap -fsanitize-undefined-trap-on-error)
+        endif()
+        if(CC_HAS_STACK_PROTECT)
+            add_definitions(-fstack-protector-all)
+            if(MINGW)
+                list(APPEND PLATFORM_LIBS ssp)
+            endif()
+        endif()
+    endif(ENABLE_AGGRESSIVE_CHECKS)
     execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE CC_VERSION)
 endif(GCC)
 
@@ -168,6 +257,11 @@
     endif()
 endif()
 
+option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF)
+if(CHECKED_BUILD)
+    add_definitions(-DCHECKED_BUILD=1)
+endif()
+
 # Build options
 set(LIB_INSTALL_DIR lib CACHE STRING "Install location of libraries")
 set(BIN_INSTALL_DIR bin CACHE STRING "Install location of executables")
@@ -179,6 +273,7 @@
     # can disable this if(X64) check if you desparately need a 32bit
     # build with 10bit/12bit support, but this violates the "shrink wrap

 
@@ -12,6 +12,9 @@
 if(POLICY CMP0042)
     cmake_policy(SET CMP0042 NEW) # MACOSX_RPATH
 endif()
+if(POLICY CMP0054)
+    cmake_policy(SET CMP0054 OLD) # Only interpret if() arguments as variables or keywords when unquoted
+endif()
 
 project (x265)
 cmake_minimum_required (VERSION 2.8.8) # OBJECT libraries require 2.8.8
@@ -20,8 +23,14 @@
 include(CheckSymbolExists)
 include(CheckCXXCompilerFlag)
 
+option(FPROFILE_GENERATE "Compile executable to generate usage data" OFF)
+option(FPROFILE_USE "Compile executable using generated usage data" OFF)
+option(NATIVE_BUILD "Target the build CPU" OFF)
+option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
+mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
+
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 43)
+set(X265_BUILD 51)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -29,11 +38,6 @@
 
 SET(CMAKE_MODULE_PATH "${PROJECT_SOURCE_DIR}/cmake" "${CMAKE_MODULE_PATH}")
 
-option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF)
-if(CHECKED_BUILD)
-    add_definitions(-DCHECKED_BUILD=1)
-endif()
-
 # System architecture detection
 string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC)
 set(X86_ALIASES x86 i386 i686 x86_64 amd64)
@@ -61,6 +65,19 @@
     if(LIBRT)
         list(APPEND PLATFORM_LIBS rt)
     endif()
+    find_package(Numa)
+    if(NUMA_FOUND)
+        list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
+        check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
+        if(NUMA_V2)
+            add_definitions(-DHAVE_LIBNUMA)
+            message(STATUS "libnuma found, building with support for NUMA nodes")
+            list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
+            link_directories(${NUMA_LIBRARY_DIR})
+            include_directories(${NUMA_INCLUDE_DIR})
+        endif()
+    endif()
+    mark_as_advanced(LIBRT NUMA_FOUND)
 endif(UNIX)
 
 if(X64 AND NOT WIN32)
@@ -77,13 +94,13 @@
   add_definitions(-DMACOS)
 endif()
 
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
     set(CLANG 1)
 endif()
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Intel")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "Intel")
     set(INTEL_CXX 1)
 endif()
-if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
+if(${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU")
     set(GCC 1)
 endif()
 
@@ -92,13 +109,12 @@
     set(MSVC 1)
 endif()
 if(MSVC)
-    option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF)
-    if (STATIC_LINK_CRT)
+    if(STATIC_LINK_CRT)
         set(CompilerFlags CMAKE_CXX_FLAGS_RELEASE CMAKE_C_FLAGS_RELEASE)
         foreach(CompilerFlag ${CompilerFlags})
             string(REPLACE "/MD" "/MT" ${CompilerFlag} "${${CompilerFlag}}")
         endforeach()
-    endif (STATIC_LINK_CRT)
+    endif(STATIC_LINK_CRT)
     add_definitions(/W4)  # Full warnings
     add_definitions(/Ob2) # always inline
     add_definitions(/MP)  # multithreaded build
@@ -130,12 +146,56 @@
     if(ENABLE_PIC)
          add_definitions(-fPIC)
     endif(ENABLE_PIC)
-    if(X86 AND NOT X64)
+    if(NATIVE_BUILD)
+        if(INTEL_CXX)
+            add_definitions(-xhost)
+        else()
+            add_definitions(-march=native)
+        endif()
+    elseif(X86 AND NOT X64)
         add_definitions(-march=i686)
     endif()
     if(ARM)
         add_definitions(-march=armv6 -mfloat-abi=hard -mfpu=vfp)
     endif()
+    if(FPROFILE_GENERATE)
+        if(INTEL_CXX)
+            add_definitions(-prof-gen -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
+            list(APPEND LINKER_OPTIONS "-prof-gen")
+        else()
+            check_cxx_compiler_flag(-fprofile-generate CC_HAS_PROFILE_GENERATE)
+            if(CC_HAS_PROFILE_GENERATE)
+                add_definitions(-fprofile-generate)
+                list(APPEND LINKER_OPTIONS "-fprofile-generate")
+            endif(CC_HAS_PROFILE_GENERATE)
+        endif(INTEL_CXX)
+    endif(FPROFILE_GENERATE)
+    if(FPROFILE_USE)
+        if(INTEL_CXX)
+            add_definitions(-prof-use -prof-dir="${CMAKE_CURRENT_BINARY_DIR}")
+            list(APPEND LINKER_OPTIONS "-prof-use")
+        else()
+            check_cxx_compiler_flag(-fprofile-use CC_HAS_PROFILE_USE)
+            check_cxx_compiler_flag(-fprofile-correction CC_HAS_PROFILE_CORRECTION)
+            check_cxx_compiler_flag(-Wno-error=coverage-mismatch CC_HAS_COVMISMATCH)
+            if(CC_HAS_PROFILE_USE)
+                add_definitions(-fprofile-use)
+                list(APPEND LINKER_OPTIONS "-fprofile-use")
+            endif(CC_HAS_PROFILE_USE)
+            if(CC_HAS_PROFILE_CORRECTION)
+                # auto-correct corrupted counters (happens a lot with x265)
+                add_definitions(-fprofile-correction)
+            endif(CC_HAS_PROFILE_CORRECTION)
+            if(CC_HAS_COVMISMATCH)
+                # ignore coverage mismatches (also happens a lot)
+                add_definitions(-Wno-error=coverage-mismatch)
+            endif(CC_HAS_COVMISMATCH)
+        endif(INTEL_CXX)
+    endif(FPROFILE_USE)
+    if(STATIC_LINK_CRT)
+        add_definitions(-static)
+        list(APPEND LINKER_OPTIONS "-static")
+    endif(STATIC_LINK_CRT)
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
     if (CC_HAS_NO_ARRAY_BOUNDS)
@@ -154,6 +214,35 @@
     if(CC_HAS_FNO_EXCEPTIONS_FLAG)
         add_definitions(-fno-exceptions)
     endif()
+    set(FSANITIZE "" CACHE STRING "-fsanitize options for GCC/clang")
+    if(FSANITIZE)
+        add_definitions(-fsanitize=${FSANITIZE})
+        # clang and gcc need the sanitize options to be passed at link
+        # time so the appropriate ASAN/TSAN runtime libraries can be
+        # linked.
+        list(APPEND LINKER_OPTIONS "-fsanitize=${FSANITIZE}")
+    endif()
+    option(ENABLE_AGGRESSIVE_CHECKS "Enable stack protection and -ftrapv" OFF)
+    if(ENABLE_AGGRESSIVE_CHECKS)
+        # use with care, -ftrapv can cause testbench SIGILL exceptions
+        # since it is testing corner cases of signed integer math
+        add_definitions(-DUSING_FTRAPV=1)
+        check_cxx_compiler_flag(-fsanitize=undefined-trap CC_HAS_CATCH_UNDEFINED) # clang
+        check_cxx_compiler_flag(-ftrapv CC_HAS_FTRAPV)                            # gcc
+        check_cxx_compiler_flag(-fstack-protector-all CC_HAS_STACK_PROTECT)       # gcc
+        if(CC_HAS_FTRAPV)
+            add_definitions(-ftrapv)
+        endif()
+        if(CC_HAS_CATCH_UNDEFINED)
+            add_definitions(-fsanitize=undefined-trap -fsanitize-undefined-trap-on-error)
+        endif()
+        if(CC_HAS_STACK_PROTECT)
+            add_definitions(-fstack-protector-all)
+            if(MINGW)
+                list(APPEND PLATFORM_LIBS ssp)
+            endif()
+        endif()
+    endif(ENABLE_AGGRESSIVE_CHECKS)
     execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE CC_VERSION)
 endif(GCC)
 
@@ -168,6 +257,11 @@
     endif()
 endif()
 
+option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF)
+if(CHECKED_BUILD)
+    add_definitions(-DCHECKED_BUILD=1)
+endif()
+
 # Build options
 set(LIB_INSTALL_DIR lib CACHE STRING "Install location of libraries")
 set(BIN_INSTALL_DIR bin CACHE STRING "Install location of executables")
@@ -179,6 +273,7 @@
     # can disable this if(X64) check if you desparately need a 32bit
     # build with 10bit/12bit support, but this violates the "shrink wrap
​

x265_1.6.tar.gz/source/cmake/FindNuma.cmake Added

@@ -0,0 +1,43 @@
+# Module for locating libnuma
+#
+# Read-only variables:
+#   NUMA_FOUND
+#     Indicates that the library has been found.
+#
+#   NUMA_INCLUDE_DIR
+#     Points to the libnuma include directory.
+#
+#   NUMA_LIBRARY_DIR
+#     Points to the directory that contains the libraries.
+#     The content of this variable can be passed to link_directories.
+#
+#   NUMA_LIBRARY
+#     Points to the libnuma that can be passed to target_link_libararies.
+#
+# Copyright (c) 2015 Steve Borho
+
+include(FindPackageHandleStandardArgs)
+
+find_path(NUMA_ROOT_DIR
+  NAMES include/numa.h
+  PATHS ENV NUMA_ROOT
+  DOC "NUMA root directory")
+
+find_path(NUMA_INCLUDE_DIR
+  NAMES numa.h
+  HINTS ${NUMA_ROOT_DIR}
+  PATH_SUFFIXES include
+  DOC "NUMA include directory")
+
+find_library(NUMA_LIBRARY
+  NAMES numa
+  HINTS ${NUMA_ROOT_DIR}
+  DOC "NUMA library")
+
+if (NUMA_LIBRARY)
+    get_filename_component(NUMA_LIBRARY_DIR ${NUMA_LIBRARY} PATH)
+endif()
+
+mark_as_advanced(NUMA_INCLUDE_DIR NUMA_LIBRARY_DIR NUMA_LIBRARY)
+
+find_package_handle_standard_args(NUMA REQUIRED_VARS NUMA_ROOT_DIR NUMA_INCLUDE_DIR NUMA_LIBRARY)

 
@@ -0,0 +1,43 @@
+# Module for locating libnuma
+#
+# Read-only variables:
+#   NUMA_FOUND
+#     Indicates that the library has been found.
+#
+#   NUMA_INCLUDE_DIR
+#     Points to the libnuma include directory.
+#
+#   NUMA_LIBRARY_DIR
+#     Points to the directory that contains the libraries.
+#     The content of this variable can be passed to link_directories.
+#
+#   NUMA_LIBRARY
+#     Points to the libnuma that can be passed to target_link_libararies.
+#
+# Copyright (c) 2015 Steve Borho
+
+include(FindPackageHandleStandardArgs)
+
+find_path(NUMA_ROOT_DIR
+  NAMES include/numa.h
+  PATHS ENV NUMA_ROOT
+  DOC "NUMA root directory")
+
+find_path(NUMA_INCLUDE_DIR
+  NAMES numa.h
+  HINTS ${NUMA_ROOT_DIR}
+  PATH_SUFFIXES include
+  DOC "NUMA include directory")
+
+find_library(NUMA_LIBRARY
+  NAMES numa
+  HINTS ${NUMA_ROOT_DIR}
+  DOC "NUMA library")
+
+if (NUMA_LIBRARY)
+    get_filename_component(NUMA_LIBRARY_DIR ${NUMA_LIBRARY} PATH)
+endif()
+
+mark_as_advanced(NUMA_INCLUDE_DIR NUMA_LIBRARY_DIR NUMA_LIBRARY)
+
+find_package_handle_standard_args(NUMA REQUIRED_VARS NUMA_ROOT_DIR NUMA_INCLUDE_DIR NUMA_LIBRARY)
​

x265_1.5.tar.gz/source/cmake/version.cmake -> x265_1.6.tar.gz/source/cmake/version.cmake Changed

@@ -10,9 +10,9 @@
 set(X265_LATEST_TAG "0.0")
 set(X265_TAG_DISTANCE "0")
 
-if(EXISTS ${CMAKE_SOURCE_DIR}/../.hg_archival.txt)
+if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt)
     # read the lines of the archive summary file to extract the version
-    file(READ ${CMAKE_SOURCE_DIR}/../.hg_archival.txt archive)
+    file(READ ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt archive)
     STRING(REGEX REPLACE "\n" ";" archive "${archive}")
     foreach(f ${archive})
         string(FIND "${f}" ": " pos)
@@ -29,7 +29,7 @@
         string(SUBSTRING "${hg_node}" 0 16 hg_id)
         set(X265_VERSION "${hg_latesttag}+${hg_latesttagdistance}-${hg_id}")
     endif()
-elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.hg)
+elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg)
     if(EXISTS "${HG_EXECUTABLE}.bat")
         # mercurial source installs on Windows require .bat extension
         set(HG_EXECUTABLE "${HG_EXECUTABLE}.bat")
@@ -38,14 +38,14 @@
 
     execute_process(COMMAND
         ${HG_EXECUTABLE} log -r. --template "{latesttag}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_LATEST_TAG
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
         )
     execute_process(COMMAND
         ${HG_EXECUTABLE} log -r. --template "{latesttagdistance}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_TAG_DISTANCE
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -53,7 +53,7 @@
     execute_process(
         COMMAND
         ${HG_EXECUTABLE} log -r. --template "{node|short}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE HG_REVISION_ID
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -67,11 +67,11 @@
     else()
         set(X265_VERSION "${X265_LATEST_TAG}+${X265_TAG_DISTANCE}-${HG_REVISION_ID}")
     endif()
-elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.git)
+elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.git)
     execute_process(
         COMMAND
         ${GIT_EXECUTABLE} describe --tags --abbrev=0
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_LATEST_TAG
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -80,7 +80,7 @@
     execute_process(
         COMMAND
         ${GIT_EXECUTABLE} describe --tags
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_VERSION
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE

 
@@ -10,9 +10,9 @@
 set(X265_LATEST_TAG "0.0")
 set(X265_TAG_DISTANCE "0")
 
-if(EXISTS ${CMAKE_SOURCE_DIR}/../.hg_archival.txt)
+if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt)
     # read the lines of the archive summary file to extract the version
-    file(READ ${CMAKE_SOURCE_DIR}/../.hg_archival.txt archive)
+    file(READ ${CMAKE_CURRENT_SOURCE_DIR}/../.hg_archival.txt archive)
     STRING(REGEX REPLACE "\n" ";" archive "${archive}")
     foreach(f ${archive})
         string(FIND "${f}" ": " pos)
@@ -29,7 +29,7 @@
         string(SUBSTRING "${hg_node}" 0 16 hg_id)
         set(X265_VERSION "${hg_latesttag}+${hg_latesttagdistance}-${hg_id}")
     endif()
-elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.hg)
+elseif(HG_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.hg)
     if(EXISTS "${HG_EXECUTABLE}.bat")
         # mercurial source installs on Windows require .bat extension
         set(HG_EXECUTABLE "${HG_EXECUTABLE}.bat")
@@ -38,14 +38,14 @@
 
     execute_process(COMMAND
         ${HG_EXECUTABLE} log -r. --template "{latesttag}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_LATEST_TAG
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
         )
     execute_process(COMMAND
         ${HG_EXECUTABLE} log -r. --template "{latesttagdistance}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_TAG_DISTANCE
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -53,7 +53,7 @@
     execute_process(
         COMMAND
         ${HG_EXECUTABLE} log -r. --template "{node|short}"
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE HG_REVISION_ID
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -67,11 +67,11 @@
     else()
         set(X265_VERSION "${X265_LATEST_TAG}+${X265_TAG_DISTANCE}-${HG_REVISION_ID}")
     endif()
-elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_SOURCE_DIR}/../.git)
+elseif(GIT_EXECUTABLE AND EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/../.git)
     execute_process(
         COMMAND
         ${GIT_EXECUTABLE} describe --tags --abbrev=0
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_LATEST_TAG
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
@@ -80,7 +80,7 @@
     execute_process(
         COMMAND
         ${GIT_EXECUTABLE} describe --tags
-        WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
+        WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
         OUTPUT_VARIABLE X265_VERSION
         ERROR_QUIET
         OUTPUT_STRIP_TRAILING_WHITESPACE
​

x265_1.5.tar.gz/source/common/CMakeLists.txt -> x265_1.6.tar.gz/source/common/CMakeLists.txt Changed

 
@@ -1,7 +1,7 @@
 # vim: syntax=cmake
 
 if(ENABLE_ASSEMBLY)
-    set_source_files_properties(primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1)
+    set_source_files_properties(threading.cpp primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1)
 
     set(SSE3  vec/dct-sse3.cpp)
     set(SSSE3 vec/dct-ssse3.cpp)
@@ -48,7 +48,7 @@
     if(HIGH_BIT_DEPTH)
         set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm)
     else()
-        set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm ipfilter8.asm loopfilter.asm)
+        set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm)
     endif()
 
     if(NOT X64)
​

x265_1.5.tar.gz/source/common/bitstream.cpp -> x265_1.6.tar.gz/source/common/bitstream.cpp Changed

@@ -27,7 +27,7 @@
         uint8_t *temp = X265_MALLOC(uint8_t, m_byteAlloc * 2);
         if (temp)
         {
-            ::memcpy(temp, m_fifo, m_byteOccupancy);
+            memcpy(temp, m_fifo, m_byteOccupancy);
             X265_FREE(m_fifo);
             m_fifo = temp;
             m_byteAlloc *= 2;
@@ -44,7 +44,7 @@
 void Bitstream::write(uint32_t val, uint32_t numBits)
 {
     X265_CHECK(numBits <= 32, "numBits out of range\n");
-    X265_CHECK(numBits == 32 || ((val & (~0 << numBits)) == 0), "numBits & val out of range\n");
+    X265_CHECK(numBits == 32 || ((val & (~0u << numBits)) == 0), "numBits & val out of range\n");
 
     uint32_t totalPartialBits = m_partialByteBits + numBits;
     uint32_t nextPartialBits = totalPartialBits & 7;
@@ -55,7 +55,11 @@
     {
         /* topword aligns m_partialByte with the msb of val */
         uint32_t topword = (numBits - nextPartialBits) & ~7;
+#if USING_FTRAPV
+        uint32_t write_bits = (topword < 32 ? m_partialByte << topword : 0) | (val >> nextPartialBits);
+#else
         uint32_t write_bits = (m_partialByte << topword) | (val >> nextPartialBits);
+#endif
 
         switch (writeBytes)
         {

 
@@ -27,7 +27,7 @@
         uint8_t *temp = X265_MALLOC(uint8_t, m_byteAlloc * 2);
         if (temp)
         {
-            ::memcpy(temp, m_fifo, m_byteOccupancy);
+            memcpy(temp, m_fifo, m_byteOccupancy);
             X265_FREE(m_fifo);
             m_fifo = temp;
             m_byteAlloc *= 2;
@@ -44,7 +44,7 @@
 void Bitstream::write(uint32_t val, uint32_t numBits)
 {
     X265_CHECK(numBits <= 32, "numBits out of range\n");
-    X265_CHECK(numBits == 32 || ((val & (~0 << numBits)) == 0), "numBits & val out of range\n");
+    X265_CHECK(numBits == 32 || ((val & (~0u << numBits)) == 0), "numBits & val out of range\n");
 
     uint32_t totalPartialBits = m_partialByteBits + numBits;
     uint32_t nextPartialBits = totalPartialBits & 7;
@@ -55,7 +55,11 @@
     {
         /* topword aligns m_partialByte with the msb of val */
         uint32_t topword = (numBits - nextPartialBits) & ~7;
+#if USING_FTRAPV
+        uint32_t write_bits = (topword < 32 ? m_partialByte << topword : 0) | (val >> nextPartialBits);
+#else
         uint32_t write_bits = (m_partialByte << topword) | (val >> nextPartialBits);
+#endif
 
         switch (writeBytes)
         {
​

x265_1.5.tar.gz/source/common/common.cpp -> x265_1.6.tar.gz/source/common/common.cpp Changed

 
@@ -33,6 +33,10 @@
 #include <sys/time.h>
 #endif
 
+#if CHECKED_BUILD || _DEBUG
+int g_checkFailures;
+#endif
+
 int64_t x265_mdate(void)
 {
 #if _WIN32
​

x265_1.5.tar.gz/source/common/common.h -> x265_1.6.tar.gz/source/common/common.h Changed

@@ -74,13 +74,6 @@
 #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16)))
 #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32)))
 
-#if X265_ARCH_X86 && !defined(X86_64)
-extern "C" intptr_t x265_stack_align(void (*func)(), ...);
-#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__)
-#else
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
-#endif
-
 #if defined(__MINGW32__)
 #define fseeko fseeko64
 #endif
@@ -90,7 +83,6 @@
 #define ALIGN_VAR_8(T, var)  __declspec(align(8)) T var
 #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var
 #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
 #define fseeko _fseeki64
 
 #endif // if defined(__GNUC__)
@@ -106,19 +98,20 @@
 #if _DEBUG && defined(_MSC_VER)
 #define DEBUG_BREAK() __debugbreak()
 #elif __APPLE_CC__
-#define DEBUG_BREAK() __builtin_trap();
+#define DEBUG_BREAK() __builtin_trap()
 #else
-#define DEBUG_BREAK()
+#define DEBUG_BREAK() abort()
 #endif
 
 /* If compiled with CHECKED_BUILD perform run-time checks and log any that
  * fail, both to stderr and to a file */
 #if CHECKED_BUILD || _DEBUG
+extern int g_checkFailures;
 #define X265_CHECK(expr, ...) if (!(expr)) { \
     x265_log(NULL, X265_LOG_ERROR, __VA_ARGS__); \
-    DEBUG_BREAK(); \
     FILE *fp = fopen("x265_check_failures.txt", "a"); \
     if (fp) { fprintf(fp, "%s:%d\n", __FILE__, __LINE__); fprintf(fp, __VA_ARGS__); fclose(fp); } \
+    g_checkFailures++; DEBUG_BREAK(); \
 }
 #if _MSC_VER
 #pragma warning(disable: 4127) // some checks have constant conditions
@@ -257,7 +250,7 @@
 #define UNIT_SIZE               (1 << LOG2_UNIT_SIZE)       // unit size of CU partition
 
 #define MAX_NUM_PARTITIONS      256
-#define NUM_CU_PARTITIONS       (1U << (g_maxFullDepth << 1))
+#define NUM_4x4_PARTITIONS      (1U << (g_unitSizeDepth << 1)) // number of 4x4 units in max CU size
 
 #define MIN_PU_SIZE             4
 #define MIN_TU_SIZE             4
@@ -376,6 +369,7 @@
     int32_t*    ref;
     uint8_t*    depth;
     uint8_t*    modes;
+    uint32_t*   bestMergeCand;
 };
 
 /* Stores intra analysis data for a single frame. This struct needs better packing */
@@ -384,6 +378,7 @@
     uint8_t*  depth;
     uint8_t*  modes;
     char*     partSizes;
+    uint8_t*  chromaModes;
 };
 
 enum TextType
@@ -430,6 +425,8 @@
 void     x265_free(void *ptr);
 char*    x265_slurp_file(const char *filename);
 
+void     x265_setup_primitives(x265_param* param, int cpu); /* primitives.cpp */
+
 #include "constants.h"
 
 #endif // ifndef X265_COMMON_H

 
@@ -74,13 +74,6 @@
 #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16)))
 #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32)))
 
-#if X265_ARCH_X86 && !defined(X86_64)
-extern "C" intptr_t x265_stack_align(void (*func)(), ...);
-#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__)
-#else
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
-#endif
-
 #if defined(__MINGW32__)
 #define fseeko fseeko64
 #endif
@@ -90,7 +83,6 @@
 #define ALIGN_VAR_8(T, var)  __declspec(align(8)) T var
 #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var
 #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var
-#define x265_stack_align(func, ...) func(__VA_ARGS__)
 #define fseeko _fseeki64
 
 #endif // if defined(__GNUC__)
@@ -106,19 +98,20 @@
 #if _DEBUG && defined(_MSC_VER)
 #define DEBUG_BREAK() __debugbreak()
 #elif __APPLE_CC__
-#define DEBUG_BREAK() __builtin_trap();
+#define DEBUG_BREAK() __builtin_trap()
 #else
-#define DEBUG_BREAK()
+#define DEBUG_BREAK() abort()
 #endif
 
 /* If compiled with CHECKED_BUILD perform run-time checks and log any that
  * fail, both to stderr and to a file */
 #if CHECKED_BUILD || _DEBUG
+extern int g_checkFailures;
 #define X265_CHECK(expr, ...) if (!(expr)) { \
     x265_log(NULL, X265_LOG_ERROR, __VA_ARGS__); \
-    DEBUG_BREAK(); \
     FILE *fp = fopen("x265_check_failures.txt", "a"); \
     if (fp) { fprintf(fp, "%s:%d\n", __FILE__, __LINE__); fprintf(fp, __VA_ARGS__); fclose(fp); } \
+    g_checkFailures++; DEBUG_BREAK(); \
 }
 #if _MSC_VER
 #pragma warning(disable: 4127) // some checks have constant conditions
@@ -257,7 +250,7 @@
 #define UNIT_SIZE               (1 << LOG2_UNIT_SIZE)       // unit size of CU partition
 
 #define MAX_NUM_PARTITIONS      256
-#define NUM_CU_PARTITIONS       (1U << (g_maxFullDepth << 1))
+#define NUM_4x4_PARTITIONS      (1U << (g_unitSizeDepth << 1)) // number of 4x4 units in max CU size
 
 #define MIN_PU_SIZE             4
 #define MIN_TU_SIZE             4
@@ -376,6 +369,7 @@
     int32_t*    ref;
     uint8_t*    depth;
     uint8_t*    modes;
+    uint32_t*   bestMergeCand;
 };
 
 /* Stores intra analysis data for a single frame. This struct needs better packing */
@@ -384,6 +378,7 @@
     uint8_t*  depth;
     uint8_t*  modes;
     char*     partSizes;
+    uint8_t*  chromaModes;
 };
 
 enum TextType
@@ -430,6 +425,8 @@
 void     x265_free(void *ptr);
 char*    x265_slurp_file(const char *filename);
 
+void     x265_setup_primitives(x265_param* param, int cpu); /* primitives.cpp */
+
 #include "constants.h"
 
 #endif // ifndef X265_COMMON_H
​

x265_1.5.tar.gz/source/common/constants.cpp -> x265_1.6.tar.gz/source/common/constants.cpp Changed

 
@@ -119,9 +119,10 @@
     65535
 };
 
+int      g_ctuSizeConfigured = 0;
 uint32_t g_maxLog2CUSize = MAX_LOG2_CU_SIZE;
 uint32_t g_maxCUSize     = MAX_CU_SIZE;
-uint32_t g_maxFullDepth  = NUM_FULL_DEPTH - 1;
+uint32_t g_unitSizeDepth = NUM_CU_DEPTH;
 uint32_t g_maxCUDepth    = NUM_CU_DEPTH - 1;
 uint32_t g_zscanToRaster[MAX_NUM_PARTITIONS] = { 0, };
 uint32_t g_rasterToZscan[MAX_NUM_PARTITIONS] = { 0, };
​

x265_1.5.tar.gz/source/common/constants.h -> x265_1.6.tar.gz/source/common/constants.h Changed

 
@@ -29,6 +29,8 @@
 namespace x265 {
 // private namespace
 
+extern int g_ctuSizeConfigured;
+
 void initZscanToRaster(uint32_t maxFullDepth, uint32_t depth, uint32_t startVal, uint32_t*& curIdx);
 void initRasterToZscan(uint32_t maxFullDepth);
 
@@ -55,7 +57,7 @@
 extern uint32_t g_maxLog2CUSize;
 extern uint32_t g_maxCUSize;
 extern uint32_t g_maxCUDepth;
-extern uint32_t g_maxFullDepth;
+extern uint32_t g_unitSizeDepth; // Depth at which 4x4 unit occurs from max CU size
 
 extern const int16_t g_t4[4][4];
 extern const int16_t g_t8[8][8];
​

x265_1.5.tar.gz/source/common/cudata.cpp -> x265_1.6.tar.gz/source/common/cudata.cpp Changed

@@ -38,7 +38,7 @@
 void bcast1(uint8_t* dst, uint8_t val)  { dst[0] = val; }
 
 void copy4(uint8_t* dst, uint8_t* src)  { ((uint32_t*)dst)[0] = ((uint32_t*)src)[0]; }
-void bcast4(uint8_t* dst, uint8_t val)  { ((uint32_t*)dst)[0] = 0x01010101 * val; }
+void bcast4(uint8_t* dst, uint8_t val)  { ((uint32_t*)dst)[0] = 0x01010101u * val; }
 
 void copy16(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; }
 void bcast16(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; }
@@ -159,11 +159,11 @@
     m_chromaFormat  = csp;
     m_hChromaShift  = CHROMA_H_SHIFT(csp);
     m_vChromaShift  = CHROMA_V_SHIFT(csp);
-    m_numPartitions = NUM_CU_PARTITIONS >> (depth * 2);
+    m_numPartitions = NUM_4x4_PARTITIONS >> (depth * 2);
 
     if (!s_partSet[0])
     {
-        s_numPartInCUSize = 1 << g_maxFullDepth;
+        s_numPartInCUSize = 1 << g_unitSizeDepth;
         switch (g_maxLog2CUSize)
         {
         case 6:
@@ -272,7 +272,7 @@
     m_cuPelX        = (cuAddr % m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize;
     m_cuPelY        = (cuAddr / m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize;
     m_absIdxInCTU   = 0;
-    m_numPartitions = NUM_CU_PARTITIONS;
+    m_numPartitions = NUM_4x4_PARTITIONS;
 
     /* sequential memsets */
     m_partSet((uint8_t*)m_qp, (uint8_t)qp);
@@ -300,12 +300,12 @@
 // initialize Sub partition
 void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
 {
-    m_absIdxInCTU   = cuGeom.encodeIdx;
+    m_absIdxInCTU   = cuGeom.absPartIdx;
     m_encData       = ctu.m_encData;
     m_slice         = ctu.m_slice;
     m_cuAddr        = ctu.m_cuAddr;
-    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx];
-    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx];
+    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
+    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
     m_cuLeft        = ctu.m_cuLeft;
     m_cuAbove       = ctu.m_cuAbove;
     m_cuAboveLeft   = ctu.m_cuAboveLeft;
@@ -392,7 +392,7 @@
     m_cuAbove      = cu.m_cuAbove;
     m_cuAboveLeft  = cu.m_cuAboveLeft;
     m_cuAboveRight = cu.m_cuAboveRight;
-    m_absIdxInCTU  = cuGeom.encodeIdx;
+    m_absIdxInCTU  = cuGeom.absPartIdx;
     m_numPartitions = cuGeom.numPartitions;
     memcpy(m_qp, cu.m_qp, BytesPerPartition * m_numPartitions);
     memcpy(m_mv[0],  cu.m_mv[0],  m_numPartitions * sizeof(MV));
@@ -462,9 +462,9 @@
     m_encData       = ctu.m_encData;
     m_slice         = ctu.m_slice;
     m_cuAddr        = ctu.m_cuAddr;
-    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx];
-    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx];
-    m_absIdxInCTU   = cuGeom.encodeIdx;
+    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
+    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
+    m_absIdxInCTU   = cuGeom.absPartIdx;
     m_numPartitions = cuGeom.numPartitions;
 
     /* copy out all prediction info for this part */
@@ -559,7 +559,7 @@
         return this;
     }
 
-    aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize];
+    aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize];
     return m_cuAbove;
 }
 
@@ -581,7 +581,7 @@
                 return this;
             }
         }
-        alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize - 1];
+        alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize - 1];
         return m_cuAbove;
     }
 
@@ -591,7 +591,7 @@
         return m_cuLeft;
     }
 
-    alPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - 1];
+    alPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - 1];
     return m_cuAboveLeft;
 }
 
@@ -620,14 +620,14 @@
             }
             return NULL;
         }
-        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + 1];
+        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + 1];
         return m_cuAbove;
     }
 
     if (!isZeroRow(absPartIdxRT, s_numPartInCUSize))
         return NULL;
 
-    arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize];
+    arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize];
     return m_cuAboveRight;
 }
 
@@ -720,21 +720,21 @@
             }
             return NULL;
         }
-        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset];
+        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset];
         return m_cuAbove;
     }
 
     if (!isZeroRow(absPartIdxRT, s_numPartInCUSize))
         return NULL;
 
-    arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1];
+    arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1];
     return m_cuAboveRight;
 }
 
 /* Get left QpMinCu */
 const CUData* CUData::getQpMinCuLeft(uint32_t& lPartUnitIdx, uint32_t curAbsIdxInCTU) const
 {
-    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
+    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
     uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx];
 
     // check for left CTU boundary
@@ -751,7 +751,7 @@
 /* Get above QpMinCu */
 const CUData* CUData::getQpMinCuAbove(uint32_t& aPartUnitIdx, uint32_t curAbsIdxInCTU) const
 {
-    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
+    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
     uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx];
 
     // check for top CTU boundary
@@ -790,7 +790,7 @@
 
 int8_t CUData::getLastCodedQP(uint32_t absPartIdx) const
 {
-    uint32_t quPartIdxMask = 0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2;
+    uint32_t quPartIdxMask = 0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2;
     int lastValidPartIdx = getLastValidPartIdx(absPartIdx & quPartIdxMask);
 
     if (lastValidPartIdx >= 0)
@@ -800,7 +800,7 @@
         if (m_absIdxInCTU)
             return m_encData->getPicCTU(m_cuAddr)->getLastCodedQP(m_absIdxInCTU);
         else if (m_cuAddr > 0 && !(m_slice->m_pps->bEntropyCodingSyncEnabled && !(m_cuAddr % m_slice->m_sps->numCuInWidth)))
-            return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_CU_PARTITIONS);
+            return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_4x4_PARTITIONS);
         else
             return (int8_t)m_slice->m_sliceQp;
     }
@@ -932,7 +932,7 @@
 
 bool CUData::setQPSubCUs(int8_t qp, uint32_t absPartIdx, uint32_t depth)
 {
-    uint32_t curPartNumb = NUM_CU_PARTITIONS >> (depth << 1);
+    uint32_t curPartNumb = NUM_4x4_PARTITIONS >> (depth << 1);
     uint32_t curPartNumQ = curPartNumb >> 2;
 
     if (m_cuDepth[absPartIdx] > depth)
@@ -1375,8 +1375,8 @@
     return true;
 }
 
-/* Construct list of merging candidates */
-uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const
+/* Construct list of merging candidates, returns count */
+uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*candMvField)[2], uint8_t* candDir) const
 {
     uint32_t absPartAddr = m_absIdxInCTU + absPartIdx;
     const bool isInterB = m_slice->isInterB();
@@ -1385,10 +1385,10 @@
 
     for (uint32_t i = 0; i < maxNumMergeCand; ++i)
     {
-        mvFieldNeighbours[i][0].mv = 0;
-        mvFieldNeighbours[i][1].mv = 0;
-        mvFieldNeighbours[i][0].refIdx = REF_NOT_VALID;
-        mvFieldNeighbours[i][1].refIdx = REF_NOT_VALID;
+        candMvField[i][0].mv = 0;
+        candMvField[i][1].mv = 0;
+        candMvField[i][0].refIdx = REF_NOT_VALID;
+        candMvField[i][1].refIdx = REF_NOT_VALID;
     }

 
@@ -38,7 +38,7 @@
 void bcast1(uint8_t* dst, uint8_t val)  { dst[0] = val; }
 
 void copy4(uint8_t* dst, uint8_t* src)  { ((uint32_t*)dst)[0] = ((uint32_t*)src)[0]; }
-void bcast4(uint8_t* dst, uint8_t val)  { ((uint32_t*)dst)[0] = 0x01010101 * val; }
+void bcast4(uint8_t* dst, uint8_t val)  { ((uint32_t*)dst)[0] = 0x01010101u * val; }
 
 void copy16(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; }
 void bcast16(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; }
@@ -159,11 +159,11 @@
     m_chromaFormat  = csp;
     m_hChromaShift  = CHROMA_H_SHIFT(csp);
     m_vChromaShift  = CHROMA_V_SHIFT(csp);
-    m_numPartitions = NUM_CU_PARTITIONS >> (depth * 2);
+    m_numPartitions = NUM_4x4_PARTITIONS >> (depth * 2);
 
     if (!s_partSet[0])
     {
-        s_numPartInCUSize = 1 << g_maxFullDepth;
+        s_numPartInCUSize = 1 << g_unitSizeDepth;
         switch (g_maxLog2CUSize)
         {
         case 6:
@@ -272,7 +272,7 @@
     m_cuPelX        = (cuAddr % m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize;
     m_cuPelY        = (cuAddr / m_slice->m_sps->numCuInWidth) << g_maxLog2CUSize;
     m_absIdxInCTU   = 0;
-    m_numPartitions = NUM_CU_PARTITIONS;
+    m_numPartitions = NUM_4x4_PARTITIONS;
 
     /* sequential memsets */
     m_partSet((uint8_t*)m_qp, (uint8_t)qp);
@@ -300,12 +300,12 @@
 // initialize Sub partition
 void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
 {
-    m_absIdxInCTU   = cuGeom.encodeIdx;
+    m_absIdxInCTU   = cuGeom.absPartIdx;
     m_encData       = ctu.m_encData;
     m_slice         = ctu.m_slice;
     m_cuAddr        = ctu.m_cuAddr;
-    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx];
-    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx];
+    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
+    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
     m_cuLeft        = ctu.m_cuLeft;
     m_cuAbove       = ctu.m_cuAbove;
     m_cuAboveLeft   = ctu.m_cuAboveLeft;
@@ -392,7 +392,7 @@
     m_cuAbove      = cu.m_cuAbove;
     m_cuAboveLeft  = cu.m_cuAboveLeft;
     m_cuAboveRight = cu.m_cuAboveRight;
-    m_absIdxInCTU  = cuGeom.encodeIdx;
+    m_absIdxInCTU  = cuGeom.absPartIdx;
     m_numPartitions = cuGeom.numPartitions;
     memcpy(m_qp, cu.m_qp, BytesPerPartition * m_numPartitions);
     memcpy(m_mv[0],  cu.m_mv[0],  m_numPartitions * sizeof(MV));
@@ -462,9 +462,9 @@
     m_encData       = ctu.m_encData;
     m_slice         = ctu.m_slice;
     m_cuAddr        = ctu.m_cuAddr;
-    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.encodeIdx];
-    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.encodeIdx];
-    m_absIdxInCTU   = cuGeom.encodeIdx;
+    m_cuPelX        = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx];
+    m_cuPelY        = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx];
+    m_absIdxInCTU   = cuGeom.absPartIdx;
     m_numPartitions = cuGeom.numPartitions;
 
     /* copy out all prediction info for this part */
@@ -559,7 +559,7 @@
         return this;
     }
 
-    aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize];
+    aPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize];
     return m_cuAbove;
 }
 
@@ -581,7 +581,7 @@
                 return this;
             }
         }
-        alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_CU_PARTITIONS - s_numPartInCUSize - 1];
+        alPartUnitIdx = g_rasterToZscan[absPartIdx + NUM_4x4_PARTITIONS - s_numPartInCUSize - 1];
         return m_cuAbove;
     }
 
@@ -591,7 +591,7 @@
         return m_cuLeft;
     }
 
-    alPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - 1];
+    alPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - 1];
     return m_cuAboveLeft;
 }
 
@@ -620,14 +620,14 @@
             }
             return NULL;
         }
-        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + 1];
+        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + 1];
         return m_cuAbove;
     }
 
     if (!isZeroRow(absPartIdxRT, s_numPartInCUSize))
         return NULL;
 
-    arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize];
+    arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize];
     return m_cuAboveRight;
 }
 
@@ -720,21 +720,21 @@
             }
             return NULL;
         }
-        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset];
+        arPartUnitIdx = g_rasterToZscan[absPartIdxRT + NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset];
         return m_cuAbove;
     }
 
     if (!isZeroRow(absPartIdxRT, s_numPartInCUSize))
         return NULL;
 
-    arPartUnitIdx = g_rasterToZscan[NUM_CU_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1];
+    arPartUnitIdx = g_rasterToZscan[NUM_4x4_PARTITIONS - s_numPartInCUSize + partUnitOffset - 1];
     return m_cuAboveRight;
 }
 
 /* Get left QpMinCu */
 const CUData* CUData::getQpMinCuLeft(uint32_t& lPartUnitIdx, uint32_t curAbsIdxInCTU) const
 {
-    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
+    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
     uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx];
 
     // check for left CTU boundary
@@ -751,7 +751,7 @@
 /* Get above QpMinCu */
 const CUData* CUData::getQpMinCuAbove(uint32_t& aPartUnitIdx, uint32_t curAbsIdxInCTU) const
 {
-    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
+    uint32_t absZorderQpMinCUIdx = curAbsIdxInCTU & (0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2);
     uint32_t absRorderQpMinCUIdx = g_zscanToRaster[absZorderQpMinCUIdx];
 
     // check for top CTU boundary
@@ -790,7 +790,7 @@
 
 int8_t CUData::getLastCodedQP(uint32_t absPartIdx) const
 {
-    uint32_t quPartIdxMask = 0xFF << (g_maxFullDepth - m_slice->m_pps->maxCuDQPDepth) * 2;
+    uint32_t quPartIdxMask = 0xFF << (g_unitSizeDepth - m_slice->m_pps->maxCuDQPDepth) * 2;
     int lastValidPartIdx = getLastValidPartIdx(absPartIdx & quPartIdxMask);
 
     if (lastValidPartIdx >= 0)
@@ -800,7 +800,7 @@
         if (m_absIdxInCTU)
             return m_encData->getPicCTU(m_cuAddr)->getLastCodedQP(m_absIdxInCTU);
         else if (m_cuAddr > 0 && !(m_slice->m_pps->bEntropyCodingSyncEnabled && !(m_cuAddr % m_slice->m_sps->numCuInWidth)))
-            return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_CU_PARTITIONS);
+            return m_encData->getPicCTU(m_cuAddr - 1)->getLastCodedQP(NUM_4x4_PARTITIONS);
         else
             return (int8_t)m_slice->m_sliceQp;
     }
@@ -932,7 +932,7 @@
 
 bool CUData::setQPSubCUs(int8_t qp, uint32_t absPartIdx, uint32_t depth)
 {
-    uint32_t curPartNumb = NUM_CU_PARTITIONS >> (depth << 1);
+    uint32_t curPartNumb = NUM_4x4_PARTITIONS >> (depth << 1);
     uint32_t curPartNumQ = curPartNumb >> 2;
 
     if (m_cuDepth[absPartIdx] > depth)
@@ -1375,8 +1375,8 @@
     return true;
 }
 
-/* Construct list of merging candidates */
-uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const
+/* Construct list of merging candidates, returns count */
+uint32_t CUData::getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField(*candMvField)[2], uint8_t* candDir) const
 {
     uint32_t absPartAddr = m_absIdxInCTU + absPartIdx;
     const bool isInterB = m_slice->isInterB();
@@ -1385,10 +1385,10 @@
 
     for (uint32_t i = 0; i < maxNumMergeCand; ++i)
     {
-        mvFieldNeighbours[i][0].mv = 0;
-        mvFieldNeighbours[i][1].mv = 0;
-        mvFieldNeighbours[i][0].refIdx = REF_NOT_VALID;
-        mvFieldNeighbours[i][1].refIdx = REF_NOT_VALID;
+        candMvField[i][0].mv = 0;
+        candMvField[i][1].mv = 0;
+        candMvField[i][0].refIdx = REF_NOT_VALID;
+        candMvField[i][1].refIdx = REF_NOT_VALID;
     }
 
​

x265_1.5.tar.gz/source/common/cudata.h -> x265_1.6.tar.gz/source/common/cudata.h Changed

@@ -64,7 +64,8 @@
     MD_ABOVE,       // MVP of above block
     MD_ABOVE_RIGHT, // MVP of above right block
     MD_BELOW_LEFT,  // MVP of below left block
-    MD_ABOVE_LEFT   // MVP of above left block
+    MD_ABOVE_LEFT,  // MVP of above left block
+    MD_COLLOCATED   // MVP of temporal neighbour
 };
 
 struct CUGeom
@@ -82,7 +83,7 @@
 
     uint32_t log2CUSize;    // Log of the CU size.
     uint32_t childOffset;   // offset of the first child CU from current CU
-    uint32_t encodeIdx;     // Encoding index of this CU in terms of 4x4 blocks.
+    uint32_t absPartIdx;    // Part index of this CU in terms of 4x4 blocks.
     uint32_t numPartitions; // Number of 4x4 blocks in the CU
     uint32_t depth;         // depth of this CU relative from CTU
     uint32_t flags;         // CU flags.
@@ -94,6 +95,26 @@
     int refIdx;
 };
 
+// Structure that keeps the neighbour's MV information.
+struct InterNeighbourMV
+{
+    // Neighbour MV. The index represents the list.
+    MV mv[2];
+
+    // Collocated right bottom CU addr.
+    uint32_t cuAddr[2];
+
+    // For spatial prediction, this field contains the reference index
+    // in each list (-1 if not available).
+    //
+    // For temporal prediction, the first value is used for the 
+    // prediction with list 0. The second value is used for the prediction 
+    // with list 1. For each value, the first four bits are the reference index 
+    // associated to the PMV, and the fifth bit is the list associated to the PMV.
+    // if both reference indices are -1, then unifiedRef is also -1
+    union { int16_t refIdx[2]; int32_t unifiedRef; };
+};
+
 typedef void(*cucopy_t)(uint8_t* dst, uint8_t* src); // dst and src are aligned to MIN(size, 32)
 typedef void(*cubcast_t)(uint8_t* dst, uint8_t val); // dst is aligned to MIN(size, 32)
 
@@ -122,9 +143,9 @@
     uint32_t      m_cuPelY;           // CU position within the picture, in pixels (Y)
     uint32_t      m_numPartitions;    // maximum number of 4x4 partitions within this CU
 
-    int           m_chromaFormat;
-    int           m_hChromaShift;
-    int           m_vChromaShift;
+    uint32_t      m_chromaFormat;
+    uint32_t      m_hChromaShift;
+    uint32_t      m_vChromaShift;
 
     /* Per-part data, stored contiguously */
     int8_t*       m_qp;               // array of QP values
@@ -158,7 +179,7 @@
     CUData();
 
     void     initialize(const CUDataMemPool& dataPool, uint32_t depth, int csp, int instance);
-    static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
+    static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
 
     void     initCTU(const Frame& frame, uint32_t cuAddr, int qp);
     void     initSubCU(const CUData& ctu, const CUGeom& cuGeom);
@@ -195,9 +216,10 @@
     uint8_t  getCbf(uint32_t absPartIdx, TextType ttype, uint32_t tuDepth) const { return (m_cbf[ttype][absPartIdx] >> tuDepth) & 0x1; }
     uint8_t  getQtRootCbf(uint32_t absPartIdx) const                             { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx]; }
     int8_t   getRefQP(uint32_t currAbsIdxInCTU) const;
-    uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const;
+    uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*candMvField)[2], uint8_t* candDir) const;
     void     clipMv(MV& outMV) const;
-    int      fillMvpCand(uint32_t puIdx, uint32_t absPartIdx, int picList, int refIdx, MV* amvpCand, MV* mvc) const;
+    int      getPMV(InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx, MV* amvpCand, MV* pmv) const;
+    void     getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const;
     void     getIntraTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
     void     getInterTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
 
@@ -213,10 +235,9 @@
     void     getAllowedChromaDir(uint32_t absPartIdx, uint32_t* modeList) const;
     int      getIntraDirLumaPredictor(uint32_t absPartIdx, uint32_t* intraDirPred) const;
 
-    uint32_t getSCUAddr() const                  { return (m_cuAddr << g_maxFullDepth * 2) + m_absIdxInCTU; }
+    uint32_t getSCUAddr() const                  { return (m_cuAddr << g_unitSizeDepth * 2) + m_absIdxInCTU; }
     uint32_t getCtxSplitFlag(uint32_t absPartIdx, uint32_t depth) const;
     uint32_t getCtxSkipFlag(uint32_t absPartIdx) const;
-    ScanType getCoefScanIdx(uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma, bool bIsIntra) const;
     void     getTUEntropyCodingParameters(TUEntropyCodingParameters &result, uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma) const;
 
     const CUData* getPULeft(uint32_t& lPartUnitIdx, uint32_t curPartUnitIdx) const;
@@ -241,15 +262,18 @@
 
     bool hasEqualMotion(uint32_t absPartIdx, const CUData& candCU, uint32_t candAbsPartIdx) const;
 
-    bool isDiffMER(int xN, int yN, int xP, int yP) const;
+    /* Check whether the current PU and a spatial neighboring PU are in same merge region */
+    bool isDiffMER(int xN, int yN, int xP, int yP) const { return ((xN >> 2) != (xP >> 2)) || ((yN >> 2) != (yP >> 2)); }
 
     // add possible motion vector predictor candidates
-    bool addMVPCand(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const;
-    bool addMVPCandOrder(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const;
+    bool getDirectPMV(MV& pmv, InterNeighbourMV *neighbours, uint32_t picList, uint32_t refIdx) const;
+    bool getIndirectPMV(MV& outMV, InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx) const;
+    void getInterNeighbourMV(InterNeighbourMV *neighbour, uint32_t partUnitIdx, MVP_DIR dir) const;
 
     bool getColMVP(MV& outMV, int& outRefIdx, int picList, int cuAddr, int absPartIdx) const;
+    bool getCollocatedMV(int cuAddr, int partUnitIdx, InterNeighbourMV *neighbour) const;
 
-    void scaleMvByPOCDist(MV& outMV, const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const;
+    MV scaleMvByPOCDist(const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const;
 
     void     deriveLeftRightTopIdx(uint32_t puIdx, uint32_t& partIdxLT, uint32_t& partIdxRT) const;
 
@@ -278,7 +302,7 @@
 
     bool create(uint32_t depth, uint32_t csp, uint32_t numInstances)
     {
-        uint32_t numPartition = NUM_CU_PARTITIONS >> (depth * 2);
+        uint32_t numPartition = NUM_4x4_PARTITIONS >> (depth * 2);
         uint32_t cuSize = g_maxCUSize >> depth;
         uint32_t sizeL = cuSize * cuSize;
         uint32_t sizeC = sizeL >> (CHROMA_H_SHIFT(csp) + CHROMA_V_SHIFT(csp));

 
@@ -64,7 +64,8 @@
     MD_ABOVE,       // MVP of above block
     MD_ABOVE_RIGHT, // MVP of above right block
     MD_BELOW_LEFT,  // MVP of below left block
-    MD_ABOVE_LEFT   // MVP of above left block
+    MD_ABOVE_LEFT,  // MVP of above left block
+    MD_COLLOCATED   // MVP of temporal neighbour
 };
 
 struct CUGeom
@@ -82,7 +83,7 @@
 
     uint32_t log2CUSize;    // Log of the CU size.
     uint32_t childOffset;   // offset of the first child CU from current CU
-    uint32_t encodeIdx;     // Encoding index of this CU in terms of 4x4 blocks.
+    uint32_t absPartIdx;    // Part index of this CU in terms of 4x4 blocks.
     uint32_t numPartitions; // Number of 4x4 blocks in the CU
     uint32_t depth;         // depth of this CU relative from CTU
     uint32_t flags;         // CU flags.
@@ -94,6 +95,26 @@
     int refIdx;
 };
 
+// Structure that keeps the neighbour's MV information.
+struct InterNeighbourMV
+{
+    // Neighbour MV. The index represents the list.
+    MV mv[2];
+
+    // Collocated right bottom CU addr.
+    uint32_t cuAddr[2];
+
+    // For spatial prediction, this field contains the reference index
+    // in each list (-1 if not available).
+    //
+    // For temporal prediction, the first value is used for the 
+    // prediction with list 0. The second value is used for the prediction 
+    // with list 1. For each value, the first four bits are the reference index 
+    // associated to the PMV, and the fifth bit is the list associated to the PMV.
+    // if both reference indices are -1, then unifiedRef is also -1
+    union { int16_t refIdx[2]; int32_t unifiedRef; };
+};
+
 typedef void(*cucopy_t)(uint8_t* dst, uint8_t* src); // dst and src are aligned to MIN(size, 32)
 typedef void(*cubcast_t)(uint8_t* dst, uint8_t val); // dst is aligned to MIN(size, 32)
 
@@ -122,9 +143,9 @@
     uint32_t      m_cuPelY;           // CU position within the picture, in pixels (Y)
     uint32_t      m_numPartitions;    // maximum number of 4x4 partitions within this CU
 
-    int           m_chromaFormat;
-    int           m_hChromaShift;
-    int           m_vChromaShift;
+    uint32_t      m_chromaFormat;
+    uint32_t      m_hChromaShift;
+    uint32_t      m_vChromaShift;
 
     /* Per-part data, stored contiguously */
     int8_t*       m_qp;               // array of QP values
@@ -158,7 +179,7 @@
     CUData();
 
     void     initialize(const CUDataMemPool& dataPool, uint32_t depth, int csp, int instance);
-    static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
+    static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
 
     void     initCTU(const Frame& frame, uint32_t cuAddr, int qp);
     void     initSubCU(const CUData& ctu, const CUGeom& cuGeom);
@@ -195,9 +216,10 @@
     uint8_t  getCbf(uint32_t absPartIdx, TextType ttype, uint32_t tuDepth) const { return (m_cbf[ttype][absPartIdx] >> tuDepth) & 0x1; }
     uint8_t  getQtRootCbf(uint32_t absPartIdx) const                             { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx]; }
     int8_t   getRefQP(uint32_t currAbsIdxInCTU) const;
-    uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*mvFieldNeighbours)[2], uint8_t* interDirNeighbours) const;
+    uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*candMvField)[2], uint8_t* candDir) const;
     void     clipMv(MV& outMV) const;
-    int      fillMvpCand(uint32_t puIdx, uint32_t absPartIdx, int picList, int refIdx, MV* amvpCand, MV* mvc) const;
+    int      getPMV(InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx, MV* amvpCand, MV* pmv) const;
+    void     getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const;
     void     getIntraTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
     void     getInterTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const;
 
@@ -213,10 +235,9 @@
     void     getAllowedChromaDir(uint32_t absPartIdx, uint32_t* modeList) const;
     int      getIntraDirLumaPredictor(uint32_t absPartIdx, uint32_t* intraDirPred) const;
 
-    uint32_t getSCUAddr() const                  { return (m_cuAddr << g_maxFullDepth * 2) + m_absIdxInCTU; }
+    uint32_t getSCUAddr() const                  { return (m_cuAddr << g_unitSizeDepth * 2) + m_absIdxInCTU; }
     uint32_t getCtxSplitFlag(uint32_t absPartIdx, uint32_t depth) const;
     uint32_t getCtxSkipFlag(uint32_t absPartIdx) const;
-    ScanType getCoefScanIdx(uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma, bool bIsIntra) const;
     void     getTUEntropyCodingParameters(TUEntropyCodingParameters &result, uint32_t absPartIdx, uint32_t log2TrSize, bool bIsLuma) const;
 
     const CUData* getPULeft(uint32_t& lPartUnitIdx, uint32_t curPartUnitIdx) const;
@@ -241,15 +262,18 @@
 
     bool hasEqualMotion(uint32_t absPartIdx, const CUData& candCU, uint32_t candAbsPartIdx) const;
 
-    bool isDiffMER(int xN, int yN, int xP, int yP) const;
+    /* Check whether the current PU and a spatial neighboring PU are in same merge region */
+    bool isDiffMER(int xN, int yN, int xP, int yP) const { return ((xN >> 2) != (xP >> 2)) || ((yN >> 2) != (yP >> 2)); }
 
     // add possible motion vector predictor candidates
-    bool addMVPCand(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const;
-    bool addMVPCandOrder(MV& mvp, int picList, int refIdx, uint32_t absPartIdx, MVP_DIR dir) const;
+    bool getDirectPMV(MV& pmv, InterNeighbourMV *neighbours, uint32_t picList, uint32_t refIdx) const;
+    bool getIndirectPMV(MV& outMV, InterNeighbourMV *neighbours, uint32_t reference_list, uint32_t refIdx) const;
+    void getInterNeighbourMV(InterNeighbourMV *neighbour, uint32_t partUnitIdx, MVP_DIR dir) const;
 
     bool getColMVP(MV& outMV, int& outRefIdx, int picList, int cuAddr, int absPartIdx) const;
+    bool getCollocatedMV(int cuAddr, int partUnitIdx, InterNeighbourMV *neighbour) const;
 
-    void scaleMvByPOCDist(MV& outMV, const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const;
+    MV scaleMvByPOCDist(const MV& inMV, int curPOC, int curRefPOC, int colPOC, int colRefPOC) const;
 
     void     deriveLeftRightTopIdx(uint32_t puIdx, uint32_t& partIdxLT, uint32_t& partIdxRT) const;
 
@@ -278,7 +302,7 @@
 
     bool create(uint32_t depth, uint32_t csp, uint32_t numInstances)
     {
-        uint32_t numPartition = NUM_CU_PARTITIONS >> (depth * 2);
+        uint32_t numPartition = NUM_4x4_PARTITIONS >> (depth * 2);
         uint32_t cuSize = g_maxCUSize >> depth;
         uint32_t sizeL = cuSize * cuSize;
         uint32_t sizeC = sizeL >> (CHROMA_H_SHIFT(csp) + CHROMA_V_SHIFT(csp));
​

x265_1.5.tar.gz/source/common/dct.cpp -> x265_1.6.tar.gz/source/common/dct.cpp Changed

@@ -709,14 +709,12 @@
 
     return numSig;
 }
-
-int  count_nonzero_c(const int16_t* quantCoeff, int numCoeff)
+template<int trSize>
+int  count_nonzero_c(const int16_t* quantCoeff)
 {
     X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n");
-    X265_CHECK(numCoeff > 0 && (numCoeff & 15) == 0, "numCoeff invalid %d\n", numCoeff);
-
     int count = 0;
-
+    int numCoeff = trSize * trSize;
     for (int i = 0; i < numCoeff; i++)
     {
         count += quantCoeff[i] != 0;
@@ -754,6 +752,39 @@
     }
 }
 
+int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
+{
+    memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
+    memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
+    memset(coeffSign, 0, MLS_GRP_NUM * sizeof(*coeffSign));
+
+    int scanPosLast = 0;
+    do
+    {
+        const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE;
+
+        const uint32_t posLast = scan[scanPosLast++];
+
+        const int curCoeff = coeff[posLast];
+        const uint32_t isNZCoeff = (curCoeff != 0);
+        // get L1 sig map
+        // NOTE: the new algorithm is complicated, so I keep reference code here
+        //uint32_t posy   = posLast >> log2TrSize;
+        //uint32_t posx   = posLast - (posy << log2TrSize);
+        //uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE);
+        //const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY);
+        //sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx);
+        numSig -= isNZCoeff;
+
+        // TODO: optimize by instruction BTS
+        coeffSign[cgIdx] += (uint16_t)(((uint32_t)curCoeff >> 31) << coeffNum[cgIdx]);
+        coeffFlag[cgIdx] = (coeffFlag[cgIdx] << 1) + (uint16_t)isNZCoeff;
+        coeffNum[cgIdx] += (uint8_t)isNZCoeff;
+    }
+    while (numSig > 0);
+    return scanPosLast - 1;
+}
+
 }  // closing - anonymous file-static namespace
 
 namespace x265 {
@@ -775,12 +806,17 @@
     p.cu[BLOCK_8x8].idct   = idct8_c;
     p.cu[BLOCK_16x16].idct = idct16_c;
     p.cu[BLOCK_32x32].idct = idct32_c;
-    p.count_nonzero = count_nonzero_c;
     p.denoiseDct = denoiseDct_c;
+    p.cu[BLOCK_4x4].count_nonzero = count_nonzero_c<4>;
+    p.cu[BLOCK_8x8].count_nonzero = count_nonzero_c<8>;
+    p.cu[BLOCK_16x16].count_nonzero = count_nonzero_c<16>;
+    p.cu[BLOCK_32x32].count_nonzero = count_nonzero_c<32>;
 
     p.cu[BLOCK_4x4].copy_cnt   = copy_count<4>;
     p.cu[BLOCK_8x8].copy_cnt   = copy_count<8>;
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
+
+    p.findPosLast = findPosLast_c;
 }
 }

 
@@ -709,14 +709,12 @@
 
     return numSig;
 }
-
-int  count_nonzero_c(const int16_t* quantCoeff, int numCoeff)
+template<int trSize>
+int  count_nonzero_c(const int16_t* quantCoeff)
 {
     X265_CHECK(((intptr_t)quantCoeff & 15) == 0, "quant buffer not aligned\n");
-    X265_CHECK(numCoeff > 0 && (numCoeff & 15) == 0, "numCoeff invalid %d\n", numCoeff);
-
     int count = 0;
-
+    int numCoeff = trSize * trSize;
     for (int i = 0; i < numCoeff; i++)
     {
         count += quantCoeff[i] != 0;
@@ -754,6 +752,39 @@
     }
 }
 
+int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
+{
+    memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
+    memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
+    memset(coeffSign, 0, MLS_GRP_NUM * sizeof(*coeffSign));
+
+    int scanPosLast = 0;
+    do
+    {
+        const uint32_t cgIdx = (uint32_t)scanPosLast >> MLS_CG_SIZE;
+
+        const uint32_t posLast = scan[scanPosLast++];
+
+        const int curCoeff = coeff[posLast];
+        const uint32_t isNZCoeff = (curCoeff != 0);
+        // get L1 sig map
+        // NOTE: the new algorithm is complicated, so I keep reference code here
+        //uint32_t posy   = posLast >> log2TrSize;
+        //uint32_t posx   = posLast - (posy << log2TrSize);
+        //uint32_t blkIdx0 = ((posy >> MLS_CG_LOG2_SIZE) << codingParameters.log2TrSizeCG) + (posx >> MLS_CG_LOG2_SIZE);
+        //const uint32_t blkIdx = ((posLast >> (2 * MLS_CG_LOG2_SIZE)) & ~maskPosXY) + ((posLast >> MLS_CG_LOG2_SIZE) & maskPosXY);
+        //sigCoeffGroupFlag64 |= ((uint64_t)isNZCoeff << blkIdx);
+        numSig -= isNZCoeff;
+
+        // TODO: optimize by instruction BTS
+        coeffSign[cgIdx] += (uint16_t)(((uint32_t)curCoeff >> 31) << coeffNum[cgIdx]);
+        coeffFlag[cgIdx] = (coeffFlag[cgIdx] << 1) + (uint16_t)isNZCoeff;
+        coeffNum[cgIdx] += (uint8_t)isNZCoeff;
+    }
+    while (numSig > 0);
+    return scanPosLast - 1;
+}
+
 }  // closing - anonymous file-static namespace
 
 namespace x265 {
@@ -775,12 +806,17 @@
     p.cu[BLOCK_8x8].idct   = idct8_c;
     p.cu[BLOCK_16x16].idct = idct16_c;
     p.cu[BLOCK_32x32].idct = idct32_c;
-    p.count_nonzero = count_nonzero_c;
     p.denoiseDct = denoiseDct_c;
+    p.cu[BLOCK_4x4].count_nonzero = count_nonzero_c<4>;
+    p.cu[BLOCK_8x8].count_nonzero = count_nonzero_c<8>;
+    p.cu[BLOCK_16x16].count_nonzero = count_nonzero_c<16>;
+    p.cu[BLOCK_32x32].count_nonzero = count_nonzero_c<32>;
 
     p.cu[BLOCK_4x4].copy_cnt   = copy_count<4>;
     p.cu[BLOCK_8x8].copy_cnt   = copy_count<8>;
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
+
+    p.findPosLast = findPosLast_c;
 }
 }
​

x265_1.5.tar.gz/source/common/deblock.cpp -> x265_1.6.tar.gz/source/common/deblock.cpp Changed

 
@@ -70,7 +70,7 @@
  * param Edge the direction of the edge in block boundary (horizonta/vertical), which is added newly */
 void Deblock::deblockCU(const CUData* cu, const CUGeom& cuGeom, const int32_t dir, uint8_t blockStrength[])
 {
-    uint32_t absPartIdx = cuGeom.encodeIdx;
+    uint32_t absPartIdx = cuGeom.absPartIdx;
     uint32_t depth = cuGeom.depth;
     if (cu->m_predMode[absPartIdx] == MODE_NONE)
         return;
@@ -358,7 +358,7 @@
         int16_t m5  = (int16_t)src[offset];
         int16_t m2  = (int16_t)src[-offset * 2];
 
-        int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) << 2) + m2 - m5 + 4) >> 3));
+        int32_t delta = x265_clip3(-tc, tc, ((((m4 - m3) * 4) + m2 - m5 + 4) >> 3));
         src[-offset] = x265_clip(m3 + (delta & maskP));
         src[0] = x265_clip(m4 - (delta & maskQ));
     }
​

x265_1.5.tar.gz/source/common/framedata.h -> x265_1.6.tar.gz/source/common/framedata.h Changed

 
@@ -32,6 +32,7 @@
 // private namespace
 
 class PicYuv;
+class JobProvider;
 
 /* Per-frame data that is used during encodes and referenced while the picture
  * is available for reference. A FrameData instance is attached to a Frame as it
@@ -52,6 +53,7 @@
     PicYuv*        m_reconPic;
     bool           m_bHasReferences;   /* used during DPB/RPS updates */
     int            m_frameEncoderID;   /* the ID of the FrameEncoder encoding this frame */
+    JobProvider*   m_jobProvider;
 
     CUDataMemPool  m_cuMemPool;
     CUData*        m_picCTU;
​

x265_1.5.tar.gz/source/common/intrapred.cpp -> x265_1.6.tar.gz/source/common/intrapred.cpp Changed

@@ -27,6 +27,29 @@
 using namespace x265;
 
 namespace {
+
+template<int tuSize>
+void intraFilter(const pixel* samples, pixel* filtered) /* 1:2:1 filtering of left and top reference samples */
+{
+    const int tuSize2 = tuSize << 1;
+
+    pixel topLeft = samples[0], topLast = samples[tuSize2], leftLast = samples[tuSize2 + tuSize2];
+
+    // filtering top
+    for (int i = 1; i < tuSize2; i++)
+        filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2;
+    filtered[tuSize2] = topLast;
+    
+    // filtering top-left
+    filtered[0] = ((topLeft << 1) + samples[1] + samples[tuSize2 + 1] + 2) >> 2;
+
+    // filtering left
+    filtered[tuSize2 + 1] = ((samples[tuSize2 + 1] << 1) + topLeft + samples[tuSize2 + 2] + 2) >> 2;
+    for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++)
+        filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2;
+    filtered[tuSize2 + tuSize2] = leftLast;
+}
+
 void dcPredFilter(const pixel* above, const pixel* left, pixel* dst, intptr_t dststride, int size)
 {
     // boundary pixels processing
@@ -216,6 +239,11 @@
 
 void setupIntraPrimitives_c(EncoderPrimitives& p)
 {
+    p.cu[BLOCK_4x4].intra_filter = intraFilter<4>;
+    p.cu[BLOCK_8x8].intra_filter = intraFilter<8>;
+    p.cu[BLOCK_16x16].intra_filter = intraFilter<16>;
+    p.cu[BLOCK_32x32].intra_filter = intraFilter<32>;
+
     p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = planar_pred_c<2>;
     p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = planar_pred_c<3>;
     p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = planar_pred_c<4>;

 
@@ -27,6 +27,29 @@
 using namespace x265;
 
 namespace {
+
+template<int tuSize>
+void intraFilter(const pixel* samples, pixel* filtered) /* 1:2:1 filtering of left and top reference samples */
+{
+    const int tuSize2 = tuSize << 1;
+
+    pixel topLeft = samples[0], topLast = samples[tuSize2], leftLast = samples[tuSize2 + tuSize2];
+
+    // filtering top
+    for (int i = 1; i < tuSize2; i++)
+        filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2;
+    filtered[tuSize2] = topLast;
+    
+    // filtering top-left
+    filtered[0] = ((topLeft << 1) + samples[1] + samples[tuSize2 + 1] + 2) >> 2;
+
+    // filtering left
+    filtered[tuSize2 + 1] = ((samples[tuSize2 + 1] << 1) + topLeft + samples[tuSize2 + 2] + 2) >> 2;
+    for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++)
+        filtered[i] = ((samples[i] << 1) + samples[i - 1] + samples[i + 1] + 2) >> 2;
+    filtered[tuSize2 + tuSize2] = leftLast;
+}
+
 void dcPredFilter(const pixel* above, const pixel* left, pixel* dst, intptr_t dststride, int size)
 {
     // boundary pixels processing
@@ -216,6 +239,11 @@
 
 void setupIntraPrimitives_c(EncoderPrimitives& p)
 {
+    p.cu[BLOCK_4x4].intra_filter = intraFilter<4>;
+    p.cu[BLOCK_8x8].intra_filter = intraFilter<8>;
+    p.cu[BLOCK_16x16].intra_filter = intraFilter<16>;
+    p.cu[BLOCK_32x32].intra_filter = intraFilter<32>;
+
     p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = planar_pred_c<2>;
     p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = planar_pred_c<3>;
     p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = planar_pred_c<4>;
​

x265_1.5.tar.gz/source/common/ipfilter.cpp -> x265_1.6.tar.gz/source/common/ipfilter.cpp Changed

@@ -34,8 +34,27 @@
 #endif
 
 namespace {
+template<int dstStride, int width, int height>
+void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
+{
+    int shift = IF_INTERNAL_PREC - X265_DEPTH;
+    int row, col;
+
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col++)
+        {
+            int16_t val = src[col] << shift;
+            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
 template<int dstStride>
-void filterConvertPelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
     int row, col;
@@ -65,8 +84,8 @@
         }
 
 #else
-        ::memset(txt - marginX, txt[0], marginX);
-        ::memset(txt + width, txt[width - 1], marginX);
+        memset(txt - marginX, txt[0], marginX);
+        memset(txt + width, txt[width - 1], marginX);
 #endif
 
         txt += stride;
@@ -378,7 +397,8 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -386,7 +406,8 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -394,7 +415,8 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -403,7 +425,8 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vps     = interp_vert_ps_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
-    p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>;
+    p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
@@ -507,11 +530,11 @@
     CHROMA_444(48, 64);
     CHROMA_444(64, 16);
     CHROMA_444(16, 64);
-    p.luma_p2s = filterConvertPelToShort_c<MAX_CU_SIZE>;
+    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
 
-    p.chroma[X265_CSP_I444].p2s = filterConvertPelToShort_c<MAX_CU_SIZE>;
-    p.chroma[X265_CSP_I420].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>;
-    p.chroma[X265_CSP_I422].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>;
+    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
+    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
+    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }

 
@@ -34,8 +34,27 @@
 #endif
 
 namespace {
+template<int dstStride, int width, int height>
+void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
+{
+    int shift = IF_INTERNAL_PREC - X265_DEPTH;
+    int row, col;
+
+    for (row = 0; row < height; row++)
+    {
+        for (col = 0; col < width; col++)
+        {
+            int16_t val = src[col] << shift;
+            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
+        }
+
+        src += srcStride;
+        dst += dstStride;
+    }
+}
+
 template<int dstStride>
-void filterConvertPelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
     int row, col;
@@ -65,8 +84,8 @@
         }
 
 #else
-        ::memset(txt - marginX, txt[0], marginX);
-        ::memset(txt + width, txt[width - 1], marginX);
+        memset(txt - marginX, txt[0], marginX);
+        memset(txt + width, txt[width - 1], marginX);
 #endif
 
         txt += stride;
@@ -378,7 +397,8 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -386,7 +406,8 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -394,7 +415,8 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vpp = interp_vert_pp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>;
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -403,7 +425,8 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vps     = interp_vert_ps_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
-    p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>;
+    p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
+    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
@@ -507,11 +530,11 @@
     CHROMA_444(48, 64);
     CHROMA_444(64, 16);
     CHROMA_444(16, 64);
-    p.luma_p2s = filterConvertPelToShort_c<MAX_CU_SIZE>;
+    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
 
-    p.chroma[X265_CSP_I444].p2s = filterConvertPelToShort_c<MAX_CU_SIZE>;
-    p.chroma[X265_CSP_I420].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>;
-    p.chroma[X265_CSP_I422].p2s = filterConvertPelToShort_c<MAX_CU_SIZE / 2>;
+    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
+    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
+    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }
​

x265_1.5.tar.gz/source/common/lowres.cpp -> x265_1.6.tar.gz/source/common/lowres.cpp Changed

@@ -56,12 +56,11 @@
     CHECKED_MALLOC(propagateCost, uint16_t, cuCount);
 
     /* allocate lowres buffers */
-    for (int i = 0; i < 4; i++)
-    {
-        CHECKED_MALLOC(buffer[i], pixel, planesize);
-        /* initialize the whole buffer to prevent valgrind warnings on right edge */
-        memset(buffer[i], 0, sizeof(pixel) * planesize);
-    }
+    CHECKED_MALLOC_ZERO(buffer[0], pixel, 4 * planesize);
+
+    buffer[1] = buffer[0] + planesize;
+    buffer[2] = buffer[1] + planesize;
+    buffer[3] = buffer[2] + planesize;
 
     lowresPlane[0] = buffer[0] + padoffset;
     lowresPlane[1] = buffer[1] + padoffset;
@@ -96,9 +95,7 @@
 
 void Lowres::destroy()
 {
-    for (int i = 0; i < 4; i++)
-        X265_FREE(buffer[i]);
-
+    X265_FREE(buffer[0]);
     X265_FREE(intraCost);
     X265_FREE(intraMode);
 
@@ -126,13 +123,11 @@
 }
 
 // (re) initialize lowres state
-void Lowres::init(PicYuv *origPic, int poc, int type)
+void Lowres::init(PicYuv *origPic, int poc)
 {
-    bIntraCalculated = false;
     bLastMiniGopBFrame = false;
     bScenecut = true;  // could be a scene-cut, until ruled out by flash detection
     bKeyframe = false; // Not a keyframe unless identified by lookahead
-    sliceType = type;
     frameNum = poc;
     leadingBframes = 0;
     indB = 0;
@@ -158,8 +153,8 @@
 
     /* downscale and generate 4 hpel planes for lookahead */
     primitives.frameInitLowres(origPic->m_picOrg[0],
-                                      lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3],
-                                      origPic->m_stride, lumaStride, width, lines);
+                               lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3],
+                               origPic->m_stride, lumaStride, width, lines);
 
     /* extend hpel planes for motion search */
     extendPicBorder(lowresPlane[0], lumaStride, width, lines, origPic->m_lumaMarginX, origPic->m_lumaMarginY);

 
@@ -56,12 +56,11 @@
     CHECKED_MALLOC(propagateCost, uint16_t, cuCount);
 
     /* allocate lowres buffers */
-    for (int i = 0; i < 4; i++)
-    {
-        CHECKED_MALLOC(buffer[i], pixel, planesize);
-        /* initialize the whole buffer to prevent valgrind warnings on right edge */
-        memset(buffer[i], 0, sizeof(pixel) * planesize);
-    }
+    CHECKED_MALLOC_ZERO(buffer[0], pixel, 4 * planesize);
+
+    buffer[1] = buffer[0] + planesize;
+    buffer[2] = buffer[1] + planesize;
+    buffer[3] = buffer[2] + planesize;
 
     lowresPlane[0] = buffer[0] + padoffset;
     lowresPlane[1] = buffer[1] + padoffset;
@@ -96,9 +95,7 @@
 
 void Lowres::destroy()
 {
-    for (int i = 0; i < 4; i++)
-        X265_FREE(buffer[i]);
-
+    X265_FREE(buffer[0]);
     X265_FREE(intraCost);
     X265_FREE(intraMode);
 
@@ -126,13 +123,11 @@
 }
 
 // (re) initialize lowres state
-void Lowres::init(PicYuv *origPic, int poc, int type)
+void Lowres::init(PicYuv *origPic, int poc)
 {
-    bIntraCalculated = false;
     bLastMiniGopBFrame = false;
     bScenecut = true;  // could be a scene-cut, until ruled out by flash detection
     bKeyframe = false; // Not a keyframe unless identified by lookahead
-    sliceType = type;
     frameNum = poc;
     leadingBframes = 0;
     indB = 0;
@@ -158,8 +153,8 @@
 
     /* downscale and generate 4 hpel planes for lookahead */
     primitives.frameInitLowres(origPic->m_picOrg[0],
-                                      lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3],
-                                      origPic->m_stride, lumaStride, width, lines);
+                               lowresPlane[0], lowresPlane[1], lowresPlane[2], lowresPlane[3],
+                               origPic->m_stride, lumaStride, width, lines);
 
     /* extend hpel planes for motion search */
     extendPicBorder(lowresPlane[0], lumaStride, width, lines, origPic->m_lumaMarginX, origPic->m_lumaMarginY);
​

x265_1.5.tar.gz/source/common/lowres.h -> x265_1.6.tar.gz/source/common/lowres.h Changed

 
@@ -114,7 +114,6 @@
     int    lines;            // height of lowres frame in pixel lines
     int    leadingBframes;   // number of leading B frames for P or I
 
-    bool   bIntraCalculated;
     bool   bScenecut;        // Set to false if the frame cannot possibly be part of a real scenecut.
     bool   bKeyframe;
     bool   bLastMiniGopBFrame;
@@ -151,7 +150,7 @@
 
     bool create(PicYuv *origPic, int _bframes, bool bAqEnabled);
     void destroy();
-    void init(PicYuv *origPic, int poc, int sliceType);
+    void init(PicYuv *origPic, int poc);
 };
 }
 
​

x265_1.5.tar.gz/source/common/mv.h -> x265_1.6.tar.gz/source/common/mv.h Changed

 
@@ -56,12 +56,17 @@
 
     MV& operator >>=(int i)                    { x >>= i; y >>= i; return *this; }
 
+#if USING_FTRAPV
+    /* avoid signed left-shifts when -ftrapv is enabled */
+    MV& operator <<=(int i)                    { x *= (1 << i); y *= (1 << i); return *this; }
+    MV operator <<(int i) const                { return MV(x * (1 << i), y * (1 << i)); }
+#else
     MV& operator <<=(int i)                    { x <<= i; y <<= i; return *this; }
+    MV operator <<(int i) const                { return MV(x << i, y << i); }
+#endif
 
     MV operator >>(int i) const                { return MV(x >> i, y >> i); }
 
-    MV operator <<(int i) const                { return MV(x << i, y << i); }
-
     MV operator *(int16_t i) const             { return MV(x * i, y * i); }
 
     MV operator -(const MV& other) const       { return MV(x - other.x, y - other.y); }
​

x265_1.5.tar.gz/source/common/param.cpp -> x265_1.6.tar.gz/source/common/param.cpp Changed

@@ -52,9 +52,7 @@
  */
 
 #undef strtok_r
-char* strtok_r(char *      str,
-               const char *delim,
-               char **     nextp)
+char* strtok_r(char* str, const char* delim, char** nextp)
 {
     if (!str)
         str = *nextp;
@@ -87,20 +85,19 @@
 }
 
 extern "C"
-void x265_param_free(x265_param *p)
+void x265_param_free(x265_param* p)
 {
     return x265_free(p);
 }
 
 extern "C"
-void x265_param_default(x265_param *param)
+void x265_param_default(x265_param* param)
 {
     memset(param, 0, sizeof(x265_param));
 
     /* Applying default values to all elements in the param structure */
     param->cpuid = x265::cpu_detect();
     param->bEnableWavefront = 1;
-    param->poolNumThreads = 0;
     param->frameNumThreads = 0;
 
     param->logLevel = X265_LOG_INFO;
@@ -127,8 +124,10 @@
 
     /* CU definitions */
     param->maxCUSize = 64;
+    param->minCUSize = 8;
     param->tuQTMaxInterDepth = 1;
     param->tuQTMaxIntraDepth = 1;
+    param->maxTUSize = 32;
 
     /* Coding Structure */
     param->keyframeMin = 0;
@@ -139,6 +138,7 @@
     param->bFrameAdaptive = X265_B_ADAPT_TRELLIS;
     param->bBPyramid = 1;
     param->scenecutThreshold = 40; /* Magic number pulled in from x264 */
+    param->lookaheadSlices = 0;
 
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
@@ -153,10 +153,10 @@
     param->bEnableWeightedPred = 1;
     param->bEnableWeightedBiPred = 0;
     param->bEnableEarlySkip = 0;
-    param->bEnableCbfFastMode = 0;
     param->bEnableAMP = 0;
     param->bEnableRectInter = 0;
     param->rdLevel = 3;
+    param->rdoqLevel = 0;
     param->bEnableSignHiding = 1;
     param->bEnableTransformSkip = 0;
     param->bEnableTSkipFast = 0;
@@ -175,12 +175,13 @@
     param->crQpOffset = 0;
     param->rdPenalty = 0;
     param->psyRd = 0.3;
-    param->psyRdoq = 1.0;
+    param->psyRdoq = 0.0;
     param->analysisMode = 0;
     param->analysisFileName = NULL;
     param->bIntraInBFrames = 0;
     param->bLossless = 0;
     param->bCULossless = 0;
+    param->bEnableTemporalSubLayers = 0;
 
     /* Rate control options */
     param->rc.vbvMaxBitrate = 0;
@@ -232,7 +233,7 @@
 }
 
 extern "C"
-int x265_param_default_preset(x265_param *param, const char *preset, const char *tune)
+int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
 {
     x265_param_default(param);
 
@@ -245,10 +246,11 @@
 
         if (!strcmp(preset, "ultrafast"))
         {
-            param->lookaheadDepth = 10;
+            param->lookaheadDepth = 5;
             param->scenecutThreshold = 0; // disable lookahead
             param->maxCUSize = 32;
-            param->searchRange = 25;
+            param->minCUSize = 16;
+            param->bframes = 3;
             param->bFrameAdaptive = 0;
             param->subpelRefine = 0;
             param->searchMethod = X265_DIA_SEARCH;
@@ -267,7 +269,7 @@
         {
             param->lookaheadDepth = 10;
             param->maxCUSize = 32;
-            param->searchRange = 44;
+            param->bframes = 3;
             param->bFrameAdaptive = 0;
             param->subpelRefine = 1;
             param->bEnableEarlySkip = 1;
@@ -319,6 +321,8 @@
             param->bEnableRectInter = 1;
             param->lookaheadDepth = 25;
             param->rdLevel = 4;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 3;
             param->maxNumMergeCand = 3;
             param->searchMethod = X265_STAR_SEARCH;
@@ -333,6 +337,8 @@
             param->tuQTMaxInterDepth = 2;
             param->tuQTMaxIntraDepth = 2;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 3;
             param->maxNumMergeCand = 3;
             param->searchMethod = X265_STAR_SEARCH;
@@ -348,6 +354,8 @@
             param->tuQTMaxInterDepth = 3;
             param->tuQTMaxIntraDepth = 3;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 4;
             param->maxNumMergeCand = 4;
             param->searchMethod = X265_STAR_SEARCH;
@@ -365,6 +373,8 @@
             param->tuQTMaxInterDepth = 4;
             param->tuQTMaxIntraDepth = 4;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 5;
             param->maxNumMergeCand = 5;
             param->searchMethod = X265_STAR_SEARCH;
@@ -415,11 +425,11 @@
             param->deblockingFilterBetaOffset = -2;
             param->deblockingFilterTCOffset = -2;
             param->bIntraInBFrames = 0;
+            param->rdoqLevel = 1;
             param->psyRdoq = 30;
             param->psyRd = 0.5;
             param->rc.ipFactor = 1.1;
             param->rc.pbFactor = 1.1;
-            param->rc.aqMode = X265_AQ_VARIANCE;
             param->rc.aqStrength = 0.3;
             param->rc.qCompress = 0.8;
         }
@@ -430,7 +440,7 @@
     return 0;
 }
 
-static int x265_atobool(const char *str, bool& bError)
+static int x265_atobool(const char* str, bool& bError)
 {
     if (!strcmp(str, "1") ||
         !strcmp(str, "true") ||
@@ -444,7 +454,7 @@
     return 0;
 }
 
-static double x265_atof(const char *str, bool& bError)
+static double x265_atof(const char* str, bool& bError)
 {
     char *end;
     double v = strtod(str, &end);
@@ -454,7 +464,7 @@
     return v;
 }
 
-static int parseName(const char *arg, const char * const * names, bool& bError)
+static int parseName(const char* arg, const char* const* names, bool& bError)
 {
     for (int i = 0; names[i]; i++)
         if (!strcmp(arg, names[i]))
@@ -471,7 +481,7 @@
 #define atobool(str) (bNameWasBool = true, x265_atobool(str, bError))
 
 extern "C"
-int x265_param_parse(x265_param *p, const char *name, const char *value)
+int x265_param_parse(x265_param* p, const char* name, const char* value)
 {
     bool bError = false;
     bool bNameWasBool = false;
@@ -543,7 +553,6 @@
             }
         }

 
@@ -52,9 +52,7 @@
  */
 
 #undef strtok_r
-char* strtok_r(char *      str,
-               const char *delim,
-               char **     nextp)
+char* strtok_r(char* str, const char* delim, char** nextp)
 {
     if (!str)
         str = *nextp;
@@ -87,20 +85,19 @@
 }
 
 extern "C"
-void x265_param_free(x265_param *p)
+void x265_param_free(x265_param* p)
 {
     return x265_free(p);
 }
 
 extern "C"
-void x265_param_default(x265_param *param)
+void x265_param_default(x265_param* param)
 {
     memset(param, 0, sizeof(x265_param));
 
     /* Applying default values to all elements in the param structure */
     param->cpuid = x265::cpu_detect();
     param->bEnableWavefront = 1;
-    param->poolNumThreads = 0;
     param->frameNumThreads = 0;
 
     param->logLevel = X265_LOG_INFO;
@@ -127,8 +124,10 @@
 
     /* CU definitions */
     param->maxCUSize = 64;
+    param->minCUSize = 8;
     param->tuQTMaxInterDepth = 1;
     param->tuQTMaxIntraDepth = 1;
+    param->maxTUSize = 32;
 
     /* Coding Structure */
     param->keyframeMin = 0;
@@ -139,6 +138,7 @@
     param->bFrameAdaptive = X265_B_ADAPT_TRELLIS;
     param->bBPyramid = 1;
     param->scenecutThreshold = 40; /* Magic number pulled in from x264 */
+    param->lookaheadSlices = 0;
 
     /* Intra Coding Tools */
     param->bEnableConstrainedIntra = 0;
@@ -153,10 +153,10 @@
     param->bEnableWeightedPred = 1;
     param->bEnableWeightedBiPred = 0;
     param->bEnableEarlySkip = 0;
-    param->bEnableCbfFastMode = 0;
     param->bEnableAMP = 0;
     param->bEnableRectInter = 0;
     param->rdLevel = 3;
+    param->rdoqLevel = 0;
     param->bEnableSignHiding = 1;
     param->bEnableTransformSkip = 0;
     param->bEnableTSkipFast = 0;
@@ -175,12 +175,13 @@
     param->crQpOffset = 0;
     param->rdPenalty = 0;
     param->psyRd = 0.3;
-    param->psyRdoq = 1.0;
+    param->psyRdoq = 0.0;
     param->analysisMode = 0;
     param->analysisFileName = NULL;
     param->bIntraInBFrames = 0;
     param->bLossless = 0;
     param->bCULossless = 0;
+    param->bEnableTemporalSubLayers = 0;
 
     /* Rate control options */
     param->rc.vbvMaxBitrate = 0;
@@ -232,7 +233,7 @@
 }
 
 extern "C"
-int x265_param_default_preset(x265_param *param, const char *preset, const char *tune)
+int x265_param_default_preset(x265_param* param, const char* preset, const char* tune)
 {
     x265_param_default(param);
 
@@ -245,10 +246,11 @@
 
         if (!strcmp(preset, "ultrafast"))
         {
-            param->lookaheadDepth = 10;
+            param->lookaheadDepth = 5;
             param->scenecutThreshold = 0; // disable lookahead
             param->maxCUSize = 32;
-            param->searchRange = 25;
+            param->minCUSize = 16;
+            param->bframes = 3;
             param->bFrameAdaptive = 0;
             param->subpelRefine = 0;
             param->searchMethod = X265_DIA_SEARCH;
@@ -267,7 +269,7 @@
         {
             param->lookaheadDepth = 10;
             param->maxCUSize = 32;
-            param->searchRange = 44;
+            param->bframes = 3;
             param->bFrameAdaptive = 0;
             param->subpelRefine = 1;
             param->bEnableEarlySkip = 1;
@@ -319,6 +321,8 @@
             param->bEnableRectInter = 1;
             param->lookaheadDepth = 25;
             param->rdLevel = 4;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 3;
             param->maxNumMergeCand = 3;
             param->searchMethod = X265_STAR_SEARCH;
@@ -333,6 +337,8 @@
             param->tuQTMaxInterDepth = 2;
             param->tuQTMaxIntraDepth = 2;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 3;
             param->maxNumMergeCand = 3;
             param->searchMethod = X265_STAR_SEARCH;
@@ -348,6 +354,8 @@
             param->tuQTMaxInterDepth = 3;
             param->tuQTMaxIntraDepth = 3;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 4;
             param->maxNumMergeCand = 4;
             param->searchMethod = X265_STAR_SEARCH;
@@ -365,6 +373,8 @@
             param->tuQTMaxInterDepth = 4;
             param->tuQTMaxIntraDepth = 4;
             param->rdLevel = 6;
+            param->rdoqLevel = 2;
+            param->psyRdoq = 1.0;
             param->subpelRefine = 5;
             param->maxNumMergeCand = 5;
             param->searchMethod = X265_STAR_SEARCH;
@@ -415,11 +425,11 @@
             param->deblockingFilterBetaOffset = -2;
             param->deblockingFilterTCOffset = -2;
             param->bIntraInBFrames = 0;
+            param->rdoqLevel = 1;
             param->psyRdoq = 30;
             param->psyRd = 0.5;
             param->rc.ipFactor = 1.1;
             param->rc.pbFactor = 1.1;
-            param->rc.aqMode = X265_AQ_VARIANCE;
             param->rc.aqStrength = 0.3;
             param->rc.qCompress = 0.8;
         }
@@ -430,7 +440,7 @@
     return 0;
 }
 
-static int x265_atobool(const char *str, bool& bError)
+static int x265_atobool(const char* str, bool& bError)
 {
     if (!strcmp(str, "1") ||
         !strcmp(str, "true") ||
@@ -444,7 +454,7 @@
     return 0;
 }
 
-static double x265_atof(const char *str, bool& bError)
+static double x265_atof(const char* str, bool& bError)
 {
     char *end;
     double v = strtod(str, &end);
@@ -454,7 +464,7 @@
     return v;
 }
 
-static int parseName(const char *arg, const char * const * names, bool& bError)
+static int parseName(const char* arg, const char* const* names, bool& bError)
 {
     for (int i = 0; names[i]; i++)
         if (!strcmp(arg, names[i]))
@@ -471,7 +481,7 @@
 #define atobool(str) (bNameWasBool = true, x265_atobool(str, bError))
 
 extern "C"
-int x265_param_parse(x265_param *p, const char *name, const char *value)
+int x265_param_parse(x265_param* p, const char* name, const char* value)
 {
     bool bError = false;
     bool bNameWasBool = false;
@@ -543,7 +553,6 @@
             }
         }
​

x265_1.5.tar.gz/source/common/picyuv.cpp -> x265_1.6.tar.gz/source/common/picyuv.cpp Changed

@@ -84,7 +84,7 @@
  * allocated by the same encoder. */
 bool PicYuv::createOffsets(const SPS& sps)
 {
-    uint32_t numPartitions = 1 << (g_maxFullDepth * 2);
+    uint32_t numPartitions = 1 << (g_unitSizeDepth * 2);
     CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight);
     CHECKED_MALLOC(m_cuOffsetC, intptr_t, sps.numCuInWidth * sps.numCuInHeight);
     for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++)
@@ -176,9 +176,7 @@
         for (int r = 0; r < height; r++)
         {
             for (int c = 0; c < width; c++)
-            {
                 yPixel[c] = (pixel)yChar[c];
-            }
 
             yPixel += m_stride;
             yChar += pic.stride[0] / sizeof(*yChar);
@@ -229,9 +227,7 @@
         for (int r = 0; r < height; r++)
         {
             for (int x = 0; x < padx; x++)
-            {
                 Y[width + x] = Y[width - 1];
-            }
 
             Y += m_stride;
         }
@@ -257,9 +253,7 @@
         pixel *V = m_picOrg[2] + ((height >> m_vChromaShift) - 1) * m_strideC;
 
         for (int i = 1; i <= pady; i++)
-        {
             memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel));
-        }
 
         for (int j = 1; j <= pady >> m_vChromaShift; j++)
         {

 
@@ -84,7 +84,7 @@
  * allocated by the same encoder. */
 bool PicYuv::createOffsets(const SPS& sps)
 {
-    uint32_t numPartitions = 1 << (g_maxFullDepth * 2);
+    uint32_t numPartitions = 1 << (g_unitSizeDepth * 2);
     CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight);
     CHECKED_MALLOC(m_cuOffsetC, intptr_t, sps.numCuInWidth * sps.numCuInHeight);
     for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++)
@@ -176,9 +176,7 @@
         for (int r = 0; r < height; r++)
         {
             for (int c = 0; c < width; c++)
-            {
                 yPixel[c] = (pixel)yChar[c];
-            }
 
             yPixel += m_stride;
             yChar += pic.stride[0] / sizeof(*yChar);
@@ -229,9 +227,7 @@
         for (int r = 0; r < height; r++)
         {
             for (int x = 0; x < padx; x++)
-            {
                 Y[width + x] = Y[width - 1];
-            }
 
             Y += m_stride;
         }
@@ -257,9 +253,7 @@
         pixel *V = m_picOrg[2] + ((height >> m_vChromaShift) - 1) * m_strideC;
 
         for (int i = 1; i <= pady; i++)
-        {
             memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel));
-        }
 
         for (int j = 1; j <= pady >> m_vChromaShift; j++)
         {
​

x265_1.5.tar.gz/source/common/pixel.cpp -> x265_1.6.tar.gz/source/common/pixel.cpp Changed

@@ -428,7 +428,7 @@
 void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
 {
     X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
-    X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
     X265_CHECK(shift >= 0, "invalid shift\n");
 
     for (int i = 0; i < size; i++)
@@ -445,7 +445,7 @@
 void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
 {
     X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
-    X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
     X265_CHECK(shift > 0, "invalid shift\n");
 
     int16_t round = 1 << (shift - 1);
@@ -462,7 +462,7 @@
 template<int size>
 void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
 {
-    X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
     X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
     X265_CHECK(shift >= 0, "invalid shift\n");
 
@@ -479,7 +479,7 @@
 template<int size>
 void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
 {
-    X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
     X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
     X265_CHECK(shift > 0, "invalid shift\n");
 
@@ -522,12 +522,10 @@
 
 #if CHECKED_BUILD || _DEBUG
     const int correction = (IF_INTERNAL_PREC - X265_DEPTH);
-#endif
-
     X265_CHECK(!((w0 << 6) > 32767), "w0 using more than 16 bits, asm output will mismatch\n");
     X265_CHECK(!(round > 32767), "round using more than 16 bits, asm output will mismatch\n");
     X265_CHECK((shift >= correction), "shift must be include factor correction, please update ASM ABI\n");
-    X265_CHECK(!(round & ((1 << correction) - 1)), "round must be include factor correction, please update ASM ABI\n");
+#endif
 
     for (y = 0; y <= height - 1; y++)
     {

 
@@ -428,7 +428,7 @@
 void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
 {
     X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
-    X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
     X265_CHECK(shift >= 0, "invalid shift\n");
 
     for (int i = 0; i < size; i++)
@@ -445,7 +445,7 @@
 void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift)
 {
     X265_CHECK(((intptr_t)dst & 15) == 0, "dst alignment error\n");
-    X265_CHECK((((intptr_t)src | srcStride) & 15) == 0 || size == 4, "src alignment error\n");
+    X265_CHECK((((intptr_t)src | (srcStride * sizeof(*src))) & 15) == 0 || size == 4, "src alignment error\n");
     X265_CHECK(shift > 0, "invalid shift\n");
 
     int16_t round = 1 << (shift - 1);
@@ -462,7 +462,7 @@
 template<int size>
 void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
 {
-    X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
     X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
     X265_CHECK(shift >= 0, "invalid shift\n");
 
@@ -479,7 +479,7 @@
 template<int size>
 void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift)
 {
-    X265_CHECK((((intptr_t)dst | dstStride) & 15) == 0 || size == 4, "dst alignment error\n");
+    X265_CHECK((((intptr_t)dst | (dstStride * sizeof(*dst))) & 15) == 0 || size == 4, "dst alignment error\n");
     X265_CHECK(((intptr_t)src & 15) == 0, "src alignment error\n");
     X265_CHECK(shift > 0, "invalid shift\n");
 
@@ -522,12 +522,10 @@
 
 #if CHECKED_BUILD || _DEBUG
     const int correction = (IF_INTERNAL_PREC - X265_DEPTH);
-#endif
-
     X265_CHECK(!((w0 << 6) > 32767), "w0 using more than 16 bits, asm output will mismatch\n");
     X265_CHECK(!(round > 32767), "round using more than 16 bits, asm output will mismatch\n");
     X265_CHECK((shift >= correction), "shift must be include factor correction, please update ASM ABI\n");
-    X265_CHECK(!(round & ((1 << correction) - 1)), "round must be include factor correction, please update ASM ABI\n");
+#endif
 
     for (y = 0; y <= height - 1; y++)
     {
​

x265_1.5.tar.gz/source/common/predict.cpp -> x265_1.6.tar.gz/source/common/predict.cpp Changed

@@ -34,11 +34,23 @@
 #pragma warning(disable: 4127) // conditional expression is constant
 #endif
 
+PredictionUnit::PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx)
+{
+    /* address of CTU */
+    ctuAddr = cu.m_cuAddr;
+
+    /* offset of CU */
+    cuAbsPartIdx = cuGeom.absPartIdx;
+
+    /* offset and dimensions of PU */
+    cu.getPartIndexAndSize(puIdx, puAbsPartIdx, width, height);
+}
+
 namespace
 {
 inline pixel weightBidir(int w0, int16_t P0, int w1, int16_t P1, int round, int shift, int offset)
 {
-    return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset << (shift - 1))) >> shift);
+    return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset * (1 << (shift - 1)))) >> shift);
 }
 }
 
@@ -67,82 +79,24 @@
     return false;
 }
 
-void Predict::predIntraLumaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSize)
-{
-    int sizeIdx = log2TrSize - 2;
-    int tuSize = 1 << log2TrSize;
-    int filter = !!(g_intraFilterFlags[dirMode] & tuSize);
-    X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n");
-
-    bool bFilter = log2TrSize <= 4;
-    primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, intraNeighbourBuf[filter], dirMode, bFilter);
-}
-
-void Predict::predIntraChromaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSizeC, int chFmt)
-{
-    int tuSize = 1 << log2TrSizeC;
-    int tuSize2 = tuSize << 1;
-
-    pixel* srcBuf = intraNeighbourBuf[0];
-
-    if (chFmt == X265_CSP_I444 && (g_intraFilterFlags[dirMode] & tuSize))
-    {
-        pixel* fltBuf = intraNeighbourBuf[1];
-        pixel topLeft = srcBuf[0], topLast = srcBuf[tuSize2], leftLast = srcBuf[tuSize2 + tuSize2];
-
-        // filtering top
-        for (int i = 1; i < tuSize2; i++)
-            fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2;
-        fltBuf[tuSize2] = topLast;
-
-        // filtering top-left
-        fltBuf[0] = ((srcBuf[0] << 1) + srcBuf[1] + srcBuf[tuSize2 + 1] + 2) >> 2;
-
-        //filtering left
-        fltBuf[tuSize2 + 1] = ((srcBuf[tuSize2 + 1] << 1) + topLeft + srcBuf[tuSize2 + 2] + 2) >> 2;
-        for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++)
-            fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2;
-        fltBuf[tuSize2 + tuSize2] = leftLast;
-
-        srcBuf = intraNeighbourBuf[1];
-    }
-
-    int sizeIdx = log2TrSizeC - 2;
-    X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n");
-    primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, srcBuf, dirMode, 0);
-}
-
-void Predict::initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx)
+void Predict::motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma)
 {
-    m_predSlice = cu.m_slice;
-    cu.getPartIndexAndSize(partIdx, m_puAbsPartIdx, m_puWidth, m_puHeight);
-    m_ctuAddr = cu.m_cuAddr;
-    m_cuAbsPartIdx = cuGeom.encodeIdx;
-}
-
-void Predict::prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx)
-{
-    initMotionCompensation(cu, cuGeom, partIdx);
-
-    m_refIdx0      = cu.m_refIdx[0][m_puAbsPartIdx];
-    m_clippedMv[0] = cu.m_mv[0][m_puAbsPartIdx];
-    m_refIdx1      = cu.m_refIdx[1][m_puAbsPartIdx];
-    m_clippedMv[1] = cu.m_mv[1][m_puAbsPartIdx];
-    cu.clipMv(m_clippedMv[0]);
-    cu.clipMv(m_clippedMv[1]);
-}
+    int refIdx0 = cu.m_refIdx[0][pu.puAbsPartIdx];
+    int refIdx1 = cu.m_refIdx[1][pu.puAbsPartIdx];
 
-void Predict::motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma)
-{
-    if (m_predSlice->isInterP())
+    if (cu.m_slice->isInterP())
     {
         /* P Slice */
         WeightValues wv0[3];
-        X265_CHECK(m_refIdx0 >= 0, "invalid P refidx\n");
-        X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "P refidx out of range\n");
-        const WeightParam *wp0 = m_predSlice->m_weightPredTable[0][m_refIdx0];
 
-        if (m_predSlice->m_pps->bUseWeightPred && wp0->bPresentFlag)
+        X265_CHECK(refIdx0 >= 0, "invalid P refidx\n");
+        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "P refidx out of range\n");
+        const WeightParam *wp0 = cu.m_slice->m_weightPredTable[0][refIdx0];
+
+        MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
+        cu.clipMv(mv0);
+
+        if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag)
         {
             for (int plane = 0; plane < 3; plane++)
             {
@@ -155,18 +109,18 @@
             ShortYuv& shortYuv = m_predShortYuv[0];
 
             if (bLuma)
-                predInterLumaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
             if (bChroma)
-                predInterChromaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
 
-            addWeightUni(predYuv, shortYuv, wv0, bLuma, bChroma);
+            addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma);
         }
         else
         {
             if (bLuma)
-                predInterLumaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
             if (bChroma)
-                predInterChromaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
         }
     }
     else
@@ -176,10 +130,13 @@
         WeightValues wv0[3], wv1[3];
         const WeightParam *pwp0, *pwp1;
 
-        if (m_predSlice->m_pps->bUseWeightedBiPred)
+        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "bidir refidx0 out of range\n");
+        X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "bidir refidx1 out of range\n");
+
+        if (cu.m_slice->m_pps->bUseWeightedBiPred)
         {
-            pwp0 = m_refIdx0 >= 0 ? m_predSlice->m_weightPredTable[0][m_refIdx0] : NULL;
-            pwp1 = m_refIdx1 >= 0 ? m_predSlice->m_weightPredTable[1][m_refIdx1] : NULL;
+            pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL;
+            pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL;
 
             if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
             {
@@ -200,7 +157,7 @@
             else
             {
                 /* uniprediction weighting, always outputs to wv0 */
-                const WeightParam* pwp = (m_refIdx0 >= 0) ? pwp0 : pwp1;
+                const WeightParam* pwp = (refIdx0 >= 0) ? pwp0 : pwp1;
                 for (int plane = 0; plane < 3; plane++)
                 {
                     wv0[plane].w = pwp[plane].inputWeight;
@@ -213,89 +170,92 @@
         else
             pwp0 = pwp1 = NULL;
 
-        if (m_refIdx0 >= 0 && m_refIdx1 >= 0)
+        if (refIdx0 >= 0 && refIdx1 >= 0)
         {
-            /* Biprediction */
-            X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "bidir refidx0 out of range\n");
-            X265_CHECK(m_refIdx1 < m_predSlice->m_numRefIdx[1], "bidir refidx1 out of range\n");
+            MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
+            MV mv1 = cu.m_mv[1][pu.puAbsPartIdx];
+            cu.clipMv(mv0);
+            cu.clipMv(mv1);
 
             if (bLuma)
             {
-                predInterLumaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
-                predInterLumaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]);
+                predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
+                predInterLumaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1);
             }
             if (bChroma)
             {
-                predInterChromaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
-                predInterChromaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]);
+                predInterChromaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
+                predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1);
             }

 
@@ -34,11 +34,23 @@
 #pragma warning(disable: 4127) // conditional expression is constant
 #endif
 
+PredictionUnit::PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx)
+{
+    /* address of CTU */
+    ctuAddr = cu.m_cuAddr;
+
+    /* offset of CU */
+    cuAbsPartIdx = cuGeom.absPartIdx;
+
+    /* offset and dimensions of PU */
+    cu.getPartIndexAndSize(puIdx, puAbsPartIdx, width, height);
+}
+
 namespace
 {
 inline pixel weightBidir(int w0, int16_t P0, int w1, int16_t P1, int round, int shift, int offset)
 {
-    return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset << (shift - 1))) >> shift);
+    return x265_clip((w0 * (P0 + IF_INTERNAL_OFFS) + w1 * (P1 + IF_INTERNAL_OFFS) + round + (offset * (1 << (shift - 1)))) >> shift);
 }
 }
 
@@ -67,82 +79,24 @@
     return false;
 }
 
-void Predict::predIntraLumaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSize)
-{
-    int sizeIdx = log2TrSize - 2;
-    int tuSize = 1 << log2TrSize;
-    int filter = !!(g_intraFilterFlags[dirMode] & tuSize);
-    X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n");
-
-    bool bFilter = log2TrSize <= 4;
-    primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, intraNeighbourBuf[filter], dirMode, bFilter);
-}
-
-void Predict::predIntraChromaAng(uint32_t dirMode, pixel* dst, intptr_t stride, uint32_t log2TrSizeC, int chFmt)
-{
-    int tuSize = 1 << log2TrSizeC;
-    int tuSize2 = tuSize << 1;
-
-    pixel* srcBuf = intraNeighbourBuf[0];
-
-    if (chFmt == X265_CSP_I444 && (g_intraFilterFlags[dirMode] & tuSize))
-    {
-        pixel* fltBuf = intraNeighbourBuf[1];
-        pixel topLeft = srcBuf[0], topLast = srcBuf[tuSize2], leftLast = srcBuf[tuSize2 + tuSize2];
-
-        // filtering top
-        for (int i = 1; i < tuSize2; i++)
-            fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2;
-        fltBuf[tuSize2] = topLast;
-
-        // filtering top-left
-        fltBuf[0] = ((srcBuf[0] << 1) + srcBuf[1] + srcBuf[tuSize2 + 1] + 2) >> 2;
-
-        //filtering left
-        fltBuf[tuSize2 + 1] = ((srcBuf[tuSize2 + 1] << 1) + topLeft + srcBuf[tuSize2 + 2] + 2) >> 2;
-        for (int i = tuSize2 + 2; i < tuSize2 + tuSize2; i++)
-            fltBuf[i] = ((srcBuf[i] << 1) + srcBuf[i - 1] + srcBuf[i + 1] + 2) >> 2;
-        fltBuf[tuSize2 + tuSize2] = leftLast;
-
-        srcBuf = intraNeighbourBuf[1];
-    }
-
-    int sizeIdx = log2TrSizeC - 2;
-    X265_CHECK(sizeIdx >= 0 && sizeIdx < 4, "intra block size is out of range\n");
-    primitives.cu[sizeIdx].intra_pred[dirMode](dst, stride, srcBuf, dirMode, 0);
-}
-
-void Predict::initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx)
+void Predict::motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma)
 {
-    m_predSlice = cu.m_slice;
-    cu.getPartIndexAndSize(partIdx, m_puAbsPartIdx, m_puWidth, m_puHeight);
-    m_ctuAddr = cu.m_cuAddr;
-    m_cuAbsPartIdx = cuGeom.encodeIdx;
-}
-
-void Predict::prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx)
-{
-    initMotionCompensation(cu, cuGeom, partIdx);
-
-    m_refIdx0      = cu.m_refIdx[0][m_puAbsPartIdx];
-    m_clippedMv[0] = cu.m_mv[0][m_puAbsPartIdx];
-    m_refIdx1      = cu.m_refIdx[1][m_puAbsPartIdx];
-    m_clippedMv[1] = cu.m_mv[1][m_puAbsPartIdx];
-    cu.clipMv(m_clippedMv[0]);
-    cu.clipMv(m_clippedMv[1]);
-}
+    int refIdx0 = cu.m_refIdx[0][pu.puAbsPartIdx];
+    int refIdx1 = cu.m_refIdx[1][pu.puAbsPartIdx];
 
-void Predict::motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma)
-{
-    if (m_predSlice->isInterP())
+    if (cu.m_slice->isInterP())
     {
         /* P Slice */
         WeightValues wv0[3];
-        X265_CHECK(m_refIdx0 >= 0, "invalid P refidx\n");
-        X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "P refidx out of range\n");
-        const WeightParam *wp0 = m_predSlice->m_weightPredTable[0][m_refIdx0];
 
-        if (m_predSlice->m_pps->bUseWeightPred && wp0->bPresentFlag)
+        X265_CHECK(refIdx0 >= 0, "invalid P refidx\n");
+        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "P refidx out of range\n");
+        const WeightParam *wp0 = cu.m_slice->m_weightPredTable[0][refIdx0];
+
+        MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
+        cu.clipMv(mv0);
+
+        if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag)
         {
             for (int plane = 0; plane < 3; plane++)
             {
@@ -155,18 +109,18 @@
             ShortYuv& shortYuv = m_predShortYuv[0];
 
             if (bLuma)
-                predInterLumaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
             if (bChroma)
-                predInterChromaShort(shortYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
 
-            addWeightUni(predYuv, shortYuv, wv0, bLuma, bChroma);
+            addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma);
         }
         else
         {
             if (bLuma)
-                predInterLumaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
             if (bChroma)
-                predInterChromaPixel(predYuv, *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
+                predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
         }
     }
     else
@@ -176,10 +130,13 @@
         WeightValues wv0[3], wv1[3];
         const WeightParam *pwp0, *pwp1;
 
-        if (m_predSlice->m_pps->bUseWeightedBiPred)
+        X265_CHECK(refIdx0 < cu.m_slice->m_numRefIdx[0], "bidir refidx0 out of range\n");
+        X265_CHECK(refIdx1 < cu.m_slice->m_numRefIdx[1], "bidir refidx1 out of range\n");
+
+        if (cu.m_slice->m_pps->bUseWeightedBiPred)
         {
-            pwp0 = m_refIdx0 >= 0 ? m_predSlice->m_weightPredTable[0][m_refIdx0] : NULL;
-            pwp1 = m_refIdx1 >= 0 ? m_predSlice->m_weightPredTable[1][m_refIdx1] : NULL;
+            pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL;
+            pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL;
 
             if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag))
             {
@@ -200,7 +157,7 @@
             else
             {
                 /* uniprediction weighting, always outputs to wv0 */
-                const WeightParam* pwp = (m_refIdx0 >= 0) ? pwp0 : pwp1;
+                const WeightParam* pwp = (refIdx0 >= 0) ? pwp0 : pwp1;
                 for (int plane = 0; plane < 3; plane++)
                 {
                     wv0[plane].w = pwp[plane].inputWeight;
@@ -213,89 +170,92 @@
         else
             pwp0 = pwp1 = NULL;
 
-        if (m_refIdx0 >= 0 && m_refIdx1 >= 0)
+        if (refIdx0 >= 0 && refIdx1 >= 0)
         {
-            /* Biprediction */
-            X265_CHECK(m_refIdx0 < m_predSlice->m_numRefIdx[0], "bidir refidx0 out of range\n");
-            X265_CHECK(m_refIdx1 < m_predSlice->m_numRefIdx[1], "bidir refidx1 out of range\n");
+            MV mv0 = cu.m_mv[0][pu.puAbsPartIdx];
+            MV mv1 = cu.m_mv[1][pu.puAbsPartIdx];
+            cu.clipMv(mv0);
+            cu.clipMv(mv1);
 
             if (bLuma)
             {
-                predInterLumaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
-                predInterLumaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]);
+                predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
+                predInterLumaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1);
             }
             if (bChroma)
             {
-                predInterChromaShort(m_predShortYuv[0], *m_predSlice->m_refPicList[0][m_refIdx0]->m_reconPic, m_clippedMv[0]);
-                predInterChromaShort(m_predShortYuv[1], *m_predSlice->m_refPicList[1][m_refIdx1]->m_reconPic, m_clippedMv[1]);
+                predInterChromaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0);
+                predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1);
             }
 
​

x265_1.5.tar.gz/source/common/predict.h -> x265_1.6.tar.gz/source/common/predict.h Changed

@@ -36,6 +36,17 @@
 class Slice;
 struct CUGeom;
 
+struct PredictionUnit
+{
+    uint32_t     ctuAddr;      // raster index of current CTU within its picture
+    uint32_t     cuAbsPartIdx; // z-order offset of current CU within its CTU
+    uint32_t     puAbsPartIdx; // z-order offset of current PU with its CU
+    int          width;
+    int          height;
+
+    PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx);
+};
+
 class Predict
 {
 public:
@@ -56,7 +67,7 @@
         int      leftUnits;
         int      unitWidth;
         int      unitHeight;
-        int      tuSize;
+        int      log2TrSize;
         bool     bNeighborFlags[4 * MAX_NUM_SPU_W + 1];
     };
 
@@ -65,38 +76,34 @@
 
     // Unfiltered/filtered neighbours of the current partition.
     pixel     intraNeighbourBuf[2][258];
+
     /* Slice information */
-    const Slice* m_predSlice;
     int       m_csp;
     int       m_hChromaShift;
     int       m_vChromaShift;
 
-    /* cached CU information for prediction */
-    uint32_t  m_ctuAddr;      // raster index of current CTU within its picture
-    uint32_t  m_cuAbsPartIdx; // z-order index of current CU within its CTU
-    uint32_t  m_puAbsPartIdx; // z-order index of current PU with its CU
-    int       m_puWidth;
-    int       m_puHeight;
-    int       m_refIdx0;
-    int       m_refIdx1;
-
-    /* TODO: Need to investigate clipping while writing into the TComDataCU fields itself */
-    MV        m_clippedMv[2];
-
     Predict();
     ~Predict();
 
     bool allocBuffers(int csp);
 
     // motion compensation functions
-    void predInterLumaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
-    void predInterChromaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterLumaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterChromaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
 
-    void predInterLumaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
-    void predInterChromaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterChromaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
 
-    void addWeightBi(Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const;
-    void addWeightUni(Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const;
+    void addWeightBi(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const;
+    void addWeightUni(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const;
+
+    void motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma);
+
+    /* Angular Intra */
+    void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize);
+    void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC);
+    void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, int dirMode);
+    void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId);
 
     /* Intra prediction helper functions */
     static void initIntraNeighbors(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, bool isLuma, IntraNeighbors *IntraNeighbors);
@@ -111,19 +118,6 @@
     static int  isAboveRightAvailable(const CUData& cu, uint32_t partIdxRT, bool* bValidFlags, uint32_t numUnits);
     template<bool cip>
     static int  isBelowLeftAvailable(const CUData& cu, uint32_t partIdxLB, bool* bValidFlags, uint32_t numUnits);
-
-public:
-
-    /* prepMotionCompensation needs to be called to prepare MC with CU-relevant data */
-    void initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx);
-    void prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx);
-    void motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma);
-
-    /* Angular Intra */
-    void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize);
-    void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC, int chFmt);
-    void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, int dirMode);
-    void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId);
 };
 }

 
@@ -36,6 +36,17 @@
 class Slice;
 struct CUGeom;
 
+struct PredictionUnit
+{
+    uint32_t     ctuAddr;      // raster index of current CTU within its picture
+    uint32_t     cuAbsPartIdx; // z-order offset of current CU within its CTU
+    uint32_t     puAbsPartIdx; // z-order offset of current PU with its CU
+    int          width;
+    int          height;
+
+    PredictionUnit(const CUData& cu, const CUGeom& cuGeom, int puIdx);
+};
+
 class Predict
 {
 public:
@@ -56,7 +67,7 @@
         int      leftUnits;
         int      unitWidth;
         int      unitHeight;
-        int      tuSize;
+        int      log2TrSize;
         bool     bNeighborFlags[4 * MAX_NUM_SPU_W + 1];
     };
 
@@ -65,38 +76,34 @@
 
     // Unfiltered/filtered neighbours of the current partition.
     pixel     intraNeighbourBuf[2][258];
+
     /* Slice information */
-    const Slice* m_predSlice;
     int       m_csp;
     int       m_hChromaShift;
     int       m_vChromaShift;
 
-    /* cached CU information for prediction */
-    uint32_t  m_ctuAddr;      // raster index of current CTU within its picture
-    uint32_t  m_cuAbsPartIdx; // z-order index of current CU within its CTU
-    uint32_t  m_puAbsPartIdx; // z-order index of current PU with its CU
-    int       m_puWidth;
-    int       m_puHeight;
-    int       m_refIdx0;
-    int       m_refIdx1;
-
-    /* TODO: Need to investigate clipping while writing into the TComDataCU fields itself */
-    MV        m_clippedMv[2];
-
     Predict();
     ~Predict();
 
     bool allocBuffers(int csp);
 
     // motion compensation functions
-    void predInterLumaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
-    void predInterChromaPixel(Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterLumaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterChromaPixel(const PredictionUnit& pu, Yuv& dstYuv, const PicYuv& refPic, const MV& mv) const;
 
-    void predInterLumaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
-    void predInterChromaShort(ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
+    void predInterChromaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const;
 
-    void addWeightBi(Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const;
-    void addWeightUni(Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const;
+    void addWeightBi(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, const WeightValues wp0[3], const WeightValues wp1[3], bool bLuma, bool bChroma) const;
+    void addWeightUni(const PredictionUnit& pu, Yuv& predYuv, const ShortYuv& srcYuv, const WeightValues wp[3], bool bLuma, bool bChroma) const;
+
+    void motionCompensation(const CUData& cu, const PredictionUnit& pu, Yuv& predYuv, bool bLuma, bool bChroma);
+
+    /* Angular Intra */
+    void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize);
+    void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC);
+    void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, int dirMode);
+    void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId);
 
     /* Intra prediction helper functions */
     static void initIntraNeighbors(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, bool isLuma, IntraNeighbors *IntraNeighbors);
@@ -111,19 +118,6 @@
     static int  isAboveRightAvailable(const CUData& cu, uint32_t partIdxRT, bool* bValidFlags, uint32_t numUnits);
     template<bool cip>
     static int  isBelowLeftAvailable(const CUData& cu, uint32_t partIdxLB, bool* bValidFlags, uint32_t numUnits);
-
-public:
-
-    /* prepMotionCompensation needs to be called to prepare MC with CU-relevant data */
-    void initMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx);
-    void prepMotionCompensation(const CUData& cu, const CUGeom& cuGeom, int partIdx);
-    void motionCompensation(Yuv& predYuv, bool bLuma, bool bChroma);
-
-    /* Angular Intra */
-    void predIntraLumaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSize);
-    void predIntraChromaAng(uint32_t dirMode, pixel* pred, intptr_t stride, uint32_t log2TrSizeC, int chFmt);
-    void initAdiPattern(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, int dirMode);
-    void initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t absPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId);
 };
 }
 
​

x265_1.5.tar.gz/source/common/primitives.cpp -> x265_1.6.tar.gz/source/common/primitives.cpp Changed

 
@@ -98,6 +98,7 @@
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
         p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
+        p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s;
     }
 
     for (int i = 0; i < NUM_CU_SIZES; i++)
@@ -190,7 +191,6 @@
 
 /* cpuid >= 0 - force CPU type
  * cpuid < 0  - auto-detect if uninitialized */
-extern "C"
 void x265_setup_primitives(x265_param *param, int cpuid)
 {
     if (cpuid < 0)
@@ -257,7 +257,7 @@
 extern "C" {
 int x265_cpu_cpuid_test(void) { return 0; }
 void x265_cpu_emms(void) {}
-void x265_cpu_cpuid(uint32_t, uint32_t *, uint32_t *, uint32_t *, uint32_t *) {}
+void x265_cpu_cpuid(uint32_t, uint32_t *eax, uint32_t *, uint32_t *, uint32_t *) { *eax = 0; }
 void x265_cpu_xgetbv(uint32_t, uint32_t *, uint32_t *) {}
 }
 #endif
​

x265_1.5.tar.gz/source/common/primitives.h -> x265_1.6.tar.gz/source/common/primitives.h Changed

@@ -119,6 +119,7 @@
 
 typedef void (*intra_pred_t)(pixel* dst, intptr_t dstStride, const pixel *srcPix, int dirMode, int bFilter);
 typedef void (*intra_allangs_t)(pixel *dst, pixel *refPix, pixel *filtPix, int bLuma);
+typedef void (*intra_filter_t)(const pixel* references, pixel* filtered);
 
 typedef void (*cpy2Dto1D_shl_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 typedef void (*cpy2Dto1D_shr_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
@@ -136,8 +137,7 @@
 typedef uint32_t (*nquant_t)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
 typedef void (*dequant_scaling_t)(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift);
 typedef void (*dequant_normal_t)(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
-typedef int  (*count_nonzero_t)(const int16_t* quantCoeff, int numCoeff);
-
+typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
 typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
@@ -155,7 +155,8 @@
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
 
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
@@ -178,6 +179,8 @@
 
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
 
+typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -207,6 +210,7 @@
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
 
         copy_pp_t      copy_pp;
+        filter_p2s_t   filter_p2s;
     }
     pu[NUM_PU_SIZES];
 
@@ -225,7 +229,7 @@
         pixel_add_ps_t  add_ps;
         blockfill_s_t   blockfill_s;   // block fill, for DC transforms
         copy_cnt_t      copy_cnt;      // copy coeff while counting non-zero
-
+        count_nonzero_t count_nonzero;
         cpy2Dto1D_shl_t cpy2Dto1D_shl;
         cpy2Dto1D_shr_t cpy2Dto1D_shr;
         cpy1Dto2D_shl_t cpy1Dto2D_shl;
@@ -246,6 +250,7 @@
 
         transpose_t     transpose;     // transpose pixel block; for use with intra all-angs
         intra_allangs_t intra_pred_allangs;
+        intra_filter_t  intra_filter;
         intra_pred_t    intra_pred[NUM_INTRA_MODE];
     }
     cu[NUM_CU_SIZES];
@@ -260,9 +265,7 @@
     nquant_t              nquant;
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
-    count_nonzero_t       count_nonzero;
     denoiseDct_t          denoiseDct;
-
     scale_t               scale1D_128to64;
     scale_t               scale2D_64to32;
 
@@ -286,7 +289,9 @@
     weightp_sp_t          weight_sp;
     weightp_pp_t          weight_pp;
 
-    filter_p2s_t          luma_p2s;
+    filter_p2s_wxh_t      luma_p2s;
+
+    findPosLast_t         findPosLast;
 
     /* There is one set of chroma primitives per color space. An encoder will
      * have just a single color space and thus it will only ever use one entry
@@ -311,6 +316,8 @@
             filter_hps_t filter_hps;
             addAvg_t     addAvg;
             copy_pp_t    copy_pp;
+            filter_p2s_t chroma_p2s;
+
         }
         pu[NUM_PU_SIZES];
 
@@ -329,7 +336,7 @@
         }
         cu[NUM_CU_SIZES];
 
-        filter_p2s_t p2s; // takes width/height as arguments
+        filter_p2s_wxh_t p2s; // takes width/height as arguments
     }
     chroma[X265_CSP_COUNT];
 };

 
@@ -119,6 +119,7 @@
 
 typedef void (*intra_pred_t)(pixel* dst, intptr_t dstStride, const pixel *srcPix, int dirMode, int bFilter);
 typedef void (*intra_allangs_t)(pixel *dst, pixel *refPix, pixel *filtPix, int bLuma);
+typedef void (*intra_filter_t)(const pixel* references, pixel* filtered);
 
 typedef void (*cpy2Dto1D_shl_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 typedef void (*cpy2Dto1D_shr_t)(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
@@ -136,8 +137,7 @@
 typedef uint32_t (*nquant_t)(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
 typedef void (*dequant_scaling_t)(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift);
 typedef void (*dequant_normal_t)(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
-typedef int  (*count_nonzero_t)(const int16_t* quantCoeff, int numCoeff);
-
+typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
 typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
@@ -155,7 +155,8 @@
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
 
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
@@ -178,6 +179,8 @@
 
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
 
+typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
 struct EncoderPrimitives
@@ -207,6 +210,7 @@
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
 
         copy_pp_t      copy_pp;
+        filter_p2s_t   filter_p2s;
     }
     pu[NUM_PU_SIZES];
 
@@ -225,7 +229,7 @@
         pixel_add_ps_t  add_ps;
         blockfill_s_t   blockfill_s;   // block fill, for DC transforms
         copy_cnt_t      copy_cnt;      // copy coeff while counting non-zero
-
+        count_nonzero_t count_nonzero;
         cpy2Dto1D_shl_t cpy2Dto1D_shl;
         cpy2Dto1D_shr_t cpy2Dto1D_shr;
         cpy1Dto2D_shl_t cpy1Dto2D_shl;
@@ -246,6 +250,7 @@
 
         transpose_t     transpose;     // transpose pixel block; for use with intra all-angs
         intra_allangs_t intra_pred_allangs;
+        intra_filter_t  intra_filter;
         intra_pred_t    intra_pred[NUM_INTRA_MODE];
     }
     cu[NUM_CU_SIZES];
@@ -260,9 +265,7 @@
     nquant_t              nquant;
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
-    count_nonzero_t       count_nonzero;
     denoiseDct_t          denoiseDct;
-
     scale_t               scale1D_128to64;
     scale_t               scale2D_64to32;
 
@@ -286,7 +289,9 @@
     weightp_sp_t          weight_sp;
     weightp_pp_t          weight_pp;
 
-    filter_p2s_t          luma_p2s;
+    filter_p2s_wxh_t      luma_p2s;
+
+    findPosLast_t         findPosLast;
 
     /* There is one set of chroma primitives per color space. An encoder will
      * have just a single color space and thus it will only ever use one entry
@@ -311,6 +316,8 @@
             filter_hps_t filter_hps;
             addAvg_t     addAvg;
             copy_pp_t    copy_pp;
+            filter_p2s_t chroma_p2s;
+
         }
         pu[NUM_PU_SIZES];
 
@@ -329,7 +336,7 @@
         }
         cu[NUM_CU_SIZES];
 
-        filter_p2s_t p2s; // takes width/height as arguments
+        filter_p2s_wxh_t p2s; // takes width/height as arguments
     }
     chroma[X265_CSP_COUNT];
 };
​

x265_1.5.tar.gz/source/common/quant.cpp -> x265_1.6.tar.gz/source/common/quant.cpp Changed

@@ -50,7 +50,7 @@
     return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
 }
 
-inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx)
+inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx)
 {
     X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n");
     X265_CHECK(absGoRice <= 4, "absGoRice check failure\n");
@@ -72,7 +72,6 @@
     else
     {
         uint32_t symbol = diffLevel;
-        const uint32_t maxVlc = g_goRiceRange[absGoRice];
         bool expGolomb = (symbol > maxVlc);
 
         if (expGolomb)
@@ -105,6 +104,41 @@
     return rate;
 }
 
+#if CHECKED_BUILD || _DEBUG
+inline int getICRateNegDiff(uint32_t absLevel, const int* greaterOneBits, const int* levelAbsBits)
+{
+    X265_CHECK(absLevel <= 2, "absLevel check failure\n");
+
+    int rate;
+    if (absLevel == 0)
+        rate = 0;
+    else if (absLevel == 2)
+        rate = greaterOneBits[1] + levelAbsBits[0];
+    else
+        rate = greaterOneBits[0];
+    return rate;
+}
+#endif
+
+inline int getICRateLessVlc(uint32_t absLevel, int32_t diffLevel, const uint32_t absGoRice)
+{
+    X265_CHECK(absGoRice <= 4, "absGoRice check failure\n");
+    if (!absLevel)
+    {
+        X265_CHECK(diffLevel < 0, "diffLevel check failure\n");
+        return 0;
+    }
+    int rate;
+
+    uint32_t symbol = diffLevel;
+    uint32_t prefLen = (symbol >> absGoRice) + 1;
+    uint32_t numBins = fastMin(prefLen + absGoRice, 8 /* g_goRicePrefixLen[absGoRice] + absGoRice */);
+
+    rate = numBins << 15;
+
+    return rate;
+}
+
 /* Calculates the cost for specific absolute transform level */
 inline uint32_t getICRateCost(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx)
 {
@@ -160,12 +194,12 @@
     m_nr           = NULL;
 }
 
-bool Quant::init(bool useRDOQ, double psyScale, const ScalingList& scalingList, Entropy& entropy)
+bool Quant::init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy)
 {
     m_entropyCoder = &entropy;
-    m_useRDOQ = useRDOQ;
+    m_rdoqLevel    = rdoqLevel;
     m_psyRdoqScale = (int64_t)(psyScale * 256.0);
-    m_scalingList = &scalingList;
+    m_scalingList  = &scalingList;
     m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
     m_fencShortBuf = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE);
@@ -382,13 +416,13 @@
         }
     }
 
-    if (m_useRDOQ)
+    if (m_rdoqLevel)
         return rdoQuant(cu, coeff, log2TrSize, ttype, absPartIdx, usePsy);
     else
     {
         int deltaU[32 * 32];
 
-        int scalingListType = ttype + (isLuma ? 3 : 0);
+        int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
         int rem = m_qpParam[ttype].rem;
         int per = m_qpParam[ttype].per;
         const int32_t* quantCoeff = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
@@ -454,9 +488,7 @@
     else
     {
         int useDST = !sizeIdx && ttype == TEXT_LUMA && bIntra;
-
-        X265_CHECK((int)numSig == primitives.count_nonzero(coeff, 1 << (log2TrSize * 2)), "numSig differ\n");
-
+        X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(coeff), "numSig differ\n");
         // DC only
         if (numSig == 1 && coeff[0] != 0 && !useDST)
         {
@@ -493,13 +525,10 @@
     const int32_t* qCoef = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
 
     int numCoeff = 1 << (log2TrSize * 2);
-
     uint32_t numSig = primitives.nquant(m_resiDctCoeff, qCoef, dstCoeff, qbits, add, numCoeff);
-
-    X265_CHECK((int)numSig == primitives.count_nonzero(dstCoeff, 1 << (log2TrSize * 2)), "numSig differ\n");
+    X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
-
     uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
     int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
@@ -674,9 +703,43 @@
                 /* record costs for sign-hiding performed at the end */
                 if (level)
                 {
-                    int rateNow = getICRate(level, level - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx);
-                    rateIncUp[blkPos] = getICRate(level + 1, level + 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow;
-                    rateIncDown[blkPos] = getICRate(level - 1, level - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow;
+                    const int32_t diff0 = level - 1 - baseLevel;
+                    const int32_t diff2 = level + 1 - baseLevel;
+                    const int32_t maxVlc = g_goRiceRange[goRiceParam];
+                    int rate0, rate1, rate2;
+
+                    if (diff0 < -2)  // prob (92.9, 86.5, 74.5)%
+                    {
+                        // NOTE: Min: L - 1 - {1,2,1,3} < -2 ==> L < {0,1,0,2}
+                        //            additional L > 0, so I got (L > 0 && L < 2) ==> L = 1
+                        X265_CHECK(level == 1, "absLevel check failure\n");
+
+                        const int rateEqual2 = greaterOneBits[1] + levelAbsBits[0];;
+                        const int rateNotEqual2 = greaterOneBits[0];
+
+                        rate0 = 0;
+                        rate2 = rateEqual2;
+                        rate1 = rateNotEqual2;
+
+                        X265_CHECK(rate1 == getICRateNegDiff(level + 0, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                        X265_CHECK(rate2 == getICRateNegDiff(level + 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                        X265_CHECK(rate0 == getICRateNegDiff(level - 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                    }
+                    else if (diff0 >= 0 && diff2 <= maxVlc)     // prob except from above path (98.6, 97.9, 96.9)%
+                    {
+                        // NOTE: no c1c2 correct rate since all of rate include this factor
+                        rate1 = getICRateLessVlc(level + 0, diff0 + 1, goRiceParam);
+                        rate2 = getICRateLessVlc(level + 1, diff0 + 2, goRiceParam);
+                        rate0 = getICRateLessVlc(level - 1, diff0 + 0, goRiceParam);
+                    }
+                    else
+                    {
+                        rate1 = getICRate(level + 0, diff0 + 1, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                        rate2 = getICRate(level + 1, diff0 + 2, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                        rate0 = getICRate(level - 1, diff0 + 0, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                    }
+                    rateIncUp[blkPos] = rate2 - rate1;
+                    rateIncDown[blkPos] = rate0 - rate1;
                 }
                 else
                 {
@@ -762,7 +825,7 @@
             costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][1]);
             totalRdCost += costCoeffGroupSig[cgScanPos];  /* add the cost of 1 bit in significant CG bitmap */
 
-            if (costZeroCG < totalRdCost)
+            if (costZeroCG < totalRdCost && m_rdoqLevel > 1)
             {
                 sigCoeffGroupFlag64 &= ~cgBlkPosMask;
                 totalRdCost = costZeroCG;
@@ -870,7 +933,7 @@
                     bestLastIdx = scanPos + 1;
                     bestCost = costAsLast;
                 }
-                if (dstCoeff[blkPos] > 1)
+                if (dstCoeff[blkPos] > 1 || m_rdoqLevel == 1)
                 {
                     foundLast = true;
                     break;
@@ -1037,7 +1100,8 @@
 
     const uint32_t trSizeCG = 1 << log2TrSizeCG;
     X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
-    const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (1 + (cgPosY << log2TrSizeCG) + cgPosX));
+    const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1;
+    const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift);
     const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
     const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;

 
@@ -50,7 +50,7 @@
     return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y)
 }
 
-inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx)
+inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx)
 {
     X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n");
     X265_CHECK(absGoRice <= 4, "absGoRice check failure\n");
@@ -72,7 +72,6 @@
     else
     {
         uint32_t symbol = diffLevel;
-        const uint32_t maxVlc = g_goRiceRange[absGoRice];
         bool expGolomb = (symbol > maxVlc);
 
         if (expGolomb)
@@ -105,6 +104,41 @@
     return rate;
 }
 
+#if CHECKED_BUILD || _DEBUG
+inline int getICRateNegDiff(uint32_t absLevel, const int* greaterOneBits, const int* levelAbsBits)
+{
+    X265_CHECK(absLevel <= 2, "absLevel check failure\n");
+
+    int rate;
+    if (absLevel == 0)
+        rate = 0;
+    else if (absLevel == 2)
+        rate = greaterOneBits[1] + levelAbsBits[0];
+    else
+        rate = greaterOneBits[0];
+    return rate;
+}
+#endif
+
+inline int getICRateLessVlc(uint32_t absLevel, int32_t diffLevel, const uint32_t absGoRice)
+{
+    X265_CHECK(absGoRice <= 4, "absGoRice check failure\n");
+    if (!absLevel)
+    {
+        X265_CHECK(diffLevel < 0, "diffLevel check failure\n");
+        return 0;
+    }
+    int rate;
+
+    uint32_t symbol = diffLevel;
+    uint32_t prefLen = (symbol >> absGoRice) + 1;
+    uint32_t numBins = fastMin(prefLen + absGoRice, 8 /* g_goRicePrefixLen[absGoRice] + absGoRice */);
+
+    rate = numBins << 15;
+
+    return rate;
+}
+
 /* Calculates the cost for specific absolute transform level */
 inline uint32_t getICRateCost(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx)
 {
@@ -160,12 +194,12 @@
     m_nr           = NULL;
 }
 
-bool Quant::init(bool useRDOQ, double psyScale, const ScalingList& scalingList, Entropy& entropy)
+bool Quant::init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy)
 {
     m_entropyCoder = &entropy;
-    m_useRDOQ = useRDOQ;
+    m_rdoqLevel    = rdoqLevel;
     m_psyRdoqScale = (int64_t)(psyScale * 256.0);
-    m_scalingList = &scalingList;
+    m_scalingList  = &scalingList;
     m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
     m_fencShortBuf = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE);
@@ -382,13 +416,13 @@
         }
     }
 
-    if (m_useRDOQ)
+    if (m_rdoqLevel)
         return rdoQuant(cu, coeff, log2TrSize, ttype, absPartIdx, usePsy);
     else
     {
         int deltaU[32 * 32];
 
-        int scalingListType = ttype + (isLuma ? 3 : 0);
+        int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
         int rem = m_qpParam[ttype].rem;
         int per = m_qpParam[ttype].per;
         const int32_t* quantCoeff = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
@@ -454,9 +488,7 @@
     else
     {
         int useDST = !sizeIdx && ttype == TEXT_LUMA && bIntra;
-
-        X265_CHECK((int)numSig == primitives.count_nonzero(coeff, 1 << (log2TrSize * 2)), "numSig differ\n");
-
+        X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(coeff), "numSig differ\n");
         // DC only
         if (numSig == 1 && coeff[0] != 0 && !useDST)
         {
@@ -493,13 +525,10 @@
     const int32_t* qCoef = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem];
 
     int numCoeff = 1 << (log2TrSize * 2);
-
     uint32_t numSig = primitives.nquant(m_resiDctCoeff, qCoef, dstCoeff, qbits, add, numCoeff);
-
-    X265_CHECK((int)numSig == primitives.count_nonzero(dstCoeff, 1 << (log2TrSize * 2)), "numSig differ\n");
+    X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
-
     uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
     int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
@@ -674,9 +703,43 @@
                 /* record costs for sign-hiding performed at the end */
                 if (level)
                 {
-                    int rateNow = getICRate(level, level - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx);
-                    rateIncUp[blkPos] = getICRate(level + 1, level + 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow;
-                    rateIncDown[blkPos] = getICRate(level - 1, level - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) - rateNow;
+                    const int32_t diff0 = level - 1 - baseLevel;
+                    const int32_t diff2 = level + 1 - baseLevel;
+                    const int32_t maxVlc = g_goRiceRange[goRiceParam];
+                    int rate0, rate1, rate2;
+
+                    if (diff0 < -2)  // prob (92.9, 86.5, 74.5)%
+                    {
+                        // NOTE: Min: L - 1 - {1,2,1,3} < -2 ==> L < {0,1,0,2}
+                        //            additional L > 0, so I got (L > 0 && L < 2) ==> L = 1
+                        X265_CHECK(level == 1, "absLevel check failure\n");
+
+                        const int rateEqual2 = greaterOneBits[1] + levelAbsBits[0];;
+                        const int rateNotEqual2 = greaterOneBits[0];
+
+                        rate0 = 0;
+                        rate2 = rateEqual2;
+                        rate1 = rateNotEqual2;
+
+                        X265_CHECK(rate1 == getICRateNegDiff(level + 0, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                        X265_CHECK(rate2 == getICRateNegDiff(level + 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                        X265_CHECK(rate0 == getICRateNegDiff(level - 1, greaterOneBits, levelAbsBits), "rate1 check failure!\n");
+                    }
+                    else if (diff0 >= 0 && diff2 <= maxVlc)     // prob except from above path (98.6, 97.9, 96.9)%
+                    {
+                        // NOTE: no c1c2 correct rate since all of rate include this factor
+                        rate1 = getICRateLessVlc(level + 0, diff0 + 1, goRiceParam);
+                        rate2 = getICRateLessVlc(level + 1, diff0 + 2, goRiceParam);
+                        rate0 = getICRateLessVlc(level - 1, diff0 + 0, goRiceParam);
+                    }
+                    else
+                    {
+                        rate1 = getICRate(level + 0, diff0 + 1, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                        rate2 = getICRate(level + 1, diff0 + 2, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                        rate0 = getICRate(level - 1, diff0 + 0, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx);
+                    }
+                    rateIncUp[blkPos] = rate2 - rate1;
+                    rateIncDown[blkPos] = rate0 - rate1;
                 }
                 else
                 {
@@ -762,7 +825,7 @@
             costCoeffGroupSig[cgScanPos] = SIGCOST(estBitsSbac.significantCoeffGroupBits[sigCtx][1]);
             totalRdCost += costCoeffGroupSig[cgScanPos];  /* add the cost of 1 bit in significant CG bitmap */
 
-            if (costZeroCG < totalRdCost)
+            if (costZeroCG < totalRdCost && m_rdoqLevel > 1)
             {
                 sigCoeffGroupFlag64 &= ~cgBlkPosMask;
                 totalRdCost = costZeroCG;
@@ -870,7 +933,7 @@
                     bestLastIdx = scanPos + 1;
                     bestCost = costAsLast;
                 }
-                if (dstCoeff[blkPos] > 1)
+                if (dstCoeff[blkPos] > 1 || m_rdoqLevel == 1)
                 {
                     foundLast = true;
                     break;
@@ -1037,7 +1100,8 @@
 
     const uint32_t trSizeCG = 1 << log2TrSizeCG;
     X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
-    const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (1 + (cgPosY << log2TrSizeCG) + cgPosX));
+    const uint32_t shift = (cgPosY << log2TrSizeCG) + cgPosX + 1;
+    const uint32_t sigPos = (uint32_t)(shift >= 64 ? 0 : sigCoeffGroupFlag64 >> shift);
     const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
     const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
 
​

x265_1.5.tar.gz/source/common/quant.h -> x265_1.6.tar.gz/source/common/quant.h Changed

 
@@ -81,7 +81,7 @@
 
     QpParam            m_qpParam[3];
 
-    bool               m_useRDOQ;
+    int                m_rdoqLevel;
     int64_t            m_psyRdoqScale;
     int16_t*           m_resiDctCoeff;
     int16_t*           m_fencDctCoeff;
@@ -99,7 +99,7 @@
     ~Quant();
 
     /* one-time setup */
-    bool init(bool useRDOQ, double psyScale, const ScalingList& scalingList, Entropy& entropy);
+    bool init(int rdoqLevel, double psyScale, const ScalingList& scalingList, Entropy& entropy);
     bool allocNoiseReduction(const x265_param& param);
 
     /* CU setup */
​

x265_1.5.tar.gz/source/common/scalinglist.cpp -> x265_1.6.tar.gz/source/common/scalinglist.cpp Changed

 
@@ -222,7 +222,7 @@
 
 void ScalingList::processDefaultMarix(int sizeId, int listId)
 {
-    ::memcpy(m_scalingListCoef[sizeId][listId], getScalingListDefaultAddress(sizeId, listId), sizeof(int) * X265_MIN(MAX_MATRIX_COEF_NUM, s_numCoefPerSize[sizeId]));
+    memcpy(m_scalingListCoef[sizeId][listId], getScalingListDefaultAddress(sizeId, listId), sizeof(int) * X265_MIN(MAX_MATRIX_COEF_NUM, s_numCoefPerSize[sizeId]));
     m_scalingListDC[sizeId][listId] = SCALING_LIST_DC;
 }
 
​

x265_1.5.tar.gz/source/common/shortyuv.cpp -> x265_1.6.tar.gz/source/common/shortyuv.cpp Changed

 
@@ -66,9 +66,9 @@
 
 void ShortYuv::clear()
 {
-    ::memset(m_buf[0], 0, (m_size  * m_size) *  sizeof(int16_t));
-    ::memset(m_buf[1], 0, (m_csize * m_csize) * sizeof(int16_t));
-    ::memset(m_buf[2], 0, (m_csize * m_csize) * sizeof(int16_t));
+    memset(m_buf[0], 0, (m_size  * m_size) *  sizeof(int16_t));
+    memset(m_buf[1], 0, (m_csize * m_csize) * sizeof(int16_t));
+    memset(m_buf[2], 0, (m_csize * m_csize) * sizeof(int16_t));
 }
 
 void ShortYuv::subtract(const Yuv& srcYuv0, const Yuv& srcYuv1, uint32_t log2Size)
​

x265_1.5.tar.gz/source/common/slice.cpp -> x265_1.6.tar.gz/source/common/slice.cpp Changed

@@ -33,7 +33,7 @@
 {
     if (m_sliceType == I_SLICE)
     {
-        ::memset(m_refPicList, 0, sizeof(m_refPicList));
+        memset(m_refPicList, 0, sizeof(m_refPicList));
         m_numRefIdx[1] = m_numRefIdx[0] = 0;
         return;
     }
@@ -112,7 +112,7 @@
     if (m_sliceType != B_SLICE)
     {
         m_numRefIdx[1] = 0;
-        ::memset(m_refPicList[1], 0, sizeof(m_refPicList[1]));
+        memset(m_refPicList[1], 0, sizeof(m_refPicList[1]));
     }
     else
     {
@@ -183,8 +183,8 @@
 uint32_t Slice::realEndAddress(uint32_t endCUAddr) const
 {
     // Calculate end address
-    uint32_t internalAddress = (endCUAddr - 1) % NUM_CU_PARTITIONS;
-    uint32_t externalAddress = (endCUAddr - 1) / NUM_CU_PARTITIONS;
+    uint32_t internalAddress = (endCUAddr - 1) % NUM_4x4_PARTITIONS;
+    uint32_t externalAddress = (endCUAddr - 1) / NUM_4x4_PARTITIONS;
     uint32_t xmax = m_sps->picWidthInLumaSamples - (externalAddress % m_sps->numCuInWidth) * g_maxCUSize;
     uint32_t ymax = m_sps->picHeightInLumaSamples - (externalAddress / m_sps->numCuInWidth) * g_maxCUSize;
 
@@ -192,13 +192,13 @@
         internalAddress--;
 
     internalAddress++;
-    if (internalAddress == NUM_CU_PARTITIONS)
+    if (internalAddress == NUM_4x4_PARTITIONS)
     {
         internalAddress = 0;
         externalAddress++;
     }
 
-    return externalAddress * NUM_CU_PARTITIONS + internalAddress;
+    return externalAddress * NUM_4x4_PARTITIONS + internalAddress;
 }

 
@@ -33,7 +33,7 @@
 {
     if (m_sliceType == I_SLICE)
     {
-        ::memset(m_refPicList, 0, sizeof(m_refPicList));
+        memset(m_refPicList, 0, sizeof(m_refPicList));
         m_numRefIdx[1] = m_numRefIdx[0] = 0;
         return;
     }
@@ -112,7 +112,7 @@
     if (m_sliceType != B_SLICE)
     {
         m_numRefIdx[1] = 0;
-        ::memset(m_refPicList[1], 0, sizeof(m_refPicList[1]));
+        memset(m_refPicList[1], 0, sizeof(m_refPicList[1]));
     }
     else
     {
@@ -183,8 +183,8 @@
 uint32_t Slice::realEndAddress(uint32_t endCUAddr) const
 {
     // Calculate end address
-    uint32_t internalAddress = (endCUAddr - 1) % NUM_CU_PARTITIONS;
-    uint32_t externalAddress = (endCUAddr - 1) / NUM_CU_PARTITIONS;
+    uint32_t internalAddress = (endCUAddr - 1) % NUM_4x4_PARTITIONS;
+    uint32_t externalAddress = (endCUAddr - 1) / NUM_4x4_PARTITIONS;
     uint32_t xmax = m_sps->picWidthInLumaSamples - (externalAddress % m_sps->numCuInWidth) * g_maxCUSize;
     uint32_t ymax = m_sps->picHeightInLumaSamples - (externalAddress / m_sps->numCuInWidth) * g_maxCUSize;
 
@@ -192,13 +192,13 @@
         internalAddress--;
 
     internalAddress++;
-    if (internalAddress == NUM_CU_PARTITIONS)
+    if (internalAddress == NUM_4x4_PARTITIONS)
     {
         internalAddress = 0;
         externalAddress++;
     }
 
-    return externalAddress * NUM_CU_PARTITIONS + internalAddress;
+    return externalAddress * NUM_4x4_PARTITIONS + internalAddress;
 }
 
 
​

x265_1.5.tar.gz/source/common/slice.h -> x265_1.6.tar.gz/source/common/slice.h Changed

@@ -55,9 +55,9 @@
         , numberOfNegativePictures(0)
         , numberOfPositivePictures(0)
     {
-        ::memset(deltaPOC, 0, sizeof(deltaPOC));
-        ::memset(poc, 0, sizeof(poc));
-        ::memset(bUsed, 0, sizeof(bUsed));
+        memset(deltaPOC, 0, sizeof(deltaPOC));
+        memset(poc, 0, sizeof(poc));
+        memset(bUsed, 0, sizeof(bUsed));
     }
 
     void sortDeltaPOC();
@@ -149,8 +149,10 @@
 
 struct VPS
 {
+    uint32_t         maxTempSubLayers;
     uint32_t         numReorderPics;
     uint32_t         maxDecPicBuffering;
+    uint32_t         maxLatencyIncrease;
     HRDInfo          hrdParameters;
     ProfileTierLevel ptl;
 };
@@ -228,9 +230,10 @@
     bool     bUseAMP; // use param
     uint32_t maxAMPDepth;
 
+    uint32_t maxTempSubLayers;   // max number of Temporal Sub layers
     uint32_t maxDecPicBuffering; // these are dups of VPS values
+    uint32_t maxLatencyIncrease;
     int      numReorderPics;
-    int      maxLatencyIncrease;
 
     bool     bUseStrongIntraSmoothing; // use param
     bool     bTemporalMVPEnabled;
@@ -285,6 +288,14 @@
     }
 };
 
+#define SET_WEIGHT(w, b, s, d, o) \
+    { \
+        (w).inputWeight = (s); \
+        (w).log2WeightDenom = (d); \
+        (w).inputOffset = (o); \
+        (w).bPresentFlag = (b); \
+    }
+
 class Slice
 {
 public:

 
@@ -55,9 +55,9 @@
         , numberOfNegativePictures(0)
         , numberOfPositivePictures(0)
     {
-        ::memset(deltaPOC, 0, sizeof(deltaPOC));
-        ::memset(poc, 0, sizeof(poc));
-        ::memset(bUsed, 0, sizeof(bUsed));
+        memset(deltaPOC, 0, sizeof(deltaPOC));
+        memset(poc, 0, sizeof(poc));
+        memset(bUsed, 0, sizeof(bUsed));
     }
 
     void sortDeltaPOC();
@@ -149,8 +149,10 @@
 
 struct VPS
 {
+    uint32_t         maxTempSubLayers;
     uint32_t         numReorderPics;
     uint32_t         maxDecPicBuffering;
+    uint32_t         maxLatencyIncrease;
     HRDInfo          hrdParameters;
     ProfileTierLevel ptl;
 };
@@ -228,9 +230,10 @@
     bool     bUseAMP; // use param
     uint32_t maxAMPDepth;
 
+    uint32_t maxTempSubLayers;   // max number of Temporal Sub layers
     uint32_t maxDecPicBuffering; // these are dups of VPS values
+    uint32_t maxLatencyIncrease;
     int      numReorderPics;
-    int      maxLatencyIncrease;
 
     bool     bUseStrongIntraSmoothing; // use param
     bool     bTemporalMVPEnabled;
@@ -285,6 +288,14 @@
     }
 };
 
+#define SET_WEIGHT(w, b, s, d, o) \
+    { \
+        (w).inputWeight = (s); \
+        (w).log2WeightDenom = (d); \
+        (w).inputOffset = (o); \
+        (w).bPresentFlag = (b); \
+    }
+
 class Slice
 {
 public:
​

x265_1.5.tar.gz/source/common/threading.cpp -> x265_1.6.tar.gz/source/common/threading.cpp Changed

 
@@ -26,6 +26,13 @@
 namespace x265 {
 // x265 private namespace
 
+#if X265_ARCH_X86 && !defined(X86_64) && ENABLE_ASSEMBLY && defined(__GNUC__)
+extern "C" intptr_t x265_stack_align(void (*func)(), ...);
+#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__)
+#else
+#define x265_stack_align(func, ...) func(__VA_ARGS__)
+#endif
+
 /* C shim for forced stack alignment */
 static void stackAlignMain(Thread *instance)
 {
​

x265_1.5.tar.gz/source/common/threading.h -> x265_1.6.tar.gz/source/common/threading.h Changed

@@ -42,32 +42,32 @@
 #include <sys/sysctl.h>
 #endif
 
-#ifdef __GNUC__                         /* GCCs builtin atomics */
+#ifdef __GNUC__               /* GCCs builtin atomics */
 
 #include <sys/time.h>
 #include <unistd.h>
 
-#define CLZ(id, x)                          id = (unsigned long)__builtin_clz(x) ^ 31
-#define CTZ(id, x)                          id = (unsigned long)__builtin_ctz(x)
-#define ATOMIC_OR(ptr, mask)                __sync_fetch_and_or(ptr, mask)
-#define ATOMIC_AND(ptr, mask)               __sync_fetch_and_and(ptr, mask)
-#define ATOMIC_INC(ptr)                     __sync_add_and_fetch((volatile int32_t*)ptr, 1)
-#define ATOMIC_DEC(ptr)                     __sync_add_and_fetch((volatile int32_t*)ptr, -1)
-#define ATOMIC_ADD(ptr, value)              __sync_add_and_fetch((volatile int32_t*)ptr, value)
-#define GIVE_UP_TIME()                      usleep(0)
+#define CLZ(id, x)            id = (unsigned long)__builtin_clz(x) ^ 31
+#define CTZ(id, x)            id = (unsigned long)__builtin_ctz(x)
+#define ATOMIC_OR(ptr, mask)  __sync_fetch_and_or(ptr, mask)
+#define ATOMIC_AND(ptr, mask) __sync_fetch_and_and(ptr, mask)
+#define ATOMIC_INC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, 1)
+#define ATOMIC_DEC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, -1)
+#define ATOMIC_ADD(ptr, val)  __sync_fetch_and_add((volatile int32_t*)ptr, val)
+#define GIVE_UP_TIME()        usleep(0)
 
-#elif defined(_MSC_VER)                 /* Windows atomic intrinsics */
+#elif defined(_MSC_VER)       /* Windows atomic intrinsics */
 
 #include <intrin.h>
 
-#define CLZ(id, x)                          _BitScanReverse(&id, x)
-#define CTZ(id, x)                          _BitScanForward(&id, x)
-#define ATOMIC_INC(ptr)                     InterlockedIncrement((volatile LONG*)ptr)
-#define ATOMIC_DEC(ptr)                     InterlockedDecrement((volatile LONG*)ptr)
-#define ATOMIC_ADD(ptr, value)              InterlockedExchangeAdd((volatile LONG*)ptr, value)
-#define ATOMIC_OR(ptr, mask)                _InterlockedOr((volatile LONG*)ptr, (LONG)mask)
-#define ATOMIC_AND(ptr, mask)               _InterlockedAnd((volatile LONG*)ptr, (LONG)mask)
-#define GIVE_UP_TIME()                      Sleep(0)
+#define CLZ(id, x)            _BitScanReverse(&id, x)
+#define CTZ(id, x)            _BitScanForward(&id, x)
+#define ATOMIC_INC(ptr)       InterlockedIncrement((volatile LONG*)ptr)
+#define ATOMIC_DEC(ptr)       InterlockedDecrement((volatile LONG*)ptr)
+#define ATOMIC_ADD(ptr, val)  InterlockedExchangeAdd((volatile LONG*)ptr, val)
+#define ATOMIC_OR(ptr, mask)  _InterlockedOr((volatile LONG*)ptr, (LONG)mask)
+#define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, (LONG)mask)
+#define GIVE_UP_TIME()        Sleep(0)
 
 #endif // ifdef __GNUC__
 
@@ -128,8 +128,8 @@
 
     bool timedWait(uint32_t milliseconds)
     {
-        /* returns true if event was signaled */
-        return WaitForSingleObject(this->handle, milliseconds) == WAIT_OBJECT_0;
+        /* returns true if the wait timed out */
+        return WaitForSingleObject(this->handle, milliseconds) == WAIT_TIMEOUT;
     }
 
     void trigger()
@@ -263,10 +263,8 @@
 
         /* blocking wait on conditional variable, mutex is atomically released
          * while blocked. When condition is signaled, mutex is re-acquired */
-        while (m_counter == 0)
-        {
+        while (!m_counter)
             pthread_cond_wait(&m_cond, &m_mutex);
-        }
 
         m_counter--;
         pthread_mutex_unlock(&m_mutex);
@@ -277,7 +275,7 @@
         bool bTimedOut = false;
 
         pthread_mutex_lock(&m_mutex);
-        if (m_counter == 0)
+        if (!m_counter)
         {
             struct timeval tv;
             struct timespec ts;
@@ -297,7 +295,10 @@
             bTimedOut = pthread_cond_timedwait(&m_cond, &m_mutex, &ts) == ETIMEDOUT;
         }
         if (m_counter > 0)
+        {
             m_counter--;
+            bTimedOut = false;
+        }
         pthread_mutex_unlock(&m_mutex);
         return bTimedOut;
     }
@@ -408,6 +409,23 @@
     Lock &inst;
 };
 
+// Utility class which adds elapsed time of the scope of the object into the
+// accumulator provided to the constructor
+struct ScopedElapsedTime
+{
+    ScopedElapsedTime(int64_t& accum) : accumlatedTime(accum) { startTime = x265_mdate(); }
+
+    ~ScopedElapsedTime() { accumlatedTime += x265_mdate() - startTime; }
+
+protected:
+
+    int64_t  startTime;
+    int64_t& accumlatedTime;
+
+    // do not allow assignments
+    ScopedElapsedTime &operator =(const ScopedElapsedTime &);
+};
+
 //< Simplistic portable thread class.  Shutdown signalling left to derived class
 class Thread
 {

 
@@ -42,32 +42,32 @@
 #include <sys/sysctl.h>
 #endif
 
-#ifdef __GNUC__                         /* GCCs builtin atomics */
+#ifdef __GNUC__               /* GCCs builtin atomics */
 
 #include <sys/time.h>
 #include <unistd.h>
 
-#define CLZ(id, x)                          id = (unsigned long)__builtin_clz(x) ^ 31
-#define CTZ(id, x)                          id = (unsigned long)__builtin_ctz(x)
-#define ATOMIC_OR(ptr, mask)                __sync_fetch_and_or(ptr, mask)
-#define ATOMIC_AND(ptr, mask)               __sync_fetch_and_and(ptr, mask)
-#define ATOMIC_INC(ptr)                     __sync_add_and_fetch((volatile int32_t*)ptr, 1)
-#define ATOMIC_DEC(ptr)                     __sync_add_and_fetch((volatile int32_t*)ptr, -1)
-#define ATOMIC_ADD(ptr, value)              __sync_add_and_fetch((volatile int32_t*)ptr, value)
-#define GIVE_UP_TIME()                      usleep(0)
+#define CLZ(id, x)            id = (unsigned long)__builtin_clz(x) ^ 31
+#define CTZ(id, x)            id = (unsigned long)__builtin_ctz(x)
+#define ATOMIC_OR(ptr, mask)  __sync_fetch_and_or(ptr, mask)
+#define ATOMIC_AND(ptr, mask) __sync_fetch_and_and(ptr, mask)
+#define ATOMIC_INC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, 1)
+#define ATOMIC_DEC(ptr)       __sync_add_and_fetch((volatile int32_t*)ptr, -1)
+#define ATOMIC_ADD(ptr, val)  __sync_fetch_and_add((volatile int32_t*)ptr, val)
+#define GIVE_UP_TIME()        usleep(0)
 
-#elif defined(_MSC_VER)                 /* Windows atomic intrinsics */
+#elif defined(_MSC_VER)       /* Windows atomic intrinsics */
 
 #include <intrin.h>
 
-#define CLZ(id, x)                          _BitScanReverse(&id, x)
-#define CTZ(id, x)                          _BitScanForward(&id, x)
-#define ATOMIC_INC(ptr)                     InterlockedIncrement((volatile LONG*)ptr)
-#define ATOMIC_DEC(ptr)                     InterlockedDecrement((volatile LONG*)ptr)
-#define ATOMIC_ADD(ptr, value)              InterlockedExchangeAdd((volatile LONG*)ptr, value)
-#define ATOMIC_OR(ptr, mask)                _InterlockedOr((volatile LONG*)ptr, (LONG)mask)
-#define ATOMIC_AND(ptr, mask)               _InterlockedAnd((volatile LONG*)ptr, (LONG)mask)
-#define GIVE_UP_TIME()                      Sleep(0)
+#define CLZ(id, x)            _BitScanReverse(&id, x)
+#define CTZ(id, x)            _BitScanForward(&id, x)
+#define ATOMIC_INC(ptr)       InterlockedIncrement((volatile LONG*)ptr)
+#define ATOMIC_DEC(ptr)       InterlockedDecrement((volatile LONG*)ptr)
+#define ATOMIC_ADD(ptr, val)  InterlockedExchangeAdd((volatile LONG*)ptr, val)
+#define ATOMIC_OR(ptr, mask)  _InterlockedOr((volatile LONG*)ptr, (LONG)mask)
+#define ATOMIC_AND(ptr, mask) _InterlockedAnd((volatile LONG*)ptr, (LONG)mask)
+#define GIVE_UP_TIME()        Sleep(0)
 
 #endif // ifdef __GNUC__
 
@@ -128,8 +128,8 @@
 
     bool timedWait(uint32_t milliseconds)
     {
-        /* returns true if event was signaled */
-        return WaitForSingleObject(this->handle, milliseconds) == WAIT_OBJECT_0;
+        /* returns true if the wait timed out */
+        return WaitForSingleObject(this->handle, milliseconds) == WAIT_TIMEOUT;
     }
 
     void trigger()
@@ -263,10 +263,8 @@
 
         /* blocking wait on conditional variable, mutex is atomically released
          * while blocked. When condition is signaled, mutex is re-acquired */
-        while (m_counter == 0)
-        {
+        while (!m_counter)
             pthread_cond_wait(&m_cond, &m_mutex);
-        }
 
         m_counter--;
         pthread_mutex_unlock(&m_mutex);
@@ -277,7 +275,7 @@
         bool bTimedOut = false;
 
         pthread_mutex_lock(&m_mutex);
-        if (m_counter == 0)
+        if (!m_counter)
         {
             struct timeval tv;
             struct timespec ts;
@@ -297,7 +295,10 @@
             bTimedOut = pthread_cond_timedwait(&m_cond, &m_mutex, &ts) == ETIMEDOUT;
         }
         if (m_counter > 0)
+        {
             m_counter--;
+            bTimedOut = false;
+        }
         pthread_mutex_unlock(&m_mutex);
         return bTimedOut;
     }
@@ -408,6 +409,23 @@
     Lock &inst;
 };
 
+// Utility class which adds elapsed time of the scope of the object into the
+// accumulator provided to the constructor
+struct ScopedElapsedTime
+{
+    ScopedElapsedTime(int64_t& accum) : accumlatedTime(accum) { startTime = x265_mdate(); }
+
+    ~ScopedElapsedTime() { accumlatedTime += x265_mdate() - startTime; }
+
+protected:
+
+    int64_t  startTime;
+    int64_t& accumlatedTime;
+
+    // do not allow assignments
+    ScopedElapsedTime &operator =(const ScopedElapsedTime &);
+};
+
 //< Simplistic portable thread class.  Shutdown signalling left to derived class
 class Thread
 {
​

x265_1.5.tar.gz/source/common/threadpool.cpp -> x265_1.6.tar.gz/source/common/threadpool.cpp Changed

@@ -27,115 +27,65 @@
 
 #include <new>
 
-#if MACOS
-#include <sys/param.h>
-#include <sys/sysctl.h>
-#endif
-
-namespace x265 {
-// x265 private namespace
-
-class ThreadPoolImpl;
+#if X86_64
 
-class PoolThread : public Thread
-{
-private:
+#ifdef __GNUC__
 
-    ThreadPoolImpl &m_pool;
+#define SLEEPBITMAP_CTZ(id, x)     id = (unsigned long)__builtin_ctzll(x)
+#define SLEEPBITMAP_OR(ptr, mask)  __sync_fetch_and_or(ptr, mask)
+#define SLEEPBITMAP_AND(ptr, mask) __sync_fetch_and_and(ptr, mask)
 
-    PoolThread& operator =(const PoolThread&);
+#elif defined(_MSC_VER)
 
-    int            m_id;
+#define SLEEPBITMAP_CTZ(id, x)     _BitScanForward64(&id, x)
+#define SLEEPBITMAP_OR(ptr, mask)  InterlockedOr64((volatile LONG64*)ptr, (LONG)mask)
+#define SLEEPBITMAP_AND(ptr, mask) InterlockedAnd64((volatile LONG64*)ptr, (LONG)mask)
 
-    bool           m_dirty;
+#endif // ifdef __GNUC__
 
-    bool           m_exited;
-
-    Event          m_wakeEvent;
-
-public:
-
-    PoolThread(ThreadPoolImpl& pool, int id)
-        : m_pool(pool)
-        , m_id(id)
-        , m_dirty(false)
-        , m_exited(false)
-    {
-    }
-
-    bool isDirty() const  { return m_dirty; }
-
-    void markDirty()      { m_dirty = true; }
+#else
 
-    bool isExited() const { return m_exited; }
+/* use 32-bit primitives defined in threading.h */
+#define SLEEPBITMAP_CTZ CTZ
+#define SLEEPBITMAP_OR  ATOMIC_OR
+#define SLEEPBITMAP_AND ATOMIC_AND
 
-    void poke()           { m_wakeEvent.trigger(); }
+#endif
 
-    virtual ~PoolThread() {}
+#if MACOS
+#include <sys/param.h>
+#include <sys/sysctl.h>
+#endif
+#if HAVE_LIBNUMA
+#include <numa.h>
+#endif
 
-    void threadMain();
-};
+namespace x265 {
+// x265 private namespace
 
-class ThreadPoolImpl : public ThreadPool
+class WorkerThread : public Thread
 {
 private:
 
-    bool         m_ok;
-    int          m_referenceCount;
-    int          m_numThreads;
-    int          m_numSleepMapWords;
-    PoolThread  *m_threads;
-    volatile uint32_t *m_sleepMap;
+    ThreadPool&  m_pool;
+    int          m_id;
+    Event        m_wakeEvent;
 
-    /* Lock for write access to the provider lists.  Threads are
-     * always allowed to read m_firstProvider and follow the
-     * linked list.  Providers must zero their m_nextProvider
-     * pointers before removing themselves from this list */
-    Lock         m_writeLock;
+    WorkerThread& operator =(const WorkerThread&);
 
 public:
 
-    static ThreadPoolImpl *s_instance;
-    static Lock s_createLock;
-
-    JobProvider *m_firstProvider;
-    JobProvider *m_lastProvider;
-
-public:
-
-    ThreadPoolImpl(int numthreads);
-
-    virtual ~ThreadPoolImpl();
-
-    ThreadPoolImpl *AddReference()
-    {
-        m_referenceCount++;
-
-        return this;
-    }
-
-    void markThreadAsleep(int id);
-
-    void waitForAllIdle();
-
-    int getThreadCount() const { return m_numThreads; }
-
-    bool IsValid() const       { return m_ok; }
-
-    void release();
+    JobProvider*     m_curJobProvider;
+    BondedTaskGroup* m_bondMaster;
 
-    void Stop();
+    WorkerThread(ThreadPool& pool, int id) : m_pool(pool), m_id(id) {}
+    virtual ~WorkerThread() {}
 
-    void enqueueJobProvider(JobProvider &);
-
-    void dequeueJobProvider(JobProvider &);
-
-    void FlushProviderList();
-
-    void pokeIdleThread();
+    void threadMain();
+    void awaken()           { m_wakeEvent.trigger(); }
 };
 
-void PoolThread::threadMain()
+void WorkerThread::threadMain()
 {
     THREAD_NAME("Worker", m_id);
 
@@ -145,286 +95,361 @@
     __attribute__((unused)) int val = nice(10);
 #endif
 
-    while (m_pool.IsValid())
+    m_pool.setCurrentThreadAffinity();
+
+    sleepbitmap_t idBit = (sleepbitmap_t)1 << m_id;
+    m_curJobProvider = m_pool.m_jpTable[0];
+    m_bondMaster = NULL;
+
+    SLEEPBITMAP_OR(&m_curJobProvider->m_ownerBitmap, idBit);
+    SLEEPBITMAP_OR(&m_pool.m_sleepBitmap, idBit);
+    m_wakeEvent.wait();
+
+    while (m_pool.m_isActive)
     {
-        /* Walk list of job providers, looking for work */
-        JobProvider *cur = m_pool.m_firstProvider;
-        while (cur)
+        if (m_bondMaster)
         {
-            // FindJob() may perform actual work and return true.  If
-            // it does we restart the job search
-            if (cur->findJob(m_id) == true)
-                break;
-
-            cur = cur->m_nextProvider;
+            m_bondMaster->processTasks(m_id);
+            m_bondMaster->m_exitedPeerCount.incr();
+            m_bondMaster = NULL;
         }
 
-        // this thread has reached the end of the provider list
-        m_dirty = false;
-
-        if (cur == NULL)
+        do
         {
-            m_pool.markThreadAsleep(m_id);
-            m_wakeEvent.wait();
+            /* do pending work for current job provider */
+            m_curJobProvider->findJob(m_id);
+
+            /* if the current job provider still wants help, only switch to a
+             * higher priority provider (lower slice type). Else take the first
+             * available job provider with the highest priority */

 
@@ -27,115 +27,65 @@
 
 #include <new>
 
-#if MACOS
-#include <sys/param.h>
-#include <sys/sysctl.h>
-#endif
-
-namespace x265 {
-// x265 private namespace
-
-class ThreadPoolImpl;
+#if X86_64
 
-class PoolThread : public Thread
-{
-private:
+#ifdef __GNUC__
 
-    ThreadPoolImpl &m_pool;
+#define SLEEPBITMAP_CTZ(id, x)     id = (unsigned long)__builtin_ctzll(x)
+#define SLEEPBITMAP_OR(ptr, mask)  __sync_fetch_and_or(ptr, mask)
+#define SLEEPBITMAP_AND(ptr, mask) __sync_fetch_and_and(ptr, mask)
 
-    PoolThread& operator =(const PoolThread&);
+#elif defined(_MSC_VER)
 
-    int            m_id;
+#define SLEEPBITMAP_CTZ(id, x)     _BitScanForward64(&id, x)
+#define SLEEPBITMAP_OR(ptr, mask)  InterlockedOr64((volatile LONG64*)ptr, (LONG)mask)
+#define SLEEPBITMAP_AND(ptr, mask) InterlockedAnd64((volatile LONG64*)ptr, (LONG)mask)
 
-    bool           m_dirty;
+#endif // ifdef __GNUC__
 
-    bool           m_exited;
-
-    Event          m_wakeEvent;
-
-public:
-
-    PoolThread(ThreadPoolImpl& pool, int id)
-        : m_pool(pool)
-        , m_id(id)
-        , m_dirty(false)
-        , m_exited(false)
-    {
-    }
-
-    bool isDirty() const  { return m_dirty; }
-
-    void markDirty()      { m_dirty = true; }
+#else
 
-    bool isExited() const { return m_exited; }
+/* use 32-bit primitives defined in threading.h */
+#define SLEEPBITMAP_CTZ CTZ
+#define SLEEPBITMAP_OR  ATOMIC_OR
+#define SLEEPBITMAP_AND ATOMIC_AND
 
-    void poke()           { m_wakeEvent.trigger(); }
+#endif
 
-    virtual ~PoolThread() {}
+#if MACOS
+#include <sys/param.h>
+#include <sys/sysctl.h>
+#endif
+#if HAVE_LIBNUMA
+#include <numa.h>
+#endif
 
-    void threadMain();
-};
+namespace x265 {
+// x265 private namespace
 
-class ThreadPoolImpl : public ThreadPool
+class WorkerThread : public Thread
 {
 private:
 
-    bool         m_ok;
-    int          m_referenceCount;
-    int          m_numThreads;
-    int          m_numSleepMapWords;
-    PoolThread  *m_threads;
-    volatile uint32_t *m_sleepMap;
+    ThreadPool&  m_pool;
+    int          m_id;
+    Event        m_wakeEvent;
 
-    /* Lock for write access to the provider lists.  Threads are
-     * always allowed to read m_firstProvider and follow the
-     * linked list.  Providers must zero their m_nextProvider
-     * pointers before removing themselves from this list */
-    Lock         m_writeLock;
+    WorkerThread& operator =(const WorkerThread&);
 
 public:
 
-    static ThreadPoolImpl *s_instance;
-    static Lock s_createLock;
-
-    JobProvider *m_firstProvider;
-    JobProvider *m_lastProvider;
-
-public:
-
-    ThreadPoolImpl(int numthreads);
-
-    virtual ~ThreadPoolImpl();
-
-    ThreadPoolImpl *AddReference()
-    {
-        m_referenceCount++;
-
-        return this;
-    }
-
-    void markThreadAsleep(int id);
-
-    void waitForAllIdle();
-
-    int getThreadCount() const { return m_numThreads; }
-
-    bool IsValid() const       { return m_ok; }
-
-    void release();
+    JobProvider*     m_curJobProvider;
+    BondedTaskGroup* m_bondMaster;
 
-    void Stop();
+    WorkerThread(ThreadPool& pool, int id) : m_pool(pool), m_id(id) {}
+    virtual ~WorkerThread() {}
 
-    void enqueueJobProvider(JobProvider &);
-
-    void dequeueJobProvider(JobProvider &);
-
-    void FlushProviderList();
-
-    void pokeIdleThread();
+    void threadMain();
+    void awaken()           { m_wakeEvent.trigger(); }
 };
 
-void PoolThread::threadMain()
+void WorkerThread::threadMain()
 {
     THREAD_NAME("Worker", m_id);
 
@@ -145,286 +95,361 @@
     __attribute__((unused)) int val = nice(10);
 #endif
 
-    while (m_pool.IsValid())
+    m_pool.setCurrentThreadAffinity();
+
+    sleepbitmap_t idBit = (sleepbitmap_t)1 << m_id;
+    m_curJobProvider = m_pool.m_jpTable[0];
+    m_bondMaster = NULL;
+
+    SLEEPBITMAP_OR(&m_curJobProvider->m_ownerBitmap, idBit);
+    SLEEPBITMAP_OR(&m_pool.m_sleepBitmap, idBit);
+    m_wakeEvent.wait();
+
+    while (m_pool.m_isActive)
     {
-        /* Walk list of job providers, looking for work */
-        JobProvider *cur = m_pool.m_firstProvider;
-        while (cur)
+        if (m_bondMaster)
         {
-            // FindJob() may perform actual work and return true.  If
-            // it does we restart the job search
-            if (cur->findJob(m_id) == true)
-                break;
-
-            cur = cur->m_nextProvider;
+            m_bondMaster->processTasks(m_id);
+            m_bondMaster->m_exitedPeerCount.incr();
+            m_bondMaster = NULL;
         }
 
-        // this thread has reached the end of the provider list
-        m_dirty = false;
-
-        if (cur == NULL)
+        do
         {
-            m_pool.markThreadAsleep(m_id);
-            m_wakeEvent.wait();
+            /* do pending work for current job provider */
+            m_curJobProvider->findJob(m_id);
+
+            /* if the current job provider still wants help, only switch to a
+             * higher priority provider (lower slice type). Else take the first
+             * available job provider with the highest priority */
​

x265_1.5.tar.gz/source/common/threadpool.h -> x265_1.6.tar.gz/source/common/threadpool.h Changed

@@ -25,85 +25,148 @@
 #define X265_THREADPOOL_H
 
 #include "common.h"
+#include "threading.h"
 
 namespace x265 {
 // x265 private namespace
 
 class ThreadPool;
+class WorkerThread;
+class BondedTaskGroup;
 
-int getCpuCount();
+#if X86_64
+typedef uint64_t sleepbitmap_t;
+#else
+typedef uint32_t sleepbitmap_t;
+#endif
 
-// Any class that wants to distribute work to the thread pool must
-// derive from JobProvider and implement FindJob().
+static const sleepbitmap_t ALL_POOL_THREADS = (sleepbitmap_t)-1;
+enum { MAX_POOL_THREADS = sizeof(sleepbitmap_t) * 8 };
+enum { INVALID_SLICE_PRIORITY = 10 }; // a value larger than any X265_TYPE_* macro
+
+// Frame level job providers. FrameEncoder and Lookahead derive from
+// this class and implement findJob()
 class JobProvider
 {
-protected:
-
-    ThreadPool   *m_pool;
-
-    JobProvider  *m_nextProvider;
-    JobProvider  *m_prevProvider;
-
 public:
 
-    JobProvider(ThreadPool *p) : m_pool(p), m_nextProvider(0), m_prevProvider(0) {}
+    ThreadPool*   m_pool;
+    sleepbitmap_t m_ownerBitmap;
+    int           m_jpId;
+    int           m_sliceType;
+    bool          m_helpWanted;
+    bool          m_isFrameEncoder; /* rather ugly hack, but nothing better presents itself */
+
+    JobProvider()
+        : m_pool(NULL)
+        , m_ownerBitmap(0)
+        , m_jpId(-1)
+        , m_sliceType(INVALID_SLICE_PRIORITY)
+        , m_helpWanted(false)
+        , m_isFrameEncoder(false)
+    {}
 
     virtual ~JobProvider() {}
 
-    void setThreadPool(ThreadPool *p) { m_pool = p; }
-
-    // Register this job provider with the thread pool, jobs are available
-    void enqueue();
-
-    // Remove this job provider from the thread pool, all jobs complete
-    void dequeue();
-
-    // Worker threads will call this method to find a job.  Must return true if
-    // work was completed.  False if no work was available.
-    virtual bool findJob(int threadId) = 0;
-
-    // All derived objects that call Enqueue *MUST* call flush before allowing
-    // their object to be destroyed, otherwise you will see random crashes involving
-    // partially freed vtables and you will be unhappy
-    void flush();
+    // Worker threads will call this method to perform work
+    virtual void findJob(int workerThreadId) = 0;
 
-    friend class ThreadPoolImpl;
-    friend class PoolThread;
+    // Will awaken one idle thread, preferring a thread which most recently
+    // performed work for this provider.
+    void tryWakeOne();
 };
 
-// Abstract interface to ThreadPool.  Each encoder instance should call
-// AllocThreadPool() to get a handle to the singleton object and then make
-// it available to their job provider structures (wave-front frame encoders,
-// etc).
 class ThreadPool
 {
-protected:
-
-    // Destructor is inaccessable, force the use of reference counted Release()
-    ~ThreadPool() {}
-
-    virtual void enqueueJobProvider(JobProvider &) = 0;
+public:
 
-    virtual void dequeueJobProvider(JobProvider &) = 0;
+    sleepbitmap_t m_sleepBitmap;
+    int           m_numProviders;
+    int           m_numWorkers;
+    int           m_numaNode;
+    bool          m_isActive;
 
-public:
+    JobProvider** m_jpTable;
+    WorkerThread* m_workers;
 
-    // When numthreads == 0, a default thread count is used. A request may grow
-    // an existing pool but it will never shrink.
-    static ThreadPool *allocThreadPool(int numthreads = 0);
+    ThreadPool();
+    ~ThreadPool();
 
-    static ThreadPool *getThreadPool();
+    bool create(int numThreads, int maxProviders, int node);
+    bool start();
+    void stop();
+    void setCurrentThreadAffinity();
+    int  tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
+    int  tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
 
-    virtual void pokeIdleThread() = 0;
+    static ThreadPool* allocThreadPools(x265_param* p, int& numPools);
 
-    // The pool is reference counted so all calls to AllocThreadPool() should be
-    // followed by a call to Release()
-    virtual void release() = 0;
+    static int  getCpuCount();
+    static int  getNumaNodeCount();
+    static void setThreadNodeAffinity(int node);
+};
 
-    virtual int  getThreadCount() const = 0;
+/* Any worker thread may enlist the help of idle worker threads from the same
+ * job provider. They must derive from this class and implement the
+ * processTasks() method.  To use, an instance must be instantiated by a worker
+ * thread (referred to as the master thread) and then tryBondPeers() must be
+ * called. If it returns non-zero then some number of slave worker threads are
+ * already in the process of calling your processTasks() function. The master
+ * thread should participate and call processTasks() itself. When
+ * waitForExit() returns, all bonded peer threads are quarunteed to have
+ * exitied processTasks(). Since the thread count is small, it uses explicit
+ * locking instead of atomic counters and bitmasks */
+class BondedTaskGroup
+{
+public:
 
-    friend class JobProvider;
+    Lock              m_lock;
+    ThreadSafeInteger m_exitedPeerCount;
+    int               m_bondedPeerCount;
+    int               m_jobTotal;
+    int               m_jobAcquired;
+
+    BondedTaskGroup()  { m_bondedPeerCount = m_jobTotal = m_jobAcquired = 0; }
+
+    /* Do not allow the instance to be destroyed before all bonded peers have
+     * exited processTasks() */
+    ~BondedTaskGroup() { waitForExit(); }
+
+    /* Try to enlist the help of idle worker threads on most recently associated
+     * with the given job provider and "bond" them to work on your tasks. Up to
+     * maxPeers worker threads will call your processTasks() method. */
+    int tryBondPeers(JobProvider& jp, int maxPeers)
+    {
+        int count = jp.m_pool->tryBondPeers(maxPeers, jp.m_ownerBitmap, *this);
+        m_bondedPeerCount += count;
+        return count;
+    }
+
+    /* Try to enlist the help of any idle worker threads and "bond" them to work
+     * on your tasks. Up to maxPeers worker threads will call your
+     * processTasks() method. */
+    int tryBondPeers(ThreadPool& pool, int maxPeers)
+    {
+        int count = pool.tryBondPeers(maxPeers, ALL_POOL_THREADS, *this);
+        m_bondedPeerCount += count;
+        return count;
+    }
+
+    /* Returns when all bonded peers have exited processTasks(). It does *NOT*
+     * ensure all tasks are completed (but this is generally implied). */
+    void waitForExit()
+    {
+        int exited = m_exitedPeerCount.get();
+        while (m_bondedPeerCount != exited)
+            exited = m_exitedPeerCount.waitForChange(exited);
+    }
+
+    /* Derived classes must define this method. The worker thread ID may be
+     * used to index into thread local data, or ignored.  The ID will be between
+     * 0 and jp.m_numWorkers - 1 */
+    virtual void processTasks(int workerThreadId) = 0;
 };
+
 } // end namespace x265
 
 #endif // ifndef X265_THREADPOOL_H

 
@@ -25,85 +25,148 @@
 #define X265_THREADPOOL_H
 
 #include "common.h"
+#include "threading.h"
 
 namespace x265 {
 // x265 private namespace
 
 class ThreadPool;
+class WorkerThread;
+class BondedTaskGroup;
 
-int getCpuCount();
+#if X86_64
+typedef uint64_t sleepbitmap_t;
+#else
+typedef uint32_t sleepbitmap_t;
+#endif
 
-// Any class that wants to distribute work to the thread pool must
-// derive from JobProvider and implement FindJob().
+static const sleepbitmap_t ALL_POOL_THREADS = (sleepbitmap_t)-1;
+enum { MAX_POOL_THREADS = sizeof(sleepbitmap_t) * 8 };
+enum { INVALID_SLICE_PRIORITY = 10 }; // a value larger than any X265_TYPE_* macro
+
+// Frame level job providers. FrameEncoder and Lookahead derive from
+// this class and implement findJob()
 class JobProvider
 {
-protected:
-
-    ThreadPool   *m_pool;
-
-    JobProvider  *m_nextProvider;
-    JobProvider  *m_prevProvider;
-
 public:
 
-    JobProvider(ThreadPool *p) : m_pool(p), m_nextProvider(0), m_prevProvider(0) {}
+    ThreadPool*   m_pool;
+    sleepbitmap_t m_ownerBitmap;
+    int           m_jpId;
+    int           m_sliceType;
+    bool          m_helpWanted;
+    bool          m_isFrameEncoder; /* rather ugly hack, but nothing better presents itself */
+
+    JobProvider()
+        : m_pool(NULL)
+        , m_ownerBitmap(0)
+        , m_jpId(-1)
+        , m_sliceType(INVALID_SLICE_PRIORITY)
+        , m_helpWanted(false)
+        , m_isFrameEncoder(false)
+    {}
 
     virtual ~JobProvider() {}
 
-    void setThreadPool(ThreadPool *p) { m_pool = p; }
-
-    // Register this job provider with the thread pool, jobs are available
-    void enqueue();
-
-    // Remove this job provider from the thread pool, all jobs complete
-    void dequeue();
-
-    // Worker threads will call this method to find a job.  Must return true if
-    // work was completed.  False if no work was available.
-    virtual bool findJob(int threadId) = 0;
-
-    // All derived objects that call Enqueue *MUST* call flush before allowing
-    // their object to be destroyed, otherwise you will see random crashes involving
-    // partially freed vtables and you will be unhappy
-    void flush();
+    // Worker threads will call this method to perform work
+    virtual void findJob(int workerThreadId) = 0;
 
-    friend class ThreadPoolImpl;
-    friend class PoolThread;
+    // Will awaken one idle thread, preferring a thread which most recently
+    // performed work for this provider.
+    void tryWakeOne();
 };
 
-// Abstract interface to ThreadPool.  Each encoder instance should call
-// AllocThreadPool() to get a handle to the singleton object and then make
-// it available to their job provider structures (wave-front frame encoders,
-// etc).
 class ThreadPool
 {
-protected:
-
-    // Destructor is inaccessable, force the use of reference counted Release()
-    ~ThreadPool() {}
-
-    virtual void enqueueJobProvider(JobProvider &) = 0;
+public:
 
-    virtual void dequeueJobProvider(JobProvider &) = 0;
+    sleepbitmap_t m_sleepBitmap;
+    int           m_numProviders;
+    int           m_numWorkers;
+    int           m_numaNode;
+    bool          m_isActive;
 
-public:
+    JobProvider** m_jpTable;
+    WorkerThread* m_workers;
 
-    // When numthreads == 0, a default thread count is used. A request may grow
-    // an existing pool but it will never shrink.
-    static ThreadPool *allocThreadPool(int numthreads = 0);
+    ThreadPool();
+    ~ThreadPool();
 
-    static ThreadPool *getThreadPool();
+    bool create(int numThreads, int maxProviders, int node);
+    bool start();
+    void stop();
+    void setCurrentThreadAffinity();
+    int  tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
+    int  tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
 
-    virtual void pokeIdleThread() = 0;
+    static ThreadPool* allocThreadPools(x265_param* p, int& numPools);
 
-    // The pool is reference counted so all calls to AllocThreadPool() should be
-    // followed by a call to Release()
-    virtual void release() = 0;
+    static int  getCpuCount();
+    static int  getNumaNodeCount();
+    static void setThreadNodeAffinity(int node);
+};
 
-    virtual int  getThreadCount() const = 0;
+/* Any worker thread may enlist the help of idle worker threads from the same
+ * job provider. They must derive from this class and implement the
+ * processTasks() method.  To use, an instance must be instantiated by a worker
+ * thread (referred to as the master thread) and then tryBondPeers() must be
+ * called. If it returns non-zero then some number of slave worker threads are
+ * already in the process of calling your processTasks() function. The master
+ * thread should participate and call processTasks() itself. When
+ * waitForExit() returns, all bonded peer threads are quarunteed to have
+ * exitied processTasks(). Since the thread count is small, it uses explicit
+ * locking instead of atomic counters and bitmasks */
+class BondedTaskGroup
+{
+public:
 
-    friend class JobProvider;
+    Lock              m_lock;
+    ThreadSafeInteger m_exitedPeerCount;
+    int               m_bondedPeerCount;
+    int               m_jobTotal;
+    int               m_jobAcquired;
+
+    BondedTaskGroup()  { m_bondedPeerCount = m_jobTotal = m_jobAcquired = 0; }
+
+    /* Do not allow the instance to be destroyed before all bonded peers have
+     * exited processTasks() */
+    ~BondedTaskGroup() { waitForExit(); }
+
+    /* Try to enlist the help of idle worker threads on most recently associated
+     * with the given job provider and "bond" them to work on your tasks. Up to
+     * maxPeers worker threads will call your processTasks() method. */
+    int tryBondPeers(JobProvider& jp, int maxPeers)
+    {
+        int count = jp.m_pool->tryBondPeers(maxPeers, jp.m_ownerBitmap, *this);
+        m_bondedPeerCount += count;
+        return count;
+    }
+
+    /* Try to enlist the help of any idle worker threads and "bond" them to work
+     * on your tasks. Up to maxPeers worker threads will call your
+     * processTasks() method. */
+    int tryBondPeers(ThreadPool& pool, int maxPeers)
+    {
+        int count = pool.tryBondPeers(maxPeers, ALL_POOL_THREADS, *this);
+        m_bondedPeerCount += count;
+        return count;
+    }
+
+    /* Returns when all bonded peers have exited processTasks(). It does *NOT*
+     * ensure all tasks are completed (but this is generally implied). */
+    void waitForExit()
+    {
+        int exited = m_exitedPeerCount.get();
+        while (m_bondedPeerCount != exited)
+            exited = m_exitedPeerCount.waitForChange(exited);
+    }
+
+    /* Derived classes must define this method. The worker thread ID may be
+     * used to index into thread local data, or ignored.  The ID will be between
+     * 0 and jp.m_numWorkers - 1 */
+    virtual void processTasks(int workerThreadId) = 0;
 };
+
 } // end namespace x265
 
 #endif // ifndef X265_THREADPOOL_H
​

x265_1.5.tar.gz/source/common/wavefront.cpp -> x265_1.6.tar.gz/source/common/wavefront.cpp Changed

@@ -54,13 +54,13 @@
 void WaveFront::clearEnabledRowMask()
 {
     memset((void*)m_externalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords);
+    memset((void*)m_internalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords);
 }
 
 void WaveFront::enqueueRow(int row)
 {
     uint32_t bit = 1 << (row & 31);
     ATOMIC_OR(&m_internalDependencyBitmap[row >> 5], bit);
-    if (m_pool) m_pool->pokeIdleThread();
 }
 
 void WaveFront::enableRow(int row)
@@ -80,11 +80,11 @@
     return !!(ATOMIC_AND(&m_internalDependencyBitmap[row >> 5], ~bit) & bit);
 }
 
-bool WaveFront::findJob(int threadId)
+void WaveFront::findJob(int threadId)
 {
     unsigned long id;
 
-    // thread safe
+    /* Loop over each word until all available rows are finished */
     for (int w = 0; w < m_numWords; w++)
     {
         uint32_t oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w];
@@ -97,15 +97,14 @@
             {
                 /* we cleared the bit, we get to process the row */
                 processRow(w * 32 + id, threadId);
-                return true;
+                m_helpWanted = true;
+                return; /* check for a higher priority task */
             }
 
-            // some other thread cleared the bit, try another bit
             oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w];
         }
     }
 
-    // made it through the bitmap without finding any enqueued rows
-    return false;
+    m_helpWanted = false;
 }
 }

 
@@ -54,13 +54,13 @@
 void WaveFront::clearEnabledRowMask()
 {
     memset((void*)m_externalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords);
+    memset((void*)m_internalDependencyBitmap, 0, sizeof(uint32_t) * m_numWords);
 }
 
 void WaveFront::enqueueRow(int row)
 {
     uint32_t bit = 1 << (row & 31);
     ATOMIC_OR(&m_internalDependencyBitmap[row >> 5], bit);
-    if (m_pool) m_pool->pokeIdleThread();
 }
 
 void WaveFront::enableRow(int row)
@@ -80,11 +80,11 @@
     return !!(ATOMIC_AND(&m_internalDependencyBitmap[row >> 5], ~bit) & bit);
 }
 
-bool WaveFront::findJob(int threadId)
+void WaveFront::findJob(int threadId)
 {
     unsigned long id;
 
-    // thread safe
+    /* Loop over each word until all available rows are finished */
     for (int w = 0; w < m_numWords; w++)
     {
         uint32_t oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w];
@@ -97,15 +97,14 @@
             {
                 /* we cleared the bit, we get to process the row */
                 processRow(w * 32 + id, threadId);
-                return true;
+                m_helpWanted = true;
+                return; /* check for a higher priority task */
             }
 
-            // some other thread cleared the bit, try another bit
             oldval = m_internalDependencyBitmap[w] & m_externalDependencyBitmap[w];
         }
     }
 
-    // made it through the bitmap without finding any enqueued rows
-    return false;
+    m_helpWanted = false;
 }
 }
​

x265_1.5.tar.gz/source/common/wavefront.h -> x265_1.6.tar.gz/source/common/wavefront.h Changed

 
@@ -53,10 +53,9 @@
 
 public:
 
-    WaveFront(ThreadPool *pool)
-        : JobProvider(pool)
-        , m_internalDependencyBitmap(0)
-        , m_externalDependencyBitmap(0)
+    WaveFront()
+        : m_internalDependencyBitmap(NULL)
+        , m_externalDependencyBitmap(NULL)
     {}
 
     virtual ~WaveFront();
@@ -86,8 +85,8 @@
 
     // WaveFront's implementation of JobProvider::findJob. Consults
     // m_queuedBitmap and calls ProcessRow(row) for lowest numbered queued row
-    // or returns false
-    bool findJob(int threadId);
+    // processes available rows and returns when no work remains
+    void findJob(int threadId);
 
     // Start or resume encode processing of this row, must be implemented by
     // derived classes.
​

x265_1.5.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp Changed

@@ -44,6 +44,11 @@
     p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \
     p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \
     p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu
+#define ALL_LUMA_CU_TYPED_S(prim, fncdef, fname, cpu) \
+    p.cu[BLOCK_8x8].prim   = fncdef x265_ ## fname ## 8_ ## cpu; \
+    p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## 16_ ## cpu; \
+    p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## 32_ ## cpu; \
+    p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## 64_ ## cpu
 #define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \
     p.cu[BLOCK_4x4].prim   = fncdef x265_ ## fname ## _4x4_ ## cpu; \
     p.cu[BLOCK_8x8].prim   = fncdef x265_ ## fname ## _8x8_ ## cpu; \
@@ -61,6 +66,7 @@
     p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \
     p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu;
 #define ALL_LUMA_CU(prim, fname, cpu)      ALL_LUMA_CU_TYPED(prim, , fname, cpu)
+#define ALL_LUMA_CU_S(prim, fname, cpu)    ALL_LUMA_CU_TYPED_S(prim, , fname, cpu)
 #define ALL_LUMA_TU(prim, fname, cpu)      ALL_LUMA_TU_TYPED(prim, , fname, cpu)
 #define ALL_LUMA_BLOCKS(prim, fname, cpu)  ALL_LUMA_BLOCKS_TYPED(prim, , fname, cpu)
 #define ALL_LUMA_TU_S(prim, fname, cpu)    ALL_LUMA_TU_TYPED_S(prim, , fname, cpu)
@@ -179,7 +185,6 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim  = fncdef x265_ ## fname ## _8x32_ ## cpu
 #define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu)
 
-
 #define ALL_CHROMA_422_CU_TYPED(prim, fncdef, fname, cpu) \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].prim   = fncdef x265_ ## fname ## _4x8_ ## cpu; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].prim  = fncdef x265_ ## fname ## _8x16_ ## cpu; \
@@ -791,6 +796,10 @@
 
 void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 16bpp
 {
+#if !defined(X86_64)
+#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
+#endif
+
     if (cpuMask & X265_CPU_SSE2)
     {
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
@@ -863,6 +872,16 @@
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
         ALL_LUMA_TU_S(transpose, transpose, sse2);
 
+        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
+        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
+        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
+        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
+
+        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
+        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
+        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
+        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
+
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
 
@@ -872,10 +891,10 @@
         p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_32x64_sse2;
 
         p.cu[BLOCK_4x4].dct = x265_dct4_sse2;
+        p.cu[BLOCK_8x8].dct = x265_dct8_sse2;
         p.cu[BLOCK_4x4].idct = x265_idct4_sse2;
-#if X86_64
         p.cu[BLOCK_8x8].idct = x265_idct8_sse2;
-#endif
+
         p.idst4x4 = x265_idst4_sse2;
 
         LUMA_VSS_FILTERS(sse2);
@@ -894,7 +913,10 @@
 
         p.dst4x4 = x265_dst4_ssse3;
         p.cu[BLOCK_8x8].idct = x265_idct8_ssse3;
-        p.count_nonzero = x265_count_nonzero_ssse3;
+        p.cu[BLOCK_4x4].count_nonzero = x265_count_nonzero_4x4_ssse3;
+        p.cu[BLOCK_8x8].count_nonzero = x265_count_nonzero_8x8_ssse3;
+        p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
+        p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
     }
     if (cpuMask & X265_CPU_SSE4)
@@ -931,19 +953,30 @@
         p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4;
         p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4;
 
-#if X86_64
+        // TODO: check POPCNT flag!
+        ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
-#endif
     }
     if (cpuMask & X265_CPU_AVX)
     {
         // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_avx; fails tests
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = x265_pixel_satd_16x24_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = x265_pixel_satd_32x48_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = x265_pixel_satd_24x64_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = x265_pixel_satd_8x64_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx;
+
         ALL_LUMA_PU(satd, pixel_satd, avx);
         ASSIGN_SA8D(avx);
         LUMA_VAR(avx);
         p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_avx;
         p.ssim_end_4 = x265_pixel_ssim_end4_avx;
+
+        // copy_pp primitives
+        // 16 x N
         p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx;
         p.pu[LUMA_16x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x4_avx;
         p.pu[LUMA_16x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x8_avx;
@@ -963,11 +996,82 @@
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x24_avx;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx;
+
+        // 24 X N
+        p.pu[LUMA_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x64_avx;
+
+        // 32 x N
+        p.pu[LUMA_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx;
+        p.pu[LUMA_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.pu[LUMA_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx;
+        p.pu[LUMA_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.pu[LUMA_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x48_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx;
+
+        // 48 X 64
+        p.pu[LUMA_48x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_48x64_avx;
+
+        // copy_ss primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_ss = x265_blockcopy_ss_64x64_avx;
+
+        // copy_ps primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_64x64_avx;
+
+        // copy_sp primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_64x64_avx;
+
         p.frameInitLowres = x265_frame_init_lowres_core_avx;
+
+        p.pu[LUMA_64x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x16_avx;
+        p.pu[LUMA_64x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x32_avx;
+        p.pu[LUMA_64x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x48_avx;
+        p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx;
     }
     if (cpuMask & X265_CPU_XOP)
     {
-        p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop;
+        //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; this one is broken
         ALL_LUMA_PU(satd, pixel_satd, xop);
         ASSIGN_SA8D(xop);
         LUMA_VAR(xop);
@@ -975,36 +1079,48 @@
     }

 
@@ -44,6 +44,11 @@
     p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \
     p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \
     p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu
+#define ALL_LUMA_CU_TYPED_S(prim, fncdef, fname, cpu) \
+    p.cu[BLOCK_8x8].prim   = fncdef x265_ ## fname ## 8_ ## cpu; \
+    p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## 16_ ## cpu; \
+    p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## 32_ ## cpu; \
+    p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## 64_ ## cpu
 #define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \
     p.cu[BLOCK_4x4].prim   = fncdef x265_ ## fname ## _4x4_ ## cpu; \
     p.cu[BLOCK_8x8].prim   = fncdef x265_ ## fname ## _8x8_ ## cpu; \
@@ -61,6 +66,7 @@
     p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \
     p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu;
 #define ALL_LUMA_CU(prim, fname, cpu)      ALL_LUMA_CU_TYPED(prim, , fname, cpu)
+#define ALL_LUMA_CU_S(prim, fname, cpu)    ALL_LUMA_CU_TYPED_S(prim, , fname, cpu)
 #define ALL_LUMA_TU(prim, fname, cpu)      ALL_LUMA_TU_TYPED(prim, , fname, cpu)
 #define ALL_LUMA_BLOCKS(prim, fname, cpu)  ALL_LUMA_BLOCKS_TYPED(prim, , fname, cpu)
 #define ALL_LUMA_TU_S(prim, fname, cpu)    ALL_LUMA_TU_TYPED_S(prim, , fname, cpu)
@@ -179,7 +185,6 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim  = fncdef x265_ ## fname ## _8x32_ ## cpu
 #define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu)
 
-
 #define ALL_CHROMA_422_CU_TYPED(prim, fncdef, fname, cpu) \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].prim   = fncdef x265_ ## fname ## _4x8_ ## cpu; \
     p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].prim  = fncdef x265_ ## fname ## _8x16_ ## cpu; \
@@ -791,6 +796,10 @@
 
 void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 16bpp
 {
+#if !defined(X86_64)
+#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
+#endif
+
     if (cpuMask & X265_CPU_SSE2)
     {
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
@@ -863,6 +872,16 @@
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
         ALL_LUMA_TU_S(transpose, transpose, sse2);
 
+        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
+        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
+        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
+        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
+
+        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
+        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
+        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
+        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
+
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
 
@@ -872,10 +891,10 @@
         p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_32x64_sse2;
 
         p.cu[BLOCK_4x4].dct = x265_dct4_sse2;
+        p.cu[BLOCK_8x8].dct = x265_dct8_sse2;
         p.cu[BLOCK_4x4].idct = x265_idct4_sse2;
-#if X86_64
         p.cu[BLOCK_8x8].idct = x265_idct8_sse2;
-#endif
+
         p.idst4x4 = x265_idst4_sse2;
 
         LUMA_VSS_FILTERS(sse2);
@@ -894,7 +913,10 @@
 
         p.dst4x4 = x265_dst4_ssse3;
         p.cu[BLOCK_8x8].idct = x265_idct8_ssse3;
-        p.count_nonzero = x265_count_nonzero_ssse3;
+        p.cu[BLOCK_4x4].count_nonzero = x265_count_nonzero_4x4_ssse3;
+        p.cu[BLOCK_8x8].count_nonzero = x265_count_nonzero_8x8_ssse3;
+        p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
+        p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
     }
     if (cpuMask & X265_CPU_SSE4)
@@ -931,19 +953,30 @@
         p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4;
         p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4;
 
-#if X86_64
+        // TODO: check POPCNT flag!
+        ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
-#endif
     }
     if (cpuMask & X265_CPU_AVX)
     {
         // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_avx; fails tests
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = x265_pixel_satd_16x24_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = x265_pixel_satd_32x48_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = x265_pixel_satd_24x64_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = x265_pixel_satd_8x64_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx;
+
         ALL_LUMA_PU(satd, pixel_satd, avx);
         ASSIGN_SA8D(avx);
         LUMA_VAR(avx);
         p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_avx;
         p.ssim_end_4 = x265_pixel_ssim_end4_avx;
+
+        // copy_pp primitives
+        // 16 x N
         p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx;
         p.pu[LUMA_16x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x4_avx;
         p.pu[LUMA_16x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x8_avx;
@@ -963,11 +996,82 @@
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x24_avx;
         p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx;
+
+        // 24 X N
+        p.pu[LUMA_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x64_avx;
+
+        // 32 x N
+        p.pu[LUMA_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx;
+        p.pu[LUMA_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.pu[LUMA_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx;
+        p.pu[LUMA_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.pu[LUMA_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x48_avx;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx;
+
+        // 48 X 64
+        p.pu[LUMA_48x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_48x64_avx;
+
+        // copy_ss primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_ss = x265_blockcopy_ss_64x64_avx;
+
+        // copy_ps primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_64x64_avx;
+
+        // copy_sp primitives
+        // 16 X N
+        p.cu[BLOCK_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x32_avx;
+
+        // 32 X N
+        p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx;
+        p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x64_avx;
+
+        // 64 X N
+        p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_64x64_avx;
+
         p.frameInitLowres = x265_frame_init_lowres_core_avx;
+
+        p.pu[LUMA_64x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x16_avx;
+        p.pu[LUMA_64x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x32_avx;
+        p.pu[LUMA_64x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x48_avx;
+        p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx;
     }
     if (cpuMask & X265_CPU_XOP)
     {
-        p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop;
+        //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; this one is broken
         ALL_LUMA_PU(satd, pixel_satd, xop);
         ASSIGN_SA8D(xop);
         LUMA_VAR(xop);
@@ -975,36 +1079,48 @@
     }
​

x265_1.5.tar.gz/source/common/x86/blockcopy8.asm -> x265_1.6.tar.gz/source/common/x86/blockcopy8.asm Changed

@@ -47,15 +47,15 @@
 cglobal blockcopy_pp_2x4, 4, 7, 0
     mov    r4w,    [r2]
     mov    r5w,    [r2 + r3]
-    lea    r2,     [r2 + r3 * 2]
-    mov    r6w,    [r2]
+    mov    r6w,    [r2 + 2 * r3]
+    lea    r3,     [r3 + 2 * r3]
     mov    r3w,    [r2 + r3]
 
-    mov    [r0],         r4w
-    mov    [r0 + r1],    r5w
-    lea    r0,           [r0 + 2 * r1]
-    mov    [r0],         r6w
-    mov    [r0 + r1],    r3w
+    mov    [r0],          r4w
+    mov    [r0 + r1],     r5w
+    mov    [r0 + 2 * r1], r6w
+    lea    r1,            [r1 + 2 * r1]
+    mov    [r0 + r1],     r3w
 RET
 
 ;-----------------------------------------------------------------------------
@@ -63,37 +63,29 @@
 ;-----------------------------------------------------------------------------
 INIT_XMM sse2
 cglobal blockcopy_pp_2x8, 4, 7, 0
-    mov     r4w,     [r2]
-    mov     r5w,     [r2 + r3]
-    mov     r6w,     [r2 + 2 * r3]
+    lea     r5,      [3 * r1]
+    lea     r6,      [3 * r3]
 
-    mov     [r0],            r4w
-    mov     [r0 + r1],       r5w
-    mov     [r0 + 2 * r1],   r6w
-
-    lea     r0,             [r0 + 2 * r1]
-    lea     r2,             [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     r5w,             [r2 + 2 * r3]
-
-    mov     [r0 + r1],       r4w
-    mov     [r0 + 2 * r1],   r5w
-
-    lea     r0,              [r0 + 2 * r1]
-    lea     r2,              [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     r5w,             [r2 + 2 * r3]
-
-    mov     [r0 + r1],       r4w
-    mov     [r0 + 2 * r1],   r5w
-
-    lea     r0,              [r0 + 2 * r1]
-    lea     r2,              [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     [r0 + r1],       r4w
+    mov     r4w,           [r2]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+
+    lea     r2,            [r2 + 4 * r3]
+    mov     r4w,           [r2]
+    lea     r0,            [r0 + 4 * r1]
+    mov     [r0],          r4w
+
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
     RET
 
 ;-----------------------------------------------------------------------------
@@ -101,16 +93,30 @@
 ;-----------------------------------------------------------------------------
 INIT_XMM sse2
 cglobal blockcopy_pp_2x16, 4, 7, 0
-    mov     r6d,    16/2
-.loop:
-    mov     r4w,    [r2]
-    mov     r5w,    [r2 + r3]
-    dec     r6d
-    lea     r2,     [r2 + r3 * 2]
-    mov     [r0],       r4w
-    mov     [r0 + r1],  r5w
-    lea     r0,     [r0 + r1 * 2]
-    jnz     .loop
+    lea     r5,      [3 * r1]
+    lea     r6,      [3 * r3]
+
+    mov     r4w,           [r2]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+
+%rep 3
+    lea     r2,            [r2 + 4 * r3]
+    mov     r4w,           [r2]
+    lea     r0,            [r0 + 4 * r1]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+%endrep
     RET
 
 
@@ -145,115 +151,130 @@
     RET
 
 ;-----------------------------------------------------------------------------
+; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
+;-----------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal blockcopy_pp_4x8, 4, 6, 4
+
+    lea     r4,    [3 * r1]
+    lea     r5,    [3 * r3]
+
+    movd     m0,     [r2]
+    movd     m1,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r5]
+
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r4],     m3
+
+    lea      r2,     [r2 + 4 * r3]
+    movd     m0,     [r2]
+    movd     m1,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r5]
+
+    lea      r0,            [r0 + 4 * r1]
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r4],     m3
+    RET
+
+;-----------------------------------------------------------------------------
 ; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
 ;-----------------------------------------------------------------------------
 %macro BLOCKCOPY_PP_W4_H8 2
 INIT_XMM sse2
-cglobal blockcopy_pp_%1x%2, 4, 5, 4
+cglobal blockcopy_pp_%1x%2, 4, 7, 4
     mov    r4d,    %2/8
+    lea    r5,     [3 * r1]
+    lea    r6,     [3 * r3]
+
 .loop:
     movd     m0,     [r2]
     movd     m1,     [r2 + r3]
-    lea      r2,     [r2 + 2 * r3]
-    movd     m2,     [r2]
-    movd     m3,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r6]
 
-    movd     [r0],                m0
-    movd     [r0 + r1],           m1
-    lea      r0,                  [r0 + 2 * r1]
-    movd     [r0],                m2
-    movd     [r0 + r1],           m3
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r5],     m3
 
-    lea       r0,     [r0 + 2 * r1]
-    lea       r2,     [r2 + 2 * r3]
+    lea      r2,     [r2 + 4 * r3]
     movd     m0,     [r2]
     movd     m1,     [r2 + r3]
-    lea      r2,     [r2 + 2 * r3]
-    movd     m2,     [r2]
-    movd     m3,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r6]

 
@@ -47,15 +47,15 @@
 cglobal blockcopy_pp_2x4, 4, 7, 0
     mov    r4w,    [r2]
     mov    r5w,    [r2 + r3]
-    lea    r2,     [r2 + r3 * 2]
-    mov    r6w,    [r2]
+    mov    r6w,    [r2 + 2 * r3]
+    lea    r3,     [r3 + 2 * r3]
     mov    r3w,    [r2 + r3]
 
-    mov    [r0],         r4w
-    mov    [r0 + r1],    r5w
-    lea    r0,           [r0 + 2 * r1]
-    mov    [r0],         r6w
-    mov    [r0 + r1],    r3w
+    mov    [r0],          r4w
+    mov    [r0 + r1],     r5w
+    mov    [r0 + 2 * r1], r6w
+    lea    r1,            [r1 + 2 * r1]
+    mov    [r0 + r1],     r3w
 RET
 
 ;-----------------------------------------------------------------------------
@@ -63,37 +63,29 @@
 ;-----------------------------------------------------------------------------
 INIT_XMM sse2
 cglobal blockcopy_pp_2x8, 4, 7, 0
-    mov     r4w,     [r2]
-    mov     r5w,     [r2 + r3]
-    mov     r6w,     [r2 + 2 * r3]
+    lea     r5,      [3 * r1]
+    lea     r6,      [3 * r3]
 
-    mov     [r0],            r4w
-    mov     [r0 + r1],       r5w
-    mov     [r0 + 2 * r1],   r6w
-
-    lea     r0,             [r0 + 2 * r1]
-    lea     r2,             [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     r5w,             [r2 + 2 * r3]
-
-    mov     [r0 + r1],       r4w
-    mov     [r0 + 2 * r1],   r5w
-
-    lea     r0,              [r0 + 2 * r1]
-    lea     r2,              [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     r5w,             [r2 + 2 * r3]
-
-    mov     [r0 + r1],       r4w
-    mov     [r0 + 2 * r1],   r5w
-
-    lea     r0,              [r0 + 2 * r1]
-    lea     r2,              [r2 + 2 * r3]
-
-    mov     r4w,             [r2 + r3]
-    mov     [r0 + r1],       r4w
+    mov     r4w,           [r2]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+
+    lea     r2,            [r2 + 4 * r3]
+    mov     r4w,           [r2]
+    lea     r0,            [r0 + 4 * r1]
+    mov     [r0],          r4w
+
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
     RET
 
 ;-----------------------------------------------------------------------------
@@ -101,16 +93,30 @@
 ;-----------------------------------------------------------------------------
 INIT_XMM sse2
 cglobal blockcopy_pp_2x16, 4, 7, 0
-    mov     r6d,    16/2
-.loop:
-    mov     r4w,    [r2]
-    mov     r5w,    [r2 + r3]
-    dec     r6d
-    lea     r2,     [r2 + r3 * 2]
-    mov     [r0],       r4w
-    mov     [r0 + r1],  r5w
-    lea     r0,     [r0 + r1 * 2]
-    jnz     .loop
+    lea     r5,      [3 * r1]
+    lea     r6,      [3 * r3]
+
+    mov     r4w,           [r2]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+
+%rep 3
+    lea     r2,            [r2 + 4 * r3]
+    mov     r4w,           [r2]
+    lea     r0,            [r0 + 4 * r1]
+    mov     [r0],          r4w
+    mov     r4w,           [r2 + r3]
+    mov     [r0 + r1],     r4w
+    mov     r4w,           [r2 + 2 * r3]
+    mov     [r0 + 2 * r1], r4w
+    mov     r4w,           [r2 + r6]
+    mov     [r0 + r5],     r4w
+%endrep
     RET
 
 
@@ -145,115 +151,130 @@
     RET
 
 ;-----------------------------------------------------------------------------
+; void blockcopy_pp_4x8(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
+;-----------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal blockcopy_pp_4x8, 4, 6, 4
+
+    lea     r4,    [3 * r1]
+    lea     r5,    [3 * r3]
+
+    movd     m0,     [r2]
+    movd     m1,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r5]
+
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r4],     m3
+
+    lea      r2,     [r2 + 4 * r3]
+    movd     m0,     [r2]
+    movd     m1,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r5]
+
+    lea      r0,            [r0 + 4 * r1]
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r4],     m3
+    RET
+
+;-----------------------------------------------------------------------------
 ; void blockcopy_pp_%1x%2(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride)
 ;-----------------------------------------------------------------------------
 %macro BLOCKCOPY_PP_W4_H8 2
 INIT_XMM sse2
-cglobal blockcopy_pp_%1x%2, 4, 5, 4
+cglobal blockcopy_pp_%1x%2, 4, 7, 4
     mov    r4d,    %2/8
+    lea    r5,     [3 * r1]
+    lea    r6,     [3 * r3]
+
 .loop:
     movd     m0,     [r2]
     movd     m1,     [r2 + r3]
-    lea      r2,     [r2 + 2 * r3]
-    movd     m2,     [r2]
-    movd     m3,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r6]
 
-    movd     [r0],                m0
-    movd     [r0 + r1],           m1
-    lea      r0,                  [r0 + 2 * r1]
-    movd     [r0],                m2
-    movd     [r0 + r1],           m3
+    movd     [r0],          m0
+    movd     [r0 + r1],     m1
+    movd     [r0 + 2 * r1], m2
+    movd     [r0 + r5],     m3
 
-    lea       r0,     [r0 + 2 * r1]
-    lea       r2,     [r2 + 2 * r3]
+    lea      r2,     [r2 + 4 * r3]
     movd     m0,     [r2]
     movd     m1,     [r2 + r3]
-    lea      r2,     [r2 + 2 * r3]
-    movd     m2,     [r2]
-    movd     m3,     [r2 + r3]
+    movd     m2,     [r2 + 2 * r3]
+    movd     m3,     [r2 + r6]
 
​

x265_1.5.tar.gz/source/common/x86/blockcopy8.h -> x265_1.6.tar.gz/source/common/x86/blockcopy8.h Changed

@@ -48,6 +48,12 @@
 void x265_cpy1Dto2D_shr_8_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
 void x265_cpy1Dto2D_shr_16_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
 void x265_cpy1Dto2D_shr_32_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
+void x265_cpy2Dto1D_shl_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shl_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shl_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 uint32_t x265_copy_cnt_4_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
 uint32_t x265_copy_cnt_8_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
 uint32_t x265_copy_cnt_16_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
@@ -198,6 +204,15 @@
 void x265_blockcopy_ss_64x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 void x265_blockcopy_ss_64x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 void x265_blockcopy_ss_64x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x8_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x24_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_48x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_24x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_24x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 
 void x265_blockcopy_pp_32x8_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
@@ -205,9 +220,36 @@
 void x265_blockcopy_pp_32x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_48x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 
 void x265_blockfill_s_16x16_avx2(int16_t* dst, intptr_t dstride, int16_t val);
 void x265_blockfill_s_32x32_avx2(int16_t* dst, intptr_t dstride, int16_t val);
+// copy_sp primitives
+// 16 x N
+void x265_blockcopy_sp_16x16_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+void x265_blockcopy_sp_16x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+
+// 32 x N
+void x265_blockcopy_sp_32x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+void x265_blockcopy_sp_32x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+
+// 64 x N
+void x265_blockcopy_sp_64x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+// copy_ps primitives
+// 16 x N
+void x265_blockcopy_ps_16x16_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_ps_16x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+
+// 32 x N
+void x265_blockcopy_ps_32x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_ps_32x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+
+// 64 x N
+void x265_blockcopy_ps_64x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 
 #undef BLOCKCOPY_COMMON
 #undef BLOCKCOPY_SS_PP

 
@@ -48,6 +48,12 @@
 void x265_cpy1Dto2D_shr_8_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
 void x265_cpy1Dto2D_shr_16_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
 void x265_cpy1Dto2D_shr_32_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift);
+void x265_cpy2Dto1D_shl_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shl_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shl_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
+void x265_cpy2Dto1D_shr_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift);
 uint32_t x265_copy_cnt_4_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
 uint32_t x265_copy_cnt_8_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
 uint32_t x265_copy_cnt_16_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride);
@@ -198,6 +204,15 @@
 void x265_blockcopy_ss_64x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 void x265_blockcopy_ss_64x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 void x265_blockcopy_ss_64x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x8_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x24_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_32x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_48x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_24x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
+void x265_blockcopy_ss_24x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
 
 void x265_blockcopy_pp_32x8_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
@@ -205,9 +220,36 @@
 void x265_blockcopy_pp_32x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 void x265_blockcopy_pp_32x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_64x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_pp_48x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 
 void x265_blockfill_s_16x16_avx2(int16_t* dst, intptr_t dstride, int16_t val);
 void x265_blockfill_s_32x32_avx2(int16_t* dst, intptr_t dstride, int16_t val);
+// copy_sp primitives
+// 16 x N
+void x265_blockcopy_sp_16x16_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+void x265_blockcopy_sp_16x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+
+// 32 x N
+void x265_blockcopy_sp_32x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+void x265_blockcopy_sp_32x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+
+// 64 x N
+void x265_blockcopy_sp_64x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb);
+// copy_ps primitives
+// 16 x N
+void x265_blockcopy_ps_16x16_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_ps_16x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+
+// 32 x N
+void x265_blockcopy_ps_32x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+void x265_blockcopy_ps_32x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
+
+// 64 x N
+void x265_blockcopy_ps_64x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb);
 
 #undef BLOCKCOPY_COMMON
 #undef BLOCKCOPY_SS_PP
​

x265_1.5.tar.gz/source/common/x86/const-a.asm -> x265_1.6.tar.gz/source/common/x86/const-a.asm Changed

@@ -6,7 +6,7 @@
 ;* Authors: Loren Merritt <lorenm@u.washington.edu>
 ;*          Fiona Glaser <fiona@x264.com>
 ;*          Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
-;*
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
 ;* the Free Software Foundation; either version 2 of the License, or
@@ -37,11 +37,14 @@
 const pw_32,       times 16 dw 32
 const pw_128,      times 16 dw 128
 const pw_256,      times 16 dw 256
+const pw_257,      times 16 dw 257
 const pw_512,      times 16 dw 512
 const pw_1023,     times 8  dw 1023
+ALIGN 32
 const pw_1024,     times 16 dw 1024
 const pw_4096,     times 16 dw 4096
 const pw_00ff,     times 16 dw 0x00ff
+ALIGN 32
 const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
 const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
 const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
@@ -50,16 +53,16 @@
 const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
 const pw_swap,      times 2 db 6,7,4,5,2,3,0,1
 
-const pb_2,        times 16 db 2
-const pb_4,        times 16 db 4
-const pb_16,       times 16 db 16
-const pb_64,       times 16 db 64
+const pb_2,        times 32 db 2
+const pb_4,        times 32 db 4
+const pb_16,       times 32 db 16
+const pb_64,       times 32 db 64
 const pb_01,       times  8 db 0,1
 const pb_0,        times 16 db 0
 const pb_a1,       times 16 db 0xa1
 const pb_3,        times 16 db 3
-const pb_8,        times 16 db 8
-const pb_32,       times 16 db 32
+const pb_8,        times 32 db 8
+const pb_32,       times 32 db 32
 const pb_128,      times 16 db 128
 const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
 
@@ -72,7 +75,7 @@
 const pw_256,      times 8 dw 256
 const pw_32_0,     times 4 dw 32,
                    times 4 dw 0
-const pw_2000,     times 8 dw 0x2000
+const pw_2000,     times 16 dw 0x2000
 const pw_8000,     times 8 dw 0x8000
 const pw_3fff,     times 8 dw 0x3fff
 const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
@@ -80,7 +83,7 @@
 const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
 const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
 const pd_1,        times 8 dd 1
-const pd_2,        times 4 dd 2
+const pd_2,        times 8 dd 2
 const pd_4,        times 4 dd 4
 const pd_8,        times 4 dd 8
 const pd_16,       times 4 dd 16

 
@@ -6,7 +6,7 @@
 ;* Authors: Loren Merritt <lorenm@u.washington.edu>
 ;*          Fiona Glaser <fiona@x264.com>
 ;*          Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
-;*
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
 ;* the Free Software Foundation; either version 2 of the License, or
@@ -37,11 +37,14 @@
 const pw_32,       times 16 dw 32
 const pw_128,      times 16 dw 128
 const pw_256,      times 16 dw 256
+const pw_257,      times 16 dw 257
 const pw_512,      times 16 dw 512
 const pw_1023,     times 8  dw 1023
+ALIGN 32
 const pw_1024,     times 16 dw 1024
 const pw_4096,     times 16 dw 4096
 const pw_00ff,     times 16 dw 0x00ff
+ALIGN 32
 const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
 const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
 const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
@@ -50,16 +53,16 @@
 const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
 const pw_swap,      times 2 db 6,7,4,5,2,3,0,1
 
-const pb_2,        times 16 db 2
-const pb_4,        times 16 db 4
-const pb_16,       times 16 db 16
-const pb_64,       times 16 db 64
+const pb_2,        times 32 db 2
+const pb_4,        times 32 db 4
+const pb_16,       times 32 db 16
+const pb_64,       times 32 db 64
 const pb_01,       times  8 db 0,1
 const pb_0,        times 16 db 0
 const pb_a1,       times 16 db 0xa1
 const pb_3,        times 16 db 3
-const pb_8,        times 16 db 8
-const pb_32,       times 16 db 32
+const pb_8,        times 32 db 8
+const pb_32,       times 32 db 32
 const pb_128,      times 16 db 128
 const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
 
@@ -72,7 +75,7 @@
 const pw_256,      times 8 dw 256
 const pw_32_0,     times 4 dw 32,
                    times 4 dw 0
-const pw_2000,     times 8 dw 0x2000
+const pw_2000,     times 16 dw 0x2000
 const pw_8000,     times 8 dw 0x8000
 const pw_3fff,     times 8 dw 0x3fff
 const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
@@ -80,7 +83,7 @@
 const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
 const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
 const pd_1,        times 8 dd 1
-const pd_2,        times 4 dd 2
+const pd_2,        times 8 dd 2
 const pd_4,        times 4 dd 4
 const pd_8,        times 4 dd 8
 const pd_16,       times 4 dd 16
​

x265_1.5.tar.gz/source/common/x86/dct8.asm -> x265_1.6.tar.gz/source/common/x86/dct8.asm Changed

@@ -748,6 +748,368 @@
     movhps      [r1 + r2], m1
     RET
 
+;-------------------------------------------------------
+; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
+;-------------------------------------------------------
+INIT_XMM sse2
+cglobal dct8, 3,6,8,0-16*mmsize
+    ;------------------------
+    ; Stack Mapping(dword)
+    ;------------------------
+    ; Row0[0-3] Row1[0-3]
+    ; ...
+    ; Row6[0-3] Row7[0-3]
+    ; Row0[0-3] Row7[0-3]
+    ; ...
+    ; Row6[4-7] Row7[4-7]
+    ;------------------------
+%if BIT_DEPTH == 10
+  %define       DCT_SHIFT1 4
+  %define       DCT_ADD1 [pd_8]
+%elif BIT_DEPTH == 8
+  %define       DCT_SHIFT1 2
+  %define       DCT_ADD1 [pd_2]
+%else
+  %error Unsupported BIT_DEPTH!
+%endif
+%define         DCT_ADD2 [pd_256]
+%define         DCT_SHIFT2 9
+
+    add         r2, r2
+    lea         r3, [r2 * 3]
+    mov         r5, rsp
+%assign x 0
+%rep 2
+    movu        m0, [r0]
+    movu        m1, [r0 + r2]
+    movu        m2, [r0 + r2 * 2]
+    movu        m3, [r0 + r3]
+
+    punpcklwd   m4, m0, m1
+    punpckhwd   m0, m1
+    punpcklwd   m5, m2, m3
+    punpckhwd   m2, m3
+    punpckldq   m1, m4, m5          ; m1 = [1 0]
+    punpckhdq   m4, m5              ; m4 = [3 2]
+    punpckldq   m3, m0, m2
+    punpckhdq   m0, m2
+    pshufd      m2, m3, 0x4E        ; m2 = [4 5]
+    pshufd      m0, m0, 0x4E        ; m0 = [6 7]
+
+    paddw       m3, m1, m0
+    psubw       m1, m0              ; m1 = [d1 d0]
+    paddw       m0, m4, m2
+    psubw       m4, m2              ; m4 = [d3 d2]
+    punpcklqdq  m2, m3, m0          ; m2 = [s2 s0]
+    punpckhqdq  m3, m0
+    pshufd      m3, m3, 0x4E        ; m3 = [s1 s3]
+
+    punpcklwd   m0, m1, m4          ; m0 = [d2/d0]
+    punpckhwd   m1, m4              ; m1 = [d3/d1]
+    punpckldq   m4, m0, m1          ; m4 = [d3 d1 d2 d0]
+    punpckhdq   m0, m1              ; m0 = [d3 d1 d2 d0]
+
+    ; odd
+    lea         r4, [tab_dct8_1]
+    pmaddwd     m1, m4, [r4 + 0*16]
+    pmaddwd     m5, m0, [r4 + 0*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 1*2*mmsize], m1 ; Row 1
+
+    pmaddwd     m1, m4, [r4 + 1*16]
+    pmaddwd     m5, m0, [r4 + 1*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 3*2*mmsize], m1 ; Row 3
+
+    pmaddwd     m1, m4, [r4 + 2*16]
+    pmaddwd     m5, m0, [r4 + 2*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 5*2*mmsize], m1 ; Row 5
+
+    pmaddwd     m4, [r4 + 3*16]
+    pmaddwd     m0, [r4 + 3*16]
+    pshufd      m4, m4, 0xD8
+    pshufd      m0, m0, 0xD8
+    mova        m7, m4
+    punpckhqdq  m7, m0
+    punpcklqdq  m4, m0
+    paddd       m4, m7
+    paddd       m4, DCT_ADD1
+    psrad       m4, DCT_SHIFT1
+  %if x == 1
+    pshufd      m4, m4, 0x1B
+  %endif
+    mova        [r5 + 7*2*mmsize], m4; Row 7
+
+    ; even
+    lea         r4, [tab_dct4]
+    paddw       m0, m2, m3          ; m0 = [EE1 EE0]
+    pshufd      m0, m0, 0xD8
+    pshuflw     m0, m0, 0xD8
+    pshufhw     m0, m0, 0xD8
+    psubw       m2, m3              ; m2 = [EO1 EO0]
+    pmullw      m2, [pw_ppppmmmm]
+    pshufd      m2, m2, 0xD8
+    pshuflw     m2, m2, 0xD8
+    pshufhw     m2, m2, 0xD8
+    pmaddwd     m3, m0, [r4 + 0*16]
+    paddd       m3, DCT_ADD1
+    psrad       m3, DCT_SHIFT1
+  %if x == 1
+    pshufd      m3, m3, 0x1B
+  %endif
+    mova        [r5 + 0*2*mmsize], m3 ; Row 0
+    pmaddwd     m0, [r4 + 2*16]
+    paddd       m0, DCT_ADD1
+    psrad       m0, DCT_SHIFT1
+  %if x == 1
+    pshufd      m0, m0, 0x1B
+  %endif
+    mova        [r5 + 4*2*mmsize], m0 ; Row 4
+    pmaddwd     m3, m2, [r4 + 1*16]
+    paddd       m3, DCT_ADD1
+    psrad       m3, DCT_SHIFT1
+  %if x == 1
+    pshufd      m3, m3, 0x1B
+  %endif
+    mova        [r5 + 2*2*mmsize], m3 ; Row 2
+    pmaddwd     m2, [r4 + 3*16]
+    paddd       m2, DCT_ADD1
+    psrad       m2, DCT_SHIFT1
+  %if x == 1
+    pshufd      m2, m2, 0x1B
+  %endif
+    mova        [r5 + 6*2*mmsize], m2 ; Row 6
+
+  %if x != 1
+    lea         r0, [r0 + r2 * 4]
+    add         r5, mmsize
+  %endif
+%assign x x+1
+%endrep
+
+    mov         r0, rsp                 ; r0 = pointer to Low Part
+    lea         r4, [tab_dct8_2]
+
+%assign x 0
+%rep 4
+    mova        m0, [r0 + 0*2*mmsize]     ; [3 2 1 0]
+    mova        m1, [r0 + 1*2*mmsize]
+    paddd       m2, m0, [r0 + (0*2+1)*mmsize]
+    pshufd      m2, m2, 0x9C            ; m2 = [s2 s1 s3 s0]
+    paddd       m3, m1, [r0 + (1*2+1)*mmsize]
+    pshufd      m3, m3, 0x9C            ; m3 = ^^
+    psubd       m0, [r0 + (0*2+1)*mmsize]     ; m0 = [d3 d2 d1 d0]
+    psubd       m1, [r0 + (1*2+1)*mmsize]     ; m1 = ^^
+
+    ; even
+    pshufd      m4, m2, 0xD8
+    pshufd      m3, m3, 0xD8
+    mova        m7, m4
+    punpckhqdq  m7, m3
+    punpcklqdq  m4, m3
+    mova        m2, m4
+    paddd       m4, m7                  ; m4 = [EE1 EE0 EE1 EE0]
+    psubd       m2, m7                  ; m2 = [EO1 EO0 EO1 EO0]
+
+    pslld       m4, 6                   ; m4 = [64*EE1 64*EE0]
+    mova        m5, m2

 
@@ -748,6 +748,368 @@
     movhps      [r1 + r2], m1
     RET
 
+;-------------------------------------------------------
+; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
+;-------------------------------------------------------
+INIT_XMM sse2
+cglobal dct8, 3,6,8,0-16*mmsize
+    ;------------------------
+    ; Stack Mapping(dword)
+    ;------------------------
+    ; Row0[0-3] Row1[0-3]
+    ; ...
+    ; Row6[0-3] Row7[0-3]
+    ; Row0[0-3] Row7[0-3]
+    ; ...
+    ; Row6[4-7] Row7[4-7]
+    ;------------------------
+%if BIT_DEPTH == 10
+  %define       DCT_SHIFT1 4
+  %define       DCT_ADD1 [pd_8]
+%elif BIT_DEPTH == 8
+  %define       DCT_SHIFT1 2
+  %define       DCT_ADD1 [pd_2]
+%else
+  %error Unsupported BIT_DEPTH!
+%endif
+%define         DCT_ADD2 [pd_256]
+%define         DCT_SHIFT2 9
+
+    add         r2, r2
+    lea         r3, [r2 * 3]
+    mov         r5, rsp
+%assign x 0
+%rep 2
+    movu        m0, [r0]
+    movu        m1, [r0 + r2]
+    movu        m2, [r0 + r2 * 2]
+    movu        m3, [r0 + r3]
+
+    punpcklwd   m4, m0, m1
+    punpckhwd   m0, m1
+    punpcklwd   m5, m2, m3
+    punpckhwd   m2, m3
+    punpckldq   m1, m4, m5          ; m1 = [1 0]
+    punpckhdq   m4, m5              ; m4 = [3 2]
+    punpckldq   m3, m0, m2
+    punpckhdq   m0, m2
+    pshufd      m2, m3, 0x4E        ; m2 = [4 5]
+    pshufd      m0, m0, 0x4E        ; m0 = [6 7]
+
+    paddw       m3, m1, m0
+    psubw       m1, m0              ; m1 = [d1 d0]
+    paddw       m0, m4, m2
+    psubw       m4, m2              ; m4 = [d3 d2]
+    punpcklqdq  m2, m3, m0          ; m2 = [s2 s0]
+    punpckhqdq  m3, m0
+    pshufd      m3, m3, 0x4E        ; m3 = [s1 s3]
+
+    punpcklwd   m0, m1, m4          ; m0 = [d2/d0]
+    punpckhwd   m1, m4              ; m1 = [d3/d1]
+    punpckldq   m4, m0, m1          ; m4 = [d3 d1 d2 d0]
+    punpckhdq   m0, m1              ; m0 = [d3 d1 d2 d0]
+
+    ; odd
+    lea         r4, [tab_dct8_1]
+    pmaddwd     m1, m4, [r4 + 0*16]
+    pmaddwd     m5, m0, [r4 + 0*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 1*2*mmsize], m1 ; Row 1
+
+    pmaddwd     m1, m4, [r4 + 1*16]
+    pmaddwd     m5, m0, [r4 + 1*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 3*2*mmsize], m1 ; Row 3
+
+    pmaddwd     m1, m4, [r4 + 2*16]
+    pmaddwd     m5, m0, [r4 + 2*16]
+    pshufd      m1, m1, 0xD8
+    pshufd      m5, m5, 0xD8
+    mova        m7, m1
+    punpckhqdq  m7, m5
+    punpcklqdq  m1, m5
+    paddd       m1, m7
+    paddd       m1, DCT_ADD1
+    psrad       m1, DCT_SHIFT1
+  %if x == 1
+    pshufd      m1, m1, 0x1B
+  %endif
+    mova        [r5 + 5*2*mmsize], m1 ; Row 5
+
+    pmaddwd     m4, [r4 + 3*16]
+    pmaddwd     m0, [r4 + 3*16]
+    pshufd      m4, m4, 0xD8
+    pshufd      m0, m0, 0xD8
+    mova        m7, m4
+    punpckhqdq  m7, m0
+    punpcklqdq  m4, m0
+    paddd       m4, m7
+    paddd       m4, DCT_ADD1
+    psrad       m4, DCT_SHIFT1
+  %if x == 1
+    pshufd      m4, m4, 0x1B
+  %endif
+    mova        [r5 + 7*2*mmsize], m4; Row 7
+
+    ; even
+    lea         r4, [tab_dct4]
+    paddw       m0, m2, m3          ; m0 = [EE1 EE0]
+    pshufd      m0, m0, 0xD8
+    pshuflw     m0, m0, 0xD8
+    pshufhw     m0, m0, 0xD8
+    psubw       m2, m3              ; m2 = [EO1 EO0]
+    pmullw      m2, [pw_ppppmmmm]
+    pshufd      m2, m2, 0xD8
+    pshuflw     m2, m2, 0xD8
+    pshufhw     m2, m2, 0xD8
+    pmaddwd     m3, m0, [r4 + 0*16]
+    paddd       m3, DCT_ADD1
+    psrad       m3, DCT_SHIFT1
+  %if x == 1
+    pshufd      m3, m3, 0x1B
+  %endif
+    mova        [r5 + 0*2*mmsize], m3 ; Row 0
+    pmaddwd     m0, [r4 + 2*16]
+    paddd       m0, DCT_ADD1
+    psrad       m0, DCT_SHIFT1
+  %if x == 1
+    pshufd      m0, m0, 0x1B
+  %endif
+    mova        [r5 + 4*2*mmsize], m0 ; Row 4
+    pmaddwd     m3, m2, [r4 + 1*16]
+    paddd       m3, DCT_ADD1
+    psrad       m3, DCT_SHIFT1
+  %if x == 1
+    pshufd      m3, m3, 0x1B
+  %endif
+    mova        [r5 + 2*2*mmsize], m3 ; Row 2
+    pmaddwd     m2, [r4 + 3*16]
+    paddd       m2, DCT_ADD1
+    psrad       m2, DCT_SHIFT1
+  %if x == 1
+    pshufd      m2, m2, 0x1B
+  %endif
+    mova        [r5 + 6*2*mmsize], m2 ; Row 6
+
+  %if x != 1
+    lea         r0, [r0 + r2 * 4]
+    add         r5, mmsize
+  %endif
+%assign x x+1
+%endrep
+
+    mov         r0, rsp                 ; r0 = pointer to Low Part
+    lea         r4, [tab_dct8_2]
+
+%assign x 0
+%rep 4
+    mova        m0, [r0 + 0*2*mmsize]     ; [3 2 1 0]
+    mova        m1, [r0 + 1*2*mmsize]
+    paddd       m2, m0, [r0 + (0*2+1)*mmsize]
+    pshufd      m2, m2, 0x9C            ; m2 = [s2 s1 s3 s0]
+    paddd       m3, m1, [r0 + (1*2+1)*mmsize]
+    pshufd      m3, m3, 0x9C            ; m3 = ^^
+    psubd       m0, [r0 + (0*2+1)*mmsize]     ; m0 = [d3 d2 d1 d0]
+    psubd       m1, [r0 + (1*2+1)*mmsize]     ; m1 = ^^
+
+    ; even
+    pshufd      m4, m2, 0xD8
+    pshufd      m3, m3, 0xD8
+    mova        m7, m4
+    punpckhqdq  m7, m3
+    punpcklqdq  m4, m3
+    mova        m2, m4
+    paddd       m4, m7                  ; m4 = [EE1 EE0 EE1 EE0]
+    psubd       m2, m7                  ; m2 = [EO1 EO0 EO1 EO0]
+
+    pslld       m4, 6                   ; m4 = [64*EE1 64*EE0]
+    mova        m5, m2
​

x265_1.5.tar.gz/source/common/x86/dct8.h -> x265_1.6.tar.gz/source/common/x86/dct8.h Changed

 
@@ -24,6 +24,7 @@
 #ifndef X265_DCT8_H
 #define X265_DCT8_H
 void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
​

x265_1.5.tar.gz/source/common/x86/intrapred.h -> x265_1.6.tar.gz/source/common/x86/intrapred.h Changed

@@ -4,7 +4,7 @@
  * Copyright (C) 2003-2013 x264 project
  *
  * Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
- *
+ *          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -26,11 +26,19 @@
 #ifndef X265_INTRAPRED_H
 #define X265_INTRAPRED_H
 
-void x265_intra_pred_dc4_sse4 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc32_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 
+void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar16_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar32_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
@@ -39,6 +47,15 @@
 #define DECL_ANG(bsize, mode, cpu) \
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 
+DECL_ANG(4, 2, sse2);
+DECL_ANG(4, 3, sse2);
+DECL_ANG(4, 4, sse2);
+DECL_ANG(4, 5, sse2);
+DECL_ANG(4, 6, sse2);
+DECL_ANG(4, 7, sse2);
+DECL_ANG(4, 8, sse2);
+DECL_ANG(4, 9, sse2);
+
 DECL_ANG(4, 2, ssse3);
 DECL_ANG(4, 3, sse4);
 DECL_ANG(4, 4, sse4);
@@ -157,6 +174,44 @@
 DECL_ANG(32, 33, sse4);
 
 #undef DECL_ANG
+void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_34_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_2_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_26_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);

 
@@ -4,7 +4,7 @@
  * Copyright (C) 2003-2013 x264 project
  *
  * Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
- *
+ *          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
@@ -26,11 +26,19 @@
 #ifndef X265_INTRAPRED_H
 #define X265_INTRAPRED_H
 
-void x265_intra_pred_dc4_sse4 (pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc32_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
+void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter);
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 
+void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar16_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar32_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
@@ -39,6 +47,15 @@
 #define DECL_ANG(bsize, mode, cpu) \
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 
+DECL_ANG(4, 2, sse2);
+DECL_ANG(4, 3, sse2);
+DECL_ANG(4, 4, sse2);
+DECL_ANG(4, 5, sse2);
+DECL_ANG(4, 6, sse2);
+DECL_ANG(4, 7, sse2);
+DECL_ANG(4, 8, sse2);
+DECL_ANG(4, 9, sse2);
+
 DECL_ANG(4, 2, ssse3);
 DECL_ANG(4, 3, sse4);
 DECL_ANG(4, 4, sse4);
@@ -157,6 +174,44 @@
 DECL_ANG(32, 33, sse4);
 
 #undef DECL_ANG
+void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_34_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_2_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_26_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
​

x265_1.5.tar.gz/source/common/x86/intrapred16.asm -> x265_1.6.tar.gz/source/common/x86/intrapred16.asm Changed

@@ -65,6 +65,10 @@
 pw_planar16_1:        dw 15, 15, 15, 15, 15, 15, 15, 15
 pd_planar32_1:        dd 31, 31, 31, 31
 
+pw_planar32_1:        dw 31, 31, 31, 31, 31, 31, 31, 31
+pw_planar32_L:        dw 31, 30, 29, 28, 27, 26, 25, 24
+pw_planar32_H:        dw 23, 22, 21, 20, 19, 18, 17, 16
+
 const planar32_table
 %assign x 31
 %rep 8
@@ -82,15 +86,19 @@
 SECTION .text
 
 cextern pw_1
+cextern pw_2
 cextern pw_4
 cextern pw_8
 cextern pw_16
+cextern pw_32
 cextern pw_1023
 cextern pd_16
 cextern pd_32
 cextern pw_4096
 cextern multiL
 cextern multiH
+cextern multiH2
+cextern multiH3
 cextern multi_2Row
 cextern pw_swap
 cextern pb_unpackwq1
@@ -99,6 +107,592 @@
 ;-----------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
 ;-----------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc4, 5,6,2
+    movh        m0,             [r2 + 18]          ; sumAbove
+    movh        m1,             [r2 + 2]           ; sumLeft
+
+    paddw       m0,             m1
+    pshuflw     m1,             m0, 0x4E
+    paddw       m0,             m1
+    pshuflw     m1,             m0, 0xB1
+    paddw       m0,             m1
+
+    test        r4d,            r4d
+
+    paddw       m0,             [pw_4]
+    psraw       m0,             3
+
+    ; store DC 4x4
+    movh        [r0],           m0
+    movh        [r0 + r1 * 2],  m0
+    movh        [r0 + r1 * 4],  m0
+    lea         r5,             [r0 + r1 * 4]
+    movh        [r5 + r1 * 2],  m0
+
+    ; do DC filter
+    jz          .end
+    movh        m1,             m0
+    psllw       m1,             1
+    paddw       m1,             [pw_2]
+    movd        r3d,            m1
+    paddw       m0,             m1
+    ; filter top
+    movh        m1,             [r2 + 2]
+    paddw       m1,             m0
+    psraw       m1,             2
+    movh        [r0],           m1             ; overwrite top-left pixel, we will update it later
+
+    ; filter top-left
+    movzx       r3d,            r3w
+    movzx       r4d, word       [r2 + 18]
+    add         r3d,            r4d
+    movzx       r4d, word       [r2 + 2]
+    add         r4d,            r3d
+    shr         r4d,            2
+    mov         [r0],           r4w
+
+    ; filter left
+    movu        m1,             [r2 + 20]
+    paddw       m1,             m0
+    psraw       m1,             2
+    movd        r3d,            m1
+    mov         [r0 + r1 * 2],  r3w
+    shr         r3d,            16
+    mov         [r0 + r1 * 4],  r3w
+    pextrw      r3d,            m1, 2
+    mov         [r5 + r1 * 2],  r3w
+.end:
+    RET
+
+;-----------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
+;-----------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc8, 5, 8, 2
+    movu            m0,            [r2 + 34]
+    movu            m1,            [r2 + 2]
+
+    paddw           m0,            m1
+    movhlps         m1,            m0
+    paddw           m0,            m1
+    pshufd          m1,            m0, 1
+    paddw           m0,            m1
+    pmaddwd         m0,            [pw_1]
+
+    paddw           m0,            [pw_8]
+    psraw           m0,            4              ; sum = sum / 16
+    pshuflw         m0,            m0, 0
+    pshufd          m0,            m0, 0          ; m0 = word [dc_val ...]
+
+    test            r4d,           r4d
+
+    ; store DC 8x8
+    lea             r6,            [r1 + r1 * 4]
+    lea             r6,            [r6 + r1]
+    lea             r5,            [r6 + r1 * 4]
+    lea             r7,            [r6 + r1 * 8]
+    movu            [r0],          m0
+    movu            [r0 + r1 * 2], m0
+    movu            [r0 + r1 * 4], m0
+    movu            [r0 + r6],     m0
+    movu            [r0 + r1 * 8], m0
+    movu            [r0 + r5],     m0
+    movu            [r0 + r6 * 2], m0
+    movu            [r0 + r7],     m0
+
+    ; Do DC Filter
+    jz              .end
+    mova            m1,            [pw_2]
+    pmullw          m1,            m0
+    paddw           m1,            [pw_2]
+    movd            r4d,           m1             ; r4d = DC * 2 + 2
+    paddw           m1,            m0             ; m1 = DC * 3 + 2
+    pshuflw         m1,            m1, 0
+    pshufd          m1,            m1, 0          ; m1 = pixDCx3
+
+    ; filter top
+    movu            m0,            [r2 + 2]
+    paddw           m0,            m1
+    psraw           m0,            2
+    movu            [r0],          m0
+
+    ; filter top-left
+    movzx           r4d,           r4w
+    movzx           r3d, word      [r2 + 34]
+    add             r4d,           r3d
+    movzx           r3d, word      [r2 + 2]
+    add             r3d,           r4d
+    shr             r3d,           2
+    mov             [r0],          r3w
+
+    ; filter left
+    movu            m0,            [r2 + 36]
+    paddw           m0,            m1
+    psraw           m0,            2
+    movh            r3,            m0
+    mov             [r0 + r1 * 2], r3w
+    shr             r3,            16
+    mov             [r0 + r1 * 4], r3w
+    shr             r3,            16
+    mov             [r0 + r6],     r3w
+    shr             r3,            16
+    mov             [r0 + r1 * 8], r3w
+    pshufd          m0,            m0, 0x6E
+    movh            r3,            m0
+    mov             [r0 + r5],     r3w
+    shr             r3,            16
+    mov             [r0 + r6 * 2], r3w
+    shr             r3,            16
+    mov             [r0 + r7],     r3w
+.end:
+    RET
+
+;-------------------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
+;-------------------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc16, 5, 10, 4
+    lea             r3,                  [r2 + 66]
+    add             r1,                  r1
+    movu            m0,                  [r3]
+    movu            m1,                  [r3 + 16]
+    movu            m2,                  [r2 + 2]
+    movu            m3,                  [r2 + 18]
+
+    paddw           m0,                  m1
+    paddw           m2,                  m3
+    paddw           m0,                  m2
+    movhlps         m1,                  m0
+    paddw           m0,                  m1
+    pshuflw         m1,                  m0, 0x6E
+    paddw           m0,                  m1
+    pmaddwd         m0,                  [pw_1]
+
+    paddw           m0,                  [pw_16]
+    psraw           m0,                  5
+    movd            r5d,                 m0

 
@@ -65,6 +65,10 @@
 pw_planar16_1:        dw 15, 15, 15, 15, 15, 15, 15, 15
 pd_planar32_1:        dd 31, 31, 31, 31
 
+pw_planar32_1:        dw 31, 31, 31, 31, 31, 31, 31, 31
+pw_planar32_L:        dw 31, 30, 29, 28, 27, 26, 25, 24
+pw_planar32_H:        dw 23, 22, 21, 20, 19, 18, 17, 16
+
 const planar32_table
 %assign x 31
 %rep 8
@@ -82,15 +86,19 @@
 SECTION .text
 
 cextern pw_1
+cextern pw_2
 cextern pw_4
 cextern pw_8
 cextern pw_16
+cextern pw_32
 cextern pw_1023
 cextern pd_16
 cextern pd_32
 cextern pw_4096
 cextern multiL
 cextern multiH
+cextern multiH2
+cextern multiH3
 cextern multi_2Row
 cextern pw_swap
 cextern pb_unpackwq1
@@ -99,6 +107,592 @@
 ;-----------------------------------------------------------------------------------
 ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
 ;-----------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc4, 5,6,2
+    movh        m0,             [r2 + 18]          ; sumAbove
+    movh        m1,             [r2 + 2]           ; sumLeft
+
+    paddw       m0,             m1
+    pshuflw     m1,             m0, 0x4E
+    paddw       m0,             m1
+    pshuflw     m1,             m0, 0xB1
+    paddw       m0,             m1
+
+    test        r4d,            r4d
+
+    paddw       m0,             [pw_4]
+    psraw       m0,             3
+
+    ; store DC 4x4
+    movh        [r0],           m0
+    movh        [r0 + r1 * 2],  m0
+    movh        [r0 + r1 * 4],  m0
+    lea         r5,             [r0 + r1 * 4]
+    movh        [r5 + r1 * 2],  m0
+
+    ; do DC filter
+    jz          .end
+    movh        m1,             m0
+    psllw       m1,             1
+    paddw       m1,             [pw_2]
+    movd        r3d,            m1
+    paddw       m0,             m1
+    ; filter top
+    movh        m1,             [r2 + 2]
+    paddw       m1,             m0
+    psraw       m1,             2
+    movh        [r0],           m1             ; overwrite top-left pixel, we will update it later
+
+    ; filter top-left
+    movzx       r3d,            r3w
+    movzx       r4d, word       [r2 + 18]
+    add         r3d,            r4d
+    movzx       r4d, word       [r2 + 2]
+    add         r4d,            r3d
+    shr         r4d,            2
+    mov         [r0],           r4w
+
+    ; filter left
+    movu        m1,             [r2 + 20]
+    paddw       m1,             m0
+    psraw       m1,             2
+    movd        r3d,            m1
+    mov         [r0 + r1 * 2],  r3w
+    shr         r3d,            16
+    mov         [r0 + r1 * 4],  r3w
+    pextrw      r3d,            m1, 2
+    mov         [r5 + r1 * 2],  r3w
+.end:
+    RET
+
+;-----------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter)
+;-----------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc8, 5, 8, 2
+    movu            m0,            [r2 + 34]
+    movu            m1,            [r2 + 2]
+
+    paddw           m0,            m1
+    movhlps         m1,            m0
+    paddw           m0,            m1
+    pshufd          m1,            m0, 1
+    paddw           m0,            m1
+    pmaddwd         m0,            [pw_1]
+
+    paddw           m0,            [pw_8]
+    psraw           m0,            4              ; sum = sum / 16
+    pshuflw         m0,            m0, 0
+    pshufd          m0,            m0, 0          ; m0 = word [dc_val ...]
+
+    test            r4d,           r4d
+
+    ; store DC 8x8
+    lea             r6,            [r1 + r1 * 4]
+    lea             r6,            [r6 + r1]
+    lea             r5,            [r6 + r1 * 4]
+    lea             r7,            [r6 + r1 * 8]
+    movu            [r0],          m0
+    movu            [r0 + r1 * 2], m0
+    movu            [r0 + r1 * 4], m0
+    movu            [r0 + r6],     m0
+    movu            [r0 + r1 * 8], m0
+    movu            [r0 + r5],     m0
+    movu            [r0 + r6 * 2], m0
+    movu            [r0 + r7],     m0
+
+    ; Do DC Filter
+    jz              .end
+    mova            m1,            [pw_2]
+    pmullw          m1,            m0
+    paddw           m1,            [pw_2]
+    movd            r4d,           m1             ; r4d = DC * 2 + 2
+    paddw           m1,            m0             ; m1 = DC * 3 + 2
+    pshuflw         m1,            m1, 0
+    pshufd          m1,            m1, 0          ; m1 = pixDCx3
+
+    ; filter top
+    movu            m0,            [r2 + 2]
+    paddw           m0,            m1
+    psraw           m0,            2
+    movu            [r0],          m0
+
+    ; filter top-left
+    movzx           r4d,           r4w
+    movzx           r3d, word      [r2 + 34]
+    add             r4d,           r3d
+    movzx           r3d, word      [r2 + 2]
+    add             r3d,           r4d
+    shr             r3d,           2
+    mov             [r0],          r3w
+
+    ; filter left
+    movu            m0,            [r2 + 36]
+    paddw           m0,            m1
+    psraw           m0,            2
+    movh            r3,            m0
+    mov             [r0 + r1 * 2], r3w
+    shr             r3,            16
+    mov             [r0 + r1 * 4], r3w
+    shr             r3,            16
+    mov             [r0 + r6],     r3w
+    shr             r3,            16
+    mov             [r0 + r1 * 8], r3w
+    pshufd          m0,            m0, 0x6E
+    movh            r3,            m0
+    mov             [r0 + r5],     r3w
+    shr             r3,            16
+    mov             [r0 + r6 * 2], r3w
+    shr             r3,            16
+    mov             [r0 + r7],     r3w
+.end:
+    RET
+
+;-------------------------------------------------------------------------------------------------------
+; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter)
+;-------------------------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_dc16, 5, 10, 4
+    lea             r3,                  [r2 + 66]
+    add             r1,                  r1
+    movu            m0,                  [r3]
+    movu            m1,                  [r3 + 16]
+    movu            m2,                  [r2 + 2]
+    movu            m3,                  [r2 + 18]
+
+    paddw           m0,                  m1
+    paddw           m2,                  m3
+    paddw           m0,                  m2
+    movhlps         m1,                  m0
+    paddw           m0,                  m1
+    pshuflw         m1,                  m0, 0x6E
+    paddw           m0,                  m1
+    pmaddwd         m0,                  [pw_1]
+
+    paddw           m0,                  [pw_16]
+    psraw           m0,                  5
+    movd            r5d,                 m0
​

x265_1.5.tar.gz/source/common/x86/intrapred8.asm -> x265_1.6.tar.gz/source/common/x86/intrapred8.asm Changed

@@ -2,6 +2,7 @@
 ;* Copyright (C) 2013 x265 project
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -26,11 +27,15 @@
 
 SECTION_RODATA 32
 
+intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+
 pb_0_8        times 8 db  0,  8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
 pb_swap8:     times 2 db  7,  6,  5,  4,  3,  2,  1,  0
 c_trans_4x4           db  0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
-tab_Si:               db  0,  1,  2,  3,  4,  5,  6,  7,  0,  1,  2,  3,  4,  5,  6,  7
+const tab_S1,         db 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1,  0,  0,  0,  0
+const tab_S2,         db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0
+const tab_Si,         db  0,  1,  2,  3,  4,  5,  6,  7,  0,  1,  2,  3,  4,  5,  6,  7
 pb_fact0:             db  0,  2,  4,  6,  8, 10, 12, 14,  0,  0,  0,  0,  0,  0,  0,  0
 c_mode32_12_0:        db  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,  7,  0
 c_mode32_13_0:        db  3,  6, 10, 13,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0
@@ -43,7 +48,6 @@
 c_mode32_18_0:        db 15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0
 c_shuf8_0:            db  0,  1,  1,  2,  2,  3,  3,  4,  4,  5,  5,  6,  6,  7,  7,  8
 c_deinterval8:        db  0,  8,  1,  9,  2, 10,  3, 11,  4, 12,  5, 13,  6, 14,  7, 15
-tab_S1:               db 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1,  0,  0,  0,  0
 pb_unpackbq:          db  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1
 c_mode16_12:    db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 6
 c_mode16_13:    db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4
@@ -52,8 +56,327 @@
 c_mode16_16:          db  8,  6,  5,  3,  2,  0, 15, 14, 12, 11,  9,  8,  6,  5,  3,  2
 c_mode16_17:          db  4,  2,  1,  0, 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1
 c_mode16_18:    db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
-tab_S2:         db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0
 
+ALIGN 32
+trans8_shuf:          dd 0, 4, 1, 5, 2, 6, 3, 7
+c_ang8_src1_9_2_10:   db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
+c_ang8_26_20:         db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_src3_11_4_12:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
+c_ang8_14_8:          db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_src5_13_5_13:  db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12
+c_ang8_2_28:          db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+c_ang8_src6_14_7_15:  db 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14
+c_ang8_22_16:         db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_21_10       :  db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+c_ang8_src2_10_3_11:  db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+c_ang8_31_20:         db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_src4_12_4_12:  times 2 db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
+c_ang8_9_30:          db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+c_ang8_src5_13_6_14:  db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13
+c_ang8_19_8:          db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+
+c_ang8_17_2:          db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+c_ang8_19_4:          db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+c_ang8_21_6:          db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+c_ang8_23_8:          db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,
+c_ang8_src4_12_5_13:  db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12
+
+c_ang8_13_26:         db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+c_ang8_7_20:          db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_1_14:          db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+c_ang8_27_8:          db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_src2_10_2_10:  db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
+c_ang8_src3_11_3_11:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+
+c_ang8_31_8:          db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_13_22:         db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+c_ang8_27_4:          db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+c_ang8_9_18:          db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+
+c_ang8_5_10:          db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+c_ang8_15_20:         db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_25_30:         db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+c_ang8_3_8:           db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+
+c_ang8_mode_27:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_mode_25:       db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_mode_24:       db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+
+ALIGN 32
+c_ang16_mode_25:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+
+ALIGN 32
+c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
+intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
+
+
+ALIGN 32
+c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_31:      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,  9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,  7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
+c_ang16_mode_32:      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
+c_ang16_mode_33:     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                     db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                     db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                     db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                     db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                     db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
+c_ang16_mode_24:     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                     db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12

 
@@ -2,6 +2,7 @@
 ;* Copyright (C) 2013 x265 project
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
+;*          Praveen Kumar Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -26,11 +27,15 @@
 
 SECTION_RODATA 32
 
+intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+
 pb_0_8        times 8 db  0,  8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
 pb_swap8:     times 2 db  7,  6,  5,  4,  3,  2,  1,  0
 c_trans_4x4           db  0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
-tab_Si:               db  0,  1,  2,  3,  4,  5,  6,  7,  0,  1,  2,  3,  4,  5,  6,  7
+const tab_S1,         db 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1,  0,  0,  0,  0
+const tab_S2,         db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0
+const tab_Si,         db  0,  1,  2,  3,  4,  5,  6,  7,  0,  1,  2,  3,  4,  5,  6,  7
 pb_fact0:             db  0,  2,  4,  6,  8, 10, 12, 14,  0,  0,  0,  0,  0,  0,  0,  0
 c_mode32_12_0:        db  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,  7,  0
 c_mode32_13_0:        db  3,  6, 10, 13,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0
@@ -43,7 +48,6 @@
 c_mode32_18_0:        db 15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1,  0
 c_shuf8_0:            db  0,  1,  1,  2,  2,  3,  3,  4,  4,  5,  5,  6,  6,  7,  7,  8
 c_deinterval8:        db  0,  8,  1,  9,  2, 10,  3, 11,  4, 12,  5, 13,  6, 14,  7, 15
-tab_S1:               db 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1,  0,  0,  0,  0
 pb_unpackbq:          db  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1
 c_mode16_12:    db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 6
 c_mode16_13:    db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4
@@ -52,8 +56,327 @@
 c_mode16_16:          db  8,  6,  5,  3,  2,  0, 15, 14, 12, 11,  9,  8,  6,  5,  3,  2
 c_mode16_17:          db  4,  2,  1,  0, 15, 14, 12, 11, 10,  9,  7,  6,  5,  4,  2,  1
 c_mode16_18:    db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
-tab_S2:         db 0, 1, 3, 5, 7, 9, 11, 13, 0, 0, 0, 0, 0, 0, 0, 0
 
+ALIGN 32
+trans8_shuf:          dd 0, 4, 1, 5, 2, 6, 3, 7
+c_ang8_src1_9_2_10:   db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
+c_ang8_26_20:         db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_src3_11_4_12:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
+c_ang8_14_8:          db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_src5_13_5_13:  db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12
+c_ang8_2_28:          db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+c_ang8_src6_14_7_15:  db 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14
+c_ang8_22_16:         db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_21_10       :  db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+c_ang8_src2_10_3_11:  db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+c_ang8_31_20:         db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_src4_12_4_12:  times 2 db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
+c_ang8_9_30:          db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+c_ang8_src5_13_6_14:  db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13
+c_ang8_19_8:          db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+
+c_ang8_17_2:          db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+c_ang8_19_4:          db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+c_ang8_21_6:          db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+c_ang8_23_8:          db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,
+c_ang8_src4_12_5_13:  db 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12
+
+c_ang8_13_26:         db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+c_ang8_7_20:          db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_1_14:          db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+c_ang8_27_8:          db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_src2_10_2_10:  db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
+c_ang8_src3_11_3_11:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10
+
+c_ang8_31_8:          db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+c_ang8_13_22:         db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+c_ang8_27_4:          db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+c_ang8_9_18:          db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+
+c_ang8_5_10:          db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+c_ang8_15_20:         db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+c_ang8_25_30:         db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+c_ang8_3_8:           db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+
+c_ang8_mode_27:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_mode_25:       db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+c_ang8_mode_24:       db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+
+ALIGN 32
+c_ang16_mode_25:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+
+ALIGN 32
+c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
+intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
+
+
+ALIGN 32
+c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_31:      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6,  9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8,  7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
+c_ang16_mode_32:      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
+c_ang16_mode_33:     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                     db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                     db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                     db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                     db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                     db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
+c_ang16_mode_24:     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                     db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
​

x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm Added

@@ -0,0 +1,23008 @@
+;*****************************************************************************
+;* Copyright (C) 2013 x265 project
+;*
+;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
+;*          Praveen Tiwari <praveen@multicorewareinc.com>
+;*
+;* This program is free software; you can redistribute it and/or modify
+;* it under the terms of the GNU General Public License as published by
+;* the Free Software Foundation; either version 2 of the License, or
+;* (at your option) any later version.
+;*
+;* This program is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;* GNU General Public License for more details.
+;*
+;* You should have received a copy of the GNU General Public License
+;* along with this program; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+;*
+;* This program is also available under a commercial proprietary license.
+;* For more information, contact us at license @ x265.com.
+;*****************************************************************************/
+
+%include "x86inc.asm"
+%include "x86util.asm"
+
+SECTION_RODATA 32
+
+SECTION .text
+
+; global constant
+cextern pw_1024
+
+; common constant with intrapred8.asm
+cextern ang_table
+cextern tab_S1
+cextern tab_S2
+cextern tab_Si
+
+
+;-----------------------------------------------------------------------------
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+;-----------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal all_angs_pred_4x4, 4, 4, 8
+
+; mode 2
+
+movh      m0,         [r1 + 10]
+movd      [r0],       m0
+
+palignr   m1,         m0,      1
+movd      [r0 + 4],   m1
+
+palignr   m1,         m0,      2
+movd      [r0 + 8],   m1
+
+palignr   m1,         m0,      3
+movd      [r0 + 12],  m1
+
+; mode 3
+
+mova          m2,        [pw_1024]
+
+pslldq        m1,        m0,         1
+pinsrb        m1,        [r1 + 9],   0
+punpcklbw     m1,        m0
+
+lea           r3,        [ang_table]
+
+pmaddubsw     m6,        m1,        [r3 + 26 * 16]
+pmulhrsw      m6,        m2
+packuswb      m6,        m6
+movd          [r0 + 16], m6
+
+palignr       m0,        m1,        2
+
+mova          m7,        [r3 + 20 * 16]
+
+pmaddubsw     m3,        m0,        m7
+pmulhrsw      m3,        m2
+packuswb      m3,        m3
+movd          [r0 + 20], m3
+
+; mode 6 [row 3]
+movd          [r0 + 76], m3
+
+palignr       m3,        m1,       4
+
+pmaddubsw     m4,        m3,        [r3 + 14 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 24], m4
+
+palignr       m4,        m1,        6
+
+pmaddubsw     m4,        [r3 + 8 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 28], m4
+
+; mode 4
+
+pmaddubsw     m5,        m1,        [r3 + 21 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 32], m5
+
+pmaddubsw     m5,        m0,        [r3 + 10 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 36], m5
+
+pmaddubsw     m5,        m0,        [r3 + 31 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 40], m5
+
+pmaddubsw     m4,        m3,        m7
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 44], m4
+
+; mode 5
+
+pmaddubsw     m5,        m1,        [r3 + 17 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 48], m5
+
+pmaddubsw     m5,        m0,        [r3 + 2 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 52], m5
+
+pmaddubsw     m5,        m0,        [r3 + 19 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 56], m5
+
+pmaddubsw     m4,        m3,        [r3 + 4 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 60], m4
+
+; mode 6
+
+pmaddubsw     m5,        m1,        [r3 + 13 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 64], m5
+
+movd          [r0 + 68], m6
+
+pmaddubsw     m5,        m0,        [r3 + 7 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 72], m5
+
+; mode 7
+
+pmaddubsw     m5,        m1,        [r3 + 9 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 80], m5
+
+pmaddubsw     m5,        m1,        [r3 + 18 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 84], m5
+
+pmaddubsw     m5,        m1,        [r3 + 27 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 88], m5
+
+pmaddubsw     m5,        m0,        [r3 + 4 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 92], m5
+
+; mode 8
+
+pmaddubsw     m5,        m1,        [r3 + 5 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 96], m5
+
+pmaddubsw     m5,         m1,       [r3 + 10 * 16]
+pmulhrsw      m5,         m2
+packuswb      m5,         m5
+movd          [r0 + 100], m5
+
+pmaddubsw     m5,         m1,        [r3 + 15 * 16]
+pmulhrsw      m5,         m2
+packuswb      m5,         m5
+movd          [r0 + 104], m5
+

 
@@ -0,0 +1,23008 @@
+;*****************************************************************************
+;* Copyright (C) 2013 x265 project
+;*
+;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
+;*          Praveen Tiwari <praveen@multicorewareinc.com>
+;*
+;* This program is free software; you can redistribute it and/or modify
+;* it under the terms of the GNU General Public License as published by
+;* the Free Software Foundation; either version 2 of the License, or
+;* (at your option) any later version.
+;*
+;* This program is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;* GNU General Public License for more details.
+;*
+;* You should have received a copy of the GNU General Public License
+;* along with this program; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+;*
+;* This program is also available under a commercial proprietary license.
+;* For more information, contact us at license @ x265.com.
+;*****************************************************************************/
+
+%include "x86inc.asm"
+%include "x86util.asm"
+
+SECTION_RODATA 32
+
+SECTION .text
+
+; global constant
+cextern pw_1024
+
+; common constant with intrapred8.asm
+cextern ang_table
+cextern tab_S1
+cextern tab_S2
+cextern tab_Si
+
+
+;-----------------------------------------------------------------------------
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+;-----------------------------------------------------------------------------
+INIT_XMM sse4
+cglobal all_angs_pred_4x4, 4, 4, 8
+
+; mode 2
+
+movh      m0,         [r1 + 10]
+movd      [r0],       m0
+
+palignr   m1,         m0,      1
+movd      [r0 + 4],   m1
+
+palignr   m1,         m0,      2
+movd      [r0 + 8],   m1
+
+palignr   m1,         m0,      3
+movd      [r0 + 12],  m1
+
+; mode 3
+
+mova          m2,        [pw_1024]
+
+pslldq        m1,        m0,         1
+pinsrb        m1,        [r1 + 9],   0
+punpcklbw     m1,        m0
+
+lea           r3,        [ang_table]
+
+pmaddubsw     m6,        m1,        [r3 + 26 * 16]
+pmulhrsw      m6,        m2
+packuswb      m6,        m6
+movd          [r0 + 16], m6
+
+palignr       m0,        m1,        2
+
+mova          m7,        [r3 + 20 * 16]
+
+pmaddubsw     m3,        m0,        m7
+pmulhrsw      m3,        m2
+packuswb      m3,        m3
+movd          [r0 + 20], m3
+
+; mode 6 [row 3]
+movd          [r0 + 76], m3
+
+palignr       m3,        m1,       4
+
+pmaddubsw     m4,        m3,        [r3 + 14 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 24], m4
+
+palignr       m4,        m1,        6
+
+pmaddubsw     m4,        [r3 + 8 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 28], m4
+
+; mode 4
+
+pmaddubsw     m5,        m1,        [r3 + 21 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 32], m5
+
+pmaddubsw     m5,        m0,        [r3 + 10 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 36], m5
+
+pmaddubsw     m5,        m0,        [r3 + 31 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 40], m5
+
+pmaddubsw     m4,        m3,        m7
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 44], m4
+
+; mode 5
+
+pmaddubsw     m5,        m1,        [r3 + 17 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 48], m5
+
+pmaddubsw     m5,        m0,        [r3 + 2 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 52], m5
+
+pmaddubsw     m5,        m0,        [r3 + 19 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 56], m5
+
+pmaddubsw     m4,        m3,        [r3 + 4 * 16]
+pmulhrsw      m4,        m2
+packuswb      m4,        m4
+movd          [r0 + 60], m4
+
+; mode 6
+
+pmaddubsw     m5,        m1,        [r3 + 13 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 64], m5
+
+movd          [r0 + 68], m6
+
+pmaddubsw     m5,        m0,        [r3 + 7 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 72], m5
+
+; mode 7
+
+pmaddubsw     m5,        m1,        [r3 + 9 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 80], m5
+
+pmaddubsw     m5,        m1,        [r3 + 18 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 84], m5
+
+pmaddubsw     m5,        m1,        [r3 + 27 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 88], m5
+
+pmaddubsw     m5,        m0,        [r3 + 4 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 92], m5
+
+; mode 8
+
+pmaddubsw     m5,        m1,        [r3 + 5 * 16]
+pmulhrsw      m5,        m2
+packuswb      m5,        m5
+movd          [r0 + 96], m5
+
+pmaddubsw     m5,         m1,       [r3 + 10 * 16]
+pmulhrsw      m5,         m2
+packuswb      m5,         m5
+movd          [r0 + 100], m5
+
+pmaddubsw     m5,         m1,        [r3 + 15 * 16]
+pmulhrsw      m5,         m2
+packuswb      m5,         m5
+movd          [r0 + 104], m5
+
​

x265_1.5.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.6.tar.gz/source/common/x86/ipfilter16.asm Changed

@@ -31,6 +31,7 @@
 tab_c_n32768:     times 4 dd -32768
 tab_c_524800:     times 4 dd 524800
 tab_c_n8192:      times 8 dw -8192
+pd_524800:        times 8 dd 524800
 
 tab_Tm16:         db 0, 1, 2, 3, 4,  5,  6, 7, 2, 3, 4,  5, 6, 7, 8, 9
 
@@ -91,9 +92,28 @@
                   times 4 dw -5, 17
                   times 4 dw 58, -10
                   times 4 dw 4, -1
+ALIGN 32
+tab_LumaCoeffVer: times 8 dw 0, 0
+                  times 8 dw 0, 64
+                  times 8 dw 0, 0
+                  times 8 dw 0, 0
+
+                  times 8 dw -1, 4
+                  times 8 dw -10, 58
+                  times 8 dw 17, -5
+                  times 8 dw 1, 0
+
+                  times 8 dw -1, 4
+                  times 8 dw -11, 40
+                  times 8 dw 40, -11
+                  times 8 dw 4, -1
+
+                  times 8 dw 0, 1
+                  times 8 dw -5, 17
+                  times 8 dw 58, -10
+                  times 8 dw 4, -1
 
 SECTION .text
-
 cextern pd_32
 cextern pw_pixel_max
 cextern pd_n32768
@@ -2562,6 +2582,2681 @@
     FILTER_VER_LUMA_PP 64, 16
     FILTER_VER_LUMA_PP 16, 64
 
+%macro FILTER_VER_LUMA_AVX2_4x4 1
+INIT_YMM avx2
+cglobal interp_8tap_vert_%1_4x4, 4, 6, 7
+    mov             r4d, r4m
+    add             r1d, r1d
+    add             r3d, r3d
+    shl             r4d, 7
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeffVer]
+    add             r5, r4
+%else
+    lea             r5, [tab_LumaCoeffVer + r4]
+%endif
+
+    lea             r4, [r1 * 3]
+    sub             r0, r4
+
+%ifidn %1,pp
+    vbroadcasti128  m6, [pd_32]
+%elifidn %1, sp
+    mova            m6, [pd_524800]
+%else
+    vbroadcasti128  m6, [pd_n32768]
+%endif
+
+    movq            xm0, [r0]
+    movq            xm1, [r0 + r1]
+    punpcklwd       xm0, xm1
+    movq            xm2, [r0 + r1 * 2]
+    punpcklwd       xm1, xm2
+    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]
+    pmaddwd         m0, [r5]
+    movq            xm3, [r0 + r4]
+    punpcklwd       xm2, xm3
+    lea             r0, [r0 + 4 * r1]
+    movq            xm4, [r0]
+    punpcklwd       xm3, xm4
+    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]
+    pmaddwd         m5, m2, [r5 + 1 * mmsize]
+    pmaddwd         m2, [r5]
+    paddd           m0, m5
+    movq            xm3, [r0 + r1]
+    punpcklwd       xm4, xm3
+    movq            xm1, [r0 + r1 * 2]
+    punpcklwd       xm3, xm1
+    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]
+    pmaddwd         m5, m4, [r5 + 2 * mmsize]
+    pmaddwd         m4, [r5 + 1 * mmsize]
+    paddd           m0, m5
+    paddd           m2, m4
+    movq            xm3, [r0 + r4]
+    punpcklwd       xm1, xm3
+    lea             r0, [r0 + 4 * r1]
+    movq            xm4, [r0]
+    punpcklwd       xm3, xm4
+    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]
+    pmaddwd         m5, m1, [r5 + 3 * mmsize]
+    pmaddwd         m1, [r5 + 2 * mmsize]
+    paddd           m0, m5
+    paddd           m2, m1
+    movq            xm3, [r0 + r1]
+    punpcklwd       xm4, xm3
+    movq            xm1, [r0 + 2 * r1]
+    punpcklwd       xm3, xm1
+    vinserti128     m4, m4, xm3, 1                  ; m4 = [A 9 9 8]
+    pmaddwd         m4, [r5 + 3 * mmsize]
+    paddd           m2, m4
+
+%ifidn %1,ss
+    psrad           m0, 6
+    psrad           m2, 6
+%else
+    paddd           m0, m6
+    paddd           m2, m6
+%ifidn %1,pp
+    psrad           m0, 6
+    psrad           m2, 6
+%elifidn %1, sp
+    psrad           m0, 10
+    psrad           m2, 10
+%else
+    psrad           m0, 2
+    psrad           m2, 2
+%endif
+%endif
+
+    packssdw        m0, m2
+    pxor            m1, m1
+%ifidn %1,pp
+    CLIPW           m0, m1, [pw_pixel_max]
+%elifidn %1, sp
+    CLIPW           m0, m1, [pw_pixel_max]
+%endif
+
+    vextracti128    xm2, m0, 1
+    lea             r4, [r3 * 3]
+    movq            [r2], xm0
+    movq            [r2 + r3], xm2
+    movhps          [r2 + r3 * 2], xm0
+    movhps          [r2 + r4], xm2
+    RET
+%endmacro
+
+FILTER_VER_LUMA_AVX2_4x4 pp
+FILTER_VER_LUMA_AVX2_4x4 ps
+FILTER_VER_LUMA_AVX2_4x4 sp
+FILTER_VER_LUMA_AVX2_4x4 ss
+
+%macro FILTER_VER_LUMA_AVX2_8x8 1
+INIT_YMM avx2
+%if ARCH_X86_64 == 1
+cglobal interp_8tap_vert_%1_8x8, 4, 6, 12
+    mov             r4d, r4m
+    add             r1d, r1d
+    add             r3d, r3d
+    shl             r4d, 7
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeffVer]
+    add             r5, r4
+%else
+    lea             r5, [tab_LumaCoeffVer + r4]
+%endif
+
+    lea             r4, [r1 * 3]
+    sub             r0, r4
+
+%ifidn %1,pp
+    vbroadcasti128  m11, [pd_32]
+%elifidn %1, sp
+    mova            m11, [pd_524800]
+%else
+    vbroadcasti128  m11, [pd_n32768]
+%endif
+
+    movu            xm0, [r0]                       ; m0 = row 0
+    movu            xm1, [r0 + r1]                  ; m1 = row 1
+    punpckhwd       xm2, xm0, xm1
+    punpcklwd       xm0, xm1
+    vinserti128     m0, m0, xm2, 1
+    pmaddwd         m0, [r5]
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2
+    punpckhwd       xm3, xm1, xm2
+    punpcklwd       xm1, xm2
+    vinserti128     m1, m1, xm3, 1
+    pmaddwd         m1, [r5]
+    movu            xm3, [r0 + r4]                  ; m3 = row 3
+    punpckhwd       xm4, xm2, xm3
+    punpcklwd       xm2, xm3
+    vinserti128     m2, m2, xm4, 1
+    pmaddwd         m4, m2, [r5 + 1 * mmsize]
+    pmaddwd         m2, [r5]
+    paddd           m0, m4
+    lea             r0, [r0 + r1 * 4]
+    movu            xm4, [r0]                       ; m4 = row 4
+    punpckhwd       xm5, xm3, xm4
+    punpcklwd       xm3, xm4

 
@@ -31,6 +31,7 @@
 tab_c_n32768:     times 4 dd -32768
 tab_c_524800:     times 4 dd 524800
 tab_c_n8192:      times 8 dw -8192
+pd_524800:        times 8 dd 524800
 
 tab_Tm16:         db 0, 1, 2, 3, 4,  5,  6, 7, 2, 3, 4,  5, 6, 7, 8, 9
 
@@ -91,9 +92,28 @@
                   times 4 dw -5, 17
                   times 4 dw 58, -10
                   times 4 dw 4, -1
+ALIGN 32
+tab_LumaCoeffVer: times 8 dw 0, 0
+                  times 8 dw 0, 64
+                  times 8 dw 0, 0
+                  times 8 dw 0, 0
+
+                  times 8 dw -1, 4
+                  times 8 dw -10, 58
+                  times 8 dw 17, -5
+                  times 8 dw 1, 0
+
+                  times 8 dw -1, 4
+                  times 8 dw -11, 40
+                  times 8 dw 40, -11
+                  times 8 dw 4, -1
+
+                  times 8 dw 0, 1
+                  times 8 dw -5, 17
+                  times 8 dw 58, -10
+                  times 8 dw 4, -1
 
 SECTION .text
-
 cextern pd_32
 cextern pw_pixel_max
 cextern pd_n32768
@@ -2562,6 +2582,2681 @@
     FILTER_VER_LUMA_PP 64, 16
     FILTER_VER_LUMA_PP 16, 64
 
+%macro FILTER_VER_LUMA_AVX2_4x4 1
+INIT_YMM avx2
+cglobal interp_8tap_vert_%1_4x4, 4, 6, 7
+    mov             r4d, r4m
+    add             r1d, r1d
+    add             r3d, r3d
+    shl             r4d, 7
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeffVer]
+    add             r5, r4
+%else
+    lea             r5, [tab_LumaCoeffVer + r4]
+%endif
+
+    lea             r4, [r1 * 3]
+    sub             r0, r4
+
+%ifidn %1,pp
+    vbroadcasti128  m6, [pd_32]
+%elifidn %1, sp
+    mova            m6, [pd_524800]
+%else
+    vbroadcasti128  m6, [pd_n32768]
+%endif
+
+    movq            xm0, [r0]
+    movq            xm1, [r0 + r1]
+    punpcklwd       xm0, xm1
+    movq            xm2, [r0 + r1 * 2]
+    punpcklwd       xm1, xm2
+    vinserti128     m0, m0, xm1, 1                  ; m0 = [2 1 1 0]
+    pmaddwd         m0, [r5]
+    movq            xm3, [r0 + r4]
+    punpcklwd       xm2, xm3
+    lea             r0, [r0 + 4 * r1]
+    movq            xm4, [r0]
+    punpcklwd       xm3, xm4
+    vinserti128     m2, m2, xm3, 1                  ; m2 = [4 3 3 2]
+    pmaddwd         m5, m2, [r5 + 1 * mmsize]
+    pmaddwd         m2, [r5]
+    paddd           m0, m5
+    movq            xm3, [r0 + r1]
+    punpcklwd       xm4, xm3
+    movq            xm1, [r0 + r1 * 2]
+    punpcklwd       xm3, xm1
+    vinserti128     m4, m4, xm3, 1                  ; m4 = [6 5 5 4]
+    pmaddwd         m5, m4, [r5 + 2 * mmsize]
+    pmaddwd         m4, [r5 + 1 * mmsize]
+    paddd           m0, m5
+    paddd           m2, m4
+    movq            xm3, [r0 + r4]
+    punpcklwd       xm1, xm3
+    lea             r0, [r0 + 4 * r1]
+    movq            xm4, [r0]
+    punpcklwd       xm3, xm4
+    vinserti128     m1, m1, xm3, 1                  ; m1 = [8 7 7 6]
+    pmaddwd         m5, m1, [r5 + 3 * mmsize]
+    pmaddwd         m1, [r5 + 2 * mmsize]
+    paddd           m0, m5
+    paddd           m2, m1
+    movq            xm3, [r0 + r1]
+    punpcklwd       xm4, xm3
+    movq            xm1, [r0 + 2 * r1]
+    punpcklwd       xm3, xm1
+    vinserti128     m4, m4, xm3, 1                  ; m4 = [A 9 9 8]
+    pmaddwd         m4, [r5 + 3 * mmsize]
+    paddd           m2, m4
+
+%ifidn %1,ss
+    psrad           m0, 6
+    psrad           m2, 6
+%else
+    paddd           m0, m6
+    paddd           m2, m6
+%ifidn %1,pp
+    psrad           m0, 6
+    psrad           m2, 6
+%elifidn %1, sp
+    psrad           m0, 10
+    psrad           m2, 10
+%else
+    psrad           m0, 2
+    psrad           m2, 2
+%endif
+%endif
+
+    packssdw        m0, m2
+    pxor            m1, m1
+%ifidn %1,pp
+    CLIPW           m0, m1, [pw_pixel_max]
+%elifidn %1, sp
+    CLIPW           m0, m1, [pw_pixel_max]
+%endif
+
+    vextracti128    xm2, m0, 1
+    lea             r4, [r3 * 3]
+    movq            [r2], xm0
+    movq            [r2 + r3], xm2
+    movhps          [r2 + r3 * 2], xm0
+    movhps          [r2 + r4], xm2
+    RET
+%endmacro
+
+FILTER_VER_LUMA_AVX2_4x4 pp
+FILTER_VER_LUMA_AVX2_4x4 ps
+FILTER_VER_LUMA_AVX2_4x4 sp
+FILTER_VER_LUMA_AVX2_4x4 ss
+
+%macro FILTER_VER_LUMA_AVX2_8x8 1
+INIT_YMM avx2
+%if ARCH_X86_64 == 1
+cglobal interp_8tap_vert_%1_8x8, 4, 6, 12
+    mov             r4d, r4m
+    add             r1d, r1d
+    add             r3d, r3d
+    shl             r4d, 7
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeffVer]
+    add             r5, r4
+%else
+    lea             r5, [tab_LumaCoeffVer + r4]
+%endif
+
+    lea             r4, [r1 * 3]
+    sub             r0, r4
+
+%ifidn %1,pp
+    vbroadcasti128  m11, [pd_32]
+%elifidn %1, sp
+    mova            m11, [pd_524800]
+%else
+    vbroadcasti128  m11, [pd_n32768]
+%endif
+
+    movu            xm0, [r0]                       ; m0 = row 0
+    movu            xm1, [r0 + r1]                  ; m1 = row 1
+    punpckhwd       xm2, xm0, xm1
+    punpcklwd       xm0, xm1
+    vinserti128     m0, m0, xm2, 1
+    pmaddwd         m0, [r5]
+    movu            xm2, [r0 + r1 * 2]              ; m2 = row 2
+    punpckhwd       xm3, xm1, xm2
+    punpcklwd       xm1, xm2
+    vinserti128     m1, m1, xm3, 1
+    pmaddwd         m1, [r5]
+    movu            xm3, [r0 + r4]                  ; m3 = row 3
+    punpckhwd       xm4, xm2, xm3
+    punpcklwd       xm2, xm3
+    vinserti128     m2, m2, xm4, 1
+    pmaddwd         m4, m2, [r5 + 1 * mmsize]
+    pmaddwd         m2, [r5]
+    paddd           m0, m4
+    lea             r0, [r0 + r1 * 4]
+    movu            xm4, [r0]                       ; m4 = row 4
+    punpckhwd       xm5, xm3, xm4
+    punpcklwd       xm3, xm4
​

x265_1.5.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.6.tar.gz/source/common/x86/ipfilter8.asm Changed

@@ -35,10 +35,20 @@
 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
 
 ALIGN 32
+const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
+                        times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
+
+ALIGN 32
 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
 ALIGN 32
+const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+                     times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
+                     times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
+                     times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
+
+ALIGN 32
 tab_Lm:    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
            db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
            db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
@@ -51,6 +61,8 @@
 
 tab_c_526336:   times 4 dd 8192*64+2048
 
+pd_526336:      times 8 dd 8192*64+2048
+
 tab_ChromaCoeff: db  0, 64,  0,  0
                  db -2, 58, 10, -2
                  db -4, 54, 16, -2
@@ -59,6 +71,30 @@
                  db -4, 28, 46, -6
                  db -2, 16, 54, -4
                  db -2, 10, 58, -2
+ALIGN 32
+tab_ChromaCoeff_V: times 8 db 0, 64
+                   times 8 db 0,  0
+
+                   times 8 db -2, 58
+                   times 8 db 10, -2
+
+                   times 8 db -4, 54
+                   times 8 db 16, -2
+
+                   times 8 db -6, 46
+                   times 8 db 28, -4
+
+                   times 8 db -4, 36
+                   times 8 db 36, -4
+
+                   times 8 db -4, 28
+                   times 8 db 46, -6
+
+                   times 8 db -2, 16
+                   times 8 db 54, -4
+
+                   times 8 db -2, 10
+                   times 8 db 58, -2
 
 tab_ChromaCoeffV: times 4 dw 0, 64
                   times 4 dw 0, 0
@@ -84,6 +120,31 @@
                   times 4 dw -2, 10
                   times 4 dw 58, -2
 
+ALIGN 32
+pw_ChromaCoeffV:  times 8 dw 0, 64
+                  times 8 dw 0, 0
+
+                  times 8 dw -2, 58
+                  times 8 dw 10, -2
+
+                  times 8 dw -4, 54
+                  times 8 dw 16, -2
+
+                  times 8 dw -6, 46 
+                  times 8 dw 28, -4
+
+                  times 8 dw -4, 36
+                  times 8 dw 36, -4
+
+                  times 8 dw -4, 28
+                  times 8 dw 46, -6
+
+                  times 8 dw -2, 16
+                  times 8 dw 54, -4
+
+                  times 8 dw -2, 10
+                  times 8 dw 58, -2
+
 tab_LumaCoeff:   db   0, 0,  0,  64,  0,   0,  0,  0
                  db  -1, 4, -10, 58,  17, -5,  1,  0
                  db  -1, 4, -11, 40,  40, -11, 4, -1
@@ -109,6 +170,47 @@
                 times 4 dw 58, -10
                 times 4 dw 4, -1
 
+ALIGN 32
+pw_LumaCoeffVer: times 8 dw 0, 0
+                 times 8 dw 0, 64
+                 times 8 dw 0, 0
+                 times 8 dw 0, 0
+
+                 times 8 dw -1, 4
+                 times 8 dw -10, 58
+                 times 8 dw 17, -5
+                 times 8 dw 1, 0
+
+                 times 8 dw -1, 4
+                 times 8 dw -11, 40
+                 times 8 dw 40, -11
+                 times 8 dw 4, -1
+
+                 times 8 dw 0, 1
+                 times 8 dw -5, 17
+                 times 8 dw 58, -10
+                 times 8 dw 4, -1
+
+pb_LumaCoeffVer: times 16 db 0, 0
+                 times 16 db 0, 64
+                 times 16 db 0, 0
+                 times 16 db 0, 0
+
+                 times 16 db -1, 4
+                 times 16 db -10, 58
+                 times 16 db 17, -5
+                 times 16 db 1, 0
+
+                 times 16 db -1, 4
+                 times 16 db -11, 40
+                 times 16 db 40, -11
+                 times 16 db 4, -1
+
+                 times 16 db 0, 1
+                 times 16 db -5, 17
+                 times 16 db 58, -10
+                 times 16 db 4, -1
+
 tab_LumaCoeffVer: times 8 db 0, 0
                   times 8 db 0, 64
                   times 8 db 0, 0
@@ -183,6 +285,15 @@
 interp4_horiz_shuf1:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
                         db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
 
+ALIGN 32
+interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
+
+ALIGN 32
+interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
+
+ALIGN 32
+interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
+
 SECTION .text
 
 cextern pb_128
@@ -913,6 +1024,105 @@
     pextrd          [r2+r0], xm3, 3
     RET
 
+%macro FILTER_HORIZ_LUMA_AVX2_4xN 1
+INIT_YMM avx2
+%if ARCH_X86_64 == 1
+cglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9
+    mov             r4d, r4m
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeff]
+    vpbroadcastq    m0, [r5 + r4 * 8]
+%else
+    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]
+%endif
+
+    mova            m1, [tab_Lm]
+    mova            m2, [pw_1]
+    mova            m7, [interp8_hps_shuf]
+    mova            m8, [pw_512]
+
+    ; register map
+    ; m0 - interpolate coeff
+    ; m1 - shuffle order table
+    ; m2 - constant word 1
+    lea             r4, [r1 * 3]
+    lea             r5, [r3 * 3]
+    sub             r0, 3
+%rep %1 / 8
+    ; Row 0-1
+    vbroadcasti128  m3, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
+    pshufb          m3, m1
+    pmaddubsw       m3, m0
+    pmaddwd         m3, m2
+    vbroadcasti128  m4, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
+    pshufb          m4, m1
+    pmaddubsw       m4, m0
+    pmaddwd         m4, m2
+    phaddd          m3, m4                          ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
+
+    ; Row 2-3
+    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]

 
@@ -35,10 +35,20 @@
 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
 
 ALIGN 32
+const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
+                        times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
+
+ALIGN 32
 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
 ALIGN 32
+const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+                     times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
+                     times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
+                     times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
+
+ALIGN 32
 tab_Lm:    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
            db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
            db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
@@ -51,6 +61,8 @@
 
 tab_c_526336:   times 4 dd 8192*64+2048
 
+pd_526336:      times 8 dd 8192*64+2048
+
 tab_ChromaCoeff: db  0, 64,  0,  0
                  db -2, 58, 10, -2
                  db -4, 54, 16, -2
@@ -59,6 +71,30 @@
                  db -4, 28, 46, -6
                  db -2, 16, 54, -4
                  db -2, 10, 58, -2
+ALIGN 32
+tab_ChromaCoeff_V: times 8 db 0, 64
+                   times 8 db 0,  0
+
+                   times 8 db -2, 58
+                   times 8 db 10, -2
+
+                   times 8 db -4, 54
+                   times 8 db 16, -2
+
+                   times 8 db -6, 46
+                   times 8 db 28, -4
+
+                   times 8 db -4, 36
+                   times 8 db 36, -4
+
+                   times 8 db -4, 28
+                   times 8 db 46, -6
+
+                   times 8 db -2, 16
+                   times 8 db 54, -4
+
+                   times 8 db -2, 10
+                   times 8 db 58, -2
 
 tab_ChromaCoeffV: times 4 dw 0, 64
                   times 4 dw 0, 0
@@ -84,6 +120,31 @@
                   times 4 dw -2, 10
                   times 4 dw 58, -2
 
+ALIGN 32
+pw_ChromaCoeffV:  times 8 dw 0, 64
+                  times 8 dw 0, 0
+
+                  times 8 dw -2, 58
+                  times 8 dw 10, -2
+
+                  times 8 dw -4, 54
+                  times 8 dw 16, -2
+
+                  times 8 dw -6, 46 
+                  times 8 dw 28, -4
+
+                  times 8 dw -4, 36
+                  times 8 dw 36, -4
+
+                  times 8 dw -4, 28
+                  times 8 dw 46, -6
+
+                  times 8 dw -2, 16
+                  times 8 dw 54, -4
+
+                  times 8 dw -2, 10
+                  times 8 dw 58, -2
+
 tab_LumaCoeff:   db   0, 0,  0,  64,  0,   0,  0,  0
                  db  -1, 4, -10, 58,  17, -5,  1,  0
                  db  -1, 4, -11, 40,  40, -11, 4, -1
@@ -109,6 +170,47 @@
                 times 4 dw 58, -10
                 times 4 dw 4, -1
 
+ALIGN 32
+pw_LumaCoeffVer: times 8 dw 0, 0
+                 times 8 dw 0, 64
+                 times 8 dw 0, 0
+                 times 8 dw 0, 0
+
+                 times 8 dw -1, 4
+                 times 8 dw -10, 58
+                 times 8 dw 17, -5
+                 times 8 dw 1, 0
+
+                 times 8 dw -1, 4
+                 times 8 dw -11, 40
+                 times 8 dw 40, -11
+                 times 8 dw 4, -1
+
+                 times 8 dw 0, 1
+                 times 8 dw -5, 17
+                 times 8 dw 58, -10
+                 times 8 dw 4, -1
+
+pb_LumaCoeffVer: times 16 db 0, 0
+                 times 16 db 0, 64
+                 times 16 db 0, 0
+                 times 16 db 0, 0
+
+                 times 16 db -1, 4
+                 times 16 db -10, 58
+                 times 16 db 17, -5
+                 times 16 db 1, 0
+
+                 times 16 db -1, 4
+                 times 16 db -11, 40
+                 times 16 db 40, -11
+                 times 16 db 4, -1
+
+                 times 16 db 0, 1
+                 times 16 db -5, 17
+                 times 16 db 58, -10
+                 times 16 db 4, -1
+
 tab_LumaCoeffVer: times 8 db 0, 0
                   times 8 db 0, 64
                   times 8 db 0, 0
@@ -183,6 +285,15 @@
 interp4_horiz_shuf1:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
                         db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14
 
+ALIGN 32
+interp4_hpp_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
+
+ALIGN 32
+interp8_hps_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
+
+ALIGN 32
+interp4_hps_shuf: times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 8, 9, 10, 11, 9, 10, 11, 12
+
 SECTION .text
 
 cextern pb_128
@@ -913,6 +1024,105 @@
     pextrd          [r2+r0], xm3, 3
     RET
 
+%macro FILTER_HORIZ_LUMA_AVX2_4xN 1
+INIT_YMM avx2
+%if ARCH_X86_64 == 1
+cglobal interp_8tap_horiz_pp_4x%1, 4, 6, 9
+    mov             r4d, r4m
+
+%ifdef PIC
+    lea             r5, [tab_LumaCoeff]
+    vpbroadcastq    m0, [r5 + r4 * 8]
+%else
+    vpbroadcastq    m0, [tab_LumaCoeff + r4 * 8]
+%endif
+
+    mova            m1, [tab_Lm]
+    mova            m2, [pw_1]
+    mova            m7, [interp8_hps_shuf]
+    mova            m8, [pw_512]
+
+    ; register map
+    ; m0 - interpolate coeff
+    ; m1 - shuffle order table
+    ; m2 - constant word 1
+    lea             r4, [r1 * 3]
+    lea             r5, [r3 * 3]
+    sub             r0, 3
+%rep %1 / 8
+    ; Row 0-1
+    vbroadcasti128  m3, [r0]                        ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
+    pshufb          m3, m1
+    pmaddubsw       m3, m0
+    pmaddwd         m3, m2
+    vbroadcasti128  m4, [r0 + r1]                   ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
+    pshufb          m4, m1
+    pmaddubsw       m4, m0
+    pmaddwd         m4, m2
+    phaddd          m3, m4                          ; DWORD [R1D R1C R0D R0C R1B R1A R0B R0A]
+
+    ; Row 2-3
+    vbroadcasti128  m4, [r0 + r1 * 2]               ; [x x x x x A 9 8 7 6 5 4 3 2 1 0]
​

x265_1.5.tar.gz/source/common/x86/ipfilter8.h -> x265_1.6.tar.gz/source/common/x86/ipfilter8.h Changed

@@ -576,8 +576,12 @@
 CHROMA_420_FILTERS(_avx2);
 CHROMA_420_SP_FILTERS(_sse2);
 CHROMA_420_SP_FILTERS_SSE4(_sse4);
+CHROMA_420_SP_FILTERS(_avx2);
+CHROMA_420_SP_FILTERS_SSE4(_avx2);
 CHROMA_420_SS_FILTERS(_sse2);
 CHROMA_420_SS_FILTERS_SSE4(_sse4);
+CHROMA_420_SS_FILTERS(_avx2);
+CHROMA_420_SS_FILTERS_SSE4(_avx2);
 
 CHROMA_422_FILTERS(_sse4);
 CHROMA_422_FILTERS(_avx2);
@@ -617,10 +621,31 @@
 LUMA_SP_FILTERS(_sse4);
 LUMA_SS_FILTERS(_sse2);
 LUMA_FILTERS(_avx2);
-
+LUMA_SP_FILTERS(_avx2);
+LUMA_SS_FILTERS(_avx2);
 void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-void x265_luma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-
+void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
 #undef LUMA_FILTERS
 #undef LUMA_SP_FILTERS
 #undef LUMA_SS_FILTERS

 
@@ -576,8 +576,12 @@
 CHROMA_420_FILTERS(_avx2);
 CHROMA_420_SP_FILTERS(_sse2);
 CHROMA_420_SP_FILTERS_SSE4(_sse4);
+CHROMA_420_SP_FILTERS(_avx2);
+CHROMA_420_SP_FILTERS_SSE4(_avx2);
 CHROMA_420_SS_FILTERS(_sse2);
 CHROMA_420_SS_FILTERS_SSE4(_sse4);
+CHROMA_420_SS_FILTERS(_avx2);
+CHROMA_420_SS_FILTERS_SSE4(_avx2);
 
 CHROMA_422_FILTERS(_sse4);
 CHROMA_422_FILTERS(_avx2);
@@ -617,10 +621,31 @@
 LUMA_SP_FILTERS(_sse4);
 LUMA_SS_FILTERS(_sse2);
 LUMA_FILTERS(_avx2);
-
+LUMA_SP_FILTERS(_avx2);
+LUMA_SS_FILTERS(_avx2);
 void x265_interp_8tap_hv_pp_8x8_sse4(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-void x265_luma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-
+void x265_pixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
+void x265_pixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst);
 #undef LUMA_FILTERS
 #undef LUMA_SP_FILTERS
 #undef LUMA_SS_FILTERS
​

x265_1.5.tar.gz/source/common/x86/mc-a.asm -> x265_1.6.tar.gz/source/common/x86/mc-a.asm Changed

@@ -1759,7 +1759,570 @@
 ADDAVG_W16_H4 24
 
 ;-----------------------------------------------------------------------------
+; addAvg avx2 code start
+;-----------------------------------------------------------------------------
+
+INIT_YMM avx2
+cglobal addAvg_8x2, 6,6,4, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + 2 * r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + 2 * r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, [pw_256]
+    paddw           m0, [pw_128]
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+    RET
+
+cglobal addAvg_8x6, 6,6,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0+  r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+    RET
+
+%macro ADDAVG_W8_H4_AVX2 1
+INIT_YMM avx2
+cglobal addAvg_8x%1, 6,7,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+    mov             r6d, %1/4
+
+.loop:
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            m2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    dec             r6d
+    jnz             .loop
+    RET
+%endmacro
 
+ADDAVG_W8_H4_AVX2 4
+ADDAVG_W8_H4_AVX2 8
+ADDAVG_W8_H4_AVX2 16
+ADDAVG_W8_H4_AVX2 32
+
+%macro ADDAVG_W12_H4_AVX2 1
+INIT_YMM avx2
+cglobal addAvg_12x%1, 6,7,7, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+    mov             r6d, %1/4
+
+.loop:
+    movu            xm0, [r0]
+    movu            xm1, [r1]
+    movq            xm2, [r0 + 16]
+    movq            xm3, [r1 + 16]
+    vinserti128     m0, m0, xm2, 1
+    vinserti128     m1, m1, xm3, 1
+
+    paddw           m0, m1
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    movu            xm1, [r0 + r3]
+    movu            xm2, [r1 + r4]
+    movq            xm3, [r0 + r3 + 16]
+    movq            xm6, [r1 + r3 + 16]
+    vinserti128     m1, m1, xm3, 1
+    vinserti128     m2, m2, xm6, 1
+
+    paddw           m1, m2
+    pmulhrsw        m1, m4
+    paddw           m1, m5
+
+    packuswb        m0, m1
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movd            [r2 + 8], xm1
+    vpshufd         m1, m1, 2
+    movhps          [r2 + r5], xm0
+    movd            [r2 + r5 + 8], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    movu            xm1, [r1]
+    movq            xm2, [r0 + 16]
+    movq            xm3, [r1 + 16]
+    vinserti128     m0, m0, xm2, 1
+    vinserti128     m1, m1, xm3, 1
+
+    paddw           m0, m1
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    movu            xm1, [r0 + r3]
+    movu            xm2, [r1 + r4]

 
@@ -1759,7 +1759,570 @@
 ADDAVG_W16_H4 24
 
 ;-----------------------------------------------------------------------------
+; addAvg avx2 code start
+;-----------------------------------------------------------------------------
+
+INIT_YMM avx2
+cglobal addAvg_8x2, 6,6,4, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + 2 * r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + 2 * r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, [pw_256]
+    paddw           m0, [pw_128]
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+    RET
+
+cglobal addAvg_8x6, 6,6,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0+  r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+    RET
+
+%macro ADDAVG_W8_H4_AVX2 1
+INIT_YMM avx2
+cglobal addAvg_8x%1, 6,7,6, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+    mov             r6d, %1/4
+
+.loop:
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            xm2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    vinserti128     m0, m0, [r0 + r3], 1
+
+    movu            m2, [r1]
+    vinserti128     m2, m2, [r1 + r4], 1
+
+    paddw           m0, m2
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    packuswb        m0, m0
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movq            [r2 + r5], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    dec             r6d
+    jnz             .loop
+    RET
+%endmacro
 
+ADDAVG_W8_H4_AVX2 4
+ADDAVG_W8_H4_AVX2 8
+ADDAVG_W8_H4_AVX2 16
+ADDAVG_W8_H4_AVX2 32
+
+%macro ADDAVG_W12_H4_AVX2 1
+INIT_YMM avx2
+cglobal addAvg_12x%1, 6,7,7, pSrc0, src0, src1, dst, src0Stride, src1tride, dstStride
+    mova            m4, [pw_256]
+    mova            m5, [pw_128]
+    add             r3, r3
+    add             r4, r4
+    mov             r6d, %1/4
+
+.loop:
+    movu            xm0, [r0]
+    movu            xm1, [r1]
+    movq            xm2, [r0 + 16]
+    movq            xm3, [r1 + 16]
+    vinserti128     m0, m0, xm2, 1
+    vinserti128     m1, m1, xm3, 1
+
+    paddw           m0, m1
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    movu            xm1, [r0 + r3]
+    movu            xm2, [r1 + r4]
+    movq            xm3, [r0 + r3 + 16]
+    movq            xm6, [r1 + r3 + 16]
+    vinserti128     m1, m1, xm3, 1
+    vinserti128     m2, m2, xm6, 1
+
+    paddw           m1, m2
+    pmulhrsw        m1, m4
+    paddw           m1, m5
+
+    packuswb        m0, m1
+    vextracti128    xm1, m0, 1
+    movq            [r2], xm0
+    movd            [r2 + 8], xm1
+    vpshufd         m1, m1, 2
+    movhps          [r2 + r5], xm0
+    movd            [r2 + r5 + 8], xm1
+
+    lea             r2, [r2 + 2 * r5]
+    lea             r0, [r0 + 2 * r3]
+    lea             r1, [r1 + 2 * r4]
+
+    movu            xm0, [r0]
+    movu            xm1, [r1]
+    movq            xm2, [r0 + 16]
+    movq            xm3, [r1 + 16]
+    vinserti128     m0, m0, xm2, 1
+    vinserti128     m1, m1, xm3, 1
+
+    paddw           m0, m1
+    pmulhrsw        m0, m4
+    paddw           m0, m5
+
+    movu            xm1, [r0 + r3]
+    movu            xm2, [r1 + r4]
​

x265_1.5.tar.gz/source/common/x86/pixel-a.asm -> x265_1.6.tar.gz/source/common/x86/pixel-a.asm Changed

@@ -38,13 +38,15 @@
            times 4 db 1, -1
            times 8 db 1
            times 4 db 1, -1
-hmul_4p:   times 2 db 1, 1, 1, 1, 1, -1, 1, -1
+hmul_4p:   times 4 db 1, 1, 1, 1, 1, -1, 1, -1
 mask_10:   times 4 dw 0, -1
 mask_1100: times 2 dd 0, -1
 hmul_8w:   times 4 dw 1
            times 2 dw 1, -1
+           times 4 dw 1
+           times 2 dw 1, -1
 ALIGN 32
-hmul_w:    dw 1, -1, 1, -1, 1, -1, 1, -1
+hmul_w:    times 2 dw 1, -1, 1, -1, 1, -1, 1, -1
 ALIGN 32
 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14
 transd_shuf2: SHUFFLE_MASK_W 1, 9, 3, 11, 5, 13, 7, 15
@@ -1235,6 +1237,580 @@
     RET
 
 %else
+%if WIN64
+cglobal pixel_satd_16x24, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_16x24, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 8*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%endif
+%if WIN64
+cglobal pixel_satd_32x48, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    lea r2, [r7 + 16*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 24*SIZEOF_PIXEL]
+    lea r2, [r7 + 24*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_32x48, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 8*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 16*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 24*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 24*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%endif
+
+%if WIN64
+cglobal pixel_satd_24x64, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    lea r2, [r7 + 16*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_24x64, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2

 
@@ -38,13 +38,15 @@
            times 4 db 1, -1
            times 8 db 1
            times 4 db 1, -1
-hmul_4p:   times 2 db 1, 1, 1, 1, 1, -1, 1, -1
+hmul_4p:   times 4 db 1, 1, 1, 1, 1, -1, 1, -1
 mask_10:   times 4 dw 0, -1
 mask_1100: times 2 dd 0, -1
 hmul_8w:   times 4 dw 1
            times 2 dw 1, -1
+           times 4 dw 1
+           times 2 dw 1, -1
 ALIGN 32
-hmul_w:    dw 1, -1, 1, -1, 1, -1, 1, -1
+hmul_w:    times 2 dw 1, -1, 1, -1, 1, -1, 1, -1
 ALIGN 32
 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14
 transd_shuf2: SHUFFLE_MASK_W 1, 9, 3, 11, 5, 13, 7, 15
@@ -1235,6 +1237,580 @@
     RET
 
 %else
+%if WIN64
+cglobal pixel_satd_16x24, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_16x24, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 8*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%endif
+%if WIN64
+cglobal pixel_satd_32x48, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    lea r2, [r7 + 16*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 24*SIZEOF_PIXEL]
+    lea r2, [r7 + 24*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_32x48, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 8*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 16*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 24*SIZEOF_PIXEL]
+    mov r2, [rsp]
+    add r2, 24*SIZEOF_PIXEL
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%endif
+
+%if WIN64
+cglobal pixel_satd_24x64, 4,8,14    ;if WIN64 && cpuflag(avx)
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov r7, r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 8*SIZEOF_PIXEL]
+    lea r2, [r7 + 8*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    lea r0, [r6 + 16*SIZEOF_PIXEL]
+    lea r2, [r7 + 16*SIZEOF_PIXEL]
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
+    pxor    m7, m7
+    movhlps m7, m6
+    paddd   m6, m7
+    pshufd  m7, m6, 1
+    paddd   m6, m7
+    movd   eax, m6
+    RET
+%else
+cglobal pixel_satd_24x64, 4,7,8,0-gprsize    ;if !WIN64
+    SATD_START_SSE2 m6, m7
+    mov r6, r0
+    mov [rsp], r2
+    call pixel_satd_8x8_internal2
+    call pixel_satd_8x8_internal2
​

x265_1.5.tar.gz/source/common/x86/pixel-util.h -> x265_1.6.tar.gz/source/common/x86/pixel-util.h Changed

@@ -30,6 +30,8 @@
 void x265_getResidual16_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 void x265_getResidual32_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 void x265_getResidual32_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual16_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 
 void x265_transpose4_sse2(pixel* dest, const pixel* src, intptr_t stride);
 void x265_transpose8_sse2(pixel* dest, const pixel* src, intptr_t stride);
@@ -48,7 +50,15 @@
 uint32_t x265_nquant_avx2(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
 void x265_dequant_normal_sse4(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
 void x265_dequant_normal_avx2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
-int x265_count_nonzero_ssse3(const int16_t* quantCoeff, int numCoeff);
+
+int x265_count_nonzero_4x4_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_8x8_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_16x16_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_32x32_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_4x4_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_8x8_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_16x16_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_32x32_avx2(const int16_t* quantCoeff);
 
 void x265_weight_pp_sse4(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 void x265_weight_pp_avx2(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
@@ -67,6 +77,8 @@
 void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
 void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
 
+int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+
 #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
     void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);

 
@@ -30,6 +30,8 @@
 void x265_getResidual16_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 void x265_getResidual32_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 void x265_getResidual32_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual16_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
+void x265_getResidual32_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride);
 
 void x265_transpose4_sse2(pixel* dest, const pixel* src, intptr_t stride);
 void x265_transpose8_sse2(pixel* dest, const pixel* src, intptr_t stride);
@@ -48,7 +50,15 @@
 uint32_t x265_nquant_avx2(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff);
 void x265_dequant_normal_sse4(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
 void x265_dequant_normal_avx2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift);
-int x265_count_nonzero_ssse3(const int16_t* quantCoeff, int numCoeff);
+
+int x265_count_nonzero_4x4_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_8x8_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_16x16_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_32x32_ssse3(const int16_t* quantCoeff);
+int x265_count_nonzero_4x4_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_8x8_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_16x16_avx2(const int16_t* quantCoeff);
+int x265_count_nonzero_32x32_avx2(const int16_t* quantCoeff);
 
 void x265_weight_pp_sse4(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 void x265_weight_pp_avx2(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
@@ -67,6 +77,8 @@
 void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
 void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
 
+int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+
 #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
     void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
​

x265_1.5.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.6.tar.gz/source/common/x86/pixel-util8.asm Changed

@@ -3,6 +3,7 @@
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
 ;*          Nabajit Deka <nabajit@multicorewareinc.com>
+;*          Rajesh Paulraj <rajesh@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -63,6 +64,12 @@
 cextern pd_1
 cextern pd_32767
 cextern pd_n32768
+cextern pb_2
+cextern pb_4
+cextern pb_8
+cextern pb_16
+cextern pb_32
+cextern pb_64
 
 ;-----------------------------------------------------------------------------
 ; void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride)
@@ -95,9 +102,9 @@
     punpcklqdq   m0, m1
     punpcklqdq   m2, m3
     psubw        m0, m2
-
     movh        [r2], m0
     movhps      [r2 + r3], m0
+    RET
 %else
 cglobal getResidual4, 4,4,5
     pxor        m0, m0
@@ -130,8 +137,8 @@
     psubw       m1, m3
     movh        [r2], m1
     movhps      [r2 + r3 * 2], m1
-%endif
     RET
+%endif
 
 
 INIT_XMM sse2
@@ -157,6 +164,7 @@
     lea         r2, [r2 + r3 * 2]
 %endif
 %endrep
+    RET
 %else
 cglobal getResidual8, 4,4,5
     pxor        m0, m0
@@ -183,8 +191,9 @@
     lea         r2, [r2 + r3 * 4]
 %endif
 %endrep
-%endif
     RET
+%endif
+
 
 %if HIGH_BIT_DEPTH
 INIT_XMM sse2
@@ -238,10 +247,9 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 2]
-
     jnz        .loop
+    RET
 %else
-
 INIT_XMM sse4
 cglobal getResidual16, 4,5,8
     mov         r4d, 16/4
@@ -302,11 +310,67 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 4]
-
     jnz        .loop
+    RET
 %endif
 
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal getResidual16, 4,4,5
+    add         r3, r3
+    pxor        m0, m0
+
+%assign x 0
+%rep 16/2
+    movu        m1, [r0]
+    movu        m2, [r0 + r3]
+    movu        m3, [r1]
+    movu        m4, [r1 + r3]
+
+    psubw       m1, m3
+    psubw       m2, m4
+    movu        [r2], m1
+    movu        [r2 + r3], m2
+%assign x x+1
+%if (x != 8)
+    lea         r0, [r0 + r3 * 2]
+    lea         r1, [r1 + r3 * 2]
+    lea         r2, [r2 + r3 * 2]
+%endif
+%endrep
     RET
+%else
+INIT_YMM avx2
+cglobal getResidual16, 4,5,8
+    lea         r4, [r3 * 2]
+    add         r4d, r3d
+%assign x 0
+%rep 4
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    movu        [r2 + r3 * 2 * 2], m2
+    movu        [r2 + r4 * 2], m3
+%assign x x+1
+%if (x != 4)
+    lea         r0, [r0 + r3 * 2 * 2]
+    lea         r1, [r1 + r3 * 2 * 2]
+    lea         r2, [r2 + r3 * 4 * 2]
+%endif
+%endrep
+    RET
+%endif
 
 %if HIGH_BIT_DEPTH
 INIT_XMM sse2
@@ -357,9 +421,8 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 2]
-
     jnz        .loop
-
+    RET
 %else
 INIT_XMM sse4
 cglobal getResidual32, 4,5,7
@@ -415,12 +478,70 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 4]
-
     jnz        .loop
+    RET
+%endif
+
+
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal getResidual32, 4,4,5
+    add         r3, r3
+    pxor        m0, m0
+
+%assign x 0
+%rep 32
+    movu        m1, [r0]
+    movu        m2, [r0 + 32]
+    movu        m3, [r1]
+    movu        m4, [r1 + 32]
+
+    psubw       m1, m3
+    psubw       m2, m4
+    movu        [r2], m1
+    movu        [r2 + 32], m2
+%assign x x+1
+%if (x != 32)
+    lea         r0, [r0 + r3]
+    lea         r1, [r1 + r3]
+    lea         r2, [r2 + r3]
 %endif
+%endrep
     RET
+%else
+INIT_YMM avx2
+cglobal getResidual32, 4,5,8
+    lea         r4, [r3 * 2]
+%assign x 0
+%rep 16
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + 16]
+    pmovzxbw    m2, [r0 + r3]
+    pmovzxbw    m3, [r0 + r3 + 16]
+
+    pmovzxbw    m4, [r1]

 
@@ -3,6 +3,7 @@
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
 ;*          Nabajit Deka <nabajit@multicorewareinc.com>
+;*          Rajesh Paulraj <rajesh@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -63,6 +64,12 @@
 cextern pd_1
 cextern pd_32767
 cextern pd_n32768
+cextern pb_2
+cextern pb_4
+cextern pb_8
+cextern pb_16
+cextern pb_32
+cextern pb_64
 
 ;-----------------------------------------------------------------------------
 ; void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride)
@@ -95,9 +102,9 @@
     punpcklqdq   m0, m1
     punpcklqdq   m2, m3
     psubw        m0, m2
-
     movh        [r2], m0
     movhps      [r2 + r3], m0
+    RET
 %else
 cglobal getResidual4, 4,4,5
     pxor        m0, m0
@@ -130,8 +137,8 @@
     psubw       m1, m3
     movh        [r2], m1
     movhps      [r2 + r3 * 2], m1
-%endif
     RET
+%endif
 
 
 INIT_XMM sse2
@@ -157,6 +164,7 @@
     lea         r2, [r2 + r3 * 2]
 %endif
 %endrep
+    RET
 %else
 cglobal getResidual8, 4,4,5
     pxor        m0, m0
@@ -183,8 +191,9 @@
     lea         r2, [r2 + r3 * 4]
 %endif
 %endrep
-%endif
     RET
+%endif
+
 
 %if HIGH_BIT_DEPTH
 INIT_XMM sse2
@@ -238,10 +247,9 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 2]
-
     jnz        .loop
+    RET
 %else
-
 INIT_XMM sse4
 cglobal getResidual16, 4,5,8
     mov         r4d, 16/4
@@ -302,11 +310,67 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 4]
-
     jnz        .loop
+    RET
 %endif
 
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal getResidual16, 4,4,5
+    add         r3, r3
+    pxor        m0, m0
+
+%assign x 0
+%rep 16/2
+    movu        m1, [r0]
+    movu        m2, [r0 + r3]
+    movu        m3, [r1]
+    movu        m4, [r1 + r3]
+
+    psubw       m1, m3
+    psubw       m2, m4
+    movu        [r2], m1
+    movu        [r2 + r3], m2
+%assign x x+1
+%if (x != 8)
+    lea         r0, [r0 + r3 * 2]
+    lea         r1, [r1 + r3 * 2]
+    lea         r2, [r2 + r3 * 2]
+%endif
+%endrep
     RET
+%else
+INIT_YMM avx2
+cglobal getResidual16, 4,5,8
+    lea         r4, [r3 * 2]
+    add         r4d, r3d
+%assign x 0
+%rep 4
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + r3]
+    pmovzxbw    m2, [r0 + r3 * 2]
+    pmovzxbw    m3, [r0 + r4]
+    pmovzxbw    m4, [r1]
+    pmovzxbw    m5, [r1 + r3]
+    pmovzxbw    m6, [r1 + r3 * 2]
+    pmovzxbw    m7, [r1 + r4]
+    psubw       m0, m4
+    psubw       m1, m5
+    psubw       m2, m6
+    psubw       m3, m7
+    movu        [r2], m0
+    movu        [r2 + r3 * 2], m1
+    movu        [r2 + r3 * 2 * 2], m2
+    movu        [r2 + r4 * 2], m3
+%assign x x+1
+%if (x != 4)
+    lea         r0, [r0 + r3 * 2 * 2]
+    lea         r1, [r1 + r3 * 2 * 2]
+    lea         r2, [r2 + r3 * 4 * 2]
+%endif
+%endrep
+    RET
+%endif
 
 %if HIGH_BIT_DEPTH
 INIT_XMM sse2
@@ -357,9 +421,8 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 2]
-
     jnz        .loop
-
+    RET
 %else
 INIT_XMM sse4
 cglobal getResidual32, 4,5,7
@@ -415,12 +478,70 @@
     lea         r0, [r0 + r3 * 2]
     lea         r1, [r1 + r3 * 2]
     lea         r2, [r2 + r3 * 4]
-
     jnz        .loop
+    RET
+%endif
+
+
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal getResidual32, 4,4,5
+    add         r3, r3
+    pxor        m0, m0
+
+%assign x 0
+%rep 32
+    movu        m1, [r0]
+    movu        m2, [r0 + 32]
+    movu        m3, [r1]
+    movu        m4, [r1 + 32]
+
+    psubw       m1, m3
+    psubw       m2, m4
+    movu        [r2], m1
+    movu        [r2 + 32], m2
+%assign x x+1
+%if (x != 32)
+    lea         r0, [r0 + r3]
+    lea         r1, [r1 + r3]
+    lea         r2, [r2 + r3]
 %endif
+%endrep
     RET
+%else
+INIT_YMM avx2
+cglobal getResidual32, 4,5,8
+    lea         r4, [r3 * 2]
+%assign x 0
+%rep 16
+    pmovzxbw    m0, [r0]
+    pmovzxbw    m1, [r0 + 16]
+    pmovzxbw    m2, [r0 + r3]
+    pmovzxbw    m3, [r0 + r3 + 16]
+
+    pmovzxbw    m4, [r1]
​

x265_1.5.tar.gz/source/common/x86/pixel.h -> x265_1.6.tar.gz/source/common/x86/pixel.h Changed

@@ -103,6 +103,13 @@
 DECL_X1(satd, avx)
 DECL_X1(satd, xop)
 DECL_X1(satd, avx2)
+int x265_pixel_satd_16x24_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_32x48_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_24x64_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_8x64_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_8x12_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_12x32_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_4x32_avx(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_8x32_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_16x4_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_16x12_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
@@ -170,10 +177,12 @@
 int x265_pixel_ssd_s_8_sse2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_16_sse2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_32_sse2(const int16_t*, intptr_t);
+int x265_pixel_ssd_s_16_avx2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_32_avx2(const int16_t*, intptr_t);
 
 #define ADDAVG(func)  \
-    void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t);
+    void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    void x265_ ## func ## _avx2(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t);
 ADDAVG(addAvg_2x4)
 ADDAVG(addAvg_2x8)
 ADDAVG(addAvg_4x2);
@@ -228,6 +237,41 @@
 int x265_psyCost_ss_16x16_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_32x32_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_64x64_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+void x265_pixel_avg_16x4_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x12_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x24_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x48_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+
+void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+
+void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+
+int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_16x16_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_32x32_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_64x64_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+
+int x265_psyCost_ss_4x4_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_8x8_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 
 #undef DECL_PIXELS
 #undef DECL_HEVC_SSD

 
@@ -103,6 +103,13 @@
 DECL_X1(satd, avx)
 DECL_X1(satd, xop)
 DECL_X1(satd, avx2)
+int x265_pixel_satd_16x24_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_32x48_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_24x64_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_8x64_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_8x12_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_12x32_avx(const pixel*, intptr_t, const pixel*, intptr_t);
+int x265_pixel_satd_4x32_avx(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_8x32_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_16x4_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
 int x265_pixel_satd_16x12_sse2(const pixel*, intptr_t, const pixel*, intptr_t);
@@ -170,10 +177,12 @@
 int x265_pixel_ssd_s_8_sse2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_16_sse2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_32_sse2(const int16_t*, intptr_t);
+int x265_pixel_ssd_s_16_avx2(const int16_t*, intptr_t);
 int x265_pixel_ssd_s_32_avx2(const int16_t*, intptr_t);
 
 #define ADDAVG(func)  \
-    void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t);
+    void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \
+    void x265_ ## func ## _avx2(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t);
 ADDAVG(addAvg_2x4)
 ADDAVG(addAvg_2x8)
 ADDAVG(addAvg_4x2);
@@ -228,6 +237,41 @@
 int x265_psyCost_ss_16x16_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_32x32_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_64x64_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+void x265_pixel_avg_16x4_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x12_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_16x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x24_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_32x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x48_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+void x265_pixel_avg_64x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int);
+
+void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+
+void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+
+int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_16x16_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_32x32_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+int x265_psyCost_pp_64x64_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
+
+int x265_psyCost_ss_4x4_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_8x8_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 
 #undef DECL_PIXELS
 #undef DECL_HEVC_SSD
​

x265_1.5.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.6.tar.gz/source/common/x86/pixeladd8.asm Changed

@@ -398,6 +398,52 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/4
+    add         r5,         r5
+.loop:
+
+    pmovzxbw    m0,         [r2]        ; row 0 of src0
+    pmovzxbw    m1,         [r2 + r4]   ; row 1 of src0
+    movu        m2,        [r3]        ; row 0 of src1
+    movu        m3,        [r3 + r5]   ; row 1 of src1
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+
+    pmovzxbw    m2,         [r2]        ; row 2 of src0
+    pmovzxbw    m3,         [r2 + r4]   ; row 3 of src0
+    movu        m4,        [r3]        ; row 2 of src1
+    movu        m5,        [r3 + r5]   ; row 3 of src1
+    paddw       m2,         m4
+    paddw       m3,         m5
+    packuswb    m2,         m3
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      xm0           ; row 0 of dst
+    vextracti128 xm3, m0, 1
+    movu        [r0 + r1], xm3           ; row 1 of dst
+
+    lea         r0,         [r0 + r1 * 2]
+    vpermq      m2, m2, 11011000b
+    movu        [r0],      xm2           ; row 2 of dst
+    vextracti128 xm3, m2, 1
+    movu         [r0 + r1], xm3          ; row 3 of dst
+
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+
+    RET
 %endif
 %endmacro
 
@@ -523,6 +569,67 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/4
+    add         r5,         r5
+.loop:
+    pmovzxbw    m0,         [r2]                ; first half of row 0 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0
+    movu        m2,         [r3]                ; first half of row 0 of src1
+    movu        m3,         [r3 + 32]           ; second half of row 0 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; row 0 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first half of row 1 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 1 of src0
+    movu        m2,         [r3 + r5]           ; first half of row 1 of src1
+    movu        m3,         [r3 + r5 + 32]      ; second half of row 1 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; row 1 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    pmovzxbw    m0,         [r2]                ; first half of row 2 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second half of row 2 of src0
+    movu        m2,         [r3]                ; first half of row 2 of src1
+    movu        m3,         [r3 + 32]           ; second half of row 2 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; row 2 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first half of row 3 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 3 of src0
+    movu        m2,         [r3 + r5]           ; first half of row 3 of src1
+    movu        m3,         [r3 + r5 + 32]      ; second half of row 3 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; row 3 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+    RET
 %endif
 %endmacro
 
@@ -734,6 +841,60 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/2
+    add         r5,         r5
+.loop:
+    pmovzxbw    m0,         [r2]                ; first 16 of row 0 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second 16 of row 0 of src0
+    pmovzxbw    m2,         [r2 + 32]           ; third 16 of row 0 of src0
+    pmovzxbw    m3,         [r2 + 48]           ; forth 16 of row 0 of src0
+    movu        m4,         [r3]                ; first 16 of row 0 of src1
+    movu        m5,         [r3 + 32]           ; second 16 of row 0 of src1
+    movu        m6,         [r3 + 64]           ; third 16 of row 0 of src1
+    movu        m7,         [r3 + 96]           ; forth 16 of row 0 of src1
+
+    paddw       m0,         m4
+    paddw       m1,         m5
+    paddw       m2,         m6
+    paddw       m3,         m7
+    packuswb    m0,         m1
+    packuswb    m2,         m3
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; first 32 of row 0 of dst
+    vpermq      m2, m2, 11011000b
+    movu        [r0 + 32],      m2              ; second 32 of row 0 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first 16 of row 1 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second 16 of row 1 of src0
+    pmovzxbw    m2,         [r2 + r4 + 32]      ; third 16 of row 1 of src0
+    pmovzxbw    m3,         [r2 + r4 + 48]      ; forth 16 of row 1 of src0
+    movu        m4,         [r3 + r5]           ; first 16 of row 1 of src1
+    movu        m5,         [r3 + r5 + 32]      ; second 16 of row 1 of src1
+    movu        m6,         [r3 + r5 + 64]      ; third 16 of row 1 of src1
+    movu        m7,         [r3 + r5 + 96]      ; forth 16 of row 1 of src1
+
+    paddw       m0,         m4
+    paddw       m1,         m5
+    paddw       m2,         m6
+    paddw       m3,         m7
+    packuswb    m0,         m1
+    packuswb    m2,         m3
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; first 32 of row 1 of dst
+    vpermq      m2, m2, 11011000b
+    movu        [r0 + r1 + 32],      m2         ; second 32 of row 1 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+    RET
+
 %endif
 %endmacro

 
@@ -398,6 +398,52 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/4
+    add         r5,         r5
+.loop:
+
+    pmovzxbw    m0,         [r2]        ; row 0 of src0
+    pmovzxbw    m1,         [r2 + r4]   ; row 1 of src0
+    movu        m2,        [r3]        ; row 0 of src1
+    movu        m3,        [r3 + r5]   ; row 1 of src1
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+
+    pmovzxbw    m2,         [r2]        ; row 2 of src0
+    pmovzxbw    m3,         [r2 + r4]   ; row 3 of src0
+    movu        m4,        [r3]        ; row 2 of src1
+    movu        m5,        [r3 + r5]   ; row 3 of src1
+    paddw       m2,         m4
+    paddw       m3,         m5
+    packuswb    m2,         m3
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      xm0           ; row 0 of dst
+    vextracti128 xm3, m0, 1
+    movu        [r0 + r1], xm3           ; row 1 of dst
+
+    lea         r0,         [r0 + r1 * 2]
+    vpermq      m2, m2, 11011000b
+    movu        [r0],      xm2           ; row 2 of dst
+    vextracti128 xm3, m2, 1
+    movu         [r0 + r1], xm3          ; row 3 of dst
+
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+
+    RET
 %endif
 %endmacro
 
@@ -523,6 +569,67 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/4
+    add         r5,         r5
+.loop:
+    pmovzxbw    m0,         [r2]                ; first half of row 0 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0
+    movu        m2,         [r3]                ; first half of row 0 of src1
+    movu        m3,         [r3 + 32]           ; second half of row 0 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; row 0 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first half of row 1 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 1 of src0
+    movu        m2,         [r3 + r5]           ; first half of row 1 of src1
+    movu        m3,         [r3 + r5 + 32]      ; second half of row 1 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; row 1 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    pmovzxbw    m0,         [r2]                ; first half of row 2 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second half of row 2 of src0
+    movu        m2,         [r3]                ; first half of row 2 of src1
+    movu        m3,         [r3 + 32]           ; second half of row 2 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; row 2 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first half of row 3 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 3 of src0
+    movu        m2,         [r3 + r5]           ; first half of row 3 of src1
+    movu        m3,         [r3 + r5 + 32]      ; second half of row 3 of src1
+
+    paddw       m0,         m2
+    paddw       m1,         m3
+    packuswb    m0,         m1
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; row 3 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+    RET
 %endif
 %endmacro
 
@@ -734,6 +841,60 @@
 
     jnz         .loop
     RET
+
+INIT_YMM avx2
+cglobal pixel_add_ps_64x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %2/2
+    add         r5,         r5
+.loop:
+    pmovzxbw    m0,         [r2]                ; first 16 of row 0 of src0
+    pmovzxbw    m1,         [r2 + 16]           ; second 16 of row 0 of src0
+    pmovzxbw    m2,         [r2 + 32]           ; third 16 of row 0 of src0
+    pmovzxbw    m3,         [r2 + 48]           ; forth 16 of row 0 of src0
+    movu        m4,         [r3]                ; first 16 of row 0 of src1
+    movu        m5,         [r3 + 32]           ; second 16 of row 0 of src1
+    movu        m6,         [r3 + 64]           ; third 16 of row 0 of src1
+    movu        m7,         [r3 + 96]           ; forth 16 of row 0 of src1
+
+    paddw       m0,         m4
+    paddw       m1,         m5
+    paddw       m2,         m6
+    paddw       m3,         m7
+    packuswb    m0,         m1
+    packuswb    m2,         m3
+    vpermq      m0, m0, 11011000b
+    movu        [r0],      m0                   ; first 32 of row 0 of dst
+    vpermq      m2, m2, 11011000b
+    movu        [r0 + 32],      m2              ; second 32 of row 0 of dst
+
+    pmovzxbw    m0,         [r2 + r4]           ; first 16 of row 1 of src0
+    pmovzxbw    m1,         [r2 + r4 + 16]      ; second 16 of row 1 of src0
+    pmovzxbw    m2,         [r2 + r4 + 32]      ; third 16 of row 1 of src0
+    pmovzxbw    m3,         [r2 + r4 + 48]      ; forth 16 of row 1 of src0
+    movu        m4,         [r3 + r5]           ; first 16 of row 1 of src1
+    movu        m5,         [r3 + r5 + 32]      ; second 16 of row 1 of src1
+    movu        m6,         [r3 + r5 + 64]      ; third 16 of row 1 of src1
+    movu        m7,         [r3 + r5 + 96]      ; forth 16 of row 1 of src1
+
+    paddw       m0,         m4
+    paddw       m1,         m5
+    paddw       m2,         m6
+    paddw       m3,         m7
+    packuswb    m0,         m1
+    packuswb    m2,         m3
+    vpermq      m0, m0, 11011000b
+    movu        [r0 + r1],      m0              ; first 32 of row 1 of dst
+    vpermq      m2, m2, 11011000b
+    movu        [r0 + r1 + 32],      m2         ; second 32 of row 1 of dst
+
+    lea         r2,         [r2 + r4 * 2]
+    lea         r3,         [r3 + r5 * 2]
+    lea         r0,         [r0 + r1 * 2]
+
+    dec         r6d
+    jnz         .loop
+    RET
+
 %endif
 %endmacro
 
​

x265_1.5.tar.gz/source/common/x86/sad-a.asm -> x265_1.6.tar.gz/source/common/x86/sad-a.asm Changed

@@ -3710,3 +3710,749 @@
 SADX34_CACHELINE_FUNC 16, 16, 64, sse2, ssse3, ssse3
 SADX34_CACHELINE_FUNC 16,  8, 64, sse2, ssse3, ssse3
 
+%if HIGH_BIT_DEPTH==0
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x4, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 2
+    vpbroadcastq   xm2, [r0 + 2 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 3
+    vpbroadcastq   xm2, [r0 + 3 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x8, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+%assign x 0
+%rep 4
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 4
+    add             r1, r4
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x16, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+%assign x 0
+%rep 8
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 8
+    add             r1, r4
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x4_8x8, 7,7,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+    sub             r4, r1
+%assign x 0
+%rep 4
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    movhps         xm4, [r1 + r4]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r5
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    movhps         xm4, [r1 + r4]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 4
+    add             r1, r5
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    pshufd          xm1, xm1, q0020
+    movq            [r6 + 0], xm0
+    movq            [r6 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_32x8, 4,4,6
+    xorps           m0, m0
+    xorps           m5, m5
+
+    movu           m1, [r0]               ; row 0 of pix0
+    movu           m2, [r2]               ; row 0 of pix1
+    movu           m3, [r0 + r1]          ; row 1 of pix0
+    movu           m4, [r2 + r3]          ; row 1 of pix1

 
@@ -3710,3 +3710,749 @@
 SADX34_CACHELINE_FUNC 16, 16, 64, sse2, ssse3, ssse3
 SADX34_CACHELINE_FUNC 16,  8, 64, sse2, ssse3, ssse3
 
+%if HIGH_BIT_DEPTH==0
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x4, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 2
+    vpbroadcastq   xm2, [r0 + 2 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 3
+    vpbroadcastq   xm2, [r0 + 3 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x8, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+%assign x 0
+%rep 4
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 4
+    add             r1, r4
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x3_8x16, 6,6,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+%assign x 0
+%rep 8
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r4
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 8
+    add             r1, r4
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    movq            [r5 + 0], xm0
+    movd            [r5 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_x4_8x8, 7,7,5
+    xorps           m0, m0
+    xorps           m1, m1
+
+    sub             r2, r1          ; rebase on pointer r1
+    sub             r3, r1
+    sub             r4, r1
+%assign x 0
+%rep 4
+    ; row 0
+    vpbroadcastq   xm2, [r0 + 0 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    movhps         xm4, [r1 + r4]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+    add             r1, r5
+
+    ; row 1
+    vpbroadcastq   xm2, [r0 + 1 * FENC_STRIDE]
+    movq           xm3, [r1]
+    movhps         xm3, [r1 + r2]
+    movq           xm4, [r1 + r3]
+    movhps         xm4, [r1 + r4]
+    psadbw         xm3, xm2
+    psadbw         xm4, xm2
+    paddd          xm0, xm3
+    paddd          xm1, xm4
+
+%assign x x+1
+  %if x < 4
+    add             r1, r5
+    add             r0, 2 * FENC_STRIDE
+  %endif
+%endrep
+
+    pshufd          xm0, xm0, q0020
+    pshufd          xm1, xm1, q0020
+    movq            [r6 + 0], xm0
+    movq            [r6 + 8], xm1
+    RET
+
+INIT_YMM avx2
+cglobal pixel_sad_32x8, 4,4,6
+    xorps           m0, m0
+    xorps           m5, m5
+
+    movu           m1, [r0]               ; row 0 of pix0
+    movu           m2, [r2]               ; row 0 of pix1
+    movu           m3, [r0 + r1]          ; row 1 of pix0
+    movu           m4, [r2 + r3]          ; row 1 of pix1
​

x265_1.5.tar.gz/source/common/x86/ssd-a.asm -> x265_1.6.tar.gz/source/common/x86/ssd-a.asm Changed

@@ -822,10 +822,10 @@
 
 %if HIGH_BIT_DEPTH == 0
 %macro SSD_LOAD_FULL 5
-    mova      m1, [t0+%1]
-    mova      m2, [t2+%2]
-    mova      m3, [t0+%3]
-    mova      m4, [t2+%4]
+    movu      m1, [t0+%1]
+    movu      m2, [t2+%2]
+    movu      m3, [t0+%3]
+    movu      m4, [t2+%4]
 %if %5==1
     add       t0, t1
     add       t2, t3
@@ -1094,6 +1094,8 @@
 INIT_YMM avx2
 SSD 16, 16
 SSD 16,  8
+SSD 32, 32
+SSD 64, 64
 %assign function_align 16
 %endif ; !HIGH_BIT_DEPTH
 
@@ -2548,6 +2550,35 @@
     movd    eax, m0
     RET
 
+INIT_YMM avx2
+cglobal pixel_ssd_s_16, 2,4,5
+    add     r1, r1
+    lea     r3, [r1 * 3]
+    mov     r2d, 16/4
+    pxor    m0, m0
+.loop:
+    movu    m1, [r0]
+    movu    m2, [r0 + r1]
+    movu    m3, [r0 + 2 * r1]
+    movu    m4, [r0 + r3]
+
+    lea     r0, [r0 + r1 * 4]
+    pmaddwd m1, m1
+    pmaddwd m2, m2
+    pmaddwd m3, m3
+    pmaddwd m4, m4
+    paddd   m1, m2
+    paddd   m3, m4
+    paddd   m1, m3
+    paddd   m0, m1
+
+    dec     r2d
+    jnz    .loop
+
+    ; calculate sum and return
+    HADDD   m0, m1
+    movd    eax, xm0
+    RET
 
 INIT_YMM avx2
 cglobal pixel_ssd_s_32, 2,4,5

 
@@ -822,10 +822,10 @@
 
 %if HIGH_BIT_DEPTH == 0
 %macro SSD_LOAD_FULL 5
-    mova      m1, [t0+%1]
-    mova      m2, [t2+%2]
-    mova      m3, [t0+%3]
-    mova      m4, [t2+%4]
+    movu      m1, [t0+%1]
+    movu      m2, [t2+%2]
+    movu      m3, [t0+%3]
+    movu      m4, [t2+%4]
 %if %5==1
     add       t0, t1
     add       t2, t3
@@ -1094,6 +1094,8 @@
 INIT_YMM avx2
 SSD 16, 16
 SSD 16,  8
+SSD 32, 32
+SSD 64, 64
 %assign function_align 16
 %endif ; !HIGH_BIT_DEPTH
 
@@ -2548,6 +2550,35 @@
     movd    eax, m0
     RET
 
+INIT_YMM avx2
+cglobal pixel_ssd_s_16, 2,4,5
+    add     r1, r1
+    lea     r3, [r1 * 3]
+    mov     r2d, 16/4
+    pxor    m0, m0
+.loop:
+    movu    m1, [r0]
+    movu    m2, [r0 + r1]
+    movu    m3, [r0 + 2 * r1]
+    movu    m4, [r0 + r3]
+
+    lea     r0, [r0 + r1 * 4]
+    pmaddwd m1, m1
+    pmaddwd m2, m2
+    pmaddwd m3, m3
+    pmaddwd m4, m4
+    paddd   m1, m2
+    paddd   m3, m4
+    paddd   m1, m3
+    paddd   m0, m1
+
+    dec     r2d
+    jnz    .loop
+
+    ; calculate sum and return
+    HADDD   m0, m1
+    movd    eax, xm0
+    RET
 
 INIT_YMM avx2
 cglobal pixel_ssd_s_32, 2,4,5
​

x265_1.5.tar.gz/source/encoder/analysis.cpp -> x265_1.6.tar.gz/source/encoder/analysis.cpp Changed

@@ -71,9 +71,10 @@
 
 Analysis::Analysis()
 {
-    m_totalNumJobs = m_numAcquiredJobs = m_numCompletedJobs = 0;
     m_reuseIntraDataCTU = NULL;
     m_reuseInterDataCTU = NULL;
+    m_reuseRef = NULL;
+    m_reuseBestMergeCand = NULL;
 }
 
 bool Analysis::create(ThreadLocalData *tld)
@@ -125,6 +126,11 @@
     m_slice = ctu.m_slice;
     m_frame = &frame;
 
+#if _DEBUG || CHECKED_BUILD
+    for (uint32_t i = 0; i <= g_maxCUDepth; i++)
+        for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
+            m_modeDepth[i].pred[j].invalidate();
+#endif
     invalidateContexts(0);
     m_quant.setQPforQuant(ctu);
     m_rqt[0].cur.load(initialContext);
@@ -139,10 +145,13 @@
         {
             int numPredDir = m_slice->isInterP() ? 1 : 2;
             m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
-            reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
+            m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
+            m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
         }
     }
 
+    ProfileCUScope(ctu, totalCTUTime, totalCTUs);
+
     uint32_t zOrder = 0;
     if (m_slice->m_sliceType == I_SLICE)
     {
@@ -153,6 +162,7 @@
             memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
+            memcpy(&m_reuseIntraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition);
         }
     }
     else
@@ -196,14 +206,16 @@
         return;
     else if (md.bestMode->cu.isIntra(0))
     {
+        md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
         uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
-        checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes);
+        checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
     }
     else
     {
+        md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
         encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
@@ -225,15 +237,16 @@
         uint8_t* reuseDepth  = &m_reuseIntraDataCTU->depth[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
         uint8_t* reuseModes  = &m_reuseIntraDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
         char* reusePartSizes = &m_reuseIntraDataCTU->partSizes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
+        uint8_t* reuseChromaModes = &m_reuseIntraDataCTU->chromaModes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
 
-        if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.encodeIdx)
+        if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
         {
             m_quant.setQPforQuant(parentCTU);
 
             PartSize size = (PartSize)reusePartSizes[zOrder];
             Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
             mode.cu.initSubCU(parentCTU, cuGeom);
-            checkIntra(mode, cuGeom, size, &reuseModes[zOrder]);
+            checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
             checkBestMode(mode, depth);
 
             if (m_bTryLossless)
@@ -252,13 +265,13 @@
         m_quant.setQPforQuant(parentCTU);
 
         md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
-        checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL);
+        checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
         checkBestMode(md.pred[PRED_INTRA], depth);
 
-        if (depth == g_maxCUDepth)
+        if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
         {
             md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
-            checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL);
+            checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
             checkBestMode(md.pred[PRED_INTRA_NxN], depth);
         }
 
@@ -286,7 +299,7 @@
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx);
             if (childGeom.flags & CUGeom::PRESENT)
             {
-                m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.encodeIdx);
+                m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
                 m_rqt[nextDepth].cur.load(*nextContext);
                 compressIntraCU(parentCTU, childGeom, zOrder);
 
@@ -308,203 +321,173 @@
             addSplitFlagCost(*splitPred, cuGeom.depth);
         else
             updateModeCost(*splitPred);
+
+        checkDQPForSplitPred(splitPred->cu, cuGeom);
         checkBestMode(*splitPred, depth);
     }
 
-    checkDQP(md.bestMode->cu, cuGeom);
-
     /* Copy best data to encData CTU and recon */
     md.bestMode->cu.copyToPic(depth);
     if (md.bestMode != &md.pred[PRED_SPLIT])
-        md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.encodeIdx);
+        md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx);
 }
 
-bool Analysis::findJob(int threadId)
+void Analysis::PMODE::processTasks(int workerThreadId)
 {
-    /* try to acquire a CU mode to analyze */
-    m_pmodeLock.acquire();
-    if (m_totalNumJobs > m_numAcquiredJobs)
-    {
-        int id = m_numAcquiredJobs++;
-        m_pmodeLock.release();
-
-        ProfileScopeEvent(pmode);
-        parallelModeAnalysis(threadId, id);
-
-        m_pmodeLock.acquire();
-        if (++m_numCompletedJobs == m_totalNumJobs)
-            m_modeCompletionEvent.trigger();
-        m_pmodeLock.release();
-        return true;
-    }
-    else
-        m_pmodeLock.release();
-
-    m_meLock.acquire();
-    if (m_totalNumME > m_numAcquiredME)
-    {
-        int id = m_numAcquiredME++;
-        m_meLock.release();
-
-        ProfileScopeEvent(pme);
-        parallelME(threadId, id);
-
-        m_meLock.acquire();
-        if (++m_numCompletedME == m_totalNumME)
-            m_meCompletionEvent.trigger();
-        m_meLock.release();
-        return true;
-    }
-    else
-        m_meLock.release();
-
-    return false;
+#if DETAILED_CU_STATS
+    int fe = master.m_modeDepth[cuGeom.depth].pred[PRED_2Nx2N].cu.m_encData->m_frameEncoderID;
+    master.m_stats[fe].countPModeTasks++;
+    ScopedElapsedTime pmodeTime(master.m_stats[fe].pmodeTime);
+#endif
+    ProfileScopeEvent(pmode);
+    master.processPmode(*this, master.m_tld[workerThreadId].analysis);
 }
 
-void Analysis::parallelME(int threadId, int meId)
+/* process pmode jobs until none remain; may be called by the master thread or by
+ * a bonded peer (slave) thread via pmodeTasks() */
+void Analysis::processPmode(PMODE& pmode, Analysis& slave)
 {
-    Analysis* slave;
-
-    if (threadId == -1)
-        slave = this;
-    else
+    /* acquire a mode task, else exit early */
+    int task;
+    pmode.m_lock.acquire();
+    if (pmode.m_jobTotal > pmode.m_jobAcquired)
     {
-        slave = &m_tld[threadId].analysis;
-        slave->setQP(*m_slice, m_rdCost.m_qp);
-        slave->m_slice = m_slice;
-        slave->m_frame = m_frame;
-
-        slave->m_me.setSourcePU(*m_curInterMode->fencYuv, m_curInterMode->cu.m_cuAddr, m_curGeom->encodeIdx, m_puAbsPartIdx, m_puWidth, m_puHeight);
-        slave->prepMotionCompensation(m_curInterMode->cu, *m_curGeom, m_curPart);

 
@@ -71,9 +71,10 @@
 
 Analysis::Analysis()
 {
-    m_totalNumJobs = m_numAcquiredJobs = m_numCompletedJobs = 0;
     m_reuseIntraDataCTU = NULL;
     m_reuseInterDataCTU = NULL;
+    m_reuseRef = NULL;
+    m_reuseBestMergeCand = NULL;
 }
 
 bool Analysis::create(ThreadLocalData *tld)
@@ -125,6 +126,11 @@
     m_slice = ctu.m_slice;
     m_frame = &frame;
 
+#if _DEBUG || CHECKED_BUILD
+    for (uint32_t i = 0; i <= g_maxCUDepth; i++)
+        for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
+            m_modeDepth[i].pred[j].invalidate();
+#endif
     invalidateContexts(0);
     m_quant.setQPforQuant(ctu);
     m_rqt[0].cur.load(initialContext);
@@ -139,10 +145,13 @@
         {
             int numPredDir = m_slice->isInterP() ? 1 : 2;
             m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
-            reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
+            m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
+            m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
         }
     }
 
+    ProfileCUScope(ctu, totalCTUTime, totalCTUs);
+
     uint32_t zOrder = 0;
     if (m_slice->m_sliceType == I_SLICE)
     {
@@ -153,6 +162,7 @@
             memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
+            memcpy(&m_reuseIntraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition);
         }
     }
     else
@@ -196,14 +206,16 @@
         return;
     else if (md.bestMode->cu.isIntra(0))
     {
+        md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
         uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
-        checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes);
+        checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
     }
     else
     {
+        md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
         encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
@@ -225,15 +237,16 @@
         uint8_t* reuseDepth  = &m_reuseIntraDataCTU->depth[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
         uint8_t* reuseModes  = &m_reuseIntraDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
         char* reusePartSizes = &m_reuseIntraDataCTU->partSizes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
+        uint8_t* reuseChromaModes = &m_reuseIntraDataCTU->chromaModes[parentCTU.m_cuAddr * parentCTU.m_numPartitions];
 
-        if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.encodeIdx)
+        if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
         {
             m_quant.setQPforQuant(parentCTU);
 
             PartSize size = (PartSize)reusePartSizes[zOrder];
             Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
             mode.cu.initSubCU(parentCTU, cuGeom);
-            checkIntra(mode, cuGeom, size, &reuseModes[zOrder]);
+            checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
             checkBestMode(mode, depth);
 
             if (m_bTryLossless)
@@ -252,13 +265,13 @@
         m_quant.setQPforQuant(parentCTU);
 
         md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
-        checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL);
+        checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
         checkBestMode(md.pred[PRED_INTRA], depth);
 
-        if (depth == g_maxCUDepth)
+        if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
         {
             md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
-            checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL);
+            checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
             checkBestMode(md.pred[PRED_INTRA_NxN], depth);
         }
 
@@ -286,7 +299,7 @@
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx);
             if (childGeom.flags & CUGeom::PRESENT)
             {
-                m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.encodeIdx);
+                m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
                 m_rqt[nextDepth].cur.load(*nextContext);
                 compressIntraCU(parentCTU, childGeom, zOrder);
 
@@ -308,203 +321,173 @@
             addSplitFlagCost(*splitPred, cuGeom.depth);
         else
             updateModeCost(*splitPred);
+
+        checkDQPForSplitPred(splitPred->cu, cuGeom);
         checkBestMode(*splitPred, depth);
     }
 
-    checkDQP(md.bestMode->cu, cuGeom);
-
     /* Copy best data to encData CTU and recon */
     md.bestMode->cu.copyToPic(depth);
     if (md.bestMode != &md.pred[PRED_SPLIT])
-        md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.encodeIdx);
+        md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx);
 }
 
-bool Analysis::findJob(int threadId)
+void Analysis::PMODE::processTasks(int workerThreadId)
 {
-    /* try to acquire a CU mode to analyze */
-    m_pmodeLock.acquire();
-    if (m_totalNumJobs > m_numAcquiredJobs)
-    {
-        int id = m_numAcquiredJobs++;
-        m_pmodeLock.release();
-
-        ProfileScopeEvent(pmode);
-        parallelModeAnalysis(threadId, id);
-
-        m_pmodeLock.acquire();
-        if (++m_numCompletedJobs == m_totalNumJobs)
-            m_modeCompletionEvent.trigger();
-        m_pmodeLock.release();
-        return true;
-    }
-    else
-        m_pmodeLock.release();
-
-    m_meLock.acquire();
-    if (m_totalNumME > m_numAcquiredME)
-    {
-        int id = m_numAcquiredME++;
-        m_meLock.release();
-
-        ProfileScopeEvent(pme);
-        parallelME(threadId, id);
-
-        m_meLock.acquire();
-        if (++m_numCompletedME == m_totalNumME)
-            m_meCompletionEvent.trigger();
-        m_meLock.release();
-        return true;
-    }
-    else
-        m_meLock.release();
-
-    return false;
+#if DETAILED_CU_STATS
+    int fe = master.m_modeDepth[cuGeom.depth].pred[PRED_2Nx2N].cu.m_encData->m_frameEncoderID;
+    master.m_stats[fe].countPModeTasks++;
+    ScopedElapsedTime pmodeTime(master.m_stats[fe].pmodeTime);
+#endif
+    ProfileScopeEvent(pmode);
+    master.processPmode(*this, master.m_tld[workerThreadId].analysis);
 }
 
-void Analysis::parallelME(int threadId, int meId)
+/* process pmode jobs until none remain; may be called by the master thread or by
+ * a bonded peer (slave) thread via pmodeTasks() */
+void Analysis::processPmode(PMODE& pmode, Analysis& slave)
 {
-    Analysis* slave;
-
-    if (threadId == -1)
-        slave = this;
-    else
+    /* acquire a mode task, else exit early */
+    int task;
+    pmode.m_lock.acquire();
+    if (pmode.m_jobTotal > pmode.m_jobAcquired)
     {
-        slave = &m_tld[threadId].analysis;
-        slave->setQP(*m_slice, m_rdCost.m_qp);
-        slave->m_slice = m_slice;
-        slave->m_frame = m_frame;
-
-        slave->m_me.setSourcePU(*m_curInterMode->fencYuv, m_curInterMode->cu.m_cuAddr, m_curGeom->encodeIdx, m_puAbsPartIdx, m_puWidth, m_puHeight);
-        slave->prepMotionCompensation(m_curInterMode->cu, *m_curGeom, m_curPart);
​

x265_1.5.tar.gz/source/encoder/analysis.h -> x265_1.6.tar.gz/source/encoder/analysis.h Changed

@@ -70,30 +70,43 @@
         CUDataMemPool  cuMemPool;
     };
 
+    class PMODE : public BondedTaskGroup
+    {
+    public:
+
+        Analysis&     master;
+        const CUGeom& cuGeom;
+        int           modes[MAX_PRED_TYPES];
+
+        PMODE(Analysis& m, const CUGeom& g) : master(m), cuGeom(g) {}
+
+        void processTasks(int workerThreadId);
+
+    protected:
+
+        PMODE operator=(const PMODE&);
+    };
+
+    void processPmode(PMODE& pmode, Analysis& slave);
+
     ModeDepth m_modeDepth[NUM_CU_DEPTH];
     bool      m_bTryLossless;
     bool      m_bChromaSa8d;
 
-    /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */
-    analysis_intra_data* m_reuseIntraDataCTU;
-    analysis_inter_data* m_reuseInterDataCTU;
-    int32_t* reuseRef;
     Analysis();
+
     bool create(ThreadLocalData* tld);
     void destroy();
+
     Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext);
 
 protected:
 
-    /* mode analysis distribution */
-    int           m_totalNumJobs;
-    volatile int  m_numAcquiredJobs;
-    volatile int  m_numCompletedJobs;
-    Lock          m_pmodeLock;
-    Event         m_modeCompletionEvent;
-    bool findJob(int threadId);
-    void parallelModeAnalysis(int threadId, int jobId);
-    void parallelME(int threadId, int meId);
+    /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */
+    analysis_intra_data* m_reuseIntraDataCTU;
+    analysis_inter_data* m_reuseInterDataCTU;
+    int32_t*             m_reuseRef;
+    uint32_t*            m_reuseBestMergeCand;
 
     /* full analysis for an I-slice CU */
     void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
@@ -105,7 +118,7 @@
 
     /* measure merge and skip */
     void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
-    void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom);
+    void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isSkipMode);
 
     /* measure inter options */
     void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
@@ -119,9 +132,6 @@
     /* add the RD cost of coding a split flag (0 or 1) to the given mode */
     void addSplitFlagCost(Mode& mode, uint32_t depth);
 
-    /* update CBF flags and QP values to be internally consistent */
-    void checkDQP(CUData& cu, const CUGeom& cuGeom);
-
     /* work-avoidance heuristics for RD levels < 5 */
     uint32_t topSkipMinDepth(const CUData& parentCTU, const CUGeom& cuGeom);
     bool recursionDepthCheck(const CUData& parentCTU, const CUGeom& cuGeom, const Mode& bestMode);
@@ -129,9 +139,13 @@
     /* generate residual and recon pixels for an entire CTU recursively (RD0) */
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
+    int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
+
     /* check whether current mode is the new best */
     inline void checkBestMode(Mode& mode, uint32_t depth)
     {
+        X265_CHECK(mode.ok(), "mode costs are uninitialized\n");
+
         ModeDepth& md = m_modeDepth[depth];
         if (md.bestMode)
         {

 
@@ -70,30 +70,43 @@
         CUDataMemPool  cuMemPool;
     };
 
+    class PMODE : public BondedTaskGroup
+    {
+    public:
+
+        Analysis&     master;
+        const CUGeom& cuGeom;
+        int           modes[MAX_PRED_TYPES];
+
+        PMODE(Analysis& m, const CUGeom& g) : master(m), cuGeom(g) {}
+
+        void processTasks(int workerThreadId);
+
+    protected:
+
+        PMODE operator=(const PMODE&);
+    };
+
+    void processPmode(PMODE& pmode, Analysis& slave);
+
     ModeDepth m_modeDepth[NUM_CU_DEPTH];
     bool      m_bTryLossless;
     bool      m_bChromaSa8d;
 
-    /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */
-    analysis_intra_data* m_reuseIntraDataCTU;
-    analysis_inter_data* m_reuseInterDataCTU;
-    int32_t* reuseRef;
     Analysis();
+
     bool create(ThreadLocalData* tld);
     void destroy();
+
     Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext);
 
 protected:
 
-    /* mode analysis distribution */
-    int           m_totalNumJobs;
-    volatile int  m_numAcquiredJobs;
-    volatile int  m_numCompletedJobs;
-    Lock          m_pmodeLock;
-    Event         m_modeCompletionEvent;
-    bool findJob(int threadId);
-    void parallelModeAnalysis(int threadId, int jobId);
-    void parallelME(int threadId, int meId);
+    /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */
+    analysis_intra_data* m_reuseIntraDataCTU;
+    analysis_inter_data* m_reuseInterDataCTU;
+    int32_t*             m_reuseRef;
+    uint32_t*            m_reuseBestMergeCand;
 
     /* full analysis for an I-slice CU */
     void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
@@ -105,7 +118,7 @@
 
     /* measure merge and skip */
     void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
-    void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom);
+    void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isSkipMode);
 
     /* measure inter options */
     void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
@@ -119,9 +132,6 @@
     /* add the RD cost of coding a split flag (0 or 1) to the given mode */
     void addSplitFlagCost(Mode& mode, uint32_t depth);
 
-    /* update CBF flags and QP values to be internally consistent */
-    void checkDQP(CUData& cu, const CUGeom& cuGeom);
-
     /* work-avoidance heuristics for RD levels < 5 */
     uint32_t topSkipMinDepth(const CUData& parentCTU, const CUGeom& cuGeom);
     bool recursionDepthCheck(const CUData& parentCTU, const CUGeom& cuGeom, const Mode& bestMode);
@@ -129,9 +139,13 @@
     /* generate residual and recon pixels for an entire CTU recursively (RD0) */
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
+    int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
+
     /* check whether current mode is the new best */
     inline void checkBestMode(Mode& mode, uint32_t depth)
     {
+        X265_CHECK(mode.ok(), "mode costs are uninitialized\n");
+
         ModeDepth& md = m_modeDepth[depth];
         if (md.bestMode)
         {
​

x265_1.5.tar.gz/source/encoder/api.cpp -> x265_1.6.tar.gz/source/encoder/api.cpp Changed

@@ -173,6 +173,7 @@
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
 
+        encoder->stop();
         encoder->printSummary();
         encoder->destroy();
         delete encoder;
@@ -183,6 +184,8 @@
 void x265_cleanup(void)
 {
     BitCost::destroy();
+    CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
+    g_ctuSizeConfigured = 0;
 }
 
 extern "C"
@@ -206,7 +209,7 @@
 
         uint32_t numCUsInFrame   = widthInCU * heightInCU;
         pic->analysisData.numCUsInFrame = numCUsInFrame;
-        pic->analysisData.numPartitions = NUM_CU_PARTITIONS;
+        pic->analysisData.numPartitions = NUM_4x4_PARTITIONS;
     }
 }
 
@@ -215,3 +218,36 @@
 {
     return x265_free(p);
 }
+
+static const x265_api libapi =
+{
+    &x265_param_alloc,
+    &x265_param_free,
+    &x265_param_default,
+    &x265_param_parse,
+    &x265_param_apply_profile,
+    &x265_param_default_preset,
+    &x265_picture_alloc,
+    &x265_picture_free,
+    &x265_picture_init,
+    &x265_encoder_open,
+    &x265_encoder_parameters,
+    &x265_encoder_headers,
+    &x265_encoder_encode,
+    &x265_encoder_get_stats,
+    &x265_encoder_log,
+    &x265_encoder_close,
+    &x265_cleanup,
+    x265_version_str,
+    x265_build_info_str,
+    x265_max_bit_depth,
+};
+
+extern "C"
+const x265_api* x265_api_get(int bitDepth)
+{
+    if (bitDepth && bitDepth != X265_DEPTH)
+        return NULL;
+
+    return &libapi;
+}

 
@@ -173,6 +173,7 @@
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
 
+        encoder->stop();
         encoder->printSummary();
         encoder->destroy();
         delete encoder;
@@ -183,6 +184,8 @@
 void x265_cleanup(void)
 {
     BitCost::destroy();
+    CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
+    g_ctuSizeConfigured = 0;
 }
 
 extern "C"
@@ -206,7 +209,7 @@
 
         uint32_t numCUsInFrame   = widthInCU * heightInCU;
         pic->analysisData.numCUsInFrame = numCUsInFrame;
-        pic->analysisData.numPartitions = NUM_CU_PARTITIONS;
+        pic->analysisData.numPartitions = NUM_4x4_PARTITIONS;
     }
 }
 
@@ -215,3 +218,36 @@
 {
     return x265_free(p);
 }
+
+static const x265_api libapi =
+{
+    &x265_param_alloc,
+    &x265_param_free,
+    &x265_param_default,
+    &x265_param_parse,
+    &x265_param_apply_profile,
+    &x265_param_default_preset,
+    &x265_picture_alloc,
+    &x265_picture_free,
+    &x265_picture_init,
+    &x265_encoder_open,
+    &x265_encoder_parameters,
+    &x265_encoder_headers,
+    &x265_encoder_encode,
+    &x265_encoder_get_stats,
+    &x265_encoder_log,
+    &x265_encoder_close,
+    &x265_cleanup,
+    x265_version_str,
+    x265_build_info_str,
+    x265_max_bit_depth,
+};
+
+extern "C"
+const x265_api* x265_api_get(int bitDepth)
+{
+    if (bitDepth && bitDepth != X265_DEPTH)
+        return NULL;
+
+    return &libapi;
+}
​

x265_1.5.tar.gz/source/encoder/dpb.cpp -> x265_1.6.tar.gz/source/encoder/dpb.cpp Changed

@@ -104,11 +104,14 @@
 
     if (type == X265_TYPE_B)
     {
-        // change from _R "referenced" to _N "non-referenced" NAL unit type
+        newFrame->m_encData->m_bHasReferences = false;
+
+        // Adjust NAL type for unreferenced B frames (change from _R "referenced"
+        // to _N "non-referenced" NAL unit type)
         switch (slice->m_nalUnitType)
         {
         case NAL_UNIT_CODED_SLICE_TRAIL_R:
-            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TRAIL_N;
+            slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
             break;
         case NAL_UNIT_CODED_SLICE_RADL_R:
             slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
@@ -120,10 +123,12 @@
             break;
         }
     }
-
-    /* m_bHasReferences starts out as true for non-B pictures, and is set to false
-     * once no more pictures reference it */
-    newFrame->m_encData->m_bHasReferences = IS_REFERENCED(newFrame);
+    else
+    {
+        /* m_bHasReferences starts out as true for non-B pictures, and is set to false
+         * once no more pictures reference it */
+        newFrame->m_encData->m_bHasReferences = true;
+    }
 
     m_picList.pushFront(*newFrame);

 
@@ -104,11 +104,14 @@
 
     if (type == X265_TYPE_B)
     {
-        // change from _R "referenced" to _N "non-referenced" NAL unit type
+        newFrame->m_encData->m_bHasReferences = false;
+
+        // Adjust NAL type for unreferenced B frames (change from _R "referenced"
+        // to _N "non-referenced" NAL unit type)
         switch (slice->m_nalUnitType)
         {
         case NAL_UNIT_CODED_SLICE_TRAIL_R:
-            slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_TRAIL_N;
+            slice->m_nalUnitType = m_bTemporalSublayer ? NAL_UNIT_CODED_SLICE_TSA_N : NAL_UNIT_CODED_SLICE_TRAIL_N;
             break;
         case NAL_UNIT_CODED_SLICE_RADL_R:
             slice->m_nalUnitType = NAL_UNIT_CODED_SLICE_RADL_N;
@@ -120,10 +123,12 @@
             break;
         }
     }
-
-    /* m_bHasReferences starts out as true for non-B pictures, and is set to false
-     * once no more pictures reference it */
-    newFrame->m_encData->m_bHasReferences = IS_REFERENCED(newFrame);
+    else
+    {
+        /* m_bHasReferences starts out as true for non-B pictures, and is set to false
+         * once no more pictures reference it */
+        newFrame->m_encData->m_bHasReferences = true;
+    }
 
     m_picList.pushFront(*newFrame);
 
​

x265_1.5.tar.gz/source/encoder/dpb.h -> x265_1.6.tar.gz/source/encoder/dpb.h Changed

 
@@ -39,10 +39,11 @@
 
     int                m_lastIDR;
     int                m_pocCRA;
-    bool               m_bRefreshPending;
     int                m_maxRefL0;
     int                m_maxRefL1;
     int                m_bOpenGOP;
+    bool               m_bRefreshPending;
+    bool               m_bTemporalSublayer;
     PicList            m_picList;
     PicList            m_freeList;
     FrameData*         m_picSymFreeList;
@@ -56,6 +57,7 @@
         m_maxRefL0 = param->maxNumReferences;
         m_maxRefL1 = param->bBPyramid ? 2 : 1;
         m_bOpenGOP = param->bOpenGOP;
+        m_bTemporalSublayer = !!param->bEnableTemporalSubLayers;
     }
 
     ~DPB();
​

x265_1.5.tar.gz/source/encoder/encoder.cpp -> x265_1.6.tar.gz/source/encoder/encoder.cpp Changed

@@ -43,7 +43,7 @@
 const char g_sliceTypeToChar[] = {'B', 'P', 'I'};
 }
 
-static const char *summaryCSVHeader =
+static const char* summaryCSVHeader =
     "Command, Date/Time, Elapsed Time, FPS, Bitrate, "
     "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), "
     "I count, I ave-QP, I kpbs, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), "
@@ -51,7 +51,7 @@
     "B count, B ave-QP, B kpbs, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), "
     "Version\n";
 
-const char* defaultAnalysisFileName = "x265_analysis.dat";
+static const char* defaultAnalysisFileName = "x265_analysis.dat";
 
 using namespace x265;
 
@@ -66,7 +66,6 @@
     m_numLumaWPBiFrames = 0;
     m_numChromaWPBiFrames = 0;
     m_lookahead = NULL;
-    m_frameEncoder = NULL;
     m_rateControl = NULL;
     m_dpb = NULL;
     m_exportedPic = NULL;
@@ -78,9 +77,12 @@
     m_cuOffsetC = NULL;
     m_buOffsetY = NULL;
     m_buOffsetC = NULL;
-    m_threadPool = 0;
-    m_numThreadLocalData = 0;
+    m_threadPool = NULL;
     m_analysisFile = NULL;
+    for (int i = 0; i < X265_MAX_FRAME_THREADS; i++)
+        m_frameEncoder[i] = NULL;
+
+    MotionEstimate::initScales();
 }
 
 void Encoder::create()
@@ -101,21 +103,35 @@
     if (rows == 1 || cols < 3)
         p->bEnableWavefront = 0;
 
-    int poolThreadCount = p->poolNumThreads ? p->poolNumThreads : getCpuCount();
+    bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
 
     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
     if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
-        poolThreadCount = 0;
+        allowPools = false;
 
-    if (poolThreadCount > 1)
+    if (!p->frameNumThreads)
     {
-        m_threadPool = ThreadPool::allocThreadPool(poolThreadCount);
-        poolThreadCount = m_threadPool->getThreadCount();
+        // auto-detect frame threads
+        int cpuCount = ThreadPool::getCpuCount();
+        if (!p->bEnableWavefront)
+            p->frameNumThreads = X265_MIN3(cpuCount, (rows + 1) / 2, X265_MAX_FRAME_THREADS);
+        else if (cpuCount >= 32)
+            p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher
+        else if (cpuCount >= 16)
+            p->frameNumThreads = 5; // 8 HT cores, or dual socket
+        else if (cpuCount >= 8)
+            p->frameNumThreads = 3; // 4 HT cores
+        else if (cpuCount >= 4)
+            p->frameNumThreads = 2; // Dual or Quad core
+        else
+            p->frameNumThreads = 1;
     }
-    else
-        poolThreadCount = 0;
 
-    if (!poolThreadCount)
+    m_numPools = 0;
+    if (allowPools)
+        m_threadPool = ThreadPool::allocThreadPools(p, m_numPools);
+
+    if (!m_numPools)
     {
         // issue warnings if any of these features were requested
         if (p->bEnableWavefront)
@@ -129,31 +145,40 @@
         p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
     }
 
-    if (!p->frameNumThreads)
-    {
-        // auto-detect frame threads
-        int cpuCount = getCpuCount();
-        if (!p->bEnableWavefront)
-            p->frameNumThreads = X265_MIN(cpuCount, (rows + 1) / 2);
-        else if (cpuCount >= 32)
-            p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher
-        else if (cpuCount >= 16)
-            p->frameNumThreads = 5; // 8 HT cores, or dual socket
-        else if (cpuCount >= 8)
-            p->frameNumThreads = 3; // 4 HT cores
-        else if (cpuCount >= 4)
-            p->frameNumThreads = 2; // Dual or Quad core
-        else
-            p->frameNumThreads = 1;
-    }
+    char buf[128];
+    int len = 0;
+    if (p->bEnableWavefront)
+        len += sprintf(buf + len, "wpp(%d rows)", rows);
+    if (p->bDistributeModeAnalysis)
+        len += sprintf(buf + len, "%spmode", len ? "+" : "");
+    if (p->bDistributeMotionEstimation)
+        len += sprintf(buf + len, "%spme ", len ? "+" : "");
+    if (!len)
+        strcpy(buf, "none");
 
-    x265_log(p, X265_LOG_INFO, "WPP streams / frame threads / pool  : %d / %d / %d%s%s\n", 
-             p->bEnableWavefront ? rows : 0, p->frameNumThreads, poolThreadCount,
-             p->bDistributeMotionEstimation ? " / pme" : "", p->bDistributeModeAnalysis ? " / pmode" : "");
+    x265_log(p, X265_LOG_INFO, "frame threads / pool features       : %d / %s\n", p->frameNumThreads, buf);
 
-    m_frameEncoder = new FrameEncoder[m_param->frameNumThreads];
     for (int i = 0; i < m_param->frameNumThreads; i++)
-        m_frameEncoder[i].setThreadPool(m_threadPool);
+        m_frameEncoder[i] = new FrameEncoder;
+
+    if (m_numPools)
+    {
+        for (int i = 0; i < m_param->frameNumThreads; i++)
+        {
+            int pool = i % m_numPools;
+            m_frameEncoder[i]->m_pool = &m_threadPool[pool];
+            m_frameEncoder[i]->m_jpId = m_threadPool[pool].m_numProviders++;
+            m_threadPool[pool].m_jpTable[m_frameEncoder[i]->m_jpId] = m_frameEncoder[i];
+        }
+        for (int i = 0; i < m_numPools; i++)
+            m_threadPool[i].start();
+    }
+    else
+    {
+        /* CU stats and noise-reduction buffers are indexed by jpId, so it cannot be left as -1 */
+        for (int i = 0; i < m_param->frameNumThreads; i++)
+            m_frameEncoder[i]->m_jpId = 0;
+    }
 
     if (!m_scalingList.init())
     {
@@ -168,27 +193,17 @@
         m_aborted = true;
     m_scalingList.setupQuantMatrices();
 
-    /* Allocate thread local data, one for each thread pool worker and
-     * if --no-wpp, one for each frame encoder */
-    m_numThreadLocalData = poolThreadCount;
-    if (!m_param->bEnableWavefront)
-        m_numThreadLocalData += m_param->frameNumThreads;
-    m_threadLocalData = new ThreadLocalData[m_numThreadLocalData];
-    for (int i = 0; i < m_numThreadLocalData; i++)
+    m_lookahead = new Lookahead(m_param, m_threadPool);
+    if (m_numPools)
     {
-        m_threadLocalData[i].analysis.setThreadPool(m_threadPool);
-        m_threadLocalData[i].analysis.initSearch(*m_param, m_scalingList);
-        m_threadLocalData[i].analysis.create(m_threadLocalData);
+        m_lookahead->m_jpId = m_threadPool[0].m_numProviders++;
+        m_threadPool[0].m_jpTable[m_lookahead->m_jpId] = m_lookahead;
     }
 
-    if (!m_param->bEnableWavefront)
-        for (int i = 0; i < m_param->frameNumThreads; i++)
-            m_frameEncoder[i].m_tld = &m_threadLocalData[poolThreadCount + i];
-
-    m_lookahead = new Lookahead(m_param, m_threadPool);
     m_dpb = new DPB(m_param);
-    m_rateControl = new RateControl(m_param);
+    m_rateControl = new RateControl(*m_param);
 
+    initVPS(&m_vps);
     initSPS(&m_sps);
     initPPS(&m_pps);
 
@@ -229,26 +244,29 @@
         }
     }
 
-    if (m_frameEncoder)
+    int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
+    int numCols = (m_param->sourceWidth  + g_maxCUSize - 1) / g_maxCUSize;
+    for (int i = 0; i < m_param->frameNumThreads; i++)
     {
-        int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
-        int numCols = (m_param->sourceWidth  + g_maxCUSize - 1) / g_maxCUSize;
-        for (int i = 0; i < m_param->frameNumThreads; i++)
+        if (!m_frameEncoder[i]->init(this, numRows, numCols))
         {
-            if (!m_frameEncoder[i].init(this, numRows, numCols, i))
-            {
-                x265_log(m_param, X265_LOG_ERROR, "Unable to initialize frame encoder, aborting\n");
-                m_aborted = true;

 
@@ -43,7 +43,7 @@
 const char g_sliceTypeToChar[] = {'B', 'P', 'I'};
 }
 
-static const char *summaryCSVHeader =
+static const char* summaryCSVHeader =
     "Command, Date/Time, Elapsed Time, FPS, Bitrate, "
     "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), "
     "I count, I ave-QP, I kpbs, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), "
@@ -51,7 +51,7 @@
     "B count, B ave-QP, B kpbs, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), "
     "Version\n";
 
-const char* defaultAnalysisFileName = "x265_analysis.dat";
+static const char* defaultAnalysisFileName = "x265_analysis.dat";
 
 using namespace x265;
 
@@ -66,7 +66,6 @@
     m_numLumaWPBiFrames = 0;
     m_numChromaWPBiFrames = 0;
     m_lookahead = NULL;
-    m_frameEncoder = NULL;
     m_rateControl = NULL;
     m_dpb = NULL;
     m_exportedPic = NULL;
@@ -78,9 +77,12 @@
     m_cuOffsetC = NULL;
     m_buOffsetY = NULL;
     m_buOffsetC = NULL;
-    m_threadPool = 0;
-    m_numThreadLocalData = 0;
+    m_threadPool = NULL;
     m_analysisFile = NULL;
+    for (int i = 0; i < X265_MAX_FRAME_THREADS; i++)
+        m_frameEncoder[i] = NULL;
+
+    MotionEstimate::initScales();
 }
 
 void Encoder::create()
@@ -101,21 +103,35 @@
     if (rows == 1 || cols < 3)
         p->bEnableWavefront = 0;
 
-    int poolThreadCount = p->poolNumThreads ? p->poolNumThreads : getCpuCount();
+    bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
 
     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
     if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
-        poolThreadCount = 0;
+        allowPools = false;
 
-    if (poolThreadCount > 1)
+    if (!p->frameNumThreads)
     {
-        m_threadPool = ThreadPool::allocThreadPool(poolThreadCount);
-        poolThreadCount = m_threadPool->getThreadCount();
+        // auto-detect frame threads
+        int cpuCount = ThreadPool::getCpuCount();
+        if (!p->bEnableWavefront)
+            p->frameNumThreads = X265_MIN3(cpuCount, (rows + 1) / 2, X265_MAX_FRAME_THREADS);
+        else if (cpuCount >= 32)
+            p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher
+        else if (cpuCount >= 16)
+            p->frameNumThreads = 5; // 8 HT cores, or dual socket
+        else if (cpuCount >= 8)
+            p->frameNumThreads = 3; // 4 HT cores
+        else if (cpuCount >= 4)
+            p->frameNumThreads = 2; // Dual or Quad core
+        else
+            p->frameNumThreads = 1;
     }
-    else
-        poolThreadCount = 0;
 
-    if (!poolThreadCount)
+    m_numPools = 0;
+    if (allowPools)
+        m_threadPool = ThreadPool::allocThreadPools(p, m_numPools);
+
+    if (!m_numPools)
     {
         // issue warnings if any of these features were requested
         if (p->bEnableWavefront)
@@ -129,31 +145,40 @@
         p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
     }
 
-    if (!p->frameNumThreads)
-    {
-        // auto-detect frame threads
-        int cpuCount = getCpuCount();
-        if (!p->bEnableWavefront)
-            p->frameNumThreads = X265_MIN(cpuCount, (rows + 1) / 2);
-        else if (cpuCount >= 32)
-            p->frameNumThreads = (p->sourceHeight > 2000) ? 8 : 6; // dual-socket 10-core IvyBridge or higher
-        else if (cpuCount >= 16)
-            p->frameNumThreads = 5; // 8 HT cores, or dual socket
-        else if (cpuCount >= 8)
-            p->frameNumThreads = 3; // 4 HT cores
-        else if (cpuCount >= 4)
-            p->frameNumThreads = 2; // Dual or Quad core
-        else
-            p->frameNumThreads = 1;
-    }
+    char buf[128];
+    int len = 0;
+    if (p->bEnableWavefront)
+        len += sprintf(buf + len, "wpp(%d rows)", rows);
+    if (p->bDistributeModeAnalysis)
+        len += sprintf(buf + len, "%spmode", len ? "+" : "");
+    if (p->bDistributeMotionEstimation)
+        len += sprintf(buf + len, "%spme ", len ? "+" : "");
+    if (!len)
+        strcpy(buf, "none");
 
-    x265_log(p, X265_LOG_INFO, "WPP streams / frame threads / pool  : %d / %d / %d%s%s\n", 
-             p->bEnableWavefront ? rows : 0, p->frameNumThreads, poolThreadCount,
-             p->bDistributeMotionEstimation ? " / pme" : "", p->bDistributeModeAnalysis ? " / pmode" : "");
+    x265_log(p, X265_LOG_INFO, "frame threads / pool features       : %d / %s\n", p->frameNumThreads, buf);
 
-    m_frameEncoder = new FrameEncoder[m_param->frameNumThreads];
     for (int i = 0; i < m_param->frameNumThreads; i++)
-        m_frameEncoder[i].setThreadPool(m_threadPool);
+        m_frameEncoder[i] = new FrameEncoder;
+
+    if (m_numPools)
+    {
+        for (int i = 0; i < m_param->frameNumThreads; i++)
+        {
+            int pool = i % m_numPools;
+            m_frameEncoder[i]->m_pool = &m_threadPool[pool];
+            m_frameEncoder[i]->m_jpId = m_threadPool[pool].m_numProviders++;
+            m_threadPool[pool].m_jpTable[m_frameEncoder[i]->m_jpId] = m_frameEncoder[i];
+        }
+        for (int i = 0; i < m_numPools; i++)
+            m_threadPool[i].start();
+    }
+    else
+    {
+        /* CU stats and noise-reduction buffers are indexed by jpId, so it cannot be left as -1 */
+        for (int i = 0; i < m_param->frameNumThreads; i++)
+            m_frameEncoder[i]->m_jpId = 0;
+    }
 
     if (!m_scalingList.init())
     {
@@ -168,27 +193,17 @@
         m_aborted = true;
     m_scalingList.setupQuantMatrices();
 
-    /* Allocate thread local data, one for each thread pool worker and
-     * if --no-wpp, one for each frame encoder */
-    m_numThreadLocalData = poolThreadCount;
-    if (!m_param->bEnableWavefront)
-        m_numThreadLocalData += m_param->frameNumThreads;
-    m_threadLocalData = new ThreadLocalData[m_numThreadLocalData];
-    for (int i = 0; i < m_numThreadLocalData; i++)
+    m_lookahead = new Lookahead(m_param, m_threadPool);
+    if (m_numPools)
     {
-        m_threadLocalData[i].analysis.setThreadPool(m_threadPool);
-        m_threadLocalData[i].analysis.initSearch(*m_param, m_scalingList);
-        m_threadLocalData[i].analysis.create(m_threadLocalData);
+        m_lookahead->m_jpId = m_threadPool[0].m_numProviders++;
+        m_threadPool[0].m_jpTable[m_lookahead->m_jpId] = m_lookahead;
     }
 
-    if (!m_param->bEnableWavefront)
-        for (int i = 0; i < m_param->frameNumThreads; i++)
-            m_frameEncoder[i].m_tld = &m_threadLocalData[poolThreadCount + i];
-
-    m_lookahead = new Lookahead(m_param, m_threadPool);
     m_dpb = new DPB(m_param);
-    m_rateControl = new RateControl(m_param);
+    m_rateControl = new RateControl(*m_param);
 
+    initVPS(&m_vps);
     initSPS(&m_sps);
     initPPS(&m_pps);
 
@@ -229,26 +244,29 @@
         }
     }
 
-    if (m_frameEncoder)
+    int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
+    int numCols = (m_param->sourceWidth  + g_maxCUSize - 1) / g_maxCUSize;
+    for (int i = 0; i < m_param->frameNumThreads; i++)
     {
-        int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
-        int numCols = (m_param->sourceWidth  + g_maxCUSize - 1) / g_maxCUSize;
-        for (int i = 0; i < m_param->frameNumThreads; i++)
+        if (!m_frameEncoder[i]->init(this, numRows, numCols))
         {
-            if (!m_frameEncoder[i].init(this, numRows, numCols, i))
-            {
-                x265_log(m_param, X265_LOG_ERROR, "Unable to initialize frame encoder, aborting\n");
-                m_aborted = true;
​

x265_1.5.tar.gz/source/encoder/encoder.h -> x265_1.6.tar.gz/source/encoder/encoder.h Changed

@@ -70,7 +70,6 @@
 class Lookahead;
 class RateControl;
 class ThreadPool;
-struct ThreadLocalData;
 
 class Encoder : public x265_encoder
 {
@@ -86,11 +85,12 @@
     int64_t            m_prevReorderedPts[2];
 
     ThreadPool*        m_threadPool;
-    FrameEncoder*      m_frameEncoder;
+    FrameEncoder*      m_frameEncoder[X265_MAX_FRAME_THREADS];
     DPB*               m_dpb;
 
     Frame*             m_exportedPic;
 
+    int                m_numPools;
     int                m_curEncoder;
 
     /* cached PicYuv offset arrays, shared by all instances of
@@ -120,14 +120,12 @@
     PPS                m_pps;
     NALList            m_nalList;
     ScalingList        m_scalingList;      // quantization matrix information
-    int                m_numThreadLocalData;
 
     int                m_lastBPSEI;
     uint32_t           m_numDelayedPic;
 
     x265_param*        m_param;
     RateControl*       m_rateControl;
-    ThreadLocalData*   m_threadLocalData;
     Lookahead*         m_lookahead;
     Window             m_conformanceWindow;
 
@@ -138,6 +136,7 @@
     ~Encoder() {}
 
     void create();
+    void stop();
     void destroy();
 
     int encode(const x265_picture* pic, x265_picture *pic_out);
@@ -154,8 +153,6 @@
 
     char* statsCSVString(EncStats& stat, char* buffer);
 
-    void setThreadPool(ThreadPool* p) { m_threadPool = p; }
-
     void configure(x265_param *param);
 
     void updateVbvPlan(RateControl* rc);
@@ -172,6 +169,7 @@
 
 protected:
 
+    void initVPS(VPS *vps);
     void initSPS(SPS *sps);
     void initPPS(PPS *pps);
 };

 
@@ -70,7 +70,6 @@
 class Lookahead;
 class RateControl;
 class ThreadPool;
-struct ThreadLocalData;
 
 class Encoder : public x265_encoder
 {
@@ -86,11 +85,12 @@
     int64_t            m_prevReorderedPts[2];
 
     ThreadPool*        m_threadPool;
-    FrameEncoder*      m_frameEncoder;
+    FrameEncoder*      m_frameEncoder[X265_MAX_FRAME_THREADS];
     DPB*               m_dpb;
 
     Frame*             m_exportedPic;
 
+    int                m_numPools;
     int                m_curEncoder;
 
     /* cached PicYuv offset arrays, shared by all instances of
@@ -120,14 +120,12 @@
     PPS                m_pps;
     NALList            m_nalList;
     ScalingList        m_scalingList;      // quantization matrix information
-    int                m_numThreadLocalData;
 
     int                m_lastBPSEI;
     uint32_t           m_numDelayedPic;
 
     x265_param*        m_param;
     RateControl*       m_rateControl;
-    ThreadLocalData*   m_threadLocalData;
     Lookahead*         m_lookahead;
     Window             m_conformanceWindow;
 
@@ -138,6 +136,7 @@
     ~Encoder() {}
 
     void create();
+    void stop();
     void destroy();
 
     int encode(const x265_picture* pic, x265_picture *pic_out);
@@ -154,8 +153,6 @@
 
     char* statsCSVString(EncStats& stat, char* buffer);
 
-    void setThreadPool(ThreadPool* p) { m_threadPool = p; }
-
     void configure(x265_param *param);
 
     void updateVbvPlan(RateControl* rc);
@@ -172,6 +169,7 @@
 
 protected:
 
+    void initVPS(VPS *vps);
     void initSPS(SPS *sps);
     void initPPS(PPS *pps);
 };
​

x265_1.5.tar.gz/source/encoder/entropy.cpp -> x265_1.6.tar.gz/source/encoder/entropy.cpp Changed

@@ -43,6 +43,7 @@
 {
     markValid();
     m_fracBits = 0;
+    m_pad = 0;
     X265_CHECK(sizeof(m_contextState) >= sizeof(m_contextState[0]) * MAX_OFF_CTX_MOD, "context state table is too small\n");
 }
 
@@ -51,17 +52,21 @@
     WRITE_CODE(0,       4, "vps_video_parameter_set_id");
     WRITE_CODE(3,       2, "vps_reserved_three_2bits");
     WRITE_CODE(0,       6, "vps_reserved_zero_6bits");
-    WRITE_CODE(0,       3, "vps_max_sub_layers_minus1");
-    WRITE_FLAG(1,          "vps_temporal_id_nesting_flag");
+    WRITE_CODE(vps.maxTempSubLayers - 1, 3, "vps_max_sub_layers_minus1");
+    WRITE_FLAG(vps.maxTempSubLayers == 1,   "vps_temporal_id_nesting_flag");
     WRITE_CODE(0xffff, 16, "vps_reserved_ffff_16bits");
 
-    codeProfileTier(vps.ptl);
+    codeProfileTier(vps.ptl, vps.maxTempSubLayers);
 
     WRITE_FLAG(true, "vps_sub_layer_ordering_info_present_flag");
-    WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]");
-    WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_pics[i]");
 
-    WRITE_UVLC(0,    "vps_max_latency_increase_plus1[i]");
+    for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
+    {
+        WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]");
+        WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_pics[i]");
+        WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1[i]");
+    }
+
     WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id");
     WRITE_UVLC(0,    "vps_max_op_sets_minus1");
     WRITE_FLAG(0,    "vps_timing_info_present_flag"); /* we signal timing info in SPS-VUI */
@@ -71,16 +76,16 @@
 void Entropy::codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl)
 {
     WRITE_CODE(0, 4, "sps_video_parameter_set_id");
-    WRITE_CODE(0, 3, "sps_max_sub_layers_minus1");
-    WRITE_FLAG(1,    "sps_temporal_id_nesting_flag");
+    WRITE_CODE(sps.maxTempSubLayers - 1, 3, "sps_max_sub_layers_minus1");
+    WRITE_FLAG(sps.maxTempSubLayers == 1,   "sps_temporal_id_nesting_flag");
 
-    codeProfileTier(ptl);
+    codeProfileTier(ptl, sps.maxTempSubLayers);
 
     WRITE_UVLC(0, "sps_seq_parameter_set_id");
     WRITE_UVLC(sps.chromaFormatIdc, "chroma_format_idc");
 
     if (sps.chromaFormatIdc == X265_CSP_I444)
-        WRITE_FLAG(0,                        "separate_colour_plane_flag");
+        WRITE_FLAG(0,                       "separate_colour_plane_flag");
 
     WRITE_UVLC(sps.picWidthInLumaSamples,   "pic_width_in_luma_samples");
     WRITE_UVLC(sps.picHeightInLumaSamples,  "pic_height_in_luma_samples");
@@ -101,9 +106,12 @@
     WRITE_UVLC(BITS_FOR_POC - 4, "log2_max_pic_order_cnt_lsb_minus4");
     WRITE_FLAG(true,             "sps_sub_layer_ordering_info_present_flag");
 
-    WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]");
-    WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_pics[i]");
-    WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]");
+    for (uint32_t i = 0; i < sps.maxTempSubLayers; i++)
+    {
+        WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]");
+        WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_pics[i]");
+        WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]");
+    }
 
     WRITE_UVLC(sps.log2MinCodingBlockSize - 3,    "log2_min_coding_block_size_minus3");
     WRITE_UVLC(sps.log2DiffMaxMinCodingBlockSize, "log2_diff_max_min_coding_block_size");
@@ -129,7 +137,7 @@
     WRITE_FLAG(sps.bUseStrongIntraSmoothing, "sps_strong_intra_smoothing_enable_flag");
 
     WRITE_FLAG(1, "vui_parameters_present_flag");
-    codeVUI(sps.vuiParameters);
+    codeVUI(sps.vuiParameters, sps.maxTempSubLayers);
 
     WRITE_FLAG(0, "sps_extension_flag");
 }
@@ -184,7 +192,7 @@
     WRITE_FLAG(0, "pps_extension_flag");
 }
 
-void Entropy::codeProfileTier(const ProfileTierLevel& ptl)
+void Entropy::codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers)
 {
     WRITE_CODE(0, 2,                "XXX_profile_space[]");
     WRITE_FLAG(ptl.tierFlag,        "XXX_tier_flag[]");
@@ -222,9 +230,17 @@
     }
 
     WRITE_CODE(ptl.levelIdc, 8, "general_level_idc");
+
+    if (maxTempSubLayers > 1)
+    {
+         WRITE_FLAG(0, "sub_layer_profile_present_flag[i]");
+         WRITE_FLAG(0, "sub_layer_level_present_flag[i]");
+         for (int i = maxTempSubLayers - 1; i < 8 ; i++)
+             WRITE_CODE(0, 2, "reserved_zero_2bits");
+    }
 }
 
-void Entropy::codeVUI(const VUI& vui)
+void Entropy::codeVUI(const VUI& vui, int maxSubTLayers)
 {
     WRITE_FLAG(vui.aspectRatioInfoPresentFlag,  "aspect_ratio_info_present_flag");
     if (vui.aspectRatioInfoPresentFlag)
@@ -282,7 +298,7 @@
 
     WRITE_FLAG(vui.hrdParametersPresentFlag,  "vui_hrd_parameters_present_flag");
     if (vui.hrdParametersPresentFlag)
-        codeHrdParameters(vui.hrdParameters);
+        codeHrdParameters(vui.hrdParameters, maxSubTLayers);
 
     WRITE_FLAG(0, "bitstream_restriction_flag");
 }
@@ -329,7 +345,7 @@
     }
 }
 
-void Entropy::codeHrdParameters(const HRDInfo& hrd)
+void Entropy::codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers)
 {
     WRITE_FLAG(1, "nal_hrd_parameters_present_flag");
     WRITE_FLAG(0, "vcl_hrd_parameters_present_flag");
@@ -342,13 +358,16 @@
     WRITE_CODE(hrd.cpbRemovalDelayLength - 1,        5, "au_cpb_removal_delay_length_minus1");
     WRITE_CODE(hrd.dpbOutputDelayLength - 1,         5, "dpb_output_delay_length_minus1");
 
-    WRITE_FLAG(1, "fixed_pic_rate_general_flag");
-    WRITE_UVLC(0, "elemental_duration_in_tc_minus1");
-    WRITE_UVLC(0, "cpb_cnt_minus1");
+    for (int i = 0; i < maxSubTLayers; i++)
+    {
+        WRITE_FLAG(1, "fixed_pic_rate_general_flag");
+        WRITE_UVLC(0, "elemental_duration_in_tc_minus1");
+        WRITE_UVLC(0, "cpb_cnt_minus1");
 
-    WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1");
-    WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1");
-    WRITE_FLAG(hrd.cbrFlag, "cbr_flag");
+        WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1");
+        WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1");
+        WRITE_FLAG(hrd.cbrFlag, "cbr_flag");
+    }
 }
 
 void Entropy::codeAUD(const Slice& slice)
@@ -521,15 +540,14 @@
 {
     const Slice* slice = ctu.m_slice;
 
-    if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
-        bEncodeDQP = true;
-
     int cuSplitFlag = !(cuGeom.flags & CUGeom::LEAF);
     int cuUnsplitFlag = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
 
     if (!cuUnsplitFlag)
     {
         uint32_t qNumParts = cuGeom.numPartitions >> 2;
+        if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+            bEncodeDQP = true;
         for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts)
         {
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx);
@@ -539,13 +557,14 @@
         return;
     }
 
-    // We need to split, so don't try these modes.
     if (cuSplitFlag) 
         codeSplitFlag(ctu, absPartIdx, depth);
 
     if (depth < ctu.m_cuDepth[absPartIdx] && depth < g_maxCUDepth)
     {
         uint32_t qNumParts = cuGeom.numPartitions >> 2;
+        if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+            bEncodeDQP = true;
         for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts)
         {
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx);
@@ -554,6 +573,9 @@
         return;
     }
 
+    if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+        bEncodeDQP = true;
+
     if (slice->m_pps->bTransquantBypassEnabled)
         codeCUTransquantBypassFlag(ctu.m_tqBypass[absPartIdx]);
 
@@ -654,7 +676,7 @@
     {
         // Encode slice finish
         bool bTerminateSlice = false;
-        if (cuAddr + (NUM_CU_PARTITIONS >> (depth << 1)) == realEndAddress)

 
@@ -43,6 +43,7 @@
 {
     markValid();
     m_fracBits = 0;
+    m_pad = 0;
     X265_CHECK(sizeof(m_contextState) >= sizeof(m_contextState[0]) * MAX_OFF_CTX_MOD, "context state table is too small\n");
 }
 
@@ -51,17 +52,21 @@
     WRITE_CODE(0,       4, "vps_video_parameter_set_id");
     WRITE_CODE(3,       2, "vps_reserved_three_2bits");
     WRITE_CODE(0,       6, "vps_reserved_zero_6bits");
-    WRITE_CODE(0,       3, "vps_max_sub_layers_minus1");
-    WRITE_FLAG(1,          "vps_temporal_id_nesting_flag");
+    WRITE_CODE(vps.maxTempSubLayers - 1, 3, "vps_max_sub_layers_minus1");
+    WRITE_FLAG(vps.maxTempSubLayers == 1,   "vps_temporal_id_nesting_flag");
     WRITE_CODE(0xffff, 16, "vps_reserved_ffff_16bits");
 
-    codeProfileTier(vps.ptl);
+    codeProfileTier(vps.ptl, vps.maxTempSubLayers);
 
     WRITE_FLAG(true, "vps_sub_layer_ordering_info_present_flag");
-    WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]");
-    WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_pics[i]");
 
-    WRITE_UVLC(0,    "vps_max_latency_increase_plus1[i]");
+    for (uint32_t i = 0; i < vps.maxTempSubLayers; i++)
+    {
+        WRITE_UVLC(vps.maxDecPicBuffering - 1, "vps_max_dec_pic_buffering_minus1[i]");
+        WRITE_UVLC(vps.numReorderPics,         "vps_num_reorder_pics[i]");
+        WRITE_UVLC(vps.maxLatencyIncrease + 1, "vps_max_latency_increase_plus1[i]");
+    }
+
     WRITE_CODE(0, 6, "vps_max_nuh_reserved_zero_layer_id");
     WRITE_UVLC(0,    "vps_max_op_sets_minus1");
     WRITE_FLAG(0,    "vps_timing_info_present_flag"); /* we signal timing info in SPS-VUI */
@@ -71,16 +76,16 @@
 void Entropy::codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl)
 {
     WRITE_CODE(0, 4, "sps_video_parameter_set_id");
-    WRITE_CODE(0, 3, "sps_max_sub_layers_minus1");
-    WRITE_FLAG(1,    "sps_temporal_id_nesting_flag");
+    WRITE_CODE(sps.maxTempSubLayers - 1, 3, "sps_max_sub_layers_minus1");
+    WRITE_FLAG(sps.maxTempSubLayers == 1,   "sps_temporal_id_nesting_flag");
 
-    codeProfileTier(ptl);
+    codeProfileTier(ptl, sps.maxTempSubLayers);
 
     WRITE_UVLC(0, "sps_seq_parameter_set_id");
     WRITE_UVLC(sps.chromaFormatIdc, "chroma_format_idc");
 
     if (sps.chromaFormatIdc == X265_CSP_I444)
-        WRITE_FLAG(0,                        "separate_colour_plane_flag");
+        WRITE_FLAG(0,                       "separate_colour_plane_flag");
 
     WRITE_UVLC(sps.picWidthInLumaSamples,   "pic_width_in_luma_samples");
     WRITE_UVLC(sps.picHeightInLumaSamples,  "pic_height_in_luma_samples");
@@ -101,9 +106,12 @@
     WRITE_UVLC(BITS_FOR_POC - 4, "log2_max_pic_order_cnt_lsb_minus4");
     WRITE_FLAG(true,             "sps_sub_layer_ordering_info_present_flag");
 
-    WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]");
-    WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_pics[i]");
-    WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]");
+    for (uint32_t i = 0; i < sps.maxTempSubLayers; i++)
+    {
+        WRITE_UVLC(sps.maxDecPicBuffering - 1, "sps_max_dec_pic_buffering_minus1[i]");
+        WRITE_UVLC(sps.numReorderPics,         "sps_num_reorder_pics[i]");
+        WRITE_UVLC(sps.maxLatencyIncrease + 1, "sps_max_latency_increase_plus1[i]");
+    }
 
     WRITE_UVLC(sps.log2MinCodingBlockSize - 3,    "log2_min_coding_block_size_minus3");
     WRITE_UVLC(sps.log2DiffMaxMinCodingBlockSize, "log2_diff_max_min_coding_block_size");
@@ -129,7 +137,7 @@
     WRITE_FLAG(sps.bUseStrongIntraSmoothing, "sps_strong_intra_smoothing_enable_flag");
 
     WRITE_FLAG(1, "vui_parameters_present_flag");
-    codeVUI(sps.vuiParameters);
+    codeVUI(sps.vuiParameters, sps.maxTempSubLayers);
 
     WRITE_FLAG(0, "sps_extension_flag");
 }
@@ -184,7 +192,7 @@
     WRITE_FLAG(0, "pps_extension_flag");
 }
 
-void Entropy::codeProfileTier(const ProfileTierLevel& ptl)
+void Entropy::codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers)
 {
     WRITE_CODE(0, 2,                "XXX_profile_space[]");
     WRITE_FLAG(ptl.tierFlag,        "XXX_tier_flag[]");
@@ -222,9 +230,17 @@
     }
 
     WRITE_CODE(ptl.levelIdc, 8, "general_level_idc");
+
+    if (maxTempSubLayers > 1)
+    {
+         WRITE_FLAG(0, "sub_layer_profile_present_flag[i]");
+         WRITE_FLAG(0, "sub_layer_level_present_flag[i]");
+         for (int i = maxTempSubLayers - 1; i < 8 ; i++)
+             WRITE_CODE(0, 2, "reserved_zero_2bits");
+    }
 }
 
-void Entropy::codeVUI(const VUI& vui)
+void Entropy::codeVUI(const VUI& vui, int maxSubTLayers)
 {
     WRITE_FLAG(vui.aspectRatioInfoPresentFlag,  "aspect_ratio_info_present_flag");
     if (vui.aspectRatioInfoPresentFlag)
@@ -282,7 +298,7 @@
 
     WRITE_FLAG(vui.hrdParametersPresentFlag,  "vui_hrd_parameters_present_flag");
     if (vui.hrdParametersPresentFlag)
-        codeHrdParameters(vui.hrdParameters);
+        codeHrdParameters(vui.hrdParameters, maxSubTLayers);
 
     WRITE_FLAG(0, "bitstream_restriction_flag");
 }
@@ -329,7 +345,7 @@
     }
 }
 
-void Entropy::codeHrdParameters(const HRDInfo& hrd)
+void Entropy::codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers)
 {
     WRITE_FLAG(1, "nal_hrd_parameters_present_flag");
     WRITE_FLAG(0, "vcl_hrd_parameters_present_flag");
@@ -342,13 +358,16 @@
     WRITE_CODE(hrd.cpbRemovalDelayLength - 1,        5, "au_cpb_removal_delay_length_minus1");
     WRITE_CODE(hrd.dpbOutputDelayLength - 1,         5, "dpb_output_delay_length_minus1");
 
-    WRITE_FLAG(1, "fixed_pic_rate_general_flag");
-    WRITE_UVLC(0, "elemental_duration_in_tc_minus1");
-    WRITE_UVLC(0, "cpb_cnt_minus1");
+    for (int i = 0; i < maxSubTLayers; i++)
+    {
+        WRITE_FLAG(1, "fixed_pic_rate_general_flag");
+        WRITE_UVLC(0, "elemental_duration_in_tc_minus1");
+        WRITE_UVLC(0, "cpb_cnt_minus1");
 
-    WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1");
-    WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1");
-    WRITE_FLAG(hrd.cbrFlag, "cbr_flag");
+        WRITE_UVLC(hrd.bitRateValue - 1, "bit_rate_value_minus1");
+        WRITE_UVLC(hrd.cpbSizeValue - 1, "cpb_size_value_minus1");
+        WRITE_FLAG(hrd.cbrFlag, "cbr_flag");
+    }
 }
 
 void Entropy::codeAUD(const Slice& slice)
@@ -521,15 +540,14 @@
 {
     const Slice* slice = ctu.m_slice;
 
-    if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
-        bEncodeDQP = true;
-
     int cuSplitFlag = !(cuGeom.flags & CUGeom::LEAF);
     int cuUnsplitFlag = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY);
 
     if (!cuUnsplitFlag)
     {
         uint32_t qNumParts = cuGeom.numPartitions >> 2;
+        if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+            bEncodeDQP = true;
         for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts)
         {
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx);
@@ -539,13 +557,14 @@
         return;
     }
 
-    // We need to split, so don't try these modes.
     if (cuSplitFlag) 
         codeSplitFlag(ctu, absPartIdx, depth);
 
     if (depth < ctu.m_cuDepth[absPartIdx] && depth < g_maxCUDepth)
     {
         uint32_t qNumParts = cuGeom.numPartitions >> 2;
+        if (depth == slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+            bEncodeDQP = true;
         for (uint32_t qIdx = 0; qIdx < 4; ++qIdx, absPartIdx += qNumParts)
         {
             const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + qIdx);
@@ -554,6 +573,9 @@
         return;
     }
 
+    if (depth <= slice->m_pps->maxCuDQPDepth && slice->m_pps->bUseDQP)
+        bEncodeDQP = true;
+
     if (slice->m_pps->bTransquantBypassEnabled)
         codeCUTransquantBypassFlag(ctu.m_tqBypass[absPartIdx]);
 
@@ -654,7 +676,7 @@
     {
         // Encode slice finish
         bool bTerminateSlice = false;
-        if (cuAddr + (NUM_CU_PARTITIONS >> (depth << 1)) == realEndAddress)
​

x265_1.5.tar.gz/source/encoder/entropy.h -> x265_1.6.tar.gz/source/encoder/entropy.h Changed

@@ -142,9 +142,9 @@
     void codeVPS(const VPS& vps);
     void codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl);
     void codePPS(const PPS& pps);
-    void codeVUI(const VUI& vui);
+    void codeVUI(const VUI& vui, int maxSubTLayers);
     void codeAUD(const Slice& slice);
-    void codeHrdParameters(const HRDInfo& hrd);
+    void codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers);
 
     void codeSliceHeader(const Slice& slice, FrameData& encData);
     void codeSliceHeaderWPPEntryPoints(const Slice& slice, const uint32_t *substreamSizes, uint32_t maxOffset);
@@ -230,7 +230,7 @@
     void writeEpExGolomb(uint32_t symbol, uint32_t count);
     void writeCoefRemainExGolomb(uint32_t symbol, const uint32_t absGoRice);
 
-    void codeProfileTier(const ProfileTierLevel& ptl);
+    void codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers);
     void codeScalingList(const ScalingList&);
     void codeScalingList(const ScalingList& scalingList, uint32_t sizeId, uint32_t listId);

 
@@ -142,9 +142,9 @@
     void codeVPS(const VPS& vps);
     void codeSPS(const SPS& sps, const ScalingList& scalingList, const ProfileTierLevel& ptl);
     void codePPS(const PPS& pps);
-    void codeVUI(const VUI& vui);
+    void codeVUI(const VUI& vui, int maxSubTLayers);
     void codeAUD(const Slice& slice);
-    void codeHrdParameters(const HRDInfo& hrd);
+    void codeHrdParameters(const HRDInfo& hrd, int maxSubTLayers);
 
     void codeSliceHeader(const Slice& slice, FrameData& encData);
     void codeSliceHeaderWPPEntryPoints(const Slice& slice, const uint32_t *substreamSizes, uint32_t maxOffset);
@@ -230,7 +230,7 @@
     void writeEpExGolomb(uint32_t symbol, uint32_t count);
     void writeCoefRemainExGolomb(uint32_t symbol, const uint32_t absGoRice);
 
-    void codeProfileTier(const ProfileTierLevel& ptl);
+    void codeProfileTier(const ProfileTierLevel& ptl, int maxTempSubLayers);
     void codeScalingList(const ScalingList&);
     void codeScalingList(const ScalingList& scalingList, uint32_t sizeId, uint32_t listId);
 
​

x265_1.5.tar.gz/source/encoder/frameencoder.cpp -> x265_1.6.tar.gz/source/encoder/frameencoder.cpp Changed

@@ -39,14 +39,13 @@
 void weightAnalyse(Slice& slice, Frame& frame, x265_param& param);
 
 FrameEncoder::FrameEncoder()
-    : WaveFront(NULL)
-    , m_threadActive(true)
 {
     m_prevOutputTime = x265_mdate();
-    m_totalWorkerElapsedTime = 0;
+    m_isFrameEncoder = true;
+    m_threadActive = true;
     m_slicetypeWaitTime = 0;
-    m_frameEncoderID = 0;
     m_activeWorkerCount = 0;
+    m_completionCount = 0;
     m_bAllRowsStop = false;
     m_vbvResetTriggerRow = -1;
     m_outStreams = NULL;
@@ -59,6 +58,7 @@
     m_frame = NULL;
     m_cuGeoms = NULL;
     m_ctuGeomMap = NULL;
+    m_localTldIdx = 0;
     memset(&m_frameStats, 0, sizeof(m_frameStats));
     memset(&m_rce, 0, sizeof(RateControlEntry));
 }
@@ -66,10 +66,22 @@
 void FrameEncoder::destroy()
 {
     if (m_pool)
-        JobProvider::flush();  // ensure no worker threads are using this frame
-
-    m_threadActive = false;
-    m_enable.trigger();
+    {
+        if (!m_jpId)
+        {
+            int numTLD = m_pool->m_numWorkers;
+            if (!m_param->bEnableWavefront)
+                numTLD += m_pool->m_numProviders;
+            for (int i = 0; i < numTLD; i++)
+                m_tld[i].destroy();
+            delete [] m_tld;
+        }
+    }
+    else
+    {
+        m_tld->destroy();
+        delete m_tld;
+    }
 
     delete[] m_rows;
     delete[] m_outStreams;
@@ -85,12 +97,9 @@
         delete m_rce.picTimingSEI;
         delete m_rce.hrdTiming;
     }
-
-    // wait for worker thread to exit
-    stop();
 }
 
-bool FrameEncoder::init(Encoder *top, int numRows, int numCols, int id)
+bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
 {
     m_top = top;
     m_param = top->m_param;
@@ -99,14 +108,14 @@
     m_filterRowDelay = (m_param->bEnableSAO && m_param->bSaoNonDeblocked) ?
                         2 : (m_param->bEnableSAO || m_param->bEnableLoopFilter ? 1 : 0);
     m_filterRowDelayCus = m_filterRowDelay * numCols;
-    m_frameEncoderID = id;
     m_rows = new CTURow[m_numRows];
     bool ok = !!m_numRows;
 
-    int range  = m_param->searchRange; /* fpel search */
-        range += 1;                    /* diamond search range check lag */
-        range += 2;                    /* subpel refine */
-        range += NTAPS_LUMA / 2;       /* subpel filter half-length */
+    /* determine full motion search range */
+    int range  = m_param->searchRange;       /* fpel search */
+    range += !!(m_param->searchMethod < 2);  /* diamond/hex range check lag */
+    range += NTAPS_LUMA / 2;                 /* subpel filter half-length */
+    range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 2; /* subpel refine steps */
     m_refLagRows = 1 + ((range + g_maxCUSize - 1) / g_maxCUSize);
 
     // NOTE: 2 times of numRows because both Encoder and Filter in same queue
@@ -134,7 +143,6 @@
     else
         m_param->noiseReductionIntra = m_param->noiseReductionInter = 0;
 
-    start();
     return ok;
 }
 
@@ -143,6 +151,7 @@
 {
     /* Geoms only vary between CTUs in the presence of picture edges */
     int maxCUSize = m_param->maxCUSize;
+    int minCUSize = m_param->minCUSize;
     int heightRem = m_param->sourceHeight & (maxCUSize - 1);
     int widthRem = m_param->sourceWidth & (maxCUSize - 1);
     int allocGeoms = 1; // body
@@ -157,7 +166,7 @@
         return false;
 
     // body
-    CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, m_cuGeoms);
+    CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, minCUSize, m_cuGeoms);
     memset(m_ctuGeomMap, 0, sizeof(uint32_t) * m_numRows * m_numCols);
     if (allocGeoms == 1)
         return true;
@@ -166,7 +175,7 @@
     if (widthRem)
     {
         // right
-        CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+        CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
         for (uint32_t i = 0; i < m_numRows; i++)
         {
             uint32_t ctuAddr = m_numCols * (i + 1) - 1;
@@ -177,7 +186,7 @@
     if (heightRem)
     {
         // bottom
-        CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+        CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
         for (uint32_t i = 0; i < m_numCols; i++)
         {
             uint32_t ctuAddr = m_numCols * (m_numRows - 1) + i;
@@ -188,7 +197,7 @@
         if (widthRem)
         {
             // corner
-            CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+            CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
 
             uint32_t ctuAddr = m_numCols * m_numRows - 1;
             m_ctuGeomMap[ctuAddr] = countGeoms * CUGeom::MAX_GEOMS;
@@ -204,7 +213,9 @@
 {
     m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
     m_frame = curFrame;
-    curFrame->m_encData->m_frameEncoderID = m_frameEncoderID; // Each Frame knows the ID of the FrameEncoder encoding it
+    m_sliceType = curFrame->m_lowres.sliceType;
+    curFrame->m_encData->m_frameEncoderID = m_jpId;
+    curFrame->m_encData->m_jobProvider = this;
     curFrame->m_encData->m_slice->m_mref = m_mref;
 
     if (!m_cuGeoms)
@@ -219,19 +230,66 @@
 
 void FrameEncoder::threadMain()
 {
-    THREAD_NAME("Frame", m_frameEncoderID);
+    THREAD_NAME("Frame", m_jpId);
 
-    // worker thread routine for FrameEncoder
-    do
+    if (m_pool)
     {
-        m_enable.wait(); // Encoder::encode() triggers this event
-        if (m_threadActive)
+        m_pool->setCurrentThreadAffinity();
+
+        /* the first FE on each NUMA node is responsible for allocating thread
+         * local data for all worker threads in that pool. If WPP is disabled, then
+         * each FE also needs a TLD instance */
+        if (!m_jpId)
         {
-            compressFrame();
-            m_done.trigger(); // FrameEncoder::getEncodedPicture() blocks for this event
+            int numTLD = m_pool->m_numWorkers;
+            if (!m_param->bEnableWavefront)
+                numTLD += m_pool->m_numProviders;
+
+            m_tld = new ThreadLocalData[numTLD];
+            for (int i = 0; i < numTLD; i++)
+            {
+                m_tld[i].analysis.initSearch(*m_param, m_top->m_scalingList);
+                m_tld[i].analysis.create(m_tld);
+            }
+
+            for (int i = 0; i < m_pool->m_numProviders; i++)
+            {
+                if (m_pool->m_jpTable[i]->m_isFrameEncoder) /* ugh; over-allocation and other issues here */
+                {
+                    FrameEncoder *peer = dynamic_cast<FrameEncoder*>(m_pool->m_jpTable[i]);
+                    peer->m_tld = m_tld;
+                }
+            }
         }
+
+        if (m_param->bEnableWavefront)
+            m_localTldIdx = -1; // cause exception if used
+        else
+            m_localTldIdx = m_pool->m_numWorkers + m_jpId;
+    }
+    else
+    {

 
@@ -39,14 +39,13 @@
 void weightAnalyse(Slice& slice, Frame& frame, x265_param& param);
 
 FrameEncoder::FrameEncoder()
-    : WaveFront(NULL)
-    , m_threadActive(true)
 {
     m_prevOutputTime = x265_mdate();
-    m_totalWorkerElapsedTime = 0;
+    m_isFrameEncoder = true;
+    m_threadActive = true;
     m_slicetypeWaitTime = 0;
-    m_frameEncoderID = 0;
     m_activeWorkerCount = 0;
+    m_completionCount = 0;
     m_bAllRowsStop = false;
     m_vbvResetTriggerRow = -1;
     m_outStreams = NULL;
@@ -59,6 +58,7 @@
     m_frame = NULL;
     m_cuGeoms = NULL;
     m_ctuGeomMap = NULL;
+    m_localTldIdx = 0;
     memset(&m_frameStats, 0, sizeof(m_frameStats));
     memset(&m_rce, 0, sizeof(RateControlEntry));
 }
@@ -66,10 +66,22 @@
 void FrameEncoder::destroy()
 {
     if (m_pool)
-        JobProvider::flush();  // ensure no worker threads are using this frame
-
-    m_threadActive = false;
-    m_enable.trigger();
+    {
+        if (!m_jpId)
+        {
+            int numTLD = m_pool->m_numWorkers;
+            if (!m_param->bEnableWavefront)
+                numTLD += m_pool->m_numProviders;
+            for (int i = 0; i < numTLD; i++)
+                m_tld[i].destroy();
+            delete [] m_tld;
+        }
+    }
+    else
+    {
+        m_tld->destroy();
+        delete m_tld;
+    }
 
     delete[] m_rows;
     delete[] m_outStreams;
@@ -85,12 +97,9 @@
         delete m_rce.picTimingSEI;
         delete m_rce.hrdTiming;
     }
-
-    // wait for worker thread to exit
-    stop();
 }
 
-bool FrameEncoder::init(Encoder *top, int numRows, int numCols, int id)
+bool FrameEncoder::init(Encoder *top, int numRows, int numCols)
 {
     m_top = top;
     m_param = top->m_param;
@@ -99,14 +108,14 @@
     m_filterRowDelay = (m_param->bEnableSAO && m_param->bSaoNonDeblocked) ?
                         2 : (m_param->bEnableSAO || m_param->bEnableLoopFilter ? 1 : 0);
     m_filterRowDelayCus = m_filterRowDelay * numCols;
-    m_frameEncoderID = id;
     m_rows = new CTURow[m_numRows];
     bool ok = !!m_numRows;
 
-    int range  = m_param->searchRange; /* fpel search */
-        range += 1;                    /* diamond search range check lag */
-        range += 2;                    /* subpel refine */
-        range += NTAPS_LUMA / 2;       /* subpel filter half-length */
+    /* determine full motion search range */
+    int range  = m_param->searchRange;       /* fpel search */
+    range += !!(m_param->searchMethod < 2);  /* diamond/hex range check lag */
+    range += NTAPS_LUMA / 2;                 /* subpel filter half-length */
+    range += 2 + MotionEstimate::hpelIterationCount(m_param->subpelRefine) / 2; /* subpel refine steps */
     m_refLagRows = 1 + ((range + g_maxCUSize - 1) / g_maxCUSize);
 
     // NOTE: 2 times of numRows because both Encoder and Filter in same queue
@@ -134,7 +143,6 @@
     else
         m_param->noiseReductionIntra = m_param->noiseReductionInter = 0;
 
-    start();
     return ok;
 }
 
@@ -143,6 +151,7 @@
 {
     /* Geoms only vary between CTUs in the presence of picture edges */
     int maxCUSize = m_param->maxCUSize;
+    int minCUSize = m_param->minCUSize;
     int heightRem = m_param->sourceHeight & (maxCUSize - 1);
     int widthRem = m_param->sourceWidth & (maxCUSize - 1);
     int allocGeoms = 1; // body
@@ -157,7 +166,7 @@
         return false;
 
     // body
-    CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, m_cuGeoms);
+    CUData::calcCTUGeoms(maxCUSize, maxCUSize, maxCUSize, minCUSize, m_cuGeoms);
     memset(m_ctuGeomMap, 0, sizeof(uint32_t) * m_numRows * m_numCols);
     if (allocGeoms == 1)
         return true;
@@ -166,7 +175,7 @@
     if (widthRem)
     {
         // right
-        CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+        CUData::calcCTUGeoms(widthRem, maxCUSize, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
         for (uint32_t i = 0; i < m_numRows; i++)
         {
             uint32_t ctuAddr = m_numCols * (i + 1) - 1;
@@ -177,7 +186,7 @@
     if (heightRem)
     {
         // bottom
-        CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+        CUData::calcCTUGeoms(maxCUSize, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
         for (uint32_t i = 0; i < m_numCols; i++)
         {
             uint32_t ctuAddr = m_numCols * (m_numRows - 1) + i;
@@ -188,7 +197,7 @@
         if (widthRem)
         {
             // corner
-            CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
+            CUData::calcCTUGeoms(widthRem, heightRem, maxCUSize, minCUSize, m_cuGeoms + countGeoms * CUGeom::MAX_GEOMS);
 
             uint32_t ctuAddr = m_numCols * m_numRows - 1;
             m_ctuGeomMap[ctuAddr] = countGeoms * CUGeom::MAX_GEOMS;
@@ -204,7 +213,9 @@
 {
     m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
     m_frame = curFrame;
-    curFrame->m_encData->m_frameEncoderID = m_frameEncoderID; // Each Frame knows the ID of the FrameEncoder encoding it
+    m_sliceType = curFrame->m_lowres.sliceType;
+    curFrame->m_encData->m_frameEncoderID = m_jpId;
+    curFrame->m_encData->m_jobProvider = this;
     curFrame->m_encData->m_slice->m_mref = m_mref;
 
     if (!m_cuGeoms)
@@ -219,19 +230,66 @@
 
 void FrameEncoder::threadMain()
 {
-    THREAD_NAME("Frame", m_frameEncoderID);
+    THREAD_NAME("Frame", m_jpId);
 
-    // worker thread routine for FrameEncoder
-    do
+    if (m_pool)
     {
-        m_enable.wait(); // Encoder::encode() triggers this event
-        if (m_threadActive)
+        m_pool->setCurrentThreadAffinity();
+
+        /* the first FE on each NUMA node is responsible for allocating thread
+         * local data for all worker threads in that pool. If WPP is disabled, then
+         * each FE also needs a TLD instance */
+        if (!m_jpId)
         {
-            compressFrame();
-            m_done.trigger(); // FrameEncoder::getEncodedPicture() blocks for this event
+            int numTLD = m_pool->m_numWorkers;
+            if (!m_param->bEnableWavefront)
+                numTLD += m_pool->m_numProviders;
+
+            m_tld = new ThreadLocalData[numTLD];
+            for (int i = 0; i < numTLD; i++)
+            {
+                m_tld[i].analysis.initSearch(*m_param, m_top->m_scalingList);
+                m_tld[i].analysis.create(m_tld);
+            }
+
+            for (int i = 0; i < m_pool->m_numProviders; i++)
+            {
+                if (m_pool->m_jpTable[i]->m_isFrameEncoder) /* ugh; over-allocation and other issues here */
+                {
+                    FrameEncoder *peer = dynamic_cast<FrameEncoder*>(m_pool->m_jpTable[i]);
+                    peer->m_tld = m_tld;
+                }
+            }
         }
+
+        if (m_param->bEnableWavefront)
+            m_localTldIdx = -1; // cause exception if used
+        else
+            m_localTldIdx = m_pool->m_numWorkers + m_jpId;
+    }
+    else
+    {
​

x265_1.5.tar.gz/source/encoder/frameencoder.h -> x265_1.6.tar.gz/source/encoder/frameencoder.h Changed

@@ -122,7 +122,7 @@
 
     virtual ~FrameEncoder() {}
 
-    bool init(Encoder *top, int numRows, int numCols, int id);
+    virtual bool init(Encoder *top, int numRows, int numCols);
 
     void destroy();
 
@@ -135,8 +135,12 @@
     Event                    m_enable;
     Event                    m_done;
     Event                    m_completionEvent;
-    bool                     m_threadActive;
-    int                      m_frameEncoderID;
+    int                      m_localTldIdx;
+
+    volatile bool            m_threadActive;
+    volatile bool            m_bAllRowsStop;
+    volatile int             m_completionCount;
+    volatile int             m_vbvResetTriggerRow;
 
     uint32_t                 m_numRows;
     uint32_t                 m_numCols;
@@ -144,9 +148,6 @@
     uint32_t                 m_filterRowDelayCus;
     uint32_t                 m_refLagRows;
 
-    volatile bool            m_bAllRowsStop;
-    volatile int             m_vbvResetTriggerRow;
-
     CTURow*                  m_rows;
     RateControlEntry         m_rce;
     SEIDecodedPictureHash    m_seiReconPictureDigest;
@@ -177,6 +178,9 @@
     int64_t                  m_slicetypeWaitTime;        // total elapsed time waiting for decided frame
     int64_t                  m_totalWorkerElapsedTime;   // total elapsed time spent by worker threads processing CTUs
     int64_t                  m_totalNoWorkerTime;        // total elapsed time without any active worker threads
+#if DETAILED_CU_STATS
+    CUStats                  m_cuStats;
+#endif
 
     Encoder*                 m_top;
     x265_param*              m_param;
@@ -196,6 +200,21 @@
     FrameFilter              m_frameFilter;
     NALList                  m_nalList;
 
+    class WeightAnalysis : public BondedTaskGroup
+    {
+    public:
+
+        FrameEncoder& master;
+
+        WeightAnalysis(FrameEncoder& fe) : master(fe) {}
+
+        void processTasks(int workerThreadId);
+
+    protected:
+
+        WeightAnalysis operator=(const WeightAnalysis&);
+    };
+
 protected:
 
     bool initializeGeoms();
@@ -203,9 +222,6 @@
     /* analyze / compress frame, can be run in parallel within reference constraints */
     void compressFrame();
 
-    /* called by compressFrame to perform wave-front compression analysis */
-    void compressCTURows();
-
     /* called by compressFrame to generate final per-row bitstreams */
     void encodeSlice();
 
@@ -215,8 +231,8 @@
     void noiseReductionUpdate();
 
     /* Called by WaveFront::findJob() */
-    void processRow(int row, int threadId);
-    void processRowEncoder(int row, ThreadLocalData& tld);
+    virtual void processRow(int row, int threadId);
+    virtual void processRowEncoder(int row, ThreadLocalData& tld);
 
     void enqueueRowEncoder(int row) { WaveFront::enqueueRow(row * 2 + 0); }
     void enqueueRowFilter(int row)  { WaveFront::enqueueRow(row * 2 + 1); }

 
@@ -122,7 +122,7 @@
 
     virtual ~FrameEncoder() {}
 
-    bool init(Encoder *top, int numRows, int numCols, int id);
+    virtual bool init(Encoder *top, int numRows, int numCols);
 
     void destroy();
 
@@ -135,8 +135,12 @@
     Event                    m_enable;
     Event                    m_done;
     Event                    m_completionEvent;
-    bool                     m_threadActive;
-    int                      m_frameEncoderID;
+    int                      m_localTldIdx;
+
+    volatile bool            m_threadActive;
+    volatile bool            m_bAllRowsStop;
+    volatile int             m_completionCount;
+    volatile int             m_vbvResetTriggerRow;
 
     uint32_t                 m_numRows;
     uint32_t                 m_numCols;
@@ -144,9 +148,6 @@
     uint32_t                 m_filterRowDelayCus;
     uint32_t                 m_refLagRows;
 
-    volatile bool            m_bAllRowsStop;
-    volatile int             m_vbvResetTriggerRow;
-
     CTURow*                  m_rows;
     RateControlEntry         m_rce;
     SEIDecodedPictureHash    m_seiReconPictureDigest;
@@ -177,6 +178,9 @@
     int64_t                  m_slicetypeWaitTime;        // total elapsed time waiting for decided frame
     int64_t                  m_totalWorkerElapsedTime;   // total elapsed time spent by worker threads processing CTUs
     int64_t                  m_totalNoWorkerTime;        // total elapsed time without any active worker threads
+#if DETAILED_CU_STATS
+    CUStats                  m_cuStats;
+#endif
 
     Encoder*                 m_top;
     x265_param*              m_param;
@@ -196,6 +200,21 @@
     FrameFilter              m_frameFilter;
     NALList                  m_nalList;
 
+    class WeightAnalysis : public BondedTaskGroup
+    {
+    public:
+
+        FrameEncoder& master;
+
+        WeightAnalysis(FrameEncoder& fe) : master(fe) {}
+
+        void processTasks(int workerThreadId);
+
+    protected:
+
+        WeightAnalysis operator=(const WeightAnalysis&);
+    };
+
 protected:
 
     bool initializeGeoms();
@@ -203,9 +222,6 @@
     /* analyze / compress frame, can be run in parallel within reference constraints */
     void compressFrame();
 
-    /* called by compressFrame to perform wave-front compression analysis */
-    void compressCTURows();
-
     /* called by compressFrame to generate final per-row bitstreams */
     void encodeSlice();
 
@@ -215,8 +231,8 @@
     void noiseReductionUpdate();
 
     /* Called by WaveFront::findJob() */
-    void processRow(int row, int threadId);
-    void processRowEncoder(int row, ThreadLocalData& tld);
+    virtual void processRow(int row, int threadId);
+    virtual void processRowEncoder(int row, ThreadLocalData& tld);
 
     void enqueueRowEncoder(int row) { WaveFront::enqueueRow(row * 2 + 0); }
     void enqueueRowFilter(int row)  { WaveFront::enqueueRow(row * 2 + 1); }
​

x265_1.5.tar.gz/source/encoder/framefilter.cpp -> x265_1.6.tar.gz/source/encoder/framefilter.cpp Changed

@@ -83,6 +83,11 @@
 {
     ProfileScopeEvent(filterCTURow);
 
+#if DETAILED_CU_STATS
+    ScopedElapsedTime filterPerfScope(m_frameEncoder->m_cuStats.loopFilterElapsedTime);
+    m_frameEncoder->m_cuStats.countLoopFilter++;
+#endif
+
     if (!m_param->bEnableLoopFilter && !m_param->bEnableSAO)
     {
         processRowPost(row);
@@ -298,6 +303,9 @@
         updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight);
         updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight);
     }
+
+    if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)
+        m_frameEncoder->m_completionEvent.trigger();
 }
 
 static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height)
@@ -421,7 +429,7 @@
 /* Original YUV restoration for CU in lossless coding */
 static void origCUSampleRestoration(const CUData* cu, const CUGeom& cuGeom, Frame& frame)
 {
-    uint32_t absPartIdx = cuGeom.encodeIdx;
+    uint32_t absPartIdx = cuGeom.absPartIdx;
     if (cu->m_cuDepth[absPartIdx] > cuGeom.depth)
     {
         for (int subPartIdx = 0; subPartIdx < 4; subPartIdx++)

 
@@ -83,6 +83,11 @@
 {
     ProfileScopeEvent(filterCTURow);
 
+#if DETAILED_CU_STATS
+    ScopedElapsedTime filterPerfScope(m_frameEncoder->m_cuStats.loopFilterElapsedTime);
+    m_frameEncoder->m_cuStats.countLoopFilter++;
+#endif
+
     if (!m_param->bEnableLoopFilter && !m_param->bEnableSAO)
     {
         processRowPost(row);
@@ -298,6 +303,9 @@
         updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight);
         updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight);
     }
+
+    if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)
+        m_frameEncoder->m_completionEvent.trigger();
 }
 
 static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height)
@@ -421,7 +429,7 @@
 /* Original YUV restoration for CU in lossless coding */
 static void origCUSampleRestoration(const CUData* cu, const CUGeom& cuGeom, Frame& frame)
 {
-    uint32_t absPartIdx = cuGeom.encodeIdx;
+    uint32_t absPartIdx = cuGeom.absPartIdx;
     if (cu->m_cuDepth[absPartIdx] > cuGeom.depth)
     {
         for (int subPartIdx = 0; subPartIdx < 4; subPartIdx++)
​

x265_1.5.tar.gz/source/encoder/level.cpp -> x265_1.6.tar.gz/source/encoder/level.cpp Changed

@@ -60,6 +60,7 @@
 /* determine minimum decoder level required to decode the described video */
 void determineLevel(const x265_param &param, VPS& vps)
 {
+    vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
     if (param.bLossless)
         vps.ptl.profileIdc = Profile::NONE;
     else if (param.internalCsp == X265_CSP_I420)
@@ -154,15 +155,25 @@
             return;
         }
 
-        vps.ptl.levelIdc = levels[i].levelEnum;
-        vps.ptl.minCrForLevel = levels[i].minCompressionRatio;
-        vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond;
+#define CHECK_RANGE(value, main, high) (value > main && value <= high)
 
-        if (bitrate > levels[i].maxBitrateMain && bitrate <= levels[i].maxBitrateHigh &&
+        if (CHECK_RANGE(bitrate, levels[i].maxBitrateMain, levels[i].maxBitrateHigh) &&
+            CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh) &&
             levels[i].maxBitrateHigh != MAX_UINT)
-            vps.ptl.tierFlag = Level::HIGH;
+        {
+            /* If the user has not enabled high tier, continue looking to see if we can encode at a higher level, main tier */
+            if (!param.bHighTier && (levels[i].levelIdc < param.levelIdc))
+                continue;
+            else
+                vps.ptl.tierFlag = Level::HIGH;
+        }
         else
             vps.ptl.tierFlag = Level::MAIN;
+#undef CHECK_RANGE
+
+        vps.ptl.levelIdc = levels[i].levelEnum;
+        vps.ptl.minCrForLevel = levels[i].minCompressionRatio;
+        vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond;
         break;
     }
 
@@ -250,7 +261,7 @@
     }
     if ((uint32_t)param.rc.vbvBufferSize > (highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain))
     {
-        param.rc.vbvMaxBitrate = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain;
+        param.rc.vbvBufferSize = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain;
         x265_log(&param, X265_LOG_INFO, "lowering VBV buffer size to %dKb\n", param.rc.vbvBufferSize);
     }

 
@@ -60,6 +60,7 @@
 /* determine minimum decoder level required to decode the described video */
 void determineLevel(const x265_param &param, VPS& vps)
 {
+    vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
     if (param.bLossless)
         vps.ptl.profileIdc = Profile::NONE;
     else if (param.internalCsp == X265_CSP_I420)
@@ -154,15 +155,25 @@
             return;
         }
 
-        vps.ptl.levelIdc = levels[i].levelEnum;
-        vps.ptl.minCrForLevel = levels[i].minCompressionRatio;
-        vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond;
+#define CHECK_RANGE(value, main, high) (value > main && value <= high)
 
-        if (bitrate > levels[i].maxBitrateMain && bitrate <= levels[i].maxBitrateHigh &&
+        if (CHECK_RANGE(bitrate, levels[i].maxBitrateMain, levels[i].maxBitrateHigh) &&
+            CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh) &&
             levels[i].maxBitrateHigh != MAX_UINT)
-            vps.ptl.tierFlag = Level::HIGH;
+        {
+            /* If the user has not enabled high tier, continue looking to see if we can encode at a higher level, main tier */
+            if (!param.bHighTier && (levels[i].levelIdc < param.levelIdc))
+                continue;
+            else
+                vps.ptl.tierFlag = Level::HIGH;
+        }
         else
             vps.ptl.tierFlag = Level::MAIN;
+#undef CHECK_RANGE
+
+        vps.ptl.levelIdc = levels[i].levelEnum;
+        vps.ptl.minCrForLevel = levels[i].minCompressionRatio;
+        vps.ptl.maxLumaSrForLevel = levels[i].maxLumaSamplesPerSecond;
         break;
     }
 
@@ -250,7 +261,7 @@
     }
     if ((uint32_t)param.rc.vbvBufferSize > (highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain))
     {
-        param.rc.vbvMaxBitrate = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain;
+        param.rc.vbvBufferSize = highTier ? l.maxCpbSizeHigh : l.maxCpbSizeMain;
         x265_log(&param, X265_LOG_INFO, "lowering VBV buffer size to %dKb\n", param.rc.vbvBufferSize);
     }
 
​

x265_1.5.tar.gz/source/encoder/motion.cpp -> x265_1.6.tar.gz/source/encoder/motion.cpp Changed

@@ -59,38 +59,6 @@
 int sizeScale[NUM_PU_SIZES];
 #define SAD_THRESH(v) (bcost < (((v >> 4) * sizeScale[partEnum])))
 
-void initScales(void)
-{
-#define SETUP_SCALE(W, H) \
-    sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4;
-    SETUP_SCALE(4, 4);
-    SETUP_SCALE(8, 8);
-    SETUP_SCALE(8, 4);
-    SETUP_SCALE(4, 8);
-    SETUP_SCALE(16, 16);
-    SETUP_SCALE(16, 8);
-    SETUP_SCALE(8, 16);
-    SETUP_SCALE(16, 12);
-    SETUP_SCALE(12, 16);
-    SETUP_SCALE(4, 16);
-    SETUP_SCALE(16, 4);
-    SETUP_SCALE(32, 32);
-    SETUP_SCALE(32, 16);
-    SETUP_SCALE(16, 32);
-    SETUP_SCALE(32, 24);
-    SETUP_SCALE(24, 32);
-    SETUP_SCALE(32, 8);
-    SETUP_SCALE(8, 32);
-    SETUP_SCALE(64, 64);
-    SETUP_SCALE(64, 32);
-    SETUP_SCALE(32, 64);
-    SETUP_SCALE(64, 48);
-    SETUP_SCALE(48, 64);
-    SETUP_SCALE(64, 16);
-    SETUP_SCALE(16, 64);
-#undef SETUP_SCALE
-}
-
 /* radius 2 hexagon. repeated entries are to avoid having to compute mod6 every time. */
 const MV hex2[8] = { MV(-1, -2), MV(-2, 0), MV(-1, 2), MV(1, 2), MV(2, 0), MV(1, -2), MV(-1, -2), MV(-2, 0) };
 const uint8_t mod6m1[8] = { 5, 0, 1, 2, 3, 4, 5, 0 };  /* (x-1)%6 */
@@ -136,20 +104,57 @@
     absPartIdx = -1;
     searchMethod = X265_HEX_SEARCH;
     subpelRefine = 2;
+    blockwidth = blockheight = 0;
+    blockOffset = 0;
     bChromaSATD = false;
     chromaSatd = NULL;
 }
 
 void MotionEstimate::init(int method, int refine, int csp)
 {
-    if (!sizeScale[0])
-        initScales();
-
     searchMethod = method;
     subpelRefine = refine;
     fencPUYuv.create(FENC_STRIDE, csp);
 }
 
+void MotionEstimate::initScales(void)
+{
+#define SETUP_SCALE(W, H) \
+    sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4;
+    SETUP_SCALE(4, 4);
+    SETUP_SCALE(8, 8);
+    SETUP_SCALE(8, 4);
+    SETUP_SCALE(4, 8);
+    SETUP_SCALE(16, 16);
+    SETUP_SCALE(16, 8);
+    SETUP_SCALE(8, 16);
+    SETUP_SCALE(16, 12);
+    SETUP_SCALE(12, 16);
+    SETUP_SCALE(4, 16);
+    SETUP_SCALE(16, 4);
+    SETUP_SCALE(32, 32);
+    SETUP_SCALE(32, 16);
+    SETUP_SCALE(16, 32);
+    SETUP_SCALE(32, 24);
+    SETUP_SCALE(24, 32);
+    SETUP_SCALE(32, 8);
+    SETUP_SCALE(8, 32);
+    SETUP_SCALE(64, 64);
+    SETUP_SCALE(64, 32);
+    SETUP_SCALE(32, 64);
+    SETUP_SCALE(64, 48);
+    SETUP_SCALE(48, 64);
+    SETUP_SCALE(64, 16);
+    SETUP_SCALE(16, 64);
+#undef SETUP_SCALE
+}
+
+int MotionEstimate::hpelIterationCount(int subme)
+{
+    return workload[subme].hpel_iters +
+           workload[subme].qpel_iters / 2;
+}
+
 MotionEstimate::~MotionEstimate()
 {
     fencPUYuv.destroy();

 
@@ -59,38 +59,6 @@
 int sizeScale[NUM_PU_SIZES];
 #define SAD_THRESH(v) (bcost < (((v >> 4) * sizeScale[partEnum])))
 
-void initScales(void)
-{
-#define SETUP_SCALE(W, H) \
-    sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4;
-    SETUP_SCALE(4, 4);
-    SETUP_SCALE(8, 8);
-    SETUP_SCALE(8, 4);
-    SETUP_SCALE(4, 8);
-    SETUP_SCALE(16, 16);
-    SETUP_SCALE(16, 8);
-    SETUP_SCALE(8, 16);
-    SETUP_SCALE(16, 12);
-    SETUP_SCALE(12, 16);
-    SETUP_SCALE(4, 16);
-    SETUP_SCALE(16, 4);
-    SETUP_SCALE(32, 32);
-    SETUP_SCALE(32, 16);
-    SETUP_SCALE(16, 32);
-    SETUP_SCALE(32, 24);
-    SETUP_SCALE(24, 32);
-    SETUP_SCALE(32, 8);
-    SETUP_SCALE(8, 32);
-    SETUP_SCALE(64, 64);
-    SETUP_SCALE(64, 32);
-    SETUP_SCALE(32, 64);
-    SETUP_SCALE(64, 48);
-    SETUP_SCALE(48, 64);
-    SETUP_SCALE(64, 16);
-    SETUP_SCALE(16, 64);
-#undef SETUP_SCALE
-}
-
 /* radius 2 hexagon. repeated entries are to avoid having to compute mod6 every time. */
 const MV hex2[8] = { MV(-1, -2), MV(-2, 0), MV(-1, 2), MV(1, 2), MV(2, 0), MV(1, -2), MV(-1, -2), MV(-2, 0) };
 const uint8_t mod6m1[8] = { 5, 0, 1, 2, 3, 4, 5, 0 };  /* (x-1)%6 */
@@ -136,20 +104,57 @@
     absPartIdx = -1;
     searchMethod = X265_HEX_SEARCH;
     subpelRefine = 2;
+    blockwidth = blockheight = 0;
+    blockOffset = 0;
     bChromaSATD = false;
     chromaSatd = NULL;
 }
 
 void MotionEstimate::init(int method, int refine, int csp)
 {
-    if (!sizeScale[0])
-        initScales();
-
     searchMethod = method;
     subpelRefine = refine;
     fencPUYuv.create(FENC_STRIDE, csp);
 }
 
+void MotionEstimate::initScales(void)
+{
+#define SETUP_SCALE(W, H) \
+    sizeScale[LUMA_ ## W ## x ## H] = (H * H) >> 4;
+    SETUP_SCALE(4, 4);
+    SETUP_SCALE(8, 8);
+    SETUP_SCALE(8, 4);
+    SETUP_SCALE(4, 8);
+    SETUP_SCALE(16, 16);
+    SETUP_SCALE(16, 8);
+    SETUP_SCALE(8, 16);
+    SETUP_SCALE(16, 12);
+    SETUP_SCALE(12, 16);
+    SETUP_SCALE(4, 16);
+    SETUP_SCALE(16, 4);
+    SETUP_SCALE(32, 32);
+    SETUP_SCALE(32, 16);
+    SETUP_SCALE(16, 32);
+    SETUP_SCALE(32, 24);
+    SETUP_SCALE(24, 32);
+    SETUP_SCALE(32, 8);
+    SETUP_SCALE(8, 32);
+    SETUP_SCALE(64, 64);
+    SETUP_SCALE(64, 32);
+    SETUP_SCALE(32, 64);
+    SETUP_SCALE(64, 48);
+    SETUP_SCALE(48, 64);
+    SETUP_SCALE(64, 16);
+    SETUP_SCALE(16, 64);
+#undef SETUP_SCALE
+}
+
+int MotionEstimate::hpelIterationCount(int subme)
+{
+    return workload[subme].hpel_iters +
+           workload[subme].qpel_iters / 2;
+}
+
 MotionEstimate::~MotionEstimate()
 {
     fencPUYuv.destroy();
​

x265_1.5.tar.gz/source/encoder/motion.h -> x265_1.6.tar.gz/source/encoder/motion.h Changed

 
@@ -67,6 +67,8 @@
     MotionEstimate();
     ~MotionEstimate();
 
+    static void initScales();
+    static int hpelIterationCount(int subme);
     void init(int method, int refine, int csp);
 
     /* Methods called at slice setup */
​

x265_1.5.tar.gz/source/encoder/nal.cpp -> x265_1.6.tar.gz/source/encoder/nal.cpp Changed

 
@@ -107,7 +107,7 @@
      * nuh_reserved_zero_6bits  6-bits
      * nuh_temporal_id_plus1    3-bits */
     out[bytes++] = (uint8_t)nalUnitType << 1;
-    out[bytes++] = 1;
+    out[bytes++] = 1 + (nalUnitType == NAL_UNIT_CODED_SLICE_TSA_N);
 
     /* 7.4.1 ...
      * Within the NAL unit, the following three-byte sequences shall not occur at
​

x265_1.5.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.6.tar.gz/source/encoder/ratecontrol.cpp Changed

@@ -145,30 +145,6 @@
 }
 
 }  // end anonymous namespace
-/* Compute variance to derive AC energy of each block */
-static inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int i)
-{
-    uint32_t sum = (uint32_t)sum_ssd;
-    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
-
-    curFrame->m_lowres.wp_sum[i] += sum;
-    curFrame->m_lowres.wp_ssd[i] += ssd;
-    return ssd - ((uint64_t)sum * sum >> shift);
-}
-
-/* Find the energy of each block in Y/Cb/Cr plane */
-static inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int bChroma, int colorFormat)
-{
-    if ((colorFormat != X265_CSP_I444) && bChroma)
-    {
-        ALIGN_VAR_8(pixel, pix[8 * 8]);
-        primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride);
-        return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, bChroma);
-    }
-    else
-        return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, bChroma);
-}
-
 /* Returns the zone for the current frame */
 x265_zone* RateControl::getZone()
 {
@@ -181,138 +157,9 @@
     return NULL;
 }
 
-/* Find the total AC energy of each block in all planes */
-uint32_t RateControl::acEnergyCu(Frame* curFrame, uint32_t block_x, uint32_t block_y)
-{
-    intptr_t stride = curFrame->m_fencPic->m_stride;
-    intptr_t cStride = curFrame->m_fencPic->m_strideC;
-    intptr_t blockOffsetLuma = block_x + (block_y * stride);
-    int colorFormat = m_param->internalCsp;
-    int hShift = CHROMA_H_SHIFT(colorFormat);
-    int vShift = CHROMA_V_SHIFT(colorFormat);
-    intptr_t blockOffsetChroma = (block_x >> hShift) + ((block_y >> vShift) * cStride);
-
-    uint32_t var;
-
-    var  = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, colorFormat);
-    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, colorFormat);
-    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, colorFormat);
-    x265_emms();
-    return var;
-}
-
-void RateControl::calcAdaptiveQuantFrame(Frame *curFrame)
-{
-    /* Actual adaptive quantization */
-    int maxCol = curFrame->m_fencPic->m_picWidth;
-    int maxRow = curFrame->m_fencPic->m_picHeight;
-
-    for (int y = 0; y < 3; y++)
-    {
-        curFrame->m_lowres.wp_ssd[y] = 0;
-        curFrame->m_lowres.wp_sum[y] = 0;
-    }
-
-    /* Calculate Qp offset for each 16x16 block in the frame */
-    int block_xy = 0;
-    int block_x = 0, block_y = 0;
-    double strength = 0.f;
-    if (m_param->rc.aqMode == X265_AQ_NONE || m_param->rc.aqStrength == 0)
-    {
-        /* Need to init it anyways for CU tree */
-        int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-        int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-        int cuCount = cuWidth * cuHeight;
-
-        if (m_param->rc.aqMode && m_param->rc.aqStrength == 0)
-        {
-            memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
-            memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
-            for (int cuxy = 0; cuxy < cuCount; cuxy++)
-                curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
-        }
-
-        /* Need variance data for weighted prediction */
-        if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred)
-        {
-            for (block_y = 0; block_y < maxRow; block_y += 16)
-                for (block_x = 0; block_x < maxCol; block_x += 16)
-                    acEnergyCu(curFrame, block_x, block_y);
-        }
-    }
-    else
-    {
-        block_xy = 0;
-        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
-        if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
-        {
-            double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5);
-            for (block_y = 0; block_y < maxRow; block_y += 16)
-            {
-                for (block_x = 0; block_x < maxCol; block_x += 16)
-                {
-                    uint32_t energy = acEnergyCu(curFrame, block_x, block_y);
-                    qp_adj = pow(energy + 1, 0.1);
-                    curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj;
-                    avg_adj += qp_adj;
-                    avg_adj_pow2 += qp_adj * qp_adj;
-                    block_xy++;
-                }
-            }
-
-            avg_adj /= m_ncu;
-            avg_adj_pow2 /= m_ncu;
-            strength = m_param->rc.aqStrength * avg_adj / bit_depth_correction;
-            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj;
-        }
-        else
-            strength = m_param->rc.aqStrength * 1.0397f;
-
-        block_xy = 0;
-        for (block_y = 0; block_y < maxRow; block_y += 16)
-        {
-            for (block_x = 0; block_x < maxCol; block_x += 16)
-            {
-                if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
-                {
-                    qp_adj = curFrame->m_lowres.qpCuTreeOffset[block_xy];
-                    qp_adj = strength * (qp_adj - avg_adj);
-                }
-                else
-                {
-                    uint32_t energy = acEnergyCu(curFrame, block_x, block_y);
-                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8)));
-                }
-                curFrame->m_lowres.qpAqOffset[block_xy] = qp_adj;
-                curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj;
-                curFrame->m_lowres.invQscaleFactor[block_xy] = x265_exp2fix8(qp_adj);
-                block_xy++;
-            }
-        }
-    }
-
-    if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred)
-    {
-        int hShift = CHROMA_H_SHIFT(m_param->internalCsp);
-        int vShift = CHROMA_V_SHIFT(m_param->internalCsp);
-        maxCol = ((maxCol + 8) >> 4) << 4;
-        maxRow = ((maxRow + 8) >> 4) << 4;
-        int width[3]  = { maxCol, maxCol >> hShift, maxCol >> hShift };
-        int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift };
-
-        for (int i = 0; i < 3; i++)
-        {
-            uint64_t sum, ssd;
-            sum = curFrame->m_lowres.wp_sum[i];
-            ssd = curFrame->m_lowres.wp_ssd[i];
-            curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
-        }
-    }
-}
-
-RateControl::RateControl(x265_param *p)
+RateControl::RateControl(x265_param& p)
 {
-    m_param = p;
+    m_param = &p;
     int lowresCuWidth = ((m_param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
     int lowresCuHeight = ((m_param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
     m_ncu = lowresCuWidth * lowresCuHeight;
@@ -329,13 +176,11 @@
     m_partialResidualCost = 0;
     m_rateFactorMaxIncrement = 0;
     m_rateFactorMaxDecrement = 0;
-    m_fps = m_param->fpsNum / m_param->fpsDenom;
+    m_fps = (double)m_param->fpsNum / m_param->fpsDenom;
     m_startEndOrder.set(0);
     m_bTerminated = false;
     m_finalFrameCount = 0;
     m_numEntries = 0;
-    m_amortizeFraction = 0.85;
-    m_amortizeFrames = 75;
     if (m_param->rc.rateControlMode == X265_RC_CRF)
     {
         m_param->rc.qp = (int)m_param->rc.rfConstant;
@@ -371,6 +216,7 @@
     m_statFileOut = NULL;
     m_cutreeStatFileOut = m_cutreeStatFileIn = NULL;
     m_rce2Pass = NULL;
+    m_lastBsliceSatdCost = 0;
 
     // vbv initialization
     m_param->rc.vbvBufferSize = x265_clip3(0, 2000000, m_param->rc.vbvBufferSize);
@@ -424,11 +270,6 @@
         x265_log(m_param, X265_LOG_WARNING, "strict CBR set without CBR mode, ignored\n");
         m_param->rc.bStrictCbr = 0;
     }
-    if (m_param->totalFrames <= 2 * m_fps && m_param->rc.bStrictCbr) /* Strict CBR segment encode */

 
@@ -145,30 +145,6 @@
 }
 
 }  // end anonymous namespace
-/* Compute variance to derive AC energy of each block */
-static inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int i)
-{
-    uint32_t sum = (uint32_t)sum_ssd;
-    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
-
-    curFrame->m_lowres.wp_sum[i] += sum;
-    curFrame->m_lowres.wp_ssd[i] += ssd;
-    return ssd - ((uint64_t)sum * sum >> shift);
-}
-
-/* Find the energy of each block in Y/Cb/Cr plane */
-static inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int bChroma, int colorFormat)
-{
-    if ((colorFormat != X265_CSP_I444) && bChroma)
-    {
-        ALIGN_VAR_8(pixel, pix[8 * 8]);
-        primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride);
-        return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, bChroma);
-    }
-    else
-        return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, bChroma);
-}
-
 /* Returns the zone for the current frame */
 x265_zone* RateControl::getZone()
 {
@@ -181,138 +157,9 @@
     return NULL;
 }
 
-/* Find the total AC energy of each block in all planes */
-uint32_t RateControl::acEnergyCu(Frame* curFrame, uint32_t block_x, uint32_t block_y)
-{
-    intptr_t stride = curFrame->m_fencPic->m_stride;
-    intptr_t cStride = curFrame->m_fencPic->m_strideC;
-    intptr_t blockOffsetLuma = block_x + (block_y * stride);
-    int colorFormat = m_param->internalCsp;
-    int hShift = CHROMA_H_SHIFT(colorFormat);
-    int vShift = CHROMA_V_SHIFT(colorFormat);
-    intptr_t blockOffsetChroma = (block_x >> hShift) + ((block_y >> vShift) * cStride);
-
-    uint32_t var;
-
-    var  = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, colorFormat);
-    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, colorFormat);
-    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, colorFormat);
-    x265_emms();
-    return var;
-}
-
-void RateControl::calcAdaptiveQuantFrame(Frame *curFrame)
-{
-    /* Actual adaptive quantization */
-    int maxCol = curFrame->m_fencPic->m_picWidth;
-    int maxRow = curFrame->m_fencPic->m_picHeight;
-
-    for (int y = 0; y < 3; y++)
-    {
-        curFrame->m_lowres.wp_ssd[y] = 0;
-        curFrame->m_lowres.wp_sum[y] = 0;
-    }
-
-    /* Calculate Qp offset for each 16x16 block in the frame */
-    int block_xy = 0;
-    int block_x = 0, block_y = 0;
-    double strength = 0.f;
-    if (m_param->rc.aqMode == X265_AQ_NONE || m_param->rc.aqStrength == 0)
-    {
-        /* Need to init it anyways for CU tree */
-        int cuWidth = ((maxCol / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-        int cuHeight = ((maxRow / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
-        int cuCount = cuWidth * cuHeight;
-
-        if (m_param->rc.aqMode && m_param->rc.aqStrength == 0)
-        {
-            memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
-            memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
-            for (int cuxy = 0; cuxy < cuCount; cuxy++)
-                curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
-        }
-
-        /* Need variance data for weighted prediction */
-        if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred)
-        {
-            for (block_y = 0; block_y < maxRow; block_y += 16)
-                for (block_x = 0; block_x < maxCol; block_x += 16)
-                    acEnergyCu(curFrame, block_x, block_y);
-        }
-    }
-    else
-    {
-        block_xy = 0;
-        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
-        if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
-        {
-            double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5);
-            for (block_y = 0; block_y < maxRow; block_y += 16)
-            {
-                for (block_x = 0; block_x < maxCol; block_x += 16)
-                {
-                    uint32_t energy = acEnergyCu(curFrame, block_x, block_y);
-                    qp_adj = pow(energy + 1, 0.1);
-                    curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj;
-                    avg_adj += qp_adj;
-                    avg_adj_pow2 += qp_adj * qp_adj;
-                    block_xy++;
-                }
-            }
-
-            avg_adj /= m_ncu;
-            avg_adj_pow2 /= m_ncu;
-            strength = m_param->rc.aqStrength * avg_adj / bit_depth_correction;
-            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj;
-        }
-        else
-            strength = m_param->rc.aqStrength * 1.0397f;
-
-        block_xy = 0;
-        for (block_y = 0; block_y < maxRow; block_y += 16)
-        {
-            for (block_x = 0; block_x < maxCol; block_x += 16)
-            {
-                if (m_param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
-                {
-                    qp_adj = curFrame->m_lowres.qpCuTreeOffset[block_xy];
-                    qp_adj = strength * (qp_adj - avg_adj);
-                }
-                else
-                {
-                    uint32_t energy = acEnergyCu(curFrame, block_x, block_y);
-                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8)));
-                }
-                curFrame->m_lowres.qpAqOffset[block_xy] = qp_adj;
-                curFrame->m_lowres.qpCuTreeOffset[block_xy] = qp_adj;
-                curFrame->m_lowres.invQscaleFactor[block_xy] = x265_exp2fix8(qp_adj);
-                block_xy++;
-            }
-        }
-    }
-
-    if (m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred)
-    {
-        int hShift = CHROMA_H_SHIFT(m_param->internalCsp);
-        int vShift = CHROMA_V_SHIFT(m_param->internalCsp);
-        maxCol = ((maxCol + 8) >> 4) << 4;
-        maxRow = ((maxRow + 8) >> 4) << 4;
-        int width[3]  = { maxCol, maxCol >> hShift, maxCol >> hShift };
-        int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift };
-
-        for (int i = 0; i < 3; i++)
-        {
-            uint64_t sum, ssd;
-            sum = curFrame->m_lowres.wp_sum[i];
-            ssd = curFrame->m_lowres.wp_ssd[i];
-            curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
-        }
-    }
-}
-
-RateControl::RateControl(x265_param *p)
+RateControl::RateControl(x265_param& p)
 {
-    m_param = p;
+    m_param = &p;
     int lowresCuWidth = ((m_param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
     int lowresCuHeight = ((m_param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
     m_ncu = lowresCuWidth * lowresCuHeight;
@@ -329,13 +176,11 @@
     m_partialResidualCost = 0;
     m_rateFactorMaxIncrement = 0;
     m_rateFactorMaxDecrement = 0;
-    m_fps = m_param->fpsNum / m_param->fpsDenom;
+    m_fps = (double)m_param->fpsNum / m_param->fpsDenom;
     m_startEndOrder.set(0);
     m_bTerminated = false;
     m_finalFrameCount = 0;
     m_numEntries = 0;
-    m_amortizeFraction = 0.85;
-    m_amortizeFrames = 75;
     if (m_param->rc.rateControlMode == X265_RC_CRF)
     {
         m_param->rc.qp = (int)m_param->rc.rfConstant;
@@ -371,6 +216,7 @@
     m_statFileOut = NULL;
     m_cutreeStatFileOut = m_cutreeStatFileIn = NULL;
     m_rce2Pass = NULL;
+    m_lastBsliceSatdCost = 0;
 
     // vbv initialization
     m_param->rc.vbvBufferSize = x265_clip3(0, 2000000, m_param->rc.vbvBufferSize);
@@ -424,11 +270,6 @@
         x265_log(m_param, X265_LOG_WARNING, "strict CBR set without CBR mode, ignored\n");
         m_param->rc.bStrictCbr = 0;
     }
-    if (m_param->totalFrames <= 2 * m_fps && m_param->rc.bStrictCbr) /* Strict CBR segment encode */
​

x265_1.5.tar.gz/source/encoder/ratecontrol.h -> x265_1.6.tar.gz/source/encoder/ratecontrol.h Changed

@@ -34,14 +34,16 @@
 
 class Encoder;
 class Frame;
-struct SPS;
 class SEIBufferingPeriod;
+struct SPS;
 #define BASE_FRAME_DURATION 0.04
 
 /* Arbitrary limitations as a sanity check. */
 #define MAX_FRAME_DURATION 1.00
 #define MIN_FRAME_DURATION 0.01
 
+#define MIN_AMORTIZE_FRAME 10
+#define MIN_AMORTIZE_FRACTION 0.2
 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f)
 
 /* Current frame stats for 2 pass */
@@ -79,46 +81,50 @@
 
 struct RateControlEntry
 {
-    int64_t lastSatd; /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
-    int sliceType;
-    int bframes;
-    int poc;
-    int encodeOrder;
-    int64_t leadingNoBSatd;
-    bool bLastMiniGopBFrame;
-    double blurredComplexity;
-    double qpaRc;
-    double qpAq;
-    double qRceq;
-    double frameSizePlanned;  /* frame Size decided by RateCotrol before encoding the frame */
-    double bufferRate;
-    double movingAvgSum;
-    double   rowCplxrSum;
-    int64_t  rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
-    double qpNoVbv;
-    double bufferFill;
-    double frameDuration;
-    double clippedDuration;
-    Predictor rowPreds[3][2];
+    Predictor  rowPreds[3][2];
     Predictor* rowPred[2];
-    double frameSizeEstimated;  /* hold frameSize, updated from cu level vbv rc */
-    double frameSizeMaximum;  /* max frame Size according to minCR restrictions and level of the video */
-    bool isActive;
-    SEIPictureTiming *picTimingSEI;
-    HRDTiming        *hrdTiming;
+
+    int64_t lastSatd;      /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
+    int64_t leadingNoBSatd;
+    int64_t rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
+    double  blurredComplexity;
+    double  qpaRc;
+    double  qpAq;
+    double  qRceq;
+    double  frameSizePlanned;  /* frame Size decided by RateCotrol before encoding the frame */
+    double  bufferRate;
+    double  movingAvgSum;
+    double  rowCplxrSum;
+    double  qpNoVbv;
+    double  bufferFill;
+    double  frameDuration;
+    double  clippedDuration;
+    double  frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */
+    double  frameSizeMaximum;   /* max frame Size according to minCR restrictions and level of the video */
+    int     sliceType;
+    int     bframes;
+    int     poc;
+    int     encodeOrder;
+    bool    bLastMiniGopBFrame;
+    bool    isActive;
+    double  amortizeFrames;
+    double  amortizeFraction;
     /* Required in 2-pass rate control */
-    double iCuCount;
-    double pCuCount;
-    double skipCuCount;
-    bool keptAsRef;
-    double expectedVbv;
-    double qScale;
-    double newQScale;
-    double newQp;
-    int mvBits;
-    int miscBits;
-    int coeffBits;
     uint64_t expectedBits; /* total expected bits up to the current frame (current one excluded) */
+    double   iCuCount;
+    double   pCuCount;
+    double   skipCuCount;
+    double   expectedVbv;
+    double   qScale;
+    double   newQScale;
+    double   newQp;
+    int      mvBits;
+    int      miscBits;
+    int      coeffBits;
+    bool     keptAsRef;
+
+    SEIPictureTiming *picTimingSEI;
+    HRDTiming        *hrdTiming;
 };
 
 class RateControl
@@ -139,7 +145,7 @@
     bool   m_isAbrReset;
     int    m_lastAbrResetPoc;
 
-    double  m_rateTolerance;
+    double m_rateTolerance;
     double m_frameDuration;     /* current frame duration in seconds */
     double m_bitrate;
     double m_rateFactorConstant;
@@ -154,33 +160,38 @@
     Predictor m_pred[5];
     Predictor m_predBfromP;
 
-    int       m_leadingBframes;
-    int64_t   m_bframeBits;
-    int64_t   m_currentSatd;
-    int       m_qpConstant[3];
-    double    m_ipOffset;
-    double    m_pbOffset;
-
-    int      m_lastNonBPictType;
-    int64_t  m_leadingNoBSatd;
-
-    double   m_cplxrSum;          /* sum of bits*qscale/rceq */
-    double   m_wantedBitsWindow;  /* target bitrate * window */
-    double   m_accumPQp;          /* for determining I-frame quant */
-    double   m_accumPNorm;
-    double   m_lastQScaleFor[3];  /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */
-    double   m_lstep;
-    double   m_shortTermCplxSum;
-    double   m_shortTermCplxCount;
-    double   m_lastRceq;
-    double   m_qCompress;
-    int64_t  m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
-    int      m_framesDone;       /* # of frames passed through RateCotrol already */
-    int64_t  m_encodedBits;      /* bits used for encoded frames (without ammortization) */
-    double   m_fps;
-    int64_t  m_satdCostWindow[50];
-    int      m_sliderPos;
-    int64_t  m_encodedBitsWindow[50];
+    int64_t m_leadingNoBSatd;
+    double  m_ipOffset;
+    double  m_pbOffset;
+    int64_t m_bframeBits;
+    int64_t m_currentSatd;
+    int     m_leadingBframes;
+    int     m_qpConstant[3];
+    int     m_lastNonBPictType;
+    int     m_framesDone;        /* # of frames passed through RateCotrol already */
+
+    double  m_cplxrSum;          /* sum of bits*qscale/rceq */
+    double  m_wantedBitsWindow;  /* target bitrate * window */
+    double  m_accumPQp;          /* for determining I-frame quant */
+    double  m_accumPNorm;
+    double  m_lastQScaleFor[3];  /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */
+    double  m_lstep;
+    double  m_shortTermCplxSum;
+    double  m_shortTermCplxCount;
+    double  m_lastRceq;
+    double  m_qCompress;
+    int64_t m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
+    int64_t m_encodedBits;      /* bits used for encoded frames (without ammortization) */
+    double  m_fps;
+    int64_t m_satdCostWindow[50];
+    int64_t m_encodedBitsWindow[50];
+    int     m_sliderPos;
+
+    /* To detect a pattern of low detailed static frames in single pass ABR using satdcosts */
+    int64_t m_lastBsliceSatdCost;
+    int     m_numBframesInPattern;
+    bool    m_isPatternPresent;
+
     /* a common variable on which rateControlStart, rateControlEnd and rateControUpdateStats waits to
      * sync the calls to these functions. For example
      * -F2:
@@ -194,24 +205,25 @@
      * rceUpdate 12
      * rceEnd    11 */
     ThreadSafeInteger m_startEndOrder;
-    int      m_finalFrameCount;   /* set when encoder begins flushing */
-    bool     m_bTerminated;       /* set true when encoder is closing */
+    int     m_finalFrameCount;   /* set when encoder begins flushing */
+    bool    m_bTerminated;       /* set true when encoder is closing */
 
     /* hrd stuff */
     SEIBufferingPeriod m_bufPeriodSEI;
-    double   m_nominalRemovalTime;
-    double   m_prevCpbFinalAT;
+    double  m_nominalRemovalTime;
+    double  m_prevCpbFinalAT;
 
     /* 2 pass */
-    bool     m_2pass;
-    FILE*    m_statFileOut;

 
@@ -34,14 +34,16 @@
 
 class Encoder;
 class Frame;
-struct SPS;
 class SEIBufferingPeriod;
+struct SPS;
 #define BASE_FRAME_DURATION 0.04
 
 /* Arbitrary limitations as a sanity check. */
 #define MAX_FRAME_DURATION 1.00
 #define MIN_FRAME_DURATION 0.01
 
+#define MIN_AMORTIZE_FRAME 10
+#define MIN_AMORTIZE_FRACTION 0.2
 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f)
 
 /* Current frame stats for 2 pass */
@@ -79,46 +81,50 @@
 
 struct RateControlEntry
 {
-    int64_t lastSatd; /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
-    int sliceType;
-    int bframes;
-    int poc;
-    int encodeOrder;
-    int64_t leadingNoBSatd;
-    bool bLastMiniGopBFrame;
-    double blurredComplexity;
-    double qpaRc;
-    double qpAq;
-    double qRceq;
-    double frameSizePlanned;  /* frame Size decided by RateCotrol before encoding the frame */
-    double bufferRate;
-    double movingAvgSum;
-    double   rowCplxrSum;
-    int64_t  rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
-    double qpNoVbv;
-    double bufferFill;
-    double frameDuration;
-    double clippedDuration;
-    Predictor rowPreds[3][2];
+    Predictor  rowPreds[3][2];
     Predictor* rowPred[2];
-    double frameSizeEstimated;  /* hold frameSize, updated from cu level vbv rc */
-    double frameSizeMaximum;  /* max frame Size according to minCR restrictions and level of the video */
-    bool isActive;
-    SEIPictureTiming *picTimingSEI;
-    HRDTiming        *hrdTiming;
+
+    int64_t lastSatd;      /* Contains the picture cost of the previous frame, required for resetAbr and VBV */
+    int64_t leadingNoBSatd;
+    int64_t rowTotalBits;  /* update cplxrsum and totalbits at the end of 2 rows */
+    double  blurredComplexity;
+    double  qpaRc;
+    double  qpAq;
+    double  qRceq;
+    double  frameSizePlanned;  /* frame Size decided by RateCotrol before encoding the frame */
+    double  bufferRate;
+    double  movingAvgSum;
+    double  rowCplxrSum;
+    double  qpNoVbv;
+    double  bufferFill;
+    double  frameDuration;
+    double  clippedDuration;
+    double  frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */
+    double  frameSizeMaximum;   /* max frame Size according to minCR restrictions and level of the video */
+    int     sliceType;
+    int     bframes;
+    int     poc;
+    int     encodeOrder;
+    bool    bLastMiniGopBFrame;
+    bool    isActive;
+    double  amortizeFrames;
+    double  amortizeFraction;
     /* Required in 2-pass rate control */
-    double iCuCount;
-    double pCuCount;
-    double skipCuCount;
-    bool keptAsRef;
-    double expectedVbv;
-    double qScale;
-    double newQScale;
-    double newQp;
-    int mvBits;
-    int miscBits;
-    int coeffBits;
     uint64_t expectedBits; /* total expected bits up to the current frame (current one excluded) */
+    double   iCuCount;
+    double   pCuCount;
+    double   skipCuCount;
+    double   expectedVbv;
+    double   qScale;
+    double   newQScale;
+    double   newQp;
+    int      mvBits;
+    int      miscBits;
+    int      coeffBits;
+    bool     keptAsRef;
+
+    SEIPictureTiming *picTimingSEI;
+    HRDTiming        *hrdTiming;
 };
 
 class RateControl
@@ -139,7 +145,7 @@
     bool   m_isAbrReset;
     int    m_lastAbrResetPoc;
 
-    double  m_rateTolerance;
+    double m_rateTolerance;
     double m_frameDuration;     /* current frame duration in seconds */
     double m_bitrate;
     double m_rateFactorConstant;
@@ -154,33 +160,38 @@
     Predictor m_pred[5];
     Predictor m_predBfromP;
 
-    int       m_leadingBframes;
-    int64_t   m_bframeBits;
-    int64_t   m_currentSatd;
-    int       m_qpConstant[3];
-    double    m_ipOffset;
-    double    m_pbOffset;
-
-    int      m_lastNonBPictType;
-    int64_t  m_leadingNoBSatd;
-
-    double   m_cplxrSum;          /* sum of bits*qscale/rceq */
-    double   m_wantedBitsWindow;  /* target bitrate * window */
-    double   m_accumPQp;          /* for determining I-frame quant */
-    double   m_accumPNorm;
-    double   m_lastQScaleFor[3];  /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */
-    double   m_lstep;
-    double   m_shortTermCplxSum;
-    double   m_shortTermCplxCount;
-    double   m_lastRceq;
-    double   m_qCompress;
-    int64_t  m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
-    int      m_framesDone;       /* # of frames passed through RateCotrol already */
-    int64_t  m_encodedBits;      /* bits used for encoded frames (without ammortization) */
-    double   m_fps;
-    int64_t  m_satdCostWindow[50];
-    int      m_sliderPos;
-    int64_t  m_encodedBitsWindow[50];
+    int64_t m_leadingNoBSatd;
+    double  m_ipOffset;
+    double  m_pbOffset;
+    int64_t m_bframeBits;
+    int64_t m_currentSatd;
+    int     m_leadingBframes;
+    int     m_qpConstant[3];
+    int     m_lastNonBPictType;
+    int     m_framesDone;        /* # of frames passed through RateCotrol already */
+
+    double  m_cplxrSum;          /* sum of bits*qscale/rceq */
+    double  m_wantedBitsWindow;  /* target bitrate * window */
+    double  m_accumPQp;          /* for determining I-frame quant */
+    double  m_accumPNorm;
+    double  m_lastQScaleFor[3];  /* last qscale for a specific pict type, used for max_diff & ipb factor stuff */
+    double  m_lstep;
+    double  m_shortTermCplxSum;
+    double  m_shortTermCplxCount;
+    double  m_lastRceq;
+    double  m_qCompress;
+    int64_t m_totalBits;        /* total bits used for already encoded frames (after ammortization) */
+    int64_t m_encodedBits;      /* bits used for encoded frames (without ammortization) */
+    double  m_fps;
+    int64_t m_satdCostWindow[50];
+    int64_t m_encodedBitsWindow[50];
+    int     m_sliderPos;
+
+    /* To detect a pattern of low detailed static frames in single pass ABR using satdcosts */
+    int64_t m_lastBsliceSatdCost;
+    int     m_numBframesInPattern;
+    bool    m_isPatternPresent;
+
     /* a common variable on which rateControlStart, rateControlEnd and rateControUpdateStats waits to
      * sync the calls to these functions. For example
      * -F2:
@@ -194,24 +205,25 @@
      * rceUpdate 12
      * rceEnd    11 */
     ThreadSafeInteger m_startEndOrder;
-    int      m_finalFrameCount;   /* set when encoder begins flushing */
-    bool     m_bTerminated;       /* set true when encoder is closing */
+    int     m_finalFrameCount;   /* set when encoder begins flushing */
+    bool    m_bTerminated;       /* set true when encoder is closing */
 
     /* hrd stuff */
     SEIBufferingPeriod m_bufPeriodSEI;
-    double   m_nominalRemovalTime;
-    double   m_prevCpbFinalAT;
+    double  m_nominalRemovalTime;
+    double  m_prevCpbFinalAT;
 
     /* 2 pass */
-    bool     m_2pass;
-    FILE*    m_statFileOut;
​

x265_1.5.tar.gz/source/encoder/sao.cpp -> x265_1.6.tar.gz/source/encoder/sao.cpp Changed

 
@@ -261,6 +261,8 @@
     int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1;
     int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1;
 
+    memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */
+
     {
         const pixel* recR = &rec[ctuWidth - 1];
         for (int i = 0; i < ctuHeight + 1; i++)
​

x265_1.5.tar.gz/source/encoder/search.cpp -> x265_1.6.tar.gz/source/encoder/search.cpp Changed

@@ -30,6 +30,9 @@
 #include "entropy.h"
 #include "rdcost.h"
 
+#include "analysis.h"  // TLD
+#include "framedata.h"
+
 using namespace x265;
 
 #if _MSC_VER
@@ -40,10 +43,9 @@
 
 #define MVP_IDX_BITS 1
 
-ALIGN_VAR_32(const pixel, Search::zeroPixel[MAX_CU_SIZE]) = { 0 };
 ALIGN_VAR_32(const int16_t, Search::zeroShort[MAX_CU_SIZE]) = { 0 };
 
-Search::Search() : JobProvider(NULL)
+Search::Search()
 {
     memset(m_rqt, 0, sizeof(m_rqt));
 
@@ -54,25 +56,30 @@
     }
 
     m_numLayers = 0;
+    m_intraPred = NULL;
+    m_intraPredAngs = NULL;
+    m_fencScaled = NULL;
+    m_fencTransposed = NULL;
+    m_tsCoeff = NULL;
+    m_tsResidual = NULL;
+    m_tsRecon = NULL;
     m_param = NULL;
     m_slice = NULL;
     m_frame = NULL;
-    m_bJobsQueued = false;
-    m_totalNumME = m_numAcquiredME = m_numCompletedME = 0;
 }
 
 bool Search::initSearch(const x265_param& param, ScalingList& scalingList)
 {
     uint32_t maxLog2CUSize = g_log2Size[param.maxCUSize];
     m_param = &param;
-    m_bEnableRDOQ = param.rdLevel >= 4;
+    m_bEnableRDOQ = !!param.rdoqLevel;
     m_bFrameParallel = param.frameNumThreads > 1;
     m_numLayers = g_log2Size[param.maxCUSize] - 2;
 
     m_rdCost.setPsyRdScale(param.psyRd);
     m_me.init(param.searchMethod, param.subpelRefine, param.internalCsp);
 
-    bool ok = m_quant.init(m_bEnableRDOQ, param.psyRdoq, scalingList, m_entropyCoder);
+    bool ok = m_quant.init(param.rdoqLevel, param.psyRdoq, scalingList, m_entropyCoder);
     if (m_param->noiseReductionIntra || m_param->noiseReductionInter)
         ok &= m_quant.allocNoiseReduction(param);
 
@@ -116,6 +123,15 @@
     m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[0] + numPartitions;
     m_qtTempTransformSkipFlag[2] = m_qtTempTransformSkipFlag[0] + numPartitions * 2;
 
+    CHECKED_MALLOC(m_intraPred, pixel, (32 * 32) * (33 + 3));
+    m_fencScaled = m_intraPred + 32 * 32;
+    m_fencTransposed = m_fencScaled + 32 * 32;
+    m_intraPredAngs = m_fencTransposed + 32 * 32;
+
+    CHECKED_MALLOC(m_tsCoeff,    coeff_t, MAX_TS_SIZE * MAX_TS_SIZE);
+    CHECKED_MALLOC(m_tsResidual, int16_t, MAX_TS_SIZE * MAX_TS_SIZE);
+    CHECKED_MALLOC(m_tsRecon,    pixel,   MAX_TS_SIZE * MAX_TS_SIZE);
+
     return ok;
 
 fail:
@@ -141,6 +157,10 @@
 
     X265_FREE(m_qtTempCbf[0]);
     X265_FREE(m_qtTempTransformSkipFlag[0]);
+    X265_FREE(m_intraPred);
+    X265_FREE(m_tsCoeff);
+    X265_FREE(m_tsResidual);
+    X265_FREE(m_tsRecon);
 }
 
 void Search::setQP(const Slice& slice, int qp)
@@ -421,7 +441,7 @@
     }
 
     // set reconstruction for next intra prediction blocks if full TU prediction won
-    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
     intptr_t picStride = m_frame->m_reconPic->m_stride;
     primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride);
 
@@ -477,17 +497,14 @@
     if (m_bEnableRDOQ)
         m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true);
 
-    ALIGN_VAR_32(coeff_t, tsCoeffY[MAX_TS_SIZE * MAX_TS_SIZE]);
-    ALIGN_VAR_32(pixel,   tsReconY[MAX_TS_SIZE * MAX_TS_SIZE]);
-
     int checkTransformSkip = 1;
     for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++)
     {
         uint64_t tmpCost;
         uint32_t tmpEnergy = 0;
 
-        coeff_t* coeff = (useTSkip ? tsCoeffY : coeffY);
-        pixel*   tmpRecon = (useTSkip ? tsReconY : reconQt);
+        coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY);
+        pixel*   tmpRecon = (useTSkip ? m_tsRecon : reconQt);
         uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
         primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
@@ -578,8 +595,8 @@
 
     if (bTSkip)
     {
-        memcpy(coeffY, tsCoeffY, sizeof(coeff_t) << (log2TrSize * 2));
-        primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, tsReconY, tuSize);
+        memcpy(coeffY, m_tsCoeff, sizeof(coeff_t) << (log2TrSize * 2));
+        primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, m_tsRecon, tuSize);
     }
     else if (checkTransformSkip)
     {
@@ -589,7 +606,7 @@
     }
 
     // set reconstruction for next intra prediction blocks
-    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
     intptr_t picStride = m_frame->m_reconPic->m_stride;
     primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride);
 
@@ -639,7 +656,7 @@
         uint32_t sizeIdx   = log2TrSize - 2;
         primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
 
-        pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+        pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
         intptr_t picStride = m_frame->m_reconPic->m_stride;
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false);
@@ -799,7 +816,7 @@
             coeff_t* coeffC        = m_rqt[qtLayer].coeffRQT[chromaId] + coeffOffsetC;
             pixel*   reconQt       = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC);
             uint32_t reconQtStride = m_rqt[qtLayer].reconQtYuv.m_csize;
-            pixel*   picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.encodeIdx + absPartIdxC);
+            pixel*   picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC);
             intptr_t picStride = m_frame->m_reconPic->m_strideC;
 
             uint32_t chromaPredMode = cu.m_chromaIntraDir[absPartIdxC];
@@ -812,7 +829,7 @@
             initAdiPatternChroma(cu, cuGeom, absPartIdxC, intraNeighbors, chromaId);
 
             // get prediction signal
-            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp);
+            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
             cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep);
 
             primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
@@ -864,9 +881,6 @@
      * condition as it arrived, and to do all bit estimates from the same state. */
     m_entropyCoder.store(m_rqt[fullDepth].rqtRoot);
 
-    ALIGN_VAR_32(coeff_t, tskipCoeffC[MAX_TS_SIZE * MAX_TS_SIZE]);
-    ALIGN_VAR_32(pixel,   tskipReconC[MAX_TS_SIZE * MAX_TS_SIZE]);
-
     uint32_t curPartNum = cuGeom.numPartitions >> tuDepthC * 2;
     const SplitType splitType = (m_csp == X265_CSP_I422) ? VERTICAL_SPLIT : DONT_SPLIT;
 
@@ -903,7 +917,7 @@
                 chromaPredMode = g_chroma422IntraAngleMappingTable[chromaPredMode];
 
             // get prediction signal
-            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp);
+            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
 
             uint64_t bCost = MAX_INT64;
             uint32_t bDist = 0;
@@ -914,8 +928,8 @@
             int checkTransformSkip = 1;
             for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++)
             {
-                coeff_t* coeff = (useTSkip ? tskipCoeffC : coeffC);
-                pixel*   recon = (useTSkip ? tskipReconC : reconQt);
+                coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffC);
+                pixel*   recon = (useTSkip ? m_tsRecon : reconQt);
                 uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
                 primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
@@ -972,14 +986,14 @@
 
             if (bTSkip)
             {
-                memcpy(coeffC, tskipCoeffC, sizeof(coeff_t) << (log2TrSizeC * 2));
-                primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, tskipReconC, MAX_TS_SIZE);
+                memcpy(coeffC, m_tsCoeff, sizeof(coeff_t) << (log2TrSizeC * 2));
+                primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, m_tsRecon, MAX_TS_SIZE);
             }

 
@@ -30,6 +30,9 @@
 #include "entropy.h"
 #include "rdcost.h"
 
+#include "analysis.h"  // TLD
+#include "framedata.h"
+
 using namespace x265;
 
 #if _MSC_VER
@@ -40,10 +43,9 @@
 
 #define MVP_IDX_BITS 1
 
-ALIGN_VAR_32(const pixel, Search::zeroPixel[MAX_CU_SIZE]) = { 0 };
 ALIGN_VAR_32(const int16_t, Search::zeroShort[MAX_CU_SIZE]) = { 0 };
 
-Search::Search() : JobProvider(NULL)
+Search::Search()
 {
     memset(m_rqt, 0, sizeof(m_rqt));
 
@@ -54,25 +56,30 @@
     }
 
     m_numLayers = 0;
+    m_intraPred = NULL;
+    m_intraPredAngs = NULL;
+    m_fencScaled = NULL;
+    m_fencTransposed = NULL;
+    m_tsCoeff = NULL;
+    m_tsResidual = NULL;
+    m_tsRecon = NULL;
     m_param = NULL;
     m_slice = NULL;
     m_frame = NULL;
-    m_bJobsQueued = false;
-    m_totalNumME = m_numAcquiredME = m_numCompletedME = 0;
 }
 
 bool Search::initSearch(const x265_param& param, ScalingList& scalingList)
 {
     uint32_t maxLog2CUSize = g_log2Size[param.maxCUSize];
     m_param = &param;
-    m_bEnableRDOQ = param.rdLevel >= 4;
+    m_bEnableRDOQ = !!param.rdoqLevel;
     m_bFrameParallel = param.frameNumThreads > 1;
     m_numLayers = g_log2Size[param.maxCUSize] - 2;
 
     m_rdCost.setPsyRdScale(param.psyRd);
     m_me.init(param.searchMethod, param.subpelRefine, param.internalCsp);
 
-    bool ok = m_quant.init(m_bEnableRDOQ, param.psyRdoq, scalingList, m_entropyCoder);
+    bool ok = m_quant.init(param.rdoqLevel, param.psyRdoq, scalingList, m_entropyCoder);
     if (m_param->noiseReductionIntra || m_param->noiseReductionInter)
         ok &= m_quant.allocNoiseReduction(param);
 
@@ -116,6 +123,15 @@
     m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[0] + numPartitions;
     m_qtTempTransformSkipFlag[2] = m_qtTempTransformSkipFlag[0] + numPartitions * 2;
 
+    CHECKED_MALLOC(m_intraPred, pixel, (32 * 32) * (33 + 3));
+    m_fencScaled = m_intraPred + 32 * 32;
+    m_fencTransposed = m_fencScaled + 32 * 32;
+    m_intraPredAngs = m_fencTransposed + 32 * 32;
+
+    CHECKED_MALLOC(m_tsCoeff,    coeff_t, MAX_TS_SIZE * MAX_TS_SIZE);
+    CHECKED_MALLOC(m_tsResidual, int16_t, MAX_TS_SIZE * MAX_TS_SIZE);
+    CHECKED_MALLOC(m_tsRecon,    pixel,   MAX_TS_SIZE * MAX_TS_SIZE);
+
     return ok;
 
 fail:
@@ -141,6 +157,10 @@
 
     X265_FREE(m_qtTempCbf[0]);
     X265_FREE(m_qtTempTransformSkipFlag[0]);
+    X265_FREE(m_intraPred);
+    X265_FREE(m_tsCoeff);
+    X265_FREE(m_tsResidual);
+    X265_FREE(m_tsRecon);
 }
 
 void Search::setQP(const Slice& slice, int qp)
@@ -421,7 +441,7 @@
     }
 
     // set reconstruction for next intra prediction blocks if full TU prediction won
-    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
     intptr_t picStride = m_frame->m_reconPic->m_stride;
     primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride);
 
@@ -477,17 +497,14 @@
     if (m_bEnableRDOQ)
         m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true);
 
-    ALIGN_VAR_32(coeff_t, tsCoeffY[MAX_TS_SIZE * MAX_TS_SIZE]);
-    ALIGN_VAR_32(pixel,   tsReconY[MAX_TS_SIZE * MAX_TS_SIZE]);
-
     int checkTransformSkip = 1;
     for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++)
     {
         uint64_t tmpCost;
         uint32_t tmpEnergy = 0;
 
-        coeff_t* coeff = (useTSkip ? tsCoeffY : coeffY);
-        pixel*   tmpRecon = (useTSkip ? tsReconY : reconQt);
+        coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY);
+        pixel*   tmpRecon = (useTSkip ? m_tsRecon : reconQt);
         uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
         primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
@@ -578,8 +595,8 @@
 
     if (bTSkip)
     {
-        memcpy(coeffY, tsCoeffY, sizeof(coeff_t) << (log2TrSize * 2));
-        primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, tsReconY, tuSize);
+        memcpy(coeffY, m_tsCoeff, sizeof(coeff_t) << (log2TrSize * 2));
+        primitives.cu[sizeIdx].copy_pp(reconQt, reconQtStride, m_tsRecon, tuSize);
     }
     else if (checkTransformSkip)
     {
@@ -589,7 +606,7 @@
     }
 
     // set reconstruction for next intra prediction blocks
-    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+    pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
     intptr_t picStride = m_frame->m_reconPic->m_stride;
     primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride);
 
@@ -639,7 +656,7 @@
         uint32_t sizeIdx   = log2TrSize - 2;
         primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride);
 
-        pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.encodeIdx + absPartIdx);
+        pixel*   picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx);
         intptr_t picStride = m_frame->m_reconPic->m_stride;
 
         uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false);
@@ -799,7 +816,7 @@
             coeff_t* coeffC        = m_rqt[qtLayer].coeffRQT[chromaId] + coeffOffsetC;
             pixel*   reconQt       = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC);
             uint32_t reconQtStride = m_rqt[qtLayer].reconQtYuv.m_csize;
-            pixel*   picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.encodeIdx + absPartIdxC);
+            pixel*   picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC);
             intptr_t picStride = m_frame->m_reconPic->m_strideC;
 
             uint32_t chromaPredMode = cu.m_chromaIntraDir[absPartIdxC];
@@ -812,7 +829,7 @@
             initAdiPatternChroma(cu, cuGeom, absPartIdxC, intraNeighbors, chromaId);
 
             // get prediction signal
-            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp);
+            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
             cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep);
 
             primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
@@ -864,9 +881,6 @@
      * condition as it arrived, and to do all bit estimates from the same state. */
     m_entropyCoder.store(m_rqt[fullDepth].rqtRoot);
 
-    ALIGN_VAR_32(coeff_t, tskipCoeffC[MAX_TS_SIZE * MAX_TS_SIZE]);
-    ALIGN_VAR_32(pixel,   tskipReconC[MAX_TS_SIZE * MAX_TS_SIZE]);
-
     uint32_t curPartNum = cuGeom.numPartitions >> tuDepthC * 2;
     const SplitType splitType = (m_csp == X265_CSP_I422) ? VERTICAL_SPLIT : DONT_SPLIT;
 
@@ -903,7 +917,7 @@
                 chromaPredMode = g_chroma422IntraAngleMappingTable[chromaPredMode];
 
             // get prediction signal
-            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC, m_csp);
+            predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC);
 
             uint64_t bCost = MAX_INT64;
             uint32_t bDist = 0;
@@ -914,8 +928,8 @@
             int checkTransformSkip = 1;
             for (int useTSkip = 0; useTSkip <= checkTransformSkip; useTSkip++)
             {
-                coeff_t* coeff = (useTSkip ? tskipCoeffC : coeffC);
-                pixel*   recon = (useTSkip ? tskipReconC : reconQt);
+                coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffC);
+                pixel*   recon = (useTSkip ? m_tsRecon : reconQt);
                 uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride);
 
                 primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride);
@@ -972,14 +986,14 @@
 
             if (bTSkip)
             {
-                memcpy(coeffC, tskipCoeffC, sizeof(coeff_t) << (log2TrSizeC * 2));
-                primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, tskipReconC, MAX_TS_SIZE);
+                memcpy(coeffC, m_tsCoeff, sizeof(coeff_t) << (log2TrSizeC * 2));
+                primitives.cu[sizeIdxC].copy_pp(reconQt, reconQtStride, m_tsRecon, MAX_TS_SIZE);
             }
 
​

x265_1.5.tar.gz/source/encoder/search.h -> x265_1.6.tar.gz/source/encoder/search.h Changed

@@ -28,6 +28,7 @@
 #include "predict.h"
 #include "quant.h"
 #include "bitcost.h"
+#include "framedata.h"
 #include "yuv.h"
 #include "threadpool.h"
 
@@ -35,6 +36,18 @@
 #include "entropy.h"
 #include "motion.h"
 
+#if DETAILED_CU_STATS
+#define ProfileCUScopeNamed(name, cu, acc, count) \
+    m_stats[cu.m_encData->m_frameEncoderID].count++; \
+    ScopedElapsedTime name(m_stats[cu.m_encData->m_frameEncoderID].acc)
+#define ProfileCUScope(cu, acc, count) ProfileCUScopeNamed(timedScope, cu, acc, count)
+#define ProfileCounter(cu, count) m_stats[cu.m_encData->m_frameEncoderID].count++;
+#else
+#define ProfileCUScopeNamed(name, cu, acc, count)
+#define ProfileCUScope(cu, acc, count)
+#define ProfileCounter(cu, count)
+#endif
+
 namespace x265 {
 // private namespace
 
@@ -88,6 +101,10 @@
     MotionData bestME[MAX_INTER_PARTS][2];
     MV         amvpCand[2][MAX_NUM_REF][AMVP_NUM_CANDS];
 
+    // Neighbour MVs of the current partition. 5 spatial candidates and the
+    // temporal candidate.
+    InterNeighbourMV interNeighbours[6];
+
     uint64_t   rdCost;     // sum of partition (psy) RD costs          (sse(fenc, recon) + lambda2 * bits)
     uint64_t   sa8dCost;   // sum of partition sa8d distortion costs   (sa8d(fenc, pred) + lambda * bits)
     uint32_t   sa8dBits;   // signal bits used in sa8dCost calculation
@@ -109,8 +126,35 @@
         coeffBits = 0;
     }
 
+    void invalidate()
+    {
+        /* set costs to invalid data, catch uninitialized re-use */
+        rdCost = UINT64_MAX / 2;
+        sa8dCost = UINT64_MAX / 2;
+        sa8dBits = MAX_UINT / 2;
+        psyEnergy = MAX_UINT / 2;
+        distortion = MAX_UINT / 2;
+        totalBits = MAX_UINT / 2;
+        mvBits = MAX_UINT / 2;
+        coeffBits = MAX_UINT / 2;
+    }
+
+    bool ok() const
+    {
+        return !(rdCost >= UINT64_MAX / 2 ||
+                 sa8dCost >= UINT64_MAX / 2 ||
+                 sa8dBits >= MAX_UINT / 2 ||
+                 psyEnergy >= MAX_UINT / 2 ||
+                 distortion >= MAX_UINT / 2 ||
+                 totalBits >= MAX_UINT / 2 ||
+                 mvBits >= MAX_UINT / 2 ||
+                 coeffBits >= MAX_UINT / 2);
+    }
+
     void addSubCosts(const Mode& subMode)
     {
+        X265_CHECK(subMode.ok(), "sub-mode not initialized");
+
         rdCost += subMode.rdCost;
         sa8dCost += subMode.sa8dCost;
         sa8dBits += subMode.sa8dBits;
@@ -122,16 +166,89 @@
     }
 };
 
+#if DETAILED_CU_STATS
+/* This structure is intended for performance debugging and we make no attempt
+ * to handle dynamic range overflows. Care should be taken to avoid long encodes
+ * if you care about the accuracy of these elapsed times and counters. This
+ * profiling is orthogonal to PPA/VTune and can be enabled independently from
+ * either of them */
+struct CUStats
+{
+    int64_t  intraRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in intra RDO per CU depth
+    int64_t  interRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in inter RDO per CU depth
+    int64_t  intraAnalysisElapsedTime;          // elapsed worker time in intra sa8d analysis
+    int64_t  motionEstimationElapsedTime;       // elapsed worker time in predInterSearch()
+    int64_t  loopFilterElapsedTime;             // elapsed worker time in deblock and SAO and PSNR/SSIM
+    int64_t  pmeTime;                           // elapsed worker time processing ME slave jobs
+    int64_t  pmeBlockTime;                      // elapsed worker time blocked for pme batch completion
+    int64_t  pmodeTime;                         // elapsed worker time processing pmode slave jobs
+    int64_t  pmodeBlockTime;                    // elapsed worker time blocked for pmode batch completion
+    int64_t  weightAnalyzeTime;                 // elapsed worker time analyzing reference weights
+    int64_t  totalCTUTime;                      // elapsed worker time in compressCTU (includes pmode master)
+
+    uint64_t countIntraRDO[NUM_CU_DEPTH];
+    uint64_t countInterRDO[NUM_CU_DEPTH];
+    uint64_t countIntraAnalysis;
+    uint64_t countMotionEstimate;
+    uint64_t countLoopFilter;
+    uint64_t countPMETasks;
+    uint64_t countPMEMasters;
+    uint64_t countPModeTasks;
+    uint64_t countPModeMasters;
+    uint64_t countWeightAnalyze;
+    uint64_t totalCTUs;
+
+    CUStats() { clear(); }
+
+    void clear()
+    {
+        memset(this, 0, sizeof(*this));
+    }
+
+    void accumulate(CUStats& other)
+    {
+        for (uint32_t i = 0; i <= g_maxCUDepth; i++)
+        {
+            intraRDOElapsedTime[i] += other.intraRDOElapsedTime[i];
+            interRDOElapsedTime[i] += other.interRDOElapsedTime[i];
+            countIntraRDO[i] += other.countIntraRDO[i];
+            countInterRDO[i] += other.countInterRDO[i];
+        }
+
+        intraAnalysisElapsedTime += other.intraAnalysisElapsedTime;
+        motionEstimationElapsedTime += other.motionEstimationElapsedTime;
+        loopFilterElapsedTime += other.loopFilterElapsedTime;
+        pmeTime += other.pmeTime;
+        pmeBlockTime += other.pmeBlockTime;
+        pmodeTime += other.pmodeTime;
+        pmodeBlockTime += other.pmodeBlockTime;
+        weightAnalyzeTime += other.weightAnalyzeTime;
+        totalCTUTime += other.totalCTUTime;
+
+        countIntraAnalysis += other.countIntraAnalysis;
+        countMotionEstimate += other.countMotionEstimate;
+        countLoopFilter += other.countLoopFilter;
+        countPMETasks += other.countPMETasks;
+        countPMEMasters += other.countPMEMasters;
+        countPModeTasks += other.countPModeTasks;
+        countPModeMasters += other.countPModeMasters;
+        countWeightAnalyze += other.countWeightAnalyze;
+        totalCTUs += other.totalCTUs;
+
+        other.clear();
+    }
+}; 
+#endif
+
 inline int getTUBits(int idx, int numIdx)
 {
     return idx + (idx < numIdx - 1);
 }
 
-class Search : public JobProvider, public Predict
+class Search : public Predict
 {
 public:
 
-    static const pixel   zeroPixel[MAX_CU_SIZE];
     static const int16_t zeroShort[MAX_CU_SIZE];
 
     MotionEstimate  m_me;
@@ -147,11 +264,25 @@
     uint8_t*        m_qtTempCbf[3];
     uint8_t*        m_qtTempTransformSkipFlag[3];
 
+    pixel*          m_fencScaled;     /* 32x32 buffer for down-scaled version of 64x64 CU fenc */
+    pixel*          m_fencTransposed; /* 32x32 buffer for transposed copy of fenc */
+    pixel*          m_intraPred;      /* 32x32 buffer for individual intra predictions */
+    pixel*          m_intraPredAngs;  /* allocation for 33 consecutive (all angular) 32x32 intra predictions */
+
+    coeff_t*        m_tsCoeff;        /* transform skip coeff 32x32 */
+    int16_t*        m_tsResidual;     /* transform skip residual 32x32 */
+    pixel*          m_tsRecon;        /* transform skip reconstructed pixels 32x32 */
+
     bool            m_bFrameParallel;
     bool            m_bEnableRDOQ;
     uint32_t        m_numLayers;
     uint32_t        m_refLagPixels;
 
+#if DETAILED_CU_STATS
+    /* Accumulate CU statistics separately for each frame encoder */
+    CUStats         m_stats[X265_MAX_FRAME_THREADS];
+#endif
+
     Search();
     ~Search();
 
@@ -162,7 +293,7 @@
     void     invalidateContexts(int fromDepth);
 
     // full RD search of intra modes. if sharedModes is not NULL, it directly uses them
-    void     checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes);
+    void     checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes, uint8_t* sharedChromaModes);
 
     // select best intra mode using only sa8d costs, cannot measure NxN intra

 
@@ -28,6 +28,7 @@
 #include "predict.h"
 #include "quant.h"
 #include "bitcost.h"
+#include "framedata.h"
 #include "yuv.h"
 #include "threadpool.h"
 
@@ -35,6 +36,18 @@
 #include "entropy.h"
 #include "motion.h"
 
+#if DETAILED_CU_STATS
+#define ProfileCUScopeNamed(name, cu, acc, count) \
+    m_stats[cu.m_encData->m_frameEncoderID].count++; \
+    ScopedElapsedTime name(m_stats[cu.m_encData->m_frameEncoderID].acc)
+#define ProfileCUScope(cu, acc, count) ProfileCUScopeNamed(timedScope, cu, acc, count)
+#define ProfileCounter(cu, count) m_stats[cu.m_encData->m_frameEncoderID].count++;
+#else
+#define ProfileCUScopeNamed(name, cu, acc, count)
+#define ProfileCUScope(cu, acc, count)
+#define ProfileCounter(cu, count)
+#endif
+
 namespace x265 {
 // private namespace
 
@@ -88,6 +101,10 @@
     MotionData bestME[MAX_INTER_PARTS][2];
     MV         amvpCand[2][MAX_NUM_REF][AMVP_NUM_CANDS];
 
+    // Neighbour MVs of the current partition. 5 spatial candidates and the
+    // temporal candidate.
+    InterNeighbourMV interNeighbours[6];
+
     uint64_t   rdCost;     // sum of partition (psy) RD costs          (sse(fenc, recon) + lambda2 * bits)
     uint64_t   sa8dCost;   // sum of partition sa8d distortion costs   (sa8d(fenc, pred) + lambda * bits)
     uint32_t   sa8dBits;   // signal bits used in sa8dCost calculation
@@ -109,8 +126,35 @@
         coeffBits = 0;
     }
 
+    void invalidate()
+    {
+        /* set costs to invalid data, catch uninitialized re-use */
+        rdCost = UINT64_MAX / 2;
+        sa8dCost = UINT64_MAX / 2;
+        sa8dBits = MAX_UINT / 2;
+        psyEnergy = MAX_UINT / 2;
+        distortion = MAX_UINT / 2;
+        totalBits = MAX_UINT / 2;
+        mvBits = MAX_UINT / 2;
+        coeffBits = MAX_UINT / 2;
+    }
+
+    bool ok() const
+    {
+        return !(rdCost >= UINT64_MAX / 2 ||
+                 sa8dCost >= UINT64_MAX / 2 ||
+                 sa8dBits >= MAX_UINT / 2 ||
+                 psyEnergy >= MAX_UINT / 2 ||
+                 distortion >= MAX_UINT / 2 ||
+                 totalBits >= MAX_UINT / 2 ||
+                 mvBits >= MAX_UINT / 2 ||
+                 coeffBits >= MAX_UINT / 2);
+    }
+
     void addSubCosts(const Mode& subMode)
     {
+        X265_CHECK(subMode.ok(), "sub-mode not initialized");
+
         rdCost += subMode.rdCost;
         sa8dCost += subMode.sa8dCost;
         sa8dBits += subMode.sa8dBits;
@@ -122,16 +166,89 @@
     }
 };
 
+#if DETAILED_CU_STATS
+/* This structure is intended for performance debugging and we make no attempt
+ * to handle dynamic range overflows. Care should be taken to avoid long encodes
+ * if you care about the accuracy of these elapsed times and counters. This
+ * profiling is orthogonal to PPA/VTune and can be enabled independently from
+ * either of them */
+struct CUStats
+{
+    int64_t  intraRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in intra RDO per CU depth
+    int64_t  interRDOElapsedTime[NUM_CU_DEPTH]; // elapsed worker time in inter RDO per CU depth
+    int64_t  intraAnalysisElapsedTime;          // elapsed worker time in intra sa8d analysis
+    int64_t  motionEstimationElapsedTime;       // elapsed worker time in predInterSearch()
+    int64_t  loopFilterElapsedTime;             // elapsed worker time in deblock and SAO and PSNR/SSIM
+    int64_t  pmeTime;                           // elapsed worker time processing ME slave jobs
+    int64_t  pmeBlockTime;                      // elapsed worker time blocked for pme batch completion
+    int64_t  pmodeTime;                         // elapsed worker time processing pmode slave jobs
+    int64_t  pmodeBlockTime;                    // elapsed worker time blocked for pmode batch completion
+    int64_t  weightAnalyzeTime;                 // elapsed worker time analyzing reference weights
+    int64_t  totalCTUTime;                      // elapsed worker time in compressCTU (includes pmode master)
+
+    uint64_t countIntraRDO[NUM_CU_DEPTH];
+    uint64_t countInterRDO[NUM_CU_DEPTH];
+    uint64_t countIntraAnalysis;
+    uint64_t countMotionEstimate;
+    uint64_t countLoopFilter;
+    uint64_t countPMETasks;
+    uint64_t countPMEMasters;
+    uint64_t countPModeTasks;
+    uint64_t countPModeMasters;
+    uint64_t countWeightAnalyze;
+    uint64_t totalCTUs;
+
+    CUStats() { clear(); }
+
+    void clear()
+    {
+        memset(this, 0, sizeof(*this));
+    }
+
+    void accumulate(CUStats& other)
+    {
+        for (uint32_t i = 0; i <= g_maxCUDepth; i++)
+        {
+            intraRDOElapsedTime[i] += other.intraRDOElapsedTime[i];
+            interRDOElapsedTime[i] += other.interRDOElapsedTime[i];
+            countIntraRDO[i] += other.countIntraRDO[i];
+            countInterRDO[i] += other.countInterRDO[i];
+        }
+
+        intraAnalysisElapsedTime += other.intraAnalysisElapsedTime;
+        motionEstimationElapsedTime += other.motionEstimationElapsedTime;
+        loopFilterElapsedTime += other.loopFilterElapsedTime;
+        pmeTime += other.pmeTime;
+        pmeBlockTime += other.pmeBlockTime;
+        pmodeTime += other.pmodeTime;
+        pmodeBlockTime += other.pmodeBlockTime;
+        weightAnalyzeTime += other.weightAnalyzeTime;
+        totalCTUTime += other.totalCTUTime;
+
+        countIntraAnalysis += other.countIntraAnalysis;
+        countMotionEstimate += other.countMotionEstimate;
+        countLoopFilter += other.countLoopFilter;
+        countPMETasks += other.countPMETasks;
+        countPMEMasters += other.countPMEMasters;
+        countPModeTasks += other.countPModeTasks;
+        countPModeMasters += other.countPModeMasters;
+        countWeightAnalyze += other.countWeightAnalyze;
+        totalCTUs += other.totalCTUs;
+
+        other.clear();
+    }
+}; 
+#endif
+
 inline int getTUBits(int idx, int numIdx)
 {
     return idx + (idx < numIdx - 1);
 }
 
-class Search : public JobProvider, public Predict
+class Search : public Predict
 {
 public:
 
-    static const pixel   zeroPixel[MAX_CU_SIZE];
     static const int16_t zeroShort[MAX_CU_SIZE];
 
     MotionEstimate  m_me;
@@ -147,11 +264,25 @@
     uint8_t*        m_qtTempCbf[3];
     uint8_t*        m_qtTempTransformSkipFlag[3];
 
+    pixel*          m_fencScaled;     /* 32x32 buffer for down-scaled version of 64x64 CU fenc */
+    pixel*          m_fencTransposed; /* 32x32 buffer for transposed copy of fenc */
+    pixel*          m_intraPred;      /* 32x32 buffer for individual intra predictions */
+    pixel*          m_intraPredAngs;  /* allocation for 33 consecutive (all angular) 32x32 intra predictions */
+
+    coeff_t*        m_tsCoeff;        /* transform skip coeff 32x32 */
+    int16_t*        m_tsResidual;     /* transform skip residual 32x32 */
+    pixel*          m_tsRecon;        /* transform skip reconstructed pixels 32x32 */
+
     bool            m_bFrameParallel;
     bool            m_bEnableRDOQ;
     uint32_t        m_numLayers;
     uint32_t        m_refLagPixels;
 
+#if DETAILED_CU_STATS
+    /* Accumulate CU statistics separately for each frame encoder */
+    CUStats         m_stats[X265_MAX_FRAME_THREADS];
+#endif
+
     Search();
     ~Search();
 
@@ -162,7 +293,7 @@
     void     invalidateContexts(int fromDepth);
 
     // full RD search of intra modes. if sharedModes is not NULL, it directly uses them
-    void     checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes);
+    void     checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes, uint8_t* sharedChromaModes);
 
     // select best intra mode using only sa8d costs, cannot measure NxN intra
​

x265_1.5.tar.gz/source/encoder/slicetype.cpp -> x265_1.6.tar.gz/source/encoder/slicetype.cpp Changed

@@ -34,11 +34,17 @@
 #include "motion.h"
 #include "ratecontrol.h"
 
-#define NUM_CUS (m_widthInCU > 2 && m_heightInCU > 2 ? (m_widthInCU - 2) * (m_heightInCU - 2) : m_widthInCU * m_heightInCU)
+#if DETAILED_CU_STATS
+#define ProfileLookaheadTime(elapsed, count) ScopedElapsedTime _scope(elapsed); count++
+#else
+#define ProfileLookaheadTime(elapsed, count)
+#endif
 
 using namespace x265;
 
-static inline int16_t median(int16_t a, int16_t b, int16_t c)
+namespace {
+
+inline int16_t median(int16_t a, int16_t b, int16_t c)
 {
     int16_t t = (a - b) & ((a - b) >> 31);
 
@@ -49,55 +55,531 @@
     return b;
 }
 
-static inline void median_mv(MV &dst, MV a, MV b, MV c)
+inline void median_mv(MV &dst, MV a, MV b, MV c)
 {
     dst.x = median(a.x, b.x, c.x);
     dst.y = median(a.y, b.y, c.y);
 }
 
+/* Compute variance to derive AC energy of each block */
+inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
+{
+    uint32_t sum = (uint32_t)sum_ssd;
+    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
+
+    curFrame->m_lowres.wp_sum[plane] += sum;
+    curFrame->m_lowres.wp_ssd[plane] += ssd;
+    return ssd - ((uint64_t)sum * sum >> shift);
+}
+
+/* Find the energy of each block in Y/Cb/Cr plane */
+inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int plane, int colorFormat)
+{
+    if ((colorFormat != X265_CSP_I444) && plane)
+    {
+        ALIGN_VAR_8(pixel, pix[8 * 8]);
+        primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride);
+        return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, plane);
+    }
+    else
+        return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, plane);
+}
+
+} // end anonymous namespace
+
+/* Find the total AC energy of each block in all planes */
+uint32_t LookaheadTLD::acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp)
+{
+    intptr_t stride = curFrame->m_fencPic->m_stride;
+    intptr_t cStride = curFrame->m_fencPic->m_strideC;
+    intptr_t blockOffsetLuma = blockX + (blockY * stride);
+    int hShift = CHROMA_H_SHIFT(csp);
+    int vShift = CHROMA_V_SHIFT(csp);
+    intptr_t blockOffsetChroma = (blockX >> hShift) + ((blockY >> vShift) * cStride);
+
+    uint32_t var;
+
+    var  = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, csp);
+    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp);
+    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp);
+    x265_emms();
+    return var;
+}
+
+void LookaheadTLD::calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param)
+{
+    /* Actual adaptive quantization */
+    int maxCol = curFrame->m_fencPic->m_picWidth;
+    int maxRow = curFrame->m_fencPic->m_picHeight;
+    int blockWidth = ((param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    int blockHeight = ((param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    int blockCount = blockWidth * blockHeight;
+
+    for (int y = 0; y < 3; y++)
+    {
+        curFrame->m_lowres.wp_ssd[y] = 0;
+        curFrame->m_lowres.wp_sum[y] = 0;
+    }
+
+    /* Calculate Qp offset for each 16x16 block in the frame */
+    int blockXY = 0;
+    int blockX = 0, blockY = 0;
+    double strength = 0.f;
+    if (param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0)
+    {
+        /* Need to init it anyways for CU tree */
+        int cuCount = widthInCU * heightInCU;
+
+        if (param->rc.aqMode && param->rc.aqStrength == 0)
+        {
+            memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
+            memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
+            for (int cuxy = 0; cuxy < cuCount; cuxy++)
+                curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
+        }
+
+        /* Need variance data for weighted prediction */
+        if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
+        {
+            for (blockY = 0; blockY < maxRow; blockY += 16)
+                for (blockX = 0; blockX < maxCol; blockX += 16)
+                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+        }
+    }
+    else
+    {
+        blockXY = 0;
+        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
+        if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
+        {
+            double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5);
+            for (blockY = 0; blockY < maxRow; blockY += 16)
+            {
+                for (blockX = 0; blockX < maxCol; blockX += 16)
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+                    qp_adj = pow(energy + 1, 0.1);
+                    curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
+                    avg_adj += qp_adj;
+                    avg_adj_pow2 += qp_adj * qp_adj;
+                    blockXY++;
+                }
+            }
+
+            avg_adj /= blockCount;
+            avg_adj_pow2 /= blockCount;
+            strength = param->rc.aqStrength * avg_adj / bit_depth_correction;
+            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj;
+        }
+        else
+            strength = param->rc.aqStrength * 1.0397f;
+
+        blockXY = 0;
+        for (blockY = 0; blockY < maxRow; blockY += 16)
+        {
+            for (blockX = 0; blockX < maxCol; blockX += 16)
+            {
+                if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
+                {
+                    qp_adj = curFrame->m_lowres.qpCuTreeOffset[blockXY];
+                    qp_adj = strength * (qp_adj - avg_adj);
+                }
+                else
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8)));
+                }
+                curFrame->m_lowres.qpAqOffset[blockXY] = qp_adj;
+                curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
+                curFrame->m_lowres.invQscaleFactor[blockXY] = x265_exp2fix8(qp_adj);
+                blockXY++;
+            }
+        }
+    }
+
+    if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
+    {
+        int hShift = CHROMA_H_SHIFT(param->internalCsp);
+        int vShift = CHROMA_V_SHIFT(param->internalCsp);
+        maxCol = ((maxCol + 8) >> 4) << 4;
+        maxRow = ((maxRow + 8) >> 4) << 4;
+        int width[3]  = { maxCol, maxCol >> hShift, maxCol >> hShift };
+        int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift };
+
+        for (int i = 0; i < 3; i++)
+        {
+            uint64_t sum, ssd;
+            sum = curFrame->m_lowres.wp_sum[i];
+            ssd = curFrame->m_lowres.wp_ssd[i];
+            curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
+        }
+    }
+}
+
+void LookaheadTLD::lowresIntraEstimate(Lowres& fenc)
+{
+    ALIGN_VAR_32(pixel, prediction[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]);
+    pixel fencIntra[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE];
+    pixel neighbours[2][X265_LOWRES_CU_SIZE * 4 + 1];
+    pixel* samples = neighbours[0], *filtered = neighbours[1];
+
+    const int lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP];
+    const int intraPenalty = 5 * lookAheadLambda;
+    const int lowresPenalty = 4; /* fixed CU cost overhead */
+
+    const int cuSize  = X265_LOWRES_CU_SIZE;
+    const int cuSize2 = cuSize << 1;
+    const int sizeIdx = X265_LOWRES_CU_BITS - 2;

 
@@ -34,11 +34,17 @@
 #include "motion.h"
 #include "ratecontrol.h"
 
-#define NUM_CUS (m_widthInCU > 2 && m_heightInCU > 2 ? (m_widthInCU - 2) * (m_heightInCU - 2) : m_widthInCU * m_heightInCU)
+#if DETAILED_CU_STATS
+#define ProfileLookaheadTime(elapsed, count) ScopedElapsedTime _scope(elapsed); count++
+#else
+#define ProfileLookaheadTime(elapsed, count)
+#endif
 
 using namespace x265;
 
-static inline int16_t median(int16_t a, int16_t b, int16_t c)
+namespace {
+
+inline int16_t median(int16_t a, int16_t b, int16_t c)
 {
     int16_t t = (a - b) & ((a - b) >> 31);
 
@@ -49,55 +55,531 @@
     return b;
 }
 
-static inline void median_mv(MV &dst, MV a, MV b, MV c)
+inline void median_mv(MV &dst, MV a, MV b, MV c)
 {
     dst.x = median(a.x, b.x, c.x);
     dst.y = median(a.y, b.y, c.y);
 }
 
+/* Compute variance to derive AC energy of each block */
+inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
+{
+    uint32_t sum = (uint32_t)sum_ssd;
+    uint32_t ssd = (uint32_t)(sum_ssd >> 32);
+
+    curFrame->m_lowres.wp_sum[plane] += sum;
+    curFrame->m_lowres.wp_ssd[plane] += ssd;
+    return ssd - ((uint64_t)sum * sum >> shift);
+}
+
+/* Find the energy of each block in Y/Cb/Cr plane */
+inline uint32_t acEnergyPlane(Frame *curFrame, pixel* src, intptr_t srcStride, int plane, int colorFormat)
+{
+    if ((colorFormat != X265_CSP_I444) && plane)
+    {
+        ALIGN_VAR_8(pixel, pix[8 * 8]);
+        primitives.cu[BLOCK_8x8].copy_pp(pix, 8, src, srcStride);
+        return acEnergyVar(curFrame, primitives.cu[BLOCK_8x8].var(pix, 8), 6, plane);
+    }
+    else
+        return acEnergyVar(curFrame, primitives.cu[BLOCK_16x16].var(src, srcStride), 8, plane);
+}
+
+} // end anonymous namespace
+
+/* Find the total AC energy of each block in all planes */
+uint32_t LookaheadTLD::acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp)
+{
+    intptr_t stride = curFrame->m_fencPic->m_stride;
+    intptr_t cStride = curFrame->m_fencPic->m_strideC;
+    intptr_t blockOffsetLuma = blockX + (blockY * stride);
+    int hShift = CHROMA_H_SHIFT(csp);
+    int vShift = CHROMA_V_SHIFT(csp);
+    intptr_t blockOffsetChroma = (blockX >> hShift) + ((blockY >> vShift) * cStride);
+
+    uint32_t var;
+
+    var  = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, csp);
+    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp);
+    var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp);
+    x265_emms();
+    return var;
+}
+
+void LookaheadTLD::calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param)
+{
+    /* Actual adaptive quantization */
+    int maxCol = curFrame->m_fencPic->m_picWidth;
+    int maxRow = curFrame->m_fencPic->m_picHeight;
+    int blockWidth = ((param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    int blockHeight = ((param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS;
+    int blockCount = blockWidth * blockHeight;
+
+    for (int y = 0; y < 3; y++)
+    {
+        curFrame->m_lowres.wp_ssd[y] = 0;
+        curFrame->m_lowres.wp_sum[y] = 0;
+    }
+
+    /* Calculate Qp offset for each 16x16 block in the frame */
+    int blockXY = 0;
+    int blockX = 0, blockY = 0;
+    double strength = 0.f;
+    if (param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0)
+    {
+        /* Need to init it anyways for CU tree */
+        int cuCount = widthInCU * heightInCU;
+
+        if (param->rc.aqMode && param->rc.aqStrength == 0)
+        {
+            memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double));
+            memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double));
+            for (int cuxy = 0; cuxy < cuCount; cuxy++)
+                curFrame->m_lowres.invQscaleFactor[cuxy] = 256;
+        }
+
+        /* Need variance data for weighted prediction */
+        if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
+        {
+            for (blockY = 0; blockY < maxRow; blockY += 16)
+                for (blockX = 0; blockX < maxCol; blockX += 16)
+                    acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+        }
+    }
+    else
+    {
+        blockXY = 0;
+        double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0;
+        if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
+        {
+            double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5);
+            for (blockY = 0; blockY < maxRow; blockY += 16)
+            {
+                for (blockX = 0; blockX < maxCol; blockX += 16)
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+                    qp_adj = pow(energy + 1, 0.1);
+                    curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
+                    avg_adj += qp_adj;
+                    avg_adj_pow2 += qp_adj * qp_adj;
+                    blockXY++;
+                }
+            }
+
+            avg_adj /= blockCount;
+            avg_adj_pow2 /= blockCount;
+            strength = param->rc.aqStrength * avg_adj / bit_depth_correction;
+            avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj;
+        }
+        else
+            strength = param->rc.aqStrength * 1.0397f;
+
+        blockXY = 0;
+        for (blockY = 0; blockY < maxRow; blockY += 16)
+        {
+            for (blockX = 0; blockX < maxCol; blockX += 16)
+            {
+                if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE)
+                {
+                    qp_adj = curFrame->m_lowres.qpCuTreeOffset[blockXY];
+                    qp_adj = strength * (qp_adj - avg_adj);
+                }
+                else
+                {
+                    uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp);
+                    qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8)));
+                }
+                curFrame->m_lowres.qpAqOffset[blockXY] = qp_adj;
+                curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj;
+                curFrame->m_lowres.invQscaleFactor[blockXY] = x265_exp2fix8(qp_adj);
+                blockXY++;
+            }
+        }
+    }
+
+    if (param->bEnableWeightedPred || param->bEnableWeightedBiPred)
+    {
+        int hShift = CHROMA_H_SHIFT(param->internalCsp);
+        int vShift = CHROMA_V_SHIFT(param->internalCsp);
+        maxCol = ((maxCol + 8) >> 4) << 4;
+        maxRow = ((maxRow + 8) >> 4) << 4;
+        int width[3]  = { maxCol, maxCol >> hShift, maxCol >> hShift };
+        int height[3] = { maxRow, maxRow >> vShift, maxRow >> vShift };
+
+        for (int i = 0; i < 3; i++)
+        {
+            uint64_t sum, ssd;
+            sum = curFrame->m_lowres.wp_sum[i];
+            ssd = curFrame->m_lowres.wp_ssd[i];
+            curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]);
+        }
+    }
+}
+
+void LookaheadTLD::lowresIntraEstimate(Lowres& fenc)
+{
+    ALIGN_VAR_32(pixel, prediction[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]);
+    pixel fencIntra[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE];
+    pixel neighbours[2][X265_LOWRES_CU_SIZE * 4 + 1];
+    pixel* samples = neighbours[0], *filtered = neighbours[1];
+
+    const int lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP];
+    const int intraPenalty = 5 * lookAheadLambda;
+    const int lowresPenalty = 4; /* fixed CU cost overhead */
+
+    const int cuSize  = X265_LOWRES_CU_SIZE;
+    const int cuSize2 = cuSize << 1;
+    const int sizeIdx = X265_LOWRES_CU_BITS - 2;
​

x265_1.5.tar.gz/source/encoder/slicetype.h -> x265_1.6.tar.gz/source/encoder/slicetype.h Changed

@@ -28,141 +28,135 @@
 #include "slice.h"
 #include "motion.h"
 #include "piclist.h"
-#include "wavefront.h"
+#include "threadpool.h"
 
 namespace x265 {
 // private namespace
 
 struct Lowres;
 class Frame;
+class Lookahead;
 
 #define LOWRES_COST_MASK  ((1 << 14) - 1)
 #define LOWRES_COST_SHIFT 14
 
-#define SET_WEIGHT(w, b, s, d, o) \
-    { \
-        (w).inputWeight = (s); \
-        (w).log2WeightDenom = (d); \
-        (w).inputOffset = (o); \
-        (w).bPresentFlag = b; \
-    }
-
-class EstimateRow
+/* Thread local data for lookahead tasks */
+struct LookaheadTLD
 {
-public:
-    x265_param*         m_param;
-    MotionEstimate      m_me;
-    Lock                m_lock;
-
-    volatile uint32_t   m_completed;      // Number of CUs in this row for which cost estimation is completed
-    volatile bool       m_active;
-
-    uint64_t            m_costEst;        // Estimated cost for all CUs in a row
-    uint64_t            m_costEstAq;      // Estimated weight Aq cost for all CUs in a row
-    uint64_t            m_costIntraAq;    // Estimated weighted Aq Intra cost for all CUs in a row
-    int                 m_intraMbs;       // Number of Intra CUs
-    int                 m_costIntra;      // Estimated Intra cost for all CUs in a row
-
-    int                 m_merange;
-    int                 m_lookAheadLambda;
-
-    int                 m_widthInCU;
-    int                 m_heightInCU;
-
-    EstimateRow()
+    MotionEstimate  me;
+    ReferencePlanes weightedRef;
+    pixel*          wbuffer[4];
+    int             widthInCU;
+    int             heightInCU;
+    int             ncu;
+    int             paddedLines;
+
+#if DETAILED_CU_STATS
+    int64_t         batchElapsedTime;
+    int64_t         coopSliceElapsedTime;
+    uint64_t        countBatches;
+    uint64_t        countCoopSlices;
+#endif
+
+    LookaheadTLD()
     {
-        m_me.setQP(X265_LOOKAHEAD_QP);
-        m_me.init(X265_HEX_SEARCH, 1, X265_CSP_I400);
-        m_merange = 16;
-        m_lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP];
+        me.setQP(X265_LOOKAHEAD_QP);
+        me.init(X265_HEX_SEARCH, 1, X265_CSP_I400);
+        for (int i = 0; i < 4; i++)
+            wbuffer[i] = NULL;
+        widthInCU = heightInCU = ncu = paddedLines = 0;
+
+#if DETAILED_CU_STATS
+        batchElapsedTime = 0;
+        coopSliceElapsedTime = 0;
+        countBatches = 0;
+        countCoopSlices = 0;
+#endif
     }
 
-    void init();
-
-    void estimateCUCost(Lowres * *frames, ReferencePlanes * wfref0, int cux, int cuy, int p0, int p1, int b, bool bDoSearch[2]);
-};
-
-/* CostEstimate manages the cost estimation of a single frame, ie:
- * estimateFrameCost() and everything below it in the call graph */
-class CostEstimate : public WaveFront
-{
-public:
-    CostEstimate(ThreadPool *p);
-    ~CostEstimate();
-    void init(x265_param *, Frame *);
-
-    x265_param      *m_param;
-    EstimateRow     *m_rows;
-    pixel           *m_wbuffer[4];
-    Lowres         **m_curframes;
-
-    ReferencePlanes  m_weightedRef;
-    WeightParam      m_w;
+    void init(int w, int h, int n)
+    {
+        widthInCU = w;
+        heightInCU = h;
+        ncu = n;
+    }
 
-    int              m_paddedLines;     // number of lines in padded frame
-    int              m_widthInCU;       // width of lowres frame in downscale CUs
-    int              m_heightInCU;      // height of lowres frame in downscale CUs
+    ~LookaheadTLD() { X265_FREE(wbuffer[0]); }
 
-    bool             m_bDoSearch[2];
-    volatile bool    m_bFrameCompleted;
-    int              m_curb, m_curp0, m_curp1;
+    void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param);
+    void lowresIntraEstimate(Lowres& fenc);
 
-    void     processRow(int row, int threadId);
-    int64_t  estimateFrameCost(Lowres **frames, int p0, int p1, int b, bool bIntraPenalty);
+    void weightsAnalyse(Lowres& fenc, Lowres& ref);
 
 protected:
 
-    void     weightsAnalyse(Lowres **frames, int b, int p0);
-    uint32_t weightCostLuma(Lowres **frames, int b, int p0, WeightParam *w);
+    uint32_t acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp);
+    uint32_t weightCostLuma(Lowres& fenc, Lowres& ref, WeightParam& wp);
+    bool     allocWeightedRef(Lowres& fenc);
 };
 
 class Lookahead : public JobProvider
 {
 public:
 
+    PicList       m_inputQueue;      // input pictures in order received
+    PicList       m_outputQueue;     // pictures to be encoded, in encode order
+    Lock          m_inputLock;
+    Lock          m_outputLock;
+
+    /* pre-lookahead */
+    Frame*        m_preframes[X265_LOOKAHEAD_MAX];
+    int           m_preTotal, m_preAcquired, m_preCompleted;
+    int           m_fullQueueSize;
+    bool          m_isActive;
+    bool          m_sliceTypeBusy;
+    bool          m_bAdaptiveQuant;
+    bool          m_outputSignalRequired;
+    bool          m_bBatchMotionSearch;
+    bool          m_bBatchFrameCosts;
+    Lock          m_preLookaheadLock;
+    Event         m_outputSignal;
+
+    LookaheadTLD* m_tld;
+    x265_param*   m_param;
+    Lowres*       m_lastNonB;
+    int*          m_scratch;         // temp buffer for cutree propagate
+    
+    int           m_histogram[X265_BFRAME_MAX + 1];
+    int           m_lastKeyframe;
+    int           m_8x8Width;
+    int           m_8x8Height;
+    int           m_8x8Blocks;
+    int           m_numCoopSlices;
+    int           m_numRowsPerSlice;
+    bool          m_filled;
+
     Lookahead(x265_param *param, ThreadPool *pool);
-    ~Lookahead();
-    void init();
-    void destroy();
 
-    CostEstimate     m_est;             // Frame cost estimator
-    PicList          m_inputQueue;      // input pictures in order received
-    PicList          m_outputQueue;     // pictures to be encoded, in encode order
+#if DETAILED_CU_STATS
+    int64_t       m_slicetypeDecideElapsedTime;
+    int64_t       m_preLookaheadElapsedTime;
+    uint64_t      m_countSlicetypeDecide;
+    uint64_t      m_countPreLookahead;
+    void          getWorkerStats(int64_t& batchElapsedTime, uint64_t& batchCount, int64_t& coopSliceElapsedTime, uint64_t& coopSliceCount);
+#endif
 
-    x265_param      *m_param;
-    Lowres          *m_lastNonB;
-    int             *m_scratch;         // temp buffer
+    bool    create();
+    void    destroy();
+    void    stop();
 
-    int              m_widthInCU;       // width of lowres frame in downscale CUs
-    int              m_heightInCU;      // height of lowres frame in downscale CUs
-    int              m_lastKeyframe;
-    int              m_histogram[X265_BFRAME_MAX + 1];

 
@@ -28,141 +28,135 @@
 #include "slice.h"
 #include "motion.h"
 #include "piclist.h"
-#include "wavefront.h"
+#include "threadpool.h"
 
 namespace x265 {
 // private namespace
 
 struct Lowres;
 class Frame;
+class Lookahead;
 
 #define LOWRES_COST_MASK  ((1 << 14) - 1)
 #define LOWRES_COST_SHIFT 14
 
-#define SET_WEIGHT(w, b, s, d, o) \
-    { \
-        (w).inputWeight = (s); \
-        (w).log2WeightDenom = (d); \
-        (w).inputOffset = (o); \
-        (w).bPresentFlag = b; \
-    }
-
-class EstimateRow
+/* Thread local data for lookahead tasks */
+struct LookaheadTLD
 {
-public:
-    x265_param*         m_param;
-    MotionEstimate      m_me;
-    Lock                m_lock;
-
-    volatile uint32_t   m_completed;      // Number of CUs in this row for which cost estimation is completed
-    volatile bool       m_active;
-
-    uint64_t            m_costEst;        // Estimated cost for all CUs in a row
-    uint64_t            m_costEstAq;      // Estimated weight Aq cost for all CUs in a row
-    uint64_t            m_costIntraAq;    // Estimated weighted Aq Intra cost for all CUs in a row
-    int                 m_intraMbs;       // Number of Intra CUs
-    int                 m_costIntra;      // Estimated Intra cost for all CUs in a row
-
-    int                 m_merange;
-    int                 m_lookAheadLambda;
-
-    int                 m_widthInCU;
-    int                 m_heightInCU;
-
-    EstimateRow()
+    MotionEstimate  me;
+    ReferencePlanes weightedRef;
+    pixel*          wbuffer[4];
+    int             widthInCU;
+    int             heightInCU;
+    int             ncu;
+    int             paddedLines;
+
+#if DETAILED_CU_STATS
+    int64_t         batchElapsedTime;
+    int64_t         coopSliceElapsedTime;
+    uint64_t        countBatches;
+    uint64_t        countCoopSlices;
+#endif
+
+    LookaheadTLD()
     {
-        m_me.setQP(X265_LOOKAHEAD_QP);
-        m_me.init(X265_HEX_SEARCH, 1, X265_CSP_I400);
-        m_merange = 16;
-        m_lookAheadLambda = (int)x265_lambda_tab[X265_LOOKAHEAD_QP];
+        me.setQP(X265_LOOKAHEAD_QP);
+        me.init(X265_HEX_SEARCH, 1, X265_CSP_I400);
+        for (int i = 0; i < 4; i++)
+            wbuffer[i] = NULL;
+        widthInCU = heightInCU = ncu = paddedLines = 0;
+
+#if DETAILED_CU_STATS
+        batchElapsedTime = 0;
+        coopSliceElapsedTime = 0;
+        countBatches = 0;
+        countCoopSlices = 0;
+#endif
     }
 
-    void init();
-
-    void estimateCUCost(Lowres * *frames, ReferencePlanes * wfref0, int cux, int cuy, int p0, int p1, int b, bool bDoSearch[2]);
-};
-
-/* CostEstimate manages the cost estimation of a single frame, ie:
- * estimateFrameCost() and everything below it in the call graph */
-class CostEstimate : public WaveFront
-{
-public:
-    CostEstimate(ThreadPool *p);
-    ~CostEstimate();
-    void init(x265_param *, Frame *);
-
-    x265_param      *m_param;
-    EstimateRow     *m_rows;
-    pixel           *m_wbuffer[4];
-    Lowres         **m_curframes;
-
-    ReferencePlanes  m_weightedRef;
-    WeightParam      m_w;
+    void init(int w, int h, int n)
+    {
+        widthInCU = w;
+        heightInCU = h;
+        ncu = n;
+    }
 
-    int              m_paddedLines;     // number of lines in padded frame
-    int              m_widthInCU;       // width of lowres frame in downscale CUs
-    int              m_heightInCU;      // height of lowres frame in downscale CUs
+    ~LookaheadTLD() { X265_FREE(wbuffer[0]); }
 
-    bool             m_bDoSearch[2];
-    volatile bool    m_bFrameCompleted;
-    int              m_curb, m_curp0, m_curp1;
+    void calcAdaptiveQuantFrame(Frame *curFrame, x265_param* param);
+    void lowresIntraEstimate(Lowres& fenc);
 
-    void     processRow(int row, int threadId);
-    int64_t  estimateFrameCost(Lowres **frames, int p0, int p1, int b, bool bIntraPenalty);
+    void weightsAnalyse(Lowres& fenc, Lowres& ref);
 
 protected:
 
-    void     weightsAnalyse(Lowres **frames, int b, int p0);
-    uint32_t weightCostLuma(Lowres **frames, int b, int p0, WeightParam *w);
+    uint32_t acEnergyCu(Frame* curFrame, uint32_t blockX, uint32_t blockY, int csp);
+    uint32_t weightCostLuma(Lowres& fenc, Lowres& ref, WeightParam& wp);
+    bool     allocWeightedRef(Lowres& fenc);
 };
 
 class Lookahead : public JobProvider
 {
 public:
 
+    PicList       m_inputQueue;      // input pictures in order received
+    PicList       m_outputQueue;     // pictures to be encoded, in encode order
+    Lock          m_inputLock;
+    Lock          m_outputLock;
+
+    /* pre-lookahead */
+    Frame*        m_preframes[X265_LOOKAHEAD_MAX];
+    int           m_preTotal, m_preAcquired, m_preCompleted;
+    int           m_fullQueueSize;
+    bool          m_isActive;
+    bool          m_sliceTypeBusy;
+    bool          m_bAdaptiveQuant;
+    bool          m_outputSignalRequired;
+    bool          m_bBatchMotionSearch;
+    bool          m_bBatchFrameCosts;
+    Lock          m_preLookaheadLock;
+    Event         m_outputSignal;
+
+    LookaheadTLD* m_tld;
+    x265_param*   m_param;
+    Lowres*       m_lastNonB;
+    int*          m_scratch;         // temp buffer for cutree propagate
+    
+    int           m_histogram[X265_BFRAME_MAX + 1];
+    int           m_lastKeyframe;
+    int           m_8x8Width;
+    int           m_8x8Height;
+    int           m_8x8Blocks;
+    int           m_numCoopSlices;
+    int           m_numRowsPerSlice;
+    bool          m_filled;
+
     Lookahead(x265_param *param, ThreadPool *pool);
-    ~Lookahead();
-    void init();
-    void destroy();
 
-    CostEstimate     m_est;             // Frame cost estimator
-    PicList          m_inputQueue;      // input pictures in order received
-    PicList          m_outputQueue;     // pictures to be encoded, in encode order
+#if DETAILED_CU_STATS
+    int64_t       m_slicetypeDecideElapsedTime;
+    int64_t       m_preLookaheadElapsedTime;
+    uint64_t      m_countSlicetypeDecide;
+    uint64_t      m_countPreLookahead;
+    void          getWorkerStats(int64_t& batchElapsedTime, uint64_t& batchCount, int64_t& coopSliceElapsedTime, uint64_t& coopSliceCount);
+#endif
 
-    x265_param      *m_param;
-    Lowres          *m_lastNonB;
-    int             *m_scratch;         // temp buffer
+    bool    create();
+    void    destroy();
+    void    stop();
 
-    int              m_widthInCU;       // width of lowres frame in downscale CUs
-    int              m_heightInCU;      // height of lowres frame in downscale CUs
-    int              m_lastKeyframe;
-    int              m_histogram[X265_BFRAME_MAX + 1];
​

x265_1.5.tar.gz/source/encoder/weightPrediction.cpp -> x265_1.6.tar.gz/source/encoder/weightPrediction.cpp Changed

@@ -27,8 +27,8 @@
 #include "frame.h"
 #include "picyuv.h"
 #include "lowres.h"
+#include "slice.h"
 #include "mv.h"
-#include "slicetype.h"
 #include "bitstream.h"
 
 using namespace x265;
@@ -58,6 +58,7 @@
 void mcLuma(pixel* mcout, Lowres& ref, const MV * mvs)
 {
     intptr_t stride = ref.lumaStride;
+    const int mvshift = 1 << 2;
     const int cuSize = 8;
     MV mvmin, mvmax;
 
@@ -66,15 +67,15 @@
     for (int y = 0; y < ref.lines; y += cuSize)
     {
         intptr_t pixoff = y * stride;
-        mvmin.y = (int16_t)((-y - 8) << 2);
-        mvmax.y = (int16_t)((ref.lines - y - 1 + 8) << 2);
+        mvmin.y = (int16_t)((-y - 8) * mvshift);
+        mvmax.y = (int16_t)((ref.lines - y - 1 + 8) * mvshift);
 
         for (int x = 0; x < ref.width; x += cuSize, pixoff += cuSize, cu++)
         {
             ALIGN_VAR_16(pixel, buf8x8[8 * 8]);
             intptr_t bstride = 8;
-            mvmin.x = (int16_t)((-x - 8) << 2);
-            mvmax.x = (int16_t)((ref.width - x - 1 + 8) << 2);
+            mvmin.x = (int16_t)((-x - 8) * mvshift);
+            mvmax.x = (int16_t)((ref.width - x - 1 + 8) * mvshift);
 
             /* clip MV to available pixels */
             MV mv = mvs[cu];
@@ -100,6 +101,7 @@
     int csp = cache.csp;
     int bw = 16 >> cache.hshift;
     int bh = 16 >> cache.vshift;
+    const int mvshift = 1 << 2;
     MV mvmin, mvmax;
 
     for (int y = 0; y < height; y += bh)
@@ -109,8 +111,8 @@
          * into the lowres structures */
         int cu = y * cache.lowresWidthInCU;
         intptr_t pixoff = y * stride;
-        mvmin.y = (int16_t)((-y - 8) << 2);
-        mvmax.y = (int16_t)((height - y - 1 + 8) << 2);
+        mvmin.y = (int16_t)((-y - 8) * mvshift);
+        mvmax.y = (int16_t)((height - y - 1 + 8) * mvshift);
 
         for (int x = 0; x < width; x += bw, cu++, pixoff += bw)
         {
@@ -122,8 +124,8 @@
                 mv.y >>= cache.vshift;
 
                 /* clip MV to available pixels */
-                mvmin.x = (int16_t)((-x - 8) << 2);
-                mvmax.x = (int16_t)((width - x - 1 + 8) << 2);
+                mvmin.x = (int16_t)((-x - 8) * mvshift);
+                mvmax.x = (int16_t)((width - x - 1 + 8) * mvshift);
                 mv = mv.clipped(mvmin, mvmax);
 
                 intptr_t fpeloffset = (mv.y >> 2) * stride + (mv.x >> 2);

 
@@ -27,8 +27,8 @@
 #include "frame.h"
 #include "picyuv.h"
 #include "lowres.h"
+#include "slice.h"
 #include "mv.h"
-#include "slicetype.h"
 #include "bitstream.h"
 
 using namespace x265;
@@ -58,6 +58,7 @@
 void mcLuma(pixel* mcout, Lowres& ref, const MV * mvs)
 {
     intptr_t stride = ref.lumaStride;
+    const int mvshift = 1 << 2;
     const int cuSize = 8;
     MV mvmin, mvmax;
 
@@ -66,15 +67,15 @@
     for (int y = 0; y < ref.lines; y += cuSize)
     {
         intptr_t pixoff = y * stride;
-        mvmin.y = (int16_t)((-y - 8) << 2);
-        mvmax.y = (int16_t)((ref.lines - y - 1 + 8) << 2);
+        mvmin.y = (int16_t)((-y - 8) * mvshift);
+        mvmax.y = (int16_t)((ref.lines - y - 1 + 8) * mvshift);
 
         for (int x = 0; x < ref.width; x += cuSize, pixoff += cuSize, cu++)
         {
             ALIGN_VAR_16(pixel, buf8x8[8 * 8]);
             intptr_t bstride = 8;
-            mvmin.x = (int16_t)((-x - 8) << 2);
-            mvmax.x = (int16_t)((ref.width - x - 1 + 8) << 2);
+            mvmin.x = (int16_t)((-x - 8) * mvshift);
+            mvmax.x = (int16_t)((ref.width - x - 1 + 8) * mvshift);
 
             /* clip MV to available pixels */
             MV mv = mvs[cu];
@@ -100,6 +101,7 @@
     int csp = cache.csp;
     int bw = 16 >> cache.hshift;
     int bh = 16 >> cache.vshift;
+    const int mvshift = 1 << 2;
     MV mvmin, mvmax;
 
     for (int y = 0; y < height; y += bh)
@@ -109,8 +111,8 @@
          * into the lowres structures */
         int cu = y * cache.lowresWidthInCU;
         intptr_t pixoff = y * stride;
-        mvmin.y = (int16_t)((-y - 8) << 2);
-        mvmax.y = (int16_t)((height - y - 1 + 8) << 2);
+        mvmin.y = (int16_t)((-y - 8) * mvshift);
+        mvmax.y = (int16_t)((height - y - 1 + 8) * mvshift);
 
         for (int x = 0; x < width; x += bw, cu++, pixoff += bw)
         {
@@ -122,8 +124,8 @@
                 mv.y >>= cache.vshift;
 
                 /* clip MV to available pixels */
-                mvmin.x = (int16_t)((-x - 8) << 2);
-                mvmax.x = (int16_t)((width - x - 1 + 8) << 2);
+                mvmin.x = (int16_t)((-x - 8) * mvshift);
+                mvmax.x = (int16_t)((width - x - 1 + 8) * mvshift);
                 mv = mv.clipped(mvmin, mvmax);
 
                 intptr_t fpeloffset = (mv.y >> 2) * stride + (mv.x >> 2);
​

x265_1.5.tar.gz/source/input/y4m.cpp -> x265_1.6.tar.gz/source/input/y4m.cpp Changed

@@ -177,147 +177,118 @@
     int csp = 0;
     int d = 0;
 
-    while (!ifs->eof())
+    while (ifs->good())
     {
         // Skip Y4MPEG string
         int c = ifs->get();
-        while (!ifs->eof() && (c != ' ') && (c != '\n'))
-        {
+        while (ifs->good() && (c != ' ') && (c != '\n'))
             c = ifs->get();
-        }
 
-        while (c == ' ' && !ifs->eof())
+        while (c == ' ' && ifs->good())
         {
             // read parameter identifier
             switch (ifs->get())
             {
             case 'W':
                 width = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
 
                     if (c == ' ' || c == '\n')
-                    {
                         break;
-                    }
                     else
-                    {
                         width = width * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'H':
                 height = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == ' ' || c == '\n')
-                    {
                         break;
-                    }
                     else
-                    {
                         height = height * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'F':
                 rateNum = 0;
                 rateDenom = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == '.')
                     {
                         rateDenom = 1;
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                             {
                                 rateNum = rateNum * 10 + (c - '0');
                                 rateDenom = rateDenom * 10;
                             }
                         }
-
                         break;
                     }
                     else if (c == ':')
                     {
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                                 rateDenom = rateDenom * 10 + (c - '0');
                         }
-
                         break;
                     }
                     else
-                    {
                         rateNum = rateNum * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'A':
                 sarWidth = 0;
                 sarHeight = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == ':')
                     {
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                                 sarHeight = sarHeight * 10 + (c - '0');
                         }
-
                         break;
                     }
                     else
-                    {
                         sarWidth = sarWidth * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'C':
                 csp = 0;
                 d = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
 
                     if (c <= '9' && c >= '0')
-                    {
                         csp = csp * 10 + (c - '0');
-                    }
                     else if (c == 'p')
                     {
                         // example: C420p16
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
 
@@ -338,22 +309,19 @@
                 break;
 
             default:
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     // consume this unsupported configuration word
                     c = ifs->get();
                     if (c == ' ' || c == '\n')
                         break;
                 }
-
                 break;
             }
         }
 
         if (c == '\n')
-        {
             break;
-        }
     }
 
     if (width < MIN_FRAME_WIDTH || width > MAX_FRAME_WIDTH ||

 
@@ -177,147 +177,118 @@
     int csp = 0;
     int d = 0;
 
-    while (!ifs->eof())
+    while (ifs->good())
     {
         // Skip Y4MPEG string
         int c = ifs->get();
-        while (!ifs->eof() && (c != ' ') && (c != '\n'))
-        {
+        while (ifs->good() && (c != ' ') && (c != '\n'))
             c = ifs->get();
-        }
 
-        while (c == ' ' && !ifs->eof())
+        while (c == ' ' && ifs->good())
         {
             // read parameter identifier
             switch (ifs->get())
             {
             case 'W':
                 width = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
 
                     if (c == ' ' || c == '\n')
-                    {
                         break;
-                    }
                     else
-                    {
                         width = width * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'H':
                 height = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == ' ' || c == '\n')
-                    {
                         break;
-                    }
                     else
-                    {
                         height = height * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'F':
                 rateNum = 0;
                 rateDenom = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == '.')
                     {
                         rateDenom = 1;
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                             {
                                 rateNum = rateNum * 10 + (c - '0');
                                 rateDenom = rateDenom * 10;
                             }
                         }
-
                         break;
                     }
                     else if (c == ':')
                     {
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                                 rateDenom = rateDenom * 10 + (c - '0');
                         }
-
                         break;
                     }
                     else
-                    {
                         rateNum = rateNum * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'A':
                 sarWidth = 0;
                 sarHeight = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
                     if (c == ':')
                     {
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
                             if (c == ' ' || c == '\n')
-                            {
                                 break;
-                            }
                             else
                                 sarHeight = sarHeight * 10 + (c - '0');
                         }
-
                         break;
                     }
                     else
-                    {
                         sarWidth = sarWidth * 10 + (c - '0');
-                    }
                 }
-
                 break;
 
             case 'C':
                 csp = 0;
                 d = 0;
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     c = ifs->get();
 
                     if (c <= '9' && c >= '0')
-                    {
                         csp = csp * 10 + (c - '0');
-                    }
                     else if (c == 'p')
                     {
                         // example: C420p16
-                        while (!ifs->eof())
+                        while (ifs->good())
                         {
                             c = ifs->get();
 
@@ -338,22 +309,19 @@
                 break;
 
             default:
-                while (!ifs->eof())
+                while (ifs->good())
                 {
                     // consume this unsupported configuration word
                     c = ifs->get();
                     if (c == ' ' || c == '\n')
                         break;
                 }
-
                 break;
             }
         }
 
         if (c == '\n')
-        {
             break;
-        }
     }
 
     if (width < MIN_FRAME_WIDTH || width > MAX_FRAME_WIDTH ||
​

x265_1.5.tar.gz/source/output/y4m.cpp -> x265_1.6.tar.gz/source/output/y4m.cpp Changed

@@ -46,9 +46,7 @@
     }
 
     for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
-    {
         frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
-    }
 }
 
 Y4MOutput::~Y4MOutput()
@@ -66,14 +64,10 @@
 
 #if HIGH_BIT_DEPTH
     if (pic.bitDepth > 8 && pic.poc == 0)
-    {
         x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
-    }
 #else
     if (pic.bitDepth > 8 && pic.poc == 0)
-    {
         x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n");
-    }
 #endif
 
     X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
@@ -89,9 +83,7 @@
         for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
         {
             for (int w = 0; w < width >> x265_cli_csps[colorSpace].width[i]; w++)
-            {
                 buf[w] = (char)(src[w] >> shift);
-            }
 
             ofs.write(buf, width >> x265_cli_csps[colorSpace].width[i]);
             src += pic.stride[i] / sizeof(*src);

 
@@ -46,9 +46,7 @@
     }
 
     for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
-    {
         frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
-    }
 }
 
 Y4MOutput::~Y4MOutput()
@@ -66,14 +64,10 @@
 
 #if HIGH_BIT_DEPTH
     if (pic.bitDepth > 8 && pic.poc == 0)
-    {
         x265_log(NULL, X265_LOG_WARNING, "y4m: down-shifting reconstructed pixels to 8 bits\n");
-    }
 #else
     if (pic.bitDepth > 8 && pic.poc == 0)
-    {
         x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n");
-    }
 #endif
 
     X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
@@ -89,9 +83,7 @@
         for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
         {
             for (int w = 0; w < width >> x265_cli_csps[colorSpace].width[i]; w++)
-            {
                 buf[w] = (char)(src[w] >> shift);
-            }
 
             ofs.write(buf, width >> x265_cli_csps[colorSpace].width[i]);
             src += pic.stride[i] / sizeof(*src);
​

x265_1.5.tar.gz/source/output/yuv.cpp -> x265_1.6.tar.gz/source/output/yuv.cpp Changed

 
@@ -39,9 +39,7 @@
     buf = new char[width];
 
     for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
-    {
         frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
-    }
 }
 
 YUVOutput::~YUVOutput()
@@ -69,9 +67,7 @@
             for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
             {
                 for (int w = 0; w < width >> x265_cli_csps[colorSpace].width[i]; w++)
-                {
                     buf[w] = (char)(src[w] >> shift);
-                }
 
                 ofs.write(buf, width >> x265_cli_csps[colorSpace].width[i]);
                 src += pic.stride[i] / sizeof(*src);
​

x265_1.5.tar.gz/source/profile/cpuEvents.h -> x265_1.6.tar.gz/source/profile/cpuEvents.h Changed

 
@@ -5,6 +5,7 @@
 CPU_EVENT(filterCTURow)
 CPU_EVENT(slicetypeDecideEV)
 CPU_EVENT(prelookahead)
-CPU_EVENT(costEstimateRow)
+CPU_EVENT(estCostSingle)
+CPU_EVENT(estCostCoop)
 CPU_EVENT(pmode)
 CPU_EVENT(pme)
​

x265_1.5.tar.gz/source/test/CMakeLists.txt -> x265_1.6.tar.gz/source/test/CMakeLists.txt Changed

 
@@ -23,3 +23,6 @@
     ipfilterharness.cpp ipfilterharness.h
     intrapredharness.cpp intrapredharness.h)
 target_link_libraries(TestBench x265-static ${PLATFORM_LIBS})
+if(LINKER_OPTIONS)
+    set_target_properties(TestBench PROPERTIES LINK_FLAGS ${LINKER_OPTIONS})
+endif()
​

x265_1.5.tar.gz/source/test/ipfilterharness.cpp -> x265_1.6.tar.gz/source/test/ipfilterharness.cpp Changed

@@ -61,7 +61,7 @@
     }
 }
 
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_t ref, filter_p2s_t opt, int isChroma, int csp)
+bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
 {
     intptr_t rand_srcStride;
     int min_size = isChroma ? 2 : 4;
@@ -512,6 +512,46 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < ITERS; i++)
+    {
+        intptr_t rand_srcStride = rand() % 100;
+        int index = i % TEST_CASES;
+
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}
+
+bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < ITERS; i++)
+    {
+        intptr_t rand_srcStride = rand() % 100;
+        int index = i % TEST_CASES;
+
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
     if (opt.luma_p2s)
@@ -582,6 +622,14 @@
                 return false;
             }
         }
+        if (opt.pu[value].filter_p2s)
+        {
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
+            {
+                printf("filter_p2s[%s]", lumaPartStr[value]);
+                return false;
+            }
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -644,6 +692,14 @@
                     return false;
                 }
             }
+            if (opt.chroma[csp].pu[value].chroma_p2s)
+            {
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
+                {
+                    printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
+                    return false;
+                }
+            }
         }
     }
 
@@ -720,6 +776,13 @@
             REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp,
                            pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
+
+        if (opt.pu[value].filter_p2s)
+        {
+            printf("filter_p2s [%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
+                           pixel_buff, srcStride, IPF_vec_output_s);
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -773,6 +836,14 @@
                                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
                                IPF_vec_output_s, dstStride, 1);
             }
+
+            if (opt.chroma[csp].pu[value].chroma_p2s)
+            {
+                printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]);
+                REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s,
+                               pixel_buff, srcStride,
+                               IPF_vec_output_s);
+            }
         }
     }
 }

 
@@ -61,7 +61,7 @@
     }
 }
 
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_t ref, filter_p2s_t opt, int isChroma, int csp)
+bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
 {
     intptr_t rand_srcStride;
     int min_size = isChroma ? 2 : 4;
@@ -512,6 +512,46 @@
     return true;
 }
 
+bool IPFilterHarness::check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < ITERS; i++)
+    {
+        intptr_t rand_srcStride = rand() % 100;
+        int index = i % TEST_CASES;
+
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}
+
+bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt)
+{
+    for (int i = 0; i < ITERS; i++)
+    {
+        intptr_t rand_srcStride = rand() % 100;
+        int index = i % TEST_CASES;
+
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+            return false;
+
+        reportfail();
+    }
+
+    return true;
+}
+
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
     if (opt.luma_p2s)
@@ -582,6 +622,14 @@
                 return false;
             }
         }
+        if (opt.pu[value].filter_p2s)
+        {
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
+            {
+                printf("filter_p2s[%s]", lumaPartStr[value]);
+                return false;
+            }
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -644,6 +692,14 @@
                     return false;
                 }
             }
+            if (opt.chroma[csp].pu[value].chroma_p2s)
+            {
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
+                {
+                    printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
+                    return false;
+                }
+            }
         }
     }
 
@@ -720,6 +776,13 @@
             REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp,
                            pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
+
+        if (opt.pu[value].filter_p2s)
+        {
+            printf("filter_p2s [%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
+                           pixel_buff, srcStride, IPF_vec_output_s);
+        }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
@@ -773,6 +836,14 @@
                                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
                                IPF_vec_output_s, dstStride, 1);
             }
+
+            if (opt.chroma[csp].pu[value].chroma_p2s)
+            {
+                printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]);
+                REPORT_SPEEDUP(opt.chroma[csp].pu[value].chroma_p2s, ref.chroma[csp].pu[value].chroma_p2s,
+                               pixel_buff, srcStride,
+                               IPF_vec_output_s);
+            }
         }
     }
 }
​

x265_1.5.tar.gz/source/test/ipfilterharness.h -> x265_1.6.tar.gz/source/test/ipfilterharness.h Changed

 
@@ -50,7 +50,7 @@
     pixel   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
     int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
 
-    bool check_IPFilter_primitive(filter_p2s_t ref, filter_p2s_t opt, int isChroma, int csp);
+    bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp);
     bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
     bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
     bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt);
@@ -62,6 +62,8 @@
     bool check_IPFilterLuma_sp_primitive(filter_sp_t ref, filter_sp_t opt);
     bool check_IPFilterLuma_ss_primitive(filter_ss_t ref, filter_ss_t opt);
     bool check_IPFilterLumaHV_primitive(filter_hv_pp_t ref, filter_hv_pp_t opt);
+    bool check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
+    bool check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt);
 
 public:
 
​

x265_1.5.tar.gz/source/test/mbdstharness.cpp -> x265_1.6.tar.gz/source/test/mbdstharness.cpp Changed

@@ -209,7 +209,7 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = (rand() % 4 + 1) * 4;
+        int width = 1 << (rand() % 4 + 2);
         int height = width;
 
         uint32_t optReturnValue = 0;
@@ -278,42 +278,19 @@
 
     return true;
 }
-
 bool MBDstHarness::check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt)
 {
-    ALIGN_VAR_32(int16_t, qcoeff[32 * 32]);
-
-    for (int i = 0; i < 4; i++)
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
     {
-        int log2TrSize = i + 2;
-        int num = 1 << (log2TrSize * 2);
-        int mask = num - 1;
-
-        for (int n = 0; n <= num; n++)
-        {
-            memset(qcoeff, 0, num * sizeof(int16_t));
-
-            for (int j = 0; j < n; j++)
-            {
-                int k = rand() & mask;
-                while (qcoeff[k])
-                {
-                    k = (k + 11) & mask;
-                }
-
-                qcoeff[k] = (int16_t)rand() - RAND_MAX / 2;
-            }
-
-            int refval = ref(qcoeff, num);
-            int optval = (int)checked(opt, qcoeff, num);
-
-            if (refval != optval)
-                return false;
-
-            reportfail();
-        }
+        int index = i % TEST_CASES;
+        int opt_cnt = (int)checked(opt, short_test_buff[index] + j);
+        int ref_cnt = ref(short_test_buff[index] + j);
+        if (ref_cnt != opt_cnt)
+            return false;
+        reportfail();
+        j += INCR;
     }
-
     return true;
 }
 
@@ -437,16 +414,17 @@
             return false;
         }
     }
-
-    if (opt.count_nonzero)
+    for (int i = 0; i < NUM_TR_SIZE; i++)
     {
-        if (!check_count_nonzero_primitive(ref.count_nonzero, opt.count_nonzero))
+        if (opt.cu[i].count_nonzero)
         {
-            printf("count_nonzero: Failed!\n");
-            return false;
+            if (!check_count_nonzero_primitive(ref.cu[i].count_nonzero, opt.cu[i].count_nonzero))
+            {
+                printf("count_nonzero[%dx%d] Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
         }
     }
-
     if (opt.dequant_scaling)
     {
         if (!check_dequant_primitive(ref.dequant_scaling, opt.dequant_scaling))
@@ -523,16 +501,14 @@
         printf("nquant\t\t");
         REPORT_SPEEDUP(opt.nquant, ref.nquant, short_test_buff[0], int_test_buff[1], mshortbuf2, 23, 23785, 32 * 32);
     }
-
-    if (opt.count_nonzero)
+    for (int value = 0; value < NUM_TR_SIZE; value++)
     {
-        for (int i = 4; i <= 32; i <<= 1)
+        if (opt.cu[value].count_nonzero)
         {
-            printf("count_nonzero[%dx%d]", i, i);
-            REPORT_SPEEDUP(opt.count_nonzero, ref.count_nonzero, mbuf1, i * i)
+            printf("count_nonzero[%dx%d]", 4 << value, 4 << value);
+            REPORT_SPEEDUP(opt.cu[value].count_nonzero, ref.cu[value].count_nonzero, mbuf1);
         }
     }
-
     if (opt.denoiseDct)
     {
         printf("denoiseDct\t");

 
@@ -209,7 +209,7 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = (rand() % 4 + 1) * 4;
+        int width = 1 << (rand() % 4 + 2);
         int height = width;
 
         uint32_t optReturnValue = 0;
@@ -278,42 +278,19 @@
 
     return true;
 }
-
 bool MBDstHarness::check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt)
 {
-    ALIGN_VAR_32(int16_t, qcoeff[32 * 32]);
-
-    for (int i = 0; i < 4; i++)
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
     {
-        int log2TrSize = i + 2;
-        int num = 1 << (log2TrSize * 2);
-        int mask = num - 1;
-
-        for (int n = 0; n <= num; n++)
-        {
-            memset(qcoeff, 0, num * sizeof(int16_t));
-
-            for (int j = 0; j < n; j++)
-            {
-                int k = rand() & mask;
-                while (qcoeff[k])
-                {
-                    k = (k + 11) & mask;
-                }
-
-                qcoeff[k] = (int16_t)rand() - RAND_MAX / 2;
-            }
-
-            int refval = ref(qcoeff, num);
-            int optval = (int)checked(opt, qcoeff, num);
-
-            if (refval != optval)
-                return false;
-
-            reportfail();
-        }
+        int index = i % TEST_CASES;
+        int opt_cnt = (int)checked(opt, short_test_buff[index] + j);
+        int ref_cnt = ref(short_test_buff[index] + j);
+        if (ref_cnt != opt_cnt)
+            return false;
+        reportfail();
+        j += INCR;
     }
-
     return true;
 }
 
@@ -437,16 +414,17 @@
             return false;
         }
     }
-
-    if (opt.count_nonzero)
+    for (int i = 0; i < NUM_TR_SIZE; i++)
     {
-        if (!check_count_nonzero_primitive(ref.count_nonzero, opt.count_nonzero))
+        if (opt.cu[i].count_nonzero)
         {
-            printf("count_nonzero: Failed!\n");
-            return false;
+            if (!check_count_nonzero_primitive(ref.cu[i].count_nonzero, opt.cu[i].count_nonzero))
+            {
+                printf("count_nonzero[%dx%d] Failed!\n", 4 << i, 4 << i);
+                return false;
+            }
         }
     }
-
     if (opt.dequant_scaling)
     {
         if (!check_dequant_primitive(ref.dequant_scaling, opt.dequant_scaling))
@@ -523,16 +501,14 @@
         printf("nquant\t\t");
         REPORT_SPEEDUP(opt.nquant, ref.nquant, short_test_buff[0], int_test_buff[1], mshortbuf2, 23, 23785, 32 * 32);
     }
-
-    if (opt.count_nonzero)
+    for (int value = 0; value < NUM_TR_SIZE; value++)
     {
-        for (int i = 4; i <= 32; i <<= 1)
+        if (opt.cu[value].count_nonzero)
         {
-            printf("count_nonzero[%dx%d]", i, i);
-            REPORT_SPEEDUP(opt.count_nonzero, ref.count_nonzero, mbuf1, i * i)
+            printf("count_nonzero[%dx%d]", 4 << value, 4 << value);
+            REPORT_SPEEDUP(opt.cu[value].count_nonzero, ref.cu[value].count_nonzero, mbuf1);
         }
     }
-
     if (opt.denoiseDct)
     {
         printf("denoiseDct\t");
​

x265_1.5.tar.gz/source/test/pixelharness.cpp -> x265_1.6.tar.gz/source/test/pixelharness.cpp Changed

@@ -1149,6 +1149,71 @@
     return true;
 }
 
+bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
+{
+    ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
+    uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t ref_coeffSign[MLS_GRP_NUM], opt_coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t ref_coeffFlag[MLS_GRP_NUM], opt_coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+    int totalCoeffs = 0;
+    for (int i = 0; i < 32 * 32; i++)
+    {
+        ref_src[i] = rand() & SHORT_MAX;
+        totalCoeffs += (ref_src[i] != 0);
+    }
+
+    // extra test area all of 0x1234
+    for (int i = 0; i < ITERS * 2; i++)
+    {
+        ref_src[32 * 32 + i] = 0x1234;
+    }
+    
+
+    memset(ref_coeffNum, 0xCD, sizeof(ref_coeffNum));
+    memset(ref_coeffSign, 0xCD, sizeof(ref_coeffSign));
+    memset(ref_coeffFlag, 0xCD, sizeof(ref_coeffFlag));
+
+    memset(opt_coeffNum, 0xCD, sizeof(opt_coeffNum));
+    memset(opt_coeffSign, 0xCD, sizeof(opt_coeffSign));
+    memset(opt_coeffFlag, 0xCD, sizeof(opt_coeffFlag));
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int rand_scan_type = rand() % NUM_SCAN_TYPE;
+        int rand_scan_size = rand() % NUM_SCAN_SIZE;
+        int rand_numCoeff = 0;
+
+        for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++)
+            rand_numCoeff += (ref_src[i + j] != 0);
+
+        const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size];
+
+        int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff);
+        int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff);
+
+        if (ref_scanPos != opt_scanPos)
+            return false;
+
+        for (int j = 0; rand_numCoeff; j++)
+        {
+            if (ref_coeffSign[j] != opt_coeffSign[j])
+                return false;
+
+            if (ref_coeffFlag[j] != opt_coeffFlag[j])
+                return false;
+
+            if (ref_coeffNum[j] != opt_coeffNum[j])
+                return false;
+
+            rand_numCoeff -= ref_coeffNum[j];
+        }
+
+        reportfail();
+    }
+
+    return true;
+}
 
 bool PixelHarness::testPU(int part, const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
@@ -1299,6 +1364,14 @@
                 return false;
             }
         }
+        if (opt.chroma[i].pu[part].satd)
+        {
+            if (!check_pixelcmp(ref.chroma[i].pu[part].satd, opt.chroma[i].pu[part].satd))
+            {
+                printf("chroma_satd[%s][%s] failed!\n", x265_source_csp_names[i], chromaPartStr[i][part]);
+                return false;
+            }
+        }
         if (part < NUM_CU_SIZES)
         {
             if (opt.chroma[i].cu[part].sub_ps)
@@ -1467,7 +1540,7 @@
             {
                 if (!check_cpy2Dto1D_shl_t(ref.cu[i].cpy2Dto1D_shl, opt.cu[i].cpy2Dto1D_shl))
                 {
-                    printf("cpy2Dto1D_shl failed!\n");
+                    printf("cpy2Dto1D_shl[%dx%d] failed!\n", 4 << i, 4 << i);
                     return false;
                 }
             }
@@ -1645,6 +1718,15 @@
         }
     }
 
+    if (opt.findPosLast)
+    {
+        if (!check_findPosLast(ref.findPosLast, opt.findPosLast))
+        {
+            printf("findPosLast failed!\n");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -1688,7 +1770,7 @@
     if (opt.pu[part].copy_pp)
     {
         HEADER("copy_pp[%s]", lumaPartStr[part]);
-        REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 128);
+        REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 64);
     }
 
     if (opt.pu[part].addAvg)
@@ -1723,7 +1805,7 @@
         if (opt.cu[part].copy_ss)
         {
             HEADER("copy_ss[%s]", lumaPartStr[part]);
-            REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 64, sbuf2, 128);
+            REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 128, sbuf2, 128);
         }
         if (opt.cu[part].copy_sp)
         {
@@ -1733,7 +1815,7 @@
         if (opt.cu[part].copy_ps)
         {
             HEADER("copy_ps[%s]", lumaPartStr[part]);
-            REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 64, pbuf1, 128);
+            REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 128, pbuf1, 64);
         }
     }
 
@@ -1749,6 +1831,11 @@
             HEADER("[%s]  addAvg[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
             REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg, ref.chroma[i].pu[part].addAvg, sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE);
         }
+        if (opt.chroma[i].pu[part].satd)
+        {
+            HEADER("[%s] satd[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
+            REPORT_SPEEDUP(opt.chroma[i].pu[part].satd, ref.chroma[i].pu[part].satd, pbuf1, STRIDE, fref, STRIDE);
+        }
         if (part < NUM_CU_SIZES)
         {
             if (opt.chroma[i].cu[part].copy_ss)
@@ -1990,4 +2077,13 @@
         HEADER0("propagateCost");
         REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80);
     }
+
+    if (opt.findPosLast)
+    {
+        HEADER0("findPosLast");
+        coeff_t coefBuf[32 * 32];
+        memset(coefBuf, 0, sizeof(coefBuf));
+        memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t));
+        REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32);
+    }
 }

 
@@ -1149,6 +1149,71 @@
     return true;
 }
 
+bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
+{
+    ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
+    uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t ref_coeffSign[MLS_GRP_NUM], opt_coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t ref_coeffFlag[MLS_GRP_NUM], opt_coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+    int totalCoeffs = 0;
+    for (int i = 0; i < 32 * 32; i++)
+    {
+        ref_src[i] = rand() & SHORT_MAX;
+        totalCoeffs += (ref_src[i] != 0);
+    }
+
+    // extra test area all of 0x1234
+    for (int i = 0; i < ITERS * 2; i++)
+    {
+        ref_src[32 * 32 + i] = 0x1234;
+    }
+    
+
+    memset(ref_coeffNum, 0xCD, sizeof(ref_coeffNum));
+    memset(ref_coeffSign, 0xCD, sizeof(ref_coeffSign));
+    memset(ref_coeffFlag, 0xCD, sizeof(ref_coeffFlag));
+
+    memset(opt_coeffNum, 0xCD, sizeof(opt_coeffNum));
+    memset(opt_coeffSign, 0xCD, sizeof(opt_coeffSign));
+    memset(opt_coeffFlag, 0xCD, sizeof(opt_coeffFlag));
+
+    for (int i = 0; i < ITERS; i++)
+    {
+        int rand_scan_type = rand() % NUM_SCAN_TYPE;
+        int rand_scan_size = rand() % NUM_SCAN_SIZE;
+        int rand_numCoeff = 0;
+
+        for (int j = 0; j < 1 << (2 * (rand_scan_size + 2)); j++)
+            rand_numCoeff += (ref_src[i + j] != 0);
+
+        const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size];
+
+        int ref_scanPos = ref(scanTbl, ref_src + i, ref_coeffSign, ref_coeffFlag, ref_coeffNum, rand_numCoeff);
+        int opt_scanPos = (int)checked(opt, scanTbl, ref_src + i, opt_coeffSign, opt_coeffFlag, opt_coeffNum, rand_numCoeff);
+
+        if (ref_scanPos != opt_scanPos)
+            return false;
+
+        for (int j = 0; rand_numCoeff; j++)
+        {
+            if (ref_coeffSign[j] != opt_coeffSign[j])
+                return false;
+
+            if (ref_coeffFlag[j] != opt_coeffFlag[j])
+                return false;
+
+            if (ref_coeffNum[j] != opt_coeffNum[j])
+                return false;
+
+            rand_numCoeff -= ref_coeffNum[j];
+        }
+
+        reportfail();
+    }
+
+    return true;
+}
 
 bool PixelHarness::testPU(int part, const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
@@ -1299,6 +1364,14 @@
                 return false;
             }
         }
+        if (opt.chroma[i].pu[part].satd)
+        {
+            if (!check_pixelcmp(ref.chroma[i].pu[part].satd, opt.chroma[i].pu[part].satd))
+            {
+                printf("chroma_satd[%s][%s] failed!\n", x265_source_csp_names[i], chromaPartStr[i][part]);
+                return false;
+            }
+        }
         if (part < NUM_CU_SIZES)
         {
             if (opt.chroma[i].cu[part].sub_ps)
@@ -1467,7 +1540,7 @@
             {
                 if (!check_cpy2Dto1D_shl_t(ref.cu[i].cpy2Dto1D_shl, opt.cu[i].cpy2Dto1D_shl))
                 {
-                    printf("cpy2Dto1D_shl failed!\n");
+                    printf("cpy2Dto1D_shl[%dx%d] failed!\n", 4 << i, 4 << i);
                     return false;
                 }
             }
@@ -1645,6 +1718,15 @@
         }
     }
 
+    if (opt.findPosLast)
+    {
+        if (!check_findPosLast(ref.findPosLast, opt.findPosLast))
+        {
+            printf("findPosLast failed!\n");
+            return false;
+        }
+    }
+
     return true;
 }
 
@@ -1688,7 +1770,7 @@
     if (opt.pu[part].copy_pp)
     {
         HEADER("copy_pp[%s]", lumaPartStr[part]);
-        REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 128);
+        REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 64);
     }
 
     if (opt.pu[part].addAvg)
@@ -1723,7 +1805,7 @@
         if (opt.cu[part].copy_ss)
         {
             HEADER("copy_ss[%s]", lumaPartStr[part]);
-            REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 64, sbuf2, 128);
+            REPORT_SPEEDUP(opt.cu[part].copy_ss, ref.cu[part].copy_ss, sbuf1, 128, sbuf2, 128);
         }
         if (opt.cu[part].copy_sp)
         {
@@ -1733,7 +1815,7 @@
         if (opt.cu[part].copy_ps)
         {
             HEADER("copy_ps[%s]", lumaPartStr[part]);
-            REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 64, pbuf1, 128);
+            REPORT_SPEEDUP(opt.cu[part].copy_ps, ref.cu[part].copy_ps, sbuf1, 128, pbuf1, 64);
         }
     }
 
@@ -1749,6 +1831,11 @@
             HEADER("[%s]  addAvg[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
             REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg, ref.chroma[i].pu[part].addAvg, sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE);
         }
+        if (opt.chroma[i].pu[part].satd)
+        {
+            HEADER("[%s] satd[%s]", x265_source_csp_names[i], chromaPartStr[i][part]);
+            REPORT_SPEEDUP(opt.chroma[i].pu[part].satd, ref.chroma[i].pu[part].satd, pbuf1, STRIDE, fref, STRIDE);
+        }
         if (part < NUM_CU_SIZES)
         {
             if (opt.chroma[i].cu[part].copy_ss)
@@ -1990,4 +2077,13 @@
         HEADER0("propagateCost");
         REPORT_SPEEDUP(opt.propagateCost, ref.propagateCost, ibuf1, ushort_test_buff[0], int_test_buff[0], ushort_test_buff[0], int_test_buff[0], double_test_buff[0], 80);
     }
+
+    if (opt.findPosLast)
+    {
+        HEADER0("findPosLast");
+        coeff_t coefBuf[32 * 32];
+        memset(coefBuf, 0, sizeof(coefBuf));
+        memset(coefBuf + 32 * 31, 1, 32 * sizeof(coeff_t));
+        REPORT_SPEEDUP(opt.findPosLast, ref.findPosLast, g_scanOrder[SCAN_DIAG][NUM_SCAN_SIZE - 1], coefBuf, (uint16_t*)sbuf1, (uint16_t*)sbuf2, (uint8_t*)psbuf1, 32);
+    }
 }
​

x265_1.5.tar.gz/source/test/pixelharness.h -> x265_1.6.tar.gz/source/test/pixelharness.h Changed

 
@@ -104,6 +104,7 @@
     bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt);
     bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt);
     bool check_calSign(sign_t ref, sign_t opt);
+    bool check_findPosLast(findPosLast_t ref, findPosLast_t opt);
 
 public:
 
​

x265_1.6.tar.gz/source/test/rate-control-tests.txt Added

@@ -0,0 +1,34 @@
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
+
+# This test is listed first since it currently reproduces bugs
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
+
+# VBV tests, non-deterministic so testing for correctness and bitrate
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
+RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
+RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
+RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
+112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
+112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
+112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
+112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
+112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
+112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
+112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
+big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
+big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
+
+# multi-pass rate control tests
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
+112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
+112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4

 
@@ -0,0 +1,34 @@
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
+
+# This test is listed first since it currently reproduces bugs
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
+
+# VBV tests, non-deterministic so testing for correctness and bitrate
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
+RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
+RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
+RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
+112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
+112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
+112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
+112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
+112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
+112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
+112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
+big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
+big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
+big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
+
+# multi-pass rate control tests
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
+112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
+112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
​

x265_1.6.tar.gz/source/test/regression-tests.txt Added

@@ -0,0 +1,127 @@
+# List of command lines to be run by regression tests, see https://bitbucket.org/sborho/test-harness
+
+# the vast majority of the commands are tested for results matching the
+# most recent commit which was known to change outputs. The output
+# bitstream must be bit-exact or the test fails. If no golden outputs
+# are available the bitstream is validated (decoded) and then saved as a
+# new golden output
+
+# Note: --nr-intra, --nr-inter, and --bitrate (ABR) give different
+# outputs for different frame encoder counts. In order for outputs to be
+# consistent across many machines, you must force a certain -FN so it is
+# not auto-detected.
+
+BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
+BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
+BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
+BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
+BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
+BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode
+Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh
+Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1
+Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
+FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd
+FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2
+Keiba_832x480_30.y4m,--preset medium --pmode --tune grain
+Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4
+Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4
+Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32
+Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
+KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
+KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
+KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
+News-4k.y4m,--preset medium --tune ssim --no-sao
+News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp
+ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3
+ParkScene_1920x1080_24.y4m,--preset slower --no-weightp
+ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4
+RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip
+RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0
+RaceHorses_416x240_30.y4m,--preset superfast --no-cutree
+RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip
+RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra
+RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither
+RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain
+RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr
+RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb
+RaceHorses_416x240_30_10bit.yuv,--preset placebo
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv,--preset medium --dither
+big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200
+big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb
+big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra
+big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0
+big_buck_bunny_360p24.y4m,--preset ultrafast --deblock=2
+big_buck_bunny_360p24.y4m,--preset veryfast --no-deblock
+city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock
+city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2
+city_4cif_60fps.y4m,--preset slower --scaling-list default
+city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra
+ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40
+ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6
+ducks_take_off_420_720p50.y4m,--preset medium --tskip --tskip-fast --constrained-intra
+ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40
+ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1
+ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2
+ducks_take_off_444_720p50.y4m,--preset medium --qp 38 --no-scenecut
+ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0
+ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1
+mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4
+mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast
+mobile_calendar_422_ntsc.y4m,--preset superfast --weightp --rd 0
+mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip
+old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency
+old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6
+old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid
+old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless
+old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16
+old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32
+old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim
+parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless
+parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
+silent_cif_420.y4m,--preset medium --me full --rect --amp
+silent_cif_420.y4m,--preset superfast --weightp --rect
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
+vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
+vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
+washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
+washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
+washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
+washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
+washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
+washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless
+
+# interlace test, even though input YUV is not field seperated
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff
+
+# vim: tw=200

 
@@ -0,0 +1,127 @@
+# List of command lines to be run by regression tests, see https://bitbucket.org/sborho/test-harness
+
+# the vast majority of the commands are tested for results matching the
+# most recent commit which was known to change outputs. The output
+# bitstream must be bit-exact or the test fails. If no golden outputs
+# are available the bitstream is validated (decoded) and then saved as a
+# new golden output
+
+# Note: --nr-intra, --nr-inter, and --bitrate (ABR) give different
+# outputs for different frame encoder counts. In order for outputs to be
+# consistent across many machines, you must force a certain -FN so it is
+# not auto-detected.
+
+BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
+BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
+BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
+BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
+BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
+BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp
+BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode
+Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh
+Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1
+Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
+DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
+FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd
+FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2
+Keiba_832x480_30.y4m,--preset medium --pmode --tune grain
+Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4
+Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4
+Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32
+Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
+KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
+KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
+KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
+News-4k.y4m,--preset medium --tune ssim --no-sao
+News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
+OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp
+ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3
+ParkScene_1920x1080_24.y4m,--preset slower --no-weightp
+ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4
+RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip
+RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0
+RaceHorses_416x240_30.y4m,--preset superfast --no-cutree
+RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip
+RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra
+RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither
+RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain
+RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr
+RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb
+RaceHorses_416x240_30_10bit.yuv,--preset placebo
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv,--preset medium --dither
+big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200
+big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb
+big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra
+big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0
+big_buck_bunny_360p24.y4m,--preset ultrafast --deblock=2
+big_buck_bunny_360p24.y4m,--preset veryfast --no-deblock
+city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock
+city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2
+city_4cif_60fps.y4m,--preset slower --scaling-list default
+city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra
+ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40
+ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6
+ducks_take_off_420_720p50.y4m,--preset medium --tskip --tskip-fast --constrained-intra
+ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40
+ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1
+ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2
+ducks_take_off_444_720p50.y4m,--preset medium --qp 38 --no-scenecut
+ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0
+ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1
+mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4
+mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast
+mobile_calendar_422_ntsc.y4m,--preset superfast --weightp --rd 0
+mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip
+old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency
+old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6
+old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid
+old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless
+old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16
+old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32
+old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim
+parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless
+parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
+silent_cif_420.y4m,--preset medium --me full --rect --amp
+silent_cif_420.y4m,--preset superfast --weightp --rect
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
+vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
+vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
+washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
+washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
+washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
+washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
+washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
+washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless
+
+# interlace test, even though input YUV is not field seperated
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff
+
+# vim: tw=200
​

x265_1.6.tar.gz/source/test/smoke-tests.txt Added

@@ -0,0 +1,17 @@
+# List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
+
+big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
+big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
+washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
+washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
+old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
+old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
+RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
+RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff

 
@@ -0,0 +1,17 @@
+# List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
+
+big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
+big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
+washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
+washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
+old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
+old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
+RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
+RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
+CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff
​

x265_1.5.tar.gz/source/test/testbench.cpp -> x265_1.6.tar.gz/source/test/testbench.cpp Changed

@@ -174,7 +174,10 @@
     for (int i = 0; test_arch[i].flag; i++)
     {
         if (test_arch[i].flag & cpuid)
+        {
             printf("Testing primitives: %s\n", test_arch[i].name);
+            fflush(stdout);
+        }
         else
             continue;
 
@@ -188,6 +191,7 @@
                 continue;
             if (!harness[h]->testCorrectness(cprim, vecprim))
             {
+                fflush(stdout);
                 fprintf(stderr, "\nx265: intrinsic primitive has failed. Go and fix that Right Now!\n");
                 return -1;
             }
@@ -204,6 +208,7 @@
                 continue;
             if (!harness[h]->testCorrectness(cprim, asmprim))
             {
+                fflush(stdout);
                 fprintf(stderr, "\nx265: asm primitive has failed. Go and fix that Right Now!\n");
                 return -1;
             }
@@ -226,6 +231,7 @@
     memcpy(&primitives, &optprim, sizeof(EncoderPrimitives));
 
     printf("\nTest performance improvement with full optimizations\n");
+    fflush(stdout);
 
     for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++)
     {

 
@@ -174,7 +174,10 @@
     for (int i = 0; test_arch[i].flag; i++)
     {
         if (test_arch[i].flag & cpuid)
+        {
             printf("Testing primitives: %s\n", test_arch[i].name);
+            fflush(stdout);
+        }
         else
             continue;
 
@@ -188,6 +191,7 @@
                 continue;
             if (!harness[h]->testCorrectness(cprim, vecprim))
             {
+                fflush(stdout);
                 fprintf(stderr, "\nx265: intrinsic primitive has failed. Go and fix that Right Now!\n");
                 return -1;
             }
@@ -204,6 +208,7 @@
                 continue;
             if (!harness[h]->testCorrectness(cprim, asmprim))
             {
+                fflush(stdout);
                 fprintf(stderr, "\nx265: asm primitive has failed. Go and fix that Right Now!\n");
                 return -1;
             }
@@ -226,6 +231,7 @@
     memcpy(&primitives, &optprim, sizeof(EncoderPrimitives));
 
     printf("\nTest performance improvement with full optimizations\n");
+    fflush(stdout);
 
     for (size_t h = 0; h < sizeof(harness) / sizeof(TestHarness*); h++)
     {
​

x265_1.5.tar.gz/source/test/testharness.h -> x265_1.6.tar.gz/source/test/testharness.h Changed

 
@@ -158,7 +158,7 @@
                                     m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \
                                     m_rand, m_rand, m_rand, m_rand, m_rand), /* max_args+6 */ \
         x265_checkasm_call_float((float(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__))
-#define reportfail() if (!m_ok) { fprintf(stderr, "stack clobber check failed at %s:%d", __FILE__, __LINE__); abort(); }
+#define reportfail() if (!m_ok) { fflush(stdout); fprintf(stderr, "stack clobber check failed at %s:%d", __FILE__, __LINE__); abort(); }
 #elif ARCH_X86
 #define checked(func, ...) x265_checkasm_call((intptr_t(*)())func, &m_ok, __VA_ARGS__);
 #define checked_float(func, ...) x265_checkasm_call_float((float(*)())func, &m_ok, __VA_ARGS__);
​

x265_1.5.tar.gz/source/x265.cpp -> x265_1.6.tar.gz/source/x265.cpp Changed

 
@@ -147,6 +147,7 @@
 
     if (!bProgress || !frameNum || (prevUpdateTime && time - prevUpdateTime < UPDATE_INTERVAL))
         return;
+
     int64_t elapsed = time - startTime;
     double fps = elapsed > 0 ? frameNum * 1000000. / elapsed : 0;
     float bitrate = 0.008f * totalbytes * (param->fpsNum / param->fpsDenom) / ((float)frameNum);
@@ -158,9 +159,8 @@
                 eta / 3600, (eta / 60) % 60, eta % 60);
     }
     else
-    {
         sprintf(buf, "x265 %d frames: %.2f fps, %.2f kb/s", frameNum, fps, bitrate);
-    }
+
     fprintf(stderr, "%s  \r", buf + 5);
     SetConsoleTitle(buf);
     fflush(stderr); // needed in windows
@@ -530,7 +530,7 @@
     while (pic_in && !b_ctrl_c)
     {
         pic_orig.poc = inFrameCount;
-        if (cliopt.qpfile && !param->rc.bStatRead)
+        if (cliopt.qpfile)
         {
             if (!cliopt.parseQPFile(pic_orig))
             {
​

x265_1.5.tar.gz/source/x265.def.in -> x265_1.6.tar.gz/source/x265.def.in Changed

 
@@ -1,6 +1,5 @@
 EXPORTS
 x265_encoder_open_${X265_BUILD}
-x265_setup_primitives
 x265_param_default
 x265_param_default_preset
 x265_param_parse
@@ -20,3 +19,4 @@
 x265_encoder_log
 x265_encoder_close
 x265_cleanup
+x265_api_get_${X265_BUILD}
​

x265_1.5.tar.gz/source/x265.h -> x265_1.6.tar.gz/source/x265.h Changed

@@ -91,19 +91,31 @@
 /* Stores all analysis data for a single frame */
 typedef struct x265_analysis_data
 {
+    void*            interData;
+    void*            intraData;
     uint32_t         frameRecordSize;
-    int32_t          poc;
-    int32_t          sliceType;
+    uint32_t         poc;
+    uint32_t         sliceType;
     uint32_t         numCUsInFrame;
     uint32_t         numPartitions;
-    void*            interData;
-    void*            intraData;
 } x265_analysis_data;
 
 /* Used to pass pictures into the encoder, and to get picture data back out of
  * the encoder.  The input and output semantics are different */
 typedef struct x265_picture
 {
+    /* presentation time stamp: user-specified, returned on output */
+    int64_t pts;
+
+    /* display time stamp: ignored on input, copied from reordered pts. Returned
+     * on output */
+    int64_t dts;
+
+    /* force quantizer for != X265_QP_AUTO */
+    /* The value provided on input is returned with the same picture (POC) on
+     * output */
+    void*   userData;
+
     /* Must be specified on input pictures, the number of planes is determined
      * by the colorSpace value */
     void*   planes[3];
@@ -132,18 +144,8 @@
      * initialize this value to the internal color space */
     int     colorSpace;
 
-    /* presentation time stamp: user-specified, returned on output */
-    int64_t pts;
-
-    /* display time stamp: ignored on input, copied from reordered pts. Returned
-     * on output */
-    int64_t dts;
-
-    /* The value provided on input is returned with the same picture (POC) on
-     * output */
-    void*   userData;
-
-    /* force quantizer for != X265_QP_AUTO */
+    /* Force the slice base QP for this picture within the encoder. Set to 0
+     * to allow the encoder to determine base QP */
     int     forceqp;
 
     /* If param.analysisMode is X265_ANALYSIS_OFF this field is ignored on input
@@ -159,8 +161,6 @@
      * this data structure */
     x265_analysis_data analysisData;
 
-    /* new data members to this structure must be added to the end so that
-     * users of x265_picture_alloc/free() can be assured of future safety */
 } x265_picture;
 
 typedef enum
@@ -229,7 +229,11 @@
 #define X265_B_ADAPT_FAST       1
 #define X265_B_ADAPT_TRELLIS    2
 
+#define X265_REF_LIMIT_DEPTH    1
+#define X265_REF_LIMIT_CU       2
+
 #define X265_BFRAME_MAX         16
+#define X265_MAX_FRAME_THREADS  16
 
 #define X265_TYPE_AUTO          0x0000  /* Let x265 choose the right type */
 #define X265_TYPE_IDR           0x0001
@@ -237,13 +241,14 @@
 #define X265_TYPE_P             0x0003
 #define X265_TYPE_BREF          0x0004  /* Non-disposable B-frame */
 #define X265_TYPE_B             0x0005
+#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR)
+#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF)
+
 #define X265_QP_AUTO                 0
 
 #define X265_AQ_NONE                 0
 #define X265_AQ_VARIANCE             1
 #define X265_AQ_AUTO_VARIANCE        2
-#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR)
-#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF)
 
 /* NOTE! For this release only X265_CSP_I420 and X265_CSP_I444 are supported */
 
@@ -308,11 +313,9 @@
     double    elapsedEncodeTime;    /* wall time since encoder was opened */
     double    elapsedVideoTime;     /* encoded picture count / frame rate */
     double    bitrate;              /* accBits / elapsed video time */
+    uint64_t  accBits;              /* total bits output thus far */
     uint32_t  encodedPictureCount;  /* number of output pictures thus far */
     uint32_t  totalWPFrames;        /* number of uni-directional weighted frames used */
-    uint64_t  accBits;              /* total bits output thus far */
-
-    /* new statistic member variables must be added below this line */
 } x265_stats;
 
 /* String values accepted by x265_param_parse() (and CLI) for various parameters */
@@ -322,7 +325,8 @@
 static const char * const x265_fullrange_names[] = { "limited", "full", 0 };
 static const char * const x265_colorprim_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "film", "bt2020", 0 };
 static const char * const x265_transfer_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "linear", "log100",
-                                                    "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12", 0 };
+                                                    "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12",
+                                                    "smpte-st-2084", "smpte-st-428", 0 };
 static const char * const x265_colmatrix_names[] = { "GBR", "bt709", "undef", "", "fcc", "bt470bg", "smpte170m", "smpte240m",
                                                      "YCgCo", "bt2020nc", "bt2020c", 0 };
 static const char * const x265_sar_names[] = { "undef", "1:1", "12:11", "10:11", "16:11", "40:33", "24:11", "20:11",
@@ -334,9 +338,9 @@
  * If zones overlap, whichever comes later in the list takes precedence. */
 typedef struct x265_zone
 {
-    int startFrame, endFrame;   /* range of frame numbers */
-    int bForceQp;               /* whether to use qp vs bitrate factor */
-    int qp;
+    int   startFrame, endFrame; /* range of frame numbers */
+    int   bForceQp;             /* whether to use qp vs bitrate factor */
+    int   qp;
     float bitrateFactor;
 } x265_zone;
     
@@ -348,36 +352,77 @@
  * x265_param as an opaque data structure */
 typedef struct x265_param
 {
-    /*== Encoder Environment ==*/
-
     /* x265_param_default() will auto-detect this cpu capability bitmap.  it is
      * recommended to not change this value unless you know the cpu detection is
      * somehow flawed on your target hardware. The asm function tables are
      * process global, the first encoder configures them for all encoders */
     int       cpuid;
 
+    /*== Parallelism Features ==*/
+
+    /* Number of concurrently encoded frames between 1 and X265_MAX_FRAME_THREADS
+     * or 0 for auto-detection. By default x265 will use a number of frame
+     * threads empirically determined to be optimal for your CPU core count,
+     * between 2 and 6.  Using more than one frame thread causes motion search
+     * in the down direction to be clamped but otherwise encode behavior is
+     * unaffected. With CQP rate control the output bitstream is deterministic
+     * for all values of frameNumThreads greater than 1. All other forms of
+     * rate-control can be negatively impacted by increases to the number of
+     * frame threads because the extra concurrency adds uncertainty to the
+     * bitrate estimations. Frame parallelism is generally limited by the the
+     * is generally limited by the the number of CU rows
+     *
+     * When thread pools are used, each frame thread is assigned to a single
+     * pool and the frame thread itself is given the node affinity of its pool.
+     * But when no thread pools are used no node affinity is assigned. */
+    int       frameNumThreads;
+
+    /* Comma seperated list of threads per NUMA node. If "none", then no worker
+     * pools are created and only frame parallelism is possible. If NULL or ""
+     * (default) x265 will use all available threads on each NUMA node.
+     *
+     * '+'  is a special value indicating all cores detected on the node
+     * '*'  is a special value indicating all cores detected on the node and all
+     *      remaining nodes.
+     * '-'  is a special value indicating no cores on the node, same as '0'
+     *
+     * example strings for a 4-node system:
+     *   ""        - default, unspecified, all numa nodes are used for thread pools
+     *   "*"       - same as default
+     *   "none"    - no thread pools are created, only frame parallelism possible
+     *   "-"       - same as "none"
+     *   "10"      - allocate one pool, using up to 10 cores on node 0
+     *   "-,+"     - allocate one pool, using all cores on node 1
+     *   "+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+     *   "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+     *   "-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+     *   "8,8,8,8" - allocate four pools with up to 8 threads in each pool
+     *
+     * The total number of threads will be determined by the number of threads
+     * assigned to all nodes. The worker threads will each be given affinity for
+     * their node, they will not be allowed to migrate between nodes, but they
+     * will be allowed to move between CPU cores within their node.
+     *
+     * If the three pool features: bEnableWavefront, bDistributeModeAnalysis and
+     * bDistributeMotionEstimation are all disabled, then numaPools is ignored
+     * and no thread pools are created.
+     *
+     * If "none" is specified, then all three of the thread pool features are
+     * implicitly disabled.
+     *
+     * Multiple thread pools will be allocated for any NUMA node with more than
+     * 64 logical CPU cores. But any given thread pool will always use at most
+     * one NUMA node.
+     *
+     * Frame encoders are distributed between the available thread pools, and

 
@@ -91,19 +91,31 @@
 /* Stores all analysis data for a single frame */
 typedef struct x265_analysis_data
 {
+    void*            interData;
+    void*            intraData;
     uint32_t         frameRecordSize;
-    int32_t          poc;
-    int32_t          sliceType;
+    uint32_t         poc;
+    uint32_t         sliceType;
     uint32_t         numCUsInFrame;
     uint32_t         numPartitions;
-    void*            interData;
-    void*            intraData;
 } x265_analysis_data;
 
 /* Used to pass pictures into the encoder, and to get picture data back out of
  * the encoder.  The input and output semantics are different */
 typedef struct x265_picture
 {
+    /* presentation time stamp: user-specified, returned on output */
+    int64_t pts;
+
+    /* display time stamp: ignored on input, copied from reordered pts. Returned
+     * on output */
+    int64_t dts;
+
+    /* force quantizer for != X265_QP_AUTO */
+    /* The value provided on input is returned with the same picture (POC) on
+     * output */
+    void*   userData;
+
     /* Must be specified on input pictures, the number of planes is determined
      * by the colorSpace value */
     void*   planes[3];
@@ -132,18 +144,8 @@
      * initialize this value to the internal color space */
     int     colorSpace;
 
-    /* presentation time stamp: user-specified, returned on output */
-    int64_t pts;
-
-    /* display time stamp: ignored on input, copied from reordered pts. Returned
-     * on output */
-    int64_t dts;
-
-    /* The value provided on input is returned with the same picture (POC) on
-     * output */
-    void*   userData;
-
-    /* force quantizer for != X265_QP_AUTO */
+    /* Force the slice base QP for this picture within the encoder. Set to 0
+     * to allow the encoder to determine base QP */
     int     forceqp;
 
     /* If param.analysisMode is X265_ANALYSIS_OFF this field is ignored on input
@@ -159,8 +161,6 @@
      * this data structure */
     x265_analysis_data analysisData;
 
-    /* new data members to this structure must be added to the end so that
-     * users of x265_picture_alloc/free() can be assured of future safety */
 } x265_picture;
 
 typedef enum
@@ -229,7 +229,11 @@
 #define X265_B_ADAPT_FAST       1
 #define X265_B_ADAPT_TRELLIS    2
 
+#define X265_REF_LIMIT_DEPTH    1
+#define X265_REF_LIMIT_CU       2
+
 #define X265_BFRAME_MAX         16
+#define X265_MAX_FRAME_THREADS  16
 
 #define X265_TYPE_AUTO          0x0000  /* Let x265 choose the right type */
 #define X265_TYPE_IDR           0x0001
@@ -237,13 +241,14 @@
 #define X265_TYPE_P             0x0003
 #define X265_TYPE_BREF          0x0004  /* Non-disposable B-frame */
 #define X265_TYPE_B             0x0005
+#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR)
+#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF)
+
 #define X265_QP_AUTO                 0
 
 #define X265_AQ_NONE                 0
 #define X265_AQ_VARIANCE             1
 #define X265_AQ_AUTO_VARIANCE        2
-#define IS_X265_TYPE_I(x) ((x) == X265_TYPE_I || (x) == X265_TYPE_IDR)
-#define IS_X265_TYPE_B(x) ((x) == X265_TYPE_B || (x) == X265_TYPE_BREF)
 
 /* NOTE! For this release only X265_CSP_I420 and X265_CSP_I444 are supported */
 
@@ -308,11 +313,9 @@
     double    elapsedEncodeTime;    /* wall time since encoder was opened */
     double    elapsedVideoTime;     /* encoded picture count / frame rate */
     double    bitrate;              /* accBits / elapsed video time */
+    uint64_t  accBits;              /* total bits output thus far */
     uint32_t  encodedPictureCount;  /* number of output pictures thus far */
     uint32_t  totalWPFrames;        /* number of uni-directional weighted frames used */
-    uint64_t  accBits;              /* total bits output thus far */
-
-    /* new statistic member variables must be added below this line */
 } x265_stats;
 
 /* String values accepted by x265_param_parse() (and CLI) for various parameters */
@@ -322,7 +325,8 @@
 static const char * const x265_fullrange_names[] = { "limited", "full", 0 };
 static const char * const x265_colorprim_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "film", "bt2020", 0 };
 static const char * const x265_transfer_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "linear", "log100",
-                                                    "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12", 0 };
+                                                    "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12",
+                                                    "smpte-st-2084", "smpte-st-428", 0 };
 static const char * const x265_colmatrix_names[] = { "GBR", "bt709", "undef", "", "fcc", "bt470bg", "smpte170m", "smpte240m",
                                                      "YCgCo", "bt2020nc", "bt2020c", 0 };
 static const char * const x265_sar_names[] = { "undef", "1:1", "12:11", "10:11", "16:11", "40:33", "24:11", "20:11",
@@ -334,9 +338,9 @@
  * If zones overlap, whichever comes later in the list takes precedence. */
 typedef struct x265_zone
 {
-    int startFrame, endFrame;   /* range of frame numbers */
-    int bForceQp;               /* whether to use qp vs bitrate factor */
-    int qp;
+    int   startFrame, endFrame; /* range of frame numbers */
+    int   bForceQp;             /* whether to use qp vs bitrate factor */
+    int   qp;
     float bitrateFactor;
 } x265_zone;
     
@@ -348,36 +352,77 @@
  * x265_param as an opaque data structure */
 typedef struct x265_param
 {
-    /*== Encoder Environment ==*/
-
     /* x265_param_default() will auto-detect this cpu capability bitmap.  it is
      * recommended to not change this value unless you know the cpu detection is
      * somehow flawed on your target hardware. The asm function tables are
      * process global, the first encoder configures them for all encoders */
     int       cpuid;
 
+    /*== Parallelism Features ==*/
+
+    /* Number of concurrently encoded frames between 1 and X265_MAX_FRAME_THREADS
+     * or 0 for auto-detection. By default x265 will use a number of frame
+     * threads empirically determined to be optimal for your CPU core count,
+     * between 2 and 6.  Using more than one frame thread causes motion search
+     * in the down direction to be clamped but otherwise encode behavior is
+     * unaffected. With CQP rate control the output bitstream is deterministic
+     * for all values of frameNumThreads greater than 1. All other forms of
+     * rate-control can be negatively impacted by increases to the number of
+     * frame threads because the extra concurrency adds uncertainty to the
+     * bitrate estimations. Frame parallelism is generally limited by the the
+     * is generally limited by the the number of CU rows
+     *
+     * When thread pools are used, each frame thread is assigned to a single
+     * pool and the frame thread itself is given the node affinity of its pool.
+     * But when no thread pools are used no node affinity is assigned. */
+    int       frameNumThreads;
+
+    /* Comma seperated list of threads per NUMA node. If "none", then no worker
+     * pools are created and only frame parallelism is possible. If NULL or ""
+     * (default) x265 will use all available threads on each NUMA node.
+     *
+     * '+'  is a special value indicating all cores detected on the node
+     * '*'  is a special value indicating all cores detected on the node and all
+     *      remaining nodes.
+     * '-'  is a special value indicating no cores on the node, same as '0'
+     *
+     * example strings for a 4-node system:
+     *   ""        - default, unspecified, all numa nodes are used for thread pools
+     *   "*"       - same as default
+     *   "none"    - no thread pools are created, only frame parallelism possible
+     *   "-"       - same as "none"
+     *   "10"      - allocate one pool, using up to 10 cores on node 0
+     *   "-,+"     - allocate one pool, using all cores on node 1
+     *   "+,-,+"   - allocate two pools, using all cores on nodes 0 and 2
+     *   "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2
+     *   "-,*"     - allocate three pools, using all cores on nodes 1, 2 and 3
+     *   "8,8,8,8" - allocate four pools with up to 8 threads in each pool
+     *
+     * The total number of threads will be determined by the number of threads
+     * assigned to all nodes. The worker threads will each be given affinity for
+     * their node, they will not be allowed to migrate between nodes, but they
+     * will be allowed to move between CPU cores within their node.
+     *
+     * If the three pool features: bEnableWavefront, bDistributeModeAnalysis and
+     * bDistributeMotionEstimation are all disabled, then numaPools is ignored
+     * and no thread pools are created.
+     *
+     * If "none" is specified, then all three of the thread pool features are
+     * implicitly disabled.
+     *
+     * Multiple thread pools will be allocated for any NUMA node with more than
+     * 64 logical CPU cores. But any given thread pool will always use at most
+     * one NUMA node.
+     *
+     * Frame encoders are distributed between the available thread pools, and
​

x265_1.5.tar.gz/source/x265cli.h -> x265_1.6.tar.gz/source/x265cli.h Changed

@@ -37,7 +37,8 @@
     { "version",              no_argument, NULL, 'V' },
     { "asm",            required_argument, NULL, 0 },
     { "no-asm",               no_argument, NULL, 0 },
-    { "threads",        required_argument, NULL, 0 },
+    { "pools",          required_argument, NULL, 0 },
+    { "numa-pools",     required_argument, NULL, 0 },
     { "preset",         required_argument, NULL, 'p' },
     { "tune",           required_argument, NULL, 't' },
     { "frame-threads",  required_argument, NULL, 'F' },
@@ -71,6 +72,8 @@
     { "no-wpp",               no_argument, NULL, 0 },
     { "wpp",                  no_argument, NULL, 0 },
     { "ctu",            required_argument, NULL, 's' },
+    { "min-cu-size",    required_argument, NULL, 0 },
+    { "max-tu-size",    required_argument, NULL, 0 },
     { "tu-intra-depth", required_argument, NULL, 0 },
     { "tu-inter-depth", required_argument, NULL, 0 },
     { "me",             required_argument, NULL, 0 },
@@ -96,6 +99,8 @@
     { "no-cu-lossless",       no_argument, NULL, 0 },
     { "no-constrained-intra", no_argument, NULL, 0 },
     { "constrained-intra",    no_argument, NULL, 0 },
+    { "cip",                  no_argument, NULL, 0 },
+    { "no-cip",               no_argument, NULL, 0 },
     { "fast-intra",           no_argument, NULL, 0 },
     { "no-fast-intra",        no_argument, NULL, 0 },
     { "no-open-gop",          no_argument, NULL, 0 },
@@ -105,6 +110,7 @@
     { "scenecut",       required_argument, NULL, 0 },
     { "no-scenecut",          no_argument, NULL, 0 },
     { "rc-lookahead",   required_argument, NULL, 0 },
+    { "lookahead-slices", required_argument, NULL, 0 },
     { "bframes",        required_argument, NULL, 'b' },
     { "bframe-bias",    required_argument, NULL, 0 },
     { "b-adapt",        required_argument, NULL, 0 },
@@ -136,6 +142,8 @@
     { "cbqpoffs",       required_argument, NULL, 0 },
     { "crqpoffs",       required_argument, NULL, 0 },
     { "rd",             required_argument, NULL, 0 },
+    { "rdoq-level",     required_argument, NULL, 0 },
+    { "no-rdoq-level",        no_argument, NULL, 0 },
     { "psy-rd",         required_argument, NULL, 0 },
     { "psy-rdoq",       required_argument, NULL, 0 },
     { "no-psy-rd",            no_argument, NULL, 0 },
@@ -195,6 +203,8 @@
     { "analysis-mode",  required_argument, NULL, 0 },
     { "analysis-file",  required_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
+    { "temporal-layers",      no_argument, NULL, 0 },
+    { "no-temporal-layers",   no_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -246,10 +256,11 @@
     H0("   --[no-]psnr                   Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
     H0("\nProfile, Level, Tier:\n");
     H0("   --profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
-    H0("   --level-idc <integer|float>   Force a minumum required decoder level (as '5.0' or '50')\n");
+    H0("   --level-idc <integer|float>   Force a minimum required decoder level (as '5.0' or '50')\n");
     H0("   --[no-]high-tier              If a decoder level is specified, this modifier selects High tier of that level\n");
     H0("\nThreading, performance:\n");
-    H0("   --threads <integer>           Number of threads for thread pool (0: detect CPU core count, default)\n");
+    H0("   --pools <integer,...>         Comma separated thread count per thread pool (pool per NUMA node)\n");
+    H0("                                 '-' implies no threads on node, '+' implies one thread per core on node\n");
     H0("-F/--frame-threads <integer>     Number of concurrently encoded frames. 0: auto-determined by core count\n");
     H0("   --[no-]wpp                    Enable Wavefront Parallel Processing. Default %s\n", OPT(param->bEnableWavefront));
     H0("   --[no-]pmode                  Parallel mode analysis. Default %s\n", OPT(param->bDistributeModeAnalysis));
@@ -262,14 +273,16 @@
     H0("                                 psnr, ssim, grain, zerolatency, fastdecode\n");
     H0("\nQuad-Tree size and depth:\n");
     H0("-s/--ctu <64|32|16>              Maximum CU size (WxH). Default %d\n", param->maxCUSize);
+    H0("   --min-cu-size <64|32|16|8>    Minimum CU size (WxH). Default %d\n", param->minCUSize);
+    H0("   --max-tu-size <32|16|8|4>     Maximum TU size (WxH). Default %d\n", param->maxTUSize);
     H0("   --tu-intra-depth <integer>    Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth);
     H0("   --tu-inter-depth <integer>    Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth);
     H0("\nAnalysis:\n");
-    H0("   --rd <0..6>                   Level of RD in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel);
+    H0("   --rd <0..6>                   Level of RDO in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel);
     H0("   --[no-]psy-rd <0..2.0>        Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd);
-    H0("   --[no-]psy-rdoq <0..50.0>     Strength of psycho-visual optimization in quantization, 0 to disable. Default %.1f\n", param->psyRdoq);
+    H0("   --[no-]rdoq-level <0|1|2>     Level of RDO in quantization 0:none, 1:levels, 2:levels & coding groups. Default %d\n", param->rdoqLevel);
+    H0("   --[no-]psy-rdoq <0..50.0>     Strength of psycho-visual optimization in RDO quantization, 0 to disable. Default %.1f\n", param->psyRdoq);
     H0("   --[no-]early-skip             Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip));
-    H1("   --[no-]fast-cbf               Enable early outs based on whether residual is coded. Default %s\n", OPT(param->bEnableCbfFastMode));
     H1("   --[no-]tskip-fast             Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast));
     H1("   --nr-intra <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n");
     H1("   --nr-inter <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n");
@@ -300,6 +313,7 @@
     H0("   --no-scenecut                 Disable adaptive I-frame decision\n");
     H0("   --scenecut <integer>          How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold);
     H0("   --rc-lookahead <integer>      Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth);
+    H1("   --lookahead-slices <0..16>    Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices);
     H0("   --bframes <integer>           Maximum number of consecutive b-frames (now it only enables B GOP structure) Default %d\n", param->bframes);
     H1("   --bframe-bias <integer>       Bias towards B frame decisions. Default %d\n", param->bFrameBias);
     H0("   --b-adapt <0..2>              0 - none, 1 - fast, 2 - full (trellis) adaptive B frame scheduling. Default %d\n", param->bFrameAdaptive);
@@ -371,10 +385,11 @@
     H1("                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
     H1("   --chromaloc <integer>         Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
     H0("\nBitstream options:\n");
+    H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
-    H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H0("   --[no-]hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
-    H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
+    H0("   --[no-]temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
+    H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
     H1("\nReconstructed video options (debugging):\n");
     H1("-r/--recon <filename>            Reconstructed raw image YUV or Y4M output file name\n");

 
@@ -37,7 +37,8 @@
     { "version",              no_argument, NULL, 'V' },
     { "asm",            required_argument, NULL, 0 },
     { "no-asm",               no_argument, NULL, 0 },
-    { "threads",        required_argument, NULL, 0 },
+    { "pools",          required_argument, NULL, 0 },
+    { "numa-pools",     required_argument, NULL, 0 },
     { "preset",         required_argument, NULL, 'p' },
     { "tune",           required_argument, NULL, 't' },
     { "frame-threads",  required_argument, NULL, 'F' },
@@ -71,6 +72,8 @@
     { "no-wpp",               no_argument, NULL, 0 },
     { "wpp",                  no_argument, NULL, 0 },
     { "ctu",            required_argument, NULL, 's' },
+    { "min-cu-size",    required_argument, NULL, 0 },
+    { "max-tu-size",    required_argument, NULL, 0 },
     { "tu-intra-depth", required_argument, NULL, 0 },
     { "tu-inter-depth", required_argument, NULL, 0 },
     { "me",             required_argument, NULL, 0 },
@@ -96,6 +99,8 @@
     { "no-cu-lossless",       no_argument, NULL, 0 },
     { "no-constrained-intra", no_argument, NULL, 0 },
     { "constrained-intra",    no_argument, NULL, 0 },
+    { "cip",                  no_argument, NULL, 0 },
+    { "no-cip",               no_argument, NULL, 0 },
     { "fast-intra",           no_argument, NULL, 0 },
     { "no-fast-intra",        no_argument, NULL, 0 },
     { "no-open-gop",          no_argument, NULL, 0 },
@@ -105,6 +110,7 @@
     { "scenecut",       required_argument, NULL, 0 },
     { "no-scenecut",          no_argument, NULL, 0 },
     { "rc-lookahead",   required_argument, NULL, 0 },
+    { "lookahead-slices", required_argument, NULL, 0 },
     { "bframes",        required_argument, NULL, 'b' },
     { "bframe-bias",    required_argument, NULL, 0 },
     { "b-adapt",        required_argument, NULL, 0 },
@@ -136,6 +142,8 @@
     { "cbqpoffs",       required_argument, NULL, 0 },
     { "crqpoffs",       required_argument, NULL, 0 },
     { "rd",             required_argument, NULL, 0 },
+    { "rdoq-level",     required_argument, NULL, 0 },
+    { "no-rdoq-level",        no_argument, NULL, 0 },
     { "psy-rd",         required_argument, NULL, 0 },
     { "psy-rdoq",       required_argument, NULL, 0 },
     { "no-psy-rd",            no_argument, NULL, 0 },
@@ -195,6 +203,8 @@
     { "analysis-mode",  required_argument, NULL, 0 },
     { "analysis-file",  required_argument, NULL, 0 },
     { "strict-cbr",           no_argument, NULL, 0 },
+    { "temporal-layers",      no_argument, NULL, 0 },
+    { "no-temporal-layers",   no_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -246,10 +256,11 @@
     H0("   --[no-]psnr                   Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
     H0("\nProfile, Level, Tier:\n");
     H0("   --profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
-    H0("   --level-idc <integer|float>   Force a minumum required decoder level (as '5.0' or '50')\n");
+    H0("   --level-idc <integer|float>   Force a minimum required decoder level (as '5.0' or '50')\n");
     H0("   --[no-]high-tier              If a decoder level is specified, this modifier selects High tier of that level\n");
     H0("\nThreading, performance:\n");
-    H0("   --threads <integer>           Number of threads for thread pool (0: detect CPU core count, default)\n");
+    H0("   --pools <integer,...>         Comma separated thread count per thread pool (pool per NUMA node)\n");
+    H0("                                 '-' implies no threads on node, '+' implies one thread per core on node\n");
     H0("-F/--frame-threads <integer>     Number of concurrently encoded frames. 0: auto-determined by core count\n");
     H0("   --[no-]wpp                    Enable Wavefront Parallel Processing. Default %s\n", OPT(param->bEnableWavefront));
     H0("   --[no-]pmode                  Parallel mode analysis. Default %s\n", OPT(param->bDistributeModeAnalysis));
@@ -262,14 +273,16 @@
     H0("                                 psnr, ssim, grain, zerolatency, fastdecode\n");
     H0("\nQuad-Tree size and depth:\n");
     H0("-s/--ctu <64|32|16>              Maximum CU size (WxH). Default %d\n", param->maxCUSize);
+    H0("   --min-cu-size <64|32|16|8>    Minimum CU size (WxH). Default %d\n", param->minCUSize);
+    H0("   --max-tu-size <32|16|8|4>     Maximum TU size (WxH). Default %d\n", param->maxTUSize);
     H0("   --tu-intra-depth <integer>    Max TU recursive depth for intra CUs. Default %d\n", param->tuQTMaxIntraDepth);
     H0("   --tu-inter-depth <integer>    Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth);
     H0("\nAnalysis:\n");
-    H0("   --rd <0..6>                   Level of RD in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel);
+    H0("   --rd <0..6>                   Level of RDO in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel);
     H0("   --[no-]psy-rd <0..2.0>        Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd);
-    H0("   --[no-]psy-rdoq <0..50.0>     Strength of psycho-visual optimization in quantization, 0 to disable. Default %.1f\n", param->psyRdoq);
+    H0("   --[no-]rdoq-level <0|1|2>     Level of RDO in quantization 0:none, 1:levels, 2:levels & coding groups. Default %d\n", param->rdoqLevel);
+    H0("   --[no-]psy-rdoq <0..50.0>     Strength of psycho-visual optimization in RDO quantization, 0 to disable. Default %.1f\n", param->psyRdoq);
     H0("   --[no-]early-skip             Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip));
-    H1("   --[no-]fast-cbf               Enable early outs based on whether residual is coded. Default %s\n", OPT(param->bEnableCbfFastMode));
     H1("   --[no-]tskip-fast             Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast));
     H1("   --nr-intra <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n");
     H1("   --nr-inter <integer>          An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n");
@@ -300,6 +313,7 @@
     H0("   --no-scenecut                 Disable adaptive I-frame decision\n");
     H0("   --scenecut <integer>          How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold);
     H0("   --rc-lookahead <integer>      Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth);
+    H1("   --lookahead-slices <0..16>    Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices);
     H0("   --bframes <integer>           Maximum number of consecutive b-frames (now it only enables B GOP structure) Default %d\n", param->bframes);
     H1("   --bframe-bias <integer>       Bias towards B frame decisions. Default %d\n", param->bFrameBias);
     H0("   --b-adapt <0..2>              0 - none, 1 - fast, 2 - full (trellis) adaptive B frame scheduling. Default %d\n", param->bFrameAdaptive);
@@ -371,10 +385,11 @@
     H1("                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
     H1("   --chromaloc <integer>         Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
     H0("\nBitstream options:\n");
+    H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
-    H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H0("   --[no-]hrd                    Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI));
-    H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
+    H0("   --[no-]temporal-layers        Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers));
+    H0("   --[no-]aud                    Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters));
     H1("   --hash <integer>              Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI);
     H1("\nReconstructed video options (debugging):\n");
     H1("-r/--recon <filename>            Reconstructed raw image YUV or Y4M output file name\n");
​