Packman Build Service PMBS

We truncated the diff of some files because they were too big. If you want to see the full diff for every file, click here.

Changes of Revision 10

x265.changes Changed

@@ -1,4 +1,45 @@
 -------------------------------------------------------------------
+Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com
+
+- soname bump to 59
+- Update to version 1.7
+  * large amount of assembly code optimizations
+  * some preliminary support for high dynamic range content
+  * improvements for multi-library support
+  * some new quality features
+    (full documentation at: http://x265.readthedocs.org/en/1.7)
+  * This release simplifies the multi-library support introduced
+    in version 1.6. Any libx265 can now forward API requests to
+    other installed libx265 libraries (by name) so applications
+    like ffmpeg and the x265 CLI can select between 8bit and 10bit
+    encodes at runtime without the need of a shim library or
+    library load path hacks. See --output-depth, and
+    http://x265.readthedocs.org/en/1.7/api.html#multi-library-interface
+  * For quality, x265 now allows you to configure the quantization
+    group size smaller than the CTU size (for finer grained AQ
+    adjustments). See --qg-size.
+  * x265 now supports limited mid-encode reconfigure via a new public
+    method: x265_encoder_reconfig()
+  * For HDR, x265 now supports signaling the SMPTE 2084 color transfer
+    function, the SMPTE 2086 mastering display color primaries, and the
+    content light levels. See --master-display, --max-cll
+  * x265 will no longer emit any non-conformant bitstreams unless
+    --allow-non-conformance is specified.
+  * The x265 CLI now supports a simple encode preview feature. See
+    --recon-y4m-exec.
+  * The AnnexB NAL headers can now be configured off, via x265_param.bAnnexB
+    This is not configurable via the CLI because it is a function of the
+    muxer being used, and the CLI only supports raw output files. See
+    --annexb
+  Misc:
+  * --lossless encodes are now signaled as level 8.5
+  * --profile now has a -P short option
+  * The regression scripts used by x265 are now public, and can be found at:
+    https://bitbucket.org/sborho/test-harness
+  * x265's cmake scripts now support PGO builds, the test-harness can be
+    used to drive the profile-guided build process.
+
+-------------------------------------------------------------------
 Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
 
 - soname bumped to 51

​x
 
@@ -1,4 +1,45 @@
 -------------------------------------------------------------------
+Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com
+
+- soname bump to 59
+- Update to version 1.7
+  * large amount of assembly code optimizations
+  * some preliminary support for high dynamic range content
+  * improvements for multi-library support
+  * some new quality features
+    (full documentation at: http://x265.readthedocs.org/en/1.7)
+  * This release simplifies the multi-library support introduced
+    in version 1.6. Any libx265 can now forward API requests to
+    other installed libx265 libraries (by name) so applications
+    like ffmpeg and the x265 CLI can select between 8bit and 10bit
+    encodes at runtime without the need of a shim library or
+    library load path hacks. See --output-depth, and
+    http://x265.readthedocs.org/en/1.7/api.html#multi-library-interface
+  * For quality, x265 now allows you to configure the quantization
+    group size smaller than the CTU size (for finer grained AQ
+    adjustments). See --qg-size.
+  * x265 now supports limited mid-encode reconfigure via a new public
+    method: x265_encoder_reconfig()
+  * For HDR, x265 now supports signaling the SMPTE 2084 color transfer
+    function, the SMPTE 2086 mastering display color primaries, and the
+    content light levels. See --master-display, --max-cll
+  * x265 will no longer emit any non-conformant bitstreams unless
+    --allow-non-conformance is specified.
+  * The x265 CLI now supports a simple encode preview feature. See
+    --recon-y4m-exec.
+  * The AnnexB NAL headers can now be configured off, via x265_param.bAnnexB
+    This is not configurable via the CLI because it is a function of the
+    muxer being used, and the CLI only supports raw output files. See
+    --annexb
+  Misc:
+  * --lossless encodes are now signaled as level 8.5
+  * --profile now has a -P short option
+  * The regression scripts used by x265 are now public, and can be found at:
+    https://bitbucket.org/sborho/test-harness
+  * x265's cmake scripts now support PGO builds, the test-harness can be
+    used to drive the profile-guided build process.
+
+-------------------------------------------------------------------
 Tue Apr 28 20:08:06 UTC 2015 - aloisio@gmx.com
 
 - soname bumped to 51
​

x265.spec Changed

 
@@ -1,10 +1,10 @@
 # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/
 
 Name:           x265
-%define soname  51
+%define soname  59
 %define libname lib%{name}
 %define libsoname %{libname}-%{soname}
-Version:        1.6
+Version:        1.7
 Release:        0
 License:        GPL-2.0+
 Summary:        A free h265/HEVC encoder - encoder binary
​

baselibs.conf Changed

 
@@ -1,1 +1,1 @@
-libx265-51
+libx265-59
​

x265_1.6.tar.gz/.hg_archival.txt -> x265_1.7.tar.gz/.hg_archival.txt Changed

 
@@ -1,4 +1,4 @@
 repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf
-node: cbeb7d8a4880e4020c4545dd8e498432c3c6cad3
+node: 8425278def1edf0931dc33fc518e1950063e76b0
 branch: stable
-tag: 1.6
+tag: 1.7
​

x265_1.6.tar.gz/.hgtags -> x265_1.7.tar.gz/.hgtags Changed

 
@@ -14,3 +14,4 @@
 c1e4fc0162c14fdb84f5c3bd404fb28cfe10a17f 1.3
 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4
 9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5
+cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6
​

x265_1.6.tar.gz/doc/reST/api.rst -> x265_1.7.tar.gz/doc/reST/api.rst Changed

@@ -171,8 +171,26 @@
 	 *      how x265_encoder_open has changed the parameters.
 	 *      note that the data accessible through pointers in the returned param struct
 	 *      (e.g. filenames) should not be modified by the calling application. */
-	void x265_encoder_parameters(x265_encoder *, x265_param *);                                                                      
-
+	void x265_encoder_parameters(x265_encoder *, x265_param *);
+
+**x265_encoder_reconfig()** may be used to reconfigure encoder parameters mid-encode::
+
+	/* x265_encoder_reconfig:
+	 *       used to modify encoder parameters.
+	 *      various parameters from x265_param are copied.
+	 *      this takes effect immediately, on whichever frame is encoded next;
+	 *      returns 0 on success, negative on parameter validation error.
+	 *
+	 *      not all parameters can be changed; see the actual function for a
+	 *      detailed breakdown.  since not all parameters can be changed, moving
+	 *      from preset to preset may not always fully copy all relevant parameters,
+	 *      but should still work usably in practice. however, more so than for
+	 *      other presets, many of the speed shortcuts used in ultrafast cannot be
+	 *      switched out of; using reconfig to switch between ultrafast and other
+	 *      presets is not recommended without a more fine-grained breakdown of
+	 *      parameters to take this into account. */
+	int x265_encoder_reconfig(x265_encoder *, x265_param *);
+	
 Pictures
 ========
 
@@ -352,7 +370,7 @@
 Multi-library Interface
 =======================
 
-If your application might want to make a runtime selection between among
+If your application might want to make a runtime selection between
 a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
 want to use the multi-library interface.
 
@@ -370,13 +388,34 @@
      *   libx265 */
     const x265_api* x265_api_get(int bitDepth);
 
-The general idea is to request the API for the bitDepth you would prefer
-the encoder to use (8 or 10), and if that returns NULL you request the
-API for bitDepth=0, which returns the system default libx265.
-
 Note that using this multi-library API in your application is only the
-first step. Next your application must dynamically link to libx265 and
-then you must build and install a multi-lib configuration of libx265,
-which includes 8bpp and 16bpp builds of libx265 and a shim library which
-forwards x265_api_get() calls to the appropriate library using dynamic
-loading and binding.
+first step.
+
+Your application must link to one build of libx265 (statically or 
+dynamically) and this linked version of libx265 will support one 
+bit-depth (8 or 10 bits). 
+
+Your application must now request the API for the bitDepth you would 
+prefer the encoder to use (8 or 10). If the requested bitdepth is zero, 
+or if it matches the bitdepth of the system default libx265 (the 
+currently linked library), then this library will be used for encode.
+If you request a different bit-depth, the linked libx265 will attempt 
+to dynamically bind a shared library with a name appropriate for the 
+requested bit-depth:
+
+    8-bit:  libx265_main.dll
+    10-bit: libx265_main10.dll
+
+    (the shared library extension is obviously platform specific. On
+    Linux it is .so while on Mac it is .dylib)
+
+For example on Windows, one could package together an x265.exe
+statically linked against the 8bpp libx265 together with a
+libx265_main10.dll in the same folder, and this executable would be able
+to encode main and main10 bitstreams.
+
+On Linux, x265 packagers could install 8bpp static and shared libraries
+under the name libx265 (so all applications link against 8bpp libx265)
+and then also install libx265_main10.so (symlinked to its numbered solib).
+Thus applications which use x265_api_get() will be able to generate main
+or main10 bitstreams.

 
@@ -171,8 +171,26 @@
     *      how x265_encoder_open has changed the parameters.
     *      note that the data accessible through pointers in the returned param struct
     *      (e.g. filenames) should not be modified by the calling application. */
-   void x265_encoder_parameters(x265_encoder *, x265_param *);                                                                      
-
+   void x265_encoder_parameters(x265_encoder *, x265_param *);
+
+**x265_encoder_reconfig()** may be used to reconfigure encoder parameters mid-encode::
+
+   /* x265_encoder_reconfig:
+    *       used to modify encoder parameters.
+    *      various parameters from x265_param are copied.
+    *      this takes effect immediately, on whichever frame is encoded next;
+    *      returns 0 on success, negative on parameter validation error.
+    *
+    *      not all parameters can be changed; see the actual function for a
+    *      detailed breakdown.  since not all parameters can be changed, moving
+    *      from preset to preset may not always fully copy all relevant parameters,
+    *      but should still work usably in practice. however, more so than for
+    *      other presets, many of the speed shortcuts used in ultrafast cannot be
+    *      switched out of; using reconfig to switch between ultrafast and other
+    *      presets is not recommended without a more fine-grained breakdown of
+    *      parameters to take this into account. */
+   int x265_encoder_reconfig(x265_encoder *, x265_param *);
+   
 Pictures
 ========
 
@@ -352,7 +370,7 @@
 Multi-library Interface
 =======================
 
-If your application might want to make a runtime selection between among
+If your application might want to make a runtime selection between
 a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will
 want to use the multi-library interface.
 
@@ -370,13 +388,34 @@
      *   libx265 */
     const x265_api* x265_api_get(int bitDepth);
 
-The general idea is to request the API for the bitDepth you would prefer
-the encoder to use (8 or 10), and if that returns NULL you request the
-API for bitDepth=0, which returns the system default libx265.
-
 Note that using this multi-library API in your application is only the
-first step. Next your application must dynamically link to libx265 and
-then you must build and install a multi-lib configuration of libx265,
-which includes 8bpp and 16bpp builds of libx265 and a shim library which
-forwards x265_api_get() calls to the appropriate library using dynamic
-loading and binding.
+first step.
+
+Your application must link to one build of libx265 (statically or 
+dynamically) and this linked version of libx265 will support one 
+bit-depth (8 or 10 bits). 
+
+Your application must now request the API for the bitDepth you would 
+prefer the encoder to use (8 or 10). If the requested bitdepth is zero, 
+or if it matches the bitdepth of the system default libx265 (the 
+currently linked library), then this library will be used for encode.
+If you request a different bit-depth, the linked libx265 will attempt 
+to dynamically bind a shared library with a name appropriate for the 
+requested bit-depth:
+
+    8-bit:  libx265_main.dll
+    10-bit: libx265_main10.dll
+
+    (the shared library extension is obviously platform specific. On
+    Linux it is .so while on Mac it is .dylib)
+
+For example on Windows, one could package together an x265.exe
+statically linked against the 8bpp libx265 together with a
+libx265_main10.dll in the same folder, and this executable would be able
+to encode main and main10 bitstreams.
+
+On Linux, x265 packagers could install 8bpp static and shared libraries
+under the name libx265 (so all applications link against 8bpp libx265)
+and then also install libx265_main10.so (symlinked to its numbered solib).
+Thus applications which use x265_api_get() will be able to generate main
+or main10 bitstreams.
​

x265_1.6.tar.gz/doc/reST/cli.rst -> x265_1.7.tar.gz/doc/reST/cli.rst Changed

@@ -159,6 +159,13 @@
 	handled implicitly.
 
 	One may also directly supply the CPU capability bitmap as an integer.
+	
+	Note that by specifying this option you are overriding x265's CPU
+	detection and it is possible to do this wrong. You can cause encoder
+	crashes by specifying SIMD architectures which are not supported on
+	your CPU.
+
+	Default: auto-detected SIMD architectures
 
 .. option:: --frame-threads, -F <integer>
 
@@ -171,7 +178,7 @@
 	Over-allocation of frame threads will not improve performance, it
 	will generally just increase memory use.
 
-	**Values:** any value between 8 and 16. Default is 0, auto-detect
+	**Values:** any value between 0 and 16. Default is 0, auto-detect
 
 .. option:: --pools <string>, --numa-pools <string>
 
@@ -201,11 +208,11 @@
 	their node, they will not be allowed to migrate between nodes, but they
 	will be allowed to move between CPU cores within their node.
 
-	If the three pool features: :option:`--wpp` :option:`--pmode` and
-	:option:`--pme` are all disabled, then :option:`--pools` is ignored
-	and no thread pools are created.
+	If the four pool features: :option:`--wpp`, :option:`--pmode`,
+	:option:`--pme` and :option:`--lookahead-slices` are all disabled,
+	then :option:`--pools` is ignored and no thread pools are created.
 
-	If "none" is specified, then all three of the thread pool features are
+	If "none" is specified, then all four of the thread pool features are
 	implicitly disabled.
 
 	Multiple thread pools will be allocated for any NUMA node with more than
@@ -217,9 +224,22 @@
 	:option:`--frame-threads`.  The pools are used for WPP and for
 	distributed analysis and motion search.
 
+	On Windows, the native APIs offer sufficient functionality to
+	discover the NUMA topology and enforce the thread affinity that
+	libx265 needs (so long as you have not chosen to target XP or
+	Vista), but on POSIX systems it relies on libnuma for this
+	functionality. If your target POSIX system is single socket, then
+	building without libnuma is a perfectly reasonable option, as it
+	will have no effect on the runtime behavior. On a multiple-socket
+	system, a POSIX build of libx265 without libnuma will be less work
+	efficient. See :ref:`thread pools <pools>` for more detail.
+
 	Default "", one thread is allocated per detected hardware thread
 	(logical CPU cores) and one thread pool per NUMA node.
 
+	Note that the string value will need to be escaped or quoted to
+	protect against shell expansion on many platforms
+
 .. option:: --wpp, --no-wpp
 
 	Enable Wavefront Parallel Processing. The encoder may begin encoding
@@ -399,10 +419,20 @@
 
 	**CLI ONLY**
 
+.. option:: --output-depth, -D 8|10
+
+	Bitdepth of output HEVC bitstream, which is also the internal bit
+	depth of the encoder. If the requested bit depth is not the bit
+	depth of the linked libx265, it will attempt to bind libx265_main
+	for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the
+	same API version as the linked libx265.
+
+	**CLI ONLY**
+
 Profile, Level, Tier
 ====================
 
-.. option:: --profile <string>
+.. option:: --profile, -P <string>
 
 	Enforce the requirements of the specified profile, ensuring the
 	output stream will be decodable by a decoder which supports that
@@ -437,7 +467,7 @@
 	times 10, for example level **5.1** is specified as "5.1" or "51",
 	and level **5.0** is specified as "5.0" or "50".
 
-	Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2
+	Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5
 
 .. option:: --high-tier, --no-high-tier
 
@@ -464,11 +494,22 @@
 	HEVC specification.  If x265 detects that the total reference count
 	is greater than 8, it will issue a warning that the resulting stream
 	is non-compliant and it signals the stream as profile NONE and level
-	NONE but still allows the encode to continue.  Compliant HEVC
+	NONE and will abort the encode unless
+	:option:`--allow-non-conformance` it specified.  Compliant HEVC
 	decoders may refuse to decode such streams.
 	
 	Default 3
 
+.. option:: --allow-non-conformance, --no-allow-non-conformance
+
+	Allow libx265 to generate a bitstream with profile and level NONE.
+	By default it will abort any encode which does not meet strict level
+	compliance. The two most likely causes for non-conformance are
+	:option:`--ctu` being too small, :option:`--ref` being too high,
+	or the bitrate or resolution being out of specification.
+
+	Default: disabled
+
 .. note::
 	:option:`--profile`, :option:`--level-idc`, and
 	:option:`--high-tier` are only intended for use when you are
@@ -476,7 +517,7 @@
 	limitations and must constrain the bitstream within those limits.
 	Specifying a profile or level may lower the encode quality
 	parameters to meet those requirements but it will never raise
-	them.
+	them. It may enable VBV constraints on a CRF encode.
 
 Mode decision / Analysis
 ========================
@@ -1111,6 +1152,14 @@
 
 	**Range of values:** 0.0 to 3.0
 
+.. option:: --qg-size <64|32|16>
+
+	Enable adaptive quantization for sub-CTUs. This parameter specifies 
+	the minimum CU size at which QP can be adjusted, ie. Quantization Group
+	size. Allowed range of values are 64, 32, 16 provided this falls within 
+	the inclusive range [maxCUSize, minCUSize]. Experimental.
+	Default: same as maxCUSize
+
 .. option:: --cutree, --no-cutree
 
 	Enable the use of lookahead's lowres motion vector fields to
@@ -1162,12 +1211,12 @@
 .. option:: --strict-cbr, --no-strict-cbr
 	
 	Enables stricter conditions to control bitrate deviance from the 
-	target bitrate in CBR mode. Bitrate adherence is prioritised
+	target bitrate in ABR mode. Bit rate adherence is prioritised
 	over quality. Rate tolerance is reduced to 50%. Default disabled.
 	
 	This option is for use-cases which require the final average bitrate 
-	to be within very strict limits of the target - preventing overshoots 
-	completely, and achieve bitrates within 5% of target bitrate, 
+	to be within very strict limits of the target; preventing overshoots, 
+	while keeping the bit rate within 5% of the target setting, 
 	especially in short segment encodes. Typically, the encoder stays 
 	conservative, waiting until there is enough feedback in terms of 
 	encoded frames to control QP. strict-cbr allows the encoder to be 
@@ -1209,7 +1258,7 @@
 	lookahead).  Default value is 0.6. Increasing it to 1 will
 	effectively generate CQP
 
-.. option:: --qstep <integer>
+.. option:: --qpstep <integer>
 
 	The maximum single adjustment in QP allowed to rate control. Default
 	4
@@ -1451,9 +1500,48 @@
 	specification for a description of these values. Default undefined
 	(not signaled)
 
+.. option:: --master-display <string>
+
+	SMPTE ST 2086 mastering display color volume SEI info, specified as
+	a string which is parsed when the stream header SEI are emitted. The
+	string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)"
+	where %hu are unsigned 16bit integers and %u are unsigned 32bit
+	integers. The SEI includes X,Y display primaries for RGB channels,
+	white point X,Y and max,min luminance values. (HDR)
+
+	Example for P65D3 1000-nits:
+
+		G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
+
+	Note that this string value will need to be escaped or quoted to
+	protect against shell expansion on many platforms. No default.
+
+.. option:: --max-cll <string>
+
+	Maximum content light level and maximum frame average light level as
+	required by the Consumer Electronics Association 861.3 specification.
+
+	Specified as a string which is parsed when the stream header SEI are
+	emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit
+	integers. The first value is the max content light level (or 0 if no
+	maximum is indicated), the second value is the maximum picture
+	average light level (or 0). (HDR)
+
+	Note that this string value will need to be escaped or quoted to
+	protect against shell expansion on many platforms. No default.
+

 
@@ -159,6 +159,13 @@
    handled implicitly.
 
    One may also directly supply the CPU capability bitmap as an integer.
+   
+   Note that by specifying this option you are overriding x265's CPU
+   detection and it is possible to do this wrong. You can cause encoder
+   crashes by specifying SIMD architectures which are not supported on
+   your CPU.
+
+   Default: auto-detected SIMD architectures
 
 .. option:: --frame-threads, -F <integer>
 
@@ -171,7 +178,7 @@
    Over-allocation of frame threads will not improve performance, it
    will generally just increase memory use.
 
-   **Values:** any value between 8 and 16. Default is 0, auto-detect
+   **Values:** any value between 0 and 16. Default is 0, auto-detect
 
 .. option:: --pools <string>, --numa-pools <string>
 
@@ -201,11 +208,11 @@
    their node, they will not be allowed to migrate between nodes, but they
    will be allowed to move between CPU cores within their node.
 
-   If the three pool features: :option:`--wpp` :option:`--pmode` and
-   :option:`--pme` are all disabled, then :option:`--pools` is ignored
-   and no thread pools are created.
+   If the four pool features: :option:`--wpp`, :option:`--pmode`,
+   :option:`--pme` and :option:`--lookahead-slices` are all disabled,
+   then :option:`--pools` is ignored and no thread pools are created.
 
-   If "none" is specified, then all three of the thread pool features are
+   If "none" is specified, then all four of the thread pool features are
    implicitly disabled.
 
    Multiple thread pools will be allocated for any NUMA node with more than
@@ -217,9 +224,22 @@
    :option:`--frame-threads`.  The pools are used for WPP and for
    distributed analysis and motion search.
 
+   On Windows, the native APIs offer sufficient functionality to
+   discover the NUMA topology and enforce the thread affinity that
+   libx265 needs (so long as you have not chosen to target XP or
+   Vista), but on POSIX systems it relies on libnuma for this
+   functionality. If your target POSIX system is single socket, then
+   building without libnuma is a perfectly reasonable option, as it
+   will have no effect on the runtime behavior. On a multiple-socket
+   system, a POSIX build of libx265 without libnuma will be less work
+   efficient. See :ref:`thread pools <pools>` for more detail.
+
    Default "", one thread is allocated per detected hardware thread
    (logical CPU cores) and one thread pool per NUMA node.
 
+   Note that the string value will need to be escaped or quoted to
+   protect against shell expansion on many platforms
+
 .. option:: --wpp, --no-wpp
 
    Enable Wavefront Parallel Processing. The encoder may begin encoding
@@ -399,10 +419,20 @@
 
    **CLI ONLY**
 
+.. option:: --output-depth, -D 8|10
+
+   Bitdepth of output HEVC bitstream, which is also the internal bit
+   depth of the encoder. If the requested bit depth is not the bit
+   depth of the linked libx265, it will attempt to bind libx265_main
+   for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the
+   same API version as the linked libx265.
+
+   **CLI ONLY**
+
 Profile, Level, Tier
 ====================
 
-.. option:: --profile <string>
+.. option:: --profile, -P <string>
 
    Enforce the requirements of the specified profile, ensuring the
    output stream will be decodable by a decoder which supports that
@@ -437,7 +467,7 @@
    times 10, for example level **5.1** is specified as "5.1" or "51",
    and level **5.0** is specified as "5.0" or "50".
 
-   Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2
+   Annex A levels: 1, 2, 2.1, 3, 3.1, 4, 4.1, 5, 5.1, 5.2, 6, 6.1, 6.2, 8.5
 
 .. option:: --high-tier, --no-high-tier
 
@@ -464,11 +494,22 @@
    HEVC specification.  If x265 detects that the total reference count
    is greater than 8, it will issue a warning that the resulting stream
    is non-compliant and it signals the stream as profile NONE and level
-   NONE but still allows the encode to continue.  Compliant HEVC
+   NONE and will abort the encode unless
+   :option:`--allow-non-conformance` it specified.  Compliant HEVC
    decoders may refuse to decode such streams.
    
    Default 3
 
+.. option:: --allow-non-conformance, --no-allow-non-conformance
+
+   Allow libx265 to generate a bitstream with profile and level NONE.
+   By default it will abort any encode which does not meet strict level
+   compliance. The two most likely causes for non-conformance are
+   :option:`--ctu` being too small, :option:`--ref` being too high,
+   or the bitrate or resolution being out of specification.
+
+   Default: disabled
+
 .. note::
    :option:`--profile`, :option:`--level-idc`, and
    :option:`--high-tier` are only intended for use when you are
@@ -476,7 +517,7 @@
    limitations and must constrain the bitstream within those limits.
    Specifying a profile or level may lower the encode quality
    parameters to meet those requirements but it will never raise
-   them.
+   them. It may enable VBV constraints on a CRF encode.
 
 Mode decision / Analysis
 ========================
@@ -1111,6 +1152,14 @@
 
    **Range of values:** 0.0 to 3.0
 
+.. option:: --qg-size <64|32|16>
+
+   Enable adaptive quantization for sub-CTUs. This parameter specifies 
+   the minimum CU size at which QP can be adjusted, ie. Quantization Group
+   size. Allowed range of values are 64, 32, 16 provided this falls within 
+   the inclusive range [maxCUSize, minCUSize]. Experimental.
+   Default: same as maxCUSize
+
 .. option:: --cutree, --no-cutree
 
    Enable the use of lookahead's lowres motion vector fields to
@@ -1162,12 +1211,12 @@
 .. option:: --strict-cbr, --no-strict-cbr
    
    Enables stricter conditions to control bitrate deviance from the 
-   target bitrate in CBR mode. Bitrate adherence is prioritised
+   target bitrate in ABR mode. Bit rate adherence is prioritised
    over quality. Rate tolerance is reduced to 50%. Default disabled.
    
    This option is for use-cases which require the final average bitrate 
-   to be within very strict limits of the target - preventing overshoots 
-   completely, and achieve bitrates within 5% of target bitrate, 
+   to be within very strict limits of the target; preventing overshoots, 
+   while keeping the bit rate within 5% of the target setting, 
    especially in short segment encodes. Typically, the encoder stays 
    conservative, waiting until there is enough feedback in terms of 
    encoded frames to control QP. strict-cbr allows the encoder to be 
@@ -1209,7 +1258,7 @@
    lookahead).  Default value is 0.6. Increasing it to 1 will
    effectively generate CQP
 
-.. option:: --qstep <integer>
+.. option:: --qpstep <integer>
 
    The maximum single adjustment in QP allowed to rate control. Default
    4
@@ -1451,9 +1500,48 @@
    specification for a description of these values. Default undefined
    (not signaled)
 
+.. option:: --master-display <string>
+
+   SMPTE ST 2086 mastering display color volume SEI info, specified as
+   a string which is parsed when the stream header SEI are emitted. The
+   string format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)"
+   where %hu are unsigned 16bit integers and %u are unsigned 32bit
+   integers. The SEI includes X,Y display primaries for RGB channels,
+   white point X,Y and max,min luminance values. (HDR)
+
+   Example for P65D3 1000-nits:
+
+       G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
+
+   Note that this string value will need to be escaped or quoted to
+   protect against shell expansion on many platforms. No default.
+
+.. option:: --max-cll <string>
+
+   Maximum content light level and maximum frame average light level as
+   required by the Consumer Electronics Association 861.3 specification.
+
+   Specified as a string which is parsed when the stream header SEI are
+   emitted. The string format is "%hu,%hu" where %hu are unsigned 16bit
+   integers. The first value is the max content light level (or 0 if no
+   maximum is indicated), the second value is the maximum picture
+   average light level (or 0). (HDR)
+
+   Note that this string value will need to be escaped or quoted to
+   protect against shell expansion on many platforms. No default.
+
​

x265_1.6.tar.gz/doc/reST/threading.rst -> x265_1.7.tar.gz/doc/reST/threading.rst Changed

@@ -2,6 +2,8 @@
 Threading
 *********
 
+.. _pools:
+
 Thread Pools
 ============
 
@@ -31,6 +33,18 @@
 expected to drop that job so the worker thread may go back to the pool
 and find more work.
 
+On Windows, the native APIs offer sufficient functionality to discover
+the NUMA topology and enforce the thread affinity that libx265 needs (so
+long as you have not chosen to target XP or Vista), but on POSIX systems
+it relies on libnuma for this functionality. If your target POSIX system
+is single socket, then building without libnuma is a perfectly
+reasonable option, as it will have no effect on the runtime behavior. On
+a multiple-socket system, a POSIX build of libx265 without libnuma will
+be less work efficient, but will still function correctly. You lose the
+work isolation effect that keeps each frame encoder from only using the
+threads of a single socket and so you incur a heavier context switching
+cost.
+
 Wavefront Parallel Processing
 =============================
 
@@ -225,6 +239,7 @@
 lowres cost analysis to worker threads. It will use bonded task groups
 to perform batches of frame cost estimates, and it may optionally use
 bonded task groups to measure single frame cost estimates using slices.
+(see :option:`--lookahead-slices`)
 
 The function slicetypeDecide() itself is also be performed by a worker
 thread if your encoder has a thread pool, else it runs within the

 
@@ -2,6 +2,8 @@
 Threading
 *********
 
+.. _pools:
+
 Thread Pools
 ============
 
@@ -31,6 +33,18 @@
 expected to drop that job so the worker thread may go back to the pool
 and find more work.
 
+On Windows, the native APIs offer sufficient functionality to discover
+the NUMA topology and enforce the thread affinity that libx265 needs (so
+long as you have not chosen to target XP or Vista), but on POSIX systems
+it relies on libnuma for this functionality. If your target POSIX system
+is single socket, then building without libnuma is a perfectly
+reasonable option, as it will have no effect on the runtime behavior. On
+a multiple-socket system, a POSIX build of libx265 without libnuma will
+be less work efficient, but will still function correctly. You lose the
+work isolation effect that keeps each frame encoder from only using the
+threads of a single socket and so you incur a heavier context switching
+cost.
+
 Wavefront Parallel Processing
 =============================
 
@@ -225,6 +239,7 @@
 lowres cost analysis to worker threads. It will use bonded task groups
 to perform batches of frame cost estimates, and it may optionally use
 bonded task groups to measure single frame cost estimates using slices.
+(see :option:`--lookahead-slices`)
 
 The function slicetypeDecide() itself is also be performed by a worker
 thread if your encoder has a thread pool, else it runs within the
​

x265_1.6.tar.gz/readme.rst -> x265_1.7.tar.gz/readme.rst Changed

 
@@ -3,7 +3,7 @@
 =================
 
 | **Read:** | Online `documentation <http://x265.readthedocs.org/en/default/>`_ | Developer `wiki <http://bitbucket.org/multicoreware/x265/wiki/>`_
-| **Download:** | `releases <http://bitbucket.org/multicoreware/x265/downloads/>`_ 
+| **Download:** | `releases <http://ftp.videolan.org/pub/videolan/x265/>`_ 
 | **Interact:** | #x265 on freenode.irc.net | `x265-devel@videolan.org <http://mailman.videolan.org/listinfo/x265-devel>`_ | `Report an issue <https://bitbucket.org/multicoreware/x265/issues?status=new&status=open>`_
 
 `x265 <https://www.videolan.org/developers/x265.html>`_ is an open
​

x265_1.6.tar.gz/source/CMakeLists.txt -> x265_1.7.tar.gz/source/CMakeLists.txt Changed

@@ -30,7 +30,7 @@
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 51)
+set(X265_BUILD 59)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -65,15 +65,19 @@
     if(LIBRT)
         list(APPEND PLATFORM_LIBS rt)
     endif()
+    find_library(LIBDL dl)
+    if(LIBDL)
+        list(APPEND PLATFORM_LIBS dl)
+    endif()
     find_package(Numa)
     if(NUMA_FOUND)
-        list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
+        link_directories(${NUMA_LIBRARY_DIR})
+        list(APPEND CMAKE_REQUIRED_LIBRARIES numa)
         check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
         if(NUMA_V2)
             add_definitions(-DHAVE_LIBNUMA)
             message(STATUS "libnuma found, building with support for NUMA nodes")
-            list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
-            link_directories(${NUMA_LIBRARY_DIR})
+            list(APPEND PLATFORM_LIBS numa)
             include_directories(${NUMA_INCLUDE_DIR})
         endif()
     endif()
@@ -90,7 +94,7 @@
 if(CMAKE_GENERATOR STREQUAL "Xcode")
   set(XCODE 1)
 endif()
-if (APPLE)
+if(APPLE)
   add_definitions(-DMACOS)
 endif()
 
@@ -196,6 +200,7 @@
         add_definitions(-static)
         list(APPEND LINKER_OPTIONS "-static")
     endif(STATIC_LINK_CRT)
+    check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW)
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
     if (CC_HAS_NO_ARRAY_BOUNDS)
@@ -291,7 +296,7 @@
     endif()
 endif(WARNINGS_AS_ERRORS)
 
-if (WIN32)
+if(WIN32)
     # Visual leak detector
     find_package(VLD QUIET)
     if(VLD_FOUND)
@@ -300,12 +305,15 @@
         list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES})
         link_directories(${VLD_LIBRARY_DIRS})
     endif()
-    option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF)
+    option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF)
     if(WINXP_SUPPORT)
         # force use of workarounds for CONDITION_VARIABLE and atomic
         # intrinsics introduced after XP
-        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP)
-    endif()
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP -D_WIN32_WINNT_WIN7=0x0601)
+    else(WINXP_SUPPORT)
+        # default to targeting Windows 7 for the NUMA APIs
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7)
+    endif(WINXP_SUPPORT)
 endif()
 
 include(version) # determine X265_VERSION and X265_LATEST_TAG
@@ -462,8 +470,10 @@
 # Main CLI application
 option(ENABLE_CLI "Build standalone CLI application" ON)
 if(ENABLE_CLI)
-    file(GLOB InputFiles input/*.cpp input/*.h)
-    file(GLOB OutputFiles output/*.cpp output/*.h)
+    file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h)
+    file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h
+                          output/yuv.cpp output/y4m.cpp # recon
+                          output/raw.cpp)               # muxers
     file(GLOB FilterFiles filters/*.cpp filters/*.h)
     source_group(input FILES ${InputFiles})
     source_group(output FILES ${OutputFiles})

 
@@ -30,7 +30,7 @@
 mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD)
 
 # X265_BUILD must be incremented each time the public API is changed
-set(X265_BUILD 51)
+set(X265_BUILD 59)
 configure_file("${PROJECT_SOURCE_DIR}/x265.def.in"
                "${PROJECT_BINARY_DIR}/x265.def")
 configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in"
@@ -65,15 +65,19 @@
     if(LIBRT)
         list(APPEND PLATFORM_LIBS rt)
     endif()
+    find_library(LIBDL dl)
+    if(LIBDL)
+        list(APPEND PLATFORM_LIBS dl)
+    endif()
     find_package(Numa)
     if(NUMA_FOUND)
-        list(APPEND CMAKE_REQUIRED_LIBRARIES ${NUMA_LIBRARY})
+        link_directories(${NUMA_LIBRARY_DIR})
+        list(APPEND CMAKE_REQUIRED_LIBRARIES numa)
         check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2)
         if(NUMA_V2)
             add_definitions(-DHAVE_LIBNUMA)
             message(STATUS "libnuma found, building with support for NUMA nodes")
-            list(APPEND PLATFORM_LIBS ${NUMA_LIBRARY})
-            link_directories(${NUMA_LIBRARY_DIR})
+            list(APPEND PLATFORM_LIBS numa)
             include_directories(${NUMA_INCLUDE_DIR})
         endif()
     endif()
@@ -90,7 +94,7 @@
 if(CMAKE_GENERATOR STREQUAL "Xcode")
   set(XCODE 1)
 endif()
-if (APPLE)
+if(APPLE)
   add_definitions(-DMACOS)
 endif()
 
@@ -196,6 +200,7 @@
         add_definitions(-static)
         list(APPEND LINKER_OPTIONS "-static")
     endif(STATIC_LINK_CRT)
+    check_cxx_compiler_flag(-Wno-strict-overflow CC_HAS_NO_STRICT_OVERFLOW)
     check_cxx_compiler_flag(-Wno-narrowing CC_HAS_NO_NARROWING) 
     check_cxx_compiler_flag(-Wno-array-bounds CC_HAS_NO_ARRAY_BOUNDS) 
     if (CC_HAS_NO_ARRAY_BOUNDS)
@@ -291,7 +296,7 @@
     endif()
 endif(WARNINGS_AS_ERRORS)
 
-if (WIN32)
+if(WIN32)
     # Visual leak detector
     find_package(VLD QUIET)
     if(VLD_FOUND)
@@ -300,12 +305,15 @@
         list(APPEND PLATFORM_LIBS ${VLD_LIBRARIES})
         link_directories(${VLD_LIBRARY_DIRS})
     endif()
-    option(WINXP_SUPPORT "Make binaries compatible with Windows XP" OFF)
+    option(WINXP_SUPPORT "Make binaries compatible with Windows XP and Vista" OFF)
     if(WINXP_SUPPORT)
         # force use of workarounds for CONDITION_VARIABLE and atomic
         # intrinsics introduced after XP
-        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP)
-    endif()
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WINXP -D_WIN32_WINNT_WIN7=0x0601)
+    else(WINXP_SUPPORT)
+        # default to targeting Windows 7 for the NUMA APIs
+        add_definitions(-D_WIN32_WINNT=_WIN32_WINNT_WIN7)
+    endif(WINXP_SUPPORT)
 endif()
 
 include(version) # determine X265_VERSION and X265_LATEST_TAG
@@ -462,8 +470,10 @@
 # Main CLI application
 option(ENABLE_CLI "Build standalone CLI application" ON)
 if(ENABLE_CLI)
-    file(GLOB InputFiles input/*.cpp input/*.h)
-    file(GLOB OutputFiles output/*.cpp output/*.h)
+    file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h)
+    file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h
+                          output/yuv.cpp output/y4m.cpp # recon
+                          output/raw.cpp)               # muxers
     file(GLOB FilterFiles filters/*.cpp filters/*.h)
     source_group(input FILES ${InputFiles})
     source_group(output FILES ${OutputFiles})
​

x265_1.6.tar.gz/source/common/common.cpp -> x265_1.7.tar.gz/source/common/common.cpp Changed

 
@@ -100,11 +100,14 @@
     return (x265_exp2_lut[i & 63] + 256) << (i >> 6) >> 8;
 }
 
-void x265_log(const x265_param *param, int level, const char *fmt, ...)
+void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...)
 {
     if (param && level > param->logLevel)
         return;
-    const char *log_level;
+    const int bufferSize = 4096;
+    char buffer[bufferSize];
+    int p = 0;
+    const char* log_level;
     switch (level)
     {
     case X265_LOG_ERROR:
@@ -127,11 +130,13 @@
         break;
     }
 
-    fprintf(stderr, "x265 [%s]: ", log_level);
+    if (caller)
+        p += sprintf(buffer, "%-4s [%s]: ", caller, log_level);
     va_list arg;
     va_start(arg, fmt);
-    vfprintf(stderr, fmt, arg);
+    vsnprintf(buffer + p, bufferSize - p, fmt, arg);
     va_end(arg);
+    fputs(buffer, stderr);
 }
 
 double x265_ssim2dB(double ssim)
​

x265_1.6.tar.gz/source/common/common.h -> x265_1.7.tar.gz/source/common/common.h Changed

 
@@ -413,7 +413,8 @@
 
 /* outside x265 namespace, but prefixed. defined in common.cpp */
 int64_t  x265_mdate(void);
-void     x265_log(const x265_param *param, int level, const char *fmt, ...);
+#define  x265_log(param, ...) general_log(param, "x265", __VA_ARGS__)
+void     general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...);
 int      x265_exp2fix8(double x);
 
 double   x265_ssim2dB(double ssim);
​

x265_1.6.tar.gz/source/common/constants.cpp -> x265_1.7.tar.gz/source/common/constants.cpp Changed

 
@@ -324,7 +324,7 @@
       4,  12, 20, 28,  5, 13, 21, 29,  6, 14, 22, 30,  7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 }
 };
 
-const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4] =
+ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) =
 {
     { 0,  4,  1,  8,  5,  2, 12,  9,  6,  3, 13, 10,  7, 14, 11, 15 },
     { 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 },
​

x265_1.6.tar.gz/source/common/contexts.h -> x265_1.7.tar.gz/source/common/contexts.h Changed

 
@@ -106,6 +106,7 @@
 // private namespace
 
 extern const uint32_t g_entropyBits[128];
+extern const uint32_t g_entropyStateBits[128];
 extern const uint8_t g_nextState[128][2];
 
 #define sbacGetMps(S)            ((S) & 1)
​

x265_1.6.tar.gz/source/common/cudata.cpp -> x265_1.7.tar.gz/source/common/cudata.cpp Changed

@@ -298,7 +298,7 @@
 }
 
 // initialize Sub partition
-void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
+void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp)
 {
     m_absIdxInCTU   = cuGeom.absPartIdx;
     m_encData       = ctu.m_encData;
@@ -312,8 +312,8 @@
     m_cuAboveRight  = ctu.m_cuAboveRight;
     X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n");
 
-    /* sequential memsets */
-    m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]);
+    m_partSet((uint8_t*)m_qp, (uint8_t)qp);
+
     m_partSet(m_log2CUSize,   (uint8_t)cuGeom.log2CUSize);
     m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX);
     m_partSet(m_tqBypass,     (uint8_t)m_encData->m_param->bLossless);
@@ -1830,6 +1830,10 @@
     }
 }
 
+/* Clip motion vector to within slightly padded boundary of picture (the
+ * MV may reference a block that is completely within the padded area).
+ * Note this function is unaware of how much of this picture is actually
+ * available for use (re: frame parallelism) */
 void CUData::clipMv(MV& outMV) const
 {
     const uint32_t mvshift = 2;
@@ -2027,6 +2031,7 @@
         uint32_t blockSize = 1 << log2CUSize;
         uint32_t sbWidth   = 1 << (g_log2Size[maxCUSize] - log2CUSize);
         int32_t lastLevelFlag = log2CUSize == g_log2Size[minCUSize];
+
         for (uint32_t sbY = 0; sbY < sbWidth; sbY++)
         {
             for (uint32_t sbX = 0; sbX < sbWidth; sbX++)

 
@@ -298,7 +298,7 @@
 }
 
 // initialize Sub partition
-void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom)
+void CUData::initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp)
 {
     m_absIdxInCTU   = cuGeom.absPartIdx;
     m_encData       = ctu.m_encData;
@@ -312,8 +312,8 @@
     m_cuAboveRight  = ctu.m_cuAboveRight;
     X265_CHECK(m_numPartitions == cuGeom.numPartitions, "initSubCU() size mismatch\n");
 
-    /* sequential memsets */
-    m_partSet((uint8_t*)m_qp, (uint8_t)ctu.m_qp[0]);
+    m_partSet((uint8_t*)m_qp, (uint8_t)qp);
+
     m_partSet(m_log2CUSize,   (uint8_t)cuGeom.log2CUSize);
     m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX);
     m_partSet(m_tqBypass,     (uint8_t)m_encData->m_param->bLossless);
@@ -1830,6 +1830,10 @@
     }
 }
 
+/* Clip motion vector to within slightly padded boundary of picture (the
+ * MV may reference a block that is completely within the padded area).
+ * Note this function is unaware of how much of this picture is actually
+ * available for use (re: frame parallelism) */
 void CUData::clipMv(MV& outMV) const
 {
     const uint32_t mvshift = 2;
@@ -2027,6 +2031,7 @@
         uint32_t blockSize = 1 << log2CUSize;
         uint32_t sbWidth   = 1 << (g_log2Size[maxCUSize] - log2CUSize);
         int32_t lastLevelFlag = log2CUSize == g_log2Size[minCUSize];
+
         for (uint32_t sbY = 0; sbY < sbWidth; sbY++)
         {
             for (uint32_t sbX = 0; sbX < sbWidth; sbX++)
​

x265_1.6.tar.gz/source/common/cudata.h -> x265_1.7.tar.gz/source/common/cudata.h Changed

 
@@ -85,8 +85,8 @@
     uint32_t childOffset;   // offset of the first child CU from current CU
     uint32_t absPartIdx;    // Part index of this CU in terms of 4x4 blocks.
     uint32_t numPartitions; // Number of 4x4 blocks in the CU
-    uint32_t depth;         // depth of this CU relative from CTU
     uint32_t flags;         // CU flags.
+    uint32_t depth;         // depth of this CU relative from CTU
 };
 
 struct MVField
@@ -182,7 +182,7 @@
     static void calcCTUGeoms(uint32_t ctuWidth, uint32_t ctuHeight, uint32_t maxCUSize, uint32_t minCUSize, CUGeom cuDataArray[CUGeom::MAX_GEOMS]);
 
     void     initCTU(const Frame& frame, uint32_t cuAddr, int qp);
-    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom);
+    void     initSubCU(const CUData& ctu, const CUGeom& cuGeom, int qp);
     void     initLosslessCU(const CUData& cu, const CUGeom& cuGeom);
 
     void     copyPartFrom(const CUData& cu, const CUGeom& childGeom, uint32_t subPartIdx);
​

x265_1.6.tar.gz/source/common/dct.cpp -> x265_1.7.tar.gz/source/common/dct.cpp Changed

@@ -752,7 +752,7 @@
     }
 }
 
-int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
+int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/)
 {
     memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
     memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
@@ -785,6 +785,37 @@
     return scanPosLast - 1;
 }
 
+uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
+{
+    int n;
+
+    for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
+
+    uint32_t lastNZPosInCG = (uint32_t)n;
+
+    for (n = 0;; n++)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    uint32_t firstNZPosInCG = (uint32_t)n;
+
+    return ((lastNZPosInCG << 16) | firstNZPosInCG);
+}
+
 }  // closing - anonymous file-static namespace
 
 namespace x265 {
@@ -817,6 +848,7 @@
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
 
-    p.findPosLast = findPosLast_c;
+    p.scanPosLast = scanPosLast_c;
+    p.findPosFirstLast = findPosFirstLast_c;
 }
 }

 
@@ -752,7 +752,7 @@
     }
 }
 
-int findPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig)
+int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/)
 {
     memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum));
     memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag));
@@ -785,6 +785,37 @@
     return scanPosLast - 1;
 }
 
+uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])
+{
+    int n;
+
+    for (n = SCAN_SET_SIZE - 1; n >= 0; --n)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n");
+
+    uint32_t lastNZPosInCG = (uint32_t)n;
+
+    for (n = 0;; n++)
+    {
+        const uint32_t idx = scanTbl[n];
+        const uint32_t idxY = idx / MLS_CG_SIZE;
+        const uint32_t idxX = idx % MLS_CG_SIZE;
+        if (dstCoeff[idxY * trSize + idxX])
+            break;
+    }
+
+    uint32_t firstNZPosInCG = (uint32_t)n;
+
+    return ((lastNZPosInCG << 16) | firstNZPosInCG);
+}
+
 }  // closing - anonymous file-static namespace
 
 namespace x265 {
@@ -817,6 +848,7 @@
     p.cu[BLOCK_16x16].copy_cnt = copy_count<16>;
     p.cu[BLOCK_32x32].copy_cnt = copy_count<32>;
 
-    p.findPosLast = findPosLast_c;
+    p.scanPosLast = scanPosLast_c;
+    p.findPosFirstLast = findPosFirstLast_c;
 }
 }
​

x265_1.6.tar.gz/source/common/frame.cpp -> x265_1.7.tar.gz/source/common/frame.cpp Changed

 
@@ -31,18 +31,21 @@
 Frame::Frame()
 {
     m_bChromaExtended = false;
+    m_lowresInit = false;
     m_reconRowCount.set(0);
     m_countRefEncoders = 0;
     m_encData = NULL;
     m_reconPic = NULL;
     m_next = NULL;
     m_prev = NULL;
+    m_param = NULL;
     memset(&m_lowres, 0, sizeof(m_lowres));
 }
 
 bool Frame::create(x265_param *param)
 {
     m_fencPic = new PicYuv;
+    m_param = param;
 
     return m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) &&
            m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode);
​

x265_1.6.tar.gz/source/common/frame.h -> x265_1.7.tar.gz/source/common/frame.h Changed

 
@@ -56,6 +56,7 @@
     void*                  m_userData;           // user provided pointer passed in with this picture
 
     Lowres                 m_lowres;
+    bool                   m_lowresInit;         // lowres init complete (pre-analysis)
     bool                   m_bChromaExtended;    // orig chroma planes motion extended for weight analysis
 
     /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */
@@ -64,7 +65,7 @@
 
     Frame*                 m_next;               // PicList doubly linked list pointers
     Frame*                 m_prev;
-
+    x265_param*            m_param;              // Points to the latest param set for the frame.
     x265_analysis_data     m_analysisData;
     Frame();
 
​

x265_1.6.tar.gz/source/common/framedata.h -> x265_1.7.tar.gz/source/common/framedata.h Changed

 
@@ -74,6 +74,7 @@
         uint32_t numEncodedCUs; /* ctuAddr of last encoded CTU in row */
         uint32_t encodedBits;   /* sum of 'totalBits' of encoded CTUs */
         uint32_t satdForVbv;    /* sum of lowres (estimated) costs for entire row */
+        uint32_t intraSatdForVbv; /* sum of lowres (estimated) intra costs for entire row */
         uint32_t diagSatd;
         uint32_t diagIntraSatd;
         double   diagQp;
​

x265_1.6.tar.gz/source/common/ipfilter.cpp -> x265_1.7.tar.gz/source/common/ipfilter.cpp Changed

@@ -34,27 +34,8 @@
 #endif
 
 namespace {
-template<int dstStride, int width, int height>
-void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
-{
-    int shift = IF_INTERNAL_PREC - X265_DEPTH;
-    int row, col;
-
-    for (row = 0; row < height; row++)
-    {
-        for (col = 0; col < width; col++)
-        {
-            int16_t val = src[col] << shift;
-            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
-        }
-
-        src += srcStride;
-        dst += dstStride;
-    }
-}
-
-template<int dstStride>
-void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
+template<int width, int height>
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
     int row, col;
@@ -398,7 +379,7 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -407,7 +388,7 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -416,7 +397,7 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -426,7 +407,7 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
@@ -482,6 +463,7 @@
 
     CHROMA_422(4, 8);
     CHROMA_422(4, 4);
+    CHROMA_422(2, 4);
     CHROMA_422(2, 8);
     CHROMA_422(8,  16);
     CHROMA_422(8,  8);
@@ -530,11 +512,6 @@
     CHROMA_444(48, 64);
     CHROMA_444(64, 16);
     CHROMA_444(16, 64);
-    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-
-    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
-    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }

 
@@ -34,27 +34,8 @@
 #endif
 
 namespace {
-template<int dstStride, int width, int height>
-void pixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst)
-{
-    int shift = IF_INTERNAL_PREC - X265_DEPTH;
-    int row, col;
-
-    for (row = 0; row < height; row++)
-    {
-        for (col = 0; col < width; col++)
-        {
-            int16_t val = src[col] << shift;
-            dst[col] = val - (int16_t)IF_INTERNAL_OFFS;
-        }
-
-        src += srcStride;
-        dst += dstStride;
-    }
-}
-
-template<int dstStride>
-void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height)
+template<int width, int height>
+void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride)
 {
     int shift = IF_INTERNAL_PREC - X265_DEPTH;
     int row, col;
@@ -398,7 +379,7 @@
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_422(W, H) \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -407,7 +388,7 @@
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE / 2, W, H>; 
+    p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define CHROMA_444(W, H) \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \
@@ -416,7 +397,7 @@
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>;  \
     p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \
-    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].chroma_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>; 
+    p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>;
 
 #define LUMA(W, H) \
     p.pu[LUMA_ ## W ## x ## H].luma_hpp     = interp_horiz_pp_c<8, W, H>; \
@@ -426,7 +407,7 @@
     p.pu[LUMA_ ## W ## x ## H].luma_vsp     = interp_vert_sp_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_vss     = interp_vert_ss_c<8, W, H>;  \
     p.pu[LUMA_ ## W ## x ## H].luma_hvpp    = interp_hv_pp_c<8, W, H>; \
-    p.pu[LUMA_ ## W ## x ## H].filter_p2s = pixelToShort_c<MAX_CU_SIZE, W, H>
+    p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>;
 
 void setupFilterPrimitives_c(EncoderPrimitives& p)
 {
@@ -482,6 +463,7 @@
 
     CHROMA_422(4, 8);
     CHROMA_422(4, 4);
+    CHROMA_422(2, 4);
     CHROMA_422(2, 8);
     CHROMA_422(8,  16);
     CHROMA_422(8,  8);
@@ -530,11 +512,6 @@
     CHROMA_444(48, 64);
     CHROMA_444(64, 16);
     CHROMA_444(16, 64);
-    p.luma_p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-
-    p.chroma[X265_CSP_I444].p2s = filterPixelToShort_c<MAX_CU_SIZE>;
-    p.chroma[X265_CSP_I420].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
-    p.chroma[X265_CSP_I422].p2s = filterPixelToShort_c<MAX_CU_SIZE / 2>;
 
     p.extendRowBorder = extendCURowColBorder;
 }
​

x265_1.6.tar.gz/source/common/loopfilter.cpp -> x265_1.7.tar.gz/source/common/loopfilter.cpp Changed

@@ -42,18 +42,23 @@
         dst[x] = signOf(src1[x] - src2[x]);
 }
 
-void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft)
+void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride)
 {
-    int x;
-    int8_t signRight;
+    int x, y;
+    int8_t signRight, signLeft0;
     int8_t edgeType;
 
-    for (x = 0; x < width; x++)
+    for (y = 0; y < 2; y++)
     {
-        signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
-        edgeType = signRight + signLeft + 2;
-        signLeft  = -signRight;
-        rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        signLeft0 = signLeft[y];
+        for (x = 0; x < width; x++)
+        {
+            signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
+            edgeType = signRight + signLeft0 + 2;
+            signLeft0 = -signRight;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        rec += stride;
     }
 }
 
@@ -72,6 +77,25 @@
     }
 }
 
+void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width)
+{
+    int x, y;
+    int8_t signDown;
+    int edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        for (x = 0; x < width; x++)
+        {
+            signDown = signOf(rec[x] - rec[x + stride]);
+            edgeType = signDown + upBuff1[x] + 2;
+            upBuff1[x] = -signDown;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        rec += stride;
+    }
+}
+
 void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride)
 {
     int x;
@@ -119,8 +143,11 @@
 {
     p.saoCuOrgE0 = processSaoCUE0;
     p.saoCuOrgE1 = processSaoCUE1;
-    p.saoCuOrgE2 = processSaoCUE2;
-    p.saoCuOrgE3 = processSaoCUE3;
+    p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
+    p.saoCuOrgE2[0] = processSaoCUE2;
+    p.saoCuOrgE2[1] = processSaoCUE2;
+    p.saoCuOrgE3[0] = processSaoCUE3;
+    p.saoCuOrgE3[1] = processSaoCUE3;
     p.saoCuOrgB0 = processSaoCUB0;
     p.sign = calSign;
 }

 
@@ -42,18 +42,23 @@
         dst[x] = signOf(src1[x] - src2[x]);
 }
 
-void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t signLeft)
+void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride)
 {
-    int x;
-    int8_t signRight;
+    int x, y;
+    int8_t signRight, signLeft0;
     int8_t edgeType;
 
-    for (x = 0; x < width; x++)
+    for (y = 0; y < 2; y++)
     {
-        signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
-        edgeType = signRight + signLeft + 2;
-        signLeft  = -signRight;
-        rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        signLeft0 = signLeft[y];
+        for (x = 0; x < width; x++)
+        {
+            signRight = ((rec[x] - rec[x + 1]) < 0) ? -1 : ((rec[x] - rec[x + 1]) > 0) ? 1 : 0;
+            edgeType = signRight + signLeft0 + 2;
+            signLeft0 = -signRight;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        rec += stride;
     }
 }
 
@@ -72,6 +77,25 @@
     }
 }
 
+void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width)
+{
+    int x, y;
+    int8_t signDown;
+    int edgeType;
+
+    for (y = 0; y < 2; y++)
+    {
+        for (x = 0; x < width; x++)
+        {
+            signDown = signOf(rec[x] - rec[x + stride]);
+            edgeType = signDown + upBuff1[x] + 2;
+            upBuff1[x] = -signDown;
+            rec[x] = x265_clip(rec[x] + offsetEo[edgeType]);
+        }
+        rec += stride;
+    }
+}
+
 void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride)
 {
     int x;
@@ -119,8 +143,11 @@
 {
     p.saoCuOrgE0 = processSaoCUE0;
     p.saoCuOrgE1 = processSaoCUE1;
-    p.saoCuOrgE2 = processSaoCUE2;
-    p.saoCuOrgE3 = processSaoCUE3;
+    p.saoCuOrgE1_2Rows = processSaoCUE1_2Rows;
+    p.saoCuOrgE2[0] = processSaoCUE2;
+    p.saoCuOrgE2[1] = processSaoCUE2;
+    p.saoCuOrgE3[0] = processSaoCUE3;
+    p.saoCuOrgE3[1] = processSaoCUE3;
     p.saoCuOrgB0 = processSaoCUB0;
     p.sign = calSign;
 }
​

x265_1.6.tar.gz/source/common/param.cpp -> x265_1.7.tar.gz/source/common/param.cpp Changed

@@ -87,7 +87,7 @@
 extern "C"
 void x265_param_free(x265_param* p)
 {
-    return x265_free(p);
+    x265_free(p);
 }
 
 extern "C"
@@ -117,6 +117,7 @@
     param->levelIdc = 0;
     param->bHighTier = 0;
     param->interlaceMode = 0;
+    param->bAnnexB = 1;
     param->bRepeatHeaders = 0;
     param->bEnableAccessUnitDelimiters = 0;
     param->bEmitHRDSEI = 0;
@@ -209,6 +210,7 @@
     param->rc.zones = NULL;
     param->rc.bEnableSlowFirstPass = 0;
     param->rc.bStrictCbr = 0;
+    param->rc.qgSize = 64; /* Same as maxCUSize */
 
     /* Video Usability Information (VUI) */
     param->vui.aspectRatioIdc = 0;
@@ -263,6 +265,7 @@
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "superfast"))
@@ -279,6 +282,7 @@
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableSAO = 0;
             param->bEnableFastIntra = 1;
         }
@@ -292,6 +296,7 @@
             param->rdLevel = 2;
             param->maxNumReferences = 1;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "faster"))
@@ -565,6 +570,7 @@
             p->levelIdc = atoi(value);
     }
     OPT("high-tier") p->bHighTier = atobool(value);
+    OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value);
     OPT2("log-level", "log")
     {
         p->logLevel = atoi(value);
@@ -575,6 +581,7 @@
         }
     }
     OPT("cu-stats") p->bLogCuStats = atobool(value);
+    OPT("annexb") p->bAnnexB = atobool(value);
     OPT("repeat-headers") p->bRepeatHeaders = atobool(value);
     OPT("wpp") p->bEnableWavefront = atobool(value);
     OPT("ctu") p->maxCUSize = (uint32_t)atoi(value);
@@ -843,6 +850,9 @@
     OPT2("pools", "numa-pools") p->numaPools = strdup(value);
     OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
     OPT("analysis-file") p->analysisFileName = strdup(value);
+    OPT("qg-size") p->rc.qgSize = atoi(value);
+    OPT("master-display") p->masteringDisplayColorVolume = strdup(value);
+    OPT("max-cll") p->contentLightLevelInfo = strdup(value);
     else
         return X265_PARAM_BAD_NAME;
 #undef OPT
@@ -1183,7 +1193,7 @@
     uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize];
     uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize];
 
-    if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1)
+    if (ATOMIC_INC(&g_ctuSizeConfigured) > 1)
     {
         if (g_maxCUSize != param->maxCUSize)
         {
@@ -1264,22 +1274,20 @@
     x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n",
              param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences);
 
+    if (param->rc.aqMode)
+        x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree  : %d / %0.1f / %d / %d\n", param->rc.aqMode,
+                 param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
+
     if (param->bLossless)
         x265_log(param, X265_LOG_INFO, "Rate Control                        : Lossless\n");
     else switch (param->rc.rateControlMode)
     {
     case X265_RC_ABR:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break;
     case X265_RC_CQP:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength,
-                 param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : CQP-%d\n", param->rc.qp); break;
     case X265_RC_CRF:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break;
     }
 
     if (param->rc.vbvBufferSize)
@@ -1327,6 +1335,43 @@
     fflush(stderr);
 }
 
+void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam)
+{
+    if (!param || !reconfiguredParam)
+        return;
+
+    x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n");
+
+    char buf[80] = { 0 };
+    char tmp[40];
+#define TOOLCMP(COND1, COND2, STR, VAL)  if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); }
+    TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences);
+    TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize);
+    TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange);
+    TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine);
+    TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel);
+    TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd);
+    TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel);
+    TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq);
+    TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra);
+    TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter);
+    TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast);
+    TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding);
+    TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra);
+    if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset 
+        || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset))
+    {
+        sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset);
+        appendtool(param, buf, sizeof(buf), tmp);
+    }
+    else
+        TOOLCMP(param->bEnableLoopFilter,  reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter);
+
+    TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp);
+    TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip);
+    x265_log(param, X265_LOG_INFO, "tools:%s\n", buf);
+}
+
 char *x265_param2string(x265_param* p)
 {
     char *buf, *s;

 
@@ -87,7 +87,7 @@
 extern "C"
 void x265_param_free(x265_param* p)
 {
-    return x265_free(p);
+    x265_free(p);
 }
 
 extern "C"
@@ -117,6 +117,7 @@
     param->levelIdc = 0;
     param->bHighTier = 0;
     param->interlaceMode = 0;
+    param->bAnnexB = 1;
     param->bRepeatHeaders = 0;
     param->bEnableAccessUnitDelimiters = 0;
     param->bEmitHRDSEI = 0;
@@ -209,6 +210,7 @@
     param->rc.zones = NULL;
     param->rc.bEnableSlowFirstPass = 0;
     param->rc.bStrictCbr = 0;
+    param->rc.qgSize = 64; /* Same as maxCUSize */
 
     /* Video Usability Information (VUI) */
     param->vui.aspectRatioIdc = 0;
@@ -263,6 +265,7 @@
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "superfast"))
@@ -279,6 +282,7 @@
             param->rc.aqStrength = 0.0;
             param->rc.aqMode = X265_AQ_NONE;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableSAO = 0;
             param->bEnableFastIntra = 1;
         }
@@ -292,6 +296,7 @@
             param->rdLevel = 2;
             param->maxNumReferences = 1;
             param->rc.cuTree = 0;
+            param->rc.qgSize = 32;
             param->bEnableFastIntra = 1;
         }
         else if (!strcmp(preset, "faster"))
@@ -565,6 +570,7 @@
             p->levelIdc = atoi(value);
     }
     OPT("high-tier") p->bHighTier = atobool(value);
+    OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value);
     OPT2("log-level", "log")
     {
         p->logLevel = atoi(value);
@@ -575,6 +581,7 @@
         }
     }
     OPT("cu-stats") p->bLogCuStats = atobool(value);
+    OPT("annexb") p->bAnnexB = atobool(value);
     OPT("repeat-headers") p->bRepeatHeaders = atobool(value);
     OPT("wpp") p->bEnableWavefront = atobool(value);
     OPT("ctu") p->maxCUSize = (uint32_t)atoi(value);
@@ -843,6 +850,9 @@
     OPT2("pools", "numa-pools") p->numaPools = strdup(value);
     OPT("lambda-file") p->rc.lambdaFileName = strdup(value);
     OPT("analysis-file") p->analysisFileName = strdup(value);
+    OPT("qg-size") p->rc.qgSize = atoi(value);
+    OPT("master-display") p->masteringDisplayColorVolume = strdup(value);
+    OPT("max-cll") p->contentLightLevelInfo = strdup(value);
     else
         return X265_PARAM_BAD_NAME;
 #undef OPT
@@ -1183,7 +1193,7 @@
     uint32_t maxLog2CUSize = (uint32_t)g_log2Size[param->maxCUSize];
     uint32_t minLog2CUSize = (uint32_t)g_log2Size[param->minCUSize];
 
-    if (g_ctuSizeConfigured || ATOMIC_INC(&g_ctuSizeConfigured) > 1)
+    if (ATOMIC_INC(&g_ctuSizeConfigured) > 1)
     {
         if (g_maxCUSize != param->maxCUSize)
         {
@@ -1264,22 +1274,20 @@
     x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n",
              param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences);
 
+    if (param->rc.aqMode)
+        x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree  : %d / %0.1f / %d / %d\n", param->rc.aqMode,
+                 param->rc.aqStrength, param->rc.qgSize, param->rc.cuTree);
+
     if (param->bLossless)
         x265_log(param, X265_LOG_INFO, "Rate Control                        : Lossless\n");
     else switch (param->rc.rateControlMode)
     {
     case X265_RC_ABR:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : ABR-%d kbps / %0.1f / %d\n", param->rc.bitrate,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : ABR-%d kbps / %0.2f\n", param->rc.bitrate, param->rc.qCompress); break;
     case X265_RC_CQP:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CQP-%d / %0.1f / %d\n", param->rc.qp, param->rc.aqStrength,
-                 param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control                        : CQP-%d\n", param->rc.qp); break;
     case X265_RC_CRF:
-        x265_log(param, X265_LOG_INFO, "Rate Control / AQ-Strength / CUTree : CRF-%0.1f / %0.1f / %d\n", param->rc.rfConstant,
-                 param->rc.aqStrength, param->rc.cuTree);
-        break;
+        x265_log(param, X265_LOG_INFO, "Rate Control / qCompress            : CRF-%0.1f / %0.2f\n", param->rc.rfConstant, param->rc.qCompress); break;
     }
 
     if (param->rc.vbvBufferSize)
@@ -1327,6 +1335,43 @@
     fflush(stderr);
 }
 
+void x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam)
+{
+    if (!param || !reconfiguredParam)
+        return;
+
+    x265_log(param,X265_LOG_INFO, "Reconfigured param options :\n");
+
+    char buf[80] = { 0 };
+    char tmp[40];
+#define TOOLCMP(COND1, COND2, STR, VAL)  if (COND1 != COND2) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); }
+    TOOLCMP(param->maxNumReferences, reconfiguredParam->maxNumReferences, "ref=%d", reconfiguredParam->maxNumReferences);
+    TOOLCMP(param->maxTUSize, reconfiguredParam->maxTUSize, "max-tu-size=%d", reconfiguredParam->maxTUSize);
+    TOOLCMP(param->searchRange, reconfiguredParam->searchRange, "merange=%d", reconfiguredParam->searchRange);
+    TOOLCMP(param->subpelRefine, reconfiguredParam->subpelRefine, "subme= %d", reconfiguredParam->subpelRefine);
+    TOOLCMP(param->rdLevel, reconfiguredParam->rdLevel, "rd=%d", reconfiguredParam->rdLevel);
+    TOOLCMP(param->psyRd, reconfiguredParam->psyRd, "psy-rd=%.2lf", reconfiguredParam->psyRd);
+    TOOLCMP(param->rdoqLevel, reconfiguredParam->rdoqLevel, "rdoq=%d", reconfiguredParam->rdoqLevel);
+    TOOLCMP(param->psyRdoq, reconfiguredParam->psyRdoq, "psy-rdoq=%.2lf", reconfiguredParam->psyRdoq);
+    TOOLCMP(param->noiseReductionIntra, reconfiguredParam->noiseReductionIntra, "nr-intra=%d", reconfiguredParam->noiseReductionIntra);
+    TOOLCMP(param->noiseReductionInter, reconfiguredParam->noiseReductionInter, "nr-inter=%d", reconfiguredParam->noiseReductionInter);
+    TOOLCMP(param->bEnableTSkipFast, reconfiguredParam->bEnableTSkipFast, "tskip-fast=%d", reconfiguredParam->bEnableTSkipFast);
+    TOOLCMP(param->bEnableSignHiding, reconfiguredParam->bEnableSignHiding, "signhide=%d", reconfiguredParam->bEnableSignHiding);
+    TOOLCMP(param->bEnableFastIntra, reconfiguredParam->bEnableFastIntra, "fast-intra=%d", reconfiguredParam->bEnableFastIntra);
+    if (param->bEnableLoopFilter && (param->deblockingFilterBetaOffset != reconfiguredParam->deblockingFilterBetaOffset 
+        || param->deblockingFilterTCOffset != reconfiguredParam->deblockingFilterTCOffset))
+    {
+        sprintf(tmp, "deblock(tC=%d:B=%d)", param->deblockingFilterTCOffset, param->deblockingFilterBetaOffset);
+        appendtool(param, buf, sizeof(buf), tmp);
+    }
+    else
+        TOOLCMP(param->bEnableLoopFilter,  reconfiguredParam->bEnableLoopFilter, "deblock=%d", reconfiguredParam->bEnableLoopFilter);
+
+    TOOLCMP(param->bEnableTemporalMvp, reconfiguredParam->bEnableTemporalMvp, "tmvp=%d", reconfiguredParam->bEnableTemporalMvp);
+    TOOLCMP(param->bEnableEarlySkip, reconfiguredParam->bEnableEarlySkip, "early-skip=%d", reconfiguredParam->bEnableEarlySkip);
+    x265_log(param, X265_LOG_INFO, "tools:%s\n", buf);
+}
+
 char *x265_param2string(x265_param* p)
 {
     char *buf, *s;
​

x265_1.6.tar.gz/source/common/param.h -> x265_1.7.tar.gz/source/common/param.h Changed

 
@@ -28,6 +28,7 @@
 int   x265_check_params(x265_param *param);
 int   x265_set_globals(x265_param *param);
 void  x265_print_params(x265_param *param);
+void  x265_print_reconfigured_params(x265_param* param, x265_param* reconfiguredParam);
 void  x265_param_apply_fastfirstpass(x265_param *p);
 char* x265_param2string(x265_param *param);
 int   x265_atoi(const char *str, bool& bError);
​

x265_1.6.tar.gz/source/common/picyuv.cpp -> x265_1.7.tar.gz/source/common/picyuv.cpp Changed

 
@@ -175,8 +175,7 @@
 
         for (int r = 0; r < height; r++)
         {
-            for (int c = 0; c < width; c++)
-                yPixel[c] = (pixel)yChar[c];
+            memcpy(yPixel, yChar, width * sizeof(pixel));
 
             yPixel += m_stride;
             yChar += pic.stride[0] / sizeof(*yChar);
@@ -184,11 +183,8 @@
 
         for (int r = 0; r < height >> m_vChromaShift; r++)
         {
-            for (int c = 0; c < width >> m_hChromaShift; c++)
-            {
-                uPixel[c] = (pixel)uChar[c];
-                vPixel[c] = (pixel)vChar[c];
-            }
+            memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel));
+            memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel));
 
             uPixel += m_strideC;
             vPixel += m_strideC;
​

x265_1.6.tar.gz/source/common/pixel.cpp -> x265_1.7.tar.gz/source/common/pixel.cpp Changed

 
@@ -582,7 +582,7 @@
     }
 }
 
-void scale1D_128to64(pixel *dst, const pixel *src, intptr_t /*stride*/)
+void scale1D_128to64(pixel *dst, const pixel *src)
 {
     int x;
     const pixel* src1 = src;
​

x265_1.6.tar.gz/source/common/predict.cpp -> x265_1.7.tar.gz/source/common/predict.cpp Changed

@@ -273,7 +273,7 @@
 void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const
 {
     int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx);
-    int dstStride = dstSYuv.m_size;
+    intptr_t dstStride = dstSYuv.m_size;
 
     intptr_t srcStride = refPic.m_stride;
     intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
@@ -288,7 +288,7 @@
     X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n");
 
     if (!(yFrac | xFrac))
-        primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height);
+        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
     else if (!yFrac)
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
     else if (!xFrac)
@@ -375,14 +375,13 @@
     int partEnum = partitionFromSizes(pu.width, pu.height);
     
     uint32_t cxWidth  = pu.width >> m_hChromaShift;
-    uint32_t cxHeight = pu.height >> m_vChromaShift;
 
-    X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n");
+    X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n");
 
     if (!(yFrac | xFrac))
     {
-        primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
-        primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
     }
     else if (!yFrac)
     {
@@ -817,7 +816,9 @@
             const pixel refSample = *pAdiLineNext;
             // Pad unavailable samples with new value
             int nextOrTop = X265_MIN(next, leftUnits);
+
             // fill left column
+#if HIGH_BIT_DEPTH
             while (curr < nextOrTop)
             {
                 for (int i = 0; i < unitHeight; i++)
@@ -836,6 +837,24 @@
                 adi += unitWidth;
                 curr++;
             }
+#else
+            X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
+            if (curr < nextOrTop)
+            {
+                const int fillSize = unitHeight * (nextOrTop - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = nextOrTop;
+                adi += fillSize;
+            }
+
+            if (curr < next)
+            {
+                const int fillSize = unitWidth * (next - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = next;
+                adi += fillSize;
+            }
+#endif
         }
 
         // pad all other reference samples.

 
@@ -273,7 +273,7 @@
 void Predict::predInterLumaShort(const PredictionUnit& pu, ShortYuv& dstSYuv, const PicYuv& refPic, const MV& mv) const
 {
     int16_t* dst = dstSYuv.getLumaAddr(pu.puAbsPartIdx);
-    int dstStride = dstSYuv.m_size;
+    intptr_t dstStride = dstSYuv.m_size;
 
     intptr_t srcStride = refPic.m_stride;
     intptr_t srcOffset = (mv.x >> 2) + (mv.y >> 2) * srcStride;
@@ -288,7 +288,7 @@
     X265_CHECK(dstStride == MAX_CU_SIZE, "stride expected to be max cu size\n");
 
     if (!(yFrac | xFrac))
-        primitives.luma_p2s(src, srcStride, dst, pu.width, pu.height);
+        primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride);
     else if (!yFrac)
         primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0);
     else if (!xFrac)
@@ -375,14 +375,13 @@
     int partEnum = partitionFromSizes(pu.width, pu.height);
     
     uint32_t cxWidth  = pu.width >> m_hChromaShift;
-    uint32_t cxHeight = pu.height >> m_vChromaShift;
 
-    X265_CHECK(((cxWidth | cxHeight) % 2) == 0, "chroma block size expected to be multiple of 2\n");
+    X265_CHECK(((cxWidth | (pu.height >> m_vChromaShift)) % 2) == 0, "chroma block size expected to be multiple of 2\n");
 
     if (!(yFrac | xFrac))
     {
-        primitives.chroma[m_csp].p2s(refCb, refStride, dstCb, cxWidth, cxHeight);
-        primitives.chroma[m_csp].p2s(refCr, refStride, dstCr, cxWidth, cxHeight);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride);
+        primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride);
     }
     else if (!yFrac)
     {
@@ -817,7 +816,9 @@
             const pixel refSample = *pAdiLineNext;
             // Pad unavailable samples with new value
             int nextOrTop = X265_MIN(next, leftUnits);
+
             // fill left column
+#if HIGH_BIT_DEPTH
             while (curr < nextOrTop)
             {
                 for (int i = 0; i < unitHeight; i++)
@@ -836,6 +837,24 @@
                 adi += unitWidth;
                 curr++;
             }
+#else
+            X265_CHECK(curr <= nextOrTop, "curr must be less than or equal to nextOrTop\n");
+            if (curr < nextOrTop)
+            {
+                const int fillSize = unitHeight * (nextOrTop - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = nextOrTop;
+                adi += fillSize;
+            }
+
+            if (curr < next)
+            {
+                const int fillSize = unitWidth * (next - curr);
+                memset(adi, refSample, fillSize * sizeof(pixel));
+                curr = next;
+                adi += fillSize;
+            }
+#endif
         }
 
         // pad all other reference samples.
​

x265_1.6.tar.gz/source/common/primitives.cpp -> x265_1.7.tar.gz/source/common/primitives.cpp Changed

 
@@ -90,7 +90,6 @@
 
     /* alias chroma 4:4:4 from luma primitives (all but chroma filters) */
 
-    p.chroma[X265_CSP_I444].p2s = p.luma_p2s;
     p.chroma[X265_CSP_I444].cu[BLOCK_4x4].sa8d = NULL;
 
     for (int i = 0; i < NUM_PU_SIZES; i++)
@@ -98,7 +97,7 @@
         p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp;
         p.chroma[X265_CSP_I444].pu[i].addAvg  = p.pu[i].addAvg;
         p.chroma[X265_CSP_I444].pu[i].satd    = p.pu[i].satd;
-        p.chroma[X265_CSP_I444].pu[i].chroma_p2s = p.pu[i].filter_p2s;
+        p.chroma[X265_CSP_I444].pu[i].p2s     = p.pu[i].convert_p2s;
     }
 
     for (int i = 0; i < NUM_CU_SIZES; i++)
​

x265_1.6.tar.gz/source/common/primitives.h -> x265_1.7.tar.gz/source/common/primitives.h Changed

@@ -140,7 +140,8 @@
 typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
 typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
-typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
+typedef void (*scale1D_t)(pixel* dst, const pixel* src);
+typedef void (*scale2D_t)(pixel* dst, const pixel* src, intptr_t stride);
 typedef void (*downscale_t)(const pixel* src0, pixel* dstf, pixel* dsth, pixel* dstv, pixel* dstc,
                             intptr_t src_stride, intptr_t dst_stride, int width, int height);
 typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX);
@@ -155,8 +156,7 @@
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
 
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
@@ -168,7 +168,7 @@
 typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight);
 typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride);
 
-typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft);
+typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride);
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
@@ -179,7 +179,8 @@
 
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
 
-typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
 
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
@@ -210,7 +211,7 @@
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
 
         copy_pp_t      copy_pp;
-        filter_p2s_t   filter_p2s;
+        filter_p2s_t   convert_p2s;
     }
     pu[NUM_PU_SIZES];
 
@@ -266,17 +267,26 @@
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
     denoiseDct_t          denoiseDct;
-    scale_t               scale1D_128to64;
-    scale_t               scale2D_64to32;
+    scale1D_t             scale1D_128to64;
+    scale2D_t             scale2D_64to32;
 
     ssim_4x4x2_core_t     ssim_4x4x2_core;
     ssim_end4_t           ssim_end_4;
 
     sign_t                sign;
     saoCuOrgE0_t          saoCuOrgE0;
-    saoCuOrgE1_t          saoCuOrgE1;
-    saoCuOrgE2_t          saoCuOrgE2;
-    saoCuOrgE3_t          saoCuOrgE3;
+
+    /* To avoid the overhead in avx2 optimization in handling width=16, SAO_E0_1 is split
+     * into two parts: saoCuOrgE1, saoCuOrgE1_2Rows */
+    saoCuOrgE1_t          saoCuOrgE1, saoCuOrgE1_2Rows;
+
+    // saoCuOrgE2[0] is used for width<=16 and saoCuOrgE2[1] is used for width > 16.
+    saoCuOrgE2_t          saoCuOrgE2[2];
+
+    /* In avx2 optimization, two rows cannot be handled simultaneously since it requires 
+     * a pixel from the previous row. So, saoCuOrgE3[0] is used for width<=16 and 
+     * saoCuOrgE3[1] is used for width > 16. */
+    saoCuOrgE3_t          saoCuOrgE3[2];
     saoCuOrgB0_t          saoCuOrgB0;
 
     downscale_t           frameInitLowres;
@@ -289,9 +299,9 @@
     weightp_sp_t          weight_sp;
     weightp_pp_t          weight_pp;
 
-    filter_p2s_wxh_t      luma_p2s;
 
-    findPosLast_t         findPosLast;
+    scanPosLast_t         scanPosLast;
+    findPosFirstLast_t    findPosFirstLast;
 
     /* There is one set of chroma primitives per color space. An encoder will
      * have just a single color space and thus it will only ever use one entry
@@ -316,7 +326,7 @@
             filter_hps_t filter_hps;
             addAvg_t     addAvg;
             copy_pp_t    copy_pp;
-            filter_p2s_t chroma_p2s;
+            filter_p2s_t p2s;
 
         }
         pu[NUM_PU_SIZES];
@@ -336,7 +346,6 @@
         }
         cu[NUM_CU_SIZES];
 
-        filter_p2s_wxh_t p2s; // takes width/height as arguments
     }
     chroma[X265_CSP_COUNT];
 };

 
@@ -140,7 +140,8 @@
 typedef int(*count_nonzero_t)(const int16_t* quantCoeff);
 typedef void (*weightp_pp_t)(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset);
 typedef void (*weightp_sp_t)(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
-typedef void (*scale_t)(pixel* dst, const pixel* src, intptr_t stride);
+typedef void (*scale1D_t)(pixel* dst, const pixel* src);
+typedef void (*scale2D_t)(pixel* dst, const pixel* src, intptr_t stride);
 typedef void (*downscale_t)(const pixel* src0, pixel* dstf, pixel* dsth, pixel* dstv, pixel* dstc,
                             intptr_t src_stride, intptr_t dst_stride, int width, int height);
 typedef void (*extendCURowBorder_t)(pixel* txt, intptr_t stride, int width, int height, int marginX);
@@ -155,8 +156,7 @@
 typedef void (*filter_sp_t) (const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_ss_t) (const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
 typedef void (*filter_hv_pp_t) (const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY);
-typedef void (*filter_p2s_wxh_t)(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst);
+typedef void (*filter_p2s_t)(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
 
 typedef void (*copy_pp_t)(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); // dst is aligned
 typedef void (*copy_sp_t)(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride);
@@ -168,7 +168,7 @@
 typedef void (*pixelavg_pp_t)(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int weight);
 typedef void (*addAvg_t)(const int16_t* src0, const int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride);
 
-typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t signLeft);
+typedef void (*saoCuOrgE0_t)(pixel* rec, int8_t* offsetEo, int width, int8_t* signLeft, intptr_t stride);
 typedef void (*saoCuOrgE1_t)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX);
@@ -179,7 +179,8 @@
 
 typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len);
 
-typedef int (*findPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
 
 /* Function pointers to optimized encoder primitives. Each pointer can reference
  * either an assembly routine, a SIMD intrinsic primitive, or a C function */
@@ -210,7 +211,7 @@
         addAvg_t       addAvg;      // bidir motion compensation, uses 16bit values
 
         copy_pp_t      copy_pp;
-        filter_p2s_t   filter_p2s;
+        filter_p2s_t   convert_p2s;
     }
     pu[NUM_PU_SIZES];
 
@@ -266,17 +267,26 @@
     dequant_scaling_t     dequant_scaling;
     dequant_normal_t      dequant_normal;
     denoiseDct_t          denoiseDct;
-    scale_t               scale1D_128to64;
-    scale_t               scale2D_64to32;
+    scale1D_t             scale1D_128to64;
+    scale2D_t             scale2D_64to32;
 
     ssim_4x4x2_core_t     ssim_4x4x2_core;
     ssim_end4_t           ssim_end_4;
 
     sign_t                sign;
     saoCuOrgE0_t          saoCuOrgE0;
-    saoCuOrgE1_t          saoCuOrgE1;
-    saoCuOrgE2_t          saoCuOrgE2;
-    saoCuOrgE3_t          saoCuOrgE3;
+
+    /* To avoid the overhead in avx2 optimization in handling width=16, SAO_E0_1 is split
+     * into two parts: saoCuOrgE1, saoCuOrgE1_2Rows */
+    saoCuOrgE1_t          saoCuOrgE1, saoCuOrgE1_2Rows;
+
+    // saoCuOrgE2[0] is used for width<=16 and saoCuOrgE2[1] is used for width > 16.
+    saoCuOrgE2_t          saoCuOrgE2[2];
+
+    /* In avx2 optimization, two rows cannot be handled simultaneously since it requires 
+     * a pixel from the previous row. So, saoCuOrgE3[0] is used for width<=16 and 
+     * saoCuOrgE3[1] is used for width > 16. */
+    saoCuOrgE3_t          saoCuOrgE3[2];
     saoCuOrgB0_t          saoCuOrgB0;
 
     downscale_t           frameInitLowres;
@@ -289,9 +299,9 @@
     weightp_sp_t          weight_sp;
     weightp_pp_t          weight_pp;
 
-    filter_p2s_wxh_t      luma_p2s;
 
-    findPosLast_t         findPosLast;
+    scanPosLast_t         scanPosLast;
+    findPosFirstLast_t    findPosFirstLast;
 
     /* There is one set of chroma primitives per color space. An encoder will
      * have just a single color space and thus it will only ever use one entry
@@ -316,7 +326,7 @@
             filter_hps_t filter_hps;
             addAvg_t     addAvg;
             copy_pp_t    copy_pp;
-            filter_p2s_t chroma_p2s;
+            filter_p2s_t p2s;
 
         }
         pu[NUM_PU_SIZES];
@@ -336,7 +346,6 @@
         }
         cu[NUM_CU_SIZES];
 
-        filter_p2s_wxh_t p2s; // takes width/height as arguments
     }
     chroma[X265_CSP_COUNT];
 };
​

x265_1.6.tar.gz/source/common/quant.cpp -> x265_1.7.tar.gz/source/common/quant.cpp Changed

@@ -198,7 +198,8 @@
 {
     m_entropyCoder = &entropy;
     m_rdoqLevel    = rdoqLevel;
-    m_psyRdoqScale = (int64_t)(psyScale * 256.0);
+    m_psyRdoqScale = (int32_t)(psyScale * 256.0);
+    X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n");
     m_scalingList  = &scalingList;
     m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
@@ -225,16 +226,15 @@
     X265_FREE(m_fencShortBuf);
 }
 
-void Quant::setQPforQuant(const CUData& cu)
+void Quant::setQPforQuant(const CUData& ctu, int qp)
 {
-    m_tqBypass = !!cu.m_tqBypass[0];
+    m_tqBypass = !!ctu.m_tqBypass[0];
     if (m_tqBypass)
         return;
-    m_nr = m_frameNr ? &m_frameNr[cu.m_encData->m_frameEncoderID] : NULL;
-    int qpy = cu.m_qp[0];
-    m_qpParam[TEXT_LUMA].setQpParam(qpy + QP_BD_OFFSET);
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, cu.m_chromaFormat);
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, cu.m_chromaFormat);
+    m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL;
+    m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET);
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat);
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat);
 }
 
 void Quant::setChromaQP(int qpin, TextType ttype, int chFmt)
@@ -515,6 +515,7 @@
 {
     int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
     int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
+    const uint32_t usePsyMask = usePsy ? -1 : 0;
 
     X265_CHECK(scalingListType < 6, "scaling list type out of range\n");
 
@@ -529,9 +530,10 @@
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
+
     uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
-    int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
+    const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
 
     /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
      * scale applied that must be removed during unquant. Note that in real dequant there is clipping
@@ -544,7 +546,7 @@
 #define UNQUANT(lvl)    (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift)
 #define SIGCOST(bits)   ((lambda2 * (bits)) >> 8)
 #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits))
-#define PSYVALUE(rec)   ((psyScale * (rec)) >> (16 - scaleBits))
+#define PSYVALUE(rec)   ((psyScale * (rec)) >> (2 * transformShift + 1))
 
     int64_t costCoeff[32 * 32];   /* d*d + lambda * bits */
     int64_t costUncoded[32 * 32]; /* d*d + lambda * 0    */
@@ -557,14 +559,6 @@
     int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
     uint64_t sigCoeffGroupFlag64 = 0;
 
-    uint32_t ctxSet      = 0;
-    int    c1            = 1;
-    int    c2            = 0;
-    uint32_t goRiceParam = 0;
-    uint32_t c1Idx       = 0;
-    uint32_t c2Idx       = 0;
-    int cgLastScanPos    = -1;
-    int lastScanPos      = -1;
     const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
     bool bIsLuma = ttype == TEXT_LUMA;
 
@@ -579,30 +573,231 @@
     TUEntropyCodingParameters codeParams;
     cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma);
     const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
+    const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
+
+    uint8_t coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+#if CHECKED_BUILD || _DEBUG
+    // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
+    memset(coeffNum, 0, sizeof(coeffNum));
+    memset(coeffSign, 0, sizeof(coeffNum));
+    memset(coeffFlag, 0, sizeof(coeffNum));
+#endif
+    const int lastScanPos = primitives.scanPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
+    const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
+
 
     /* TODO: update bit estimates if dirty */
     EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
 
-    uint32_t scanPos;
-    coeffGroupRDStats cgRdStats;
+    uint32_t scanPos = 0;
+    uint32_t c1 = 1;
+
+    // process trail all zero Coeff Group
+
+    /* coefficients after lastNZ have no distortion signal cost */
+    const int zeroCG = cgNum - 1 - cgLastScanPos;
+    memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+    memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+
+    /* sum zero coeff (uncodec) cost */
+
+    // TODO: does we need these cost?
+    if (usePsyMask)
+    {
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
+        {
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
+
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
+
+            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
+            for (int y = 0; y < MLS_CG_SIZE; y++)
+            {
+                for (int x = 0; x < MLS_CG_SIZE; x++)
+                {
+                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
+
+                    /* when no residual coefficient is coded, predicted coef == recon coef */
+                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
+
+                    totalUncodedCost += costUncoded[blkPos + x];
+                    totalRdCost += costUncoded[blkPos + x];
+                }
+                blkPos += trSize;
+            }
+        }
+    }
+    else
+    {
+        // non-psy path
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
+        {
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
+
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
+
+            for (int y = 0; y < MLS_CG_SIZE; y++)
+            {
+                for (int x = 0; x < MLS_CG_SIZE; x++)
+                {
+                    int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
+
+                    totalUncodedCost += costUncoded[blkPos + x];
+                    totalRdCost += costUncoded[blkPos + x];
+                }
+                blkPos += trSize;
+            }
+        }
+    }
+
+    static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
+    {
+        // patternSigCtx = 0
+        {
+            2, 1, 1, 0,
+            1, 1, 0, 0,
+            1, 0, 0, 0,
+            0, 0, 0, 0,
+        },
+        // patternSigCtx = 1
+        {
+            2, 2, 2, 2,
+            1, 1, 1, 1,
+            0, 0, 0, 0,
+            0, 0, 0, 0,
+        },
+        // patternSigCtx = 2
+        {
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+        },
+        // patternSigCtx = 3
+        {
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+        },
+        // 4x4

 
@@ -198,7 +198,8 @@
 {
     m_entropyCoder = &entropy;
     m_rdoqLevel    = rdoqLevel;
-    m_psyRdoqScale = (int64_t)(psyScale * 256.0);
+    m_psyRdoqScale = (int32_t)(psyScale * 256.0);
+    X265_CHECK((psyScale * 256.0) < (double)MAX_INT, "psyScale value too large\n");
     m_scalingList  = &scalingList;
     m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2);
     m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE);
@@ -225,16 +226,15 @@
     X265_FREE(m_fencShortBuf);
 }
 
-void Quant::setQPforQuant(const CUData& cu)
+void Quant::setQPforQuant(const CUData& ctu, int qp)
 {
-    m_tqBypass = !!cu.m_tqBypass[0];
+    m_tqBypass = !!ctu.m_tqBypass[0];
     if (m_tqBypass)
         return;
-    m_nr = m_frameNr ? &m_frameNr[cu.m_encData->m_frameEncoderID] : NULL;
-    int qpy = cu.m_qp[0];
-    m_qpParam[TEXT_LUMA].setQpParam(qpy + QP_BD_OFFSET);
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, cu.m_chromaFormat);
-    setChromaQP(qpy + cu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, cu.m_chromaFormat);
+    m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL;
+    m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET);
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat);
+    setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat);
 }
 
 void Quant::setChromaQP(int qpin, TextType ttype, int chFmt)
@@ -515,6 +515,7 @@
 {
     int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */
     int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype;
+    const uint32_t usePsyMask = usePsy ? -1 : 0;
 
     X265_CHECK(scalingListType < 6, "scaling list type out of range\n");
 
@@ -529,9 +530,10 @@
     X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n");
     if (!numSig)
         return 0;
+
     uint32_t trSize = 1 << log2TrSize;
     int64_t lambda2 = m_qpParam[ttype].lambda2;
-    int64_t psyScale = (m_psyRdoqScale * m_qpParam[ttype].lambda);
+    const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda);
 
     /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4)
      * scale applied that must be removed during unquant. Note that in real dequant there is clipping
@@ -544,7 +546,7 @@
 #define UNQUANT(lvl)    (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift)
 #define SIGCOST(bits)   ((lambda2 * (bits)) >> 8)
 #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits))
-#define PSYVALUE(rec)   ((psyScale * (rec)) >> (16 - scaleBits))
+#define PSYVALUE(rec)   ((psyScale * (rec)) >> (2 * transformShift + 1))
 
     int64_t costCoeff[32 * 32];   /* d*d + lambda * bits */
     int64_t costUncoded[32 * 32]; /* d*d + lambda * 0    */
@@ -557,14 +559,6 @@
     int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */
     uint64_t sigCoeffGroupFlag64 = 0;
 
-    uint32_t ctxSet      = 0;
-    int    c1            = 1;
-    int    c2            = 0;
-    uint32_t goRiceParam = 0;
-    uint32_t c1Idx       = 0;
-    uint32_t c2Idx       = 0;
-    int cgLastScanPos    = -1;
-    int lastScanPos      = -1;
     const uint32_t cgSize = (1 << MLS_CG_SIZE); /* 4x4 num coef = 16 */
     bool bIsLuma = ttype == TEXT_LUMA;
 
@@ -579,30 +573,231 @@
     TUEntropyCodingParameters codeParams;
     cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma);
     const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2);
+    const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE);
+
+    uint8_t coeffNum[MLS_GRP_NUM];      // value range[0, 16]
+    uint16_t coeffSign[MLS_GRP_NUM];    // bit mask map for non-zero coeff sign
+    uint16_t coeffFlag[MLS_GRP_NUM];    // bit mask map for non-zero coeff
+
+#if CHECKED_BUILD || _DEBUG
+    // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group
+    memset(coeffNum, 0, sizeof(coeffNum));
+    memset(coeffSign, 0, sizeof(coeffNum));
+    memset(coeffFlag, 0, sizeof(coeffNum));
+#endif
+    const int lastScanPos = primitives.scanPosLast(codeParams.scan, dstCoeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize);
+    const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE);
+
 
     /* TODO: update bit estimates if dirty */
     EstBitsSbac& estBitsSbac = m_entropyCoder->m_estBitsSbac;
 
-    uint32_t scanPos;
-    coeffGroupRDStats cgRdStats;
+    uint32_t scanPos = 0;
+    uint32_t c1 = 1;
+
+    // process trail all zero Coeff Group
+
+    /* coefficients after lastNZ have no distortion signal cost */
+    const int zeroCG = cgNum - 1 - cgLastScanPos;
+    memset(&costCoeff[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+    memset(&costSig[(cgLastScanPos + 1) << MLS_CG_SIZE], 0, zeroCG * MLS_CG_BLK_SIZE * sizeof(int64_t));
+
+    /* sum zero coeff (uncodec) cost */
+
+    // TODO: does we need these cost?
+    if (usePsyMask)
+    {
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
+        {
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
+
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
+
+            // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA
+            for (int y = 0; y < MLS_CG_SIZE; y++)
+            {
+                for (int x = 0; x < MLS_CG_SIZE; x++)
+                {
+                    int signCoef         = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+                    int predictedCoef    = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/
+
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
+
+                    /* when no residual coefficient is coded, predicted coef == recon coef */
+                    costUncoded[blkPos + x] -= PSYVALUE(predictedCoef);
+
+                    totalUncodedCost += costUncoded[blkPos + x];
+                    totalRdCost += costUncoded[blkPos + x];
+                }
+                blkPos += trSize;
+            }
+        }
+    }
+    else
+    {
+        // non-psy path
+        for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++)
+        {
+            X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n");
+
+            uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE);
+            uint32_t blkPos      = codeParams.scan[scanPosBase];
+
+            for (int y = 0; y < MLS_CG_SIZE; y++)
+            {
+                for (int x = 0; x < MLS_CG_SIZE; x++)
+                {
+                    int signCoef = m_resiDctCoeff[blkPos + x];            /* pre-quantization DCT coeff */
+                    costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits;
+
+                    totalUncodedCost += costUncoded[blkPos + x];
+                    totalRdCost += costUncoded[blkPos + x];
+                }
+                blkPos += trSize;
+            }
+        }
+    }
+
+    static const uint8_t table_cnt[5][SCAN_SET_SIZE] =
+    {
+        // patternSigCtx = 0
+        {
+            2, 1, 1, 0,
+            1, 1, 0, 0,
+            1, 0, 0, 0,
+            0, 0, 0, 0,
+        },
+        // patternSigCtx = 1
+        {
+            2, 2, 2, 2,
+            1, 1, 1, 1,
+            0, 0, 0, 0,
+            0, 0, 0, 0,
+        },
+        // patternSigCtx = 2
+        {
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+            2, 1, 0, 0,
+        },
+        // patternSigCtx = 3
+        {
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+            2, 2, 2, 2,
+        },
+        // 4x4
​

x265_1.6.tar.gz/source/common/quant.h -> x265_1.7.tar.gz/source/common/quant.h Changed

@@ -41,7 +41,7 @@
     int per;
     int qp;
     int64_t lambda2; /* FIX8 */
-    int64_t lambda;  /* FIX8 */
+    int32_t lambda;  /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */
 
     QpParam() : qp(MAX_INT) {}
 
@@ -53,7 +53,8 @@
             per = qpScaled / 6;
             qp  = qpScaled;
             lambda2 = (int64_t)(x265_lambda2_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
-            lambda  = (int64_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
+            lambda  = (int32_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
+            X265_CHECK((x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5) < (double)MAX_INT, "x265_lambda_tab[] value too large\n");
         }
     }
 };
@@ -82,7 +83,7 @@
     QpParam            m_qpParam[3];
 
     int                m_rdoqLevel;
-    int64_t            m_psyRdoqScale;
+    int32_t            m_psyRdoqScale;  // dynamic range [0,50] * 256 = 14-bits
     int16_t*           m_resiDctCoeff;
     int16_t*           m_fencDctCoeff;
     int16_t*           m_fencShortBuf;
@@ -103,7 +104,7 @@
     bool allocNoiseReduction(const x265_param& param);
 
     /* CU setup */
-    void setQPforQuant(const CUData& cu);
+    void setQPforQuant(const CUData& ctu, int qp);
 
     uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff,
                           uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip);
@@ -111,10 +112,39 @@
     void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff,
                          uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig);
 
+    /* Pattern decision for context derivation process of significant_coeff_flag */
+    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        if (trSizeCG == 1)
+            return 0;
+
+        X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift,
+        //       but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1),
+        //       the sigRight and sigLower will clear value to zero, the final result will be correct
+        const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid
+
+        // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
+        return sigRight + sigLower;
+    }
+
+    /* Context derivation process of coeff_abs_significant_flag */
+    static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx
+        const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
+
+        return (sigRight | sigLower) & 1;
+    }
+
     /* static methods shared with entropy.cpp */
-    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
     static uint32_t getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext);
-    static uint32_t getSigCoeffGroupCtxInc(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
 
 protected:

 
@@ -41,7 +41,7 @@
     int per;
     int qp;
     int64_t lambda2; /* FIX8 */
-    int64_t lambda;  /* FIX8 */
+    int32_t lambda;  /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */
 
     QpParam() : qp(MAX_INT) {}
 
@@ -53,7 +53,8 @@
             per = qpScaled / 6;
             qp  = qpScaled;
             lambda2 = (int64_t)(x265_lambda2_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
-            lambda  = (int64_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
+            lambda  = (int32_t)(x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5);
+            X265_CHECK((x265_lambda_tab[qp - QP_BD_OFFSET] * 256. + 0.5) < (double)MAX_INT, "x265_lambda_tab[] value too large\n");
         }
     }
 };
@@ -82,7 +83,7 @@
     QpParam            m_qpParam[3];
 
     int                m_rdoqLevel;
-    int64_t            m_psyRdoqScale;
+    int32_t            m_psyRdoqScale;  // dynamic range [0,50] * 256 = 14-bits
     int16_t*           m_resiDctCoeff;
     int16_t*           m_fencDctCoeff;
     int16_t*           m_fencShortBuf;
@@ -103,7 +104,7 @@
     bool allocNoiseReduction(const x265_param& param);
 
     /* CU setup */
-    void setQPforQuant(const CUData& cu);
+    void setQPforQuant(const CUData& ctu, int qp);
 
     uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff,
                           uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip);
@@ -111,10 +112,39 @@
     void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff,
                          uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig);
 
+    /* Pattern decision for context derivation process of significant_coeff_flag */
+    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        if (trSizeCG == 1)
+            return 0;
+
+        X265_CHECK(trSizeCG <= 8, "transform CG is too large\n");
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: cgBlkPos+1 may more than 63, it is invalid for shift,
+        //       but in this case, both cgPosX and cgPosY equal to (trSizeCG - 1),
+        //       the sigRight and sigLower will clear value to zero, the final result will be correct
+        const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid
+
+        // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1);
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2;
+        return sigRight + sigLower;
+    }
+
+    /* Context derivation process of coeff_abs_significant_flag */
+    static uint32_t getSigCoeffGroupCtxInc(uint64_t cgGroupMask, uint32_t cgPosX, uint32_t cgPosY, uint32_t cgBlkPos, uint32_t trSizeCG)
+    {
+        X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n");
+        // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx
+        const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid
+        const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos;
+        const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1));
+
+        return (sigRight | sigLower) & 1;
+    }
+
     /* static methods shared with entropy.cpp */
-    static uint32_t calcPatternSigCtx(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
     static uint32_t getSigCtxInc(uint32_t patternSigCtx, uint32_t log2TrSize, uint32_t trSize, uint32_t blkPos, bool bIsLuma, uint32_t firstSignificanceMapContext);
-    static uint32_t getSigCoeffGroupCtxInc(uint64_t sigCoeffGroupFlag64, uint32_t cgPosX, uint32_t cgPosY, uint32_t log2TrSizeCG);
 
 protected:
 
​

x265_1.6.tar.gz/source/common/slice.h -> x265_1.7.tar.gz/source/common/slice.h Changed

 
@@ -98,6 +98,7 @@
         LEVEL6 = 180,
         LEVEL6_1 = 183,
         LEVEL6_2 = 186,
+        LEVEL8_5 = 255,
     };
 }
 
​

x265_1.6.tar.gz/source/common/threading.h -> x265_1.7.tar.gz/source/common/threading.h Changed

 
@@ -189,6 +189,14 @@
         LeaveCriticalSection(&m_cs);
     }
 
+    void poke(void)
+    {
+        /* awaken all waiting threads, but make no change */
+        EnterCriticalSection(&m_cs);
+        WakeAllConditionVariable(&m_cv);
+        LeaveCriticalSection(&m_cs);
+    }
+
     void incr()
     {
         EnterCriticalSection(&m_cs);
@@ -370,6 +378,14 @@
         pthread_mutex_unlock(&m_mutex);
     }
 
+    void poke(void)
+    {
+        /* awaken all waiting threads, but make no change */
+        pthread_mutex_lock(&m_mutex);
+        pthread_cond_broadcast(&m_cond);
+        pthread_mutex_unlock(&m_mutex);
+    }
+
     void incr()
     {
         pthread_mutex_lock(&m_mutex);
​

x265_1.6.tar.gz/source/common/threadpool.cpp -> x265_1.7.tar.gz/source/common/threadpool.cpp Changed

@@ -232,7 +232,7 @@
     int cpuCount = getCpuCount();
     bool bNumaSupport = false;
 
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     bNumaSupport = true;
 #elif HAVE_LIBNUMA
     bNumaSupport = numa_available() >= 0;
@@ -241,10 +241,10 @@
 
     for (int i = 0; i < cpuCount; i++)
     {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
         UCHAR node;
         if (GetNumaProcessorNode((UCHAR)i, &node))
-            cpusPerNode[X265_MIN(node, MAX_NODE_NUM)]++;
+            cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++;
         else
 #elif HAVE_LIBNUMA
         if (bNumaSupport >= 0)
@@ -261,7 +261,7 @@
     /* limit nodes based on param->numaPools */
     if (p->numaPools && *p->numaPools)
     {
-        char *nodeStr = p->numaPools;
+        const char *nodeStr = p->numaPools;
         for (int i = 0; i < numNumaNodes; i++)
         {
             if (!*nodeStr)
@@ -373,7 +373,7 @@
     return true;
 }
 
-void ThreadPool::stop()
+void ThreadPool::stopWorkers()
 {
     if (m_workers)
     {
@@ -408,7 +408,7 @@
 /* static */
 void ThreadPool::setThreadNodeAffinity(int numaNode)
 {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     GROUP_AFFINITY groupAffinity;
     if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
     {
@@ -433,7 +433,7 @@
 /* static */
 int ThreadPool::getNumaNodeCount()
 {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     ULONG num = 1;
     if (GetNumaHighestNodeNumber(&num))
         num++;

 
@@ -232,7 +232,7 @@
     int cpuCount = getCpuCount();
     bool bNumaSupport = false;
 
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     bNumaSupport = true;
 #elif HAVE_LIBNUMA
     bNumaSupport = numa_available() >= 0;
@@ -241,10 +241,10 @@
 
     for (int i = 0; i < cpuCount; i++)
     {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
         UCHAR node;
         if (GetNumaProcessorNode((UCHAR)i, &node))
-            cpusPerNode[X265_MIN(node, MAX_NODE_NUM)]++;
+            cpusPerNode[X265_MIN(node, (UCHAR)MAX_NODE_NUM)]++;
         else
 #elif HAVE_LIBNUMA
         if (bNumaSupport >= 0)
@@ -261,7 +261,7 @@
     /* limit nodes based on param->numaPools */
     if (p->numaPools && *p->numaPools)
     {
-        char *nodeStr = p->numaPools;
+        const char *nodeStr = p->numaPools;
         for (int i = 0; i < numNumaNodes; i++)
         {
             if (!*nodeStr)
@@ -373,7 +373,7 @@
     return true;
 }
 
-void ThreadPool::stop()
+void ThreadPool::stopWorkers()
 {
     if (m_workers)
     {
@@ -408,7 +408,7 @@
 /* static */
 void ThreadPool::setThreadNodeAffinity(int numaNode)
 {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     GROUP_AFFINITY groupAffinity;
     if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity))
     {
@@ -433,7 +433,7 @@
 /* static */
 int ThreadPool::getNumaNodeCount()
 {
-#if _WIN32_WINNT >= 0x0601
+#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 
     ULONG num = 1;
     if (GetNumaHighestNodeNumber(&num))
         num++;
​

x265_1.6.tar.gz/source/common/threadpool.h -> x265_1.7.tar.gz/source/common/threadpool.h Changed

 
@@ -94,7 +94,7 @@
 
     bool create(int numThreads, int maxProviders, int node);
     bool start();
-    void stop();
+    void stopWorkers();
     void setCurrentThreadAffinity();
     int  tryAcquireSleepingThread(sleepbitmap_t firstTryBitmap, sleepbitmap_t secondTryBitmap);
     int  tryBondPeers(int maxPeers, sleepbitmap_t peerBitmap, BondedTaskGroup& master);
​

x265_1.6.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.7.tar.gz/source/common/x86/asm-primitives.cpp Changed

@@ -800,6 +800,10 @@
 #error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
 #endif
 
+#if X86_64
+    p.scanPosLast = x265_scanPosLast_x64;
+#endif
+
     if (cpuMask & X265_CPU_SSE2)
     {
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
@@ -859,9 +863,6 @@
         PIXEL_AVG_W4(mmx2);
         LUMA_VAR(sse2);
 
-        p.luma_p2s = x265_luma_p2s_sse2;
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2;
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2;
 
         ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
@@ -872,15 +873,41 @@
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
         ALL_LUMA_TU_S(transpose, transpose, sse2);
 
-        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
-        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
-        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
-        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
-
-        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
-        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
-        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
-        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
+        ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
+        ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
+
+        p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
+        p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
+        p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2;
+        p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2;
+        p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2;
+        p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
+        p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
+        p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
+        p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
+        p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
+        p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
+        p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
+        p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
+        p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
+        p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
+        p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
+        p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
+        p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
+        p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
+        p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
+        p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
+        p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
+        p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
+        p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
+        p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
+        p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
+        p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
+        p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
+        p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
+        p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
+        p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
+        p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
 
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
@@ -918,6 +945,74 @@
         p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
         p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
+
+        p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
+        p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
+        p.findPosFirstLast = x265_findPosFirstLast_ssse3;
     }
     if (cpuMask & X265_CPU_SSE4)
     {
@@ -957,6 +1052,13 @@
         ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
     }
     if (cpuMask & X265_CPU_AVX)
     {
@@ -1079,6 +1181,26 @@
     }
     if (cpuMask & X265_CPU_AVX2)
     {
+        p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
+
+        p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
+        p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
+        p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
+        p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
+
+        p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
+        p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
+        p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
+        p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
+        p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
+
+        p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
+        p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2;
+        p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
+        p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2;
+        p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
+        p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
+
         p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
         p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2;
 
@@ -1087,6 +1209,7 @@
         p.dequant_normal  = x265_dequant_normal_avx2;
 
         p.scale1D_128to64 = x265_scale1D_128to64_avx2;
+        p.scale2D_64to32 = x265_scale2D_64to32_avx2;
         // p.weight_pp = x265_weight_pp_avx2; fails tests
 
         p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2;
@@ -1119,12 +1242,84 @@
         ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2);
         ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2);
         ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2);

 
@@ -800,6 +800,10 @@
 #error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF"
 #endif
 
+#if X86_64
+    p.scanPosLast = x265_scanPosLast_x64;
+#endif
+
     if (cpuMask & X265_CPU_SSE2)
     {
         /* We do not differentiate CPUs which support MMX and not SSE2. We only check
@@ -859,9 +863,6 @@
         PIXEL_AVG_W4(mmx2);
         LUMA_VAR(sse2);
 
-        p.luma_p2s = x265_luma_p2s_sse2;
-        p.chroma[X265_CSP_I420].p2s = x265_chroma_p2s_sse2;
-        p.chroma[X265_CSP_I422].p2s = x265_chroma_p2s_sse2;
 
         ALL_LUMA_TU(blockfill_s, blockfill_s, sse2);
         ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2);
@@ -872,15 +873,41 @@
         ALL_LUMA_TU_S(calcresidual, getResidual, sse2);
         ALL_LUMA_TU_S(transpose, transpose, sse2);
 
-        p.cu[BLOCK_4x4].intra_pred[DC_IDX] = x265_intra_pred_dc4_sse2;
-        p.cu[BLOCK_8x8].intra_pred[DC_IDX] = x265_intra_pred_dc8_sse2;
-        p.cu[BLOCK_16x16].intra_pred[DC_IDX] = x265_intra_pred_dc16_sse2;
-        p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_sse2;
-
-        p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = x265_intra_pred_planar4_sse2;
-        p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = x265_intra_pred_planar8_sse2;
-        p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_sse2;
-        p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_sse2;
+        ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2);
+        ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2);
+
+        p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2;
+        p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2;
+        p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2;
+        p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2;
+        p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2;
+        p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2;
+        p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2;
+        p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2;
+        p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2;
+        p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2;
+        p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2;
+        p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2;
+        p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2;
+        p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2;
+        p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2;
+        p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2;
+        p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2;
+        p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2;
+        p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2;
+        p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2;
+        p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2;
+        p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2;
+        p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2;
+        p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2;
+        p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2;
+        p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2;
+        p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2;
+        p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2;
+        p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2;
+        p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2;
+        p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2;
+        p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2;
 
         p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2;
         ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2);
@@ -918,6 +945,74 @@
         p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3;
         p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3;
         p.frameInitLowres = x265_frame_init_lowres_core_ssse3;
+
+        p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3;
+        p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3;
+        p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3;
+        p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3;
+        p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3;
+        p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3;
+        p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3;
+        p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3;
+        p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3;
+        p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3;
+        p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3;
+        p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3;
+        p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3;
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3;
+        p.findPosFirstLast = x265_findPosFirstLast_ssse3;
     }
     if (cpuMask & X265_CPU_SSE4)
     {
@@ -957,6 +1052,13 @@
         ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4);
         ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4);
         ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4);
+
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
+        p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4;
+        p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4;
     }
     if (cpuMask & X265_CPU_AVX)
     {
@@ -1079,6 +1181,26 @@
     }
     if (cpuMask & X265_CPU_AVX2)
     {
+        p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2;
+
+        p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2;
+        p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2;
+        p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2;
+        p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2;
+
+        p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2;
+        p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2;
+        p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2;
+        p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2;
+        p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2;
+
+        p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2;
+        p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2;
+        p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2;
+        p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2;
+        p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2;
+        p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2;
+
         p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2;
         p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2;
 
@@ -1087,6 +1209,7 @@
         p.dequant_normal  = x265_dequant_normal_avx2;
 
         p.scale1D_128to64 = x265_scale1D_128to64_avx2;
+        p.scale2D_64to32 = x265_scale2D_64to32_avx2;
         // p.weight_pp = x265_weight_pp_avx2; fails tests
 
         p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2;
@@ -1119,12 +1242,84 @@
         ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2);
         ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2);
         ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2);
​

x265_1.6.tar.gz/source/common/x86/const-a.asm -> x265_1.7.tar.gz/source/common/x86/const-a.asm Changed

@@ -29,81 +29,100 @@
 
 SECTION_RODATA 32
 
-const pb_1,        times 32 db 1
+;; 8-bit constants
 
-const hsub_mul,    times 16 db 1, -1
-const pw_1,        times 16 dw 1
-const pw_16,       times 16 dw 16
-const pw_32,       times 16 dw 32
-const pw_128,      times 16 dw 128
-const pw_256,      times 16 dw 256
-const pw_257,      times 16 dw 257
-const pw_512,      times 16 dw 512
-const pw_1023,     times 8  dw 1023
-ALIGN 32
-const pw_1024,     times 16 dw 1024
-const pw_4096,     times 16 dw 4096
-const pw_00ff,     times 16 dw 0x00ff
-ALIGN 32
-const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
-const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
-const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
-const pb_unpackbd2, times 2 db 4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7
-const pb_unpackwq1, db 0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3
-const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
-const pw_swap,      times 2 db 6,7,4,5,2,3,0,1
+const pb_0,                 times 16 db 0
+const pb_1,                 times 32 db 1
+const pb_2,                 times 32 db 2
+const pb_3,                 times 16 db 3
+const pb_4,                 times 32 db 4
+const pb_8,                 times 32 db 8
+const pb_15,                times 32 db 15
+const pb_16,                times 32 db 16
+const pb_32,                times 32 db 32
+const pb_64,                times 32 db 64
+const pb_128,               times 16 db 128
+const pb_a1,                times 16 db 0xa1
 
-const pb_2,        times 32 db 2
-const pb_4,        times 32 db 4
-const pb_16,       times 32 db 16
-const pb_64,       times 32 db 64
-const pb_01,       times  8 db 0,1
-const pb_0,        times 16 db 0
-const pb_a1,       times 16 db 0xa1
-const pb_3,        times 16 db 3
-const pb_8,        times 32 db 8
-const pb_32,       times 32 db 32
-const pb_128,      times 16 db 128
-const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
+const pb_01,                times  8 db   0,   1
+const hsub_mul,             times 16 db   1,  -1
+const pw_swap,              times  2 db   6,   7,   4,   5,   2,   3,   0,   1
+const pb_unpackbd1,         times  2 db   0,   0,   0,   0,   1,   1,   1,   1,   2,   2,   2,   2,   3,   3,   3,   3
+const pb_unpackbd2,         times  2 db   4,   4,   4,   4,   5,   5,   5,   5,   6,   6,   6,   6,   7,   7,   7,   7
+const pb_unpackwq1,         times  1 db   0,   1,   0,   1,   0,   1,   0,   1,   2,   3,   2,   3,   2,   3,   2,   3
+const pb_unpackwq2,         times  1 db   4,   5,   4,   5,   4,   5,   4,   5,   6,   7,   6,   7,   6,   7,   6,   7
+const pb_shuf8x8c,          times  1 db   0,   0,   0,   0,   2,   2,   2,   2,   4,   4,   4,   4,   6,   6,   6,   6
+const pb_movemask,          times 16 db 0x00
+                            times 16 db 0xFF
+const pb_0000000000000F0F,  times  2 db 0xff, 0x00
+                            times 12 db 0x00
+const pb_000000000000000F,           db 0xff
+                            times 15 db 0x00
 
-const pw_0_15,     times 2 dw 0, 1, 2, 3, 4, 5, 6, 7
-const pw_2,        times 8 dw 2
-const pw_m2,       times 8 dw -2
-const pw_4,        times 8 dw 4
-const pw_8,        times 8 dw 8
-const pw_64,       times 8 dw 64
-const pw_256,      times 8 dw 256
-const pw_32_0,     times 4 dw 32,
-                   times 4 dw 0
-const pw_2000,     times 16 dw 0x2000
-const pw_8000,     times 8 dw 0x8000
-const pw_3fff,     times 8 dw 0x3fff
-const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
-const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1
-const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
-const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
-const pd_1,        times 8 dd 1
-const pd_2,        times 8 dd 2
-const pd_4,        times 4 dd 4
-const pd_8,        times 4 dd 8
-const pd_16,       times 4 dd 16
-const pd_32,       times 4 dd 32
-const pd_64,       times 4 dd 64
-const pd_128,      times 4 dd 128
-const pd_256,      times 4 dd 256
-const pd_512,      times 4 dd 512
-const pd_1024,     times 4 dd 1024
-const pd_2048,     times 4 dd 2048
-const pd_ffff,     times 4 dd 0xffff
-const pd_32767,    times 4 dd 32767
-const pd_n32768,   times 4 dd 0xffff8000
-const pw_ff00,     times 8 dw 0xff00
+;; 16-bit constants
 
-const multi_2Row,  dw 1, 2, 3, 4, 1, 2, 3, 4
-const multiL,      dw 1, 2, 3, 4, 5, 6, 7, 8
-const multiH,      dw 9, 10, 11, 12, 13, 14, 15, 16
-const multiH2,     dw 17, 18, 19, 20, 21, 22, 23, 24
-const multiH3,     dw 25, 26, 27, 28, 29, 30, 31, 32
+const pw_1,                 times 16 dw 1
+const pw_2,                 times  8 dw 2
+const pw_m2,                times  8 dw -2
+const pw_4,                 times  8 dw 4
+const pw_8,                 times  8 dw 8
+const pw_16,                times 16 dw 16
+const pw_15,                times 16 dw 15
+const pw_31,                times 16 dw 31
+const pw_32,                times 16 dw 32
+const pw_64,                times  8 dw 64
+const pw_128,               times 16 dw 128
+const pw_256,               times 16 dw 256
+const pw_257,               times 16 dw 257
+const pw_512,               times 16 dw 512
+const pw_1023,              times  8 dw 1023
+const pw_1024,              times 16 dw 1024
+const pw_4096,              times 16 dw 4096
+const pw_00ff,              times 16 dw 0x00ff
+const pw_ff00,              times  8 dw 0xff00
+const pw_2000,              times 16 dw 0x2000
+const pw_8000,              times  8 dw 0x8000
+const pw_3fff,              times  8 dw 0x3fff
+const pw_32_0,              times  4 dw 32,
+                            times  4 dw 0
+const pw_pixel_max,         times 16 dw ((1 << BIT_DEPTH)-1)
+
+const pw_0_15,              times  2 dw   0,   1,   2,   3,   4,   5,   6,   7
+const pw_ppppmmmm,          times  1 dw   1,   1,   1,   1,  -1,  -1,  -1,  -1
+const pw_ppmmppmm,          times  1 dw   1,   1,  -1,  -1,   1,   1,  -1,  -1
+const pw_pmpmpmpm,          times  1 dw   1,  -1,   1,  -1,   1,  -1,   1,  -1
+const pw_pmmpzzzz,          times  1 dw   1,  -1,  -1,   1,   0,   0,   0,   0
+const multi_2Row,           times  1 dw   1,   2,   3,   4,   1,   2,   3,   4
+const multiH,               times  1 dw   9,  10,  11,  12,  13,  14,  15,  16
+const multiH3,              times  1 dw  25,  26,  27,  28,  29,  30,  31,  32
+const multiL,               times  1 dw   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,  16
+const multiH2,              times  1 dw  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32
+const pw_planar16_mul,      times  1 dw  15,  14,  13,  12,  11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0
+const pw_planar32_mul,      times  1 dw  31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  21,  20,  19,  18,  17,  16
+const pw_FFFFFFFFFFFFFFF0,           dw 0x00
+                            times 7  dw 0xff
+
+
+;; 32-bit constants
+
+const pd_1,                 times  8 dd 1
+const pd_2,                 times  8 dd 2
+const pd_4,                 times  4 dd 4
+const pd_8,                 times  4 dd 8
+const pd_16,                times  4 dd 16
+const pd_32,                times  4 dd 32
+const pd_64,                times  4 dd 64
+const pd_128,               times  4 dd 128
+const pd_256,               times  4 dd 256
+const pd_512,               times  4 dd 512
+const pd_1024,              times  4 dd 1024
+const pd_2048,              times  4 dd 2048
+const pd_ffff,              times  4 dd 0xffff
+const pd_32767,             times  4 dd 32767
+const pd_n32768,            times  4 dd 0xffff8000
+
+const trans8_shuf,          times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
+const deinterleave_shufd,   times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
 
 const popcnt_table
 %assign x 0

 
@@ -29,81 +29,100 @@
 
 SECTION_RODATA 32
 
-const pb_1,        times 32 db 1
+;; 8-bit constants
 
-const hsub_mul,    times 16 db 1, -1
-const pw_1,        times 16 dw 1
-const pw_16,       times 16 dw 16
-const pw_32,       times 16 dw 32
-const pw_128,      times 16 dw 128
-const pw_256,      times 16 dw 256
-const pw_257,      times 16 dw 257
-const pw_512,      times 16 dw 512
-const pw_1023,     times 8  dw 1023
-ALIGN 32
-const pw_1024,     times 16 dw 1024
-const pw_4096,     times 16 dw 4096
-const pw_00ff,     times 16 dw 0x00ff
-ALIGN 32
-const pw_pixel_max,times 16 dw ((1 << BIT_DEPTH)-1)
-const deinterleave_shufd, dd 0,4,1,5,2,6,3,7
-const pb_unpackbd1, times 2 db 0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3
-const pb_unpackbd2, times 2 db 4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7
-const pb_unpackwq1, db 0,1,0,1,0,1,0,1,2,3,2,3,2,3,2,3
-const pb_unpackwq2, db 4,5,4,5,4,5,4,5,6,7,6,7,6,7,6,7
-const pw_swap,      times 2 db 6,7,4,5,2,3,0,1
+const pb_0,                 times 16 db 0
+const pb_1,                 times 32 db 1
+const pb_2,                 times 32 db 2
+const pb_3,                 times 16 db 3
+const pb_4,                 times 32 db 4
+const pb_8,                 times 32 db 8
+const pb_15,                times 32 db 15
+const pb_16,                times 32 db 16
+const pb_32,                times 32 db 32
+const pb_64,                times 32 db 64
+const pb_128,               times 16 db 128
+const pb_a1,                times 16 db 0xa1
 
-const pb_2,        times 32 db 2
-const pb_4,        times 32 db 4
-const pb_16,       times 32 db 16
-const pb_64,       times 32 db 64
-const pb_01,       times  8 db 0,1
-const pb_0,        times 16 db 0
-const pb_a1,       times 16 db 0xa1
-const pb_3,        times 16 db 3
-const pb_8,        times 32 db 8
-const pb_32,       times 32 db 32
-const pb_128,      times 16 db 128
-const pb_shuf8x8c, db 0,0,0,0,2,2,2,2,4,4,4,4,6,6,6,6
+const pb_01,                times  8 db   0,   1
+const hsub_mul,             times 16 db   1,  -1
+const pw_swap,              times  2 db   6,   7,   4,   5,   2,   3,   0,   1
+const pb_unpackbd1,         times  2 db   0,   0,   0,   0,   1,   1,   1,   1,   2,   2,   2,   2,   3,   3,   3,   3
+const pb_unpackbd2,         times  2 db   4,   4,   4,   4,   5,   5,   5,   5,   6,   6,   6,   6,   7,   7,   7,   7
+const pb_unpackwq1,         times  1 db   0,   1,   0,   1,   0,   1,   0,   1,   2,   3,   2,   3,   2,   3,   2,   3
+const pb_unpackwq2,         times  1 db   4,   5,   4,   5,   4,   5,   4,   5,   6,   7,   6,   7,   6,   7,   6,   7
+const pb_shuf8x8c,          times  1 db   0,   0,   0,   0,   2,   2,   2,   2,   4,   4,   4,   4,   6,   6,   6,   6
+const pb_movemask,          times 16 db 0x00
+                            times 16 db 0xFF
+const pb_0000000000000F0F,  times  2 db 0xff, 0x00
+                            times 12 db 0x00
+const pb_000000000000000F,           db 0xff
+                            times 15 db 0x00
 
-const pw_0_15,     times 2 dw 0, 1, 2, 3, 4, 5, 6, 7
-const pw_2,        times 8 dw 2
-const pw_m2,       times 8 dw -2
-const pw_4,        times 8 dw 4
-const pw_8,        times 8 dw 8
-const pw_64,       times 8 dw 64
-const pw_256,      times 8 dw 256
-const pw_32_0,     times 4 dw 32,
-                   times 4 dw 0
-const pw_2000,     times 16 dw 0x2000
-const pw_8000,     times 8 dw 0x8000
-const pw_3fff,     times 8 dw 0x3fff
-const pw_ppppmmmm, dw 1,1,1,1,-1,-1,-1,-1
-const pw_ppmmppmm, dw 1,1,-1,-1,1,1,-1,-1
-const pw_pmpmpmpm, dw 1,-1,1,-1,1,-1,1,-1
-const pw_pmmpzzzz, dw 1,-1,-1,1,0,0,0,0
-const pd_1,        times 8 dd 1
-const pd_2,        times 8 dd 2
-const pd_4,        times 4 dd 4
-const pd_8,        times 4 dd 8
-const pd_16,       times 4 dd 16
-const pd_32,       times 4 dd 32
-const pd_64,       times 4 dd 64
-const pd_128,      times 4 dd 128
-const pd_256,      times 4 dd 256
-const pd_512,      times 4 dd 512
-const pd_1024,     times 4 dd 1024
-const pd_2048,     times 4 dd 2048
-const pd_ffff,     times 4 dd 0xffff
-const pd_32767,    times 4 dd 32767
-const pd_n32768,   times 4 dd 0xffff8000
-const pw_ff00,     times 8 dw 0xff00
+;; 16-bit constants
 
-const multi_2Row,  dw 1, 2, 3, 4, 1, 2, 3, 4
-const multiL,      dw 1, 2, 3, 4, 5, 6, 7, 8
-const multiH,      dw 9, 10, 11, 12, 13, 14, 15, 16
-const multiH2,     dw 17, 18, 19, 20, 21, 22, 23, 24
-const multiH3,     dw 25, 26, 27, 28, 29, 30, 31, 32
+const pw_1,                 times 16 dw 1
+const pw_2,                 times  8 dw 2
+const pw_m2,                times  8 dw -2
+const pw_4,                 times  8 dw 4
+const pw_8,                 times  8 dw 8
+const pw_16,                times 16 dw 16
+const pw_15,                times 16 dw 15
+const pw_31,                times 16 dw 31
+const pw_32,                times 16 dw 32
+const pw_64,                times  8 dw 64
+const pw_128,               times 16 dw 128
+const pw_256,               times 16 dw 256
+const pw_257,               times 16 dw 257
+const pw_512,               times 16 dw 512
+const pw_1023,              times  8 dw 1023
+const pw_1024,              times 16 dw 1024
+const pw_4096,              times 16 dw 4096
+const pw_00ff,              times 16 dw 0x00ff
+const pw_ff00,              times  8 dw 0xff00
+const pw_2000,              times 16 dw 0x2000
+const pw_8000,              times  8 dw 0x8000
+const pw_3fff,              times  8 dw 0x3fff
+const pw_32_0,              times  4 dw 32,
+                            times  4 dw 0
+const pw_pixel_max,         times 16 dw ((1 << BIT_DEPTH)-1)
+
+const pw_0_15,              times  2 dw   0,   1,   2,   3,   4,   5,   6,   7
+const pw_ppppmmmm,          times  1 dw   1,   1,   1,   1,  -1,  -1,  -1,  -1
+const pw_ppmmppmm,          times  1 dw   1,   1,  -1,  -1,   1,   1,  -1,  -1
+const pw_pmpmpmpm,          times  1 dw   1,  -1,   1,  -1,   1,  -1,   1,  -1
+const pw_pmmpzzzz,          times  1 dw   1,  -1,  -1,   1,   0,   0,   0,   0
+const multi_2Row,           times  1 dw   1,   2,   3,   4,   1,   2,   3,   4
+const multiH,               times  1 dw   9,  10,  11,  12,  13,  14,  15,  16
+const multiH3,              times  1 dw  25,  26,  27,  28,  29,  30,  31,  32
+const multiL,               times  1 dw   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,  15,  16
+const multiH2,              times  1 dw  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,  28,  29,  30,  31,  32
+const pw_planar16_mul,      times  1 dw  15,  14,  13,  12,  11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0
+const pw_planar32_mul,      times  1 dw  31,  30,  29,  28,  27,  26,  25,  24,  23,  22,  21,  20,  19,  18,  17,  16
+const pw_FFFFFFFFFFFFFFF0,           dw 0x00
+                            times 7  dw 0xff
+
+
+;; 32-bit constants
+
+const pd_1,                 times  8 dd 1
+const pd_2,                 times  8 dd 2
+const pd_4,                 times  4 dd 4
+const pd_8,                 times  4 dd 8
+const pd_16,                times  4 dd 16
+const pd_32,                times  4 dd 32
+const pd_64,                times  4 dd 64
+const pd_128,               times  4 dd 128
+const pd_256,               times  4 dd 256
+const pd_512,               times  4 dd 512
+const pd_1024,              times  4 dd 1024
+const pd_2048,              times  4 dd 2048
+const pd_ffff,              times  4 dd 0xffff
+const pd_32767,             times  4 dd 32767
+const pd_n32768,            times  4 dd 0xffff8000
+
+const trans8_shuf,          times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
+const deinterleave_shufd,   times  1 dd   0,   4,   1,   5,   2,   6,   3,   7
 
 const popcnt_table
 %assign x 0
​

x265_1.6.tar.gz/source/common/x86/dct8.asm -> x265_1.7.tar.gz/source/common/x86/dct8.asm Changed

@@ -261,6 +261,11 @@
                 times 2 dw 84, -29, -74, 55
                 times 2 dw 55, -84, 74, -29
 
+pw_dst4_tab:    times 4 dw 29,  55,  74,  84
+                times 4 dw 74,  74,   0, -74
+                times 4 dw 84, -29, -74,  55
+                times 4 dw 55, -84,  74, -29
+
 tab_idst4:      times 4 dw 29, +84
                 times 4 dw +74, +55
                 times 4 dw 55, -29
@@ -270,6 +275,16 @@
                 times 4 dw 84, +55
                 times 4 dw -74, -29
 
+pw_idst4_tab:   times 4 dw  29,  84
+                times 4 dw  55, -29
+                times 4 dw  74,  55
+                times 4 dw  74, -84
+                times 4 dw  74, -74
+                times 4 dw  84,  55
+                times 4 dw  0,   74
+                times 4 dw -74, -29
+pb_idst4_shuf:  times 2 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15
+
 tab_dct8_1:     times 2 dw 89, 50, 75, 18
                 times 2 dw 75, -89, -18, -50
                 times 2 dw 50, 18, -89, 75
@@ -316,7 +331,7 @@
 cextern pd_1024
 cextern pd_2048
 cextern pw_ppppmmmm
-
+cextern trans8_shuf
 ;------------------------------------------------------
 ;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride)
 ;------------------------------------------------------
@@ -656,6 +671,59 @@
 
     RET
 
+;------------------------------------------------------------------
+;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride)
+;------------------------------------------------------------------
+INIT_YMM avx2
+cglobal dst4, 3, 4, 6
+%if BIT_DEPTH == 8
+  %define       DST_SHIFT 1
+  vpbroadcastd  m5, [pd_1]
+%elif BIT_DEPTH == 10
+  %define       DST_SHIFT 3
+  vpbroadcastd  m5, [pd_4]
+%endif
+    mova        m4, [trans8_shuf]
+    add         r2d, r2d
+    lea         r3, [pw_dst4_tab]
+
+    movq        xm0, [r0 + 0 * r2]
+    movhps      xm0, [r0 + 1 * r2]
+    lea         r0, [r0 + 2 * r2]
+    movq        xm1, [r0]
+    movhps      xm1, [r0 + r2]
+
+    vinserti128 m0, m0, xm1, 1          ; m0 = src[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
+
+    pmaddwd     m2, m0, [r3 + 0 * 32]
+    pmaddwd     m1, m0, [r3 + 1 * 32]
+    phaddd      m2, m1
+    paddd       m2, m5
+    psrad       m2, DST_SHIFT
+    pmaddwd     m3, m0, [r3 + 2 * 32]
+    pmaddwd     m1, m0, [r3 + 3 * 32]
+    phaddd      m3, m1
+    paddd       m3, m5
+    psrad       m3, DST_SHIFT
+    packssdw    m2, m3
+    vpermd      m2, m4, m2
+
+    vpbroadcastd m5, [pd_128]
+    pmaddwd     m0, m2, [r3 + 0 * 32]
+    pmaddwd     m1, m2, [r3 + 1 * 32]
+    phaddd      m0, m1
+    paddd       m0, m5
+    psrad       m0, 8
+    pmaddwd     m3, m2, [r3 + 2 * 32]
+    pmaddwd     m2, m2, [r3 + 3 * 32]
+    phaddd      m3, m2
+    paddd       m3, m5
+    psrad       m3, 8
+    packssdw    m0, m3
+    vpermd      m0, m4, m0
+    movu        [r1], m0
+    RET
+
 ;-------------------------------------------------------
 ;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
 ;-------------------------------------------------------
@@ -748,6 +816,81 @@
     movhps      [r1 + r2], m1
     RET
 
+;-----------------------------------------------------------------
+;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
+;-----------------------------------------------------------------
+INIT_YMM avx2
+cglobal idst4, 3, 4, 6
+%if BIT_DEPTH == 8
+  vpbroadcastd  m4, [pd_2048]
+  %define       IDCT4_SHIFT 12
+%elif BIT_DEPTH == 10
+  vpbroadcastd  m4, [pd_512]
+  %define       IDCT4_SHIFT 10
+%else
+  %error Unsupported BIT_DEPTH!
+%endif
+    add         r2d, r2d
+    lea         r3, [pw_idst4_tab]
+
+    movu        xm0, [r0 + 0 * 16]
+    movu        xm1, [r0 + 1 * 16]
+
+    punpcklwd   m2, m0, m1
+    punpckhwd   m0, m1
+
+    vinserti128 m2, m2, xm2, 1
+    vinserti128 m0, m0, xm0, 1
+
+    vpbroadcastd m5, [pd_64]
+    pmaddwd     m1, m2, [r3 + 0 * 32]
+    pmaddwd     m3, m0, [r3 + 1 * 32]
+    paddd       m1, m3
+    paddd       m1, m5
+    psrad       m1, 7
+    pmaddwd     m3, m2, [r3 + 2 * 32]
+    pmaddwd     m0, [r3 + 3 * 32]
+    paddd       m3, m0
+    paddd       m3, m5
+    psrad       m3, 7
+
+    packssdw    m0, m1, m3
+    pshufb      m0, [pb_idst4_shuf]
+    vpermq      m1, m0, 11101110b
+
+    punpcklwd   m2, m0, m1
+    punpckhwd   m0, m1
+    punpcklwd   m1, m2, m0
+    punpckhwd   m2, m0
+
+    vpermq      m1, m1, 01000100b
+    vpermq      m2, m2, 01000100b
+
+    pmaddwd     m0, m1, [r3 + 0 * 32]
+    pmaddwd     m3, m2, [r3 + 1 * 32]
+    paddd       m0, m3
+    paddd       m0, m4
+    psrad       m0, IDCT4_SHIFT
+    pmaddwd     m3, m1, [r3 + 2 * 32]
+    pmaddwd     m2, m2, [r3 + 3 * 32]
+    paddd       m3, m2
+    paddd       m3, m4
+    psrad       m3, IDCT4_SHIFT
+
+    packssdw    m0, m3
+    pshufb      m1, m0, [pb_idst4_shuf]
+    vpermq      m0, m1, 11101110b
+
+    punpcklwd   m2, m1, m0
+    movq        [r1 + 0 * r2], xm2
+    movhps      [r1 + 1 * r2], xm2
+
+    punpckhwd   m1, m0
+    movq        [r1 + 2 * r2], xm1
+    lea         r1, [r1 + 2 * r2]
+    movhps      [r1 + r2], xm1
+    RET
+
 ;-------------------------------------------------------
 ; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
 ;-------------------------------------------------------

 
@@ -261,6 +261,11 @@
                 times 2 dw 84, -29, -74, 55
                 times 2 dw 55, -84, 74, -29
 
+pw_dst4_tab:    times 4 dw 29,  55,  74,  84
+                times 4 dw 74,  74,   0, -74
+                times 4 dw 84, -29, -74,  55
+                times 4 dw 55, -84,  74, -29
+
 tab_idst4:      times 4 dw 29, +84
                 times 4 dw +74, +55
                 times 4 dw 55, -29
@@ -270,6 +275,16 @@
                 times 4 dw 84, +55
                 times 4 dw -74, -29
 
+pw_idst4_tab:   times 4 dw  29,  84
+                times 4 dw  55, -29
+                times 4 dw  74,  55
+                times 4 dw  74, -84
+                times 4 dw  74, -74
+                times 4 dw  84,  55
+                times 4 dw  0,   74
+                times 4 dw -74, -29
+pb_idst4_shuf:  times 2 db 0, 1, 8, 9, 2, 3, 10, 11, 4, 5, 12, 13, 6, 7, 14, 15
+
 tab_dct8_1:     times 2 dw 89, 50, 75, 18
                 times 2 dw 75, -89, -18, -50
                 times 2 dw 50, 18, -89, 75
@@ -316,7 +331,7 @@
 cextern pd_1024
 cextern pd_2048
 cextern pw_ppppmmmm
-
+cextern trans8_shuf
 ;------------------------------------------------------
 ;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride)
 ;------------------------------------------------------
@@ -656,6 +671,59 @@
 
     RET
 
+;------------------------------------------------------------------
+;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride)
+;------------------------------------------------------------------
+INIT_YMM avx2
+cglobal dst4, 3, 4, 6
+%if BIT_DEPTH == 8
+  %define       DST_SHIFT 1
+  vpbroadcastd  m5, [pd_1]
+%elif BIT_DEPTH == 10
+  %define       DST_SHIFT 3
+  vpbroadcastd  m5, [pd_4]
+%endif
+    mova        m4, [trans8_shuf]
+    add         r2d, r2d
+    lea         r3, [pw_dst4_tab]
+
+    movq        xm0, [r0 + 0 * r2]
+    movhps      xm0, [r0 + 1 * r2]
+    lea         r0, [r0 + 2 * r2]
+    movq        xm1, [r0]
+    movhps      xm1, [r0 + r2]
+
+    vinserti128 m0, m0, xm1, 1          ; m0 = src[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
+
+    pmaddwd     m2, m0, [r3 + 0 * 32]
+    pmaddwd     m1, m0, [r3 + 1 * 32]
+    phaddd      m2, m1
+    paddd       m2, m5
+    psrad       m2, DST_SHIFT
+    pmaddwd     m3, m0, [r3 + 2 * 32]
+    pmaddwd     m1, m0, [r3 + 3 * 32]
+    phaddd      m3, m1
+    paddd       m3, m5
+    psrad       m3, DST_SHIFT
+    packssdw    m2, m3
+    vpermd      m2, m4, m2
+
+    vpbroadcastd m5, [pd_128]
+    pmaddwd     m0, m2, [r3 + 0 * 32]
+    pmaddwd     m1, m2, [r3 + 1 * 32]
+    phaddd      m0, m1
+    paddd       m0, m5
+    psrad       m0, 8
+    pmaddwd     m3, m2, [r3 + 2 * 32]
+    pmaddwd     m2, m2, [r3 + 3 * 32]
+    phaddd      m3, m2
+    paddd       m3, m5
+    psrad       m3, 8
+    packssdw    m0, m3
+    vpermd      m0, m4, m0
+    movu        [r1], m0
+    RET
+
 ;-------------------------------------------------------
 ;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
 ;-------------------------------------------------------
@@ -748,6 +816,81 @@
     movhps      [r1 + r2], m1
     RET
 
+;-----------------------------------------------------------------
+;void idst4(const int16_t* src, int16_t* dst, intptr_t dstStride)
+;-----------------------------------------------------------------
+INIT_YMM avx2
+cglobal idst4, 3, 4, 6
+%if BIT_DEPTH == 8
+  vpbroadcastd  m4, [pd_2048]
+  %define       IDCT4_SHIFT 12
+%elif BIT_DEPTH == 10
+  vpbroadcastd  m4, [pd_512]
+  %define       IDCT4_SHIFT 10
+%else
+  %error Unsupported BIT_DEPTH!
+%endif
+    add         r2d, r2d
+    lea         r3, [pw_idst4_tab]
+
+    movu        xm0, [r0 + 0 * 16]
+    movu        xm1, [r0 + 1 * 16]
+
+    punpcklwd   m2, m0, m1
+    punpckhwd   m0, m1
+
+    vinserti128 m2, m2, xm2, 1
+    vinserti128 m0, m0, xm0, 1
+
+    vpbroadcastd m5, [pd_64]
+    pmaddwd     m1, m2, [r3 + 0 * 32]
+    pmaddwd     m3, m0, [r3 + 1 * 32]
+    paddd       m1, m3
+    paddd       m1, m5
+    psrad       m1, 7
+    pmaddwd     m3, m2, [r3 + 2 * 32]
+    pmaddwd     m0, [r3 + 3 * 32]
+    paddd       m3, m0
+    paddd       m3, m5
+    psrad       m3, 7
+
+    packssdw    m0, m1, m3
+    pshufb      m0, [pb_idst4_shuf]
+    vpermq      m1, m0, 11101110b
+
+    punpcklwd   m2, m0, m1
+    punpckhwd   m0, m1
+    punpcklwd   m1, m2, m0
+    punpckhwd   m2, m0
+
+    vpermq      m1, m1, 01000100b
+    vpermq      m2, m2, 01000100b
+
+    pmaddwd     m0, m1, [r3 + 0 * 32]
+    pmaddwd     m3, m2, [r3 + 1 * 32]
+    paddd       m0, m3
+    paddd       m0, m4
+    psrad       m0, IDCT4_SHIFT
+    pmaddwd     m3, m1, [r3 + 2 * 32]
+    pmaddwd     m2, m2, [r3 + 3 * 32]
+    paddd       m3, m2
+    paddd       m3, m4
+    psrad       m3, IDCT4_SHIFT
+
+    packssdw    m0, m3
+    pshufb      m1, m0, [pb_idst4_shuf]
+    vpermq      m0, m1, 11101110b
+
+    punpcklwd   m2, m1, m0
+    movq        [r1 + 0 * r2], xm2
+    movhps      [r1 + 1 * r2], xm2
+
+    punpckhwd   m1, m0
+    movq        [r1 + 2 * r2], xm1
+    lea         r1, [r1 + 2 * r2]
+    movhps      [r1 + r2], xm1
+    RET
+
 ;-------------------------------------------------------
 ; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride)
 ;-------------------------------------------------------
​

x265_1.6.tar.gz/source/common/x86/dct8.h -> x265_1.7.tar.gz/source/common/x86/dct8.h Changed

@@ -26,6 +26,7 @@
 void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
@@ -33,6 +34,7 @@
 void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 
 void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);

 
@@ -26,6 +26,7 @@
 void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride);
+void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
@@ -33,6 +34,7 @@
 void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride);
 
 void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
+void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride);
 void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride);
​

x265_1.6.tar.gz/source/common/x86/intrapred.h -> x265_1.7.tar.gz/source/common/x86/intrapred.h Changed

@@ -34,6 +34,7 @@
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
+void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 
 void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
@@ -43,6 +44,8 @@
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 
 #define DECL_ANG(bsize, mode, cpu) \
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -55,6 +58,16 @@
 DECL_ANG(4, 7, sse2);
 DECL_ANG(4, 8, sse2);
 DECL_ANG(4, 9, sse2);
+DECL_ANG(4, 10, sse2);
+DECL_ANG(4, 11, sse2);
+DECL_ANG(4, 12, sse2);
+DECL_ANG(4, 13, sse2);
+DECL_ANG(4, 14, sse2);
+DECL_ANG(4, 15, sse2);
+DECL_ANG(4, 16, sse2);
+DECL_ANG(4, 17, sse2);
+DECL_ANG(4, 18, sse2);
+DECL_ANG(4, 26, sse2);
 
 DECL_ANG(4, 2, ssse3);
 DECL_ANG(4, 3, sse4);
@@ -174,6 +187,34 @@
 DECL_ANG(32, 33, sse4);
 
 #undef DECL_ANG
+void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -192,6 +233,24 @@
 void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -212,8 +271,17 @@
 void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
+void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 #endif // ifndef X265_INTRAPRED_H

 
@@ -34,6 +34,7 @@
 void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
+void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter);
 
 void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
@@ -43,6 +44,8 @@
 void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
+void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int);
 
 #define DECL_ANG(bsize, mode, cpu) \
     void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -55,6 +58,16 @@
 DECL_ANG(4, 7, sse2);
 DECL_ANG(4, 8, sse2);
 DECL_ANG(4, 9, sse2);
+DECL_ANG(4, 10, sse2);
+DECL_ANG(4, 11, sse2);
+DECL_ANG(4, 12, sse2);
+DECL_ANG(4, 13, sse2);
+DECL_ANG(4, 14, sse2);
+DECL_ANG(4, 15, sse2);
+DECL_ANG(4, 16, sse2);
+DECL_ANG(4, 17, sse2);
+DECL_ANG(4, 18, sse2);
+DECL_ANG(4, 26, sse2);
 
 DECL_ANG(4, 2, ssse3);
 DECL_ANG(4, 3, sse4);
@@ -174,6 +187,34 @@
 DECL_ANG(32, 33, sse4);
 
 #undef DECL_ANG
+void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -192,6 +233,24 @@
 void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
@@ -212,8 +271,17 @@
 void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
 void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter);
+void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
+void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma);
 #endif // ifndef X265_INTRAPRED_H
​

x265_1.6.tar.gz/source/common/x86/intrapred16.asm -> x265_1.7.tar.gz/source/common/x86/intrapred16.asm Changed

@@ -690,6 +690,508 @@
 %endrep
     RET
 
+;-----------------------------------------------------------------------------------------
+; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
+;-----------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_ang4_2, 3,5,4
+    lea         r4,            [r2 + 4]
+    add         r2,            20
+    cmp         r3m,           byte 34
+    cmove       r2,            r4
+
+    add         r1,            r1
+    movu        m0,            [r2]
+    movh        [r0],          m0
+    psrldq      m0,            2
+    movh        [r0 + r1],     m0
+    psrldq      m0,            2
+    movh        [r0 + r1 * 2], m0
+    lea         r1,            [r1 * 3]
+    psrldq      m0,            2
+    movh        [r0 + r1],     m0
+    RET
+
+cglobal intra_pred_ang4_3, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 33
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m0
+    psrldq      m0, 2
+    punpcklwd   m4, m0      ; [7 6 6 5 5 4 4 3]
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [8 7 7 6 6 5 5 4]
+
+
+    lea         r3, [ang_table + 20 * 16]
+    mova        m0, [r3 + 6 * 16]   ; [26]
+    mova        m1, [r3]            ; [20]
+    mova        m6, [r3 - 6 * 16]   ; [14]
+    mova        m7, [r3 - 12 * 16]  ; [ 8]
+    jmp        .do_filter4x4
+
+
+ALIGN 16
+.do_filter4x4:
+    lea     r4, [pd_16]
+    pmaddwd m2, m0
+    paddd   m2, [r4]
+    psrld   m2, 5
+
+    pmaddwd m3, m1
+    paddd   m3, [r4]
+    psrld   m3, 5
+    packssdw m2, m3
+
+    pmaddwd m4, m6
+    paddd   m4, [r4]
+    psrld   m4, 5
+
+    pmaddwd m5, m7
+    paddd   m5, [r4]
+    psrld   m5, 5
+    packssdw m4, m5
+
+    jz         .store
+
+    ; transpose 4x4
+    punpckhwd    m0, m2, m4
+    punpcklwd    m2, m4
+    punpckhwd    m4, m2, m0
+    punpcklwd    m2, m0
+
+.store:
+    add         r1, r1
+    movh        [r0], m2
+    movhps      [r0 + r1], m2
+    movh        [r0 + r1 * 2], m4
+    lea         r1, [r1 * 3]
+    movhps      [r0 + r1], m4
+    RET
+
+cglobal intra_pred_ang4_4, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 32
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m3
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
+
+    lea         r3, [ang_table + 18 * 16]
+    mova        m0, [r3 +  3 * 16]  ; [21]
+    mova        m1, [r3 -  8 * 16]  ; [10]
+    mova        m6, [r3 + 13 * 16]  ; [31]
+    mova        m7, [r3 +  2 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_5, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 31
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m3
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
+
+    lea         r3, [ang_table + 10 * 16]
+    mova        m0, [r3 +  7 * 16]  ; [17]
+    mova        m1, [r3 -  8 * 16]  ; [ 2]
+    mova        m6, [r3 +  9 * 16]  ; [19]
+    mova        m7, [r3 -  6 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_6, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 30
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m2
+    mova        m4, m0
+    psrldq      m0, 2
+    punpcklwd   m4, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m5, m4
+
+    lea         r3, [ang_table + 19 * 16]
+    mova        m0, [r3 -  6 * 16]  ; [13]
+    mova        m1, [r3 +  7 * 16]  ; [26]
+    mova        m6, [r3 - 12 * 16]  ; [ 7]
+    mova        m7, [r3 +  1 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_7, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 29
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m2
+    mova        m4, m2
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [6 5 5 4 4 3 3 2]
+
+    lea         r3, [ang_table + 20 * 16]
+    mova        m0, [r3 - 11 * 16]  ; [ 9]
+    mova        m1, [r3 -  2 * 16]  ; [18]
+    mova        m6, [r3 +  7 * 16]  ; [27]
+    mova        m7, [r3 - 16 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_8, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 28
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]

 
@@ -690,6 +690,508 @@
 %endrep
     RET
 
+;-----------------------------------------------------------------------------------------
+; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter)
+;-----------------------------------------------------------------------------------------
+INIT_XMM sse2
+cglobal intra_pred_ang4_2, 3,5,4
+    lea         r4,            [r2 + 4]
+    add         r2,            20
+    cmp         r3m,           byte 34
+    cmove       r2,            r4
+
+    add         r1,            r1
+    movu        m0,            [r2]
+    movh        [r0],          m0
+    psrldq      m0,            2
+    movh        [r0 + r1],     m0
+    psrldq      m0,            2
+    movh        [r0 + r1 * 2], m0
+    lea         r1,            [r1 * 3]
+    psrldq      m0,            2
+    movh        [r0 + r1],     m0
+    RET
+
+cglobal intra_pred_ang4_3, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 33
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m0
+    psrldq      m0, 2
+    punpcklwd   m4, m0      ; [7 6 6 5 5 4 4 3]
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [8 7 7 6 6 5 5 4]
+
+
+    lea         r3, [ang_table + 20 * 16]
+    mova        m0, [r3 + 6 * 16]   ; [26]
+    mova        m1, [r3]            ; [20]
+    mova        m6, [r3 - 6 * 16]   ; [14]
+    mova        m7, [r3 - 12 * 16]  ; [ 8]
+    jmp        .do_filter4x4
+
+
+ALIGN 16
+.do_filter4x4:
+    lea     r4, [pd_16]
+    pmaddwd m2, m0
+    paddd   m2, [r4]
+    psrld   m2, 5
+
+    pmaddwd m3, m1
+    paddd   m3, [r4]
+    psrld   m3, 5
+    packssdw m2, m3
+
+    pmaddwd m4, m6
+    paddd   m4, [r4]
+    psrld   m4, 5
+
+    pmaddwd m5, m7
+    paddd   m5, [r4]
+    psrld   m5, 5
+    packssdw m4, m5
+
+    jz         .store
+
+    ; transpose 4x4
+    punpckhwd    m0, m2, m4
+    punpcklwd    m2, m4
+    punpckhwd    m4, m2, m0
+    punpcklwd    m2, m0
+
+.store:
+    add         r1, r1
+    movh        [r0], m2
+    movhps      [r0 + r1], m2
+    movh        [r0 + r1 * 2], m4
+    lea         r1, [r1 * 3]
+    movhps      [r0 + r1], m4
+    RET
+
+cglobal intra_pred_ang4_4, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 32
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m3
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
+
+    lea         r3, [ang_table + 18 * 16]
+    mova        m0, [r3 +  3 * 16]  ; [21]
+    mova        m1, [r3 -  8 * 16]  ; [10]
+    mova        m6, [r3 + 13 * 16]  ; [31]
+    mova        m7, [r3 +  2 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_5, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 31
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m0
+    psrldq      m0, 2
+    punpcklwd   m3, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m4, m3
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [7 6 6 5 5 4 4 3]
+
+    lea         r3, [ang_table + 10 * 16]
+    mova        m0, [r3 +  7 * 16]  ; [17]
+    mova        m1, [r3 -  8 * 16]  ; [ 2]
+    mova        m6, [r3 +  9 * 16]  ; [19]
+    mova        m7, [r3 -  6 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_6, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 30
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m2
+    mova        m4, m0
+    psrldq      m0, 2
+    punpcklwd   m4, m0      ; [6 5 5 4 4 3 3 2]
+    mova        m5, m4
+
+    lea         r3, [ang_table + 19 * 16]
+    mova        m0, [r3 -  6 * 16]  ; [13]
+    mova        m1, [r3 +  7 * 16]  ; [26]
+    mova        m6, [r3 - 12 * 16]  ; [ 7]
+    mova        m7, [r3 +  1 * 16]  ; [20]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_7, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 29
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
+    mova        m3, m2
+    mova        m4, m2
+    mova        m5, m0
+    psrldq      m0, 2
+    punpcklwd   m5, m0      ; [6 5 5 4 4 3 3 2]
+
+    lea         r3, [ang_table + 20 * 16]
+    mova        m0, [r3 - 11 * 16]  ; [ 9]
+    mova        m1, [r3 -  2 * 16]  ; [18]
+    mova        m6, [r3 +  7 * 16]  ; [27]
+    mova        m7, [r3 - 16 * 16]  ; [ 4]
+    jmp         mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4)
+
+cglobal intra_pred_ang4_8, 3,5,8
+    mov         r4d, 2
+    cmp         r3m, byte 28
+    mov         r3d, 18
+    cmove       r3d, r4d
+
+    movu        m0, [r2 + r3]   ; [8 7 6 5 4 3 2 1]
+    mova        m2, m0
+    psrldq      m0, 2
+    punpcklwd   m2, m0      ; [5 4 4 3 3 2 2 1]
​

x265_1.6.tar.gz/source/common/x86/intrapred8.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8.asm Changed

@@ -28,6 +28,7 @@
 SECTION_RODATA 32
 
 intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+intra_pred_shuff_15_0:   times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
 
 pb_0_8        times 8 db  0,  8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
@@ -58,7 +59,6 @@
 c_mode16_18:    db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
 
 ALIGN 32
-trans8_shuf:          dd 0, 4, 1, 5, 2, 6, 3, 7
 c_ang8_src1_9_2_10:   db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
 c_ang8_26_20:         db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 c_ang8_src3_11_4_12:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
@@ -124,6 +124,37 @@
                       db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+ALIGN 32
+c_ang16_mode_11:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+
+ALIGN 32
+c_ang16_mode_12:      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db  8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_13:      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
@@ -135,6 +166,15 @@
                       db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
                       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_9:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
 c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
@@ -150,6 +190,15 @@
 ALIGN 32
 intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
 
+ALIGN 32
+c_ang16_mode_8:       db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
@@ -162,6 +211,15 @@
                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_7:      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
@@ -175,6 +233,17 @@
                       db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 
+
+ALIGN 32
+c_ang16_mode_6:       db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
 ALIGN 32
 c_ang16_mode_31:      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
@@ -186,6 +255,17 @@
                       db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+
+ALIGN 32
+c_ang16_mode_5:       db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
 ALIGN 32
 c_ang16_mode_32:      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
                       db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
@@ -200,6 +280,16 @@
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
+c_ang16_mode_4:       db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
 c_ang16_mode_33:     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
@@ -216,6 +306,16 @@
                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
+c_ang16_mode_3:      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                     db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                     db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                     db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                     db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
 c_ang16_mode_24:     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
@@ -376,6 +476,191 @@
                    db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
                    db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+
+ALIGN 32
+c_ang32_mode_33:   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                   db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                   db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30

 
@@ -28,6 +28,7 @@
 SECTION_RODATA 32
 
 intra_pred_shuff_0_8:    times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
+intra_pred_shuff_15_0:   times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
 
 pb_0_8        times 8 db  0,  8
 pb_unpackbw1  times 2 db  1,  8,  2,  8,  3,  8,  4,  8
@@ -58,7 +59,6 @@
 c_mode16_18:    db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1
 
 ALIGN 32
-trans8_shuf:          dd 0, 4, 1, 5, 2, 6, 3, 7
 c_ang8_src1_9_2_10:   db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9
 c_ang8_26_20:         db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
 c_ang8_src3_11_4_12:  db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11
@@ -124,6 +124,37 @@
                       db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+ALIGN 32
+c_ang16_mode_11:      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+
+ALIGN 32
+c_ang16_mode_12:      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
+                      db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db  8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+
+ALIGN 32
+c_ang16_mode_13:      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_28:      db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
@@ -135,6 +166,15 @@
                       db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
                       db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_9:       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                      db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                      db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
 c_ang16_mode_27:      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
@@ -150,6 +190,15 @@
 ALIGN 32
 intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15
 
+ALIGN 32
+c_ang16_mode_8:       db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_29:     db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9,  14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
@@ -162,6 +211,15 @@
                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+ALIGN 32
+c_ang16_mode_7:      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                     db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
+                     db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
 c_ang16_mode_30:      db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
@@ -175,6 +233,17 @@
                       db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 
+
+ALIGN 32
+c_ang16_mode_6:       db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
+                      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
+                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9
+                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                      db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
 ALIGN 32
 c_ang16_mode_31:      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
                       db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19
@@ -186,6 +255,17 @@
                       db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
+
+ALIGN 32
+c_ang16_mode_5:       db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25
+                      db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                      db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
 ALIGN 32
 c_ang16_mode_32:      db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21
                       db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31
@@ -200,6 +280,16 @@
                       db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
 
 ALIGN 32
+c_ang16_mode_4:       db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29
+                      db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                      db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7
+                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                      db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17
+                      db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                      db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27
+                      db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+
+ALIGN 32
 c_ang16_mode_33:     db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
                      db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
                      db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
@@ -216,6 +306,16 @@
                      db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
 ALIGN 32
+c_ang16_mode_3:      db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                     db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4
+                     db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                     db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                     db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                     db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                     db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                     db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
+
+ALIGN 32
 c_ang16_mode_24:     db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
                      db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
                      db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2
@@ -376,6 +476,191 @@
                    db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11
                    db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0
 
+
+ALIGN 32
+c_ang32_mode_33:   db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
+                   db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24
+                   db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18
+                   db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12
+                   db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6
+                   db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26
+                   db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20
+                   db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14
+                   db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8
+                   db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28
+                   db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22
+                   db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16
+                   db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10
+                   db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30
​

x265_1.6.tar.gz/source/common/x86/intrapred8_allangs.asm -> x265_1.7.tar.gz/source/common/x86/intrapred8_allangs.asm Changed

@@ -2,7 +2,7 @@
 ;* Copyright (C) 2013 x265 project
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
-;*          Praveen Tiwari <praveen@multicorewareinc.com>
+;*          Praveen Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -27,6 +27,64 @@
 
 SECTION_RODATA 32
 
+all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
+                db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
+                db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
+                db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0
+                db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
+                db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8
+                db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8
+
+all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
+          db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
+
+
 SECTION .text
 
 ; global constant
@@ -34,9 +92,14 @@
 
 ; common constant with intrapred8.asm
 cextern ang_table
+cextern pw_ang_table
 cextern tab_S1
 cextern tab_S2
 cextern tab_Si
+cextern pw_16
+cextern pb_000000000000000F
+cextern pb_0000000000000F0F
+cextern pw_FFFFFFFFFFFFFFF0
 
 
 ;-----------------------------------------------------------------------------
@@ -23006,3 +23069,1098 @@
     palignr    m4,              m2,       m1,    14
     movu       [r0 + 2111 * 16],   m4
     RET
+
+
+;-----------------------------------------------------------------------------
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+;-----------------------------------------------------------------------------
+INIT_YMM avx2
+cglobal all_angs_pred_4x4, 4, 4, 6
+
+    mova           m5, [pw_1024]
+    lea            r2, [all_ang4]
+    lea            r3, [all_ang4_shuff]
+
+; mode 2
+
+    vbroadcasti128 m0, [r1 + 9]
+    mova           xm1, xm0
+    psrldq         xm1, 1
+    pshufb         xm1, [r3]
+    movu           [r0], xm1
+
+; mode 3
+
+    pshufb         m1, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m1, [r2]
+    pmulhrsw       m1, m5
+
+; mode 4
+
+    pshufb         m2, m0, [r3 + 2 * mmsize]
+    pmaddubsw      m2, [r2 + 1 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (3 - 2) * 16], m1
+
+; mode 5
+
+    pshufb         m1, m0, [r3 + 2 * mmsize]
+    pmaddubsw      m1, [r2 + 2 * mmsize]
+    pmulhrsw       m1, m5
+
+; mode 6
+
+    pshufb         m2, m0, [r3 + 3 * mmsize]
+    pmaddubsw      m2, [r2 + 3 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (5 - 2) * 16], m1
+
+    add            r3, 4 * mmsize
+    add            r2, 4 * mmsize
+
+; mode 7
+
+    pshufb         m1, m0, [r3 + 0 * mmsize]
+    pmaddubsw      m1, [r2 + 0 * mmsize]
+    pmulhrsw       m1, m5
+
+; mode 8
+
+    pshufb         m2, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m2, [r2 + 1 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (7 - 2) * 16], m1
+
+; mode 9
+
+    pshufb         m1, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m1, [r2 + 2 * mmsize]
+    pmulhrsw       m1, m5
+    packuswb       m1, m1
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (9 - 2) * 16], xm1
+
+; mode 10
+
+    pshufb         xm1, xm0, [r3 + 2 * mmsize]
+    movu           [r0 + (10 - 2) * 16], xm1
+
+    pxor           xm1, xm1
+    movd           xm2, [r1 + 1]
+    pshufd         xm3, xm2, 0
+    punpcklbw      xm3, xm1
+    pinsrb         xm2, [r1], 0
+    pshufb         xm4, xm2, xm1
+    punpcklbw      xm4, xm1
+    psubw          xm3, xm4
+    psraw          xm3, 1
+    pshufb         xm4, xm0, xm1
+    punpcklbw      xm4, xm1
+    paddw          xm3, xm4
+    packuswb       xm3, xm1
+
+    pextrb         [r0 + 128], xm3, 0
+    pextrb         [r0 + 132], xm3, 1
+    pextrb         [r0 + 136], xm3, 2
+    pextrb         [r0 + 140], xm3, 3
+
+; mode 11
+
+    vbroadcasti128 m0, [r1]
+    pshufb         m1, m0, [r3 + 3 * mmsize]
+    pmaddubsw      m1, [r2 + 3 * mmsize]
+    pmulhrsw       m1, m5

 
@@ -2,7 +2,7 @@
 ;* Copyright (C) 2013 x265 project
 ;*
 ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com>
-;*          Praveen Tiwari <praveen@multicorewareinc.com>
+;*          Praveen Tiwari <praveen@multicorewareinc.com>
 ;*
 ;* This program is free software; you can redistribute it and/or modify
 ;* it under the terms of the GNU General Public License as published by
@@ -27,6 +27,64 @@
 
 SECTION_RODATA 32
 
+all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
+                db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
+                db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11
+                db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10
+                db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10
+                db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9
+                db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0
+                db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2
+                db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3
+                db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4
+                db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6
+                db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7
+                db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8
+                db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8
+
+all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
+          db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24
+          db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12
+          db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28
+          db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12
+          db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28
+          db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12
+          db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24
+          db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8
+          db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20
+          db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4
+          db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20
+          db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4
+          db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20
+          db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8
+
+
 SECTION .text
 
 ; global constant
@@ -34,9 +92,14 @@
 
 ; common constant with intrapred8.asm
 cextern ang_table
+cextern pw_ang_table
 cextern tab_S1
 cextern tab_S2
 cextern tab_Si
+cextern pw_16
+cextern pb_000000000000000F
+cextern pb_0000000000000F0F
+cextern pw_FFFFFFFFFFFFFFF0
 
 
 ;-----------------------------------------------------------------------------
@@ -23006,3 +23069,1098 @@
     palignr    m4,              m2,       m1,    14
     movu       [r0 + 2111 * 16],   m4
     RET
+
+
+;-----------------------------------------------------------------------------
+; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma)
+;-----------------------------------------------------------------------------
+INIT_YMM avx2
+cglobal all_angs_pred_4x4, 4, 4, 6
+
+    mova           m5, [pw_1024]
+    lea            r2, [all_ang4]
+    lea            r3, [all_ang4_shuff]
+
+; mode 2
+
+    vbroadcasti128 m0, [r1 + 9]
+    mova           xm1, xm0
+    psrldq         xm1, 1
+    pshufb         xm1, [r3]
+    movu           [r0], xm1
+
+; mode 3
+
+    pshufb         m1, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m1, [r2]
+    pmulhrsw       m1, m5
+
+; mode 4
+
+    pshufb         m2, m0, [r3 + 2 * mmsize]
+    pmaddubsw      m2, [r2 + 1 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (3 - 2) * 16], m1
+
+; mode 5
+
+    pshufb         m1, m0, [r3 + 2 * mmsize]
+    pmaddubsw      m1, [r2 + 2 * mmsize]
+    pmulhrsw       m1, m5
+
+; mode 6
+
+    pshufb         m2, m0, [r3 + 3 * mmsize]
+    pmaddubsw      m2, [r2 + 3 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (5 - 2) * 16], m1
+
+    add            r3, 4 * mmsize
+    add            r2, 4 * mmsize
+
+; mode 7
+
+    pshufb         m1, m0, [r3 + 0 * mmsize]
+    pmaddubsw      m1, [r2 + 0 * mmsize]
+    pmulhrsw       m1, m5
+
+; mode 8
+
+    pshufb         m2, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m2, [r2 + 1 * mmsize]
+    pmulhrsw       m2, m5
+    packuswb       m1, m2
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (7 - 2) * 16], m1
+
+; mode 9
+
+    pshufb         m1, m0, [r3 + 1 * mmsize]
+    pmaddubsw      m1, [r2 + 2 * mmsize]
+    pmulhrsw       m1, m5
+    packuswb       m1, m1
+    vpermq         m1, m1, 11011000b
+    movu           [r0 + (9 - 2) * 16], xm1
+
+; mode 10
+
+    pshufb         xm1, xm0, [r3 + 2 * mmsize]
+    movu           [r0 + (10 - 2) * 16], xm1
+
+    pxor           xm1, xm1
+    movd           xm2, [r1 + 1]
+    pshufd         xm3, xm2, 0
+    punpcklbw      xm3, xm1
+    pinsrb         xm2, [r1], 0
+    pshufb         xm4, xm2, xm1
+    punpcklbw      xm4, xm1
+    psubw          xm3, xm4
+    psraw          xm3, 1
+    pshufb         xm4, xm0, xm1
+    punpcklbw      xm4, xm1
+    paddw          xm3, xm4
+    packuswb       xm3, xm1
+
+    pextrb         [r0 + 128], xm3, 0
+    pextrb         [r0 + 132], xm3, 1
+    pextrb         [r0 + 136], xm3, 2
+    pextrb         [r0 + 140], xm3, 3
+
+; mode 11
+
+    vbroadcasti128 m0, [r1]
+    pshufb         m1, m0, [r3 + 3 * mmsize]
+    pmaddubsw      m1, [r2 + 3 * mmsize]
+    pmulhrsw       m1, m5
​

x265_1.6.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter16.asm Changed

@@ -113,10 +113,13 @@
                   times 8 dw 58, -10
                   times 8 dw 4, -1
 
+const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
+
 SECTION .text
 cextern pd_32
 cextern pw_pixel_max
 cextern pd_n32768
+cextern pw_2000
 
 ;------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
@@ -5525,65 +5528,1409 @@
     FILTER_VER_LUMA_SS 64, 16
     FILTER_VER_LUMA_SS 16, 64
 
-;--------------------------------------------------------------------------------------------------
-; void filterConvertPelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
-;--------------------------------------------------------------------------------------------------
-INIT_XMM sse2
-cglobal luma_p2s, 3, 7, 5
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_2xN 1
+INIT_XMM sse4
+cglobal filterPixelToShort_2x%1, 3, 6, 2
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r1 * 3]
+    lea        r5, [r3 * 3]
 
-    add         r1, r1
+    ; load constant
+    mova       m1, [pw_2000]
 
-    ; load width and height
-    mov         r3d, r3m
-    mov         r4d, r4m
+%rep %1/4
+    movd       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m1
+
+    movd       [r2 + r3 * 0], m0
+    pextrd     [r2 + r3 * 1], m0, 2
+
+    movd       m0, [r0 + r1 * 2]
+    movhps     m0, [r0 + r4]
+    psllw      m0, 4
+    psubw      m0, m1
+
+    movd       [r2 + r3 * 2], m0
+    pextrd     [r2 + r5], m0, 2
+
+    lea        r0, [r0 + r1 * 4]
+    lea        r2, [r2 + r3 * 4]
+%endrep
+    RET
+%endmacro
+P2S_H_2xN 4
+P2S_H_2xN 8
+P2S_H_2xN 16
+
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_4xN 1
+INIT_XMM ssse3
+cglobal filterPixelToShort_4x%1, 3, 6, 2
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r3 * 3]
+    lea        r5, [r1 * 3]
 
     ; load constant
-    mova        m4, [tab_c_n8192]
+    mova       m1, [pw_2000]
 
-.loopH:
+%rep %1/4
+    movh       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m1
+    movh       [r2 + r3 * 0], m0
+    movhps     [r2 + r3 * 1], m0
+
+    movh       m0, [r0 + r1 * 2]
+    movhps     m0, [r0 + r5]
+    psllw      m0, 4
+    psubw      m0, m1
+    movh       [r2 + r3 * 2], m0
+    movhps     [r2 + r4], m0
 
-    xor         r5d, r5d
-.loopW:
-    lea         r6, [r0 + r5 * 2]
+    lea        r0, [r0 + r1 * 4]
+    lea        r2, [r2 + r3 * 4]
+%endrep
+    RET
+%endmacro
+P2S_H_4xN 4
+P2S_H_4xN 8
+P2S_H_4xN 16
+P2S_H_4xN 32
 
-    movu        m0, [r6]
-    psllw       m0, 4
-    paddw       m0, m4
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+INIT_XMM ssse3
+cglobal filterPixelToShort_4x2, 3, 4, 1
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
 
-    movu        m1, [r6 + r1]
-    psllw       m1, 4
-    paddw       m1, m4
+    movh       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, [pw_2000]
+    movh       [r2 + r3 * 0], m0
+    movhps     [r2 + r3 * 1], m0
 
-    movu        m2, [r6 + r1 * 2]
-    psllw       m2, 4
-    paddw       m2, m4
-
-    lea         r6, [r6 + r1 * 2]
-    movu        m3, [r6 + r1]
-    psllw       m3, 4
-    paddw       m3, m4
+    RET
 
-    add         r5, 8
-    cmp         r5, r3
-    jg          .width4
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
-    je          .nextH
-    jmp         .loopW
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_6xN 1
+INIT_XMM sse4
+cglobal filterPixelToShort_6x%1, 3, 7, 3
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r3 * 3]
+    lea        r5, [r1 * 3]
 
-.width4:
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
+    ; load height
+    mov        r6d, %1/4
 
-.nextH:
-    lea         r0, [r0 + r1 * 4]
-    add         r2, FENC_STRIDE * 8
+    ; load constant
+    mova       m2, [pw_2000]
 
-    sub         r4d, 4
-    jnz         .loopH
+.loop
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m2
+    psllw      m1, 4
+    psubw      m1, m2
+
+    movh       [r2 + r3 * 0], m0
+    pextrd     [r2 + r3 * 0 + 8], m0, 2
+    movh       [r2 + r3 * 1], m1
+    pextrd     [r2 + r3 * 1 + 8], m1, 2
+
+    movu       m0, [r0 + r1 * 2]
+    movu       m1, [r0 + r5]
+    psllw      m0, 4
+    psubw      m0, m2
+    psllw      m1, 4

 
@@ -113,10 +113,13 @@
                   times 8 dw 58, -10
                   times 8 dw 4, -1
 
+const interp8_hps_shuf,     dd 0, 4, 1, 5, 2, 6, 3, 7
+
 SECTION .text
 cextern pd_32
 cextern pw_pixel_max
 cextern pd_n32768
+cextern pw_2000
 
 ;------------------------------------------------------------------------------------------------------------
 ; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx)
@@ -5525,65 +5528,1409 @@
     FILTER_VER_LUMA_SS 64, 16
     FILTER_VER_LUMA_SS 16, 64
 
-;--------------------------------------------------------------------------------------------------
-; void filterConvertPelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int width, int height)
-;--------------------------------------------------------------------------------------------------
-INIT_XMM sse2
-cglobal luma_p2s, 3, 7, 5
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_2xN 1
+INIT_XMM sse4
+cglobal filterPixelToShort_2x%1, 3, 6, 2
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r1 * 3]
+    lea        r5, [r3 * 3]
 
-    add         r1, r1
+    ; load constant
+    mova       m1, [pw_2000]
 
-    ; load width and height
-    mov         r3d, r3m
-    mov         r4d, r4m
+%rep %1/4
+    movd       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m1
+
+    movd       [r2 + r3 * 0], m0
+    pextrd     [r2 + r3 * 1], m0, 2
+
+    movd       m0, [r0 + r1 * 2]
+    movhps     m0, [r0 + r4]
+    psllw      m0, 4
+    psubw      m0, m1
+
+    movd       [r2 + r3 * 2], m0
+    pextrd     [r2 + r5], m0, 2
+
+    lea        r0, [r0 + r1 * 4]
+    lea        r2, [r2 + r3 * 4]
+%endrep
+    RET
+%endmacro
+P2S_H_2xN 4
+P2S_H_2xN 8
+P2S_H_2xN 16
+
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_4xN 1
+INIT_XMM ssse3
+cglobal filterPixelToShort_4x%1, 3, 6, 2
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r3 * 3]
+    lea        r5, [r1 * 3]
 
     ; load constant
-    mova        m4, [tab_c_n8192]
+    mova       m1, [pw_2000]
 
-.loopH:
+%rep %1/4
+    movh       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m1
+    movh       [r2 + r3 * 0], m0
+    movhps     [r2 + r3 * 1], m0
+
+    movh       m0, [r0 + r1 * 2]
+    movhps     m0, [r0 + r5]
+    psllw      m0, 4
+    psubw      m0, m1
+    movh       [r2 + r3 * 2], m0
+    movhps     [r2 + r4], m0
 
-    xor         r5d, r5d
-.loopW:
-    lea         r6, [r0 + r5 * 2]
+    lea        r0, [r0 + r1 * 4]
+    lea        r2, [r2 + r3 * 4]
+%endrep
+    RET
+%endmacro
+P2S_H_4xN 4
+P2S_H_4xN 8
+P2S_H_4xN 16
+P2S_H_4xN 32
 
-    movu        m0, [r6]
-    psllw       m0, 4
-    paddw       m0, m4
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+INIT_XMM ssse3
+cglobal filterPixelToShort_4x2, 3, 4, 1
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
 
-    movu        m1, [r6 + r1]
-    psllw       m1, 4
-    paddw       m1, m4
+    movh       m0, [r0]
+    movhps     m0, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, [pw_2000]
+    movh       [r2 + r3 * 0], m0
+    movhps     [r2 + r3 * 1], m0
 
-    movu        m2, [r6 + r1 * 2]
-    psllw       m2, 4
-    paddw       m2, m4
-
-    lea         r6, [r6 + r1 * 2]
-    movu        m3, [r6 + r1]
-    psllw       m3, 4
-    paddw       m3, m4
+    RET
 
-    add         r5, 8
-    cmp         r5, r3
-    jg          .width4
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
-    movu        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
-    je          .nextH
-    jmp         .loopW
+;-----------------------------------------------------------------------------
+; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride)
+;-----------------------------------------------------------------------------
+%macro P2S_H_6xN 1
+INIT_XMM sse4
+cglobal filterPixelToShort_6x%1, 3, 7, 3
+    add        r1d, r1d
+    mov        r3d, r3m
+    add        r3d, r3d
+    lea        r4, [r3 * 3]
+    lea        r5, [r1 * 3]
 
-.width4:
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 0 - 16], m0
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 2 - 16], m1
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 4 - 16], m2
-    movh        [r2 + r5 * 2 + FENC_STRIDE * 6 - 16], m3
+    ; load height
+    mov        r6d, %1/4
 
-.nextH:
-    lea         r0, [r0 + r1 * 4]
-    add         r2, FENC_STRIDE * 8
+    ; load constant
+    mova       m2, [pw_2000]
 
-    sub         r4d, 4
-    jnz         .loopH
+.loop
+    movu       m0, [r0]
+    movu       m1, [r0 + r1]
+    psllw      m0, 4
+    psubw      m0, m2
+    psllw      m1, 4
+    psubw      m1, m2
+
+    movh       [r2 + r3 * 0], m0
+    pextrd     [r2 + r3 * 0 + 8], m0, 2
+    movh       [r2 + r3 * 1], m1
+    pextrd     [r2 + r3 * 1 + 8], m1, 2
+
+    movu       m0, [r0 + r1 * 2]
+    movu       m1, [r0 + r5]
+    psllw      m0, 4
+    psubw      m0, m2
+    psllw      m1, 4
​

x265_1.6.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.7.tar.gz/source/common/x86/ipfilter8.asm Changed

@@ -27,269 +27,269 @@
 %include "x86util.asm"
 
 SECTION_RODATA 32
-tab_Tm:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
-           db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
-           db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
+const tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+                 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
+                 db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
 
-ALIGN 32
 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
 
-ALIGN 32
 const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
                         times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
 
-ALIGN 32
 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
-ALIGN 32
 const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
                      times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
                      times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
                      times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
 
-ALIGN 32
-tab_Lm:    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
-           db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
-           db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
-           db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
-
-tab_Vm:    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-           db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
-
-tab_Cm:    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
-
-tab_c_526336:   times 4 dd 8192*64+2048
-
-pd_526336:      times 8 dd 8192*64+2048
-
-tab_ChromaCoeff: db  0, 64,  0,  0
-                 db -2, 58, 10, -2
-                 db -4, 54, 16, -2
-                 db -6, 46, 28, -4
-                 db -4, 36, 36, -4
-                 db -4, 28, 46, -6
-                 db -2, 16, 54, -4
-                 db -2, 10, 58, -2
-ALIGN 32
-tab_ChromaCoeff_V: times 8 db 0, 64
-                   times 8 db 0,  0
+const tab_Lm,    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
+                 db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
+                 db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
+                 db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
 
-                   times 8 db -2, 58
-                   times 8 db 10, -2
+const tab_Vm,    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+                 db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
 
-                   times 8 db -4, 54
-                   times 8 db 16, -2
+const tab_Cm,    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
 
-                   times 8 db -6, 46
-                   times 8 db 28, -4
+const pd_526336, times 8 dd 8192*64+2048
 
-                   times 8 db -4, 36
-                   times 8 db 36, -4
+const tab_ChromaCoeff, db  0, 64,  0,  0
+                       db -2, 58, 10, -2
+                       db -4, 54, 16, -2
+                       db -6, 46, 28, -4
+                       db -4, 36, 36, -4
+                       db -4, 28, 46, -6
+                       db -2, 16, 54, -4
+                       db -2, 10, 58, -2
 
-                   times 8 db -4, 28
-                   times 8 db 46, -6
+const tabw_ChromaCoeff, dw  0, 64,  0,  0
+                        dw -2, 58, 10, -2
+                        dw -4, 54, 16, -2
+                        dw -6, 46, 28, -4
+                        dw -4, 36, 36, -4
+                        dw -4, 28, 46, -6
+                        dw -2, 16, 54, -4
+                        dw -2, 10, 58, -2
 
-                   times 8 db -2, 16
-                   times 8 db 54, -4
+const tab_ChromaCoeff_V, times 8 db 0, 64
+                         times 8 db 0,  0
 
-                   times 8 db -2, 10
-                   times 8 db 58, -2
+                         times 8 db -2, 58
+                         times 8 db 10, -2
 
-tab_ChromaCoeffV: times 4 dw 0, 64
-                  times 4 dw 0, 0
+                         times 8 db -4, 54
+                         times 8 db 16, -2
 
-                  times 4 dw -2, 58
-                  times 4 dw 10, -2
+                         times 8 db -6, 46
+                         times 8 db 28, -4
 
-                  times 4 dw -4, 54
-                  times 4 dw 16, -2
+                         times 8 db -4, 36
+                         times 8 db 36, -4
 
-                  times 4 dw -6, 46 
-                  times 4 dw 28, -4
+                         times 8 db -4, 28
+                         times 8 db 46, -6
 
-                  times 4 dw -4, 36
-                  times 4 dw 36, -4
+                         times 8 db -2, 16
+                         times 8 db 54, -4
 
-                  times 4 dw -4, 28
-                  times 4 dw 46, -6
+                         times 8 db -2, 10
+                         times 8 db 58, -2
 
-                  times 4 dw -2, 16
-                  times 4 dw 54, -4
+const tab_ChromaCoeffV, times 4 dw 0, 64
+                        times 4 dw 0, 0
 
-                  times 4 dw -2, 10
-                  times 4 dw 58, -2
+                        times 4 dw -2, 58
+                        times 4 dw 10, -2
 
-ALIGN 32
-pw_ChromaCoeffV:  times 8 dw 0, 64
-                  times 8 dw 0, 0
+                        times 4 dw -4, 54
+                        times 4 dw 16, -2
 
-                  times 8 dw -2, 58
-                  times 8 dw 10, -2
+                        times 4 dw -6, 46
+                        times 4 dw 28, -4
 
-                  times 8 dw -4, 54
-                  times 8 dw 16, -2
+                        times 4 dw -4, 36
+                        times 4 dw 36, -4
 
-                  times 8 dw -6, 46 
-                  times 8 dw 28, -4
-
-                  times 8 dw -4, 36
-                  times 8 dw 36, -4
-
-                  times 8 dw -4, 28
-                  times 8 dw 46, -6
-
-                  times 8 dw -2, 16
-                  times 8 dw 54, -4
-
-                  times 8 dw -2, 10
-                  times 8 dw 58, -2
-
-tab_LumaCoeff:   db   0, 0,  0,  64,  0,   0,  0,  0
-                 db  -1, 4, -10, 58,  17, -5,  1,  0
-                 db  -1, 4, -11, 40,  40, -11, 4, -1
-                 db   0, 1, -5,  17,  58, -10, 4, -1
-
-tab_LumaCoeffV: times 4 dw 0, 0
-                times 4 dw 0, 64
-                times 4 dw 0, 0
-                times 4 dw 0, 0
-
-                times 4 dw -1, 4
-                times 4 dw -10, 58
-                times 4 dw 17, -5
-                times 4 dw 1, 0
-
-                times 4 dw -1, 4
-                times 4 dw -11, 40
-                times 4 dw 40, -11
-                times 4 dw 4, -1
-
-                times 4 dw 0, 1
-                times 4 dw -5, 17
-                times 4 dw 58, -10
-                times 4 dw 4, -1
+                        times 4 dw -4, 28

 
@@ -27,269 +27,269 @@
 %include "x86util.asm"
 
 SECTION_RODATA 32
-tab_Tm:    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
-           db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
-           db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
+const tab_Tm,    db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6
+                 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10
+                 db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14
 
-ALIGN 32
 const interp4_vpp_shuf, times 2 db 0, 4, 1, 5, 2, 6, 3, 7, 8, 12, 9, 13, 10, 14, 11, 15
 
-ALIGN 32
 const interp_vert_shuf, times 2 db 0, 2, 1, 3, 2, 4, 3, 5, 4, 6, 5, 7, 6, 8, 7, 9
                         times 2 db 4, 6, 5, 7, 6, 8, 7, 9, 8, 10, 9, 11, 10, 12, 11, 13
 
-ALIGN 32
 const interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4
                          dd 2, 3, 3, 4, 4, 5, 5, 6
 
-ALIGN 32
 const pb_8tap_hps_0, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8
                      times 2 db 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10
                      times 2 db 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12
                      times 2 db 6, 7, 7, 8, 8, 9, 9,10,10,11,11,12,12,13,13,14
 
-ALIGN 32
-tab_Lm:    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
-           db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
-           db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
-           db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
-
-tab_Vm:    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
-           db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
-
-tab_Cm:    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
-
-tab_c_526336:   times 4 dd 8192*64+2048
-
-pd_526336:      times 8 dd 8192*64+2048
-
-tab_ChromaCoeff: db  0, 64,  0,  0
-                 db -2, 58, 10, -2
-                 db -4, 54, 16, -2
-                 db -6, 46, 28, -4
-                 db -4, 36, 36, -4
-                 db -4, 28, 46, -6
-                 db -2, 16, 54, -4
-                 db -2, 10, 58, -2
-ALIGN 32
-tab_ChromaCoeff_V: times 8 db 0, 64
-                   times 8 db 0,  0
+const tab_Lm,    db 0, 1, 2, 3, 4,  5,  6,  7,  1, 2, 3, 4,  5,  6,  7,  8
+                 db 2, 3, 4, 5, 6,  7,  8,  9,  3, 4, 5, 6,  7,  8,  9,  10
+                 db 4, 5, 6, 7, 8,  9,  10, 11, 5, 6, 7, 8,  9,  10, 11, 12
+                 db 6, 7, 8, 9, 10, 11, 12, 13, 7, 8, 9, 10, 11, 12, 13, 14
 
-                   times 8 db -2, 58
-                   times 8 db 10, -2
+const tab_Vm,    db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1
+                 db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3
 
-                   times 8 db -4, 54
-                   times 8 db 16, -2
+const tab_Cm,    db 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3, 0, 2, 1, 3
 
-                   times 8 db -6, 46
-                   times 8 db 28, -4
+const pd_526336, times 8 dd 8192*64+2048
 
-                   times 8 db -4, 36
-                   times 8 db 36, -4
+const tab_ChromaCoeff, db  0, 64,  0,  0
+                       db -2, 58, 10, -2
+                       db -4, 54, 16, -2
+                       db -6, 46, 28, -4
+                       db -4, 36, 36, -4
+                       db -4, 28, 46, -6
+                       db -2, 16, 54, -4
+                       db -2, 10, 58, -2
 
-                   times 8 db -4, 28
-                   times 8 db 46, -6
+const tabw_ChromaCoeff, dw  0, 64,  0,  0
+                        dw -2, 58, 10, -2
+                        dw -4, 54, 16, -2
+                        dw -6, 46, 28, -4
+                        dw -4, 36, 36, -4
+                        dw -4, 28, 46, -6
+                        dw -2, 16, 54, -4
+                        dw -2, 10, 58, -2
 
-                   times 8 db -2, 16
-                   times 8 db 54, -4
+const tab_ChromaCoeff_V, times 8 db 0, 64
+                         times 8 db 0,  0
 
-                   times 8 db -2, 10
-                   times 8 db 58, -2
+                         times 8 db -2, 58
+                         times 8 db 10, -2
 
-tab_ChromaCoeffV: times 4 dw 0, 64
-                  times 4 dw 0, 0
+                         times 8 db -4, 54
+                         times 8 db 16, -2
 
-                  times 4 dw -2, 58
-                  times 4 dw 10, -2
+                         times 8 db -6, 46
+                         times 8 db 28, -4
 
-                  times 4 dw -4, 54
-                  times 4 dw 16, -2
+                         times 8 db -4, 36
+                         times 8 db 36, -4
 
-                  times 4 dw -6, 46 
-                  times 4 dw 28, -4
+                         times 8 db -4, 28
+                         times 8 db 46, -6
 
-                  times 4 dw -4, 36
-                  times 4 dw 36, -4
+                         times 8 db -2, 16
+                         times 8 db 54, -4
 
-                  times 4 dw -4, 28
-                  times 4 dw 46, -6
+                         times 8 db -2, 10
+                         times 8 db 58, -2
 
-                  times 4 dw -2, 16
-                  times 4 dw 54, -4
+const tab_ChromaCoeffV, times 4 dw 0, 64
+                        times 4 dw 0, 0
 
-                  times 4 dw -2, 10
-                  times 4 dw 58, -2
+                        times 4 dw -2, 58
+                        times 4 dw 10, -2
 
-ALIGN 32
-pw_ChromaCoeffV:  times 8 dw 0, 64
-                  times 8 dw 0, 0
+                        times 4 dw -4, 54
+                        times 4 dw 16, -2
 
-                  times 8 dw -2, 58
-                  times 8 dw 10, -2
+                        times 4 dw -6, 46
+                        times 4 dw 28, -4
 
-                  times 8 dw -4, 54
-                  times 8 dw 16, -2
+                        times 4 dw -4, 36
+                        times 4 dw 36, -4
 
-                  times 8 dw -6, 46 
-                  times 8 dw 28, -4
-
-                  times 8 dw -4, 36
-                  times 8 dw 36, -4
-
-                  times 8 dw -4, 28
-                  times 8 dw 46, -6
-
-                  times 8 dw -2, 16
-                  times 8 dw 54, -4
-
-                  times 8 dw -2, 10
-                  times 8 dw 58, -2
-
-tab_LumaCoeff:   db   0, 0,  0,  64,  0,   0,  0,  0
-                 db  -1, 4, -10, 58,  17, -5,  1,  0
-                 db  -1, 4, -11, 40,  40, -11, 4, -1
-                 db   0, 1, -5,  17,  58, -10, 4, -1
-
-tab_LumaCoeffV: times 4 dw 0, 0
-                times 4 dw 0, 64
-                times 4 dw 0, 0
-                times 4 dw 0, 0
-
-                times 4 dw -1, 4
-                times 4 dw -10, 58
-                times 4 dw 17, -5
-                times 4 dw 1, 0
-
-                times 4 dw -1, 4
-                times 4 dw -11, 40
-                times 4 dw 40, -11
-                times 4 dw 4, -1
-
-                times 4 dw 0, 1
-                times 4 dw -5, 17
-                times 4 dw 58, -10
-                times 4 dw 4, -1
+                        times 4 dw -4, 28
​

x265_1.6.tar.gz/source/common/x86/ipfilter8.h -> x265_1.7.tar.gz/source/common/x86/ipfilter8.h Changed

@@ -289,16 +289,114 @@
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu)
 
-void x265_chroma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-void x265_luma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
+
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu);
+
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
 
 CHROMA_420_VERT_FILTERS(_sse2);
 CHROMA_420_HORIZ_FILTERS(_sse4);
 CHROMA_420_VERT_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_422_VERT_FILTERS(_sse2);
 CHROMA_422_HORIZ_FILTERS(_sse4);
 CHROMA_422_VERT_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_444_VERT_FILTERS(_sse2);
 CHROMA_444_HORIZ_FILTERS(_sse4);
@@ -572,6 +670,48 @@
     SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \
     SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu);
 
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); 
+
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
+
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
+
 CHROMA_420_FILTERS(_sse4);
 CHROMA_420_FILTERS(_avx2);
 CHROMA_420_SP_FILTERS(_sse2);
@@ -582,19 +722,32 @@
 CHROMA_420_SS_FILTERS_SSE4(_sse4);
 CHROMA_420_SS_FILTERS(_avx2);
 CHROMA_420_SS_FILTERS_SSE4(_avx2);
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_422_FILTERS(_sse4);
 CHROMA_422_FILTERS(_avx2);
 CHROMA_422_SP_FILTERS(_sse2);
+CHROMA_422_SP_FILTERS(_avx2);
 CHROMA_422_SP_FILTERS_SSE4(_sse4);
+CHROMA_422_SP_FILTERS_SSE4(_avx2);
 CHROMA_422_SS_FILTERS(_sse2);
+CHROMA_422_SS_FILTERS(_avx2);
 CHROMA_422_SS_FILTERS_SSE4(_sse4);
+CHROMA_422_SS_FILTERS_SSE4(_avx2);
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
+void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
+void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 
 CHROMA_444_FILTERS(_sse4);
 CHROMA_444_SP_FILTERS(_sse4);
 CHROMA_444_SS_FILTERS(_sse2);
-
-void x265_chroma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+CHROMA_444_FILTERS(_avx2);
+CHROMA_444_SP_FILTERS(_avx2);
+CHROMA_444_SS_FILTERS(_avx2);
 
 #undef SETUP_CHROMA_FUNC_DEF

 
@@ -289,16 +289,114 @@
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \
     SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu)
 
-void x265_chroma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
-void x265_luma_p2s_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
+
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu);
+
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
 
 CHROMA_420_VERT_FILTERS(_sse2);
 CHROMA_420_HORIZ_FILTERS(_sse4);
 CHROMA_420_VERT_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_422_VERT_FILTERS(_sse2);
 CHROMA_422_HORIZ_FILTERS(_sse4);
 CHROMA_422_VERT_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_444_VERT_FILTERS(_sse2);
 CHROMA_444_HORIZ_FILTERS(_sse4);
@@ -572,6 +670,48 @@
     SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \
     SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu);
 
+#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \
+    void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride);
+
+#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); 
+
+#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu);
+
+#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu);
+
+#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \
+    SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \
+    SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu);
+
 CHROMA_420_FILTERS(_sse4);
 CHROMA_420_FILTERS(_avx2);
 CHROMA_420_SP_FILTERS(_sse2);
@@ -582,19 +722,32 @@
 CHROMA_420_SS_FILTERS_SSE4(_sse4);
 CHROMA_420_SS_FILTERS(_avx2);
 CHROMA_420_SS_FILTERS_SSE4(_avx2);
+CHROMA_420_P2S_FILTERS_SSE4(_sse4);
+CHROMA_420_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_420_P2S_FILTERS_AVX2(_avx2);
 
 CHROMA_422_FILTERS(_sse4);
 CHROMA_422_FILTERS(_avx2);
 CHROMA_422_SP_FILTERS(_sse2);
+CHROMA_422_SP_FILTERS(_avx2);
 CHROMA_422_SP_FILTERS_SSE4(_sse4);
+CHROMA_422_SP_FILTERS_SSE4(_avx2);
 CHROMA_422_SS_FILTERS(_sse2);
+CHROMA_422_SS_FILTERS(_avx2);
 CHROMA_422_SS_FILTERS_SSE4(_sse4);
+CHROMA_422_SS_FILTERS_SSE4(_avx2);
+CHROMA_422_P2S_FILTERS_SSE4(_sse4);
+CHROMA_422_P2S_FILTERS_SSSE3(_ssse3);
+CHROMA_422_P2S_FILTERS_AVX2(_avx2);
+void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx);
+void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx);
 
 CHROMA_444_FILTERS(_sse4);
 CHROMA_444_SP_FILTERS(_sse4);
 CHROMA_444_SS_FILTERS(_sse2);
-
-void x265_chroma_p2s_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, int width, int height);
+CHROMA_444_FILTERS(_avx2);
+CHROMA_444_SP_FILTERS(_avx2);
+CHROMA_444_SS_FILTERS(_avx2);
 
 #undef SETUP_CHROMA_FUNC_DEF
​

x265_1.6.tar.gz/source/common/x86/loopfilter.asm -> x265_1.7.tar.gz/source/common/x86/loopfilter.asm Changed

@@ -28,31 +28,39 @@
 %include "x86inc.asm"
 
 SECTION_RODATA 32
-pb_31:      times 16 db 31
-pb_15:      times 16 db 15
+pb_31:      times 32 db 31
+pb_15:      times 32 db 15
+pb_movemask_32:  times 32 db 0x00
+                 times 32 db 0xFF
 
 SECTION .text
 cextern pb_1
 cextern pb_128
 cextern pb_2
 cextern pw_2
+cextern pb_movemask
 
 
 ;============================================================================================================
-; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t signLeft)
+; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
 ;============================================================================================================
 INIT_XMM sse4
-cglobal saoCuOrgE0, 4, 4, 8, rec, offsetEo, lcuWidth, signLeft
+cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
 
-    neg         r3                          ; r3 = -signLeft
-    movzx       r3d, r3b
-    movd        m0, r3d
-    mova        m4, [pb_128]                ; m4 = [80]
-    pxor        m5, m5                      ; m5 = 0
-    movu        m6, [r1]                    ; m6 = offsetEo
+    mov         r4d, r4m
+    mova        m4,  [pb_128]                ; m4 = [80]
+    pxor        m5,  m5                      ; m5 = 0
+    movu        m6,  [r1]                    ; m6 = offsetEo
+
+    movzx       r1d, byte [r3]
+    inc         r3
+    neg         r1b
+    movd        m0, r1d
+    lea         r1, [r0 + r4]
+    mov         r4d, r2d
 
 .loop:
-    movu        m7, [r0]                    ; m1 = rec[x]
+    movu        m7, [r0]                    ; m7 = rec[x]
     movu        m2, [r0 + 1]                ; m2 = rec[x+1]
 
     pxor        m1, m7, m4
@@ -69,7 +77,7 @@
     pxor        m0, m0
     palignr     m0, m2, 15
     paddb       m2, m3
-    paddb       m2, [pb_2]                  ; m1 = uiEdgeType
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
     pshufb      m3, m6, m2
     pmovzxbw    m2, m7                      ; rec
     punpckhbw   m7, m5
@@ -84,6 +92,97 @@
     add         r0q, 16
     sub         r2d, 16
     jnz        .loop
+
+    movzx       r3d, byte [r3]
+    neg         r3b
+    movd        m0, r3d
+.loopH:
+    movu        m7, [r1]                    ; m7 = rec[x]
+    movu        m2, [r1 + 1]                ; m2 = rec[x+1]
+
+    pxor        m1, m7, m4
+    pxor        m3, m2, m4
+    pcmpgtb     m2, m1, m3
+    pcmpgtb     m3, m1
+    pand        m2, [pb_1]
+    por         m2, m3
+
+    pslldq      m3, m2, 1
+    por         m3, m0
+
+    psignb      m3, m4                      ; m3 = signLeft
+    pxor        m0, m0
+    palignr     m0, m2, 15
+    paddb       m2, m3
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
+    pshufb      m3, m6, m2
+    pmovzxbw    m2, m7                      ; rec
+    punpckhbw   m7, m5
+    pmovsxbw    m1, m3                      ; offsetEo
+    punpckhbw   m3, m3
+    psraw       m3, 8
+    paddw       m2, m1
+    paddw       m7, m3
+    packuswb    m2, m7
+    movu        [r1], m2
+
+    add         r1q, 16
+    sub         r4d, 16
+    jnz        .loopH
+    RET
+
+INIT_YMM avx2
+cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
+
+    mov                 r4d,        r4m
+    vbroadcasti128      m4,         [pb_128]                   ; m4 = [80]
+    vbroadcasti128      m6,         [r1]                       ; m6 = offsetEo
+    movzx               r1d,        byte [r3]
+    neg                 r1b
+    movd                xm0,        r1d
+    movzx               r1d,        byte [r3 + 1]
+    neg                 r1b
+    movd                xm1,        r1d
+    vinserti128         m0,         m0,        xm1,           1
+
+.loop:
+    movu                xm5,        [r0]                       ; xm5 = rec[x]
+    movu                xm2,        [r0 + 1]                   ; xm2 = rec[x + 1]
+    vinserti128         m5,         m5,        [r0 + r4],     1
+    vinserti128         m2,         m2,        [r0 + r4 + 1], 1
+
+    pxor                m1,         m5,        m4
+    pxor                m3,         m2,        m4
+    pcmpgtb             m2,         m1,        m3
+    pcmpgtb             m3,         m1
+    pand                m2,         [pb_1]
+    por                 m2,         m3
+
+    pslldq              m3,         m2,        1
+    por                 m3,         m0
+
+    psignb              m3,         m4                         ; m3 = signLeft
+    pxor                m0,         m0
+    palignr             m0,         m2,        15
+    paddb               m2,         m3
+    paddb               m2,         [pb_2]                     ; m2 = uiEdgeType
+    pshufb              m3,         m6,        m2
+    pmovzxbw            m2,         xm5                        ; rec
+    vextracti128        xm5,        m5,        1
+    pmovzxbw            m5,         xm5
+    pmovsxbw            m1,         xm3                        ; offsetEo
+    vextracti128        xm3,        m3,        1
+    pmovsxbw            m3,         xm3
+    paddw               m2,         m1
+    paddw               m5,         m3
+    packuswb            m2,         m5
+    vpermq              m2,         m2,        11011000b
+    movu                [r0],       xm2
+    vextracti128        [r0 + r4],  m2,        1
+
+    add                 r0q,        16
+    sub                 r2d,        16
+    jnz                 .loop
     RET
 
 ;==================================================================================================
@@ -94,117 +193,382 @@
     mov         r3d, r3m
     mov         r4d, r4m
     pxor        m0,    m0                      ; m0 = 0
-    movu        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
+    mova        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
     mova        m7,    [pb_128]
     shr         r4d,   4
-    .loop
-         movu        m1,    [r0]                    ; m1 = pRec[x]
-         movu        m2,    [r0 + r3]               ; m2 = pRec[x + iStride]
-
-         pxor        m3,    m1,    m7
-         pxor        m4,    m2,    m7
-         pcmpgtb     m2,    m3,    m4
-         pcmpgtb     m4,    m3
-         pand        m2,    [pb_1]
-         por         m2,    m4
-
-         movu        m3,    [r1]                    ; m3 = m_iUpBuff1
-
-         paddb       m3,    m2
-         paddb       m3,    m6
-
-         movu        m4,    [r2]                    ; m4 = m_iOffsetEo
-         pshufb      m5,    m4,    m3
-
-         psubb       m3,    m0,    m2
-         movu        [r1],  m3
-
-         pmovzxbw    m2,    m1
-         punpckhbw   m1,    m0
-         pmovsxbw    m3,    m5
-         punpckhbw   m5,    m5
-         psraw       m5,    8
-
-         paddw       m2,    m3
-         paddw       m1,    m5
-         packuswb    m2,    m1
-         movu        [r0],  m2
-
-         add         r0,    16

 
@@ -28,31 +28,39 @@
 %include "x86inc.asm"
 
 SECTION_RODATA 32
-pb_31:      times 16 db 31
-pb_15:      times 16 db 15
+pb_31:      times 32 db 31
+pb_15:      times 32 db 15
+pb_movemask_32:  times 32 db 0x00
+                 times 32 db 0xFF
 
 SECTION .text
 cextern pb_1
 cextern pb_128
 cextern pb_2
 cextern pw_2
+cextern pb_movemask
 
 
 ;============================================================================================================
-; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t signLeft)
+; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride)
 ;============================================================================================================
 INIT_XMM sse4
-cglobal saoCuOrgE0, 4, 4, 8, rec, offsetEo, lcuWidth, signLeft
+cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride
 
-    neg         r3                          ; r3 = -signLeft
-    movzx       r3d, r3b
-    movd        m0, r3d
-    mova        m4, [pb_128]                ; m4 = [80]
-    pxor        m5, m5                      ; m5 = 0
-    movu        m6, [r1]                    ; m6 = offsetEo
+    mov         r4d, r4m
+    mova        m4,  [pb_128]                ; m4 = [80]
+    pxor        m5,  m5                      ; m5 = 0
+    movu        m6,  [r1]                    ; m6 = offsetEo
+
+    movzx       r1d, byte [r3]
+    inc         r3
+    neg         r1b
+    movd        m0, r1d
+    lea         r1, [r0 + r4]
+    mov         r4d, r2d
 
 .loop:
-    movu        m7, [r0]                    ; m1 = rec[x]
+    movu        m7, [r0]                    ; m7 = rec[x]
     movu        m2, [r0 + 1]                ; m2 = rec[x+1]
 
     pxor        m1, m7, m4
@@ -69,7 +77,7 @@
     pxor        m0, m0
     palignr     m0, m2, 15
     paddb       m2, m3
-    paddb       m2, [pb_2]                  ; m1 = uiEdgeType
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
     pshufb      m3, m6, m2
     pmovzxbw    m2, m7                      ; rec
     punpckhbw   m7, m5
@@ -84,6 +92,97 @@
     add         r0q, 16
     sub         r2d, 16
     jnz        .loop
+
+    movzx       r3d, byte [r3]
+    neg         r3b
+    movd        m0, r3d
+.loopH:
+    movu        m7, [r1]                    ; m7 = rec[x]
+    movu        m2, [r1 + 1]                ; m2 = rec[x+1]
+
+    pxor        m1, m7, m4
+    pxor        m3, m2, m4
+    pcmpgtb     m2, m1, m3
+    pcmpgtb     m3, m1
+    pand        m2, [pb_1]
+    por         m2, m3
+
+    pslldq      m3, m2, 1
+    por         m3, m0
+
+    psignb      m3, m4                      ; m3 = signLeft
+    pxor        m0, m0
+    palignr     m0, m2, 15
+    paddb       m2, m3
+    paddb       m2, [pb_2]                  ; m2 = uiEdgeType
+    pshufb      m3, m6, m2
+    pmovzxbw    m2, m7                      ; rec
+    punpckhbw   m7, m5
+    pmovsxbw    m1, m3                      ; offsetEo
+    punpckhbw   m3, m3
+    psraw       m3, 8
+    paddw       m2, m1
+    paddw       m7, m3
+    packuswb    m2, m7
+    movu        [r1], m2
+
+    add         r1q, 16
+    sub         r4d, 16
+    jnz        .loopH
+    RET
+
+INIT_YMM avx2
+cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride
+
+    mov                 r4d,        r4m
+    vbroadcasti128      m4,         [pb_128]                   ; m4 = [80]
+    vbroadcasti128      m6,         [r1]                       ; m6 = offsetEo
+    movzx               r1d,        byte [r3]
+    neg                 r1b
+    movd                xm0,        r1d
+    movzx               r1d,        byte [r3 + 1]
+    neg                 r1b
+    movd                xm1,        r1d
+    vinserti128         m0,         m0,        xm1,           1
+
+.loop:
+    movu                xm5,        [r0]                       ; xm5 = rec[x]
+    movu                xm2,        [r0 + 1]                   ; xm2 = rec[x + 1]
+    vinserti128         m5,         m5,        [r0 + r4],     1
+    vinserti128         m2,         m2,        [r0 + r4 + 1], 1
+
+    pxor                m1,         m5,        m4
+    pxor                m3,         m2,        m4
+    pcmpgtb             m2,         m1,        m3
+    pcmpgtb             m3,         m1
+    pand                m2,         [pb_1]
+    por                 m2,         m3
+
+    pslldq              m3,         m2,        1
+    por                 m3,         m0
+
+    psignb              m3,         m4                         ; m3 = signLeft
+    pxor                m0,         m0
+    palignr             m0,         m2,        15
+    paddb               m2,         m3
+    paddb               m2,         [pb_2]                     ; m2 = uiEdgeType
+    pshufb              m3,         m6,        m2
+    pmovzxbw            m2,         xm5                        ; rec
+    vextracti128        xm5,        m5,        1
+    pmovzxbw            m5,         xm5
+    pmovsxbw            m1,         xm3                        ; offsetEo
+    vextracti128        xm3,        m3,        1
+    pmovsxbw            m3,         xm3
+    paddw               m2,         m1
+    paddw               m5,         m3
+    packuswb            m2,         m5
+    vpermq              m2,         m2,        11011000b
+    movu                [r0],       xm2
+    vextracti128        [r0 + r4],  m2,        1
+
+    add                 r0q,        16
+    sub                 r2d,        16
+    jnz                 .loop
     RET
 
 ;==================================================================================================
@@ -94,117 +193,382 @@
     mov         r3d, r3m
     mov         r4d, r4m
     pxor        m0,    m0                      ; m0 = 0
-    movu        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
+    mova        m6,    [pb_2]                  ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
     mova        m7,    [pb_128]
     shr         r4d,   4
-    .loop
-         movu        m1,    [r0]                    ; m1 = pRec[x]
-         movu        m2,    [r0 + r3]               ; m2 = pRec[x + iStride]
-
-         pxor        m3,    m1,    m7
-         pxor        m4,    m2,    m7
-         pcmpgtb     m2,    m3,    m4
-         pcmpgtb     m4,    m3
-         pand        m2,    [pb_1]
-         por         m2,    m4
-
-         movu        m3,    [r1]                    ; m3 = m_iUpBuff1
-
-         paddb       m3,    m2
-         paddb       m3,    m6
-
-         movu        m4,    [r2]                    ; m4 = m_iOffsetEo
-         pshufb      m5,    m4,    m3
-
-         psubb       m3,    m0,    m2
-         movu        [r1],  m3
-
-         pmovzxbw    m2,    m1
-         punpckhbw   m1,    m0
-         pmovsxbw    m3,    m5
-         punpckhbw   m5,    m5
-         psraw       m5,    8
-
-         paddw       m2,    m3
-         paddw       m1,    m5
-         packuswb    m2,    m1
-         movu        [r0],  m2
-
-         add         r0,    16
​

x265_1.6.tar.gz/source/common/x86/loopfilter.h -> x265_1.7.tar.gz/source/common/x86/loopfilter.h Changed

@@ -25,11 +25,21 @@
 #ifndef X265_LOOPFILTER_H
 #define X265_LOOPFILTER_H
 
-void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t signLeft);
+void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
+void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
 void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
+void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
+void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
+void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
+void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
 void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
+void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
 void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
+void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
 
 #endif // ifndef X265_LOOPFILTER_H

 
@@ -25,11 +25,21 @@
 #ifndef X265_LOOPFILTER_H
 #define X265_LOOPFILTER_H
 
-void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t signLeft);
+void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
+void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride);
 void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
+void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width);
 void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
+void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
+void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride);
 void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
+void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
+void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX);
 void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
+void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride);
 void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
+void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX);
 
 #endif // ifndef X265_LOOPFILTER_H
​

x265_1.6.tar.gz/source/common/x86/mc-a.asm -> x265_1.7.tar.gz/source/common/x86/mc-a.asm Changed

 
@@ -1895,8 +1895,10 @@
 
 ADDAVG_W8_H4_AVX2 4
 ADDAVG_W8_H4_AVX2 8
+ADDAVG_W8_H4_AVX2 12
 ADDAVG_W8_H4_AVX2 16
 ADDAVG_W8_H4_AVX2 32
+ADDAVG_W8_H4_AVX2 64
 
 %macro ADDAVG_W12_H4_AVX2 1
 INIT_YMM avx2
@@ -1982,6 +1984,7 @@
 %endmacro
 
 ADDAVG_W12_H4_AVX2 16
+ADDAVG_W12_H4_AVX2 32
 
 %macro ADDAVG_W16_H4_AVX2 1
 INIT_YMM avx2
@@ -2044,6 +2047,7 @@
 ADDAVG_W16_H4_AVX2 8
 ADDAVG_W16_H4_AVX2 12
 ADDAVG_W16_H4_AVX2 16
+ADDAVG_W16_H4_AVX2 24
 ADDAVG_W16_H4_AVX2 32
 ADDAVG_W16_H4_AVX2 64
 
@@ -2101,6 +2105,7 @@
 %endmacro
 
 ADDAVG_W24_H2_AVX2 32
+ADDAVG_W24_H2_AVX2 64
 
 %macro ADDAVG_W32_H2_AVX2 1
 INIT_YMM avx2
@@ -2157,6 +2162,7 @@
 ADDAVG_W32_H2_AVX2 16
 ADDAVG_W32_H2_AVX2 24
 ADDAVG_W32_H2_AVX2 32
+ADDAVG_W32_H2_AVX2 48
 ADDAVG_W32_H2_AVX2 64
 
 %macro ADDAVG_W64_H2_AVX2 1
​

x265_1.6.tar.gz/source/common/x86/pixel-a.asm -> x265_1.7.tar.gz/source/common/x86/pixel-a.asm Changed

@@ -7078,6 +7078,117 @@
 .end:
     RET
 
+; Input 16bpp, Output 8bpp
+;-------------------------------------------------------------------------------------------------------------------------------------
+;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask)
+;-------------------------------------------------------------------------------------------------------------------------------------
+INIT_YMM avx2
+cglobal downShift_16, 6,7,3
+    movd        xm0, r6m        ; m0 = shift
+    add         r1d, r1d
+    dec         r5d
+.loopH:
+    xor         r6, r6
+.loopW:
+    movu        m1, [r0 + r6 * 2 +  0]
+    movu        m2, [r0 + r6 * 2 + 32]
+    vpsrlw      m1, xm0
+    vpsrlw      m2, xm0
+    packuswb    m1, m2
+    vpermq      m1, m1, 11011000b
+    movu        [r2 + r6], m1
+
+    add         r6d, mmsize
+    cmp         r6d, r4d
+    jl          .loopW
+
+    ; move to next row
+    add         r0, r1
+    add         r2, r3
+    dec         r5d
+    jnz         .loopH
+
+; processing last row of every frame [To handle width which not a multiple of 32]
+    mov         r6d, r4d
+    and         r4d, 31
+    shr         r6d, 5
+
+.loop32:
+    movu        m1, [r0]
+    movu        m2, [r0 + 32]
+    psrlw       m1, xm0
+    psrlw       m2, xm0
+    packuswb    m1, m2
+    vpermq      m1, m1, 11011000b
+    movu        [r2], m1
+
+    add         r0, 2*mmsize
+    add         r2, mmsize
+    dec         r6d
+    jnz         .loop32
+
+    cmp         r4d, 16
+    jl          .process8
+    movu        m1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    vpermq      m1, m1, 10001000b
+    movu        [r2], xm1
+
+    add         r0, mmsize
+    add         r2, 16
+    sub         r4d, 16
+    jz          .end
+
+.process8:
+    cmp         r4d, 8
+    jl          .process4
+    movu        m1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movq        [r2], xm1
+
+    add         r0, 16
+    add         r2, 8
+    sub         r4d, 8
+    jz          .end
+
+.process4:
+    cmp         r4d, 4
+    jl          .process2
+    movq        xm1,[r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        [r2], xm1
+
+    add         r0, 8
+    add         r2, 4
+    sub         r4d, 4
+    jz          .end
+
+.process2:
+    cmp         r4d, 2
+    jl          .process1
+    movd        xm1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        r6d, xm1
+    mov         [r2], r6w
+
+    add         r0, 4
+    add         r2, 2
+    sub         r4d, 2
+    jz          .end
+
+.process1:
+    movd        xm1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        r3d, xm1
+    mov         [r2], r3b
+.end:
+    RET
+
 ; Input 8bpp, Output 16bpp
 ;---------------------------------------------------------------------------------------------------------------------
 ;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift)
@@ -10395,3 +10506,1372 @@
     mov             rsp, r5
     RET
 %endif
+
+;;---------------------------------------------------------------
+;; SATD AVX2
+;; int pixel_satd(const pixel*, intptr_t, const pixel*, intptr_t)
+;;---------------------------------------------------------------
+;; r0   - pix0
+;; r1   - pix0Stride
+;; r2   - pix1
+;; r3   - pix1Stride
+
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
+INIT_YMM avx2
+cglobal calc_satd_16x8    ; function to compute satd cost for 16 columns, 8 rows
+    pxor                m6, m6
+    vbroadcasti128      m0, [r0]
+    vbroadcasti128      m4, [r2]
+    vbroadcasti128      m1, [r0 + r1]
+    vbroadcasti128      m5, [r2 + r3]
+    pmaddubsw           m4, m7
+    pmaddubsw           m0, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m1, m7
+    psubw               m0, m4
+    psubw               m1, m5
+    vbroadcasti128      m2, [r0 + r1 * 2]
+    vbroadcasti128      m4, [r2 + r3 * 2]
+    vbroadcasti128      m3, [r0 + r4]
+    vbroadcasti128      m5, [r2 + r5]
+    pmaddubsw           m4, m7
+    pmaddubsw           m2, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m3, m7
+    psubw               m2, m4
+    psubw               m3, m5
+    lea                 r0, [r0 + r1 * 4]
+    lea                 r2, [r2 + r3 * 4]
+    paddw               m4, m0, m1
+    psubw               m1, m1, m0
+    paddw               m0, m2, m3
+    psubw               m3, m2
+    paddw               m2, m4, m0
+    psubw               m0, m4
+    paddw               m4, m1, m3
+    psubw               m3, m1
+    pabsw               m2, m2
+    pabsw               m0, m0
+    pabsw               m4, m4
+    pabsw               m3, m3
+    pblendw             m1, m2, m0, 10101010b
+    pslld               m0, 16
+    psrld               m2, 16
+    por                 m0, m2
+    pmaxsw              m1, m0
+    paddw               m6, m1
+    pblendw             m2, m4, m3, 10101010b
+    pslld               m3, 16
+    psrld               m4, 16
+    por                 m3, m4
+    pmaxsw              m2, m3
+    paddw               m6, m2
+    vbroadcasti128      m1, [r0]
+    vbroadcasti128      m4, [r2]
+    vbroadcasti128      m2, [r0 + r1]
+    vbroadcasti128      m5, [r2 + r3]
+    pmaddubsw           m4, m7
+    pmaddubsw           m1, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m2, m7
+    psubw               m1, m4
+    psubw               m2, m5
+    vbroadcasti128      m0, [r0 + r1 * 2]
+    vbroadcasti128      m4, [r2 + r3 * 2]
+    vbroadcasti128      m3, [r0 + r4]
+    vbroadcasti128      m5, [r2 + r5]
+    lea                 r0, [r0 + r1 * 4]
+    lea                 r2, [r2 + r3 * 4]
+    pmaddubsw           m4, m7
+    pmaddubsw           m0, m7

 
@@ -7078,6 +7078,117 @@
 .end:
     RET
 
+; Input 16bpp, Output 8bpp
+;-------------------------------------------------------------------------------------------------------------------------------------
+;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask)
+;-------------------------------------------------------------------------------------------------------------------------------------
+INIT_YMM avx2
+cglobal downShift_16, 6,7,3
+    movd        xm0, r6m        ; m0 = shift
+    add         r1d, r1d
+    dec         r5d
+.loopH:
+    xor         r6, r6
+.loopW:
+    movu        m1, [r0 + r6 * 2 +  0]
+    movu        m2, [r0 + r6 * 2 + 32]
+    vpsrlw      m1, xm0
+    vpsrlw      m2, xm0
+    packuswb    m1, m2
+    vpermq      m1, m1, 11011000b
+    movu        [r2 + r6], m1
+
+    add         r6d, mmsize
+    cmp         r6d, r4d
+    jl          .loopW
+
+    ; move to next row
+    add         r0, r1
+    add         r2, r3
+    dec         r5d
+    jnz         .loopH
+
+; processing last row of every frame [To handle width which not a multiple of 32]
+    mov         r6d, r4d
+    and         r4d, 31
+    shr         r6d, 5
+
+.loop32:
+    movu        m1, [r0]
+    movu        m2, [r0 + 32]
+    psrlw       m1, xm0
+    psrlw       m2, xm0
+    packuswb    m1, m2
+    vpermq      m1, m1, 11011000b
+    movu        [r2], m1
+
+    add         r0, 2*mmsize
+    add         r2, mmsize
+    dec         r6d
+    jnz         .loop32
+
+    cmp         r4d, 16
+    jl          .process8
+    movu        m1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    vpermq      m1, m1, 10001000b
+    movu        [r2], xm1
+
+    add         r0, mmsize
+    add         r2, 16
+    sub         r4d, 16
+    jz          .end
+
+.process8:
+    cmp         r4d, 8
+    jl          .process4
+    movu        m1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movq        [r2], xm1
+
+    add         r0, 16
+    add         r2, 8
+    sub         r4d, 8
+    jz          .end
+
+.process4:
+    cmp         r4d, 4
+    jl          .process2
+    movq        xm1,[r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        [r2], xm1
+
+    add         r0, 8
+    add         r2, 4
+    sub         r4d, 4
+    jz          .end
+
+.process2:
+    cmp         r4d, 2
+    jl          .process1
+    movd        xm1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        r6d, xm1
+    mov         [r2], r6w
+
+    add         r0, 4
+    add         r2, 2
+    sub         r4d, 2
+    jz          .end
+
+.process1:
+    movd        xm1, [r0]
+    psrlw       m1, xm0
+    packuswb    m1, m1
+    movd        r3d, xm1
+    mov         [r2], r3b
+.end:
+    RET
+
 ; Input 8bpp, Output 16bpp
 ;---------------------------------------------------------------------------------------------------------------------
 ;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift)
@@ -10395,3 +10506,1372 @@
     mov             rsp, r5
     RET
 %endif
+
+;;---------------------------------------------------------------
+;; SATD AVX2
+;; int pixel_satd(const pixel*, intptr_t, const pixel*, intptr_t)
+;;---------------------------------------------------------------
+;; r0   - pix0
+;; r1   - pix0Stride
+;; r2   - pix1
+;; r3   - pix1Stride
+
+%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
+INIT_YMM avx2
+cglobal calc_satd_16x8    ; function to compute satd cost for 16 columns, 8 rows
+    pxor                m6, m6
+    vbroadcasti128      m0, [r0]
+    vbroadcasti128      m4, [r2]
+    vbroadcasti128      m1, [r0 + r1]
+    vbroadcasti128      m5, [r2 + r3]
+    pmaddubsw           m4, m7
+    pmaddubsw           m0, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m1, m7
+    psubw               m0, m4
+    psubw               m1, m5
+    vbroadcasti128      m2, [r0 + r1 * 2]
+    vbroadcasti128      m4, [r2 + r3 * 2]
+    vbroadcasti128      m3, [r0 + r4]
+    vbroadcasti128      m5, [r2 + r5]
+    pmaddubsw           m4, m7
+    pmaddubsw           m2, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m3, m7
+    psubw               m2, m4
+    psubw               m3, m5
+    lea                 r0, [r0 + r1 * 4]
+    lea                 r2, [r2 + r3 * 4]
+    paddw               m4, m0, m1
+    psubw               m1, m1, m0
+    paddw               m0, m2, m3
+    psubw               m3, m2
+    paddw               m2, m4, m0
+    psubw               m0, m4
+    paddw               m4, m1, m3
+    psubw               m3, m1
+    pabsw               m2, m2
+    pabsw               m0, m0
+    pabsw               m4, m4
+    pabsw               m3, m3
+    pblendw             m1, m2, m0, 10101010b
+    pslld               m0, 16
+    psrld               m2, 16
+    por                 m0, m2
+    pmaxsw              m1, m0
+    paddw               m6, m1
+    pblendw             m2, m4, m3, 10101010b
+    pslld               m3, 16
+    psrld               m4, 16
+    por                 m3, m4
+    pmaxsw              m2, m3
+    paddw               m6, m2
+    vbroadcasti128      m1, [r0]
+    vbroadcasti128      m4, [r2]
+    vbroadcasti128      m2, [r0 + r1]
+    vbroadcasti128      m5, [r2 + r3]
+    pmaddubsw           m4, m7
+    pmaddubsw           m1, m7
+    pmaddubsw           m5, m7
+    pmaddubsw           m2, m7
+    psubw               m1, m4
+    psubw               m2, m5
+    vbroadcasti128      m0, [r0 + r1 * 2]
+    vbroadcasti128      m4, [r2 + r3 * 2]
+    vbroadcasti128      m3, [r0 + r4]
+    vbroadcasti128      m5, [r2 + r5]
+    lea                 r0, [r0 + r1 * 4]
+    lea                 r2, [r2 + r3 * 4]
+    pmaddubsw           m4, m7
+    pmaddubsw           m0, m7
​

x265_1.6.tar.gz/source/common/x86/pixel-util.h -> x265_1.7.tar.gz/source/common/x86/pixel-util.h Changed

@@ -73,15 +73,18 @@
 float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width);
 float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width);
 
-void x265_scale1D_128to64_ssse3(pixel*, const pixel*, intptr_t);
-void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
+void x265_scale1D_128to64_ssse3(pixel*, const pixel*);
+void x265_scale1D_128to64_avx2(pixel*, const pixel*);
 void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
+void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t);
 
-int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
 
 #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
 
 #define CHROMA_420_PIXELSUB_DEF(cpu) \
     SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \
@@ -97,7 +100,7 @@
 
 #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
 
 #define LUMA_PIXELSUB_DEF(cpu) \
     SETUP_LUMA_PIXELSUB_PS_FUNC(8,   8, cpu); \

 
@@ -73,15 +73,18 @@
 float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width);
 float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width);
 
-void x265_scale1D_128to64_ssse3(pixel*, const pixel*, intptr_t);
-void x265_scale1D_128to64_avx2(pixel*, const pixel*, intptr_t);
+void x265_scale1D_128to64_ssse3(pixel*, const pixel*);
+void x265_scale1D_128to64_avx2(pixel*, const pixel*);
 void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t);
+void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t);
 
-int x265_findPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig);
+int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize);
+uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]);
 
 #define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
 
 #define CHROMA_420_PIXELSUB_DEF(cpu) \
     SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \
@@ -97,7 +100,7 @@
 
 #define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \
     void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t*  dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \
-    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  scr1, intptr_t srcStride0, intptr_t srcStride1);
+    void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t*  src1, intptr_t srcStride0, intptr_t srcStride1);
 
 #define LUMA_PIXELSUB_DEF(cpu) \
     SETUP_LUMA_PIXELSUB_PS_FUNC(8,   8, cpu); \
​

x265_1.6.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.7.tar.gz/source/common/x86/pixel-util8.asm Changed

@@ -40,16 +40,17 @@
 ssim_c1:   times 4 dd 416          ; .01*.01*255*255*64
 ssim_c2:   times 4 dd 235963       ; .03*.03*255*255*64*63
 %endif
-mask_ff:   times 16 db 0xff
-           times 16 db 0
-deinterleave_shuf: db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
-deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
-hmul_16p:  times 16 db 1
-           times 8 db 1, -1
-hmulw_16p:  times 8 dw 1
-            times 4 dw 1, -1
 
-trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
+mask_ff:                times 16 db 0xff
+                        times 16 db 0
+deinterleave_shuf:      times  2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+deinterleave_word_shuf: times  2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
+hmul_16p:               times 16 db 1
+                        times  8 db 1, -1
+hmulw_16p:              times  8 dw 1
+                        times  4 dw 1, -1
+
+trans8_shuf:            dd 0, 4, 1, 5, 2, 6, 3, 7
 
 SECTION .text
 
@@ -67,6 +68,7 @@
 cextern pb_2
 cextern pb_4
 cextern pb_8
+cextern pb_15
 cextern pb_16
 cextern pb_32
 cextern pb_64
@@ -616,7 +618,7 @@
 
 %if ARCH_X86_64 == 1
 INIT_YMM avx2
-cglobal quant, 5,5,10
+cglobal quant, 5,6,9
     ; fill qbits
     movd            xm4, r4d            ; m4 = qbits
 
@@ -627,7 +629,7 @@
     ; fill offset
     vpbroadcastd    m5, r5m             ; m5 = add
 
-    vpbroadcastw    m9, [pw_1]          ; m9 = word [1]
+    lea             r5, [pw_1]
 
     mov             r4d, r6m
     shr             r4d, 4
@@ -665,7 +667,7 @@
 
     ; count non-zero coeff
     ; TODO: popcnt is faster, but some CPU can't support
-    pminuw          m2, m9
+    pminuw          m2, [r5]
     paddw           m7, m2
 
     add             r0, mmsize
@@ -1285,9 +1287,8 @@
     mov          r6d, r6m
     shl          r6d, 16
     or           r6d, r5d          ; assuming both (w0<<6) and round are using maximum of 16 bits each.
-    movd         xm0, r6d
-    pshufd       xm0, xm0, 0       ; m0 = [w0<<6, round]
-    vinserti128  m0, m0, xm0, 1    ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate
+
+    vpbroadcastd m0, r6d
 
     movd         xm1, r7m
     vpbroadcastd m2, r8m
@@ -1492,6 +1493,84 @@
     dec         r5d
     jnz         .loopH
     RET
+
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal weight_sp, 6, 9, 7
+    mov             r7d, r7m
+    shl             r7d, 16
+    or              r7d, r6m
+    vpbroadcastd    m0, r7d            ; m0 = times 8 dw w0, round
+    movd            xm1, r8m            ; m1 = [shift]
+    vpbroadcastd    m2, r9m            ; m2 = times 16 dw offset
+    vpbroadcastw    m3, [pw_1]
+    vpbroadcastw    m4, [pw_2000]
+
+    add             r2d, r2d            ; 2 * srcstride
+
+    mov             r7, r0
+    mov             r8, r1
+.loopH:
+    mov             r6d, r4d            ; width
+
+    ; save old src and dst
+    mov             r0, r7              ; src
+    mov             r1, r8              ; dst
+.loopW:
+    movu            m5, [r0]
+    paddw           m5, m4
+
+    punpcklwd       m6,m5, m3
+    pmaddwd         m6, m0
+    psrad           m6, xm1
+    paddd           m6, m2
+
+    punpckhwd       m5, m3
+    pmaddwd         m5, m0
+    psrad           m5, xm1
+    paddd           m5, m2
+
+    packssdw        m6, m5
+    packuswb        m6, m6
+    vpermq          m6, m6, 10001000b
+
+    sub             r6d, 16
+    jl              .width8
+    movu            [r1], xm6
+    je              .nextH
+    add             r0, 32
+    add             r1, 16
+    jmp             .loopW
+
+.width8:
+    add             r6d, 16
+    cmp             r6d, 8
+    jl              .width4
+    movq            [r1], xm6
+    je              .nextH
+    psrldq          m6, 8
+    sub             r6d, 8
+    add             r1, 8
+
+.width4:
+    cmp             r6d, 4
+    jl              .width2
+    movd            [r1], xm6
+    je              .nextH
+    add             r1, 4
+    pshufd          m6, m6, 1
+
+.width2:
+    pextrw          [r1], xm6, 0
+
+.nextH:
+    lea             r7, [r7 + r2]
+    lea             r8, [r8 + r3]
+
+    dec             r5d
+    jnz             .loopH
+    RET
+%endif
 %endif  ; end of (HIGH_BIT_DEPTH == 0)
     
 
@@ -3944,6 +4023,150 @@
     RET
 %endif
 
+;-----------------------------------------------------------------
+; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride)
+;-----------------------------------------------------------------
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal scale2D_64to32, 3, 4, 5, dest, src, stride
+    mov         r3d,     32
+    add         r2d,     r2d
+    mova        m4,      [pw_2000]
+
+.loop:
+    movu        m0,      [r1]
+    movu        m1,      [r1 + 1 * mmsize]
+    movu        m2,      [r1 + r2]
+    movu        m3,      [r1 + r2 + 1 * mmsize]
+
+    paddw       m0,      m2
+    paddw       m1,      m3
+    phaddw      m0,      m1
+
+    pmulhrsw    m0,      m4
+    vpermq      m0,      m0, q3120
+    movu        [r0],    m0
+
+    movu        m0,      [r1 + 2 * mmsize]
+    movu        m1,      [r1 + 3 * mmsize]
+    movu        m2,      [r1 + r2 + 2 * mmsize]
+    movu        m3,      [r1 + r2 + 3 * mmsize]
+
+    paddw       m0,      m2
+    paddw       m1,      m3
+    phaddw      m0,      m1
+
+    pmulhrsw    m0,      m4
+    vpermq      m0,      m0, q3120
+    movu        [r0 + mmsize], m0
+

 
@@ -40,16 +40,17 @@
 ssim_c1:   times 4 dd 416          ; .01*.01*255*255*64
 ssim_c2:   times 4 dd 235963       ; .03*.03*255*255*64*63
 %endif
-mask_ff:   times 16 db 0xff
-           times 16 db 0
-deinterleave_shuf: db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
-deinterleave_word_shuf: db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
-hmul_16p:  times 16 db 1
-           times 8 db 1, -1
-hmulw_16p:  times 8 dw 1
-            times 4 dw 1, -1
 
-trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7
+mask_ff:                times 16 db 0xff
+                        times 16 db 0
+deinterleave_shuf:      times  2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15
+deinterleave_word_shuf: times  2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15
+hmul_16p:               times 16 db 1
+                        times  8 db 1, -1
+hmulw_16p:              times  8 dw 1
+                        times  4 dw 1, -1
+
+trans8_shuf:            dd 0, 4, 1, 5, 2, 6, 3, 7
 
 SECTION .text
 
@@ -67,6 +68,7 @@
 cextern pb_2
 cextern pb_4
 cextern pb_8
+cextern pb_15
 cextern pb_16
 cextern pb_32
 cextern pb_64
@@ -616,7 +618,7 @@
 
 %if ARCH_X86_64 == 1
 INIT_YMM avx2
-cglobal quant, 5,5,10
+cglobal quant, 5,6,9
     ; fill qbits
     movd            xm4, r4d            ; m4 = qbits
 
@@ -627,7 +629,7 @@
     ; fill offset
     vpbroadcastd    m5, r5m             ; m5 = add
 
-    vpbroadcastw    m9, [pw_1]          ; m9 = word [1]
+    lea             r5, [pw_1]
 
     mov             r4d, r6m
     shr             r4d, 4
@@ -665,7 +667,7 @@
 
     ; count non-zero coeff
     ; TODO: popcnt is faster, but some CPU can't support
-    pminuw          m2, m9
+    pminuw          m2, [r5]
     paddw           m7, m2
 
     add             r0, mmsize
@@ -1285,9 +1287,8 @@
     mov          r6d, r6m
     shl          r6d, 16
     or           r6d, r5d          ; assuming both (w0<<6) and round are using maximum of 16 bits each.
-    movd         xm0, r6d
-    pshufd       xm0, xm0, 0       ; m0 = [w0<<6, round]
-    vinserti128  m0, m0, xm0, 1    ; document says (pshufd + vinserti128) can be replaced with vpbroadcastd m0, xm0, but having build problem, need to investigate
+
+    vpbroadcastd m0, r6d
 
     movd         xm1, r7m
     vpbroadcastd m2, r8m
@@ -1492,6 +1493,84 @@
     dec         r5d
     jnz         .loopH
     RET
+
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal weight_sp, 6, 9, 7
+    mov             r7d, r7m
+    shl             r7d, 16
+    or              r7d, r6m
+    vpbroadcastd    m0, r7d            ; m0 = times 8 dw w0, round
+    movd            xm1, r8m            ; m1 = [shift]
+    vpbroadcastd    m2, r9m            ; m2 = times 16 dw offset
+    vpbroadcastw    m3, [pw_1]
+    vpbroadcastw    m4, [pw_2000]
+
+    add             r2d, r2d            ; 2 * srcstride
+
+    mov             r7, r0
+    mov             r8, r1
+.loopH:
+    mov             r6d, r4d            ; width
+
+    ; save old src and dst
+    mov             r0, r7              ; src
+    mov             r1, r8              ; dst
+.loopW:
+    movu            m5, [r0]
+    paddw           m5, m4
+
+    punpcklwd       m6,m5, m3
+    pmaddwd         m6, m0
+    psrad           m6, xm1
+    paddd           m6, m2
+
+    punpckhwd       m5, m3
+    pmaddwd         m5, m0
+    psrad           m5, xm1
+    paddd           m5, m2
+
+    packssdw        m6, m5
+    packuswb        m6, m6
+    vpermq          m6, m6, 10001000b
+
+    sub             r6d, 16
+    jl              .width8
+    movu            [r1], xm6
+    je              .nextH
+    add             r0, 32
+    add             r1, 16
+    jmp             .loopW
+
+.width8:
+    add             r6d, 16
+    cmp             r6d, 8
+    jl              .width4
+    movq            [r1], xm6
+    je              .nextH
+    psrldq          m6, 8
+    sub             r6d, 8
+    add             r1, 8
+
+.width4:
+    cmp             r6d, 4
+    jl              .width2
+    movd            [r1], xm6
+    je              .nextH
+    add             r1, 4
+    pshufd          m6, m6, 1
+
+.width2:
+    pextrw          [r1], xm6, 0
+
+.nextH:
+    lea             r7, [r7 + r2]
+    lea             r8, [r8 + r3]
+
+    dec             r5d
+    jnz             .loopH
+    RET
+%endif
 %endif  ; end of (HIGH_BIT_DEPTH == 0)
     
 
@@ -3944,6 +4023,150 @@
     RET
 %endif
 
+;-----------------------------------------------------------------
+; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride)
+;-----------------------------------------------------------------
+%if HIGH_BIT_DEPTH
+INIT_YMM avx2
+cglobal scale2D_64to32, 3, 4, 5, dest, src, stride
+    mov         r3d,     32
+    add         r2d,     r2d
+    mova        m4,      [pw_2000]
+
+.loop:
+    movu        m0,      [r1]
+    movu        m1,      [r1 + 1 * mmsize]
+    movu        m2,      [r1 + r2]
+    movu        m3,      [r1 + r2 + 1 * mmsize]
+
+    paddw       m0,      m2
+    paddw       m1,      m3
+    phaddw      m0,      m1
+
+    pmulhrsw    m0,      m4
+    vpermq      m0,      m0, q3120
+    movu        [r0],    m0
+
+    movu        m0,      [r1 + 2 * mmsize]
+    movu        m1,      [r1 + 3 * mmsize]
+    movu        m2,      [r1 + r2 + 2 * mmsize]
+    movu        m3,      [r1 + r2 + 3 * mmsize]
+
+    paddw       m0,      m2
+    paddw       m1,      m3
+    phaddw      m0,      m1
+
+    pmulhrsw    m0,      m4
+    vpermq      m0,      m0, q3120
+    movu        [r0 + mmsize], m0
+
​

x265_1.6.tar.gz/source/common/x86/pixel.h -> x265_1.7.tar.gz/source/common/x86/pixel.h Changed

@@ -226,6 +226,7 @@
 ADDAVG(addAvg_32x48)
 
 void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
+void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
 int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
@@ -256,10 +257,14 @@
 void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 
 void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 
 int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
 int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
@@ -272,6 +277,7 @@
 int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
 
 #undef DECL_PIXELS
 #undef DECL_HEVC_SSD

 
@@ -226,6 +226,7 @@
 ADDAVG(addAvg_32x48)
 
 void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
+void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask);
 void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift);
 int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
 int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
@@ -256,10 +257,14 @@
 void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1);
 
 void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
+void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1);
 
 int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
 int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride);
@@ -272,6 +277,7 @@
 int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
 int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride);
+void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset);
 
 #undef DECL_PIXELS
 #undef DECL_HEVC_SSD
​

x265_1.6.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.7.tar.gz/source/common/x86/pixeladd8.asm Changed

@@ -398,10 +398,65 @@
 
     jnz         .loop
     RET
+%endif
+%endmacro
+PIXEL_ADD_PS_W16_H4 16, 16
+PIXEL_ADD_PS_W16_H4 16, 32
 
+;-----------------------------------------------------------------------------
+; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
+;-----------------------------------------------------------------------------
+%macro PIXEL_ADD_PS_W16_H4_avx2 1
+%if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 INIT_YMM avx2
-cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
-    mov         r6d,        %2/4
+cglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1
+    mova    m3,     [pw_pixel_max]
+    pxor    m2,     m2
+    mov     r6d,    %1/4
+    add     r4d,    r4d
+    add     r5d,    r5d
+    add     r1d,    r1d
+    lea     r7,     [r4 * 3]
+    lea     r8,     [r5 * 3]
+    lea     r9,     [r1 * 3]
+
+.loop:
+    movu    m0,     [r2]
+    movu    m1,     [r3]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0],              m0
+
+    movu    m0,     [r2 + r4]
+    movu    m1,     [r3 + r5]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r1],         m0
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m1,     [r3 + r5 * 2]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r1 * 2],     m0
+
+    movu    m0,     [r2 + r7]
+    movu    m1,     [r3 + r8]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r9],         m0
+
+    dec     r6d
+    lea     r0,     [r0 + r1 * 4]
+    lea     r2,     [r2 + r4 * 4]
+    lea     r3,     [r3 + r5 * 4]
+    jnz     .loop
+    RET
+%endif
+%else
+INIT_YMM avx2
+cglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %1/4
     add         r5,         r5
 .loop:
 
@@ -447,8 +502,8 @@
 %endif
 %endmacro
 
-PIXEL_ADD_PS_W16_H4 16, 16
-PIXEL_ADD_PS_W16_H4 16, 32
+PIXEL_ADD_PS_W16_H4_avx2 16
+PIXEL_ADD_PS_W16_H4_avx2 32
 
 
 ;-----------------------------------------------------------------------------
@@ -569,11 +624,90 @@
 
     jnz         .loop
     RET
+%endif
+%endmacro
+PIXEL_ADD_PS_W32_H2 32, 32
+PIXEL_ADD_PS_W32_H2 32, 64
 
+;-----------------------------------------------------------------------------
+; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
+;-----------------------------------------------------------------------------
+%macro PIXEL_ADD_PS_W32_H4_avx2 1
+%if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 INIT_YMM avx2
-cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
-    mov         r6d,        %2/4
+cglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
+    mova    m5,     [pw_pixel_max]
+    pxor    m4,     m4
+    mov     r6d,    %1/4
+    add     r4d,    r4d
+    add     r5d,    r5d
+    add     r1d,    r1d
+    lea     r7,     [r4 * 3]
+    lea     r8,     [r5 * 3]
+    lea     r9,     [r1 * 3]
+
+.loop:
+    movu    m0,     [r2]
+    movu    m2,     [r2 + 32]
+    movu    m1,     [r3]
+    movu    m3,     [r3 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0],               m0
+    movu    [r0 + 32],          m2
+
+    movu    m0,     [r2 + r4]
+    movu    m2,     [r2 + r4 + 32]
+    movu    m1,     [r3 + r5]
+    movu    m3,     [r3 + r5 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r1],          m0
+    movu    [r0 + r1 + 32],     m2
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m2,     [r2 + r4 * 2 + 32]
+    movu    m1,     [r3 + r5 * 2]
+    movu    m3,     [r3 + r5 * 2 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r1 * 2],      m0
+    movu    [r0 + r1 * 2 + 32], m2
+
+    movu    m0,     [r2 + r7]
+    movu    m2,     [r2 + r7 + 32]
+    movu    m1,     [r3 + r8]
+    movu    m3,     [r3 + r8 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r9],          m0
+    movu    [r0 + r9 + 32],     m2
+
+    dec     r6d
+    lea     r0,     [r0 + r1 * 4]
+    lea     r2,     [r2 + r4 * 4]
+    lea     r3,     [r3 + r5 * 4]
+    jnz     .loop
+    RET
+%endif
+%else
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %1/4
     add         r5,         r5
+    lea         r7,         [r4 * 3]
+    lea         r8,         [r5 * 3]
+    lea         r9,         [r1 * 3]
 .loop:
     pmovzxbw    m0,         [r2]                ; first half of row 0 of src0
     pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0
@@ -597,44 +731,41 @@
     vpermq      m0, m0, 11011000b
     movu        [r0 + r1],      m0              ; row 1 of dst
 
-    lea         r2,         [r2 + r4 * 2]
-    lea         r3,         [r3 + r5 * 2]
-    lea         r0,         [r0 + r1 * 2]
-
-    pmovzxbw    m0,         [r2]                ; first half of row 2 of src0
-    pmovzxbw    m1,         [r2 + 16]           ; second half of row 2 of src0
-    movu        m2,         [r3]                ; first half of row 2 of src1
-    movu        m3,         [r3 + 32]           ; second half of row 2 of src1
+    pmovzxbw    m0,         [r2 + r4 * 2]       ; first half of row 2 of src0
+    pmovzxbw    m1,         [r2 + r4 * 2 + 16]  ; second half of row 2 of src0
+    movu        m2,         [r3 + r5 * 2]       ; first half of row 2 of src1
+    movu        m3,         [r3 + + r5 * 2 + 32]; second half of row 2 of src1
 
     paddw       m0,         m2
     paddw       m1,         m3
     packuswb    m0,         m1
     vpermq      m0, m0, 11011000b
-    movu        [r0],      m0                   ; row 2 of dst
+    movu        [r0 + r1 * 2],      m0          ; row 2 of dst
 
-    pmovzxbw    m0,         [r2 + r4]           ; first half of row 3 of src0
-    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 3 of src0
-    movu        m2,         [r3 + r5]           ; first half of row 3 of src1
-    movu        m3,         [r3 + r5 + 32]      ; second half of row 3 of src1

 
@@ -398,10 +398,65 @@
 
     jnz         .loop
     RET
+%endif
+%endmacro
+PIXEL_ADD_PS_W16_H4 16, 16
+PIXEL_ADD_PS_W16_H4 16, 32
 
+;-----------------------------------------------------------------------------
+; void pixel_add_ps_16x16(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
+;-----------------------------------------------------------------------------
+%macro PIXEL_ADD_PS_W16_H4_avx2 1
+%if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 INIT_YMM avx2
-cglobal pixel_add_ps_16x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
-    mov         r6d,        %2/4
+cglobal pixel_add_ps_16x%1, 6, 10, 4, dest, destride, src0, scr1, srcStride0, srcStride1
+    mova    m3,     [pw_pixel_max]
+    pxor    m2,     m2
+    mov     r6d,    %1/4
+    add     r4d,    r4d
+    add     r5d,    r5d
+    add     r1d,    r1d
+    lea     r7,     [r4 * 3]
+    lea     r8,     [r5 * 3]
+    lea     r9,     [r1 * 3]
+
+.loop:
+    movu    m0,     [r2]
+    movu    m1,     [r3]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0],              m0
+
+    movu    m0,     [r2 + r4]
+    movu    m1,     [r3 + r5]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r1],         m0
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m1,     [r3 + r5 * 2]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r1 * 2],     m0
+
+    movu    m0,     [r2 + r7]
+    movu    m1,     [r3 + r8]
+    paddw   m0,     m1
+    CLIPW   m0, m2, m3
+    movu    [r0 + r9],         m0
+
+    dec     r6d
+    lea     r0,     [r0 + r1 * 4]
+    lea     r2,     [r2 + r4 * 4]
+    lea     r3,     [r3 + r5 * 4]
+    jnz     .loop
+    RET
+%endif
+%else
+INIT_YMM avx2
+cglobal pixel_add_ps_16x%1, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %1/4
     add         r5,         r5
 .loop:
 
@@ -447,8 +502,8 @@
 %endif
 %endmacro
 
-PIXEL_ADD_PS_W16_H4 16, 16
-PIXEL_ADD_PS_W16_H4 16, 32
+PIXEL_ADD_PS_W16_H4_avx2 16
+PIXEL_ADD_PS_W16_H4_avx2 32
 
 
 ;-----------------------------------------------------------------------------
@@ -569,11 +624,90 @@
 
     jnz         .loop
     RET
+%endif
+%endmacro
+PIXEL_ADD_PS_W32_H2 32, 32
+PIXEL_ADD_PS_W32_H2 32, 64
 
+;-----------------------------------------------------------------------------
+; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1)
+;-----------------------------------------------------------------------------
+%macro PIXEL_ADD_PS_W32_H4_avx2 1
+%if HIGH_BIT_DEPTH
+%if ARCH_X86_64
 INIT_YMM avx2
-cglobal pixel_add_ps_32x%2, 6, 7, 8, dest, destride, src0, scr1, srcStride0, srcStride1
-    mov         r6d,        %2/4
+cglobal pixel_add_ps_32x%1, 6, 10, 6, dest, destride, src0, scr1, srcStride0, srcStride1
+    mova    m5,     [pw_pixel_max]
+    pxor    m4,     m4
+    mov     r6d,    %1/4
+    add     r4d,    r4d
+    add     r5d,    r5d
+    add     r1d,    r1d
+    lea     r7,     [r4 * 3]
+    lea     r8,     [r5 * 3]
+    lea     r9,     [r1 * 3]
+
+.loop:
+    movu    m0,     [r2]
+    movu    m2,     [r2 + 32]
+    movu    m1,     [r3]
+    movu    m3,     [r3 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0],               m0
+    movu    [r0 + 32],          m2
+
+    movu    m0,     [r2 + r4]
+    movu    m2,     [r2 + r4 + 32]
+    movu    m1,     [r3 + r5]
+    movu    m3,     [r3 + r5 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r1],          m0
+    movu    [r0 + r1 + 32],     m2
+
+    movu    m0,     [r2 + r4 * 2]
+    movu    m2,     [r2 + r4 * 2 + 32]
+    movu    m1,     [r3 + r5 * 2]
+    movu    m3,     [r3 + r5 * 2 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r1 * 2],      m0
+    movu    [r0 + r1 * 2 + 32], m2
+
+    movu    m0,     [r2 + r7]
+    movu    m2,     [r2 + r7 + 32]
+    movu    m1,     [r3 + r8]
+    movu    m3,     [r3 + r8 + 32]
+    paddw   m0,     m1
+    paddw   m2,     m3
+    CLIPW2  m0, m2, m4, m5
+
+    movu    [r0 + r9],          m0
+    movu    [r0 + r9 + 32],     m2
+
+    dec     r6d
+    lea     r0,     [r0 + r1 * 4]
+    lea     r2,     [r2 + r4 * 4]
+    lea     r3,     [r3 + r5 * 4]
+    jnz     .loop
+    RET
+%endif
+%else
+%if ARCH_X86_64
+INIT_YMM avx2
+cglobal pixel_add_ps_32x%1, 6, 10, 8, dest, destride, src0, scr1, srcStride0, srcStride1
+    mov         r6d,        %1/4
     add         r5,         r5
+    lea         r7,         [r4 * 3]
+    lea         r8,         [r5 * 3]
+    lea         r9,         [r1 * 3]
 .loop:
     pmovzxbw    m0,         [r2]                ; first half of row 0 of src0
     pmovzxbw    m1,         [r2 + 16]           ; second half of row 0 of src0
@@ -597,44 +731,41 @@
     vpermq      m0, m0, 11011000b
     movu        [r0 + r1],      m0              ; row 1 of dst
 
-    lea         r2,         [r2 + r4 * 2]
-    lea         r3,         [r3 + r5 * 2]
-    lea         r0,         [r0 + r1 * 2]
-
-    pmovzxbw    m0,         [r2]                ; first half of row 2 of src0
-    pmovzxbw    m1,         [r2 + 16]           ; second half of row 2 of src0
-    movu        m2,         [r3]                ; first half of row 2 of src1
-    movu        m3,         [r3 + 32]           ; second half of row 2 of src1
+    pmovzxbw    m0,         [r2 + r4 * 2]       ; first half of row 2 of src0
+    pmovzxbw    m1,         [r2 + r4 * 2 + 16]  ; second half of row 2 of src0
+    movu        m2,         [r3 + r5 * 2]       ; first half of row 2 of src1
+    movu        m3,         [r3 + + r5 * 2 + 32]; second half of row 2 of src1
 
     paddw       m0,         m2
     paddw       m1,         m3
     packuswb    m0,         m1
     vpermq      m0, m0, 11011000b
-    movu        [r0],      m0                   ; row 2 of dst
+    movu        [r0 + r1 * 2],      m0          ; row 2 of dst
 
-    pmovzxbw    m0,         [r2 + r4]           ; first half of row 3 of src0
-    pmovzxbw    m1,         [r2 + r4 + 16]      ; second half of row 3 of src0
-    movu        m2,         [r3 + r5]           ; first half of row 3 of src1
-    movu        m3,         [r3 + r5 + 32]      ; second half of row 3 of src1
​

x265_1.6.tar.gz/source/common/x86/sad-a.asm -> x265_1.7.tar.gz/source/common/x86/sad-a.asm Changed

@@ -4004,10 +4004,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_32x24, 4,5,6
+cglobal pixel_sad_32x24, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
     mov             r4d, 6
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; row 0 of pix0
     movu           m2, [r2]               ; row 0 of pix1
@@ -4019,21 +4021,18 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; row 2 of pix0
-    movu           m2, [r2]               ; row 2 of pix1
-    movu           m3, [r0 + r1]          ; row 3 of pix0
-    movu           m4, [r2 + r3]          ; row 3 of pix1
+    movu           m1, [r0 + 2 * r1]      ; row 2 of pix0
+    movu           m2, [r2 + 2 * r3]      ; row 2 of pix1
+    movu           m3, [r0 + r5]          ; row 3 of pix0
+    movu           m4, [r2 + r6]          ; row 3 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop
@@ -4307,10 +4306,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_64x48, 4,5,6
+cglobal pixel_sad_64x48, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
-    mov             r4d, 24
+    mov             r4d, 12
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; first 32 of row 0 of pix0
     movu           m2, [r2]               ; first 32 of row 0 of pix1
@@ -4332,8 +4333,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 0 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 0 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 0 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 0 of pix1
+
+    psadbw         m1, m2
+    psadbw         m3, m4
+    paddd          m0, m1
+    paddd          m5, m3
+
+    movu           m1, [r0 + r5]          ; first 32 of row 1 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 1 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 1 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 1 of pix1
+
+    psadbw         m1, m2
+    psadbw         m3, m4
+    paddd          m0, m1
+    paddd          m5, m3
+
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop
@@ -4347,10 +4368,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_64x64, 4,5,6
+cglobal pixel_sad_64x64, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
     mov             r4d, 8
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; first 32 of row 0 of pix0
     movu           m2, [r2]               ; first 32 of row 0 of pix1
@@ -4372,31 +4395,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; first 32 of row 2 of pix0
-    movu           m2, [r2]               ; first 32 of row 2 of pix1
-    movu           m3, [r0 + 32]          ; second 32 of row 2 of pix0
-    movu           m4, [r2 + 32]          ; second 32 of row 2 of pix1
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 2 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 2 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 2 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 2 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    movu           m1, [r0 + r1]          ; first 32 of row 3 of pix0
-    movu           m2, [r2 + r3]          ; first 32 of row 3 of pix1
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 3 of pix0
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 3 of pix1
+    movu           m1, [r0 + r5]          ; first 32 of row 3 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 3 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 3 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 3 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     movu           m1, [r0]               ; first 32 of row 4 of pix0
     movu           m2, [r2]               ; first 32 of row 4 of pix1
@@ -4418,31 +4438,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; first 32 of row 6 of pix0
-    movu           m2, [r2]               ; first 32 of row 6 of pix1
-    movu           m3, [r0 + 32]          ; second 32 of row 6 of pix0
-    movu           m4, [r2 + 32]          ; second 32 of row 6 of pix1
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 6 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 6 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 6 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 6 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    movu           m1, [r0 + r1]          ; first 32 of row 7 of pix0
-    movu           m2, [r2 + r3]          ; first 32 of row 7 of pix1
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 7 of pix0
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 7 of pix1
+    movu           m1, [r0 + r5]          ; first 32 of row 7 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 7 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 7 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 7 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop

 
@@ -4004,10 +4004,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_32x24, 4,5,6
+cglobal pixel_sad_32x24, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
     mov             r4d, 6
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; row 0 of pix0
     movu           m2, [r2]               ; row 0 of pix1
@@ -4019,21 +4021,18 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; row 2 of pix0
-    movu           m2, [r2]               ; row 2 of pix1
-    movu           m3, [r0 + r1]          ; row 3 of pix0
-    movu           m4, [r2 + r3]          ; row 3 of pix1
+    movu           m1, [r0 + 2 * r1]      ; row 2 of pix0
+    movu           m2, [r2 + 2 * r3]      ; row 2 of pix1
+    movu           m3, [r0 + r5]          ; row 3 of pix0
+    movu           m4, [r2 + r6]          ; row 3 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop
@@ -4307,10 +4306,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_64x48, 4,5,6
+cglobal pixel_sad_64x48, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
-    mov             r4d, 24
+    mov             r4d, 12
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; first 32 of row 0 of pix0
     movu           m2, [r2]               ; first 32 of row 0 of pix1
@@ -4332,8 +4333,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 0 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 0 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 0 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 0 of pix1
+
+    psadbw         m1, m2
+    psadbw         m3, m4
+    paddd          m0, m1
+    paddd          m5, m3
+
+    movu           m1, [r0 + r5]          ; first 32 of row 1 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 1 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 1 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 1 of pix1
+
+    psadbw         m1, m2
+    psadbw         m3, m4
+    paddd          m0, m1
+    paddd          m5, m3
+
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop
@@ -4347,10 +4368,12 @@
     RET
 
 INIT_YMM avx2
-cglobal pixel_sad_64x64, 4,5,6
+cglobal pixel_sad_64x64, 4,7,6
     xorps           m0, m0
     xorps           m5, m5
     mov             r4d, 8
+    lea             r5, [r1 * 3]
+    lea             r6, [r3 * 3]
 .loop
     movu           m1, [r0]               ; first 32 of row 0 of pix0
     movu           m2, [r2]               ; first 32 of row 0 of pix1
@@ -4372,31 +4395,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; first 32 of row 2 of pix0
-    movu           m2, [r2]               ; first 32 of row 2 of pix1
-    movu           m3, [r0 + 32]          ; second 32 of row 2 of pix0
-    movu           m4, [r2 + 32]          ; second 32 of row 2 of pix1
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 2 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 2 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 2 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 2 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    movu           m1, [r0 + r1]          ; first 32 of row 3 of pix0
-    movu           m2, [r2 + r3]          ; first 32 of row 3 of pix1
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 3 of pix0
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 3 of pix1
+    movu           m1, [r0 + r5]          ; first 32 of row 3 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 3 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 3 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 3 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     movu           m1, [r0]               ; first 32 of row 4 of pix0
     movu           m2, [r2]               ; first 32 of row 4 of pix1
@@ -4418,31 +4438,28 @@
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
-
-    movu           m1, [r0]               ; first 32 of row 6 of pix0
-    movu           m2, [r2]               ; first 32 of row 6 of pix1
-    movu           m3, [r0 + 32]          ; second 32 of row 6 of pix0
-    movu           m4, [r2 + 32]          ; second 32 of row 6 of pix1
+    movu           m1, [r0 + 2 * r1]      ; first 32 of row 6 of pix0
+    movu           m2, [r2 + 2 * r3]      ; first 32 of row 6 of pix1
+    movu           m3, [r0 + 2 * r1 + 32] ; second 32 of row 6 of pix0
+    movu           m4, [r2 + 2 * r3 + 32] ; second 32 of row 6 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    movu           m1, [r0 + r1]          ; first 32 of row 7 of pix0
-    movu           m2, [r2 + r3]          ; first 32 of row 7 of pix1
-    movu           m3, [r0 + 32 + r1]     ; second 32 of row 7 of pix0
-    movu           m4, [r2 + 32 + r3]     ; second 32 of row 7 of pix1
+    movu           m1, [r0 + r5]          ; first 32 of row 7 of pix0
+    movu           m2, [r2 + r6]          ; first 32 of row 7 of pix1
+    movu           m3, [r0 + 32 + r5]     ; second 32 of row 7 of pix0
+    movu           m4, [r2 + 32 + r6]     ; second 32 of row 7 of pix1
 
     psadbw         m1, m2
     psadbw         m3, m4
     paddd          m0, m1
     paddd          m5, m3
 
-    lea     r2,     [r2 + 2 * r3]
-    lea     r0,     [r0 + 2 * r1]
+    lea     r2,     [r2 + 4 * r3]
+    lea     r0,     [r0 + 4 * r1]
 
     dec         r4d
     jnz         .loop
​

x265_1.6.tar.gz/source/common/x86/sad16-a.asm -> x265_1.7.tar.gz/source/common/x86/sad16-a.asm Changed

@@ -276,9 +276,8 @@
     ABSW2   m3, m4, m3, m4, m7, m5
     paddw   m1, m2
     paddw   m3, m4
-    paddw   m3, m1
-    pmaddwd m3, [pw_1]
-    paddd   m0, m3
+    paddw   m0, m1
+    paddw   m0, m3
 %else
     movu    m1, [r2]
     movu    m2, [r2+2*r3]
@@ -287,15 +286,45 @@
     ABSW2   m1, m2, m1, m2, m3, m4
     lea     r0, [r0+4*r1]
     lea     r2, [r2+4*r3]
-    paddw   m2, m1
-    pmaddwd m2, [pw_1]
-    paddd   m0, m2
+    paddw   m0, m1
+    paddw   m0, m2
 %endif
 %endmacro
 
-;-----------------------------------------------------------------------------
-; int pixel_sad_NxM( uint16_t *, intptr_t, uint16_t *, intptr_t )
-;-----------------------------------------------------------------------------
+%macro SAD_INC_2ROW_Nx64 1
+%if 2*%1 > mmsize
+    movu    m1, [r2 + 0]
+    movu    m2, [r2 + 16]
+    movu    m3, [r2 + 2 * r3 + 0]
+    movu    m4, [r2 + 2 * r3 + 16]
+    psubw   m1, [r0 + 0]
+    psubw   m2, [r0 + 16]
+    psubw   m3, [r0 + 2 * r1 + 0]
+    psubw   m4, [r0 + 2 * r1 + 16]
+    ABSW2   m1, m2, m1, m2, m5, m6
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+    ABSW2   m3, m4, m3, m4, m7, m5
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m0, m1
+    paddw   m8, m3
+%else
+    movu    m1, [r2]
+    movu    m2, [r2 + 2 * r3]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + 2 * r1]
+    ABSW2   m1, m2, m1, m2, m3, m4
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+    paddw   m0, m1
+    paddw   m8, m2
+%endif
+%endmacro
+
+; ---------------------------------------------------------------------------- -
+; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t)
+; ---------------------------------------------------------------------------- -
 %macro SAD 2
 cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize)
     pxor    m0, m0
@@ -309,8 +338,35 @@
     dec    r4d
     jg .loop
 %endif
+%if %2 == 32
+    HADDUWD m0, m1
+    HADDD   m0, m1
+%else
+    HADDW   m0, m1
+%endif
+    movd    eax, xm0
+    RET
+%endmacro
 
+; ---------------------------------------------------------------------------- -
+; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t)
+; ---------------------------------------------------------------------------- -
+%macro SAD_Nx64 1
+cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9
+    pxor    m0, m0
+    pxor    m8, m8
+    mov     r4d, 64 / 2
+.loop:
+    SAD_INC_2ROW_Nx64 %1
+    dec    r4d
+    jg .loop
+
+    HADDUWD m0, m1
+    HADDUWD m8, m1
     HADDD   m0, m1
+    HADDD   m8, m1
+    paddd   m0, m8
+
     movd    eax, xm0
     RET
 %endmacro
@@ -321,7 +377,7 @@
 SAD  16, 12
 SAD  16, 16
 SAD  16, 32
-SAD  16, 64
+SAD_Nx64  16
 
 INIT_XMM sse2
 SAD  8,  4
@@ -329,6 +385,13 @@
 SAD  8, 16
 SAD  8, 32
 
+INIT_YMM avx2
+SAD  16,  4
+SAD  16,  8
+SAD  16, 12
+SAD  16, 16
+SAD  16, 32
+
 ;------------------------------------------------------------------
 ; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t )
 ;------------------------------------------------------------------
@@ -716,7 +779,6 @@
 %endif
     movd     eax, xm0
     RET
-
 ;-----------------------------------------------------------------------------
 ; void pixel_sad_xN_WxH( uint16_t *fenc, uint16_t *pix0, uint16_t *pix1,
 ;                        uint16_t *pix2, intptr_t i_stride, int scores[3] )

 
@@ -276,9 +276,8 @@
     ABSW2   m3, m4, m3, m4, m7, m5
     paddw   m1, m2
     paddw   m3, m4
-    paddw   m3, m1
-    pmaddwd m3, [pw_1]
-    paddd   m0, m3
+    paddw   m0, m1
+    paddw   m0, m3
 %else
     movu    m1, [r2]
     movu    m2, [r2+2*r3]
@@ -287,15 +286,45 @@
     ABSW2   m1, m2, m1, m2, m3, m4
     lea     r0, [r0+4*r1]
     lea     r2, [r2+4*r3]
-    paddw   m2, m1
-    pmaddwd m2, [pw_1]
-    paddd   m0, m2
+    paddw   m0, m1
+    paddw   m0, m2
 %endif
 %endmacro
 
-;-----------------------------------------------------------------------------
-; int pixel_sad_NxM( uint16_t *, intptr_t, uint16_t *, intptr_t )
-;-----------------------------------------------------------------------------
+%macro SAD_INC_2ROW_Nx64 1
+%if 2*%1 > mmsize
+    movu    m1, [r2 + 0]
+    movu    m2, [r2 + 16]
+    movu    m3, [r2 + 2 * r3 + 0]
+    movu    m4, [r2 + 2 * r3 + 16]
+    psubw   m1, [r0 + 0]
+    psubw   m2, [r0 + 16]
+    psubw   m3, [r0 + 2 * r1 + 0]
+    psubw   m4, [r0 + 2 * r1 + 16]
+    ABSW2   m1, m2, m1, m2, m5, m6
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+    ABSW2   m3, m4, m3, m4, m7, m5
+    paddw   m1, m2
+    paddw   m3, m4
+    paddw   m0, m1
+    paddw   m8, m3
+%else
+    movu    m1, [r2]
+    movu    m2, [r2 + 2 * r3]
+    psubw   m1, [r0]
+    psubw   m2, [r0 + 2 * r1]
+    ABSW2   m1, m2, m1, m2, m3, m4
+    lea     r0, [r0 + 4 * r1]
+    lea     r2, [r2 + 4 * r3]
+    paddw   m0, m1
+    paddw   m8, m2
+%endif
+%endmacro
+
+; ---------------------------------------------------------------------------- -
+; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t)
+; ---------------------------------------------------------------------------- -
 %macro SAD 2
 cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize)
     pxor    m0, m0
@@ -309,8 +338,35 @@
     dec    r4d
     jg .loop
 %endif
+%if %2 == 32
+    HADDUWD m0, m1
+    HADDD   m0, m1
+%else
+    HADDW   m0, m1
+%endif
+    movd    eax, xm0
+    RET
+%endmacro
 
+; ---------------------------------------------------------------------------- -
+; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t)
+; ---------------------------------------------------------------------------- -
+%macro SAD_Nx64 1
+cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9
+    pxor    m0, m0
+    pxor    m8, m8
+    mov     r4d, 64 / 2
+.loop:
+    SAD_INC_2ROW_Nx64 %1
+    dec    r4d
+    jg .loop
+
+    HADDUWD m0, m1
+    HADDUWD m8, m1
     HADDD   m0, m1
+    HADDD   m8, m1
+    paddd   m0, m8
+
     movd    eax, xm0
     RET
 %endmacro
@@ -321,7 +377,7 @@
 SAD  16, 12
 SAD  16, 16
 SAD  16, 32
-SAD  16, 64
+SAD_Nx64  16
 
 INIT_XMM sse2
 SAD  8,  4
@@ -329,6 +385,13 @@
 SAD  8, 16
 SAD  8, 32
 
+INIT_YMM avx2
+SAD  16,  4
+SAD  16,  8
+SAD  16, 12
+SAD  16, 16
+SAD  16, 32
+
 ;------------------------------------------------------------------
 ; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t )
 ;------------------------------------------------------------------
@@ -716,7 +779,6 @@
 %endif
     movd     eax, xm0
     RET
-
 ;-----------------------------------------------------------------------------
 ; void pixel_sad_xN_WxH( uint16_t *fenc, uint16_t *pix0, uint16_t *pix1,
 ;                        uint16_t *pix2, intptr_t i_stride, int scores[3] )
​

x265_1.6.tar.gz/source/common/x86/x86inc.asm -> x265_1.7.tar.gz/source/common/x86/x86inc.asm Changed

 
@@ -72,7 +72,7 @@
     %define mangle(x) x
 %endif
 
-%macro SECTION_RODATA 0-1 16
+%macro SECTION_RODATA 0-1 32
     SECTION .rodata align=%1
 %endmacro
 
@@ -715,6 +715,7 @@
     %else
         global %1
     %endif
+    ALIGN 32
     %1: %2
 %endmacro
 
​

x265_1.6.tar.gz/source/encoder/CMakeLists.txt -> x265_1.7.tar.gz/source/encoder/CMakeLists.txt Changed

 
@@ -1,7 +1,11 @@
 # vim: syntax=cmake
 
 if(GCC)
-   add_definitions(-Wno-uninitialized)
+    add_definitions(-Wno-uninitialized)
+    if(CC_HAS_NO_STRICT_OVERFLOW)
+        # GCC 4.9.2 gives warnings we know we can ignore in this file
+        set_source_files_properties(slicetype.cpp PROPERTIES COMPILE_FLAGS -Wno-strict-overflow)
+    endif(CC_HAS_NO_STRICT_OVERFLOW)
 endif()
 if(MSVC)
    add_definitions(/wd4701) # potentially uninitialized local variable 'foo' used
​

x265_1.6.tar.gz/source/encoder/analysis.cpp -> x265_1.7.tar.gz/source/encoder/analysis.cpp Changed

@@ -130,9 +130,12 @@
     for (uint32_t i = 0; i <= g_maxCUDepth; i++)
         for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
             m_modeDepth[i].pred[j].invalidate();
-#endif
     invalidateContexts(0);
-    m_quant.setQPforQuant(ctu);
+#endif
+
+    int qp = setLambdaFromQP(ctu, m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : m_slice->m_sliceQp);
+    ctu.setQPSubParts((int8_t)qp, 0, 0);
+
     m_rqt[0].cur.load(initialContext);
     m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0);
 
@@ -140,11 +143,11 @@
     if (m_param->analysisMode)
     {
         if (m_slice->m_sliceType == I_SLICE)
-            m_reuseIntraDataCTU = (analysis_intra_data *)m_frame->m_analysisData.intraData;
+            m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
         else
         {
             int numPredDir = m_slice->isInterP() ? 1 : 2;
-            m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
+            m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
             m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
             m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
         }
@@ -155,10 +158,10 @@
     uint32_t zOrder = 0;
     if (m_slice->m_sliceType == I_SLICE)
     {
-        compressIntraCU(ctu, cuGeom, zOrder);
+        compressIntraCU(ctu, cuGeom, zOrder, qp);
         if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData)
         {
-            CUData *bestCU = &m_modeDepth[0].bestMode->cu;
+            CUData* bestCU = &m_modeDepth[0].bestMode->cu;
             memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
@@ -173,21 +176,21 @@
             * they are available for intra predictions */
             m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0);
 
-            compressInterCU_rd0_4(ctu, cuGeom);
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
 
             /* generate residual for entire CTU at once and copy to reconPic */
             encodeResidue(ctu, cuGeom);
         }
         else if (m_param->bDistributeModeAnalysis && m_param->rdLevel >= 2)
-            compressInterCU_dist(ctu, cuGeom);
+            compressInterCU_dist(ctu, cuGeom, qp);
         else if (m_param->rdLevel <= 4)
-            compressInterCU_rd0_4(ctu, cuGeom);
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
         else
         {
-            compressInterCU_rd5_6(ctu, cuGeom, zOrder);
+            compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp);
             if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData)
             {
-                CUData *bestCU = &m_modeDepth[0].bestMode->cu;
+                CUData* bestCU = &m_modeDepth[0].bestMode->cu;
                 memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
                 memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition);
             }
@@ -206,24 +209,28 @@
         return;
     else if (md.bestMode->cu.isIntra(0))
     {
+        m_quant.m_tqBypass = true;
         md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
         uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
         checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
+        m_quant.m_tqBypass = false;
     }
     else
     {
+        m_quant.m_tqBypass = true;
         md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
         encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
+        m_quant.m_tqBypass = false;
     }
 }
 
-void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder)
+void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp)
 {
     uint32_t depth = cuGeom.depth;
     ModeDepth& md = m_modeDepth[depth];
@@ -241,11 +248,9 @@
 
         if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
         {
-            m_quant.setQPforQuant(parentCTU);
-
             PartSize size = (PartSize)reusePartSizes[zOrder];
             Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
-            mode.cu.initSubCU(parentCTU, cuGeom);
+            mode.cu.initSubCU(parentCTU, cuGeom, qp);
             checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
             checkBestMode(mode, depth);
 
@@ -262,15 +267,13 @@
     }
     else if (mightNotSplit)
     {
-        m_quant.setQPforQuant(parentCTU);
-
-        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
+        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
         checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
         checkBestMode(md.pred[PRED_INTRA], depth);
 
         if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
         {
-            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
+            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
             checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
             checkBestMode(md.pred[PRED_INTRA_NxN], depth);
         }
@@ -287,12 +290,13 @@
         Mode* splitPred = &md.pred[PRED_SPLIT];
         splitPred->initCosts();
         CUData* splitCU = &splitPred->cu;
-        splitCU->initSubCU(parentCTU, cuGeom);
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
 
         uint32_t nextDepth = depth + 1;
         ModeDepth& nd = m_modeDepth[nextDepth];
         invalidateContexts(nextDepth);
         Entropy* nextContext = &m_rqt[depth].cur;
+        int32_t nextQP = qp;
 
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
         {
@@ -301,7 +305,11 @@
             {
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
                 m_rqt[nextDepth].cur.load(*nextContext);
-                compressIntraCU(parentCTU, childGeom, zOrder);
+
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
+
+                compressIntraCU(parentCTU, childGeom, zOrder, nextQP);
 
                 // Save best CU and pred data for this sub CU
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
@@ -322,7 +330,7 @@
         else
             updateModeCost(*splitPred);
 
-        checkDQPForSplitPred(splitPred->cu, cuGeom);
+        checkDQPForSplitPred(*splitPred, cuGeom);
         checkBestMode(*splitPred, depth);
     }
 
@@ -362,24 +370,18 @@
     }
 
     ModeDepth& md = m_modeDepth[pmode.cuGeom.depth];
-    bool bMergeOnly = pmode.cuGeom.log2CUSize == 6;
 
     /* setup slave Analysis */
     if (&slave != this)
     {
         slave.m_slice = m_slice;
         slave.m_frame = m_frame;
-        slave.setQP(*m_slice, m_rdCost.m_qp);
+        slave.m_param = m_param;
+        slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp);
         slave.invalidateContexts(0);
-
-        if (m_param->rdLevel >= 5)
-        {
-            slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
-            slave.m_quant.setQPforQuant(md.pred[PRED_2Nx2N].cu);
-        }
+        slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
     }
 
-
     /* perform Mode task, repeat until no more work is available */
     do
     {
@@ -388,8 +390,6 @@
             switch (pmode.modes[task])
             {
             case PRED_INTRA:
-                if (&slave != this)

 
@@ -130,9 +130,12 @@
     for (uint32_t i = 0; i <= g_maxCUDepth; i++)
         for (uint32_t j = 0; j < MAX_PRED_TYPES; j++)
             m_modeDepth[i].pred[j].invalidate();
-#endif
     invalidateContexts(0);
-    m_quant.setQPforQuant(ctu);
+#endif
+
+    int qp = setLambdaFromQP(ctu, m_slice->m_pps->bUseDQP ? calculateQpforCuSize(ctu, cuGeom) : m_slice->m_sliceQp);
+    ctu.setQPSubParts((int8_t)qp, 0, 0);
+
     m_rqt[0].cur.load(initialContext);
     m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0);
 
@@ -140,11 +143,11 @@
     if (m_param->analysisMode)
     {
         if (m_slice->m_sliceType == I_SLICE)
-            m_reuseIntraDataCTU = (analysis_intra_data *)m_frame->m_analysisData.intraData;
+            m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData;
         else
         {
             int numPredDir = m_slice->isInterP() ? 1 : 2;
-            m_reuseInterDataCTU = (analysis_inter_data *)m_frame->m_analysisData.interData;
+            m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData;
             m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir];
             m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS];
         }
@@ -155,10 +158,10 @@
     uint32_t zOrder = 0;
     if (m_slice->m_sliceType == I_SLICE)
     {
-        compressIntraCU(ctu, cuGeom, zOrder);
+        compressIntraCU(ctu, cuGeom, zOrder, qp);
         if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData)
         {
-            CUData *bestCU = &m_modeDepth[0].bestMode->cu;
+            CUData* bestCU = &m_modeDepth[0].bestMode->cu;
             memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition);
             memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition);
@@ -173,21 +176,21 @@
             * they are available for intra predictions */
             m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0);
 
-            compressInterCU_rd0_4(ctu, cuGeom);
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
 
             /* generate residual for entire CTU at once and copy to reconPic */
             encodeResidue(ctu, cuGeom);
         }
         else if (m_param->bDistributeModeAnalysis && m_param->rdLevel >= 2)
-            compressInterCU_dist(ctu, cuGeom);
+            compressInterCU_dist(ctu, cuGeom, qp);
         else if (m_param->rdLevel <= 4)
-            compressInterCU_rd0_4(ctu, cuGeom);
+            compressInterCU_rd0_4(ctu, cuGeom, qp);
         else
         {
-            compressInterCU_rd5_6(ctu, cuGeom, zOrder);
+            compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp);
             if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData)
             {
-                CUData *bestCU = &m_modeDepth[0].bestMode->cu;
+                CUData* bestCU = &m_modeDepth[0].bestMode->cu;
                 memcpy(&m_reuseInterDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition);
                 memcpy(&m_reuseInterDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_predMode, sizeof(uint8_t) * numPartition);
             }
@@ -206,24 +209,28 @@
         return;
     else if (md.bestMode->cu.isIntra(0))
     {
+        m_quant.m_tqBypass = true;
         md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0];
         uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir;
         checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
+        m_quant.m_tqBypass = false;
     }
     else
     {
+        m_quant.m_tqBypass = true;
         md.pred[PRED_LOSSLESS].initCosts();
         md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom);
         md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv);
         encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom);
         checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth);
+        m_quant.m_tqBypass = false;
     }
 }
 
-void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder)
+void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp)
 {
     uint32_t depth = cuGeom.depth;
     ModeDepth& md = m_modeDepth[depth];
@@ -241,11 +248,9 @@
 
         if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx)
         {
-            m_quant.setQPforQuant(parentCTU);
-
             PartSize size = (PartSize)reusePartSizes[zOrder];
             Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN];
-            mode.cu.initSubCU(parentCTU, cuGeom);
+            mode.cu.initSubCU(parentCTU, cuGeom, qp);
             checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]);
             checkBestMode(mode, depth);
 
@@ -262,15 +267,13 @@
     }
     else if (mightNotSplit)
     {
-        m_quant.setQPforQuant(parentCTU);
-
-        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom);
+        md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp);
         checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL);
         checkBestMode(md.pred[PRED_INTRA], depth);
 
         if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3)
         {
-            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom);
+            md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp);
             checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL);
             checkBestMode(md.pred[PRED_INTRA_NxN], depth);
         }
@@ -287,12 +290,13 @@
         Mode* splitPred = &md.pred[PRED_SPLIT];
         splitPred->initCosts();
         CUData* splitCU = &splitPred->cu;
-        splitCU->initSubCU(parentCTU, cuGeom);
+        splitCU->initSubCU(parentCTU, cuGeom, qp);
 
         uint32_t nextDepth = depth + 1;
         ModeDepth& nd = m_modeDepth[nextDepth];
         invalidateContexts(nextDepth);
         Entropy* nextContext = &m_rqt[depth].cur;
+        int32_t nextQP = qp;
 
         for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++)
         {
@@ -301,7 +305,11 @@
             {
                 m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx);
                 m_rqt[nextDepth].cur.load(*nextContext);
-                compressIntraCU(parentCTU, childGeom, zOrder);
+
+                if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth)
+                    nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom));
+
+                compressIntraCU(parentCTU, childGeom, zOrder, nextQP);
 
                 // Save best CU and pred data for this sub CU
                 splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx);
@@ -322,7 +330,7 @@
         else
             updateModeCost(*splitPred);
 
-        checkDQPForSplitPred(splitPred->cu, cuGeom);
+        checkDQPForSplitPred(*splitPred, cuGeom);
         checkBestMode(*splitPred, depth);
     }
 
@@ -362,24 +370,18 @@
     }
 
     ModeDepth& md = m_modeDepth[pmode.cuGeom.depth];
-    bool bMergeOnly = pmode.cuGeom.log2CUSize == 6;
 
     /* setup slave Analysis */
     if (&slave != this)
     {
         slave.m_slice = m_slice;
         slave.m_frame = m_frame;
-        slave.setQP(*m_slice, m_rdCost.m_qp);
+        slave.m_param = m_param;
+        slave.setLambdaFromQP(md.pred[PRED_2Nx2N].cu, m_rdCost.m_qp);
         slave.invalidateContexts(0);
-
-        if (m_param->rdLevel >= 5)
-        {
-            slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
-            slave.m_quant.setQPforQuant(md.pred[PRED_2Nx2N].cu);
-        }
+        slave.m_rqt[pmode.cuGeom.depth].cur.load(m_rqt[pmode.cuGeom.depth].cur);
     }
 
-
     /* perform Mode task, repeat until no more work is available */
     do
     {
@@ -388,8 +390,6 @@
             switch (pmode.modes[task])
             {
             case PRED_INTRA:
-                if (&slave != this)
​

x265_1.6.tar.gz/source/encoder/analysis.h -> x265_1.7.tar.gz/source/encoder/analysis.h Changed

@@ -109,12 +109,12 @@
     uint32_t*            m_reuseBestMergeCand;
 
     /* full analysis for an I-slice CU */
-    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
+    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
 
     /* full analysis for a P or B slice CU */
-    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom);
-    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom);
-    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
+    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
+    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
+    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
 
     /* measure merge and skip */
     void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
@@ -122,7 +122,7 @@
 
     /* measure inter options */
     void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
-    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly);
+    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
 
     void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom);
 
@@ -139,7 +139,7 @@
     /* generate residual and recon pixels for an entire CTU recursively (RD0) */
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
-    int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
+    int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom);
 
     /* check whether current mode is the new best */
     inline void checkBestMode(Mode& mode, uint32_t depth)

 
@@ -109,12 +109,12 @@
     uint32_t*            m_reuseBestMergeCand;
 
     /* full analysis for an I-slice CU */
-    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
+    void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
 
     /* full analysis for a P or B slice CU */
-    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom);
-    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom);
-    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder);
+    void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
+    void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp);
+    void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp);
 
     /* measure merge and skip */
     void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom);
@@ -122,7 +122,7 @@
 
     /* measure inter options */
     void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
-    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, bool bMergeOnly);
+    void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize);
 
     void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom);
 
@@ -139,7 +139,7 @@
     /* generate residual and recon pixels for an entire CTU recursively (RD0) */
     void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom);
 
-    int calculateQpforCuSize(CUData& ctu, const CUGeom& cuGeom);
+    int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom);
 
     /* check whether current mode is the new best */
     inline void checkBestMode(Mode& mode, uint32_t depth)
​

x265_1.6.tar.gz/source/encoder/api.cpp -> x265_1.7.tar.gz/source/encoder/api.cpp Changed

@@ -39,9 +39,11 @@
     if (!p)
         return NULL;
 
-    x265_param *param = X265_MALLOC(x265_param, 1);
-    if (!param)
-        return NULL;
+    Encoder* encoder = NULL;
+    x265_param* param = x265_param_alloc();
+    x265_param* latestParam = x265_param_alloc();
+    if (!param || !latestParam)
+        goto fail;
 
     memcpy(param, p, sizeof(x265_param));
     x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str);
@@ -50,38 +52,44 @@
     x265_setup_primitives(param, param->cpuid);
 
     if (x265_check_params(param))
-        return NULL;
+        goto fail;
 
     if (x265_set_globals(param))
-        return NULL;
+        goto fail;
 
-    Encoder *encoder = new Encoder;
+    encoder = new Encoder;
     if (!param->rc.bEnableSlowFirstPass)
         x265_param_apply_fastfirstpass(param);
 
     // may change params for auto-detect, etc
     encoder->configure(param);
-    
     // may change rate control and CPB params
     if (!enforceLevel(*param, encoder->m_vps))
-    {
-        delete encoder;
-        return NULL;
-    }
+        goto fail;
 
     // will detect and set profile/tier/level in VPS
     determineLevel(*param, encoder->m_vps);
 
-    encoder->create();
-    if (encoder->m_aborted)
+    if (!param->bAllowNonConformance && encoder->m_vps.ptl.profileIdc == Profile::NONE)
     {
-        delete encoder;
-        return NULL;
+        x265_log(param, X265_LOG_INFO, "non-conformant bitstreams not allowed (--allow-non-conformance)\n");
+        goto fail;
     }
 
-    x265_print_params(param);
+    encoder->create();
+    encoder->m_latestParam = latestParam;
+    memcpy(latestParam, param, sizeof(x265_param));
+    if (encoder->m_aborted)
+        goto fail;
 
+    x265_print_params(param);
     return encoder;
+
+fail:
+    delete encoder;
+    x265_param_free(param);
+    x265_param_free(latestParam);
+    return NULL;
 }
 
 extern "C"
@@ -112,6 +120,27 @@
 }
 
 extern "C"
+int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in)
+{
+    if (!enc || !param_in)
+        return -1;
+
+    x265_param save;
+    Encoder* encoder = static_cast<Encoder*>(enc);
+    memcpy(&save, encoder->m_latestParam, sizeof(x265_param));
+    int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in);
+    if (ret)
+        /* reconfigure failed, recover saved param set */
+        memcpy(encoder->m_latestParam, &save, sizeof(x265_param));
+    else
+    {
+        encoder->m_reconfigured = true;
+        x265_print_reconfigured_params(&save, encoder->m_latestParam);
+    }
+    return ret;
+}
+
+extern "C"
 int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out)
 {
     if (!enc)
@@ -173,19 +202,22 @@
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
 
-        encoder->stop();
+        encoder->stopJobs();
         encoder->printSummary();
         encoder->destroy();
         delete encoder;
+        ATOMIC_DEC(&g_ctuSizeConfigured);
     }
 }
 
 extern "C"
 void x265_cleanup(void)
 {
-    BitCost::destroy();
-    CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
-    g_ctuSizeConfigured = 0;
+    if (!g_ctuSizeConfigured)
+    {
+        BitCost::destroy();
+        CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
+    }
 }
 
 extern "C"
@@ -232,6 +264,7 @@
     &x265_picture_init,
     &x265_encoder_open,
     &x265_encoder_parameters,
+    &x265_encoder_reconfig,
     &x265_encoder_headers,
     &x265_encoder_encode,
     &x265_encoder_get_stats,
@@ -243,11 +276,66 @@
     x265_max_bit_depth,
 };
 
+typedef const x265_api* (*api_get_func)(int bitDepth);
+
+#define xstr(s) str(s)
+#define str(s) #s
+
+#if _WIN32
+#define ext ".dll"
+#elif MACOS
+#include <dlfcn.h>
+#define ext ".dylib"
+#else
+#include <dlfcn.h>
+#define ext ".so"
+#endif
+
 extern "C"
 const x265_api* x265_api_get(int bitDepth)
 {
     if (bitDepth && bitDepth != X265_DEPTH)
-        return NULL;
+    {
+        const char* libname = NULL;
+        const char* method = "x265_api_get_" xstr(X265_BUILD);
+
+        if (bitDepth == 12)
+            libname = "libx265_main12" ext;
+        else if (bitDepth == 10)
+            libname = "libx265_main10" ext;
+        else if (bitDepth == 8)
+            libname = "libx265_main" ext;
+        else
+            return NULL;
+
+        const x265_api* api = NULL;
+
+#if _WIN32
+        HMODULE h = LoadLibraryA(libname);
+        if (h)
+        {
+            api_get_func get = (api_get_func)GetProcAddress(h, method);
+            if (get)
+                api = get(0);
+        }
+#else
+        void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL);
+        if (h)
+        {
+            api_get_func get = (api_get_func)dlsym(h, method);
+            if (get)
+                api = get(0);
+        }
+#endif
+
+        if (api && bitDepth != api->max_bit_depth)
+        {
+            x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth);
+            return NULL;
+        }
+
+        return api;

 
@@ -39,9 +39,11 @@
     if (!p)
         return NULL;
 
-    x265_param *param = X265_MALLOC(x265_param, 1);
-    if (!param)
-        return NULL;
+    Encoder* encoder = NULL;
+    x265_param* param = x265_param_alloc();
+    x265_param* latestParam = x265_param_alloc();
+    if (!param || !latestParam)
+        goto fail;
 
     memcpy(param, p, sizeof(x265_param));
     x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str);
@@ -50,38 +52,44 @@
     x265_setup_primitives(param, param->cpuid);
 
     if (x265_check_params(param))
-        return NULL;
+        goto fail;
 
     if (x265_set_globals(param))
-        return NULL;
+        goto fail;
 
-    Encoder *encoder = new Encoder;
+    encoder = new Encoder;
     if (!param->rc.bEnableSlowFirstPass)
         x265_param_apply_fastfirstpass(param);
 
     // may change params for auto-detect, etc
     encoder->configure(param);
-    
     // may change rate control and CPB params
     if (!enforceLevel(*param, encoder->m_vps))
-    {
-        delete encoder;
-        return NULL;
-    }
+        goto fail;
 
     // will detect and set profile/tier/level in VPS
     determineLevel(*param, encoder->m_vps);
 
-    encoder->create();
-    if (encoder->m_aborted)
+    if (!param->bAllowNonConformance && encoder->m_vps.ptl.profileIdc == Profile::NONE)
     {
-        delete encoder;
-        return NULL;
+        x265_log(param, X265_LOG_INFO, "non-conformant bitstreams not allowed (--allow-non-conformance)\n");
+        goto fail;
     }
 
-    x265_print_params(param);
+    encoder->create();
+    encoder->m_latestParam = latestParam;
+    memcpy(latestParam, param, sizeof(x265_param));
+    if (encoder->m_aborted)
+        goto fail;
 
+    x265_print_params(param);
     return encoder;
+
+fail:
+    delete encoder;
+    x265_param_free(param);
+    x265_param_free(latestParam);
+    return NULL;
 }
 
 extern "C"
@@ -112,6 +120,27 @@
 }
 
 extern "C"
+int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in)
+{
+    if (!enc || !param_in)
+        return -1;
+
+    x265_param save;
+    Encoder* encoder = static_cast<Encoder*>(enc);
+    memcpy(&save, encoder->m_latestParam, sizeof(x265_param));
+    int ret = encoder->reconfigureParam(encoder->m_latestParam, param_in);
+    if (ret)
+        /* reconfigure failed, recover saved param set */
+        memcpy(encoder->m_latestParam, &save, sizeof(x265_param));
+    else
+    {
+        encoder->m_reconfigured = true;
+        x265_print_reconfigured_params(&save, encoder->m_latestParam);
+    }
+    return ret;
+}
+
+extern "C"
 int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out)
 {
     if (!enc)
@@ -173,19 +202,22 @@
     {
         Encoder *encoder = static_cast<Encoder*>(enc);
 
-        encoder->stop();
+        encoder->stopJobs();
         encoder->printSummary();
         encoder->destroy();
         delete encoder;
+        ATOMIC_DEC(&g_ctuSizeConfigured);
     }
 }
 
 extern "C"
 void x265_cleanup(void)
 {
-    BitCost::destroy();
-    CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
-    g_ctuSizeConfigured = 0;
+    if (!g_ctuSizeConfigured)
+    {
+        BitCost::destroy();
+        CUData::s_partSet[0] = NULL; /* allow CUData to adjust to new CTU size */
+    }
 }
 
 extern "C"
@@ -232,6 +264,7 @@
     &x265_picture_init,
     &x265_encoder_open,
     &x265_encoder_parameters,
+    &x265_encoder_reconfig,
     &x265_encoder_headers,
     &x265_encoder_encode,
     &x265_encoder_get_stats,
@@ -243,11 +276,66 @@
     x265_max_bit_depth,
 };
 
+typedef const x265_api* (*api_get_func)(int bitDepth);
+
+#define xstr(s) str(s)
+#define str(s) #s
+
+#if _WIN32
+#define ext ".dll"
+#elif MACOS
+#include <dlfcn.h>
+#define ext ".dylib"
+#else
+#include <dlfcn.h>
+#define ext ".so"
+#endif
+
 extern "C"
 const x265_api* x265_api_get(int bitDepth)
 {
     if (bitDepth && bitDepth != X265_DEPTH)
-        return NULL;
+    {
+        const char* libname = NULL;
+        const char* method = "x265_api_get_" xstr(X265_BUILD);
+
+        if (bitDepth == 12)
+            libname = "libx265_main12" ext;
+        else if (bitDepth == 10)
+            libname = "libx265_main10" ext;
+        else if (bitDepth == 8)
+            libname = "libx265_main" ext;
+        else
+            return NULL;
+
+        const x265_api* api = NULL;
+
+#if _WIN32
+        HMODULE h = LoadLibraryA(libname);
+        if (h)
+        {
+            api_get_func get = (api_get_func)GetProcAddress(h, method);
+            if (get)
+                api = get(0);
+        }
+#else
+        void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL);
+        if (h)
+        {
+            api_get_func get = (api_get_func)dlsym(h, method);
+            if (get)
+                api = get(0);
+        }
+#endif
+
+        if (api && bitDepth != api->max_bit_depth)
+        {
+            x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth);
+            return NULL;
+        }
+
+        return api;
​

x265_1.6.tar.gz/source/encoder/encoder.cpp -> x265_1.7.tar.gz/source/encoder/encoder.cpp Changed

@@ -58,6 +58,7 @@
 Encoder::Encoder()
 {
     m_aborted = false;
+    m_reconfigured = false;
     m_encodedFrameNum = 0;
     m_pocLast = -1;
     m_curEncoder = 0;
@@ -73,6 +74,7 @@
     m_outputCount = 0;
     m_csvfpt = NULL;
     m_param = NULL;
+    m_latestParam = NULL;
     m_cuOffsetY = NULL;
     m_cuOffsetC = NULL;
     m_buOffsetY = NULL;
@@ -106,7 +108,7 @@
     bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
 
     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
-    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
+    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)
         allowPools = false;
 
     if (!p->frameNumThreads)
@@ -140,9 +142,11 @@
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pme disabled\n");
         if (p->bDistributeModeAnalysis)
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n");
+        if (p->lookaheadSlices)
+            x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n");
 
         // disable all pool features if the thread pool is disabled or unusable.
-        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
+        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
     }
 
     char buf[128];
@@ -159,7 +163,10 @@
     x265_log(p, X265_LOG_INFO, "frame threads / pool features       : %d / %s\n", p->frameNumThreads, buf);
 
     for (int i = 0; i < m_param->frameNumThreads; i++)
+    {
         m_frameEncoder[i] = new FrameEncoder;
+        m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB;
+    }
 
     if (m_numPools)
     {
@@ -287,15 +294,17 @@
     m_aborted |= parseLambdaFile(m_param);
 
     m_encodeStartTime = x265_mdate();
+
+    m_nalList.m_annexB = !!m_param->bAnnexB;
 }
 
-void Encoder::stop()
+void Encoder::stopJobs()
 {
     if (m_rateControl)
         m_rateControl->terminate(); // unblock all blocked RC calls
 
     if (m_lookahead)
-        m_lookahead->stop();
+        m_lookahead->stopJobs();
     
     for (int i = 0; i < m_param->frameNumThreads; i++)
     {
@@ -309,7 +318,7 @@
     }
 
     if (m_threadPool)
-        m_threadPool->stop();
+        m_threadPool->stopWorkers();
 }
 
 void Encoder::destroy()
@@ -358,15 +367,20 @@
 
     if (m_param)
     {
-        free((void*)m_param->rc.lambdaFileName); // allocs by strdup
-        free(m_param->rc.statFileName);
-        free(m_param->analysisFileName);
-        free((void*)m_param->scalingLists);
-        free(m_param->csvfn);
-        free(m_param->numaPools);
+        /* release string arguments that were strdup'd */
+        free((char*)m_param->rc.lambdaFileName);
+        free((char*)m_param->rc.statFileName);
+        free((char*)m_param->analysisFileName);
+        free((char*)m_param->scalingLists);
+        free((char*)m_param->csvfn);
+        free((char*)m_param->numaPools);
+        free((char*)m_param->masteringDisplayColorVolume);
+        free((char*)m_param->contentLightLevelInfo);
 
-        X265_FREE(m_param);
+        x265_param_free(m_param);
     }
+
+    x265_param_free(m_latestParam);
 }
 
 void Encoder::updateVbvPlan(RateControl* rc)
@@ -436,7 +450,8 @@
         if (m_dpb->m_freeList.empty())
         {
             inFrame = new Frame;
-            if (inFrame->create(m_param))
+            x265_param* p = m_reconfigured? m_latestParam : m_param;
+            if (inFrame->create(p))
             {
                 /* the first PicYuv created is asked to generate the CU and block unit offset
                  * arrays which are then shared with all subsequent PicYuv (orig and recon) 
@@ -477,7 +492,10 @@
             }
         }
         else
+        {
             inFrame = m_dpb->m_freeList.popBack();
+            inFrame->m_lowresInit = false;
+        }
 
         /* Copy input picture into a Frame and PicYuv, send to lookahead */
         inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset);
@@ -486,6 +504,7 @@
         inFrame->m_userData  = pic_in->userData;
         inFrame->m_pts       = pic_in->pts;
         inFrame->m_forceqp   = pic_in->forceqp;
+        inFrame->m_param     = m_reconfigured ? m_latestParam : m_param;
 
         if (m_pocLast == 0)
             m_firstPts = inFrame->m_pts;
@@ -717,6 +736,34 @@
     return ret;
 }
 
+int Encoder::reconfigureParam(x265_param* encParam, x265_param* param)
+{
+    encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers
+    encParam->bEnableLoopFilter = param->bEnableLoopFilter;
+    encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset;
+    encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; 
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
+    encParam->bEnableEarlySkip = param->bEnableEarlySkip;
+    encParam->bEnableTemporalMvp = param->bEnableTemporalMvp;
+    /* Scratch buffer prevents me_range from being increased for esa/tesa
+    if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange)
+        encParam->searchRange = param->searchRange; */
+    encParam->noiseReductionInter = param->noiseReductionInter;
+    encParam->noiseReductionIntra = param->noiseReductionIntra;
+    /* We can't switch out of subme=0 during encoding. */
+    if (encParam->subpelRefine)
+        encParam->subpelRefine = param->subpelRefine;
+    encParam->rdoqLevel = param->rdoqLevel;
+    encParam->rdLevel = param->rdLevel;
+    encParam->bEnableTSkipFast = param->bEnableTSkipFast;
+    encParam->psyRd = param->psyRd;
+    encParam->psyRdoq = param->psyRdoq;
+    encParam->bEnableSignHiding = param->bEnableSignHiding;
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
+    encParam->maxTUSize = param->maxTUSize;
+    return x265_check_params(encParam);
+}
+
 void EncStats::addPsnr(double psnrY, double psnrU, double psnrV)
 {
     m_psnrSumY += psnrY;
@@ -1430,6 +1477,34 @@
     bs.writeByteAlignment();
     list.serialize(NAL_UNIT_PPS, bs);
 
+    if (m_param->masteringDisplayColorVolume)
+    {
+        SEIMasteringDisplayColorVolume mdsei;
+        if (mdsei.parse(m_param->masteringDisplayColorVolume))
+        {
+            bs.resetBits();
+            mdsei.write(bs, m_sps);
+            bs.writeByteAlignment();
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
+        }
+        else
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n");
+    }
+
+    if (m_param->contentLightLevelInfo)
+    {
+        SEIContentLightLevel cllsei;
+        if (cllsei.parse(m_param->contentLightLevelInfo))
+        {
+            bs.resetBits();
+            cllsei.write(bs, m_sps);
+            bs.writeByteAlignment();
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
+        }
+        else
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n");

 
@@ -58,6 +58,7 @@
 Encoder::Encoder()
 {
     m_aborted = false;
+    m_reconfigured = false;
     m_encodedFrameNum = 0;
     m_pocLast = -1;
     m_curEncoder = 0;
@@ -73,6 +74,7 @@
     m_outputCount = 0;
     m_csvfpt = NULL;
     m_param = NULL;
+    m_latestParam = NULL;
     m_cuOffsetY = NULL;
     m_cuOffsetC = NULL;
     m_buOffsetY = NULL;
@@ -106,7 +108,7 @@
     bool allowPools = !p->numaPools || strcmp(p->numaPools, "none");
 
     // Trim the thread pool if --wpp, --pme, and --pmode are disabled
-    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation)
+    if (!p->bEnableWavefront && !p->bDistributeModeAnalysis && !p->bDistributeMotionEstimation && !p->lookaheadSlices)
         allowPools = false;
 
     if (!p->frameNumThreads)
@@ -140,9 +142,11 @@
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pme disabled\n");
         if (p->bDistributeModeAnalysis)
             x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --pmode disabled\n");
+        if (p->lookaheadSlices)
+            x265_log(p, X265_LOG_WARNING, "No thread pool allocated, --lookahead-slices disabled\n");
 
         // disable all pool features if the thread pool is disabled or unusable.
-        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = 0;
+        p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0;
     }
 
     char buf[128];
@@ -159,7 +163,10 @@
     x265_log(p, X265_LOG_INFO, "frame threads / pool features       : %d / %s\n", p->frameNumThreads, buf);
 
     for (int i = 0; i < m_param->frameNumThreads; i++)
+    {
         m_frameEncoder[i] = new FrameEncoder;
+        m_frameEncoder[i]->m_nalList.m_annexB = !!m_param->bAnnexB;
+    }
 
     if (m_numPools)
     {
@@ -287,15 +294,17 @@
     m_aborted |= parseLambdaFile(m_param);
 
     m_encodeStartTime = x265_mdate();
+
+    m_nalList.m_annexB = !!m_param->bAnnexB;
 }
 
-void Encoder::stop()
+void Encoder::stopJobs()
 {
     if (m_rateControl)
         m_rateControl->terminate(); // unblock all blocked RC calls
 
     if (m_lookahead)
-        m_lookahead->stop();
+        m_lookahead->stopJobs();
     
     for (int i = 0; i < m_param->frameNumThreads; i++)
     {
@@ -309,7 +318,7 @@
     }
 
     if (m_threadPool)
-        m_threadPool->stop();
+        m_threadPool->stopWorkers();
 }
 
 void Encoder::destroy()
@@ -358,15 +367,20 @@
 
     if (m_param)
     {
-        free((void*)m_param->rc.lambdaFileName); // allocs by strdup
-        free(m_param->rc.statFileName);
-        free(m_param->analysisFileName);
-        free((void*)m_param->scalingLists);
-        free(m_param->csvfn);
-        free(m_param->numaPools);
+        /* release string arguments that were strdup'd */
+        free((char*)m_param->rc.lambdaFileName);
+        free((char*)m_param->rc.statFileName);
+        free((char*)m_param->analysisFileName);
+        free((char*)m_param->scalingLists);
+        free((char*)m_param->csvfn);
+        free((char*)m_param->numaPools);
+        free((char*)m_param->masteringDisplayColorVolume);
+        free((char*)m_param->contentLightLevelInfo);
 
-        X265_FREE(m_param);
+        x265_param_free(m_param);
     }
+
+    x265_param_free(m_latestParam);
 }
 
 void Encoder::updateVbvPlan(RateControl* rc)
@@ -436,7 +450,8 @@
         if (m_dpb->m_freeList.empty())
         {
             inFrame = new Frame;
-            if (inFrame->create(m_param))
+            x265_param* p = m_reconfigured? m_latestParam : m_param;
+            if (inFrame->create(p))
             {
                 /* the first PicYuv created is asked to generate the CU and block unit offset
                  * arrays which are then shared with all subsequent PicYuv (orig and recon) 
@@ -477,7 +492,10 @@
             }
         }
         else
+        {
             inFrame = m_dpb->m_freeList.popBack();
+            inFrame->m_lowresInit = false;
+        }
 
         /* Copy input picture into a Frame and PicYuv, send to lookahead */
         inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset);
@@ -486,6 +504,7 @@
         inFrame->m_userData  = pic_in->userData;
         inFrame->m_pts       = pic_in->pts;
         inFrame->m_forceqp   = pic_in->forceqp;
+        inFrame->m_param     = m_reconfigured ? m_latestParam : m_param;
 
         if (m_pocLast == 0)
             m_firstPts = inFrame->m_pts;
@@ -717,6 +736,34 @@
     return ret;
 }
 
+int Encoder::reconfigureParam(x265_param* encParam, x265_param* param)
+{
+    encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers
+    encParam->bEnableLoopFilter = param->bEnableLoopFilter;
+    encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset;
+    encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; 
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
+    encParam->bEnableEarlySkip = param->bEnableEarlySkip;
+    encParam->bEnableTemporalMvp = param->bEnableTemporalMvp;
+    /* Scratch buffer prevents me_range from being increased for esa/tesa
+    if (param->searchMethod < X265_FULL_SEARCH || param->searchMethod < encParam->searchRange)
+        encParam->searchRange = param->searchRange; */
+    encParam->noiseReductionInter = param->noiseReductionInter;
+    encParam->noiseReductionIntra = param->noiseReductionIntra;
+    /* We can't switch out of subme=0 during encoding. */
+    if (encParam->subpelRefine)
+        encParam->subpelRefine = param->subpelRefine;
+    encParam->rdoqLevel = param->rdoqLevel;
+    encParam->rdLevel = param->rdLevel;
+    encParam->bEnableTSkipFast = param->bEnableTSkipFast;
+    encParam->psyRd = param->psyRd;
+    encParam->psyRdoq = param->psyRdoq;
+    encParam->bEnableSignHiding = param->bEnableSignHiding;
+    encParam->bEnableFastIntra = param->bEnableFastIntra;
+    encParam->maxTUSize = param->maxTUSize;
+    return x265_check_params(encParam);
+}
+
 void EncStats::addPsnr(double psnrY, double psnrU, double psnrV)
 {
     m_psnrSumY += psnrY;
@@ -1430,6 +1477,34 @@
     bs.writeByteAlignment();
     list.serialize(NAL_UNIT_PPS, bs);
 
+    if (m_param->masteringDisplayColorVolume)
+    {
+        SEIMasteringDisplayColorVolume mdsei;
+        if (mdsei.parse(m_param->masteringDisplayColorVolume))
+        {
+            bs.resetBits();
+            mdsei.write(bs, m_sps);
+            bs.writeByteAlignment();
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
+        }
+        else
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n");
+    }
+
+    if (m_param->contentLightLevelInfo)
+    {
+        SEIContentLightLevel cllsei;
+        if (cllsei.parse(m_param->contentLightLevelInfo))
+        {
+            bs.resetBits();
+            cllsei.write(bs, m_sps);
+            bs.writeByteAlignment();
+            list.serialize(NAL_UNIT_PREFIX_SEI, bs);
+        }
+        else
+            x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n");
​

x265_1.6.tar.gz/source/encoder/encoder.h -> x265_1.7.tar.gz/source/encoder/encoder.h Changed

 
@@ -125,22 +125,26 @@
     uint32_t           m_numDelayedPic;
 
     x265_param*        m_param;
+    x265_param*        m_latestParam;
     RateControl*       m_rateControl;
     Lookahead*         m_lookahead;
     Window             m_conformanceWindow;
 
     bool               m_bZeroLatency;     // x265_encoder_encode() returns NALs for the input picture, zero lag
     bool               m_aborted;          // fatal error detected
+    bool               m_reconfigured;      // reconfigure of encoder detected
 
     Encoder();
     ~Encoder() {}
 
     void create();
-    void stop();
+    void stopJobs();
     void destroy();
 
     int encode(const x265_picture* pic, x265_picture *pic_out);
 
+    int reconfigureParam(x265_param* encParam, x265_param* param);
+
     void getStreamHeaders(NALList& list, Entropy& sbacCoder, Bitstream& bs);
 
     void fetchStats(x265_stats* stats, size_t statsSizeBytes);
​

x265_1.6.tar.gz/source/encoder/entropy.cpp -> x265_1.7.tar.gz/source/encoder/entropy.cpp Changed

@@ -585,7 +585,7 @@
         if (ctu.isSkipped(absPartIdx))
         {
             codeMergeIndex(ctu, absPartIdx);
-            finishCU(ctu, absPartIdx, depth);
+            finishCU(ctu, absPartIdx, depth, bEncodeDQP);
             return;
         }
         codePredMode(ctu.m_predMode[absPartIdx]);
@@ -606,7 +606,7 @@
     codeCoeff(ctu, absPartIdx, bEncodeDQP, tuDepthRange);
 
     // --- write terminating bit ---
-    finishCU(ctu, absPartIdx, depth);
+    finishCU(ctu, absPartIdx, depth, bEncodeDQP);
 }
 
 /* Return bit count of signaling inter mode */
@@ -658,7 +658,7 @@
 }
 
 /* finish encoding a cu and handle end-of-slice conditions */
-void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth)
+void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bCodeDQP)
 {
     const Slice* slice = ctu.m_slice;
     uint32_t realEndAddress = slice->m_endCUAddr;
@@ -672,6 +672,9 @@
     bool granularityBoundary = (((rpelx & granularityMask) == 0 || (rpelx == slice->m_sps->picWidthInLumaSamples )) &&
                                 ((bpely & granularityMask) == 0 || (bpely == slice->m_sps->picHeightInLumaSamples)));
 
+    if (slice->m_pps->bUseDQP)
+        const_cast<CUData&>(ctu).setQPSubParts(bCodeDQP ? ctu.getRefQP(absPartIdx) : ctu.m_qp[absPartIdx], absPartIdx, depth);
+
     if (granularityBoundary)
     {
         // Encode slice finish
@@ -1141,11 +1144,11 @@
     {
         length = 0;
         codeNumber = (codeNumber >> absGoRice) - COEF_REMAIN_BIN_REDUCTION;
-        if (codeNumber != 0)
         {
             unsigned long idx;
             CLZ(idx, codeNumber + 1);
             length = idx;
+            X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
             codeNumber -= (1 << idx) - 1;
         }
         codeNumber = (codeNumber << absGoRice) + codeRemain;
@@ -1461,7 +1464,7 @@
     //const uint32_t maskPosXY = ((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1;
     X265_CHECK((uint32_t)((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1) == (((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1), "maskPosXY fault\n");
 
-    scanPosLast = primitives.findPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig);
+    scanPosLast = primitives.scanPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codingParameters.scanType], trSize);
     posLast = codingParameters.scan[scanPosLast];
 
     const int lastScanSet = scanPosLast >> MLS_CG_SIZE;
@@ -1515,7 +1518,6 @@
     uint8_t * const baseCoeffGroupCtx = &m_contextState[OFF_SIG_CG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_CG_FLAG_CTX)];
     uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA];
     uint32_t c1 = 1;
-    uint32_t goRiceParam = 0;
     int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1;
     int absCoeff[1 << MLS_CG_SIZE];
     int numNonZero = 1;
@@ -1529,7 +1531,6 @@
         const uint32_t subCoeffFlag = coeffFlag[subSet];
         uint32_t scanFlagMask = subCoeffFlag;
         int subPosBase = subSet << MLS_CG_SIZE;
-        goRiceParam    = 0;
         
         if (subSet == lastScanSet)
         {
@@ -1548,7 +1549,7 @@
         else
         {
             uint32_t sigCoeffGroup = ((sigCoeffGroupFlag64 & cgBlkPosMask) != 0);
-            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
+            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
             encodeBin(sigCoeffGroup, baseCoeffGroupCtx[ctxSig]);
         }
 
@@ -1556,7 +1557,8 @@
         if (sigCoeffGroupFlag64 & cgBlkPosMask)
         {
             X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n");
-            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
+            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
+            const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0;
 
             static const uint8_t ctxIndMap4x4[16] =
             {
@@ -1566,37 +1568,50 @@
                 7, 7, 8, 8
             };
             // NOTE: [patternSigCtx][posXinSubset][posYinSubset]
-            static const uint8_t table_cnt[4][4][4] =
+            static const uint8_t table_cnt[4][SCAN_SET_SIZE] =
             {
                 // patternSigCtx = 0
                 {
-                    { 2, 1, 1, 0 },
-                    { 1, 1, 0, 0 },
-                    { 1, 0, 0, 0 },
-                    { 0, 0, 0, 0 },
+                    2, 1, 1, 0,
+                    1, 1, 0, 0,
+                    1, 0, 0, 0,
+                    0, 0, 0, 0,
                 },
                 // patternSigCtx = 1
                 {
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
+                    2, 2, 2, 2,
+                    1, 1, 1, 1,
+                    0, 0, 0, 0,
+                    0, 0, 0, 0,
                 },
                 // patternSigCtx = 2
                 {
-                    { 2, 2, 2, 2 },
-                    { 1, 1, 1, 1 },
-                    { 0, 0, 0, 0 },
-                    { 0, 0, 0, 0 },
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
                 },
                 // patternSigCtx = 3
                 {
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
                 }
             };
+
+            const int offset = codingParameters.firstSignificanceMapContext;
+            ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]);
+            // TODO: accelerate by PABSW
+            const uint32_t blkPosBase  = codingParameters.scan[subPosBase];
+            for (int i = 0; i < MLS_CG_SIZE; i++)
+            {
+                tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]);
+                tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]);
+                tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]);
+                tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]);
+            }
+
             if (m_bitIf)
             {
                 if (log2TrSize == 2)
@@ -1604,16 +1619,16 @@
                     uint32_t blkPos, sig, ctxSig;
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
                     {
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
                         sig     = scanFlagMask & 1;
                         scanFlagMask >>= 1;
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
                         {
                             ctxSig = ctxIndMap4x4[blkPos];
                             X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
                             encodeBin(sig, baseCtx[ctxSig]);
                         }
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
                         numNonZero += sig;
                     }
                 }
@@ -1621,35 +1636,25 @@
                 {
                     X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
 
-                    const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
-                    const int offset = codingParameters.firstSignificanceMapContext;
-                    const uint32_t lumaMask = bIsLuma ? ~0 : 0;
-                    static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
-                    const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
+                    const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
 
                     uint32_t blkPos, sig, ctxSig;
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
                     {
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
-                        X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
                         const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;

 
@@ -585,7 +585,7 @@
         if (ctu.isSkipped(absPartIdx))
         {
             codeMergeIndex(ctu, absPartIdx);
-            finishCU(ctu, absPartIdx, depth);
+            finishCU(ctu, absPartIdx, depth, bEncodeDQP);
             return;
         }
         codePredMode(ctu.m_predMode[absPartIdx]);
@@ -606,7 +606,7 @@
     codeCoeff(ctu, absPartIdx, bEncodeDQP, tuDepthRange);
 
     // --- write terminating bit ---
-    finishCU(ctu, absPartIdx, depth);
+    finishCU(ctu, absPartIdx, depth, bEncodeDQP);
 }
 
 /* Return bit count of signaling inter mode */
@@ -658,7 +658,7 @@
 }
 
 /* finish encoding a cu and handle end-of-slice conditions */
-void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth)
+void Entropy::finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bCodeDQP)
 {
     const Slice* slice = ctu.m_slice;
     uint32_t realEndAddress = slice->m_endCUAddr;
@@ -672,6 +672,9 @@
     bool granularityBoundary = (((rpelx & granularityMask) == 0 || (rpelx == slice->m_sps->picWidthInLumaSamples )) &&
                                 ((bpely & granularityMask) == 0 || (bpely == slice->m_sps->picHeightInLumaSamples)));
 
+    if (slice->m_pps->bUseDQP)
+        const_cast<CUData&>(ctu).setQPSubParts(bCodeDQP ? ctu.getRefQP(absPartIdx) : ctu.m_qp[absPartIdx], absPartIdx, depth);
+
     if (granularityBoundary)
     {
         // Encode slice finish
@@ -1141,11 +1144,11 @@
     {
         length = 0;
         codeNumber = (codeNumber >> absGoRice) - COEF_REMAIN_BIN_REDUCTION;
-        if (codeNumber != 0)
         {
             unsigned long idx;
             CLZ(idx, codeNumber + 1);
             length = idx;
+            X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n");
             codeNumber -= (1 << idx) - 1;
         }
         codeNumber = (codeNumber << absGoRice) + codeRemain;
@@ -1461,7 +1464,7 @@
     //const uint32_t maskPosXY = ((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1;
     X265_CHECK((uint32_t)((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1) == (((uint32_t)~0 >> (31 - log2TrSize + MLS_CG_LOG2_SIZE)) >> 1), "maskPosXY fault\n");
 
-    scanPosLast = primitives.findPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig);
+    scanPosLast = primitives.scanPosLast(codingParameters.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codingParameters.scanType], trSize);
     posLast = codingParameters.scan[scanPosLast];
 
     const int lastScanSet = scanPosLast >> MLS_CG_SIZE;
@@ -1515,7 +1518,6 @@
     uint8_t * const baseCoeffGroupCtx = &m_contextState[OFF_SIG_CG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_CG_FLAG_CTX)];
     uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA];
     uint32_t c1 = 1;
-    uint32_t goRiceParam = 0;
     int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1;
     int absCoeff[1 << MLS_CG_SIZE];
     int numNonZero = 1;
@@ -1529,7 +1531,6 @@
         const uint32_t subCoeffFlag = coeffFlag[subSet];
         uint32_t scanFlagMask = subCoeffFlag;
         int subPosBase = subSet << MLS_CG_SIZE;
-        goRiceParam    = 0;
         
         if (subSet == lastScanSet)
         {
@@ -1548,7 +1549,7 @@
         else
         {
             uint32_t sigCoeffGroup = ((sigCoeffGroupFlag64 & cgBlkPosMask) != 0);
-            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
+            uint32_t ctxSig = Quant::getSigCoeffGroupCtxInc(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
             encodeBin(sigCoeffGroup, baseCoeffGroupCtx[ctxSig]);
         }
 
@@ -1556,7 +1557,8 @@
         if (sigCoeffGroupFlag64 & cgBlkPosMask)
         {
             X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n");
-            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, codingParameters.log2TrSizeCG);
+            const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE));
+            const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0;
 
             static const uint8_t ctxIndMap4x4[16] =
             {
@@ -1566,37 +1568,50 @@
                 7, 7, 8, 8
             };
             // NOTE: [patternSigCtx][posXinSubset][posYinSubset]
-            static const uint8_t table_cnt[4][4][4] =
+            static const uint8_t table_cnt[4][SCAN_SET_SIZE] =
             {
                 // patternSigCtx = 0
                 {
-                    { 2, 1, 1, 0 },
-                    { 1, 1, 0, 0 },
-                    { 1, 0, 0, 0 },
-                    { 0, 0, 0, 0 },
+                    2, 1, 1, 0,
+                    1, 1, 0, 0,
+                    1, 0, 0, 0,
+                    0, 0, 0, 0,
                 },
                 // patternSigCtx = 1
                 {
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
-                    { 2, 1, 0, 0 },
+                    2, 2, 2, 2,
+                    1, 1, 1, 1,
+                    0, 0, 0, 0,
+                    0, 0, 0, 0,
                 },
                 // patternSigCtx = 2
                 {
-                    { 2, 2, 2, 2 },
-                    { 1, 1, 1, 1 },
-                    { 0, 0, 0, 0 },
-                    { 0, 0, 0, 0 },
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
+                    2, 1, 0, 0,
                 },
                 // patternSigCtx = 3
                 {
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
-                    { 2, 2, 2, 2 },
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
+                    2, 2, 2, 2,
                 }
             };
+
+            const int offset = codingParameters.firstSignificanceMapContext;
+            ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]);
+            // TODO: accelerate by PABSW
+            const uint32_t blkPosBase  = codingParameters.scan[subPosBase];
+            for (int i = 0; i < MLS_CG_SIZE; i++)
+            {
+                tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]);
+                tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]);
+                tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]);
+                tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]);
+            }
+
             if (m_bitIf)
             {
                 if (log2TrSize == 2)
@@ -1604,16 +1619,16 @@
                     uint32_t blkPos, sig, ctxSig;
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
                     {
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
                         sig     = scanFlagMask & 1;
                         scanFlagMask >>= 1;
-                        X265_CHECK((uint32_t)(coeff[blkPos] != 0) == sig, "sign bit mistake\n");
+                        X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n");
                         {
                             ctxSig = ctxIndMap4x4[blkPos];
                             X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");;
                             encodeBin(sig, baseCtx[ctxSig]);
                         }
-                        absCoeff[numNonZero] = int(abs(coeff[blkPos]));
+                        absCoeff[numNonZero] = tmpCoeff[blkPos];
                         numNonZero += sig;
                     }
                 }
@@ -1621,35 +1636,25 @@
                 {
                     X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n");
 
-                    const uint8_t (*tabSigCtx)[4] = table_cnt[(uint32_t)patternSigCtx];
-                    const int offset = codingParameters.firstSignificanceMapContext;
-                    const uint32_t lumaMask = bIsLuma ? ~0 : 0;
-                    static const uint32_t posXY4Mask[] = {0x024, 0x0CC, 0x39C};
-                    const uint32_t posGT4Mask = posXY4Mask[log2TrSize - 3] & lumaMask;
+                    const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx];
 
                     uint32_t blkPos, sig, ctxSig;
                     for (; scanPosSigOff >= 0; scanPosSigOff--)
                     {
-                        blkPos  = codingParameters.scan[subPosBase + scanPosSigOff];
-                        X265_CHECK(blkPos || (subPosBase + scanPosSigOff == 0), "blkPos==0 must be at scan[0]\n");
+                        blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff];
                         const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0;
​

x265_1.6.tar.gz/source/encoder/entropy.h -> x265_1.7.tar.gz/source/encoder/entropy.h Changed

@@ -87,7 +87,7 @@
 struct EstBitsSbac
 {
     int significantCoeffGroupBits[NUM_SIG_CG_FLAG_CTX][2];
-    int significantBits[NUM_SIG_FLAG_CTX][2];
+    int significantBits[2][NUM_SIG_FLAG_CTX];
     int lastBits[2][10];
     int greaterOneBits[NUM_ONE_FLAG_CTX][2];
     int levelAbsBits[NUM_ABS_FLAG_CTX][2];
@@ -179,7 +179,7 @@
     inline void codeQtCbfChroma(uint32_t cbf, uint32_t tuDepth)           { encodeBin(cbf, m_contextState[OFF_QT_CBF_CTX + 2 + tuDepth]); }
     inline void codeQtRootCbf(uint32_t cbf)                               { encodeBin(cbf, m_contextState[OFF_QT_ROOT_CBF_CTX]); }
     inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); }
-
+    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
     void codeSaoOffset(const SaoCtuParam& ctuParam, int plane);
 
     /* RDO functions */
@@ -221,7 +221,7 @@
     }
 
     void encodeCU(const CUData& ctu, const CUGeom &cuGeom, uint32_t absPartIdx, uint32_t depth, bool& bEncodeDQP);
-    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth);
+    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bEncodeDQP);
 
     void writeOut();
 
@@ -242,7 +242,6 @@
 
     void codeSaoMaxUvlc(uint32_t code, uint32_t maxSymbol);
 
-    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
     void codeLastSignificantXY(uint32_t posx, uint32_t posy, uint32_t log2TrSize, bool bIsLuma, uint32_t scanIdx);
 
     void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize,

 
@@ -87,7 +87,7 @@
 struct EstBitsSbac
 {
     int significantCoeffGroupBits[NUM_SIG_CG_FLAG_CTX][2];
-    int significantBits[NUM_SIG_FLAG_CTX][2];
+    int significantBits[2][NUM_SIG_FLAG_CTX];
     int lastBits[2][10];
     int greaterOneBits[NUM_ONE_FLAG_CTX][2];
     int levelAbsBits[NUM_ABS_FLAG_CTX][2];
@@ -179,7 +179,7 @@
     inline void codeQtCbfChroma(uint32_t cbf, uint32_t tuDepth)           { encodeBin(cbf, m_contextState[OFF_QT_CBF_CTX + 2 + tuDepth]); }
     inline void codeQtRootCbf(uint32_t cbf)                               { encodeBin(cbf, m_contextState[OFF_QT_ROOT_CBF_CTX]); }
     inline void codeTransformSkipFlags(uint32_t transformSkip, TextType ttype) { encodeBin(transformSkip, m_contextState[OFF_TRANSFORMSKIP_FLAG_CTX + (ttype ? NUM_TRANSFORMSKIP_FLAG_CTX : 0)]); }
-
+    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
     void codeSaoOffset(const SaoCtuParam& ctuParam, int plane);
 
     /* RDO functions */
@@ -221,7 +221,7 @@
     }
 
     void encodeCU(const CUData& ctu, const CUGeom &cuGeom, uint32_t absPartIdx, uint32_t depth, bool& bEncodeDQP);
-    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth);
+    void finishCU(const CUData& ctu, uint32_t absPartIdx, uint32_t depth, bool bEncodeDQP);
 
     void writeOut();
 
@@ -242,7 +242,6 @@
 
     void codeSaoMaxUvlc(uint32_t code, uint32_t maxSymbol);
 
-    void codeDeltaQP(const CUData& cu, uint32_t absPartIdx);
     void codeLastSignificantXY(uint32_t posx, uint32_t posy, uint32_t log2TrSize, bool bIsLuma, uint32_t scanIdx);
 
     void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize,
​

x265_1.6.tar.gz/source/encoder/frameencoder.cpp -> x265_1.7.tar.gz/source/encoder/frameencoder.cpp Changed

@@ -213,6 +213,7 @@
 {
     m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
     m_frame = curFrame;
+    m_param = curFrame->m_param;
     m_sliceType = curFrame->m_lowres.sliceType;
     curFrame->m_encData->m_frameEncoderID = m_jpId;
     curFrame->m_encData->m_jobProvider = this;
@@ -794,6 +795,7 @@
     uint32_t row = (uint32_t)intRow;
     CTURow& curRow = m_rows[row];
 
+    tld.analysis.m_param = m_param;
     if (m_param->bEnableWavefront)
     {
         ScopedLock self(curRow.lock);
@@ -824,6 +826,13 @@
     const uint32_t lineStartCUAddr = row * numCols;
     bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
 
+    /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */
+    uint32_t qTreeInterCnt[NUM_CU_DEPTH];
+    uint32_t qTreeIntraCnt[NUM_CU_DEPTH];
+    uint32_t qTreeSkipCnt[NUM_CU_DEPTH];
+    for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
+        qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
+
     while (curRow.completed < numCols)
     {
         ProfileScopeEvent(encodeCTU);
@@ -841,24 +850,34 @@
                 curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
             }
 
+            FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr];
             if (row >= col && row && m_vbvResetTriggerRow != intRow)
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
+                cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
             else
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_rowStat[row].diagQp;
-        }
-        else
-            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
+                cuStat.baseQp = curEncData.m_rowStat[row].diagQp;
+
+            /* TODO: use defines from slicetype.h for lowres block size */
+            uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
+            uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
+            uint32_t noOfBlocks = g_maxCUSize / 16;
+            uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
+            uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
+            
+            cuStat.vbvCost = 0;
+            cuStat.intraVbvCost = 0;
+            for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
+            {
+                uint32_t idx = block_x + (block_y * maxBlockCols);
 
-        if (m_param->rc.aqMode || bIsVbv)
-        {
-            int qp = calcQpForCu(cuAddr, curEncData.m_cuStat[cuAddr].baseQp);
-            tld.analysis.setQP(*slice, qp);
-            qp = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
-            ctu->setQPSubParts((int8_t)qp, 0, 0);
-            curEncData.m_rowStat[row].sumQpAq += qp;
+                for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++, idx++)
+                {
+                    cuStat.vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
+                    cuStat.intraVbvCost += m_frame->m_lowres.intraCost[idx];
+                }
+            }
         }
         else
-            tld.analysis.setQP(*slice, slice->m_sliceQp);
+            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
 
         if (m_param->bEnableWavefront && !col && row)
         {
@@ -886,7 +905,9 @@
         curRow.completed++;
 
         if (m_param->bLogCuStats || m_param->rc.bStatWrite)
-            collectCTUStatistics(*ctu);
+            curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt);
+        else if (m_param->rc.aqMode)
+            curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu);
 
         // copy no. of intra, inter Cu cnt per row into frame stats for 2 pass
         if (m_param->rc.bStatWrite)
@@ -894,18 +915,17 @@
             curRow.rowStats.mvBits += best.mvBits;
             curRow.rowStats.coeffBits += best.coeffBits;
             curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits);
-            StatisticLog* log = &m_sliceTypeLog[slice->m_sliceType];
 
             for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
             {
                 /* 1 << shift == number of 8x8 blocks at current depth */
                 int shift = 2 * (g_maxCUDepth - depth);
-                curRow.rowStats.iCuCnt += log->qTreeIntraCnt[depth] << shift;
-                curRow.rowStats.pCuCnt += log->qTreeInterCnt[depth] << shift;
-                curRow.rowStats.skipCuCnt += log->qTreeSkipCnt[depth] << shift;
+                curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift;
+                curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift;
+                curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift;
 
                 // clear the row cu data from thread local object
-                log->qTreeIntraCnt[depth] = log->qTreeInterCnt[depth] = log->qTreeSkipCnt[depth] = 0;
+                qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
             }
         }
 
@@ -1075,15 +1095,18 @@
         }
     }
 
+    tld.analysis.m_param = NULL;
     curRow.busy = false;
 
     if (ATOMIC_INC(&m_completionCount) == 2 * (int)m_numRows)
         m_completionEvent.trigger();
 }
 
-void FrameEncoder::collectCTUStatistics(CUData& ctu)
+/* collect statistics about CU coding decisions, return total QP */
+int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt)
 {
     StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType];
+    int totQP = 0;
 
     if (ctu.m_slice->m_sliceType == I_SLICE)
     {
@@ -1094,13 +1117,14 @@
 
             log->totalCu++;
             log->cntIntra[depth]++;
-            log->qTreeIntraCnt[depth]++;
+            qtreeIntraCnt[depth]++;
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
 
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
             {
                 log->totalCu--;
                 log->cntIntra[depth]--;
-                log->qTreeIntraCnt[depth]--;
+                qtreeIntraCnt[depth]--;
             }
             else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
             {
@@ -1124,6 +1148,7 @@
 
             log->totalCu++;
             log->cntTotalCu[depth]++;
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
 
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
             {
@@ -1134,12 +1159,12 @@
             {
                 log->totalCu--;
                 log->cntSkipCu[depth]++;
-                log->qTreeSkipCnt[depth]++;
+                qtreeSkipCnt[depth]++;
             }
             else if (ctu.isInter(absPartIdx))
             {
                 log->cntInter[depth]++;
-                log->qTreeInterCnt[depth]++;
+                qtreeInterCnt[depth]++;
 
                 if (ctu.m_partSize[absPartIdx] < AMP_ID)
                     log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++;
@@ -1149,12 +1174,13 @@
             else if (ctu.isIntra(absPartIdx))
             {
                 log->cntIntra[depth]++;
-                log->qTreeIntraCnt[depth]++;
+                qtreeIntraCnt[depth]++;
 
                 if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
                 {
                     X265_CHECK(ctu.m_log2CUSize[absPartIdx] == 3 && ctu.m_slice->m_sps->quadtreeTULog2MinSize < 3, "Intra NxN found at improbable depth\n");
                     log->cntIntraNxN++;
+                    log->cntIntra[depth]--;
                     /* TODO: log intra modes at absPartIdx +0 to +3 */
                 }
                 else if (ctu.m_lumaIntraDir[absPartIdx] > 1)
@@ -1164,6 +1190,23 @@
             }
         }
     }
+
+    return totQP;
+}
+
+/* iterate over coded CUs and determine total QP */
+int FrameEncoder::calcCTUQP(const CUData& ctu)
+{
+    int totQP = 0;
+    uint32_t depth = 0, numParts = ctu.m_numPartitions;

 
@@ -213,6 +213,7 @@
 {
     m_slicetypeWaitTime = x265_mdate() - m_prevOutputTime;
     m_frame = curFrame;
+    m_param = curFrame->m_param;
     m_sliceType = curFrame->m_lowres.sliceType;
     curFrame->m_encData->m_frameEncoderID = m_jpId;
     curFrame->m_encData->m_jobProvider = this;
@@ -794,6 +795,7 @@
     uint32_t row = (uint32_t)intRow;
     CTURow& curRow = m_rows[row];
 
+    tld.analysis.m_param = m_param;
     if (m_param->bEnableWavefront)
     {
         ScopedLock self(curRow.lock);
@@ -824,6 +826,13 @@
     const uint32_t lineStartCUAddr = row * numCols;
     bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0;
 
+    /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */
+    uint32_t qTreeInterCnt[NUM_CU_DEPTH];
+    uint32_t qTreeIntraCnt[NUM_CU_DEPTH];
+    uint32_t qTreeSkipCnt[NUM_CU_DEPTH];
+    for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
+        qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
+
     while (curRow.completed < numCols)
     {
         ProfileScopeEvent(encodeCTU);
@@ -841,24 +850,34 @@
                 curEncData.m_rowStat[row].diagQpScale = x265_qp2qScale(curEncData.m_avgQpRc);
             }
 
+            FrameData::RCStatCU& cuStat = curEncData.m_cuStat[cuAddr];
             if (row >= col && row && m_vbvResetTriggerRow != intRow)
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
+                cuStat.baseQp = curEncData.m_cuStat[cuAddr - numCols + 1].baseQp;
             else
-                curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_rowStat[row].diagQp;
-        }
-        else
-            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
+                cuStat.baseQp = curEncData.m_rowStat[row].diagQp;
+
+            /* TODO: use defines from slicetype.h for lowres block size */
+            uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16;
+            uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16;
+            uint32_t noOfBlocks = g_maxCUSize / 16;
+            uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks;
+            uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth;
+            
+            cuStat.vbvCost = 0;
+            cuStat.intraVbvCost = 0;
+            for (uint32_t h = 0; h < noOfBlocks && block_y < maxBlockRows; h++, block_y++)
+            {
+                uint32_t idx = block_x + (block_y * maxBlockCols);
 
-        if (m_param->rc.aqMode || bIsVbv)
-        {
-            int qp = calcQpForCu(cuAddr, curEncData.m_cuStat[cuAddr].baseQp);
-            tld.analysis.setQP(*slice, qp);
-            qp = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
-            ctu->setQPSubParts((int8_t)qp, 0, 0);
-            curEncData.m_rowStat[row].sumQpAq += qp;
+                for (uint32_t w = 0; w < noOfBlocks && (block_x + w) < maxBlockCols; w++, idx++)
+                {
+                    cuStat.vbvCost += m_frame->m_lowres.lowresCostForRc[idx] & LOWRES_COST_MASK;
+                    cuStat.intraVbvCost += m_frame->m_lowres.intraCost[idx];
+                }
+            }
         }
         else
-            tld.analysis.setQP(*slice, slice->m_sliceQp);
+            curEncData.m_cuStat[cuAddr].baseQp = curEncData.m_avgQpRc;
 
         if (m_param->bEnableWavefront && !col && row)
         {
@@ -886,7 +905,9 @@
         curRow.completed++;
 
         if (m_param->bLogCuStats || m_param->rc.bStatWrite)
-            collectCTUStatistics(*ctu);
+            curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt);
+        else if (m_param->rc.aqMode)
+            curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu);
 
         // copy no. of intra, inter Cu cnt per row into frame stats for 2 pass
         if (m_param->rc.bStatWrite)
@@ -894,18 +915,17 @@
             curRow.rowStats.mvBits += best.mvBits;
             curRow.rowStats.coeffBits += best.coeffBits;
             curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits);
-            StatisticLog* log = &m_sliceTypeLog[slice->m_sliceType];
 
             for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++)
             {
                 /* 1 << shift == number of 8x8 blocks at current depth */
                 int shift = 2 * (g_maxCUDepth - depth);
-                curRow.rowStats.iCuCnt += log->qTreeIntraCnt[depth] << shift;
-                curRow.rowStats.pCuCnt += log->qTreeInterCnt[depth] << shift;
-                curRow.rowStats.skipCuCnt += log->qTreeSkipCnt[depth] << shift;
+                curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift;
+                curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift;
+                curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift;
 
                 // clear the row cu data from thread local object
-                log->qTreeIntraCnt[depth] = log->qTreeInterCnt[depth] = log->qTreeSkipCnt[depth] = 0;
+                qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0;
             }
         }
 
@@ -1075,15 +1095,18 @@
         }
     }
 
+    tld.analysis.m_param = NULL;
     curRow.busy = false;
 
     if (ATOMIC_INC(&m_completionCount) == 2 * (int)m_numRows)
         m_completionEvent.trigger();
 }
 
-void FrameEncoder::collectCTUStatistics(CUData& ctu)
+/* collect statistics about CU coding decisions, return total QP */
+int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt)
 {
     StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType];
+    int totQP = 0;
 
     if (ctu.m_slice->m_sliceType == I_SLICE)
     {
@@ -1094,13 +1117,14 @@
 
             log->totalCu++;
             log->cntIntra[depth]++;
-            log->qTreeIntraCnt[depth]++;
+            qtreeIntraCnt[depth]++;
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
 
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
             {
                 log->totalCu--;
                 log->cntIntra[depth]--;
-                log->qTreeIntraCnt[depth]--;
+                qtreeIntraCnt[depth]--;
             }
             else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
             {
@@ -1124,6 +1148,7 @@
 
             log->totalCu++;
             log->cntTotalCu[depth]++;
+            totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2));
 
             if (ctu.m_predMode[absPartIdx] == MODE_NONE)
             {
@@ -1134,12 +1159,12 @@
             {
                 log->totalCu--;
                 log->cntSkipCu[depth]++;
-                log->qTreeSkipCnt[depth]++;
+                qtreeSkipCnt[depth]++;
             }
             else if (ctu.isInter(absPartIdx))
             {
                 log->cntInter[depth]++;
-                log->qTreeInterCnt[depth]++;
+                qtreeInterCnt[depth]++;
 
                 if (ctu.m_partSize[absPartIdx] < AMP_ID)
                     log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++;
@@ -1149,12 +1174,13 @@
             else if (ctu.isIntra(absPartIdx))
             {
                 log->cntIntra[depth]++;
-                log->qTreeIntraCnt[depth]++;
+                qtreeIntraCnt[depth]++;
 
                 if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N)
                 {
                     X265_CHECK(ctu.m_log2CUSize[absPartIdx] == 3 && ctu.m_slice->m_sps->quadtreeTULog2MinSize < 3, "Intra NxN found at improbable depth\n");
                     log->cntIntraNxN++;
+                    log->cntIntra[depth]--;
                     /* TODO: log intra modes at absPartIdx +0 to +3 */
                 }
                 else if (ctu.m_lumaIntraDir[absPartIdx] > 1)
@@ -1164,6 +1190,23 @@
             }
         }
     }
+
+    return totQP;
+}
+
+/* iterate over coded CUs and determine total QP */
+int FrameEncoder::calcCTUQP(const CUData& ctu)
+{
+    int totQP = 0;
+    uint32_t depth = 0, numParts = ctu.m_numPartitions;
​

x265_1.6.tar.gz/source/encoder/frameencoder.h -> x265_1.7.tar.gz/source/encoder/frameencoder.h Changed

 
@@ -63,11 +63,6 @@
     uint64_t cntTotalCu[4];
     uint64_t totalCu;
 
-    /* These states store the count of inter,intra and skip ctus within quad tree structure of each CU */
-    uint32_t qTreeInterCnt[4];
-    uint32_t qTreeIntraCnt[4];
-    uint32_t qTreeSkipCnt[4];
-
     StatisticLog()
     {
         memset(this, 0, sizeof(StatisticLog));
@@ -226,8 +221,8 @@
     void encodeSlice();
 
     void threadMain();
-    int  calcQpForCu(uint32_t cuAddr, double baseQp);
-    void collectCTUStatistics(CUData& ctu);
+    int  collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt);
+    int  calcCTUQP(const CUData& ctu);
     void noiseReductionUpdate();
 
     /* Called by WaveFront::findJob() */
​

x265_1.6.tar.gz/source/encoder/level.cpp -> x265_1.7.tar.gz/source/encoder/level.cpp Changed

@@ -55,15 +55,14 @@
     { 35651584, 1069547520, 60000,    240000,   60000,  240000,   8, Level::LEVEL6,   "6",   60 },
     { 35651584, 2139095040, 120000,   480000,   120000, 480000,   8, Level::LEVEL6_1, "6.1", 61 },
     { 35651584, 4278190080U, 240000,  800000,   240000, 800000,   6, Level::LEVEL6_2, "6.2", 62 },
+    { MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, 1, Level::LEVEL8_5, "8.5", 85 },
 };
 
 /* determine minimum decoder level required to decode the described video */
 void determineLevel(const x265_param &param, VPS& vps)
 {
     vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
-    if (param.bLossless)
-        vps.ptl.profileIdc = Profile::NONE;
-    else if (param.internalCsp == X265_CSP_I420)
+    if (param.internalCsp == X265_CSP_I420)
     {
         if (param.internalBitDepth == 8)
         {
@@ -104,7 +103,15 @@
 
     const size_t NumLevels = sizeof(levels) / sizeof(levels[0]);
     uint32_t i;
-    for (i = 0; i < NumLevels; i++)
+    if (param.bLossless)
+    {
+        i = 13;
+        vps.ptl.minCrForLevel = 1;
+        vps.ptl.maxLumaSrForLevel = MAX_UINT;
+        vps.ptl.levelIdc = Level::LEVEL8_5;
+        vps.ptl.tierFlag = Level::MAIN;
+    }
+    else for (i = 0; i < NumLevels; i++)
     {
         if (lumaSamples > levels[i].maxLumaSamples)
             continue;
@@ -337,31 +344,40 @@
 extern "C"
 int x265_param_apply_profile(x265_param *param, const char *profile)
 {
-    if (!profile)
+    if (!param || !profile)
         return 0;
-    if (!strcmp(profile, "main"))
-    {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
 
 #if HIGH_BIT_DEPTH
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
-        x265_log(param, X265_LOG_ERROR, "Main profile not supported, compiled for Main10.\n");
+    if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8"))
+    {
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile);
         return -1;
-#endif
     }
-    else if (!strcmp(profile, "main10"))
+#else
+    if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10"))
     {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
-
-        /* SPSs shall have bit_depth_luma_minus8 in the range of 0 to 2, inclusive 
-         * this covers all builds of x265, currently */
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile);
+        return -1;
+    }
+#endif
+    
+    if (!strcmp(profile, "main"))
+    {
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
     }
     else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp"))
     {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
 
         /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */
         param->maxNumReferences = 1;
@@ -378,25 +394,29 @@
         param->rc.cuTree = 0;
         param->bEnableWeightedPred = 0;
         param->bEnableWeightedBiPred = 0;
-
-#if HIGH_BIT_DEPTH
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
-        x265_log(param, X265_LOG_ERROR, "Mainstillpicture profile not supported, compiled for Main10.\n");
-        return -1;
-#endif
+    }
+    else if (!strcmp(profile, "main10"))
+    {
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
     }
     else if (!strcmp(profile, "main422-10"))
-        param->internalCsp = X265_CSP_I422;
-    else if (!strcmp(profile, "main444-8"))
     {
-        param->internalCsp = X265_CSP_I444;
-#if HIGH_BIT_DEPTH
-        x265_log(param, X265_LOG_ERROR, "Main 4:4:4 8 profile not supported, compiled for Main10.\n");
-        return -1;
-#endif
+        if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422)))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
+    }
+    else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10"))
+    {
+        /* any color space allowed */
     }
-    else if (!strcmp(profile, "main444-10"))
-        param->internalCsp = X265_CSP_I444;
     else
     {
         x265_log(param, X265_LOG_ERROR, "unknown profile <%s>\n", profile);

 
@@ -55,15 +55,14 @@
     { 35651584, 1069547520, 60000,    240000,   60000,  240000,   8, Level::LEVEL6,   "6",   60 },
     { 35651584, 2139095040, 120000,   480000,   120000, 480000,   8, Level::LEVEL6_1, "6.1", 61 },
     { 35651584, 4278190080U, 240000,  800000,   240000, 800000,   6, Level::LEVEL6_2, "6.2", 62 },
+    { MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, MAX_UINT, 1, Level::LEVEL8_5, "8.5", 85 },
 };
 
 /* determine minimum decoder level required to decode the described video */
 void determineLevel(const x265_param &param, VPS& vps)
 {
     vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1;
-    if (param.bLossless)
-        vps.ptl.profileIdc = Profile::NONE;
-    else if (param.internalCsp == X265_CSP_I420)
+    if (param.internalCsp == X265_CSP_I420)
     {
         if (param.internalBitDepth == 8)
         {
@@ -104,7 +103,15 @@
 
     const size_t NumLevels = sizeof(levels) / sizeof(levels[0]);
     uint32_t i;
-    for (i = 0; i < NumLevels; i++)
+    if (param.bLossless)
+    {
+        i = 13;
+        vps.ptl.minCrForLevel = 1;
+        vps.ptl.maxLumaSrForLevel = MAX_UINT;
+        vps.ptl.levelIdc = Level::LEVEL8_5;
+        vps.ptl.tierFlag = Level::MAIN;
+    }
+    else for (i = 0; i < NumLevels; i++)
     {
         if (lumaSamples > levels[i].maxLumaSamples)
             continue;
@@ -337,31 +344,40 @@
 extern "C"
 int x265_param_apply_profile(x265_param *param, const char *profile)
 {
-    if (!profile)
+    if (!param || !profile)
         return 0;
-    if (!strcmp(profile, "main"))
-    {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
 
 #if HIGH_BIT_DEPTH
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
-        x265_log(param, X265_LOG_ERROR, "Main profile not supported, compiled for Main10.\n");
+    if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8"))
+    {
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile);
         return -1;
-#endif
     }
-    else if (!strcmp(profile, "main10"))
+#else
+    if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10"))
     {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
-
-        /* SPSs shall have bit_depth_luma_minus8 in the range of 0 to 2, inclusive 
-         * this covers all builds of x265, currently */
+        x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile);
+        return -1;
+    }
+#endif
+    
+    if (!strcmp(profile, "main"))
+    {
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
     }
     else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp"))
     {
-        /* SPSs shall have chroma_format_idc equal to 1 only */
-        param->internalCsp = X265_CSP_I420;
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
 
         /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */
         param->maxNumReferences = 1;
@@ -378,25 +394,29 @@
         param->rc.cuTree = 0;
         param->bEnableWeightedPred = 0;
         param->bEnableWeightedBiPred = 0;
-
-#if HIGH_BIT_DEPTH
-        /* SPSs shall have bit_depth_luma_minus8 equal to 0 only */
-        x265_log(param, X265_LOG_ERROR, "Mainstillpicture profile not supported, compiled for Main10.\n");
-        return -1;
-#endif
+    }
+    else if (!strcmp(profile, "main10"))
+    {
+        if (!(param->internalCsp & X265_CSP_I420))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
     }
     else if (!strcmp(profile, "main422-10"))
-        param->internalCsp = X265_CSP_I422;
-    else if (!strcmp(profile, "main444-8"))
     {
-        param->internalCsp = X265_CSP_I444;
-#if HIGH_BIT_DEPTH
-        x265_log(param, X265_LOG_ERROR, "Main 4:4:4 8 profile not supported, compiled for Main10.\n");
-        return -1;
-#endif
+        if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422)))
+        {
+            x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n",
+                     profile, x265_source_csp_names[param->internalCsp]);
+            return -1;
+        }
+    }
+    else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10"))
+    {
+        /* any color space allowed */
     }
-    else if (!strcmp(profile, "main444-10"))
-        param->internalCsp = X265_CSP_I444;
     else
     {
         x265_log(param, X265_LOG_ERROR, "unknown profile <%s>\n", profile);
​

x265_1.6.tar.gz/source/encoder/motion.cpp -> x265_1.7.tar.gz/source/encoder/motion.cpp Changed

@@ -234,9 +234,14 @@
                pix_base + (m1x) + (m1y) * stride, \
                pix_base + (m2x) + (m2y) * stride, \
                stride, costs); \
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \
+        (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \
+        (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \
     }
 
 #define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \
@@ -247,10 +252,10 @@
                fref + (m2x) + (m2y) * stride, \
                fref + (m3x) + (m3y) * stride, \
                stride, costs); \
-        costs[0] += mvcost(MV(m0x, m0y) << 2); \
-        costs[1] += mvcost(MV(m1x, m1y) << 2); \
-        costs[2] += mvcost(MV(m2x, m2y) << 2); \
-        costs[3] += mvcost(MV(m3x, m3y) << 2); \
+        (costs)[0] += mvcost(MV(m0x, m0y) << 2); \
+        (costs)[1] += mvcost(MV(m1x, m1y) << 2); \
+        (costs)[2] += mvcost(MV(m2x, m2y) << 2); \
+        (costs)[3] += mvcost(MV(m3x, m3y) << 2); \
         COPY4_IF_LT(bcost, costs[0], bmv, MV(m0x, m0y), bPointNr, p0, bDistance, d0); \
         COPY4_IF_LT(bcost, costs[1], bmv, MV(m1x, m1y), bPointNr, p1, bDistance, d1); \
         COPY4_IF_LT(bcost, costs[2], bmv, MV(m2x, m2y), bPointNr, p2, bDistance, d2); \
@@ -266,10 +271,16 @@
                pix_base + (m2x) + (m2y) * stride, \
                pix_base + (m3x) + (m3y) * stride, \
                stride, costs); \
-        costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \
-        costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
-        costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
-        costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
+        const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \
+        X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
+        costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
+        costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
+        costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
+        costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
         COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
         COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
         COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
@@ -285,10 +296,17 @@
                pix_base + (m2x) + (m2y) * stride, \
                pix_base + (m3x) + (m3y) * stride, \
                stride, costs); \
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
-        (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \
+        /* TODO: use restrict keyword in ICL */ \
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
+        (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
+        (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
+        (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
+        (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
     }
 
 #define DIA1_ITER(mx, my) \

 
@@ -234,9 +234,14 @@
                pix_base + (m1x) + (m1y) * stride, \
                pix_base + (m2x) + (m2y) * stride, \
                stride, costs); \
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \
+        (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \
+        (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \
+        (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \
     }
 
 #define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \
@@ -247,10 +252,10 @@
                fref + (m2x) + (m2y) * stride, \
                fref + (m3x) + (m3y) * stride, \
                stride, costs); \
-        costs[0] += mvcost(MV(m0x, m0y) << 2); \
-        costs[1] += mvcost(MV(m1x, m1y) << 2); \
-        costs[2] += mvcost(MV(m2x, m2y) << 2); \
-        costs[3] += mvcost(MV(m3x, m3y) << 2); \
+        (costs)[0] += mvcost(MV(m0x, m0y) << 2); \
+        (costs)[1] += mvcost(MV(m1x, m1y) << 2); \
+        (costs)[2] += mvcost(MV(m2x, m2y) << 2); \
+        (costs)[3] += mvcost(MV(m3x, m3y) << 2); \
         COPY4_IF_LT(bcost, costs[0], bmv, MV(m0x, m0y), bPointNr, p0, bDistance, d0); \
         COPY4_IF_LT(bcost, costs[1], bmv, MV(m1x, m1y), bPointNr, p1, bDistance, d1); \
         COPY4_IF_LT(bcost, costs[2], bmv, MV(m2x, m2y), bPointNr, p2, bDistance, d2); \
@@ -266,10 +271,16 @@
                pix_base + (m2x) + (m2y) * stride, \
                pix_base + (m3x) + (m3y) * stride, \
                stride, costs); \
-        costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \
-        costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \
-        costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \
-        costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \
+        const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \
+        X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
+        costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
+        costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
+        costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
+        costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
         COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \
         COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \
         COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \
@@ -285,10 +296,17 @@
                pix_base + (m2x) + (m2y) * stride, \
                pix_base + (m3x) + (m3y) * stride, \
                stride, costs); \
-        (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \
-        (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \
-        (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \
-        (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \
+        /* TODO: use restrict keyword in ICL */ \
+        const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \
+        const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \
+        X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \
+        X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \
+        (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \
+        (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \
+        (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \
+        (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \
     }
 
 #define DIA1_ITER(mx, my) \
​

x265_1.6.tar.gz/source/encoder/nal.cpp -> x265_1.7.tar.gz/source/encoder/nal.cpp Changed

@@ -35,6 +35,7 @@
     , m_extraBuffer(NULL)
     , m_extraOccupancy(0)
     , m_extraAllocSize(0)
+    , m_annexB(true)
 {}
 
 void NALList::takeContents(NALList& other)
@@ -90,7 +91,12 @@
     uint8_t *out = m_buffer + m_occupancy;
     uint32_t bytes = 0;
 
-    if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
+    if (!m_annexB)
+    {
+        /* Will write size later */
+        bytes += 4;
+    }
+    else if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
     {
         memcpy(out, startCodePrefix, 4);
         bytes += 4;
@@ -144,6 +150,16 @@
      * to 0x03 is appended to the end of the data.  */
     if (!out[bytes - 1])
         out[bytes++] = 0x03;
+
+    if (!m_annexB)
+    {
+        uint32_t dataSize = bytes - 4;
+        out[0] = (uint8_t)(dataSize >> 24);
+        out[1] = (uint8_t)(dataSize >> 16);
+        out[2] = (uint8_t)(dataSize >> 8);
+        out[3] = (uint8_t)dataSize;
+    }
+
     m_occupancy += bytes;
 
     X265_CHECK(m_numNal < (uint32_t)MAX_NAL_UNITS, "NAL count overflow\n");

 
@@ -35,6 +35,7 @@
     , m_extraBuffer(NULL)
     , m_extraOccupancy(0)
     , m_extraAllocSize(0)
+    , m_annexB(true)
 {}
 
 void NALList::takeContents(NALList& other)
@@ -90,7 +91,12 @@
     uint8_t *out = m_buffer + m_occupancy;
     uint32_t bytes = 0;
 
-    if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
+    if (!m_annexB)
+    {
+        /* Will write size later */
+        bytes += 4;
+    }
+    else if (!m_numNal || nalUnitType == NAL_UNIT_VPS || nalUnitType == NAL_UNIT_SPS || nalUnitType == NAL_UNIT_PPS)
     {
         memcpy(out, startCodePrefix, 4);
         bytes += 4;
@@ -144,6 +150,16 @@
      * to 0x03 is appended to the end of the data.  */
     if (!out[bytes - 1])
         out[bytes++] = 0x03;
+
+    if (!m_annexB)
+    {
+        uint32_t dataSize = bytes - 4;
+        out[0] = (uint8_t)(dataSize >> 24);
+        out[1] = (uint8_t)(dataSize >> 16);
+        out[2] = (uint8_t)(dataSize >> 8);
+        out[3] = (uint8_t)dataSize;
+    }
+
     m_occupancy += bytes;
 
     X265_CHECK(m_numNal < (uint32_t)MAX_NAL_UNITS, "NAL count overflow\n");
​

x265_1.6.tar.gz/source/encoder/nal.h -> x265_1.7.tar.gz/source/encoder/nal.h Changed

 
@@ -48,6 +48,7 @@
     uint8_t*    m_extraBuffer;
     uint32_t    m_extraOccupancy;
     uint32_t    m_extraAllocSize;
+    bool        m_annexB;
 
     NALList();
     ~NALList() { X265_FREE(m_buffer); X265_FREE(m_extraBuffer); }
​

x265_1.6.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.7.tar.gz/source/encoder/ratecontrol.cpp Changed

@@ -300,7 +300,7 @@
         }
     }
 
-    /* qstep - value set as encoder specific */
+    /* qpstep - value set as encoder specific */
     m_lstep = pow(2, m_param->rc.qpStep / 6.0);
 
     for (int i = 0; i < 2; i++)
@@ -370,14 +370,19 @@
     m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm;
 
     /* Frame Predictors and Row predictors used in vbv */
-    for (int i = 0; i < 5; i++)
+    for (int i = 0; i < 4; i++)
     {
-        m_pred[i].coeff = 1.5;
+        m_pred[i].coeff = 1.0;
         m_pred[i].count = 1.0;
         m_pred[i].decay = 0.5;
         m_pred[i].offset = 0.0;
     }
-    m_pred[0].coeff = 1.0;
+    m_pred[0].coeff = m_pred[3].coeff = 0.75;
+    if (m_param->rc.qCompress >= 0.8) // when tuned for grain 
+    {
+        m_pred[1].coeff = 0.75;
+        m_pred[0].coeff = m_pred[3].coeff = 0.50;
+    }
     if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead))
     {
         /* If the user hasn't defined the stat filename, use the default value */
@@ -945,6 +950,9 @@
     m_curSlice = curEncData.m_slice;
     m_sliceType = m_curSlice->m_sliceType;
     rce->sliceType = m_sliceType;
+    if (!m_2pass)
+        rce->keptAsRef = IS_REFERENCED(curFrame);
+    m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
     rce->poc = m_curSlice->m_poc;
     if (m_param->rc.bStatRead)
     {
@@ -1074,7 +1082,7 @@
             m_lastQScaleFor[m_sliceType] = x265_qp2qScale(rce->qpaRc);
             if (rce->poc == 0)
                  m_lastQScaleFor[P_SLICE] = m_lastQScaleFor[m_sliceType] * fabs(m_param->rc.ipFactor);
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], m_qp, (double)m_currentSatd);
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], m_qp, (double)m_currentSatd);
         }
     }
     m_framesDone++;
@@ -1105,6 +1113,14 @@
         m_accumPQp += m_qp;
 }
 
+int RateControl::getPredictorType(int lowresSliceType, int sliceType)
+{
+    /* Use a different predictor for B Ref and B frames for vbv frame size predictions */
+    if (lowresSliceType == X265_TYPE_BREF)
+        return 3;
+    return sliceType;
+}
+
 double RateControl::getDiffLimitedQScale(RateControlEntry *rce, double q)
 {
     // force I/B quants as a function of P quants
@@ -1379,6 +1395,7 @@
             q += m_pbOffset;
 
         double qScale = x265_qp2qScale(q);
+        rce->qpNoVbv = q;
         double lmin = 0, lmax = 0;
         if (m_isVbv)
         {
@@ -1391,16 +1408,15 @@
                     qScale = x265_clip3(lmin, lmax, qScale);
                 q = x265_qScale2qp(qScale);
             }
-            rce->qpNoVbv = q;
             if (!m_2pass)
             {
                 qScale = clipQscale(curFrame, rce, qScale);
                 /* clip qp to permissible range after vbv-lookahead estimation to avoid possible 
                  * mispredictions by initial frame size predictors */
-                if (m_pred[m_sliceType].count == 1)
+                if (m_pred[m_predType].count == 1)
                     qScale = x265_clip3(lmin, lmax, qScale);
                 m_lastQScaleFor[m_sliceType] = qScale;
-                rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], qScale, (double)m_currentSatd);
+                rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
             }
             else
                 rce->frameSizePlanned = qScale2bits(rce, qScale);
@@ -1544,7 +1560,7 @@
             q = clipQscale(curFrame, rce, q);
             /*  clip qp to permissible range after vbv-lookahead estimation to avoid possible
              * mispredictions by initial frame size predictors */
-            if (!m_2pass && m_isVbv && m_pred[m_sliceType].count == 1)
+            if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1)
                 q = x265_clip3(lqmin, lqmax, q);
         }
         m_lastQScaleFor[m_sliceType] = q;
@@ -1554,7 +1570,7 @@
         if (m_2pass && m_isVbv)
             rce->frameSizePlanned = qScale2bits(rce, q);
         else
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
 
         /* Always use up the whole VBV in this case. */
         if (m_singleFrameVbv)
@@ -1707,7 +1723,7 @@
             {
                 double frameQ[3];
                 double curBits;
-                curBits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+                curBits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
                 double bufferFillCur = m_bufferFill - curBits;
                 double targetFill;
                 double totalDuration = m_frameDuration;
@@ -1726,7 +1742,8 @@
                         bufferFillCur += wantedFrameSize;
                     int64_t satd = curFrame->m_lowres.plannedSatd[j] >> (X265_DEPTH - 8);
                     type = IS_X265_TYPE_I(type) ? I_SLICE : IS_X265_TYPE_B(type) ? B_SLICE : P_SLICE;
-                    curBits = predictSize(&m_pred[type], frameQ[type], (double)satd);
+                    int predType = getPredictorType(curFrame->m_lowres.plannedType[j], type);
+                    curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
                     bufferFillCur -= curBits;
                 }
 
@@ -1766,7 +1783,7 @@
             }
             // Now a hard threshold to make sure the frame fits in VBV.
             // This one is mostly for I-frames.
-            double bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+            double bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
 
             // For small VBVs, allow the frame to use up the entire VBV.
             double maxFillFactor;
@@ -1783,18 +1800,21 @@
                 bits *= qf;
                 if (bits < m_bufferRate / minFillFactor)
                     q *= bits * minFillFactor / m_bufferRate;
-                bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+                bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
             }
 
             q = X265_MAX(q0, q);
         }
 
         /* Apply MinCR restrictions */
-        double pbits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+        double pbits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
         if (pbits > rce->frameSizeMaximum)
             q *= pbits / rce->frameSizeMaximum;
-
-        if (!m_isCbr || (m_isAbr && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2))
+        /* To detect frames that are more complex in SATD costs compared to prev window, yet 
+         * lookahead vbv reduces its qscale by half its value. Be on safer side and avoid drastic 
+         * qscale reductions for frames high in complexity */
+        bool mispredCheck = rce->movingAvgSum && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2;
+        if (!m_isCbr || (m_isAbr && mispredCheck))
             q = X265_MAX(q0, q);
 
         if (m_rateFactorMaxIncrement)
@@ -1838,18 +1858,26 @@
         if (satdCostForPendingCus  > 0)
         {
             double pred_s = predictSize(rce->rowPred[0], qScale, satdCostForPendingCus);
-            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCost = 0;
+            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCostForPendingCus = 0;
             double refQScale = 0;
 
             if (picType != I_SLICE)
             {
                 FrameData& refEncData = *refFrame->m_encData;
                 uint32_t endCuAddr = maxCols * (row + 1);
-                for (uint32_t cuAddr = curEncData.m_rowStat[row].numEncodedCUs + 1; cuAddr < endCuAddr; cuAddr++)
+                uint32_t startCuAddr = curEncData.m_rowStat[row].numEncodedCUs;
+                if (startCuAddr)
                 {
-                    refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
-                    refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
-                    intraCost += curEncData.m_cuStat[cuAddr].intraVbvCost;
+                    for (uint32_t cuAddr = startCuAddr + 1 ; cuAddr < endCuAddr; cuAddr++)
+                    {
+                        refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
+                        refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
+                    }
+                }
+                else
+                {
+                    refRowBits = refEncData.m_rowStat[row].encodedBits;
+                    refRowSatdCost = refEncData.m_rowStat[row].satdForVbv;
                 }
 
                 refRowSatdCost >>= X265_DEPTH - 8;
@@ -1859,7 +1887,7 @@
             if (picType == I_SLICE || qScale >= refQScale)
             {

 
@@ -300,7 +300,7 @@
         }
     }
 
-    /* qstep - value set as encoder specific */
+    /* qpstep - value set as encoder specific */
     m_lstep = pow(2, m_param->rc.qpStep / 6.0);
 
     for (int i = 0; i < 2; i++)
@@ -370,14 +370,19 @@
     m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm;
 
     /* Frame Predictors and Row predictors used in vbv */
-    for (int i = 0; i < 5; i++)
+    for (int i = 0; i < 4; i++)
     {
-        m_pred[i].coeff = 1.5;
+        m_pred[i].coeff = 1.0;
         m_pred[i].count = 1.0;
         m_pred[i].decay = 0.5;
         m_pred[i].offset = 0.0;
     }
-    m_pred[0].coeff = 1.0;
+    m_pred[0].coeff = m_pred[3].coeff = 0.75;
+    if (m_param->rc.qCompress >= 0.8) // when tuned for grain 
+    {
+        m_pred[1].coeff = 0.75;
+        m_pred[0].coeff = m_pred[3].coeff = 0.50;
+    }
     if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead))
     {
         /* If the user hasn't defined the stat filename, use the default value */
@@ -945,6 +950,9 @@
     m_curSlice = curEncData.m_slice;
     m_sliceType = m_curSlice->m_sliceType;
     rce->sliceType = m_sliceType;
+    if (!m_2pass)
+        rce->keptAsRef = IS_REFERENCED(curFrame);
+    m_predType = getPredictorType(curFrame->m_lowres.sliceType, m_sliceType);
     rce->poc = m_curSlice->m_poc;
     if (m_param->rc.bStatRead)
     {
@@ -1074,7 +1082,7 @@
             m_lastQScaleFor[m_sliceType] = x265_qp2qScale(rce->qpaRc);
             if (rce->poc == 0)
                  m_lastQScaleFor[P_SLICE] = m_lastQScaleFor[m_sliceType] * fabs(m_param->rc.ipFactor);
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], m_qp, (double)m_currentSatd);
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], m_qp, (double)m_currentSatd);
         }
     }
     m_framesDone++;
@@ -1105,6 +1113,14 @@
         m_accumPQp += m_qp;
 }
 
+int RateControl::getPredictorType(int lowresSliceType, int sliceType)
+{
+    /* Use a different predictor for B Ref and B frames for vbv frame size predictions */
+    if (lowresSliceType == X265_TYPE_BREF)
+        return 3;
+    return sliceType;
+}
+
 double RateControl::getDiffLimitedQScale(RateControlEntry *rce, double q)
 {
     // force I/B quants as a function of P quants
@@ -1379,6 +1395,7 @@
             q += m_pbOffset;
 
         double qScale = x265_qp2qScale(q);
+        rce->qpNoVbv = q;
         double lmin = 0, lmax = 0;
         if (m_isVbv)
         {
@@ -1391,16 +1408,15 @@
                     qScale = x265_clip3(lmin, lmax, qScale);
                 q = x265_qScale2qp(qScale);
             }
-            rce->qpNoVbv = q;
             if (!m_2pass)
             {
                 qScale = clipQscale(curFrame, rce, qScale);
                 /* clip qp to permissible range after vbv-lookahead estimation to avoid possible 
                  * mispredictions by initial frame size predictors */
-                if (m_pred[m_sliceType].count == 1)
+                if (m_pred[m_predType].count == 1)
                     qScale = x265_clip3(lmin, lmax, qScale);
                 m_lastQScaleFor[m_sliceType] = qScale;
-                rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], qScale, (double)m_currentSatd);
+                rce->frameSizePlanned = predictSize(&m_pred[m_predType], qScale, (double)m_currentSatd);
             }
             else
                 rce->frameSizePlanned = qScale2bits(rce, qScale);
@@ -1544,7 +1560,7 @@
             q = clipQscale(curFrame, rce, q);
             /*  clip qp to permissible range after vbv-lookahead estimation to avoid possible
              * mispredictions by initial frame size predictors */
-            if (!m_2pass && m_isVbv && m_pred[m_sliceType].count == 1)
+            if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1)
                 q = x265_clip3(lqmin, lqmax, q);
         }
         m_lastQScaleFor[m_sliceType] = q;
@@ -1554,7 +1570,7 @@
         if (m_2pass && m_isVbv)
             rce->frameSizePlanned = qScale2bits(rce, q);
         else
-            rce->frameSizePlanned = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+            rce->frameSizePlanned = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
 
         /* Always use up the whole VBV in this case. */
         if (m_singleFrameVbv)
@@ -1707,7 +1723,7 @@
             {
                 double frameQ[3];
                 double curBits;
-                curBits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+                curBits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
                 double bufferFillCur = m_bufferFill - curBits;
                 double targetFill;
                 double totalDuration = m_frameDuration;
@@ -1726,7 +1742,8 @@
                         bufferFillCur += wantedFrameSize;
                     int64_t satd = curFrame->m_lowres.plannedSatd[j] >> (X265_DEPTH - 8);
                     type = IS_X265_TYPE_I(type) ? I_SLICE : IS_X265_TYPE_B(type) ? B_SLICE : P_SLICE;
-                    curBits = predictSize(&m_pred[type], frameQ[type], (double)satd);
+                    int predType = getPredictorType(curFrame->m_lowres.plannedType[j], type);
+                    curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd);
                     bufferFillCur -= curBits;
                 }
 
@@ -1766,7 +1783,7 @@
             }
             // Now a hard threshold to make sure the frame fits in VBV.
             // This one is mostly for I-frames.
-            double bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+            double bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
 
             // For small VBVs, allow the frame to use up the entire VBV.
             double maxFillFactor;
@@ -1783,18 +1800,21 @@
                 bits *= qf;
                 if (bits < m_bufferRate / minFillFactor)
                     q *= bits * minFillFactor / m_bufferRate;
-                bits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+                bits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
             }
 
             q = X265_MAX(q0, q);
         }
 
         /* Apply MinCR restrictions */
-        double pbits = predictSize(&m_pred[m_sliceType], q, (double)m_currentSatd);
+        double pbits = predictSize(&m_pred[m_predType], q, (double)m_currentSatd);
         if (pbits > rce->frameSizeMaximum)
             q *= pbits / rce->frameSizeMaximum;
-
-        if (!m_isCbr || (m_isAbr && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2))
+        /* To detect frames that are more complex in SATD costs compared to prev window, yet 
+         * lookahead vbv reduces its qscale by half its value. Be on safer side and avoid drastic 
+         * qscale reductions for frames high in complexity */
+        bool mispredCheck = rce->movingAvgSum && m_currentSatd >= rce->movingAvgSum && q <= q0 / 2;
+        if (!m_isCbr || (m_isAbr && mispredCheck))
             q = X265_MAX(q0, q);
 
         if (m_rateFactorMaxIncrement)
@@ -1838,18 +1858,26 @@
         if (satdCostForPendingCus  > 0)
         {
             double pred_s = predictSize(rce->rowPred[0], qScale, satdCostForPendingCus);
-            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCost = 0;
+            uint32_t refRowSatdCost = 0, refRowBits = 0, intraCostForPendingCus = 0;
             double refQScale = 0;
 
             if (picType != I_SLICE)
             {
                 FrameData& refEncData = *refFrame->m_encData;
                 uint32_t endCuAddr = maxCols * (row + 1);
-                for (uint32_t cuAddr = curEncData.m_rowStat[row].numEncodedCUs + 1; cuAddr < endCuAddr; cuAddr++)
+                uint32_t startCuAddr = curEncData.m_rowStat[row].numEncodedCUs;
+                if (startCuAddr)
                 {
-                    refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
-                    refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
-                    intraCost += curEncData.m_cuStat[cuAddr].intraVbvCost;
+                    for (uint32_t cuAddr = startCuAddr + 1 ; cuAddr < endCuAddr; cuAddr++)
+                    {
+                        refRowSatdCost += refEncData.m_cuStat[cuAddr].vbvCost;
+                        refRowBits += refEncData.m_cuStat[cuAddr].totalBits;
+                    }
+                }
+                else
+                {
+                    refRowBits = refEncData.m_rowStat[row].encodedBits;
+                    refRowSatdCost = refEncData.m_rowStat[row].satdForVbv;
                 }
 
                 refRowSatdCost >>= X265_DEPTH - 8;
@@ -1859,7 +1887,7 @@
             if (picType == I_SLICE || qScale >= refQScale)
             {
​

x265_1.6.tar.gz/source/encoder/ratecontrol.h -> x265_1.7.tar.gz/source/encoder/ratecontrol.h Changed

 
@@ -157,10 +157,9 @@
     double m_rateFactorMaxIncrement; /* Don't allow RF above (CRF + this value). */
     double m_rateFactorMaxDecrement; /* don't allow RF below (this value). */
 
-    Predictor m_pred[5];
-    Predictor m_predBfromP;
-
+    Predictor m_pred[4];       /* Slice predictors to preidct bits for each Slice type - I,P,Bref and B */
     int64_t m_leadingNoBSatd;
+    int     m_predType;       /* Type of slice predictors to be used - depends on the slice type */
     double  m_ipOffset;
     double  m_pbOffset;
     int64_t m_bframeBits;
@@ -266,6 +265,7 @@
     double tuneAbrQScaleFromFeedback(double qScale);
     void   accumPQpUpdate();
 
+    int    getPredictorType(int lowresSliceType, int sliceType);
     void   updateVbv(int64_t bits, RateControlEntry* rce);
     void   updatePredictor(Predictor *p, double q, double var, double bits);
     double clipQscale(Frame* pic, RateControlEntry* rce, double q);
​

x265_1.6.tar.gz/source/encoder/rdcost.h -> x265_1.7.tar.gz/source/encoder/rdcost.h Changed

@@ -40,13 +40,15 @@
     uint32_t  m_chromaDistWeight[2];
     uint32_t  m_psyRdBase;
     uint32_t  m_psyRd;
-    int       m_qp;
+    int       m_qp; /* QP used to configure lambda, may be higher than QP_MAX_SPEC but <= QP_MAX_MAX */
 
     void setPsyRdScale(double scale)                { m_psyRdBase = (uint32_t)floor(65536.0 * scale * 0.33); }
 
     void setQP(const Slice& slice, int qp)
     {
+        x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
         m_qp = qp;
+        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
 
         /* Scale PSY RD factor by a slice type factor */
         static const uint32_t psyScaleFix8[3] = { 300, 256, 96 }; /* B, P, I */
@@ -60,19 +62,21 @@
         }
 
         int qpCb, qpCr;
-        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
         if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
-            qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
+        {
+            qpCb = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[0])];
+            qpCr = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[1])];
+        }
         else
-            qpCb = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
+        {
+            qpCb = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[0]);
+            qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]);
+        }
+
         int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET);
         uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
         m_chromaDistWeight[0] = lambdaOffset;
 
-        if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
-            qpCr = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
-        else
-            qpCr = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
         chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET);
         lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
         m_chromaDistWeight[1] = lambdaOffset;

 
@@ -40,13 +40,15 @@
     uint32_t  m_chromaDistWeight[2];
     uint32_t  m_psyRdBase;
     uint32_t  m_psyRd;
-    int       m_qp;
+    int       m_qp; /* QP used to configure lambda, may be higher than QP_MAX_SPEC but <= QP_MAX_MAX */
 
     void setPsyRdScale(double scale)                { m_psyRdBase = (uint32_t)floor(65536.0 * scale * 0.33); }
 
     void setQP(const Slice& slice, int qp)
     {
+        x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
         m_qp = qp;
+        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
 
         /* Scale PSY RD factor by a slice type factor */
         static const uint32_t psyScaleFix8[3] = { 300, 256, 96 }; /* B, P, I */
@@ -60,19 +62,21 @@
         }
 
         int qpCb, qpCr;
-        setLambda(x265_lambda2_tab[qp], x265_lambda_tab[qp]);
         if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
-            qpCb = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
+        {
+            qpCb = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[0])];
+            qpCr = (int)g_chromaScale[x265_clip3(QP_MIN, QP_MAX_MAX, qp + slice.m_pps->chromaQpOffset[1])];
+        }
         else
-            qpCb = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
+        {
+            qpCb = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[0]);
+            qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]);
+        }
+
         int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET);
         uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
         m_chromaDistWeight[0] = lambdaOffset;
 
-        if (slice.m_sps->chromaFormatIdc == X265_CSP_I420)
-            qpCr = x265_clip3(QP_MIN, QP_MAX_MAX, (int)g_chromaScale[qp + slice.m_pps->chromaQpOffset[0]]);
-        else
-            qpCr = X265_MIN(qp + slice.m_pps->chromaQpOffset[0], QP_MAX_SPEC);
         chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET);
         lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256;
         m_chromaDistWeight[1] = lambdaOffset;
​

x265_1.6.tar.gz/source/encoder/sao.cpp -> x265_1.7.tar.gz/source/encoder/sao.cpp Changed

@@ -258,7 +258,7 @@
     pixel* tmpL;
     pixel* tmpU;
 
-    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1;
+    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2];
     int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1;
 
     memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */
@@ -279,7 +279,7 @@
     {
     case SAO_EO_0: // dir: -
     {
-        pixel firstPxl = 0, lastPxl = 0;
+        pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0;
         startX = !lpelx;
         endX   = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth;
         if (ctuWidth & 15)
@@ -301,25 +301,38 @@
         }
         else
         {
-            for (y = 0; y < ctuHeight; y++)
+            for (y = 0; y < ctuHeight; y += 2)
             {
-                int signLeft = signOf(rec[startX] - tmpL[y]);
+                signLeft1[0] = signOf(rec[startX] - tmpL[y]);
+                signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]);
 
                 if (!lpelx)
+                {
                     firstPxl = rec[0];
+                    row1FirstPxl = rec[stride];
+                }
 
                 if (rpelx == picWidth)
+                {
                     lastPxl = rec[ctuWidth - 1];
+                    row1LastPxl = rec[stride + ctuWidth - 1];
+                }
 
-                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, (int8_t)signLeft);
+                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride);
 
                 if (!lpelx)
+                {
                     rec[0] = firstPxl;
+                    rec[stride] = row1FirstPxl;
+                }
 
                 if (rpelx == picWidth)
+                {
                     rec[ctuWidth - 1] = lastPxl;
+                    rec[stride + ctuWidth - 1] = row1LastPxl;
+                }
 
-                rec += stride;
+                rec += 2 * stride;
             }
         }
         break;
@@ -354,11 +367,14 @@
         {
             primitives.sign(upBuff1, rec, tmpU, ctuWidth);
 
-            for (y = startY; y < endY; y++)
+            int diff = (endY - startY) % 2;
+            for (y = startY; y < endY - diff; y += 2)
             {
-                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
-                rec += stride;
+                primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth);
+                rec += 2 * stride;
             }
+            if (diff & 1)
+                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
         }
 
         break;
@@ -421,23 +437,8 @@
             for (y = startY; y < endY; y++)
             {
                 int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]);
-                pixel firstPxl = rec[0];  // copy first Pxl
-                pixel lastPxl = rec[ctuWidth - 1];
-                int8_t one = upBufft[1];
-                int8_t two = upBufft[endX + 1];
 
-                primitives.saoCuOrgE2(rec, upBufft, upBuff1, m_offsetEo, ctuWidth, stride);
-                if (!lpelx)
-                {
-                    rec[0] = firstPxl;
-                    upBufft[1] = one;
-                }
-
-                if (rpelx == picWidth)
-                {
-                    rec[ctuWidth - 1] = lastPxl;
-                    upBufft[endX + 1] = two;
-                }
+                primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride);
 
                 upBufft[startX] = iSignDown2;
 
@@ -508,7 +509,7 @@
                 upBuff1[x - 1] = -signDown;
                 rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]];
 
-                primitives.saoCuOrgE3(rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
+                primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
 
                 upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]);
 
@@ -783,13 +784,7 @@
                 rec += stride;
             }
 
-            if (!(ctuWidth & 15))
-                primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
-            else
-            {
-                for (x = 0; x < ctuWidth; x++)
-                    upBuff1[x] = signOf(rec[x] - rec[x - stride]);
-            }
+            primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
 
             for (y = startY; y < endY; y++)
             {
@@ -832,8 +827,7 @@
                 rec += stride;
             }
 
-            for (x = startX; x < endX; x++)
-                upBuff1[x] = signOf(rec[x] - rec[x - stride - 1]);
+            primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX));
 
             for (y = startY; y < endY; y++)
             {
@@ -879,8 +873,7 @@
                 rec += stride;
             }
 
-            for (x = startX - 1; x < endX; x++)
-                upBuff1[x] = signOf(rec[x] - rec[x - stride + 1]);
+            primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1));
 
             for (y = startY; y < endY; y++)
             {

 
@@ -258,7 +258,7 @@
     pixel* tmpL;
     pixel* tmpU;
 
-    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1;
+    int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1, signLeft1[2];
     int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1;
 
     memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */
@@ -279,7 +279,7 @@
     {
     case SAO_EO_0: // dir: -
     {
-        pixel firstPxl = 0, lastPxl = 0;
+        pixel firstPxl = 0, lastPxl = 0, row1FirstPxl = 0, row1LastPxl = 0;
         startX = !lpelx;
         endX   = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth;
         if (ctuWidth & 15)
@@ -301,25 +301,38 @@
         }
         else
         {
-            for (y = 0; y < ctuHeight; y++)
+            for (y = 0; y < ctuHeight; y += 2)
             {
-                int signLeft = signOf(rec[startX] - tmpL[y]);
+                signLeft1[0] = signOf(rec[startX] - tmpL[y]);
+                signLeft1[1] = signOf(rec[stride + startX] - tmpL[y + 1]);
 
                 if (!lpelx)
+                {
                     firstPxl = rec[0];
+                    row1FirstPxl = rec[stride];
+                }
 
                 if (rpelx == picWidth)
+                {
                     lastPxl = rec[ctuWidth - 1];
+                    row1LastPxl = rec[stride + ctuWidth - 1];
+                }
 
-                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, (int8_t)signLeft);
+                primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride);
 
                 if (!lpelx)
+                {
                     rec[0] = firstPxl;
+                    rec[stride] = row1FirstPxl;
+                }
 
                 if (rpelx == picWidth)
+                {
                     rec[ctuWidth - 1] = lastPxl;
+                    rec[stride + ctuWidth - 1] = row1LastPxl;
+                }
 
-                rec += stride;
+                rec += 2 * stride;
             }
         }
         break;
@@ -354,11 +367,14 @@
         {
             primitives.sign(upBuff1, rec, tmpU, ctuWidth);
 
-            for (y = startY; y < endY; y++)
+            int diff = (endY - startY) % 2;
+            for (y = startY; y < endY - diff; y += 2)
             {
-                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
-                rec += stride;
+                primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth);
+                rec += 2 * stride;
             }
+            if (diff & 1)
+                primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth);
         }
 
         break;
@@ -421,23 +437,8 @@
             for (y = startY; y < endY; y++)
             {
                 int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]);
-                pixel firstPxl = rec[0];  // copy first Pxl
-                pixel lastPxl = rec[ctuWidth - 1];
-                int8_t one = upBufft[1];
-                int8_t two = upBufft[endX + 1];
 
-                primitives.saoCuOrgE2(rec, upBufft, upBuff1, m_offsetEo, ctuWidth, stride);
-                if (!lpelx)
-                {
-                    rec[0] = firstPxl;
-                    upBufft[1] = one;
-                }
-
-                if (rpelx == picWidth)
-                {
-                    rec[ctuWidth - 1] = lastPxl;
-                    upBufft[endX + 1] = two;
-                }
+                primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride);
 
                 upBufft[startX] = iSignDown2;
 
@@ -508,7 +509,7 @@
                 upBuff1[x - 1] = -signDown;
                 rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]];
 
-                primitives.saoCuOrgE3(rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
+                primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX);
 
                 upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]);
 
@@ -783,13 +784,7 @@
                 rec += stride;
             }
 
-            if (!(ctuWidth & 15))
-                primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
-            else
-            {
-                for (x = 0; x < ctuWidth; x++)
-                    upBuff1[x] = signOf(rec[x] - rec[x - stride]);
-            }
+            primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth);
 
             for (y = startY; y < endY; y++)
             {
@@ -832,8 +827,7 @@
                 rec += stride;
             }
 
-            for (x = startX; x < endX; x++)
-                upBuff1[x] = signOf(rec[x] - rec[x - stride - 1]);
+            primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX));
 
             for (y = startY; y < endY; y++)
             {
@@ -879,8 +873,7 @@
                 rec += stride;
             }
 
-            for (x = startX - 1; x < endX; x++)
-                upBuff1[x] = signOf(rec[x] - rec[x - stride + 1]);
+            primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1));
 
             for (y = startY; y < endY; y++)
             {
​

x265_1.6.tar.gz/source/encoder/search.cpp -> x265_1.7.tar.gz/source/encoder/search.cpp Changed

@@ -163,11 +163,16 @@
     X265_FREE(m_tsRecon);
 }
 
-void Search::setQP(const Slice& slice, int qp)
+int Search::setLambdaFromQP(const CUData& ctu, int qp)
 {
-    x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
+    X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n");
+
     m_me.setQP(qp);
-    m_rdCost.setQP(slice, qp);
+    m_rdCost.setQP(*m_slice, qp);
+
+    int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
+    m_quant.setQPforQuant(ctu, quantQP);
+    return quantQP;
 }
 
 #if CHECKED_BUILD || _DEBUG
@@ -1185,7 +1190,7 @@
         intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size);
     }
     updateModeCost(intraMode);
-    checkDQP(cu, cuGeom);
+    checkDQP(intraMode, cuGeom);
 }
 
 /* Note that this function does not save the best intra prediction, it must
@@ -1231,16 +1236,11 @@
 
         pixel nScale[129];
         intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
+        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
 
         // we do not estimate filtering for downscaled samples
-        for (int x = 1; x < 65; x++)
-        {
-            intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
-            intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
-            intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
-            intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
-        }
+        memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
+        memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
 
         scaleTuSize = 32;
         scaleStride = 32;
@@ -1369,8 +1369,6 @@
     X265_CHECK(cu.m_partSize[0] == SIZE_2Nx2N, "encodeIntraInInter does not expect NxN intra\n");
     X265_CHECK(!m_slice->isIntra(), "encodeIntraInInter does not expect to be used in I slices\n");
 
-    m_quant.setQPforQuant(cu);
-
     uint32_t tuDepthRange[2];
     cu.getIntraTUQtDepthRange(tuDepthRange, 0);
 
@@ -1405,7 +1403,7 @@
 
     m_entropyCoder.store(intraMode.contexts);
     updateModeCost(intraMode);
-    checkDQP(intraMode.cu, cuGeom);
+    checkDQP(intraMode, cuGeom);
 }
 
 uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes)
@@ -1465,16 +1463,10 @@
 
                     pixel nScale[129];
                     intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
+                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
 
-                    // TO DO: primitive
-                    for (int x = 1; x < 65; x++)
-                    {
-                        intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
-                        intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
-                        intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
-                        intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
-                    }
+                    memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));
+                    memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
 
                     scaleTuSize = 32;
                     scaleStride = 32;
@@ -1869,6 +1861,34 @@
     return outCost;
 }
 
+/* Pick between the two AMVP candidates which is the best one to use as
+ * MVP for the motion search, based on SAD cost */
+int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref)
+{
+    if (amvp[0] == amvp[1])
+        return 0;
+
+    Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv;
+    uint32_t costs[AMVP_NUM_CANDS];
+
+    for (int i = 0; i < AMVP_NUM_CANDS; i++)
+    {
+        MV mvCand = amvp[i];
+
+        // NOTE: skip mvCand if Y is > merange and -FN>1
+        if (m_bFrameParallel && (mvCand.y >= (m_param->searchRange + 1) * 4))
+            costs[i] = m_me.COST_MAX;
+        else
+        {
+            cu.clipMv(mvCand);
+            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
+            costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
+        }
+    }
+
+    return costs[0] <= costs[1] ? 0 : 1;
+}
+
 void Search::PME::processTasks(int workerThreadId)
 {
 #if DETAILED_CU_STATS
@@ -1899,10 +1919,10 @@
     /* Setup slave Search instance for ME for master's CU */
     if (&slave != this)
     {
-        slave.setQP(*m_slice, m_rdCost.m_qp);
         slave.m_slice = m_slice;
         slave.m_frame = m_frame;
-
+        slave.m_param = m_param;
+        slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp);
         slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height);
     }
 
@@ -1910,9 +1930,9 @@
     do
     {
         if (meId < m_slice->m_numRefIdx[0])
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 0, meId);
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId);
         else
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
 
         meId = -1;
         pme.m_lock.acquire();
@@ -1923,55 +1943,30 @@
     while (meId >= 0);
 }
 
-void Search::singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu,
-                                    int part, int list, int ref)
+void Search::singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref)
 {
     uint32_t bits = master.m_listSelBits[list] + MVP_IDX_BITS;
     bits += getTUBits(ref, m_slice->m_numRefIdx[list]);
 
-    MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
-    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
-
-    int mvpIdx = 0;
-    int merange = m_param->searchRange;
     MotionData* bestME = interMode.bestME[part];
 
-    if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
-    {
-        uint32_t bestCost = MAX_INT;
-        for (int i = 0; i < AMVP_NUM_CANDS; i++)
-        {
-            MV mvCand = interMode.amvpCand[list][ref][i];
-
-            // NOTE: skip mvCand if Y is > merange and -FN>1
-            if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
-                continue;
-
-            interMode.cu.clipMv(mvCand);
-
-            Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
-            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
-            uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
+    MV  mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
+    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
 
-            if (bestCost > cost)
-            {
-                bestCost = cost;
-                mvpIdx = i;
-            }
-        }
-    }
+    const MV* amvp = interMode.amvpCand[list][ref];
+    int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref);
+    MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
 
-    MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
-    setSearchRange(interMode.cu, mvp, merange, mvmin, mvmax);
+    setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax);
 
-    int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);

 
@@ -163,11 +163,16 @@
     X265_FREE(m_tsRecon);
 }
 
-void Search::setQP(const Slice& slice, int qp)
+int Search::setLambdaFromQP(const CUData& ctu, int qp)
 {
-    x265_emms(); /* TODO: if the lambda tables were ints, this would not be necessary */
+    X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n");
+
     m_me.setQP(qp);
-    m_rdCost.setQP(slice, qp);
+    m_rdCost.setQP(*m_slice, qp);
+
+    int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp);
+    m_quant.setQPforQuant(ctu, quantQP);
+    return quantQP;
 }
 
 #if CHECKED_BUILD || _DEBUG
@@ -1185,7 +1190,7 @@
         intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size);
     }
     updateModeCost(intraMode);
-    checkDQP(cu, cuGeom);
+    checkDQP(intraMode, cuGeom);
 }
 
 /* Note that this function does not save the best intra prediction, it must
@@ -1231,16 +1236,11 @@
 
         pixel nScale[129];
         intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
+        primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
 
         // we do not estimate filtering for downscaled samples
-        for (int x = 1; x < 65; x++)
-        {
-            intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
-            intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
-            intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
-            intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
-        }
+        memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));   // Top & Left pixels
+        memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
 
         scaleTuSize = 32;
         scaleStride = 32;
@@ -1369,8 +1369,6 @@
     X265_CHECK(cu.m_partSize[0] == SIZE_2Nx2N, "encodeIntraInInter does not expect NxN intra\n");
     X265_CHECK(!m_slice->isIntra(), "encodeIntraInInter does not expect to be used in I slices\n");
 
-    m_quant.setQPforQuant(cu);
-
     uint32_t tuDepthRange[2];
     cu.getIntraTUQtDepthRange(tuDepthRange, 0);
 
@@ -1405,7 +1403,7 @@
 
     m_entropyCoder.store(intraMode.contexts);
     updateModeCost(intraMode);
-    checkDQP(intraMode.cu, cuGeom);
+    checkDQP(intraMode, cuGeom);
 }
 
 uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes)
@@ -1465,16 +1463,10 @@
 
                     pixel nScale[129];
                     intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0];
-                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1, 0);
+                    primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1);
 
-                    // TO DO: primitive
-                    for (int x = 1; x < 65; x++)
-                    {
-                        intraNeighbourBuf[0][x] = nScale[x];           // Top pixel
-                        intraNeighbourBuf[0][x + 64] = nScale[x + 64]; // Left pixel
-                        intraNeighbourBuf[1][x] = nScale[x];           // Top pixel
-                        intraNeighbourBuf[1][x + 64] = nScale[x + 64]; // Left pixel
-                    }
+                    memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel));
+                    memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel));
 
                     scaleTuSize = 32;
                     scaleStride = 32;
@@ -1869,6 +1861,34 @@
     return outCost;
 }
 
+/* Pick between the two AMVP candidates which is the best one to use as
+ * MVP for the motion search, based on SAD cost */
+int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref)
+{
+    if (amvp[0] == amvp[1])
+        return 0;
+
+    Yuv& tmpPredYuv = m_rqt[cu.m_cuDepth[0]].tmpPredYuv;
+    uint32_t costs[AMVP_NUM_CANDS];
+
+    for (int i = 0; i < AMVP_NUM_CANDS; i++)
+    {
+        MV mvCand = amvp[i];
+
+        // NOTE: skip mvCand if Y is > merange and -FN>1
+        if (m_bFrameParallel && (mvCand.y >= (m_param->searchRange + 1) * 4))
+            costs[i] = m_me.COST_MAX;
+        else
+        {
+            cu.clipMv(mvCand);
+            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
+            costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
+        }
+    }
+
+    return costs[0] <= costs[1] ? 0 : 1;
+}
+
 void Search::PME::processTasks(int workerThreadId)
 {
 #if DETAILED_CU_STATS
@@ -1899,10 +1919,10 @@
     /* Setup slave Search instance for ME for master's CU */
     if (&slave != this)
     {
-        slave.setQP(*m_slice, m_rdCost.m_qp);
         slave.m_slice = m_slice;
         slave.m_frame = m_frame;
-
+        slave.m_param = m_param;
+        slave.setLambdaFromQP(pme.mode.cu, m_rdCost.m_qp);
         slave.m_me.setSourcePU(*pme.mode.fencYuv, pme.pu.ctuAddr, pme.pu.cuAbsPartIdx, pme.pu.puAbsPartIdx, pme.pu.width, pme.pu.height);
     }
 
@@ -1910,9 +1930,9 @@
     do
     {
         if (meId < m_slice->m_numRefIdx[0])
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 0, meId);
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId);
         else
-            slave.singleMotionEstimation(*this, pme.mode, pme.cuGeom, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
+            slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]);
 
         meId = -1;
         pme.m_lock.acquire();
@@ -1923,55 +1943,30 @@
     while (meId >= 0);
 }
 
-void Search::singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu,
-                                    int part, int list, int ref)
+void Search::singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref)
 {
     uint32_t bits = master.m_listSelBits[list] + MVP_IDX_BITS;
     bits += getTUBits(ref, m_slice->m_numRefIdx[list]);
 
-    MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
-    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
-
-    int mvpIdx = 0;
-    int merange = m_param->searchRange;
     MotionData* bestME = interMode.bestME[part];
 
-    if (interMode.amvpCand[list][ref][0] != interMode.amvpCand[list][ref][1])
-    {
-        uint32_t bestCost = MAX_INT;
-        for (int i = 0; i < AMVP_NUM_CANDS; i++)
-        {
-            MV mvCand = interMode.amvpCand[list][ref][i];
-
-            // NOTE: skip mvCand if Y is > merange and -FN>1
-            if (m_bFrameParallel && (mvCand.y >= (merange + 1) * 4))
-                continue;
-
-            interMode.cu.clipMv(mvCand);
-
-            Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv;
-            predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand);
-            uint32_t cost = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size);
+    MV  mvc[(MD_ABOVE_LEFT + 1) * 2 + 1];
+    int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc);
 
-            if (bestCost > cost)
-            {
-                bestCost = cost;
-                mvpIdx = i;
-            }
-        }
-    }
+    const MV* amvp = interMode.amvpCand[list][ref];
+    int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref);
+    MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx];
 
-    MV mvmin, mvmax, outmv, mvp = interMode.amvpCand[list][ref][mvpIdx];
-    setSearchRange(interMode.cu, mvp, merange, mvmin, mvmax);
+    setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax);
 
-    int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, merange, outmv);
​

x265_1.6.tar.gz/source/encoder/search.h -> x265_1.7.tar.gz/source/encoder/search.h Changed

@@ -287,7 +287,7 @@
     ~Search();
 
     bool     initSearch(const x265_param& param, ScalingList& scalingList);
-    void     setQP(const Slice& slice, int qp);
+    int      setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */
 
     // mark temp RD entropy contexts as uninitialized; useful for finding loads without stores
     void     invalidateContexts(int fromDepth);
@@ -301,7 +301,7 @@
     void     encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom);
 
     // estimation inter prediction (non-skip)
-    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChroma);
+    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC);
 
     // encode residual and compute rd-cost for inter mode
     void     encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom);
@@ -316,8 +316,8 @@
     void     getBestIntraModeChroma(Mode& intraMode, const CUGeom& cuGeom);
 
     /* update CBF flags and QP values to be internally consistent */
-    void checkDQP(CUData& cu, const CUGeom& cuGeom);
-    void checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom);
+    void checkDQP(Mode& mode, const CUGeom& cuGeom);
+    void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom);
 
     class PME : public BondedTaskGroup
     {
@@ -339,7 +339,7 @@
     };
 
     void     processPME(PME& pme, Search& slave);
-    void     singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, int part, int list, int ref);
+    void     singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref);
 
 protected:
 
@@ -396,8 +396,9 @@
     };
 
     /* inter/ME helper functions */
-    void     checkBestMVP(MV* amvpCand, MV cMv, MV& mvPred, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
-    void     setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const;
+    int       selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref);
+    const MV& checkBestMVP(const MV amvpCand[2], const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
+    void     setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const;
     uint32_t mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m);
     static void getBlkBits(PartSize cuMode, bool bPSlice, int puIdx, uint32_t lastMode, uint32_t blockBit[3]);

 
@@ -287,7 +287,7 @@
     ~Search();
 
     bool     initSearch(const x265_param& param, ScalingList& scalingList);
-    void     setQP(const Slice& slice, int qp);
+    int      setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */
 
     // mark temp RD entropy contexts as uninitialized; useful for finding loads without stores
     void     invalidateContexts(int fromDepth);
@@ -301,7 +301,7 @@
     void     encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom);
 
     // estimation inter prediction (non-skip)
-    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bMergeOnly, bool bChroma);
+    void     predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC);
 
     // encode residual and compute rd-cost for inter mode
     void     encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom);
@@ -316,8 +316,8 @@
     void     getBestIntraModeChroma(Mode& intraMode, const CUGeom& cuGeom);
 
     /* update CBF flags and QP values to be internally consistent */
-    void checkDQP(CUData& cu, const CUGeom& cuGeom);
-    void checkDQPForSplitPred(CUData& cu, const CUGeom& cuGeom);
+    void checkDQP(Mode& mode, const CUGeom& cuGeom);
+    void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom);
 
     class PME : public BondedTaskGroup
     {
@@ -339,7 +339,7 @@
     };
 
     void     processPME(PME& pme, Search& slave);
-    void     singleMotionEstimation(Search& master, Mode& interMode, const CUGeom& cuGeom, const PredictionUnit& pu, int part, int list, int ref);
+    void     singleMotionEstimation(Search& master, Mode& interMode, const PredictionUnit& pu, int part, int list, int ref);
 
 protected:
 
@@ -396,8 +396,9 @@
     };
 
     /* inter/ME helper functions */
-    void     checkBestMVP(MV* amvpCand, MV cMv, MV& mvPred, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
-    void     setSearchRange(const CUData& cu, MV mvp, int merange, MV& mvmin, MV& mvmax) const;
+    int       selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref);
+    const MV& checkBestMVP(const MV amvpCand[2], const MV& mv, int& mvpIdx, uint32_t& outBits, uint32_t& outCost) const;
+    void     setSearchRange(const CUData& cu, const MV& mvp, int merange, MV& mvmin, MV& mvmax) const;
     uint32_t mergeEstimation(CUData& cu, const CUGeom& cuGeom, const PredictionUnit& pu, int puIdx, MergeData& m);
     static void getBlkBits(PartSize cuMode, bool bPSlice, int puIdx, uint32_t lastMode, uint32_t blockBit[3]);
 
​

x265_1.6.tar.gz/source/encoder/sei.h -> x265_1.7.tar.gz/source/encoder/sei.h Changed

@@ -71,6 +71,8 @@
         DECODED_PICTURE_HASH                 = 132,
         SCALABLE_NESTING                     = 133,
         REGION_REFRESH_INFO                  = 134,
+        MASTERING_DISPLAY_INFO               = 137,
+        CONTENT_LIGHT_LEVEL_INFO             = 144,
     };
 
     virtual PayloadType payloadType() const = 0;
@@ -111,6 +113,73 @@
     }
 };
 
+class SEIMasteringDisplayColorVolume : public SEI
+{
+public:
+
+    uint16_t displayPrimaryX[3];
+    uint16_t displayPrimaryY[3];
+    uint16_t whitePointX, whitePointY;
+    uint32_t maxDisplayMasteringLuminance;
+    uint32_t minDisplayMasteringLuminance;
+
+    PayloadType payloadType() const { return MASTERING_DISPLAY_INFO; }
+
+    bool parse(const char* value)
+    {
+        return sscanf(value, "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)",
+                      &displayPrimaryX[0], &displayPrimaryY[0],
+                      &displayPrimaryX[1], &displayPrimaryY[1],
+                      &displayPrimaryX[2], &displayPrimaryY[2],
+                      &whitePointX, &whitePointY,
+                      &maxDisplayMasteringLuminance, &minDisplayMasteringLuminance) == 10;
+    }
+
+    void write(Bitstream& bs, const SPS&)
+    {
+        m_bitIf = &bs;
+
+        WRITE_CODE(MASTERING_DISPLAY_INFO, 8, "payload_type");
+        WRITE_CODE(8 * 2 + 2 * 4, 8, "payload_size");
+
+        for (uint32_t i = 0; i < 3; i++)
+        {
+            WRITE_CODE(displayPrimaryX[i], 16, "display_primaries_x[ c ]");
+            WRITE_CODE(displayPrimaryY[i], 16, "display_primaries_y[ c ]");
+        }
+        WRITE_CODE(whitePointX, 16, "white_point_x");
+        WRITE_CODE(whitePointY, 16, "white_point_y");
+        WRITE_CODE(maxDisplayMasteringLuminance, 32, "max_display_mastering_luminance");
+        WRITE_CODE(minDisplayMasteringLuminance, 32, "min_display_mastering_luminance");
+    }
+};
+
+class SEIContentLightLevel : public SEI
+{
+public:
+
+    uint16_t max_content_light_level;
+    uint16_t max_pic_average_light_level;
+
+    PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; }
+
+    bool parse(const char* value)
+    {
+        return sscanf(value, "%hu,%hu",
+                      &max_content_light_level, &max_pic_average_light_level) == 2;
+    }
+
+    void write(Bitstream& bs, const SPS&)
+    {
+        m_bitIf = &bs;
+
+        WRITE_CODE(CONTENT_LIGHT_LEVEL_INFO, 8, "payload_type");
+        WRITE_CODE(4, 8, "payload_size");
+        WRITE_CODE(max_content_light_level,     16, "max_content_light_level");
+        WRITE_CODE(max_pic_average_light_level, 16, "max_pic_average_light_level");
+    }
+};
+
 class SEIDecodedPictureHash : public SEI
 {
 public:

 
@@ -71,6 +71,8 @@
         DECODED_PICTURE_HASH                 = 132,
         SCALABLE_NESTING                     = 133,
         REGION_REFRESH_INFO                  = 134,
+        MASTERING_DISPLAY_INFO               = 137,
+        CONTENT_LIGHT_LEVEL_INFO             = 144,
     };
 
     virtual PayloadType payloadType() const = 0;
@@ -111,6 +113,73 @@
     }
 };
 
+class SEIMasteringDisplayColorVolume : public SEI
+{
+public:
+
+    uint16_t displayPrimaryX[3];
+    uint16_t displayPrimaryY[3];
+    uint16_t whitePointX, whitePointY;
+    uint32_t maxDisplayMasteringLuminance;
+    uint32_t minDisplayMasteringLuminance;
+
+    PayloadType payloadType() const { return MASTERING_DISPLAY_INFO; }
+
+    bool parse(const char* value)
+    {
+        return sscanf(value, "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)",
+                      &displayPrimaryX[0], &displayPrimaryY[0],
+                      &displayPrimaryX[1], &displayPrimaryY[1],
+                      &displayPrimaryX[2], &displayPrimaryY[2],
+                      &whitePointX, &whitePointY,
+                      &maxDisplayMasteringLuminance, &minDisplayMasteringLuminance) == 10;
+    }
+
+    void write(Bitstream& bs, const SPS&)
+    {
+        m_bitIf = &bs;
+
+        WRITE_CODE(MASTERING_DISPLAY_INFO, 8, "payload_type");
+        WRITE_CODE(8 * 2 + 2 * 4, 8, "payload_size");
+
+        for (uint32_t i = 0; i < 3; i++)
+        {
+            WRITE_CODE(displayPrimaryX[i], 16, "display_primaries_x[ c ]");
+            WRITE_CODE(displayPrimaryY[i], 16, "display_primaries_y[ c ]");
+        }
+        WRITE_CODE(whitePointX, 16, "white_point_x");
+        WRITE_CODE(whitePointY, 16, "white_point_y");
+        WRITE_CODE(maxDisplayMasteringLuminance, 32, "max_display_mastering_luminance");
+        WRITE_CODE(minDisplayMasteringLuminance, 32, "min_display_mastering_luminance");
+    }
+};
+
+class SEIContentLightLevel : public SEI
+{
+public:
+
+    uint16_t max_content_light_level;
+    uint16_t max_pic_average_light_level;
+
+    PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; }
+
+    bool parse(const char* value)
+    {
+        return sscanf(value, "%hu,%hu",
+                      &max_content_light_level, &max_pic_average_light_level) == 2;
+    }
+
+    void write(Bitstream& bs, const SPS&)
+    {
+        m_bitIf = &bs;
+
+        WRITE_CODE(CONTENT_LIGHT_LEVEL_INFO, 8, "payload_type");
+        WRITE_CODE(4, 8, "payload_size");
+        WRITE_CODE(max_content_light_level,     16, "max_content_light_level");
+        WRITE_CODE(max_pic_average_light_level, 16, "max_pic_average_light_level");
+    }
+};
+
 class SEIDecodedPictureHash : public SEI
 {
 public:
​

x265_1.6.tar.gz/source/encoder/slicetype.cpp -> x265_1.7.tar.gz/source/encoder/slicetype.cpp Changed

@@ -44,23 +44,6 @@
 
 namespace {
 
-inline int16_t median(int16_t a, int16_t b, int16_t c)
-{
-    int16_t t = (a - b) & ((a - b) >> 31);
-
-    a -= t;
-    b += t;
-    b -= (b - c) & ((b - c) >> 31);
-    b += (a - b) & ((a - b) >> 31);
-    return b;
-}
-
-inline void median_mv(MV &dst, MV a, MV b, MV c)
-{
-    dst.x = median(a.x, b.x, c.x);
-    dst.y = median(a.y, b.y, c.y);
-}
-
 /* Compute variance to derive AC energy of each block */
 inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
 {
@@ -492,8 +475,6 @@
     m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height;
 
     m_lastKeyframe = -m_param->keyframeMax;
-    memset(m_preframes, 0, sizeof(m_preframes));
-    m_preTotal = m_preAcquired = m_preCompleted = 0;
     m_sliceTypeBusy = false;
     m_fullQueueSize = X265_MAX(1, m_param->lookaheadDepth);
     m_bAdaptiveQuant = m_param->rc.aqMode || m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred;
@@ -568,14 +549,14 @@
     return m_tld && m_scratch;
 }
 
-void Lookahead::stop()
+void Lookahead::stopJobs()
 {
     if (m_pool && !m_inputQueue.empty())
     {
-        m_preLookaheadLock.acquire();
+        m_inputLock.acquire();
         m_isActive = false;
         bool wait = m_outputSignalRequired = m_sliceTypeBusy;
-        m_preLookaheadLock.release();
+        m_inputLock.release();
 
         if (wait)
             m_outputSignal.wait();
@@ -634,19 +615,11 @@
             m_filled = true; /* full capacity plus mini-gop lag */
     }
 
-    m_preLookaheadLock.acquire();
-
     m_inputLock.acquire();
     m_inputQueue.pushBack(curFrame);
-    m_inputLock.release();
-
-    m_preframes[m_preTotal++] = &curFrame;
-    X265_CHECK(m_preTotal <= X265_LOOKAHEAD_MAX, "prelookahead overflow\n");
-    
-    m_preLookaheadLock.release();
-
-    if (m_pool)
+    if (m_pool && m_inputQueue.size() >= m_fullQueueSize)
         tryWakeOne();
+    m_inputLock.release();
 }
 
 /* Called by API thread */
@@ -657,74 +630,33 @@
     m_filled = true;
 }
 
-void Lookahead::findJob(int workerThreadID)
+void Lookahead::findJob(int /*workerThreadID*/)
 {
-    Frame* preFrame;
-    bool   doDecide;
-
-    if (!m_isActive)
-        return;
-
-    int tld = workerThreadID;
-    if (workerThreadID < 0)
-        tld = m_pool ? m_pool->m_numWorkers : 0;
+    bool doDecide;
 
-    m_preLookaheadLock.acquire();
-    do
-    {
-        preFrame = NULL;
-        doDecide = false;
+    m_inputLock.acquire();
+    if (m_inputQueue.size() >= m_fullQueueSize && !m_sliceTypeBusy && m_isActive)
+        doDecide = m_sliceTypeBusy = true;
+    else
+        doDecide = m_helpWanted = false;
+    m_inputLock.release();
 
-        if (m_preTotal > m_preAcquired)
-            preFrame = m_preframes[m_preAcquired++];
-        else
-        {
-            if (m_preTotal == m_preCompleted)
-                m_preAcquired = m_preTotal = m_preCompleted = 0;
-
-            /* the worker thread that performs the last pre-lookahead will generally get to run
-             * slicetypeDecide() */
-            m_inputLock.acquire();
-            if (!m_sliceTypeBusy && !m_preTotal && m_inputQueue.size() >= m_fullQueueSize && m_isActive)
-                doDecide = m_sliceTypeBusy = true;
-            else
-                m_helpWanted = false;
-            m_inputLock.release();
-        }
-        m_preLookaheadLock.release();
+    if (!doDecide)
+        return;
 
-        if (preFrame)
-        {
-            ProfileLookaheadTime(m_preLookaheadElapsedTime, m_countPreLookahead);
-            ProfileScopeEvent(prelookahead);
-
-            preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
-            if (m_param->rc.bStatRead && m_param->rc.cuTree && IS_REFERENCED(preFrame))
-                /* cu-tree offsets were read from stats file */;
-            else if (m_bAdaptiveQuant)
-                m_tld[tld].calcAdaptiveQuantFrame(preFrame, m_param);
-            m_tld[tld].lowresIntraEstimate(preFrame->m_lowres);
-
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
-            m_preCompleted++;
-        }
-        else if (doDecide)
-        {
-            ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
-            ProfileScopeEvent(slicetypeDecideEV);
+    ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
+    ProfileScopeEvent(slicetypeDecideEV);
 
-            slicetypeDecide();
+    slicetypeDecide();
 
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
-            if (m_outputSignalRequired)
-            {
-                m_outputSignal.trigger();
-                m_outputSignalRequired = false;
-            }
-            m_sliceTypeBusy = false;
-        }
+    m_inputLock.acquire();
+    if (m_outputSignalRequired)
+    {
+        m_outputSignal.trigger();
+        m_outputSignalRequired = false;
     }
-    while (preFrame || doDecide);
+    m_sliceTypeBusy = false;
+    m_inputLock.release();
 }
 
 /* Called by API thread */
@@ -739,13 +671,11 @@
         if (out)
             return out;
 
-        /* process all pending pre-lookahead frames and run slicetypeDecide() if
-         * necessary */
-        findJob(-1);
+        findJob(-1); /* run slicetypeDecide() if necessary */
 
-        m_preLookaheadLock.acquire();
-        bool wait = m_outputSignalRequired = m_sliceTypeBusy || m_preTotal;
-        m_preLookaheadLock.release();
+        m_inputLock.acquire();
+        bool wait = m_outputSignalRequired = m_sliceTypeBusy;
+        m_inputLock.release();
 
         if (wait)
             m_outputSignal.wait();
@@ -809,7 +739,7 @@
     {
         /* aggregate lowres row satds to CTU resolution */
         curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCosts[b - p0][p1 - b];
-        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0;
+        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
         uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
         uint32_t numCuInHeight = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
         uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
@@ -823,7 +753,7 @@
             lowresRow = row * scale;
             for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++)
             {
-                sum = 0;

 
@@ -44,23 +44,6 @@
 
 namespace {
 
-inline int16_t median(int16_t a, int16_t b, int16_t c)
-{
-    int16_t t = (a - b) & ((a - b) >> 31);
-
-    a -= t;
-    b += t;
-    b -= (b - c) & ((b - c) >> 31);
-    b += (a - b) & ((a - b) >> 31);
-    return b;
-}
-
-inline void median_mv(MV &dst, MV a, MV b, MV c)
-{
-    dst.x = median(a.x, b.x, c.x);
-    dst.y = median(a.y, b.y, c.y);
-}
-
 /* Compute variance to derive AC energy of each block */
 inline uint32_t acEnergyVar(Frame *curFrame, uint64_t sum_ssd, int shift, int plane)
 {
@@ -492,8 +475,6 @@
     m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height;
 
     m_lastKeyframe = -m_param->keyframeMax;
-    memset(m_preframes, 0, sizeof(m_preframes));
-    m_preTotal = m_preAcquired = m_preCompleted = 0;
     m_sliceTypeBusy = false;
     m_fullQueueSize = X265_MAX(1, m_param->lookaheadDepth);
     m_bAdaptiveQuant = m_param->rc.aqMode || m_param->bEnableWeightedPred || m_param->bEnableWeightedBiPred;
@@ -568,14 +549,14 @@
     return m_tld && m_scratch;
 }
 
-void Lookahead::stop()
+void Lookahead::stopJobs()
 {
     if (m_pool && !m_inputQueue.empty())
     {
-        m_preLookaheadLock.acquire();
+        m_inputLock.acquire();
         m_isActive = false;
         bool wait = m_outputSignalRequired = m_sliceTypeBusy;
-        m_preLookaheadLock.release();
+        m_inputLock.release();
 
         if (wait)
             m_outputSignal.wait();
@@ -634,19 +615,11 @@
             m_filled = true; /* full capacity plus mini-gop lag */
     }
 
-    m_preLookaheadLock.acquire();
-
     m_inputLock.acquire();
     m_inputQueue.pushBack(curFrame);
-    m_inputLock.release();
-
-    m_preframes[m_preTotal++] = &curFrame;
-    X265_CHECK(m_preTotal <= X265_LOOKAHEAD_MAX, "prelookahead overflow\n");
-    
-    m_preLookaheadLock.release();
-
-    if (m_pool)
+    if (m_pool && m_inputQueue.size() >= m_fullQueueSize)
         tryWakeOne();
+    m_inputLock.release();
 }
 
 /* Called by API thread */
@@ -657,74 +630,33 @@
     m_filled = true;
 }
 
-void Lookahead::findJob(int workerThreadID)
+void Lookahead::findJob(int /*workerThreadID*/)
 {
-    Frame* preFrame;
-    bool   doDecide;
-
-    if (!m_isActive)
-        return;
-
-    int tld = workerThreadID;
-    if (workerThreadID < 0)
-        tld = m_pool ? m_pool->m_numWorkers : 0;
+    bool doDecide;
 
-    m_preLookaheadLock.acquire();
-    do
-    {
-        preFrame = NULL;
-        doDecide = false;
+    m_inputLock.acquire();
+    if (m_inputQueue.size() >= m_fullQueueSize && !m_sliceTypeBusy && m_isActive)
+        doDecide = m_sliceTypeBusy = true;
+    else
+        doDecide = m_helpWanted = false;
+    m_inputLock.release();
 
-        if (m_preTotal > m_preAcquired)
-            preFrame = m_preframes[m_preAcquired++];
-        else
-        {
-            if (m_preTotal == m_preCompleted)
-                m_preAcquired = m_preTotal = m_preCompleted = 0;
-
-            /* the worker thread that performs the last pre-lookahead will generally get to run
-             * slicetypeDecide() */
-            m_inputLock.acquire();
-            if (!m_sliceTypeBusy && !m_preTotal && m_inputQueue.size() >= m_fullQueueSize && m_isActive)
-                doDecide = m_sliceTypeBusy = true;
-            else
-                m_helpWanted = false;
-            m_inputLock.release();
-        }
-        m_preLookaheadLock.release();
+    if (!doDecide)
+        return;
 
-        if (preFrame)
-        {
-            ProfileLookaheadTime(m_preLookaheadElapsedTime, m_countPreLookahead);
-            ProfileScopeEvent(prelookahead);
-
-            preFrame->m_lowres.init(preFrame->m_fencPic, preFrame->m_poc);
-            if (m_param->rc.bStatRead && m_param->rc.cuTree && IS_REFERENCED(preFrame))
-                /* cu-tree offsets were read from stats file */;
-            else if (m_bAdaptiveQuant)
-                m_tld[tld].calcAdaptiveQuantFrame(preFrame, m_param);
-            m_tld[tld].lowresIntraEstimate(preFrame->m_lowres);
-
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
-            m_preCompleted++;
-        }
-        else if (doDecide)
-        {
-            ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
-            ProfileScopeEvent(slicetypeDecideEV);
+    ProfileLookaheadTime(m_slicetypeDecideElapsedTime, m_countSlicetypeDecide);
+    ProfileScopeEvent(slicetypeDecideEV);
 
-            slicetypeDecide();
+    slicetypeDecide();
 
-            m_preLookaheadLock.acquire(); /* re-acquire for next pass */
-            if (m_outputSignalRequired)
-            {
-                m_outputSignal.trigger();
-                m_outputSignalRequired = false;
-            }
-            m_sliceTypeBusy = false;
-        }
+    m_inputLock.acquire();
+    if (m_outputSignalRequired)
+    {
+        m_outputSignal.trigger();
+        m_outputSignalRequired = false;
     }
-    while (preFrame || doDecide);
+    m_sliceTypeBusy = false;
+    m_inputLock.release();
 }
 
 /* Called by API thread */
@@ -739,13 +671,11 @@
         if (out)
             return out;
 
-        /* process all pending pre-lookahead frames and run slicetypeDecide() if
-         * necessary */
-        findJob(-1);
+        findJob(-1); /* run slicetypeDecide() if necessary */
 
-        m_preLookaheadLock.acquire();
-        bool wait = m_outputSignalRequired = m_sliceTypeBusy || m_preTotal;
-        m_preLookaheadLock.release();
+        m_inputLock.acquire();
+        bool wait = m_outputSignalRequired = m_sliceTypeBusy;
+        m_inputLock.release();
 
         if (wait)
             m_outputSignal.wait();
@@ -809,7 +739,7 @@
     {
         /* aggregate lowres row satds to CTU resolution */
         curFrame->m_lowres.lowresCostForRc = curFrame->m_lowres.lowresCosts[b - p0][p1 - b];
-        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0;
+        uint32_t lowresRow = 0, lowresCol = 0, lowresCuIdx = 0, sum = 0, intraSum = 0;
         uint32_t scale = m_param->maxCUSize / (2 * X265_LOWRES_CU_SIZE);
         uint32_t numCuInHeight = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize;
         uint32_t widthInLowresCu = (uint32_t)m_8x8Width, heightInLowresCu = (uint32_t)m_8x8Height;
@@ -823,7 +753,7 @@
             lowresRow = row * scale;
             for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++)
             {
-                sum = 0;
​

x265_1.6.tar.gz/source/encoder/slicetype.h -> x265_1.7.tar.gz/source/encoder/slicetype.h Changed

@@ -105,8 +105,6 @@
     Lock          m_outputLock;
 
     /* pre-lookahead */
-    Frame*        m_preframes[X265_LOOKAHEAD_MAX];
-    int           m_preTotal, m_preAcquired, m_preCompleted;
     int           m_fullQueueSize;
     bool          m_isActive;
     bool          m_sliceTypeBusy;
@@ -114,7 +112,6 @@
     bool          m_outputSignalRequired;
     bool          m_bBatchMotionSearch;
     bool          m_bBatchFrameCosts;
-    Lock          m_preLookaheadLock;
     Event         m_outputSignal;
 
     LookaheadTLD* m_tld;
@@ -143,7 +140,7 @@
 
     bool    create();
     void    destroy();
-    void    stop();
+    void    stopJobs();
 
     void    addPicture(Frame&, int sliceType);
     void    flush();
@@ -176,6 +173,22 @@
     int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
 };
 
+class PreLookaheadGroup : public BondedTaskGroup
+{
+public:
+
+    Frame* m_preframes[X265_LOOKAHEAD_MAX];
+    Lookahead& m_lookahead;
+
+    PreLookaheadGroup(Lookahead& l) : m_lookahead(l) {}
+
+    void processTasks(int workerThreadID);
+
+protected:
+
+    PreLookaheadGroup& operator=(const PreLookaheadGroup&);
+};
+
 class CostEstimateGroup : public BondedTaskGroup
 {
 public:

 
@@ -105,8 +105,6 @@
     Lock          m_outputLock;
 
     /* pre-lookahead */
-    Frame*        m_preframes[X265_LOOKAHEAD_MAX];
-    int           m_preTotal, m_preAcquired, m_preCompleted;
     int           m_fullQueueSize;
     bool          m_isActive;
     bool          m_sliceTypeBusy;
@@ -114,7 +112,6 @@
     bool          m_outputSignalRequired;
     bool          m_bBatchMotionSearch;
     bool          m_bBatchFrameCosts;
-    Lock          m_preLookaheadLock;
     Event         m_outputSignal;
 
     LookaheadTLD* m_tld;
@@ -143,7 +140,7 @@
 
     bool    create();
     void    destroy();
-    void    stop();
+    void    stopJobs();
 
     void    addPicture(Frame&, int sliceType);
     void    flush();
@@ -176,6 +173,22 @@
     int64_t frameCostRecalculate(Lowres **frames, int p0, int p1, int b);
 };
 
+class PreLookaheadGroup : public BondedTaskGroup
+{
+public:
+
+    Frame* m_preframes[X265_LOOKAHEAD_MAX];
+    Lookahead& m_lookahead;
+
+    PreLookaheadGroup(Lookahead& l) : m_lookahead(l) {}
+
+    void processTasks(int workerThreadID);
+
+protected:
+
+    PreLookaheadGroup& operator=(const PreLookaheadGroup&);
+};
+
 class CostEstimateGroup : public BondedTaskGroup
 {
 public:
​

x265_1.6.tar.gz/source/input/input.cpp -> x265_1.7.tar.gz/source/input/input.cpp Changed

 
@@ -27,7 +27,7 @@
 
 using namespace x265;
 
-Input* Input::open(InputFileInfo& info, bool bForceY4m)
+InputFile* InputFile::open(InputFileInfo& info, bool bForceY4m)
 {
     const char * s = strrchr(info.filename, '.');
 
​

x265_1.6.tar.gz/source/input/input.h -> x265_1.7.tar.gz/source/input/input.h Changed

 
@@ -48,23 +48,25 @@
     int sarWidth;
     int sarHeight;
     int frameCount;
+    int timebaseNum;
+    int timebaseDenom;
 
     /* user supplied */
     int skipFrames;
     const char *filename;
 };
 
-class Input
+class InputFile
 {
 protected:
 
-    virtual ~Input()  {}
+    virtual ~InputFile()  {}
 
 public:
 
-    Input()           {}
+    InputFile()           {}
 
-    static Input* open(InputFileInfo& info, bool bForceY4m);
+    static InputFile* open(InputFileInfo& info, bool bForceY4m);
 
     virtual void startReader() = 0;
 
​

x265_1.6.tar.gz/source/input/y4m.cpp -> x265_1.7.tar.gz/source/input/y4m.cpp Changed

 
@@ -46,9 +46,6 @@
     for (int i = 0; i < QUEUE_SIZE; i++)
         buf[i] = NULL;
 
-    readCount.set(0);
-    writeCount.set(0);
-
     threadActive = false;
     colorSpace = info.csp;
     sarWidth = info.sarWidth;
@@ -164,7 +161,7 @@
 void Y4MInput::release()
 {
     threadActive = false;
-    readCount.set(readCount.get()); // unblock file reader
+    readCount.poke();
     stop();
     delete this;
 }
@@ -352,7 +349,7 @@
     while (threadActive);
 
     threadActive = false;
-    writeCount.set(writeCount.get()); // unblock readPicture
+    writeCount.poke();
 }
 
 bool Y4MInput::populateFrameQueue()
​

x265_1.6.tar.gz/source/input/y4m.h -> x265_1.7.tar.gz/source/input/y4m.h Changed

 
@@ -33,7 +33,7 @@
 namespace x265 {
 // x265 private namespace
 
-class Y4MInput : public Input, public Thread
+class Y4MInput : public InputFile, public Thread
 {
 protected:
 
​

x265_1.6.tar.gz/source/input/yuv.cpp -> x265_1.7.tar.gz/source/input/yuv.cpp Changed

 
@@ -44,8 +44,6 @@
     for (int i = 0; i < QUEUE_SIZE; i++)
         buf[i] = NULL;
 
-    readCount.set(0);
-    writeCount.set(0);
     depth = info.depth;
     width = info.width;
     height = info.height;
@@ -152,7 +150,7 @@
 void YUVInput::release()
 {
     threadActive = false;
-    readCount.set(readCount.get()); // unblock read thread
+    readCount.poke();
     stop();
     delete this;
 }
@@ -175,7 +173,7 @@
     }
 
     threadActive = false;
-    writeCount.set(writeCount.get()); // unblock readPicture
+    writeCount.poke();
 }
 
 bool YUVInput::populateFrameQueue()
​

x265_1.6.tar.gz/source/input/yuv.h -> x265_1.7.tar.gz/source/input/yuv.h Changed

 
@@ -33,7 +33,7 @@
 namespace x265 {
 // private x265 namespace
 
-class YUVInput : public Input, public Thread
+class YUVInput : public InputFile, public Thread
 {
 protected:
 
​

x265_1.6.tar.gz/source/output/output.cpp -> x265_1.7.tar.gz/source/output/output.cpp Changed

 
@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2013 x265 project
+ * Copyright (C) 2013-2015 x265 project
  *
  * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -25,9 +26,11 @@
 #include "yuv.h"
 #include "y4m.h"
 
+#include "raw.h"
+
 using namespace x265;
 
-Output* Output::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
+ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp)
 {
     const char * s = strrchr(fname, '.');
 
@@ -36,3 +39,8 @@
     else
         return new YUVOutput(fname, width, height, bitdepth, csp);
 }
+
+OutputFile* OutputFile::open(const char *fname, InputFileInfo& inputInfo)
+{
+    return new RAWOutput(fname, inputInfo);
+}
​

x265_1.6.tar.gz/source/output/output.h -> x265_1.7.tar.gz/source/output/output.h Changed

@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2013 x265 project
+ * Copyright (C) 2013-2015 x265 project
  *
  * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -25,22 +26,23 @@
 #define X265_OUTPUT_H
 
 #include "x265.h"
+#include "input/input.h"
 
 namespace x265 {
 // private x265 namespace
 
-class Output
+class ReconFile
 {
 protected:
 
-    virtual ~Output()  {}
+    virtual ~ReconFile()  {}
 
 public:
 
-    Output()           {}
+    ReconFile()           {}
 
-    static Output* open(const char *fname, int width, int height, uint32_t bitdepth,
-                        uint32_t fpsNum, uint32_t fpsDenom, int csp);
+    static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth,
+                           uint32_t fpsNum, uint32_t fpsDenom, int csp);
 
     virtual bool isFail() const = 0;
 
@@ -50,6 +52,35 @@
 
     virtual const char *getName() const = 0;
 };
+
+class OutputFile
+{
+protected:
+
+    virtual ~OutputFile() {}
+
+public:
+
+    OutputFile() {}
+
+    static OutputFile* open(const char* fname, InputFileInfo& inputInfo);
+
+    virtual bool isFail() const = 0;
+
+    virtual bool needPTS() const = 0;
+
+    virtual void release() = 0;
+
+    virtual const char* getName() const = 0;
+
+    virtual void setParam(x265_param* param) = 0;
+
+    virtual int writeHeaders(const x265_nal* nal, uint32_t nalcount) = 0;
+
+    virtual int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture& pic) = 0;
+
+    virtual void closeFile(int64_t largest_pts, int64_t second_largest_pts) = 0;
+};
 }
 
 #endif // ifndef X265_OUTPUT_H

 
@@ -1,7 +1,8 @@
 /*****************************************************************************
- * Copyright (C) 2013 x265 project
+ * Copyright (C) 2013-2015 x265 project
  *
  * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -25,22 +26,23 @@
 #define X265_OUTPUT_H
 
 #include "x265.h"
+#include "input/input.h"
 
 namespace x265 {
 // private x265 namespace
 
-class Output
+class ReconFile
 {
 protected:
 
-    virtual ~Output()  {}
+    virtual ~ReconFile()  {}
 
 public:
 
-    Output()           {}
+    ReconFile()           {}
 
-    static Output* open(const char *fname, int width, int height, uint32_t bitdepth,
-                        uint32_t fpsNum, uint32_t fpsDenom, int csp);
+    static ReconFile* open(const char *fname, int width, int height, uint32_t bitdepth,
+                           uint32_t fpsNum, uint32_t fpsDenom, int csp);
 
     virtual bool isFail() const = 0;
 
@@ -50,6 +52,35 @@
 
     virtual const char *getName() const = 0;
 };
+
+class OutputFile
+{
+protected:
+
+    virtual ~OutputFile() {}
+
+public:
+
+    OutputFile() {}
+
+    static OutputFile* open(const char* fname, InputFileInfo& inputInfo);
+
+    virtual bool isFail() const = 0;
+
+    virtual bool needPTS() const = 0;
+
+    virtual void release() = 0;
+
+    virtual const char* getName() const = 0;
+
+    virtual void setParam(x265_param* param) = 0;
+
+    virtual int writeHeaders(const x265_nal* nal, uint32_t nalcount) = 0;
+
+    virtual int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture& pic) = 0;
+
+    virtual void closeFile(int64_t largest_pts, int64_t second_largest_pts) = 0;
+};
 }
 
 #endif // ifndef X265_OUTPUT_H
​

x265_1.7.tar.gz/source/output/raw.cpp Added

@@ -0,0 +1,80 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2015 x265 project
+ *
+ * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "raw.h"
+
+using namespace x265;
+using namespace std;
+
+RAWOutput::RAWOutput(const char* fname, InputFileInfo&)
+{
+    b_fail = false;
+    if (!strcmp(fname, "-"))
+    {
+        ofs = &cout;
+        return;
+    }
+    ofs = new ofstream(fname, ios::binary | ios::out);
+    if (ofs->fail())
+        b_fail = true;
+}
+
+void RAWOutput::setParam(x265_param* param)
+{
+    param->bAnnexB = true;
+}
+
+int RAWOutput::writeHeaders(const x265_nal* nal, uint32_t nalcount)
+{
+    uint32_t bytes = 0;
+
+    for (uint32_t i = 0; i < nalcount; i++)
+    {
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
+        bytes += nal->sizeBytes;
+        nal++;
+    }
+
+    return bytes;
+}
+
+int RAWOutput::writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&)
+{
+    uint32_t bytes = 0;
+
+    for (uint32_t i = 0; i < nalcount; i++)
+    {
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
+        bytes += nal->sizeBytes;
+        nal++;
+    }
+
+    return bytes;
+}
+
+void RAWOutput::closeFile(int64_t, int64_t)
+{
+    if (ofs != &cout)
+        delete ofs;
+}

 
@@ -0,0 +1,80 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2015 x265 project
+ *
+ * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "raw.h"
+
+using namespace x265;
+using namespace std;
+
+RAWOutput::RAWOutput(const char* fname, InputFileInfo&)
+{
+    b_fail = false;
+    if (!strcmp(fname, "-"))
+    {
+        ofs = &cout;
+        return;
+    }
+    ofs = new ofstream(fname, ios::binary | ios::out);
+    if (ofs->fail())
+        b_fail = true;
+}
+
+void RAWOutput::setParam(x265_param* param)
+{
+    param->bAnnexB = true;
+}
+
+int RAWOutput::writeHeaders(const x265_nal* nal, uint32_t nalcount)
+{
+    uint32_t bytes = 0;
+
+    for (uint32_t i = 0; i < nalcount; i++)
+    {
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
+        bytes += nal->sizeBytes;
+        nal++;
+    }
+
+    return bytes;
+}
+
+int RAWOutput::writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&)
+{
+    uint32_t bytes = 0;
+
+    for (uint32_t i = 0; i < nalcount; i++)
+    {
+        ofs->write((const char*)nal->payload, nal->sizeBytes);
+        bytes += nal->sizeBytes;
+        nal++;
+    }
+
+    return bytes;
+}
+
+void RAWOutput::closeFile(int64_t, int64_t)
+{
+    if (ofs != &cout)
+        delete ofs;
+}
​

x265_1.7.tar.gz/source/output/raw.h Added

@@ -0,0 +1,64 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2015 x265 project
+ *
+ * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef X265_HEVC_RAW_H
+#define X265_HEVC_RAW_H
+
+#include "output.h"
+#include "common.h"
+#include <fstream>
+#include <iostream>
+
+namespace x265 {
+class RAWOutput : public OutputFile
+{
+protected:
+
+    std::ostream* ofs;
+
+    bool b_fail;
+
+public:
+
+    RAWOutput(const char* fname, InputFileInfo&);
+
+    bool isFail() const { return b_fail; }
+
+    bool needPTS() const { return false; }
+
+    void release() { delete this; }
+
+    const char* getName() const { return "raw"; }
+
+    void setParam(x265_param* param);
+
+    int writeHeaders(const x265_nal* nal, uint32_t nalcount);
+
+    int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&);
+
+    void closeFile(int64_t largest_pts, int64_t second_largest_pts);
+};
+}
+
+#endif // ifndef X265_HEVC_RAW_H

 
@@ -0,0 +1,64 @@
+/*****************************************************************************
+ * Copyright (C) 2013-2015 x265 project
+ *
+ * Authors: Steve Borho <steve@borho.org>
+ *          Xinyue Lu <i@7086.in>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef X265_HEVC_RAW_H
+#define X265_HEVC_RAW_H
+
+#include "output.h"
+#include "common.h"
+#include <fstream>
+#include <iostream>
+
+namespace x265 {
+class RAWOutput : public OutputFile
+{
+protected:
+
+    std::ostream* ofs;
+
+    bool b_fail;
+
+public:
+
+    RAWOutput(const char* fname, InputFileInfo&);
+
+    bool isFail() const { return b_fail; }
+
+    bool needPTS() const { return false; }
+
+    void release() { delete this; }
+
+    const char* getName() const { return "raw"; }
+
+    void setParam(x265_param* param);
+
+    int writeHeaders(const x265_nal* nal, uint32_t nalcount);
+
+    int writeFrame(const x265_nal* nal, uint32_t nalcount, x265_picture&);
+
+    void closeFile(int64_t largest_pts, int64_t second_largest_pts);
+};
+}
+
+#endif // ifndef X265_HEVC_RAW_H
​

x265_1.7.tar.gz/source/output/reconplay.cpp Added

@@ -0,0 +1,197 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
+ *          Chunli Zhang <chunli@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "common.h"
+#include "reconplay.h"
+
+#include <signal.h>
+
+using namespace x265;
+
+#if _WIN32
+#define popen  _popen
+#define pclose _pclose
+#define pipemode "wb"
+#else
+#define pipemode "w"
+#endif
+
+bool ReconPlay::pipeValid;
+
+#ifndef _WIN32
+static void sigpipe_handler(int)
+{
+    if (ReconPlay::pipeValid)
+        general_log(NULL, "exec", X265_LOG_ERROR, "pipe closed\n");
+    ReconPlay::pipeValid = false;
+}
+#endif
+
+ReconPlay::ReconPlay(const char* commandLine, x265_param& param)
+{
+#ifndef _WIN32
+    if (signal(SIGPIPE, sigpipe_handler) == SIG_ERR)
+        general_log(&param, "exec", X265_LOG_ERROR, "Unable to register SIGPIPE handler: %s\n", strerror(errno));
+#endif
+
+    width = param.sourceWidth;
+    height = param.sourceHeight;
+    colorSpace = param.internalCsp;
+
+    frameSize = 0;
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
+        frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
+
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
+    {
+        poc[i] = -1;
+        CHECKED_MALLOC(frameData[i], pixel, frameSize);
+    }
+
+    outputPipe = popen(commandLine, pipemode);
+    if (outputPipe)
+    {
+        const char* csp = (colorSpace >= X265_CSP_I444) ? "444" : (colorSpace >= X265_CSP_I422) ? "422" : "420";
+        const char* depth = (param.internalBitDepth == 10) ? "p10" : "";
+
+        fprintf(outputPipe, "YUV4MPEG2 W%d H%d F%d:%d Ip C%s%s\n", width, height, param.fpsNum, param.fpsDenom, csp, depth);
+
+        pipeValid = true;
+        threadActive = true;
+        start();
+        return;
+    }
+    else
+        general_log(&param, "exec", X265_LOG_ERROR, "popen(%s) failed\n", commandLine);
+
+fail:
+    threadActive = false;
+}
+
+ReconPlay::~ReconPlay()
+{
+    if (threadActive)
+    {
+        threadActive = false;
+        writeCount.poke();
+        stop();
+    }
+
+    if (outputPipe) 
+        pclose(outputPipe);
+
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
+        X265_FREE(frameData[i]);
+}
+
+bool ReconPlay::writePicture(const x265_picture& pic)
+{
+    if (!threadActive || !pipeValid)
+        return false;
+
+    int written = writeCount.get();
+    int read = readCount.get();
+    int currentCursor = pic.poc % RECON_BUF_SIZE;
+
+    /* TODO: it's probably better to drop recon pictures when the ring buffer is
+     * backed up on the display app */
+    while (written - read > RECON_BUF_SIZE - 2 || poc[currentCursor] != -1)
+    {
+        read = readCount.waitForChange(read);
+        if (!threadActive)
+            return false;
+    }
+
+    X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
+    X265_CHECK(pic.bitDepth == X265_DEPTH,   "invalid bit depth\n");
+
+    pixel* buf = frameData[currentCursor];
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
+    {
+        char* src = (char*)pic.planes[i];
+        int pwidth = width >> x265_cli_csps[colorSpace].width[i];
+
+        for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
+        {
+            memcpy(buf, src, pwidth * sizeof(pixel));
+            src += pic.stride[i];
+            buf += pwidth;
+        }
+    }
+
+    poc[currentCursor] = pic.poc;
+    writeCount.incr();
+
+    return true;
+}
+
+void ReconPlay::threadMain()
+{
+    THREAD_NAME("ReconPlayOutput", 0);
+
+    do
+    {
+        /* extract the next output picture in display order and write to pipe */
+        if (!outputFrame())
+            break;
+    }
+    while (threadActive);
+
+    threadActive = false;
+    readCount.poke();
+}
+
+bool ReconPlay::outputFrame()
+{
+    int written = writeCount.get();
+    int read = readCount.get();
+    int currentCursor = read % RECON_BUF_SIZE;
+
+    while (poc[currentCursor] != read)
+    {
+        written = writeCount.waitForChange(written);
+        if (!threadActive)
+            return false;
+    }
+
+    char* buf = (char*)frameData[currentCursor];
+    intptr_t remainSize = frameSize * sizeof(pixel);
+
+    fprintf(outputPipe, "FRAME\n");
+    while (remainSize > 0)
+    {
+        intptr_t retCount = (intptr_t)fwrite(buf, sizeof(char), remainSize, outputPipe);
+
+        if (retCount < 0 || !pipeValid)
+            /* pipe failure, stop writing and start dropping recon pictures */
+            return false;
+    
+        buf += retCount;
+        remainSize -= retCount;
+    }
+
+    poc[currentCursor] = -1;
+    readCount.incr();
+    return true;
+}

 
@@ -0,0 +1,197 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
+ *          Chunli Zhang <chunli@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#include "common.h"
+#include "reconplay.h"
+
+#include <signal.h>
+
+using namespace x265;
+
+#if _WIN32
+#define popen  _popen
+#define pclose _pclose
+#define pipemode "wb"
+#else
+#define pipemode "w"
+#endif
+
+bool ReconPlay::pipeValid;
+
+#ifndef _WIN32
+static void sigpipe_handler(int)
+{
+    if (ReconPlay::pipeValid)
+        general_log(NULL, "exec", X265_LOG_ERROR, "pipe closed\n");
+    ReconPlay::pipeValid = false;
+}
+#endif
+
+ReconPlay::ReconPlay(const char* commandLine, x265_param& param)
+{
+#ifndef _WIN32
+    if (signal(SIGPIPE, sigpipe_handler) == SIG_ERR)
+        general_log(&param, "exec", X265_LOG_ERROR, "Unable to register SIGPIPE handler: %s\n", strerror(errno));
+#endif
+
+    width = param.sourceWidth;
+    height = param.sourceHeight;
+    colorSpace = param.internalCsp;
+
+    frameSize = 0;
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
+        frameSize += (uint32_t)((width >> x265_cli_csps[colorSpace].width[i]) * (height >> x265_cli_csps[colorSpace].height[i]));
+
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
+    {
+        poc[i] = -1;
+        CHECKED_MALLOC(frameData[i], pixel, frameSize);
+    }
+
+    outputPipe = popen(commandLine, pipemode);
+    if (outputPipe)
+    {
+        const char* csp = (colorSpace >= X265_CSP_I444) ? "444" : (colorSpace >= X265_CSP_I422) ? "422" : "420";
+        const char* depth = (param.internalBitDepth == 10) ? "p10" : "";
+
+        fprintf(outputPipe, "YUV4MPEG2 W%d H%d F%d:%d Ip C%s%s\n", width, height, param.fpsNum, param.fpsDenom, csp, depth);
+
+        pipeValid = true;
+        threadActive = true;
+        start();
+        return;
+    }
+    else
+        general_log(&param, "exec", X265_LOG_ERROR, "popen(%s) failed\n", commandLine);
+
+fail:
+    threadActive = false;
+}
+
+ReconPlay::~ReconPlay()
+{
+    if (threadActive)
+    {
+        threadActive = false;
+        writeCount.poke();
+        stop();
+    }
+
+    if (outputPipe) 
+        pclose(outputPipe);
+
+    for (int i = 0; i < RECON_BUF_SIZE; i++)
+        X265_FREE(frameData[i]);
+}
+
+bool ReconPlay::writePicture(const x265_picture& pic)
+{
+    if (!threadActive || !pipeValid)
+        return false;
+
+    int written = writeCount.get();
+    int read = readCount.get();
+    int currentCursor = pic.poc % RECON_BUF_SIZE;
+
+    /* TODO: it's probably better to drop recon pictures when the ring buffer is
+     * backed up on the display app */
+    while (written - read > RECON_BUF_SIZE - 2 || poc[currentCursor] != -1)
+    {
+        read = readCount.waitForChange(read);
+        if (!threadActive)
+            return false;
+    }
+
+    X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n");
+    X265_CHECK(pic.bitDepth == X265_DEPTH,   "invalid bit depth\n");
+
+    pixel* buf = frameData[currentCursor];
+    for (int i = 0; i < x265_cli_csps[colorSpace].planes; i++)
+    {
+        char* src = (char*)pic.planes[i];
+        int pwidth = width >> x265_cli_csps[colorSpace].width[i];
+
+        for (int h = 0; h < height >> x265_cli_csps[colorSpace].height[i]; h++)
+        {
+            memcpy(buf, src, pwidth * sizeof(pixel));
+            src += pic.stride[i];
+            buf += pwidth;
+        }
+    }
+
+    poc[currentCursor] = pic.poc;
+    writeCount.incr();
+
+    return true;
+}
+
+void ReconPlay::threadMain()
+{
+    THREAD_NAME("ReconPlayOutput", 0);
+
+    do
+    {
+        /* extract the next output picture in display order and write to pipe */
+        if (!outputFrame())
+            break;
+    }
+    while (threadActive);
+
+    threadActive = false;
+    readCount.poke();
+}
+
+bool ReconPlay::outputFrame()
+{
+    int written = writeCount.get();
+    int read = readCount.get();
+    int currentCursor = read % RECON_BUF_SIZE;
+
+    while (poc[currentCursor] != read)
+    {
+        written = writeCount.waitForChange(written);
+        if (!threadActive)
+            return false;
+    }
+
+    char* buf = (char*)frameData[currentCursor];
+    intptr_t remainSize = frameSize * sizeof(pixel);
+
+    fprintf(outputPipe, "FRAME\n");
+    while (remainSize > 0)
+    {
+        intptr_t retCount = (intptr_t)fwrite(buf, sizeof(char), remainSize, outputPipe);
+
+        if (retCount < 0 || !pipeValid)
+            /* pipe failure, stop writing and start dropping recon pictures */
+            return false;
+    
+        buf += retCount;
+        remainSize -= retCount;
+    }
+
+    poc[currentCursor] = -1;
+    readCount.incr();
+    return true;
+}
​

x265_1.7.tar.gz/source/output/reconplay.h Added

@@ -0,0 +1,74 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
+ *          Chunli Zhang <chunli@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef X265_RECONPLAY_H
+#define X265_RECONPLAY_H
+
+#include "x265.h"
+#include "threading.h"
+#include <cstdio>
+
+namespace x265 {
+// private x265 namespace
+
+class ReconPlay : public Thread
+{
+public:
+
+    ReconPlay(const char* commandLine, x265_param& param);
+
+    virtual ~ReconPlay();
+
+    bool writePicture(const x265_picture& pic);
+
+    static bool pipeValid;
+
+protected:
+
+    enum { RECON_BUF_SIZE = 40 };
+
+    FILE*  outputPipe;     /* The output pipe for player */
+    size_t frameSize;      /* size of one frame in pixels */
+    bool   threadActive;   /* worker thread is active */
+    int    width;          /* width of frame */
+    int    height;         /* height of frame */
+    int    colorSpace;     /* color space of frame */
+
+    int    poc[RECON_BUF_SIZE];
+    pixel* frameData[RECON_BUF_SIZE];
+
+    /* Note that the class uses read and write counters to signal that reads and
+     * writes have occurred in the ring buffer, but writes into the buffer
+     * happen in decode order and the reader must check that the POC it next
+     * needs to send to the pipe is in fact present.  The counters are used to
+     * prevent the writer from getting too far ahead of the reader */
+    ThreadSafeInteger readCount;
+    ThreadSafeInteger writeCount;
+
+    void threadMain();
+    bool outputFrame();
+};
+}
+
+#endif // ifndef X265_RECONPLAY_H

 
@@ -0,0 +1,74 @@
+/*****************************************************************************
+ * Copyright (C) 2013 x265 project
+ *
+ * Authors: Peixuan Zhang <zhangpeixuancn@gmail.com>
+ *          Chunli Zhang <chunli@multicorewareinc.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02111, USA.
+ *
+ * This program is also available under a commercial proprietary license.
+ * For more information, contact us at license @ x265.com.
+ *****************************************************************************/
+
+#ifndef X265_RECONPLAY_H
+#define X265_RECONPLAY_H
+
+#include "x265.h"
+#include "threading.h"
+#include <cstdio>
+
+namespace x265 {
+// private x265 namespace
+
+class ReconPlay : public Thread
+{
+public:
+
+    ReconPlay(const char* commandLine, x265_param& param);
+
+    virtual ~ReconPlay();
+
+    bool writePicture(const x265_picture& pic);
+
+    static bool pipeValid;
+
+protected:
+
+    enum { RECON_BUF_SIZE = 40 };
+
+    FILE*  outputPipe;     /* The output pipe for player */
+    size_t frameSize;      /* size of one frame in pixels */
+    bool   threadActive;   /* worker thread is active */
+    int    width;          /* width of frame */
+    int    height;         /* height of frame */
+    int    colorSpace;     /* color space of frame */
+
+    int    poc[RECON_BUF_SIZE];
+    pixel* frameData[RECON_BUF_SIZE];
+
+    /* Note that the class uses read and write counters to signal that reads and
+     * writes have occurred in the ring buffer, but writes into the buffer
+     * happen in decode order and the reader must check that the POC it next
+     * needs to send to the pipe is in fact present.  The counters are used to
+     * prevent the writer from getting too far ahead of the reader */
+    ThreadSafeInteger readCount;
+    ThreadSafeInteger writeCount;
+
+    void threadMain();
+    bool outputFrame();
+};
+}
+
+#endif // ifndef X265_RECONPLAY_H
​

x265_1.6.tar.gz/source/output/y4m.h -> x265_1.7.tar.gz/source/output/y4m.h Changed

 
@@ -30,7 +30,7 @@
 namespace x265 {
 // private x265 namespace
 
-class Y4MOutput : public Output
+class Y4MOutput : public ReconFile
 {
 protected:
 
​

x265_1.6.tar.gz/source/output/yuv.h -> x265_1.7.tar.gz/source/output/yuv.h Changed

 
@@ -32,7 +32,7 @@
 namespace x265 {
 // private x265 namespace
 
-class YUVOutput : public Output
+class YUVOutput : public ReconFile
 {
 protected:
 
​

x265_1.6.tar.gz/source/test/ipfilterharness.cpp -> x265_1.7.tar.gz/source/test/ipfilterharness.cpp Changed

@@ -61,55 +61,6 @@
     }
 }
 
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
-{
-    intptr_t rand_srcStride;
-    int min_size = isChroma ? 2 : 4;
-    int max_size = isChroma ? (MAX_CU_SIZE >> 1) : MAX_CU_SIZE;
-
-    if (isChroma && (csp == X265_CSP_I444))
-    {
-        min_size = 4;
-        max_size = MAX_CU_SIZE;
-    }
-
-    for (int i = 0; i < ITERS; i++)
-    {
-        int index = i % TEST_CASES;
-        int rand_height = (int16_t)rand() % 100;
-        int rand_width = (int16_t)rand() % 100;
-
-        rand_srcStride = rand_width + rand() % 100;
-        if (rand_srcStride < rand_width)
-            rand_srcStride = rand_width;
-
-        rand_width &= ~(min_size - 1);
-        rand_width = x265_clip3(min_size, max_size, rand_width);
-
-        rand_height &= ~(min_size - 1);
-        rand_height = x265_clip3(min_size, max_size, rand_height);
-
-        ref(pixel_test_buff[index],
-            rand_srcStride,
-            IPF_C_output_s,
-            rand_width,
-            rand_height);
-
-        checked(opt, pixel_test_buff[index],
-                rand_srcStride,
-                IPF_vec_output_s,
-                rand_width,
-                rand_height);
-
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
-            return false;
-
-        reportfail();
-    }
-
-    return true;
-}
-
 bool IPFilterHarness::check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt)
 {
     intptr_t rand_srcStride, rand_dstStride;
@@ -518,12 +469,13 @@
     {
         intptr_t rand_srcStride = rand() % 100;
         int index = i % TEST_CASES;
+        intptr_t dstStride = rand() % 100 + 64;
 
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
 
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
 
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
             return false;
 
         reportfail();
@@ -538,12 +490,13 @@
     {
         intptr_t rand_srcStride = rand() % 100;
         int index = i % TEST_CASES;
+        intptr_t dstStride = rand() % 100 + 64;
 
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
 
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
 
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
             return false;
 
         reportfail();
@@ -554,15 +507,6 @@
 
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    if (opt.luma_p2s)
-    {
-        // last parameter does not matter in case of luma
-        if (!check_IPFilter_primitive(ref.luma_p2s, opt.luma_p2s, 0, 1))
-        {
-            printf("luma_p2s failed\n");
-            return false;
-        }
-    }
 
     for (int value = 0; value < NUM_PU_SIZES; value++)
     {
@@ -622,11 +566,11 @@
                 return false;
             }
         }
-        if (opt.pu[value].filter_p2s)
+        if (opt.pu[value].convert_p2s)
         {
-            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
             {
-                printf("filter_p2s[%s]", lumaPartStr[value]);
+                printf("convert_p2s[%s]", lumaPartStr[value]);
                 return false;
             }
         }
@@ -634,14 +578,6 @@
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
     {
-        if (opt.chroma[csp].p2s)
-        {
-            if (!check_IPFilter_primitive(ref.chroma[csp].p2s, opt.chroma[csp].p2s, 1, csp))
-            {
-                printf("chroma_p2s[%s]", x265_source_csp_names[csp]);
-                return false;
-            }
-        }
         for (int value = 0; value < NUM_PU_SIZES; value++)
         {
             if (opt.chroma[csp].pu[value].filter_hpp)
@@ -692,9 +628,9 @@
                     return false;
                 }
             }
-            if (opt.chroma[csp].pu[value].chroma_p2s)
+            if (opt.chroma[csp].pu[value].p2s)
             {
-                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
                 {
                     printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
                     return false;
@@ -708,19 +644,10 @@
 
 void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    int height = 64;
-    int width = 64;
     int16_t srcStride = 96;
     int16_t dstStride = 96;
     int maxVerticalfilterHalfDistance = 3;
 
-    if (opt.luma_p2s)
-    {
-        printf("luma_p2s\t");
-        REPORT_SPEEDUP(opt.luma_p2s, ref.luma_p2s,
-                       pixel_buff, srcStride, IPF_vec_output_s, width, height);
-    }
-
     for (int value = 0; value < NUM_PU_SIZES; value++)
     {
         if (opt.pu[value].luma_hpp)
@@ -777,23 +704,18 @@
                            pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
 
-        if (opt.pu[value].filter_p2s)
+        if (opt.pu[value].convert_p2s)
         {
-            printf("filter_p2s [%s]\t", lumaPartStr[value]);
-            REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
-                           pixel_buff, srcStride, IPF_vec_output_s);
+            printf("convert_p2s[%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
+                               pixel_buff, srcStride,
+                               IPF_vec_output_s, dstStride);
         }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
     {
         printf("= Color Space %s =\n", x265_source_csp_names[csp]);
-        if (opt.chroma[csp].p2s)
-        {
-            printf("chroma_p2s\t");
-            REPORT_SPEEDUP(opt.chroma[csp].p2s, ref.chroma[csp].p2s,
-                           pixel_buff, srcStride, IPF_vec_output_s, width, height);
-        }
         for (int value = 0; value < NUM_PU_SIZES; value++)
         {
             if (opt.chroma[csp].pu[value].filter_hpp)
@@ -836,13 +758,11 @@
                                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
                                IPF_vec_output_s, dstStride, 1);

 
@@ -61,55 +61,6 @@
     }
 }
 
-bool IPFilterHarness::check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp)
-{
-    intptr_t rand_srcStride;
-    int min_size = isChroma ? 2 : 4;
-    int max_size = isChroma ? (MAX_CU_SIZE >> 1) : MAX_CU_SIZE;
-
-    if (isChroma && (csp == X265_CSP_I444))
-    {
-        min_size = 4;
-        max_size = MAX_CU_SIZE;
-    }
-
-    for (int i = 0; i < ITERS; i++)
-    {
-        int index = i % TEST_CASES;
-        int rand_height = (int16_t)rand() % 100;
-        int rand_width = (int16_t)rand() % 100;
-
-        rand_srcStride = rand_width + rand() % 100;
-        if (rand_srcStride < rand_width)
-            rand_srcStride = rand_width;
-
-        rand_width &= ~(min_size - 1);
-        rand_width = x265_clip3(min_size, max_size, rand_width);
-
-        rand_height &= ~(min_size - 1);
-        rand_height = x265_clip3(min_size, max_size, rand_height);
-
-        ref(pixel_test_buff[index],
-            rand_srcStride,
-            IPF_C_output_s,
-            rand_width,
-            rand_height);
-
-        checked(opt, pixel_test_buff[index],
-                rand_srcStride,
-                IPF_vec_output_s,
-                rand_width,
-                rand_height);
-
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
-            return false;
-
-        reportfail();
-    }
-
-    return true;
-}
-
 bool IPFilterHarness::check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt)
 {
     intptr_t rand_srcStride, rand_dstStride;
@@ -518,12 +469,13 @@
     {
         intptr_t rand_srcStride = rand() % 100;
         int index = i % TEST_CASES;
+        intptr_t dstStride = rand() % 100 + 64;
 
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
 
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
 
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
             return false;
 
         reportfail();
@@ -538,12 +490,13 @@
     {
         intptr_t rand_srcStride = rand() % 100;
         int index = i % TEST_CASES;
+        intptr_t dstStride = rand() % 100 + 64;
 
-        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s);
+        ref(pixel_test_buff[index] + i, rand_srcStride, IPF_C_output_s, dstStride);
 
-        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s);
+        checked(opt, pixel_test_buff[index] + i, rand_srcStride, IPF_vec_output_s, dstStride);
 
-        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(pixel)))
+        if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t)))
             return false;
 
         reportfail();
@@ -554,15 +507,6 @@
 
 bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    if (opt.luma_p2s)
-    {
-        // last parameter does not matter in case of luma
-        if (!check_IPFilter_primitive(ref.luma_p2s, opt.luma_p2s, 0, 1))
-        {
-            printf("luma_p2s failed\n");
-            return false;
-        }
-    }
 
     for (int value = 0; value < NUM_PU_SIZES; value++)
     {
@@ -622,11 +566,11 @@
                 return false;
             }
         }
-        if (opt.pu[value].filter_p2s)
+        if (opt.pu[value].convert_p2s)
         {
-            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].filter_p2s, opt.pu[value].filter_p2s))
+            if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s))
             {
-                printf("filter_p2s[%s]", lumaPartStr[value]);
+                printf("convert_p2s[%s]", lumaPartStr[value]);
                 return false;
             }
         }
@@ -634,14 +578,6 @@
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
     {
-        if (opt.chroma[csp].p2s)
-        {
-            if (!check_IPFilter_primitive(ref.chroma[csp].p2s, opt.chroma[csp].p2s, 1, csp))
-            {
-                printf("chroma_p2s[%s]", x265_source_csp_names[csp]);
-                return false;
-            }
-        }
         for (int value = 0; value < NUM_PU_SIZES; value++)
         {
             if (opt.chroma[csp].pu[value].filter_hpp)
@@ -692,9 +628,9 @@
                     return false;
                 }
             }
-            if (opt.chroma[csp].pu[value].chroma_p2s)
+            if (opt.chroma[csp].pu[value].p2s)
             {
-                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].chroma_p2s, opt.chroma[csp].pu[value].chroma_p2s))
+                if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s))
                 {
                     printf("chroma_p2s[%s]", chromaPartStr[csp][value]);
                     return false;
@@ -708,19 +644,10 @@
 
 void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt)
 {
-    int height = 64;
-    int width = 64;
     int16_t srcStride = 96;
     int16_t dstStride = 96;
     int maxVerticalfilterHalfDistance = 3;
 
-    if (opt.luma_p2s)
-    {
-        printf("luma_p2s\t");
-        REPORT_SPEEDUP(opt.luma_p2s, ref.luma_p2s,
-                       pixel_buff, srcStride, IPF_vec_output_s, width, height);
-    }
-
     for (int value = 0; value < NUM_PU_SIZES; value++)
     {
         if (opt.pu[value].luma_hpp)
@@ -777,23 +704,18 @@
                            pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3);
         }
 
-        if (opt.pu[value].filter_p2s)
+        if (opt.pu[value].convert_p2s)
         {
-            printf("filter_p2s [%s]\t", lumaPartStr[value]);
-            REPORT_SPEEDUP(opt.pu[value].filter_p2s, ref.pu[value].filter_p2s,
-                           pixel_buff, srcStride, IPF_vec_output_s);
+            printf("convert_p2s[%s]\t", lumaPartStr[value]);
+            REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s,
+                               pixel_buff, srcStride,
+                               IPF_vec_output_s, dstStride);
         }
     }
 
     for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++)
     {
         printf("= Color Space %s =\n", x265_source_csp_names[csp]);
-        if (opt.chroma[csp].p2s)
-        {
-            printf("chroma_p2s\t");
-            REPORT_SPEEDUP(opt.chroma[csp].p2s, ref.chroma[csp].p2s,
-                           pixel_buff, srcStride, IPF_vec_output_s, width, height);
-        }
         for (int value = 0; value < NUM_PU_SIZES; value++)
         {
             if (opt.chroma[csp].pu[value].filter_hpp)
@@ -836,13 +758,11 @@
                                short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride,
                                IPF_vec_output_s, dstStride, 1);
​

x265_1.6.tar.gz/source/test/ipfilterharness.h -> x265_1.7.tar.gz/source/test/ipfilterharness.h Changed

 
@@ -50,7 +50,6 @@
     pixel   pixel_test_buff[TEST_CASES][TEST_BUF_SIZE];
     int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE];
 
-    bool check_IPFilter_primitive(filter_p2s_wxh_t ref, filter_p2s_wxh_t opt, int isChroma, int csp);
     bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt);
     bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt);
     bool check_IPFilterChroma_hps_primitive(filter_hps_t ref, filter_hps_t opt);
​

x265_1.6.tar.gz/source/test/pixelharness.cpp -> x265_1.7.tar.gz/source/test/pixelharness.cpp Changed

@@ -666,7 +666,32 @@
     return true;
 }
 
-bool PixelHarness::check_scale_pp(scale_t ref, scale_t opt)
+bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt)
+{
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0, sizeof(ref_dest));
+    memset(opt_dest, 0, sizeof(opt_dest));
+
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        checked(opt, opt_dest, pixel_test_buff[index] + j);
+        ref(ref_dest, pixel_test_buff[index] + j);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
+bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -845,8 +870,8 @@
 
 bool PixelHarness::check_calSign(sign_t ref, sign_t opt)
 {
-    ALIGN_VAR_16(int8_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int8_t, opt_dest[64 * 64]);
+    ALIGN_VAR_16(int8_t, ref_dest[64 * 2]);
+    ALIGN_VAR_16(int8_t, opt_dest[64 * 2]);
 
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
@@ -855,12 +880,12 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = 16 * (rand() % 4 + 1);
+        int width = (rand() % 64) + 1;
 
         ref(ref_dest, pbuf2 + j, pbuf3 + j, width);
         checked(opt, opt_dest, pbuf2 + j, pbuf3 + j, width);
 
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int8_t)))
+        if (memcmp(ref_dest, opt_dest, sizeof(ref_dest)))
             return false;
 
         reportfail();
@@ -883,12 +908,10 @@
     for (int i = 0; i < ITERS; i++)
     {
         int width = 16 * (rand() % 4 + 1);
-        int8_t sign = rand() % 3;
-        if (sign == 2)
-            sign = -1;
+        int stride = width + 1;
 
-        ref(ref_dest, psbuf1 + j, width, sign);
-        checked(opt, opt_dest, psbuf1 + j, width, sign);
+        ref(ref_dest, psbuf1 + j, width, psbuf2 + j, stride);
+        checked(opt, opt_dest, psbuf1 + j, width, psbuf5 + j, stride);
 
         if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
             return false;
@@ -928,7 +951,43 @@
     return true;
 }
 
-bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt)
+bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref[2], saoCuOrgE2_t opt[2])
+{
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    for (int id = 0; id < 2; id++)
+    {
+        int j = 0;
+        if (opt[id])
+        {
+            for (int i = 0; i < ITERS; i++)
+            {
+                int width = 16 * (1 << (id * (rand() % 2 + 1))) - (rand() % 2);
+                int stride = width + 1;
+
+                ref[width > 16](ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
+                checked(opt[width > 16], opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
+
+                if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
+                    return false;
+
+                if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+                    return false;
+
+                reportfail();
+                j += INCR;
+            }
+        }
+    }
+
+    return true;
+}
+
+bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -940,16 +999,14 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = 16 * (rand() % 4 + 1);
-        int stride = width + 1;
-
-        ref(ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
-        checked(opt, opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
+        int stride = 16 * (rand() % 4 + 1);
+        int start = rand() % 2;
+        int end = 16 - rand() % 2;
 
-        if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
-            return false;
+        ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
+        checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
 
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)) || memcmp(psbuf2, psbuf5, BUFFSIZE))
             return false;
 
         reportfail();
@@ -959,7 +1016,7 @@
     return true;
 }
 
-bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
+bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -971,9 +1028,9 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int stride = 16 * (rand() % 4 + 1);
+        int stride = 32 * (rand() % 2 + 1);
         int start = rand() % 2;
-        int end = (16 * (rand() % 4 + 1)) - rand() % 2;
+        int end = (32 * (rand() % 2 + 1)) - rand() % 2;
 
         ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
         checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
@@ -995,9 +1052,8 @@
 
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
-
-    int width = 16 + rand() % 48;
-    int height = 16 + rand() % 48;
+    int width = 32 + rand() % 32;
+    int height = 32 + rand() % 32;
     intptr_t srcStride = 64;
     intptr_t dstStride = width;
     int j = 0;
@@ -1133,8 +1189,8 @@
     for (int i = 0; i < ITERS; i++)
     {
         int width = 16 * (rand() % 4 + 1);
-        int height = rand() % 64 +1;
-        int stride = rand() % 65;
+        int height = rand() % 63 + 2;
+        int stride = width;
 
         ref(ref_dest, psbuf1 + j, width, height, stride);
         checked(opt, opt_dest, psbuf1 + j, width, height, stride);
@@ -1149,7 +1205,7 @@
     return true;
 }
 
-bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
+bool PixelHarness::check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt)
 {
     ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
     uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM];      // value range[0, 16]
@@ -1160,6 +1216,14 @@
     for (int i = 0; i < 32 * 32; i++)
     {

 
@@ -666,7 +666,32 @@
     return true;
 }
 
-bool PixelHarness::check_scale_pp(scale_t ref, scale_t opt)
+bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt)
+{
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0, sizeof(ref_dest));
+    memset(opt_dest, 0, sizeof(opt_dest));
+
+    int j = 0;
+    for (int i = 0; i < ITERS; i++)
+    {
+        int index = i % TEST_CASES;
+        checked(opt, opt_dest, pixel_test_buff[index] + j);
+        ref(ref_dest, pixel_test_buff[index] + j);
+
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+            return false;
+
+        reportfail();
+        j += INCR;
+    }
+
+    return true;
+}
+
+bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -845,8 +870,8 @@
 
 bool PixelHarness::check_calSign(sign_t ref, sign_t opt)
 {
-    ALIGN_VAR_16(int8_t, ref_dest[64 * 64]);
-    ALIGN_VAR_16(int8_t, opt_dest[64 * 64]);
+    ALIGN_VAR_16(int8_t, ref_dest[64 * 2]);
+    ALIGN_VAR_16(int8_t, opt_dest[64 * 2]);
 
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
@@ -855,12 +880,12 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = 16 * (rand() % 4 + 1);
+        int width = (rand() % 64) + 1;
 
         ref(ref_dest, pbuf2 + j, pbuf3 + j, width);
         checked(opt, opt_dest, pbuf2 + j, pbuf3 + j, width);
 
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int8_t)))
+        if (memcmp(ref_dest, opt_dest, sizeof(ref_dest)))
             return false;
 
         reportfail();
@@ -883,12 +908,10 @@
     for (int i = 0; i < ITERS; i++)
     {
         int width = 16 * (rand() % 4 + 1);
-        int8_t sign = rand() % 3;
-        if (sign == 2)
-            sign = -1;
+        int stride = width + 1;
 
-        ref(ref_dest, psbuf1 + j, width, sign);
-        checked(opt, opt_dest, psbuf1 + j, width, sign);
+        ref(ref_dest, psbuf1 + j, width, psbuf2 + j, stride);
+        checked(opt, opt_dest, psbuf1 + j, width, psbuf5 + j, stride);
 
         if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
             return false;
@@ -928,7 +951,43 @@
     return true;
 }
 
-bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt)
+bool PixelHarness::check_saoCuOrgE2_t(saoCuOrgE2_t ref[2], saoCuOrgE2_t opt[2])
+{
+    ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
+    ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
+
+    memset(ref_dest, 0xCD, sizeof(ref_dest));
+    memset(opt_dest, 0xCD, sizeof(opt_dest));
+
+    for (int id = 0; id < 2; id++)
+    {
+        int j = 0;
+        if (opt[id])
+        {
+            for (int i = 0; i < ITERS; i++)
+            {
+                int width = 16 * (1 << (id * (rand() % 2 + 1))) - (rand() % 2);
+                int stride = width + 1;
+
+                ref[width > 16](ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
+                checked(opt[width > 16], opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
+
+                if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
+                    return false;
+
+                if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+                    return false;
+
+                reportfail();
+                j += INCR;
+            }
+        }
+    }
+
+    return true;
+}
+
+bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -940,16 +999,14 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int width = 16 * (rand() % 4 + 1);
-        int stride = width + 1;
-
-        ref(ref_dest, psbuf1 + j, psbuf2 + j, psbuf3 + j, width, stride);
-        checked(opt, opt_dest, psbuf4 + j, psbuf2 + j, psbuf3 + j, width, stride);
+        int stride = 16 * (rand() % 4 + 1);
+        int start = rand() % 2;
+        int end = 16 - rand() % 2;
 
-        if (memcmp(psbuf1 + j, psbuf4 + j, width * sizeof(int8_t)))
-            return false;
+        ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
+        checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
 
-        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)))
+        if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel)) || memcmp(psbuf2, psbuf5, BUFFSIZE))
             return false;
 
         reportfail();
@@ -959,7 +1016,7 @@
     return true;
 }
 
-bool PixelHarness::check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
+bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt)
 {
     ALIGN_VAR_16(pixel, ref_dest[64 * 64]);
     ALIGN_VAR_16(pixel, opt_dest[64 * 64]);
@@ -971,9 +1028,9 @@
 
     for (int i = 0; i < ITERS; i++)
     {
-        int stride = 16 * (rand() % 4 + 1);
+        int stride = 32 * (rand() % 2 + 1);
         int start = rand() % 2;
-        int end = (16 * (rand() % 4 + 1)) - rand() % 2;
+        int end = (32 * (rand() % 2 + 1)) - rand() % 2;
 
         ref(ref_dest, psbuf2 + j, psbuf1 + j, stride, start, end);
         checked(opt, opt_dest, psbuf5 + j, psbuf1 + j, stride, start, end);
@@ -995,9 +1052,8 @@
 
     memset(ref_dest, 0xCD, sizeof(ref_dest));
     memset(opt_dest, 0xCD, sizeof(opt_dest));
-
-    int width = 16 + rand() % 48;
-    int height = 16 + rand() % 48;
+    int width = 32 + rand() % 32;
+    int height = 32 + rand() % 32;
     intptr_t srcStride = 64;
     intptr_t dstStride = width;
     int j = 0;
@@ -1133,8 +1189,8 @@
     for (int i = 0; i < ITERS; i++)
     {
         int width = 16 * (rand() % 4 + 1);
-        int height = rand() % 64 +1;
-        int stride = rand() % 65;
+        int height = rand() % 63 + 2;
+        int stride = width;
 
         ref(ref_dest, psbuf1 + j, width, height, stride);
         checked(opt, opt_dest, psbuf1 + j, width, height, stride);
@@ -1149,7 +1205,7 @@
     return true;
 }
 
-bool PixelHarness::check_findPosLast(findPosLast_t ref, findPosLast_t opt)
+bool PixelHarness::check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt)
 {
     ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]);
     uint8_t ref_coeffNum[MLS_GRP_NUM], opt_coeffNum[MLS_GRP_NUM];      // value range[0, 16]
@@ -1160,6 +1216,14 @@
     for (int i = 0; i < 32 * 32; i++)
     {
​

x265_1.6.tar.gz/source/test/pixelharness.h -> x265_1.7.tar.gz/source/test/pixelharness.h Changed

@@ -76,7 +76,8 @@
     bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
     bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
     bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
-    bool check_scale_pp(scale_t ref, scale_t opt);
+    bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
+    bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
     bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
     bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
     bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
@@ -95,8 +96,9 @@
     bool check_addAvg(addAvg_t, addAvg_t);
     bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
     bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
-    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt);
+    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
     bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
+    bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
     bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt);
     bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt);
     bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt);
@@ -104,7 +106,8 @@
     bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt);
     bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt);
     bool check_calSign(sign_t ref, sign_t opt);
-    bool check_findPosLast(findPosLast_t ref, findPosLast_t opt);
+    bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt);
+    bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt);
 
 public:

 
@@ -76,7 +76,8 @@
     bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt);
     bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt);
     bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt);
-    bool check_scale_pp(scale_t ref, scale_t opt);
+    bool check_scale1D_pp(scale1D_t ref, scale1D_t opt);
+    bool check_scale2D_pp(scale2D_t ref, scale2D_t opt);
     bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt);
     bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt);
     bool check_calresidual(calcresidual_t ref, calcresidual_t opt);
@@ -95,8 +96,9 @@
     bool check_addAvg(addAvg_t, addAvg_t);
     bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt);
     bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt);
-    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref, saoCuOrgE2_t opt);
+    bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
     bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
+    bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt);
     bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt);
     bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt);
     bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt);
@@ -104,7 +106,8 @@
     bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt);
     bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt);
     bool check_calSign(sign_t ref, sign_t opt);
-    bool check_findPosLast(findPosLast_t ref, findPosLast_t opt);
+    bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt);
+    bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt);
 
 public:
 
​

x265_1.6.tar.gz/source/test/rate-control-tests.txt -> x265_1.7.tar.gz/source/test/rate-control-tests.txt Changed

@@ -1,34 +1,36 @@
-# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
-
-# This test is listed first since it currently reproduces bugs
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
-
-# VBV tests, non-deterministic so testing for correctness and bitrate
-# fluctuations - up to 1% bitrate fluctuation is allowed between runs
-RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
-112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
-112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
-112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
-112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
-112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
-112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
-112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
-big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
-big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
-
-# multi-pass rate control tests
-big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
-112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
-112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
+
+#These tests should yeild deterministic results
+# This test is listed first since it currently reproduces bugs
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
+fire_1920x1080_30.yuv, --preset slow --bitrate 2000 --tune zero-latency 
+
+
+# VBV tests, non-deterministic so testing for correctness and bitrate
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
+night_cars_1920x1080_30.yuv,--preset medium --crf 25 --vbv-bufsize 5000 --vbv-maxrate 5000 -F6 --crf-max 34 --crf-min 22
+ducks_take_off_420_720p50.y4m,--preset slow --bitrate 1600 --vbv-bufsize 1600 --vbv-maxrate 1600 --strict-cbr --aq-mode 2 --aq-strength 0.5
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryslow --bitrate 4000 --vbv-bufsize 3000 --vbv-maxrate 4000 --tune grain
+fire_1920x1080_30.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud --pmode --tune ssim
+112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
+Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
+Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000  --tune grain
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
+sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd  --crf-max 30
+sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr
+
+
+
+# multi-pass rate control tests
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000
+112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4
+pine_tree_1920x1080_30.yuv,--preset veryfast --crf 12 --pass 1 -F4,--preset faster --bitrate 4000 --pass 2 -F4
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafast --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 8000 --strict-cbr -F4 --pass 1, --tune grain --preset ultrafast --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 8000 -F4 --pass 2
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000  --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500  --vbv-bufsize 700 --pass 2 -F4
+

 
@@ -1,34 +1,36 @@
-# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
-
-# This test is listed first since it currently reproduces bugs
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
-
-# VBV tests, non-deterministic so testing for correctness and bitrate
-# fluctuations - up to 1% bitrate fluctuation is allowed between runs
-RaceHorses_416x240_30_10bit.yuv,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --vbv-bufsize 600 --vbv-maxrate 600
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --bitrate 1100 --vbv-bufsize 1100 --vbv-maxrate 1200
-112_1920x1080_25.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud
-112_1920x1080_25.yuv,--preset medium --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd
-112_1920x1080_25.yuv,--preset medium --bitrate 4000 --vbv-maxrate 12000 --vbv-bufsize 12000 --repeat-headers
-112_1920x1080_25.yuv,--preset superfast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1500 --hrd --strict-cbr
-112_1920x1080_25.yuv,--preset superfast --bitrate 30000 --vbv-maxrate 30000 --vbv-bufsize 30000 --repeat-headers
-112_1920x1080_25.yuv,--preset superfast --bitrate 4000 --vbv-maxrate 6000 --vbv-bufsize 6000 --aud
-112_1920x1080_25.yuv,--preset veryslow --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --vbv-bufsize 3000 --vbv-maxrate 3000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud
-big_buck_bunny_360p24.y4m,--preset medium --crf 1 --vbv-bufsize 3000 --vbv-maxrate 3000 --hrd
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 1000 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud --strict-cbr
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 3000 --vbv-bufsize 9000 --vbv-maxrate 9000 --repeat-headers
-big_buck_bunny_360p24.y4m,--preset superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd
-big_buck_bunny_360p24.y4m,--preset superfast --crf 6 --vbv-bufsize 1000 --vbv-maxrate 1000 --aud
-
-# multi-pass rate control tests
-big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1,--preset slow --bitrate 200 --pass 2
-big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4
-112_1920x1080_25.yuv,--preset slow --bitrate 1000 --pass 1 -F4,--preset slow --bitrate 1000 --pass 2 -F4
-112_1920x1080_25.yuv,--preset superfast --crf 12 --pass 1,--preset superfast --bitrate 4000 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset veryslow --crf 40 --pass 1, --preset veryslow --bitrate 200 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset superfast --bitrate 600 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 600 --pass 2 -F4
-RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --pass 1,--preset medium --bitrate 500 --pass 3 -F4,--preset medium --bitrate 500 --pass 2 -F4
+# List of command lines to be run by rate control regression tests, see https://bitbucket.org/sborho/test-harness
+
+#These tests should yeild deterministic results
+# This test is listed first since it currently reproduces bugs
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 1000 --pass 1 -F4,--preset medium --bitrate 1000 --pass 2 -F4
+fire_1920x1080_30.yuv, --preset slow --bitrate 2000 --tune zero-latency 
+
+
+# VBV tests, non-deterministic so testing for correctness and bitrate
+# fluctuations - up to 1% bitrate fluctuation is allowed between runs
+night_cars_1920x1080_30.yuv,--preset medium --crf 25 --vbv-bufsize 5000 --vbv-maxrate 5000 -F6 --crf-max 34 --crf-min 22
+ducks_take_off_420_720p50.y4m,--preset slow --bitrate 1600 --vbv-bufsize 1600 --vbv-maxrate 1600 --strict-cbr --aq-mode 2 --aq-strength 0.5
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryslow --bitrate 4000 --vbv-bufsize 3000 --vbv-maxrate 4000 --tune grain
+fire_1920x1080_30.yuv,--preset medium --bitrate 1000 --vbv-maxrate 1500 --vbv-bufsize 1500 --aud --pmode --tune ssim
+112_1920x1080_25.yuv,--preset ultrafast --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 15000 --hrd --strict-cbr
+Traffic_4096x2048_30.yuv,--preset superfast --bitrate 20000 --vbv-maxrate 20000 --vbv-bufsize 20000 --repeat-headers --strict-cbr
+Traffic_4096x2048_30.yuv,--preset faster --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 6000 --aud --repeat-headers --no-open-gop --hrd --pmode --pme
+News-4k.y4m,--preset veryfast --bitrate 3000 --vbv-maxrate 5000 --vbv-bufsize 5000 --repeat-headers --temporal-layers
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 18000 --vbv-bufsize 20000 --vbv-maxrate 18000 --strict-cbr
+NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --bitrate 8000 --vbv-bufsize 12000 --vbv-maxrate 10000  --tune grain
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 600 --aud --hrd --tune fast-decode
+sita_1920x1080_30.yuv,--preset superfast --crf 25 --vbv-bufsize 3000 --vbv-maxrate 4000 --vbv-bufsize 5000 --hrd  --crf-max 30
+sita_1920x1080_30.yuv,--preset superfast --bitrate 3000 --vbv-bufsize 3000 --vbv-maxrate 3000 --aud --strict-cbr
+
+
+
+# multi-pass rate control tests
+big_buck_bunny_360p24.y4m,--preset slow --crf 40 --pass 1 -f 5000,--preset slow --bitrate 200 --pass 2 -f 5000
+big_buck_bunny_360p24.y4m,--preset medium --bitrate 700 --pass 1 -F4 --slow-firstpass -f 5000 ,--preset medium --bitrate 700 --vbv-bufsize 900 --vbv-maxrate 700 --pass 2 -F4 -f 5000
+112_1920x1080_25.yuv,--preset fast --bitrate 1000 --vbv-maxrate 1000 --vbv-bufsize 1000 --strict-cbr --pass 1 -F4,--preset fast --bitrate 1000 --vbv-maxrate 3000 --vbv-bufsize 3000 --pass 2 -F4
+pine_tree_1920x1080_30.yuv,--preset veryfast --crf 12 --pass 1 -F4,--preset faster --bitrate 4000 --pass 2 -F4
+SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv, --tune grain --preset ultrafast --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 8000 --strict-cbr -F4 --pass 1, --tune grain --preset ultrafast --bitrate 8000 --vbv-maxrate 8000 --vbv-bufsize 8000 -F4 --pass 2
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 40 --pass 1, --preset faster --bitrate 200 --pass 2 -F4
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --bitrate 2500 --pass 1 -F4 --slow-firstpass,--preset superfast --bitrate 2500 --pass 2 -F4
+RaceHorses_416x240_30_10bit.yuv,--preset medium --crf 26 --vbv-maxrate 1000 --vbv-bufsize 1000 --pass 1,--preset fast --bitrate 1000  --vbv-maxrate 1000 --vbv-bufsize 700 --pass 3 -F4,--preset slow --bitrate 500 --vbv-maxrate 500  --vbv-bufsize 700 --pass 2 -F4
+
​

x265_1.6.tar.gz/source/test/regression-tests.txt -> x265_1.7.tar.gz/source/test/regression-tests.txt Changed

@@ -12,9 +12,9 @@
 # not auto-detected.
 
 BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
-BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32
 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
-BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
 BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
@@ -29,7 +29,7 @@
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
@@ -37,8 +37,8 @@
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
@@ -51,11 +51,11 @@
 Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
 KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
-KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16
 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
-News-4k.y4m,--preset medium --tune ssim --no-sao
+News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32
 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
@@ -108,13 +108,13 @@
 parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
 silent_cif_420.y4m,--preset medium --me full --rect --amp
 silent_cif_420.y4m,--preset superfast --weightp --rect
-silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16
 vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
-vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16
 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
 washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
 washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
-washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32
 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
 washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4

 
@@ -12,9 +12,9 @@
 # not auto-detected.
 
 BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190
-BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7
+BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32
 BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao
-BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3
+BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16
 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0
 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp
 BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709
@@ -29,7 +29,7 @@
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao
-CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency
+CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16
 CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd
@@ -37,8 +37,8 @@
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut
 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd
-DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32
+DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0
 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4
@@ -51,11 +51,11 @@
 Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb
 KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16
 KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8
-KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16
+KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16
 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain
 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr
-News-4k.y4m,--preset medium --tune ssim --no-sao
+News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32
 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp
 OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode
@@ -108,13 +108,13 @@
 parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain
 silent_cif_420.y4m,--preset medium --me full --rect --amp
 silent_cif_420.y4m,--preset superfast --weightp --rect
-silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao
+silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16
 vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5
-vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode
+vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16
 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4
 washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5
 washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4
-washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2
+washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32
 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency
 washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4
 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4
​

x265_1.6.tar.gz/source/test/smoke-tests.txt -> x265_1.7.tar.gz/source/test/smoke-tests.txt Changed

@@ -1,14 +1,18 @@
 # List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
 
+# consider VBV tests a failure if new bitrate is more than 5% different
+# from the old bitrate
+# vbv-tolerance = 0.05
+
 big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
 big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
-big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
-washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16
 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
 old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
-old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32
 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10

 
@@ -1,14 +1,18 @@
 # List of command lines to be run by smoke tests, see https://bitbucket.org/sborho/test-harness
 
+# consider VBV tests a failure if new bitrate is more than 5% different
+# from the old bitrate
+# vbv-tolerance = 0.05
+
 big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers
 big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default
-big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme
-washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1
+big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16
+washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16
 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4
 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0
 old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1
 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16
-old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode
+old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32
 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8
 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1
 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10
​

x265_1.6.tar.gz/source/test/testbench.cpp -> x265_1.7.tar.gz/source/test/testbench.cpp Changed

 
@@ -168,6 +168,7 @@
         { "AVX", X265_CPU_AVX },
         { "XOP", X265_CPU_XOP },
         { "AVX2", X265_CPU_AVX2 },
+        { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 },
         { "", 0 },
     };
 
​

x265_1.6.tar.gz/source/x265.cpp -> x265_1.7.tar.gz/source/x265.cpp Changed

@@ -27,6 +27,7 @@
 
 #include "input/input.h"
 #include "output/output.h"
+#include "output/reconplay.h"
 #include "filters/filters.h"
 #include "common.h"
 #include "param.h"
@@ -46,12 +47,16 @@
 #include <string>
 #include <ostream>
 #include <fstream>
+#include <queue>
 
+#define CONSOLE_TITLE_SIZE 200
 #ifdef _WIN32
 #include <windows.h>
+static char orgConsoleTitle[CONSOLE_TITLE_SIZE] = "";
 #else
 #define GetConsoleTitle(t, n)
 #define SetConsoleTitle(t)
+#define SetThreadExecutionState(es)
 #endif
 
 using namespace x265;
@@ -65,33 +70,34 @@
 
 struct CLIOptions
 {
-    Input*  input;
-    Output* recon;
-    std::fstream bitstreamFile;
+    InputFile* input;
+    ReconFile* recon;
+    OutputFile* output;
+    FILE*       qpfile;
+    const char* reconPlayCmd;
+    const x265_api* api;
+    x265_param* param;
     bool bProgress;
     bool bForceY4m;
     bool bDither;
-
     uint32_t seek;              // number of frames to skip from the beginning
     uint32_t framesToBeEncoded; // number of frames to encode
     uint64_t totalbytes;
-    size_t   analysisRecordSize; // number of bytes read from or dumped into file
-    int      analysisHeaderSize;
-
     int64_t startTime;
     int64_t prevUpdateTime;
-    float   frameRate;
-    FILE*   qpfile;
-    FILE*   analysisFile;
 
     /* in microseconds */
     static const int UPDATE_INTERVAL = 250000;
 
     CLIOptions()
     {
-        frameRate = 0.f;
         input = NULL;
         recon = NULL;
+        output = NULL;
+        qpfile = NULL;
+        reconPlayCmd = NULL;
+        api = NULL;
+        param = NULL;
         framesToBeEncoded = seek = 0;
         totalbytes = 0;
         bProgress = true;
@@ -99,18 +105,12 @@
         startTime = x265_mdate();
         prevUpdateTime = 0;
         bDither = false;
-        qpfile = NULL;
-        analysisFile = NULL;
-        analysisRecordSize = 0;
-        analysisHeaderSize = 0;
     }
 
     void destroy();
-    void writeNALs(const x265_nal* nal, uint32_t nalcount);
-    void printStatus(uint32_t frameNum, x265_param *param);
-    bool parse(int argc, char **argv, x265_param* param);
+    void printStatus(uint32_t frameNum);
+    bool parse(int argc, char **argv);
     bool parseQPFile(x265_picture &pic_org);
-    bool validateFanout(x265_param*);
 };
 
 void CLIOptions::destroy()
@@ -124,23 +124,12 @@
     if (qpfile)
         fclose(qpfile);
     qpfile = NULL;
-    if (analysisFile)
-        fclose(analysisFile);
-    analysisFile = NULL;
+    if (output)
+        output->release();
+    output = NULL;
 }
 
-void CLIOptions::writeNALs(const x265_nal* nal, uint32_t nalcount)
-{
-    ProfileScopeEvent(bitstreamWrite);
-    for (uint32_t i = 0; i < nalcount; i++)
-    {
-        bitstreamFile.write((const char*)nal->payload, nal->sizeBytes);
-        totalbytes += nal->sizeBytes;
-        nal++;
-    }
-}
-
-void CLIOptions::printStatus(uint32_t frameNum, x265_param *param)
+void CLIOptions::printStatus(uint32_t frameNum)
 {
     char buf[200];
     int64_t time = x265_mdate();
@@ -167,15 +156,16 @@
     prevUpdateTime = time;
 }
 
-bool CLIOptions::parse(int argc, char **argv, x265_param* param)
+bool CLIOptions::parse(int argc, char **argv)
 {
     bool bError = 0;
     int help = 0;
     int inputBitDepth = 8;
+    int outputBitDepth = 0;
     int reconFileBitDepth = 0;
     const char *inputfn = NULL;
     const char *reconfn = NULL;
-    const char *bitstreamfn = NULL;
+    const char *outputfn = NULL;
     const char *preset = NULL;
     const char *tune = NULL;
     const char *profile = NULL;
@@ -192,15 +182,31 @@
         int c = getopt_long(argc, argv, short_options, long_options, NULL);
         if (c == -1)
             break;
-        if (c == 'p')
+        else if (c == 'p')
             preset = optarg;
-        if (c == 't')
+        else if (c == 't')
             tune = optarg;
+        else if (c == 'D')
+            outputBitDepth = atoi(optarg);
         else if (c == '?')
             showHelp(param);
     }
 
-    if (x265_param_default_preset(param, preset, tune) < 0)
+    api = x265_api_get(outputBitDepth);
+    if (!api)
+    {
+        x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
+        api = x265_api_get(0);
+    }
+
+    param = api->param_alloc();
+    if (!param)
+    {
+        x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
+        return true;
+    }
+
+    if (api->param_default_preset(param, preset, tune) < 0)
     {
         x265_log(NULL, X265_LOG_ERROR, "preset or tune unrecognized\n");
         return true;
@@ -211,9 +217,7 @@
         int long_options_index = -1;
         int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
         if (c == -1)
-        {
             break;
-        }
 
         switch (c)
         {
@@ -261,7 +265,7 @@
             OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError);
             OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError);
             OPT("no-progress") this->bProgress = false;
-            OPT("output") bitstreamfn = optarg;
+            OPT("output") outputfn = optarg;
             OPT("input") inputfn = optarg;
             OPT("recon") reconfn = optarg;
             OPT("input-depth") inputBitDepth = (uint32_t)x265_atoi(optarg, bError);
@@ -271,17 +275,19 @@
             OPT("profile") profile = optarg; /* handled last */
             OPT("preset") /* handled above */;
             OPT("tune")   /* handled above */;
+            OPT("output-depth")   /* handled above */;
+            OPT("recon-y4m-exec") reconPlayCmd = optarg;
             OPT("qpfile")

 
@@ -27,6 +27,7 @@
 
 #include "input/input.h"
 #include "output/output.h"
+#include "output/reconplay.h"
 #include "filters/filters.h"
 #include "common.h"
 #include "param.h"
@@ -46,12 +47,16 @@
 #include <string>
 #include <ostream>
 #include <fstream>
+#include <queue>
 
+#define CONSOLE_TITLE_SIZE 200
 #ifdef _WIN32
 #include <windows.h>
+static char orgConsoleTitle[CONSOLE_TITLE_SIZE] = "";
 #else
 #define GetConsoleTitle(t, n)
 #define SetConsoleTitle(t)
+#define SetThreadExecutionState(es)
 #endif
 
 using namespace x265;
@@ -65,33 +70,34 @@
 
 struct CLIOptions
 {
-    Input*  input;
-    Output* recon;
-    std::fstream bitstreamFile;
+    InputFile* input;
+    ReconFile* recon;
+    OutputFile* output;
+    FILE*       qpfile;
+    const char* reconPlayCmd;
+    const x265_api* api;
+    x265_param* param;
     bool bProgress;
     bool bForceY4m;
     bool bDither;
-
     uint32_t seek;              // number of frames to skip from the beginning
     uint32_t framesToBeEncoded; // number of frames to encode
     uint64_t totalbytes;
-    size_t   analysisRecordSize; // number of bytes read from or dumped into file
-    int      analysisHeaderSize;
-
     int64_t startTime;
     int64_t prevUpdateTime;
-    float   frameRate;
-    FILE*   qpfile;
-    FILE*   analysisFile;
 
     /* in microseconds */
     static const int UPDATE_INTERVAL = 250000;
 
     CLIOptions()
     {
-        frameRate = 0.f;
         input = NULL;
         recon = NULL;
+        output = NULL;
+        qpfile = NULL;
+        reconPlayCmd = NULL;
+        api = NULL;
+        param = NULL;
         framesToBeEncoded = seek = 0;
         totalbytes = 0;
         bProgress = true;
@@ -99,18 +105,12 @@
         startTime = x265_mdate();
         prevUpdateTime = 0;
         bDither = false;
-        qpfile = NULL;
-        analysisFile = NULL;
-        analysisRecordSize = 0;
-        analysisHeaderSize = 0;
     }
 
     void destroy();
-    void writeNALs(const x265_nal* nal, uint32_t nalcount);
-    void printStatus(uint32_t frameNum, x265_param *param);
-    bool parse(int argc, char **argv, x265_param* param);
+    void printStatus(uint32_t frameNum);
+    bool parse(int argc, char **argv);
     bool parseQPFile(x265_picture &pic_org);
-    bool validateFanout(x265_param*);
 };
 
 void CLIOptions::destroy()
@@ -124,23 +124,12 @@
     if (qpfile)
         fclose(qpfile);
     qpfile = NULL;
-    if (analysisFile)
-        fclose(analysisFile);
-    analysisFile = NULL;
+    if (output)
+        output->release();
+    output = NULL;
 }
 
-void CLIOptions::writeNALs(const x265_nal* nal, uint32_t nalcount)
-{
-    ProfileScopeEvent(bitstreamWrite);
-    for (uint32_t i = 0; i < nalcount; i++)
-    {
-        bitstreamFile.write((const char*)nal->payload, nal->sizeBytes);
-        totalbytes += nal->sizeBytes;
-        nal++;
-    }
-}
-
-void CLIOptions::printStatus(uint32_t frameNum, x265_param *param)
+void CLIOptions::printStatus(uint32_t frameNum)
 {
     char buf[200];
     int64_t time = x265_mdate();
@@ -167,15 +156,16 @@
     prevUpdateTime = time;
 }
 
-bool CLIOptions::parse(int argc, char **argv, x265_param* param)
+bool CLIOptions::parse(int argc, char **argv)
 {
     bool bError = 0;
     int help = 0;
     int inputBitDepth = 8;
+    int outputBitDepth = 0;
     int reconFileBitDepth = 0;
     const char *inputfn = NULL;
     const char *reconfn = NULL;
-    const char *bitstreamfn = NULL;
+    const char *outputfn = NULL;
     const char *preset = NULL;
     const char *tune = NULL;
     const char *profile = NULL;
@@ -192,15 +182,31 @@
         int c = getopt_long(argc, argv, short_options, long_options, NULL);
         if (c == -1)
             break;
-        if (c == 'p')
+        else if (c == 'p')
             preset = optarg;
-        if (c == 't')
+        else if (c == 't')
             tune = optarg;
+        else if (c == 'D')
+            outputBitDepth = atoi(optarg);
         else if (c == '?')
             showHelp(param);
     }
 
-    if (x265_param_default_preset(param, preset, tune) < 0)
+    api = x265_api_get(outputBitDepth);
+    if (!api)
+    {
+        x265_log(NULL, X265_LOG_WARNING, "falling back to default bit-depth\n");
+        api = x265_api_get(0);
+    }
+
+    param = api->param_alloc();
+    if (!param)
+    {
+        x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n");
+        return true;
+    }
+
+    if (api->param_default_preset(param, preset, tune) < 0)
     {
         x265_log(NULL, X265_LOG_ERROR, "preset or tune unrecognized\n");
         return true;
@@ -211,9 +217,7 @@
         int long_options_index = -1;
         int c = getopt_long(argc, argv, short_options, long_options, &long_options_index);
         if (c == -1)
-        {
             break;
-        }
 
         switch (c)
         {
@@ -261,7 +265,7 @@
             OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError);
             OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError);
             OPT("no-progress") this->bProgress = false;
-            OPT("output") bitstreamfn = optarg;
+            OPT("output") outputfn = optarg;
             OPT("input") inputfn = optarg;
             OPT("recon") reconfn = optarg;
             OPT("input-depth") inputBitDepth = (uint32_t)x265_atoi(optarg, bError);
@@ -271,17 +275,19 @@
             OPT("profile") profile = optarg; /* handled last */
             OPT("preset") /* handled above */;
             OPT("tune")   /* handled above */;
+            OPT("output-depth")   /* handled above */;
+            OPT("recon-y4m-exec") reconPlayCmd = optarg;
             OPT("qpfile")
​

x265_1.6.tar.gz/source/x265.def.in -> x265_1.7.tar.gz/source/x265.def.in Changed

 
@@ -14,6 +14,7 @@
 x265_build_info_str
 x265_encoder_headers
 x265_encoder_parameters
+x265_encoder_reconfig
 x265_encoder_encode
 x265_encoder_get_stats
 x265_encoder_log
​

x265_1.6.tar.gz/source/x265.h -> x265_1.7.tar.gz/source/x265.h Changed

@@ -416,7 +416,7 @@
      *
      * Frame encoders are distributed between the available thread pools, and
      * the encoder will never generate more thread pools than frameNumThreads */
-    char*     numaPools;
+    const char* numaPools;
 
     /* Enable wavefront parallel processing, greatly increases parallelism for
      * less than 1% compression efficiency loss. Requires a thread pool, enabled
@@ -458,7 +458,7 @@
      * order. Otherwise the encoder will emit per-stream statistics into the log
      * file when x265_encoder_log is called (presumably at the end of the
      * encode) */
-    char*     csvfn;
+    const char* csvfn;
 
     /*== Internal Picture Specification ==*/
 
@@ -522,12 +522,21 @@
      * performance. Value must be between 1 and 16, default is 3 */
     int       maxNumReferences;
 
+    /* Allow libx265 to emit HEVC bitstreams which do not meet strict level
+     * requirements. Defaults to false */
+    int       bAllowNonConformance;
+
     /*== Bitstream Options ==*/
 
     /* Flag indicating whether VPS, SPS and PPS headers should be output with
      * each keyframe. Default false */
     int       bRepeatHeaders;
 
+    /* Flag indicating whether the encoder should generate start codes (Annex B
+     * format) or length (file format) before NAL units. Default true, Annex B.
+     * Muxers should set this to the correct value */
+    int       bAnnexB;
+
     /* Flag indicating whether the encoder should emit an Access Unit Delimiter
      * NAL at the start of every access unit. Default false */
     int       bEnableAccessUnitDelimiters;
@@ -869,7 +878,7 @@
     int       analysisMode;
 
     /* Filename for analysisMode save/load. Default name is "x265_analysis.dat" */
-    char*     analysisFileName;
+    const char* analysisFileName;
 
     /*== Rate Control ==*/
 
@@ -962,7 +971,7 @@
 
         /* Filename of the 2pass output/input stats file, if unspecified the
          * encoder will default to using x265_2pass.log */
-        char*     statFileName;
+        const char* statFileName;
 
         /* temporally blur quants */
         double    qblur;
@@ -988,6 +997,12 @@
         /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise 
          * quality to maintain bitrate adherence */
         int bStrictCbr;
+
+        /* Enable adaptive quantization at CU granularity. This parameter specifies 
+         * the minimum CU size at which QP can be adjusted, i.e. Quantization Group 
+         * (QG) size. Allowed values are 64, 32, 16 provided it falls within the 
+         * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/
+        uint32_t qgSize;
     } rc;
 
     /*== Video Usability Information ==*/
@@ -1084,6 +1099,22 @@
          * conformance cropping window to further crop the displayed window */
         int defDispWinBottomOffset;
     } vui;
+
+    /* SMPTE ST 2086 mastering display color volume SEI info, specified as a
+     * string which is parsed when the stream header SEI are emitted. The string
+     * format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu
+     * are unsigned 16bit integers and %u are unsigned 32bit integers. The SEI
+     * includes X,Y display primaries for RGB channels, white point X,Y and
+     * max,min luminance values. */
+    const char* masteringDisplayColorVolume;
+
+    /* Content light level info SEI, specified as a string which is parsed when
+     * the stream header SEI are emitted. The string format is "%hu,%hu" where
+     * %hu are unsigned 16bit integers. The first value is the max content light
+     * level (or 0 if no maximum is indicated), the second value is the maximum
+     * picture average light level (or 0). */
+    const char* contentLightLevelInfo;
+
 } x265_param;
 
 /* x265_param_alloc:
@@ -1162,12 +1193,10 @@
 void x265_picture_init(x265_param *param, x265_picture *pic);
 
 /* x265_max_bit_depth:
- *      Specifies the maximum number of bits per pixel that x265 can input. This
- *      is also the max bit depth that x265 encodes in.  When x265_max_bit_depth
- *      is 8, the internal and input bit depths must be 8.  When
- *      x265_max_bit_depth is 12, the internal and input bit depths can be
- *      either 8, 10, or 12. Note that the internal bit depth must be the same
- *      for all encoders allocated in the same process. */
+ *      Specifies the numer of bits per pixel that x265 uses internally to
+ *      represent a pixel, and the bit depth of the output bitstream.
+ *      param->internalBitDepth must be set to this value. x265_max_bit_depth
+ *      will be 8 for default builds, 10 for HIGH_BIT_DEPTH builds. */
 X265_API extern const int x265_max_bit_depth;
 
 /* x265_version_str:
@@ -1214,6 +1243,21 @@
  *      Once flushing has begun, all subsequent calls must pass pic_in as NULL. */
 int x265_encoder_encode(x265_encoder *encoder, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out);
 
+/* x265_encoder_reconfig:
+ *      various parameters from x265_param are copied.
+ *      this takes effect immediately, on whichever frame is encoded next;
+ *      returns 0 on success, negative on parameter validation error.
+ *
+ *      not all parameters can be changed; see the actual function for a
+ *      detailed breakdown.  since not all parameters can be changed, moving
+ *      from preset to preset may not always fully copy all relevant parameters,
+ *      but should still work usably in practice. however, more so than for
+ *      other presets, many of the speed shortcuts used in ultrafast cannot be
+ *      switched out of; using reconfig to switch between ultrafast and other
+ *      presets is not recommended without a more fine-grained breakdown of
+ *      parameters to take this into account. */
+int x265_encoder_reconfig(x265_encoder *, x265_param *);
+
 /* x265_encoder_get_stats:
  *       returns encoder statistics */
 void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes);
@@ -1253,6 +1297,7 @@
     void          (*picture_init)(x265_param*, x265_picture*);
     x265_encoder* (*encoder_open)(x265_param*);
     void          (*encoder_parameters)(x265_encoder*, x265_param*);
+    int           (*encoder_reconfig)(x265_encoder*, x265_param*);
     int           (*encoder_headers)(x265_encoder*, x265_nal**, uint32_t*);
     int           (*encoder_encode)(x265_encoder*, x265_nal**, uint32_t*, x265_picture*, x265_picture*);
     void          (*encoder_get_stats)(x265_encoder*, x265_stats*, uint32_t);
@@ -1275,8 +1320,14 @@
  *   Retrieve the programming interface for a linked x265 library.
  *   May return NULL if no library is available that supports the
  *   requested bit depth. If bitDepth is 0 the function is guarunteed
- *   to return a non-NULL x265_api pointer, from the system default
- *   libx265 */
+ *   to return a non-NULL x265_api pointer, from the linked libx265.
+ *
+ *   If the requested bitDepth is not supported by the linked libx265,
+ *   it will attempt to dynamically bind x265_api_get() from a shared
+ *   library with an appropriate name:
+ *     8bit:  libx265_main.so
+ *     10bit: libx265_main10.so
+ *   Obviously the shared library file extension is platform specific */
 const x265_api* x265_api_get(int bitDepth);
 
 #ifdef __cplusplus

 
@@ -416,7 +416,7 @@
      *
      * Frame encoders are distributed between the available thread pools, and
      * the encoder will never generate more thread pools than frameNumThreads */
-    char*     numaPools;
+    const char* numaPools;
 
     /* Enable wavefront parallel processing, greatly increases parallelism for
      * less than 1% compression efficiency loss. Requires a thread pool, enabled
@@ -458,7 +458,7 @@
      * order. Otherwise the encoder will emit per-stream statistics into the log
      * file when x265_encoder_log is called (presumably at the end of the
      * encode) */
-    char*     csvfn;
+    const char* csvfn;
 
     /*== Internal Picture Specification ==*/
 
@@ -522,12 +522,21 @@
      * performance. Value must be between 1 and 16, default is 3 */
     int       maxNumReferences;
 
+    /* Allow libx265 to emit HEVC bitstreams which do not meet strict level
+     * requirements. Defaults to false */
+    int       bAllowNonConformance;
+
     /*== Bitstream Options ==*/
 
     /* Flag indicating whether VPS, SPS and PPS headers should be output with
      * each keyframe. Default false */
     int       bRepeatHeaders;
 
+    /* Flag indicating whether the encoder should generate start codes (Annex B
+     * format) or length (file format) before NAL units. Default true, Annex B.
+     * Muxers should set this to the correct value */
+    int       bAnnexB;
+
     /* Flag indicating whether the encoder should emit an Access Unit Delimiter
      * NAL at the start of every access unit. Default false */
     int       bEnableAccessUnitDelimiters;
@@ -869,7 +878,7 @@
     int       analysisMode;
 
     /* Filename for analysisMode save/load. Default name is "x265_analysis.dat" */
-    char*     analysisFileName;
+    const char* analysisFileName;
 
     /*== Rate Control ==*/
 
@@ -962,7 +971,7 @@
 
         /* Filename of the 2pass output/input stats file, if unspecified the
          * encoder will default to using x265_2pass.log */
-        char*     statFileName;
+        const char* statFileName;
 
         /* temporally blur quants */
         double    qblur;
@@ -988,6 +997,12 @@
         /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise 
          * quality to maintain bitrate adherence */
         int bStrictCbr;
+
+        /* Enable adaptive quantization at CU granularity. This parameter specifies 
+         * the minimum CU size at which QP can be adjusted, i.e. Quantization Group 
+         * (QG) size. Allowed values are 64, 32, 16 provided it falls within the 
+         * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/
+        uint32_t qgSize;
     } rc;
 
     /*== Video Usability Information ==*/
@@ -1084,6 +1099,22 @@
          * conformance cropping window to further crop the displayed window */
         int defDispWinBottomOffset;
     } vui;
+
+    /* SMPTE ST 2086 mastering display color volume SEI info, specified as a
+     * string which is parsed when the stream header SEI are emitted. The string
+     * format is "G(%hu,%hu)B(%hu,%hu)R(%hu,%hu)WP(%hu,%hu)L(%u,%u)" where %hu
+     * are unsigned 16bit integers and %u are unsigned 32bit integers. The SEI
+     * includes X,Y display primaries for RGB channels, white point X,Y and
+     * max,min luminance values. */
+    const char* masteringDisplayColorVolume;
+
+    /* Content light level info SEI, specified as a string which is parsed when
+     * the stream header SEI are emitted. The string format is "%hu,%hu" where
+     * %hu are unsigned 16bit integers. The first value is the max content light
+     * level (or 0 if no maximum is indicated), the second value is the maximum
+     * picture average light level (or 0). */
+    const char* contentLightLevelInfo;
+
 } x265_param;
 
 /* x265_param_alloc:
@@ -1162,12 +1193,10 @@
 void x265_picture_init(x265_param *param, x265_picture *pic);
 
 /* x265_max_bit_depth:
- *      Specifies the maximum number of bits per pixel that x265 can input. This
- *      is also the max bit depth that x265 encodes in.  When x265_max_bit_depth
- *      is 8, the internal and input bit depths must be 8.  When
- *      x265_max_bit_depth is 12, the internal and input bit depths can be
- *      either 8, 10, or 12. Note that the internal bit depth must be the same
- *      for all encoders allocated in the same process. */
+ *      Specifies the numer of bits per pixel that x265 uses internally to
+ *      represent a pixel, and the bit depth of the output bitstream.
+ *      param->internalBitDepth must be set to this value. x265_max_bit_depth
+ *      will be 8 for default builds, 10 for HIGH_BIT_DEPTH builds. */
 X265_API extern const int x265_max_bit_depth;
 
 /* x265_version_str:
@@ -1214,6 +1243,21 @@
  *      Once flushing has begun, all subsequent calls must pass pic_in as NULL. */
 int x265_encoder_encode(x265_encoder *encoder, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out);
 
+/* x265_encoder_reconfig:
+ *      various parameters from x265_param are copied.
+ *      this takes effect immediately, on whichever frame is encoded next;
+ *      returns 0 on success, negative on parameter validation error.
+ *
+ *      not all parameters can be changed; see the actual function for a
+ *      detailed breakdown.  since not all parameters can be changed, moving
+ *      from preset to preset may not always fully copy all relevant parameters,
+ *      but should still work usably in practice. however, more so than for
+ *      other presets, many of the speed shortcuts used in ultrafast cannot be
+ *      switched out of; using reconfig to switch between ultrafast and other
+ *      presets is not recommended without a more fine-grained breakdown of
+ *      parameters to take this into account. */
+int x265_encoder_reconfig(x265_encoder *, x265_param *);
+
 /* x265_encoder_get_stats:
  *       returns encoder statistics */
 void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes);
@@ -1253,6 +1297,7 @@
     void          (*picture_init)(x265_param*, x265_picture*);
     x265_encoder* (*encoder_open)(x265_param*);
     void          (*encoder_parameters)(x265_encoder*, x265_param*);
+    int           (*encoder_reconfig)(x265_encoder*, x265_param*);
     int           (*encoder_headers)(x265_encoder*, x265_nal**, uint32_t*);
     int           (*encoder_encode)(x265_encoder*, x265_nal**, uint32_t*, x265_picture*, x265_picture*);
     void          (*encoder_get_stats)(x265_encoder*, x265_stats*, uint32_t);
@@ -1275,8 +1320,14 @@
  *   Retrieve the programming interface for a linked x265 library.
  *   May return NULL if no library is available that supports the
  *   requested bit depth. If bitDepth is 0 the function is guarunteed
- *   to return a non-NULL x265_api pointer, from the system default
- *   libx265 */
+ *   to return a non-NULL x265_api pointer, from the linked libx265.
+ *
+ *   If the requested bitDepth is not supported by the linked libx265,
+ *   it will attempt to dynamically bind x265_api_get() from a shared
+ *   library with an appropriate name:
+ *     8bit:  libx265_main.so
+ *     10bit: libx265_main10.so
+ *   Obviously the shared library file extension is platform specific */
 const x265_api* x265_api_get(int bitDepth);
 
 #ifdef __cplusplus
​

x265_1.6.tar.gz/source/x265cli.h -> x265_1.7.tar.gz/source/x265cli.h Changed

@@ -30,7 +30,7 @@
 namespace x265 {
 #endif
 
-static const char short_options[] = "o:p:f:F:r:I:i:b:s:t:q:m:hwV?";
+static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?";
 static const struct option long_options[] =
 {
     { "help",                 no_argument, NULL, 'h' },
@@ -47,16 +47,19 @@
     { "no-pme",               no_argument, NULL, 0 },
     { "pme",                  no_argument, NULL, 0 },
     { "log-level",      required_argument, NULL, 0 },
-    { "profile",        required_argument, NULL, 0 },
+    { "profile",        required_argument, NULL, 'P' },
     { "level-idc",      required_argument, NULL, 0 },
     { "high-tier",            no_argument, NULL, 0 },
     { "no-high-tier",         no_argument, NULL, 0 },
+    { "allow-non-conformance",no_argument, NULL, 0 },
+    { "no-allow-non-conformance",no_argument, NULL, 0 },
     { "csv",            required_argument, NULL, 0 },
     { "no-cu-stats",          no_argument, NULL, 0 },
     { "cu-stats",             no_argument, NULL, 0 },
     { "y4m",                  no_argument, NULL, 0 },
     { "no-progress",          no_argument, NULL, 0 },
     { "output",         required_argument, NULL, 'o' },
+    { "output-depth",   required_argument, NULL, 'D' },
     { "input",          required_argument, NULL, 0 },
     { "input-depth",    required_argument, NULL, 0 },
     { "input-res",      required_argument, NULL, 0 },
@@ -181,6 +184,8 @@
     { "colormatrix",    required_argument, NULL, 0 },
     { "chromaloc",      required_argument, NULL, 0 },
     { "crop-rect",      required_argument, NULL, 0 },
+    { "master-display", required_argument, NULL, 0 },
+    { "max-cll",        required_argument, NULL, 0 },
     { "no-dither",            no_argument, NULL, 0 },
     { "dither",               no_argument, NULL, 0 },
     { "no-repeat-headers",    no_argument, NULL, 0 },
@@ -205,6 +210,8 @@
     { "strict-cbr",           no_argument, NULL, 0 },
     { "temporal-layers",      no_argument, NULL, 0 },
     { "no-temporal-layers",   no_argument, NULL, 0 },
+    { "qg-size",        required_argument, NULL, 0 },
+    { "recon-y4m-exec", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -236,6 +243,7 @@
     H0("-V/--version                     Show version info and exit\n");
     H0("\nOutput Options:\n");
     H0("-o/--output <filename>           Bitstream output file name\n");
+    H0("-D/--output-depth 8|10           Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth);
     H0("   --log-level <string>          Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]);
     H0("   --no-progress                 Disable CLI progress reports\n");
     H0("   --[no-]cu-stats               Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats));
@@ -255,9 +263,10 @@
     H0("   --[no-]ssim                   Enable reporting SSIM metric scores. Default %s\n", OPT(param->bEnableSsim));
     H0("   --[no-]psnr                   Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
     H0("\nProfile, Level, Tier:\n");
-    H0("   --profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
+    H0("-P/--profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
     H0("   --level-idc <integer|float>   Force a minimum required decoder level (as '5.0' or '50')\n");
     H0("   --[no-]high-tier              If a decoder level is specified, this modifier selects High tier of that level\n");
+    H0("   --[no-]allow-non-conformance  Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance));
     H0("\nThreading, performance:\n");
     H0("   --pools <integer,...>         Comma separated thread count per thread pool (pool per NUMA node)\n");
     H0("                                 '-' implies no threads on node, '+' implies one thread per core on node\n");
@@ -352,12 +361,14 @@
     H0("   --analysis-file <filename>    Specify file name used for either dumping or reading analysis data.\n");
     H0("   --aq-mode <integer>           Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode);
     H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
+    H0("   --qg-size <int>               Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize);
     H0("   --[no-]cutree                 Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
     H1("   --ipratio <float>             QP factor between I and P. Default %.2f\n", param->rc.ipFactor);
     H1("   --pbratio <float>             QP factor between P and B. Default %.2f\n", param->rc.pbFactor);
     H1("   --qcomp <float>               Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress);
-    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset. Default %d\n", param->cbQpOffset);
-    H1("   --crqpoffs <integer>          Chroma Cr QP Offset. Default %d\n", param->crQpOffset);
+    H1("   --qpstep <integer>            The maximum single adjustment in QP allowed to rate control. Default %d\n", param->rc.qpStep);
+    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset);
+    H1("   --crqpoffs <integer>          Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset);
     H1("   --scaling-list <string>       Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n");
     H1("   --lambda-file <string>        Specify a file containing replacement values for the lambda tables\n");
     H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
@@ -384,6 +395,9 @@
     H1("   --colormatrix <string>        Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n");
     H1("                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
     H1("   --chromaloc <integer>         Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
+    H0("   --master-display <string>     SMPTE ST 2086 master display color volume info SEI (HDR)\n");
+    H0("                                    format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
+    H0("   --max-cll <string>            Emit content light level info SEI as \"cll,fall\" (HDR)\n");
     H0("\nBitstream options:\n");
     H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
@@ -394,6 +408,7 @@
     H1("\nReconstructed video options (debugging):\n");
     H1("-r/--recon <filename>            Reconstructed raw image YUV or Y4M output file name\n");
     H1("   --recon-depth <integer>       Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
+    H1("   --recon-y4m-exec <string>     pipe reconstructed frames to Y4M viewer, ex:\"ffplay -i pipe:0 -autoexit\"\n");
     H1("\nExecutable return codes:\n");
     H1("    0 - encode successful\n");
     H1("    1 - unable to parse command line\n");

 
@@ -30,7 +30,7 @@
 namespace x265 {
 #endif
 
-static const char short_options[] = "o:p:f:F:r:I:i:b:s:t:q:m:hwV?";
+static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?";
 static const struct option long_options[] =
 {
     { "help",                 no_argument, NULL, 'h' },
@@ -47,16 +47,19 @@
     { "no-pme",               no_argument, NULL, 0 },
     { "pme",                  no_argument, NULL, 0 },
     { "log-level",      required_argument, NULL, 0 },
-    { "profile",        required_argument, NULL, 0 },
+    { "profile",        required_argument, NULL, 'P' },
     { "level-idc",      required_argument, NULL, 0 },
     { "high-tier",            no_argument, NULL, 0 },
     { "no-high-tier",         no_argument, NULL, 0 },
+    { "allow-non-conformance",no_argument, NULL, 0 },
+    { "no-allow-non-conformance",no_argument, NULL, 0 },
     { "csv",            required_argument, NULL, 0 },
     { "no-cu-stats",          no_argument, NULL, 0 },
     { "cu-stats",             no_argument, NULL, 0 },
     { "y4m",                  no_argument, NULL, 0 },
     { "no-progress",          no_argument, NULL, 0 },
     { "output",         required_argument, NULL, 'o' },
+    { "output-depth",   required_argument, NULL, 'D' },
     { "input",          required_argument, NULL, 0 },
     { "input-depth",    required_argument, NULL, 0 },
     { "input-res",      required_argument, NULL, 0 },
@@ -181,6 +184,8 @@
     { "colormatrix",    required_argument, NULL, 0 },
     { "chromaloc",      required_argument, NULL, 0 },
     { "crop-rect",      required_argument, NULL, 0 },
+    { "master-display", required_argument, NULL, 0 },
+    { "max-cll",        required_argument, NULL, 0 },
     { "no-dither",            no_argument, NULL, 0 },
     { "dither",               no_argument, NULL, 0 },
     { "no-repeat-headers",    no_argument, NULL, 0 },
@@ -205,6 +210,8 @@
     { "strict-cbr",           no_argument, NULL, 0 },
     { "temporal-layers",      no_argument, NULL, 0 },
     { "no-temporal-layers",   no_argument, NULL, 0 },
+    { "qg-size",        required_argument, NULL, 0 },
+    { "recon-y4m-exec", required_argument, NULL, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
     { 0, 0, 0, 0 },
@@ -236,6 +243,7 @@
     H0("-V/--version                     Show version info and exit\n");
     H0("\nOutput Options:\n");
     H0("-o/--output <filename>           Bitstream output file name\n");
+    H0("-D/--output-depth 8|10           Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth);
     H0("   --log-level <string>          Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]);
     H0("   --no-progress                 Disable CLI progress reports\n");
     H0("   --[no-]cu-stats               Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats));
@@ -255,9 +263,10 @@
     H0("   --[no-]ssim                   Enable reporting SSIM metric scores. Default %s\n", OPT(param->bEnableSsim));
     H0("   --[no-]psnr                   Enable reporting PSNR metric scores. Default %s\n", OPT(param->bEnablePsnr));
     H0("\nProfile, Level, Tier:\n");
-    H0("   --profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
+    H0("-P/--profile <string>            Enforce an encode profile: main, main10, mainstillpicture\n");
     H0("   --level-idc <integer|float>   Force a minimum required decoder level (as '5.0' or '50')\n");
     H0("   --[no-]high-tier              If a decoder level is specified, this modifier selects High tier of that level\n");
+    H0("   --[no-]allow-non-conformance  Allow the encoder to generate profile NONE bitstreams. Default %s\n", OPT(param->bAllowNonConformance));
     H0("\nThreading, performance:\n");
     H0("   --pools <integer,...>         Comma separated thread count per thread pool (pool per NUMA node)\n");
     H0("                                 '-' implies no threads on node, '+' implies one thread per core on node\n");
@@ -352,12 +361,14 @@
     H0("   --analysis-file <filename>    Specify file name used for either dumping or reading analysis data.\n");
     H0("   --aq-mode <integer>           Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode);
     H0("   --aq-strength <float>         Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength);
+    H0("   --qg-size <int>               Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize);
     H0("   --[no-]cutree                 Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree));
     H1("   --ipratio <float>             QP factor between I and P. Default %.2f\n", param->rc.ipFactor);
     H1("   --pbratio <float>             QP factor between P and B. Default %.2f\n", param->rc.pbFactor);
     H1("   --qcomp <float>               Weight given to predicted complexity. Default %.2f\n", param->rc.qCompress);
-    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset. Default %d\n", param->cbQpOffset);
-    H1("   --crqpoffs <integer>          Chroma Cr QP Offset. Default %d\n", param->crQpOffset);
+    H1("   --qpstep <integer>            The maximum single adjustment in QP allowed to rate control. Default %d\n", param->rc.qpStep);
+    H1("   --cbqpoffs <integer>          Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset);
+    H1("   --crqpoffs <integer>          Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset);
     H1("   --scaling-list <string>       Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n");
     H1("   --lambda-file <string>        Specify a file containing replacement values for the lambda tables\n");
     H1("                                 MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n");
@@ -384,6 +395,9 @@
     H1("   --colormatrix <string>        Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n");
     H1("                                 smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n");
     H1("   --chromaloc <integer>         Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
+    H0("   --master-display <string>     SMPTE ST 2086 master display color volume info SEI (HDR)\n");
+    H0("                                    format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n");
+    H0("   --max-cll <string>            Emit content light level info SEI as \"cll,fall\" (HDR)\n");
     H0("\nBitstream options:\n");
     H0("   --[no-]repeat-headers         Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders));
     H0("   --[no-]info                   Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
@@ -394,6 +408,7 @@
     H1("\nReconstructed video options (debugging):\n");
     H1("-r/--recon <filename>            Reconstructed raw image YUV or Y4M output file name\n");
     H1("   --recon-depth <integer>       Bit-depth of reconstructed raw image file. Defaults to input bit depth, or 8 if Y4M\n");
+    H1("   --recon-y4m-exec <string>     pipe reconstructed frames to Y4M viewer, ex:\"ffplay -i pipe:0 -autoexit\"\n");
     H1("\nExecutable return codes:\n");
     H1("    0 - encode successful\n");
     H1("    1 - unable to parse command line\n");
​