Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 12
View file
x265.changes
Changed
@@ -1,4 +1,30 @@ ------------------------------------------------------------------- +Wed Feb 3 13:22:42 UTC 2016 - idonmez@suse.com + +- Update to version 1.9 + API Changes: + * x265_frame_stats returns many additional fields: maxCLL, maxFALL, + residual energy, scenecut and latency logging + * --qpfile now supports frametype 'K" + * x265 now allows CRF ratecontrol in pass N (N greater than or equal to 2) + * Chroma subsampling format YUV 4:0:0 is now fully supported and tested + New Features: + * Quant offsets: This feature allows block level quantization offsets + to be specified for every frame. An API-only feature. + * --intra-refresh: Keyframes can be replaced by a moving column + of intra blocks in non-keyframes. + * --limit-modes: Intelligently restricts mode analysis. + * --max-luma and --min-luma for luma clipping, optional for HDR use-cases + * Emergency denoising is now enabled by default in very low bitrate, + VBV encodes + Presets and Performance: + * Recently added features lookahead-slices, limit-modes, limit-refs + have been enabled by default for applicable presets. + * The default psy-rd strength has been increased to 2.0 + * Multi-socket machines now use a single pool of threads that can + work cross-socket. + +------------------------------------------------------------------- Fri Nov 27 18:21:04 UTC 2015 - aloisio@gmx.com - Update to version 1.8:
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 68 +%define soname 79 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 1.8 +Version: 1.9 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -43,35 +43,34 @@ streams. %prep -%setup -q -n "%{name}_11047/build/linux" -cd ../.. +%setup -q -n x265_%{version} %patch0 -p1 -cd - + %define FAKE_BUILDDATE %(LC_ALL=C date -u -r %{_sourcedir}/%{name}.changes '+%%b %%e %%Y') -sed -i -e "s/0.0/%{soname}.0/g" ../../source/cmake/version.cmake +sed -i -e "s/0.0/%{soname}.0/g" source/cmake/version.cmake %build -export CXXFLAGS="%optflags" -export CFLAGS="%optflags" -cmake -DCMAKE_INSTALL_PREFIX=/usr -DENABLE_TESTS=ON -G "Unix Makefiles" ../../source -cmake -DCMAKE_INSTALL_PREFIX=/usr ../../source -#./make-Makefiles.bash +export CXXFLAGS="%{optflags}" +export CFLAGS="%{optflags}" + +cd build/linux +cmake -DCMAKE_INSTALL_PREFIX=%{_prefix} \ + -DLIB_INSTALL_DIR=%{_lib} \ + -DENABLE_TESTS=ON \ + -G "Unix Makefiles" \ + ../../source + make %{?_smp_mflags} VERBOSE=1 %install +cd build/linux %makeinstall -%ifarch x86_64 - mv "%{buildroot}/usr/lib" "%{buildroot}%{_libdir}" -%endif rm -f %{buildroot}%{_libdir}/%{libname}.a echo "%{libname}-%{soname}" > %{_sourcedir}/baselibs.conf -%clean -%{?buildroot:%__rm -rf "%{buildroot}"} - %post -n %{libsoname} -p /sbin/ldconfig %postun -n %{libsoname} -p /sbin/ldconfig
View file
x265_1.8.tar.gz/.hg_archival.txt -> x265_1.9.tar.gz/.hg_archival.txt
Changed
@@ -1,5 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: 5dcc9d3a928c400b41a3547d7bfee10340519e56 +node: 1d3b6e448e01ec40b392ef78b7e55a86249fbe68 branch: stable -latesttag: 1.8 -latesttagdistance: 1 +tag: 1.9
View file
x265_1.8.tar.gz/doc/reST/cli.rst -> x265_1.9.tar.gz/doc/reST/cli.rst
Changed
@@ -84,8 +84,8 @@ it adds one line per run. If :option:`--csv-log-level` is greater than 0, it writes one line per frame. Default none - When frame level logging is enabled, several frame performance - statistics are listed: + Several frame performance statistics are available when + :option:`--csv-log-level` is greater than or equal to 2: **DecideWait ms** number of milliseconds the frame encoder had to wait, since the previous frame was retrieved by the API thread, @@ -202,15 +202,29 @@ "-" - same as "none" "10" - allocate one pool, using up to 10 cores on node 0 "-,+" - allocate one pool, using all cores on node 1 - "+,-,+" - allocate two pools, using all cores on nodes 0 and 2 - "+,-,+,-" - allocate two pools, using all cores on nodes 0 and 2 - "-,*" - allocate three pools, using all cores on nodes 1, 2 and 3 + "+,-,+" - allocate one pool, using only cores on nodes 0 and 2 + "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2 + "-,*" - allocate one pool, using all cores on nodes 1, 2 and 3 "8,8,8,8" - allocate four pools with up to 8 threads in each pool - - The total number of threads will be determined by the number of threads - assigned to all nodes. The worker threads will each be given affinity for - their node, they will not be allowed to migrate between nodes, but they - will be allowed to move between CPU cores within their node. + "8,+,+,+" - allocate two pools, the first with 8 threads on node 0, and the second with all cores on node 1,2,3 + + A thread pool dedicated to a given NUMA node is enabled only when the + number of threads to be created on that NUMA node is explicitly mentioned + in that corresponding position with the --pools option. Else, all threads + are spawned from a single pool. The total number of threads will be + determined by the number of threads assigned to the enabled NUMA nodes for + that pool. The worker threads are be given affinity to all the enabled + NUMA nodes for that pool and may migrate between them, unless explicitly + specified as described above. + + In the case that any threadpool has more than 64 threads, the threadpool + may be broken down into multiple pools of 64 threads each; on 32-bit + machines, this number is 32. All pools are given affinity to the NUMA + nodes on which the original pool had affinity. For performance reasons, + the last thread pool is spawned only if it has more than 32 threads for + 64-bit machines, or 16 for 32-bit machines. If the total number of threads + in the system doesn't obey this constraint, we may spawn fewer threads + than cores which has been emperically shown to be better for performance. If the four pool features: :option:`--wpp`, :option:`--pmode`, :option:`--pme` and :option:`--lookahead-slices` are all disabled, @@ -219,10 +233,6 @@ If "none" is specified, then all four of the thread pool features are implicitly disabled. - Multiple thread pools will be allocated for any NUMA node with more than - 64 logical CPU cores. But any given thread pool will always use at most - one NUMA node. - Frame encoders are distributed between the available thread pools, and the encoder will never generate more thread pools than :option:`--frame-threads`. The pools are used for WPP and for @@ -238,8 +248,12 @@ system, a POSIX build of libx265 without libnuma will be less work efficient. See :ref:`thread pools <pools>` for more detail. - Default "", one thread is allocated per detected hardware thread - (logical CPU cores) and one thread pool per NUMA node. + Default "", one pool is created across all available NUMA nodes, with + one thread allocated per detected hardware thread + (logical CPU cores). In the case that the total number of threads is more + than the maximum size that ATOMIC operations can handle (32 for 32-bit + compiles, and 64 for 64-bit compiles), multiple thread pools may be + spawned subject to the performance constraint described above. Note that the string value will need to be escaped or quoted to protect against shell expansion on many platforms @@ -353,7 +367,7 @@ **CLI ONLY** -.. option:: --total-frames <integer> +.. option:: --frames <integer> The number of frames intended to be encoded. It may be left unspecified, but when it is specified rate control can make use of @@ -377,15 +391,15 @@ .. option:: --input-csp <integer|string> - YUV only: Source color space. Only i420, i422, and i444 are - supported at this time. The internal color space is always the - same as the source color space (libx265 does not support any color - space conversions). + Chroma Subsampling (YUV only): Only 4:0:0(monochrome), 4:2:0, 4:2:2, and 4:4:4 are supported at this time. + The chroma subsampling format of your input must match your desired output chroma subsampling format + (libx265 will not perform any chroma subsampling conversion), and it must be supported by the + HEVC profile you have specified. - 0. i400 - 1. i420 **(default)** - 2. i422 - 3. i444 + 0. i400 (4:0:0 monochrome) - Not supported by Main or Main10 profiles + 1. i420 (4:2:0 default) - Supported by all HEVC profiles + 2. i422 (4:2:2) - Not supported by Main, Main10 and Main12 profiles + 3. i444 (4:4:4) - Supported by Main 4:4:4, Main 4:4:4 10, Main 4:4:4 12, Main 4:4:4 16 Intra profiles 4. nv12 5. nv16 @@ -436,8 +450,8 @@ depth of the encoder. If the requested bit depth is not the bit depth of the linked libx265, it will attempt to bind libx265_main for an 8bit encoder, libx265_main10 for a 10bit encoder, or - libx265_main12 for a 12bit encoder (EXPERIMENTAL), with the - same API version as the linked libx265. + libx265_main12 for a 12bit encoder, with the same API version as the + linked libx265. If the output depth is not specified but :option:`--profile` is specified, the output depth will be derived from the profile name. @@ -486,13 +500,6 @@ The CLI application will derive the output bit depth from the profile name if :option:`--output-depth` is not specified. -.. note:: - - All 12bit presets are extremely unstable, do not use them yet. - 16bit is not supported at all, but those profiles are included - because it is possible for libx265 to make bitstreams compatible - with them. - .. option:: --level-idc <integer|float> Minimum decoder requirement level. Defaults to 0, which implies @@ -606,7 +613,8 @@ +-------+---------------------------------------------------------------+ | Level | Description | +=======+===============================================================+ - | 0 | sa8d mode and split decisions, intra w/ source pixels | + | 0 | sa8d mode and split decisions, intra w/ source pixels, | + | | currently not supported | +-------+---------------------------------------------------------------+ | 1 | recon generated (better intra), RDO merge/skip selection | +-------+---------------------------------------------------------------+ @@ -677,7 +685,16 @@ (within your decoder level limits) if you enable one or both of these flags. - This feature is EXPERIMENTAL and functional at all RD levels. + Default 3. + +.. option:: --limit-modes, --no-limit-modes + + When enabled, limit-modes will limit modes analyzed for each CU using cost + metrics from the 4 sub-CUs. When multiple inter modes like :option:`--rect` + and/or :option:`--amp` are enabled, this feature will use motion cost + heuristics from the 4 sub-CUs to bypass modes that are unlikely to be the + best choice. This can significantly improve performance when :option:`rect` + and/or :option:`--amp` are enabled at minimal compression efficiency loss. .. option:: --rect, --no-rect @@ -1049,9 +1066,9 @@ energy of the source image in the encoded image at the expense of compression efficiency. It only has effect on presets which use RDO-based mode decisions (:option:`--rd` 3 and above). 1.0 is a - typical value. Default 0.3 + typical value. Default 2.0 - **Range of values:** 0 .. 2.0 + **Range of values:** 0 .. 5.0 .. option:: --psy-rdoq <float> @@ -1076,7 +1093,8 @@ Max intra period in frames. A special case of infinite-gop (single keyframe at the beginning of the stream) can be triggered with - argument -1. Use 1 to force all-intra. Default 250 + argument -1. Use 1 to force all-intra. When intra-refresh is enabled + it specifies the interval between which refresh sweeps happen. Default 250 .. option:: --min-keyint, -i <integer> @@ -1095,6 +1113,14 @@ :option:`--scenecut` 0 or :option:`--no-scenecut` disables adaptive I frame placement. Default 40 +.. option:: --intra-refresh + + Enables Periodic Intra Refresh(PIR) instead of keyframe insertion. + PIR can replace keyframes by inserting a column of intra blocks in + non-keyframes, that move across the video from one side to the other + and thereby refresh the image but over a period of multiple + frames instead of a single keyframe. + .. option:: --rc-lookahead <integer> Number of frames for slice-type decision lookahead (a key @@ -1108,21 +1134,31 @@ .. option:: --lookahead-slices <0..16> - Use multiple worker threads to measure the estimated cost of each - frame within the lookahead. When :option:`--b-adapt` is 2, most - frame cost estimates will be performed in batch mode, many cost - estimates at the same time, and lookahead-slices is ignored for - batched estimates. The effect on performance can be quite small. - The higher this parameter, the less accurate the frame costs will be - (since context is lost across slice boundaries) which will result in - less accurate B-frame and scene-cut decisions. - - The encoder may internally lower the number of slices to ensure - each slice codes at least 10 16x16 rows of lowres blocks. If slices - are used in lookahead, they are logged in the list of tools as - *lslices*. - - **Values:** 0 - disabled (default). 1 is the same as 0. Max 16 + Use multiple worker threads to measure the estimated cost of each frame + within the lookahead. The frame is divided into the specified number of + slices, and one-thread is launched per slice. When :option:`--b-adapt` is + 2, most frame cost estimates will be performed in batch mode (many cost + estimates at the same time) and lookahead-slices is ignored for batched + estimates; it may still be used for single cost estimations. The higher this + parameter, the less accurate the frame costs will be (since context is lost + across slice boundaries) which will result in less accurate B-frame and + scene-cut decisions. The effect on performance can be significant especially + on systems with many threads. + + The encoder may internally lower the number of slices or disable + slicing to ensure each slice codes at least 10 16x16 rows of lowres + blocks to minimize the impact on quality. For example, for 720p and + 1080p videos, the number of slices is capped to 4 and 6, respectively. + For resolutions lesser than 720p, slicing is auto-disabled. + + If slices are used in lookahead, they are logged in the list of tools + as *lslices* + + **Values:** 0 - disabled. 1 is the same as 0. Max 16. + Default: 8 for ultrafast, superfast, faster, fast, medium + 4 for slow, slower + disabled for veryslow, slower + .. option:: --b-adapt <integer> @@ -1198,6 +1234,13 @@ is also non-zero. Both vbv-bufsize and vbv-maxrate are required to enable VBV in CRF mode. Default 0 (disabled) + Note that when VBV is enabled (with a valid :option:`--vbv-bufsize`), + VBV emergency denoising is turned on. This will turn on aggressive + denoising at the frame level when frame QP > QP_MAX_SPEC (51), drastically + reducing bitrate and allowing ratecontrol to assign lower QPs for + the following frames. The visual effect is blurring, but removes + significant blocking/displacement artifacts. + .. option:: --vbv-init <float> Initial buffer occupancy. The portion of the decode buffer which @@ -1405,10 +1448,11 @@ framenumber frametype QP - Frametype can be one of [I,i,P,B,b]. **B** is a referenced B frame, + Frametype can be one of [I,i,K,P,B,b]. **B** is a referenced B frame, **b** is an unreferenced B frame. **I** is a keyframe (random - access point) while **i** is a I frame that is not a keyframe - (references are not broken). + access point) while **i** is an I frame that is not a keyframe + (references are not broken). **K** implies **I** if closed_gop option + is enabled, and **i** otherwise. Specifying QP (integer) is optional, and if specified they are clamped within the encoder to qpmin/qpmax. @@ -1551,7 +1595,7 @@ .. option:: --colorprim <integer|string> - Specify color primitive to use when converting to RGB. Default + Specify color primaries to use when converting to RGB. Default undefined (not signaled) 1. bt709 @@ -1621,7 +1665,7 @@ Example for D65P3 1000-nits: - G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1) + G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1) Note that this string value will need to be escaped or quoted to protect against shell expansion on many platforms. No default. @@ -1640,6 +1684,16 @@ Note that this string value will need to be escaped or quoted to protect against shell expansion on many platforms. No default. +.. option:: --min-luma <integer> + + Minimum luma value allowed for input pictures. Any values below min-luma + are clipped. Experimental. No default. + +.. option:: --max-luma <integer> + + Maximum luma value allowed for input pictures. Any values above max-luma + are clipped. Experimental. No default. + Bitstream options =================
View file
x265_1.8.tar.gz/doc/reST/presets.rst -> x265_1.9.tar.gz/doc/reST/presets.rst
Changed
@@ -6,76 +6,83 @@ Presets ======= -x265 has a number of predefined :option:`--preset` options that make -trade-offs between encode speed (encoded frames per second) and +x265 has ten predefined :option:`--preset` options that optimize the +trade-off between encoding speed (encoded frames per second) and compression efficiency (quality per bit in the bitstream). The default -preset is medium, it does a reasonably good job of finding the best -possible quality without spending enormous CPU cycles looking for the -absolute most efficient way to achieve that quality. As you go higher -than medium, the encoder takes shortcuts to improve performance at the -expense of quality and compression efficiency. As you go lower than -medium, the encoder tries harder and harder to achieve the best quailty -per bit compression ratio. - -The presets adjust encoder parameters to affect these trade-offs. - -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| | ultrafast | superfast | veryfast | faster | fast | medium | slow | slower | veryslow | placebo | -+==============+===========+===========+==========+========+======+========+======+========+==========+=========+ -| ctu | 32 | 32 | 32 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| min-cu-size | 16 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| bframes | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| b-adapt | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| rc-lookahead | 5 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| scenecut | 0 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| refs | 1 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | 5 | 5 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| me | dia | hex | hex | hex | hex | hex | star | star | star | star | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| merange | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| subme | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| rect | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| amp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| max-merge | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| early-skip | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| fast-intra | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| b-intra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| sao | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| signhide | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| weightp | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| weightb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| aq-mode | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| cuTree | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| rdLevel | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 6 | 6 | 6 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| rdoq-level | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| tu-intra | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ -| tu-inter | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | -+--------------+-----------+-----------+----------+--------+------+--------+------+--------+----------+---------+ - -Placebo mode enables transform-skip prediction evaluation. +preset is medium. It does a reasonably good job of finding the best +possible quality without spending excessive CPU cycles looking for the +absolute most efficient way to achieve that quality. When you use +faster presets, the encoder takes shortcuts to improve performance at +the expense of quality and compression efficiency. When you use slower +presets, x265 tests more encoding options, using more computations to +achieve the best quality at your selected bit rate (or in the case of +--crf rate control, the lowest bit rate at the selected quality). + +The presets adjust encoder parameters as shown in the following table. +Any parameters below that are specified in your command-line will be +changed from the value specified by the preset. + ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| |ultrafast |superfast |veryfast |faster |fast |medium |slow |slower |veryslow |placebo | ++=================+==========+==========+=========+=======+=====+=======+=====+=======+=========+========+ +| ctu | 32 | 32 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | 64 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| min-cu-size | 16 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | 8 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| bframes | 3 | 3 | 4 | 4 | 4 | 4 | 4 | 8 | 8 | 8 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| b-adapt | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | 2 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| rc-lookahead | 5 | 10 | 15 | 15 | 15 | 20 | 25 | 30 | 40 | 60 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| lookahead-slices| 8 | 8 | 8 | 8 | 8 | 8 | 4 | 4 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| scenecut | 0 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | 40 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| ref | 1 | 1 | 2 | 2 | 3 | 3 | 4 | 4 | 5 | 5 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| limit-refs | 0 | 0 | 3 | 3 | 3 | 3 | 3 | 2 | 1 | 0 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| me | dia | hex | hex | hex |hex | hex |star | star | star | star | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| merange | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 57 | 92 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| subme | 0 | 1 | 1 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| rect | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| amp | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| limit-modes | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| max-merge | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 4 | 5 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| early-skip | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| fast-intra | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| b-intra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| sao | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| signhide | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| weightp | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| weightb | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| aq-mode | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| cuTree | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| rdLevel | 2 | 2 | 2 | 2 | 2 | 3 | 4 | 6 | 6 | 6 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| rdoq-level | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2 | 2 | 2 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| tu-intra | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ +| tu-inter | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 3 | 4 | ++-----------------+----------+----------+---------+-------+-----+-------+-----+-------+---------+--------+ .. _tunings:
View file
x265_1.8.tar.gz/source/CMakeLists.txt -> x265_1.9.tar.gz/source/CMakeLists.txt
Changed
@@ -30,7 +30,7 @@ mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 68) +set(X265_BUILD 79) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -45,12 +45,14 @@ set(POWER_ALIASES ppc64 ppc64le) list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH) if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1") - message(STATUS "Detected x86 target processor") set(X86 1) add_definitions(-DX265_ARCH_X86=1) if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) set(X64 1) add_definitions(-DX86_64=1) + message(STATUS "Detected x86_64 target processor") + else() + message(STATUS "Detected x86 target processor") endif() elseif(POWERMATCH GREATER "-1") message(STATUS "Detected POWER target processor") @@ -71,23 +73,27 @@ if(LIBRT) list(APPEND PLATFORM_LIBS rt) endif() + mark_as_advanced(LIBRT) find_library(LIBDL dl) if(LIBDL) list(APPEND PLATFORM_LIBS dl) endif() - find_package(Numa) - if(NUMA_FOUND) - link_directories(${NUMA_LIBRARY_DIR}) - list(APPEND CMAKE_REQUIRED_LIBRARIES numa) - check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2) - if(NUMA_V2) - add_definitions(-DHAVE_LIBNUMA) - message(STATUS "libnuma found, building with support for NUMA nodes") - list(APPEND PLATFORM_LIBS numa) - include_directories(${NUMA_INCLUDE_DIR}) + option(ENABLE_LIBNUMA "Enable libnuma usage (Linux only)" ON) + if(ENABLE_LIBNUMA) + find_package(Numa) + if(NUMA_FOUND) + link_directories(${NUMA_LIBRARY_DIR}) + list(APPEND CMAKE_REQUIRED_LIBRARIES numa) + check_symbol_exists(numa_node_of_cpu numa.h NUMA_V2) + if(NUMA_V2) + add_definitions(-DHAVE_LIBNUMA) + message(STATUS "libnuma found, building with support for NUMA nodes") + list(APPEND PLATFORM_LIBS numa) + include_directories(${NUMA_INCLUDE_DIR}) + endif() endif() - endif() - mark_as_advanced(LIBRT NUMA_FOUND) + mark_as_advanced(NUMA_FOUND) + endif(ENABLE_LIBNUMA) option(NO_ATOMICS "Use a slow mutex to replace atomics" OFF) if(NO_ATOMICS) add_definitions(-DNO_ATOMICS=1) @@ -157,6 +163,7 @@ if(GCC) add_definitions(-Wall -Wextra -Wshadow) add_definitions(-D__STDC_LIMIT_MACROS=1) + add_definitions(-std=gnu++98) if(ENABLE_PIC) add_definitions(-fPIC) endif(ENABLE_PIC) @@ -379,16 +386,19 @@ option(ENABLE_VTUNE "Enable Vtune profiling instrumentation" OFF) if(ENABLE_VTUNE) - add_definitions(-DENABLE_VTUNE) - include_directories($ENV{VTUNE_AMPLIFIER_XE_2015_DIR}/include) - list(APPEND PLATFORM_LIBS vtune) - link_directories($ENV{VTUNE_AMPLIFIER_XE_2015_DIR}/lib64) - if(WIN32) - list(APPEND PLATFORM_LIBS libittnotify.lib) - else() - list(APPEND PLATFORM_LIBS libittnotify.a dl) - endif() - add_subdirectory(profile/vtune) + find_package(Vtune) + if(VTUNE_FOUND) + add_definitions(-DENABLE_VTUNE) + include_directories(${VTUNE_INCLUDE_DIR}) + list(APPEND PLATFORM_LIBS vtune) + link_directories(${VTUNE_LIBRARY_DIR}) + if(WIN32) + list(APPEND PLATFORM_LIBS libittnotify.lib) + else() + list(APPEND PLATFORM_LIBS libittnotify.a dl) + endif() + add_subdirectory(profile/vtune) + endif(VTUNE_FOUND) endif(ENABLE_VTUNE) option(DETAILED_CU_STATS "Enable internal profiling of encoder work" OFF) @@ -455,6 +465,9 @@ if(ENABLE_SHARED) add_library(x265-shared SHARED "${PROJECT_BINARY_DIR}/x265.def" ${YASM_OBJS} ${X265_RC_FILE} $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common>) + if(EXTRA_LIB) + target_link_libraries(x265-shared ${EXTRA_LIB}) + endif() target_link_libraries(x265-shared ${PLATFORM_LIBS}) if(MSVC) set_target_properties(x265-shared PROPERTIES OUTPUT_NAME libx265) @@ -465,6 +478,8 @@ set_target_properties(x265-shared PROPERTIES VERSION ${X265_BUILD}) if(APPLE) set_target_properties(x265-shared PROPERTIES MACOSX_RPATH 1) + elseif(CYGWIN) + # Cygwin is not officially supported or tested. MinGW with msys is recommended. else() list(APPEND LINKER_OPTIONS "-Wl,-Bsymbolic,-znoexecstack") endif() @@ -480,9 +495,6 @@ ARCHIVE DESTINATION ${LIB_INSTALL_DIR} RUNTIME DESTINATION ${BIN_INSTALL_DIR}) endif() - if(EXTRA_LIB) - target_link_libraries(x265-shared ${EXTRA_LIB}) - endif() if(LINKER_OPTIONS) # set_target_properties can't do list expansion string(REPLACE ";" " " LINKER_OPTION_STR "${LINKER_OPTIONS}")
View file
x265_1.9.tar.gz/source/cmake/FindVtune.cmake
Added
@@ -0,0 +1,25 @@ +# Module for locating Vtune +# +# Read-only variables +# VTUNE_FOUND: Indicates that the library has been found +# VTUNE_INCLUDE_DIR: Points to the vtunes include dir +# VTUNE_LIBRARY_DIR: Points to the directory with libraries +# +# Copyright (c) 2015 Pradeep Ramachandran + +include(FindPackageHandleStandardArgs) + +find_path(VTUNE_DIR + if(UNIX) + NAMES amplxe-vars.sh + else() + NAMES amplxe-vars.bat + endif(UNIX) + HINTS $ENV{VTUNE_AMPLIFIER_XE_2016_DIR} $ENV{VTUNE_AMPLIFIER_XE_2015_DIR} + DOC "Vtune root directory") + +set (VTUNE_INCLUDE_DIR ${VTUNE_DIR}/include) +set (VTUNE_LIBRARY_DIR ${VTUNE_DIR}/lib64) + +mark_as_advanced(VTUNE_DIR) +find_package_handle_standard_args(VTUNE REQUIRED_VARS VTUNE_DIR VTUNE_INCLUDE_DIR VTUNE_LIBRARY_DIR)
View file
x265_1.8.tar.gz/source/common/bitstream.cpp -> x265_1.9.tar.gz/source/common/bitstream.cpp
Changed
@@ -1,5 +1,6 @@ #include "common.h" #include "bitstream.h" +#include "threading.h" using namespace X265_NS; @@ -112,16 +113,13 @@ void SyntaxElementWriter::writeUvlc(uint32_t code) { - uint32_t length = 1; - uint32_t temp = ++code; + ++code; - X265_CHECK(temp, "writing -1 code, will cause infinite loop\n"); + X265_CHECK(code, "writing -1 code, will cause infinite loop\n"); - while (1 != temp) - { - temp >>= 1; - length += 2; - } + unsigned long idx; + CLZ(idx, code); + uint32_t length = (uint32_t)idx * 2 + 1; // Take care of cases where length > 32 m_bitIf->write(0, length >> 1);
View file
x265_1.8.tar.gz/source/common/bitstream.h -> x265_1.9.tar.gz/source/common/bitstream.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Author: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/common.h -> x265_1.9.tar.gz/source/common/common.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -134,10 +135,10 @@ typedef int32_t ssum2_t; // Signed sum #endif // if HIGH_BIT_DEPTH -#if X265_DEPTH <= 10 -typedef uint32_t sse_ret_t; +#if X265_DEPTH < 10 +typedef uint32_t sse_t; #else -typedef uint64_t sse_ret_t; +typedef uint64_t sse_t; #endif #ifndef NULL @@ -214,6 +215,7 @@ #define X265_MALLOC(type, count) (type*)x265_malloc(sizeof(type) * (count)) #define X265_FREE(ptr) x265_free(ptr) +#define X265_FREE_ZERO(ptr) x265_free(ptr); (ptr) = NULL #define CHECKED_MALLOC(var, type, count) \ { \ var = (type*)x265_malloc(sizeof(type) * (count)); \ @@ -317,6 +319,9 @@ #define CHROMA_V_SHIFT(x) (x == X265_CSP_I420) #define X265_MAX_PRED_MODE_PER_CTU 85 * 2 * 8 +#define MAX_NUM_TR_COEFFS MAX_TR_SIZE * MAX_TR_SIZE // Maximum number of transform coefficients, for a 32x32 transform +#define MAX_NUM_TR_CATEGORIES 16 // 32, 16, 8, 4 transform categories each for luma and chroma + namespace X265_NS { enum { SAO_NUM_OFFSET = 4 }; @@ -366,25 +371,6 @@ delete[] ctuParam[2]; } }; - -/* Stores inter analysis data for a single frame */ -struct analysis_inter_data -{ - int32_t* ref; - uint8_t* depth; - uint8_t* modes; - uint32_t* bestMergeCand; -}; - -/* Stores intra analysis data for a single frame. This struct needs better packing */ -struct analysis_intra_data -{ - uint8_t* depth; - uint8_t* modes; - char* partSizes; - uint8_t* chromaModes; -}; - enum TextType { TEXT_LUMA = 0, // luma
View file
x265_1.8.tar.gz/source/common/constants.cpp -> x265_1.9.tar.gz/source/common/constants.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/constants.h -> x265_1.9.tar.gz/source/common/constants.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/contexts.h -> x265_1.9.tar.gz/source/common/contexts.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/cudata.cpp -> x265_1.9.tar.gz/source/common/cudata.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -192,44 +193,82 @@ break; } - /* Each CU's data is layed out sequentially within the charMemBlock */ - uint8_t *charBuf = dataPool.charMemBlock + (m_numPartitions * BytesPerPartition) * instance; - - m_qp = (int8_t*)charBuf; charBuf += m_numPartitions; - m_log2CUSize = charBuf; charBuf += m_numPartitions; - m_lumaIntraDir = charBuf; charBuf += m_numPartitions; - m_tqBypass = charBuf; charBuf += m_numPartitions; - m_refIdx[0] = (int8_t*)charBuf; charBuf += m_numPartitions; - m_refIdx[1] = (int8_t*)charBuf; charBuf += m_numPartitions; - m_cuDepth = charBuf; charBuf += m_numPartitions; - m_predMode = charBuf; charBuf += m_numPartitions; /* the order up to here is important in initCTU() and initSubCU() */ - m_partSize = charBuf; charBuf += m_numPartitions; - m_mergeFlag = charBuf; charBuf += m_numPartitions; - m_interDir = charBuf; charBuf += m_numPartitions; - m_mvpIdx[0] = charBuf; charBuf += m_numPartitions; - m_mvpIdx[1] = charBuf; charBuf += m_numPartitions; - m_tuDepth = charBuf; charBuf += m_numPartitions; - m_transformSkip[0] = charBuf; charBuf += m_numPartitions; - m_transformSkip[1] = charBuf; charBuf += m_numPartitions; - m_transformSkip[2] = charBuf; charBuf += m_numPartitions; - m_cbf[0] = charBuf; charBuf += m_numPartitions; - m_cbf[1] = charBuf; charBuf += m_numPartitions; - m_cbf[2] = charBuf; charBuf += m_numPartitions; - m_chromaIntraDir = charBuf; charBuf += m_numPartitions; - - X265_CHECK(charBuf == dataPool.charMemBlock + (m_numPartitions * BytesPerPartition) * (instance + 1), "CU data layout is broken\n"); - - m_mv[0] = dataPool.mvMemBlock + (instance * 4) * m_numPartitions; - m_mv[1] = m_mv[0] + m_numPartitions; - m_mvd[0] = m_mv[1] + m_numPartitions; - m_mvd[1] = m_mvd[0] + m_numPartitions; - - uint32_t cuSize = g_maxCUSize >> depth; - uint32_t sizeL = cuSize * cuSize; - uint32_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift); - m_trCoeff[0] = dataPool.trCoeffMemBlock + instance * (sizeL + sizeC * 2); - m_trCoeff[1] = m_trCoeff[0] + sizeL; - m_trCoeff[2] = m_trCoeff[0] + sizeL + sizeC; + if (csp == X265_CSP_I400) + { + /* Each CU's data is layed out sequentially within the charMemBlock */ + uint8_t *charBuf = dataPool.charMemBlock + (m_numPartitions * (BytesPerPartition - 4)) * instance; + + m_qp = (int8_t*)charBuf; charBuf += m_numPartitions; + m_log2CUSize = charBuf; charBuf += m_numPartitions; + m_lumaIntraDir = charBuf; charBuf += m_numPartitions; + m_tqBypass = charBuf; charBuf += m_numPartitions; + m_refIdx[0] = (int8_t*)charBuf; charBuf += m_numPartitions; + m_refIdx[1] = (int8_t*)charBuf; charBuf += m_numPartitions; + m_cuDepth = charBuf; charBuf += m_numPartitions; + m_predMode = charBuf; charBuf += m_numPartitions; /* the order up to here is important in initCTU() and initSubCU() */ + m_partSize = charBuf; charBuf += m_numPartitions; + m_mergeFlag = charBuf; charBuf += m_numPartitions; + m_interDir = charBuf; charBuf += m_numPartitions; + m_mvpIdx[0] = charBuf; charBuf += m_numPartitions; + m_mvpIdx[1] = charBuf; charBuf += m_numPartitions; + m_tuDepth = charBuf; charBuf += m_numPartitions; + m_transformSkip[0] = charBuf; charBuf += m_numPartitions; + m_cbf[0] = charBuf; charBuf += m_numPartitions; + m_chromaIntraDir = charBuf; charBuf += m_numPartitions; + + X265_CHECK(charBuf == dataPool.charMemBlock + (m_numPartitions * (BytesPerPartition - 4)) * (instance + 1), "CU data layout is broken\n"); //BytesPerPartition + + m_mv[0] = dataPool.mvMemBlock + (instance * 4) * m_numPartitions; + m_mv[1] = m_mv[0] + m_numPartitions; + m_mvd[0] = m_mv[1] + m_numPartitions; + m_mvd[1] = m_mvd[0] + m_numPartitions; + + uint32_t cuSize = g_maxCUSize >> depth; + m_trCoeff[0] = dataPool.trCoeffMemBlock + instance * (cuSize * cuSize); + m_trCoeff[1] = m_trCoeff[2] = 0; + m_transformSkip[1] = m_transformSkip[2] = m_cbf[1] = m_cbf[2] = 0; + } + else + { + /* Each CU's data is layed out sequentially within the charMemBlock */ + uint8_t *charBuf = dataPool.charMemBlock + (m_numPartitions * BytesPerPartition) * instance; + + m_qp = (int8_t*)charBuf; charBuf += m_numPartitions; + m_log2CUSize = charBuf; charBuf += m_numPartitions; + m_lumaIntraDir = charBuf; charBuf += m_numPartitions; + m_tqBypass = charBuf; charBuf += m_numPartitions; + m_refIdx[0] = (int8_t*)charBuf; charBuf += m_numPartitions; + m_refIdx[1] = (int8_t*)charBuf; charBuf += m_numPartitions; + m_cuDepth = charBuf; charBuf += m_numPartitions; + m_predMode = charBuf; charBuf += m_numPartitions; /* the order up to here is important in initCTU() and initSubCU() */ + m_partSize = charBuf; charBuf += m_numPartitions; + m_mergeFlag = charBuf; charBuf += m_numPartitions; + m_interDir = charBuf; charBuf += m_numPartitions; + m_mvpIdx[0] = charBuf; charBuf += m_numPartitions; + m_mvpIdx[1] = charBuf; charBuf += m_numPartitions; + m_tuDepth = charBuf; charBuf += m_numPartitions; + m_transformSkip[0] = charBuf; charBuf += m_numPartitions; + m_transformSkip[1] = charBuf; charBuf += m_numPartitions; + m_transformSkip[2] = charBuf; charBuf += m_numPartitions; + m_cbf[0] = charBuf; charBuf += m_numPartitions; + m_cbf[1] = charBuf; charBuf += m_numPartitions; + m_cbf[2] = charBuf; charBuf += m_numPartitions; + m_chromaIntraDir = charBuf; charBuf += m_numPartitions; + + X265_CHECK(charBuf == dataPool.charMemBlock + (m_numPartitions * BytesPerPartition) * (instance + 1), "CU data layout is broken\n"); + + m_mv[0] = dataPool.mvMemBlock + (instance * 4) * m_numPartitions; + m_mv[1] = m_mv[0] + m_numPartitions; + m_mvd[0] = m_mv[1] + m_numPartitions; + m_mvd[1] = m_mvd[0] + m_numPartitions; + + uint32_t cuSize = g_maxCUSize >> depth; + uint32_t sizeL = cuSize * cuSize; + uint32_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift); // block chroma part + m_trCoeff[0] = dataPool.trCoeffMemBlock + instance * (sizeL + sizeC * 2); + m_trCoeff[1] = m_trCoeff[0] + sizeL; + m_trCoeff[2] = m_trCoeff[0] + sizeL + sizeC; + } } void CUData::initCTU(const Frame& frame, uint32_t cuAddr, int qp) @@ -245,7 +284,8 @@ /* sequential memsets */ m_partSet((uint8_t*)m_qp, (uint8_t)qp); m_partSet(m_log2CUSize, (uint8_t)g_maxLog2CUSize); - m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX); + m_partSet(m_lumaIntraDir, (uint8_t)ALL_IDX); + m_partSet(m_chromaIntraDir, (uint8_t)ALL_IDX); m_partSet(m_tqBypass, (uint8_t)frame.m_encData->m_param->bLossless); if (m_slice->m_sliceType != I_SLICE) { @@ -256,7 +296,7 @@ X265_CHECK(!(frame.m_encData->m_param->bLossless && !m_slice->m_pps->bTransquantBypassEnabled), "lossless enabled without TQbypass in PPS\n"); /* initialize the remaining CU data in one memset */ - memset(m_cuDepth, 0, (BytesPerPartition - 6) * m_numPartitions); + memset(m_cuDepth, 0, (frame.m_param->internalCsp == X265_CSP_I400 ? BytesPerPartition - 11 : BytesPerPartition - 7) * m_numPartitions); uint32_t widthInCU = m_slice->m_sps->numCuInWidth; m_cuLeft = (m_cuAddr % widthInCU) ? m_encData->getPicCTU(m_cuAddr - 1) : NULL; @@ -283,14 +323,15 @@ m_partSet((uint8_t*)m_qp, (uint8_t)qp); m_partSet(m_log2CUSize, (uint8_t)cuGeom.log2CUSize); - m_partSet(m_lumaIntraDir, (uint8_t)DC_IDX); + m_partSet(m_lumaIntraDir, (uint8_t)ALL_IDX); + m_partSet(m_chromaIntraDir, (uint8_t)ALL_IDX); m_partSet(m_tqBypass, (uint8_t)m_encData->m_param->bLossless); m_partSet((uint8_t*)m_refIdx[0], (uint8_t)REF_NOT_VALID); m_partSet((uint8_t*)m_refIdx[1], (uint8_t)REF_NOT_VALID); m_partSet(m_cuDepth, (uint8_t)cuGeom.depth); /* initialize the remaining CU data in one memset */ - memset(m_predMode, 0, (BytesPerPartition - 7) * m_numPartitions); + memset(m_predMode, 0, (ctu.m_chromaFormat == X265_CSP_I400 ? BytesPerPartition - 12 : BytesPerPartition - 8) * m_numPartitions); } /* Copy the results of a sub-part (split) CU to the parent CU */ @@ -314,13 +355,9 @@ m_subPartCopy(m_mvpIdx[0] + offset, subCU.m_mvpIdx[0]); m_subPartCopy(m_mvpIdx[1] + offset, subCU.m_mvpIdx[1]); m_subPartCopy(m_tuDepth + offset, subCU.m_tuDepth); + m_subPartCopy(m_transformSkip[0] + offset, subCU.m_transformSkip[0]); - m_subPartCopy(m_transformSkip[1] + offset, subCU.m_transformSkip[1]); - m_subPartCopy(m_transformSkip[2] + offset, subCU.m_transformSkip[2]); m_subPartCopy(m_cbf[0] + offset, subCU.m_cbf[0]); - m_subPartCopy(m_cbf[1] + offset, subCU.m_cbf[1]); - m_subPartCopy(m_cbf[2] + offset, subCU.m_cbf[2]); - m_subPartCopy(m_chromaIntraDir + offset, subCU.m_chromaIntraDir); memcpy(m_mv[0] + offset, subCU.m_mv[0], childGeom.numPartitions * sizeof(MV)); memcpy(m_mv[1] + offset, subCU.m_mv[1], childGeom.numPartitions * sizeof(MV)); @@ -329,12 +366,21 @@ uint32_t tmp = 1 << ((g_maxLog2CUSize - childGeom.depth) * 2); uint32_t tmp2 = subPartIdx * tmp; - memcpy(m_trCoeff[0] + tmp2, subCU.m_trCoeff[0], sizeof(coeff_t) * tmp); + memcpy(m_trCoeff[0] + tmp2, subCU.m_trCoeff[0], sizeof(coeff_t)* tmp); - uint32_t tmpC = tmp >> (m_hChromaShift + m_vChromaShift); - uint32_t tmpC2 = tmp2 >> (m_hChromaShift + m_vChromaShift); - memcpy(m_trCoeff[1] + tmpC2, subCU.m_trCoeff[1], sizeof(coeff_t) * tmpC); - memcpy(m_trCoeff[2] + tmpC2, subCU.m_trCoeff[2], sizeof(coeff_t) * tmpC); + if (subCU.m_chromaFormat != X265_CSP_I400) + { + m_subPartCopy(m_transformSkip[1] + offset, subCU.m_transformSkip[1]); + m_subPartCopy(m_transformSkip[2] + offset, subCU.m_transformSkip[2]); + m_subPartCopy(m_cbf[1] + offset, subCU.m_cbf[1]); + m_subPartCopy(m_cbf[2] + offset, subCU.m_cbf[2]); + m_subPartCopy(m_chromaIntraDir + offset, subCU.m_chromaIntraDir); + + uint32_t tmpC = tmp >> (m_hChromaShift + m_vChromaShift); + uint32_t tmpC2 = tmp2 >> (m_hChromaShift + m_vChromaShift); + memcpy(m_trCoeff[1] + tmpC2, subCU.m_trCoeff[1], sizeof(coeff_t) * tmpC); + memcpy(m_trCoeff[2] + tmpC2, subCU.m_trCoeff[2], sizeof(coeff_t) * tmpC); + } } /* If a sub-CU part is not present (off the edge of the picture) its depth and @@ -374,12 +420,17 @@ /* clear residual coding flags */ m_partSet(m_predMode, cu.m_predMode[0] & (MODE_INTRA | MODE_INTER)); m_partSet(m_tuDepth, 0); - m_partSet(m_transformSkip[0], 0); - m_partSet(m_transformSkip[1], 0); - m_partSet(m_transformSkip[2], 0); m_partSet(m_cbf[0], 0); - m_partSet(m_cbf[1], 0); - m_partSet(m_cbf[2], 0); + m_partSet(m_transformSkip[0], 0); + + if (cu.m_chromaFormat != X265_CSP_I400) + { + m_partSet(m_chromaIntraDir, (uint8_t)ALL_IDX); + m_partSet(m_cbf[1], 0); + m_partSet(m_cbf[2], 0); + m_partSet(m_transformSkip[1], 0); + m_partSet(m_transformSkip[2], 0); + } } /* Copy completed predicted CU to CTU in picture */ @@ -402,30 +453,34 @@ m_partCopy(ctu.m_mvpIdx[1] + m_absIdxInCTU, m_mvpIdx[1]); m_partCopy(ctu.m_tuDepth + m_absIdxInCTU, m_tuDepth); m_partCopy(ctu.m_transformSkip[0] + m_absIdxInCTU, m_transformSkip[0]); - m_partCopy(ctu.m_transformSkip[1] + m_absIdxInCTU, m_transformSkip[1]); - m_partCopy(ctu.m_transformSkip[2] + m_absIdxInCTU, m_transformSkip[2]); m_partCopy(ctu.m_cbf[0] + m_absIdxInCTU, m_cbf[0]); - m_partCopy(ctu.m_cbf[1] + m_absIdxInCTU, m_cbf[1]); - m_partCopy(ctu.m_cbf[2] + m_absIdxInCTU, m_cbf[2]); - m_partCopy(ctu.m_chromaIntraDir + m_absIdxInCTU, m_chromaIntraDir); - memcpy(ctu.m_mv[0] + m_absIdxInCTU, m_mv[0], m_numPartitions * sizeof(MV)); - memcpy(ctu.m_mv[1] + m_absIdxInCTU, m_mv[1], m_numPartitions * sizeof(MV)); + memcpy(ctu.m_mv[0] + m_absIdxInCTU, m_mv[0], m_numPartitions * sizeof(MV)); + memcpy(ctu.m_mv[1] + m_absIdxInCTU, m_mv[1], m_numPartitions * sizeof(MV)); memcpy(ctu.m_mvd[0] + m_absIdxInCTU, m_mvd[0], m_numPartitions * sizeof(MV)); memcpy(ctu.m_mvd[1] + m_absIdxInCTU, m_mvd[1], m_numPartitions * sizeof(MV)); uint32_t tmpY = 1 << ((g_maxLog2CUSize - depth) * 2); uint32_t tmpY2 = m_absIdxInCTU << (LOG2_UNIT_SIZE * 2); - memcpy(ctu.m_trCoeff[0] + tmpY2, m_trCoeff[0], sizeof(coeff_t) * tmpY); + memcpy(ctu.m_trCoeff[0] + tmpY2, m_trCoeff[0], sizeof(coeff_t)* tmpY); - uint32_t tmpC = tmpY >> (m_hChromaShift + m_vChromaShift); - uint32_t tmpC2 = tmpY2 >> (m_hChromaShift + m_vChromaShift); - memcpy(ctu.m_trCoeff[1] + tmpC2, m_trCoeff[1], sizeof(coeff_t) * tmpC); - memcpy(ctu.m_trCoeff[2] + tmpC2, m_trCoeff[2], sizeof(coeff_t) * tmpC); + if (ctu.m_chromaFormat != X265_CSP_I400) + { + m_partCopy(ctu.m_transformSkip[1] + m_absIdxInCTU, m_transformSkip[1]); + m_partCopy(ctu.m_transformSkip[2] + m_absIdxInCTU, m_transformSkip[2]); + m_partCopy(ctu.m_cbf[1] + m_absIdxInCTU, m_cbf[1]); + m_partCopy(ctu.m_cbf[2] + m_absIdxInCTU, m_cbf[2]); + m_partCopy(ctu.m_chromaIntraDir + m_absIdxInCTU, m_chromaIntraDir); + + uint32_t tmpC = tmpY >> (m_hChromaShift + m_vChromaShift); + uint32_t tmpC2 = tmpY2 >> (m_hChromaShift + m_vChromaShift); + memcpy(ctu.m_trCoeff[1] + tmpC2, m_trCoeff[1], sizeof(coeff_t) * tmpC); + memcpy(ctu.m_trCoeff[2] + tmpC2, m_trCoeff[2], sizeof(coeff_t) * tmpC); + } } /* The reverse of copyToPic, called only by encodeResidue */ -void CUData::copyFromPic(const CUData& ctu, const CUGeom& cuGeom) +void CUData::copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp) { m_encData = ctu.m_encData; m_slice = ctu.m_slice; @@ -451,19 +506,23 @@ m_partCopy(m_mvpIdx[1], ctu.m_mvpIdx[1] + m_absIdxInCTU); m_partCopy(m_chromaIntraDir, ctu.m_chromaIntraDir + m_absIdxInCTU); - memcpy(m_mv[0], ctu.m_mv[0] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); - memcpy(m_mv[1], ctu.m_mv[1] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); + memcpy(m_mv[0], ctu.m_mv[0] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); + memcpy(m_mv[1], ctu.m_mv[1] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); memcpy(m_mvd[0], ctu.m_mvd[0] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); memcpy(m_mvd[1], ctu.m_mvd[1] + m_absIdxInCTU, m_numPartitions * sizeof(MV)); /* clear residual coding flags */ m_partSet(m_tuDepth, 0); m_partSet(m_transformSkip[0], 0); - m_partSet(m_transformSkip[1], 0); - m_partSet(m_transformSkip[2], 0); m_partSet(m_cbf[0], 0); - m_partSet(m_cbf[1], 0); - m_partSet(m_cbf[2], 0); + + if (csp != X265_CSP_I400) + { + m_partSet(m_transformSkip[1], 0); + m_partSet(m_transformSkip[2], 0); + m_partSet(m_cbf[1], 0); + m_partSet(m_cbf[2], 0); + } } /* Only called by encodeResidue, these fields can be modified during inter/intra coding */ @@ -473,22 +532,28 @@ m_partCopy((uint8_t*)ctu.m_qp + m_absIdxInCTU, (uint8_t*)m_qp); m_partCopy(ctu.m_transformSkip[0] + m_absIdxInCTU, m_transformSkip[0]); - m_partCopy(ctu.m_transformSkip[1] + m_absIdxInCTU, m_transformSkip[1]); - m_partCopy(ctu.m_transformSkip[2] + m_absIdxInCTU, m_transformSkip[2]); m_partCopy(ctu.m_predMode + m_absIdxInCTU, m_predMode); m_partCopy(ctu.m_tuDepth + m_absIdxInCTU, m_tuDepth); m_partCopy(ctu.m_cbf[0] + m_absIdxInCTU, m_cbf[0]); - m_partCopy(ctu.m_cbf[1] + m_absIdxInCTU, m_cbf[1]); - m_partCopy(ctu.m_cbf[2] + m_absIdxInCTU, m_cbf[2]); - m_partCopy(ctu.m_chromaIntraDir + m_absIdxInCTU, m_chromaIntraDir); uint32_t tmpY = 1 << ((g_maxLog2CUSize - depth) * 2); uint32_t tmpY2 = m_absIdxInCTU << (LOG2_UNIT_SIZE * 2); - memcpy(ctu.m_trCoeff[0] + tmpY2, m_trCoeff[0], sizeof(coeff_t) * tmpY); - tmpY >>= m_hChromaShift + m_vChromaShift; - tmpY2 >>= m_hChromaShift + m_vChromaShift; - memcpy(ctu.m_trCoeff[1] + tmpY2, m_trCoeff[1], sizeof(coeff_t) * tmpY); - memcpy(ctu.m_trCoeff[2] + tmpY2, m_trCoeff[2], sizeof(coeff_t) * tmpY); + memcpy(ctu.m_trCoeff[0] + tmpY2, m_trCoeff[0], sizeof(coeff_t)* tmpY); + + if (ctu.m_chromaFormat != X265_CSP_I400) + { + m_partCopy(ctu.m_transformSkip[1] + m_absIdxInCTU, m_transformSkip[1]); + m_partCopy(ctu.m_transformSkip[2] + m_absIdxInCTU, m_transformSkip[2]); + + m_partCopy(ctu.m_cbf[1] + m_absIdxInCTU, m_cbf[1]); + m_partCopy(ctu.m_cbf[2] + m_absIdxInCTU, m_cbf[2]); + m_partCopy(ctu.m_chromaIntraDir + m_absIdxInCTU, m_chromaIntraDir); + + tmpY >>= m_hChromaShift + m_vChromaShift; + tmpY2 >>= m_hChromaShift + m_vChromaShift; + memcpy(ctu.m_trCoeff[1] + tmpY2, m_trCoeff[1], sizeof(coeff_t) * tmpY); + memcpy(ctu.m_trCoeff[2] + tmpY2, m_trCoeff[2], sizeof(coeff_t) * tmpY); + } } const CUData* CUData::getPULeft(uint32_t& lPartUnitIdx, uint32_t curPartUnitIdx) const @@ -1676,7 +1741,7 @@ if (tempRefIdx != -1) { uint32_t cuAddr = neighbours[MD_COLLOCATED].cuAddr[picList]; - const Frame* colPic = m_slice->m_refPicList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; + const Frame* colPic = m_slice->m_refFrameList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; const CUData* colCU = colPic->m_encData->getPicCTU(cuAddr); // Scale the vector @@ -1857,7 +1922,7 @@ bool CUData::getColMVP(MV& outMV, int& outRefIdx, int picList, int cuAddr, int partUnitIdx) const { - const Frame* colPic = m_slice->m_refPicList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; + const Frame* colPic = m_slice->m_refFrameList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; const CUData* colCU = colPic->m_encData->getPicCTU(cuAddr); uint32_t absPartAddr = partUnitIdx & TMVP_UNIT_MASK; @@ -1892,7 +1957,7 @@ // Cache the collocated MV. bool CUData::getCollocatedMV(int cuAddr, int partUnitIdx, InterNeighbourMV *neighbour) const { - const Frame* colPic = m_slice->m_refPicList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; + const Frame* colPic = m_slice->m_refFrameList[m_slice->isInterB() && !m_slice->m_colFromL0Flag][m_slice->m_colRefIdx]; const CUData* colCU = colPic->m_encData->getPicCTU(cuAddr); uint32_t absPartAddr = partUnitIdx & TMVP_UNIT_MASK; @@ -1951,7 +2016,7 @@ bool bIsIntra = isIntra(absPartIdx); // set the group layout - result.log2TrSizeCG = log2TrSize - 2; + const uint32_t log2TrSizeCG = log2TrSize - 2; // set the scan orders if (bIsIntra) @@ -1979,7 +2044,7 @@ result.scanType = SCAN_DIAG; result.scan = g_scanOrder[result.scanType][log2TrSize - 2]; - result.scanCG = g_scanOrderCG[result.scanType][result.log2TrSizeCG]; + result.scanCG = g_scanOrderCG[result.scanType][log2TrSizeCG]; if (log2TrSize == 2) result.firstSignificanceMapContext = 0;
View file
x265_1.8.tar.gz/source/common/cudata.h -> x265_1.9.tar.gz/source/common/cudata.h
Changed
@@ -222,12 +222,12 @@ void copyToPic(uint32_t depth) const; /* RD-0 methods called only from encodeResidue */ - void copyFromPic(const CUData& ctu, const CUGeom& cuGeom); + void copyFromPic(const CUData& ctu, const CUGeom& cuGeom, int csp); void updatePic(uint32_t depth) const; void setPartSizeSubParts(PartSize size) { m_partSet(m_partSize, (uint8_t)size); } void setPredModeSubParts(PredMode mode) { m_partSet(m_predMode, (uint8_t)mode); } - void clearCbf() { m_partSet(m_cbf[0], 0); m_partSet(m_cbf[1], 0); m_partSet(m_cbf[2], 0); } + void clearCbf() { m_partSet(m_cbf[0], 0); if (m_chromaFormat != X265_CSP_I400) { m_partSet(m_cbf[1], 0); m_partSet(m_cbf[2], 0);} } /* these functions all take depth as an absolute depth from CTU, it is used to calculate the number of parts to copy */ void setQPSubParts(int8_t qp, uint32_t absPartIdx, uint32_t depth) { s_partSet[depth]((uint8_t*)m_qp + absPartIdx, (uint8_t)qp); } @@ -246,7 +246,7 @@ void setPURefIdx(int list, int8_t refIdx, int absPartIdx, int puIdx); uint8_t getCbf(uint32_t absPartIdx, TextType ttype, uint32_t tuDepth) const { return (m_cbf[ttype][absPartIdx] >> tuDepth) & 0x1; } - uint8_t getQtRootCbf(uint32_t absPartIdx) const { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx]; } + uint8_t getQtRootCbf(uint32_t absPartIdx) const { if (m_chromaFormat == X265_CSP_I400) return m_cbf[0][absPartIdx] || false; else { return m_cbf[0][absPartIdx] || m_cbf[1][absPartIdx] || m_cbf[2][absPartIdx];} } int8_t getRefQP(uint32_t currAbsIdxInCTU) const; uint32_t getInterMergeCandidates(uint32_t absPartIdx, uint32_t puIdx, MVField (*candMvField)[2], uint8_t* candDir) const; void clipMv(MV& outMV) const; @@ -323,7 +323,6 @@ const uint16_t *scan; const uint16_t *scanCG; ScanType scanType; - uint32_t log2TrSizeCG; uint32_t firstSignificanceMapContext; }; @@ -340,8 +339,15 @@ uint32_t numPartition = NUM_4x4_PARTITIONS >> (depth * 2); uint32_t cuSize = g_maxCUSize >> depth; uint32_t sizeL = cuSize * cuSize; - uint32_t sizeC = sizeL >> (CHROMA_H_SHIFT(csp) + CHROMA_V_SHIFT(csp)); - CHECKED_MALLOC(trCoeffMemBlock, coeff_t, (sizeL + sizeC * 2) * numInstances); + if (csp == X265_CSP_I400) + { + CHECKED_MALLOC(trCoeffMemBlock, coeff_t, (sizeL) * numInstances); + } + else + { + uint32_t sizeC = sizeL >> (CHROMA_H_SHIFT(csp) + CHROMA_V_SHIFT(csp)); + CHECKED_MALLOC(trCoeffMemBlock, coeff_t, (sizeL + sizeC * 2) * numInstances); + } CHECKED_MALLOC(charMemBlock, uint8_t, numPartition * numInstances * CUData::BytesPerPartition); CHECKED_MALLOC(mvMemBlock, MV, numPartition * 4 * numInstances); return true;
View file
x265_1.8.tar.gz/source/common/dct.cpp -> x265_1.9.tar.gz/source/common/dct.cpp
Changed
@@ -703,7 +703,10 @@ if (level) ++numSig; level *= sign; - qCoef[blockpos] = (int16_t)x265_clip3(-32768, 32767, level); + + // TODO: when we limit range to [-32767, 32767], we can get more performance with output change + // But nquant is a little percent in rdoQuant, so I keep old dynamic range for compatible + qCoef[blockpos] = (int16_t)abs(x265_clip3(-32768, 32767, level)); } return numSig; @@ -784,11 +787,12 @@ return scanPosLast - 1; } +// NOTE: no defined value on lastNZPosInCG & absSumSign when ALL ZEROS block as input static uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) { int n; - for (n = SCAN_SET_SIZE - 1; n >= 0; --n) + for (n = SCAN_SET_SIZE - 1; n >= 0; n--) { const uint32_t idx = scanTbl[n]; const uint32_t idxY = idx / MLS_CG_SIZE; @@ -812,8 +816,17 @@ uint32_t firstNZPosInCG = (uint32_t)n; + uint32_t absSumSign = 0; + for (n = firstNZPosInCG; n <= (int)lastNZPosInCG; n++) + { + const uint32_t idx = scanTbl[n]; + const uint32_t idxY = idx / MLS_CG_SIZE; + const uint32_t idxX = idx % MLS_CG_SIZE; + absSumSign += dstCoeff[idxY * trSize + idxX]; + } + // NOTE: when coeff block all ZERO, the lastNZPosInCG is undefined and firstNZPosInCG is 16 - return ((lastNZPosInCG << 16) | firstNZPosInCG); + return ((absSumSign << 31) | (lastNZPosInCG << 8) | firstNZPosInCG); }
View file
x265_1.8.tar.gz/source/common/deblock.cpp -> x265_1.9.tar.gz/source/common/deblock.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Author: Gopu Govindaswamy <gopu@multicorewareinc.com> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -108,7 +109,7 @@ for (uint32_t e = 0; e < numUnits; e += partIdxIncr) { edgeFilterLuma(cu, absPartIdx, depth, dir, e, blockStrength); - if (!((e0 + e) & chromaMask)) + if (!((e0 + e) & chromaMask) && cu->m_chromaFormat != X265_CSP_I400) edgeFilterChroma(cu, absPartIdx, depth, dir, e, blockStrength); } } @@ -209,8 +210,8 @@ const Slice* const sliceQ = cuQ->m_slice; const Slice* const sliceP = cuP->m_slice; - const Frame* refP0 = sliceP->getRefPic(0, cuP->m_refIdx[0][partP]); - const Frame* refQ0 = sliceQ->getRefPic(0, cuQ->m_refIdx[0][partQ]); + const Frame* refP0 = sliceP->m_refFrameList[0][cuP->m_refIdx[0][partP]]; + const Frame* refQ0 = sliceQ->m_refFrameList[0][cuQ->m_refIdx[0][partQ]]; const MV& mvP0 = refP0 ? cuP->m_mv[0][partP] : zeroMv; const MV& mvQ0 = refQ0 ? cuQ->m_mv[0][partQ] : zeroMv; @@ -221,8 +222,8 @@ } // (sliceQ->isInterB() || sliceP->isInterB()) - const Frame* refP1 = sliceP->getRefPic(1, cuP->m_refIdx[1][partP]); - const Frame* refQ1 = sliceQ->getRefPic(1, cuQ->m_refIdx[1][partQ]); + const Frame* refP1 = sliceP->m_refFrameList[1][cuP->m_refIdx[1][partP]]; + const Frame* refQ1 = sliceQ->m_refFrameList[1][cuQ->m_refIdx[1][partQ]]; const MV& mvP1 = refP1 ? cuP->m_mv[1][partP] : zeroMv; const MV& mvQ1 = refQ1 ? cuQ->m_mv[1][partQ] : zeroMv; @@ -279,31 +280,6 @@ * \param maskQ indicator to enable filtering on partQ * \param maskP1 decision weak filter/no filter for partP * \param maskQ1 decision weak filter/no filter for partQ */ -static inline void pelFilterLumaStrong(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ) -{ - int32_t tc2 = 2 * tc; - int32_t tcP = (tc2 & maskP); - int32_t tcQ = (tc2 & maskQ); - for (int32_t i = 0; i < UNIT_SIZE; i++, src += srcStep) - { - int16_t m4 = (int16_t)src[0]; - int16_t m3 = (int16_t)src[-offset]; - int16_t m5 = (int16_t)src[offset]; - int16_t m2 = (int16_t)src[-offset * 2]; - int16_t m6 = (int16_t)src[offset * 2]; - int16_t m1 = (int16_t)src[-offset * 3]; - int16_t m7 = (int16_t)src[offset * 3]; - int16_t m0 = (int16_t)src[-offset * 4]; - src[-offset * 3] = (pixel)(x265_clip3(-tcP, tcP, ((2 * m0 + 3 * m1 + m2 + m3 + m4 + 4) >> 3) - m1) + m1); - src[-offset * 2] = (pixel)(x265_clip3(-tcP, tcP, ((m1 + m2 + m3 + m4 + 2) >> 2) - m2) + m2); - src[-offset] = (pixel)(x265_clip3(-tcP, tcP, ((m1 + 2 * m2 + 2 * m3 + 2 * m4 + m5 + 4) >> 3) - m3) + m3); - src[0] = (pixel)(x265_clip3(-tcQ, tcQ, ((m2 + 2 * m3 + 2 * m4 + 2 * m5 + m6 + 4) >> 3) - m4) + m4); - src[offset] = (pixel)(x265_clip3(-tcQ, tcQ, ((m3 + m4 + m5 + m6 + 2) >> 2) - m5) + m5); - src[offset * 2] = (pixel)(x265_clip3(-tcQ, tcQ, ((m3 + m4 + m5 + 3 * m6 + 2 * m7 + 4) >> 3) - m6) + m6); - } -} - -/* Weak filter */ static inline void pelFilterLuma(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tc, int32_t maskP, int32_t maskQ, int32_t maskP1, int32_t maskQ1) { @@ -445,7 +421,12 @@ useStrongFiltering(offset, beta, tc, src + unitOffset + srcStep * 3)); if (sw) - pelFilterLumaStrong(src + unitOffset, srcStep, offset, tc, maskP, maskQ); + { + int32_t tc2 = 2 * tc; + int32_t tcP = (tc2 & maskP); + int32_t tcQ = (tc2 & maskQ); + primitives.pelFilterLumaStrong[dir](src + unitOffset, srcStep, offset, tcP, tcQ); + } else { int32_t sideThreshold = (beta + (beta >> 1)) >> 3;
View file
x265_1.8.tar.gz/source/common/deblock.h -> x265_1.9.tar.gz/source/common/deblock.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Author: Gopu Govindaswamy <gopu@multicorewareinc.com> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -37,24 +38,24 @@ public: enum { EDGE_VER, EDGE_HOR }; - void deblockCTU(const CUData* ctu, const CUGeom& cuGeom, int32_t dir); + static void deblockCTU(const CUData* ctu, const CUGeom& cuGeom, int32_t dir); protected: // CU-level deblocking function - void deblockCU(const CUData* cu, const CUGeom& cuGeom, const int32_t dir, uint8_t blockStrength[]); + static void deblockCU(const CUData* cu, const CUGeom& cuGeom, const int32_t dir, uint8_t blockStrength[]); // set filtering functions - void setEdgefilterTU(const CUData* cu, uint32_t absPartIdx, uint32_t tuDepth, int32_t dir, uint8_t blockStrength[]); - void setEdgefilterPU(const CUData* cu, uint32_t absPartIdx, int32_t dir, uint8_t blockStrength[], uint32_t numUnits); - void setEdgefilterMultiple(const CUData* cu, uint32_t absPartIdx, int32_t dir, int32_t edgeIdx, uint8_t value, uint8_t blockStrength[], uint32_t numUnits); + static void setEdgefilterTU(const CUData* cu, uint32_t absPartIdx, uint32_t tuDepth, int32_t dir, uint8_t blockStrength[]); + static void setEdgefilterPU(const CUData* cu, uint32_t absPartIdx, int32_t dir, uint8_t blockStrength[], uint32_t numUnits); + static void setEdgefilterMultiple(const CUData* cu, uint32_t absPartIdx, int32_t dir, int32_t edgeIdx, uint8_t value, uint8_t blockStrength[], uint32_t numUnits); // get filtering functions - uint8_t getBoundaryStrength(const CUData* cuQ, int32_t dir, uint32_t partQ, const uint8_t blockStrength[]); + static uint8_t getBoundaryStrength(const CUData* cuQ, int32_t dir, uint32_t partQ, const uint8_t blockStrength[]); // filter luma/chroma functions - void edgeFilterLuma(const CUData* cuQ, uint32_t absPartIdx, uint32_t depth, int32_t dir, int32_t edge, const uint8_t blockStrength[]); - void edgeFilterChroma(const CUData* cuQ, uint32_t absPartIdx, uint32_t depth, int32_t dir, int32_t edge, const uint8_t blockStrength[]); + static void edgeFilterLuma(const CUData* cuQ, uint32_t absPartIdx, uint32_t depth, int32_t dir, int32_t edge, const uint8_t blockStrength[]); + static void edgeFilterChroma(const CUData* cuQ, uint32_t absPartIdx, uint32_t depth, int32_t dir, int32_t edge, const uint8_t blockStrength[]); static const uint8_t s_tcTable[54]; static const uint8_t s_betaTable[52];
View file
x265_1.8.tar.gz/source/common/frame.cpp -> x265_1.9.tar.gz/source/common/frame.cpp
Changed
@@ -33,22 +33,37 @@ m_bChromaExtended = false; m_lowresInit = false; m_reconRowCount.set(0); + m_reconColCount = NULL; m_countRefEncoders = 0; m_encData = NULL; m_reconPic = NULL; + m_quantOffsets = NULL; m_next = NULL; m_prev = NULL; m_param = NULL; memset(&m_lowres, 0, sizeof(m_lowres)); } -bool Frame::create(x265_param *param) +bool Frame::create(x265_param *param, float* quantOffsets) { m_fencPic = new PicYuv; m_param = param; - return m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) && - m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode); + if (m_fencPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp) && + m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode)) + { + X265_CHECK((m_reconColCount == NULL), "m_reconColCount was initialized"); + m_numRows = (m_fencPic->m_picHeight + g_maxCUSize - 1) / g_maxCUSize; + m_reconColCount = new ThreadSafeInteger[m_numRows]; + + if (quantOffsets) + { + int32_t cuCount = m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol; + m_quantOffsets = new float[cuCount]; + } + return true; + } + return false; } bool Frame::allocEncodeData(x265_param *param, const SPS& sps) @@ -56,15 +71,27 @@ m_encData = new FrameData; m_reconPic = new PicYuv; m_encData->m_reconPic = m_reconPic; - bool ok = m_encData->create(param, sps) && m_reconPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp); + bool ok = m_encData->create(*param, sps) && m_reconPic->create(param->sourceWidth, param->sourceHeight, param->internalCsp); if (ok) { /* initialize right border of m_reconpicYuv as SAO may read beyond the * end of the picture accessing uninitialized pixels */ int maxHeight = sps.numCuInHeight * g_maxCUSize; - memset(m_reconPic->m_picOrg[0], 0, sizeof(pixel) * m_reconPic->m_stride * maxHeight); - memset(m_reconPic->m_picOrg[1], 0, sizeof(pixel) * m_reconPic->m_strideC * (maxHeight >> m_reconPic->m_vChromaShift)); - memset(m_reconPic->m_picOrg[2], 0, sizeof(pixel) * m_reconPic->m_strideC * (maxHeight >> m_reconPic->m_vChromaShift)); + memset(m_reconPic->m_picOrg[0], 0, sizeof(pixel)* m_reconPic->m_stride * maxHeight); + + /* use pre-calculated cu/pu offsets cached in the SPS structure */ + m_reconPic->m_cuOffsetY = sps.cuOffsetY; + m_reconPic->m_buOffsetY = sps.buOffsetY; + + if (param->internalCsp != X265_CSP_I400) + { + memset(m_reconPic->m_picOrg[1], 0, sizeof(pixel) * m_reconPic->m_strideC * (maxHeight >> m_reconPic->m_vChromaShift)); + memset(m_reconPic->m_picOrg[2], 0, sizeof(pixel) * m_reconPic->m_strideC * (maxHeight >> m_reconPic->m_vChromaShift)); + + /* use pre-calculated cu/pu offsets cached in the SPS structure */ + m_reconPic->m_cuOffsetC = sps.cuOffsetC; + m_reconPic->m_buOffsetC = sps.buOffsetC; + } } return ok; } @@ -100,5 +127,16 @@ m_reconPic = NULL; } + if (m_reconColCount) + { + delete[] m_reconColCount; + m_reconColCount = NULL; + } + + if (m_quantOffsets) + { + delete[] m_quantOffsets; + } + m_lowres.destroy(); }
View file
x265_1.8.tar.gz/source/common/frame.h -> x265_1.9.tar.gz/source/common/frame.h
Changed
@@ -35,7 +35,7 @@ class PicYuv; struct SPS; -#define IS_REFERENCED(frame) (frame->m_lowres.sliceType != X265_TYPE_B) +#define IS_REFERENCED(frame) (frame->m_lowres.sliceType != X265_TYPE_B) class Frame { @@ -59,8 +59,12 @@ bool m_lowresInit; // lowres init complete (pre-analysis) bool m_bChromaExtended; // orig chroma planes motion extended for weight analysis + float* m_quantOffsets; // points to quantOffsets in x265_picture + /* Frame Parallelism - notification between FrameEncoders of available motion reference rows */ ThreadSafeInteger m_reconRowCount; // count of CTU rows completely reconstructed and extended for motion reference + ThreadSafeInteger* m_reconColCount; // count of CTU cols completely reconstructed and extended for motion reference + int32_t m_numRows; volatile uint32_t m_countRefEncoders; // count of FrameEncoder threads monitoring m_reconRowCount Frame* m_next; // PicList doubly linked list pointers @@ -69,7 +73,7 @@ x265_analysis_data m_analysisData; Frame(); - bool create(x265_param *param); + bool create(x265_param *param, float* quantOffsets); bool allocEncodeData(x265_param *param, const SPS& sps); void reinit(const SPS& sps); void destroy();
View file
x265_1.8.tar.gz/source/common/framedata.cpp -> x265_1.9.tar.gz/source/common/framedata.cpp
Changed
@@ -31,15 +31,15 @@ memset(this, 0, sizeof(*this)); } -bool FrameData::create(x265_param *param, const SPS& sps) +bool FrameData::create(const x265_param& param, const SPS& sps) { - m_param = param; + m_param = ¶m; m_slice = new Slice; m_picCTU = new CUData[sps.numCUsInFrame]; - m_cuMemPool.create(0, param->internalCsp, sps.numCUsInFrame); + m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame); for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) - m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param->internalCsp, ctuAddr); + m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param.internalCsp, ctuAddr); CHECKED_MALLOC(m_cuStat, RCStatCU, sps.numCUsInFrame); CHECKED_MALLOC(m_rowStat, RCStatRow, sps.numCuInHeight);
View file
x265_1.8.tar.gz/source/common/framedata.h -> x265_1.9.tar.gz/source/common/framedata.h
Changed
@@ -55,8 +55,7 @@ double avgLumaDistortion; double avgChromaDistortion; double avgPsyEnergy; - double avgLumaLevel; - double lumaLevel; + double avgResEnergy; double percentIntraNxN; double percentSkipCu[NUM_CU_DEPTH]; double percentMergeCu[NUM_CU_DEPTH]; @@ -69,13 +68,13 @@ uint64_t lumaDistortion; uint64_t chromaDistortion; uint64_t psyEnergy; + uint64_t resEnergy; uint64_t cntSkipCu[NUM_CU_DEPTH]; uint64_t cntMergeCu[NUM_CU_DEPTH]; uint64_t cntInter[NUM_CU_DEPTH]; uint64_t cntIntra[NUM_CU_DEPTH]; uint64_t cuInterDistribution[NUM_CU_DEPTH][INTER_MODES]; uint64_t cuIntraDistribution[NUM_CU_DEPTH][INTRA_MODES]; - uint16_t maxLumaLevel; FrameStats() { @@ -96,7 +95,7 @@ Slice* m_slice; SAOParam* m_saoParam; - x265_param* m_param; + const x265_param* m_param; FrameData* m_freeListNext; PicYuv* m_reconPic; @@ -135,19 +134,44 @@ RCStatCU* m_cuStat; RCStatRow* m_rowStat; FrameStats m_frameStats; // stats of current frame for multi-pass encodes + /* data needed for periodic intra refresh */ + struct PeriodicIR + { + uint32_t pirStartCol; + uint32_t pirEndCol; + int framesSinceLastPir; + }; + PeriodicIR m_pir; double m_avgQpRc; /* avg QP as decided by rate-control */ double m_avgQpAq; /* avg QP as decided by AQ in addition to rate-control */ double m_rateFactor; /* calculated based on the Frame QP */ FrameData(); - bool create(x265_param *param, const SPS& sps); + bool create(const x265_param& param, const SPS& sps); void reinit(const SPS& sps); void destroy(); + inline CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; } +}; + +/* Stores intra analysis data for a single frame. This struct needs better packing */ +struct analysis_intra_data +{ + uint8_t* depth; + uint8_t* modes; + char* partSizes; + uint8_t* chromaModes; +}; - CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; } +/* Stores inter analysis data for a single frame */ +struct analysis_inter_data +{ + MV* mv; + int32_t* ref; + uint8_t* depth; + uint8_t* modes; + uint32_t* bestMergeCand; }; } - #endif // ifndef X265_FRAMEDATA_H
View file
x265_1.8.tar.gz/source/common/ipfilter.cpp -> x265_1.9.tar.gz/source/common/ipfilter.cpp
Changed
@@ -4,6 +4,7 @@ * Authors: Deepthi Devaki <deepthidevaki@multicorewareinc.com>, * Rajesh Paulraj <rajesh@multicorewareinc.com> * Praveen Kumar Tiwari <praveen@multicorewareinc.com> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/loopfilter.cpp -> x265_1.9.tar.gz/source/common/loopfilter.cpp
Changed
@@ -3,6 +3,7 @@ * * Authors: Praveen Kumar Tiwari <praveen@multicorewareinc.com> * Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -136,6 +137,27 @@ rec += stride; } } + +static void pelFilterLumaStrong_c(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ) +{ + for (int32_t i = 0; i < UNIT_SIZE; i++, src += srcStep) + { + int16_t m4 = (int16_t)src[0]; + int16_t m3 = (int16_t)src[-offset]; + int16_t m5 = (int16_t)src[offset]; + int16_t m2 = (int16_t)src[-offset * 2]; + int16_t m6 = (int16_t)src[offset * 2]; + int16_t m1 = (int16_t)src[-offset * 3]; + int16_t m7 = (int16_t)src[offset * 3]; + int16_t m0 = (int16_t)src[-offset * 4]; + src[-offset * 3] = (pixel)(x265_clip3(-tcP, tcP, ((2 * m0 + 3 * m1 + m2 + m3 + m4 + 4) >> 3) - m1) + m1); + src[-offset * 2] = (pixel)(x265_clip3(-tcP, tcP, ((m1 + m2 + m3 + m4 + 2) >> 2) - m2) + m2); + src[-offset] = (pixel)(x265_clip3(-tcP, tcP, ((m1 + 2 * m2 + 2 * m3 + 2 * m4 + m5 + 4) >> 3) - m3) + m3); + src[0] = (pixel)(x265_clip3(-tcQ, tcQ, ((m2 + 2 * m3 + 2 * m4 + 2 * m5 + m6 + 4) >> 3) - m4) + m4); + src[offset] = (pixel)(x265_clip3(-tcQ, tcQ, ((m3 + m4 + m5 + m6 + 2) >> 2) - m5) + m5); + src[offset * 2] = (pixel)(x265_clip3(-tcQ, tcQ, ((m3 + m4 + m5 + 3 * m6 + 2 * m7 + 4) >> 3) - m6) + m6); + } +} } namespace X265_NS { @@ -150,5 +172,9 @@ p.saoCuOrgE3[1] = processSaoCUE3; p.saoCuOrgB0 = processSaoCUB0; p.sign = calSign; + + // C code is same for EDGE_VER and EDGE_HOR only asm code is different + p.pelFilterLumaStrong[0] = pelFilterLumaStrong_c; + p.pelFilterLumaStrong[1] = pelFilterLumaStrong_c; } }
View file
x265_1.8.tar.gz/source/common/lowres.cpp -> x265_1.9.tar.gz/source/common/lowres.cpp
Changed
@@ -52,6 +52,7 @@ CHECKED_MALLOC(qpAqOffset, double, cuCount); CHECKED_MALLOC(invQscaleFactor, int, cuCount); CHECKED_MALLOC(qpCuTreeOffset, double, cuCount); + CHECKED_MALLOC(blockVariance, uint32_t, cuCount); } CHECKED_MALLOC(propagateCost, uint16_t, cuCount); @@ -120,18 +121,17 @@ X265_FREE(invQscaleFactor); X265_FREE(qpCuTreeOffset); X265_FREE(propagateCost); + X265_FREE(blockVariance); } // (re) initialize lowres state void Lowres::init(PicYuv *origPic, int poc) { bLastMiniGopBFrame = false; - bScenecut = false; // could be a scene-cut, until ruled out by flash detection bKeyframe = false; // Not a keyframe unless identified by lookahead frameNum = poc; leadingBframes = 0; indB = 0; - satdCost = (int64_t)-1; memset(costEst, -1, sizeof(costEst)); memset(weightedCostDelta, 0, sizeof(weightedCostDelta));
View file
x265_1.8.tar.gz/source/common/lowres.h -> x265_1.9.tar.gz/source/common/lowres.h
Changed
@@ -143,12 +143,15 @@ double* qpAqOffset; // AQ QP offset values for each 16x16 CU double* qpCuTreeOffset; // cuTree QP offset values for each 16x16 CU int* invQscaleFactor; // qScale values for qp Aq Offsets + uint32_t* blockVariance; uint64_t wp_ssd[3]; // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame uint64_t wp_sum[3]; + uint64_t frameVariance; /* cutree intermediate data */ uint16_t* propagateCost; double weightedCostDelta[X265_BFRAME_MAX + 2]; + ReferencePlanes weightedRef[X265_BFRAME_MAX + 2]; bool create(PicYuv *origPic, int _bframes, bool bAqEnabled); void destroy();
View file
x265_1.8.tar.gz/source/common/param.cpp -> x265_1.9.tar.gz/source/common/param.cpp
Changed
@@ -147,7 +147,7 @@ param->bFrameAdaptive = X265_B_ADAPT_TRELLIS; param->bBPyramid = 1; param->scenecutThreshold = 40; /* Magic number pulled in from x264 */ - param->lookaheadSlices = 0; + param->lookaheadSlices = 8; /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; @@ -159,7 +159,8 @@ param->subpelRefine = 2; param->searchRange = 57; param->maxNumMergeCand = 2; - param->limitReferences = 0; + param->limitReferences = 3; + param->limitModes = 0; param->bEnableWeightedPred = 1; param->bEnableWeightedBiPred = 0; param->bEnableEarlySkip = 0; @@ -184,7 +185,7 @@ param->cbQpOffset = 0; param->crQpOffset = 0; param->rdPenalty = 0; - param->psyRd = 0.3; + param->psyRd = 2.0; param->psyRdoq = 0.0; param->analysisMode = 0; param->analysisFileName = NULL; @@ -241,6 +242,10 @@ param->vui.defDispWinRightOffset = 0; param->vui.defDispWinTopOffset = 0; param->vui.defDispWinBottomOffset = 0; + param->maxCLL = 0; + param->maxFALL = 0; + param->minLuma = 0; + param->maxLuma = (1 << X265_DEPTH) - 1; } int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) @@ -274,9 +279,9 @@ param->bEnableWeightedPred = 0; param->rdLevel = 2; param->maxNumReferences = 1; + param->limitReferences = 0; param->rc.aqStrength = 0.0; param->rc.aqMode = X265_AQ_NONE; - param->rc.cuTree = 0; param->rc.qgSize = 32; param->bEnableFastIntra = 1; } @@ -291,9 +296,9 @@ param->bEnableWeightedPred = 0; param->rdLevel = 2; param->maxNumReferences = 1; + param->limitReferences = 0; param->rc.aqStrength = 0.0; param->rc.aqMode = X265_AQ_NONE; - param->rc.cuTree = 0; param->rc.qgSize = 32; param->bEnableSAO = 0; param->bEnableFastIntra = 1; @@ -301,13 +306,11 @@ else if (!strcmp(preset, "veryfast")) { param->lookaheadDepth = 15; - param->maxCUSize = 32; param->bFrameAdaptive = 0; param->subpelRefine = 1; param->bEnableEarlySkip = 1; param->rdLevel = 2; - param->maxNumReferences = 1; - param->rc.cuTree = 0; + param->maxNumReferences = 2; param->rc.qgSize = 32; param->bEnableFastIntra = 1; } @@ -317,8 +320,7 @@ param->bFrameAdaptive = 0; param->bEnableEarlySkip = 1; param->rdLevel = 2; - param->maxNumReferences = 1; - param->rc.cuTree = 0; + param->maxNumReferences = 2; param->bEnableFastIntra = 1; } else if (!strcmp(preset, "fast")) @@ -326,7 +328,7 @@ param->lookaheadDepth = 15; param->bFrameAdaptive = 0; param->rdLevel = 2; - param->maxNumReferences = 2; + param->maxNumReferences = 3; param->bEnableFastIntra = 1; } else if (!strcmp(preset, "medium")) @@ -343,6 +345,9 @@ param->subpelRefine = 3; param->maxNumMergeCand = 3; param->searchMethod = X265_STAR_SEARCH; + param->maxNumReferences = 4; + param->limitModes = 1; + param->lookaheadSlices = 4; // limit parallelism as already enough work exists } else if (!strcmp(preset, "slower")) { @@ -359,7 +364,11 @@ param->subpelRefine = 3; param->maxNumMergeCand = 3; param->searchMethod = X265_STAR_SEARCH; + param->maxNumReferences = 4; + param->limitReferences = 2; + param->limitModes = 1; param->bIntraInBFrames = 1; + param->lookaheadSlices = 4; // limit parallelism as already enough work exists } else if (!strcmp(preset, "veryslow")) { @@ -377,7 +386,10 @@ param->maxNumMergeCand = 4; param->searchMethod = X265_STAR_SEARCH; param->maxNumReferences = 5; + param->limitReferences = 1; + param->limitModes = 1; param->bIntraInBFrames = 1; + param->lookaheadSlices = 0; // disabled for best quality } else if (!strcmp(preset, "placebo")) { @@ -397,8 +409,10 @@ param->searchMethod = X265_STAR_SEARCH; param->bEnableTransformSkip = 1; param->maxNumReferences = 5; + param->limitReferences = 0; param->rc.bEnableSlowFirstPass = 1; param->bIntraInBFrames = 1; + param->lookaheadSlices = 0; // disabled for best quality // TODO: optimized esa } else @@ -565,10 +579,14 @@ OPT2("level-idc", "level") { /* allow "5.1" or "51", both converted to integer 51 */ - if (atof(value) < 7) + /* if level-idc specifies an obviously wrong value in either float or int, + throw error consistently. Stronger level checking will be done in encoder_open() */ + if (atof(value) < 10) p->levelIdc = (int)(10 * atof(value) + .5); - else + else if (atoi(value) < 100) p->levelIdc = atoi(value); + else + bError = true; } OPT("high-tier") p->bHighTier = atobool(value); OPT("allow-non-conformance") p->bAllowNonConformance = atobool(value); @@ -608,6 +626,7 @@ OPT2("constrained-intra", "cip") p->bEnableConstrainedIntra = atobool(value); OPT("fast-intra") p->bEnableFastIntra = atobool(value); OPT("open-gop") p->bOpenGOP = atobool(value); + OPT("intra-refresh") p->bIntraRefresh = atobool(value); OPT("lookahead-slices") p->lookaheadSlices = atoi(value); OPT("scenecut") { @@ -644,6 +663,7 @@ } OPT("ref") p->maxNumReferences = atoi(value); OPT("limit-refs") p->limitReferences = atoi(value); + OPT("limit-modes") p->limitModes = atobool(value); OPT("weightp") p->bEnableWeightedPred = atobool(value); OPT("weightb") p->bEnableWeightedBiPred = atobool(value); OPT("cbqpoffs") p->cbQpOffset = atoi(value); @@ -854,7 +874,9 @@ OPT("analysis-file") p->analysisFileName = strdup(value); OPT("qg-size") p->rc.qgSize = atoi(value); OPT("master-display") p->masteringDisplayColorVolume = strdup(value); - OPT("max-cll") p->contentLightLevelInfo = strdup(value); + OPT("max-cll") bError |= sscanf(value, "%hu,%hu", &p->maxCLL, &p->maxFALL) != 2; + OPT("min-luma") p->minLuma = (uint16_t)atoi(value); + OPT("max-luma") p->maxLuma = (uint16_t)atoi(value); else return X265_PARAM_BAD_NAME; #undef OPT @@ -1035,6 +1057,8 @@ "subme must be greater than or equal to 0"); CHECK(param->limitReferences > 3, "limitReferences must be 0, 1, 2 or 3"); + CHECK(param->limitModes > 1, + "limitRectAmp must be 0, 1"); CHECK(param->frameNumThreads < 0 || param->frameNumThreads > X265_MAX_FRAME_THREADS, "frameNumThreads (--frame-threads) must be [0 .. X265_MAX_FRAME_THREADS)"); CHECK(param->cbQpOffset < -12, "Min. Chroma Cb QP Offset is -12"); @@ -1063,8 +1087,8 @@ CHECK(param->sourceWidth < (int)param->maxCUSize || param->sourceHeight < (int)param->maxCUSize, "Picture size must be at least one CTU"); - CHECK(param->internalCsp < X265_CSP_I420 || X265_CSP_I444 < param->internalCsp, - "Color space must be i420, i422, or i444"); + CHECK(param->internalCsp < X265_CSP_I400 || X265_CSP_I444 < param->internalCsp, + "chroma subsampling must be i400 (4:0:0 monochrome), i420 (4:2:0 default), i422 (4:2:0), i444 (4:4:4)"); CHECK(param->sourceWidth & !!CHROMA_H_SHIFT(param->internalCsp), "Picture width must be an integer multiple of the specified chroma subsampling"); CHECK(param->sourceHeight & !!CHROMA_V_SHIFT(param->internalCsp), @@ -1094,7 +1118,7 @@ "deblocking filter tC offset must be in the range of -6 to +6"); CHECK(param->deblockingFilterBetaOffset < -6 || param->deblockingFilterBetaOffset > 6, "deblocking filter Beta offset must be in the range of -6 to +6"); - CHECK(param->psyRd < 0 || 2.0 < param->psyRd, "Psy-rd strength must be between 0 and 2.0"); + CHECK(param->psyRd < 0 || 5.0 < param->psyRd, "Psy-rd strength must be between 0 and 5.0"); CHECK(param->psyRdoq < 0 || 50.0 < param->psyRdoq, "Psy-rdoq strength must be between 0 and 50.0"); CHECK(param->bEnableWavefront < 0, "WaveFrontSynchro cannot be negative"); CHECK((param->vui.aspectRatioIdc < 0 @@ -1170,7 +1194,7 @@ CHECK(0 > param->noiseReductionIntra || param->noiseReductionIntra > 2000, "Valid noise reduction range 0 - 2000"); if (param->noiseReductionInter) CHECK(0 > param->noiseReductionInter || param->noiseReductionInter > 2000, "Valid noise reduction range 0 - 2000"); - CHECK(param->rc.rateControlMode == X265_RC_CRF && param->rc.bStatRead, + CHECK(param->rc.rateControlMode == X265_RC_CRF && param->rc.bStatRead && param->rc.vbvMaxBitrate == 0, "Constant rate-factor is incompatible with 2pass"); CHECK(param->rc.rateControlMode == X265_RC_CQP && param->rc.bStatRead, "Constant QP is incompatible with 2pass"); @@ -1307,6 +1331,7 @@ #define TOOLVAL(VAL, STR) if (VAL) { sprintf(tmp, STR, VAL); appendtool(param, buf, sizeof(buf), tmp); } TOOLOPT(param->bEnableRectInter, "rect"); TOOLOPT(param->bEnableAMP, "amp"); + TOOLOPT(param->limitModes, "limit-modes"); TOOLVAL(param->rdLevel, "rd=%d"); TOOLVAL(param->psyRd, "psy-rd=%.2lf"); TOOLVAL(param->rdoqLevel, "rdoq=%d"); @@ -1428,6 +1453,7 @@ s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive); s += sprintf(s, " ref=%d", p->maxNumReferences); s += sprintf(s, " limit-refs=%d", p->limitReferences); + BOOL(p->limitModes, "limit-modes"); BOOL(p->bEnableWeightedPred, "weightp"); BOOL(p->bEnableWeightedBiPred, "weightb"); s += sprintf(s, " aq-mode=%d", p->rc.aqMode); @@ -1447,6 +1473,7 @@ BOOL(p->bSaoNonDeblocked, "sao-non-deblock"); BOOL(p->bBPyramid, "b-pyramid"); BOOL(p->rc.cuTree, "cutree"); + BOOL(p->bIntraRefresh, "intra-refresh"); s += sprintf(s, " rc=%s", p->rc.rateControlMode == X265_RC_ABR ? ( p->rc.bStatRead ? "2 pass" : p->rc.bitrate == p->rc.vbvMaxBitrate ? "cbr" : "abr") : p->rc.rateControlMode == X265_RC_CRF ? "crf" : "cqp");
View file
x265_1.8.tar.gz/source/common/picyuv.cpp -> x265_1.9.tar.gz/source/common/picyuv.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -42,6 +43,9 @@ m_cuOffsetC = NULL; m_buOffsetY = NULL; m_buOffsetC = NULL; + + m_maxLumaLevel = 0; + m_avgLumaLevel = 0; } bool PicYuv::create(uint32_t picWidth, uint32_t picHeight, uint32_t picCsp) @@ -59,20 +63,27 @@ m_lumaMarginY = g_maxCUSize + 16; // margin for 8-tap filter and infinite padding m_stride = (numCuInWidth * g_maxCUSize) + (m_lumaMarginX << 1); - m_chromaMarginX = m_lumaMarginX; // keep 16-byte alignment for chroma CTUs - m_chromaMarginY = m_lumaMarginY >> m_vChromaShift; - - m_strideC = ((numCuInWidth * g_maxCUSize) >> m_hChromaShift) + (m_chromaMarginX * 2); int maxHeight = numCuInHeight * g_maxCUSize; - CHECKED_MALLOC(m_picBuf[0], pixel, m_stride * (maxHeight + (m_lumaMarginY * 2))); - CHECKED_MALLOC(m_picBuf[1], pixel, m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); - CHECKED_MALLOC(m_picBuf[2], pixel, m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); + m_picOrg[0] = m_picBuf[0] + m_lumaMarginY * m_stride + m_lumaMarginX; + + if (picCsp != X265_CSP_I400) + { + m_chromaMarginX = m_lumaMarginX; // keep 16-byte alignment for chroma CTUs + m_chromaMarginY = m_lumaMarginY >> m_vChromaShift; + m_strideC = ((numCuInWidth * g_maxCUSize) >> m_hChromaShift) + (m_chromaMarginX * 2); - m_picOrg[0] = m_picBuf[0] + m_lumaMarginY * m_stride + m_lumaMarginX; - m_picOrg[1] = m_picBuf[1] + m_chromaMarginY * m_strideC + m_chromaMarginX; - m_picOrg[2] = m_picBuf[2] + m_chromaMarginY * m_strideC + m_chromaMarginX; + CHECKED_MALLOC(m_picBuf[1], pixel, m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); + CHECKED_MALLOC(m_picBuf[2], pixel, m_strideC * ((maxHeight >> m_vChromaShift) + (m_chromaMarginY * 2))); + m_picOrg[1] = m_picBuf[1] + m_chromaMarginY * m_strideC + m_chromaMarginX; + m_picOrg[2] = m_picBuf[2] + m_chromaMarginY * m_strideC + m_chromaMarginX; + } + else + { + m_picBuf[1] = m_picBuf[2] = NULL; + m_picOrg[1] = m_picOrg[2] = NULL; + } return true; fail: @@ -85,27 +96,45 @@ bool PicYuv::createOffsets(const SPS& sps) { uint32_t numPartitions = 1 << (g_unitSizeDepth * 2); - CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight); - CHECKED_MALLOC(m_cuOffsetC, intptr_t, sps.numCuInWidth * sps.numCuInHeight); - for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++) + + if (m_picCsp != X265_CSP_I400) { - for (uint32_t cuCol = 0; cuCol < sps.numCuInWidth; cuCol++) + CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight); + CHECKED_MALLOC(m_cuOffsetC, intptr_t, sps.numCuInWidth * sps.numCuInHeight); + for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++) { - m_cuOffsetY[cuRow * sps.numCuInWidth + cuCol] = m_stride * cuRow * g_maxCUSize + cuCol * g_maxCUSize; - m_cuOffsetC[cuRow * sps.numCuInWidth + cuCol] = m_strideC * cuRow * (g_maxCUSize >> m_vChromaShift) + cuCol * (g_maxCUSize >> m_hChromaShift); + for (uint32_t cuCol = 0; cuCol < sps.numCuInWidth; cuCol++) + { + m_cuOffsetY[cuRow * sps.numCuInWidth + cuCol] = m_stride * cuRow * g_maxCUSize + cuCol * g_maxCUSize; + m_cuOffsetC[cuRow * sps.numCuInWidth + cuCol] = m_strideC * cuRow * (g_maxCUSize >> m_vChromaShift) + cuCol * (g_maxCUSize >> m_hChromaShift); + } } - } - CHECKED_MALLOC(m_buOffsetY, intptr_t, (size_t)numPartitions); - CHECKED_MALLOC(m_buOffsetC, intptr_t, (size_t)numPartitions); - for (uint32_t idx = 0; idx < numPartitions; ++idx) - { - intptr_t x = g_zscanToPelX[idx]; - intptr_t y = g_zscanToPelY[idx]; - m_buOffsetY[idx] = m_stride * y + x; - m_buOffsetC[idx] = m_strideC * (y >> m_vChromaShift) + (x >> m_hChromaShift); + CHECKED_MALLOC(m_buOffsetY, intptr_t, (size_t)numPartitions); + CHECKED_MALLOC(m_buOffsetC, intptr_t, (size_t)numPartitions); + for (uint32_t idx = 0; idx < numPartitions; ++idx) + { + intptr_t x = g_zscanToPelX[idx]; + intptr_t y = g_zscanToPelY[idx]; + m_buOffsetY[idx] = m_stride * y + x; + m_buOffsetC[idx] = m_strideC * (y >> m_vChromaShift) + (x >> m_hChromaShift); + } } + else + { + CHECKED_MALLOC(m_cuOffsetY, intptr_t, sps.numCuInWidth * sps.numCuInHeight); + for (uint32_t cuRow = 0; cuRow < sps.numCuInHeight; cuRow++) + for (uint32_t cuCol = 0; cuCol < sps.numCuInWidth; cuCol++) + m_cuOffsetY[cuRow * sps.numCuInWidth + cuCol] = m_stride * cuRow * g_maxCUSize + cuCol * g_maxCUSize; + CHECKED_MALLOC(m_buOffsetY, intptr_t, (size_t)numPartitions); + for (uint32_t idx = 0; idx < numPartitions; ++idx) + { + intptr_t x = g_zscanToPelX[idx]; + intptr_t y = g_zscanToPelY[idx]; + m_buOffsetY[idx] = m_stride * y + x; + } + } return true; fail: @@ -121,7 +150,7 @@ /* Copy pixels from an x265_picture into internal PicYuv instance. * Shift pixels as necessary, mask off bits above X265_DEPTH for safety. */ -void PicYuv::copyFromPicture(const x265_picture& pic, int padx, int pady) +void PicYuv::copyFromPicture(const x265_picture& pic, const x265_param& param, int padx, int pady) { /* m_picWidth is the width that is being encoded, padx indicates how many * of those pixels are padding to reach multiple of MinCU(4) size. @@ -155,28 +184,29 @@ #if (X265_DEPTH > 8) { pixel *yPixel = m_picOrg[0]; - pixel *uPixel = m_picOrg[1]; - pixel *vPixel = m_picOrg[2]; uint8_t *yChar = (uint8_t*)pic.planes[0]; - uint8_t *uChar = (uint8_t*)pic.planes[1]; - uint8_t *vChar = (uint8_t*)pic.planes[2]; int shift = (X265_DEPTH - 8); primitives.planecopy_cp(yChar, pic.stride[0] / sizeof(*yChar), yPixel, m_stride, width, height, shift); - primitives.planecopy_cp(uChar, pic.stride[1] / sizeof(*uChar), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); - primitives.planecopy_cp(vChar, pic.stride[2] / sizeof(*vChar), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); + + if (pic.colorSpace != X265_CSP_I400) + { + pixel *uPixel = m_picOrg[1]; + pixel *vPixel = m_picOrg[2]; + + uint8_t *uChar = (uint8_t*)pic.planes[1]; + uint8_t *vChar = (uint8_t*)pic.planes[2]; + + primitives.planecopy_cp(uChar, pic.stride[1] / sizeof(*uChar), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); + primitives.planecopy_cp(vChar, pic.stride[2] / sizeof(*vChar), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); + } } #else /* Case for (X265_DEPTH == 8) */ // TODO: Does we need this path? may merge into above in future { pixel *yPixel = m_picOrg[0]; - pixel *uPixel = m_picOrg[1]; - pixel *vPixel = m_picOrg[2]; - uint8_t *yChar = (uint8_t*)pic.planes[0]; - uint8_t *uChar = (uint8_t*)pic.planes[1]; - uint8_t *vChar = (uint8_t*)pic.planes[2]; for (int r = 0; r < height; r++) { @@ -186,15 +216,24 @@ yChar += pic.stride[0] / sizeof(*yChar); } - for (int r = 0; r < height >> m_vChromaShift; r++) + if (pic.colorSpace != X265_CSP_I400) { - memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel)); - memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel)); + pixel *uPixel = m_picOrg[1]; + pixel *vPixel = m_picOrg[2]; + + uint8_t *uChar = (uint8_t*)pic.planes[1]; + uint8_t *vChar = (uint8_t*)pic.planes[2]; + + for (int r = 0; r < height >> m_vChromaShift; r++) + { + memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel)); + memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel)); - uPixel += m_strideC; - vPixel += m_strideC; - uChar += pic.stride[1] / sizeof(*uChar); - vChar += pic.stride[2] / sizeof(*vChar); + uPixel += m_strideC; + vPixel += m_strideC; + uChar += pic.stride[1] / sizeof(*uChar); + vChar += pic.stride[2] / sizeof(*vChar); + } } } #endif /* (X265_DEPTH > 8) */ @@ -205,43 +244,63 @@ uint16_t mask = (1 << X265_DEPTH) - 1; int shift = abs(pic.bitDepth - X265_DEPTH); pixel *yPixel = m_picOrg[0]; - pixel *uPixel = m_picOrg[1]; - pixel *vPixel = m_picOrg[2]; uint16_t *yShort = (uint16_t*)pic.planes[0]; - uint16_t *uShort = (uint16_t*)pic.planes[1]; - uint16_t *vShort = (uint16_t*)pic.planes[2]; if (pic.bitDepth > X265_DEPTH) { /* shift right and mask pixels to final size */ primitives.planecopy_sp(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); - primitives.planecopy_sp(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); - primitives.planecopy_sp(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); } else /* Case for (pic.bitDepth <= X265_DEPTH) */ { /* shift left and mask pixels to final size */ primitives.planecopy_sp_shl(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); - primitives.planecopy_sp_shl(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); - primitives.planecopy_sp_shl(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + } + + if (pic.colorSpace != X265_CSP_I400) + { + pixel *uPixel = m_picOrg[1]; + pixel *vPixel = m_picOrg[2]; + + uint16_t *uShort = (uint16_t*)pic.planes[1]; + uint16_t *vShort = (uint16_t*)pic.planes[2]; + + if (pic.bitDepth > X265_DEPTH) + { + primitives.planecopy_sp(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + primitives.planecopy_sp(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + } + else /* Case for (pic.bitDepth <= X265_DEPTH) */ + { + primitives.planecopy_sp_shl(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + primitives.planecopy_sp_shl(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + } } } /* extend the right edge if width was not multiple of the minimum CU size */ - if (padx) + uint64_t sumLuma; + pixel *Y = m_picOrg[0]; + m_maxLumaLevel = primitives.planeClipAndMax(Y, m_stride, width, height, &sumLuma, (pixel)param.minLuma, (pixel)param.maxLuma); + m_avgLumaLevel = (double)(sumLuma) / (m_picHeight * m_picWidth); + + for (int r = 0; r < height; r++) { - pixel *Y = m_picOrg[0]; - pixel *U = m_picOrg[1]; - pixel *V = m_picOrg[2]; + for (int x = 0; x < padx; x++) + Y[width + x] = Y[width - 1]; + Y += m_stride; + } - for (int r = 0; r < height; r++) - { - for (int x = 0; x < padx; x++) - Y[width + x] = Y[width - 1]; + /* extend the bottom if height was not multiple of the minimum CU size */ + Y = m_picOrg[0] + (height - 1) * m_stride; + for (int i = 1; i <= pady; i++) + memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel)); - Y += m_stride; - } + if (pic.colorSpace != X265_CSP_I400) + { + pixel *U = m_picOrg[1]; + pixel *V = m_picOrg[2]; for (int r = 0; r < height >> m_vChromaShift; r++) { @@ -254,17 +313,9 @@ U += m_strideC; V += m_strideC; } - } - - /* extend the bottom if height was not multiple of the minimum CU size */ - if (pady) - { - pixel *Y = m_picOrg[0] + (height - 1) * m_stride; - pixel *U = m_picOrg[1] + ((height >> m_vChromaShift) - 1) * m_strideC; - pixel *V = m_picOrg[2] + ((height >> m_vChromaShift) - 1) * m_strideC; - for (int i = 1; i <= pady; i++) - memcpy(Y + i * m_stride, Y, (width + padx) * sizeof(pixel)); + U = m_picOrg[1] + ((height >> m_vChromaShift) - 1) * m_strideC; + V = m_picOrg[2] + ((height >> m_vChromaShift) - 1) * m_strideC; for (int j = 1; j <= pady >> m_vChromaShift; j++) {
View file
x265_1.8.tar.gz/source/common/picyuv.h -> x265_1.9.tar.gz/source/common/picyuv.h
Changed
@@ -60,13 +60,16 @@ uint32_t m_chromaMarginX; uint32_t m_chromaMarginY; + uint16_t m_maxLumaLevel; + double m_avgLumaLevel; + PicYuv(); bool create(uint32_t picWidth, uint32_t picHeight, uint32_t csp); bool createOffsets(const SPS& sps); void destroy(); - void copyFromPicture(const x265_picture&, int padx, int pady); + void copyFromPicture(const x265_picture&, const x265_param& param, int padx, int pady); intptr_t getChromaAddrOffset(uint32_t ctuAddr, uint32_t absPartIdx) const { return m_cuOffsetC[ctuAddr] + m_buOffsetC[absPartIdx]; }
View file
x265_1.8.tar.gz/source/common/pixel.cpp -> x265_1.9.tar.gz/source/common/pixel.cpp
Changed
@@ -25,6 +25,7 @@ *****************************************************************************/ #include "common.h" +#include "slicetype.h" // LOWRES_COST_MASK #include "primitives.h" #include "x265.h" @@ -117,9 +118,9 @@ } template<int lx, int ly, class T1, class T2> -sse_ret_t sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2) +sse_t sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2) { - sse_ret_t sum = 0; + sse_t sum = 0; int tmp; for (int y = 0; y < ly; y++) @@ -187,37 +188,6 @@ return (int)(sum >> 1); } -static int satd_4x4(const int16_t* pix1, intptr_t stride_pix1) -{ - int32_t tmp[4][4]; - int32_t s01, s23, d01, d23; - int32_t satd = 0; - int d; - - for (d = 0; d < 4; d++, pix1 += stride_pix1) - { - s01 = pix1[0] + pix1[1]; - s23 = pix1[2] + pix1[3]; - d01 = pix1[0] - pix1[1]; - d23 = pix1[2] - pix1[3]; - - tmp[d][0] = s01 + s23; - tmp[d][1] = s01 - s23; - tmp[d][2] = d01 - d23; - tmp[d][3] = d01 + d23; - } - - for (d = 0; d < 4; d++) - { - s01 = tmp[0][d] + tmp[1][d]; - s23 = tmp[2][d] + tmp[3][d]; - d01 = tmp[0][d] - tmp[1][d]; - d23 = tmp[2][d] - tmp[3][d]; - satd += abs(s01 + s23) + abs(s01 - s23) + abs(d01 - d23) + abs(d01 + d23); - } - return (int)(satd / 2); -} - // x264's SWAR version of satd 8x4, performs two 4x4 SATDs at once static int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) { @@ -313,57 +283,6 @@ return (int)((_sa8d_8x8(pix1, i_pix1, pix2, i_pix2) + 2) >> 2); } -inline int _sa8d_8x8(const int16_t* pix1, intptr_t i_pix1) -{ - int32_t tmp[8][8]; - int32_t a0, a1, a2, a3, a4, a5, a6, a7; - int32_t sum = 0; - - for (int i = 0; i < 8; i++, pix1 += i_pix1) - { - a0 = pix1[0] + pix1[1]; - a1 = pix1[2] + pix1[3]; - a2 = pix1[4] + pix1[5]; - a3 = pix1[6] + pix1[7]; - a4 = pix1[0] - pix1[1]; - a5 = pix1[2] - pix1[3]; - a6 = pix1[4] - pix1[5]; - a7 = pix1[6] - pix1[7]; - tmp[i][0] = (a0 + a1) + (a2 + a3); - tmp[i][1] = (a0 + a1) - (a2 + a3); - tmp[i][2] = (a0 - a1) + (a2 - a3); - tmp[i][3] = (a0 - a1) - (a2 - a3); - tmp[i][4] = (a4 + a5) + (a6 + a7); - tmp[i][5] = (a4 + a5) - (a6 + a7); - tmp[i][6] = (a4 - a5) + (a6 - a7); - tmp[i][7] = (a4 - a5) - (a6 - a7); - } - - for (int i = 0; i < 8; i++) - { - a0 = (tmp[0][i] + tmp[1][i]) + (tmp[2][i] + tmp[3][i]); - a2 = (tmp[0][i] + tmp[1][i]) - (tmp[2][i] + tmp[3][i]); - a1 = (tmp[0][i] - tmp[1][i]) + (tmp[2][i] - tmp[3][i]); - a3 = (tmp[0][i] - tmp[1][i]) - (tmp[2][i] - tmp[3][i]); - a4 = (tmp[4][i] + tmp[5][i]) + (tmp[6][i] + tmp[7][i]); - a6 = (tmp[4][i] + tmp[5][i]) - (tmp[6][i] + tmp[7][i]); - a5 = (tmp[4][i] - tmp[5][i]) + (tmp[6][i] - tmp[7][i]); - a7 = (tmp[4][i] - tmp[5][i]) - (tmp[6][i] - tmp[7][i]); - a0 = abs(a0 + a4) + abs(a0 - a4); - a0 += abs(a1 + a5) + abs(a1 - a5); - a0 += abs(a2 + a6) + abs(a2 - a6); - a0 += abs(a3 + a7) + abs(a3 - a7); - sum += a0; - } - - return (int)sum; -} - -static int sa8d_8x8(const int16_t* pix1, intptr_t i_pix1) -{ - return (int)((_sa8d_8x8(pix1, i_pix1) + 2) >> 2); -} - static int sa8d_16x16(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) { int sum = _sa8d_8x8(pix1, i_pix1, pix2, i_pix2) @@ -403,9 +322,9 @@ } template<int size> -int pixel_ssd_s_c(const int16_t* a, intptr_t dstride) +sse_t pixel_ssd_s_c(const int16_t* a, intptr_t dstride) { - int sum = 0; + sse_t sum = 0; for (int y = 0; y < size; y++) { for (int x = 0; x < size; x++) @@ -783,39 +702,6 @@ } } -template<int size> -int psyCost_ss(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) -{ - static int16_t zeroBuf[8] /* = { 0 } */; - - if (size) - { - int dim = 1 << (size + 2); - uint32_t totEnergy = 0; - for (int i = 0; i < dim; i += 8) - { - for (int j = 0; j < dim; j+= 8) - { - /* AC energy, measured by sa8d (AC + DC) minus SAD (DC) */ - int sourceEnergy = sa8d_8x8(source + i * sstride + j, sstride) - - (sad<8, 8>(source + i * sstride + j, sstride, zeroBuf, 0) >> 2); - int reconEnergy = sa8d_8x8(recon + i * rstride + j, rstride) - - (sad<8, 8>(recon + i * rstride + j, rstride, zeroBuf, 0) >> 2); - - totEnergy += abs(sourceEnergy - reconEnergy); - } - } - return totEnergy; - } - else - { - /* 4x4 is too small for sa8d */ - int sourceEnergy = satd_4x4(source, sstride) - (sad<4, 4>(source, sstride, zeroBuf, 0) >> 2); - int reconEnergy = satd_4x4(recon, rstride) - (sad<4, 4>(recon, rstride, zeroBuf, 0) >> 2); - return abs(sourceEnergy - reconEnergy); - } -} - template<int bx, int by> void blockcopy_pp_c(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb) { @@ -960,19 +846,57 @@ /* Estimate the total amount of influence on future quality that could be had if we * were to improve the reference samples used to inter predict any given CU. */ static void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, - const int32_t* invQscales, const double* fpsFactor, int len) + const int32_t* invQscales, const double* fpsFactor, int len) { - double fps = *fpsFactor / 256; + double fps = *fpsFactor / 256; // range[0.01, 1.00] for (int i = 0; i < len; i++) { - double intraCost = intraCosts[i] * invQscales[i]; - double propagateAmount = (double)propagateIn[i] + intraCost * fps; - double propagateNum = (double)intraCosts[i] - (interCosts[i] & ((1 << 14) - 1)); - double propagateDenom = (double)intraCosts[i]; + int intraCost = intraCosts[i]; + int interCost = X265_MIN(intraCosts[i], interCosts[i] & LOWRES_COST_MASK); + double propagateIntra = intraCost * invQscales[i]; // Q16 x Q8.8 = Q24.8 + double propagateAmount = (double)propagateIn[i] + propagateIntra * fps; // Q16.0 + Q24.8 x Q0.x = Q25.0 + double propagateNum = (double)(intraCost - interCost); // Q32 - Q32 = Q33.0 + +#if 0 + // algorithm that output match to asm + float intraRcp = (float)1.0f / intraCost; // VC can't mapping this into RCPPS + float intraRcpError1 = (float)intraCost * (float)intraRcp; + intraRcpError1 *= (float)intraRcp; + float intraRcpError2 = intraRcp + intraRcp; + float propagateDenom = intraRcpError2 - intraRcpError1; + dst[i] = (int)(propagateAmount * propagateNum * (double)propagateDenom + 0.5); +#else + double propagateDenom = (double)intraCost; // Q32 dst[i] = (int)(propagateAmount * propagateNum / propagateDenom + 0.5); +#endif } } + +static pixel planeClipAndMax_c(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix) +{ + pixel maxLumaLevel = 0; + uint64_t sumLuma = 0; + + for (int r = 0; r < height; r++) + { + for (int c = 0; c < width; c++) + { + /* Clip luma of source picture to max and min values before extending edges of picYuv */ + src[c] = x265_clip3((pixel)minPix, (pixel)maxPix, src[c]); + + /* Determine maximum and average luma level in a picture */ + maxLumaLevel = X265_MAX(src[c], maxLumaLevel); + sumLuma += src[c]; + } + + src += stride; + } + + *outsum = sumLuma; + return maxLumaLevel; +} + } // end anonymous namespace namespace X265_NS { @@ -1020,7 +944,6 @@ p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl = cpy1Dto2D_shl<W>; \ p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shr = cpy1Dto2D_shr<W>; \ p.cu[BLOCK_ ## W ## x ## H].psy_cost_pp = psyCost_pp<BLOCK_ ## W ## x ## H>; \ - p.cu[BLOCK_ ## W ## x ## H].psy_cost_ss = psyCost_ss<BLOCK_ ## W ## x ## H>; \ p.cu[BLOCK_ ## W ## x ## H].transpose = transpose<W>; \ p.cu[BLOCK_ ## W ## x ## H].ssd_s = pixel_ssd_s_c<W>; \ p.cu[BLOCK_ ## W ## x ## H].var = pixel_var<W>; \ @@ -1258,6 +1181,7 @@ p.planecopy_cp = planecopy_cp_c; p.planecopy_sp = planecopy_sp_c; p.planecopy_sp_shl = planecopy_sp_shl_c; + p.planeClipAndMax = planeClipAndMax_c; p.propagateCost = estimateCUPropagateCost; } }
View file
x265_1.8.tar.gz/source/common/predict.cpp -> x265_1.9.tar.gz/source/common/predict.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -98,7 +99,7 @@ if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag) { - for (int plane = 0; plane < 3; plane++) + for (int plane = 0; plane < (bChroma ? 3 : 1); plane++) { wv0[plane].w = wp0[plane].inputWeight; wv0[plane].offset = wp0[plane].inputOffset * (1 << (X265_DEPTH - 8)); @@ -109,18 +110,18 @@ ShortYuv& shortYuv = m_predShortYuv[0]; if (bLuma) - predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); if (bChroma) - predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma); } else { if (bLuma) - predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); if (bChroma) - predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); } } else @@ -141,7 +142,7 @@ if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag)) { /* biprediction weighting */ - for (int plane = 0; plane < 3; plane++) + for (int plane = 0; plane < (bChroma ? 3 : 1); plane++) { wv0[plane].w = pwp0[plane].inputWeight; wv0[plane].o = pwp0[plane].inputOffset * (1 << (X265_DEPTH - 8)); @@ -158,7 +159,7 @@ { /* uniprediction weighting, always outputs to wv0 */ const WeightParam* pwp = (refIdx0 >= 0) ? pwp0 : pwp1; - for (int plane = 0; plane < 3; plane++) + for (int plane = 0; plane < (bChroma ? 3 : 1); plane++) { wv0[plane].w = pwp[plane].inputWeight; wv0[plane].offset = pwp[plane].inputOffset * (1 << (X265_DEPTH - 8)); @@ -179,13 +180,13 @@ if (bLuma) { - predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); - predInterLumaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterLumaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); + predInterLumaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); } if (bChroma) { - predInterChromaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); - predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterChromaShort(pu, m_predShortYuv[0], *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); + predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); } if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag)) @@ -203,18 +204,18 @@ ShortYuv& shortYuv = m_predShortYuv[0]; if (bLuma) - predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); if (bChroma) - predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma); } else { if (bLuma) - predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); if (bChroma) - predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[0][refIdx0]->m_reconPic, mv0); + predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[0][refIdx0], mv0); } } else @@ -230,18 +231,18 @@ ShortYuv& shortYuv = m_predShortYuv[0]; if (bLuma) - predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterLumaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); if (bChroma) - predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterChromaShort(pu, shortYuv, *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); addWeightUni(pu, predYuv, shortYuv, wv0, bLuma, bChroma); } else { if (bLuma) - predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterLumaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); if (bChroma) - predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refPicList[1][refIdx1]->m_reconPic, mv1); + predInterChromaPixel(pu, predYuv, *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); } } } @@ -600,8 +601,9 @@ int tuSize = 1 << intraNeighbors.log2TrSize; int tuSize2 = tuSize << 1; - pixel* adiOrigin = cu.m_encData->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + puAbsPartIdx); - intptr_t picStride = cu.m_encData->m_reconPic->m_stride; + PicYuv* reconPic = cu.m_encData->m_reconPic; + pixel* adiOrigin = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + puAbsPartIdx); + intptr_t picStride = reconPic->m_stride; fillReferenceSamples(adiOrigin, picStride, intraNeighbors, intraNeighbourBuf[0]); @@ -648,8 +650,9 @@ void Predict::initAdiPatternChroma(const CUData& cu, const CUGeom& cuGeom, uint32_t puAbsPartIdx, const IntraNeighbors& intraNeighbors, uint32_t chromaId) { - const pixel* adiOrigin = cu.m_encData->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + puAbsPartIdx); - intptr_t picStride = cu.m_encData->m_reconPic->m_strideC; + PicYuv* reconPic = cu.m_encData->m_reconPic; + const pixel* adiOrigin = reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + puAbsPartIdx); + intptr_t picStride = reconPic->m_strideC; fillReferenceSamples(adiOrigin, picStride, intraNeighbors, intraNeighbourBuf[0]);
View file
x265_1.8.tar.gz/source/common/predict.h -> x265_1.9.tar.gz/source/common/predict.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/primitives.h -> x265_1.9.tar.gz/source/common/primitives.h
Changed
@@ -112,9 +112,9 @@ typedef int (*pixelcmp_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned typedef int (*pixelcmp_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); -typedef sse_ret_t (*pixel_sse_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned -typedef sse_ret_t (*pixel_sse_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); -typedef int (*pixel_ssd_s_t)(const int16_t* fenc, intptr_t fencstride); +typedef sse_t (*pixel_sse_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned +typedef sse_t (*pixel_sse_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); +typedef sse_t (*pixel_ssd_s_t)(const int16_t* fenc, intptr_t fencstride); typedef void (*pixelcmp_x4_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); typedef void (*pixelcmp_x3_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); typedef void (*blockfill_s_t)(int16_t* dst, intptr_t dstride, int16_t val); @@ -176,15 +176,16 @@ typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX); typedef void (*saoCuOrgB0_t)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); -typedef void (*saoCuStatsBO_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); -typedef void (*saoCuStatsE0_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); -typedef void (*saoCuStatsE1_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); -typedef void (*saoCuStatsE2_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBuff, int endX, int endY, int32_t *stats, int32_t *count); -typedef void (*saoCuStatsE3_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsBO_t)(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE0_t)(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE1_t)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE2_t)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBuff, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE3_t)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); typedef void (*sign_t)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); typedef void (*planecopy_cp_t) (const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); typedef void (*planecopy_sp_t) (const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +typedef pixel (*planeClipAndMax_t)(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix); typedef void (*cutree_propagate_cost) (int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len); @@ -195,6 +196,8 @@ typedef uint32_t (*costCoeffRemain_t)(uint16_t *absCoeff, int numNonZero, int idx); typedef uint32_t (*costC1C2Flag_t)(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset); +typedef void (*pelFilterLumaStrong_t)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); + /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ struct EncoderPrimitives @@ -259,7 +262,6 @@ pixel_sse_t sse_pp; // Sum of Square Error (pixel, pixel) fenc alignment not assumed pixel_sse_ss_t sse_ss; // Sum of Square Error (short, short) fenc alignment not assumed pixelcmp_t psy_cost_pp; // difference in AC energy between two pixel blocks - pixelcmp_ss_t psy_cost_ss; // difference in AC energy between two signed residual blocks pixel_ssd_s_t ssd_s; // Sum of Square Error (residual coeff to self) pixelcmp_t sa8d; // Sum of Transformed Differences (8x8 Hadamard), uses satd for 4x4 intra TU @@ -316,6 +318,7 @@ planecopy_cp_t planecopy_cp; planecopy_sp_t planecopy_sp; planecopy_sp_t planecopy_sp_shl; + planeClipAndMax_t planeClipAndMax; weightp_sp_t weight_sp; weightp_pp_t weight_pp; @@ -328,6 +331,7 @@ costCoeffRemain_t costCoeffRemain; costC1C2Flag_t costC1C2Flag; + pelFilterLumaStrong_t pelFilterLumaStrong[2]; // EDGE_VER = 0, EDGE_HOR = 1 /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry
View file
x265_1.8.tar.gz/source/common/quant.cpp -> x265_1.9.tar.gz/source/common/quant.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -50,9 +51,8 @@ return y + ((x - y) & ((x - y) >> (sizeof(int) * CHAR_BIT - 1))); // min(x, y) } -inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, uint32_t c1c2Idx) +inline int getICRate(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, const uint32_t absGoRice, const uint32_t maxVlc, const uint32_t c1c2Rate) { - X265_CHECK(c1c2Idx <= 3, "c1c2Idx check failure\n"); X265_CHECK(absGoRice <= 4, "absGoRice check failure\n"); if (!absLevel) { @@ -94,12 +94,7 @@ uint32_t numBins = fastMin(prefLen + absGoRice, 8 /* g_goRicePrefixLen[absGoRice] + absGoRice */); rate += numBins << 15; - - if (c1c2Idx & 1) - rate += greaterOneBits[1]; - - if (c1c2Idx == 3) - rate += levelAbsBits[1]; + rate += c1c2Rate; } return rate; } @@ -140,7 +135,7 @@ } /* Calculates the cost for specific absolute transform level */ -inline uint32_t getICRateCost(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, uint32_t c1c2Idx) +inline uint32_t getICRateCost(uint32_t absLevel, int32_t diffLevel, const int* greaterOneBits, const int* levelAbsBits, uint32_t absGoRice, const uint32_t c1c2Rate) { X265_CHECK(absLevel, "absLevel should not be zero\n"); @@ -175,16 +170,15 @@ rate = (COEF_REMAIN_BIN_REDUCTION + length + absGoRice + 1 + length) << 15; } - if (c1c2Idx & 1) - rate += greaterOneBits[1]; - if (c1c2Idx == 3) - rate += levelAbsBits[1]; + rate += c1c2Rate; return rate; } } } +Quant::rdoQuant_t Quant::rdoQuant_func[NUM_CU_DEPTH] = {&Quant::rdoQuant<2>, &Quant::rdoQuant<3>, &Quant::rdoQuant<4>, &Quant::rdoQuant<5>}; + Quant::Quant() { m_resiDctCoeff = NULL; @@ -229,8 +223,11 @@ { m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL; m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET); - setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat); - setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat); + if (ctu.m_chromaFormat != X265_CSP_I400) + { + setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat); + setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[1], TEXT_CHROMA_V, ctu.m_chromaFormat); + } } void Quant::setChromaQP(int qpin, TextType ttype, int chFmt) @@ -444,18 +441,18 @@ primitives.cu[sizeIdx].dct(m_fencShortBuf, m_fencDctCoeff, trSize); } - if (m_nr) + if (m_nr && m_nr->offset) { /* denoise is not applied to intra residual, so DST can be ignored */ int cat = sizeIdx + 4 * !isLuma + 8 * !isIntra; int numCoeff = 1 << (log2TrSize * 2); - primitives.denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offsetDenoise[cat], numCoeff); + primitives.denoiseDct(m_resiDctCoeff, m_nr->residualSum[cat], m_nr->offset[cat], numCoeff); m_nr->count[cat]++; } } if (m_rdoqLevel) - return rdoQuant(cu, coeff, log2TrSize, ttype, absPartIdx, usePsy); + return (this->*rdoQuant_func[log2TrSize - 2])(cu, coeff, ttype, absPartIdx, usePsy); else { int deltaU[32 * 32]; @@ -550,9 +547,10 @@ /* Rate distortion optimized quantization for entropy coding engines using * probability models like CABAC */ -uint32_t Quant::rdoQuant(const CUData& cu, int16_t* dstCoeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool usePsy) +template<uint32_t log2TrSize> +uint32_t Quant::rdoQuant(const CUData& cu, int16_t* dstCoeff, TextType ttype, uint32_t absPartIdx, bool usePsy) { - int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ int scalingListType = (cu.isIntra(absPartIdx) ? 0 : 3) + ttype; const uint32_t usePsyMask = usePsy ? -1 : 0; @@ -564,13 +562,13 @@ int add = (1 << (qbits - 1)); const int32_t* qCoef = m_scalingList->m_quantCoef[log2TrSize - 2][scalingListType][rem]; - int numCoeff = 1 << (log2TrSize * 2); + const int numCoeff = 1 << (log2TrSize * 2); uint32_t numSig = primitives.nquant(m_resiDctCoeff, qCoef, dstCoeff, qbits, add, numCoeff); X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n"); if (!numSig) return 0; - uint32_t trSize = 1 << log2TrSize; + const uint32_t trSize = 1 << log2TrSize; int64_t lambda2 = m_qpParam[ttype].lambda2; const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda); @@ -580,20 +578,20 @@ const int32_t* unquantScale = m_scalingList->m_dequantCoef[log2TrSize - 2][scalingListType][rem]; int unquantShift = QUANT_IQUANT_SHIFT - QUANT_SHIFT - transformShift + (m_scalingList->m_bEnabled ? 4 : 0); int unquantRound = (unquantShift > per) ? 1 << (unquantShift - per - 1) : 0; - int scaleBits = SCALE_BITS - 2 * transformShift; + const int scaleBits = SCALE_BITS - 2 * transformShift; #define UNQUANT(lvl) (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift) #define SIGCOST(bits) ((lambda2 * (bits)) >> 8) #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits)) #define PSYVALUE(rec) ((psyScale * (rec)) >> X265_MAX(0, (2 * transformShift + 1))) - int64_t costCoeff[32 * 32]; /* d*d + lambda * bits */ - int64_t costUncoded[32 * 32]; /* d*d + lambda * 0 */ - int64_t costSig[32 * 32]; /* lambda * bits */ + int64_t costCoeff[trSize * trSize]; /* d*d + lambda * bits */ + int64_t costUncoded[trSize * trSize]; /* d*d + lambda * 0 */ + int64_t costSig[trSize * trSize]; /* lambda * bits */ - int rateIncUp[32 * 32]; /* signal overhead of increasing level */ - int rateIncDown[32 * 32]; /* signal overhead of decreasing level */ - int sigRateDelta[32 * 32]; /* signal difference between zero and non-zero */ + int rateIncUp[trSize * trSize]; /* signal overhead of increasing level */ + int rateIncDown[trSize * trSize]; /* signal overhead of decreasing level */ + int sigRateDelta[trSize * trSize]; /* signal difference between zero and non-zero */ int64_t costCoeffGroupSig[MLS_GRP_NUM]; /* lambda * bits of group coding cost */ uint64_t sigCoeffGroupFlag64 = 0; @@ -611,7 +609,8 @@ TUEntropyCodingParameters codeParams; cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, bIsLuma); - const uint32_t cgNum = 1 << (codeParams.log2TrSizeCG * 2); + const uint32_t log2TrSizeCG = log2TrSize - 2; + const uint32_t cgNum = 1 << (log2TrSizeCG * 2); const uint32_t cgStride = (trSize >> MLS_CG_LOG2_SIZE); uint8_t coeffNum[MLS_GRP_NUM]; // value range[0, 16] @@ -742,8 +741,8 @@ { uint32_t ctxSet = (cgScanPos && bIsLuma) ? 2 : 0; const uint32_t cgBlkPos = codeParams.scanCG[cgScanPos]; - const uint32_t cgPosY = cgBlkPos >> codeParams.log2TrSizeCG; - const uint32_t cgPosX = cgBlkPos - (cgPosY << codeParams.log2TrSizeCG); + const uint32_t cgPosY = cgBlkPos >> log2TrSizeCG; + const uint32_t cgPosX = cgBlkPos & ((1 << log2TrSizeCG) - 1); const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos); const int patternSigCtx = calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, cgStride); const int ctxSigOffset = codeParams.firstSignificanceMapContext + (cgScanPos && bIsLuma ? 3 : 0); @@ -829,6 +828,7 @@ uint32_t subFlagMask = coeffFlag[cgScanPos]; int c2 = 0; uint32_t goRiceParam = 0; + uint32_t levelThreshold = 3; uint32_t c1Idx = 0; uint32_t c2Idx = 0; /* iterate over coefficients in each group in reverse scan order */ @@ -836,7 +836,7 @@ { scanPos = (cgScanPos << MLS_CG_SIZE) + scanPosinCG; uint32_t blkPos = codeParams.scan[scanPos]; - uint32_t maxAbsLevel = abs(dstCoeff[blkPos]); /* abs(quantized coeff) */ + uint32_t maxAbsLevel = dstCoeff[blkPos]; /* abs(quantized coeff) */ int signCoef = m_resiDctCoeff[blkPos]; /* pre-quantization DCT coeff */ int predictedCoef = m_fencDctCoeff[blkPos] - signCoef; /* predicted DCT = source DCT - residual DCT*/ @@ -855,7 +855,11 @@ // coefficient level estimation const int* greaterOneBits = estBitsSbac.greaterOneBits[4 * ctxSet + c1]; - const uint32_t ctxSig = (blkPos == 0) ? 0 : table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset; + //const uint32_t ctxSig = (blkPos == 0) ? 0 : table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset; + static const uint64_t table_cnt64[4] = {0x0000000100110112ULL, 0x0000000011112222ULL, 0x0012001200120012ULL, 0x2222222222222222ULL}; + uint64_t ctxCnt = (trSize == 4) ? 0x8877886654325410ULL : table_cnt64[patternSigCtx]; + const uint32_t ctxSig = (blkPos == 0) ? 0 : ((ctxCnt >> (4 * g_scan4x4[codeParams.scanType][scanPosinCG])) & 0xF) + ctxSigOffset; + // NOTE: above equal to 'table_cnt[(trSize == 4) ? 4 : patternSigCtx][g_scan4x4[codeParams.scanType][scanPosinCG]] + ctxSigOffset' X265_CHECK(ctxSig == getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codeParams.firstSignificanceMapContext), "sigCtx check failure\n"); // before find lastest non-zero coeff @@ -886,15 +890,17 @@ { subFlagMask >>= 1; - const uint32_t c1c2Idx = ((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)) + (((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1) * 2; - const uint32_t baseLevel = ((uint32_t)0xD9 >> (c1c2Idx * 2)) & 3; // {1, 2, 1, 3} + const uint32_t c1c2idx = ((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)) + (((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1) * 2; + const uint32_t baseLevel = ((uint32_t)0xD9 >> (c1c2idx * 2)) & 3; // {1, 2, 1, 3} X265_CHECK(!!((int)c1Idx < C1FLAG_NUMBER) == (int)((c1Idx - 8) >> (sizeof(int) * CHAR_BIT - 1)), "scan validation 1\n"); X265_CHECK(!!(c2Idx == 0) == ((-(int)c2Idx) >> (sizeof(int) * CHAR_BIT - 1)) + 1, "scan validation 2\n"); X265_CHECK((int)baseLevel == ((c1Idx < C1FLAG_NUMBER) ? (2 + (c2Idx == 0)) : 1), "scan validation 3\n"); + X265_CHECK(c1c2idx <= 3, "c1c2Idx check failure\n"); // coefficient level estimation const int* levelAbsBits = estBitsSbac.levelAbsBits[ctxSet + c2]; + const uint32_t c1c2Rate = ((c1c2idx & 1) ? greaterOneBits[1] : 0) + ((c1c2idx == 3) ? levelAbsBits[1] : 0); uint32_t level = 0; uint32_t sigCoefBits = 0; @@ -914,13 +920,15 @@ sigCoefBits = estBitsSbac.significantBits[1][ctxSig]; } + const uint32_t unQuantLevel = (maxAbsLevel * (unquantScale[blkPos] << per) + unquantRound); // NOTE: X265_MAX(maxAbsLevel - 1, 1) ==> (X>=2 -> X-1), (X<2 -> 1) | (0 < X < 2 ==> X=1) if (maxAbsLevel == 1) { - uint32_t levelBits = (c1c2Idx & 1) ? greaterOneBits[0] + IEP_RATE : ((1 + goRiceParam) << 15) + IEP_RATE; - X265_CHECK(levelBits == getICRateCost(1, 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE, "levelBits mistake\n"); + uint32_t levelBits = (c1c2idx & 1) ? greaterOneBits[0] + IEP_RATE : ((1 + goRiceParam) << 15) + IEP_RATE; + X265_CHECK(levelBits == getICRateCost(1, 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Rate) + IEP_RATE, "levelBits mistake\n"); - int unquantAbsLevel = UNQUANT(1); + int unquantAbsLevel = unQuantLevel >> unquantShift; + X265_CHECK(UNQUANT(1) == unquantAbsLevel, "DQuant check failed\n"); int d = abs(signCoef) - unquantAbsLevel; int64_t curCost = RDCOST(d, sigCoefBits + levelBits); @@ -940,14 +948,18 @@ } else if (maxAbsLevel) { - uint32_t levelBits0 = getICRateCost(maxAbsLevel, maxAbsLevel - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE; - uint32_t levelBits1 = getICRateCost(maxAbsLevel - 1, maxAbsLevel - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Idx) + IEP_RATE; + uint32_t levelBits0 = getICRateCost(maxAbsLevel, maxAbsLevel - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Rate) + IEP_RATE; + uint32_t levelBits1 = getICRateCost(maxAbsLevel - 1, maxAbsLevel - 1 - baseLevel, greaterOneBits, levelAbsBits, goRiceParam, c1c2Rate) + IEP_RATE; + + const uint32_t preDQuantLevelDiff = (unquantScale[blkPos] << per); - int unquantAbsLevel0 = UNQUANT(maxAbsLevel); + const int unquantAbsLevel0 = unQuantLevel >> unquantShift; + X265_CHECK(UNQUANT(maxAbsLevel) == (uint32_t)unquantAbsLevel0, "DQuant check failed\n"); int d0 = abs(signCoef) - unquantAbsLevel0; int64_t curCost0 = RDCOST(d0, sigCoefBits + levelBits0); - int unquantAbsLevel1 = UNQUANT(maxAbsLevel - 1); + const int unquantAbsLevel1 = (unQuantLevel - preDQuantLevelDiff) >> unquantShift; + X265_CHECK(UNQUANT(maxAbsLevel - 1) == (uint32_t)unquantAbsLevel1, "DQuant check failed\n"); int d1 = abs(signCoef) - unquantAbsLevel1; int64_t curCost1 = RDCOST(d1, sigCoefBits + levelBits1); @@ -1012,9 +1024,9 @@ } else { - rate1 = getICRate(level + 0, diff0 + 1, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); - rate2 = getICRate(level + 1, diff0 + 2, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); - rate0 = getICRate(level - 1, diff0 + 0, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Idx); + rate1 = getICRate(level + 0, diff0 + 1, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Rate); + rate2 = getICRate(level + 1, diff0 + 2, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Rate); + rate0 = getICRate(level - 1, diff0 + 0, greaterOneBits, levelAbsBits, goRiceParam, maxVlc, c1c2Rate); } rateIncUp[blkPos] = rate2 - rate1; rateIncDown[blkPos] = rate0 - rate1; @@ -1026,10 +1038,14 @@ } /* Update CABAC estimation state */ - if (level >= baseLevel && goRiceParam < 4 && level > (3U << goRiceParam)) + if ((level >= baseLevel) && (goRiceParam < 4) && (level > levelThreshold)) + { goRiceParam++; + levelThreshold <<= 1; + } - c1Idx -= (-(int32_t)level) >> 31; + const uint32_t isNonZero = (uint32_t)(-(int32_t)level) >> 31; + c1Idx += isNonZero; /* update bin model */ if (level > 1) @@ -1038,7 +1054,7 @@ c2 += (uint32_t)(c2 - 2) >> 31; c2Idx++; } - else if ((c1 < 3) && (c1 > 0) && level) + else if (((c1 == 1) | (c1 == 2)) & isNonZero) c1++; if (dstCoeff[blkPos]) @@ -1219,7 +1235,8 @@ // Average 49.62 pixels /* clean uncoded coefficients */ - for (int pos = bestLastIdx; pos <= fastMin(lastScanPos, (bestLastIdx | (SCAN_SET_SIZE - 1))); pos++) + X265_CHECK((uint32_t)(fastMin(lastScanPos, bestLastIdx) | (SCAN_SET_SIZE - 1)) < trSize * trSize, "array beyond bound\n"); + for (int pos = bestLastIdx; pos <= (fastMin(lastScanPos, bestLastIdx) | (SCAN_SET_SIZE - 1)); pos++) { dstCoeff[codeParams.scan[pos]] = 0; } @@ -1236,7 +1253,8 @@ if (cu.m_slice->m_pps->bSignHideEnabled && numSig >= 2) { const int realLastScanPos = (bestLastIdx - 1) >> LOG2_SCAN_SET_SIZE; - int lastCG = true; + int lastCG = 1; + for (int subSet = realLastScanPos; subSet >= 0; subSet--) { int subPos = subSet << LOG2_SCAN_SET_SIZE; @@ -1248,69 +1266,72 @@ /* measure distance between first and last non-zero coef in this * coding group */ const uint32_t posFirstLast = primitives.findPosFirstLast(&dstCoeff[codeParams.scan[subPos]], trSize, g_scan4x4[codeParams.scanType]); - int firstNZPosInCG = (uint16_t)posFirstLast; - int lastNZPosInCG = posFirstLast >> 16; - + const int firstNZPosInCG = (uint8_t)posFirstLast; + const int lastNZPosInCG = (int8_t)(posFirstLast >> 8); + const uint32_t absSumSign = posFirstLast; if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD) { - uint32_t signbit = (dstCoeff[codeParams.scan[subPos + firstNZPosInCG]] > 0 ? 0 : 1); - int absSum = 0; + const int32_t signbit = ((int32_t)dstCoeff[codeParams.scan[subPos + firstNZPosInCG]]); +#if CHECKED_BUILD || _DEBUG + int32_t absSum_dummy = 0; for (n = firstNZPosInCG; n <= lastNZPosInCG; n++) - absSum += dstCoeff[codeParams.scan[n + subPos]]; + absSum_dummy += dstCoeff[codeParams.scan[n + subPos]]; + X265_CHECK(((uint32_t)absSum_dummy & 1) == (absSumSign >> 31), "absSumSign check failure\n"); +#endif - if (signbit != (absSum & 1U)) + //if (signbit != absSumSign) + if (((int32_t)(signbit ^ absSumSign)) < 0) { /* We must find a coeff to toggle up or down so the sign bit of the first non-zero coeff * is properly implied. Note dstCoeff[] are signed by this point but curChange and * finalChange imply absolute levels (+1 is away from zero, -1 is towards zero) */ int64_t minCostInc = MAX_INT64, curCost = MAX_INT64; - int minPos = -1; - int16_t finalChange = 0, curChange = 0; + uint32_t minPos = 0; + int8_t finalChange = 0; + int curChange = 0; + uint32_t lastCoeffAdjust = (lastCG & (abs(dstCoeff[codeParams.scan[lastNZPosInCG + subPos]]) == 1)) * 4 * IEP_RATE; for (n = (lastCG ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n) { - uint32_t blkPos = codeParams.scan[n + subPos]; - int signCoef = m_resiDctCoeff[blkPos]; /* pre-quantization DCT coeff */ - int absLevel = abs(dstCoeff[blkPos]); + const uint32_t blkPos = codeParams.scan[n + subPos]; + const int32_t signCoef = m_resiDctCoeff[blkPos]; /* pre-quantization DCT coeff */ + const int absLevel = abs(dstCoeff[blkPos]); + // TODO: this is constant in non-scaling mode + const uint32_t preDQuantLevelDiff = (unquantScale[blkPos] << per); + const uint32_t unQuantLevel = (absLevel * (unquantScale[blkPos] << per) + unquantRound); + + int d = abs(signCoef) - (unQuantLevel >> unquantShift); + X265_CHECK((uint32_t)UNQUANT(absLevel) == (unQuantLevel >> unquantShift), "dquant check failed\n"); - int d = abs(signCoef) - UNQUANT(absLevel); - int64_t origDist = (((int64_t)d * d)) << scaleBits; + const int64_t origDist = (((int64_t)d * d)); -#define DELTARDCOST(d, deltabits) ((((int64_t)d * d) << scaleBits) - origDist + ((lambda2 * (int64_t)(deltabits)) >> 8)) +#define DELTARDCOST(d0, d, deltabits) ((((int64_t)d * d - d0) << scaleBits) + ((lambda2 * (int64_t)(deltabits)) >> 8)) + const uint32_t isOne = (absLevel == 1); if (dstCoeff[blkPos]) { - d = abs(signCoef) - UNQUANT(absLevel + 1); - int64_t costUp = DELTARDCOST(d, rateIncUp[blkPos]); + d = abs(signCoef) - ((unQuantLevel + preDQuantLevelDiff) >> unquantShift); + X265_CHECK((uint32_t)UNQUANT(absLevel + 1) == ((unQuantLevel + preDQuantLevelDiff) >> unquantShift), "dquant check failed\n"); + int64_t costUp = DELTARDCOST(origDist, d, rateIncUp[blkPos]); /* if decrementing would make the coeff 0, we can include the * significant coeff flag cost savings */ - d = abs(signCoef) - UNQUANT(absLevel - 1); - bool isOne = abs(dstCoeff[blkPos]) == 1; + d = abs(signCoef) - ((unQuantLevel - preDQuantLevelDiff) >> unquantShift); + X265_CHECK((uint32_t)UNQUANT(absLevel - 1) == ((unQuantLevel - preDQuantLevelDiff) >> unquantShift), "dquant check failed\n"); int downBits = rateIncDown[blkPos] - (isOne ? (IEP_RATE + sigRateDelta[blkPos]) : 0); - int64_t costDown = DELTARDCOST(d, downBits); + int64_t costDown = DELTARDCOST(origDist, d, downBits); - if (lastCG && lastNZPosInCG == n && isOne) - costDown -= 4 * IEP_RATE; + costDown -= lastCoeffAdjust; + curCost = ((n == firstNZPosInCG) & isOne) ? MAX_INT64 : costDown; - if (costUp < costDown) - { - curCost = costUp; - curChange = 1; - } - else - { - curChange = -1; - if (n == firstNZPosInCG && isOne) - curCost = MAX_INT64; - else - curCost = costDown; - } + curChange = 2 * (costUp < costDown) - 1; + curCost = (costUp < costDown) ? costUp : curCost; } - else if (n < firstNZPosInCG && signbit != (signCoef >= 0 ? 0 : 1U)) + //else if ((n < firstNZPosInCG) & (signbit != ((uint32_t)signCoef >> 31))) + else if ((n < firstNZPosInCG) & ((signbit ^ signCoef) < 0)) { /* don't try to make a new coded coeff before the first coeff if its * sign would be different than the first coeff, the inferred sign would @@ -1320,36 +1341,48 @@ else { /* evaluate changing an uncoded coeff 0 to a coded coeff +/-1 */ - d = abs(signCoef) - UNQUANT(1); - curCost = DELTARDCOST(d, rateIncUp[blkPos] + IEP_RATE + sigRateDelta[blkPos]); + d = abs(signCoef) - ((preDQuantLevelDiff + unquantRound) >> unquantShift); + X265_CHECK((uint32_t)UNQUANT(1) == ((preDQuantLevelDiff + unquantRound) >> unquantShift), "dquant check failed\n"); + curCost = DELTARDCOST(origDist, d, rateIncUp[blkPos] + IEP_RATE + sigRateDelta[blkPos]); curChange = 1; } if (curCost < minCostInc) { minCostInc = curCost; - finalChange = curChange; - minPos = blkPos; + finalChange = (int8_t)curChange; + minPos = blkPos + (absLevel << 16); } + lastCoeffAdjust = 0; } - if (dstCoeff[minPos] == 32767 || dstCoeff[minPos] == -32768) + const int absInMinPos = (minPos >> 16); + minPos = (uint16_t)minPos; + + // if (dstCoeff[minPos] == 32767 || dstCoeff[minPos] == -32768) + if (absInMinPos >= 32767) /* don't allow sign hiding to violate the SPEC range */ finalChange = -1; - if (dstCoeff[minPos] == 0) - numSig++; - else if (finalChange == -1 && abs(dstCoeff[minPos]) == 1) - numSig--; - - if (m_resiDctCoeff[minPos] >= 0) - dstCoeff[minPos] += finalChange; - else - dstCoeff[minPos] -= finalChange; + // NOTE: Reference code + //if (dstCoeff[minPos] == 0) + // numSig++; + //else if (finalChange == -1 && abs(dstCoeff[minPos]) == 1) + // numSig--; + numSig += (absInMinPos == 0) - ((finalChange == -1) & (absInMinPos == 1)); + + + // NOTE: Reference code + //if (m_resiDctCoeff[minPos] >= 0) + // dstCoeff[minPos] += finalChange; + //else + // dstCoeff[minPos] -= finalChange; + const int16_t resiCoeffSign = ((int16_t)m_resiDctCoeff[minPos] >> 16); + dstCoeff[minPos] += (((int16_t)finalChange ^ resiCoeffSign) - resiCoeffSign); } } - lastCG = false; + lastCG = 0; } }
View file
x265_1.8.tar.gz/source/common/quant.h -> x265_1.9.tar.gz/source/common/quant.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -59,18 +60,18 @@ } }; -#define MAX_NUM_TR_COEFFS MAX_TR_SIZE * MAX_TR_SIZE /* Maximum number of transform coefficients, for a 32x32 transform */ -#define MAX_NUM_TR_CATEGORIES 16 /* 32, 16, 8, 4 transform categories each for luma and chroma */ - // NOTE: MUST be 16-byte aligned for asm code struct NoiseReduction { /* 0 = luma 4x4, 1 = luma 8x8, 2 = luma 16x16, 3 = luma 32x32 * 4 = chroma 4x4, 5 = chroma 8x8, 6 = chroma 16x16, 7 = chroma 32x32 * Intra 0..7 - Inter 8..15 */ - ALIGN_VAR_16(uint32_t, residualSum[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]); - uint32_t count[MAX_NUM_TR_CATEGORIES]; - uint16_t offsetDenoise[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; + ALIGN_VAR_16(uint32_t, nrResidualSum[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]); + uint32_t nrCount[MAX_NUM_TR_CATEGORIES]; + uint16_t nrOffsetDenoise[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; + uint16_t (*offset)[MAX_NUM_TR_COEFFS]; + uint32_t (*residualSum)[MAX_NUM_TR_COEFFS]; + uint32_t *count; }; class Quant @@ -125,8 +126,8 @@ const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012 - const uint32_t sigRight = ((uint32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; - const uint32_t sigLower = ((uint32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); + const uint32_t sigRight = (cgPosX != (trSizeCG - 1)) & sigPos; + const uint32_t sigLower = (cgPosY != (trSizeCG - 1)) & (sigPos >> (trSizeCG - 1)); return sigRight + sigLower * 2; } @@ -136,8 +137,8 @@ X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n"); // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid - const uint32_t sigRight = ((uint32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; - const uint32_t sigLower = ((uint32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); + const uint32_t sigRight = (cgPosX != (trSizeCG - 1)) & sigPos; + const uint32_t sigLower = (cgPosY != (trSizeCG - 1)) & (sigPos >> (trSizeCG - 1)); return (sigRight | sigLower); } @@ -151,7 +152,14 @@ uint32_t signBitHidingHDQ(int16_t* qcoeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codingParameters, uint32_t log2TrSize); - uint32_t rdoQuant(const CUData& cu, int16_t* dstCoeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool usePsy); + template<uint32_t log2TrSize> + uint32_t rdoQuant(const CUData& cu, int16_t* dstCoeff, TextType ttype, uint32_t absPartIdx, bool usePsy); + +public: + typedef uint32_t (Quant::*rdoQuant_t)(const CUData& cu, int16_t* dstCoeff, TextType ttype, uint32_t absPartIdx, bool usePsy); + +private: + static rdoQuant_t rdoQuant_func[NUM_CU_DEPTH]; }; }
View file
x265_1.8.tar.gz/source/common/shortyuv.cpp -> x265_1.9.tar.gz/source/common/shortyuv.cpp
Changed
@@ -40,19 +40,26 @@ bool ShortYuv::create(uint32_t size, int csp) { m_csp = csp; + m_size = size; m_hChromaShift = CHROMA_H_SHIFT(csp); m_vChromaShift = CHROMA_V_SHIFT(csp); - - m_size = size; - m_csize = size >> m_hChromaShift; - size_t sizeL = size * size; - size_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift); - X265_CHECK((sizeC & 15) == 0, "invalid size"); - CHECKED_MALLOC(m_buf[0], int16_t, sizeL + sizeC * 2); - m_buf[1] = m_buf[0] + sizeL; - m_buf[2] = m_buf[0] + sizeL + sizeC; + if (csp != X265_CSP_I400) + { + m_csize = size >> m_hChromaShift; + size_t sizeC = sizeL >> (m_hChromaShift + m_vChromaShift); + X265_CHECK((sizeC & 15) == 0, "invalid size"); + + CHECKED_MALLOC(m_buf[0], int16_t, sizeL + sizeC * 2); + m_buf[1] = m_buf[0] + sizeL; + m_buf[2] = m_buf[0] + sizeL + sizeC; + } + else + { + CHECKED_MALLOC(m_buf[0], int16_t, sizeL); + m_buf[1] = m_buf[2] = NULL; + } return true; fail: @@ -75,8 +82,11 @@ { const int sizeIdx = log2Size - 2; primitives.cu[sizeIdx].sub_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); - primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); - primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); + if (m_csp != X265_CSP_I400) + { + primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); + primitives.chroma[m_csp].cu[sizeIdx].sub_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); + } } void ShortYuv::copyPartToPartLuma(ShortYuv& dstYuv, uint32_t absPartIdx, uint32_t log2Size) const
View file
x265_1.8.tar.gz/source/common/slice.cpp -> x265_1.9.tar.gz/source/common/slice.cpp
Changed
@@ -33,7 +33,9 @@ { if (m_sliceType == I_SLICE) { - memset(m_refPicList, 0, sizeof(m_refPicList)); + memset(m_refFrameList, 0, sizeof(m_refFrameList)); + memset(m_refReconPicList, 0, sizeof(m_refReconPicList)); + memset(m_refPOCList, 0, sizeof(m_refPOCList)); m_numRefIdx[1] = m_numRefIdx[0] = 0; return; } @@ -106,13 +108,13 @@ { cIdx = rIdx % numPocTotalCurr; X265_CHECK(cIdx >= 0 && cIdx < numPocTotalCurr, "RPS index check fail\n"); - m_refPicList[0][rIdx] = rpsCurrList0[cIdx]; + m_refFrameList[0][rIdx] = rpsCurrList0[cIdx]; } if (m_sliceType != B_SLICE) { m_numRefIdx[1] = 0; - memset(m_refPicList[1], 0, sizeof(m_refPicList[1])); + memset(m_refFrameList[1], 0, sizeof(m_refFrameList[1])); } else { @@ -120,13 +122,13 @@ { cIdx = rIdx % numPocTotalCurr; X265_CHECK(cIdx >= 0 && cIdx < numPocTotalCurr, "RPS index check fail\n"); - m_refPicList[1][rIdx] = rpsCurrList1[cIdx]; + m_refFrameList[1][rIdx] = rpsCurrList1[cIdx]; } } for (int dir = 0; dir < 2; dir++) for (int numRefIdx = 0; numRefIdx < m_numRefIdx[dir]; numRefIdx++) - m_refPOCList[dir][numRefIdx] = m_refPicList[dir][numRefIdx]->m_poc; + m_refPOCList[dir][numRefIdx] = m_refFrameList[dir][numRefIdx]->m_poc; } void Slice::disableWeights()
View file
x265_1.8.tar.gz/source/common/slice.h -> x265_1.9.tar.gz/source/common/slice.h
Changed
@@ -31,6 +31,7 @@ class Frame; class PicList; +class PicYuv; class MotionReference; enum SliceType @@ -104,6 +105,12 @@ struct ProfileTierLevel { + int profileIdc; + int levelIdc; + uint32_t minCrForLevel; + uint32_t maxLumaSrForLevel; + uint32_t bitDepthConstraint; + int chromaFormatConstraint; bool tierFlag; bool progressiveSourceFlag; bool interlacedSourceFlag; @@ -113,12 +120,6 @@ bool intraConstraintFlag; bool onePictureOnlyConstraintFlag; bool lowerBitRateConstraintFlag; - int profileIdc; - int levelIdc; - uint32_t minCrForLevel; - uint32_t maxLumaSrForLevel; - uint32_t bitDepthConstraint; - int chromaFormatConstraint; }; struct HRDInfo @@ -151,21 +152,21 @@ struct VPS { + HRDInfo hrdParameters; + ProfileTierLevel ptl; uint32_t maxTempSubLayers; uint32_t numReorderPics; uint32_t maxDecPicBuffering; uint32_t maxLatencyIncrease; - HRDInfo hrdParameters; - ProfileTierLevel ptl; }; struct Window { - bool bEnabled; int leftOffset; int rightOffset; int topOffset; int bottomOffset; + bool bEnabled; Window() { @@ -175,40 +176,41 @@ struct VUI { - bool aspectRatioInfoPresentFlag; int aspectRatioIdc; int sarWidth; int sarHeight; - - bool overscanInfoPresentFlag; - bool overscanAppropriateFlag; - - bool videoSignalTypePresentFlag; int videoFormat; - bool videoFullRangeFlag; - - bool colourDescriptionPresentFlag; int colourPrimaries; int transferCharacteristics; int matrixCoefficients; - - bool chromaLocInfoPresentFlag; int chromaSampleLocTypeTopField; int chromaSampleLocTypeBottomField; - Window defaultDisplayWindow; - + bool aspectRatioInfoPresentFlag; + bool overscanInfoPresentFlag; + bool overscanAppropriateFlag; + bool videoSignalTypePresentFlag; + bool videoFullRangeFlag; + bool colourDescriptionPresentFlag; + bool chromaLocInfoPresentFlag; bool frameFieldInfoPresentFlag; bool fieldSeqFlag; - bool hrdParametersPresentFlag; - HRDInfo hrdParameters; + HRDInfo hrdParameters; + Window defaultDisplayWindow; TimingInfo timingInfo; }; struct SPS { + /* cached PicYuv offset arrays, shared by all instances of + * PicYuv created by this encoder */ + intptr_t* cuOffsetY; + intptr_t* cuOffsetC; + intptr_t* buOffsetY; + intptr_t* buOffsetC; + int chromaFormatIdc; // use param uint32_t picWidthInLumaSamples; // use param uint32_t picHeightInLumaSamples; // use param @@ -228,8 +230,6 @@ uint32_t quadtreeTUMaxDepthInter; // use param uint32_t quadtreeTUMaxDepthIntra; // use param - bool bUseSAO; // use param - bool bUseAMP; // use param uint32_t maxAMPDepth; uint32_t maxTempSubLayers; // max number of Temporal Sub layers @@ -237,11 +237,26 @@ uint32_t maxLatencyIncrease; int numReorderPics; + bool bUseSAO; // use param + bool bUseAMP; // use param bool bUseStrongIntraSmoothing; // use param bool bTemporalMVPEnabled; Window conformanceWindow; VUI vuiParameters; + + SPS() + { + memset(this, 0, sizeof(*this)); + } + + ~SPS() + { + X265_FREE(cuOffsetY); + X265_FREE(cuOffsetC); + X265_FREE(buOffsetY); + X265_FREE(buOffsetC); + } }; struct PPS @@ -249,6 +264,8 @@ uint32_t maxCuDQPDepth; int chromaQpOffset[2]; // use param + int deblockingFilterBetaOffsetDiv2; + int deblockingFilterTcOffsetDiv2; bool bUseWeightPred; // use param bool bUseWeightedBiPred; // use param @@ -262,17 +279,15 @@ bool bDeblockingFilterControlPresent; bool bPicDisableDeblockingFilter; - int deblockingFilterBetaOffsetDiv2; - int deblockingFilterTcOffsetDiv2; }; struct WeightParam { // Explicit weighted prediction parameters parsed in slice header, - bool bPresentFlag; uint32_t log2WeightDenom; int inputWeight; int inputOffset; + bool bPresentFlag; /* makes a non-h265 weight (i.e. fix7), into an h265 weight */ void setFromWeightAndOffset(int w, int o, int denom, bool bNormalize) @@ -304,6 +319,9 @@ const SPS* m_sps; const PPS* m_pps; + Frame* m_refFrameList[2][MAX_NUM_REF + 1]; + PicYuv* m_refReconPicList[2][MAX_NUM_REF + 1]; + WeightParam m_weightPredTable[2][MAX_NUM_REF][3]; // [list][refIdx][0:Y, 1:U, 2:V] MotionReference (*m_mref)[MAX_NUM_REF + 1]; RPS m_rps; @@ -312,34 +330,28 @@ SliceType m_sliceType; int m_sliceQp; int m_poc; - int m_lastIDR; - bool m_bCheckLDC; // TODO: is this necessary? - bool m_sLFaseFlag; // loop filter boundary flag - bool m_colFromL0Flag; // collocated picture from List0 or List1 flag uint32_t m_colRefIdx; // never modified - + int m_numRefIdx[2]; - Frame* m_refPicList[2][MAX_NUM_REF + 1]; int m_refPOCList[2][MAX_NUM_REF + 1]; uint32_t m_maxNumMergeCand; // use param uint32_t m_endCUAddr; + bool m_bCheckLDC; // TODO: is this necessary? + bool m_sLFaseFlag; // loop filter boundary flag + bool m_colFromL0Flag; // collocated picture from List0 or List1 flag + Slice() { m_lastIDR = 0; m_sLFaseFlag = true; m_numRefIdx[0] = m_numRefIdx[1] = 0; - for (int i = 0; i < MAX_NUM_REF; i++) - { - m_refPicList[0][i] = NULL; - m_refPicList[1][i] = NULL; - m_refPOCList[0][i] = 0; - m_refPOCList[1][i] = 0; - } - + memset(m_refFrameList, 0, sizeof(m_refFrameList)); + memset(m_refReconPicList, 0, sizeof(m_refReconPicList)); + memset(m_refPOCList, 0, sizeof(m_refPOCList)); disableWeights(); } @@ -347,8 +359,6 @@ void setRefPicList(PicList& picList); - const Frame* getRefPic(int list, int refIdx) const { return refIdx >= 0 ? m_refPicList[list][refIdx] : NULL; } - bool getRapPicFlag() const { return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL
View file
x265_1.8.tar.gz/source/common/threading.h -> x265_1.9.tar.gz/source/common/threading.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -204,6 +205,15 @@ return ret; } + int getIncr(int n = 1) + { + EnterCriticalSection(&m_cs); + int ret = m_val; + m_val += n; + LeaveCriticalSection(&m_cs); + return ret; + } + void set(int newval) { EnterCriticalSection(&m_cs); @@ -393,6 +403,15 @@ return ret; } + int getIncr(int n = 1) + { + pthread_mutex_lock(&m_mutex); + int ret = m_val; + m_val += n; + pthread_mutex_unlock(&m_mutex); + return ret; + } + void set(int newval) { pthread_mutex_lock(&m_mutex);
View file
x265_1.8.tar.gz/source/common/threadpool.cpp -> x265_1.9.tar.gz/source/common/threadpool.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -59,6 +60,9 @@ #if HAVE_LIBNUMA #include <numa.h> #endif +#if defined(_MSC_VER) +# define strcasecmp _stricmp +#endif namespace X265_NS { // x265 private namespace @@ -226,8 +230,13 @@ { enum { MAX_NODE_NUM = 127 }; int cpusPerNode[MAX_NODE_NUM + 1]; + int threadsPerPool[MAX_NODE_NUM + 2]; + uint64_t nodeMaskPerPool[MAX_NODE_NUM + 2]; memset(cpusPerNode, 0, sizeof(cpusPerNode)); + memset(threadsPerPool, 0, sizeof(threadsPerPool)); + memset(nodeMaskPerPool, 0, sizeof(nodeMaskPerPool)); + int numNumaNodes = X265_MIN(getNumaNodeCount(), MAX_NODE_NUM); int cpuCount = getCpuCount(); bool bNumaSupport = false; @@ -258,7 +267,7 @@ for (int i = 0; i < numNumaNodes; i++) x265_log(p, X265_LOG_DEBUG, "detected NUMA node %d with %d logical cores\n", i, cpusPerNode[i]); - /* limit nodes based on param->numaPools */ + /* limit threads based on param->numaPools */ if (p->numaPools && *p->numaPools) { const char *nodeStr = p->numaPools; @@ -266,19 +275,30 @@ { if (!*nodeStr) { - cpusPerNode[i] = 0; + threadsPerPool[i] = 0; continue; } else if (*nodeStr == '-') - cpusPerNode[i] = 0; - else if (*nodeStr == '*') + threadsPerPool[i] = 0; + else if (*nodeStr == '*' || !strcasecmp(nodeStr, "NULL")) + { + for (int j = i; j < numNumaNodes; j++) + { + threadsPerPool[numNumaNodes] += cpusPerNode[j]; + nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << j); + } break; + } else if (*nodeStr == '+') - ; + { + threadsPerPool[numNumaNodes] += cpusPerNode[i]; + nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << i); + } else { int count = atoi(nodeStr); - cpusPerNode[i] = X265_MIN(count, cpusPerNode[i]); + threadsPerPool[i] = X265_MIN(count, cpusPerNode[i]); + nodeMaskPerPool[i] = ((uint64_t)1 << i); } /* consume current node string, comma, and white-space */ @@ -288,14 +308,31 @@ ++nodeStr; } } + else + { + for (int i = 0; i < numNumaNodes; i++) + { + threadsPerPool[numNumaNodes] += cpusPerNode[i]; + nodeMaskPerPool[numNumaNodes] |= ((uint64_t)1 << i); + } + } + + // If the last pool size is > MAX_POOL_THREADS, clip it to spawn thread pools only of size >= 1/2 max (heuristic) + if ((threadsPerPool[numNumaNodes] > MAX_POOL_THREADS) && + ((threadsPerPool[numNumaNodes] % MAX_POOL_THREADS) < (MAX_POOL_THREADS / 2))) + { + threadsPerPool[numNumaNodes] -= (threadsPerPool[numNumaNodes] % MAX_POOL_THREADS); + x265_log(p, X265_LOG_DEBUG, + "Creating only %d worker threads beyond specified numbers with --pools (if specified) to prevent asymmetry in pools; may not use all HW contexts\n", threadsPerPool[numNumaNodes]); + } numPools = 0; - for (int i = 0; i < numNumaNodes; i++) + for (int i = 0; i < numNumaNodes + 1; i++) { if (bNumaSupport) x265_log(p, X265_LOG_DEBUG, "NUMA node %d may use %d logical cores\n", i, cpusPerNode[i]); - if (cpusPerNode[i]) - numPools += (cpusPerNode[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS; + if (threadsPerPool[i]) + numPools += (threadsPerPool[i] + MAX_POOL_THREADS - 1) / MAX_POOL_THREADS; } if (!numPools) @@ -314,20 +351,27 @@ int node = 0; for (int i = 0; i < numPools; i++) { - while (!cpusPerNode[node]) + while (!threadsPerPool[node]) node++; - int cores = X265_MIN(MAX_POOL_THREADS, cpusPerNode[node]); - if (!pools[i].create(cores, maxProviders, node)) + int numThreads = X265_MIN(MAX_POOL_THREADS, threadsPerPool[node]); + if (!pools[i].create(numThreads, maxProviders, nodeMaskPerPool[node])) { X265_FREE(pools); numPools = 0; return NULL; } if (numNumaNodes > 1) - x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads on NUMA node %d\n", i, cores, node); + { + char *nodesstr = new char[64 * strlen(",63") + 1]; + int len = 0; + for (int j = 0; j < 64; j++) + if ((nodeMaskPerPool[node] >> j) & 1) + len += sprintf(nodesstr + len, ",%d", j); + x265_log(p, X265_LOG_INFO, "Thread pool %d using %d threads on numa nodes %s\n", i, numThreads, nodesstr + 1); + } else - x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", cores); - cpusPerNode[node] -= cores; + x265_log(p, X265_LOG_INFO, "Thread pool created using %d threads\n", numThreads); + threadsPerPool[node] -= numThreads; } } else @@ -340,11 +384,37 @@ memset(this, 0, sizeof(*this)); } -bool ThreadPool::create(int numThreads, int maxProviders, int node) +bool ThreadPool::create(int numThreads, int maxProviders, uint64_t nodeMask) { X265_CHECK(numThreads <= MAX_POOL_THREADS, "a single thread pool cannot have more than MAX_POOL_THREADS threads\n"); - m_numaNode = node; +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 + m_winCpuMask = 0x0; + GROUP_AFFINITY groupAffinity; + for (int i = 0; i < getNumaNodeCount(); i++) + { + int numaNode = ((nodeMask >> i) & 0x1U) ? i : -1; + if (numaNode != -1) + if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity)) + m_winCpuMask |= groupAffinity.Mask; + } + m_numaMask = &m_winCpuMask; +#elif HAVE_LIBNUMA + if (numa_available() >= 0) + { + struct bitmask* nodemask = numa_allocate_nodemask(); + if (nodemask) + { + *(nodemask->maskp) = nodeMask; + m_numaMask = nodemask; + } + else + x265_log(NULL, X265_LOG_ERROR, "unable to get NUMA node mask for %lx\n", nodeMask); + } +#else + (void)nodeMask; +#endif + m_numWorkers = numThreads; m_workers = X265_MALLOC(WorkerThread, numThreads); @@ -398,36 +468,39 @@ X265_FREE(m_workers); X265_FREE(m_jpTable); + +#if HAVE_LIBNUMA + if(m_numaMask) + numa_free_nodemask((struct bitmask*)m_numaMask); +#endif } void ThreadPool::setCurrentThreadAffinity() { - setThreadNodeAffinity(m_numaNode); + setThreadNodeAffinity(m_numaMask); } /* static */ -void ThreadPool::setThreadNodeAffinity(int numaNode) +void ThreadPool::setThreadNodeAffinity(void *numaMask) { #if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 - GROUP_AFFINITY groupAffinity; - if (GetNumaNodeProcessorMaskEx((USHORT)numaNode, &groupAffinity)) - { - if (SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)groupAffinity.Mask)) - return; - } - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode); + if (SetThreadAffinityMask(GetCurrentThread(), *((DWORD_PTR*)numaMask))) + return; + else + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n"); #elif HAVE_LIBNUMA if (numa_available() >= 0) { - numa_run_on_node(numaNode); - numa_set_preferred(numaNode); + numa_run_on_node_mask((struct bitmask*)numaMask); + numa_set_interleave_mask((struct bitmask*)numaMask); numa_set_localalloc(); return; } - x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity to NUMA node %d\n", numaNode); + x265_log(NULL, X265_LOG_ERROR, "unable to set thread affinity for NUMA node mask\n"); #else - (void)numaNode; + (void)numaMask; #endif + return; } /* static */
View file
x265_1.8.tar.gz/source/common/threadpool.h -> x265_1.9.tar.gz/source/common/threadpool.h
Changed
@@ -83,7 +83,10 @@ sleepbitmap_t m_sleepBitmap; int m_numProviders; int m_numWorkers; - int m_numaNode; + void* m_numaMask; // node mask in linux, cpu mask in windows +#if defined(_WIN32_WINNT) && _WIN32_WINNT >= _WIN32_WINNT_WIN7 + DWORD_PTR m_winCpuMask; +#endif bool m_isActive; JobProvider** m_jpTable; @@ -92,7 +95,7 @@ ThreadPool(); ~ThreadPool(); - bool create(int numThreads, int maxProviders, int node); + bool create(int numThreads, int maxProviders, uint64_t nodeMask); bool start(); void stopWorkers(); void setCurrentThreadAffinity(); @@ -103,7 +106,7 @@ static int getCpuCount(); static int getNumaNodeCount(); - static void setThreadNodeAffinity(int node); + static void setThreadNodeAffinity(void *numaMask); }; /* Any worker thread may enlist the help of idle worker threads from the same
View file
x265_1.8.tar.gz/source/common/version.cpp -> x265_1.9.tar.gz/source/common/version.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/wavefront.cpp -> x265_1.9.tar.gz/source/common/wavefront.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/wavefront.h -> x265_1.9.tar.gz/source/common/wavefront.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.9.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -962,11 +962,8 @@ p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar4_sse2); p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar8_sse2); - -#if X265_DEPTH <= 10 p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_sse2); p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_sse2); -#endif /* X265_DEPTH <= 10 */ ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); p.cu[BLOCK_4x4].intra_pred[2] = PFX(intra_pred_ang4_2_sse2); @@ -1003,13 +1000,12 @@ p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2); p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_32x64_sse2); -#if X265_DEPTH <= 10 - p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2); - ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_4x8_mmx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_8x16_sse2); p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_16x32_sse2); +#if X265_DEPTH <= 10 + p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2); + ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); #endif p.cu[BLOCK_4x4].dct = PFX(dct4_sse2); p.cu[BLOCK_8x8].dct = PFX(dct8_sse2); @@ -1031,6 +1027,7 @@ ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); + p.propagateCost = PFX(mbtree_propagate_cost_sse2); } if (cpuMask & X265_CPU_SSE3) { @@ -1144,11 +1141,8 @@ p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar4_sse4); p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar8_sse4); - -#if X265_DEPTH <= 10 p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_sse4); p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_sse4); -#endif ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse4); INTRA_ANG_SSE4_COMMON(sse4); INTRA_ANG_SSE4_HIGH(sse4); @@ -1158,14 +1152,12 @@ p.weight_sp = PFX(weight_sp_sse4); p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_sse4); - p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_sse4); // TODO: check POPCNT flag! ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); #if X265_DEPTH <= 10 ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); #endif - ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = PFX(filterPixelToShort_2x4_sse4); p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); @@ -1173,6 +1165,7 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); + p.costCoeffRemain = PFX(costCoeffRemain_sse4); } if (cpuMask & X265_CPU_AVX) { @@ -1306,6 +1299,7 @@ p.pu[LUMA_64x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x32_avx); p.pu[LUMA_64x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx); p.pu[LUMA_64x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx); + p.propagateCost = PFX(mbtree_propagate_cost_avx); } if (cpuMask & X265_CPU_XOP) { @@ -1319,6 +1313,9 @@ } if (cpuMask & X265_CPU_AVX2) { +#if X265_DEPTH == 12 + ASSIGN_SA8D(avx2); +#endif p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_avx2); // TODO: the planecopy_sp is really planecopy_SC now, must be fix it @@ -1479,20 +1476,14 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_avx2); - p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_avx2); - p.cu[BLOCK_8x8].psy_cost_ss = PFX(psyCost_ss_8x8_avx2); - p.cu[BLOCK_16x16].psy_cost_ss = PFX(psyCost_ss_16x16_avx2); - p.cu[BLOCK_32x32].psy_cost_ss = PFX(psyCost_ss_32x32_avx2); - p.cu[BLOCK_64x64].psy_cost_ss = PFX(psyCost_ss_64x64_avx2); p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_avx2); -#if X265_DEPTH <= 10 + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_avx2); + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_avx2); + p.cu[BLOCK_8x8].psy_cost_pp = PFX(psyCost_pp_8x8_avx2); p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx2); p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx2); p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx2); - p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_avx2); - p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_avx2); -#endif p.cu[BLOCK_16x16].intra_pred[DC_IDX] = PFX(intra_pred_dc16_avx2); p.cu[BLOCK_32x32].intra_pred[DC_IDX] = PFX(intra_pred_dc32_avx2); @@ -1536,20 +1527,13 @@ p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16_avx2); p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32_avx2); -#if X265_DEPTH <= 10 - p.cu[BLOCK_16x16].sse_ss = PFX(pixel_ssd_ss_16x16_avx2); - p.cu[BLOCK_32x32].sse_ss = PFX(pixel_ssd_ss_32x32_avx2); - p.cu[BLOCK_64x64].sse_ss = PFX(pixel_ssd_ss_64x64_avx2); - - p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); - p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); - p.cu[BLOCK_64x64].sse_pp = PFX(pixel_ssd_64x64_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); + p.cu[BLOCK_16x16].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_16x16_avx2); + p.cu[BLOCK_32x32].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_32x32_avx2); + p.cu[BLOCK_64x64].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = (pixel_sse_t)PFX(pixel_ssd_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_32x32_avx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_16x32_avx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_32x64_avx2); -#endif - p.quant = PFX(quant_avx2); p.nquant = PFX(nquant_avx2); p.dequant_normal = PFX(dequant_normal_avx2); @@ -1588,21 +1572,16 @@ p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16_avx2); p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32_avx2); -#if X265_DEPTH <= 10 - ALL_LUMA_TU_S(dct, dct, avx2); ALL_LUMA_TU_S(idct, idct, avx2); -#endif + ALL_LUMA_TU_S(dct, dct, avx2); + ALL_LUMA_CU_S(transpose, transpose, avx2); ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, avx2); ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2); -#if X265_DEPTH <= 10 ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2); -#endif ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2); -#if X265_DEPTH <= 10 p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_avx2); // since ALL_LUMA_PU didn't declare 4x4 size, calling separately luma_vsp function to use -#endif p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); @@ -1625,7 +1604,6 @@ p.pu[LUMA_16x12].sad = PFX(pixel_sad_16x12_avx2); p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_avx2); p.pu[LUMA_16x32].sad = PFX(pixel_sad_16x32_avx2); -#if X265_DEPTH <= 10 p.pu[LUMA_16x64].sad = PFX(pixel_sad_16x64_avx2); p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_avx2); p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_avx2); @@ -1637,7 +1615,6 @@ p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_avx2); p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_avx2); p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_avx2); -#endif p.pu[LUMA_16x4].sad_x3 = PFX(pixel_sad_x3_16x4_avx2); p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_avx2); @@ -1712,7 +1689,6 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_avx2); -#if X265_DEPTH <= 10 p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_avx2); p.pu[LUMA_4x8].luma_hps = PFX(interp_8tap_horiz_ps_4x8_avx2); p.pu[LUMA_4x16].luma_hps = PFX(interp_8tap_horiz_ps_4x16_avx2); @@ -1738,7 +1714,6 @@ p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_avx2); p.pu[LUMA_24x32].luma_hps = PFX(interp_8tap_horiz_ps_24x32_avx2); p.pu[LUMA_12x16].luma_hps = PFX(interp_8tap_horiz_ps_12x16_avx2); -#endif p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_avx2); p.pu[LUMA_4x8].luma_hpp = PFX(interp_8tap_horiz_pp_4x8_avx2); @@ -1766,7 +1741,6 @@ p.pu[LUMA_24x32].luma_hpp = PFX(interp_8tap_horiz_pp_24x32_avx2); p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_avx2); -#if X265_DEPTH <= 10 p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); @@ -2164,18 +2138,19 @@ p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_avx2); p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_avx2); p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_avx2); -#endif p.frameInitLowres = PFX(frame_init_lowres_core_avx2); + p.propagateCost = PFX(mbtree_propagate_cost_avx2); -#if X265_DEPTH <= 10 // TODO: depends on hps and vsp ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); // calling luma_hvpp for all sizes p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; // ALL_LUMA_PU_T has declared all sizes except 4x4, hence calling luma_hvpp[4x4] -#endif if (cpuMask & X265_CPU_BMI2) + { p.scanPosLast = PFX(scanPosLast_avx2_bmi2); + p.costCoeffNxN = PFX(costCoeffNxN_avx2_bmi2); + } } } #else // if HIGH_BIT_DEPTH @@ -2345,7 +2320,7 @@ p.cu[BLOCK_8x8].idct = PFX(idct8_sse2); // TODO: it is passed smoke test, but we need testbench, so temporary disable - //p.costC1C2Flag = x265_costC1C2Flag_sse2; + p.costC1C2Flag = PFX(costC1C2Flag_sse2); #endif p.idst4x4 = PFX(idst4_sse2); p.dst4x4 = PFX(dst4_sse2); @@ -2356,6 +2331,7 @@ ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); + p.propagateCost = PFX(mbtree_propagate_cost_sse2); } if (cpuMask & X265_CPU_SSE3) { @@ -2530,7 +2506,6 @@ INTRA_ANG_SSE4(sse4); p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_sse4); - p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_sse4); p.pu[LUMA_4x4].convert_p2s = PFX(filterPixelToShort_4x4_sse4); p.pu[LUMA_4x8].convert_p2s = PFX(filterPixelToShort_4x8_sse4); @@ -2552,6 +2527,9 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); #if X86_64 + p.pelFilterLumaStrong[0] = PFX(pelFilterLumaStrong_V_sse4); + p.pelFilterLumaStrong[1] = PFX(pelFilterLumaStrong_H_sse4); + p.saoCuStatsBO = PFX(saoCuStatsBO_sse4); p.saoCuStatsE0 = PFX(saoCuStatsE0_sse4); p.saoCuStatsE1 = PFX(saoCuStatsE1_sse4); @@ -2559,7 +2537,6 @@ p.saoCuStatsE3 = PFX(saoCuStatsE3_sse4); ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); - ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); p.costCoeffNxN = PFX(costCoeffNxN_sse4); #endif @@ -2664,6 +2641,7 @@ p.pu[LUMA_48x64].copy_pp = PFX(blockcopy_pp_48x64_avx); p.frameInitLowres = PFX(frame_init_lowres_core_avx); + p.propagateCost = PFX(mbtree_propagate_cost_avx); } if (cpuMask & X265_CPU_XOP) { @@ -2678,6 +2656,14 @@ #if X86_64 if (cpuMask & X265_CPU_AVX2) { + p.cu[BLOCK_16x16].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_16x16_avx2); + p.cu[BLOCK_32x32].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_32x32_avx2); + p.cu[BLOCK_64x64].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_64x64_avx2); + + p.cu[BLOCK_16x16].var = PFX(pixel_var_16x16_avx2); + p.cu[BLOCK_32x32].var = PFX(pixel_var_32x32_avx2); + p.cu[BLOCK_64x64].var = PFX(pixel_var_64x64_avx2); + p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_avx2); p.planecopy_sp = PFX(downShift_16_avx2); @@ -2700,12 +2686,6 @@ p.saoCuOrgB0 = PFX(saoCuOrgB0_avx2); p.sign = PFX(calSign_avx2); - p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_avx2); - p.cu[BLOCK_8x8].psy_cost_ss = PFX(psyCost_ss_8x8_avx2); - p.cu[BLOCK_16x16].psy_cost_ss = PFX(psyCost_ss_16x16_avx2); - p.cu[BLOCK_32x32].psy_cost_ss = PFX(psyCost_ss_32x32_avx2); - p.cu[BLOCK_64x64].psy_cost_ss = PFX(psyCost_ss_64x64_avx2); - p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_avx2); p.cu[BLOCK_8x8].psy_cost_pp = PFX(psyCost_pp_8x8_avx2); p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx2); @@ -2811,7 +2791,7 @@ p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_avx2); p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_avx2); p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_avx2); - + p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_avx2); p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_avx2); p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_avx2); p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_avx2); @@ -2863,6 +2843,11 @@ p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx2); p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx2); p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx2); + p.pu[LUMA_48x64].sad_x4 = PFX(pixel_sad_x4_48x64_avx2); + p.pu[LUMA_64x16].sad_x4 = PFX(pixel_sad_x4_64x16_avx2); + p.pu[LUMA_64x32].sad_x4 = PFX(pixel_sad_x4_64x32_avx2); + p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_avx2); + p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_avx2); p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); @@ -2935,31 +2920,31 @@ p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_avx2); p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_avx2); p.cu[BLOCK_8x8].intra_pred[3] = PFX(intra_pred_ang8_3_avx2); - p.cu[BLOCK_8x8].intra_pred[33] = PFX(intra_pred_ang8_33_avx2); p.cu[BLOCK_8x8].intra_pred[4] = PFX(intra_pred_ang8_4_avx2); - p.cu[BLOCK_8x8].intra_pred[32] = PFX(intra_pred_ang8_32_avx2); p.cu[BLOCK_8x8].intra_pred[5] = PFX(intra_pred_ang8_5_avx2); - p.cu[BLOCK_8x8].intra_pred[31] = PFX(intra_pred_ang8_31_avx2); - p.cu[BLOCK_8x8].intra_pred[30] = PFX(intra_pred_ang8_30_avx2); p.cu[BLOCK_8x8].intra_pred[6] = PFX(intra_pred_ang8_6_avx2); p.cu[BLOCK_8x8].intra_pred[7] = PFX(intra_pred_ang8_7_avx2); - p.cu[BLOCK_8x8].intra_pred[29] = PFX(intra_pred_ang8_29_avx2); p.cu[BLOCK_8x8].intra_pred[8] = PFX(intra_pred_ang8_8_avx2); - p.cu[BLOCK_8x8].intra_pred[28] = PFX(intra_pred_ang8_28_avx2); p.cu[BLOCK_8x8].intra_pred[9] = PFX(intra_pred_ang8_9_avx2); - p.cu[BLOCK_8x8].intra_pred[27] = PFX(intra_pred_ang8_27_avx2); - p.cu[BLOCK_8x8].intra_pred[25] = PFX(intra_pred_ang8_25_avx2); - p.cu[BLOCK_8x8].intra_pred[12] = PFX(intra_pred_ang8_12_avx2); - p.cu[BLOCK_8x8].intra_pred[24] = PFX(intra_pred_ang8_24_avx2); p.cu[BLOCK_8x8].intra_pred[11] = PFX(intra_pred_ang8_11_avx2); + p.cu[BLOCK_8x8].intra_pred[12] = PFX(intra_pred_ang8_12_avx2); p.cu[BLOCK_8x8].intra_pred[13] = PFX(intra_pred_ang8_13_avx2); + p.cu[BLOCK_8x8].intra_pred[14] = PFX(intra_pred_ang8_14_avx2); + p.cu[BLOCK_8x8].intra_pred[15] = PFX(intra_pred_ang8_15_avx2); + p.cu[BLOCK_8x8].intra_pred[16] = PFX(intra_pred_ang8_16_avx2); p.cu[BLOCK_8x8].intra_pred[20] = PFX(intra_pred_ang8_20_avx2); p.cu[BLOCK_8x8].intra_pred[21] = PFX(intra_pred_ang8_21_avx2); p.cu[BLOCK_8x8].intra_pred[22] = PFX(intra_pred_ang8_22_avx2); p.cu[BLOCK_8x8].intra_pred[23] = PFX(intra_pred_ang8_23_avx2); - p.cu[BLOCK_8x8].intra_pred[14] = PFX(intra_pred_ang8_14_avx2); - p.cu[BLOCK_8x8].intra_pred[15] = PFX(intra_pred_ang8_15_avx2); - p.cu[BLOCK_8x8].intra_pred[16] = PFX(intra_pred_ang8_16_avx2); + p.cu[BLOCK_8x8].intra_pred[24] = PFX(intra_pred_ang8_24_avx2); + p.cu[BLOCK_8x8].intra_pred[25] = PFX(intra_pred_ang8_25_avx2); + p.cu[BLOCK_8x8].intra_pred[27] = PFX(intra_pred_ang8_27_avx2); + p.cu[BLOCK_8x8].intra_pred[28] = PFX(intra_pred_ang8_28_avx2); + p.cu[BLOCK_8x8].intra_pred[29] = PFX(intra_pred_ang8_29_avx2); + p.cu[BLOCK_8x8].intra_pred[30] = PFX(intra_pred_ang8_30_avx2); + p.cu[BLOCK_8x8].intra_pred[31] = PFX(intra_pred_ang8_31_avx2); + p.cu[BLOCK_8x8].intra_pred[32] = PFX(intra_pred_ang8_32_avx2); + p.cu[BLOCK_8x8].intra_pred[33] = PFX(intra_pred_ang8_33_avx2); p.cu[BLOCK_16x16].intra_pred[3] = PFX(intra_pred_ang16_3_avx2); p.cu[BLOCK_16x16].intra_pred[4] = PFX(intra_pred_ang16_4_avx2); p.cu[BLOCK_16x16].intra_pred[5] = PFX(intra_pred_ang16_5_avx2); @@ -2970,6 +2955,10 @@ p.cu[BLOCK_16x16].intra_pred[12] = PFX(intra_pred_ang16_12_avx2); p.cu[BLOCK_16x16].intra_pred[11] = PFX(intra_pred_ang16_11_avx2); p.cu[BLOCK_16x16].intra_pred[13] = PFX(intra_pred_ang16_13_avx2); + p.cu[BLOCK_16x16].intra_pred[14] = PFX(intra_pred_ang16_14_avx2); + p.cu[BLOCK_16x16].intra_pred[15] = PFX(intra_pred_ang16_15_avx2); + p.cu[BLOCK_16x16].intra_pred[16] = PFX(intra_pred_ang16_16_avx2); + p.cu[BLOCK_16x16].intra_pred[17] = PFX(intra_pred_ang16_17_avx2); p.cu[BLOCK_16x16].intra_pred[25] = PFX(intra_pred_ang16_25_avx2); p.cu[BLOCK_16x16].intra_pred[28] = PFX(intra_pred_ang16_28_avx2); p.cu[BLOCK_16x16].intra_pred[27] = PFX(intra_pred_ang16_27_avx2); @@ -2981,6 +2970,21 @@ p.cu[BLOCK_16x16].intra_pred[24] = PFX(intra_pred_ang16_24_avx2); p.cu[BLOCK_16x16].intra_pred[23] = PFX(intra_pred_ang16_23_avx2); p.cu[BLOCK_16x16].intra_pred[22] = PFX(intra_pred_ang16_22_avx2); + p.cu[BLOCK_32x32].intra_pred[5] = PFX(intra_pred_ang32_5_avx2); + p.cu[BLOCK_32x32].intra_pred[6] = PFX(intra_pred_ang32_6_avx2); + p.cu[BLOCK_32x32].intra_pred[7] = PFX(intra_pred_ang32_7_avx2); + p.cu[BLOCK_32x32].intra_pred[8] = PFX(intra_pred_ang32_8_avx2); + p.cu[BLOCK_32x32].intra_pred[9] = PFX(intra_pred_ang32_9_avx2); + p.cu[BLOCK_32x32].intra_pred[10] = PFX(intra_pred_ang32_10_avx2); + p.cu[BLOCK_32x32].intra_pred[11] = PFX(intra_pred_ang32_11_avx2); + p.cu[BLOCK_32x32].intra_pred[12] = PFX(intra_pred_ang32_12_avx2); + p.cu[BLOCK_32x32].intra_pred[13] = PFX(intra_pred_ang32_13_avx2); + p.cu[BLOCK_32x32].intra_pred[14] = PFX(intra_pred_ang32_14_avx2); + p.cu[BLOCK_32x32].intra_pred[15] = PFX(intra_pred_ang32_15_avx2); + p.cu[BLOCK_32x32].intra_pred[16] = PFX(intra_pred_ang32_16_avx2); + p.cu[BLOCK_32x32].intra_pred[17] = PFX(intra_pred_ang32_17_avx2); + p.cu[BLOCK_32x32].intra_pred[19] = PFX(intra_pred_ang32_19_avx2); + p.cu[BLOCK_32x32].intra_pred[20] = PFX(intra_pred_ang32_20_avx2); p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_34_avx2); p.cu[BLOCK_32x32].intra_pred[2] = PFX(intra_pred_ang32_2_avx2); p.cu[BLOCK_32x32].intra_pred[26] = PFX(intra_pred_ang32_26_avx2); @@ -3309,6 +3313,12 @@ ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; + p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_avx2); + p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_avx2); + p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_avx2); + p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_avx2); + p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_avx2); + p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_avx2); p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_avx2); p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_avx2); p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_avx2); @@ -3321,11 +3331,21 @@ p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_avx2); p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = PFX(filterPixelToShort_24x32_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_avx2); p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); @@ -3616,13 +3636,33 @@ p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_avx2); p.frameInitLowres = PFX(frame_init_lowres_core_avx2); + p.propagateCost = PFX(mbtree_propagate_cost_avx2); + p.saoCuStatsE0 = PFX(saoCuStatsE0_avx2); + p.saoCuStatsE1 = PFX(saoCuStatsE1_avx2); + p.saoCuStatsE2 = PFX(saoCuStatsE2_avx2); + p.saoCuStatsE3 = PFX(saoCuStatsE3_avx2); if (cpuMask & X265_CPU_BMI2) + { p.scanPosLast = PFX(scanPosLast_avx2_bmi2); + p.costCoeffNxN = PFX(costCoeffNxN_avx2_bmi2); + } p.cu[BLOCK_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_avx2); p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_avx2); + p.planeClipAndMax = PFX(planeClipAndMax_avx2); + + p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_avx2); + p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_avx2); + p.pu[LUMA_32x24].sad_x3 = PFX(pixel_sad_x3_32x24_avx2); + p.pu[LUMA_32x32].sad_x3 = PFX(pixel_sad_x3_32x32_avx2); + p.pu[LUMA_32x64].sad_x3 = PFX(pixel_sad_x3_32x64_avx2); + p.pu[LUMA_64x16].sad_x3 = PFX(pixel_sad_x3_64x16_avx2); + p.pu[LUMA_64x32].sad_x3 = PFX(pixel_sad_x3_64x32_avx2); + p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_avx2); + p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_avx2); + p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_avx2); } #endif
View file
x265_1.8.tar.gz/source/common/x86/blockcopy8.asm -> x265_1.9.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -3,6 +3,7 @@ ;* ;* Authors: Praveen Kumar Tiwari <praveen@multicorewareinc.com> ;* Murugan Vairavel <murugan@multicorewareinc.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/x86/blockcopy8.h -> x265_1.9.tar.gz/source/common/x86/blockcopy8.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +;* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/x86/const-a.asm -> x265_1.9.tar.gz/source/common/x86/const-a.asm
Changed
@@ -2,6 +2,7 @@ ;* const-a.asm: x86 global constants ;***************************************************************************** ;* Copyright (C) 2010-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> @@ -31,10 +32,10 @@ ;; 8-bit constants -const pb_0, times 16 db 0 +const pb_0, times 32 db 0 const pb_1, times 32 db 1 const pb_2, times 32 db 2 -const pb_3, times 16 db 3 +const pb_3, times 32 db 3 const pb_4, times 32 db 4 const pb_8, times 32 db 8 const pb_15, times 32 db 15 @@ -54,6 +55,11 @@ const pb_shuf8x8c, times 1 db 0, 0, 0, 0, 2, 2, 2, 2, 4, 4, 4, 4, 6, 6, 6, 6 const pb_movemask, times 16 db 0x00 times 16 db 0xFF + +const pb_movemask_32, times 32 db 0x00 + times 32 db 0xFF + times 32 db 0x00 + const pb_0000000000000F0F, times 2 db 0xff, 0x00 times 12 db 0x00 const pb_000000000000000F, db 0xff @@ -61,6 +67,7 @@ ;; 16-bit constants +const pw_n1, times 16 dw -1 const pw_1, times 16 dw 1 const pw_2, times 16 dw 2 const pw_3, times 16 dw 3 @@ -86,12 +93,12 @@ const pw_ff00, times 8 dw 0xff00 const pw_2000, times 16 dw 0x2000 const pw_8000, times 8 dw 0x8000 -const pw_3fff, times 8 dw 0x3fff +const pw_3fff, times 16 dw 0x3fff const pw_32_0, times 4 dw 32, times 4 dw 0 const pw_pixel_max, times 16 dw ((1 << BIT_DEPTH)-1) -const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7 +const pw_0_7, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7 const pw_ppppmmmm, times 1 dw 1, 1, 1, 1, -1, -1, -1, -1 const pw_ppmmppmm, times 1 dw 1, 1, -1, -1, 1, 1, -1, -1 const pw_pmpmpmpm, times 16 dw 1, -1, 1, -1, 1, -1, 1, -1 @@ -107,6 +114,7 @@ times 7 dw 0xff const hmul_16p, times 16 db 1 times 8 db 1, -1 +const pw_exp2_0_15, dw 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 ;; 32-bit constants @@ -115,8 +123,9 @@ const pd_2, times 8 dd 2 const pd_4, times 4 dd 4 const pd_8, times 4 dd 8 +const pd_15, times 8 dd 15 const pd_16, times 8 dd 16 -const pd_31, times 4 dd 31 +const pd_31, times 8 dd 31 const pd_32, times 8 dd 32 const pd_64, times 4 dd 64 const pd_128, times 4 dd 128 @@ -129,7 +138,12 @@ const pd_524416, times 4 dd 524416 const pd_n32768, times 8 dd 0xffff8000 const pd_n131072, times 4 dd 0xfffe0000 - +const pd_0000ffff, times 8 dd 0x0000FFFF +const pd_planar16_mul0, times 1 dd 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +const pd_planar16_mul1, times 1 dd 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 +const pd_planar32_mul1, times 1 dd 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16 +const pd_planar32_mul2, times 1 dd 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 +const pd_planar16_mul2, times 1 dd 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 const trans8_shuf, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 const popcnt_table
View file
x265_1.8.tar.gz/source/common/x86/cpu-a.asm -> x265_1.9.tar.gz/source/common/x86/cpu-a.asm
Changed
@@ -2,6 +2,7 @@ ;* cpu-a.asm: x86 cpu utilities ;***************************************************************************** ;* Copyright (C) 2003-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Laurent Aimar <fenrir@via.ecp.fr> ;* Loren Merritt <lorenm@u.washington.edu>
View file
x265_1.8.tar.gz/source/common/x86/dct8.asm -> x265_1.9.tar.gz/source/common/x86/dct8.asm
Changed
@@ -2115,15 +2115,15 @@ mova m0, [r0] pabsw m1, m0 - mova m2, [r1] + movu m2, [r1] pmovsxwd m3, m1 paddd m2, m3 - mova [r1], m2 - mova m2, [r1 + 16] + movu [r1], m2 + movu m2, [r1 + 16] psrldq m3, m1, 8 pmovsxwd m4, m3 paddd m2, m4 - mova [r1 + 16], m2 + movu [r1 + 16], m2 movu m3, [r2] psubusw m1, m3 @@ -2174,7 +2174,7 @@ pmaddwd m0, m%4 phaddd m2, m0 paddd m2, m5 - psrad m2, DCT_SHIFT + psrad m2, DCT8_SHIFT1 packssdw m2, m2 vpermq m2, m2, 0x08 mova [r5 + %2], xm2 @@ -2190,7 +2190,7 @@ phaddd m8, m9 phaddd m6, m8 paddd m6, m5 - psrad m6, DCT_SHIFT2 + psrad m6, DCT8_SHIFT2 vbroadcasti128 m4, [r6 + %2] pmaddwd m10, m0, m4 @@ -2201,7 +2201,7 @@ phaddd m8, m9 phaddd m10, m8 paddd m10, m5 - psrad m10, DCT_SHIFT2 + psrad m10, DCT8_SHIFT2 packssdw m6, m10 vpermq m10, m6, 0xD8 @@ -2210,18 +2210,7 @@ INIT_YMM avx2 cglobal dct8, 3, 7, 11, 0-8*16 -%if BIT_DEPTH == 12 - %define DCT_SHIFT 6 - vbroadcasti128 m5, [pd_16] -%elif BIT_DEPTH == 10 - %define DCT_SHIFT 4 - vbroadcasti128 m5, [pd_8] -%elif BIT_DEPTH == 8 - %define DCT_SHIFT 2 - vbroadcasti128 m5, [pd_2] -%else - %error Unsupported BIT_DEPTH! -%endif +vbroadcasti128 m5, [pd_ %+ DCT8_ROUND1] %define DCT_SHIFT2 9 add r2d, r2d @@ -2265,7 +2254,7 @@ DCT8_PASS_1 7 * 16, 7 * 16, 4, 1 ;pass2 - vbroadcasti128 m5, [pd_256] + vbroadcasti128 m5, [pd_ %+ DCT8_ROUND2] mova m0, [r5] mova m1, [r5 + 32] @@ -2904,7 +2893,7 @@ cglobal idct8, 3, 7, 13, 0-8*16 %if BIT_DEPTH == 12 %define IDCT_SHIFT2 8 - vpbroadcastd m12, [pd_256] + vpbroadcastd m12, [pd_128] %elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m12, [pd_512] @@ -3065,7 +3054,7 @@ cglobal idct16, 3, 7, 16, 0-16*mmsize %if BIT_DEPTH == 12 %define IDCT_SHIFT2 8 - vpbroadcastd m15, [pd_256] + vpbroadcastd m15, [pd_128] %elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m15, [pd_512] @@ -3487,7 +3476,7 @@ %if BIT_DEPTH == 12 %define IDCT_SHIFT2 8 - vpbroadcastd m15, [pd_256] + vpbroadcastd m15, [pd_128] %elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m15, [pd_512] @@ -3651,7 +3640,7 @@ %define IDCT_SHIFT1 7 %if BIT_DEPTH == 12 %define IDCT_SHIFT2 8 - vpbroadcastd m5, [pd_256] + vpbroadcastd m5, [pd_128] %elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m5, [pd_512]
View file
x265_1.8.tar.gz/source/common/x86/dct8.h -> x265_1.9.tar.gz/source/common/x86/dct8.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Nabajit Deka <nabajit@multicorewareinc.com> +;* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/x86/intrapred16.asm -> x265_1.9.tar.gz/source/common/x86/intrapred16.asm
Changed
@@ -109,9 +109,11 @@ cextern pw_16 cextern pw_31 cextern pw_32 +cextern pd_15 cextern pd_16 cextern pd_31 cextern pd_32 +cextern pd_0000ffff cextern pw_4096 cextern pw_pixel_max cextern multiL @@ -123,7 +125,12 @@ cextern pb_unpackwq1 cextern pb_unpackwq2 cextern pw_planar16_mul +cextern pd_planar16_mul0 +cextern pd_planar16_mul1 cextern pw_planar32_mul +cextern pd_planar32_mul1 +cextern pd_planar32_mul2 +cextern pd_planar16_mul2 ;----------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) @@ -731,6 +738,117 @@ ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- INIT_XMM sse2 +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +cglobal intra_pred_planar16, 3,5,13 + add r1d, r1d + pxor m12, m12 + + movu m2, [r2 + 2] + movu m10, [r2 + 18] + + punpckhwd m7, m2, m12 + punpcklwd m2, m12 + punpckhwd m0, m10, m12 + punpcklwd m10, m12 + + movzx r3d, word [r2 + 34] ; topRight = above[16] + lea r4, [pd_planar16_mul1] + + movd m3, r3d + pshufd m3, m3, 0 ; topRight + + pmaddwd m8, m3, [r4 + 3*mmsize] ; (x + 1) * topRight + pmaddwd m4, m3, [r4 + 2*mmsize] ; (x + 1) * topRight + pmaddwd m9, m3, [r4 + 1*mmsize] ; (x + 1) * topRight + pmaddwd m3, m3, [r4 + 0*mmsize] ; (x + 1) * topRight + + mova m11, [pd_15] + pmaddwd m1, m2, m11 ; (blkSize - 1 - y) * above[x] + pmaddwd m6, m7, m11 ; (blkSize - 1 - y) * above[x] + pmaddwd m5, m10, m11 ; (blkSize - 1 - y) * above[x] + pmaddwd m11, m0 ; (blkSize - 1 - y) * above[x] + + paddd m4, m5 + paddd m3, m1 + paddd m8, m11 + paddd m9, m6 + + mova m5, [pd_16] + paddd m3, m5 + paddd m9, m5 + paddd m4, m5 + paddd m8, m5 + + movzx r4d, word [r2 + 98] ; bottomLeft = left[16] + movd m6, r4d + pshufd m6, m6, 0 ; bottomLeft + + paddd m4, m6 + paddd m3, m6 + paddd m8, m6 + paddd m9, m6 + + psubd m1, m6, m0 ; column 12-15 + psubd m11, m6, m10 ; column 8-11 + psubd m10, m6, m7 ; column 4-7 + psubd m6, m2 ; column 0-3 + + add r2, 66 + lea r4, [pd_planar16_mul0] + +%macro INTRA_PRED_PLANAR16_sse2 1 + movzx r3d, word [r2 + %1*2] + movd m5, r3d + pshufd m5, m5, 0 + + pmaddwd m0, m5, [r4 + 3*mmsize] ; column 12-15 + pmaddwd m2, m5, [r4 + 2*mmsize] ; column 8-11 + pmaddwd m7, m5, [r4 + 1*mmsize] ; column 4-7 + pmaddwd m5, m5, [r4 + 0*mmsize] ; column 0-3 + + paddd m0, m8 + paddd m2, m4 + paddd m7, m9 + paddd m5, m3 + + paddd m8, m1 + paddd m4, m11 + paddd m9, m10 + paddd m3, m6 + + psrad m0, 5 + psrad m2, 5 + psrad m7, 5 + psrad m5, 5 + + packssdw m2, m0 + packssdw m5, m7 + movu [r0], m5 + movu [r0 + mmsize], m2 + + add r0, r1 +%endmacro + + INTRA_PRED_PLANAR16_sse2 0 + INTRA_PRED_PLANAR16_sse2 1 + INTRA_PRED_PLANAR16_sse2 2 + INTRA_PRED_PLANAR16_sse2 3 + INTRA_PRED_PLANAR16_sse2 4 + INTRA_PRED_PLANAR16_sse2 5 + INTRA_PRED_PLANAR16_sse2 6 + INTRA_PRED_PLANAR16_sse2 7 + INTRA_PRED_PLANAR16_sse2 8 + INTRA_PRED_PLANAR16_sse2 9 + INTRA_PRED_PLANAR16_sse2 10 + INTRA_PRED_PLANAR16_sse2 11 + INTRA_PRED_PLANAR16_sse2 12 + INTRA_PRED_PLANAR16_sse2 13 + INTRA_PRED_PLANAR16_sse2 14 + INTRA_PRED_PLANAR16_sse2 15 + RET + +%else +; code for BIT_DEPTH == 10 cglobal intra_pred_planar16, 3,3,8 movu m2, [r2 + 2] movu m7, [r2 + 18] @@ -809,7 +927,180 @@ INTRA_PRED_PLANAR_16 14 INTRA_PRED_PLANAR_16 15 RET +%endif + +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- +INIT_XMM sse2 +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +cglobal intra_pred_planar32, 3,7,16 + ; NOTE: align stack to 64 bytes, so all of local data in same cache line + mov r6, rsp + sub rsp, 4*mmsize + and rsp, ~63 + %define m16 [rsp + 0 * mmsize] + %define m17 [rsp + 1 * mmsize] + %define m18 [rsp + 2 * mmsize] + %define m19 [rsp + 3 * mmsize] + + add r1, r1 + pxor m12, m12 + + movzx r3d, word [r2 + 66] + lea r4, [planar32_table1] + + movd m0, r3d + pshufd m0, m0, 0 + + pmaddwd m8, m0, [r4 + 0] + pmaddwd m9, m0, [r4 + 16] + pmaddwd m10, m0, [r4 + 32] + pmaddwd m11, m0, [r4 + 48] + pmaddwd m7, m0, [r4 + 64] + pmaddwd m13, m0, [r4 + 80] + pmaddwd m14, m0, [r4 + 96] + pmaddwd m15, m0, [r4 + 112] + + movzx r3d, word [r2 + 194] + movd m0, r3d + pshufd m0, m0, 0 + + paddd m8, m0 + paddd m9, m0 + paddd m10, m0 + paddd m11, m0 + paddd m7, m0 + paddd m13, m0 + paddd m14, m0 + paddd m15, m0 + + paddd m8, [pd_32] + paddd m9, [pd_32] + paddd m10, [pd_32] + paddd m11, [pd_32] + paddd m7, [pd_32] + paddd m13, [pd_32] + paddd m14, [pd_32] + paddd m15, [pd_32] + + movu m1, [r2 + 2] + punpckhwd m5, m1, m12 + pmaddwd m2, m5, [pd_31] + paddd m9, m2 + psubd m2, m0, m5 + + punpcklwd m1, m12 + pmaddwd m5, m1, [pd_31] + paddd m8, m5 + psubd m3, m0, m1 + + movu m1, [r2 + 18] + punpckhwd m5, m1, m12 + pmaddwd m4, m5, [pd_31] + paddd m11, m4 + psubd m4, m0, m5 + + punpcklwd m1, m12 + pmaddwd m5, m1, [pd_31] + paddd m10, m5 + psubd m5, m0, m1 + mova m16, m5 + + movu m1, [r2 + 34] + punpckhwd m6, m1, m12 + psubd m5, m0, m6 + pmaddwd m6, [pd_31] + paddd m13, m6 + + punpcklwd m6, m1, m12 + psubd m1, m0, m6 + mova m17, m1 + pmaddwd m6, [pd_31] + paddd m7, m6 + + movu m1, [r2 + 50] + mova m18, m1 + punpckhwd m6, m1, m12 + psubd m1, m0, m6 + pmaddwd m6, [pd_31] + paddd m15, m6 + + punpcklwd m6, m18, m12 + psubd m12, m0, m6 + mova m19, m12 + pmaddwd m6, [pd_31] + paddd m14, m6 + + add r2, 130 + lea r5, [planar32_table] + +%macro INTRA_PRED_PLANAR32_sse2 0 + movzx r3d, word [r2] + movd m0, r3d + pshufd m0, m0, 0 + + pmaddwd m6, m0, [r5] + pmaddwd m12, m0, [r5 + 16] + paddd m6, m8 + paddd m12, m9 + paddd m8, m3 + paddd m9, m2 + psrad m6, 6 + psrad m12, 6 + packssdw m6, m12 + movu [r0], m6 + + pmaddwd m6, m0, [r5 + 32] + pmaddwd m12, m0, [r5 + 48] + paddd m6, m10 + paddd m12, m11 + paddd m10, m16 + paddd m11, m4 + psrad m6, 6 + psrad m12, 6 + packssdw m6, m12 + movu [r0 + 16], m6 + + pmaddwd m6, m0, [r5 + 64] + pmaddwd m12, m0, [r5 + 80] + paddd m6, m7 + paddd m12, m13 + paddd m7, m17 + paddd m13, m5 + psrad m6, 6 + psrad m12, 6 + packssdw m6, m12 + movu [r0 + 32], m6 + + pmaddwd m6, m0, [r5 + 96] + pmaddwd m12, m0, [r5 + 112] + paddd m6, m14 + paddd m12, m15 + paddd m14, m19 + paddd m15, m1 + psrad m6, 6 + psrad m12, 6 + packssdw m6, m12 + movu [r0 + 48], m6 + + lea r0, [r0 + r1] + add r2, 2 +%endmacro + + mov r4, 8 +.loop: + INTRA_PRED_PLANAR32_sse2 + INTRA_PRED_PLANAR32_sse2 + INTRA_PRED_PLANAR32_sse2 + INTRA_PRED_PLANAR32_sse2 + dec r4 + jnz .loop + mov rsp, r6 + RET +%else +;code for BIT_DEPTH == 10 ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- @@ -917,11 +1208,132 @@ %assign x x+1 %endrep RET +%endif ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- INIT_YMM avx2 +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +cglobal intra_pred_planar32, 3,4,16 + pmovzxwd m1, [r2 + 2] + pmovzxwd m4, [r2 + 34] + pmovzxwd m2, [r2 + 18] + pmovzxwd m3, [r2 + 50] + lea r2, [r2 + 66] + + movzx r3d, word [r2] + movd xm5, r3d + vpbroadcastd m5, xm5 + + pslld m8, m5, 3 + pmulld m7, m5, [pd_planar16_mul1 + 32] + psubd m6, m7, m8 + pmulld m9, m5, [pd_planar32_mul2 + 32] + psubd m8, m9, m8 + + movzx r3d, word [r2 + 128] + movd xm10, r3d + vpbroadcastd m10, xm10 + + mova m11, m10 + paddd m11, [pd_32] + + paddd m6, m11 + paddd m7, m11 + paddd m8, m11 + paddd m9, m11 + + psubd m0, m10, m1 + mova m13, m0 + pslld m5, m1, 5 + psubd m1, m5, m1 + paddd m12, m6, m1 + + psubd m5, m10, m4 + mova m6, m5 + pslld m1, m4, 5 + psubd m4, m1, m4 + paddd m14, m8, m4 + + psubd m1, m10, m2 + mova m8, m1 + pslld m4, m2, 5 + psubd m2, m4, m2 + paddd m7, m2 + + psubd m11, m10, m3 + mova m15, m11 + pslld m4, m3, 5 + psubd m3, m4, m3 + paddd m9, m3 + + mova m2, [pd_planar32_mul1 + 32] + mova m4, [pd_planar16_mul2 + 32] + + add r1, r1 + +%macro PROCESS_AVX2 1 + movzx r3d, word [r2 + %1 * 2] + movd xm0, r3d + vpbroadcastd m0, xm0 + + pmulld m1, m0, m2 + pslld m3, m0, 3 + paddd m5, m1, m3 + pmulld m0, m4 + paddd m11, m0, m3 + + paddd m5, m12 + paddd m1, m7 + paddd m11, m14 + paddd m0, m9 + + psrad m5, 6 + psrad m1, 6 + psrad m11, 6 + psrad m0, 6 + + packssdw m5, m1 + packssdw m11, m0 + + vpermq m5, m5, q3120 + vpermq m11, m11, q3120 + + movu [r0], m5 + movu [r0 + mmsize], m11 +%endmacro + +%macro INCREMENT_AVX2 0 + paddd m12, m13 + paddd m14, m6 + paddd m7, m8 + paddd m9, m15 + + add r0, r1 +%endmacro + + add r2, mmsize*2 +%assign x 0 +%rep 4 +%assign y 0 +%rep 8 + PROCESS_AVX2 y +%if x + y < 10 + INCREMENT_AVX2 +%endif +%assign y y+1 +%endrep +lea r2, [r2 + 16] +%assign x x+1 +%endrep + RET + +%else +; code for BIT_DEPTH == 10 +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- cglobal intra_pred_planar32, 3,3,8 movu m1, [r2 + 2] movu m4, [r2 + 34] @@ -980,11 +1392,106 @@ %assign x x+1 %endrep RET +%endif ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- INIT_YMM avx2 +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +cglobal intra_pred_planar16, 3,3,11 + add r1d, r1d + + movzx r4d, word [r2 + 34] + movd xm3, r4d + vpbroadcastd m3, xm3 + + movzx r4d, word [r2 + 98] + movd xm4, r4d + vpbroadcastd m4, xm4 + + pmovzxwd m2, [r2 + 2] + pmovzxwd m5, [r2 + 18] + + pmulld m10, m3, [pd_planar16_mul1] + pmulld m7, m3, [pd_planar16_mul1 + 32] + + psubd m10, m2 + pslld m1, m2, 4 + paddd m10, m1 + + psubd m7, m5 + pslld m6, m5, 4 + paddd m9, m6, m7 + + paddd m10, [pd_16] + paddd m9, [pd_16] + paddd m7, m10, m4 + paddd m9, m4 + + psubd m0, m4, m2 + psubd m8, m4, m5 + + add r2, 66 + mova m5, [pd_planar16_mul0] + mova m6, [pd_planar16_mul0 + 32] + mova m10, [pd_0000ffff] + +%macro INTRA_PRED_PLANAR16_AVX2 1 + vpbroadcastd m2, [r2 + %1] + pand m1, m2, m10 + psrld m2, 16 + + pmulld m3, m1, m5 + pmulld m4, m1, m6 + pmulld m1, m2, m5 + pmulld m2, m2, m6 + + paddd m3, m7 + paddd m4, m9 + paddd m7, m0 + paddd m9, m8 + + psrad m3, 5 + psrad m4, 5 + + paddd m1, m7 + paddd m2, m9 + + psrad m1, 5 + psrad m2, 5 + + paddd m7, m0 + paddd m9, m8 + + packssdw m3, m4 + packssdw m1, m2 + + vpermq m3, m3, q3120 + vpermq m1, m1, q3120 + + movu [r0], m3 + movu [r0 + r1], m1 +%if %1 <= 24 + lea r0, [r0 + r1 * 2] +%endif +%endmacro + INTRA_PRED_PLANAR16_AVX2 0 + INTRA_PRED_PLANAR16_AVX2 4 + INTRA_PRED_PLANAR16_AVX2 8 + INTRA_PRED_PLANAR16_AVX2 12 + INTRA_PRED_PLANAR16_AVX2 16 + INTRA_PRED_PLANAR16_AVX2 20 + INTRA_PRED_PLANAR16_AVX2 24 + INTRA_PRED_PLANAR16_AVX2 28 +%undef INTRA_PRED_PLANAR16_AVX2 + RET + +%else +; code for BIT_DEPTH == 10 +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- cglobal intra_pred_planar16, 3,3,4 add r1d, r1d vpbroadcastw m3, [r2 + 34] @@ -1028,6 +1535,7 @@ INTRA_PRED_PLANAR16_AVX2 28 %undef INTRA_PRED_PLANAR16_AVX2 RET +%endif %macro TRANSPOSE_4x4 0 punpckhwd m0, m1, m3 @@ -2216,6 +2724,118 @@ ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- INIT_XMM sse4 +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +cglobal intra_pred_planar16, 3,5,12 + add r1d, r1d + + pmovzxwd m2, [r2 + 2] + pmovzxwd m7, [r2 + 10] + pmovzxwd m10, [r2 + 18] + pmovzxwd m0, [r2 + 26] + + movzx r3d, word [r2 + 34] ; topRight = above[16] + lea r4, [pd_planar16_mul1] + + movd m3, r3d + pshufd m3, m3, 0 ; topRight + + pslld m8, m3, 2 + pmulld m3, m3, [r4 + 0*mmsize] ; (x + 1) * topRight + paddd m9, m3, m8 + paddd m4, m9, m8 + paddd m8, m4 + + pslld m1, m2, 4 + pslld m6, m7, 4 + pslld m5, m10, 4 + pslld m11, m0, 4 + psubd m1, m2 + psubd m6, m7 + psubd m5, m10 + psubd m11, m0 + + paddd m4, m5 + paddd m3, m1 + paddd m8, m11 + paddd m9, m6 + + mova m5, [pd_16] + paddd m3, m5 + paddd m9, m5 + paddd m4, m5 + paddd m8, m5 + + movzx r4d, word [r2 + 98] ; bottomLeft = left[16] + movd m6, r4d + pshufd m6, m6, 0 ; bottomLeft + + paddd m4, m6 + paddd m3, m6 + paddd m8, m6 + paddd m9, m6 + + psubd m1, m6, m0 ; column 12-15 + psubd m11, m6, m10 ; column 8-11 + psubd m10, m6, m7 ; column 4-7 + psubd m6, m2 ; column 0-3 + + add r2, 66 + lea r4, [pd_planar16_mul0] + +%macro INTRA_PRED_PLANAR16 1 + movzx r3d, word [r2] + movd m5, r3d + pshufd m5, m5, 0 + + pmulld m0, m5, [r4 + 3*mmsize] ; column 12-15 + pmulld m2, m5, [r4 + 2*mmsize] ; column 8-11 + pmulld m7, m5, [r4 + 1*mmsize] ; column 4-7 + pmulld m5, m5, [r4 + 0*mmsize] ; column 0-3 + + paddd m0, m8 + paddd m2, m4 + paddd m7, m9 + paddd m5, m3 + + paddd m8, m1 + paddd m4, m11 + paddd m9, m10 + paddd m3, m6 + + psrad m0, 5 + psrad m2, 5 + psrad m7, 5 + psrad m5, 5 + + packusdw m2, m0 + packusdw m5, m7 + movu [r0], m5 + movu [r0 + mmsize], m2 + + add r2, 2 + lea r0, [r0 + r1] +%endmacro + + INTRA_PRED_PLANAR16 0 + INTRA_PRED_PLANAR16 1 + INTRA_PRED_PLANAR16 2 + INTRA_PRED_PLANAR16 3 + INTRA_PRED_PLANAR16 4 + INTRA_PRED_PLANAR16 5 + INTRA_PRED_PLANAR16 6 + INTRA_PRED_PLANAR16 7 + INTRA_PRED_PLANAR16 8 + INTRA_PRED_PLANAR16 9 + INTRA_PRED_PLANAR16 10 + INTRA_PRED_PLANAR16 11 + INTRA_PRED_PLANAR16 12 + INTRA_PRED_PLANAR16 13 + INTRA_PRED_PLANAR16 14 + INTRA_PRED_PLANAR16 15 + RET + +%else +; code for BIT_DEPTH == 10 cglobal intra_pred_planar16, 3,3,8 add r1, r1 movu m2, [r2 + 2] @@ -2293,6 +2913,7 @@ INTRA_PRED_PLANAR16 14 INTRA_PRED_PLANAR16 15 RET +%endif ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter)
View file
x265_1.8.tar.gz/source/common/x86/intrapred8.asm -> x265_1.9.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -27,7 +27,9 @@ SECTION_RODATA 32 -intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 +const intra_pred_shuff_0_8, times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 + db 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 + intra_pred_shuff_15_0: times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 intra_filter4_shuf0: times 2 db 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 @@ -54,13 +56,13 @@ c_shuf8_0: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 c_deinterval8: db 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 pb_unpackbq: db 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1 -c_mode16_12: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 6 -c_mode16_13: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4 -c_mode16_14: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2 +c_mode16_12: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 6 +c_mode16_13: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4 +c_mode16_14: db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2 c_mode16_15: db 0, 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2 c_mode16_16: db 8, 6, 5, 3, 2, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2 c_mode16_17: db 4, 2, 1, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1 -c_mode16_18: db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 +c_mode16_18: db 0, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 ALIGN 32 c_ang8_src1_9_2_10: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9 @@ -259,235 +261,6 @@ db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 -c_ang32_mode_27: db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_28: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang32_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang32_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang32_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_23: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - -ALIGN 32 -c_ang32_mode_22: db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang32_mode_21: db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - ALIGN 32 intra_pred_shuff_0_4: times 4 db 0, 1, 1, 2, 2, 3, 3, 4 intra_pred4_shuff1: db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5 @@ -560,11 +333,139 @@ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 -ALIGN 32 -c_ang8_mode_20: db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 +const c_ang8_mode_16, db 8, 7, 6, 5, 4, 3, 2, 1, 0, 9, 10, 12, 13, 15, 0, 0 + +const intra_pred8_shuff16, db 0, 1, 1, 2, 3, 3, 4, 5 + db 1, 2, 2, 3, 4, 4, 5, 6 + db 2, 3, 3, 4, 5, 5, 6, 7 + db 3, 4, 4, 5, 6, 6, 7, 8 + db 4, 5, 5, 6, 7, 7, 8, 9 + +const angHor8_tab_16, db (32-11), 11, (32-22), 22, (32-1 ), 1, (32-12), 12, (32-23), 23, (32- 2), 2, (32-13), 13, (32-24), 24 + +const c_ang8_mode_20, db 15, 13, 12, 10, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 0, 0 + +; NOTE: this big table improve speed ~10%, if we have broadcast instruction work on high-128bits infuture, we can remove the table +const angHor8_tab_20, times 8 db (32-24), 24 + times 8 db (32-13), 13 + times 8 db (32- 2), 2 + times 8 db (32-23), 23 + times 8 db (32-12), 12 + times 8 db (32- 1), 1 + times 8 db (32-22), 22 + times 8 db (32-11), 11 + +const ang16_shuf_mode9, times 8 db 0, 1 + times 8 db 1, 2 + +const angHor_tab_9, db (32-2), 2, (32-4), 4, (32-6), 6, (32-8), 8, (32-10), 10, (32-12), 12, (32-14), 14, (32-16), 16 + db (32-18), 18, (32-20), 20, (32-22), 22, (32-24), 24, (32-26), 26, (32-28), 28, (32-30), 30, (32-32), 32 + +const angHor_tab_11, db (32-30), 30, (32-28), 28, (32-26), 26, (32-24), 24, (32-22), 22, (32-20), 20, (32-18), 18, (32-16), 16 + db (32-14), 14, (32-12), 12, (32-10), 10, (32- 8), 8, (32- 6), 6, (32- 4), 4, (32- 2), 2, (32- 0), 0 + +const ang16_shuf_mode12, db 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3 + db 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2 + +const angHor_tab_12, db (32-27), 27, (32-22), 22, (32-17), 17, (32-12), 12, (32-7), 7, (32-2), 2, (32-29), 29, (32-24), 24 + db (32-19), 19, (32-14), 14, (32-9), 9, (32-4), 4, (32-31), 31, (32-26), 26, (32-21), 21, (32-16), 16 + +const ang16_shuf_mode13, db 4, 5, 4, 5, 4, 5, 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 5, 6, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 4, 5, 3, 4 + db 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2, 0, 1, 0, 1, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0, 0 ,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0 + +const angHor_tab_13, db (32-23), 23, (32-14), 14, (32-5), 5, (32-28), 28, (32-19), 19, (32-10), 10, (32-1), 1, (32-24), 24 + db (32-15), 15, (32-6), 6, (32-29), 29, (32-20), 20, (32-11), 11, (32-2), 2, (32-25), 25, (32-16), 16 + +const ang16_shuf_mode14, db 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 4, 5, 4, 5, 3, 4, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 5, 6, 4, 5 + db 3, 4, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 0, 1, 4, 5, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0 + +const angHor_tab_14, db (32-19), 19, (32-6), 6, (32-25), 25, (32-12), 12, (32-31), 31, (32-18), 18, (32-5), 5, (32-24), 24 + db (32-11), 11, (32-30), 30, (32-17), 17, (32-4), 4, (32-23), 23, (32-10), 10, (32-29), 29, (32-16), 16 + +const ang16_shuf_mode15, db 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6, 4, 5, 9, 10, 8, 9, 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 5, 6 + db 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 5, 6, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2 + db 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0 + +const angHor_tab_15, db (32-15), 15, (32-30), 30, (32-13), 13, (32-28), 28, (32-11), 11, (32-26), 26, (32-9), 9, (32-24), 24 + db (32-7), 7, (32-22), 22, (32-5), 5, (32-20), 20, (32-3), 3, (32-18), 18, (32-1), 1, (32- 16), 16 + +const ang16_shuf_mode16, db 10, 11, 9, 10, 9, 10, 8, 9, 7, 8, 7, 8, 6, 7, 5, 6, 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7 + db 5, 6, 4, 5, 3, 4, 3, 4, 2, 3, 1, 2, 1, 2, 0, 1, 6, 7, 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 2, 3, 1, 2 + db 0 ,0, 0, 0, 0, 15, 14, 12 , 11, 9, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2, 0 + +const angHor_tab_16, db (32-11), 11, (32-22), 22, (32-1), 1, (32-12), 12, (32-23), 23, (32-2), 2, (32-13), 13, (32-24), 24 + db (32-3), 3, (32-14), 14, (32-25), 25, (32-4), 4, (32-15), 15, (32-26), 26, (32-5), 5, (32-16), 16 + +const ang16_shuf_mode17, db 12, 13, 11, 12, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 13, 14, 12, 13, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8 + db 5, 6, 4, 5, 4, 5, 3, 4, 2, 3, 1, 2, 0, 1, 0, 1, 6, 7, 5, 6, 5, 6, 4, 5, 3, 4, 2, 3, 1, 2, 1, 2 + db 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0 + +const angHor_tab_17, db (32- 6), 6, (32-12), 12, (32-18), 18, (32-24), 24, (32-30), 30, (32- 4), 4, (32-10), 10, (32-16), 16 + db (32-22), 22, (32-28), 28, (32- 2), 2, (32- 8), 8, (32-14), 14, (32-20), 20, (32-26), 26, (32- 0), 0 + +; Intrapred_angle32x32, modes 1 to 33 constants +const ang32_shuf_mode9, times 8 db 0, 1 + times 8 db 1, 2 + +const ang32_shuf_mode11, times 8 db 1, 2 + times 8 db 0, 1 + +const ang32_fact_mode12, db (32-27), 27, (32-22), 22, (32-17), 17, (32-12), 12, (32- 7), 7, (32- 2), 2, (32-29), 29, (32-24), 24 + db (32-11), 11, (32- 6), 6, (32- 1), 1, (32-28), 28, (32-23), 23, (32-18), 18, (32-13), 13, (32- 8), 8 + db (32-19), 19, (32-14), 14, (32- 9), 9, (32- 4), 4, (32-31), 31, (32-26), 26, (32-21), 21, (32-16), 16 + db (32- 3), 3, (32-30), 30, (32-25), 25, (32-20), 20, (32-15), 15, (32-10), 10, (32- 5), 5, (32- 0), 0 +const ang32_shuf_mode12, db 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2 + db 3, 4, 3, 4, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3, 2, 3, 1, 2, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 +const ang32_shuf_mode24, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 13, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 10, 3, 3 + dd 0, 0, 7, 3, 0, 0, 7, 3 + +const ang32_fact_mode13, db (32-23), 23, (32-14), 14, (32- 5), 5, (32-28), 28, (32-19), 19, (32-10), 10, (32- 1), 1, (32-24), 24 + db (32- 7), 7, (32-30), 30, (32-21), 21, (32-12), 12, (32- 3), 3, (32-26), 26, (32-17), 17, (32- 8), 8 + db (32-15), 15, (32- 6), 6, (32-29), 29, (32-20), 20, (32-11), 11, (32- 2), 2, (32-25), 25, (32-16), 16 + db (32-31), 31, (32-22), 22, (32-13), 13, (32- 4), 4, (32-27), 27, (32-18), 18, (32- 9), 9, (32- 0), 0 +const ang32_shuf_mode13, db 14, 15, 14, 15, 14, 15, 13, 14, 13, 14, 13, 14, 13, 14, 12, 13, 10, 11, 9, 10, 9, 10, 9, 10, 9, 10, 8, 9, 8, 9, 8, 9 + db 12, 13, 12, 13, 11, 12, 11, 12, 11, 12, 11, 12, 10, 11, 10, 11, 7, 8, 7, 8, 7, 8, 7, 8, 6, 7, 6, 7, 6, 7, 6, 7 + db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 7, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 9, 5, 2 +const ang32_shuf_mode23, db 0, 0, 0, 0, 0, 0, 0, 0, 14, 14, 11, 11, 7, 7, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 9, 9, 5, 5, 2, 2 + +const ang32_fact_mode14, db (32-19), 19, (32- 6), 6, (32-25), 25, (32-12), 12, (32-31), 31, (32-18), 18, (32- 5), 5, (32-24), 24 + db (32- 3), 3, (32-22), 22, (32- 9), 9, (32-28), 28, (32-15), 15, (32- 2), 2, (32-21), 21, (32- 8), 8 + db (32-11), 11, (32-30), 30, (32-17), 17, (32- 4), 4, (32-23), 23, (32-10), 10, (32-29), 29, (32-16), 16 + db (32-27), 27, (32-14), 14, (32- 1), 1, (32-20), 20, (32- 7), 7, (32-26), 26, (32-13), 13, (32- 0), 0 +const ang32_shuf_mode14, db 14, 15, 14, 15, 13, 14, 13, 14, 12, 13, 12, 13, 12, 13, 11, 12, 8, 9, 7, 8, 7, 8, 6, 7, 6, 7, 6, 7, 5, 6, 5, 6 + db 11, 12, 10, 11, 10, 11, 10, 11, 9, 10, 9, 10, 8, 9, 8, 9, 4, 5, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3, 2, 3 + db 0, 0, 0, 0, 0, 0, 0, 0, 15, 12, 10, 7, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 11, 9, 6, 4, 1 +const ang32_shuf_mode22, db 0, 0, 15, 15, 13, 13, 10, 10, 8, 8, 5, 5, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12, 9, 9, 7, 7, 4, 4, 2 + +const ang32_fact_mode15, db (32-15), 15, (32-30), 30, (32-13), 13, (32-28), 28, (32-11), 11, (32-26), 26, (32- 9), 9, (32-24), 24 + db (32-31), 31, (32-14), 14, (32-29), 29, (32-12), 12, (32-27), 27, (32-10), 10, (32-25), 25, (32- 8), 8 + db (32- 7), 7, (32-22), 22, (32- 5), 5, (32-20), 20, (32- 3), 3, (32-18), 18, (32- 1), 1, (32-16), 16 + db (32-23), 23, (32- 6), 6, (32-21), 21, (32- 4), 4, (32-19), 19, (32- 2), 2, (32-17), 17, (32- 0), 0 +const ang32_shuf_mode15, db 14, 15, 13, 14, 13, 14, 12, 13, 12, 13, 11, 12, 11, 12, 10, 11, 5, 6, 5, 6, 4, 5, 4, 5, 3, 4, 3, 4, 2, 3, 2, 3 + db 12, 13, 11, 12, 11, 12, 10, 11, 10, 11, 9, 10, 9, 10, 8, 9, 3, 4, 3, 4, 2, 3, 2, 3, 1, 2, 1, 2, 0, 1, 0, 1 + db 0, 0, 0, 0, 0, 0, 0, 0, 15, 13, 11, 9, 8, 6, 4, 2, 0, 0, 0, 0, 0, 0, 0, 0, 14, 12, 10, 8, 7, 5, 3, 1 +const ang32_shuf_mode21, db 15, 15, 13, 13, 11, 11, 9, 9, 8, 8, 6, 6, 4, 4, 2, 2, 14, 14, 12, 12, 10, 10, 8, 8, 7, 7, 5, 5, 3, 3, 1, 1 + +const ang32_fact_mode16, db (32-11), 11, (32-22), 22, (32- 1), 1, (32-12), 12, (32-23), 23, (32- 2), 2, (32-13), 13, (32-24), 24 + db (32- 3), 3, (32-14), 14, (32-25), 25, (32- 4), 4, (32-15), 15, (32-26), 26, (32- 5), 5, (32-16), 16 + db (32-27), 27, (32- 6), 6, (32-17), 17, (32-28), 28, (32- 7), 7, (32-18), 18, (32-29), 29, (32- 8), 8 + db (32-19), 19, (32-30), 30, (32- 9), 9, (32-20), 20, (32-31), 31, (32-10), 10, (32-21), 21, (32- 0), 0 +const ang32_shuf_mode16, db 14, 15, 13, 14, 13, 14, 12, 13, 11, 12, 11, 12, 10, 11, 9, 10, 9, 10, 8, 9, 7, 8, 7, 8, 6, 7, 5, 6, 5, 6, 4, 5 + db 14, 15, 14, 15, 13, 14, 12, 13, 12, 13, 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 8, 9, 7, 8, 6, 7, 6, 7, 5, 6, 5, 6 + db 0, 0, 0, 0, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 0, 0, 14, 13, 11, 10, 8, 7, 5, 4, 2, 1 + dd 7, 1, 2, 3, 7, 1, 2, 3 +const ang32_shuf_mode20, db 12, 11, 9, 8, 6, 5, 3, 2, 0, 0, 0, 0, 0, 0, 14, 15, 8, 7, 5, 4, 2, 1, 0, 0, 14, 13, 13, 11, 11, 10, 10, 8 + db 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 1, 1, 0, 0 + +const ang32_fact_mode17, db (32- 6), 6, (32-12), 12, (32-18), 18, (32-24), 24, (32-30), 30, (32- 4), 4, (32-10), 10, (32-16), 16 + db (32-22), 22, (32-28), 28, (32- 2), 2, (32- 8), 8, (32-14), 14, (32-20), 20, (32-26), 26, (32- 0), 0 +const ang32_shuf_mode17, db 14, 15, 13, 14, 12, 13, 11, 12, 10, 11, 10, 11, 9, 10, 8, 9, 7, 8, 6, 7, 6, 7, 5, 6, 4, 5, 3, 4, 2, 3, 2, 3 + db 0, 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, 0, 0, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0 +const ang32_shuf_mode19, db 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15 + dd 0, 0, 2, 3, 0, 0, 7, 1 + dd 0, 0, 5, 6, 0, 0, 0, 0 const ang_table %assign x 0 @@ -588,6 +489,7 @@ %endrep SECTION .text +cextern pb_1 cextern pw_2 cextern pw_3 cextern pw_4 @@ -11966,6 +11868,6711 @@ call ang32_mode_3_33_row_0_15 RET + +cglobal ang32_mode_4_32_row_0_15 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 + 5 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m6, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m5, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 - 6 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, [r3 + 15 * 32] ; [31] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 15 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m8, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m8, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 4 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m10, m2, m0, 6 + palignr m11, m3, m2, 6 + pmaddubsw m9, m10, [r3 - 7 * 32] ; [9] + pmulhrsw m9, m7 + pmaddubsw m1, m11, [r3 - 7 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + pmaddubsw m10, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + pmaddubsw m11, [r3 + 14 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + palignr m11, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m11, [r3 + 3 * 32] ; [19] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 3 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m12, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m12, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m4, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m4, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 + 13 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m5, m2, m0, 12 + palignr m1, m3, m2, 12 + pmaddubsw m5, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m1, [r3 + 2 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + palignr m8, m2, m0, 14 + palignr m1, m3, m2, 14 + pmaddubsw m6, m8, [r3 - 9 * 32] ; [7] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 - 9 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, [r3 + 12 * 32] ; [28] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + pmaddubsw m9, m2, [r3 + 1 * 32] ; [17] + pmulhrsw m9, m7 + pmaddubsw m1, m3, [r3 + 1 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + movu m0, [r2 + 25] + movu m1, [r2 + 26] + punpcklbw m0, m1 + + palignr m11, m3, m2, 2 + palignr m1, m0, m3, 2 + pmaddubsw m10, m11, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + pmaddubsw m12, m1, [r3 - 10 * 32] + pmulhrsw m12, m7 + packuswb m10, m12 + + pmaddubsw m11, [r3 + 11 * 32] ; [27] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 11 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m0, m3, 4 + palignr m3, m2, 4 + pmaddubsw m3, [r3] ; [16] + pmulhrsw m3, m7 + pmaddubsw m0, [r3] + pmulhrsw m0, m7 + packuswb m3, m0 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 3, 0, 8 + ret + +cglobal ang32_mode_4_32_row_16_31 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 - 11 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + pmaddubsw m1, m2, [r3 + 10 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + palignr m6, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m6, [r3 - 1 * 32] ; [15] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 - 1 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m9, m2, m0, 4 + palignr m10, m3, m2, 4 + pmaddubsw m8, m9, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m1, m10, [r3 - 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + pmaddubsw m9, [r3 + 9 * 32] ; [25] + pmulhrsw m9, m7 + pmaddubsw m10, [r3 + 9 * 32] + pmulhrsw m10, m7 + packuswb m9, m10 + + palignr m10, m2, m0, 6 + palignr m11, m3, m2, 6 + pmaddubsw m10, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + pmaddubsw m11, [r3 - 2 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + palignr m12, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m11, m12, [r3 - 13 * 32] ; [3] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 13 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 8 + pmaddubsw m12, [r3 + 8 * 32] ; [24] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m4, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m4, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 - 3 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m6, m2, m0, 12 + palignr m8, m3, m2, 12 + pmaddubsw m5, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + pmaddubsw m1, m8, [r3 - 14 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + pmaddubsw m6, [r3 + 7 * 32] ; [23] + pmulhrsw m6, m7 + pmaddubsw m8, [r3 + 7 * 32] + pmulhrsw m8, m7 + packuswb m6, m8 + + palignr m8, m2, m0, 14 + palignr m1, m3, m2, 14 + pmaddubsw m8, [r3 - 4 * 32] ; [12] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 - 4 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + pmaddubsw m9, m2, [r3 - 15 * 32] ; [1] + pmulhrsw m9, m7 + pmaddubsw m1, m3, [r3 - 15 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + pmaddubsw m10, m2, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + pmaddubsw m1, m3, [r3 + 6 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + movu m0, [r2 + 25] + movu m1, [r2 + 26] + punpcklbw m0, m1 + + palignr m11, m3, m2, 2 + palignr m1, m0, m3, 2 + pmaddubsw m11, [r3 - 5 * 32] ; [11] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 5 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + movu m12, [r2 + 11] + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 8 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_4, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_4_32_row_0_15 + + add r4, 16 + mov r0, r4 + add r2, 11 + + call ang32_mode_4_32_row_16_31 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_32, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_4_32_row_0_15 + + add r2, 11 + + call ang32_mode_4_32_row_16_31 + RET + +cglobal ang32_mode_5_31_row_0_15 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 + 1 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m6, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m5, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 - 14 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, [r3 + 3 * 32] ; [19] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 3 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m9, m2, m0, 4 + palignr m10, m3, m2, 4 + pmaddubsw m8, m9, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m1, m10, [r3 - 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + pmaddubsw m9, [r3 + 5 * 32] ; [21] + pmulhrsw m9, m7 + pmaddubsw m10, [r3 + 5 * 32] + pmulhrsw m10, m7 + packuswb m9, m10 + + palignr m11, m2, m0, 6 + palignr m12, m3, m2, 6 + pmaddubsw m10, m11, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + pmaddubsw m1, m12, [r3 - 10 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + pmaddubsw m11, [r3 + 7 * 32] ; [23] + pmulhrsw m11, m7 + pmaddubsw m12, [r3 + 7 * 32] + pmulhrsw m12, m7 + packuswb m11, m12 + + palignr m12, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m12, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m4, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m4, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 + 9 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m6, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m5, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 - 6 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, [r3 + 11 * 32] ; [27] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 11 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m9, m2, m0, 12 + palignr m1, m3, m2, 12 + pmaddubsw m8, m9, [r3 - 4 * 32] ; [12] + pmulhrsw m8, m7 + pmaddubsw m10, m1, [r3 - 4 * 32] + pmulhrsw m10, m7 + packuswb m8, m10 + + pmaddubsw m9, [r3 + 13 * 32] ; [29] + pmulhrsw m9, m7 + pmaddubsw m1, [r3 + 13 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + palignr m11, m2, m0, 14 + palignr m1, m3, m2, 14 + pmaddubsw m10, m11, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + pmaddubsw m12, m1, [r3 - 2 * 32] + pmulhrsw m12, m7 + packuswb m10, m12 + + pmaddubsw m11, [r3 + 15 * 32] ; [31] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 15 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + pmaddubsw m2, [r3] ; [16] + pmulhrsw m2, m7 + pmaddubsw m3, [r3] + pmulhrsw m3, m7 + packuswb m2, m3 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 2, 0, 8 + ret + +cglobal ang32_mode_5_31_row_16_31 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 - 15 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m8, m2, [r3 + 2 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + palignr m8, m2, m0, 2 + palignr m9, m3, m2, 2 + pmaddubsw m6, m8, [r3 - 13 * 32] ; [3] + pmulhrsw m6, m7 + pmaddubsw m1, m9, [r3 - 13 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + pmaddubsw m8, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m9, [r3 + 4 * 32] + pmulhrsw m9, m7 + packuswb m8, m9 + + palignr m10, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m9, m10, [r3 - 11 * 32] ; [5] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 - 11 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + pmaddubsw m1, [r3 + 6 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + palignr m12, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m11, m12, [r3 - 9 * 32] ; [7] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 9 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 6 + pmaddubsw m12, [r3 + 8 * 32] ; [24] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m5, m2, m0, 8 + palignr m8, m3, m2, 8 + pmaddubsw m4, m5, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m7 + pmaddubsw m1, m8, [r3 - 7 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + pmaddubsw m8, [r3 + 10 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + palignr m8, m2, m0, 10 + palignr m9, m3, m2, 10 + pmaddubsw m6, m8, [r3 - 5 * 32] ; [11] + pmulhrsw m6, m7 + pmaddubsw m1, m9, [r3 - 5 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + pmaddubsw m8, [r3 + 12 * 32] ; [28] + pmulhrsw m8, m7 + pmaddubsw m9, [r3 + 12 * 32] + pmulhrsw m9, m7 + packuswb m8, m9 + + palignr m10, m2, m0, 12 + palignr m11, m3, m2, 12 + pmaddubsw m9, m10, [r3 - 3 * 32] ; [13] + pmulhrsw m9, m7 + pmaddubsw m1, m11, [r3 - 3 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + pmaddubsw m10, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + pmaddubsw m11, [r3 + 14 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + palignr m11, m2, m0, 14 + palignr m1, m3, m2, 14 + pmaddubsw m11, [r3 - 1 * 32] ; [15] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 1 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + movu m2, [r2 + 9] + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 2, 0, 8 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_5, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_5_31_row_0_15 + + add r4, 16 + mov r0, r4 + add r2, 9 + + call ang32_mode_5_31_row_16_31 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_31, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_5_31_row_0_15 + + add r2, 9 + + call ang32_mode_5_31_row_16_31 + RET + +cglobal ang32_mode_6_30_row_0_15 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 - 3 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + pmaddubsw m8, m2, [r3 + 10 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + palignr m8, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m6, m8, [r3 - 9 * 32] ; [7] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 - 9 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 4 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m11, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m9, m11, [r3 - 15 * 32] ; [1] + pmulhrsw m9, m7 + pmaddubsw m12, m1, [r3 - 15 * 32] + pmulhrsw m12, m7 + packuswb m9, m12 + + pmaddubsw m10, m11, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + pmaddubsw m12, m1, [r3 - 2 * 32] + pmulhrsw m12, m7 + packuswb m10, m12 + + pmaddubsw m11, [r3 + 11 * 32] ; [27] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 11 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m12, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m12, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m4, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m4, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 + 5 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m8, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m5, m8, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + pmaddubsw m9, m1, [r3 - 14 * 32] + pmulhrsw m9, m7 + packuswb m5, m9 + + pmaddubsw m6, m8, [r3 - 1 * 32] ; [15] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 - 1 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, [r3 + 12 * 32] ; [28] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m10, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m9, m10, [r3 - 7 * 32] ; [9] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 - 7 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + pmaddubsw m1, m1, [r3 + 6 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + palignr m3, m2, 12 + palignr m2, m0, 12 + pmaddubsw m11, m2, [r3 - 13 * 32] ; [3] + pmulhrsw m11, m7 + pmaddubsw m1, m3, [r3 - 13 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + pmaddubsw m2, [r3] ; [16] + pmulhrsw m2, m7 + pmaddubsw m3, [r3] + pmulhrsw m3, m7 + packuswb m2, m3 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 2, 0, 8 + ret + +cglobal ang32_mode_6_30_row_16_31 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 + 13 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m6, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m5, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 - 6 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, [r3 + 7 * 32] ; [23] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 7 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m10, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m8, m10, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m11, m1, [r3 - 12 * 32] + pmulhrsw m11, m7 + packuswb m8, m11 + + pmaddubsw m9, m10, [r3 + 1 * 32] ; [17] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 + 1 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + pmaddubsw m1, [r3 + 14 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + palignr m12, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m11, m12, [r3 - 5 * 32] ; [11] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 5 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 6 + pmaddubsw m12, [r3 + 8 * 32] ; [24] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m6, m2, m0, 8 + palignr m1, m3, m2, 8 + pmaddubsw m4, m6, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m7 + pmaddubsw m8, m1, [r3 - 11 * 32] + pmulhrsw m8, m7 + packuswb m4, m8 + + pmaddubsw m5, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m9, m1, [r3 + 2 * 32] + pmulhrsw m9, m7 + packuswb m5, m9 + + pmaddubsw m6, [r3 + 15 * 32] ; [31] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 15 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m9, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m8, m9, [r3 - 4 * 32] ; [12] + pmulhrsw m8, m7 + pmaddubsw m10, m1, [r3 - 4 * 32] + pmulhrsw m10, m7 + packuswb m8, m10 + + pmaddubsw m9, [r3 + 9 * 32] ; [25] + pmulhrsw m9, m7 + pmaddubsw m1, [r3 + 9 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + palignr m3, m2, 12 + palignr m2, m0, 12 + pmaddubsw m10, m2, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + pmaddubsw m1, m3, [r3 - 10 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + pmaddubsw m2, [r3 + 3 * 32] ; [19] + pmulhrsw m2, m7 + pmaddubsw m3, [r3 + 3 * 32] + pmulhrsw m3, m7 + packuswb m2, m3 + + movu m3, [r2 + 8] ; [0] + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 2, 3, 0, 8 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_6, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_6_30_row_0_15 + + add r4, 16 + mov r0, r4 + add r2, 6 + + call ang32_mode_6_30_row_16_31 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_30, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_6_30_row_0_15 + + add r2, 6 + + call ang32_mode_6_30_row_16_31 + RET + +cglobal ang32_mode_7_29_row_0_15 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 - 7 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m8, m2, [r3 + 2 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, m0, [r3 + 11 * 32] ; [27] + pmulhrsw m6, m7 + pmaddubsw m9, m2, [r3 + 11 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + palignr m11, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m8, m11, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m12, m1, [r3 - 12 * 32] + pmulhrsw m12, m7 + packuswb m8, m12 + + pmaddubsw m9, m11, [r3 - 3 * 32] ; [13] + pmulhrsw m9, m7 + pmaddubsw m12, m1, [r3 - 3 * 32] + pmulhrsw m12, m7 + packuswb m9, m12 + + pmaddubsw m10, m11, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + pmaddubsw m12, m1, [r3 + 6 * 32] + pmulhrsw m12, m7 + packuswb m10, m12 + + pmaddubsw m11, [r3 + 15 * 32] ; [31] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 15 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m12, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m12, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m5, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m4, m5, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m7 + pmaddubsw m8, m1, [r3 + 1 * 32] + pmulhrsw m8, m7 + packuswb m4, m8 + + pmaddubsw m5, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + pmaddubsw m1, [r3 + 10 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + palignr m10, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m6, m10, [r3 - 13 * 32] ; [3] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 - 13 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, m10, [r3 - 4 * 32] ; [12] + pmulhrsw m8, m7 + pmaddubsw m11, m1, [r3 - 4 * 32] + pmulhrsw m11, m7 + packuswb m8, m11 + + pmaddubsw m9, m10, [r3 + 5 * 32] ; [21] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 + 5 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + pmaddubsw m1, [r3 + 14 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + palignr m3, m2, 8 + palignr m2, m0, 8 + pmaddubsw m11, m2, [r3 - 9 * 32] ; [7] + pmulhrsw m11, m7 + pmaddubsw m1, m3, [r3 - 9 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + pmaddubsw m2, [r3] ; [16] + pmulhrsw m2, m7 + pmaddubsw m3, [r3] + pmulhrsw m3, m7 + packuswb m2, m3 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 2, 0, 8 + ret + +cglobal ang32_mode_7_29_row_16_31 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 + 9 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m9, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m5, m9, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 - 14 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, m9, [r3 - 5 * 32] ; [11] + pmulhrsw m6, m7 + pmaddubsw m10, m1, [r3 - 5 * 32] + pmulhrsw m10, m7 + packuswb m6, m10 + + pmaddubsw m8, m9, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m10, m1, [r3 + 4 * 32] + pmulhrsw m10, m7 + packuswb m8, m10 + + pmaddubsw m9, [r3 + 13 * 32] ; [29] + pmulhrsw m9, m7 + pmaddubsw m1, [r3 + 13 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + palignr m12, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m10, m12, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + pmaddubsw m11, m1, [r3 - 10 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + pmaddubsw m11, m12, [r3 - 1 * 32] ; [15] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 1 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 4 + pmaddubsw m12, [r3 + 8 * 32] ; [24] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m8, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m4, m8, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m7 + pmaddubsw m9, m1, [r3 - 15 * 32] + pmulhrsw m9, m7 + packuswb m4, m9 + + pmaddubsw m5, m8, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + pmaddubsw m9, m1, [r3 - 6 * 32] + pmulhrsw m9, m7 + packuswb m5, m9 + + pmaddubsw m6, m8, [r3 + 3 * 32] ; [19] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 + 3 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, [r3 + 12 * 32] ; [28] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m3, m2, 8 + palignr m2, m0, 8 + pmaddubsw m9, m2, [r3 - 11 * 32] ; [5] + pmulhrsw m9, m7 + pmaddubsw m1, m3, [r3 - 11 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + pmaddubsw m10, m2, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + pmaddubsw m1, m3, [r3 - 2 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + pmaddubsw m2, [r3 + 7 * 32] ; [23] + pmulhrsw m2, m7 + pmaddubsw m3, [r3 + 7 * 32] + pmulhrsw m3, m7 + packuswb m2, m3 + + movu m1, [r2 + 6] ; [0] + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 2, 1, 0, 8 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_7, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_7_29_row_0_15 + + add r4, 16 + mov r0, r4 + add r2, 4 + + call ang32_mode_7_29_row_16_31 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_29, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_7_29_row_0_15 + + add r2, 4 + + call ang32_mode_7_29_row_16_31 + RET + +cglobal ang32_mode_8_28_avx2 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 - 11 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m0, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + pmaddubsw m8, m2, [r3 - 6 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, m0, [r3 - 1 * 32] ; [15] + pmulhrsw m6, m7 + pmaddubsw m9, m2, [r3 - 1 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, m0, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m12, m2, [r3 + 4 * 32] + pmulhrsw m12, m7 + packuswb m8, m12 + + pmaddubsw m9, m0, [r3 + 9 * 32] ; [25] + pmulhrsw m9, m7 + pmaddubsw m12, m2, [r3 + 9 * 32] + pmulhrsw m12, m7 + packuswb m9, m12 + + pmaddubsw m10, m0, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + pmaddubsw m12, m2, [r3 + 14 * 32] + pmulhrsw m12, m7 + packuswb m10, m12 + + palignr m12, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m11, m12, [r3 - 13 * 32] ; [3] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 13 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 2 + pmaddubsw m12, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + + palignr m8, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m4, m8, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m7 + pmaddubsw m9, m1, [r3 - 3 * 32] + pmulhrsw m9, m7 + packuswb m4, m9 + + pmaddubsw m5, m8, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m9, m1, [r3 + 2 * 32] + pmulhrsw m9, m7 + packuswb m5, m9 + + pmaddubsw m6, m8, [r3 + 7 * 32] ; [23] + pmulhrsw m6, m7 + pmaddubsw m9, m1, [r3 + 7 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, [r3 + 12 * 32] ; [28] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 12 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m12, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m9, m12, [r3 - 15 * 32] ; [1] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 - 15 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, m12, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + pmaddubsw m11, m1, [r3 - 10 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + pmaddubsw m11, m12, [r3 - 5 * 32] ; [11] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 5 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 4 + pmaddubsw m12, [r3] ; [16] + pmulhrsw m12, m7 + pmaddubsw m1, [r3] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 8 + + ; rows 16 to 23 + + jnz .doNotAdjustBufferPtr + lea r4, [r4 + mmsize/2] + mov r0, r4 +.doNotAdjustBufferPtr: + + palignr m6, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m7 + pmaddubsw m8, m1, [r3 + 5 * 32] + pmulhrsw m8, m7 + packuswb m4, m8 + + pmaddubsw m5, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + pmaddubsw m8, m1, [r3 + 10 * 32] + pmulhrsw m8, m7 + packuswb m5, m8 + + pmaddubsw m6, [r3 + 15 * 32] ; [31] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 + 15 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m12, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m8, m12, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m11, m1, [r3 - 12 * 32] + pmulhrsw m11, m7 + packuswb m8, m11 + + pmaddubsw m9, m12, [r3 - 7 * 32] ; [9] + pmulhrsw m9, m7 + pmaddubsw m11, m1, [r3 - 7 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, m12, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + pmaddubsw m11, m1, [r3 - 2 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + pmaddubsw m11, m12, [r3 + 3 * 32] ; [19] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 3 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m1, m3, m2, 6 + pmaddubsw m12, [r3 + 8 * 32] ; [24] + pmulhrsw m12, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 16 + + ; rows 24 to 31 + palignr m4, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m4, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 + 13 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m3, m2, 8 + palignr m2, m0, 8 + pmaddubsw m5, m2, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + pmaddubsw m9, m3, [r3 - 14 * 32] + pmulhrsw m9, m7 + packuswb m5, m9 + + pmaddubsw m6, m2, [r3 - 9 * 32] ; [7] + pmulhrsw m6, m7 + pmaddubsw m9, m3, [r3 - 9 * 32] + pmulhrsw m9, m7 + packuswb m6, m9 + + pmaddubsw m8, m2, [r3 - 4 * 32] ; [12] + pmulhrsw m8, m7 + pmaddubsw m1, m3, [r3 - 4 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + pmaddubsw m9, m2, [r3 + 1 * 32] ; [17] + pmulhrsw m9, m7 + pmaddubsw m11, m3, [r3 + 1 * 32] + pmulhrsw m11, m7 + packuswb m9, m11 + + pmaddubsw m10, m2, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + pmaddubsw m1, m3, [r3 + 6 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + pmaddubsw m2, [r3 + 11 * 32] ; [27] + pmulhrsw m2, m7 + pmaddubsw m3, [r3 + 11 * 32] + pmulhrsw m3, m7 + packuswb m2, m3 + + movu m3, [r2 + 6] ; [0] + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 2, 3, 0, 24 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_8, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_8_28_avx2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_28, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_8_28_avx2 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_9, 3,5,8 + vbroadcasti128 m0, [angHor_tab_9] + vbroadcasti128 m1, [angHor_tab_9 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode9] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize*2 + 1] + vbroadcasti128 m6, [r2 + mmsize*2 + 17] + + pshufb m5, m3, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 1 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m6, m3, 3 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 5 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m6, m3, 7 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 9 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 10 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m6, m3, 11 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 13 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m6, m3, 15 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + vbroadcasti128 m3, [r2 + mmsize*2 + 33] + + pshufb m5, m6, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 1 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m3, m6, 3 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 5 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m3, m6, 7 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 9 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 10 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m3, m6, 11 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 13 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1*2], m4 + + palignr m5, m3, m6, 15 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_27, 3,5,6 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] ; r4 -> 3 * stride + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m3, m2, 2 + palignr m2, m0, 2 + movu m1, [r2 + 2] ; [0] + movu [r0 + r4], m1 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + pmaddubsw m4, m2, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m2, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m2, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m2, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m2, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m2, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m2, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m2, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m2, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m2, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m2, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m2, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m2, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m2, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m2, [r3 + 14 * 32] ; [30] + pmulhrsw m2, m5 + pmaddubsw m3, [r3 + 14 * 32] + pmulhrsw m3, m5 + packuswb m2, m3 + movu [r0 + r1*2], m2 + + movu m1, [r2 + 3] ; [0] + movu [r0 + r4], m1 + RET + +cglobal intra_pred_ang32_10, 5,5,4 + pxor m0, m0 + mova m1, [pb_1] + lea r4, [r1 * 3] + + vbroadcasti128 m2, [r2 + mmsize*2 + 1] + + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + pxor m0, m0 + vbroadcasti128 m2, [r2 + mmsize*2 + mmsize/2 + 1] + + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + + lea r0, [r0 + r1 * 4] + + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r1 * 2], m3 + paddb m0, m1 + pshufb m3, m2, m0 + movu [r0 + r4], m3 + RET + +cglobal intra_pred_ang32_11, 3,4,8 + vbroadcasti128 m0, [angHor_tab_11] + vbroadcasti128 m1, [angHor_tab_11 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode11] + lea r3, [r1 * 3] + + ; prepare for [16 0 -1 -2 ...] + movu xm3, [r2 + mmsize*2 - 1] + vbroadcasti128 m6, [r2 + mmsize*2 + 15] + + pinsrb xm3, [r2 + 0], 1 + pinsrb xm3, [r2 + 16], 0 + vinserti128 m3, m3, xm3, 1 ; [16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14] + + pshufb m5, m3, m7 ; [ 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 16 0 16 0 16 0 16 0 16 0 16 0 16 0 16 0] + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 1 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 3 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 5 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 7 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 9 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 10 + pshufb m5, m7 + + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 11 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 13 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 15 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + mova m3, m6 + vbroadcasti128 m6, [r2 + mmsize*2 + 15 + 16] + pshufb m5, m3, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 1 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 3 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 5 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 7 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 9 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 10 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 11 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m6, m3, 13 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m6, m3, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m6, m3, 15 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_25, 3,5,7 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] ; [31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m1, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + pinsrb xm3, [r2], 15 + pinsrb xm3, [r2 + mmsize*2 + 16], 14 + + punpckhbw m2, m0, m1 ; [32 31 31 30 30 29 29 28 28 27 27 26 26 25 25 24 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8] + punpcklbw m0, m1 ; [24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0] + vinserti128 m3, m3, xm2, 1 ; [16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 0 16 x x x x x x x x x x x x x x] + + pmaddubsw m4, m0, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m0, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + movu m1, [r2] ; [0] + movu [r0 + r4], m1 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, 14 + palignr m0, m3, 14 + + ; rows 16 to 23 + pmaddubsw m4, m0, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + pmaddubsw m4, m0, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m0, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m0, [r3 - 14 * 32] ; [2] + pmulhrsw m0, m5 + pmaddubsw m2, [r3 - 14 * 32] + pmulhrsw m2, m5 + packuswb m0, m2 + movu [r0 + r1*2], m0 + + movu m1, [r2 + 1] ; [0] + palignr m1, m3, 14 + movu [r0 + r4], m1 + RET + +cglobal intra_pred_ang32_12, 3,4,9 + movu m0, [ang32_fact_mode12] + movu m1, [ang32_fact_mode12 + mmsize] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode12] + mova m8, [ang32_shuf_mode12 + mmsize] + lea r3, [r1 * 3] + + ; prepare for [26, 19, 13, 6, 0, -1, -2....] + + movu xm4, [r2 + mmsize*2 - 4] + vbroadcasti128 m6, [r2 + mmsize*2 + 12] + + pinsrb xm4, [r2 + 0], 4 + pinsrb xm4, [r2 + 6], 3 + pinsrb xm4, [r2 + 13], 2 + pinsrb xm4, [r2 + 19], 1 + pinsrb xm4, [r2 + 26], 0 + vinserti128 m3, m4, xm4, 1 ; [26, 19, 13, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 26, 19, 13, 6, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] + + pshufb m4, m3, m7 ; [ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 6, 0, 6, 0, 13, 6, 13, 6, 13, 6, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13] + pshufb m5, m3, m8 ; [ 6, 0, 6, 0, 6, 0, 6, 0, 13, 6, 13, 6, 13, 6, 13, 6, 19, 13, 16, 19, 16, 19, 16, 19, 16, 19, 16, 19, 16, 19, 16, 19] + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 1 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 2 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 3 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 4 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 5 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 6 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 7 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 8 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 9 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 10 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 11 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 12 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 13 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 14 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 15 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + mova m3, m6 + vbroadcasti128 m6, [r2 + mmsize*2 + 12 + 16] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 1 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 2 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 3 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 4 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 5 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 6 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 7 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 8 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 9 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 10 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 11 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m6, m3, 12 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m6, m3, 13 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m6, m3, 14 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m3, 15 + pshufb m5, m4, m8 + pshufb m4, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_24, 3,5,8 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2] + pshufb m4, [ang32_shuf_mode24] + mova m3, [ang32_shuf_mode24 + mmsize] + vpermd m4, m3, m4 ; [6 6 13 13 19 19 26 26 x x x...] + palignr m3, m0, m4, 1 + vinserti128 m3, m3, xm2, 1 + + pmaddubsw m4, m0, [r3 + 11 * 32] ; [27] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m0, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m0, [r3 - 9 * 32] ; [7] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + + pmaddubsw m4, m6, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m6, [r3 + 3 * 32] ; [19] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + + pmaddubsw m4, m6, [r3 + 15 * 32] ; [31] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + pmaddubsw m4, m6, [r3 - 5 * 32] ; [11] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + + pmaddubsw m4, m6, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 + 7 * 32] ; [23] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + pmaddubsw m4, m6, [r3 - 13 * 32] ; [3] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 1 * 32] ; [15] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pand m6, [pw_00ff] + pand m7, [pw_00ff] + packuswb m6, m7 + movu [r0 + r4], m6 + RET + +cglobal intra_pred_ang32_13, 3,4,9 + movu m0, [ang32_fact_mode13] + movu m1, [ang32_fact_mode13 + mmsize] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode13] + mova m8, [ang32_shuf_mode13 + mmsize] + lea r3, [r1 * 3] + + ; prepare for [28, 25, 21, 18, 14, 11, 7, 4, 0, -1, -2....] + + movu m6, [r2] + pshufb m6, [ang32_shuf_mode13 + mmsize*2] + mova m3, [ang32_shuf_mode24 + mmsize*1] + vpermd m6, m3, m6 + palignr m6, m6, 1 + vbroadcasti128 m3, [r2 + mmsize*2 + 1] + + palignr m5, m3, m6, 1 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 3 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 5 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 7 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 9 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 11 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 13 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 15 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + mova m6, m3 + vbroadcasti128 m3, [r2 + mmsize*2 + 17] + palignr m5, m3, m6, 1 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 3 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 4 + pshufb m4, m5, m7 + pshufb m5, m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 5 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 7 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 9 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 11 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 13 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 15 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_23, 3,5,8 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2] + pshufb m4, [ang32_shuf_mode23] + vpermq m4, m4, q1313 + palignr m3, m0, m4, 1 + vinserti128 m3, m3, xm2, 1 + + pmaddubsw m4, m0, [r3 + 7 * 32] ; [23] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m0, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + + pmaddubsw m4, m6, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 + 3 * 32] ; [19] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m6, [r3 - 1 * 32] ; [15] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + + pmaddubsw m4, m6, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 5 * 32] ; [11] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + + pmaddubsw m4, m6, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + pmaddubsw m4, m6, [r3 - 9 * 32] ; [7] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 6 + palignr m7, m2, m0, 6 + + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 13 * 32] ; [3] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + palignr m6, m0, m3, 2 + palignr m7, m2, m0, 2 + pmaddubsw m4, m6, [r3 + 15 * 32] ; [31] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m3, [r3 + 11 * 32] ; [27] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 + 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m3, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m3, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pand m3, [pw_00ff] + pand m0, [pw_00ff] + packuswb m3, m0 + movu [r0 + r4], m3 + RET + +cglobal intra_pred_ang32_14, 3,4,9 + movu m0, [ang32_fact_mode14] + movu m1, [ang32_fact_mode14 + mmsize] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode14] + mova m8, [ang32_shuf_mode14 + mmsize] + lea r3, [r1 * 3] + + ; prepare for [30, 27, 25, 22, 20, 17, 15, 12, 10, 7, 5, 2, 0, -1, -2...] + + movu m6, [r2] + pshufb m6, [ang32_shuf_mode14 + mmsize*2] + vpermq m6, m6, 01110111b + pslldq m6, m6, 1 + vbroadcasti128 m3, [r2 + mmsize*2 + 1] + + palignr m5, m3, m6, 1 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 3 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 5 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 7 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 9 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 11 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 13 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 15 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + mova m6, m3 + vbroadcasti128 m3, [r2 + mmsize*2 + 17] + palignr m5, m3, m6, 1 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 3 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 4 + pshufb m4, m5, m7 + pshufb m5, m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 5 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 7 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 9 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 11 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m5, m3, m6, 13 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m5, m3, m6, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 15 + pshufb m4, m5, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_22, 3,5,9 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2 + 2] + pshufb m4, [ang32_shuf_mode22] + vextracti128 xm8, m4, 1 + + palignr m3, m0, m4, 2 + palignr m3, m8, 15 + vinserti128 m3, m3, xm2, 1 + vinserti128 m8, m8, xm0, 1 + + pmaddubsw m4, m0, [r3 + 3 * 32] ; [19] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 + 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + + pmaddubsw m4, m6, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + + pmaddubsw m4, m6, [r3 + 15 * 32] ; [31] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m6, [r3 - 5 * 32] ; [11] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m0, m3, 6 + palignr m7, m2, m0, 6 + + pmaddubsw m4, m6, [r3 + 7 * 32] ; [23] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + + pmaddubsw m4, m6, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + pmaddubsw m4, m6, [r3 - 13 * 32] ; [3] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 2 + palignr m7, m2, m0, 2 + + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m3, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m3, [r3 - 1 * 32] ; [15] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m3, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 14 + palignr m7, m0, m3, 14 + + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + palignr m6, m3, m8, 12 + palignr m7, m0, m3, 12 + pmaddubsw m4, m6, [r3 + 11 * 32] ; [27] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + palignr m6, m3, m8, 10 + palignr m7, m0, m3, 10 + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 9 * 32] ; [7] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m0, m3, 8 + palignr m3, m8, 8 + pmaddubsw m4, m3, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m3, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pand m3, [pw_00ff] + pand m0, [pw_00ff] + packuswb m3, m0 + movu [r0 + r4], m3 + RET + +cglobal intra_pred_ang32_15, 3,4,9 + movu m0, [ang32_fact_mode15] + movu m1, [ang32_fact_mode15 + mmsize] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode15] + mova m8, [ang32_shuf_mode15 + mmsize] + lea r3, [r1 * 3] + + ; prepare for [30, 28, 26, 24, 23, 21, 19, 17, 15, 13, 11, 9, 8, 6, 4, 2, 0, -1, -2...] + + movu m6, [r2] + pshufb m6, [ang32_shuf_mode15 + mmsize*2] + vpermq m6, m6, 01110111b + + movu xm3, [r2 + mmsize*2] + pinsrb xm3, [r2], 0 + vpermq m3, m3, 01000100b + + palignr m4, m3, m6, 2 + pshufb m4, m7 + pshufb m5, m6, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 3 + pshufb m4, m7 + palignr m5, m3, m6, 1 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 4 + pshufb m4, m7 + palignr m5, m3, m6, 2 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 5 + pshufb m4, m7 + palignr m5, m3, m6, 3 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 6 + pshufb m4, m7 + palignr m5, m3, m6, 4 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 7 + pshufb m4, m7 + palignr m5, m3, m6, 5 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 8 + pshufb m4, m7 + palignr m5, m3, m6, 6 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 9 + pshufb m4, m7 + palignr m5, m3, m6, 7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 10 + pshufb m4, m7 + palignr m5, m3, m6, 8 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 11 + pshufb m4, m7 + palignr m5, m3, m6, 9 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 12 + pshufb m4, m7 + palignr m5, m3, m6, 10 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 13 + pshufb m4, m7 + palignr m5, m3, m6, 11 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 14 + pshufb m4, m7 + palignr m5, m3, m6, 12 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 15 + pshufb m4, m7 + palignr m5, m3, m6, 13 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + pshufb m4, m3, m7 + palignr m5, m3, m6, 14 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 15 + mova m6, m3 + vbroadcasti128 m3, [r2 + mmsize*2 + 16] + + palignr m4, m3, m6, 1 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 2 + pshufb m4, m7 + pshufb m5, m6, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 3 + pshufb m4, m7 + palignr m5, m3, m6, 1 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 4 + pshufb m4, m7 + palignr m5, m3, m6, 2 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 5 + pshufb m4, m7 + palignr m5, m3, m6, 3 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 6 + pshufb m4, m7 + palignr m5, m3, m6, 4 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 7 + pshufb m4, m7 + palignr m5, m3, m6, 5 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 8 + pshufb m4, m7 + palignr m5, m3, m6, 6 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 9 + pshufb m4, m7 + palignr m5, m3, m6, 7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 10 + pshufb m4, m7 + palignr m5, m3, m6, 8 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 11 + pshufb m4, m7 + palignr m5, m3, m6, 9 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 12 + pshufb m4, m7 + palignr m5, m3, m6, 10 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 13 + pshufb m4, m7 + palignr m5, m3, m6, 11 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 14 + pshufb m4, m7 + palignr m5, m3, m6, 12 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], m4 + + palignr m4, m3, m6, 15 + pshufb m4, m7 + palignr m5, m3, m6, 13 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1], m4 + + pshufb m4, m3, m7 + palignr m5, m3, m6, 14 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 15 + vbroadcasti128 m6, [r2 + mmsize*2 + 32] + + palignr m4, m6, m3, 1 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_21, 3,5,9 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2] + pshufb m4, [ang32_shuf_mode21] + vextracti128 xm6, m4, 1 + + palignr m3, m0, m4, 1 + palignr m8, m3, m6, 1 + vinserti128 m3, m3, xm2, 1 + vinserti128 m8, m8, xm0, 1 + + pmaddubsw m4, m0, [r3 - 1 * 32] ; [15] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + pmaddubsw m4, m6, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 5 * 32] ; [11] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m6, [r3 - 9 * 32] ; [7] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 6 + palignr m7, m2, m0, 6 + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 13 * 32] ; [3] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 2 + palignr m7, m2, m0, 2 + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m3, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + palignr m6, m3, m8, 14 + palignr m7, m0, m3, 14 + pmaddubsw m4, m6, [r3 + 15 * 32] ; [31] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 12 + palignr m7, m0, m3, 12 + pmaddubsw m4, m6, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m3, m8, 10 + palignr m7, m0, m3, 10 + pmaddubsw m4, m6, [r3 + 11 * 32] ; [27] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 8 + palignr m7, m0, m3, 8 + pmaddubsw m4, m6, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + palignr m6, m3, m8, 6 + palignr m7, m0, m3, 6 + pmaddubsw m4, m6, [r3 + 7 * 32] ; [23] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 4 + palignr m7, m0, m3, 4 + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m3, m8, 2 + palignr m7, m0, m3, 2 + pmaddubsw m4, m6, [r3 + 3 * 32] ; [19] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m8, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pand m8, [pw_00ff] + pand m3, [pw_00ff] + packuswb m8, m3 + movu [r0 + r4], m8 + RET + +cglobal intra_pred_ang32_16, 3,4,10 + movu m0, [ang32_fact_mode16] + movu m1, [ang32_fact_mode16 + mmsize] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode16] + mova m8, [ang32_shuf_mode16 + mmsize] + lea r3, [r1 * 3] + + ; prepare for [30, 29, 27, 26, 24, 23, 21, 20, 18, 17, 15, 14, 12, 11, 9, 8, 6, 5, 3, 2, 0, -1, -2...] + + movu m6, [r2] + pshufb m6, [ang32_shuf_mode16 + mmsize*2] + mova m9, m6 + mova m3, [ang32_shuf_mode16 + mmsize*3] + vpermd m6, m3, m6 + vpermq m9, m9, q3232 + pslldq m9, 4 + palignr m6, m9, 15 + pslldq m9, 1 + + vbroadcasti128 m3, [r2 + mmsize*2 + 1] + + palignr m4, m3, m6, 1 + palignr m5, m6, m9, 6 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 2 + palignr m5, m6, m9, 7 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 3 + palignr m5, m6, m9, 8 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 4 + palignr m5, m6, m9, 9 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 5 + palignr m5, m6, m9, 10 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 6 + palignr m5, m6, m9, 11 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 7 + palignr m5, m6, m9, 12 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 8 + palignr m5, m6, m9, 13 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 9 + palignr m5, m6, m9, 14 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 10 + palignr m5, m6, m9, 15 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 11 + pshufb m4, m7 + pshufb m5, m6, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 12 + palignr m5, m3, m6, 1 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 13 + palignr m5, m3, m6, 2 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 14 + palignr m5, m3, m6, 3 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 15 + palignr m5, m3, m6, 4 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m5, m3, m6, 5 + pshufb m4, m3, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + vbroadcasti128 m9, [r2 + mmsize*2 + 17] + + palignr m4, m9, m3, 1 + palignr m5, m3, m6, 6 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m9, m3, 2 + palignr m5, m3, m6, 7 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m9, m3, 3 + palignr m5, m3, m6, 8 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m9, m3, 4 + palignr m5, m3, m6, 9 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m9, m3, 5 + palignr m5, m3, m6, 10 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m9, m3, 6 + palignr m5, m3, m6, 11 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m9, m3, 7 + palignr m5, m3, m6, 12 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m9, m3, 8 + palignr m5, m3, m6, 13 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m9, m3, 9 + palignr m5, m3, m6, 14 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m9, m3, 10 + palignr m5, m3, m6, 15 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m9, m3, 11 + pshufb m4, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m9, m3, 12 + palignr m5, m9, m3, 1 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m9, m3, 13 + palignr m5, m9, m3, 2 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m9, m3, 14 + palignr m5, m9, m3, 3 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m9, m3, 15 + palignr m5, m9, m3, 4 + pshufb m4, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m5, m9, m3, 5 + pshufb m4, m9, m7 + pshufb m5, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_20, 3,5,10 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2] + pshufb m4, [ang32_shuf_mode20] + mova m9, m4 + vpermq m9, m9, q3333 + mova m7, m4 + vpermq m7, m7, q1111 + palignr m4, m7, 14 + pshufb m4, [ang32_shuf_mode20 + mmsize*1] + + vextracti128 xm6, m4, 1 + palignr m3, m0, m4, 1 + palignr m8, m3, m6, 1 + vinserti128 m3, m3, xm2, 1 + vinserti128 m8, m8, xm0, 1 + vinserti128 m9, m9, xm3, 1 + + pmaddubsw m4, m0, [r3 - 5 * 32] ; [11] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 15 * 32] ; [1] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + pmaddubsw m4, m6, [r3 + 7 * 32] ; [23] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + pmaddubsw m4, m6, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 6 + palignr m7, m2, m0, 6 + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + pmaddubsw m4, m6, [r3 - 13 * 32] ; [3] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 2 + palignr m7, m2, m0, 2 + pmaddubsw m4, m6, [r3 + 9 * 32] ; [25] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m3, [r3 - 1 * 32] ; [15] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m3, m8, 14 + palignr m7, m0, m3, 14 + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + palignr m6, m3, m8, 12 + palignr m7, m0, m3, 12 + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + palignr m6, m3, m8, 10 + palignr m7, m0, m3, 10 + pmaddubsw m4, m6, [r3 + 11 * 32] ; [27] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 11 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 8 + palignr m7, m0, m3, 8 + pmaddubsw m4, m6, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 1 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m3, m8, 6 + palignr m7, m0, m3, 6 + pmaddubsw m4, m6, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + pmaddubsw m4, m6, [r3 - 9 * 32] ; [7] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 9 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m3, m8, 4 + palignr m7, m0, m3, 4 + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 2 + palignr m7, m0, m3, 2 + pmaddubsw m4, m6, [r3 + 13 * 32] ; [29] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 13 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + pmaddubsw m4, m8, [r3 + 3 * 32] ; [19] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 3 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m8, m9, 14 + palignr m7, m3, m8, 14 + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 7 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + palignr m6, m8, m9, 12 + palignr m7, m3, m8, 12 + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m8, m9, 10 + palignr m7, m3, m8, 10 + pmaddubsw m4, m6, [r3 + 15 * 32] ; [31] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 15 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m8, m9, 8 + palignr m7, m3, m8, 8 + pmaddubsw m4, m6, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 5 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pand m6, [pw_00ff] + pand m7, [pw_00ff] + packuswb m6, m7 + movu [r0 + r4], m6 + RET + +cglobal intra_pred_ang32_17, 3,4,8 + movu m0, [ang32_fact_mode17] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode17] + lea r3, [r1 * 3] + + ; prepare for [31, 30, 28, 27, 26, 25, 23, 22, 21, 20, 18, 17, 16, 15, 14, 12, 11, 10, 9, 7, 6, 5, 4, 2, 1, 0, -1, -2...] + + movu m6, [r2] + pshufb m6, [ang32_shuf_mode17 + mmsize] + mova m1, m6 + mova m3, [ang32_shuf_mode16 + mmsize*3] + vpermd m6, m3, m6 + vpermq m1, m1, q3232 + pslldq m1, 4 + + movu xm4, [r2 + mmsize*2] + pinsrb xm4, [r2], 0 + vinserti128 m3, m4, xm4, 1 + + palignr m4, m3, m6, 2 + palignr m5, m6, m1, 5 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 3 + palignr m5, m6, m1, 6 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 4 + palignr m5, m6, m1, 7 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 5 + palignr m5, m6, m1, 8 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 6 + palignr m5, m6, m1, 9 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 7 + palignr m5, m6, m1, 10 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 8 + palignr m5, m6, m1, 11 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 9 + palignr m5, m6, m1, 12 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 10 + palignr m5, m6, m1, 13 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 11 + palignr m5, m6, m1, 14 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m3, m6, 12 + palignr m5, m6, m1, 15 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m3, m6, 13 + pshufb m4, m7 + pshufb m5, m6, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m3, m6, 14 + palignr m5, m3, m6, 1 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m3, m6, 15 + palignr m5, m3, m6, 2 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m5, m3, m6, 3 + pshufb m4, m3, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + vbroadcasti128 m1, [r2 + mmsize*2 + 16] + palignr m4, m1, m3, 1 + palignr m5, m3, m6, 4 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m1, m3, 2 + palignr m5, m3, m6, 5 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m1, m3, 3 + palignr m5, m3, m6, 6 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m1, m3, 4 + palignr m5, m3, m6, 7 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m1, m3, 5 + palignr m5, m3, m6, 8 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m1, m3, 6 + palignr m5, m3, m6, 9 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m1, m3, 7 + palignr m5, m3, m6, 10 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m1, m3, 8 + palignr m5, m3, m6, 11 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m1, m3, 9 + palignr m5, m3, m6, 12 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m1, m3, 10 + palignr m5, m3, m6, 13 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m1, m3, 11 + palignr m5, m3, m6, 14 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + palignr m4, m1, m3, 12 + palignr m5, m3, m6, 15 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m1, m3, 13 + pshufb m4, m7 + pshufb m5, m3, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + + lea r0, [r0 + r1 * 4] + + palignr m4, m1, m3, 14 + palignr m5, m1, m3, 1 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0], m4 + + palignr m4, m1, m3, 15 + palignr m5, m1, m3, 2 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1], m4 + + vbroadcasti128 m6, [r2 + mmsize*2 + mmsize] + palignr m5, m1, m3, 3 + pshufb m4, m1, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r1 * 2], m4 + + palignr m4, m6, m1, 1 + palignr m5, m1, m3, 4 + pshufb m4, m7 + pshufb m5, m7 + pmaddubsw m4, m0 + pmaddubsw m5, m0 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + vpermq m4, m4, q3120 + movu [r0 + r3], m4 + RET + +cglobal intra_pred_ang32_19, 3,5,10 + lea r3, [ang_table_avx2 + 32 * 16] + lea r4, [r1 * 3] + mova m5, [pw_1024] + + ; rows 0 to 7 + movu m0, [r2 + 0] + movu m1, [r2 + 1] + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + + movu m4, [r2 + mmsize*2] + pshufb m4, [ang32_shuf_mode17 + mmsize*1] + mova m3, [ang32_shuf_mode19 + mmsize*1] + mova m6, [ang32_shuf_mode19 + mmsize*2] + mova m9, m4 + vpermd m4, m3, m4 + vpermd m9, m6, m9 + pshufb m4, [ang32_shuf_mode19] + pshufb m9, [ang32_shuf_mode19] + + vextracti128 xm6, m4, 1 + palignr m3, m0, m4, 1 + palignr m8, m3, m6, 1 + palignr m7, m8, m9, 1 + vinserti128 m3, m3, xm2, 1 + vinserti128 m8, m8, xm0, 1 + vinserti128 m9, m7, xm3, 1 + + pmaddubsw m4, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m2, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m0, m3, 14 + palignr m7, m2, m0, 14 + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 12 + palignr m7, m2, m0, 12 + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 10 + palignr m7, m2, m0, 10 + pmaddubsw m4, m6, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m0, m3, 8 + palignr m7, m2, m0, 8 + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m0, m3, 6 + palignr m7, m2, m0, 6 + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 8 to 15 + palignr m6, m0, m3, 2 + palignr m7, m2, m0, 2 + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m3, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m3, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m0, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m3, m8, 14 + palignr m7, m0, m3, 14 + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m3, m8, 12 + palignr m7, m0, m3, 12 + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m3, m8, 10 + palignr m7, m0, m3, 10 + pmaddubsw m4, m6, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 8 + palignr m7, m0, m3, 8 + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pand m6, [pw_00ff] + pand m7, [pw_00ff] + packuswb m6, m7 + movu [r0 + r4], m6 + + lea r0, [r0 + r1 * 4] + + ; rows 16 to 23 + palignr m6, m3, m8, 6 + palignr m7, m0, m3, 6 + pmaddubsw m4, m6, [r3 - 10 * 32] ; [6] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m3, m8, 4 + palignr m7, m0, m3, 4 + pmaddubsw m4, m6, [r3 - 4 * 32] ; [12] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m3, m8, 2 + palignr m7, m0, m3, 2 + pmaddubsw m4, m6, [r3 + 2 * 32] ; [18] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + pmaddubsw m4, m8, [r3 + 8 * 32] ; [24] + pmulhrsw m4, m5 + pmaddubsw m1, m3, [r3 + 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + palignr m6, m8, m9, 14 + palignr m7, m3, m8, 14 + pmaddubsw m4, m6, [r3 + 14 * 32] ; [30] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m6, [r3 - 12 * 32] ; [4] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m8, m9, 12 + palignr m7, m3, m8, 12 + pmaddubsw m4, m6, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m8, m9, 10 + palignr m7, m3, m8, 10 + pmaddubsw m4, m6, [r3] ; [16] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + ; rows 24 to 31 + palignr m6, m8, m9, 8 + palignr m7, m3, m8, 8 + pmaddubsw m4, m6, [r3 + 6 * 32] ; [22] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 6 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + palignr m6, m8, m9, 6 + palignr m7, m3, m8, 6 + pmaddubsw m4, m6, [r3 + 12 * 32] ; [28] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 12 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + pmaddubsw m4, m6, [r3 - 14 * 32] ; [2] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 14 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1*2], m4 + + palignr m6, m8, m9, 4 + palignr m7, m3, m8, 4 + pmaddubsw m4, m6, [r3 - 8 * 32] ; [8] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 8 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r4], m4 + + lea r0, [r0 + r1 * 4] + + vpbroadcastb m0, [r2 + mmsize*2 + 31] + palignr m1, m9, m0, 1 + vinserti128 m0, m1, xm8, 1 + + palignr m6, m8, m9, 2 + palignr m7, m3, m8, 2 + pmaddubsw m4, m6, [r3 - 2 * 32] ; [14] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 - 2 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0], m4 + + pmaddubsw m4, m9, [r3 + 4 * 32] ; [20] + pmulhrsw m4, m5 + pmaddubsw m1, m8, [r3 + 4 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1], m4 + + palignr m6, m9, m0, 14 + palignr m7, m8, m9, 14 + pmaddubsw m4, m6, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m5 + pmaddubsw m1, m7, [r3 + 10 * 32] + pmulhrsw m1, m5 + packuswb m4, m1 + movu [r0 + r1 * 2], m4 + + pand m6, [pw_00ff] + pand m7, [pw_00ff] + packuswb m6, m7 + movu [r0 + r4], m6 + RET + %endif ; ARCH_X86_64 ;----------------------------------------------------------------------------------------- ; end of intra_pred_ang32 angular modes avx2 asm @@ -12679,70 +19286,113 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang8_16, 3, 6, 6 - mova m3, [pw_1024] - movu xm5, [r2 + 16] - pinsrb xm5, [r2], 0 - lea r5, [intra_pred_shuff_0_8] - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 2], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - - lea r4, [c_ang8_mode_20] - pmaddubsw m1, m0, [r4] - pmulhrsw m1, m3 - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 3], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - pmaddubsw m2, m0, [r4 + mmsize] - pmulhrsw m2, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 5], 0 - vinserti128 m0, m5, xm5, 1 - pshufb m0, [r5] - pmaddubsw m4, m0, [r4 + 2 * mmsize] - pmulhrsw m4, m3 - pslldq xm5, 1 - pinsrb xm5, [r2 + 6], 0 - mova xm0, xm5 - pslldq xm5, 1 - pinsrb xm5, [r2 + 8], 0 - vinserti128 m0, m0, xm5, 1 - pshufb m0, [r5] - pmaddubsw m0, [r4 + 3 * mmsize] - pmulhrsw m0, m3 - - packuswb m1, m2 - packuswb m4, m0 - - vperm2i128 m2, m1, m4, 00100000b - vperm2i128 m1, m1, m4, 00110001b - punpcklbw m4, m2, m1 - punpckhbw m2, m1 - punpcklwd m1, m4, m2 - punpckhwd m4, m2 - mova m0, [trans8_shuf] - vpermd m1, m0, m1 - vpermd m4, m0, m4 +cglobal intra_pred_ang8_16, 3,4,7 + lea r0, [r0 + r1 * 8] + sub r0, r1 + neg r1 + lea r3, [r1 * 3] + vbroadcasti128 m0, [angHor8_tab_16] ; m0 = factor + mova m1, [intra_pred8_shuff16] ; m1 = 4 of Row shuffle + movu m2, [intra_pred8_shuff16 + 8] ; m2 = 4 of Row shuffle + + ; prepare reference pixel + movq xm3, [r2 + 16 + 1] ; m3 = [-1 -2 -3 -4 -5 -6 -7 -8 x x x x x x x x] + movhps xm3, [r2 + 2] ; m3 = [-1 -2 -3 -4 -5 -6 -7 -8 2 3 x 5 6 x 8 x] + pslldq xm3, 1 + pinsrb xm3, [r2], 0 ; m3 = [ 0 -1 -2 -3 -4 -5 -6 -7 -8 2 3 x 5 6 x 8] + pshufb xm3, [c_ang8_mode_16] + vinserti128 m3, m3, xm3, 1 ; m3 = [-8 -7 -6 -5 -4 -3 -2 -1 0 2 3 5 6 8] + + ; process 4 rows + pshufb m4, m3, m1 + pshufb m5, m3, m2 + psrldq m3, 4 + punpcklbw m6, m5, m4 + punpckhbw m5, m4 + pmaddubsw m6, m0 + pmulhrsw m6, [pw_1024] + pmaddubsw m5, m0 + pmulhrsw m5, [pw_1024] + packuswb m6, m5 + vextracti128 xm5, m6, 1 + movq [r0], xm6 + movhps [r0 + r1], xm6 + movq [r0 + r1 * 2], xm5 + movhps [r0 + r3], xm5 + + ; process 4 rows + lea r0, [r0 + r1 * 4] + pshufb m4, m3, m1 + pshufb m5, m3, m2 + punpcklbw m6, m5, m4 + punpckhbw m5, m4 + pmaddubsw m6, m0 + pmulhrsw m6, [pw_1024] + pmaddubsw m5, m0 + pmulhrsw m5, [pw_1024] + packuswb m6, m5 + vextracti128 xm5, m6, 1 + movq [r0], xm6 + movhps [r0 + r1], xm6 + movq [r0 + r1 * 2], xm5 + movhps [r0 + r3], xm5 + RET - lea r3, [3 * r1] - movq [r0], xm1 - movhps [r0 + r1], xm1 - vextracti128 xm2, m1, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 - lea r0, [r0 + 4 * r1] - movq [r0], xm4 - movhps [r0 + r1], xm4 - vextracti128 xm2, m4, 1 - movq [r0 + 2 * r1], xm2 - movhps [r0 + r3], xm2 +%if 1 +INIT_YMM avx2 +cglobal intra_pred_ang8_20, 3,5,6 + lea r0, [r0 + r1 * 8] + sub r0, r1 + neg r1 + lea r3, [angHor8_tab_20] + lea r4, [r1 * 3] + movu m5, [intra_pred_shuff_0_8 + 16] + + ; prepare reference pixel + movq xm1, [r2 + 1] ; m3 = [ 1 2 3 4 5 6 7 8 x x x x x x x x] + movhps xm1, [r2 + 16 + 2] ; m3 = [ 1 2 3 4 5 6 7 8 -2 -3 x -5 -6 x -8 x] + palignr xm1, xm1, [r2 - 15], 15 ; m3 = [ 0 1 2 3 4 5 6 7 8 -2 -3 x -5 -6 x -8] + pshufb xm1, [c_ang8_mode_20] + vinserti128 m1, m1, xm1, 1 + + ; process 4 rows + pshufb m3, m1, m5 + psrldq m1, 2 + pmaddubsw m3, [r3 + 0 * 16] + pmulhrsw m3, [pw_1024] + + pshufb m4, m1, [intra_pred_shuff_0_8] + psrldq m1, 1 + pmaddubsw m4, [r3 + 2 * 16] + pmulhrsw m4, [pw_1024] + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + movq [r0], xm3 + movq [r0 + r1], xm4 + movhps [r0 + r1 * 2], xm3 + movhps [r0 + r4], xm4 + + ; process 4 rows + lea r0, [r0 + r1 * 4] + pshufb m3, m1, m5 + psrldq m1, 1 + pmaddubsw m3, [r3 + 4 * 16] + pmulhrsw m3, [pw_1024] + + pshufb m4, m1, m5 + pmaddubsw m4, [r3 + 6 * 16] + pmulhrsw m4, [pw_1024] + + packuswb m3, m4 + vextracti128 xm4, m3, 1 + movq [r0], xm3 + movq [r0 + r1], xm4 + movhps [r0 + r1 * 2], xm3 + movhps [r0 + r4], xm4 RET +%else INIT_YMM avx2 cglobal intra_pred_ang8_20, 3, 6, 6 mova m3, [pw_1024] @@ -12796,6 +19446,7 @@ movhps [r0 + 2 * r1], xm4 movhps [r0 + r3], xm2 RET +%endif INIT_YMM avx2 cglobal intra_pred_ang8_21, 3, 6, 6 @@ -13275,173 +19926,787 @@ INIT_YMM avx2 -cglobal intra_pred_ang16_12, 3, 6, 13 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 32] - pinsrb xm9, [r2], 0 - pslldq xm7, xm9, 1 - pinsrb xm7, [r2 + 6], 0 - vinserti128 m9, m9, xm7, 1 - pshufb m9, [r5] - - movu xm12, [r2 + 6 + 32] - - psrldq xm10, xm12, 2 - psrldq xm8, xm12, 1 - vinserti128 m10, m10, xm8, 1 - pshufb m10, [r5] - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_12] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize +cglobal intra_pred_ang16_12, 3,4,9 + vbroadcasti128 m0, [angHor_tab_12] + vbroadcasti128 m1, [angHor_tab_12 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode12] + mova m8, [ang16_shuf_mode12 + mmsize] + lea r3, [r1 * 3] + + movu xm4, [r2 + mmsize - 2] + pinsrb xm4, [r2 + 0], 2 + pinsrb xm4, [r2 + 6], 1 + pinsrb xm4, [r2 + 13], 0 + vbroadcasti128 m6, [r2 + mmsize + 14] + vinserti128 m3, m4, xm4, 1 + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - pslldq xm7, 1 - pinsrb xm7, [r2 + 13], 0 - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - mova xm8, xm12 - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - movu xm9, [r2 + 31] - pinsrb xm9, [r2 + 6], 0 - pinsrb xm9, [r2 + 0], 1 - pshufb xm9, [r5] - vinserti128 m9, m9, xm7, 1 - - psrldq xm10, xm12, 1 - vinserti128 m10, m10, xm12, 1 - pshufb m10, [r5] + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 RET INIT_YMM avx2 -cglobal intra_pred_ang16_13, 3, 6, 14 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm13, [r2 + 32] - pinsrb xm13, [r2], 0 - pslldq xm7, xm13, 2 - pinsrb xm7, [r2 + 7], 0 - pinsrb xm7, [r2 + 4], 1 - vinserti128 m9, m13, xm7, 1 - pshufb m9, [r5] - - movu xm12, [r2 + 4 + 32] - - psrldq xm10, xm12, 4 - psrldq xm8, xm12, 2 - vinserti128 m10, m10, xm8, 1 - pshufb m10, [r5] - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_13] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 +cglobal intra_pred_ang16_13, 3,4,9 + vbroadcasti128 m0, [angHor_tab_13] + vbroadcasti128 m1, [angHor_tab_13 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode13] + mova m8, [ang16_shuf_mode13 + mmsize] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize + 1] + vbroadcasti128 m4, [r2] + pshufb m4, [ang16_shuf_mode13 + mmsize * 2] + + palignr m3, m4, 11 + vbroadcasti128 m6, [r2 + mmsize + 12] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - pslldq xm7, 1 - pinsrb xm7, [r2 + 11], 0 - pshufb xm2, xm7, [r5] - vinserti128 m9, m9, xm2, 1 + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - psrldq xm8, xm12, 1 - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + RET - pslldq xm13, 1 - pinsrb xm13, [r2 + 4], 0 - pshufb xm3, xm13, [r5] - vinserti128 m9, m9, xm3, 0 +INIT_YMM avx2 +cglobal intra_pred_ang16_14, 3,4,9 + vbroadcasti128 m0, [angHor_tab_14] + vbroadcasti128 m1, [angHor_tab_14 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode14] + mova m8, [ang16_shuf_mode14 + mmsize] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize + 1] + vbroadcasti128 m4, [r2] + pshufb m4, [ang16_shuf_mode14 + mmsize * 2] + palignr m3, m4, 9 + vbroadcasti128 m6, [r2 + mmsize + 10] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - psrldq xm8, xm12, 3 - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - add r4, 4 * mmsize + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + RET - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 +INIT_YMM avx2 +cglobal intra_pred_ang16_15, 3,4,9 + vbroadcasti128 m0, [angHor_tab_15] + vbroadcasti128 m1, [angHor_tab_15 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode15] + mova m8, [ang16_shuf_mode15 + mmsize] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize + 1] + vbroadcasti128 m4, [r2] + pshufb m4, [ang16_shuf_mode15 + mmsize * 2] + palignr m3, m3, m4, 7 + vbroadcasti128 m6, [r2 + mmsize + 8] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - pslldq xm7, 1 - pinsrb xm7, [r2 + 14], 0 - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - mova xm8, xm12 - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + RET - pslldq xm13, 1 - pinsrb xm13, [r2 + 7], 0 - pshufb xm13, [r5] - vinserti128 m9, m9, xm13, 0 +INIT_YMM avx2 +cglobal intra_pred_ang16_16, 3,4,9 + vbroadcasti128 m0, [angHor_tab_16] + vbroadcasti128 m1, [angHor_tab_16 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode16] + mova m8, [ang16_shuf_mode16 + mmsize] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize + 1] + vbroadcasti128 m4, [r2] + pshufb m4, [ang16_shuf_mode16 + mmsize * 2] + palignr m3, m4, 5 + vbroadcasti128 m6, [r2 + mmsize + 6] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - psrldq xm12, 2 - pshufb xm12, [r5] - vinserti128 m10, m10, xm12, 0 + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 RET INIT_YMM avx2 -cglobal intra_pred_ang16_11, 3, 5, 12 - mova m11, [pw_1024] - - movu xm9, [r2 + 32] - pinsrb xm9, [r2], 0 - pshufb xm9, [intra_pred_shuff_0_8] - vinserti128 m9, m9, xm9, 1 - - vbroadcasti128 m10, [r2 + 8 + 32] - pshufb m10, [intra_pred_shuff_0_8] - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_11] +cglobal intra_pred_ang16_17, 3,4,9 + vbroadcasti128 m0, [angHor_tab_17] + vbroadcasti128 m1, [angHor_tab_17 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang16_shuf_mode17] + mova m8, [ang16_shuf_mode17 + mmsize] + lea r3, [r1 * 3] + + vbroadcasti128 m3, [r2 + mmsize + 1] + vbroadcasti128 m4, [r2] + pshufb m4, [ang16_shuf_mode17 + mmsize * 2] + palignr m3, m4, 3 + vbroadcasti128 m6, [r2 + mmsize + 4] + + pshufb m4, m3, m7 + pshufb m5, m3, m8 + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + palignr m5, m6, m3, 4 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - add r4, 4 * mmsize + palignr m5, m6, m3, 8 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + palignr m5, m6, m3, 12 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m4, m5, m7 + pshufb m5, m8 + + pmaddubsw m4, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + RET - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 +INIT_YMM avx2 +cglobal intra_pred_ang16_11, 3,4,8 + vbroadcasti128 m0, [angHor_tab_11] + vbroadcasti128 m1, [angHor_tab_11 + mmsize/2] + mova m2, [pw_1024] + mova m7, [ang32_shuf_mode9] + lea r3, [r1 * 3] + + ; prepare for [0 -1 -2...] + + movu xm3, [r2 + mmsize] + pinsrb xm3, [r2], 0 + vbroadcasti128 m6, [r2 + mmsize + 16] + vinserti128 m3, m3, xm3, 1 + + pshufb m5, m3, m7 ; [ 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2] + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + + lea r0, [r0 + r1 * 4] + + palignr m5, m6, m3, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 RET + ; transpose 8x32 to 16x16, used for intra_ang16x16 avx2 asm %if ARCH_X86_64 == 1 INIT_YMM avx2 @@ -13493,21 +20758,21 @@ movu [r0 + r1 * 2], xm%2 movu [r0 + r5 * 1], xm%11 - lea r0, [r0 + r6] + add r0, r6 movu [r0 + r1 * 0], xm%7 movu [r0 + r1 * 1], xm%8 movu [r0 + r1 * 2], xm%4 movu [r0 + r5 * 1], xm%9 - lea r0, [r0 + r6] + add r0, r6 vextracti128 [r0 + r1 * 0], m%5, 1 vextracti128 [r0 + r1 * 1], m%6, 1 vextracti128 [r0 + r1 * 2], m%2, 1 vextracti128 [r0 + r5 * 1], m%11, 1 - lea r0, [r0 + r6] + add r0, r6 vextracti128 [r0 + r1 * 0], m%7, 1 vextracti128 [r0 + r1 * 1], m%8, 1 @@ -13530,21 +20795,21 @@ movu [r0 + r1 * 2], xm%3 movu [r0 + r5 * 1], xm%4 - lea r0, [r0 + r6] + add r0, r6 movu [r0 + r1 * 0], xm%5 movu [r0 + r1 * 1], xm%6 movu [r0 + r1 * 2], xm%7 movu [r0 + r5 * 1], xm%8 - lea r0, [r0 + r6] + add r0, r6 vextracti128 [r0 + r1 * 0], m%1, 1 vextracti128 [r0 + r1 * 1], m%2, 1 vextracti128 [r0 + r1 * 2], m%3, 1 vextracti128 [r0 + r5 * 1], m%4, 1 - lea r0, [r0 + r6] + add r0, r6 vextracti128 [r0 + r1 * 0], m%5, 1 vextracti128 [r0 + r1 * 1], m%6, 1 @@ -14110,41 +21375,100 @@ %endif ; ARCH_X86_64 INIT_YMM avx2 -cglobal intra_pred_ang16_9, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] +cglobal intra_pred_ang16_9, 3,4,8 + vbroadcasti128 m0, [angHor_tab_9] + vbroadcasti128 m1, [angHor_tab_9 + mmsize/2] + mova m2, [pw_1024] + lea r3, [r1 * 3] + mova m7, [ang16_shuf_mode9] - vbroadcasti128 m9, [r2 + 1 + 32] - pshufb m9, [r5] - vbroadcasti128 m10, [r2 + 9 + 32] - pshufb m10, [r5] + vbroadcasti128 m6, [r2 + mmsize + 17] + vbroadcasti128 m3, [r2 + mmsize + 1] - lea r3, [3 * r1] - lea r4, [c_ang16_mode_9] + pshufb m5, m3, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 2 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 + + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 + palignr m5, m6, m3, 4 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 6 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 - add r4, 4 * mmsize + lea r0, [r0 + r1 * 4] - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm7, [r2 + 2 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm7, [r2 + 10 + 32] - pshufb xm7, [r5] - vinserti128 m10, m10, xm7, 1 + palignr m5, m6, m3, 8 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 10 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 + lea r0, [r0 + r1 * 4] - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 + palignr m5, m6, m3, 12 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0], xm4 + vextracti128 [r0 + r1], m4, 1 + + palignr m5, m6, m3, 14 + pshufb m5, m7 + pmaddubsw m4, m5, m0 + pmaddubsw m5, m1 + pmulhrsw m4, m2 + pmulhrsw m5, m2 + packuswb m4, m5 + movu [r0 + r1 * 2], xm4 + vextracti128 [r0 + r3], m4, 1 RET %endif @@ -14587,3020 +21911,6 @@ INTRA_PRED_ANG32_STORE RET -%if ARCH_X86_64 == 1 -%macro INTRA_PRED_ANG32_CAL_ROW 0 - pmaddubsw m6, m2, m10 - pmulhrsw m6, m0 - pmaddubsw m7, m3, m10 - pmulhrsw m7, m0 - pmaddubsw m8, m4, m10 - pmulhrsw m8, m0 - pmaddubsw m9, m5, m10 - pmulhrsw m9, m0 - packuswb m6, m7 - packuswb m8, m9 - vperm2i128 m7, m6, m8, 00100000b - vperm2i128 m6, m6, m8, 00110001b -%endmacro - - -INIT_YMM avx2 -cglobal intra_pred_ang32_27, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_27] - - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - ;row [0, 1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [2, 3] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [4, 5] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [6, 7] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [8, 9] - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [10, 11] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [12, 13] - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [14] - mova m10, [r4 + 3 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - ;row [15, 16] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [17, 18] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [19, 20] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [21, 22] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [23, 24] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [25, 26] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [27, 28] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [29, 30] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - mova m10, [r4 + 4 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_28, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_28] - - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - ;row [0, 1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [2, 3] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [4, 5] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - ;row [6, 7] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [8, 9] - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [10, 11] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - ;row [12, 13] - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [14, 15] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [16, 17] - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [18] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [19, 20] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[21, 22] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[23, 24] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [25, 26] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [27, 28] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [29, 30] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - mova m10, [r4 + 4 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_29, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_29] - - ;row [0, 1] - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [2] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [3, 4] - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [5, 6] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [7, 8] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [9] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [10, 11] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [12, 13] - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [14, 15] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [16] - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [17, 18] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [19, 20] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [21, 22] - vbroadcasti128 m2, [r2 + 7] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 15] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 23] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 31] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [23] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [24, 25] - vbroadcasti128 m2, [r2 + 8] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 16] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 24] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 32] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [26, 27] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [28, 29] - vbroadcasti128 m2, [r2 + 9] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 17] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 25] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 33] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [30] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 10] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 18] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 26] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 34] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_30, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_30] - - ;row [0, 1] - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [2, 3] - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [4, 5] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [6] - mova m10, [r4 + 3 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [7, 8] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [9, 10] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [11] - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [12, 13] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [14, 15] - vbroadcasti128 m2, [r2 + 7] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 15] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 23] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 31] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [16] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [17, 18] - vbroadcasti128 m2, [r2 + 8] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 16] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 24] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 32] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [19, 20] - vbroadcasti128 m2, [r2 + 9] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 17] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 25] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 33] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - add r4, 4 * mmsize - - ;row [21] - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [22, 23] - vbroadcasti128 m2, [r2 + 10] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 18] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 26] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 34] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [24, 25] - vbroadcasti128 m2, [r2 + 11] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 19] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 27] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 35] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [26] - mova m10, [r4 + 3 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [27, 28] - vbroadcasti128 m2, [r2 + 12] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 20] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 28] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 36] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [29, 30] - vbroadcasti128 m2, [r2 + 13] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 21] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 29] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 37] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 14] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 22] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 30] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 38] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_31, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_31] - - ;row [0] - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [1, 2] - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [3, 4] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [5, 6] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [7, 8] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [9, 10] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [11, 12] - vbroadcasti128 m2, [r2 + 7] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 15] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 23] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 31] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [13, 14] - vbroadcasti128 m2, [r2 + 8] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 16] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 24] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 32] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [15] - vbroadcasti128 m2, [r2 + 9] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 17] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 25] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 33] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [16, 17] - vbroadcasti128 m2, [r2 + 10] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 18] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 26] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 34] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [18, 19] - vbroadcasti128 m2, [r2 + 11] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 19] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 27] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 35] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [20, 21] - vbroadcasti128 m2, [r2 + 12] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 20] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 28] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 36] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [22, 23] - vbroadcasti128 m2, [r2 + 13] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 21] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 29] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 37] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [24, 25] - vbroadcasti128 m2, [r2 + 14] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 22] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 30] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 38] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [26, 27] - vbroadcasti128 m2, [r2 + 15] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 23] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 31] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 39] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [28, 29] - vbroadcasti128 m2, [r2 + 16] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 24] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 32] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 40] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [30] - vbroadcasti128 m2, [r2 + 17] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 25] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 33] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 41] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 18] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 26] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 34] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 42] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_32, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_32] - - ;row [0] - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [1, 2] - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [3] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [4, 5] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [6] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [7, 8] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [9] - vbroadcasti128 m2, [r2 + 7] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 15] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 23] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 31] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [10, 11] - vbroadcasti128 m2, [r2 + 8] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 16] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 24] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 32] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [12] - vbroadcasti128 m2, [r2 + 9] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 17] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 25] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 33] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - lea r0, [r0 + 4 * r1] - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [13, 14] - vbroadcasti128 m2, [r2 + 10] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 18] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 26] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 34] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [15] - vbroadcasti128 m2, [r2 + 11] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 19] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 27] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 35] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [16, 17] - vbroadcasti128 m2, [r2 + 12] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 20] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 28] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 36] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [18] - vbroadcasti128 m2, [r2 + 13] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 21] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 29] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 37] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [19, 20] - vbroadcasti128 m2, [r2 + 14] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 22] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 30] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 38] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [21] - vbroadcasti128 m2, [r2 + 15] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 23] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 31] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 39] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [22, 23] - vbroadcasti128 m2, [r2 + 16] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 24] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 32] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 40] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row [24] - vbroadcasti128 m2, [r2 + 17] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 25] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 33] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 41] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [25, 26] - vbroadcasti128 m2, [r2 + 18] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 26] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 34] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 42] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [27] - vbroadcasti128 m2, [r2 + 19] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 27] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 35] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 43] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [28, 29] - vbroadcasti128 m2, [r2 + 20] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 28] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 36] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 44] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [30] - vbroadcasti128 m2, [r2 + 21] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 29] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 37] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 45] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 22] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 30] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 38] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 46] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_25, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_25] - - ;row [0, 1] - vbroadcasti128 m2, [r2 + 0] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 8] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 16] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 24] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[2, 3] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[4, 5] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[6, 7] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[8, 9] - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[10, 11] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[12, 13] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[14, 15] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[16, 17] - movu xm2, [r2 - 1] - pinsrb xm2, [r2 + 80], 0 - vinserti128 m2, m2, xm2, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 7] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 15] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 23] - pshufb m5, m1 - - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[18, 19] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[20, 21] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[22, 23] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[24, 25] - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[26, 27] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[28, 29] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[30, 31] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_24, 3, 5, 12 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_24] - - ;row[0, 1] - vbroadcasti128 m11, [r2 + 0] - pshufb m2, m11, m1 - vbroadcasti128 m3, [r2 + 8] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 16] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 24] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[2, 3] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[4, 5] - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[6, 7] - pslldq xm11, 1 - pinsrb xm11, [r2 + 70], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 7] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 15] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 23] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[8, 9] - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[10, 11] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[12, 13] - pslldq xm11, 1 - pinsrb xm11, [r2 + 77], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 6] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 14] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 22] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[14, 15] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[16, 17] - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[18] - mova m10, [r4 + 1 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row[19, 20] - pslldq xm11, 1 - pinsrb xm11, [r2 + 83], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 5] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 13] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 21] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[21, 22] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[23, 24] - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[25, 26] - pslldq xm11, 1 - pinsrb xm11, [r2 + 90], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 4] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 12] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 20] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[27, 28] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[29, 30] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;[row 31] - mova m10, [r4 + 4 * mmsize] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, m10 - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, m10 - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_23, 3, 5, 12 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_23] - - ;row[0, 1] - vbroadcasti128 m11, [r2 + 0] - pshufb m2, m11, m1 - vbroadcasti128 m3, [r2 + 8] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 16] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 24] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[2] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row[3, 4] - pslldq xm11, 1 - pinsrb xm11, [r2 + 68], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 7] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 15] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 23] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[5, 6] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[7, 8] - pslldq xm11, 1 - pinsrb xm11, [r2 + 71], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 6] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 14] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 22] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[9] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row[10, 11] - pslldq xm11, 1 - pinsrb xm11, [r2 + 75], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 5] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 13] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 21] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[12, 13] - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[14, 15] - pslldq xm11, 1 - pinsrb xm11, [r2 + 78], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 4] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 12] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 20] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[16] - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row[17, 18] - pslldq xm11, 1 - pinsrb xm11, [r2 + 82], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 3] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 11] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 19] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[19, 20] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[21, 22] - pslldq xm11, 1 - pinsrb xm11, [r2 + 85], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 2] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 10] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 18] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[23] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row[24, 25] - pslldq xm11, 1 - pinsrb xm11, [r2 + 89], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 1] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 9] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 17] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[26, 27] - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[28, 29] - pslldq xm11, 1 - pinsrb xm11, [r2 + 92], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 0] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 8] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 16] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[30, 31] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_22, 3, 5, 13 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_22] - - ;row[0, 1] - vbroadcasti128 m11, [r2 + 0] - pshufb m2, m11, m1 - vbroadcasti128 m3, [r2 + 8] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 16] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 24] - pshufb m5, m1 - - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[2, 3] - pslldq xm11, 1 - pinsrb xm11, [r2 + 66], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 7] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 15] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 23] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[4, 5] - pslldq xm11, 1 - pinsrb xm11, [r2 + 69], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 6] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 14] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 22] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[6] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row[7, 8] - pslldq xm11, 1 - pinsrb xm11, [r2 + 71], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 5] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 13] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 21] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[9, 10] - pslldq xm11, 1 - pinsrb xm11, [r2 + 74], 0 - vinserti128 m2, m11, xm11, 1 - vinserti128 m2, m2, xm2, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 4] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 12] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 20] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[11] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row[12, 13] - pslldq xm11, 1 - pinsrb xm11, [r2 + 76], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 3] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 11] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 19] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[14, 15] - pslldq xm11, 1 - pinsrb xm11, [r2 + 79], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 2] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 10] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 18] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[16] - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row[17, 18] - pslldq xm11, 1 - pinsrb xm11, [r2 + 81], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 1] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 9] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 17] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[19, 20] - pslldq xm11, 1 - pinsrb xm11, [r2 + 84], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m12, [r2 + 0] - pshufb m3, m12, m1 - vbroadcasti128 m4, [r2 + 8] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 16] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[21] - add r4, 4 * mmsize - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row[22, 23] - pslldq xm11, 1 - pinsrb xm11, [r2 + 86], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 66], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 7] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 15] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[24, 25] - pslldq xm11, 1 - pinsrb xm11, [r2 + 89], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 69], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 6] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 14] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[26] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row[27, 28] - pslldq xm11, 1 - pinsrb xm11, [r2 + 91], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 71], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 5] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 13] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[29, 30] - pslldq xm11, 1 - pinsrb xm11, [r2 + 94], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 74], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 4] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 12] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[31] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang32_21, 3, 5, 13 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_21] - - ;row[0] - vbroadcasti128 m11, [r2 + 0] - pshufb m2, m11, m1 - vbroadcasti128 m3, [r2 + 8] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 16] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 24] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row[1, 2] - pslldq xm11, 1 - pinsrb xm11, [r2 + 66], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 7] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 15] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 23] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[3, 4] - pslldq xm11, 1 - pinsrb xm11, [r2 + 68], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 6] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 14] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 22] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[5, 6] - pslldq xm11, 1 - pinsrb xm11, [r2 + 70], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 5] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 13] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 21] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[7, 8] - pslldq xm11, 1 - pinsrb xm11, [r2 + 72], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 4] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 12] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 20] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[9, 10] - pslldq xm11, 1 - pinsrb xm11, [r2 + 73], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 3] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 11] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 19] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[11, 12] - pslldq xm11, 1 - pinsrb xm11, [r2 + 75], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 2] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 10] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 18] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row[13, 14] - pslldq xm11, 1 - pinsrb xm11, [r2 + 77], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 1] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 9] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 17] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row[15] - pslldq xm11, 1 - pinsrb xm11, [r2 + 79], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - vbroadcasti128 m12, [r2 + 0] - pshufb m3, m12, m1 - vbroadcasti128 m4, [r2 + 8] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 16] - pshufb m5, m1 - vperm2i128 m6, m2, m3, 00100000b - add r4, 4 * mmsize - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row[16, 17] - pslldq xm11, 1 - pinsrb xm11, [r2 + 81], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 66], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 7] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 15] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - lea r0, [r0 + 4 * r1] - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[18, 19] - pslldq xm11, 1 - pinsrb xm11, [r2 + 83], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 68], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 6] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 14] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[20, 21] - pslldq xm11, 1 - pinsrb xm11, [r2 + 85], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 70], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 5] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 13] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - lea r0, [r0 + 4 * r1] - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[22, 23] - pslldq xm11, 1 - pinsrb xm11, [r2 + 87], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 72], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 4] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 12] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[24, 25] - pslldq xm11, 1 - pinsrb xm11, [r2 + 88], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 73], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 3] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 11] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - lea r0, [r0 + 4 * r1] - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[26, 27] - pslldq xm11, 1 - pinsrb xm11, [r2 + 90], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 75], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 2] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 10] - pshufb m5, m1 - - mova m10, [r4 + 2 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - - ;row[28, 29] - pslldq xm11, 1 - pinsrb xm11, [r2 + 92], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 77], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 1] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 9] - pshufb m5, m1 - - mova m10, [r4 + 3 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - lea r0, [r0 + 4 * r1] - movu [r0], m7 - movu [r0 + r1], m6 - - ;row[30, 31] - pslldq xm11, 1 - pinsrb xm11, [r2 + 94], 0 - vinserti128 m2, m11, xm11, 1 - pshufb m2, m1 - pslldq xm12, 1 - pinsrb xm12, [r2 + 79], 0 - vinserti128 m3, m12, xm12, 1 - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 0] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 8] - pshufb m5, m1 - - mova m10, [r4 + 4 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + 2 * r1], m7 - movu [r0 + r3], m6 - RET -%endif - %macro INTRA_PRED_STORE_4x4 0 movd [r0], xm0 pextrd [r0 + r1], xm0, 1
View file
x265_1.8.tar.gz/source/common/x86/intrapred8_allangs.asm -> x265_1.9.tar.gz/source/common/x86/intrapred8_allangs.asm
Changed
@@ -27,62 +27,63 @@ SECTION_RODATA 32 -all_ang4_shuff: db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 - db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 - db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 - db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3 - db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12 - db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11 - db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11 - db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10 - db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10 - db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9 - db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0 - db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1 - db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2 - db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3 - db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 - db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4 - db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 - db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 - db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6 - db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 - db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8 - db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8 - -all_ang4: db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 - db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 - db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 - db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 - db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 - db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 - db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 - db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 - db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 - db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 - db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 - db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28 - db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 - db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 - db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 - db 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 - db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28 - db 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 - db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28 - db 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 - db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 - db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 - db 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 - db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4 - db 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 - db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4 - db 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 - db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 +const allAng4_shuf_mode2, db 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7 +const allAng4_shuf_mode3_4, db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5 + db 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 +const allAng4_shuf_mode5_6, db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 +const allAng4_shuf_mode7_8, db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 0, 1, 1, 2, 2, 3, 3, 4, 1, 2, 2, 3, 3, 4, 4, 5, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 +const allAng4_shuf_mode10, db 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3 +const allAng4_shuf_mode11_12, db 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12, 0, 9, 9, 10, 10, 11, 11, 12 +const allAng4_shuf_mode13_14, db 0, 9, 9, 10, 10, 11, 11, 12, 4, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11, 2, 0, 0, 9, 9, 10, 10, 11 +const allAng4_shuf_mode15_16, db 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11, 0, 9, 9, 10, 10, 11, 11, 12, 2, 0, 0, 9, 9, 10, 10, 11 + db 2, 0, 0, 9, 9, 10, 10, 11, 4, 2, 2, 0, 0, 9, 9, 10, 2, 0, 0, 9, 9, 10, 10, 11, 3, 2, 2, 0, 0, 9, 9, 10 +const allAng4_shuf_mode17, db 0, 9, 9, 10, 10, 11, 11, 12, 1, 0, 0, 9, 9, 10, 10, 11, 2, 1, 1, 0, 0, 9, 9, 10, 4, 2, 2, 1, 1, 0, 0, 9 + db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0 +const allAng4_shuf_mode18, db 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0, 0, 1, 2, 3, 9, 0, 1, 2, 10, 9, 0, 1, 11, 10, 9, 0 +const allAng4_shuf_mode19_20, db 0, 1, 1, 2, 2, 3, 3, 4, 9, 0, 0, 1, 1, 2, 2, 3, 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3 + db 10, 9, 9, 0, 0, 1, 1, 2, 12, 10, 10, 9, 9, 0, 0, 1, 10, 0, 0, 1, 1, 2, 2, 3, 11, 10, 10, 0, 0, 1, 1, 2 +const allAng4_shuf_mode21_22, db 0, 1, 1, 2, 2, 3, 3, 4, 10, 0, 0, 1, 1, 2, 2, 3, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 10, 0, 0, 1, 1, 2, 2, 3, 12, 10, 10, 0, 0, 1, 1, 2, 10, 0, 0, 1, 1, 2, 2, 3, 10, 0, 0, 1, 1, 2, 2, 3 +const allAng4_shuf_mode23_24, db 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 + db 0, 1, 1, 2, 2, 3, 3, 4, 12, 0, 0, 1, 1, 2, 2, 3, 0, 1, 1, 2, 2, 3, 3, 4, 0, 1, 1, 2, 2, 3, 3, 4 +const allAng4_shuf_mode26, db 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4 +const allAng4_shuf_mode27_28, db 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5, 1, 2, 2, 3, 3, 4, 4, 5 +const allAng4_shuf_mode29_30, db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6, 2, 3, 3, 4, 4, 5, 5, 6 +const allAng4_shuf_mode31_32, db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6 + db 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7 +const allAng4_shuf_mode33, db 1, 2, 2, 3, 3, 4, 4, 5, 2, 3, 3, 4, 4, 5, 5, 6, 3, 4, 4, 5, 5, 6, 6, 7, 4, 5, 5, 6, 6, 7, 7, 8 +const allAng4_shuf_mode34, db 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8 + +const allAng4_fact_mode3_4, db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10 + db 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode5_6, db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26 + db 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode7_8, db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10 + db 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode9, db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8 +const allAng4_fact_mode11_12, db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22 + db 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode13_14, db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6 + db 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode15_16, db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22 + db 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode17, db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24 +const allAng4_fact_mode19_20, db 26, 6, 26, 6, 26, 6, 26, 6, 20, 12, 20, 12, 20, 12, 20, 12, 21, 11, 21, 11, 21, 11, 21, 11, 10, 22, 10, 22, 10, 22, 10, 22 + db 14, 18, 14, 18, 14, 18, 14, 18, 8, 24, 8, 24, 8, 24, 8, 24, 31, 1, 31, 1, 31, 1, 31, 1, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode21_22, db 17, 15, 17, 15, 17, 15, 17, 15, 2, 30, 2, 30, 2, 30, 2, 30, 13, 19, 13, 19, 13, 19, 13, 19, 26, 6, 26, 6, 26, 6, 26, 6 + db 19, 13, 19, 13, 19, 13, 19, 13, 4, 28, 4, 28, 4, 28, 4, 28, 7, 25, 7, 25, 7, 25, 7, 25, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode23_24, db 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22 + db 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12 +const allAng4_fact_mode25, db 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24 +const allAng4_fact_mode27_28, db 30, 2, 30, 2, 30, 2, 30, 2, 28, 4, 28, 4, 28, 4, 28, 4, 27, 5, 27, 5, 27, 5, 27, 5, 22, 10, 22, 10, 22, 10, 22, 10 + db 26, 6, 26, 6, 26, 6, 26, 6, 24, 8, 24, 8, 24, 8, 24, 8, 17, 15, 17, 15, 17, 15, 17, 15, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode29_30, db 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26 + db 5, 27, 5, 27, 5, 27, 5, 27, 28, 4, 28, 4, 28, 4, 28, 4, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode31_32, db 15, 17, 15, 17, 15, 17, 15, 17, 30, 2, 30, 2, 30, 2, 30, 2, 11, 21, 11, 21, 11, 21, 11, 21, 22, 10, 22, 10, 22, 10, 22, 10 + db 13, 19, 13, 19, 13, 19, 13, 19, 28, 4, 28, 4, 28, 4, 28, 4, 1, 31, 1, 31, 1, 31, 1, 31, 12, 20, 12, 20, 12, 20, 12, 20 +const allAng4_fact_mode33, db 6, 26, 6, 26, 6, 26, 6, 26, 12, 20, 12, 20, 12, 20, 12, 20, 18, 14, 18, 14, 18, 14, 18, 14, 24, 8, 24, 8, 24, 8, 24, 8 SECTION .text @@ -23075,80 +23076,69 @@ ; void all_angs_pred_4x4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma) ;----------------------------------------------------------------------------- INIT_YMM avx2 -cglobal all_angs_pred_4x4, 4, 4, 6 +cglobal all_angs_pred_4x4, 2, 2, 6 mova m5, [pw_1024] - lea r2, [all_ang4] - lea r3, [all_ang4_shuff] ; mode 2 vbroadcasti128 m0, [r1 + 9] - mova xm1, xm0 - psrldq xm1, 1 - pshufb xm1, [r3] + pshufb m1, m0, [allAng4_shuf_mode2] movu [r0], xm1 ; mode 3 - pshufb m1, m0, [r3 + 1 * mmsize] - pmaddubsw m1, [r2] + pshufb m1, m0, [allAng4_shuf_mode3_4] + pmaddubsw m1, [allAng4_fact_mode3_4] pmulhrsw m1, m5 ; mode 4 - pshufb m2, m0, [r3 + 2 * mmsize] - pmaddubsw m2, [r2 + 1 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode3_4 + mmsize] + pmaddubsw m2, [allAng4_fact_mode3_4 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (3 - 2) * 16], m1 ; mode 5 - pshufb m1, m0, [r3 + 2 * mmsize] - pmaddubsw m1, [r2 + 2 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode5_6] + pmaddubsw m1, [allAng4_fact_mode5_6] pmulhrsw m1, m5 ; mode 6 - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 3 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode5_6 + mmsize] + pmaddubsw m2, [allAng4_fact_mode5_6 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (5 - 2) * 16], m1 - add r3, 4 * mmsize - add r2, 4 * mmsize - ; mode 7 - pshufb m1, m0, [r3 + 0 * mmsize] - pmaddubsw m1, [r2 + 0 * mmsize] + pshufb m3, m0, [allAng4_shuf_mode7_8] + pmaddubsw m1, m3, [allAng4_fact_mode7_8] pmulhrsw m1, m5 ; mode 8 - pshufb m2, m0, [r3 + 1 * mmsize] - pmaddubsw m2, [r2 + 1 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode7_8 + mmsize] + pmaddubsw m2, [allAng4_fact_mode7_8 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (7 - 2) * 16], m1 ; mode 9 - pshufb m1, m0, [r3 + 1 * mmsize] - pmaddubsw m1, [r2 + 2 * mmsize] - pmulhrsw m1, m5 - packuswb m1, m1 - vpermq m1, m1, 11011000b - movu [r0 + (9 - 2) * 16], xm1 + pmaddubsw m3, [allAng4_fact_mode9] + pmulhrsw m3, m5 + packuswb m3, m3 + vpermq m3, m3, 11011000b + movu [r0 + (9 - 2) * 16], xm3 ; mode 10 - pshufb xm1, xm0, [r3 + 2 * mmsize] + pshufb xm1, xm0, [allAng4_shuf_mode10] movu [r0 + (10 - 2) * 16], xm1 pxor xm1, xm1 @@ -23173,135 +23163,111 @@ ; mode 11 vbroadcasti128 m0, [r1] - pshufb m1, m0, [r3 + 3 * mmsize] - pmaddubsw m1, [r2 + 3 * mmsize] + pshufb m3, m0, [allAng4_shuf_mode11_12] + pmaddubsw m1, m3, [allAng4_fact_mode11_12] pmulhrsw m1, m5 ; mode 12 - add r2, 4 * mmsize - - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 0 * mmsize] + pmaddubsw m2, m3, [allAng4_fact_mode11_12 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (11 - 2) * 16], m1 ; mode 13 - add r3, 4 * mmsize - - pshufb m1, m0, [r3 + 0 * mmsize] - pmaddubsw m1, [r2 + 1 * mmsize] - pmulhrsw m1, m5 + pmaddubsw m3, [allAng4_fact_mode13_14] + pmulhrsw m3, m5 ; mode 14 - pshufb m2, m0, [r3 + 1 * mmsize] - pmaddubsw m2, [r2 + 2 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode13_14] + pmaddubsw m2, [allAng4_fact_mode13_14 + mmsize] pmulhrsw m2, m5 - packuswb m1, m2 - vpermq m1, m1, 11011000b - movu [r0 + (13 - 2) * 16], m1 + packuswb m3, m2 + movu [r0 + (13 - 2) * 16], m3 ; mode 15 - pshufb m1, m0, [r3 + 2 * mmsize] - pmaddubsw m1, [r2 + 3 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode15_16] + pmaddubsw m1, [allAng4_fact_mode15_16] pmulhrsw m1, m5 ; mode 16 - add r2, 4 * mmsize - - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 0 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode15_16 + mmsize] + pmaddubsw m2, [allAng4_fact_mode15_16 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (15 - 2) * 16], m1 ; mode 17 - add r3, 4 * mmsize - - pshufb m1, m0, [r3 + 0 * mmsize] - pmaddubsw m1, [r2 + 1 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode17] + pmaddubsw m1, [allAng4_fact_mode17] pmulhrsw m1, m5 packuswb m1, m1 vpermq m1, m1, 11011000b ; mode 18 - pshufb m2, m0, [r3 + 1 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode18] vinserti128 m1, m1, xm2, 1 movu [r0 + (17 - 2) * 16], m1 ; mode 19 - pshufb m1, m0, [r3 + 2 * mmsize] - pmaddubsw m1, [r2 + 2 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode19_20] + pmaddubsw m1, [allAng4_fact_mode19_20] pmulhrsw m1, m5 ; mode 20 - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 3 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode19_20 + mmsize] + pmaddubsw m2, [allAng4_fact_mode19_20 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (19 - 2) * 16], m1 ; mode 21 - add r2, 4 * mmsize - add r3, 4 * mmsize - - pshufb m1, m0, [r3 + 0 * mmsize] - pmaddubsw m1, [r2 + 0 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode21_22] + pmaddubsw m1, [allAng4_fact_mode21_22] pmulhrsw m1, m5 ; mode 22 - pshufb m2, m0, [r3 + 1 * mmsize] - pmaddubsw m2, [r2 + 1 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode21_22 + mmsize] + pmaddubsw m2, [allAng4_fact_mode21_22 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (21 - 2) * 16], m1 ; mode 23 - pshufb m1, m0, [r3 + 2 * mmsize] - pmaddubsw m1, [r2 + 2 * mmsize] + pshufb m3, m0, [allAng4_shuf_mode23_24] + pmaddubsw m1, m3, [allAng4_fact_mode23_24] pmulhrsw m1, m5 ; mode 24 - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 3 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode23_24 + mmsize] + pmaddubsw m2, [allAng4_fact_mode23_24 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (23 - 2) * 16], m1 ; mode 25 - add r2, 4 * mmsize - - pshufb m1, m0, [r3 + 3 * mmsize] - pmaddubsw m1, [r2 + 0 * mmsize] - pmulhrsw m1, m5 - packuswb m1, m1 - vpermq m1, m1, 11011000b - movu [r0 + (25 - 2) * 16], xm1 + pmaddubsw m3, [allAng4_fact_mode25] + pmulhrsw m3, m5 + packuswb m3, m3 + vpermq m3, m3, 11011000b + movu [r0 + (25 - 2) * 16], xm3 ; mode 26 - add r3, 4 * mmsize - - pshufb xm1, xm0, [r3 + 0 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode26] movu [r0 + (26 - 2) * 16], xm1 pxor xm1, xm1 @@ -23326,64 +23292,55 @@ ; mode 27 - pshufb m1, m0, [r3 + 1 * mmsize] - pmaddubsw m1, [r2 + 1 * mmsize] + pshufb m3, m0, [allAng4_shuf_mode27_28] + pmaddubsw m1, m3, [allAng4_fact_mode27_28] pmulhrsw m1, m5 ; mode 28 - pshufb m2, m0, [r3 + 1 * mmsize] - pmaddubsw m2, [r2 + 2 * mmsize] + pmaddubsw m2, m3, [allAng4_fact_mode27_28 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (27 - 2) * 16], m1 ; mode 29 - pshufb m1, m0, [r3 + 2 * mmsize] - pmaddubsw m1, [r2 + 3 * mmsize] - pmulhrsw m1, m5 + pmaddubsw m3, [allAng4_fact_mode29_30] + pmulhrsw m3, m5 ; mode 30 - add r2, 4 * mmsize - - pshufb m2, m0, [r3 + 3 * mmsize] - pmaddubsw m2, [r2 + 0 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode29_30] + pmaddubsw m2, [allAng4_fact_mode29_30 + mmsize] pmulhrsw m2, m5 - packuswb m1, m2 - vpermq m1, m1, 11011000b - movu [r0 + (29 - 2) * 16], m1 + packuswb m3, m2 + movu [r0 + (29 - 2) * 16], m3 ; mode 31 - add r3, 4 * mmsize - - pshufb m1, m0, [r3 + 0 * mmsize] - pmaddubsw m1, [r2 + 1 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode31_32] + pmaddubsw m1, [allAng4_fact_mode31_32] pmulhrsw m1, m5 ; mode 32 - pshufb m2, m0, [r3 + 0 * mmsize] - pmaddubsw m2, [r2 + 2 * mmsize] + pshufb m2, m0, [allAng4_shuf_mode31_32 + mmsize] + pmaddubsw m2, [allAng4_fact_mode31_32 + mmsize] pmulhrsw m2, m5 packuswb m1, m2 - vpermq m1, m1, 11011000b movu [r0 + (31 - 2) * 16], m1 ; mode 33 - pshufb m1, m0, [r3 + 1 * mmsize] - pmaddubsw m1, [r2 + 3 * mmsize] + pshufb m1, m0, [allAng4_shuf_mode33] + pmaddubsw m1, [allAng4_fact_mode33] pmulhrsw m1, m5 packuswb m1, m2 vpermq m1, m1, 11011000b ; mode 34 - pshufb m0, [r3 + 2 * mmsize] + pshufb m0, [allAng4_shuf_mode34] vinserti128 m1, m1, xm0, 1 movu [r0 + (33 - 2) * 16], m1 RET
View file
x265_1.8.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.9.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -4869,7 +4869,7 @@ %ifidn %2,pp vbroadcasti128 m8, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m8, [INTERP_OFFSET_SP] + vbroadcasti128 m8, [INTERP_OFFSET_SP] %else vbroadcasti128 m8, [INTERP_OFFSET_PS] %endif @@ -5011,11 +5011,11 @@ mov r4d, %1/2 %ifidn %2, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %2, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -5183,11 +5183,11 @@ mov r4d, %1/2 %ifidn %2, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %2, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -5325,11 +5325,11 @@ mov r4d, %1/2 %ifidn %2, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %2, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -5456,11 +5456,11 @@ mov r4d, %1/2 %ifidn %2, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %2, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -5609,11 +5609,11 @@ mov r4d, %1/2 %ifidn %2, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %2, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %2, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -5732,11 +5732,11 @@ mov r4d, 32 %ifidn %1, pp - mova m7, [INTERP_OFFSET_PP] + vbroadcasti128 m7, [INTERP_OFFSET_PP] %elifidn %1, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %elifidn %1, ps - mova m7, [INTERP_OFFSET_PS] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -6068,7 +6068,7 @@ %ifidn %1,pp vbroadcasti128 m6, [pd_32] %elifidn %1, sp - mova m6, [pd_524800] + vbroadcasti128 m6, [INTERP_OFFSET_SP] %else vbroadcasti128 m6, [INTERP_OFFSET_PS] %endif @@ -6178,7 +6178,7 @@ %ifidn %1,pp vbroadcasti128 m11, [pd_32] %elifidn %1, sp - mova m11, [pd_524800] + vbroadcasti128 m11, [INTERP_OFFSET_SP] %else vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif @@ -6816,7 +6816,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -6867,7 +6867,7 @@ %ifidn %3,pp vbroadcasti128 m14, [pd_32] %elifidn %3, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -6950,7 +6950,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -7597,7 +7597,7 @@ %ifidn %1,pp vbroadcasti128 m11, [pd_32] %elifidn %1, sp - mova m11, [INTERP_OFFSET_SP] + vbroadcasti128 m11, [INTERP_OFFSET_SP] %else vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif @@ -7644,7 +7644,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -7816,7 +7816,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -7861,7 +7861,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -7901,7 +7901,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -8248,7 +8248,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -8668,7 +8668,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [INTERP_OFFSET_SP] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -8703,7 +8703,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [INTERP_OFFSET_SP] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -10342,8 +10342,8 @@ vpermd m3, m5, m3 paddd m3, m2 vextracti128 xm4, m3, 1 - psrad xm3, 2 - psrad xm4, 2 + psrad xm3, INTERP_SHIFT_PS + psrad xm4, INTERP_SHIFT_PS packssdw xm3, xm3 packssdw xm4, xm4 @@ -10375,8 +10375,8 @@ vpermd m3, m5, m3 paddd m3, m2 vextracti128 xm4, m3, 1 - psrad xm3, 2 - psrad xm4, 2 + psrad xm3, INTERP_SHIFT_PS + psrad xm4, INTERP_SHIFT_PS packssdw xm3, xm3 packssdw xm4, xm4 @@ -10441,8 +10441,8 @@ vpermq m4, m4, q3120 paddd m4, m2 vextracti128 xm5,m4, 1 - psrad xm4, 2 - psrad xm5, 2 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm4, xm5 movu [r2], xm4 @@ -10511,8 +10511,8 @@ vpermq m4, m4, q3120 paddd m4, m2 vextracti128 xm5,m4, 1 - psrad xm4, 2 - psrad xm5, 2 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm4, xm5 movu [r2 + x], xm4 @@ -10583,8 +10583,8 @@ vpermq m4, m4, q3120 paddd m4, m2 vextracti128 xm5,m4, 1 - psrad xm4, 2 - psrad xm5, 2 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm4, xm5 movu [r2 + x], xm4 @@ -10609,8 +10609,8 @@ vpermq m6, m6, q3120 paddd m6, m2 vextracti128 xm5,m6, 1 - psrad xm6, 2 - psrad xm5, 2 + psrad xm6, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm6, xm5 movu [r2 + 16 + x], xm6 @@ -10690,8 +10690,8 @@ vpermq m4, m4, q3120 paddd m4, m2 vextracti128 xm5, m4, 1 - psrad xm4, 2 - psrad xm5, 2 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm4, xm5 movu [r2], xm4 @@ -10713,8 +10713,8 @@ vpermq m6, m6, q3120 paddd m6, m2 vextracti128 xm5,m6, 1 - psrad xm6, 2 - psrad xm5, 2 + psrad xm6, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm6, xm5 movu [r2 + 16], xm6 @@ -10783,8 +10783,8 @@ vpermq m4, m4, q3120 paddd m4, m2 vextracti128 xm5,m4, 1 - psrad xm4, 2 - psrad xm5, 2 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS packssdw xm4, xm5 movu [r2], xm4 @@ -10798,7 +10798,7 @@ phaddd m6, m6 vpermq m6, m6, q3120 paddd xm6, xm2 - psrad xm6, 2 + psrad xm6, INTERP_SHIFT_PS packssdw xm6, xm6 movq [r2 + 16], xm6 @@ -10847,7 +10847,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -10906,7 +10906,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -10920,7 +10920,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 16], xm4 @@ -10979,7 +10979,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -10993,7 +10993,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 16], xm4 @@ -11007,7 +11007,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 32], xm4 @@ -11061,7 +11061,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -11072,7 +11072,7 @@ phaddd m4, m4 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movq [r2 + 16], xm4 @@ -11126,7 +11126,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -11140,7 +11140,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 16], xm4 @@ -11154,7 +11154,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 32], xm4 @@ -11168,7 +11168,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 48], xm4 @@ -11227,7 +11227,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -11241,7 +11241,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 16], xm4 @@ -11255,7 +11255,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 32], xm4 @@ -11269,7 +11269,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 48], xm4 @@ -11283,7 +11283,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 64], xm4 @@ -11297,7 +11297,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 80], xm4 @@ -11311,7 +11311,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 96], xm4 @@ -11325,7 +11325,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 112], xm4 @@ -11380,7 +11380,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2], xm4 @@ -11394,7 +11394,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 16], xm4 @@ -11408,7 +11408,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 32], xm4 @@ -11422,7 +11422,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 48], xm4 @@ -11436,7 +11436,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 64], xm4 @@ -11450,7 +11450,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movu [r2 + 80], xm4 @@ -11500,7 +11500,7 @@ phaddd m4, m5 paddd m4, m2 vpermq m4, m4, q3120 - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS vextracti128 xm5, m4, 1 packssdw xm4, xm5 movq [r2], xm4 @@ -11537,7 +11537,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -11665,19 +11665,19 @@ psrad m4, 6 psrad m5, 6 %elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 - psrad m4, 10 - psrad m5, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 - psrad m4, 2 - psrad m5, 2 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS %endif %endif @@ -11736,11 +11736,11 @@ psrad m6, 6 psrad m7, 6 %elifidn %1, sp - psrad m6, 10 - psrad m7, 10 + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP %else - psrad m6, 2 - psrad m7, 2 + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -11814,23 +11814,23 @@ psrad m0, 6 psrad m1, 6 %elifidn %1, sp - psrad m8, 10 - psrad m9, 10 - psrad m10, 10 - psrad m11, 10 - psrad m12, 10 - psrad m13, 10 - psrad m0, 10 - psrad m1, 10 -%else - psrad m8, 2 - psrad m9, 2 - psrad m10, 2 - psrad m11, 2 - psrad m12, 2 - psrad m13, 2 - psrad m0, 2 - psrad m1, 2 + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + psrad m13, INTERP_SHIFT_SP + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP +%else + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + psrad m13, INTERP_SHIFT_PS + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS %endif %endif @@ -11954,7 +11954,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -11966,8 +11966,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_8x2 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x2 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_8x2 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x2 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x2 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_8x2 ss, 0, 6 %macro FILTER_VER_CHROMA_AVX2_4x2 3 @@ -11991,7 +11991,7 @@ %ifidn %1,pp vbroadcasti128 m6, [pd_32] %elifidn %1, sp - mova m6, [pd_524800] + vbroadcasti128 m6, [INTERP_OFFSET_SP] %else vbroadcasti128 m6, [INTERP_OFFSET_PS] %endif @@ -12033,8 +12033,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_4x2 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x2 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_4x2 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x2 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x2 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4x2 ss, 0, 6 %macro FILTER_VER_CHROMA_AVX2_4x4 3 @@ -12058,7 +12058,7 @@ %ifidn %1,pp vbroadcasti128 m6, [pd_32] %elifidn %1, sp - mova m6, [pd_524800] + vbroadcasti128 m6, [INTERP_OFFSET_SP] %else vbroadcasti128 m6, [INTERP_OFFSET_PS] %endif @@ -12112,8 +12112,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_4x4 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x4 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_4x4 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x4 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x4 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4x4 ss, 0, 6 @@ -12138,7 +12138,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -12225,8 +12225,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_4x8 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_4x8 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_4x8 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x8 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4x8 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4x8 ss, 0 , 6 %macro PROCESS_LUMA_AVX2_W4_16R_4TAP 3 @@ -12396,7 +12396,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -12410,12 +12410,12 @@ %endmacro FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6 -FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, 2 -FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, 10 +FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4xN ss, 16, 0, 6 FILTER_VER_CHROMA_AVX2_4xN pp, 32, 1, 6 -FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, 2 -FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, 10 +FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6 %macro FILTER_VER_CHROMA_AVX2_8x8 3 @@ -12429,7 +12429,7 @@ %ifdef PIC lea r5, [tab_ChromaCoeffVer] - add r5, r4 + add r5, r4 %else lea r5, [tab_ChromaCoeffVer + r4] %endif @@ -12440,7 +12440,7 @@ %ifidn %1,pp vbroadcasti128 m11, [pd_32] %elifidn %1, sp - mova m11, [pd_524800] + vbroadcasti128 m11, [INTERP_OFFSET_SP] %else vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif @@ -12569,8 +12569,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_8x8 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x8 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_8x8 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x8 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x8 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_8x8 ss, 0, 6 %macro FILTER_VER_CHROMA_AVX2_8x6 3 @@ -12595,7 +12595,7 @@ %ifidn %1,pp vbroadcasti128 m11, [pd_32] %elifidn %1, sp - mova m11, [pd_524800] + vbroadcasti128 m11, [INTERP_OFFSET_SP] %else vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif @@ -12700,8 +12700,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_8x6 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x6 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_8x6 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x6 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x6 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_8x6 ss, 0, 6 %macro PROCESS_CHROMA_AVX2 3 @@ -12785,7 +12785,7 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + vbroadcasti128 m7, [INTERP_OFFSET_SP] %else vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif @@ -12799,8 +12799,8 @@ %endmacro FILTER_VER_CHROMA_AVX2_8x4 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x4 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_8x4 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x4 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x4 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_8x4 ss, 0, 6 %macro FILTER_VER_CHROMA_AVX2_8x12 3 @@ -12824,7 +12824,7 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + vbroadcasti128 m14, [INTERP_OFFSET_SP] %else vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif @@ -13002,6 +13002,6 @@ %endmacro FILTER_VER_CHROMA_AVX2_8x12 pp, 1, 6 -FILTER_VER_CHROMA_AVX2_8x12 ps, 0, 2 -FILTER_VER_CHROMA_AVX2_8x12 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x12 ps, 0, INTERP_SHIFT_PS +FILTER_VER_CHROMA_AVX2_8x12 sp, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_8x12 ss, 0, 6
View file
x265_1.8.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.9.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -12541,6 +12541,459 @@ ;----------------------------------------------------------------------------- ; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_16x4, 3, 4, 2 + mov r3d, r3m + add r3d, r3d + + ; load constant + vbroadcasti128 m1, [pw_2000] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + lea r1, [r1 * 3] + lea r3, [r3 * 3] + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_16x8, 3, 6, 2 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + + ; load constant + vbroadcasti128 m1, [pw_2000] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_16x12, 3, 6, 2 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + + ; load constant + vbroadcasti128 m1, [pw_2000] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_16x16, 3, 6, 2 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + + ; load constant + vbroadcasti128 m1, [pw_2000] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal filterPixelToShort_16x24, 3, 7, 2 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + mov r6d, 3 + + ; load constant + vbroadcasti128 m1, [pw_2000] +.loop: + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%macro P2S_H_16xN_avx2 1 +INIT_YMM avx2 +cglobal filterPixelToShort_16x%1, 3, 7, 2 + mov r3d, r3m + add r3d, r3d + lea r4, [r1 * 3] + lea r5, [r3 * 3] + mov r6d, %1/16 + + ; load constant + vbroadcasti128 m1, [pw_2000] +.loop: + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + psllw m0, 6 + psubw m0, m1 + movu [r2], m0 + + pmovzxbw m0, [r0 + r1] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3], m0 + + pmovzxbw m0, [r0 + r1 * 2] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r3 * 2], m0 + + pmovzxbw m0, [r0 + r4] + psllw m0, 6 + psubw m0, m1 + movu [r2 + r5], m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r6d + jnz .loop + RET +%endmacro +P2S_H_16xN_avx2 32 +P2S_H_16xN_avx2 64 + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- %macro P2S_H_32xN 1 INIT_XMM ssse3 cglobal filterPixelToShort_32x%1, 3, 7, 6 @@ -25016,67 +25469,57 @@ ; void interp_4tap_horiz_ps_32x32(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx, int isRowExt) ;-----------------------------------------------------------------------------------------------------------------------------; INIT_YMM avx2 -cglobal interp_4tap_horiz_ps_32x32, 4,7,6 +cglobal interp_4tap_horiz_ps_32x32, 4,6,8 mov r4d, r4m - mov r5d, r5m add r3d, r3d + dec r0 -%ifdef PIC - lea r6, [tab_ChromaCoeff] - vpbroadcastd m0, [r6 + r4 * 4] -%else - vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] -%endif + ; check isRowExt + cmp r5m, byte 0 - vbroadcasti128 m2, [pw_1] - vbroadcasti128 m5, [pw_2000] - mova m1, [tab_Tm] + lea r5, [tab_ChromaCoeff] + vpbroadcastw m0, [r5 + r4 * 4 + 0] + vpbroadcastw m1, [r5 + r4 * 4 + 2] + mova m7, [pw_2000] ; register map - ; m0 - interpolate coeff - ; m1 - shuffle order table - ; m2 - constant word 1 - mov r6d, 32 - dec r0 - test r5d, r5d - je .loop - sub r0 , r1 - add r6d , 3 + ; m0 - interpolate coeff Low + ; m1 - interpolate coeff High + ; m7 - constant pw_2000 + mov r4d, 32 + je .loop + sub r0, r1 + add r4d, 3 .loop ; Row 0 - vbroadcasti128 m3, [r0] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 8] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2], m3 - - vbroadcasti128 m3, [r0 + 16] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m3, m1 - pmaddubsw m3, m0 - pmaddwd m3, m2 - vbroadcasti128 m4, [r0 + 24] ; [x x x x x A 9 8 7 6 5 4 3 2 1 0] - pshufb m4, m1 - pmaddubsw m4, m0 - pmaddwd m4, m2 - - packssdw m3, m4 - psubw m3, m5 - vpermq m3, m3, 11011000b - movu [r2 + 32], m3 - - add r2, r3 - add r0, r1 - dec r6d - jnz .loop + movu m2, [r0] + movu m3, [r0 + 1] + punpckhbw m4, m2, m3 + punpcklbw m2, m3 + pmaddubsw m4, m0 + pmaddubsw m2, m0 + + movu m3, [r0 + 2] + movu m5, [r0 + 3] + punpckhbw m6, m3, m5 + punpcklbw m3, m5 + pmaddubsw m6, m1 + pmaddubsw m3, m1 + + paddw m4, m6 + paddw m2, m3 + psubw m4, m7 + psubw m2, m7 + vperm2i128 m3, m2, m4, 0x20 + vperm2i128 m5, m2, m4, 0x31 + movu [r2], m3 + movu [r2 + mmsize], m5 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop RET ;-----------------------------------------------------------------------------------------------------------------------------
View file
x265_1.8.tar.gz/source/common/x86/loopfilter.asm -> x265_1.9.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -26,24 +26,28 @@ ;*****************************************************************************/ %include "x86inc.asm" +%include "x86util.asm" SECTION_RODATA 32 pb_31: times 32 db 31 pb_124: times 32 db 124 pb_15: times 32 db 15 -pb_movemask_32: times 32 db 0x00 - times 32 db 0xFF SECTION .text cextern pb_1 -cextern pb_128 cextern pb_2 +cextern pb_3 +cextern pb_4 +cextern pb_01 +cextern pb_128 +cextern pw_1 +cextern pw_n1 cextern pw_2 +cextern pw_4 cextern pw_pixel_max cextern pb_movemask -cextern pw_1 +cextern pb_movemask_32 cextern hmul_16p -cextern pb_4 ;============================================================================================================ @@ -1989,79 +1993,94 @@ %endif ;-------------------------------------------------------------------------------------------------------------------------- -; saoCuStatsBO_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +; saoCuStatsBO_c(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) ;-------------------------------------------------------------------------------------------------------------------------- %if ARCH_X86_64 INIT_XMM sse4 -cglobal saoCuStatsBO, 7,12,6 - mova m3, [hmul_16p + 16] - mova m4, [pb_124] - mova m5, [pb_4] - xor r7d, r7d +cglobal saoCuStatsBO, 7,13,2 + mova m0, [pb_124] + add r5, 4 + add r6, 4 .loopH: - mov r10, r0 + mov r12, r0 mov r11, r1 mov r9d, r3d + .loopL: movu m1, [r11] - movu m0, [r10] + psrlw m1, 1 ; rec[x] >> boShift + pand m1, m0 - punpckhbw m2, m0, m1 - punpcklbw m0, m1 - psrlw m1, 1 ; rec[x] >> boShift - pmaddubsw m2, m3 - pmaddubsw m0, m3 - pand m1, m4 - paddb m1, m5 + cmp r9d, 8 + jle .proc8 + movq r10, m1 %assign x 0 -%rep 16 - pextrb r7d, m1, x +%rep 8 + movzx r7d, r10b + shr r10, 8 -%if (x < 8) - pextrw r8d, m0, (x % 8) -%else - pextrw r8d, m2, (x % 8) -%endif - movsx r8d, r8w - inc dword [r6 + r7] ; count[classIdx]++ - add [r5 + r7], r8d ; stats[classIdx] += (fenc[x] - rec[x]); + movsx r8d, word [r12 + x*2] ; diff[x] + inc dword [r6 + r7] ; count[classIdx]++ + add [r5 + r7], r8d ; stats[classIdx] += (fenc[x] - rec[x]); +%assign x x+1 +%endrep + movhlps m1, m1 + sub r9d, 8 + add r12, 8*2 + +.proc8: + movq r10, m1 +%assign x 0 +%rep 8 + movzx r7d, r10b + shr r10, 8 + + movsx r8d, word [r12 + x*2] ; diff[x] + inc dword [r6 + r7] ; count[classIdx]++ + add [r5 + r7], r8d ; stats[classIdx] += (fenc[x] - rec[x]); dec r9d - jz .next + jz .next %assign x x+1 %endrep - add r10, 16 + add r12, 8*2 add r11, 16 - jmp .loopL + jmp .loopL .next: - add r0, r2 + add r0, 64*2 ; MAX_CU_SIZE add r1, r2 dec r4d - jnz .loopH + jnz .loopH RET %endif ;----------------------------------------------------------------------------------------------------------------------- -; saoCuStatsE0(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +; saoCuStatsE0(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) ;----------------------------------------------------------------------------------------------------------------------- %if ARCH_X86_64 INIT_XMM sse4 -cglobal saoCuStatsE0, 5,9,8, 0-32 +cglobal saoCuStatsE0, 3,10,6, 0-32 mov r3d, r3m - mov r8, r5mp + mov r4d, r4m + mov r9, r5mp ; clear internal temporary buffer pxor m0, m0 mova [rsp], m0 mova [rsp + mmsize], m0 mova m4, [pb_128] - mova m5, [hmul_16p + 16] - mova m6, [pb_2] + mova m5, [pb_2] xor r7d, r7d + ; correct stride for diff[] and rec + mov r6d, r3d + and r6d, ~15 + sub r2, r6 + lea r8, [(r6 - 64) * 2] ; 64 = MAX_CU_SIZE + .loopH: mov r5d, r3d @@ -2075,100 +2094,257 @@ pinsrb m0, r7d, 15 .loopL: - movu m7, [r1] + movu m3, [r1] movu m2, [r1 + 1] - pxor m1, m7, m4 - pxor m3, m2, m4 - pcmpgtb m2, m1, m3 - pcmpgtb m3, m1 - pand m2, [pb_1] - por m2, m3 ; signRight + pxor m1, m3, m4 + pxor m2, m4 + pcmpgtb m3, m1, m2 + pcmpgtb m2, m1 + pand m3, [pb_1] + por m2, m3 ; signRight palignr m3, m2, m0, 15 - psignb m3, m4 ; signLeft + psignb m3, m4 ; signLeft mova m0, m2 paddb m2, m3 - paddb m2, m6 ; edgeType + paddb m2, m5 ; edgeType ; stats[edgeType] - movu m3, [r0] ; fenc[0-15] - punpckhbw m1, m3, m7 - punpcklbw m3, m7 - pmaddubsw m1, m5 - pmaddubsw m3, m5 - %assign x 0 %rep 16 pextrb r7d, m2, x -%if (x < 8) - pextrw r6d, m3, (x % 8) -%else - pextrw r6d, m1, (x % 8) -%endif - movsx r6d, r6w + movsx r6d, word [r0 + x * 2] inc word [rsp + r7 * 2] ; tmp_count[edgeType]++ add [rsp + 5 * 2 + r7 * 4], r6d ; tmp_stats[edgeType] += (fenc[x] - rec[x]) dec r5d - jz .next + jz .next %assign x x+1 %endrep - add r0q, 16 - add r1q, 16 - jmp .loopL + add r0, 16*2 + add r1, 16 + jmp .loopL .next: - mov r6d, r3d - and r6d, 15 - - sub r6, r3 - add r6, r2 - add r0, r6 - add r1, r6 + sub r0, r8 + add r1, r2 dec r4d - jnz .loopH + jnz .loopH ; sum to global buffer mov r0, r6mp ; s_eoTable = {1, 2, 0, 3, 4} - movzx r5d, word [rsp + 0 * 2] - add [r0 + 1 * 4], r5d - movzx r6d, word [rsp + 1 * 2] - add [r0 + 2 * 4], r6d - movzx r5d, word [rsp + 2 * 2] - add [r0 + 0 * 4], r5d - movzx r6d, word [rsp + 3 * 2] - add [r0 + 3 * 4], r6d + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 movzx r5d, word [rsp + 4 * 2] add [r0 + 4 * 4], r5d - mov r6d, [rsp + 5 * 2 + 0 * 4] - add [r8 + 1 * 4], r6d - mov r5d, [rsp + 5 * 2 + 1 * 4] - add [r8 + 2 * 4], r5d - mov r6d, [rsp + 5 * 2 + 2 * 4] - add [r8 + 0 * 4], r6d - mov r5d, [rsp + 5 * 2 + 3 * 4] - add [r8 + 3 * 4], r5d + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r9] + paddd m0, m1 + movu [r9], m0 mov r6d, [rsp + 5 * 2 + 4 * 4] - add [r8 + 4 * 4], r6d + add [r9 + 4 * 4], r6d + RET + + +;----------------------------------------------------------------------------------------------------------------------- +; saoCuStatsE0(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +;----------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +; spending rbp register to avoid x86inc stack alignment problem +cglobal saoCuStatsE0, 3,11,16 + mov r3d, r3m + mov r4d, r4m + mov r9, r5mp + + ; clear internal temporary buffer + pxor xm6, xm6 ; count[0] + pxor xm7, xm7 ; count[1] + pxor xm8, xm8 ; count[2] + pxor xm9, xm9 ; count[3] + pxor xm10, xm10 ; count[4] + pxor xm11, xm11 ; stats[0] + pxor xm12, xm12 ; stats[1] + pxor xm13, xm13 ; stats[2] + pxor xm14, xm14 ; stats[3] + pxor xm15, xm15 ; stats[4] + xor r7d, r7d + + ; correct stride for diff[] and rec + mov r6d, r3d + and r6d, ~15 + sub r2, r6 + lea r8, [(r6 - 64) * 2] ; 64 = MAX_CU_SIZE + lea r10, [pb_movemask_32 + 32] + +.loopH: + mov r5d, r3d + + ; calculate signLeft + mov r7b, [r1] + sub r7b, [r1 - 1] + seta r7b + setb r6b + sub r7b, r6b + neg r7b + pinsrb xm0, r7d, 15 + +.loopL: + mova m4, [pb_128] ; lower performance, but we haven't enough register for stats[] + movu xm3, [r1] + movu xm2, [r1 + 1] + + pxor xm1, xm3, xm4 + pxor xm2, xm4 + pcmpgtb xm3, xm1, xm2 + pcmpgtb xm2, xm1 + pand xm3, [pb_1] + por xm2, xm3 ; signRight + + palignr xm3, xm2, xm0, 15 + psignb xm3, xm4 ; signLeft + + mova xm0, xm2 + paddb xm2, xm3 + paddb xm2, [pb_2] ; edgeType + + ; get current process mask + mov r7d, 16 + mov r6d, r5d + cmp r5d, r7d + cmovge r6d, r7d + neg r6 + movu xm1, [r10 + r6] + + ; tmp_count[edgeType]++ + ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + pxor xm3, xm3 + por xm1, xm2 ; apply unavailable pixel mask + movu m5, [r0] ; up to 14bits + + pcmpeqb xm3, xm1, xm3 + psubb xm6, xm3 + pmovsxbw m2, xm3 + pmaddwd m4, m5, m2 + paddd m11, m4 + + pcmpeqb xm3, xm1, [pb_1] + psubb xm7, xm3 + pmovsxbw m2, xm3 + pmaddwd m4, m5, m2 + paddd m12, m4 + + pcmpeqb xm3, xm1, [pb_2] + psubb xm8, xm3 + pmovsxbw m2, xm3 + pmaddwd m4, m5, m2 + paddd m13, m4 + + pcmpeqb xm3, xm1, [pb_3] + psubb xm9, xm3 + pmovsxbw m2, xm3 + pmaddwd m4, m5, m2 + paddd m14, m4 + + pcmpeqb xm3, xm1, [pb_4] + psubb xm10, xm3 + pmovsxbw m2, xm3 + pmaddwd m4, m5, m2 + paddd m15, m4 + + sub r5d, r7d + jle .next + + add r0, 16*2 + add r1, 16 + jmp .loopL + +.next: + sub r0, r8 + add r1, r2 + + dec r4d + jnz .loopH + + ; sum to global buffer + mov r0, r6mp + + ; sum into word + ; WARNING: There have a ovberflow bug on case Block64x64 with ALL pixels are SAME type (HM algorithm never pass Block64x64 into here) + pxor xm0, xm0 + psadbw xm1, xm6, xm0 + psadbw xm2, xm7, xm0 + psadbw xm3, xm8, xm0 + psadbw xm4, xm9, xm0 + psadbw xm5, xm10, xm0 + pshufd xm1, xm1, q3120 + pshufd xm2, xm2, q3120 + pshufd xm3, xm3, q3120 + pshufd xm4, xm4, q3120 + + ; sum count[4] only + movhlps xm6, xm5 + paddd xm5, xm6 + + ; sum count[s_eoTable] + ; s_eoTable = {1, 2, 0, 3, 4} + punpcklqdq xm3, xm1 + punpcklqdq xm2, xm4 + phaddd xm3, xm2 + movu xm1, [r0] + paddd xm3, xm1 + movu [r0], xm3 + movd r5d, xm5 + add [r0 + 4 * 4], r5d + + ; sum stats[s_eoTable] + vextracti128 xm1, m11, 1 + paddd xm1, xm11 + vextracti128 xm2, m12, 1 + paddd xm2, xm12 + vextracti128 xm3, m13, 1 + paddd xm3, xm13 + vextracti128 xm4, m14, 1 + paddd xm4, xm14 + vextracti128 xm5, m15, 1 + paddd xm5, xm15 + + ; s_eoTable = {1, 2, 0, 3, 4} + phaddd xm3, xm1 + phaddd xm2, xm4 + phaddd xm3, xm2 + psubd xm3, xm0, xm3 ; negtive for compensate PMADDWD sign algorithm problem + + ; sum stats[4] only + HADDD xm5, xm6 + psubd xm5, xm0, xm5 + + movu xm1, [r9] + paddd xm3, xm1 + movu [r9], xm3 + movd r6d, xm5 + add [r9 + 4 * 4], r6d RET %endif ;------------------------------------------------------------------------------------------------------------------------------------------- -; saoCuStatsE1_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +; saoCuStatsE1_c(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) ;------------------------------------------------------------------------------------------------------------------------------------------- %if ARCH_X86_64 INIT_XMM sse4 -cglobal saoCuStatsE1, 4,12,9,0-32 ; Stack: 5 of stats and 5 of count +cglobal saoCuStatsE1, 4,12,8,0-32 ; Stack: 5 of stats and 5 of count mov r5d, r5m mov r4d, r4m - mov r11d, r5d ; clear internal temporary buffer pxor m0, m0 @@ -2177,7 +2353,6 @@ mova m0, [pb_128] mova m5, [pb_1] mova m6, [pb_2] - mova m8, [hmul_16p + 16] movh m7, [r3 + r4] .loopH: @@ -2194,11 +2369,11 @@ pxor m1, m0 pxor m2, m0 pcmpgtb m3, m1, m2 - pand m3, m5 pcmpgtb m2, m1 + pand m3, m5 por m2, m3 pxor m3, m3 - psubb m3, m2 ; -signDown + psubb m3, m2 ; -signDown ; edgeType movu m4, [r11] @@ -2208,26 +2383,14 @@ ; update upBuff1 movu [r11], m3 - ; stats[edgeType] - pxor m1, m0 - movu m3, [r9] - punpckhbw m4, m3, m1 - punpcklbw m3, m1 - pmaddubsw m3, m8 - pmaddubsw m4, m8 - ; 16 pixels %assign x 0 %rep 16 pextrb r7d, m2, x inc word [rsp + r7 * 2] - %if (x < 8) - pextrw r8d, m3, (x % 8) - %else - pextrw r8d, m4, (x % 8) - %endif - movsx r8d, r8w + ; stats[edgeType] + movsx r8d, word [r9 + x * 2] add [rsp + 5 * 2 + r7 * 4], r8d dec r6d @@ -2235,15 +2398,678 @@ %assign x x+1 %endrep - add r9, 16 + add r9, 16*2 + add r10, 16 + add r11, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + add r0, 64*2 ; MAX_CU_SIZE + add r1, r2 + + dec r5d + jg .loopH + + ; restore unavailable pixels + movh [r3 + r4], m7 + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; s_eoTable = {1,2,0,3,4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET + + +INIT_YMM avx2 +cglobal saoCuStatsE1, 4,13,16 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + mov r4d, r4m + + ; clear internal temporary buffer + pxor xm6, xm6 ; count[0] + pxor xm7, xm7 ; count[1] + pxor xm8, xm8 ; count[2] + pxor xm9, xm9 ; count[3] + pxor xm10, xm10 ; count[4] + pxor xm11, xm11 ; stats[0] + pxor xm12, xm12 ; stats[1] + pxor xm13, xm13 ; stats[2] + pxor xm14, xm14 ; stats[3] + pxor xm15, xm15 ; stats[4] + mova m0, [pb_128] + mova m5, [pb_1] + + ; save unavailable bound pixel + push qword [r3 + r4] + + ; unavailable mask + lea r12, [pb_movemask_32 + 32] + +.loopH: + mov r6d, r4d + mov r9, r0 + mov r10, r1 + mov r11, r3 + +.loopW: + movu xm1, [r10] + movu xm2, [r10 + r2] + + ; signDown + pxor xm1, xm0 + pxor xm2, xm0 + pcmpgtb xm3, xm1, xm2 + pcmpgtb xm2, xm1 + pand xm3, xm5 + por xm2, xm3 + psignb xm3, xm2, xm0 ; -signDown + + ; edgeType + movu xm4, [r11] + paddb xm4, [pb_2] + paddb xm2, xm4 + + ; update upBuff1 (must be delay, above code modify memory[r11]) + movu [r11], xm3 + + ; m[1-4] free in here + + ; get current process group mask + mov r7d, 16 + mov r8d, r6d + cmp r6d, r7d + cmovge r8d, r7d + neg r8 + movu xm1, [r12 + r8] + + ; tmp_count[edgeType]++ + ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + pxor xm3, xm3 + por xm1, xm2 ; apply unavailable pixel mask + movu m4, [r9] ; up to 14bits + + pcmpeqb xm3, xm1, xm3 + psubb xm6, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m11, m3 + + pcmpeqb xm3, xm1, xm5 + psubb xm7, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m12, m3 + + pcmpeqb xm3, xm1, [pb_2] + psubb xm8, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m13, m3 + + pcmpeqb xm3, xm1, [pb_3] + psubb xm9, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m14, m3 + + pcmpeqb xm3, xm1, [pb_4] + psubb xm10, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m15, m3 + + sub r6d, r7d + jle .next + + add r9, 16*2 add r10, 16 add r11, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + add r0, 64*2 ; MAX_CU_SIZE + add r1, r2 + + dec r5d + jg .loopH + + ; restore unavailable pixels + pop qword [r3 + r4] + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; sum into word + ; WARNING: There have a ovberflow bug on case Block64x64 with ALL pixels are SAME type (HM algorithm never pass Block64x64 into here) + pxor xm0, xm0 + psadbw xm1, xm6, xm0 + psadbw xm2, xm7, xm0 + psadbw xm3, xm8, xm0 + psadbw xm4, xm9, xm0 + psadbw xm5, xm10, xm0 + pshufd xm1, xm1, q3120 + pshufd xm2, xm2, q3120 + pshufd xm3, xm3, q3120 + pshufd xm4, xm4, q3120 + + ; sum count[4] only + movhlps xm6, xm5 + paddd xm5, xm6 + + ; sum count[s_eoTable] + ; s_eoTable = {1, 2, 0, 3, 4} + punpcklqdq xm3, xm1 + punpcklqdq xm2, xm4 + phaddd xm3, xm2 + movu xm1, [r0] + paddd xm3, xm1 + movu [r0], xm3 + movd r5d, xm5 + add [r0 + 4 * 4], r5d + + ; sum stats[s_eoTable] + vextracti128 xm1, m11, 1 + paddd xm1, xm11 + vextracti128 xm2, m12, 1 + paddd xm2, xm12 + vextracti128 xm3, m13, 1 + paddd xm3, xm13 + vextracti128 xm4, m14, 1 + paddd xm4, xm14 + vextracti128 xm5, m15, 1 + paddd xm5, xm15 + + ; s_eoTable = {1, 2, 0, 3, 4} + phaddd xm3, xm1 + phaddd xm2, xm4 + phaddd xm3, xm2 + psubd xm3, xm0, xm3 ; negtive for compensate PMADDWD sign algorithm problem + + ; sum stats[4] only + HADDD xm5, xm6 + psubd xm5, xm0, xm5 + + movu xm1, [r1] + paddd xm3, xm1 + movu [r1], xm3 + movd r6d, xm5 + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64 + + +;void saoCuStatsE2_c(const int16_t *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) +;{ +; X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); +; X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); +; int x, y; +; int32_t tmp_stats[SAO::NUM_EDGETYPE]; +; int32_t tmp_count[SAO::NUM_EDGETYPE]; +; memset(tmp_stats, 0, sizeof(tmp_stats)); +; memset(tmp_count, 0, sizeof(tmp_count)); +; for (y = 0; y < endY; y++) +; { +; upBufft[0] = signOf(rec[stride] - rec[-1]); +; for (x = 0; x < endX; x++) +; { +; int signDown = signOf2(rec[x], rec[x + stride + 1]); +; X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); +; uint32_t edgeType = signDown + upBuff1[x] + 2; +; upBufft[x + 1] = (int8_t)(-signDown); +; tmp_stats[edgeType] += diff[x]; +; tmp_count[edgeType]++; +; } +; std::swap(upBuff1, upBufft); +; rec += stride; +; fenc += stride; +; } +; for (x = 0; x < SAO::NUM_EDGETYPE; x++) +; { +; stats[SAO::s_eoTable[x]] += tmp_stats[x]; +; count[SAO::s_eoTable[x]] += tmp_count[x]; +; } +;} + +%if ARCH_X86_64 +; TODO: x64 only because I need temporary register r7,r8, easy portab to x86 +INIT_XMM sse4 +cglobal saoCuStatsE2, 5,9,8,0-32 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m0, [pb_128] + mova m5, [pb_1] + mova m6, [pb_2] + +.loopH: + ; TODO: merge into SIMD in below + ; get upBuffX[0] + mov r6b, [r1 + r2] + sub r6b, [r1 - 1] + seta r6b + setb r7b + sub r6b, r7b + mov [r4], r6b + + ; backup unavailable pixels + movh m7, [r4 + r5 + 1] + + mov r6d, r5d +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 + 1] + + ; signDown + ; stats[edgeType] + pxor m1, m0 + pxor m2, m0 + pcmpgtb m3, m1, m2 + pand m3, m5 + pcmpgtb m2, m1 + por m2, m3 + pxor m3, m3 + psubb m3, m2 + + ; edgeType + movu m4, [r3] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r4 + 1], m3 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + movsx r8d, word [r0 + x * 2] + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, 16*2 + add r1, 16 + add r3, 16 + add r4, 16 + jmp .loopW + +.next: + xchg r3, r4 + + ; restore pointer upBuff1 + mov r6d, r5d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + add r4, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE + add r1, r2 + add r1, r6 + + ; restore unavailable pixels + movh [r3 + r5 + 1], m7 + + dec byte r6m + jg .loopH + + ; sum to global buffer + mov r1, r7m + mov r0, r8m + + ; s_eoTable = {1,2,0,3,4} + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET + + +INIT_YMM avx2 +cglobal saoCuStatsE2, 5,10,16 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + + ; clear internal temporary buffer + pxor xm6, xm6 ; count[0] + pxor xm7, xm7 ; count[1] + pxor xm8, xm8 ; count[2] + pxor xm9, xm9 ; count[3] + pxor xm10, xm10 ; count[4] + pxor xm11, xm11 ; stats[0] + pxor xm12, xm12 ; stats[1] + pxor xm13, xm13 ; stats[2] + pxor xm14, xm14 ; stats[3] + pxor xm15, xm15 ; stats[4] + mova m0, [pb_128] + + ; unavailable mask + lea r9, [pb_movemask_32 + 32] + +.loopH: + ; TODO: merge into SIMD in below + ; get upBuffX[0] + mov r6b, [r1 + r2] + sub r6b, [r1 - 1] + seta r6b + setb r7b + sub r6b, r7b + mov [r4], r6b + + ; backup unavailable pixels + movq xm5, [r4 + r5 + 1] + + mov r6d, r5d +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 + 1] + + ; signDown + ; stats[edgeType] + pxor xm1, xm0 + pxor xm2, xm0 + pcmpgtb xm3, xm1, xm2 + pand xm3, [pb_1] + pcmpgtb xm2, xm1 + por xm2, xm3 + psignb xm3, xm2, xm0 + + ; edgeType + movu xm4, [r3] + paddb xm4, [pb_2] + paddb xm2, xm4 + + ; update upBuff1 + movu [r4 + 1], xm3 + + ; m[1-4] free in here + + ; get current process group mask + mov r7d, 16 + mov r8d, r6d + cmp r6d, r7d + cmovge r8d, r7d + neg r8 + movu xm1, [r9 + r8] + + ; tmp_count[edgeType]++ + ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + pxor xm3, xm3 + por xm1, xm2 ; apply unavailable pixel mask + movu m4, [r0] ; up to 14bits + + pcmpeqb xm3, xm1, xm3 + psubb xm6, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m11, m3 + + pcmpeqb xm3, xm1, [pb_1] + psubb xm7, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m12, m3 + + pcmpeqb xm3, xm1, [pb_2] + psubb xm8, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m13, m3 + + pcmpeqb xm3, xm1, [pb_3] + psubb xm9, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m14, m3 + + pcmpeqb xm3, xm1, [pb_4] + psubb xm10, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m15, m3 + + sub r6d, r7d + jle .next + + add r0, 16*2 + add r1, 16 + add r3, 16 + add r4, 16 + jmp .loopW + +.next: + xchg r3, r4 + + ; restore pointer upBuff1 + ; TODO: BZHI + mov r6d, r5d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + add r4, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE + add r1, r2 + add r1, r6 + + ; restore unavailable pixels + movq [r3 + r5 + 1], xm5 + + dec byte r6m + jg .loopH + + ; sum to global buffer + mov r1, r7m + mov r0, r8m + + ; sum into word + ; WARNING: There have a ovberflow bug on case Block64x64 with ALL pixels are SAME type (HM algorithm never pass Block64x64 into here) + pxor xm0, xm0 + psadbw xm1, xm6, xm0 + psadbw xm2, xm7, xm0 + psadbw xm3, xm8, xm0 + psadbw xm4, xm9, xm0 + psadbw xm5, xm10, xm0 + pshufd xm1, xm1, q3120 + pshufd xm2, xm2, q3120 + pshufd xm3, xm3, q3120 + pshufd xm4, xm4, q3120 + + ; sum count[4] only + movhlps xm6, xm5 + paddd xm5, xm6 + + ; sum count[s_eoTable] + ; s_eoTable = {1, 2, 0, 3, 4} + punpcklqdq xm3, xm1 + punpcklqdq xm2, xm4 + phaddd xm3, xm2 + movu xm1, [r0] + paddd xm3, xm1 + movu [r0], xm3 + movd r5d, xm5 + add [r0 + 4 * 4], r5d + + ; sum stats[s_eoTable] + vextracti128 xm1, m11, 1 + paddd xm1, xm11 + vextracti128 xm2, m12, 1 + paddd xm2, xm12 + vextracti128 xm3, m13, 1 + paddd xm3, xm13 + vextracti128 xm4, m14, 1 + paddd xm4, xm14 + vextracti128 xm5, m15, 1 + paddd xm5, xm15 + + ; s_eoTable = {1, 2, 0, 3, 4} + phaddd xm3, xm1 + phaddd xm2, xm4 + phaddd xm3, xm2 + psubd xm3, xm0, xm3 ; negtive for compensate PMADDWD sign algorithm problem + + ; sum stats[4] only + HADDD xm5, xm6 + psubd xm5, xm0, xm5 + + movu xm1, [r1] + paddd xm3, xm1 + movu [r1], xm3 + movd r6d, xm5 + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64 + + +;void saoStatE3(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); +;{ +; memset(tmp_stats, 0, sizeof(tmp_stats)); +; memset(tmp_count, 0, sizeof(tmp_count)); +; for (y = startY; y < endY; y++) +; { +; for (x = startX; x < endX; x++) +; { +; int signDown = signOf2(rec[x], rec[x + stride - 1]); +; uint32_t edgeType = signDown + upBuff1[x] + 2; +; upBuff1[x - 1] = (int8_t)(-signDown); +; tmp_stats[edgeType] += diff[x]; +; tmp_count[edgeType]++; +; } +; upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); +; rec += stride; +; fenc += stride; +; } +; for (x = 0; x < NUM_EDGETYPE; x++) +; { +; stats[s_eoTable[x]] += tmp_stats[x]; +; count[s_eoTable[x]] += tmp_count[x]; +; } +;} + +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal saoCuStatsE3, 4,9,8,0-32 ; Stack: 5 of stats and 5 of count + mov r4d, r4m + mov r5d, r5m + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m0, [pb_128] + mova m5, [pb_1] + mova m6, [pb_2] + movh m7, [r3 + r4] + +.loopH: + mov r6d, r4d + +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 - 1] + + ; signDown + pxor m1, m0 + pxor m2, m0 + pcmpgtb m3, m1, m2 + pand m3, m5 + pcmpgtb m2, m1 + por m2, m3 + pxor m3, m3 + psubb m3, m2 + + ; edgeType + movu m4, [r3] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r3 - 1], m3 + + ; stats[edgeType] + pxor m1, m0 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + movsx r8d, word [r0 + x * 2] + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, 16*2 + add r1, 16 + add r3, 16 jmp .loopW .next: ; restore pointer upBuff1 - add r0, r2 + mov r6d, r4d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE add r1, r2 + add r1, r6 dec r5d jg .loopH @@ -2256,26 +3082,448 @@ mov r0, r7m ; s_eoTable = {1,2,0,3,4} - movzx r6d, word [rsp + 0 * 2] - add [r0 + 1 * 4], r6d - movzx r6d, word [rsp + 1 * 2] - add [r0 + 2 * 4], r6d - movzx r6d, word [rsp + 2 * 2] - add [r0 + 0 * 4], r6d - movzx r6d, word [rsp + 3 * 2] - add [r0 + 3 * 4], r6d - movzx r6d, word [rsp + 4 * 2] - add [r0 + 4 * 4], r6d - - mov r6d, [rsp + 5 * 2 + 0 * 4] - add [r1 + 1 * 4], r6d - mov r6d, [rsp + 5 * 2 + 1 * 4] - add [r1 + 2 * 4], r6d - mov r6d, [rsp + 5 * 2 + 2 * 4] - add [r1 + 0 * 4], r6d - mov r6d, [rsp + 5 * 2 + 3 * 4] - add [r1 + 3 * 4], r6d + pmovzxwd m0, [rsp + 0 * 2] + pshufd m0, m0, q3102 + movu m1, [r0] + paddd m0, m1 + movu [r0], m0 + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + movu m0, [rsp + 5 * 2 + 0 * 4] + pshufd m0, m0, q3102 + movu m1, [r1] + paddd m0, m1 + movu [r1], m0 mov r6d, [rsp + 5 * 2 + 4 * 4] add [r1 + 4 * 4], r6d RET + + +INIT_YMM avx2 +cglobal saoCuStatsE3, 4,10,16 ; Stack: 5 of stats and 5 of count + mov r4d, r4m + mov r5d, r5m + + ; clear internal temporary buffer + pxor xm6, xm6 ; count[0] + pxor xm7, xm7 ; count[1] + pxor xm8, xm8 ; count[2] + pxor xm9, xm9 ; count[3] + pxor xm10, xm10 ; count[4] + pxor xm11, xm11 ; stats[0] + pxor xm12, xm12 ; stats[1] + pxor xm13, xm13 ; stats[2] + pxor xm14, xm14 ; stats[3] + pxor xm15, xm15 ; stats[4] + mova m0, [pb_128] + + ; unavailable mask + lea r9, [pb_movemask_32 + 32] + push qword [r3 + r4] + +.loopH: + mov r6d, r4d + +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 - 1] + + ; signDown + ; stats[edgeType] + pxor xm1, xm0 + pxor xm2, xm0 + pcmpgtb xm3, xm1, xm2 + pand xm3, [pb_1] + pcmpgtb xm2, xm1 + por xm2, xm3 + pxor xm3, xm3 + psubb xm3, xm2 + + ; edgeType + movu xm4, [r3] + paddb xm4, [pb_2] + paddb xm2, xm4 + + ; update upBuff1 + movu [r3 - 1], xm3 + + ; m[1-4] free in here + + ; get current process group mask + mov r7d, 16 + mov r8d, r6d + cmp r6d, r7d + cmovge r8d, r7d + neg r8 + movu xm1, [r9 + r8] + + ; tmp_count[edgeType]++ + ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + pxor xm3, xm3 + por xm1, xm2 ; apply unavailable pixel mask + movu m4, [r0] ; up to 14bits + + pcmpeqb xm3, xm1, xm3 + psubb xm6, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m11, m3 + + pcmpeqb xm3, xm1, [pb_1] + psubb xm7, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m12, m3 + + pcmpeqb xm3, xm1, [pb_2] + psubb xm8, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m13, m3 + + pcmpeqb xm3, xm1, [pb_3] + psubb xm9, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m14, m3 + + pcmpeqb xm3, xm1, [pb_4] + psubb xm10, xm3 + pmovsxbw m2, xm3 + pmaddwd m3, m4, m2 + paddd m15, m3 + + sub r6d, r7d + jle .next + + add r0, 16*2 + add r1, 16 + add r3, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + mov r6d, r4d + and r6d, ~15 + neg r6 ; MUST BE 64-bits, it is Negtive + + ; move to next row + + ; move back to start point + add r3, r6 + + ; adjust with stride + lea r0, [r0 + (r6 + 64) * 2] ; 64 = MAX_CU_SIZE + add r1, r2 + add r1, r6 + + dec r5d + jg .loopH + + ; restore unavailable pixels + pop qword [r3 + r4] + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; sum into word + ; WARNING: There have a ovberflow bug on case Block64x64 with ALL pixels are SAME type (HM algorithm never pass Block64x64 into here) + pxor xm0, xm0 + psadbw xm1, xm6, xm0 + psadbw xm2, xm7, xm0 + psadbw xm3, xm8, xm0 + psadbw xm4, xm9, xm0 + psadbw xm5, xm10, xm0 + pshufd xm1, xm1, q3120 + pshufd xm2, xm2, q3120 + pshufd xm3, xm3, q3120 + pshufd xm4, xm4, q3120 + + ; sum count[4] only + movhlps xm6, xm5 + paddd xm5, xm6 + + ; sum count[s_eoTable] + ; s_eoTable = {1, 2, 0, 3, 4} + punpcklqdq xm3, xm1 + punpcklqdq xm2, xm4 + phaddd xm3, xm2 + movu xm1, [r0] + paddd xm3, xm1 + movu [r0], xm3 + movd r5d, xm5 + add [r0 + 4 * 4], r5d + + ; sum stats[s_eoTable] + vextracti128 xm1, m11, 1 + paddd xm1, xm11 + vextracti128 xm2, m12, 1 + paddd xm2, xm12 + vextracti128 xm3, m13, 1 + paddd xm3, xm13 + vextracti128 xm4, m14, 1 + paddd xm4, xm14 + vextracti128 xm5, m15, 1 + paddd xm5, xm15 + + ; s_eoTable = {1, 2, 0, 3, 4} + phaddd xm3, xm1 + phaddd xm2, xm4 + phaddd xm3, xm2 + psubd xm3, xm0, xm3 ; negtive for compensate PMADDWD sign algorithm problem + + ; sum stats[4] only + HADDD xm5, xm6 + psubd xm5, xm0, xm5 + + movu xm1, [r1] + paddd xm3, xm1 + movu [r1], xm3 + movd r6d, xm5 + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64 + + +%if ARCH_X86_64 +;; argument registers used - +; r0 - src +; r1 - srcStep +; r2 - offset +; r3 - tcP +; r4 - tcQ + +INIT_XMM sse4 +cglobal pelFilterLumaStrong_H, 5,7,10 + mov r1, r2 + neg r3d + neg r4d + neg r1 + + lea r5, [r2 * 3] + lea r6, [r1 * 3] + + pmovzxbw m4, [r0] ; src[0] + pmovzxbw m3, [r0 + r1] ; src[-offset] + pmovzxbw m2, [r0 + r1 * 2] ; src[-offset * 2] + pmovzxbw m1, [r0 + r6] ; src[-offset * 3] + pmovzxbw m0, [r0 + r1 * 4] ; src[-offset * 4] + pmovzxbw m5, [r0 + r2] ; src[offset] + pmovzxbw m6, [r0 + r2 * 2] ; src[offset * 2] + pmovzxbw m7, [r0 + r5] ; src[offset * 3] + + paddw m0, m0 ; m0*2 + mova m8, m2 + paddw m8, m3 ; m2 + m3 + paddw m8, m4 ; m2 + m3 + m4 + mova m9, m8 + paddw m9, m9 ; 2*m2 + 2*m3 + 2*m4 + paddw m8, m1 ; m2 + m3 + m4 + m1 + paddw m0, m8 ; 2*m0 + m2+ m3 + m4 + m1 + paddw m9, m1 + paddw m0, m1 + paddw m9, m5 ; m1 + 2*m2 + 2*m3 + 2*m4 + m5 + paddw m0, m1 ; 2*m0 + 3*m1 + m2 + m3 + m4 + + punpcklqdq m0, m9 + punpcklqdq m1, m3 + + paddw m3, m4 + mova m9, m5 + paddw m9, m6 + paddw m7, m7 ; 2*m7 + paddw m9, m3 ; m3 + m4 + m5 + m6 + mova m3, m9 + paddw m3, m3 ; 2*m3 + 2*m4 + 2*m5 + 2*m6 + paddw m7, m9 ; 2*m7 + m3 + m4 + m5 + m6 + paddw m7, m6 + psubw m3, m6 ; 2*m3 + 2*m4 + 2*m5 + m6 + paddw m7, m6 ; m3 + m4 + m5 + 3*m6 + 2*m7 + paddw m3, m2 ; m2 + 2*m3 + 2*m4 + 2*m5 + m6 + + punpcklqdq m9, m8 + punpcklqdq m3, m7 + punpcklqdq m5, m2 + punpcklqdq m4, m6 + + movd m7, r3d ; -tcP + movd m2, r4d ; -tcQ + pshufb m7, [pb_01] + pshufb m2, [pb_01] + mova m6, m2 + punpcklqdq m6, m7 + + paddw m0, [pw_4] + paddw m3, [pw_4] + paddw m9, [pw_2] + + psraw m0, 3 + psraw m3, 3 + psraw m9, 2 + + psubw m0, m1 + psubw m3, m4 + psubw m9, m5 + + pmaxsw m0, m7 + pmaxsw m3, m2 + pmaxsw m9, m6 + psignw m7, [pw_n1] + psignw m2, [pw_n1] + psignw m6, [pw_n1] + pminsw m0, m7 + pminsw m3, m2 + pminsw m9, m6 + + paddw m0, m1 + paddw m3, m4 + paddw m9, m5 + packuswb m0, m0 + packuswb m3, m9 + + movd [r0 + r6], m0 + pextrd [r0 + r1], m0, 1 + movd [r0], m3 + pextrd [r0 + r2 * 2], m3, 1 + pextrd [r0 + r2 * 1], m3, 2 + pextrd [r0 + r1 * 2], m3, 3 + RET + +INIT_XMM sse4 +cglobal pelFilterLumaStrong_V, 5,5,10 + neg r3d + neg r4d + lea r2, [r1 * 3] + + movh m0, [r0 - 4] ; src[-offset * 4] row 0 + movh m1, [r0 + r1 * 1 - 4] ; src[-offset * 4] row 1 + movh m2, [r0 + r1 * 2 - 4] ; src[-offset * 4] row 2 + movh m3, [r0 + r2 * 1 - 4] ; src[-offset * 4] row 3 + + punpcklbw m0, m1 + punpcklbw m2, m3 + mova m4, m0 + punpcklwd m0, m2 + punpckhwd m4, m2 + mova m1, m0 + mova m2, m0 + mova m3, m0 + pshufd m0, m0, 0 + pshufd m1, m1, 1 + pshufd m2, m2, 2 + pshufd m3, m3, 3 + mova m5, m4 + mova m6, m4 + mova m7, m4 + pshufd m4, m4, 0 + pshufd m5, m5, 1 + pshufd m6, m6, 2 + pshufd m7, m7, 3 + pmovzxbw m0, m0 + pmovzxbw m1, m1 + pmovzxbw m2, m2 + pmovzxbw m3, m3 + pmovzxbw m4, m4 + pmovzxbw m5, m5 + pmovzxbw m6, m6 + pmovzxbw m7, m7 + + paddw m0, m0 ; m0*2 + mova m8, m2 + paddw m8, m3 ; m2 + m3 + paddw m8, m4 ; m2 + m3 + m4 + mova m9, m8 + paddw m9, m9 ; 2*m2 + 2*m3 + 2*m4 + paddw m8, m1 ; m2 + m3 + m4 + m1 + paddw m0, m8 ; 2*m0 + m2+ m3 + m4 + m1 + paddw m9, m1 + paddw m0, m1 + paddw m9, m5 ; m1 + 2*m2 + 2*m3 + 2*m4 + m5 + paddw m0, m1 ; 2*m0 + 3*m1 + m2 + m3 + m4 + + punpcklqdq m0, m9 + punpcklqdq m1, m3 + + paddw m3, m4 + mova m9, m5 + paddw m9, m6 + paddw m7, m7 ; 2*m7 + paddw m9, m3 ; m3 + m4 + m5 + m6 + mova m3, m9 + paddw m3, m3 ; 2*m3 + 2*m4 + 2*m5 + 2*m6 + paddw m7, m9 ; 2*m7 + m3 + m4 + m5 + m6 + paddw m7, m6 + psubw m3, m6 ; 2*m3 + 2*m4 + 2*m5 + m6 + paddw m7, m6 ; m3 + m4 + m5 + 3*m6 + 2*m7 + paddw m3, m2 ; m2 + 2*m3 + 2*m4 + 2*m5 + m6 + + punpcklqdq m9, m8 + punpcklqdq m3, m7 + punpcklqdq m5, m2 + punpcklqdq m4, m6 + + movd m7, r3d ; -tcP + movd m2, r4d ; -tcQ + pshufb m7, [pb_01] + pshufb m2, [pb_01] + mova m6, m2 + punpcklqdq m6, m7 + + paddw m0, [pw_4] + paddw m3, [pw_4] + paddw m9, [pw_2] + + psraw m0, 3 + psraw m3, 3 + psraw m9, 2 + + psubw m0, m1 + psubw m3, m4 + psubw m9, m5 + + pmaxsw m0, m7 + pmaxsw m3, m2 + pmaxsw m9, m6 + psignw m7, [pw_n1] + psignw m2, [pw_n1] + psignw m6, [pw_n1] + pminsw m0, m7 + pminsw m3, m2 + pminsw m9, m6 + + paddw m0, m1 + paddw m3, m4 + paddw m9, m5 + packuswb m0, m0 + packuswb m3, m9 + + ; 4x6 output rows - + ; m0 - col 0 + ; m3 - col 3 + mova m1, m0 + mova m2, m3 + mova m4, m3 + mova m5, m3 + pshufd m1, m1, 1 ; col 2 + pshufd m2, m2, 1 ; col 5 + pshufd m4, m4, 2 ; col 4 + pshufd m5, m5, 3 ; col 1 + + ; transpose 4x6 to 6x4 + punpcklbw m0, m5 + punpcklbw m1, m3 + punpcklbw m4, m2 + punpcklwd m0, m1 + + movd [r0 + r1 * 0 - 3], m0 + pextrd [r0 + r1 * 1 - 3], m0, 1 + pextrd [r0 + r1 * 2 - 3], m0, 2 + pextrd [r0 + r2 * 1 - 3], m0, 3 + pextrw [r0 + r1 * 0 + 1], m4, 0 + pextrw [r0 + r1 * 1 + 1], m4, 1 + pextrw [r0 + r1 * 2 + 1], m4, 2 + pextrw [r0 + r2 * 1 + 1], m4, 3 + RET %endif ; ARCH_X86_64
View file
x265_1.8.tar.gz/source/common/x86/loopfilter.h -> x265_1.9.tar.gz/source/common/x86/loopfilter.h
Changed
@@ -3,6 +3,7 @@ * * Authors: Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> * Praveen Kumar Tiwari <praveen@multicorewareinc.com> +;* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -35,14 +36,17 @@ void PFX(saoCuOrgE3_ ## cpu)(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); \ void PFX(saoCuOrgE3_32_ ## cpu)(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); \ void PFX(saoCuOrgB0_ ## cpu)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); \ - void PFX(saoCuStatsBO_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ - void PFX(saoCuStatsE0_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ - void PFX(saoCuStatsE1_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ - void PFX(saoCuStatsE2_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count); \ - void PFX(saoCuStatsE3_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsBO_ ## cpu)(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE0_ ## cpu)(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE1_ ## cpu)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE2_ ## cpu)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE3_ ## cpu)(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ void PFX(calSign_ ## cpu)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); DECL_SAO(sse4); DECL_SAO(avx2); +void PFX(pelFilterLumaStrong_V_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); +void PFX(pelFilterLumaStrong_H_sse4)(pixel* src, intptr_t srcStep, intptr_t offset, int32_t tcP, int32_t tcQ); + #endif // ifndef X265_LOOPFILTER_H
View file
x265_1.8.tar.gz/source/common/x86/mc-a.asm -> x265_1.9.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -2,6 +2,7 @@ ;* mc-a.asm: x86 motion compensation ;***************************************************************************** ;* Copyright (C) 2003-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> @@ -3989,8 +3990,12 @@ test dword r4m, 15 jz pixel_avg_w%1_sse2 %endif +%if (%1 == 8) + jmp pixel_avg_w8_unaligned_sse2 +%else jmp pixel_avg_w%1_mmx2 %endif +%endif %endmacro ;----------------------------------------------------------------------------- @@ -4049,6 +4054,32 @@ lea r4, [r4 + 4 * r5] %endmacro +INIT_XMM sse2 +cglobal pixel_avg_w8_unaligned + AVG_START +.height_loop: +%if HIGH_BIT_DEPTH + ; NO TEST BRANCH! + movu m0, [t2] + movu m1, [t2+SIZEOF_PIXEL*t3] + movu m2, [t4] + movu m3, [t4+SIZEOF_PIXEL*t5] + pavgw m0, m2 + pavgw m1, m3 + movu [t0], m0 + movu [t0+SIZEOF_PIXEL*t1], m1 +%else ;!HIGH_BIT_DEPTH + movq m0, [t2] + movhps m0, [t2+SIZEOF_PIXEL*t3] + movq m1, [t4] + movhps m1, [t4+SIZEOF_PIXEL*t5] + pavgb m0, m1 + movq [t0], m0 + movhps [t0+SIZEOF_PIXEL*t1], m0 +%endif + AVG_END + + ;------------------------------------------------------------------------------------------------------------------------------- ;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) ;------------------------------------------------------------------------------------------------------------------------------- @@ -4115,11 +4146,11 @@ AVGH 4, 4 AVGH 4, 2 -AVG_FUNC 8, movq, movq -AVGH 8, 32 -AVGH 8, 16 -AVGH 8, 8 -AVGH 8, 4 +;AVG_FUNC 8, movq, movq +;AVGH 8, 32 +;AVGH 8, 16 +;AVGH 8, 8 +;AVGH 8, 4 AVG_FUNC 16, movq, movq AVGH 16, 64 @@ -4197,7 +4228,7 @@ AVGH 4, 4 AVGH 4, 2 -AVG_FUNC 8, movq, movq +;AVG_FUNC 8, movq, movq AVGH 8, 32 AVGH 8, 16 AVGH 8, 8 @@ -4418,6 +4449,37 @@ call pixel_avg_16x64_8bit call pixel_avg_16x64_8bit RET + +cglobal pixel_avg_48x64, 6,7,4 + mov r6d, 4 +.loop: +%rep 8 + movu m0, [r2] + movu xm2, [r2 + mmsize] + movu m1, [r4] + movu xm3, [r4 + mmsize] + pavgb m0, m1 + pavgb xm2, xm3 + movu [r0], m0 + movu [r0 + mmsize], xm2 + + movu m0, [r2 + r3] + movu xm2, [r2 + r3 + mmsize] + movu m1, [r4 + r5] + movu xm3, [r4 + r5 + mmsize] + pavgb m0, m1 + pavgb xm2, xm3 + movu [r0 + r1], m0 + movu [r0 + r1 + mmsize], xm2 + + lea r2, [r2 + r3 * 2] + lea r4, [r4 + r5 * 2] + lea r0, [r0 + r1 * 2] +%endrep + + dec r6d + jnz .loop + RET %endif ;=============================================================================
View file
x265_1.8.tar.gz/source/common/x86/mc-a2.asm -> x265_1.9.tar.gz/source/common/x86/mc-a2.asm
Changed
@@ -2,12 +2,14 @@ ;* mc-a2.asm: x86 motion compensation ;***************************************************************************** ;* Copyright (C) 2005-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> ;* Holger Lubitz <holger@lubitz.org> ;* Mathieu Monnier <manao@melix.net> ;* Oskar Arvidsson <oskar@irock.se> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -46,6 +48,8 @@ pd_16: times 4 dd 16 pd_0f: times 4 dd 0xffff pf_inv256: times 8 dd 0.00390625 +const pd_inv256, times 4 dq 0.00390625 +const pd_0_5, times 4 dq 0.5 SECTION .text @@ -987,151 +991,227 @@ %endif ;----------------------------------------------------------------------------- -; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, uint16_t *intra_costs, -; uint16_t *inter_costs, uint16_t *inv_qscales, float *fps_factor, int len ) +; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs, +; uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len ) ;----------------------------------------------------------------------------- -%macro MBTREE 0 +INIT_XMM sse2 cglobal mbtree_propagate_cost, 7,7,7 - add r6d, r6d - lea r0, [r0+r6*2] - add r1, r6 - add r2, r6 - add r3, r6 - add r4, r6 - neg r6 - pxor xmm4, xmm4 - movss xmm6, [r5] - shufps xmm6, xmm6, 0 - mulps xmm6, [pf_inv256] - movdqa xmm5, [pw_3fff] + dec r6d + movsd m6, [r5] + mulpd m6, [pd_inv256] + xor r5d, r5d + lea r0, [r0+r5*2] + pxor m4, m4 + movlhps m6, m6 + mova m5, [pw_3fff] + .loop: - movq xmm2, [r2+r6] ; intra - movq xmm0, [r4+r6] ; invq - movq xmm3, [r3+r6] ; inter - movq xmm1, [r1+r6] ; prop - punpcklwd xmm2, xmm4 - punpcklwd xmm0, xmm4 - pmaddwd xmm0, xmm2 - pand xmm3, xmm5 - punpcklwd xmm1, xmm4 - punpcklwd xmm3, xmm4 -%if cpuflag(fma4) - cvtdq2ps xmm0, xmm0 - cvtdq2ps xmm1, xmm1 - fmaddps xmm0, xmm0, xmm6, xmm1 - cvtdq2ps xmm1, xmm2 - psubd xmm2, xmm3 - cvtdq2ps xmm2, xmm2 - rcpps xmm3, xmm1 - mulps xmm1, xmm3 - mulps xmm0, xmm2 - addps xmm2, xmm3, xmm3 - fnmaddps xmm3, xmm1, xmm3, xmm2 - mulps xmm0, xmm3 -%else - cvtdq2ps xmm0, xmm0 - mulps xmm0, xmm6 ; intra*invq*fps_factor>>8 - cvtdq2ps xmm1, xmm1 ; prop - addps xmm0, xmm1 ; prop + (intra*invq*fps_factor>>8) - cvtdq2ps xmm1, xmm2 ; intra - psubd xmm2, xmm3 ; intra - inter - cvtdq2ps xmm2, xmm2 ; intra - inter - rcpps xmm3, xmm1 ; 1 / intra 1st approximation - mulps xmm1, xmm3 ; intra * (1/intra 1st approx) - mulps xmm1, xmm3 ; intra * (1/intra 1st approx)^2 - mulps xmm0, xmm2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) - addps xmm3, xmm3 ; 2 * (1/intra 1st approx) - subps xmm3, xmm1 ; 2nd approximation for 1/intra - mulps xmm0, xmm3 ; / intra -%endif - cvtps2dq xmm0, xmm0 - movdqa [r0+r6*2], xmm0 - add r6, 8 - jl .loop + movh m2, [r2+r5*4] ; intra + movh m0, [r4+r5*4] ; invq + movd m3, [r3+r5*2] ; inter + pand m3, m5 + punpcklwd m3, m4 + + ; PMINSD + pcmpgtd m1, m2, m3 + pand m3, m1 + pandn m1, m2 + por m3, m1 + + movd m1, [r1+r5*2] ; prop + punpckldq m2, m2 + punpckldq m0, m0 + pmuludq m0, m2 + pshufd m2, m2, q3120 + pshufd m0, m0, q3120 + + punpcklwd m1, m4 + cvtdq2pd m0, m0 + mulpd m0, m6 ; intra*invq*fps_factor>>8 + cvtdq2pd m1, m1 ; prop + addpd m0, m1 ; prop + (intra*invq*fps_factor>>8) + ;cvtdq2ps m1, m2 ; intra + cvtdq2pd m1, m2 ; intra + psubd m2, m3 ; intra - inter + cvtdq2pd m2, m2 ; intra - inter + ;rcpps m3, m1 + ;mulps m1, m3 ; intra * (1/intra 1st approx) + ;mulps m1, m3 ; intra * (1/intra 1st approx)^2 + ;addps m3, m3 ; 2 * (1/intra 1st approx) + ;subps m3, m1 ; 2nd approximation for 1/intra + ;cvtps2pd m3, m3 ; 1 / intra 1st approximation + mulpd m0, m2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) + ;mulpd m0, m3 ; / intra + + ; TODO: DIVPD very slow, but match to C model output, since it is not bottleneck function, I comment above faster code + divpd m0, m1 + addpd m0, [pd_0_5] + cvttpd2dq m0, m0 + + movh [r0+r5*4], m0 + add r5d, 2 + cmp r5d, r6d + jl .loop + + xor r6d, r5d + jnz .even + movd m2, [r2+r5*4] ; intra + movd m0, [r4+r5*4] ; invq + movd m3, [r3+r5*2] ; inter + pand m3, m5 + punpcklwd m3, m4 + + ; PMINSD + pcmpgtd m1, m2, m3 + pand m3, m1 + pandn m1, m2 + por m3, m1 + + movd m1, [r1+r5*2] ; prop + punpckldq m2, m2 ; DWORD [_ 1 _ 0] + punpckldq m0, m0 + pmuludq m0, m2 ; QWORD [m1 m0] + pshufd m2, m2, q3120 + pshufd m0, m0, q3120 + punpcklwd m1, m4 + cvtdq2pd m0, m0 + mulpd m0, m6 ; intra*invq*fps_factor>>8 + cvtdq2pd m1, m1 ; prop + addpd m0, m1 ; prop + (intra*invq*fps_factor>>8) + cvtdq2pd m1, m2 ; intra + psubd m2, m3 ; intra - inter + cvtdq2pd m2, m2 ; intra - inter + mulpd m0, m2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) + + divpd m0, m1 + addpd m0, [pd_0_5] + cvttpd2dq m0, m0 + movd [r0+r5*4], m0 +.even: RET -%endmacro -INIT_XMM sse2 -MBTREE -; Bulldozer only has a 128-bit float unit, so the AVX version of this function is actually slower. -INIT_XMM fma4 -MBTREE - -%macro INT16_UNPACK 1 - vpunpckhwd xm4, xm%1, xm7 - vpunpcklwd xm%1, xm7 - vinsertf128 m%1, m%1, xm4, 1 -%endmacro +;----------------------------------------------------------------------------- +; void mbtree_propagate_cost( int *dst, uint16_t *propagate_in, int32_t *intra_costs, +; uint16_t *inter_costs, int32_t *inv_qscales, double *fps_factor, int len ) +;----------------------------------------------------------------------------- ; FIXME: align loads/stores to 16 bytes %macro MBTREE_AVX 0 -cglobal mbtree_propagate_cost, 7,7,8 - add r6d, r6d - lea r0, [r0+r6*2] - add r1, r6 - add r2, r6 - add r3, r6 - add r4, r6 - neg r6 - mova xm5, [pw_3fff] - vbroadcastss m6, [r5] - mulps m6, [pf_inv256] -%if notcpuflag(avx2) - pxor xm7, xm7 -%endif +cglobal mbtree_propagate_cost, 7,7,7 + sub r6d, 3 + vbroadcastsd m6, [r5] + mulpd m6, [pd_inv256] + xor r5d, r5d + mova m5, [pw_3fff] + .loop: -%if cpuflag(avx2) - pmovzxwd m0, [r2+r6] ; intra - pmovzxwd m1, [r4+r6] ; invq - pmovzxwd m2, [r1+r6] ; prop - pand xm3, xm5, [r3+r6] ; inter - pmovzxwd m3, xm3 - pmaddwd m1, m0 - psubd m4, m0, m3 - cvtdq2ps m0, m0 - cvtdq2ps m1, m1 - cvtdq2ps m2, m2 - cvtdq2ps m4, m4 - fmaddps m1, m1, m6, m2 - rcpps m3, m0 - mulps m2, m0, m3 - mulps m1, m4 - addps m4, m3, m3 - fnmaddps m4, m2, m3, m4 - mulps m1, m4 -%else - movu xm0, [r2+r6] - movu xm1, [r4+r6] - movu xm2, [r1+r6] - pand xm3, xm5, [r3+r6] - INT16_UNPACK 0 - INT16_UNPACK 1 - INT16_UNPACK 2 - INT16_UNPACK 3 - cvtdq2ps m0, m0 - cvtdq2ps m1, m1 - cvtdq2ps m2, m2 - cvtdq2ps m3, m3 - mulps m1, m0 - subps m4, m0, m3 - mulps m1, m6 ; intra*invq*fps_factor>>8 - addps m1, m2 ; prop + (intra*invq*fps_factor>>8) - rcpps m3, m0 ; 1 / intra 1st approximation - mulps m2, m0, m3 ; intra * (1/intra 1st approx) - mulps m2, m3 ; intra * (1/intra 1st approx)^2 - mulps m1, m4 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) - addps m3, m3 ; 2 * (1/intra 1st approx) - subps m3, m2 ; 2nd approximation for 1/intra - mulps m1, m3 ; / intra -%endif - vcvtps2dq m1, m1 - movu [r0+r6*2], m1 - add r6, 16 - jl .loop + movu xm2, [r2+r5*4] ; intra + movu xm0, [r4+r5*4] ; invq + pmovzxwd xm3, [r3+r5*2] ; inter + pand xm3, xm5 + pminsd xm3, xm2 + + pmovzxwd xm1, [r1+r5*2] ; prop + pmulld xm0, xm2 + cvtdq2pd m0, xm0 + cvtdq2pd m1, xm1 ; prop +;%if cpuflag(avx2) +; fmaddpd m0, m0, m6, m1 +;%else + mulpd m0, m6 ; intra*invq*fps_factor>>8 + addpd m0, m1 ; prop + (intra*invq*fps_factor>>8) +;%endif + cvtdq2pd m1, xm2 ; intra + psubd xm2, xm3 ; intra - inter + cvtdq2pd m2, xm2 ; intra - inter + mulpd m0, m2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) + + ; TODO: DIVPD very slow, but match to C model output, since it is not bottleneck function, I comment above faster code + divpd m0, m1 + addpd m0, [pd_0_5] + cvttpd2dq xm0, m0 + + movu [r0+r5*4], xm0 + add r5d, 4 ; process 4 values in one iteration + cmp r5d, r6d + jl .loop + + add r6d, 3 + xor r6d, r5d + jz .even ; if loop counter is multiple of 4, all values are processed + + and r6d, 3 ; otherwise, remaining unprocessed values must be 1, 2 or 3 + cmp r6d, 1 + je .process1 ; if only 1 value is unprocessed + + ; process 2 values here + movq xm2, [r2+r5*4] ; intra + movq xm0, [r4+r5*4] ; invq + movd xm3, [r3+r5*2] ; inter + pmovzxwd xm3, xm3 + pand xm3, xm5 + pminsd xm3, xm2 + + movd xm1, [r1+r5*2] ; prop + pmovzxwd xm1, xm1 + pmulld xm0, xm2 + cvtdq2pd m0, xm0 + cvtdq2pd m1, xm1 ; prop +;%if cpuflag(avx2) +; fmaddpd m0, m0, m6, m1 +;%else + mulpd m0, m6 ; intra*invq*fps_factor>>8 + addpd m0, m1 ; prop + (intra*invq*fps_factor>>8) +;%endif + cvtdq2pd m1, xm2 ; intra + psubd xm2, xm3 ; intra - inter + cvtdq2pd m2, xm2 ; intra - inter + mulpd m0, m2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) + + divpd m0, m1 + addpd m0, [pd_0_5] + cvttpd2dq xm0, m0 + movq [r0+r5*4], xm0 + + xor r6d, 2 + jz .even + add r5d, 2 + + ; process 1 value here +.process1: + movd xm2, [r2+r5*4] ; intra + movd xm0, [r4+r5*4] ; invq + movzx r6d, word [r3+r5*2] ; inter + movd xm3, r6d + pand xm3, xm5 + pminsd xm3, xm2 + + movzx r6d, word [r1+r5*2] ; prop + movd xm1, r6d + pmulld xm0, xm2 + cvtdq2pd m0, xm0 + cvtdq2pd m1, xm1 ; prop +;%if cpuflag(avx2) +; fmaddpd m0, m0, m6, m1 +;%else + mulpd m0, m6 ; intra*invq*fps_factor>>8 + addpd m0, m1 ; prop + (intra*invq*fps_factor>>8) +;%endif + cvtdq2pd m1, xm2 ; intra + psubd xm2, xm3 ; intra - inter + cvtdq2pd m2, xm2 ; intra - inter + mulpd m0, m2 ; (prop + (intra*invq*fps_factor>>8)) * (intra - inter) + + divpd m0, m1 + addpd m0, [pd_0_5] + cvttpd2dq xm0, m0 + movd [r0+r5*4], xm0 +.even: RET %endmacro INIT_YMM avx MBTREE_AVX -INIT_YMM avx2,fma3 + +INIT_YMM avx2 MBTREE_AVX
View file
x265_1.8.tar.gz/source/common/x86/mc.h -> x265_1.9.tar.gz/source/common/x86/mc.h
Changed
@@ -36,4 +36,14 @@ #undef LOWRES +#define PROPAGATE_COST(cpu) \ + void PFX(mbtree_propagate_cost_ ## cpu)(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, \ + const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len); + +PROPAGATE_COST(sse2) +PROPAGATE_COST(avx) +PROPAGATE_COST(avx2) + +#undef PROPAGATE_COST + #endif // ifndef X265_MC_H
View file
x265_1.8.tar.gz/source/common/x86/pixel-a.asm -> x265_1.9.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -2,6 +2,7 @@ ;* pixel.asm: x86 pixel metrics ;***************************************************************************** ;* Copyright (C) 2003-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Holger Lubitz <holger@lubitz.org> @@ -70,6 +71,7 @@ cextern pd_2 cextern hmul_16p cextern pb_movemask +cextern pb_movemask_32 cextern pw_pixel_max ;============================================================================= @@ -6497,6 +6499,1357 @@ %endif ; !ARCH_X86_64 %endmacro ; SA8D + +%if ARCH_X86_64 == 1 && BIT_DEPTH == 12 +INIT_YMM avx2 +cglobal sa8d_8x8_12bit + pmovzxwd m0, [r0] + pmovzxwd m9, [r2] + psubd m0, m9 + + pmovzxwd m1, [r0 + r1] + pmovzxwd m9, [r2 + r3] + psubd m1, m9 + + pmovzxwd m2, [r0 + r1 * 2] + pmovzxwd m9, [r2 + r3 * 2] + psubd m2, m9 + + pmovzxwd m8, [r0 + r4] + pmovzxwd m9, [r2 + r5] + psubd m8, m9 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxwd m4, [r0] + pmovzxwd m9, [r2] + psubd m4, m9 + + pmovzxwd m5, [r0 + r1] + pmovzxwd m9, [r2 + r3] + psubd m5, m9 + + pmovzxwd m3, [r0 + r1 * 2] + pmovzxwd m9, [r2 + r3 * 2] + psubd m3, m9 + + pmovzxwd m7, [r0 + r4] + pmovzxwd m9, [r2 + r5] + psubd m7, m9 + + mova m6, m0 + paddd m0, m1 + psubd m1, m6 + mova m6, m2 + paddd m2, m8 + psubd m8, m6 + mova m6, m0 + + punpckldq m0, m1 + punpckhdq m6, m1 + + mova m1, m0 + paddd m0, m6 + psubd m6, m1 + mova m1, m2 + + punpckldq m2, m8 + punpckhdq m1, m8 + + mova m8, m2 + paddd m2, m1 + psubd m1, m8 + mova m8, m4 + paddd m4, m5 + psubd m5, m8 + mova m8, m3 + paddd m3, m7 + psubd m7, m8 + mova m8, m4 + + punpckldq m4, m5 + punpckhdq m8, m5 + + mova m5, m4 + paddd m4, m8 + psubd m8, m5 + mova m5, m3 + punpckldq m3, m7 + punpckhdq m5, m7 + + mova m7, m3 + paddd m3, m5 + psubd m5, m7 + mova m7, m0 + paddd m0, m2 + psubd m2, m7 + mova m7, m6 + paddd m6, m1 + psubd m1, m7 + mova m7, m0 + + punpcklqdq m0, m2 + punpckhqdq m7, m2 + + mova m2, m0 + paddd m0, m7 + psubd m7, m2 + mova m2, m6 + + punpcklqdq m6, m1 + punpckhqdq m2, m1 + + mova m1, m6 + paddd m6, m2 + psubd m2, m1 + mova m1, m4 + paddd m4, m3 + psubd m3, m1 + mova m1, m8 + paddd m8, m5 + psubd m5, m1 + mova m1, m4 + + punpcklqdq m4, m3 + punpckhqdq m1, m3 + + mova m3, m4 + paddd m4, m1 + psubd m1, m3 + mova m3, m8 + + punpcklqdq m8, m5 + punpckhqdq m3, m5 + + mova m5, m8 + paddd m8, m3 + psubd m3, m5 + mova m5, m0 + paddd m0, m4 + psubd m4, m5 + mova m5, m7 + paddd m7, m1 + psubd m1, m5 + mova m5, m0 + + vinserti128 m0, m0, xm4, 1 + vperm2i128 m5, m5, m4, 00110001b + + pxor m4, m4 + psubd m4, m0 + pmaxsd m0, m4 + pxor m4, m4 + psubd m4, m5 + pmaxsd m5, m4 + pmaxsd m0, m5 + mova m4, m7 + + vinserti128 m7, m7, xm1, 1 + vperm2i128 m4, m4, m1, 00110001b + + pxor m1, m1 + psubd m1, m7 + pmaxsd m7, m1 + pxor m1, m1 + psubd m1, m4 + pmaxsd m4, m1 + pmaxsd m7, m4 + mova m1, m6 + paddd m6, m8 + psubd m8, m1 + mova m1, m2 + paddd m2, m3 + psubd m3, m1 + mova m1, m6 + + vinserti128 m6, m6, xm8, 1 + vperm2i128 m1, m1, m8, 00110001b + + pxor m8, m8 + psubd m8, m6 + pmaxsd m6, m8 + pxor m8, m8 + psubd m8, m1 + pmaxsd m1, m8 + pmaxsd m6, m1 + mova m8, m2 + + vinserti128 m2, m2, xm3, 1 + vperm2i128 m8, m8, m3, 00110001b + + pxor m3, m3 + psubd m3, m2 + pmaxsd m2, m3 + pxor m3, m3 + psubd m3, m8 + pmaxsd m8, m3 + pmaxsd m2, m8 + paddd m0, m6 + paddd m0, m7 + paddd m0, m2 + ret + +cglobal pixel_sa8d_8x8, 4,6,10 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + + call sa8d_8x8_12bit + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + movd eax, xm0 + add eax, 1 + shr eax, 1 + RET + +cglobal pixel_sa8d_8x16, 4,7,11 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + pxor m10, m10 + + call sa8d_8x8_12bit + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm10, xm0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm0, xm10 + movd eax, xm0 + RET + +cglobal pixel_sa8d_16x16, 4,8,11 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + mov r6, r0 + mov r7, r2 + pxor m10, m10 + + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + movd eax, xm0 + add eax, 1 + shr eax, 1 + RET + +cglobal pixel_sa8d_16x32, 4,8,12 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + mov r6, r0 + mov r7, r2 + pxor m10, m10 + pxor m11, m11 + + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + movd eax, xm11 + RET + +cglobal pixel_sa8d_32x32, 4,8,12 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + mov r6, r0 + mov r7, r2 + pxor m10, m10 + pxor m11, m11 + + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + movd eax, xm11 + RET + +cglobal pixel_sa8d_32x64, 4,8,12 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + mov r6, r0 + mov r7, r2 + pxor m10, m10 + pxor m11, m11 + + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + movd eax, xm11 + RET + +cglobal pixel_sa8d_64x64, 4,8,12 + add r1d, r1d + add r3d, r3d + lea r4, [r1 + r1 * 2] + lea r5, [r3 + r3 * 2] + mov r6, r0 + mov r7, r2 + pxor m10, m10 + pxor m11, m11 + + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 64] + lea r2, [r7 + 64] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 80] + lea r2, [r7 + 80] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 96] + lea r2, [r7 + 96] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 112] + lea r2, [r7 + 112] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 64] + lea r2, [r7 + 64] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 80] + lea r2, [r7 + 80] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 96] + lea r2, [r7 + 96] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 112] + lea r2, [r7 + 112] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 64] + lea r2, [r7 + 64] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 80] + lea r2, [r7 + 80] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 96] + lea r2, [r7 + 96] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 112] + lea r2, [r7 + 112] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + lea r6, [r6 + r1 * 8] + lea r6, [r6 + r1 * 8] + lea r7, [r7 + r3 * 8] + lea r7, [r7 + r3 * 8] + pxor m10, m10 + mov r0, r6 + mov r2, r7 + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 16] + lea r2, [r7 + 16] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 32] + lea r2, [r7 + 32] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 48] + lea r2, [r7 + 48] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 64] + lea r2, [r7 + 64] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 80] + lea r2, [r7 + 80] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + + pxor m10, m10 + lea r0, [r6 + 96] + lea r2, [r7 + 96] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r6 + 112] + lea r2, [r7 + 112] + call sa8d_8x8_12bit + paddd m10, m0 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + call sa8d_8x8_12bit + paddd m0, m10 + + vextracti128 xm6, m0, 1 + paddd xm0, xm6 + + movhlps xm6, xm0 + paddd xm0, xm6 + + pshuflw xm6, xm0, 0Eh + paddd xm0, xm6 + paddd xm0, [pd_1] + psrld xm0, 1 + paddd xm11, xm0 + movd eax, xm11 + RET +%endif + + ;============================================================================= ; INTRA SATD ;============================================================================= @@ -6508,7 +7861,9 @@ %define movdqu movups %define punpcklqdq movlhps INIT_XMM sse2 +%if BIT_DEPTH <= 10 SA8D +%endif SATDS_SSE2 %if HIGH_BIT_DEPTH == 0 @@ -6524,8 +7879,10 @@ %define LOAD_SUMSUB_16P LOAD_SUMSUB_16P_SSSE3 %endif INIT_XMM ssse3 -SATDS_SSE2 +%if BIT_DEPTH <= 10 SA8D +%endif +SATDS_SSE2 %undef movdqa ; nehalem doesn't like movaps %undef movdqu ; movups %undef punpcklqdq ; or movlhps @@ -6533,21 +7890,24 @@ %define TRANS TRANS_SSE4 %define LOAD_DUP_4x8P LOAD_DUP_4x8P_PENRYN INIT_XMM sse4 -SATDS_SSE2 +%if BIT_DEPTH <= 10 SA8D +%endif +SATDS_SSE2 ; Sandy/Ivy Bridge and Bulldozer do movddup in the load unit, so ; it's effectively free. %define LOAD_DUP_4x8P LOAD_DUP_4x8P_CONROE INIT_XMM avx -SATDS_SSE2 SA8D +SATDS_SSE2 %define TRANS TRANS_XOP INIT_XMM xop -SATDS_SSE2 +%if BIT_DEPTH <= 10 SA8D - +%endif +SATDS_SSE2 %if HIGH_BIT_DEPTH == 0 %define LOAD_SUMSUB_8x4P LOAD_SUMSUB8_16x4P_AVX2 @@ -6555,34 +7915,39 @@ %define TRANS TRANS_SSE4 %macro LOAD_SUMSUB_8x8P_AVX2 7 ; 4*dst, 2*tmp, mul] - movq xm%1, [r0] - movq xm%3, [r2] - movq xm%2, [r0+r1] - movq xm%4, [r2+r3] - vinserti128 m%1, m%1, [r0+4*r1], 1 - vinserti128 m%3, m%3, [r2+4*r3], 1 - vinserti128 m%2, m%2, [r0+r4], 1 - vinserti128 m%4, m%4, [r2+r5], 1 - punpcklqdq m%1, m%1 - punpcklqdq m%3, m%3 - punpcklqdq m%2, m%2 - punpcklqdq m%4, m%4 + movddup xm%1, [r0] + movddup xm%3, [r2] + movddup xm%2, [r0+4*r1] + movddup xm%5, [r2+4*r3] + vinserti128 m%1, m%1, xm%2, 1 + vinserti128 m%3, m%3, xm%5, 1 + + movddup xm%2, [r0+r1] + movddup xm%4, [r2+r3] + movddup xm%5, [r0+r4] + movddup xm%6, [r2+r5] + vinserti128 m%2, m%2, xm%5, 1 + vinserti128 m%4, m%4, xm%6, 1 + DIFF_SUMSUB_SSSE3 %1, %3, %2, %4, %7 lea r0, [r0+2*r1] lea r2, [r2+2*r3] - movq xm%3, [r0] - movq xm%5, [r2] - movq xm%4, [r0+r1] + movddup xm%3, [r0] + movddup xm%5, [r0+4*r1] + vinserti128 m%3, m%3, xm%5, 1 + + movddup xm%5, [r2] + movddup xm%4, [r2+4*r3] + vinserti128 m%5, m%5, xm%4, 1 + + movddup xm%4, [r0+r1] + movddup xm%6, [r0+r4] + vinserti128 m%4, m%4, xm%6, 1 + movq xm%6, [r2+r3] - vinserti128 m%3, m%3, [r0+4*r1], 1 - vinserti128 m%5, m%5, [r2+4*r3], 1 - vinserti128 m%4, m%4, [r0+r4], 1 - vinserti128 m%6, m%6, [r2+r5], 1 - punpcklqdq m%3, m%3 - punpcklqdq m%5, m%5 - punpcklqdq m%4, m%4 - punpcklqdq m%6, m%6 + movhps xm%6, [r2+r5] + vpermq m%6, m%6, q1100 DIFF_SUMSUB_SSSE3 %3, %5, %4, %6, %7 %endmacro @@ -6789,92 +8154,57 @@ ;void planecopy_sc(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------ INIT_XMM sse2 -cglobal downShift_16, 7,7,3 - movd m0, r6d ; m0 = shift +cglobal downShift_16, 4,7,3 + mov r4d, r4m + mov r5d, r5m + movd m0, r6m ; m0 = shift add r1, r1 + dec r5d .loopH: xor r6, r6 + .loopW: movu m1, [r0 + r6 * 2] - movu m2, [r0 + r6 * 2 + 16] + movu m2, [r0 + r6 * 2 + mmsize] psrlw m1, m0 psrlw m2, m0 packuswb m1, m2 movu [r2 + r6], m1 - add r6, 16 + add r6, mmsize cmp r6d, r4d - jl .loopW + jl .loopW ; move to next row add r0, r1 add r2, r3 dec r5d - jnz .loopH - -;processing last row of every frame [To handle width which not a multiple of 16] + jnz .loopH + ;processing last row of every frame [To handle width which not a multiple of 16] + ; r4d must be more than or equal to 16(mmsize) .loop16: - movu m1, [r0] - movu m2, [r0 + 16] + movu m1, [r0 + (r4 - mmsize) * 2] + movu m2, [r0 + (r4 - mmsize) * 2 + mmsize] psrlw m1, m0 psrlw m2, m0 packuswb m1, m2 - movu [r2], m1 + movu [r2 + r4 - mmsize], m1 - add r0, 2 * mmsize - add r2, mmsize - sub r4d, 16 - jz .end - cmp r4d, 15 - jg .loop16 + sub r4d, mmsize + jz .end + cmp r4d, mmsize + jge .loop16 - cmp r4d, 8 - jl .process4 + ; process partial pixels movu m1, [r0] + movu m2, [r0 + mmsize] psrlw m1, m0 - packuswb m1, m1 - movh [r2], m1 - - add r0, mmsize - add r2, 8 - sub r4d, 8 - jz .end - -.process4: - cmp r4d, 4 - jl .process2 - movh m1,[r0] - psrlw m1, m0 - packuswb m1, m1 - movd [r2], m1 - - add r0, 8 - add r2, 4 - sub r4d, 4 - jz .end - -.process2: - cmp r4d, 2 - jl .process1 - movd m1, [r0] - psrlw m1, m0 - packuswb m1, m1 - movd r6, m1 - mov [r2], r6w - - add r0, 4 - add r2, 2 - sub r4d, 2 - jz .end + psrlw m2, m0 + packuswb m1, m2 + movu [r2], m1 -.process1: - movd m1, [r0] - psrlw m1, m0 - packuswb m1, m1 - movd r3, m1 - mov [r2], r3b .end: RET @@ -6883,12 +8213,16 @@ ;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------------------- INIT_YMM avx2 -cglobal downShift_16, 6,7,3 +cglobal downShift_16, 4,7,3 + mov r4d, r4m + mov r5d, r5m movd xm0, r6m ; m0 = shift add r1d, r1d + dec r5d .loopH: xor r6, r6 + .loopW: movu m1, [r0 + r6 * 2 + 0] movu m2, [r0 + r6 * 2 + 32] @@ -6900,92 +8234,39 @@ add r6d, mmsize cmp r6d, r4d - jl .loopW + jl .loopW ; move to next row add r0, r1 add r2, r3 dec r5d - jnz .loopH + jnz .loopH -; processing last row of every frame [To handle width which not a multiple of 32] - mov r6d, r4d - and r4d, 31 - shr r6d, 5 + ; processing last row of every frame [To handle width which not a multiple of 32] .loop32: - movu m1, [r0] - movu m2, [r0 + 32] + movu m1, [r0 + (r4 - mmsize) * 2] + movu m2, [r0 + (r4 - mmsize) * 2 + mmsize] psrlw m1, xm0 psrlw m2, xm0 packuswb m1, m2 - vpermq m1, m1, 11011000b - movu [r2], m1 + vpermq m1, m1, q3120 + movu [r2 + r4 - mmsize], m1 - add r0, 2*mmsize - add r2, mmsize - dec r6d - jnz .loop32 + sub r4d, mmsize + jz .end + cmp r4d, mmsize + jge .loop32 - cmp r4d, 16 - jl .process8 + ; process partial pixels movu m1, [r0] + movu m2, [r0 + mmsize] psrlw m1, xm0 - packuswb m1, m1 - vpermq m1, m1, 10001000b - movu [r2], xm1 - - add r0, mmsize - add r2, 16 - sub r4d, 16 - jz .end - -.process8: - cmp r4d, 8 - jl .process4 - movu m1, [r0] - psrlw m1, xm0 - packuswb m1, m1 - movq [r2], xm1 - - add r0, 16 - add r2, 8 - sub r4d, 8 - jz .end - -.process4: - cmp r4d, 4 - jl .process2 - movq xm1,[r0] - psrlw m1, xm0 - packuswb m1, m1 - movd [r2], xm1 - - add r0, 8 - add r2, 4 - sub r4d, 4 - jz .end - -.process2: - cmp r4d, 2 - jl .process1 - movd xm1, [r0] - psrlw m1, xm0 - packuswb m1, m1 - movd r6d, xm1 - mov [r2], r6w - - add r0, 4 - add r2, 2 - sub r4d, 2 - jz .end + psrlw m2, xm0 + packuswb m1, m2 + vpermq m1, m1, q3120 + movu [r2], m1 -.process1: - movd xm1, [r0] - psrlw m1, xm0 - packuswb m1, m1 - movd r3d, xm1 - mov [r2], r3b .end: RET @@ -7122,7 +8403,9 @@ ;void planecopy_sp_shl(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------ INIT_XMM sse2 -cglobal upShift_16, 6,7,4 +cglobal upShift_16, 4,7,4 + mov r4d, r4m + mov r5d, r5m movd m0, r6m ; m0 = shift mova m3, [pw_pixel_max] FIX_STRIDES r1d, r3d @@ -7150,68 +8433,34 @@ dec r5d jnz .loopH -;processing last row of every frame [To handle width which not a multiple of 16] + ;processing last row of every frame [To handle width which not a multiple of 16] + ; WARNING: width(r4d) MUST BE more than or equal to 16(mmsize) in here .loop16: - movu m1, [r0] - movu m2, [r0 + mmsize] + movu m1, [r0 + (r4 - mmsize) * 2] + movu m2, [r0 + (r4 - mmsize) * 2 + mmsize] psllw m1, m0 psllw m2, m0 pand m1, m3 pand m2, m3 - movu [r2], m1 - movu [r2 + mmsize], m2 + movu [r2 + (r4 - mmsize) * 2], m1 + movu [r2 + (r4 - mmsize) * 2 + mmsize], m2 - add r0, 2 * mmsize - add r2, 2 * mmsize - sub r4d, 16 + sub r4d, mmsize jz .end - jg .loop16 + cmp r4d, mmsize + jge .loop16 - cmp r4d, 8 - jl .process4 + ; process partial pixels movu m1, [r0] - psrlw m1, m0 - pand m1, m3 - movu [r2], m1 - - add r0, mmsize - add r2, mmsize - sub r4d, 8 - jz .end - -.process4: - cmp r4d, 4 - jl .process2 - movh m1,[r0] - psllw m1, m0 - pand m1, m3 - movh [r2], m1 - - add r0, 8 - add r2, 8 - sub r4d, 4 - jz .end - -.process2: - cmp r4d, 2 - jl .process1 - movd m1, [r0] + movu m2, [r0 + mmsize] psllw m1, m0 + psllw m2, m0 pand m1, m3 - movd [r2], m1 - - add r0, 4 - add r2, 4 - sub r4d, 2 - jz .end + pand m2, m3 + movu [r2], m1 + movu [r2 + mmsize], m2 -.process1: - movd m1, [r0] - psllw m1, m0 - pand m1, m3 - movd r3, m1 - mov [r2], r3w .end: RET @@ -7219,9 +8468,10 @@ ;------------------------------------------------------------------------------------------------------------------------------------- ;void planecopy_sp_shl(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------------------- -; TODO: NO TEST CODE! INIT_YMM avx2 -cglobal upShift_16, 6,7,4 +cglobal upShift_16, 4,7,4 + mov r4d, r4m + mov r5d, r5m movd xm0, r6m ; m0 = shift vbroadcasti128 m3, [pw_pixel_max] FIX_STRIDES r1d, r3d @@ -7248,83 +8498,33 @@ dec r5d jnz .loopH -; processing last row of every frame [To handle width which not a multiple of 32] - mov r6d, r4d - and r4d, 31 - shr r6d, 5 + ; processing last row of every frame [To handle width which not a multiple of 32] .loop32: - movu m1, [r0] - movu m2, [r0 + mmsize] + movu m1, [r0 + (r4 - mmsize) * 2] + movu m2, [r0 + (r4 - mmsize) * 2 + mmsize] psllw m1, xm0 psllw m2, xm0 pand m1, m3 pand m2, m3 - movu [r2], m1 - movu [r2 + mmsize], m2 + movu [r2 + (r4 - mmsize) * 2], m1 + movu [r2 + (r4 - mmsize) * 2 + mmsize], m2 - add r0, 2*mmsize - add r2, 2*mmsize - dec r6d - jnz .loop32 + sub r4d, mmsize + jz .end + cmp r4d, mmsize + jge .loop32 - cmp r4d, 16 - jl .process8 + ; process partial pixels movu m1, [r0] + movu m2, [r0 + mmsize] psllw m1, xm0 + psllw m2, xm0 pand m1, m3 + pand m2, m3 movu [r2], m1 + movu [r2 + mmsize], m2 - add r0, mmsize - add r2, mmsize - sub r4d, 16 - jz .end - -.process8: - cmp r4d, 8 - jl .process4 - movu xm1, [r0] - psllw xm1, xm0 - pand xm1, xm3 - movu [r2], xm1 - - add r0, 16 - add r2, 16 - sub r4d, 8 - jz .end - -.process4: - cmp r4d, 4 - jl .process2 - movq xm1,[r0] - psllw xm1, xm0 - pand xm1, xm3 - movq [r2], xm1 - - add r0, 8 - add r2, 8 - sub r4d, 4 - jz .end - -.process2: - cmp r4d, 2 - jl .process1 - movd xm1, [r0] - psllw xm1, xm0 - pand xm1, xm3 - movd [r2], xm1 - - add r0, 4 - add r2, 4 - sub r4d, 2 - jz .end - -.process1: - movd xm1, [r0] - psllw xm1, xm0 - pand xm1, xm3 - movd r3d, xm1 - mov [r2], r3w .end: RET @@ -8725,16 +9925,272 @@ pabsd xm1, xm1 %endmacro +%macro PSY_COST_PP_8x8_MAIN12 0 + ; load source pixels + lea r4, [r1 * 3] + pmovzxwd m0, [r0] + pmovzxwd m1, [r0 + r1] + pmovzxwd m2, [r0 + r1 * 2] + pmovzxwd m3, [r0 + r4] + lea r5, [r0 + r1 * 4] + pmovzxwd m4, [r5] + pmovzxwd m5, [r5 + r1] + pmovzxwd m6, [r5 + r1 * 2] + pmovzxwd m7, [r5 + r4] + + ; source SAD + paddd m8, m0, m1 + paddd m8, m2 + paddd m8, m3 + paddd m8, m4 + paddd m8, m5 + paddd m8, m6 + paddd m8, m7 + + vextracti128 xm9, m8, 1 + paddd m8, m9 ; sad_8x8 + movhlps xm9, xm8 + paddd xm8, xm9 + pshuflw xm9, xm8, 0Eh + paddd xm8, xm9 + psrld m8, 2 + + ; source SA8D + psubd m9, m1, m0 + paddd m0, m1 + psubd m1, m3, m2 + paddd m2, m3 + punpckhdq m3, m0, m9 + punpckldq m0, m9 + psubd m9, m3, m0 + paddd m0, m3 + punpckhdq m3, m2, m1 + punpckldq m2, m1 + psubd m10, m3, m2 + paddd m2, m3 + psubd m3, m5, m4 + paddd m4, m5 + psubd m5, m7, m6 + paddd m6, m7 + punpckhdq m1, m4, m3 + punpckldq m4, m3 + psubd m7, m1, m4 + paddd m4, m1 + punpckhdq m3, m6, m5 + punpckldq m6, m5 + psubd m1, m3, m6 + paddd m6, m3 + psubd m3, m2, m0 + paddd m0, m2 + psubd m2, m10, m9 + paddd m9, m10 + punpckhqdq m5, m0, m3 + punpcklqdq m0, m3 + psubd m10, m5, m0 + paddd m0, m5 + punpckhqdq m3, m9, m2 + punpcklqdq m9, m2 + psubd m5, m3, m9 + paddd m9, m3 + psubd m3, m6, m4 + paddd m4, m6 + psubd m6, m1, m7 + paddd m7, m1 + punpckhqdq m2, m4, m3 + punpcklqdq m4, m3 + psubd m1, m2, m4 + paddd m4, m2 + punpckhqdq m3, m7, m6 + punpcklqdq m7, m6 + psubd m2, m3, m7 + paddd m7, m3 + psubd m3, m4, m0 + paddd m0, m4 + psubd m4, m1, m10 + paddd m10, m1 + vinserti128 m6, m0, xm3, 1 + vperm2i128 m0, m0, m3, 00110001b + pabsd m0, m0 + pabsd m6, m6 + pmaxsd m0, m6 + vinserti128 m3, m10, xm4, 1 + vperm2i128 m10, m10, m4, 00110001b + pabsd m10, m10 + pabsd m3, m3 + pmaxsd m10, m3 + psubd m3, m7, m9 + paddd m9, m7 + psubd m7, m2, m5 + paddd m5, m2 + vinserti128 m4, m9, xm3, 1 + vperm2i128 m9, m9, m3, 00110001b + pabsd m9, m9 + pabsd m4, m4 + pmaxsd m9, m4 + vinserti128 m3, m5, xm7, 1 + vperm2i128 m5, m5, m7, 00110001b + pabsd m5, m5 + pabsd m3, m3 + pmaxsd m5, m3 + paddd m0, m9 + paddd m0, m10 + paddd m0, m5 + + vextracti128 xm9, m0, 1 + paddd m0, m9 ; sad_8x8 + movhlps xm9, xm0 + paddd xm0, xm9 + pshuflw xm9, xm0, 0Eh + paddd xm0, xm9 + paddd m0, [pd_1] + psrld m0, 1 ; sa8d_8x8 + psubd m11, m0, m8 ; sa8d_8x8 - sad_8x8 + + ; load recon pixels + lea r4, [r3 * 3] + pmovzxwd m0, [r2] + pmovzxwd m1, [r2 + r3] + pmovzxwd m2, [r2 + r3 * 2] + pmovzxwd m3, [r2 + r4] + lea r5, [r2 + r3 * 4] + pmovzxwd m4, [r5] + pmovzxwd m5, [r5 + r3] + pmovzxwd m6, [r5 + r3 * 2] + pmovzxwd m7, [r5 + r4] + + ; recon SAD + paddd m8, m0, m1 + paddd m8, m2 + paddd m8, m3 + paddd m8, m4 + paddd m8, m5 + paddd m8, m6 + paddd m8, m7 + + vextracti128 xm9, m8, 1 + paddd m8, m9 ; sad_8x8 + movhlps xm9, xm8 + paddd xm8, xm9 + pshuflw xm9, xm8, 0Eh + paddd xm8, xm9 + psrld m8, 2 + + ; recon SA8D + psubd m9, m1, m0 + paddd m0, m1 + psubd m1, m3, m2 + paddd m2, m3 + punpckhdq m3, m0, m9 + punpckldq m0, m9 + psubd m9, m3, m0 + paddd m0, m3 + punpckhdq m3, m2, m1 + punpckldq m2, m1 + psubd m10, m3, m2 + paddd m2, m3 + psubd m3, m5, m4 + paddd m4, m5 + psubd m5, m7, m6 + paddd m6, m7 + punpckhdq m1, m4, m3 + punpckldq m4, m3 + psubd m7, m1, m4 + paddd m4, m1 + punpckhdq m3, m6, m5 + punpckldq m6, m5 + psubd m1, m3, m6 + paddd m6, m3 + psubd m3, m2, m0 + paddd m0, m2 + psubd m2, m10, m9 + paddd m9, m10 + punpckhqdq m5, m0, m3 + punpcklqdq m0, m3 + psubd m10, m5, m0 + paddd m0, m5 + punpckhqdq m3, m9, m2 + punpcklqdq m9, m2 + psubd m5, m3, m9 + paddd m9, m3 + psubd m3, m6, m4 + paddd m4, m6 + psubd m6, m1, m7 + paddd m7, m1 + punpckhqdq m2, m4, m3 + punpcklqdq m4, m3 + psubd m1, m2, m4 + paddd m4, m2 + punpckhqdq m3, m7, m6 + punpcklqdq m7, m6 + psubd m2, m3, m7 + paddd m7, m3 + psubd m3, m4, m0 + paddd m0, m4 + psubd m4, m1, m10 + paddd m10, m1 + vinserti128 m6, m0, xm3, 1 + vperm2i128 m0, m0, m3, 00110001b + pabsd m0, m0 + pabsd m6, m6 + pmaxsd m0, m6 + vinserti128 m3, m10, xm4, 1 + vperm2i128 m10, m10, m4, 00110001b + pabsd m10, m10 + pabsd m3, m3 + pmaxsd m10, m3 + psubd m3, m7, m9 + paddd m9, m7 + psubd m7, m2, m5 + paddd m5, m2 + vinserti128 m4, m9, xm3, 1 + vperm2i128 m9, m9, m3, 00110001b + pabsd m9, m9 + pabsd m4, m4 + pmaxsd m9, m4 + vinserti128 m3, m5, xm7, 1 + vperm2i128 m5, m5, m7, 00110001b + pabsd m5, m5 + pabsd m3, m3 + pmaxsd m5, m3 + paddd m0, m9 + paddd m0, m10 + paddd m0, m5 + + vextracti128 xm9, m0, 1 + paddd m0, m9 ; sad_8x8 + movhlps xm9, xm0 + paddd xm0, xm9 + pshuflw xm9, xm0, 0Eh + paddd xm0, xm9 + paddd m0, [pd_1] + psrld m0, 1 ; sa8d_8x8 + psubd m0, m8 ; sa8d_8x8 - sad_8x8 + + psubd m11, m0 + pabsd m11, m11 +%endmacro + %if ARCH_X86_64 -%if HIGH_BIT_DEPTH +INIT_YMM avx2 +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_8x8, 4, 8, 12 + add r1d, r1d + add r3d, r3d + PSY_COST_PP_8x8_MAIN12 + movd eax, xm11 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 cglobal psyCost_pp_8x8, 4, 8, 11 add r1d, r1d add r3d, r3d PSY_PP_8x8_AVX2 movd eax, xm1 RET -%else ; !HIGH_BIT_DEPTH -INIT_YMM avx2 +%endif + +%if BIT_DEPTH == 8 cglobal psyCost_pp_8x8, 4, 8, 13 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8746,9 +10202,35 @@ RET %endif %endif + %if ARCH_X86_64 INIT_YMM avx2 -%if HIGH_BIT_DEPTH +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_16x16, 4, 10, 13 + add r1d, r1d + add r3d, r3d + pxor m12, m12 + + mov r8d, 2 +.loopH: + mov r9d, 2 +.loopW: + PSY_COST_PP_8x8_MAIN12 + + paddd xm12, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 32] + lea r2, [r2 + r3 * 8 - 32] + dec r8d + jnz .loopH + movd eax, xm12 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 cglobal psyCost_pp_16x16, 4, 10, 12 add r1d, r1d add r3d, r3d @@ -8771,7 +10253,9 @@ jnz .loopH movd eax, xm11 RET -%else ; !HIGH_BIT_DEPTH +%endif + +%if BIT_DEPTH == 8 cglobal psyCost_pp_16x16, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8797,9 +10281,35 @@ RET %endif %endif + %if ARCH_X86_64 INIT_YMM avx2 -%if HIGH_BIT_DEPTH +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_32x32, 4, 10, 13 + add r1d, r1d + add r3d, r3d + pxor m12, m12 + + mov r8d, 4 +.loopH: + mov r9d, 4 +.loopW: + PSY_COST_PP_8x8_MAIN12 + + paddd xm12, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 64] + lea r2, [r2 + r3 * 8 - 64] + dec r8d + jnz .loopH + movd eax, xm12 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 cglobal psyCost_pp_32x32, 4, 10, 12 add r1d, r1d add r3d, r3d @@ -8822,7 +10332,9 @@ jnz .loopH movd eax, xm11 RET -%else ; !HIGH_BIT_DEPTH +%endif + +%if BIT_DEPTH == 8 cglobal psyCost_pp_32x32, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8848,9 +10360,35 @@ RET %endif %endif + %if ARCH_X86_64 INIT_YMM avx2 -%if HIGH_BIT_DEPTH +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_64x64, 4, 10, 13 + add r1d, r1d + add r3d, r3d + pxor m12, m12 + + mov r8d, 8 +.loopH: + mov r9d, 8 +.loopW: + PSY_COST_PP_8x8_MAIN12 + + paddd xm12, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 128] + lea r2, [r2 + r3 * 8 - 128] + dec r8d + jnz .loopH + movd eax, xm12 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 cglobal psyCost_pp_64x64, 4, 10, 12 add r1d, r1d add r3d, r3d @@ -8873,7 +10411,9 @@ jnz .loopH movd eax, xm11 RET -%else ; !HIGH_BIT_DEPTH +%endif + +%if BIT_DEPTH == 8 cglobal psyCost_pp_64x64, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -12186,3 +13726,80 @@ movd eax, xm6 RET %endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1 + + +;------------------------------------------------------------------------------------------------------------------------------------- +; pixel planeClipAndMax(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix) +;------------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +INIT_YMM avx2 +cglobal planeClipAndMax, 5,7,8 + movd xm0, r5m + vpbroadcastb m0, xm0 ; m0 = [min] + vpbroadcastb m1, r6m ; m1 = [max] + pxor m2, m2 ; m2 = sumLuma + pxor m3, m3 ; m3 = maxLumaLevel + pxor m4, m4 ; m4 = zero + + ; get mask to partial register pixels + mov r5d, r2d + and r2d, ~(mmsize - 1) + sub r5d, r2d + lea r6, [pb_movemask_32 + mmsize] + sub r6, r5 + movu m5, [r6] ; m5 = mask for last couple column + +.loopH: + lea r5d, [r2 - mmsize] + +.loopW: + movu m6, [r0 + r5] + pmaxub m6, m0 + pminub m6, m1 + movu [r0 + r5], m6 ; store back + pmaxub m3, m6 ; update maxLumaLevel + psadbw m6, m4 + paddq m2, m6 + + sub r5d, mmsize + jge .loopW + + ; partial pixels + movu m7, [r0 + r2] + pmaxub m6, m7, m0 + pminub m6, m1 + + pand m7, m5 ; get invalid/unchange pixel + pandn m6, m5, m6 ; clear invalid pixels + por m7, m6 ; combin valid & invalid pixels + movu [r0 + r2], m7 ; store back + pmaxub m3, m6 ; update maxLumaLevel + psadbw m6, m4 + paddq m2, m6 + +.next: + add r0, r1 + dec r3d + jg .loopH + + ; sumLuma + vextracti128 xm0, m2, 1 + paddq xm0, xm2 + movhlps xm1, xm0 + paddq xm0, xm1 + movq [r4], xm0 + + ; maxLumaLevel + vextracti128 xm0, m3, 1 + pmaxub xm0, xm3 + movhlps xm3, xm0 + pmaxub xm0, xm3 + pmovzxbw xm0, xm0 + pxor xm0, [pb_movemask + 16] + phminposuw xm0, xm0 + + movd eax, xm0 + not al + movzx eax, al + RET +%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0
View file
x265_1.8.tar.gz/source/common/x86/pixel-util.h -> x265_1.9.tar.gz/source/common/x86/pixel-util.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +;* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -55,5 +56,6 @@ int PFX(scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize)); uint32_t PFX(findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])); uint32_t PFX(costCoeffNxN_sse4(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase)); +uint32_t PFX(costCoeffNxN_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase)); #endif // ifndef X265_PIXEL_UTIL_H
View file
x265_1.8.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.9.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -49,6 +49,7 @@ mask_ff: times 16 db 0xff times 16 db 0 deinterleave_shuf: times 2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 +interleave_shuf: times 2 db 0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15 deinterleave_word_shuf: times 2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15 hmulw_16p: times 8 dw 1 times 4 dw 1, -1 @@ -56,7 +57,7 @@ SECTION .text cextern pw_1 -cextern pw_0_15 +cextern pw_0_7 cextern pb_1 cextern pb_128 cextern pw_00ff @@ -78,6 +79,7 @@ cextern trans8_shuf cextern_naked private_prefix %+ _entropyStateBits cextern pb_movemask +cextern pw_exp2_0_15 ;----------------------------------------------------------------------------- ; void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride) @@ -792,6 +794,7 @@ pshufd m6, m6, 0 ; m6 = add mov r3d, r4d ; r3 = numCoeff shr r4d, 3 + pxor m4, m4 .loop: pmovsxwd m0, [r0] ; m0 = level @@ -810,13 +813,13 @@ psignd m3, m1 packssdw m2, m3 + pabsw m2, m2 movu [r2], m2 add r0, 16 add r1, 32 add r2, 16 - pxor m4, m4 pcmpeqw m2, m4 psubw m7, m2 @@ -862,9 +865,11 @@ psignd m2, m0 packssdw m1, m2 - vpermq m2, m1, q3120 + pabsw m1, m1 + vpermq m2, m1, q3120 movu [r2], m2 + add r0, mmsize add r1, mmsize * 2 add r2, mmsize @@ -1560,7 +1565,7 @@ movd m0, r6d pshuflw m0, m0, 0 punpcklqdq m0, m0 - pcmpgtw m0, [pw_0_15] + pcmpgtw m0, [pw_0_7] .loopH: mov r6d, r4d @@ -1718,7 +1723,7 @@ pshuflw m0, m0, 0 punpcklqdq m0, m0 vinserti128 m0, m0, xm0, 1 - pcmpgtw m0, [pw_0_15] + pcmpgtw m0, [pw_0_7] .loopH: mov r6d, r4d @@ -6397,6 +6402,78 @@ movd edx, xm6 %endif RET + +INIT_YMM avx2 +cglobal pixel_var_32x32, 2,4,7 + VAR_START 0 + mov r2d, 16 + +.loop: + pmovzxbw m0, [r0] + pmovzxbw m3, [r0 + 16] + pmovzxbw m1, [r0 + r1] + pmovzxbw m4, [r0 + r1 + 16] + + lea r0, [r0 + r1 * 2] + + VAR_CORE + + dec r2d + jg .loop + + vextracti128 xm0, m5, 1 + vextracti128 xm1, m6, 1 + paddw xm5, xm0 + paddd xm6, xm1 + HADDW xm5, xm2 + HADDD xm6, xm1 + +%if ARCH_X86_64 + punpckldq xm5, xm6 + movq rax, xm5 +%else + movd eax, xm5 + movd edx, xm6 +%endif + RET + +INIT_YMM avx2 +cglobal pixel_var_64x64, 2,4,7 + VAR_START 0 + mov r2d, 64 + +.loop: + pmovzxbw m0, [r0] + pmovzxbw m3, [r0 + 16] + pmovzxbw m1, [r0 + mmsize] + pmovzxbw m4, [r0 + mmsize + 16] + + lea r0, [r0 + r1] + + VAR_CORE + + dec r2d + jg .loop + + pxor m1, m1 + punpcklwd m0, m5, m1 + punpckhwd m5, m1 + paddd m5, m0 + vextracti128 xm2, m5, 1 + vextracti128 xm1, m6, 1 + paddd xm5, xm2 + paddd xm6, xm1 + HADDD xm5, xm2 + HADDD xm6, xm1 + +%if ARCH_X86_64 + punpckldq xm5, xm6 + movq rax, xm5 +%else + movd eax, xm5 + movd edx, xm6 +%endif + RET %endif ; !HIGH_BIT_DEPTH %macro VAR2_END 3 @@ -6578,10 +6655,10 @@ ;----------------------------------------------------------------------------- -; uint32_t[last first] findPosFirstAndLast(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) +; uint32_t[sumSign last first] findPosFirstLast(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16], uint32_t *absSum) ;----------------------------------------------------------------------------- INIT_XMM ssse3 -cglobal findPosFirstLast, 3,3,3 +cglobal findPosFirstLast, 3,3,4 ; convert stride to int16_t add r1d, r1d @@ -6593,10 +6670,22 @@ movh m1, [r0] movhps m1, [r0 + r1] movh m2, [r0 + r1 * 2] - lea r1, [r1 * 3] + lea r1d, [r1 * 3] movhps m2, [r0 + r1] + pxor m3, m1, m2 packsswb m1, m2 + ; get absSum + movhlps m2, m3 + pxor m3, m2 + pshufd m2, m3, q2301 + pxor m3, m2 + movd r0d, m3 + mov r2d, r0d + shr r2d, 16 + xor r2d, r0d + shl r2d, 31 + ; get non-zero mask pxor m2, m2 pcmpeqb m1, m2 @@ -6609,319 +6698,10 @@ not r0d bsr r1w, r0w bsf eax, r0d ; side effect: clear AH to Zero - shl r1d, 16 - or eax, r1d - RET - - -;void saoCuStatsE2_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) -;{ -; X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); -; X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); -; int x, y; -; int32_t tmp_stats[SAO::NUM_EDGETYPE]; -; int32_t tmp_count[SAO::NUM_EDGETYPE]; -; memset(tmp_stats, 0, sizeof(tmp_stats)); -; memset(tmp_count, 0, sizeof(tmp_count)); -; for (y = 0; y < endY; y++) -; { -; upBufft[0] = signOf(rec[stride] - rec[-1]); -; for (x = 0; x < endX; x++) -; { -; int signDown = signOf2(rec[x], rec[x + stride + 1]); -; X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); -; uint32_t edgeType = signDown + upBuff1[x] + 2; -; upBufft[x + 1] = (int8_t)(-signDown); -; tmp_stats[edgeType] += (fenc[x] - rec[x]); -; tmp_count[edgeType]++; -; } -; std::swap(upBuff1, upBufft); -; rec += stride; -; fenc += stride; -; } -; for (x = 0; x < SAO::NUM_EDGETYPE; x++) -; { -; stats[SAO::s_eoTable[x]] += tmp_stats[x]; -; count[SAO::s_eoTable[x]] += tmp_count[x]; -; } -;} - -%if ARCH_X86_64 -; TODO: x64 only because I need temporary register r7,r8, easy portab to x86 -INIT_XMM sse4 -cglobal saoCuStatsE2, 5,9,8,0-32 ; Stack: 5 of stats and 5 of count - mov r5d, r5m - - ; clear internal temporary buffer - pxor m0, m0 - mova [rsp], m0 - mova [rsp + mmsize], m0 - mova m0, [pb_128] - mova m5, [pb_1] - mova m6, [pb_2] - -.loopH: - ; TODO: merge into below SIMD - ; get upBuffX[0] - mov r6b, [r1 + r2] - sub r6b, [r1 - 1] - seta r6b - setb r7b - sub r6b, r7b - mov [r4], r6b - - ; backup unavailable pixels - movh m7, [r4 + r5 + 1] - - mov r6d, r5d -.loopW: - movu m1, [r1] - movu m2, [r1 + r2 + 1] - - ; signDown - pxor m1, m0 - pxor m2, m0 - pcmpgtb m3, m1, m2 - pand m3, m5 - pcmpgtb m2, m1 - por m2, m3 - pxor m3, m3 - psubb m3, m2 - - ; edgeType - movu m4, [r3] - paddb m4, m6 - paddb m2, m4 - - ; update upBuff1 - movu [r4 + 1], m3 - - ; stats[edgeType] - pxor m1, m0 - movu m3, [r0] - punpckhbw m4, m3, m1 - punpcklbw m3, m1 - pmaddubsw m3, [hmul_16p + 16] - pmaddubsw m4, [hmul_16p + 16] - - ; 16 pixels -%assign x 0 -%rep 16 - pextrb r7d, m2, x - inc word [rsp + r7 * 2] - - %if (x < 8) - pextrw r8d, m3, (x % 8) - %else - pextrw r8d, m4, (x % 8) - %endif - movsx r8d, r8w - add [rsp + 5 * 2 + r7 * 4], r8d - - dec r6d - jz .next -%assign x x+1 -%endrep - - add r0, 16 - add r1, 16 - add r3, 16 - add r4, 16 - jmp .loopW - -.next: - xchg r3, r4 - - ; restore pointer upBuff1 - mov r6d, r5d - and r6d, 15 - - ; move to next row - sub r6, r5 - add r3, r6 - add r4, r6 - add r6, r2 - add r0, r6 - add r1, r6 - - ; restore unavailable pixels - movh [r3 + r5 + 1], m7 - - dec byte r6m - jg .loopH - - ; sum to global buffer - mov r1, r7m - mov r0, r8m - - ; s_eoTable = {1,2,0,3,4} - movzx r6d, word [rsp + 0 * 2] - add [r0 + 1 * 4], r6d - movzx r6d, word [rsp + 1 * 2] - add [r0 + 2 * 4], r6d - movzx r6d, word [rsp + 2 * 2] - add [r0 + 0 * 4], r6d - movzx r6d, word [rsp + 3 * 2] - add [r0 + 3 * 4], r6d - movzx r6d, word [rsp + 4 * 2] - add [r0 + 4 * 4], r6d - - mov r6d, [rsp + 5 * 2 + 0 * 4] - add [r1 + 1 * 4], r6d - mov r6d, [rsp + 5 * 2 + 1 * 4] - add [r1 + 2 * 4], r6d - mov r6d, [rsp + 5 * 2 + 2 * 4] - add [r1 + 0 * 4], r6d - mov r6d, [rsp + 5 * 2 + 3 * 4] - add [r1 + 3 * 4], r6d - mov r6d, [rsp + 5 * 2 + 4 * 4] - add [r1 + 4 * 4], r6d - RET -%endif ; ARCH_X86_64 - - -;void saoStatE3(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); -;{ -; memset(tmp_stats, 0, sizeof(tmp_stats)); -; memset(tmp_count, 0, sizeof(tmp_count)); -; for (y = startY; y < endY; y++) -; { -; for (x = startX; x < endX; x++) -; { -; int signDown = signOf2(rec[x], rec[x + stride - 1]); -; uint32_t edgeType = signDown + upBuff1[x] + 2; -; upBuff1[x - 1] = (int8_t)(-signDown); -; tmp_stats[edgeType] += (fenc[x] - rec[x]); -; tmp_count[edgeType]++; -; } -; upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); -; rec += stride; -; fenc += stride; -; } -; for (x = 0; x < NUM_EDGETYPE; x++) -; { -; stats[s_eoTable[x]] += tmp_stats[x]; -; count[s_eoTable[x]] += tmp_count[x]; -; } -;} - -%if ARCH_X86_64 -INIT_XMM sse4 -cglobal saoCuStatsE3, 4,9,8,0-32 ; Stack: 5 of stats and 5 of count - mov r4d, r4m - mov r5d, r5m - - ; clear internal temporary buffer - pxor m0, m0 - mova [rsp], m0 - mova [rsp + mmsize], m0 - mova m0, [pb_128] - mova m5, [pb_1] - mova m6, [pb_2] - movh m7, [r3 + r4] - -.loopH: - mov r6d, r4d - -.loopW: - movu m1, [r1] - movu m2, [r1 + r2 - 1] - - ; signDown - pxor m1, m0 - pxor m2, m0 - pcmpgtb m3, m1, m2 - pand m3, m5 - pcmpgtb m2, m1 - por m2, m3 - pxor m3, m3 - psubb m3, m2 - - ; edgeType - movu m4, [r3] - paddb m4, m6 - paddb m2, m4 - - ; update upBuff1 - movu [r3 - 1], m3 - - ; stats[edgeType] - pxor m1, m0 - movu m3, [r0] - punpckhbw m4, m3, m1 - punpcklbw m3, m1 - pmaddubsw m3, [hmul_16p + 16] - pmaddubsw m4, [hmul_16p + 16] - - ; 16 pixels -%assign x 0 -%rep 16 - pextrb r7d, m2, x - inc word [rsp + r7 * 2] - - %if (x < 8) - pextrw r8d, m3, (x % 8) - %else - pextrw r8d, m4, (x % 8) - %endif - movsx r8d, r8w - add [rsp + 5 * 2 + r7 * 4], r8d - - dec r6d - jz .next -%assign x x+1 -%endrep - - add r0, 16 - add r1, 16 - add r3, 16 - jmp .loopW - -.next: - ; restore pointer upBuff1 - mov r6d, r4d - and r6d, 15 - - ; move to next row - sub r6, r4 - add r3, r6 - add r6, r2 - add r0, r6 - add r1, r6 - dec r5d - jg .loopH - - ; restore unavailable pixels - movh [r3 + r4], m7 - - ; sum to global buffer - mov r1, r6m - mov r0, r7m - - ; s_eoTable = {1,2,0,3,4} - movzx r6d, word [rsp + 0 * 2] - add [r0 + 1 * 4], r6d - movzx r6d, word [rsp + 1 * 2] - add [r0 + 2 * 4], r6d - movzx r6d, word [rsp + 2 * 2] - add [r0 + 0 * 4], r6d - movzx r6d, word [rsp + 3 * 2] - add [r0 + 3 * 4], r6d - movzx r6d, word [rsp + 4 * 2] - add [r0 + 4 * 4], r6d - - mov r6d, [rsp + 5 * 2 + 0 * 4] - add [r1 + 1 * 4], r6d - mov r6d, [rsp + 5 * 2 + 1 * 4] - add [r1 + 2 * 4], r6d - mov r6d, [rsp + 5 * 2 + 2 * 4] - add [r1 + 0 * 4], r6d - mov r6d, [rsp + 5 * 2 + 3 * 4] - add [r1 + 3 * 4], r6d - mov r6d, [rsp + 5 * 2 + 4 * 4] - add [r1 + 4 * 4], r6d + shl r1d, 8 + or eax, r2d ; merge absSumSign + or eax, r1d ; merge lastNZPosInCG RET -%endif ; ARCH_X86_64 ; uint32_t costCoeffNxN(uint16_t *scan, coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, uint8_t *tabSigCtx, uint16_t scanFlagMask, uint8_t *baseCtx, int offset, int subPosBase) @@ -6963,7 +6743,7 @@ %if ARCH_X86_64 ; uint32_t costCoeffNxN(uint16_t *scan, coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, uint8_t *tabSigCtx, uint16_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase) INIT_XMM sse4 -cglobal costCoeffNxN, 6,11,5 +cglobal costCoeffNxN, 6,11,6 add r2d, r2d ; abs(coeff) @@ -7096,6 +6876,177 @@ %endif and eax, 0xFFFFFF RET + + +; uint32_t costCoeffNxN(uint16_t *scan, coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, uint8_t *tabSigCtx, uint16_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase) +INIT_YMM avx2,bmi2 +cglobal costCoeffNxN, 6,10,5 + add r2d, r2d + + ; abs(coeff) + movq xm1, [r1] + movhps xm1, [r1 + r2] + movq xm2, [r1 + r2 * 2] + lea r2, [r2 * 3] + movhps xm2, [r1 + r2] + vinserti128 m1, m1, xm2, 1 + pabsw m1, m1 + ; r[1-2] free here + + ; loading tabSigCtx + mova xm2, [r4] + ; r[4] free here + + ; WARNING: beyond-bound read here! + ; loading scan table + mov r2d, r8m + bzhi r4d, r5d, r2d ; clear non-scan mask bits + mov r6d, r2d + xor r2d, 15 + movu m0, [r0 + r2 * 2] + packuswb m0, m0 + pxor m0, [pb_15] + vpermq m0, m0, q3120 + add r4d, r2d ; r4d = (scanPosSigOff == 15) -> (numNonZero == 0) + mov r2d, r6d + + ; reorder tabSigCtx (+offset) + pshufb xm2, xm0 + vpbroadcastb xm3, r7m + paddb xm2, xm3 + ; r[0-1] free here + + ; reorder coeff + pshufb m1, [deinterleave_shuf] + vpermq m1, m1, q3120 + pshufb m1, m0 + vpermq m1, m1, q3120 + pshufb m1, [interleave_shuf] + ; r[0-1], m[2-3] free here + + ; sig mask + pxor xm3, xm3 + movd xm4, r5d + vpbroadcastw m4, xm4 + pandn m4, m4, [pw_exp2_0_15] + pcmpeqw m4, m3 + + ; absCoeff[numNonZero] = tmpCoeff[blkPos] + ; [0-3] + movq r0, xm4 + movq r1, xm1 + pext r6, r1, r0 + mov qword [r3], r6 + popcnt r0, r0 + shr r0, 3 + add r3, r0 + + ; [4-7] + pextrq r0, xm4, 1 + pextrq r1, xm1, 1 + pext r6, r1, r0 + mov qword [r3], r6 + popcnt r0, r0 + shr r0, 3 + add r3, r0 + + ; [8-B] + vextracti128 xm4, m4, 1 + movq r0, xm4 + vextracti128 xm1, m1, 1 + movq r1, xm1 + pext r6, r1, r0 + mov qword [r3], r6 + popcnt r0, r0 + shr r0, 3 + add r3, r0 + + ; [C-F] + pextrq r0, xm4, 1 + pextrq r1, xm1, 1 + pext r6, r1, r0 + mov qword [r3], r6 + ; r[0-1,3] free here + + ; register mapping + ; m0 - Zigzag + ; m1 - sigCtx + ; r0 - x265_entropyStateBits + ; r1 - baseCtx + ; r2 - scanPosSigOff + ; r5 - scanFlagMask + ; r6 - sum + ; {r3,r4} - ctxSig[15-0] + ; r8m - (numNonZero != 0) || (subPosBase == 0) + lea r0, [private_prefix %+ _entropyStateBits] + mov r1, r6mp + xor r6d, r6d + xor r8d, r8d + + test r2d, r2d + jz .idx_zero + +; { +; const uint32_t cnt = tabSigCtx[blkPos] + offset + posOffset; +; ctxSig = cnt & posZeroMask; +; const uint32_t mstate = baseCtx[ctxSig]; +; const uint32_t mps = mstate & 1; +; const uint32_t stateBits = x265_entropyStateBits[mstate ^ sig]; +; uint32_t nextState = (stateBits >> 24) + mps; +; if ((mstate ^ sig) == 1) +; nextState = sig; +; baseCtx[ctxSig] = (uint8_t)nextState; +; sum += stateBits; +; } +; absCoeff[numNonZero] = tmpCoeff[blkPos]; +; numNonZero += sig; +; scanPosSigOff--; +.loop: + shr r5d, 1 + setc r8b ; r8 = sig + movd r7d, xm2 ; r7 = ctxSig + movzx r7d, r7b + psrldq xm2, 1 + movzx r9d, byte [r1 + r7] ; mstate = baseCtx[ctxSig] + mov r3d, r9d + and r3b, 1 ; mps = mstate & 1 + xor r9d, r8d ; r9 = mstate ^ sig + add r6d, [r0 + r9 * 4] ; sum += entropyStateBits[mstate ^ sig] + add r3b, byte [r0 + r9 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r9d, 1 + cmove r3d, r8d + mov byte [r1 + r7], r3b + + dec r2d + jg .loop + +.idx_zero: + xor r2d, r2d + cmp word r9m, 0 + sete r2b + add r4d, r2d ; (numNonZero != 0) || (subPosBase == 0) + jz .exit + + dec r2b + movd r3d, xm2 + and r2d, r3d + + movzx r3d, byte [r1 + r2] ; mstate = baseCtx[ctxSig] + mov r4d, r5d + xor r5d, r3d ; r0 = mstate ^ sig + and r3b, 1 ; mps = mstate & 1 + add r6d, [r0 + r5 * 4] ; sum += x265_entropyStateBits[mstate ^ sig] + add r3b, [r0 + r5 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r5b, 1 + cmove r3d, r4d + mov byte [r1 + r2], r3b + +.exit: +%ifnidn eax,r6d + mov eax, r6d +%endif + and eax, 0xFFFFFF + RET %endif ; ARCH_X86_64
View file
x265_1.8.tar.gz/source/common/x86/pixel.h -> x265_1.9.tar.gz/source/common/x86/pixel.h
Changed
@@ -2,10 +2,12 @@ * pixel.h: x86 pixel metrics ***************************************************************************** * Copyright (C) 2003-2013 x264 project + * Copyright (C) 2013-2015 x265 project * * Authors: Laurent Aimar <fenrir@via.ecp.fr> * Loren Merritt <lorenm@u.washington.edu> * Fiona Glaser <fiona@x264.com> +;* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -34,9 +36,10 @@ void PFX(upShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); void PFX(upShift_8_sse4)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); void PFX(upShift_8_avx2)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); +pixel PFX(planeClipAndMax_avx2)(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix); #define DECL_PIXELS(cpu) \ - FUNCDEF_PU(uint32_t, pixel_ssd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_PU(sse_t, pixel_ssd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ FUNCDEF_PU(int, pixel_sa8d, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ FUNCDEF_PU(void, pixel_sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ FUNCDEF_PU(void, pixel_sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ @@ -45,10 +48,10 @@ FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \ FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ - FUNCDEF_CHROMA_PU(uint32_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \ FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ - FUNCDEF_CHROMA_PU(int, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ - FUNCDEF_TU_S(int, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ FUNCDEF_TU(uint64_t, pixel_var, cpu, const pixel*, intptr_t); \ FUNCDEF_TU(int, psyCost_pp, cpu, const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); \ FUNCDEF_TU(int, psyCost_ss, cpu, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride)
View file
x265_1.8.tar.gz/source/common/x86/pixeladd8.asm -> x265_1.9.tar.gz/source/common/x86/pixeladd8.asm
Changed
@@ -2,6 +2,7 @@ ;* Copyright (C) 2013 x265 project ;* ;* Authors: Praveen Kumar Tiwari <praveen@multicorewareinc.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/x86/sad-a.asm -> x265_1.9.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -2,6 +2,7 @@ ;* sad-a.asm: x86 sad functions ;***************************************************************************** ;* Copyright (C) 2003-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> @@ -3328,6 +3329,730 @@ SAD_X4_END_SSE2 1 %endmacro +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +INIT_YMM avx2 +%macro SAD_X4_64x8_AVX2 0 + movu m4, [r0] + movu m5, [r1] + movu m6, [r2] + movu m7, [r3] + movu m8, [r4] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + mmsize] + movu m5, [r1 + mmsize] + movu m6, [r2 + mmsize] + movu m7, [r3 + mmsize] + movu m8, [r4 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE] + movu m5, [r1 + r5] + movu m6, [r2 + r5] + movu m7, [r3 + r5] + movu m8, [r4 + r5] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE + mmsize] + movu m5, [r1 + r5 + mmsize] + movu m6, [r2 + r5 + mmsize] + movu m7, [r3 + r5 + mmsize] + movu m8, [r4 + r5 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2] + movu m5, [r1 + r5 * 2] + movu m6, [r2 + r5 * 2] + movu m7, [r3 + r5 * 2] + movu m8, [r4 + r5 * 2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2 + mmsize] + movu m5, [r1 + r5 * 2 + mmsize] + movu m6, [r2 + r5 * 2 + mmsize] + movu m7, [r3 + r5 * 2 + mmsize] + movu m8, [r4 + r5 * 2 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3] + movu m5, [r1 + r7] + movu m6, [r2 + r7] + movu m7, [r3 + r7] + movu m8, [r4 + r7] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3 + mmsize] + movu m5, [r1 + r7 + mmsize] + movu m6, [r2 + r7 + mmsize] + movu m7, [r3 + r7 + mmsize] + movu m8, [r4 + r7 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + movu m4, [r0] + movu m5, [r1] + movu m6, [r2] + movu m7, [r3] + movu m8, [r4] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + mmsize] + movu m5, [r1 + mmsize] + movu m6, [r2 + mmsize] + movu m7, [r3 + mmsize] + movu m8, [r4 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE] + movu m5, [r1 + r5] + movu m6, [r2 + r5] + movu m7, [r3 + r5] + movu m8, [r4 + r5] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE + mmsize] + movu m5, [r1 + r5 + mmsize] + movu m6, [r2 + r5 + mmsize] + movu m7, [r3 + r5 + mmsize] + movu m8, [r4 + r5 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2] + movu m5, [r1 + r5 * 2] + movu m6, [r2 + r5 * 2] + movu m7, [r3 + r5 * 2] + movu m8, [r4 + r5 * 2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2 + mmsize] + movu m5, [r1 + r5 * 2 + mmsize] + movu m6, [r2 + r5 * 2 + mmsize] + movu m7, [r3 + r5 * 2 + mmsize] + movu m8, [r4 + r5 * 2 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3] + movu m5, [r1 + r7] + movu m6, [r2 + r7] + movu m7, [r3 + r7] + movu m8, [r4 + r7] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3 + mmsize] + movu m5, [r1 + r7 + mmsize] + movu m6, [r2 + r7 + mmsize] + movu m7, [r3 + r7 + mmsize] + movu m8, [r4 + r7 + mmsize] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 +%endmacro + +%macro PIXEL_SAD_X4_END_AVX2 0 + vextracti128 xm4, m0, 1 + vextracti128 xm5, m1, 1 + vextracti128 xm6, m2, 1 + vextracti128 xm7, m3, 1 + paddd m0, m4 + paddd m1, m5 + paddd m2, m6 + paddd m3, m7 + pshufd xm4, xm0, 2 + pshufd xm5, xm1, 2 + pshufd xm6, xm2, 2 + pshufd xm7, xm3, 2 + paddd m0, m4 + paddd m1, m5 + paddd m2, m6 + paddd m3, m7 + + movd [r6 + 0], xm0 + movd [r6 + 4], xm1 + movd [r6 + 8], xm2 + movd [r6 + 12], xm3 +%endmacro + +cglobal pixel_sad_x4_64x16, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + PIXEL_SAD_X4_END_AVX2 + RET + +cglobal pixel_sad_x4_64x32, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + PIXEL_SAD_X4_END_AVX2 + RET + +cglobal pixel_sad_x4_64x48, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + PIXEL_SAD_X4_END_AVX2 + RET + +cglobal pixel_sad_x4_64x64, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_64x8_AVX2 + PIXEL_SAD_X4_END_AVX2 + RET + +%macro SAD_X4_48x8_AVX2 0 + movu m4, [r0] + movu m5, [r1] + movu m6, [r2] + movu m7, [r3] + movu m8, [r4] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu xm4, [r0 + mmsize] + movu xm5, [r1 + mmsize] + movu xm6, [r2 + mmsize] + movu xm7, [r3 + mmsize] + movu xm8, [r4 + mmsize] + + vinserti128 m4, m4, [r0 + FENC_STRIDE], 1 + vinserti128 m5, m5, [r1 + r5], 1 + vinserti128 m6, m6, [r2 + r5], 1 + vinserti128 m7, m7, [r3 + r5], 1 + vinserti128 m8, m8, [r4 + r5], 1 + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE + mmsize/2] + movu m5, [r1 + r5 + mmsize/2] + movu m6, [r2 + r5 + mmsize/2] + movu m7, [r3 + r5 + mmsize/2] + movu m8, [r4 + r5 + mmsize/2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2] + movu m5, [r1 + r5 * 2] + movu m6, [r2 + r5 * 2] + movu m7, [r3 + r5 * 2] + movu m8, [r4 + r5 * 2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu xm4, [r0 + FENC_STRIDE * 2 + mmsize] + movu xm5, [r1 + r5 * 2 + mmsize] + movu xm6, [r2 + r5 * 2 + mmsize] + movu xm7, [r3 + r5 * 2 + mmsize] + movu xm8, [r4 + r5 * 2 + mmsize] + vinserti128 m4, m4, [r0 + FENC_STRIDE * 3], 1 + vinserti128 m5, m5, [r1 + r7], 1 + vinserti128 m6, m6, [r2 + r7], 1 + vinserti128 m7, m7, [r3 + r7], 1 + vinserti128 m8, m8, [r4 + r7], 1 + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3 + mmsize/2] + movu m5, [r1 + r7 + mmsize/2] + movu m6, [r2 + r7 + mmsize/2] + movu m7, [r3 + r7 + mmsize/2] + movu m8, [r4 + r7 + mmsize/2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + movu m4, [r0] + movu m5, [r1] + movu m6, [r2] + movu m7, [r3] + movu m8, [r4] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu xm4, [r0 + mmsize] + movu xm5, [r1 + mmsize] + movu xm6, [r2 + mmsize] + movu xm7, [r3 + mmsize] + movu xm8, [r4 + mmsize] + vinserti128 m4, m4, [r0 + FENC_STRIDE], 1 + vinserti128 m5, m5, [r1 + r5], 1 + vinserti128 m6, m6, [r2 + r5], 1 + vinserti128 m7, m7, [r3 + r5], 1 + vinserti128 m8, m8, [r4 + r5], 1 + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE + mmsize/2] + movu m5, [r1 + r5 + mmsize/2] + movu m6, [r2 + r5 + mmsize/2] + movu m7, [r3 + r5 + mmsize/2] + movu m8, [r4 + r5 + mmsize/2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2] + movu m5, [r1 + r5 * 2] + movu m6, [r2 + r5 * 2] + movu m7, [r3 + r5 * 2] + movu m8, [r4 + r5 * 2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu xm4, [r0 + FENC_STRIDE * 2 + mmsize] + movu xm5, [r1 + r5 * 2 + mmsize] + movu xm6, [r2 + r5 * 2 + mmsize] + movu xm7, [r3 + r5 * 2 + mmsize] + movu xm8, [r4 + r5 * 2 + mmsize] + vinserti128 m4, m4, [r0 + FENC_STRIDE * 3], 1 + vinserti128 m5, m5, [r1 + r7], 1 + vinserti128 m6, m6, [r2 + r7], 1 + vinserti128 m7, m7, [r3 + r7], 1 + vinserti128 m8, m8, [r4 + r7], 1 + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3 + mmsize/2] + movu m5, [r1 + r7 + mmsize/2] + movu m6, [r2 + r7 + mmsize/2] + movu m7, [r3 + r7 + mmsize/2] + movu m8, [r4 + r7 + mmsize/2] + + psadbw m9, m4, m5 + paddd m0, m9 + psadbw m5, m4, m6 + paddd m1, m5 + psadbw m6, m4, m7 + paddd m2, m6 + psadbw m4, m8 + paddd m3, m4 +%endmacro + +INIT_YMM avx2 +cglobal pixel_sad_x4_48x64, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + SAD_X4_48x8_AVX2 + PIXEL_SAD_X4_END_AVX2 + RET +%endif + INIT_XMM sse2 SAD_X_SSE2 3, 16, 16, 7 SAD_X_SSE2 3, 16, 8, 7 @@ -3949,6 +4674,849 @@ movd [r5 + 8], xm1 RET +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +INIT_YMM avx2 +%macro SAD_X3_32x8_AVX2 0 + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m7, m3, m5 + paddd m1, m7 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE] + movu m4, [r1 + r4] + movu m5, [r2 + r4] + movu m6, [r3 + r4] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3] + movu m4, [r1 + r6] + movu m5, [r2 + r6] + movu m6, [r3 + r6] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE] + movu m4, [r1 + r4] + movu m5, [r2 + r4] + movu m6, [r3 + r4] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3] + movu m4, [r1 + r6] + movu m5, [r2 + r6] + movu m6, [r3 + r6] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 +%endmacro + +%macro SAD_X3_64x8_AVX2 0 + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + mmsize] + movu m4, [r1 + mmsize] + movu m5, [r2 + mmsize] + movu m6, [r3 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE] + movu m4, [r1 + r4] + movu m5, [r2 + r4] + movu m6, [r3 + r4] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE + mmsize] + movu m4, [r1 + r4 + mmsize] + movu m5, [r2 + r4 + mmsize] + movu m6, [r3 + r4 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2 + mmsize] + movu m4, [r1 + r4 * 2 + mmsize] + movu m5, [r2 + r4 * 2 + mmsize] + movu m6, [r3 + r4 * 2 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3] + movu m4, [r1 + r6] + movu m5, [r2 + r6] + movu m6, [r3 + r6] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3 + mmsize] + movu m4, [r1 + r6 + mmsize] + movu m5, [r2 + r6 + mmsize] + movu m6, [r3 + r6 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + mmsize] + movu m4, [r1 + mmsize] + movu m5, [r2 + mmsize] + movu m6, [r3 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE] + movu m4, [r1 + r4] + movu m5, [r2 + r4] + movu m6, [r3 + r4] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE + mmsize] + movu m4, [r1 + r4 + mmsize] + movu m5, [r2 + r4 + mmsize] + movu m6, [r3 + r4 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2 + mmsize] + movu m4, [r1 + r4 * 2 + mmsize] + movu m5, [r2 + r4 * 2 + mmsize] + movu m6, [r3 + r4 * 2 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3] + movu m4, [r1 + r6] + movu m5, [r2 + r6] + movu m6, [r3 + r6] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3 + mmsize] + movu m4, [r1 + r6 + mmsize] + movu m5, [r2 + r6 + mmsize] + movu m6, [r3 + r6 + mmsize] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 +%endmacro + +%macro SAD_X3_48x8_AVX2 0 + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu xm3, [r0 + mmsize] + movu xm4, [r1 + mmsize] + movu xm5, [r2 + mmsize] + movu xm6, [r3 + mmsize] + vinserti128 m3, m3, [r0 + FENC_STRIDE], 1 + vinserti128 m4, m4, [r1 + r4], 1 + vinserti128 m5, m5, [r2 + r4], 1 + vinserti128 m6, m6, [r3 + r4], 1 + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE + mmsize/2] + movu m4, [r1 + r4 + mmsize/2] + movu m5, [r2 + r4 + mmsize/2] + movu m6, [r3 + r4 + mmsize/2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu xm3, [r0 + FENC_STRIDE * 2 + mmsize] + movu xm4, [r1 + r4 * 2 + mmsize] + movu xm5, [r2 + r4 * 2 + mmsize] + movu xm6, [r3 + r4 * 2 + mmsize] + vinserti128 m3, m3, [r0 + FENC_STRIDE * 3], 1 + vinserti128 m4, m4, [r1 + r6], 1 + vinserti128 m5, m5, [r2 + r6], 1 + vinserti128 m6, m6, [r3 + r6], 1 + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3 + mmsize/2] + movu m4, [r1 + r6 + mmsize/2] + movu m5, [r2 + r6 + mmsize/2] + movu m6, [r3 + r6 + mmsize/2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu xm3, [r0 + mmsize] + movu xm4, [r1 + mmsize] + movu xm5, [r2 + mmsize] + movu xm6, [r3 + mmsize] + vinserti128 m3, m3, [r0 + FENC_STRIDE], 1 + vinserti128 m4, m4, [r1 + r4], 1 + vinserti128 m5, m5, [r2 + r4], 1 + vinserti128 m6, m6, [r3 + r4], 1 + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE + mmsize/2] + movu m4, [r1 + r4 + mmsize/2] + movu m5, [r2 + r4 + mmsize/2] + movu m6, [r3 + r4 + mmsize/2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu xm3, [r0 + FENC_STRIDE * 2 + mmsize] + movu xm4, [r1 + r4 * 2 + mmsize] + movu xm5, [r2 + r4 * 2 + mmsize] + movu xm6, [r3 + r4 * 2 + mmsize] + vinserti128 m3, m3, [r0 + FENC_STRIDE * 3], 1 + vinserti128 m4, m4, [r1 + r6], 1 + vinserti128 m5, m5, [r2 + r6], 1 + vinserti128 m6, m6, [r3 + r6], 1 + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3 + mmsize/2] + movu m4, [r1 + r6 + mmsize/2] + movu m5, [r2 + r6 + mmsize/2] + movu m6, [r3 + r6 + mmsize/2] + + psadbw m7, m3, m4 + paddd m0, m7 + psadbw m4, m3, m5 + paddd m1, m4 + psadbw m3, m6 + paddd m2, m3 +%endmacro + +%macro PIXEL_SAD_X3_END_AVX2 0 + vextracti128 xm3, m0, 1 + vextracti128 xm4, m1, 1 + vextracti128 xm5, m2, 1 + paddd m0, m3 + paddd m1, m4 + paddd m2, m5 + pshufd xm3, xm0, 2 + pshufd xm4, xm1, 2 + pshufd xm5, xm2, 2 + paddd m0, m3 + paddd m1, m4 + paddd m2, m5 + + movd [r5 + 0], xm0 + movd [r5 + 4], xm1 + movd [r5 + 8], xm2 +%endmacro + +cglobal pixel_sad_x3_32x8, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_32x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_32x16, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_32x24, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_32x32, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_32x64, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_32x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_64x16, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_64x32, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_64x48, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_64x64, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_64x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET + +cglobal pixel_sad_x3_48x64, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + SAD_X3_48x8_AVX2 + PIXEL_SAD_X3_END_AVX2 + RET +%endif + INIT_YMM avx2 cglobal pixel_sad_x4_8x8, 7,7,5 xorps m0, m0
View file
x265_1.8.tar.gz/source/common/x86/sad16-a.asm -> x265_1.9.tar.gz/source/common/x86/sad16-a.asm
Changed
@@ -413,77 +413,50 @@ SAD 16, 32 INIT_YMM avx2 -cglobal pixel_sad_16x64, 4,7,4 +cglobal pixel_sad_16x64, 4,5,5 pxor m0, m0 - pxor m3, m3 - mov r4d, 64 / 8 - add r3d, r3d - add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + mov r4d, 16 + mova m4, [pw_1] .loop: movu m1, [r2] - movu m2, [r2 + r3] + movu m2, [r2 + r3 * 2] psubw m1, [r0] - psubw m2, [r0 + r1] - pabsw m1, m1 - pabsw m2, m2 - paddw m0, m1 - paddw m3, m2 - - movu m1, [r2 + 2 * r3] - movu m2, [r2 + r6] - psubw m1, [r0 + 2 * r1] - psubw m2, [r0 + r5] + psubw m2, [r0 + r1 * 2] pabsw m1, m1 pabsw m2, m2 - paddw m0, m1 - paddw m3, m2 - + paddw m3, m1, m2 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] movu m1, [r2] - movu m2, [r2 + r3] + movu m2, [r2 + r3 * 2] psubw m1, [r0] - psubw m2, [r0 + r1] + psubw m2, [r0 + r1 * 2] pabsw m1, m1 pabsw m2, m2 - paddw m0, m1 - paddw m3, m2 - - movu m1, [r2 + 2 * r3] - movu m2, [r2 + r6] - psubw m1, [r0 + 2 * r1] - psubw m2, [r0 + r5] - pabsw m1, m1 - pabsw m2, m2 - paddw m0, m1 - paddw m3, m2 - - lea r0, [r0 + 4 * r1] - lea r2, [r2 + 4 * r3] - - dec r4d - jg .loop - - HADDUWD m0, m1 - HADDUWD m3, m1 - HADDD m0, m1 - HADDD m3, m1 + paddw m1, m2 + pmaddwd m3, m4 paddd m0, m3 + pmaddwd m1, m4 + paddd m0, m1 + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + dec r4d + jg .loop + HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_32x8, 4,7,5 +cglobal pixel_sad_32x8, 4,7,7 pxor m0, m0 mov r4d, 8/4 + mova m6, [pw_1] add r3d, r3d add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + lea r5d, [r1 * 3] + lea r6d, [r3 * 3] .loop: movu m1, [r2] movu m2, [r2 + 32] @@ -499,8 +472,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -518,24 +490,28 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 dec r4d jg .loop - HADDW m0, m1 + HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_32x16, 4,7,5 +cglobal pixel_sad_32x16, 4,7,7 pxor m0, m0 mov r4d, 16/8 + mova m6, [pw_1] add r3d, r3d add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + lea r5d, [r1 * 3] + lea r6d, [r3 * 3] .loop: movu m1, [r2] movu m2, [r2 + 32] @@ -551,8 +527,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -570,8 +545,12 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 movu m1, [r2] movu m2, [r2 + 32] @@ -587,8 +566,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -606,24 +584,28 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 dec r4d jg .loop - HADDW m0, m1 + HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_32x24, 4,7,5 +cglobal pixel_sad_32x24, 4,7,7 pxor m0, m0 mov r4d, 24/4 + mova m6, [pw_1] add r3d, r3d add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + lea r5d, [r1 * 3] + lea r6d, [r3 * 3] .loop: movu m1, [r2] movu m2, [r2 + 32] @@ -639,8 +621,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -656,29 +637,30 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 - + paddw m1, m3 + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] dec r4d jg .loop - HADDUWD m0, m1 HADDD m0, m1 movd eax, xm0 RET - INIT_YMM avx2 -cglobal pixel_sad_32x32, 4,7,5 +cglobal pixel_sad_32x32, 4,7,7 pxor m0, m0 mov r4d, 32/4 + mova m6, [pw_1] add r3d, r3d add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + lea r5d, [r1 * 3] + lea r6d, [r3 * 3] .loop: movu m1, [r2] movu m2, [r2 + 32] @@ -694,8 +676,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -711,8 +692,12 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] @@ -720,20 +705,19 @@ dec r4d jg .loop - HADDUWD m0, m1 HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_32x64, 4,7,6 +cglobal pixel_sad_32x64, 4,7,7 pxor m0, m0 - pxor m5, m5 mov r4d, 64 / 4 + mova m6, [pw_1] add r3d, r3d add r1d, r1d - lea r5, [r1 * 3] - lea r6, [r3 * 3] + lea r5d, [r1 * 3] + lea r6d, [r3 * 3] .loop: movu m1, [r2] movu m2, [r2 + 32] @@ -749,8 +733,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m5, m3 + paddw m5, m1, m3 movu m1, [r2 + 2 * r3] movu m2, [r2 + 2 * r3 + 32] @@ -766,29 +749,28 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m5, m3 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 + lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] - dec r4d + dec r4d jg .loop - HADDUWD m0, m1 - HADDUWD m5, m1 HADDD m0, m1 - HADDD m5, m1 - paddd m0, m5 - movd eax, xm0 RET INIT_YMM avx2 cglobal pixel_sad_48x64, 4, 5, 7 pxor m0, m0 - pxor m5, m5 - pxor m6, m6 mov r4d, 64/2 + mova m6, [pw_1] add r3d, r3d add r1d, r1d .loop: @@ -801,9 +783,8 @@ pabsw m1, m1 pabsw m2, m2 pabsw m3, m3 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 + paddw m1, m2 + paddw m5, m3, m1 movu m1, [r2 + r3 + 0 * mmsize] movu m2, [r2 + r3 + 1 * mmsize] @@ -814,29 +795,28 @@ pabsw m1, m1 pabsw m2, m2 pabsw m3, m3 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 + paddw m1, m2 + paddw m3, m1 + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m3, m6 + paddd m0, m3 lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] dec r4d jg .loop - HADDUWD m0, m1 - HADDUWD m5, m1 - HADDUWD m6, m1 - paddd m0, m5 - paddd m0, m6 - HADDD m0, m1 + HADDD m0, m3 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_64x16, 4, 5, 5 +cglobal pixel_sad_64x16, 4, 5, 7 pxor m0, m0 mov r4d, 16 / 2 + mova m6, [pw_1] add r3d, r3d add r1d, r1d .loop: @@ -854,8 +834,8 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m5, m1, m3 + movu m1, [r2 + r3] movu m2, [r2 + r3 + 32] movu m3, [r2 + r3 + 64] @@ -870,24 +850,28 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 + lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] - dec r4d - jg .loop + dec r4d + jg .loop - HADDUWD m0, m1 HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_64x32, 4, 5, 6 +cglobal pixel_sad_64x32, 4, 5, 7 pxor m0, m0 - pxor m5, m5 mov r4d, 32 / 2 + mova m6, [pw_1] add r3d, r3d add r1d, r1d .loop: @@ -905,8 +889,7 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m5, m3 + paddw m5, m1, m3 movu m1, [r2 + r3] movu m2, [r2 + r3 + 32] @@ -922,29 +905,27 @@ pabsw m4, m4 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m5, m3 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] - dec r4d - jg .loop + dec r4d + jg .loop - HADDUWD m0, m1 - HADDUWD m5, m1 - paddd m0, m5 HADDD m0, m1 - movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_64x48, 4, 5, 8 +cglobal pixel_sad_64x48, 4, 5, 7 pxor m0, m0 - pxor m5, m5 - pxor m6, m6 - pxor m7, m7 mov r4d, 48 / 2 + mova m6, [pw_1] add r3d, r3d add r1d, r1d .loop: @@ -960,10 +941,9 @@ pabsw m2, m2 pabsw m3, m3 pabsw m4, m4 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 - paddw m7, m4 + paddw m1, m2 + paddw m3, m4 + paddw m5, m1, m3 movu m1, [r2 + r3] movu m2, [r2 + r3 + 32] @@ -977,35 +957,30 @@ pabsw m2, m2 pabsw m3, m3 pabsw m4, m4 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 - paddw m7, m4 + paddw m1, m2 + paddw m3, m4 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] - dec r4d - jg .loop + dec r4d + jg .loop - HADDUWD m0, m1 - HADDUWD m5, m1 - HADDUWD m6, m1 - HADDUWD m7, m1 - paddd m0, m5 - paddd m0, m6 - paddd m0, m7 HADDD m0, m1 movd eax, xm0 RET INIT_YMM avx2 -cglobal pixel_sad_64x64, 4, 5, 8 +cglobal pixel_sad_64x64, 4, 5, 7 pxor m0, m0 - pxor m5, m5 - pxor m6, m6 - pxor m7, m7 mov r4d, 64 / 2 + mova m6, [pw_1] add r3d, r3d add r1d, r1d .loop: @@ -1021,10 +996,9 @@ pabsw m2, m2 pabsw m3, m3 pabsw m4, m4 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 - paddw m7, m4 + paddw m1, m2 + paddw m3, m4 + paddw m5, m1, m3 movu m1, [r2 + r3] movu m2, [r2 + r3 + 32] @@ -1038,25 +1012,22 @@ pabsw m2, m2 pabsw m3, m3 pabsw m4, m4 - paddw m0, m1 - paddw m5, m2 - paddw m6, m3 - paddw m7, m4 + paddw m1, m2 + paddw m3, m4 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] - dec r4d - jg .loop + dec r4d + jg .loop - HADDUWD m0, m1 - HADDUWD m5, m1 - HADDUWD m6, m1 - HADDUWD m7, m1 - paddd m0, m5 - paddd m0, m6 - paddd m0, m7 - HADDD m0, m1 + HADDD m0, m1 movd eax, xm0 RET
View file
x265_1.8.tar.gz/source/common/x86/ssd-a.asm -> x265_1.9.tar.gz/source/common/x86/ssd-a.asm
Changed
@@ -2,11 +2,13 @@ ;* ssd-a.asm: x86 ssd functions ;***************************************************************************** ;* Copyright (C) 2003-2013 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Fiona Glaser <fiona@x264.com> ;* Laurent Aimar <fenrir@via.ecp.fr> ;* Alex Izvorski <aizvorksi@gmail.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -105,8 +107,32 @@ dec r4d jg .loop %endif +%if BIT_DEPTH == 12 && %1 >= 16 && %2 >=16 +%if mmsize == 16 + movu m5, m0 + pxor m6, m6 + punpckldq m0, m6 + punpckhdq m5, m6 + paddq m0, m5 + movhlps m5, m0 + paddq m0, m5 + movq r6, xm0 +%elif mmsize == 32 + movu m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + vextracti128 xm2, m0, 1 + paddq xm2, xm0 + movhlps xm1, xm2 + paddq xm2, xm1 + movq rax, xm2 +%endif +%else HADDD m0, m5 - movd eax, xm0 + movd eax,xm0 +%endif %ifidn movu,movq ; detect MMX EMMS %endif @@ -168,6 +194,154 @@ movq rax, m9 RET %endmacro +%macro SSD_ONE_SS_32 0 +cglobal pixel_ssd_ss_32x32, 4,5,8 + add r1d, r1d + add r3d, r3d + pxor m5, m5 + pxor m6, m6 + mov r4d, 2 + +.iterate: + mov r5d, 16 + pxor m4, m4 + pxor m7, m7 +.loop: + movu m0, [r0] + movu m1, [r0 + mmsize] + movu m2, [r2] + movu m3, [r2 + mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + movu m0, [r0 + 2 * mmsize] + movu m1, [r0 + 3 * mmsize] + movu m2, [r2 + 2 * mmsize] + movu m3, [r2 + 3 * mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + + add r0, r1 + add r2, r3 + + dec r5d + jnz .loop + + mova m0, m4 + pxor m1, m1 + punpckldq m0, m1 + punpckhdq m4, m1 + paddq m5, m0 + paddq m6, m4 + + mova m0, m7 + punpckldq m0, m1 + punpckhdq m7, m1 + paddq m5, m0 + paddq m6, m7 + + dec r4d + jnz .iterate + + paddq m5, m6 + movhlps m2, m5 + paddq m5, m2 + movq rax, m5 + RET +%endmacro + +%macro SSD_ONE_SS_64 0 +cglobal pixel_ssd_ss_64x64, 4,6,8 + add r1d, r1d + add r3d, r3d + pxor m5, m5 + pxor m6, m6 + mov r5d, 8 + +.iterate: + pxor m4, m4 + pxor m7, m7 + mov r4d, 8 + +.loop: + ;----process 1st half a row---- + movu m0, [r0] + movu m1, [r0 + mmsize] + movu m2, [r2] + movu m3, [r2 + mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + movu m0, [r0 + 2 * mmsize] + movu m1, [r0 + 3 * mmsize] + movu m2, [r2 + 2 * mmsize] + movu m3, [r2 + 3 * mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + ;----process 2nd half a row---- + movu m0, [r0 + 4 * mmsize] + movu m1, [r0 + 5 * mmsize] + movu m2, [r2 + 4 * mmsize] + movu m3, [r2 + 5 * mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + movu m0, [r0 + 6 * mmsize] + movu m1, [r0 + 7 * mmsize] + movu m2, [r2 + 6 * mmsize] + movu m3, [r2 + 7 * mmsize] + psubw m0, m2 + psubw m1, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m4, m0 + paddd m7, m1 + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loop + + mova m0, m4 + pxor m1, m1 + punpckldq m0, m1 + punpckhdq m4, m1 + paddq m5, m0 + paddq m6, m4 + + mova m0, m7 + punpckldq m0, m1 + punpckhdq m7, m1 + paddq m5, m0 + paddq m6, m7 + + dec r5 + jne .iterate + + paddq m5, m6 + movhlps m2, m5 + paddq m5, m2 + movq rax, m5 + RET +%endmacro %macro SSD_TWO 2 cglobal pixel_ssd_ss_%1x%2, 4,7,8 @@ -265,8 +439,19 @@ lea r2, [r2 + r6] dec r4d jnz .loop +%if BIT_DEPTH == 10 && %1 == 64 && %2 ==64 + movu m5, m0 + pxor m6, m6 + punpckldq m0, m6 + punpckhdq m5, m6 + paddq m0, m5 + movhlps m5, m0 + paddq m0, m5 + movq rax, xm0 +%else HADDD m0, m5 movd eax, xm0 +%endif RET %endmacro %macro SSD_24 2 @@ -370,120 +555,146 @@ %endmacro INIT_YMM avx2 -cglobal pixel_ssd_16x16, 4,7,8 +cglobal pixel_ssd_16x16, 4,7,3 FIX_STRIDES r1, r3 - lea r5, [3 * r1] - lea r6, [3 * r3] - mov r4d, 4 - pxor m0, m0 + lea r5, [3 * r1] + lea r6, [3 * r3] + mov r4d, 4 + pxor m0, m0 .loop: - movu m1, [r0] - movu m2, [r0 + r1] - movu m3, [r0 + r1 * 2] - movu m4, [r0 + r5] - movu m6, [r2] - movu m7, [r2 + r3] - psubw m1, m6 - psubw m2, m7 - movu m6, [r2 + r3 * 2] - movu m7, [r2 + r6] - psubw m3, m6 - psubw m4, m7 - - lea r0, [r0 + r1 * 4] - lea r2, [r2 + r3 * 4] - - pmaddwd m1, m1 - pmaddwd m2, m2 - pmaddwd m3, m3 - pmaddwd m4, m4 - paddd m1, m2 - paddd m3, m4 - paddd m0, m1 - paddd m0, m3 - - dec r4d - jg .loop - - HADDD m0, m5 - movd eax, xm0 - RET + movu m1, [r0] + movu m2, [r0 + r1] + psubw m1, [r2] + psubw m2, [r2 + r3] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + movu m1, [r0 + r1 * 2] + movu m2, [r0 + r5] + psubw m1, [r2 + r3 * 2] + psubw m2, [r2 + r6] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + dec r4d + jg .loop + + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + vextracti128 xm2, m0, 1 + paddq xm2, xm0 + movhlps xm1, xm2 + paddq xm2, xm1 + movq rax, xm2 + ret INIT_YMM avx2 -cglobal pixel_ssd_32x32, 4,7,8 - add r1, r1 - add r3, r3 - mov r4d, 16 - pxor m0, m0 -.loop: - movu m1, [r0] - movu m2, [r0 + 32] - movu m3, [r0 + r1] - movu m4, [r0 + r1 + 32] - movu m6, [r2] - movu m7, [r2 + 32] - psubw m1, m6 - psubw m2, m7 - movu m6, [r2 + r3] - movu m7, [r2 + r3 + 32] - psubw m3, m6 - psubw m4, m7 - - lea r0, [r0 + r1 * 2] - lea r2, [r2 + r3 * 2] - - pmaddwd m1, m1 - pmaddwd m2, m2 - pmaddwd m3, m3 - pmaddwd m4, m4 - paddd m1, m2 - paddd m3, m4 - paddd m0, m1 - paddd m0, m3 +cglobal pixel_ssd_32x2 + pxor m0, m0 + + movu m1, [r0] + movu m2, [r0 + 32] + psubw m1, [r2] + psubw m2, [r2 + 32] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + movu m1, [r0 + r1] + movu m2, [r0 + r1 + 32] + psubw m1, [r2 + r3] + psubw m2, [r2 + r3 + 32] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + + + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + + paddq m3, m0 + paddq m4, m1 +ret - dec r4d - jg .loop - - HADDD m0, m5 - movd eax, xm0 - RET +INIT_YMM avx2 +cglobal pixel_ssd_32x32, 4,5,5 + add r1, r1 + add r3, r3 + pxor m3, m3 + pxor m4, m4 + mov r4, 16 +.iterate: + call pixel_ssd_32x2 + dec r4d + jne .iterate + + paddq m3, m4 + vextracti128 xm4, m3, 1 + paddq xm3, xm4 + movhlps xm4, xm3 + paddq xm3, xm4 + movq rax, xm3 +RET INIT_YMM avx2 -cglobal pixel_ssd_64x64, 4,7,8 - FIX_STRIDES r1, r3 - mov r4d, 64 - pxor m0, m0 +cglobal pixel_ssd_64x64, 4,5,5 + FIX_STRIDES r1, r3 + mov r4d, 64 + pxor m3, m3 + pxor m4, m4 .loop: - movu m1, [r0] - movu m2, [r0+32] - movu m3, [r0+32*2] - movu m4, [r0+32*3] - movu m6, [r2] - movu m7, [r2+32] - psubw m1, m6 - psubw m2, m7 - movu m6, [r2+32*2] - movu m7, [r2+32*3] - psubw m3, m6 - psubw m4, m7 - - lea r0, [r0+r1] - lea r2, [r2+r3] - - pmaddwd m1, m1 - pmaddwd m2, m2 - pmaddwd m3, m3 - pmaddwd m4, m4 - paddd m1, m2 - paddd m3, m4 - paddd m0, m1 - paddd m0, m3 - - dec r4d - jg .loop - - HADDD m0, m5 - movd eax, xm0 + pxor m0, m0 + movu m1, [r0] + movu m2, [r0+32] + psubw m1, [r2] + psubw m2, [r2+32] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + movu m1, [r0+32*2] + movu m2, [r0+32*3] + psubw m1, [r2+32*2] + psubw m2, [r2+32*3] + pmaddwd m1, m1 + pmaddwd m2, m2 + paddd m0, m1 + paddd m0, m2 + + lea r0, [r0+r1] + lea r2, [r2+r3] + + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + + paddq m3, m0 + paddq m4, m1 + + dec r4d + jg .loop + + paddq m3, m4 + vextracti128 xm4, m3, 1 + paddq xm3, xm4 + movhlps xm4, xm3 + paddq xm3, xm4 + movq rax, xm3 RET INIT_MMX mmx2 @@ -511,24 +722,23 @@ SSD_ONE 32, 8 SSD_ONE 32, 16 SSD_ONE 32, 24 -SSD_ONE 32, 32 %if BIT_DEPTH <= 10 SSD_ONE 32, 64 + SSD_ONE 32, 32 + SSD_TWO 64, 64 %else SSD_ONE_32 + SSD_ONE_SS_32 + SSD_ONE_SS_64 %endif - SSD_TWO 48, 64 SSD_TWO 64, 16 SSD_TWO 64, 32 SSD_TWO 64, 48 -SSD_TWO 64, 64 + INIT_YMM avx2 -SSD_ONE 16, 8 -SSD_ONE 16, 16 -SSD_ONE 32, 32 -SSD_ONE 64, 64 +SSD_ONE 16, 8 SSD_ONE 16, 32 SSD_ONE 32, 64 %endif ; HIGH_BIT_DEPTH @@ -1002,6 +1212,172 @@ SSD_SS_32xN SSD_SS_48 SSD_SS_64xN + +INIT_YMM avx2 +cglobal pixel_ssd_ss_16x16, 4,6,4 + add r1d, r1d + add r3d, r3d + pxor m2, m2 + pxor m3, m3 + lea r4, [3 * r1] + lea r5, [3 * r3] + + movu m0, [r0] + movu m1, [r0 + r1] + psubw m0, [r2] + psubw m1, [r2 + r3] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + movu m0, [r0 + 2 * r1] + movu m1, [r0 + r4] + psubw m0, [r2 + 2 * r3] + psubw m1, [r2 + r5] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m0, [r0] + movu m1, [r0 + r1] + psubw m0, [r2] + psubw m1, [r2 + r3] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + movu m0, [r0 + 2 * r1] + movu m1, [r0 + r4] + psubw m0, [r2 + 2 * r3] + psubw m1, [r2 + r5] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m0, [r0] + movu m1, [r0 + r1] + psubw m0, [r2] + psubw m1, [r2 + r3] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + movu m0, [r0 + 2 * r1] + movu m1, [r0 + r4] + psubw m0, [r2 + 2 * r3] + psubw m1, [r2 + r5] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m0, [r0] + movu m1, [r0 + r1] + psubw m0, [r2] + psubw m1, [r2 + r3] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + movu m0, [r0 + 2 * r1] + movu m1, [r0 + r4] + psubw m0, [r2 + 2 * r3] + psubw m1, [r2 + r5] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + paddd m2, m3 + HADDD m2, m0 + movd eax, xm2 + RET + +INIT_YMM avx2 +cglobal pixel_ssd_ss_32x32, 4,5,4 + add r1d, r1d + add r3d, r3d + pxor m2, m2 + pxor m3, m3 + mov r4d, 16 +.loop: + movu m0, [r0] + movu m1, [r0 + mmsize] + psubw m0, [r2] + psubw m1, [r2 + mmsize] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + movu m0, [r0 + r1] + movu m1, [r0 + r1 + mmsize] + psubw m0, [r2 + r3] + psubw m1, [r2 + r3 + mmsize] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + dec r4d + jne .loop + + paddd m2, m3 + HADDD m2, m0 + movd eax, xm2 + RET + +INIT_YMM avx2 +cglobal pixel_ssd_ss_64x64, 4,5,4 + add r1d, r1d + add r3d, r3d + pxor m2, m2 + pxor m3, m3 + mov r4d,64 +.loop: + movu m0, [r0] + movu m1, [r0 + mmsize] + psubw m0, [r2] + psubw m1, [r2 + mmsize] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + movu m0, [r0 + 2 * mmsize] + movu m1, [r0 + 3 * mmsize] + psubw m0, [r2 + 2 * mmsize] + psubw m1, [r2 + 3 * mmsize] + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m2, m0 + paddd m3, m1 + + add r0, r1 + add r2, r3 + + dec r4d + jne .loop + + paddd m2, m3 + HADDD m2, m0 + movd eax, xm2 + RET + %endif ; !HIGH_BIT_DEPTH %if HIGH_BIT_DEPTH == 0 @@ -2729,9 +3105,20 @@ dec r2d jnz .loop +%if BIT_DEPTH >= 10 + movu m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + movhlps m1, m0 + paddq m0, m1 + movq rax, xm0 +%else ; calculate sum and return HADDD m0, m1 movd eax, m0 +%endif RET INIT_YMM avx2 @@ -2803,8 +3190,20 @@ dec r2d jnz .loop - +%if BIT_DEPTH >= 10 + movu m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + vextracti128 xm2, m0, 1 + paddq xm2, xm0 + movhlps xm1, xm2 + paddq xm2, xm1 + movq rax, xm2 +%else ; calculate sum and return HADDD m0, m1 movd eax, xm0 +%endif RET
View file
x265_1.8.tar.gz/source/common/x86/x86util.asm -> x265_1.9.tar.gz/source/common/x86/x86util.asm
Changed
@@ -5,6 +5,7 @@ ;* ;* Authors: Holger Lubitz <holger@lubitz.org> ;* Loren Merritt <lorenm@u.washington.edu> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/common/yuv.cpp -> x265_1.9.tar.gz/source/common/yuv.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2015 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -50,7 +51,7 @@ { CHECKED_MALLOC(m_buf[0], pixel, size * size + 8); m_buf[1] = m_buf[2] = 0; - m_csize = MAX_INT; + m_csize = 0; return true; } else @@ -82,22 +83,26 @@ { pixel* dstY = dstPic.getLumaAddr(cuAddr, absPartIdx); primitives.cu[m_part].copy_pp(dstY, dstPic.m_stride, m_buf[0], m_size); - - pixel* dstU = dstPic.getCbAddr(cuAddr, absPartIdx); - pixel* dstV = dstPic.getCrAddr(cuAddr, absPartIdx); - primitives.chroma[m_csp].cu[m_part].copy_pp(dstU, dstPic.m_strideC, m_buf[1], m_csize); - primitives.chroma[m_csp].cu[m_part].copy_pp(dstV, dstPic.m_strideC, m_buf[2], m_csize); + if (m_csp != X265_CSP_I400) + { + pixel* dstU = dstPic.getCbAddr(cuAddr, absPartIdx); + pixel* dstV = dstPic.getCrAddr(cuAddr, absPartIdx); + primitives.chroma[m_csp].cu[m_part].copy_pp(dstU, dstPic.m_strideC, m_buf[1], m_csize); + primitives.chroma[m_csp].cu[m_part].copy_pp(dstV, dstPic.m_strideC, m_buf[2], m_csize); + } } void Yuv::copyFromPicYuv(const PicYuv& srcPic, uint32_t cuAddr, uint32_t absPartIdx) { const pixel* srcY = srcPic.getLumaAddr(cuAddr, absPartIdx); primitives.cu[m_part].copy_pp(m_buf[0], m_size, srcY, srcPic.m_stride); - - const pixel* srcU = srcPic.getCbAddr(cuAddr, absPartIdx); - const pixel* srcV = srcPic.getCrAddr(cuAddr, absPartIdx); - primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[1], m_csize, srcU, srcPic.m_strideC); - primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[2], m_csize, srcV, srcPic.m_strideC); + if (m_csp != X265_CSP_I400) + { + const pixel* srcU = srcPic.getCbAddr(cuAddr, absPartIdx); + const pixel* srcV = srcPic.getCrAddr(cuAddr, absPartIdx); + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[1], m_csize, srcU, srcPic.m_strideC); + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[2], m_csize, srcV, srcPic.m_strideC); + } } void Yuv::copyFromYuv(const Yuv& srcYuv) @@ -105,8 +110,11 @@ X265_CHECK(m_size >= srcYuv.m_size, "invalid size\n"); primitives.cu[m_part].copy_pp(m_buf[0], m_size, srcYuv.m_buf[0], srcYuv.m_size); - primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[1], m_csize, srcYuv.m_buf[1], srcYuv.m_csize); - primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[2], m_csize, srcYuv.m_buf[2], srcYuv.m_csize); + if (m_csp != X265_CSP_I400) + { + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[1], m_csize, srcYuv.m_buf[1], srcYuv.m_csize); + primitives.chroma[m_csp].cu[m_part].copy_pp(m_buf[2], m_csize, srcYuv.m_buf[2], srcYuv.m_csize); + } } /* This version is intended for use by ME, which required FENC_STRIDE for luma fenc pixels */ @@ -130,11 +138,13 @@ { pixel* dstY = dstYuv.getLumaAddr(absPartIdx); primitives.cu[m_part].copy_pp(dstY, dstYuv.m_size, m_buf[0], m_size); - - pixel* dstU = dstYuv.getCbAddr(absPartIdx); - pixel* dstV = dstYuv.getCrAddr(absPartIdx); - primitives.chroma[m_csp].cu[m_part].copy_pp(dstU, dstYuv.m_csize, m_buf[1], m_csize); - primitives.chroma[m_csp].cu[m_part].copy_pp(dstV, dstYuv.m_csize, m_buf[2], m_csize); + if (m_csp != X265_CSP_I400) + { + pixel* dstU = dstYuv.getCbAddr(absPartIdx); + pixel* dstV = dstYuv.getCrAddr(absPartIdx); + primitives.chroma[m_csp].cu[m_part].copy_pp(dstU, dstYuv.m_csize, m_buf[1], m_csize); + primitives.chroma[m_csp].cu[m_part].copy_pp(dstV, dstYuv.m_csize, m_buf[2], m_csize); + } } void Yuv::copyPartToYuv(Yuv& dstYuv, uint32_t absPartIdx) const @@ -142,20 +152,25 @@ pixel* srcY = m_buf[0] + getAddrOffset(absPartIdx, m_size); pixel* dstY = dstYuv.m_buf[0]; primitives.cu[dstYuv.m_part].copy_pp(dstY, dstYuv.m_size, srcY, m_size); - - pixel* srcU = m_buf[1] + getChromaAddrOffset(absPartIdx); - pixel* srcV = m_buf[2] + getChromaAddrOffset(absPartIdx); - pixel* dstU = dstYuv.m_buf[1]; - pixel* dstV = dstYuv.m_buf[2]; - primitives.chroma[m_csp].cu[dstYuv.m_part].copy_pp(dstU, dstYuv.m_csize, srcU, m_csize); - primitives.chroma[m_csp].cu[dstYuv.m_part].copy_pp(dstV, dstYuv.m_csize, srcV, m_csize); + if (m_csp != X265_CSP_I400) + { + pixel* srcU = m_buf[1] + getChromaAddrOffset(absPartIdx); + pixel* srcV = m_buf[2] + getChromaAddrOffset(absPartIdx); + pixel* dstU = dstYuv.m_buf[1]; + pixel* dstV = dstYuv.m_buf[2]; + primitives.chroma[m_csp].cu[dstYuv.m_part].copy_pp(dstU, dstYuv.m_csize, srcU, m_csize); + primitives.chroma[m_csp].cu[dstYuv.m_part].copy_pp(dstV, dstYuv.m_csize, srcV, m_csize); + } } void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL) { primitives.cu[log2SizeL - 2].add_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); - primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); - primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); + if (m_csp != X265_CSP_I400) + { + primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); + primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); + } } void Yuv::addAvg(const ShortYuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t absPartIdx, uint32_t width, uint32_t height, bool bLuma, bool bChroma)
View file
x265_1.8.tar.gz/source/encoder/analysis.cpp -> x265_1.9.tar.gz/source/encoder/analysis.cpp
Changed
@@ -3,6 +3,7 @@ * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> * Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -71,12 +72,11 @@ Analysis::Analysis() { - m_reuseIntraDataCTU = NULL; m_reuseInterDataCTU = NULL; m_reuseRef = NULL; m_reuseBestMergeCand = NULL; + m_reuseMv = NULL; } - bool Analysis::create(ThreadLocalData *tld) { m_tld = tld; @@ -127,9 +127,6 @@ m_frame = &frame; #if _DEBUG || CHECKED_BUILD - for (uint32_t i = 0; i <= g_maxCUDepth; i++) - for (uint32_t j = 0; j < MAX_PRED_TYPES; j++) - m_modeDepth[i].pred[j].invalidate(); invalidateContexts(0); #endif @@ -140,40 +137,46 @@ m_modeDepth[0].fencYuv.copyFromPicYuv(*m_frame->m_fencPic, ctu.m_cuAddr, 0); uint32_t numPartition = ctu.m_numPartitions; - if (m_param->analysisMode) + if (m_param->analysisMode && m_slice->m_sliceType != I_SLICE) { - if (m_slice->m_sliceType == I_SLICE) - m_reuseIntraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; - else - { - int numPredDir = m_slice->isInterP() ? 1 : 2; - m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; - m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; - m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS]; - } + int numPredDir = m_slice->isInterP() ? 1 : 2; + m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; + m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; + m_reuseBestMergeCand = &m_reuseInterDataCTU->bestMergeCand[ctu.m_cuAddr * CUGeom::MAX_GEOMS]; + m_reuseMv = &m_reuseInterDataCTU->mv[ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; } - ProfileCUScope(ctu, totalCTUTime, totalCTUs); - uint32_t zOrder = 0; if (m_slice->m_sliceType == I_SLICE) { - compressIntraCU(ctu, cuGeom, zOrder, qp); - if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.intraData) + analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; + if (m_param->analysisMode == X265_ANALYSIS_LOAD) + { + memcpy(ctu.m_cuDepth, &intraDataCTU->depth[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); + memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); + memcpy(ctu.m_partSize, &intraDataCTU->partSizes[ctu.m_cuAddr * numPartition], sizeof(char) * numPartition); + memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); + } + compressIntraCU(ctu, cuGeom, qp); + if (m_param->analysisMode == X265_ANALYSIS_SAVE && intraDataCTU) { CUData* bestCU = &m_modeDepth[0].bestMode->cu; - memcpy(&m_reuseIntraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); - memcpy(&m_reuseIntraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition); - memcpy(&m_reuseIntraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition); - memcpy(&m_reuseIntraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition); + memcpy(&intraDataCTU->depth[ctu.m_cuAddr * numPartition], bestCU->m_cuDepth, sizeof(uint8_t) * numPartition); + memcpy(&intraDataCTU->modes[ctu.m_cuAddr * numPartition], bestCU->m_lumaIntraDir, sizeof(uint8_t) * numPartition); + memcpy(&intraDataCTU->partSizes[ctu.m_cuAddr * numPartition], bestCU->m_partSize, sizeof(uint8_t) * numPartition); + memcpy(&intraDataCTU->chromaModes[ctu.m_cuAddr * numPartition], bestCU->m_chromaIntraDir, sizeof(uint8_t) * numPartition); } } else { - if (!m_param->rdLevel) + if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE && + ctu.m_cuPelX / g_maxCUSize >= frame.m_encData->m_pir.pirStartCol + && ctu.m_cuPelX / g_maxCUSize < frame.m_encData->m_pir.pirEndCol) + compressIntraCU(ctu, cuGeom, qp); + else if (!m_param->rdLevel) { /* In RD Level 0/1, copy source pixels into the reconstructed block so - * they are available for intra predictions */ + * they are available for intra predictions */ m_modeDepth[0].fencYuv.copyToPicYuv(*m_frame->m_reconPic, ctu.m_cuAddr, 0); compressInterCU_rd0_4(ctu, cuGeom, qp); @@ -187,6 +190,7 @@ compressInterCU_rd0_4(ctu, cuGeom, qp); else { + uint32_t zOrder = 0; compressInterCU_rd5_6(ctu, cuGeom, zOrder, qp); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_frame->m_analysisData.interData) { @@ -212,8 +216,7 @@ md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0]; - uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir; - checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL); + checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); } else @@ -226,7 +229,7 @@ } } -void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t& zOrder, int32_t qp) +void Analysis::compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -235,42 +238,37 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); - if (m_param->analysisMode == X265_ANALYSIS_LOAD) - { - uint8_t* reuseDepth = &m_reuseIntraDataCTU->depth[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - uint8_t* reuseModes = &m_reuseIntraDataCTU->modes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - char* reusePartSizes = &m_reuseIntraDataCTU->partSizes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; - uint8_t* reuseChromaModes = &m_reuseIntraDataCTU->chromaModes[parentCTU.m_cuAddr * parentCTU.m_numPartitions]; + bool bAlreadyDecided = parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX; + bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; - if (mightNotSplit && depth == reuseDepth[zOrder] && zOrder == cuGeom.absPartIdx) + if (bAlreadyDecided) + { + if (bDecidedDepth) { - PartSize size = (PartSize)reusePartSizes[zOrder]; - Mode& mode = size == SIZE_2Nx2N ? md.pred[PRED_INTRA] : md.pred[PRED_INTRA_NxN]; + Mode& mode = md.pred[0]; + md.bestMode = &mode; mode.cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(mode, cuGeom, size, &reuseModes[zOrder], &reuseChromaModes[zOrder]); - checkBestMode(mode, depth); + memcpy(mode.cu.m_lumaIntraDir, parentCTU.m_lumaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + memcpy(mode.cu.m_chromaIntraDir, parentCTU.m_chromaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + checkIntra(mode, cuGeom, (PartSize)parentCTU.m_partSize[cuGeom.absPartIdx]); if (m_bTryLossless) tryLossless(cuGeom); if (mightSplit) addSplitFlagCost(*md.bestMode, cuGeom.depth); - - // increment zOrder offset to point to next best depth in sharedDepth buffer - zOrder += g_depthInc[g_maxCUDepth - 1][reuseDepth[zOrder]]; - mightSplit = false; } } - else if (mightNotSplit) + else if (cuGeom.log2CUSize != MAX_LOG2_CU_SIZE && mightNotSplit) { md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); + checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N); checkBestMode(md.pred[PRED_INTRA], depth); if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) { md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); + checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN); checkBestMode(md.pred[PRED_INTRA_NxN], depth); } @@ -281,6 +279,9 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } + // stop recursion if we reach the depth of previous analysis decision + mightSplit &= !(bAlreadyDecided && bDecidedDepth); + if (mightSplit) { Mode* splitPred = &md.pred[PRED_SPLIT]; @@ -305,7 +306,7 @@ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - compressIntraCU(parentCTU, childGeom, zOrder, nextQP); + compressIntraCU(parentCTU, childGeom, nextQP); // Save best CU and pred data for this sub CU splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); @@ -317,7 +318,10 @@ { /* record the depth of this non-present sub-CU */ splitCU->setEmptyPart(childGeom, subPartIdx); - zOrder += g_depthInc[g_maxCUDepth - 1][nextDepth]; + + /* Set depth of non-present CU to 0 to ensure that correct CU is fetched as reference to code deltaQP */ + if (bAlreadyDecided) + memset(parentCTU.m_cuDepth + childGeom.absPartIdx, 0, childGeom.numPartitions); } } nextContext->store(splitPred->contexts); @@ -394,32 +398,52 @@ break; case PRED_2Nx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; + slave.checkInter_rd0_4(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, refMasks); if (m_slice->m_sliceType == B_SLICE) slave.checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], pmode.cuGeom); break; case PRED_Nx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[2]; /* left */ + refMasks[1] = m_splitRefIdx[1] | m_splitRefIdx[3]; /* right */ + slave.checkInter_rd0_4(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, refMasks); break; case PRED_2NxN: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1]; /* top */ + refMasks[1] = m_splitRefIdx[2] | m_splitRefIdx[3]; /* bot */ + slave.checkInter_rd0_4(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, refMasks); break; case PRED_2NxnU: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1]; /* 25% top */ + refMasks[1] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% bot */ + slave.checkInter_rd0_4(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, refMasks); break; case PRED_2NxnD: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% top */ + refMasks[1] = m_splitRefIdx[2] | m_splitRefIdx[3]; /* 25% bot */ + slave.checkInter_rd0_4(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, refMasks); break; case PRED_nLx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[2]; /* 25% left */ + refMasks[1] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% right */ + slave.checkInter_rd0_4(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, refMasks); break; case PRED_nRx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% left */ + refMasks[1] = m_splitRefIdx[1] | m_splitRefIdx[3]; /* 25% right */ + slave.checkInter_rd0_4(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, refMasks); break; @@ -433,12 +457,14 @@ switch (pmode.modes[task]) { case PRED_INTRA: - slave.checkIntra(md.pred[PRED_INTRA], pmode.cuGeom, SIZE_2Nx2N, NULL, NULL); + slave.checkIntra(md.pred[PRED_INTRA], pmode.cuGeom, SIZE_2Nx2N); if (pmode.cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) - slave.checkIntra(md.pred[PRED_INTRA_NxN], pmode.cuGeom, SIZE_NxN, NULL, NULL); + slave.checkIntra(md.pred[PRED_INTRA_NxN], pmode.cuGeom, SIZE_NxN); break; case PRED_2Nx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; + slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, refMasks); md.pred[PRED_BIDIR].rdCost = MAX_INT64; if (m_slice->m_sliceType == B_SLICE) @@ -450,26 +476,42 @@ break; case PRED_Nx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[2]; /* left */ + refMasks[1] = m_splitRefIdx[1] | m_splitRefIdx[3]; /* right */ + slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, refMasks); break; case PRED_2NxN: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1]; /* top */ + refMasks[1] = m_splitRefIdx[2] | m_splitRefIdx[3]; /* bot */ + slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, refMasks); break; case PRED_2NxnU: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1]; /* 25% top */ + refMasks[1] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% bot */ + slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, refMasks); break; case PRED_2NxnD: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% top */ + refMasks[1] = m_splitRefIdx[2] | m_splitRefIdx[3]; /* 25% bot */ slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, refMasks); break; case PRED_nLx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[2]; /* 25% left */ + refMasks[1] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% right */ + slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, refMasks); break; case PRED_nRx2N: + refMasks[0] = m_splitRefIdx[0] | m_splitRefIdx[1] | m_splitRefIdx[2] | m_splitRefIdx[3]; /* 75% left */ + refMasks[1] = m_splitRefIdx[1] | m_splitRefIdx[3]; /* 25% right */ slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, refMasks); break; @@ -488,7 +530,7 @@ while (task >= 0); } -void Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) +uint32_t Analysis::compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; uint32_t cuAddr = parentCTU.m_cuAddr; @@ -498,19 +540,89 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); uint32_t minDepth = m_param->rdLevel <= 4 ? topSkipMinDepth(parentCTU, cuGeom) : 0; + uint32_t splitRefs[4] = { 0, 0, 0, 0 }; X265_CHECK(m_param->rdLevel >= 2, "compressInterCU_dist does not support RD 0 or 1\n"); + PMODE pmode(*this, cuGeom); + if (mightNotSplit && depth >= minDepth) { - int bTryAmp = m_slice->m_sps->maxAMPDepth > depth; - int bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; - - PMODE pmode(*this, cuGeom); - /* Initialize all prediction CUs based on parentCTU */ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); + + if (m_param->rdLevel <= 4) + checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); + else + checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); + } + + bool bNoSplit = false; + bool splitIntra = true; + if (md.bestMode) + { + bNoSplit = md.bestMode->cu.isSkipped(0); + if (mightSplit && depth && depth >= minDepth && !bNoSplit && m_param->rdLevel <= 4) + bNoSplit = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); + } + + if (mightSplit && !bNoSplit) + { + Mode* splitPred = &md.pred[PRED_SPLIT]; + splitPred->initCosts(); + CUData* splitCU = &splitPred->cu; + splitCU->initSubCU(parentCTU, cuGeom, qp); + + uint32_t nextDepth = depth + 1; + ModeDepth& nd = m_modeDepth[nextDepth]; + invalidateContexts(nextDepth); + Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; + splitIntra = false; + + for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) + { + const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); + if (childGeom.flags & CUGeom::PRESENT) + { + m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); + m_rqt[nextDepth].cur.load(*nextContext); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + splitRefs[subPartIdx] = compressInterCU_dist(parentCTU, childGeom, nextQP); + + // Save best CU and pred data for this sub CU + splitIntra |= nd.bestMode->cu.isIntra(0); + splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); + splitPred->addSubCosts(*nd.bestMode); + + nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); + nextContext = &nd.bestMode->contexts; + } + else + splitCU->setEmptyPart(childGeom, subPartIdx); + } + nextContext->store(splitPred->contexts); + + if (mightNotSplit) + addSplitFlagCost(*splitPred, cuGeom.depth); + else + updateModeCost(*splitPred); + + checkDQPForSplitPred(*splitPred, cuGeom); + } + + if (mightNotSplit && depth >= minDepth) + { + int bTryAmp = m_slice->m_sps->maxAMPDepth > depth; + int bTryIntra = (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) && (!m_param->limitReferences || splitIntra) && (cuGeom.log2CUSize != MAX_LOG2_CU_SIZE); + + if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0) + setLambdaFromQP(parentCTU, qp); + if (bTryIntra) { md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); @@ -533,6 +645,8 @@ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); pmode.modes[pmode.m_jobTotal++] = PRED_nRx2N; } + m_splitRefIdx[0] = splitRefs[0]; m_splitRefIdx[1] = splitRefs[1]; m_splitRefIdx[2] = splitRefs[2]; m_splitRefIdx[3] = splitRefs[3]; + pmode.tryBondPeers(*m_frame->m_encData->m_jobProvider, pmode.m_jobTotal); /* participate in processing jobs, until all are distributed */ @@ -544,8 +658,6 @@ if (m_param->rdLevel <= 4) { - checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); - { ProfileCUScope(parentCTU, pmodeBlockTime, countPModeMasters); pmode.waitForExit(); @@ -577,7 +689,7 @@ if (m_param->rdLevel > 2) { /* RD selection between merge, inter, bidir and intra */ - if (!m_bChromaSa8d) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ + if (!m_bChromaSa8d && (m_csp != X265_CSP_I400)) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ { uint32_t numPU = bestInter->cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) @@ -628,14 +740,13 @@ } else { - checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); { ProfileCUScope(parentCTU, pmodeBlockTime, countPModeMasters); pmode.waitForExit(); } checkBestMode(md.pred[PRED_2Nx2N], depth); - if (m_slice->m_sliceType == B_SLICE) + if (m_slice->m_sliceType == B_SLICE && md.pred[PRED_BIDIR].sa8dCost < MAX_INT64) checkBestMode(md.pred[PRED_BIDIR], depth); if (m_param->bEnableRectInter) @@ -660,14 +771,6 @@ } } - if (md.bestMode->rdCost == MAX_INT64 && !bTryIntra) - { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntraInInter(md.pred[PRED_INTRA], cuGeom); - encodeIntraInInter(md.pred[PRED_INTRA], cuGeom); - checkBestMode(md.pred[PRED_INTRA], depth); - } - if (m_bTryLossless) tryLossless(cuGeom); @@ -675,59 +778,24 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } - bool bNoSplit = false; - if (md.bestMode) - { - bNoSplit = md.bestMode->cu.isSkipped(0); - if (mightSplit && depth && depth >= minDepth && !bNoSplit && m_param->rdLevel <= 4) - bNoSplit = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); - } - + /* compare split RD cost against best cost */ if (mightSplit && !bNoSplit) - { - Mode* splitPred = &md.pred[PRED_SPLIT]; - splitPred->initCosts(); - CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom, qp); - - uint32_t nextDepth = depth + 1; - ModeDepth& nd = m_modeDepth[nextDepth]; - invalidateContexts(nextDepth); - Entropy* nextContext = &m_rqt[depth].cur; - int nextQP = qp; - - for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) - { - const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); - if (childGeom.flags & CUGeom::PRESENT) - { - m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); - m_rqt[nextDepth].cur.load(*nextContext); - - if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) - nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - - compressInterCU_dist(parentCTU, childGeom, nextQP); - - // Save best CU and pred data for this sub CU - splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); - splitPred->addSubCosts(*nd.bestMode); - - nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); - nextContext = &nd.bestMode->contexts; - } - else - splitCU->setEmptyPart(childGeom, subPartIdx); - } - nextContext->store(splitPred->contexts); - - if (mightNotSplit) - addSplitFlagCost(*splitPred, cuGeom.depth); - else - updateModeCost(*splitPred); + checkBestMode(md.pred[PRED_SPLIT], depth); - checkDQPForSplitPred(*splitPred, cuGeom); - checkBestMode(*splitPred, depth); + /* determine which motion references the parent CU should search */ + uint32_t refMask; + if (!(m_param->limitReferences & X265_REF_LIMIT_DEPTH)) + refMask = 0; + else if (md.bestMode == &md.pred[PRED_SPLIT]) + refMask = splitRefs[0] | splitRefs[1] | splitRefs[2] | splitRefs[3]; + else + { + /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ + CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; + uint32_t numPU = cu.getNumPartInter(0); + refMask = 0; + for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) + refMask |= cu.getBestRefIdx(subPartIdx); } if (mightNotSplit) @@ -742,23 +810,40 @@ /* Copy best data to encData CTU and recon */ md.bestMode->cu.copyToPic(depth); - if (md.bestMode != &md.pred[PRED_SPLIT]) - md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); + md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); + + return refMask; } -uint32_t Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) +SplitData Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; uint32_t cuAddr = parentCTU.m_cuAddr; ModeDepth& md = m_modeDepth[depth]; md.bestMode = NULL; + PicYuv& reconPic = *m_frame->m_reconPic; + bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); uint32_t minDepth = topSkipMinDepth(parentCTU, cuGeom); bool earlyskip = false; bool splitIntra = true; - uint32_t splitRefs[4] = { 0, 0, 0, 0 }; + + SplitData splitData[4]; + splitData[0].initSplitCUData(); + splitData[1].initSplitCUData(); + splitData[2].initSplitCUData(); + splitData[3].initSplitCUData(); + + // avoid uninitialize value in below reference + if (m_param->limitModes) + { + md.pred[PRED_2Nx2N].bestME[0][0].mvCost = 0; // L0 + md.pred[PRED_2Nx2N].bestME[0][1].mvCost = 0; // L1 + md.pred[PRED_2Nx2N].sa8dCost = 0; + } + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ if (mightNotSplit && depth >= minDepth) { @@ -804,7 +889,7 @@ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - splitRefs[subPartIdx] = compressInterCU_rd0_4(parentCTU, childGeom, nextQP); + splitData[subPartIdx] = compressInterCU_rd0_4(parentCTU, childGeom, nextQP); // Save best CU and pred data for this sub CU splitIntra |= nd.bestMode->cu.isIntra(0); @@ -834,7 +919,7 @@ /* Split CUs * 0 1 * 2 3 */ - uint32_t allSplitRefs = splitRefs[0] | splitRefs[1] | splitRefs[2] | splitRefs[3]; + uint32_t allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs; /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */ if (mightNotSplit && depth >= minDepth) { @@ -852,7 +937,7 @@ { CUData& cu = md.pred[PRED_2Nx2N].cu; uint32_t refMask = cu.getBestRefIdx(0); - allSplitRefs = splitRefs[0] = splitRefs[1] = splitRefs[2] = splitRefs[3] = refMask; + allSplitRefs = splitData[0].splitRefs = splitData[1].splitRefs = splitData[2].splitRefs = splitData[3].splitRefs = refMask; } if (m_slice->m_sliceType == B_SLICE) @@ -864,23 +949,80 @@ Mode *bestInter = &md.pred[PRED_2Nx2N]; if (m_param->bEnableRectInter) { - refMasks[0] = splitRefs[0] | splitRefs[2]; /* left */ - refMasks[1] = splitRefs[1] | splitRefs[3]; /* right */ - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); - if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_Nx2N]; + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxN, threshold_Nx2N; - refMasks[0] = splitRefs[0] | splitRefs[1]; /* top */ - refMasks[1] = splitRefs[2] | splitRefs[3]; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxN]; + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + } + else + { + threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + } + + int try_2NxN_first = threshold_2NxN < threshold_Nx2N; + if (try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxN]; + } + + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_Nx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); + if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_Nx2N]; + } + + if (!try_2NxN_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxN) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxN]; + } } if (m_slice->m_sps->maxAMPDepth > depth) { + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; + + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; + + threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; + } + else + { + threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] + + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + + threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] + + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + } + bool bHor = false, bVer = false; if (bestInter->cu.m_partSize[0] == SIZE_2NxN) bHor = true; @@ -895,42 +1037,76 @@ if (bHor) { - refMasks[0] = splitRefs[0] | splitRefs[1]; /* 25% top */ - refMasks[1] = allSplitRefs; /* 75% bot */ - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); - if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxnU]; - - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitRefs[2] | splitRefs[3]; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_2NxnD]; + int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; + if (try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnD]; + } + + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnU) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); + if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnU]; + } + + if (!try_2NxnD_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_2NxnD]; + } } if (bVer) { - refMasks[0] = splitRefs[0] | splitRefs[2]; /* 25% left */ - refMasks[1] = allSplitRefs; /* 75% right */ - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); - if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_nLx2N]; - - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitRefs[1] | splitRefs[3]; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) - bestInter = &md.pred[PRED_nRx2N]; + int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; + if (try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nRx2N]; + } + + if (splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nLx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); + if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nLx2N]; + } + + if (!try_nRx2N_first && splitCost < md.pred[PRED_2Nx2N].sa8dCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) + bestInter = &md.pred[PRED_nRx2N]; + } } } - bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; + bool bTryIntra = (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) && cuGeom.log2CUSize != MAX_LOG2_CU_SIZE; if (m_param->rdLevel >= 3) { /* Calculate RD cost of best inter option */ - if (!m_bChromaSa8d) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ + if (!m_bChromaSa8d && (m_csp != X265_CSP_I400)) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ { uint32_t numPU = bestInter->cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) @@ -1005,10 +1181,13 @@ else if (md.bestMode->cu.isInter(0)) { uint32_t numPU = md.bestMode->cu.getNumPartInter(0); - for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) + if (m_csp != X265_CSP_I400) { - PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx); - motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, true); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) + { + PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx); + motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, true); + } } if (m_param->rdLevel == 2) encodeResAndCalcRdInterCU(*md.bestMode, cuGeom); @@ -1019,7 +1198,6 @@ uint32_t tuDepthRange[2]; cu.getInterTUQtDepthRange(tuDepthRange, 0); - m_rqt[cuGeom.depth].tmpResiYuv.subtract(*md.bestMode->fencYuv, md.bestMode->predYuv, cuGeom.log2CUSize); residualTransformQuantInter(*md.bestMode, cuGeom, 0, 0, tuDepthRange); if (cu.getQtRootCbf(0)) @@ -1045,9 +1223,12 @@ cu.getIntraTUQtDepthRange(tuDepthRange, 0); residualTransformQuantIntra(*md.bestMode, cuGeom, 0, 0, tuDepthRange); - getBestIntraModeChroma(*md.bestMode, cuGeom); - residualQTIntraChroma(*md.bestMode, cuGeom, 0, 0); - md.bestMode->reconYuv.copyFromPicYuv(*m_frame->m_reconPic, cu.m_cuAddr, cuGeom.absPartIdx); // TODO: + if (m_csp != X265_CSP_I400) + { + getBestIntraModeChroma(*md.bestMode, cuGeom); + residualQTIntraChroma(*md.bestMode, cuGeom, 0, 0); + } + md.bestMode->reconYuv.copyFromPicYuv(reconPic, cu.m_cuAddr, cuGeom.absPartIdx); // TODO: } } } @@ -1074,19 +1255,28 @@ } /* determine which motion references the parent CU should search */ - uint32_t refMask; - if (!(m_param->limitReferences & X265_REF_LIMIT_DEPTH)) - refMask = 0; - else if (md.bestMode == &md.pred[PRED_SPLIT]) - refMask = allSplitRefs; - else + SplitData splitCUData; + splitCUData.initSplitCUData(); + + if (m_param->limitReferences & X265_REF_LIMIT_DEPTH) { - /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ - CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; - uint32_t numPU = cu.getNumPartInter(0); - refMask = 0; - for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) - refMask |= cu.getBestRefIdx(subPartIdx); + if (md.bestMode == &md.pred[PRED_SPLIT]) + splitCUData.splitRefs = allSplitRefs; + else + { + /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ + CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; + uint32_t numPU = cu.getNumPartInter(0); + for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) + splitCUData.splitRefs |= cu.getBestRefIdx(subPartIdx); + } + } + + if (m_param->limitModes) + { + splitCUData.mvCost[0] = md.pred[PRED_2Nx2N].bestME[0][0].mvCost; // L0 + splitCUData.mvCost[1] = md.pred[PRED_2Nx2N].bestME[0][1].mvCost; // L1 + splitCUData.sa8dCost = md.pred[PRED_2Nx2N].sa8dCost; } if (mightNotSplit) @@ -1100,15 +1290,14 @@ } /* Copy best data to encData CTU and recon */ - X265_CHECK(md.bestMode->ok(), "best mode is not ok"); md.bestMode->cu.copyToPic(depth); if (m_param->rdLevel) - md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); + md.bestMode->reconYuv.copyToPicYuv(reconPic, cuAddr, cuGeom.absPartIdx); - return refMask; + return splitCUData; } -uint32_t Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) +SplitData Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -1116,6 +1305,16 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); + bool foundSkip = false; + bool splitIntra = true; + + // avoid uninitialize value in below reference + if (m_param->limitModes) + { + md.pred[PRED_2Nx2N].bestME[0][0].mvCost = 0; // L0 + md.pred[PRED_2Nx2N].bestME[0][1].mvCost = 0; // L1 + md.pred[PRED_2Nx2N].rdCost = 0; + } if (m_param->analysisMode == X265_ANALYSIS_LOAD) { @@ -1127,25 +1326,21 @@ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, true); - if (m_bTryLossless) - tryLossless(cuGeom); - - if (mightSplit) - addSplitFlagCost(*md.bestMode, cuGeom.depth); - // increment zOrder offset to point to next best depth in sharedDepth buffer zOrder += g_depthInc[g_maxCUDepth - 1][reuseDepth[zOrder]]; - mightSplit = false; - mightNotSplit = false; + foundSkip = true; } - } + } + + SplitData splitData[4]; + splitData[0].initSplitCUData(); + splitData[1].initSplitCUData(); + splitData[2].initSplitCUData(); + splitData[3].initSplitCUData(); - bool foundSkip = false; - bool splitIntra = true; - uint32_t splitRefs[4] = { 0, 0, 0, 0 }; /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ - if (mightNotSplit) + if (mightNotSplit && !foundSkip) { md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); @@ -1180,7 +1375,7 @@ if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - splitRefs[subPartIdx] = compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); + splitData[subPartIdx] = compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); // Save best CU and pred data for this sub CU splitIntra |= nd.bestMode->cu.isIntra(0); @@ -1207,7 +1402,7 @@ /* Split CUs * 0 1 * 2 3 */ - uint32_t allSplitRefs = splitRefs[0] | splitRefs[1] | splitRefs[2] | splitRefs[3]; + uint32_t allSplitRefs = splitData[0].splitRefs | splitData[1].splitRefs | splitData[2].splitRefs | splitData[3].splitRefs; /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */ if (mightNotSplit) { @@ -1226,7 +1421,7 @@ { CUData& cu = md.pred[PRED_2Nx2N].cu; uint32_t refMask = cu.getBestRefIdx(0); - allSplitRefs = splitRefs[0] = splitRefs[1] = splitRefs[2] = splitRefs[3] = refMask; + allSplitRefs = splitData[0].splitRefs = splitData[1].splitRefs = splitData[2].splitRefs = splitData[3].splitRefs = refMask; } if (m_slice->m_sliceType == B_SLICE) @@ -1242,22 +1437,78 @@ if (m_param->bEnableRectInter) { - refMasks[0] = splitRefs[0] | splitRefs[2]; /* left */ - refMasks[1] = splitRefs[1] | splitRefs[3]; /* right */ - md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); - checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); - - refMasks[0] = splitRefs[0] | splitRefs[1]; /* top */ - refMasks[1] = splitRefs[2] | splitRefs[3]; /* bot */ - md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); - checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxN, threshold_Nx2N; + + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxN = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_Nx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + } + else + { + threshold_2NxN = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_Nx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + } + + int try_2NxN_first = threshold_2NxN < threshold_Nx2N; + if (try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); + } + + if (splitCost < md.bestMode->rdCost + threshold_Nx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* right */ + md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); + checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); + } + + if (!try_2NxN_first && splitCost < md.bestMode->rdCost + threshold_2NxN) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* bot */ + md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); + checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); + } } // Try AMP (SIZE_2NxnU, SIZE_2NxnD, SIZE_nLx2N, SIZE_nRx2N) if (m_slice->m_sps->maxAMPDepth > depth) { + uint64_t splitCost = splitData[0].sa8dCost + splitData[1].sa8dCost + splitData[2].sa8dCost + splitData[3].sa8dCost; + uint32_t threshold_2NxnU, threshold_2NxnD, threshold_nLx2N, threshold_nRx2N; + + if (m_slice->m_sliceType == P_SLICE) + { + threshold_2NxnU = splitData[0].mvCost[0] + splitData[1].mvCost[0]; + threshold_2NxnD = splitData[2].mvCost[0] + splitData[3].mvCost[0]; + + threshold_nLx2N = splitData[0].mvCost[0] + splitData[2].mvCost[0]; + threshold_nRx2N = splitData[1].mvCost[0] + splitData[3].mvCost[0]; + } + else + { + threshold_2NxnU = (splitData[0].mvCost[0] + splitData[1].mvCost[0] + + splitData[0].mvCost[1] + splitData[1].mvCost[1] + 1) >> 1; + threshold_2NxnD = (splitData[2].mvCost[0] + splitData[3].mvCost[0] + + splitData[2].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + + threshold_nLx2N = (splitData[0].mvCost[0] + splitData[2].mvCost[0] + + splitData[0].mvCost[1] + splitData[2].mvCost[1] + 1) >> 1; + threshold_nRx2N = (splitData[1].mvCost[0] + splitData[3].mvCost[0] + + splitData[1].mvCost[1] + splitData[3].mvCost[1] + 1) >> 1; + } + bool bHor = false, bVer = false; if (md.bestMode->cu.m_partSize[0] == SIZE_2NxN) bHor = true; @@ -1271,47 +1522,80 @@ if (bHor) { - refMasks[0] = splitRefs[0] | splitRefs[1]; /* 25% top */ - refMasks[1] = allSplitRefs; /* 75% bot */ - md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); - checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); - - refMasks[0] = allSplitRefs; /* 75% top */ - refMasks[1] = splitRefs[2] | splitRefs[3]; /* 25% bot */ - md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); - checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + int try_2NxnD_first = threshold_2NxnD < threshold_2NxnU; + if (try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + } + + if (splitCost < md.bestMode->rdCost + threshold_2NxnU) + { + refMasks[0] = splitData[0].splitRefs | splitData[1].splitRefs; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ + md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); + checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); + } + + if (!try_2NxnD_first && splitCost < md.bestMode->rdCost + threshold_2NxnD) + { + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitData[2].splitRefs | splitData[3].splitRefs; /* 25% bot */ + md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); + checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); + } } + if (bVer) { - refMasks[0] = splitRefs[0] | splitRefs[2]; /* 25% left */ - refMasks[1] = allSplitRefs; /* 75% right */ - md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); - checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); + int try_nRx2N_first = threshold_nRx2N < threshold_nLx2N; + if (try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + } + + if (splitCost < md.bestMode->rdCost + threshold_nLx2N) + { + refMasks[0] = splitData[0].splitRefs | splitData[2].splitRefs; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ + md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); + checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); + } - refMasks[0] = allSplitRefs; /* 75% left */ - refMasks[1] = splitRefs[1] | splitRefs[3]; /* 25% right */ - md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); - checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + if (!try_nRx2N_first && splitCost < md.bestMode->rdCost + threshold_nRx2N) + { + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitData[1].splitRefs | splitData[3].splitRefs; /* 25% right */ + md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); + checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); + } } } - if (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) + if ((m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) && cuGeom.log2CUSize != MAX_LOG2_CU_SIZE) { if (!m_param->limitReferences || splitIntra) { ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]); md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); + checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N); checkBestMode(md.pred[PRED_INTRA], depth); if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) { md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); + checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN); checkBestMode(md.pred[PRED_INTRA_NxN], depth); } } @@ -1334,27 +1618,34 @@ checkBestMode(md.pred[PRED_SPLIT], depth); /* determine which motion references the parent CU should search */ - uint32_t refMask; - if (!(m_param->limitReferences & X265_REF_LIMIT_DEPTH)) - refMask = 0; - else if (md.bestMode == &md.pred[PRED_SPLIT]) - refMask = allSplitRefs; - else + SplitData splitCUData; + splitCUData.initSplitCUData(); + if (m_param->limitReferences & X265_REF_LIMIT_DEPTH) { - /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ - CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; - uint32_t numPU = cu.getNumPartInter(0); - refMask = 0; - for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) - refMask |= cu.getBestRefIdx(subPartIdx); + if (md.bestMode == &md.pred[PRED_SPLIT]) + splitCUData.splitRefs = allSplitRefs; + else + { + /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ + CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; + uint32_t numPU = cu.getNumPartInter(0); + for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) + splitCUData.splitRefs |= cu.getBestRefIdx(subPartIdx); + } + } + + if (m_param->limitModes) + { + splitCUData.mvCost[0] = md.pred[PRED_2Nx2N].bestME[0][0].mvCost; // L0 + splitCUData.mvCost[1] = md.pred[PRED_2Nx2N].bestME[0][1].mvCost; // L1 + splitCUData.sa8dCost = md.pred[PRED_2Nx2N].rdCost; } /* Copy best data to encData CTU and recon */ - X265_CHECK(md.bestMode->ok(), "best mode is not ok"); md.bestMode->cu.copyToPic(depth); md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); - return refMask; + return splitCUData; } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ @@ -1389,13 +1680,23 @@ bestPred->sa8dCost = MAX_INT64; int bestSadCand = -1; int sizeIdx = cuGeom.log2CUSize - 2; - + int safeX, maxSafeMv; + if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE) + { + safeX = m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol * g_maxCUSize - 3; + maxSafeMv = (safeX - tempPred->cu.m_cuPelX) * 4; + } for (uint32_t i = 0; i < numMergeCand; ++i) { if (m_bFrameParallel && (candMvField[i][0].mv.y >= (m_param->searchRange + 1) * 4 || candMvField[i][1].mv.y >= (m_param->searchRange + 1) * 4)) continue; + if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE && + tempPred->cu.m_cuPelX / g_maxCUSize < m_frame->m_encData->m_pir.pirEndCol && + candMvField[i][0].mv.x > maxSafeMv) + // skip merge candidates which reference beyond safe reference area + continue; tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; // merge candidate ID is stored in L0 MVP idx X265_CHECK(m_slice->m_sliceType == B_SLICE || !(candDir[i] & 0x10), " invalid merge for P slice\n"); @@ -1404,12 +1705,11 @@ tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; - - motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_bChromaSa8d); + motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400)); tempPred->sa8dBits = getTUBits(i, numMergeCand); tempPred->distortion = primitives.cu[sizeIdx].sa8d(fencYuv->m_buf[0], fencYuv->m_size, tempPred->predYuv.m_buf[0], tempPred->predYuv.m_size); - if (m_bChromaSa8d) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) { tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[1], fencYuv->m_csize, tempPred->predYuv.m_buf[1], tempPred->predYuv.m_csize); tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[2], fencYuv->m_csize, tempPred->predYuv.m_buf[2], tempPred->predYuv.m_csize); @@ -1428,7 +1728,7 @@ return; /* calculate the motion compensation for chroma for the best mode selected */ - if (!m_bChromaSa8d) /* Chroma MC was done above */ + if (!m_bChromaSa8d && (m_csp != X265_CSP_I400)) /* Chroma MC was done above */ motionCompensation(bestPred->cu, pu, bestPred->predYuv, false, true); if (m_param->rdLevel) @@ -1463,7 +1763,6 @@ md.bestMode->cu.setPURefIdx(0, (int8_t)candMvField[bestSadCand][0].refIdx, 0, 0); md.bestMode->cu.setPURefIdx(1, (int8_t)candMvField[bestSadCand][1].refIdx, 0, 0); checkDQP(*md.bestMode, cuGeom); - X265_CHECK(md.bestMode->ok(), "Merge mode not ok\n"); } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ @@ -1501,7 +1800,12 @@ first = *m_reuseBestMergeCand; last = first + 1; } - + int safeX, maxSafeMv; + if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE) + { + safeX = m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol * g_maxCUSize - 3; + maxSafeMv = (safeX - tempPred->cu.m_cuPelX) * 4; + } for (uint32_t i = first; i < last; i++) { if (m_bFrameParallel && @@ -1524,7 +1828,11 @@ continue; triedBZero = true; } - + if (m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE && + tempPred->cu.m_cuPelX / g_maxCUSize < m_frame->m_encData->m_pir.pirEndCol && + candMvField[i][0].mv.x > maxSafeMv) + // skip merge candidates which reference beyond safe reference area + continue; tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; /* merge candidate ID is stored in L0 MVP idx */ tempPred->cu.m_interDir[0] = candDir[i]; tempPred->cu.m_mv[0][0] = candMvField[i][0].mv; @@ -1533,11 +1841,12 @@ tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; tempPred->cu.setPredModeSubParts(MODE_INTER); /* must be cleared between encode iterations */ - motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, true); + motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, m_csp != X265_CSP_I400); uint8_t hasCbf = true; bool swapped = false; - if (!foundCbf0Merge) + /* bypass encoding merge with residual if analysis-mode = load as only SKIP CUs enter this function */ + if (!foundCbf0Merge && !isShareMergeCand) { /* if the best prediction has CBF (not a skip) then try merge with residual */ @@ -1586,14 +1895,13 @@ bestPred->cu.setPURefIdx(0, (int8_t)candMvField[bestCand][0].refIdx, 0, 0); bestPred->cu.setPURefIdx(1, (int8_t)candMvField[bestCand][1].refIdx, 0, 0); checkDQP(*bestPred, cuGeom); - X265_CHECK(bestPred->ok(), "merge mode is not ok"); } if (m_param->analysisMode) { - m_reuseBestMergeCand++; if (m_param->analysisMode == X265_ANALYSIS_SAVE) *m_reuseBestMergeCand = bestPred->cu.m_mvpIdx[0][0]; + m_reuseBestMergeCand++; } } @@ -1614,18 +1922,20 @@ { bestME[i].ref = *m_reuseRef; m_reuseRef++; + + bestME[i].mv = *m_reuseMv; + m_reuseMv++; } } } - - predInterSearch(interMode, cuGeom, m_bChromaSa8d, refMask); + predInterSearch(interMode, cuGeom, m_bChromaSa8d && (m_csp != X265_CSP_I400), refMask); /* predInterSearch sets interMode.sa8dBits */ const Yuv& fencYuv = *interMode.fencYuv; Yuv& predYuv = interMode.predYuv; int part = partitionFromLog2Size(cuGeom.log2CUSize); interMode.distortion = primitives.cu[part].sa8d(fencYuv.m_buf[0], fencYuv.m_size, predYuv.m_buf[0], predYuv.m_size); - if (m_bChromaSa8d) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) { interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, predYuv.m_buf[1], predYuv.m_csize); interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[2], fencYuv.m_csize, predYuv.m_buf[2], predYuv.m_csize); @@ -1637,11 +1947,16 @@ uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { + PredictionUnit pu(interMode.cu, cuGeom, puIdx); MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) { + if (bestME[i].ref >= 0) + *m_reuseMv = getLowresMV(interMode.cu, pu, i, bestME[i].ref); + *m_reuseRef = bestME[i].ref; m_reuseRef++; + m_reuseMv++; } } } @@ -1664,11 +1979,13 @@ { bestME[i].ref = *m_reuseRef; m_reuseRef++; + + bestME[i].mv = *m_reuseMv; + m_reuseMv++; } } } - - predInterSearch(interMode, cuGeom, true, refMask); + predInterSearch(interMode, cuGeom, m_csp != X265_CSP_I400, refMask); /* predInterSearch sets interMode.sa8dBits, but this is ignored */ encodeResAndCalcRdInterCU(interMode, cuGeom); @@ -1678,11 +1995,16 @@ uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { + PredictionUnit pu(interMode.cu, cuGeom, puIdx); MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) { + if (bestME[i].ref >= 0) + *m_reuseMv = getLowresMV(interMode.cu, pu, i, bestME[i].ref); + *m_reuseRef = bestME[i].ref; m_reuseRef++; + m_reuseMv++; } } } @@ -1731,10 +2053,10 @@ cu.m_mvd[1][0] = bestME[1].mv - mvp1; PredictionUnit pu(cu, cuGeom, 0); - motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_bChromaSa8d); + motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_bChromaSa8d && (m_csp != X265_CSP_I400)); int sa8d = primitives.cu[partEnum].sa8d(fencYuv.m_buf[0], fencYuv.m_size, bidir2Nx2N.predYuv.m_buf[0], bidir2Nx2N.predYuv.m_size); - if (m_bChromaSa8d) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) { /* Add in chroma distortion */ sa8d += primitives.chroma[m_csp].cu[partEnum].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, bidir2Nx2N.predYuv.m_buf[1], bidir2Nx2N.predYuv.m_csize); @@ -1765,16 +2087,16 @@ int zsa8d; - if (m_bChromaSa8d) + if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) { cu.m_mv[0][0] = mvzero; cu.m_mv[1][0] = mvzero; motionCompensation(cu, pu, tmpPredYuv, true, true); - zsa8d = primitives.cu[partEnum].sa8d(fencYuv.m_buf[0], fencYuv.m_size, tmpPredYuv.m_buf[0], tmpPredYuv.m_size); zsa8d += primitives.chroma[m_csp].cu[partEnum].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, tmpPredYuv.m_buf[1], tmpPredYuv.m_csize); zsa8d += primitives.chroma[m_csp].cu[partEnum].sa8d(fencYuv.m_buf[2], fencYuv.m_csize, tmpPredYuv.m_buf[2], tmpPredYuv.m_csize); + } else { @@ -1810,13 +2132,12 @@ cu.m_mvd[1][0] = mvzero - mvp1; cu.m_mvpIdx[1][0] = (uint8_t)mvpIdx1; - if (m_bChromaSa8d) - /* real MC was already performed */ + if (m_bChromaSa8d) /* real MC was already performed */ bidir2Nx2N.predYuv.copyFromYuv(tmpPredYuv); else - motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, true); + motionCompensation(cu, pu, bidir2Nx2N.predYuv, true, m_csp != X265_CSP_I400); } - else if (m_bChromaSa8d) + else if (m_bChromaSa8d && (m_csp != X265_CSP_I400)) { /* recover overwritten motion vectors */ cu.m_mv[0][0] = bestME[0].mv; @@ -1845,7 +2166,9 @@ Mode *bestMode = m_modeDepth[cuGeom.depth].bestMode; CUData& cu = bestMode->cu; - cu.copyFromPic(ctu, cuGeom); + cu.copyFromPic(ctu, cuGeom, m_csp); + + PicYuv& reconPic = *m_frame->m_reconPic; Yuv& fencYuv = m_modeDepth[cuGeom.depth].fencYuv; if (cuGeom.depth) @@ -1860,8 +2183,11 @@ cu.getIntraTUQtDepthRange(tuDepthRange, 0); residualTransformQuantIntra(*bestMode, cuGeom, 0, 0, tuDepthRange); - getBestIntraModeChroma(*bestMode, cuGeom); - residualQTIntraChroma(*bestMode, cuGeom, 0, 0); + if (m_csp != X265_CSP_I400) + { + getBestIntraModeChroma(*bestMode, cuGeom); + residualQTIntraChroma(*bestMode, cuGeom, 0, 0); + } } else // if (cu.isInter(0)) { @@ -1876,20 +2202,23 @@ /* at RD 0, the prediction pixels are accumulated into the top depth predYuv */ Yuv& predYuv = m_modeDepth[0].bestMode->predYuv; pixel* predY = predYuv.getLumaAddr(absPartIdx); - pixel* predU = predYuv.getCbAddr(absPartIdx); - pixel* predV = predYuv.getCrAddr(absPartIdx); primitives.cu[sizeIdx].sub_ps(resiYuv.m_buf[0], resiYuv.m_size, fencYuv.m_buf[0], predY, fencYuv.m_size, predYuv.m_size); - primitives.chroma[m_csp].cu[sizeIdx].sub_ps(resiYuv.m_buf[1], resiYuv.m_csize, + if (m_csp != X265_CSP_I400) + { + pixel* predU = predYuv.getCbAddr(absPartIdx); + pixel* predV = predYuv.getCrAddr(absPartIdx); + primitives.chroma[m_csp].cu[sizeIdx].sub_ps(resiYuv.m_buf[1], resiYuv.m_csize, fencYuv.m_buf[1], predU, fencYuv.m_csize, predYuv.m_csize); - primitives.chroma[m_csp].cu[sizeIdx].sub_ps(resiYuv.m_buf[2], resiYuv.m_csize, + primitives.chroma[m_csp].cu[sizeIdx].sub_ps(resiYuv.m_buf[2], resiYuv.m_csize, fencYuv.m_buf[2], predV, fencYuv.m_csize, predYuv.m_csize); + } uint32_t tuDepthRange[2]; cu.getInterTUQtDepthRange(tuDepthRange, 0); @@ -1902,27 +2231,30 @@ /* residualTransformQuantInter() wrote transformed residual back into * resiYuv. Generate the recon pixels by adding it to the prediction */ - PicYuv& reconPic = *m_frame->m_reconPic; if (cu.m_cbf[0][0]) primitives.cu[sizeIdx].add_ps(reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, predY, resiYuv.m_buf[0], predYuv.m_size, resiYuv.m_size); else primitives.cu[sizeIdx].copy_pp(reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, predY, predYuv.m_size); - - if (cu.m_cbf[1][0]) - primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, + if (m_csp != X265_CSP_I400) + { + pixel* predU = predYuv.getCbAddr(absPartIdx); + pixel* predV = predYuv.getCrAddr(absPartIdx); + if (cu.m_cbf[1][0]) + primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predU, resiYuv.m_buf[1], predYuv.m_csize, resiYuv.m_csize); - else - primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, + else + primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predU, predYuv.m_csize); - if (cu.m_cbf[2][0]) - primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, + if (cu.m_cbf[2][0]) + primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predV, resiYuv.m_buf[2], predYuv.m_csize, resiYuv.m_csize); - else - primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, + else + primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predV, predYuv.m_csize); + } } cu.updatePic(cuGeom.depth); @@ -1936,7 +2268,6 @@ mode.contexts.resetBits(); mode.contexts.codeSplitFlag(mode.cu, 0, depth); uint32_t bits = mode.contexts.getNumberOfWrittenBits(); - mode.mvBits += bits; mode.totalBits += bits; updateModeCost(mode); } @@ -1947,7 +2278,6 @@ } else { - mode.mvBits++; mode.totalBits++; updateModeCost(mode); } @@ -1965,7 +2295,7 @@ if (m_slice->m_numRefIdx[0]) { numRefs++; - const CUData& cu = *m_slice->m_refPicList[0][0]->m_encData->getPicCTU(parentCTU.m_cuAddr); + const CUData& cu = *m_slice->m_refFrameList[0][0]->m_encData->getPicCTU(parentCTU.m_cuAddr); previousQP = cu.m_qp[0]; if (!cu.m_cuDepth[cuGeom.absPartIdx]) return 0; @@ -1979,7 +2309,7 @@ if (m_slice->m_numRefIdx[1]) { numRefs++; - const CUData& cu = *m_slice->m_refPicList[1][0]->m_encData->getPicCTU(parentCTU.m_cuAddr); + const CUData& cu = *m_slice->m_refFrameList[1][0]->m_encData->getPicCTU(parentCTU.m_cuAddr); if (!cu.m_cuDepth[cuGeom.absPartIdx]) return 0; for (uint32_t i = 0; i < cuGeom.numPartitions; i += 4) @@ -2061,10 +2391,10 @@ return false; } -int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom) +int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, double baseQp) { FrameData& curEncData = *m_frame->m_encData; - double qp = curEncData.m_cuStat[ctu.m_cuAddr].baseQp; + double qp = baseQp >= 0 ? baseQp : curEncData.m_cuStat[ctu.m_cuAddr].baseQp; /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */ bool isReferenced = IS_REFERENCED(m_frame);
View file
x265_1.8.tar.gz/source/encoder/analysis.h -> x265_1.9.tar.gz/source/encoder/analysis.h
Changed
@@ -3,6 +3,7 @@ * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> * Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -40,6 +41,21 @@ class Entropy; +struct SplitData +{ + uint32_t splitRefs; + uint32_t mvCost[2]; + uint64_t sa8dCost; + + void initSplitCUData() + { + splitRefs = 0; + mvCost[0] = 0; // L0 + mvCost[1] = 0; // L1 + sa8dCost = 0; + } +}; + class Analysis : public Search { public: @@ -101,20 +117,20 @@ Mode& compressCTU(CUData& ctu, Frame& frame, const CUGeom& cuGeom, const Entropy& initialContext); protected: - /* Analysis data for load/save modes, keeps getting incremented as CTU analysis proceeds and data is consumed or read */ - analysis_intra_data* m_reuseIntraDataCTU; analysis_inter_data* m_reuseInterDataCTU; + MV* m_reuseMv; int32_t* m_reuseRef; uint32_t* m_reuseBestMergeCand; + uint32_t m_splitRefIdx[4]; /* full analysis for an I-slice CU */ - void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); + void compressIntraCU(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); /* full analysis for a P or B slice CU */ - void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); - uint32_t compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); - uint32_t compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); + uint32_t compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + SplitData compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + SplitData compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); /* measure merge and skip */ void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom); @@ -139,13 +155,11 @@ /* generate residual and recon pixels for an entire CTU recursively (RD0) */ void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom); - int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom); + int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, double baseQP = -1); /* check whether current mode is the new best */ inline void checkBestMode(Mode& mode, uint32_t depth) { - X265_CHECK(mode.ok(), "mode costs are uninitialized\n"); - ModeDepth& md = m_modeDepth[depth]; if (md.bestMode) {
View file
x265_1.8.tar.gz/source/encoder/api.cpp -> x265_1.9.tar.gz/source/encoder/api.cpp
Changed
@@ -72,9 +72,7 @@ #endif #if HIGH_BIT_DEPTH - if (X265_DEPTH == 12) - x265_log(p, X265_LOG_WARNING, "Main12 is HIGHLY experimental, do not use!\n"); - else if (X265_DEPTH != 10 && X265_DEPTH != 12) + if (X265_DEPTH != 10 && X265_DEPTH != 12) #else if (X265_DEPTH != 8) #endif @@ -247,6 +245,16 @@ } } +int x265_encoder_intra_refresh(x265_encoder *enc) +{ + if (!enc) + return -1; + + Encoder *encoder = static_cast<Encoder*>(enc); + encoder->m_bQueuedIntraRefresh = 1; + return 0; +} + void x265_cleanup(void) { if (!g_ctuSizeConfigured) @@ -268,6 +276,7 @@ pic->bitDepth = param->internalBitDepth; pic->colorSpace = param->internalCsp; pic->forceqp = X265_QP_AUTO; + pic->quantOffsets = NULL; if (param->analysisMode) { uint32_t widthInCU = (param->sourceWidth + g_maxCUSize - 1) >> g_maxLog2CUSize; @@ -318,6 +327,7 @@ &x265_cleanup, sizeof(x265_frame_stats), + &x265_encoder_intra_refresh, }; typedef const x265_api* (*api_get_func)(int bitDepth);
View file
x265_1.8.tar.gz/source/encoder/bitcost.cpp -> x265_1.9.tar.gz/source/encoder/bitcost.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -40,7 +41,12 @@ x265_emms(); // just to be safe CalculateLogs(); - s_costs[qp] = new uint16_t[4 * BC_MAX_MV + 1] + 2 * BC_MAX_MV; + s_costs[qp] = X265_MALLOC(uint16_t, 4 * BC_MAX_MV + 1) + 2 * BC_MAX_MV; + if (!s_costs[qp]) + { + x265_log(NULL, X265_LOG_ERROR, "BitCost s_costs buffer allocation failure\n"); + return; + } double lambda = x265_lambda_tab[qp]; // estimate same cost for negative and positive MVD @@ -66,11 +72,16 @@ { if (!s_bitsizes) { - s_bitsizes = new float[2 * BC_MAX_MV + 1]; + s_bitsizes = X265_MALLOC(float, 4 * BC_MAX_MV + 1) + 2 * BC_MAX_MV; + if (!s_bitsizes) + { + x265_log(NULL, X265_LOG_ERROR, "BitCost s_bitsizes buffer allocation failure\n"); + return; + } s_bitsizes[0] = 0.718f; float log2_2 = 2.0f / log(2.0f); // 2 x 1/log(2) for (int i = 1; i <= 2 * BC_MAX_MV; i++) - s_bitsizes[i] = log((float)(i + 1)) * log2_2 + 1.718f; + s_bitsizes[i] = s_bitsizes[-i] = log((float)(i + 1)) * log2_2 + 1.718f; } } @@ -80,12 +91,15 @@ { if (s_costs[i]) { - delete [] (s_costs[i] - 2 * BC_MAX_MV); + X265_FREE(s_costs[i] - 2 * BC_MAX_MV); - s_costs[i] = 0; + s_costs[i] = NULL; } } - delete [] s_bitsizes; - s_bitsizes = 0; + if (s_bitsizes) + { + X265_FREE(s_bitsizes - 2 * BC_MAX_MV); + s_bitsizes = NULL; + } }
View file
x265_1.8.tar.gz/source/encoder/bitcost.h -> x265_1.9.tar.gz/source/encoder/bitcost.h
Changed
@@ -47,14 +47,14 @@ // return bit cost of motion vector difference, without lambda inline uint32_t bitcost(const MV& mv) const { - return (uint32_t)(s_bitsizes[abs(mv.x - m_mvp.x)] + - s_bitsizes[abs(mv.y - m_mvp.y)] + 0.5f); + return (uint32_t)(s_bitsizes[mv.x - m_mvp.x] + + s_bitsizes[mv.y - m_mvp.y] + 0.5f); } static inline uint32_t bitcost(const MV& mv, const MV& mvp) { - return (uint32_t)(s_bitsizes[abs(mv.x - mvp.x)] + - s_bitsizes[abs(mv.y - mvp.y)] + 0.5f); + return (uint32_t)(s_bitsizes[mv.x - mvp.x] + + s_bitsizes[mv.y - mvp.y] + 0.5f); } static void destroy();
View file
x265_1.8.tar.gz/source/encoder/dpb.cpp -> x265_1.9.tar.gz/source/encoder/dpb.cpp
Changed
@@ -47,16 +47,16 @@ delete curFrame; } - while (m_picSymFreeList) + while (m_frameDataFreeList) { - FrameData* next = m_picSymFreeList->m_freeListNext; - m_picSymFreeList->destroy(); + FrameData* next = m_frameDataFreeList->m_freeListNext; + m_frameDataFreeList->destroy(); - m_picSymFreeList->m_reconPic->destroy(); - delete m_picSymFreeList->m_reconPic; + m_frameDataFreeList->m_reconPic->destroy(); + delete m_frameDataFreeList->m_reconPic; - delete m_picSymFreeList; - m_picSymFreeList = next; + delete m_frameDataFreeList; + m_frameDataFreeList = next; } } @@ -74,13 +74,19 @@ curFrame->m_reconRowCount.set(0); curFrame->m_bChromaExtended = false; + // Reset column counter + X265_CHECK(curFrame->m_reconColCount != NULL, "curFrame->m_reconColCount check failure"); + X265_CHECK(curFrame->m_numRows > 0, "curFrame->m_numRows check failure"); + for(int32_t col = 0; col < curFrame->m_numRows; col++) + curFrame->m_reconColCount[col].set(0); + // iterator is invalidated by remove, restart scan m_picList.remove(*curFrame); iterFrame = m_picList.first(); m_freeList.pushBack(*curFrame); - curFrame->m_encData->m_freeListNext = m_picSymFreeList; - m_picSymFreeList = curFrame->m_encData; + curFrame->m_encData->m_freeListNext = m_frameDataFreeList; + m_frameDataFreeList = curFrame->m_encData; curFrame->m_encData = NULL; curFrame->m_reconPic = NULL; } @@ -171,7 +177,7 @@ { for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++) { - Frame *refpic = slice->m_refPicList[l][ref]; + Frame *refpic = slice->m_refFrameList[l][ref]; ATOMIC_INC(&refpic->m_countRefEncoders); } }
View file
x265_1.8.tar.gz/source/encoder/dpb.h -> x265_1.9.tar.gz/source/encoder/dpb.h
Changed
@@ -46,14 +46,14 @@ bool m_bTemporalSublayer; PicList m_picList; PicList m_freeList; - FrameData* m_picSymFreeList; + FrameData* m_frameDataFreeList; DPB(x265_param *param) { m_lastIDR = 0; m_pocCRA = 0; m_bRefreshPending = false; - m_picSymFreeList = NULL; + m_frameDataFreeList = NULL; m_maxRefL0 = param->maxNumReferences; m_maxRefL1 = param->bBPyramid ? 2 : 1; m_bOpenGOP = param->bOpenGOP;
View file
x265_1.8.tar.gz/source/encoder/encoder.cpp -> x265_1.9.tar.gz/source/encoder/encoder.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -39,6 +40,10 @@ #include "x265.h" +#if _MSC_VER +#pragma warning(disable: 4996) // POSIX functions are just fine, thanks +#endif + namespace X265_NS { const char g_sliceTypeToChar[] = {'B', 'P', 'I'}; } @@ -66,12 +71,9 @@ m_outputCount = 0; m_param = NULL; m_latestParam = NULL; - m_cuOffsetY = NULL; - m_cuOffsetC = NULL; - m_buOffsetY = NULL; - m_buOffsetC = NULL; m_threadPool = NULL; m_analysisFile = NULL; + m_offsetEmergency = NULL; for (int i = 0; i < X265_MAX_FRAME_THREADS; i++) m_frameEncoder[i] = NULL; @@ -191,6 +193,7 @@ { x265_log(m_param, X265_LOG_ERROR, "Unable to allocate scaling list arrays\n"); m_aborted = true; + return; } else if (!m_param->scalingLists || !strcmp(m_param->scalingLists, "off")) m_scalingList.m_bEnabled = false; @@ -198,7 +201,6 @@ m_scalingList.setDefaultScalingList(); else if (m_scalingList.parseScalingList(m_param->scalingLists)) m_aborted = true; - m_scalingList.setupQuantMatrices(); m_lookahead = new Lookahead(m_param, m_threadPool); if (m_numPools) @@ -213,6 +215,82 @@ initVPS(&m_vps); initSPS(&m_sps); initPPS(&m_pps); + + if (m_param->rc.vbvBufferSize) + { + m_offsetEmergency = (uint16_t(*)[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS])X265_MALLOC(uint16_t, MAX_NUM_TR_CATEGORIES * MAX_NUM_TR_COEFFS * (QP_MAX_MAX - QP_MAX_SPEC)); + if (!m_offsetEmergency) + { + x265_log(m_param, X265_LOG_ERROR, "Unable to allocate memory\n"); + m_aborted = true; + return; + } + + bool scalingEnabled = m_scalingList.m_bEnabled; + if (!scalingEnabled) + { + m_scalingList.setDefaultScalingList(); + m_scalingList.setupQuantMatrices(); + } + else + m_scalingList.setupQuantMatrices(); + + for (int q = 0; q < QP_MAX_MAX - QP_MAX_SPEC; q++) + { + for (int cat = 0; cat < MAX_NUM_TR_CATEGORIES; cat++) + { + uint16_t *nrOffset = m_offsetEmergency[q][cat]; + + int trSize = cat & 3; + + int coefCount = 1 << ((trSize + 2) * 2); + + /* Denoise chroma first then luma, then DC. */ + int dcThreshold = (QP_MAX_MAX - QP_MAX_SPEC) * 2 / 3; + int lumaThreshold = (QP_MAX_MAX - QP_MAX_SPEC) * 2 / 3; + int chromaThreshold = 0; + + int thresh = (cat < 4 || (cat >= 8 && cat < 12)) ? lumaThreshold : chromaThreshold; + + double quantF = (double)(1ULL << (q / 6 + 16 + 8)); + + for (int i = 0; i < coefCount; i++) + { + /* True "emergency mode": remove all DCT coefficients */ + if (q == QP_MAX_MAX - QP_MAX_SPEC - 1) + { + nrOffset[i] = INT16_MAX; + continue; + } + + int iThresh = i == 0 ? dcThreshold : thresh; + if (q < iThresh) + { + nrOffset[i] = 0; + continue; + } + + int numList = (cat >= 8) * 3 + ((int)!iThresh); + + double pos = (double)(q - iThresh + 1) / (QP_MAX_MAX - QP_MAX_SPEC - iThresh); + double start = quantF / (m_scalingList.m_quantCoef[trSize][numList][QP_MAX_SPEC % 6][i]); + + // Formula chosen as an exponential scale to vaguely mimic the effects of a higher quantizer. + double bias = (pow(2, pos * (QP_MAX_MAX - QP_MAX_SPEC)) * 0.003 - 0.003) * start; + nrOffset[i] = (uint16_t)X265_MIN(bias + 0.5, INT16_MAX); + } + } + } + + if (!scalingEnabled) + { + m_scalingList.m_bEnabled = false; + m_scalingList.m_bDataPresent = false; + m_scalingList.setupQuantMatrices(); + } + } + else + m_scalingList.setupQuantMatrices(); int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize; int numCols = (m_param->sourceWidth + g_maxCUSize - 1) / g_maxCUSize; @@ -259,6 +337,8 @@ m_encodeStartTime = x265_mdate(); m_nalList.m_annexB = !!m_param->bAnnexB; + + m_emitCLLSEI = p->maxCLL || p->maxFALL; } void Encoder::stopJobs() @@ -318,10 +398,7 @@ delete m_rateControl; } - X265_FREE(m_cuOffsetY); - X265_FREE(m_cuOffsetC); - X265_FREE(m_buOffsetY); - X265_FREE(m_buOffsetC); + X265_FREE(m_offsetEmergency); if (m_analysisFile) fclose(m_analysisFile); @@ -335,7 +412,6 @@ free((char*)m_param->scalingLists); free((char*)m_param->numaPools); free((char*)m_param->masteringDisplayColorVolume); - free((char*)m_param->contentLightLevelInfo); PARAM_NS::x265_param_free(m_param); } @@ -361,6 +437,45 @@ } } +void Encoder::calcRefreshInterval(Frame* frameEnc) +{ + Slice* slice = frameEnc->m_encData->m_slice; + uint32_t numBlocksInRow = slice->m_sps->numCuInWidth; + FrameData::PeriodicIR* pir = &frameEnc->m_encData->m_pir; + if (slice->m_sliceType == I_SLICE) + { + pir->framesSinceLastPir = 0; + m_bQueuedIntraRefresh = 0; + /* PIR is currently only supported with ref == 1, so any intra frame effectively refreshes + * the whole frame and counts as an intra refresh. */ + pir->pirEndCol = numBlocksInRow; + } + else if (slice->m_sliceType == P_SLICE) + { + Frame* ref = frameEnc->m_encData->m_slice->m_refFrameList[0][0]; + int pocdiff = frameEnc->m_poc - ref->m_poc; + int numPFramesInGOP = m_param->keyframeMax / pocdiff; + int increment = (numBlocksInRow + numPFramesInGOP - 1) / numPFramesInGOP; + pir->pirEndCol = ref->m_encData->m_pir.pirEndCol; + pir->framesSinceLastPir = ref->m_encData->m_pir.framesSinceLastPir + pocdiff; + if (pir->framesSinceLastPir >= m_param->keyframeMax || + (m_bQueuedIntraRefresh && pir->pirEndCol >= numBlocksInRow)) + { + pir->pirEndCol = 0; + pir->framesSinceLastPir = 0; + m_bQueuedIntraRefresh = 0; + frameEnc->m_lowres.bKeyframe = 1; + } + pir->pirStartCol = pir->pirEndCol; + pir->pirEndCol += increment; + /* If our intra refresh has reached the right side of the frame, we're done. */ + if (pir->pirEndCol >= numBlocksInRow) + { + pir->pirEndCol = numBlocksInRow; + } + } +} + /** * Feed one new input frame into the encoder, get one frame out. If pic_in is * NULL, a flush condition is implied and pic_in must be NULL for all subsequent @@ -395,7 +510,7 @@ { if (pic_in->colorSpace != m_param->internalCsp) { - x265_log(m_param, X265_LOG_ERROR, "Unsupported color space (%d) on input\n", + x265_log(m_param, X265_LOG_ERROR, "Unsupported chroma subsampling (%d) on input\n", pic_in->colorSpace); return -1; } @@ -411,17 +526,20 @@ { inFrame = new Frame; x265_param* p = m_reconfigured? m_latestParam : m_param; - if (inFrame->create(p)) + if (inFrame->create(p, pic_in->quantOffsets)) { /* the first PicYuv created is asked to generate the CU and block unit offset * arrays which are then shared with all subsequent PicYuv (orig and recon) * allocated by this top level encoder */ - if (m_cuOffsetY) + if (m_sps.cuOffsetY) { - inFrame->m_fencPic->m_cuOffsetC = m_cuOffsetC; - inFrame->m_fencPic->m_cuOffsetY = m_cuOffsetY; - inFrame->m_fencPic->m_buOffsetC = m_buOffsetC; - inFrame->m_fencPic->m_buOffsetY = m_buOffsetY; + inFrame->m_fencPic->m_cuOffsetY = m_sps.cuOffsetY; + inFrame->m_fencPic->m_buOffsetY = m_sps.buOffsetY; + if (pic_in->colorSpace != X265_CSP_I400) + { + inFrame->m_fencPic->m_cuOffsetC = m_sps.cuOffsetC; + inFrame->m_fencPic->m_buOffsetC = m_sps.buOffsetC; + } } else { @@ -435,10 +553,15 @@ } else { - m_cuOffsetC = inFrame->m_fencPic->m_cuOffsetC; - m_cuOffsetY = inFrame->m_fencPic->m_cuOffsetY; - m_buOffsetC = inFrame->m_fencPic->m_buOffsetC; - m_buOffsetY = inFrame->m_fencPic->m_buOffsetY; + m_sps.cuOffsetY = inFrame->m_fencPic->m_cuOffsetY; + m_sps.buOffsetY = inFrame->m_fencPic->m_buOffsetY; + if (pic_in->colorSpace != X265_CSP_I400) + { + m_sps.cuOffsetC = inFrame->m_fencPic->m_cuOffsetC; + m_sps.cuOffsetY = inFrame->m_fencPic->m_cuOffsetY; + m_sps.buOffsetC = inFrame->m_fencPic->m_buOffsetC; + m_sps.buOffsetY = inFrame->m_fencPic->m_buOffsetY; + } } } } @@ -454,17 +577,27 @@ else { inFrame = m_dpb->m_freeList.popBack(); + /* Set lowres scencut and satdCost here to aovid overwriting ANALYSIS_READ + decision by lowres init*/ + inFrame->m_lowres.bScenecut = false; + inFrame->m_lowres.satdCost = (int64_t)-1; inFrame->m_lowresInit = false; } /* Copy input picture into a Frame and PicYuv, send to lookahead */ - inFrame->m_fencPic->copyFromPicture(*pic_in, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset); + inFrame->m_fencPic->copyFromPicture(*pic_in, *m_param, m_sps.conformanceWindow.rightOffset, m_sps.conformanceWindow.bottomOffset); inFrame->m_poc = ++m_pocLast; inFrame->m_userData = pic_in->userData; inFrame->m_pts = pic_in->pts; inFrame->m_forceqp = pic_in->forceqp; inFrame->m_param = m_reconfigured ? m_latestParam : m_param; + + if (pic_in->quantOffsets != NULL) + { + int cuCount = inFrame->m_lowres.maxBlocksInRow * inFrame->m_lowres.maxBlocksInCol; + memcpy(inFrame->m_quantOffsets, pic_in->quantOffsets, cuCount * sizeof(float)); + } if (m_pocLast == 0) m_firstPts = inFrame->m_pts; @@ -496,11 +629,15 @@ readAnalysisFile(&inputPic->analysisData, inFrame->m_poc); inFrame->m_analysisData.poc = inFrame->m_poc; inFrame->m_analysisData.sliceType = inputPic->analysisData.sliceType; + inFrame->m_analysisData.bScenecut = inputPic->analysisData.bScenecut; + inFrame->m_analysisData.satdCost = inputPic->analysisData.satdCost; inFrame->m_analysisData.numCUsInFrame = inputPic->analysisData.numCUsInFrame; inFrame->m_analysisData.numPartitions = inputPic->analysisData.numPartitions; inFrame->m_analysisData.interData = inputPic->analysisData.interData; inFrame->m_analysisData.intraData = inputPic->analysisData.intraData; sliceType = inputPic->analysisData.sliceType; + inFrame->m_lowres.bScenecut = !!inFrame->m_analysisData.bScenecut; + inFrame->m_lowres.satdCost = inFrame->m_analysisData.satdCost; } m_lookahead->addPicture(*inFrame, sliceType); @@ -563,16 +700,21 @@ pic_out->planes[0] = recpic->m_picOrg[0]; pic_out->stride[0] = (int)(recpic->m_stride * sizeof(pixel)); - pic_out->planes[1] = recpic->m_picOrg[1]; - pic_out->stride[1] = (int)(recpic->m_strideC * sizeof(pixel)); - pic_out->planes[2] = recpic->m_picOrg[2]; - pic_out->stride[2] = (int)(recpic->m_strideC * sizeof(pixel)); + if (m_param->internalCsp != X265_CSP_I400) + { + pic_out->planes[1] = recpic->m_picOrg[1]; + pic_out->stride[1] = (int)(recpic->m_strideC * sizeof(pixel)); + pic_out->planes[2] = recpic->m_picOrg[2]; + pic_out->stride[2] = (int)(recpic->m_strideC * sizeof(pixel)); + } /* Dump analysis data from pic_out to file in save mode and free */ if (m_param->analysisMode == X265_ANALYSIS_SAVE) { pic_out->analysisData.poc = pic_out->poc; pic_out->analysisData.sliceType = pic_out->sliceType; + pic_out->analysisData.bScenecut = outFrame->m_lowres.bScenecut; + pic_out->analysisData.satdCost = outFrame->m_lowres.satdCost; pic_out->analysisData.numCUsInFrame = outFrame->m_analysisData.numCUsInFrame; pic_out->analysisData.numPartitions = outFrame->m_analysisData.numPartitions; pic_out->analysisData.interData = outFrame->m_analysisData.interData; @@ -581,36 +723,57 @@ freeAnalysis(&pic_out->analysisData); } } - if (slice->m_sliceType == P_SLICE) + if (m_param->internalCsp == X265_CSP_I400) { - if (slice->m_weightPredTable[0][0][0].bPresentFlag) - m_numLumaWPFrames++; - if (slice->m_weightPredTable[0][0][1].bPresentFlag || - slice->m_weightPredTable[0][0][2].bPresentFlag) - m_numChromaWPFrames++; + if (slice->m_sliceType == P_SLICE) + { + if (slice->m_weightPredTable[0][0][0].bPresentFlag) + m_numLumaWPFrames++; + } + else if (slice->m_sliceType == B_SLICE) + { + bool bLuma = false; + for (int l = 0; l < 2; l++) + { + if (slice->m_weightPredTable[l][0][0].bPresentFlag) + bLuma = true; + } + if (bLuma) + m_numLumaWPBiFrames++; + } } - else if (slice->m_sliceType == B_SLICE) + else { - bool bLuma = false, bChroma = false; - for (int l = 0; l < 2; l++) + if (slice->m_sliceType == P_SLICE) { - if (slice->m_weightPredTable[l][0][0].bPresentFlag) - bLuma = true; - if (slice->m_weightPredTable[l][0][1].bPresentFlag || - slice->m_weightPredTable[l][0][2].bPresentFlag) - bChroma = true; + if (slice->m_weightPredTable[0][0][0].bPresentFlag) + m_numLumaWPFrames++; + if (slice->m_weightPredTable[0][0][1].bPresentFlag || + slice->m_weightPredTable[0][0][2].bPresentFlag) + m_numChromaWPFrames++; } + else if (slice->m_sliceType == B_SLICE) + { + bool bLuma = false, bChroma = false; + for (int l = 0; l < 2; l++) + { + if (slice->m_weightPredTable[l][0][0].bPresentFlag) + bLuma = true; + if (slice->m_weightPredTable[l][0][1].bPresentFlag || + slice->m_weightPredTable[l][0][2].bPresentFlag) + bChroma = true; + } - if (bLuma) - m_numLumaWPBiFrames++; - if (bChroma) - m_numChromaWPBiFrames++; + if (bLuma) + m_numLumaWPBiFrames++; + if (bChroma) + m_numChromaWPBiFrames++; + } } - if (m_aborted) return -1; - finishFrameStats(outFrame, curEncoder, curEncoder->m_accessUnitBits, frameData); + finishFrameStats(outFrame, curEncoder, frameData, m_pocLast); /* Write RateControl Frame level stats in multipass encodes */ if (m_param->rc.bStatWrite) @@ -638,10 +801,10 @@ if (frameEnc && !pass) { /* give this frame a FrameData instance before encoding */ - if (m_dpb->m_picSymFreeList) + if (m_dpb->m_frameDataFreeList) { - frameEnc->m_encData = m_dpb->m_picSymFreeList; - m_dpb->m_picSymFreeList = m_dpb->m_picSymFreeList->m_freeListNext; + frameEnc->m_encData = m_dpb->m_frameDataFreeList; + m_dpb->m_frameDataFreeList = m_dpb->m_frameDataFreeList->m_freeListNext; frameEnc->reinit(m_sps); } else @@ -652,10 +815,6 @@ slice->m_pps = &m_pps; slice->m_maxNumMergeCand = m_param->maxNumMergeCand; slice->m_endCUAddr = slice->realEndAddress(m_sps.numCUsInFrame * NUM_4x4_PARTITIONS); - frameEnc->m_reconPic->m_cuOffsetC = m_cuOffsetC; - frameEnc->m_reconPic->m_cuOffsetY = m_cuOffsetY; - frameEnc->m_reconPic->m_buOffsetC = m_buOffsetC; - frameEnc->m_reconPic->m_buOffsetY = m_buOffsetY; } curEncoder->m_rce.encodeOrder = m_encodedFrameNum++; @@ -690,13 +849,15 @@ if (m_param->rc.rateControlMode != X265_RC_CQP) m_lookahead->getEstimatedPictureCost(frameEnc); + if (m_param->bIntraRefresh) + calcRefreshInterval(frameEnc); /* Allow FrameEncoder::compressFrame() to start in the frame encoder thread */ if (!curEncoder->startCompressFrame(frameEnc)) m_aborted = true; } else if (m_encodedFrameNum) - m_rateControl->setFinalFrameCount(m_encodedFrameNum); + m_rateControl->setFinalFrameCount(m_encodedFrameNum); } while (m_bZeroLatency && ++pass < 2); @@ -708,7 +869,7 @@ encParam->maxNumReferences = param->maxNumReferences; // never uses more refs than specified in stream headers encParam->bEnableLoopFilter = param->bEnableLoopFilter; encParam->deblockingFilterTCOffset = param->deblockingFilterTCOffset; - encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; + encParam->deblockingFilterBetaOffset = param->deblockingFilterBetaOffset; encParam->bEnableFastIntra = param->bEnableFastIntra; encParam->bEnableEarlySkip = param->bEnableEarlySkip; encParam->bEnableTemporalMvp = param->bEnableTemporalMvp; @@ -943,7 +1104,7 @@ (double)cuStats.countPModeMasters / cuStats.totalCTUs, (double)cuStats.pmodeBlockTime / cuStats.countPModeMasters); x265_log(m_param, X265_LOG_INFO, "CU: %.3lf slaves per PMODE master, each took average of %.3lf ms\n", - (double)cuStats.countPModeTasks / cuStats.countPModeMasters, + (double)cuStats.countPModeTasks / cuStats.countPModeMasters, ELAPSED_MSEC(cuStats.pmodeTime) / cuStats.countPModeTasks); } @@ -1050,6 +1211,15 @@ stats->statsB.psnrU = m_analyzeB.m_psnrSumU / (double)m_analyzeB.m_numPics; stats->statsB.psnrV = m_analyzeB.m_psnrSumV / (double)m_analyzeB.m_numPics; stats->statsB.ssim = x265_ssim2dB(m_analyzeB.m_globalSsim / (double)m_analyzeB.m_numPics); + + stats->maxCLL = m_analyzeAll.m_maxCLL; + stats->maxFALL = (uint16_t)(m_analyzeAll.m_maxFALL / m_analyzeAll.m_numPics); + + if (m_emitCLLSEI) + { + m_param->maxCLL = stats->maxCLL; + m_param->maxFALL = stats->maxFALL; + } } /* If new statistics are added to x265_stats, we must check here whether the @@ -1057,9 +1227,10 @@ * future safety) */ } -void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, uint64_t bits, x265_frame_stats* frameStats) +void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc) { PicYuv* reconPic = curFrame->m_reconPic; + uint64_t bits = curEncoder->m_accessUnitBits; //===== calculate PSNR ===== int width = reconPic->m_picWidth - m_sps.conformanceWindow.rightOffset; @@ -1123,6 +1294,9 @@ m_analyzeB.addSsim(ssim); } + m_analyzeAll.m_maxFALL += curFrame->m_fencPic->m_avgLumaLevel; + m_analyzeAll.m_maxCLL = X265_MAX(m_analyzeAll.m_maxCLL, curFrame->m_fencPic->m_maxLumaLevel); + char c = (slice->isIntra() ? 'I' : slice->isInterP() ? 'P' : 'B'); int poc = slice->m_poc; if (!IS_REFERENCED(curFrame)) @@ -1130,11 +1304,15 @@ if (frameStats) { + const int picOrderCntLSB = (slice->m_poc - slice->m_lastIDR + (1 << BITS_FOR_POC)) % (1 << BITS_FOR_POC); + frameStats->encoderOrder = m_outputCount++; frameStats->sliceType = c; - frameStats->poc = poc; + frameStats->poc = picOrderCntLSB; frameStats->qp = curEncData.m_avgQpAq; frameStats->bits = bits; + frameStats->bScenecut = curFrame->m_lowres.bScenecut; + frameStats->frameLatency = inPoc - poc; if (m_param->rc.rateControlMode == X265_RC_CRF) frameStats->rateFactor = curEncData.m_rateFactor; frameStats->psnrY = psnrY; @@ -1173,8 +1351,9 @@ frameStats->avgChromaDistortion = curFrame->m_encData->m_frameStats.avgChromaDistortion; frameStats->avgLumaDistortion = curFrame->m_encData->m_frameStats.avgLumaDistortion; frameStats->avgPsyEnergy = curFrame->m_encData->m_frameStats.avgPsyEnergy; - frameStats->avgLumaLevel = curFrame->m_encData->m_frameStats.avgLumaLevel; - frameStats->maxLumaLevel = curFrame->m_encData->m_frameStats.maxLumaLevel; + frameStats->avgResEnergy = curFrame->m_encData->m_frameStats.avgResEnergy; + frameStats->avgLumaLevel = curFrame->m_fencPic->m_avgLumaLevel; + frameStats->maxLumaLevel = curFrame->m_fencPic->m_maxLumaLevel; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { frameStats->cuStats.percentSkipCu[depth] = curFrame->m_encData->m_frameStats.percentSkipCu[depth]; @@ -1227,18 +1406,15 @@ x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n"); } - if (m_param->contentLightLevelInfo) + if (m_emitCLLSEI) { SEIContentLightLevel cllsei; - if (cllsei.parse(m_param->contentLightLevelInfo)) - { - bs.resetBits(); - cllsei.write(bs, m_sps); - bs.writeByteAlignment(); - list.serialize(NAL_UNIT_PREFIX_SEI, bs); - } - else - x265_log(m_param, X265_LOG_WARNING, "unable to parse content light level info\n"); + cllsei.max_content_light_level = m_param->maxCLL; + cllsei.max_pic_average_light_level = m_param->maxFALL; + bs.resetBits(); + cllsei.write(bs, m_sps); + bs.writeByteAlignment(); + list.serialize(NAL_UNIT_PREFIX_SEI, bs); } if (m_param->bEmitInfoSEI) @@ -1425,6 +1601,7 @@ p->rc.cuTree = 0; p->bEnableWeightedPred = 0; p->bEnableWeightedBiPred = 0; + p->bIntraRefresh = 0; /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */ p->maxNumReferences = 1; @@ -1515,10 +1692,38 @@ if (p->totalFrames && p->totalFrames <= 2 * ((float)p->fpsNum) / p->fpsDenom && p->rc.bStrictCbr) p->lookaheadDepth = p->totalFrames; + if (p->bIntraRefresh) + { + int numCuInWidth = (m_param->sourceWidth + g_maxCUSize - 1) / g_maxCUSize; + if (p->maxNumReferences > 1) + { + x265_log(p, X265_LOG_WARNING, "Max References > 1 + intra-refresh is not supported , setting max num references = 1\n"); + p->maxNumReferences = 1; + } + + if (p->bBPyramid && p->bframes) + x265_log(p, X265_LOG_WARNING, "B pyramid cannot be enabled when max references is 1, Disabling B pyramid\n"); + p->bBPyramid = 0; + + + if (p->bOpenGOP) + { + x265_log(p, X265_LOG_WARNING, "Open Gop disabled, Intra Refresh is not compatible with openGop\n"); + p->bOpenGOP = 0; + } + + x265_log(p, X265_LOG_WARNING, "Scenecut is disabled when Intra Refresh is enabled\n"); + + if (((float)numCuInWidth - 1) / m_param->keyframeMax > 1) + x265_log(p, X265_LOG_WARNING, "Keyint value is very low.It leads to frequent intra refreshes, can be almost every frame." + "Prefered use case would be high keyint value or an API call to refresh when necessary\n"); + + } + if (p->scalingLists && p->internalCsp == X265_CSP_I444) { - x265_log(p, X265_LOG_WARNING, "Scaling lists are not yet supported for 4:4:4 color space\n"); + x265_log(p, X265_LOG_WARNING, "Scaling lists are not yet supported for 4:4:4 chroma subsampling\n"); p->scalingLists = 0; } @@ -1536,6 +1741,17 @@ x265_log(p, X265_LOG_WARNING, "Analysis load/save options incompatible with pmode/pme, Disabling pmode/pme\n"); p->bDistributeMotionEstimation = p->bDistributeModeAnalysis = 0; } + if (p->analysisMode && p->rc.cuTree) + { + x265_log(p, X265_LOG_WARNING, "Analysis load/save options works only with cu-tree off, Disabling cu-tree\n"); + p->rc.cuTree = 0; + } + + if (p->bDistributeModeAnalysis && (p->limitReferences >> 1) && 1) + { + x265_log(p, X265_LOG_WARNING, "Limit reference options 2 and 3 are not supported with pmode. Disabling limit reference\n"); + p->limitReferences = 0; + } if (p->bEnableTemporalSubLayers && !p->bframes) { @@ -1641,6 +1857,7 @@ void Encoder::allocAnalysis(x265_analysis_data* analysis) { + X265_CHECK(analysis->sliceType, "invalid slice type\n"); analysis->interData = analysis->intraData = NULL; if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { @@ -1654,12 +1871,14 @@ } else { + int numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; analysis_inter_data *interData = (analysis_inter_data*)analysis->interData; CHECKED_MALLOC_ZERO(interData, analysis_inter_data, 1); - CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2); + CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); CHECKED_MALLOC_ZERO(interData->bestMergeCand, uint32_t, analysis->numCUsInFrame * CUGeom::MAX_GEOMS); + CHECKED_MALLOC_ZERO(interData->mv, MV, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); analysis->interData = interData; } return; @@ -1685,6 +1904,7 @@ X265_FREE(((analysis_inter_data*)analysis->interData)->depth); X265_FREE(((analysis_inter_data*)analysis->interData)->modes); X265_FREE(((analysis_inter_data*)analysis->interData)->bestMergeCand); + X265_FREE(((analysis_inter_data*)analysis->interData)->mv); X265_FREE(analysis->interData); } } @@ -1731,6 +1951,8 @@ analysis->poc = poc; analysis->frameRecordSize = frameRecordSize; X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFile); + X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFile); + X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFile); X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFile); X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFile); @@ -1752,6 +1974,7 @@ X265_FREAD(((analysis_inter_data *)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FREAD(((analysis_inter_data *)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FREAD(((analysis_inter_data *)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); + X265_FREAD(((analysis_inter_data *)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); consumedBytes += frameRecordSize; totalConsumedBytes = consumedBytes; } @@ -1761,6 +1984,7 @@ X265_FREAD(((analysis_inter_data *)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FREAD(((analysis_inter_data *)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FREAD(((analysis_inter_data *)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); + X265_FREAD(((analysis_inter_data *)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); consumedBytes += frameRecordSize; } #undef X265_FREAD @@ -1780,7 +2004,7 @@ /* calculate frameRecordSize */ analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(analysis->poc) + sizeof(analysis->sliceType) + - sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions); + sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost); if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 4; else if (analysis->sliceType == X265_TYPE_P) @@ -1788,17 +2012,20 @@ analysis->frameRecordSize += sizeof(int32_t) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU; analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 2; analysis->frameRecordSize += sizeof(uint32_t) * analysis->numCUsInFrame * CUGeom::MAX_GEOMS; + analysis->frameRecordSize += sizeof(MV) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU; } else { analysis->frameRecordSize += sizeof(int32_t) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2; analysis->frameRecordSize += sizeof(uint8_t) * analysis->numCUsInFrame * analysis->numPartitions * 2; analysis->frameRecordSize += sizeof(uint32_t) * analysis->numCUsInFrame * CUGeom::MAX_GEOMS; + analysis->frameRecordSize += sizeof(MV) * analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2; } - X265_FWRITE(&analysis->frameRecordSize, sizeof(uint32_t), 1, m_analysisFile); X265_FWRITE(&analysis->poc, sizeof(int), 1, m_analysisFile); X265_FWRITE(&analysis->sliceType, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->bScenecut, sizeof(int), 1, m_analysisFile); + X265_FWRITE(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFile); X265_FWRITE(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFile); X265_FWRITE(&analysis->numPartitions, sizeof(int), 1, m_analysisFile); @@ -1815,6 +2042,7 @@ X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FWRITE(((analysis_inter_data*)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU, m_analysisFile); } else { @@ -1822,6 +2050,7 @@ X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFile); X265_FWRITE(((analysis_inter_data*)analysis->interData)->bestMergeCand, sizeof(uint32_t), analysis->numCUsInFrame * CUGeom::MAX_GEOMS, m_analysisFile); + X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv, sizeof(MV), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * 2, m_analysisFile); } #undef X265_FWRITE }
View file
x265_1.8.tar.gz/source/encoder/encoder.h -> x265_1.9.tar.gz/source/encoder/encoder.h
Changed
@@ -45,8 +45,10 @@ double m_psnrSumV; double m_globalSsim; double m_totalQp; + double m_maxFALL; uint64_t m_accBits; uint32_t m_numPics; + uint16_t m_maxCLL; EncStats() { @@ -54,6 +56,8 @@ m_accBits = 0; m_numPics = 0; m_totalQp = 0; + m_maxCLL = 0; + m_maxFALL = 0; } void addQP(double aveQp); @@ -75,64 +79,62 @@ { public: - int m_pocLast; // time index (POC) - int m_encodedFrameNum; - int m_outputCount; + uint32_t m_residualSumEmergency[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; + uint32_t m_countEmergency[MAX_NUM_TR_CATEGORIES]; + uint16_t (*m_offsetEmergency)[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; - int m_bframeDelay; int64_t m_firstPts; int64_t m_bframeDelayTime; int64_t m_prevReorderedPts[2]; + int64_t m_encodeStartTime; - ThreadPool* m_threadPool; - FrameEncoder* m_frameEncoder[X265_MAX_FRAME_THREADS]; - DPB* m_dpb; - - Frame* m_exportedPic; - + int m_pocLast; // time index (POC) + int m_encodedFrameNum; + int m_outputCount; + int m_bframeDelay; int m_numPools; int m_curEncoder; - /* cached PicYuv offset arrays, shared by all instances of - * PicYuv created by this encoder */ - intptr_t* m_cuOffsetY; - intptr_t* m_cuOffsetC; - intptr_t* m_buOffsetY; - intptr_t* m_buOffsetC; - - /* Collect statistics globally */ - EncStats m_analyzeAll; - EncStats m_analyzeI; - EncStats m_analyzeP; - EncStats m_analyzeB; - int64_t m_encodeStartTime; - // weighted prediction int m_numLumaWPFrames; // number of P frames with weighted luma reference int m_numChromaWPFrames; // number of P frames with weighted chroma reference int m_numLumaWPBiFrames; // number of B frames with weighted luma reference int m_numChromaWPBiFrames; // number of B frames with weighted chroma reference - FILE* m_analysisFile; int m_conformanceMode; - VPS m_vps; - SPS m_sps; - PPS m_pps; - NALList m_nalList; - ScalingList m_scalingList; // quantization matrix information - int m_lastBPSEI; uint32_t m_numDelayedPic; + ThreadPool* m_threadPool; + FrameEncoder* m_frameEncoder[X265_MAX_FRAME_THREADS]; + DPB* m_dpb; + Frame* m_exportedPic; + FILE* m_analysisFile; x265_param* m_param; x265_param* m_latestParam; RateControl* m_rateControl; Lookahead* m_lookahead; + + /* Collect statistics globally */ + EncStats m_analyzeAll; + EncStats m_analyzeI; + EncStats m_analyzeP; + EncStats m_analyzeB; + VPS m_vps; + SPS m_sps; + PPS m_pps; + NALList m_nalList; + ScalingList m_scalingList; // quantization matrix information Window m_conformanceWindow; + bool m_emitCLLSEI; bool m_bZeroLatency; // x265_encoder_encode() returns NALs for the input picture, zero lag bool m_aborted; // fatal error detected bool m_reconfigured; // reconfigure of encoder detected + /* Begin intra refresh when one not in progress or else begin one as soon as the current + * one is done. Requires bIntraRefresh to be set.*/ + int m_bQueuedIntraRefresh; + Encoder(); ~Encoder() {} @@ -164,7 +166,9 @@ void writeAnalysisFile(x265_analysis_data* pic); - void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, uint64_t bits, x265_frame_stats* frameStats); + void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc); + + void calcRefreshInterval(Frame* frameEnc); protected:
View file
x265_1.8.tar.gz/source/encoder/entropy.cpp -> x265_1.9.tar.gz/source/encoder/entropy.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -429,7 +430,8 @@ if (slice.m_sps->bUseSAO) { WRITE_FLAG(saoParam->bSaoFlag[0], "slice_sao_luma_flag"); - WRITE_FLAG(saoParam->bSaoFlag[1], "slice_sao_chroma_flag"); + if (encData.m_param->internalCsp != X265_CSP_I400) + WRITE_FLAG(saoParam->bSaoFlag[1], "slice_sao_chroma_flag"); } // check if numRefIdx match the defaults (1, hard-coded in PPS). If not, override @@ -828,6 +830,79 @@ } } +void Entropy::encodeTransformLuma(const CUData& cu, uint32_t absPartIdx, uint32_t curDepth, uint32_t log2CurSize, + bool& bCodeDQP, const uint32_t depthRange[2]) +{ + const bool subdiv = cu.m_tuDepth[absPartIdx] > curDepth; + + /* in each of these conditions, the subdiv flag is implied and not signaled, + * so we have checks to make sure the implied value matches our intentions */ + if (cu.isIntra(absPartIdx) && cu.m_partSize[absPartIdx] != SIZE_2Nx2N && log2CurSize == MIN_LOG2_CU_SIZE) + { + X265_CHECK(subdiv, "intra NxN requires TU depth below CU depth\n"); + } + else if (cu.isInter(absPartIdx) && cu.m_partSize[absPartIdx] != SIZE_2Nx2N && + !curDepth && cu.m_slice->m_sps->quadtreeTUMaxDepthInter == 1) + { + X265_CHECK(subdiv, "inter TU must be smaller than CU when not 2Nx2N part size: log2CurSize %d, depthRange[0] %d\n", log2CurSize, depthRange[0]); + } + else if (log2CurSize > depthRange[1]) + { + X265_CHECK(subdiv, "TU is larger than the max allowed, it should have been split\n"); + } + else if (log2CurSize == cu.m_slice->m_sps->quadtreeTULog2MinSize || log2CurSize == depthRange[0]) + { + X265_CHECK(!subdiv, "min sized TU cannot be subdivided\n"); + } + else + { + X265_CHECK(log2CurSize > depthRange[0], "transform size failure\n"); + codeTransformSubdivFlag(subdiv, 5 - log2CurSize); + } + + if (subdiv) + { + --log2CurSize; + ++curDepth; + + uint32_t qNumParts = 1 << (log2CurSize - LOG2_UNIT_SIZE) * 2; + + encodeTransformLuma(cu, absPartIdx + 0 * qNumParts, curDepth, log2CurSize, bCodeDQP, depthRange); + encodeTransformLuma(cu, absPartIdx + 1 * qNumParts, curDepth, log2CurSize, bCodeDQP, depthRange); + encodeTransformLuma(cu, absPartIdx + 2 * qNumParts, curDepth, log2CurSize, bCodeDQP, depthRange); + encodeTransformLuma(cu, absPartIdx + 3 * qNumParts, curDepth, log2CurSize, bCodeDQP, depthRange); + return; + } + + if (!cu.isIntra(absPartIdx) && !curDepth) + { + X265_CHECK(cu.getCbf(absPartIdx, TEXT_LUMA, 0), "CBF should have been set\n"); + } + else + codeQtCbfLuma(cu, absPartIdx, curDepth); + + uint32_t cbfY = cu.getCbf(absPartIdx, TEXT_LUMA, curDepth); + + if (!cbfY) + return; + + // dQP: only for CTU once + if (cu.m_slice->m_pps->bUseDQP && bCodeDQP) + { + uint32_t log2CUSize = cu.m_log2CUSize[absPartIdx]; + uint32_t absPartIdxLT = absPartIdx & (0xFF << (log2CUSize - LOG2_UNIT_SIZE) * 2); + codeDeltaQP(cu, absPartIdxLT); + bCodeDQP = false; + } + + if (cbfY) + { + uint32_t coeffOffset = absPartIdx << (LOG2_UNIT_SIZE * 2); + codeCoeffNxN(cu, cu.m_trCoeff[0] + coeffOffset, absPartIdx, log2CurSize, TEXT_LUMA); + } +} + + void Entropy::codePredInfo(const CUData& cu, uint32_t absPartIdx) { if (cu.isIntra(absPartIdx)) // If it is intra mode, encode intra prediction mode. @@ -908,7 +983,10 @@ } uint32_t log2CUSize = cu.m_log2CUSize[absPartIdx]; - encodeTransform(cu, absPartIdx, 0, log2CUSize, bCodeDQP, depthRange); + if (cu.m_chromaFormat == X265_CSP_I400) + encodeTransformLuma(cu, absPartIdx, 0, log2CUSize, bCodeDQP, depthRange); + else + encodeTransform(cu, absPartIdx, 0, log2CUSize, bCodeDQP, depthRange); } void Entropy::codeSaoOffset(const SaoCtuParam& ctuParam, int plane) @@ -1010,7 +1088,7 @@ void Entropy::codePredWeightTable(const Slice& slice) { const WeightParam *wp; - bool bChroma = true; // 4:0:0 not yet supported + bool bChroma = slice.m_sps->chromaFormatIdc != X265_CSP_I400; bool bDenomCoded = false; int numRefDirs = slice.m_sliceType == B_SLICE ? 2 : 1; uint32_t totalSignalledWeightFlags = 0; @@ -1565,11 +1643,16 @@ uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; uint32_t c1 = 1; int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1; - ALIGN_VAR_32(uint16_t, absCoeff[(1 << MLS_CG_SIZE)]); + ALIGN_VAR_32(uint16_t, absCoeff[(1 << MLS_CG_SIZE) + 1]); // extra 2 bytes(+1) space for AVX2 assembly, +1 because (numNonZero<=1) in costCoeffNxN path uint32_t numNonZero = 1; unsigned long lastNZPosInCG; unsigned long firstNZPosInCG; +#if _DEBUG + // Unnecessary, for Valgrind-3.10.0 only + memset(absCoeff, 0, sizeof(absCoeff)); +#endif + absCoeff[0] = (uint16_t)abs(coeff[posLast]); for (int subSet = lastScanSet; subSet >= 0; subSet--) @@ -1715,6 +1798,7 @@ { // maximum g_entropyBits are 18-bits and maximum of count are 16, so intermedia of sum are 22-bits const uint8_t *tabSigCtx = table_cnt[(log2TrSize == 2) ? 4 : (uint32_t)patternSigCtx]; + X265_CHECK(numNonZero <= 1, "numNonZero check failure"); uint32_t sum = primitives.costCoeffNxN(g_scan4x4[codingParameters.scanType], &coeff[blkPosBase], (intptr_t)trSize, absCoeff + numNonZero, tabSigCtx, scanFlagMask, baseCtx, offset + posOffset, scanPosSigOff, subPosBase); #if CHECKED_BUILD || _DEBUG @@ -1919,43 +2003,78 @@ numCtx = bIsLuma ? 12 : 3; } - if (bIsLuma) - { - for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX], bin); + const int ctxSigOffset = OFF_SIG_FLAG_CTX + (bIsLuma ? 0 : NUM_SIG_FLAG_CTX_LUMA); + + estBitsSbac.significantBits[0][0] = sbacGetEntropyBits(m_contextState[ctxSigOffset], 0); + estBitsSbac.significantBits[1][0] = sbacGetEntropyBits(m_contextState[ctxSigOffset], 1); - for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++) - for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + ctxIdx], bin); + for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++) + { + estBitsSbac.significantBits[0][ctxIdx] = sbacGetEntropyBits(m_contextState[ctxSigOffset + ctxIdx], 0); + estBitsSbac.significantBits[1][ctxIdx] = sbacGetEntropyBits(m_contextState[ctxSigOffset + ctxIdx], 1); } - else + + const uint32_t maxGroupIdx = log2TrSize * 2 - 1; + if (bIsLuma) { - for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[bin][0] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + 0)], bin); + if (log2TrSize == 2) + { + for (int i = 0, ctxIdx = 0; i < 2; i++, ctxIdx += NUM_CTX_LAST_FLAG_XY) + { + int bits = 0; + const uint8_t *ctxState = &m_contextState[OFF_CTX_LAST_FLAG_X + ctxIdx]; - for (int ctxIdx = firstCtx; ctxIdx < firstCtx + numCtx; ctxIdx++) - for (uint32_t bin = 0; bin < 2; bin++) - estBitsSbac.significantBits[bin][ctxIdx] = sbacGetEntropyBits(m_contextState[OFF_SIG_FLAG_CTX + (NUM_SIG_FLAG_CTX_LUMA + ctxIdx)], bin); - } + for (uint32_t ctx = 0; ctx < 3; ctx++) + { + estBitsSbac.lastBits[i][ctx] = bits + sbacGetEntropyBits(ctxState[ctx], 0); + bits += sbacGetEntropyBits(ctxState[ctx], 1); + } - int blkSizeOffset = bIsLuma ? ((log2TrSize - 2) * 3 + ((log2TrSize - 1) >> 2)) : NUM_CTX_LAST_FLAG_XY_LUMA; - int ctxShift = bIsLuma ? ((log2TrSize + 1) >> 2) : log2TrSize - 2; - uint32_t maxGroupIdx = log2TrSize * 2 - 1; + estBitsSbac.lastBits[i][maxGroupIdx] = bits; + } + } + else + { + const int blkSizeOffset = ((log2TrSize - 2) * 3 + (log2TrSize == 5)); - uint32_t ctx; - for (int i = 0, ctxIdx = 0; i < 2; i++, ctxIdx += NUM_CTX_LAST_FLAG_XY) + for (int i = 0, ctxIdx = 0; i < 2; i++, ctxIdx += NUM_CTX_LAST_FLAG_XY) + { + int bits = 0; + const uint8_t *ctxState = &m_contextState[OFF_CTX_LAST_FLAG_X + ctxIdx]; + X265_CHECK(maxGroupIdx & 1, "maxGroupIdx check failure\n"); + + for (uint32_t ctx = 0; ctx < (maxGroupIdx >> 1) + 1; ctx++) + { + const int cost0 = sbacGetEntropyBits(ctxState[blkSizeOffset + ctx], 0); + const int cost1 = sbacGetEntropyBits(ctxState[blkSizeOffset + ctx], 1); + estBitsSbac.lastBits[i][ctx * 2 + 0] = bits + cost0; + estBitsSbac.lastBits[i][ctx * 2 + 1] = bits + cost1 + cost0; + bits += 2 * cost1; + } + // correct latest bit cost, it didn't include cost0 + estBitsSbac.lastBits[i][maxGroupIdx] -= sbacGetEntropyBits(ctxState[blkSizeOffset + (maxGroupIdx >> 1)], 0); + } + } + } + else { - int bits = 0; - const uint8_t *ctxState = &m_contextState[OFF_CTX_LAST_FLAG_X + ctxIdx]; + const int blkSizeOffset = NUM_CTX_LAST_FLAG_XY_LUMA; + const int ctxShift = log2TrSize - 2; - for (ctx = 0; ctx < maxGroupIdx; ctx++) + for (int i = 0, ctxIdx = 0; i < 2; i++, ctxIdx += NUM_CTX_LAST_FLAG_XY) { - int ctxOffset = blkSizeOffset + (ctx >> ctxShift); - estBitsSbac.lastBits[i][ctx] = bits + sbacGetEntropyBits(ctxState[ctxOffset], 0); - bits += sbacGetEntropyBits(ctxState[ctxOffset], 1); - } + int bits = 0; + const uint8_t *ctxState = &m_contextState[OFF_CTX_LAST_FLAG_X + ctxIdx]; + + for (uint32_t ctx = 0; ctx < maxGroupIdx; ctx++) + { + int ctxOffset = blkSizeOffset + (ctx >> ctxShift); + estBitsSbac.lastBits[i][ctx] = bits + sbacGetEntropyBits(ctxState[ctxOffset], 0); + bits += sbacGetEntropyBits(ctxState[ctxOffset], 1); + } - estBitsSbac.lastBits[i][ctx] = bits; + estBitsSbac.lastBits[i][maxGroupIdx] = bits; + } } }
View file
x265_1.8.tar.gz/source/encoder/entropy.h -> x265_1.9.tar.gz/source/encoder/entropy.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -246,6 +247,8 @@ void encodeTransform(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize, bool& bCodeDQP, const uint32_t depthRange[2]); + void encodeTransformLuma(const CUData& cu, uint32_t absPartIdx, uint32_t tuDepth, uint32_t log2TrSize, + bool& bCodeDQP, const uint32_t depthRange[2]); void copyFrom(const Entropy& src); void copyContextsFrom(const Entropy& src);
View file
x265_1.8.tar.gz/source/encoder/frameencoder.cpp -> x265_1.9.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -104,7 +104,8 @@ m_param = top->m_param; m_numRows = numRows; m_numCols = numCols; - m_filterRowDelay = (m_param->bEnableSAO && m_param->bSaoNonDeblocked) ? + m_filterRowDelay = ((m_param->bEnableSAO && m_param->bSaoNonDeblocked) + || (!m_param->bEnableLoopFilter && m_param->bEnableSAO)) ? 2 : (m_param->bEnableSAO || m_param->bEnableLoopFilter ? 1 : 0); m_filterRowDelayCus = m_filterRowDelay * numCols; m_rows = new CTURow[m_numRows]; @@ -124,7 +125,7 @@ m_pool = NULL; } - m_frameFilter.init(top, this, numRows); + m_frameFilter.init(top, this, numRows, numCols); // initialize HRD parameters of SPS if (m_param->bEmitHRDSEI || !!m_param->interlaceMode) @@ -135,7 +136,7 @@ ok &= m_rce.picTimingSEI && m_rce.hrdTiming; } - if (m_param->noiseReductionIntra || m_param->noiseReductionInter) + if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize) m_nr = X265_MALLOC(NoiseReduction, 1); if (m_nr) memset(m_nr, 0, sizeof(NoiseReduction)); @@ -275,7 +276,7 @@ m_localTldIdx = 0; } - m_done.trigger(); /* signal that thread is initialized */ + m_done.trigger(); /* signal that thread is initialized */ m_enable.wait(); /* Encoder::encode() triggers this event */ while (m_threadActive) @@ -357,15 +358,52 @@ WeightParam *w = NULL; if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].bPresentFlag) w = slice->m_weightPredTable[l][ref]; - m_mref[l][ref].init(slice->m_refPicList[l][ref]->m_reconPic, w, *m_param); + slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic; + m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param); } } + int numTLD; + if (m_pool) + numTLD = m_param->bEnableWavefront ? m_pool->m_numWorkers : m_pool->m_numWorkers + m_pool->m_numProviders; + else + numTLD = 1; + /* Get the QP for this frame from rate control. This call may block until * frames ahead of it in encode order have called rateControlEnd() */ int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top); m_rce.newQp = qp; + if (m_nr) + { + if (qp > QP_MAX_SPEC && m_frame->m_param->rc.vbvBufferSize) + { + for (int i = 0; i < numTLD; i++) + { + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].offset = m_top->m_offsetEmergency[qp - QP_MAX_SPEC - 1]; + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].residualSum = m_top->m_residualSumEmergency; + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].count = m_top->m_countEmergency; + } + } + else + { + if (m_param->noiseReductionIntra || m_param->noiseReductionInter) + { + for (int i = 0; i < numTLD; i++) + { + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].offset = m_tld[i].analysis.m_quant.m_frameNr[m_jpId].nrOffsetDenoise; + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].residualSum = m_tld[i].analysis.m_quant.m_frameNr[m_jpId].nrResidualSum; + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].count = m_tld[i].analysis.m_quant.m_frameNr[m_jpId].nrCount; + } + } + else + { + for (int i = 0; i < numTLD; i++) + m_tld[i].analysis.m_quant.m_frameNr[m_jpId].offset = NULL; + } + } + } + /* Clip slice QP to 0-51 spec range before encoding */ slice->m_sliceQp = x265_clip3(-QP_BD_OFFSET, QP_MAX_SPEC, qp); @@ -458,7 +496,7 @@ /* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to * tune RateControl parameters for other frames. * Hence, for these modes, update m_startEndOrder and unlock RC for previous threads waiting in - * RateControlEnd here, after the slicecontexts are initialized. For the rest - ABR + * RateControlEnd here, after the slice contexts are initialized. For the rest - ABR * and VBV, unlock only after rateControlUpdateStats of this frame is called */ if (m_param->rc.rateControlMode != X265_RC_ABR && !m_top->m_rateControl->m_isVbv) { @@ -482,7 +520,7 @@ { for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++) { - Frame *refpic = slice->m_refPicList[l][ref]; + Frame *refpic = slice->m_refFrameList[l][ref]; uint32_t reconRowCount = refpic->m_reconRowCount.get(); while ((reconRowCount != m_numRows) && (reconRowCount < row + m_refLagRows)) @@ -521,7 +559,7 @@ int list = l; for (int ref = 0; ref < slice->m_numRefIdx[list]; ref++) { - Frame *refpic = slice->m_refPicList[list][ref]; + Frame *refpic = slice->m_refFrameList[list][ref]; uint32_t reconRowCount = refpic->m_reconRowCount.get(); while ((reconRowCount != m_numRows) && (reconRowCount < i + m_refLagRows)) @@ -572,10 +610,7 @@ m_frame->m_encData->m_frameStats.lumaDistortion += m_rows[i].rowStats.lumaDistortion; m_frame->m_encData->m_frameStats.chromaDistortion += m_rows[i].rowStats.chromaDistortion; m_frame->m_encData->m_frameStats.psyEnergy += m_rows[i].rowStats.psyEnergy; - m_frame->m_encData->m_frameStats.lumaLevel += m_rows[i].rowStats.lumaLevel; - - if (m_rows[i].rowStats.maxLumaLevel > m_frame->m_encData->m_frameStats.maxLumaLevel) - m_frame->m_encData->m_frameStats.maxLumaLevel = m_rows[i].rowStats.maxLumaLevel; + m_frame->m_encData->m_frameStats.resEnergy += m_rows[i].rowStats.resEnergy; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { m_frame->m_encData->m_frameStats.cntSkipCu[depth] += m_rows[i].rowStats.cntSkipCu[depth]; @@ -589,7 +624,7 @@ m_frame->m_encData->m_frameStats.avgLumaDistortion = (double)(m_frame->m_encData->m_frameStats.lumaDistortion) / m_frame->m_encData->m_frameStats.totalCtu; m_frame->m_encData->m_frameStats.avgChromaDistortion = (double)(m_frame->m_encData->m_frameStats.chromaDistortion) / m_frame->m_encData->m_frameStats.totalCtu; m_frame->m_encData->m_frameStats.avgPsyEnergy = (double)(m_frame->m_encData->m_frameStats.psyEnergy) / m_frame->m_encData->m_frameStats.totalCtu; - m_frame->m_encData->m_frameStats.avgLumaLevel = m_frame->m_encData->m_frameStats.lumaLevel / m_frame->m_encData->m_frameStats.totalCtu; + m_frame->m_encData->m_frameStats.avgResEnergy = (double)(m_frame->m_encData->m_frameStats.resEnergy) / m_frame->m_encData->m_frameStats.totalCtu; m_frame->m_encData->m_frameStats.percentIntraNxN = (double)(m_frame->m_encData->m_frameStats.cntIntraNxN * 100) / m_frame->m_encData->m_frameStats.totalCu; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { @@ -626,22 +661,23 @@ if (m_param->decodedPictureHashSEI) { + int planes = (m_frame->m_param->internalCsp != X265_CSP_I400) ? 3 : 1; if (m_param->decodedPictureHashSEI == 1) { m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::MD5; - for (int i = 0; i < 3; i++) + for (int i = 0; i < planes; i++) MD5Final(&m_state[i], m_seiReconPictureDigest.m_digest[i]); } else if (m_param->decodedPictureHashSEI == 2) { m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CRC; - for (int i = 0; i < 3; i++) + for (int i = 0; i < planes; i++) crcFinish(m_crc[i], m_seiReconPictureDigest.m_digest[i]); } else if (m_param->decodedPictureHashSEI == 3) { m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CHECKSUM; - for (int i = 0; i < 3; i++) + for (int i = 0; i < planes; i++) checksumFinish(m_checksum[i], m_seiReconPictureDigest.m_digest[i]); } @@ -678,41 +714,40 @@ { for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++) { - Frame *refpic = slice->m_refPicList[l][ref]; + Frame *refpic = slice->m_refFrameList[l][ref]; ATOMIC_DEC(&refpic->m_countRefEncoders); } } - int numTLD; - if (m_pool) - numTLD = m_param->bEnableWavefront ? m_pool->m_numWorkers : m_pool->m_numWorkers + m_pool->m_numProviders; - else - numTLD = 1; - if (m_nr) { - /* Accumulate NR statistics from all worker threads */ - for (int i = 0; i < numTLD; i++) + bool nrEnabled = (m_rce.newQp < QP_MAX_SPEC || !m_param->rc.vbvBufferSize) && (m_param->noiseReductionIntra || m_param->noiseReductionInter); + + if (nrEnabled) { - NoiseReduction* nr = &m_tld[i].analysis.m_quant.m_frameNr[m_jpId]; - for (int cat = 0; cat < MAX_NUM_TR_CATEGORIES; cat++) + /* Accumulate NR statistics from all worker threads */ + for (int i = 0; i < numTLD; i++) { - for (int coeff = 0; coeff < MAX_NUM_TR_COEFFS; coeff++) - m_nr->residualSum[cat][coeff] += nr->residualSum[cat][coeff]; - - m_nr->count[cat] += nr->count[cat]; + NoiseReduction* nr = &m_tld[i].analysis.m_quant.m_frameNr[m_jpId]; + for (int cat = 0; cat < MAX_NUM_TR_CATEGORIES; cat++) + { + for (int coeff = 0; coeff < MAX_NUM_TR_COEFFS; coeff++) + m_nr->nrResidualSum[cat][coeff] += nr->nrResidualSum[cat][coeff]; + + m_nr->nrCount[cat] += nr->nrCount[cat]; + } } - } - noiseReductionUpdate(); + noiseReductionUpdate(); - /* Copy updated NR coefficients back to all worker threads */ - for (int i = 0; i < numTLD; i++) - { - NoiseReduction* nr = &m_tld[i].analysis.m_quant.m_frameNr[m_jpId]; - memcpy(nr->offsetDenoise, m_nr->offsetDenoise, sizeof(uint16_t) * MAX_NUM_TR_CATEGORIES * MAX_NUM_TR_COEFFS); - memset(nr->count, 0, sizeof(uint32_t) * MAX_NUM_TR_CATEGORIES); - memset(nr->residualSum, 0, sizeof(uint32_t) * MAX_NUM_TR_CATEGORIES * MAX_NUM_TR_COEFFS); + /* Copy updated NR coefficients back to all worker threads */ + for (int i = 0; i < numTLD; i++) + { + NoiseReduction* nr = &m_tld[i].analysis.m_quant.m_frameNr[m_jpId]; + memcpy(nr->nrOffsetDenoise, m_nr->nrOffsetDenoise, sizeof(uint16_t)* MAX_NUM_TR_CATEGORIES * MAX_NUM_TR_COEFFS); + memset(nr->nrCount, 0, sizeof(uint32_t)* MAX_NUM_TR_CATEGORIES); + memset(nr->nrResidualSum, 0, sizeof(uint32_t)* MAX_NUM_TR_CATEGORIES * MAX_NUM_TR_COEFFS); + } } } @@ -773,7 +808,7 @@ } else { - for (int i = 0; i < 3; i++) + for (int i = 0; i < (m_param->internalCsp != X265_CSP_I400 ? 3 : 1); i++) saoParam->ctuParam[i][cuAddr].reset(); } } @@ -824,7 +859,7 @@ // Called by worker threads void FrameEncoder::processRowEncoder(int intRow, ThreadLocalData& tld) { - uint32_t row = (uint32_t)intRow; + const uint32_t row = (uint32_t)intRow; CTURow& curRow = m_rows[row]; tld.analysis.m_param = m_param; @@ -858,11 +893,15 @@ const uint32_t lineStartCUAddr = row * numCols; bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; + uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16; + uint32_t noOfBlocks = g_maxCUSize / 16; + while (curRow.completed < numCols) { ProfileScopeEvent(encodeCTU); - uint32_t col = curRow.completed; + const uint32_t col = curRow.completed; const uint32_t cuAddr = lineStartCUAddr + col; CUData* ctu = curEncData.getPicCTU(cuAddr); ctu->initCTU(*m_frame, cuAddr, slice->m_sliceQp); @@ -882,11 +921,8 @@ cuStat.baseQp = curEncData.m_rowStat[row].diagQp; /* TODO: use defines from slicetype.h for lowres block size */ - uint32_t maxBlockCols = (m_frame->m_fencPic->m_picWidth + (16 - 1)) / 16; - uint32_t maxBlockRows = (m_frame->m_fencPic->m_picHeight + (16 - 1)) / 16; - uint32_t noOfBlocks = g_maxCUSize / 16; - uint32_t block_y = (cuAddr / curEncData.m_slice->m_sps->numCuInWidth) * noOfBlocks; - uint32_t block_x = (cuAddr * noOfBlocks) - block_y * curEncData.m_slice->m_sps->numCuInWidth; + uint32_t block_y = (ctu->m_cuPelY >> g_maxLog2CUSize) * noOfBlocks; + uint32_t block_x = (ctu->m_cuPelX >> g_maxLog2CUSize) * noOfBlocks; cuStat.vbvCost = 0; cuStat.intraVbvCost = 0; @@ -926,6 +962,58 @@ // Save CABAC state for next row curRow.bufferedEntropy.loadContexts(rowCoder); + /* SAO parameter estimation using non-deblocked pixels for CTU bottom and right boundary areas */ + if (m_param->bEnableSAO && m_param->bSaoNonDeblocked) + m_frameFilter.m_parallelFilter[row].m_sao.calcSaoStatsCu_BeforeDblk(m_frame, col, row); + + /* Deblock with idle threading */ + if (m_param->bEnableLoopFilter | m_param->bEnableSAO) + { + // TODO: Multiple Threading + // Delay ONE row to avoid Intra Prediction Conflict + if (m_pool && (row >= 1)) + { + // Waitting last threading finish + m_frameFilter.m_parallelFilter[row - 1].waitForExit(); + + // Processing new group + int allowCol = col; + + // avoid race condition on last column + if (row >= 2) + { + allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 2].m_lastDeblocked.get() + : m_frameFilter.m_parallelFilter[row - 2].m_lastCol.get()), (int)col); + } + m_frameFilter.m_parallelFilter[row - 1].m_allowedCol.set(allowCol); + m_frameFilter.m_parallelFilter[row - 1].tryBondPeers(*this, 1); + } + + // Last Row may start early + if (m_pool && (row == m_numRows - 1)) + { + // Waiting for the last thread to finish + m_frameFilter.m_parallelFilter[row].waitForExit(); + + // Deblocking last row + int allowCol = col; + + // avoid race condition on last column + if (row >= 2) + { + allowCol = X265_MIN(((col == numCols - 1) ? m_frameFilter.m_parallelFilter[row - 1].m_lastDeblocked.get() + : m_frameFilter.m_parallelFilter[row - 1].m_lastCol.get()), (int)col); + } + m_frameFilter.m_parallelFilter[row].m_allowedCol.set(allowCol); + m_frameFilter.m_parallelFilter[row].tryBondPeers(*this, 1); + } + } + // Both Loopfilter and SAO Disabled + else + { + m_frameFilter.m_parallelFilter[row].processPostCu(col); + } + // Completed CU processing curRow.completed++; @@ -958,6 +1046,7 @@ curRow.rowStats.lumaDistortion += best.lumaDistortion; curRow.rowStats.chromaDistortion += best.chromaDistortion; curRow.rowStats.psyEnergy += best.psyEnergy; + curRow.rowStats.resEnergy += best.resEnergy; curRow.rowStats.cntIntraNxN += frameLog.cntIntraNxN; curRow.rowStats.totalCu += frameLog.totalCu; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) @@ -970,17 +1059,6 @@ curRow.rowStats.cuIntraDistribution[depth][n] += frameLog.cuIntraDistribution[depth][n]; } - /* calculate maximum and average luma levels */ - uint32_t ctuLumaLevel = 0; - uint32_t ctuNoOfPixels = best.fencYuv->m_size * best.fencYuv->m_size; - for (uint32_t i = 0; i < ctuNoOfPixels; i++) - { - pixel p = best.fencYuv->m_buf[0][i]; - ctuLumaLevel += p; - curRow.rowStats.maxLumaLevel = X265_MAX(p, curRow.rowStats.maxLumaLevel); - } - curRow.rowStats.lumaLevel += (double)(ctuLumaLevel) / ctuNoOfPixels; - curEncData.m_cuStat[cuAddr].totalBits = best.totalBits; x265_emms(); @@ -1065,10 +1143,6 @@ } } - /* SAO parameter estimation using non-deblocked pixels for CTU bottom and right boundary areas */ - if (m_param->bEnableSAO && m_param->bSaoNonDeblocked) - m_frameFilter.m_sao.calcSaoStatsCu_BeforeDblk(m_frame, col, row); - if (m_param->bEnableWavefront && curRow.completed >= 2 && row < m_numRows - 1 && (!m_bAllRowsStop || intRow + 1 < m_vbvResetTriggerRow)) { @@ -1085,7 +1159,7 @@ ScopedLock self(curRow.lock); if ((m_bAllRowsStop && intRow > m_vbvResetTriggerRow) || - (row > 0 && curRow.completed < numCols - 1 && m_rows[row - 1].completed < m_rows[row].completed + 2)) + (row > 0 && ((curRow.completed < numCols - 1) || (m_rows[row - 1].completed < numCols)) && m_rows[row - 1].completed < m_rows[row].completed + 2)) { curRow.active = false; curRow.busy = false; @@ -1127,9 +1201,24 @@ if (!m_param->bEnableSAO && (m_param->bEnableWavefront || row == m_numRows - 1)) rowCoder.finishSlice(); + /* Processing left Deblock block with current threading */ + if ((m_param->bEnableLoopFilter | m_param->bEnableSAO) & (row >= 2)) + { + /* TODO: Multiple Threading */ + + /* Check conditional to start previous row process with current threading */ + if (m_frameFilter.m_parallelFilter[row - 2].m_lastDeblocked.get() == (int)numCols) + { + /* stop threading on current row and restart it */ + m_frameFilter.m_parallelFilter[row - 1].waitForExit(); + m_frameFilter.m_parallelFilter[row - 1].m_allowedCol.set(numCols); + m_frameFilter.m_parallelFilter[row - 1].processTasks(-1); + } + } + + /* trigger row-wise loop filters */ if (m_param->bEnableWavefront) { - /* trigger row-wise loop filters */ if (row >= m_filterRowDelay) { enableRowFilter(row - m_filterRowDelay); @@ -1139,6 +1228,7 @@ enqueueRowFilter(0); tryWakeOne(); } + if (row == m_numRows - 1) { for (uint32_t i = m_numRows - m_filterRowDelay; i < m_numRows; i++) @@ -1247,25 +1337,25 @@ int trSize = cat & 3; int coefCount = 1 << ((trSize + 2) * 2); - if (m_nr->count[cat] > maxBlocksPerTrSize[trSize]) + if (m_nr->nrCount[cat] > maxBlocksPerTrSize[trSize]) { for (int i = 0; i < coefCount; i++) - m_nr->residualSum[cat][i] >>= 1; - m_nr->count[cat] >>= 1; + m_nr->nrResidualSum[cat][i] >>= 1; + m_nr->nrCount[cat] >>= 1; } int nrStrength = cat < 8 ? m_param->noiseReductionIntra : m_param->noiseReductionInter; - uint64_t scaledCount = (uint64_t)nrStrength * m_nr->count[cat]; + uint64_t scaledCount = (uint64_t)nrStrength * m_nr->nrCount[cat]; for (int i = 0; i < coefCount; i++) { - uint64_t value = scaledCount + m_nr->residualSum[cat][i] / 2; - uint64_t denom = m_nr->residualSum[cat][i] + 1; - m_nr->offsetDenoise[cat][i] = (uint16_t)(value / denom); + uint64_t value = scaledCount + m_nr->nrResidualSum[cat][i] / 2; + uint64_t denom = m_nr->nrResidualSum[cat][i] + 1; + m_nr->nrOffsetDenoise[cat][i] = (uint16_t)(value / denom); } // Don't denoise DC coefficients - m_nr->offsetDenoise[cat][0] = 0; + m_nr->nrOffsetDenoise[cat][0] = 0; } }
View file
x265_1.8.tar.gz/source/encoder/framefilter.cpp -> x265_1.9.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -35,177 +35,486 @@ static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height); static float calculateSSIM(pixel *pix1, intptr_t stride1, pixel *pix2, intptr_t stride2, uint32_t width, uint32_t height, void *buf, uint32_t& cnt); -FrameFilter::FrameFilter() - : m_param(NULL) - , m_frame(NULL) - , m_frameEncoder(NULL) - , m_ssimBuf(NULL) -{ -} - void FrameFilter::destroy() { - if (m_param->bEnableSAO) - m_sao.destroy(); - X265_FREE(m_ssimBuf); + + if (m_parallelFilter) + { + if (m_param->bEnableSAO) + { + for(int row = 0; row < m_numRows; row++) + m_parallelFilter[row].m_sao.destroy((row == 0 ? 1 : 0)); + } + + delete[] m_parallelFilter; + m_parallelFilter = NULL; + } } -void FrameFilter::init(Encoder *top, FrameEncoder *frame, int numRows) +void FrameFilter::init(Encoder *top, FrameEncoder *frame, int numRows, uint32_t numCols) { m_param = top->m_param; m_frameEncoder = frame; m_numRows = numRows; + m_numCols = numCols; m_hChromaShift = CHROMA_H_SHIFT(m_param->internalCsp); m_vChromaShift = CHROMA_V_SHIFT(m_param->internalCsp); m_pad[0] = top->m_sps.conformanceWindow.rightOffset; m_pad[1] = top->m_sps.conformanceWindow.bottomOffset; m_saoRowDelay = m_param->bEnableLoopFilter ? 1 : 0; - m_lastHeight = m_param->sourceHeight % g_maxCUSize ? m_param->sourceHeight % g_maxCUSize : g_maxCUSize; - - if (m_param->bEnableSAO) - if (!m_sao.create(m_param)) - m_param->bEnableSAO = 0; + m_lastHeight = (m_param->sourceHeight % g_maxCUSize) ? (m_param->sourceHeight % g_maxCUSize) : g_maxCUSize; + m_lastWidth = (m_param->sourceWidth % g_maxCUSize) ? (m_param->sourceWidth % g_maxCUSize) : g_maxCUSize; if (m_param->bEnableSsim) m_ssimBuf = X265_MALLOC(int, 8 * (m_param->sourceWidth / 4 + 3)); + + m_parallelFilter = new ParallelFilter[numRows]; + + if (m_parallelFilter) + { + if (m_param->bEnableSAO) + { + for(int row = 0; row < numRows; row++) + { + if (!m_parallelFilter[row].m_sao.create(m_param, (row == 0 ? 1 : 0))) + m_param->bEnableSAO = 0; + else + { + if (row != 0) + m_parallelFilter[row].m_sao.createFromRootNode(&m_parallelFilter[0].m_sao); + } + + } + } + + for(int row = 0; row < numRows; row++) + { + // Setting maximum bound information + m_parallelFilter[row].m_rowHeight = (row == numRows - 1) ? m_lastHeight : g_maxCUSize; + m_parallelFilter[row].m_row = row; + m_parallelFilter[row].m_rowAddr = row * numCols; + m_parallelFilter[row].m_frameFilter = this; + + if (row > 0) + m_parallelFilter[row].m_prevRow = &m_parallelFilter[row - 1]; + } + } + } void FrameFilter::start(Frame *frame, Entropy& initState, int qp) { m_frame = frame; - if (m_param->bEnableSAO) - m_sao.startSlice(frame, initState, qp); + // Reset Filter Data Struct + if (m_parallelFilter) + { + for(int row = 0; row < m_numRows; row++) + { + if (m_param->bEnableSAO) + m_parallelFilter[row].m_sao.startSlice(frame, initState, qp); + + m_parallelFilter[row].m_lastCol.set(0); + m_parallelFilter[row].m_allowedCol.set(0); + m_parallelFilter[row].m_lastDeblocked.set(-1); + m_parallelFilter[row].m_encData = frame->m_encData; + } + + // Reset SAO common statistics + if (m_param->bEnableSAO) + m_parallelFilter[0].m_sao.resetStats(); + } } -void FrameFilter::processRow(int row) +/* restore original YUV samples to recon after SAO (if lossless) */ +static void restoreOrigLosslessYuv(const CUData* cu, Frame& frame, uint32_t absPartIdx) { - ProfileScopeEvent(filterCTURow); + const int size = cu->m_log2CUSize[absPartIdx] - 2; + const uint32_t cuAddr = cu->m_cuAddr; -#if DETAILED_CU_STATS - ScopedElapsedTime filterPerfScope(m_frameEncoder->m_cuStats.loopFilterElapsedTime); - m_frameEncoder->m_cuStats.countLoopFilter++; -#endif + PicYuv* reconPic = frame.m_reconPic; + PicYuv* fencPic = frame.m_fencPic; - if (!m_param->bEnableLoopFilter && !m_param->bEnableSAO) + pixel* dst = reconPic->getLumaAddr(cuAddr, absPartIdx); + pixel* src = fencPic->getLumaAddr(cuAddr, absPartIdx); + + primitives.cu[size].copy_pp(dst, reconPic->m_stride, src, fencPic->m_stride); + + if (cu->m_chromaFormat != X265_CSP_I400) { - processRowPost(row); + pixel* dstCb = reconPic->getCbAddr(cuAddr, absPartIdx); + pixel* srcCb = fencPic->getCbAddr(cuAddr, absPartIdx); + pixel* dstCr = reconPic->getCrAddr(cuAddr, absPartIdx); + pixel* srcCr = fencPic->getCrAddr(cuAddr, absPartIdx); + + const int csp = fencPic->m_picCsp; + primitives.chroma[csp].cu[size].copy_pp(dstCb, reconPic->m_strideC, srcCb, fencPic->m_strideC); + primitives.chroma[csp].cu[size].copy_pp(dstCr, reconPic->m_strideC, srcCr, fencPic->m_strideC); + } +} + +/* Original YUV restoration for CU in lossless coding */ +static void origCUSampleRestoration(const CUData* cu, const CUGeom& cuGeom, Frame& frame) +{ + uint32_t absPartIdx = cuGeom.absPartIdx; + if (cu->m_cuDepth[absPartIdx] > cuGeom.depth) + { + for (int subPartIdx = 0; subPartIdx < 4; subPartIdx++) + { + const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); + if (childGeom.flags & CUGeom::PRESENT) + origCUSampleRestoration(cu, childGeom, frame); + } return; } - FrameData& encData = *m_frame->m_encData; - const uint32_t numCols = encData.m_slice->m_sps->numCuInWidth; - const uint32_t lineStartCUAddr = row * numCols; - if (m_param->bEnableLoopFilter) + // restore original YUV samples + if (cu->m_tqBypass[absPartIdx]) + restoreOrigLosslessYuv(cu, frame, absPartIdx); +} + +void FrameFilter::ParallelFilter::copySaoAboveRef(PicYuv* reconPic, uint32_t cuAddr, int col) +{ + // Copy SAO Top Reference Pixels + int ctuWidth = g_maxCUSize; + const pixel* recY = reconPic->getPlaneAddr(0, cuAddr) - (m_rowAddr == 0 ? 0 : reconPic->m_stride); + + // Luma + memcpy(&m_sao.m_tmpU[0][col * ctuWidth], recY, ctuWidth * sizeof(pixel)); + X265_CHECK(col * ctuWidth + ctuWidth <= m_sao.m_numCuInWidth * ctuWidth, "m_tmpU buffer beyond bound write detected"); + + // Chroma + if (m_frameFilter->m_param->internalCsp != X265_CSP_I400) + { + ctuWidth >>= m_sao.m_hChromaShift; + + const pixel* recU = reconPic->getPlaneAddr(1, cuAddr) - (m_rowAddr == 0 ? 0 : reconPic->m_strideC); + const pixel* recV = reconPic->getPlaneAddr(2, cuAddr) - (m_rowAddr == 0 ? 0 : reconPic->m_strideC); + memcpy(&m_sao.m_tmpU[1][col * ctuWidth], recU, ctuWidth * sizeof(pixel)); + memcpy(&m_sao.m_tmpU[2][col * ctuWidth], recV, ctuWidth * sizeof(pixel)); + + X265_CHECK(col * ctuWidth + ctuWidth <= m_sao.m_numCuInWidth * ctuWidth, "m_tmpU buffer beyond bound write detected"); + } +} + +void FrameFilter::ParallelFilter::processSaoUnitCu(SAOParam *saoParam, int col) +{ + // TODO: apply SAO on CU and copy back soon, is it necessary? + if (saoParam->bSaoFlag[0]) + m_sao.processSaoUnitCuLuma(saoParam->ctuParam[0], m_row, col); + + if (saoParam->bSaoFlag[1]) + m_sao.processSaoUnitCuChroma(saoParam->ctuParam, m_row, col); + + if (m_encData->m_slice->m_pps->bTransquantBypassEnabled) { - const CUGeom* cuGeoms = m_frameEncoder->m_cuGeoms; - const uint32_t* ctuGeomMap = m_frameEncoder->m_ctuGeomMap; + const CUGeom* cuGeoms = m_frameFilter->m_frameEncoder->m_cuGeoms; + const uint32_t* ctuGeomMap = m_frameFilter->m_frameEncoder->m_ctuGeomMap; - for (uint32_t col = 0; col < numCols; col++) + uint32_t cuAddr = m_rowAddr + col; + const CUData* ctu = m_encData->getPicCTU(cuAddr); + assert(m_frameFilter->m_frame->m_reconPic == m_encData->m_reconPic); + origCUSampleRestoration(ctu, cuGeoms[ctuGeomMap[cuAddr]], *m_frameFilter->m_frame); + } +} + +// NOTE: MUST BE delay a row when Deblock enabled, the Deblock will modify above pixels in Horizon pass +void FrameFilter::ParallelFilter::processPostCu(int col) const +{ + // Update finished CU cursor + m_frameFilter->m_frame->m_reconColCount[m_row].set(col); + + // shortcut path for non-border area + if ((col != 0) & (col != m_frameFilter->m_numCols - 1) & (m_row != 0) & (m_row != m_frameFilter->m_numRows - 1)) + return; + + PicYuv *reconPic = m_frameFilter->m_frame->m_reconPic; + const uint32_t lineStartCUAddr = m_rowAddr + col; + const int realH = getCUHeight(); + const int realW = m_frameFilter->getCUWidth(col); + + const uint32_t lumaMarginX = reconPic->m_lumaMarginX; + const uint32_t lumaMarginY = reconPic->m_lumaMarginY; + const uint32_t chromaMarginX = reconPic->m_chromaMarginX; + const uint32_t chromaMarginY = reconPic->m_chromaMarginY; + const int hChromaShift = reconPic->m_hChromaShift; + const int vChromaShift = reconPic->m_vChromaShift; + const intptr_t stride = reconPic->m_stride; + const intptr_t strideC = reconPic->m_strideC; + pixel *pixY = reconPic->getLumaAddr(lineStartCUAddr); + // // MUST BE check I400 since m_picOrg uninitialize in that case + pixel *pixU = (m_frameFilter->m_param->internalCsp != X265_CSP_I400) ? reconPic->getCbAddr(lineStartCUAddr) : NULL; + pixel *pixV = (m_frameFilter->m_param->internalCsp != X265_CSP_I400) ? reconPic->getCrAddr(lineStartCUAddr) : NULL; + int copySizeY = realW; + int copySizeC = (realW >> hChromaShift); + + if ((col == 0) | (col == m_frameFilter->m_numCols - 1)) + { + // TODO: improve by process on Left or Right only + primitives.extendRowBorder(reconPic->getLumaAddr(m_rowAddr), stride, reconPic->m_picWidth, realH, reconPic->m_lumaMarginX); + + if (m_frameFilter->m_param->internalCsp != X265_CSP_I400) { - uint32_t cuAddr = lineStartCUAddr + col; - const CUData* ctu = encData.getPicCTU(cuAddr); - deblockCTU(ctu, cuGeoms[ctuGeomMap[cuAddr]], Deblock::EDGE_VER); + primitives.extendRowBorder(reconPic->getCbAddr(m_rowAddr), strideC, reconPic->m_picWidth >> hChromaShift, realH >> vChromaShift, reconPic->m_chromaMarginX); + primitives.extendRowBorder(reconPic->getCrAddr(m_rowAddr), strideC, reconPic->m_picWidth >> hChromaShift, realH >> vChromaShift, reconPic->m_chromaMarginX); + } + } - if (col > 0) + // Extra Left and Right border on first and last CU + if ((col == 0) | (col == m_frameFilter->m_numCols - 1)) + { + copySizeY += lumaMarginX; + copySizeC += chromaMarginX; + } + + // First column need extension left padding area and first CU + if (col == 0) + { + pixY -= lumaMarginX; + pixU -= chromaMarginX; + pixV -= chromaMarginX; + } + + // Border extend Top + if (m_row == 0) + { + for (uint32_t y = 0; y < lumaMarginY; y++) + memcpy(pixY - (y + 1) * stride, pixY, copySizeY * sizeof(pixel)); + + if (m_frameFilter->m_param->internalCsp != X265_CSP_I400) + { + for (uint32_t y = 0; y < chromaMarginY; y++) { - const CUData* ctuPrev = encData.getPicCTU(cuAddr - 1); - deblockCTU(ctuPrev, cuGeoms[ctuGeomMap[cuAddr - 1]], Deblock::EDGE_HOR); + memcpy(pixU - (y + 1) * strideC, pixU, copySizeC * sizeof(pixel)); + memcpy(pixV - (y + 1) * strideC, pixV, copySizeC * sizeof(pixel)); } } + } - const CUData* ctuPrev = encData.getPicCTU(lineStartCUAddr + numCols - 1); - deblockCTU(ctuPrev, cuGeoms[ctuGeomMap[lineStartCUAddr + numCols - 1]], Deblock::EDGE_HOR); + // Border extend Bottom + if (m_row == m_frameFilter->m_numRows - 1) + { + pixY += (realH - 1) * stride; + pixU += ((realH >> vChromaShift) - 1) * strideC; + pixV += ((realH >> vChromaShift) - 1) * strideC; + for (uint32_t y = 0; y < lumaMarginY; y++) + memcpy(pixY + (y + 1) * stride, pixY, copySizeY * sizeof(pixel)); + + if (m_frameFilter->m_param->internalCsp != X265_CSP_I400) + { + for (uint32_t y = 0; y < chromaMarginY; y++) + { + memcpy(pixU + (y + 1) * strideC, pixU, copySizeC * sizeof(pixel)); + memcpy(pixV + (y + 1) * strideC, pixV, copySizeC * sizeof(pixel)); + } + } } +} - // SAO - SAOParam* saoParam = encData.m_saoParam; - if (m_param->bEnableSAO) +// NOTE: Single Threading only +void FrameFilter::ParallelFilter::processTasks(int /*workerThreadId*/) +{ + SAOParam* saoParam = m_encData->m_saoParam; + const CUGeom* cuGeoms = m_frameFilter->m_frameEncoder->m_cuGeoms; + const uint32_t* ctuGeomMap = m_frameFilter->m_frameEncoder->m_ctuGeomMap; + PicYuv* reconPic = m_encData->m_reconPic; + const int colStart = m_lastCol.get(); + // TODO: Waiting previous row finish or simple clip on it? + const int colEnd = m_allowedCol.get(); + const int numCols = m_frameFilter->m_numCols; + + // Avoid threading conflict + if (colStart >= colEnd) + return; + + for (uint32_t col = (uint32_t)colStart; col < (uint32_t)colEnd; col++) { - m_sao.m_entropyCoder.load(m_frameEncoder->m_initSliceContext); - m_sao.m_rdContexts.next.load(m_frameEncoder->m_initSliceContext); - m_sao.m_rdContexts.cur.load(m_frameEncoder->m_initSliceContext); + const uint32_t cuAddr = m_rowAddr + col; - m_sao.rdoSaoUnitRow(saoParam, row); + if (m_frameFilter->m_param->bEnableLoopFilter) + { + const CUData* ctu = m_encData->getPicCTU(cuAddr); + deblockCTU(ctu, cuGeoms[ctuGeomMap[cuAddr]], Deblock::EDGE_VER); + } - // NOTE: Delay a row because SAO decide need top row pixels at next row, is it HM's bug? - if (row >= m_saoRowDelay) - processSao(row - m_saoRowDelay); - } + if (col >= 1) + { + if (m_frameFilter->m_param->bEnableLoopFilter) + { + const CUData* ctuPrev = m_encData->getPicCTU(cuAddr - 1); + deblockCTU(ctuPrev, cuGeoms[ctuGeomMap[cuAddr - 1]], Deblock::EDGE_HOR); - // this row of CTUs has been encoded + // When SAO Disable, setting column counter here + if ((!m_frameFilter->m_param->bEnableSAO) & (m_row >= 1)) + m_prevRow->processPostCu(col - 1); + } - if (row > 0) - processRowPost(row - 1); + if (m_frameFilter->m_param->bEnableSAO) + { + // Save SAO bottom row reference pixels + copySaoAboveRef(reconPic, cuAddr - 1, col - 1); + + // SAO Decide + if (col >= 2) + { + // NOTE: Delay 2 column to avoid mistake on below case, it is Deblock sync logic issue, less probability but still alive + // ... H V | + // ..S H V | + m_sao.rdoSaoUnitCu(saoParam, m_rowAddr, col - 2, cuAddr - 2); + } + + // Process Previous Row SAO CU + if (m_row >= 1 && col >= 3) + { + // Must delay 1 row to avoid thread data race conflict + m_prevRow->processSaoUnitCu(saoParam, col - 3); + m_prevRow->processPostCu(col - 3); + } + } - if (row == m_numRows - 1) + m_lastDeblocked.set(col); + } + m_lastCol.incr(); + } + + if (colEnd == numCols) { - if (m_param->bEnableSAO) + const uint32_t cuAddr = m_rowAddr + numCols - 1; + + if (m_frameFilter->m_param->bEnableLoopFilter) { - m_sao.rdoSaoUnitRowEnd(saoParam, encData.m_slice->m_sps->numCUsInFrame); + const CUData* ctuPrev = m_encData->getPicCTU(cuAddr); + deblockCTU(ctuPrev, cuGeoms[ctuGeomMap[cuAddr]], Deblock::EDGE_HOR); - for (int i = m_numRows - m_saoRowDelay; i < m_numRows; i++) - processSao(i); + // When SAO Disable, setting column counter here + if ((!m_frameFilter->m_param->bEnableSAO) & (m_row >= 1)) + m_prevRow->processPostCu(numCols - 1); } - processRowPost(row); + // TODO: move processPostCu() into processSaoUnitCu() + if (m_frameFilter->m_param->bEnableSAO) + { + // Save SAO bottom row reference pixels + copySaoAboveRef(reconPic, cuAddr, numCols - 1); + + // SAO Decide + // NOTE: reduce condition check for 1 CU only video, Why someone play with it? + if (numCols >= 2) + m_sao.rdoSaoUnitCu(saoParam, m_rowAddr, numCols - 2, cuAddr - 1); + + if (numCols >= 1) + m_sao.rdoSaoUnitCu(saoParam, m_rowAddr, numCols - 1, cuAddr); + + // Process Previous Rows SAO CU + if (m_row >= 1 && numCols >= 3) + { + m_prevRow->processSaoUnitCu(saoParam, numCols - 3); + m_prevRow->processPostCu(numCols - 3); + } + + if (m_row >= 1 && numCols >= 2) + { + m_prevRow->processSaoUnitCu(saoParam, numCols - 2); + m_prevRow->processPostCu(numCols - 2); + } + + if (m_row >= 1 && numCols >= 1) + { + m_prevRow->processSaoUnitCu(saoParam, numCols - 1); + m_prevRow->processPostCu(numCols - 1); + } + + // Setting column sync counter + if (m_row >= 1) + m_frameFilter->m_frame->m_reconColCount[m_row - 1].set(numCols - 1); + } + m_lastDeblocked.set(numCols); } } -uint32_t FrameFilter::getCUHeight(int rowNum) const +void FrameFilter::processRow(int row) { - return rowNum == m_numRows - 1 ? m_lastHeight : g_maxCUSize; -} + ProfileScopeEvent(filterCTURow); -void FrameFilter::processRowPost(int row) -{ - PicYuv *reconPic = m_frame->m_reconPic; - const uint32_t numCols = m_frame->m_encData->m_slice->m_sps->numCuInWidth; - const uint32_t lineStartCUAddr = row * numCols; - const int realH = getCUHeight(row); +#if DETAILED_CU_STATS + ScopedElapsedTime filterPerfScope(m_frameEncoder->m_cuStats.loopFilterElapsedTime); + m_frameEncoder->m_cuStats.countLoopFilter++; +#endif - // Border extend Left and Right - primitives.extendRowBorder(reconPic->getLumaAddr(lineStartCUAddr), reconPic->m_stride, reconPic->m_picWidth, realH, reconPic->m_lumaMarginX); - primitives.extendRowBorder(reconPic->getCbAddr(lineStartCUAddr), reconPic->m_strideC, reconPic->m_picWidth >> m_hChromaShift, realH >> m_vChromaShift, reconPic->m_chromaMarginX); - primitives.extendRowBorder(reconPic->getCrAddr(lineStartCUAddr), reconPic->m_strideC, reconPic->m_picWidth >> m_hChromaShift, realH >> m_vChromaShift, reconPic->m_chromaMarginX); + if (!m_param->bEnableLoopFilter && !m_param->bEnableSAO) + { + processPostRow(row); + return; + } + FrameData& encData = *m_frame->m_encData; - // Border extend Top - if (!row) + // SAO: was integrate into encode loop + SAOParam* saoParam = encData.m_saoParam; + + /* Processing left block Deblock with current threading */ { - const intptr_t stride = reconPic->m_stride; - const intptr_t strideC = reconPic->m_strideC; - pixel *pixY = reconPic->getLumaAddr(lineStartCUAddr) - reconPic->m_lumaMarginX; - pixel *pixU = reconPic->getCbAddr(lineStartCUAddr) - reconPic->m_chromaMarginX; - pixel *pixV = reconPic->getCrAddr(lineStartCUAddr) - reconPic->m_chromaMarginX; + /* stop threading on current row */ + m_parallelFilter[row].waitForExit(); + + /* Check to avoid previous row process slower than current row */ + X265_CHECK((row < 1) || m_parallelFilter[row - 1].m_lastDeblocked.get() == m_numCols, "previous row not finish"); - for (uint32_t y = 0; y < reconPic->m_lumaMarginY; y++) - memcpy(pixY - (y + 1) * stride, pixY, stride * sizeof(pixel)); + m_parallelFilter[row].m_allowedCol.set(m_numCols); + m_parallelFilter[row].processTasks(-1); - for (uint32_t y = 0; y < reconPic->m_chromaMarginY; y++) + if (row == m_numRows - 1) { - memcpy(pixU - (y + 1) * strideC, pixU, strideC * sizeof(pixel)); - memcpy(pixV - (y + 1) * strideC, pixV, strideC * sizeof(pixel)); + /* TODO: Early start last row */ + if ((row >= 1) && (m_parallelFilter[row - 1].m_lastDeblocked.get() != m_numCols)) + x265_log(m_param, X265_LOG_WARNING, "detected ParallelFilter race condition on last row\n"); + + /* Apply SAO on last row of CUs, because we always apply SAO on row[X-1] */ + if (m_param->bEnableSAO) + { + for(int col = 0; col < m_numCols; col++) + { + // NOTE: must use processSaoUnitCu(), it include TQBypass logic + m_parallelFilter[row].processSaoUnitCu(saoParam, col); + } + } + + // Process border extension on last row + for(int col = 0; col < m_numCols; col++) + { + // m_reconColCount will be set in processPostCu() + m_parallelFilter[row].processPostCu(col); + } } } - // Border extend Bottom + // this row of CTUs has been encoded + + if (row > 0) + processPostRow(row - 1); + if (row == m_numRows - 1) { - const intptr_t stride = reconPic->m_stride; - const intptr_t strideC = reconPic->m_strideC; - pixel *pixY = reconPic->getLumaAddr(lineStartCUAddr) - reconPic->m_lumaMarginX + (realH - 1) * stride; - pixel *pixU = reconPic->getCbAddr(lineStartCUAddr) - reconPic->m_chromaMarginX + ((realH >> m_vChromaShift) - 1) * strideC; - pixel *pixV = reconPic->getCrAddr(lineStartCUAddr) - reconPic->m_chromaMarginX + ((realH >> m_vChromaShift) - 1) * strideC; - for (uint32_t y = 0; y < reconPic->m_lumaMarginY; y++) - memcpy(pixY + (y + 1) * stride, pixY, stride * sizeof(pixel)); - - for (uint32_t y = 0; y < reconPic->m_chromaMarginY; y++) + if (m_param->bEnableSAO) { - memcpy(pixU + (y + 1) * strideC, pixU, strideC * sizeof(pixel)); - memcpy(pixV + (y + 1) * strideC, pixV, strideC * sizeof(pixel)); + // Merge numNoSao into RootNode (Node0) + for(int i = 1; i < m_numRows; i++) + { + m_parallelFilter[0].m_sao.m_numNoSao[0] += m_parallelFilter[i].m_sao.m_numNoSao[0]; + m_parallelFilter[0].m_sao.m_numNoSao[1] += m_parallelFilter[i].m_sao.m_numNoSao[1]; + } + + m_parallelFilter[0].m_sao.rdoSaoUnitRowEnd(saoParam, encData.m_slice->m_sps->numCUsInFrame); } + processPostRow(row); } +} + +void FrameFilter::processPostRow(int row) +{ + PicYuv *reconPic = m_frame->m_reconPic; + const uint32_t numCols = m_frame->m_encData->m_slice->m_sps->numCuInWidth; + const uint32_t lineStartCUAddr = row * numCols; // Notify other FrameEncoders that this row of reconstructed pixels is available m_frame->m_reconRowCount.incr(); @@ -217,26 +526,30 @@ intptr_t stride = reconPic->m_stride; uint32_t width = reconPic->m_picWidth - m_pad[0]; - uint32_t height = getCUHeight(row); + uint32_t height = m_parallelFilter[row].getCUHeight(); uint64_t ssdY = computeSSD(fencPic->getLumaAddr(cuAddr), reconPic->getLumaAddr(cuAddr), stride, width, height); - height >>= m_vChromaShift; - width >>= m_hChromaShift; - stride = reconPic->m_strideC; + m_frameEncoder->m_SSDY += ssdY; - uint64_t ssdU = computeSSD(fencPic->getCbAddr(cuAddr), reconPic->getCbAddr(cuAddr), stride, width, height); - uint64_t ssdV = computeSSD(fencPic->getCrAddr(cuAddr), reconPic->getCrAddr(cuAddr), stride, width, height); + if (m_param->internalCsp != X265_CSP_I400) + { + height >>= m_vChromaShift; + width >>= m_hChromaShift; + stride = reconPic->m_strideC; - m_frameEncoder->m_SSDY += ssdY; - m_frameEncoder->m_SSDU += ssdU; - m_frameEncoder->m_SSDV += ssdV; + uint64_t ssdU = computeSSD(fencPic->getCbAddr(cuAddr), reconPic->getCbAddr(cuAddr), stride, width, height); + uint64_t ssdV = computeSSD(fencPic->getCrAddr(cuAddr), reconPic->getCrAddr(cuAddr), stride, width, height); + + m_frameEncoder->m_SSDU += ssdU; + m_frameEncoder->m_SSDV += ssdV; + } } if (m_param->bEnableSsim && m_ssimBuf) { - pixel *rec = m_frame->m_reconPic->m_picOrg[0]; + pixel *rec = reconPic->m_picOrg[0]; pixel *fenc = m_frame->m_fencPic->m_picOrg[0]; - intptr_t stride1 = m_frame->m_fencPic->m_stride; - intptr_t stride2 = m_frame->m_reconPic->m_stride; + intptr_t stride1 = reconPic->m_stride; + intptr_t stride2 = m_frame->m_fencPic->m_stride; uint32_t bEnd = ((row + 1) == (this->m_numRows - 1)); uint32_t bStart = (row == 0); uint32_t minPixY = row * g_maxCUSize - 4 * !bStart; @@ -253,55 +566,75 @@ } if (m_param->decodedPictureHashSEI == 1) { - uint32_t height = getCUHeight(row); + uint32_t height = m_parallelFilter[row].getCUHeight(); uint32_t width = reconPic->m_picWidth; intptr_t stride = reconPic->m_stride; if (!row) - { - for (int i = 0; i < 3; i++) - MD5Init(&m_frameEncoder->m_state[i]); - } + MD5Init(&m_frameEncoder->m_state[0]); updateMD5Plane(m_frameEncoder->m_state[0], reconPic->getLumaAddr(cuAddr), width, height, stride); - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; + if (m_param->internalCsp != X265_CSP_I400) + { + if (!row) + { + MD5Init(&m_frameEncoder->m_state[1]); + MD5Init(&m_frameEncoder->m_state[2]); + } - updateMD5Plane(m_frameEncoder->m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride); - updateMD5Plane(m_frameEncoder->m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride); + width >>= m_hChromaShift; + height >>= m_vChromaShift; + stride = reconPic->m_strideC; + + updateMD5Plane(m_frameEncoder->m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride); + updateMD5Plane(m_frameEncoder->m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride); + } } else if (m_param->decodedPictureHashSEI == 2) { - uint32_t height = getCUHeight(row); + uint32_t height = m_parallelFilter[row].getCUHeight(); uint32_t width = reconPic->m_picWidth; intptr_t stride = reconPic->m_stride; + if (!row) - m_frameEncoder->m_crc[0] = m_frameEncoder->m_crc[1] = m_frameEncoder->m_crc[2] = 0xffff; + m_frameEncoder->m_crc[0] = 0xffff; + updateCRC(reconPic->getLumaAddr(cuAddr), m_frameEncoder->m_crc[0], height, width, stride); - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; + if (m_param->internalCsp != X265_CSP_I400) + { + width >>= m_hChromaShift; + height >>= m_vChromaShift; + stride = reconPic->m_strideC; + m_frameEncoder->m_crc[1] = m_frameEncoder->m_crc[2] = 0xffff; - updateCRC(reconPic->getCbAddr(cuAddr), m_frameEncoder->m_crc[1], height, width, stride); - updateCRC(reconPic->getCrAddr(cuAddr), m_frameEncoder->m_crc[2], height, width, stride); + updateCRC(reconPic->getCbAddr(cuAddr), m_frameEncoder->m_crc[1], height, width, stride); + updateCRC(reconPic->getCrAddr(cuAddr), m_frameEncoder->m_crc[2], height, width, stride); + } } else if (m_param->decodedPictureHashSEI == 3) { uint32_t width = reconPic->m_picWidth; - uint32_t height = getCUHeight(row); + uint32_t height = m_parallelFilter[row].getCUHeight(); intptr_t stride = reconPic->m_stride; uint32_t cuHeight = g_maxCUSize; + if (!row) - m_frameEncoder->m_checksum[0] = m_frameEncoder->m_checksum[1] = m_frameEncoder->m_checksum[2] = 0; + m_frameEncoder->m_checksum[0] = 0; + updateChecksum(reconPic->m_picOrg[0], m_frameEncoder->m_checksum[0], height, width, stride, row, cuHeight); - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; - cuHeight >>= m_vChromaShift; + if (m_param->internalCsp != X265_CSP_I400) + { + width >>= m_hChromaShift; + height >>= m_vChromaShift; + stride = reconPic->m_strideC; + cuHeight >>= m_vChromaShift; + + if (!row) + m_frameEncoder->m_checksum[1] = m_frameEncoder->m_checksum[2] = 0; - updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight); - updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight); + updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight); + updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight); + } } if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows) @@ -400,79 +733,3 @@ cnt = (height - 1) * (width - 1); return ssim; } - -/* restore original YUV samples to recon after SAO (if lossless) */ -static void restoreOrigLosslessYuv(const CUData* cu, Frame& frame, uint32_t absPartIdx) -{ - int size = cu->m_log2CUSize[absPartIdx] - 2; - uint32_t cuAddr = cu->m_cuAddr; - - PicYuv* reconPic = frame.m_reconPic; - PicYuv* fencPic = frame.m_fencPic; - - pixel* dst = reconPic->getLumaAddr(cuAddr, absPartIdx); - pixel* src = fencPic->getLumaAddr(cuAddr, absPartIdx); - - primitives.cu[size].copy_pp(dst, reconPic->m_stride, src, fencPic->m_stride); - - pixel* dstCb = reconPic->getCbAddr(cuAddr, absPartIdx); - pixel* srcCb = fencPic->getCbAddr(cuAddr, absPartIdx); - - pixel* dstCr = reconPic->getCrAddr(cuAddr, absPartIdx); - pixel* srcCr = fencPic->getCrAddr(cuAddr, absPartIdx); - - int csp = fencPic->m_picCsp; - primitives.chroma[csp].cu[size].copy_pp(dstCb, reconPic->m_strideC, srcCb, fencPic->m_strideC); - primitives.chroma[csp].cu[size].copy_pp(dstCr, reconPic->m_strideC, srcCr, fencPic->m_strideC); -} - -/* Original YUV restoration for CU in lossless coding */ -static void origCUSampleRestoration(const CUData* cu, const CUGeom& cuGeom, Frame& frame) -{ - uint32_t absPartIdx = cuGeom.absPartIdx; - if (cu->m_cuDepth[absPartIdx] > cuGeom.depth) - { - for (int subPartIdx = 0; subPartIdx < 4; subPartIdx++) - { - const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); - if (childGeom.flags & CUGeom::PRESENT) - origCUSampleRestoration(cu, childGeom, frame); - } - return; - } - - // restore original YUV samples - if (cu->m_tqBypass[absPartIdx]) - restoreOrigLosslessYuv(cu, frame, absPartIdx); -} - -void FrameFilter::processSao(int row) -{ - FrameData& encData = *m_frame->m_encData; - SAOParam* saoParam = encData.m_saoParam; - - if (saoParam->bSaoFlag[0]) - m_sao.processSaoUnitRow(saoParam->ctuParam[0], row, 0); - - if (saoParam->bSaoFlag[1]) - { - m_sao.processSaoUnitRow(saoParam->ctuParam[1], row, 1); - m_sao.processSaoUnitRow(saoParam->ctuParam[2], row, 2); - } - - if (encData.m_slice->m_pps->bTransquantBypassEnabled) - { - uint32_t numCols = encData.m_slice->m_sps->numCuInWidth; - uint32_t lineStartCUAddr = row * numCols; - - const CUGeom* cuGeoms = m_frameEncoder->m_cuGeoms; - const uint32_t* ctuGeomMap = m_frameEncoder->m_ctuGeomMap; - - for (uint32_t col = 0; col < numCols; col++) - { - uint32_t cuAddr = lineStartCUAddr + col; - const CUData* ctu = encData.getPicCTU(cuAddr); - origCUSampleRestoration(ctu, cuGeoms[ctuGeomMap[cuAddr]], *m_frame); - } - } -}
View file
x265_1.8.tar.gz/source/encoder/framefilter.h -> x265_1.9.tar.gz/source/encoder/framefilter.h
Changed
@@ -29,6 +29,7 @@ #include "frame.h" #include "deblock.h" #include "sao.h" +#include "threadpool.h" // class BondedTaskGroup namespace X265_NS { // private x265 namespace @@ -39,7 +40,7 @@ struct ThreadLocalData; // Manages the processing of a single frame loopfilter -class FrameFilter : public Deblock +class FrameFilter { public: @@ -50,24 +51,86 @@ int m_vChromaShift; int m_pad[2]; - SAO m_sao; int m_numRows; + int m_numCols; int m_saoRowDelay; int m_lastHeight; + int m_lastWidth; - void* m_ssimBuf; /* Temp storage for ssim computation */ + void* m_ssimBuf; /* Temp storage for ssim computation */ - FrameFilter(); +#define MAX_PFILTER_CUS (4) /* maximum CUs for every thread */ + class ParallelFilter : public BondedTaskGroup, public Deblock + { + public: + uint32_t m_rowHeight; + int m_row; + uint32_t m_rowAddr; + FrameFilter* m_frameFilter; + FrameData* m_encData; + ParallelFilter* m_prevRow; + SAO m_sao; + ThreadSafeInteger m_lastCol; /* The column that next to process */ + ThreadSafeInteger m_allowedCol; /* The column that processed from Encode pipeline */ + ThreadSafeInteger m_lastDeblocked; /* The column that finished all of Deblock stages */ - void init(Encoder *top, FrameEncoder *frame, int numRows); + ParallelFilter() + : m_rowHeight(0) + , m_row(0) + , m_rowAddr(0) + , m_frameFilter(NULL) + , m_encData(NULL) + , m_prevRow(NULL) + { + } + + ~ParallelFilter() + { } + + void processTasks(int workerThreadId); + + // Apply SAO on a CU in current row + void processSaoUnitCu(SAOParam *saoParam, int col); + + // Copy and Save SAO reference pixels for SAO Rdo decide + void copySaoAboveRef(PicYuv* reconPic, uint32_t cuAddr, int col); + + // Post-Process (Border extension) + void processPostCu(int col) const; + + uint32_t getCUHeight() const + { + return m_rowHeight; + } + + protected: + + ParallelFilter operator=(const ParallelFilter&); + }; + + ParallelFilter* m_parallelFilter; + + FrameFilter() + : m_param(NULL) + , m_frame(NULL) + , m_frameEncoder(NULL) + , m_ssimBuf(NULL) + , m_parallelFilter(NULL) + { + } + + uint32_t getCUWidth(int colNum) const + { + return (colNum == (int)m_numCols - 1) ? m_lastWidth : g_maxCUSize; + } + + void init(Encoder *top, FrameEncoder *frame, int numRows, uint32_t numCols); void destroy(); void start(Frame *pic, Entropy& initState, int qp); void processRow(int row); - void processRowPost(int row); - void processSao(int row); - uint32_t getCUHeight(int rowNum) const; + void processPostRow(int row); }; }
View file
x265_1.8.tar.gz/source/encoder/level.cpp -> x265_1.9.tar.gz/source/encoder/level.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -462,7 +463,7 @@ { if (param->internalCsp != X265_CSP_I420) { - x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input chroma subsampling.\n", profile, x265_source_csp_names[param->internalCsp]); return -1; } @@ -472,7 +473,7 @@ { if (param->internalCsp != X265_CSP_I420 && param->internalCsp != X265_CSP_I422) { - x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", + x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input chroma subsampling.\n", profile, x265_source_csp_names[param->internalCsp]); return -1; }
View file
x265_1.8.tar.gz/source/encoder/motion.cpp -> x265_1.9.tar.gz/source/encoder/motion.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -188,11 +189,12 @@ satd = primitives.pu[partEnum].satd; sad_x3 = primitives.pu[partEnum].sad_x3; sad_x4 = primitives.pu[partEnum].sad_x4; + chromaSatd = primitives.chroma[fencPUYuv.m_csp].pu[partEnum].satd; /* Enable chroma residual cost if subpelRefine level is greater than 2 and chroma block size * is an even multiple of 4x4 pixels (indicated by non-null chromaSatd pointer) */ - bChromaSATD = subpelRefine > 2 && chromaSatd; + bChromaSATD = subpelRefine > 2 && chromaSatd && (srcFencYuv.m_csp != X265_CSP_I400); X265_CHECK(!(bChromaSATD && !workload[subpelRefine].hpel_satd), "Chroma SATD cannot be used with SAD hpel\n"); ctuAddr = _ctuAddr; @@ -1214,8 +1216,11 @@ const pixel* refCb = ref->getCbAddr(ctuAddr, absPartIdx) + refOffset; const pixel* refCr = ref->getCrAddr(ctuAddr, absPartIdx) + refOffset; - xFrac = qmv.x & ((1 << shiftHor) - 1); - yFrac = qmv.y & ((1 << shiftVer) - 1); + X265_CHECK((hshift == 0) || (hshift == 1), "hshift must be 0 or 1\n"); + X265_CHECK((vshift == 0) || (vshift == 1), "vshift must be 0 or 1\n"); + + xFrac = qmv.x & (hshift ? 7 : 3); + yFrac = qmv.y & (vshift ? 7 : 3); if (!(yFrac | xFrac)) {
View file
x265_1.8.tar.gz/source/encoder/motion.h -> x265_1.9.tar.gz/source/encoder/motion.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/encoder/nal.cpp -> x265_1.9.tar.gz/source/encoder/nal.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.9.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -23,6 +23,10 @@ * For more information, contact us at license @ x265.com. *****************************************************************************/ +#if _MSC_VER +#pragma warning(disable: 4127) // conditional expression is constant, yes I know +#endif + #include "common.h" #include "param.h" #include "frame.h" @@ -142,6 +146,9 @@ rce->expectedVbv = rce2Pass->expectedVbv; rce->blurredComplexity = rce2Pass->blurredComplexity; rce->sliceType = rce2Pass->sliceType; + rce->qpNoVbv = rce2Pass->qpNoVbv; + rce->newQp = rce2Pass->newQp; + rce->qRceq = rce2Pass->qRceq; } } // end anonymous namespace @@ -205,7 +212,7 @@ m_rateFactorMaxDecrement = m_param->rc.rfConstant - m_param->rc.rfConstantMin; } m_isAbr = m_param->rc.rateControlMode != X265_RC_CQP && !m_param->rc.bStatRead; - m_2pass = m_param->rc.rateControlMode == X265_RC_ABR && m_param->rc.bStatRead; + m_2pass = (m_param->rc.rateControlMode == X265_RC_ABR || m_param->rc.vbvMaxBitrate > 0) && m_param->rc.bStatRead; m_bitrate = m_param->rc.bitrate * 1000; m_frameDuration = (double)m_param->fpsDenom / m_param->fpsNum; m_qp = m_param->rc.qp; @@ -219,6 +226,7 @@ m_cutreeStatFileOut = m_cutreeStatFileIn = NULL; m_rce2Pass = NULL; m_lastBsliceSatdCost = 0; + m_movingAvgSum = 0.0; // vbv initialization m_param->rc.vbvBufferSize = x265_clip3(0, 2000000, m_param->rc.vbvBufferSize); @@ -444,6 +452,7 @@ CMP_OPT_FIRST_PASS("open-gop", m_param->bOpenGOP); CMP_OPT_FIRST_PASS("keyint", m_param->keyframeMax); CMP_OPT_FIRST_PASS("scenecut", m_param->scenecutThreshold); + CMP_OPT_FIRST_PASS("intra-refresh", m_param->bIntraRefresh); if ((p = strstr(opts, "b-adapt=")) != 0 && sscanf(p, "b-adapt=%d", &i) && i >= X265_B_ADAPT_NONE && i <= X265_B_ADAPT_TRELLIS) { @@ -488,6 +497,12 @@ x265_log(m_param, X265_LOG_ERROR, "Rce Entries for 2 pass cannot be allocated\n"); return false; } + m_encOrder = X265_MALLOC(int, m_numEntries); + if (!m_encOrder) + { + x265_log(m_param, X265_LOG_ERROR, "Encode order for 2 pass cannot be allocated\n"); + return false; + } /* init all to skipped p frames */ for (int i = 0; i < m_numEntries; i++) { @@ -504,22 +519,24 @@ { RateControlEntry *rce; int frameNumber; + int encodeOrder; char picType; int e; char *next; - double qpRc, qpAq; + double qpRc, qpAq, qNoVbv, qRceq; next = strstr(p, ";"); if (next) *next++ = 0; - e = sscanf(p, " in:%d ", &frameNumber); + e = sscanf(p, " in:%d out:%d", &frameNumber, &encodeOrder); if (frameNumber < 0 || frameNumber >= m_numEntries) { x265_log(m_param, X265_LOG_ERROR, "bad frame number (%d) at stats line %d\n", frameNumber, i); return false; } - rce = &m_rce2Pass[frameNumber]; - e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf", - &picType, &qpRc, &qpAq, &rce->coeffBits, + rce = &m_rce2Pass[encodeOrder]; + m_encOrder[frameNumber] = encodeOrder; + e += sscanf(p, " in:%*d out:%*d type:%c q:%lf q-aq:%lf q-noVbv:%lf q-Rceq:%lf tex:%d mv:%d misc:%d icu:%lf pcu:%lf scu:%lf", + &picType, &qpRc, &qpAq, &qNoVbv, &qRceq, &rce->coeffBits, &rce->mvBits, &rce->miscBits, &rce->iCuCount, &rce->pCuCount, &rce->skipCuCount); rce->keptAsRef = true; @@ -538,13 +555,16 @@ x265_log(m_param, X265_LOG_ERROR, "statistics are damaged at line %d, parser out=%d\n", i, e); return false; } - rce->qScale = x265_qp2qScale(qpRc); + rce->qScale = rce->newQScale = x265_qp2qScale(qpRc); totalQpAq += qpAq; + rce->qpNoVbv = qNoVbv; + rce->qpaRc = qpRc; + rce->qpAq = qpAq; + rce->qRceq = qRceq; p = next; } X265_FREE(statsBuf); - - if (m_param->rc.rateControlMode == X265_RC_ABR) + if (m_param->rc.rateControlMode == X265_RC_ABR || m_param->rc.vbvMaxBitrate > 0) { if (!initPass2()) return false; @@ -627,11 +647,8 @@ #undef MAX_DURATION } - -bool RateControl::initPass2() +bool RateControl::analyseABR2Pass(int startIndex, int endIndex, uint64_t allAvailableBits) { - uint64_t allConstBits = 0; - uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration); double rateFactor, stepMult; double qBlur = m_param->rc.qblur; double cplxBlur = m_param->rc.complexityBlur; @@ -640,30 +657,19 @@ double *qScale, *blurredQscale; double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); double clippedDuration = CLIP_DURATION(m_frameDuration) / BASE_FRAME_DURATION; - - /* find total/average complexity & const_bits */ - for (int i = 0; i < m_numEntries; i++) - allConstBits += m_rce2Pass[i].miscBits; - - if (allAvailableBits < allConstBits) - { - x265_log(m_param, X265_LOG_ERROR, "requested bitrate is too low. estimated minimum is %d kbps\n", - (int)(allConstBits * m_fps / m_numEntries * 1000.)); - return false; - } - + int framesCount = endIndex - startIndex + 1; /* Blur complexities, to reduce local fluctuation of QP. * We don't blur the QPs directly, because then one very simple frame * could drag down the QP of a nearby complex frame and give it more * bits than intended. */ - for (int i = 0; i < m_numEntries; i++) + for (int i = startIndex; i <= endIndex; i++) { double weightSum = 0; double cplxSum = 0; double weight = 1.0; double gaussianWeight; /* weighted average of cplx of future frames */ - for (int j = 1; j < cplxBlur * 2 && j < m_numEntries - i; j++) + for (int j = 1; j < cplxBlur * 2 && j <= endIndex - i; j++) { RateControlEntry *rcj = &m_rce2Pass[i + j]; weight *= 1 - pow(rcj->iCuCount / m_ncu, 2); @@ -687,11 +693,10 @@ } m_rce2Pass[i].blurredComplexity = cplxSum / weightSum; } - - CHECKED_MALLOC(qScale, double, m_numEntries); + CHECKED_MALLOC(qScale, double, framesCount); if (filterSize > 1) { - CHECKED_MALLOC(blurredQscale, double, m_numEntries); + CHECKED_MALLOC(blurredQscale, double, framesCount); } else blurredQscale = qScale; @@ -702,9 +707,8 @@ * because qscale2bits is not invertible, but we can start with the simple * approximation of scaling the 1st pass by the ratio of bitrates. * The search range is probably overkill, but speed doesn't matter here. */ - expectedBits = 1; - for (int i = 0; i < m_numEntries; i++) + for (int i = startIndex; i <= endIndex; i++) { RateControlEntry* rce = &m_rce2Pass[i]; double q = getQScale(rce, 1.0); @@ -781,12 +785,10 @@ X265_FREE(qScale); if (filterSize > 1) X265_FREE(blurredQscale); - if (m_isVbv) - if (!vbv2Pass(allAvailableBits)) + if (!vbv2Pass(allAvailableBits, endIndex, startIndex)) return false; - expectedBits = countExpectedBits(); - + expectedBits = countExpectedBits(startIndex, endIndex); if (fabs(expectedBits / allAvailableBits - 1.0) > 0.01) { double avgq = 0; @@ -819,7 +821,123 @@ return false; } -bool RateControl::vbv2Pass(uint64_t allAvailableBits) +bool RateControl::initPass2() +{ + uint64_t allConstBits = 0, allCodedBits = 0; + uint64_t allAvailableBits = uint64_t(m_param->rc.bitrate * 1000. * m_numEntries * m_frameDuration); + int startIndex, framesCount, endIndex; + int fps = (int)(m_fps + 0.5); + startIndex = endIndex = framesCount = 0; + bool isQpModified = true; + int diffQp = 0; + double targetBits = 0; + double expectedBits = 0; + for (startIndex = 0, endIndex = 0; endIndex < m_numEntries; endIndex++) + { + allConstBits += m_rce2Pass[endIndex].miscBits; + allCodedBits += m_rce2Pass[endIndex].coeffBits + m_rce2Pass[endIndex].mvBits; + if (m_param->rc.rateControlMode == X265_RC_CRF) + { + framesCount = endIndex - startIndex + 1; + diffQp += int (m_rce2Pass[endIndex].qpaRc - m_rce2Pass[endIndex].qpNoVbv); + if (framesCount > fps) + diffQp -= int (m_rce2Pass[endIndex - fps].qpaRc - m_rce2Pass[endIndex - fps].qpNoVbv); + if (framesCount >= fps) + { + if (diffQp >= 1) + { + if (!isQpModified && endIndex > fps) + { + double factor = 2; + double step = 0; + for (int start = endIndex; start <= endIndex + fps - 1 && start < m_numEntries; start++) + { + RateControlEntry *rce = &m_rce2Pass[start]; + targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); + expectedBits += qScale2bits(rce, rce->qScale); + } + if (expectedBits < 0.95 * targetBits) + { + isQpModified = true; + while (endIndex + fps < m_numEntries) + { + step = pow(2, factor / 6.0); + expectedBits = 0; + for (int start = endIndex; start <= endIndex + fps - 1; start++) + { + RateControlEntry *rce = &m_rce2Pass[start]; + rce->newQScale = rce->qScale / step; + X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n"); + expectedBits += qScale2bits(rce, rce->newQScale); + rce->newQp = x265_qScale2qp(rce->newQScale); + } + if (expectedBits >= targetBits && step > 1) + factor *= 0.90; + else + break; + } + + if (m_isVbv && endIndex + fps < m_numEntries) + if (!vbv2Pass((uint64_t)targetBits, endIndex + fps - 1, endIndex)) + return false; + + targetBits = 0; + expectedBits = 0; + + for (int start = endIndex - fps; start <= endIndex - 1; start++) + { + RateControlEntry *rce = &m_rce2Pass[start]; + targetBits += qScale2bits(rce, x265_qp2qScale(rce->qpNoVbv)); + } + while (1) + { + step = pow(2, factor / 6.0); + expectedBits = 0; + for (int start = endIndex - fps; start <= endIndex - 1; start++) + { + RateControlEntry *rce = &m_rce2Pass[start]; + rce->newQScale = rce->qScale * step; + X265_CHECK(rce->newQScale >= 0, "new Qscale is negative\n"); + expectedBits += qScale2bits(rce, rce->newQScale); + rce->newQp = x265_qScale2qp(rce->newQScale); + } + if (expectedBits > targetBits && step > 1) + factor *= 1.1; + else + break; + } + if (m_isVbv) + if (!vbv2Pass((uint64_t)targetBits, endIndex - 1, endIndex - fps)) + return false; + diffQp = 0; + startIndex = endIndex + 1; + targetBits = expectedBits = 0; + } + else + targetBits = expectedBits = 0; + } + } + else + isQpModified = false; + } + } + } + + if (m_param->rc.rateControlMode == X265_RC_ABR) + { + if (allAvailableBits < allConstBits) + { + x265_log(m_param, X265_LOG_ERROR, "requested bitrate is too low. estimated minimum is %d kbps\n", + (int)(allConstBits * m_fps / framesCount * 1000.)); + return false; + } + if (!analyseABR2Pass(0, m_numEntries - 1, allAvailableBits)) + return false; + } + return true; +} + +bool RateControl::vbv2Pass(uint64_t allAvailableBits, int endPos, int startPos) { /* for each interval of bufferFull .. underflow, uniformly increase the qp of all * frames in the interval until either buffer is full at some intermediate frame or the @@ -845,10 +963,10 @@ { /* not first iteration */ adjustment = X265_MAX(X265_MIN(expectedBits / allAvailableBits, 0.999), 0.9); fills[-1] = m_bufferSize * m_param->rc.vbvBufferInit; - t0 = 0; + t0 = startPos; /* fix overflows */ adjMin = 1; - while (adjMin && findUnderflow(fills, &t0, &t1, 1)) + while (adjMin && findUnderflow(fills, &t0, &t1, 1, endPos)) { adjMin = fixUnderflow(t0, t1, adjustment, MIN_QPSCALE, MAX_MAX_QPSCALE); t0 = t1; @@ -859,20 +977,16 @@ t0 = 0; /* fix underflows -- should be done after overflow, as we'd better undersize target than underflowing VBV */ adjMax = 1; - while (adjMax && findUnderflow(fills, &t0, &t1, 0)) + while (adjMax && findUnderflow(fills, &t0, &t1, 0, endPos)) adjMax = fixUnderflow(t0, t1, 1.001, MIN_QPSCALE, MAX_MAX_QPSCALE ); - - expectedBits = countExpectedBits(); + expectedBits = countExpectedBits(startPos, endPos); } - while ((expectedBits < .995 * allAvailableBits) && ((int64_t)(expectedBits+.5) > (int64_t)(prevBits+.5))); - + while ((expectedBits < .995 * allAvailableBits) && ((int64_t)(expectedBits+.5) > (int64_t)(prevBits+.5)) && !(m_param->rc.rateControlMode == X265_RC_CRF)); if (!adjMax) x265_log(m_param, X265_LOG_WARNING, "vbv-maxrate issue, qpmax or vbv-maxrate too low\n"); - /* store expected vbv filling values for tracking when encoding */ - for (int i = 0; i < m_numEntries; i++) + for (int i = startPos; i <= endPos; i++) m_rce2Pass[i].expectedVbv = m_bufferSize - fills[i]; - X265_FREE(fills - 1); return true; @@ -912,9 +1026,10 @@ m_param->bframes = 1; return X265_TYPE_AUTO; } - int frameType = m_rce2Pass[frameNum].sliceType == I_SLICE ? (frameNum > 0 && m_param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR) - : m_rce2Pass[frameNum].sliceType == P_SLICE ? X265_TYPE_P - : (m_rce2Pass[frameNum].sliceType == B_SLICE && m_rce2Pass[frameNum].keptAsRef? X265_TYPE_BREF : X265_TYPE_B); + int index = m_encOrder[frameNum]; + int frameType = m_rce2Pass[index].sliceType == I_SLICE ? (frameNum > 0 && m_param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR) + : m_rce2Pass[index].sliceType == P_SLICE ? X265_TYPE_P + : (m_rce2Pass[index].sliceType == B_SLICE && m_rce2Pass[index].keptAsRef ? X265_TYPE_BREF : X265_TYPE_B); return frameType; } else @@ -926,16 +1041,20 @@ /* Frame Predictors used in vbv */ for (int i = 0; i < 4; i++) { + m_pred[i].coeffMin = 1.0 / 4; m_pred[i].coeff = 1.0; m_pred[i].count = 1.0; m_pred[i].decay = 0.5; m_pred[i].offset = 0.0; } m_pred[0].coeff = m_pred[3].coeff = 0.75; + m_pred[0].coeffMin = m_pred[3].coeffMin = 0.75 / 4; if (m_param->rc.qCompress >= 0.8) // when tuned for grain { + m_pred[1].coeffMin = 0.75 / 4; m_pred[1].coeff = 0.75; - m_pred[0].coeff = m_pred[3].coeff = 0.50; + m_pred[0].coeff = m_pred[3].coeff = 0.5; + m_pred[0].coeffMin = m_pred[3].coeffMin = 0.5 / 4; } } @@ -965,10 +1084,11 @@ if (m_param->rc.bStatRead) { X265_CHECK(rce->poc >= 0 && rce->poc < m_numEntries, "bad encode ordinal\n"); - copyRceData(rce, &m_rce2Pass[rce->poc]); + int index = m_encOrder[rce->poc]; + copyRceData(rce, &m_rce2Pass[index]); } rce->isActive = true; - bool isRefFrameScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refPicList[0][0]->m_lowres.bScenecut == 1; + bool isRefFrameScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refFrameList[0][0]->m_lowres.bScenecut; if (curFrame->m_lowres.bScenecut) { m_isSceneTransition = true; @@ -995,6 +1115,7 @@ { for (int j = 0; j < 2; j++) { + rce->rowPreds[i][j].coeffMin = 0.25 / 4; rce->rowPreds[i][j].coeff = 0.25; rce->rowPreds[i][j].count = 1.0; rce->rowPreds[i][j].decay = 0.5; @@ -1029,6 +1150,17 @@ } } } + if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF) + { + rce->qpPrev = x265_qScale2qp(rce->qScale); + rce->qScale = rce->newQScale; + rce->qpaRc = curEncData.m_avgQpRc = curEncData.m_avgQpAq = x265_qScale2qp(rce->newQScale); + m_qp = int(rce->qpaRc + 0.5); + rce->frameSizePlanned = qScale2bits(rce, rce->qScale); + m_framesDone++; + return m_qp; + } + if (m_isAbr || m_2pass) // ABR,CRF { if (m_isAbr || m_isVbv) @@ -1200,11 +1332,10 @@ } return q; } - -double RateControl::countExpectedBits() +double RateControl::countExpectedBits(int startPos, int endPos) { double expectedBits = 0; - for( int i = 0; i < m_numEntries; i++ ) + for (int i = startPos; i <= endPos; i++) { RateControlEntry *rce = &m_rce2Pass[i]; rce->expectedBits = (uint64_t)expectedBits; @@ -1212,8 +1343,7 @@ } return expectedBits; } - -bool RateControl::findUnderflow(double *fills, int *t0, int *t1, int over) +bool RateControl::findUnderflow(double *fills, int *t0, int *t1, int over, int endPos) { /* find an interval ending on an overflow or underflow (depending on whether * we're adding or removing bits), and starting on the earliest frame that @@ -1223,7 +1353,7 @@ double fill = fills[*t0 - 1]; double parity = over ? 1. : -1.; int start = -1, end = -1; - for (int i = *t0; i < m_numEntries; i++) + for (int i = *t0; i <= endPos; i++) { fill += (m_frameDuration * m_vbvMaxRate - qScale2bits(&m_rce2Pass[i], m_rce2Pass[i].newQScale)) * parity; @@ -1260,12 +1390,11 @@ } return adjusted; } - bool RateControl::cuTreeReadFor2Pass(Frame* frame) { - uint8_t sliceTypeActual = (uint8_t)m_rce2Pass[frame->m_poc].sliceType; - - if (m_rce2Pass[frame->m_poc].keptAsRef) + int index = m_encOrder[frame->m_poc]; + uint8_t sliceTypeActual = (uint8_t)m_rce2Pass[index].sliceType; + if (m_rce2Pass[index].keptAsRef) { /* TODO: We don't need pre-lookahead to measure AQ offsets, but there is currently * no way to signal this */ @@ -1347,18 +1476,28 @@ { if (m_isAbr) { - double slidingWindowCplxSum = 0; - int start = m_sliderPos > s_slidingWindowFrames ? m_sliderPos : 0; - for (int cnt = 0; cnt < s_slidingWindowFrames; cnt++, start++) - { - int pos = start % s_slidingWindowFrames; - slidingWindowCplxSum *= 0.5; - if (!m_satdCostWindow[pos]) - break; - slidingWindowCplxSum += m_satdCostWindow[pos]; + int pos = m_sliderPos % s_slidingWindowFrames; + int addPos = (pos + s_slidingWindowFrames - 1) % s_slidingWindowFrames; + if (m_sliderPos > s_slidingWindowFrames) + { + const static double base = pow(0.5, s_slidingWindowFrames - 1); + m_movingAvgSum -= m_lastRemovedSatdCost * base; + m_movingAvgSum *= 0.5; + m_movingAvgSum += m_satdCostWindow[addPos]; } - rce->movingAvgSum = slidingWindowCplxSum; - m_satdCostWindow[m_sliderPos % s_slidingWindowFrames] = rce->lastSatd; + else if (m_sliderPos == s_slidingWindowFrames) + { + m_movingAvgSum += m_satdCostWindow[addPos]; + } + else if (m_sliderPos > 0) + { + m_movingAvgSum += m_satdCostWindow[addPos]; + m_movingAvgSum *= 0.5; + } + + rce->movingAvgSum = m_movingAvgSum; + m_lastRemovedSatdCost = m_satdCostWindow[pos]; + m_satdCostWindow[pos] = rce->lastSatd; m_sliderPos++; } } @@ -1367,10 +1506,10 @@ { /* B-frames don't have independent rate control, but rather get the * average QP of the two adjacent P-frames + an offset */ - Slice* prevRefSlice = m_curSlice->m_refPicList[0][0]->m_encData->m_slice; - Slice* nextRefSlice = m_curSlice->m_refPicList[1][0]->m_encData->m_slice; - double q0 = m_curSlice->m_refPicList[0][0]->m_encData->m_avgQpRc; - double q1 = m_curSlice->m_refPicList[1][0]->m_encData->m_avgQpRc; + Slice* prevRefSlice = m_curSlice->m_refFrameList[0][0]->m_encData->m_slice; + Slice* nextRefSlice = m_curSlice->m_refFrameList[1][0]->m_encData->m_slice; + double q0 = m_curSlice->m_refFrameList[0][0]->m_encData->m_avgQpRc; + double q1 = m_curSlice->m_refFrameList[1][0]->m_encData->m_avgQpRc; bool i0 = prevRefSlice->m_sliceType == I_SLICE; bool i1 = nextRefSlice->m_sliceType == I_SLICE; int dt0 = abs(m_curSlice->m_poc - prevRefSlice->m_poc); @@ -1386,9 +1525,9 @@ q0 = q1; } } - if (prevRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refPicList[0][0])) + if (prevRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refFrameList[0][0])) q0 -= m_pbOffset / 2; - if (nextRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refPicList[1][0])) + if (nextRefSlice->m_sliceType == B_SLICE && IS_REFERENCED(m_curSlice->m_refFrameList[1][0])) q1 -= m_pbOffset / 2; if (i0 && i1) q = (q0 + q1) / 2 + m_ipOffset; @@ -1512,7 +1651,7 @@ * Then bias the quant up or down if total size so far was far from * the target. * Result: Depending on the value of rate_tolerance, there is a - * tradeoff between quality and bitrate precision. But at large + * trade-off between quality and bitrate precision. But at large * tolerances, the bit distribution approaches that of 2pass. */ double overflow = 1; @@ -1584,7 +1723,7 @@ q = clipQscale(curFrame, rce, q); /* clip qp to permissible range after vbv-lookahead estimation to avoid possible * mispredictions by initial frame size predictors, after each scenecut */ - bool isFrameAfterScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refPicList[0][0]->m_lowres.bScenecut; + bool isFrameAfterScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refFrameList[0][0]->m_lowres.bScenecut; if (!m_2pass && m_isVbv && isFrameAfterScenecut) q = x265_clip3(lqmin, lqmax, q); } @@ -1714,7 +1853,7 @@ } seiBP->m_initialCpbRemovalDelay = (uint32_t)(num * cpbState + denom) / denom; - seiBP->m_initialCpbRemovalDelayOffset = (uint32_t)(num * cpbSize + denom) / denom - seiBP->m_initialCpbRemovalDelay; + seiBP->m_initialCpbRemovalDelayOffset = (uint32_t)((num * cpbSize + denom) / denom - seiBP->m_initialCpbRemovalDelay); } void RateControl::updateVbvPlan(Encoder* enc) @@ -1869,7 +2008,7 @@ double qScale = x265_qp2qScale(qpVbv); FrameData& curEncData = *curFrame->m_encData; int picType = curEncData.m_slice->m_sliceType; - Frame* refFrame = curEncData.m_slice->m_refPicList[0][0]; + Frame* refFrame = curEncData.m_slice->m_refFrameList[0][0]; uint32_t maxRows = curEncData.m_slice->m_sps->numCuInHeight; uint32_t maxCols = curEncData.m_slice->m_sps->numCuInWidth; @@ -1945,6 +2084,8 @@ int RateControl::rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv) { + if (m_param->rc.bStatRead && m_param->rc.rateControlMode == X265_RC_CRF) + return 0; FrameData& curEncData = *curFrame->m_encData; double qScaleVbv = x265_qp2qScale(qpVbv); uint64_t rowSatdCost = curEncData.m_rowStat[row].diagSatd; @@ -1957,9 +2098,9 @@ } rowSatdCost >>= X265_DEPTH - 8; updatePredictor(rce->rowPred[0], qScaleVbv, (double)rowSatdCost, encodedBits); - if (curEncData.m_slice->m_sliceType == P_SLICE) + if (curEncData.m_slice->m_sliceType != I_SLICE) { - Frame* refFrame = curEncData.m_slice->m_refPicList[0][0]; + Frame* refFrame = curEncData.m_slice->m_refFrameList[0][0]; if (qpVbv < refFrame->m_encData->m_rowStat[row].diagQp) { uint64_t intraRowSatdCost = curEncData.m_rowStat[row].diagIntraSatd; @@ -2137,7 +2278,8 @@ return; const double range = 2; double old_coeff = p->coeff / p->count; - double new_coeff = bits * q / var; + double old_offset = p->offset / p->count; + double new_coeff = X265_MAX((bits * q - old_offset) / var, p->coeffMin ); double new_coeff_clipped = x265_clip3(old_coeff / range, old_coeff * range, new_coeff); double new_offset = bits * q - new_coeff_clipped * var; if (new_offset >= 0) @@ -2192,7 +2334,7 @@ if (m_param->rc.aqMode || m_isVbv) { - if (m_isVbv) + if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)) { /* determine avg QP decided by VBV rate control */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) @@ -2218,20 +2360,31 @@ { if (m_param->rc.rateControlMode == X265_RC_ABR && !m_param->rc.bStatRead) checkAndResetABR(rce, true); - - if (m_param->rc.rateControlMode == X265_RC_CRF) + } + if (m_param->rc.rateControlMode == X265_RC_CRF) + { + double crfVal, qpRef = curEncData.m_avgQpRc; + bool is2passCrfChange = false; + if (m_2pass) { - if (int(curEncData.m_avgQpRc + 0.5) == slice->m_sliceQp) - curEncData.m_rateFactor = m_rateFactorConstant; - else + if (abs(curEncData.m_avgQpRc - rce->qpPrev) > 0.1) { - /* If vbv changed the frame QP recalculate the rate-factor */ - double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); - double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0; - curEncData.m_rateFactor = pow(baseCplx, 1 - m_qCompress) / - x265_qp2qScale(int(curEncData.m_avgQpRc + 0.5) + mbtree_offset); + qpRef = rce->qpPrev; + is2passCrfChange = true; } } + if (is2passCrfChange || abs(qpRef - rce->qpNoVbv) > 0.5) + { + double crfFactor = rce->qRceq /x265_qp2qScale(qpRef); + double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); + double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0; + crfVal = x265_qScale2qp(pow(baseCplx, 1 - m_qCompress) / crfFactor) - mbtree_offset; + } + else + crfVal = rce->sliceType == I_SLICE ? m_param->rc.rfConstant - m_ipOffset : + (rce->sliceType == B_SLICE ? m_param->rc.rfConstant + m_pbOffset : m_param->rc.rfConstant); + + curEncData.m_rateFactor = crfVal; } if (m_isAbr && !m_isAbrReset) @@ -2325,9 +2478,10 @@ : rce->sliceType == P_SLICE ? 'P' : IS_REFERENCED(curFrame) ? 'B' : 'b'; if (fprintf(m_statFileOut, - "in:%d out:%d type:%c q:%.2f q-aq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", + "in:%d out:%d type:%c q:%.2f q-aq:%.2f q-noVbv:%.2f q-Rceq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", rce->poc, rce->encodeOrder, cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, + rce->qpNoVbv, rce->qRceq, curFrame->m_encData->m_frameStats.coeffBits, curFrame->m_encData->m_frameStats.mvBits, curFrame->m_encData->m_frameStats.miscBits,
View file
x265_1.8.tar.gz/source/encoder/ratecontrol.h -> x265_1.9.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -48,6 +48,7 @@ struct Predictor { + double coeffMin; double coeff; double count; double decay; @@ -74,6 +75,7 @@ double qpaRc; double qpAq; double qRceq; + double qpPrev; double frameSizePlanned; /* frame Size decided by RateCotrol before encoding the frame */ double bufferRate; double movingAvgSum; @@ -167,6 +169,8 @@ int64_t m_satdCostWindow[50]; int64_t m_encodedBitsWindow[50]; int m_sliderPos; + int64_t m_lastRemovedSatdCost; + double m_movingAvgSum; /* To detect a pattern of low detailed static frames in single pass ABR using satdcosts */ int64_t m_lastBsliceSatdCost; @@ -205,8 +209,8 @@ double m_lastAccumPNorm; double m_expectedBitsSum; /* sum of qscale2bits after rceq, ratefactor, and overflow, only includes finished frames */ int64_t m_predictedBits; + int *m_encOrder; RateControlEntry* m_rce2Pass; - struct { uint16_t *qpBuffer[2]; /* Global buffers for converting MB-tree quantizer data. */ @@ -258,11 +262,12 @@ void checkAndResetABR(RateControlEntry* rce, bool isFrameDone); double predictRowsSizeSum(Frame* pic, RateControlEntry* rce, double qpm, int32_t& encodedBits); bool initPass2(); + bool analyseABR2Pass(int startPoc, int endPoc, uint64_t allAvailableBits); void initFramePredictors(); double getDiffLimitedQScale(RateControlEntry *rce, double q); - double countExpectedBits(); - bool vbv2Pass(uint64_t allAvailableBits); - bool findUnderflow(double *fills, int *t0, int *t1, int over); + double countExpectedBits(int startPos, int framesCount); + bool vbv2Pass(uint64_t allAvailableBits, int frameCount, int startPos); + bool findUnderflow(double *fills, int *t0, int *t1, int over, int framesCount); bool fixUnderflow(int t0, int t1, double adjustment, double qscaleMin, double qscaleMax); }; }
View file
x265_1.8.tar.gz/source/encoder/rdcost.h -> x265_1.9.tar.gz/source/encoder/rdcost.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -73,13 +74,18 @@ qpCr = x265_clip3(QP_MIN, QP_MAX_SPEC, qp + slice.m_pps->chromaQpOffset[1]); } - int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET); - uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; - m_chromaDistWeight[0] = lambdaOffset; + if (slice.m_sps->chromaFormatIdc == X265_CSP_I444) + { + int chroma_offset_idx = X265_MIN(qp - qpCb + 12, MAX_CHROMA_LAMBDA_OFFSET); + uint16_t lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; + m_chromaDistWeight[0] = lambdaOffset; - chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET); - lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; - m_chromaDistWeight[1] = lambdaOffset; + chroma_offset_idx = X265_MIN(qp - qpCr + 12, MAX_CHROMA_LAMBDA_OFFSET); + lambdaOffset = m_psyRd ? x265_chroma_lambda2_offset_tab[chroma_offset_idx] : 256; + m_chromaDistWeight[1] = lambdaOffset; + } + else + m_chromaDistWeight[0] = m_chromaDistWeight[1] = 256; } void setLambda(double lambda2, double lambda) @@ -88,9 +94,9 @@ m_lambda = (uint64_t)floor(256.0 * lambda); } - inline uint64_t calcRdCost(sse_ret_t distortion, uint32_t bits) const + inline uint64_t calcRdCost(sse_t distortion, uint32_t bits) const { -#if X265_DEPTH <= 10 +#if X265_DEPTH < 10 X265_CHECK(bits <= (UINT64_MAX - 128) / m_lambda2, "calcRdCost wrap detected dist: %u, bits %u, lambda: " X265_LL "\n", distortion, bits, m_lambda2); @@ -108,15 +114,18 @@ return primitives.cu[size].psy_cost_pp(source, sstride, recon, rstride); } - /* return the difference in energy between the source block and the recon block */ - inline int psyCost(int size, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) const - { - return primitives.cu[size].psy_cost_ss(source, sstride, recon, rstride); - } - /* return the RD cost of this prediction, including the effect of psy-rd */ - inline uint64_t calcPsyRdCost(sse_ret_t distortion, uint32_t bits, uint32_t psycost) const + inline uint64_t calcPsyRdCost(sse_t distortion, uint32_t bits, uint32_t psycost) const { +#if X265_DEPTH < 10 + X265_CHECK((bits <= (UINT64_MAX / m_lambda2)) && (psycost <= UINT64_MAX / (m_lambda * m_psyRd)), + "calcPsyRdCost wrap detected dist: %u, bits: %u, lambda: " X265_LL ", lambda2: " X265_LL "\n", + distortion, bits, m_lambda, m_lambda2); +#else + X265_CHECK((bits <= (UINT64_MAX / m_lambda2)) && (psycost <= UINT64_MAX / (m_lambda * m_psyRd)), + "calcPsyRdCost wrap detected dist: " X265_LL ", bits: %u, lambda: " X265_LL ", lambda2: " X265_LL "\n", + distortion, bits, m_lambda, m_lambda2); +#endif return distortion + ((m_lambda * m_psyRd * psycost) >> 24) + ((bits * m_lambda2) >> 8); } @@ -127,9 +136,9 @@ return sadCost + ((bits * m_lambda + 128) >> 8); } - inline sse_ret_t scaleChromaDist(uint32_t plane, sse_ret_t dist) const + inline sse_t scaleChromaDist(uint32_t plane, sse_t dist) const { -#if X265_DEPTH <= 10 +#if X265_DEPTH < 10 X265_CHECK(dist <= (UINT64_MAX - 128) / m_chromaDistWeight[plane - 1], "scaleChromaDist wrap detected dist: %u, lambda: %u\n", dist, m_chromaDistWeight[plane - 1]); @@ -138,11 +147,13 @@ "scaleChromaDist wrap detected dist: " X265_LL " lambda: %u\n", dist, m_chromaDistWeight[plane - 1]); #endif - return (sse_ret_t)((dist * (uint64_t)m_chromaDistWeight[plane - 1] + 128) >> 8); + return (sse_t)((dist * (uint64_t)m_chromaDistWeight[plane - 1] + 128) >> 8); } inline uint32_t getCost(uint32_t bits) const { + X265_CHECK(bits <= (UINT64_MAX - 128) / m_lambda, + "getCost wrap detected bits: %u, lambda: " X265_LL "\n", bits, m_lambda); return (uint32_t)((bits * m_lambda + 128) >> 8); } };
View file
x265_1.8.tar.gz/source/encoder/reference.cpp -> x265_1.9.tar.gz/source/encoder/reference.cpp
Changed
@@ -68,7 +68,7 @@ intptr_t stride = reconPic->m_stride; int cuHeight = g_maxCUSize; - for (int c = 0; c < numInterpPlanes; c++) + for (int c = 0; c < (p.internalCsp != X265_CSP_I400 ? numInterpPlanes : 1); c++) { if (c == 1) {
View file
x265_1.8.tar.gz/source/encoder/sao.cpp -> x265_1.9.tar.gz/source/encoder/sao.cpp
Changed
@@ -73,9 +73,6 @@ SAO::SAO() { - m_count = NULL; - m_offset = NULL; - m_offsetOrg = NULL; m_countPreDblk = NULL; m_offsetOrgPreDblk = NULL; m_refDepth = 0; @@ -84,28 +81,22 @@ m_param = NULL; m_clipTable = NULL; m_clipTableBase = NULL; - m_tmpU1[0] = NULL; - m_tmpU1[1] = NULL; - m_tmpU1[2] = NULL; - m_tmpU2[0] = NULL; - m_tmpU2[1] = NULL; - m_tmpU2[2] = NULL; - m_tmpL1 = NULL; - m_tmpL2 = NULL; - - m_depthSaoRate[0][0] = 0; - m_depthSaoRate[0][1] = 0; - m_depthSaoRate[0][2] = 0; - m_depthSaoRate[0][3] = 0; - m_depthSaoRate[1][0] = 0; - m_depthSaoRate[1][1] = 0; - m_depthSaoRate[1][2] = 0; - m_depthSaoRate[1][3] = 0; + m_tmpU[0] = NULL; + m_tmpU[1] = NULL; + m_tmpU[2] = NULL; + m_tmpL1[0] = NULL; + m_tmpL1[1] = NULL; + m_tmpL1[2] = NULL; + m_tmpL2[0] = NULL; + m_tmpL2[1] = NULL; + m_tmpL2[2] = NULL; + m_depthSaoRate = NULL; } -bool SAO::create(x265_param* param) +bool SAO::create(x265_param* param, int initCommon) { m_param = param; + m_chromaFormat = param->internalCsp; m_hChromaShift = CHROMA_H_SHIFT(param->internalCsp); m_vChromaShift = CHROMA_V_SHIFT(param->internalCsp); @@ -116,37 +107,56 @@ const pixel rangeExt = maxY >> 1; int numCtu = m_numCuInWidth * m_numCuInHeight; - CHECKED_MALLOC(m_clipTableBase, pixel, maxY + 2 * rangeExt); - - CHECKED_MALLOC(m_tmpL1, pixel, g_maxCUSize + 1); - CHECKED_MALLOC(m_tmpL2, pixel, g_maxCUSize + 1); - - for (int i = 0; i < 3; i++) + for (int i = 0; i < (param->internalCsp != X265_CSP_I400 ? 3 : 1); i++) { + CHECKED_MALLOC(m_tmpL1[i], pixel, g_maxCUSize + 1); + CHECKED_MALLOC(m_tmpL2[i], pixel, g_maxCUSize + 1); + // SAO asm code will read 1 pixel before and after, so pad by 2 - CHECKED_MALLOC(m_tmpU1[i], pixel, m_param->sourceWidth + 2); - m_tmpU1[i] += 1; - CHECKED_MALLOC(m_tmpU2[i], pixel, m_param->sourceWidth + 2); - m_tmpU2[i] += 1; + // NOTE: m_param->sourceWidth+2 enough, to avoid condition check in copySaoAboveRef(), I alloc more up to 63 bytes in here + CHECKED_MALLOC(m_tmpU[i], pixel, m_numCuInWidth * g_maxCUSize + 2 + 32); + m_tmpU[i] += 1; } - CHECKED_MALLOC(m_count, PerClass, NUM_PLANE); - CHECKED_MALLOC(m_offset, PerClass, NUM_PLANE); - CHECKED_MALLOC(m_offsetOrg, PerClass, NUM_PLANE); - - CHECKED_MALLOC(m_countPreDblk, PerPlane, numCtu); - CHECKED_MALLOC(m_offsetOrgPreDblk, PerPlane, numCtu); - - m_clipTable = &(m_clipTableBase[rangeExt]); - - for (int i = 0; i < rangeExt; i++) - m_clipTableBase[i] = 0; + if (initCommon) + { + CHECKED_MALLOC(m_countPreDblk, PerPlane, numCtu); + CHECKED_MALLOC(m_offsetOrgPreDblk, PerPlane, numCtu); + CHECKED_MALLOC(m_depthSaoRate, double, 2 * SAO_DEPTHRATE_SIZE); + + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + 0] = 0; + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + 1] = 0; + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + 2] = 0; + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + 3] = 0; + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + 0] = 0; + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + 1] = 0; + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + 2] = 0; + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + 3] = 0; + + CHECKED_MALLOC(m_clipTableBase, pixel, maxY + 2 * rangeExt); + m_clipTable = &(m_clipTableBase[rangeExt]); + + // Share with fast clip lookup table + if (initCommon) + { + for (int i = 0; i < rangeExt; i++) + m_clipTableBase[i] = 0; - for (int i = 0; i < maxY; i++) - m_clipTable[i] = (pixel)i; + for (int i = 0; i < maxY; i++) + m_clipTable[i] = (pixel)i; - for (int i = maxY; i < maxY + rangeExt; i++) - m_clipTable[i] = maxY; + for (int i = maxY; i < maxY + rangeExt; i++) + m_clipTable[i] = maxY; + } + } + else + { + // must initialize these common pointer outside of function + m_countPreDblk = NULL; + m_offsetOrgPreDblk = NULL; + m_clipTableBase = NULL; + m_clipTable = NULL; + } return true; @@ -154,34 +164,61 @@ return false; } -void SAO::destroy() +void SAO::createFromRootNode(SAO* root) { - X265_FREE(m_clipTableBase); - - X265_FREE(m_tmpL1); - X265_FREE(m_tmpL2); + X265_CHECK(m_countPreDblk == NULL, "duplicate initialize on m_countPreDblk"); + X265_CHECK(m_offsetOrgPreDblk == NULL, "duplicate initialize on m_offsetOrgPreDblk"); + X265_CHECK(m_depthSaoRate == NULL, "duplicate initialize on m_depthSaoRate"); + X265_CHECK(m_clipTableBase == NULL, "duplicate initialize on m_clipTableBase"); + X265_CHECK(m_clipTable == NULL, "duplicate initialize on m_clipTable"); + + m_countPreDblk = root->m_countPreDblk; + m_offsetOrgPreDblk = root->m_offsetOrgPreDblk; + m_depthSaoRate = root->m_depthSaoRate; + m_clipTableBase = root->m_clipTableBase; // Unnecessary + m_clipTable = root->m_clipTable; +} +void SAO::destroy(int destoryCommon) +{ for (int i = 0; i < 3; i++) { - if (m_tmpU1[i]) X265_FREE(m_tmpU1[i] - 1); - if (m_tmpU2[i]) X265_FREE(m_tmpU2[i] - 1); + if (m_tmpL1[i]) + { + X265_FREE(m_tmpL1[i]); + m_tmpL1[i] = NULL; + } + + if (m_tmpL2[i]) + { + X265_FREE(m_tmpL2[i]); + m_tmpL2[i] = NULL; + } + + if (m_tmpU[i]) + { + X265_FREE(m_tmpU[i] - 1); + m_tmpU[i] = NULL; + } } - X265_FREE(m_count); - X265_FREE(m_offset); - X265_FREE(m_offsetOrg); - X265_FREE(m_countPreDblk); - X265_FREE(m_offsetOrgPreDblk); + if (destoryCommon) + { + X265_FREE_ZERO(m_countPreDblk); + X265_FREE_ZERO(m_offsetOrgPreDblk); + X265_FREE_ZERO(m_depthSaoRate); + X265_FREE_ZERO(m_clipTableBase); + } } /* allocate memory for SAO parameters */ void SAO::allocSaoParam(SAOParam* saoParam) const { + int planes = (m_param->internalCsp != X265_CSP_I400) ? 3 : 1; saoParam->numCuInWidth = m_numCuInWidth; - saoParam->ctuParam[0] = new SaoCtuParam[m_numCuInHeight * m_numCuInWidth]; - saoParam->ctuParam[1] = new SaoCtuParam[m_numCuInHeight * m_numCuInWidth]; - saoParam->ctuParam[2] = new SaoCtuParam[m_numCuInHeight * m_numCuInWidth]; + for (int i = 0; i < planes; i++) + saoParam->ctuParam[i] = new SaoCtuParam[m_numCuInHeight * m_numCuInWidth]; } void SAO::startSlice(Frame* frame, Entropy& initState, int qp) @@ -209,8 +246,6 @@ break; } - resetStats(); - m_entropyCoder.load(initState); m_rdContexts.next.load(initState); m_rdContexts.cur.load(initState); @@ -224,7 +259,7 @@ } saoParam->bSaoFlag[0] = true; - saoParam->bSaoFlag[1] = true; + saoParam->bSaoFlag[1] = m_param->internalCsp != X265_CSP_I400; m_numNoSao[0] = 0; // Luma m_numNoSao[1] = 0; // Chroma @@ -232,9 +267,9 @@ // NOTE: Allow SAO automatic turn-off only when frame parallelism is disabled. if (m_param->frameNumThreads == 1) { - if (m_refDepth > 0 && m_depthSaoRate[0][m_refDepth - 1] > SAO_ENCODING_RATE) + if (m_refDepth > 0 && m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth - 1] > SAO_ENCODING_RATE) saoParam->bSaoFlag[0] = false; - if (m_refDepth > 0 && m_depthSaoRate[1][m_refDepth - 1] > SAO_ENCODING_RATE_CHROMA) + if (m_refDepth > 0 && m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth - 1] > SAO_ENCODING_RATE_CHROMA) saoParam->bSaoFlag[1] = false; } } @@ -243,12 +278,13 @@ void SAO::processSaoCu(int addr, int typeIdx, int plane) { int x, y; - const CUData* cu = m_frame->m_encData->getPicCTU(addr); - pixel* rec = m_frame->m_reconPic->getPlaneAddr(plane, addr); - intptr_t stride = plane ? m_frame->m_reconPic->m_strideC : m_frame->m_reconPic->m_stride; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* rec = reconPic->getPlaneAddr(plane, addr); + intptr_t stride = plane ? reconPic->m_strideC : reconPic->m_stride; uint32_t picWidth = m_param->sourceWidth; uint32_t picHeight = m_param->sourceHeight; - int ctuWidth = g_maxCUSize; + const CUData* cu = m_frame->m_encData->getPicCTU(addr); + int ctuWidth = g_maxCUSize; int ctuHeight = g_maxCUSize; uint32_t lpelx = cu->m_cuPelX; uint32_t tpely = cu->m_cuPelY; @@ -278,17 +314,10 @@ memset(_upBuff1 + MAX_CU_SIZE, 0, 2 * sizeof(int8_t)); /* avoid valgrind uninit warnings */ - { - const pixel* recR = &rec[ctuWidth - 1]; - for (int i = 0; i < ctuHeight + 1; i++) - { - m_tmpL2[i] = *recR; - recR += stride; - } + tmpL = m_tmpL1[plane]; + tmpU = &(m_tmpU[plane][lpelx]); - tmpL = m_tmpL1; - tmpU = &(m_tmpU1[plane][lpelx]); - } + int8_t* offsetEo = m_offsetEo[plane]; switch (typeIdx) { @@ -308,7 +337,7 @@ int edgeType = signRight + signLeft + 2; signLeft = -signRight; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } rec += stride; @@ -333,7 +362,7 @@ row1LastPxl = rec[stride + ctuWidth - 1]; } - primitives.saoCuOrgE0(rec, m_offsetEo, ctuWidth, signLeft1, stride); + primitives.saoCuOrgE0(rec, offsetEo, ctuWidth, signLeft1, stride); if (!lpelx) { @@ -372,7 +401,7 @@ int edgeType = signDown + upBuff1[x] + 2; upBuff1[x] = -signDown; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } rec += stride; @@ -385,11 +414,11 @@ int diff = (endY - startY) % 2; for (y = startY; y < endY - diff; y += 2) { - primitives.saoCuOrgE1_2Rows(rec, upBuff1, m_offsetEo, stride, ctuWidth); + primitives.saoCuOrgE1_2Rows(rec, upBuff1, offsetEo, stride, ctuWidth); rec += 2 * stride; } if (diff & 1) - primitives.saoCuOrgE1(rec, upBuff1, m_offsetEo, stride, ctuWidth); + primitives.saoCuOrgE1(rec, upBuff1, offsetEo, stride, ctuWidth); } break; @@ -439,7 +468,7 @@ int8_t signDown = signOf(rec[x] - rec[x + stride + 1]); int edgeType = signDown + upBuff1[x] + 2; upBufft[x + 1] = -signDown; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } std::swap(upBuff1, upBufft); @@ -453,7 +482,7 @@ { int8_t iSignDown2 = signOf(rec[stride + startX] - tmpL[y]); - primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, m_offsetEo, endX - startX, stride); + primitives.saoCuOrgE2[endX > 16](rec + startX, upBufft + startX, upBuff1 + startX, offsetEo, endX - startX, stride); upBufft[startX] = iSignDown2; @@ -485,14 +514,14 @@ int8_t signDown = signOf(rec[x] - tmpL[y + 1]); int edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = -signDown; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; for (x = startX + 1; x < endX; x++) { signDown = signOf(rec[x] - rec[x + stride - 1]); edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = -signDown; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; } upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); @@ -522,9 +551,9 @@ int8_t signDown = signOf(rec[x] - tmpL[y + 1]); int edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = -signDown; - rec[x] = m_clipTable[rec[x] + m_offsetEo[edgeType]]; + rec[x] = m_clipTable[rec[x] + offsetEo[edgeType]]; - primitives.saoCuOrgE3[endX > 16](rec, upBuff1, m_offsetEo, stride - 1, startX, endX); + primitives.saoCuOrgE3[endX > 16](rec, upBuff1, offsetEo, stride - 1, startX, endX); upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); @@ -536,7 +565,7 @@ } case SAO_BO: { - const int8_t* offsetBo = m_offsetBo; + const int8_t* offsetBo = m_offsetBo[plane]; if (ctuWidth & 15) { @@ -564,98 +593,169 @@ } default: break; } - -// if (iSaoType!=SAO_BO_0 || iSaoType!=SAO_BO_1) - std::swap(m_tmpL1, m_tmpL2); } -/* Process SAO all units */ -void SAO::processSaoUnitRow(SaoCtuParam* ctuParam, int idxY, int plane) +/* Process SAO unit */ +void SAO::processSaoUnitCuLuma(SaoCtuParam* ctuParam, int idxY, int idxX) { - intptr_t stride = plane ? m_frame->m_reconPic->m_strideC : m_frame->m_reconPic->m_stride; - uint32_t picWidth = m_param->sourceWidth; + PicYuv* reconPic = m_frame->m_reconPic; + intptr_t stride = reconPic->m_stride; int ctuWidth = g_maxCUSize; int ctuHeight = g_maxCUSize; - if (plane) + + int addr = idxY * m_numCuInWidth + idxX; + pixel* rec = reconPic->getLumaAddr(addr); + + if (idxX == 0) { - picWidth >>= m_hChromaShift; - ctuWidth >>= m_hChromaShift; - ctuHeight >>= m_vChromaShift; + for (int i = 0; i < ctuHeight + 1; i++) + { + m_tmpL1[0][i] = rec[0]; + rec += stride; + } } - if (!idxY) + bool mergeLeftFlag = (ctuParam[addr].mergeMode == SAO_MERGE_LEFT); + int typeIdx = ctuParam[addr].typeIdx; + + if (idxX != (m_numCuInWidth - 1)) { - pixel* rec = m_frame->m_reconPic->m_picOrg[plane]; - memcpy(m_tmpU1[plane], rec, sizeof(pixel) * picWidth); + rec = reconPic->getLumaAddr(addr); + for (int i = 0; i < ctuHeight + 1; i++) + { + m_tmpL2[0][i] = rec[ctuWidth - 1]; + rec += stride; + } } - int addr = idxY * m_numCuInWidth; - pixel* rec = plane ? m_frame->m_reconPic->getChromaAddr(plane, addr) : m_frame->m_reconPic->getLumaAddr(addr); - - for (int i = 0; i < ctuHeight + 1; i++) + if (typeIdx >= 0) { - m_tmpL1[i] = rec[0]; - rec += stride; + if (!mergeLeftFlag) + { + if (typeIdx == SAO_BO) + { + memset(m_offsetBo[0], 0, sizeof(m_offsetBo[0])); + + for (int i = 0; i < SAO_NUM_OFFSET; i++) + m_offsetBo[0][((ctuParam[addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[addr].offset[i] << SAO_BIT_INC); + } + else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) + { + int offset[NUM_EDGETYPE]; + offset[0] = 0; + for (int i = 0; i < SAO_NUM_OFFSET; i++) + offset[i + 1] = ctuParam[addr].offset[i] << SAO_BIT_INC; + + for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++) + m_offsetEo[0][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; + } + } + processSaoCu(addr, typeIdx, 0); } + std::swap(m_tmpL1[0], m_tmpL2[0]); +} + +/* Process SAO unit (Chroma only) */ +void SAO::processSaoUnitCuChroma(SaoCtuParam* ctuParam[3], int idxY, int idxX) +{ + PicYuv* reconPic = m_frame->m_reconPic; + intptr_t stride = reconPic->m_strideC; + int ctuWidth = g_maxCUSize; + int ctuHeight = g_maxCUSize; - rec -= (stride << 1); + { + ctuWidth >>= m_hChromaShift; + ctuHeight >>= m_vChromaShift; + } - memcpy(m_tmpU2[plane], rec, sizeof(pixel) * picWidth); + int addr = idxY * m_numCuInWidth + idxX; + pixel* recCb = reconPic->getCbAddr(addr); + pixel* recCr = reconPic->getCrAddr(addr); - for (int idxX = 0; idxX < m_numCuInWidth; idxX++) + if (idxX == 0) { - addr = idxY * m_numCuInWidth + idxX; + for (int i = 0; i < ctuHeight + 1; i++) + { + m_tmpL1[1][i] = recCb[0]; + m_tmpL1[2][i] = recCr[0]; + recCb += stride; + recCr += stride; + } + } - bool mergeLeftFlag = ctuParam[addr].mergeMode == SAO_MERGE_LEFT; - int typeIdx = ctuParam[addr].typeIdx; + bool mergeLeftFlagCb = (ctuParam[1][addr].mergeMode == SAO_MERGE_LEFT); + int typeIdxCb = ctuParam[1][addr].typeIdx; + + bool mergeLeftFlagCr = (ctuParam[2][addr].mergeMode == SAO_MERGE_LEFT); + int typeIdxCr = ctuParam[2][addr].typeIdx; + + if (idxX != (m_numCuInWidth - 1)) + { + recCb = reconPic->getCbAddr(addr); + recCr = reconPic->getCrAddr(addr); + for (int i = 0; i < ctuHeight + 1; i++) + { + m_tmpL2[1][i] = recCb[ctuWidth - 1]; + m_tmpL2[2][i] = recCr[ctuWidth - 1]; + recCb += stride; + recCr += stride; + } + } - if (typeIdx >= 0) + // Process U + if (typeIdxCb >= 0) + { + if (!mergeLeftFlagCb) { - if (!mergeLeftFlag) + if (typeIdxCb == SAO_BO) { - if (typeIdx == SAO_BO) - { - memset(m_offsetBo, 0, sizeof(m_offsetBo)); + memset(m_offsetBo[1], 0, sizeof(m_offsetBo[0])); - for (int i = 0; i < SAO_NUM_OFFSET; i++) - m_offsetBo[((ctuParam[addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[addr].offset[i] << SAO_BIT_INC); - } - else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) - { - int offset[NUM_EDGETYPE]; - offset[0] = 0; - for (int i = 0; i < SAO_NUM_OFFSET; i++) - offset[i + 1] = ctuParam[addr].offset[i] << SAO_BIT_INC; + for (int i = 0; i < SAO_NUM_OFFSET; i++) + m_offsetBo[1][((ctuParam[1][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[1][addr].offset[i] << SAO_BIT_INC); + } + else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) + { + int offset[NUM_EDGETYPE]; + offset[0] = 0; + for (int i = 0; i < SAO_NUM_OFFSET; i++) + offset[i + 1] = ctuParam[1][addr].offset[i] << SAO_BIT_INC; - for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++) - m_offsetEo[edgeType] = (int8_t)offset[s_eoTable[edgeType]]; - } + for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++) + m_offsetEo[1][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; } - processSaoCu(addr, typeIdx, plane); } - else if (idxX != (m_numCuInWidth - 1)) + processSaoCu(addr, typeIdxCb, 1); + } + + // Process V + if (typeIdxCr >= 0) + { + if (!mergeLeftFlagCr) { - rec = plane ? m_frame->m_reconPic->getChromaAddr(plane, addr) : m_frame->m_reconPic->getLumaAddr(addr); + if (typeIdxCr == SAO_BO) + { + memset(m_offsetBo[2], 0, sizeof(m_offsetBo[0])); - for (int i = 0; i < ctuHeight + 1; i++) + for (int i = 0; i < SAO_NUM_OFFSET; i++) + m_offsetBo[2][((ctuParam[2][addr].bandPos + i) & (SAO_NUM_BO_CLASSES - 1))] = (int8_t)(ctuParam[2][addr].offset[i] << SAO_BIT_INC); + } + else // if (typeIdx == SAO_EO_0 || typeIdx == SAO_EO_1 || typeIdx == SAO_EO_2 || typeIdx == SAO_EO_3) { - m_tmpL1[i] = rec[ctuWidth - 1]; - rec += stride; + int offset[NUM_EDGETYPE]; + offset[0] = 0; + for (int i = 0; i < SAO_NUM_OFFSET; i++) + offset[i + 1] = ctuParam[2][addr].offset[i] << SAO_BIT_INC; + + for (int edgeType = 0; edgeType < NUM_EDGETYPE; edgeType++) + m_offsetEo[2][edgeType] = (int8_t)offset[s_eoTable[edgeType]]; } } + processSaoCu(addr, typeIdxCb, 2); } - std::swap(m_tmpU1[plane], m_tmpU2[plane]); -} - -void SAO::resetSaoUnit(SaoCtuParam* saoUnit) -{ - saoUnit->mergeMode = SAO_MERGE_NONE; - saoUnit->typeIdx = -1; - saoUnit->bandPos = 0; - - for (int i = 0; i < SAO_NUM_OFFSET; i++) - saoUnit->offset[i] = 0; + std::swap(m_tmpL1[1], m_tmpL2[1]); + std::swap(m_tmpL1[2], m_tmpL2[2]); } void SAO::copySaoUnit(SaoCtuParam* saoUnitDst, const SaoCtuParam* saoUnitSrc) @@ -671,12 +771,13 @@ /* Calculate SAO statistics for current CTU without non-crossing slice */ void SAO::calcSaoStatsCu(int addr, int plane) { + const PicYuv* reconPic = m_frame->m_reconPic; const CUData* cu = m_frame->m_encData->getPicCTU(addr); const pixel* fenc0 = m_frame->m_fencPic->getPlaneAddr(plane, addr); - const pixel* rec0 = m_frame->m_reconPic->getPlaneAddr(plane, addr); + const pixel* rec0 = reconPic->getPlaneAddr(plane, addr); const pixel* fenc; const pixel* rec; - intptr_t stride = plane ? m_frame->m_reconPic->m_strideC : m_frame->m_reconPic->m_stride; + intptr_t stride = plane ? reconPic->m_strideC : reconPic->m_stride; uint32_t picWidth = m_param->sourceWidth; uint32_t picHeight = m_param->sourceHeight; int ctuWidth = g_maxCUSize; @@ -702,24 +803,48 @@ int endX; int endY; - int skipB = plane ? 2 : 4; - int skipR = plane ? 3 : 5; + const int plane_offset = plane ? 2 : 0; + int skipB = 4; + int skipR = 5; - int8_t _upBuff1[MAX_CU_SIZE + 2], *upBuff1 = _upBuff1 + 1; - int8_t _upBufft[MAX_CU_SIZE + 2], *upBufft = _upBufft + 1; + int8_t _upBuff[2 * (MAX_CU_SIZE + 16 + 16)], *upBuff1 = _upBuff + 16, *upBufft = upBuff1 + (MAX_CU_SIZE + 16 + 16); + + ALIGN_VAR_32(int16_t, diff[MAX_CU_SIZE * MAX_CU_SIZE]); + + // Calculate (fenc - frec) and put into diff[] + if ((lpelx + ctuWidth < picWidth) & (tpely + ctuHeight < picHeight)) + { + // WARNING: *) May read beyond bound on video than ctuWidth or ctuHeight is NOT multiple of cuSize + X265_CHECK((ctuWidth == ctuHeight) || (m_chromaFormat != X265_CSP_I420), "video size check failure\n"); + if (plane) + primitives.chroma[m_chromaFormat].cu[g_maxLog2CUSize - 2].sub_ps(diff, MAX_CU_SIZE, fenc0, rec0, stride, stride); + else + primitives.cu[g_maxLog2CUSize - 2].sub_ps(diff, MAX_CU_SIZE, fenc0, rec0, stride, stride); + } + else + { + // path for non-square area (most in edge) + for(int y = 0; y < ctuHeight; y++) + { + for(int x = 0; x < ctuWidth; x++) + { + diff[y * MAX_CU_SIZE + x] = (fenc0[y * stride + x] - rec0[y * stride + x]); + } + } + } // SAO_BO: { if (m_param->bSaoNonDeblocked) { - skipB = plane ? 1 : 3; - skipR = plane ? 2 : 4; + skipB = 3; + skipR = 4; } - endX = (rpelx == picWidth) ? ctuWidth : ctuWidth - skipR; - endY = (bpely == picHeight) ? ctuHeight : ctuHeight - skipB; + endX = (rpelx == picWidth) ? ctuWidth : ctuWidth - skipR + plane_offset; + endY = (bpely == picHeight) ? ctuHeight : ctuHeight - skipB + plane_offset; - primitives.saoCuStatsBO(fenc0, rec0, stride, endX, endY, m_offsetOrg[plane][SAO_BO], m_count[plane][SAO_BO]); + primitives.saoCuStatsBO(diff, rec0, stride, endX, endY, m_offsetOrg[plane][SAO_BO], m_count[plane][SAO_BO]); } { @@ -727,84 +852,82 @@ { if (m_param->bSaoNonDeblocked) { - skipB = plane ? 1 : 3; - skipR = plane ? 3 : 5; + skipB = 3; + skipR = 5; } startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR; + endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR + plane_offset; - primitives.saoCuStatsE0(fenc0 + startX, rec0 + startX, stride, endX - startX, ctuHeight - skipB, m_offsetOrg[plane][SAO_EO_0], m_count[plane][SAO_EO_0]); + primitives.saoCuStatsE0(diff + startX, rec0 + startX, stride, endX - startX, ctuHeight - skipB + plane_offset, m_offsetOrg[plane][SAO_EO_0], m_count[plane][SAO_EO_0]); } // SAO_EO_1: // dir: | { if (m_param->bSaoNonDeblocked) { - skipB = plane ? 2 : 4; - skipR = plane ? 2 : 4; + skipB = 4; + skipR = 4; } - fenc = fenc0; rec = rec0; startY = !tpely; - endX = (rpelx == picWidth) ? ctuWidth : ctuWidth - skipR; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB; + endX = (rpelx == picWidth) ? ctuWidth : ctuWidth - skipR + plane_offset; + endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB + plane_offset; if (!tpely) { - fenc += stride; rec += stride; } primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth); - primitives.saoCuStatsE1(fenc0 + startY * stride, rec0 + startY * stride, stride, upBuff1, endX, endY - startY, m_offsetOrg[plane][SAO_EO_1], m_count[plane][SAO_EO_1]); + primitives.saoCuStatsE1(diff + startY * MAX_CU_SIZE, rec0 + startY * stride, stride, upBuff1, endX, endY - startY, m_offsetOrg[plane][SAO_EO_1], m_count[plane][SAO_EO_1]); } // SAO_EO_2: // dir: 135 { if (m_param->bSaoNonDeblocked) { - skipB = plane ? 2 : 4; - skipR = plane ? 3 : 5; + skipB = 4; + skipR = 5; } fenc = fenc0; rec = rec0; startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR; + endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR + plane_offset; startY = !tpely; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB; + endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB + plane_offset; if (!tpely) { fenc += stride; rec += stride; } - primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX)); + primitives.sign(upBuff1, &rec[startX], &rec[startX - stride - 1], (endX - startX)); - primitives.saoCuStatsE2(fenc0 + startX + startY * stride, rec0 + startX + startY * stride, stride, upBuff1 + startX, upBufft + startX, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_2], m_count[plane][SAO_EO_2]); + primitives.saoCuStatsE2(diff + startX + startY * MAX_CU_SIZE, rec0 + startX + startY * stride, stride, upBuff1, upBufft, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_2], m_count[plane][SAO_EO_2]); } // SAO_EO_3: // dir: 45 { if (m_param->bSaoNonDeblocked) { - skipB = plane ? 2 : 4; - skipR = plane ? 3 : 5; + skipB = 4; + skipR = 5; } fenc = fenc0; rec = rec0; startX = !lpelx; - endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR; + endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR + plane_offset; startY = !tpely; - endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB; + endY = (bpely == picHeight) ? ctuHeight - 1 : ctuHeight - skipB + plane_offset; if (!tpely) { @@ -812,9 +935,9 @@ rec += stride; } - primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1)); + primitives.sign(upBuff1, &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1)); - primitives.saoCuStatsE3(fenc0 + startX + startY * stride, rec0 + startX + startY * stride, stride, upBuff1 + startX, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_3], m_count[plane][SAO_EO_3]); + primitives.saoCuStatsE3(diff + startX + startY * MAX_CU_SIZE, rec0 + startX + startY * stride, stride, upBuff1 + 1, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_3], m_count[plane][SAO_EO_3]); } } } @@ -825,9 +948,10 @@ int x, y; const CUData* cu = frame->m_encData->getPicCTU(addr); + const PicYuv* reconPic = m_frame->m_reconPic; const pixel* fenc; const pixel* rec; - intptr_t stride = m_frame->m_reconPic->m_stride; + intptr_t stride = reconPic->m_stride; uint32_t picWidth = m_param->sourceWidth; uint32_t picHeight = m_param->sourceHeight; int ctuWidth = g_maxCUSize; @@ -857,11 +981,12 @@ memset(m_countPreDblk[addr], 0, sizeof(PerPlane)); memset(m_offsetOrgPreDblk[addr], 0, sizeof(PerPlane)); - for (int plane = 0; plane < NUM_PLANE; plane++) + int plane_offset = 0; + for (int plane = 0; plane < (frame->m_param->internalCsp != X265_CSP_I400 ? NUM_PLANE : 1); plane++) { if (plane == 1) { - stride = frame->m_reconPic->m_strideC; + stride = reconPic->m_strideC; picWidth >>= m_hChromaShift; picHeight >>= m_vChromaShift; ctuWidth >>= m_hChromaShift; @@ -874,14 +999,14 @@ // SAO_BO: - skipB = plane ? 1 : 3; - skipR = plane ? 2 : 4; + skipB = 3 - plane_offset; + skipR = 4 - plane_offset; stats = m_offsetOrgPreDblk[addr][plane][SAO_BO]; count = m_countPreDblk[addr][plane][SAO_BO]; const pixel* fenc0 = m_frame->m_fencPic->getPlaneAddr(plane, addr); - const pixel* rec0 = m_frame->m_reconPic->getPlaneAddr(plane, addr); + const pixel* rec0 = reconPic->getPlaneAddr(plane, addr); fenc = fenc0; rec = rec0; @@ -903,8 +1028,8 @@ // SAO_EO_0: // dir: - { - skipB = plane ? 1 : 3; - skipR = plane ? 3 : 5; + skipB = 3 - plane_offset; + skipR = 5 - plane_offset; stats = m_offsetOrgPreDblk[addr][plane][SAO_EO_0]; count = m_countPreDblk[addr][plane][SAO_EO_0]; @@ -939,8 +1064,8 @@ // SAO_EO_1: // dir: | { - skipB = plane ? 2 : 4; - skipR = plane ? 2 : 4; + skipB = 4 - plane_offset; + skipR = 4 - plane_offset; stats = m_offsetOrgPreDblk[addr][plane][SAO_EO_1]; count = m_countPreDblk[addr][plane][SAO_EO_1]; @@ -984,8 +1109,8 @@ // SAO_EO_2: // dir: 135 { - skipB = plane ? 2 : 4; - skipR = plane ? 3 : 5; + skipB = 4 - plane_offset; + skipR = 5 - plane_offset; stats = m_offsetOrgPreDblk[addr][plane][SAO_EO_2]; count = m_countPreDblk[addr][plane][SAO_EO_2]; @@ -1036,8 +1161,8 @@ // SAO_EO_3: // dir: 45 { - skipB = plane ? 2 : 4; - skipR = plane ? 3 : 5; + skipB = 4 - plane_offset; + skipR = 5 - plane_offset; stats = m_offsetOrgPreDblk[addr][plane][SAO_EO_3]; count = m_countPreDblk[addr][plane][SAO_EO_3]; @@ -1083,28 +1208,29 @@ fenc += stride; } } + plane_offset = 2; } } /* reset offset statistics */ void SAO::resetStats() { - memset(m_count, 0, sizeof(PerClass) * NUM_PLANE); - memset(m_offset, 0, sizeof(PerClass) * NUM_PLANE); - memset(m_offsetOrg, 0, sizeof(PerClass) * NUM_PLANE); + memset(m_count, 0, sizeof(m_count)); + memset(m_offset, 0, sizeof(m_offset)); + memset(m_offsetOrg, 0, sizeof(m_offsetOrg)); } void SAO::rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus) { if (!saoParam->bSaoFlag[0]) - m_depthSaoRate[0][m_refDepth] = 1.0; + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0; else - m_depthSaoRate[0][m_refDepth] = m_numNoSao[0] / ((double)numctus); + m_depthSaoRate[0 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[0] / ((double)numctus); if (!saoParam->bSaoFlag[1]) - m_depthSaoRate[1][m_refDepth] = 1.0; + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = 1.0; else - m_depthSaoRate[1][m_refDepth] = m_numNoSao[1] / ((double)numctus); + m_depthSaoRate[1 * SAO_DEPTHRATE_SIZE + m_refDepth] = m_numNoSao[1] / ((double)numctus); } void SAO::rdoSaoUnitRow(SAOParam* saoParam, int idxY) @@ -1127,37 +1253,38 @@ if (allowMerge[1]) m_entropyCoder.codeSaoMerge(0); m_entropyCoder.store(m_rdContexts.temp); + // reset stats Y, Cb, Cr - for (int plane = 0; plane < 3; plane++) + X265_CHECK(sizeof(PerPlane) == (sizeof(int32_t) * (NUM_PLANE * MAX_NUM_SAO_TYPE * MAX_NUM_SAO_CLASS)), "Found Padding space in struct PerPlane"); + + // TODO: Confirm the address space is continuous + if (m_param->bSaoNonDeblocked) { - for (int j = 0; j < MAX_NUM_SAO_TYPE; j++) - { - for (int k = 0; k < MAX_NUM_SAO_CLASS; k++) - { - m_offset[plane][j][k] = 0; - if (m_param->bSaoNonDeblocked) - { - m_count[plane][j][k] = m_countPreDblk[addr][plane][j][k]; - m_offsetOrg[plane][j][k] = m_offsetOrgPreDblk[addr][plane][j][k]; - } - else - { - m_count[plane][j][k] = 0; - m_offsetOrg[plane][j][k] = 0; - } - } - } + memcpy(m_count, m_countPreDblk[addr], sizeof(m_count)); + memcpy(m_offsetOrg, m_offsetOrgPreDblk[addr], sizeof(m_offsetOrg)); + } + else + { + memset(m_count, 0, sizeof(m_count)); + memset(m_offsetOrg, 0, sizeof(m_offsetOrg)); + } - saoParam->ctuParam[plane][addr].mergeMode = SAO_MERGE_NONE; - saoParam->ctuParam[plane][addr].typeIdx = -1; - saoParam->ctuParam[plane][addr].bandPos = 0; - if (saoParam->bSaoFlag[plane > 0]) - calcSaoStatsCu(addr, plane); + saoParam->ctuParam[0][addr].reset(); + saoParam->ctuParam[1][addr].reset(); + saoParam->ctuParam[2][addr].reset(); + + if (saoParam->bSaoFlag[0]) + calcSaoStatsCu(addr, 0); + + if (saoParam->bSaoFlag[1]) + { + calcSaoStatsCu(addr, 1); + calcSaoStatsCu(addr, 2); } saoComponentParamDist(saoParam, addr, addrUp, addrLeft, &mergeSaoParam[0][0], mergeDist); - - sao2ChromaParamDist(saoParam, addr, addrUp, addrLeft, mergeSaoParam, mergeDist); + if (m_chromaFormat != X265_CSP_I400) + sao2ChromaParamDist(saoParam, addr, addrUp, addrLeft, mergeSaoParam, mergeDist); if (saoParam->bSaoFlag[0] || saoParam->bSaoFlag[1]) { @@ -1209,14 +1336,122 @@ if (saoParam->ctuParam[0][addr].typeIdx < 0) m_numNoSao[0]++; - if (saoParam->ctuParam[1][addr].typeIdx < 0) + if (m_chromaFormat != X265_CSP_I400 && saoParam->ctuParam[1][addr].typeIdx < 0) m_numNoSao[1]++; + m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.store(m_rdContexts.cur); } } } +void SAO::rdoSaoUnitCu(SAOParam* saoParam, int rowBaseAddr, int idxX, int addr) +{ + SaoCtuParam mergeSaoParam[NUM_MERGE_MODE][2]; + double mergeDist[NUM_MERGE_MODE]; + const bool allowMerge[2] = {(idxX != 0), (rowBaseAddr != 0)}; // left, up + + const int addrUp = rowBaseAddr ? addr - m_numCuInWidth : -1; + const int addrLeft = idxX ? addr - 1 : -1; + + bool chroma = m_param->internalCsp != X265_CSP_I400; + int planes = chroma ? 3 : 1; + + m_entropyCoder.load(m_rdContexts.cur); + if (allowMerge[0]) + m_entropyCoder.codeSaoMerge(0); + if (allowMerge[1]) + m_entropyCoder.codeSaoMerge(0); + m_entropyCoder.store(m_rdContexts.temp); + + // reset stats Y, Cb, Cr + X265_CHECK(sizeof(PerPlane) == (sizeof(int32_t) * (NUM_PLANE * MAX_NUM_SAO_TYPE * MAX_NUM_SAO_CLASS)), "Found Padding space in struct PerPlane"); + + // TODO: Confirm the address space is continuous + if (m_param->bSaoNonDeblocked) + { + memcpy(m_count, m_countPreDblk[addr], sizeof(m_count)); + memcpy(m_offsetOrg, m_offsetOrgPreDblk[addr], sizeof(m_offsetOrg)); + } + else + { + memset(m_count, 0, sizeof(m_count)); + memset(m_offsetOrg, 0, sizeof(m_offsetOrg)); + } + + for (int i = 0; i < planes; i++) + saoParam->ctuParam[i][addr].reset(); + + if (saoParam->bSaoFlag[0]) + calcSaoStatsCu(addr, 0); + + if (saoParam->bSaoFlag[1]) + { + calcSaoStatsCu(addr, 1); + calcSaoStatsCu(addr, 2); + } + + saoComponentParamDist(saoParam, addr, addrUp, addrLeft, &mergeSaoParam[0][0], mergeDist); + if (chroma) + sao2ChromaParamDist(saoParam, addr, addrUp, addrLeft, mergeSaoParam, mergeDist); + + if (saoParam->bSaoFlag[0] || saoParam->bSaoFlag[1]) + { + // Cost of new SAO_params + m_entropyCoder.load(m_rdContexts.cur); + m_entropyCoder.resetBits(); + if (allowMerge[0]) + m_entropyCoder.codeSaoMerge(0); + if (allowMerge[1]) + m_entropyCoder.codeSaoMerge(0); + for (int plane = 0; plane < planes; plane++) + { + if (saoParam->bSaoFlag[plane > 0]) + m_entropyCoder.codeSaoOffset(saoParam->ctuParam[plane][addr], plane); + } + + uint32_t rate = m_entropyCoder.getNumberOfWrittenBits(); + double bestCost = mergeDist[0] + (double)rate; + m_entropyCoder.store(m_rdContexts.temp); + + // Cost of Merge + for (int mergeIdx = 0; mergeIdx < 2; ++mergeIdx) + { + if (!allowMerge[mergeIdx]) + continue; + + m_entropyCoder.load(m_rdContexts.cur); + m_entropyCoder.resetBits(); + if (allowMerge[0]) + m_entropyCoder.codeSaoMerge(1 - mergeIdx); + if (allowMerge[1] && (mergeIdx == 1)) + m_entropyCoder.codeSaoMerge(1); + + rate = m_entropyCoder.getNumberOfWrittenBits(); + double mergeCost = mergeDist[mergeIdx + 1] + (double)rate; + if (mergeCost < bestCost) + { + SaoMergeMode mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; + bestCost = mergeCost; + m_entropyCoder.store(m_rdContexts.temp); + for (int plane = 0; plane < planes; plane++) + { + mergeSaoParam[plane][mergeIdx].mergeMode = mergeMode; + if (saoParam->bSaoFlag[plane > 0]) + copySaoUnit(&saoParam->ctuParam[plane][addr], &mergeSaoParam[plane][mergeIdx]); + } + } + } + + if (saoParam->ctuParam[0][addr].typeIdx < 0) + m_numNoSao[0]++; + if (chroma && saoParam->ctuParam[1][addr].typeIdx < 0) + m_numNoSao[1]++; + m_entropyCoder.load(m_rdContexts.temp); + m_entropyCoder.store(m_rdContexts.cur); + } +} + /** rate distortion optimization of SAO unit */ inline int64_t SAO::estSaoTypeDist(int plane, int typeIdx, double lambda, int32_t* currentDistortionTableBo, double* currentRdCostTableBo) { @@ -1302,7 +1537,6 @@ int currentDistortionTableBo[MAX_NUM_SAO_CLASS]; double currentRdCostTableBo[MAX_NUM_SAO_CLASS]; - resetSaoUnit(lclCtuParam); m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); m_entropyCoder.codeSaoOffset(*lclCtuParam, 0); @@ -1362,7 +1596,6 @@ m_entropyCoder.store(m_rdContexts.temp); // merge left or merge up - for (int mergeIdx = 0; mergeIdx < 2; mergeIdx++) { SaoCtuParam* mergeSrcParam = NULL; @@ -1389,8 +1622,6 @@ mergeDist[mergeIdx + 1] = ((double)estDist / m_lumaLambda); } - else - resetSaoUnit(&mergeSaoParam[mergeIdx]); } } @@ -1404,8 +1635,6 @@ int bestClassTableBo[2] = { 0, 0 }; int currentDistortionTableBo[MAX_NUM_SAO_CLASS]; - resetSaoUnit(lclCtuParam[0]); - resetSaoUnit(lclCtuParam[1]); m_entropyCoder.load(m_rdContexts.temp); m_entropyCoder.resetBits(); m_entropyCoder.codeSaoOffset(*lclCtuParam[0], 1); @@ -1483,7 +1712,6 @@ m_entropyCoder.store(m_rdContexts.temp); // merge left or merge up - for (int mergeIdx = 0; mergeIdx < 2; mergeIdx++) { for (int compIdx = 0; compIdx < 2; compIdx++) @@ -1512,14 +1740,12 @@ mergeSaoParam[plane][mergeIdx].mergeMode = mergeIdx ? SAO_MERGE_UP : SAO_MERGE_LEFT; mergeDist[mergeIdx + 1] += ((double)estDist / m_chromaLambda); } - else - resetSaoUnit(&mergeSaoParam[plane][mergeIdx]); } } } // NOTE: must put in namespace X265_NS since we need class SAO -void saoCuStatsBO_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +void saoCuStatsBO_c(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) { int x, y; const int boShift = X265_DEPTH - SAO_BO_BITS; @@ -1529,21 +1755,23 @@ for (x = 0; x < endX; x++) { int classIdx = 1 + (rec[x] >> boShift); - stats[classIdx] += (fenc[x] - rec[x]); + stats[classIdx] += diff[x]; count[classIdx]++; } - fenc += stride; + diff += MAX_CU_SIZE; rec += stride; } } -void saoCuStatsE0_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +void saoCuStatsE0_c(const int16_t *diff, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) { int x, y; int32_t tmp_stats[SAO::NUM_EDGETYPE]; int32_t tmp_count[SAO::NUM_EDGETYPE]; + X265_CHECK(endX <= MAX_CU_SIZE, "endX too big\n"); + memset(tmp_stats, 0, sizeof(tmp_stats)); memset(tmp_count, 0, sizeof(tmp_count)); @@ -1558,11 +1786,11 @@ signLeft = -signRight; X265_CHECK(edgeType <= 4, "edgeType check failure\n"); - tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_stats[edgeType] += diff[x]; tmp_count[edgeType]++; } - fenc += stride; + diff += MAX_CU_SIZE; rec += stride; } @@ -1573,7 +1801,7 @@ } } -void saoCuStatsE1_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +void saoCuStatsE1_c(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) { X265_CHECK(endX <= MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY <= MAX_CU_SIZE, "endY check failure\n"); @@ -1585,6 +1813,7 @@ memset(tmp_stats, 0, sizeof(tmp_stats)); memset(tmp_count, 0, sizeof(tmp_count)); + X265_CHECK(endX * endY <= (4096 - 16), "Assembly of saoE1 may overflow with this block size\n"); for (y = 0; y < endY; y++) { for (x = 0; x < endX; x++) @@ -1594,10 +1823,11 @@ uint32_t edgeType = signDown + upBuff1[x] + 2; upBuff1[x] = (int8_t)(-signDown); - tmp_stats[edgeType] += (fenc[x] - rec[x]); + X265_CHECK(edgeType <= 4, "edgeType check failure\n"); + tmp_stats[edgeType] += diff[x]; tmp_count[edgeType]++; } - fenc += stride; + diff += MAX_CU_SIZE; rec += stride; } @@ -1608,7 +1838,7 @@ } } -void saoCuStatsE2_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) +void saoCuStatsE2_c(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) { X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); @@ -1629,14 +1859,14 @@ X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); uint32_t edgeType = signDown + upBuff1[x] + 2; upBufft[x + 1] = (int8_t)(-signDown); - tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_stats[edgeType] += diff[x]; tmp_count[edgeType]++; } std::swap(upBuff1, upBufft); rec += stride; - fenc += stride; + diff += MAX_CU_SIZE; } for (x = 0; x < SAO::NUM_EDGETYPE; x++) @@ -1646,7 +1876,7 @@ } } -void saoCuStatsE3_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +void saoCuStatsE3_c(const int16_t *diff, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) { X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); @@ -1668,14 +1898,14 @@ uint32_t edgeType = signDown + upBuff1[x] + 2; upBuff1[x - 1] = (int8_t)(-signDown); - tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_stats[edgeType] += diff[x]; tmp_count[edgeType]++; } upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); rec += stride; - fenc += stride; + diff += MAX_CU_SIZE; } for (x = 0; x < SAO::NUM_EDGETYPE; x++)
View file
x265_1.8.tar.gz/source/encoder/sao.h -> x265_1.9.tar.gz/source/encoder/sao.h
Changed
@@ -62,6 +62,7 @@ enum { NUM_EDGETYPE = 5 }; enum { NUM_PLANE = 3 }; enum { NUM_MERGE_MODE = 3 }; + enum { SAO_DEPTHRATE_SIZE = 4 }; static const uint32_t s_eoTable[NUM_EDGETYPE]; @@ -71,18 +72,19 @@ protected: /* allocated per part */ - PerClass* m_count; - PerClass* m_offset; - PerClass* m_offsetOrg; + PerPlane m_count; + PerPlane m_offset; + PerPlane m_offsetOrg; /* allocated per CTU */ PerPlane* m_countPreDblk; PerPlane* m_offsetOrgPreDblk; - double m_depthSaoRate[2][4]; - int8_t m_offsetBo[SAO_NUM_BO_CLASSES]; - int8_t m_offsetEo[NUM_EDGETYPE]; + double* m_depthSaoRate; + int8_t m_offsetBo[NUM_PLANE][SAO_NUM_BO_CLASSES]; + int8_t m_offsetEo[NUM_PLANE][NUM_EDGETYPE]; + int m_chromaFormat; int m_numCuInWidth; int m_numCuInHeight; int m_hChromaShift; @@ -91,10 +93,9 @@ pixel* m_clipTable; pixel* m_clipTableBase; - pixel* m_tmpU1[3]; - pixel* m_tmpU2[3]; - pixel* m_tmpL1; - pixel* m_tmpL2; + pixel* m_tmpU[3]; + pixel* m_tmpL1[3]; + pixel* m_tmpL2[3]; public: @@ -119,8 +120,9 @@ SAO(); - bool create(x265_param* param); - void destroy(); + bool create(x265_param* param, int initCommon); + void createFromRootNode(SAO *root); + void destroy(int destoryCommon); void allocSaoParam(SAOParam* saoParam) const; @@ -131,6 +133,8 @@ // CTU-based SAO process without slice granularity void processSaoCu(int addr, int typeIdx, int plane); void processSaoUnitRow(SaoCtuParam* ctuParam, int idxY, int plane); + void processSaoUnitCuLuma(SaoCtuParam* ctuParam, int idxY, int idxX); + void processSaoUnitCuChroma(SaoCtuParam* ctuParam[3], int idxY, int idxX); void copySaoUnit(SaoCtuParam* saoUnitDst, const SaoCtuParam* saoUnitSrc); @@ -146,6 +150,9 @@ void rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus); void rdoSaoUnitRow(SAOParam* saoParam, int idxY); + void rdoSaoUnitCu(SAOParam* saoParam, int rowBaseAddr, int idxX, int addr); + + friend class FrameFilter; }; }
View file
x265_1.8.tar.gz/source/encoder/search.cpp -> x265_1.9.tar.gz/source/encoder/search.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -80,7 +81,7 @@ m_me.init(param.searchMethod, param.subpelRefine, param.internalCsp); bool ok = m_quant.init(param.rdoqLevel, param.psyRdoq, scalingList, m_entropyCoder); - if (m_param->noiseReductionIntra || m_param->noiseReductionInter) + if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize) ok &= m_quant.allocNoiseReduction(param); ok &= Predict::allocBuffers(param.internalCsp); /* sets m_hChromaShift & m_vChromaShift */ @@ -97,13 +98,27 @@ * the coeffRQT and reconQtYuv are allocated to the max CU size at every depth. The parts * which are reconstructed at each depth are valid. At the end, the transform depth table * is walked and the coeff and recon at the correct depths are collected */ - for (uint32_t i = 0; i <= m_numLayers; i++) + + if (param.internalCsp != X265_CSP_I400) + { + for (uint32_t i = 0; i <= m_numLayers; i++) + { + CHECKED_MALLOC(m_rqt[i].coeffRQT[0], coeff_t, sizeL + sizeC * 2); + m_rqt[i].coeffRQT[1] = m_rqt[i].coeffRQT[0] + sizeL; + m_rqt[i].coeffRQT[2] = m_rqt[i].coeffRQT[0] + sizeL + sizeC; + ok &= m_rqt[i].reconQtYuv.create(g_maxCUSize, param.internalCsp); + ok &= m_rqt[i].resiQtYuv.create(g_maxCUSize, param.internalCsp); + } + } + else { - CHECKED_MALLOC(m_rqt[i].coeffRQT[0], coeff_t, sizeL + sizeC * 2); - m_rqt[i].coeffRQT[1] = m_rqt[i].coeffRQT[0] + sizeL; - m_rqt[i].coeffRQT[2] = m_rqt[i].coeffRQT[0] + sizeL + sizeC; - ok &= m_rqt[i].reconQtYuv.create(g_maxCUSize, param.internalCsp); - ok &= m_rqt[i].resiQtYuv.create(g_maxCUSize, param.internalCsp); + for (uint32_t i = 0; i <= m_numLayers; i++) + { + CHECKED_MALLOC(m_rqt[i].coeffRQT[0], coeff_t, sizeL); + m_rqt[i].coeffRQT[1] = m_rqt[i].coeffRQT[2] = NULL; + ok &= m_rqt[i].reconQtYuv.create(g_maxCUSize, param.internalCsp); + ok &= m_rqt[i].resiQtYuv.create(g_maxCUSize, param.internalCsp); + } } /* the rest of these buffers are indexed per-depth */ @@ -116,12 +131,22 @@ ok &= m_rqt[i].bidirPredYuv[1].create(cuSize, param.internalCsp); } - CHECKED_MALLOC(m_qtTempCbf[0], uint8_t, numPartitions * 3); - m_qtTempCbf[1] = m_qtTempCbf[0] + numPartitions; - m_qtTempCbf[2] = m_qtTempCbf[0] + numPartitions * 2; - CHECKED_MALLOC(m_qtTempTransformSkipFlag[0], uint8_t, numPartitions * 3); - m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[0] + numPartitions; - m_qtTempTransformSkipFlag[2] = m_qtTempTransformSkipFlag[0] + numPartitions * 2; + if (param.internalCsp != X265_CSP_I400) + { + CHECKED_MALLOC(m_qtTempCbf[0], uint8_t, numPartitions * 3); + m_qtTempCbf[1] = m_qtTempCbf[0] + numPartitions; + m_qtTempCbf[2] = m_qtTempCbf[0] + numPartitions * 2; + CHECKED_MALLOC(m_qtTempTransformSkipFlag[0], uint8_t, numPartitions * 3); + m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[0] + numPartitions; + m_qtTempTransformSkipFlag[2] = m_qtTempTransformSkipFlag[0] + numPartitions * 2; + } + else + { + CHECKED_MALLOC(m_qtTempCbf[0], uint8_t, numPartitions); + m_qtTempCbf[1] = m_qtTempCbf[2] = NULL; + CHECKED_MALLOC(m_qtTempTransformSkipFlag[0], uint8_t, numPartitions); + m_qtTempTransformSkipFlag[1] = m_qtTempTransformSkipFlag[2] = NULL; + } CHECKED_MALLOC(m_intraPred, pixel, (32 * 32) * (33 + 3)); m_fencScaled = m_intraPred + 32 * 32; @@ -163,12 +188,12 @@ X265_FREE(m_tsRecon); } -int Search::setLambdaFromQP(const CUData& ctu, int qp) +int Search::setLambdaFromQP(const CUData& ctu, int qp, int lambdaQp) { X265_CHECK(qp >= QP_MIN && qp <= QP_MAX_MAX, "QP used for lambda is out of range\n"); m_me.setQP(qp); - m_rdCost.setQP(*m_slice, qp); + m_rdCost.setQP(*m_slice, lambdaQp < 0 ? qp : lambdaQp); int quantQP = x265_clip3(QP_MIN, QP_MAX_SPEC, qp); m_quant.setQPforQuant(ctu, quantQP); @@ -446,8 +471,9 @@ } // set reconstruction for next intra prediction blocks if full TU prediction won - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); - intptr_t picStride = m_frame->m_reconPic->m_stride; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); + intptr_t picStride = reconPic->m_stride; primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride); outCost.rdcost += fullCost.rdcost; @@ -530,7 +556,7 @@ // no residual coded, recon = pred primitives.cu[sizeIdx].copy_pp(tmpRecon, tmpReconStride, pred, stride); - sse_ret_t tmpDist = primitives.cu[sizeIdx].sse_pp(tmpRecon, tmpReconStride, fenc, stride); + sse_t tmpDist = primitives.cu[sizeIdx].sse_pp(tmpRecon, tmpReconStride, fenc, stride); cu.setTransformSkipSubParts(useTSkip, TEXT_LUMA, absPartIdx, fullDepth); cu.setCbfSubParts((!!numSig) << tuDepth, TEXT_LUMA, absPartIdx, fullDepth); @@ -611,8 +637,9 @@ } // set reconstruction for next intra prediction blocks - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); - intptr_t picStride = m_frame->m_reconPic->m_stride; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); + intptr_t picStride = reconPic->m_stride; primitives.cu[sizeIdx].copy_pp(picReconY, picStride, reconQt, reconQtStride); outCost.rdcost += fullCost.rdcost; @@ -661,8 +688,9 @@ uint32_t sizeIdx = log2TrSize - 2; primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); - pixel* picReconY = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); - intptr_t picStride = m_frame->m_reconPic->m_stride; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); + intptr_t picStride = reconPic->m_stride; uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false); if (numSig) @@ -750,7 +778,7 @@ } /* returns distortion */ -uint32_t Search::codeIntraChromaQt(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, uint32_t& psyEnergy) +void Search::codeIntraChromaQt(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, Cost& outCost) { CUData& cu = mode.cu; uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; @@ -758,10 +786,10 @@ if (tuDepth < cu.m_tuDepth[absPartIdx]) { uint32_t qNumParts = 1 << (log2TrSize - 1 - LOG2_UNIT_SIZE) * 2; - uint32_t outDist = 0, splitCbfU = 0, splitCbfV = 0; + uint32_t splitCbfU = 0, splitCbfV = 0; for (uint32_t qIdx = 0, qPartIdx = absPartIdx; qIdx < 4; ++qIdx, qPartIdx += qNumParts) { - outDist += codeIntraChromaQt(mode, cuGeom, tuDepth + 1, qPartIdx, psyEnergy); + codeIntraChromaQt(mode, cuGeom, tuDepth + 1, qPartIdx, outCost); splitCbfU |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); splitCbfV |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); } @@ -770,8 +798,7 @@ cu.m_cbf[1][absPartIdx + offs] |= (splitCbfU << tuDepth); cu.m_cbf[2][absPartIdx + offs] |= (splitCbfV << tuDepth); } - - return outDist; + return; } uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; @@ -780,7 +807,7 @@ { X265_CHECK(log2TrSize == 2 && m_csp != X265_CSP_I444 && tuDepth, "invalid tuDepth\n"); if (absPartIdx & 3) - return 0; + return; log2TrSizeC = 2; tuDepthC--; } @@ -791,13 +818,15 @@ bool checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && log2TrSizeC <= MAX_LOG2_TS_SIZE && !cu.m_tqBypass[0]; checkTransformSkip &= !m_param->bEnableTSkipFast || (log2TrSize <= MAX_LOG2_TS_SIZE && cu.m_transformSkip[TEXT_LUMA][absPartIdx]); if (checkTransformSkip) - return codeIntraChromaTSkip(mode, cuGeom, tuDepth, tuDepthC, absPartIdx, psyEnergy); + { + codeIntraChromaTSkip(mode, cuGeom, tuDepth, tuDepthC, absPartIdx, outCost); + return; + } ShortYuv& resiYuv = m_rqt[cuGeom.depth].tmpResiYuv; uint32_t qtLayer = log2TrSize - 2; uint32_t stride = mode.fencYuv->m_csize; const uint32_t sizeIdxC = log2TrSizeC - 2; - sse_ret_t outDist = 0; uint32_t curPartNum = cuGeom.numPartitions >> tuDepthC * 2; const SplitType splitType = (m_csp == X265_CSP_I422) ? VERTICAL_SPLIT : DONT_SPLIT; @@ -821,8 +850,9 @@ coeff_t* coeffC = m_rqt[qtLayer].coeffRQT[chromaId] + coeffOffsetC; pixel* reconQt = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC); uint32_t reconQtStride = m_rqt[qtLayer].reconQtYuv.m_csize; - pixel* picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); - intptr_t picStride = m_frame->m_reconPic->m_strideC; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* picReconC = reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); + intptr_t picStride = reconPic->m_strideC; uint32_t chromaPredMode = cu.m_chromaIntraDir[absPartIdxC]; if (chromaPredMode == DM_CHROMA_IDX) @@ -852,10 +882,10 @@ cu.setCbfPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep); } - outDist += m_rdCost.scaleChromaDist(chromaId, primitives.cu[sizeIdxC].sse_pp(reconQt, reconQtStride, fenc, stride)); + outCost.distortion += m_rdCost.scaleChromaDist(chromaId, primitives.cu[sizeIdxC].sse_pp(reconQt, reconQtStride, fenc, stride)); if (m_rdCost.m_psyRd) - psyEnergy += m_rdCost.psyCost(sizeIdxC, fenc, stride, reconQt, reconQtStride); + outCost.energy += m_rdCost.psyCost(sizeIdxC, fenc, stride, reconQt, reconQtStride); primitives.cu[sizeIdxC].copy_pp(picReconC, picStride, reconQt, reconQtStride); } @@ -867,19 +897,16 @@ offsetSubTUCBFs(cu, TEXT_CHROMA_U, tuDepth, absPartIdx); offsetSubTUCBFs(cu, TEXT_CHROMA_V, tuDepth, absPartIdx); } - - return outDist; } /* returns distortion */ -uint32_t Search::codeIntraChromaTSkip(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t tuDepthC, uint32_t absPartIdx, uint32_t& psyEnergy) +void Search::codeIntraChromaTSkip(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t tuDepthC, uint32_t absPartIdx, Cost& outCost) { CUData& cu = mode.cu; uint32_t fullDepth = cuGeom.depth + tuDepth; uint32_t log2TrSize = cuGeom.log2CUSize - tuDepth; const uint32_t log2TrSizeC = 2; uint32_t qtLayer = log2TrSize - 2; - uint32_t outDist = 0; /* At the TU layers above this one, no RDO is performed, only distortion is being measured, * so the entropy coder is not very accurate. The best we can do is return it in the same @@ -925,7 +952,7 @@ predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC); uint64_t bCost = MAX_INT64; - uint32_t bDist = 0; + sse_t bDist = 0; uint32_t bCbf = 0; uint32_t bEnergy = 0; int bTSkip = 0; @@ -956,7 +983,7 @@ primitives.cu[sizeIdxC].copy_pp(recon, reconStride, pred, stride); cu.setCbfPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep); } - sse_ret_t tmpDist = primitives.cu[sizeIdxC].sse_pp(recon, reconStride, fenc, stride); + sse_t tmpDist = primitives.cu[sizeIdxC].sse_pp(recon, reconStride, fenc, stride); tmpDist = m_rdCost.scaleChromaDist(chromaId, tmpDist); cu.setTransformSkipPartRange(useTSkip, ttype, absPartIdxC, tuIterator.absPartIdxStep); @@ -998,12 +1025,13 @@ cu.setCbfPartRange(bCbf << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); cu.setTransformSkipPartRange(bTSkip, ttype, absPartIdxC, tuIterator.absPartIdxStep); - pixel* reconPicC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); - intptr_t picStride = m_frame->m_reconPic->m_strideC; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* reconPicC = reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); + intptr_t picStride = reconPic->m_strideC; primitives.cu[sizeIdxC].copy_pp(reconPicC, picStride, reconQt, reconQtStride); - outDist += bDist; - psyEnergy += bEnergy; + outCost.distortion += bDist; + outCost.energy += bEnergy; } } while (tuIterator.isNextSection()); @@ -1015,7 +1043,6 @@ } m_entropyCoder.load(m_rqt[fullDepth].rqtRoot); - return outDist; } void Search::extractIntraResultChromaQT(CUData& cu, Yuv& reconYuv, uint32_t absPartIdx, uint32_t tuDepth) @@ -1108,8 +1135,9 @@ int16_t* residual = resiYuv.getChromaAddr(chromaId, absPartIdxC); uint32_t coeffOffsetC = absPartIdxC << (LOG2_UNIT_SIZE * 2 - (m_hChromaShift + m_vChromaShift)); coeff_t* coeffC = cu.m_trCoeff[ttype] + coeffOffsetC; - pixel* picReconC = m_frame->m_reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); - intptr_t picStride = m_frame->m_reconPic->m_strideC; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* picReconC = reconPic->getChromaAddr(chromaId, cu.m_cuAddr, cuGeom.absPartIdx + absPartIdxC); + intptr_t picStride = reconPic->m_strideC; uint32_t chromaPredMode = cu.m_chromaIntraDir[absPartIdxC]; if (chromaPredMode == DM_CHROMA_IDX) @@ -1150,7 +1178,7 @@ } } -void Search::checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes, uint8_t* sharedChromaModes) +void Search::checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize) { CUData& cu = intraMode.cu; @@ -1161,34 +1189,43 @@ cu.getIntraTUQtDepthRange(tuDepthRange, 0); intraMode.initCosts(); - intraMode.lumaDistortion += estIntraPredQT(intraMode, cuGeom, tuDepthRange, sharedModes); - intraMode.chromaDistortion += estIntraPredChromaQT(intraMode, cuGeom, sharedChromaModes); - intraMode.distortion += intraMode.lumaDistortion + intraMode.chromaDistortion; + intraMode.lumaDistortion += estIntraPredQT(intraMode, cuGeom, tuDepthRange); + if (m_csp != X265_CSP_I400) + { + intraMode.chromaDistortion += estIntraPredChromaQT(intraMode, cuGeom); + intraMode.distortion += intraMode.lumaDistortion + intraMode.chromaDistortion; + } + else + intraMode.distortion += intraMode.lumaDistortion; m_entropyCoder.resetBits(); if (m_slice->m_pps->bTransquantBypassEnabled) m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]); + int skipFlagBits = 0; if (!m_slice->isIntra()) { m_entropyCoder.codeSkipFlag(cu, 0); + skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); m_entropyCoder.codePredMode(cu.m_predMode[0]); } m_entropyCoder.codePartSize(cu, 0, cuGeom.depth); m_entropyCoder.codePredInfo(cu, 0); - intraMode.mvBits = m_entropyCoder.getNumberOfWrittenBits(); + intraMode.mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits; bool bCodeDQP = m_slice->m_pps->bUseDQP; m_entropyCoder.codeCoeff(cu, 0, bCodeDQP, tuDepthRange); m_entropyCoder.store(intraMode.contexts); intraMode.totalBits = m_entropyCoder.getNumberOfWrittenBits(); - intraMode.coeffBits = intraMode.totalBits - intraMode.mvBits; + intraMode.coeffBits = intraMode.totalBits - intraMode.mvBits - skipFlagBits; if (m_rdCost.m_psyRd) { const Yuv* fencYuv = intraMode.fencYuv; intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, intraMode.reconYuv.m_buf[0], intraMode.reconYuv.m_size); } + intraMode.resEnergy = primitives.cu[cuGeom.log2CUSize - 2].sse_pp(intraMode.fencYuv->m_buf[0], intraMode.fencYuv->m_size, intraMode.predYuv.m_buf[0], intraMode.predYuv.m_size); + updateModeCost(intraMode); checkDQP(intraMode, cuGeom); } @@ -1356,7 +1393,6 @@ intraMode.distortion = bsad; intraMode.sa8dCost = bcost; intraMode.sa8dBits = bbits; - X265_CHECK(intraMode.ok(), "intra mode is not ok"); } void Search::encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom) @@ -1379,35 +1415,41 @@ extractIntraResultQT(cu, *reconYuv, 0, 0); intraMode.lumaDistortion = icosts.distortion; - intraMode.chromaDistortion = estIntraPredChromaQT(intraMode, cuGeom, NULL); - intraMode.distortion = intraMode.lumaDistortion + intraMode.chromaDistortion; + if (m_csp != X265_CSP_I400) + { + intraMode.chromaDistortion = estIntraPredChromaQT(intraMode, cuGeom); + intraMode.distortion = intraMode.lumaDistortion + intraMode.chromaDistortion; + } + else + intraMode.distortion = intraMode.lumaDistortion; m_entropyCoder.resetBits(); if (m_slice->m_pps->bTransquantBypassEnabled) m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]); m_entropyCoder.codeSkipFlag(cu, 0); + int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); m_entropyCoder.codePredMode(cu.m_predMode[0]); m_entropyCoder.codePartSize(cu, 0, cuGeom.depth); m_entropyCoder.codePredInfo(cu, 0); - intraMode.mvBits += m_entropyCoder.getNumberOfWrittenBits(); + intraMode.mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits; bool bCodeDQP = m_slice->m_pps->bUseDQP; m_entropyCoder.codeCoeff(cu, 0, bCodeDQP, tuDepthRange); intraMode.totalBits = m_entropyCoder.getNumberOfWrittenBits(); - intraMode.coeffBits = intraMode.totalBits - intraMode.mvBits; + intraMode.coeffBits = intraMode.totalBits - intraMode.mvBits - skipFlagBits; if (m_rdCost.m_psyRd) { const Yuv* fencYuv = intraMode.fencYuv; intraMode.psyEnergy = m_rdCost.psyCost(cuGeom.log2CUSize - 2, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); } - + intraMode.resEnergy = primitives.cu[cuGeom.log2CUSize - 2].sse_pp(intraMode.fencYuv->m_buf[0], intraMode.fencYuv->m_size, intraMode.predYuv.m_buf[0], intraMode.predYuv.m_size); m_entropyCoder.store(intraMode.contexts); updateModeCost(intraMode); checkDQP(intraMode, cuGeom); } -uint32_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes) +sse_t Search::estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2]) { CUData& cu = intraMode.cu; Yuv* reconYuv = &intraMode.reconYuv; @@ -1422,7 +1464,7 @@ uint32_t qNumParts = cuGeom.numPartitions >> 2; uint32_t sizeIdx = log2TrSize - 2; uint32_t absPartIdx = 0; - uint32_t totalDistortion = 0; + sse_t totalDistortion = 0; int checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && !cu.m_tqBypass[0] && cu.m_partSize[0] != SIZE_2Nx2N; @@ -1431,8 +1473,8 @@ { uint32_t bmode = 0; - if (sharedModes) - bmode = sharedModes[puIdx]; + if (intraMode.cu.m_lumaIntraDir[puIdx] != (uint8_t)ALL_IDX) + bmode = intraMode.cu.m_lumaIntraDir[puIdx]; else { uint64_t candCostList[MAX_RD_INTRA_MODES]; @@ -1456,25 +1498,6 @@ int scaleStride = stride; int costShift = 0; - if (tuSize > 32) - { - // origin is 64x64, we scale to 32x32 and setup required parameters - primitives.scale2D_64to32(m_fencScaled, fenc, stride); - fenc = m_fencScaled; - - pixel nScale[129]; - intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0]; - primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1); - - memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel)); - memcpy(&intraNeighbourBuf[1][1], &nScale[1], 2 * 64 * sizeof(pixel)); - - scaleTuSize = 32; - scaleStride = 32; - costShift = 2; - sizeIdx = 5 - 2; // log2(scaleTuSize) - 2 - } - m_entropyCoder.loadIntraDirModeLuma(m_rqt[depth].cur); /* there are three cost tiers for intra modes: @@ -1541,9 +1564,10 @@ for (int i = 0; i < maxCandCount; i++) candCostList[i] = MAX_INT64; - uint64_t paddedBcost = bcost + (bcost >> 3); // 1.12% + uint64_t paddedBcost = bcost + (bcost >> 2); // 1.25% for (int mode = 0; mode < 35; mode++) - if (modeCosts[mode] < paddedBcost || (mpms & ((uint64_t)1 << mode))) + if ((modeCosts[mode] < paddedBcost) || ((uint32_t)mode == mpmModes[0])) + /* choose for R-D analysis only if this mode passes cost threshold or matches MPM[0] */ updateCandList(mode, modeCosts[mode], maxCandCount, rdModeList, candCostList); } @@ -1590,10 +1614,11 @@ * output recon picture, so it cannot proceed in parallel with anything else when doing INTRA_NXN. Also * it is not updating m_rdContexts[depth].cur for the later PUs which I suspect is slightly wrong. I think * that the contexts should be tracked through each PU */ - pixel* dst = m_frame->m_reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); - uint32_t dststride = m_frame->m_reconPic->m_stride; - const pixel* src = reconYuv->getLumaAddr(absPartIdx); - uint32_t srcstride = reconYuv->m_size; + PicYuv* reconPic = m_frame->m_reconPic; + pixel* dst = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); + uint32_t dststride = reconPic->m_stride; + const pixel* src = reconYuv->getLumaAddr(absPartIdx); + uint32_t srcstride = reconYuv->m_size; primitives.cu[log2TrSize - 2].copy_pp(dst, dststride, src, srcstride); } } @@ -1670,7 +1695,7 @@ cu.setChromIntraDirSubParts(bestMode, 0, cuGeom.depth); } -uint32_t Search::estIntraPredChromaQT(Mode &intraMode, const CUGeom& cuGeom, uint8_t* sharedChromaModes) +sse_t Search::estIntraPredChromaQT(Mode &intraMode, const CUGeom& cuGeom) { CUData& cu = intraMode.cu; Yuv& reconYuv = intraMode.reconYuv; @@ -1679,7 +1704,7 @@ uint32_t initTuDepth = cu.m_partSize[0] != SIZE_2Nx2N && m_csp == X265_CSP_I444; uint32_t log2TrSize = cuGeom.log2CUSize - initTuDepth; uint32_t absPartStep = cuGeom.numPartitions; - uint32_t totalDistortion = 0; + sse_t totalDistortion = 0; int size = partitionFromLog2Size(log2TrSize); @@ -1690,7 +1715,7 @@ uint32_t absPartIdxC = tuIterator.absPartIdxTURelCU; uint32_t bestMode = 0; - uint32_t bestDist = 0; + sse_t bestDist = 0; uint64_t bestCost = MAX_INT64; // init mode list @@ -1698,10 +1723,10 @@ uint32_t maxMode = NUM_CHROMA_MODE; uint32_t modeList[NUM_CHROMA_MODE]; - if (sharedChromaModes && !initTuDepth) + if (intraMode.cu.m_chromaIntraDir[0] != (uint8_t)ALL_IDX && !initTuDepth) { for (uint32_t l = 0; l < NUM_CHROMA_MODE; l++) - modeList[l] = sharedChromaModes[0]; + modeList[l] = intraMode.cu.m_chromaIntraDir[0]; maxMode = 1; } else @@ -1714,8 +1739,8 @@ m_entropyCoder.load(m_rqt[depth].cur); cu.setChromIntraDirSubParts(modeList[mode], absPartIdxC, depth + initTuDepth); - uint32_t psyEnergy = 0; - uint32_t dist = codeIntraChromaQt(intraMode, cuGeom, initTuDepth, absPartIdxC, psyEnergy); + Cost outCost; + codeIntraChromaQt(intraMode, cuGeom, initTuDepth, absPartIdxC, outCost); if (m_slice->m_pps->bTransformSkipEnabled) m_entropyCoder.load(m_rqt[depth].cur); @@ -1738,12 +1763,13 @@ codeCoeffQTChroma(cu, initTuDepth, absPartIdxC, TEXT_CHROMA_U); codeCoeffQTChroma(cu, initTuDepth, absPartIdxC, TEXT_CHROMA_V); uint32_t bits = m_entropyCoder.getNumberOfWrittenBits(); - uint64_t cost = m_rdCost.m_psyRd ? m_rdCost.calcPsyRdCost(dist, bits, psyEnergy) : m_rdCost.calcRdCost(dist, bits); + uint64_t cost = m_rdCost.m_psyRd ? m_rdCost.calcPsyRdCost(outCost.distortion, bits, outCost.energy) + : m_rdCost.calcRdCost(outCost.distortion, bits); if (cost < bestCost) { bestCost = cost; - bestDist = dist; + bestDist = outCost.distortion; bestMode = modeList[mode]; extractIntraResultChromaQT(cu, reconYuv, absPartIdxC, initTuDepth); memcpy(m_qtTempCbf[1], cu.m_cbf[1] + absPartIdxC, tuIterator.absPartIdxStep * sizeof(uint8_t)); @@ -1756,15 +1782,16 @@ if (!tuIterator.isLastSection()) { uint32_t zorder = cuGeom.absPartIdx + absPartIdxC; - uint32_t dststride = m_frame->m_reconPic->m_strideC; + PicYuv* reconPic = m_frame->m_reconPic; + uint32_t dststride = reconPic->m_strideC; const pixel* src; pixel* dst; - dst = m_frame->m_reconPic->getCbAddr(cu.m_cuAddr, zorder); + dst = reconPic->getCbAddr(cu.m_cuAddr, zorder); src = reconYuv.getCbAddr(absPartIdxC); primitives.chroma[m_csp].cu[size].copy_pp(dst, dststride, src, reconYuv.m_csize); - dst = m_frame->m_reconPic->getCrAddr(cu.m_cuAddr, zorder); + dst = reconPic->getCrAddr(cu.m_cuAddr, zorder); src = reconYuv.getCrAddr(absPartIdxC); primitives.chroma[m_csp].cu[size].copy_pp(dst, dststride, src, reconYuv.m_csize); } @@ -1865,7 +1892,7 @@ /* find the lowres motion vector from lookahead in middle of current PU */ MV Search::getLowresMV(const CUData& cu, const PredictionUnit& pu, int list, int ref) { - int diffPoc = abs(m_slice->m_poc - m_slice->m_refPicList[list][ref]->m_poc); + int diffPoc = abs(m_slice->m_poc - m_slice->m_refPOCList[list][ref]); if (diffPoc > m_param->bframes + 1) /* poc difference is out of range for lookahead */ return 0; @@ -1905,7 +1932,7 @@ else { cu.clipMv(mvCand); - predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refPicList[list][ref]->m_reconPic, mvCand); + predInterLumaPixel(pu, tmpPredYuv, *m_slice->m_refReconPicList[list][ref], mvCand); costs[i] = m_me.bufSAD(tmpPredYuv.getLumaAddr(pu.puAbsPartIdx), tmpPredYuv.m_size); } } @@ -1998,7 +2025,8 @@ /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); - uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); + uint32_t mvCost = m_me.mvcost(outmv); + uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits); /* Refine MVP selection, updates: mvpIdx, bits, cost */ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); @@ -2014,6 +2042,7 @@ bestME[list].ref = ref; bestME[list].cost = cost; bestME[list].bits = bits; + bestME[list].mvCost = mvCost; } } @@ -2059,11 +2088,14 @@ cu.getNeighbourMV(puIdx, pu.puAbsPartIdx, interMode.interNeighbours); /* Uni-directional prediction */ - if (m_param->analysisMode == X265_ANALYSIS_LOAD && bestME[0].ref >= 0) + if (m_param->analysisMode == X265_ANALYSIS_LOAD) { for (int list = 0; list < numPredDir; list++) { int ref = bestME[list].ref; + if (ref < 0) + continue; + uint32_t bits = m_listSelBits[list] + MVP_IDX_BITS; bits += getTUBits(ref, numRefIdx[list]); @@ -2072,8 +2104,7 @@ const MV* amvp = interMode.amvpCand[list][ref]; int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - - MV lmv = getLowresMV(cu, pu, list, ref); + MV lmv = bestME[list].mv; if (lmv.notZero()) mvc[numMvc++] = lmv; @@ -2082,7 +2113,8 @@ /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); - uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); + uint32_t mvCost = m_me.mvcost(outmv); + uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits); /* Refine MVP selection, updates: mvpIdx, bits, cost */ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); @@ -2094,6 +2126,7 @@ bestME[list].mvpIdx = mvpIdx; bestME[list].cost = cost; bestME[list].bits = bits; + bestME[list].mvCost = mvCost; } } bDoUnidir = false; @@ -2142,6 +2175,7 @@ } if (bDoUnidir) { + interMode.bestME[puIdx][0].ref = interMode.bestME[puIdx][1].ref = -1; uint32_t refMask = refMasks[puIdx] ? refMasks[puIdx] : (uint32_t)-1; for (int list = 0; list < numPredDir; list++) @@ -2174,19 +2208,21 @@ /* Get total cost of partition, but only include MV bit cost once */ bits += m_me.bitcost(outmv); - uint32_t cost = (satdCost - m_me.mvcost(outmv)) + m_rdCost.getCost(bits); + uint32_t mvCost = m_me.mvcost(outmv); + uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits); /* Refine MVP selection, updates: mvpIdx, bits, cost */ mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); if (cost < bestME[list].cost) { - bestME[list].mv = outmv; - bestME[list].mvp = mvp; - bestME[list].mvpIdx = mvpIdx; - bestME[list].ref = ref; - bestME[list].cost = cost; - bestME[list].bits = bits; + bestME[list].mv = outmv; + bestME[list].mvp = mvp; + bestME[list].mvpIdx = mvpIdx; + bestME[list].ref = ref; + bestME[list].cost = cost; + bestME[list].bits = bits; + bestME[list].mvCost = mvCost; } } /* the second list ref bits start at bit 16 */ @@ -2221,8 +2257,8 @@ } else { - PicYuv* refPic0 = slice->m_refPicList[0][bestME[0].ref]->m_reconPic; - PicYuv* refPic1 = slice->m_refPicList[1][bestME[1].ref]->m_reconPic; + PicYuv* refPic0 = slice->m_refReconPicList[0][bestME[0].ref]; + PicYuv* refPic1 = slice->m_refReconPicList[1][bestME[1].ref]; Yuv* bidirYuv = m_rqt[cuGeom.depth].bidirPredYuv; /* Generate reference subpels */ @@ -2370,7 +2406,6 @@ motionCompensation(cu, pu, *predYuv, true, bChromaMC); } - X265_CHECK(interMode.ok(), "inter mode is not ok"); interMode.sa8dBits += totalmebits; } @@ -2449,6 +2484,17 @@ cu.clipMv(mvmin); cu.clipMv(mvmax); + if (cu.m_encData->m_param->bIntraRefresh && m_slice->m_sliceType == P_SLICE && + cu.m_cuPelX / g_maxCUSize < m_frame->m_encData->m_pir.pirStartCol && + m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol < m_slice->m_sps->numCuInWidth) + { + int safeX, maxSafeMv; + safeX = m_slice->m_refFrameList[0][0]->m_encData->m_pir.pirEndCol * g_maxCUSize - 3; + maxSafeMv = (safeX - cu.m_cuPelX) * 4; + mvmax.x = X265_MIN(mvmax.x, maxSafeMv); + mvmin.x = X265_MIN(mvmin.x, maxSafeMv); + } + /* Clip search range to signaled maximum MV length. * We do not support this VUI field being changed from the default */ const int maxMvLen = (1 << 15) - 1; @@ -2471,9 +2517,8 @@ CUData& cu = interMode.cu; Yuv* reconYuv = &interMode.reconYuv; const Yuv* fencYuv = interMode.fencYuv; - + Yuv* predYuv = &interMode.predYuv; X265_CHECK(!cu.isIntra(0), "intra CU not expected\n"); - uint32_t depth = cu.m_cuDepth[0]; // No residual coding : SKIP mode @@ -2487,24 +2532,27 @@ // Luma int part = partitionFromLog2Size(cu.m_log2CUSize[0]); interMode.lumaDistortion = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); + interMode.distortion = interMode.lumaDistortion; // Chroma - interMode.chromaDistortion = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); - interMode.chromaDistortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); - interMode.distortion = interMode.lumaDistortion + interMode.chromaDistortion; - + if (m_csp != X265_CSP_I400) + { + interMode.chromaDistortion = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); + interMode.chromaDistortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + interMode.distortion += interMode.chromaDistortion; + } m_entropyCoder.load(m_rqt[depth].cur); m_entropyCoder.resetBits(); if (m_slice->m_pps->bTransquantBypassEnabled) m_entropyCoder.codeCUTransquantBypassFlag(cu.m_tqBypass[0]); m_entropyCoder.codeSkipFlag(cu, 0); + int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); m_entropyCoder.codeMergeIndex(cu, 0); - - interMode.mvBits = m_entropyCoder.getNumberOfWrittenBits(); + interMode.mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits; interMode.coeffBits = 0; - interMode.totalBits = interMode.mvBits; + interMode.totalBits = interMode.mvBits + skipFlagBits; if (m_rdCost.m_psyRd) interMode.psyEnergy = m_rdCost.psyCost(part, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); - + interMode.resEnergy = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); updateModeCost(interMode); m_entropyCoder.store(interMode.contexts); } @@ -2540,9 +2588,12 @@ uint32_t tqBypass = cu.m_tqBypass[0]; if (!tqBypass) { - sse_ret_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); - cbf0Dist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], predYuv->m_csize, predYuv->m_buf[1], predYuv->m_csize)); - cbf0Dist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], predYuv->m_csize, predYuv->m_buf[2], predYuv->m_csize)); + sse_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); + if (m_csp != X265_CSP_I400) + { + cbf0Dist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], predYuv->m_csize, predYuv->m_buf[1], predYuv->m_csize)); + cbf0Dist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], predYuv->m_csize, predYuv->m_buf[2], predYuv->m_csize)); + } /* Consider the RD cost of not signaling any residual */ m_entropyCoder.load(m_rqt[depth].cur); @@ -2577,30 +2628,33 @@ if (m_slice->m_pps->bTransquantBypassEnabled) m_entropyCoder.codeCUTransquantBypassFlag(tqBypass); - uint32_t coeffBits, bits; + uint32_t coeffBits, bits, mvBits; if (cu.m_mergeFlag[0] && cu.m_partSize[0] == SIZE_2Nx2N && !cu.getQtRootCbf(0)) { cu.setPredModeSubParts(MODE_SKIP); /* Merge/Skip */ + coeffBits = mvBits = 0; m_entropyCoder.codeSkipFlag(cu, 0); + int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); m_entropyCoder.codeMergeIndex(cu, 0); - coeffBits = 0; - bits = m_entropyCoder.getNumberOfWrittenBits(); + mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits; + bits = mvBits + skipFlagBits; } else { m_entropyCoder.codeSkipFlag(cu, 0); + int skipFlagBits = m_entropyCoder.getNumberOfWrittenBits(); m_entropyCoder.codePredMode(cu.m_predMode[0]); m_entropyCoder.codePartSize(cu, 0, cuGeom.depth); m_entropyCoder.codePredInfo(cu, 0); - uint32_t mvBits = m_entropyCoder.getNumberOfWrittenBits(); + mvBits = m_entropyCoder.getNumberOfWrittenBits() - skipFlagBits; bool bCodeDQP = m_slice->m_pps->bUseDQP; m_entropyCoder.codeCoeff(cu, 0, bCodeDQP, tuDepthRange); bits = m_entropyCoder.getNumberOfWrittenBits(); - coeffBits = bits - mvBits; + coeffBits = bits - mvBits - skipFlagBits; } m_entropyCoder.store(interMode.contexts); @@ -2611,18 +2665,22 @@ reconYuv->copyFromYuv(*predYuv); // update with clipped distortion and cost (qp estimation loop uses unclipped values) - sse_ret_t bestLumaDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); - sse_ret_t bestChromaDist = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); - bestChromaDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + sse_t bestLumaDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); + interMode.distortion = bestLumaDist; + if (m_csp != X265_CSP_I400) + { + sse_t bestChromaDist = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); + bestChromaDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + interMode.chromaDistortion = bestChromaDist; + interMode.distortion += bestChromaDist; + } if (m_rdCost.m_psyRd) interMode.psyEnergy = m_rdCost.psyCost(sizeIdx, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); - + interMode.resEnergy = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); interMode.totalBits = bits; interMode.lumaDistortion = bestLumaDist; - interMode.chromaDistortion = bestChromaDist; - interMode.distortion = bestLumaDist + bestChromaDist; interMode.coeffBits = coeffBits; - interMode.mvBits = bits - coeffBits; + interMode.mvBits = mvBits; updateModeCost(interMode); checkDQP(interMode, cuGeom); } @@ -2641,14 +2699,15 @@ { // code full block uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - bool bCodeChroma = true; + uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; + uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) { X265_CHECK(log2TrSize == 2 && m_csp != X265_CSP_I444 && tuDepth, "invalid tuDepth\n"); log2TrSizeC = 2; tuDepthC--; - bCodeChroma = !(absPartIdx & 3); + codeChroma &= !(absPartIdx & 3); } uint32_t absPartIdxStep = cuGeom.numPartitions >> tuDepthC * 2; @@ -2682,7 +2741,7 @@ cu.setCbfSubParts(0, TEXT_LUMA, absPartIdx, depth); } - if (bCodeChroma) + if (codeChroma) { uint32_t sizeIdxC = log2TrSizeC - 2; uint32_t strideResiC = resiYuv.m_csize; @@ -2748,19 +2807,25 @@ { residualTransformQuantInter(mode, cuGeom, qPartIdx, tuDepth + 1, depthRange); ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); - ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); - vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); + if (m_csp != X265_CSP_I400) + { + ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); + vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); + } } for (uint32_t i = 0; i < 4 * qNumParts; ++i) { cu.m_cbf[0][absPartIdx + i] |= ycbf << tuDepth; - cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; - cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; + if (m_csp != X265_CSP_I400) + { + cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; + cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; + } } } } -uint64_t Search::estimateNullCbfCost(uint32_t &dist, uint32_t &psyEnergy, uint32_t tuDepth, TextType compId) +uint64_t Search::estimateNullCbfCost(sse_t dist, uint32_t psyEnergy, uint32_t tuDepth, TextType compId) { uint32_t nullBits = m_entropyCoder.estimateCbfBits(0, compId, tuDepth); @@ -2786,14 +2851,14 @@ X265_CHECK(bCheckFull || bCheckSplit, "check-full or check-split must be set\n"); uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - bool bCodeChroma = true; + uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) { X265_CHECK(log2TrSize == 2 && m_csp != X265_CSP_I444 && tuDepth, "invalid tuDepth\n"); log2TrSizeC = 2; tuDepthC--; - bCodeChroma = !(absPartIdx & 3); + codeChroma &= !(absPartIdx & 3); } // code full block @@ -2803,7 +2868,7 @@ uint8_t cbfFlag[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, {0, 0}, {0, 0} }; uint32_t numSig[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, {0, 0}, {0, 0} }; uint32_t singleBits[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, { 0, 0 }, { 0, 0 } }; - uint32_t singleDist[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, { 0, 0 }, { 0, 0 } }; + sse_t singleDist[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, { 0, 0 }, { 0, 0 } }; uint32_t singlePsyEnergy[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, { 0, 0 }, { 0, 0 } }; uint32_t bestTransformMode[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { 0, 0 }, { 0, 0 }, { 0, 0 } }; uint64_t minCost[MAX_NUM_COMPONENT][2 /*0 = top (or whole TU for non-4:2:2) sub-TU, 1 = bottom sub-TU*/] = { { MAX_INT64, MAX_INT64 }, {MAX_INT64, MAX_INT64}, {MAX_INT64, MAX_INT64} }; @@ -2819,14 +2884,14 @@ if (bCheckFull) { uint32_t trSizeC = 1 << log2TrSizeC; - int partSize = partitionFromLog2Size(log2TrSize); + int partSize = partitionFromLog2Size(log2TrSize); int partSizeC = partitionFromLog2Size(log2TrSizeC); const uint32_t qtLayer = log2TrSize - 2; uint32_t coeffOffsetY = absPartIdx << (LOG2_UNIT_SIZE * 2); coeff_t* coeffCurY = m_rqt[qtLayer].coeffRQT[0] + coeffOffsetY; - bool checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && !cu.m_tqBypass[0]; - bool checkTransformSkipY = checkTransformSkip && log2TrSize <= MAX_LOG2_TS_SIZE; + bool checkTransformSkip = m_slice->m_pps->bTransformSkipEnabled && !cu.m_tqBypass[0]; + bool checkTransformSkipY = checkTransformSkip && log2TrSize <= MAX_LOG2_TS_SIZE; bool checkTransformSkipC = checkTransformSkip && log2TrSizeC <= MAX_LOG2_TS_SIZE; cu.setTUDepthSubParts(tuDepth, absPartIdx, depth); @@ -2844,24 +2909,20 @@ if (bSplitPresentFlag && log2TrSize > depthRange[0]) m_entropyCoder.codeTransformSubdivFlag(0, 5 - log2TrSize); - fullCost.bits = m_entropyCoder.getNumberOfWrittenBits(); - // Coding luma cbf flag has been removed from here. The context for cbf flag is different for each depth. - // So it is valid if we encode coefficients and then cbfs at least for analysis. -// m_entropyCoder.codeQtCbfLuma(cbfFlag[TEXT_LUMA][0], tuDepth); if (cbfFlag[TEXT_LUMA][0]) m_entropyCoder.codeCoeffNxN(cu, coeffCurY, absPartIdx, log2TrSize, TEXT_LUMA); - - uint32_t singleBitsPrev = m_entropyCoder.getNumberOfWrittenBits(); - singleBits[TEXT_LUMA][0] = singleBitsPrev - fullCost.bits; + singleBits[TEXT_LUMA][0] = m_entropyCoder.getNumberOfWrittenBits(); X265_CHECK(log2TrSize <= 5, "log2TrSize is too large\n"); - uint32_t distY = primitives.cu[partSize].ssd_s(resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size); - uint32_t psyEnergyY = 0; + + //Assuming zero residual + sse_t zeroDistY = primitives.cu[partSize].sse_pp(fenc, fencYuv->m_size, mode.predYuv.getLumaAddr(absPartIdx), mode.predYuv.m_size); + uint32_t zeroPsyEnergyY = 0; if (m_rdCost.m_psyRd) - psyEnergyY = m_rdCost.psyCost(partSize, resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, (int16_t*)zeroShort, 0); + zeroPsyEnergyY = m_rdCost.psyCost(partSize, fenc, fencYuv->m_size, mode.predYuv.getLumaAddr(absPartIdx), mode.predYuv.m_size); - int16_t* curResiY = m_rqt[qtLayer].resiQtYuv.getLumaAddr(absPartIdx); + int16_t* curResiY = m_rqt[qtLayer].resiQtYuv.getLumaAddr(absPartIdx); uint32_t strideResiY = m_rqt[qtLayer].resiQtYuv.m_size; if (cbfFlag[TEXT_LUMA][0]) @@ -2870,12 +2931,16 @@ // non-zero cost calculation for luma - This is an approximation // finally we have to encode correct cbf after comparing with null cost - const uint32_t nonZeroDistY = primitives.cu[partSize].sse_ss(resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, curResiY, strideResiY); + pixel* curReconY = m_rqt[qtLayer].reconQtYuv.getLumaAddr(absPartIdx); + uint32_t strideReconY = m_rqt[qtLayer].reconQtYuv.m_size; + primitives.cu[partSize].add_ps(curReconY, strideReconY, mode.predYuv.getLumaAddr(absPartIdx), curResiY, mode.predYuv.m_size, strideResiY); + + const sse_t nonZeroDistY = primitives.cu[partSize].sse_pp(fenc, fencYuv->m_size, curReconY, strideReconY); uint32_t nzCbfBitsY = m_entropyCoder.estimateCbfBits(cbfFlag[TEXT_LUMA][0], TEXT_LUMA, tuDepth); uint32_t nonZeroPsyEnergyY = 0; uint64_t singleCostY = 0; if (m_rdCost.m_psyRd) { - nonZeroPsyEnergyY = m_rdCost.psyCost(partSize, resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, curResiY, strideResiY); + nonZeroPsyEnergyY = m_rdCost.psyCost(partSize, fenc, fencYuv->m_size, curReconY, strideReconY); singleCostY = m_rdCost.calcPsyRdCost(nonZeroDistY, nzCbfBitsY + singleBits[TEXT_LUMA][0], nonZeroPsyEnergyY); } else @@ -2891,7 +2956,7 @@ // zero-cost calculation for luma. This is an approximation // Initial cost calculation was also an approximation. First resetting the bit counter and then encoding zero cbf. // Now encoding the zero cbf without writing into bitstream, keeping m_fracBits unchanged. The same is valid for chroma. - uint64_t nullCostY = estimateNullCbfCost(distY, psyEnergyY, tuDepth, TEXT_LUMA); + uint64_t nullCostY = estimateNullCbfCost(zeroDistY, zeroPsyEnergyY, tuDepth, TEXT_LUMA); if (nullCostY < singleCostY) { @@ -2900,12 +2965,12 @@ primitives.cu[partSize].blockfill_s(curResiY, strideResiY, 0); #if CHECKED_BUILD || _DEBUG uint32_t numCoeffY = 1 << (log2TrSize << 1); - memset(coeffCurY, 0, sizeof(coeff_t) * numCoeffY); + memset(coeffCurY, 0, sizeof(coeff_t)* numCoeffY); #endif if (checkTransformSkipY) minCost[TEXT_LUMA][0] = nullCostY; - singleDist[TEXT_LUMA][0] = distY; - singlePsyEnergy[TEXT_LUMA][0] = psyEnergyY; + singleDist[TEXT_LUMA][0] = zeroDistY; + singlePsyEnergy[TEXT_LUMA][0] = zeroPsyEnergyY; } else { @@ -2919,21 +2984,23 @@ else { if (checkTransformSkipY) - minCost[TEXT_LUMA][0] = estimateNullCbfCost(distY, psyEnergyY, tuDepth, TEXT_LUMA); + minCost[TEXT_LUMA][0] = estimateNullCbfCost(zeroDistY, zeroPsyEnergyY, tuDepth, TEXT_LUMA); primitives.cu[partSize].blockfill_s(curResiY, strideResiY, 0); - singleDist[TEXT_LUMA][0] = distY; - singlePsyEnergy[TEXT_LUMA][0] = psyEnergyY; + singleDist[TEXT_LUMA][0] = zeroDistY; + singleBits[TEXT_LUMA][0] = 0; + singlePsyEnergy[TEXT_LUMA][0] = zeroPsyEnergyY; } cu.setCbfSubParts(cbfFlag[TEXT_LUMA][0] << tuDepth, TEXT_LUMA, absPartIdx, depth); - if (bCodeChroma) + if (codeChroma) { uint32_t coeffOffsetC = coeffOffsetY >> (m_hChromaShift + m_vChromaShift); uint32_t strideResiC = m_rqt[qtLayer].resiQtYuv.m_csize; for (uint32_t chromaId = TEXT_CHROMA_U; chromaId <= TEXT_CHROMA_V; chromaId++) { - uint32_t distC = 0, psyEnergyC = 0; + sse_t zeroDistC = 0; + uint32_t zeroPsyEnergyC = 0; coeff_t* coeffCurC = m_rqt[qtLayer].coeffRQT[chromaId] + coeffOffsetC; TURecurse tuIterator(splitIntoSubTUs ? VERTICAL_SPLIT : DONT_SPLIT, absPartIdxStep, absPartIdx); @@ -2952,14 +3019,18 @@ numSig[chromaId][tuIterator.section] = m_quant.transformNxN(cu, fenc, fencYuv->m_csize, resi, resiYuv.m_csize, coeffCurC + subTUOffset, log2TrSizeC, (TextType)chromaId, absPartIdxC, false); cbfFlag[chromaId][tuIterator.section] = !!numSig[chromaId][tuIterator.section]; + uint32_t latestBitCount = m_entropyCoder.getNumberOfWrittenBits(); if (cbfFlag[chromaId][tuIterator.section]) m_entropyCoder.codeCoeffNxN(cu, coeffCurC + subTUOffset, absPartIdxC, log2TrSizeC, (TextType)chromaId); - uint32_t newBits = m_entropyCoder.getNumberOfWrittenBits(); - singleBits[chromaId][tuIterator.section] = newBits - singleBitsPrev; - singleBitsPrev = newBits; + + singleBits[chromaId][tuIterator.section] = m_entropyCoder.getNumberOfWrittenBits() - latestBitCount; int16_t* curResiC = m_rqt[qtLayer].resiQtYuv.getChromaAddr(chromaId, absPartIdxC); - distC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[log2TrSizeC - 2].ssd_s(resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize)); + zeroDistC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[log2TrSizeC - 2].sse_pp(fenc, fencYuv->m_csize, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), mode.predYuv.m_csize)); + + if (m_rdCost.m_psyRd) + //Assuming zero residual + zeroPsyEnergyC = m_rdCost.psyCost(partSizeC, fenc, fencYuv->m_csize, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), mode.predYuv.m_csize); if (cbfFlag[chromaId][tuIterator.section]) { @@ -2968,13 +3039,15 @@ // non-zero cost calculation for luma, same as luma - This is an approximation // finally we have to encode correct cbf after comparing with null cost - uint32_t dist = primitives.cu[partSizeC].sse_ss(resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize, curResiC, strideResiC); + pixel* curReconC = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC); + uint32_t strideReconC = m_rqt[qtLayer].reconQtYuv.m_csize; + primitives.cu[partSizeC].add_ps(curReconC, strideReconC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), curResiC, mode.predYuv.m_csize, strideResiC); + sse_t nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[partSizeC].sse_pp(fenc, fencYuv->m_csize, curReconC, strideReconC)); uint32_t nzCbfBitsC = m_entropyCoder.estimateCbfBits(cbfFlag[chromaId][tuIterator.section], (TextType)chromaId, tuDepth); - uint32_t nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, dist); uint32_t nonZeroPsyEnergyC = 0; uint64_t singleCostC = 0; if (m_rdCost.m_psyRd) { - nonZeroPsyEnergyC = m_rdCost.psyCost(partSizeC, resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize, curResiC, strideResiC); + nonZeroPsyEnergyC = m_rdCost.psyCost(partSizeC, fenc, fencYuv->m_csize, curReconC, strideReconC); singleCostC = m_rdCost.calcPsyRdCost(nonZeroDistC, nzCbfBitsC + singleBits[chromaId][tuIterator.section], nonZeroPsyEnergyC); } else @@ -2988,7 +3061,7 @@ else { //zero-cost calculation for chroma. This is an approximation - uint64_t nullCostC = estimateNullCbfCost(distC, psyEnergyC, tuDepth, (TextType)chromaId); + uint64_t nullCostC = estimateNullCbfCost(zeroDistC, zeroPsyEnergyC, tuDepth, (TextType)chromaId); if (nullCostC < singleCostC) { @@ -3001,8 +3074,8 @@ #endif if (checkTransformSkipC) minCost[chromaId][tuIterator.section] = nullCostC; - singleDist[chromaId][tuIterator.section] = distC; - singlePsyEnergy[chromaId][tuIterator.section] = psyEnergyC; + singleDist[chromaId][tuIterator.section] = zeroDistC; + singlePsyEnergy[chromaId][tuIterator.section] = zeroPsyEnergyC; } else { @@ -3016,10 +3089,11 @@ else { if (checkTransformSkipC) - minCost[chromaId][tuIterator.section] = estimateNullCbfCost(distC, psyEnergyC, tuDepthC, (TextType)chromaId); + minCost[chromaId][tuIterator.section] = estimateNullCbfCost(zeroDistC, zeroPsyEnergyC, tuDepthC, (TextType)chromaId); primitives.cu[partSizeC].blockfill_s(curResiC, strideResiC, 0); - singleDist[chromaId][tuIterator.section] = distC; - singlePsyEnergy[chromaId][tuIterator.section] = psyEnergyC; + singleBits[chromaId][tuIterator.section] = 0; + singleDist[chromaId][tuIterator.section] = zeroDistC; + singlePsyEnergy[chromaId][tuIterator.section] = zeroPsyEnergyC; } cu.setCbfPartRange(cbfFlag[chromaId][tuIterator.section] << tuDepth, (TextType)chromaId, absPartIdxC, tuIterator.absPartIdxStep); @@ -3030,7 +3104,7 @@ if (checkTransformSkipY) { - uint32_t nonZeroDistY = 0; + sse_t nonZeroDistY = 0; uint32_t nonZeroPsyEnergyY = 0; uint64_t singleCostY = MAX_INT64; @@ -3054,11 +3128,12 @@ m_quant.invtransformNxN(cu, m_tsResidual, trSize, m_tsCoeff, log2TrSize, TEXT_LUMA, false, true, numSigTSkipY); - nonZeroDistY = primitives.cu[partSize].sse_ss(resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, m_tsResidual, trSize); + primitives.cu[partSize].add_ps(m_tsRecon, trSize, mode.predYuv.getLumaAddr(absPartIdx), m_tsResidual, mode.predYuv.m_size, trSize); + nonZeroDistY = primitives.cu[partSize].sse_pp(fenc, fencYuv->m_size, m_tsRecon, trSize); if (m_rdCost.m_psyRd) { - nonZeroPsyEnergyY = m_rdCost.psyCost(partSize, resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, m_tsResidual, trSize); + nonZeroPsyEnergyY = m_rdCost.psyCost(partSize, fenc, fencYuv->m_size, m_tsRecon, trSize); singleCostY = m_rdCost.calcPsyRdCost(nonZeroDistY, skipSingleBitsY, nonZeroPsyEnergyY); } else @@ -3081,9 +3156,10 @@ cu.setCbfSubParts(cbfFlag[TEXT_LUMA][0] << tuDepth, TEXT_LUMA, absPartIdx, depth); } - if (bCodeChroma && checkTransformSkipC) + if (codeChroma && checkTransformSkipC) { - uint32_t nonZeroDistC = 0, nonZeroPsyEnergyC = 0; + sse_t nonZeroDistC = 0; + uint32_t nonZeroPsyEnergyC = 0; uint64_t singleCostC = MAX_INT64; uint32_t strideResiC = m_rqt[qtLayer].resiQtYuv.m_csize; uint32_t coeffOffsetC = coeffOffsetY >> (m_hChromaShift + m_vChromaShift); @@ -3122,11 +3198,12 @@ m_quant.invtransformNxN(cu, m_tsResidual, trSizeC, m_tsCoeff, log2TrSizeC, (TextType)chromaId, false, true, numSigTSkipC); - uint32_t dist = primitives.cu[partSizeC].sse_ss(resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize, m_tsResidual, trSizeC); - nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, dist); + primitives.cu[partSizeC].add_ps(m_tsRecon, trSizeC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), m_tsResidual, mode.predYuv.m_csize, trSizeC); + nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[partSizeC].sse_pp(fenc, fencYuv->m_csize, m_tsRecon, trSizeC)); if (m_rdCost.m_psyRd) { - nonZeroPsyEnergyC = m_rdCost.psyCost(partSizeC, resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize, m_tsResidual, trSizeC); + + nonZeroPsyEnergyC = m_rdCost.psyCost(partSizeC, fenc, fencYuv->m_csize, m_tsRecon, trSizeC); singleCostC = m_rdCost.calcPsyRdCost(nonZeroDistC, singleBits[chromaId][tuIterator.section], nonZeroPsyEnergyC); } else @@ -3160,7 +3237,7 @@ m_entropyCoder.resetBits(); //Encode cbf flags - if (bCodeChroma) + if (codeChroma) { if (!splitIntoSubTUs) { @@ -3234,14 +3311,20 @@ { estimateResidualQT(mode, cuGeom, qPartIdx, tuDepth + 1, resiYuv, splitCost, depthRange); ycbf |= cu.getCbf(qPartIdx, TEXT_LUMA, tuDepth + 1); - ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); - vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); + if (m_csp != X265_CSP_I400) + { + ucbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_U, tuDepth + 1); + vcbf |= cu.getCbf(qPartIdx, TEXT_CHROMA_V, tuDepth + 1); + } } for (uint32_t i = 0; i < 4 * qNumParts; ++i) { cu.m_cbf[0][absPartIdx + i] |= ycbf << tuDepth; - cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; - cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; + if (m_csp != X265_CSP_I400) + { + cu.m_cbf[1][absPartIdx + i] |= ucbf << tuDepth; + cu.m_cbf[2][absPartIdx + i] |= vcbf << tuDepth; + } } // Here we were encoding cbfs and coefficients for splitted blocks. Since I have collected coefficient bits @@ -3275,7 +3358,7 @@ } cu.setTransformSkipSubParts(bestTransformMode[TEXT_LUMA][0], TEXT_LUMA, absPartIdx, depth); - if (bCodeChroma) + if (codeChroma) { if (!splitIntoSubTUs) { @@ -3298,7 +3381,7 @@ cu.setTUDepthSubParts(tuDepth, absPartIdx, depth); cu.setCbfSubParts(cbfFlag[TEXT_LUMA][0] << tuDepth, TEXT_LUMA, absPartIdx, depth); - if (bCodeChroma) + if (codeChroma) { if (!splitIntoSubTUs) { @@ -3330,18 +3413,20 @@ const bool bSubdiv = tuDepth < cu.m_tuDepth[absPartIdx]; uint32_t log2TrSize = cu.m_log2CUSize[0] - tuDepth; - - if (!(log2TrSize - m_hChromaShift < 2)) + if (m_csp != X265_CSP_I400) { - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1)) - m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_U, tuDepth, !bSubdiv); - if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1)) - m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_V, tuDepth, !bSubdiv); - } - else - { - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1), "chroma CBF not matching\n"); - X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1), "chroma CBF not matching\n"); + if (!(log2TrSize - m_hChromaShift < 2)) + { + if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1)) + m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_U, tuDepth, !bSubdiv); + if (!tuDepth || cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1)) + m_entropyCoder.codeQtCbfChroma(cu, absPartIdx, TEXT_CHROMA_V, tuDepth, !bSubdiv); + } + else + { + X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_U, tuDepth - 1), "chroma CBF not matching\n"); + X265_CHECK(cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth) == cu.getCbf(absPartIdx, TEXT_CHROMA_V, tuDepth - 1), "chroma CBF not matching\n"); + } } if (!bSubdiv) @@ -3371,14 +3456,14 @@ const uint32_t qtLayer = log2TrSize - 2; uint32_t log2TrSizeC = log2TrSize - m_hChromaShift; - bool bCodeChroma = true; + uint32_t codeChroma = (m_csp != X265_CSP_I400) ? 1 : 0; uint32_t tuDepthC = tuDepth; if (log2TrSizeC < 2) { X265_CHECK(log2TrSize == 2 && m_csp != X265_CSP_I444 && tuDepth, "invalid tuDepth\n"); log2TrSizeC = 2; tuDepthC--; - bCodeChroma = !(absPartIdx & 3); + codeChroma &= !(absPartIdx & 3); } m_rqt[qtLayer].resiQtYuv.copyPartToPartLuma(resiYuv, absPartIdx, log2TrSize); @@ -3389,7 +3474,7 @@ coeff_t* coeffDstY = cu.m_trCoeff[0] + coeffOffsetY; memcpy(coeffDstY, coeffSrcY, sizeof(coeff_t) * numCoeffY); - if (bCodeChroma) + if (codeChroma) { m_rqt[qtLayer].resiQtYuv.copyPartToPartChroma(resiYuv, absPartIdx, log2TrSizeC + m_hChromaShift); @@ -3453,7 +3538,6 @@ mode.contexts.resetBits(); mode.contexts.codeDeltaQP(cu, 0); uint32_t bits = mode.contexts.getNumberOfWrittenBits(); - mode.mvBits += bits; mode.totalBits += bits; updateModeCost(mode); } @@ -3464,7 +3548,6 @@ } else { - mode.mvBits++; mode.totalBits++; updateModeCost(mode); } @@ -3498,7 +3581,6 @@ mode.contexts.resetBits(); mode.contexts.codeDeltaQP(cu, 0); uint32_t bits = mode.contexts.getNumberOfWrittenBits(); - mode.mvBits += bits; mode.totalBits += bits; updateModeCost(mode); } @@ -3509,7 +3591,6 @@ } else { - mode.mvBits++; mode.totalBits++; updateModeCost(mode); }
View file
x265_1.8.tar.gz/source/encoder/search.h -> x265_1.9.tar.gz/source/encoder/search.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> +* Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -84,8 +85,14 @@ MV mvp; int mvpIdx; int ref; - uint32_t cost; int bits; + uint32_t mvCost; + uint32_t cost; + + MotionData() + { + memset(this, 0, sizeof(MotionData)); + } }; struct Mode @@ -105,16 +112,17 @@ // temporal candidate. InterNeighbourMV interNeighbours[6]; - uint64_t rdCost; // sum of partition (psy) RD costs (sse(fenc, recon) + lambda2 * bits) - uint64_t sa8dCost; // sum of partition sa8d distortion costs (sa8d(fenc, pred) + lambda * bits) - uint32_t sa8dBits; // signal bits used in sa8dCost calculation - uint32_t psyEnergy; // sum of partition psycho-visual energy difference - sse_ret_t lumaDistortion; - sse_ret_t chromaDistortion; - sse_ret_t distortion; // sum of partition SSE distortion - uint32_t totalBits; // sum of partition bits (mv + coeff) - uint32_t mvBits; // Mv bits + Ref + block type (or intra mode) - uint32_t coeffBits; // Texture bits (DCT Coeffs) + uint64_t rdCost; // sum of partition (psy) RD costs (sse(fenc, recon) + lambda2 * bits) + uint64_t sa8dCost; // sum of partition sa8d distortion costs (sa8d(fenc, pred) + lambda * bits) + uint32_t sa8dBits; // signal bits used in sa8dCost calculation + uint32_t psyEnergy; // sum of partition psycho-visual energy difference + sse_t resEnergy; // sum of partition residual energy after motion prediction + sse_t lumaDistortion; + sse_t chromaDistortion; + sse_t distortion; // sum of partition SSE distortion + uint32_t totalBits; // sum of partition bits (mv + coeff) + uint32_t mvBits; // Mv bits + Ref + block type (or intra mode) + uint32_t coeffBits; // Texture bits (DCT Coeffs) void initCosts() { @@ -122,6 +130,7 @@ sa8dCost = 0; sa8dBits = 0; psyEnergy = 0; + resEnergy = 0; lumaDistortion = 0; chromaDistortion = 0; distortion = 0; @@ -130,62 +139,13 @@ coeffBits = 0; } - void invalidate() - { - /* set costs to invalid data, catch uninitialized re-use */ - rdCost = UINT64_MAX / 2; - sa8dCost = UINT64_MAX / 2; - sa8dBits = MAX_UINT / 2; - psyEnergy = MAX_UINT / 2; -#if X265_DEPTH <= 10 - lumaDistortion = MAX_UINT / 2; - chromaDistortion = MAX_UINT / 2; - distortion = MAX_UINT / 2; -#else - lumaDistortion = UINT64_MAX / 2; - chromaDistortion = UINT64_MAX / 2; - distortion = UINT64_MAX / 2; -#endif - totalBits = MAX_UINT / 2; - mvBits = MAX_UINT / 2; - coeffBits = MAX_UINT / 2; - } - - bool ok() const - { -#if X265_DEPTH <= 10 - return !(rdCost >= UINT64_MAX / 2 || - sa8dCost >= UINT64_MAX / 2 || - sa8dBits >= MAX_UINT / 2 || - psyEnergy >= MAX_UINT / 2 || - lumaDistortion >= MAX_UINT / 2 || - chromaDistortion >= MAX_UINT / 2 || - distortion >= MAX_UINT / 2 || - totalBits >= MAX_UINT / 2 || - mvBits >= MAX_UINT / 2 || - coeffBits >= MAX_UINT / 2); -#else - return !(rdCost >= UINT64_MAX / 2 || - sa8dCost >= UINT64_MAX / 2 || - sa8dBits >= MAX_UINT / 2 || - psyEnergy >= MAX_UINT / 2 || - lumaDistortion >= UINT64_MAX / 2 || - chromaDistortion >= UINT64_MAX / 2 || - distortion >= UINT64_MAX / 2 || - totalBits >= MAX_UINT / 2 || - mvBits >= MAX_UINT / 2 || - coeffBits >= MAX_UINT / 2); -#endif - } - void addSubCosts(const Mode& subMode) { - X265_CHECK(subMode.ok(), "sub-mode not initialized"); - rdCost += subMode.rdCost; sa8dCost += subMode.sa8dCost; sa8dBits += subMode.sa8dBits; psyEnergy += subMode.psyEnergy; + resEnergy += subMode.resEnergy; lumaDistortion += subMode.lumaDistortion; chromaDistortion += subMode.chromaDistortion; distortion += subMode.distortion; @@ -325,13 +285,13 @@ ~Search(); bool initSearch(const x265_param& param, ScalingList& scalingList); - int setLambdaFromQP(const CUData& ctu, int qp); /* returns real quant QP in valid spec range */ + int setLambdaFromQP(const CUData& ctu, int qp, int lambdaQP = -1); /* returns real quant QP in valid spec range */ // mark temp RD entropy contexts as uninitialized; useful for finding loads without stores void invalidateContexts(int fromDepth); - // full RD search of intra modes. if sharedModes is not NULL, it directly uses them - void checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSize, uint8_t* sharedModes, uint8_t* sharedChromaModes); + // full RD search of intra modes + void checkIntra(Mode& intraMode, const CUGeom& cuGeom, PartSize partSizes); // select best intra mode using only sa8d costs, cannot measure NxN intra void checkIntraInInter(Mode& intraMode, const CUGeom& cuGeom); @@ -397,10 +357,10 @@ void saveResidualQTData(CUData& cu, ShortYuv& resiYuv, uint32_t absPartIdx, uint32_t tuDepth); // RDO search of luma intra modes; result is fully encoded luma. luma distortion is returned - uint32_t estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2], uint8_t* sharedModes); + sse_t estIntraPredQT(Mode &intraMode, const CUGeom& cuGeom, const uint32_t depthRange[2]); // RDO select best chroma mode from luma; result is fully encode chroma. chroma distortion is returned - uint32_t estIntraPredChromaQT(Mode &intraMode, const CUGeom& cuGeom, uint8_t* sharedChromaModes); + sse_t estIntraPredChromaQT(Mode &intraMode, const CUGeom& cuGeom); void codeSubdivCbfQTChroma(const CUData& cu, uint32_t tuDepth, uint32_t absPartIdx); void codeInterSubdivCbfQT(CUData& cu, uint32_t absPartIdx, const uint32_t tuDepth, const uint32_t depthRange[2]); @@ -410,12 +370,12 @@ { uint64_t rdcost; uint32_t bits; - sse_ret_t distortion; + sse_t distortion; uint32_t energy; Cost() { rdcost = 0; bits = 0; distortion = 0; energy = 0; } }; - uint64_t estimateNullCbfCost(uint32_t &dist, uint32_t &psyEnergy, uint32_t tuDepth, TextType compId); + uint64_t estimateNullCbfCost(sse_t dist, uint32_t psyEnergy, uint32_t tuDepth, TextType compId); void estimateResidualQT(Mode& mode, const CUGeom& cuGeom, uint32_t absPartIdx, uint32_t depth, ShortYuv& resiYuv, Cost& costs, const uint32_t depthRange[2]); // generate prediction, generate residual and recon. if bAllowSplit, find optimal RQT splits @@ -424,8 +384,8 @@ void extractIntraResultQT(CUData& cu, Yuv& reconYuv, uint32_t tuDepth, uint32_t absPartIdx); // generate chroma prediction, generate residual and recon - uint32_t codeIntraChromaQt(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, uint32_t& psyEnergy); - uint32_t codeIntraChromaTSkip(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t tuDepthC, uint32_t absPartIdx, uint32_t& psyEnergy); + void codeIntraChromaQt(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t absPartIdx, Cost& outCost); + void codeIntraChromaTSkip(Mode& mode, const CUGeom& cuGeom, uint32_t tuDepth, uint32_t tuDepthC, uint32_t absPartIdx, Cost& outCost); void extractIntraResultChromaQT(CUData& cu, Yuv& reconYuv, uint32_t absPartIdx, uint32_t tuDepth); // reshuffle CBF flags after coding a pair of 4:2:2 chroma blocks
View file
x265_1.8.tar.gz/source/encoder/sei.h -> x265_1.9.tar.gz/source/encoder/sei.h
Changed
@@ -163,12 +163,6 @@ PayloadType payloadType() const { return CONTENT_LIGHT_LEVEL_INFO; } - bool parse(const char* value) - { - return sscanf(value, "%hu,%hu", - &max_content_light_level, &max_pic_average_light_level) == 2; - } - void write(Bitstream& bs, const SPS&) { m_bitIf = &bs; @@ -195,29 +189,31 @@ uint8_t m_digest[3][16]; - void write(Bitstream& bs, const SPS&) + void write(Bitstream& bs, const SPS& sps) { m_bitIf = &bs; + int planes = (sps.chromaFormatIdc != X265_CSP_I400) ? 3 : 1; + WRITE_CODE(DECODED_PICTURE_HASH, 8, "payload_type"); switch (m_method) { case MD5: - WRITE_CODE(1 + 16 * 3, 8, "payload_size"); + WRITE_CODE(1 + 16 * planes, 8, "payload_size"); WRITE_CODE(MD5, 8, "hash_type"); break; case CRC: - WRITE_CODE(1 + 2 * 3, 8, "payload_size"); + WRITE_CODE(1 + 2 * planes, 8, "payload_size"); WRITE_CODE(CRC, 8, "hash_type"); break; case CHECKSUM: - WRITE_CODE(1 + 4 * 3, 8, "payload_size"); + WRITE_CODE(1 + 4 * planes, 8, "payload_size"); WRITE_CODE(CHECKSUM, 8, "hash_type"); break; } - for (int yuvIdx = 0; yuvIdx < 3; yuvIdx++) + for (int yuvIdx = 0; yuvIdx < planes; yuvIdx++) { if (m_method == MD5) {
View file
x265_1.8.tar.gz/source/encoder/slicetype.cpp -> x265_1.9.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -83,8 +83,11 @@ uint32_t var; var = acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[0] + blockOffsetLuma, stride, 0, csp); - var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp); - var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp); + if (csp != X265_CSP_I400) + { + var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[1] + blockOffsetChroma, cStride, 1, csp); + var += acEnergyPlane(curFrame, curFrame->m_fencPic->m_picOrg[2] + blockOffsetChroma, cStride, 2, csp); + } x265_emms(); return var; } @@ -96,6 +99,7 @@ int maxRow = curFrame->m_fencPic->m_picHeight; int blockCount = curFrame->m_lowres.maxBlocksInRow * curFrame->m_lowres.maxBlocksInCol; + float* quantOffsets = curFrame->m_quantOffsets; for (int y = 0; y < 3; y++) { curFrame->m_lowres.wp_ssd[y] = 0; @@ -113,10 +117,21 @@ if (param->rc.aqMode && param->rc.aqStrength == 0) { - memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double)); - memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double)); - for (int cuxy = 0; cuxy < cuCount; cuxy++) - curFrame->m_lowres.invQscaleFactor[cuxy] = 256; + if (quantOffsets) + { + for (int cuxy = 0; cuxy < cuCount; cuxy++) + { + curFrame->m_lowres.qpCuTreeOffset[cuxy] = curFrame->m_lowres.qpAqOffset[cuxy] = quantOffsets[cuxy]; + curFrame->m_lowres.invQscaleFactor[cuxy] = x265_exp2fix8(curFrame->m_lowres.qpCuTreeOffset[cuxy]); + } + } + else + { + memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double)); + memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double)); + for (int cuxy = 0; cuxy < cuCount; cuxy++) + curFrame->m_lowres.invQscaleFactor[cuxy] = 256; + } } /* Need variance data for weighted prediction */ @@ -135,19 +150,25 @@ if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE || param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED) { double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8))); + curFrame->m_lowres.frameVariance = 0; + uint64_t rowVariance = 0; for (blockY = 0; blockY < maxRow; blockY += 16) { + rowVariance = 0; for (blockX = 0; blockX < maxCol; blockX += 16) { uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp); + curFrame->m_lowres.blockVariance[blockXY] = energy; + rowVariance += energy; qp_adj = pow(energy * bit_depth_correction + 1, 0.1); curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; avg_adj += qp_adj; avg_adj_pow2 += qp_adj * qp_adj; blockXY++; } + curFrame->m_lowres.frameVariance += (rowVariance / maxCol); } - + curFrame->m_lowres.frameVariance /= maxRow; avg_adj /= blockCount; avg_adj_pow2 /= blockCount; strength = param->rc.aqStrength * avg_adj; @@ -177,6 +198,8 @@ uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp); qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (14.427f + 2 * (X265_DEPTH - 8))); } + if (quantOffsets != NULL) + qp_adj += quantOffsets[blockXY]; curFrame->m_lowres.qpAqOffset[blockXY] = qp_adj; curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; curFrame->m_lowres.invQscaleFactor[blockXY] = x265_exp2fix8(qp_adj); @@ -328,7 +351,7 @@ primitives.weight_pp(ref.buffer[0], wbuffer[0], stride, widthHeight, paddedLines, scale, round << correction, denom + correction, offset); - src = weightedRef.fpelPlane[0]; + src = fenc.weightedRef[fenc.frameNum - ref.frameNum].fpelPlane[0]; } uint32_t cost = 0; @@ -350,7 +373,6 @@ bool LookaheadTLD::allocWeightedRef(Lowres& fenc) { intptr_t planesize = fenc.buffer[1] - fenc.buffer[0]; - intptr_t padoffset = fenc.lowresPlane[0] - fenc.buffer[0]; paddedLines = (int)(planesize / fenc.lumaStride); wbuffer[0] = X265_MALLOC(pixel, 4 * planesize); @@ -363,14 +385,6 @@ else return false; - for (int i = 0; i < 4; i++) - weightedRef.lowresPlane[i] = wbuffer[i] + padoffset; - - weightedRef.fpelPlane[0] = weightedRef.lowresPlane[0]; - weightedRef.lumaStride = fenc.lumaStride; - weightedRef.isLowres = true; - weightedRef.isWeighted = false; - return true; } @@ -388,6 +402,16 @@ return; } + ReferencePlanes& weightedRef = fenc.weightedRef[deltaIndex]; + intptr_t padoffset = fenc.lowresPlane[0] - fenc.buffer[0]; + for (int i = 0; i < 4; i++) + weightedRef.lowresPlane[i] = wbuffer[i] + padoffset; + + weightedRef.fpelPlane[0] = weightedRef.lowresPlane[0]; + weightedRef.lumaStride = fenc.lumaStride; + weightedRef.isLowres = true; + weightedRef.isWeighted = false; + /* epsilon is chosen to require at least a numerator of 127 (with denominator = 128) */ float guessScale, fencMean, refMean; x265_emms(); @@ -478,7 +502,13 @@ m_8x8Height = ((m_param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; m_8x8Width = ((m_param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_8x8Width - 2) * (m_8x8Height - 2) : m_8x8Width * m_8x8Height; + m_cuCount = m_8x8Width * m_8x8Height; + m_8x8Blocks = m_8x8Width > 2 && m_8x8Height > 2 ? (m_cuCount + 4 - 2 * (m_8x8Width + m_8x8Height)) : m_cuCount; + + /* Allow the strength to be adjusted via qcompress, since the two concepts + * are very similar. */ + + m_cuTreeStrength = 5.0 * (1.0 - m_param->rc.qCompress); m_lastKeyframe = -m_param->keyframeMax; m_sliceTypeBusy = false; @@ -502,7 +532,16 @@ m_bBatchFrameCosts = m_bBatchMotionSearch; if (m_param->lookaheadSlices && !m_pool) + { + x265_log(param, X265_LOG_WARNING, "No pools found; disabling lookahead-slices\n"); + m_param->lookaheadSlices = 0; + } + + if (m_param->lookaheadSlices && (m_param->sourceHeight < 720)) + { + x265_log(param, X265_LOG_WARNING, "Source height < 720p; disabling lookahead-slices\n"); m_param->lookaheadSlices = 0; + } if (m_param->lookaheadSlices > 1) { @@ -715,16 +754,16 @@ case P_SLICE: b = p1 = poc - l0poc; - frames[p0] = &slice->m_refPicList[0][0]->m_lowres; + frames[p0] = &slice->m_refFrameList[0][0]->m_lowres; frames[b] = &curFrame->m_lowres; break; case B_SLICE: b = poc - l0poc; p1 = b + l1poc - poc; - frames[p0] = &slice->m_refPicList[0][0]->m_lowres; + frames[p0] = &slice->m_refFrameList[0][0]->m_lowres; frames[b] = &curFrame->m_lowres; - frames[p1] = &slice->m_refPicList[1][0]->m_lowres; + frames[p1] = &slice->m_refFrameList[1][0]->m_lowres; break; default: @@ -736,10 +775,13 @@ if (m_param->rc.cuTree && !m_param->rc.bStatRead) /* update row satds based on cutree offsets */ curFrame->m_lowres.satdCost = frameCostRecalculate(frames, p0, p1, b); - else if (m_param->rc.aqMode) - curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAq[b - p0][p1 - b]; - else - curFrame->m_lowres.satdCost = curFrame->m_lowres.costEst[b - p0][p1 - b]; + else if (m_param->analysisMode != X265_ANALYSIS_LOAD) + { + if (m_param->rc.aqMode) + curFrame->m_lowres.satdCost = curFrame->m_lowres.costEstAq[b - p0][p1 - b]; + else + curFrame->m_lowres.satdCost = curFrame->m_lowres.costEst[b - p0][p1 - b]; + } if (m_param->rc.vbvBufferSize && m_param->rc.vbvMaxBitrate) { @@ -760,6 +802,7 @@ for (uint32_t cnt = 0; cnt < scale && lowresRow < heightInLowresCu; lowresRow++, cnt++) { sum = 0; intraSum = 0; + int diff = 0; lowresCuIdx = lowresRow * widthInLowresCu; for (lowresCol = 0; lowresCol < widthInLowresCu; lowresCol++, lowresCuIdx++) { @@ -767,14 +810,18 @@ if (qp_offset) { lowresCuCost = (uint16_t)((lowresCuCost * x265_exp2fix8(qp_offset[lowresCuIdx]) + 128) >> 8); - int32_t intraCuCost = curFrame->m_lowres.intraCost[lowresCuIdx]; + int32_t intraCuCost = curFrame->m_lowres.intraCost[lowresCuIdx]; curFrame->m_lowres.intraCost[lowresCuIdx] = (intraCuCost * x265_exp2fix8(qp_offset[lowresCuIdx]) + 128) >> 8; } + if (m_param->bIntraRefresh && slice->m_sliceType == X265_TYPE_P) + for (uint32_t x = curFrame->m_encData->m_pir.pirStartCol; x <= curFrame->m_encData->m_pir.pirEndCol; x++) + diff += curFrame->m_lowres.intraCost[lowresCuIdx] - lowresCuCost; curFrame->m_lowres.lowresCostForRc[lowresCuIdx] = lowresCuCost; sum += lowresCuCost; intraSum += curFrame->m_lowres.intraCost[lowresCuIdx]; } curFrame->m_encData->m_rowStat[row].satdForVbv += sum; + curFrame->m_encData->m_rowStat[row].satdForVbv += diff; curFrame->m_encData->m_rowStat[row].intraSatdForVbv += intraSum; } } @@ -886,8 +933,7 @@ x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n", frm.sliceType, m_param->maxNumReferences); } - - if (/* (!param->intraRefresh || frm.frameNum == 0) && */ frm.frameNum - m_lastKeyframe >= m_param->keyframeMax) + if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax) { if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I) frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; @@ -1170,7 +1216,7 @@ frames[framecnt + 1] = NULL; keyintLimit = m_param->keyframeMax - frames[0]->frameNum + m_lastKeyframe - 1; - origNumFrames = numFrames = X265_MIN(framecnt, keyintLimit); + origNumFrames = numFrames = m_param->bIntraRefresh ? framecnt : X265_MIN(framecnt, keyintLimit); if (bIsVbvLookahead) numFrames = framecnt; @@ -1366,12 +1412,12 @@ if (m_param->rc.cuTree) cuTree(frames, X265_MIN(numFrames, m_param->keyframeMax), bKeyframe); - // if (!param->bIntraRefresh) - for (int j = keyintLimit + 1; j <= numFrames; j += m_param->keyframeMax) - { - frames[j]->sliceType = X265_TYPE_I; - resetStart = X265_MIN(resetStart, j + 1); - } + if (!m_param->bIntraRefresh) + for (int j = keyintLimit + 1; j <= numFrames; j += m_param->keyframeMax) + { + frames[j]->sliceType = X265_TYPE_I; + resetStart = X265_MIN(resetStart, j + 1); + } if (bIsVbvLookahead) vbvLookahead(frames, numFrames, bKeyframe); @@ -1493,7 +1539,7 @@ { if (m_param->keyframeMin == m_param->keyframeMax) threshMin = threshMax; - if (gopSize <= m_param->keyframeMin / 4) + if (gopSize <= m_param->keyframeMin / 4 || m_param->bIntraRefresh) bias = threshMin / 4; else if (gopSize <= m_param->keyframeMin) bias = threshMin * gopSize / m_param->keyframeMin; @@ -1606,7 +1652,6 @@ double averageDuration = totalDuration / (numframes + 1); int i = numframes; - int cuCount = m_8x8Width * m_8x8Height; while (i > 0 && frames[i]->sliceType == X265_TYPE_B) i--; @@ -1620,18 +1665,18 @@ { if (bIntra) { - memset(frames[0]->propagateCost, 0, cuCount * sizeof(uint16_t)); - memcpy(frames[0]->qpCuTreeOffset, frames[0]->qpAqOffset, cuCount * sizeof(double)); + memset(frames[0]->propagateCost, 0, m_cuCount * sizeof(uint16_t)); + memcpy(frames[0]->qpCuTreeOffset, frames[0]->qpAqOffset, m_cuCount * sizeof(double)); return; } std::swap(frames[lastnonb]->propagateCost, frames[0]->propagateCost); - memset(frames[0]->propagateCost, 0, cuCount * sizeof(uint16_t)); + memset(frames[0]->propagateCost, 0, m_cuCount * sizeof(uint16_t)); } else { if (lastnonb < idx) return; - memset(frames[lastnonb]->propagateCost, 0, cuCount * sizeof(uint16_t)); + memset(frames[lastnonb]->propagateCost, 0, m_cuCount * sizeof(uint16_t)); } CostEstimateGroup estGroup(*this, frames); @@ -1647,13 +1692,13 @@ estGroup.singleCost(curnonb, lastnonb, lastnonb); - memset(frames[curnonb]->propagateCost, 0, cuCount * sizeof(uint16_t)); + memset(frames[curnonb]->propagateCost, 0, m_cuCount * sizeof(uint16_t)); bframes = lastnonb - curnonb - 1; if (m_param->bBPyramid && bframes > 1) { int middle = (bframes + 1) / 2 + curnonb; estGroup.singleCost(curnonb, lastnonb, middle); - memset(frames[middle]->propagateCost, 0, cuCount * sizeof(uint16_t)); + memset(frames[middle]->propagateCost, 0, m_cuCount * sizeof(uint16_t)); while (i > curnonb) { int p0 = i > middle ? middle : curnonb; @@ -1804,20 +1849,14 @@ if (ref0Distance && frame->weightedCostDelta[ref0Distance - 1] > 0) weightdelta = (1.0 - frame->weightedCostDelta[ref0Distance - 1]); - /* Allow the strength to be adjusted via qcompress, since the two concepts - * are very similar. */ - - int cuCount = m_8x8Width * m_8x8Height; - double strength = 5.0 * (1.0 - m_param->rc.qCompress); - - for (int cuIndex = 0; cuIndex < cuCount; cuIndex++) + for (int cuIndex = 0; cuIndex < m_cuCount; cuIndex++) { int intracost = (frame->intraCost[cuIndex] * frame->invQscaleFactor[cuIndex] + 128) >> 8; if (intracost) { int propagateCost = (frame->propagateCost[cuIndex] * fpsFactor + 128) >> 8; double log2_ratio = X265_LOG2(intracost + propagateCost) - X265_LOG2(intracost) + weightdelta; - frame->qpCuTreeOffset[cuIndex] = frame->qpAqOffset[cuIndex] - strength * log2_ratio; + frame->qpCuTreeOffset[cuIndex] = frame->qpAqOffset[cuIndex] - m_cuTreeStrength * log2_ratio; } } } @@ -1958,7 +1997,7 @@ if (bDoSearch[1]) fenc->lowresMvs[1][p1 - b - 1][0].x = 0x7FFE; #endif - tld.weightedRef.isWeighted = false; + fenc->weightedRef[b - p0].isWeighted = false; if (param->bEnableWeightedPred && bDoSearch[0]) tld.weightsAnalyse(*m_frames[b], *m_frames[p0]); @@ -2032,7 +2071,7 @@ Lowres *fref1 = m_frames[p1]; Lowres *fenc = m_frames[b]; - ReferencePlanes *wfref0 = tld.weightedRef.isWeighted ? &tld.weightedRef : fref0; + ReferencePlanes *wfref0 = fenc->weightedRef[b - p0].isWeighted ? &fenc->weightedRef[b - p0] : fref0; const int widthInCU = m_lookahead.m_8x8Width; const int heightInCU = m_lookahead.m_8x8Height; @@ -2061,6 +2100,7 @@ for (int i = 0; i < 1 + bBidir; i++) { int& fencCost = fenc->lowresMvCosts[i][listDist[i]][cuXY]; + int skipCost = INT_MAX; if (!bDoSearch[i]) { @@ -2103,12 +2143,20 @@ pixel *src = fref->lowresMC(pelOffset, mvc[idx], subpelbuf, stride); int cost = tld.me.bufSATD(src, stride); COPY2_IF_LT(mvpcost, cost, mvp, mvc[idx]); + /* Except for mv0 case, everyting else is likely to have enough residual to not trigger the skip. */ + if (!mvp.notZero() && bBidir) + skipCost = cost; } } /* ME will never return a cost larger than the cost @MVP, so we do not * have to check that ME cost is more than the estimated merge cost */ fencCost = tld.me.motionEstimate(fref, mvmin, mvmax, mvp, 0, NULL, s_merange, *fencMV); + if (skipCost < 64 && skipCost < fencCost && bBidir) + { + fencCost = skipCost; + *fencMV = 0; + } COPY2_IF_LT(bcost, fencCost, listused, i + 1); }
View file
x265_1.8.tar.gz/source/encoder/slicetype.h -> x265_1.9.tar.gz/source/encoder/slicetype.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -44,7 +45,6 @@ struct LookaheadTLD { MotionEstimate me; - ReferencePlanes weightedRef; pixel* wbuffer[4]; int widthInCU; int heightInCU; @@ -103,29 +103,30 @@ PicList m_outputQueue; // pictures to be encoded, in encode order Lock m_inputLock; Lock m_outputLock; - - /* pre-lookahead */ - int m_fullQueueSize; - bool m_isActive; - bool m_sliceTypeBusy; - bool m_bAdaptiveQuant; - bool m_outputSignalRequired; - bool m_bBatchMotionSearch; - bool m_bBatchFrameCosts; Event m_outputSignal; - LookaheadTLD* m_tld; x265_param* m_param; Lowres* m_lastNonB; int* m_scratch; // temp buffer for cutree propagate - + + /* pre-lookahead */ + int m_fullQueueSize; int m_histogram[X265_BFRAME_MAX + 1]; int m_lastKeyframe; int m_8x8Width; int m_8x8Height; int m_8x8Blocks; + int m_cuCount; int m_numCoopSlices; int m_numRowsPerSlice; + double m_cuTreeStrength; + + bool m_isActive; + bool m_sliceTypeBusy; + bool m_bAdaptiveQuant; + bool m_outputSignalRequired; + bool m_bBatchMotionSearch; + bool m_bBatchFrameCosts; bool m_filled; bool m_isSceneTransition; Lookahead(x265_param *param, ThreadPool *pool);
View file
x265_1.8.tar.gz/source/encoder/weightPrediction.cpp -> x265_1.9.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -4,6 +4,7 @@ * Author: Shazeb Nawaz Khan <shazeb@multicorewareinc.com> * Steve Borho <steve@borho.org> * Kavitha Sampas <kavitha@multicorewareinc.com> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -259,13 +260,13 @@ for (int list = 0; list < cache.numPredDir; list++) { WeightParam *weights = wp[list][0]; - Frame *refFrame = slice.m_refPicList[list][0]; + Frame *refFrame = slice.m_refFrameList[list][0]; Lowres& refLowres = refFrame->m_lowres; int diffPoc = abs(curPoc - refFrame->m_poc); /* prepare estimates */ float guessScale[3], fencMean[3], refMean[3]; - for (int plane = 0; plane < 3; plane++) + for (int plane = 0; plane < (param.internalCsp != X265_CSP_I400 ? 3 : 1); plane++) { SET_WEIGHT(weights[plane], false, 1, 0, 0); uint64_t fencVar = fenc.wp_ssd[plane] + !refLowres.wp_ssd[plane]; @@ -289,7 +290,7 @@ MV *mvs = NULL; - for (int plane = 0; plane < 3; plane++) + for (int plane = 0; plane < (param.internalCsp != X265_CSP_I400 ? 3 : 1); plane++) { denom = plane ? chromaDenom : lumaDenom; if (plane && !weights[0].bPresentFlag) @@ -328,7 +329,7 @@ { /* reference chroma planes must be extended prior to being * used as motion compensation sources */ - if (!refFrame->m_bChromaExtended) + if (!refFrame->m_bChromaExtended && param.internalCsp != X265_CSP_I400) { refFrame->m_bChromaExtended = true; PicYuv *refPic = refFrame->m_fencPic;
View file
x265_1.8.tar.gz/source/output/y4m.cpp -> x265_1.9.tar.gz/source/output/y4m.cpp
Changed
@@ -70,7 +70,7 @@ x265_log(NULL, X265_LOG_WARNING, "y4m: forcing reconstructed pixels to 8 bits\n"); #endif - X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n"); + X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n"); #if HIGH_BIT_DEPTH
View file
x265_1.8.tar.gz/source/output/yuv.cpp -> x265_1.9.tar.gz/source/output/yuv.cpp
Changed
@@ -53,7 +53,7 @@ uint64_t fileOffset = pic.poc; fileOffset *= frameSize; - X265_CHECK(pic.colorSpace == colorSpace, "invalid color space\n"); + X265_CHECK(pic.colorSpace == colorSpace, "invalid chroma subsampling\n"); X265_CHECK(pic.bitDepth == (int)depth, "invalid bit depth\n"); #if HIGH_BIT_DEPTH
View file
x265_1.8.tar.gz/source/profile/vtune/CMakeLists.txt -> x265_1.9.tar.gz/source/profile/vtune/CMakeLists.txt
Changed
@@ -1,2 +1,2 @@ -include_directories($ENV{VTUNE_AMPLIFIER_XE_2015_DIR}/include) +include_directories(${VTUNE_INCLUDE_DIR}) add_library(vtune vtune.h vtune.cpp ../cpuEvents.h)
View file
x265_1.8.tar.gz/source/profile/vtune/vtune.cpp -> x265_1.9.tar.gz/source/profile/vtune/vtune.cpp
Changed
@@ -30,7 +30,6 @@ const char *stringNames[] = { #include "../cpuEvents.h" - "" }; #undef CPU_EVENT @@ -44,7 +43,8 @@ void vtuneInit() { domain = __itt_domain_create("x265"); - for (size_t i = 0; i < sizeof(stringNames) / sizeof(const char *); i++) + size_t length = sizeof(stringNames) / sizeof(const char *); + for (size_t i = 0; i < length; i++) taskHandle[i] = __itt_string_handle_create(stringNames[i]); }
View file
x265_1.8.tar.gz/source/test/checkasm-a.asm -> x265_1.9.tar.gz/source/test/checkasm-a.asm
Changed
@@ -2,9 +2,11 @@ ;* checkasm-a.asm: assembly check tool ;***************************************************************************** ;* Copyright (C) 2008-2014 x264 project +;* Copyright (C) 2013-2015 x265 project ;* ;* Authors: Loren Merritt <lorenm@u.washington.edu> ;* Henrik Gramner <henrik@gramner.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/test/intrapredharness.cpp -> x265_1.9.tar.gz/source/test/intrapredharness.cpp
Changed
@@ -130,6 +130,8 @@ if (memcmp(pixel_out_vec + k * FENC_STRIDE, pixel_out_c + k * FENC_STRIDE, width * sizeof(pixel))) { printf("ang_%dx%d, Mode = %d, Row = %d failed !!\n", width, width, pmode, k); + ref[pmode](pixel_out_c, stride, pixel_buff + j, pmode, bFilter); + opt[pmode](pixel_out_vec, stride, pixel_buff + j, pmode, bFilter); return false; } }
View file
x265_1.8.tar.gz/source/test/ipfilterharness.h -> x265_1.9.tar.gz/source/test/ipfilterharness.h
Changed
@@ -4,6 +4,7 @@ * Authors: Deepthi Devaki <deepthidevaki@multicorewareinc.com>, * Rajesh Paulraj <rajesh@multicorewareinc.com> * Praveen Kumar Tiwari <praveen@multicorewareinc.com> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/test/pixelharness.cpp -> x265_1.9.tar.gz/source/test/pixelharness.cpp
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -41,6 +42,7 @@ int_test_buff[0][i] = rand() % SHORT_MAX; ushort_test_buff[0][i] = rand() % ((1 << 16) - 1); uchar_test_buff[0][i] = rand() % ((1 << 8) - 1); + residual_test_buff[0][i] = (rand() % (2 * RMAX + 1)) - RMAX - 1;// For sse_ss only pixel_test_buff[1][i] = PIXEL_MIN; short_test_buff[1][i] = SMIN; @@ -49,6 +51,7 @@ int_test_buff[1][i] = SHORT_MIN; ushort_test_buff[1][i] = PIXEL_MIN; uchar_test_buff[1][i] = PIXEL_MIN; + residual_test_buff[1][i] = RMIN; pixel_test_buff[2][i] = PIXEL_MAX; short_test_buff[2][i] = SMAX; @@ -57,6 +60,7 @@ int_test_buff[2][i] = SHORT_MAX; ushort_test_buff[2][i] = ((1 << 16) - 1); uchar_test_buff[2][i] = 255; + residual_test_buff[2][i] = RMAX; pbuf1[i] = rand() & PIXEL_MAX; pbuf2[i] = rand() & PIXEL_MAX; @@ -103,8 +107,8 @@ { int index1 = rand() % TEST_CASES; int index2 = rand() % TEST_CASES; - sse_ret_t vres = (sse_ret_t)checked(opt, pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); - sse_ret_t cres = ref(pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); + sse_t vres = (sse_t)checked(opt, pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); + sse_t cres = ref(pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); if (vres != cres) return false; @@ -124,8 +128,8 @@ { int index1 = rand() % TEST_CASES; int index2 = rand() % TEST_CASES; - sse_ret_t vres = (sse_ret_t)checked(opt, short_test_buff[index1], stride, short_test_buff[index2] + j, stride); - sse_ret_t cres = ref(short_test_buff[index1], stride, short_test_buff[index2] + j, stride); + sse_t vres = (sse_t)checked(opt, residual_test_buff[index1], stride, residual_test_buff[index2] + j, stride); + sse_t cres = ref(residual_test_buff[index1], stride, residual_test_buff[index2] + j, stride); if (vres != cres) return false; @@ -227,8 +231,8 @@ { // NOTE: stride must be multiple of 16, because minimum block is 4x4 int stride = (STRIDE + (rand() % STRIDE)) & ~15; - int cres = ref(sbuf1 + j, stride); - int vres = (int)checked(opt, sbuf1 + j, (intptr_t)stride); + sse_t cres = ref(sbuf1 + j, stride); + sse_t vres = (sse_t)checked(opt, sbuf1 + j, (intptr_t)stride); if (cres != vres) return false; @@ -854,7 +858,7 @@ int width = (rand() % 4) + 1; // range[1-4] float cres = ref(sum0, sum1, width); float vres = checked_float(opt, sum0, sum1, width); - if (fabs(vres - cres) > 0.00001) + if (fabs(vres - cres) > 0.0001) return false; reportfail(); @@ -1061,8 +1065,8 @@ int endX = MAX_CU_SIZE - (rand() % 5); int endY = MAX_CU_SIZE - (rand() % 4) - 1; - ref(pbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_ref, count_ref); - checked(opt, pbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_vec, count_vec); + ref(sbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_ref, count_ref); + checked(opt, sbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_vec, count_vec); if (memcmp(stats_ref, stats_vec, sizeof(stats_ref)) || memcmp(count_ref, count_vec, sizeof(count_ref))) return false; @@ -1097,8 +1101,8 @@ int endX = MAX_CU_SIZE - (rand() % 5) - 1; int endY = MAX_CU_SIZE - (rand() % 4) - 1; - ref(pbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_ref, count_ref); - checked(opt, pbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_vec, count_vec); + ref(sbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_ref, count_ref); + checked(opt, sbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_vec, count_vec); if (memcmp(stats_ref, stats_vec, sizeof(stats_ref)) || memcmp(count_ref, count_vec, sizeof(count_ref))) return false; @@ -1141,8 +1145,8 @@ int endX = MAX_CU_SIZE - (rand() % 5); int endY = MAX_CU_SIZE - (rand() % 4) - 1; - ref(pbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); - checked(opt, pbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); + ref(sbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); + checked(opt, sbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) || memcmp(stats_ref, stats_vec, sizeof(stats_ref)) @@ -1193,8 +1197,8 @@ int endX = MAX_CU_SIZE - (rand() % 5) - 1; int endY = MAX_CU_SIZE - (rand() % 4) - 1; - ref(pbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, upBufft_ref, endX, endY, stats_ref, count_ref); - checked(opt, pbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, upBufft_vec, endX, endY, stats_vec, count_vec); + ref(sbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, upBufft_ref, endX, endY, stats_ref, count_ref); + checked(opt, sbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, upBufft_vec, endX, endY, stats_vec, count_vec); // TODO: don't check upBuff*, the latest output pixels different, and can move into stack temporary buffer in future if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) @@ -1244,8 +1248,8 @@ int endX = MAX_CU_SIZE - (rand() % 5) - 1; int endY = MAX_CU_SIZE - (rand() % 4) - 1; - ref(pbuf2, pbuf3, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); - checked(opt, pbuf2, pbuf3, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); + ref(sbuf2, pbuf3, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); + checked(opt, sbuf2, pbuf3, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) || memcmp(stats_ref, stats_vec, sizeof(stats_ref)) @@ -1295,8 +1299,8 @@ memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); - int width = 32 + rand() % 32; - int height = 32 + rand() % 32; + int width = 32 + (rand() % 32); + int height = 32 + (rand() % 32); intptr_t srcStride = 64; intptr_t dstStride = width; int j = 0; @@ -1304,11 +1308,23 @@ for (int i = 0; i < ITERS; i++) { int index = i % TEST_CASES; + checked(opt, ushort_test_buff[index] + j, srcStride, opt_dest, dstStride, width, height, (int)8, (uint16_t)((1 << X265_DEPTH) - 1)); ref(ushort_test_buff[index] + j, srcStride, ref_dest, dstStride, width, height, (int)8, (uint16_t)((1 << X265_DEPTH) - 1)); - if (memcmp(ref_dest, opt_dest, width * height * sizeof(pixel))) + if (memcmp(ref_dest, opt_dest, dstStride * height * sizeof(pixel))) + { + memcpy(opt_dest, ref_dest, sizeof(ref_dest)); + opt(ushort_test_buff[index] + j, srcStride, opt_dest, dstStride, width, height, (int)8, (uint16_t)((1 << X265_DEPTH) - 1)); return false; + } + + // check tail memory area + for(int x = width; x < dstStride; x++) + { + if (opt_dest[(height - 1 * dstStride) + x] != 0xCD) + return false; + } reportfail(); j += INCR; @@ -1340,6 +1356,13 @@ if (memcmp(ref_dest, opt_dest, sizeof(ref_dest))) return false; + // check tail memory area + for(int x = width; x < dstStride; x++) + { + if (opt_dest[(height - 1 * dstStride) + x] != 0xCD) + return false; + } + reportfail(); j += INCR; } @@ -1356,16 +1379,16 @@ memset(opt_dest, 0xCD, sizeof(opt_dest)); double fps = 1.0; - int width = 16 + rand() % 64; int j = 0; for (int i = 0; i < ITERS; i++) { + int width = 16 + rand() % 64; int index = i % TEST_CASES; checked(opt, opt_dest, ushort_test_buff[index] + j, int_test_buff[index] + j, ushort_test_buff[index] + j, int_test_buff[index] + j, &fps, width); ref(ref_dest, ushort_test_buff[index] + j, int_test_buff[index] + j, ushort_test_buff[index] + j, int_test_buff[index] + j, &fps, width); - if (memcmp(ref_dest, opt_dest, width * sizeof(pixel))) + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) return false; reportfail(); @@ -1397,28 +1420,6 @@ return true; } -bool PixelHarness::check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt) -{ - int j = 0, index1, index2, optres, refres; - intptr_t stride = STRIDE; - - for (int i = 0; i < ITERS; i++) - { - index1 = rand() % TEST_CASES; - index2 = rand() % TEST_CASES; - optres = (int)checked(opt, short_test_buff[index1], stride, short_test_buff[index2] + j, stride); - refres = ref(short_test_buff[index1], stride, short_test_buff[index2] + j, stride); - - if (optres != refres) - return false; - - reportfail(); - j += INCR; - } - - return true; -} - bool PixelHarness::check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); @@ -1570,8 +1571,8 @@ // specially case: all coeff group are zero if (j >= SCAN_SET_SIZE) { - // all zero block the high 16-bits undefined - if ((uint16_t)ref_scanPos != (uint16_t)opt_scanPos) + // all zero block the high 24-bits undefined + if ((uint8_t)ref_scanPos != (uint8_t)opt_scanPos) return false; } else if (ref_scanPos != opt_scanPos) @@ -1586,8 +1587,8 @@ bool PixelHarness::check_costCoeffNxN(costCoeffNxN_t ref, costCoeffNxN_t opt) { ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 3]); - ALIGN_VAR_32(uint16_t, ref_absCoeff[1 << MLS_CG_SIZE]); - ALIGN_VAR_32(uint16_t, opt_absCoeff[1 << MLS_CG_SIZE]); + ALIGN_VAR_32(uint16_t, ref_absCoeff[(1 << MLS_CG_SIZE)]); + ALIGN_VAR_32(uint16_t, opt_absCoeff[(1 << MLS_CG_SIZE) + 4]); memset(ref_absCoeff, 0xCD, sizeof(ref_absCoeff)); memset(opt_absCoeff, 0xCD, sizeof(opt_absCoeff)); @@ -1613,6 +1614,12 @@ ref_src[32 * 32 + i] = 0x1234; } + // Safe check magic + opt_absCoeff[(1 << MLS_CG_SIZE) + 0] = 0x0123; + opt_absCoeff[(1 << MLS_CG_SIZE) + 1] = 0x4567; + opt_absCoeff[(1 << MLS_CG_SIZE) + 2] = 0xBA98; + opt_absCoeff[(1 << MLS_CG_SIZE) + 3] = 0xFEDC; + // generate CABAC context table uint8_t m_contextState_ref[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; uint8_t m_contextState_opt[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; @@ -1703,8 +1710,8 @@ continue; const uint32_t blkPosBase = scanTbl[subPosBase]; - uint32_t ref_sum = ref(scanTblCG4x4, &ref_src[blkPosBase + i], trSize, ref_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)ref_baseCtx, offset, rand_scanPosSigOff, subPosBase); - uint32_t opt_sum = (uint32_t)checked(opt, scanTblCG4x4, &ref_src[blkPosBase + i], trSize, opt_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)opt_baseCtx, offset, rand_scanPosSigOff, subPosBase); + uint32_t ref_sum = ref(scanTblCG4x4, &ref_src[blkPosBase + i], (intptr_t)trSize, ref_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)ref_baseCtx, offset, rand_scanPosSigOff, subPosBase); + uint32_t opt_sum = (uint32_t)checked(opt, scanTblCG4x4, &ref_src[blkPosBase + i], (intptr_t)trSize, opt_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)opt_baseCtx, offset, rand_scanPosSigOff, subPosBase); if (ref_sum != opt_sum) return false; @@ -1712,18 +1719,25 @@ return false; // NOTE: just first rand_numCoeff valid, but I check full buffer for confirm no overwrite bug - if (memcmp(ref_absCoeff, opt_absCoeff, sizeof(ref_absCoeff))) + if (memcmp(ref_absCoeff, opt_absCoeff, rand_numCoeff * sizeof(ref_absCoeff[0]))) + return false; + + // Check memory beyond-bound write + if ( opt_absCoeff[(1 << MLS_CG_SIZE) + 1] != 0x4567 + || opt_absCoeff[(1 << MLS_CG_SIZE) + 2] != 0xBA98 + || opt_absCoeff[(1 << MLS_CG_SIZE) + 3] != 0xFEDC) return false; reportfail(); } return true; } + bool PixelHarness::check_costCoeffRemain(costCoeffRemain_t ref, costCoeffRemain_t opt) { - ALIGN_VAR_32(uint16_t, absCoeff[1 << MLS_CG_SIZE]); + ALIGN_VAR_32(uint16_t, absCoeff[(1 << MLS_CG_SIZE) + ITERS]); - for (int i = 0; i < (1 << MLS_CG_SIZE); i++) + for (int i = 0; i < (1 << MLS_CG_SIZE) + ITERS; i++) { absCoeff[i] = rand() & SHORT_MAX; // more coeff with value one @@ -1737,20 +1751,168 @@ int numNonZero = rand() % 17; //can be random, range[1, 16] for (k = 0; k < C1FLAG_NUMBER; k++) { - if (absCoeff[k] >= 2) + if (absCoeff[i + k] >= 2) { break; } } firstC2Idx = k; // it is index of exact first coeff that value more than 2 - int ref_sum = ref(absCoeff, numNonZero, firstC2Idx); - int opt_sum = (int)checked(opt, absCoeff, numNonZero, firstC2Idx); + int ref_sum = ref(absCoeff + i, numNonZero, firstC2Idx); + int opt_sum = (int)checked(opt, absCoeff + i, numNonZero, firstC2Idx); if (ref_sum != opt_sum) return false; } return true; } +bool PixelHarness::check_costC1C2Flag(costC1C2Flag_t ref, costC1C2Flag_t opt) +{ + ALIGN_VAR_32(uint16_t, absCoeff[(1 << MLS_CG_SIZE)]); + + // generate CABAC context table + uint8_t ref_baseCtx[8]; + uint8_t opt_baseCtx[8]; + for (int k = 0; k < 8; k++) + { + ref_baseCtx[k] = + opt_baseCtx[k] = (rand() % (125 - 2)) + 2; + } + + for (int i = 0; i < ITERS; i++) + { + int rand_offset = rand() % 4; + int numNonZero = 0; + + // generate test data, all are Absolute value and Aligned + for (int k = 0; k < C1FLAG_NUMBER; k++) + { + int value = rand() & SHORT_MAX; + // more coeff with value [0,2] + if (value < SHORT_MAX * 1 / 3) + value = 0; + else if (value < SHORT_MAX * 2 / 3) + value = 1; + else if (value < SHORT_MAX * 3 / 4) + value = 2; + + if (value) + { + absCoeff[numNonZero] = (uint16_t)value; + numNonZero++; + } + } + if (numNonZero == 0) + { + numNonZero = 1; + absCoeff[0] = 1; + } + + int ref_sum = ref(absCoeff, (intptr_t)numNonZero, ref_baseCtx, (intptr_t)rand_offset); + int opt_sum = (int)checked(opt, absCoeff, (intptr_t)numNonZero, opt_baseCtx, (intptr_t)rand_offset); + if (ref_sum != opt_sum) + { + ref_sum = ref(absCoeff, (intptr_t)numNonZero, ref_baseCtx, (intptr_t)rand_offset); + opt_sum = opt(absCoeff, (intptr_t)numNonZero, opt_baseCtx, (intptr_t)rand_offset); + return false; + } + } + return true; +} + +bool PixelHarness::check_planeClipAndMax(planeClipAndMax_t ref, planeClipAndMax_t opt) +{ + for (int i = 0; i < ITERS; i++) + { + intptr_t rand_stride = rand() % STRIDE; + int rand_width = (rand() % (STRIDE * 2)) + 1; + const int rand_height = (rand() % MAX_HEIGHT) + 1; + const pixel rand_min = rand() % 32; + const pixel rand_max = PIXEL_MAX - (rand() % 32); + uint64_t ref_sum, opt_sum; + + // video width must be more than or equal to 32 + if (rand_width < 32) + rand_width = 32; + + // stride must be more than or equal to width + if (rand_stride < rand_width) + rand_stride = rand_width; + + pixel ref_max = ref(pbuf1, rand_stride, rand_width, rand_height, &ref_sum, rand_min, rand_max); + pixel opt_max = (pixel)checked(opt, pbuf1, rand_stride, rand_width, rand_height, &opt_sum, rand_min, rand_max); + + if (ref_max != opt_max) + return false; + } + return true; +} + +bool PixelHarness::check_pelFilterLumaStrong_H(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt) +{ + intptr_t srcStep = 1, offset = 64; + int32_t tcP, tcQ, maskP, maskQ, tc; + int j = 0; + + pixel pixel_test_buff1[TEST_CASES][BUFFSIZE]; + for (int i = 0; i < TEST_CASES; i++) + memcpy(pixel_test_buff1[i], pixel_test_buff[i], sizeof(pixel) * BUFFSIZE); + + for (int i = 0; i < ITERS; i++) + { + tc = rand() % PIXEL_MAX; + maskP = (rand() % PIXEL_MAX) - 1; + maskQ = (rand() % PIXEL_MAX) - 1; + tcP = (tc & maskP); + tcQ = (tc & maskQ); + + int index = rand() % 3; + + ref(pixel_test_buff[index] + 4 * offset + j, srcStep, offset, tcP, tcQ); + checked(opt, pixel_test_buff1[index] + 4 * offset + j, srcStep, offset, tcP, tcQ); + + if (memcmp(pixel_test_buff[index], pixel_test_buff1[index], sizeof(pixel) * BUFFSIZE)) + return false; + + reportfail() + j += INCR; + } + + return true; +} + +bool PixelHarness::check_pelFilterLumaStrong_V(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt) +{ + intptr_t srcStep = 64, offset = 1; + int32_t tcP, tcQ, maskP, maskQ, tc; + int j = 0; + + pixel pixel_test_buff1[TEST_CASES][BUFFSIZE]; + for (int i = 0; i < TEST_CASES; i++) + memcpy(pixel_test_buff1[i], pixel_test_buff[i], sizeof(pixel) * BUFFSIZE); + + for (int i = 0; i < ITERS; i++) + { + tc = rand() % PIXEL_MAX; + maskP = (rand() % PIXEL_MAX) - 1; + maskQ = (rand() % PIXEL_MAX) - 1; + tcP = (tc & maskP); + tcQ = (tc & maskQ); + + int index = rand() % 3; + + ref(pixel_test_buff[index] + 4 + j, srcStep, offset, tcP, tcQ); + checked(opt, pixel_test_buff1[index] + 4 + j, srcStep, offset, tcP, tcQ); + + if (memcmp(pixel_test_buff[index], pixel_test_buff1[index], sizeof(pixel) * BUFFSIZE)) + return false; + + reportfail() + j += INCR; + } + + return true; +} + bool PixelHarness::testPU(int part, const EncoderPrimitives& ref, const EncoderPrimitives& opt) { if (opt.pu[part].satd) @@ -2039,15 +2201,6 @@ } } - if (opt.cu[i].psy_cost_ss) - { - if (!check_psyCost_ss(ref.cu[i].psy_cost_ss, opt.cu[i].psy_cost_ss)) - { - printf("\npsy_cost_ss[%dx%d] failed!\n", 4 << i, 4 << i); - return false; - } - } - if (i < BLOCK_64x64) { /* TU only primitives */ @@ -2175,7 +2328,7 @@ { if (!check_ssim_4x4x2_core(ref.ssim_4x4x2_core, opt.ssim_4x4x2_core)) { - printf("ssim_end_4 failed!\n"); + printf("ssim_4x4x2_core failed!\n"); return false; } } @@ -2362,6 +2515,7 @@ return false; } } + if (opt.costCoeffNxN) { if (!check_costCoeffNxN(ref.costCoeffNxN, opt.costCoeffNxN)) @@ -2370,6 +2524,7 @@ return false; } } + if (opt.costCoeffRemain) { if (!check_costCoeffRemain(ref.costCoeffRemain, opt.costCoeffRemain)) @@ -2379,6 +2534,43 @@ } } + if (opt.costC1C2Flag) + { + if (!check_costC1C2Flag(ref.costC1C2Flag, opt.costC1C2Flag)) + { + printf("costC1C2Flag failed!\n"); + return false; + } + } + + + if (opt.planeClipAndMax) + { + if (!check_planeClipAndMax(ref.planeClipAndMax, opt.planeClipAndMax)) + { + printf("planeClipAndMax failed!\n"); + return false; + } + } + + if (opt.pelFilterLumaStrong[0]) + { + if (!check_pelFilterLumaStrong_V(ref.pelFilterLumaStrong[0], opt.pelFilterLumaStrong[0])) + { + printf("pelFilterLumaStrong Vertical failed!\n"); + return false; + } + } + + if (opt.pelFilterLumaStrong[1]) + { + if (!check_pelFilterLumaStrong_H(ref.pelFilterLumaStrong[1], opt.pelFilterLumaStrong[1])) + { + printf("pelFilterLumaStrong Horizontal failed!\n"); + return false; + } + } + return true; } @@ -2637,12 +2829,6 @@ HEADER("psy_cost_pp[%dx%d]", 4 << i, 4 << i); REPORT_SPEEDUP(opt.cu[i].psy_cost_pp, ref.cu[i].psy_cost_pp, pbuf1, STRIDE, pbuf2, STRIDE); } - - if (opt.cu[i].psy_cost_ss) - { - HEADER("psy_cost_ss[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].psy_cost_ss, ref.cu[i].psy_cost_ss, sbuf1, STRIDE, sbuf2, STRIDE); - } } if (opt.weight_pp) @@ -2745,14 +2931,14 @@ { int32_t stats[33], count[33]; HEADER0("saoCuStatsBO"); - REPORT_SPEEDUP(opt.saoCuStatsBO, ref.saoCuStatsBO, pbuf2, pbuf3, 64, 60, 61, stats, count); + REPORT_SPEEDUP(opt.saoCuStatsBO, ref.saoCuStatsBO, sbuf2, pbuf3, 64, 60, 61, stats, count); } if (opt.saoCuStatsE0) { int32_t stats[33], count[33]; HEADER0("saoCuStatsE0"); - REPORT_SPEEDUP(opt.saoCuStatsE0, ref.saoCuStatsE0, pbuf2, pbuf3, 64, 60, 61, stats, count); + REPORT_SPEEDUP(opt.saoCuStatsE0, ref.saoCuStatsE0, sbuf2, pbuf3, 64, 60, 61, stats, count); } if (opt.saoCuStatsE1) @@ -2761,7 +2947,7 @@ int8_t upBuff1[MAX_CU_SIZE + 2]; memset(upBuff1, 1, sizeof(upBuff1)); HEADER0("saoCuStatsE1"); - REPORT_SPEEDUP(opt.saoCuStatsE1, ref.saoCuStatsE1, pbuf2, pbuf3, 64, upBuff1 + 1,60, 61, stats, count); + REPORT_SPEEDUP(opt.saoCuStatsE1, ref.saoCuStatsE1, sbuf2, pbuf3, 64, upBuff1 + 1,60, 61, stats, count); } if (opt.saoCuStatsE2) @@ -2772,7 +2958,7 @@ memset(upBuff1, 1, sizeof(upBuff1)); memset(upBufft, -1, sizeof(upBufft)); HEADER0("saoCuStatsE2"); - REPORT_SPEEDUP(opt.saoCuStatsE2, ref.saoCuStatsE2, pbuf2, pbuf3, 64, upBuff1 + 1, upBufft + 1, 60, 61, stats, count); + REPORT_SPEEDUP(opt.saoCuStatsE2, ref.saoCuStatsE2, sbuf2, pbuf3, 64, upBuff1 + 1, upBufft + 1, 60, 61, stats, count); } if (opt.saoCuStatsE3) @@ -2781,7 +2967,7 @@ int32_t stats[5], count[5]; memset(upBuff1, 1, sizeof(upBuff1)); HEADER0("saoCuStatsE3"); - REPORT_SPEEDUP(opt.saoCuStatsE3, ref.saoCuStatsE3, pbuf2, pbuf3, 64, upBuff1 + 1, 60, 61, stats, count); + REPORT_SPEEDUP(opt.saoCuStatsE3, ref.saoCuStatsE3, sbuf2, pbuf3, 64, upBuff1 + 1, 60, 61, stats, count); } if (opt.planecopy_sp) @@ -2823,6 +3009,7 @@ coefBuf[3 + 3 * 32] = 0x0BAD; REPORT_SPEEDUP(opt.findPosFirstLast, ref.findPosFirstLast, coefBuf, 32, g_scan4x4[SCAN_DIAG]); } + if (opt.costCoeffNxN) { HEADER0("costCoeffNxN"); @@ -2841,6 +3028,7 @@ REPORT_SPEEDUP(opt.costCoeffNxN, ref.costCoeffNxN, g_scan4x4[SCAN_DIAG], coefBuf, 32, tmpOut, ctxSig, 0xFFFF, ctx, 1, 15, 32); } + if (opt.costCoeffRemain) { HEADER0("costCoeffRemain"); @@ -2849,4 +3037,37 @@ memset(abscoefBuf + 32 * 31, 1, 32 * sizeof(uint16_t)); REPORT_SPEEDUP(opt.costCoeffRemain, ref.costCoeffRemain, abscoefBuf, 16, 3); } + + if (opt.costC1C2Flag) + { + HEADER0("costC1C2Flag"); + ALIGN_VAR_32(uint16_t, abscoefBuf[C1FLAG_NUMBER]); + memset(abscoefBuf, 1, sizeof(abscoefBuf)); + abscoefBuf[C1FLAG_NUMBER - 2] = 2; + abscoefBuf[C1FLAG_NUMBER - 1] = 3; + REPORT_SPEEDUP(opt.costC1C2Flag, ref.costC1C2Flag, abscoefBuf, C1FLAG_NUMBER, (uint8_t*)psbuf1, 1); + } + + if (opt.planeClipAndMax) + { + HEADER0("planeClipAndMax"); + uint64_t dummy; + REPORT_SPEEDUP(opt.planeClipAndMax, ref.planeClipAndMax, pbuf1, 128, 63, 62, &dummy, 1, PIXEL_MAX - 1); + } + + if (opt.pelFilterLumaStrong[0]) + { + int32_t tcP = (rand() % PIXEL_MAX) - 1; + int32_t tcQ = (rand() % PIXEL_MAX) - 1; + HEADER0("pelFilterLumaStrong_Vertical"); + REPORT_SPEEDUP(opt.pelFilterLumaStrong[0], ref.pelFilterLumaStrong[0], pbuf1, STRIDE, 1, tcP, tcQ); + } + + if (opt.pelFilterLumaStrong[1]) + { + int32_t tcP = (rand() % PIXEL_MAX) - 1; + int32_t tcQ = (rand() % PIXEL_MAX) - 1; + HEADER0("pelFilterLumaStrong_Horizontal"); + REPORT_SPEEDUP(opt.pelFilterLumaStrong[1], ref.pelFilterLumaStrong[1], pbuf1, 1, STRIDE, tcP, tcQ); + } }
View file
x265_1.8.tar.gz/source/test/pixelharness.h -> x265_1.9.tar.gz/source/test/pixelharness.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -40,6 +41,8 @@ enum { TEST_CASES = 3 }; enum { SMAX = 1 << 12 }; enum { SMIN = -1 << 12 }; + enum { RMAX = PIXEL_MAX - PIXEL_MIN }; //The maximum value obtained by subtracting pixel values (residual max) + enum { RMIN = PIXEL_MIN - PIXEL_MAX }; //The minimum value obtained by subtracting pixel values (residual min) ALIGN_VAR_32(pixel, pbuf1[BUFFSIZE]); pixel pbuf2[BUFFSIZE]; @@ -64,6 +67,7 @@ uint16_t ushort_test_buff[TEST_CASES][BUFFSIZE]; uint8_t uchar_test_buff[TEST_CASES][BUFFSIZE]; double double_test_buff[TEST_CASES][BUFFSIZE]; + int16_t residual_test_buff[TEST_CASES][BUFFSIZE]; bool check_pixelcmp(pixelcmp_t ref, pixelcmp_t opt); bool check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt); @@ -110,12 +114,15 @@ bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt); bool check_cutree_propagate_cost(cutree_propagate_cost ref, cutree_propagate_cost opt); bool check_psyCost_pp(pixelcmp_t ref, pixelcmp_t opt); - bool check_psyCost_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt); bool check_calSign(sign_t ref, sign_t opt); bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt); bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt); bool check_costCoeffNxN(costCoeffNxN_t ref, costCoeffNxN_t opt); bool check_costCoeffRemain(costCoeffRemain_t ref, costCoeffRemain_t opt); + bool check_costC1C2Flag(costC1C2Flag_t ref, costC1C2Flag_t opt); + bool check_planeClipAndMax(planeClipAndMax_t ref, planeClipAndMax_t opt); + bool check_pelFilterLumaStrong_V(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt); + bool check_pelFilterLumaStrong_H(pelFilterLumaStrong_t ref, pelFilterLumaStrong_t opt); public:
View file
x265_1.8.tar.gz/source/test/regression-tests.txt -> x265_1.9.tar.gz/source/test/regression-tests.txt
Changed
@@ -11,124 +11,132 @@ # consistent across many machines, you must force a certain -FN so it is # not auto-detected. +BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 +BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp --limit-modes +BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao +BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 7000 --limit-modes,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 --limit-modes BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 -BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp -BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 -BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp -BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 +BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-mode=save --bitrate 7000,--preset slower --no-cutree --analysis-mode=load --bitrate 7000 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 7000 --tskip-fast,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast +BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop +Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-mode=save --bitrate 15000,--preset veryfast --no-cutree --analysis-mode=load --bitrate 15000 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 -Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop -CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full +CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 +CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao +CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32 +CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode --limit-refs 2 -CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao -CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 -CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain -CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 -CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop +CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2 +CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 --limit-modes CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 --limit-modes +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 -DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 -FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 +FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd +FourPeople_1280x720_60.y4m,--preset medium --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +FourPeople_1280x720_60.y4m,--preset veryslow --numa-pools "none" +Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4 Keiba_832x480_30.y4m,--preset medium --pmode --tune grain Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4 --limit-refs 0 -Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4 -Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32 Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb -KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 -KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 -KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 --limit-refs 1 +Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing -NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 +KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 --limit-refs 1 +KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 +KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr -News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-mode=save --bitrate 9000,--preset slow --no-cutree --analysis-mode=load --bitrate 9000 +News-4k.y4m,--preset ultrafast --no-cutree --analysis-mode=save --bitrate 15000,--preset ultrafast --no-cutree --analysis-mode=load --bitrate 15000 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 +News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 +OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode -OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp +ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4 ParkScene_1920x1080_24.y4m,--preset medium --qp 40 --rdpenalty 2 --tu-intra-depth 3 ParkScene_1920x1080_24.y4m,--preset slower --no-weightp -ParkScene_1920x1080_24_10bit_444.yuv,--preset superfast --weightp --lookahead-slices 4 +RaceHorses_416x240_30.y4m,--preset superfast --no-cutree RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 -RaceHorses_416x240_30.y4m,--preset superfast --no-cutree RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3 -RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra --limit-refs 1 -RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither -RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr --limit-refs 1 RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb +RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither +RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra --limit-refs 1 +RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain --limit-modes RaceHorses_416x240_30_10bit.yuv,--preset placebo --limit-refs 1 SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv,--preset medium --dither -big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200 -big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb --limit-refs 3 -big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra -big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0 big_buck_bunny_360p24.y4m,--preset ultrafast --deblock=2 +big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0 --aq-mode 3 big_buck_bunny_360p24.y4m,--preset veryfast --no-deblock -city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock +big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200 +big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb --limit-refs 3 +big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra --aq-mode 3 city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2 +city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock city_4cif_60fps.y4m,--preset slower --scaling-list default city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra --limit-refs 0 -ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40 +ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1 +ducks_take_off_444_720p50.y4m,--preset superfast --weightp --limit-refs 2 ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6 --limit-refs 2 +ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40 ducks_take_off_420_720p50.y4m,--preset medium --tskip --tskip-fast --constrained-intra -ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40 -ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1 -ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2 ducks_take_off_444_720p50.y4m,--preset medium --qp 38 --no-scenecut -ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0 --limit-refs 2 +ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40 ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1 --limit-refs 1 +ducks_take_off_420_720p50.y4m,--preset slower --no-wpp +ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2 +mobile_calendar_422_ntsc.y4m,--preset superfast --weightp mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4 mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast -mobile_calendar_422_ntsc.y4m,--preset superfast --weightp --rd 0 mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip --limit-refs 2 +old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32 +old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 --limit-modes +old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency +old_town_cross_444_720p50.y4m,--preset fast --no-cutree --analysis-mode=save --bitrate 3000 --early-skip,--preset fast --no-cutree --analysis-mode=load --bitrate 3000 --early-skip old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6 old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid old_town_cross_444_720p50.y4m,--preset slower --crf 4 --cu-lossless -old_town_cross_444_720p50.y4m,--preset superfast --weightp --min-cu 16 -old_town_cross_444_720p50.y4m,--preset ultrafast --weightp --min-cu 32 -old_town_cross_444_720p50.y4m,--preset veryfast --qp 1 --tune ssim parkrun_ter_720p50.y4m,--preset medium --no-open-gop --sao-non-deblock --crf 4 --cu-lossless parkrun_ter_720p50.y4m,--preset slower --fast-intra --no-rect --tune grain -silent_cif_420.y4m,--preset medium --me full --rect --amp silent_cif_420.y4m,--preset superfast --weightp --rect +silent_cif_420.y4m,--preset medium --me full --rect --amp silent_cif_420.y4m,--preset placebo --ctu 32 --no-sao --qg-size 16 -vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5 -vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16 +washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4 -washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5 -washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 --limit-refs 1 -washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 --limit-refs 1 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency -washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4 -washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 -BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 15000,--preset medium --no-cutree --analysis-mode=load --bitrate 13000,--preset medium --no-cutree --analysis-mode=load --bitrate 11000,--preset medium --no-cutree --analysis-mode=load --bitrate 9000,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 -NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-mode=save --bitrate 15000,--preset slow --no-cutree --analysis-mode=load --bitrate 13000,--preset slow --no-cutree --analysis-mode=load --bitrate 11000,--preset slow --no-cutree --analysis-mode=load --bitrate 9000,--preset slow --no-cutree --analysis-mode=load --bitrate 7000 -old_town_cross_444_720p50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 15000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 13000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 11000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 9000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --early-skip -Johnny_1280x720_60.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 15000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 13000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 11000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 9000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast -BasketballDrive_1920x1080_50.y4m,--preset medium --recon-y4m-exec "ffplay -i pipe:0 -autoexit" -FourPeople_1280x720_60.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" -FourPeople_1280x720_60.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5 +vtc1nw_422_ntsc.y4m,--preset medium --scaling-list default --ctu 16 --ref 5 +washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 --limit-refs 1 --aq-mode 2 +vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16 +washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 --limit-refs 1 +washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 --limit-modes + +# Main12 intraCost overflow bug test +720p50_parkrun_ter.y4m,--preset medium # interlace test, even though input YUV is not field seperated -CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --interlace tff +CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff # vim: tw=200
View file
x265_1.8.tar.gz/source/test/smoke-tests.txt -> x265_1.9.tar.gz/source/test/smoke-tests.txt
Changed
@@ -19,3 +19,6 @@ CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=fast --weightb --interlace bff + +# Main12 intraCost overflow bug test +720p50_parkrun_ter.y4m,--preset medium
View file
x265_1.8.tar.gz/source/test/testbench.cpp -> x265_1.9.tar.gz/source/test/testbench.cpp
Changed
@@ -4,6 +4,7 @@ * Authors: Gopu Govindaswamy <gopu@govindaswamy.org> * Mandar Gurav <mandar@multicorewareinc.com> * Mahesh Pittala <mahesh@multicorewareinc.com> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/test/testharness.h -> x265_1.9.tar.gz/source/test/testharness.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by
View file
x265_1.8.tar.gz/source/x265-extras.cpp -> x265_1.9.tar.gz/source/x265-extras.cpp
Changed
@@ -36,7 +36,7 @@ "I count, I ave-QP, I kbps, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), " "P count, P ave-QP, P kbps, P-PSNR Y, P-PSNR U, P-PSNR V, P-SSIM (dB), " "B count, B ave-QP, B kbps, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), " - "Version\n"; + "MaxCLL, MaxFALL, Version\n"; FILE* x265_csvlog_open(const x265_api& api, const x265_param& param, const char* fname, int level) { @@ -61,54 +61,58 @@ { if (level) { - fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, "); + fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, Scenecut, "); if (param.rc.rateControlMode == X265_RC_CRF) fprintf(csvfp, "RateFactor, "); - fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, SSIM, SSIM (dB), List 0, List 1"); - /* detailed performance statistics */ - fprintf(csvfp, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms), Stall Time (ms), Avg WPP, Row Blocks"); - if (level >= 2) + if (param.bEnablePsnr) + fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, "); + if (param.bEnableSsim) + fprintf(csvfp, "SSIM, SSIM(dB), "); + fprintf(csvfp, "Latency, "); + fprintf(csvfp, "List 0, List 1"); + uint32_t size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Intra %dx%d DC, Intra %dx%d Planar, Intra %dx%d Ang", size, size, size, size, size, size); + size /= 2; + } + fprintf(csvfp, ", 4x4"); + size = param.maxCUSize; + if (param.bEnableRectInter) { - uint32_t size = param.maxCUSize; - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - { - fprintf(csvfp, ", Intra %dx%d DC, Intra %dx%d Planar, Intra %dx%d Ang", size, size, size, size, size, size); - size /= 2; - } - fprintf(csvfp, ", 4x4"); - size = param.maxCUSize; - if (param.bEnableRectInter) - { - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - { - fprintf(csvfp, ", Inter %dx%d, Inter %dx%d (Rect)", size, size, size, size); - if (param.bEnableAMP) - fprintf(csvfp, ", Inter %dx%d (Amp)", size, size); - size /= 2; - } - } - else - { - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - { - fprintf(csvfp, ", Inter %dx%d", size, size); - size /= 2; - } - } - size = param.maxCUSize; for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { - fprintf(csvfp, ", Skip %dx%d", size, size); + fprintf(csvfp, ", Inter %dx%d, Inter %dx%d (Rect)", size, size, size, size); + if (param.bEnableAMP) + fprintf(csvfp, ", Inter %dx%d (Amp)", size, size); size /= 2; } - size = param.maxCUSize; + } + else + { for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { - fprintf(csvfp, ", Merge %dx%d", size, size); + fprintf(csvfp, ", Inter %dx%d", size, size); size /= 2; } - fprintf(csvfp, ", Avg Luma Distortion, Avg Chroma Distortion, Avg psyEnergy, Avg Luma Level, Max Luma Level"); } + size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Skip %dx%d", size, size); + size /= 2; + } + size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Merge %dx%d", size, size); + size /= 2; + } + fprintf(csvfp, ", Avg Luma Distortion, Avg Chroma Distortion, Avg psyEnergy, Avg Luma Level, Max Luma Level, Avg Residual Energy"); + + /* detailed performance statistics */ + if (level >= 2) + fprintf(csvfp, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms), Stall Time (ms), Avg WPP, Row Blocks"); fprintf(csvfp, "\n"); } else @@ -125,17 +129,14 @@ return; const x265_frame_stats* frameStats = &pic.frameData; - fprintf(csvfp, "%d, %c-SLICE, %4d, %2.2lf, %10d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits); + fprintf(csvfp, "%d, %c-SLICE, %4d, %2.2lf, %10d, %d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits, frameStats->bScenecut); if (param.rc.rateControlMode == X265_RC_CRF) fprintf(csvfp, "%.3lf,", frameStats->rateFactor); if (param.bEnablePsnr) fprintf(csvfp, "%.3lf, %.3lf, %.3lf, %.3lf,", frameStats->psnrY, frameStats->psnrU, frameStats->psnrV, frameStats->psnr); - else - fputs(" -, -, -, -,", csvfp); if (param.bEnableSsim) fprintf(csvfp, " %.6f, %6.3f,", frameStats->ssim, x265_ssim2dB(frameStats->ssim)); - else - fputs(" -, -,", csvfp); + fprintf(csvfp, "%d, ", frameStats->frameLatency); if (frameStats->sliceType == 'I') fputs(" -, -,", csvfp); else @@ -154,32 +155,33 @@ else fputs(" -,", csvfp); } - fprintf(csvfp, " %.1lf, %.1lf, %.1lf, %.1lf, %.1lf, %.1lf,", frameStats->decideWaitTime, frameStats->row0WaitTime, frameStats->wallTime, frameStats->refWaitWallTime, frameStats->totalCTUTime, frameStats->stallTime); - fprintf(csvfp, " %.3lf, %d", frameStats->avgWPP, frameStats->countRowBlocks); - if (level >= 2) + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, "%5.2lf%%, %5.2lf%%, %5.2lf%%,", frameStats->cuStats.percentIntraDistribution[depth][0], frameStats->cuStats.percentIntraDistribution[depth][1], frameStats->cuStats.percentIntraDistribution[depth][2]); + fprintf(csvfp, "%5.2lf%%", frameStats->cuStats.percentIntraNxN); + if (param.bEnableRectInter) { for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - fprintf(csvfp, ", %5.2lf%%, %5.2lf%%, %5.2lf%%", frameStats->cuStats.percentIntraDistribution[depth][0], frameStats->cuStats.percentIntraDistribution[depth][1], frameStats->cuStats.percentIntraDistribution[depth][2]); - fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentIntraNxN); - if (param.bEnableRectInter) { - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - { - fprintf(csvfp, ", %5.2lf%%, %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0], frameStats->cuStats.percentInterDistribution[depth][1]); - if (param.bEnableAMP) - fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][2]); - } + fprintf(csvfp, ", %5.2lf%%, %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0], frameStats->cuStats.percentInterDistribution[depth][1]); + if (param.bEnableAMP) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][2]); } - else - { - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0]); - } - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentSkipCu[depth]); + } + else + { for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentMergeCu[depth]); - fprintf(csvfp, ", %.2lf, %.2lf, %.2lf, %.2lf, %d", frameStats->avgLumaDistortion, frameStats->avgChromaDistortion, frameStats->avgPsyEnergy, frameStats->avgLumaLevel, frameStats->maxLumaLevel); + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0]); + } + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentSkipCu[depth]); + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentMergeCu[depth]); + fprintf(csvfp, ", %.2lf, %.2lf, %.2lf, %.2lf, %d, %.2lf", frameStats->avgLumaDistortion, frameStats->avgChromaDistortion, frameStats->avgPsyEnergy, frameStats->avgLumaLevel, frameStats->maxLumaLevel, frameStats->avgResEnergy); + + if (level >= 2) + { + fprintf(csvfp, ", %.1lf, %.1lf, %.1lf, %.1lf, %.1lf, %.1lf,", frameStats->decideWaitTime, frameStats->row0WaitTime, frameStats->wallTime, frameStats->refWaitWallTime, frameStats->totalCTUTime, frameStats->stallTime); + fprintf(csvfp, " %.3lf, %d", frameStats->avgWPP, frameStats->countRowBlocks); } fprintf(csvfp, "\n"); fflush(stderr); @@ -198,11 +200,13 @@ } // CLI arguments or other + fputc('"', csvfp); for (int i = 1; i < argc; i++) { - if (i) fputc(' ', csvfp); + fputc(' ', csvfp); fputs(argv[i], csvfp); } + fputc('"', csvfp); // current date and time time_t now; @@ -273,7 +277,7 @@ else fprintf(csvfp, " -, -, -, -, -, -, -,"); - fprintf(csvfp, " %s\n", api.version_str); + fprintf(csvfp, " %-6u, %-6u, %s\n", stats.maxCLL, stats.maxFALL, api.version_str); } /* The dithering algorithm is based on Sierra-2-4A error diffusion. */
View file
x265_1.8.tar.gz/source/x265.cpp -> x265_1.9.tar.gz/source/x265.cpp
Changed
@@ -486,6 +486,7 @@ pic_org.forceqp = qp + 1; if (type == 'I') pic_org.sliceType = X265_TYPE_IDR; else if (type == 'i') pic_org.sliceType = X265_TYPE_I; + else if (type == 'K') pic_org.sliceType = param->bOpenGOP ? X265_TYPE_I : X265_TYPE_IDR; else if (type == 'P') pic_org.sliceType = X265_TYPE_P; else if (type == 'B') pic_org.sliceType = X265_TYPE_BREF; else if (type == 'b') pic_org.sliceType = X265_TYPE_B;
View file
x265_1.8.tar.gz/source/x265.def.in -> x265_1.9.tar.gz/source/x265.def.in
Changed
@@ -22,3 +22,4 @@ x265_cleanup x265_api_get_${X265_BUILD} x265_api_query +x265_encoder_intra_refresh
View file
x265_1.8.tar.gz/source/x265.h -> x265_1.9.tar.gz/source/x265.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Steve Borho <steve@borho.org> + * Min Chen <chenm003@163.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -91,13 +92,15 @@ /* Stores all analysis data for a single frame */ typedef struct x265_analysis_data { - void* interData; - void* intraData; + int64_t satdCost; uint32_t frameRecordSize; uint32_t poc; uint32_t sliceType; uint32_t numCUsInFrame; uint32_t numPartitions; + void* interData; + void* intraData; + int bScenecut; } x265_analysis_data; /* cu statistics */ @@ -132,6 +135,7 @@ double avgLumaDistortion; double avgChromaDistortion; double avgPsyEnergy; + double avgResEnergy; double avgLumaLevel; uint64_t bits; int encoderOrder; @@ -141,6 +145,8 @@ int list1POC[16]; uint16_t maxLumaLevel; char sliceType; + int bScenecut; + int frameLatency; x265_cu_stats cuStats; } x265_frame_stats; @@ -205,6 +211,13 @@ * this data structure */ x265_analysis_data analysisData; + /* An array of quantizer offsets to be applied to this image during encoding. + * These are added on top of the decisions made by rateControl. + * Adaptive quantization must be enabled to use this feature. These quantizer + * offsets should be given for each 16x16 block. Behavior if quant + * offsets differ between encoding passes is undefined. */ + float *quantOffsets; + /* Frame level statistics */ x265_frame_stats frameData; @@ -378,6 +391,8 @@ x265_sliceType_stats statsI; /* statistics of I slice */ x265_sliceType_stats statsP; /* statistics of P slice */ x265_sliceType_stats statsB; /* statistics of B slice */ + uint16_t maxCLL; /* maximum content light level */ + uint16_t maxFALL; /* maximum frame average light level */ } x265_stats; /* String values accepted by x265_param_parse() (and CLI) for various parameters */ @@ -604,7 +619,7 @@ /* Enables the emission of a user data SEI with the stream headers which * describes the encoder version, build info, and parameters. This is - * very helpful for debugging, but may interfere with regression tests. + * very helpful for debugging, but may interfere with regression tests. * Default enabled */ int bEmitInfoSEI; @@ -664,9 +679,9 @@ int bBPyramid; /* A value which is added to the cost estimate of B frames in the lookahead. - * It may be a positive value (making B frames appear more expensive, which - * causes the lookahead to chose more P frames) or negative, which makes the - * lookahead chose more B frames. Default is 0, there are no limits */ + * It may be a positive value (making B frames appear less expensive, which + * biases the lookahead to choose more B frames) or negative, which makes the + * lookahead choose more P frames. Default is 0, there are no limits */ int bFrameBias; /* The number of frames that must be queued in the lookahead before it may @@ -691,6 +706,11 @@ * should detect scene cuts. The default (40) is recommended. */ int scenecutThreshold; + /* Replace keyframes by using a column of intra blocks that move across the video + * from one side to the other, thereby "refreshing" the image. In effect, instead of a + * big keyframe, the keyframe is "spread" over many frames. */ + int bIntraRefresh; + /*== Coding Unit (CU) definitions ==*/ /* Maximum CU width and height in pixels. The size must be 64, 32, or 16. @@ -810,6 +830,9 @@ * 4 split CUs at the next lower CU depth. The two flags may be combined */ uint32_t limitReferences; + /* Limit modes analyzed for each CU using cost metrics from the 4 sub-CUs */ + uint32_t limitModes; + /* ME search method (DIA, HEX, UMH, STAR, FULL). The search patterns * (methods) are sorted in increasing complexity, with diamond being the * simplest and fastest and full being the slowest. DIA, HEX, and UMH were @@ -920,7 +943,7 @@ /* Psycho-visual rate-distortion strength. Only has an effect in presets * which use RDO. It makes mode decision favor options which preserve the * energy of the source, at the cost of lost compression. The value must - * be between 0 and 2.0, 1.0 is typical. Default 0.3 */ + * be between 0 and 5.0, 1.0 is typical. Default 2.0 */ double psyRd; /* Strength of psycho-visual optimizations in quantization. Only has an @@ -1038,7 +1061,7 @@ /* Enable slow and a more detailed first pass encode in multi pass rate control */ int bEnableSlowFirstPass; - + /* rate-control overrides */ int zoneCount; x265_zone* zones; @@ -1051,14 +1074,14 @@ * values will affect all encoders in the same process */ const char* lambdaFileName; - /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise + /* Enable stricter conditions to check bitrate deviations in CBR mode. May compromise * quality to maintain bitrate adherence */ int bStrictCbr; - /* Enable adaptive quantization at CU granularity. This parameter specifies - * the minimum CU size at which QP can be adjusted, i.e. Quantization Group - * (QG) size. Allowed values are 64, 32, 16 provided it falls within the - * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize*/ + /* Enable adaptive quantization at CU granularity. This parameter specifies + * the minimum CU size at which QP can be adjusted, i.e. Quantization Group + * (QG) size. Allowed values are 64, 32, 16 provided it falls within the + * inclusuve range [maxCUSize, minCUSize]. Experimental, default: maxCUSize */ uint32_t qgSize; } rc; @@ -1165,12 +1188,27 @@ * max,min luminance values. */ const char* masteringDisplayColorVolume; - /* Content light level info SEI, specified as a string which is parsed when - * the stream header SEI are emitted. The string format is "%hu,%hu" where - * %hu are unsigned 16bit integers. The first value is the max content light - * level (or 0 if no maximum is indicated), the second value is the maximum - * picture average light level (or 0). */ - const char* contentLightLevelInfo; + /* Maximum Content light level(MaxCLL), specified as integer that indicates the + * maximum pixel intensity level in units of 1 candela per square metre of the + * bitstream. x265 will also calculate MaxCLL programmatically from the input + * pixel values and set in the Content light level info SEI */ + uint16_t maxCLL; + + /* Maximum Frame Average Light Level(MaxFALL), specified as integer that indicates + * the maximum frame average intensity level in units of 1 candela per square + * metre of the bitstream. x265 will also calculate MaxFALL programmatically + * from the input pixel values and set in the Content light level info SEI */ + uint16_t maxFALL; + + /* Minimum luma level of input source picture, specified as a integer which + * would automatically increase any luma values below the specified --min-luma + * value to that value. */ + uint16_t minLuma; + + /* Maximum luma level of input source picture, specified as a integer which + * would automatically decrease any luma values above the specified --max-luma + * value to that value. */ + uint16_t maxLuma; } x265_param; @@ -1211,7 +1249,7 @@ "main422-10", "main422-10-intra", "main444-10", "main444-10-intra", - "main12", "main12-intra", /* Highly Experimental */ + "main12", "main12-intra", "main422-12", "main422-12-intra", "main444-12", "main444-12-intra", @@ -1347,6 +1385,22 @@ * close an encoder handler */ void x265_encoder_close(x265_encoder *); +/* x265_encoder_intra_refresh: + * If an intra refresh is not in progress, begin one with the next P-frame. + * If an intra refresh is in progress, begin one as soon as the current one finishes. + * Requires bIntraRefresh to be set. + * + * Useful for interactive streaming where the client can tell the server that packet loss has + * occurred. In this case, keyint can be set to an extremely high value so that intra refreshes + * occur only when calling x265_encoder_intra_refresh. + * + * In multi-pass encoding, if x265_encoder_intra_refresh is called differently in each pass, + * behavior is undefined. + * + * Should not be called during an x265_encoder_encode. */ + +int x265_encoder_intra_refresh(x265_encoder *); + /* x265_cleanup: * release library static allocations, reset configured CTU size */ void x265_cleanup(void); @@ -1394,6 +1448,7 @@ void (*cleanup)(void); int sizeof_frame_stats; /* sizeof(x265_frame_stats) */ + int (*encoder_intra_refresh)(x265_encoder*); /* add new pointers to the end, or increment X265_MAJOR_VERSION */ } x265_api;
View file
x265_1.8.tar.gz/source/x265cli.h -> x265_1.9.tar.gz/source/x265cli.h
Changed
@@ -116,6 +116,7 @@ { "min-keyint", required_argument, NULL, 'i' }, { "scenecut", required_argument, NULL, 0 }, { "no-scenecut", no_argument, NULL, 0 }, + { "intra-refresh", no_argument, NULL, 0 }, { "rc-lookahead", required_argument, NULL, 0 }, { "lookahead-slices", required_argument, NULL, 0 }, { "bframes", required_argument, NULL, 'b' }, @@ -126,6 +127,8 @@ { "b-pyramid", no_argument, NULL, 0 }, { "ref", required_argument, NULL, 0 }, { "limit-refs", required_argument, NULL, 0 }, + { "no-limit-modes", no_argument, NULL, 0 }, + { "limit-modes", no_argument, NULL, 0 }, { "no-weightp", no_argument, NULL, 0 }, { "weightp", no_argument, NULL, 'w' }, { "no-weightb", no_argument, NULL, 0 }, @@ -192,6 +195,8 @@ { "crop-rect", required_argument, NULL, 0 }, /* DEPRECATED */ { "master-display", required_argument, NULL, 0 }, { "max-cll", required_argument, NULL, 0 }, + { "min-luma", required_argument, NULL, 0 }, + { "max-luma", required_argument, NULL, 0 }, { "no-dither", no_argument, NULL, 0 }, { "dither", no_argument, NULL, 0 }, { "no-repeat-headers", no_argument, NULL, 0 }, @@ -251,14 +256,18 @@ H0(" --log-level <string> Logging level: none error warning info debug full. Default %s\n", X265_NS::logLevelNames[param->logLevel + 1]); H0(" --no-progress Disable CLI progress reports\n"); H0(" --csv <filename> Comma separated log file, if csv-log-level > 0 frame level statistics, else one line per run\n"); - H0(" --csv-log-level Level of csv logging, if csv-log-level > 0 frame level statistics, else one line per run: 0-2\n"); + H0(" --csv-log-level <integer> Level of csv logging, if csv-log-level > 0 frame level statistics, else one line per run: 0-2\n"); H0("\nInput Options:\n"); H0(" --input <filename> Raw YUV or Y4M input file name. `-` for stdin\n"); H1(" --y4m Force parsing of input stream as YUV4MPEG2 regardless of file extension\n"); H0(" --fps <float|rational> Source frame rate (float or num/denom), auto-detected if Y4M\n"); H0(" --input-res WxH Source picture size [w x h], auto-detected if Y4M\n"); H1(" --input-depth <integer> Bit-depth of input file. Default 8\n"); - H1(" --input-csp <string> Source color space: i420, i444 or i422, auto-detected if Y4M. Default: i420\n"); + H1(" --input-csp <string> Chroma subsampling, auto-detected if Y4M\n"); + H1(" 0 - i400 (4:0:0 monochrome)\n"); + H1(" 1 - i420 (4:2:0 default)\n"); + H1(" 2 - i422 (4:2:2)\n"); + H1(" 3 - i444 (4:4:4)\n"); H0("-f/--frames <integer> Maximum number of frames to encode. Default all\n"); H0(" --seek <integer> First frame to encode\n"); H1(" --[no-]interlace <bff|tff> Indicate input pictures are interlace fields in temporal order. Default progressive\n"); @@ -292,7 +301,7 @@ H0(" --tu-inter-depth <integer> Max TU recursive depth for inter CUs. Default %d\n", param->tuQTMaxInterDepth); H0("\nAnalysis:\n"); H0(" --rd <0..6> Level of RDO in mode decision 0:least....6:full RDO. Default %d\n", param->rdLevel); - H0(" --[no-]psy-rd <0..2.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd); + H0(" --[no-]psy-rd <0..5.0> Strength of psycho-visual rate distortion optimization, 0 to disable. Default %.1f\n", param->psyRd); H0(" --[no-]rdoq-level <0|1|2> Level of RDO in quantization 0:none, 1:levels, 2:levels & coding groups. Default %d\n", param->rdoqLevel); H0(" --[no-]psy-rdoq <0..50.0> Strength of psycho-visual optimization in RDO quantization, 0 to disable. Default %.1f\n", param->psyRdoq); H0(" --[no-]early-skip Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip)); @@ -308,12 +317,13 @@ H0("\nTemporal / motion search options:\n"); H0(" --max-merge <1..5> Maximum number of merge candidates. Default %d\n", param->maxNumMergeCand); H0(" --ref <integer> max number of L0 references to be allowed (1 .. 16) Default %d\n", param->maxNumReferences); - H0(" --limit-refs <0|1|2|3> limit references per depth (1) or CU (2) or both (3). Default %d\n", param->limitReferences); + H0(" --limit-refs <0|1|2|3> Limit references per depth (1) or CU (2) or both (3). Default %d\n", param->limitReferences); H0(" --me <string> Motion search method dia hex umh star full. Default %d\n", param->searchMethod); H0("-m/--subme <integer> Amount of subpel refinement to perform (0:least .. 7:most). Default %d \n", param->subpelRefine); H0(" --merange <integer> Motion search range. Default %d\n", param->searchRange); H0(" --[no-]rect Enable rectangular motion partitions Nx2N and 2NxN. Default %s\n", OPT(param->bEnableRectInter)); H0(" --[no-]amp Enable asymmetric motion partitions, requires --rect. Default %s\n", OPT(param->bEnableAMP)); + H0(" --[no-]limit-modes Limit rectangular and asymmetric motion predictions. Default %d\n", param->limitModes); H1(" --[no-]temporal-mvp Enable temporal MV predictors. Default %s\n", OPT(param->bEnableTemporalMvp)); H0("\nSpatial / intra options:\n"); H0(" --[no-]strong-intra-smoothing Enable strong intra smoothing for 32x32 blocks. Default %s\n", OPT(param->bEnableStrongIntraSmoothing)); @@ -327,6 +337,7 @@ H0("-i/--min-keyint <integer> Scenecuts closer together than this are coded as I, not IDR. Default: auto\n"); H0(" --no-scenecut Disable adaptive I-frame decision\n"); H0(" --scenecut <integer> How aggressively to insert extra I-frames. Default %d\n", param->scenecutThreshold); + H0(" --intra-refresh Use Periodic Intra Refresh instead of IDR frames\n"); H0(" --rc-lookahead <integer> Number of frames for frame-type lookahead (determines encoder latency) Default %d\n", param->lookaheadDepth); H1(" --lookahead-slices <0..16> Number of slices to use per lookahead cost estimate. Default %d\n", param->lookaheadSlices); H0(" --bframes <integer> Maximum number of consecutive b-frames (now it only enables B GOP structure) Default %d\n", param->bframes); @@ -335,7 +346,7 @@ H0(" --[no-]b-pyramid Use B-frames as references. Default %s\n", OPT(param->bBPyramid)); H1(" --qpfile <string> Force frametypes and QPs for some or all frames\n"); H1(" Format of each line: framenumber frametype QP\n"); - H1(" QP is optional (none lets x265 choose). Frametypes: I,i,P,B,b.\n"); + H1(" QP is optional (none lets x265 choose). Frametypes: I,i,K,P,B,b.\n"); H1(" QPs are restricted by qpmin/qpmax.\n"); H0("\nRate control, Adaptive Quantization:\n"); H0(" --bitrate <integer> Target bitrate (kbps) for ABR (implied). Default %d\n", param->rc.bitrate); @@ -403,6 +414,8 @@ H0(" --master-display <string> SMPTE ST 2086 master display color volume info SEI (HDR)\n"); H0(" format: G(x,y)B(x,y)R(x,y)WP(x,y)L(max,min)\n"); H0(" --max-cll <string> Emit content light level info SEI as \"cll,fall\" (HDR)\n"); + H0(" --min-luma <integer> Minimum luma plane value of input source picture\n"); + H0(" --max-luma <integer> Maximum luma plane value of input source picture\n"); H0("\nBitstream options:\n"); H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); H0(" --[no-]info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI));
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.