Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 30
View file
x265.changes
Changed
@@ -1,4 +1,66 @@ ------------------------------------------------------------------- +Tue Oct 9 20:03:53 UTC 2018 - aloisio@gmx.com + +- Update to version 2.9 + New features: + * Support for chunked encoding + + :option:`--chunk-start and --chunk-end` + + Frames preceding first frame of chunk in display order + will be encoded, however, they will be discarded in the + bitstream. + + Frames following last frame of the chunk in display order + will be used in taking lookahead decisions, but, they will + not be encoded. + + This feature can be enabled only in closed GOP structures. + Default disabled. + * Support for HDR10+ version 1 SEI messages. + Encoder enhancements: + * Create API function for allocating and freeing + x265_analysis_data. + * CEA 608/708 support: Read SEI messages from text file and + encode it using userSEI message. + Bug fixes: + * Disable noise reduction when vbv is enabled. + * Support minLuma and maxLuma values changed by the + commandline. + version 2.8 + New features: + * :option:`--asm avx512` used to enable AVX-512 in x265. + Default disabled. + + For 4K main10 high-quality encoding, we are seeing good + gains; for other resolutions and presets, we don't + recommend using this setting for now. + * :option:`--dynamic-refine` dynamically switches between + different inter refine levels. Default disabled. + + It is recommended to use :option:`--refine-intra 4' with + dynamic refinement for a better trade-off between encode + efficiency and performance than using static refinement. + * :option:`--single-sei` + + Encode SEI messages in a single NAL unit instead of + multiple NAL units. Default disabled. + * :option:`--max-ausize-factor` controls the maximum AU size + defined in HEVC specification. + + It represents the percentage of maximum AU size used. + Default is 1. + * VMAF (Video Multi-Method Assessment Fusion) + + Added VMAF support for objective quality measurement of a + video sequence. + + Enable cmake option ENABLE_LIBVMAF to report per frame and + aggregate VMAF score. The frame level VMAF score does not + include temporal scores. + + This is supported only on linux for now. + Encoder enhancements: + * Introduced refine-intra level 4 to improve quality. + * Support for HLG-graded content and pic_struct in SEI message. + Bug Fixes: + * Fix 32 bit build error (using CMAKE GUI) in Linux. + * Fix 32 bit build error for asm primitives. + * Fix build error on mac OS. + * Fix VBV Lookahead in analysis load to achieve target bitrate. + +- Added x265-fix_enable512.patch + +------------------------------------------------------------------- Fri May 4 22:21:57 UTC 2018 - zaitor@opensuse.org - Build with nasm >= 2.13 for openSUSE Leap 42.3 and SLE-12, since
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 151 +%define soname 165 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 2.7 +Version: 2.9 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -13,17 +13,15 @@ Source0: https://bitbucket.org/multicoreware/x265/downloads/%{name}_%{version}.tar.gz Patch0: arm.patch Patch1: x265.pkgconfig.patch +Patch2: x265-fix_enable512.patch BuildRequires: gcc BuildRequires: gcc-c++ BuildRequires: cmake >= 2.8.8 BuildRequires: pkg-config BuildRequires: nasm >= 2.13 -%if 0%{?suse_version} > 1310 %ifarch x86_64 BuildRequires: libnuma-devel >= 2.0.9 %endif -%endif -BuildRoot: %{_tmppath}/%{name}-%{version}-build %description x265 is a free library for encoding next-generation H265/HEVC video @@ -47,18 +45,19 @@ %description -n %{libname}-devel x265 is a free library for encoding next-generation H265/HEVC video -streams. +streams. %prep %setup -q -n %{name}_%{version} %patch0 -p1 %patch1 -p1 +%patch2 -p1 sed -i -e "s/0.0/%{soname}.0/g" source/cmake/version.cmake %build -%if 0%{?suse_version} < 1330 +%if 0%{?suse_version} < 1500 cd source %else %define __builddir ./source/build @@ -68,7 +67,7 @@ make %{?_smp_mflags} %install -%if 0%{?suse_version} < 1330 +%if 0%{?suse_version} < 1500 cd source %endif %cmake_install @@ -79,15 +78,14 @@ %postun -n %{libsoname} -p /sbin/ldconfig %files -n %{libsoname} -%defattr(0644,root,root) %{_libdir}/%{libname}.so.%{soname}* -%files -%defattr(0755,root,root) +%files %{_bindir}/%{name} %files -n %{libname}-devel -%defattr(0644,root,root) +%license COPYING +%doc readme.rst %{_includedir}/%{name}.h %{_includedir}/%{name}_config.h %{_libdir}/pkgconfig/%{name}.pc
View file
x265-fix_enable512.patch
Added
@@ -0,0 +1,25 @@ +--- a/source/common/cpu.cpp ++++ b/source/common/cpu.cpp +@@ -110,6 +110,11 @@ const cpu_name_t cpu_names[] = + { "", 0 }, + }; + ++bool detect512() ++{ ++ return(enable512); ++} ++ + #if X265_ARCH_X86 + + extern "C" { +@@ -123,10 +128,6 @@ uint64_t PFX(cpu_xgetbv)(int xcr); + #pragma warning(disable: 4309) // truncation of constant value + #endif + +-bool detect512() +-{ +- return(enable512); +-} + uint32_t cpu_detect(bool benableavx512 ) + { +
View file
x265_2.7.tar.gz/.hg_archival.txt -> x265_2.9.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,4 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: e41a9bf2bac4a7af2bec2bbadf91e63752d320ef +node: f9681d731f2e56c2ca185cec10daece5939bee07 branch: stable -tag: 2.7 +tag: 2.9
View file
x265_2.7.tar.gz/.hgtags -> x265_2.9.tar.gz/.hgtags
Changed
@@ -25,3 +25,5 @@ e7a4dd48293b7956d4a20df257d23904cc78e376 2.4 64b2d0bf45a52511e57a6b7299160b961ca3d51c 2.5 0e9ea76945c89962cd46cee6537586e2054b2935 2.6 +e41a9bf2bac4a7af2bec2bbadf91e63752d320ef 2.7 +a158a3a029663133455268e2a63ae6b0af2df720 2.8
View file
x265_2.7.tar.gz/doc/reST/api.rst -> x265_2.9.tar.gz/doc/reST/api.rst
Changed
@@ -223,6 +223,18 @@ * returns negative on error, 0 access unit were output.*/ int x265_set_analysis_data(x265_encoder *encoder, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes); +**x265_alloc_analysis_data()** may be used to allocate memory for the x265_analysis_data:: + + /* x265_alloc_analysis_data: + * Allocate memory for the x265_analysis_data object's internal structures. */ + void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis); + +**x265_free_analysis_data()** may be used to free memory for the x265_analysis_data:: + + /* x265_free_analysis_data: + * Free the allocated memory for x265_analysis_data object's internal structures. */ + void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis); + Pictures ======== @@ -398,7 +410,30 @@ * release library static allocations, reset configured CTU size */ void x265_cleanup(void); +VMAF (Video Multi-Method Assessment Fusion) +========================================== + +If you set the ENABLE_LIBVMAF cmake option to ON, then x265 will report per frame +and aggregate VMAF score for the given input and dump the scores in csv file. +The user also need to specify the :option:`--recon` in command line to get the VMAF scores. + + /* x265_calculate_vmafScore: + * returns VMAF score for the input video. + * This api must be called only after encoding was done. */ + double x265_calculate_vmafscore(x265_param*, x265_vmaf_data*); + + /* x265_calculate_vmaf_framelevelscore: + * returns VMAF score for each frame in a given input video. The frame level VMAF score does not include temporal scores. */ + double x265_calculate_vmaf_framelevelscore(x265_vmaf_framedata*); + +.. Note:: + When setting ENABLE_LIBVMAF cmake option to ON, it is recommended to + also set ENABLE_SHARED to OFF to prevent build problems. + We only need the static library from these builds. + + Binaries build with windows will not have VMAF support. + Multi-library Interface =======================
View file
x265_2.7.tar.gz/doc/reST/cli.rst -> x265_2.9.tar.gz/doc/reST/cli.rst
Changed
@@ -52,7 +52,7 @@ 2. unable to open encoder 3. unable to generate stream headers 4. encoder abort - + Logging/Statistic Options ========================= @@ -104,6 +104,8 @@ **BufferFill** Bits available for the next frame. Includes bits carried over from the current frame. + **BufferFillFinal** Buffer bits available after removing the frame out of CPB. + **Latency** Latency in terms of number of frames between when the frame was given in and when the frame is given out. @@ -183,11 +185,11 @@ .. option:: --csv-log-level <integer> - Controls the level of detail (and size) of --csv log files - - 0. summary **(default)** - 1. frame level logging - 2. frame level logging with performance statistics + Controls the level of detail (and size) of --csv log files + + 0. summary **(default)** + 1. frame level logging + 2. frame level logging with performance statistics .. option:: --ssim, --no-ssim @@ -254,7 +256,7 @@ "*" - same as default "none" - no thread pools are created, only frame parallelism possible "-" - same as "none" - "10" - allocate one pool, using up to 10 cores on node 0 + "10" - allocate one pool, using up to 10 cores on all available nodes "-,+" - allocate one pool, using all cores on node 1 "+,-,+" - allocate one pool, using only cores on nodes 0 and 2 "+,-,+,-" - allocate one pool, using only cores on nodes 0 and 2 @@ -535,6 +537,20 @@ **CLI ONLY** +.. option:: --chunk-start <integer> + + First frame of the chunk. Frames preceeding this in display order will + be encoded, however, they will be discarded in the bitstream. This + feature can be enabled only in closed GOP structures. + Default 0 (disabled). + +.. option:: --chunk-end <integer> + + Last frame of the chunk. Frames following this in display order will be + used in taking lookahead decisions, but, they will not be encoded. + This feature can be enabled only in closed GOP structures. + Default 0 (disabled). + Profile, Level, Tier ==================== @@ -646,9 +662,9 @@ encoding options, the encoder will attempt to modify/set the right encode specifications. If the encoder is unable to do so, this option will be turned OFF. Highly experimental. - + Default: disabled - + .. note:: :option:`--profile`, :option:`--level-idc`, and @@ -773,7 +789,7 @@ Default 3. .. option:: --limit-modes, --no-limit-modes - + When enabled, limit-modes will limit modes analyzed for each CU using cost metrics from the 4 sub-CUs. When multiple inter modes like :option:`--rect` and/or :option:`--amp` are enabled, this feature will use motion cost @@ -820,6 +836,11 @@ Default: enabled, disabled for :option:`--tune grain` +.. option:: --splitrd-skip, --no-splitrd-skip + + Enable skipping split RD analysis when sum of split CU rdCost larger than one + split CU rdCost for Intra CU. Default disabled. + .. option:: --fast-intra, --no-fast-intra Perform an initial scan of every fifth intra angular mode, then @@ -888,35 +909,36 @@ Note that --analysis-reuse-level must be paired with analysis-reuse-mode. - +--------------+------------------------------------------+ - | Level | Description | - +==============+==========================================+ - | 1 | Lookahead information | - +--------------+------------------------------------------+ - | 2 to 4 | Level 1 + intra/inter modes, ref's | - +--------------+------------------------------------------+ - | 5,6 and 9 | Level 2 + rect-amp | - +--------------+------------------------------------------+ - | 7 | Level 5 + AVC size CU refinement | - +--------------+------------------------------------------+ - | 8 | Level 5 + AVC size Full CU analysis-info | - +--------------+------------------------------------------+ - | 10 | Level 5 + Full CU analysis-info | - +--------------+------------------------------------------+ + +--------------+------------------------------------------+ + | Level | Description | + +==============+==========================================+ + | 1 | Lookahead information | + +--------------+------------------------------------------+ + | 2 to 4 | Level 1 + intra/inter modes, ref's | + +--------------+------------------------------------------+ + | 5 and 6 | Level 2 + rect-amp | + +--------------+------------------------------------------+ + | 7 | Level 5 + AVC size CU refinement | + +--------------+------------------------------------------+ + | 8 and 9 | Level 5 + AVC size Full CU analysis-info | + +--------------+------------------------------------------+ + | 10 | Level 5 + Full CU analysis-info | + +--------------+------------------------------------------+ .. option:: --refine-mv-type <string> - Reuse MV information received through API call. Currently receives information for AVC size and the accepted - string input is "avc". Default is disabled. + Reuse MV information received through API call. Currently receives information for AVC size and the accepted + string input is "avc". Default is disabled. .. option:: --scale-factor - Factor by which input video is scaled down for analysis save mode. - This option should be coupled with analysis-reuse-mode option, --analysis-reuse-level 10. - The ctu size of load should be double the size of save. Default 0. + Factor by which input video is scaled down for analysis save mode. + This option should be coupled with analysis-reuse-mode option, + --analysis-reuse-level 10. The ctu size of load can either be the + same as that of save or double the size of save. Default 0. + +.. option:: --refine-intra <0..4> -.. option:: --refine-intra <0..3> - Enables refinement of intra blocks in current encode. Level 0 - Forces both mode and depth from the save encode. @@ -931,8 +953,10 @@ Level 3 - Perform analysis of intra modes for depth reused from first encode. - Default 0. + Level 4 - Does not reuse any analysis information - redo analysis for the intra block. + Default 0. + .. option:: --refine-inter <0..3> Enables refinement of inter blocks in current encode. @@ -954,11 +978,17 @@ Default 0. +.. option:: --dynamic-refine, --no-dynamic-refine + + Dynamically switches :option:`--refine-inter` levels 0-3 based on the content and + the encoder settings. It is recommended to use :option:`--refine-intra` 4 with dynamic + refinement. Default disabled. + .. option:: --refine-mv Enables refinement of motion vector for scaled video. Evaluates the best motion vector by searching the surrounding eight integer and subpel pixel - positions. + positions. Options which affect the transform unit quad-tree, sometimes referred to as the residual quad-tree (RQT). @@ -1094,9 +1124,9 @@ quad-tree begins at the same depth of the coded tree unit, but if the maximum TU size is smaller than the CU size then transform QT begins at the depth of the max-tu-size. Default: 32. - + .. option:: --dynamic-rd <0..4> - + Increases the RD level at points where quality drops due to VBV rate control enforcement. The number of CUs for which the RD is reconfigured is determined based on the strength. Strength 1 gives the best FPS, @@ -1107,13 +1137,13 @@ .. option:: --ssim-rd, --no-ssim-rd - Enable/Disable SSIM RDO. SSIM is a better perceptual quality assessment - method as compared to MSE. SSIM based RDO calculation is based on residual - divisive normalization scheme. This normalization is consistent with the - luminance and contrast masking effect of Human Visual System. It is used - for mode selection during analysis of CTUs and can achieve significant - gain in terms of objective quality metrics SSIM and PSNR. It only has effect - on presets which use RDO-based mode decisions (:option:`--rd` 3 and above). + Enable/Disable SSIM RDO. SSIM is a better perceptual quality assessment + method as compared to MSE. SSIM based RDO calculation is based on residual + divisive normalization scheme. This normalization is consistent with the + luminance and contrast masking effect of Human Visual System. It is used + for mode selection during analysis of CTUs and can achieve significant + gain in terms of objective quality metrics SSIM and PSNR. It only has effect + on presets which use RDO-based mode decisions (:option:`--rd` 3 and above). Temporal / motion search options ================================ @@ -1216,8 +1246,8 @@ .. option:: --analyze-src-pics, --no-analyze-src-pics - Enalbe motion estimation with source frame pixels, in this mode, - motion estimation can be computed independently. Default disabled. + Enable motion estimation with source frame pixels, in this mode, + motion estimation can be computed independently. Default disabled. Spatial/intra options ===================== @@ -1362,12 +1392,12 @@ .. option:: --ctu-info <0, 1, 2, 4, 6> - This value enables receiving CTU information asynchronously and determine reaction to the CTU information. Default 0. - 1: force the partitions if CTU information is present. - 2: functionality of (1) and reduce qp if CTU information has changed. - 4: functionality of (1) and force Inter modes when CTU Information has changed, merge/skip otherwise. - This option should be enabled only when planning to invoke the API function x265_encoder_ctu_info to copy ctu-info asynchronously. - If enabled without calling the API function, the encoder will wait indefinitely. + This value enables receiving CTU information asynchronously and determine reaction to the CTU information. Default 0. + 1: force the partitions if CTU information is present. + 2: functionality of (1) and reduce qp if CTU information has changed. + 4: functionality of (1) and force Inter modes when CTU Information has changed, merge/skip otherwise. + This option should be enabled only when planning to invoke the API function x265_encoder_ctu_info to copy ctu-info asynchronously. + If enabled without calling the API function, the encoder will wait indefinitely. .. option:: --intra-refresh @@ -1387,16 +1417,17 @@ Default 20 **Range of values:** Between the maximum consecutive bframe count (:option:`--bframes`) and 250 + .. option:: --gop-lookahead <integer> - Number of frames for GOP boundary decision lookahead. If a scenecut frame is found - within this from the gop boundary set by `--keyint`, the GOP will be extented until such a point, - otherwise the GOP will be terminated as set by `--keyint`. Default 0. + Number of frames for GOP boundary decision lookahead. If a scenecut frame is found + within this from the gop boundary set by `--keyint`, the GOP will be extented until such a point, + otherwise the GOP will be terminated as set by `--keyint`. Default 0. - **Range of values:** Between 0 and (`--rc-lookahead` - mini-GOP length) + **Range of values:** Between 0 and (`--rc-lookahead` - mini-GOP length) - It is recommended to have `--gop-lookahaed` less than `--min-keyint` as scenecuts beyond - `--min-keyint` are already being coded as keyframes. + It is recommended to have `--gop-lookahaed` less than `--min-keyint` as scenecuts beyond + `--min-keyint` are already being coded as keyframes. .. option:: --lookahead-slices <0..16> @@ -1412,30 +1443,30 @@ on systems with many threads. The encoder may internally lower the number of slices or disable - slicing to ensure each slice codes at least 10 16x16 rows of lowres - blocks to minimize the impact on quality. For example, for 720p and - 1080p videos, the number of slices is capped to 4 and 6, respectively. - For resolutions lesser than 720p, slicing is auto-disabled. - - If slices are used in lookahead, they are logged in the list of tools - as *lslices* + slicing to ensure each slice codes at least 10 16x16 rows of lowres + blocks to minimize the impact on quality. For example, for 720p and + 1080p videos, the number of slices is capped to 4 and 6, respectively. + For resolutions lesser than 720p, slicing is auto-disabled. + + If slices are used in lookahead, they are logged in the list of tools + as *lslices* **Values:** 0 - disabled. 1 is the same as 0. Max 16. - Default: 8 for ultrafast, superfast, faster, fast, medium - 4 for slow, slower - disabled for veryslow, slower - + Default: 8 for ultrafast, superfast, faster, fast, medium + 4 for slow, slower + disabled for veryslow, slower + .. option:: --lookahead-threads <integer> - Use multiple worker threads dedicated to doing only lookahead instead of sharing - the worker threads with frame Encoders. A dedicated lookahead threadpool is created with the - specified number of worker threads. This can range from 0 upto half the - hardware threads available for encoding. Using too many threads for lookahead can starve - resources for frame Encoder and can harm performance. Default is 0 - disabled, Lookahead + Use multiple worker threads dedicated to doing only lookahead instead of sharing + the worker threads with frame Encoders. A dedicated lookahead threadpool is created with the + specified number of worker threads. This can range from 0 upto half the + hardware threads available for encoding. Using too many threads for lookahead can starve + resources for frame Encoder and can harm performance. Default is 0 - disabled, Lookahead shares worker threads with other FrameEncoders . **Values:** 0 - disabled(default). Max - Half of available hardware threads. - + .. option:: --b-adapt <integer> Set the level of effort in determining B frame placement. @@ -1466,11 +1497,11 @@ .. option:: --b-pyramid, --no-b-pyramid Use B-frames as references, when possible. Default enabled - + .. option:: --force-flush <integer> Force the encoder to flush frames. Default is 0. - + Values: 0 - flush the encoder only when all the input pictures are over. 1 - flush all the frames even when the input is not over. @@ -1502,7 +1533,7 @@ any given frame (ensuring a max QP). This is dangerous when CRF is used in combination with VBV as it may result in buffer underruns. Default disabled - + .. option:: --crf-min <0..51.0> Specify an lower limit to the rate factor which may be assigned to @@ -1541,7 +1572,7 @@ Default 0.9 **Range of values:** fractional: 0 - 1.0, or kbits: 2 .. bufsize - + .. option:: --vbv-end <float> Final buffer emptiness. The portion of the decode buffer that must be @@ -1553,7 +1584,7 @@ can specify the starting and ending state of the VBV buffer so that VBV compliance can be maintained when chunks are independently encoded and stitched together. - + .. option:: --vbv-end-fr-adj <float> Frame from which qp has to be adjusted to achieve final decode buffer @@ -1671,31 +1702,31 @@ .. option:: --multi-pass-opt-analysis, --no-multi-pass-opt-analysis - Enable/Disable multipass analysis refinement along with multipass ratecontrol. Based on - the information stored in pass 1, in subsequent passes analysis data is refined - and also redundant steps are skipped. - In pass 1 analysis information like motion vector, depth, reference and prediction - modes of the final best CTU partition is stored for each CTU. - Multipass analysis refinement cannot be enabled when 'analysis-save/analysis-load' option - is enabled and both will be disabled when enabled together. This feature requires 'pmode/pme' - to be disabled and hence pmode/pme will be disabled when enabled at the same time. + Enable/Disable multipass analysis refinement along with multipass ratecontrol. Based on + the information stored in pass 1, in subsequent passes analysis data is refined + and also redundant steps are skipped. + In pass 1 analysis information like motion vector, depth, reference and prediction + modes of the final best CTU partition is stored for each CTU. + Multipass analysis refinement cannot be enabled when 'analysis-save/analysis-load' option + is enabled and both will be disabled when enabled together. This feature requires 'pmode/pme' + to be disabled and hence pmode/pme will be disabled when enabled at the same time. - Default: disabled. + Default: disabled. .. option:: --multi-pass-opt-distortion, --no-multi-pass-opt-distortion - Enable/Disable multipass refinement of qp based on distortion data along with multipass - ratecontrol. In pass 1 distortion of best CTU partition is stored. CTUs with high - distortion get lower(negative)qp offsets and vice-versa for low distortion CTUs in pass 2. - This helps to improve the subjective quality. - Multipass refinement of qp cannot be enabled when 'analysis-save/analysis-load' option - is enabled and both will be disabled when enabled together. 'multi-pass-opt-distortion' - requires 'pmode/pme' to be disabled and hence pmode/pme will be disabled when enabled along with it. + Enable/Disable multipass refinement of qp based on distortion data along with multipass + ratecontrol. In pass 1 distortion of best CTU partition is stored. CTUs with high + distortion get lower(negative)qp offsets and vice-versa for low distortion CTUs in pass 2. + This helps to improve the subjective quality. + Multipass refinement of qp cannot be enabled when 'analysis-save/analysis-load' option + is enabled and both will be disabled when enabled together. 'multi-pass-opt-distortion' + requires 'pmode/pme' to be disabled and hence pmode/pme will be disabled when enabled along with it. - Default: disabled. + Default: disabled. .. option:: --strict-cbr, --no-strict-cbr - + Enables stricter conditions to control bitrate deviance from the target bitrate in ABR mode. Bit rate adherence is prioritised over quality. Rate tolerance is reduced to 50%. Default disabled. @@ -1708,7 +1739,7 @@ encoded frames to control QP. strict-cbr allows the encoder to be more aggressive in hitting the target bitrate even for short segment videos. - + .. option:: --cbqpoffs <integer> Offset of Cb chroma QP from the luma QP selected by rate control. @@ -1741,14 +1772,13 @@ qComp sets the quantizer curve compression factor. It weights the frame quantizer based on the complexity of residual (measured by - lookahead). Default value is 0.6. Increasing it to 1 will - effectively generate CQP + lookahead). It's value must be between 0.5 and 1.0. Default value is + 0.6. Increasing it to 1.0 will effectively generate CQP. .. option:: --qpstep <integer> - The maximum single adjustment in QP allowed to rate control. Default - 4 - + The maximum single adjustment in QP allowed to rate control. Default 4 + .. option:: --qpmin <integer> sets a hard lower limit on QP allowed to ratecontrol. Default 0 @@ -1756,21 +1786,21 @@ .. option:: --qpmax <integer> sets a hard upper limit on QP allowed to ratecontrol. Default 69 - + .. option:: --rc-grain, --no-rc-grain - Enables a specialised ratecontrol algorithm for film grain content. This - parameter strictly minimises QP fluctuations within and across frames - and removes pulsing of grain. Default disabled. - Enabled when :option:'--tune' grain is applied. It is highly recommended - that this option is used through the tune grain feature where a combination - of param options are used to improve visual quality. - + Enables a specialised ratecontrol algorithm for film grain content. This + parameter strictly minimises QP fluctuations within and across frames + and removes pulsing of grain. Default disabled. + Enabled when :option:'--tune' grain is applied. It is highly recommended + that this option is used through the tune grain feature where a combination + of param options are used to improve visual quality. + .. option:: --const-vbv, --no-const-vbv - Enables VBV algorithm to be consistent across runs. Default disabled. - Enabled when :option:'--tune' grain is applied. - + Enables VBV algorithm to be consistent across runs. Default disabled. + Enabled when :option:'--tune' grain is applied. + .. option:: --qblur <float> Temporally blur quants. Default 0.5 @@ -1831,17 +1861,18 @@ HEVC specifies a default set of scaling lists which may be enabled without requiring them to be signaled in the SPS. Those scaling lists can be enabled via :option:`--scaling-list` *default*. - + All other strings indicate a filename containing custom scaling lists in the HM format. The encode will abort if the file is not - parsed correctly. Custom lists must be signaled in the SPS + parsed correctly. Custom lists must be signaled in the SPS. A sample + scaling list file is available in `the downloads page <https://bitbucket.org/multicoreware/x265/downloads/reference_scalinglist.txt>`_ .. option:: --lambda-file <filename> Specify a text file containing values for x265_lambda_tab and x265_lambda2_tab. Each table requires MAX_MAX_QP+1 (70) float values. - + The text file syntax is simple. Comma is considered to be white-space. All white-space is ignored. Lines must be less than 2k bytes in length. Content following hash (#) characters are ignored. @@ -1856,6 +1887,11 @@ vectors and splits) and less on residual. This feature is intended for experimentation. +.. option:: --max-ausize-factor <float> + + It controls the maximum AU size defined in specification. It represents + the percentage of maximum AU size used. Default is 1. Range is 0.5 to 1. + Loop filters ============ @@ -1975,9 +2011,9 @@ 7. smpte240m 8. film 9. bt2020 - 10. smpte428 - 11. smpte431 - 12. smpte432 + 10. smpte428 + 11. smpte431 + 12. smpte432 .. option:: --transfer <integer|string> @@ -2018,10 +2054,10 @@ 8. YCgCo 9. bt2020nc 10. bt2020c - 11. smpte2085 - 12. chroma-derived-nc - 13. chroma-derived-c - 14. ictcp + 11. smpte2085 + 12. chroma-derived-nc + 13. chroma-derived-c + 14. ictcp .. option:: --chromaloc <0..5> @@ -2075,13 +2111,13 @@ automatically when :option:`--master-display` or :option:`--max-cll` is specified. Useful when there is a desire to signal 0 values for max-cll and max-fall. Default disabled. - + .. option:: --hdr-opt, --no-hdr-opt Add luma and chroma offsets for HDR/WCG content. Input video should be 10 bit 4:2:0. Applicable for HDR content. It is recommended that AQ-mode be enabled along with this feature. Default disabled. - + .. option:: --dhdr10-info <filename> Inserts tone mapping information as an SEI message. It takes as input, @@ -2107,6 +2143,24 @@ Maximum luma value allowed for input pictures. Any values above max-luma are clipped. No default. +.. option:: --nalu-file <filename> + + Text file containing userSEI in POC order : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload> + Parse the input file specified and inserts SEI messages into the bitstream. + Currently, we support only PREFIX SEI messages. This is an "application-only" feature. + +.. option:: --atc-sei <integer> + + Emit the alternative transfer characteristics SEI message where the integer + is the preferred transfer characteristics. Required for HLG (Hybrid Log Gamma) + signalling. Not signalled by default. + +.. option:: --pic-struct <integer> + + Set the picture structure and emits it in the picture timing SEI message. + Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation. + Required for HLG (Hybrid Log Gamma) signalling. Not signalled by default. + Bitstream options ================= @@ -2173,7 +2227,7 @@ .. option:: --log2-max-poc-lsb <integer> - Maximum of the picture order count. Default 8 + Maximum of the picture order count. Default 8 .. option:: --vui-timing-info, --no-vui-timing-info @@ -2205,21 +2259,28 @@ Only effective at RD levels 5 and 6 +.. option:: --idr-recovery-sei, --no-idr-recovery-sei + Emit RecoveryPoint info as sei in bitstream for each IDR frame. Default disabled. + +.. option:: --single-sei, --no-single-sei + Emit SEI messages in a single NAL unit instead of multiple NALs. Default disabled. + When HRD SEI is enabled the HM decoder will throw a warning. + DCT Approximations ================= .. option:: --lowpass-dct - If enabled, x265 will use low-pass subband dct approximation instead of the - standard dct for 16x16 and 32x32 blocks. This approximation is less computational - intensive but it generates truncated coefficient matrixes for the transformed block. - Empirical analysis shows marginal loss in compression and performance gains up to 10%, - paticularly at moderate bit-rates. + If enabled, x265 will use low-pass subband dct approximation instead of the + standard dct for 16x16 and 32x32 blocks. This approximation is less computational + intensive but it generates truncated coefficient matrixes for the transformed block. + Empirical analysis shows marginal loss in compression and performance gains up to 10%, + paticularly at moderate bit-rates. - This approximation should be considered for platforms with performance and time - constrains. + This approximation should be considered for platforms with performance and time + constrains. - Default disabled. **Experimental feature** + Default disabled. **Experimental feature** Debugging options =================
View file
x265_2.7.tar.gz/doc/reST/presets.rst -> x265_2.9.tar.gz/doc/reST/presets.rst
Changed
@@ -156,7 +156,10 @@ that strictly minimises QP fluctuations across frames, while still allowing the encoder to hit bitrate targets and VBV buffer limits (with a slightly higher margin of error than normal). It is highly recommended that this -algorithm is used only through the :option:`--tune` *grain* feature. +algorithm is used only through the :option:`--tune` *grain* feature. +Overriding the `--tune` *grain* settings might result in grain strobing, especially +when enabling features like :option:`--aq-mode` and :option:`--cutree` that modify +per-block QPs within a given frame. Fast Decode ~~~~~~~~~~~
View file
x265_2.7.tar.gz/doc/reST/releasenotes.rst -> x265_2.9.tar.gz/doc/reST/releasenotes.rst
Changed
@@ -2,6 +2,69 @@ Release Notes ************* +Version 2.9 +=========== + +Release date - 05/10/2018 + +New features +------------- +1. Support for chunked encoding + + :option:`--chunk-start and --chunk-end` + Frames preceding first frame of chunk in display order will be encoded, however, they will be discarded in the bitstream. + Frames following last frame of the chunk in display order will be used in taking lookahead decisions, but, they will not be encoded. + This feature can be enabled only in closed GOP structures. Default disabled. + +2. Support for HDR10+ version 1 SEI messages. + +Encoder enhancements +-------------------- +1. Create API function for allocating and freeing x265_analysis_data. +2. CEA 608/708 support: Read SEI messages from text file and encode it using userSEI message. + +Bug fixes +--------- +1. Disable noise reduction when vbv is enabled. +2. Support minLuma and maxLuma values changed by the commandline. + +Version 2.8 +=========== + +Release date - 21/05/2018 + +New features +------------- +1. :option:`--asm avx512` used to enable AVX-512 in x265. Default disabled. + For 4K main10 high-quality encoding, we are seeing good gains; for other resolutions and presets, we don't recommend using this setting for now. + +2. :option:`--dynamic-refine` dynamically switches between different inter refine levels. Default disabled. + It is recommended to use :option:`--refine-intra 4' with dynamic refinement for a better trade-off between encode efficiency and performance than using static refinement. + +3. :option:`--single-sei` + Encode SEI messages in a single NAL unit instead of multiple NAL units. Default disabled. + +4. :option:`--max-ausize-factor` controls the maximum AU size defined in HEVC specification. + It represents the percentage of maximum AU size used. Default is 1. + +5. VMAF (Video Multi-Method Assessment Fusion) + Added VMAF support for objective quality measurement of a video sequence. + Enable cmake option ENABLE_LIBVMAF to report per frame and aggregate VMAF score. The frame level VMAF score does not include temporal scores. + This is supported only on linux for now. + +Encoder enhancements +-------------------- +1. Introduced refine-intra level 4 to improve quality. +2. Support for HLG-graded content and pic_struct in SEI message. + +Bug Fixes +--------- +1. Fix 32 bit build error (using CMAKE GUI) in Linux. +2. Fix 32 bit build error for asm primitives. +3. Fix build error on mac OS. +4. Fix VBV Lookahead in analysis load to achieve target bitrate. + + Version 2.7 ===========
View file
x265_2.7.tar.gz/source/CMakeLists.txt -> x265_2.9.tar.gz/source/CMakeLists.txt
Changed
@@ -29,7 +29,7 @@ option(STATIC_LINK_CRT "Statically link C runtime for release builds" OFF) mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 151) +set(X265_BUILD 165) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -48,12 +48,12 @@ if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1") set(X86 1) add_definitions(-DX265_ARCH_X86=1) - if("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) + if(CMAKE_CXX_FLAGS STREQUAL "-m32") + message(STATUS "Detected x86 target processor") + elseif("${CMAKE_SIZEOF_VOID_P}" MATCHES 8) set(X64 1) add_definitions(-DX86_64=1) message(STATUS "Detected x86_64 target processor") - else() - message(STATUS "Detected x86 target processor") endif() elseif(POWERMATCH GREATER "-1") message(STATUS "Detected POWER target processor") @@ -109,6 +109,11 @@ if(NO_ATOMICS) add_definitions(-DNO_ATOMICS=1) endif(NO_ATOMICS) + find_library(VMAF vmaf) + option(ENABLE_LIBVMAF "Enable VMAF" OFF) + if(ENABLE_LIBVMAF) + add_definitions(-DENABLE_LIBVMAF) + endif() endif(UNIX) if(X64 AND NOT WIN32) @@ -536,6 +541,9 @@ if(EXTRA_LIB) target_link_libraries(x265-static ${EXTRA_LIB}) endif() +if(ENABLE_LIBVMAF) + target_link_libraries(x265-static ${VMAF}) +endif() install(TARGETS x265-static LIBRARY DESTINATION ${LIB_INSTALL_DIR} ARCHIVE DESTINATION ${LIB_INSTALL_DIR}) @@ -546,7 +554,7 @@ ARCHIVE DESTINATION ${LIB_INSTALL_DIR}) endif() install(FILES x265.h "${PROJECT_BINARY_DIR}/x265_config.h" DESTINATION include) -if(WIN32) +if((WIN32 AND ENABLE_CLI) OR (WIN32 AND ENABLE_SHARED)) if(MSVC_IDE) install(FILES "${PROJECT_BINARY_DIR}/Debug/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS Debug) install(FILES "${PROJECT_BINARY_DIR}/RelWithDebInfo/x265.pdb" DESTINATION ${BIN_INSTALL_DIR} CONFIGURATIONS RelWithDebInfo)
View file
x265_2.7.tar.gz/source/common/common.cpp -> x265_2.9.tar.gz/source/common/common.cpp
Changed
@@ -54,7 +54,7 @@ #endif } -#define X265_ALIGNBYTES 32 +#define X265_ALIGNBYTES 64 #if _WIN32 #if defined(__MINGW32__) && !defined(__MINGW64_VERSION_MAJOR)
View file
x265_2.7.tar.gz/source/common/common.h -> x265_2.9.tar.gz/source/common/common.h
Changed
@@ -75,6 +75,7 @@ #define ALIGN_VAR_8(T, var) T var __attribute__((aligned(8))) #define ALIGN_VAR_16(T, var) T var __attribute__((aligned(16))) #define ALIGN_VAR_32(T, var) T var __attribute__((aligned(32))) +#define ALIGN_VAR_64(T, var) T var __attribute__((aligned(64))) #if defined(__MINGW32__) #define fseeko fseeko64 #define ftello ftello64 @@ -85,6 +86,7 @@ #define ALIGN_VAR_8(T, var) __declspec(align(8)) T var #define ALIGN_VAR_16(T, var) __declspec(align(16)) T var #define ALIGN_VAR_32(T, var) __declspec(align(32)) T var +#define ALIGN_VAR_64(T, var) __declspec(align(64)) T var #define fseeko _fseeki64 #define ftello _ftelli64 #endif // if defined(__GNUC__) @@ -330,6 +332,8 @@ #define START_CODE_OVERHEAD 3 #define FILLER_OVERHEAD (NAL_TYPE_OVERHEAD + START_CODE_OVERHEAD + 1) +#define MAX_NUM_DYN_REFINE (NUM_CU_DEPTH * X265_REFINE_INTER_LEVELS) + namespace X265_NS { enum { SAO_NUM_OFFSET = 4 };
View file
x265_2.7.tar.gz/source/common/cpu.cpp -> x265_2.9.tar.gz/source/common/cpu.cpp
Changed
@@ -58,10 +58,11 @@ #endif // if X265_ARCH_ARM namespace X265_NS { +static bool enable512 = false; const cpu_name_t cpu_names[] = { #if X265_ARCH_X86 -#define MMX2 X265_CPU_MMX | X265_CPU_MMX2 | X265_CPU_CMOV +#define MMX2 X265_CPU_MMX | X265_CPU_MMX2 { "MMX2", MMX2 }, { "MMXEXT", MMX2 }, { "SSE", MMX2 | X265_CPU_SSE }, @@ -84,13 +85,13 @@ { "BMI2", AVX | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 }, #define AVX2 AVX | X265_CPU_FMA3 | X265_CPU_LZCNT | X265_CPU_BMI1 | X265_CPU_BMI2 | X265_CPU_AVX2 { "AVX2", AVX2}, + { "AVX512", AVX2 | X265_CPU_AVX512 }, #undef AVX2 #undef AVX #undef SSE2 #undef MMX2 { "Cache32", X265_CPU_CACHELINE_32 }, { "Cache64", X265_CPU_CACHELINE_64 }, - { "SlowCTZ", X265_CPU_SLOW_CTZ }, { "SlowAtom", X265_CPU_SLOW_ATOM }, { "SlowPshufb", X265_CPU_SLOW_PSHUFB }, { "SlowPalignr", X265_CPU_SLOW_PALIGNR }, @@ -115,28 +116,32 @@ /* cpu-a.asm */ int PFX(cpu_cpuid_test)(void); void PFX(cpu_cpuid)(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx); -void PFX(cpu_xgetbv)(uint32_t op, uint32_t *eax, uint32_t *edx); +uint64_t PFX(cpu_xgetbv)(int xcr); } #if defined(_MSC_VER) #pragma warning(disable: 4309) // truncation of constant value #endif -uint32_t cpu_detect(void) +bool detect512() +{ + return(enable512); +} +uint32_t cpu_detect(bool benableavx512 ) { - uint32_t cpu = 0; + uint32_t cpu = 0; uint32_t eax, ebx, ecx, edx; uint32_t vendor[4] = { 0 }; uint32_t max_extended_cap, max_basic_cap; + uint64_t xcr0 = 0; #if !X86_64 if (!PFX(cpu_cpuid_test)()) return 0; #endif - PFX(cpu_cpuid)(0, &eax, vendor + 0, vendor + 2, vendor + 1); - max_basic_cap = eax; + PFX(cpu_cpuid)(0, &max_basic_cap, vendor + 0, vendor + 2, vendor + 1); if (max_basic_cap == 0) return 0; @@ -147,27 +152,24 @@ return cpu; if (edx & 0x02000000) cpu |= X265_CPU_MMX2 | X265_CPU_SSE; - if (edx & 0x00008000) - cpu |= X265_CPU_CMOV; - else - return cpu; if (edx & 0x04000000) cpu |= X265_CPU_SSE2; if (ecx & 0x00000001) cpu |= X265_CPU_SSE3; if (ecx & 0x00000200) - cpu |= X265_CPU_SSSE3; + cpu |= X265_CPU_SSSE3 | X265_CPU_SSE2_IS_FAST; if (ecx & 0x00080000) cpu |= X265_CPU_SSE4; if (ecx & 0x00100000) cpu |= X265_CPU_SSE42; - /* Check OXSAVE and AVX bits */ - if ((ecx & 0x18000000) == 0x18000000) + + if (ecx & 0x08000000) /* XGETBV supported and XSAVE enabled by OS */ { /* Check for OS support */ - PFX(cpu_xgetbv)(0, &eax, &edx); - if ((eax & 0x6) == 0x6) + xcr0 = PFX(cpu_xgetbv)(0); + if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */ { + if (ecx & 0x10000000) cpu |= X265_CPU_AVX; if (ecx & 0x00001000) cpu |= X265_CPU_FMA3; @@ -178,19 +180,29 @@ { PFX(cpu_cpuid)(7, &eax, &ebx, &ecx, &edx); /* AVX2 requires OS support, but BMI1/2 don't. */ - if ((cpu & X265_CPU_AVX) && (ebx & 0x00000020)) - cpu |= X265_CPU_AVX2; if (ebx & 0x00000008) - { cpu |= X265_CPU_BMI1; - if (ebx & 0x00000100) - cpu |= X265_CPU_BMI2; + if (ebx & 0x00000100) + cpu |= X265_CPU_BMI2; + + if ((xcr0 & 0x6) == 0x6) /* XMM/YMM state */ + { + if (ebx & 0x00000020) + cpu |= X265_CPU_AVX2; + if (benableavx512) + { + if ((xcr0 & 0xE0) == 0xE0) /* OPMASK/ZMM state */ + { + if ((ebx & 0xD0030000) == 0xD0030000) + { + cpu |= X265_CPU_AVX512; + enable512 = true; + } + } + } } } - if (cpu & X265_CPU_SSSE3) - cpu |= X265_CPU_SSE2_IS_FAST; - PFX(cpu_cpuid)(0x80000000, &eax, &ebx, &ecx, &edx); max_extended_cap = eax; @@ -230,8 +242,6 @@ { if (edx & 0x00400000) cpu |= X265_CPU_MMX2; - if (!(cpu & X265_CPU_LZCNT)) - cpu |= X265_CPU_SLOW_CTZ; if ((cpu & X265_CPU_SSE2) && !(cpu & X265_CPU_SSE2_IS_FAST)) cpu |= X265_CPU_SSE2_IS_SLOW; /* AMD CPUs come in two types: terrible at SSE and great at it */ } @@ -244,19 +254,10 @@ int model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0); if (family == 6) { - /* 6/9 (pentium-m "banias"), 6/13 (pentium-m "dothan"), and 6/14 (core1 "yonah") - * theoretically support sse2, but it's significantly slower than mmx for - * almost all of x264's functions, so let's just pretend they don't. */ - if (model == 9 || model == 13 || model == 14) - { - cpu &= ~(X265_CPU_SSE2 | X265_CPU_SSE3); - X265_CHECK(!(cpu & (X265_CPU_SSSE3 | X265_CPU_SSE4)), "unexpected CPU ID %d\n", cpu); - } /* Detect Atom CPU */ - else if (model == 28) + if (model == 28) { cpu |= X265_CPU_SLOW_ATOM; - cpu |= X265_CPU_SLOW_CTZ; cpu |= X265_CPU_SLOW_PSHUFB; } @@ -328,7 +329,7 @@ int PFX(cpu_fast_neon_mrc_test)(void); } -uint32_t cpu_detect(void) +uint32_t cpu_detect(bool benableavx512) { int flags = 0; @@ -371,7 +372,7 @@ #elif X265_ARCH_POWER8 -uint32_t cpu_detect(void) +uint32_t cpu_detect(bool benableavx512) { #if HAVE_ALTIVEC return X265_CPU_ALTIVEC; @@ -382,10 +383,11 @@ #else // if X265_ARCH_POWER8 -uint32_t cpu_detect(void) +uint32_t cpu_detect(bool benableavx512) { return 0; } #endif // if X265_ARCH_X86 } +
View file
x265_2.7.tar.gz/source/common/cpu.h -> x265_2.9.tar.gz/source/common/cpu.h
Changed
@@ -26,7 +26,6 @@ #define X265_CPU_H #include "common.h" - /* All assembly functions are prefixed with X265_NS (macro expanded) */ #define PFX3(prefix, name) prefix ## _ ## name #define PFX2(prefix, name) PFX3(prefix, name) @@ -50,7 +49,8 @@ #endif namespace X265_NS { -uint32_t cpu_detect(void); +uint32_t cpu_detect(bool); +bool detect512(); struct cpu_name_t {
View file
x265_2.7.tar.gz/source/common/cudata.cpp -> x265_2.9.tar.gz/source/common/cudata.cpp
Changed
@@ -1626,11 +1626,6 @@ dir |= (1 << list); candMvField[count][list].mv = colmv; candMvField[count][list].refIdx = refIdx; - if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && m_log2CUSize[0] < 4) - { - MV dist(MAX_MV, MAX_MV); - candMvField[count][list].mv = dist; - } } } @@ -1790,14 +1785,7 @@ int curRefPOC = m_slice->m_refPOCList[picList][refIdx]; int curPOC = m_slice->m_poc; - - if (m_encData->m_param->scaleFactor && m_encData->m_param->analysisSave && (m_log2CUSize[0] < 4)) - { - MV dist(MAX_MV, MAX_MV); - pmv[numMvc++] = amvpCand[num++] = dist; - } - else - pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC); + pmv[numMvc++] = amvpCand[num++] = scaleMvByPOCDist(neighbours[MD_COLLOCATED].mv[picList], curPOC, curRefPOC, colPOC, colRefPOC); } }
View file
x265_2.7.tar.gz/source/common/cudata.h -> x265_2.9.tar.gz/source/common/cudata.h
Changed
@@ -224,6 +224,11 @@ uint64_t m_fAc_den[3]; uint64_t m_fDc_den[3]; + /* Feature values per CTU for dynamic refinement */ + uint64_t* m_collectCURd; + uint32_t* m_collectCUVariance; + uint32_t* m_collectCUCount; + CUData(); void initialize(const CUDataMemPool& dataPool, uint32_t depth, const x265_param& param, int instance); @@ -348,8 +353,12 @@ coeff_t* trCoeffMemBlock; MV* mvMemBlock; sse_t* distortionMemBlock; + uint64_t* dynRefineRdBlock; + uint32_t* dynRefCntBlock; + uint32_t* dynRefVarBlock; - CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; } + CUDataMemPool() { charMemBlock = NULL; trCoeffMemBlock = NULL; mvMemBlock = NULL; distortionMemBlock = NULL; + dynRefineRdBlock = NULL; dynRefCntBlock = NULL; dynRefVarBlock = NULL;} bool create(uint32_t depth, uint32_t csp, uint32_t numInstances, const x265_param& param) {
View file
x265_2.7.tar.gz/source/common/dct.cpp -> x265_2.9.tar.gz/source/common/dct.cpp
Changed
@@ -980,19 +980,110 @@ sum += sbacGetEntropyBits(mstate, firstC2Flag); } } - return (sum & 0x00FFFFFF) + (c1 << 26) + (firstC2Idx << 28); } +template<int log2TrSize> +static void nonPsyRdoQuant_c(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ + const int scaleBits = SCALE_BITS - 2 * transformShift; + const uint32_t trSize = 1 << log2TrSize; + + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int64_t signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits)); + *totalUncodedCost += costUncoded[blkPos + x]; + *totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } +} +template<int log2TrSize> +static void psyRdoQuant_c(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ + const int scaleBits = SCALE_BITS - 2 * transformShift; + const uint32_t trSize = 1 << log2TrSize; + int max = X265_MAX(0, (2 * transformShift + 1)); + + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int64_t signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ + + costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits)); + + /* when no residual coefficient is coded, predicted coef == recon coef */ + costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max)); + + *totalUncodedCost += costUncoded[blkPos + x]; + *totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } +} +template<int log2TrSize> +static void psyRdoQuant_c_1(int16_t *m_resiDctCoeff, /*int16_t *m_fencDctCoeff, */ int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, /* int64_t *psyScale,*/ uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ + const int scaleBits = SCALE_BITS - 2 * transformShift; + const uint32_t trSize = 1 << log2TrSize; + + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int64_t signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits)); + *totalUncodedCost += costUncoded[blkPos + x]; + *totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } +} +template<int log2TrSize> +static void psyRdoQuant_c_2(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +{ + const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ + + const uint32_t trSize = 1 << log2TrSize; + int max = X265_MAX(0, (2 * transformShift + 1)); + + for (int y = 0; y < MLS_CG_SIZE; y++) + { + for (int x = 0; x < MLS_CG_SIZE; x++) + { + int64_t signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ + int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ + costUncoded[blkPos + x] -= static_cast<int64_t>((double)(((*psyScale) * predictedCoef) >> max)); + *totalUncodedCost += costUncoded[blkPos + x]; + *totalRdCost += costUncoded[blkPos + x]; + } + blkPos += trSize; + } +} namespace X265_NS { // x265 private namespace - void setupDCTPrimitives_c(EncoderPrimitives& p) { p.dequant_scaling = dequant_scaling_c; p.dequant_normal = dequant_normal_c; p.quant = quant_c; p.nquant = nquant_c; + p.cu[BLOCK_4x4].nonPsyRdoQuant = nonPsyRdoQuant_c<2>; + p.cu[BLOCK_8x8].nonPsyRdoQuant = nonPsyRdoQuant_c<3>; + p.cu[BLOCK_16x16].nonPsyRdoQuant = nonPsyRdoQuant_c<4>; + p.cu[BLOCK_32x32].nonPsyRdoQuant = nonPsyRdoQuant_c<5>; + p.cu[BLOCK_4x4].psyRdoQuant = psyRdoQuant_c<2>; + p.cu[BLOCK_8x8].psyRdoQuant = psyRdoQuant_c<3>; + p.cu[BLOCK_16x16].psyRdoQuant = psyRdoQuant_c<4>; + p.cu[BLOCK_32x32].psyRdoQuant = psyRdoQuant_c<5>; p.dst4x4 = dst4_c; p.cu[BLOCK_4x4].dct = dct4_c; p.cu[BLOCK_8x8].dct = dct8_c; @@ -1013,7 +1104,14 @@ p.cu[BLOCK_8x8].copy_cnt = copy_count<8>; p.cu[BLOCK_16x16].copy_cnt = copy_count<16>; p.cu[BLOCK_32x32].copy_cnt = copy_count<32>; - + p.cu[BLOCK_4x4].psyRdoQuant_1p = psyRdoQuant_c_1<2>; + p.cu[BLOCK_4x4].psyRdoQuant_2p = psyRdoQuant_c_2<2>; + p.cu[BLOCK_8x8].psyRdoQuant_1p = psyRdoQuant_c_1<3>; + p.cu[BLOCK_8x8].psyRdoQuant_2p = psyRdoQuant_c_2<3>; + p.cu[BLOCK_16x16].psyRdoQuant_1p = psyRdoQuant_c_1<4>; + p.cu[BLOCK_16x16].psyRdoQuant_2p = psyRdoQuant_c_2<4>; + p.cu[BLOCK_32x32].psyRdoQuant_1p = psyRdoQuant_c_1<5>; + p.cu[BLOCK_32x32].psyRdoQuant_2p = psyRdoQuant_c_2<5>; p.scanPosLast = scanPosLast_c; p.findPosFirstLast = findPosFirstLast_c; p.costCoeffNxN = costCoeffNxN_c;
View file
x265_2.7.tar.gz/source/common/frame.cpp -> x265_2.9.tar.gz/source/common/frame.cpp
Changed
@@ -53,6 +53,7 @@ m_addOnDepth = NULL; m_addOnCtuInfo = NULL; m_addOnPrevChange = NULL; + m_classifyFrame = false; } bool Frame::create(x265_param *param, float* quantOffsets) @@ -82,10 +83,18 @@ m_analysisData.wt = NULL; m_analysisData.intraData = NULL; m_analysisData.interData = NULL; - m_analysis2Pass.analysisFramedata = NULL; + m_analysisData.distortionData = NULL; } - if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(m_fencPic, param->bframes, !!param->rc.aqMode || !!param->bAQMotion, param->rc.qgSize)) + if (param->bDynamicRefine) + { + int size = m_param->maxCUDepth * X265_REFINE_INTER_LEVELS; + CHECKED_MALLOC_ZERO(m_classifyRd, uint64_t, size); + CHECKED_MALLOC_ZERO(m_classifyVariance, uint64_t, size); + CHECKED_MALLOC_ZERO(m_classifyCount, uint32_t, size); + } + + if (m_fencPic->create(param, !!m_param->bCopyPicToFrame) && m_lowres.create(param, m_fencPic, param->rc.qgSize)) { X265_CHECK((m_reconColCount == NULL), "m_reconColCount was initialized"); m_numRows = (m_fencPic->m_picHeight + param->maxCUSize - 1) / param->maxCUSize; @@ -94,11 +103,8 @@ if (quantOffsets) { - int32_t cuCount; - if (param->rc.qgSize == 8) - cuCount = m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes; - else - cuCount = m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol; + int32_t cuCount = (param->rc.qgSize == 8) ? m_lowres.maxBlocksInRowFullRes * m_lowres.maxBlocksInColFullRes : + m_lowres.maxBlocksInRow * m_lowres.maxBlocksInCol; m_quantOffsets = new float[cuCount]; } return true; @@ -226,4 +232,11 @@ } m_lowres.destroy(); X265_FREE(m_rcData); + + if (m_param->bDynamicRefine) + { + X265_FREE_ZERO(m_classifyRd); + X265_FREE_ZERO(m_classifyVariance); + X265_FREE_ZERO(m_classifyCount); + } }
View file
x265_2.7.tar.gz/source/common/frame.h -> x265_2.9.tar.gz/source/common/frame.h
Changed
@@ -109,7 +109,6 @@ Frame* m_prev; x265_param* m_param; // Points to the latest param set for the frame. x265_analysis_data m_analysisData; - x265_analysis_2Pass m_analysis2Pass; RcStats* m_rcData; Event m_copyMVType; @@ -122,6 +121,14 @@ uint8_t** m_addOnDepth; uint8_t** m_addOnCtuInfo; int** m_addOnPrevChange; + + /* Average feature values of frames being considered for classification */ + uint64_t* m_classifyRd; + uint64_t* m_classifyVariance; + uint32_t* m_classifyCount; + + bool m_classifyFrame; + Frame(); bool create(x265_param *param, float* quantOffsets);
View file
x265_2.7.tar.gz/source/common/framedata.cpp -> x265_2.9.tar.gz/source/common/framedata.cpp
Changed
@@ -41,9 +41,25 @@ if (param.rc.bStatWrite) m_spsrps = const_cast<RPS*>(sps.spsrps); bool isallocated = m_cuMemPool.create(0, param.internalCsp, sps.numCUsInFrame, param); + if (m_param->bDynamicRefine) + { + CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefineRdBlock, uint64_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame); + CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefCntBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame); + CHECKED_MALLOC_ZERO(m_cuMemPool.dynRefVarBlock, uint32_t, MAX_NUM_DYN_REFINE * sps.numCUsInFrame); + } if (isallocated) + { for (uint32_t ctuAddr = 0; ctuAddr < sps.numCUsInFrame; ctuAddr++) + { + if (m_param->bDynamicRefine) + { + m_picCTU[ctuAddr].m_collectCURd = m_cuMemPool.dynRefineRdBlock + (ctuAddr * MAX_NUM_DYN_REFINE); + m_picCTU[ctuAddr].m_collectCUVariance = m_cuMemPool.dynRefVarBlock + (ctuAddr * MAX_NUM_DYN_REFINE); + m_picCTU[ctuAddr].m_collectCUCount = m_cuMemPool.dynRefCntBlock + (ctuAddr * MAX_NUM_DYN_REFINE); + } m_picCTU[ctuAddr].initialize(m_cuMemPool, 0, param, ctuAddr); + } + } else return false; CHECKED_MALLOC_ZERO(m_cuStat, RCStatCU, sps.numCUsInFrame); @@ -65,6 +81,12 @@ { memset(m_cuStat, 0, sps.numCUsInFrame * sizeof(*m_cuStat)); memset(m_rowStat, 0, sps.numCuInHeight * sizeof(*m_rowStat)); + if (m_param->bDynamicRefine) + { + memset(m_picCTU->m_collectCURd, 0, MAX_NUM_DYN_REFINE * sizeof(uint64_t)); + memset(m_picCTU->m_collectCUVariance, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t)); + memset(m_picCTU->m_collectCUCount, 0, MAX_NUM_DYN_REFINE * sizeof(uint32_t)); + } } void FrameData::destroy() @@ -75,6 +97,12 @@ m_cuMemPool.destroy(); + if (m_param->bDynamicRefine) + { + X265_FREE(m_cuMemPool.dynRefineRdBlock); + X265_FREE(m_cuMemPool.dynRefCntBlock); + X265_FREE(m_cuMemPool.dynRefVarBlock); + } X265_FREE(m_cuStat); X265_FREE(m_rowStat); for (int i = 0; i < INTEGRAL_PLANE_NUM; i++)
View file
x265_2.7.tar.gz/source/common/framedata.h -> x265_2.9.tar.gz/source/common/framedata.h
Changed
@@ -88,6 +88,11 @@ uint64_t cntInterPu[NUM_CU_DEPTH][INTER_MODES - 1]; uint64_t cntMergePu[NUM_CU_DEPTH][INTER_MODES - 1]; + /* Feature values per row for dynamic refinement */ + uint64_t rowRdDyn[MAX_NUM_DYN_REFINE]; + uint32_t rowVarDyn[MAX_NUM_DYN_REFINE]; + uint32_t rowCntDyn[MAX_NUM_DYN_REFINE]; + FrameStats() { memset(this, 0, sizeof(FrameStats)); @@ -174,47 +179,5 @@ inline CUData* getPicCTU(uint32_t ctuAddr) { return &m_picCTU[ctuAddr]; } }; -/* Stores intra analysis data for a single frame. This struct needs better packing */ -struct analysis_intra_data -{ - uint8_t* depth; - uint8_t* modes; - char* partSizes; - uint8_t* chromaModes; -}; - -/* Stores inter analysis data for a single frame */ -struct analysis_inter_data -{ - int32_t* ref; - uint8_t* depth; - uint8_t* modes; - uint8_t* partSize; - uint8_t* mergeFlag; - uint8_t* interDir; - uint8_t* mvpIdx[2]; - int8_t* refIdx[2]; - MV* mv[2]; - int64_t* sadCost; -}; - -struct analysis2PassFrameData -{ - uint8_t* depth; - MV* m_mv[2]; - int* mvpIdx[2]; - int32_t* ref[2]; - uint8_t* modes; - sse_t* distortion; - sse_t* ctuDistortion; - double* scaledDistortion; - double averageDistortion; - double sdDistortion; - uint32_t highDistortionCtuCount; - uint32_t lowDistortionCtuCount; - double* offset; - double* threshold; -}; - } #endif // ifndef X265_FRAMEDATA_H
View file
x265_2.7.tar.gz/source/common/ipfilter.cpp -> x265_2.9.tar.gz/source/common/ipfilter.cpp
Changed
@@ -379,7 +379,8 @@ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>; #define CHROMA_422(W, H) \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -388,7 +389,8 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>; #define CHROMA_444(W, H) \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_hpp = interp_horiz_pp_c<4, W, H>; \ @@ -397,7 +399,8 @@ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vps = interp_vert_ps_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = interp_vert_sp_c<4, W, H>; \ p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vss = interp_vert_ss_c<4, W, H>; \ - p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s = filterPixelToShort_c<W, H>; + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\ + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].p2s[ALIGNED] = filterPixelToShort_c<W, H>; #define LUMA(W, H) \ p.pu[LUMA_ ## W ## x ## H].luma_hpp = interp_horiz_pp_c<8, W, H>; \ @@ -407,7 +410,8 @@ p.pu[LUMA_ ## W ## x ## H].luma_vsp = interp_vert_sp_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_vss = interp_vert_ss_c<8, W, H>; \ p.pu[LUMA_ ## W ## x ## H].luma_hvpp = interp_hv_pp_c<8, W, H>; \ - p.pu[LUMA_ ## W ## x ## H].convert_p2s = filterPixelToShort_c<W, H>; + p.pu[LUMA_ ## W ## x ## H].convert_p2s[NONALIGNED] = filterPixelToShort_c<W, H>;\ + p.pu[LUMA_ ## W ## x ## H].convert_p2s[ALIGNED] = filterPixelToShort_c<W, H>; void setupFilterPrimitives_c(EncoderPrimitives& p) {
View file
x265_2.7.tar.gz/source/common/lowres.cpp -> x265_2.9.tar.gz/source/common/lowres.cpp
Changed
@@ -27,10 +27,10 @@ using namespace X265_NS; -bool Lowres::create(PicYuv *origPic, int _bframes, bool bAQEnabled, uint32_t qgSize) +bool Lowres::create(x265_param* param, PicYuv *origPic, uint32_t qgSize) { isLowres = true; - bframes = _bframes; + bframes = param->bframes; width = origPic->m_picWidth / 2; lines = origPic->m_picHeight / 2; lumaStride = width + 2 * origPic->m_lumaMarginX; @@ -41,11 +41,7 @@ maxBlocksInRowFullRes = maxBlocksInRow * 2; maxBlocksInColFullRes = maxBlocksInCol * 2; int cuCount = maxBlocksInRow * maxBlocksInCol; - int cuCountFullRes; - if (qgSize == 8) - cuCountFullRes = maxBlocksInRowFullRes * maxBlocksInColFullRes; - else - cuCountFullRes = cuCount; + int cuCountFullRes = (qgSize > 8) ? cuCount : cuCount << 2; /* rounding the width to multiple of lowres CU size */ width = maxBlocksInRow * X265_LOWRES_CU_SIZE; @@ -53,16 +49,18 @@ size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY); size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX; - if (bAQEnabled) + if (!!param->rc.aqMode) { CHECKED_MALLOC_ZERO(qpAqOffset, double, cuCountFullRes); - CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes); CHECKED_MALLOC_ZERO(invQscaleFactor, int, cuCountFullRes); CHECKED_MALLOC_ZERO(qpCuTreeOffset, double, cuCountFullRes); - CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes); if (qgSize == 8) CHECKED_MALLOC_ZERO(invQscaleFactor8x8, int, cuCount); } + if (origPic->m_param->bAQMotion) + CHECKED_MALLOC_ZERO(qpAqMotionOffset, double, cuCountFullRes); + if (origPic->m_param->bDynamicRefine) + CHECKED_MALLOC_ZERO(blockVariance, uint32_t, cuCountFullRes); CHECKED_MALLOC(propagateCost, uint16_t, cuCount); /* allocate lowres buffers */ @@ -126,14 +124,13 @@ X265_FREE(lowresMvCosts[1][i]); } X265_FREE(qpAqOffset); - X265_FREE(qpAqMotionOffset); X265_FREE(invQscaleFactor); X265_FREE(qpCuTreeOffset); X265_FREE(propagateCost); - X265_FREE(blockVariance); X265_FREE(invQscaleFactor8x8); + X265_FREE(qpAqMotionOffset); + X265_FREE(blockVariance); } - // (re) initialize lowres state void Lowres::init(PicYuv *origPic, int poc) {
View file
x265_2.7.tar.gz/source/common/lowres.h -> x265_2.9.tar.gz/source/common/lowres.h
Changed
@@ -69,7 +69,7 @@ int qmvy = qmv.y + (qmv.y & 1); int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1); pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride; - primitives.pu[LUMA_8x8].pixelavg_pp(buf, outstride, frefA, lumaStride, frefB, lumaStride, 32); + primitives.pu[LUMA_8x8].pixelavg_pp[(outstride % 64 == 0) && (lumaStride % 64 == 0)](buf, outstride, frefA, lumaStride, frefB, lumaStride, 32); return buf; } else @@ -91,7 +91,7 @@ int qmvy = qmv.y + (qmv.y & 1); int hpelB = (qmvy & 2) | ((qmvx & 2) >> 1); pixel *frefB = lowresPlane[hpelB] + blockOffset + (qmvx >> 2) + (qmvy >> 2) * lumaStride; - primitives.pu[LUMA_8x8].pixelavg_pp(subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32); + primitives.pu[LUMA_8x8].pixelavg_pp[NONALIGNED](subpelbuf, 8, frefA, lumaStride, frefB, lumaStride, 32); return comp(fenc, FENC_STRIDE, subpelbuf, 8); } else @@ -152,14 +152,12 @@ uint32_t* blockVariance; uint64_t wp_ssd[3]; // This is different than SSDY, this is sum(pixel^2) - sum(pixel)^2 for entire frame uint64_t wp_sum[3]; - uint64_t frameVariance; /* cutree intermediate data */ uint16_t* propagateCost; double weightedCostDelta[X265_BFRAME_MAX + 2]; ReferencePlanes weightedRef[X265_BFRAME_MAX + 2]; - - bool create(PicYuv *origPic, int _bframes, bool bAqEnabled, uint32_t qgSize); + bool create(x265_param* param, PicYuv *origPic, uint32_t qgSize); void destroy(); void init(PicYuv *origPic, int poc); };
View file
x265_2.7.tar.gz/source/common/param.cpp -> x265_2.9.tar.gz/source/common/param.cpp
Changed
@@ -105,7 +105,7 @@ memset(param, 0, sizeof(x265_param)); /* Applying default values to all elements in the param structure */ - param->cpuid = X265_NS::cpu_detect(); + param->cpuid = X265_NS::cpu_detect(false); param->bEnableWavefront = 1; param->frameNumThreads = 0; @@ -133,6 +133,7 @@ param->bEmitHRDSEI = 0; param->bEmitInfoSEI = 1; param->bEmitHDRSEI = 0; + param->bEmitIDRRecoverySEI = 0; /* CU definitions */ param->maxCUSize = 64; @@ -155,6 +156,9 @@ param->lookaheadThreads = 0; param->scenecutBias = 5.0; param->radl = 0; + param->chunkStart = 0; + param->chunkEnd = 0; + /* Intra Coding Tools */ param->bEnableConstrainedIntra = 0; param->bEnableStrongIntraSmoothing = 1; @@ -192,6 +196,7 @@ param->bEnableSAO = 1; param->bSaoNonDeblocked = 0; param->bLimitSAO = 0; + /* Coding Quality */ param->cbQpOffset = 0; param->crQpOffset = 0; @@ -289,16 +294,24 @@ param->scaleFactor = 0; param->intraRefine = 0; param->interRefine = 0; + param->bDynamicRefine = 0; param->mvRefine = 0; param->bUseAnalysisFile = 1; param->csvfpt = NULL; param->forceFlush = 0; param->bDisableLookahead = 0; param->bCopyPicToFrame = 1; + param->maxAUSizeFactor = 1; + param->naluFile = NULL; /* DCT Approximations */ param->bLowPassDct = 0; param->bMVType = 0; + param->bSingleSeiNal = 0; + + /* SEI messages */ + param->preferredTransferCharacteristics = -1; + param->pictureStructure = -1; } int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) @@ -606,10 +619,26 @@ if (0) ; OPT("asm") { +#if X265_ARCH_X86 + if (!strcasecmp(value, "avx512")) + { + p->cpuid = X265_NS::cpu_detect(true); + if (!(p->cpuid & X265_CPU_AVX512)) + x265_log(p, X265_LOG_WARNING, "AVX512 is not supported\n"); + } + else + { + if (bValueWasNull) + p->cpuid = atobool(value); + else + p->cpuid = parseCpuName(value, bError, false); + } +#else if (bValueWasNull) p->cpuid = atobool(value); else - p->cpuid = parseCpuName(value, bError); + p->cpuid = parseCpuName(value, bError, false); +#endif } OPT("fps") { @@ -981,6 +1010,7 @@ OPT("limit-sao") p->bLimitSAO = atobool(value); OPT("dhdr10-info") p->toneMapFile = strdup(value); OPT("dhdr10-opt") p->bDhdr10opt = atobool(value); + OPT("idr-recovery-sei") p->bEmitIDRRecoverySEI = atobool(value); OPT("const-vbv") p->rc.bEnableConstVbv = atobool(value); OPT("ctu-info") p->bCTUInfo = atoi(value); OPT("scale-factor") p->scaleFactor = atoi(value); @@ -989,7 +1019,7 @@ OPT("refine-mv")p->mvRefine = atobool(value); OPT("force-flush")p->forceFlush = atoi(value); OPT("splitrd-skip") p->bEnableSplitRdSkip = atobool(value); - OPT("lowpass-dct") p->bLowPassDct = atobool(value); + OPT("lowpass-dct") p->bLowPassDct = atobool(value); OPT("vbv-end") p->vbvBufferEnd = atof(value); OPT("vbv-end-fr-adj") p->vbvEndFrameAdjust = atof(value); OPT("copy-pic") p->bCopyPicToFrame = atobool(value); @@ -1007,11 +1037,19 @@ { bError = true; } - } + } OPT("gop-lookahead") p->gopLookahead = atoi(value); OPT("analysis-save") p->analysisSave = strdup(value); OPT("analysis-load") p->analysisLoad = strdup(value); OPT("radl") p->radl = atoi(value); + OPT("max-ausize-factor") p->maxAUSizeFactor = atof(value); + OPT("dynamic-refine") p->bDynamicRefine = atobool(value); + OPT("single-sei") p->bSingleSeiNal = atobool(value); + OPT("atc-sei") p->preferredTransferCharacteristics = atoi(value); + OPT("pic-struct") p->pictureStructure = atoi(value); + OPT("chunk-start") p->chunkStart = atoi(value); + OPT("chunk-end") p->chunkEnd = atoi(value); + OPT("nalu-file") p->naluFile = strdup(value); else return X265_PARAM_BAD_NAME; } @@ -1054,7 +1092,7 @@ * false || no - disabled * integer bitmap value * comma separated list of SIMD names, eg: SSE4.1,XOP */ -int parseCpuName(const char* value, bool& bError) +int parseCpuName(const char* value, bool& bError, bool bEnableavx512) { if (!value) { @@ -1065,7 +1103,7 @@ if (isdigit(value[0])) cpu = x265_atoi(value, bError); else - cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect() : 0; + cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect(bEnableavx512) : 0; if (bError) { @@ -1365,8 +1403,10 @@ "Supported values for bCTUInfo are 0, 1, 2, 4, 6"); CHECK(param->interRefine > 3 || param->interRefine < 0, "Invalid refine-inter value, refine-inter levels 0 to 3 supported"); - CHECK(param->intraRefine > 3 || param->intraRefine < 0, + CHECK(param->intraRefine > 4 || param->intraRefine < 0, "Invalid refine-intra value, refine-intra levels 0 to 3 supported"); + CHECK(param->maxAUSizeFactor < 0.5 || param->maxAUSizeFactor > 1.0, + "Supported factor for controlling max AU size is from 0.5 to 1"); #if !X86_64 CHECK(param->searchMethod == X265_SEA && (param->sourceWidth > 840 || param->sourceHeight > 480), "SEA motion search does not support resolutions greater than 480p in 32 bit build"); @@ -1375,6 +1415,21 @@ if (param->masteringDisplayColorVolume || param->maxFALL || param->maxCLL) param->bEmitHDRSEI = 1; + bool isSingleSEI = (param->bRepeatHeaders + || param->bEmitHRDSEI + || param->bEmitInfoSEI + || param->bEmitHDRSEI + || param->bEmitIDRRecoverySEI + || !!param->interlaceMode + || param->preferredTransferCharacteristics > 1 + || param->toneMapFile + || param->naluFile); + + if (!isSingleSEI && param->bSingleSeiNal) + { + param->bSingleSeiNal = 0; + x265_log(param, X265_LOG_WARNING, "None of the SEI messages are enabled. Disabling Single SEI NAL\n"); + } return check_failed; } @@ -1504,6 +1559,7 @@ TOOLVAL(param->bCTUInfo, "ctu-info=%d"); if (param->bMVType == AVC_INFO) TOOLOPT(param->bMVType, "refine-mv-type=avc"); + TOOLOPT(param->bDynamicRefine, "dynamic-refine"); if (param->maxSlices > 1) TOOLVAL(param->maxSlices, "slices=%d"); if (param->bEnableLoopFilter) @@ -1520,6 +1576,7 @@ TOOLOPT(!param->bSaoNonDeblocked && param->bEnableSAO, "sao"); TOOLOPT(param->rc.bStatWrite, "stats-write"); TOOLOPT(param->rc.bStatRead, "stats-read"); + TOOLOPT(param->bSingleSeiNal, "single-sei"); #if ENABLE_HDR10_PLUS TOOLOPT(param->toneMapFile != NULL, "dhdr10-info"); #endif @@ -1560,6 +1617,10 @@ s += sprintf(s, " input-res=%dx%d", p->sourceWidth - padx, p->sourceHeight - pady); s += sprintf(s, " interlace=%d", p->interlaceMode); s += sprintf(s, " total-frames=%d", p->totalFrames); + if (p->chunkStart) + s += sprintf(s, " chunk-start=%d", p->chunkStart); + if (p->chunkEnd) + s += sprintf(s, " chunk-end=%d", p->chunkEnd); s += sprintf(s, " level-idc=%d", p->levelIdc); s += sprintf(s, " high-tier=%d", p->bHighTier); s += sprintf(s, " uhd-bd=%d", p->uhdBluray); @@ -1726,6 +1787,7 @@ BOOL(p->bEmitHDRSEI, "hdr"); BOOL(p->bHDROpt, "hdr-opt"); BOOL(p->bDhdr10opt, "dhdr10-opt"); + BOOL(p->bEmitIDRRecoverySEI, "idr-recovery-sei"); if (p->analysisSave) s += sprintf(s, " analysis-save"); if (p->analysisLoad) @@ -1740,6 +1802,9 @@ BOOL(p->bLowPassDct, "lowpass-dct"); s += sprintf(s, " refine-mv-type=%d", p->bMVType); s += sprintf(s, " copy-pic=%d", p->bCopyPicToFrame); + s += sprintf(s, " max-ausize-factor=%.1f", p->maxAUSizeFactor); + BOOL(p->bDynamicRefine, "dynamic-refine"); + BOOL(p->bSingleSeiNal, "single-sei"); #undef BOOL return buf; }
View file
x265_2.7.tar.gz/source/common/param.h -> x265_2.9.tar.gz/source/common/param.h
Changed
@@ -33,7 +33,7 @@ char* x265_param2string(x265_param *param, int padx, int pady); int x265_atoi(const char *str, bool& bError); double x265_atof(const char *str, bool& bError); -int parseCpuName(const char *value, bool& bError); +int parseCpuName(const char *value, bool& bError, bool bEnableavx512); void setParamAspectRatio(x265_param *p, int width, int height); void getParamAspectRatio(x265_param *p, int& width, int& height); bool parseLambdaFile(x265_param *param);
View file
x265_2.7.tar.gz/source/common/picyuv.cpp -> x265_2.9.tar.gz/source/common/picyuv.cpp
Changed
@@ -358,6 +358,19 @@ pixel *uPic = m_picOrg[1]; pixel *vPic = m_picOrg[2]; + if(param.minLuma != 0 || param.maxLuma != PIXEL_MAX) + { + for (int r = 0; r < height; r++) + { + for (int c = 0; c < width; c++) + { + yPic[c] = X265_MIN(yPic[c], (pixel)param.maxLuma); + yPic[c] = X265_MAX(yPic[c], (pixel)param.minLuma); + } + yPic += m_stride; + } + } + yPic = m_picOrg[0]; if (param.csvLogLevel >= 2 || param.maxCLL || param.maxFALL) { for (int r = 0; r < height; r++)
View file
x265_2.7.tar.gz/source/common/picyuv.h -> x265_2.9.tar.gz/source/common/picyuv.h
Changed
@@ -72,6 +72,7 @@ pixel m_maxChromaVLevel; pixel m_minChromaVLevel; double m_avgChromaVLevel; + double m_vmafScore; x265_param *m_param; PicYuv();
View file
x265_2.7.tar.gz/source/common/pixel.cpp -> x265_2.9.tar.gz/source/common/pixel.cpp
Changed
@@ -922,7 +922,7 @@ static void cuTreeFix8Pack(uint16_t *dst, double *src, int count) { for (int i = 0; i < count; i++) - dst[i] = (uint16_t)(src[i] * 256.0); + dst[i] = (uint16_t)(int16_t)(src[i] * 256.0); } static void cuTreeFix8Unpack(double *dst, uint16_t *src, int count) @@ -986,28 +986,34 @@ { #define LUMA_PU(W, H) \ p.pu[LUMA_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \ - p.pu[LUMA_ ## W ## x ## H].addAvg = addAvg<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].addAvg[NONALIGNED] = addAvg<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].addAvg[ALIGNED] = addAvg<W, H>; \ p.pu[LUMA_ ## W ## x ## H].sad = sad<W, H>; \ p.pu[LUMA_ ## W ## x ## H].sad_x3 = sad_x3<W, H>; \ p.pu[LUMA_ ## W ## x ## H].sad_x4 = sad_x4<W, H>; \ - p.pu[LUMA_ ## W ## x ## H].pixelavg_pp = pixelavg_pp<W, H>; - + p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[NONALIGNED] = pixelavg_pp<W, H>; \ + p.pu[LUMA_ ## W ## x ## H].pixelavg_pp[ALIGNED] = pixelavg_pp<W, H>; #define LUMA_CU(W, H) \ p.cu[BLOCK_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>; \ - p.cu[BLOCK_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>; \ + p.cu[BLOCK_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \ + p.cu[BLOCK_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>; \ p.cu[BLOCK_ ## W ## x ## H].copy_sp = blockcopy_sp_c<W, H>; \ p.cu[BLOCK_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \ p.cu[BLOCK_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \ - p.cu[BLOCK_ ## W ## x ## H].blockfill_s = blockfill_s_c<W>; \ + p.cu[BLOCK_ ## W ## x ## H].blockfill_s[NONALIGNED] = blockfill_s_c<W>; \ + p.cu[BLOCK_ ## W ## x ## H].blockfill_s[ALIGNED] = blockfill_s_c<W>; \ p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shl = cpy2Dto1D_shl<W>; \ p.cu[BLOCK_ ## W ## x ## H].cpy2Dto1D_shr = cpy2Dto1D_shr<W>; \ - p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl = cpy1Dto2D_shl<W>; \ + p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[NONALIGNED] = cpy1Dto2D_shl<W>; \ + p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shl[ALIGNED] = cpy1Dto2D_shl<W>; \ p.cu[BLOCK_ ## W ## x ## H].cpy1Dto2D_shr = cpy1Dto2D_shr<W>; \ p.cu[BLOCK_ ## W ## x ## H].psy_cost_pp = psyCost_pp<BLOCK_ ## W ## x ## H>; \ p.cu[BLOCK_ ## W ## x ## H].transpose = transpose<W>; \ - p.cu[BLOCK_ ## W ## x ## H].ssd_s = pixel_ssd_s_c<W>; \ + p.cu[BLOCK_ ## W ## x ## H].ssd_s[NONALIGNED] = pixel_ssd_s_c<W>; \ + p.cu[BLOCK_ ## W ## x ## H].ssd_s[ALIGNED] = pixel_ssd_s_c<W>; \ p.cu[BLOCK_ ## W ## x ## H].var = pixel_var<W>; \ - p.cu[BLOCK_ ## W ## x ## H].calcresidual = getResidual<W>; \ + p.cu[BLOCK_ ## W ## x ## H].calcresidual[NONALIGNED] = getResidual<W>; \ + p.cu[BLOCK_ ## W ## x ## H].calcresidual[ALIGNED] = getResidual<W>; \ p.cu[BLOCK_ ## W ## x ## H].sse_pp = sse<W, H, pixel, pixel>; \ p.cu[BLOCK_ ## W ## x ## H].sse_ss = sse<W, H, int16_t, int16_t>; @@ -1102,7 +1108,8 @@ p.cu[BLOCK_64x64].sa8d = sa8d16<64, 64>; #define CHROMA_PU_420(W, H) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg = addAvg<W, H>; \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[NONALIGNED] = addAvg<W, H>; \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].addAvg[ALIGNED] = addAvg<W, H>; \ p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \ CHROMA_PU_420(2, 2); @@ -1165,7 +1172,8 @@ p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \ p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \ p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>; + p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>; CHROMA_CU_420(2, 2) CHROMA_CU_420(4, 4) @@ -1179,7 +1187,8 @@ p.chroma[X265_CSP_I420].cu[BLOCK_64x64].sa8d = sa8d16<32, 32>; #define CHROMA_PU_422(W, H) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg = addAvg<W, H>; \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[NONALIGNED] = addAvg<W, H>; \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].addAvg[ALIGNED] = addAvg<W, H>; \ p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].copy_pp = blockcopy_pp_c<W, H>; \ CHROMA_PU_422(2, 4); @@ -1242,7 +1251,8 @@ p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ps = blockcopy_ps_c<W, H>; \ p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].copy_ss = blockcopy_ss_c<W, H>; \ p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].sub_ps = pixel_sub_ps_c<W, H>; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps = pixel_add_ps_c<W, H>; + p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[NONALIGNED] = pixel_add_ps_c<W, H>; \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_ ## W ## x ## H].add_ps[ALIGNED] = pixel_add_ps_c<W, H>; CHROMA_CU_422(2, 4) CHROMA_CU_422(4, 8) @@ -1258,7 +1268,7 @@ p.weight_pp = weight_pp_c; p.weight_sp = weight_sp_c; - p.scale1D_128to64 = scale1D_128to64; + p.scale1D_128to64[NONALIGNED] = p.scale1D_128to64[ALIGNED] = scale1D_128to64; p.scale2D_64to32 = scale2D_64to32; p.frameInitLowres = frame_init_lowres_core; p.ssim_4x4x2_core = ssim_4x4x2_core;
View file
x265_2.7.tar.gz/source/common/predict.cpp -> x265_2.9.tar.gz/source/common/predict.cpp
Changed
@@ -91,7 +91,7 @@ MV mv0 = cu.m_mv[0][pu.puAbsPartIdx]; cu.clipMv(mv0); - if (cu.m_slice->m_pps->bUseWeightPred && wp0->bPresentFlag) + if (cu.m_slice->m_pps->bUseWeightPred && wp0->wtPresent) { for (int plane = 0; plane < (bChroma ? 3 : 1); plane++) { @@ -133,7 +133,7 @@ pwp0 = refIdx0 >= 0 ? cu.m_slice->m_weightPredTable[0][refIdx0] : NULL; pwp1 = refIdx1 >= 0 ? cu.m_slice->m_weightPredTable[1][refIdx1] : NULL; - if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag)) + if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent)) { /* biprediction weighting */ for (int plane = 0; plane < (bChroma ? 3 : 1); plane++) @@ -183,7 +183,7 @@ predInterChromaShort(pu, m_predShortYuv[1], *cu.m_slice->m_refReconPicList[1][refIdx1], mv1); } - if (pwp0 && pwp1 && (pwp0->bPresentFlag || pwp1->bPresentFlag)) + if (pwp0 && pwp1 && (pwp0->wtPresent || pwp1->wtPresent)) addWeightBi(pu, predYuv, m_predShortYuv[0], m_predShortYuv[1], wv0, wv1, bLuma, bChroma); else predYuv.addAvg(m_predShortYuv[0], m_predShortYuv[1], pu.puAbsPartIdx, pu.width, pu.height, bLuma, bChroma); @@ -193,7 +193,7 @@ MV mv0 = cu.m_mv[0][pu.puAbsPartIdx]; cu.clipMv(mv0); - if (pwp0 && pwp0->bPresentFlag) + if (pwp0 && pwp0->wtPresent) { ShortYuv& shortYuv = m_predShortYuv[0]; @@ -220,7 +220,7 @@ /* uniprediction to L1 */ X265_CHECK(refIdx1 >= 0, "refidx1 was not positive\n"); - if (pwp1 && pwp1->bPresentFlag) + if (pwp1 && pwp1->wtPresent) { ShortYuv& shortYuv = m_predShortYuv[0]; @@ -283,7 +283,11 @@ int yFrac = mv.y & 3; if (!(yFrac | xFrac)) - primitives.pu[partEnum].convert_p2s(src, srcStride, dst, dstStride); + { + bool srcbufferAlignCheck = (refPic.m_cuOffsetY[pu.ctuAddr] + refPic.m_buOffsetY[pu.cuAbsPartIdx + pu.puAbsPartIdx] + srcOffset) % 64 == 0; + bool dstbufferAlignCheck = (dstSYuv.getAddrOffset(pu.puAbsPartIdx, dstSYuv.m_size) % 64) == 0; + primitives.pu[partEnum].convert_p2s[srcStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheck && dstbufferAlignCheck](src, srcStride, dst, dstStride); + } else if (!yFrac) primitives.pu[partEnum].luma_hps(src, srcStride, dst, dstStride, xFrac, 0); else if (!xFrac) @@ -375,8 +379,10 @@ if (!(yFrac | xFrac)) { - primitives.chroma[m_csp].pu[partEnum].p2s(refCb, refStride, dstCb, dstStride); - primitives.chroma[m_csp].pu[partEnum].p2s(refCr, refStride, dstCr, dstStride); + bool srcbufferAlignCheckC = (refPic.m_cuOffsetC[pu.ctuAddr] + refPic.m_buOffsetC[pu.cuAbsPartIdx + pu.puAbsPartIdx] + refOffset) % 64 == 0; + bool dstbufferAlignCheckC = dstSYuv.getChromaAddrOffset(pu.puAbsPartIdx) % 64 == 0; + primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCb, refStride, dstCb, dstStride); + primitives.chroma[m_csp].pu[partEnum].p2s[refStride % 64 == 0 && dstStride % 64 == 0 && srcbufferAlignCheckC && dstbufferAlignCheckC](refCr, refStride, dstCr, dstStride); } else if (!yFrac) {
View file
x265_2.7.tar.gz/source/common/primitives.cpp -> x265_2.9.tar.gz/source/common/primitives.cpp
Changed
@@ -114,9 +114,11 @@ for (int i = 0; i < NUM_PU_SIZES; i++) { p.chroma[X265_CSP_I444].pu[i].copy_pp = p.pu[i].copy_pp; - p.chroma[X265_CSP_I444].pu[i].addAvg = p.pu[i].addAvg; + p.chroma[X265_CSP_I444].pu[i].addAvg[NONALIGNED] = p.pu[i].addAvg[NONALIGNED]; + p.chroma[X265_CSP_I444].pu[i].addAvg[ALIGNED] = p.pu[i].addAvg[ALIGNED]; p.chroma[X265_CSP_I444].pu[i].satd = p.pu[i].satd; - p.chroma[X265_CSP_I444].pu[i].p2s = p.pu[i].convert_p2s; + p.chroma[X265_CSP_I444].pu[i].p2s[NONALIGNED] = p.pu[i].convert_p2s[NONALIGNED]; + p.chroma[X265_CSP_I444].pu[i].p2s[ALIGNED] = p.pu[i].convert_p2s[ALIGNED]; } for (int i = 0; i < NUM_CU_SIZES; i++) @@ -124,7 +126,8 @@ p.chroma[X265_CSP_I444].cu[i].sa8d = p.cu[i].sa8d; p.chroma[X265_CSP_I444].cu[i].sse_pp = p.cu[i].sse_pp; p.chroma[X265_CSP_I444].cu[i].sub_ps = p.cu[i].sub_ps; - p.chroma[X265_CSP_I444].cu[i].add_ps = p.cu[i].add_ps; + p.chroma[X265_CSP_I444].cu[i].add_ps[NONALIGNED] = p.cu[i].add_ps[NONALIGNED]; + p.chroma[X265_CSP_I444].cu[i].add_ps[ALIGNED] = p.cu[i].add_ps[ALIGNED]; p.chroma[X265_CSP_I444].cu[i].copy_ps = p.cu[i].copy_ps; p.chroma[X265_CSP_I444].cu[i].copy_sp = p.cu[i].copy_sp; p.chroma[X265_CSP_I444].cu[i].copy_ss = p.cu[i].copy_ss;
View file
x265_2.7.tar.gz/source/common/primitives.h -> x265_2.9.tar.gz/source/common/primitives.h
Changed
@@ -62,6 +62,13 @@ NUM_CU_SIZES }; +enum AlignPrimitive +{ + NONALIGNED, + ALIGNED, + NUM_ALIGNMENT_TYPES +}; + enum { NUM_TR_SIZE = 4 }; // TU are 4x4, 8x8, 16x16, and 32x32 @@ -216,7 +223,10 @@ typedef void (*integralv_t)(uint32_t *sum, intptr_t stride); typedef void (*integralh_t)(uint32_t *sum, pixel *pix, intptr_t stride); - +typedef void(*nonPsyRdoQuant_t)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos); +typedef void(*psyRdoQuant_t)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos); +typedef void(*psyRdoQuant_t1)(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost,uint32_t blkPos); +typedef void(*psyRdoQuant_t2)(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos); /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ struct EncoderPrimitives @@ -242,12 +252,10 @@ filter_sp_t luma_vsp; filter_ss_t luma_vss; filter_hv_pp_t luma_hvpp; // combines hps + vsp - - pixelavg_pp_t pixelavg_pp; // quick bidir using pixels (borrowed from x264) - addAvg_t addAvg; // bidir motion compensation, uses 16bit values - + pixelavg_pp_t pixelavg_pp[NUM_ALIGNMENT_TYPES]; // quick bidir using pixels (borrowed from x264) + addAvg_t addAvg[NUM_ALIGNMENT_TYPES]; // bidir motion compensation, uses 16bit values copy_pp_t copy_pp; - filter_p2s_t convert_p2s; + filter_p2s_t convert_p2s[NUM_ALIGNMENT_TYPES]; } pu[NUM_PU_SIZES]; @@ -265,17 +273,16 @@ dct_t standard_dct; // original dct function, used by lowpass_dct dct_t lowpass_dct; // lowpass dct approximation - calcresidual_t calcresidual; + calcresidual_t calcresidual[NUM_ALIGNMENT_TYPES]; pixel_sub_ps_t sub_ps; - pixel_add_ps_t add_ps; - blockfill_s_t blockfill_s; // block fill, for DC transforms + pixel_add_ps_t add_ps[NUM_ALIGNMENT_TYPES]; + blockfill_s_t blockfill_s[NUM_ALIGNMENT_TYPES]; // block fill, for DC transforms copy_cnt_t copy_cnt; // copy coeff while counting non-zero count_nonzero_t count_nonzero; cpy2Dto1D_shl_t cpy2Dto1D_shl; cpy2Dto1D_shr_t cpy2Dto1D_shr; - cpy1Dto2D_shl_t cpy1Dto2D_shl; + cpy1Dto2D_shl_t cpy1Dto2D_shl[NUM_ALIGNMENT_TYPES]; cpy1Dto2D_shr_t cpy1Dto2D_shr; - copy_sp_t copy_sp; copy_ps_t copy_ps; copy_ss_t copy_ss; @@ -286,16 +293,18 @@ pixel_sse_t sse_pp; // Sum of Square Error (pixel, pixel) fenc alignment not assumed pixel_sse_ss_t sse_ss; // Sum of Square Error (short, short) fenc alignment not assumed pixelcmp_t psy_cost_pp; // difference in AC energy between two pixel blocks - pixel_ssd_s_t ssd_s; // Sum of Square Error (residual coeff to self) + pixel_ssd_s_t ssd_s[NUM_ALIGNMENT_TYPES]; // Sum of Square Error (residual coeff to self) pixelcmp_t sa8d; // Sum of Transformed Differences (8x8 Hadamard), uses satd for 4x4 intra TU - transpose_t transpose; // transpose pixel block; for use with intra all-angs intra_allangs_t intra_pred_allangs; intra_filter_t intra_filter; intra_pred_t intra_pred[NUM_INTRA_MODE]; + nonPsyRdoQuant_t nonPsyRdoQuant; + psyRdoQuant_t psyRdoQuant; + psyRdoQuant_t1 psyRdoQuant_1p; + psyRdoQuant_t2 psyRdoQuant_2p; } cu[NUM_CU_SIZES]; - /* These remaining primitives work on either fixed block sizes or take * block dimensions as arguments and thus do not belong in either the PU or * the CU arrays */ @@ -307,7 +316,7 @@ dequant_scaling_t dequant_scaling; dequant_normal_t dequant_normal; denoiseDct_t denoiseDct; - scale1D_t scale1D_128to64; + scale1D_t scale1D_128to64[NUM_ALIGNMENT_TYPES]; scale2D_t scale2D_64to32; ssim_4x4x2_core_t ssim_4x4x2_core; @@ -384,9 +393,9 @@ filter_ss_t filter_vss; filter_pp_t filter_hpp; filter_hps_t filter_hps; - addAvg_t addAvg; + addAvg_t addAvg[NUM_ALIGNMENT_TYPES]; copy_pp_t copy_pp; - filter_p2s_t p2s; + filter_p2s_t p2s[NUM_ALIGNMENT_TYPES]; } pu[NUM_PU_SIZES]; @@ -397,7 +406,7 @@ pixelcmp_t sa8d; // if chroma CU is not multiple of 8x8, will use satd pixel_sse_t sse_pp; pixel_sub_ps_t sub_ps; - pixel_add_ps_t add_ps; + pixel_add_ps_t add_ps[NUM_ALIGNMENT_TYPES]; copy_ps_t copy_ps; copy_sp_t copy_sp;
View file
x265_2.7.tar.gz/source/common/quant.cpp -> x265_2.9.tar.gz/source/common/quant.cpp
Changed
@@ -560,13 +560,11 @@ uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig) { const uint32_t sizeIdx = log2TrSize - 2; - if (cu.m_tqBypass[0]) { - primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, coeff, resiStride, 0); + primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, coeff, resiStride, 0); return; } - // Values need to pass as input parameter in dequant int rem = m_qpParam[ttype].rem; int per = m_qpParam[ttype].per; @@ -595,7 +593,7 @@ if (transformShift > 0) primitives.cu[sizeIdx].cpy1Dto2D_shr(residual, m_resiDctCoeff, resiStride, transformShift); else - primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, m_resiDctCoeff, resiStride, -transformShift); + primitives.cu[sizeIdx].cpy1Dto2D_shl[resiStride % 64 == 0](residual, m_resiDctCoeff, resiStride, -transformShift); #endif } else @@ -611,7 +609,7 @@ const int add_2nd = 1 << (shift_2nd - 1); int dc_val = (((m_resiDctCoeff[0] * (64 >> 6) + add_1st) >> shift_1st) * (64 >> 3) + add_2nd) >> shift_2nd; - primitives.cu[sizeIdx].blockfill_s(residual, resiStride, (int16_t)dc_val); + primitives.cu[sizeIdx].blockfill_s[resiStride % 64 == 0](residual, resiStride, (int16_t)dc_val); return; } @@ -644,11 +642,9 @@ X265_CHECK((int)numSig == primitives.cu[log2TrSize - 2].count_nonzero(dstCoeff), "numSig differ\n"); if (!numSig) return 0; - const uint32_t trSize = 1 << log2TrSize; int64_t lambda2 = m_qpParam[ttype].lambda2; - const int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda); - + int64_t psyScale = ((int64_t)m_psyRdoqScale * m_qpParam[ttype].lambda); /* unquant constants for measuring distortion. Scaling list quant coefficients have a (1 << 4) * scale applied that must be removed during unquant. Note that in real dequant there is clipping * at several stages. We skip the clipping for simplicity when measuring RD cost */ @@ -725,27 +721,15 @@ for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++) { X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n"); - uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); uint32_t blkPos = codeParams.scan[scanPosBase]; - - // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA - for (int y = 0; y < MLS_CG_SIZE; y++) + bool enable512 = detect512(); + if (enable512) + primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos); + else { - for (int x = 0; x < MLS_CG_SIZE; x++) - { - int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ - int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ - - costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; - - /* when no residual coefficient is coded, predicted coef == recon coef */ - costUncoded[blkPos + x] -= PSYVALUE(predictedCoef); - - totalUncodedCost += costUncoded[blkPos + x]; - totalRdCost += costUncoded[blkPos + x]; - } - blkPos += trSize; + primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost,blkPos); + primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos); } } } @@ -755,25 +739,11 @@ for (int cgScanPos = cgLastScanPos + 1; cgScanPos < (int)cgNum ; cgScanPos++) { X265_CHECK(coeffNum[cgScanPos] == 0, "count of coeff failure\n"); - uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); uint32_t blkPos = codeParams.scan[scanPosBase]; - - for (int y = 0; y < MLS_CG_SIZE; y++) - { - for (int x = 0; x < MLS_CG_SIZE; x++) - { - int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ - costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; - - totalUncodedCost += costUncoded[blkPos + x]; - totalRdCost += costUncoded[blkPos + x]; - } - blkPos += trSize; - } + primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos); } } - static const uint8_t table_cnt[5][SCAN_SET_SIZE] = { // patternSigCtx = 0 @@ -833,25 +803,22 @@ // TODO: does we need zero-coeff cost? const uint32_t scanPosBase = (cgScanPos << MLS_CG_SIZE); uint32_t blkPos = codeParams.scan[scanPosBase]; - if (usePsyMask) { - // TODO: we can't SIMD optimize because PSYVALUE need 64-bits multiplication, convert to Double can work faster by FMA + bool enable512 = detect512(); + + if (enable512) + primitives.cu[log2TrSize - 2].psyRdoQuant(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos); + else + { + primitives.cu[log2TrSize - 2].psyRdoQuant_1p(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos); + primitives.cu[log2TrSize - 2].psyRdoQuant_2p(m_resiDctCoeff, m_fencDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, &psyScale, blkPos); + } + blkPos = codeParams.scan[scanPosBase]; for (int y = 0; y < MLS_CG_SIZE; y++) { for (int x = 0; x < MLS_CG_SIZE; x++) { - int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ - int predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ - - costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; - - /* when no residual coefficient is coded, predicted coef == recon coef */ - costUncoded[blkPos + x] -= PSYVALUE(predictedCoef); - - totalUncodedCost += costUncoded[blkPos + x]; - totalRdCost += costUncoded[blkPos + x]; - const uint32_t scanPosOffset = y * MLS_CG_SIZE + x; const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset; X265_CHECK(trSize > 4, "trSize check failure\n"); @@ -867,16 +834,12 @@ else { // non-psy path + primitives.cu[log2TrSize - 2].nonPsyRdoQuant(m_resiDctCoeff, costUncoded, &totalUncodedCost, &totalRdCost, blkPos); + blkPos = codeParams.scan[scanPosBase]; for (int y = 0; y < MLS_CG_SIZE; y++) { for (int x = 0; x < MLS_CG_SIZE; x++) { - int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ - costUncoded[blkPos + x] = ((int64_t)signCoef * signCoef) << scaleBits; - - totalUncodedCost += costUncoded[blkPos + x]; - totalRdCost += costUncoded[blkPos + x]; - const uint32_t scanPosOffset = y * MLS_CG_SIZE + x; const uint32_t ctxSig = table_cnt[patternSigCtx][g_scan4x4[codeParams.scanType][scanPosOffset]] + ctxSigOffset; X265_CHECK(trSize > 4, "trSize check failure\n");
View file
x265_2.7.tar.gz/source/common/slice.cpp -> x265_2.9.tar.gz/source/common/slice.cpp
Changed
@@ -138,7 +138,7 @@ for (int yuv = 0; yuv < 3; yuv++) { WeightParam& wp = m_weightPredTable[l][i][yuv]; - wp.bPresentFlag = false; + wp.wtPresent = 0; wp.log2WeightDenom = 0; wp.inputWeight = 1; wp.inputOffset = 0;
View file
x265_2.7.tar.gz/source/common/slice.h -> x265_2.9.tar.gz/source/common/slice.h
Changed
@@ -298,7 +298,7 @@ uint32_t log2WeightDenom; int inputWeight; int inputOffset; - bool bPresentFlag; + int wtPresent; /* makes a non-h265 weight (i.e. fix7), into an h265 weight */ void setFromWeightAndOffset(int w, int o, int denom, bool bNormalize) @@ -321,7 +321,7 @@ (w).inputWeight = (s); \ (w).log2WeightDenom = (d); \ (w).inputOffset = (o); \ - (w).bPresentFlag = (b); \ + (w).wtPresent = (b); \ } class Slice @@ -385,14 +385,14 @@ bool getRapPicFlag() const { return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL + || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP || m_nalUnitType == NAL_UNIT_CODED_SLICE_CRA; } - bool getIdrPicFlag() const { - return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL; + return m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL + || m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP; } - bool isIRAP() const { return m_nalUnitType >= 16 && m_nalUnitType <= 23; } bool isIntra() const { return m_sliceType == I_SLICE; }
View file
x265_2.7.tar.gz/source/common/x86/asm-primitives.cpp -> x265_2.9.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -404,36 +404,58 @@ p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = PFX(pixel_sa8d_8x16_ ## cpu); \ p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = PFX(pixel_sa8d_16x32_ ## cpu); \ p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = PFX(pixel_sa8d_32x64_ ## cpu) - #define PIXEL_AVG(cpu) \ - p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_ ## cpu); \ - p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_ ## cpu); \ - p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_ ## cpu); \ - p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_ ## cpu); \ - p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_ ## cpu); \ - p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_ ## cpu); \ - p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_ ## cpu); \ - p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_ ## cpu); \ - p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_ ## cpu); \ - p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_ ## cpu); \ - p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_ ## cpu); \ - p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_ ## cpu); \ - p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_ ## cpu); \ - p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_ ## cpu); \ - p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_ ## cpu); \ - p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_ ## cpu); \ - p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_ ## cpu); \ - p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_ ## cpu); \ - p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_ ## cpu); \ - p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_ ## cpu); \ - p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_ ## cpu); \ - p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_ ## cpu); - + p.pu[LUMA_64x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \ + p.pu[LUMA_64x48].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \ + p.pu[LUMA_64x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \ + p.pu[LUMA_64x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \ + p.pu[LUMA_48x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \ + p.pu[LUMA_32x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \ + p.pu[LUMA_32x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \ + p.pu[LUMA_32x24].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \ + p.pu[LUMA_32x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \ + p.pu[LUMA_32x8].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x8_ ## cpu); \ + p.pu[LUMA_24x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \ + p.pu[LUMA_16x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \ + p.pu[LUMA_16x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \ + p.pu[LUMA_16x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \ + p.pu[LUMA_16x12].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \ + p.pu[LUMA_16x8].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x8_ ## cpu); \ + p.pu[LUMA_16x4].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_16x4_ ## cpu); \ + p.pu[LUMA_12x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \ + p.pu[LUMA_8x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_8x32_ ## cpu); \ + p.pu[LUMA_8x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_8x16_ ## cpu); \ + p.pu[LUMA_8x8].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_8x8_ ## cpu); \ + p.pu[LUMA_8x4].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_8x4_ ## cpu); \ + p.pu[LUMA_64x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x64_ ## cpu); \ + p.pu[LUMA_64x48].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x48_ ## cpu); \ + p.pu[LUMA_64x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x32_ ## cpu); \ + p.pu[LUMA_64x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_64x16_ ## cpu); \ + p.pu[LUMA_48x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_48x64_ ## cpu); \ + p.pu[LUMA_32x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x64_ ## cpu); \ + p.pu[LUMA_32x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x32_ ## cpu); \ + p.pu[LUMA_32x24].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x24_ ## cpu); \ + p.pu[LUMA_32x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x16_ ## cpu); \ + p.pu[LUMA_32x8].pixelavg_pp[ALIGNED] = PFX(pixel_avg_32x8_ ## cpu); \ + p.pu[LUMA_24x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_24x32_ ## cpu); \ + p.pu[LUMA_16x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x64_ ## cpu); \ + p.pu[LUMA_16x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x32_ ## cpu); \ + p.pu[LUMA_16x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x16_ ## cpu); \ + p.pu[LUMA_16x12].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x12_ ## cpu); \ + p.pu[LUMA_16x8].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x8_ ## cpu); \ + p.pu[LUMA_16x4].pixelavg_pp[ALIGNED] = PFX(pixel_avg_16x4_ ## cpu); \ + p.pu[LUMA_12x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_12x16_ ## cpu); \ + p.pu[LUMA_8x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_8x32_ ## cpu); \ + p.pu[LUMA_8x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_8x16_ ## cpu); \ + p.pu[LUMA_8x8].pixelavg_pp[ALIGNED] = PFX(pixel_avg_8x8_ ## cpu); \ + p.pu[LUMA_8x4].pixelavg_pp[ALIGNED] = PFX(pixel_avg_8x4_ ## cpu); #define PIXEL_AVG_W4(cpu) \ - p.pu[LUMA_4x4].pixelavg_pp = PFX(pixel_avg_4x4_ ## cpu); \ - p.pu[LUMA_4x8].pixelavg_pp = PFX(pixel_avg_4x8_ ## cpu); \ - p.pu[LUMA_4x16].pixelavg_pp = PFX(pixel_avg_4x16_ ## cpu); - + p.pu[LUMA_4x4].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_4x4_ ## cpu); \ + p.pu[LUMA_4x8].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_4x8_ ## cpu); \ + p.pu[LUMA_4x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_4x16_ ## cpu); \ + p.pu[LUMA_4x4].pixelavg_pp[ALIGNED] = PFX(pixel_avg_4x4_ ## cpu); \ + p.pu[LUMA_4x8].pixelavg_pp[ALIGNED] = PFX(pixel_avg_4x8_ ## cpu); \ + p.pu[LUMA_4x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_4x16_ ## cpu); #define CHROMA_420_FILTERS(cpu) \ ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); \ @@ -633,23 +655,32 @@ #define LUMA_PIXELSUB(cpu) \ p.cu[BLOCK_4x4].sub_ps = PFX(pixel_sub_ps_4x4_ ## cpu); \ - p.cu[BLOCK_4x4].add_ps = PFX(pixel_add_ps_4x4_ ## cpu); \ + p.cu[BLOCK_4x4].add_ps[NONALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \ + p.cu[BLOCK_4x4].add_ps[ALIGNED] = PFX(pixel_add_ps_4x4_ ## cpu); \ ALL_LUMA_CU(sub_ps, pixel_sub_ps, cpu); \ - ALL_LUMA_CU(add_ps, pixel_add_ps, cpu); + ALL_LUMA_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \ + ALL_LUMA_CU(add_ps[ALIGNED], pixel_add_ps, cpu); #define CHROMA_420_PIXELSUB_PS(cpu) \ ALL_CHROMA_420_CU(sub_ps, pixel_sub_ps, cpu); \ - ALL_CHROMA_420_CU(add_ps, pixel_add_ps, cpu); + ALL_CHROMA_420_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \ + ALL_CHROMA_420_CU(add_ps[ALIGNED], pixel_add_ps, cpu); #define CHROMA_422_PIXELSUB_PS(cpu) \ ALL_CHROMA_422_CU(sub_ps, pixel_sub_ps, cpu); \ - ALL_CHROMA_422_CU(add_ps, pixel_add_ps, cpu); + ALL_CHROMA_422_CU(add_ps[NONALIGNED], pixel_add_ps, cpu); \ + ALL_CHROMA_422_CU(add_ps[ALIGNED], pixel_add_ps, cpu); #define LUMA_VAR(cpu) ALL_LUMA_CU(var, pixel_var, cpu) -#define LUMA_ADDAVG(cpu) ALL_LUMA_PU(addAvg, addAvg, cpu); p.pu[LUMA_4x4].addAvg = PFX(addAvg_4x4_ ## cpu) -#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg, addAvg, cpu); -#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg, addAvg, cpu); +#define LUMA_ADDAVG(cpu) ALL_LUMA_PU(addAvg[NONALIGNED], addAvg, cpu); \ + p.pu[LUMA_4x4].addAvg[NONALIGNED] = PFX(addAvg_4x4_ ## cpu); \ + ALL_LUMA_PU(addAvg[ALIGNED], addAvg, cpu); \ + p.pu[LUMA_4x4].addAvg[ALIGNED] = PFX(addAvg_4x4_ ## cpu) +#define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg[NONALIGNED], addAvg, cpu); \ + ALL_CHROMA_420_PU(addAvg[ALIGNED], addAvg, cpu) +#define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg[NONALIGNED], addAvg, cpu); \ + ALL_CHROMA_422_PU(addAvg[ALIGNED], addAvg, cpu) #define SETUP_INTRA_ANG_COMMON(mode, fno, cpu) \ p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); \ @@ -855,6 +886,10 @@ ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu); +#define ASSIGN2(func, fname) \ + func[ALIGNED] = PFX(fname); \ + func[NONALIGNED] = PFX(fname) + namespace X265_NS { // private x265 namespace @@ -873,10 +908,6 @@ void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main10 { -#if !defined(X86_64) -#error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF" -#endif - #if X86_64 p.scanPosLast = PFX(scanPosLast_x64); #endif @@ -937,35 +968,69 @@ CHROMA_422_VERT_FILTERS(_sse2); CHROMA_444_VERT_FILTERS(sse2); +#if X86_64 ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2); p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse2); ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2); p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse2); ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse2); ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse2); +#endif p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_sse2); p.ssim_end_4 = PFX(pixel_ssim_end4_sse2); - PIXEL_AVG(sse2); + ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_sse2); + ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_sse2); + ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_sse2); + ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_sse2); + ASSIGN2(p.pu[LUMA_48x64].pixelavg_pp, pixel_avg_48x64_sse2); + ASSIGN2(p.pu[LUMA_32x64].pixelavg_pp, pixel_avg_32x64_sse2); + ASSIGN2(p.pu[LUMA_32x32].pixelavg_pp, pixel_avg_32x32_sse2); + ASSIGN2(p.pu[LUMA_32x24].pixelavg_pp, pixel_avg_32x24_sse2); + ASSIGN2(p.pu[LUMA_32x16].pixelavg_pp, pixel_avg_32x16_sse2); + ASSIGN2(p.pu[LUMA_32x8].pixelavg_pp, pixel_avg_32x8_sse2); + ASSIGN2(p.pu[LUMA_24x32].pixelavg_pp, pixel_avg_24x32_sse2); + ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_sse2); + ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_sse2); + ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_sse2); + ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_sse2); + ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_sse2); + ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_sse2); + ASSIGN2(p.pu[LUMA_12x16].pixelavg_pp, pixel_avg_12x16_sse2); +#if X86_64 + ASSIGN2(p.pu[LUMA_8x32].pixelavg_pp, pixel_avg_8x32_sse2); + ASSIGN2(p.pu[LUMA_8x16].pixelavg_pp, pixel_avg_8x16_sse2); + ASSIGN2(p.pu[LUMA_8x8].pixelavg_pp, pixel_avg_8x8_sse2); + ASSIGN2(p.pu[LUMA_8x4].pixelavg_pp, pixel_avg_8x4_sse2); +#endif PIXEL_AVG_W4(mmx2); LUMA_VAR(sse2); - ALL_LUMA_TU(blockfill_s, blockfill_s, sse2); + ALL_LUMA_TU(blockfill_s[ALIGNED], blockfill_s, sse2); + ALL_LUMA_TU(blockfill_s[NONALIGNED], blockfill_s, sse2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2); - ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, sse2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, sse2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, sse2); ALL_LUMA_TU_S(cpy2Dto1D_shr, cpy2Dto1D_shr_, sse2); ALL_LUMA_TU_S(cpy2Dto1D_shl, cpy2Dto1D_shl_, sse2); - ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2); - ALL_LUMA_TU_S(calcresidual, getResidual, sse2); +#if X86_64 + ASSIGN2(p.cu[BLOCK_4x4].ssd_s,pixel_ssd_s_4_sse2 ); + ASSIGN2(p.cu[BLOCK_8x8].ssd_s,pixel_ssd_s_8_sse2); + ASSIGN2(p.cu[BLOCK_16x16].ssd_s,pixel_ssd_s_16_sse2); + ASSIGN2(p.cu[BLOCK_32x32].ssd_s,pixel_ssd_s_32_sse2 ); +#endif + ALL_LUMA_TU_S(calcresidual[ALIGNED], getResidual, sse2); + ALL_LUMA_TU_S(calcresidual[NONALIGNED], getResidual, sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar4_sse2); p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar8_sse2); p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_sse2); +#if X86_64 p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_sse2); ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); - +#endif p.cu[BLOCK_4x4].intra_pred[2] = PFX(intra_pred_ang4_2_sse2); p.cu[BLOCK_4x4].intra_pred[3] = PFX(intra_pred_ang4_3_sse2); p.cu[BLOCK_4x4].intra_pred[4] = PFX(intra_pred_ang4_4_sse2); @@ -990,7 +1055,9 @@ p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_sse2); p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_sse2); p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_sse2); +#if X86_64 p.cu[BLOCK_4x4].intra_pred[26] = PFX(intra_pred_ang4_26_sse2); +#endif p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_sse2); p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_sse2); p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_sse2); @@ -999,19 +1066,24 @@ p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_sse2); p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2); +#if X86_64 && X265_DEPTH <= 10 + p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_32x64_sse2); p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_4x8_mmx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_8x16_sse2); p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_16x32_sse2); -#if X265_DEPTH <= 10 - p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2); - ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); + + p.cu[BLOCK_8x8].sse_ss = PFX(pixel_ssd_ss_8x8_sse2); + p.cu[BLOCK_16x16].sse_ss = PFX(pixel_ssd_ss_16x16_sse2); + p.cu[BLOCK_32x32].sse_ss = PFX(pixel_ssd_ss_32x32_sse2); + p.cu[BLOCK_64x64].sse_ss = PFX(pixel_ssd_ss_64x64_sse2); #endif p.cu[BLOCK_4x4].dct = PFX(dct4_sse2); p.cu[BLOCK_8x8].dct = PFX(dct8_sse2); p.cu[BLOCK_4x4].idct = PFX(idct4_sse2); +#if X86_64 p.cu[BLOCK_8x8].idct = PFX(idct8_sse2); - +#endif p.idst4x4 = PFX(idst4_sse2); p.dst4x4 = PFX(dst4_sse2); @@ -1022,25 +1094,31 @@ //p.planecopy_sp = PFX(downShift_16_sse2); p.planecopy_sp_shl = PFX(upShift_16_sse2); - ALL_CHROMA_420_PU(p2s, filterPixelToShort, sse2); - ALL_CHROMA_422_PU(p2s, filterPixelToShort, sse2); - ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); - ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); + ALL_CHROMA_420_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s[ALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_420_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s[NONALIGNED], filterPixelToShort, sse2); ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); p.propagateCost = PFX(mbtree_propagate_cost_sse2); } if (cpuMask & X265_CPU_SSE3) { +#if X86_64 ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3); ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3); ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3); ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, sse3); ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, sse3); ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, sse3); +#endif } if (cpuMask & X265_CPU_SSSE3) { - p.scale1D_128to64 = PFX(scale1D_128to64_ssse3); + ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3); p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_ssse3); this one is broken @@ -1055,60 +1133,65 @@ p.frameInitLowres = PFX(frame_init_lowres_core_ssse3); - ALL_LUMA_PU(convert_p2s, filterPixelToShort, ssse3); + ALL_LUMA_PU(convert_p2s[ALIGNED], filterPixelToShort, ssse3); + ALL_LUMA_PU(convert_p2s[NONALIGNED], filterPixelToShort, ssse3); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s, filterPixelToShort_4x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s, filterPixelToShort_4x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s, filterPixelToShort_4x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s, filterPixelToShort_8x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s, filterPixelToShort_8x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s, filterPixelToShort_8x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s, filterPixelToShort_8x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s, filterPixelToShort_16x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s, filterPixelToShort_16x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s, filterPixelToShort_16x12_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s, filterPixelToShort_16x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s, filterPixelToShort_16x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s, filterPixelToShort_32x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s, filterPixelToShort_32x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s, filterPixelToShort_32x24_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s, filterPixelToShort_32x32_ssse3); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s, filterPixelToShort_4x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s, filterPixelToShort_4x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s, filterPixelToShort_4x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s, filterPixelToShort_4x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s, filterPixelToShort_8x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s, filterPixelToShort_8x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s, filterPixelToShort_8x12_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s, filterPixelToShort_8x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s, filterPixelToShort_8x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s, filterPixelToShort_8x64_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s, filterPixelToShort_12x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s, filterPixelToShort_16x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s, filterPixelToShort_16x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s, filterPixelToShort_16x24_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s, filterPixelToShort_16x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s, filterPixelToShort_16x64_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s, filterPixelToShort_24x64_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s, filterPixelToShort_32x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s, filterPixelToShort_32x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s, filterPixelToShort_32x48_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s, filterPixelToShort_32x64_ssse3); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s, filterPixelToShort_4x2_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s, filterPixelToShort_8x2_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s, filterPixelToShort_8x6_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = PFX(filterPixelToShort_4x4_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = PFX(filterPixelToShort_4x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = PFX(filterPixelToShort_4x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = PFX(filterPixelToShort_4x4_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = PFX(filterPixelToShort_4x8_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = PFX(filterPixelToShort_4x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = PFX(filterPixelToShort_4x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = PFX(filterPixelToShort_8x12_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = PFX(filterPixelToShort_8x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = PFX(filterPixelToShort_12x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = PFX(filterPixelToShort_4x2_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = PFX(filterPixelToShort_8x2_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = PFX(filterPixelToShort_8x6_ssse3); p.findPosFirstLast = PFX(findPosFirstLast_ssse3); p.fix8Unpack = PFX(cutree_fix8_unpack_ssse3); p.fix8Pack = PFX(cutree_fix8_pack_ssse3); } if (cpuMask & X265_CPU_SSE4) { +#if X86_64 p.pelFilterLumaStrong[0] = PFX(pelFilterLumaStrong_V_sse4); p.pelFilterLumaStrong[1] = PFX(pelFilterLumaStrong_H_sse4); p.pelFilterChroma[0] = PFX(pelFilterChroma_V_sse4); p.pelFilterChroma[1] = PFX(pelFilterChroma_H_sse4); - p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4); +#endif p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4); p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4); p.saoCuOrgE2[0] = PFX(saoCuOrgE2_sse4); @@ -1123,6 +1206,68 @@ CHROMA_422_ADDAVG(sse4); LUMA_FILTERS(sse4); + +#if X86_64 + p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse4); + p.pu[LUMA_4x8].luma_hpp = PFX(interp_8tap_horiz_pp_4x8_sse4); + p.pu[LUMA_4x16].luma_hpp = PFX(interp_8tap_horiz_pp_4x16_sse4); + p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse4); + p.pu[LUMA_4x8].luma_hps = PFX(interp_8tap_horiz_ps_4x8_sse4); + p.pu[LUMA_4x16].luma_hps = PFX(interp_8tap_horiz_ps_4x16_sse4); +#endif + + p.pu[LUMA_8x8].luma_hpp = PFX(interp_8tap_horiz_pp_8x8_sse4); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_sse4); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_sse4); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_8tap_horiz_pp_64x64_sse4); + p.pu[LUMA_8x4].luma_hpp = PFX(interp_8tap_horiz_pp_8x4_sse4); + + p.pu[LUMA_16x8].luma_hpp = PFX(interp_8tap_horiz_pp_16x8_sse4); + p.pu[LUMA_8x16].luma_hpp = PFX(interp_8tap_horiz_pp_8x16_sse4); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_8tap_horiz_pp_16x32_sse4); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_8tap_horiz_pp_32x16_sse4); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_8tap_horiz_pp_64x32_sse4); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_8tap_horiz_pp_32x64_sse4); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_8tap_horiz_pp_16x12_sse4); + p.pu[LUMA_12x16].luma_hpp = PFX(interp_8tap_horiz_pp_12x16_sse4); + p.pu[LUMA_16x4].luma_hpp = PFX(interp_8tap_horiz_pp_16x4_sse4); + + p.pu[LUMA_32x24].luma_hpp = PFX(interp_8tap_horiz_pp_32x24_sse4); + p.pu[LUMA_24x32].luma_hpp = PFX(interp_8tap_horiz_pp_24x32_sse4); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_8tap_horiz_pp_32x8_sse4); + p.pu[LUMA_8x32].luma_hpp = PFX(interp_8tap_horiz_pp_8x32_sse4); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_8tap_horiz_pp_64x48_sse4); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_sse4); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_8tap_horiz_pp_64x16_sse4); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_8tap_horiz_pp_16x64_sse4); + + p.pu[LUMA_8x8].luma_hps = PFX(interp_8tap_horiz_ps_8x8_sse4); + p.pu[LUMA_16x16].luma_hps = PFX(interp_8tap_horiz_ps_16x16_sse4); + p.pu[LUMA_32x32].luma_hps = PFX(interp_8tap_horiz_ps_32x32_sse4); + p.pu[LUMA_64x64].luma_hps = PFX(interp_8tap_horiz_ps_64x64_sse4); + p.pu[LUMA_8x4].luma_hps = PFX(interp_8tap_horiz_ps_8x4_sse4); + p.pu[LUMA_16x8].luma_hps = PFX(interp_8tap_horiz_ps_16x8_sse4); + p.pu[LUMA_8x16].luma_hps = PFX(interp_8tap_horiz_ps_8x16_sse4); + p.pu[LUMA_16x32].luma_hps = PFX(interp_8tap_horiz_ps_16x32_sse4); + p.pu[LUMA_32x16].luma_hps = PFX(interp_8tap_horiz_ps_32x16_sse4); + p.pu[LUMA_64x32].luma_hps = PFX(interp_8tap_horiz_ps_64x32_sse4); + p.pu[LUMA_32x64].luma_hps = PFX(interp_8tap_horiz_ps_32x64_sse4); + p.pu[LUMA_16x12].luma_hps = PFX(interp_8tap_horiz_ps_16x12_sse4); + p.pu[LUMA_12x16].luma_hps = PFX(interp_8tap_horiz_ps_12x16_sse4); + p.pu[LUMA_16x4].luma_hps = PFX(interp_8tap_horiz_ps_16x4_sse4); + p.pu[LUMA_32x24].luma_hps = PFX(interp_8tap_horiz_ps_32x24_sse4); + p.pu[LUMA_24x32].luma_hps = PFX(interp_8tap_horiz_ps_24x32_sse4); + p.pu[LUMA_32x8].luma_hps = PFX(interp_8tap_horiz_ps_32x8_sse4); + p.pu[LUMA_8x32].luma_hps = PFX(interp_8tap_horiz_ps_8x32_sse4); + p.pu[LUMA_64x48].luma_hps = PFX(interp_8tap_horiz_ps_64x48_sse4); + p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_sse4); + p.pu[LUMA_64x16].luma_hps = PFX(interp_8tap_horiz_ps_64x16_sse4); + p.pu[LUMA_16x64].luma_hps = PFX(interp_8tap_horiz_ps_16x64_sse4); + + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse4); p.pu[LUMA_4x4].luma_vpp = PFX(interp_8tap_vert_pp_4x4_sse4); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse4); p.pu[LUMA_4x4].luma_vps = PFX(interp_8tap_vert_ps_4x4_sse4); + ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, sse4); p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_sse4); + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; CHROMA_420_HORIZ_FILTERS(sse4); CHROMA_420_VERT_FILTERS_SSE4(_sse4); CHROMA_422_HORIZ_FILTERS(_sse4); @@ -1162,16 +1307,16 @@ // TODO: check POPCNT flag! ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); -#if X265_DEPTH <= 10 +#if X86_64 && X265_DEPTH <= 10 ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); #endif - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = PFX(filterPixelToShort_2x4_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = PFX(filterPixelToShort_6x8_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s[NONALIGNED] = PFX(filterPixelToShort_2x4_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s[NONALIGNED] = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s[NONALIGNED] = PFX(filterPixelToShort_6x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s[NONALIGNED] = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s[NONALIGNED] = PFX(filterPixelToShort_2x16_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s[NONALIGNED] = PFX(filterPixelToShort_6x16_sse4); p.costCoeffRemain = PFX(costCoeffRemain_sse4); #if X86_64 p.saoCuStatsE0 = PFX(saoCuStatsE0_sse4); @@ -1180,6 +1325,7 @@ p.saoCuStatsE3 = PFX(saoCuStatsE3_sse4); #endif } +#if X86_64 if (cpuMask & X265_CPU_AVX) { // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_avx); fails tests @@ -1411,83 +1557,81 @@ p.cu[BLOCK_32x32].intra_pred[32] = PFX(intra_pred_ang32_32_avx2); p.cu[BLOCK_32x32].intra_pred[33] = PFX(intra_pred_ang32_33_avx2); p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_2_avx2); - - p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); - p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); - p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); - p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2); - p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2); - p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2); - p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2); - p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_avx2); - p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_avx2); - p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_avx2); - p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_avx2); - p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_avx2); - p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_avx2); - p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_avx2); - p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_avx2); - p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_avx2); - p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_avx2); - p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_avx2); - - p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.pu[LUMA_8x32].addAvg = PFX(addAvg_8x32_avx2); - p.pu[LUMA_12x16].addAvg = PFX(addAvg_12x16_avx2); - p.pu[LUMA_16x4].addAvg = PFX(addAvg_16x4_avx2); - p.pu[LUMA_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.pu[LUMA_16x12].addAvg = PFX(addAvg_16x12_avx2); - p.pu[LUMA_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.pu[LUMA_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.pu[LUMA_16x64].addAvg = PFX(addAvg_16x64_avx2); - p.pu[LUMA_24x32].addAvg = PFX(addAvg_24x32_avx2); - p.pu[LUMA_32x8].addAvg = PFX(addAvg_32x8_avx2); - p.pu[LUMA_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.pu[LUMA_32x24].addAvg = PFX(addAvg_32x24_avx2); - p.pu[LUMA_32x32].addAvg = PFX(addAvg_32x32_avx2); - p.pu[LUMA_32x64].addAvg = PFX(addAvg_32x64_avx2); - p.pu[LUMA_48x64].addAvg = PFX(addAvg_48x64_avx2); - p.pu[LUMA_64x16].addAvg = PFX(addAvg_64x16_avx2); - p.pu[LUMA_64x32].addAvg = PFX(addAvg_64x32_avx2); - p.pu[LUMA_64x48].addAvg = PFX(addAvg_64x48_avx2); - p.pu[LUMA_64x64].addAvg = PFX(addAvg_64x64_avx2); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = PFX(addAvg_8x2_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = PFX(addAvg_8x6_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = PFX(addAvg_8x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = PFX(addAvg_12x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = PFX(addAvg_16x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = PFX(addAvg_16x12_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = PFX(addAvg_32x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = PFX(addAvg_32x24_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = PFX(addAvg_32x32_avx2); - - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = PFX(addAvg_32x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = PFX(addAvg_8x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = PFX(addAvg_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = PFX(addAvg_16x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = PFX(addAvg_8x12_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = PFX(addAvg_16x24_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = PFX(addAvg_8x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = PFX(addAvg_24x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = PFX(addAvg_12x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_avx2); + ASSIGN2(p.pu[LUMA_12x16].pixelavg_pp, pixel_avg_12x16_avx2); + ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_avx2); + ASSIGN2(p.pu[LUMA_24x32].pixelavg_pp, pixel_avg_24x32_avx2); + ASSIGN2(p.pu[LUMA_32x8].pixelavg_pp, pixel_avg_32x8_avx2); + ASSIGN2(p.pu[LUMA_32x16].pixelavg_pp, pixel_avg_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x24].pixelavg_pp, pixel_avg_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x32].pixelavg_pp, pixel_avg_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x64].pixelavg_pp, pixel_avg_32x64_avx2); + ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_avx2); + ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_avx2); + ASSIGN2(p.pu[LUMA_48x64].pixelavg_pp, pixel_avg_48x64_avx2); + ASSIGN2(p.pu[LUMA_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.pu[LUMA_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.pu[LUMA_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.pu[LUMA_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.pu[LUMA_12x16].addAvg, addAvg_12x16_avx2); + ASSIGN2(p.pu[LUMA_16x4].addAvg, addAvg_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].addAvg, addAvg_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].addAvg, addAvg_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].addAvg, addAvg_16x64_avx2); + ASSIGN2(p.pu[LUMA_24x32].addAvg, addAvg_24x32_avx2); + ASSIGN2(p.pu[LUMA_32x8].addAvg, addAvg_32x8_avx2); + ASSIGN2(p.pu[LUMA_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x24].addAvg, addAvg_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x32].addAvg, addAvg_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x64].addAvg, addAvg_32x64_avx2); + ASSIGN2(p.pu[LUMA_48x64].addAvg, addAvg_48x64_avx2); + ASSIGN2(p.pu[LUMA_64x16].addAvg, addAvg_64x16_avx2); + ASSIGN2(p.pu[LUMA_64x32].addAvg, addAvg_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x48].addAvg, addAvg_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x64].addAvg, addAvg_64x64_avx2); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg, addAvg_8x2_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg, addAvg_8x6_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg, addAvg_12x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg, addAvg_16x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg, addAvg_16x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg, addAvg_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg, addAvg_32x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg, addAvg_32x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg, addAvg_32x32_avx2); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg, addAvg_32x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg,addAvg_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg, addAvg_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg, addAvg_16x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg, addAvg_8x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg, addAvg_16x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg, addAvg_8x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg, addAvg_24x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg, addAvg_12x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg, addAvg_32x48_avx2); p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_avx2); p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_avx2); @@ -1537,9 +1681,8 @@ p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_avx2); p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_avx2); - p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16_avx2); - p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32_avx2); - + ASSIGN2( p.cu[BLOCK_16x16].ssd_s,pixel_ssd_s_16_avx2); + ASSIGN2( p.cu[BLOCK_32x32].ssd_s,pixel_ssd_s_32_avx2); p.cu[BLOCK_16x16].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_16x16_avx2); p.cu[BLOCK_32x32].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_32x32_avx2); p.cu[BLOCK_64x64].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_64x64_avx2); @@ -1555,7 +1698,7 @@ p.idst4x4 = PFX(idst4_avx2); p.denoiseDct = PFX(denoise_dct_avx2); - p.scale1D_128to64 = PFX(scale1D_128to64_avx2); + ASSIGN2(p.scale1D_128to64, scale1D_128to64_avx2); p.scale2D_64to32 = PFX(scale2D_64to32_avx2); p.weight_pp = PFX(weight_pp_avx2); @@ -1563,16 +1706,15 @@ p.sign = PFX(calSign_avx2); p.planecopy_cp = PFX(upShift_8_avx2); - p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2); - p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_avx2); - - p.cu[BLOCK_16x16].blockfill_s = PFX(blockfill_s_16x16_avx2); - p.cu[BLOCK_32x32].blockfill_s = PFX(blockfill_s_32x32_avx2); + ASSIGN2(p.cu[BLOCK_16x16].calcresidual, getResidual16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].calcresidual, getResidual32_avx2); + ASSIGN2(p.cu[BLOCK_16x16].blockfill_s, blockfill_s_16x16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].blockfill_s, blockfill_s_32x32_avx2); ALL_LUMA_TU(count_nonzero, count_nonzero, avx2); - ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, avx2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, avx2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, avx2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, avx2); - p.cu[BLOCK_8x8].copy_cnt = PFX(copy_cnt_8_avx2); p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx2); p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx2); @@ -1596,13 +1738,13 @@ ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2); p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_avx2); // since ALL_LUMA_PU didn't declare 4x4 size, calling separately luma_vsp function to use - p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); - p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); - p.cu[BLOCK_64x64].add_ps = PFX(pixel_add_ps_64x64_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = PFX(pixel_add_ps_16x32_avx2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = PFX(pixel_add_ps_32x64_avx2); + ASSIGN2(p.cu[BLOCK_16x16].add_ps, pixel_add_ps_16x16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].add_ps, pixel_add_ps_32x32_avx2); + ASSIGN2(p.cu[BLOCK_64x64].add_ps, pixel_add_ps_64x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps, pixel_add_ps_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps, pixel_add_ps_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps, pixel_add_ps_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps, pixel_add_ps_32x64_avx2); p.cu[BLOCK_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); @@ -1663,44 +1805,45 @@ p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_avx2); p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_avx2); - p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_avx2); - p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_avx2); - p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_avx2); - p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_avx2); - p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_avx2); - p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_avx2); - p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_avx2); - p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_avx2); - p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_avx2); - p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_avx2); - p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_avx2); - p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_avx2); - p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_avx2); - p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_avx2); - p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_avx2); - p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_avx2); - p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_avx2); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = PFX(filterPixelToShort_24x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_avx2); + ASSIGN2(p.pu[LUMA_16x4].convert_p2s, filterPixelToShort_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].convert_p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].convert_p2s, filterPixelToShort_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].convert_p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].convert_p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].convert_p2s, filterPixelToShort_16x64_avx2); + ASSIGN2(p.pu[LUMA_32x8].convert_p2s, filterPixelToShort_32x8_avx2); + ASSIGN2(p.pu[LUMA_32x16].convert_p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x24].convert_p2s, filterPixelToShort_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x32].convert_p2s, filterPixelToShort_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x64].convert_p2s, filterPixelToShort_32x64_avx2); + ASSIGN2(p.pu[LUMA_64x16].convert_p2s, filterPixelToShort_64x16_avx2); + ASSIGN2(p.pu[LUMA_64x32].convert_p2s, filterPixelToShort_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x48].convert_p2s, filterPixelToShort_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x64].convert_p2s, filterPixelToShort_64x64_avx2); + ASSIGN2(p.pu[LUMA_24x32].convert_p2s, filterPixelToShort_24x32_avx2); + ASSIGN2(p.pu[LUMA_48x64].convert_p2s, filterPixelToShort_48x64_avx2); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s, filterPixelToShort_16x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s, filterPixelToShort_16x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s, filterPixelToShort_24x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s, filterPixelToShort_32x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s, filterPixelToShort_32x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s, filterPixelToShort_32x32_avx2); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s, filterPixelToShort_16x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s, filterPixelToShort_16x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s, filterPixelToShort_24x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s, filterPixelToShort_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s, filterPixelToShort_32x48_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s, filterPixelToShort_32x64_avx2); p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_avx2); p.pu[LUMA_4x8].luma_hps = PFX(interp_8tap_horiz_ps_4x8_avx2); @@ -2167,6 +2310,14 @@ p.integral_inith[INTEGRAL_8] = PFX(integral8h_avx2); p.integral_inith[INTEGRAL_12] = PFX(integral12h_avx2); p.integral_inith[INTEGRAL_16] = PFX(integral16h_avx2); + p.cu[BLOCK_4x4].nonPsyRdoQuant = PFX(nonPsyRdoQuant4_avx2); + p.cu[BLOCK_8x8].nonPsyRdoQuant = PFX(nonPsyRdoQuant8_avx2); + p.cu[BLOCK_16x16].nonPsyRdoQuant = PFX(nonPsyRdoQuant16_avx2); + p.cu[BLOCK_32x32].nonPsyRdoQuant = PFX(nonPsyRdoQuant32_avx2); + p.cu[BLOCK_4x4].psyRdoQuant_1p = PFX(psyRdoQuant_1p4_avx2); + p.cu[BLOCK_8x8].psyRdoQuant_1p = PFX(psyRdoQuant_1p8_avx2); + p.cu[BLOCK_16x16].psyRdoQuant_1p = PFX(psyRdoQuant_1p16_avx2); + p.cu[BLOCK_32x32].psyRdoQuant_1p = PFX(psyRdoQuant_1p32_avx2); /* TODO: This kernel needs to be modified to work with HIGH_BIT_DEPTH only p.planeClipAndMax = PFX(planeClipAndMax_avx2); */ @@ -2188,6 +2339,844 @@ p.costCoeffNxN = PFX(costCoeffNxN_avx2_bmi2); } } + if (cpuMask & X265_CPU_AVX512) + { + p.cu[BLOCK_16x16].var = PFX(pixel_var_16x16_avx512); + p.cu[BLOCK_32x32].calcresidual[NONALIGNED] = PFX(getResidual32_avx512); + p.cu[BLOCK_32x32].calcresidual[ALIGNED] = PFX(getResidual_aligned32_avx512); + p.cu[BLOCK_64x64].sub_ps = PFX(pixel_sub_ps_64x64_avx512); + p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_avx512); + + p.cu[BLOCK_64x64].add_ps[NONALIGNED] = PFX(pixel_add_ps_64x64_avx512); + p.cu[BLOCK_32x32].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x64_avx512); + + p.cu[BLOCK_32x32].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x32_avx512); + p.cu[BLOCK_64x64].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_64x64_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x64_avx512); + + // 64 X N + p.cu[BLOCK_64x64].copy_ss = PFX(blockcopy_ss_64x64_avx512); + p.pu[LUMA_64x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx512); + p.pu[LUMA_64x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx512); + p.pu[LUMA_64x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x32_avx512); + p.pu[LUMA_64x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x16_avx512); + p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)PFX(blockcopy_ss_64x64_avx512); + p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)PFX(blockcopy_ss_64x64_avx512); + + // 32 X N + p.cu[BLOCK_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx512); + p.pu[LUMA_32x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x64_avx512); + p.pu[LUMA_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx512); + p.pu[LUMA_32x24].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x24_avx512); + p.pu[LUMA_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx512); + p.pu[LUMA_32x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x64_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = PFX(blockcopy_ss_32x64_avx512); + p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x64_avx512); + p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x64_avx512); + + p.pu[LUMA_64x16].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x16_avx512); + p.pu[LUMA_64x32].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x32_avx512); + p.pu[LUMA_64x48].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x48_avx512); + p.pu[LUMA_64x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x64_avx512); + p.pu[LUMA_32x8].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx512); + p.pu[LUMA_32x16].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.pu[LUMA_32x24].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.pu[LUMA_32x32].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.pu[LUMA_32x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.pu[LUMA_48x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s[ALIGNED] = PFX(filterPixelToShort_2x4_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s[ALIGNED] = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s[ALIGNED] = PFX(filterPixelToShort_6x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s[NONALIGNED] = PFX(filterPixelToShort_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s[ALIGNED] = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s[ALIGNED] = PFX(filterPixelToShort_2x16_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s[ALIGNED] = PFX(filterPixelToShort_6x16_sse4); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].p2s[NONALIGNED] = PFX(filterPixelToShort_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].p2s[NONALIGNED] = PFX(filterPixelToShort_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].p2s[NONALIGNED] = PFX(filterPixelToShort_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].p2s[NONALIGNED] = PFX(filterPixelToShort_64x64_avx512); + + p.pu[LUMA_64x16].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x16_avx512); + p.pu[LUMA_64x32].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x32_avx512); + p.pu[LUMA_64x48].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x48_avx512); + p.pu[LUMA_64x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x64_avx512); + p.pu[LUMA_32x8].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.pu[LUMA_32x16].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.pu[LUMA_32x24].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.pu[LUMA_32x32].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.pu[LUMA_32x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + p.pu[LUMA_48x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x8].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x64_avx512); + p.cu[BLOCK_32x32].ssd_s[NONALIGNED] = PFX(pixel_ssd_s_32_avx512); + p.cu[BLOCK_32x32].ssd_s[ALIGNED] = PFX(pixel_ssd_s_aligned_32_avx512); + p.cu[BLOCK_16x16].ssd_s[NONALIGNED] = PFX(pixel_ssd_s_16_avx512); + p.cu[BLOCK_16x16].ssd_s[ALIGNED] = PFX(pixel_ssd_s_aligned_16_avx512); + p.pu[LUMA_16x32].sad = PFX(pixel_sad_16x32_avx512); + p.pu[LUMA_16x64].sad = PFX(pixel_sad_16x64_avx512); + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_avx512); + p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_avx512); + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_avx512); + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_avx512); + p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_avx512); + p.pu[LUMA_48x64].sad = PFX(pixel_sad_48x64_avx512); + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_avx512); + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_avx512); + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_avx512); + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_avx512); + + p.pu[LUMA_64x16].addAvg[NONALIGNED] = PFX(addAvg_64x16_avx512); + p.pu[LUMA_64x32].addAvg[NONALIGNED] = PFX(addAvg_64x32_avx512); + p.pu[LUMA_64x48].addAvg[NONALIGNED] = PFX(addAvg_64x48_avx512); + p.pu[LUMA_64x64].addAvg[NONALIGNED] = PFX(addAvg_64x64_avx512); + p.pu[LUMA_32x8].addAvg[NONALIGNED] = PFX(addAvg_32x8_avx512); + p.pu[LUMA_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.pu[LUMA_32x24].addAvg[NONALIGNED] = PFX(addAvg_32x24_avx512); + p.pu[LUMA_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + p.pu[LUMA_32x64].addAvg[NONALIGNED] = PFX(addAvg_32x64_avx512); + p.pu[LUMA_16x4].addAvg[NONALIGNED] = PFX(addAvg_16x4_avx512); + p.pu[LUMA_16x8].addAvg[NONALIGNED] = PFX(addAvg_16x8_avx512); + p.pu[LUMA_16x12].addAvg[NONALIGNED] = PFX(addAvg_16x12_avx512); + p.pu[LUMA_16x16].addAvg[NONALIGNED] = PFX(addAvg_16x16_avx512); + p.pu[LUMA_16x32].addAvg[NONALIGNED] = PFX(addAvg_16x32_avx512); + p.pu[LUMA_16x64].addAvg[NONALIGNED] = PFX(addAvg_16x64_avx512); + p.pu[LUMA_48x64].addAvg[NONALIGNED] = PFX(addAvg_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg[NONALIGNED] = PFX(addAvg_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg[NONALIGNED] = PFX(addAvg_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg[NONALIGNED] = PFX(addAvg_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg[NONALIGNED] = PFX(addAvg_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg[NONALIGNED] = PFX(addAvg_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg[NONALIGNED] = PFX(addAvg_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg[NONALIGNED] = PFX(addAvg_16x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg[NONALIGNED] = PFX(addAvg_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg[NONALIGNED] = PFX(addAvg_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg[NONALIGNED] = PFX(addAvg_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg[NONALIGNED] = PFX(addAvg_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg[NONALIGNED] = PFX(addAvg_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg[NONALIGNED] = PFX(addAvg_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg[NONALIGNED] = PFX(addAvg_16x8_avx512); + p.pu[LUMA_32x8].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x8_avx512); + p.pu[LUMA_32x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x16_avx512); + p.pu[LUMA_32x24].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x24_avx512); + p.pu[LUMA_32x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x32_avx512); + p.pu[LUMA_32x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_32x64_avx512); + p.pu[LUMA_64x16].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x16_avx512); + p.pu[LUMA_64x32].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x32_avx512); + p.pu[LUMA_64x48].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x48_avx512); + p.pu[LUMA_64x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_64x64_avx512); + p.pu[LUMA_48x64].pixelavg_pp[NONALIGNED] = PFX(pixel_avg_48x64_avx512); + + p.pu[LUMA_32x8].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_32x8_avx512); + p.pu[LUMA_32x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_32x16_avx512); + p.pu[LUMA_32x24].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_32x24_avx512); + p.pu[LUMA_32x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_32x32_avx512); + p.pu[LUMA_32x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_32x64_avx512); + p.pu[LUMA_48x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_48x64_avx512); + p.pu[LUMA_64x16].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_64x16_avx512); + p.pu[LUMA_64x32].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_64x32_avx512); + p.pu[LUMA_64x48].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_64x48_avx512); + p.pu[LUMA_64x64].pixelavg_pp[ALIGNED] = PFX(pixel_avg_aligned_64x64_avx512); + p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_avx512); + p.pu[LUMA_16x12].sad_x3 = PFX(pixel_sad_x3_16x12_avx512); + p.pu[LUMA_16x16].sad_x3 = PFX(pixel_sad_x3_16x16_avx512); + p.pu[LUMA_16x32].sad_x3 = PFX(pixel_sad_x3_16x32_avx512); + p.pu[LUMA_16x64].sad_x3 = PFX(pixel_sad_x3_16x64_avx512); + p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_avx512); + p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_avx512); + p.pu[LUMA_32x24].sad_x3 = PFX(pixel_sad_x3_32x24_avx512); + p.pu[LUMA_32x32].sad_x3 = PFX(pixel_sad_x3_32x32_avx512); + p.pu[LUMA_32x64].sad_x3 = PFX(pixel_sad_x3_32x64_avx512); + //p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_avx512); + p.pu[LUMA_64x16].sad_x3 = PFX(pixel_sad_x3_64x16_avx512); + p.pu[LUMA_64x32].sad_x3 = PFX(pixel_sad_x3_64x32_avx512); + p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_avx512); + p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_avx512); + + p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_avx512); + p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_avx512); + p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_avx512); + p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_avx512); + p.pu[LUMA_16x64].sad_x4 = PFX(pixel_sad_x4_16x64_avx512); + p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx512); + p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx512); + p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx512); + p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx512); + p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx512); + //p.pu[LUMA_48x64].sad_x4 = PFX(pixel_sad_x4_48x64_avx512); + p.pu[LUMA_64x16].sad_x4 = PFX(pixel_sad_x4_64x16_avx512); + p.pu[LUMA_64x32].sad_x4 = PFX(pixel_sad_x4_64x32_avx512); + p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_avx512); + p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_avx512); + p.cu[BLOCK_16x16].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16_avx512); + p.cu[BLOCK_32x32].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shl[NONALIGNED] = PFX(cpy1Dto2D_shl_32_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shl[ALIGNED] = PFX(cpy1Dto2D_shl_aligned_32_avx512); + p.cu[BLOCK_16x16].cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32_avx512); + + p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16_avx512); + p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32_avx512); + + p.weight_pp = PFX(weight_pp_avx512); + p.weight_sp = PFX(weight_sp_avx512); + p.dequant_normal = PFX(dequant_normal_avx512); + p.dequant_scaling = PFX(dequant_scaling_avx512); + p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx512); + p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = PFX(interp_4tap_horiz_pp_8x12_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = PFX(interp_4tap_horiz_pp_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = PFX(interp_4tap_horiz_pp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = PFX(interp_4tap_horiz_pp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = PFX(interp_4tap_horiz_pp_24x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = PFX(interp_4tap_horiz_pp_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = PFX(interp_4tap_horiz_pp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = PFX(interp_4tap_horiz_pp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = PFX(interp_4tap_horiz_pp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = PFX(interp_4tap_horiz_pp_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx512); + + p.pu[LUMA_16x4].addAvg[ALIGNED] = PFX(addAvg_aligned_16x4_avx512); + p.pu[LUMA_16x8].addAvg[ALIGNED] = PFX(addAvg_aligned_16x8_avx512); + p.pu[LUMA_16x12].addAvg[ALIGNED] = PFX(addAvg_aligned_16x12_avx512); + p.pu[LUMA_16x16].addAvg[ALIGNED] = PFX(addAvg_aligned_16x16_avx512); + p.pu[LUMA_16x32].addAvg[ALIGNED] = PFX(addAvg_aligned_16x32_avx512); + p.pu[LUMA_16x64].addAvg[ALIGNED] = PFX(addAvg_aligned_16x64_avx512); + p.pu[LUMA_48x64].addAvg[ALIGNED] = PFX(addAvg_aligned_48x64_avx512); + p.pu[LUMA_32x8].addAvg[ALIGNED] = PFX(addAvg_aligned_32x8_avx512); + p.pu[LUMA_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.pu[LUMA_32x24].addAvg[ALIGNED] = PFX(addAvg_aligned_32x24_avx512); + p.pu[LUMA_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + p.pu[LUMA_32x64].addAvg[ALIGNED] = PFX(addAvg_aligned_32x64_avx512); + p.pu[LUMA_64x16].addAvg[ALIGNED] = PFX(addAvg_aligned_64x16_avx512); + p.pu[LUMA_64x32].addAvg[ALIGNED] = PFX(addAvg_aligned_64x32_avx512); + p.pu[LUMA_64x48].addAvg[ALIGNED] = PFX(addAvg_aligned_64x48_avx512); + p.pu[LUMA_64x64].addAvg[ALIGNED] = PFX(addAvg_aligned_64x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg[ALIGNED] = PFX(addAvg_aligned_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg[ALIGNED] = PFX(addAvg_aligned_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg[ALIGNED] = PFX(addAvg_aligned_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg[ALIGNED] = PFX(addAvg_aligned_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg[ALIGNED] = PFX(addAvg_aligned_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg[ALIGNED] = PFX(addAvg_aligned_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg[ALIGNED] = PFX(addAvg_aligned_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg[ALIGNED] = PFX(addAvg_aligned_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg[ALIGNED] = PFX(addAvg_aligned_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg[ALIGNED] = PFX(addAvg_aligned_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg[ALIGNED] = PFX(addAvg_aligned_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg[ALIGNED] = PFX(addAvg_aligned_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg[ALIGNED] = PFX(addAvg_aligned_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg[ALIGNED] = PFX(addAvg_aligned_32x64_avx512); + p.cu[BLOCK_32x32].blockfill_s[NONALIGNED] = PFX(blockfill_s_32x32_avx512); + p.cu[BLOCK_32x32].blockfill_s[ALIGNED] = PFX(blockfill_s_aligned_32x32_avx512); + p.pu[LUMA_8x4].luma_hpp = PFX(interp_8tap_horiz_pp_8x4_avx512); + p.pu[LUMA_8x8].luma_hpp = PFX(interp_8tap_horiz_pp_8x8_avx512); + p.pu[LUMA_8x16].luma_hpp = PFX(interp_8tap_horiz_pp_8x16_avx512); + p.pu[LUMA_8x32].luma_hpp = PFX(interp_8tap_horiz_pp_8x32_avx512); + p.pu[LUMA_16x4].luma_hpp = PFX(interp_8tap_horiz_pp_16x4_avx512); + p.pu[LUMA_16x8].luma_hpp = PFX(interp_8tap_horiz_pp_16x8_avx512); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_8tap_horiz_pp_16x12_avx512); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_avx512); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_8tap_horiz_pp_16x32_avx512); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_8tap_horiz_pp_16x64_avx512); + p.pu[LUMA_24x32].luma_hpp = PFX(interp_8tap_horiz_pp_24x32_avx512); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_8tap_horiz_pp_32x8_avx512); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_8tap_horiz_pp_32x16_avx512); + p.pu[LUMA_32x24].luma_hpp = PFX(interp_8tap_horiz_pp_32x24_avx512); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_avx512); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_8tap_horiz_pp_32x64_avx512); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_8tap_horiz_pp_64x16_avx512); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_8tap_horiz_pp_64x32_avx512); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_8tap_horiz_pp_64x48_avx512); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_8tap_horiz_pp_64x64_avx512); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = PFX(interp_4tap_vert_pp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = PFX(interp_4tap_vert_pp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = PFX(interp_4tap_vert_pp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = PFX(interp_4tap_vert_ps_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = PFX(interp_4tap_vert_ps_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = PFX(interp_4tap_vert_ps_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = PFX(interp_4tap_vert_ps_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = PFX(interp_4tap_vert_sp_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = PFX(interp_4tap_vert_ss_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = PFX(interp_4tap_vert_ss_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = PFX(interp_4tap_vert_ss_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = PFX(interp_4tap_vert_ss_64x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = PFX(interp_4tap_vert_pp_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = PFX(interp_4tap_vert_ps_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = PFX(interp_4tap_vert_sp_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = PFX(interp_4tap_vert_ss_48x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = PFX(interp_4tap_vert_pp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = PFX(interp_4tap_vert_ps_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = PFX(interp_4tap_vert_ss_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = PFX(interp_4tap_vert_sp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = PFX(interp_4tap_vert_ps_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = PFX(interp_4tap_vert_ss_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = PFX(interp_4tap_vert_sp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = PFX(interp_4tap_vert_pp_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = PFX(interp_4tap_vert_ps_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = PFX(interp_4tap_vert_ss_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = PFX(interp_4tap_vert_sp_8x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vpp = PFX(interp_4tap_vert_pp_24x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vps = PFX(interp_4tap_vert_ps_24x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = PFX(interp_4tap_vert_ss_24x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = PFX(interp_4tap_vert_sp_24x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx512); + + p.pu[LUMA_8x8].luma_vss = PFX(interp_8tap_vert_ss_8x8_avx512); + p.pu[LUMA_8x16].luma_vss = PFX(interp_8tap_vert_ss_8x16_avx512); + p.pu[LUMA_8x32].luma_vss = PFX(interp_8tap_vert_ss_8x32_avx512); + p.pu[LUMA_16x4].luma_vss = PFX(interp_8tap_vert_ss_16x4_avx512); + p.pu[LUMA_16x8].luma_vss = PFX(interp_8tap_vert_ss_16x8_avx512); + p.pu[LUMA_16x12].luma_vss = PFX(interp_8tap_vert_ss_16x12_avx512); + p.pu[LUMA_16x16].luma_vss = PFX(interp_8tap_vert_ss_16x16_avx512); + p.pu[LUMA_16x32].luma_vss = PFX(interp_8tap_vert_ss_16x32_avx512); + p.pu[LUMA_16x64].luma_vss = PFX(interp_8tap_vert_ss_16x64_avx512); + p.pu[LUMA_24x32].luma_vss = PFX(interp_8tap_vert_ss_24x32_avx512); + p.pu[LUMA_32x8].luma_vss = PFX(interp_8tap_vert_ss_32x8_avx512); + p.pu[LUMA_32x16].luma_vss = PFX(interp_8tap_vert_ss_32x16_avx512); + p.pu[LUMA_32x32].luma_vss = PFX(interp_8tap_vert_ss_32x32_avx512); + p.pu[LUMA_32x24].luma_vss = PFX(interp_8tap_vert_ss_32x24_avx512); + p.pu[LUMA_32x64].luma_vss = PFX(interp_8tap_vert_ss_32x64_avx512); + p.pu[LUMA_64x16].luma_vss = PFX(interp_8tap_vert_ss_64x16_avx512); + p.pu[LUMA_64x32].luma_vss = PFX(interp_8tap_vert_ss_64x32_avx512); + p.pu[LUMA_64x48].luma_vss = PFX(interp_8tap_vert_ss_64x48_avx512); + p.pu[LUMA_64x64].luma_vss = PFX(interp_8tap_vert_ss_64x64_avx512); + p.pu[LUMA_48x64].luma_vss = PFX(interp_8tap_vert_ss_48x64_avx512); + + p.pu[LUMA_8x8].luma_vsp = PFX(interp_8tap_vert_sp_8x8_avx512); + p.pu[LUMA_8x16].luma_vsp = PFX(interp_8tap_vert_sp_8x16_avx512); + p.pu[LUMA_8x32].luma_vsp = PFX(interp_8tap_vert_sp_8x32_avx512); + p.pu[LUMA_16x4].luma_vsp = PFX(interp_8tap_vert_sp_16x4_avx512); + p.pu[LUMA_16x8].luma_vsp = PFX(interp_8tap_vert_sp_16x8_avx512); + p.pu[LUMA_16x12].luma_vsp = PFX(interp_8tap_vert_sp_16x12_avx512); + p.pu[LUMA_16x16].luma_vsp = PFX(interp_8tap_vert_sp_16x16_avx512); + p.pu[LUMA_16x32].luma_vsp = PFX(interp_8tap_vert_sp_16x32_avx512); + p.pu[LUMA_16x64].luma_vsp = PFX(interp_8tap_vert_sp_16x64_avx512); + p.pu[LUMA_24x32].luma_vsp = PFX(interp_8tap_vert_sp_24x32_avx512); + p.pu[LUMA_32x8].luma_vsp = PFX(interp_8tap_vert_sp_32x8_avx512); + p.pu[LUMA_32x16].luma_vsp = PFX(interp_8tap_vert_sp_32x16_avx512); + p.pu[LUMA_32x32].luma_vsp = PFX(interp_8tap_vert_sp_32x32_avx512); + p.pu[LUMA_32x24].luma_vsp = PFX(interp_8tap_vert_sp_32x24_avx512); + p.pu[LUMA_32x64].luma_vsp = PFX(interp_8tap_vert_sp_32x64_avx512); + p.pu[LUMA_64x16].luma_vsp = PFX(interp_8tap_vert_sp_64x16_avx512); + p.pu[LUMA_64x32].luma_vsp = PFX(interp_8tap_vert_sp_64x32_avx512); + p.pu[LUMA_64x48].luma_vsp = PFX(interp_8tap_vert_sp_64x48_avx512); + p.pu[LUMA_64x64].luma_vsp = PFX(interp_8tap_vert_sp_64x64_avx512); + p.pu[LUMA_48x64].luma_vsp = PFX(interp_8tap_vert_sp_48x64_avx512); + + p.pu[LUMA_16x4].luma_vpp = PFX(interp_8tap_vert_pp_16x4_avx512); + p.pu[LUMA_16x8].luma_vpp = PFX(interp_8tap_vert_pp_16x8_avx512); + p.pu[LUMA_16x12].luma_vpp = PFX(interp_8tap_vert_pp_16x12_avx512); + p.pu[LUMA_16x16].luma_vpp = PFX(interp_8tap_vert_pp_16x16_avx512); + p.pu[LUMA_16x32].luma_vpp = PFX(interp_8tap_vert_pp_16x32_avx512); + p.pu[LUMA_16x64].luma_vpp = PFX(interp_8tap_vert_pp_16x64_avx512); + p.pu[LUMA_24x32].luma_vpp = PFX(interp_8tap_vert_pp_24x32_avx512); + p.pu[LUMA_32x8].luma_vpp = PFX(interp_8tap_vert_pp_32x8_avx512); + p.pu[LUMA_32x16].luma_vpp = PFX(interp_8tap_vert_pp_32x16_avx512); + p.pu[LUMA_32x32].luma_vpp = PFX(interp_8tap_vert_pp_32x32_avx512); + p.pu[LUMA_32x24].luma_vpp = PFX(interp_8tap_vert_pp_32x24_avx512); + p.pu[LUMA_32x64].luma_vpp = PFX(interp_8tap_vert_pp_32x64_avx512); + p.pu[LUMA_48x64].luma_vpp = PFX(interp_8tap_vert_pp_48x64_avx512); + p.pu[LUMA_64x16].luma_vpp = PFX(interp_8tap_vert_pp_64x16_avx512); + p.pu[LUMA_64x32].luma_vpp = PFX(interp_8tap_vert_pp_64x32_avx512); + p.pu[LUMA_64x48].luma_vpp = PFX(interp_8tap_vert_pp_64x48_avx512); + p.pu[LUMA_64x64].luma_vpp = PFX(interp_8tap_vert_pp_64x64_avx512); + + p.pu[LUMA_16x4].luma_vps = PFX(interp_8tap_vert_ps_16x4_avx512); + p.pu[LUMA_16x8].luma_vps = PFX(interp_8tap_vert_ps_16x8_avx512); + p.pu[LUMA_16x12].luma_vps = PFX(interp_8tap_vert_ps_16x12_avx512); + p.pu[LUMA_16x16].luma_vps = PFX(interp_8tap_vert_ps_16x16_avx512); + p.pu[LUMA_16x32].luma_vps = PFX(interp_8tap_vert_ps_16x32_avx512); + p.pu[LUMA_16x64].luma_vps = PFX(interp_8tap_vert_ps_16x64_avx512); + p.pu[LUMA_24x32].luma_vps = PFX(interp_8tap_vert_ps_24x32_avx512); + p.pu[LUMA_32x8].luma_vps = PFX(interp_8tap_vert_ps_32x8_avx512); + p.pu[LUMA_32x16].luma_vps = PFX(interp_8tap_vert_ps_32x16_avx512); + p.pu[LUMA_32x32].luma_vps = PFX(interp_8tap_vert_ps_32x32_avx512); + p.pu[LUMA_32x24].luma_vps = PFX(interp_8tap_vert_ps_32x24_avx512); + p.pu[LUMA_32x64].luma_vps = PFX(interp_8tap_vert_ps_32x64_avx512); + p.pu[LUMA_48x64].luma_vps = PFX(interp_8tap_vert_ps_48x64_avx512); + p.pu[LUMA_64x16].luma_vps = PFX(interp_8tap_vert_ps_64x16_avx512); + p.pu[LUMA_64x32].luma_vps = PFX(interp_8tap_vert_ps_64x32_avx512); + p.pu[LUMA_64x48].luma_vps = PFX(interp_8tap_vert_ps_64x48_avx512); + p.pu[LUMA_64x64].luma_vps = PFX(interp_8tap_vert_ps_64x64_avx512); + + p.cu[BLOCK_8x8].dct = PFX(dct8_avx512); + /* TODO: Currently these kernels performance are similar to AVX2 version, we need a to improve them further to ebable + * it. Probably a Vtune analysis will help here. + + * p.cu[BLOCK_16x16].dct = PFX(dct16_avx512); + * p.cu[BLOCK_32x32].dct = PFX(dct32_avx512); */ + + p.cu[BLOCK_8x8].idct = PFX(idct8_avx512); + p.cu[BLOCK_16x16].idct = PFX(idct16_avx512); + p.cu[BLOCK_32x32].idct = PFX(idct32_avx512); + p.quant = PFX(quant_avx512); + p.nquant = PFX(nquant_avx512); + p.denoiseDct = PFX(denoise_dct_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = PFX(interp_4tap_horiz_ps_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = PFX(interp_4tap_horiz_ps_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = PFX(interp_4tap_horiz_ps_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = PFX(interp_4tap_horiz_ps_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = PFX(interp_4tap_horiz_ps_64x16_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = PFX(interp_4tap_horiz_ps_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = PFX(interp_4tap_horiz_ps_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = PFX(interp_4tap_horiz_ps_8x12_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = PFX(interp_4tap_horiz_ps_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = PFX(interp_4tap_horiz_ps_24x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx512); + + //Luma_hps_32xN + p.pu[LUMA_32x8].luma_hps = PFX(interp_8tap_horiz_ps_32x8_avx512); + p.pu[LUMA_32x16].luma_hps = PFX(interp_8tap_horiz_ps_32x16_avx512); + p.pu[LUMA_32x32].luma_hps = PFX(interp_8tap_horiz_ps_32x32_avx512); + p.pu[LUMA_32x24].luma_hps = PFX(interp_8tap_horiz_ps_32x24_avx512); + p.pu[LUMA_32x64].luma_hps = PFX(interp_8tap_horiz_ps_32x64_avx512); + //Luma_hps_64xN + p.pu[LUMA_64x16].luma_hps = PFX(interp_8tap_horiz_ps_64x16_avx512); + p.pu[LUMA_64x32].luma_hps = PFX(interp_8tap_horiz_ps_64x32_avx512); + p.pu[LUMA_64x48].luma_hps = PFX(interp_8tap_horiz_ps_64x48_avx512); + p.pu[LUMA_64x64].luma_hps = PFX(interp_8tap_horiz_ps_64x64_avx512); + //Luma_hps_16xN + p.pu[LUMA_16x4].luma_hps = PFX(interp_8tap_horiz_ps_16x4_avx512); + p.pu[LUMA_16x8].luma_hps = PFX(interp_8tap_horiz_ps_16x8_avx512); + p.pu[LUMA_16x12].luma_hps = PFX(interp_8tap_horiz_ps_16x12_avx512); + p.pu[LUMA_16x16].luma_hps = PFX(interp_8tap_horiz_ps_16x16_avx512); + p.pu[LUMA_16x32].luma_hps = PFX(interp_8tap_horiz_ps_16x32_avx512); + p.pu[LUMA_16x64].luma_hps = PFX(interp_8tap_horiz_ps_16x64_avx512); + //Luma_hps_48x64 + p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_avx512); + //Luma_hps_24x32 + p.pu[LUMA_24x32].luma_hps = PFX(interp_8tap_horiz_ps_24x32_avx512); + //Luma_hps_8xN + p.pu[LUMA_8x4].luma_hps = PFX(interp_8tap_horiz_ps_8x4_avx512); + p.pu[LUMA_8x8].luma_hps = PFX(interp_8tap_horiz_ps_8x8_avx512); + p.pu[LUMA_8x16].luma_hps = PFX(interp_8tap_horiz_ps_8x16_avx512); + p.pu[LUMA_8x32].luma_hps = PFX(interp_8tap_horiz_ps_8x32_avx512); + p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_avx512); + p.pu[LUMA_16x32].satd = PFX(pixel_satd_16x32_avx512); + p.pu[LUMA_16x64].satd = PFX(pixel_satd_16x64_avx512); + p.pu[LUMA_32x8].satd = PFX(pixel_satd_32x8_avx512); + p.pu[LUMA_32x16].satd = PFX(pixel_satd_32x16_avx512); + p.pu[LUMA_32x24].satd = PFX(pixel_satd_32x24_avx512); + p.pu[LUMA_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.pu[LUMA_32x64].satd = PFX(pixel_satd_32x64_avx512); + p.pu[LUMA_64x16].satd = PFX(pixel_satd_64x16_avx512); + p.pu[LUMA_64x32].satd = PFX(pixel_satd_64x32_avx512); + p.pu[LUMA_64x48].satd = PFX(pixel_satd_64x48_avx512); + p.pu[LUMA_64x64].satd = PFX(pixel_satd_64x64_avx512); + p.pu[LUMA_48x64].satd = PFX(pixel_satd_48x64_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = PFX(pixel_satd_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = PFX(pixel_satd_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = PFX(pixel_satd_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = PFX(pixel_satd_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = PFX(pixel_satd_32x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = PFX(pixel_satd_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = PFX(pixel_satd_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = PFX(pixel_satd_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = PFX(pixel_satd_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_avx512); + + p.cu[BLOCK_32x32].intra_pred[DC_IDX] = PFX(intra_pred_dc32_avx512); + p.cu[BLOCK_32x32].intra_pred[2] = PFX(intra_pred_ang32_2_avx512); + p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_2_avx512); + p.cu[BLOCK_32x32].intra_pred[9] = PFX(intra_pred_ang32_9_avx512); + p.cu[BLOCK_32x32].intra_pred[10] = PFX(intra_pred_ang32_10_avx512); + p.cu[BLOCK_32x32].intra_pred[11] = PFX(intra_pred_ang32_11_avx512); + p.cu[BLOCK_32x32].intra_pred[18] = PFX(intra_pred_ang32_18_avx512); + p.cu[BLOCK_32x32].intra_pred[25] = PFX(intra_pred_ang32_25_avx512); + p.cu[BLOCK_32x32].intra_pred[26] = PFX(intra_pred_ang32_26_avx512); + p.cu[BLOCK_32x32].intra_pred[27] = PFX(intra_pred_ang32_27_avx512); + p.cu[BLOCK_32x32].intra_pred[5] = PFX(intra_pred_ang32_5_avx512); + p.cu[BLOCK_32x32].intra_pred[31] = PFX(intra_pred_ang32_31_avx512); + p.cu[BLOCK_32x32].intra_pred[32] = PFX(intra_pred_ang32_32_avx512); + p.cu[BLOCK_32x32].intra_pred[4] = PFX(intra_pred_ang32_4_avx512); + p.cu[BLOCK_32x32].intra_pred[30] = PFX(intra_pred_ang32_30_avx512); + p.cu[BLOCK_32x32].intra_pred[6] = PFX(intra_pred_ang32_6_avx512); + p.cu[BLOCK_32x32].intra_pred[29] = PFX(intra_pred_ang32_29_avx512); + p.cu[BLOCK_32x32].intra_pred[7] = PFX(intra_pred_ang32_7_avx512); + p.cu[BLOCK_32x32].intra_pred[8] = PFX(intra_pred_ang32_8_avx512); + p.cu[BLOCK_32x32].intra_pred[28] = PFX(intra_pred_ang32_28_avx512); + p.cu[BLOCK_16x16].intra_pred[9] = PFX(intra_pred_ang16_9_avx512); + p.cu[BLOCK_16x16].intra_pred[11] = PFX(intra_pred_ang16_11_avx512); + p.cu[BLOCK_16x16].intra_pred[25] = PFX(intra_pred_ang16_25_avx512); + p.cu[BLOCK_16x16].intra_pred[27] = PFX(intra_pred_ang16_27_avx512); + p.cu[BLOCK_16x16].intra_pred[8] = PFX(intra_pred_ang16_8_avx512); + p.cu[BLOCK_16x16].intra_pred[28] = PFX(intra_pred_ang16_28_avx512); + p.cu[BLOCK_16x16].intra_pred[5] = PFX(intra_pred_ang16_5_avx512); + p.cu[BLOCK_16x16].intra_pred[31] = PFX(intra_pred_ang16_31_avx512); + p.cu[BLOCK_16x16].intra_pred[4] = PFX(intra_pred_ang16_4_avx512); + p.cu[BLOCK_16x16].intra_pred[32] = PFX(intra_pred_ang16_32_avx512); + p.cu[BLOCK_16x16].intra_pred[6] = PFX(intra_pred_ang16_6_avx512); + p.cu[BLOCK_16x16].intra_pred[30] = PFX(intra_pred_ang16_30_avx512); + p.cu[BLOCK_16x16].intra_pred[7] = PFX(intra_pred_ang16_7_avx512); + p.cu[BLOCK_16x16].intra_pred[29] = PFX(intra_pred_ang16_29_avx512); + p.pu[LUMA_64x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x64>; + p.pu[LUMA_64x48].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x48>; + p.pu[LUMA_64x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x32>; + p.pu[LUMA_64x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x16>; + p.pu[LUMA_32x8].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x8>; + p.pu[LUMA_32x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x16>; + p.pu[LUMA_32x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x32>; + p.pu[LUMA_32x24].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x24>; + p.pu[LUMA_32x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x64>; + p.pu[LUMA_16x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x4>; + p.pu[LUMA_16x8].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x8>; + p.pu[LUMA_16x12].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x12>; + p.pu[LUMA_16x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x16>; + p.pu[LUMA_16x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x32>; + p.pu[LUMA_16x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x64>; + p.pu[LUMA_48x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_48x64>; + + p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx512); + p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx512); + p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx512); + + p.cu[BLOCK_4x4].nonPsyRdoQuant = PFX(nonPsyRdoQuant4_avx512); + p.cu[BLOCK_8x8].nonPsyRdoQuant = PFX(nonPsyRdoQuant8_avx512); + p.cu[BLOCK_16x16].nonPsyRdoQuant = PFX(nonPsyRdoQuant16_avx512); + p.cu[BLOCK_32x32].nonPsyRdoQuant = PFX(nonPsyRdoQuant32_avx512); + p.cu[BLOCK_4x4].psyRdoQuant = PFX(psyRdoQuant4_avx512); + p.cu[BLOCK_8x8].psyRdoQuant = PFX(psyRdoQuant8_avx512); + p.cu[BLOCK_16x16].psyRdoQuant = PFX(psyRdoQuant16_avx512); + p.cu[BLOCK_32x32].psyRdoQuant = PFX(psyRdoQuant32_avx512); + + p.cu[BLOCK_32x32].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_32x32_avx512); + p.cu[BLOCK_64x64].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_64x64_avx512); + p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx512); + p.cu[BLOCK_64x64].sse_pp = PFX(pixel_ssd_64x64_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_32x64_avx512); + p.planecopy_sp_shl = PFX(upShift_16_avx512); + + } +#endif } #else // if HIGH_BIT_DEPTH @@ -2295,16 +3284,16 @@ //p.frameInitLowres = PFX(frame_init_lowres_core_mmx2); p.frameInitLowres = PFX(frame_init_lowres_core_sse2); - ALL_LUMA_TU(blockfill_s, blockfill_s, sse2); + ALL_LUMA_TU(blockfill_s[NONALIGNED], blockfill_s, sse2); + ALL_LUMA_TU(blockfill_s[ALIGNED], blockfill_s, sse2); ALL_LUMA_TU_S(cpy2Dto1D_shl, cpy2Dto1D_shl_, sse2); ALL_LUMA_TU_S(cpy2Dto1D_shr, cpy2Dto1D_shr_, sse2); - ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, sse2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, sse2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, sse2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, sse2); - ALL_LUMA_TU_S(ssd_s, pixel_ssd_s_, sse2); - + ALL_LUMA_TU_S(ssd_s[NONALIGNED], pixel_ssd_s_, sse2); ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2); ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); - p.cu[BLOCK_4x4].intra_pred[2] = PFX(intra_pred_ang4_2_sse2); p.cu[BLOCK_4x4].intra_pred[3] = PFX(intra_pred_ang4_3_sse2); p.cu[BLOCK_4x4].intra_pred[4] = PFX(intra_pred_ang4_4_sse2); @@ -2339,9 +3328,8 @@ p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2); p.cu[BLOCK_4x4].intra_pred_allangs = PFX(all_angs_pred_4x4_sse2); - - p.cu[BLOCK_4x4].calcresidual = PFX(getResidual4_sse2); - p.cu[BLOCK_8x8].calcresidual = PFX(getResidual8_sse2); + ASSIGN2(p.cu[BLOCK_4x4].calcresidual, getResidual4_sse2); + ASSIGN2(p.cu[BLOCK_8x8].calcresidual, getResidual8_sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); p.cu[BLOCK_64x64].transpose = PFX(transpose64_sse2); @@ -2362,10 +3350,14 @@ p.dst4x4 = PFX(dst4_sse2); p.planecopy_sp = PFX(downShift_16_sse2); - ALL_CHROMA_420_PU(p2s, filterPixelToShort, sse2); - ALL_CHROMA_422_PU(p2s, filterPixelToShort, sse2); - ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); - ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); + ALL_CHROMA_420_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_420_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s[ALIGNED], filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s[NONALIGNED], filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s[ALIGNED], filterPixelToShort, sse2); ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); p.propagateCost = PFX(mbtree_propagate_cost_sse2); } @@ -2411,64 +3403,61 @@ p.pu[LUMA_8x8].luma_hvpp = PFX(interp_8tap_hv_pp_8x8_ssse3); p.frameInitLowres = PFX(frame_init_lowres_core_ssse3); - p.scale1D_128to64 = PFX(scale1D_128to64_ssse3); + ASSIGN2(p.scale1D_128to64, scale1D_128to64_ssse3); p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); - p.pu[LUMA_8x4].convert_p2s = PFX(filterPixelToShort_8x4_ssse3); - p.pu[LUMA_8x8].convert_p2s = PFX(filterPixelToShort_8x8_ssse3); - p.pu[LUMA_8x16].convert_p2s = PFX(filterPixelToShort_8x16_ssse3); - p.pu[LUMA_8x32].convert_p2s = PFX(filterPixelToShort_8x32_ssse3); - p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_ssse3); - p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_ssse3); - p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_ssse3); - p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_ssse3); - p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_ssse3); - p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_ssse3); - p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_ssse3); - p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_ssse3); - p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_ssse3); - p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_ssse3); - p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_ssse3); - p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_ssse3); - p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_ssse3); - p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_ssse3); - p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_ssse3); - p.pu[LUMA_12x16].convert_p2s = PFX(filterPixelToShort_12x16_ssse3); - p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_ssse3); - p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_ssse3); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = PFX(filterPixelToShort_8x2_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = PFX(filterPixelToShort_8x6_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_ssse3); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = PFX(filterPixelToShort_8x12_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = PFX(filterPixelToShort_8x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = PFX(filterPixelToShort_12x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_ssse3); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_ssse3); + ASSIGN2(p.pu[LUMA_8x4].convert_p2s, filterPixelToShort_8x4_ssse3); + ASSIGN2(p.pu[LUMA_8x8].convert_p2s, filterPixelToShort_8x8_ssse3); + ASSIGN2(p.pu[LUMA_8x16].convert_p2s, filterPixelToShort_8x16_ssse3); + ASSIGN2(p.pu[LUMA_8x32].convert_p2s, filterPixelToShort_8x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s, filterPixelToShort_8x2_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s, filterPixelToShort_8x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s, filterPixelToShort_8x6_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s, filterPixelToShort_8x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s, filterPixelToShort_8x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s, filterPixelToShort_8x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s, filterPixelToShort_8x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s, filterPixelToShort_8x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s, filterPixelToShort_8x12_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s, filterPixelToShort_8x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s, filterPixelToShort_8x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s, filterPixelToShort_8x64_ssse3); + + ASSIGN2(p.pu[LUMA_16x4].convert_p2s, filterPixelToShort_16x4_ssse3); + ASSIGN2(p.pu[LUMA_16x8].convert_p2s, filterPixelToShort_16x8_ssse3); + ASSIGN2(p.pu[LUMA_16x12].convert_p2s, filterPixelToShort_16x12_ssse3); + ASSIGN2(p.pu[LUMA_16x16].convert_p2s, filterPixelToShort_16x16_ssse3); + ASSIGN2(p.pu[LUMA_16x32].convert_p2s, filterPixelToShort_16x32_ssse3); + ASSIGN2(p.pu[LUMA_16x64].convert_p2s, filterPixelToShort_16x64_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s, filterPixelToShort_16x4_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s, filterPixelToShort_16x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s, filterPixelToShort_16x12_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s, filterPixelToShort_16x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s, filterPixelToShort_16x32_ssse3); + + ASSIGN2(p.pu[LUMA_32x8].convert_p2s, filterPixelToShort_32x8_ssse3); + ASSIGN2(p.pu[LUMA_32x16].convert_p2s, filterPixelToShort_32x16_ssse3); + ASSIGN2(p.pu[LUMA_32x24].convert_p2s, filterPixelToShort_32x24_ssse3); + ASSIGN2(p.pu[LUMA_32x32].convert_p2s, filterPixelToShort_32x32_ssse3); + ASSIGN2(p.pu[LUMA_32x64].convert_p2s, filterPixelToShort_32x64_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s, filterPixelToShort_32x8_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s, filterPixelToShort_32x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s, filterPixelToShort_32x24_ssse3); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s, filterPixelToShort_32x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s, filterPixelToShort_32x16_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s, filterPixelToShort_32x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s, filterPixelToShort_32x48_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s, filterPixelToShort_32x64_ssse3); + + ASSIGN2(p.pu[LUMA_64x16].convert_p2s, filterPixelToShort_64x16_ssse3); + ASSIGN2(p.pu[LUMA_64x32].convert_p2s, filterPixelToShort_64x32_ssse3); + ASSIGN2(p.pu[LUMA_64x48].convert_p2s, filterPixelToShort_64x48_ssse3); + ASSIGN2(p.pu[LUMA_64x64].convert_p2s, filterPixelToShort_64x64_ssse3); + ASSIGN2(p.pu[LUMA_12x16].convert_p2s, filterPixelToShort_12x16_ssse3); + ASSIGN2(p.pu[LUMA_24x32].convert_p2s, filterPixelToShort_24x32_ssse3); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s, filterPixelToShort_24x64_ssse3); + ASSIGN2(p.pu[LUMA_48x64].convert_p2s, filterPixelToShort_48x64_ssse3); + p.findPosFirstLast = PFX(findPosFirstLast_ssse3); p.fix8Unpack = PFX(cutree_fix8_unpack_ssse3); p.fix8Pack = PFX(cutree_fix8_pack_ssse3); @@ -2519,8 +3508,8 @@ CHROMA_420_CU_BLOCKCOPY(ps, sse4); CHROMA_422_CU_BLOCKCOPY(ps, sse4); - p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_sse4); - p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_sse4); + ASSIGN2(p.cu[BLOCK_16x16].calcresidual, getResidual16_sse4); + ASSIGN2(p.cu[BLOCK_32x32].calcresidual, getResidual32_sse4); p.cu[BLOCK_8x8].dct = PFX(dct8_sse4); p.denoiseDct = PFX(denoise_dct_sse4); p.quant = PFX(quant_sse4); @@ -2545,24 +3534,25 @@ p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_sse4); - p.pu[LUMA_4x4].convert_p2s = PFX(filterPixelToShort_4x4_sse4); - p.pu[LUMA_4x8].convert_p2s = PFX(filterPixelToShort_4x8_sse4); - p.pu[LUMA_4x16].convert_p2s = PFX(filterPixelToShort_4x16_sse4); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = PFX(filterPixelToShort_2x4_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = PFX(filterPixelToShort_4x2_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = PFX(filterPixelToShort_4x4_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = PFX(filterPixelToShort_4x8_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = PFX(filterPixelToShort_4x16_sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = PFX(filterPixelToShort_6x8_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = PFX(filterPixelToShort_4x4_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = PFX(filterPixelToShort_4x8_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = PFX(filterPixelToShort_4x16_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = PFX(filterPixelToShort_4x32_sse4); - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); + ASSIGN2(p.pu[LUMA_4x4].convert_p2s, filterPixelToShort_4x4_sse4); + ASSIGN2(p.pu[LUMA_4x8].convert_p2s, filterPixelToShort_4x8_sse4); + ASSIGN2(p.pu[LUMA_4x16].convert_p2s, filterPixelToShort_4x16_sse4); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s, filterPixelToShort_2x4_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s, filterPixelToShort_2x8_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s, filterPixelToShort_4x2_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s, filterPixelToShort_4x4_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s, filterPixelToShort_4x8_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s, filterPixelToShort_4x16_sse4); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s, filterPixelToShort_6x8_sse4); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s, filterPixelToShort_2x8_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s, filterPixelToShort_2x16_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s, filterPixelToShort_4x4_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s, filterPixelToShort_4x8_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s, filterPixelToShort_4x16_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s, filterPixelToShort_4x32_sse4); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s, filterPixelToShort_6x16_sse4); #if X86_64 p.pelFilterLumaStrong[0] = PFX(pelFilterLumaStrong_V_sse4); @@ -2732,69 +3722,63 @@ p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx2); p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx2); - p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.pu[LUMA_8x32].addAvg = PFX(addAvg_8x32_avx2); - - p.pu[LUMA_12x16].addAvg = PFX(addAvg_12x16_avx2); - - p.pu[LUMA_16x4].addAvg = PFX(addAvg_16x4_avx2); - p.pu[LUMA_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.pu[LUMA_16x12].addAvg = PFX(addAvg_16x12_avx2); - p.pu[LUMA_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.pu[LUMA_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.pu[LUMA_16x64].addAvg = PFX(addAvg_16x64_avx2); - - p.pu[LUMA_24x32].addAvg = PFX(addAvg_24x32_avx2); - - p.pu[LUMA_32x8].addAvg = PFX(addAvg_32x8_avx2); - p.pu[LUMA_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.pu[LUMA_32x24].addAvg = PFX(addAvg_32x24_avx2); - p.pu[LUMA_32x32].addAvg = PFX(addAvg_32x32_avx2); - p.pu[LUMA_32x64].addAvg = PFX(addAvg_32x64_avx2); - - p.pu[LUMA_48x64].addAvg = PFX(addAvg_48x64_avx2); - - p.pu[LUMA_64x16].addAvg = PFX(addAvg_64x16_avx2); - p.pu[LUMA_64x32].addAvg = PFX(addAvg_64x32_avx2); - p.pu[LUMA_64x48].addAvg = PFX(addAvg_64x48_avx2); - p.pu[LUMA_64x64].addAvg = PFX(addAvg_64x64_avx2); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = PFX(addAvg_8x2_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = PFX(addAvg_8x6_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = PFX(addAvg_8x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = PFX(addAvg_12x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = PFX(addAvg_16x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = PFX(addAvg_16x12_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = PFX(addAvg_32x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = PFX(addAvg_32x24_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = PFX(addAvg_32x32_avx2); - - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = PFX(addAvg_8x4_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = PFX(addAvg_8x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = PFX(addAvg_8x12_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = PFX(addAvg_8x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = PFX(addAvg_8x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = PFX(addAvg_8x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = PFX(addAvg_12x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = PFX(addAvg_16x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = PFX(addAvg_16x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = PFX(addAvg_16x24_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = PFX(addAvg_16x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = PFX(addAvg_16x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = PFX(addAvg_24x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = PFX(addAvg_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = PFX(addAvg_32x64_avx2); + ASSIGN2(p.pu[LUMA_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.pu[LUMA_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.pu[LUMA_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.pu[LUMA_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.pu[LUMA_12x16].addAvg, addAvg_12x16_avx2); + ASSIGN2(p.pu[LUMA_16x4].addAvg, addAvg_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].addAvg, addAvg_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].addAvg, addAvg_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].addAvg, addAvg_16x64_avx2); + ASSIGN2(p.pu[LUMA_24x32].addAvg, addAvg_24x32_avx2); + ASSIGN2(p.pu[LUMA_32x8].addAvg, addAvg_32x8_avx2); + ASSIGN2(p.pu[LUMA_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x24].addAvg, addAvg_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x32].addAvg, addAvg_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x64].addAvg, addAvg_32x64_avx2); + ASSIGN2(p.pu[LUMA_48x64].addAvg, addAvg_48x64_avx2); + ASSIGN2(p.pu[LUMA_64x16].addAvg, addAvg_64x16_avx2); + ASSIGN2(p.pu[LUMA_64x32].addAvg, addAvg_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x48].addAvg, addAvg_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x64].addAvg, addAvg_64x64_avx2); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg, addAvg_8x2_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg, addAvg_8x6_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg, addAvg_12x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg, addAvg_16x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg, addAvg_16x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg, addAvg_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg, addAvg_32x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg, addAvg_32x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg, addAvg_32x32_avx2); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg, addAvg_8x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg, addAvg_8x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg, addAvg_8x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg, addAvg_8x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg, addAvg_8x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg, addAvg_8x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg, addAvg_12x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg, addAvg_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg, addAvg_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg, addAvg_16x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg, addAvg_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg, addAvg_16x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg, addAvg_24x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg, addAvg_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg, addAvg_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg, addAvg_32x48_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg, addAvg_32x64_avx2); p.cu[BLOCK_8x8].sa8d = PFX(pixel_sa8d_8x8_avx2); p.cu[BLOCK_16x16].sa8d = PFX(pixel_sa8d_16x16_avx2); @@ -2803,13 +3787,13 @@ p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = PFX(pixel_sa8d_16x16_avx2); p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = PFX(pixel_sa8d_32x32_avx2); - p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); - p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); - p.cu[BLOCK_64x64].add_ps = PFX(pixel_add_ps_64x64_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = PFX(pixel_add_ps_16x32_avx2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = PFX(pixel_add_ps_32x64_avx2); + ASSIGN2(p.cu[BLOCK_16x16].add_ps, pixel_add_ps_16x16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].add_ps, pixel_add_ps_32x32_avx2); + ASSIGN2(p.cu[BLOCK_64x64].add_ps, pixel_add_ps_64x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps, pixel_add_ps_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps, pixel_add_ps_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps, pixel_add_ps_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps, pixel_add_ps_32x64_avx2); p.cu[BLOCK_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); @@ -2818,25 +3802,23 @@ p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = PFX(pixel_sub_ps_16x32_avx2); p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_avx2); - - p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); - p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); - p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2); - p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2); - p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2); - p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2); - - p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_avx2); - p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_avx2); - p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_avx2); - p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_avx2); - p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_avx2); - p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_avx2); - p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_avx2); - p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_avx2); - p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_avx2); - p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_avx2); - + ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_avx2); + + ASSIGN2(p.pu[LUMA_32x64].pixelavg_pp, pixel_avg_32x64_avx2); + ASSIGN2(p.pu[LUMA_32x32].pixelavg_pp, pixel_avg_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x24].pixelavg_pp, pixel_avg_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x16].pixelavg_pp, pixel_avg_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x8].pixelavg_pp, pixel_avg_32x8_avx2); + ASSIGN2(p.pu[LUMA_48x64].pixelavg_pp, pixel_avg_48x64_avx2); + ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_avx2); + ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_avx2); p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_avx2); p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_avx2); p.pu[LUMA_8x16].satd = PFX(pixel_satd_8x16_avx2); @@ -2895,19 +3877,15 @@ p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); - p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16_avx2); - p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32_avx2); - + ASSIGN2(p.cu[BLOCK_16x16].ssd_s, pixel_ssd_s_16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].ssd_s, pixel_ssd_s_32_avx2); p.cu[BLOCK_8x8].copy_cnt = PFX(copy_cnt_8_avx2); p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx2); p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx2); - - p.cu[BLOCK_16x16].blockfill_s = PFX(blockfill_s_16x16_avx2); - p.cu[BLOCK_32x32].blockfill_s = PFX(blockfill_s_32x32_avx2); - - ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, avx2); + ASSIGN2(p.cu[BLOCK_16x16].blockfill_s, blockfill_s_16x16_avx2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[ALIGNED], cpy1Dto2D_shl_, avx2); + ALL_LUMA_TU_S(cpy1Dto2D_shl[NONALIGNED], cpy1Dto2D_shl_, avx2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, avx2); - p.cu[BLOCK_8x8].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8_avx2); p.cu[BLOCK_16x16].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16_avx2); p.cu[BLOCK_32x32].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32_avx2); @@ -2923,10 +3901,10 @@ p.dequant_normal = PFX(dequant_normal_avx2); p.dequant_scaling = PFX(dequant_scaling_avx2); - p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2); - p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_avx2); + ASSIGN2(p.cu[BLOCK_16x16].calcresidual, getResidual16_avx2); + ASSIGN2(p.cu[BLOCK_32x32].calcresidual, getResidual32_avx2); - p.scale1D_128to64 = PFX(scale1D_128to64_avx2); + ASSIGN2(p.scale1D_128to64, scale1D_128to64_avx2); p.weight_pp = PFX(weight_pp_avx2); p.weight_sp = PFX(weight_sp_avx2); @@ -3354,44 +4332,45 @@ ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; - p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_avx2); - p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_avx2); - p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_avx2); - p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_avx2); - p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_avx2); - p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_avx2); - p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_avx2); - p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_avx2); - p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_avx2); - p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_avx2); - p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_avx2); - p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_avx2); - p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_avx2); - p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_avx2); - p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_avx2); - p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_avx2); - p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_avx2); - - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = PFX(filterPixelToShort_24x32_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_avx2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_avx2); + ASSIGN2(p.pu[LUMA_16x4].convert_p2s, filterPixelToShort_16x4_avx2); + ASSIGN2(p.pu[LUMA_16x8].convert_p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.pu[LUMA_16x12].convert_p2s, filterPixelToShort_16x12_avx2); + ASSIGN2(p.pu[LUMA_16x16].convert_p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.pu[LUMA_16x32].convert_p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.pu[LUMA_16x64].convert_p2s, filterPixelToShort_16x64_avx2); + ASSIGN2(p.pu[LUMA_32x8].convert_p2s, filterPixelToShort_32x8_avx2); + ASSIGN2(p.pu[LUMA_32x16].convert_p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.pu[LUMA_32x24].convert_p2s, filterPixelToShort_32x24_avx2); + ASSIGN2(p.pu[LUMA_32x32].convert_p2s, filterPixelToShort_32x32_avx2); + ASSIGN2(p.pu[LUMA_32x64].convert_p2s, filterPixelToShort_32x64_avx2); + ASSIGN2(p.pu[LUMA_64x16].convert_p2s, filterPixelToShort_64x16_avx2); + ASSIGN2(p.pu[LUMA_64x32].convert_p2s, filterPixelToShort_64x32_avx2); + ASSIGN2(p.pu[LUMA_64x48].convert_p2s, filterPixelToShort_64x48_avx2); + ASSIGN2(p.pu[LUMA_64x64].convert_p2s, filterPixelToShort_64x64_avx2); + ASSIGN2(p.pu[LUMA_48x64].convert_p2s, filterPixelToShort_48x64_avx2); + ASSIGN2(p.pu[LUMA_24x32].convert_p2s, filterPixelToShort_24x32_avx2); + + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s, filterPixelToShort_16x4_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s, filterPixelToShort_16x12_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s, filterPixelToShort_24x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s, filterPixelToShort_32x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s, filterPixelToShort_32x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s, filterPixelToShort_32x32_avx2); + + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s, filterPixelToShort_16x8_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s, filterPixelToShort_16x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s, filterPixelToShort_16x24_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s, filterPixelToShort_16x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s, filterPixelToShort_16x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s, filterPixelToShort_24x64_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s, filterPixelToShort_32x16_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s, filterPixelToShort_32x32_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s, filterPixelToShort_32x48_avx2); + ASSIGN2(p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s, filterPixelToShort_32x64_avx2); //i422 for chroma_hpp p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = PFX(interp_4tap_horiz_pp_12x32_avx2); @@ -3718,6 +4697,707 @@ p.integral_inith[INTEGRAL_16] = PFX(integral16h_avx2); p.integral_inith[INTEGRAL_24] = PFX(integral24h_avx2); p.integral_inith[INTEGRAL_32] = PFX(integral32h_avx2); + p.cu[BLOCK_4x4].nonPsyRdoQuant = PFX(nonPsyRdoQuant4_avx2); + p.cu[BLOCK_8x8].nonPsyRdoQuant = PFX(nonPsyRdoQuant8_avx2); + p.cu[BLOCK_16x16].nonPsyRdoQuant = PFX(nonPsyRdoQuant16_avx2); + p.cu[BLOCK_32x32].nonPsyRdoQuant = PFX(nonPsyRdoQuant32_avx2); + p.cu[BLOCK_4x4].psyRdoQuant_1p = PFX(psyRdoQuant_1p4_avx2); + p.cu[BLOCK_8x8].psyRdoQuant_1p = PFX(psyRdoQuant_1p8_avx2); + p.cu[BLOCK_16x16].psyRdoQuant_1p = PFX(psyRdoQuant_1p16_avx2); + p.cu[BLOCK_32x32].psyRdoQuant_1p = PFX(psyRdoQuant_1p32_avx2); + + } + if (cpuMask & X265_CPU_AVX512) + { + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_avx512); + // p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_avx512); + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_avx512); + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_avx512); + //p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_avx512); + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_avx512); + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_avx512); + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_avx512); + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_avx512); + + p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_avx512); + p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_avx512); + p.pu[LUMA_32x24].sad_x3 = PFX(pixel_sad_x3_32x24_avx512); + p.pu[LUMA_32x32].sad_x3 = PFX(pixel_sad_x3_32x32_avx512); + p.pu[LUMA_32x64].sad_x3 = PFX(pixel_sad_x3_32x64_avx512); + p.pu[LUMA_64x16].sad_x3 = PFX(pixel_sad_x3_64x16_avx512); + p.pu[LUMA_64x32].sad_x3 = PFX(pixel_sad_x3_64x32_avx512); + p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_avx512); + p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_avx512); + p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_avx512); + + p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx512); + p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx512); + p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx512); + p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx512); + p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx512); + p.pu[LUMA_64x16].sad_x4 = PFX(pixel_sad_x4_64x16_avx512); + p.pu[LUMA_64x32].sad_x4 = PFX(pixel_sad_x4_64x32_avx512); + p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_avx512); + p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_avx512); + p.pu[LUMA_48x64].sad_x4 = PFX(pixel_sad_x4_48x64_avx512); + + p.pu[LUMA_4x4].satd = PFX(pixel_satd_4x4_avx512); + p.pu[LUMA_4x8].satd = PFX(pixel_satd_4x8_avx512); + p.pu[LUMA_4x16].satd = PFX(pixel_satd_4x16_avx512); + p.pu[LUMA_8x4].satd = PFX(pixel_satd_8x4_avx512); + p.pu[LUMA_8x8].satd = PFX(pixel_satd_8x8_avx512); + p.pu[LUMA_8x16].satd = PFX(pixel_satd_8x16_avx512); + p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = PFX(pixel_satd_4x4_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = PFX(pixel_satd_4x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = PFX(pixel_satd_4x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = PFX(pixel_satd_8x4_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = PFX(pixel_satd_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = PFX(pixel_satd_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = PFX(pixel_satd_16x16_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = PFX(pixel_satd_4x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = PFX(pixel_satd_4x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = PFX(pixel_satd_4x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = PFX(pixel_satd_8x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = PFX(pixel_satd_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = PFX(pixel_satd_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = PFX(pixel_satd_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = PFX(pixel_satd_16x16_avx512); + + p.cu[BLOCK_8x8].sa8d = PFX(pixel_sa8d_8x8_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = PFX(pixel_sa8d_8x8_avx512); + + p.cu[BLOCK_8x8].var = PFX(pixel_var_8x8_avx512); + p.cu[BLOCK_16x16].var = PFX(pixel_var_16x16_avx512); + p.cu[BLOCK_32x32].var = PFX(pixel_var_32x32_avx512); + p.cu[BLOCK_64x64].var = PFX(pixel_var_64x64_avx512); + ASSIGN2(p.pu[LUMA_16x64].pixelavg_pp, pixel_avg_16x64_avx512); + ASSIGN2(p.pu[LUMA_16x32].pixelavg_pp, pixel_avg_16x32_avx512); + ASSIGN2(p.pu[LUMA_16x16].pixelavg_pp, pixel_avg_16x16_avx512); + ASSIGN2(p.pu[LUMA_16x12].pixelavg_pp, pixel_avg_16x12_avx512); + ASSIGN2(p.pu[LUMA_16x8].pixelavg_pp, pixel_avg_16x8_avx512); + ASSIGN2(p.pu[LUMA_16x4].pixelavg_pp, pixel_avg_16x4_avx512); + ASSIGN2(p.pu[LUMA_8x32].pixelavg_pp, pixel_avg_8x32_avx512); + ASSIGN2(p.pu[LUMA_8x16].pixelavg_pp, pixel_avg_8x16_avx512); + ASSIGN2(p.pu[LUMA_8x8].pixelavg_pp, pixel_avg_8x8_avx512); + //p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_avx512); + p.pu[LUMA_4x4].sad = PFX(pixel_sad_4x4_avx512); + p.pu[LUMA_4x8].sad = PFX(pixel_sad_4x8_avx512); + p.pu[LUMA_4x16].sad = PFX(pixel_sad_4x16_avx512); + p.pu[LUMA_8x4].sad = PFX(pixel_sad_8x4_avx512); + p.pu[LUMA_8x8].sad = PFX(pixel_sad_8x8_avx512); + // p.pu[LUMA_8x16].sad = PFX(pixel_sad_8x16_avx512); + p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_avx512); + p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_avx512); + + p.pu[LUMA_64x64].copy_pp = PFX(blockcopy_pp_64x64_avx512); + p.pu[LUMA_64x32].copy_pp = PFX(blockcopy_pp_64x32_avx512); + p.pu[LUMA_64x48].copy_pp = PFX(blockcopy_pp_64x48_avx512); + p.pu[LUMA_64x16].copy_pp = PFX(blockcopy_pp_64x16_avx512); + p.pu[LUMA_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx512); + p.pu[LUMA_32x24].copy_pp = PFX(blockcopy_pp_32x24_avx512); + p.pu[LUMA_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx512); + p.pu[LUMA_32x64].copy_pp = PFX(blockcopy_pp_32x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = PFX(blockcopy_pp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = PFX(blockcopy_pp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = PFX(blockcopy_pp_32x64_avx512); + + p.cu[BLOCK_64x64].copy_sp = PFX(blockcopy_sp_64x64_avx512); + p.cu[BLOCK_32x32].copy_sp = PFX(blockcopy_sp_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = PFX(blockcopy_sp_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = PFX(blockcopy_sp_32x64_avx512); + + p.cu[BLOCK_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx512); + p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx512); + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_avx512); + p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_avx512); + + p.scale1D_128to64[NONALIGNED] = PFX(scale1D_128to64_avx512); + p.scale1D_128to64[ALIGNED] = PFX(scale1D_128to64_aligned_avx512); + + p.pu[LUMA_64x16].addAvg[NONALIGNED] = PFX(addAvg_64x16_avx512); + p.pu[LUMA_64x32].addAvg[NONALIGNED] = PFX(addAvg_64x32_avx512); + p.pu[LUMA_64x48].addAvg[NONALIGNED] = PFX(addAvg_64x48_avx512); + p.pu[LUMA_64x64].addAvg[NONALIGNED] = PFX(addAvg_64x64_avx512); + p.pu[LUMA_32x8].addAvg[NONALIGNED] = PFX(addAvg_32x8_avx512); + p.pu[LUMA_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.pu[LUMA_32x24].addAvg[NONALIGNED] = PFX(addAvg_32x24_avx512); + p.pu[LUMA_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + p.pu[LUMA_32x64].addAvg[NONALIGNED] = PFX(addAvg_32x64_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg[NONALIGNED] = PFX(addAvg_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg[NONALIGNED] = PFX(addAvg_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg[NONALIGNED] = PFX(addAvg_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg[NONALIGNED] = PFX(addAvg_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg[NONALIGNED] = PFX(addAvg_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg[NONALIGNED] = PFX(addAvg_32x32_avx512); + + p.pu[LUMA_32x8].addAvg[ALIGNED] = PFX(addAvg_aligned_32x8_avx512); + p.pu[LUMA_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.pu[LUMA_32x24].addAvg[ALIGNED] = PFX(addAvg_aligned_32x24_avx512); + p.pu[LUMA_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + p.pu[LUMA_32x64].addAvg[ALIGNED] = PFX(addAvg_aligned_32x64_avx512); + p.pu[LUMA_64x16].addAvg[ALIGNED] = PFX(addAvg_aligned_64x16_avx512); + p.pu[LUMA_64x32].addAvg[ALIGNED] = PFX(addAvg_aligned_64x32_avx512); + p.pu[LUMA_64x48].addAvg[ALIGNED] = PFX(addAvg_aligned_64x48_avx512); + p.pu[LUMA_64x64].addAvg[ALIGNED] = PFX(addAvg_aligned_64x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg[ALIGNED] = PFX(addAvg_aligned_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg[ALIGNED] = PFX(addAvg_aligned_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg[ALIGNED] = PFX(addAvg_aligned_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg[ALIGNED] = PFX(addAvg_aligned_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg[ALIGNED] = PFX(addAvg_aligned_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg[ALIGNED] = PFX(addAvg_aligned_32x32_avx512); + + p.cu[BLOCK_32x32].blockfill_s[NONALIGNED] = PFX(blockfill_s_32x32_avx512); + p.cu[BLOCK_32x32].blockfill_s[ALIGNED] = PFX(blockfill_s_aligned_32x32_avx512); + + p.cu[BLOCK_64x64].add_ps[NONALIGNED] = PFX(pixel_add_ps_64x64_avx512); + p.cu[BLOCK_32x32].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps[NONALIGNED] = PFX(pixel_add_ps_32x64_avx512); + + p.cu[BLOCK_32x32].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x32_avx512); + p.cu[BLOCK_64x64].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_64x64_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps[ALIGNED] = PFX(pixel_add_ps_aligned_32x64_avx512); + + p.cu[BLOCK_64x64].sub_ps = PFX(pixel_sub_ps_64x64_avx512); + p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx512); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx512); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_avx512); + + p.pu[LUMA_64x16].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x16_avx512); + p.pu[LUMA_64x32].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x32_avx512); + p.pu[LUMA_64x48].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x48_avx512); + p.pu[LUMA_64x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_64x64_avx512); + p.pu[LUMA_32x8].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx2); + p.pu[LUMA_32x16].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.pu[LUMA_32x24].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.pu[LUMA_32x32].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.pu[LUMA_32x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.pu[LUMA_48x64].convert_p2s[NONALIGNED] = PFX(filterPixelToShort_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s[NONALIGNED] = PFX(filterPixelToShort_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].p2s[NONALIGNED] = PFX(filterPixelToShort_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].p2s[NONALIGNED] = PFX(filterPixelToShort_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].p2s[NONALIGNED] = PFX(filterPixelToShort_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].p2s[NONALIGNED] = PFX(filterPixelToShort_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].p2s[NONALIGNED] = PFX(filterPixelToShort_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].p2s[NONALIGNED] = PFX(filterPixelToShort_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].p2s[NONALIGNED] = PFX(filterPixelToShort_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].p2s[NONALIGNED] = PFX(filterPixelToShort_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].p2s[NONALIGNED] = PFX(filterPixelToShort_64x64_avx512); + + p.pu[LUMA_64x16].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x16_avx512); + p.pu[LUMA_64x32].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x32_avx512); + p.pu[LUMA_64x48].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x48_avx512); + p.pu[LUMA_64x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x64_avx512); + p.pu[LUMA_32x8].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.pu[LUMA_32x16].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.pu[LUMA_32x24].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.pu[LUMA_32x32].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.pu[LUMA_32x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + p.pu[LUMA_48x64].convert_p2s[ALIGNED] = PFX(filterPixelToShort_aligned_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x8].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].p2s[ALIGNED] = PFX(filterPixelToShort_aligned_64x64_avx512); + + p.cu[BLOCK_64x64].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_64x64_avx512); + p.cu[BLOCK_32x32].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_32x32_avx512); + p.cu[BLOCK_16x16].sse_ss = (pixel_sse_ss_t)PFX(pixel_ssd_ss_16x16_avx512); + p.cu[BLOCK_32x32].ssd_s[NONALIGNED] = PFX(pixel_ssd_s_32_avx512); + p.cu[BLOCK_32x32].ssd_s[ALIGNED] = PFX(pixel_ssd_s_32_avx512); + p.cu[BLOCK_16x16].ssd_s[NONALIGNED] = PFX(pixel_ssd_s_16_avx512); + p.cu[BLOCK_16x16].ssd_s[ALIGNED] = PFX(pixel_ssd_s_aligned_16_avx512); + p.cu[BLOCK_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx512); + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = PFX(blockcopy_ss_32x64_avx512); + p.cu[BLOCK_64x64].copy_ss = PFX(blockcopy_ss_64x64_avx512); + + p.cu[BLOCK_32x32].calcresidual[NONALIGNED] = PFX(getResidual32_avx512); + p.cu[BLOCK_32x32].calcresidual[ALIGNED] = PFX(getResidual_aligned32_avx512); + p.cu[BLOCK_16x16].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16_avx512); + p.cu[BLOCK_32x32].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shl[NONALIGNED] = PFX(cpy1Dto2D_shl_32_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shl[ALIGNED] = PFX(cpy1Dto2D_shl_aligned_32_avx512); + p.cu[BLOCK_16x16].cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_16_avx512); + p.cu[BLOCK_32x32].cpy1Dto2D_shr = PFX(cpy1Dto2D_shr_32_avx512); + + p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16_avx512); + p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32_avx512); + + p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx512); + p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx512); + + p.dequant_normal = PFX(dequant_normal_avx512); + p.dequant_scaling = PFX(dequant_scaling_avx512); + //i444 chroma_hpp + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = PFX(interp_4tap_horiz_pp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = PFX(interp_4tap_horiz_pp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = PFX(interp_4tap_horiz_pp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = PFX(interp_4tap_horiz_pp_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = PFX(interp_4tap_horiz_pp_48x64_avx512); + + //i422 chroma_hpp + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = PFX(interp_4tap_horiz_pp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = PFX(interp_4tap_horiz_pp_32x48_avx512); + + //i420 chroma_hpp + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx512); + + p.weight_pp = PFX(weight_pp_avx512); + p.weight_sp = PFX(weight_sp_avx512); + + //i444 chroma_hps + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = PFX(interp_4tap_horiz_ps_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = PFX(interp_4tap_horiz_ps_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = PFX(interp_4tap_horiz_ps_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = PFX(interp_4tap_horiz_ps_64x16_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx512); + + //i422 chroma_hps + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = PFX(interp_4tap_horiz_ps_32x48_avx512); + + //i420 chroma_hps + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx512); + + p.pu[LUMA_16x4].luma_hpp = PFX(interp_8tap_horiz_pp_16x4_avx512); + p.pu[LUMA_16x8].luma_hpp = PFX(interp_8tap_horiz_pp_16x8_avx512); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_8tap_horiz_pp_16x12_avx512); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_avx512); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_8tap_horiz_pp_16x32_avx512); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_8tap_horiz_pp_16x64_avx512); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_8tap_horiz_pp_32x8_avx512); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_8tap_horiz_pp_32x16_avx512); + p.pu[LUMA_32x24].luma_hpp = PFX(interp_8tap_horiz_pp_32x24_avx512); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_avx512); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_8tap_horiz_pp_32x64_avx512); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_8tap_horiz_pp_64x16_avx512); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_8tap_horiz_pp_64x32_avx512); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_8tap_horiz_pp_64x48_avx512); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_8tap_horiz_pp_64x64_avx512); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_avx512); + ASSIGN2(p.pu[LUMA_64x16].pixelavg_pp, pixel_avg_64x16_avx512); + ASSIGN2(p.pu[LUMA_64x32].pixelavg_pp, pixel_avg_64x32_avx512); + ASSIGN2(p.pu[LUMA_64x48].pixelavg_pp, pixel_avg_64x48_avx512); + ASSIGN2(p.pu[LUMA_64x64].pixelavg_pp, pixel_avg_64x64_avx512); + //luma hps + p.pu[LUMA_64x64].luma_hps = PFX(interp_8tap_horiz_ps_64x64_avx512); + p.pu[LUMA_64x48].luma_hps = PFX(interp_8tap_horiz_ps_64x48_avx512); + p.pu[LUMA_64x32].luma_hps = PFX(interp_8tap_horiz_ps_64x32_avx512); + p.pu[LUMA_64x16].luma_hps = PFX(interp_8tap_horiz_ps_64x16_avx512); + + p.pu[LUMA_32x64].luma_hps = PFX(interp_8tap_horiz_ps_32x64_avx512); + p.pu[LUMA_32x32].luma_hps = PFX(interp_8tap_horiz_ps_32x32_avx512); + p.pu[LUMA_32x24].luma_hps = PFX(interp_8tap_horiz_ps_32x24_avx512); + p.pu[LUMA_32x16].luma_hps = PFX(interp_8tap_horiz_ps_32x16_avx512); + p.pu[LUMA_32x8].luma_hps = PFX(interp_8tap_horiz_ps_32x8_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = PFX(interp_4tap_horiz_ps_16x24_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx512); + + p.pu[LUMA_16x8].luma_hps = PFX(interp_8tap_horiz_ps_16x8_avx512); + p.pu[LUMA_16x16].luma_hps = PFX(interp_8tap_horiz_ps_16x16_avx512); + p.pu[LUMA_16x12].luma_hps = PFX(interp_8tap_horiz_ps_16x12_avx512); + p.pu[LUMA_16x4].luma_hps = PFX(interp_8tap_horiz_ps_16x4_avx512); + p.pu[LUMA_16x32].luma_hps = PFX(interp_8tap_horiz_ps_16x32_avx512); + p.pu[LUMA_16x64].luma_hps = PFX(interp_8tap_horiz_ps_16x64_avx512); + + p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = PFX(interp_4tap_horiz_ps_48x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = PFX(interp_4tap_vert_pp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = PFX(interp_4tap_vert_ss_8x12_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = PFX(interp_4tap_vert_ss_8x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = PFX(interp_4tap_vert_ss_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = PFX(interp_4tap_vert_ss_24x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = PFX(interp_4tap_vert_ss_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = PFX(interp_4tap_vert_sp_16x24_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = PFX(interp_4tap_vert_sp_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = PFX(interp_4tap_vert_pp_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = PFX(interp_4tap_vert_pp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = PFX(interp_4tap_vert_pp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = PFX(interp_4tap_vert_pp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = PFX(interp_4tap_vert_ss_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = PFX(interp_4tap_vert_ss_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = PFX(interp_4tap_vert_ss_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = PFX(interp_4tap_vert_ss_64x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = PFX(interp_4tap_vert_ss_48x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = PFX(interp_4tap_vert_sp_48x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = PFX(interp_4tap_vert_sp_64x16_avx512); + + p.pu[LUMA_8x8].luma_vss = PFX(interp_8tap_vert_ss_8x8_avx512); + p.pu[LUMA_8x16].luma_vss = PFX(interp_8tap_vert_ss_8x16_avx512); + p.pu[LUMA_8x32].luma_vss = PFX(interp_8tap_vert_ss_8x32_avx512); + p.pu[LUMA_16x4].luma_vss = PFX(interp_8tap_vert_ss_16x4_avx512); + p.pu[LUMA_16x8].luma_vss = PFX(interp_8tap_vert_ss_16x8_avx512); + p.pu[LUMA_16x12].luma_vss = PFX(interp_8tap_vert_ss_16x12_avx512); + p.pu[LUMA_16x16].luma_vss = PFX(interp_8tap_vert_ss_16x16_avx512); + p.pu[LUMA_16x32].luma_vss = PFX(interp_8tap_vert_ss_16x32_avx512); + p.pu[LUMA_16x64].luma_vss = PFX(interp_8tap_vert_ss_16x64_avx512); + p.pu[LUMA_24x32].luma_vss = PFX(interp_8tap_vert_ss_24x32_avx512); + p.pu[LUMA_32x64].luma_vss = PFX(interp_8tap_vert_ss_32x64_avx512); + p.pu[LUMA_32x32].luma_vss = PFX(interp_8tap_vert_ss_32x32_avx512); + p.pu[LUMA_32x24].luma_vss = PFX(interp_8tap_vert_ss_32x24_avx512); + p.pu[LUMA_32x16].luma_vss = PFX(interp_8tap_vert_ss_32x16_avx512); + p.pu[LUMA_32x8].luma_vss = PFX(interp_8tap_vert_ss_32x8_avx512); + p.pu[LUMA_48x64].luma_vss = PFX(interp_8tap_vert_ss_48x64_avx512); + p.pu[LUMA_64x64].luma_vss = PFX(interp_8tap_vert_ss_64x64_avx512); + p.pu[LUMA_64x48].luma_vss = PFX(interp_8tap_vert_ss_64x48_avx512); + p.pu[LUMA_64x32].luma_vss = PFX(interp_8tap_vert_ss_64x32_avx512); + p.pu[LUMA_64x16].luma_vss = PFX(interp_8tap_vert_ss_64x16_avx512); + + p.pu[LUMA_16x64].luma_vpp = PFX(interp_8tap_vert_pp_16x64_avx512); + p.pu[LUMA_16x32].luma_vpp = PFX(interp_8tap_vert_pp_16x32_avx512); + p.pu[LUMA_16x16].luma_vpp = PFX(interp_8tap_vert_pp_16x16_avx512); + p.pu[LUMA_16x8].luma_vpp = PFX(interp_8tap_vert_pp_16x8_avx512); + p.pu[LUMA_32x64].luma_vpp = PFX(interp_8tap_vert_pp_32x64_avx512); + p.pu[LUMA_32x32].luma_vpp = PFX(interp_8tap_vert_pp_32x32_avx512); + p.pu[LUMA_32x24].luma_vpp = PFX(interp_8tap_vert_pp_32x24_avx512); + p.pu[LUMA_32x16].luma_vpp = PFX(interp_8tap_vert_pp_32x16_avx512); + p.pu[LUMA_32x8].luma_vpp = PFX(interp_8tap_vert_pp_32x8_avx512); + p.pu[LUMA_48x64].luma_vpp = PFX(interp_8tap_vert_pp_48x64_avx512); + p.pu[LUMA_64x64].luma_vpp = PFX(interp_8tap_vert_pp_64x64_avx512); + p.pu[LUMA_64x48].luma_vpp = PFX(interp_8tap_vert_pp_64x48_avx512); + p.pu[LUMA_64x32].luma_vpp = PFX(interp_8tap_vert_pp_64x32_avx512); + p.pu[LUMA_64x16].luma_vpp = PFX(interp_8tap_vert_pp_64x16_avx512); + p.pu[LUMA_16x4].luma_vsp = PFX(interp_8tap_vert_sp_16x4_avx512); + p.pu[LUMA_16x8].luma_vsp = PFX(interp_8tap_vert_sp_16x8_avx512); + p.pu[LUMA_16x12].luma_vsp = PFX(interp_8tap_vert_sp_16x12_avx512); + p.pu[LUMA_16x16].luma_vsp = PFX(interp_8tap_vert_sp_16x16_avx512); + p.pu[LUMA_16x32].luma_vsp = PFX(interp_8tap_vert_sp_16x32_avx512); + p.pu[LUMA_16x64].luma_vsp = PFX(interp_8tap_vert_sp_16x64_avx512); + p.pu[LUMA_32x64].luma_vsp = PFX(interp_8tap_vert_sp_32x64_avx512); + p.pu[LUMA_32x32].luma_vsp = PFX(interp_8tap_vert_sp_32x32_avx512); + p.pu[LUMA_32x24].luma_vsp = PFX(interp_8tap_vert_sp_32x24_avx512); + p.pu[LUMA_32x16].luma_vsp = PFX(interp_8tap_vert_sp_32x16_avx512); + p.pu[LUMA_32x8].luma_vsp = PFX(interp_8tap_vert_sp_32x8_avx512); + p.pu[LUMA_48x64].luma_vsp = PFX(interp_8tap_vert_sp_48x64_avx512); + p.pu[LUMA_64x64].luma_vsp = PFX(interp_8tap_vert_sp_64x64_avx512); + p.pu[LUMA_64x48].luma_vsp = PFX(interp_8tap_vert_sp_64x48_avx512); + p.pu[LUMA_64x32].luma_vsp = PFX(interp_8tap_vert_sp_64x32_avx512); + p.pu[LUMA_64x16].luma_vsp = PFX(interp_8tap_vert_sp_64x16_avx512); + + p.cu[BLOCK_8x8].dct = PFX(dct8_avx512); + /* TODO: Currently these kernels performance are similar to AVX2 version, we need a to improve them further to ebable + * it. Probably a Vtune analysis will help here. + + * p.cu[BLOCK_16x16].dct = PFX(dct16_avx512); + * p.cu[BLOCK_32x32].dct = PFX(dct32_avx512); */ + + p.cu[BLOCK_8x8].idct = PFX(idct8_avx512); + p.cu[BLOCK_16x16].idct = PFX(idct16_avx512); + p.cu[BLOCK_32x32].idct = PFX(idct32_avx512); + p.quant = PFX(quant_avx512); + p.nquant = PFX(nquant_avx512); + p.denoiseDct = PFX(denoise_dct_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = PFX(interp_4tap_vert_ps_64x64_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = PFX(interp_4tap_vert_ps_64x48_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = PFX(interp_4tap_vert_ps_64x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = PFX(interp_4tap_vert_ps_64x16_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx512); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = PFX(interp_4tap_vert_ps_32x48_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx512); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx512); + //p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + + /*p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = PFX(interp_4tap_vert_ps_16x24_avx512);*/ + + //p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx512); + //p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx512); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx512); + p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx512); + p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx512); + p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx512); + + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = PFX(interp_4tap_vert_ps_48x64_avx512); + + p.pu[LUMA_64x16].luma_vps = PFX(interp_8tap_vert_ps_64x16_avx512); + p.pu[LUMA_64x32].luma_vps = PFX(interp_8tap_vert_ps_64x32_avx512); + p.pu[LUMA_64x48].luma_vps = PFX(interp_8tap_vert_ps_64x48_avx512); + p.pu[LUMA_64x64].luma_vps = PFX(interp_8tap_vert_ps_64x64_avx512); + + p.pu[LUMA_32x8].luma_vps = PFX(interp_8tap_vert_ps_32x8_avx512); + p.pu[LUMA_32x16].luma_vps = PFX(interp_8tap_vert_ps_32x16_avx512); + p.pu[LUMA_32x32].luma_vps = PFX(interp_8tap_vert_ps_32x32_avx512); + p.pu[LUMA_32x24].luma_vps = PFX(interp_8tap_vert_ps_32x24_avx512); + p.pu[LUMA_32x64].luma_vps = PFX(interp_8tap_vert_ps_32x64_avx512); + + p.pu[LUMA_16x8].luma_vps = PFX(interp_8tap_vert_ps_16x8_avx512); + p.pu[LUMA_16x16].luma_vps = PFX(interp_8tap_vert_ps_16x16_avx512); + p.pu[LUMA_16x32].luma_vps = PFX(interp_8tap_vert_ps_16x32_avx512); + //p.pu[LUMA_16x64].luma_vps = PFX(interp_8tap_vert_ps_16x64_avx512); + p.pu[LUMA_48x64].luma_vps = PFX(interp_8tap_vert_ps_48x64_avx512); + + p.pu[LUMA_64x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x64>; + p.pu[LUMA_64x48].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x48>; + p.pu[LUMA_64x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x32>; + p.pu[LUMA_64x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_64x16>; + p.pu[LUMA_32x8].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x8>; + p.pu[LUMA_32x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x16>; + p.pu[LUMA_32x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x32>; + p.pu[LUMA_32x24].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x24>; + p.pu[LUMA_32x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_32x64>; + p.pu[LUMA_16x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x4>; + p.pu[LUMA_16x8].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x8>; + p.pu[LUMA_16x12].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x12>; + p.pu[LUMA_16x16].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x16>; + p.pu[LUMA_16x32].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x32>; + p.pu[LUMA_16x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_16x64>; + p.pu[LUMA_48x64].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_48x64>; + + p.cu[BLOCK_4x4].nonPsyRdoQuant = PFX(nonPsyRdoQuant4_avx512); + p.cu[BLOCK_8x8].nonPsyRdoQuant = PFX(nonPsyRdoQuant8_avx512); + p.cu[BLOCK_16x16].nonPsyRdoQuant = PFX(nonPsyRdoQuant16_avx512); + p.cu[BLOCK_32x32].nonPsyRdoQuant = PFX(nonPsyRdoQuant32_avx512); + p.cu[BLOCK_4x4].psyRdoQuant = PFX(psyRdoQuant4_avx512); + p.cu[BLOCK_8x8].psyRdoQuant = PFX(psyRdoQuant8_avx512); + p.cu[BLOCK_16x16].psyRdoQuant = PFX(psyRdoQuant16_avx512); + p.cu[BLOCK_32x32].psyRdoQuant = PFX(psyRdoQuant32_avx512); + p.pu[LUMA_32x8].satd = PFX(pixel_satd_32x8_avx512); + p.pu[LUMA_32x16].satd = PFX(pixel_satd_32x16_avx512); + p.pu[LUMA_32x24].satd = PFX(pixel_satd_32x24_avx512); + p.pu[LUMA_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.pu[LUMA_32x64].satd = PFX(pixel_satd_32x64_avx512); + p.pu[LUMA_64x16].satd = PFX(pixel_satd_64x16_avx512); + p.pu[LUMA_64x32].satd = PFX(pixel_satd_64x32_avx512); + p.pu[LUMA_64x48].satd = PFX(pixel_satd_64x48_avx512); + p.pu[LUMA_64x64].satd = PFX(pixel_satd_64x64_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = PFX(pixel_satd_32x16_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = PFX(pixel_satd_32x24_avx512); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = PFX(pixel_satd_32x8_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = PFX(pixel_satd_32x64_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = PFX(pixel_satd_32x48_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = PFX(pixel_satd_32x32_avx512); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_avx512); + p.planecopy_sp_shl = PFX(upShift_16_avx512); + p.cu[BLOCK_16x16].count_nonzero = PFX(count_nonzero_16x16_avx512); + p.cu[BLOCK_32x32].count_nonzero = PFX(count_nonzero_32x32_avx512); } #endif @@ -3738,7 +5418,7 @@ // CPU dispatcher function void PFX(intel_cpu_indicator_init)(void) { - uint32_t cpu = x265::cpu_detect(); + uint32_t cpu = x265::cpu_detect(false); if (cpu & X265_CPU_AVX) __intel_cpu_indicator = 0x20000;
View file
x265_2.7.tar.gz/source/common/x86/blockcopy8.asm -> x265_2.9.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -26,7 +26,10 @@ %include "x86inc.asm" %include "x86util.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 + +ALIGN 64 +const shuf1_avx512, dq 0, 2, 4, 6, 1, 3, 5, 7 cextern pb_4 cextern pb_1 @@ -1103,6 +1106,82 @@ BLOCKCOPY_PP_W64_H4_avx 64, 48 BLOCKCOPY_PP_W64_H4_avx 64, 64 +;---------------------------------------------------------------------------------------------- +; blockcopy_pp avx512 code start +;---------------------------------------------------------------------------------------------- +%macro PROCESS_BLOCKCOPY_PP_64X4_avx512 0 +movu m0, [r2] +movu m1, [r2 + r3] +movu m2, [r2 + 2 * r3] +movu m3, [r2 + r4] + +movu [r0] , m0 +movu [r0 + r1] , m1 +movu [r0 + 2 * r1] , m2 +movu [r0 + r5] , m3 +%endmacro + +%macro PROCESS_BLOCKCOPY_PP_32X4_avx512 0 +movu ym0, [r2] +vinserti32x8 m0, [r2 + r3], 1 +movu ym1, [r2 + 2 * r3] +vinserti32x8 m1, [r2 + r4], 1 + +movu [r0] , ym0 +vextracti32x8 [r0 + r1] , m0, 1 +movu [r0 + 2 * r1] , ym1 +vextracti32x8 [r0 + r5] , m1, 1 +%endmacro + +;---------------------------------------------------------------------------------------------- +; void blockcopy_pp_64x%1(pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride) +;---------------------------------------------------------------------------------------------- +%macro BLOCKCOPY_PP_W64_H4_avx512 1 +INIT_ZMM avx512 +cglobal blockcopy_pp_64x%1, 4, 6, 4 +lea r4, [3 * r3] +lea r5, [3 * r1] + +%rep %1/4 - 1 +PROCESS_BLOCKCOPY_PP_64X4_avx512 +lea r2, [r2 + 4 * r3] +lea r0, [r0 + 4 * r1] +%endrep + +PROCESS_BLOCKCOPY_PP_64X4_avx512 +RET +%endmacro + +BLOCKCOPY_PP_W64_H4_avx512 16 +BLOCKCOPY_PP_W64_H4_avx512 32 +BLOCKCOPY_PP_W64_H4_avx512 48 +BLOCKCOPY_PP_W64_H4_avx512 64 + +%macro BLOCKCOPY_PP_W32_H4_avx512 1 +INIT_ZMM avx512 +cglobal blockcopy_pp_32x%1, 4, 6, 2 + lea r4, [3 * r3] + lea r5, [3 * r1] + +%rep %1/4 - 1 + PROCESS_BLOCKCOPY_PP_32X4_avx512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] +%endrep + PROCESS_BLOCKCOPY_PP_32X4_avx512 + RET +%endmacro + +BLOCKCOPY_PP_W32_H4_avx512 8 +BLOCKCOPY_PP_W32_H4_avx512 16 +BLOCKCOPY_PP_W32_H4_avx512 24 +BLOCKCOPY_PP_W32_H4_avx512 32 +BLOCKCOPY_PP_W32_H4_avx512 48 +BLOCKCOPY_PP_W32_H4_avx512 64 +;---------------------------------------------------------------------------------------------- +; blockcopy_pp avx512 code end +;---------------------------------------------------------------------------------------------- + ;----------------------------------------------------------------------------- ; void blockcopy_sp_2x4(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) ;----------------------------------------------------------------------------- @@ -2121,6 +2200,86 @@ BLOCKCOPY_SP_W64_H4_avx2 64, 64 +%macro PROCESS_BLOCKCOPY_SP_64x4_AVX512 0 + movu m0, [r2] + movu m1, [r2 + 64] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + 64] + + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m4, m0 + vpermq m2, m4, m2 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + 2 * r3] + movu m1, [r2 + 2 * r3 + 64] + movu m2, [r2 + r4] + movu m3, [r2 + r4 + 64] + + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m4, m0 + vpermq m2, m4, m2 + movu [r0 + 2 * r1], m0 + movu [r0 + r5], m2 +%endmacro + +%macro PROCESS_BLOCKCOPY_SP_32x4_AVX512 0 + movu m0, [r2] + movu m1, [r2 + r3] + movu m2, [r2 + 2 * r3] + movu m3, [r2 + r4] + + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m4, m0 + vpermq m2, m4, m2 + movu [r0], ym0 + vextracti32x8 [r0 + r1], m0, 1 + movu [r0 + 2 * r1], ym2 + vextracti32x8 [r0 + r5], m2, 1 +%endmacro + +;----------------------------------------------------------------------------- +; void blockcopy_sp_%1x%2(pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockcopy_sp_64x64, 4, 6, 5 + mova m4, [shuf1_avx512] + add r3, r3 + lea r4, [3 * r3] + lea r5, [3 * r1] + +%rep 15 + PROCESS_BLOCKCOPY_SP_64x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_BLOCKCOPY_SP_64x4_AVX512 + RET + +%macro BLOCKCOPY_SP_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal blockcopy_sp_32x%1, 4, 6, 5 + mova m4, [shuf1_avx512] + add r3, r3 + lea r4, [3 * r3] + lea r5, [3 * r1] + +%rep %1/4 - 1 + PROCESS_BLOCKCOPY_SP_32x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_BLOCKCOPY_SP_32x4_AVX512 + RET +%endmacro + +BLOCKCOPY_SP_32xN_AVX512 32 +BLOCKCOPY_SP_32xN_AVX512 64 + ;----------------------------------------------------------------------------- ; void blockfill_s_4x4(int16_t* dst, intptr_t dstride, int16_t val) ;----------------------------------------------------------------------------- @@ -2396,6 +2555,43 @@ movu [r0 + r3 + 32], m0 RET +;-------------------------------------------------------------------- +; void blockfill_s_32x32(int16_t* dst, intptr_t dstride, int16_t val) +;-------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockfill_s_32x32, 3, 4, 1 +add r1, r1 +lea r3, [3 * r1] +movd xm0, r2d +vpbroadcastw m0, xm0 + +%rep 8 +movu [r0], m0 +movu [r0 + r1], m0 +movu [r0 + 2 * r1], m0 +movu [r0 + r3], m0 +lea r0, [r0 + 4 * r1] +%endrep +RET + +;-------------------------------------------------------------------- +; void blockfill_s_aligned_32x32(int16_t* dst, intptr_t dstride, int16_t val) +;-------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockfill_s_aligned_32x32, 3, 4, 1 +add r1, r1 +lea r3, [3 * r1] +movd xm0, r2d +vpbroadcastw m0, xm0 + +%rep 8 +mova [r0], m0 +mova [r0 + r1], m0 +mova [r0 + 2 * r1], m0 +mova [r0 + r3], m0 +lea r0, [r0 + 4 * r1] +%endrep +RET ;----------------------------------------------------------------------------- ; void blockcopy_ps_2x4(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); ;----------------------------------------------------------------------------- @@ -3077,6 +3273,79 @@ BLOCKCOPY_PS_W32_H4_avx2 32, 32 BLOCKCOPY_PS_W32_H4_avx2 32, 64 +%macro PROCESS_BLOCKCOPY_PS_32x8_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + r3] + pmovzxbw m2, [r2 + r3 * 2] + pmovzxbw m3, [r2 + r4] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + r1 * 2], m2 + movu [r0 + r5], m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + r3] + pmovzxbw m2, [r2 + r3 * 2] + pmovzxbw m3, [r2 + r4] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + r1 * 2], m2 + movu [r0 + r5], m3 +%endmacro + +INIT_ZMM avx512 +cglobal blockcopy_ps_32x32, 4, 6, 4 + add r1, r1 + lea r4, [3 * r3] + lea r5, [3 * r1] + + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ps_32x64, 4, 6, 4 + add r1, r1 + lea r4, [3 * r3] + lea r5, [3 * r1] + + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_32x8_AVX512 + RET + ;----------------------------------------------------------------------------- ; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); ;----------------------------------------------------------------------------- @@ -3262,6 +3531,79 @@ jnz .loop RET +%macro PROCESS_BLOCKCOPY_PS_64x8_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r2 + r3] + pmovzxbw m3, [r2 + r3 + 32] + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + 64], m3 + + pmovzxbw m0, [r2 + r3 * 2] + pmovzxbw m1, [r2 + r3 * 2 + 32] + pmovzxbw m2, [r2 + r4] + pmovzxbw m3, [r2 + r4 + 32] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 64], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + 64], m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r2 + r3] + pmovzxbw m3, [r2 + r3 + 32] + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + 64], m3 + + pmovzxbw m0, [r2 + r3 * 2] + pmovzxbw m1, [r2 + r3 * 2 + 32] + pmovzxbw m2, [r2 + r4] + pmovzxbw m3, [r2 + r4 + 32] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 64], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + 64], m3 +%endmacro +;----------------------------------------------------------------------------- +; void blockcopy_ps_%1x%2(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockcopy_ps_64x64, 4, 6, 4 + add r1, r1 + lea r4, [3 * r3] + lea r5, [3 * r1] + + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + PROCESS_BLOCKCOPY_PS_64x8_AVX512 + RET + ;----------------------------------------------------------------------------- ; void blockcopy_ss_2x4(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) ;----------------------------------------------------------------------------- @@ -4051,6 +4393,143 @@ BLOCKCOPY_SS_W32_H4_avx 32, 48 BLOCKCOPY_SS_W32_H4_avx 32, 64 +%macro PROCESS_BLOCKCOPY_SS_W32_H8_avx512 0 + movu m0, [r2] + movu m1, [r2 + r3] + movu m2, [r2 + 2 * r3] + movu m3, [r2 + r6] + lea r2, [r2 + 4 * r3] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + 2 * r1], m2 + movu [r0 + r5], m3 + lea r0, [r0 + 4 * r1] + + movu m0, [r2] + movu m1, [r2 + r3] + movu m2, [r2 + 2 * r3] + movu m3, [r2 + r6] + lea r2, [r2 + 4 * r3] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + 2 * r1], m2 + movu [r0 + r5], m3 + lea r0, [r0 + 4 * r1] +%endmacro + +%macro PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 0 + movu m0, [r2] + movu m1, [r2 + r3] + movu m2, [r2 + 2 * r3] + movu m3, [r2 + r6] + lea r2, [r2 + 4 * r3] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + 2 * r1], m2 + movu [r0 + r5], m3 + lea r0, [r0 + 4 * r1] + + movu m0, [r2] + movu m1, [r2 + r3] + movu m2, [r2 + 2 * r3] + movu m3, [r2 + r6] + + movu [r0], m0 + movu [r0 + r1], m1 + movu [r0 + 2 * r1], m2 + movu [r0 + r5], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockcopy_ss_32x8, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_32x16, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_32x24, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_32x32, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_32x48, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_32x64, 4, 7, 4 + + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_avx512 + PROCESS_BLOCKCOPY_SS_W32_H8_LAST_avx512 + RET + ;----------------------------------------------------------------------------- ; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) ;----------------------------------------------------------------------------- @@ -4349,6 +4828,154 @@ BLOCKCOPY_SS_W64_H4_avx 64, 48 BLOCKCOPY_SS_W64_H4_avx 64, 64 +%macro PROCESS_BLOCKCOPY_SS_W64_H8_avx512 0 + movu m0, [r2] + movu m1, [r2 + mmsize] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + mmsize] + + movu [r0], m0 + movu [r0 + mmsize], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + mmsize], m3 + + movu m0, [r2 + 2 * r3] + movu m1, [r2 + 2 * r3 + mmsize] + movu m2, [r2 + r6] + movu m3, [r2 + r6 + mmsize] + lea r2, [r2 + 4 * r3] + + movu [r0 + 2 * r1], m0 + movu [r0 + 2 * r1 + mmsize], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + mmsize], m3 + lea r0, [r0 + 4 * r1] + + movu m0, [r2] + movu m1, [r2 + mmsize] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + mmsize] + + movu [r0], m0 + movu [r0 + mmsize], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + mmsize], m3 + + movu m0, [r2 + 2 * r3] + movu m1, [r2 + 2 * r3 + mmsize] + movu m2, [r2 + r6] + movu m3, [r2 + r6 + mmsize] + lea r2, [r2 + 4 * r3] + + movu [r0 + 2 * r1], m0 + movu [r0 + 2 * r1 + mmsize], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + mmsize], m3 + lea r0, [r0 + 4 * r1] +%endmacro + +%macro PROCESS_BLOCKCOPY_SS_W64_H8_LAST_avx512 0 + movu m0, [r2] + movu m1, [r2 + mmsize] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + mmsize] + + movu [r0], m0 + movu [r0 + mmsize], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + mmsize], m3 + + movu m0, [r2 + 2 * r3] + movu m1, [r2 + 2 * r3 + mmsize] + movu m2, [r2 + r6] + movu m3, [r2 + r6 + mmsize] + lea r2, [r2 + 4 * r3] + + movu [r0 + 2 * r1], m0 + movu [r0 + 2 * r1 + mmsize], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + mmsize], m3 + lea r0, [r0 + 4 * r1] + + movu m0, [r2] + movu m1, [r2 + mmsize] + movu m2, [r2 + r3] + movu m3, [r2 + r3 + mmsize] + + movu [r0], m0 + movu [r0 + mmsize], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + mmsize], m3 + + movu m0, [r2 + 2 * r3] + movu m1, [r2 + 2 * r3 + mmsize] + movu m2, [r2 + r6] + movu m3, [r2 + r6 + mmsize] + + movu [r0 + 2 * r1], m0 + movu [r0 + 2 * r1 + mmsize], m1 + movu [r0 + r5], m2 + movu [r0 + r5 + mmsize], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void blockcopy_ss_%1x%2(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride) +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal blockcopy_ss_64x16, 4, 7, 4 + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_64x32, 4, 7, 4 + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_64x48, 4, 7, 4 + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_LAST_avx512 + RET + +INIT_ZMM avx512 +cglobal blockcopy_ss_64x64, 4, 7, 4 + add r1, r1 + add r3, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_avx512 + PROCESS_BLOCKCOPY_SS_W64_H8_LAST_avx512 + RET ;-------------------------------------------------------------------------------------- ; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); ;-------------------------------------------------------------------------------------- @@ -4572,6 +5199,53 @@ jnz .loop RET +INIT_ZMM avx512 +cglobal cpy2Dto1D_shr_16, 4, 5, 4 + shl r2d, 1 + movd xm0, r3d + pcmpeqw ymm1, ymm1 + psllw ym1, ymm1, xm0 + psraw ym1, 1 + vinserti32x8 m1, ym1, 1 + lea r3, [r2 * 3] + mov r4d, 2 + +.loop: + ; Row 0-1 + movu ym2, [r1] + vinserti32x8 m2, [r1 + r2], 1 + psubw m2, m1 + psraw m2, xm0 + movu [r0], m2 + + ; Row 2-3 + movu ym2, [r1 + 2 * r2] + vinserti32x8 m2, [r1 + r3], 1 + psubw m2, m1 + psraw m2, xm0 + movu [r0 + mmsize], m2 + + lea r1, [r1 + 4 * r2] + ; Row 4-5 + + movu ym2, [r1] + vinserti32x8 m2, [r1 + r2], 1 + psubw m2, m1 + psraw m2, xm0 + movu [r0 + 2 * mmsize], m2 + + ; Row 6-7 + movu ym2, [r1 + 2 * r2] + vinserti32x8 m2, [r1 + r3], 1 + psubw m2, m1 + psraw m2, xm0 + movu [r0 + 3 * mmsize], m2 + + add r0, 4 * mmsize + lea r1, [r1 + 4 * r2] + dec r4d + jnz .loop + RET ;-------------------------------------------------------------------------------------- ; void cpy2Dto1D_shr(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); @@ -4675,6 +5349,48 @@ jnz .loop RET +INIT_ZMM avx512 +cglobal cpy2Dto1D_shr_32, 4, 5, 4 + shl r2d, 1 + movd xm0, r3d + pcmpeqw ymm1, ymm1 + psllw ym1, ymm1, xm0 + psraw ym1, 1 + vinserti32x8 m1, ym1, 1 + lea r3, [r2 * 3] + mov r4d, 8 + +.loop: + ; Row 0 + movu m2, [r1] + psubw m2, m1 + psraw m2, xm0 + movu [r0], m2 + + ; Row 1 + movu m2, [r1 + r2] + psubw m2, m1 + psraw m2, xm0 + movu [r0 + mmsize], m2 + + ; Row 2 + movu m2, [r1 + 2 * r2] + psubw m2, m1 + psraw m2, xm0 + movu [r0 + 2 * mmsize], m2 + + ; Row 3 + movu m2, [r1 + r3] + psubw m2, m1 + psraw m2, xm0 + movu [r0 + 3 * mmsize], m2 + + add r0, 4 * mmsize + lea r1, [r1 + 4 * r2] + dec r4d + jnz .loop + RET + ;-------------------------------------------------------------------------------------- ; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) ;-------------------------------------------------------------------------------------- @@ -4931,7 +5647,103 @@ jnz .loop RET +;-------------------------------------------------------------------------------------- +; cpy_1Dto2D_shl avx512 code start +;-------------------------------------------------------------------------------------- +%macro PROCESS_CPY1Dto2D_SHL_32x8_AVX512 0 + movu m1, [r1 + 0 * mmsize] + movu m2, [r1 + 1 * mmsize] + movu m3, [r1 + 2 * mmsize] + movu m4, [r1 + 3 * mmsize] + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + movu [r0], m1 + movu [r0 + r2], m2 + movu [r0 + 2 * r2], m3 + movu [r0 + r3], m4 + + add r1, 4 * mmsize + lea r0, [r0 + r2 * 4] + + movu m1, [r1 + 0 * mmsize] + movu m2, [r1 + 1 * mmsize] + movu m3, [r1 + 2 * mmsize] + movu m4, [r1 + 3 * mmsize] + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + movu [r0], m1 + movu [r0 + r2], m2 + movu [r0 + 2 * r2], m3 + movu [r0 + r3], m4 +%endmacro +;-------------------------------------------------------------------------------------- +; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) +;-------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal cpy1Dto2D_shl_32, 4, 4, 5 + add r2d, r2d + movd xm0, r3d + lea r3, [3 * r2] +%rep 3 + PROCESS_CPY1Dto2D_SHL_32x8_AVX512 + add r1, 4 * mmsize + lea r0, [r0 + r2 * 4] +%endrep + PROCESS_CPY1Dto2D_SHL_32x8_AVX512 + RET +%macro PROCESS_CPY1Dto2D_SHL_ALIGNED_32x8_AVX512 0 + mova m1, [r1 + 0 * mmsize] + mova m2, [r1 + 1 * mmsize] + mova m3, [r1 + 2 * mmsize] + mova m4, [r1 + 3 * mmsize] + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + mova [r0], m1 + mova [r0 + r2], m2 + mova [r0 + 2 * r2], m3 + mova [r0 + r3], m4 + + add r1, 4 * mmsize + lea r0, [r0 + r2 * 4] + + mova m1, [r1 + 0 * mmsize] + mova m2, [r1 + 1 * mmsize] + mova m3, [r1 + 2 * mmsize] + mova m4, [r1 + 3 * mmsize] + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + mova [r0], m1 + mova [r0 + r2], m2 + mova [r0 + 2 * r2], m3 + mova [r0 + r3], m4 +%endmacro +;-------------------------------------------------------------------------------------- +; void cpy1Dto2D_shl(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) +;-------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal cpy1Dto2D_shl_aligned_32, 4, 4, 5 + add r2d, r2d + movd xm0, r3d + lea r3, [3 * r2] +%rep 3 + PROCESS_CPY1Dto2D_SHL_ALIGNED_32x8_AVX512 + add r1, 4 * mmsize + lea r0, [r0 + r2 * 4] +%endrep + PROCESS_CPY1Dto2D_SHL_ALIGNED_32x8_AVX512 + RET +;-------------------------------------------------------------------------------------- +; copy_cnt avx512 code end +;-------------------------------------------------------------------------------------- ;-------------------------------------------------------------------------------------- ; uint32_t copy_cnt(int16_t* dst, const int16_t* src, intptr_t srcStride); ;-------------------------------------------------------------------------------------- @@ -5294,7 +6106,91 @@ movd eax, xm4 RET +;-------------------------------------------------------------------------------------- +; copy_cnt avx512 code start +;-------------------------------------------------------------------------------------- +%macro PROCESS_COPY_CNT_32x4_AVX512 0 + movu m0, [r1] + movu m1, [r1 + r2] + movu [r0], m0 + movu [r0 + mmsize], m1 + packsswb m0, m1 + pminub m0, m3 + + movu m1, [r1 + 2 * r2] + movu m2, [r1 + r3] + movu [r0 + 2 * mmsize], m1 + movu [r0 + 3 * mmsize], m2 + packsswb m1, m2 + pminub m1, m3 + + paddb m0, m1 + paddb m4, m0 +%endmacro + +%macro PROCESS_COPY_CNT_16x4_AVX512 0 + movu ym0, [r1] + vinserti32x8 m0, [r1 + r2], 1 + movu ym1, [r1 + 2 * r2] + vinserti32x8 m1, [r1 + r3], 1 + movu [r0], m0 + movu [r0 + mmsize], m1 + packsswb m0, m1 + pminub m0, m3 + paddb m4, m0 +%endmacro + +%macro PROCESS_COPY_CNT_END_AVX512 0 + pxor m0, m0 + vextracti32x8 ym1, m4, 1 + paddb ym4, ym1 + vextracti32x4 xm1, ym4, 1 + paddb xm4, xm1 + psadbw xm4, xm0 + movhlps xm1, xm4 + paddd xm4, xm1 + movd eax, xm4 +%endmacro + +;-------------------------------------------------------------------------------------- +; uint32_t copy_cnt(int32_t* dst, const int16_t* src, intptr_t stride); +;-------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal copy_cnt_32, 3, 4, 5 + add r2d, r2d + lea r3, [3 * r2] + + vbroadcasti32x8 m3, [pb_1] + pxor m4, m4 + +%rep 7 + PROCESS_COPY_CNT_32x4_AVX512 + add r0, 4 * mmsize + lea r1, [r1 + 4 * r2] +%endrep + PROCESS_COPY_CNT_32x4_AVX512 + PROCESS_COPY_CNT_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal copy_cnt_16, 3, 4, 5 + add r2d, r2d + lea r3, [3 * r2] + + vbroadcasti32x8 m3, [pb_1] + pxor m4, m4 +%rep 3 + PROCESS_COPY_CNT_16x4_AVX512 + add r0, 2 * mmsize + lea r1, [r1 + 4 * r2] +%endrep + PROCESS_COPY_CNT_16x4_AVX512 + PROCESS_COPY_CNT_END_AVX512 + RET +;-------------------------------------------------------------------------------------- +; copy_cnt avx512 code end +;-------------------------------------------------------------------------------------- ;-------------------------------------------------------------------------------------- ; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); ;-------------------------------------------------------------------------------------- @@ -5558,6 +6454,102 @@ RET ;-------------------------------------------------------------------------------------- +; cpy2Dto1D_shl avx512 code start +;-------------------------------------------------------------------------------------- +%macro PROCESS_CPY2Dto1D_SHL_16x8_AVX512 0 + movu m1, [r1] + vinserti32x8 m1, [r1 + r2], 1 + movu m2, [r1 + 2 * r2] + vinserti32x8 m2, [r1 + r3], 1 + + psllw m1, xm0 + psllw m2, xm0 + movu [r0], m1 + movu [r0 + mmsize], m2 + + add r0, 2 * mmsize + lea r1, [r1 + r2 * 4] + + movu m1, [r1] + vinserti32x8 m1, [r1 + r2], 1 + movu m2, [r1 + 2 * r2] + vinserti32x8 m2, [r1 + r3], 1 + + psllw m1, xm0 + psllw m2, xm0 + movu [r0], m1 + movu [r0 + mmsize], m2 +%endmacro + +%macro PROCESS_CPY2Dto1D_SHL_32x8_AVX512 0 + movu m1, [r1] + movu m2, [r1 + r2] + movu m3, [r1 + 2 * r2] + movu m4, [r1 + r3] + + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + movu [r0], m1 + movu [r0 + mmsize], m2 + movu [r0 + 2 * mmsize], m3 + movu [r0 + 3 * mmsize], m4 + + add r0, 4 * mmsize + lea r1, [r1 + r2 * 4] + + movu m1, [r1] + movu m2, [r1 + r2] + movu m3, [r1 + 2 * r2] + movu m4, [r1 + r3] + + psllw m1, xm0 + psllw m2, xm0 + psllw m3, xm0 + psllw m4, xm0 + movu [r0], m1 + movu [r0 + mmsize], m2 + movu [r0 + 2 * mmsize], m3 + movu [r0 + 3 * mmsize], m4 +%endmacro + +;-------------------------------------------------------------------------------------- +; void cpy2Dto1D_shl(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +;-------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal cpy2Dto1D_shl_32, 4, 4, 5 + add r2d, r2d + movd xm0, r3d + lea r3, [3 * r2] + + PROCESS_CPY2Dto1D_SHL_32x8_AVX512 + add r0, 4 * mmsize + lea r1, [r1 + r2 * 4] + PROCESS_CPY2Dto1D_SHL_32x8_AVX512 + add r0, 4 * mmsize + lea r1, [r1 + r2 * 4] + PROCESS_CPY2Dto1D_SHL_32x8_AVX512 + add r0, 4 * mmsize + lea r1, [r1 + r2 * 4] + PROCESS_CPY2Dto1D_SHL_32x8_AVX512 + RET + +INIT_ZMM avx512 +cglobal cpy2Dto1D_shl_16, 4, 4, 3 + add r2d, r2d + movd xm0, r3d + lea r3, [3 * r2] + + PROCESS_CPY2Dto1D_SHL_16x8_AVX512 + add r0, 2 * mmsize + lea r1, [r1 + r2 * 4] + PROCESS_CPY2Dto1D_SHL_16x8_AVX512 + RET +;-------------------------------------------------------------------------------------- +; cpy2Dto1D_shl avx512 code end +;-------------------------------------------------------------------------------------- +;-------------------------------------------------------------------------------------- ; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) ;-------------------------------------------------------------------------------------- INIT_XMM sse2 @@ -5785,6 +6777,37 @@ jnz .loop RET +INIT_ZMM avx512 +cglobal cpy1Dto2D_shr_16, 3, 5, 4 + shl r2d, 1 + movd xm0, r3m + pcmpeqw xmm1, xmm1 + psllw xm1, xmm1, xm0 + psraw xm1, 1 + vpbroadcastw m1, xm1 + mov r3d, 4 + lea r4, [r2 * 3] + +.loop: + ; Row 0-1 + movu m2, [r1] + psubw m2, m1 + psraw m2, xm0 + movu [r0], ym2 + vextracti32x8 [r0 + r2], m2, 1 + + ; Row 2-3 + movu m2, [r1 + mmsize] + psubw m2, m1 + psraw m2, xm0 + movu [r0 + r2 * 2], ym2 + vextracti32x8 [r0 + r4], m2, 1 + + add r1, 2 * mmsize + lea r0, [r0 + r2 * 4] + dec r3d + jnz .loop + RET ;-------------------------------------------------------------------------------------- ; void cpy1Dto2D_shr(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift) @@ -5875,3 +6898,30 @@ dec r3d jnz .loop RET + +INIT_ZMM avx512 +cglobal cpy1Dto2D_shr_32, 3, 4, 6 + shl r2d, 1 + movd xm0, r3m + pcmpeqw xmm1, xmm1 + psllw xm1, xmm1, xm0 + psraw xm1, 1 + vpbroadcastw m1, xm1 + mov r3d, 16 + +.loop: + ; Row 0-1 + movu m2, [r1] + movu m3, [r1 + mmsize] + psubw m2, m1 + psubw m3, m1 + psraw m2, xm0 + psraw m3, xm0 + movu [r0], m2 + movu [r0 + r2], m3 + + add r1, 2 * mmsize + lea r0, [r0 + r2 * 2] + dec r3d + jnz .loop + RET
View file
x265_2.7.tar.gz/source/common/x86/blockcopy8.h -> x265_2.9.tar.gz/source/common/x86/blockcopy8.h
Changed
@@ -28,37 +28,48 @@ FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); - +FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shl_aligned, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); FUNCDEF_TU_S(uint32_t, copy_cnt, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride); FUNCDEF_TU_S(uint32_t, copy_cnt, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride); FUNCDEF_TU_S(uint32_t, copy_cnt, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride); +FUNCDEF_TU_S(uint32_t, copy_cnt, avx512, int16_t* dst, const int16_t* src, intptr_t srcStride); FUNCDEF_TU(void, blockfill_s, sse2, int16_t* dst, intptr_t dstride, int16_t val); FUNCDEF_TU(void, blockfill_s, avx2, int16_t* dst, intptr_t dstride, int16_t val); +FUNCDEF_TU(void, blockfill_s, avx512, int16_t* dst, intptr_t dstride, int16_t val); +FUNCDEF_TU(void, blockfill_s_aligned, avx512, int16_t* dst, intptr_t dstride, int16_t val); FUNCDEF_CHROMA_PU(void, blockcopy_ss, sse2, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx512, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); FUNCDEF_CHROMA_PU(void, blockcopy_pp, sse2, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx512, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_sp, sse2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_sp, sse4, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_sp, avx2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_sp, avx512, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_ps, sse2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_ps, sse4, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); FUNCDEF_PU(void, blockcopy_ps, avx2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_ps, avx512, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); #endif // ifndef X265_I386_PIXEL_H
View file
x265_2.7.tar.gz/source/common/x86/const-a.asm -> x265_2.9.tar.gz/source/common/x86/const-a.asm
Changed
@@ -28,7 +28,7 @@ %include "x86inc.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 ;; 8-bit constants
View file
x265_2.7.tar.gz/source/common/x86/cpu-a.asm -> x265_2.9.tar.gz/source/common/x86/cpu-a.asm
Changed
@@ -54,18 +54,16 @@ RET ;----------------------------------------------------------------------------- -; void cpu_xgetbv( int op, int *eax, int *edx ) +; uint64_t cpu_xgetbv( int xcr ) ;----------------------------------------------------------------------------- -cglobal cpu_xgetbv, 3,7 - push r2 - push r1 - mov ecx, r0d +cglobal cpu_xgetbv + movifnidn ecx, r0m xgetbv - pop r4 - mov [r4], eax - pop r4 - mov [r4], edx - RET +%if ARCH_X86_64 + shl rdx, 32 + or rax, rdx +%endif + ret %if ARCH_X86_64 @@ -78,7 +76,7 @@ %if WIN64 sub rsp, 32 ; shadow space %endif - and rsp, ~31 + and rsp, ~(STACK_ALIGNMENT - 1) mov rax, r0 mov r0, r1 mov r1, r2 @@ -119,7 +117,7 @@ push ebp mov ebp, esp sub esp, 12 - and esp, ~31 + and esp, ~(STACK_ALIGNMENT - 1) mov ecx, [ebp+8] mov edx, [ebp+12] mov [esp], edx
View file
x265_2.7.tar.gz/source/common/x86/dct8.asm -> x265_2.9.tar.gz/source/common/x86/dct8.asm
Changed
@@ -28,7 +28,89 @@ %include "x86inc.asm" %include "x86util.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 + +tab_dct32: dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64 + dw 90, 90, 88, 85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4, -4, -13, -22, -31, -38, -46, -54, -61, -67, -73, -78, -82, -85, -88, -90, -90 + dw 90, 87, 80, 70, 57, 43, 25, 9, -9, -25, -43, -57, -70, -80, -87, -90, -90, -87, -80, -70, -57, -43, -25, -9, 9, 25, 43, 57, 70, 80, 87, 90 + dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 13, 38, 61, 78, 88, 90, 85, 73, 54, 31, 4, -22, -46, -67, -82, -90 + dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89, 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89 + dw 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, -22, -61, -85, -90, -73, -38, 4, 46, 78, 90, 82, 54, 13, -31, -67, -88 + dw 87, 57, 9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87, -87, -57, -9, 43, 80, 90, 70, 25, -25, -70, -90, -80, -43, 9, 57, 87 + dw 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31, 31, 78, 90, 61, 4, -54, -88, -82, -38, 22, 73, 90, 67, 13, -46, -85 + dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83 + dw 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67, 4, 73, 88, 38, -38, -88, -73, -4, 67, 90, 46, -31, -85, -78, -13, 61, 90, 54, -22, -82 + dw 80, 9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80, -80, -9, 70, 87, 25, -57, -90, -43, 43, 90, 57, -25, -87, -70, 9, 80 + dw 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46, 46, 90, 38, -54, -90, -31, 61, 88, 22, -67, -85, -13, 73, 82, 4, -78 + dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75, 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75 + dw 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54, -54, -85, 4, 88, 46, -61, -82, 13, 90, 38, -67, -78, 22, 90, 31, -73 + dw 70, -43, -87, 9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70, -70, 43, 87, -9, -90, -25, 80, 57, -57, -80, 25, 90, 9, -87, -43, 70 + dw 67, -54, -78, 38, 85, -22, -90, 4, 90, 13, -88, -31, 82, 46, -73, -61, 61, 73, -46, -82, 31, 88, -13, -90, -4, 90, 22, -85, -38, 78, 54, -67 + dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64 + dw 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67, -67, -54, 78, 38, -85, -22, 90, 4, -90, 13, 88, -31, -82, 46, 73, -61 + dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87, 9, -90, 25, 80, -57, -57, 80, 25, -90, 9, 87, -43, -70, 70, 43, -87, -9, 90, -25, -80, 57 + dw 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73, 73, 31, -90, 22, 78, -67, -38, 90, -13, -82, 61, 46, -88, 4, 85, -54 + dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50, 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50 + dw 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82, 4, 78, -78, -4, 82, -73, -13, 85, -67, -22, 88, -61, -31, 90, -54, -38, 90, -46 + dw 43, -90, 57, 25, -87, 70, 9, -80, 80, -9, -70, 87, -25, -57, 90, -43, -43, 90, -57, -25, 87, -70, -9, 80, -80, 9, 70, -87, 25, 57, -90, 43 + dw 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82, 82, -22, -54, 90, -61, -13, 78, -85, 31, 46, -90, 67, 4, -73, 88, -38 + dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36 + dw 31, -78, 90, -61, 4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85, -85, 46, 13, -67, 90, -73, 22, 38, -82, 88, -54, -4, 61, -90, 78, -31 + dw 25, -70, 90, -80, 43, 9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25, -25, 70, -90, 80, -43, -9, 57, -87, 87, -57, 9, 43, -80, 90, -70, 25 + dw 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88, 88, -67, 31, 13, -54, 82, -90, 78, -46, 4, 38, -73, 90, -85, 61, -22 + dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18, 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18 + dw 13, -38, 61, -78, 88, -90, 85, -73, 54, -31, 4, 22, -46, 67, -82, 90, -90, 82, -67, 46, -22, -4, 31, -54, 73, -85, 90, -88, 78, -61, 38, -13 + dw 9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9, -9, 25, -43, 57, -70, 80, -87, 90, -90, 87, -80, 70, -57, 43, -25, 9 + dw 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90, 90, -90, 88, -85, 82, -78, 73, -67, 61, -54, 46, -38, 31, -22, 13, -4 +tab_dct16: dw 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64 + dw 90, 87, 80, 70, 57, 43, 25, 9, -9, -25, -43, -57, -70, -80, -87, -90 + dw 89, 75, 50, 18, -18, -50, -75, -89, -89, -75, -50, -18, 18, 50, 75, 89 + dw 87, 57, 9, -43, -80, -90, -70, -25, 25, 70, 90, 80, 43, -9, -57, -87 + dw 83, 36, -36, -83, -83, -36, 36, 83, 83, 36, -36, -83, -83, -36, 36, 83 + dw 80, 9, -70, -87, -25, 57, 90, 43, -43, -90, -57, 25, 87, 70, -9, -80 + dw 75, -18, -89, -50, 50, 89, 18, -75, -75, 18, 89, 50, -50, -89, -18, 75 + dw 70, -43, -87, 9, 90, 25, -80, -57, 57, 80, -25, -90, -9, 87, 43, -70 + dw 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64, 64, -64, -64, 64 + dw 57, -80, -25, 90, -9, -87, 43, 70, -70, -43, 87, 9, -90, 25, 80, -57 + dw 50, -89, 18, 75, -75, -18, 89, -50, -50, 89, -18, -75, 75, 18, -89, 50 + dw 43, -90, 57, 25, -87, 70, 9, -80, 80, -9, -70, 87, -25, -57, 90, -43 + dw 36, -83, 83, -36, -36, 83, -83, 36, 36, -83, 83, -36, -36, 83, -83, 36 + dw 25, -70, 90, -80, 43, 9, -57, 87, -87, 57, -9, -43, 80, -90, 70, -25 + dw 18, -50, 75, -89, 89, -75, 50, -18, -18, 50, -75, 89, -89, 75, -50, 18 + dw 9, -25, 43, -57, 70, -80, 87, -90, 90, -87, 80, -70, 57, -43, 25, -9 + +dct16_shuf_AVX512: dq 0, 1, 8, 9, 4, 5, 12, 13 +dct16_shuf1_AVX512: dq 2, 3, 10, 11, 6, 7, 14, 15 +dct16_shuf3_AVX512: dq 0, 1, 4, 5, 8, 9, 12, 13 +dct16_shuf4_AVX512: dq 2, 3, 6, 7, 10, 11, 14, 15 +dct16_shuf2_AVX512: dd 0, 4, 8, 12, 2, 6, 10, 14, 16, 20, 24, 28, 18, 22, 26, 30 + +dct8_shuf5_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7 +dct8_shuf6_AVX512: dq 0, 2, 4, 6, 1, 3, 5, 7 +dct8_shuf8_AVX512: dd 0, 2, 8, 10, 4, 6, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 +dct8_shuf4_AVX512: times 2 dd 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 +dct16_shuf7_AVX512: dd 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 +dct16_shuf9_AVX512: dd 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 + +dct32_shuf_AVX512: dd 0, 1, 4, 5, 8, 9, 12, 13, 16, 17, 20 , 21, 24, 25, 28, 29 +dct32_shuf4_AVX512: times 2 dd 0, 4, 8, 12, 0, 4, 8, 12 +dct32_shuf5_AVX512: dd 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0 +dct32_shuf6_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 0, 0, 0, 0 +dct32_shuf7_AVX512: dd 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1 +dct32_shuf8_AVX512: dd -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 +dct16_shuf5_AVX512: dw 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27, 4, 5, 6, 7, 12, 13, 14, 15, 20, 21, 22, 23, 28, 29, 30, 31 +dct16_shuf6_AVX512: dw 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30 +dct16_shuf8_AVX512: dw 20, 0, 4, 2, 28, 8, 6, 10, 22, 16, 12, 18, 30, 24, 14, 26 + +dct8_shuf7_AVX512: dw 0, 2, 16, 18, 8, 10, 24, 26, 4, 6, 20, 22, 12, 14, 28, 30 +dct8_shuf9_AVX512: times 2 dw 0, 8, 16, 24, 4, 12, 20, 28 +dct32_shuf1_AVX512: dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16 +dct32_shuf2_AVX512: dw 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23, 15, 14, 13, 12, 11, 10, 9, 8, 31, 30, 29, 28, 27, 26, 25, 24 +dct32_shuf3_AVX512: times 2 dw 0, 8, 16, 24, 2, 10, 18, 26 + +dct8_shuf: times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9 +dct8_shuf_AVX512: times 2 db 4, 5, 6, 7, 0, 1, 2, 3, 12, 13, 14, 15, 8, 9, 10, 11 + tab_dct8: dw 64, 64, 64, 64, 64, 64, 64, 64 dw 89, 75, 50, 18, -18, -50, -75, -89 dw 83, 36, -36, -83, -83, -36, 36, 83 @@ -38,7 +120,10 @@ dw 36, -83, 83, -36, -36, 83, -83, 36 dw 18, -50, 75, -89, 89, -75, 50, -18 -dct8_shuf: times 2 db 6, 7, 4, 5, 2, 3, 0, 1, 14, 15, 12, 13, 10, 11, 8, 9 +tab_dct8_avx512: dw 64, 64, 64, 64, 89, 75, 50, 18 + dw 83, 36, -36, -83, 75, -18, -89, -50 + dw 64, -64, -64, 64, 50, -89, 18, 75 + dw 36, -83, 83, -36, 18, -50, 75, -89 tab_dct16_1: dw 64, 64, 64, 64, 64, 64, 64, 64 dw 90, 87, 80, 70, 57, 43, 25, 9 @@ -57,7 +142,6 @@ dw 18, -50, 75, -89, 89, -75, 50, -18 dw 9, -25, 43, -57, 70, -80, 87, -90 - tab_dct16_2: dw 64, 64, 64, 64, 64, 64, 64, 64 dw -9, -25, -43, -57, -70, -80, -87, -90 dw -89, -75, -50, -18, 18, 50, 75, 89 @@ -155,12 +239,34 @@ times 4 dw 50, -89, 18, 75 times 4 dw 18, -50, 75, -89 +avx512_idct8_1: times 8 dw 64, 83, 64, 36 + times 8 dw 64, 36, -64, -83 + times 8 dw 64, -36, -64, 83 + times 8 dw 64, -83, 64, -36 + +avx512_idct8_2: times 8 dw 89, 75, 50, 18 + times 8 dw 75, -18, -89, -50 + times 8 dw 50, -89, 18, 75 + times 8 dw 18, -50, 75, -89 + +avx512_idct8_3: dw 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, 83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36 + dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83, -64, 83 + dw 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, 36, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83, 64, -83 + dw -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, -64, -83, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36, 64, -36 + dw 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 89, 75, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89, 50, -89 + dw 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 50, 18, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75, 18, 75 + dw 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 75, -18, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50, 18, -50 + dw -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, -89, -50, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89, 75, -89 + idct8_shuf1: dd 0, 2, 4, 6, 1, 3, 5, 7 const idct8_shuf2, times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15 idct8_shuf3: times 2 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3 + +idct8_avx512_shuf3: times 4 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3 + tab_idct16_1: dw 90, 87, 80, 70, 57, 43, 25, 9 dw 87, 57, 9, -43, -80, -90, -70, -25 dw 80, 9, -70, -87, -25, 57, 90, 43 @@ -182,6 +288,31 @@ idct16_shuff: dd 0, 4, 2, 6, 1, 5, 3, 7 idct16_shuff1: dd 2, 6, 0, 4, 3, 7, 1, 5 +idct16_shuff2: dw 0, 16, 2, 18, 4, 20, 6, 22, 8, 24, 10, 26, 12, 28, 14, 30 +idct16_shuff3: dw 1, 17, 3, 19, 5, 21, 7, 23, 9, 25, 11, 27, 13, 29, 15, 31 +idct16_shuff4: dd 0, 8, 2, 10, 4, 12, 6, 14 +idct16_shuff5: dd 1, 9, 3, 11, 5, 13, 7, 15 + + +tab_AVX512_idct16_1: dw 90, 87, 80, 70, 57, 43, 25, 9, 90, 87, 80, 70, 57, 43, 25, 9, 80, 9, -70, -87, -25, 57, 90, 43, 80, 9, -70, -87, -25, 57, 90, 43 + dw 87, 57, 9, -43, -80, -90, -70, -25, 87, 57, 9, -43, -80, -90, -70, -25, 70, -43, -87, 9, 90, 25, -80, -57, 70, -43, -87, 9, 90, 25, -80, -57 + dw 57, -80, -25, 90, -9, -87, 43, 70, 57, -80, -25, 90, -9, -87, 43, 70, 25, -70, 90, -80, 43, 9, -57, 87, 25, -70, 90, -80, 43, 9, -57, 87 + dw 43, -90, 57, 25, -87, 70, 9, -80, 43, -90, 57, 25, -87, 70, 9, -80, 9, -25, 43, -57, 70, -80, 87, -90, 9, -25, 43, -57, 70, -80, 87, -90 + +tab_AVX512_idct16_2: dw 64, 89, 83, 75, 64, 50, 36, 18, 64, 89, 83, 75, 64, 50, 36, 18, 64, 50, -36, -89, -64, 18, 83, 75, 64, 50, -36, -89, -64, 18, 83, 75 + dw 64, 75, 36, -18, -64, -89, -83, -50, 64, 75, 36, -18, -64, -89, -83, -50, 64, 18, -83, -50, 64, 75, -36, -89, 64, 18, -83, -50, 64, 75, -36, -89 + dw 64, -18, -83, 50, 64, -75, -36, 89, 64, -18, -83, 50, 64, -75, -36, 89, 64, -75, 36, 18, -64, 89, -83, 50, 64, -75, 36, 18, -64, 89, -83, 50 + dw 64, -50, -36, 89, -64, -18, 83, -75, 64, -50, -36, 89, -64, -18, 83, -75, 64, -89, 83, -75, 64, -50, 36, -18, 64, -89, 83, -75, 64, -50, 36, -18 + +idct16_AVX512_shuff: dd 0, 4, 2, 6, 1, 5, 3, 7, 8, 12, 10, 14, 9, 13, 11, 15 + +idct16_AVX512_shuff1: dd 2, 6, 0, 4, 3, 7, 1, 5, 10, 14, 8, 12, 11, 15, 9, 13 + +idct16_AVX512_shuff2: dq 0, 1, 8, 9, 4, 5, 12, 13 +idct16_AVX512_shuff3: dq 2, 3, 10, 11, 6, 7, 14, 15 +idct16_AVX512_shuff4: dq 4, 5, 12, 13, 0, 1, 8, 9 +idct16_AVX512_shuff5: dq 6, 7, 14, 15, 2, 3, 10, 11 +idct16_AVX512_shuff6: times 4 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 tab_idct32_1: dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4 dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13 @@ -237,6 +368,71 @@ dw 64, -87, 75, -57, 36, -9, -18, 43, -64, 80, -89, 90, -83, 70, -50, 25 dw 64, -90, 89, -87, 83, -80, 75, -70, 64, -57, 50, -43, 36, -25, 18, -9 + +tab_idct32_AVX512_1: dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 90 ,90 ,88 ,85, 82, 78, 73, 67, 90, 82, 67, 46, 22, -4, -31, -54, 90, 82, 67, 46, 22, -4, -31, -54 + dw 61, 54, 46, 38, 31, 22, 13, 4, 61, 54, 46, 38, 31, 22, 13, 4, -73, -85, -90, -88, -78, -61, -38, -13, -73, -85, -90, -88, -78, -61, -38, -13 + dw 88, 67, 31, -13, -54, -82, -90, -78, 88, 67, 31, -13, -54, -82, -90, -78, 85, 46, -13, -67, -90, -73, -22, 38, 85, 46, -13, -67, -90, -73, -22, 38 + dw -46, -4, 38, 73, 90, 85, 61, 22, -46, -4, 38, 73, 90, 85, 61, 22, 82, 88, 54, -4, -61, -90, -78, -31, 82, 88, 54, -4, -61, -90, -78, -31 + dw 82, 22, -54, -90, -61, 13, 78, 85, 82, 22, -54, -90, -61, 13, 78, 85, 78, -4, -82, -73, 13, 85, 67, -22, 78, -4, -82, -73, 13, 85, 67, -22 + dw 31, -46, -90, -67, 4, 73, 88, 38, 31, -46, -90, -67, 4, 73, 88, 38, -88, -61, 31, 90, 54, -38, -90, -46, -88, -61, 31, 90, 54, -38, -90, -46 + dw 73, -31, -90, -22, 78, 67, -38, -90, 73, -31, -90, -22, 78, 67, -38, -90, 67, -54, -78, 38, 85, -22, -90, 4, 67, -54, -78, 38, 85, -22, -90, 4 + dw -13, 82, 61, -46, -88, -4, 85, 54, -13, 82, 61, -46, -88, -4, 85, 54, 90, 13, -88, -31, 82, 46, -73, -61, 90, 13, -88, -31, 82, 46, -73, -61 + +tab_idct32_AVX512_5: dw 4, -13, 22, -31, 38, -46, 54, -61, 4, -13, 22, -31, 38, -46, 54, -61, 13, -38, 61, -78, 88, -90, 85, -73, 13, -38, 61, -78, 88, -90, 85, -73 + dw 67, -73, 78, -82, 85, -88, 90, -90, 67, -73, 78, -82, 85, -88, 90, -90, 54, -31, 4, 22, -46, 67, -82, 90, 54, -31, 4, 22, -46, 67, -82, 90 + dw 22, -61, 85, -90, 73, -38, -4, 46, 22, -61, 85, -90, 73, -38, -4, 46, 31, -78, 90, -61, 4, 54, -88, 82, 31, -78, 90, -61, 4, 54, -88, 82 + dw -78, 90, -82, 54, -13, -31, 67, -88, -78, 90, -82, 54, -13, -31, 67, -88, -38, -22, 73, -90, 67, -13, -46, 85, -38, -22, 73, -90, 67, -13, -46, 85 + dw 38, -88, 73, -4, -67, 90, -46, -31, 38, -88, 73, -4, -67, 90, -46, -31, 46, -90, 38, 54, -90, 31, 61, -88, 46, -90, 38, 54, -90, 31, 61, -88 + dw 85, -78, 13, 61, -90, 54, 22, -82, 85, -78, 13, 61, -90, 54, 22, -82, 22, 67, -85, 13, 73, -82, 4, 78, 22, 67, -85, 13, 73, -82, 4, 78 + dw 54, -85, -4, 88, -46, -61, 82, 13, 54, -85, -4, 88, -46, -61, 82, 13, 61, -73, -46, 82, 31, -88, -13, 90, 61, -73, -46, 82, 31, -88, -13, 90 + dw -90, 38, 67, -78, -22, 90, -31, -73, -90, 38, 67, -78, -22, 90, -31, -73, -4, -90, 22, 85, -38, -78, 54, 67, -4, -90, 22, 85, -38, -78, 54, 67 + + +tab_idct32_AVX512_2: dw 64, 89, 83, 75, 64, 50, 36, 18, 64, 89, 83, 75, 64, 50, 36, 18, 64, 75, 36, -18, -64, -89, -83, -50, 64, 75, 36, -18, -64, -89, -83, -50 + dw 64, 50, -36, -89, -64, 18, 83, 75, 64, 50, -36, -89, -64, 18, 83, 75, 64, 18, -83, -50, 64, 75, -36, -89, 64, 18, -83, -50, 64, 75, -36, -89 + dw 64, -18, -83, 50, 64, -75, -36, 89, 64, -18, -83, 50, 64, -75, -36, 89, 64, -50, -36, 89, -64, -18, 83, -75, 64, -50, -36, 89, -64, -18, 83, -75 + dw 64, -75, 36, 18, -64, 89, -83, 50, 64, -75, 36, 18, -64, 89, -83, 50, 64, -89, 83, -75, 64, -50, 36, -18, 64, -89, 83, -75, 64, -50, 36, -18 + +tab_idct32_AVX512_3: dw 90, 87, 80, 70, 57, 43, 25, 9, 90, 87, 80, 70, 57, 43, 25, 9, 87, 57, 9, -43, -80, -90, -70, -25, 87, 57, 9, -43, -80, -90, -70, -25 + dw 80, 9, -70, -87, -25, 57, 90, 43, 80, 9, -70, -87, -25, 57, 90, 43, 70, -43, -87, 9, 90, 25, -80, -57, 70, -43, -87, 9, 90, 25, -80, -57 + dw 57, -80, -25, 90, -9, -87, 43, 70, 57, -80, -25, 90, -9, -87, 43, 70, 43, -90, 57, 25, -87, 70, 9, -80, 43, -90, 57, 25, -87, 70, 9, -80 + dw 25, -70, 90, -80, 43, 9, -57, 87, 25, -70, 90, -80, 43, 9, -57, 87, 9, -25, 43, -57, 70, -80, 87, -90, 9, -25, 43, -57, 70, -80, 87, -90 + +tab_idct32_AVX512_4: dw 90 ,90 ,88 ,85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4, 90 ,90 ,88 ,85, 82, 78, 73, 67, 61, 54, 46, 38, 31, 22, 13, 4 + dw 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13, 90, 82, 67, 46, 22, -4, -31, -54, -73, -85, -90, -88, -78, -61, -38, -13 + dw 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22, 88, 67, 31, -13, -54, -82, -90, -78, -46, -4, 38, 73, 90, 85, 61, 22 + dw 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31, 85, 46, -13, -67, -90, -73, -22, 38, 82, 88, 54, -4, -61, -90, -78, -31 + dw 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67, 4, 73, 88, 38, 82, 22, -54, -90, -61, 13, 78, 85, 31, -46, -90, -67, 4, 73, 88, 38 + dw 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46, 78, -4, -82, -73, 13, 85, 67, -22, -88, -61, 31, 90, 54, -38, -90, -46 + dw 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54, 73, -31, -90, -22, 78, 67, -38, -90, -13, 82, 61, -46, -88, -4, 85, 54 + dw 67, -54, -78, 38, 85, -22, -90, 4, 90, 13, -88, -31, 82, 46, -73, -61, 67, -54, -78, 38, 85, -22, -90, 4, 90, 13, -88, -31, 82, 46, -73, -61 + dw 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67, 61, -73, -46, 82, 31, -88, -13, 90, -4, -90, 22, 85, -38, -78, 54, 67 + dw 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73, 54, -85, -4, 88, -46, -61, 82, 13, -90, 38, 67, -78, -22, 90, -31, -73 + dw 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82, 4, 78, 46, -90, 38, 54, -90, 31, 61, -88, 22, 67, -85, 13, 73, -82, 4, 78 + dw 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82, 38, -88, 73, -4, -67, 90, -46, -31, 85, -78, 13, 61, -90, 54, 22, -82 + dw 31, -78, 90, -61, 4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85, 31, -78, 90, -61, 4, 54, -88, 82, -38, -22, 73, -90, 67, -13, -46, 85 + dw 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88, 22, -61, 85, -90, 73, -38, -4, 46, -78, 90, -82, 54, -13, -31, 67, -88 + dw 13, -38, 61, -78, 88, -90, 85, -73, 54, -31, 4, 22, -46, 67, -82, 90, 13, -38, 61, -78, 88, -90, 85, -73, 54, -31, 4, 22, -46, 67, -82, 90 + dw 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90, 4, -13, 22, -31, 38, -46, 54, -61, 67, -73, 78, -82, 85, -88, 90, -90 + +tab_idct32_AVX512_6: dw 64, 90, 89, 87, 83, 80, 75, 70, 64, 57, 50, 43, 36, 25, 18, 9, 64, 90, 89, 87, 83, 80, 75, 70, 64, 57, 50, 43, 36, 25, 18, 9 + dw 64, 87, 75, 57, 36, 9, -18, -43, -64, -80, -89, -90, -83, -70, -50, -25, 64, 87, 75, 57, 36, 9, -18, -43, -64, -80, -89, -90, -83, -70, -50, -25 + dw 64, 80, 50, 9, -36, -70, -89, -87, -64, -25, 18, 57, 83, 90, 75, 43, 64, 80, 50, 9, -36, -70, -89, -87, -64, -25, 18, 57, 83, 90, 75, 43 + dw 64, 70, 18, -43, -83, -87, -50, 9, 64, 90, 75, 25, -36, -80, -89, -57, 64, 70, 18, -43, -83, -87, -50, 9, 64, 90, 75, 25, -36, -80, -89, -57 + dw 64, 57, -18, -80, -83, -25, 50, 90, 64, -9, -75, -87, -36, 43, 89, 70, 64, 57, -18, -80, -83, -25, 50, 90, 64, -9, -75, -87, -36, 43, 89, 70 + dw 64, 43, -50, -90, -36, 57, 89, 25, -64, -87, -18, 70, 83, 9, -75, -80, 64, 43, -50, -90, -36, 57, 89, 25, -64, -87, -18, 70, 83, 9, -75, -80 + dw 64, 25, -75, -70, 36, 90, 18, -80, -64, 43, 89, 9, -83, -57, 50, 87, 64, 25, -75, -70, 36, 90, 18, -80, -64, 43, 89, 9, -83, -57, 50, 87 + dw 64, 9, -89, -25, 83, 43, -75, -57, 64, 70, -50, -80, 36, 87, -18, -90, 64, 9, -89, -25, 83, 43, -75, -57, 64, 70, -50, -80, 36, 87, -18, -90 + dw 64, -9, -89, 25, 83, -43, -75, 57, 64, -70, -50, 80, 36, -87, -18, 90, 64, -9, -89, 25, 83, -43, -75, 57, 64, -70, -50, 80, 36, -87, -18, 90 + dw 64, -25, -75, 70, 36, -90, 18, 80, -64, -43, 89, -9, -83, 57, 50, -87, 64, -25, -75, 70, 36, -90, 18, 80, -64, -43, 89, -9, -83, 57, 50, -87 + dw 64, -43, -50, 90, -36, -57, 89, -25, -64, 87, -18, -70, 83, -9, -75, 80, 64, -43, -50, 90, -36, -57, 89, -25, -64, 87, -18, -70, 83, -9, -75, 80 + dw 64, -57, -18, 80, -83, 25, 50, -90, 64, 9, -75, 87, -36, -43, 89, -70, 64, -57, -18, 80, -83, 25, 50, -90, 64, 9, -75, 87, -36, -43, 89, -70 + dw 64, -70, 18, 43, -83, 87, -50, -9, 64, -90, 75, -25, -36, 80, -89, 57, 64, -70, 18, 43, -83, 87, -50, -9, 64, -90, 75, -25, -36, 80, -89, 57 + dw 64, -80, 50, -9, -36, 70, -89, 87, -64, 25, 18, -57, 83, -90, 75, -43, 64, -80, 50, -9, -36, 70, -89, 87, -64, 25, 18, -57, 83, -90, 75, -43 + dw 64, -87, 75, -57, 36, -9, -18, 43, -64, 80, -89, 90, -83, 70, -50, 25, 64, -87, 75, -57, 36, -9, -18, 43, -64, 80, -89, 90, -83, 70, -50, 25 + dw 64, -90, 89, -87, 83, -80, 75, -70, 64, -57, 50, -43, 36, -25, 18, -9, 64, -90, 89, -87, 83, -80, 75, -70, 64, -57, 50, -43, 36, -25, 18, -9 + + avx2_dct4: dw 64, 64, 64, 64, 64, 64, 64, 64, 64, -64, 64, -64, 64, -64, 64, -64 dw 83, 36, 83, 36, 83, 36, 83, 36, 36, -83, 36, -83, 36, -83, 36, -83 @@ -314,9 +510,13 @@ tab_idct8_2: times 1 dw 89, 75, 50, 18, 75, -18, -89, -50 times 1 dw 50, -89, 18, 75, 18, -50, 75, -89 - pb_idct8odd: db 2, 3, 6, 7, 10, 11, 14, 15, 2, 3, 6, 7, 10, 11, 14, 15 +;Scale bits table for rdoQuant +tab_nonpsyRdo8 : dq 5, 7, 9, 11 +tab_nonpsyRdo10: dq 9, 11, 13, 15 +tab_nonpsyRdo12: dq 13, 15, 17, 19 + SECTION .text cextern pd_1 cextern pd_2 @@ -343,6 +543,10 @@ %define DST4_ROUND 16 %define DCT8_SHIFT1 6 %define DCT8_ROUND1 32 + %define RDO_MAX_4 3 + %define RDO_MAX_8 1 + %define RDO_MAX_16 0 + %define RDO_MAX_32 0 %elif BIT_DEPTH == 10 %define DCT4_SHIFT 3 %define DCT4_ROUND 4 @@ -352,6 +556,10 @@ %define DST4_ROUND 4 %define DCT8_SHIFT1 4 %define DCT8_ROUND1 8 + %define RDO_MAX_4 7 + %define RDO_MAX_8 5 + %define RDO_MAX_16 3 + %define RDO_MAX_32 1 %elif BIT_DEPTH == 8 %define DCT4_SHIFT 1 %define DCT4_ROUND 1 @@ -361,6 +569,10 @@ %define DST4_ROUND 1 %define DCT8_SHIFT1 2 %define DCT8_ROUND1 2 + %define RDO_MAX_4 11 + %define RDO_MAX_8 9 + %define RDO_MAX_16 7 + %define RDO_MAX_32 5 %else %error Unsupported BIT_DEPTH! %endif @@ -2165,6 +2377,67 @@ dec r3d jnz .loop RET +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal denoise_dct, 4, 4, 22 + pxor m16, m16 + sub r3d, 16 + je .coeff16 + add r3d, 16 + shr r3d, 5 + jmp .loop + +.coeff16: + movu ym19, [r0] + pabsw ym17, ym19 + movu m2, [r1] + pmovsxwd m18, ym17 + paddd m2, m18 + movu [r1], m2 + movu ym3, [r2] + psubusw ym17, ym3 + pcmpgtw ym18, ym17, ym16 + pand ym17, ym18 + psignw ym17, ym19 + movu [r0], ym17 + RET + +.loop: + movu m21, [r0] + pabsw m17, m21 + movu m2, [r1] + pmovsxwd m4, ym17 + paddd m2, m4 + movu [r1], m2 + vextracti64x4 ym4, m17, 1 + + movu m2, [r1 + mmsize] + pmovsxwd m3, ym4 + paddd m2, m3 + movu [r1 + mmsize], m2 + movu m3, [r2] + psubusw m17, m3 + + vextracti64x4 ym20, m17, 1 + pcmpgtw ym18, ym17, ym16 + pcmpgtw ym19, ym20, ym16 + vinserti64x4 m18, m18, ym19, 1 + + pand m17, m18 + vextracti64x4 ym19, m17, 1 + vextracti64x4 ym20, m21, 1 + psignw ym17, ym21 + psignw ym19, ym20 + vinserti64x4 m17, m17, ym19, 1 + + movu [r0], m17 + add r0, mmsize + add r1, mmsize * 2 + add r2, mmsize + dec r3d + jnz .loop + RET +%endif ; ARCH_X86_64 == 1 %if ARCH_X86_64 == 1 %macro DCT8_PASS_1 4 @@ -2270,6 +2543,168 @@ movu [r1 + 96], m10 RET + +%macro DCT8_AVX512_PASS_1 4 + vpmaddwd m%2, m3, m%1 + vpsrlq m8, m%2, 32 + vpaddd m%2, m8 + vpaddd m%2, m5 + vpsrad m%2, DCT8_SHIFT1 + + vpmaddwd m%4, m2, m%3 + vpsrlq m8, m%4, 32 + vpaddd m%4, m8 + vpaddd m%4, m5 + vpsrad m%4, DCT8_SHIFT1 + + vpackssdw m%2, m%4 + vpermw m%2, m1, m%2 +%endmacro + +%macro DCT8_AVX512_PASS_2 4 + vpmaddwd m0, m9, m%1 + vpmaddwd m1, m10, m%1 + vpsrldq m2, m0, 8 + vpsrldq m3, m1, 8 + vpaddd m0, m2 + vpaddd m1, m3 + vpsrlq m2, m0, 32 + vpsrlq m3, m1, 32 + vpaddd m0, m2 + vpaddd m1, m3 + vpaddd m0, m5 + vpsrad m0, DCT8_SHIFT2 + vpaddd m1, m5 + vpsrad m1, DCT8_SHIFT2 + vpackssdw m0, m1 + vpermw m0, m19, m0 + + vpmaddwd m1, m9, m%2 + vpmaddwd m2, m10, m%2 + vpsrldq m3, m1, 8 + vpsrldq m4, m2, 8 + vpaddd m1, m3 + vpaddd m2, m4 + vpsrlq m3, m1, 32 + vpsrlq m4, m2, 32 + vpaddd m1, m3 + vpaddd m2, m4 + vpaddd m1, m5 + vpsrad m1, DCT8_SHIFT2 + vpaddd m2, m5 + vpsrad m2, DCT8_SHIFT2 + vpackssdw m1, m2 + vpermw m1, m19, m1 + vinserti128 ym0, ym0, xm1, 1 + + vpmaddwd m1, m9, m%3 + vpmaddwd m2, m10, m%3 + vpsrldq m3, m1, 8 + vpsrldq m4, m2, 8 + vpaddd m1, m3 + vpaddd m2, m4 + vpsrlq m3, m1, 32 + vpsrlq m4, m2, 32 + vpaddd m1, m3 + vpaddd m2, m4 + vpaddd m1, m5 + vpsrad m1, DCT8_SHIFT2 + vpaddd m2, m5 + vpsrad m2, DCT8_SHIFT2 + vpackssdw m1, m2 + vpermw m1, m19, m1 + + vpmaddwd m2, m9, m%4 + vpmaddwd m3, m10, m%4 + vpsrldq m4, m2, 8 + vpsrldq m6, m3, 8 + vpaddd m2, m4 + vpaddd m3, m6 + vpsrlq m4, m2, 32 + vpsrlq m6, m3, 32 + vpaddd m2, m4 + vpaddd m3, m6 + vpaddd m2, m5 + vpsrad m2, DCT8_SHIFT2 + vpaddd m3, m5 + vpsrad m3, DCT8_SHIFT2 + vpackssdw m2, m3 + vpermw m2, m19, m2 + + vinserti128 ym1, ym1, xm2, 1 + vinserti64x4 m0, m0, ym1, 1 +%endmacro + +INIT_ZMM avx512 +cglobal dct8, 3, 7, 24 + + vbroadcasti32x4 m5, [pd_ %+ DCT8_ROUND1] + vbroadcasti32x8 m4, [dct8_shuf] + vbroadcasti32x4 m19, [dct8_shuf9_AVX512] + + add r2d, r2d + lea r3, [r2 * 3] + lea r4, [r0 + r2 * 4] + lea r5, [tab_dct8] + lea r6, [tab_dct8_avx512] + + ;pass1 + mova xm0, [r0] + vinserti128 ym0, ym0, [r4], 1 + mova xm1, [r0 + r2] + vinserti128 ym1, ym1, [r4 + r2], 1 + mova xm2, [r0 + r2 * 2] + vinserti128 ym2, ym2, [r4 + r2 * 2], 1 + mova xm3, [r0 + r3] + vinserti128 ym3, ym3, [r4 + r3], 1 + + vinserti64x4 m0, m0, ym2, 1 + vinserti64x4 m1, m1, ym3, 1 + + vpunpcklqdq m2, m0, m1 + vpunpckhqdq m0, m1 + + vpshufb m0, m4 + vpaddw m3, m2, m0 + vpsubw m2, m0 + + vbroadcasti32x8 m1, [dct8_shuf7_AVX512] + + ; Load all the coefficients togather for better caching + vpbroadcastq m20, [r6 + 0 * 8] + vpbroadcastq m21, [r6 + 1 * 8] + vpbroadcastq m22, [r6 + 2 * 8] + vpbroadcastq m23, [r6 + 3 * 8] + vpbroadcastq m7, [r6 + 4 * 8] + vpbroadcastq m12, [r6 + 5 * 8] + vpbroadcastq m14, [r6 + 6 * 8] + vpbroadcastq m16, [r6 + 7 * 8] + + DCT8_AVX512_PASS_1 20, 9, 21, 10 + DCT8_AVX512_PASS_1 22, 11, 23, 10 + DCT8_AVX512_PASS_1 7, 13, 12, 10 + DCT8_AVX512_PASS_1 14, 15, 16, 10 + + ;pass2 + vbroadcasti32x4 m5, [pd_ %+ DCT8_ROUND2] + + vinserti64x4 m9, m9, ym11, 1 + vinserti64x4 m10, m13, ym15, 1 + + ;Load all the coefficients togather for better caching and reuse common coefficients from PASS 1 + vbroadcasti32x4 m21, [r5 + 1 * 16] + vbroadcasti32x4 m22, [r5 + 2 * 16] + vbroadcasti32x4 m23, [r5 + 3 * 16] + vbroadcasti32x4 m12, [r5 + 5 * 16] + vbroadcasti32x4 m14, [r5 + 6 * 16] + vbroadcasti32x4 m16, [r5 + 7 * 16] + + DCT8_AVX512_PASS_2 20, 21, 22, 23 + movu [r1], m0 + DCT8_AVX512_PASS_2 7, 12, 14, 16 + movu [r1 + 64], m0 + RET + %macro DCT16_PASS_1_E 2 vpbroadcastq m7, [r7 + %1] @@ -2527,10 +2962,401 @@ dec r4d jnz .pass2 RET +%macro DCT16_avx512_PASS_1_O 4 + vbroadcasti32x4 m1, [r5 + %1] + + pmaddwd m3, m6, m1 + vpsrldq m11, m3, 8 + vpaddd m3, m11 + + pmaddwd m11, m8, m1 + vpsrldq m12, m11, 8 + vpaddd m11, m12 + + vpunpcklqdq m12, m3, m11 + vpsrldq m11, m12, 4 + vpaddd m11, m12 + + pmaddwd m3, m10, m1 + vpsrldq m12, m3, 8 + vpaddd m3, m12 + + pmaddwd m12, m2, m1 + vpsrldq m13, m12, 8 + vpaddd m12, m13 + + vpunpcklqdq m13, m3, m12 + vpsrldq m12, m13, 4 + vpaddd m12, m13 + + mova m%3, m26 + vpermi2d m%3, m11, m12 + paddd m%3, m0 + psrad m%3, DCT_SHIFT + + ; next row start + vbroadcasti32x4 m1, [r5 + %2] + + pmaddwd m3, m6, m1 + vpsrldq m11, m3, 8 + vpaddd m3, m11 + + pmaddwd m11, m8, m1 + vpsrldq m12, m11, 8 + vpaddd m11, m12 + + vpunpcklqdq m12, m3, m11 + vpsrldq m11, m12, 4 + vpaddd m11, m12 + + pmaddwd m3, m10, m1 + vpsrldq m12, m3, 8 + vpaddd m3, m12 + + pmaddwd m12, m2, m1 + vpsrldq m13, m12, 8 + vpaddd m12, m13 + + vpunpcklqdq m13, m3, m12 + vpsrldq m12, m13, 4 + vpaddd m12, m13 + + mova m%4, m26 + vpermi2d m%4, m11, m12 + paddd m%4, m0 + psrad m%4, DCT_SHIFT + ;next row end + + packssdw m%3, m%4 + vpermw m%4, m25, m%3 +%endmacro + +%macro DCT16_AVX512_PASS_1_LOOP 0 + vbroadcasti32x8 m1, [dct16_shuf1] + mova m2, [dct16_shuf3_AVX512] + mova m3, [dct16_shuf4_AVX512] + + movu ym4, [r0] + movu ym5, [r0 + r2] + vinserti64x4 m4, m4, ym5, 1 + + movu ym5, [r0 + 2 * r2] + movu ym6, [r0 + r3] + vinserti64x4 m5, m5, ym6, 1 + + mova m6, m2 + mova m7, m3 + vpermi2q m6, m4, m5 + vpermi2q m7, m4, m5 + + movu ym4, [r4] + movu ym5, [r4 + r2] + vinserti64x4 m4, m4, ym5, 1 + + movu ym5, [r4 + 2 * r2] + movu ym8, [r4 + r3] + vinserti64x4 m5, m5, ym8, 1 + + mova m8, m2 + mova m9, m3 + vpermi2q m8, m4, m5 + vpermi2q m9, m4, m5 + + vpshufb m7, m1 + vpshufb m9, m1 + + paddw m4, m6, m7 + psubw m6, m7 + + paddw m5, m8, m9 + psubw m8, m9 + + lea r0, [r0 + 8 * r2] + lea r4, [r0 + r2 * 4] + + movu ym7, [r0] + movu ym9, [r0 + r2] + vinserti64x4 m7, m7, ym9, 1 + + movu ym9, [r0 + 2 * r2] + movu ym10, [r0 + r3] + vinserti64x4 m9, m9, ym10, 1 + + mova m10, m2 + mova m11, m3 + vpermi2q m10, m7, m9 + vpermi2q m11, m7, m9 + + vpshufb m11, m1 + paddw m7, m10, m11 + psubw m10, m11 + + movu ym9, [r4] + movu ym11, [r4 + r2] + vinserti64x4 m9, m9, ym11, 1 + + movu ym11, [r4 + 2 * r2] + movu ym12, [r4 + r3] + vinserti64x4 m11, m11, ym12, 1 + + vpermi2q m2, m9, m11 + vpermi2q m3, m9, m11 + + vpshufb m3, m1 + paddw m9, m2, m3 + psubw m2, m3 +%endmacro + +%macro DCT16_avx512_PASS_1_E 4 + vpbroadcastq m1, [r5 + %1] + + pmaddwd m19, m11, m1 + vpsrldq m12, m19, 4 + vpaddd m12, m19 + + pmaddwd m19, m13, m1 + vpsrldq m18, m19, 4 + vpaddd m18, m19 + + mova m%2, m27 + vpermi2d m%2, m12, m18 + paddd m%2, m0 + psrad m%2, DCT_SHIFT + + ; 2nd row + vpbroadcastq m1, [r5 + %3] + + pmaddwd m19, m11, m1 + vpsrldq m12, m19, 4 + vpaddd m12, m19 + + pmaddwd m19, m13, m1 + vpsrldq m18, m19, 4 + vpaddd m18, m19 + + mova m%4, m27 + vpermi2d m%4, m12, m18 + paddd m%4, m0 + psrad m%4, DCT_SHIFT + + packssdw m%2, m%4 + vpermw m%4, m25, m%2 +%endmacro + +%macro DCT16_PASS2_AVX512 10 + vpmaddwd m5, m%2, m%1 + vpsrldq m6, m5, 8 + vpaddd m5, m6 + vpsrldq m6, m5, 4 + vpaddd m5, m6 + + vpmaddwd m6, m%3, m%1 + vpsrldq m7, m6, 8 + vpaddd m6, m7 + vpsrldq m7, m6, 4 + vpaddd m6, m7 + vpunpckldq m7, m5, m6 + + vpmaddwd m5, m%4, m%1 + vpsrldq m6, m5, 8 + vpaddd m5, m6 + vpsrldq m6, m5, 4 + vpaddd m5, m6 + + vpmaddwd m6, m%5, m%1 + vpsrldq m8, m6, 8 + vpaddd m6, m8 + vpsrldq m8, m6, 4 + vpaddd m6, m8 + vpunpckldq m8, m5, m6 + + vpunpcklqdq m5, m7, m8 + vpermd m5, m2, m5 + vpsrldq m6, m5, 4 + vpaddd m5, m6 + + vpmaddwd m6, m%6, m%1 + vpsrldq m7, m6, 8 + vpaddd m6, m7 + vpsrldq m7, m6, 4 + vpaddd m6, m7 + + vpmaddwd m7, m%7, m%1 + vpsrldq m8, m7, 8 + vpaddd m7, m8 + vpsrldq m8, m7, 4 + vpaddd m7, m8 + vpunpckldq m8, m6, m7 + + vpmaddwd m6, m%8, m%1 + vpsrldq m7, m6, 8 + vpaddd m6, m7 + vpsrldq m7, m6, 4 + vpaddd m6, m7 + + vpmaddwd m7, m%9, m%1 + vpsrldq m4, m7, 8 + vpaddd m7, m4 + vpsrldq m4, m7, 4 + vpaddd m7, m4 + vpunpckldq m4, m6, m7 + + vpunpcklqdq m6, m8, m4 + vpermd m6, m2, m6 + vpsrldq m7, m6, 4 + vpaddd m6, m7 + + paddd m5, m0 + psrad m5, DCT_SHIFT2 + paddd m6, m0 + psrad m6, DCT_SHIFT2 + + packssdw m5, m6 + vpermw m%10, m3, m5 +%endmacro + +INIT_ZMM avx512 +cglobal dct16, 3, 6, 29 + +%if BIT_DEPTH == 12 + %define DCT_SHIFT 7 + vbroadcasti32x4 m0, [pd_64] +%elif BIT_DEPTH == 10 + %define DCT_SHIFT 5 + vbroadcasti32x4 m0, [pd_16] +%elif BIT_DEPTH == 8 + %define DCT_SHIFT 3 + vbroadcasti32x4 m0, [pd_4] +%else + %error Unsupported BIT_DEPTH! +%endif +%define DCT_SHIFT2 10 + + add r2d, r2d + lea r3, [r2 * 3] + lea r4, [r0 + r2 * 4] + lea r5, [tab_dct16_1 + 8 * 16] + + ;Load reuseable table once to save memory movments + mova m25, [dct16_shuf5_AVX512] + mova m26, [dct16_shuf2_AVX512] + mova m27, [dct16_shuf7_AVX512] + vbroadcasti32x8 m28, [dct16_shuf6_AVX512] + + DCT16_AVX512_PASS_1_LOOP + DCT16_avx512_PASS_1_O -7 * 16, -5 * 16, 15, 14 ;row 1, 3 + DCT16_avx512_PASS_1_O -3 * 16, -1 * 16, 16, 15 ;row 5, 7 + DCT16_avx512_PASS_1_O 1 * 16, 3 * 16, 17, 16 ;row 9, 11 + DCT16_avx512_PASS_1_O 5 * 16, 7 * 16, 18, 17 ;row 13, 15 + + vbroadcasti32x8 m1, [dct16_shuf2] + pshufb m4, m1 + pshufb m5, m1 + pshufb m7, m1 + pshufb m9, m1 + + vpsrldq m3, m4, 2 + vpsubw m11, m4, m3 + vpsrldq m6, m5, 2 + vpsubw m12, m5, m6 + vpsrldq m8, m7, 2 + vpsubw m13, m7, m8 + vpsrldq m10, m9, 2 + vpsubw m18, m9, m10 + + vpermw m11, m28, m11 + vpermw m12, m28, m12 + vinserti64x4 m11, m11, ym12, 1 + + vpermw m13, m28, m13 + vpermw m18, m28, m18 + vinserti64x4 m13, m13, ym18, 1 + + DCT16_avx512_PASS_1_E -6 * 16, 21, -2 * 16, 20 ; row 2, 6 + DCT16_avx512_PASS_1_E 2 * 16, 22, 6 * 16, 21 ; row 10, 14 + + vpaddw m11, m4, m3 + vpaddw m12, m5, m6 + vpaddw m13, m7, m8 + vpaddw m18, m9, m10 + + vpermw m11, m28, m11 + vpermw m12, m28, m12 + vinserti64x4 m11, m11, ym12, 1 + + vpermw m13, m28, m13 + vpermw m18, m28, m18 + vinserti64x4 m13, m13, ym18, 1 + + DCT16_avx512_PASS_1_E -8 * 16, 23, 0 * 16, 22 ; row 0, 8 + DCT16_avx512_PASS_1_E -4 * 16, 24, 4 * 16, 23 ; row 4, 12 + + ;PASS2 + vbroadcasti128 m0, [pd_512] + + lea r5, [tab_dct16] + mova m2, [dct16_shuf9_AVX512] + vbroadcasti32x8 m3, [dct16_shuf8_AVX512] + + vbroadcasti32x8 m1, [r5 + 0 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 1 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 0 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 2 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 3 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 1 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 4 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 5 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 2 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 6 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 7 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 3 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 8 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 9 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 4 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 10 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 11 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 5 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 12 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 13 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 6 * 64], m9 + + vbroadcasti32x8 m1, [r5 + 14 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 9 + vbroadcasti32x8 m1, [r5 + 15 * 32] + DCT16_PASS2_AVX512 1, 14, 15, 16, 17, 20, 21, 22, 23, 10 + vinserti64x4 m9, m9, ym10, 1 + movu [r1 + 7 * 64], m9 + RET %macro DCT32_PASS_1 4 vbroadcasti128 m8, [r7 + %1] - pmaddwd m11, m%3, m8 pmaddwd m12, m%4, m8 phaddd m11, m12 @@ -2791,6 +3617,521 @@ jnz .pass2 RET + +%macro DCT32_avx512_LOOP 4 + movu m1, [r0] + movu m2, [r0 + r2] + + vinserti64x4 m3, m1, ym2, 1 ; row 0l, 1l + vextracti64x4 ym4, m1, 1 + vinserti64x4 m2, m2, ym4, 0 ; row 0h, 1h + vpermw m2, m31, m2 + + psubw m%1, m3, m2 ; O + paddw m3, m2 ; E + mova [r9 + %3 * 64], m3 + + movu m1, [r0 + 2 * r2] + movu m5, [r0 + r3] + + vinserti64x4 m6, m1, ym5, 1 ; row 2l, 3l + vextracti64x4 ym7, m1, 1 + vinserti64x4 m5, m5, ym7, 0 ; row 2h, 3h + vpermw m5, m31, m5 + + psubw m%2, m6, m5 ; O + paddw m6, m5 ; E + mova [r9 + %4 * 64], m6 +%endmacro + +%macro DCT32_avx512_PASS_1_O 3 + pmaddwd m10, m%2, m9 + vpsrldq m11, m10, 8 + vpaddd m10, m11 + + pmaddwd m11, m%3, m9 + vpsrldq m12, m11, 8 + vpaddd m11, m12 + + mova m12, m8 + vpermi2d m12, m10, m11 + vpsrldq m10, m12, 8 + vpaddd m12, m10 + vpsrldq m10, m12, 4 + vpaddd m12, m10 + + vpaddd m12, m0 + vpsrad m12, DCT_SHIFT + vpackssdw m12, m12 + vpermw m12, m30, m12 + movq [r5 + %1], xm12 +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_O 0 + vbroadcasti32x8 m9, [r7 + 1 * 32] + + DCT32_avx512_LOOP 13, 14, 0, 1 + DCT32_avx512_PASS_1_O 1 * 64 + 0 * 8, 13, 14 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 15, 16, 2, 3 + DCT32_avx512_PASS_1_O 1 * 64 + 1 * 8, 15, 16 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 17, 18, 4, 5 + DCT32_avx512_PASS_1_O 1 * 64 + 2 * 8, 17, 18 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 19, 20, 6, 7 + DCT32_avx512_PASS_1_O 1 * 64 + 3 * 8, 19, 20 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 21, 22, 8, 9 + DCT32_avx512_PASS_1_O 1 * 64 + 4 * 8, 21, 22 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 23, 24, 10, 11 + DCT32_avx512_PASS_1_O 1 * 64 + 5 * 8, 23, 24 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 25, 26, 12, 13 + DCT32_avx512_PASS_1_O 1 * 64 + 6 * 8, 25, 26 + + lea r0, [r0 + 4 * r2] + DCT32_avx512_LOOP 27, 28, 14, 15 + DCT32_avx512_PASS_1_O 1 * 64 + 7 * 8, 27, 28 +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_O_1_7 1 + vbroadcasti32x8 m9, [r7 + %1 * 32] + + DCT32_avx512_PASS_1_O %1 * 64 + 0 * 8, 13, 14 + DCT32_avx512_PASS_1_O %1 * 64 + 1 * 8, 15, 16 + DCT32_avx512_PASS_1_O %1 * 64 + 2 * 8, 17, 18 + DCT32_avx512_PASS_1_O %1 * 64 + 3 * 8, 19, 20 + DCT32_avx512_PASS_1_O %1 * 64 + 4 * 8, 21, 22 + DCT32_avx512_PASS_1_O %1 * 64 + 5 * 8, 23, 24 + DCT32_avx512_PASS_1_O %1 * 64 + 6 * 8, 25, 26 + DCT32_avx512_PASS_1_O %1 * 64 + 7 * 8, 27, 28 +%endmacro + +%macro DCT32_avx512_LOOP_EO 4 + mova m4, [rsp + 32 * mmsize + %3 * 64] + vpermw m4, m8, m4 + vextracti64x4 ym5, m4, 1 + + mova m6, [rsp + 32 * mmsize + %4 * 64] + vpermw m6, m8, m6 + vextracti64x4 ym7, m6, 1 + + vinserti64x4 m4, m4, ym6, 1 + vinserti64x4 m5, m5, ym7, 1 + + psubw m%1, m4, m5 ; EO + paddw m%2, m4, m5 ; EE +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EO 2 + pmaddwd m29, m%2, m12 + vpsrldq m30, m29, 8 + vpaddd m30, m29 + vpsrldq m29, m30, 4 + vpaddd m29, m30 + + vpaddd m29, m0 + vpsrad m29, DCT_SHIFT + vpackssdw m29, m29 + + vpermw m29, m11, m29 + movq [r5 + %1], xm29 +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EO_0 0 + + mova m8, [dct32_shuf2_AVX512] + vbroadcasti32x4 m12, [r7 + 2 * 32] + + DCT32_avx512_LOOP_EO 13, 14, 0, 1 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 0 * 8, 13 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 15, 16, 2, 3 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 1 * 8, 15 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 17, 18, 4, 5 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 2 * 8, 17 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 19, 20, 6, 7 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 3 * 8, 19 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 21, 22, 8, 9 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 4 * 8, 21 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 23, 24, 10, 11 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 5 * 8, 23 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 25, 26, 12, 13 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 6 * 8, 25 + + lea r9, [r9 + 4 * r2] + DCT32_avx512_LOOP_EO 27, 28, 14, 15 + DCT32_avx512_PASS_1_ROW_EO 2 * 64 + 7 * 8, 27 + +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EO_1_7 1 + + vbroadcasti32x4 m12, [r7 + %1 * 32] + + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 0 * 8, 13 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 1 * 8, 15 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 2 * 8, 17 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 3 * 8, 19 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 4 * 8, 21 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 5 * 8, 23 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 6 * 8, 25 + DCT32_avx512_PASS_1_ROW_EO %1 * 64 + 7 * 8, 27 + +%endmacro + +%macro DCT32_avx512_LOOP_EEO 0 + vpunpcklqdq m2, m14, m16 + vpunpckhqdq m14, m16 + vpshufb m14, m31 + + vpaddw m16, m2, m14 ; EEE + vpsubw m2, m14 ; EE0 + + vpunpcklqdq m3, m18, m20 + vpunpckhqdq m18, m20 + vpshufb m18, m31 + + vpaddw m20, m3, m18 ; EEE + vpsubw m3, m18 ; EE0 + + vpunpcklqdq m4, m22, m24 + vpunpckhqdq m22, m24 + vpshufb m22, m31 + + vpaddw m24, m4, m22 ; EEE + vpsubw m4, m22 ; EE0 + + vpunpcklqdq m5, m26, m28 + vpunpckhqdq m26, m28 + vpshufb m26, m31 + + vpaddw m28, m5, m26 ; EEE + vpsubw m5, m26 ; EE0 +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EEO 2 + pmaddwd m30, m%2, m1 + vpsrldq m29, m30, 4 + vpaddd m29, m30 + + vpaddd m29, m0 + vpsrad m29, DCT_SHIFT + vpackssdw m29, m29 + + vpermw m29, m27, m29 + movu [r5 + %1], xm29 +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EEO_1_4 1 + +vpbroadcastq m1, [r7 + %1 * 32] +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 0 * 16, 2 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 1 * 16, 3 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 2 * 16, 4 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 3 * 16, 5 + +%endmacro + +%macro DCT32_avx512_PASS_1_ROW_EEEO_1_4 1 + +vpbroadcastq m1, [r7 + %1 * 32] +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 0 * 16, 16 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 1 * 16, 20 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 2 * 16, 24 +DCT32_avx512_PASS_1_ROW_EEO %1 * 64 + 3 * 16, 28 + +%endmacro + +%macro DCT32_avx512_PASS2_OPT 5 + pmaddwd m9, m1, m%1 + vpsrldq m10, m9, 8 + vpaddd m9, m10 + + pmaddwd m10, m1, m%2 + vpsrldq m11, m10, 8 + vpaddd m10, m11 + + pmaddwd m11, m1, m%3 + vpsrldq m12, m11, 8 + vpaddd m11, m12 + + pmaddwd m12, m1, m%4 + vpsrldq m13, m12, 8 + vpaddd m12, m13 + + vpsrldq m13, m9, 4 + vpaddd m9, m13 + vpsrldq m13, m10, 4 + vpaddd m10, m13 + vpsrldq m13, m11, 4 + vpaddd m11, m13 + vpsrldq m13, m12, 4 + vpaddd m12, m13 + + vpermd m9, m31, m9 + vpermd m10, m31, m10 + vpermd m11, m31, m11 + vpermd m12, m31, m12 + + vpandd m9, m27 + vpandd m10, m30 + vpandd m11, m29 + vpandd m12, m28 + + vpaddd m9, m10 + vpaddd m11, m12 + vpaddd m9, m11 + + vpsrldq m10, m9, 8 + vpaddd m9, m10 + vpsrldq m10, m9, 4 + vpaddd m9, m10 + + vpermd m9, m31, m9 + vpaddd m9, m0 + vpsrad m9, DCT_SHIFT2 + vpackssdw m9, m9 + movq [r1 + %5], xm9 + +%endmacro + +%macro DCT32_avx512_PASS2 5 + + mova m9, [r5 + %1] + mova m10, [r5 + %2] + mova m11, [r5 + %3] + mova m12, [r5 + %4] + + pmaddwd m9, m1, m9 + vpsrldq m13, m9, 8 + vpaddd m9, m13 + + pmaddwd m10, m1, m10 + vpsrldq m13, m10, 8 + vpaddd m10, m13 + + pmaddwd m11, m1, m11 + vpsrldq m13, m11, 8 + vpaddd m11, m13 + + pmaddwd m12, m1, m12 + vpsrldq m13, m12, 8 + vpaddd m12, m13 + + vpsrldq m13, m9, 4 + vpaddd m9, m13 + vpsrldq m13, m10, 4 + vpaddd m10, m13 + vpsrldq m13, m11, 4 + vpaddd m11, m13 + vpsrldq m13, m12, 4 + vpaddd m12, m13 + + vpermd m9, m31, m9 + vpermd m10, m31, m10 + vpermd m11, m31, m11 + vpermd m12, m31, m12 + + vpandd m9, m27 + vpandd m10, m30 + vpandd m11, m29 + vpandd m12, m28 + + vpaddd m9, m10 + vpaddd m11, m12 + vpaddd m9, m11 + + vpsrldq m10, m9, 8 + vpaddd m9, m10 + vpsrldq m10, m9, 4 + vpaddd m9, m10 + + vpermd m9, m31, m9 + vpaddd m9, m0 + vpsrad m9, DCT_SHIFT2 + vpackssdw m9, m9 + movq [r1 + %5], xm9 + +%endmacro + +%macro DCT32_avx512_PASS2_1_ROW 1 + +mova m1, [r8 + %1 * 64] + +DCT32_avx512_PASS2_OPT 2, 3, 4, 14, %1 * 64 + 0 * 8 +DCT32_avx512_PASS2_OPT 15, 16, 17, 18, %1 * 64 + 1 * 8 +DCT32_avx512_PASS2_OPT 19, 20, 21, 22, %1 * 64 + 2 * 8 +DCT32_avx512_PASS2_OPT 23, 24, 25, 26, %1 * 64 + 3 * 8 +DCT32_avx512_PASS2_OPT 5, 6, 7, 8, %1 * 64 + 4 * 8 + +DCT32_avx512_PASS2 20 * 64, 21 * 64, 22 * 64, 23 * 64, %1 * 64 + 5 * 8 +DCT32_avx512_PASS2 24 * 64, 25 * 64, 26 * 64, 27 * 64, %1 * 64 + 6 * 8 +DCT32_avx512_PASS2 28 * 64, 29 * 64, 30 * 64, 31 * 64, %1 * 64 + 7 * 8 + +%endmacro + +INIT_ZMM avx512 +cglobal dct32, 3, 10, 32, 0-(32*mmsize + 16*mmsize) + +%if BIT_DEPTH == 12 + %define DCT_SHIFT 8 + vpbroadcastq m0, [pd_128] +%elif BIT_DEPTH == 10 + %define DCT_SHIFT 6 + vpbroadcastq m0, [pd_32] +%elif BIT_DEPTH == 8 + %define DCT_SHIFT 4 + vpbroadcastq m0, [pd_8] +%else + %error Unsupported BIT_DEPTH! +%endif +%define DCT_SHIFT2 11 + + add r2d, r2d + lea r7, [tab_dct32_1] + lea r8, [tab_dct32] + lea r3, [r2 * 3] + mov r5, rsp + mov r9, 2048 ; 32 * mmsize + add r9, rsp + + mova m31, [dct32_shuf1_AVX512] + + ; PASSS 1 + + vbroadcasti32x8 m30, [dct8_shuf9_AVX512] + mova m8, [dct32_shuf_AVX512] + + DCT32_avx512_PASS_1_ROW_O + DCT32_avx512_PASS_1_ROW_O_1_7 3 + DCT32_avx512_PASS_1_ROW_O_1_7 5 + DCT32_avx512_PASS_1_ROW_O_1_7 7 + DCT32_avx512_PASS_1_ROW_O_1_7 9 + DCT32_avx512_PASS_1_ROW_O_1_7 11 + DCT32_avx512_PASS_1_ROW_O_1_7 13 + DCT32_avx512_PASS_1_ROW_O_1_7 15 + DCT32_avx512_PASS_1_ROW_O_1_7 17 + DCT32_avx512_PASS_1_ROW_O_1_7 19 + DCT32_avx512_PASS_1_ROW_O_1_7 20 + DCT32_avx512_PASS_1_ROW_O_1_7 21 + DCT32_avx512_PASS_1_ROW_O_1_7 23 + DCT32_avx512_PASS_1_ROW_O_1_7 25 + DCT32_avx512_PASS_1_ROW_O_1_7 27 + DCT32_avx512_PASS_1_ROW_O_1_7 29 + DCT32_avx512_PASS_1_ROW_O_1_7 31 + + vbroadcasti32x8 m11, [dct8_shuf9_AVX512] + + DCT32_avx512_PASS_1_ROW_EO_0 + DCT32_avx512_PASS_1_ROW_EO_1_7 6 + DCT32_avx512_PASS_1_ROW_EO_1_7 10 + DCT32_avx512_PASS_1_ROW_EO_1_7 14 + DCT32_avx512_PASS_1_ROW_EO_1_7 18 + DCT32_avx512_PASS_1_ROW_EO_1_7 22 + DCT32_avx512_PASS_1_ROW_EO_1_7 26 + DCT32_avx512_PASS_1_ROW_EO_1_7 30 + + vbroadcasti32x4 m31, [dct8_shuf] + vbroadcasti32x8 m27, [dct32_shuf3_AVX512] + + DCT32_avx512_LOOP_EEO + DCT32_avx512_PASS_1_ROW_EEO_1_4 4 + DCT32_avx512_PASS_1_ROW_EEO_1_4 12 + DCT32_avx512_PASS_1_ROW_EEO_1_4 20 + DCT32_avx512_PASS_1_ROW_EEO_1_4 28 + + DCT32_avx512_PASS_1_ROW_EEEO_1_4 0 + DCT32_avx512_PASS_1_ROW_EEEO_1_4 16 + DCT32_avx512_PASS_1_ROW_EEEO_1_4 8 + DCT32_avx512_PASS_1_ROW_EEEO_1_4 24 + + ; PASS 2 + + vpbroadcastq m0, [pd_1024] + vbroadcasti32x8 m31, [dct32_shuf4_AVX512] + movu m30, [dct32_shuf5_AVX512] + movu m29, [dct32_shuf6_AVX512] + movu m28, [dct32_shuf7_AVX512] + movu m27, [dct32_shuf8_AVX512] + + ;Load the source coefficents into free registers and reuse them for all rows + + mova m2, [r5 + 0 * 64] + mova m3, [r5 + 1 * 64] + mova m4, [r5 + 2 * 64] + mova m14, [r5 + 3 * 64] + mova m15, [r5 + 4 * 64] + mova m16, [r5 + 5 * 64] + mova m17, [r5 + 6 * 64] + mova m18, [r5 + 7 * 64] + mova m19, [r5 + 8 * 64] + mova m20, [r5 + 9 * 64] + mova m21, [r5 + 10 * 64] + mova m22, [r5 + 11 * 64] + mova m23, [r5 + 12 * 64] + mova m24, [r5 + 13 * 64] + mova m25, [r5 + 14 * 64] + mova m26, [r5 + 15 * 64] + mova m5, [r5 + 16 * 64] + mova m6, [r5 + 17 * 64] + mova m7, [r5 + 18 * 64] + mova m8, [r5 + 19 * 64] + + DCT32_avx512_PASS2_1_ROW 0 + DCT32_avx512_PASS2_1_ROW 1 + DCT32_avx512_PASS2_1_ROW 2 + DCT32_avx512_PASS2_1_ROW 3 + DCT32_avx512_PASS2_1_ROW 4 + DCT32_avx512_PASS2_1_ROW 5 + DCT32_avx512_PASS2_1_ROW 6 + DCT32_avx512_PASS2_1_ROW 7 + DCT32_avx512_PASS2_1_ROW 8 + DCT32_avx512_PASS2_1_ROW 9 + DCT32_avx512_PASS2_1_ROW 10 + DCT32_avx512_PASS2_1_ROW 11 + DCT32_avx512_PASS2_1_ROW 12 + DCT32_avx512_PASS2_1_ROW 13 + DCT32_avx512_PASS2_1_ROW 14 + DCT32_avx512_PASS2_1_ROW 15 + DCT32_avx512_PASS2_1_ROW 16 + DCT32_avx512_PASS2_1_ROW 17 + DCT32_avx512_PASS2_1_ROW 18 + DCT32_avx512_PASS2_1_ROW 19 + DCT32_avx512_PASS2_1_ROW 20 + DCT32_avx512_PASS2_1_ROW 21 + DCT32_avx512_PASS2_1_ROW 22 + DCT32_avx512_PASS2_1_ROW 23 + DCT32_avx512_PASS2_1_ROW 24 + DCT32_avx512_PASS2_1_ROW 25 + DCT32_avx512_PASS2_1_ROW 26 + DCT32_avx512_PASS2_1_ROW 27 + DCT32_avx512_PASS2_1_ROW 28 + DCT32_avx512_PASS2_1_ROW 29 + DCT32_avx512_PASS2_1_ROW 30 + DCT32_avx512_PASS2_1_ROW 31 + + RET + %macro IDCT8_PASS_1 1 vpbroadcastd m7, [r5 + %1] vpbroadcastd m10, [r5 + %1 + 4] @@ -2969,6 +4310,213 @@ mova [r1 + r3], xm3 RET + +%macro IDCT8_AVX512_PASS_1 0 + pmaddwd m5, m29, m17 + pmaddwd m6, m25, m18 + paddd m5, m6 + + pmaddwd m6, m30, m21 + pmaddwd m3, m26, m22 + paddd m6, m3 + + paddd m3, m5, m6 + paddd m3, m11 + psrad m3, IDCT_SHIFT1 + + psubd m5, m6 + paddd m5, m11 + psrad m5, IDCT_SHIFT1 + + pmaddwd m6, m29, m19 + pmaddwd m8, m25, m20 + paddd m6, m8 + + pmaddwd m8, m30, m23 + pmaddwd m9, m26, m24 + paddd m8, m9 + + paddd m9, m6, m8 + paddd m9, m11 + psrad m9, IDCT_SHIFT1 + + psubd m6, m8 + paddd m6, m11 + psrad m6, IDCT_SHIFT1 + + packssdw m3, m9 + vpermq m3, m3, 0xD8 + + packssdw m6, m5 + vpermq m6, m6, 0xD8 +%endmacro + + +%macro IDCT8_AVX512_PASS_2 0 + mov r7d, 0xAAAA + kmovd k1, r7d + punpcklqdq m2, m3, m13 + punpckhqdq m0, m3, m13 + + pmaddwd m3, m2, [r5] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m6, m2, [r5 + 2 * mmsize] + pmaddwd m7, m2, [r5 + 3 * mmsize] + + vpsrldq m14, m3, 4 + paddd m3, m14 + vpslldq m16, m5, 4 + paddd m5, m16 + vmovdqu32 m3 {k1}, m5 + + vpsrldq m14, m6, 4 + paddd m6, m14 + vpslldq m16, m7, 4 + paddd m7, m16 + vmovdqu32 m6 {k1}, m7 + + punpcklqdq m7, m3, m6 + punpckhqdq m3, m6 + + pmaddwd m5, m0, [r6] + pmaddwd m6, m0, [r6 + 1 * mmsize] + pmaddwd m8, m0, [r6 + 2 * mmsize] + pmaddwd m9, m0, [r6 + 3 * mmsize] + + vpsrldq m14, m5, 4 + paddd m5, m14 + vpslldq m16, m6, 4 + paddd m6, m16 + vmovdqu32 m5 {k1}, m6 + + vpsrldq m14, m8, 4 + paddd m8, m14 + vpslldq m16, m9, 4 + paddd m9, m16 + vmovdqu32 m8 {k1}, m9 + + punpcklqdq m6, m5, m8 + punpckhqdq m5, m8 + + paddd m8, m7, m6 + paddd m8, m12 + psrad m8, IDCT_SHIFT2 + + psubd m7, m6 + paddd m7, m12 + psrad m7, IDCT_SHIFT2 + + pshufb m7, [idct8_avx512_shuf3] + packssdw m8, m7 + + paddd m9, m3, m5 + paddd m9, m12 + psrad m9, IDCT_SHIFT2 + + psubd m3, m5 + paddd m3, m12 + psrad m3, IDCT_SHIFT2 + + pshufb m3, [idct8_avx512_shuf3] + packssdw m9, m3 +%endmacro + + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal idct8, 3, 8, 31 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m12, [pd_128] +%elif BIT_DEPTH == 10 + %define IDCT_SHIFT2 10 + vpbroadcastd m12, [pd_512] +%elif BIT_DEPTH == 8 + %define IDCT_SHIFT2 12 + vpbroadcastd m12, [pd_2048] +%else + %error Unsupported BIT_DEPTH! +%endif +%define IDCT_SHIFT1 7 + + vpbroadcastd m11, [pd_64] + + lea r4, [avx512_idct8_3] + lea r5, [avx2_idct8_1] + lea r6, [avx2_idct8_2] + movu m16, [idct16_shuff2] + movu m17, [idct16_shuff3] + + ;pass1 + mova ym1, [r0 + 0 * 32] + mova ym0, [r0 + 1 * 32] + mova ym25, ym16 + mova ym26, ym17 + vpermi2w ym25, ym1, ym0 + vpermi2w ym26, ym1, ym0 + + mova ym1, [r0 + 2 * 32] + mova ym0, [r0 + 3 * 32] + mova ym27, ym16 + mova ym28, ym17 + vpermi2w ym27, ym1, ym0 + vpermi2w ym28, ym1, ym0 + + vperm2i128 ym29, ym25, ym26, 0x20 + vperm2i128 ym30, ym25, ym26, 0x31 + vperm2i128 ym25, ym27, ym28, 0x20 + vperm2i128 ym26, ym27, ym28, 0x31 + + vinserti64x4 m29, m29, ym29, 1 + vinserti64x4 m25, m25, ym25, 1 + vinserti64x4 m30, m30, ym30, 1 + vinserti64x4 m26, m26, ym26, 1 + + movu m17, [r4] + movu m18, [r4 + 1 * mmsize] + movu m19, [r4 + 2 * mmsize] + movu m20, [r4 + 3 * mmsize] + movu m21, [r4 + 4 * mmsize] + movu m22, [r4 + 5 * mmsize] + movu m23, [r4 + 6 * mmsize] + movu m24, [r4 + 7 * mmsize] + + IDCT8_AVX512_PASS_1 + + vextracti64x4 ym13, m3, 1 + vextracti64x4 ym14, m6, 1 + vinserti64x4 m3, m3, ym14, 1 + vinserti64x4 m13, m13, ym6, 1 + + ;pass2 + add r2d, r2d + lea r3, [r2 * 3] + lea r5, [avx512_idct8_1] + lea r6, [avx512_idct8_2] + + IDCT8_AVX512_PASS_2 + + vextracti128 xm3, ym8, 1 + mova [r1], xm8 + mova [r1 + r2], xm3 + vextracti128 xm3, ym9, 1 + mova [r1 + r2 * 2], xm9 + mova [r1 + r3], xm3 + + lea r1, [r1 + r2 * 4] + + vextracti64x4 ym10, m8, 1 + vextracti64x4 ym11, m9, 1 + + vextracti128 xm3, ym10, 1 + mova [r1], xm10 + mova [r1 + r2], xm3 + vextracti128 xm3, ym11, 1 + mova [r1 + r2 * 2], xm11 + mova [r1 + r3], xm3 + RET +%endif + %macro IDCT_PASS1 2 vbroadcasti128 m5, [tab_idct16_2 + %1 * 16] @@ -3266,6 +4814,574 @@ jnz .pass2 RET + +%macro IDCT16_AVX512_PASS1 3 + movu m5, [tab_AVX512_idct16_2 + %1 * 64] + pmaddwd m9, m4, m5 + pmaddwd m10, m6, m5 + + vpsrldq m16, m9, 4 + paddd m9, m16 + vpslldq m17, m10, 4 + paddd m10, m17 + vmovdqu32 m9 {k1}, m10 + + pmaddwd m10, m7, m5 + pmaddwd m11, m8, m5 + + vpsrldq m16, m10, 4 + paddd m10, m16 + vpslldq m17, m11, 4 + paddd m11, m17 + vmovdqu32 m10 {k1}, m11 + + vpsrldq m16, m9, 8 + paddd m9, m16 + vpslldq m17, m10, 8 + paddd m10, m17 + vmovdqu32 m9 {k2}, m10 + + mova m5, [tab_AVX512_idct16_1 + %1 * 64] + pmaddwd m10, m28, m5 + pmaddwd m11, m29, m5 + + vpsrldq m16, m10, 4 + paddd m10, m16 + vpslldq m17, m11, 4 + paddd m11, m17 + vmovdqu32 m10 {k1}, m11 + + pmaddwd m11, m30, m5 + pmaddwd m12, m31, m5 + + vpsrldq m16, m11, 4 + paddd m11, m16 + vpslldq m17, m12, 4 + paddd m12, m17 + vmovdqu32 m11 {k1}, m12 + + vpsrldq m16, m10, 8 + paddd m10, m16 + vpslldq m17, m11, 8 + paddd m11, m17 + vmovdqu32 m10 {k2}, m11 + + paddd m11, m9, m10 + paddd m11, m14 + psrad m11, IDCT_SHIFT1 + + psubd m9, m10 + paddd m9, m14 + psrad m9, IDCT_SHIFT1 + + mova m5, [tab_AVX512_idct16_2 + %1 * 64 + 64] + pmaddwd m10, m4, m5 + pmaddwd m12, m6, m5 + + + vpsrldq m16, m10, 4 + paddd m10, m16 + vpslldq m17, m12, 4 + paddd m12, m17 + vmovdqu32 m10 {k1}, m12 + + pmaddwd m12, m7, m5 + pmaddwd m13, m8, m5 + + + vpsrldq m16, m12, 4 + paddd m12, m16 + vpslldq m17, m13, 4 + paddd m13, m17 + vmovdqu32 m12 {k1}, m13 + + + vpsrldq m16, m10, 8 + paddd m10, m16 + vpslldq m17, m12, 8 + paddd m12, m17 + vmovdqu32 m10 {k2}, m12 + + + + mova m5, [tab_AVX512_idct16_1 + %1 * 64 + 64] + pmaddwd m12, m28, m5 + pmaddwd m13, m29, m5 + + + vpsrldq m16, m12, 4 + paddd m12, m16 + vpslldq m17, m13, 4 + paddd m13, m17 + vmovdqu32 m12 {k1}, m13 + + pmaddwd m13, m30, m5 + pmaddwd m5, m31 + + + vpsrldq m16, m13, 4 + paddd m13, m16 + vpslldq m17, m5, 4 + paddd m5, m17 + vmovdqu32 m13 {k1}, m5 + + + vpsrldq m16, m12, 8 + paddd m12, m16 + vpslldq m17, m13, 8 + paddd m13, m17 + vmovdqu32 m12 {k2}, m13 + + + paddd m5, m10, m12 + paddd m5, m14 + psrad m5, IDCT_SHIFT1 + + psubd m10, m12 + paddd m10, m14 + psrad m10, IDCT_SHIFT1 + + packssdw m11, m5 + packssdw m9, m10 + + mova m10, [idct16_AVX512_shuff] + mova m5, [idct16_AVX512_shuff1] + + vpermd m%2, m10, m11 + vpermd m%3, m5, m9 +%endmacro + +%macro IDCT16_AVX512_PASS2 2 + vpermq m0, m%1, 0xD8 + + pmaddwd m1, m0, m7 + pmaddwd m2, m0, m8 + + + vpsrldq m14, m1, 4 + paddd m1, m14 + vpslldq m31, m2, 4 + paddd m2, m31 + vmovdqu32 m1 {k1}, m2 + + pmaddwd m2, m0, m9 + pmaddwd m3, m0, m10 + + + vpsrldq m14, m2, 4 + paddd m2, m14 + vpslldq m31, m3, 4 + paddd m3, m31 + vmovdqu32 m2 {k1}, m3 + + + vpsrldq m14, m1, 8 + paddd m1, m14 + vpslldq m31, m2, 8 + paddd m2, m31 + vmovdqu32 m1 {k2}, m2 + + pmaddwd m2, m0, m11 + pmaddwd m3, m0, m12 + + + vpsrldq m14, m2, 4 + paddd m2, m14 + vpslldq m31, m3, 4 + paddd m3, m31 + vmovdqu32 m2 {k1}, m3 + + vbroadcasti64x2 m14, [r5 + 112] + pmaddwd m3, m0, m13 + pmaddwd m4, m0, m14 + + + vpsrldq m14, m3, 4 + paddd m3, m14 + vpslldq m31, m4, 4 + paddd m4, m31 + vmovdqu32 m3 {k1}, m4 + + + vpsrldq m14, m2, 8 + paddd m2, m14 + vpslldq m31, m3, 8 + paddd m3, m31 + vmovdqu32 m2 {k2}, m3 + + vpermq m0, m%2, 0xD8 + pmaddwd m3, m0, m16 + pmaddwd m4, m0, m17 + + + vpsrldq m14, m3, 4 + paddd m3, m14 + vpslldq m31, m4, 4 + paddd m4, m31 + vmovdqu32 m3 {k1}, m4 + + pmaddwd m4, m0, m19 + pmaddwd m5, m0, m23 + + + vpsrldq m14, m4, 4 + paddd m4, m14 + vpslldq m31, m5, 4 + paddd m5, m31 + vmovdqu32 m4 {k1}, m5 + + + vpsrldq m14, m3, 8 + paddd m3, m14 + vpslldq m31, m4, 8 + paddd m4, m31 + vmovdqu32 m3 {k2}, m4 + + + pmaddwd m4, m0, m28 + pmaddwd m5, m0, m29 + + vpsrldq m14, m4, 4 + paddd m4, m14 + vpslldq m31, m5, 4 + paddd m5, m31 + vmovdqu32 m4 {k1}, m5 + + pmaddwd m6, m0, m30 + vbroadcasti64x2 m31, [r6 + 112] + pmaddwd m0, m31 + + + vpsrldq m14, m6, 4 + paddd m6, m14 + vpslldq m31, m0, 4 + paddd m0, m31 + vmovdqu32 m6 {k1}, m0 + + + vpsrldq m14, m4, 8 + paddd m4, m14 + vpslldq m31, m6, 8 + paddd m6, m31 + vmovdqu32 m4 {k2}, m6 + + paddd m5, m1, m3 + paddd m5, m15 + psrad m5, IDCT_SHIFT2 + + psubd m1, m3 + paddd m1, m15 + psrad m1, IDCT_SHIFT2 + + paddd m6, m2, m4 + paddd m6, m15 + psrad m6, IDCT_SHIFT2 + + psubd m2, m4 + paddd m2, m15 + psrad m2, IDCT_SHIFT2 + + packssdw m5, m6 + packssdw m1, m2 + pshufb m2, m1, [idct16_AVX512_shuff6] +%endmacro + + +;------------------------------------------------------- +; void idct16(const int16_t* src, int16_t* dst, intptr_t dstStride) +;------------------------------------------------------- +INIT_ZMM avx512 +cglobal idct16, 3, 8, 32 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m15, [pd_128] +%elif BIT_DEPTH == 10 + %define IDCT_SHIFT2 10 + vpbroadcastd m15, [pd_512] +%elif BIT_DEPTH == 8 + %define IDCT_SHIFT2 12 + vpbroadcastd m15, [pd_2048] +%else + %error Unsupported BIT_DEPTH! +%endif +%define IDCT_SHIFT1 7 + + vpbroadcastd m14, [pd_64] + + add r2d, r2d + + mov r7d, 0xAAAA + kmovd k1, r7d + mov r7d, 0xCCCC + kmovd k2, r7d + mova ym2, [idct16_shuff2] + mova ym3, [idct16_shuff3] + mova ym26, [idct16_shuff4] + mova ym27, [idct16_shuff5] + +.pass1: + movu xm0, [r0 + 0 * 32] + vinserti128 ym0, ym0, [r0 + 8 * 32], 1 + movu xm1, [r0 + 2 * 32] + vinserti128 ym1, ym1, [r0 + 10 * 32], 1 + + mova ym9, ym2 + mova ym10, ym3 + vpermi2w ym9, ym0, ym1 + vpermi2w ym10, ym0, ym1 + + movu xm0, [r0 + 4 * 32] + vinserti128 ym0, ym0, [r0 + 12 * 32], 1 + movu xm1, [r0 + 6 * 32] + vinserti128 ym1, ym1, [r0 + 14 * 32], 1 + + mova ym11, ym2 + mova ym12, ym3 + vpermi2w ym11, ym0, ym1 + vpermi2w ym12, ym0, ym1 + + mova ym4, ym26 + mova ym6, ym27 + vpermi2d ym4, ym9, ym11 + vpermi2d ym6, ym9, ym11 + + mova ym7, ym26 + mova ym8, ym27 + vpermi2d ym7, ym10, ym12 + vpermi2d ym8, ym10, ym12 + + vpermq ym4, ym4, q3120 + vpermq ym6, ym6, q3120 + vpermq ym7, ym7, q3120 + vpermq ym8, ym8, q3120 + + movu xm0, [r0 + 1 * 32] + vinserti128 ym0, ym0, [r0 + 9 * 32], 1 + movu xm1, [r0 + 3 * 32] + vinserti128 ym1, ym1, [r0 + 11 * 32], 1 + + mova ym9, ym2 + mova ym10, ym3 + vpermi2w ym9, ym0, ym1 + vpermi2w ym10, ym0, ym1 + + movu xm0, [r0 + 5 * 32] + vinserti128 ym0, ym0, [r0 + 13 * 32], 1 + movu xm1, [r0 + 7 * 32] + vinserti128 ym1, ym1, [r0 + 15 * 32], 1 + + mova ym11, ym2 + mova ym12, ym3 + vpermi2w ym11, ym0, ym1 + vpermi2w ym12, ym0, ym1 + + mova ym28, ym26 + mova ym29, ym27 + vpermi2d ym28, ym9, ym11 + vpermi2d ym29, ym9, ym11 + + mova ym30, ym26 + mova ym31, ym27 + vpermi2d ym30, ym10, ym12 + vpermi2d ym31, ym10, ym12 + + vpermq ym28, ym28, q3120 + vpermq ym29, ym29, q3120 + vpermq ym30, ym30, q3120 + vpermq ym31, ym31, q3120 + + vinserti64x4 m4, m4, ym4, 1 + vinserti64x4 m6, m6, ym6, 1 + vinserti64x4 m7, m7, ym7, 1 + vinserti64x4 m8, m8, ym8, 1 + vinserti64x4 m28, m28, ym28, 1 + vinserti64x4 m29, m29, ym29, 1 + vinserti64x4 m30, m30, ym30, 1 + vinserti64x4 m31, m31, ym31, 1 + + IDCT16_AVX512_PASS1 0, 18, 19 + IDCT16_AVX512_PASS1 2, 20, 21 + + add r0, 16 + + movu xm0, [r0 + 0 * 32] + vinserti128 ym0, ym0, [r0 + 8 * 32], 1 + movu xm1, [r0 + 2 * 32] + vinserti128 ym1, ym1, [r0 + 10 * 32], 1 + + mova ym9, ym2 + mova ym10, ym3 + vpermi2w ym9, ym0, ym1 + vpermi2w ym10, ym0, ym1 + + movu xm0, [r0 + 4 * 32] + vinserti128 ym0, ym0, [r0 + 12 * 32], 1 + movu xm1, [r0 + 6 * 32] + vinserti128 ym1, ym1, [r0 + 14 * 32], 1 + + mova ym11, ym2 + mova ym12, ym3 + vpermi2w ym11, ym0, ym1 + vpermi2w ym12, ym0, ym1 + + mova ym4, ym26 + mova ym6, ym27 + vpermi2d ym4, ym9, ym11 + vpermi2d ym6, ym9, ym11 + + mova ym7, ym26 + mova ym8, ym27 + vpermi2d ym7, ym10, ym12 + vpermi2d ym8, ym10, ym12 + + vpermq ym4, ym4, q3120 + vpermq ym6, ym6, q3120 + vpermq ym7, ym7, q3120 + vpermq ym8, ym8, q3120 + + movu xm0, [r0 + 1 * 32] + vinserti128 ym0, ym0, [r0 + 9 * 32], 1 + movu xm1, [r0 + 3 * 32] + vinserti128 ym1, ym1, [r0 + 11 * 32], 1 + + mova ym9, ym2 + mova ym10, ym3 + vpermi2w ym9, ym0, ym1 + vpermi2w ym10, ym0, ym1 + + movu xm0, [r0 + 5 * 32] + vinserti128 ym0, ym0, [r0 + 13 * 32], 1 + movu xm1, [r0 + 7 * 32] + vinserti128 ym1, ym1, [r0 + 15 * 32], 1 + + mova ym11, ym2 + mova ym12, ym3 + vpermi2w ym11, ym0, ym1 + vpermi2w ym12, ym0, ym1 + + mova ym28, ym26 + mova ym29, ym27 + vpermi2d ym28, ym9, ym11 + vpermi2d ym29, ym9, ym11 + + mova ym30, ym26 + mova ym31, ym27 + vpermi2d ym30, ym10, ym12 + vpermi2d ym31, ym10, ym12 + + vpermq ym28, ym28, q3120 + vpermq ym29, ym29, q3120 + vpermq ym30, ym30, q3120 + vpermq ym31, ym31, q3120 + + vinserti64x4 m4, m4, ym4, 1 + vinserti64x4 m6, m6, ym6, 1 + vinserti64x4 m7, m7, ym7, 1 + vinserti64x4 m8, m8, ym8, 1 + vinserti64x4 m28, m28, ym28, 1 + vinserti64x4 m29, m29, ym29, 1 + vinserti64x4 m30, m30, ym30, 1 + vinserti64x4 m31, m31, ym31, 1 + + + IDCT16_AVX512_PASS1 0, 22, 23 + IDCT16_AVX512_PASS1 2, 24, 25 + + mova m26, [idct16_AVX512_shuff2] + mova m27, [idct16_AVX512_shuff3] + vpermi2q m26, m18, m22 + vpermi2q m27, m18, m22 + mova m18, [idct16_AVX512_shuff2] + mova m22, [idct16_AVX512_shuff3] + vpermi2q m18, m20, m24 + vpermi2q m22, m20, m24 + mova m20, [idct16_AVX512_shuff4] + mova m24, [idct16_AVX512_shuff5] + vpermi2q m20, m21, m25 + vpermi2q m24, m21, m25 + mova m21, [idct16_AVX512_shuff4] + mova m25, [idct16_AVX512_shuff5] + vpermi2q m21, m19, m23 + vpermi2q m25, m19, m23 + + lea r5, [tab_idct16_2] + lea r6, [tab_idct16_1] + + vbroadcasti64x2 m7, [r5] + vbroadcasti64x2 m8, [r5 + 16] + vbroadcasti64x2 m9, [r5 + 32] + vbroadcasti64x2 m10, [r5 + 48] + vbroadcasti64x2 m11, [r5 + 64] + vbroadcasti64x2 m12, [r5 + 80] + vbroadcasti64x2 m13, [r5 + 96] + + vbroadcasti64x2 m16, [r6] + vbroadcasti64x2 m17, [r6 + 16] + vbroadcasti64x2 m19, [r6 + 32] + vbroadcasti64x2 m23, [r6 + 48] + vbroadcasti64x2 m28, [r6 + 64] + vbroadcasti64x2 m29, [r6 + 80] + vbroadcasti64x2 m30, [r6 + 96] + + + IDCT16_AVX512_PASS2 26, 27 + mova [r1], xm5 + mova [r1 + 16], xm2 + vextracti128 [r1 + r2], ym5, 1 + vextracti128 [r1 + r2 + 16], ym2, 1 + vextracti64x4 ym14, m5, 1 + vextracti64x4 ym31, m2, 1 + lea r1, [r1 + 2 * r2] + mova [r1], xm14 + mova [r1 + 16], xm31 + vextracti128 [r1 + r2], ym14, 1 + vextracti128 [r1 + r2 + 16], ym31, 1 + + IDCT16_AVX512_PASS2 18, 22 + lea r1, [r1 + 2 * r2] + mova [r1], xm5 + mova [r1 + 16], xm2 + vextracti128 [r1 + r2], ym5, 1 + vextracti128 [r1 + r2 + 16], ym2, 1 + vextracti64x4 ym14, m5, 1 + vextracti64x4 ym31, m2, 1 + lea r1, [r1 + 2 * r2] + mova [r1], xm14 + mova [r1 + 16], xm31 + vextracti128 [r1 + r2], ym14, 1 + vextracti128 [r1 + r2 + 16], ym31, 1 + + IDCT16_AVX512_PASS2 20, 24 + lea r1, [r1 + 2 * r2] + mova [r1], xm5 + mova [r1 + 16], xm2 + vextracti128 [r1 + r2], ym5, 1 + vextracti128 [r1 + r2 + 16], ym2, 1 + vextracti64x4 ym14, m5, 1 + vextracti64x4 ym31, m2, 1 + lea r1, [r1 + 2 * r2] + mova [r1], xm14 + mova [r1 + 16], xm31 + vextracti128 [r1 + r2], ym14, 1 + vextracti128 [r1 + r2 + 16], ym31, 1 + + IDCT16_AVX512_PASS2 21, 25 + lea r1, [r1 + 2 * r2] + mova [r1], xm5 + mova [r1 + 16], xm2 + vextracti128 [r1 + r2], ym5, 1 + vextracti128 [r1 + r2 + 16], ym2, 1 + vextracti64x4 ym14, m5, 1 + vextracti64x4 ym31, m2, 1 + lea r1, [r1 + 2 * r2] + mova [r1], xm14 + mova [r1 + 16], xm31 + vextracti128 [r1 + r2], ym14, 1 + vextracti128 [r1 + r2 + 16], ym31, 1 + RET + + + %macro IDCT32_PASS1 1 vbroadcasti128 m3, [tab_idct32_1 + %1 * 32] vbroadcasti128 m13, [tab_idct32_1 + %1 * 32 + 16] @@ -3630,6 +5746,601 @@ jnz .pass2 RET + +%macro IDCT32_AVX512_PASS1 5 + pmaddwd m9, m8, m%4 + pmaddwd m10, m7, m%5 + + paddd m9, m10 + vpsrldq m0, m9, 8 + paddd m9, m0 + vpsrldq m0, m9, 4 + paddd m9, m0 + + pmaddwd m10, m4, m%4 + pmaddwd m11, m1, m%5 + + paddd m10, m11 + vpsrldq m0, m10, 8 + paddd m10, m0 + vpslldq m0, m10, 4 + paddd m10, m0 + + vmovdqu32 m9 {k3}, m10 + + mova m6, [tab_idct32_AVX512_5 + %1 * 64] + mova m5, [tab_idct32_AVX512_5 + %1 * 64 + 64] + + pmaddwd m10, m8, m6 + pmaddwd m11, m7, m5 + + paddd m10, m11 + vpslldq m0, m10, 8 + paddd m10, m0 + vpsrldq m0, m10, 4 + paddd m10, m0 + + pmaddwd m11, m4, m6 + pmaddwd m12, m1, m5 + + paddd m11, m12 + vpslldq m0, m11, 8 + paddd m11, m0 + vpslldq m0, m11, 4 + paddd m11, m0 + + vmovdqu32 m10 {k4}, m11 + vmovdqu32 m9 {k2}, m10 + + pmaddwd m10, m3, m%2 + pmaddwd m11, m14, m%2 + + vpsrldq m0, m10, 4 + paddd m10, m0 + vpslldq m5, m11, 4 + paddd m11, m5 + vmovdqu32 m10 {k1}, m11 + + vpsrldq m0, m10, 8 + paddd m10, m0 + + pmaddwd m11, m2, m%3 + pmaddwd m12, m13, m%3 + + vpsrldq m0, m11, 4 + paddd m11, m0 + vpslldq m5, m12, 4 + paddd m12, m5 + vmovdqu32 m11 {k1}, m12 + + vpsrldq m0, m11, 8 + paddd m11, m0 + + paddd m12, m10, m11 + psubd m10, m11 + + punpcklqdq m12, m10 + paddd m10, m9, m12 + paddd m10, m15 + psrad m10, IDCT_SHIFT1 + + psubd m12, m9 + paddd m12, m15 + psrad m12, IDCT_SHIFT1 + + packssdw m10, m12 + vextracti128 xm12, m10, 1 + vextracti64x4 ym5, m10, 1 + vextracti128 xm0, ym5, 1 + + movd [r3 + %1 * 64], xm10 + movd [r3 + 32 + %1 * 64], xm12 + pextrd [r4 - %1 * 64], xm10, 1 + pextrd [r4+ 32 - %1 * 64], xm12, 1 + pextrd [r3 + 16 * 64 + %1 *64], xm10, 3 + pextrd [r3 + 16 * 64 + 32 + %1 * 64], xm12, 3 + pextrd [r4 + 16 * 64 - %1 * 64], xm10, 2 + pextrd [r4 + 16 * 64 + 32 - %1 * 64], xm12, 2 + + movd [r3 + (%1 + 1) * 64], xm5 + movd [r3 + 32 + (%1 + 1) * 64], xm0 + pextrd [r4 - (%1 + 1) * 64], xm5, 1 + pextrd [r4+ 32 - (%1 + 1) * 64], xm0, 1 + pextrd [r3 + 16 * 64 + (%1 + 1) * 64], xm5, 3 + pextrd [r3 + 16 * 64 + 32 + (%1 + 1) * 64], xm0, 3 + pextrd [r4 + 16 * 64 - (%1 + 1) * 64], xm5, 2 + pextrd [r4 + 16 * 64 + 32 - (%1 + 1) * 64], xm0, 2 +%endmacro + +%macro IDCT32_AVX512_PASS2 0 + pmaddwd m2, m0, m7 + pmaddwd m3, m0, m8 + + vpsrldq m24, m2, 4 + paddd m2, m24 + vpslldq m25, m3, 4 + paddd m3, m25 + vmovdqu32 m2 {k1}, m3 + + pmaddwd m3, m0, m9 + pmaddwd m4, m0, m10 + + vpsrldq m24, m3, 4 + paddd m3, m24 + vpslldq m25, m4, 4 + paddd m4, m25 + vmovdqu32 m3 {k1}, m4 + + vpsrldq m24, m2, 8 + paddd m2, m24 + vpslldq m25, m3, 8 + paddd m3, m25 + vmovdqu32 m2 {k2}, m3 + + pmaddwd m3, m0, m11 + pmaddwd m4, m0, m12 + + vpsrldq m24, m3, 4 + paddd m3, m24 + vpslldq m25, m4, 4 + paddd m4, m25 + vmovdqu32 m3 {k1}, m4 + + pmaddwd m4, m0, m13 + pmaddwd m5, m0, m14 + + vpsrldq m24, m4, 4 + paddd m4, m24 + vpslldq m25, m5, 4 + paddd m5, m25 + vmovdqu32 m4 {k1}, m5 + + vpsrldq m24, m3, 8 + paddd m3, m24 + vpslldq m25, m4, 8 + paddd m4, m25 + vmovdqu32 m3 {k2}, m4 + + mova m24, [idct16_AVX512_shuff3] + mova m25, [idct16_AVX512_shuff2] + vpermi2q m24, m2, m3 + vpermi2q m25, m2, m3 + paddd m2, m25, m24 + + pmaddwd m3, m0, m16 + pmaddwd m4, m0, m17 + + vpsrldq m24, m3, 4 + paddd m3, m24 + vpslldq m25, m4, 4 + paddd m4, m25 + vmovdqu32 m3 {k1}, m4 + + pmaddwd m4, m0, m18 + pmaddwd m5, m0, m19 + + vpsrldq m24, m4, 4 + paddd m4, m24 + vpslldq m25, m5, 4 + paddd m5, m25 + vmovdqu32 m4 {k1}, m5 + + vpsrldq m24, m3, 8 + paddd m3, m24 + vpslldq m25, m4, 8 + paddd m4, m25 + vmovdqu32 m3 {k2}, m4 + + pmaddwd m4, m0, m20 + pmaddwd m5, m0, m21 + + vpsrldq m24, m4, 4 + paddd m4, m24 + vpslldq m25, m5, 4 + paddd m5, m25 + vmovdqu32 m4 {k1}, m5 + + pmaddwd m5, m0, m22 + pmaddwd m0, m23 + + vpsrldq m24, m5, 4 + paddd m5, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m5 {k1}, m0 + + vpsrldq m24, m4, 8 + paddd m4, m24 + vpslldq m25, m5, 8 + paddd m5, m25 + vmovdqu32 m4 {k2}, m5 + + mova m24, [idct16_AVX512_shuff3] + mova m25, [idct16_AVX512_shuff2] + vpermi2q m24, m3, m4 + vpermi2q m25, m3, m4 + paddd m3, m25, m24 + + pmaddwd m4, m1, m26 + pmaddwd m0, m1, m27 + + vpsrldq m24, m4, 4 + paddd m4, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m4 {k1}, m0 + + pmaddwd m5, m1, m28 + pmaddwd m0, m1, m29 + + vpsrldq m24, m5, 4 + paddd m5, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m5 {k1}, m0 + + + vpsrldq m24, m4, 8 + paddd m4, m24 + vpslldq m25, m5, 8 + paddd m5, m25 + vmovdqu32 m4 {k2}, m5 + + pmaddwd m5, m1, m30 + pmaddwd m0, m1, m31 + + vpsrldq m24, m5, 4 + paddd m5, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m5 {k1}, m0 + + pmaddwd m6, m1, [tab_idct32_AVX512_4 + 6 * mmsize] + pmaddwd m0, m1, [tab_idct32_AVX512_4 + 7 * mmsize] + + vpsrldq m24, m6, 4 + paddd m6, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m6 {k1}, m0 + + vpsrldq m24, m5, 8 + paddd m5, m24 + vpslldq m25, m6, 8 + paddd m6, m25 + vmovdqu32 m5 {k2}, m6 + + mova m24, [idct16_AVX512_shuff3] + mova m25, [idct16_AVX512_shuff2] + vpermi2q m24, m4, m5 + vpermi2q m25, m4, m5 + paddd m4, m25, m24 + + pmaddwd m5, m1, [tab_idct32_AVX512_4 + 8 * mmsize] + pmaddwd m0, m1, [tab_idct32_AVX512_4 + 9 * mmsize] + + vpsrldq m24, m5, 4 + paddd m5, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m5 {k1}, m0 + + pmaddwd m6, m1, [tab_idct32_AVX512_4 + 10 * mmsize] + pmaddwd m0, m1, [tab_idct32_AVX512_4 + 11 * mmsize] + + vpsrldq m24, m6, 4 + paddd m6, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m6 {k1}, m0 + + vpsrldq m24, m5, 8 + paddd m5, m24 + vpslldq m25, m6, 8 + paddd m6, m25 + vmovdqu32 m5 {k2}, m6 + + pmaddwd m6, m1, [tab_idct32_AVX512_4 + 12 * mmsize] + pmaddwd m0, m1, [tab_idct32_AVX512_4 + 13 * mmsize] + + vpsrldq m24, m6, 4 + paddd m6, m24 + vpslldq m25, m0, 4 + paddd m0, m25 + vmovdqu32 m6 {k1}, m0 + + pmaddwd m0, m1, [tab_idct32_AVX512_4 + 14 * mmsize] + pmaddwd m1, [tab_idct32_AVX512_4 + 15 * mmsize] + + vpsrldq m24, m0, 4 + paddd m0, m24 + vpslldq m25, m1, 4 + paddd m1, m25 + vmovdqu32 m0 {k1}, m1 + + vpsrldq m24, m6, 8 + paddd m6, m24 + vpslldq m25, m0, 8 + paddd m0, m25 + vmovdqu32 m6 {k2}, m0 + + mova m24, [idct16_AVX512_shuff3] + mova m25, [idct16_AVX512_shuff2] + vpermi2q m24, m5, m6 + vpermi2q m25, m5, m6 + paddd m5, m25, m24 + + paddd m6, m2, m4 + paddd m6, m15 + psrad m6, IDCT_SHIFT2 + + psubd m2, m4 + paddd m2, m15 + psrad m2, IDCT_SHIFT2 + + paddd m4, m3, m5 + paddd m4, m15 + psrad m4, IDCT_SHIFT2 + + psubd m3, m5 + paddd m3, m15 + psrad m3, IDCT_SHIFT2 + + packssdw m6, m4 + packssdw m2, m3 + + vpermq m6, m6, 0xD8 + vpermq m2, m2, 0x8D + pshufb m2, [idct16_AVX512_shuff6] +%endmacro + +;------------------------------------------------------------------- +; void idct32(const int16_t* src, int16_t* dst, intptr_t dstStride) +;------------------------------------------------------------------- + +INIT_ZMM avx512 +cglobal idct32, 3, 8, 32, 0-32*64 + +%define IDCT_SHIFT1 7 + + vbroadcasti128 m15, [pd_64] + + mov r3, rsp + lea r4, [r3 + 15 * 64] + mov r5d, 8 + mov r7d, 0xAAAA + kmovd k1, r7d + mov r7d, 0xCCCC + kmovd k2, r7d + mov r7d, 0x2222 + kmovd k3, r7d + mov r7d, 0x8888 + kmovd k4, r7d + + + mova m16, [tab_idct32_AVX512_2 + 0 * 64] + mova m17, [tab_idct32_AVX512_2 + 1 * 64] + mova m18, [tab_idct32_AVX512_2 + 2 * 64] + mova m19, [tab_idct32_AVX512_2 + 3 * 64] + + mova m20, [tab_idct32_AVX512_3 + 0 * 64] + mova m21, [tab_idct32_AVX512_3 + 1 * 64] + mova m22, [tab_idct32_AVX512_3 + 2 * 64] + mova m23, [tab_idct32_AVX512_3 + 3 * 64] + + mova m24, [tab_idct32_AVX512_1 + 0 * 64] + mova m25, [tab_idct32_AVX512_1 + 1 * 64] + mova m26, [tab_idct32_AVX512_1 + 2 * 64] + mova m27, [tab_idct32_AVX512_1 + 3 * 64] + mova m28, [tab_idct32_AVX512_1 + 4 * 64] + mova m29, [tab_idct32_AVX512_1 + 5 * 64] + mova m30, [tab_idct32_AVX512_1 + 6 * 64] + mova m31, [tab_idct32_AVX512_1 + 7 * 64] + +.pass1: + movq xm0, [r0 + 2 * 64] + movq xm1, [r0 + 18 * 64] + punpcklqdq xm0, xm0, xm1 + movq xm1, [r0 + 0 * 64] + movq xm2, [r0 + 16 * 64] + punpcklqdq xm1, xm1, xm2 + vinserti128 ym0, ym0, xm1, 1 ;[2 18 0 16] + + movq xm1, [r0 + 1 * 64] + movq xm2, [r0 + 9 * 64] + punpcklqdq xm1, xm1, xm2 + movq xm2, [r0 + 17 * 64] + movq xm3, [r0 + 25 * 64] + punpcklqdq xm2, xm2, xm3 + vinserti128 ym1, ym1, xm2, 1 ;[1 9 17 25] + + movq xm2, [r0 + 6 * 64] + movq xm3, [r0 + 22 * 64] + punpcklqdq xm2, xm2, xm3 + movq xm3, [r0 + 4 * 64] + movq xm4, [r0 + 20 * 64] + punpcklqdq xm3, xm3, xm4 + vinserti128 ym2, ym2, xm3, 1 ;[6 22 4 20] + + movq xm3, [r0 + 3 * 64] + movq xm4, [r0 + 11 * 64] + punpcklqdq xm3, xm3, xm4 + movq xm4, [r0 + 19 * 64] + movq xm5, [r0 + 27 * 64] + punpcklqdq xm4, xm4, xm5 + vinserti128 ym3, ym3, xm4, 1 ;[3 11 17 25] + + movq xm4, [r0 + 10 * 64] + movq xm5, [r0 + 26 * 64] + punpcklqdq xm4, xm4, xm5 + movq xm5, [r0 + 8 * 64] + movq xm6, [r0 + 24 * 64] + punpcklqdq xm5, xm5, xm6 + vinserti128 ym4, ym4, xm5, 1 ;[10 26 8 24] + + movq xm5, [r0 + 5 * 64] + movq xm6, [r0 + 13 * 64] + punpcklqdq xm5, xm5, xm6 + movq xm6, [r0 + 21 * 64] + movq xm7, [r0 + 29 * 64] + punpcklqdq xm6, xm6, xm7 + vinserti128 ym5, ym5, xm6, 1 ;[5 13 21 9] + + movq xm6, [r0 + 14 * 64] + movq xm7, [r0 + 30 * 64] + punpcklqdq xm6, xm6, xm7 + movq xm7, [r0 + 12 * 64] + movq xm8, [r0 + 28 * 64] + punpcklqdq xm7, xm7, xm8 + vinserti128 ym6, ym6, xm7, 1 ;[14 30 12 28] + + movq xm7, [r0 + 7 * 64] + movq xm8, [r0 + 15 * 64] + punpcklqdq xm7, xm7, xm8 + movq xm8, [r0 + 23 * 64] + movq xm9, [r0 + 31 * 64] + punpcklqdq xm8, xm8, xm9 + vinserti128 ym7, ym7, xm8, 1 ;[7 15 23 31] + + punpckhwd ym8, ym0, ym2 ;[18 22 16 20] + punpcklwd ym0, ym2 ;[2 6 0 4] + + punpckhwd ym2, ym1, ym3 ;[9 11 25 27] + punpcklwd ym1, ym3 ;[1 3 17 19] + + punpckhwd ym3, ym4, ym6 ;[26 30 24 28] + punpcklwd ym4, ym6 ;[10 14 8 12] + + punpckhwd ym6, ym5, ym7 ;[13 15 29 31] + punpcklwd ym5, ym7 ;[5 7 21 23] + + punpckhdq ym7, ym0, ym4 ;[22 62 102 142 23 63 103 143 02 42 82 122 03 43 83 123] + punpckldq ym0, ym4 ;[20 60 100 140 21 61 101 141 00 40 80 120 01 41 81 121] + + punpckhdq ym4, ym8, ym3 ;[182 222 262 302 183 223 263 303 162 202 242 282 163 203 243 283] + punpckldq ym8, ym3 ;[180 220 260 300 181 221 261 301 160 200 240 280 161 201 241 281] + + punpckhdq ym3, ym1, ym5 ;[12 32 52 72 13 33 53 73 172 192 212 232 173 193 213 233] + punpckldq ym1, ym5 ;[10 30 50 70 11 31 51 71 170 190 210 230 171 191 211 231] + + punpckhdq ym5, ym2, ym6 ;[92 112 132 152 93 113 133 153 252 272 292 312 253 273 293 313] + punpckldq ym2, ym6 ;[90 110 130 150 91 111 131 151 250 270 290 310 251 271 291 311] + + punpckhqdq ym6, ym0, ym8 ;[21 61 101 141 181 221 261 301 01 41 81 121 161 201 241 281] + punpcklqdq ym0, ym8 ;[20 60 100 140 180 220 260 300 00 40 80 120 160 200 240 280] + + punpckhqdq ym8, ym7, ym4 ;[23 63 103 143 183 223 263 303 03 43 83 123 163 203 243 283] + punpcklqdq ym7, ym4 ;[22 62 102 142 182 222 262 302 02 42 82 122 162 202 242 282] + + punpckhqdq ym4, ym1, ym2 ;[11 31 51 71 91 111 131 151 171 191 211 231 251 271 291 311] + punpcklqdq ym1, ym2 ;[10 30 50 70 90 110 130 150 170 190 210 230 250 270 290 310] + + punpckhqdq ym2, ym3, ym5 ;[13 33 53 73 93 113 133 153 173 193 213 233 253 273 293 313] + punpcklqdq ym3, ym5 ;[12 32 52 72 92 112 132 152 172 192 212 232 252 272 292 312] + + vinserti64x4 m7, m7, ym7, 1 + vinserti64x4 m8, m8, ym8, 1 + movu m13, [idct16_AVX512_shuff2] + movu m14, [idct16_AVX512_shuff3] + vpermi2q m13, m7, m8 + vpermi2q m14, m7, m8 + + vinserti64x4 m1, m1, ym1, 1 + vinserti64x4 m4, m4, ym4, 1 + movu m7, [idct16_AVX512_shuff3] + movu m8, [idct16_AVX512_shuff2] + vpermi2q m7, m1, m4 + vpermi2q m8, m1, m4 + + vinserti64x4 m3, m3, ym3, 1 + vinserti64x4 m2, m2, ym2, 1 + movu m1, [idct16_AVX512_shuff3] + movu m4, [idct16_AVX512_shuff2] + vpermi2q m1, m3, m2 + vpermi2q m4, m3, m2 + + vinserti64x4 m0, m0, ym0, 1 + vinserti64x4 m6, m6, ym6, 1 + movu m2, [idct16_AVX512_shuff2] + movu m3, [idct16_AVX512_shuff3] + vpermi2q m2, m0, m6 + vpermi2q m3, m0, m6 + + + IDCT32_AVX512_PASS1 0, 16, 20, 24, 25 + IDCT32_AVX512_PASS1 2, 17, 21, 26, 27 + IDCT32_AVX512_PASS1 4, 18, 22, 28, 29 + IDCT32_AVX512_PASS1 6, 19, 23, 30, 31 + + add r0, 8 + add r3, 4 + add r4, 4 + dec r5d + jnz .pass1 + +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m15, [pd_128] +%elif BIT_DEPTH == 10 + %define IDCT_SHIFT2 10 + vpbroadcastd m15, [pd_512] +%elif BIT_DEPTH == 8 + %define IDCT_SHIFT2 12 + vpbroadcastd m15, [pd_2048] +%else + %error Unsupported BIT_DEPTH! +%endif + + mov r3, rsp + add r2d, r2d + mov r4d, 16 + mov r6d, 0xFFFF0000 + kmovd k3, r6d + + mova m7, [tab_idct32_AVX512_6] + mova m8, [tab_idct32_AVX512_6 + 1 * mmsize] + mova m9, [tab_idct32_AVX512_6 + 2 * mmsize] + mova m10, [tab_idct32_AVX512_6 + 3 * mmsize] + mova m11, [tab_idct32_AVX512_6 + 4 * mmsize] + mova m12, [tab_idct32_AVX512_6 + 5 * mmsize] + mova m13, [tab_idct32_AVX512_6 + 6 * mmsize] + mova m14, [tab_idct32_AVX512_6 + 7 * mmsize] + mova m16, [tab_idct32_AVX512_6 + 8 * mmsize] + mova m17, [tab_idct32_AVX512_6 + 9 * mmsize] + mova m18, [tab_idct32_AVX512_6 + 10 * mmsize] + mova m19, [tab_idct32_AVX512_6 + 11 * mmsize] + mova m20, [tab_idct32_AVX512_6 + 12 * mmsize] + mova m21, [tab_idct32_AVX512_6 + 13 * mmsize] + mova m22, [tab_idct32_AVX512_6 + 14 * mmsize] + mova m23, [tab_idct32_AVX512_6 + 15 * mmsize] + mova m26, [tab_idct32_AVX512_4] + mova m27, [tab_idct32_AVX512_4 + 1 * mmsize] + mova m28, [tab_idct32_AVX512_4 + 2 * mmsize] + mova m29, [tab_idct32_AVX512_4 + 3 * mmsize] + mova m30, [tab_idct32_AVX512_4 + 4 * mmsize] + mova m31, [tab_idct32_AVX512_4 + 5 * mmsize] + +.pass2: + movu ym0, [r3] + movu ym1, [r3 + 32] + vmovdqu16 m0 {k3}, [r3 + 32] + vmovdqu16 m1 {k3}, [r3 + 64] + + IDCT32_AVX512_PASS2 + movu [r1], ym6 + movu [r1 + 32], ym2 + vextracti64x4 ym24, m6, 1 + vextracti64x4 ym25, m2, 1 + add r1, r2 + movu [r1 ], ym24 + movu [r1 + 32], ym25 + + add r1, r2 + add r3, 128 + dec r4d + jnz .pass2 + RET + ;------------------------------------------------------- ; void idct4(const int16_t* src, int16_t* dst, intptr_t dstStride) ;------------------------------------------------------- @@ -3704,4 +6415,1227 @@ movhps [r1 + 2 * r2], xm0 movhps [r1 + r3], xm1 RET + +;static void nonPsyRdoQuant_c(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos) +;{ +; const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ +; const int scaleBits = SCALE_BITS - 2 * transformShift; +; const uint32_t trSize = 1 << log2TrSize; + +; for (int y = 0; y < MLS_CG_SIZE; y++) +; { +; for (int x = 0; x < MLS_CG_SIZE; x++) +; { +; int signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ +; costUncoded[blkPos + x] = static_cast<int64_t>((double)((signCoef * signCoef) << scaleBits)); +; *totalUncodedCost += costUncoded[blkPos + x]; +; *totalRdCost += costUncoded[blkPos + x]; +; } +; blkPos += trSize; +; } +;} + +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +; void nonPsyRdoQuant_c(int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos) +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal nonPsyRdoQuant4, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 +;Row 1, 2 + movu xm0, [r0] + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1], m1 + ;Row 3, 4 + movu xm0, [r0 + 16] + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 + vfmadd213pd m2, m2, m5 + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1 + 64], m1 + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +INIT_ZMM avx512 +cglobal nonPsyRdoQuant8, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 8] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 8] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 8] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + +;Row 1, 2 + movq xm0, [r0] + pinsrq xm0, [r0 + mmsize/4], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1], ym1 + vextracti32x8 [r1 + mmsize], m1 , 1 + + ;Row 3, 4 + movq xm0, [r0 + mmsize/2] + pinsrq xm0, [r0 + 3 * mmsize/4], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 + vfmadd213pd m2, m2, m5 + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1 + 2 * mmsize], ym1 + vextracti32x8 [r1 + 3 * mmsize], m1 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +INIT_ZMM avx512 +cglobal nonPsyRdoQuant16, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 16] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 16] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 16] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + +;Row 1, 2 + movq xm0, [r0] + pinsrq xm0, [r0 + mmsize/2], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1], ym1 + vextracti32x8 [r1 + 2 * mmsize], m1, 1 + + ;Row 3, 4 + movq xm0, [r0 + mmsize] + pinsrq xm0, [r0 + 3 * mmsize/2], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 + vfmadd213pd m2, m2, m5 + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1 + 4 * mmsize], ym1 + vextracti32x8 [r1 + 6 * mmsize], m1 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +INIT_ZMM avx512 +cglobal nonPsyRdoQuant32, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 24] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 24] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 24] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + +;Row 1, 2 + movq xm0, [r0] + pinsrq xm0, [r0 + mmsize], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1], ym1 + vextracti32x8 [r1 + 4 * mmsize], m1, 1 + + ;Row 3, 4 + movq xm0, [r0 + 2 * mmsize] + pinsrq xm0, [r0 + 3 * mmsize], 1 + vpmovsxwq m1, xm0 + vcvtqq2pd m2, m1 + vfmadd213pd m2, m2, m5 + vcvtpd2qq m1, m2 + vpsllq m1, xm3 ; costUncoded + paddq m4, m1 + movu [r1 + 8 * mmsize], ym1 + vextracti32x8 [r1 + 12 * mmsize], m1 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +;static void psyRdoQuant_c(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t psyScale, uint32_t blkPos) +;{ +; const int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; /* Represents scaling through forward transform */ +; const int scaleBits = SCALE_BITS - 2 * transformShift; +; const uint32_t trSize = 1 << log2TrSize; +; int max = X265_MAX(0, (2 * transformShift + 1)); +; +; for (int y = 0; y < MLS_CG_SIZE; y++) +; { +; for (int x = 0; x < MLS_CG_SIZE; x++) +; { +; int64_t signCoef = m_resiDctCoeff[blkPos + x]; /* pre-quantization DCT coeff */ +; int64_t predictedCoef = m_fencDctCoeff[blkPos + x] - signCoef; /* predicted DCT = source DCT - residual DCT*/ +; +; costUncoded[blkPos + x] = static_cast<int64_t>((double)(signCoef * signCoef) << scaleBits); +; +; /* when no residual coefficient is coded, predicted coef == recon coef */ +; costUncoded[blkPos + x] -= static_cast<int64_t>((psyScale * (predictedCoef)) >> max); +; +; *totalUncodedCost += costUncoded[blkPos + x]; +; *totalRdCost += costUncoded[blkPos + x]; +; } +; blkPos += trSize; +; } +;} + +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +; void psyRdoQuant(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal psyRdoQuant4, 5, 9, 13 +%if WIN64 + mov r5, r5m +%endif + mov r6d, r6m + vpbroadcastq m12, [r5] ; psyScale + lea r0, [r0 + 2 * r6] + lea r1, [r1 + 2 * r6] + lea r6, [4 * r6] + lea r2, [r2 + 2 * r6] + movq xm0, [r3] + movq xm1, [r4] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8] +%else + %error Unsupported BIT_DEPTH! +%endif + + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + +;Row 1, 2 + vpmovsxwq m6, [r0] + vpmovsxwq m7, [r1] + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_4 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2], m8 + + ;Row 3, 4 + vpmovsxwq m6, [r0 + 16] + vpmovsxwq m7, [r1 + 16] + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_4 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2 + 64], m8 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r3], xm0 + movq [r4], xm1 + RET + +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +; void psyRdoQuant(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal psyRdoQuant8, 5, 9, 15 +%if WIN64 + mov r5, r5m +%endif + mov r6d, r6m + vpbroadcastq m12, [r5] ; psyScale + lea r0, [r0 + 2 * r6] + lea r1, [r1 + 2 * r6] + lea r6, [4 * r6] + lea r2, [r2 + 2 * r6] + movq xm0, [r3] + movq xm1, [r4] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 + 8] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 + 8] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 8] +%else + %error Unsupported BIT_DEPTH! +%endif + + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + +;Row 1, 2 + movq xm13, [r0] + movq xm14, [r1] + pinsrq xm13, [r0 + mmsize/4], 1 + pinsrq xm14, [r1 + mmsize/4], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_8 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2], ym8 + vextracti32x8 [r2 + mmsize], m8 , 1 + + ;Row 3, 4 + movq xm13, [r0 + mmsize/2] + movq xm14, [r1 + mmsize/2] + pinsrq xm13, [r0 + 3 * mmsize/4], 1 + pinsrq xm14, [r1 + 3 * mmsize/4], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_8 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2 + 2 * mmsize], ym8 + vextracti32x8 [r2 + 3 * mmsize], m8 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r3], xm0 + movq [r4], xm1 + RET + +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +; void psyRdoQuant(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal psyRdoQuant16, 5, 9, 15 +%if WIN64 + mov r5, r5m +%endif + mov r6d, r6m + vpbroadcastq m12, [r5] ; psyScale + lea r0, [r0 + 2 * r6] + lea r1, [r1 + 2 * r6] + lea r6, [4 * r6] + lea r2, [r2 + 2 * r6] + movq xm0, [r3] + movq xm1, [r4] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 + 16] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 + 16] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 16] +%else + %error Unsupported BIT_DEPTH! +%endif + + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + +;Row 1, 2 + movq xm13, [r0] + movq xm14, [r1] + pinsrq xm13, [r0 + mmsize/2], 1 + pinsrq xm14, [r1 + mmsize/2], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_16 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2], ym8 + vextracti32x8 [r2 + 2 * mmsize], m8 , 1 + + ;Row 3, 4 + movq xm13, [r0 + mmsize] + movq xm14, [r1 + mmsize] + pinsrq xm13, [r0 + 3 * mmsize/2], 1 + pinsrq xm14, [r1 + 3 * mmsize/2], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_16 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2 + 4 * mmsize], ym8 + vextracti32x8 [r2 + 6 * mmsize], m8 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r3], xm0 + movq [r4], xm1 + RET + +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +; void psyRdoQuant(int16_t *m_resiDctCoeff, int16_t *m_fencDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, int64_t *psyScale, uint32_t blkPos) +;--------------------------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal psyRdoQuant32, 5, 9, 15 +%if WIN64 + mov r5, r5m +%endif + mov r6d, r6m + vpbroadcastq m12, [r5] ; psyScale + lea r0, [r0 + 2 * r6] + lea r1, [r1 + 2 * r6] + lea r6, [4 * r6] + lea r2, [r2 + 2 * r6] + movq xm0, [r3] + movq xm1, [r4] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 + 24] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 + 24] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 24] +%else + %error Unsupported BIT_DEPTH! +%endif + + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + +;Row 1, 2 + movq xm13, [r0] + movq xm14, [r1] + pinsrq xm13, [r0 + mmsize], 1 + pinsrq xm14, [r1 + mmsize], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_32 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2], ym8 + vextracti32x8 [r2 + 4 * mmsize], m8 , 1 + + ;Row 3, 4 + movq xm13, [r0 + 2 * mmsize] + movq xm14, [r1 + 2 * mmsize] + pinsrq xm13, [r0 + 3 * mmsize], 1 + pinsrq xm14, [r1 + 3 * mmsize], 1 + vpmovsxwq m6, xm13 + vpmovsxwq m7, xm14 + psubq m7, m6 ; predictedCoef + + vcvtqq2pd m9, m6 + vfmadd213pd m9, m9, m3 + vcvtpd2qq m8, m9 + vpsllq m8, xm2 ;(signCoef * signCoef) << scaleBits + + vcvtqq2pd m10, m7 + vcvtqq2pd m11, m12 + vfmadd213pd m10, m11, m3 + vcvtpd2qq m9, m10 + vpsraq m9, RDO_MAX_32 ;(psyScale * predictedCoef) >> max + + psubq m8, m9 + paddq m4, m8 + movu [r2 + 8 * mmsize], ym8 + vextracti32x8 [r2 + 12 * mmsize], m8 , 1 + + vextracti32x8 ym2, m4, 1 + paddq ym4, ym2 + vextracti32x4 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r3], xm0 + movq [r4], xm1 + RET + +INIT_YMM avx2 +cglobal nonPsyRdoQuant4, 5, 9, 16 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] + movq xm0, [r2] + movq xm1, [r3] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8] +%else + %error Unsupported BIT_DEPTH! +%endif + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + vpxor m13, m13 + + vpmovsxwd m6, [r0] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1], m13 + + vpmovsxwd m6, [r0 + 8] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 32], m13 + + vpmovsxwd m6, [r0 + 16] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 64], m13 + + vpmovsxwd m6, [r0 +24] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 96], m13 + + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r2], xm0 + movq [r3], xm1 + RET + + + +INIT_YMM avx2 +cglobal nonPsyRdoQuant8, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 8] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 8] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 8] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + movq xm0, [r0] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1], ym0 + vpxor m0, m0 + movq xm0, [r0 +mmsize/2] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 +2*mmsize], m0 + vpxor m0, m0 + movq xm0, [r0 +mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 +4*mmsize], m0 + vpxor m0, m0 + movq xm0, [r0 +3*mmsize/2] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 +6*mmsize], m0 + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +INIT_YMM avx2 +cglobal nonPsyRdoQuant16, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 16] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 16] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 16] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + +;Row 1, 2 + movq xm0, [r0] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1], ym0 + + movq xm0, [r0 +mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1+4*mmsize], ym0 + + movq xm0, [r0 + 2*mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1+8*mmsize], ym0 + + movq xm0, [r0 + 3*mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1+12*mmsize], ym0 + + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET +INIT_YMM avx2 +cglobal nonPsyRdoQuant32, 5, 5, 8 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] +%if BIT_DEPTH == 12 + mov r4, [tab_nonpsyRdo12 + 24] +%elif BIT_DEPTH == 10 + mov r4, [tab_nonpsyRdo10 + 24] +%elif BIT_DEPTH == 8 + mov r4, [tab_nonpsyRdo8 + 24] +%else + %error Unsupported BIT_DEPTH! + %endif + movq xm3, r4 + movq xm6, [r2] + movq xm7, [r3] + vpxor m4, m4 + vpxor m5, m5 + + movq xm0, [r0] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1], m0 + vpxor m0, m0 + + movq xm0, [r0 +2*mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 + 8*mmsize], m0 + vpxor m0, m0 + + movq xm0, [r0 +4*mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 +16*mmsize], m0 + vpxor m0, m0 + + movq xm0, [r0 +6*mmsize] + vpmovsxwd m1, xm0 + vcvtdq2pd m2, xm1 ; Convert packed 64-bit integers to packed double-precision (64-bit) floating-point elements + vfmadd213pd m2, m2, m5 ; Multiply packed double-precision (64-bit) floating-point elements + vcvtpd2dq xm1, m2 + vpmovsxdq m0 , xm1 + vpsllq m0, xm3 ; costUncoded + paddq m4, m0 + movu [r1 +24*mmsize], m0 + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm5 + paddq xm4, xm2 + + paddq xm6, xm4 + paddq xm7, xm4 + + movq [r2], xm6 + movq [r3], xm7 + RET + +INIT_YMM avx2 +cglobal psyRdoQuant_1p4, 5, 9, 16 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] + movq xm0, [r2] + movq xm1, [r3] + +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8] +%else + %error Unsupported BIT_DEPTH! +%endif + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + vpxor m13, m13 + + vpmovsxwd m6, [r0] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1], m13 + + vpmovsxwd m6, [r0 + 8] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 32], m13 + + vpmovsxwd m6, [r0 + 16] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 64], m13 + + vpmovsxwd m6, [r0 +24] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 96], m13 + + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r2], xm0 + movq [r3], xm1 + RET +INIT_YMM avx2 +cglobal psyRdoQuant_1p8, 7, 9, 16 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] + movq xm0, [r2] + movq xm1, [r3] +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 +8] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 +8] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 8 ] +%else + %error Unsupported BIT_DEPTH! +%endif + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + vpxor m13, m13 + + + vpmovsxwd m6, [r0] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1], m13 + + vpmovsxwd m6, [r0 + 16] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 64], m13 + + vpmovsxwd m6, [r0 +32] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 +128], m13 + + vpmovsxwd m6, [r0 + 48] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 192], m13 + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r2], xm0 + movq [r3], xm1 + RET + +INIT_YMM avx2 +cglobal psyRdoQuant_1p16, 7, 9, 16 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] + movq xm0, [r2] + movq xm1, [r3] +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 + 16] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 + 16] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 16 ] +%else + %error Unsupported BIT_DEPTH! +%endif + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + vpxor m13, m13 + + vpmovsxwd m6, [r0] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1], m13 + + vpmovsxwd m6, [r0 + mmsize] + + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 4*mmsize], m13 + + vpmovsxwd m6, [r0 + 2 * mmsize] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 8*mmsize], m13 + + vpmovsxwd m6, [r0 + 3 * mmsize] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 12*mmsize], m13 + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r2], xm0 + movq [r3], xm1 + RET + +INIT_YMM avx2 +cglobal psyRdoQuant_1p32, 7, 9, 16 + mov r4d, r4m + lea r0, [r0 + 2 * r4] + lea r4, [4 * r4] + lea r1, [r1 + 2 * r4] + movq xm0, [r2] + movq xm1, [r3] +%if BIT_DEPTH == 12 + mov r5, [tab_nonpsyRdo12 + 24] ; scaleBits +%elif BIT_DEPTH == 10 + mov r5, [tab_nonpsyRdo10 + 24] +%elif BIT_DEPTH == 8 + mov r5, [tab_nonpsyRdo8 + 24] +%else + %error Unsupported BIT_DEPTH! +%endif + movq xm2, r5 + vpxor m4, m4 + vpxor m3, m3 + vpxor m13, m13 + + + vpmovsxwd m6, [r0] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1], m13 + + vpmovsxwd m6, [r0 + 2 * mmsize] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 8 * mmsize], m13 + + vpmovsxwd m6, [r0 + 4 * mmsize] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 16 * mmsize], m13 + + vpmovsxwd m6, [r0 + 6 * mmsize] + vcvtdq2pd m9, xm6 + vfmadd213pd m9, m9, m3 + vcvtpd2dq xm8, m9 + vpmovsxdq m13, xm8 ; 32 bit int to 64 bit int + vpsllq m13, xm2 ;(signCoef * signCoef) << scaleBits + paddq m4, m13 + movu [r1 + 24 *mmsize], m13 + + vextracti128 xm2, m4, 1 + paddq xm4, xm2 + punpckhqdq xm2, xm4, xm3 + paddq xm4, xm2 + + paddq xm0, xm4 + paddq xm1, xm4 + + movq [r2], xm0 + movq [r3], xm1 + RET + %endif
View file
x265_2.7.tar.gz/source/common/x86/dct8.h -> x265_2.9.tar.gz/source/common/x86/dct8.h
Changed
@@ -34,6 +34,11 @@ FUNCDEF_TU_S2(void, idct, ssse3, const int16_t* src, int16_t* dst, intptr_t dstStride); FUNCDEF_TU_S2(void, idct, sse4, const int16_t* src, int16_t* dst, intptr_t dstStride); FUNCDEF_TU_S2(void, idct, avx2, const int16_t* src, int16_t* dst, intptr_t dstStride); +FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx512, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos); +FUNCDEF_TU_S2(void, psyRdoQuant, avx512, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos); +FUNCDEF_TU_S2(void, nonPsyRdoQuant, avx2, int16_t *m_resiDctCoeff, int64_t *costUncoded, int64_t *totalUncodedCost, int64_t *totalRdCost, uint32_t blkPos); +FUNCDEF_TU_S2(void, psyRdoQuant_1p, avx2, int16_t* m_resiDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, uint32_t blkPos); +FUNCDEF_TU_S2(void, psyRdoQuant_2p, avx2, int16_t* m_resiDctCoeff, int16_t* m_fencDctCoeff, int64_t* costUncoded, int64_t* totalUncodedCost, int64_t* totalRdCost, int64_t *psyScale, uint32_t blkPos); void PFX(dst4_ssse3)(const int16_t* src, int16_t* dst, intptr_t srcStride); void PFX(dst4_sse2)(const int16_t* src, int16_t* dst, intptr_t srcStride); @@ -42,5 +47,11 @@ void PFX(idst4_avx2)(const int16_t* src, int16_t* dst, intptr_t srcStride); void PFX(denoise_dct_sse4)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); void PFX(denoise_dct_avx2)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); - +void PFX(denoise_dct_avx512)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); +void PFX(dct8_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(idct8_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride); +void PFX(idct16_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride); +void PFX(idct32_avx512)(const int16_t* src, int16_t* dst, intptr_t dstStride); +void PFX(dct32_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(dct16_avx512)(const int16_t* src, int16_t* dst, intptr_t srcStride); #endif // ifndef X265_DCT8_H
View file
x265_2.7.tar.gz/source/common/x86/h-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/h-ipfilter16.asm
Changed
@@ -47,7 +47,7 @@ h_pd_524800: times 8 dd 524800 -tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 +h_tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 dw -1, 4, -10, 58, 17, -5, 1, 0 dw -1, 4, -11, 40, 40, -11, 4, -1 dw 0, 1, -5, 17, 58, -10, 4, -1 @@ -79,8 +79,13 @@ db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 const interp8_hpp_shuf_new, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 - db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13 - + db 4, 5, 6, 7, 6, 7, 8, 9, 8, 9, 10, 11, 10, 11, 12, 13 + +ALIGN 64 +interp8_hpp_shuf1_load_avx512: times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 +interp8_hpp_shuf2_load_avx512: times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 +interp8_hpp_shuf1_store_avx512: times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15 + SECTION .text cextern pd_8 cextern pd_32 @@ -207,10 +212,10 @@ add r3d, r3d %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp @@ -285,7 +290,8 @@ ;------------------------------------------------------------------------------------------------------------ ; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------ - FILTER_HOR_LUMA_sse2 4, 4, pp +%if ARCH_X86_64 + FILTER_HOR_LUMA_sse2 4, 4, pp FILTER_HOR_LUMA_sse2 4, 8, pp FILTER_HOR_LUMA_sse2 4, 16, pp FILTER_HOR_LUMA_sse2 8, 4, pp @@ -339,6 +345,7 @@ FILTER_HOR_LUMA_sse2 64, 32, ps FILTER_HOR_LUMA_sse2 64, 48, ps FILTER_HOR_LUMA_sse2 64, 64, ps +%endif ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -625,10 +632,10 @@ add r3, r3 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp @@ -712,10 +719,10 @@ shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp @@ -815,10 +822,10 @@ shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp mova m1, [INTERP_OFFSET_PP] @@ -936,10 +943,10 @@ shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp @@ -1132,10 +1139,10 @@ shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] mova m0, [r6 + r4] %else - mova m0, [tab_LumaCoeff + r4] + mova m0, [h_tab_LumaCoeff + r4] %endif %ifidn %3, pp mova m1, [pd_32] @@ -1307,12 +1314,12 @@ mov r4d, r4m shl r4d, 4 %ifdef PIC - lea r5, [tab_LumaCoeff] + lea r5, [h_tab_LumaCoeff] vpbroadcastq m0, [r5 + r4] vpbroadcastq m1, [r5 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif lea r6, [pw_pixel_max] mova m3, [interp8_hpp_shuf] @@ -1376,302 +1383,352 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W8 1 +%macro PROCESS_IPFILTER_LUMA_PP_8x2_AVX2 0 + movu xm7, [r0] + movu xm8, [r0 + 8] + vinserti128 m7, m7, [r0 + r1], 1 + vinserti128 m8, m8, [r0 + r1 + 8], 1 + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PP + + movu xm9, [r0 + 16] + vinserti128 m9, m9, [r0 + r1 + 16], 1 + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PP + + packusdw m7, m8 + pshufb m7, m12 + CLIPW m7, m5, m6 + movu [r2], xm7 + vextracti128 [r2 + r3], m7, 1 +%endmacro + +%macro IPFILTER_LUMA_AVX2_8xN 1 INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_8x%1, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 +cglobal interp_8tap_horiz_pp_8x%1, 5,6,15 + shl r1d, 1 + shl r3d, 1 + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + %ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] + lea r5, [h_tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [h_ab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %1/2 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + r1] - vbroadcasti128 m5, [r0 + r1 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + r1 + 8] - vbroadcasti128 m6, [r0 + r1 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + r3], xm4 - - lea r2, [r2 + 2 * r3] - lea r0, [r0 + 2 * r1] - dec r4d - jnz .loop + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + mova m4, [pd_32] + pxor m5, m5 + mova m6, [pw_pixel_max] + +%rep %1/2 - 1 + PROCESS_IPFILTER_LUMA_PP_8x2_AVX2 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_8x2_AVX2 RET %endmacro -FILTER_HOR_LUMA_W8 4 -FILTER_HOR_LUMA_W8 8 -FILTER_HOR_LUMA_W8 16 -FILTER_HOR_LUMA_W8 32 + +%if ARCH_X86_64 + IPFILTER_LUMA_AVX2_8xN 4 + IPFILTER_LUMA_AVX2_8xN 8 + IPFILTER_LUMA_AVX2_8xN 16 + IPFILTER_LUMA_AVX2_8xN 32 +%endif ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W16 1 +%macro PROCESS_IPFILTER_LUMA_PP_16x1_AVX2 0 + movu m7, [r0] + movu m8, [r0 + 8] + + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PP + + movu m9, [r0 + 16] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PP + + packusdw m7, m8 + pshufb m7, m12 + CLIPW m7, m5, m6 + movu [r2], m7 +%endmacro + +%macro IPFILTER_LUMA_AVX2_16xN 1 INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_16x%1, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 +cglobal interp_8tap_horiz_pp_16x%1, 5,6,15 + shl r1d, 1 + shl r3d, 1 + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + %ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] + lea r5, [h_tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %1 - -.loop: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8] - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2], xm4 - - vbroadcasti128 m4, [r0 + 16] - vbroadcasti128 m5, [r0 + 24] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24] - vbroadcasti128 m6, [r0 + 32] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16], xm4 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + mova m4, [pd_32] + pxor m5, m5 + mova m6, [pw_pixel_max] + +%rep %1 - 1 + PROCESS_IPFILTER_LUMA_PP_16x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_16x1_AVX2 RET %endmacro -FILTER_HOR_LUMA_W16 4 -FILTER_HOR_LUMA_W16 8 -FILTER_HOR_LUMA_W16 12 -FILTER_HOR_LUMA_W16 16 -FILTER_HOR_LUMA_W16 32 -FILTER_HOR_LUMA_W16 64 + +%if ARCH_X86_64 + IPFILTER_LUMA_AVX2_16xN 4 + IPFILTER_LUMA_AVX2_16xN 8 + IPFILTER_LUMA_AVX2_16xN 12 + IPFILTER_LUMA_AVX2_16xN 16 + IPFILTER_LUMA_AVX2_16xN 32 + IPFILTER_LUMA_AVX2_16xN 64 +%endif ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------- -%macro FILTER_HOR_LUMA_W32 2 +%macro PROCESS_IPFILTER_LUMA_PP_32x1_AVX2 0 + PROCESS_IPFILTER_LUMA_PP_16x1_AVX2 + + movu m7, [r0 + mmsize] + movu m8, [r0 + 8 + mmsize] + + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PP + + movu m9, [r0 + 16 + mmsize] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PP + + packusdw m7, m8 + pshufb m7, m12 + CLIPW m7, m5, m6 + movu [r2 + mmsize], m7 +%endmacro + +%macro IPFILTER_LUMA_AVX2_32xN 1 INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 +cglobal interp_8tap_horiz_pp_32x%1, 5,6,15 + shl r1d, 1 + shl r3d, 1 + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + %ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] + lea r5, [h_tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, %2 - -.loop: -%assign x 0 -%rep %1/16 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8 + x] - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + x], xm4 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m5, [r0 + 24 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24 + x] - vbroadcasti128 m6, [r0 + 32 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16 + x], xm4 + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + mova m4, [pd_32] + pxor m5, m5 + mova m6, [pw_pixel_max] + +%rep %1 - 1 + PROCESS_IPFILTER_LUMA_PP_32x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_32x1_AVX2 + RET +%endmacro +%if ARCH_X86_64 + IPFILTER_LUMA_AVX2_32xN 8 + IPFILTER_LUMA_AVX2_32xN 16 + IPFILTER_LUMA_AVX2_32xN 24 + IPFILTER_LUMA_AVX2_32xN 32 + IPFILTER_LUMA_AVX2_32xN 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PP_64x1_AVX2 0 + PROCESS_IPFILTER_LUMA_PP_16x1_AVX2 +%assign x 32 +%rep 3 + movu m7, [r0 + x] + movu m8, [r0 + 8 + x] + + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PP + + movu m9, [r0 + 16 + x] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PP + + packusdw m7, m8 + pshufb m7, m12 + CLIPW m7, m5, m6 + movu [r2 + x], m7 %assign x x+32 %endrep +%endmacro - add r2, r3 - add r0, r1 - dec r4d - jnz .loop +%macro IPFILTER_LUMA_AVX2_64xN 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_64x%1, 5,6,15 + shl r1d, 1 + shl r3d, 1 + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [h_tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + mova m4, [pd_32] + pxor m5, m5 + mova m6, [pw_pixel_max] + +%rep %1 - 1 + PROCESS_IPFILTER_LUMA_PP_64x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_64x1_AVX2 RET %endmacro -FILTER_HOR_LUMA_W32 32, 8 -FILTER_HOR_LUMA_W32 32, 16 -FILTER_HOR_LUMA_W32 32, 24 -FILTER_HOR_LUMA_W32 32, 32 -FILTER_HOR_LUMA_W32 32, 64 -FILTER_HOR_LUMA_W32 64, 16 -FILTER_HOR_LUMA_W32 64, 32 -FILTER_HOR_LUMA_W32 64, 48 -FILTER_HOR_LUMA_W32 64, 64 + +%if ARCH_X86_64 + IPFILTER_LUMA_AVX2_64xN 16 + IPFILTER_LUMA_AVX2_64xN 32 + IPFILTER_LUMA_AVX2_64xN 48 + IPFILTER_LUMA_AVX2_64xN 64 +%endif ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx @@ -1684,12 +1741,12 @@ mov r4d, r4m shl r4d, 4 %ifdef PIC - lea r5, [tab_LumaCoeff] + lea r5, [h_tab_LumaCoeff] vpbroadcastq m0, [r5 + r4] vpbroadcastq m1, [r5 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif mova m3, [interp8_hpp_shuf] mova m7, [pd_32] @@ -1774,12 +1831,12 @@ mov r4d, r4m shl r4d, 4 %ifdef PIC - lea r5, [tab_LumaCoeff] + lea r5, [h_tab_LumaCoeff] vpbroadcastq m0, [r5 + r4] vpbroadcastq m1, [r5 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif mova m3, [interp8_hpp_shuf] mova m7, [pd_32] @@ -1884,125 +1941,82 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx ;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_LUMA_PP_48x1_AVX2 0 + PROCESS_IPFILTER_LUMA_PP_32x1_AVX2 + + movu m7, [r0 + 2 * mmsize] + movu m8, [r0 + 8 + 2 * mmsize] + + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PP + + movu m9, [r0 + 16 + 2 * mmsize] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PP + + packusdw m7, m8 + pshufb m7, m12 + CLIPW m7, m5, m6 + movu [r2 + 2 * mmsize], m7 +%endmacro + +%if ARCH_X86_64 INIT_YMM avx2 -cglobal interp_8tap_horiz_pp_48x64, 4,6,8 - add r1d, r1d - add r3d, r3d - sub r0, 6 - mov r4d, r4m - shl r4d, 4 +cglobal interp_8tap_horiz_pp_48x64, 5,6,15 + shl r1d, 1 + shl r3d, 1 + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + %ifdef PIC - lea r5, [tab_LumaCoeff] - vpbroadcastq m0, [r5 + r4] - vpbroadcastq m1, [r5 + r4 + 8] + lea r5, [h_tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - mova m7, [pd_32] - pxor m2, m2 - - ; register map - ; m0 , m1 interpolate coeff - - mov r4d, 64 - -.loop: -%assign x 0 -%rep 2 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 8 + x] - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + x], xm4 - - vbroadcasti128 m4, [r0 + 16 + x] - vbroadcasti128 m5, [r0 + 24 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 24 + x] - vbroadcasti128 m6, [r0 + 32 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 16 + x], xm4 - - vbroadcasti128 m4, [r0 + 32 + x] - vbroadcasti128 m5, [r0 + 40 + x] - pshufb m4, m3 - pshufb m5, m3 - - pmaddwd m4, m0 - pmaddwd m5, m1 - paddd m4, m5 - - vbroadcasti128 m5, [r0 + 40 + x] - vbroadcasti128 m6, [r0 + 48 + x] - pshufb m5, m3 - pshufb m6, m3 - - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m7 - psrad m4, INTERP_SHIFT_PP - - packusdw m4, m4 - vpermq m4, m4, q2020 - CLIPW m4, m2, [pw_pixel_max] - movu [r2 + 32 + x], xm4 - -%assign x x+48 + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + mova m4, [pd_32] + pxor m5, m5 + mova m6, [pw_pixel_max] + +%rep 63 + PROCESS_IPFILTER_LUMA_PP_48x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] %endrep - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop + PROCESS_IPFILTER_LUMA_PP_48x1_AVX2 RET +%endif ;----------------------------------------------------------------------------------------------------------------------------- ;void interp_horiz_ps_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) @@ -2018,12 +2032,12 @@ add r3d, r3d %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] lea r4, [r4 * 8] vbroadcasti128 m0, [r6 + r4 * 2] %else lea r4, [r4 * 8] - vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2] + vbroadcasti128 m0, [h_tab_LumaCoeff + r4 * 2] %endif vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2119,22 +2133,53 @@ IPFILTER_LUMA_PS_4xN_AVX2 8 IPFILTER_LUMA_PS_4xN_AVX2 16 + %macro PROCESS_IPFILTER_LUMA_PS_8x1_AVX2 1 + + %assign x 0 + %rep %1/8 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8+ x] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, INTERP_SHIFT_PS + psrad xm5, INTERP_SHIFT_PS + packssdw xm4, xm5 + movu [r2 + x], xm4 + %assign x x+16 + %endrep + %endmacro + %macro IPFILTER_LUMA_PS_8xN_AVX2 1 INIT_YMM avx2 %if ARCH_X86_64 == 1 cglobal interp_8tap_horiz_ps_8x%1, 4, 6, 8 - add r1d, r1d - add r3d, r3d + shl r1d, 1 + shl r3d, 1 mov r4d, r4m mov r5d, r5m shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] vpbroadcastq m0, [r6 + r4] vpbroadcastq m1, [r6 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2151,30 +2196,7 @@ add r4d, 7 .loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m6, m3 - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - - movu [r2], xm4 + PROCESS_IPFILTER_LUMA_PS_8x1_AVX2 8 add r2, r3 add r0, r1 dec r4d @@ -2197,12 +2219,12 @@ mov r5d, r5m shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] vpbroadcastq m0, [r6 + r4] vpbroadcastq m1, [r6 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2218,46 +2240,297 @@ sub r0, r6 add r4d, 7 + .loop0: -%assign x 0 -%rep 24/8 - vbroadcasti128 m4, [r0 + x] - vbroadcasti128 m5, [r0 + 8 + x] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 + PROCESS_IPFILTER_LUMA_PS_8x1_AVX2 24 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif - vbroadcasti128 m6, [r0 + 16 + x] - pshufb m5, m3 - pshufb m6, m3 - pmaddwd m5, m0 - pmaddwd m6, m1 - paddd m5, m6 - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5,m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 +%macro PROCESS_IPFILTER_LUMA_PS_16x1_AVX2 0 + movu m7, [r0] + movu m8, [r0 + 8] + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + movu m9, [r0 + 16] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PS + packssdw m7, m8 + pshufb m7, m12 + movu [r2], m7 +%endmacro - movu [r2 + x], xm4 - %assign x x+16 - %endrep +%macro IPFILTER_LUMA_PS_16xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_16x%1, 5, 6, 15 - add r2, r3 - add r0, r1 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [h_tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4] + vpbroadcastd m1, [r6 + r4 + 4] + vpbroadcastd m2, [r6 + r4 + 8] + vpbroadcastd m3, [r6 + r4 + 12] +%else + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + vbroadcasti128 m4, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + + PROCESS_IPFILTER_LUMA_PS_16x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + ;add r2, r3 + ;add r0, r1 dec r4d jnz .loop0 RET %endif -%macro IPFILTER_LUMA_PS_32_64_AVX2 2 +%endmacro + + IPFILTER_LUMA_PS_16xN_AVX2 4 + IPFILTER_LUMA_PS_16xN_AVX2 8 + IPFILTER_LUMA_PS_16xN_AVX2 12 + IPFILTER_LUMA_PS_16xN_AVX2 16 + IPFILTER_LUMA_PS_16xN_AVX2 32 + IPFILTER_LUMA_PS_16xN_AVX2 64 +%macro PROCESS_IPFILTER_LUMA_PS_32x1_AVX2 0 + PROCESS_IPFILTER_LUMA_PS_16x1_AVX2 + movu m7, [r0 + mmsize] + movu m8, [r0 + 8+ mmsize] + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + movu m9, [r0 + 16+ mmsize] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PS + packssdw m7, m8 + pshufb m7, m12 + movu [r2+ mmsize], m7 +%endmacro + +%macro IPFILTER_LUMA_PS_32xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_8tap_horiz_ps_32x%1, 5, 6, 15 + + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [h_tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4] + vpbroadcastd m1, [r6 + r4 + 4] + vpbroadcastd m2, [r6 + r4 + 8] + vpbroadcastd m3, [r6 + r4 + 12] +%else + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + vbroadcasti128 m4, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + PROCESS_IPFILTER_LUMA_PS_32x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + ;add r2, r3 + ;add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_32xN_AVX2 8 + IPFILTER_LUMA_PS_32xN_AVX2 16 + IPFILTER_LUMA_PS_32xN_AVX2 24 + IPFILTER_LUMA_PS_32xN_AVX2 32 + IPFILTER_LUMA_PS_32xN_AVX2 64 + +%macro PROCESS_IPFILTER_LUMA_PS_64x1_AVX2 0 + PROCESS_IPFILTER_LUMA_PS_16x1_AVX2 +%assign x 32 +%rep 3 + movu m7, [r0 + x] + movu m8, [r0 + 8+ x] + pshufb m10, m7, m14 + pshufb m7, m13 + pshufb m11, m8, m14 + pshufb m8, m13 + + pmaddwd m7, m0 + pmaddwd m10, m1 + paddd m7, m10 + pmaddwd m10, m11, m3 + pmaddwd m9, m8, m2 + paddd m10, m9 + paddd m7, m10 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + movu m9, [r0 + 16+ x] + pshufb m10, m9, m14 + pshufb m9, m13 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pmaddwd m10, m3 + pmaddwd m9, m2 + paddd m9, m10 + paddd m8, m9 + paddd m8, m4 + psrad m8, INTERP_SHIFT_PS + packssdw m7, m8 + pshufb m7, m12 + movu [r2+ x], m7 +%assign x x+32 +%endrep +%endmacro + +%macro IPFILTER_LUMA_PS_64xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_8tap_horiz_ps_64x%1, 5, 6, 15 + + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [h_tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4] + vpbroadcastd m1, [r6 + r4 + 4] + vpbroadcastd m2, [r6 + r4 + 8] + vpbroadcastd m3, [r6 + r4 + 12] +%else + vpbroadcastd m0, [h_tab_LumaCoeff + r4] + vpbroadcastd m1, [h_tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [h_tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [h_tab_LumaCoeff + r4 + 12] +%endif + mova m13, [interp8_hpp_shuf1_load_avx512] + mova m14, [interp8_hpp_shuf2_load_avx512] + mova m12, [interp8_hpp_shuf1_store_avx512] + vbroadcasti128 m4, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + PROCESS_IPFILTER_LUMA_PS_64x1_AVX2 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + ;add r2, r3 + ;add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_64xN_AVX2 16 + IPFILTER_LUMA_PS_64xN_AVX2 32 + IPFILTER_LUMA_PS_64xN_AVX2 48 + IPFILTER_LUMA_PS_64xN_AVX2 64 + +%macro IPFILTER_LUMA_PS_48xN_AVX2 1 INIT_YMM avx2 %if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_%1x%2, 4, 6, 8 +cglobal interp_8tap_horiz_ps_48x%1, 5, 9,15 add r1d, r1d add r3d, r3d @@ -2280,7 +2553,7 @@ sub r0, 6 test r5d, r5d - mov r4d, %2 + mov r4d, %1 jz .loop0 lea r6, [r1*3] sub r0, r6 @@ -2288,7 +2561,7 @@ .loop0: %assign x 0 -%rep %1/16 +%rep 3 vbroadcasti128 m4, [r0 + x] vbroadcasti128 m5, [r0 + 4 * SIZEOF_PIXEL + x] pshufb m4, m3 @@ -2351,115 +2624,7 @@ RET %endif %endmacro - - IPFILTER_LUMA_PS_32_64_AVX2 32, 8 - IPFILTER_LUMA_PS_32_64_AVX2 32, 16 - IPFILTER_LUMA_PS_32_64_AVX2 32, 24 - IPFILTER_LUMA_PS_32_64_AVX2 32, 32 - IPFILTER_LUMA_PS_32_64_AVX2 32, 64 - - IPFILTER_LUMA_PS_32_64_AVX2 64, 16 - IPFILTER_LUMA_PS_32_64_AVX2 64, 32 - IPFILTER_LUMA_PS_32_64_AVX2 64, 48 - IPFILTER_LUMA_PS_32_64_AVX2 64, 64 - - IPFILTER_LUMA_PS_32_64_AVX2 48, 64 - -%macro IPFILTER_LUMA_PS_16xN_AVX2 1 -INIT_YMM avx2 -%if ARCH_X86_64 == 1 -cglobal interp_8tap_horiz_ps_16x%1, 4, 6, 8 - - add r1d, r1d - add r3d, r3d - mov r4d, r4m - mov r5d, r5m - shl r4d, 4 -%ifdef PIC - lea r6, [tab_LumaCoeff] - vpbroadcastq m0, [r6 + r4] - vpbroadcastq m1, [r6 + r4 + 8] -%else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] -%endif - mova m3, [interp8_hpp_shuf] - vbroadcasti128 m2, [INTERP_OFFSET_PS] - - ; register map - ; m0 , m1 interpolate coeff - - sub r0, 6 - test r5d, r5d - mov r4d, %1 - jz .loop0 - lea r6, [r1*3] - sub r0, r6 - add r4d, 7 - -.loop0: - vbroadcasti128 m4, [r0] - vbroadcasti128 m5, [r0 + 8] - pshufb m4, m3 - pshufb m7, m5, m3 - pmaddwd m4, m0 - pmaddwd m7, m1 - paddd m4, m7 - - vbroadcasti128 m6, [r0 + 16] - pshufb m5, m3 - pshufb m7, m6, m3 - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - - phaddd m4, m5 - vpermq m4, m4, q3120 - paddd m4, m2 - vextracti128 xm5, m4, 1 - psrad xm4, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm4, xm5 - movu [r2], xm4 - - vbroadcasti128 m5, [r0 + 24] - pshufb m6, m3 - pshufb m7, m5, m3 - pmaddwd m6, m0 - pmaddwd m7, m1 - paddd m6, m7 - - vbroadcasti128 m7, [r0 + 32] - pshufb m5, m3 - pshufb m7, m3 - pmaddwd m5, m0 - pmaddwd m7, m1 - paddd m5, m7 - - phaddd m6, m5 - vpermq m6, m6, q3120 - paddd m6, m2 - vextracti128 xm5,m6, 1 - psrad xm6, INTERP_SHIFT_PS - psrad xm5, INTERP_SHIFT_PS - packssdw xm6, xm5 - movu [r2 + 16], xm6 - - add r2, r3 - add r0, r1 - dec r4d - jnz .loop0 - RET -%endif -%endmacro - - IPFILTER_LUMA_PS_16xN_AVX2 4 - IPFILTER_LUMA_PS_16xN_AVX2 8 - IPFILTER_LUMA_PS_16xN_AVX2 12 - IPFILTER_LUMA_PS_16xN_AVX2 16 - IPFILTER_LUMA_PS_16xN_AVX2 32 - IPFILTER_LUMA_PS_16xN_AVX2 64 - + IPFILTER_LUMA_PS_48xN_AVX2 64 INIT_YMM avx2 %if ARCH_X86_64 == 1 cglobal interp_8tap_horiz_ps_12x16, 4, 6, 8 @@ -2469,12 +2634,12 @@ mov r5d, r5m shl r4d, 4 %ifdef PIC - lea r6, [tab_LumaCoeff] + lea r6, [h_tab_LumaCoeff] vpbroadcastq m0, [r6 + r4] vpbroadcastq m1, [r6 + r4 + 8] %else - vpbroadcastq m0, [tab_LumaCoeff + r4] - vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] + vpbroadcastq m0, [h_tab_LumaCoeff + r4] + vpbroadcastq m1, [h_tab_LumaCoeff + r4 + 8] %endif mova m3, [interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS]
View file
x265_2.7.tar.gz/source/common/x86/h4-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/h4-ipfilter16.asm
Changed
@@ -52,7 +52,7 @@ tab_Tm16: db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 -tab_ChromaCoeff: dw 0, 64, 0, 0 +h4_tab_ChromaCoeff: dw 0, 64, 0, 0 dw -2, 58, 10, -2 dw -4, 54, 16, -2 dw -6, 46, 28, -4 @@ -279,10 +279,10 @@ add r4d, r4d %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] movddup m0, [r6 + r4 * 4] %else - movddup m0, [tab_ChromaCoeff + r4 * 4] + movddup m0, [h4_tab_ChromaCoeff + r4 * 4] %endif %ifidn %3, ps @@ -377,6 +377,7 @@ ; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- +%if ARCH_X86_64 FILTER_HOR_CHROMA_sse3 2, 4, pp FILTER_HOR_CHROMA_sse3 2, 8, pp FILTER_HOR_CHROMA_sse3 2, 16, pp @@ -462,6 +463,7 @@ FILTER_HOR_CHROMA_sse3 64, 32, ps FILTER_HOR_CHROMA_sse3 64, 48, ps FILTER_HOR_CHROMA_sse3 64, 64, ps +%endif %macro FILTER_W2_2 1 movu m3, [r0] @@ -530,10 +532,10 @@ add r4d, r4d %ifdef PIC - lea r%6, [tab_ChromaCoeff] + lea r%6, [h4_tab_ChromaCoeff] movh m0, [r%6 + r4 * 4] %else - movh m0, [tab_ChromaCoeff + r4 * 4] + movh m0, [h4_tab_ChromaCoeff + r4 * 4] %endif punpcklqdq m0, m0 @@ -1129,10 +1131,10 @@ add r4d, r4d %ifdef PIC - lea r%4, [tab_ChromaCoeff] + lea r%4, [h4_tab_ChromaCoeff] movh m0, [r%4 + r4 * 4] %else - movh m0, [tab_ChromaCoeff + r4 * 4] + movh m0, [h4_tab_ChromaCoeff + r4 * 4] %endif punpcklqdq m0, m0 @@ -1246,10 +1248,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1314,10 +1316,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1370,10 +1372,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1432,10 +1434,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1504,10 +1506,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1579,10 +1581,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1655,10 +1657,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1724,10 +1726,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1804,10 +1806,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1872,10 +1874,10 @@ sub r0, 2 mov r4d, r4m %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r5 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m1, [h4_interp8_hpp_shuf] vpbroadcastd m2, [pd_32] @@ -1934,10 +1936,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -1993,10 +1995,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2066,10 +2068,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2148,10 +2150,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2213,10 +2215,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2314,10 +2316,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2467,10 +2469,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS] @@ -2587,10 +2589,10 @@ mov r5d, r5m %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [h4_tab_ChromaCoeff] vpbroadcastq m0, [r6 + r4 * 8] %else - vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastq m0, [h4_tab_ChromaCoeff + r4 * 8] %endif mova m3, [h4_interp8_hpp_shuf] vbroadcasti128 m2, [INTERP_OFFSET_PS]
View file
x265_2.7.tar.gz/source/common/x86/intrapred.h -> x265_2.9.tar.gz/source/common/x86/intrapred.h
Changed
@@ -76,7 +76,7 @@ FUNCDEF_TU_S2(void, intra_pred_dc, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); FUNCDEF_TU_S2(void, intra_pred_dc, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); FUNCDEF_TU_S2(void, intra_pred_dc, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); - +FUNCDEF_TU_S2(void, intra_pred_dc, avx512, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); FUNCDEF_TU_S2(void, intra_pred_planar, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); FUNCDEF_TU_S2(void, intra_pred_planar, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); FUNCDEF_TU_S2(void, intra_pred_planar, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); @@ -85,7 +85,7 @@ DECL_ALL(ssse3); DECL_ALL(sse4); DECL_ALL(avx2); - +DECL_ALL(avx512); #undef DECL_ALL #undef DECL_ANGS #undef DECL_ANG
View file
x265_2.7.tar.gz/source/common/x86/intrapred16.asm -> x265_2.9.tar.gz/source/common/x86/intrapred16.asm
Changed
@@ -71,7 +71,7 @@ const pw_ang8_16, db 0, 0, 0, 0, 0, 0, 12, 13, 10, 11, 6, 7, 4, 5, 0, 1 const pw_ang8_17, db 0, 0, 14, 15, 12, 13, 10, 11, 8, 9, 4, 5, 2, 3, 0, 1 const pw_swap16, times 2 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 - +const pw_swap16_avx512, times 4 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 const pw_ang16_13, db 14, 15, 8, 9, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 const pw_ang16_16, db 0, 0, 0, 0, 0, 0, 10, 11, 8, 9, 6, 7, 2, 3, 0, 1 @@ -196,6 +196,7 @@ ;----------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) ;----------------------------------------------------------------------------------- +%if ARCH_X86_64 INIT_XMM sse2 cglobal intra_pred_dc8, 5, 8, 2 movu m0, [r2 + 34] @@ -275,10 +276,13 @@ mov [r0 + r7], r3w .end: RET +%endif ;------------------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter) ;------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +;This code is meant for 64 bit architecture INIT_XMM sse2 cglobal intra_pred_dc16, 5, 10, 4 lea r3, [r2 + 66] @@ -410,6 +414,7 @@ mov [r9 + r1 * 8], r3w .end: RET +%endif ;------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* above, pixel* left, pixel* dst, intptr_t dstStride, int filter) @@ -474,6 +479,7 @@ ;------------------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter) ;------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 INIT_YMM avx2 cglobal intra_pred_dc16, 3, 9, 4 mov r3d, r4m @@ -682,6 +688,68 @@ movu [r0 + r2 * 1 + 0], m0 movu [r0 + r2 * 1 + mmsize], m0 RET +INIT_ZMM avx512 +cglobal intra_pred_dc32, 3,3,2 + add r2, 2 + add r1d, r1d + movu m0, [r2] + movu m1, [r2 + 2 * mmsize] + paddw m0, m1 + vextracti32x8 ym1, m0, 1 + paddw ym0, ym1 + vextracti32x4 xm1, m0, 1 + paddw xm0, xm1 + pmaddwd xm0, [pw_1] + movhlps xm1, xm0 + paddd xm0, xm1 + vpsrldq xm1, xm0, 4 + paddd xm0, xm1 + paddd xm0, [pd_32] ; sum = sum + 32 + psrld xm0, 6 ; sum = sum / 64 + vpbroadcastw m0, xm0 + lea r2, [r1 * 3] + ; store DC 32x32 + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r2 * 1 + 0], m0 + RET +%endif ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) @@ -1104,6 +1172,7 @@ ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- +%if ARCH_X86_64 INIT_XMM sse2 cglobal intra_pred_planar32, 3,3,16 movd m3, [r2 + 66] ; topRight = above[32] @@ -1209,7 +1278,7 @@ %endrep RET %endif - +%endif ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- @@ -2063,6 +2132,7 @@ STORE_4x4 RET +%if ARCH_X86_64 cglobal intra_pred_ang4_26, 3,3,3 movh m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] add r1d, r1d @@ -2098,6 +2168,7 @@ mov [r0 + r3], r2w .quit: RET +%endif cglobal intra_pred_ang4_27, 3,3,5 movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] @@ -11054,35 +11125,35 @@ %macro TRANSPOSE_STORE_AVX2 11 jnz .skip%11 - punpckhwd m%9, m%1, m%2 - punpcklwd m%1, m%2 - punpckhwd m%2, m%3, m%4 - punpcklwd m%3, m%4 - - punpckldq m%4, m%1, m%3 - punpckhdq m%1, m%3 - punpckldq m%3, m%9, m%2 - punpckhdq m%9, m%2 - - punpckhwd m%10, m%5, m%6 - punpcklwd m%5, m%6 - punpckhwd m%6, m%7, m%8 - punpcklwd m%7, m%8 - - punpckldq m%8, m%5, m%7 - punpckhdq m%5, m%7 - punpckldq m%7, m%10, m%6 - punpckhdq m%10, m%6 - - punpcklqdq m%6, m%4, m%8 - punpckhqdq m%2, m%4, m%8 - punpcklqdq m%4, m%1, m%5 - punpckhqdq m%8, m%1, m%5 - - punpcklqdq m%1, m%3, m%7 - punpckhqdq m%5, m%3, m%7 - punpcklqdq m%3, m%9, m%10 - punpckhqdq m%7, m%9, m%10 + punpckhwd ym%9, ym%1, ym%2 + punpcklwd ym%1, ym%2 + punpckhwd ym%2, ym%3, ym%4 + punpcklwd ym%3, ym%4 + + punpckldq ym%4, ym%1, ym%3 + punpckhdq ym%1, ym%3 + punpckldq ym%3, ym%9, ym%2 + punpckhdq ym%9, ym%2 + + punpckhwd ym%10, ym%5, ym%6 + punpcklwd ym%5, ym%6 + punpckhwd ym%6, ym%7, ym%8 + punpcklwd ym%7, ym%8 + + punpckldq ym%8, ym%5, ym%7 + punpckhdq ym%5, ym%7 + punpckldq ym%7, ym%10, ym%6 + punpckhdq ym%10, ym%6 + + punpcklqdq ym%6, ym%4, ym%8 + punpckhqdq ym%2, ym%4, ym%8 + punpcklqdq ym%4, ym%1, ym%5 + punpckhqdq ym%8, ym%1, ym%5 + + punpcklqdq ym%1, ym%3, ym%7 + punpckhqdq ym%5, ym%3, ym%7 + punpcklqdq ym%3, ym%9, ym%10 + punpckhqdq ym%7, ym%9, ym%10 movu [r0 + r1 * 0 + %11], xm%6 movu [r0 + r1 * 1 + %11], xm%2 @@ -11096,32 +11167,33 @@ movu [r5 + r4 * 1 + %11], xm%7 lea r5, [r5 + r1 * 4] - vextracti128 [r5 + r1 * 0 + %11], m%6, 1 - vextracti128 [r5 + r1 * 1 + %11], m%2, 1 - vextracti128 [r5 + r1 * 2 + %11], m%4, 1 - vextracti128 [r5 + r4 * 1 + %11], m%8, 1 + vextracti128 [r5 + r1 * 0 + %11], ym%6, 1 + vextracti128 [r5 + r1 * 1 + %11], ym%2, 1 + vextracti128 [r5 + r1 * 2 + %11], ym%4, 1 + vextracti128 [r5 + r4 * 1 + %11], ym%8, 1 lea r5, [r5 + r1 * 4] - vextracti128 [r5 + r1 * 0 + %11], m%1, 1 - vextracti128 [r5 + r1 * 1 + %11], m%5, 1 - vextracti128 [r5 + r1 * 2 + %11], m%3, 1 - vextracti128 [r5 + r4 * 1 + %11], m%7, 1 + vextracti128 [r5 + r1 * 0 + %11], ym%1, 1 + vextracti128 [r5 + r1 * 1 + %11], ym%5, 1 + vextracti128 [r5 + r1 * 2 + %11], ym%3, 1 + vextracti128 [r5 + r4 * 1 + %11], ym%7, 1 jmp .end%11 .skip%11: - movu [r0 + r1 * 0], m%1 - movu [r0 + r1 * 1], m%2 - movu [r0 + r1 * 2], m%3 - movu [r0 + r4 * 1], m%4 + movu [r0 + r1 * 0], ym%1 + movu [r0 + r1 * 1], ym%2 + movu [r0 + r1 * 2], ym%3 + movu [r0 + r4 * 1], ym%4 lea r0, [r0 + r1 * 4] - movu [r0 + r1 * 0], m%5 - movu [r0 + r1 * 1], m%6 - movu [r0 + r1 * 2], m%7 - movu [r0 + r4 * 1], m%8 + movu [r0 + r1 * 0], ym%5 + movu [r0 + r1 * 1], ym%6 + movu [r0 + r1 * 2], ym%7 + movu [r0 + r4 * 1], ym%8 lea r0, [r0 + r1 * 4] .end%11: %endmacro +%if ARCH_X86_64 ;; angle 16, modes 3 and 33 cglobal ang16_mode_3_33 test r6d, r6d @@ -11771,7 +11843,6 @@ packusdw m11, m3 TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 16 ret - ;; angle 16, modes 7 and 29 cglobal ang16_mode_7_29 test r6d, r6d @@ -18220,10 +18291,2364 @@ mov rsp, [rsp+4*mmsize] RET +%endif ;------------------------------------------------------------------------------------------------------- ; end of avx2 code for intra_pred_ang32 mode 2 to 34 ;------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------- +; avx512 code for intra_pred_ang32 mode 2 to 34 start +;------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal intra_pred_ang32_2, 3,5,3 + lea r4, [r2] + add r2, 128 + cmp r3m, byte 34 + cmove r2, r4 + add r1d, r1d + lea r3, [r1 * 3] + movu m0, [r2 + 4] + movu m1, [r2 + 20] + + movu [r0], m0 + palignr m2, m1, m0, 2 + movu [r0 + r1], m2 + palignr m2, m1, m0, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + movu [r0], m2 + palignr m2, m1, m0, 10 + movu [r0 + r1], m2 + palignr m2, m1, m0, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 14 + movu [r0 + r3], m2 + + movu m0, [r2 + 36] + lea r0, [r0 + r1 * 4] + movu [r0], m1 + palignr m2, m0, m1, 2 + movu [r0 + r1], m2 + palignr m2, m0, m1, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + movu [r0], m2 + palignr m2, m0, m1, 10 + movu [r0 + r1], m2 + palignr m2, m0, m1, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 14 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + movu m1, [r2 + 52] + + movu [r0], m0 + palignr m2, m1, m0, 2 + movu [r0 + r1], m2 + palignr m2, m1, m0, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + movu [r0], m2 + palignr m2, m1, m0, 10 + movu [r0 + r1], m2 + palignr m2, m1, m0, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 14 + movu [r0 + r3], m2 + + movu m0, [r2 + 68] + lea r0, [r0 + r1 * 4] + movu [r0], m1 + palignr m2, m0, m1, 2 + movu [r0 + r1], m2 + palignr m2, m0, m1, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + movu [r0], m2 + palignr m2, m0, m1, 10 + movu [r0 + r1], m2 + palignr m2, m0, m1, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 14 + movu [r0 + r3], m2 + RET + +cglobal intra_pred_ang32_10, 3,4,2 + add r2, mmsize*2 + add r1d, r1d + lea r3, [r1 * 3] + + vpbroadcastw m0, [r2 + 2] ; [1...] + vpbroadcastw m1, [r2 + 2 + 2] ; [2...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 4] ; [3...] + vpbroadcastw m1, [r2 + 2 + 6] ; [4...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 * 4] + + vpbroadcastw m0, [r2 + 2 + 8] ; [5...] + vpbroadcastw m1, [r2 + 2 + 10] ; [6...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 12] ; [7...] + vpbroadcastw m1, [r2 + 2 + 14] ; [8...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 16] ; [9...] + vpbroadcastw m1, [r2 + 2 + 18] ; [10...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 20] ; [11...] + vpbroadcastw m1, [r2 + 2 + 22] ; [12...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 24] ; [13...] + vpbroadcastw m1, [r2 + 2 + 26] ; [14...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 28] ; [15...] + vpbroadcastw m1, [r2 + 2 + 30] ; [16...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 32] ; [17...] + vpbroadcastw m1, [r2 + 2 + 34] ; [18...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 36] ; [19...] + vpbroadcastw m1, [r2 + 2 + 38] ; [20...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 40] ; [21...] + vpbroadcastw m1, [r2 + 2 + 42] ; [22...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 44] ; [23...] + vpbroadcastw m1, [r2 + 2 + 46] ; [24...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 48] ; [25...] + vpbroadcastw m1, [r2 + 2 + 50] ; [26...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 52] ; [27...] + vpbroadcastw m1, [r2 + 2 + 54] ; [28...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + lea r0, [r0 + r1 *4] + + vpbroadcastw m0, [r2 + 2 + 56] ; [29...] + vpbroadcastw m1, [r2 + 2 + 58] ; [30...] + movu [r0], m0 + movu [r0 + r1], m1 + + vpbroadcastw m0, [r2 + 2 + 60] ; [31...] + vpbroadcastw m1, [r2 + 2 + 62] ; [32...] + movu [r0 + r1 * 2], m0 + movu [r0 + r3], m1 + RET + +cglobal intra_pred_ang32_18, 3,6,6 + mov r4, rsp + sub rsp, 4*(mmsize/2)+gprsize + and rsp, ~63 + mov [rsp+4*(mmsize/2)], r4 + + movu m0, [r2] + mova [rsp + 2*(mmsize/2)], ym0 + vextracti32x8 [rsp + 3*(mmsize/2)], m0, 1 + + movu m2, [r2 + 130] + pshufb m2, [pw_swap16_avx512] + vpermq m2, m2, q1032 + mova [rsp + 1*(mmsize/2)], ym2 + vextracti32x8 [rsp + 0*(mmsize/2)], m2, 1 + + add r1d, r1d + lea r2, [rsp+2*(mmsize/2)] + lea r4, [r1 * 2] + lea r3, [r1 * 3] + lea r5, [r1 * 4] + + movu m0, [r2] + movu m2, [r2 - 16] + movu [r0], m0 + + palignr m4, m0, m2, 14 + palignr m5, m0, m2, 12 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m0, m2, 10 + palignr m5, m0, m2, 8 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m5 + + palignr m4, m0, m2, 6 + palignr m5, m0, m2, 4 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m0, m2, 2 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m2 + + movu m0, [r2 - 32] + palignr m4, m2, m0, 14 + palignr m5, m2, m0, 12 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m2, m0, 10 + palignr m5, m2, m0, 8 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m5 + + palignr m4, m2, m0, 6 + palignr m5, m2, m0, 4 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m2, m0, 2 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m0 + + movu m2, [r2 - 48] + palignr m4, m0, m2, 14 + palignr m5, m0, m2, 12 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m0, m2, 10 + palignr m5, m0, m2, 8 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m5 + + palignr m4, m0, m2, 6 + palignr m5, m0, m2, 4 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m0, m2, 2 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m2 + + movu m0, [r2 - 64] + palignr m4, m2, m0, 14 + palignr m5, m2, m0, 12 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m2, m0, 10 + palignr m5, m2, m0, 8 + movu [r0 + r3], m4 + add r0, r5 + movu [r0], m5 + + palignr m4, m2, m0, 6 + palignr m5, m2, m0, 4 + movu [r0 + r1], m4 + movu [r0 + r4], m5 + + palignr m4, m2, m0, 2 + movu [r0 + r3], m4 + mov rsp, [rsp+4*(mmsize/2)] + RET +INIT_ZMM avx512 +cglobal intra_pred_ang32_26, 3,3,2 + movu m0, [r2 + 2] + add r1d, r1d + lea r2, [r1 * 3] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r2], m0 + RET + +;; angle 16, modes 9 and 27 +cglobal ang16_mode_9_27 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + movu ym16, [r3 - 14 * 32] ; [2] + vinserti32x8 m16, [r3 - 12 * 32], 1 ; [4] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + movu ym16, [r3 - 10 * 32] ; [6] + vinserti32x8 m16, [r3 - 8 * 32], 1 ; [8] + pmaddwd m6, m3, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + movu ym16, [r3 - 6 * 32] ; [10] + vinserti32x8 m16, [r3 - 4 * 32], 1 ; [12] + pmaddwd m8, m3, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + movu ym16, [r3 - 2 * 32] ; [14] + vinserti32x8 m16, [r3], 1 ; [16] + pmaddwd m10, m3, m16 + paddd m10, m15 + psrld m10, 5 + pmaddwd m1, m0, m16 + paddd m1, m15 + psrld m1, 5 + packusdw m10, m1 + vextracti32x8 ym11, m10, 1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 2, 1, 0 + + movu ym16, [r3 + 2 * 32] ; [18] + vinserti32x8 m16, [r3 + 4 * 32], 1 ; [20] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + movu ym16, [r3 + 6 * 32] ; [22] + vinserti32x8 m16, [r3 + 8 * 32], 1 ; [24] + pmaddwd m6, m3, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m8, m0, m16 + paddd m8, m15 + psrld m8, 5 + packusdw m6, m8 + vextracti32x8 ym7, m6, 1 + movu ym16, [r3 + 10 * 32] ; [26] + vinserti32x8 m16, [r3 + 12 * 32], 1 ; [28] + pmaddwd m8, m3, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + movu ym16, [r3 + 14 * 32] ; [30] + pmaddwd ym3, ym16 + paddd ym3, ym15 + psrld ym3, 5 + pmaddwd ym0, ym16 + paddd ym0, ym15 + psrld ym0, 5 + packusdw ym3, ym0 + + movu ym1, [r2 + 4] + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +cglobal intra_pred_ang32_9, 3,8,17 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + + call ang16_mode_9_27 + add r2, 2 + lea r0, [r0 + 32] + call ang16_mode_9_27 + add r2, 30 + lea r0, [r7 + 8 * r1] + call ang16_mode_9_27 + add r2, 2 + lea r0, [r0 + 32] + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang32_27, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + + call ang16_mode_9_27 + add r2, 2 + call ang16_mode_9_27 + add r2, 30 + mov r0, r5 + call ang16_mode_9_27 + add r2, 2 + call ang16_mode_9_27 + RET +;; angle 16, modes 11 and 25 +cglobal ang16_mode_11_25 + test r6d, r6d + + vbroadcasti32x8 m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + vbroadcasti32x8 m1, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m1 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m0, m1 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + movu ym16, [r3 + 14 * 32] ; [30] + vinserti32x8 m16, [r3 + 12 * 32], 1 ; [28] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + movu ym16, [r3 + 10 * 32] ; [26] + vinserti32x8 m16, [r3 + 8 * 32], 1 ; [24] + pmaddwd m6, m3, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + movu ym16, [r3 + 6 * 32] ; [22] + vinserti32x8 m16, [r3 + 4 * 32], 1 ; [20] + pmaddwd m8, m3, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + movu ym16, [r3 + 2 * 32] ; [18] + vinserti32x8 m16, [r3], 1 ; [16] + pmaddwd m10, m3, m16 + paddd m10, m15 + psrld m10, 5 + pmaddwd m1, m0, m16 + paddd m1, m15 + psrld m1, 5 + packusdw m10, m1 + vextracti32x8 ym11, m10, 1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 2, 1, 0 + + movu ym16, [r3 - 2 * 32] ; [14] + vinserti32x8 m16, [r3 - 4 * 32], 1 ; [12] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + movu ym16, [r3 - 6 * 32] ; [10] + vinserti32x8 m16, [r3 - 8 * 32], 1 ; [8] + pmaddwd m6, m3, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m8, m0, m16 + paddd m8, m15 + psrld m8, 5 + packusdw m6, m8 + vextracti32x8 ym7, m6, 1 + movu ym16, [r3 - 10 * 32] ; [6] + vinserti32x8 m16, [r3 - 12 * 32], 1 ; [4] + pmaddwd m8, m3, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m0, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + pmaddwd ym3, [r3 - 14 * 32] ; [2] + paddd ym3, ym15 + psrld ym3, 5 + pmaddwd ym0, [r3 - 14 * 32] + paddd ym0, ym15 + psrld ym0, 5 + packusdw ym3, ym0 + + movu ym1, [r2] + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +cglobal intra_pred_ang32_11, 3,8,17, 0-8 + movzx r5d, word [r2 + 128] ; [0] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 128], r6w + + movzx r5d, word [r2 + 126] ; [16] + movzx r6d, word [r2 + 32] + mov [rsp + 4], r5w + mov [r2 + 126], r6w + vbroadcasti32x8 m15, [pd_16] + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_11_25 + sub r2, 2 + lea r0, [r0 + 32] + call ang16_mode_11_25 + add r2, 34 + lea r0, [r7 + 8 * r1] + call ang16_mode_11_25 + sub r2, 2 + lea r0, [r0 + 32] + call ang16_mode_11_25 + mov r6d, [rsp] + mov [r2 - 30], r6w + mov r6d, [rsp + 4] + mov [r2 - 32], r6w + RET + +cglobal intra_pred_ang32_25, 3,7,17, 0-4 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + vbroadcasti32x8 m15, [pd_16] + movzx r4d, word [r2 - 2] + movzx r5d, word [r2 + 160] ; [16] + mov [rsp], r4w + mov [r2 - 2], r5w + + lea r4, [r1 * 3] + lea r5, [r0 + 32] + call ang16_mode_11_25 + sub r2, 2 + call ang16_mode_11_25 + add r2, 34 + mov r0, r5 + call ang16_mode_11_25 + sub r2, 2 + call ang16_mode_11_25 + mov r5d, [rsp] + mov [r2 - 32], r5w + RET + +cglobal intra_pred_ang16_9, 3,7,17 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang16_27, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang16_11, 3,7,17, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + vbroadcasti32x8 m15, [pd_16] + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + call ang16_mode_11_25 + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_25, 3,7,17 + xor r6d, r6d + inc r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 16 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + call ang16_mode_11_25 + RET +cglobal ang16_mode_5_31 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 1 * 32] ; [17] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, [r3 + 1 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + movu ym16, [r3 - 14 * 32] ; [2] + vinserti32x8 m16, [r3 + 3 * 32] ,1 ; [19] + palignr m6, m0, m3, 4 + pmaddwd m5, m6, m16 + paddd m5, m15 + psrld m5, 5 + palignr m7, m2, m0, 4 + pmaddwd m8, m7, m16 + paddd m8, m15 + psrld m8, 5 + packusdw m5, m8 + vextracti32x8 ym6, m5, 1 + + palignr m8, m0, m3, 8 + palignr m9, m2, m0, 8 + movu ym16, [r3 - 12 * 32] ; [4] + vinserti32x8 m16, [r3 + 5 * 32],1 ; [21] + pmaddwd m7, m8, m16 + paddd m7, m15 + psrld m7, 5 + pmaddwd m10, m9,m16 + paddd m10, m15 + psrld m10, 5 + packusdw m7, m10 + vextracti32x8 ym8, m7, 1 + + palignr m10, m0, m3, 12 + palignr m11, m2, m0, 12 + movu ym16,[r3 - 10 * 32] ; [6] + vinserti32x8 m16, [r3 + 7 * 32] ,1 ; [23] + pmaddwd m9, m10, m16 + paddd m9, m15 + psrld m9, 5 + pmaddwd m3, m11, m16 + paddd m3, m15 + psrld m3, 5 + packusdw m9, m3 + vextracti32x8 ym10, m9, 1 + + pmaddwd m11, m0, [r3 - 8 * 32] ; [8] + paddd m11, m15 + psrld m11, 5 + pmaddwd m3, m2, [r3 - 8 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + pmaddwd m4, m0, [r3 + 9 * 32] ; [25] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m2, [r3 + 9 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m2, m0, 4 + movu ym16, [r3 - 6 * 32] ; [10] + vinserti32x8 m16, [r3 + 11 * 32] ,1 ; [27] + pmaddwd m5, m6,m16 + paddd m5, m15 + psrld m5, 5 + palignr m7, m1, m2, 4 + pmaddwd m3, m7,m16 + paddd m3, m15 + psrld m3, 5 + packusdw m5, m3 + vextracti32x8 ym6, m5, 1 + + palignr m8, m2, m0, 8 + palignr m9, m1, m2, 8 + movu ym16, [r3 - 4 * 32] ; [12] + vinserti32x8 m16, [r3 + 13 * 32] ,1 ; [29] + pmaddwd m7, m8, m16 + paddd m7, m15 + psrld m7, 5 + pmaddwd m3, m9, m16 + paddd m3, m15 + psrld m3, 5 + packusdw m7, m3 + vextracti32x8 ym8, m7, 1 + + + palignr m10, m2, m0, 12 + palignr m11, m1, m2, 12 + movu ym16, [r3 - 2 * 32] ; [14] + vinserti32x8 m16, [r3 + 15 * 32],1 ; [31] + pmaddwd m9, m10, m16 + paddd m9, m15 + psrld m9, 5 + pmaddwd m3, m11, m16 + paddd m3, m15 + psrld m3, 5 + packusdw m9, m3 + vextracti32x8 ym10, m9, 1 + + pmaddwd m2, [r3] ; [16] + paddd m2, m15 + psrld m2, 5 + pmaddwd m1, [r3] + paddd m1, m15 + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 2, 0, 1, 16 + ret +;; angle 32, modes 5 and 31 +cglobal ang32_mode_5_31 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + movu ym16, [r3 - 15 * 32] ; [1] + vinserti32x8 m16, [r3 + 2 * 32],1 ; [18] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + + palignr m7, m0, m3, 4 + movu ym16, [r3 - 13 * 32] ; [3] + vinserti32x8 m16, [r3 + 4 * 32] ,1 ; [20] + pmaddwd m6, m7, m16 + paddd m6, m15 + psrld m6, 5 + palignr m8, m2, m0, 4 + pmaddwd m9, m8,m16 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + + + palignr m9, m0, m3, 8 + movu ym16, [r3 - 11 * 32] ; [5] + vinserti32x8 m16, [r3 + 6 * 32] ,1 ; [22] + pmaddwd m8, m9,m16 + paddd m8, m15 + psrld m8, 5 + palignr m10, m2, m0, 8 + pmaddwd m11, m10,m16 + paddd m11, m15 + psrld m11, 5 + packusdw m8, m11 + vextracti32x8 ym9, m8, 1 + + + palignr m11, m0, m3, 12 + movu ym16, [r3 - 9 * 32] ; [7] + vinserti32x8 m16, [r3 + 8 * 32] ,1 ; [24] + pmaddwd m10, m11,m16 + paddd m10, m15 + psrld m10, 5 + palignr m12, m2, m0, 12 + pmaddwd m3, m12, m16 + paddd m3, m15 + psrld m3, 5 + packusdw m10, m3 + vextracti32x8 ym11, m10, 1 + + + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + movu ym16, [r3 - 7 * 32] ; [9] + vinserti32x8 m16, [r3 + 10 * 32] ,1 ; [26] + pmaddwd m4, m0, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m2, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + + palignr m7, m2, m0, 4 + movu ym16, [r3 - 5 * 32] ; [11] + vinserti32x8 m16, [r3 + 12 * 32],1 ; [28] + pmaddwd m6, m7, m16 + paddd m6, m15 + psrld m6, 5 + palignr m8, m1, m2, 4 + pmaddwd m9, m8,m16 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + + palignr m9, m2, m0, 8 + movu ym16, [r3 - 3 * 32] ; [13] + vinserti32x8 m16, [r3 + 14 * 32] ,1 ; [30] + pmaddwd m8, m9, m16 + paddd m8, m15 + psrld m8, 5 + palignr m3, m1, m2, 8 + pmaddwd m10, m3, m16 + paddd m10, m15 + psrld m10, 5 + packusdw m8, m10 + vextracti32x8 ym9, m8, 1 + + + + palignr m10, m2, m0, 12 + pmaddwd m10, [r3 - 1 * 32] ; [15] + paddd m10, m15 + psrld m10, 5 + palignr m11, m1, m2, 12 + pmaddwd m11, [r3 - 1 * 32] + paddd m11, m15 + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m2, [r3 - 16 * 32] ; [0] + paddd m2, m15 + psrld m2, 5 + pmaddwd m1, [r3 - 16 * 32] + paddd m1, m15 + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 2, 0, 1, 16 + ret +cglobal intra_pred_ang32_5, 3,8,17 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_5_31 + + add r2, 18 + lea r0, [r0 + 32] + + call ang32_mode_5_31 + + add r2, 14 + lea r0, [r7 + 8 * r1] + + call ang16_mode_5_31 + vbroadcasti32x8 m15, [pd_16] + add r2, 18 + lea r0, [r0 + 32] + call ang32_mode_5_31 + RET +cglobal intra_pred_ang32_31, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_5_31 + + add r2, 18 + + call ang32_mode_5_31 + + add r2, 14 + mov r0, r5 + + call ang16_mode_5_31 + + add r2, 18 + call ang32_mode_5_31 + RET +cglobal intra_pred_ang16_5, 3,7,17 + add r2, 64 + xor r6d, r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + call ang16_mode_5_31 + RET +cglobal intra_pred_ang16_31, 3,7,17 + xor r6d, r6d + inc r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + call ang16_mode_5_31 + RET +;; angle 16, modes 4 and 32 +cglobal ang16_mode_4_32 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 3 * 32] ; [21] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, [r3 + 3 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + palignr m7, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + movu ym16,[r3 - 8 * 32] ; [10] + vinserti32x8 m16, [r3 + 13 * 32] ,1 ; [31] + pmaddwd m5, m6, m16 + paddd m5, m15 + psrld m5, 5 + pmaddwd m8, m7,m16 + paddd m8, m15 + psrld m8, 5 + packusdw m5, m8 + vextracti32x8 ym6, m5, 1 + + + palignr m7, m0, m3, 8 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + pmaddwd m7, [r3 + 2 * 32] ; [20] + paddd m7, m15 + psrld m7, 5 + palignr m8, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + pmaddwd m8, [r3 + 2 * 32] + paddd m8, m15 + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m0, m3, 12 + palignr m3, m2, m0, 12 + movu ym16,[r3 - 9 * 32] ; [9] + vinserti32x8 m16, [r3 + 12 * 32] ,1 ; [30] + pmaddwd m8, m9, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m10, m3,m16 + paddd m10,m15 + psrld m10, 5 + packusdw m8, m10 + vextracti32x8 ym9, m8, 1 + + + pmaddwd m10, m0, [r3 + 1 * 32] ; [19] + paddd m10,m15 + psrld m10, 5 + pmaddwd m3, m2, [r3 + 1 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m10, m3 + + palignr m11, m2, m0, 4 + pmaddwd m11, [r3 - 10 * 32] ; [8] + paddd m11, m15 + psrld m11, 5 + palignr m3, m1, m2, 4 + pmaddwd m3, [r3 - 10 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + palignr m4, m2, m0, 4 + pmaddwd m4, [r3 + 11 * 32] ; [29] + paddd m4, m15 + psrld m4, 5 + palignr m5, m1, m2, 4 + pmaddwd m5, [r3 + 11 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m5, m2, m0, 8 + pmaddwd m5, [r3] ; [18] + paddd m5, m15 + psrld m5, 5 + palignr m6, m1, m2, 8 + pmaddwd m6, [r3] + paddd m6, m15 + psrld m6, 5 + packusdw m5, m6 + palignr m7, m2, m0, 12 + palignr m8, m1, m2, 12 + movu ym16,[r3 - 11 * 32] ; [7] + vinserti32x8 m16, [r3 + 10 * 32],1 ; [28] + pmaddwd m6, m7, m16 + paddd m6, m15 + psrld m6, 5 + palignr m8, m1, m2, 12 + pmaddwd m3, m8, m16 + paddd m3,m15 + psrld m3, 5 + packusdw m6, m3 + vextracti32x8 ym7, m6, 1 + + movu m0, [r2 + 34] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + pmaddwd m8, m2, [r3 - 1 * 32] ; [17] + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m1, [r3 - 1 * 32] + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + + palignr m3, m0, m0, 2 ; [ x 32 31 30 29 28 27 26 x 24 23 22 21 20 19 18] + punpcklwd m0, m3 ; [29 29 28 28 27 27 26 22 21 20 20 19 19 18 18 17] + + palignr m10, m1, m2, 4 + pmaddwd m9, m10, [r3 - 12 * 32] ; [6] + paddd m9, m15 + psrld m9, 5 + palignr m11, m0, m1, 4 + pmaddwd m3, m11, [r3 - 12 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 9 * 32] ; [27] + paddd m10,m15 + psrld m10, 5 + pmaddwd m11, [r3 + 9 * 32] + paddd m11, m15 + psrld m11, 5 + packusdw m10, m11 + + palignr m3, m1, m2, 8 + pmaddwd m3, [r3 - 2 * 32] ; [16] + paddd m3, m15 + psrld m3, 5 + palignr m0, m1, 8 + pmaddwd m0, [r3 - 2 * 32] + paddd m0,m15 + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 3, 0, 1, 16 + ret +;; angle 32, modes 4 and 32 +cglobal ang32_mode_4_32 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + movu ym16, [r3 - 13 * 32] ; [5] + vinserti32x8 m16, [r3 + 8 * 32],1 ; [26] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0,m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + palignr m6, m0, m3, 4 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + pmaddwd m6, [r3 - 3 * 32] ; [15] + paddd m6, m15 + psrld m6, 5 + palignr m7, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + pmaddwd m7, [r3 - 3 * 32] + paddd m7, m15 + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m0, m3, 8 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + palignr m9, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + movu ym16, [r3 - 14 * 32] ; [4] + vinserti32x8 m16, [r3 + 7 * 32] ,1 ; [25] + pmaddwd m7, m8, m16 + paddd m7, m15 + psrld m7, 5 + pmaddwd m10, m9, m16 + paddd m10, m15 + psrld m10, 5 + packusdw m7, m10 + vextracti32x8 ym8, m7, 1 + + palignr m9, m0, m3, 12 + pmaddwd m9, [r3 - 4 * 32] ; [14] + paddd m9, m15 + psrld m9, 5 + palignr m3, m2, m0, 12 + pmaddwd m3, [r3 - 4 * 32] + paddd m3,m15 + psrld m3, 5 + packusdw m9, m3 + + movu ym16, [r3 - 15 * 32] ; [3] + vinserti32x8 m16, [r3 + 6 * 32] ,1 ; [24] + pmaddwd m10, m0, m16 + paddd m10, m15 + psrld m10, 5 + pmaddwd m3, m2, m16 + paddd m3,m15 + psrld m3, 5 + packusdw m10, m3 + vextracti32x8 ym11, m10, 1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + palignr m4, m2, m0, 4 + pmaddwd m4, [r3 - 5* 32] ; [13] + paddd m4, m15 + psrld m4, 5 + palignr m5, m1, m2, 4 + pmaddwd m5, [r3 - 5 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m2, m0, 8 + palignr m7, m1, m2, 8 + movu ym16, [r3 - 16 * 32] ; [2] + vinserti32x8 m16, [r3 + 5 * 32] ,1 ; [23] + pmaddwd m5, m6, m16 + paddd m5, m15 + psrld m5, 5 + palignr m7, m1, m2, 8 + pmaddwd m8, m7,m16 + paddd m8, m15 + psrld m8, 5 + packusdw m5, m8 + vextracti32x8 ym6, m5, 1 + + + palignr m7, m2, m0, 12 + pmaddwd m7, [r3 - 6 * 32] ; [12] + paddd m7, m15 + psrld m7, 5 + palignr m8, m1, m2, 12 + pmaddwd m8, [r3 - 6 * 32] + paddd m8, m15 + psrld m8, 5 + packusdw m7, m8 + + movu m0, [r2 + 34] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + pmaddwd m8, m2, [r3 - 17 * 32] ; [1] + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m1, [r3 - 17 * 32] + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + + palignr m3, m0, m0, 2 ; [ x 32 31 30 29 28 27 26 x 24 23 22 21 20 19 18] + punpcklwd m0, m3 ; [29 29 28 28 27 27 26 22 21 20 20 19 19 18 18 17] + + pmaddwd m9, m2, [r3 + 4 * 32] ; [22] + paddd m9, m15 + psrld m9, 5 + pmaddwd m3, m1, [r3 + 4 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m9, m3 + + palignr m10, m1, m2, 4 + pmaddwd m10, [r3 - 7 * 32] ; [11] + paddd m10, m15 + psrld m10, 5 + palignr m11, m0, m1, 4 + pmaddwd m11, [r3 - 7 * 32] + paddd m11, m15 + psrld m11, 5 + packusdw m10, m11 + + palignr m3, m1, m2, 8 + pmaddwd m3, [r3 - 18 * 32] ; [0] + paddd m3, m15 + psrld m3, 5 + palignr m0, m1, 8 + pmaddwd m0, [r3 - 18 * 32] + paddd m0, m15 + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 3, 0, 1, 16 + ret +cglobal intra_pred_ang32_4, 3,8,17 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_4_32 + + add r2, 22 + lea r0, [r0 + 32] + + call ang32_mode_4_32 + + add r2, 10 + lea r0, [r7 + 8 * r1] + + call ang16_mode_4_32 + + add r2, 22 + lea r0, [r0 + 32] + call ang32_mode_4_32 + RET +cglobal intra_pred_ang32_32, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_4_32 + + add r2, 22 + + call ang32_mode_4_32 + + add r2, 10 + mov r0, r5 + + call ang16_mode_4_32 + add r2, 22 + call ang32_mode_4_32 + RET +cglobal intra_pred_ang16_4, 3,7,17 + add r2, 64 + xor r6d, r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + call ang16_mode_4_32 + RET +cglobal intra_pred_ang16_32, 3,7,17 + xor r6d, r6d + inc r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 18 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + call ang16_mode_4_32 + RET +;; angle 16, modes 6 and 30 +cglobal ang16_mode_6_30 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + movu ym16, [r3 - 2 * 32] ; [13] + vinserti32x8 m16, [r3 + 11 * 32] ,1 ; [26] + pmaddwd m4, m3, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + palignr m7, m0, m3, 4 + palignr m8, m2, m0, 4 + movu ym16, [r3 - 8 * 32] ; [7] + vinserti32x8 m16, [r3 + 5 * 32] ,1 ; [20] + pmaddwd m6, m7, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m9, m8, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + + palignr m10, m0, m3, 8 + palignr m11, m2, m0, 8 + movu ym16, [r3 - 14 * 32] ; [1] + vinserti32x8 m16, [r3 - 1 * 32],1 ; [14] + pmaddwd m8, m10, m16 + paddd m8,m15 + psrld m8, 5 + palignr m11, m2, m0, 8 + pmaddwd m9, m11, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + + pmaddwd m10, [r3 + 12 * 32] ; [27] + paddd m10,m15 + psrld m10, 5 + pmaddwd m11, [r3 + 12 * 32] + paddd m11, m15 + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m0, m3, 12 + pmaddwd m11, [r3 - 7 * 32] ; [8] + paddd m11, m15 + psrld m11, 5 + palignr m12, m2, m0, 12 + pmaddwd m12, [r3 - 7 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m4, m0, m3, 12 + pmaddwd m4, [r3 + 6 * 32] ; [21] + paddd m4, m15 + psrld m4, 5 + palignr m5, m2, m0, 12 + pmaddwd m5, [r3 + 6 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + movu ym16, [r3 - 13 * 32] ; [2] + vinserti32x8 m16, [r3] ,1 ; [15] + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + pmaddwd m3, m2,m16 + paddd m3, m15 + psrld m3, 5 + packusdw m5, m3 + vextracti32x8 ym6, m5, 1 + + pmaddwd m7, m0, [r3 + 13 * 32] ; [28] + paddd m7, m15 + psrld m7, 5 + pmaddwd m3, m2, [r3 + 13 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m7, m3 + + palignr m9, m2, m0, 4 + palignr m3, m1, m2, 4 + movu ym16, [r3 - 6 * 32] ; [9] + vinserti32x8 m16, [r3 + 7 * 32],1 ; [22] + pmaddwd m8, m9, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m10, m3, m16 + paddd m10,m15 + psrld m10, 5 + packusdw m8, m10 + vextracti32x8 ym9, m8, 1 + + + palignr m11, m2, m0, 8 + pmaddwd m10, m11, [r3 - 12 * 32] ; [3] + paddd m10, m15 + psrld m10, 5 + palignr m3, m1, m2, 8 + pmaddwd m12, m3, [r3 - 12 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, [r3 + 1 * 32] ; [16] + paddd m11, m15 + psrld m11, 5 + pmaddwd m3, [r3 + 1 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m11, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 16 + ret +;; angle 32, modes 6 and 30 +cglobal ang32_mode_6_30 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 14 * 32] ; [29] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, [r3 + 14 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + movu ym16, [r3 - 5 * 32] ; [10] + vinserti32x8 m16, [r3 + 8 * 32] ,1 ; [23] + pmaddwd m5, m6, m16 + paddd m5, m15 + psrld m5, 5 + pmaddwd m8, m7, m16 + paddd m8, m15 + psrld m8, 5 + packusdw m5, m8 + vextracti32x8 ym6, m5, 1 + + palignr m9, m0, m3, 8 + palignr m12, m2, m0, 8 + movu ym16, [r3 - 11 * 32] ; [4] + vinserti32x8 m16, [r3 + 2 * 32] ,1 ; [17] + pmaddwd m7, m9, m16 + paddd m7,m15 + psrld m7, 5 + palignr m12, m2, m0, 8 + pmaddwd m11, m12,m16 + paddd m11,m15 + psrld m11, 5 + packusdw m7, m11 + vextracti32x8 ym8, m7, 1 + + pmaddwd m9, [r3 + 15 * 32] ; [30] + paddd m9, m15 + psrld m9, 5 + pmaddwd m12, [r3 + 15 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m9, m12 + + palignr m11, m0, m3, 12 + pmaddwd m10, m11, [r3 - 4 * 32] ; [11] + paddd m10, m15 + psrld m10, 5 + palignr m12, m2, m0, 12 + pmaddwd m3, m12, [r3 - 4 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m10, m3 + + pmaddwd m11, [r3 + 9 * 32] ; [24] + paddd m11, m15 + psrld m11, 5 + pmaddwd m12, [r3 + 9 * 32] + paddd m12,m15 + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + movu ym16, [r3 - 10 * 32] ; [5] + vinserti32x8 m16, [r3 + 3 * 32] ,1 ; [18] + pmaddwd m4, m0, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m2, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + pmaddwd m6, m0, [r3 + 16 * 32] ; [31] + paddd m6,m15 + psrld m6, 5 + pmaddwd m7, m2, [r3 + 16 * 32] + paddd m7,m15 + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m2, m0, 4 + palignr m9, m1, m2, 4 + movu ym16, [r3 - 3 * 32] ; [12] + vinserti32x8 m16, [r3 + 10 * 32],1 ; [25] + pmaddwd m7, m8,m16 + paddd m7,m15 + psrld m7, 5 + pmaddwd m3, m9, m16 + paddd m3, m15 + psrld m3, 5 + packusdw m7, m3 + vextracti32x8 ym8, m7, 1 + + palignr m10, m2, m0, 8 + palignr m12, m1, m2, 8 + movu ym16, [r3 - 9 * 32] ; [6] + vinserti32x8 m16, [r3 + 4 * 32] ,1 ; [19] + pmaddwd m9, m10, m16 + paddd m9, m15 + psrld m9, 5 + pmaddwd m3, m12,m16 + paddd m3, m15 + psrld m3, 5 + packusdw m9, m3 + vextracti32x8 ym10, m9, 1 + + + palignr m11, m2, m0, 12 + pmaddwd m11, [r3 - 15 * 32] ; [0] + paddd m11, m15 + psrld m11, 5 + palignr m3, m1, m2, 12 + pmaddwd m3, [r3 - 15 * 32] + paddd m3, m15 + psrld m3, 5 + packusdw m11, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 16 + ret +cglobal intra_pred_ang32_6, 3,8,17 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_6_30 + + add r2, 12 + lea r0, [r0 + 32] + + call ang32_mode_6_30 + + add r2, 20 + lea r0, [r7 + 8 * r1] + + call ang16_mode_6_30 + + add r2, 12 + lea r0, [r0 + 32] + call ang32_mode_6_30 + RET +cglobal intra_pred_ang32_30, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_6_30 + + add r2, 12 + + call ang32_mode_6_30 + + add r2, 20 + mov r0, r5 + + call ang16_mode_6_30 + + add r2, 12 + call ang32_mode_6_30 + RET +cglobal intra_pred_ang16_6, 3,7,17 + add r2, 64 + xor r6d, r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 15 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + call ang16_mode_6_30 + RET +cglobal intra_pred_ang16_30, 3,7,17 + xor r6d, r6d + inc r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 15 * 32] + shl r1d, 1 + lea r4, [r1 * 3] + call ang16_mode_6_30 + RET + +;; angle 16, modes 8 and 28 +cglobal ang16_mode_8_28 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + movu ym14, [r3 - 10 * 32] + vinserti32x8 m14, [r3 - 5 * 32], 1 + pmaddwd m4, m3, m14 ; [5], [10] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m14 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + movu ym14, [r3] + vinserti32x8 m14, [r3 + 5 * 32], 1 + pmaddwd m6, m3, m14 ; [15], [20] + paddd m6, m15 + psrld m6, 5 + pmaddwd m9, m0, m14 + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + vextracti32x8 ym7, m6, 1 + + movu ym14, [r3 + 10 * 32] + vinserti32x8 m14, [r3 + 15 * 32], 1 + pmaddwd m8, m3, m14 ; [25], [30] + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m0, m14 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + + palignr m11, m0, m3, 4 + movu ym14, [r3 - 12 * 32] + vinserti32x8 m14, [r3 - 7 * 32], 1 + pmaddwd m10, m11, m14 ; [3], [8] + paddd m10, m15 + psrld m10, 5 + palignr m1, m2, m0, 4 + pmaddwd m12, m1, m14 + paddd m12, m15 + psrld m12, 5 + packusdw m10, m12 + vextracti32x8 ym11, m10, 1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m7, m0, m3, 4 + movu ym14, [r3 - 2 * 32] + vinserti32x8 m14, [r3 + 3 * 32], 1 + pmaddwd m4, m7, m14 ; [13], [18] + paddd m4, m15 + psrld m4, 5 + palignr m1, m2, m0, 4 + pmaddwd m5, m1, m14 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + movu ym14, [r3 + 8 * 32] + vinserti32x8 m14, [r3 + 13 * 32], 1 + pmaddwd m6, m7, m14 ; [23], [28] + paddd m6, m15 + psrld m6, 5 + pmaddwd m8, m1, m14 + paddd m8, m15 + psrld m8, 5 + packusdw m6, m8 + vextracti32x8 ym7, m6, 1 + + movu ym14, [r3 - 14 * 32] + vinserti32x8 m14, [r3 - 9 * 32], 1 + palignr m1, m0, m3, 8 + pmaddwd m8, m1, m14 ; [1], [6] + paddd m8, m15 + psrld m8, 5 + palignr m2, m0, 8 + pmaddwd m9, m2, m14 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + + movu ym14, [r3 - 4 * 32] + vinserti32x8 m14, [r3 + 1 * 32], 1 + pmaddwd m3, m1, m14 ; [11], [16] + paddd m3, m15 + psrld m3, 5 + pmaddwd m0, m2, m14 + paddd m0, m15 + psrld m0, 5 + packusdw m3, m0 + vextracti32x8 ym1, m3, 1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +;; angle 32, modes 8 and 28 +cglobal ang32_mode_8_28 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + movu ym14, [r3 + 6 * 32] + vinserti32x8 m14, [r3 + 11 * 32], 1 + pmaddwd m4, m3, m14 ; [21], [26] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m14 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + pmaddwd m6, m3, [r3 + 16 * 32] ; [31] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 + 16 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m11, m0, m3, 4 + movu ym14, [r3 - 11 * 32] + vinserti32x8 m14, [r3 - 6 * 32], 1 + pmaddwd m7, m11, m14 ; [4], [9] + paddd m7, m15 + psrld m7, 5 + palignr m1, m2, m0, 4 + pmaddwd m8, m1, m14 + paddd m8, m15 + psrld m8, 5 + packusdw m7, m8 + vextracti32x8 ym8, m7, 1 + + movu ym14, [r3 - 1 * 32] + vinserti32x8 m14, [r3 + 4 * 32], 1 + pmaddwd m9, m11, m14 ; [14], [19] + paddd m9, m15 + psrld m9, 5 + pmaddwd m10, m1, m14 + paddd m10, m15 + psrld m10, 5 + packusdw m9, m10 + vextracti32x8 ym10, m9, 1 + + pmaddwd m11, [r3 + 9 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m1, [r3 + 9 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m11, m1 + +TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m4, m0, m3, 4 + pmaddwd m4, [r3 + 14 * 32] ; [29] + paddd m4, m15 + psrld m4, 5 + palignr m5, m2, m0, 4 + pmaddwd m5, [r3 + 14 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m1, m0, m3, 8 + pmaddwd m5, m1, [r3 - 13 * 32] ; [2] + paddd m5, m15 + psrld m5, 5 + palignr m10, m2, m0, 8 + pmaddwd m6, m10, [r3 - 13 * 32] + paddd m6, m15 + psrld m6, 5 + packusdw m5, m6 + + movu ym14, [r3 - 8 * 32] + vinserti32x8 m14, [r3 - 3 * 32], 1 + pmaddwd m6, m1, m14 ; [7], [12] + paddd m6, m15 + psrld m6, 5 + pmaddwd m8, m10, m14 + paddd m8, m15 + psrld m8, 5 + packusdw m6, m8 + vextracti32x8 ym7, m6, 1 + + movu ym14, [r3 + 2 * 32] + vinserti32x8 m14, [r3 + 7 * 32], 1 + pmaddwd m8, m1, m14 ; [17], [22] + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m10, m14 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + + pmaddwd m1, [r3 + 12 * 32] ; [27] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m10, [r3 + 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m1, m10 + + palignr m11, m0, m3, 12 + pmaddwd m11, [r3 - 15 * 32] ; [0] + paddd m11, [pd_16] + psrld m11, 5 + palignr m2, m0, 12 + pmaddwd m2, [r3 - 15 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m11, m2 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 11, 0, 2, 16 + ret + + +cglobal intra_pred_ang32_8, 3,8,16 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + + call ang16_mode_8_28 + + add r2, 4 + lea r0, [r0 + 32] + + call ang32_mode_8_28 + + add r2, 28 + lea r0, [r7 + 8 * r1] + + call ang16_mode_8_28 + + add r2, 4 + lea r0, [r0 + 32] + + call ang32_mode_8_28 + RET + +cglobal intra_pred_ang32_28, 3,7,16 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_8_28 + + add r2, 4 + + call ang32_mode_8_28 + + add r2, 28 + mov r0, r5 + + call ang16_mode_8_28 + + add r2, 4 + call ang32_mode_8_28 + RET + + cglobal intra_pred_ang16_8, 3,7,16 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + vbroadcasti32x8 m15, [pd_16] + + call ang16_mode_8_28 + RET + +cglobal intra_pred_ang16_28, 3,7,16 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + vbroadcasti32x8 m15, [pd_16] + + call ang16_mode_8_28 + RET + +;; angle 16, modes 7 and 29 +cglobal ang16_mode_7_29 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + movu ym16, [r3 - 8 * 32] ; [9] + vinserti32x8 m16, [r3 + 1 * 32] ,1 ; [18] + pmaddwd m4, m3,m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, m16 + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + vextracti32x8 ym5, m4, 1 + + pmaddwd m6, m3, [r3 + 10 * 32] ; [27] + paddd m6, m15 + psrld m6, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, m15 + psrld m9, 5 + packusdw m6, m9 + + palignr m10, m0, m3, 4 + pmaddwd m7, m10, [r3 - 13 * 32] ; [4] + paddd m7, m15 + psrld m7, 5 + palignr m11, m2, m0, 4 + pmaddwd m8, m11, [r3 - 13 * 32] + paddd m8, m15 + psrld m8, 5 + packusdw m7, m8 + + movu ym16, [r3 - 4 * 32] ; [13] + vinserti32x8 m16, [r3 + 5 * 32],1 ; [22] + pmaddwd m8, m10, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, m11, m16 + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + vextracti32x8 ym9, m8, 1 + + pmaddwd m10, [r3 + 14 * 32] ; [31] + paddd m10, m15 + psrld m10, 5 + pmaddwd m11, [r3 + 14 * 32] + paddd m11, m15 + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m0, m3, 8 + pmaddwd m11, [r3 - 9 * 32] ; [8] + paddd m11, m15 + psrld m11, 5 + palignr m12, m2, m0, 8 + pmaddwd m12, [r3 - 9 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m5, m0, m3, 8 + palignr m6, m2, m0, 8 + movu ym16, [r3] ; [17] + vinserti32x8 m16, [r3 + 9 * 32] ,1 ; [26] + pmaddwd m4, m5, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m7, m6, m16 + paddd m7, m15 + psrld m7, 5 + packusdw m4, m7 + vextracti32x8 ym5, m4, 1 + + + palignr m9, m0, m3, 12 + palignr m3, m2, m0, 12 + movu ym16, [r3 - 14 * 32] ; [3] + vinserti32x8 m16, [r3 - 5 * 32] ,1 ; [12] + pmaddwd m6, m9,m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m7, m3,m16 + paddd m7, m15 + psrld m7, 5 + packusdw m6, m7 + vextracti32x8 ym7, m6, 1 + + movu ym16, [r3 + 4 * 32] ; [21] + vinserti32x8 m16, [r3 + 13 * 32] ,1 ; [30] + pmaddwd m8, m9,m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m10, m3, m16 + paddd m10, m15 + psrld m10, 5 + packusdw m8, m10 + vextracti32x8 ym9, m8, 1 + + movu ym16,[r3 - 10 * 32] ; [7] + vinserti32x8 m16, [r3 - 1 * 32] ,1 ; [16] + pmaddwd m10, m0, m16 + paddd m10, m15 + psrld m10, 5 + pmaddwd m12, m2, m16 + paddd m12, m15 + psrld m12, 5 + packusdw m10, m12 + vextracti32x8 ym0, m10, 1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 16 + ret +;; angle 32, modes 7 and 29 +cglobal ang32_mode_7_29 + test r6d, r6d + + vbroadcasti32x8 m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + vbroadcasti32x8 m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + vbroadcasti32x8 m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + vbroadcasti32x8 m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 8 * 32] ; [25] + paddd m4, m15 + psrld m4, 5 + pmaddwd m5, m0, [r3 + 8 * 32] + paddd m5, m15 + psrld m5, 5 + packusdw m4, m5 + + palignr m8, m0, m3, 4 + pmaddwd m5, m8, [r3 - 15 * 32] ; [2] + paddd m5, m15 + psrld m5, 5 + palignr m9, m2, m0, 4 + pmaddwd m10, m9, [r3 - 15 * 32] + paddd m10, m15 + psrld m10, 5 + packusdw m5, m10 + + movu ym16,[r3 - 6 * 32] ; [11] + vinserti32x8 m16, [r3 + 3 * 32],1 ; [20] + pmaddwd m6, m8, m16 + paddd m6, m15 + psrld m6, 5 + pmaddwd m7, m9, m16 + paddd m7, m15 + psrld m7, 5 + packusdw m6, m7 + vextracti32x8 ym7, m6, 1 + + pmaddwd m8, [r3 + 12 * 32] ; [29] + paddd m8, m15 + psrld m8, 5 + pmaddwd m9, [r3 + 12 * 32] + paddd m9, m15 + psrld m9, 5 + packusdw m8, m9 + + palignr m11, m0, m3, 8 + palignr m12, m2, m0, 8 + movu ym16, [r3 - 11 * 32] ; [6] + vinserti32x8 m16, [r3 - 2 * 32] ,1 ; [15] + pmaddwd m9, m11, m16 + paddd m9, m15 + psrld m9, 5 + palignr m12, m2, m0, 8 + pmaddwd m10, m12, m16 + paddd m10, m15 + psrld m10, 5 + packusdw m9, m10 + vextracti32x8 ym10, m9, 1 + + pmaddwd m11, [r3 + 7 * 32] ; [24] + paddd m11, m15 + psrld m11, 5 + pmaddwd m12, [r3 + 7 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m5, m0, m3, 12 + palignr m6, m2, m0, 12 + movu ym16, [r3 - 16 * 32] ; [1] + vinserti32x8 m16, [r3 - 7 * 32] ,1 ; [10] + pmaddwd m4, m5, m16 + paddd m4, m15 + psrld m4, 5 + pmaddwd m7, m6, m16 + paddd m7, m15 + psrld m7, 5 + packusdw m4, m7 + vextracti32x8 ym5, m4, 1 + + palignr m9, m0, m3, 12 + pmaddwd m6, m9, [r3 + 2 * 32] ; [19] + paddd m6, m15 + psrld m6, 5 + palignr m3, m2, m0, 12 + pmaddwd m7, m3, [r3 + 2 * 32] + paddd m7, m15 + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m9, [r3 + 11 * 32] ; [28] + paddd m7, m15 + psrld m7, 5 + pmaddwd m8, m3, [r3 + 11 * 32] + paddd m8, m15 + psrld m8, 5 + packusdw m7, m8 + + movu ym16, [r3 - 12 * 32] ; [5] + vinserti32x8 m16, [r3 - 3 * 32] ,1 ; [14] + pmaddwd m8, m0, m16 + paddd m8, m15 + psrld m8, 5 + pmaddwd m10, m2, m16 + paddd m10,m15 + psrld m10, 5 + packusdw m8, m10 + vextracti32x8 ym9, m8, 1 + + pmaddwd m10, m0, [r3 + 6 * 32] ; [23] + paddd m10,m15 + psrld m10, 5 + pmaddwd m12, m2, [r3 + 6 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m10, m12 + + palignr m11, m2, m0, 4 + pmaddwd m11, [r3 - 17 * 32] ; [0] + paddd m11, m15 + psrld m11, 5 + palignr m12, m1, m2, 4 + pmaddwd m12, [r3 - 17 * 32] + paddd m12, m15 + psrld m12, 5 + packusdw m11, m12 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 3, 2, 16 + ret + +cglobal intra_pred_ang32_7, 3,8,17 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_7_29 + + add r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_7_29 + + add r2, 24 + lea r0, [r7 + 8 * r1] + + call ang16_mode_7_29 + + add r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_7_29 + RET + +cglobal intra_pred_ang32_29, 3,7,17 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + vbroadcasti32x8 m15, [pd_16] + call ang16_mode_7_29 + + add r2, 8 + + call ang32_mode_7_29 + + add r2, 24 + mov r0, r5 + + call ang16_mode_7_29 + add r2, 8 + call ang32_mode_7_29 + RET +cglobal intra_pred_ang16_7, 3,7,17 + add r2, 64 + xor r6d, r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_7_29 + RET + +cglobal intra_pred_ang16_29, 3,7,17 + xor r6d, r6d + inc r6d + vbroadcasti32x8 m15, [pd_16] + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_7_29 + RET +;------------------------------------------------------------------------------------------------------- +; avx512 code for intra_pred_ang32 mode 2 to 34 end +;------------------------------------------------------------------------------------------------------- %macro MODE_2_34 0 movu m0, [r2 + 4] movu m1, [r2 + 20]
View file
x265_2.7.tar.gz/source/common/x86/ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -45,12 +45,33 @@ %endif -SECTION_RODATA 32 +SECTION_RODATA 64 tab_c_524800: times 4 dd 524800 tab_c_n8192: times 8 dw -8192 pd_524800: times 8 dd 524800 +tab_ChromaCoeff: dw 0, 64, 0, 0 + dw -2, 58, 10, -2 + dw -4, 54, 16, -2 + dw -6, 46, 28, -4 + dw -4, 36, 36, -4 + dw -4, 28, 46, -6 + dw -2, 16, 54, -4 + dw -2, 10, 58, -2 + +tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 + dw -1, 4, -10, 58, 17, -5, 1, 0 + dw -1, 4, -11, 40, 40, -11, 4, -1 + dw 0, 1, -5, 17, 58, -10, 4, -1 + +ALIGN 64 +tab_LumaCoeffH_avx512: + times 4 dw 0, 0, 0, 64, 0, 0, 0, 0 + times 4 dw -1, 4, -10, 58, 17, -5, 1, 0 + times 4 dw -1, 4, -11, 40, 40, -11, 4, -1 + times 4 dw 0, 1, -5, 17, 58, -10, 4, -1 + ALIGN 32 tab_LumaCoeffV: times 4 dw 0, 0 times 4 dw 0, 64 @@ -71,6 +92,7 @@ times 4 dw -5, 17 times 4 dw 58, -10 times 4 dw 4, -1 + ALIGN 32 tab_LumaCoeffVer: times 8 dw 0, 0 times 8 dw 0, 64 @@ -91,7 +113,62 @@ times 8 dw -5, 17 times 8 dw 58, -10 times 8 dw 4, -1 - + +ALIGN 64 +const tab_ChromaCoeffV_avx512, times 16 dw 0, 64 + times 16 dw 0, 0 + + times 16 dw -2, 58 + times 16 dw 10, -2 + + times 16 dw -4, 54 + times 16 dw 16, -2 + + times 16 dw -6, 46 + times 16 dw 28, -4 + + times 16 dw -4, 36 + times 16 dw 36, -4 + + times 16 dw -4, 28 + times 16 dw 46, -6 + + times 16 dw -2, 16 + times 16 dw 54, -4 + + times 16 dw -2, 10 + times 16 dw 58, -2 + +ALIGN 64 +tab_LumaCoeffVer_avx512: times 16 dw 0, 0 + times 16 dw 0, 64 + times 16 dw 0, 0 + times 16 dw 0, 0 + + times 16 dw -1, 4 + times 16 dw -10, 58 + times 16 dw 17, -5 + times 16 dw 1, 0 + + times 16 dw -1, 4 + times 16 dw -11, 40 + times 16 dw 40, -11 + times 16 dw 4, -1 + + times 16 dw 0, 1 + times 16 dw -5, 17 + times 16 dw 58, -10 + times 16 dw 4, -1 + +ALIGN 64 +const interp8_hpp_shuf1_load_avx512, times 4 db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 + +ALIGN 64 +const interp8_hpp_shuf2_load_avx512, times 4 db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 + +ALIGN 64 +const interp8_hpp_shuf1_store_avx512, times 4 db 0, 1, 4, 5, 2, 3, 6, 7, 8, 9, 12, 13, 10, 11, 14, 15 + SECTION .text cextern pd_8 cextern pd_32 @@ -246,6 +323,7 @@ ;------------------------------------------------------------------------------------------------------------- ; void interp_8tap_vert_pp_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 FILTER_VER_LUMA_sse2 pp, 4, 4 FILTER_VER_LUMA_sse2 pp, 8, 8 FILTER_VER_LUMA_sse2 pp, 8, 4 @@ -300,7 +378,570 @@ FILTER_VER_LUMA_sse2 ps, 48, 64 FILTER_VER_LUMA_sse2 ps, 64, 16 FILTER_VER_LUMA_sse2 ps, 16, 64 +%endif + +;----------------------------------------------------------------------------- +;p2s and p2s_aligned avx512 code start +;----------------------------------------------------------------------------- +%macro P2S_64x4_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu m0, [r0 + mmsize] + movu m1, [r0 + r1 + mmsize] + movu m2, [r0 + r1 * 2 + mmsize] + movu m3, [r0 + r5 + mmsize] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2 + mmsize], m0 + movu [r2 + r3 + mmsize], m1 + movu [r2 + r3 * 2 + mmsize], m2 + movu [r2 + r4 + mmsize], m3 +%endmacro + +%macro P2S_ALIGNED_64x4_AVX512 0 + mova m0, [r0] + mova m1, [r0 + r1] + mova m2, [r0 + r1 * 2] + mova m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r4], m3 + + mova m0, [r0 + mmsize] + mova m1, [r0 + r1 + mmsize] + mova m2, [r0 + r1 * 2 + mmsize] + mova m3, [r0 + r5 + mmsize] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2 + mmsize], m0 + mova [r2 + r3 + mmsize], m1 + mova [r2 + r3 * 2 + mmsize], m2 + mova [r2 + r4 + mmsize], m3 +%endmacro + +%macro P2S_32x4_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 +%endmacro + +%macro P2S_ALIGNED_32x4_AVX512 0 + mova m0, [r0] + mova m1, [r0 + r1] + mova m2, [r0 + r1 * 2] + mova m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r4], m3 +%endmacro + +%macro P2S_48x4_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + r1 * 2] + movu m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + + movu ym0, [r0 + mmsize] + movu ym1, [r0 + r1 + mmsize] + movu ym2, [r0 + r1 * 2 + mmsize] + movu ym3, [r0 + r5 + mmsize] + psllw ym0, (14 - BIT_DEPTH) + psllw ym1, (14 - BIT_DEPTH) + psllw ym2, (14 - BIT_DEPTH) + psllw ym3, (14 - BIT_DEPTH) + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym1 + movu [r2 + r3 * 2 + mmsize], ym2 + movu [r2 + r4 + mmsize], ym3 +%endmacro + +%macro P2S_ALIGNED_48x4_AVX512 0 + mova m0, [r0] + mova m1, [r0 + r1] + mova m2, [r0 + r1 * 2] + mova m3, [r0 + r5] + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r4], m3 + + mova ym0, [r0 + mmsize] + mova ym1, [r0 + r1 + mmsize] + mova ym2, [r0 + r1 * 2 + mmsize] + mova ym3, [r0 + r5 + mmsize] + psllw ym0, (14 - BIT_DEPTH) + psllw ym1, (14 - BIT_DEPTH) + psllw ym2, (14 - BIT_DEPTH) + psllw ym3, (14 - BIT_DEPTH) + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + mova [r2 + mmsize], ym0 + mova [r2 + r3 + mmsize], ym1 + mova [r2 + r3 * 2 + mmsize], ym2 + mova [r2 + r4 + mmsize], ym3 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal filterPixelToShort_64x16, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 3 + P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_64x4_AVX512 + RET + + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x32, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 7 + P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x48, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 11 + P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x8, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x16, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 3 + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x24, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 5 + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x32, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 7 + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x48, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 11 + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_32x4_AVX512 + RET +INIT_ZMM avx512 +cglobal filterPixelToShort_48x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_48x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_48x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x16, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 3 + P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_64x4_AVX512 + RET + + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x32, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 7 + P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x48, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 11 + P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x8, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x16, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 3 + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x24, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 5 + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x32, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 7 + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x48, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 11 + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_48x64, 4, 6, 5 + add r1d, r1d + add r3d, r3d + lea r4, [r3 * 3] + lea r5, [r1 * 3] + + ; load constant + vbroadcasti32x8 m4, [pw_2000] +%rep 15 + P2S_ALIGNED_48x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + P2S_ALIGNED_48x4_AVX512 + RET +;----------------------------------------------------------------------------------------------------------------------------- +;p2s and p2s_aligned avx512 code end +;----------------------------------------------------------------------------------------------------------------------------- %macro PROCESS_LUMA_VER_W4_4R 0 movq m0, [r0] @@ -4611,3 +5252,8822 @@ jnz .loop RET +;------------------------------------------------------------------------------------------------------------- +;ipfilter_chroma_avx512 code start +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_hpp code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_CHROMA_PP_8x4_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu xm7, [r0] + vinserti32x4 m7, [r0 + r1], 1 + vinserti32x4 m7, [r0 + 2 * r1], 2 + vinserti32x4 m7, [r0 + r6], 3 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + movu xm8, [r0 + 8] + vinserti32x4 m8, [r0 + r1 + 8], 1 + vinserti32x4 m8, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m8, [r0 + r6 + 8], 3 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], xm7 + vextracti32x4 [r2 + r3], m7, 1 + vextracti32x4 [r2 + 2 * r3], m7, 2 + vextracti32x4 [r2 + r7], m7, 3 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_16x2_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu ym7, [r0] + vinserti32x8 m7, [r0 + r1], 1 + movu ym8, [r0 + 8] + vinserti32x8 m8, [r0 + r1 + 8], 1 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], ym7 + vextracti32x8 [r2 + r3], m7, 1 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_24x4_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu ym7, [r0] + vinserti32x8 m7, [r0 + r1], 1 + movu ym8, [r0 + 8] + vinserti32x8 m8, [r0 + r1 + 8], 1 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], ym7 + vextracti32x8 [r2 + r3], m7, 1 + + movu ym7, [r0 + 2 * r1] + vinserti32x8 m7, [r0 + r6], 1 + movu ym8, [r0 + 2 * r1 + 8] + vinserti32x8 m8, [r0 + r6 + 8], 1 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + 2 * r3], ym7 + vextracti32x8 [r2 + r7], m7, 1 + + movu xm7, [r0 + mmsize/2] + vinserti32x4 m7, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m7, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m7, [r0 + r6 + mmsize/2], 3 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + movu xm8, [r0 + mmsize/2 + 8] + vinserti32x4 m8, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m8, [r0 + 2 * r1 + mmsize/2 + 8], 2 + vinserti32x4 m8, [r0 + r6 + mmsize/2 + 8], 3 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + mmsize/2], xm7 + vextracti32x4 [r2 + r3 + mmsize/2], m7, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m7, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m7, 3 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu m7, [r0] + movu m8, [r0 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], m7 + + movu m7, [r0 + r1] + movu m8, [r0 + r1 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + r3], m7 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_48x2_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu m7, [r0] + movu m8, [r0 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], m7 + + movu m7, [r0 + r1] + movu m8, [r0 + r1 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + r3], m7 + + movu ym7, [r0 + mmsize] + vinserti32x8 m7, [r0 + r1 + mmsize], 1 + movu ym8, [r0 + mmsize + 8] + vinserti32x8 m8, [r0 + r1 + mmsize + 8], 1 + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + mmsize], ym7 + vextracti32x8 [r2 + r3 + mmsize], m7, 1 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_64x2_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3 shuffle order table + ; m4 - pd_32 + ; m5 - zero + ; m6 - pw_pixel_max + + movu m7, [r0] + movu m8, [r0 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2], m7 + + movu m7, [r0 + mmsize] + movu m8, [r0 + mmsize + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + mmsize], m7 + + movu m7, [r0 + r1] + movu m8, [r0 + r1 + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + r3], m7 + + movu m7, [r0 + r1 + mmsize] + movu m8, [r0 + r1 + mmsize + 8] + + pshufb m9, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m9, m1 + paddd m7, m9 + paddd m7, m4 + psrad m7, 6 + + pshufb m9, m8, m3 + pshufb m8, m2 + pmaddwd m8, m0 + pmaddwd m9, m1 + paddd m8, m9 + paddd m8, m4 + psrad m8, 6 + + packusdw m7, m8 + CLIPW m7, m5, m6 + pshufb m7, m10 + movu [r2 + r3 + mmsize], m7 +%endmacro +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_8x4, 5,8,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + + PROCESS_IPFILTER_CHROMA_PP_8x4_AVX512 + RET +%endif + +%macro IPFILTER_CHROMA_AVX512_8xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_8x%1, 5,8,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep %1/4 - 1 + PROCESS_IPFILTER_CHROMA_PP_8x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_8x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_AVX512_8xN 8 +IPFILTER_CHROMA_AVX512_8xN 12 +IPFILTER_CHROMA_AVX512_8xN 16 +IPFILTER_CHROMA_AVX512_8xN 32 +IPFILTER_CHROMA_AVX512_8xN 64 +%endif + +%macro IPFILTER_CHROMA_AVX512_16xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_16x%1, 5,6,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep %1/2 - 1 + PROCESS_IPFILTER_CHROMA_PP_16x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_16x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_AVX512_16xN 4 +IPFILTER_CHROMA_AVX512_16xN 8 +IPFILTER_CHROMA_AVX512_16xN 12 +IPFILTER_CHROMA_AVX512_16xN 16 +IPFILTER_CHROMA_AVX512_16xN 24 +IPFILTER_CHROMA_AVX512_16xN 32 +IPFILTER_CHROMA_AVX512_16xN 64 +%endif + +%macro IPFILTER_CHROMA_AVX512_24xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_24x%1, 5,8,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep %1/4 - 1 + PROCESS_IPFILTER_CHROMA_PP_24x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_24x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_AVX512_24xN 32 +IPFILTER_CHROMA_AVX512_24xN 64 +%endif + +%macro IPFILTER_CHROMA_AVX512_32xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_32x%1, 5,6,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep %1/2 - 1 + PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_AVX512_32xN 8 +IPFILTER_CHROMA_AVX512_32xN 16 +IPFILTER_CHROMA_AVX512_32xN 24 +IPFILTER_CHROMA_AVX512_32xN 32 +IPFILTER_CHROMA_AVX512_32xN 48 +IPFILTER_CHROMA_AVX512_32xN 64 +%endif + +%macro IPFILTER_CHROMA_AVX512_64xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_64x%1, 5,6,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep %1/2 - 1 + PROCESS_IPFILTER_CHROMA_PP_64x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_64x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_AVX512_64xN 16 +IPFILTER_CHROMA_AVX512_64xN 32 +IPFILTER_CHROMA_AVX512_64xN 48 +IPFILTER_CHROMA_AVX512_64xN 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_48x64, 5,6,11 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m4, [pd_32] + pxor m5, m5 + vbroadcasti32x8 m6, [pw_pixel_max] + vbroadcasti32x8 m10, [interp8_hpp_shuf1_store_avx512] + +%rep 31 + PROCESS_IPFILTER_CHROMA_PP_48x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_CHROMA_PP_48x2_AVX512 + RET +%endif +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_hpp code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vpp code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_CHROMA_VERT_PP_8x8_AVX512 0 + movu xm1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + vinserti32x4 m1, [r6], 1 + vinserti32x4 m1, [r8], 2 + vinserti32x4 m1, [r9], 3 + movu xm3, [r0 + r1] + vinserti32x4 m3, [r6 + r1], 1 + vinserti32x4 m3, [r8 + r1], 2 + vinserti32x4 m3, [r9 + r1], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu xm4, [r0 + 2 * r1] + vinserti32x4 m4, [r6 + 2 * r1], 1 + vinserti32x4 m4, [r8 + 2 * r1], 2 + vinserti32x4 m4, [r9 + 2 * r1], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + movu xm5, [r0 + r10] + vinserti32x4 m5, [r6 + r10], 1 + vinserti32x4 m5, [r8 + r10], 2 + vinserti32x4 m5, [r9 + r10], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu xm4, [r0 + 4 * r1] + vinserti32x4 m4, [r6 + 4 * r1], 1 + vinserti32x4 m4, [r8 + 4 * r1], 2 + vinserti32x4 m4, [r9 + 4 * r1], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + mmsize] + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m8 + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r7], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r7], m2, 3 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_8x8, 5, 11, 9 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + lea r10, [3 * r1] + lea r7, [3 * r3] + PROCESS_CHROMA_VERT_PP_8x8_AVX512 + RET +%endif + +%macro FILTER_VER_PP_CHROMA_8xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_8x%1, 5, 11, 9 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + lea r10, [3 * r1] + lea r7, [3 * r3] +%rep %1/8 - 1 + PROCESS_CHROMA_VERT_PP_8x8_AVX512 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_8x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PP_CHROMA_8xN_AVX512 16 +FILTER_VER_PP_CHROMA_8xN_AVX512 32 +FILTER_VER_PP_CHROMA_8xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PP_16x4_AVX512 0 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + vinserti32x8 m1, [r6], 1 + movu ym3, [r0 + r1] + vinserti32x8 m3, [r6 + r1], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + lea r0, [r0 + 2 * r1] + lea r6, [r6 + 2 * r1] + + movu ym5, [r0 + r1] + vinserti32x8 m5, [r6 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + mmsize] + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m8 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_16x4, 5, 8, 9 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + lea r7, [3 * r3] + PROCESS_CHROMA_VERT_PP_16x4_AVX512 + RET +%endif + +%macro FILTER_VER_PP_CHROMA_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_16x%1, 5, 8, 9 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + lea r7, [3 * r3] +%rep %1/4 - 1 + PROCESS_CHROMA_VERT_PP_16x4_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_16x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PP_CHROMA_16xN_AVX512 8 +FILTER_VER_PP_CHROMA_16xN_AVX512 12 +FILTER_VER_PP_CHROMA_16xN_AVX512 16 +FILTER_VER_PP_CHROMA_16xN_AVX512 24 +FILTER_VER_PP_CHROMA_16xN_AVX512 32 +FILTER_VER_PP_CHROMA_16xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PP_24x8_AVX512 0 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + + movu ym10, [r8] + movu ym3, [r0 + r1] + movu ym12, [r8 + r1] + vinserti32x8 m1, [r6], 1 + vinserti32x8 m10, [r9], 1 + vinserti32x8 m3, [r6 + r1], 1 + vinserti32x8 m12, [r9 + r1], 1 + + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, [r5] + pmaddwd m9, [r5] + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, [r5] + pmaddwd m10, [r5] + + movu ym4, [r0 + 2 * r1] + movu ym13, [r8 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + vinserti32x8 m13, [r9 + 2 * r1], 1 + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, [r5] + pmaddwd m11, [r5] + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, [r5] + pmaddwd m12, [r5] + + movu ym5, [r0 + r10] + vinserti32x8 m5, [r6 + r10], 1 + movu ym14, [r8 + r10] + vinserti32x8 m14, [r9 + r10], 1 + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, [r5 + mmsize] + pmaddwd m15, [r5 + mmsize] + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, [r5 + mmsize] + pmaddwd m13, [r5 + mmsize] + paddd m1, m4 + paddd m10, m13 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + movu ym13, [r8 + 4 * r1] + vinserti32x8 m13, [r9 + 4 * r1], 1 + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, [r5 + mmsize] + pmaddwd m15, [r5 + mmsize] + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, [r5 + mmsize] + pmaddwd m14, [r5 + mmsize] + paddd m3, m5 + paddd m12, m14 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP + psrad m12, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + pxor m5, m5 + CLIPW2 m0, m2, m5, m8 + CLIPW2 m9, m11, m5, m8 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 + lea r11, [r2 + 4 * r3] + movu [r11], ym9 + movu [r11 + r3], ym11 + vextracti32x8 [r11 + 2 * r3], m9, 1 + vextracti32x8 [r11 + r7], m11, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r6 + mmsize/2], 1 + vinserti32x4 m1, [r8 + mmsize/2], 2 + vinserti32x4 m1, [r9 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r8 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r9 + r1 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 2 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + movu xm5, [r0 + r10 + mmsize/2] + vinserti32x4 m5, [r6 + r10 + mmsize/2], 1 + vinserti32x4 m5, [r8 + r10 + mmsize/2], 2 + vinserti32x4 m5, [r9 + r10 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 4 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + mmsize] + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m8 + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 3 +%endmacro + +%macro FILTER_VER_PP_CHROMA_24xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_24x%1, 5, 12, 16 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + lea r10, [3 * r1] + lea r7, [3 * r3] +%rep %1/8 - 1 + PROCESS_CHROMA_VERT_PP_24x8_AVX512 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_24x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_PP_CHROMA_24xN_AVX512 32 + FILTER_VER_PP_CHROMA_24xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PP_32x2_AVX512 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + mmsize] + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, m8 + movu [r2], m0 + movu [r2 + r3], m2 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_PP_CHROMA_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_32x%1, 5, 7, 9 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + +%rep %1/2 - 1 + PROCESS_CHROMA_VERT_PP_32x2_AVX512 + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PP_CHROMA_32xN_AVX512 8 +FILTER_VER_PP_CHROMA_32xN_AVX512 16 +FILTER_VER_PP_CHROMA_32xN_AVX512 24 +FILTER_VER_PP_CHROMA_32xN_AVX512 32 +FILTER_VER_PP_CHROMA_32xN_AVX512 48 +FILTER_VER_PP_CHROMA_32xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PP_48x4_AVX512 0 + movu m1, [r0] + lea r6, [r0 + 2 * r1] + movu m10, [r6] + movu m3, [r0 + r1] + movu m12, [r6 + r1] + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, [r5] + pmaddwd m9, [r5] + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, [r5] + pmaddwd m10, [r5] + + movu m4, [r0 + 2 * r1] + movu m13, [r6 + 2 * r1] + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, [r5] + pmaddwd m11, [r5] + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, [r5] + pmaddwd m12, [r5] + + movu m5, [r0 + r7] + movu m14, [r6 + r7] + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, [r5 + mmsize] + pmaddwd m15, [r5 + mmsize] + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, [r5 + mmsize] + pmaddwd m13, [r5 + mmsize] + paddd m1, m4 + paddd m10, m13 + + movu m4, [r0 + 4 * r1] + movu m13, [r6 + 4 * r1] + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, [r5 + mmsize] + pmaddwd m15, [r5 + mmsize] + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, [r5 + mmsize] + pmaddwd m14, [r5 + mmsize] + paddd m3, m5 + paddd m12, m14 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP + psrad m12, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + CLIPW2 m0, m2, m16, m8 + CLIPW2 m9, m11, m16, m8 + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + 2 * r3], m9 + movu [r2 + r8], m11 + + movu ym1, [r0 + mmsize] + vinserti32x8 m1, [r6 + mmsize], 1 + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m3, [r6 + r1 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r7 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu ym4, [r0 + 4 * r1 + mmsize] + vinserti32x8 m4, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + mmsize] + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m16, m8 + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r8 + mmsize], m2, 1 +%endmacro + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_48x64, 5, 9, 17 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m8, [pw_pixel_max] + pxor m16, m16 + +%rep 15 + PROCESS_CHROMA_VERT_PP_48x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_48x4_AVX512 + RET +%endif + +%macro PROCESS_CHROMA_VERT_PP_64x2_AVX512 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu m9, [r0 + mmsize] + movu m11, [r0 + r1 + mmsize] + punpcklwd m8, m9, m11 + pmaddwd m8, [r5] + punpckhwd m9, m11 + pmaddwd m9, [r5] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, [r5] + punpckhwd m11, m12 + pmaddwd m11, [r5] + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m13, [r0 + r1 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m8, m14 + punpckhwd m12, m13 + pmaddwd m12, [r5 + 1 * mmsize] + paddd m9, m12 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m10, m14 + punpckhwd m13, m12 + pmaddwd m13, [r5 + 1 * mmsize] + paddd m11, m13 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + pxor m5, m5 + CLIPW2 m0, m2, m5, m15 + CLIPW2 m8, m10, m5, m15 + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], m8 + movu [r2 + r3 + mmsize], m10 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_PP_CHROMA_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_pp_64x%1, 5, 7, 16 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x8 m7, [INTERP_OFFSET_PP] + vbroadcasti32x8 m15, [pw_pixel_max] + +%rep %1/2 - 1 + PROCESS_CHROMA_VERT_PP_64x2_AVX512 + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_CHROMA_VERT_PP_64x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PP_CHROMA_64xN_AVX512 16 +FILTER_VER_PP_CHROMA_64xN_AVX512 32 +FILTER_VER_PP_CHROMA_64xN_AVX512 48 +FILTER_VER_PP_CHROMA_64xN_AVX512 64 +%endif +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vpp code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_hps code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_CHROMA_PS_32x2_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle load order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 + + movu m6, [r0 + r1] + movu m7, [r0 + r1 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + r3], m6 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_32x1_AVX512 0 + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 +%endmacro + +%macro IPFILTER_CHROMA_PS_AVX512_32xN 1 +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_32x%1, 4,7,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + vbroadcasti32x8 m5, [interp8_hpp_shuf1_store_avx512] + + mov r6d, %1 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r6d, 3 + PROCESS_IPFILTER_CHROMA_PS_32x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r6d + +.loop: + PROCESS_IPFILTER_CHROMA_PS_32x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r6d, 2 + jnz .loop + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_AVX512_32xN 8 +IPFILTER_CHROMA_PS_AVX512_32xN 16 +IPFILTER_CHROMA_PS_AVX512_32xN 24 +IPFILTER_CHROMA_PS_AVX512_32xN 32 +IPFILTER_CHROMA_PS_AVX512_32xN 48 +IPFILTER_CHROMA_PS_AVX512_32xN 64 + +%macro PROCESS_IPFILTER_CHROMA_PS_64x2_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 -shuffle order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 + + movu m6, [r0 + mmsize] + movu m7, [r0 + mmsize + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + mmsize], m6 + + movu m6, [r0 + r1] + movu m7, [r0 + r1 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + r3], m6 + + movu m6, [r0 + r1 + mmsize] + movu m7, [r0 + r1 + mmsize + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + r3 + mmsize], m6 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_64x1_AVX512 0 + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 + + movu m6, [r0 + mmsize] + movu m7, [r0 + mmsize + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + mmsize], m6 +%endmacro + +%macro IPFILTER_CHROMA_PS_AVX512_64xN 1 +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_64x%1, 4,7,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + vbroadcasti32x8 m5, [interp8_hpp_shuf1_store_avx512] + mov r6d, %1 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r6d, 3 + PROCESS_IPFILTER_CHROMA_PS_64x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r6d + +.loop: + PROCESS_IPFILTER_CHROMA_PS_64x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r6d, 2 + jnz .loop + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_AVX512_64xN 16 +IPFILTER_CHROMA_PS_AVX512_64xN 32 +IPFILTER_CHROMA_PS_AVX512_64xN 48 +IPFILTER_CHROMA_PS_AVX512_64xN 64 + +%macro PROCESS_IPFILTER_CHROMA_PS_16x2_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu ym6, [r0] + vinserti32x8 m6, [r0 + r1], 1 + movu ym7, [r0 + 8] + vinserti32x8 m7, [r0 + r1 + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], ym6 + vextracti32x8 [r2 + r3], m6, 1 +%endmacro +%macro PROCESS_IPFILTER_CHROMA_PS_16x1_AVX512 0 + movu ym6, [r0] + vinserti32x8 m6, [r0 + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + vextracti32x8 ym7, m6, 1 + packssdw ym6, ym7 + pshufb ym6, ym5 + movu [r2], ym6 +%endmacro +%macro IPFILTER_CHROMA_PS_AVX512_16xN 1 +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_16x%1, 4,7,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + mova m2, [interp8_hpp_shuf1_load_avx512] + mova m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + mova m5, [interp8_hpp_shuf1_store_avx512] + mov r6d, %1 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r6d, 3 + PROCESS_IPFILTER_CHROMA_PS_16x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r6d + +.loop: + PROCESS_IPFILTER_CHROMA_PS_16x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r6d, 2 + jnz .loop + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_AVX512_16xN 4 +IPFILTER_CHROMA_PS_AVX512_16xN 8 +IPFILTER_CHROMA_PS_AVX512_16xN 12 +IPFILTER_CHROMA_PS_AVX512_16xN 16 +IPFILTER_CHROMA_PS_AVX512_16xN 24 +IPFILTER_CHROMA_PS_AVX512_16xN 32 +IPFILTER_CHROMA_PS_AVX512_16xN 64 + +%macro PROCESS_IPFILTER_CHROMA_PS_48x2_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle load order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 + + movu m6, [r0 + r1] + movu m7, [r0 + r1 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + r3], m6 + + movu ym6, [r0 + mmsize] + vinserti32x8 m6, [r0 + r1 + mmsize], 1 + movu ym7, [r0 + mmsize + 8] + vinserti32x8 m7, [r0 + r1 + mmsize + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + mmsize], ym6 + vextracti32x8 [r2 + r3 + mmsize], m6, 1 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_48x1_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle load order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu m6, [r0] + movu m7, [r0 + 8] + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], m6 + + movu ym6, [r0 + mmsize] + movu ym7, [r0 + mmsize + 8] + + pshufb ym8, ym6, ym3 + pshufb ym6, ym2 + pmaddwd ym6, ym0 + pmaddwd ym8, ym1 + paddd ym6, ym8 + paddd ym6, ym4 + psrad ym6, INTERP_SHIFT_PS + + pshufb ym8, ym7, ym3 + pshufb ym7, ym2 + pmaddwd ym7, ym0 + pmaddwd ym8, ym1 + paddd ym7, ym8 + paddd ym7, ym4 + psrad ym7, INTERP_SHIFT_PS + + packssdw ym6, ym7 + pshufb ym6, ym5 + movu [r2 + mmsize], ym6 +%endmacro + +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_48x64, 4,7,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + vbroadcasti32x8 m5, [interp8_hpp_shuf1_store_avx512] + + mov r6d, 64 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r6d, 3 + PROCESS_IPFILTER_CHROMA_PS_48x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r6d +.loop: + PROCESS_IPFILTER_CHROMA_PS_48x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r6d, 2 + jnz .loop + RET +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_8x4_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle load order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu xm6, [r0] + vinserti32x4 m6, [r0 + r1], 1 + vinserti32x4 m6, [r0 + 2 * r1], 2 + vinserti32x4 m6, [r0 + r6], 3 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + movu xm7, [r0 + 8] + vinserti32x4 m7, [r0 + r1 + 8], 1 + vinserti32x4 m7, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m7, [r0 + r6 + 8], 3 + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], xm6 + vextracti32x4 [r2 + r3], m6, 1 + vextracti32x4 [r2 + 2 * r3], m6, 2 + vextracti32x4 [r2 + r7], m6, 3 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_8x3_AVX512 0 + movu xm6, [r0] + vinserti32x4 m6, [r0 + r1], 1 + vinserti32x4 m6, [r0 + 2 * r1], 2 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + movu xm7, [r0 + 8] + vinserti32x4 m7, [r0 + r1 + 8], 1 + vinserti32x4 m7, [r0 + 2 * r1 + 8], 2 + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], xm6 + vextracti32x4 [r2 + r3], m6, 1 + vextracti32x4 [r2 + 2 * r3], m6, 2 +%endmacro + +%macro IPFILTER_CHROMA_PS_AVX512_8xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_8x%1, 4,9,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r8, [tab_ChromaCoeff] + vpbroadcastd m0, [r8 + r4 * 8] + vpbroadcastd m1, [r8 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + vbroadcasti32x8 m5, [interp8_hpp_shuf1_store_avx512] + + mov r8d, %1 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r8d, 3 + PROCESS_IPFILTER_CHROMA_PS_8x3_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r8d, 3 + +.loop: + PROCESS_IPFILTER_CHROMA_PS_8x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r8d, 4 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_PS_AVX512_8xN 4 +IPFILTER_CHROMA_PS_AVX512_8xN 8 +IPFILTER_CHROMA_PS_AVX512_8xN 12 +IPFILTER_CHROMA_PS_AVX512_8xN 16 +IPFILTER_CHROMA_PS_AVX512_8xN 32 +IPFILTER_CHROMA_PS_AVX512_8xN 64 +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_24x4_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3 - shuffle order table + ; m4 - INTERP_OFFSET_PS + ; m5 - shuffle store order table + + movu ym6, [r0] + vinserti32x8 m6, [r0 + r1], 1 + movu ym7, [r0 + 8] + vinserti32x8 m7, [r0 + r1 + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], ym6 + vextracti32x8 [r2 + r3], m6, 1 + + movu ym6, [r0 + 2 * r1] + vinserti32x8 m6, [r0 + r6], 1 + movu ym7, [r0 + 2 * r1 + 8] + vinserti32x8 m7, [r0 + r6 + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + 2 * r3], ym6 + vextracti32x8 [r2 + r7], m6, 1 + + movu xm6, [r0 + mmsize/2] + vinserti32x4 m6, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m6, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m6, [r0 + r6 + mmsize/2], 3 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + movu xm7, [r0 + mmsize/2 + 8] + vinserti32x4 m7, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m7, [r0 + 2 * r1 + mmsize/2 + 8], 2 + vinserti32x4 m7, [r0 + r6 + mmsize/2 + 8], 3 + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + mmsize/2], xm6 + vextracti32x4 [r2 + r3 + mmsize/2], m6, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m6, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m6, 3 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_24x3_AVX512 0 + movu ym6, [r0] + vinserti32x8 m6, [r0 + r1], 1 + movu ym7, [r0 + 8] + vinserti32x8 m7, [r0 + r1 + 8], 1 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2], ym6 + vextracti32x8 [r2 + r3], m6, 1 + + movu ym6, [r0 + 2 * r1] + movu ym7, [r0 + 2 * r1 + 8] + + pshufb ym8, ym6, ym3 + pshufb ym6, ym2 + pmaddwd ym6, ym0 + pmaddwd ym8, ym1 + paddd ym6, ym8 + paddd ym6, ym4 + psrad ym6, INTERP_SHIFT_PS + + pshufb ym8, ym7, ym3 + pshufb ym7, ym2 + pmaddwd ym7, ym0 + pmaddwd ym8, ym1 + paddd ym7, ym8 + paddd ym7, ym4 + psrad ym7, INTERP_SHIFT_PS + + packssdw ym6, ym7 + pshufb ym6, ym5 + movu [r2 + 2 * r3], ym6 + + movu xm6, [r0 + mmsize/2] + vinserti32x4 m6, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m6, [r0 + 2 * r1 + mmsize/2], 2 + + pshufb m8, m6, m3 + pshufb m6, m2 + pmaddwd m6, m0 + pmaddwd m8, m1 + paddd m6, m8 + paddd m6, m4 + psrad m6, INTERP_SHIFT_PS + + movu xm7, [r0 + mmsize/2 + 8] + vinserti32x4 m7, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m7, [r0 + 2 * r1 + mmsize/2 + 8], 2 + + pshufb m8, m7, m3 + pshufb m7, m2 + pmaddwd m7, m0 + pmaddwd m8, m1 + paddd m7, m8 + paddd m7, m4 + psrad m7, INTERP_SHIFT_PS + + packssdw m6, m7 + pshufb m6, m5 + movu [r2 + mmsize/2], xm6 + vextracti32x4 [r2 + r3 + mmsize/2], m6, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m6, 2 +%endmacro + +%macro IPFILTER_CHROMA_PS_AVX512_24xN 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_24x%1, 4,9,9 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r8, [tab_ChromaCoeff] + vpbroadcastd m0, [r8 + r4 * 8] + vpbroadcastd m1, [r8 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 8] + vpbroadcastd m1, [tab_ChromaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m3, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m4, [INTERP_OFFSET_PS] + vbroadcasti32x8 m5,[interp8_hpp_shuf1_store_avx512] + + mov r8d, %1 + sub r0, 2 + test r5d, r5d + jz .loop + sub r0, r1 + add r8d, 3 + PROCESS_IPFILTER_CHROMA_PS_24x3_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r8d, 3 + +.loop: + PROCESS_IPFILTER_CHROMA_PS_24x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r8d, 4 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_CHROMA_PS_AVX512_24xN 32 +IPFILTER_CHROMA_PS_AVX512_24xN 64 +%endif +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_hps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vps code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_CHROMA_VERT_PS_8x8_AVX512 0 + movu xm1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + vinserti32x4 m1, [r6], 1 + vinserti32x4 m1, [r8], 2 + vinserti32x4 m1, [r9], 3 + movu xm3, [r0 + r1] + vinserti32x4 m3, [r6 + r1], 1 + vinserti32x4 m3, [r8 + r1], 2 + vinserti32x4 m3, [r9 + r1], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, [r5] + punpckhwd m1, m3 + pmaddwd m1, [r5] + + movu xm4, [r0 + 2 * r1] + vinserti32x4 m4, [r6 + 2 * r1], 1 + vinserti32x4 m4, [r8 + 2 * r1], 2 + vinserti32x4 m4, [r9 + 2 * r1], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, [r5] + punpckhwd m3, m4 + pmaddwd m3, [r5] + + movu xm5, [r0 + r10] + vinserti32x4 m5, [r6 + r10], 1 + vinserti32x4 m5, [r8 + r10], 2 + vinserti32x4 m5, [r9 + r10], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + mmsize] + paddd m1, m4 + + movu xm4, [r0 + 4 * r1] + vinserti32x4 m4, [r6 + 4 * r1], 1 + vinserti32x4 m4, [r8 + 4 * r1], 2 + vinserti32x4 m4, [r9 + 4 * r1], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m9 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m9 + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r7], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r7], m2, 3 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_8x8, 5, 11, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + lea r10, [3 * r1] + lea r7, [3 * r3] + mova m8, [r5] + mova m9, [r5 + mmsize] + PROCESS_CHROMA_VERT_PS_8x8_AVX512 + RET +%endif + +%macro FILTER_VER_PS_CHROMA_8xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_8x%1, 5, 11, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + lea r10, [3 * r1] + lea r7, [3 * r3] + mova m8, [r5] + mova m9, [r5 + mmsize] +%rep %1/8 - 1 + PROCESS_CHROMA_VERT_PS_8x8_AVX512 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_8x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PS_CHROMA_8xN_AVX512 16 +FILTER_VER_PS_CHROMA_8xN_AVX512 32 +FILTER_VER_PS_CHROMA_8xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PS_16x4_AVX512 0 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + vinserti32x8 m1, [r6], 1 + movu ym3, [r0 + r1] + vinserti32x8 m3, [r6 + r1], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m8 + punpckhwd m1, m3 + pmaddwd m1, m8 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m8 + punpckhwd m3, m4 + pmaddwd m3, m8 + + movu ym5, [r0 + r8] + vinserti32x8 m5, [r6 + r8], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m9 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m9 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m9 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m9 + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_16x4, 5, 9, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + lea r7, [3 * r3] + lea r8, [3 * r1] + mova m8, [r5] + mova m9, [r5 + mmsize] + PROCESS_CHROMA_VERT_PS_16x4_AVX512 + RET +%endif + +%macro FILTER_VER_PS_CHROMA_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_16x%1, 5, 9, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + lea r7, [3 * r3] + lea r8, [3 * r1] + mova m8, [r5] + mova m9, [r5 + mmsize] +%rep %1/4 - 1 + PROCESS_CHROMA_VERT_PS_16x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_16x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PS_CHROMA_16xN_AVX512 8 +FILTER_VER_PS_CHROMA_16xN_AVX512 12 +FILTER_VER_PS_CHROMA_16xN_AVX512 16 +FILTER_VER_PS_CHROMA_16xN_AVX512 24 +FILTER_VER_PS_CHROMA_16xN_AVX512 32 +FILTER_VER_PS_CHROMA_16xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PS_24x8_AVX512 0 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + + movu ym10, [r8] + movu ym3, [r0 + r1] + movu ym12, [r8 + r1] + vinserti32x8 m1, [r6], 1 + vinserti32x8 m10, [r9], 1 + vinserti32x8 m3, [r6 + r1], 1 + vinserti32x8 m12, [r9 + r1], 1 + + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + + movu ym4, [r0 + 2 * r1] + movu ym13, [r8 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + vinserti32x8 m13, [r9 + 2 * r1], 1 + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu ym5, [r0 + r10] + vinserti32x8 m5, [r6 + r10], 1 + movu ym14, [r8 + r10] + vinserti32x8 m14, [r9 + r10], 1 + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + movu ym13, [r8 + 4 * r1] + vinserti32x8 m13, [r9 + 4 * r1], 1 + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 + lea r11, [r2 + 4 * r3] + movu [r11], ym9 + movu [r11 + r3], ym11 + vextracti32x8 [r11 + 2 * r3], m9, 1 + vextracti32x8 [r11 + r7], m11, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r6 + mmsize/2], 1 + vinserti32x4 m1, [r8 + mmsize/2], 2 + vinserti32x4 m1, [r9 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r8 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r9 + r1 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m16 + punpckhwd m1, m3 + pmaddwd m1, m16 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 2 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m16 + punpckhwd m3, m4 + pmaddwd m3, m16 + + movu xm5, [r0 + r10 + mmsize/2] + vinserti32x4 m5, [r6 + r10 + mmsize/2], 1 + vinserti32x4 m5, [r8 + r10 + mmsize/2], 2 + vinserti32x4 m5, [r9 + r10 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m17 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m17 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 4 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m17 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m17 + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 3 +%endmacro + +%macro FILTER_VER_PS_CHROMA_24xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_24x%1, 5, 12, 18 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + lea r10, [3 * r1] + lea r7, [3 * r3] + mova m16, [r5] + mova m17, [r5 + mmsize] +%rep %1/8 - 1 + PROCESS_CHROMA_VERT_PS_24x8_AVX512 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_24x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_PS_CHROMA_24xN_AVX512 32 + FILTER_VER_PS_CHROMA_24xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PS_32x2_AVX512 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, m9 + punpckhwd m1, m3 + pmaddwd m1, m9 + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m9 + punpckhwd m3, m4 + pmaddwd m3, m9 + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m10 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m10 + paddd m1, m4 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m10 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m10 + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + movu [r2], m0 + movu [r2 + r3], m2 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_PS_CHROMA_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_32x%1, 5, 7, 11 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + mova m9, [r5] + mova m10, [r5 + mmsize] +%rep %1/2 - 1 + PROCESS_CHROMA_VERT_PS_32x2_AVX512 + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PS_CHROMA_32xN_AVX512 8 +FILTER_VER_PS_CHROMA_32xN_AVX512 16 +FILTER_VER_PS_CHROMA_32xN_AVX512 24 +FILTER_VER_PS_CHROMA_32xN_AVX512 32 +FILTER_VER_PS_CHROMA_32xN_AVX512 48 +FILTER_VER_PS_CHROMA_32xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_PS_48x4_AVX512 0 + movu m1, [r0] + lea r6, [r0 + 2 * r1] + movu m10, [r6] + movu m3, [r0 + r1] + movu m12, [r6 + r1] + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + + movu m4, [r0 + 2 * r1] + movu m13, [r6 + 2 * r1] + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu m5, [r0 + r7] + movu m14, [r6 + r7] + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu m4, [r0 + 4 * r1] + movu m13, [r6 + 4 * r1] + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + 2 * r3], m9 + movu [r2 + r8], m11 + + movu ym1, [r0 + mmsize] + vinserti32x8 m1, [r6 + mmsize], 1 + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m3, [r6 + r1 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m16 + punpckhwd m1, m3 + pmaddwd m1, m16 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m16 + punpckhwd m3, m4 + pmaddwd m3, m16 + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r7 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m17 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m17 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1 + mmsize] + vinserti32x8 m4, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m17 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m17 + paddd m3, m5 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r8 + mmsize], m2, 1 +%endmacro + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_48x64, 5, 9, 18 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + mova m16, [r5] + mova m17, [r5 + mmsize] +%rep 15 + PROCESS_CHROMA_VERT_PS_48x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_48x4_AVX512 + RET +%endif + +%macro PROCESS_CHROMA_VERT_PS_64x2_AVX512 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m9, [r0 + mmsize] + movu m11, [r0 + r1 + mmsize] + punpcklwd m8, m9, m11 + pmaddwd m8, m15 + punpckhwd m9, m11 + pmaddwd m9, m15 + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, m15 + punpckhwd m11, m12 + pmaddwd m11, m15 + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m16 + paddd m1, m4 + + movu m13, [r0 + r1 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, m16 + paddd m8, m14 + punpckhwd m12, m13 + pmaddwd m12, m16 + paddd m9, m12 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m16 + paddd m3, m5 + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, m16 + paddd m10, m14 + punpckhwd m13, m12 + pmaddwd m13, m16 + paddd m11, m13 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], m8 + movu [r2 + r3 + mmsize], m10 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_PS_CHROMA_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ps_64x%1, 5, 7, 17 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] +%endif + vbroadcasti32x4 m7, [INTERP_OFFSET_PS] + mova m15, [r5] + mova m16, [r5 + mmsize] + +%rep %1/2 - 1 + PROCESS_CHROMA_VERT_PS_64x2_AVX512 + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_CHROMA_VERT_PS_64x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VER_PS_CHROMA_64xN_AVX512 16 +FILTER_VER_PS_CHROMA_64xN_AVX512 32 +FILTER_VER_PS_CHROMA_64xN_AVX512 48 +FILTER_VER_PS_CHROMA_64xN_AVX512 64 +%endif +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vsp and chroma_vss code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_CHROMA_VERT_S_8x8_AVX512 1 + movu xm1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + vinserti32x4 m1, [r6], 1 + vinserti32x4 m1, [r8], 2 + vinserti32x4 m1, [r9], 3 + movu xm3, [r0 + r1] + vinserti32x4 m3, [r6 + r1], 1 + vinserti32x4 m3, [r8 + r1], 2 + vinserti32x4 m3, [r9 + r1], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m8 + punpckhwd m1, m3 + pmaddwd m1, m8 + + movu xm4, [r0 + 2 * r1] + vinserti32x4 m4, [r6 + 2 * r1], 1 + vinserti32x4 m4, [r8 + 2 * r1], 2 + vinserti32x4 m4, [r9 + 2 * r1], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m8 + punpckhwd m3, m4 + pmaddwd m3, m8 + + movu xm5, [r0 + r10] + vinserti32x4 m5, [r6 + r10], 1 + vinserti32x4 m5, [r8 + r10], 2 + vinserti32x4 m5, [r9 + r10], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m9 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m9 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1] + vinserti32x4 m4, [r6 + 4 * r1], 1 + vinserti32x4 m4, [r8 + 4 * r1], 2 + vinserti32x4 m4, [r9 + 4 * r1], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m9 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m9 + paddd m3, m5 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m10, m11 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r7], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r7], m2, 3 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro CHROMA_VERT_S_8x8_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_8x8, 5, 11, 12 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m8, [r5] + mova m9, [r5 + mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m10, m10 + vbroadcasti32x8 m11, [pw_pixel_max] +%endif + lea r10, [3 * r1] + lea r7, [3 * r3] + + PROCESS_CHROMA_VERT_S_8x8_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + CHROMA_VERT_S_8x8_AVX512 ss + CHROMA_VERT_S_8x8_AVX512 sp +%endif +%macro FILTER_VER_S_CHROMA_8xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_8x%2, 5, 11, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m8, [r5] + mova m9, [r5 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m10, m10 + vbroadcasti32x8 m11, [pw_pixel_max] +%endif + lea r10, [3 * r1] + lea r7, [3 * r3] + +%rep %2/8 - 1 + PROCESS_CHROMA_VERT_S_8x8_AVX512 %1 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_8x8_AVX512 %1 + RET +%endmacro +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_8xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_8xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_8xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_8xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_8xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_8xN_AVX512 sp, 64 +%endif +%macro PROCESS_CHROMA_VERT_S_16x4_AVX512 1 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + vinserti32x8 m1, [r6], 1 + movu ym3, [r0 + r1] + vinserti32x8 m3, [r6 + r1], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m8 + punpckhwd m1, m3 + pmaddwd m1, m8 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m8 + punpckhwd m3, m4 + pmaddwd m3, m8 + + movu ym5, [r0 + r8] + vinserti32x8 m5, [r6 + r8], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m9 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m9 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m9 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m9 + paddd m3, m5 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m10, m11 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro CHROMA_VERT_S_16x4_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_16x4, 5, 9, 12 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m8, [r5] + mova m9, [r5 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m10, m10 + vbroadcasti32x8 m11, [pw_pixel_max] +%endif + lea r7, [3 * r3] + lea r8, [3 * r1] + PROCESS_CHROMA_VERT_S_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + CHROMA_VERT_S_16x4_AVX512 ss + CHROMA_VERT_S_16x4_AVX512 sp +%endif +%macro FILTER_VER_S_CHROMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_16x%2, 5, 9, 12 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m8, [r5] + mova m9, [r5 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m10, m10 + vbroadcasti32x8 m11, [pw_pixel_max] +%endif + lea r7, [3 * r3] + lea r8, [3 * r1] +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_S_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 8 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 12 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 24 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 8 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 12 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 24 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 64 +%endif + +%macro PROCESS_CHROMA_VERT_S_24x8_AVX512 1 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + + movu ym10, [r8] + movu ym3, [r0 + r1] + movu ym12, [r8 + r1] + vinserti32x8 m1, [r6], 1 + vinserti32x8 m10, [r9], 1 + vinserti32x8 m3, [r6 + r1], 1 + vinserti32x8 m12, [r9 + r1], 1 + + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + + movu ym4, [r0 + 2 * r1] + movu ym13, [r8 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + vinserti32x8 m13, [r9 + 2 * r1], 1 + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu ym5, [r0 + r10] + vinserti32x8 m5, [r6 + r10], 1 + movu ym14, [r8 + r10] + vinserti32x8 m14, [r9 + r10], 1 + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + movu ym13, [r8 + 4 * r1] + vinserti32x8 m13, [r9 + 4 * r1], 1 + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + CLIPW2 m0, m2, m18, m19 + CLIPW2 m9, m11, m18, m19 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 +%endif + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 + lea r11, [r2 + 4 * r3] + movu [r11], ym9 + movu [r11 + r3], ym11 + vextracti32x8 [r11 + 2 * r3], m9, 1 + vextracti32x8 [r11 + r7], m11, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r6 + mmsize/2], 1 + vinserti32x4 m1, [r8 + mmsize/2], 2 + vinserti32x4 m1, [r9 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r8 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r9 + r1 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m16 + punpckhwd m1, m3 + pmaddwd m1, m16 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 2 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m16 + punpckhwd m3, m4 + pmaddwd m3, m16 + + movu xm5, [r0 + r10 + mmsize/2] + vinserti32x4 m5, [r6 + r10 + mmsize/2], 1 + vinserti32x4 m5, [r8 + r10 + mmsize/2], 2 + vinserti32x4 m5, [r9 + r10 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m17 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m17 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 4 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m17 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m17 + paddd m3, m5 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m18, m19 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 3 +%endmacro +%macro FILTER_VER_S_CHROMA_24xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_24x%2, 5, 12, 20 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m16, [r5 + r4] + mova m17, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m16, [r5] + mova m17, [r5 + mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m18, m18 + vbroadcasti32x8 m19, [pw_pixel_max] +%endif + lea r10, [3 * r1] + lea r7, [3 * r3] +%rep %2/8 - 1 + PROCESS_CHROMA_VERT_S_24x8_AVX512 %1 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_24x8_AVX512 %1 + RET +%endmacro +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_24xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_24xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_24xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_24xN_AVX512 sp, 64 +%endif + +%macro PROCESS_CHROMA_VERT_S_32x4_AVX512 1 + movu m1, [r0] + lea r6, [r0 + 2 * r1] + movu m10, [r6] + movu m3, [r0 + r1] + movu m12, [r6 + r1] + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + movu m4, [r0 + 2 * r1] + movu m13, [r6 + 2 * r1] + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu m5, [r0 + r7] + movu m14, [r6 + r7] + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu m4, [r0 + 4 * r1] + movu m13, [r6 + 4 * r1] + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + CLIPW2 m0, m2, m18, m19 + CLIPW2 m9, m11, m18, m19 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + 2 * r3], m9 + movu [r2 + r8], m11 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_CHROMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_32x%2, 5, 9, 20 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m16, [r5 + r4] + mova m17, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m16, [r5] + mova m17, [r5 + mmsize] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m18, m18 + vbroadcasti32x8 m19, [pw_pixel_max] +%endif + +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_S_32x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_32x4_AVX512 %1 + RET +%endmacro +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 8 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 24 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 48 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 8 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 24 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 48 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 64 +%endif +%macro PROCESS_CHROMA_VERT_S_48x4_AVX512 1 + movu m1, [r0] + lea r6, [r0 + 2 * r1] + movu m10, [r6] + movu m3, [r0 + r1] + movu m12, [r6 + r1] + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + + movu m4, [r0 + 2 * r1] + movu m13, [r6 + 2 * r1] + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu m5, [r0 + r7] + movu m14, [r6 + r7] + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu m4, [r0 + 4 * r1] + movu m13, [r6 + 4 * r1] + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + paddd m12, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + CLIPW2 m0, m2, m18, m19 + CLIPW2 m9, m11, m18, m19 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + 2 * r3], m9 + movu [r2 + r8], m11 + + movu ym1, [r0 + mmsize] + vinserti32x8 m1, [r6 + mmsize], 1 + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m3, [r6 + r1 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m16 + punpckhwd m1, m3 + pmaddwd m1, m16 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m16 + punpckhwd m3, m4 + pmaddwd m3, m16 + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r7 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m17 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m17 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1 + mmsize] + vinserti32x8 m4, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m17 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m17 + paddd m3, m5 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m18, m19 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r8 + mmsize], m2, 1 +%endmacro +%macro CHROMA_VERT_S_48x4_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_48x64, 5, 9, 20 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m16, [r5 + r4] + mova m17, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m16, [r5] + mova m17, [r5 + mmsize] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m18, m18 + vbroadcasti32x8 m19, [pw_pixel_max] +%endif +%rep 15 + PROCESS_CHROMA_VERT_S_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + CHROMA_VERT_S_48x4_AVX512 sp + CHROMA_VERT_S_48x4_AVX512 ss +%endif +%macro PROCESS_CHROMA_VERT_S_64x2_AVX512 1 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m9, [r0 + mmsize] + movu m11, [r0 + r1 + mmsize] + punpcklwd m8, m9, m11 + pmaddwd m8, m15 + punpckhwd m9, m11 + pmaddwd m9, m15 + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, m15 + punpckhwd m11, m12 + pmaddwd m11, m15 + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m16 + paddd m1, m4 + + movu m13, [r0 + r1 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, m16 + paddd m8, m14 + punpckhwd m12, m13 + pmaddwd m12, m16 + paddd m9, m12 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m16 + paddd m3, m5 + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, m16 + paddd m10, m14 + punpckhwd m13, m12 + pmaddwd m13, m16 + paddd m11, m13 + +%ifidn %1,sp + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + CLIPW2 m0, m2, m17, m18 + CLIPW2 m8, m10, m17, m18 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], m8 + movu [r2 + r3 + mmsize], m10 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_CHROMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_64x%2, 5, 7, 19 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [tab_ChromaCoeffV_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + mmsize] +%else + lea r5, [tab_ChromaCoeffV_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m7, [INTERP_OFFSET_SP] + pxor m17, m17 + vbroadcasti32x8 m18, [pw_pixel_max] +%endif +%rep %2/2 - 1 + PROCESS_CHROMA_VERT_S_64x2_AVX512 %1 + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_CHROMA_VERT_S_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 48 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 48 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +; avx512 chroma_vsp and chroma_vss code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_chroma_avx512 code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_luma_avx512 code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_LUMA_PP_8x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + movu xm10, [r0] + movu xm11, [r0 + 8] + movu xm12, [r0 + 16] + + vinserti32x4 m10, [r0 + r1], 1 + vinserti32x4 m11, [r0 + r1 + 8], 1 + vinserti32x4 m12, [r0 + r1 + 16], 1 + + vinserti32x4 m10, [r0 + 2 * r1], 2 + vinserti32x4 m11, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m12, [r0 + 2 * r1 + 16], 2 + + vinserti32x4 m10, [r0 + r6], 3 + vinserti32x4 m11, [r0 + r6 + 8], 3 + vinserti32x4 m12, [r0 + r6 + 16], 3 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2], xm10 + vextracti32x4 [r2 + r3], m10, 1 + vextracti32x4 [r2 + 2 * r3], m10, 2 + vextracti32x4 [r2 + r7], m10, 3 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + movu ym10, [r0] + vinserti32x8 m10, [r0 + r1], 1 + movu ym11, [r0 + 8] + vinserti32x8 m11, [r0 + r1 + 8], 1 + movu ym12, [r0 + 16] + vinserti32x8 m12, [r0 + r1 + 16], 1 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2], ym10 + vextracti32x8 [r2 + r3], m10, 1 + + movu ym10, [r0 + 2 * r1] + vinserti32x8 m10, [r0 + r6], 1 + movu ym11, [r0 + 2 * r1 + 8] + vinserti32x8 m11, [r0 + r6 + 8], 1 + movu ym12, [r0 + 2 * r1 + 16] + vinserti32x8 m12, [r0 + r6 + 16], 1 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + 2 * r3], ym10 + vextracti32x8 [r2 + r7], m10, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_24x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 + + movu xm10, [r0 + mmsize/2] + movu xm11, [r0 + mmsize/2 + 8] + movu xm12, [r0 + mmsize/2 + 16] + + vinserti32x4 m10, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m11, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m12, [r0 + r1 + mmsize/2 + 16], 1 + + vinserti32x4 m10, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m11, [r0 + 2 * r1 + mmsize/2 + 8], 2 + vinserti32x4 m12, [r0 + 2 * r1 + mmsize/2 + 16], 2 + + vinserti32x4 m10, [r0 + r6 + mmsize/2], 3 + vinserti32x4 m11, [r0 + r6 + mmsize/2 + 8], 3 + vinserti32x4 m12, [r0 + r6 + mmsize/2 + 16], 3 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + mmsize/2], xm10 + vextracti32x4 [r2 + r3 + mmsize/2], m10, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m10, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m10, 3 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + movu m10, [r0] + movu m11, [r0 + 8] + movu m12, [r0 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2], m10 + + movu m10, [r0 + r1] + movu m11, [r0 + r1 + 8] + movu m12, [r0 + r1 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + r3], m10 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + movu m10, [r0] + movu m11, [r0 + 8] + movu m12, [r0 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2], m10 + + movu m10, [r0 + r1] + movu m11, [r0 + r1 + 8] + movu m12, [r0 + r1 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + r3], m10 + + movu m10, [r0 + 2 * r1] + movu m11, [r0 + 2 * r1 + 8] + movu m12, [r0 + 2 * r1 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + 2 * r3], m10 + + movu m10, [r0 + r6] + movu m11, [r0 + r6 + 8] + movu m12, [r0 + r6 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + r7], m10 + + movu ym10, [r0 + mmsize] + vinserti32x8 m10, [r0 + r1 + mmsize], 1 + movu ym11, [r0 + mmsize + 8] + vinserti32x8 m11, [r0 + r1 + mmsize + 8], 1 + movu ym12, [r0 + mmsize + 16] + vinserti32x8 m12, [r0 + r1 + mmsize + 16], 1 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + mmsize], ym10 + vextracti32x8 [r2 + r3 + mmsize], m10, 1 + + movu ym10, [r0 + 2 * r1 + mmsize] + vinserti32x8 m10, [r0 + r6 + mmsize], 1 + movu ym11, [r0 + 2 * r1 + mmsize + 8] + vinserti32x8 m11, [r0 + r6 + mmsize + 8], 1 + movu ym12, [r0 + 2 * r1 + mmsize + 16] + vinserti32x8 m12, [r0 + r6 + mmsize + 16], 1 + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + 2 * r3 + mmsize], ym10 + vextracti32x8 [r2 + r7 + mmsize], m10, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_64x2_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff + ; m4 , m5 load shuffle order table + ; m6 - pd_32 + ; m7 - zero + ; m8 - pw_pixel_max + ; m9 - store shuffle order table + + movu m10, [r0] + movu m11, [r0 + 8] + movu m12, [r0 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2], m10 + + movu m10, [r0 + mmsize] + movu m11, [r0 + mmsize + 8] + movu m12, [r0 + mmsize + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m15, m3 + pmaddwd m12, m2 + paddd m12, m15 + paddd m11, m12 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + mmsize], m10 + + movu m10, [r0 + r1] + movu m11, [r0 + r1 + 8] + movu m12, [r0 + r1 + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + r3], m10 + + movu m10, [r0 + r1 + mmsize] + movu m11, [r0 + r1 + mmsize + 8] + movu m12, [r0 + r1 + mmsize + 16] + + pshufb m13, m10, m5 + pshufb m10, m4 + pshufb m14, m11, m5 + pshufb m11, m4 + pshufb m15, m12, m5 + pshufb m12, m4 + + pmaddwd m10, m0 + pmaddwd m13, m1 + paddd m10, m13 + pmaddwd m13, m14, m3 + pmaddwd m16, m11, m2 + paddd m13, m16 + paddd m10, m13 + paddd m10, m6 + psrad m10, INTERP_SHIFT_PP + + pmaddwd m11, m0 + pmaddwd m14, m1 + paddd m11, m14 + pmaddwd m14, m15, m3 + pmaddwd m16, m12, m2 + paddd m14, m16 + paddd m11, m14 + paddd m11, m6 + psrad m11, INTERP_SHIFT_PP + + packusdw m10, m11 + CLIPW m10, m7, m8 + pshufb m10, m9 + movu [r2 + r3 + mmsize], m10 +%endmacro + +%macro IPFILTER_LUMA_AVX512_8xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_8x%1, 5, 8, 17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep %1/4 - 1 + PROCESS_IPFILTER_LUMA_PP_8x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_8x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_LUMA_AVX512_8xN 4 + IPFILTER_LUMA_AVX512_8xN 8 + IPFILTER_LUMA_AVX512_8xN 16 + IPFILTER_LUMA_AVX512_8xN 32 +%endif + +%macro IPFILTER_LUMA_AVX512_16xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_16x%1, 5,8,17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep %1/4 - 1 + PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_AVX512_16xN 4 +IPFILTER_LUMA_AVX512_16xN 8 +IPFILTER_LUMA_AVX512_16xN 12 +IPFILTER_LUMA_AVX512_16xN 16 +IPFILTER_LUMA_AVX512_16xN 32 +IPFILTER_LUMA_AVX512_16xN 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_24x32, 5, 8, 17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep 7 + PROCESS_IPFILTER_LUMA_PP_24x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_24x4_AVX512 + RET +%endif + +%macro IPFILTER_LUMA_AVX512_32xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_32x%1, 5,6,17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + +%rep %1/2 - 1 + PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_AVX512_32xN 8 +IPFILTER_LUMA_AVX512_32xN 16 +IPFILTER_LUMA_AVX512_32xN 24 +IPFILTER_LUMA_AVX512_32xN 32 +IPFILTER_LUMA_AVX512_32xN 64 +%endif + +%macro IPFILTER_LUMA_AVX512_64xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_64x%1, 5,6,17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + +%rep %1/2 - 1 + PROCESS_IPFILTER_LUMA_PP_64x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_64x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_AVX512_64xN 16 +IPFILTER_LUMA_AVX512_64xN 32 +IPFILTER_LUMA_AVX512_64xN 48 +IPFILTER_LUMA_AVX512_64xN 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_48x64, 5,8,17 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 + +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4] + vpbroadcastd m1, [r5 + r4 + 4] + vpbroadcastd m2, [r5 + r4 + 8] + vpbroadcastd m3, [r5 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4] + vpbroadcastd m1, [tab_LumaCoeff + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeff + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeff + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x8 m6, [pd_32] + pxor m7, m7 + vbroadcasti32x8 m8, [pw_pixel_max] + vbroadcasti32x8 m9, [interp8_hpp_shuf1_store_avx512] + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep 15 + PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 + RET +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_hps code start +;------------------------------------------------------------------------------------------------------------- + +%macro PROCESS_IPFILTER_LUMA_PS_32x2_AVX512 0 + ; register map + ; m0, m1, m2, m3 - interpolate coeff + ; m4, m5 - shuffle load order table + ; m6 - INTERP_OFFSET_PS + ; m7 - shuffle store order table + + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 + + movu m8, [r0 + r1] + movu m9, [r0 + r1 + 8] + movu m10, [r0 + r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r3],m8 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_32x1_AVX512 0 + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 +%endmacro + +%macro IPFILTER_LUMA_PS_AVX512_32xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_32x%1, 4,7,15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + +%ifdef PIC + lea r6, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r6 + r4] + vpbroadcastd m1, [r6 + r4 + 4] + vpbroadcastd m2, [r6 + r4 + 8] + vpbroadcastd m3, [r6 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, %1 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_32x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + +.loop: + PROCESS_IPFILTER_LUMA_PS_32x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r4d, 2 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_PS_AVX512_32xN 8 +IPFILTER_LUMA_PS_AVX512_32xN 16 +IPFILTER_LUMA_PS_AVX512_32xN 24 +IPFILTER_LUMA_PS_AVX512_32xN 32 +IPFILTER_LUMA_PS_AVX512_32xN 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_64x2_AVX512 0 + ; register map + ; m0, m1, m2, m3 - interpolate coeff + ; m4, m5 - shuffle load order table + ; m6 - INTERP_OFFSET_PS + ; m7 - shuffle store order table + + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 + + movu m8, [r0 + mmsize] + movu m9, [r0 + mmsize + 8] + movu m10, [r0 + mmsize + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize], m8 + + movu m8, [r0 + r1] + movu m9, [r0 + r1 + 8] + movu m10, [r0 + r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r3],m8 + + movu m8, [r0 + r1 + mmsize] + movu m9, [r0 + r1 + mmsize + 8] + movu m10, [r0 + r1 + mmsize + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r3 + mmsize], m8 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_64x1_AVX512 0 + + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 + + movu m8, [r0 + mmsize] + movu m9, [r0 + mmsize + 8] + movu m10, [r0 + mmsize + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize], m8 +%endmacro + +%macro IPFILTER_LUMA_PS_AVX512_64xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_64x%1, 4,7,15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + +%ifdef PIC + lea r6, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r6 + r4] + vpbroadcastd m1, [r6 + r4 + 4] + vpbroadcastd m2, [r6 + r4 + 8] + vpbroadcastd m3, [r6 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, %1 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_64x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + +.loop: + PROCESS_IPFILTER_LUMA_PS_64x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r4d, 2 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_PS_AVX512_64xN 16 +IPFILTER_LUMA_PS_AVX512_64xN 32 +IPFILTER_LUMA_PS_AVX512_64xN 48 +IPFILTER_LUMA_PS_AVX512_64xN 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_16x4_AVX512 0 + ; register map + ; m0, m1, m2, m3 - interpolate coeff + ; m4, m5 - shuffle load order table + ; m6 - INTERP_OFFSET_PS + ; m7 - shuffle store order table + + movu ym8, [r0] + vinserti32x8 m8, [r0 + r1], 1 + movu ym9, [r0 + 8] + vinserti32x8 m9, [r0 + r1 + 8], 1 + movu ym10, [r0 + 16] + vinserti32x8 m10, [r0 + r1 + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], ym8 + vextracti32x8 [r2 + r3],m8, 1 + + movu ym8, [r0 + 2 * r1] + vinserti32x8 m8, [r0 + r6], 1 + movu ym9, [r0 + 2 * r1 + 8] + vinserti32x8 m9, [r0 + r6 + 8], 1 + movu ym10, [r0 + 2 * r1 + 16] + vinserti32x8 m10, [r0 + r6 + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + 2 * r3], ym8 + vextracti32x8 [r2 + r7], m8, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_16x3_AVX512 0 + movu ym8, [r0] + vinserti32x8 m8, [r0 + r1], 1 + movu ym9, [r0 + 8] + vinserti32x8 m9, [r0 + r1 + 8], 1 + movu ym10, [r0 + 16] + vinserti32x8 m10, [r0 + r1 + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], ym8 + vextracti32x8 [r2 + r3],m8, 1 + + movu ym8, [r0 + 2 * r1] + movu ym9, [r0 + 2 * r1 + 8] + movu ym10, [r0 + 2 * r1 + 16] + + pshufb ym11, ym8, ym5 + pshufb ym8, ym4 + pmaddwd ym8, ym0 + pmaddwd ym11, ym1 + paddd ym8, ym11 + pshufb ym12, ym9, ym5 + pshufb ym9, ym4 + pmaddwd ym11, ym12, ym3 + pmaddwd ym14, ym9, ym2 + paddd ym11, ym14 + paddd ym8, ym11 + paddd ym8, ym6 + psrad ym8, INTERP_SHIFT_PS + + pshufb ym13, ym10, ym5 + pshufb ym10, ym4 + pmaddwd ym9, ym0 + pmaddwd ym12, ym1 + paddd ym9, ym12 + pmaddwd ym12, ym13, ym3 + pmaddwd ym14, ym10, ym2 + paddd ym12, ym14 + paddd ym9, ym12 + paddd ym9, ym6 + psrad ym9, INTERP_SHIFT_PS + + packssdw ym8, ym9 + pshufb ym8, ym7 + movu [r2 + 2 * r3], ym8 +%endmacro + + +%macro IPFILTER_LUMA_PS_AVX512_16xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_16x%1, 4,9,15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r8, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r8 + r4] + vpbroadcastd m1, [r8 + r4 + 4] + vpbroadcastd m2, [r8 + r4 + 8] + vpbroadcastd m3, [r8 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, %1 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_16x3_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r4d, 3 + +.loop: + PROCESS_IPFILTER_LUMA_PS_16x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r4d, 4 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_PS_AVX512_16xN 4 +IPFILTER_LUMA_PS_AVX512_16xN 8 +IPFILTER_LUMA_PS_AVX512_16xN 12 +IPFILTER_LUMA_PS_AVX512_16xN 16 +IPFILTER_LUMA_PS_AVX512_16xN 32 +IPFILTER_LUMA_PS_AVX512_16xN 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_48x4_AVX512 0 + ; register map + ; m0, m1, m2, m3 - interpolate coeff + ; m4, m5 - shuffle load order table + ; m6 - INTERP_OFFSET_PS + ; m7 - shuffle store order table + + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 + + movu m8, [r0 + r1] + movu m9, [r0 + r1 + 8] + movu m10, [r0 + r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r3],m8 + + movu m8, [r0 + 2 * r1] + movu m9, [r0 + 2 * r1 + 8] + movu m10, [r0 + 2 * r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + 2 * r3], m8 + + movu m8, [r0 + r6] + movu m9, [r0 + r6 + 8] + movu m10, [r0 + r6 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r7],m8 + + movu ym8, [r0 + mmsize] + vinserti32x8 m8, [r0 + r1 + mmsize], 1 + movu ym9, [r0 + mmsize + 8] + vinserti32x8 m9, [r0 + r1 + mmsize + 8], 1 + movu ym10, [r0 + mmsize + 16] + vinserti32x8 m10, [r0 + r1 + mmsize + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize], ym8 + vextracti32x8 [r2 + r3 + mmsize], m8, 1 + + movu ym8, [r0 + 2 * r1 + mmsize] + vinserti32x8 m8, [r0 + r6 + mmsize], 1 + movu ym9, [r0 + 2 * r1 + mmsize + 8] + vinserti32x8 m9, [r0 + r6 + mmsize + 8], 1 + movu ym10, [r0 + 2 * r1 + mmsize + 16] + vinserti32x8 m10, [r0 + r6 + mmsize + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + 2 * r3 + mmsize], ym8 + vextracti32x8 [r2 + r7 + mmsize], m8, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_48x3_AVX512 0 + movu m8, [r0] + movu m9, [r0 + 8] + movu m10, [r0 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], m8 + + movu m8, [r0 + r1] + movu m9, [r0 + r1 + 8] + movu m10, [r0 + r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m12, m13, m3 + pmaddwd m14, m10, m2 + paddd m12, m14 + paddd m9, m12 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + r3],m8 + + movu m8, [r0 + 2 * r1] + movu m9, [r0 + 2 * r1 + 8] + movu m10, [r0 + 2 * r1 + 16] + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + 2 * r3], m8 + + movu ym8, [r0 + mmsize] + vinserti32x8 m8, [r0 + r1 + mmsize], 1 + movu ym9, [r0 + mmsize + 8] + vinserti32x8 m9, [r0 + r1 + mmsize + 8], 1 + movu ym10, [r0 + mmsize + 16] + vinserti32x8 m10, [r0 + r1 + mmsize + 16], 1 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize], ym8 + vextracti32x8 [r2 + r3 + mmsize], m8, 1 + + movu ym8, [r0 + 2 * r1 + mmsize] + movu ym9, [r0 + 2 * r1 + mmsize + 8] + movu ym10, [r0 + 2 * r1 + mmsize + 16] + + pshufb ym11, ym8, ym5 + pshufb ym8, ym4 + pmaddwd ym8, ym0 + pmaddwd ym11, ym1 + paddd ym8, ym11 + pshufb ym12, ym9, ym5 + pshufb ym9, ym4 + pmaddwd ym11, ym12, ym3 + pmaddwd ym14, ym9, ym2 + paddd ym11, ym14 + paddd ym8, ym11 + paddd ym8, ym6 + psrad ym8, INTERP_SHIFT_PS + + pshufb ym13, ym10, ym5 + pshufb ym10, ym4 + pmaddwd ym9, ym0 + pmaddwd ym12, ym1 + paddd ym9, ym12 + pmaddwd ym12, ym13, ym3 + pmaddwd ym14, ym10, ym2 + paddd ym12, ym14 + paddd ym9, ym12 + paddd ym9, ym6 + psrad ym9, INTERP_SHIFT_PS + + packssdw ym8, ym9 + pshufb ym8, ym7 + movu [r2 + 2 * r3 + mmsize], ym8 +%endmacro + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_48x64, 4,9,15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r8, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r8 + r4] + vpbroadcastd m1, [r8 + r4 + 4] + vpbroadcastd m2, [r8 + r4 + 8] + vpbroadcastd m3, [r8 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, 64 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_48x4_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r4d, 3 + +.loop: + PROCESS_IPFILTER_LUMA_PS_48x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r4d, 4 + jnz .loop + RET +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_24x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff table + ; m4 , m5 - load shuffle order table + ; m6 - INTERP_OFFSET_PS + ; m7 - store shuffle order table + + PROCESS_IPFILTER_LUMA_PS_16x4_AVX512 + + movu xm8, [r0 + mmsize/2] + movu xm9, [r0 + mmsize/2 + 8] + movu xm10, [r0 + mmsize/2 + 16] + + vinserti32x4 m8, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m9, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m10, [r0 + r1 + mmsize/2 + 16], 1 + + vinserti32x4 m8, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m9, [r0 + 2 * r1 + mmsize/2 + 8], 2 + vinserti32x4 m10, [r0 + 2 * r1 + mmsize/2 + 16], 2 + + vinserti32x4 m8, [r0 + r6 + mmsize/2], 3 + vinserti32x4 m9, [r0 + r6 + mmsize/2 + 8], 3 + vinserti32x4 m10, [r0 + r6 + mmsize/2 + 16], 3 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize/2], xm8 + vextracti32x4 [r2 + r3 + mmsize/2], m8, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m8, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m8, 3 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_24x3_AVX512 0 + + PROCESS_IPFILTER_LUMA_PS_16x3_AVX512 + + movu xm8, [r0 + mmsize/2] + movu xm9, [r0 + mmsize/2 + 8] + movu xm10, [r0 + mmsize/2 + 16] + + vinserti32x4 m8, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m9, [r0 + r1 + mmsize/2 + 8], 1 + vinserti32x4 m10, [r0 + r1 + mmsize/2 + 16], 1 + + vinserti32x4 m8, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m9, [r0 + 2 * r1 + mmsize/2 + 8], 2 + vinserti32x4 m10, [r0 + 2 * r1 + mmsize/2 + 16], 2 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2 + mmsize/2], xm8 + vextracti32x4 [r2 + r3 + mmsize/2], m8, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m8, 2 +%endmacro + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_24x32, 4, 9, 15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + + lea r6, [3 * r1] + lea r7, [3 * r3] + +%ifdef PIC + lea r8, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r8 + r4] + vpbroadcastd m1, [r8 + r4 + 4] + vpbroadcastd m2, [r8 + r4 + 8] + vpbroadcastd m3, [r8 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, 32 + test r5d, r5d + jz .loop + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_24x3_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r4d, 3 + +.loop: + PROCESS_IPFILTER_LUMA_PS_24x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r4d, 4 + jnz .loop + RET +%endif +%macro PROCESS_IPFILTER_LUMA_PS_8x4_AVX512 0 + ; register map + ; m0 , m1, m2, m3 - interpolate coeff table + ; m4 , m5 - load shuffle order table + ; m6 - INTERP_OFFSET_PS + ; m7 - store shuffle order table + + movu xm8, [r0] + movu xm9, [r0 + 8] + movu xm10, [r0 + 16] + + vinserti32x4 m8, [r0 + r1], 1 + vinserti32x4 m9, [r0 + r1 + 8], 1 + vinserti32x4 m10, [r0 + r1 + 16], 1 + + vinserti32x4 m8, [r0 + 2 * r1], 2 + vinserti32x4 m9, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m10, [r0 + 2 * r1 + 16], 2 + + vinserti32x4 m8, [r0 + r6], 3 + vinserti32x4 m9, [r0 + r6 + 8], 3 + vinserti32x4 m10, [r0 + r6 + 16], 3 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], xm8 + vextracti32x4 [r2 + r3], m8, 1 + vextracti32x4 [r2 + 2 * r3], m8, 2 + vextracti32x4 [r2 + r7], m8, 3 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_8x3_AVX512 0 + movu xm8, [r0] + movu xm9, [r0 + 8] + movu xm10, [r0 + 16] + + vinserti32x4 m8, [r0 + r1], 1 + vinserti32x4 m9, [r0 + r1 + 8], 1 + vinserti32x4 m10, [r0 + r1 + 16], 1 + + vinserti32x4 m8, [r0 + 2 * r1], 2 + vinserti32x4 m9, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m10, [r0 + 2 * r1 + 16], 2 + + pshufb m11, m8, m5 + pshufb m8, m4 + pmaddwd m8, m0 + pmaddwd m11, m1 + paddd m8, m11 + pshufb m12, m9, m5 + pshufb m9, m4 + pmaddwd m11, m12, m3 + pmaddwd m14, m9, m2 + paddd m11, m14 + + paddd m8, m11 + paddd m8, m6 + psrad m8, INTERP_SHIFT_PS + + pshufb m13, m10, m5 + pshufb m10, m4 + pmaddwd m9, m0 + pmaddwd m12, m1 + paddd m9, m12 + pmaddwd m13, m3 + pmaddwd m10, m2 + paddd m10, m13 + + paddd m9, m10 + paddd m9, m6 + psrad m9, INTERP_SHIFT_PS + + packssdw m8, m9 + pshufb m8, m7 + movu [r2], xm8 + vextracti32x4 [r2 + r3], m8, 1 + vextracti32x4 [r2 + 2 * r3], m8, 2 +%endmacro + +%macro IPFILTER_LUMA_PS_AVX512_8xN 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_8x%1, 4, 9, 15 + shl r1d, 1 + shl r3d, 1 + mov r4d, r4m + mov r5d, r5m + shl r4d, 6 + + lea r6, [3 * r1] + lea r7, [3 * r3] + +%ifdef PIC + lea r8, [tab_LumaCoeffH_avx512] + vpbroadcastd m0, [r8 + r4] + vpbroadcastd m1, [r8 + r4 + 4] + vpbroadcastd m2, [r8 + r4 + 8] + vpbroadcastd m3, [r8 + r4 + 12] +%else + vpbroadcastd m0, [tab_LumaCoeffH_avx512 + r4] + vpbroadcastd m1, [tab_LumaCoeffH_avx512 + r4 + 4] + vpbroadcastd m2, [tab_LumaCoeffH_avx512 + r4 + 8] + vpbroadcastd m3, [tab_LumaCoeffH_avx512 + r4 + 12] +%endif + vbroadcasti32x8 m4, [interp8_hpp_shuf1_load_avx512] + vbroadcasti32x8 m5, [interp8_hpp_shuf2_load_avx512] + vbroadcasti32x4 m6, [INTERP_OFFSET_PS] + vbroadcasti32x8 m7, [interp8_hpp_shuf1_store_avx512] + + sub r0, 6 + mov r4d, %1 + test r5d, r5d + jz .loop + sub r0, r6 + add r4d, 7 + PROCESS_IPFILTER_LUMA_PS_8x3_AVX512 + lea r0, [r0 + r6] + lea r2, [r2 + r7] + sub r4d, 3 + +.loop: + PROCESS_IPFILTER_LUMA_PS_8x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + sub r4d, 4 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_LUMA_PS_AVX512_8xN 4 + IPFILTER_LUMA_PS_AVX512_8xN 8 + IPFILTER_LUMA_PS_AVX512_8xN 16 + IPFILTER_LUMA_PS_AVX512_8xN 32 +%endif + +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_hps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vss and luma_vsp code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_LUMA_VERT_S_8x8_AVX512 1 + lea r6, [r0 + 4 * r1] + movu xm1, [r0] ;0 row + vinserti32x4 m1, [r0 + 2 * r1], 1 + vinserti32x4 m1, [r0 + 4 * r1], 2 + vinserti32x4 m1, [r6 + 2 * r1], 3 + movu xm3, [r0 + r1] ;1 row + vinserti32x4 m3, [r0 + r7], 1 + vinserti32x4 m3, [r6 + r1], 2 + vinserti32x4 m3, [r6 + r7], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu xm4, [r0 + 2 * r1] ;2 row + vinserti32x4 m4, [r0 + 4 * r1], 1 + vinserti32x4 m4, [r6 + 2 * r1], 2 + vinserti32x4 m4, [r6 + 4 * r1], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + lea r4, [r6 + 4 * r1] + movu xm5, [r0 + r7] ;3 row + vinserti32x4 m5, [r6 + r1], 1 + vinserti32x4 m5, [r6 + r7], 2 + vinserti32x4 m5, [r4 + r1], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1] ;4 row + vinserti32x4 m4, [r6 + 2 * r1], 1 + vinserti32x4 m4, [r6 + 4 * r1], 2 + vinserti32x4 m4, [r4 + 2 * r1], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu xm11, [r6 + r1] ;5 row + vinserti32x4 m11, [r6 + r7], 1 + vinserti32x4 m11, [r4 + r1], 2 + vinserti32x4 m11, [r4 + r7], 3 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu xm12, [r6 + 2 * r1] ;6 row + vinserti32x4 m12, [r6 + 4 * r1], 1 + vinserti32x4 m12, [r4 + 2 * r1], 2 + vinserti32x4 m12, [r4 + 4 * r1], 3 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + lea r8, [r4 + 4 * r1] + movu xm13, [r6 + r7] ;7 row + vinserti32x4 m13, [r4 + r1], 1 + vinserti32x4 m13, [r4 + r7], 2 + vinserti32x4 m13, [r8 + r1], 3 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu xm12, [r6 + 4 * r1] ; 8 row + vinserti32x4 m12, [r4 + 2 * r1], 1 + vinserti32x4 m12, [r4 + 4 * r1], 2 + vinserti32x4 m12, [r8 + 2 * r1], 3 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r5], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r5], m2, 3 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_8xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_8x%2, 5, 9, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + lea r5, [3 * r3] + +%rep %2/8 - 1 + PROCESS_LUMA_VERT_S_8x8_AVX512 %1 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_8x8_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_8xN_AVX512 ss, 8 + FILTER_VER_S_LUMA_8xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_8xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_8xN_AVX512 sp, 8 + FILTER_VER_S_LUMA_8xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_8xN_AVX512 sp, 32 +%endif + +%macro PROCESS_LUMA_VERT_S_16x4_AVX512 1 + movu ym1, [r0] + movu ym3, [r0 + r1] + vinserti32x8 m1, [r0 + 2 * r1], 1 + vinserti32x8 m3, [r0 + r7], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + lea r6, [r0 + 4 * r1] + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7] + vinserti32x8 m5, [r6 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r6] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r6 + r1] + vinserti32x8 m11, [r6 + r7], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r6 + 2 * r1] + vinserti32x8 m12, [r6 + 4 * r1], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + lea r4, [r6 + 4 * r1] + movu ym13, [r6 + r7] + vinserti32x8 m13, [r4 + r1], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r6 + 4 * r1] + vinserti32x8 m12, [r4 + 2 * r1], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r5], m2, 1 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_16x%2, 5, 8, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + lea r5, [3 * r3] +%rep %2/4 - 1 + PROCESS_LUMA_VERT_S_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 4 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 8 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 12 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 4 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 8 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 12 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 64 +%endif + +%macro PROCESS_LUMA_VERT_S_24x8_AVX512 1 + PROCESS_LUMA_VERT_S_16x4_AVX512 %1 + lea r4, [r6 + 4 * r1] + lea r8, [r4 + 4 * r1] + movu ym1, [r6] + movu ym3, [r6 + r1] + vinserti32x8 m1, [r6 + 2 * r1], 1 + vinserti32x8 m3, [r6 + r7], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r6 + 2 * r1] + vinserti32x8 m4, [r4], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r6 + r7] + vinserti32x8 m5, [r4 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r4] + vinserti32x8 m4, [r4 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r4 + r1] + vinserti32x8 m11, [r4 + r7], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r4 + 2 * r1] + vinserti32x8 m12, [r4 + 4 * r1], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r4 + r7] + vinserti32x8 m13, [r8 + r1], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r4 + 4 * r1] + vinserti32x8 m12, [r8 + 2 * r1], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + lea r9, [r2 + 4 * r3] + movu [r9], ym0 + movu [r9 + r3], ym2 + vextracti32x8 [r9 + 2 * r3], m0, 1 + vextracti32x8 [r9 + r5], m2, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r0 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m1, [r0 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m1, [r6 + 2 * r1 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r0 + r7 + mmsize/2], 1 + vinserti32x4 m3, [r6 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r6 + r7 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r0 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu xm5, [r0 + r7 + mmsize/2] + vinserti32x4 m5, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r6 + r7 + mmsize/2], 2 + vinserti32x4 m5, [r4 + r1 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r4 + 2 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu xm11, [r6 + r1 + mmsize/2] + vinserti32x4 m11, [r6 + r7 + mmsize/2], 1 + vinserti32x4 m11, [r4 + r1 + mmsize/2], 2 + vinserti32x4 m11, [r4 + r7 + mmsize/2], 3 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu xm12, [r6 + 2 * r1 + mmsize/2] + vinserti32x4 m12, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m12, [r4 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m12, [r4 + 4 * r1 + mmsize/2], 3 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu xm13, [r6 + r7 + mmsize/2] + vinserti32x4 m13, [r4 + r1 + mmsize/2], 1 + vinserti32x4 m13, [r4 + r7 + mmsize/2], 2 + vinserti32x4 m13, [r8 + r1 + mmsize/2], 3 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu xm12, [r6 + 4 * r1 + mmsize/2] + vinserti32x4 m12, [r4 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m12, [r4 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m12, [r8 + 2 * r1 + mmsize/2], 3 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r5 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r5 + mmsize/2], m2, 3 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_24x32_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_24x32, 5, 10, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + lea r5, [3 * r3] + +%rep 3 + PROCESS_LUMA_VERT_S_24x8_AVX512 %1 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_24x8_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_24x32_AVX512 ss + FILTER_VER_S_LUMA_24x32_AVX512 sp +%endif + +%macro PROCESS_LUMA_VERT_S_32x2_AVX512 1 + movu m1, [r0] ;0 row + movu m3, [r0 + r1] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r6, [r0 + 4 * r1] + + movu m11, [r6 + r1] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r6 + 4 * r1] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_32x%2, 5, 8, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 8 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 24 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 8 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 24 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 64 +%endif + +%macro PROCESS_LUMA_VERT_S_48x4_AVX512 1 + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + movu m1, [r0 + 2 * r1] + movu m3, [r0 + r7] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 4 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + lea r4, [r6 + 4 * r1] + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu m11, [r6 + r7] + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r4] + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r4 + r1] + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r4 + 2 * r1] + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + 2 * r3], m0 + movu [r2 + r5], m2 + + movu ym1, [r0 + mmsize] + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m1, [r0 + 2 * r1 + mmsize], 1 + vinserti32x8 m3, [r0 + r7 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r1 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r6 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r6 + r1 + mmsize] + vinserti32x8 m11, [r6 + r7 + mmsize], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r6 + 2 * r1 + mmsize] + vinserti32x8 m12, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r6 + r7 + mmsize] + vinserti32x8 m13, [r4 + r1 + mmsize], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r6 + 4 * r1 + mmsize] + vinserti32x8 m12, [r4 + 2 * r1 + mmsize], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r5 + mmsize], m2, 1 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_48x64, 5, 8, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + + lea r5, [3 * r3] +%rep 15 + PROCESS_LUMA_VERT_S_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_48x64_AVX512 ss + FILTER_VER_S_LUMA_48x64_AVX512 sp +%endif + +%macro PROCESS_LUMA_VERT_S_64x2_AVX512 1 + movu m1, [r0] ;0 row + movu m3, [r0 + r1] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r6, [r0 + 4 * r1] + + movu m11, [r6 + r1] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r6 + 4 * r1] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + + movu m1, [r0 + mmsize] ;0 row + movu m3, [r0 + r1 + mmsize] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1 + mmsize] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7 + mmsize] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1 + mmsize] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu m11, [r6 + r1 + mmsize] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1 + mmsize] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7 + mmsize] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r6 + 4 * r1 + mmsize] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize], m0 + movu [r2 + r3 + mmsize], m2 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_64x%2, 5, 8, 22 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [INTERP_OFFSET_SP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%endif + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_S_64x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_S_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 48 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 48 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vss and luma_vsp code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vpp and luma_vps code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_LUMA_VERT_P_16x4_AVX512 1 + lea r5, [r0 + 4 * r1] + movu ym1, [r0] + movu ym3, [r0 + r1] + vinserti32x8 m1, [r0 + 2 * r1], 1 + vinserti32x8 m3, [r0 + r7], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + 4 * r1], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7] + vinserti32x8 m5, [r5 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r5] + vinserti32x8 m4, [r5 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r4, [r5 + 4 * r1] + movu ym11, [r5 + r1] + vinserti32x8 m11, [r5 + r7], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r5 + 2 * r1] + vinserti32x8 m12, [r4], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r5 + r7] + vinserti32x8 m13, [r4 + r1], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r4] + vinserti32x8 m12, [r4 + 2 * r1], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r8], m2, 1 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_P_LUMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_16x%2, 5, 9, 22 + add r1d, r1d + add r3d, r3d + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x4 m19, [INTERP_OFFSET_PP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%else + vbroadcasti32x4 m19, [INTERP_OFFSET_PS] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] + sub r0, r7 + +%rep %2/4 - 1 + PROCESS_LUMA_VERT_P_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_P_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 4 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 8 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 12 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 16 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 32 + FILTER_VER_P_LUMA_16xN_AVX512 ps, 64 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 4 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 8 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 12 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 16 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 32 + FILTER_VER_P_LUMA_16xN_AVX512 pp, 64 +%endif + +%macro PROCESS_LUMA_VERT_P_24x4_AVX512 1 + PROCESS_LUMA_VERT_P_16x4_AVX512 %1 + movu xm1, [r0 + mmsize/2] + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m1, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r0 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m1, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m3, [r0 + r7 + mmsize/2], 2 + vinserti32x4 m1, [r0 + r7 + mmsize/2], 3 + vinserti32x4 m3, [r0 + 4 * r1 + mmsize/2], 3 + + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + movu xm5, [r0 + r7 + mmsize/2] + vinserti32x4 m4, [r0 + r7 + mmsize/2], 1 + vinserti32x4 m5, [r5 + mmsize/2], 1 + vinserti32x4 m4, [r5 + mmsize/2], 2 + vinserti32x4 m5, [r5 + r1 + mmsize/2], 2 + vinserti32x4 m4, [r5 + r1 + mmsize/2], 3 + vinserti32x4 m5, [r5 + 2 * r1 + mmsize/2], 3 + + punpcklwd m3, m4, m5 + pmaddwd m3, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m3 + paddd m1, m4 + + movu xm3, [r5 + mmsize/2] + movu xm5, [r5 + r1 + mmsize/2] + vinserti32x4 m3, [r5 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r5 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m3, [r5 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m5, [r5 + r7 + mmsize/2], 2 + vinserti32x4 m3, [r5 + r7 + mmsize/2], 3 + vinserti32x4 m5, [r5 + 4 * r1 + mmsize/2], 3 + + punpcklwd m2, m3, m5 + pmaddwd m2, m17 + punpckhwd m3, m5 + pmaddwd m3, m17 + + movu xm6, [r5 + 2 * r1 + mmsize/2] + movu xm7, [r5 + r7 + mmsize/2] + vinserti32x4 m6, [r5 + r7 + mmsize/2], 1 + vinserti32x4 m7, [r4 + mmsize/2], 1 + vinserti32x4 m6, [r4 + mmsize/2], 2 + vinserti32x4 m7, [r4 + r1 + mmsize/2], 2 + vinserti32x4 m6, [r4 + r1 + mmsize/2], 3 + vinserti32x4 m7, [r4 + 2 * r1 + mmsize/2], 3 + + punpcklwd m5, m6, m7 + pmaddwd m5, m18 + punpckhwd m6, m7 + pmaddwd m6, m18 + + paddd m2, m5 + paddd m3, m6 + paddd m0, m2 + paddd m1, m3 + + paddd m0, m19 + paddd m1, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + packssdw m0, m1 + CLIPW m0, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + packssdw m0, m1 +%endif + + movu [r2 + mmsize/2], xm0 + vextracti32x4 [r2 + r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r8 + mmsize/2], m0, 3 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_P_LUMA_24xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_24x32, 5, 9, 22 + add r1d, r1d + add r3d, r3d + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x4 m19, [INTERP_OFFSET_PP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%else + vbroadcasti32x4 m19, [INTERP_OFFSET_PS] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] + sub r0, r7 + +%rep 7 + PROCESS_LUMA_VERT_P_24x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_P_24x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_P_LUMA_24xN_AVX512 ps + FILTER_VER_P_LUMA_24xN_AVX512 pp +%endif + +%macro PROCESS_LUMA_VERT_P_32x2_AVX512 1 + movu m1, [r0] ;0 row + movu m3, [r0 + r1] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r6, [r0 + 4 * r1] + + movu m11, [r6 + r1] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r6 + 4 * r1] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2], m0 + movu [r2 + r3], m2 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_P_LUMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_32x%2, 5, 8, 22 + add r1d, r1d + add r3d, r3d + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x4 m19, [INTERP_OFFSET_PP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%else + vbroadcasti32x4 m19, [INTERP_OFFSET_PS] +%endif + lea r7, [3 * r1] + sub r0, r7 + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_P_32x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_P_32x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_P_LUMA_32xN_AVX512 ps, 8 + FILTER_VER_P_LUMA_32xN_AVX512 ps, 16 + FILTER_VER_P_LUMA_32xN_AVX512 ps, 32 + FILTER_VER_P_LUMA_32xN_AVX512 ps, 24 + FILTER_VER_P_LUMA_32xN_AVX512 ps, 64 + FILTER_VER_P_LUMA_32xN_AVX512 pp, 8 + FILTER_VER_P_LUMA_32xN_AVX512 pp, 16 + FILTER_VER_P_LUMA_32xN_AVX512 pp, 32 + FILTER_VER_P_LUMA_32xN_AVX512 pp, 24 + FILTER_VER_P_LUMA_32xN_AVX512 pp, 64 +%endif + +%macro PROCESS_LUMA_VERT_P_48x4_AVX512 1 + PROCESS_LUMA_VERT_P_32x2_AVX512 %1 + movu m1, [r0 + 2 * r1] + movu m3, [r0 + r7] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 4 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r4, [r6 + 4 * r1] + + movu m11, [r6 + r7] + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 4 * r1] + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r4 + r1] + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r4 + 2 * r1] + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%endif + movu [r2 + 2 * r3], m0 + movu [r2 + r8], m2 + + movu ym1, [r0 + mmsize] + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m1, [r0 + 2 * r1 + mmsize], 1 + vinserti32x8 m3, [r0 + r7 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r0 + 4 * r1 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r1 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r6 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r6 + r1 + mmsize] + vinserti32x8 m11, [r6 + r7 + mmsize], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r6 + 2 * r1 + mmsize] + vinserti32x8 m12, [r4 + mmsize], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r6 + r7 + mmsize] + vinserti32x8 m13, [r4 + r1 + mmsize], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r4 + mmsize] + vinserti32x8 m12, [r4 + 2 * r1 + mmsize], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r8 + mmsize], m2, 1 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_P_LUMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_48x64, 5, 9, 22 + add r1d, r1d + add r3d, r3d + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x4 m19, [INTERP_OFFSET_PP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%else + vbroadcasti32x4 m19, [INTERP_OFFSET_PS] +%endif + lea r7, [3 * r1] + lea r8, [3 * r3] + sub r0, r7 + +%rep 15 + PROCESS_LUMA_VERT_P_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_P_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_P_LUMA_48x64_AVX512 ps + FILTER_VER_P_LUMA_48x64_AVX512 pp +%endif + +%macro PROCESS_LUMA_VERT_P_64x2_AVX512 1 + PROCESS_LUMA_VERT_P_32x2_AVX512 %1 + movu m1, [r0 + mmsize] + movu m3, [r0 + r1 + mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1 + mmsize] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7 + mmsize] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1 + mmsize] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu m11, [r6 + r1 + mmsize] + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu m12, [r6 + 4 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + + packssdw m0, m1 + packssdw m2, m3 + CLIPW2 m0, m2, m20, m21 +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%endif + + movu [r2 + mmsize], m0 + movu [r2 + r3 + mmsize], m2 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_P_LUMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_64x%2, 5, 8, 22 + add r1d, r1d + add r3d, r3d + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [tab_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x4 m19, [INTERP_OFFSET_PP] + pxor m20, m20 + vbroadcasti32x8 m21, [pw_pixel_max] +%else + vbroadcasti32x4 m19, [INTERP_OFFSET_PS] +%endif + lea r7, [3 * r1] + sub r0, r7 + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_P_64x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_P_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_P_LUMA_64xN_AVX512 ps, 16 + FILTER_VER_P_LUMA_64xN_AVX512 ps, 32 + FILTER_VER_P_LUMA_64xN_AVX512 ps, 48 + FILTER_VER_P_LUMA_64xN_AVX512 ps, 64 + FILTER_VER_P_LUMA_64xN_AVX512 pp, 16 + FILTER_VER_P_LUMA_64xN_AVX512 pp, 32 + FILTER_VER_P_LUMA_64xN_AVX512 pp, 48 + FILTER_VER_P_LUMA_64xN_AVX512 pp, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vpp and luma_vps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_luma_avx512 code end +;-------------------------------------------------------------------------------------------------------------
View file
x265_2.7.tar.gz/source/common/x86/ipfilter8.asm -> x265_2.9.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -26,7 +26,7 @@ %include "x86inc.asm" %include "x86util.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 const tab_Tm, db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 db 8, 9,10,11, 9,10,11,12,10,11,12,13,11,12,13, 14 @@ -43,6 +43,15 @@ const pd_526336, times 8 dd 8192*64+2048 +const tab_ChromaCoeff, db 0, 64, 0, 0 + db -2, 58, 10, -2 + db -4, 54, 16, -2 + db -6, 46, 28, -4 + db -4, 36, 36, -4 + db -4, 28, 46, -6 + db -2, 16, 54, -4 + db -2, 10, 58, -2 + const tab_LumaCoeff, db 0, 0, 0, 64, 0, 0, 0, 0 db -1, 4, -10, 58, 17, -5, 1, 0 db -1, 4, -11, 40, 40, -11, 4, -1 @@ -133,12 +142,115 @@ times 16 db 58, -10 times 16 db 4, -1 +ALIGN 64 +const tab_ChromaCoeffVer_32_avx512, times 32 db 0, 64 + times 32 db 0, 0 + + times 32 db -2, 58 + times 32 db 10, -2 + + times 32 db -4, 54 + times 32 db 16, -2 + + times 32 db -6, 46 + times 32 db 28, -4 + + times 32 db -4, 36 + times 32 db 36, -4 + + times 32 db -4, 28 + times 32 db 46, -6 + + times 32 db -2, 16 + times 32 db 54, -4 + + times 32 db -2, 10 + times 32 db 58, -2 + +ALIGN 64 +const pw_ChromaCoeffVer_32_avx512, times 16 dw 0, 64 + times 16 dw 0, 0 + + times 16 dw -2, 58 + times 16 dw 10, -2 + + times 16 dw -4, 54 + times 16 dw 16, -2 + + times 16 dw -6, 46 + times 16 dw 28, -4 + + times 16 dw -4, 36 + times 16 dw 36, -4 + + times 16 dw -4, 28 + times 16 dw 46, -6 + + times 16 dw -2, 16 + times 16 dw 54, -4 + + times 16 dw -2, 10 + times 16 dw 58, -2 + +ALIGN 64 +const pw_LumaCoeffVer_avx512, times 16 dw 0, 0 + times 16 dw 0, 64 + times 16 dw 0, 0 + times 16 dw 0, 0 + + times 16 dw -1, 4 + times 16 dw -10, 58 + times 16 dw 17, -5 + times 16 dw 1, 0 + + times 16 dw -1, 4 + times 16 dw -11, 40 + times 16 dw 40, -11 + times 16 dw 4, -1 + + times 16 dw 0, 1 + times 16 dw -5, 17 + times 16 dw 58, -10 + times 16 dw 4, -1 + +ALIGN 64 +const tab_LumaCoeffVer_32_avx512, times 32 db 0, 0 + times 32 db 0, 64 + times 32 db 0, 0 + times 32 db 0, 0 + + times 32 db -1, 4 + times 32 db -10, 58 + times 32 db 17, -5 + times 32 db 1, 0 + + times 32 db -1, 4 + times 32 db -11, 40 + times 32 db 40, -11 + times 32 db 4, -1 + + times 32 db 0, 1 + times 32 db -5, 17 + times 32 db 58, -10 + times 32 db 4, -1 + const tab_c_64_n64, times 8 db 64, -64 const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 -SECTION .text +const interp4_horiz_shuf_load1_avx512, times 2 db 0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6 +const interp4_horiz_shuf_load2_avx512, times 2 db 8, 9, 10, 11, 9, 10, 11, 12, 10, 11, 12, 13, 11, 12, 13, 14 +const interp4_horiz_shuf_load3_avx512, times 2 db 4, 5, 6, 7, 5, 6, 7, 8, 6, 7, 8, 9, 7, 8, 9, 10 + +ALIGN 64 +interp4_vps_store1_avx512: dq 0, 1, 8, 9, 2, 3, 10, 11 +interp4_vps_store2_avx512: dq 4, 5, 12, 13, 6, 7, 14, 15 +const interp4_hps_shuf_avx512, dq 0, 4, 1, 5, 2, 6, 3, 7 +const interp4_hps_store_16xN_avx512, dq 0, 2, 1, 3, 4, 6, 5, 7 +const interp8_hps_store_avx512, dq 0, 1, 4, 5, 2, 3, 6, 7 +const interp8_vsp_store_avx512, dq 0, 2, 4, 6, 1, 3, 5, 7 +SECTION .text cextern pb_128 cextern pw_1 cextern pw_32 @@ -1954,6 +2066,276 @@ P2S_H_32xN_avx2 48 ;----------------------------------------------------------------------------- +;p2s and p2s_aligned 32xN avx512 code start +;----------------------------------------------------------------------------- + +%macro PROCESS_P2S_32x4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal filterPixelToShort_32x8, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x16, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 3 + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x24, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 5 + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x32, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 7 + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x48, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 11 + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_32x64, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 15 + PROCESS_P2S_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_32x4_AVX512 + RET +%endif + +%macro PROCESS_P2S_ALIGNED_32x4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r6], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x8, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x16, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 3 + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x24, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 5 + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x32, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 7 + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x48, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 11 + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_32x64, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 15 + PROCESS_P2S_ALIGNED_32x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_32x4_AVX512 + RET +%endif +;----------------------------------------------------------------------------- +;p2s and p2s_aligned 32xN avx512 code end +;----------------------------------------------------------------------------- +;----------------------------------------------------------------------------- ; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) ;----------------------------------------------------------------------------- %macro P2S_H_64xN 1 @@ -2269,6 +2651,236 @@ P2S_H_64xN_avx2 48 ;----------------------------------------------------------------------------- +;p2s and p2s_aligned 64xN avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_P2S_64x4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + mmsize/2] + pmovzxbw m2, [r0 + r1] + pmovzxbw m3, [r0 + r1 + mmsize/2] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + mmsize], m1 + movu [r2 + r3], m2 + movu [r2 + r3 + mmsize], m3 + + pmovzxbw m0, [r0 + r1 * 2] + pmovzxbw m1, [r0 + r1 * 2 + mmsize/2] + pmovzxbw m2, [r0 + r5] + pmovzxbw m3, [r0 + r5 + mmsize/2] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m1 + movu [r2 + r6], m2 + movu [r2 + r6 + mmsize], m3 +%endmacro + +%macro PROCESS_P2S_ALIGNED_64x4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + mmsize/2] + pmovzxbw m2, [r0 + r1] + pmovzxbw m3, [r0 + r1 + mmsize/2] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + mmsize], m1 + mova [r2 + r3], m2 + mova [r2 + r3 + mmsize], m3 + + pmovzxbw m0, [r0 + r1 * 2] + pmovzxbw m1, [r0 + r1 * 2 + mmsize/2] + pmovzxbw m2, [r0 + r5] + pmovzxbw m3, [r0 + r5 + mmsize/2] + + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2 + r3 * 2], m0 + mova [r2 + r3 * 2 + mmsize], m1 + mova [r2 + r6], m2 + mova [r2 + r6 + mmsize], m3 +%endmacro +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal filterPixelToShort_64x64, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 15 + PROCESS_P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x48, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 11 + PROCESS_P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x32, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 7 + PROCESS_P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_64x16, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 3 + PROCESS_P2S_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x64, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 15 + PROCESS_P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x48, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 11 + PROCESS_P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x32, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 7 + PROCESS_P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_64x16, 3, 7, 5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + +%rep 3 + PROCESS_P2S_ALIGNED_64x4_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_P2S_ALIGNED_64x4_AVX512 + RET +%endif +;----------------------------------------------------------------------------- +;p2s and p2s_aligned 64xN avx512 code end +;----------------------------------------------------------------------------- + +;----------------------------------------------------------------------------- ; void filterPixelToShort(pixel src, intptr_t srcStride, int16_t dst, int16_t dstStride) ;----------------------------------------------------------------------------- %macro P2S_H_12xN 1 @@ -2689,6 +3301,229 @@ jnz .loop RET +;----------------------------------------------------------------------------- +;p2s and p2s_aligned 48xN avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_P2S_48x8_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m3 + + pmovzxbw ym0, [r0 + 32] + pmovzxbw ym1, [r0 + r1 + 32] + pmovzxbw ym2, [r0 + r1 * 2 + 32] + pmovzxbw ym3, [r0 + r5 + 32] + psllw ym0, 6 + psllw ym1, 6 + psllw ym2, 6 + psllw ym3, 6 + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + movu [r2 + 64], ym0 + movu [r2 + r3 + 64], ym1 + movu [r2 + r3 * 2 + 64], ym2 + movu [r2 + r6 + 64], ym3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r6], m3 + + pmovzxbw ym0, [r0 + 32] + pmovzxbw ym1, [r0 + r1 + 32] + pmovzxbw ym2, [r0 + r1 * 2 + 32] + pmovzxbw ym3, [r0 + r5 + 32] + psllw ym0, 6 + psllw ym1, 6 + psllw ym2, 6 + psllw ym3, 6 + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + movu [r2 + 64], ym0 + movu [r2 + r3 + 64], ym1 + movu [r2 + r3 * 2 + 64], ym2 + movu [r2 + r6 + 64], ym3 +%endmacro + +%macro PROCESS_P2S_ALIGNED_48x8_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r6], m3 + + pmovzxbw ym0, [r0 + 32] + pmovzxbw ym1, [r0 + r1 + 32] + pmovzxbw ym2, [r0 + r1 * 2 + 32] + pmovzxbw ym3, [r0 + r5 + 32] + psllw ym0, 6 + psllw ym1, 6 + psllw ym2, 6 + psllw ym3, 6 + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + mova [r2 + 64], ym0 + mova [r2 + r3 + 64], ym1 + mova [r2 + r3 * 2 + 64], ym2 + mova [r2 + r6 + 64], ym3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m3, [r0 + r5] + psllw m0, 6 + psllw m1, 6 + psllw m2, 6 + psllw m3, 6 + psubw m0, m4 + psubw m1, m4 + psubw m2, m4 + psubw m3, m4 + mova [r2], m0 + mova [r2 + r3], m1 + mova [r2 + r3 * 2], m2 + mova [r2 + r6], m3 + + pmovzxbw ym0, [r0 + 32] + pmovzxbw ym1, [r0 + r1 + 32] + pmovzxbw ym2, [r0 + r1 * 2 + 32] + pmovzxbw ym3, [r0 + r5 + 32] + psllw ym0, 6 + psllw ym1, 6 + psllw ym2, 6 + psllw ym3, 6 + psubw ym0, ym4 + psubw ym1, ym4 + psubw ym2, ym4 + psubw ym3, ym4 + mova [r2 + 64], ym0 + mova [r2 + r3 + 64], ym1 + mova [r2 + r3 * 2 + 64], ym2 + mova [r2 + r6 + 64], ym3 +%endmacro +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, int16_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal filterPixelToShort_48x64, 3,7,5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_48x8_AVX512 + RET + +INIT_ZMM avx512 +cglobal filterPixelToShort_aligned_48x64, 3,7,5 + mov r3d, r3m + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + + ; load constant + vpbroadcastd m4, [pw_2000] + + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + PROCESS_P2S_ALIGNED_48x8_AVX512 + RET +%endif +;----------------------------------------------------------------------------- +;p2s and p2s_aligned 48xN avx512 code end +;----------------------------------------------------------------------------- %macro PROCESS_LUMA_W4_4R 0 movd m0, [r0] @@ -9353,3 +10188,4762 @@ FILTER_VER_LUMA_S_AVX2_32x24 sp FILTER_VER_LUMA_S_AVX2_32x24 ss +;------------------------------------------------------------------------------------------------------------- +;ipfilter_chroma_avx512 code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_CHROMA_PP_64x1_AVX512 0 + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 512 + + movu m5, [r0] + pshufb m6, m5, m2 + pshufb m5, m5, m1 + pmaddubsw m5, m0 + pmaddubsw m6, m0 + pmaddwd m5, m3 + pmaddwd m6, m3 + + movu m7, [r0 + 4] + pshufb m8, m7, m2 + pshufb m7, m7, m1 + pmaddubsw m7, m0 + pmaddubsw m8, m0 + pmaddwd m7, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2], m5 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 0 + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 512 + ; m9 - store shuffle order table + + movu ym5, [r0] + vinserti32x8 m5, [r0 + r1], 1 + movu ym7, [r0 + 4] + vinserti32x8 m7, [r0 + r1 + 4], 1 + + pshufb m6, m5, m2 + pshufb m5, m1 + pshufb m8, m7, m2 + pshufb m7, m1 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + pmaddwd m5, m3 + pmaddwd m7, m3 + + pmaddubsw m6, m0 + pmaddubsw m8, m0 + pmaddwd m6, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2], ym5 + vextracti32x8 [r2 + r3], m5, 1 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_16x4_AVX512 0 + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 512 + + movu xm5, [r0] + vinserti32x4 m5, [r0 + r1], 1 + vinserti32x4 m5, [r0 + 2 * r1], 2 + vinserti32x4 m5, [r0 + r6], 3 + pshufb m6, m5, m2 + pshufb m5, m1 + + movu xm7, [r0 + 4] + vinserti32x4 m7, [r0 + r1 + 4], 1 + vinserti32x4 m7, [r0 + 2 * r1 + 4], 2 + vinserti32x4 m7, [r0 + r6 + 4], 3 + pshufb m8, m7, m2 + pshufb m7, m1 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + pmaddwd m5, m3 + pmaddwd m7, m3 + + pmaddubsw m6, m0 + pmaddubsw m8, m0 + pmaddwd m6, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2], xm5 + vextracti32x4 [r2 + r3], m5, 1 + vextracti32x4 [r2 + 2 * r3], m5, 2 + vextracti32x4 [r2 + r7], m5, 3 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PP_48x4_AVX512 0 + ; register map + ; m0 - interpolate coeff + ; m1, m2 - shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 512 + movu ym5, [r0] + vinserti32x8 m5, [r0 + r1], 1 + movu ym7, [r0 + 4] + vinserti32x8 m7, [r0 + r1 + 4], 1 + + pshufb m6, m5, m2 + pshufb m5, m1 + pshufb m8, m7, m2 + pshufb m7, m1 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + pmaddwd m5, m3 + pmaddwd m7, m3 + + pmaddubsw m6, m0 + pmaddubsw m8, m0 + pmaddwd m6, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2], ym5 + vextracti32x8 [r2 + r3], m5, 1 + + movu ym5, [r0 + 2 * r1] + vinserti32x8 m5, [r0 + r6], 1 + movu ym7, [r0 + 2 * r1 + 4] + vinserti32x8 m7, [r0 + r6 + 4], 1 + + pshufb m6, m5, m2 + pshufb m5, m1 + pshufb m8, m7, m2 + pshufb m7, m1 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + pmaddwd m5, m3 + pmaddwd m7, m3 + + pmaddubsw m6, m0 + pmaddubsw m8, m0 + pmaddwd m6, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2 + 2 * r3], ym5 + vextracti32x8 [r2 + r7], m5, 1 + + movu xm5, [r0 + mmsize/2] + vinserti32x4 m5, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m5, [r0 + r6 + mmsize/2], 3 + pshufb m6, m5, m2 + pshufb m5, m1 + + movu xm7, [r0 + 36] + vinserti32x4 m7, [r0 + r1 + 36], 1 + vinserti32x4 m7, [r0 + 2 * r1 + 36], 2 + vinserti32x4 m7, [r0 + r6 + 36], 3 + pshufb m8, m7, m2 + pshufb m7, m1 + + pmaddubsw m5, m0 + pmaddubsw m7, m0 + pmaddwd m5, m3 + pmaddwd m7, m3 + + pmaddubsw m6, m0 + pmaddubsw m8, m0 + pmaddwd m6, m3 + pmaddwd m8, m3 + + packssdw m5, m7 + packssdw m6, m8 + pmulhrsw m5, m4 + pmulhrsw m6, m4 + packuswb m5, m6 + movu [r2 + mmsize/2], xm5 + vextracti32x4 [r2 + r3 + mmsize/2], m5, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m5, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m5, 3 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_64xN(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PP_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_64x%1, 4,6,9 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_512] + dec r0 + +%rep %1 - 1 + PROCESS_IPFILTER_CHROMA_PP_64x1_AVX512 + lea r2, [r2 + r3] + lea r0, [r0 + r1] +%endrep + PROCESS_IPFILTER_CHROMA_PP_64x1_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_CHROMA_PP_64xN_AVX512 64 + IPFILTER_CHROMA_PP_64xN_AVX512 32 + IPFILTER_CHROMA_PP_64xN_AVX512 48 + IPFILTER_CHROMA_PP_64xN_AVX512 16 +%endif + +%macro IPFILTER_CHROMA_PP_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_32x%1, 4,6,9 + mov r4d, r4m + +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_512] + dec r0 + +%rep %1/2 - 1 + PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] +%endrep + PROCESS_IPFILTER_CHROMA_PP_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_CHROMA_PP_32xN_AVX512 16 + IPFILTER_CHROMA_PP_32xN_AVX512 24 + IPFILTER_CHROMA_PP_32xN_AVX512 8 + IPFILTER_CHROMA_PP_32xN_AVX512 32 + IPFILTER_CHROMA_PP_32xN_AVX512 64 + IPFILTER_CHROMA_PP_32xN_AVX512 48 +%endif + +%macro IPFILTER_CHROMA_PP_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_16x%1, 4,8,9 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_512] + dec r0 + +%rep %1/4 - 1 + PROCESS_IPFILTER_CHROMA_PP_16x4_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] +%endrep + PROCESS_IPFILTER_CHROMA_PP_16x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_CHROMA_PP_16xN_AVX512 4 + IPFILTER_CHROMA_PP_16xN_AVX512 8 + IPFILTER_CHROMA_PP_16xN_AVX512 12 + IPFILTER_CHROMA_PP_16xN_AVX512 16 + IPFILTER_CHROMA_PP_16xN_AVX512 24 + IPFILTER_CHROMA_PP_16xN_AVX512 32 + IPFILTER_CHROMA_PP_16xN_AVX512 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_pp_48x64, 4,8,9 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastd m0, [r5 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_512] + dec r0 + +%rep 15 + PROCESS_IPFILTER_CHROMA_PP_48x4_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] +%endrep + PROCESS_IPFILTER_CHROMA_PP_48x4_AVX512 + RET +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_64x1_AVX512 0 + movu ym6, [r0] + vinserti32x8 m6, [r0 + 4], 1 + pshufb m7, m6, m2 + pshufb m6, m1 + pmaddubsw m6, m0 + pmaddubsw m7, m0 + pmaddwd m6, m3 + pmaddwd m7, m3 + + movu ym8, [r0 + 32] + vinserti32x8 m8, [r0 + 36], 1 + pshufb m9, m8, m2 + pshufb m8, m1 + pmaddubsw m8, m0 + pmaddubsw m9, m0 + pmaddwd m8, m3 + pmaddwd m9, m3 + + packssdw m6, m7 + packssdw m8, m9 + psubw m6, m4 + psubw m8, m4 + vpermq m6, m10, m6 + vpermq m8, m10, m8 + movu [r2], m6 + movu [r2 + mmsize],m8 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_64xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_64x%1, 4,7,11 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_2000] + mova m10, [interp4_hps_shuf_avx512] + + ; register map + ; m0 - interpolate coeff + ; m1,m2 - load shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 2000 + ; m10 - store shuffle order table + + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0, r1 + add r6d, 3 + +.loop: + PROCESS_IPFILTER_CHROMA_PS_64x1_AVX512 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + r1] + dec r6d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_CHROMA_PS_64xN_AVX512 64 + IPFILTER_CHROMA_PS_64xN_AVX512 32 + IPFILTER_CHROMA_PS_64xN_AVX512 48 + IPFILTER_CHROMA_PS_64xN_AVX512 16 +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_32x1_AVX512 0 + movu ym6, [r0] + vinserti32x8 m6, [r0 + 4], 1 + pshufb m7, m6, m2 + pshufb m6, m6, m1 + pmaddubsw m6, m0 + pmaddubsw m7, m0 + pmaddwd m6, m3 + pmaddwd m7, m3 + + packssdw m6, m7 + psubw m6, m4 + vpermq m6, m8, m6 + movu [r2], m6 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_32xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_32x%1, 4,7,9 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_2000] + mova m8, [interp4_hps_shuf_avx512] + + ; register map + ; m0 - interpolate coeff + ; m1,m2 - load shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 2000 + ; m8 - store shuffle order table + + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0, r1 + add r6d, 3 + +.loop: + PROCESS_IPFILTER_CHROMA_PS_32x1_AVX512 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + r1] + dec r6d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 + IPFILTER_CHROMA_PS_32xN_AVX512 64 + IPFILTER_CHROMA_PS_32xN_AVX512 48 + IPFILTER_CHROMA_PS_32xN_AVX512 32 + IPFILTER_CHROMA_PS_32xN_AVX512 24 + IPFILTER_CHROMA_PS_32xN_AVX512 16 + IPFILTER_CHROMA_PS_32xN_AVX512 8 +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_16x2_AVX512 0 + movu xm6, [r0] + vinserti32x4 m6, [r0 + 4], 1 + vinserti32x4 m6, [r0 + r1], 2 + vinserti32x4 m6, [r0 + r1 + 4], 3 + + pshufb m7, m6, m2 + pshufb m6, m6, m1 + pmaddubsw m6, m0 + pmaddubsw m7, m0 + pmaddwd m6, m3 + pmaddwd m7, m3 + + packssdw m6, m7 + psubw m6, m4 + vpermq m6, m8, m6 + movu [r2], ym6 + vextracti32x8 [r2 + r3], m6, 1 +%endmacro + +%macro PROCESS_IPFILTER_CHROMA_PS_16x1_AVX512 0 + movu xm6, [r0] + vinserti32x4 m6, [r0 + 4], 1 + + pshufb ym7, ym6, ym2 + pshufb ym6, ym6, ym1 + pmaddubsw ym6, ym0 + pmaddubsw ym7, ym0 + pmaddwd ym6, ym3 + pmaddwd ym7, ym3 + + packssdw ym6, ym7 + psubw ym6, ym4 + vpermq ym6, ym8, ym6 + movu [r2], ym6 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_16xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_16x%1, 4,7,9 + mov r4d, r4m + mov r5d, r5m + add r3, r3 + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_2000] + mova m8, [interp4_hps_store_16xN_avx512] + + ; register map + ; m0 - interpolate coeff + ; m1,m2 - load shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 2000 + ; m8 - store shuffle order table + + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0, r1 + add r6d, 3 + PROCESS_IPFILTER_CHROMA_PS_16x1_AVX512 + lea r2, [r2 + r3] + lea r0, [r0 + r1] + dec r6d + +.loop: + PROCESS_IPFILTER_CHROMA_PS_16x2_AVX512 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + sub r6d, 2 + jnz .loop + + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_CHROMA_PS_16xN_AVX512 64 + IPFILTER_CHROMA_PS_16xN_AVX512 32 + IPFILTER_CHROMA_PS_16xN_AVX512 24 + IPFILTER_CHROMA_PS_16xN_AVX512 16 + IPFILTER_CHROMA_PS_16xN_AVX512 12 + IPFILTER_CHROMA_PS_16xN_AVX512 8 + IPFILTER_CHROMA_PS_16xN_AVX512 4 +%endif + +%macro PROCESS_IPFILTER_CHROMA_PS_48x1_AVX512 0 + movu ym6, [r0] + vinserti32x8 m6, [r0 + 4], 1 + pshufb m7, m6, m2 + pshufb m6, m6, m1 + pmaddubsw m6, m0 + pmaddubsw m7, m0 + pmaddwd m6, m3 + pmaddwd m7, m3 + + packssdw m6, m7 + psubw m6, m4 + vpermq m6, m8, m6 + movu [r2], m6 + + movu xm6, [r0 + 32] + vinserti32x4 m6, [r0 + 36], 1 + pshufb ym7, ym6, ym2 + pshufb ym6, ym6, ym1 + pmaddubsw ym6, ym0 + pmaddubsw ym7, ym0 + pmaddwd ym6, ym3 + pmaddwd ym7, ym3 + + packssdw ym6, ym7 + psubw ym6, ym4 + vpermq ym6, ym9, ym6 + movu [r2 + mmsize],ym6 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_48xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_CHROMA_PS_48xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_horiz_ps_48x%1, 4,7,10 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastd m0, [r6 + r4 * 4] +%else + vpbroadcastd m0, [tab_ChromaCoeff + r4 * 4] +%endif + + vbroadcasti32x8 m1, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m2, [interp4_horiz_shuf_load2_avx512] + vbroadcasti32x8 m3, [pw_1] + vbroadcasti32x8 m4, [pw_2000] + mova m8, [interp4_hps_shuf_avx512] + mova m9, [interp4_hps_store_16xN_avx512] + + ; register map + ; m0 - interpolate coeff + ; m1,m2 - load shuffle order table + ; m3 - constant word 1 + ; m4 - constant word 2000 + ; m8 - store shuffle order table + + mov r6d, %1 + dec r0 + test r5d, r5d + je .loop + sub r0, r1 + add r6d, 3 + +.loop: + PROCESS_IPFILTER_CHROMA_PS_48x1_AVX512 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + r1] + dec r6d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_CHROMA_PS_48xN_AVX512 64 +%endif + +;------------------------------------------------------------------------------------------------------------- +;avx512 chroma_vpp and chroma_vps code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_CHROMA_VERT_16x4_AVX512 1 + lea r5, [r0 + 4 * r1] + movu xm1, [r0] + movu xm3, [r0 + r1] + vinserti32x4 m1, [r0 + r1], 1 + vinserti32x4 m3, [r0 + 2 * r1], 1 + vinserti32x4 m1, [r0 + 2 * r1], 2 + vinserti32x4 m3, [r0 + r6], 2 + vinserti32x4 m1, [r0 + r6], 3 + vinserti32x4 m3, [r0 + 4 * r1], 3 + + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu xm4, [r0 + 2 * r1] + movu xm5, [r0 + r6] + vinserti32x4 m4, [r0 + r6], 1 + vinserti32x4 m5, [r5], 1 + vinserti32x4 m4, [r5], 2 + vinserti32x4 m5, [r5 + r1], 2 + vinserti32x4 m4, [r5 + r1], 3 + vinserti32x4 m5, [r5 + 2 * r1], 3 + + punpcklbw m3, m4, m5 + pmaddubsw m3, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m3 + paddw m1, m4 +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + packuswb m0, m1 + movu [r2], xm0 + vextracti32x4 [r2 + r3], m0, 1 + vextracti32x4 [r2 + 2 * r3], m0, 2 + vextracti32x4 [r2 + r7], m0, 3 +%else + psubw m0, m7 + psubw m1, m7 + mova m2, m10 + mova m3, m11 + + vpermi2q m2, m0, m1 + vpermi2q m3, m0, m1 + + movu [r2], ym2 + vextracti32x8 [r2 + r3], m2, 1 + movu [r2 + 2 * r3], ym3 + vextracti32x8 [r2 + r7], m3, 1 +%endif +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_CHROMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_16x%2, 4, 10, 12 + mov r4d, r4m + shl r4d, 7 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + mova m8, [tab_ChromaCoeffVer_32_avx512 + r4] + mova m9, [tab_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m10, [interp4_vps_store1_avx512] + mova m11, [interp4_vps_store2_avx512] +%endif + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 4 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 8 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 12 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 16 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 24 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 32 + FILTER_VERT_CHROMA_16xN_AVX512 pp, 64 + + FILTER_VERT_CHROMA_16xN_AVX512 ps, 4 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 8 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 12 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 16 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 24 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 32 + FILTER_VERT_CHROMA_16xN_AVX512 ps, 64 +%endif +%macro PROCESS_CHROMA_VERT_32x4_AVX512 1 + movu ym1, [r0] + movu ym3, [r0 + r1] + vinserti32x8 m1, [r0 + 2 * r1], 1 + vinserti32x8 m3, [r0 + r6], 1 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + 4 * r1], 1 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + lea r0, [r0 + 2 * r1] + + movu ym5, [r0 + r1] + vinserti32x8 m5, [r0 + r6], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + paddw m0, m6 + punpckhbw m4, m5 + pmaddubsw m4, m9 + paddw m1, m4 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + 4 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + paddw m2, m6 + punpckhbw m5, m4 + pmaddubsw m5, m9 + paddw m3, m5 + +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + packuswb m0, m1 + packuswb m2, m3 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%else + psubw m0, m7 + psubw m1, m7 + psubw m2, m7 + psubw m3, m7 + + mova m4, m10 + mova m5, m11 + vpermi2q m4, m0, m1 + vpermi2q m5, m0, m1 + mova m6, m10 + mova m12, m11 + vpermi2q m6, m2, m3 + vpermi2q m12, m2, m3 + + movu [r2], m4 + movu [r2 + r3], m6 + movu [r2 + 2 * r3], m5 + movu [r2 + r7], m12 +%endif +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_CHROMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_32x%2, 4, 8, 13 + mov r4d, r4m + shl r4d, 7 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + mova m8, [tab_ChromaCoeffVer_32_avx512 + r4] + mova m9, [tab_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1,pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m10, [interp4_vps_store1_avx512] + mova m11, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_32x4_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_32x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 8 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 16 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 24 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 32 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 48 + FILTER_VERT_CHROMA_32xN_AVX512 pp, 64 + + FILTER_VERT_CHROMA_32xN_AVX512 ps, 8 + FILTER_VERT_CHROMA_32xN_AVX512 ps, 16 + FILTER_VERT_CHROMA_32xN_AVX512 ps, 24 + FILTER_VERT_CHROMA_32xN_AVX512 ps, 32 + FILTER_VERT_CHROMA_32xN_AVX512 ps, 48 + FILTER_VERT_CHROMA_32xN_AVX512 ps, 64 +%endif +%macro PROCESS_CHROMA_VERT_48x4_AVX512 1 + movu ym1, [r0] + movu ym3, [r0 + r1] + vinserti32x8 m1, [r0 + 2 * r1], 1 + vinserti32x8 m3, [r0 + r6], 1 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + 4 * r1], 1 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + lea r5, [r0 + 4 * r1] + + movu ym5, [r0 + r6] + vinserti32x8 m5, [r5 + r1], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + paddw m0, m6 + punpckhbw m4, m5 + pmaddubsw m4, m9 + paddw m1, m4 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r5 + 2 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + paddw m2, m6 + punpckhbw m5, m4 + pmaddubsw m5, m9 + paddw m3, m5 +%ifidn %1, pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + + packuswb m0, m1 + packuswb m2, m3 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%else + psubw m0, m7 + psubw m1, m7 + psubw m2, m7 + psubw m3, m7 + + mova m4, m10 + mova m5, m11 + vpermi2q m4, m0, m1 + vpermi2q m5, m0, m1 + mova m6, m10 + mova m12, m11 + vpermi2q m6, m2, m3 + vpermi2q m12, m2, m3 + + movu [r2], m4 + movu [r2 + r3], m6 + movu [r2 + 2 * r3], m5 + movu [r2 + r7], m12 +%endif + movu xm1, [r0 + mmsize/2] + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m1, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r0 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m1, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m3, [r0 + r6 + mmsize/2], 2 + vinserti32x4 m1, [r0 + r6 + mmsize/2], 3 + vinserti32x4 m3, [r0 + 4 * r1 + mmsize/2], 3 + + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + movu xm5, [r0 + r6 + mmsize/2] + vinserti32x4 m4, [r0 + r6 + mmsize/2], 1 + vinserti32x4 m5, [r5 + mmsize/2], 1 + vinserti32x4 m4, [r5 + mmsize/2], 2 + vinserti32x4 m5, [r5 + r1 + mmsize/2], 2 + vinserti32x4 m4, [r5 + r1 + mmsize/2], 3 + vinserti32x4 m5, [r5 + 2 * r1 + mmsize/2], 3 + + punpcklbw m3, m4, m5 + pmaddubsw m3, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + paddw m0, m3 + paddw m1, m4 +%ifidn %1, pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + packuswb m0, m1 + movu [r2 + mmsize/2], xm0 + vextracti32x4 [r2 + r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m0, 3 +%else + psubw m0, m7 + psubw m1, m7 + mova m2, m10 + mova m3, m11 + + vpermi2q m2, m0, m1 + vpermi2q m3, m0, m1 + + movu [r2 + mmsize], ym2 + vextracti32x8 [r2 + r3 + mmsize], m2, 1 + movu [r2 + 2 * r3 + mmsize], ym3 + vextracti32x8 [r2 + r7 + mmsize], m3, 1 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_CHROMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_48x64, 4, 8, 13 + mov r4d, r4m + shl r4d, 7 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + mova m8, [tab_ChromaCoeffVer_32_avx512 + r4] + mova m9, [tab_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m10, [interp4_vps_store1_avx512] + mova m11, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + lea r7, [3 * r3] +%rep 15 + PROCESS_CHROMA_VERT_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_CHROMA_48x64_AVX512 pp + FILTER_VERT_CHROMA_48x64_AVX512 ps +%endif +%macro PROCESS_CHROMA_VERT_64x4_AVX512 1 + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 + +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + movu m8, m13 + movu m9, m14 + vpermi2q m8, m2, m3 + vpermi2q m9, m2, m3 + movu [r2], m8 + movu [r2 + mmsize], m9 +%endif + + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 + +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + movu m8, m13 + movu m9, m14 + vpermi2q m8, m4, m5 + vpermi2q m9, m4, m5 + movu [r2 + r3], m8 + movu [r2 + r3 + mmsize], m9 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 + +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + movu m8, m13 + movu m9, m14 + vpermi2q m8, m6, m7 + vpermi2q m9, m6, m7 + movu [r2 + 2 * r3], m8 + movu [r2 + 2 * r3 + mmsize], m9 +%endif + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 + +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 +%else + psubw m2, m12 + psubw m3, m12 + movu m8, m13 + movu m9, m14 + vpermi2q m8, m2, m3 + vpermi2q m9, m2, m3 + movu [r2 + r5], m8 + movu [r2 + r5 + mmsize], m9 +%endif +%endmacro + +%macro FILTER_VER_CHROMA_AVX512_64xN 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_64x%2, 4, 6, 15 + mov r4d, r4m + shl r4d, 7 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32_avx512] + mova m10, [r5 + r4] + mova m11, [r5 + r4 + mmsize] +%else + mova m10, [tab_ChromaCoeffVer_32_avx512 + r4] + mova m11, [tab_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1,pp + vbroadcasti32x8 m12, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m12, [pw_2000] + mova m13, [interp4_vps_store1_avx512] + mova m14, [interp4_vps_store2_avx512] +%endif + lea r4, [r1 * 3] + sub r0, r1 + lea r5, [r3 * 3] + +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_64x4_AVX512 %1 + lea r2, [r2 + r3 * 4] +%endrep + PROCESS_CHROMA_VERT_64x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 == 1 +FILTER_VER_CHROMA_AVX512_64xN pp, 64 +FILTER_VER_CHROMA_AVX512_64xN pp, 48 +FILTER_VER_CHROMA_AVX512_64xN pp, 32 +FILTER_VER_CHROMA_AVX512_64xN pp, 16 + +FILTER_VER_CHROMA_AVX512_64xN ps, 64 +FILTER_VER_CHROMA_AVX512_64xN ps, 48 +FILTER_VER_CHROMA_AVX512_64xN ps, 32 +FILTER_VER_CHROMA_AVX512_64xN ps, 16 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 chroma_vpp and chroma_vps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;avx512 chroma_vss code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_CHROMA_VERT_SS_8x4_AVX512 0 + lea r5, [r0 + 4 * r1] + movu xm1, [r0] + movu xm3, [r0 + r1] + vinserti32x4 m1, [r0 + r1], 1 + vinserti32x4 m3, [r0 + 2 * r1], 1 + vinserti32x4 m1, [r0 + 2 * r1], 2 + vinserti32x4 m3, [r0 + r6], 2 + vinserti32x4 m1, [r0 + r6], 3 + vinserti32x4 m3, [r0 + 4 * r1], 3 + + punpcklwd m0, m1, m3 + pmaddwd m0, m8 + punpckhwd m1, m3 + pmaddwd m1, m8 + + movu xm4, [r0 + 2 * r1] + movu xm5, [r0 + r6] + vinserti32x4 m4, [r0 + r6], 1 + vinserti32x4 m5, [r5], 1 + vinserti32x4 m4, [r5], 2 + vinserti32x4 m5, [r5 + r1], 2 + vinserti32x4 m4, [r5 + r1], 3 + vinserti32x4 m5, [r5 + 2 * r1], 3 + + punpcklwd m3, m4, m5 + pmaddwd m3, m9 + punpckhwd m4, m5 + pmaddwd m4, m9 + + paddd m0, m3 + paddd m1, m4 + + psrad m0, 6 + psrad m1, 6 + packssdw m0, m1 + movu [r2], xm0 + vextracti32x4 [r2 + r3], m0, 1 + vextracti32x4 [r2 + 2 * r3], m0, 2 + vextracti32x4 [r2 + r7], m0, 3 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_SS_CHROMA_8xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ss_8x%1, 5, 8, 10 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + mmsize] +%else + lea r5, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m8, [r5] + mova m9, [r5 + mmsize] +%endif + lea r6, [3 * r1] + lea r7, [3 * r3] + +%rep %1/4 - 1 + PROCESS_CHROMA_VERT_SS_8x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_SS_8x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_SS_CHROMA_8xN_AVX512 4 + FILTER_VER_SS_CHROMA_8xN_AVX512 8 + FILTER_VER_SS_CHROMA_8xN_AVX512 12 + FILTER_VER_SS_CHROMA_8xN_AVX512 16 + FILTER_VER_SS_CHROMA_8xN_AVX512 32 + FILTER_VER_SS_CHROMA_8xN_AVX512 64 +%endif + +%macro PROCESS_CHROMA_VERT_S_16x4_AVX512 1 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + vinserti32x8 m1, [r6], 1 + movu ym3, [r0 + r1] + vinserti32x8 m3, [r6 + r1], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m7 + punpckhwd m1, m3 + pmaddwd m1, m7 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m7 + punpckhwd m3, m4 + pmaddwd m3, m7 + + movu ym5, [r0 + r4] + vinserti32x8 m5, [r6 + r4], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m8 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m8 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m8 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m8 + paddd m3, m5 + +%ifidn %1, sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m10, m0 + movu [r2], xm0 + vextracti32x4 [r2 + r3], m0, 2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r5], m0, 3 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r5], m2, 1 +%endif +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_CHROMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_16x%2, 4, 7, 11 + mov r4d, r4m + shl r4d, 7 + +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m7, [r5 + r4] + mova m8, [r5 + r4 + mmsize] +%else + mova m7, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m8, [pw_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m9, [pd_526336] + mova m10, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + add r1d, r1d + sub r0, r1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] + +%rep %2/4 - 1 + PROCESS_CHROMA_VERT_S_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 4 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 8 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 12 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 24 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_16xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 4 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 8 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 12 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 24 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_16xN_AVX512 sp, 64 +%endif + +%macro PROCESS_CHROMA_VERT_SS_24x8_AVX512 0 + movu ym1, [r0] + lea r6, [r0 + 2 * r1] + lea r8, [r0 + 4 * r1] + lea r9, [r8 + 2 * r1] + + movu ym10, [r8] + movu ym3, [r0 + r1] + movu ym12, [r8 + r1] + vinserti32x8 m1, [r6], 1 + vinserti32x8 m10, [r9], 1 + vinserti32x8 m3, [r6 + r1], 1 + vinserti32x8 m12, [r9 + r1], 1 + + punpcklwd m0, m1, m3 + punpcklwd m9, m10, m12 + pmaddwd m0, m16 + pmaddwd m9, m16 + punpckhwd m1, m3 + punpckhwd m10, m12 + pmaddwd m1, m16 + pmaddwd m10, m16 + + movu ym4, [r0 + 2 * r1] + movu ym13, [r8 + 2 * r1] + vinserti32x8 m4, [r6 + 2 * r1], 1 + vinserti32x8 m13, [r9 + 2 * r1], 1 + punpcklwd m2, m3, m4 + punpcklwd m11, m12, m13 + pmaddwd m2, m16 + pmaddwd m11, m16 + punpckhwd m3, m4 + punpckhwd m12, m13 + pmaddwd m3, m16 + pmaddwd m12, m16 + + movu ym5, [r0 + r10] + vinserti32x8 m5, [r6 + r10], 1 + movu ym14, [r8 + r10] + vinserti32x8 m14, [r9 + r10], 1 + punpcklwd m6, m4, m5 + punpcklwd m15, m13, m14 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m0, m6 + paddd m9, m15 + punpckhwd m4, m5 + punpckhwd m13, m14 + pmaddwd m4, m17 + pmaddwd m13, m17 + paddd m1, m4 + paddd m10, m13 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r6 + 4 * r1], 1 + movu ym13, [r8 + 4 * r1] + vinserti32x8 m13, [r9 + 4 * r1], 1 + punpcklwd m6, m5, m4 + punpcklwd m15, m14, m13 + pmaddwd m6, m17 + pmaddwd m15, m17 + paddd m2, m6 + paddd m11, m15 + punpckhwd m5, m4 + punpckhwd m14, m13 + pmaddwd m5, m17 + pmaddwd m14, m17 + paddd m3, m5 + paddd m12, m14 + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m9, m10 + packssdw m11, m12 + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 + lea r11, [r2 + 4 * r3] + movu [r11], ym9 + movu [r11 + r3], ym11 + vextracti32x8 [r11 + 2 * r3], m9, 1 + vextracti32x8 [r11 + r7], m11, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r6 + mmsize/2], 1 + vinserti32x4 m1, [r8 + mmsize/2], 2 + vinserti32x4 m1, [r9 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m3, [r8 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r9 + r1 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m16 + punpckhwd m1, m3 + pmaddwd m1, m16 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 2 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m16 + punpckhwd m3, m4 + pmaddwd m3, m16 + + movu xm5, [r0 + r10 + mmsize/2] + vinserti32x4 m5, [r6 + r10 + mmsize/2], 1 + vinserti32x4 m5, [r8 + r10 + mmsize/2], 2 + vinserti32x4 m5, [r9 + r10 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m17 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m17 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r8 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r9 + 4 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m17 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m17 + paddd m3, m5 + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 3 +%endmacro + +%macro FILTER_VER_SS_CHROMA_24xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_ss_24x%1, 5, 12, 18 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 7 +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m16, [r5 + r4] + mova m17, [r5 + r4 + mmsize] +%else + lea r5, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m16, [r5] + mova m17, [r5 + mmsize] +%endif + lea r10, [3 * r1] + lea r7, [3 * r3] +%rep %1/8 - 1 + PROCESS_CHROMA_VERT_SS_24x8_AVX512 + lea r0, [r8 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_SS_24x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_SS_CHROMA_24xN_AVX512 32 + FILTER_VER_SS_CHROMA_24xN_AVX512 64 +%endif +%macro PROCESS_CHROMA_VERT_S_32x2_AVX512 1 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, m7 + punpckhwd m1, m3 + pmaddwd m1, m7 + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m7 + punpckhwd m3, m4 + pmaddwd m3, m7 + movu m5, [r0 + r4] + punpcklwd m6, m4, m5 + pmaddwd m6, m8 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m8 + paddd m1, m4 + movu m4, [r0 + 4 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m8 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m8 + paddd m3, m5 +%ifidn %1, sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m10, m0 + movu [r2], ym0 + vextracti32x8 [r2 + r3], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 + movu [r2], m0 + movu [r2 + r3], m2 +%endif +%endmacro + +%macro FILTER_VER_S_CHROMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_32x%2, 4, 6, 11 + mov r4d, r4m + shl r4d, 7 +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m7, [r5 + r4] + mova m8, [r5 + r4 + mmsize] +%else + mova m7, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m8, [pw_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m9, [pd_526336] + mova m10, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + add r1d, r1d + sub r0, r1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%rep %2/2 - 1 + PROCESS_CHROMA_VERT_S_32x2_AVX512 %1 + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] +%endrep + PROCESS_CHROMA_VERT_S_32x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 8 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 24 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 48 + FILTER_VER_S_CHROMA_32xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 8 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 24 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 48 + FILTER_VER_S_CHROMA_32xN_AVX512 sp, 64 +%endif + +%macro PROCESS_CHROMA_VERT_S_48x4_AVX512 1 + PROCESS_CHROMA_VERT_S_32x2_AVX512 %1 + lea r6, [r0 + 2 * r1] + + movu m1, [r6] + movu m3, [r6 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, m7 + punpckhwd m1, m3 + pmaddwd m1, m7 + movu m4, [r6 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m7 + punpckhwd m3, m4 + pmaddwd m3, m7 + + movu m5, [r6 + r4] + punpcklwd m6, m4, m5 + pmaddwd m6, m8 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m8 + paddd m1, m4 + + movu m4, [r6 + 4 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m8 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m8 + paddd m3, m5 + +%ifidn %1, sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m10, m0 + movu [r2 + 2 * r3], ym0 + vextracti32x8 [r2 + r5], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + 2 * r3], m0 + movu [r2 + r5], m2 +%endif + + movu ym1, [r0 + mmsize] + vinserti32x8 m1, [r6 + mmsize], 1 + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m3, [r6 + r1 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m7 + punpckhwd m1, m3 + pmaddwd m1, m7 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m7 + punpckhwd m3, m4 + pmaddwd m3, m7 + + movu ym5, [r0 + r4 + mmsize] + vinserti32x8 m5, [r6 + r4 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m8 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m8 + paddd m1, m4 + + movu ym4, [r0 + 4 * r1 + mmsize] + vinserti32x8 m4, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m8 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m8 + paddd m3, m5 + +%ifidn %1, sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m10, m0 + movu [r2 + mmsize/2], xm0 + vextracti32x4 [r2 + r3 + mmsize/2], m0, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r5 + mmsize/2], m0, 3 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 + + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r5 + mmsize], m2, 1 +%endif +%endmacro + +%macro FILTER_VER_S_CHROMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_48x64, 4, 7, 11 + mov r4d, r4m + shl r4d, 7 + +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m7, [r5 + r4] + mova m8, [r5 + r4 + mmsize] +%else + mova m7, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m8, [pw_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m9, [pd_526336] + mova m10, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + add r1d, r1d + sub r0, r1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] + +%rep 15 + PROCESS_CHROMA_VERT_S_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_CHROMA_VERT_S_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_48x64_AVX512 ss + FILTER_VER_S_CHROMA_48x64_AVX512 sp +%endif + +%macro PROCESS_CHROMA_VERT_S_64x2_AVX512 1 + PROCESS_CHROMA_VERT_S_32x2_AVX512 %1 + movu m1, [r0 + mmsize] + movu m3, [r0 + r1 + mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m7 + punpckhwd m1, m3 + pmaddwd m1, m7 + movu m4, [r0 + 2 * r1 + mmsize] + punpcklwd m2, m3, m4 + pmaddwd m2, m7 + punpckhwd m3, m4 + pmaddwd m3, m7 + + movu m5, [r0 + r4 + mmsize] + punpcklwd m6, m4, m5 + pmaddwd m6, m8 + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, m8 + paddd m1, m4 + + movu m4, [r0 + 4 * r1 + mmsize] + punpcklwd m6, m5, m4 + pmaddwd m6, m8 + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, m8 + paddd m3, m5 + +%ifidn %1, sp + paddd m0, m9 + paddd m1, m9 + paddd m2, m9 + paddd m3, m9 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m10, m0 + movu [r2 + mmsize/2], ym0 + vextracti32x8 [r2 + r3 + mmsize/2], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + mmsize], m0 + movu [r2 + r3 + mmsize], m2 +%endif +%endmacro + +%macro FILTER_VER_S_CHROMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_4tap_vert_%1_64x%2, 4, 6, 11 + mov r4d, r4m + shl r4d, 7 +%ifdef PIC + lea r5, [pw_ChromaCoeffVer_32_avx512] + mova m7, [r5 + r4] + mova m8, [r5 + r4 + mmsize] +%else + mova m7, [pw_ChromaCoeffVer_32_avx512 + r4] + mova m8, [pw_ChromaCoeffVer_32_avx512 + r4 + mmsize] +%endif + +%ifidn %1, sp + vbroadcasti32x4 m9, [pd_526336] + mova m10, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + add r1d, r1d + sub r0, r1 + lea r4, [r1 * 3] + lea r5, [r3 * 3] + +%rep %2/2 - 1 + PROCESS_CHROMA_VERT_S_64x2_AVX512 %1 + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] +%endrep + PROCESS_CHROMA_VERT_S_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 16 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 32 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 48 + FILTER_VER_S_CHROMA_64xN_AVX512 ss, 64 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 16 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 32 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 48 + FILTER_VER_S_CHROMA_64xN_AVX512 sp, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 chroma_vss code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_chroma_avx512 code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_luma_avx512 code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_IPFILTER_LUMA_PP_64x1_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3, m4 shuffle order table + ; m5 - pw_1 + ; m6 - pw_512 + + movu m7, [r0] + movu m9, [r0 + 8] + + pshufb m8, m7, m3 + pshufb m7, m2 + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2], m7 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3, m4 shuffle order table + ; m5 - pw_1 + ; m6 - pw_512 + + movu ym7, [r0] + vinserti32x8 m7, [r0 + r1], 1 + movu ym9, [r0 + 8] + vinserti32x8 m9, [r0 + r1 + 8], 1 + + pshufb m8, m7, m3 + pshufb m7, m2 + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2], ym7 + vextracti32x8 [r2 + r3], m7, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3, m4 shuffle order table + ; m5 - pw_1 + ; m6 - pw_512 + + movu xm7, [r0] + vinserti32x4 m7, [r0 + r1], 1 + vinserti32x4 m7, [r0 + 2 * r1], 2 + vinserti32x4 m7, [r0 + r6], 3 + + pshufb m8, m7, m3 + pshufb m7, m2 + + movu xm9, [r0 + 8] + vinserti32x4 m9, [r0 + r1 + 8], 1 + vinserti32x4 m9, [r0 + 2 * r1 + 8], 2 + vinserti32x4 m9, [r0 + r6 + 8], 3 + + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2], xm7 + vextracti32x4 [r2 + r3], m7, 1 + vextracti32x4 [r2 + 2 * r3], m7, 2 + vextracti32x4 [r2 + r7], m7, 3 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 0 + ; register map + ; m0 , m1 interpolate coeff + ; m2 , m3, m4 shuffle order table + ; m5 - pw_1 + ; m6 - pw_512 + + movu ym7, [r0] + vinserti32x8 m7, [r0 + r1], 1 + movu ym9, [r0 + 8] + vinserti32x8 m9, [r0 + r1 + 8], 1 + + pshufb m8, m7, m3 + pshufb m7, m2 + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2], ym7 + vextracti32x8 [r2 + r3], m7, 1 + + movu ym7, [r0 + 2 * r1] + vinserti32x8 m7, [r0 + r6], 1 + movu ym9, [r0 + 2 * r1 + 8] + vinserti32x8 m9, [r0 + r6 + 8], 1 + + pshufb m8, m7, m3 + pshufb m7, m2 + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2 + 2 * r3], ym7 + vextracti32x8 [r2 + r7], m7, 1 + + movu xm7, [r0 + mmsize/2] + vinserti32x4 m7, [r0 + r1 + mmsize/2], 1 + vinserti32x4 m7, [r0 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m7, [r0 + r6 + mmsize/2], 3 + + pshufb m8, m7, m3 + pshufb m7, m2 + + movu xm9, [r0 + 40] + vinserti32x4 m9, [r0 + r1 + 40], 1 + vinserti32x4 m9, [r0 + 2 * r1 + 40], 2 + vinserti32x4 m9, [r0 + r6 + 40], 3 + + pshufb m10, m9, m3 + pshufb m11, m9, m4 + pshufb m9, m2 + + pmaddubsw m7, m0 + pmaddubsw m12, m8, m1 + pmaddwd m7, m5 + pmaddwd m12, m5 + paddd m7, m12 + + pmaddubsw m8, m0 + pmaddubsw m12, m9, m1 + pmaddwd m8, m5 + pmaddwd m12, m5 + paddd m8, m12 + + pmaddubsw m9, m0 + pmaddubsw m12, m10, m1 + pmaddwd m9, m5 + pmaddwd m12, m5 + paddd m9, m12 + + pmaddubsw m10, m0 + pmaddubsw m12, m11, m1 + pmaddwd m10, m5 + pmaddwd m12, m5 + paddd m10, m12 + + packssdw m7, m8 + packssdw m9, m10 + pmulhrsw m7, m6 + pmulhrsw m9, m6 + packuswb m7, m9 + movu [r2 + mmsize/2], xm7 + vextracti32x4 [r2 + r3 + mmsize/2], m7, 1 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m7, 2 + vextracti32x4 [r2 + r7 + mmsize/2], m7, 3 +%endmacro + +%macro IPFILTER_LUMA_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_64x%1, 4,6,13 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_512] + +%rep %1-1 + PROCESS_IPFILTER_LUMA_PP_64x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_64x1_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_64xN_AVX512 16 +IPFILTER_LUMA_64xN_AVX512 32 +IPFILTER_LUMA_64xN_AVX512 48 +IPFILTER_LUMA_64xN_AVX512 64 +%endif + +%macro IPFILTER_LUMA_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_32x%1, 4,6,13 + sub r0, 3 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_512] + +%rep %1/2 -1 + PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_32x2_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_32xN_AVX512 8 +IPFILTER_LUMA_32xN_AVX512 16 +IPFILTER_LUMA_32xN_AVX512 24 +IPFILTER_LUMA_32xN_AVX512 32 +IPFILTER_LUMA_32xN_AVX512 64 +%endif + +%macro IPFILTER_LUMA_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_16x%1, 4,8,14 + sub r0, 3 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_512] + +%rep %1/4 -1 + PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_16x4_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +IPFILTER_LUMA_16xN_AVX512 4 +IPFILTER_LUMA_16xN_AVX512 8 +IPFILTER_LUMA_16xN_AVX512 12 +IPFILTER_LUMA_16xN_AVX512 16 +IPFILTER_LUMA_16xN_AVX512 32 +IPFILTER_LUMA_16xN_AVX512 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_pp_48x64, 4,8,14 + sub r0, 3 + mov r4d, r4m + lea r6, [3 * r1] + lea r7, [3 * r3] +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastd m0, [r5 + r4 * 8] + vpbroadcastd m1, [r5 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_512] + +%rep 15 + PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_IPFILTER_LUMA_PP_48x4_AVX512 + RET +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_64x1_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3, m4 - load shuffle order table + ; m5 - pw_1 + ; m6 - pw_2000 + ; m7 - store shuffle order table + + movu ym8, [r0] + vinserti32x8 m8, [r0 + 8], 1 + pshufb m9, m8, m3 + pshufb m10, m8, m4 + pshufb m8, m2 + + movu ym11, [r0 + mmsize/2] + vinserti32x8 m11, [r0 + mmsize/2 + 8], 1 + pshufb m12, m11, m3 + pshufb m13, m11, m4 + pshufb m11, m2 + + pmaddubsw m8, m0 + pmaddubsw m14, m9, m1 + pmaddwd m8, m5 + pmaddwd m14, m5 + paddd m8, m14 + + pmaddubsw m9, m0 + pmaddubsw m14, m10, m1 + pmaddwd m9, m5 + pmaddwd m14, m5 + paddd m9, m14 + + pmaddubsw m11, m0 + pmaddubsw m14, m12, m1 + pmaddwd m11, m5 + pmaddwd m14, m5 + paddd m11, m14 + + pmaddubsw m12, m0 + pmaddubsw m14, m13, m1 + pmaddwd m12, m5 + pmaddwd m14, m5 + paddd m12, m14 + + + packssdw m8, m9 + packssdw m11, m12 + psubw m8, m6 + psubw m11, m6 + vpermq m8, m7, m8 + vpermq m11, m7, m11 + movu [r2], m8 + movu [r2 + mmsize], m11 +%endmacro + +%macro IPFILTER_LUMA_PS_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_64x%1, 4,7,15 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_2000] + mova m7, [interp8_hps_store_avx512] + + mov r4d, %1 + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 + +.loop: + PROCESS_IPFILTER_LUMA_PS_64x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_LUMA_PS_64xN_AVX512 16 + IPFILTER_LUMA_PS_64xN_AVX512 32 + IPFILTER_LUMA_PS_64xN_AVX512 48 + IPFILTER_LUMA_PS_64xN_AVX512 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_32x1_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3, m4 - load shuffle order table + ; m5 - pw_1 + ; m6 - pw_2000 + ; m7 - store shuffle order table + + movu ym8, [r0] + vinserti32x8 m8, [r0 + 8], 1 + pshufb m9, m8, m3 + pshufb m10, m8, m4 + pshufb m8, m2 + + pmaddubsw m8, m0 + pmaddubsw m11, m9, m1 + pmaddwd m8, m5 + pmaddwd m11, m5 + paddd m8, m11 + + pmaddubsw m9, m0 + pmaddubsw m11, m10, m1 + pmaddwd m9, m5 + pmaddwd m11, m5 + paddd m9, m11 + + packssdw m8, m9 + psubw m8, m6 + vpermq m8, m7, m8 + movu [r2], m8 +%endmacro + +%macro IPFILTER_LUMA_PS_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_32x%1, 4,7,12 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_2000] + mova m7, [interp8_hps_store_avx512] + + mov r4d, %1 + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 + +.loop: + PROCESS_IPFILTER_LUMA_PS_32x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_LUMA_PS_32xN_AVX512 8 + IPFILTER_LUMA_PS_32xN_AVX512 16 + IPFILTER_LUMA_PS_32xN_AVX512 24 + IPFILTER_LUMA_PS_32xN_AVX512 32 + IPFILTER_LUMA_PS_32xN_AVX512 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_8TAP_16x2_AVX512 0 + movu xm7, [r0] + vinserti32x4 m7, [r0 + 8], 1 + vinserti32x4 m7, [r0 + r1], 2 + vinserti32x4 m7, [r0 + r1 + 8], 3 + pshufb m8, m7, m3 + pshufb m9, m7, m4 + pshufb m7, m2 + + pmaddubsw m7, m0 + pmaddubsw m10, m8, m1 + pmaddwd m7, m5 + pmaddwd m10, m5 + paddd m7, m10 + + pmaddubsw m8, m0 + pmaddubsw m10, m9, m1 + pmaddwd m8, m5 + pmaddwd m10, m5 + paddd m8, m10 + + packssdw m7, m8 + psubw m7, m6 + movu [r2], ym7 + vextracti32x8 [r2 + r3], m7, 1 +%endmacro + +%macro PROCESS_IPFILTER_LUMA_PS_8TAP_16x1_AVX512 0 + movu xm7, [r0] + vinserti32x4 m7, [r0 + 8], 1 + pshufb ym8, ym7, ym3 + pshufb ym9, ym7, ym4 + pshufb ym7, ym2 + + pmaddubsw ym7, ym0 + pmaddubsw ym10, ym8, ym1 + pmaddwd ym7, ym5 + pmaddwd ym10, ym5 + paddd ym7, ym10 + + pmaddubsw ym8, ym0 + pmaddubsw ym10, ym9, ym1 + pmaddwd ym8, ym5 + pmaddwd ym10, ym5 + paddd ym8, ym10 + + packssdw ym7, ym8 + psubw ym7, ym6 + movu [r2], ym7 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_16xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA_PS_8TAP_16xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_16x%1, 4,7,11 + mov r4d, r4m + mov r5d, r5m + add r3, r3 + +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_2000] + + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3, m4 - load shuffle order table + ; m5 - pw_1 + ; m6 - pw_2000 + + mov r4d, %1 + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 + PROCESS_IPFILTER_LUMA_PS_8TAP_16x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4d + +.loop: + PROCESS_IPFILTER_LUMA_PS_8TAP_16x2_AVX512 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + sub r4d, 2 + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 4 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 8 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 12 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 16 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 32 + IPFILTER_LUMA_PS_8TAP_16xN_AVX512 64 +%endif + +%macro PROCESS_IPFILTER_LUMA_PS_48x1_AVX512 0 + ; register map + ; m0 , m1 - interpolate coeff + ; m2 , m3, m4 - load shuffle order table + ; m5 - pw_1 + ; m6 - pw_2000 + ; m7 - store shuffle order table + + movu ym8, [r0] + vinserti32x8 m8, [r0 + 8], 1 + pshufb m9, m8, m3 + pshufb m10, m8, m4 + pshufb m8, m2 + + pmaddubsw m8, m0 + pmaddubsw m11, m9, m1 + pmaddwd m8, m5 + pmaddwd m11, m5 + paddd m8, m11 + + pmaddubsw m9, m0 + pmaddubsw m11, m10, m1 + pmaddwd m9, m5 + pmaddwd m11, m5 + paddd m9, m11 + + packssdw m8, m9 + psubw m8, m6 + vpermq m8, m7, m8 + movu [r2], m8 + + movu ym8, [r0 + 32] + vinserti32x4 m8, [r0 + 40], 1 + pshufb ym9, ym8, ym3 + pshufb ym10, ym8, ym4 + pshufb ym8, ym2 + + pmaddubsw ym8, ym0 + pmaddubsw ym11, ym9, ym1 + pmaddwd ym8, ym5 + pmaddwd ym11, ym5 + paddd ym8, ym11 + + pmaddubsw ym9, ym0 + pmaddubsw ym11, ym10, ym1 + pmaddwd ym9, ym5 + pmaddwd ym11, ym5 + paddd ym9, ym11 + + packssdw ym8, ym9 + psubw ym8, ym6 + movu [r2 + mmsize], ym8 +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_horiz_ps_48xN(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;------------------------------------------------------------------------------------------------------------- +%macro IPFILTER_LUMA_PS_48xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_horiz_ps_48x%1, 4,7,12 + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastd m0, [r6 + r4 * 8] + vpbroadcastd m1, [r6 + r4 * 8 + 4] +%else + vpbroadcastd m0, [tab_LumaCoeff + r4 * 8] + vpbroadcastd m1, [tab_LumaCoeff + r4 * 8 + 4] +%endif + vbroadcasti32x8 m2, [interp4_horiz_shuf_load1_avx512] + vbroadcasti32x8 m3, [interp4_horiz_shuf_load3_avx512] + vbroadcasti32x8 m4, [interp4_horiz_shuf_load2_avx512] + vpbroadcastd m5, [pw_1] + vbroadcasti32x8 m6, [pw_2000] + mova m7, [interp8_hps_store_avx512] + + mov r4d, %1 + sub r0, 3 + test r5d, r5d + jz .loop + lea r6, [r1 * 3] + sub r0, r6 ; r0(src)-r6 + add r4d, 7 ; blkheight += N - 1 + +.loop: + PROCESS_IPFILTER_LUMA_PS_48x1_AVX512 + lea r0, [r0 + r1] + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loop + RET +%endmacro + +%if ARCH_X86_64 == 1 + IPFILTER_LUMA_PS_48xN_AVX512 64 +%endif + +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vss code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_LUMA_VERT_SS_8x8_AVX512 0 + lea r6, [r0 + 4 * r1] + movu xm1, [r0] ;0 row + vinserti32x4 m1, [r0 + 2 * r1], 1 + vinserti32x4 m1, [r0 + 4 * r1], 2 + vinserti32x4 m1, [r6 + 2 * r1], 3 + movu xm3, [r0 + r1] ;1 row + vinserti32x4 m3, [r0 + r7], 1 + vinserti32x4 m3, [r6 + r1], 2 + vinserti32x4 m3, [r6 + r7], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu xm4, [r0 + 2 * r1] ;2 row + vinserti32x4 m4, [r0 + 4 * r1], 1 + vinserti32x4 m4, [r6 + 2 * r1], 2 + vinserti32x4 m4, [r6 + 4 * r1], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + lea r4, [r6 + 4 * r1] + movu xm5, [r0 + r7] ;3 row + vinserti32x4 m5, [r6 + r1], 1 + vinserti32x4 m5, [r6 + r7], 2 + vinserti32x4 m5, [r4 + r1], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1] ;4 row + vinserti32x4 m4, [r6 + 2 * r1], 1 + vinserti32x4 m4, [r6 + 4 * r1], 2 + vinserti32x4 m4, [r4 + 2 * r1], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu xm11, [r6 + r1] ;5 row + vinserti32x4 m11, [r6 + r7], 1 + vinserti32x4 m11, [r4 + r1], 2 + vinserti32x4 m11, [r4 + r7], 3 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu xm12, [r6 + 2 * r1] ;6 row + vinserti32x4 m12, [r6 + 4 * r1], 1 + vinserti32x4 m12, [r4 + 2 * r1], 2 + vinserti32x4 m12, [r4 + 4 * r1], 3 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + lea r8, [r4 + 4 * r1] + movu xm13, [r6 + r7] ;7 row + vinserti32x4 m13, [r4 + r1], 1 + vinserti32x4 m13, [r4 + r7], 2 + vinserti32x4 m13, [r8 + r1], 3 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu xm12, [r6 + 4 * r1] ; 8 row + vinserti32x4 m12, [r4 + 2 * r1], 1 + vinserti32x4 m12, [r4 + 4 * r1], 2 + vinserti32x4 m12, [r8 + 2 * r1], 3 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r5], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r5], m2, 3 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_SS_LUMA_8xN_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_ss_8x%1, 5, 9, 19 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif + + lea r5, [3 * r3] +%rep %1/8 - 1 + PROCESS_LUMA_VERT_SS_8x8_AVX512 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_SS_8x8_AVX512 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_SS_LUMA_8xN_AVX512 8 + FILTER_VER_SS_LUMA_8xN_AVX512 16 + FILTER_VER_SS_LUMA_8xN_AVX512 32 +%endif +%macro PROCESS_LUMA_VERT_S_16x4_AVX512 1 + movu ym1, [r0] + movu ym3, [r0 + r1] + vinserti32x8 m1, [r0 + 2 * r1], 1 + vinserti32x8 m3, [r0 + r7], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + lea r6, [r0 + 4 * r1] + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r6], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7] + vinserti32x8 m5, [r6 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r6] + vinserti32x8 m4, [r6 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r6 + r1] + vinserti32x8 m11, [r6 + r7], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r6 + 2 * r1] + vinserti32x8 m12, [r6 + 4 * r1], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + lea r4, [r6 + 4 * r1] + movu ym13, [r6 + r7] + vinserti32x8 m13, [r4 + r1], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r6 + 4 * r1] + vinserti32x8 m12, [r4 + 2 * r1], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m20, m0 + movu [r2], xm0 + vextracti32x4 [r2 + r3], m0, 2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r5], m0, 3 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r5], m2, 1 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_16x%2, 5, 8, 21 + add r1d, r1d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [pd_526336] + mova m20, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + + lea r5, [3 * r3] +%rep %2/4 - 1 + PROCESS_LUMA_VERT_S_16x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_16x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 4 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 8 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 12 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_16xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 4 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 8 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 12 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_16xN_AVX512 sp, 64 +%endif +%macro PROCESS_LUMA_VERT_SS_24x8_AVX512 0 + PROCESS_LUMA_VERT_S_16x4_AVX512 ss + lea r4, [r6 + 4 * r1] + lea r8, [r4 + 4 * r1] + movu ym1, [r6] + movu ym3, [r6 + r1] + vinserti32x8 m1, [r6 + 2 * r1], 1 + vinserti32x8 m3, [r6 + r7], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r6 + 2 * r1] + vinserti32x8 m4, [r4], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r6 + r7] + vinserti32x8 m5, [r4 + r1], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r4] + vinserti32x8 m4, [r4 + 2 * r1], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r4 + r1] + vinserti32x8 m11, [r4 + r7], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r4 + 2 * r1] + vinserti32x8 m12, [r4 + 4 * r1], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r4 + r7] + vinserti32x8 m13, [r8 + r1], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu ym12, [r4 + 4 * r1] + vinserti32x8 m12, [r8 + 2 * r1], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + lea r9, [r2 + 4 * r3] + movu [r9], ym0 + movu [r9 + r3], ym2 + vextracti32x8 [r9 + 2 * r3], m0, 1 + vextracti32x8 [r9 + r5], m2, 1 + + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r0 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m1, [r0 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m1, [r6 + 2 * r1 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r0 + r7 + mmsize/2], 1 + vinserti32x4 m3, [r6 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r6 + r7 + mmsize/2], 3 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r0 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 3 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu xm5, [r0 + r7 + mmsize/2] + vinserti32x4 m5, [r6 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r6 + r7 + mmsize/2], 2 + vinserti32x4 m5, [r4 + r1 + mmsize/2], 3 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r6 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r6 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r4 + 2 * r1 + mmsize/2], 3 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu xm11, [r6 + r1 + mmsize/2] + vinserti32x4 m11, [r6 + r7 + mmsize/2], 1 + vinserti32x4 m11, [r4 + r1 + mmsize/2], 2 + vinserti32x4 m11, [r4 + r7 + mmsize/2], 3 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu xm12, [r6 + 2 * r1 + mmsize/2] + vinserti32x4 m12, [r6 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m12, [r4 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m12, [r4 + 4 * r1 + mmsize/2], 3 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu xm13, [r6 + r7 + mmsize/2] + vinserti32x4 m13, [r4 + r1 + mmsize/2], 1 + vinserti32x4 m13, [r4 + r7 + mmsize/2], 2 + vinserti32x4 m13, [r8 + r1 + mmsize/2], 3 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + + paddd m8, m14 + paddd m4, m12 + paddd m0, m8 + paddd m1, m4 + + movu xm12, [r6 + 4 * r1 + mmsize/2] + vinserti32x4 m12, [r4 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m12, [r4 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m12, [r8 + 2 * r1 + mmsize/2], 3 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + + paddd m10, m14 + paddd m11, m13 + paddd m2, m10 + paddd m3, m11 + + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r5 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r5 + mmsize/2], m2, 3 +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal interp_8tap_vert_ss_24x32, 5, 10, 19 + add r1d, r1d + add r3d, r3d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif + + lea r5, [3 * r3] +%rep 3 + PROCESS_LUMA_VERT_SS_24x8_AVX512 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_SS_24x8_AVX512 + RET +%endif + +%macro PROCESS_LUMA_VERT_S_32x2_AVX512 1 + movu m1, [r0] ;0 row + movu m3, [r0 + r1] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + lea r6, [r0 + 4 * r1] + + movu m11, [r6 + r1] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + paddd m8, m14 + paddd m4, m12 + movu m12, [r6 + 4 * r1] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + paddd m10, m14 + paddd m11, m13 + + paddd m0, m8 + paddd m1, m4 + paddd m2, m10 + paddd m3, m11 +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m20, m0 + movu [r2], ym0 + vextracti32x8 [r2 + r3], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + movu [r2], m0 + movu [r2 + r3], m2 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_32x%2, 5, 8, 21 + add r1d, r1d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [pd_526336] + mova m20, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 8 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 24 + FILTER_VER_S_LUMA_32xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 8 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 24 + FILTER_VER_S_LUMA_32xN_AVX512 sp, 64 +%endif + +%macro PROCESS_LUMA_VERT_S_48x4_AVX512 1 + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + movu m1, [r0 + 2 * r1] + movu m3, [r0 + r7] + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 4 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + lea r4, [r6 + 4 * r1] + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu m11, [r6 + r7] + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r4] + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r4 + r1] + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + paddd m8, m14 + paddd m4, m12 + movu m12, [r4 + 2 * r1] + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + paddd m10, m14 + paddd m11, m13 + + paddd m0, m8 + paddd m1, m4 + paddd m2, m10 + paddd m3, m11 +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m20, m0 + movu [r2 + 2 * r3], ym0 + vextracti32x8 [r2 + r5], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + 2 * r3], m0 + movu [r2 + r5], m2 +%endif + movu ym1, [r0 + mmsize] + movu ym3, [r0 + r1 + mmsize] + vinserti32x8 m1, [r0 + 2 * r1 + mmsize], 1 + vinserti32x8 m3, [r0 + r7 + mmsize], 1 + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu ym4, [r0 + 2 * r1 + mmsize] + vinserti32x8 m4, [r6 + mmsize], 1 + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu ym5, [r0 + r7 + mmsize] + vinserti32x8 m5, [r6 + r1 + mmsize], 1 + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu ym4, [r6 + mmsize] + vinserti32x8 m4, [r6 + 2 * r1 + mmsize], 1 + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu ym11, [r6 + r1 + mmsize] + vinserti32x8 m11, [r6 + r7 + mmsize], 1 + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu ym12, [r6 + 2 * r1 + mmsize] + vinserti32x8 m12, [r6 + 4 * r1 + mmsize], 1 + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu ym13, [r6 + r7 + mmsize] + vinserti32x8 m13, [r4 + r1 + mmsize], 1 + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + paddd m8, m14 + paddd m4, m12 + movu ym12, [r6 + 4 * r1 + mmsize] + vinserti32x8 m12, [r4 + 2 * r1 + mmsize], 1 + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + paddd m10, m14 + paddd m11, m13 + + paddd m0, m8 + paddd m1, m4 + paddd m2, m10 + paddd m3, m11 +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m20, m0 + movu [r2 + mmsize/2], xm0 + vextracti32x4 [r2 + r3 + mmsize/2], m0, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r5 + mmsize/2], m0, 3 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 + + movu [r2 + mmsize], ym0 + movu [r2 + r3 + mmsize], ym2 + vextracti32x8 [r2 + 2 * r3 + mmsize], m0, 1 + vextracti32x8 [r2 + r5 + mmsize], m2, 1 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_48x64, 5, 8, 21 + add r1d, r1d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [pd_526336] + mova m20, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + + lea r5, [3 * r3] +%rep 15 + PROCESS_LUMA_VERT_S_48x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_S_48x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_48x64_AVX512 ss + FILTER_VER_S_LUMA_48x64_AVX512 sp +%endif + +%macro PROCESS_LUMA_VERT_S_64x2_AVX512 1 + PROCESS_LUMA_VERT_S_32x2_AVX512 %1 + movu m1, [r0 + mmsize] ;0 row + movu m3, [r0 + r1 + mmsize] ;1 row + punpcklwd m0, m1, m3 + pmaddwd m0, m15 + punpckhwd m1, m3 + pmaddwd m1, m15 + + movu m4, [r0 + 2 * r1 + mmsize] ;2 row + punpcklwd m2, m3, m4 + pmaddwd m2, m15 + punpckhwd m3, m4 + pmaddwd m3, m15 + + movu m5, [r0 + r7 + mmsize] ;3 row + punpcklwd m6, m4, m5 + pmaddwd m6, m16 + punpckhwd m4, m5 + pmaddwd m4, m16 + + paddd m0, m6 + paddd m1, m4 + + movu m4, [r0 + 4 * r1 + mmsize] ;4 row + punpcklwd m6, m5, m4 + pmaddwd m6, m16 + punpckhwd m5, m4 + pmaddwd m5, m16 + + paddd m2, m6 + paddd m3, m5 + + movu m11, [r6 + r1 + mmsize] ;5 row + punpcklwd m8, m4, m11 + pmaddwd m8, m17 + punpckhwd m4, m11 + pmaddwd m4, m17 + + movu m12, [r6 + 2 * r1 + mmsize] ;6 row + punpcklwd m10, m11, m12 + pmaddwd m10, m17 + punpckhwd m11, m12 + pmaddwd m11, m17 + + movu m13, [r6 + r7 + mmsize] ;7 row + punpcklwd m14, m12, m13 + pmaddwd m14, m18 + punpckhwd m12, m13 + pmaddwd m12, m18 + paddd m8, m14 + paddd m4, m12 + movu m12, [r6 + 4 * r1 + mmsize] ; 8 row + punpcklwd m14, m13, m12 + pmaddwd m14, m18 + punpckhwd m13, m12 + pmaddwd m13, m18 + paddd m10, m14 + paddd m11, m13 + + paddd m0, m8 + paddd m1, m4 + paddd m2, m10 + paddd m3, m11 +%ifidn %1, sp + paddd m0, m19 + paddd m1, m19 + paddd m2, m19 + paddd m3, m19 + + psrad m0, 12 + psrad m1, 12 + psrad m2, 12 + psrad m3, 12 + + packssdw m0, m1 + packssdw m2, m3 + packuswb m0, m2 + vpermq m0, m20, m0 + movu [r2 + mmsize/2], ym0 + vextracti32x8 [r2 + r3 + mmsize/2], m0, 1 +%else + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + packssdw m0, m1 + packssdw m2, m3 + movu [r2 + mmsize], m0 + movu [r2 + r3 + mmsize], m2 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_S_LUMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_64x%2, 5, 8, 21 + add r1d, r1d + lea r7, [3 * r1] + sub r0, r7 + shl r4d, 8 +%ifdef PIC + lea r5, [pw_LumaCoeffVer_avx512] + mova m15, [r5 + r4] + mova m16, [r5 + r4 + 1 * mmsize] + mova m17, [r5 + r4 + 2 * mmsize] + mova m18, [r5 + r4 + 3 * mmsize] +%else + lea r5, [pw_LumaCoeffVer_avx512 + r4] + mova m15, [r5] + mova m16, [r5 + 1 * mmsize] + mova m17, [r5 + 2 * mmsize] + mova m18, [r5 + 3 * mmsize] +%endif +%ifidn %1, sp + vbroadcasti32x4 m19, [pd_526336] + mova m20, [interp8_vsp_store_avx512] +%else + add r3d, r3d +%endif + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_S_64x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_S_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 16 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 32 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 48 + FILTER_VER_S_LUMA_64xN_AVX512 ss, 64 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 16 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 32 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 48 + FILTER_VER_S_LUMA_64xN_AVX512 sp, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vss code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vpp and luma_vps code start +;------------------------------------------------------------------------------------------------------------- +%macro PROCESS_LUMA_VERT_16x8_AVX512 1 + lea r5, [r0 + 4 * r1] + lea r4, [r5 + 4 * r1] + movu xm1, [r0] + vinserti32x4 m1, [r0 + 2 * r1], 1 + vinserti32x4 m1, [r5], 2 + vinserti32x4 m1, [r5 + 2 * r1], 3 + movu xm3, [r0 + r1] + vinserti32x4 m3, [r0 + r6], 1 + vinserti32x4 m3, [r5 + r1], 2 + vinserti32x4 m3, [r5 + r6], 3 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu xm4, [r0 + 2 * r1] + vinserti32x4 m4, [r0 + 4 * r1], 1 + vinserti32x4 m4, [r5 + 2 * r1], 2 + vinserti32x4 m4, [r5 + 4 * r1], 3 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + movu xm5, [r0 + r6] + vinserti32x4 m5, [r5 + r1], 1 + vinserti32x4 m5, [r5 + r6], 2 + vinserti32x4 m5, [r4 + r1], 3 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m6 + paddw m1, m4 + + movu xm4, [r0 + 4 * r1] + vinserti32x4 m4, [r5 + 2 * r1], 1 + vinserti32x4 m4, [r5 + 4 * r1], 2 + vinserti32x4 m4, [r4 + 2 * r1], 3 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + punpckhbw m5, m4 + pmaddubsw m5, m9 + + paddw m2, m6 + paddw m3, m5 + + movu xm15, [r5 + r1] + vinserti32x4 m15, [r5 + r6], 1 + vinserti32x4 m15, [r4 + r1], 2 + vinserti32x4 m15, [r4 + r6], 3 + punpcklbw m12, m4, m15 + pmaddubsw m12, m10 + punpckhbw m13, m4, m15 + pmaddubsw m13, m10 + + lea r8, [r4 + 4 * r1] + movu xm4, [r5 + 2 * r1] + vinserti32x4 m4, [r5 + 4 * r1], 1 + vinserti32x4 m4, [r4 + 2 * r1], 2 + vinserti32x4 m4, [r4 + 4 * r1], 3 + punpcklbw m14, m15, m4 + pmaddubsw m14, m10 + punpckhbw m15, m4 + pmaddubsw m15, m10 + + movu xm5, [r5 + r6] + vinserti32x4 m5, [r4 + r1], 1 + vinserti32x4 m5, [r4 + r6], 2 + vinserti32x4 m5, [r8 + r1], 3 + punpcklbw m6, m4, m5 + pmaddubsw m6, m11 + punpckhbw m4, m5 + pmaddubsw m4, m11 + + paddw m12, m6 + paddw m13, m4 + + movu xm4, [r5 + 4 * r1] + vinserti32x4 m4, [r4 + 2 * r1], 1 + vinserti32x4 m4, [r4 + 4 * r1], 2 + vinserti32x4 m4, [r8 + 2 * r1], 3 + punpcklbw m6, m5, m4 + pmaddubsw m6, m11 + punpckhbw m5, m4 + pmaddubsw m5, m11 + + paddw m14, m6 + paddw m15, m5 + + paddw m0, m12 + paddw m1, m13 + paddw m2, m14 + paddw m3, m15 +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + + packuswb m0, m1 + packuswb m2, m3 + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti32x4 [r2 + 2 * r3], m0, 1 + vextracti32x4 [r2 + r7], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2], m0, 2 + vextracti32x4 [r2 + r3], m2, 2 + vextracti32x4 [r2 + 2 * r3], m0, 3 + vextracti32x4 [r2 + r7], m2, 3 +%else + psubw m0, m7 + psubw m1, m7 + mova m12, m16 + mova m13, m17 + vpermi2q m12, m0, m1 + vpermi2q m13, m0, m1 + movu [r2], ym12 + vextracti32x8 [r2 + 2 * r3], m12, 1 + + psubw m2, m7 + psubw m3, m7 + mova m14, m16 + mova m15, m17 + vpermi2q m14, m2, m3 + vpermi2q m15, m2, m3 + movu [r2 + r3], ym14 + vextracti32x8 [r2 + r7], m14, 1 + lea r2, [r2 + 4 * r3] + + movu [r2], ym13 + movu [r2 + r3], ym15 + vextracti32x8 [r2 + 2 * r3], m13, 1 + vextracti32x8 [r2 + r7], m15, 1 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_LUMA_16xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_16x%2, 5, 9, 18 + mov r4d, r4m + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + 1 * mmsize] + mova m10, [r5 + r4 + 2 * mmsize] + mova m11, [r5 + r4 + 3 * mmsize] +%else + mova m8, [tab_LumaCoeffVer_32_avx512 + r4] + mova m9, [tab_LumaCoeffVer_32_avx512 + r4 + 1 * mmsize] + mova m10, [tab_LumaCoeffVer_32_avx512 + r4 + 2 * mmsize] + mova m11, [tab_LumaCoeffVer_32_avx512 + r4 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m16, [interp4_vps_store1_avx512] + mova m17, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + lea r7, [3 * r3] + sub r0, r6 + +%rep %2/8 - 1 + PROCESS_LUMA_VERT_16x8_AVX512 %1 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_16x8_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_LUMA_16xN_AVX512 pp, 8 + FILTER_VERT_LUMA_16xN_AVX512 pp, 16 + FILTER_VERT_LUMA_16xN_AVX512 pp, 32 + FILTER_VERT_LUMA_16xN_AVX512 pp, 64 + + FILTER_VERT_LUMA_16xN_AVX512 ps, 8 + FILTER_VERT_LUMA_16xN_AVX512 ps, 16 + FILTER_VERT_LUMA_16xN_AVX512 ps, 32 + FILTER_VERT_LUMA_16xN_AVX512 ps, 64 +%endif +%macro PROCESS_LUMA_VERT_32x4_AVX512 1 + lea r5, [r0 + 4 * r1] + movu ym1, [r0] + vinserti32x8 m1, [r0 + 2 * r1], 1 + movu ym3, [r0 + r1] + vinserti32x8 m3, [r0 + r6], 1 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + 4 * r1], 1 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + movu ym5, [r0 + r6] + vinserti32x8 m5, [r5 + r1], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m6 + paddw m1, m4 + + movu ym4, [r0 + 4 * r1] + vinserti32x8 m4, [r5 + 2 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + punpckhbw m5, m4 + pmaddubsw m5, m9 + + paddw m2, m6 + paddw m3, m5 + + lea r4, [r5 + 4 * r1] + movu ym15, [r5 + r1] + vinserti32x8 m15, [r5 + r6], 1 + punpcklbw m12, m4, m15 + pmaddubsw m12, m10 + punpckhbw m13, m4, m15 + pmaddubsw m13, m10 + + movu ym4, [r5 + 2 * r1] + vinserti32x8 m4, [r5 + 4 * r1], 1 + punpcklbw m14, m15, m4 + pmaddubsw m14, m10 + punpckhbw m15, m4 + pmaddubsw m15, m10 + + movu ym5, [r5 + r6] + vinserti32x8 m5, [r4 + r1], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m11 + punpckhbw m4, m5 + pmaddubsw m4, m11 + + paddw m12, m6 + paddw m13, m4 + + movu ym4, [r5 + 4 * r1] + vinserti32x8 m4, [r4 + 2 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m11 + punpckhbw m5, m4 + pmaddubsw m5, m11 + + paddw m14, m6 + paddw m15, m5 + + paddw m0, m12 + paddw m1, m13 + paddw m2, m14 + paddw m3, m15 +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + + packuswb m0, m1 + packuswb m2, m3 + movu [r2], ym0 + movu [r2 + r3], ym2 + vextracti32x8 [r2 + 2 * r3], m0, 1 + vextracti32x8 [r2 + r7], m2, 1 +%else + psubw m0, m7 + psubw m1, m7 + mova m12, m16 + mova m13, m17 + vpermi2q m12, m0, m1 + vpermi2q m13, m0, m1 + movu [r2], m12 + movu [r2 + 2 * r3], m13 + + psubw m2, m7 + psubw m3, m7 + mova m14, m16 + mova m15, m17 + vpermi2q m14, m2, m3 + vpermi2q m15, m2, m3 + movu [r2 + r3], m14 + movu [r2 + r7], m15 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_LUMA_32xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_32x%2, 5, 8, 18 + mov r4d, r4m + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + 1 * mmsize] + mova m10, [r5 + r4 + 2 * mmsize] + mova m11, [r5 + r4 + 3 * mmsize] +%else + mova m8, [tab_LumaCoeffVer_32_avx512 + r4] + mova m9, [tab_LumaCoeffVer_32_avx512 + r4 + 1 * mmsize] + mova m10, [tab_LumaCoeffVer_32_avx512 + r4 + 2 * mmsize] + mova m11, [tab_LumaCoeffVer_32_avx512 + r4 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m16, [interp4_vps_store1_avx512] + mova m17, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + lea r7, [3 * r3] + sub r0, r6 + +%rep %2/4 - 1 + PROCESS_LUMA_VERT_32x4_AVX512 %1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_32x4_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_LUMA_32xN_AVX512 pp, 8 + FILTER_VERT_LUMA_32xN_AVX512 pp, 16 + FILTER_VERT_LUMA_32xN_AVX512 pp, 24 + FILTER_VERT_LUMA_32xN_AVX512 pp, 32 + FILTER_VERT_LUMA_32xN_AVX512 pp, 64 + + FILTER_VERT_LUMA_32xN_AVX512 ps, 8 + FILTER_VERT_LUMA_32xN_AVX512 ps, 16 + FILTER_VERT_LUMA_32xN_AVX512 ps, 24 + FILTER_VERT_LUMA_32xN_AVX512 ps, 32 + FILTER_VERT_LUMA_32xN_AVX512 ps, 64 +%endif +%macro PROCESS_LUMA_VERT_48x8_AVX512 1 +%ifidn %1, pp + PROCESS_LUMA_VERT_32x4_AVX512 pp +%else + PROCESS_LUMA_VERT_32x4_AVX512 ps +%endif + lea r8, [r4 + 4 * r1] + lea r9, [r2 + 4 * r3] + movu ym1, [r5] + vinserti32x8 m1, [r5 + 2 * r1], 1 + movu ym3, [r5 + r1] + vinserti32x8 m3, [r5 + r6], 1 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu ym4, [r5 + 2 * r1] + vinserti32x8 m4, [r5 + 4 * r1], 1 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + movu ym5, [r5 + r6] + vinserti32x8 m5, [r4 + r1], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m6 + paddw m1, m4 + + movu ym4, [r5 + 4 * r1] + vinserti32x8 m4, [r4 + 2 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + punpckhbw m5, m4 + pmaddubsw m5, m9 + + paddw m2, m6 + paddw m3, m5 + + movu ym15, [r4 + r1] + vinserti32x8 m15, [r4 + r6], 1 + punpcklbw m12, m4, m15 + pmaddubsw m12, m10 + punpckhbw m13, m4, m15 + pmaddubsw m13, m10 + + movu ym4, [r4 + 2 * r1] + vinserti32x8 m4, [r4 + 4 * r1], 1 + punpcklbw m14, m15, m4 + pmaddubsw m14, m10 + punpckhbw m15, m4 + pmaddubsw m15, m10 + + movu ym5, [r4 + r6] + vinserti32x8 m5, [r8 + r1], 1 + punpcklbw m6, m4, m5 + pmaddubsw m6, m11 + punpckhbw m4, m5 + pmaddubsw m4, m11 + + paddw m12, m6 + paddw m13, m4 + + movu ym4, [r4 + 4 * r1] + vinserti32x8 m4, [r8 + 2 * r1], 1 + punpcklbw m6, m5, m4 + pmaddubsw m6, m11 + punpckhbw m5, m4 + pmaddubsw m5, m11 + + paddw m14, m6 + paddw m15, m5 + + paddw m0, m12 + paddw m1, m13 + paddw m2, m14 + paddw m3, m15 +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + packuswb m0, m1 + packuswb m2, m3 + + movu [r9], ym0 + movu [r9 + r3], ym2 + vextracti32x8 [r9 + 2 * r3], m0, 1 + vextracti32x8 [r9 + r7], m2, 1 +%else + psubw m0, m7 + psubw m1, m7 + mova m12, m16 + mova m13, m17 + vpermi2q m12, m0, m1 + vpermi2q m13, m0, m1 + movu [r9], m12 + movu [r9 + 2 * r3], m13 + + psubw m2, m7 + psubw m3, m7 + mova m14, m16 + mova m15, m17 + vpermi2q m14, m2, m3 + vpermi2q m15, m2, m3 + movu [r9 + r3], m14 + movu [r9 + r7], m15 +%endif + movu xm1, [r0 + mmsize/2] + vinserti32x4 m1, [r0 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m1, [r5 + mmsize/2], 2 + vinserti32x4 m1, [r5 + 2 * r1 + mmsize/2], 3 + movu xm3, [r0 + r1 + mmsize/2] + vinserti32x4 m3, [r0 + r6 + mmsize/2], 1 + vinserti32x4 m3, [r5 + r1 + mmsize/2], 2 + vinserti32x4 m3, [r5 + r6 + mmsize/2], 3 + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu xm4, [r0 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r0 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r5 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r5 + 4 * r1 + mmsize/2], 3 + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + movu xm5, [r0 + r6 + mmsize/2] + vinserti32x4 m5, [r5 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r5 + r6 + mmsize/2], 2 + vinserti32x4 m5, [r4 + r1 + mmsize/2], 3 + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m6 + paddw m1, m4 + + movu xm4, [r0 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r5 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r5 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r4 + 2 * r1 + mmsize/2], 3 + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + punpckhbw m5, m4 + pmaddubsw m5, m9 + + paddw m2, m6 + paddw m3, m5 + + movu xm15, [r5 + r1 + mmsize/2] + vinserti32x4 m15, [r5 + r6 + mmsize/2], 1 + vinserti32x4 m15, [r4 + r1 + mmsize/2], 2 + vinserti32x4 m15, [r4 + r6 + mmsize/2], 3 + punpcklbw m12, m4, m15 + pmaddubsw m12, m10 + punpckhbw m13, m4, m15 + pmaddubsw m13, m10 + + movu xm4, [r5 + 2 * r1 + mmsize/2] + vinserti32x4 m4, [r5 + 4 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r4 + 2 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r4 + 4 * r1 + mmsize/2], 3 + punpcklbw m14, m15, m4 + pmaddubsw m14, m10 + punpckhbw m15, m4 + pmaddubsw m15, m10 + + movu xm5, [r5 + r6 + mmsize/2] + vinserti32x4 m5, [r4 + r1 + mmsize/2], 1 + vinserti32x4 m5, [r4 + r6 + mmsize/2], 2 + vinserti32x4 m5, [r8 + r1 + mmsize/2], 3 + punpcklbw m6, m4, m5 + pmaddubsw m6, m11 + punpckhbw m4, m5 + pmaddubsw m4, m11 + + paddw m12, m6 + paddw m13, m4 + + movu xm4, [r5 + 4 * r1 + mmsize/2] + vinserti32x4 m4, [r4 + 2 * r1 + mmsize/2], 1 + vinserti32x4 m4, [r4 + 4 * r1 + mmsize/2], 2 + vinserti32x4 m4, [r8 + 2 * r1 + mmsize/2], 3 + punpcklbw m6, m5, m4 + pmaddubsw m6, m11 + punpckhbw m5, m4 + pmaddubsw m5, m11 + + paddw m14, m6 + paddw m15, m5 + + paddw m0, m12 + paddw m1, m13 + paddw m2, m14 + paddw m3, m15 +%ifidn %1, pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + + packuswb m0, m1 + packuswb m2, m3 + movu [r2 + mmsize/2], xm0 + movu [r2 + r3 + mmsize/2], xm2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 1 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 1 + lea r2, [r2 + 4 * r3] + vextracti32x4 [r2 + mmsize/2], m0, 2 + vextracti32x4 [r2 + r3 + mmsize/2], m2, 2 + vextracti32x4 [r2 + 2 * r3 + mmsize/2], m0, 3 + vextracti32x4 [r2 + r7 + mmsize/2], m2, 3 +%else + psubw m0, m7 + psubw m1, m7 + mova m12, m16 + mova m13, m17 + vpermi2q m12, m0, m1 + vpermi2q m13, m0, m1 + movu [r2 + mmsize], ym12 + vextracti32x8 [r2 + 2 * r3 + mmsize], m12, 1 + + psubw m2, m7 + psubw m3, m7 + mova m14, m16 + mova m15, m17 + vpermi2q m14, m2, m3 + vpermi2q m15, m2, m3 + movu [r2 + r3 + mmsize], ym14 + vextracti32x8 [r2 + r7 + mmsize], m14, 1 + lea r2, [r2 + 4 * r3] + + movu [r2 + mmsize], ym13 + movu [r2 + r3 + mmsize], ym15 + vextracti32x8 [r2 + 2 * r3 + mmsize], m13, 1 + vextracti32x8 [r2 + r7 + mmsize], m15, 1 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_LUMA_48x64_AVX512 1 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_48x64, 5, 10, 18 + mov r4d, r4m + shl r4d, 8 + +%ifdef PIC + lea r5, [tab_LumaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + 1 * mmsize] + mova m10, [r5 + r4 + 2 * mmsize] + mova m11, [r5 + r4 + 3 * mmsize] +%else + mova m8, [tab_LumaCoeffVer_32_avx512 + r4] + mova m9, [tab_LumaCoeffVer_32_avx512 + r4 + 1 * mmsize] + mova m10, [tab_LumaCoeffVer_32_avx512 + r4 + 2 * mmsize] + mova m11, [tab_LumaCoeffVer_32_avx512 + r4 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m16, [interp4_vps_store1_avx512] + mova m17, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + lea r7, [3 * r3] + sub r0, r6 + +%rep 7 + PROCESS_LUMA_VERT_48x8_AVX512 %1 + lea r0, [r4] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_LUMA_VERT_48x8_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 + FILTER_VERT_LUMA_48x64_AVX512 pp + FILTER_VERT_LUMA_48x64_AVX512 ps +%endif +%macro PROCESS_LUMA_VERT_64x2_AVX512 1 + lea r5, [r0 + 4 * r1] + movu m1, [r0] + movu m3, [r0 + r1] + punpcklbw m0, m1, m3 + pmaddubsw m0, m8 + punpckhbw m1, m3 + pmaddubsw m1, m8 + + movu m4, [r0 + 2 * r1] + punpcklbw m2, m3, m4 + pmaddubsw m2, m8 + punpckhbw m3, m4 + pmaddubsw m3, m8 + + movu m5, [r0 + r6] + punpcklbw m6, m4, m5 + pmaddubsw m6, m9 + punpckhbw m4, m5 + pmaddubsw m4, m9 + + paddw m0, m6 + paddw m1, m4 + + movu m4, [r0 + 4 * r1] + punpcklbw m6, m5, m4 + pmaddubsw m6, m9 + punpckhbw m5, m4 + pmaddubsw m5, m9 + + paddw m2, m6 + paddw m3, m5 + + movu m15, [r5 + r1] + punpcklbw m12, m4, m15 + pmaddubsw m12, m10 + punpckhbw m13, m4, m15 + pmaddubsw m13, m10 + + movu m4, [r5 + 2 * r1] + punpcklbw m14, m15, m4 + pmaddubsw m14, m10 + punpckhbw m15, m4 + pmaddubsw m15, m10 + + movu m5, [r5 + r6] + punpcklbw m6, m4, m5 + pmaddubsw m6, m11 + punpckhbw m4, m5 + pmaddubsw m4, m11 + + paddw m12, m6 + paddw m13, m4 + + movu m4, [r5 + 4 * r1] + punpcklbw m6, m5, m4 + pmaddubsw m6, m11 + punpckhbw m5, m4 + pmaddubsw m5, m11 + + paddw m14, m6 + paddw m15, m5 + + paddw m0, m12 + paddw m1, m13 + paddw m2, m14 + paddw m3, m15 +%ifidn %1,pp + pmulhrsw m0, m7 + pmulhrsw m1, m7 + pmulhrsw m2, m7 + pmulhrsw m3, m7 + + packuswb m0, m1 + packuswb m2, m3 + movu [r2], m0 + movu [r2 + r3], m2 +%else + psubw m0, m7 + psubw m1, m7 + mova m12, m16 + mova m13, m17 + vpermi2q m12, m0, m1 + vpermi2q m13, m0, m1 + movu [r2], m12 + movu [r2 + mmsize], m13 + + psubw m2, m7 + psubw m3, m7 + mova m14, m16 + mova m15, m17 + vpermi2q m14, m2, m3 + vpermi2q m15, m2, m3 + movu [r2 + r3], m14 + movu [r2 + r3 + mmsize], m15 +%endif +%endmacro +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VERT_LUMA_64xN_AVX512 2 +INIT_ZMM avx512 +cglobal interp_8tap_vert_%1_64x%2, 5, 8, 18 + mov r4d, r4m + shl r4d, 8 +%ifdef PIC + lea r5, [tab_LumaCoeffVer_32_avx512] + mova m8, [r5 + r4] + mova m9, [r5 + r4 + 1 * mmsize] + mova m10, [r5 + r4 + 2 * mmsize] + mova m11, [r5 + r4 + 3 * mmsize] +%else + mova m8, [tab_LumaCoeffVer_32_avx512 + r4] + mova m9, [tab_LumaCoeffVer_32_avx512 + r4 + 1 * mmsize] + mova m10, [tab_LumaCoeffVer_32_avx512 + r4 + 2 * mmsize] + mova m11, [tab_LumaCoeffVer_32_avx512 + r4 + 3 * mmsize] +%endif +%ifidn %1, pp + vbroadcasti32x8 m7, [pw_512] +%else + shl r3d, 1 + vbroadcasti32x8 m7, [pw_2000] + mova m16, [interp4_vps_store1_avx512] + mova m17, [interp4_vps_store2_avx512] +%endif + + lea r6, [3 * r1] + sub r0, r6 + lea r7, [3 * r3] + +%rep %2/2 - 1 + PROCESS_LUMA_VERT_64x2_AVX512 %1 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%endrep + PROCESS_LUMA_VERT_64x2_AVX512 %1 + RET +%endmacro + +%if ARCH_X86_64 +FILTER_VERT_LUMA_64xN_AVX512 pp, 16 +FILTER_VERT_LUMA_64xN_AVX512 pp, 32 +FILTER_VERT_LUMA_64xN_AVX512 pp, 48 +FILTER_VERT_LUMA_64xN_AVX512 pp, 64 + +FILTER_VERT_LUMA_64xN_AVX512 ps, 16 +FILTER_VERT_LUMA_64xN_AVX512 ps, 32 +FILTER_VERT_LUMA_64xN_AVX512 ps, 48 +FILTER_VERT_LUMA_64xN_AVX512 ps, 64 +%endif +;------------------------------------------------------------------------------------------------------------- +;avx512 luma_vpp and luma_vps code end +;------------------------------------------------------------------------------------------------------------- +;------------------------------------------------------------------------------------------------------------- +;ipfilter_luma_avx512 code end +;------------------------------------------------------------------------------------------------------------- \ No newline at end of file
View file
x265_2.7.tar.gz/source/common/x86/ipfilter8.h -> x265_2.9.tar.gz/source/common/x86/ipfilter8.h
Changed
@@ -33,6 +33,7 @@ FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \ FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \ + FUNCDEF_CHROMA_PU(void, filterPixelToShort_aligned, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \ FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ @@ -45,5 +46,6 @@ SETUP_FUNC_DEF(sse3); SETUP_FUNC_DEF(sse4); SETUP_FUNC_DEF(avx2); +SETUP_FUNC_DEF(avx512); #endif // ifndef X265_IPFILTER8_H
View file
x265_2.7.tar.gz/source/common/x86/loopfilter.asm -> x265_2.9.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -58,6 +58,7 @@ ;============================================================================================================ INIT_XMM sse4 %if HIGH_BIT_DEPTH +%if ARCH_X86_64 cglobal saoCuOrgE0, 4,5,9 mov r4d, r4m movh m6, [r1] @@ -157,7 +158,7 @@ sub r4d, 16 jnz .loopH RET - +%endif %else ; HIGH_BIT_DEPTH == 1 cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride @@ -249,6 +250,7 @@ INIT_YMM avx2 %if HIGH_BIT_DEPTH +%if ARCH_X86_64 cglobal saoCuOrgE0, 4,4,9 vbroadcasti128 m6, [r1] movzx r1d, byte [r3] @@ -308,6 +310,7 @@ dec r2d jnz .loop RET +%endif %else ; HIGH_BIT_DEPTH cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride @@ -1655,6 +1658,7 @@ RET %endif +%if ARCH_X86_64 INIT_YMM avx2 %if HIGH_BIT_DEPTH cglobal saoCuOrgB0, 5,7,8 @@ -1814,6 +1818,7 @@ .end: RET %endif +%endif ;============================================================================================================ ; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width)
View file
x265_2.7.tar.gz/source/common/x86/mc-a.asm -> x265_2.9.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -46,13 +46,10 @@ %error Unsupport bit depth! %endif -SECTION_RODATA 32 +SECTION_RODATA 64 -ch_shuf: times 2 db 0,2,2,4,4,6,6,8,1,3,3,5,5,7,7,9 -ch_shuf_adj: times 8 db 0 - times 8 db 2 - times 8 db 4 - times 8 db 6 +ALIGN 64 +const shuf_avx512, dq 0, 2, 4, 6, 1, 3, 5, 7 SECTION .text @@ -1037,6 +1034,7 @@ ;------------------------------------------------------------------------------ ; avx2 asm for addAvg high_bit_depth ;------------------------------------------------------------------------------ +%if ARCH_X86_64 INIT_YMM avx2 cglobal addAvg_8x2, 6,6,2, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride movu xm0, [r0] @@ -1114,6 +1112,7 @@ movu [r2], xm0 movu [r2 + r5], xm2 RET +%endif %macro ADDAVG_W8_H4_AVX2 1 cglobal addAvg_8x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride @@ -1168,13 +1167,16 @@ RET %endmacro +%if ARCH_X86_64 ADDAVG_W8_H4_AVX2 4 ADDAVG_W8_H4_AVX2 8 ADDAVG_W8_H4_AVX2 12 ADDAVG_W8_H4_AVX2 16 ADDAVG_W8_H4_AVX2 32 ADDAVG_W8_H4_AVX2 64 +%endif +%if ARCH_X86_64 cglobal addAvg_12x16, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride mova m4, [pw_ %+ ADDAVG_ROUND] mova m5, [pw_pixel_max] @@ -1258,6 +1260,7 @@ dec r6d jnz .loop RET +%endif %macro ADDAVG_W16_H4_AVX2 1 cglobal addAvg_16x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride @@ -1299,6 +1302,7 @@ RET %endmacro +%if ARCH_X86_64 ADDAVG_W16_H4_AVX2 4 ADDAVG_W16_H4_AVX2 8 ADDAVG_W16_H4_AVX2 12 @@ -1306,7 +1310,9 @@ ADDAVG_W16_H4_AVX2 24 ADDAVG_W16_H4_AVX2 32 ADDAVG_W16_H4_AVX2 64 +%endif +%if ARCH_X86_64 cglobal addAvg_24x32, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride mova m4, [pw_ %+ ADDAVG_ROUND] mova m5, [pw_pixel_max] @@ -1418,6 +1424,7 @@ dec r6d jnz .loop RET +%endif %macro ADDAVG_W32_H2_AVX2 1 cglobal addAvg_32x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride @@ -1477,13 +1484,16 @@ RET %endmacro +%if ARCH_X86_64 ADDAVG_W32_H2_AVX2 8 ADDAVG_W32_H2_AVX2 16 ADDAVG_W32_H2_AVX2 24 ADDAVG_W32_H2_AVX2 32 ADDAVG_W32_H2_AVX2 48 ADDAVG_W32_H2_AVX2 64 +%endif +%if ARCH_X86_64 cglobal addAvg_48x64, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride mova m4, [pw_ %+ ADDAVG_ROUND] mova m5, [pw_pixel_max] @@ -1557,6 +1567,7 @@ dec r6d jnz .loop RET +%endif %macro ADDAVG_W64_H1_AVX2 1 cglobal addAvg_64x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride @@ -1652,10 +1663,729 @@ RET %endmacro +%if ARCH_X86_64 ADDAVG_W64_H1_AVX2 16 ADDAVG_W64_H1_AVX2 32 ADDAVG_W64_H1_AVX2 48 ADDAVG_W64_H1_AVX2 64 +%endif +;----------------------------------------------------------------------------- +;addAvg avx512 high bit depth code start +;----------------------------------------------------------------------------- +%macro PROCESS_ADDAVG_16x4_HBD_AVX512 0 + movu ym0, [r0] + vinserti32x8 m0, [r0 + r3], 1 + movu ym1, [r1] + vinserti32x8 m1, [r1 + r4], 1 + + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + + movu [r2], ym0 + vextracti32x8 [r2 + r5], m0, 1 + + movu ym0, [r0 + 2 * r3] + vinserti32x8 m0, [r0 + r6], 1 + movu ym1, [r1 + 2 * r4] + vinserti32x8 m1, [r1 + r7], 1 + + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + + movu [r2 + 2 * r5], ym0 + vextracti32x8 [r2 + r8], m0, 1 +%endmacro + +%macro PROCESS_ADDAVG_32x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 +%endmacro + +%macro PROCESS_ADDAVG_64x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + mmsize] + movu m1, [r1 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + mmsize], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + r3 + mmsize] + movu m1, [r1 + r4 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + mmsize], m0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu m0, [r0 + 2 * r3 + mmsize] + movu m1, [r1 + 2 * r4 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5 + mmsize], m0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 + + movu m0, [r0 + r6 + mmsize] + movu m1, [r1 + r7 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8 + mmsize], m0 +%endmacro + +%macro PROCESS_ADDAVG_48x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu ym0, [r0 + mmsize] + movu ym1, [r1 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + mmsize], ym0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu ym0, [r0 + r3 + mmsize] + movu ym1, [r1 + r4 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + r5 + mmsize], ym0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu ym0, [r0 + 2 * r3 + mmsize] + movu ym1, [r1 + 2 * r4 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + 2 * r5 + mmsize], ym0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 + + movu ym0, [r0 + r6 + mmsize] + movu ym1, [r1 + r7 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + r8 + mmsize], ym0 +%endmacro +;----------------------------------------------------------------------------- +;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal addAvg_16x4, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + PROCESS_ADDAVG_16x4_HBD_AVX512 + RET +%endif + +%macro ADDAVG_W16_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_16x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_16x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_16x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_W16_HBD_AVX512 8 +ADDAVG_W16_HBD_AVX512 12 +ADDAVG_W16_HBD_AVX512 16 +ADDAVG_W16_HBD_AVX512 24 +ADDAVG_W16_HBD_AVX512 32 +ADDAVG_W16_HBD_AVX512 64 +%endif + +%macro ADDAVG_W32_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_32x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_32x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_32x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_W32_HBD_AVX512 8 +ADDAVG_W32_HBD_AVX512 16 +ADDAVG_W32_HBD_AVX512 24 +ADDAVG_W32_HBD_AVX512 32 +ADDAVG_W32_HBD_AVX512 48 +ADDAVG_W32_HBD_AVX512 64 +%endif + +%macro ADDAVG_W64_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_64x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_64x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_64x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_W64_HBD_AVX512 16 +ADDAVG_W64_HBD_AVX512 32 +ADDAVG_W64_HBD_AVX512 48 +ADDAVG_W64_HBD_AVX512 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal addAvg_48x64, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep 15 + PROCESS_ADDAVG_48x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_48x4_HBD_AVX512 + RET +%endif + +%macro PROCESS_ADDAVG_ALIGNED_16x4_HBD_AVX512 0 + movu ym0, [r0] + vinserti32x8 m0, [r0 + r3], 1 + movu ym1, [r1] + vinserti32x8 m1, [r1 + r4], 1 + + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + + movu [r2], ym0 + vextracti32x8 [r2 + r5], m0, 1 + + movu ym0, [r0 + 2 * r3] + vinserti32x8 m0, [r0 + r6], 1 + movu ym1, [r1 + 2 * r4] + vinserti32x8 m1, [r1 + r7], 1 + + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + + movu [r2 + 2 * r5], ym0 + vextracti32x8 [r2 + r8], m0, 1 +%endmacro + +%macro PROCESS_ADDAVG_ALIGNED_32x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 +%endmacro + +%macro PROCESS_ADDAVG_ALIGNED_64x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + mmsize] + movu m1, [r1 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + mmsize], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + r3 + mmsize] + movu m1, [r1 + r4 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + mmsize], m0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu m0, [r0 + 2 * r3 + mmsize] + movu m1, [r1 + 2 * r4 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5 + mmsize], m0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 + + movu m0, [r0 + r6 + mmsize] + movu m1, [r1 + r7 + mmsize] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8 + mmsize], m0 +%endmacro + +%macro PROCESS_ADDAVG_ALIGNED_48x4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu ym0, [r0 + mmsize] + movu ym1, [r1 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + mmsize], ym0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu ym0, [r0 + r3 + mmsize] + movu ym1, [r1 + r4 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + r5 + mmsize], ym0 + + movu m0, [r0 + 2 * r3] + movu m1, [r1 + 2 * r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 2 * r5], m0 + + movu ym0, [r0 + 2 * r3 + mmsize] + movu ym1, [r1 + 2 * r4 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + 2 * r5 + mmsize], ym0 + + movu m0, [r0 + r6] + movu m1, [r1 + r7] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r8], m0 + + movu ym0, [r0 + r6 + mmsize] + movu ym1, [r1 + r7 + mmsize] + paddw ym0, ym1 + pmulhrsw ym0, ym3 + paddw ym0, ym4 + pmaxsw ym0, ym2 + pminsw ym0, ym5 + movu [r2 + r8 + mmsize], ym0 +%endmacro +;----------------------------------------------------------------------------- +;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal addAvg_aligned_16x4, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + PROCESS_ADDAVG_ALIGNED_16x4_HBD_AVX512 + RET +%endif + +%macro ADDAVG_ALIGNED_W16_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_aligned_16x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_ALIGNED_16x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_16x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_ALIGNED_W16_HBD_AVX512 8 +ADDAVG_ALIGNED_W16_HBD_AVX512 12 +ADDAVG_ALIGNED_W16_HBD_AVX512 16 +ADDAVG_ALIGNED_W16_HBD_AVX512 24 +ADDAVG_ALIGNED_W16_HBD_AVX512 32 +ADDAVG_ALIGNED_W16_HBD_AVX512 64 +%endif + +%macro ADDAVG_ALIGNED_W32_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_aligned_32x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_ALIGNED_32x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_32x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_ALIGNED_W32_HBD_AVX512 8 +ADDAVG_ALIGNED_W32_HBD_AVX512 16 +ADDAVG_ALIGNED_W32_HBD_AVX512 24 +ADDAVG_ALIGNED_W32_HBD_AVX512 32 +ADDAVG_ALIGNED_W32_HBD_AVX512 48 +ADDAVG_ALIGNED_W32_HBD_AVX512 64 +%endif + +%macro ADDAVG_ALIGNED_W64_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_aligned_64x%1, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_ADDAVG_ALIGNED_64x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_64x4_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +ADDAVG_ALIGNED_W64_HBD_AVX512 16 +ADDAVG_ALIGNED_W64_HBD_AVX512 32 +ADDAVG_ALIGNED_W64_HBD_AVX512 48 +ADDAVG_ALIGNED_W64_HBD_AVX512 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal addAvg_aligned_48x64, 6,9,6 + vbroadcasti32x8 m4, [pw_ %+ ADDAVG_ROUND] + vbroadcasti32x8 m5, [pw_pixel_max] + vbroadcasti32x8 m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + lea r6, [3 * r3] + lea r7, [3 * r4] + lea r8, [3 * r5] + +%rep 15 + PROCESS_ADDAVG_ALIGNED_48x4_HBD_AVX512 + lea r2, [r2 + 4 * r5] + lea r0, [r0 + 4 * r3] + lea r1, [r1 + 4 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_48x4_HBD_AVX512 + RET +%endif +;----------------------------------------------------------------------------- +;addAvg avx512 high bit depth code end +;----------------------------------------------------------------------------- ;----------------------------------------------------------------------------- %else ; !HIGH_BIT_DEPTH ;----------------------------------------------------------------------------- @@ -2968,7 +3698,221 @@ ;----------------------------------------------------------------------------- ; addAvg avx2 code end ;----------------------------------------------------------------------------- +; addAvg avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_ADDAVG_64x2_AVX512 0 + movu m0, [r0] + movu m1, [r1] + movu m2, [r0 + mmsize] + movu m3, [r1 + mmsize] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + packuswb m0, m2 + vpermq m0, m6, m0 + movu [r2], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + movu m2, [r0 + r3 + mmsize] + movu m3, [r1 + r4 + mmsize] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + + packuswb m0, m2 + vpermq m0, m6, m0 + movu [r2 + r5], m0 +%endmacro + +%macro PROCESS_ADDAVG_32x2_AVX512 0 + movu m0, [r0] + movu m1, [r1] + movu m2, [r0 + r3] + movu m3, [r1 + r4] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + + packuswb m0, m2 + vpermq m0, m6, m0 + movu [r2], ym0 + vextracti32x8 [r2 + r5], m0, 1 +%endmacro +;-------------------------------------------------------------------------------------------------------------------- +;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +;-------------------------------------------------------------------------------------------------------------------- +%macro ADDAVG_W64_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_64x%1, 6,6,7 + vbroadcasti32x8 m4, [pw_256] + vbroadcasti32x8 m5, [pw_128] + mova m6, [shuf_avx512] + + add r3, r3 + add r4, r4 + +%rep %1/2 - 1 + PROCESS_ADDAVG_64x2_AVX512 + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + PROCESS_ADDAVG_64x2_AVX512 + RET +%endmacro + +ADDAVG_W64_AVX512 16 +ADDAVG_W64_AVX512 32 +ADDAVG_W64_AVX512 48 +ADDAVG_W64_AVX512 64 + +%macro ADDAVG_W32_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_32x%1, 6,6,7 + vbroadcasti32x8 m4, [pw_256] + vbroadcasti32x8 m5, [pw_128] + mova m6, [shuf_avx512] + add r3, r3 + add r4, r4 + +%rep %1/2 - 1 + PROCESS_ADDAVG_32x2_AVX512 + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + PROCESS_ADDAVG_32x2_AVX512 + RET +%endmacro + +ADDAVG_W32_AVX512 8 +ADDAVG_W32_AVX512 16 +ADDAVG_W32_AVX512 24 +ADDAVG_W32_AVX512 32 +ADDAVG_W32_AVX512 48 +ADDAVG_W32_AVX512 64 + +%macro PROCESS_ADDAVG_ALIGNED_64x2_AVX512 0 + mova m0, [r0] + mova m1, [r1] + mova m2, [r0 + mmsize] + mova m3, [r1 + mmsize] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + + packuswb m0, m2 + vpermq m0, m6, m0 + mova [r2], m0 + + mova m0, [r0 + r3] + mova m1, [r1 + r4] + mova m2, [r0 + r3 + mmsize] + mova m3, [r1 + r4 + mmsize] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + + packuswb m0, m2 + vpermq m0, m6, m0 + mova [r2 + r5], m0 +%endmacro + +%macro PROCESS_ADDAVG_ALIGNED_32x2_AVX512 0 + mova m0, [r0] + mova m1, [r1] + mova m2, [r0 + r3] + mova m3, [r1 + r4] + + paddw m0, m1 + pmulhrsw m0, m4 + paddw m0, m5 + paddw m2, m3 + pmulhrsw m2, m4 + paddw m2, m5 + + packuswb m0, m2 + vpermq m0, m6, m0 + mova [r2], ym0 + vextracti32x8 [r2 + r5], m0, 1 +%endmacro +;-------------------------------------------------------------------------------------------------------------------- +;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) +;-------------------------------------------------------------------------------------------------------------------- +%macro ADDAVG_ALIGNED_W64_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_aligned_64x%1, 6,6,7 + vbroadcasti32x8 m4, [pw_256] + vbroadcasti32x8 m5, [pw_128] + mova m6, [shuf_avx512] + + add r3, r3 + add r4, r4 + +%rep %1/2 - 1 + PROCESS_ADDAVG_ALIGNED_64x2_AVX512 + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_64x2_AVX512 + RET +%endmacro + +ADDAVG_ALIGNED_W64_AVX512 16 +ADDAVG_ALIGNED_W64_AVX512 32 +ADDAVG_ALIGNED_W64_AVX512 48 +ADDAVG_ALIGNED_W64_AVX512 64 + +%macro ADDAVG_ALIGNED_W32_AVX512 1 +INIT_ZMM avx512 +cglobal addAvg_aligned_32x%1, 6,6,7 + vbroadcasti32x8 m4, [pw_256] + vbroadcasti32x8 m5, [pw_128] + mova m6, [shuf_avx512] + add r3, r3 + add r4, r4 + +%rep %1/2 - 1 + PROCESS_ADDAVG_ALIGNED_32x2_AVX512 + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + PROCESS_ADDAVG_ALIGNED_32x2_AVX512 + RET +%endmacro + +ADDAVG_ALIGNED_W32_AVX512 8 +ADDAVG_ALIGNED_W32_AVX512 16 +ADDAVG_ALIGNED_W32_AVX512 24 +ADDAVG_ALIGNED_W32_AVX512 32 +ADDAVG_ALIGNED_W32_AVX512 48 +ADDAVG_ALIGNED_W32_AVX512 64 +;----------------------------------------------------------------------------- +; addAvg avx512 code end ;----------------------------------------------------------------------------- %macro ADDAVG_W24_H2 2 INIT_XMM sse4 @@ -3367,11 +4311,11 @@ %endmacro %endif -%macro AVG_END 0 - lea t4, [t4+t5*2*SIZEOF_PIXEL] +%macro AVG_END 0-1 2;rows lea t2, [t2+t3*2*SIZEOF_PIXEL] + lea t4, [t4+t5*2*SIZEOF_PIXEL] lea t0, [t0+t1*2*SIZEOF_PIXEL] - sub eax, 2 + sub eax, %1 jg .height_loop %ifidn movu,movq ; detect MMX EMMS @@ -3434,17 +4378,24 @@ %endmacro %macro BIWEIGHT_START_SSSE3 0 - movzx t6d, byte r6m ; FIXME x86_64 - mov t7d, 64 - sub t7d, t6d - shl t7d, 8 - add t6d, t7d - mova m4, [pw_512] - movd xm3, t6d + movzx t6d, byte r6m ; FIXME x86_64 +%if mmsize > 16 + vbroadcasti128 m4, [pw_512] +%else + mova m4, [pw_512] +%endif + lea t7d, [t6+(64<<8)] + shl t6d, 8 + sub t7d, t6d +%if cpuflag(avx512) + vpbroadcastw m3, t7d +%else + movd xm3, t7d %if cpuflag(avx2) - vpbroadcastw m3, xm3 + vpbroadcastw m3, xm3 %else - SPLATW m3, m3 ; weight_dst,src + SPLATW m3, m3 ; weight_dst,src +%endif %endif %endmacro @@ -3567,6 +4518,38 @@ AVG_WEIGHT 24, 7 AVG_WEIGHT 48, 7 +INIT_YMM avx512 +cglobal pixel_avg_weight_w8 + BIWEIGHT_START + kxnorb k1, k1, k1 + kaddb k1, k1, k1 + AVG_START 5 +.height_loop: + movq xm0, [t2] + movq xm2, [t4] + movq xm1, [t2+t3] + movq xm5, [t4+t5] + lea t2, [t2+t3*2] + lea t4, [t4+t5*2] + vpbroadcastq m0 {k1}, [t2] + vpbroadcastq m2 {k1}, [t4] + vpbroadcastq m1 {k1}, [t2+t3] + vpbroadcastq m5 {k1}, [t4+t5] + punpcklbw m0, m2 + punpcklbw m1, m5 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + pmulhrsw m0, m4 + pmulhrsw m1, m4 + packuswb m0, m1 + vextracti128 xmm1, m0, 1 + movq [t0], xm0 + movhps [t0+t1], xm0 + lea t0, [t0+t1*2] + movq [t0], xmm1 + movhps [t0+t1], xmm1 + AVG_END 4 + INIT_YMM avx2 cglobal pixel_avg_weight_w16 BIWEIGHT_START @@ -3586,6 +4569,35 @@ vextracti128 [t0+t1], m0, 1 AVG_END +INIT_ZMM avx512 + cglobal pixel_avg_weight_w16 + BIWEIGHT_START + AVG_START 5 +.height_loop: + movu xm0, [t2] + movu xm1, [t4] + vinserti128 ym0, [t2+t3], 1 + vinserti128 ym1, [t4+t5], 1 + lea t2, [t2+t3*2] + lea t4, [t4+t5*2] + vinserti32x4 m0, [t2], 2 + vinserti32x4 m1, [t4], 2 + vinserti32x4 m0, [t2+t3], 3 + vinserti32x4 m1, [t4+t5], 3 + SBUTTERFLY bw, 0, 1, 2 + pmaddubsw m0, m3 + pmaddubsw m1, m3 + pmulhrsw m0, m4 + pmulhrsw m1, m4 + packuswb m0, m1 + mova [t0], xm0 + vextracti128 [t0+t1], ym0, 1 + lea t0, [t0+t1*2] + vextracti32x4 [t0], m0, 2 + vextracti32x4 [t0+t1], m0, 3 + AVG_END 4 + +INIT_YMM avx2 cglobal pixel_avg_weight_w32 BIWEIGHT_START AVG_START 5 @@ -3601,6 +4613,7 @@ mova [t0], m0 AVG_END +INIT_YMM avx2 cglobal pixel_avg_weight_w64 BIWEIGHT_START AVG_START 5 @@ -4345,6 +5358,18 @@ AVGH 16, 8 AVGH 16, 4 +INIT_XMM avx512 +AVGH 16, 64 +AVGH 16, 32 +AVGH 16, 16 +AVGH 16, 12 +AVGH 16, 8 +AVGH 16, 4 +AVGH 8, 32 +AVGH 8, 16 +AVGH 8, 8 +AVGH 8, 4 + %endif ;HIGH_BIT_DEPTH ;------------------------------------------------------------------------------------------------------------------------------- @@ -4482,6 +5507,58 @@ RET %endif +;----------------------------------------------------------------------------- +;pixel_avg_pp avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_PIXELAVG_64x4_AVX512 0 + movu m0, [r2] + movu m2, [r2 + r3] + movu m1, [r4] + movu m3, [r4 + r5] + pavgb m0, m1 + pavgb m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + 2 * r3] + movu m2, [r2 + r7] + movu m1, [r4 + 2 * r5] + movu m3, [r4 + r8] + pavgb m0, m1 + pavgb m2, m3 + movu [r0 + 2 * r1], m0 + movu [r0 + r6], m2 +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 && BIT_DEPTH == 8 +%macro PIXEL_AVG_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_avg_64x%1, 6, 9, 4 + lea r6, [3 * r1] + lea r7, [3 * r3] + lea r8, [3 * r5] + +%rep %1/4 - 1 + PROCESS_PIXELAVG_64x4_AVX512 + lea r2, [r2 + r3 * 4] + lea r4, [r4 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_PIXELAVG_64x4_AVX512 + RET +%endmacro + +PIXEL_AVG_64xN_AVX512 16 +PIXEL_AVG_64xN_AVX512 32 +PIXEL_AVG_64xN_AVX512 48 +PIXEL_AVG_64xN_AVX512 64 +%endif +;----------------------------------------------------------------------------- +;pixel_avg_pp avx512 code end +;----------------------------------------------------------------------------- ;============================================================================= ; pixel avg2 ;============================================================================= @@ -5267,6 +6344,552 @@ RET %endif +;----------------------------------------------------------------------------- +;pixel_avg_pp avx512 high bit depth code start +;----------------------------------------------------------------------------- +%macro PROCESS_PIXELAVG_32x8_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 +%endmacro +%macro PROCESS_PIXELAVG_ALIGNED_32x8_HBD_AVX512 0 + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 +%endmacro + +%macro PROCESS_PIXELAVG_64x8_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + mmsize] + movu m1, [r4 + mmsize] + movu m2, [r2 + r3 + mmsize] + movu m3, [r4 + r5 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + mmsize], m0 + movu [r0 + r1 + mmsize], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 + + movu m0, [r2 + r3 * 2 + mmsize] + movu m1, [r4 + r5 * 2 + mmsize] + movu m2, [r2 + r6 + mmsize] + movu m3, [r4 + r7 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r8 + mmsize], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + mmsize] + movu m1, [r4 + mmsize] + movu m2, [r2 + r3 + mmsize] + movu m3, [r4 + r5 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + mmsize], m0 + movu [r0 + r1 + mmsize], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 + + movu m0, [r2 + r3 * 2 + mmsize] + movu m1, [r4 + r5 * 2 + mmsize] + movu m2, [r2 + r6 + mmsize] + movu m3, [r4 + r7 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r8 + mmsize], m2 +%endmacro +%macro PROCESS_PIXELAVG_ALIGNED_64x8_HBD_AVX512 0 + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova m0, [r2 + mmsize] + mova m1, [r4 + mmsize] + mova m2, [r2 + r3 + mmsize] + mova m3, [r4 + r5 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + mmsize], m0 + mova [r0 + r1 + mmsize], m2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 + + mova m0, [r2 + r3 * 2 + mmsize] + mova m1, [r4 + r5 * 2 + mmsize] + mova m2, [r2 + r6 + mmsize] + mova m3, [r4 + r7 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2 + mmsize], m0 + mova [r0 + r8 + mmsize], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova m0, [r2 + mmsize] + mova m1, [r4 + mmsize] + mova m2, [r2 + r3 + mmsize] + mova m3, [r4 + r5 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + mmsize], m0 + mova [r0 + r1 + mmsize], m2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 + + mova m0, [r2 + r3 * 2 + mmsize] + mova m1, [r4 + r5 * 2 + mmsize] + mova m2, [r2 + r6 + mmsize] + mova m3, [r4 + r7 + mmsize] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2 + mmsize], m0 + mova [r0 + r8 + mmsize], m2 +%endmacro + +%macro PROCESS_PIXELAVG_48x8_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu ym0, [r2 + mmsize] + movu ym1, [r4 + mmsize] + movu ym2, [r2 + r3 + mmsize] + movu ym3, [r4 + r5 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + movu [r0 + mmsize], ym0 + movu [r0 + r1 + mmsize], ym2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 + + movu ym0, [r2 + r3 * 2 + mmsize] + movu ym1, [r4 + r5 * 2 + mmsize] + movu ym2, [r2 + r6 + mmsize] + movu ym3, [r4 + r7 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + movu [r0 + r1 * 2 + mmsize], ym0 + movu [r0 + r8 + mmsize], ym2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + movu m0, [r2] + movu m1, [r4] + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + movu ym0, [r2 + mmsize] + movu ym1, [r4 + mmsize] + movu ym2, [r2 + r3 + mmsize] + movu ym3, [r4 + r5 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + movu [r0 + mmsize], ym0 + movu [r0 + r1 + mmsize], ym2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m2 + + movu ym0, [r2 + r3 * 2 + mmsize] + movu ym1, [r4 + r5 * 2 + mmsize] + movu ym2, [r2 + r6 + mmsize] + movu ym3, [r4 + r7 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + movu [r0 + r1 * 2 + mmsize], ym0 + movu [r0 + r8 + mmsize], ym2 +%endmacro +%macro PROCESS_PIXELAVG_ALIGNED_48x8_HBD_AVX512 0 + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova ym0, [r2 + mmsize] + mova ym1, [r4 + mmsize] + mova ym2, [r2 + r3 + mmsize] + mova ym3, [r4 + r5 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + mova [r0 + mmsize], ym0 + mova [r0 + r1 + mmsize], ym2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 + + mova ym0, [r2 + r3 * 2 + mmsize] + mova ym1, [r4 + r5 * 2 + mmsize] + mova ym2, [r2 + r6 + mmsize] + mova ym3, [r4 + r7 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + mova [r0 + r1 * 2 + mmsize], ym0 + mova [r0 + r8 + mmsize], ym2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + + mova m0, [r2] + mova m1, [r4] + mova m2, [r2 + r3] + mova m3, [r4 + r5] + pavgw m0, m1 + pavgw m2, m3 + mova [r0], m0 + mova [r0 + r1], m2 + + mova ym0, [r2 + mmsize] + mova ym1, [r4 + mmsize] + mova ym2, [r2 + r3 + mmsize] + mova ym3, [r4 + r5 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + mova [r0 + mmsize], ym0 + mova [r0 + r1 + mmsize], ym2 + + mova m0, [r2 + r3 * 2] + mova m1, [r4 + r5 * 2] + mova m2, [r2 + r6] + mova m3, [r4 + r7] + pavgw m0, m1 + pavgw m2, m3 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m2 + + mova ym0, [r2 + r3 * 2 + mmsize] + mova ym1, [r4 + r5 * 2 + mmsize] + mova ym2, [r2 + r6 + mmsize] + mova ym3, [r4 + r7 + mmsize] + pavgw ym0, ym1 + pavgw ym2, ym3 + mova [r0 + r1 * 2 + mmsize], ym0 + mova [r0 + r8 + mmsize], ym2 +%endmacro + +%macro PIXEL_AVG_HBD_W32 1 +INIT_ZMM avx512 +cglobal pixel_avg_32x%1, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep %1/8 - 1 + PROCESS_PIXELAVG_32x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_32x8_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +PIXEL_AVG_HBD_W32 8 +PIXEL_AVG_HBD_W32 16 +PIXEL_AVG_HBD_W32 24 +PIXEL_AVG_HBD_W32 32 +PIXEL_AVG_HBD_W32 64 +%endif +%macro PIXEL_AVG_HBD_ALIGNED_W32 1 +INIT_ZMM avx512 +cglobal pixel_avg_aligned_32x%1, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep %1/8 - 1 + PROCESS_PIXELAVG_ALIGNED_32x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_ALIGNED_32x8_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +PIXEL_AVG_HBD_ALIGNED_W32 8 +PIXEL_AVG_HBD_ALIGNED_W32 16 +PIXEL_AVG_HBD_ALIGNED_W32 24 +PIXEL_AVG_HBD_ALIGNED_W32 32 +PIXEL_AVG_HBD_ALIGNED_W32 64 +%endif + +%macro PIXEL_AVG_HBD_W64 1 +INIT_ZMM avx512 +cglobal pixel_avg_64x%1, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep %1/8 - 1 + PROCESS_PIXELAVG_64x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_64x8_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +PIXEL_AVG_HBD_W64 16 +PIXEL_AVG_HBD_W64 32 +PIXEL_AVG_HBD_W64 48 +PIXEL_AVG_HBD_W64 64 +%endif +%macro PIXEL_AVG_HBD_ALIGNED_W64 1 +INIT_ZMM avx512 +cglobal pixel_avg_aligned_64x%1, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep %1/8 - 1 + PROCESS_PIXELAVG_ALIGNED_64x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_ALIGNED_64x8_HBD_AVX512 + RET +%endmacro + +%if ARCH_X86_64 +PIXEL_AVG_HBD_ALIGNED_W64 16 +PIXEL_AVG_HBD_ALIGNED_W64 32 +PIXEL_AVG_HBD_ALIGNED_W64 48 +PIXEL_AVG_HBD_ALIGNED_W64 64 +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_avg_48x64, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep 7 + PROCESS_PIXELAVG_48x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_48x8_HBD_AVX512 + RET +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_avg_aligned_48x64, 6,9,4 + shl r1d, 1 + shl r3d, 1 + shl r5d, 1 + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + +%rep 7 + PROCESS_PIXELAVG_ALIGNED_48x8_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endrep + PROCESS_PIXELAVG_ALIGNED_48x8_HBD_AVX512 + RET +%endif +;----------------------------------------------------------------------------- +;pixel_avg_pp avx512 high bit depth code end +;----------------------------------------------------------------------------- %endif ; HIGH_BIT_DEPTH %if HIGH_BIT_DEPTH == 0 @@ -5395,6 +7018,7 @@ jg .height_loop RET +%if ARCH_X86_64 INIT_YMM avx2 cglobal pixel_avg2_w20, 6,7 sub r2, r4 @@ -5411,6 +7035,7 @@ sub r5d, 2 jg .height_loop RET +%endif ; Cacheline split code for processors with high latencies for loads ; split over cache lines. See sad-a.asm for a more detailed explanation.
View file
x265_2.7.tar.gz/source/common/x86/pixel-a.asm -> x265_2.9.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -45,6 +45,9 @@ times 2 dw 1, -1 times 4 dw 1 times 2 dw 1, -1 +psy_pp_shuff1: dq 0, 1, 8, 9, 4, 5, 12, 13 +psy_pp_shuff2: dq 2, 3, 10, 11, 6, 7, 14, 15 +psy_pp_shuff3: dq 0, 0, 8, 8, 1, 1, 9, 9 ALIGN 32 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14 @@ -8145,6 +8148,243 @@ %endif ; ARCH_X86_64=1 %endif ; HIGH_BIT_DEPTH +%macro SATD_AVX512_LOAD4 2 ; size, opmask + vpbroadcast%1 m0, [r0] + vpbroadcast%1 m0 {%2}, [r0+2*r1] + vpbroadcast%1 m2, [r2] + vpbroadcast%1 m2 {%2}, [r2+2*r3] + add r0, r1 + add r2, r3 + vpbroadcast%1 m1, [r0] + vpbroadcast%1 m1 {%2}, [r0+2*r1] + vpbroadcast%1 m3, [r2] + vpbroadcast%1 m3 {%2}, [r2+2*r3] +%endmacro + +%macro SATD_AVX512_LOAD8 5 ; size, halfreg, opmask1, opmask2, opmask3 + vpbroadcast%1 %{2}0, [r0] + vpbroadcast%1 %{2}0 {%3}, [r0+2*r1] + vpbroadcast%1 %{2}2, [r2] + vpbroadcast%1 %{2}2 {%3}, [r2+2*r3] + vpbroadcast%1 m0 {%4}, [r0+4*r1] + vpbroadcast%1 m2 {%4}, [r2+4*r3] + vpbroadcast%1 m0 {%5}, [r0+2*r4] + vpbroadcast%1 m2 {%5}, [r2+2*r5] + vpbroadcast%1 %{2}1, [r0+r1] + vpbroadcast%1 %{2}1 {%3}, [r0+r4] + vpbroadcast%1 %{2}3, [r2+r3] + vpbroadcast%1 %{2}3 {%3}, [r2+r5] + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + vpbroadcast%1 m1 {%4}, [r0+r1] + vpbroadcast%1 m3 {%4}, [r2+r3] + vpbroadcast%1 m1 {%5}, [r0+r4] + vpbroadcast%1 m3 {%5}, [r2+r5] +%endmacro + +%macro SATD_AVX512_PACKED 0 + DIFF_SUMSUB_SSSE3 0, 2, 1, 3, 4 + SUMSUB_BA w, 0, 1, 2 + SBUTTERFLY qdq, 0, 1, 2 + SUMSUB_BA w, 0, 1, 2 + HMAXABSW2 0, 1, 2, 3 +%endmacro + +%macro SATD_AVX512_END 0-1 0 ; sa8d + paddw m0 {k1}{z}, m1 ; zero-extend to dwords +%if ARCH_X86_64 +%if mmsize == 64 + vextracti32x8 ym1, m0, 1 + paddd ym0, ym1 +%endif +%if mmsize >= 32 + vextracti128 xm1, ym0, 1 + paddd xmm0, xm0, xm1 +%endif + punpckhqdq xmm1, xmm0, xmm0 + paddd xmm0, xmm1 + movq rax, xmm0 + rorx rdx, rax, 32 +%if %1 + lea eax, [rax+rdx+1] + shr eax, 1 +%else + add eax, edx +%endif +%else + HADDD m0, m1 + movd eax, xm0 +%if %1 + inc eax + shr eax, 1 +%endif +%endif + RET +%endmacro + +%macro HMAXABSW2 4 ; a, b, tmp1, tmp2 + pabsw m%1, m%1 + pabsw m%2, m%2 + psrldq m%3, m%1, 2 + psrld m%4, m%2, 16 + pmaxsw m%1, m%3 + pmaxsw m%2, m%4 +%endmacro +%if HIGH_BIT_DEPTH==0 +INIT_ZMM avx512 +cglobal pixel_satd_16x8_internal + vbroadcasti64x4 m6, [hmul_16p] + kxnorb k2, k2, k2 + mov r4d, 0x55555555 + knotw k2, k2 + kmovd k1, r4d + lea r4, [3*r1] + lea r5, [3*r3] +satd_16x8_avx512: + vbroadcasti128 ym0, [r0] + vbroadcasti32x4 m0 {k2}, [r0+4*r1] ; 0 0 4 4 + vbroadcasti128 ym4, [r2] + vbroadcasti32x4 m4 {k2}, [r2+4*r3] + vbroadcasti128 ym2, [r0+2*r1] + vbroadcasti32x4 m2 {k2}, [r0+2*r4] ; 2 2 6 6 + vbroadcasti128 ym5, [r2+2*r3] + vbroadcasti32x4 m5 {k2}, [r2+2*r5] + DIFF_SUMSUB_SSSE3 0, 4, 2, 5, 6 + vbroadcasti128 ym1, [r0+r1] + vbroadcasti128 ym4, [r2+r3] + vbroadcasti128 ym3, [r0+r4] + vbroadcasti128 ym5, [r2+r5] + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + vbroadcasti32x4 m1 {k2}, [r0+r1] ; 1 1 5 5 + vbroadcasti32x4 m4 {k2}, [r2+r3] + vbroadcasti32x4 m3 {k2}, [r0+r4] ; 3 3 7 7 + vbroadcasti32x4 m5 {k2}, [r2+r5] + DIFF_SUMSUB_SSSE3 1, 4, 3, 5, 6 + HADAMARD4_V 0, 1, 2, 3, 4 + HMAXABSW2 0, 2, 4, 5 + HMAXABSW2 1, 3, 4, 5 + paddw m4, m0, m2 ; m1 + paddw m2, m1, m3 ; m0 + ret + +cglobal pixel_satd_8x8_internal + vbroadcasti64x4 m4, [hmul_16p] + mov r4d, 0x55555555 + kmovd k1, r4d ; 01010101 + kshiftlb k2, k1, 5 ; 10100000 + kshiftlb k3, k1, 4 ; 01010000 + lea r4, [3*r1] + lea r5, [3*r3] +satd_8x8_avx512: + SATD_AVX512_LOAD8 q, ym, k1, k2, k3 ; 2 0 2 0 6 4 6 4 + SATD_AVX512_PACKED ; 3 1 3 1 7 5 7 5 + ret + +cglobal pixel_satd_16x8, 4,6 + call pixel_satd_16x8_internal_avx512 + jmp satd_zmm_avx512_end + +cglobal pixel_satd_16x16, 4,6 + call pixel_satd_16x8_internal_avx512 + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + paddw m7, m0, m1 + call satd_16x8_avx512 + paddw m1, m7 + jmp satd_zmm_avx512_end + +cglobal pixel_satd_8x8, 4,6 + call pixel_satd_8x8_internal_avx512 +satd_zmm_avx512_end: + SATD_AVX512_END + +cglobal pixel_satd_8x16, 4,6 + call pixel_satd_8x8_internal_avx512 + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + paddw m5, m0, m1 + call satd_8x8_avx512 + paddw m1, m5 + jmp satd_zmm_avx512_end + +INIT_YMM avx512 +cglobal pixel_satd_4x8_internal + vbroadcasti128 m4, [hmul_4p] + mov r4d, 0x55550c + kmovd k2, r4d ; 00001100 + kshiftlb k3, k2, 2 ; 00110000 + kshiftlb k4, k2, 4 ; 11000000 + kshiftrd k1, k2, 8 ; 01010101 + lea r4, [3*r1] + lea r5, [3*r3] +satd_4x8_avx512: + SATD_AVX512_LOAD8 d, xm, k2, k3, k4 ; 0 0 2 2 4 4 6 6 +satd_ymm_avx512: ; 1 1 3 3 5 5 7 7 + SATD_AVX512_PACKED + ret + +cglobal pixel_satd_8x4, 4,5 + mova m4, [hmul_16p] + mov r4d, 0x5555 + kmovw k1, r4d + SATD_AVX512_LOAD4 q, k1 ; 2 0 2 0 + call satd_ymm_avx512 ; 3 1 3 1 + jmp satd_ymm_avx512_end2 + +cglobal pixel_satd_4x8, 4,6 + call pixel_satd_4x8_internal_avx512 +satd_ymm_avx512_end: +%if ARCH_X86_64 == 0 + pop r5d + %assign regs_used 5 +%endif +satd_ymm_avx512_end2: + SATD_AVX512_END + +cglobal pixel_satd_4x16, 4,6 + call pixel_satd_4x8_internal_avx512 + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] + paddw m5, m0, m1 + call satd_4x8_avx512 + paddw m1, m5 + jmp satd_ymm_avx512_end + +INIT_XMM avx512 +cglobal pixel_satd_4x4, 4,5 + mova m4, [hmul_4p] + mov r4d, 0x550c + kmovw k2, r4d + kshiftrw k1, k2, 8 + SATD_AVX512_LOAD4 d, k2 ; 0 0 2 2 + SATD_AVX512_PACKED ; 1 1 3 3 + SWAP 0, 1 + SATD_AVX512_END + +INIT_ZMM avx512 +cglobal pixel_sa8d_8x8, 4,6 + vbroadcasti64x4 m4, [hmul_16p] + mov r4d, 0x55555555 + kmovd k1, r4d ; 01010101 + kshiftlb k2, k1, 5 ; 10100000 + kshiftlb k3, k1, 4 ; 01010000 + lea r4, [3*r1] + lea r5, [3*r3] + SATD_AVX512_LOAD8 q, ym, k1, k2, k3 ; 2 0 2 0 6 4 6 4 + DIFF_SUMSUB_SSSE3 0, 2, 1, 3, 4 ; 3 1 3 1 7 5 7 5 + SUMSUB_BA w, 0, 1, 2 + SBUTTERFLY qdq, 0, 1, 2 + SUMSUB_BA w, 0, 1, 2 + shufps m2, m0, m1, q2020 + shufps m1, m0, m1, q3131 + SUMSUB_BA w, 2, 1, 0 + vshufi32x4 m0, m2, m1, q1010 + vshufi32x4 m1, m2, m1, q3232 + SUMSUB_BA w, 0, 1, 2 + HMAXABSW2 0, 1, 2, 3 + SATD_AVX512_END 1 +%endif ; Input 10bit, Output 8bit ;------------------------------------------------------------------------------------------------------------------------ ;void planecopy_sc(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) @@ -8523,8 +8763,53 @@ .end: RET +INIT_ZMM avx512 +cglobal upShift_16, 4,7,4 + mov r4d, r4m + mov r5d, r5m + movd xm0, r6m ; m0 = shift + vbroadcasti32x4 m3, [pw_pixel_max] + FIX_STRIDES r1d, r3d + dec r5d +.loopH: + xor r6d, r6d +.loopW: + movu m1, [r0 + r6 * SIZEOF_PIXEL] + psllw m1, xm0 + pand m1, m3 + movu [r2 + r6 * SIZEOF_PIXEL], m1 + + add r6, mmsize / SIZEOF_PIXEL + cmp r6d, r4d + jl .loopW + + ; move to next row + add r0, r1 + add r2, r3 + dec r5d + jnz .loopH + ; processing last row of every frame [To handle width which not a multiple of 32] +.loop32: + movu m1, [r0 + (r4 - mmsize/2) * 2] + psllw m1, xm0 + pand m1, m3 + movu [r2 + (r4 - mmsize/2) * 2], m1 + + sub r4d, mmsize/2 + jz .end + cmp r4d, mmsize/2 + jge .loop32 + + ; process partial pixels + movu m1, [r0] + psllw m1, xm0 + pand m1, m3 + movu [r2], m1 + +.end: + RET ;--------------------------------------------------------------------------------------------------------------------- ;int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride) ;--------------------------------------------------------------------------------------------------------------------- @@ -10166,6 +10451,590 @@ pabsd m11, m11 %endmacro +%macro PSY_COST_PP_8x8_AVX512_MAIN12 0 + ; load source and recon pixels + lea r4, [r1 * 3] + pmovzxwd ym0, [r0] + pmovzxwd ym1, [r0 + r1] + pmovzxwd ym2, [r0 + r1 * 2] + pmovzxwd ym3, [r0 + r4] + lea r5, [r0 + r1 * 4] + pmovzxwd ym4, [r5] + pmovzxwd ym5, [r5 + r1] + pmovzxwd ym6, [r5 + r1 * 2] + pmovzxwd ym7, [r5 + r4] + + lea r4, [r3 * 3] + pmovzxwd ym16, [r2] + pmovzxwd ym17, [r2 + r3] + pmovzxwd ym18, [r2 + r3 * 2] + pmovzxwd ym19, [r2 + r4] + lea r5, [r2 + r3 * 4] + pmovzxwd ym20, [r5] + pmovzxwd ym21, [r5 + r3] + pmovzxwd ym22, [r5 + r3 * 2] + pmovzxwd ym23, [r5 + r4] + + vinserti64x4 m0, m0, ym16, 1 + vinserti64x4 m1, m1, ym17, 1 + vinserti64x4 m2, m2, ym18, 1 + vinserti64x4 m3, m3, ym19, 1 + vinserti64x4 m4, m4, ym20, 1 + vinserti64x4 m5, m5, ym21, 1 + vinserti64x4 m6, m6, ym22, 1 + vinserti64x4 m7, m7, ym23, 1 + + ; source + recon SAD + paddd m8, m0, m1 + paddd m8, m2 + paddd m8, m3 + paddd m8, m4 + paddd m8, m5 + paddd m8, m6 + paddd m8, m7 + + vextracti64x4 ym15, m8, 1 + + vextracti128 xm9, ym8, 1 + paddd ym8, ym9 ; sad_8x8 + movhlps xm9, xm8 + paddd xm8, xm9 + pshuflw xm9, xm8, 0Eh + paddd xm8, xm9 + psrld ym8, 2 + + vextracti128 xm9, ym15, 1 + paddd ym15, ym9 ; sad_8x8 + movhlps xm9, xm15 + paddd xm15, xm9 + pshuflw xm9, xm15, 0Eh + paddd xm15, xm9 + psrld ym15, 2 + + ; source and recon SA8D + psubd m9, m1, m0 + paddd m0, m1 + psubd m1, m3, m2 + paddd m2, m3 + punpckhdq m3, m0, m9 + punpckldq m0, m9 + psubd m9, m3, m0 + paddd m0, m3 + punpckhdq m3, m2, m1 + punpckldq m2, m1 + psubd m10, m3, m2 + paddd m2, m3 + psubd m3, m5, m4 + paddd m4, m5 + psubd m5, m7, m6 + paddd m6, m7 + punpckhdq m1, m4, m3 + punpckldq m4, m3 + psubd m7, m1, m4 + paddd m4, m1 + punpckhdq m3, m6, m5 + punpckldq m6, m5 + psubd m1, m3, m6 + paddd m6, m3 + psubd m3, m2, m0 + paddd m0, m2 + psubd m2, m10, m9 + paddd m9, m10 + punpckhqdq m5, m0, m3 + punpcklqdq m0, m3 + psubd m10, m5, m0 + paddd m0, m5 + punpckhqdq m3, m9, m2 + punpcklqdq m9, m2 + psubd m5, m3, m9 + paddd m9, m3 + psubd m3, m6, m4 + paddd m4, m6 + psubd m6, m1, m7 + paddd m7, m1 + punpckhqdq m2, m4, m3 + punpcklqdq m4, m3 + psubd m1, m2, m4 + paddd m4, m2 + punpckhqdq m3, m7, m6 + punpcklqdq m7, m6 + + psubd m2, m3, m7 + paddd m7, m3 + psubd m3, m4, m0 + paddd m0, m4 + psubd m4, m1, m10 + paddd m10, m1 + + mova m16, m13 + mova m17, m14 + vpermi2q m16, m0, m3 + vpermi2q m17, m0, m3 + + pabsd m17, m17 + pabsd m16, m16 + pmaxsd m17, m16 + + mova m18, m13 + mova m19, m14 + vpermi2q m18, m10, m4 + vpermi2q m19, m10, m4 + + pabsd m19, m19 + pabsd m18, m18 + pmaxsd m19, m18 + psubd m18, m7, m9 + paddd m9, m7 + psubd m7, m2, m5 + paddd m5, m2 + + mova m20, m13 + mova m21, m14 + vpermi2q m20, m9, m18 + vpermi2q m21, m9, m18 + + pabsd m21, m21 + pabsd m20, m20 + pmaxsd m21, m20 + + mova m22, m13 + mova m23, m14 + vpermi2q m22, m5, m7 + vpermi2q m23, m5, m7 + + pabsd m23, m23 + pabsd m22, m22 + pmaxsd m23, m22 + paddd m17, m21 + paddd m17, m19 + paddd m17, m23 + + vextracti64x4 ym26, m17, 1 + + vextracti128 xm9, m17, 1 + paddd ym17, ym9 ; sad_8x8 + movhlps xm9, xm17 + paddd xm17, xm9 + pshuflw xm9, xm17, 0Eh + paddd xm17, xm9 + paddd ym17, [pd_1] + psrld ym17, 1 ; sa8d_8x8 + + vextracti128 xm9, ym26, 1 + paddd ym26, ym9 ; sad_8x8 + movhlps xm9, xm26 + paddd xm26, xm9 + pshuflw xm9, xm26, 0Eh + paddd xm26, xm9 + paddd ym26, [pd_1] + psrld ym26, 1 ; sa8d_8x8 + + + + psubd ym11, ym17, ym8 ; sa8d_8x8 - sad_8x8 + psubd ym12, ym26, ym15 ; sa8d_8x8 - sad_8x8 + + psubd ym11, ym12 + pabsd ym11, ym11 +%endmacro + +%macro PSY_PP_INPUT_AVX512_MAIN10 0 + lea r4, [r1 * 3] + movu xm0, [r0] + movu xm1, [r0 + r1] + movu xm2, [r0 + r1 * 2] + movu xm3, [r0 + r4] + lea r5, [r0 + r1 * 4] + movu xm4, [r5] + movu xm5, [r5 + r1] + movu xm6, [r5 + r1 * 2] + movu xm7, [r5 + r4] + + lea r4, [r3 * 3] + vinserti128 ym0, ym0, [r2], 1 + vinserti128 ym1, ym1, [r2 + r3], 1 + vinserti128 ym2, ym2, [r2 + r3 * 2], 1 + vinserti128 ym3, ym3, [r2 + r4], 1 + lea r5, [r2 + r3 * 4] + vinserti128 ym4, ym4, [r5], 1 + vinserti128 ym5, ym5, [r5 + r3], 1 + vinserti128 ym6, ym6, [r5 + r3 * 2], 1 + vinserti128 ym7, ym7, [r5 + r4], 1 + + add r0, 16 + add r2, 16 + + lea r4, [r1 * 3] + vinserti32x4 m0, m0, [r0], 2 + vinserti32x4 m1, m1, [r0 + r1], 2 + vinserti32x4 m2, m2, [r0 + r1 * 2], 2 + vinserti32x4 m3, m3, [r0 + r4], 2 + lea r5, [r0 + r1 * 4] + vinserti32x4 m4, m4, [r5], 2 + vinserti32x4 m5, m5, [r5 + r1], 2 + vinserti32x4 m6, m6, [r5 + r1 * 2], 2 + vinserti32x4 m7, m7, [r5 + r4], 2 + + lea r4, [r3 * 3] + vinserti32x4 m0, m0, [r2], 3 + vinserti32x4 m1, m1, [r2 + r3], 3 + vinserti32x4 m2, m2, [r2 + r3 * 2], 3 + vinserti32x4 m3, m3, [r2 + r4], 3 + lea r5, [r2 + r3 * 4] + vinserti32x4 m4, m4, [r5], 3 + vinserti32x4 m5, m5, [r5 + r3], 3 + vinserti32x4 m6, m6, [r5 + r3 * 2], 3 + vinserti32x4 m7, m7, [r5 + r4], 3 +%endmacro + + +%macro PSY_PP_16x8_AVX512_MAIN10 0 + paddw m8, m0, m1 + paddw m8, m2 + paddw m8, m3 + paddw m8, m4 + paddw m8, m5 + paddw m8, m6 + paddw m8, m7 + pmaddwd m8, m14 + + psrldq m9, m8, 8 + paddd m8, m9 + psrldq m9, m8, 4 + paddd m8, m9 + psrld m8, 2 + + psubw m9, m1, m0 + paddw m0, m1 + psubw m1, m3, m2 + paddw m2, m3 + punpckhwd m3, m0, m9 + punpcklwd m0, m9 + psubw m9, m3, m0 + paddw m0, m3 + punpckhwd m3, m2, m1 + punpcklwd m2, m1 + psubw m10, m3, m2 + paddw m2, m3 + + psubw m3, m5, m4 + paddw m4, m5 + psubw m5, m7, m6 + paddw m6, m7 + punpckhwd m1, m4, m3 + punpcklwd m4, m3 + psubw m7, m1, m4 + paddw m4, m1 + punpckhwd m3, m6, m5 + punpcklwd m6, m5 + psubw m1, m3, m6 + paddw m6, m3 + + psubw m3, m2, m0 + paddw m0, m2 + psubw m2, m10, m9 + paddw m9, m10 + punpckhdq m5, m0, m3 + punpckldq m0, m3 + psubw m10, m5, m0 + paddw m0, m5 + punpckhdq m3, m9, m2 + punpckldq m9, m2 + psubw m5, m3, m9 + paddw m9, m3 + + psubw m3, m6, m4 + paddw m4, m6 + psubw m6, m1, m7 + paddw m7, m1 + punpckhdq m2, m4, m3 + punpckldq m4, m3 + psubw m1, m2, m4 + paddw m4, m2 + punpckhdq m3, m7, m6 + punpckldq m7, m6 + psubw m2, m3, m7 + paddw m7, m3 + + psubw m3, m4, m0 + paddw m0, m4 + psubw m4, m1, m10 + paddw m10, m1 + punpckhqdq m6, m0, m3 + punpcklqdq m0, m3 + pabsw m0, m0 + pabsw m6, m6 + pmaxsw m0, m6 + punpckhqdq m3, m10, m4 + punpcklqdq m10, m4 + pabsw m10, m10 + pabsw m3, m3 + pmaxsw m10, m3 + + psubw m3, m7, m9 + paddw m9, m7 + psubw m7, m2, m5 + paddw m5, m2 + punpckhqdq m4, m9, m3 + punpcklqdq m9, m3 + pabsw m9, m9 + pabsw m4, m4 + pmaxsw m9, m4 + punpckhqdq m3, m5, m7 + punpcklqdq m5, m7 + pabsw m5, m5 + pabsw m3, m3 + pmaxsw m5, m3 + + paddd m0, m9 + paddd m0, m10 + paddd m0, m5 + psrld m9, m0, 16 + pslld m0, 16 + psrld m0, 16 + paddd m0, m9 + psrldq m9, m0, 8 + paddd m0, m9 + psrldq m9, m0, 4 + paddd m0, m9 + paddd m0, m15 + psrld m0, 1 + psubd m0, m8 + + vextracti64x4 ym2, m0, 1 + + vextracti128 xm3, ym2, 1 + psubd xm3, xm2 + pabsd xm3, xm3 + + vextracti128 xm1, ym0, 1 + psubd xm1, xm0 + pabsd xm1, xm1 + paddd xm1, xm3 +%endmacro + +%macro PSY_PP_INPUT_AVX512_MAIN 0 + movu xm16, [r0 + r1 * 0] + movu xm17, [r0 + r1 * 1] + movu xm18, [r0 + r1 * 2] + movu xm19, [r0 + r4 * 1] + + movu xm20, [r2 + r3 * 0] + movu xm21, [r2 + r3 * 1] + movu xm22, [r2 + r3 * 2] + movu xm23, [r2 + r7 * 1] + + mova m0, m26 + vpermi2q m0, m16, m20 + mova m1, m26 + vpermi2q m1, m17, m21 + mova m2, m26 + vpermi2q m2, m18, m22 + mova m3, m26 + vpermi2q m3, m19, m23 + + + lea r5, [r0 + r1 * 4] + lea r6, [r2 + r3 * 4] + + movu xm16, [r5 + r1 * 0] + movu xm17, [r5 + r1 * 1] + movu xm18, [r5 + r1 * 2] + movu xm19, [r5 + r4 * 1] + + movu xm20, [r6 + r3 * 0] + movu xm21, [r6 + r3 * 1] + movu xm22, [r6 + r3 * 2] + movu xm23, [r6 + r7 * 1] + + mova m4, m26 + vpermi2q m4, m16, m20 + mova m5, m26 + vpermi2q m5, m17, m21 + mova m6, m26 + vpermi2q m6, m18, m22 + mova m7, m26 + vpermi2q m7, m19, m23 +%endmacro + +%macro PSY_PP_16x8_AVX512_MAIN 0 + pmaddubsw m0, m8 + pmaddubsw m1, m8 + pmaddubsw m2, m8 + pmaddubsw m3, m8 + pmaddubsw m4, m8 + pmaddubsw m5, m8 + pmaddubsw m6, m8 + pmaddubsw m7, m8 + + paddw m11, m0, m1 + paddw m11, m2 + paddw m11, m3 + paddw m11, m4 + paddw m11, m5 + paddw m11, m6 + paddw m11, m7 + + pmaddwd m11, m14 + psrldq m10, m11, 4 + paddd m11, m10 + psrld m11, 2 + + mova m9, m0 + paddw m0, m1 + psubw m1, m9 + mova m9, m2 + paddw m2, m3 + psubw m3, m9 + mova m9, m0 + paddw m0, m2 + psubw m2, m9 + mova m9, m1 + paddw m1, m3 + psubw m3, m9 + + movdqa m9, m4 + paddw m4, m5 + psubw m5, m9 + movdqa m9, m6 + paddw m6, m7 + psubw m7, m9 + movdqa m9, m4 + paddw m4, m6 + psubw m6, m9 + movdqa m9, m5 + paddw m5, m7 + psubw m7, m9 + + movdqa m9, m0 + paddw m0, m4 + psubw m4, m9 + movdqa m9, m1 + paddw m1, m5 + psubw m5, m9 + + mova m9, m0 + vshufps m9, m9, m4, 11011101b + vshufps m0, m0, m4, 10001000b + + movdqa m4, m0 + paddw m16, m0, m9 + psubw m17, m9, m4 + + movaps m4, m1 + vshufps m4, m4, m5, 11011101b + vshufps m1, m1, m5, 10001000b + + movdqa m5, m1 + paddw m18, m1, m4 + psubw m19, m4, m5 + + movdqa m5, m2 + paddw m2, m6 + psubw m6, m5 + movdqa m5, m3 + paddw m3, m7 + psubw m7, m5 + + movaps m5, m2 + vshufps m5, m5, m6, 11011101b + vshufps m2, m2, m6, 10001000b + + movdqa m6, m2 + paddw m20, m2, m5 + psubw m21, m5, m6 + + movaps m6, m3 + + vshufps m6, m6, m7, 11011101b + vshufps m3, m3, m7, 10001000b + + movdqa m7, m3 + paddw m22, m3, m6 + psubw m23, m6, m7 + + movdqa m7, m16 + + vextracti64x4 ym24, m16, 1 + vextracti64x4 ym25, m17, 1 + pblendw ym16, ym17, 10101010b + pblendw ym24, ym25, 10101010b + vinserti64x4 m16, m16, ym24, 1 + + pslld m17, 10h + psrld m7, 10h + por m17, m7 + pabsw m16, m16 + pabsw m17, m17 + pmaxsw m16, m17 + movdqa m7, m18 + + vextracti64x4 ym24, m18, 1 + vextracti64x4 ym25, m19, 1 + pblendw ym18, ym19, 10101010b + pblendw ym24, ym25, 10101010b + vinserti64x4 m18, m18, ym24, 1 + + pslld m19, 10h + psrld m7, 10h + por m19, m7 + pabsw m18, m18 + pabsw m19, m19 + pmaxsw m18, m19 + movdqa m7, m20 + + vextracti64x4 ym24, m20, 1 + vextracti64x4 ym25, m21, 1 + pblendw ym20, ym21, 10101010b + pblendw ym24, ym25, 10101010b + vinserti64x4 m20, m20, ym24, 1 + + pslld m21, 10h + psrld m7, 10h + por m21, m7 + pabsw m20, m20 + pabsw m21, m21 + pmaxsw m20, m21 + mova m7, m22 + + vextracti64x4 ym24, m22, 1 + vextracti64x4 ym25, m23, 1 + pblendw ym22, ym23, 10101010b + pblendw ym24, ym25, 10101010b + vinserti64x4 m22, m22, ym24, 1 + + pslld m23, 10h + psrld m7, 10h + por m23, m7 + pabsw m22, m22 + pabsw m23, m23 + pmaxsw m22, m23 + paddw m16, m18 + paddw m16, m20 + paddw m16, m22 + pmaddwd m16, m14 + psrldq m1, m16, 8 + paddd m16, m1 + + pshuflw m1, m16, 00001110b + paddd m16, m1 + paddd m16, m15 + psrld m16, 1 + + psubd m16, m11 + vextracti64x4 ym2, m16, 1 + + vextracti128 xm1, ym16, 1 + psubd xm16, xm1 + pabsd xm16, xm16 + + vextracti128 xm3, ym2, 1 + psubd xm3, xm2 + pabsd xm3, xm3 + paddd xm16, xm3 +%endmacro + + %if ARCH_X86_64 INIT_YMM avx2 %if HIGH_BIT_DEPTH && BIT_DEPTH == 12 @@ -10435,6 +11304,257 @@ RET %endif %endif +%if ARCH_X86_64 +INIT_ZMM avx512 +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_16x16, 4, 10, 27 + add r1d, r1d + add r3d, r3d + pxor m24, m24 + movu m13, [psy_pp_shuff1] + movu m14, [psy_pp_shuff2] + + mov r8d, 2 +.loopH: + mov r9d, 2 +.loopW: + PSY_COST_PP_8x8_AVX512_MAIN12 + + paddd xm24, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 32] + lea r2, [r2 + r3 * 8 - 32] + dec r8d + jnz .loopH + movd eax, xm24 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 +cglobal psyCost_pp_16x16, 4, 10, 16 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + + mov r8d, 2 +.loopH: + PSY_PP_INPUT_AVX512_MAIN10 + PSY_PP_16x8_AVX512_MAIN10 + + paddd xm11, xm1 + lea r0, [r0 + r1 * 8 - 16] + lea r2, [r2 + r3 * 8 - 16] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%endif + +%if BIT_DEPTH == 8 +cglobal psyCost_pp_16x16, 4, 10, 27 + lea r4, [3 * r1] + lea r7, [3 * r3] + vbroadcasti32x8 m8, [hmul_8p] + pxor m13, m13 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + movu m26, [psy_pp_shuff3] + + mov r8d, 2 +.loopH: + PSY_PP_INPUT_AVX512_MAIN + PSY_PP_16x8_AVX512_MAIN + + paddd m13, m16 + lea r0, [r0 + r1 * 8] + lea r2, [r2 + r3 * 8] + dec r8d + jnz .loopH + movd eax, xm13 + RET +%endif +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_32x32, 4, 10, 27 + add r1d, r1d + add r3d, r3d + pxor m24, m24 + movu m13, [psy_pp_shuff1] + movu m14, [psy_pp_shuff2] + + mov r8d, 4 +.loopH: + mov r9d, 4 +.loopW: + PSY_COST_PP_8x8_AVX512_MAIN12 + + paddd xm24, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 64] + lea r2, [r2 + r3 * 8 - 64] + dec r8d + jnz .loopH + movd eax, xm24 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 +cglobal psyCost_pp_32x32, 4, 10, 16 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + + mov r8d, 4 +.loopH: + mov r9d, 2 +.loopW: + PSY_PP_INPUT_AVX512_MAIN10 + PSY_PP_16x8_AVX512_MAIN10 + + paddd xm11, xm1 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 64] + lea r2, [r2 + r3 * 8 - 64] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%endif + +%if BIT_DEPTH == 8 +cglobal psyCost_pp_32x32, 4, 10, 27 + lea r4, [3 * r1] + lea r7, [3 * r3] + vbroadcasti32x8 m8, [hmul_8p] + pxor m13, m13 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + movu m26, [psy_pp_shuff3] + + mov r8d, 4 +.loopH: + mov r9d, 2 +.loopW: + PSY_PP_INPUT_AVX512_MAIN + PSY_PP_16x8_AVX512_MAIN + + paddd m13, m16 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 32] + lea r2, [r2 + r3 * 8 - 32] + dec r8d + jnz .loopH + movd eax, xm13 + RET +%endif +%endif + +%if ARCH_X86_64 +INIT_ZMM avx512 +%if HIGH_BIT_DEPTH && BIT_DEPTH == 12 +cglobal psyCost_pp_64x64, 4, 10, 27 + add r1d, r1d + add r3d, r3d + pxor m24, m24 + movu m13, [psy_pp_shuff1] + movu m14, [psy_pp_shuff2] + + mov r8d, 8 +.loopH: + mov r9d, 8 +.loopW: + PSY_COST_PP_8x8_AVX512_MAIN12 + + paddd xm24, xm11 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 128] + lea r2, [r2 + r3 * 8 - 128] + dec r8d + jnz .loopH + movd eax, xm24 + RET +%endif + +%if HIGH_BIT_DEPTH && BIT_DEPTH == 10 +cglobal psyCost_pp_64x64, 4, 10, 16 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + + mov r8d, 8 +.loopH: + mov r9d, 4 +.loopW: + PSY_PP_INPUT_AVX512_MAIN10 + PSY_PP_16x8_AVX512_MAIN10 + + paddd xm11, xm1 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 128] + lea r2, [r2 + r3 * 8 - 128] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%endif + +%if BIT_DEPTH == 8 +cglobal psyCost_pp_64x64, 4, 10, 27 + lea r4, [3 * r1] + lea r7, [3 * r3] + vbroadcasti32x8 m8, [hmul_8p] + pxor m13, m13 + vbroadcasti32x8 m14, [pw_1] + vbroadcasti32x8 m15, [pd_1] + movu m26, [psy_pp_shuff3] + + mov r8d, 8 +.loopH: + mov r9d, 4 +.loopW: + PSY_PP_INPUT_AVX512_MAIN + PSY_PP_16x8_AVX512_MAIN + + paddd m13, m16 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 64] + lea r2, [r2 + r3 * 8 - 64] + dec r8d + jnz .loopH + movd eax, xm13 + RET +%endif +%endif ;--------------------------------------------------------------------------------------------------------------------- ;int psyCost_ss(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) @@ -12993,8 +14113,134 @@ paddd xm0, xm1 movd eax, xm0 RET -%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +%macro PROCESS_SATD_32x4_AVX512 0 ; function to compute satd cost for 32 columns, 4 rows + ; rows 0-3 + pmovzxbw m0, [r0] + pmovzxbw m4, [r2] + psubw m0, m4 + pmovzxbw m1, [r0 + r1] + pmovzxbw m5, [r2 + r3] + psubw m1, m5 + pmovzxbw m2, [r0 + r1 * 2] + pmovzxbw m4, [r2 + r3 * 2] + psubw m2, m4 + pmovzxbw m3, [r0 + r4] + pmovzxbw m5, [r2 + r5] + psubw m3, m5 + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 +%endmacro + +%macro SATD_MAIN_AVX512_END 0 + vextracti32x8 ym7, m6, 1 + paddd ym6, ym7 + vextracti128 xm7, ym6, 1 + paddd xm6, xm6, xm7 + punpckhqdq xm7, xm6, xm6 + paddd xm6, xm7 + movq rax, xm6 + rorx rdx, rax, 32 + add eax, edx +%endmacro + +%macro SATD_32xN_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_satd_32x%1, 4,6,8 + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 +%rep %1/4 - 1 + PROCESS_SATD_32x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_AVX512 + SATD_MAIN_AVX512_END + RET +%endmacro + +SATD_32xN_AVX512 8 +SATD_32xN_AVX512 16 +SATD_32xN_AVX512 24 +SATD_32xN_AVX512 32 +SATD_32xN_AVX512 48 +SATD_32xN_AVX512 64 + +%macro SATD_64xN_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_satd_64x%1, 4,8,8 + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 + +%rep %1/4 - 1 + PROCESS_SATD_32x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_AVX512 + lea r0, [r6 + mmsize/2] + lea r2, [r7 + mmsize/2] +%rep %1/4 - 1 + PROCESS_SATD_32x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_AVX512 + SATD_MAIN_AVX512_END + RET +%endmacro + +SATD_64xN_AVX512 16 +SATD_64xN_AVX512 32 +SATD_64xN_AVX512 48 +SATD_64xN_AVX512 64 +%endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 %if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1 INIT_YMM avx2 cglobal calc_satd_16x8 ; function to compute satd cost for 16 columns, 8 rows @@ -13721,6 +14967,257 @@ paddd xm6, xm7 movd eax, xm6 RET + +%macro SATD_HBD_AVX512_END 0 + vextracti32x8 ym7, m6, 1 + paddd ym6, ym7 + vextracti128 xm7, ym6, 1 + paddd xm6, xm7 + pxor xm7, xm7 + movhlps xm7, xm6 + paddd xm6, xm7 + pshufd xm7, xm6, 1 + paddd xm6, xm7 + movd eax, xm6 +%endmacro +%macro PROCESS_SATD_16x8_HBD_AVX512 0 ; function to compute satd cost for 16 columns, 8 rows + ; rows 0-3 + lea r6, [r0 + r1 * 4] + lea r7, [r2 + r3 * 4] + movu ym0, [r0] + movu ym4, [r2] + vinserti32x8 m0, [r6], 1 + vinserti32x8 m4, [r7], 1 + psubw m0, m4 + movu ym1, [r0 + r1] + movu ym5, [r2 + r3] + vinserti32x8 m1, [r6 + r1], 1 + vinserti32x8 m5, [r7 + r3], 1 + psubw m1, m5 + movu ym2, [r0 + r1 * 2] + movu ym4, [r2 + r3 * 2] + vinserti32x8 m2, [r6 + r1 * 2], 1 + vinserti32x8 m4, [r7 + r3 * 2], 1 + psubw m2, m4 + movu ym3, [r0 + r4] + movu ym5, [r2 + r5] + vinserti32x8 m3, [r6 + r4], 1 + vinserti32x8 m5, [r7 + r5], 1 + psubw m3, m5 + + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 +%endmacro +%macro PROCESS_SATD_32x4_HBD_AVX512 0 ; function to compute satd cost for 32 columns, 4 rows + ; rows 0-3 + movu m0, [r0] + movu m4, [r2] + psubw m0, m4 + movu m1, [r0 + r1] + movu m5, [r2 + r3] + psubw m1, m5 + movu m2, [r0 + r1 * 2] + movu m4, [r2 + r3 * 2] + psubw m2, m4 + movu m3, [r0 + r4] + movu m5, [r2 + r5] + psubw m3, m5 + paddw m4, m0, m1 + psubw m1, m0 + paddw m0, m2, m3 + psubw m3, m2 + punpckhwd m2, m4, m1 + punpcklwd m4, m1 + punpckhwd m1, m0, m3 + punpcklwd m0, m3 + paddw m3, m4, m0 + psubw m0, m4 + paddw m4, m2, m1 + psubw m1, m2 + punpckhdq m2, m3, m0 + punpckldq m3, m0 + paddw m0, m3, m2 + psubw m2, m3 + punpckhdq m3, m4, m1 + punpckldq m4, m1 + paddw m1, m4, m3 + psubw m3, m4 + punpckhqdq m4, m0, m1 + punpcklqdq m0, m1 + pabsw m0, m0 + pabsw m4, m4 + pmaxsw m0, m0, m4 + punpckhqdq m1, m2, m3 + punpcklqdq m2, m3 + pabsw m2, m2 + pabsw m1, m1 + pmaxsw m2, m1 + pxor m7, m7 + mova m1, m0 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m0 + punpckhwd m1, m7 + paddd m6, m1 + pxor m7, m7 + mova m1, m2 + punpcklwd m1, m7 + paddd m6, m1 + mova m1, m2 + punpckhwd m1, m7 + paddd m6, m1 +%endmacro + +%macro SATD_16xN_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_satd_16x%1, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + +%rep %1/8 - 1 + PROCESS_SATD_16x8_HBD_AVX512 + lea r0, [r6 + 4 * r1] + lea r2, [r7 + 4 * r3] +%endrep + PROCESS_SATD_16x8_HBD_AVX512 + SATD_HBD_AVX512_END + RET +%endmacro + +SATD_16xN_HBD_AVX512 8 +SATD_16xN_HBD_AVX512 16 +SATD_16xN_HBD_AVX512 32 +SATD_16xN_HBD_AVX512 64 + +%macro SATD_32xN_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_satd_32x%1, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 +%rep %1/4 - 1 + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_HBD_AVX512 + SATD_HBD_AVX512_END + RET +%endmacro + +SATD_32xN_HBD_AVX512 8 +SATD_32xN_HBD_AVX512 16 +SATD_32xN_HBD_AVX512 24 +SATD_32xN_HBD_AVX512 32 +SATD_32xN_HBD_AVX512 64 +INIT_ZMM avx512 +cglobal pixel_satd_48x64, 4,10,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r8, r0 + mov r9, r2 + +%rep 15 + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r8 + mmsize] + lea r2, [r9 + mmsize] +%rep 7 + PROCESS_SATD_16x8_HBD_AVX512 + lea r0, [r6 + 4 * r1] + lea r2, [r7 + 4 * r3] +%endrep + PROCESS_SATD_16x8_HBD_AVX512 + SATD_HBD_AVX512_END + RET + +%macro SATD_64xN_HBD_AVX512 1 +INIT_ZMM avx512 +cglobal pixel_satd_64x%1, 4,8,8 + add r1d, r1d + add r3d, r3d + lea r4, [3 * r1] + lea r5, [3 * r3] + pxor m6, m6 + mov r6, r0 + mov r7, r2 +%rep %1/4 - 1 + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r6 + mmsize] + lea r2, [r7 + mmsize] +%rep %1/4 - 1 + PROCESS_SATD_32x4_HBD_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SATD_32x4_HBD_AVX512 + SATD_HBD_AVX512_END + RET +%endmacro + +SATD_64xN_HBD_AVX512 16 +SATD_64xN_HBD_AVX512 32 +SATD_64xN_HBD_AVX512 48 +SATD_64xN_HBD_AVX512 64 %endif ; ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 1 @@ -13818,6 +15315,7 @@ ;lea %8, [%8+4*r3] %endmacro +%if ARCH_X86_64 INIT_YMM avx2 cglobal pixel_satd_8x8, 4,4,7 @@ -14383,5 +15881,5 @@ movd eax, xm0 RET - +%endif %endif ; HIGH_BIT_DEPTH == 1 && BIT_DEPTH == 10
View file
x265_2.7.tar.gz/source/common/x86/pixel-util.h -> x265_2.9.tar.gz/source/common/x86/pixel-util.h
Changed
@@ -27,6 +27,7 @@ #define DEFINE_UTILS(cpu) \ FUNCDEF_TU_S2(void, getResidual, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \ + FUNCDEF_TU_S2(void, getResidual_aligned, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \ FUNCDEF_TU_S2(void, transpose, cpu, pixel* dest, const pixel* src, intptr_t stride); \ FUNCDEF_TU(int, count_nonzero, cpu, const int16_t* quantCoeff); \ uint32_t PFX(quant_ ## cpu(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)); \ @@ -36,6 +37,7 @@ void PFX(weight_pp_ ## cpu(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)); \ void PFX(weight_sp_ ## cpu(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset)); \ void PFX(scale1D_128to64_ ## cpu(pixel*, const pixel*)); \ + void PFX(scale1D_128to64_aligned_ ## cpu(pixel*, const pixel*)); \ void PFX(scale2D_64to32_ ## cpu(pixel*, const pixel*, intptr_t)); \ uint32_t PFX(costCoeffRemain_ ## cpu(uint16_t *absCoeff, int numNonZero, int idx)); \ uint32_t PFX(costC1C2Flag_sse2(uint16_t *absCoeff, intptr_t numNonZero, uint8_t *baseCtxMod, intptr_t ctxOffset)); \ @@ -44,6 +46,7 @@ DEFINE_UTILS(ssse3); DEFINE_UTILS(sse4); DEFINE_UTILS(avx2); +DEFINE_UTILS(avx512); #undef DEFINE_UTILS @@ -58,4 +61,7 @@ uint32_t PFX(costCoeffNxN_sse4(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase)); uint32_t PFX(costCoeffNxN_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase)); +int PFX(count_nonzero_16x16_avx512(const int16_t* quantCoeff)); +int PFX(count_nonzero_32x32_avx512(const int16_t* quantCoeff)); + #endif // ifndef X265_PIXEL_UTIL_H
View file
x265_2.7.tar.gz/source/common/x86/pixel-util8.asm -> x265_2.9.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -4,6 +4,7 @@ ;* Authors: Min Chen <chenm003@163.com> <min.chen@multicorewareinc.com> ;* Nabajit Deka <nabajit@multicorewareinc.com> ;* Rajesh Paulraj <rajesh@multicorewareinc.com> +;* Praveen Kumar Tiwari <praveen@multicorewareinc.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -26,7 +27,13 @@ %include "x86inc.asm" %include "x86util.asm" -SECTION_RODATA 32 +SECTION_RODATA 64 + +var_shuf_avx512: db 0,-1, 1,-1, 2,-1, 3,-1, 4,-1, 5,-1, 6,-1, 7,-1 + db 8,-1, 9,-1,10,-1,11,-1,12,-1,13,-1,14,-1,15,-1 +ALIGN 64 +const dequant_shuf1_avx512, dq 0, 2, 4, 6, 1, 3, 5, 7 +const dequant_shuf2_avx512, dq 0, 4, 1, 5, 2, 6, 3, 7 %if BIT_DEPTH == 12 ssim_c1: times 4 dd 107321.76 ; .01*.01*4095*4095*64 @@ -552,6 +559,262 @@ %endrep RET %endif + +%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r3] + movu m2, [r0 + r3 * 2] + movu m3, [r0 + r4] + lea r0, [r0 + r3 * 4] + + movu m4, [r1] + movu m5, [r1 + r3] + movu m6, [r1 + r3 * 2] + movu m7, [r1 + r4] + lea r1, [r1 + r3 * 4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + lea r2, [r2 + r3 * 4] +%endmacro + +%macro PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END 0 + movu m0, [r0] + movu m1, [r0 + r3] + movu m2, [r0 + r3 * 2] + movu m3, [r0 + r4] + + movu m4, [r1] + movu m5, [r1 + r3] + movu m6, [r1 + r3 * 2] + movu m7, [r1 + r4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 +%endmacro + +%macro PROCESS_GETRESIDUAL32_W4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r3] + pmovzxbw m2, [r0 + r3 * 2] + pmovzxbw m3, [r0 + r4] + lea r0, [r0 + r3 * 4] + + pmovzxbw m4, [r1] + pmovzxbw m5, [r1 + r3] + pmovzxbw m6, [r1 + r3 * 2] + pmovzxbw m7, [r1 + r4] + lea r1, [r1 + r3 * 4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3 * 2], m1 + lea r2, [r2 + r3 * 4] + movu [r2], m2 + movu [r2 + r3 * 2], m3 + lea r2, [r2 + r3 * 4] +%endmacro + +%macro PROCESS_GETRESIDUAL32_W4_AVX512_END 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r3] + pmovzxbw m2, [r0 + r3 * 2] + pmovzxbw m3, [r0 + r4] + + pmovzxbw m4, [r1] + pmovzxbw m5, [r1 + r3] + pmovzxbw m6, [r1 + r3 * 2] + pmovzxbw m7, [r1 + r4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3 * 2], m1 + lea r2, [r2 + r3 * 4] + movu [r2], m2 + movu [r2 + r3 * 2], m3 +%endmacro + + +%if HIGH_BIT_DEPTH +INIT_ZMM avx512 +cglobal getResidual32, 4,5,8 + add r3, r3 + lea r4, [r3 * 3] + + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_W4_HBD_AVX512_END + RET +%else +INIT_ZMM avx512 +cglobal getResidual32, 4,5,8 + lea r4, [r3 * 3] + + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512 + PROCESS_GETRESIDUAL32_W4_AVX512_END + RET +%endif + +%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r3] + movu m2, [r0 + r3 * 2] + movu m3, [r0 + r4] + lea r0, [r0 + r3 * 4] + + movu m4, [r1] + movu m5, [r1 + r3] + movu m6, [r1 + r3 * 2] + movu m7, [r1 + r4] + lea r1, [r1 + r3 * 4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 + lea r2, [r2 + r3 * 4] +%endmacro + +%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512_END 0 + movu m0, [r0] + movu m1, [r0 + r3] + movu m2, [r0 + r3 * 2] + movu m3, [r0 + r4] + + movu m4, [r1] + movu m5, [r1 + r3] + movu m6, [r1 + r3 * 2] + movu m7, [r1 + r4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r4], m3 +%endmacro + +%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r3] + pmovzxbw m2, [r0 + r3 * 2] + pmovzxbw m3, [r0 + r4] + lea r0, [r0 + r3 * 4] + + pmovzxbw m4, [r1] + pmovzxbw m5, [r1 + r3] + pmovzxbw m6, [r1 + r3 * 2] + pmovzxbw m7, [r1 + r4] + lea r1, [r1 + r3 * 4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3 * 2], m1 + lea r2, [r2 + r3 * 4] + movu [r2], m2 + movu [r2 + r3 * 2], m3 + lea r2, [r2 + r3 * 4] +%endmacro + +%macro PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512_END 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r3] + pmovzxbw m2, [r0 + r3 * 2] + pmovzxbw m3, [r0 + r4] + + pmovzxbw m4, [r1] + pmovzxbw m5, [r1 + r3] + pmovzxbw m6, [r1 + r3 * 2] + pmovzxbw m7, [r1 + r4] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + + movu [r2], m0 + movu [r2 + r3 * 2], m1 + lea r2, [r2 + r3 * 4] + movu [r2], m2 + movu [r2 + r3 * 2], m3 +%endmacro + + +%if HIGH_BIT_DEPTH +INIT_ZMM avx512 +cglobal getResidual_aligned32, 4,5,8 + add r3, r3 + lea r4, [r3 * 3] + + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_HBD_AVX512_END + RET +%else +INIT_ZMM avx512 +cglobal getResidual_aligned32, 4,5,8 + lea r4, [r3 * 3] + + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512 + PROCESS_GETRESIDUAL32_ALIGNED_W4_AVX512_END + RET +%endif ;----------------------------------------------------------------------------- ; uint32_t quant(int16_t *coef, int32_t *quantCoeff, int32_t *deltaU, int16_t *qCoef, int qBits, int add, int numCoeff); ;----------------------------------------------------------------------------- @@ -782,6 +1045,133 @@ %endif ; ARCH_X86_64 == 1 +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal quant, 5, 6, 22 + ; fill qbits + movd xm4, r4d ; m4 = qbits + + ; fill qbits-8 + sub r4d, 8 + movd xm6, r4d ; m6 = qbits8 + + ; fill offset +%if UNIX64 == 0 + vpbroadcastd m5, r5m ; m5 = add +%else ; Mac + movd xm5, r5m + vpbroadcastd m5, xm5 ; m5 = add +%endif + + vbroadcasti32x8 m9, [pw_1] + + mov r4d, r6m + pxor m7, m7 + sub r4d, 32 + jl .coeff16 + add r4d, 32 + shr r4d, 5 + jmp .loop + +.coeff16: + ; 16 coeff + pxor m7, m7 + pmovsxwd m16, [r0] ; m16 = level + pabsd m1, m16 + pmulld m1, [r1] + paddd m17, m1, m5 + psrad m17, xm4 ; m17 = level1 + + pslld m3, m17, 8 + psrad m1, xm6 + psubd m1, m3 ; m1 = deltaU1 + movu [r2], m1 + vextracti64x4 ym19, m17, 1 + vextracti64x4 ym20, m16, 1 + psignd ym17, ym16 + psignd ym19, ym20 + packssdw ym17, ym19 + vpermq ym17, ym17, q3120 + movu [r3], ym17 + + pminuw ym17, ym9 + paddw ym7, ym17 + + ; sum count + xorpd m0, m0 + psadbw ym7, ym0 + vextracti128 xm1, ym7, 1 + paddd xm7, xm1 + movhlps xm0, xm7 + paddd xm7, xm0 + movd eax, xm7 + RET + +.loop: + ; 16 coeff + pmovsxwd m16, [r0] ; m16 = level + pabsd m1, m16 + pmulld m1, [r1] + paddd m17, m1, m5 + psrad m17, xm4 ; m17 = level1 + + pslld m3, m17, 8 + psrad m1, xm6 + psubd m1, m3 ; m1 = deltaU1 + movu [r2], m1 + vextracti64x4 ym19, m17, 1 + vextracti64x4 ym20, m16, 1 + psignd ym17, ym16 + psignd ym19, ym20 + packssdw ym17, ym19 + + ; 16 coeff + pmovsxwd m16, [r0 + mmsize/2] ; m16 = level + pabsd m1, m16 + pmulld m1, [r1 + mmsize] + paddd m18, m1, m5 + psrad m18, xm4 ; m2 = level1 + + pslld m8, m18, 8 + psrad m1, xm6 + psubd m1, m8 ; m1 = deltaU1 + movu [r2 + mmsize], m1 + vextracti64x4 ym21, m18, 1 + vextracti64x4 ym20, m16, 1 + psignd ym18, ym16 + psignd ym21, ym20 + packssdw ym18, ym21 + vinserti64x4 m17, m17, ym18, 1 + vpermq m17, m17, q3120 + + movu [r3], m17 + + pminuw m17, m9 + paddw m7, m17 + + add r0, mmsize + add r1, mmsize * 2 + add r2, mmsize * 2 + add r3, mmsize + + dec r4d + jnz .loop + + ; sum count + xorpd m0, m0 + psadbw m7, m0 + vextracti32x8 ym1, m7, 1 + paddd ym7, ym1 + vextracti64x2 xm1, m7, 1 + paddd xm7, xm1 + pshufd xm1, xm7, 2 + paddd xm7, xm1 + movd eax, xm7 + RET +%endif ; ARCH_X86_64 == 1 + + + ;----------------------------------------------------------------------------- ; uint32_t nquant(int16_t *coef, int32_t *quantCoeff, int16_t *qCoef, int qBits, int add, int numCoeff); ;----------------------------------------------------------------------------- @@ -888,7 +1278,101 @@ paddd xm5, xm0 movd eax, xm5 RET +%if ARCH_X86_64 == 1 +INIT_ZMM avx512 +cglobal nquant, 3,5,22 +%if UNIX64 == 0 + vpbroadcastd m4, r4m +%else ; Mac + movd xm4, r4m + vpbroadcastd m4, xm4 +%endif + vbroadcasti32x8 m6, [pw_1] + mov r4d, r5m + pxor m5, m5 + movd xm3, r3m + sub r4d, 16 + je .coeff16 + add r4d, 16 + shr r4d, 5 + jmp .loop + +.coeff16: + pmovsxwd m16, [r0] + pabsd m17, m16 + pmulld m17, [r1] + paddd m17, m4 + psrad m17, xm3 + + vextracti64x4 ym19, m17, 1 + vextracti64x4 ym20, m16, 1 + psignd ym17, ym16 + psignd ym19, ym20 + packssdw ym17, ym19 + vpermq ym17, ym17, q3120 + pabsw ym17, ym17 + movu [r2], ym17 + pminuw ym17, ym6 + paddw ym5, ym17 + pxor m0, m0 + psadbw ym5, ym0 + vextracti128 xm0, ym5, 1 + paddd xm5, xm0 + pshufd xm0, xm5, 2 + paddd xm5, xm0 + movd eax, xm5 + RET + +.loop: + pmovsxwd m16, [r0] + pabsd m17, m16 + pmulld m17, [r1] + paddd m17, m4 + psrad m17, xm3 + vextracti64x4 ym19, m17, 1 + vextracti64x4 ym20, m16, 1 + psignd ym17, ym16 + psignd ym19, ym20 + packssdw ym17, ym19 + + pmovsxwd m16, [r0 + mmsize/2] + pabsd m18, m16 + pmulld m18, [r1 + mmsize] + paddd m18, m4 + psrad m18, xm3 + vextracti64x4 ym21, m18, 1 + vextracti64x4 ym20, m16, 1 + psignd ym18, ym16 + psignd ym21, ym20 + packssdw ym18, ym21 + vinserti64x4 m17, m17, ym18, 1 + vpermq m17, m17, q3120 + + pabsw m17, m17 + movu [r2], m17 + + add r0, mmsize + add r1, mmsize * 2 + add r2, mmsize + + pminuw m17, m6 + paddw m5, m17 + + dec r4d + jnz .loop + + pxor m0, m0 + psadbw m5, m0 + vextracti32x8 ym1, m5, 1 + paddd ym5, ym1 + vextracti64x2 xm1, m5, 1 + paddd xm5, xm1 + pshufd xm1, xm5, 2 + paddd xm5, xm1 + movd eax, xm5 + RET +%endif ; ARCH_X86_64 == 1 ;----------------------------------------------------------------------------- ; void dequant_normal(const int16_t* quantCoef, int32_t* coef, int num, int scale, int shift) @@ -1106,6 +1590,142 @@ jnz .loop RET +;---------------------------------------------------------------------------------------------------------------------- +;void dequant_scaling(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift) +;---------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal dequant_scaling, 6,7,8 + mova m6, [dequant_shuf1_avx512] + mova m7, [dequant_shuf2_avx512] + add r5d, 4 + mov r6d, r3d + shr r3d, 5 ; num/32 + cmp r5d, r4d + jle .skip + sub r5d, r4d + vpbroadcastd m0, [pd_1] + movd xm1, r5d ; shift - per + dec r5d + movd xm2, r5d ; shift - per - 1 + pslld m0, xm2 ; 1 << shift - per - 1 + +.part0: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 32] + movu m3, [r1] + movu m5, [r1 + 64] + pmulld m2, m3 + pmulld m4, m5 + paddd m2, m0 + paddd m4, m0 + psrad m2, xm1 + psrad m4, xm1 + packssdw m2, m4 + vpermq m2, m6, m2 + cmp r6d, 16 + je .num16part0 + movu [r2], m2 + + add r0, 64 + add r1, 128 + add r2, 64 + dec r3d + jnz .part0 + jmp .end + +.num16part0: + movu [r2], ym2 + jmp .end + +.skip: + sub r4d, r5d ; per - shift + movd xm0, r4d + +.part1: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 32] + movu m3, [r1] + movu m5, [r1 + 64] + pmulld m2, m3 + pmulld m4, m5 + packssdw m2, m4 + + vextracti32x8 ym4, m2, 1 + pmovsxwd m1, ym2 + pmovsxwd m2, ym4 + pslld m1, xm0 + pslld m2, xm0 + packssdw m1, m2 + + vpermq m1, m7, m1 + cmp r6d, 16 + je .num16part1 + movu [r2], m1 + + add r0, 64 + add r1, 128 + add r2, 64 + dec r3d + jnz .part1 + +.num16part1: + movu [r2], ym1 + +.end: + RET + +INIT_ZMM avx512 +cglobal dequant_normal, 5,5,7 + vpbroadcastd m2, [pw_1] ; m2 = word [1] + vpbroadcastd m5, [pd_32767] ; m5 = dword [32767] + vpbroadcastd m6, [pd_n32768] ; m6 = dword [-32768] +%if HIGH_BIT_DEPTH + cmp r3d, 32767 + jle .skip + shr r3d, (BIT_DEPTH - 8) + sub r4d, (BIT_DEPTH - 8) +.skip: +%endif + movd xm0, r4d ; m0 = shift + add r4d, -1+16 + bts r3d, r4d + + movd xm1, r3d + vpbroadcastd m1, xm1 ; m1 = dword [add scale] + + ; m0 = shift + ; m1 = scale + ; m2 = word [1] + mov r3d, r2d + shr r2d, 5 +.loop: + movu m3, [r0] + punpckhwd m4, m3, m2 + punpcklwd m3, m2 + pmaddwd m3, m1 ; m3 = dword (clipQCoef * scale + add) + pmaddwd m4, m1 + psrad m3, xm0 + psrad m4, xm0 + pminsd m3, m5 + pmaxsd m3, m6 + pminsd m4, m5 + pmaxsd m4, m6 + packssdw m3, m4 + + mova [r1 + 0 * mmsize/2], ym3 + cmp r3d, 16 + je .num16 + vextracti32x8 [r1 + 1 * mmsize/2], m3, 1 + + add r0, mmsize + add r1, mmsize + + dec r2d + jnz .loop + RET +.num16: + RET + ;----------------------------------------------------------------------------- ; int x265_count_nonzero_4x4_sse2(const int16_t *quantCoeff); @@ -1238,7 +1858,30 @@ movd eax, xm0 RET +;----------------------------------------------------------------------------- +; int x265_count_nonzero_16x16_avx512(const int16_t *quantCoeff); +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal count_nonzero_16x16, 1,4,2 + mov r1, 0xFFFFFFFFFFFFFFFF + kmovq k2, r1 + xor r3, r3 + pxor m0, m0 +%assign x 0 +%rep 4 + movu m1, [r0 + x] + vpacksswb m1, [r0 + x + 64] +%assign x x+128 + vpcmpb k1 {k2}, m1, m0, 00000100b + kmovq r1, k1 + popcnt r2, r1 + add r3d, r2d +%endrep + mov eax, r3d + RET +%endif ;----------------------------------------------------------------------------- ; int x265_count_nonzero_32x32_sse2(const int16_t *quantCoeff); ;----------------------------------------------------------------------------- @@ -1288,6 +1931,30 @@ RET +;----------------------------------------------------------------------------- +; int x265_count_nonzero_32x32_avx512(const int16_t *quantCoeff); +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal count_nonzero_32x32, 1,4,2 + mov r1, 0xFFFFFFFFFFFFFFFF + kmovq k2, r1 + xor r3, r3 + pxor m0, m0 + +%assign x 0 +%rep 16 + movu m1, [r0 + x] + vpacksswb m1, [r0 + x + 64] +%assign x x+128 + vpcmpb k1 {k2}, m1, m0, 00000100b + kmovq r1, k1 + popcnt r2, r1 + add r3d, r2d +%endrep + mov eax, r3d + RET +%endif ;----------------------------------------------------------------------------------------------------------------------------------------------- ;void weight_pp(pixel *src, pixel *dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset) ;----------------------------------------------------------------------------------------------------------------------------------------------- @@ -1531,6 +2198,116 @@ jnz .loopH RET %endif + +%if HIGH_BIT_DEPTH +INIT_ZMM avx512 +cglobal weight_pp, 6, 7, 7 +%define correction (14 - BIT_DEPTH) + mov r6d, r6m + shl r6d, 16 - correction + or r6d, r5d + + movd xm0, r6d + vpbroadcastd m0, xm0 + mov r5d, r7m + sub r5d, correction + movd xm1, r5d + + vpbroadcastd m2, r8m + vbroadcasti32x8 m5, [pw_1] + vbroadcasti32x8 m6, [pw_pixel_max] + + add r2d, r2d + add r3d, r3d + sub r2d, r3d + shr r3d, 6 + +.loopH: + mov r5d, r3d + +.loopW: + movu m4, [r0] + punpcklwd m3, m4, m5 + pmaddwd m3, m0 + psrad m3, xm1 + paddd m3, m2 + + punpckhwd m4, m5 + pmaddwd m4, m0 + psrad m4, xm1 + paddd m4, m2 + + packusdw m3, m4 + pminuw m3, m6 + movu [r1], m3 + + add r0, 64 + add r1, 64 + + dec r5d + jnz .loopW + + lea r0, [r0 + r2] + lea r1, [r1 + r2] + + dec r4d + jnz .loopH +%undef correction + RET +%else +INIT_ZMM avx512 +cglobal weight_pp, 6, 7, 6 + + shl r5d, 6 + mov r6d, r6m + shl r6d, 16 + or r6d, r5d + + movd xm0, r6d + vpbroadcastd m0, xm0 + movd xm1, r7m + vpbroadcastd m2, r8m + + vbroadcasti32x8 m5, [pw_1] + + sub r2d, r3d + shr r3d, 5 + +.loopH: + mov r5d, r3d + +.loopW: + pmovzxbw m4, [r0] + punpcklwd m3, m4, m5 + pmaddwd m3, m0 + psrad m3, xm1 + paddd m3, m2 + + punpckhwd m4, m5 + pmaddwd m4, m0 + psrad m4, xm1 + paddd m4, m2 + + packssdw m3, m4 + vextracti64x4 ym4, m3, 1 + packuswb ym3, ym4 + vpermq ym3, ym3, q3120 + movu [r1], ym3 + + add r0, 32 + add r1, 32 + + dec r5d + jnz .loopW + + lea r0, [r0 + r2] + lea r1, [r1 + r2] + + dec r4d + jnz .loopH + RET +%endif + ;------------------------------------------------------------------------------------------------------------------------------------------------- ;void weight_sp(int16_t *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset) ;------------------------------------------------------------------------------------------------------------------------------------------------- @@ -1892,6 +2669,149 @@ %endif %endif +%if ARCH_X86_64 == 1 +%if HIGH_BIT_DEPTH +INIT_ZMM avx512 +cglobal weight_sp, 6,9,8 + vbroadcasti32x8 m1, [pw_pixel_max] + vbroadcasti32x8 m2, [pw_1] + + mov r6d, r7m + shl r6d, 16 + or r6d, r6m + movd xm3, r6d + vpbroadcastd m3, xm3 ; m3 = [round w0] + movd xm4, r8m ; m4 = [shift] + vpbroadcastd m5, r9m ; m5 = [offset] + + ; correct row stride + add r3d, r3d + add r2d, r2d + mov r6d, r4d + and r6d, ~(mmsize / SIZEOF_PIXEL - 1) + shl r6d, 1 + sub r3d, r6d + sub r2d, r6d + + mov r6d, r4d + and r6d, (mmsize / SIZEOF_PIXEL - 1) + +.loopH: + mov r6d, r4d + +.loopW: + movu m6, [r0] + vbroadcasti32x8 m8, [pw_2000] + paddw m6, m8 + + punpcklwd m7, m6, m2 + pmaddwd m7, m3 ;(round w0) + psrad m7, xm4 ;(shift) + paddd m7, m5 ;(offset) + + punpckhwd m6, m2 + pmaddwd m6, m3 + psrad m6, xm4 + paddd m6, m5 + + packusdw m7, m6 + pminuw m7, m1 + + sub r6d, (mmsize / SIZEOF_PIXEL) + jl .widthLess30 + movu [r1], m7 + lea r0, [r0 + mmsize] + lea r1, [r1 + mmsize] + je .nextH + jmp .loopW + +.widthLess30: + mov r8d, 0xFFFFFFFF + NEG r6d + shrx r8d, r8d, r6d + kmovd k1, r8d + vmovdqu16 [r1] {k1}, m7 + jmp .nextH + +.nextH: + add r0, r2 + add r1, r3 + + dec r5d + jnz .loopH + RET + +%else +INIT_ZMM avx512 +cglobal weight_sp, 6, 10, 7 + mov r7d, r7m + shl r7d, 16 + or r7d, r6m + movd xm0, r7d + vpbroadcastd m0, xm0 ; m0 = times 8 dw w0, round + movd xm1, r8m ; m1 = [shift] + vpbroadcastd m2, r9m ; m2 = times 16 dw offset + vpbroadcastw m3, [pw_1] + vpbroadcastw m4, [pw_2000] + + add r2d, r2d ; 2 * srcstride + + mov r7, r0 + mov r8, r1 +.loopH: + mov r6d, r4d ; width + + ; save old src and dst + mov r0, r7 ; src + mov r1, r8 ; dst + +.loopW: + movu m5, [r0] + paddw m5, m4 + + punpcklwd m6, m5, m3 + pmaddwd m6, m0 + psrad m6, xm1 + paddd m6, m2 + + punpckhwd m5, m3 + pmaddwd m5, m0 + psrad m5, xm1 + paddd m5, m2 + + packssdw m6, m5 + vextracti64x4 ym5, m6, 1 + packuswb ym6, ym5 + vpermq ym6, ym6, q3120 + + sub r6d, 32 + jl .widthLess30 + movu [r1], ym6 + je .nextH + add r0, 64 + add r1, 32 + jmp .loopW + + +.widthLess30: + mov r9d, 0xFFFFFFFF + NEG r6d + shrx r9d, r9d, r6d + kmovd k1, r9d + vmovdqu8 [r1] {k1}, ym6 + jmp .nextH + +.nextH: + lea r7, [r7 + r2] + lea r8, [r8 + r3] + + dec r5d + jnz .loopH + RET +%endif +%endif + + ;----------------------------------------------------------------- ; void transpose_4x4(pixel *dst, pixel *src, intptr_t stride) ;----------------------------------------------------------------- @@ -4060,6 +4980,68 @@ RET %endif +%if HIGH_BIT_DEPTH == 0 +INIT_ZMM avx512 +cglobal scale1D_128to64, 2, 2, 7 + pxor m4, m4 + mova m6, [dequant_shuf1_avx512] + vbroadcasti32x8 m5, [pb_1] + + ;Top pixel + movu m0, [r1] + movu m1, [r1 + 1 * mmsize] + movu m2, [r1 + 2 * mmsize] + movu m3, [r1 + 3 * mmsize] + + pmaddubsw m0, m5 + pavgw m0, m4 + pmaddubsw m1, m5 + pavgw m1, m4 + packuswb m0, m1 + vpermq m0, m6, m0 + movu [r0], m0 + + ;Left pixel + pmaddubsw m2, m5 + pavgw m2, m4 + pmaddubsw m3, m5 + pavgw m3, m4 + packuswb m2, m3 + vpermq m2, m6, m2 + movu [r0 + mmsize], m2 + RET + +INIT_ZMM avx512 +cglobal scale1D_128to64_aligned, 2, 2, 7 + pxor m4, m4 + mova m6, [dequant_shuf1_avx512] + vbroadcasti32x8 m5, [pb_1] + + ;Top pixel + mova m0, [r1] + mova m1, [r1 + 1 * mmsize] + mova m2, [r1 + 2 * mmsize] + mova m3, [r1 + 3 * mmsize] + + pmaddubsw m0, m5 + pavgw m0, m4 + pmaddubsw m1, m5 + pavgw m1, m4 + packuswb m0, m1 + vpermq m0, m6, m0 + mova [r0], m0 + + ;Left pixel + pmaddubsw m2, m5 + pavgw m2, m4 + pmaddubsw m3, m5 + pavgw m3, m4 + packuswb m2, m3 + vpermq m2, m6, m2 + mova [r0 + mmsize], m2 + RET +%endif + ;----------------------------------------------------------------- ; void scale2D_64to32(pixel *dst, pixel *src, intptr_t stride) ;----------------------------------------------------------------- @@ -5323,6 +6305,226 @@ PIXELSUB_PS_W32_H8_avx2 32, 64 %endif +%macro PROCESS_SUB_PS_32x8_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r3] + pmovzxbw m2, [r2 + r4] + pmovzxbw m3, [r3 + r5] + pmovzxbw m4, [r2 + 2 * r4] + pmovzxbw m5, [r3 + 2 * r5] + pmovzxbw m6, [r2 + r7] + pmovzxbw m7, [r3 + r8] + + psubw m0, m1 + psubw m2, m3 + psubw m4, m5 + psubw m6, m7 + + movu [r0], m0 + movu [r0 + r1], m2 + movu [r0 + r1 * 2 ], m4 + movu [r0 + r9], m6 + + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r3] + pmovzxbw m2, [r2 + r4] + pmovzxbw m3, [r3 + r5] + pmovzxbw m4, [r2 + 2 * r4] + pmovzxbw m5, [r3 + 2 * r5] + pmovzxbw m6, [r2 + r7] + pmovzxbw m7, [r3 + r8] + + psubw m0, m1 + psubw m2, m3 + psubw m4, m5 + psubw m6, m7 + + movu [r0], m0 + movu [r0 + r1], m2 + movu [r0 + r1 * 2 ], m4 + movu [r0 + r9], m6 +%endmacro + +%macro PROCESS_SUB_PS_32x8_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r3] + movu m2, [r2 + r4] + movu m3, [r3 + r5] + psubw m0, m1 + psubw m2, m3 + + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + r4 * 2] + movu m1, [r3 + r5 * 2] + movu m2, [r2 + r7] + movu m3, [r3 + r8] + psubw m0, m1 + psubw m2, m3 + + movu [r0 + r1 * 2], m0 + movu [r0 + r6], m2 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + + movu m0, [r2] + movu m1, [r3] + movu m2, [r2 + r4] + movu m3, [r3 + r5] + psubw m0, m1 + psubw m2, m3 + + movu [r0], m0 + movu [r0 + r1], m2 + + movu m0, [r2 + r4 * 2] + movu m1, [r3 + r5 * 2] + movu m2, [r2 + r7] + movu m3, [r3 + r8] + psubw m0, m1 + psubw m2, m3 + + movu [r0 + r1 * 2], m0 + movu [r0 + r6], m2 +%endmacro + +;----------------------------------------------------------------------------- +; void pixel_sub_ps_32x32(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sub_ps_32x32, 6, 9, 4 + add r1d, r1d + add r4d, r4d + add r5d, r5d + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + RET + +cglobal pixel_sub_ps_32x64, 6, 9, 4 + add r1d, r1d + add r4d, r4d + add r5d, r5d + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_32x8_HBD_AVX512 + RET +%endif +%else +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sub_ps_32x32, 6, 10, 8 + add r1, r1 + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] + + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sub_ps_32x64, 6, 10, 8 + add r1, r1 + lea r7, [r4 * 3] + lea r8, [r5 * 3] + lea r9, [r1 * 3] + + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] + PROCESS_SUB_PS_32x8_AVX512 + RET +%endif +%endif + ;----------------------------------------------------------------------------- ; void pixel_sub_ps_64x%2(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); ;----------------------------------------------------------------------------- @@ -5747,6 +6949,251 @@ jnz .loop RET %endif + +%macro PROCESS_SUB_PS_64x8_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r3] + pmovzxbw m3, [r3 + 32] + pmovzxbw m4, [r2 + r4] + pmovzxbw m5, [r2 + r4 + 32] + pmovzxbw m6, [r3 + r5] + pmovzxbw m7, [r3 + r5 + 32] + + psubw m0, m2 + psubw m1, m3 + psubw m4, m6 + psubw m5, m7 + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + 2 * r1], m4 + movu [r0 + 2 * r1 + 64], m5 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r3] + pmovzxbw m3, [r3 + 32] + pmovzxbw m4, [r2 + r4] + pmovzxbw m5, [r2 + r4 + 32] + pmovzxbw m6, [r3 + r5] + pmovzxbw m7, [r3 + r5 + 32] + + psubw m0, m2 + psubw m1, m3 + psubw m4, m6 + psubw m5, m7 + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + 2 * r1], m4 + movu [r0 + 2 * r1 + 64], m5 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r3] + pmovzxbw m3, [r3 + 32] + pmovzxbw m4, [r2 + r4] + pmovzxbw m5, [r2 + r4 + 32] + pmovzxbw m6, [r3 + r5] + pmovzxbw m7, [r3 + r5 + 32] + + psubw m0, m2 + psubw m1, m3 + psubw m4, m6 + psubw m5, m7 + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + 2 * r1], m4 + movu [r0 + 2 * r1 + 64], m5 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + 32] + pmovzxbw m2, [r3] + pmovzxbw m3, [r3 + 32] + pmovzxbw m4, [r2 + r4] + pmovzxbw m5, [r2 + r4 + 32] + pmovzxbw m6, [r3 + r5] + pmovzxbw m7, [r3 + r5 + 32] + + psubw m0, m2 + psubw m1, m3 + psubw m4, m6 + psubw m5, m7 + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + 2 * r1], m4 + movu [r0 + 2 * r1 + 64], m5 +%endmacro + +%macro PROCESS_SUB_PS_64x8_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r2 + 64] + movu m4, [r3] + movu m5, [r3 + 64] + psubw m0, m4 + psubw m1, m5 + movu m2, [r2 + r4] + movu m3, [r2 + r4 + 64] + movu m6, [r3 + r5] + movu m7, [r3 + r5 + 64] + psubw m2, m6 + psubw m3, m7 + + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + 64], m3 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + 64] + movu m4, [r3 + r5 * 2] + movu m5, [r3 + r5 * 2 + 64] + psubw m0, m4 + psubw m1, m5 + movu m2, [r2 + r7] + movu m3, [r2 + r7 + 64] + movu m6, [r3 + r8] + movu m7, [r3 + r8 + 64] + psubw m2, m6 + psubw m3, m7 + + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 64], m1 + movu [r0 + r6], m2 + movu [r0 + r6 + 64], m3 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + + movu m0, [r2] + movu m1, [r2 + 64] + movu m4, [r3] + movu m5, [r3 + 64] + psubw m0, m4 + psubw m1, m5 + movu m2, [r2 + r4] + movu m3, [r2 + r4 + 64] + movu m6, [r3 + r5] + movu m7, [r3 + r5 + 64] + psubw m2, m6 + psubw m3, m7 + + movu [r0], m0 + movu [r0 + 64], m1 + movu [r0 + r1], m2 + movu [r0 + r1 + 64], m3 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + 64] + movu m4, [r3 + r5 * 2] + movu m5, [r3 + r5 * 2 + 64] + psubw m0, m4 + psubw m1, m5 + movu m2, [r2 + r7] + movu m3, [r2 + r7 + 64] + movu m6, [r3 + r8] + movu m7, [r3 + r8 + 64] + psubw m2, m6 + psubw m3, m7 + + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 64], m1 + movu [r0 + r6], m2 + movu [r0 + r6 + 64], m3 +%endmacro +;----------------------------------------------------------------------------- +; void pixel_sub_ps_64x64(int16_t *dest, intptr_t destride, pixel *src0, pixel *src1, intptr_t srcstride0, intptr_t srcstride1); +;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sub_ps_64x64, 6, 9, 8 + add r1d, r1d + add r4d, r4d + add r5d, r5d + lea r6, [r1 * 3] + lea r7, [r4 * 3] + lea r8, [r5 * 3] + + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + PROCESS_SUB_PS_64x8_HBD_AVX512 + RET +%endif +%else +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sub_ps_64x64, 6, 7, 8 + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 2 * r4] + lea r3, [r3 + 2 * r5] + PROCESS_SUB_PS_64x8_AVX512 + RET +%endif +%endif ;============================================================================= ; variance ;============================================================================= @@ -5757,7 +7204,7 @@ %if HIGH_BIT_DEPTH == 0 %if %1 mova m7, [pw_00ff] -%elif mmsize < 32 +%elif mmsize == 16 pxor m7, m7 ; zero %endif %endif ; !HIGH_BIT_DEPTH @@ -6476,6 +7923,245 @@ RET %endif ; !HIGH_BIT_DEPTH +%macro PROCESS_VAR_32x8_AVX512 0 + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + 2 * r1] + pmovzxbw m3, [r0 + r2] + + paddw m4, m0 + paddw m4, m1 + paddw m4, m2 + paddw m4, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m5, m0 + paddd m5, m1 + paddd m5, m2 + paddd m5, m3 + + lea r0, [r0 + r1 * 4] + + pmovzxbw m0, [r0] + pmovzxbw m1, [r0 + r1] + pmovzxbw m2, [r0 + 2 * r1] + pmovzxbw m3, [r0 + r2] + + paddw m4, m0 + paddw m4, m1 + paddw m4, m2 + paddw m4, m3 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m5, m0 + paddd m5, m1 + paddd m5, m2 + paddd m5, m3 +%endmacro + +%macro PROCESS_VAR_AVX512_END 0 + vextracti32x8 ym0, m4, 1 + vextracti32x8 ym1, m5, 1 + paddw ym4, ym0 + paddd ym5, ym1 + vextracti32x4 xm0, m4, 1 + vextracti32x4 xm1, m5, 1 + paddw xm4, xm0 + paddd xm5, xm1 + HADDW xm4, xm2 + HADDD xm5, xm1 +%if ARCH_X86_64 + punpckldq xm4, xm5 + movq rax, xm4 +%else + movd eax, xm4 + movd edx, xm5 +%endif +%endmacro +%if ARCH_X86_64 == 1 && HIGH_BIT_DEPTH == 0 +;----------------------------------------------------------------------------- +; int pixel_var_wxh( uint8_t *, intptr_t ) +;----------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal pixel_var_32x32, 2,4,6 + pxor m4, m4 ; sum + pxor m5, m5 ; sum squared + lea r2, [3 * r1] + + PROCESS_VAR_32x8_AVX512 + lea r0, [r0 + r1 * 4] + PROCESS_VAR_32x8_AVX512 + lea r0, [r0 + r1 * 4] + PROCESS_VAR_32x8_AVX512 + lea r0, [r0 + r1 * 4] + PROCESS_VAR_32x8_AVX512 + PROCESS_VAR_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_var_64x64, 2,4,7 + pxor m5, m5 ; sum + pxor m6, m6 ; sum squared + mov r2d, 32 + +.loop: + pmovzxbw m0, [r0] + pmovzxbw m3, [r0 + mmsize/2] + pmovzxbw m1, [r0 + r1] + pmovzxbw m4, [r0 + r1 + mmsize/2] + + lea r0, [r0 + 2 * r1] + + paddw m5, m0 + paddw m5, m3 + paddw m5, m1 + paddw m5, m4 + pmaddwd m0, m0 + pmaddwd m3, m3 + pmaddwd m1, m1 + pmaddwd m4, m4 + paddd m6, m0 + paddd m6, m3 + paddd m6, m1 + paddd m6, m4 + + dec r2d + jg .loop + + pxor m1, m1 + punpcklwd m0, m5, m1 + punpckhwd m5, m1 + paddd m5, m0 + vextracti32x8 ym2, m5, 1 + vextracti32x8 ym1, m6, 1 + paddd ym5, ym2 + paddd ym6, ym1 + vextracti32x4 xm2, m5, 1 + vextracti32x4 xm1, m6, 1 + paddd xm5, xm2 + paddd xm6, xm1 + HADDD xm5, xm2 + HADDD xm6, xm1 + punpckldq xm5, xm6 + movq rax, xm5 + RET +%endif +%macro VAR_AVX512_CORE 1 ; accum +%if %1 + paddw m0, m2 + pmaddwd m2, m2 + paddw m0, m3 + pmaddwd m3, m3 + paddd m1, m2 + paddd m1, m3 +%else + paddw m0, m2, m3 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m1, m2, m3 +%endif +%endmacro + +%macro VAR_AVX512_CORE_16x16 1 ; accum +%if HIGH_BIT_DEPTH + mova ym2, [r0] + vinserti64x4 m2, [r0+r1], 1 + mova ym3, [r0+2*r1] + vinserti64x4 m3, [r0+r3], 1 +%else + vbroadcasti64x2 ym2, [r0] + vbroadcasti64x2 m2 {k1}, [r0+r1] + vbroadcasti64x2 ym3, [r0+2*r1] + vbroadcasti64x2 m3 {k1}, [r0+r3] + pshufb m2, m4 + pshufb m3, m4 +%endif + VAR_AVX512_CORE %1 +%endmacro + +%macro VAR_AVX512_CORE_8x8 1 ; accum +%if HIGH_BIT_DEPTH + mova xm2, [r0] + mova xm3, [r0+r1] +%else + movq xm2, [r0] + movq xm3, [r0+r1] +%endif + vinserti128 ym2, [r0+2*r1], 1 + vinserti128 ym3, [r0+r2], 1 + lea r0, [r0+4*r1] + vinserti32x4 m2, [r0], 2 + vinserti32x4 m3, [r0+r1], 2 + vinserti32x4 m2, [r0+2*r1], 3 + vinserti32x4 m3, [r0+r2], 3 +%if HIGH_BIT_DEPTH == 0 + punpcklbw m2, m4 + punpcklbw m3, m4 +%endif + VAR_AVX512_CORE %1 +%endmacro + +INIT_ZMM avx512 +cglobal pixel_var_16x16, 2,4 + FIX_STRIDES r1 + mov r2d, 0xf0 + lea r3, [3*r1] +%if HIGH_BIT_DEPTH == 0 + vbroadcasti64x4 m4, [var_shuf_avx512] + kmovb k1, r2d +%endif + VAR_AVX512_CORE_16x16 0 +.loop: + lea r0, [r0+4*r1] + VAR_AVX512_CORE_16x16 1 + sub r2d, 0x50 + jg .loop +%if ARCH_X86_64 == 0 + pop r3d + %assign regs_used 3 +%endif +var_avx512_end: + vbroadcasti32x4 m2, [pw_1] + pmaddwd m0, m2 + SBUTTERFLY dq, 0, 1, 2 + paddd m0, m1 + vextracti32x8 ym1, m0, 1 + paddd ym0, ym1 + vextracti128 xm1, ym0, 1 + paddd xmm0, xm0, xm1 + punpckhqdq xmm1, xmm0, xmm0 + paddd xmm0, xmm1 +%if ARCH_X86_64 + movq rax, xmm0 +%else + movd eax, xmm0 + pextrd edx, xmm0, 1 + %endif + RET + +%if HIGH_BIT_DEPTH == 0 ; 8x8 doesn't benefit from AVX-512 in high bit-depth +cglobal pixel_var_8x8, 2,3 + lea r2, [3*r1] + pxor xm4, xm4 + VAR_AVX512_CORE_8x8 0 + jmp var_avx512_end +%endif + +cglobal pixel_var_8x16, 2,3 + FIX_STRIDES r1 + lea r2, [3*r1] +%if HIGH_BIT_DEPTH == 0 + pxor xm4, xm4 +%endif + VAR_AVX512_CORE_8x8 0 + lea r0, [r0+4*r1] + VAR_AVX512_CORE_8x8 1 + jmp var_avx512_end + %macro VAR2_END 3 HADDW %2, xm1 movd r1d, %2
View file
x265_2.7.tar.gz/source/common/x86/pixel.h -> x265_2.9.tar.gz/source/common/x86/pixel.h
Changed
@@ -34,6 +34,7 @@ void PFX(downShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); void PFX(upShift_16_sse2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); void PFX(upShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void PFX(upShift_16_avx512)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); void PFX(upShift_8_sse4)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); void PFX(upShift_8_avx2)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); pixel PFX(planeClipAndMax_avx2)(pixel *src, intptr_t stride, int width, int height, uint64_t *outsum, const pixel minPix, const pixel maxPix); @@ -44,14 +45,19 @@ FUNCDEF_PU(void, pixel_sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ FUNCDEF_PU(void, pixel_sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ FUNCDEF_PU(void, pixel_avg, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \ + FUNCDEF_PU(void, pixel_avg_aligned, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \ FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_PU(void, pixel_add_ps_aligned, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \ FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \ FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \ FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ + FUNCDEF_CHROMA_PU(void, addAvg_aligned, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_CHROMA_PU(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \ FUNCDEF_TU_S(sse_t, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU_S(sse_t, pixel_ssd_s_aligned, cpu, const int16_t*, intptr_t); \ FUNCDEF_TU(uint64_t, pixel_var, cpu, const pixel*, intptr_t); \ FUNCDEF_TU(int, psyCost_pp, cpu, const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); \ FUNCDEF_TU(int, psyCost_ss, cpu, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) @@ -65,6 +71,7 @@ DECL_PIXELS(avx); DECL_PIXELS(xop); DECL_PIXELS(avx2); +DECL_PIXELS(avx512); #undef DECL_PIXELS
View file
x265_2.7.tar.gz/source/common/x86/pixeladd8.asm -> x265_2.9.tar.gz/source/common/x86/pixeladd8.asm
Changed
@@ -24,11 +24,11 @@ %include "x86inc.asm" %include "x86util.asm" +SECTION_RODATA 64 -SECTION_RODATA 32 - +ALIGN 64 +const store_shuf1_avx512, dq 0, 2, 4, 6, 1, 3, 5, 7 SECTION .text - cextern pw_pixel_max ;----------------------------------------------------------------------------- @@ -768,7 +768,6 @@ PIXEL_ADD_PS_W32_H4_avx2 32 PIXEL_ADD_PS_W32_H4_avx2 64 - ;----------------------------------------------------------------------------- ; void pixel_add_ps_64x%2(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) ;----------------------------------------------------------------------------- @@ -1145,3 +1144,505 @@ RET %endif + +;----------------------------------------------------------------------------- +; pixel_add_ps avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_ADD_PS_64x4_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + mmsize/2] + movu m2, [r3] + movu m3, [r3 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + movu [r0], m0 + pmovzxbw m0, [r2 + r4] + pmovzxbw m1, [r2 + r4 + mmsize/2] + movu m2, [r3 + r5] + movu m3, [r3 + r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + movu [r0 + r1], m0 + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r2 + 2 * r4 + mmsize/2] + movu m2, [r3 + 2 * r5] + movu m3, [r3 + 2 * r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + movu [r0 + 2 * r1], m0 + + pmovzxbw m0, [r2 + r7] + pmovzxbw m1, [r2 + r7 + mmsize/2] + movu m2, [r3 + r8] + movu m3, [r3 + r8 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + movu [r0 + r6], m0 +%endmacro + +%macro PROCESS_ADD_PS_64x4_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r2 + mmsize] + movu m2, [r3] + movu m3, [r3 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0], m0 + movu [r0 + mmsize], m1 + + movu m0, [r2 + r4] + movu m1, [r2 + r4 + mmsize] + movu m2, [r3 + r5] + movu m3, [r3 + r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1], m0 + movu [r0 + r1 + mmsize], m1 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r4 * 2 + mmsize] + movu m2, [r3 + r5 * 2] + movu m3, [r3 + r5 * 2 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + mmsize], m1 + + movu m0, [r2 + r6] + movu m1, [r2 + r6 + mmsize] + movu m2, [r3 + r7] + movu m3, [r3 + r7 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r8], m0 + movu [r0 + r8 + mmsize], m1 +%endmacro + +%macro PROCESS_ADD_PS_64x4_ALIGNED_AVX512 0 + pmovzxbw m0, [r2] + pmovzxbw m1, [r2 + mmsize/2] + mova m2, [r3] + mova m3, [r3 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + mova [r0], m0 + pmovzxbw m0, [r2 + r4] + pmovzxbw m1, [r2 + r4 + mmsize/2] + mova m2, [r3 + r5] + mova m3, [r3 + r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + mova [r0 + r1], m0 + pmovzxbw m0, [r2 + 2 * r4] + pmovzxbw m1, [r2 + 2 * r4 + mmsize/2] + mova m2, [r3 + 2 * r5] + mova m3, [r3 + 2 * r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + mova [r0 + 2 * r1], m0 + + pmovzxbw m0, [r2 + r7] + pmovzxbw m1, [r2 + r7 + mmsize/2] + mova m2, [r3 + r8] + mova m3, [r3 + r8 + mmsize] + paddw m0, m2 + paddw m1, m3 + packuswb m0, m1 + vpermq m0, m4, m0 + mova [r0 + r6], m0 +%endmacro + +%macro PROCESS_ADD_PS_64x4_HBD_ALIGNED_AVX512 0 + mova m0, [r2] + mova m1, [r2 + mmsize] + mova m2, [r3] + mova m3, [r3 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0], m0 + mova [r0 + mmsize], m1 + + mova m0, [r2 + r4] + mova m1, [r2 + r4 + mmsize] + mova m2, [r3 + r5] + mova m3, [r3 + r5 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0 + r1], m0 + mova [r0 + r1 + mmsize], m1 + + mova m0, [r2 + r4 * 2] + mova m1, [r2 + r4 * 2 + mmsize] + mova m2, [r3 + r5 * 2] + mova m3, [r3 + r5 * 2 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0 + r1 * 2], m0 + mova [r0 + r1 * 2 + mmsize], m1 + + mova m0, [r2 + r6] + mova m1, [r2 + r6 + mmsize] + mova m2, [r3 + r7] + mova m3, [r3 + r7 + mmsize] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0 + r8], m0 + mova [r0 + r8 + mmsize], m1 +%endmacro + +;----------------------------------------------------------------------------- +; void pixel_add_ps_64x64(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) +;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_add_ps_64x64, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 15 + PROCESS_ADD_PS_64x4_HBD_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_64x4_HBD_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_64x64, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 15 + PROCESS_ADD_PS_64x4_HBD_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_64x4_HBD_ALIGNED_AVX512 + RET +%endif +%else +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_add_ps_64x64, 6, 9, 4 + add r5, r5 + lea r6, [3 * r1] + lea r7, [3 * r4] + lea r8, [3 * r5] + mova m4, [store_shuf1_avx512] +%rep 15 + PROCESS_ADD_PS_64x4_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_64x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_64x64, 6, 9, 4 + add r5, r5 + lea r6, [3 * r1] + lea r7, [3 * r4] + lea r8, [3 * r5] + mova m4, [store_shuf1_avx512] +%rep 15 + PROCESS_ADD_PS_64x4_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_64x4_ALIGNED_AVX512 + RET +%endif +%endif + +%macro PROCESS_ADD_PS_32x4_AVX512 0 + pmovzxbw m0, [r2] + movu m1, [r3] + pmovzxbw m2, [r2 + r4] + movu m3, [r3 + r5] + paddw m0, m1 + paddw m2, m3 + packuswb m0, m2 + vpermq m0, m4, m0 + movu [r0], ym0 + vextracti32x8 [r0 + r1], m0, 1 + pmovzxbw m0, [r2 + r4 * 2] + movu m1, [r3 + r5 * 2] + pmovzxbw m2, [r2 + r6] + movu m3, [r3 + r7] + paddw m0, m1 + paddw m2, m3 + packuswb m0, m2 + vpermq m0, m4, m0 + movu [r0 + r1 * 2], ym0 + vextracti32x8 [r0 + r8], m0, 1 +%endmacro + +%macro PROCESS_ADD_PS_32x4_HBD_AVX512 0 + movu m0, [r2] + movu m1, [r2 + r4] + movu m2, [r3] + movu m3, [r3 + r5] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0], m0 + movu [r0 + r1], m1 + + movu m0, [r2 + r4 * 2] + movu m1, [r2 + r6] + movu m2, [r3 + r5 * 2] + movu m3, [r3 + r7] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + movu [r0 + r1 * 2], m0 + movu [r0 + r8], m1 +%endmacro + +%macro PROCESS_ADD_PS_32x4_ALIGNED_AVX512 0 + pmovzxbw m0, [r2] + mova m1, [r3] + pmovzxbw m2, [r2 + r4] + mova m3, [r3 + r5] + paddw m0, m1 + paddw m2, m3 + packuswb m0, m2 + vpermq m0, m4, m0 + mova [r0], ym0 + vextracti32x8 [r0 + r1], m0, 1 + pmovzxbw m0, [r2 + r4 * 2] + mova m1, [r3 + r5 * 2] + pmovzxbw m2, [r2 + r6] + mova m3, [r3 + r7] + paddw m0, m1 + paddw m2, m3 + packuswb m0, m2 + vpermq m0, m4, m0 + mova [r0 + r1 * 2], ym0 + vextracti32x8 [r0 + r8], m0, 1 +%endmacro + +%macro PROCESS_ADD_PS_32x4_HBD_ALIGNED_AVX512 0 + mova m0, [r2] + mova m1, [r2 + r4] + mova m2, [r3] + mova m3, [r3 + r5] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0], m0 + mova [r0 + r1], m1 + + mova m0, [r2 + r4 * 2] + mova m1, [r2 + r6] + mova m2, [r3 + r5 * 2] + mova m3, [r3 + r7] + paddw m0, m2 + paddw m1, m3 + + CLIPW2 m0, m1, m4, m5 + mova [r0 + r1 * 2], m0 + mova [r0 + r8], m1 +%endmacro + +;----------------------------------------------------------------------------- +; void pixel_add_ps_32x32(pixel *dest, intptr_t destride, pixel *src0, int16_t *scr1, intptr_t srcStride0, intptr_t srcStride1) +;----------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_add_ps_32x32, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 7 + PROCESS_ADD_PS_32x4_HBD_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_HBD_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_32x64, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 15 + PROCESS_ADD_PS_32x4_HBD_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_HBD_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_32x32, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 7 + PROCESS_ADD_PS_32x4_HBD_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_HBD_ALIGNED_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_32x64, 6, 9, 6 + vbroadcasti32x8 m5, [pw_pixel_max] + pxor m4, m4 + add r4d, r4d + add r5d, r5d + add r1d, r1d + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] +%rep 15 + PROCESS_ADD_PS_32x4_HBD_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_HBD_ALIGNED_AVX512 + RET +%endif +%else +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_add_ps_32x32, 6, 9, 5 + add r5, r5 + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mova m4, [store_shuf1_avx512] +%rep 7 + PROCESS_ADD_PS_32x4_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_32x64, 6, 9, 5 + add r5, r5 + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mova m4, [store_shuf1_avx512] + +%rep 15 + PROCESS_ADD_PS_32x4_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_32x32, 6, 9, 5 + add r5, r5 + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mova m4, [store_shuf1_avx512] +%rep 7 + PROCESS_ADD_PS_32x4_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_ALIGNED_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_add_ps_aligned_32x64, 6, 9, 5 + add r5, r5 + lea r6, [r4 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mova m4, [store_shuf1_avx512] + +%rep 15 + PROCESS_ADD_PS_32x4_ALIGNED_AVX512 + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r5 * 4] + lea r0, [r0 + r1 * 4] +%endrep + PROCESS_ADD_PS_32x4_ALIGNED_AVX512 + RET +%endif +%endif +;----------------------------------------------------------------------------- +; pixel_add_ps avx512 code end +;-----------------------------------------------------------------------------
View file
x265_2.7.tar.gz/source/common/x86/sad-a.asm -> x265_2.9.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -378,111 +378,63 @@ lea r0, [r0 + r1] %endmacro -%macro SAD_W16 0 -;----------------------------------------------------------------------------- -; int pixel_sad_16x16( uint8_t *, intptr_t, uint8_t *, intptr_t ) -;----------------------------------------------------------------------------- -cglobal pixel_sad_16x16, 4,4,8 - movu m0, [r2] - movu m1, [r2+r3] - lea r2, [r2+2*r3] - movu m2, [r2] - movu m3, [r2+r3] - lea r2, [r2+2*r3] - psadbw m0, [r0] - psadbw m1, [r0+r1] - lea r0, [r0+2*r1] - movu m4, [r2] - paddw m0, m1 - psadbw m2, [r0] - psadbw m3, [r0+r1] - lea r0, [r0+2*r1] - movu m5, [r2+r3] - lea r2, [r2+2*r3] - paddw m2, m3 - movu m6, [r2] - movu m7, [r2+r3] - lea r2, [r2+2*r3] - paddw m0, m2 - psadbw m4, [r0] - psadbw m5, [r0+r1] - lea r0, [r0+2*r1] - movu m1, [r2] - paddw m4, m5 - psadbw m6, [r0] - psadbw m7, [r0+r1] - lea r0, [r0+2*r1] - movu m2, [r2+r3] - lea r2, [r2+2*r3] - paddw m6, m7 - movu m3, [r2] - paddw m0, m4 - movu m4, [r2+r3] - lea r2, [r2+2*r3] - paddw m0, m6 - psadbw m1, [r0] - psadbw m2, [r0+r1] - lea r0, [r0+2*r1] - movu m5, [r2] - paddw m1, m2 - psadbw m3, [r0] - psadbw m4, [r0+r1] - lea r0, [r0+2*r1] - movu m6, [r2+r3] - lea r2, [r2+2*r3] - paddw m3, m4 - movu m7, [r2] - paddw m0, m1 - movu m1, [r2+r3] - paddw m0, m3 - psadbw m5, [r0] - psadbw m6, [r0+r1] - lea r0, [r0+2*r1] - paddw m5, m6 - psadbw m7, [r0] - psadbw m1, [r0+r1] - paddw m7, m1 - paddw m0, m5 - paddw m0, m7 - SAD_END_SSE2 +%macro SAD_W16 1 ; h +cglobal pixel_sad_16x%1, 4,4 +%ifidn cpuname, sse2 +.skip_prologue: +%endif +%assign %%i 0 +%if ARCH_X86_64 + lea r6, [3*r1] ; r6 results in fewer REX prefixes than r4 and both are volatile + lea r5, [3*r3] +%rep %1/4 + movu m1, [r2] + psadbw m1, [r0] + movu m3, [r2+r3] + psadbw m3, [r0+r1] + movu m2, [r2+2*r3] + psadbw m2, [r0+2*r1] + movu m4, [r2+r5] + psadbw m4, [r0+r6] +%if %%i != %1/4-1 + lea r2, [r2+4*r3] + lea r0, [r0+4*r1] +%endif + paddw m1, m3 + paddw m2, m4 + ACCUM paddw, 0, 1, %%i + paddw m0, m2 + %assign %%i %%i+1 +%endrep +%else ; The cost of having to save and restore registers on x86-32 +%rep %1/2 ; nullifies the benefit of having 3*stride in registers. + movu m1, [r2] + psadbw m1, [r0] + movu m2, [r2+r3] + psadbw m2, [r0+r1] +%if %%i != %1/2-1 + lea r2, [r2+2*r3] + lea r0, [r0+2*r1] +%endif + ACCUM paddw, 0, 1, %%i + paddw m0, m2 + %assign %%i %%i+1 +%endrep +%endif + SAD_END_SSE2 + %endmacro -;----------------------------------------------------------------------------- -; int pixel_sad_16x8( uint8_t *, intptr_t, uint8_t *, intptr_t ) -;----------------------------------------------------------------------------- -cglobal pixel_sad_16x8, 4,4 - movu m0, [r2] - movu m2, [r2+r3] - lea r2, [r2+2*r3] - movu m3, [r2] - movu m4, [r2+r3] - psadbw m0, [r0] - psadbw m2, [r0+r1] - lea r0, [r0+2*r1] - psadbw m3, [r0] - psadbw m4, [r0+r1] - lea r0, [r0+2*r1] - lea r2, [r2+2*r3] - paddw m0, m2 - paddw m3, m4 - paddw m0, m3 - movu m1, [r2] - movu m2, [r2+r3] - lea r2, [r2+2*r3] - movu m3, [r2] - movu m4, [r2+r3] - psadbw m1, [r0] - psadbw m2, [r0+r1] - lea r0, [r0+2*r1] - psadbw m3, [r0] - psadbw m4, [r0+r1] - lea r0, [r0+2*r1] - lea r2, [r2+2*r3] - paddw m1, m2 - paddw m3, m4 - paddw m0, m1 - paddw m0, m3 - SAD_END_SSE2 +INIT_XMM sse2 +SAD_W16 8 +SAD_W16 16 +INIT_XMM sse3 +SAD_W16 8 +SAD_W16 16 +INIT_XMM sse2, aligned +SAD_W16 8 +SAD_W16 16 +%macro SAD_Wx 0 ;----------------------------------------------------------------------------- ; int pixel_sad_16x12( uint8_t *, intptr_t, uint8_t *, intptr_t ) ;----------------------------------------------------------------------------- @@ -808,11 +760,11 @@ %endmacro INIT_XMM sse2 -SAD_W16 +SAD_Wx INIT_XMM sse3 -SAD_W16 +SAD_Wx INIT_XMM sse2, aligned -SAD_W16 +SAD_Wx %macro SAD_INC_4x8P_SSE 1 movq m1, [r0] @@ -841,7 +793,132 @@ SAD_INC_4x8P_SSE 1 SAD_INC_4x8P_SSE 1 SAD_END_SSE2 + +%macro SAD_W48_AVX512 3 ; w, h, d/q +cglobal pixel_sad_%1x%2, 4,4 + kxnorb k1, k1, k1 + kaddb k1, k1, k1 +%assign %%i 0 +%if ARCH_X86_64 && %2 != 4 + lea r6, [3*r1] + lea r5, [3*r3] +%rep %2/4 + mov%3 m1, [r0] + vpbroadcast%3 m1 {k1}, [r0+r1] + mov%3 m3, [r2] + vpbroadcast%3 m3 {k1}, [r2+r3] + mov%3 m2, [r0+2*r1] + vpbroadcast%3 m2 {k1}, [r0+r6] + mov%3 m4, [r2+2*r3] + vpbroadcast%3 m4 {k1}, [r2+r5] +%if %%i != %2/4-1 + lea r0, [r0+4*r1] + lea r2, [r2+4*r3] +%endif + psadbw m1, m3 + psadbw m2, m4 + ACCUM paddd, 0, 1, %%i + paddd m0, m2 + %assign %%i %%i+1 +%endrep +%else +%rep %2/2 + mov%3 m1, [r0] + vpbroadcast%3 m1 {k1}, [r0+r1] + mov%3 m2, [r2] + vpbroadcast%3 m2 {k1}, [r2+r3] +%if %%i != %2/2-1 + lea r0, [r0+2*r1] + lea r2, [r2+2*r3] +%endif + psadbw m1, m2 + ACCUM paddd, 0, 1, %%i + %assign %%i %%i+1 +%endrep +%endif +%if %1 == 8 + punpckhqdq m1, m0, m0 + paddd m0, m1 +%endif + movd eax, m0 + RET +%endmacro + +INIT_XMM avx512 +SAD_W48_AVX512 4, 4, d +SAD_W48_AVX512 4, 8, d +SAD_W48_AVX512 4, 16, d +SAD_W48_AVX512 8, 4, q +SAD_W48_AVX512 8, 8, q +SAD_W48_AVX512 8, 16, q + +%macro SAD_W16_AVX512_START 1 ; h + cmp r1d, 16 ; optimized for width = 16, which has the + jne pixel_sad_16x%1_sse2.skip_prologue ; rows laid out contiguously in memory + lea r1, [3*r3] +%endmacro + +%macro SAD_W16_AVX512_END 0 + paddd m0, m1 + paddd m0, m2 + paddd m0, m3 +%if mmsize == 64 + vextracti32x8 ym1, m0, 1 + paddd ym0, ym1 +%endif + vextracti128 xm1, ym0, 1 + paddd xmm0, xm0, xm1 + punpckhqdq xmm1, xmm0, xmm0 + paddd xmm0, xmm1 + movd eax, xmm0 RET +%endmacro + +INIT_YMM avx512 +cglobal pixel_sad_16x8, 4,4 + SAD_W16_AVX512_START 8 + movu xm0, [r2] + vinserti128 m0, [r2+r3], 1 + psadbw m0, [r0+0*32] + movu xm1, [r2+2*r3] + vinserti128 m1, [r2+r1], 1 + lea r2, [r2+4*r3] + psadbw m1, [r0+1*32] + movu xm2, [r2] + vinserti128 m2, [r2+r3], 1 + psadbw m2, [r0+2*32] + movu xm3, [r2+2*r3] + vinserti128 m3, [r2+r1], 1 + psadbw m3, [r0+3*32] + SAD_W16_AVX512_END + +INIT_ZMM avx512 +cglobal pixel_sad_16x16, 4,4 + SAD_W16_AVX512_START 16 + movu xm0, [r2] + vinserti128 ym0, [r2+r3], 1 + movu xm1, [r2+4*r3] + vinserti32x4 m0, [r2+2*r3], 2 + vinserti32x4 m1, [r2+2*r1], 2 + vinserti32x4 m0, [r2+r1], 3 + lea r2, [r2+4*r3] + vinserti32x4 m1, [r2+r3], 1 + psadbw m0, [r0+0*64] + vinserti32x4 m1, [r2+r1], 3 + lea r2, [r2+4*r3] + psadbw m1, [r0+1*64] + movu xm2, [r2] + vinserti128 ym2, [r2+r3], 1 + movu xm3, [r2+4*r3] + vinserti32x4 m2, [r2+2*r3], 2 + vinserti32x4 m3, [r2+2*r1], 2 + vinserti32x4 m2, [r2+r1], 3 + lea r2, [r2+4*r3] + vinserti32x4 m3, [r2+r3], 1 + psadbw m2, [r0+2*64] + vinserti32x4 m3, [r2+r1], 3 + psadbw m3, [r0+3*64] + SAD_W16_AVX512_END ;============================================================================= ; SAD x3/x4 MMX @@ -4051,6 +4128,263 @@ SAD_X4_48x8_AVX2 PIXEL_SAD_X4_END_AVX2 RET + +;------------------------------------------------------------ +;sad_x4 avx512 code start +;------------------------------------------------------------ +%macro PROCESS_SAD_X4_64x4_AVX512 0 + movu m4, [r0] + movu m5, [r1] + movu m6, [r2] + movu m7, [r3] + movu m8, [r4] + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE] + movu m5, [r1 + r5] + movu m6, [r2 + r5] + movu m7, [r3 + r5] + movu m8, [r4 + r5] + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 2] + movu m5, [r1 + r5 * 2] + movu m6, [r2 + r5 * 2] + movu m7, [r3 + r5 * 2] + movu m8, [r4 + r5 * 2] + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu m4, [r0 + FENC_STRIDE * 3] + movu m5, [r1 + r7] + movu m6, [r2 + r7] + movu m7, [r3 + r7] + movu m8, [r4 + r7] + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 +%endmacro + +%macro PROCESS_SAD_X4_32x4_AVX512 0 + movu ym4, [r0] + movu ym5, [r1] + movu ym6, [r2] + movu ym7, [r3] + movu ym8, [r4] + + vinserti32x8 m4, [r0 + FENC_STRIDE], 1 + vinserti32x8 m5, [r1 + r5], 1 + vinserti32x8 m6, [r2 + r5], 1 + vinserti32x8 m7, [r3 + r5], 1 + vinserti32x8 m8, [r4 + r5], 1 + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu ym4, [r0 + FENC_STRIDE * 2] + movu ym5, [r1 + r5 * 2] + movu ym6, [r2 + r5 * 2] + movu ym7, [r3 + r5 * 2] + movu ym8, [r4 + r5 * 2] + + vinserti32x8 m4, [r0 + FENC_STRIDE * 3], 1 + vinserti32x8 m5, [r1 + r7], 1 + vinserti32x8 m6, [r2 + r7], 1 + vinserti32x8 m7, [r3 + r7], 1 + vinserti32x8 m8, [r4 + r7], 1 + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 +%endmacro + +%macro PROCESS_SAD_X4_48x4_AVX512 0 + movu ym4, [r0] + movu ym5, [r1] + movu ym6, [r2] + movu ym7, [r3] + movu ym8, [r4] + + vinserti32x8 m4, [r0 + FENC_STRIDE], 1 + vinserti32x8 m5, [r1 + r5], 1 + vinserti32x8 m6, [r2 + r5], 1 + vinserti32x8 m7, [r3 + r5], 1 + vinserti32x8 m8, [r4 + r5], 1 + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu ym4, [r0 + FENC_STRIDE * 2] + movu ym5, [r1 + r5 * 2] + movu ym6, [r2 + r5 * 2] + movu ym7, [r3 + r5 * 2] + movu ym8, [r4 + r5 * 2] + + vinserti32x8 m4, [r0 + FENC_STRIDE * 3], 1 + vinserti32x8 m5, [r1 + r7], 1 + vinserti32x8 m6, [r2 + r7], 1 + vinserti32x8 m7, [r3 + r7], 1 + vinserti32x8 m8, [r4 + r7], 1 + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 + + movu xm4, [r0 + mmsize/2] + movu xm5, [r1 + mmsize/2] + movu xm6, [r2 + mmsize/2] + movu xm7, [r3 + mmsize/2] + movu xm8, [r4 + mmsize/2] + vinserti32x4 m4, [r0 + FENC_STRIDE + mmsize/2], 1 + vinserti32x4 m5, [r1 + r5 + mmsize/2], 1 + vinserti32x4 m6, [r2 + r5 + mmsize/2], 1 + vinserti32x4 m7, [r3 + r5 + mmsize/2], 1 + vinserti32x4 m8, [r4 + r5 + mmsize/2], 1 + + vinserti32x4 m4, [r0 + FENC_STRIDE * 2 + mmsize/2], 2 + vinserti32x4 m5, [r1 + r5 * 2 + mmsize/2], 2 + vinserti32x4 m6, [r2 + r5 * 2 + mmsize/2], 2 + vinserti32x4 m7, [r3 + r5 * 2 + mmsize/2], 2 + vinserti32x4 m8, [r4 + r5 * 2 + mmsize/2], 2 + vinserti32x4 m4, [r0 + FENC_STRIDE * 3 + mmsize/2], 3 + vinserti32x4 m5, [r1 + r7 + mmsize/2], 3 + vinserti32x4 m6, [r2 + r7 + mmsize/2], 3 + vinserti32x4 m7, [r3 + r7 + mmsize/2], 3 + vinserti32x4 m8, [r4 + r7 + mmsize/2], 3 + + psadbw m9, m4, m5 + psadbw m5, m4, m6 + psadbw m6, m4, m7 + psadbw m4, m8 + paddd m0, m9 + paddd m1, m5 + paddd m2, m6 + paddd m3, m4 +%endmacro + +%macro PIXEL_SAD_X4_END_AVX512 0 + vextracti32x8 ym4, m0, 1 + vextracti32x8 ym5, m1, 1 + vextracti32x8 ym6, m2, 1 + vextracti32x8 ym7, m3, 1 + paddd ym0, ym4 + paddd ym1, ym5 + paddd ym2, ym6 + paddd ym3, ym7 + vextracti64x2 xm4, m0, 1 + vextracti64x2 xm5, m1, 1 + vextracti64x2 xm6, m2, 1 + vextracti64x2 xm7, m3, 1 + paddd xm0, xm4 + paddd xm1, xm5 + paddd xm2, xm6 + paddd xm3, xm7 + pshufd xm4, xm0, 2 + pshufd xm5, xm1, 2 + pshufd xm6, xm2, 2 + pshufd xm7, xm3, 2 + paddd xm0, xm4 + paddd xm1, xm5 + paddd xm2, xm6 + paddd xm3, xm7 + movd [r6 + 0], xm0 + movd [r6 + 4], xm1 + movd [r6 + 8], xm2 + movd [r6 + 12], xm3 +%endmacro + +%macro SAD_X4_AVX512 2 +INIT_ZMM avx512 +cglobal pixel_sad_x4_%1x%2, 7,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + lea r7, [r5 * 3] + +%rep %2/4 - 1 + PROCESS_SAD_X4_%1x4_AVX512 + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] +%endrep + PROCESS_SAD_X4_%1x4_AVX512 + PIXEL_SAD_X4_END_AVX512 + RET +%endmacro + +SAD_X4_AVX512 64, 64 +SAD_X4_AVX512 64, 48 +SAD_X4_AVX512 64, 32 +SAD_X4_AVX512 64, 16 +SAD_X4_AVX512 32, 64 +SAD_X4_AVX512 32, 32 +SAD_X4_AVX512 32, 24 +SAD_X4_AVX512 32, 16 +SAD_X4_AVX512 32, 8 +SAD_X4_AVX512 48, 64 +;------------------------------------------------------------ +;sad_x4 avx512 code end +;------------------------------------------------------------ %endif INIT_XMM sse2 @@ -5517,6 +5851,218 @@ RET %endif +;------------------------------------------------------------ +;sad_x3 avx512 code start +;------------------------------------------------------------ +%macro PROCESS_SAD_X3_64x4_AVX512 0 + movu m3, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE] + movu m4, [r1 + r4] + movu m5, [r2 + r4] + movu m6, [r3 + r4] + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 2] + movu m4, [r1 + r4 * 2] + movu m5, [r2 + r4 * 2] + movu m6, [r3 + r4 * 2] + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu m3, [r0 + FENC_STRIDE * 3] + movu m4, [r1 + r6] + movu m5, [r2 + r6] + movu m6, [r3 + r6] + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 +%endmacro + +%macro PROCESS_SAD_X3_32x4_AVX512 0 + movu ym3, [r0] + movu ym4, [r1] + movu ym5, [r2] + movu ym6, [r3] + vinserti32x8 m3, [r0 + FENC_STRIDE], 1 + vinserti32x8 m4, [r1 + r4], 1 + vinserti32x8 m5, [r2 + r4], 1 + vinserti32x8 m6, [r3 + r4], 1 + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu ym3, [r0 + FENC_STRIDE * 2] + movu ym4, [r1 + r4 * 2] + movu ym5, [r2 + r4 * 2] + movu ym6, [r3 + r4 * 2] + vinserti32x8 m3, [r0 + FENC_STRIDE * 3], 1 + vinserti32x8 m4, [r1 + r6], 1 + vinserti32x8 m5, [r2 + r6], 1 + vinserti32x8 m6, [r3 + r6], 1 + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 +%endmacro + +%macro PROCESS_SAD_X3_48x4_AVX512 0 + movu ym3, [r0] + movu ym4, [r1] + movu ym5, [r2] + movu ym6, [r3] + vinserti32x8 m3, [r0 + FENC_STRIDE], 1 + vinserti32x8 m4, [r1 + r4], 1 + vinserti32x8 m5, [r2 + r4], 1 + vinserti32x8 m6, [r3 + r4], 1 + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu ym3, [r0 + FENC_STRIDE * 2] + movu ym4, [r1 + r4 * 2] + movu ym5, [r2 + r4 * 2] + movu ym6, [r3 + r4 * 2] + vinserti32x8 m3, [r0 + FENC_STRIDE * 3], 1 + vinserti32x8 m4, [r1 + r6], 1 + vinserti32x8 m5, [r2 + r6], 1 + vinserti32x8 m6, [r3 + r6], 1 + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 + + movu xm3, [r0 + mmsize/2] + movu xm4, [r1 + mmsize/2] + movu xm5, [r2 + mmsize/2] + movu xm6, [r3 + mmsize/2] + vinserti32x4 m3, [r0 + FENC_STRIDE + mmsize/2], 1 + vinserti32x4 m4, [r1 + r4 + mmsize/2], 1 + vinserti32x4 m5, [r2 + r4 + mmsize/2], 1 + vinserti32x4 m6, [r3 + r4 + mmsize/2], 1 + + vinserti32x4 m3, [r0 + 2 * FENC_STRIDE + mmsize/2], 2 + vinserti32x4 m4, [r1 + 2 * r4 + mmsize/2], 2 + vinserti32x4 m5, [r2 + 2 * r4 + mmsize/2], 2 + vinserti32x4 m6, [r3 + 2 * r4 + mmsize/2], 2 + vinserti32x4 m3, [r0 + 3 * FENC_STRIDE + mmsize/2], 3 + vinserti32x4 m4, [r1 + r6 + mmsize/2], 3 + vinserti32x4 m5, [r2 + r6 + mmsize/2], 3 + vinserti32x4 m6, [r3 + r6 + mmsize/2], 3 + + psadbw m7, m3, m4 + psadbw m4, m3, m5 + psadbw m3, m6 + paddd m0, m7 + paddd m1, m4 + paddd m2, m3 +%endmacro + +%macro PIXEL_SAD_X3_END_AVX512 0 + vextracti32x8 ym3, m0, 1 + vextracti32x8 ym4, m1, 1 + vextracti32x8 ym5, m2, 1 + paddd ym0, ym3 + paddd ym1, ym4 + paddd ym2, ym5 + vextracti64x2 xm3, m0, 1 + vextracti64x2 xm4, m1, 1 + vextracti64x2 xm5, m2, 1 + paddd xm0, xm3 + paddd xm1, xm4 + paddd xm2, xm5 + pshufd xm3, xm0, 2 + pshufd xm4, xm1, 2 + pshufd xm5, xm2, 2 + paddd xm0, xm3 + paddd xm1, xm4 + paddd xm2, xm5 + movd [r5 + 0], xm0 + movd [r5 + 4], xm1 + movd [r5 + 8], xm2 +%endmacro + +%macro SAD_X3_AVX512 2 +INIT_ZMM avx512 +cglobal pixel_sad_x3_%1x%2, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + lea r6, [r4 * 3] + +%rep %2/4 - 1 + PROCESS_SAD_X3_%1x4_AVX512 + add r0, FENC_STRIDE * 4 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] +%endrep + PROCESS_SAD_X3_%1x4_AVX512 + PIXEL_SAD_X3_END_AVX512 + RET +%endmacro + +SAD_X3_AVX512 64, 64 +SAD_X3_AVX512 64, 48 +SAD_X3_AVX512 64, 32 +SAD_X3_AVX512 64, 16 +SAD_X3_AVX512 32, 64 +SAD_X3_AVX512 32, 32 +SAD_X3_AVX512 32, 24 +SAD_X3_AVX512 32, 16 +SAD_X3_AVX512 32, 8 +SAD_X3_AVX512 48, 64 +;------------------------------------------------------------ +;sad_x3 avx512 code end +;------------------------------------------------------------ + INIT_YMM avx2 cglobal pixel_sad_x4_8x8, 7,7,5 xorps m0, m0 @@ -6138,4 +6684,77 @@ movd eax, xm0 RET +%macro PROCESS_SAD_64x4_AVX512 0 + movu m1, [r0] + movu m2, [r2] + movu m3, [r0 + r1] + movu m4, [r2 + r3] + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 + movu m1, [r0 + 2 * r1] + movu m2, [r2 + 2 * r3] + movu m3, [r0 + r5] + movu m4, [r2 + r6] + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 +%endmacro + +%macro PROCESS_SAD_32x4_AVX512 0 + movu ym1, [r0] + movu ym2, [r2] + movu ym3, [r0 + 2 * r1] + movu ym4, [r2 + 2 * r3] + vinserti32x8 m1, [r0 + r1], 1 + vinserti32x8 m2, [r2 + r3], 1 + vinserti32x8 m3, [r0 + r5], 1 + vinserti32x8 m4, [r2 + r6], 1 + + psadbw m1, m2 + psadbw m3, m4 + paddd m0, m1 + paddd m0, m3 +%endmacro + +%macro PROCESS_SAD_AVX512_END 0 + vextracti32x8 ym1, m0, 1 + paddd ym0, ym1 + vextracti64x2 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 2 + paddd xm0, xm1 + movd eax, xm0 +%endmacro +;----------------------------------------------------------------------------- +; int pixel_sad_64x%1( uint8_t *, intptr_t, uint8_t *, intptr_t ) +;----------------------------------------------------------------------------- +%macro SAD_MxN_AVX512 2 +INIT_ZMM avx512 +cglobal pixel_sad_%1x%2, 4, 7, 5 + pxor m0, m0 + lea r5, [3 * r1] + lea r6, [3 * r3] + +%rep %2/4 - 1 + PROCESS_SAD_%1x4_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] +%endrep + PROCESS_SAD_%1x4_AVX512 + PROCESS_SAD_AVX512_END + RET +%endmacro + +SAD_MxN_AVX512 64, 16 +SAD_MxN_AVX512 64, 32 +SAD_MxN_AVX512 64, 48 +SAD_MxN_AVX512 64, 64 +SAD_MxN_AVX512 32, 8 +SAD_MxN_AVX512 32, 16 +SAD_MxN_AVX512 32, 24 +SAD_MxN_AVX512 32, 32 +SAD_MxN_AVX512 32, 64 %endif
View file
x265_2.7.tar.gz/source/common/x86/sad16-a.asm -> x265_2.9.tar.gz/source/common/x86/sad16-a.asm
Changed
@@ -1155,6 +1155,565 @@ SAD_12 12, 16 +%macro PROCESS_SAD_64x8_AVX512 0 + movu m1, [r2] + movu m2, [r2 + mmsize] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + mmsize] + psubw m1, [r0] + psubw m2, [r0 + mmsize] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m5, m1, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + mmsize] + movu m3, [r2 + r5] + movu m4, [r2 + r5 + mmsize] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + mmsize] + psubw m3, [r0 + r4] + psubw m4, [r0 + r4 + mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m1, m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 + + movu m1, [r2] + movu m2, [r2 + mmsize] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + mmsize] + psubw m1, [r0] + psubw m2, [r0 + mmsize] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m5, m1, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + mmsize] + movu m3, [r2 + r5] + movu m4, [r2 + r5 + mmsize] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + mmsize] + psubw m3, [r0 + r4] + psubw m4, [r0 + r4 + mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 +%endmacro + + +%macro PROCESS_SAD_32x8_AVX512 0 + movu m1, [r2] + movu m2, [r2 + r3] + movu m3, [r2 + 2 * r3] + movu m4, [r2 + r5] + psubw m1, [r0] + psubw m2, [r0 + r1] + psubw m3, [r0 + 2 * r1] + psubw m4, [r0 + r4] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m5, m1, m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m1, [r2] + movu m2, [r2 + r3] + movu m3, [r2 + 2 * r3] + movu m4, [r2 + r5] + psubw m1, [r0] + psubw m2, [r0 + r1] + psubw m3, [r0 + 2 * r1] + psubw m4, [r0 + r4] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m1, m3 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 +%endmacro + +%macro PROCESS_SAD_16x8_AVX512 0 + movu ym1, [r2] + vinserti64x4 m1, [r2 + r3], 1 + movu ym2, [r2 + 2 * r3] + vinserti64x4 m2, [r2 + r5], 1 + movu ym3, [r0] + vinserti64x4 m3, [r0 + r1], 1 + movu ym4, [r0 + 2 * r1] + vinserti64x4 m4, [r0 + r4], 1 + + psubw m1, m3 + psubw m2, m4 + pabsw m1, m1 + pabsw m2, m2 + paddw m5, m1, m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu ym1, [r2] + vinserti64x4 m1, [r2 + r3], 1 + movu ym2, [r2 + 2 * r3] + vinserti64x4 m2, [r2 + r5], 1 + movu ym3, [r0] + vinserti64x4 m3, [r0 + r1], 1 + movu ym4, [r0 + 2 * r1] + vinserti64x4 m4, [r0 + r4], 1 + + psubw m1, m3 + psubw m2, m4 + pabsw m1, m1 + pabsw m2, m2 + paddw m1, m2 + + pmaddwd m5, m6 + paddd m0, m5 + pmaddwd m1, m6 + paddd m0, m1 +%endmacro + +%macro PROCESS_SAD_AVX512_END 0 + vextracti32x8 ym1, m0, 1 + paddd ym0, ym1 + vextracti64x2 xm1, m0, 1 + paddd xm0, xm1 + pshufd xm1, xm0, 00001110b + paddd xm0, xm1 + pshufd xm1, xm0, 00000001b + paddd xm0, xm1 + movd eax, xm0 +%endmacro + +;----------------------------------------------------------------------------- +; int pixel_sad_64x%1( uint16_t *, intptr_t, uint16_t *, intptr_t ) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_64x16, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_64x32, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_64x48, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_64x64, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_64x8_AVX512 + PROCESS_SAD_AVX512_END + RET +%endif + +;----------------------------------------------------------------------------- +; int pixel_sad_32x%1( uint16_t *, intptr_t, uint16_t *, intptr_t ) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_32x8, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_32x8_AVX512 + PROCESS_SAD_AVX512_END + RET + + +INIT_ZMM avx512 +cglobal pixel_sad_32x16, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_32x24, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_32x32, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_32x64, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + PROCESS_SAD_32x8_AVX512 + PROCESS_SAD_AVX512_END + RET +%endif + +;----------------------------------------------------------------------------- +; int pixel_sad_16x%1( uint16_t *, intptr_t, uint16_t *, intptr_t ) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_16x32, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + %rep 3 + PROCESS_SAD_16x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + %endrep + PROCESS_SAD_16x8_AVX512 + PROCESS_SAD_AVX512_END + RET + +INIT_ZMM avx512 +cglobal pixel_sad_16x64, 4,6,7 + pxor m0, m0 + + vbroadcasti32x8 m6, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] + + %rep 7 + PROCESS_SAD_16x8_AVX512 + lea r2, [r2 + 4 * r3] + lea r0, [r0 + 4 * r1] + %endrep + PROCESS_SAD_16x8_AVX512 + PROCESS_SAD_AVX512_END + RET +%endif + +;----------------------------------------------------------------------------- +; int pixel_sad_48x64( uint16_t *, intptr_t, uint16_t *, intptr_t ) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_48x64, 4, 7, 9 + pxor m0, m0 + mov r6d, 64/8 + + vbroadcasti32x8 m8, [pw_1] + + add r3d, r3d + add r1d, r1d + lea r4d, [r1 * 3] + lea r5d, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + r3] + movu ym3, [r2 + mmsize] + vinserti32x8 m3, [r2 + r3 + mmsize], 1 + movu m4, [r0] + movu m5, [r0 + r1] + movu ym6, [r0 + mmsize] + vinserti32x8 m6, [r0 + r1 + mmsize], 1 + + psubw m1, m4 + psubw m2, m5 + psubw m3, m6 + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m1, m2 + paddw m7, m3, m1 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + r5] + movu ym3, [r2 + 2 * r3 + mmsize] + vinserti32x8 m3, [r2 + r5 + mmsize], 1 + movu m4, [r0 + 2 * r1] + movu m5, [r0 + r4] + movu ym6, [r0 + 2 * r1 + mmsize] + vinserti32x8 m6, [r0 + r4 + mmsize], 1 + psubw m1, m4 + psubw m2, m5 + psubw m3, m6 + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m1, m2 + paddw m1, m3 + + pmaddwd m7, m8 + paddd m0, m7 + pmaddwd m1, m8 + paddd m0, m1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m1, [r2] + movu m2, [r2 + r3] + movu ym3, [r2 + mmsize] + vinserti32x8 m3, [r2 + r3 + mmsize], 1 + movu m4, [r0] + movu m5, [r0 + r1] + movu ym6, [r0 + mmsize] + vinserti32x8 m6, [r0 + r1 + mmsize], 1 + + psubw m1, m4 + psubw m2, m5 + psubw m3, m6 + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m1, m2 + paddw m7, m3, m1 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + r5] + movu ym3, [r2 + 2 * r3 + mmsize] + vinserti32x8 m3, [r2 + r5 + mmsize], 1 + movu m4, [r0 + 2 * r1] + movu m5, [r0 + r4] + movu ym6, [r0 + 2 * r1 + mmsize] + vinserti32x8 m6, [r0 + r4 + mmsize], 1 + psubw m1, m4 + psubw m2, m5 + psubw m3, m6 + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m1, m2 + paddw m1, m3 + + pmaddwd m7, m8 + paddd m0, m7 + pmaddwd m1, m8 + paddd m0, m1 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + dec r6d + jg .loop + + PROCESS_SAD_AVX512_END + RET +%endif + ;============================================================================= ; SAD x3/x4 ;============================================================================= @@ -1561,3 +2120,2251 @@ SAD_X 4, 64, 48 SAD_X 4, 64, 64 +;============================ +; SAD x3/x4 avx512 code start +;============================ + +%macro PROCESS_SAD_X4_16x4_AVX512 0 + movu ym8, [r0] + vinserti64x4 m8, [r0 + 2 * FENC_STRIDE], 1 + movu ym4, [r1] + vinserti64x4 m4, [r1 + r5], 1 + movu ym5, [r2] + vinserti64x4 m5, [r2 + r5], 1 + movu ym6, [r3] + vinserti64x4 m6, [r3 + r5], 1 + movu ym7, [r4] + vinserti64x4 m7, [r4 + r5], 1 + + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + movu ym8, [r0 + 4 * FENC_STRIDE] + vinserti64x4 m8, [r0 + 6 * FENC_STRIDE], 1 + movu ym4, [r1 + 2 * r5] + vinserti64x4 m4, [r1 + r7], 1 + movu ym5, [r2 + 2 * r5] + vinserti64x4 m5, [r2 + r7], 1 + movu ym6, [r3 + 2 * r5] + vinserti64x4 m6, [r3 + r7], 1 + movu ym7, [r4 + 2 * r5] + vinserti64x4 m7, [r4 + r7], 1 + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 +%endmacro + +%macro PROCESS_SAD_X4_32x4_AVX512 0 + movu m8, [r0] + movu m4, [r1] + movu m5, [r2] + movu m6, [r3] + movu m7, [r4] + + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + + movu m8, [r0 + 2 * FENC_STRIDE] + movu m4, [r1 + r5] + movu m5, [r2 + r5] + movu m6, [r3 + r5] + movu m7, [r4 + r5] + + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + movu m8, [r0 + 4 * FENC_STRIDE] + movu m4, [r1 + 2 * r5] + movu m5, [r2 + 2 * r5] + movu m6, [r3 + 2 * r5] + movu m7, [r4 + 2 * r5] + + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + movu m8, [r0 + 6 * FENC_STRIDE] + movu m4, [r1 + r7] + movu m5, [r2 + r7] + movu m6, [r3 + r7] + movu m7, [r4 + r7] + + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 +%endmacro + +%macro PROCESS_SAD_X4_64x4_AVX512 0 + movu m8, [r0] + movu m10, [r0 + mmsize] + movu m4, [r1] + movu m11, [r1 + mmsize] + movu m5, [r2] + movu m12, [r2 + mmsize] + movu m6, [r3] + movu m13, [r3 + mmsize] + movu m7, [r4] + movu m14, [r4 + mmsize] + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + psubw m11, m10 + psubw m12, m10 + psubw m13, m10 + psubw m14, m10 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + paddw m4, m11 + paddw m5, m12 + paddw m6, m13 + paddw m7, m14 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + + movu m8, [r0 + 2 * FENC_STRIDE] + movu m10, [r0 + 2 * FENC_STRIDE + mmsize] + movu m4, [r1 + r5] + movu m11, [r1 + r5 + mmsize] + movu m5, [r2 + r5] + movu m12, [r2 + r5 + mmsize] + movu m6, [r3 + r5] + movu m13, [r3 + r5 + mmsize] + movu m7, [r4 + r5] + movu m14, [r4 + r5 + mmsize] + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + psubw m11, m10 + psubw m12, m10 + psubw m13, m10 + psubw m14, m10 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + paddw m4, m11 + paddw m5, m12 + paddw m6, m13 + paddw m7, m14 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + movu m8, [r0 + 4 * FENC_STRIDE] + movu m10, [r0 + 4 * FENC_STRIDE + mmsize] + movu m4, [r1 + 2 * r5] + movu m11, [r1 + 2 * r5 + mmsize] + movu m5, [r2 + 2 * r5] + movu m12, [r2 + 2 * r5 + mmsize] + movu m6, [r3 + 2 * r5] + movu m13, [r3 + 2 * r5 + mmsize] + movu m7, [r4 + 2 * r5] + movu m14, [r4 + 2 * r5 + mmsize] + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + psubw m11, m10 + psubw m12, m10 + psubw m13, m10 + psubw m14, m10 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + paddw m4, m11 + paddw m5, m12 + paddw m6, m13 + paddw m7, m14 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 + + movu m8, [r0 + 6 * FENC_STRIDE] + movu m10, [r0 + 6 * FENC_STRIDE + mmsize] + movu m4, [r1 + r7] + movu m11, [r1 + r7 + mmsize] + movu m5, [r2 + r7] + movu m12, [r2 + r7 + mmsize] + movu m6, [r3 + r7] + movu m13, [r3 + r7 + mmsize] + movu m7, [r4 + r7] + movu m14, [r4 + r7 + mmsize] + + psubw m4, m8 + psubw m5, m8 + psubw m6, m8 + psubw m7, m8 + psubw m11, m10 + psubw m12, m10 + psubw m13, m10 + psubw m14, m10 + pabsw m4, m4 + pabsw m5, m5 + pabsw m6, m6 + pabsw m7, m7 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + paddw m4, m11 + paddw m5, m12 + paddw m6, m13 + paddw m7, m14 + + pmaddwd m4, m9 + paddd m0, m4 + pmaddwd m5, m9 + paddd m1, m5 + pmaddwd m6, m9 + paddd m2, m6 + pmaddwd m7, m9 + paddd m3, m7 +%endmacro + +%macro PROCESS_SAD_X4_END_AVX512 0 + vextracti32x8 ym4, m0, 1 + vextracti32x8 ym5, m1, 1 + vextracti32x8 ym6, m2, 1 + vextracti32x8 ym7, m3, 1 + + paddd ym0, ym4 + paddd ym1, ym5 + paddd ym2, ym6 + paddd ym3, ym7 + + vextracti64x2 xm4, m0, 1 + vextracti64x2 xm5, m1, 1 + vextracti64x2 xm6, m2, 1 + vextracti64x2 xm7, m3, 1 + + paddd xm0, xm4 + paddd xm1, xm5 + paddd xm2, xm6 + paddd xm3, xm7 + + pshufd xm4, xm0, 00001110b + pshufd xm5, xm1, 00001110b + pshufd xm6, xm2, 00001110b + pshufd xm7, xm3, 00001110b + + paddd xm0, xm4 + paddd xm1, xm5 + paddd xm2, xm6 + paddd xm3, xm7 + + pshufd xm4, xm0, 00000001b + pshufd xm5, xm1, 00000001b + pshufd xm6, xm2, 00000001b + pshufd xm7, xm3, 00000001b + + paddd xm0, xm4 + paddd xm1, xm5 + paddd xm2, xm6 + paddd xm3, xm7 + + mov r0, r6mp + movd [r0 + 0], xm0 + movd [r0 + 4], xm1 + movd [r0 + 8], xm2 + movd [r0 + 12], xm3 +%endmacro + + +%macro PROCESS_SAD_X3_16x4_AVX512 0 + movu ym6, [r0] + vinserti64x4 m6, [r0 + 2 * FENC_STRIDE], 1 + movu ym3, [r1] + vinserti64x4 m3, [r1 + r4], 1 + movu ym4, [r2] + vinserti64x4 m4, [r2 + r4], 1 + movu ym5, [r3] + vinserti64x4 m5, [r3 + r4], 1 + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu ym6, [r0 + 4 * FENC_STRIDE] + vinserti64x4 m6, [r0 + 6 * FENC_STRIDE], 1 + movu ym3, [r1 + 2 * r4] + vinserti64x4 m3, [r1 + r6], 1 + movu ym4, [r2 + 2 * r4] + vinserti64x4 m4, [r2 + r6], 1 + movu ym5, [r3 + 2 * r4] + vinserti64x4 m5, [r3 + r6], 1 + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 +%endmacro + + +%macro PROCESS_SAD_X3_32x4_AVX512 0 + movu m6, [r0] + movu m3, [r1] + movu m4, [r2] + movu m5, [r3] + + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 2 * FENC_STRIDE] + movu m3, [r1 + r4] + movu m4, [r2 + r4] + movu m5, [r3 + r4] + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 4 * FENC_STRIDE] + movu m3, [r1 + 2 * r4] + movu m4, [r2 + 2 * r4] + movu m5, [r3 + 2 * r4] + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 6 * FENC_STRIDE] + movu m3, [r1 + r6] + movu m4, [r2 + r6] + movu m5, [r3 + r6] + + psubw m3, m6 + psubw m4, m6 + psubw m5, m6 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 +%endmacro + +%macro PROCESS_SAD_X3_64x4_AVX512 0 + movu m6, [r0] + movu m8, [r0 + mmsize] + movu m3, [r1] + movu m9, [r1 + mmsize] + movu m4, [r2] + movu m10, [r2 + mmsize] + movu m5, [r3] + movu m11, [r3 + mmsize] + + psubw m3, m6 + psubw m9, m8 + psubw m4, m6 + psubw m10, m8 + psubw m5, m6 + psubw m11, m8 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + paddw m3, m9 + paddw m4, m10 + paddw m5, m11 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 2 * FENC_STRIDE] + movu m8, [r0 + 2 * FENC_STRIDE + mmsize] + movu m3, [r1 + r4] + movu m9, [r1 + r4 + mmsize] + movu m4, [r2 + r4] + movu m10, [r2 + r4 + mmsize] + movu m5, [r3 + r4] + movu m11, [r3 + r4 + mmsize] + + psubw m3, m6 + psubw m9, m8 + psubw m4, m6 + psubw m10, m8 + psubw m5, m6 + psubw m11, m8 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + paddw m3, m9 + paddw m4, m10 + paddw m5, m11 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 4 * FENC_STRIDE] + movu m8, [r0 + 4 * FENC_STRIDE + mmsize] + movu m3, [r1 + 2 * r4] + movu m9, [r1 + 2 * r4 + mmsize] + movu m4, [r2 + 2 * r4] + movu m10, [r2 + 2 * r4 + mmsize] + movu m5, [r3 + 2 * r4] + movu m11, [r3 + 2 * r4 + mmsize] + + psubw m3, m6 + psubw m9, m8 + psubw m4, m6 + psubw m10, m8 + psubw m5, m6 + psubw m11, m8 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + paddw m3, m9 + paddw m4, m10 + paddw m5, m11 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 + + movu m6, [r0 + 6 * FENC_STRIDE] + movu m8, [r0 + 6 * FENC_STRIDE + mmsize] + movu m3, [r1 + r6] + movu m9, [r1 + r6 + mmsize] + movu m4, [r2 + r6] + movu m10, [r2 + r6 + mmsize] + movu m5, [r3 + r6] + movu m11, [r3 + r6 + mmsize] + + psubw m3, m6 + psubw m9, m8 + psubw m4, m6 + psubw m10, m8 + psubw m5, m6 + psubw m11, m8 + pabsw m3, m3 + pabsw m4, m4 + pabsw m5, m5 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + paddw m3, m9 + paddw m4, m10 + paddw m5, m11 + + pmaddwd m3, m7 + paddd m0, m3 + pmaddwd m4, m7 + paddd m1, m4 + pmaddwd m5, m7 + paddd m2, m5 +%endmacro + +%macro PROCESS_SAD_X3_END_AVX512 0 + vextracti32x8 ym3, m0, 1 + vextracti32x8 ym4, m1, 1 + vextracti32x8 ym5, m2, 1 + + paddd ym0, ym3 + paddd ym1, ym4 + paddd ym2, ym5 + + vextracti64x2 xm3, m0, 1 + vextracti64x2 xm4, m1, 1 + vextracti64x2 xm5, m2, 1 + + paddd xm0, xm3 + paddd xm1, xm4 + paddd xm2, xm5 + + pshufd xm3, xm0, 00001110b + pshufd xm4, xm1, 00001110b + pshufd xm5, xm2, 00001110b + + paddd xm0, xm3 + paddd xm1, xm4 + paddd xm2, xm5 + + pshufd xm3, xm0, 00000001b + pshufd xm4, xm1, 00000001b + pshufd xm5, xm2, 00000001b + + paddd xm0, xm3 + paddd xm1, xm4 + paddd xm2, xm5 + + %if UNIX64 + movd [r5 + 0], xm0 + movd [r5 + 4], xm1 + movd [r5 + 8], xm2 + %else + mov r0, r5mp + movd [r0 + 0], xm0 + movd [r0 + 4], xm1 + movd [r0 + 8], xm2 +%endif +%endmacro + + +;------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x3_16x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x3_16x8, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_16x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_16x12, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + %rep 2 + PROCESS_SAD_X3_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + %endrep + PROCESS_SAD_X3_16x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_16x16, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + %rep 3 + PROCESS_SAD_X3_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + %endrep + PROCESS_SAD_X3_16x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_16x32, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + %rep 7 + PROCESS_SAD_X3_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + %endrep + PROCESS_SAD_X3_16x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_16x64, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + %rep 15 + PROCESS_SAD_X3_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + %endrep + PROCESS_SAD_X3_16x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x3_32x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x3_32x8, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + + +INIT_ZMM avx512 +cglobal pixel_sad_x3_32x16, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_32x24, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + + +INIT_ZMM avx512 +cglobal pixel_sad_x3_32x32, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_32x64, 6,7,8 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_32x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +;---------------------------------------------------------------------------------------------------------------------------------------- +; int pixel_sad_x3_48x64( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res ) +;---------------------------------------------------------------------------------------------------------------------------------------- +INIT_ZMM avx512 +cglobal pixel_sad_x3_48x64, 4, 8, 17 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + mov r7d, 64/4 + vbroadcasti32x8 m16, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] +.loop: + movu m4, [r0] + movu m5, [r0 + 2 * FENC_STRIDE] + movu ym6, [r0 + mmsize] + vinserti32x8 m6, [r0 + 2 * FENC_STRIDE + mmsize], 1 + movu m7, [r1] + movu m8, [r1 + r4] + movu ym9, [r1 + mmsize] + vinserti32x8 m9, [r1 + r4 + mmsize], 1 + movu m10, [r2] + movu m11, [r2 + r4] + movu ym12, [r2 + mmsize] + vinserti32x8 m12, [r2 + r4 + mmsize], 1 + movu m13, [r3] + movu m14, [r3 + r4] + movu ym15, [r3 + mmsize] + vinserti32x8 m15, [r3 + r4 + mmsize], 1 + + psubw m7, m4 + psubw m8, m5 + psubw m9, m6 + psubw m10, m4 + psubw m11, m5 + psubw m12, m6 + psubw m13, m4 + psubw m14, m5 + psubw m15, m6 + + pabsw m7, m7 + pabsw m8, m8 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + pabsw m15, m15 + + paddw m7, m8 + paddw m7, m9 + paddw m10, m11 + paddw m10, m12 + paddw m13, m14 + paddw m13, m15 + + pmaddwd m7, m16 + paddd m0, m7 + pmaddwd m10, m16 + paddd m1, m10 + pmaddwd m13, m16 + paddd m2, m13 + + movu m4, [r0 + 4 * FENC_STRIDE] + movu m5, [r0 + 6 * FENC_STRIDE] + movu ym6, [r0 + 4 * FENC_STRIDE + mmsize] + vinserti32x8 m6, [r0 + 6 * FENC_STRIDE + mmsize], 1 + movu m7, [r1 + 2 * r4] + movu m8, [r1 + r6] + movu ym9, [r1 + 2 * r4 + mmsize] + vinserti32x8 m9, [r1 + r6 + mmsize], 1 + movu m10, [r2 + 2 * r4] + movu m11, [r2 + r6] + movu ym12, [r2 + 2 * r4 + mmsize] + vinserti32x8 m12, [r2 + r6 + mmsize], 1 + movu m13, [r3 + 2 * r4] + movu m14, [r3 + r6] + movu ym15, [r3 + 2 * r4 + mmsize] + vinserti32x8 m15, [r3 + r6 + mmsize], 1 + + psubw m7, m4 + psubw m8, m5 + psubw m9, m6 + psubw m10, m4 + psubw m11, m5 + psubw m12, m6 + psubw m13, m4 + psubw m14, m5 + psubw m15, m6 + + pabsw m7, m7 + pabsw m8, m8 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + pabsw m15, m15 + + paddw m7, m8 + paddw m7, m9 + paddw m10, m11 + paddw m10, m12 + paddw m13, m14 + paddw m13, m15 + + pmaddwd m7, m16 + paddd m0, m7 + pmaddwd m10, m16 + paddd m1, m10 + pmaddwd m13, m16 + paddd m2, m13 + + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + + dec r7d + jg .loop + + PROCESS_SAD_X3_END_AVX512 + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x3_64x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x3_64x16, 6,7,12 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_64x32, 6,7,12 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_64x48, 6,7,12 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x3_64x64, 6,7,12 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + + vbroadcasti32x8 m7, [pw_1] + + add r4d, r4d + lea r6d, [r4 * 3] + + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r4 * 4] + lea r2, [r2 + r4 * 4] + lea r3, [r3 + r4 * 4] + PROCESS_SAD_X3_64x4_AVX512 + PROCESS_SAD_X3_END_AVX512 + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x4_16x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x4_16x8, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_16x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_16x12, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + %rep 2 + PROCESS_SAD_X4_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + %endrep + PROCESS_SAD_X4_16x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_16x16, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + %rep 3 + PROCESS_SAD_X4_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + %endrep + PROCESS_SAD_X4_16x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_16x32, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + %rep 7 + PROCESS_SAD_X4_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + %endrep + PROCESS_SAD_X4_16x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_16x64, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + %rep 15 + PROCESS_SAD_X4_16x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + %endrep + PROCESS_SAD_X4_16x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x4_32x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x4_32x8, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_32x16, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_32x24, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + + +INIT_ZMM avx512 +cglobal pixel_sad_x4_32x32, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_32x64, 6,8,10 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_32x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET +%endif +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x4_48x64( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x4_48x64, 4, 9, 20 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + mov r8d, 64/4 + + vbroadcasti32x8 m19, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] +.loop: + movu m4, [r0] + movu m5, [r0 + 2 * FENC_STRIDE] + movu ym6, [r0 + mmsize] + vinserti32x8 m6, [r0 + 2 * FENC_STRIDE + mmsize], 1 + movu m7, [r1] + movu m8, [r1 + r5] + movu ym9, [r1 + mmsize] + vinserti32x8 m9, [r1 + r5 + mmsize], 1 + movu m10, [r2] + movu m11, [r2 + r5] + movu ym12, [r2 + mmsize] + vinserti32x8 m12, [r2 + r5 + mmsize], 1 + movu m13, [r3] + movu m14, [r3 + r5] + movu ym15, [r3 + mmsize] + vinserti32x8 m15, [r3 + r5 + mmsize], 1 + movu m16, [r4] + movu m17, [r4 + r5] + movu ym18, [r4 + mmsize] + vinserti32x8 m18, [r4 + r5 + mmsize], 1 + + psubw m7, m4 + psubw m8, m5 + psubw m9, m6 + psubw m10, m4 + psubw m11, m5 + psubw m12, m6 + psubw m13, m4 + psubw m14, m5 + psubw m15, m6 + psubw m16, m4 + psubw m17, m5 + psubw m18, m6 + + pabsw m7, m7 + pabsw m8, m8 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + pabsw m15, m15 + pabsw m16, m16 + pabsw m17, m17 + pabsw m18, m18 + + paddw m7, m8 + paddw m7, m9 + paddw m10, m11 + paddw m10, m12 + paddw m13, m14 + paddw m13, m15 + paddw m16, m17 + paddw m16, m18 + + pmaddwd m7, m19 + paddd m0, m7 + pmaddwd m10, m19 + paddd m1, m10 + pmaddwd m13, m19 + paddd m2, m13 + pmaddwd m16, m19 + paddd m3, m16 + + movu m4, [r0 + 4 * FENC_STRIDE] + movu m5, [r0 + 6 * FENC_STRIDE] + movu ym6, [r0 + 4 * FENC_STRIDE + mmsize] + vinserti32x8 m6, [r0 + 6 * FENC_STRIDE + mmsize], 1 + movu m7, [r1 + 2 * r5] + movu m8, [r1 + r7] + movu ym9, [r1 + 2 * r5 + mmsize] + vinserti32x8 m9, [r1 + r7 + mmsize], 1 + movu m10, [r2 + 2 * r5] + movu m11, [r2 + r7] + movu ym12, [r2 + 2 * r5 + mmsize] + vinserti32x8 m12, [r2 + r7 + mmsize], 1 + movu m13, [r3 + 2 * r5] + movu m14, [r3 + r7] + movu ym15, [r3 + 2 * r5 + mmsize] + vinserti32x8 m15, [r3 + r7 + mmsize], 1 + movu m16, [r4 + 2 * r5] + movu m17, [r4 + r7] + movu ym18, [r4 + 2 * r5 + mmsize] + vinserti32x8 m18, [r4 + r7 + mmsize], 1 + + + psubw m7, m4 + psubw m8, m5 + psubw m9, m6 + psubw m10, m4 + psubw m11, m5 + psubw m12, m6 + psubw m13, m4 + psubw m14, m5 + psubw m15, m6 + psubw m16, m4 + psubw m17, m5 + psubw m18, m6 + + pabsw m7, m7 + pabsw m8, m8 + pabsw m9, m9 + pabsw m10, m10 + pabsw m11, m11 + pabsw m12, m12 + pabsw m13, m13 + pabsw m14, m14 + pabsw m15, m15 + pabsw m16, m16 + pabsw m17, m17 + pabsw m18, m18 + + paddw m7, m8 + paddw m7, m9 + paddw m10, m11 + paddw m10, m12 + paddw m13, m14 + paddw m13, m15 + paddw m16, m17 + paddw m16, m18 + + pmaddwd m7, m19 + paddd m0, m7 + pmaddwd m10, m19 + paddd m1, m10 + pmaddwd m13, m19 + paddd m2, m13 + pmaddwd m16, m19 + paddd m3, m16 + + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + + dec r8d + jg .loop + + PROCESS_SAD_X4_END_AVX512 + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +; void pixel_sad_x4_64x%1( const pixel* pix1, const pixel* pix2, const pixel* pix3, const pixel* pix4, const pixel* pix5, intptr_t frefstride, int32_t* res ) +;------------------------------------------------------------------------------------------------------------------------------------------------------------ +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_sad_x4_64x16, 6,8,15 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_64x32, 6,8,15 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_64x48, 6,8,15 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET + +INIT_ZMM avx512 +cglobal pixel_sad_x4_64x64, 6,8,15 + pxor m0, m0 + pxor m1, m1 + pxor m2, m2 + pxor m3, m3 + + vbroadcasti32x8 m9, [pw_1] + + add r5d, r5d + lea r7d, [r5 * 3] + + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + add r0, FENC_STRIDE * 8 + lea r1, [r1 + r5 * 4] + lea r2, [r2 + r5 * 4] + lea r3, [r3 + r5 * 4] + lea r4, [r4 + r5 * 4] + PROCESS_SAD_X4_64x4_AVX512 + PROCESS_SAD_X4_END_AVX512 + RET +%endif
View file
x265_2.7.tar.gz/source/common/x86/ssd-a.asm -> x265_2.9.tar.gz/source/common/x86/ssd-a.asm
Changed
@@ -141,6 +141,8 @@ ; Function to find ssd for 32x16 block, sse2, 12 bit depth ; Defined sepeartely to be called from SSD_ONE_32 macro +%if ARCH_X86_64 +;This code is written for 64 bit architecture INIT_XMM sse2 cglobal ssd_ss_32x16 pxor m8, m8 @@ -180,8 +182,10 @@ paddq m4, m5 paddq m9, m4 ret +%endif %macro SSD_ONE_32 0 +%if ARCH_X86_64 cglobal pixel_ssd_ss_32x64, 4,7,10 add r1d, r1d add r3d, r3d @@ -193,7 +197,9 @@ call ssd_ss_32x16 movq rax, m9 RET +%endif %endmacro + %macro SSD_ONE_SS_32 0 cglobal pixel_ssd_ss_32x32, 4,5,8 add r1d, r1d @@ -554,6 +560,7 @@ RET %endmacro +%if ARCH_X86_64 INIT_YMM avx2 cglobal pixel_ssd_16x16, 4,7,3 FIX_STRIDES r1, r3 @@ -697,6 +704,108 @@ movq rax, xm3 RET +INIT_ZMM avx512 +cglobal pixel_ssd_32x2 + pxor m0, m0 + movu m1, [r0] + psubw m1, [r2] + pmaddwd m1, m1 + paddd m0, m1 + movu m1, [r0 + r1] + psubw m1, [r2 + r3] + pmaddwd m1, m1 + paddd m0, m1 + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + + paddq m3, m0 + paddq m3, m1 +ret + +INIT_ZMM avx512 +cglobal pixel_ssd_32x32, 4,5,5 + shl r1d, 1 + shl r3d, 1 + pxor m3, m3 + mov r4, 16 +.iterate: + call pixel_ssd_32x2 + dec r4d + jne .iterate + + vextracti32x8 ym4, m3, 1 + paddq ym3, ym4 + vextracti32x4 xm4, m3, 1 + paddq xm3, xm4 + movhlps xm4, xm3 + paddq xm3, xm4 + movq rax, xm3 +RET + +INIT_ZMM avx512 +cglobal pixel_ssd_32x64, 4,5,5 + shl r1d, 1 + shl r3d, 1 + pxor m3, m3 + mov r4, 32 +.iterate: + call pixel_ssd_32x2 + dec r4d + jne .iterate + + vextracti32x8 ym4, m3, 1 + paddq ym3, ym4 + vextracti32x4 xm4, m3, 1 + paddq xm3, xm4 + movhlps xm4, xm3 + paddq xm3, xm4 + movq rax, xm3 +RET + +INIT_ZMM avx512 +cglobal pixel_ssd_64x64, 4,5,5 + FIX_STRIDES r1, r3 + mov r4d, 64 + pxor m3, m3 + +.loop: + pxor m0, m0 + movu m1, [r0] + psubw m1, [r2] + pmaddwd m1, m1 + paddd m0, m1 + movu m1, [r0 + mmsize] + psubw m1, [r2 + mmsize] + pmaddwd m1, m1 + paddd m0, m1 + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m3, m0 + paddq m3, m1 + + dec r4d + jg .loop + + vextracti32x8 ym4, m3, 1 + paddq ym3, ym4 + vextracti32x4 xm4, m3, 1 + paddq xm3, xm4 + movhlps xm4, xm3 + paddq xm3, xm4 + movq rax, xm3 + RET +%endif INIT_MMX mmx2 SSD_ONE 4, 4 SSD_ONE 4, 8 @@ -726,7 +835,9 @@ %if BIT_DEPTH <= 10 SSD_ONE 32, 64 SSD_ONE 32, 32 +%if ARCH_X86_64 SSD_TWO 64, 64 +%endif %else SSD_ONE_32 SSD_ONE_SS_32 @@ -1377,7 +1488,126 @@ HADDD m2, m0 movd eax, xm2 RET +;----------------------------------------------------------------------------- +; ssd_ss avx512 code start +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +%macro PROCESS_SSD_SS_64x4_AVX512 0 + movu m0, [r0] + movu m1, [r0 + mmsize] + movu m2, [r0 + r1] + movu m3, [r0 + r1 + mmsize] + movu m4, [r2] + movu m5, [r2 + mmsize] + movu m6, [r2 + r3] + movu m7, [r2 + r3 + mmsize] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m8, m0 + paddd m8, m1 + paddd m8, m2 + paddd m8, m3 + movu m0, [r0 + 2 * r1] + movu m1, [r0 + 2 * r1 + mmsize] + movu m2, [r0 + r5] + movu m3, [r0 + r5 + mmsize] + movu m4, [r2 + 2 * r3] + movu m5, [r2 + 2 * r3 + mmsize] + movu m6, [r2 + r6] + movu m7, [r2 + r6 + mmsize] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m8, m0 + paddd m8, m1 + paddd m8, m2 + paddd m8, m3 +%endmacro + +%macro PROCESS_SSD_SS_32x4_AVX512 0 + movu m0, [r0] + movu m1, [r0 + r1] + movu m2, [r0 + 2 * r1] + movu m3, [r0 + r5] + movu m4, [r2] + movu m5, [r2 + r3] + movu m6, [r2 + 2 * r3] + movu m7, [r2 + r6] + + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m8, m0 + paddd m8, m1 + paddd m8, m2 + paddd m8, m3 +%endmacro + +%macro PROCESS_SSD_SS_16x4_AVX512 0 + movu ym0, [r0] + vinserti32x8 m0, [r0 + r1], 1 + movu ym1, [r0 + 2 * r1] + vinserti32x8 m1, [r0 + r5], 1 + movu ym4, [r2] + vinserti32x8 m4, [r2 + r3], 1 + movu ym5, [r2 + 2 * r3] + vinserti32x8 m5, [r2 + r6], 1 + + psubw m0, m4 + psubw m1, m5 + pmaddwd m0, m0 + pmaddwd m1, m1 + paddd m8, m0 + paddd m8, m1 +%endmacro + +%macro SSD_SS_AVX512 2 +INIT_ZMM avx512 +cglobal pixel_ssd_ss_%1x%2, 4,7,9 + add r1d, r1d + add r3d, r3d + lea r5, [r1 * 3] + lea r6, [r3 * 3] + pxor m8, m8 + +%rep %2/4 - 1 + PROCESS_SSD_SS_%1x4_AVX512 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%endrep + PROCESS_SSD_SS_%1x4_AVX512 + HADDD m8, m0 + movd eax, xm8 + RET +%endmacro + + +SSD_SS_AVX512 64, 64 +SSD_SS_AVX512 32, 32 +SSD_SS_AVX512 16, 16 +%endif +;----------------------------------------------------------------------------- +; ssd_ss avx512 code end +;----------------------------------------------------------------------------- %endif ; !HIGH_BIT_DEPTH %if HIGH_BIT_DEPTH == 0 @@ -3064,7 +3294,7 @@ movd eax, m0 RET - +%if ARCH_X86_64 && BIT_DEPTH >= 10 INIT_XMM sse2 cglobal pixel_ssd_s_32, 2,3,5 add r1, r1 @@ -3105,7 +3335,6 @@ dec r2d jnz .loop -%if BIT_DEPTH >= 10 movu m1, m0 pxor m2, m2 punpckldq m0, m2 @@ -3114,13 +3343,56 @@ movhlps m1, m0 paddq m0, m1 movq rax, xm0 -%else + RET +%endif + +%if BIT_DEPTH == 8 +INIT_XMM sse2 +cglobal pixel_ssd_s_32, 2,3,5 + add r1, r1 + + mov r2d, 16 + pxor m0, m0 +.loop: + movu m1, [r0 + 0 * mmsize] + movu m2, [r0 + 1 * mmsize] + movu m3, [r0 + 2 * mmsize] + movu m4, [r0 + 3 * mmsize] + add r0, r1 + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 + + movu m1, [r0 + 0 * mmsize] + movu m2, [r0 + 1 * mmsize] + movu m3, [r0 + 2 * mmsize] + movu m4, [r0 + 3 * mmsize] + add r0, r1 + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 + + dec r2d + jnz .loop ; calculate sum and return HADDD m0, m1 movd eax, m0 -%endif RET +%endif +%if ARCH_X86_64 INIT_YMM avx2 cglobal pixel_ssd_s_16, 2,4,5 add r1, r1 @@ -3207,3 +3479,227 @@ movd eax, xm0 %endif RET +%endif +;----------------------------------------------------------------------------- +; ssd_s avx512 code start +;----------------------------------------------------------------------------- +%macro PROCESS_SSD_S_32x8_AVX512 0 + movu m1, [r0] + movu m2, [r0 + r1] + movu m3, [r0 + 2 * r1] + movu m4, [r0 + r3] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 + + lea r0, [r0 + 4 * r1] + + movu m1, [r0] + movu m2, [r0 + r1] + movu m3, [r0 + 2 * r1] + movu m4, [r0 + r3] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 +%endmacro + +%macro PROCESS_SSD_S_16x8_AVX512 0 + movu ym1, [r0] + vinserti32x8 m1, [r0 + r1], 1 + movu ym2, [r0 + 2 * r1] + vinserti32x8 m2, [r0 + r3], 1 + lea r0, [r0 + 4 * r1] + movu ym3, [r0] + vinserti32x8 m3, [r0 + r1], 1 + movu ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + r3], 1 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 +%endmacro +;----------------------------------------------------------------------------- +; int pixel_ssd_s( int16_t *ref, intptr_t i_stride ) +;----------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_ssd_s_32, 2,4,5 + add r1, r1 + lea r3, [r1 * 3] + pxor m0, m0 + + PROCESS_SSD_S_32x8_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_AVX512 + + ; calculate sum and return +%if BIT_DEPTH >= 10 + movu m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + vextracti32x8 ym2, m0, 1 + paddq ym0, ym2 + vextracti32x4 xm2, m0, 1 + paddq xm2, xm0 + movhlps xm1, xm2 + paddq xm2, xm1 + movq rax, xm2 +%else + HADDD m0, m1 + movd eax, xm0 +%endif + RET + +INIT_ZMM avx512 +cglobal pixel_ssd_s_16, 2,4,5 + add r1, r1 + lea r3, [r1 * 3] + pxor m0, m0 + + PROCESS_SSD_S_16x8_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_16x8_AVX512 + + ; calculate sum and return + HADDD m0, m1 + movd eax, xm0 + RET +%endif +;----------------------------------------------------------------------------- +; ssd_s avx512 code end +;----------------------------------------------------------------------------- +;----------------------------------------------------------------------------- +;ALigned version of macro +;----------------------------------------------------------------------------- +%macro PROCESS_SSD_S_16x8_ALIGNED_AVX512 0 + mova ym1, [r0] + vinserti32x8 m1, [r0 + r1], 1 + mova ym2, [r0 + 2 * r1] + vinserti32x8 m2, [r0 + r3], 1 + lea r0, [r0 + 4 * r1] + mova ym3, [r0] + vinserti32x8 m3, [r0 + r1], 1 + mova ym4, [r0 + 2 * r1] + vinserti32x8 m4, [r0 + r3], 1 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 +%endmacro +;--------------------------------------------------------------------------------- +;int pixel_ssd_s_aligned( int16_t *ref, intptr_t i_stride ) +;----------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_ZMM avx512 + +INIT_ZMM avx512 +cglobal pixel_ssd_s_aligned_16, 2,4,5 + add r1, r1 + lea r3, [r1 * 3] + pxor m0, m0 + + PROCESS_SSD_S_16x8_ALIGNED_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_16x8_ALIGNED_AVX512 + + ; calculate sum and return + HADDD m0, m1 + movd eax, xm0 + RET +%endif +;--------------------------------------------------------------------------------------------- +; aligned implementation for 32 +;--------------------------------------------------------------------------------------------- +%macro PROCESS_SSD_S_32x8_ALIGNED_AVX512 0 + mova m1, [r0] + mova m2, [r0 + r1] + mova m3, [r0 + 2 * r1] + mova m4, [r0 + r3] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 + + lea r0, [r0 + 4 * r1] + + mova m1, [r0] + mova m2, [r0 + r1] + mova m3, [r0 + 2 * r1] + mova m4, [r0 + r3] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m1, m3 + paddd m0, m1 +%endmacro + +%if ARCH_X86_64 +INIT_ZMM avx512 +cglobal pixel_ssd_s_aligned_32, 2,4,5 + add r1, r1 + lea r3, [r1 * 3] + pxor m0, m0 + + PROCESS_SSD_S_32x8_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_ALIGNED_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_ALIGNED_AVX512 + lea r0, [r0 + 4 * r1] + PROCESS_SSD_S_32x8_ALIGNED_AVX512 + + ; calculate sum and return +%if BIT_DEPTH >= 10 + mova m1, m0 + pxor m2, m2 + punpckldq m0, m2 + punpckhdq m1, m2 + paddq m0, m1 + vextracti32x8 ym2, m0, 1 + paddq ym0, ym2 + vextracti32x4 xm2, m0, 1 + paddq xm2, xm0 + movhlps xm1, xm2 + paddq xm2, xm1 + movq rax, xm2 +%else + HADDD m0, m1 + movd eax, xm0 +%endif + RET +%endif \ No newline at end of file
View file
x265_2.7.tar.gz/source/common/x86/v4-ipfilter16.asm -> x265_2.9.tar.gz/source/common/x86/v4-ipfilter16.asm
Changed
@@ -2931,6 +2931,7 @@ RET %endmacro +%if ARCH_X86_64 FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6 FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, INTERP_SHIFT_PS FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, INTERP_SHIFT_SP @@ -2939,6 +2940,7 @@ FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, INTERP_SHIFT_PS FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, INTERP_SHIFT_SP FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6 +%endif %macro FILTER_VER_CHROMA_AVX2_8x8 3 INIT_YMM avx2
View file
x265_2.7.tar.gz/source/common/x86/v4-ipfilter8.asm -> x265_2.9.tar.gz/source/common/x86/v4-ipfilter8.asm
Changed
@@ -43,7 +43,7 @@ const v4_interp4_vpp_shuf1, dd 0, 1, 1, 2, 2, 3, 3, 4 dd 2, 3, 3, 4, 4, 5, 5, 6 -const tab_ChromaCoeff, db 0, 64, 0, 0 +const v4_tab_ChromaCoeff, db 0, 64, 0, 0 db -2, 58, 10, -2 db -4, 54, 16, -2 db -6, 46, 28, -4 @@ -1031,8 +1031,8 @@ mova m6, [r5 + r4] mova m5, [r5 + r4 + 16] %else - mova m6, [tab_ChromaCoeff + r4] - mova m5, [tab_ChromaCoeff + r4 + 16] + mova m6, [v4_tab_ChromaCoeff + r4] + mova m5, [v4_tab_ChromaCoeff + r4 + 16] %endif %ifidn %1,pp @@ -2114,10 +2114,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif lea r4, [r1 * 3] lea r5, [r0 + 4 * r1] @@ -2430,10 +2430,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -2515,10 +2515,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -2611,10 +2611,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -2984,10 +2984,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -3180,10 +3180,10 @@ punpcklbw m4, m2, m3 %ifdef PIC - lea r6, [tab_ChromaCoeff] + lea r6, [v4_tab_ChromaCoeff] movd m5, [r6 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -3233,10 +3233,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -3280,10 +3280,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -3355,10 +3355,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -3442,10 +3442,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m5, [r5 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -3513,10 +3513,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m5, [r5 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -3605,10 +3605,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m5, [r5 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -3700,10 +3700,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -3786,10 +3786,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -3877,10 +3877,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -3995,10 +3995,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -4091,10 +4091,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m5, [r5 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -4942,10 +4942,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m5, [r5 + r4 * 4] %else - movd m5, [tab_ChromaCoeff + r4 * 4] + movd m5, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m6, m5, [tab_Vm] @@ -5040,10 +5040,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -5130,10 +5130,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -7543,10 +7543,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -7666,10 +7666,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -8267,10 +8267,10 @@ sub r0, r1 %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -8808,10 +8808,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m1, m0, [tab_Vm] @@ -8907,10 +8907,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm] @@ -8981,10 +8981,10 @@ add r3d, r3d %ifdef PIC - lea r5, [tab_ChromaCoeff] + lea r5, [v4_tab_ChromaCoeff] movd m0, [r5 + r4 * 4] %else - movd m0, [tab_ChromaCoeff + r4 * 4] + movd m0, [v4_tab_ChromaCoeff + r4 * 4] %endif pshufb m0, [tab_Cm]
View file
x265_2.7.tar.gz/source/common/x86/x86inc.asm -> x265_2.9.tar.gz/source/common/x86/x86inc.asm
Changed
@@ -82,7 +82,13 @@ %endif %macro SECTION_RODATA 0-1 32 - SECTION .rodata align=%1 + %ifidn __OUTPUT_FORMAT__,win32 + SECTION .rdata align=%1 + %elif WIN64 + SECTION .rdata align=%1 + %else + SECTION .rodata align=%1 + %endif %endmacro %if WIN64 @@ -325,6 +331,8 @@ %endmacro %define required_stack_alignment ((mmsize + 15) & ~15) +%define vzeroupper_required (mmsize > 16 && (ARCH_X86_64 == 0 || xmm_regs_used > 16 || notcpuflag(avx512))) +%define high_mm_regs (16*cpuflag(avx512)) %macro ALLOC_STACK 1-2 0 ; stack_size, n_xmm_regs (for win64 only) %ifnum %1 @@ -438,15 +446,16 @@ %macro WIN64_PUSH_XMM 0 ; Use the shadow space to store XMM6 and XMM7, the rest needs stack space allocated. - %if xmm_regs_used > 6 + %if xmm_regs_used > 6 + high_mm_regs movaps [rstk + stack_offset + 8], xmm6 %endif - %if xmm_regs_used > 7 + %if xmm_regs_used > 7 + high_mm_regs movaps [rstk + stack_offset + 24], xmm7 %endif - %if xmm_regs_used > 8 + %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8 + %if %%xmm_regs_on_stack > 0 %assign %%i 8 - %rep xmm_regs_used-8 + %rep %%xmm_regs_on_stack movaps [rsp + (%%i-8)*16 + stack_size + 32], xmm %+ %%i %assign %%i %%i+1 %endrep @@ -455,8 +464,9 @@ %macro WIN64_SPILL_XMM 1 %assign xmm_regs_used %1 - ASSERT xmm_regs_used <= 16 - %if xmm_regs_used > 8 + ASSERT xmm_regs_used <= 16 + high_mm_regs + %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8 + %if %%xmm_regs_on_stack > 0 ; Allocate stack space for callee-saved xmm registers plus shadow space and align the stack. %assign %%pad (xmm_regs_used-8)*16 + 32 %assign stack_size_padded %%pad + ((-%%pad-stack_offset-gprsize) & (STACK_ALIGNMENT-1)) @@ -467,9 +477,10 @@ %macro WIN64_RESTORE_XMM_INTERNAL 0 %assign %%pad_size 0 - %if xmm_regs_used > 8 - %assign %%i xmm_regs_used - %rep xmm_regs_used-8 + %assign %%xmm_regs_on_stack xmm_regs_used - high_mm_regs - 8 + %if %%xmm_regs_on_stack > 0 + %assign %%i xmm_regs_used - high_mm_regs + %rep %%xmm_regs_on_stack %assign %%i %%i-1 movaps xmm %+ %%i, [rsp + (%%i-8)*16 + stack_size + 32] %endrep @@ -482,10 +493,10 @@ %assign %%pad_size stack_size_padded %endif %endif - %if xmm_regs_used > 7 + %if xmm_regs_used > 7 + high_mm_regs movaps xmm7, [rsp + stack_offset - %%pad_size + 24] %endif - %if xmm_regs_used > 6 + %if xmm_regs_used > 6 + high_mm_regs movaps xmm6, [rsp + stack_offset - %%pad_size + 8] %endif %endmacro @@ -497,12 +508,12 @@ %assign xmm_regs_used 0 %endmacro -%define has_epilogue regs_used > 7 || xmm_regs_used > 6 || mmsize == 32 || stack_size > 0 +%define has_epilogue regs_used > 7 || stack_size > 0 || vzeroupper_required || xmm_regs_used > 6 + high_mm_regs %macro RET 0 WIN64_RESTORE_XMM_INTERNAL POP_IF_USED 14, 13, 12, 11, 10, 9, 8, 7 - %if mmsize == 32 + %if vzeroupper_required vzeroupper %endif AUTO_REP_RET @@ -526,9 +537,10 @@ DECLARE_REG 13, R12, 64 DECLARE_REG 14, R13, 72 -%macro PROLOGUE 2-5+ ; #args, #regs, #xmm_regs, [stack_size,] arg_names... +%macro PROLOGUE 2-5+ 0; #args, #regs, #xmm_regs, [stack_size,] arg_names... %assign num_args %1 %assign regs_used %2 + %assign xmm_regs_used %3 ASSERT regs_used >= num_args SETUP_STACK_POINTER %4 ASSERT regs_used <= 15 @@ -538,7 +550,7 @@ DEFINE_ARGS_INTERNAL %0, %4, %5 %endmacro -%define has_epilogue regs_used > 9 || mmsize == 32 || stack_size > 0 +%define has_epilogue regs_used > 9 || stack_size > 0 || vzeroupper_required %macro RET 0 %if stack_size_padded > 0 @@ -549,7 +561,7 @@ %endif %endif POP_IF_USED 14, 13, 12, 11, 10, 9 - %if mmsize == 32 + %if vzeroupper_required vzeroupper %endif AUTO_REP_RET @@ -594,7 +606,7 @@ DEFINE_ARGS_INTERNAL %0, %4, %5 %endmacro -%define has_epilogue regs_used > 3 || mmsize == 32 || stack_size > 0 +%define has_epilogue regs_used > 3 || stack_size > 0 || vzeroupper_required %macro RET 0 %if stack_size_padded > 0 @@ -605,7 +617,7 @@ %endif %endif POP_IF_USED 6, 5, 4, 3 - %if mmsize == 32 + %if vzeroupper_required vzeroupper %endif AUTO_REP_RET @@ -710,12 +722,22 @@ %assign stack_offset 0 ; stack pointer offset relative to the return address %assign stack_size 0 ; amount of stack space that can be freely used inside a function %assign stack_size_padded 0 ; total amount of allocated stack space, including space for callee-saved xmm registers on WIN64 and alignment padding - %assign xmm_regs_used 0 ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64 + %assign xmm_regs_used 0 ; number of XMM registers requested, used for dealing with callee-saved registers on WIN64 and vzeroupper %ifnidn %3, "" PROLOGUE %3 %endif %endmacro +; Create a global symbol from a local label with the correct name mangling and type +%macro cglobal_label 1 + %if FORMAT_ELF + global current_function %+ %1:function hidden + %else + global current_function %+ %1 + %endif + %1: +%endmacro + %macro cextern 1 %xdefine %1 mangle(private_prefix %+ _ %+ %1) CAT_XDEFINE cglobaled_, %1, 1 @@ -768,10 +790,10 @@ %assign cpuflags_bmi1 (1<<16)| cpuflags_avx | cpuflags_lzcnt %assign cpuflags_bmi2 (1<<17)| cpuflags_bmi1 %assign cpuflags_avx2 (1<<18)| cpuflags_fma3 | cpuflags_bmi2 +%assign cpuflags_avx512 (1<<19)| cpuflags_avx2 ; F, CD, BW, DQ, VL -%assign cpuflags_cache32 (1<<19) -%assign cpuflags_cache64 (1<<20) -%assign cpuflags_slowctz (1<<21) +%assign cpuflags_cache32 (1<<20) +%assign cpuflags_cache64 (1<<21) %assign cpuflags_aligned (1<<22) ; not a cpu feature, but a function variant %assign cpuflags_atom (1<<23) @@ -829,11 +851,12 @@ %endif %endmacro -; Merge mmx and sse* +; Merge mmx and sse*, and avx* ; m# is a simd register of the currently selected size ; xm# is the corresponding xmm register if mmsize >= 16, otherwise the same as m# ; ym# is the corresponding ymm register if mmsize >= 32, otherwise the same as m# -; (All 3 remain in sync through SWAP.) +; zm# is the corresponding zmm register if mmsize >= 64, otherwise the same as m# +; (All 4 remain in sync through SWAP.) %macro CAT_XDEFINE 3 %xdefine %1%2 %3 @@ -843,69 +866,100 @@ %undef %1%2 %endmacro +%macro DEFINE_MMREGS 1 ; mmtype + %assign %%prev_mmregs 0 + %ifdef num_mmregs + %assign %%prev_mmregs num_mmregs + %endif + + %assign num_mmregs 8 + %if ARCH_X86_64 && mmsize >= 16 + %assign num_mmregs 16 + %if cpuflag(avx512) || mmsize == 64 + %assign num_mmregs 32 + %endif + %endif + + %assign %%i 0 + %rep num_mmregs + CAT_XDEFINE m, %%i, %1 %+ %%i + CAT_XDEFINE nn%1, %%i, %%i + %assign %%i %%i+1 + %endrep + %if %%prev_mmregs > num_mmregs + %rep %%prev_mmregs - num_mmregs + CAT_UNDEF m, %%i + CAT_UNDEF nn %+ mmtype, %%i + %assign %%i %%i+1 + %endrep + %endif + %xdefine mmtype %1 +%endmacro + +; Prefer registers 16-31 over 0-15 to avoid having to use vzeroupper +%macro AVX512_MM_PERMUTATION 0-1 0 ; start_reg + %if ARCH_X86_64 && cpuflag(avx512) + %assign %%i %1 + %rep 16-%1 + %assign %%i_high %%i+16 + SWAP %%i, %%i_high + %assign %%i %%i+1 + %endrep + %endif +%endmacro + %macro INIT_MMX 0-1+ %assign avx_enabled 0 %define RESET_MM_PERMUTATION INIT_MMX %1 %define mmsize 8 - %define num_mmregs 8 %define mova movq %define movu movq %define movh movd %define movnta movntq - %assign %%i 0 - %rep 8 - CAT_XDEFINE m, %%i, mm %+ %%i - CAT_XDEFINE nnmm, %%i, %%i - %assign %%i %%i+1 - %endrep - %rep 8 - CAT_UNDEF m, %%i - CAT_UNDEF nnmm, %%i - %assign %%i %%i+1 - %endrep INIT_CPUFLAGS %1 + DEFINE_MMREGS mm %endmacro %macro INIT_XMM 0-1+ %assign avx_enabled 0 %define RESET_MM_PERMUTATION INIT_XMM %1 %define mmsize 16 - %define num_mmregs 8 - %if ARCH_X86_64 - %define num_mmregs 16 - %endif %define mova movdqa %define movu movdqu %define movh movq %define movnta movntdq - %assign %%i 0 - %rep num_mmregs - CAT_XDEFINE m, %%i, xmm %+ %%i - CAT_XDEFINE nnxmm, %%i, %%i - %assign %%i %%i+1 - %endrep INIT_CPUFLAGS %1 + DEFINE_MMREGS xmm + %if WIN64 + ; Swap callee-saved registers with volatile registers + AVX512_MM_PERMUTATION 6 + %endif %endmacro %macro INIT_YMM 0-1+ %assign avx_enabled 1 %define RESET_MM_PERMUTATION INIT_YMM %1 %define mmsize 32 - %define num_mmregs 8 - %if ARCH_X86_64 - %define num_mmregs 16 - %endif %define mova movdqa %define movu movdqu %undef movh %define movnta movntdq - %assign %%i 0 - %rep num_mmregs - CAT_XDEFINE m, %%i, ymm %+ %%i - CAT_XDEFINE nnymm, %%i, %%i - %assign %%i %%i+1 - %endrep INIT_CPUFLAGS %1 + DEFINE_MMREGS ymm + AVX512_MM_PERMUTATION +%endmacro + +%macro INIT_ZMM 0-1+ + %assign avx_enabled 1 + %define RESET_MM_PERMUTATION INIT_ZMM %1 + %define mmsize 64 + %define mova movdqa + %define movu movdqu + %undef movh + %define movnta movntdq + INIT_CPUFLAGS %1 + DEFINE_MMREGS zmm + AVX512_MM_PERMUTATION %endmacro INIT_XMM @@ -914,18 +968,26 @@ %define mmmm%1 mm%1 %define mmxmm%1 mm%1 %define mmymm%1 mm%1 + %define mmzmm%1 mm%1 %define xmmmm%1 mm%1 %define xmmxmm%1 xmm%1 %define xmmymm%1 xmm%1 + %define xmmzmm%1 xmm%1 %define ymmmm%1 mm%1 %define ymmxmm%1 xmm%1 %define ymmymm%1 ymm%1 + %define ymmzmm%1 ymm%1 + %define zmmmm%1 mm%1 + %define zmmxmm%1 xmm%1 + %define zmmymm%1 ymm%1 + %define zmmzmm%1 zmm%1 %define xm%1 xmm %+ m%1 %define ym%1 ymm %+ m%1 + %define zm%1 zmm %+ m%1 %endmacro %assign i 0 -%rep 16 +%rep 32 DECLARE_MMCAST i %assign i i+1 %endrep @@ -1060,12 +1122,17 @@ ;============================================================================= %assign i 0 -%rep 16 +%rep 32 %if i < 8 CAT_XDEFINE sizeofmm, i, 8 + CAT_XDEFINE regnumofmm, i, i %endif CAT_XDEFINE sizeofxmm, i, 16 CAT_XDEFINE sizeofymm, i, 32 + CAT_XDEFINE sizeofzmm, i, 64 + CAT_XDEFINE regnumofxmm, i, i + CAT_XDEFINE regnumofymm, i, i + CAT_XDEFINE regnumofzmm, i, i %assign i i+1 %endrep %undef i @@ -1182,7 +1249,7 @@ %endmacro %endmacro -; Instructions with both VEX and non-VEX encodings +; Instructions with both VEX/EVEX and legacy encodings ; Non-destructive instructions are written without parameters AVX_INSTR addpd, sse2, 1, 0, 1 AVX_INSTR addps, sse, 1, 0, 1 @@ -1190,12 +1257,12 @@ AVX_INSTR addss, sse, 1, 0, 0 AVX_INSTR addsubpd, sse3, 1, 0, 0 AVX_INSTR addsubps, sse3, 1, 0, 0 -AVX_INSTR aesdec, fnord, 0, 0, 0 -AVX_INSTR aesdeclast, fnord, 0, 0, 0 -AVX_INSTR aesenc, fnord, 0, 0, 0 -AVX_INSTR aesenclast, fnord, 0, 0, 0 -AVX_INSTR aesimc -AVX_INSTR aeskeygenassist +AVX_INSTR aesdec, aesni, 0, 0, 0 +AVX_INSTR aesdeclast, aesni, 0, 0, 0 +AVX_INSTR aesenc, aesni, 0, 0, 0 +AVX_INSTR aesenclast, aesni, 0, 0, 0 +AVX_INSTR aesimc, aesni +AVX_INSTR aeskeygenassist, aesni AVX_INSTR andnpd, sse2, 1, 0, 0 AVX_INSTR andnps, sse, 1, 0, 0 AVX_INSTR andpd, sse2, 1, 0, 1 @@ -1204,10 +1271,42 @@ AVX_INSTR blendps, sse4, 1, 1, 0 AVX_INSTR blendvpd, sse4 ; can't be emulated AVX_INSTR blendvps, sse4 ; can't be emulated +AVX_INSTR cmpeqpd, sse2, 1, 0, 1 +AVX_INSTR cmpeqps, sse, 1, 0, 1 +AVX_INSTR cmpeqsd, sse2, 1, 0, 0 +AVX_INSTR cmpeqss, sse, 1, 0, 0 +AVX_INSTR cmplepd, sse2, 1, 0, 0 +AVX_INSTR cmpleps, sse, 1, 0, 0 +AVX_INSTR cmplesd, sse2, 1, 0, 0 +AVX_INSTR cmpless, sse, 1, 0, 0 +AVX_INSTR cmpltpd, sse2, 1, 0, 0 +AVX_INSTR cmpltps, sse, 1, 0, 0 +AVX_INSTR cmpltsd, sse2, 1, 0, 0 +AVX_INSTR cmpltss, sse, 1, 0, 0 +AVX_INSTR cmpneqpd, sse2, 1, 0, 1 +AVX_INSTR cmpneqps, sse, 1, 0, 1 +AVX_INSTR cmpneqsd, sse2, 1, 0, 0 +AVX_INSTR cmpneqss, sse, 1, 0, 0 +AVX_INSTR cmpnlepd, sse2, 1, 0, 0 +AVX_INSTR cmpnleps, sse, 1, 0, 0 +AVX_INSTR cmpnlesd, sse2, 1, 0, 0 +AVX_INSTR cmpnless, sse, 1, 0, 0 +AVX_INSTR cmpnltpd, sse2, 1, 0, 0 +AVX_INSTR cmpnltps, sse, 1, 0, 0 +AVX_INSTR cmpnltsd, sse2, 1, 0, 0 +AVX_INSTR cmpnltss, sse, 1, 0, 0 +AVX_INSTR cmpordpd, sse2 1, 0, 1 +AVX_INSTR cmpordps, sse 1, 0, 1 +AVX_INSTR cmpordsd, sse2 1, 0, 0 +AVX_INSTR cmpordss, sse 1, 0, 0 AVX_INSTR cmppd, sse2, 1, 1, 0 AVX_INSTR cmpps, sse, 1, 1, 0 AVX_INSTR cmpsd, sse2, 1, 1, 0 AVX_INSTR cmpss, sse, 1, 1, 0 +AVX_INSTR cmpunordpd, sse2, 1, 0, 1 +AVX_INSTR cmpunordps, sse, 1, 0, 1 +AVX_INSTR cmpunordsd, sse2, 1, 0, 0 +AVX_INSTR cmpunordss, sse, 1, 0, 0 AVX_INSTR comisd, sse2 AVX_INSTR comiss, sse AVX_INSTR cvtdq2pd, sse2 @@ -1513,3 +1612,49 @@ FMA4_INSTR fmsubadd, pd, ps FMA4_INSTR fnmadd, pd, ps, sd, ss FMA4_INSTR fnmsub, pd, ps, sd, ss + +; Macros for converting VEX instructions to equivalent EVEX ones. +%macro EVEX_INSTR 2-3 0 ; vex, evex, prefer_evex + %macro %1 2-7 fnord, fnord, %1, %2, %3 + %ifidn %3, fnord + %define %%args %1, %2 + %elifidn %4, fnord + %define %%args %1, %2, %3 + %else + %define %%args %1, %2, %3, %4 + %endif + %assign %%evex_required cpuflag(avx512) & %7 + %ifnum regnumof%1 + %if regnumof%1 >= 16 || sizeof%1 > 32 + %assign %%evex_required 1 + %endif + %endif + %ifnum regnumof%2 + %if regnumof%2 >= 16 || sizeof%2 > 32 + %assign %%evex_required 1 + %endif + %endif + %if %%evex_required + %6 %%args + %else + %5 %%args ; Prefer VEX over EVEX due to shorter instruction length + %endif + %endmacro +%endmacro + +EVEX_INSTR vbroadcastf128, vbroadcastf32x4 +EVEX_INSTR vbroadcasti128, vbroadcasti32x4 +EVEX_INSTR vextractf128, vextractf32x4 +EVEX_INSTR vextracti128, vextracti32x4 +EVEX_INSTR vinsertf128, vinsertf32x4 +EVEX_INSTR vinserti128, vinserti32x4 +EVEX_INSTR vmovdqa, vmovdqa32 +EVEX_INSTR vmovdqu, vmovdqu32 +EVEX_INSTR vpand, vpandd +EVEX_INSTR vpandn, vpandnd +EVEX_INSTR vpor, vpord +EVEX_INSTR vpxor, vpxord +EVEX_INSTR vrcpps, vrcp14ps, 1 ; EVEX versions have higher precision +EVEX_INSTR vrcpss, vrcp14ss, 1 +EVEX_INSTR vrsqrtps, vrsqrt14ps, 1 +EVEX_INSTR vrsqrtss, vrsqrt14ss, 1
View file
x265_2.7.tar.gz/source/common/x86/x86util.asm -> x265_2.9.tar.gz/source/common/x86/x86util.asm
Changed
@@ -299,32 +299,44 @@ pminsw %2, %4 %endmacro +%macro MOVHL 2 ; dst, src +%ifidn %1, %2 + punpckhqdq %1, %2 +%elif cpuflag(avx) + punpckhqdq %1, %2, %2 +%elif cpuflag(sse4) + pshufd %1, %2, q3232 ; pshufd is slow on some older CPUs, so only use it on more modern ones +%else + movhlps %1, %2 ; may cause an int/float domain transition and has a dependency on dst +%endif +%endmacro + %macro HADDD 2 ; sum junk -%if sizeof%1 == 32 -%define %2 xmm%2 - vextracti128 %2, %1, 1 -%define %1 xmm%1 - paddd %1, %2 +%if sizeof%1 >= 64 + vextracti32x8 ymm%2, zmm%1, 1 + paddd ymm%1, ymm%2 %endif -%if mmsize >= 16 -%if cpuflag(xop) && sizeof%1 == 16 - vphadddq %1, %1 +%if sizeof%1 >= 32 + vextracti128 xmm%2, ymm%1, 1 + paddd xmm%1, xmm%2 +%endif +%if sizeof%1 >= 16 + MOVHL xmm%2, xmm%1 + paddd xmm%1, xmm%2 %endif - movhlps %2, %1 - paddd %1, %2 +%if cpuflag(xop) && sizeof%1 == 16 + vphadddq xmm%1, xmm%1 %endif %if notcpuflag(xop) - PSHUFLW %2, %1, q0032 - paddd %1, %2 + PSHUFLW xmm%2, xmm%1, q1032 + paddd xmm%1, xmm%2 %endif -%undef %1 -%undef %2 %endmacro %macro HADDW 2 ; reg, tmp %if cpuflag(xop) && sizeof%1 == 16 vphaddwq %1, %1 - movhlps %2, %1 + MOVHL %2, %1 paddd %1, %2 %else pmaddwd %1, [pw_1] @@ -346,7 +358,7 @@ %macro HADDUW 2 %if cpuflag(xop) && sizeof%1 == 16 vphadduwq %1, %1 - movhlps %2, %1 + MOVHL %2, %1 paddd %1, %2 %else HADDUWD %1, %2 @@ -739,25 +751,25 @@ %if %6 ; %5 aligned? mova %1, %4 psubw %1, %5 +%elif cpuflag(avx) + movu %1, %4 + psubw %1, %5 %else movu %1, %4 movu %2, %5 psubw %1, %2 %endif %else ; !HIGH_BIT_DEPTH -%ifidn %3, none movh %1, %4 movh %2, %5 +%ifidn %3, none punpcklbw %1, %2 punpcklbw %2, %2 - psubw %1, %2 %else - movh %1, %4 punpcklbw %1, %3 - movh %2, %5 punpcklbw %2, %3 - psubw %1, %2 %endif + psubw %1, %2 %endif ; HIGH_BIT_DEPTH %endmacro
View file
x265_2.7.tar.gz/source/common/yuv.cpp -> x265_2.9.tar.gz/source/common/yuv.cpp
Changed
@@ -170,11 +170,14 @@ void Yuv::addClip(const Yuv& srcYuv0, const ShortYuv& srcYuv1, uint32_t log2SizeL, int picCsp) { - primitives.cu[log2SizeL - 2].add_ps(m_buf[0], m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); + primitives.cu[log2SizeL - 2].add_ps[(m_size % 64 == 0) && (srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0)](m_buf[0], + m_size, srcYuv0.m_buf[0], srcYuv1.m_buf[0], srcYuv0.m_size, srcYuv1.m_size); if (m_csp != X265_CSP_I400 && picCsp != X265_CSP_I400) { - primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[1], m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); - primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps(m_buf[2], m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); + primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 ==0) && (srcYuv1.m_csize % 64 == 0)](m_buf[1], + m_csize, srcYuv0.m_buf[1], srcYuv1.m_buf[1], srcYuv0.m_csize, srcYuv1.m_csize); + primitives.chroma[m_csp].cu[log2SizeL - 2].add_ps[(m_csize % 64 == 0) && (srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0)](m_buf[2], + m_csize, srcYuv0.m_buf[2], srcYuv1.m_buf[2], srcYuv0.m_csize, srcYuv1.m_csize); } if (picCsp == X265_CSP_I400 && m_csp != X265_CSP_I400) { @@ -192,7 +195,7 @@ const int16_t* srcY0 = srcYuv0.getLumaAddr(absPartIdx); const int16_t* srcY1 = srcYuv1.getLumaAddr(absPartIdx); pixel* dstY = getLumaAddr(absPartIdx); - primitives.pu[part].addAvg(srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size); + primitives.pu[part].addAvg[(srcYuv0.m_size % 64 == 0) && (srcYuv1.m_size % 64 == 0) && (m_size % 64 == 0)](srcY0, srcY1, dstY, srcYuv0.m_size, srcYuv1.m_size, m_size); } if (bChroma) { @@ -202,8 +205,8 @@ const int16_t* srcV1 = srcYuv1.getCrAddr(absPartIdx); pixel* dstU = getCbAddr(absPartIdx); pixel* dstV = getCrAddr(absPartIdx); - primitives.chroma[m_csp].pu[part].addAvg(srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize); - primitives.chroma[m_csp].pu[part].addAvg(srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize); + primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcU0, srcU1, dstU, srcYuv0.m_csize, srcYuv1.m_csize, m_csize); + primitives.chroma[m_csp].pu[part].addAvg[(srcYuv0.m_csize % 64 == 0) && (srcYuv1.m_csize % 64 == 0) && (m_csize % 64 == 0)](srcV0, srcV1, dstV, srcYuv0.m_csize, srcYuv1.m_csize, m_csize); } }
View file
x265_2.7.tar.gz/source/common/yuv.h -> x265_2.9.tar.gz/source/common/yuv.h
Changed
@@ -38,7 +38,6 @@ class Yuv { public: - pixel* m_buf[3]; uint32_t m_size;
View file
x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp -> x265_2.9.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.cpp
Changed
@@ -34,6 +34,7 @@ const std::string BezierCurveNames::NumberOfAnchors = std::string("NumberOfAnchors"); const std::string BezierCurveNames::KneePointX = std::string("KneePointX"); const std::string BezierCurveNames::KneePointY = std::string("KneePointY"); +const std::string BezierCurveNames::AnchorsTag = std::string("Anchors"); const std::string BezierCurveNames::Anchors[] = {std::string("Anchor0"), std::string("Anchor1"), std::string("Anchor2"), @@ -69,6 +70,8 @@ const std::string PercentileNames::TagName = std::string("PercentileLuminance"); const std::string PercentileNames::NumberOfPercentiles = std::string("NumberOfPercentiles"); +const std::string PercentileNames::DistributionIndex = std::string("DistributionIndex"); +const std::string PercentileNames::DistributionValues = std::string("DistributionValues"); const std::string PercentileNames::PercentilePercentageValue[] = {std::string("PercentilePercentage0"), std::string("PercentilePercentage1"), std::string("PercentilePercentage2"), @@ -104,7 +107,9 @@ const std::string LuminanceNames::TagName = std::string("LuminanceParameters"); +const std::string LuminanceNames::LlcTagName = std::string("LuminanceDistributions"); const std::string LuminanceNames::AverageRGB = std::string("AverageRGB"); +const std::string LuminanceNames::MaxSCL = std::string("MaxScl"); const std::string LuminanceNames::MaxSCL0 = std::string("MaxScl0"); const std::string LuminanceNames::MaxSCL1 = std::string("MaxScl1"); const std::string LuminanceNames::MaxSCL2 = std::string("MaxScl2");
View file
x265_2.7.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h -> x265_2.9.tar.gz/source/dynamicHDR10/SeiMetadataDictionary.h
Changed
@@ -48,6 +48,7 @@ static const std::string NumberOfAnchors; static const std::string KneePointX; static const std::string KneePointY; + static const std::string AnchorsTag; static const std::string Anchors[14]; }; //Ellipse Selection Data @@ -79,6 +80,8 @@ public: static const std::string TagName; static const std::string NumberOfPercentiles; + static const std::string DistributionIndex; + static const std::string DistributionValues; static const std::string PercentilePercentageValue[15]; static const std::string PercentileLuminanceValue[15]; }; @@ -87,7 +90,9 @@ { public: static const std::string TagName; + static const std::string LlcTagName; static const std::string AverageRGB; + static const std::string MaxSCL; static const std::string MaxSCL0; static const std::string MaxSCL1; static const std::string MaxSCL2;
View file
x265_2.7.tar.gz/source/dynamicHDR10/metadataFromJson.cpp -> x265_2.9.tar.gz/source/dynamicHDR10/metadataFromJson.cpp
Changed
@@ -46,89 +46,133 @@ int mCurrentStreamBit; int mCurrentStreamByte; - bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj) + bool luminanceParamFromJson(const Json &data, LuminanceParameters &obj, const JsonType jsonType) { JsonObject lumJsonData = data.object_items(); if(!lumJsonData.empty()) { - JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items(); - obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value(); - - obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value()); - obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value()); - obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value()); - obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value()); - - if(!percentileData.empty()) - { - obj.percentiles.resize(obj.order); - for(int i = 0; i < obj.order; ++i) - { - std::string percentileTag = PercentileNames::TagName; - percentileTag += std::to_string(i); - obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value()); - } - } - - return true; - } - return false; - } - - bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages) - { - JsonObject jsonData = data.object_items(); - if(!jsonData.empty()) - { - JsonObject percentileData = jsonData[PercentileNames::TagName].object_items(); - int order = percentileData[PercentileNames::NumberOfPercentiles].int_value(); - - percentages.resize(order); - for(int i = 0; i < order; ++i) - { - std::string percentileTag = PercentileNames::PercentilePercentageValue[i]; - percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value()); - } - - return true; - } + switch(jsonType) + { + case LEGACY: + { + obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value()); + obj.maxRLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL0].number_value()); + obj.maxGLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL1].number_value()); + obj.maxBLuminance = static_cast<float>(lumJsonData[LuminanceNames::MaxSCL2].number_value()); + + JsonObject percentileData = lumJsonData[PercentileNames::TagName].object_items(); + obj.order = percentileData[PercentileNames::NumberOfPercentiles].int_value(); + if(!percentileData.empty()) + { + obj.percentiles.resize(obj.order); + for(int i = 0; i < obj.order; ++i) + { + std::string percentileTag = PercentileNames::TagName; + percentileTag += std::to_string(i); + obj.percentiles[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value()); + } + } + return true; + } break; + case LLC: + { + obj.averageLuminance = static_cast<float>(lumJsonData[LuminanceNames::AverageRGB].number_value()); + JsonArray maxScl = lumJsonData[LuminanceNames::MaxSCL].array_items(); + obj.maxRLuminance = static_cast<float>(maxScl[0].number_value()); + obj.maxGLuminance = static_cast<float>(maxScl[1].number_value()); + obj.maxBLuminance = static_cast<float>(maxScl[2].number_value()); + + JsonObject percentileData = lumJsonData[LuminanceNames::LlcTagName].object_items(); + if(!percentileData.empty()) + { + JsonArray distributionValues = percentileData[PercentileNames::DistributionValues].array_items(); + obj.order = static_cast<int>(distributionValues.size()); + obj.percentiles.resize(obj.order); + for(int i = 0; i < obj.order; ++i) + { + obj.percentiles[i] = static_cast<unsigned int>(distributionValues[i].int_value()); + } + } + return true; + } break; + } + } return false; } - bool percentagesFromJson(const Json &data, unsigned int *percentages) + bool percentagesFromJson(const Json &data, std::vector<unsigned int> &percentages, const JsonType jsonType) { JsonObject jsonData = data.object_items(); if(!jsonData.empty()) { - JsonObject percentileData = jsonData[PercentileNames::TagName].object_items(); - int order = percentileData[PercentileNames::NumberOfPercentiles].int_value(); - - for(int i = 0; i < order; ++i) - { - std::string percentileTag = PercentileNames::PercentilePercentageValue[i]; - percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value()); - } + switch(jsonType) + { + case LEGACY: + { + JsonObject percentileData = jsonData[PercentileNames::TagName].object_items(); + int order = percentileData[PercentileNames::NumberOfPercentiles].int_value(); + percentages.resize(order); + for(int i = 0; i < order; ++i) + { + std::string percentileTag = PercentileNames::PercentilePercentageValue[i]; + percentages[i] = static_cast<unsigned int>(percentileData[percentileTag].int_value()); + } + return true; + } break; + case LLC: + { + JsonObject percentileData = jsonData[LuminanceNames::LlcTagName].object_items(); + if(!percentileData.empty()) + { + JsonArray percentageValues = percentileData[PercentileNames::DistributionIndex].array_items(); + int order = static_cast<int>(percentageValues.size()); + percentages.resize(order); + for(int i = 0; i < order; ++i) + { + percentages[i] = static_cast<unsigned int>(percentageValues[i].int_value()); + } + } + return true; + } break; + } - return true; } return false; } - bool bezierCurveFromJson(const Json &data, BezierCurveData &obj) + bool bezierCurveFromJson(const Json &data, BezierCurveData &obj, const JsonType jsonType) { JsonObject jsonData = data.object_items(); if(!jsonData.empty()) { - obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value(); - obj.coeff.resize(obj.order); - obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value(); - obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value(); - for(int i = 0; i < obj.order; ++i) - { - obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value(); - } - - return true; + switch(jsonType) + { + case LEGACY: + { + obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value(); + obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value(); + obj.order = jsonData[BezierCurveNames::NumberOfAnchors].int_value(); + obj.coeff.resize(obj.order); + for(int i = 0; i < obj.order; ++i) + { + obj.coeff[i] = jsonData[BezierCurveNames::Anchors[i]].int_value(); + } + return true; + } break; + case LLC: + { + obj.sPx = jsonData[BezierCurveNames::KneePointX].int_value(); + obj.sPy = jsonData[BezierCurveNames::KneePointY].int_value(); + JsonArray anchorValues = data[BezierCurveNames::AnchorsTag].array_items(); + obj.order = static_cast<int>(anchorValues.size()); + obj.coeff.resize(obj.order); + for(int i = 0; i < obj.order; ++i) + { + obj.coeff[i] = anchorValues[i].int_value(); + } + return true; + } break; + } } return false; } @@ -162,9 +206,7 @@ void setPayloadSize(uint8_t *dataStream, int positionOnStream, int payload) { int payloadBytes = 1; - for(;payload >= 0xFF; payload -= 0xFF, ++payloadBytes); - if(payloadBytes > 1) { shiftData(dataStream, payloadBytes-1, mCurrentStreamByte, positionOnStream); @@ -196,8 +238,6 @@ } } -// const std::string LocalParameters = std::string("LocalParameters"); -// const std::string TargetDisplayLuminance = std::string("TargetedSystemDisplayMaximumLuminance"); }; metadataFromJson::metadataFromJson() : @@ -211,17 +251,17 @@ delete mPimpl; } - bool metadataFromJson::frameMetadataFromJson(const char* filePath, int frame, uint8_t *&metadata) { std::string path(filePath); JsonArray fileData = JsonHelper::readJsonArray(path); - + JsonType jsonType = LEGACY; if(fileData.empty()) { - return false; + jsonType = LLC; + fileData = JsonHelper::readJson(filePath).at("SceneInfo").array_items(); } // frame = frame + 1; //index on the array start at 0 frames starts at 1 @@ -233,7 +273,6 @@ } int mSEIBytesToRead = 509; - if(metadata) { delete(metadata); @@ -241,13 +280,9 @@ metadata = new uint8_t[mSEIBytesToRead]; mPimpl->mCurrentStreamBit = 8; mPimpl->mCurrentStreamByte = 1; + memset(metadata, 0, mSEIBytesToRead); - for(int j = 0; j < mSEIBytesToRead; ++j) - { - (metadata)[j] = 0; - } - - fillMetadataArray(fileData, frame, metadata); + fillMetadataArray(fileData, frame, jsonType, metadata); mPimpl->setPayloadSize(metadata, 0, mPimpl->mCurrentStreamByte); return true; } @@ -256,9 +291,11 @@ { std::string path(filePath); JsonArray fileData = JsonHelper::readJsonArray(path); + JsonType jsonType = LEGACY; if (fileData.empty()) { - return -1; + jsonType = LLC; + fileData = JsonHelper::readJson(filePath).at("SceneInfo").array_items(); } int numFrames = static_cast<int>(fileData.size()); @@ -266,17 +303,12 @@ for (int frame = 0; frame < numFrames; ++frame) { metadata[frame] = new uint8_t[509]; - for (int i = 0; i < 509; ++i) - { - metadata[frame][i] = 0; - } + memset(metadata[frame], 0, 509); mPimpl->mCurrentStreamBit = 8; mPimpl->mCurrentStreamByte = 1; - fillMetadataArray(fileData, frame, metadata[frame]); - + fillMetadataArray(fileData, frame, jsonType, metadata[frame]); mPimpl->setPayloadSize(metadata[frame], 0, mPimpl->mCurrentStreamByte); - } return numFrames; @@ -321,7 +353,7 @@ /* NOTE: We leave TWO BYTES of space for the payload */ mPimpl->mCurrentStreamByte += 2; - fillMetadataArray(fileData, frame, metadata); + fillMetadataArray(fileData, frame, LEGACY, metadata); /* Set payload in bytes 2 & 3 as indicated in Extended InfoFrame Type syntax */ metadata[2] = (mPimpl->mCurrentStreamByte & 0xFF00) >> 8; @@ -331,7 +363,7 @@ int metadataFromJson::movieExtendedInfoFrameMetadataFromJson(const char* filePath, uint8_t **&metadata) { - std::string path(filePath); + std::string path(filePath); JsonArray fileData = JsonHelper::readJsonArray(path); if(fileData.empty()) { @@ -344,9 +376,9 @@ { metadata[frame] = new uint8_t[509]; for(int i = 0; i < 509; ++i) - { - metadata[frame][i] = 0; - } + { + metadata[frame][i] = 0; + } mPimpl->mCurrentStreamBit = 8; mPimpl->mCurrentStreamByte = 0; @@ -356,7 +388,7 @@ /* NOTE: We leave TWO BYTES of space for the payload */ mPimpl->mCurrentStreamByte += 2; - fillMetadataArray(fileData, frame, metadata[frame]); + fillMetadataArray(fileData, frame, LEGACY, metadata[frame]); /* Set payload in bytes 2 & 3 as indicated in Extended InfoFrame Type syntax */ metadata[frame][2] = (mPimpl->mCurrentStreamByte & 0xFF00) >> 8; @@ -366,7 +398,7 @@ return numFrames; } -void metadataFromJson::fillMetadataArray(const JsonArray &fileData, int frame, uint8_t *&metadata) +void metadataFromJson::fillMetadataArray(const JsonArray &fileData, int frame, const JsonType jsonType, uint8_t *&metadata) { const uint8_t countryCode = 0xB5; const uint16_t terminalProviderCode = 0x003C; @@ -381,57 +413,68 @@ mPimpl->appendBits(metadata, applicationIdentifier, 8); mPimpl->appendBits(metadata, applicationVersion, 8); - //Note: Validated only add up to two local selections, ignore the rest - JsonArray jsonArray = fileData[frame][JsonDataKeys::LocalParameters].array_items(); - int ellipsesNum = static_cast<int>(jsonArray.size() > 2 ? 2 : jsonArray.size()); - uint16_t numWindows = (uint16_t)fileData[frame][JsonDataKeys::NumberOfWindows].int_value(); - mPimpl->appendBits(metadata, numWindows, 2); - for (int i = 0; i < ellipsesNum; ++i) + uint16_t numWindows = 0; + /* HDR10+ LLC doesn't consider local windows */ + if(jsonType & LLC) + { + numWindows = 1; + mPimpl->appendBits(metadata, numWindows, 2); + } + else { - mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] - [EllipseSelectionNames::WindowUpperLeftCornerX].int_value(), 16); - mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] - [EllipseSelectionNames::WindowUpperLeftCornerY].int_value(), 16); - mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] - [EllipseSelectionNames::WindowLowerRightCornerX].int_value(), 16); - mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] - [EllipseSelectionNames::WindowLowerRightCornerY].int_value(), 16); + //Note: Validated only add up to two local selections, ignore the rest + JsonArray jsonArray = fileData[frame][JsonDataKeys::LocalParameters].array_items(); + int ellipsesNum = static_cast<int>(jsonArray.size() > 2 ? 2 : jsonArray.size()); + numWindows = (uint16_t)fileData[frame][JsonDataKeys::NumberOfWindows].int_value(); + mPimpl->appendBits(metadata, numWindows, 2); + for (int i = 0; i < ellipsesNum; ++i) + { + mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] + [EllipseSelectionNames::WindowUpperLeftCornerX].int_value(), 16); + mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] + [EllipseSelectionNames::WindowUpperLeftCornerY].int_value(), 16); + mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] + [EllipseSelectionNames::WindowLowerRightCornerX].int_value(), 16); + mPimpl->appendBits(metadata, jsonArray[i][EllipseSelectionNames::WindowData] + [EllipseSelectionNames::WindowLowerRightCornerY].int_value(), 16); - JsonObject ellipseJsonObject = jsonArray[i][EllipseNames::TagName].object_items(); + JsonObject ellipseJsonObject = jsonArray[i][EllipseNames::TagName].object_items(); - mPimpl->appendBits(metadata, - static_cast<uint16_t>(ellipseJsonObject[EllipseNames::CenterOfEllipseX].int_value()), - 16); + mPimpl->appendBits(metadata, + static_cast<uint16_t>(ellipseJsonObject[EllipseNames::CenterOfEllipseX].int_value()), + 16); - mPimpl->appendBits(metadata, - static_cast<uint16_t>(ellipseJsonObject[EllipseNames::CenterOfEllipseY].int_value()), - 16); + mPimpl->appendBits(metadata, + static_cast<uint16_t>(ellipseJsonObject[EllipseNames::CenterOfEllipseY].int_value()), + 16); - int angle = ellipseJsonObject[EllipseNames::RotationAngle].int_value(); - uint8_t rotationAngle = static_cast<uint8_t>((angle > 180.0) ? angle - 180.0 : angle); - mPimpl->appendBits(metadata, rotationAngle, 8); + int angle = ellipseJsonObject[EllipseNames::RotationAngle].int_value(); + uint8_t rotationAngle = static_cast<uint8_t>((angle > 180.0) ? angle - 180.0 : angle); + mPimpl->appendBits(metadata, rotationAngle, 8); - uint16_t semimajorExternalAxis = - static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMajorAxisExternalEllipse].int_value()); + uint16_t semimajorExternalAxis = + static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMajorAxisExternalEllipse].int_value()); - uint16_t semiminorExternalAxis = - static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMinorAxisExternalEllipse].int_value()); + uint16_t semiminorExternalAxis = + static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMinorAxisExternalEllipse].int_value()); - uint16_t semimajorInternalEllipse = - static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMajorAxisInternalEllipse].int_value()); + uint16_t semimajorInternalEllipse = + static_cast<uint16_t>(ellipseJsonObject[EllipseNames::SemiMajorAxisInternalEllipse].int_value()); - mPimpl->appendBits(metadata, semimajorInternalEllipse, 16); + mPimpl->appendBits(metadata, semimajorInternalEllipse, 16); - mPimpl->appendBits(metadata, semimajorExternalAxis, 16); - mPimpl->appendBits(metadata, semiminorExternalAxis, 16); - uint8_t overlapProcessOption = static_cast<uint8_t>(ellipseJsonObject[EllipseNames::OverlapProcessOption].int_value()); - //TODO: Uses Layering method, the value is "1" - mPimpl->appendBits(metadata, overlapProcessOption, 1); + mPimpl->appendBits(metadata, semimajorExternalAxis, 16); + mPimpl->appendBits(metadata, semiminorExternalAxis, 16); + uint8_t overlapProcessOption = static_cast<uint8_t>(ellipseJsonObject[EllipseNames::OverlapProcessOption].int_value()); + //TODO: Uses Layering method, the value is "1" + mPimpl->appendBits(metadata, overlapProcessOption, 1); + } } + /* Targeted System Display Data */ - uint32_t monitorPeak = fileData[frame][JsonDataKeys::TargetDisplayLuminance].int_value(); //500; + uint32_t monitorPeak = fileData[frame][JsonDataKeys::TargetDisplayLuminance].int_value(); mPimpl->appendBits(metadata, monitorPeak, 27); - //NOTE: Set as false for now, as requested + uint8_t targetedSystemDisplayActualPeakLuminanceFlag = 0; mPimpl->appendBits(metadata, targetedSystemDisplayActualPeakLuminanceFlag, 1); if (targetedSystemDisplayActualPeakLuminanceFlag) @@ -439,21 +482,20 @@ //TODO } - /* Max rgb values (maxScl)*/ + /* Max RGB values (maxScl)*/ /* Luminance values/percentile for each window */ for (int w = 0; w < numWindows; ++w) { Json lumObj = fileData[frame][LuminanceNames::TagName]; LuminanceParameters luminanceData; - if (!mPimpl->luminanceParamFromJson(lumObj, luminanceData)) + if(!mPimpl->luminanceParamFromJson(lumObj, luminanceData, jsonType)) { std::cout << "error parsing luminance parameters frame: " << w << std::endl; } - /* NOTE: Maxscl from 0 t 100,000 based on data that says in values of 0.00001 + /* NOTE: Maxscl from 0 to 100,000 based on data that says in values of 0.00001 * one for each channel R,G,B */ - mPimpl->appendBits(metadata, static_cast<uint8_t>(((int)luminanceData.maxRLuminance & 0x10000) >> 16), 1); mPimpl->appendBits(metadata, static_cast<uint16_t>((int)luminanceData.maxRLuminance & 0xFFFF), 16); mPimpl->appendBits(metadata, static_cast<uint8_t>(((int)luminanceData.maxGLuminance & 0x10000) >> 16), 1); @@ -467,11 +509,12 @@ uint8_t numDistributionMaxrgbPercentiles = static_cast<uint8_t>(luminanceData.order); mPimpl->appendBits(metadata, numDistributionMaxrgbPercentiles, 4); - std::vector<unsigned int>percentilPercentages; - mPimpl->percentagesFromJson(lumObj, percentilPercentages); + std::vector<unsigned int>percentilePercentages; + mPimpl->percentagesFromJson(lumObj, percentilePercentages, jsonType); + for (int i = 0; i < numDistributionMaxrgbPercentiles; ++i) { - uint8_t distributionMaxrgbPercentage = static_cast<uint8_t>(percentilPercentages.at(i)); + uint8_t distributionMaxrgbPercentage = static_cast<uint8_t>(percentilePercentages.at(i)); mPimpl->appendBits(metadata, distributionMaxrgbPercentage, 7); /* 17bits: 1bit then 16 */ @@ -483,7 +526,7 @@ } /* 10bits: Fraction bright pixels */ - uint16_t fractionBrightPixels = 1; + uint16_t fractionBrightPixels = 0; mPimpl->appendBits(metadata, fractionBrightPixels, 10); } @@ -498,24 +541,24 @@ /* Bezier Curve Data */ for (int w = 0; w < numWindows; ++w) { - uint8_t toneMappingFlag = 1; + uint8_t toneMappingFlag = 0; /* Check if the window contains tone mapping bezier curve data and set toneMappingFlag appropriately */ - //Json bezierData = fileData[frame][BezierCurveNames::TagName]; BezierCurveData curveData; /* Select curve data based on global window */ if (w == 0) - { - if (!mPimpl->bezierCurveFromJson(fileData[frame][BezierCurveNames::TagName], curveData)) + { + if (mPimpl->bezierCurveFromJson(fileData[frame][BezierCurveNames::TagName], curveData, jsonType)) { - toneMappingFlag = 0; + toneMappingFlag = 1; } } - /* Select curve data based on local window */ + /* Select curve data based on local window */ else { - if (!mPimpl->bezierCurveFromJson(jsonArray[w - 1][BezierCurveNames::TagName], curveData)) + JsonArray jsonArray = fileData[frame][JsonDataKeys::LocalParameters].array_items(); + if (mPimpl->bezierCurveFromJson(jsonArray[w - 1][BezierCurveNames::TagName], curveData, jsonType)) { - toneMappingFlag = 0; + toneMappingFlag = 1; } } mPimpl->appendBits(metadata, toneMappingFlag, 1);
View file
x265_2.7.tar.gz/source/dynamicHDR10/metadataFromJson.h -> x265_2.9.tar.gz/source/dynamicHDR10/metadataFromJson.h
Changed
@@ -26,7 +26,7 @@ #define METADATAFROMJSON_H #include<stdint.h> -#include "string" +#include<cstring> #include "JsonHelper.h" class metadataFromJson @@ -36,6 +36,11 @@ metadataFromJson(); ~metadataFromJson(); + enum JsonType{ + LEGACY, + LLC + }; + /** * @brief frameMetadataFromJson: Generates a sigle frame metadata array from Json file with all @@ -98,7 +103,7 @@ class DynamicMetaIO; DynamicMetaIO *mPimpl; - void fillMetadataArray(const JsonArray &fileData, int frame, uint8_t *&metadata); + void fillMetadataArray(const JsonArray &fileData, int frame, const JsonType jsonType, uint8_t *&metadata); }; #endif // METADATAFROMJSON_H
View file
x265_2.7.tar.gz/source/encoder/analysis.cpp -> x265_2.9.tar.gz/source/encoder/analysis.cpp
Changed
@@ -37,7 +37,7 @@ using namespace X265_NS; /* An explanation of rate distortion levels (--rd-level) - * + * * rd-level 0 generates no recon per CU (NO RDO or Quant) * * sa8d selection between merge / skip / inter / intra and split @@ -187,27 +187,24 @@ for (uint32_t i = 0; i < cuGeom.numPartitions; i++) ctu.m_log2CUSize[i] = (uint8_t)m_param->maxLog2CUSize - ctu.m_cuDepth[i]; } - if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && (m_slice->m_sliceType != I_SLICE)) { - m_multipassAnalysis = (analysis2PassFrameData*)m_frame->m_analysis2Pass.analysisFramedata; - m_multipassDepth = &m_multipassAnalysis->depth[ctu.m_cuAddr * ctu.m_numPartitions]; - if (m_slice->m_sliceType != I_SLICE) + int numPredDir = m_slice->isInterP() ? 1 : 2; + m_reuseInterDataCTU = m_frame->m_analysisData.interData; + for (int dir = 0; dir < numPredDir; dir++) { - int numPredDir = m_slice->isInterP() ? 1 : 2; - for (int dir = 0; dir < numPredDir; dir++) - { - m_multipassMv[dir] = &m_multipassAnalysis->m_mv[dir][ctu.m_cuAddr * ctu.m_numPartitions]; - m_multipassMvpIdx[dir] = &m_multipassAnalysis->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions]; - m_multipassRef[dir] = &m_multipassAnalysis->ref[dir][ctu.m_cuAddr * ctu.m_numPartitions]; - } - m_multipassModes = &m_multipassAnalysis->modes[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseMv[dir] = &m_reuseInterDataCTU->mv[dir][ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseMvpIdx[dir] = &m_reuseInterDataCTU->mvpIdx[dir][ctu.m_cuAddr * ctu.m_numPartitions]; } + m_reuseRef = &m_reuseInterDataCTU->ref[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions]; + m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions]; } - + if ((m_param->analysisSave || m_param->analysisLoad) && m_slice->m_sliceType != I_SLICE && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel < 10) { int numPredDir = m_slice->isInterP() ? 1 : 2; - m_reuseInterDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; + m_reuseInterDataCTU = m_frame->m_analysisData.interData; m_reuseRef = &m_reuseInterDataCTU->ref [ctu.m_cuAddr * X265_MAX_PRED_MODE_PER_CTU * numPredDir]; m_reuseDepth = &m_reuseInterDataCTU->depth[ctu.m_cuAddr * ctu.m_numPartitions]; m_reuseModes = &m_reuseInterDataCTU->modes[ctu.m_cuAddr * ctu.m_numPartitions]; @@ -224,7 +221,7 @@ if (m_slice->m_sliceType == I_SLICE) { - analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; + x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData; if (m_param->analysisLoad && m_param->analysisReuseLevel > 1) { memcpy(ctu.m_cuDepth, &intraDataCTU->depth[ctu.m_cuAddr * numPartition], sizeof(uint8_t) * numPartition); @@ -243,7 +240,7 @@ if (bCopyAnalysis) { - analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; + x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData; int posCTU = ctu.m_cuAddr * numPartition; memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition); memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition); @@ -253,7 +250,7 @@ if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !m_param->bMVType) { - analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; + x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData; memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition); memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition); } @@ -279,14 +276,14 @@ } else if ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || ((m_param->bMVType == AVC_INFO) && m_param->analysisReuseLevel >= 7 && ctu.m_numPartitions <= 16)) { - analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; + x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData; int posCTU = ctu.m_cuAddr * numPartition; memcpy(ctu.m_cuDepth, &interDataCTU->depth[posCTU], sizeof(uint8_t) * numPartition); memcpy(ctu.m_predMode, &interDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition); memcpy(ctu.m_partSize, &interDataCTU->partSize[posCTU], sizeof(uint8_t) * numPartition); if ((m_slice->m_sliceType == P_SLICE || m_param->bIntraInBFrames) && !(m_param->bMVType == AVC_INFO)) { - analysis_intra_data* intraDataCTU = (analysis_intra_data*)m_frame->m_analysisData.intraData; + x265_analysis_intra_data* intraDataCTU = m_frame->m_analysisData.intraData; memcpy(ctu.m_lumaIntraDir, &intraDataCTU->modes[posCTU], sizeof(uint8_t) * numPartition); memcpy(ctu.m_chromaIntraDir, &intraDataCTU->chromaModes[posCTU], sizeof(uint8_t) * numPartition); } @@ -518,19 +515,20 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); - bool bAlreadyDecided = parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX; - bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; + bool bAlreadyDecided = m_param->intraRefine != 4 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] != (uint8_t)ALL_IDX; + bool bDecidedDepth = m_param->intraRefine != 4 && parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; int split = 0; - if (m_param->intraRefine) + if (m_param->intraRefine && m_param->intraRefine != 4) { - split = ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1)) && bDecidedDepth); + split = m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || + ((cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1)))); if (cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize]) && !bDecidedDepth) bAlreadyDecided = false; } if (bAlreadyDecided) { - if (bDecidedDepth) + if (bDecidedDepth && mightNotSplit) { Mode& mode = md.pred[0]; md.bestMode = &mode; @@ -1184,7 +1182,7 @@ if (m_evaluateInter) { - if (m_param->interRefine == 2) + if (m_refineLevel == 2) { if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP) skipModes = true; @@ -1283,11 +1281,11 @@ } } } - if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis) + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU) { - if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx]) + if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) { - if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP) + if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP) { md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); @@ -1307,7 +1305,7 @@ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); if (m_param->rdLevel) - skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2) + skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2) && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth } if (md.bestMode && m_param->bEnableRecursionSkip && !bCtuInfoCheck && !(m_param->bMVType && m_param->analysisReuseLevel == 7 && (m_modeFlag[0] || m_modeFlag[1]))) @@ -1874,7 +1872,7 @@ if (m_evaluateInter) { - if (m_param->interRefine == 2) + if (m_refineLevel == 2) { if (parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP) skipModes = true; @@ -1976,11 +1974,11 @@ } } - if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis) + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU) { - if (mightNotSplit && depth == m_multipassDepth[cuGeom.absPartIdx]) + if (mightNotSplit && depth == m_reuseDepth[cuGeom.absPartIdx]) { - if (m_multipassModes[cuGeom.absPartIdx] == MODE_SKIP) + if (m_reuseModes[cuGeom.absPartIdx] == MODE_SKIP) { md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); @@ -2004,7 +2002,7 @@ md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); - skipModes = (m_param->bEnableEarlySkip || m_param->interRefine == 2) && + skipModes = (m_param->bEnableEarlySkip || m_refineLevel == 2) && md.bestMode && !md.bestMode->cu.getQtRootCbf(0); refMasks[0] = allSplitRefs; md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); @@ -2413,9 +2411,18 @@ bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); bool bDecidedDepth = parentCTU.m_cuDepth[cuGeom.absPartIdx] == depth; - int split = (m_param->interRefine && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1) && bDecidedDepth); + TrainingData td; + td.init(parentCTU, cuGeom); - if (bDecidedDepth) + if (!m_param->bDynamicRefine) + m_refineLevel = m_param->interRefine; + else + m_refineLevel = m_frame->m_classifyFrame ? 1 : 3; + int split = (m_param->scaleFactor && bDecidedDepth && (!mightNotSplit || + (m_refineLevel && cuGeom.log2CUSize == (uint32_t)(g_log2Size[m_param->minCUSize] + 1)))); + td.split = split; + + if (bDecidedDepth && mightNotSplit) { setLambdaFromQP(parentCTU, qp, lqp); @@ -2423,39 +2430,44 @@ md.bestMode = &mode; mode.cu.initSubCU(parentCTU, cuGeom, qp); PartSize size = (PartSize)parentCTU.m_partSize[cuGeom.absPartIdx]; - if (parentCTU.isIntra(cuGeom.absPartIdx) && m_param->interRefine < 2) + if (parentCTU.isIntra(cuGeom.absPartIdx) && m_refineLevel < 2) { - bool reuseModes = !((m_param->intraRefine == 3) || - (m_param->intraRefine == 2 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] > DC_IDX)); - if (reuseModes) + if (m_param->intraRefine == 4) + compressIntraCU(parentCTU, cuGeom, qp); + else { - memcpy(mode.cu.m_lumaIntraDir, parentCTU.m_lumaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); - memcpy(mode.cu.m_chromaIntraDir, parentCTU.m_chromaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + bool reuseModes = !((m_param->intraRefine == 3) || + (m_param->intraRefine == 2 && parentCTU.m_lumaIntraDir[cuGeom.absPartIdx] > DC_IDX)); + if (reuseModes) + { + memcpy(mode.cu.m_lumaIntraDir, parentCTU.m_lumaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + memcpy(mode.cu.m_chromaIntraDir, parentCTU.m_chromaIntraDir + cuGeom.absPartIdx, cuGeom.numPartitions); + } + checkIntra(mode, cuGeom, size); } - checkIntra(mode, cuGeom, size); } - else if (!parentCTU.isIntra(cuGeom.absPartIdx) && m_param->interRefine < 2) + else if (!parentCTU.isIntra(cuGeom.absPartIdx) && m_refineLevel < 2) { mode.cu.copyFromPic(parentCTU, cuGeom, m_csp, false); uint32_t numPU = parentCTU.getNumPartInter(cuGeom.absPartIdx); for (uint32_t part = 0; part < numPU; part++) { PredictionUnit pu(mode.cu, cuGeom, part); - if (m_param->analysisReuseLevel >= 7) + if ((m_param->analysisLoad && m_param->analysisReuseLevel == 10) || (m_param->bMVType == AVC_INFO && m_param->analysisReuseLevel >= 7)) { - analysis_inter_data* interDataCTU = (analysis_inter_data*)m_frame->m_analysisData.interData; + x265_analysis_inter_data* interDataCTU = m_frame->m_analysisData.interData; int cuIdx = (mode.cu.m_cuAddr * parentCTU.m_numPartitions) + cuGeom.absPartIdx; mode.cu.m_mergeFlag[pu.puAbsPartIdx] = interDataCTU->mergeFlag[cuIdx + part]; mode.cu.setPUInterDir(interDataCTU->interDir[cuIdx + part], pu.puAbsPartIdx, part); for (int list = 0; list < m_slice->isInterB() + 1; list++) { - mode.cu.setPUMv(list, interDataCTU->mv[list][cuIdx + part], pu.puAbsPartIdx, part); + mode.cu.setPUMv(list, interDataCTU->mv[list][cuIdx + part].word, pu.puAbsPartIdx, part); mode.cu.setPURefIdx(list, interDataCTU->refIdx[list][cuIdx + part], pu.puAbsPartIdx, part); mode.cu.m_mvpIdx[list][pu.puAbsPartIdx] = interDataCTU->mvpIdx[list][cuIdx + part]; } if (!mode.cu.m_mergeFlag[pu.puAbsPartIdx]) { - if (m_param->mvRefine) + if (m_param->mvRefine || m_param->interRefine == 1) m_me.setSourcePU(*mode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, false); //AMVP MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 2]; @@ -2465,23 +2477,37 @@ int ref = mode.cu.m_refIdx[list][pu.puAbsPartIdx]; if (ref == -1) continue; - mode.cu.getPMV(mode.interNeighbours, list, ref, mode.amvpCand[list][ref], mvc); - MV mvp = mode.amvpCand[list][ref][mode.cu.m_mvpIdx[list][pu.puAbsPartIdx]]; - if (m_param->mvRefine) + MV mvp; + + int numMvc = mode.cu.getPMV(mode.interNeighbours, list, ref, mode.amvpCand[list][ref], mvc); + if (m_param->interRefine != 1) + mvp = mode.amvpCand[list][ref][mode.cu.m_mvpIdx[list][pu.puAbsPartIdx]]; + else + mvp = interDataCTU->mv[list][cuIdx + part].word; + if (m_param->mvRefine || m_param->interRefine == 1) { MV outmv; - searchMV(mode, pu, list, ref, outmv); + searchMV(mode, pu, list, ref, outmv, mvp, numMvc, mvc); mode.cu.setPUMv(list, outmv, pu.puAbsPartIdx, part); } - mode.cu.m_mvd[list][pu.puAbsPartIdx] = mode.cu.m_mv[list][pu.puAbsPartIdx] - mvp; + mode.cu.m_mvd[list][pu.puAbsPartIdx] = mode.cu.m_mv[list][pu.puAbsPartIdx] - mode.amvpCand[list][ref][mode.cu.m_mvpIdx[list][pu.puAbsPartIdx]]/*mvp*/; } } - else if(m_param->scaleFactor) + else { MVField candMvField[MRG_MAX_NUM_CANDS][2]; // double length for mv of both lists uint8_t candDir[MRG_MAX_NUM_CANDS]; mode.cu.getInterMergeCandidates(pu.puAbsPartIdx, part, candMvField, candDir); uint8_t mvpIdx = mode.cu.m_mvpIdx[0][pu.puAbsPartIdx]; + if (mode.cu.isBipredRestriction()) + { + /* do not allow bidir merge candidates if PU is smaller than 8x8, drop L1 reference */ + if (candDir[mvpIdx] == 3) + { + candDir[mvpIdx] = 1; + candMvField[mvpIdx][1].refIdx = REF_NOT_VALID; + } + } mode.cu.setPUInterDir(candDir[mvpIdx], pu.puAbsPartIdx, part); mode.cu.setPUMv(0, candMvField[mvpIdx][0].mv, pu.puAbsPartIdx, part); mode.cu.setPUMv(1, candMvField[mvpIdx][1].mv, pu.puAbsPartIdx, part); @@ -2491,7 +2517,7 @@ } motionCompensation(mode.cu, pu, mode.predYuv, true, (m_csp != X265_CSP_I400 && m_frame->m_fencPic->m_picCsp != X265_CSP_I400)); } - if (!m_param->interRefine && parentCTU.isSkipped(cuGeom.absPartIdx)) + if (!m_param->interRefine && !m_param->bDynamicRefine && parentCTU.isSkipped(cuGeom.absPartIdx)) encodeResAndCalcRdSkipCU(mode); else encodeResAndCalcRdInterCU(mode, cuGeom); @@ -2502,7 +2528,7 @@ checkDQP(mode, cuGeom); } - if (m_param->interRefine < 2) + if (m_refineLevel < 2) { if (m_bTryLossless) tryLossless(cuGeom); @@ -2530,7 +2556,10 @@ } } - if (m_param->interRefine > 1 || (m_param->interRefine && parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP && !mode.cu.isSkipped(0))) + if (m_param->bDynamicRefine) + classifyCU(parentCTU,cuGeom, *md.bestMode, td); + + if (m_refineLevel > 1 || (m_refineLevel && parentCTU.m_predMode[cuGeom.absPartIdx] == MODE_SKIP && !mode.cu.isSkipped(0))) { m_evaluateInter = 1; m_param->rdLevel > 4 ? compressInterCU_rd5_6(parentCTU, cuGeom, qp) : compressInterCU_rd0_4(parentCTU, cuGeom, qp); @@ -2589,7 +2618,7 @@ else updateModeCost(*splitPred); - if (m_param->interRefine) + if (m_refineLevel) { if (m_param->rdLevel > 1) checkBestMode(*splitPred, cuGeom.depth); @@ -2603,6 +2632,89 @@ md.bestMode->cu.copyToPic(depth); md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); } + if (m_param->bDynamicRefine && bDecidedDepth) + trainCU(parentCTU, cuGeom, *md.bestMode, td); +} + +void Analysis::classifyCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData) +{ + uint32_t depth = cuGeom.depth; + trainData.cuVariance = calculateCUVariance(ctu, cuGeom); + if (m_frame->m_classifyFrame) + { + uint64_t diffRefine[X265_REFINE_INTER_LEVELS]; + uint64_t diffRefineRd[X265_REFINE_INTER_LEVELS]; + float probRefine[X265_REFINE_INTER_LEVELS] = { 0 }; + uint8_t varRefineLevel = 1; + uint8_t rdRefineLevel = 1; + uint64_t cuCost = bestMode.rdCost; + int offset = (depth * X265_REFINE_INTER_LEVELS); + if (cuCost < m_frame->m_classifyRd[offset]) + m_refineLevel = 1; + else + { + uint64_t trainingCount = 0; + for (uint8_t i = 0; i < X265_REFINE_INTER_LEVELS; i++) + { + offset = (depth * X265_REFINE_INTER_LEVELS) + i; + trainingCount += m_frame->m_classifyCount[offset]; + } + for (uint8_t i = 0; i < X265_REFINE_INTER_LEVELS; i++) + { + offset = (depth * X265_REFINE_INTER_LEVELS) + i; + /* Calculate distance values */ + diffRefine[i] = abs((int64_t)(trainData.cuVariance - m_frame->m_classifyVariance[offset])); + diffRefineRd[i] = abs((int64_t)(cuCost - m_frame->m_classifyRd[offset])); + + /* Calculate prior probability - ranges between 0 and 1 */ + if (trainingCount) + probRefine[i] = ((float)m_frame->m_classifyCount[offset] / (float)trainingCount); + + /* Bayesian classification - P(c|x)P(x) = P(x|c)P(c) + P(c|x) is the posterior probability of class given predictor. + P(c) is the prior probability of class. + P(x|c) is the likelihood which is the probability of predictor given class. + P(x) is the prior probability of predictor.*/ + int curRefineLevel = m_refineLevel - 1; + if ((diffRefine[i] * probRefine[curRefineLevel]) < (diffRefine[curRefineLevel] * probRefine[i])) + varRefineLevel = i + 1; + if ((diffRefineRd[i] * probRefine[curRefineLevel]) < (diffRefineRd[curRefineLevel] * probRefine[i])) + rdRefineLevel = i + 1; + } + m_refineLevel = X265_MAX(varRefineLevel, rdRefineLevel); + } + } +} + +void Analysis::trainCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData) +{ + uint32_t depth = cuGeom.depth; + int classify = 1; + if (!m_frame->m_classifyFrame) + { + /* classify = 1 : CUs for which the save data matches with that after encoding with refine-inter 3 + and CUs that has split. + classify = 2 : CUs which are encoded as simple modes (Skip/Merge/2Nx2N). + classify = 3 : CUs encoded as any other mode. */ + + bool refineInter0 = (trainData.predMode == ctu.m_predMode[cuGeom.absPartIdx] && + trainData.partSize == ctu.m_partSize[cuGeom.absPartIdx] && + trainData.mergeFlag == ctu.m_mergeFlag[cuGeom.absPartIdx]); + bool refineInter1 = (depth == m_param->maxCUDepth - 1) && trainData.split; + if (refineInter0 || refineInter1) + classify = 1; + else if (trainData.partSize == SIZE_2Nx2N && trainData.partSize == ctu.m_partSize[cuGeom.absPartIdx]) + classify = 2; + else + classify = 3; + } + else + classify = m_refineLevel; + uint64_t cuCost = bestMode.rdCost; + int offset = (depth * X265_REFINE_INTER_LEVELS) + classify - 1; + ctu.m_collectCURd[offset] += cuCost; + ctu.m_collectCUVariance[offset] += trainData.cuVariance; + ctu.m_collectCUCount[offset]++; } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ @@ -2900,7 +3012,7 @@ } } - if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis) + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU) { uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t part = 0; part < numPU; part++) @@ -2908,9 +3020,10 @@ MotionData* bestME = interMode.bestME[part]; for (int32_t i = 0; i < numPredDir; i++) { - bestME[i].ref = m_multipassRef[i][cuGeom.absPartIdx]; - bestME[i].mv = m_multipassMv[i][cuGeom.absPartIdx]; - bestME[i].mvpIdx = m_multipassMvpIdx[i][cuGeom.absPartIdx]; + int* ref = &m_reuseRef[i * m_frame->m_analysisData.numPartitions * m_frame->m_analysisData.numCUsInFrame]; + bestME[i].ref = ref[cuGeom.absPartIdx]; + bestME[i].mv = m_reuseMv[i][cuGeom.absPartIdx].word; + bestME[i].mvpIdx = m_reuseMvpIdx[i][cuGeom.absPartIdx]; } } } @@ -2964,7 +3077,7 @@ } } - if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_multipassAnalysis) + if (m_param->analysisMultiPassRefine && m_param->rc.bStatRead && m_reuseInterDataCTU) { uint32_t numPU = interMode.cu.getNumPartInter(0); for (uint32_t part = 0; part < numPU; part++) @@ -2972,9 +3085,10 @@ MotionData* bestME = interMode.bestME[part]; for (int32_t i = 0; i < numPredDir; i++) { - bestME[i].ref = m_multipassRef[i][cuGeom.absPartIdx]; - bestME[i].mv = m_multipassMv[i][cuGeom.absPartIdx]; - bestME[i].mvpIdx = m_multipassMvpIdx[i][cuGeom.absPartIdx]; + int* ref = &m_reuseRef[i * m_frame->m_analysisData.numPartitions * m_frame->m_analysisData.numCUsInFrame]; + bestME[i].ref = ref[cuGeom.absPartIdx]; + bestME[i].mv = m_reuseMv[i][cuGeom.absPartIdx].word; + bestME[i].mvpIdx = m_reuseMvpIdx[i][cuGeom.absPartIdx]; } } } @@ -3092,11 +3206,9 @@ pixel *fref0 = m_slice->m_mref[0][ref0].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx); pixel *fref1 = m_slice->m_mref[1][ref1].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx); intptr_t refStride = m_slice->m_mref[0][0].lumaStride; - - primitives.pu[partEnum].pixelavg_pp(tmpPredYuv.m_buf[0], tmpPredYuv.m_size, fref0, refStride, fref1, refStride, 32); + primitives.pu[partEnum].pixelavg_pp[(tmpPredYuv.m_size % 64 == 0) && (refStride % 64 == 0)](tmpPredYuv.m_buf[0], tmpPredYuv.m_size, fref0, refStride, fref1, refStride, 32); zsa8d = primitives.cu[partEnum].sa8d(fencYuv.m_buf[0], fencYuv.m_size, tmpPredYuv.m_buf[0], tmpPredYuv.m_size); } - uint32_t bits0 = bestME[0].bits - m_me.bitcost(bestME[0].mv, mvp0) + m_me.bitcost(mvzero, mvp0); uint32_t bits1 = bestME[1].bits - m_me.bitcost(bestME[1].mv, mvp1) + m_me.bitcost(mvzero, mvp1); uint32_t zcost = zsa8d + m_rdCost.getCost(bits0) + m_rdCost.getCost(bits1); @@ -3221,8 +3333,12 @@ * resiYuv. Generate the recon pixels by adding it to the prediction */ if (cu.m_cbf[0][0]) - primitives.cu[sizeIdx].add_ps(reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, - predY, resiYuv.m_buf[0], predYuv.m_size, resiYuv.m_size); + { + bool reconPicAlign = (reconPic.m_cuOffsetY[cu.m_cuAddr] + reconPic.m_buOffsetY[absPartIdx]) % 64 == 0; + bool predYalign = predYuv.getAddrOffset(absPartIdx, predYuv.m_size) % 64 == 0; + primitives.cu[sizeIdx].add_ps[reconPicAlign && predYalign && (reconPic.m_stride % 64 == 0) && (predYuv.m_size % 64 == 0) && + (resiYuv.m_size % 64 == 0)](reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, predY, resiYuv.m_buf[0], predYuv.m_size, resiYuv.m_size); + } else primitives.cu[sizeIdx].copy_pp(reconPic.getLumaAddr(cu.m_cuAddr, absPartIdx), reconPic.m_stride, predY, predYuv.m_size); @@ -3230,16 +3346,24 @@ { pixel* predU = predYuv.getCbAddr(absPartIdx); pixel* predV = predYuv.getCrAddr(absPartIdx); - if (cu.m_cbf[1][0]) - primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, - predU, resiYuv.m_buf[1], predYuv.m_csize, resiYuv.m_csize); + if (cu.m_cbf[1][0]) + { + bool reconPicAlign = (reconPic.m_cuOffsetC[cu.m_cuAddr] + reconPic.m_buOffsetC[absPartIdx]) % 64 == 0; + bool predUalign = predYuv.getChromaAddrOffset(absPartIdx) % 64 == 0; + primitives.chroma[m_csp].cu[sizeIdx].add_ps[reconPicAlign && predUalign && (reconPic.m_strideC % 64 == 0) && (predYuv.m_csize % 64 == 0) && + (resiYuv.m_csize % 64 == 0)](reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predU, resiYuv.m_buf[1], predYuv.m_csize, resiYuv.m_csize); + } else primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCbAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predU, predYuv.m_csize); if (cu.m_cbf[2][0]) - primitives.chroma[m_csp].cu[sizeIdx].add_ps(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, - predV, resiYuv.m_buf[2], predYuv.m_csize, resiYuv.m_csize); + { + bool reconPicAlign = (reconPic.m_cuOffsetC[cu.m_cuAddr] + reconPic.m_buOffsetC[absPartIdx]) % 64 == 0; + bool predValign = predYuv.getChromaAddrOffset(absPartIdx) % 64 == 0; + primitives.chroma[m_csp].cu[sizeIdx].add_ps[reconPicAlign && predValign && (reconPic.m_strideC % 64 == 0) && (predYuv.m_csize % 64 == 0) && + (resiYuv.m_csize % 64 == 0)](reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predV, resiYuv.m_buf[2], predYuv.m_csize, resiYuv.m_csize); + } else primitives.chroma[m_csp].cu[sizeIdx].copy_pp(reconPic.getCrAddr(cu.m_cuAddr, absPartIdx), reconPic.m_strideC, predV, predYuv.m_csize); @@ -3404,6 +3528,33 @@ return false; } +uint32_t Analysis::calculateCUVariance(const CUData& ctu, const CUGeom& cuGeom) +{ + uint32_t cuVariance = 0; + uint32_t *blockVariance = m_frame->m_lowres.blockVariance; + int loopIncr = (m_param->rc.qgSize == 8) ? 8 : 16; + + uint32_t width = m_frame->m_fencPic->m_picWidth; + uint32_t height = m_frame->m_fencPic->m_picHeight; + uint32_t block_x = ctu.m_cuPelX + g_zscanToPelX[cuGeom.absPartIdx]; + uint32_t block_y = ctu.m_cuPelY + g_zscanToPelY[cuGeom.absPartIdx]; + uint32_t maxCols = (m_frame->m_fencPic->m_picWidth + (loopIncr - 1)) / loopIncr; + uint32_t blockSize = m_param->maxCUSize >> cuGeom.depth; + uint32_t cnt = 0; + + for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += loopIncr) + { + for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += loopIncr) + { + uint32_t idx = ((block_yy / loopIncr) * (maxCols)) + (block_xx / loopIncr); + cuVariance += blockVariance[idx]; + cnt++; + } + } + + return cuVariance / cnt; +} + int Analysis::calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, int32_t complexCheck, double baseQp) { FrameData& curEncData = *m_frame->m_encData; @@ -3411,24 +3562,18 @@ if (m_param->analysisMultiPassDistortion && m_param->rc.bStatRead) { - m_multipassAnalysis = (analysis2PassFrameData*)m_frame->m_analysis2Pass.analysisFramedata; - if ((m_multipassAnalysis->threshold[ctu.m_cuAddr] < 0.9 || m_multipassAnalysis->threshold[ctu.m_cuAddr] > 1.1) - && m_multipassAnalysis->highDistortionCtuCount && m_multipassAnalysis->lowDistortionCtuCount) - qp += m_multipassAnalysis->offset[ctu.m_cuAddr]; + x265_analysis_distortion_data* distortionData = m_frame->m_analysisData.distortionData; + if ((distortionData->threshold[ctu.m_cuAddr] < 0.9 || distortionData->threshold[ctu.m_cuAddr] > 1.1) + && distortionData->highDistortionCtuCount && distortionData->lowDistortionCtuCount) + qp += distortionData->offset[ctu.m_cuAddr]; } - int loopIncr; - if (m_param->rc.qgSize == 8) - loopIncr = 8; - else - loopIncr = 16; + int loopIncr = (m_param->rc.qgSize == 8) ? 8 : 16; + /* Use cuTree offsets if cuTree enabled and frame is referenced, else use AQ offsets */ bool isReferenced = IS_REFERENCED(m_frame); - double *qpoffs; - if (complexCheck) - qpoffs = m_frame->m_lowres.qpAqOffset; - else - qpoffs = (isReferenced && m_param->rc.cuTree) ? m_frame->m_lowres.qpCuTreeOffset : m_frame->m_lowres.qpAqOffset; + double *qpoffs = (isReferenced && m_param->rc.cuTree && !complexCheck) ? m_frame->m_lowres.qpCuTreeOffset : + m_frame->m_lowres.qpAqOffset; if (qpoffs) { uint32_t width = m_frame->m_fencPic->m_picWidth; @@ -3439,13 +3584,11 @@ uint32_t blockSize = m_param->maxCUSize >> cuGeom.depth; double qp_offset = 0; uint32_t cnt = 0; - uint32_t idx; - for (uint32_t block_yy = block_y; block_yy < block_y + blockSize && block_yy < height; block_yy += loopIncr) { for (uint32_t block_xx = block_x; block_xx < block_x + blockSize && block_xx < width; block_xx += loopIncr) { - idx = ((block_yy / loopIncr) * (maxCols)) + (block_xx / loopIncr); + uint32_t idx = ((block_yy / loopIncr) * (maxCols)) + (block_xx / loopIncr); qp_offset += qpoffs[idx]; cnt++; } @@ -3458,10 +3601,7 @@ int32_t offset = (int32_t)(qp_offset * 100 + .5); double threshold = (1 - ((x265_ADAPT_RD_STRENGTH - m_param->dynamicRd) * 0.5)); int32_t max_threshold = (int32_t)(threshold * 100 + .5); - if (offset < max_threshold) - return 1; - else - return 0; + return (offset < max_threshold); } }
View file
x265_2.7.tar.gz/source/encoder/analysis.h -> x265_2.9.tar.gz/source/encoder/analysis.h
Changed
@@ -123,27 +123,42 @@ protected: /* Analysis data for save/load mode, writes/reads data based on absPartIdx */ - analysis_inter_data* m_reuseInterDataCTU; - int32_t* m_reuseRef; - uint8_t* m_reuseDepth; - uint8_t* m_reuseModes; - uint8_t* m_reusePartSize; - uint8_t* m_reuseMergeFlag; + x265_analysis_inter_data* m_reuseInterDataCTU; + int32_t* m_reuseRef; + uint8_t* m_reuseDepth; + uint8_t* m_reuseModes; + uint8_t* m_reusePartSize; + uint8_t* m_reuseMergeFlag; + x265_analysis_MV* m_reuseMv[2]; + uint8_t* m_reuseMvpIdx[2]; uint32_t m_splitRefIdx[4]; uint64_t* cacheCost; - - analysis2PassFrameData* m_multipassAnalysis; - uint8_t* m_multipassDepth; - MV* m_multipassMv[2]; - int* m_multipassMvpIdx[2]; - int32_t* m_multipassRef[2]; - uint8_t* m_multipassModes; - uint8_t m_evaluateInter; + int32_t m_refineLevel; + uint8_t* m_additionalCtuInfo; int* m_prevCtuInfoChange; + + struct TrainingData + { + uint32_t cuVariance; + uint8_t predMode; + uint8_t partSize; + uint8_t mergeFlag; + int split; + + void init(const CUData& parentCTU, const CUGeom& cuGeom) + { + cuVariance = 0; + predMode = parentCTU.m_predMode[cuGeom.absPartIdx]; + partSize = parentCTU.m_partSize[cuGeom.absPartIdx]; + mergeFlag = parentCTU.m_mergeFlag[cuGeom.absPartIdx]; + split = 0; + } + }; + /* refine RD based on QP for rd-levels 5 and 6 */ void qprdRefine(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp, int32_t lqp); @@ -182,6 +197,10 @@ void encodeResidue(const CUData& parentCTU, const CUGeom& cuGeom); int calculateQpforCuSize(const CUData& ctu, const CUGeom& cuGeom, int32_t complexCheck = 0, double baseQP = -1); + uint32_t calculateCUVariance(const CUData& ctu, const CUGeom& cuGeom); + + void classifyCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData); + void trainCU(const CUData& ctu, const CUGeom& cuGeom, const Mode& bestMode, TrainingData& trainData); void calculateNormFactor(CUData& ctu, int qp); void normFactor(const pixel* src, uint32_t blockSize, CUData& ctu, int qp, TextType ttype);
View file
x265_2.7.tar.gz/source/encoder/api.cpp -> x265_2.9.tar.gz/source/encoder/api.cpp
Changed
@@ -31,6 +31,10 @@ #include "nal.h" #include "bitcost.h" +#if ENABLE_LIBVMAF +#include "libvmaf.h" +#endif + /* multilib namespace reflectors */ #if LINKED_8BIT namespace x265_8bit { @@ -274,10 +278,10 @@ pic_in->analysisData.wt = NULL; pic_in->analysisData.intraData = NULL; pic_in->analysisData.interData = NULL; - pic_in->analysis2Pass.analysisFramedata = NULL; + pic_in->analysisData.distortionData = NULL; } - if (pp_nal && numEncoded > 0) + if (pp_nal && numEncoded > 0 && encoder->m_outputCount >= encoder->m_latestParam->chunkStart) { *pp_nal = &encoder->m_nalList.m_nal[0]; if (pi_nal) *pi_nal = encoder->m_nalList.m_numNal; @@ -285,7 +289,7 @@ else if (pi_nal) *pi_nal = 0; - if (numEncoded && encoder->m_param->csvLogLevel) + if (numEncoded && encoder->m_param->csvLogLevel && encoder->m_outputCount >= encoder->m_latestParam->chunkStart) x265_csvlog_frame(encoder->m_param, pic_out); if (numEncoded < 0) @@ -302,13 +306,34 @@ encoder->fetchStats(outputStats, statsSizeBytes); } } +#if ENABLE_LIBVMAF +void x265_vmaf_encoder_log(x265_encoder* enc, int argc, char **argv, x265_param *param, x265_vmaf_data *vmafdata) +{ + if (enc) + { + Encoder *encoder = static_cast<Encoder*>(enc); + x265_stats stats; + stats.aggregateVmafScore = x265_calculate_vmafscore(param, vmafdata); + if(vmafdata->reference_file) + fclose(vmafdata->reference_file); + if(vmafdata->distorted_file) + fclose(vmafdata->distorted_file); + if(vmafdata) + x265_free(vmafdata); + encoder->fetchStats(&stats, sizeof(stats)); + int padx = encoder->m_sps.conformanceWindow.rightOffset; + int pady = encoder->m_sps.conformanceWindow.bottomOffset; + x265_csvlog_encode(encoder->m_param, &stats, padx, pady, argc, argv); + } +} +#endif void x265_encoder_log(x265_encoder* enc, int argc, char **argv) { if (enc) { Encoder *encoder = static_cast<Encoder*>(enc); - x265_stats stats; + x265_stats stats; encoder->fetchStats(&stats, sizeof(stats)); int padx = encoder->m_sps.conformanceWindow.rightOffset; int pady = encoder->m_sps.conformanceWindow.bottomOffset; @@ -378,6 +403,181 @@ return -1; } +void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis) +{ + x265_analysis_inter_data *interData = analysis->interData = NULL; + x265_analysis_intra_data *intraData = analysis->intraData = NULL; + x265_analysis_distortion_data *distortionData = analysis->distortionData = NULL; + bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0; + int numDir = 2; //irrespective of P or B slices set direction as 2 + uint32_t numPlanes = param->internalCsp == X265_CSP_I400 ? 1 : 3; + +#if X265_DEPTH < 10 && (LINKED_10BIT || LINKED_12BIT) + uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame << 1 : analysis->numCUsInFrame; +#elif X265_DEPTH >= 10 && LINKED_8BIT + uint32_t numCUs_sse_t = param->internalBitDepth > 8 ? analysis->numCUsInFrame : (analysis->numCUsInFrame + 1U) >> 1; +#else + uint32_t numCUs_sse_t = analysis->numCUsInFrame; +#endif + + //Allocate memory for distortionData pointer + CHECKED_MALLOC_ZERO(distortionData, x265_analysis_distortion_data, 1); + CHECKED_MALLOC_ZERO(distortionData->distortion, sse_t, analysis->numPartitions * numCUs_sse_t); + if (param->rc.bStatRead) + { + CHECKED_MALLOC_ZERO(distortionData->ctuDistortion, sse_t, numCUs_sse_t); + CHECKED_MALLOC_ZERO(distortionData->scaledDistortion, double, analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(distortionData->offset, double, analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(distortionData->threshold, double, analysis->numCUsInFrame); + } + analysis->distortionData = distortionData; + + if (param->bDisableLookahead && isVbv) + { + CHECKED_MALLOC_ZERO(analysis->lookahead.intraSatdForVbv, uint32_t, analysis->numCuInHeight); + CHECKED_MALLOC_ZERO(analysis->lookahead.satdForVbv, uint32_t, analysis->numCuInHeight); + CHECKED_MALLOC_ZERO(analysis->lookahead.intraVbvCost, uint32_t, analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(analysis->lookahead.vbvCost, uint32_t, analysis->numCUsInFrame); + } + + //Allocate memory for weightParam pointer + if (!(param->bMVType == AVC_INFO)) + CHECKED_MALLOC_ZERO(analysis->wt, x265_weight_param, numPlanes * numDir); + + if (param->analysisReuseLevel < 2) + return; + + //Allocate memory for intraData pointer + CHECKED_MALLOC_ZERO(intraData, x265_analysis_intra_data, 1); + CHECKED_MALLOC(intraData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(intraData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(intraData->partSizes, char, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(intraData->chromaModes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + analysis->intraData = intraData; + + //Allocate memory for interData pointer based on ReuseLevels + CHECKED_MALLOC_ZERO(interData, x265_analysis_inter_data, 1); + CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + + CHECKED_MALLOC_ZERO(interData->mvpIdx[0], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(interData->mvpIdx[1], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(interData->mv[0], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(interData->mv[1], x265_analysis_MV, analysis->numPartitions * analysis->numCUsInFrame); + + if (param->analysisReuseLevel > 4) + { + CHECKED_MALLOC(interData->partSize, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(interData->mergeFlag, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + } + if (param->analysisReuseLevel >= 7) + { + CHECKED_MALLOC(interData->interDir, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC(interData->sadCost, int64_t, analysis->numPartitions * analysis->numCUsInFrame); + for (int dir = 0; dir < numDir; dir++) + { + CHECKED_MALLOC(interData->refIdx[dir], int8_t, analysis->numPartitions * analysis->numCUsInFrame); + CHECKED_MALLOC_ZERO(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); + } + } + else + { + if (param->analysisMultiPassRefine || param->analysisMultiPassDistortion){ + CHECKED_MALLOC_ZERO(interData->ref, int32_t, 2 * analysis->numPartitions * analysis->numCUsInFrame); + } + else + CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); + } + analysis->interData = interData; + + return; + +fail: + x265_free_analysis_data(param, analysis); +} + +void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis) +{ + bool isVbv = param->rc.vbvMaxBitrate > 0 && param->rc.vbvBufferSize > 0; + + //Free memory for Lookahead pointers + if (param->bDisableLookahead && isVbv) + { + X265_FREE(analysis->lookahead.satdForVbv); + X265_FREE(analysis->lookahead.intraSatdForVbv); + X265_FREE(analysis->lookahead.vbvCost); + X265_FREE(analysis->lookahead.intraVbvCost); + } + + //Free memory for distortionData pointers + if (analysis->distortionData) + { + X265_FREE((analysis->distortionData)->distortion); + if (param->rc.bStatRead) + { + X265_FREE((analysis->distortionData)->ctuDistortion); + X265_FREE((analysis->distortionData)->scaledDistortion); + X265_FREE((analysis->distortionData)->offset); + X265_FREE((analysis->distortionData)->threshold); + } + X265_FREE(analysis->distortionData); + } + + /* Early exit freeing weights alone if level is 1 (when there is no analysis inter/intra) */ + if (analysis->wt && !(param->bMVType == AVC_INFO)) + X265_FREE(analysis->wt); + + if (param->analysisReuseLevel < 2) + return; + + //Free memory for intraData pointers + if (analysis->intraData) + { + X265_FREE((analysis->intraData)->depth); + X265_FREE((analysis->intraData)->modes); + X265_FREE((analysis->intraData)->partSizes); + X265_FREE((analysis->intraData)->chromaModes); + X265_FREE(analysis->intraData); + analysis->intraData = NULL; + } + + //Free interData pointers + if (analysis->interData) + { + X265_FREE((analysis->interData)->depth); + X265_FREE((analysis->interData)->modes); + X265_FREE((analysis->interData)->mvpIdx[0]); + X265_FREE((analysis->interData)->mvpIdx[1]); + X265_FREE((analysis->interData)->mv[0]); + X265_FREE((analysis->interData)->mv[1]); + + if (param->analysisReuseLevel > 4) + { + X265_FREE((analysis->interData)->mergeFlag); + X265_FREE((analysis->interData)->partSize); + } + if (param->analysisReuseLevel >= 7) + { + int numDir = 2; + X265_FREE((analysis->interData)->interDir); + X265_FREE((analysis->interData)->sadCost); + for (int dir = 0; dir < numDir; dir++) + { + X265_FREE((analysis->interData)->refIdx[dir]); + if (analysis->modeFlag[dir] != NULL) + { + X265_FREE(analysis->modeFlag[dir]); + analysis->modeFlag[dir] = NULL; + } + } + } + else + X265_FREE((analysis->interData)->ref); + X265_FREE(analysis->interData); + analysis->interData = NULL; + } +} + void x265_cleanup(void) { BitCost::destroy(); @@ -457,7 +657,13 @@ &x265_csvlog_frame, &x265_csvlog_encode, &x265_dither_image, - &x265_set_analysis_data + &x265_set_analysis_data, +#if ENABLE_LIBVMAF + &x265_calculate_vmafscore, + &x265_calculate_vmaf_framelevelscore, + &x265_vmaf_encoder_log +#endif + }; typedef const x265_api* (*api_get_func)(int bitDepth); @@ -675,7 +881,7 @@ if (param->rc.rateControlMode == X265_RC_CRF) fprintf(csvfp, "RateFactor, "); if (param->rc.vbvBufferSize) - fprintf(csvfp, "BufferFill, "); + fprintf(csvfp, "BufferFill, BufferFillFinal, "); if (param->bEnablePsnr) fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, "); if (param->bEnableSsim) @@ -751,6 +957,9 @@ /* detailed performance statistics */ fprintf(csvfp, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms)," "Stall Time (ms), Total frame time (ms), Avg WPP, Row Blocks"); +#if ENABLE_LIBVMAF + fprintf(csvfp, ", VMAF Frame Score"); +#endif } fprintf(csvfp, "\n"); } @@ -759,6 +968,9 @@ fputs(summaryCSVHeader, csvfp); if (param->csvLogLevel >= 2 || param->maxCLL || param->maxFALL) fputs("MaxCLL, MaxFALL,", csvfp); +#if ENABLE_LIBVMAF + fputs(" Aggregate VMAF Score,", csvfp); +#endif fputs(" Version\n", csvfp); } } @@ -780,7 +992,7 @@ if (param->rc.rateControlMode == X265_RC_CRF) fprintf(param->csvfpt, "%.3lf,", frameStats->rateFactor); if (param->rc.vbvBufferSize) - fprintf(param->csvfpt, "%.3lf,", frameStats->bufferFill); + fprintf(param->csvfpt, "%.3lf, %.3lf,", frameStats->bufferFill, frameStats->bufferFillFinal); if (param->bEnablePsnr) fprintf(param->csvfpt, "%.3lf, %.3lf, %.3lf, %.3lf,", frameStats->psnrY, frameStats->psnrU, frameStats->psnrV, frameStats->psnr); if (param->bEnableSsim) @@ -868,6 +1080,9 @@ frameStats->totalFrameTime); fprintf(param->csvfpt, " %.3lf, %d", frameStats->avgWPP, frameStats->countRowBlocks); +#if ENABLE_LIBVMAF + fprintf(param->csvfpt, ", %lf", frameStats->vmafFrameScore); +#endif } fprintf(param->csvfpt, "\n"); fflush(stderr); @@ -886,7 +1101,11 @@ fputs(summaryCSVHeader, p->csvfpt); if (p->csvLogLevel >= 2 || p->maxCLL || p->maxFALL) fputs("MaxCLL, MaxFALL,", p->csvfpt); +#if ENABLE_LIBVMAF + fputs(" Aggregate VMAF score,", p->csvfpt); +#endif fputs(" Version\n",p->csvfpt); + } // CLI arguments or other if (argc) @@ -907,6 +1126,7 @@ fputc('"', p->csvfpt); fputs(opts, p->csvfpt); fputc('"', p->csvfpt); + X265_FREE(opts); } } @@ -918,7 +1138,6 @@ char buffer[200]; strftime(buffer, 128, "%c", timeinfo); fprintf(p->csvfpt, ", %s, ", buffer); - // elapsed time, fps, bitrate fprintf(p->csvfpt, "%.2f, %.2f, %.2f,", stats->elapsedEncodeTime, stats->encodedPictureCount / stats->elapsedEncodeTime, stats->bitrate); @@ -980,7 +1199,11 @@ fprintf(p->csvfpt, " -, -, -, -, -, -, -,"); if (p->csvLogLevel >= 2 || p->maxCLL || p->maxFALL) fprintf(p->csvfpt, " %-6u, %-6u,", stats->maxCLL, stats->maxFALL); +#if ENABLE_LIBVMAF + fprintf(p->csvfpt, " %lf,", stats->aggregateVmafScore); +#endif fprintf(p->csvfpt, " %s\n", api->version_str); + } } @@ -1071,4 +1294,318 @@ } } +#if ENABLE_LIBVMAF +/* Read y values of single frame for 8-bit input */ +int read_image_byte(FILE *file, float *buf, int width, int height, int stride) +{ + char *byte_ptr = (char *)buf; + unsigned char *tmp_buf = 0; + int i, j; + int ret = 1; + + if (width <= 0 || height <= 0) + { + goto fail_or_end; + } + + if (!(tmp_buf = (unsigned char*)malloc(width))) + { + goto fail_or_end; + } + + for (i = 0; i < height; ++i) + { + float *row_ptr = (float *)byte_ptr; + + if (fread(tmp_buf, 1, width, file) != (size_t)width) + { + goto fail_or_end; + } + + for (j = 0; j < width; ++j) + { + row_ptr[j] = tmp_buf[j]; + } + + byte_ptr += stride; + } + + ret = 0; + +fail_or_end: + free(tmp_buf); + return ret; +} +/* Read y values of single frame for 10-bit input */ +int read_image_word(FILE *file, float *buf, int width, int height, int stride) +{ + char *byte_ptr = (char *)buf; + unsigned short *tmp_buf = 0; + int i, j; + int ret = 1; + + if (width <= 0 || height <= 0) + { + goto fail_or_end; + } + + if (!(tmp_buf = (unsigned short*)malloc(width * 2))) // '*2' to accommodate words + { + goto fail_or_end; + } + + for (i = 0; i < height; ++i) + { + float *row_ptr = (float *)byte_ptr; + + if (fread(tmp_buf, 2, width, file) != (size_t)width) // '2' for word + { + goto fail_or_end; + } + + for (j = 0; j < width; ++j) + { + row_ptr[j] = tmp_buf[j] / 4.0; // '/4' to convert from 10 to 8-bit + } + + byte_ptr += stride; + } + + ret = 0; + +fail_or_end: + free(tmp_buf); + return ret; +} + +int read_frame(float *reference_data, float *distorted_data, float *temp_data, int stride_byte, void *s) +{ + x265_vmaf_data *user_data = (x265_vmaf_data *)s; + int ret; + + // read reference y + if (user_data->internalBitDepth == 8) + { + ret = read_image_byte(user_data->reference_file, reference_data, user_data->width, user_data->height, stride_byte); + } + else if (user_data->internalBitDepth == 10) + { + ret = read_image_word(user_data->reference_file, reference_data, user_data->width, user_data->height, stride_byte); + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Invalid bitdepth\n"); + return 1; + } + if (ret) + { + if (feof(user_data->reference_file)) + { + ret = 2; // OK if end of file + } + return ret; + } + + // read distorted y + if (user_data->internalBitDepth == 8) + { + ret = read_image_byte(user_data->distorted_file, distorted_data, user_data->width, user_data->height, stride_byte); + } + else if (user_data->internalBitDepth == 10) + { + ret = read_image_word(user_data->distorted_file, distorted_data, user_data->width, user_data->height, stride_byte); + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Invalid bitdepth\n"); + return 1; + } + if (ret) + { + if (feof(user_data->distorted_file)) + { + ret = 2; // OK if end of file + } + return ret; + } + + // reference skip u and v + if (user_data->internalBitDepth == 8) + { + if (fread(temp_data, 1, user_data->offset, user_data->reference_file) != (size_t)user_data->offset) + { + x265_log(NULL, X265_LOG_ERROR, "reference fread to skip u and v failed.\n"); + goto fail_or_end; + } + } + else if (user_data->internalBitDepth == 10) + { + if (fread(temp_data, 2, user_data->offset, user_data->reference_file) != (size_t)user_data->offset) + { + x265_log(NULL, X265_LOG_ERROR, "reference fread to skip u and v failed.\n"); + goto fail_or_end; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Invalid format\n"); + goto fail_or_end; + } + + // distorted skip u and v + if (user_data->internalBitDepth == 8) + { + if (fread(temp_data, 1, user_data->offset, user_data->distorted_file) != (size_t)user_data->offset) + { + x265_log(NULL, X265_LOG_ERROR, "distorted fread to skip u and v failed.\n"); + goto fail_or_end; + } + } + else if (user_data->internalBitDepth == 10) + { + if (fread(temp_data, 2, user_data->offset, user_data->distorted_file) != (size_t)user_data->offset) + { + x265_log(NULL, X265_LOG_ERROR, "distorted fread to skip u and v failed.\n"); + goto fail_or_end; + } + } + else + { + x265_log(NULL, X265_LOG_ERROR, "Invalid format\n"); + goto fail_or_end; + } + + +fail_or_end: + return ret; +} + +double x265_calculate_vmafscore(x265_param *param, x265_vmaf_data *data) +{ + double score; + + data->width = param->sourceWidth; + data->height = param->sourceHeight; + data->internalBitDepth = param->internalBitDepth; + + if (param->internalCsp == X265_CSP_I420) + { + if ((param->sourceWidth * param->sourceHeight) % 2 != 0) + x265_log(NULL, X265_LOG_ERROR, "Invalid file size\n"); + data->offset = param->sourceWidth * param->sourceHeight / 2; + } + else if (param->internalCsp == X265_CSP_I422) + data->offset = param->sourceWidth * param->sourceHeight; + else if (param->internalCsp == X265_CSP_I444) + data->offset = param->sourceWidth * param->sourceHeight * 2; + else + x265_log(NULL, X265_LOG_ERROR, "Invalid format\n"); + + compute_vmaf(&score, vcd->format, data->width, data->height, read_frame, data, vcd->model_path, vcd->log_path, vcd->log_fmt, vcd->disable_clip, vcd->disable_avx, vcd->enable_transform, vcd->phone_model, vcd->psnr, vcd->ssim, vcd->ms_ssim, vcd->pool); + + return score; +} + +int read_frame_10bit(float *reference_data, float *distorted_data, float *temp_data, int stride, void *s) +{ + x265_vmaf_framedata *user_data = (x265_vmaf_framedata *)s; + + PicYuv *reference_frame = (PicYuv *)user_data->reference_frame; + PicYuv *distorted_frame = (PicYuv *)user_data->distorted_frame; + + if(!user_data->frame_set) { + + int reference_stride = reference_frame->m_stride; + int distorted_stride = distorted_frame->m_stride; + + const uint16_t *reference_ptr = (const uint16_t *)reference_frame->m_picOrg[0]; + const uint16_t *distorted_ptr = (const uint16_t *)distorted_frame->m_picOrg[0]; + + temp_data = reference_data; + + int height = user_data->height; + int width = user_data->width; + + int i,j; + for (i = 0; i < height; i++) { + for ( j = 0; j < width; j++) { + temp_data[j] = ((float)reference_ptr[j] / 4.0); + } + reference_ptr += reference_stride; + temp_data += stride / sizeof(*temp_data); + } + + temp_data = distorted_data; + for (i = 0; i < height; i++) { + for (j = 0; j < width; j++) { + temp_data[j] = ((float)distorted_ptr[j] / 4.0); + } + distorted_ptr += distorted_stride; + temp_data += stride / sizeof(*temp_data); + } + + user_data->frame_set = 1; + return 0; + } + return 2; +} + +int read_frame_8bit(float *reference_data, float *distorted_data, float *temp_data, int stride, void *s) +{ + x265_vmaf_framedata *user_data = (x265_vmaf_framedata *)s; + + PicYuv *reference_frame = (PicYuv *)user_data->reference_frame; + PicYuv *distorted_frame = (PicYuv *)user_data->distorted_frame; + + if(!user_data->frame_set) { + + int reference_stride = reference_frame->m_stride; + int distorted_stride = distorted_frame->m_stride; + + const uint8_t *reference_ptr = (const uint8_t *)reference_frame->m_picOrg[0]; + const uint8_t *distorted_ptr = (const uint8_t *)distorted_frame->m_picOrg[0]; + + temp_data = reference_data; + + int height = user_data->height; + int width = user_data->width; + + int i,j; + for (i = 0; i < height; i++) { + for ( j = 0; j < width; j++) { + temp_data[j] = (float)reference_ptr[j]; + } + reference_ptr += reference_stride; + temp_data += stride / sizeof(*temp_data); + } + + temp_data = distorted_data; + for (i = 0; i < height; i++) { + for (j = 0; j < width; j++) { + temp_data[j] = (float)distorted_ptr[j]; + } + distorted_ptr += distorted_stride; + temp_data += stride / sizeof(*temp_data); + } + + user_data->frame_set = 1; + return 0; + } + return 2; +} + +double x265_calculate_vmaf_framelevelscore(x265_vmaf_framedata *vmafframedata) +{ + double score; + int (*read_frame)(float *reference_data, float *distorted_data, float *temp_data, + int stride, void *s); + if (vmafframedata->internalBitDepth == 8) + read_frame = read_frame_8bit; + else + read_frame = read_frame_10bit; + compute_vmaf(&score, vcd->format, vmafframedata->width, vmafframedata->height, read_frame, vmafframedata, vcd->model_path, vcd->log_path, vcd->log_fmt, vcd->disable_clip, vcd->disable_avx, vcd->enable_transform, vcd->phone_model, vcd->psnr, vcd->ssim, vcd->ms_ssim, vcd->pool); + + return score; +} +#endif } /* end namespace or extern "C" */
View file
x265_2.7.tar.gz/source/encoder/dpb.cpp -> x265_2.9.tar.gz/source/encoder/dpb.cpp
Changed
@@ -131,9 +131,8 @@ int pocCurr = slice->m_poc; int type = newFrame->m_lowres.sliceType; bool bIsKeyFrame = newFrame->m_lowres.bKeyframe; - slice->m_nalUnitType = getNalUnitType(pocCurr, bIsKeyFrame); - if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL) + if (slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || slice->m_nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP) m_lastIDR = pocCurr; slice->m_lastIDR = m_lastIDR; slice->m_sliceType = IS_X265_TYPE_B(type) ? B_SLICE : (type == X265_TYPE_P) ? P_SLICE : I_SLICE; @@ -250,7 +249,7 @@ /* Marking reference pictures when an IDR/CRA is encountered. */ void DPB::decodingRefreshMarking(int pocCurr, NalUnitType nalUnitType) { - if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL) + if (nalUnitType == NAL_UNIT_CODED_SLICE_IDR_W_RADL || nalUnitType == NAL_UNIT_CODED_SLICE_IDR_N_LP) { /* If the nal_unit_type is IDR, all pictures in the reference picture * list are marked as "unused for reference" */ @@ -326,11 +325,9 @@ NalUnitType DPB::getNalUnitType(int curPOC, bool bIsKeyFrame) { if (!curPOC) - return NAL_UNIT_CODED_SLICE_IDR_W_RADL; - + return NAL_UNIT_CODED_SLICE_IDR_N_LP; if (bIsKeyFrame) - return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : NAL_UNIT_CODED_SLICE_IDR_W_RADL; - + return m_bOpenGOP ? NAL_UNIT_CODED_SLICE_CRA : m_bhasLeadingPicture ? NAL_UNIT_CODED_SLICE_IDR_W_RADL : NAL_UNIT_CODED_SLICE_IDR_N_LP; if (m_pocCRA && curPOC < m_pocCRA) // All leading pictures are being marked as TFD pictures here since // current encoder uses all reference pictures while encoding leading
View file
x265_2.7.tar.gz/source/encoder/dpb.h -> x265_2.9.tar.gz/source/encoder/dpb.h
Changed
@@ -40,6 +40,7 @@ int m_lastIDR; int m_pocCRA; int m_bOpenGOP; + int m_bhasLeadingPicture; bool m_bRefreshPending; bool m_bTemporalSublayer; PicList m_picList; @@ -50,6 +51,7 @@ { m_lastIDR = 0; m_pocCRA = 0; + m_bhasLeadingPicture = param->radl; m_bRefreshPending = false; m_frameDataFreeList = NULL; m_bOpenGOP = param->bOpenGOP;
View file
x265_2.7.tar.gz/source/encoder/encoder.cpp -> x265_2.9.tar.gz/source/encoder/encoder.cpp
Changed
@@ -79,6 +79,7 @@ m_threadPool = NULL; m_analysisFileIn = NULL; m_analysisFileOut = NULL; + m_naluFile = NULL; m_offsetEmergency = NULL; m_iFrameNum = 0; m_iPPSQpMinus26 = 0; @@ -96,6 +97,8 @@ #endif m_prevTonemapPayload.payload = NULL; + m_startPoint = 0; + m_saveCTUSize = 0; } inline char *strcatFilename(const char *input, const char *suffix) { @@ -337,10 +340,12 @@ if (m_param->bEmitHRDSEI) m_rateControl->initHRD(m_sps); + if (!m_rateControl->init(m_sps)) m_aborted = true; if (!m_lookahead->create()) m_aborted = true; + initRefIdx(); if (m_param->analysisSave && m_param->bUseAnalysisFile) { @@ -408,10 +413,35 @@ m_emitCLLSEI = p->maxCLL || p->maxFALL; + if (m_param->naluFile) + { + m_naluFile = x265_fopen(m_param->naluFile, "r"); + if (!m_naluFile) + { + x265_log_file(NULL, X265_LOG_ERROR, "%s file not found or Failed to open\n", m_param->naluFile); + m_aborted = true; + } + else + m_enableNal = 1; + } + else + m_enableNal = 0; + #if ENABLE_HDR10_PLUS if (m_bToneMap) m_numCimInfo = m_hdr10plus_api->hdr10plus_json_to_movie_cim(m_param->toneMapFile, m_cim); #endif + if (m_param->bDynamicRefine) + { + /* Allocate memory for 1 GOP and reuse it for the subsequent GOPs */ + int size = (m_param->keyframeMax + m_param->lookaheadDepth) * m_param->maxCUDepth * X265_REFINE_INTER_LEVELS; + CHECKED_MALLOC_ZERO(m_variance, uint64_t, size); + CHECKED_MALLOC_ZERO(m_rdCost, uint64_t, size); + CHECKED_MALLOC_ZERO(m_trainingCount, uint32_t, size); + return; + fail: + m_aborted = true; + } } void Encoder::stopJobs() @@ -516,8 +546,8 @@ curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions; int num16x16inCUWidth = m_param->maxCUSize >> 4; uint32_t ctuAddr, offset, cuPos; - analysis_intra_data * intraData = (analysis_intra_data *)curFrame->m_analysisData.intraData; - analysis_intra_data * srcIntraData = (analysis_intra_data *)analysis_data->intraData; + x265_analysis_intra_data * intraData = curFrame->m_analysisData.intraData; + x265_analysis_intra_data * srcIntraData = analysis_data->intraData; for (int i = 0; i < mbImageHeight; i++) { for (int j = 0; j < mbImageWidth; j++) @@ -546,8 +576,8 @@ curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions; int num16x16inCUWidth = m_param->maxCUSize >> 4; uint32_t ctuAddr, offset, cuPos; - analysis_inter_data * interData = (analysis_inter_data *)curFrame->m_analysisData.interData; - analysis_inter_data * srcInterData = (analysis_inter_data*)analysis_data->interData; + x265_analysis_inter_data * interData = curFrame->m_analysisData.interData; + x265_analysis_inter_data * srcInterData = analysis_data->interData; for (int i = 0; i < mbImageHeight; i++) { for (int j = 0; j < mbImageWidth; j++) @@ -611,7 +641,7 @@ curFrame->m_analysisData = (*analysis_data); curFrame->m_analysisData.numCUsInFrame = widthInCU * heightInCU; curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions; - allocAnalysis(&curFrame->m_analysisData); + x265_alloc_analysis_data(m_param, &curFrame->m_analysisData); if (m_param->maxCUSize == 16) { if (analysis_data->sliceType == X265_TYPE_IDR || analysis_data->sliceType == X265_TYPE_I) @@ -622,8 +652,8 @@ curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions; size_t count = 0; - analysis_intra_data * currIntraData = (analysis_intra_data *)curFrame->m_analysisData.intraData; - analysis_intra_data * intraData = (analysis_intra_data *)analysis_data->intraData; + x265_analysis_intra_data * currIntraData = curFrame->m_analysisData.intraData; + x265_analysis_intra_data * intraData = analysis_data->intraData; for (uint32_t d = 0; d < cuBytes; d++) { int bytes = curFrame->m_analysisData.numPartitions >> ((intraData)->depth[d] * 2); @@ -643,14 +673,14 @@ curFrame->m_analysisData.numPartitions = m_param->num4x4Partitions; size_t count = 0; - analysis_inter_data * currInterData = (analysis_inter_data *)curFrame->m_analysisData.interData; - analysis_inter_data * interData = (analysis_inter_data *)analysis_data->interData; + x265_analysis_inter_data * currInterData = curFrame->m_analysisData.interData; + x265_analysis_inter_data * interData = analysis_data->interData; for (uint32_t d = 0; d < cuBytes; d++) { int bytes = curFrame->m_analysisData.numPartitions >> ((interData)->depth[d] * 2); memset(&(currInterData)->depth[count], (interData)->depth[d], bytes); memset(&(currInterData)->modes[count], (interData)->modes[d], bytes); - memcpy(&(currInterData)->sadCost[count], &((analysis_inter_data*)analysis_data->interData)->sadCost[d], bytes); + memcpy(&(currInterData)->sadCost[count], &(analysis_data->interData)->sadCost[d], bytes); if (m_param->analysisReuseLevel > 4) { memset(&(currInterData)->partSize[count], (interData)->partSize[d], bytes); @@ -697,7 +727,13 @@ if (m_bToneMap) m_hdr10plus_api->hdr10plus_clear_movie(m_cim, m_numCimInfo); #endif - + + if (m_param->bDynamicRefine) + { + X265_FREE(m_variance); + X265_FREE(m_rdCost); + X265_FREE(m_trainingCount); + } if (m_exportedPic) { ATOMIC_DEC(&m_exportedPic->m_countRefEncoders); @@ -761,6 +797,8 @@ } X265_FREE(temp); } + if (m_naluFile) + fclose(m_naluFile); if (m_param) { if (m_param->csvfpt) @@ -837,6 +875,77 @@ } } +void Encoder::copyUserSEIMessages(Frame *frame, const x265_picture* pic_in) +{ + x265_sei_payload toneMap; + toneMap.payload = NULL; + int toneMapPayload = 0; + +#if ENABLE_HDR10_PLUS + if (m_bToneMap) + { + int currentPOC = m_pocLast; + if (currentPOC < m_numCimInfo) + { + int32_t i = 0; + toneMap.payloadSize = 0; + while (m_cim[currentPOC][i] == 0xFF) + toneMap.payloadSize += m_cim[currentPOC][i++]; + toneMap.payloadSize += m_cim[currentPOC][i]; + + toneMap.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * toneMap.payloadSize); + toneMap.payloadType = USER_DATA_REGISTERED_ITU_T_T35; + memcpy(toneMap.payload, &m_cim[currentPOC][i + 1], toneMap.payloadSize); + toneMapPayload = 1; + } + } +#endif + /* seiMsg will contain SEI messages specified in a fixed file format in POC order. + * Format of the file : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload> */ + x265_sei_payload seiMsg; + seiMsg.payload = NULL; + int userPayload = 0; + if (m_enableNal) + { + readUserSeiFile(seiMsg, m_pocLast); + if (seiMsg.payload) + userPayload = 1;; + } + + int numPayloads = pic_in->userSEI.numPayloads + toneMapPayload + userPayload; + frame->m_userSEI.numPayloads = numPayloads; + + if (frame->m_userSEI.numPayloads) + { + if (!frame->m_userSEI.payloads) + { + frame->m_userSEI.payloads = new x265_sei_payload[numPayloads]; + for (int i = 0; i < numPayloads; i++) + frame->m_userSEI.payloads[i].payload = NULL; + } + for (int i = 0; i < numPayloads; i++) + { + x265_sei_payload input; + if ((i == (numPayloads - 1)) && toneMapPayload) + input = toneMap; + else if (m_enableNal) + input = seiMsg; + else + input = pic_in->userSEI.payloads[i]; + + if (!frame->m_userSEI.payloads[i].payload) + frame->m_userSEI.payloads[i].payload = new uint8_t[input.payloadSize]; + memcpy(frame->m_userSEI.payloads[i].payload, input.payload, input.payloadSize); + frame->m_userSEI.payloads[i].payloadSize = input.payloadSize; + frame->m_userSEI.payloads[i].payloadType = input.payloadType; + } + if (toneMap.payload) + x265_free(toneMap.payload); + if (seiMsg.payload) + x265_free(seiMsg.payload); + } +} + /** * Feed one new input frame into the encoder, get one frame out. If pic_in is * NULL, a flush condition is implied and pic_in must be NULL for all subsequent @@ -863,12 +972,12 @@ if (m_exportedPic) { if (!m_param->bUseAnalysisFile && m_param->analysisSave) - freeAnalysis(&m_exportedPic->m_analysisData); + x265_free_analysis_data(m_param, &m_exportedPic->m_analysisData); ATOMIC_DEC(&m_exportedPic->m_countRefEncoders); m_exportedPic = NULL; m_dpb->recycleUnreferenced(); } - if (pic_in) + if (pic_in && (!m_param->chunkEnd || (m_encodedFrameNum < m_param->chunkEnd))) { if (m_latestParam->forceFlush == 1) { @@ -881,27 +990,6 @@ m_latestParam->forceFlush = 0; } - x265_sei_payload toneMap; - toneMap.payload = NULL; -#if ENABLE_HDR10_PLUS - if (m_bToneMap) - { - int currentPOC = m_pocLast + 1; - if (currentPOC < m_numCimInfo) - { - int32_t i = 0; - toneMap.payloadSize = 0; - while (m_cim[currentPOC][i] == 0xFF) - toneMap.payloadSize += m_cim[currentPOC][i++]; - toneMap.payloadSize += m_cim[currentPOC][i]; - - toneMap.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * toneMap.payloadSize); - toneMap.payloadType = USER_DATA_REGISTERED_ITU_T_T35; - memcpy(toneMap.payload, &m_cim[currentPOC][i+1], toneMap.payloadSize); - } - } -#endif - if (pic_in->bitDepth < 8 || pic_in->bitDepth > 16) { x265_log(m_param, X265_LOG_ERROR, "Input bit depth (%d) must be between 8 and 16\n", @@ -983,36 +1071,7 @@ inFrame->m_forceqp = pic_in->forceqp; inFrame->m_param = (m_reconfigure || m_reconfigureRc) ? m_latestParam : m_param; - int toneMapEnable = 0; - if (m_bToneMap && toneMap.payload) - toneMapEnable = 1; - int numPayloads = pic_in->userSEI.numPayloads + toneMapEnable; - inFrame->m_userSEI.numPayloads = numPayloads; - - if (inFrame->m_userSEI.numPayloads) - { - if (!inFrame->m_userSEI.payloads) - { - inFrame->m_userSEI.payloads = new x265_sei_payload[numPayloads]; - for (int i = 0; i < numPayloads; i++) - inFrame->m_userSEI.payloads[i].payload = NULL; - } - for (int i = 0; i < numPayloads; i++) - { - x265_sei_payload input; - if ((i == (numPayloads - 1)) && toneMapEnable) - input = toneMap; - else - input = pic_in->userSEI.payloads[i]; - int size = inFrame->m_userSEI.payloads[i].payloadSize = input.payloadSize; - inFrame->m_userSEI.payloads[i].payloadType = input.payloadType; - if (!inFrame->m_userSEI.payloads[i].payload) - inFrame->m_userSEI.payloads[i].payload = new uint8_t[size]; - memcpy(inFrame->m_userSEI.payloads[i].payload, input.payload, size); - } - if (toneMap.payload) - x265_free(toneMap.payload); - } + copyUserSEIMessages(inFrame, pic_in); if (pic_in->quantOffsets != NULL) { @@ -1049,8 +1108,35 @@ /* Load analysis data before lookahead->addPicture, since sliceType has been decided */ if (m_param->analysisLoad) { - /* readAnalysisFile reads analysis data for the frame and allocates memory based on slicetype */ - readAnalysisFile(&inFrame->m_analysisData, inFrame->m_poc, pic_in); + /* reads analysis data for the frame and allocates memory based on slicetype */ + static int paramBytes = 0; + if (!inFrame->m_poc) + { + x265_analysis_data analysisData = pic_in->analysisData; + paramBytes = validateAnalysisData(&analysisData, 0); + if (paramBytes == -1) + { + m_aborted = true; + return -1; + } + } + if (m_saveCTUSize) + { + cuLocation cuLocInFrame; + cuLocInFrame.init(m_param); + /* Set skipWidth/skipHeight flags when the out of bound pixels in lowRes is greater than half of maxCUSize */ + int extendedWidth = ((m_param->sourceWidth / 2 + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize) * m_param->maxCUSize; + int extendedHeight = ((m_param->sourceHeight / 2 + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize) * m_param->maxCUSize; + uint32_t outOfBoundaryLowres = extendedWidth - m_param->sourceWidth / 2; + if (outOfBoundaryLowres * 2 >= m_param->maxCUSize) + cuLocInFrame.skipWidth = true; + uint32_t outOfBoundaryLowresH = extendedHeight - m_param->sourceHeight / 2; + if (outOfBoundaryLowresH * 2 >= m_param->maxCUSize) + cuLocInFrame.skipHeight = true; + readAnalysisFile(&inFrame->m_analysisData, inFrame->m_poc, pic_in, paramBytes, cuLocInFrame); + } + else + readAnalysisFile(&inFrame->m_analysisData, inFrame->m_poc, pic_in, paramBytes); inFrame->m_poc = inFrame->m_analysisData.poc; sliceType = inFrame->m_analysisData.sliceType; inFrame->m_lowres.bScenecut = !!inFrame->m_analysisData.bScenecut; @@ -1133,7 +1219,7 @@ /* Free up pic_in->analysisData since it has already been used */ if ((m_param->analysisLoad && !m_param->analysisSave) || (m_param->bMVType && slice->m_sliceType != I_SLICE)) - freeAnalysis(&outFrame->m_analysisData); + x265_free_analysis_data(m_param, &outFrame->m_analysisData); if (pic_out) { @@ -1146,6 +1232,7 @@ pic_out->pts = outFrame->m_pts; pic_out->dts = outFrame->m_dts; + pic_out->reorderedPts = outFrame->m_reorderedPts; pic_out->sliceType = outFrame->m_lowres.sliceType; pic_out->planes[0] = recpic->m_picOrg[0]; pic_out->stride[0] = (int)(recpic->m_stride * sizeof(pixel)); @@ -1171,6 +1258,7 @@ pic_out->analysisData.intraData = outFrame->m_analysisData.intraData; pic_out->analysisData.modeFlag[0] = outFrame->m_analysisData.modeFlag[0]; pic_out->analysisData.modeFlag[1] = outFrame->m_analysisData.modeFlag[1]; + pic_out->analysisData.distortionData = outFrame->m_analysisData.distortionData; if (m_param->bDisableLookahead) { int factor = 1; @@ -1178,6 +1266,7 @@ factor = m_param->scaleFactor * 2; pic_out->analysisData.numCuInHeight = outFrame->m_analysisData.numCuInHeight; pic_out->analysisData.lookahead.dts = outFrame->m_dts; + pic_out->analysisData.lookahead.reorderedPts = outFrame->m_reorderedPts; pic_out->analysisData.satdCost *= factor; pic_out->analysisData.lookahead.keyframe = outFrame->m_lowres.bKeyframe; pic_out->analysisData.lookahead.lastMiniGopBFrame = outFrame->m_lowres.bLastMiniGopBFrame; @@ -1186,46 +1275,49 @@ int vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; for (int index = 0; index < vbvCount; index++) { - pic_out->analysisData.lookahead.plannedSatd[index] = outFrame->m_lowres.plannedSatd[index] * factor; + pic_out->analysisData.lookahead.plannedSatd[index] = outFrame->m_lowres.plannedSatd[index]; pic_out->analysisData.lookahead.plannedType[index] = outFrame->m_lowres.plannedType[index]; } for (uint32_t index = 0; index < pic_out->analysisData.numCuInHeight; index++) { - outFrame->m_analysisData.lookahead.intraSatdForVbv[index] = outFrame->m_encData->m_rowStat[index].intraSatdForVbv * factor; - outFrame->m_analysisData.lookahead.satdForVbv[index] = outFrame->m_encData->m_rowStat[index].satdForVbv * factor; + outFrame->m_analysisData.lookahead.intraSatdForVbv[index] = outFrame->m_encData->m_rowStat[index].intraSatdForVbv; + outFrame->m_analysisData.lookahead.satdForVbv[index] = outFrame->m_encData->m_rowStat[index].satdForVbv; } pic_out->analysisData.lookahead.intraSatdForVbv = outFrame->m_analysisData.lookahead.intraSatdForVbv; pic_out->analysisData.lookahead.satdForVbv = outFrame->m_analysisData.lookahead.satdForVbv; for (uint32_t index = 0; index < pic_out->analysisData.numCUsInFrame; index++) { - outFrame->m_analysisData.lookahead.intraVbvCost[index] = outFrame->m_encData->m_cuStat[index].intraVbvCost * factor; - outFrame->m_analysisData.lookahead.vbvCost[index] = outFrame->m_encData->m_cuStat[index].vbvCost * factor; + outFrame->m_analysisData.lookahead.intraVbvCost[index] = outFrame->m_encData->m_cuStat[index].intraVbvCost; + outFrame->m_analysisData.lookahead.vbvCost[index] = outFrame->m_encData->m_cuStat[index].vbvCost; } pic_out->analysisData.lookahead.intraVbvCost = outFrame->m_analysisData.lookahead.intraVbvCost; pic_out->analysisData.lookahead.vbvCost = outFrame->m_analysisData.lookahead.vbvCost; } } writeAnalysisFile(&pic_out->analysisData, *outFrame->m_encData); + pic_out->analysisData.saveParam = pic_out->analysisData.saveParam; if (m_param->bUseAnalysisFile) - freeAnalysis(&pic_out->analysisData); + x265_free_analysis_data(m_param, &pic_out->analysisData); } } if (m_param->rc.bStatWrite && (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion)) { if (pic_out) { - pic_out->analysis2Pass.poc = pic_out->poc; - pic_out->analysis2Pass.analysisFramedata = outFrame->m_analysis2Pass.analysisFramedata; + pic_out->analysisData.poc = pic_out->poc; + pic_out->analysisData.interData = outFrame->m_analysisData.interData; + pic_out->analysisData.intraData = outFrame->m_analysisData.intraData; + pic_out->analysisData.distortionData = outFrame->m_analysisData.distortionData; } - writeAnalysis2PassFile(&outFrame->m_analysis2Pass, *outFrame->m_encData, outFrame->m_lowres.sliceType); + writeAnalysisFileRefine(&outFrame->m_analysisData, *outFrame->m_encData); } if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion) - freeAnalysis2Pass(&outFrame->m_analysis2Pass, outFrame->m_lowres.sliceType); + x265_free_analysis_data(m_param, &outFrame->m_analysisData); if (m_param->internalCsp == X265_CSP_I400) { if (slice->m_sliceType == P_SLICE) { - if (slice->m_weightPredTable[0][0][0].bPresentFlag) + if (slice->m_weightPredTable[0][0][0].wtPresent) m_numLumaWPFrames++; } else if (slice->m_sliceType == B_SLICE) @@ -1233,7 +1325,7 @@ bool bLuma = false; for (int l = 0; l < 2; l++) { - if (slice->m_weightPredTable[l][0][0].bPresentFlag) + if (slice->m_weightPredTable[l][0][0].wtPresent) bLuma = true; } if (bLuma) @@ -1244,10 +1336,10 @@ { if (slice->m_sliceType == P_SLICE) { - if (slice->m_weightPredTable[0][0][0].bPresentFlag) + if (slice->m_weightPredTable[0][0][0].wtPresent) m_numLumaWPFrames++; - if (slice->m_weightPredTable[0][0][1].bPresentFlag || - slice->m_weightPredTable[0][0][2].bPresentFlag) + if (slice->m_weightPredTable[0][0][1].wtPresent || + slice->m_weightPredTable[0][0][2].wtPresent) m_numChromaWPFrames++; } else if (slice->m_sliceType == B_SLICE) @@ -1255,10 +1347,10 @@ bool bLuma = false, bChroma = false; for (int l = 0; l < 2; l++) { - if (slice->m_weightPredTable[l][0][0].bPresentFlag) + if (slice->m_weightPredTable[l][0][0].wtPresent) bLuma = true; - if (slice->m_weightPredTable[l][0][1].bPresentFlag || - slice->m_weightPredTable[l][0][2].bPresentFlag) + if (slice->m_weightPredTable[l][0][1].wtPresent || + slice->m_weightPredTable[l][0][2].wtPresent) bChroma = true; } @@ -1271,7 +1363,8 @@ if (m_aborted) return -1; - finishFrameStats(outFrame, curEncoder, frameData, m_pocLast); + if ((m_outputCount + 1) >= m_param->chunkStart) + finishFrameStats(outFrame, curEncoder, frameData, m_pocLast); /* Write RateControl Frame level stats in multipass encodes */ if (m_param->rc.bStatWrite) @@ -1306,8 +1399,12 @@ } else m_exportedPic = outFrame; - - m_numDelayedPic--; + + m_outputCount++; + if (m_param->chunkEnd == m_outputCount) + m_numDelayedPic = 0; + else + m_numDelayedPic--; ret = 1; } @@ -1316,14 +1413,18 @@ * curEncoder is guaranteed to be idle at this point */ if (!pass) frameEnc = m_lookahead->getDecidedPicture(); - if (frameEnc && !pass) + if (frameEnc && !pass && (!m_param->chunkEnd || (m_encodedFrameNum < m_param->chunkEnd))) { if (m_param->analysisMultiPassRefine || m_param->analysisMultiPassDistortion) { - allocAnalysis2Pass(&frameEnc->m_analysis2Pass, frameEnc->m_lowres.sliceType); - frameEnc->m_analysis2Pass.poc = frameEnc->m_poc; + uint32_t widthInCU = (m_param->sourceWidth + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; + uint32_t heightInCU = (m_param->sourceHeight + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; + frameEnc->m_analysisData.numCUsInFrame = widthInCU * heightInCU; + frameEnc->m_analysisData.numPartitions = m_param->num4x4Partitions; + x265_alloc_analysis_data(m_param, &frameEnc->m_analysisData); + frameEnc->m_analysisData.poc = frameEnc->m_poc; if (m_param->rc.bStatRead) - readAnalysis2PassFile(&frameEnc->m_analysis2Pass, frameEnc->m_poc, frameEnc->m_lowres.sliceType); + readAnalysisFile(&frameEnc->m_analysisData, frameEnc->m_poc, frameEnc->m_lowres.sliceType); } if (frameEnc->m_reconfigureRc && m_reconfigureRc) @@ -1370,6 +1471,7 @@ if (m_param->analysisLoad && m_param->bDisableLookahead) { frameEnc->m_dts = frameEnc->m_analysisData.lookahead.dts; + frameEnc->m_reorderedPts = frameEnc->m_analysisData.lookahead.reorderedPts; if (m_rateControl->m_isVbv) { for (uint32_t index = 0; index < frameEnc->m_analysisData.numCuInHeight; index++) @@ -1436,6 +1538,7 @@ frameEnc->m_encData->m_slice->m_iNumRPSInSPS = m_sps.spsrpsNum; curEncoder->m_rce.encodeOrder = frameEnc->m_encodeOrder = m_encodedFrameNum++; + if (!m_param->analysisLoad || !m_param->bDisableLookahead) { if (m_bframeDelay) @@ -1463,7 +1566,7 @@ analysis->numCUsInFrame = numCUsInFrame; analysis->numCuInHeight = heightInCU; analysis->numPartitions = m_param->num4x4Partitions; - allocAnalysis(analysis); + x265_alloc_analysis_data(m_param, analysis); } /* determine references, setup RPS, etc */ m_dpb->prepareEncode(frameEnc); @@ -2074,7 +2177,7 @@ { const int picOrderCntLSB = slice->m_poc - slice->m_lastIDR; - frameStats->encoderOrder = m_outputCount++; + frameStats->encoderOrder = m_outputCount; frameStats->sliceType = c; frameStats->poc = picOrderCntLSB; frameStats->qp = curEncData.m_avgQpAq; @@ -2083,6 +2186,7 @@ if (m_param->csvLogLevel >= 2) frameStats->ipCostRatio = curFrame->m_lowres.ipCostRatio; frameStats->bufferFill = m_rateControl->m_bufferFillActual; + frameStats->bufferFillFinal = m_rateControl->m_bufferFillFinal; frameStats->frameLatency = inPoc - poc; if (m_param->rc.rateControlMode == X265_RC_CRF) frameStats->rateFactor = curEncData.m_rateFactor; @@ -2106,6 +2210,9 @@ #define ELAPSED_MSEC(start, end) (((double)(end) - (start)) / 1000) if (m_param->csvLogLevel >= 2) { +#if ENABLE_LIBVMAF + frameStats->vmafFrameScore = curFrame->m_fencPic->m_vmafScore; +#endif frameStats->decideWaitTime = ELAPSED_MSEC(0, curEncoder->m_slicetypeWaitTime); frameStats->row0WaitTime = ELAPSED_MSEC(curEncoder->m_startCompressTime, curEncoder->m_row0WaitTime); frameStats->wallTime = ELAPSED_MSEC(curEncoder->m_row0WaitTime, curEncoder->m_endCompressTime); @@ -2265,30 +2372,25 @@ list.serialize(NAL_UNIT_SPS, bs); bs.resetBits(); - sbacCoder.codePPS( m_pps, (m_param->maxSlices <= 1), m_iPPSQpMinus26); + sbacCoder.codePPS(m_pps, (m_param->maxSlices <= 1), m_iPPSQpMinus26); bs.writeByteAlignment(); list.serialize(NAL_UNIT_PPS, bs); + if (m_param->bSingleSeiNal) + bs.resetBits(); + if (m_param->bEmitHDRSEI) { SEIContentLightLevel cllsei; cllsei.max_content_light_level = m_param->maxCLL; cllsei.max_pic_average_light_level = m_param->maxFALL; - bs.resetBits(); - cllsei.write(bs, m_sps); - bs.writeByteAlignment(); - list.serialize(NAL_UNIT_PREFIX_SEI, bs); + cllsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list, m_param->bSingleSeiNal); if (m_param->masteringDisplayColorVolume) { SEIMasteringDisplayColorVolume mdsei; if (mdsei.parse(m_param->masteringDisplayColorVolume)) - { - bs.resetBits(); - mdsei.write(bs, m_sps); - bs.writeByteAlignment(); - list.serialize(NAL_UNIT_PREFIX_SEI, bs); - } + mdsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list, m_param->bSingleSeiNal); else x265_log(m_param, X265_LOG_WARNING, "unable to parse mastering display color volume info\n"); } @@ -2300,21 +2402,18 @@ if (opts) { char *buffer = X265_MALLOC(char, strlen(opts) + strlen(PFX(version_str)) + - strlen(PFX(build_info_str)) + 200); + strlen(PFX(build_info_str)) + 200); if (buffer) { sprintf(buffer, "x265 (build %d) - %s:%s - H.265/HEVC codec - " - "Copyright 2013-2018 (c) Multicoreware, Inc - " - "http://x265.org - options: %s", - X265_BUILD, PFX(version_str), PFX(build_info_str), opts); - - bs.resetBits(); + "Copyright 2013-2018 (c) Multicoreware, Inc - " + "http://x265.org - options: %s", + X265_BUILD, PFX(version_str), PFX(build_info_str), opts); + SEIuserDataUnregistered idsei; idsei.m_userData = (uint8_t*)buffer; idsei.setSize((uint32_t)strlen(buffer)); - idsei.write(bs, m_sps); - bs.writeByteAlignment(); - list.serialize(NAL_UNIT_PREFIX_SEI, bs); + idsei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list, m_param->bSingleSeiNal); X265_FREE(buffer); } @@ -2329,11 +2428,7 @@ SEIActiveParameterSets sei; sei.m_selfContainedCvsFlag = true; sei.m_noParamSetUpdateFlag = true; - - bs.resetBits(); - sei.write(bs, m_sps); - bs.writeByteAlignment(); - list.serialize(NAL_UNIT_PREFIX_SEI, bs); + sei.writeSEImessages(bs, m_sps, NAL_UNIT_PREFIX_SEI, list, m_param->bSingleSeiNal); } } @@ -2416,7 +2511,7 @@ vui.defaultDisplayWindow.bottomOffset = m_param->vui.defDispWinBottomOffset; vui.defaultDisplayWindow.leftOffset = m_param->vui.defDispWinLeftOffset; - vui.frameFieldInfoPresentFlag = !!m_param->interlaceMode; + vui.frameFieldInfoPresentFlag = !!m_param->interlaceMode || (m_param->pictureStructure >= 0); vui.fieldSeqFlag = !!m_param->interlaceMode; vui.hrdParametersPresentFlag = m_param->bEmitHRDSEI; @@ -2428,6 +2523,7 @@ void Encoder::initPPS(PPS *pps) { bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; + bool bEnableDistOffset = m_param->analysisMultiPassDistortion && m_param->rc.bStatRead; if (!m_param->bLossless && (m_param->rc.aqMode || bIsVbv || m_param->bAQMotion)) { @@ -2435,6 +2531,11 @@ pps->maxCuDQPDepth = g_log2Size[m_param->maxCUSize] - g_log2Size[m_param->rc.qgSize]; X265_CHECK(pps->maxCuDQPDepth <= 3, "max CU DQP depth cannot be greater than 3\n"); } + else if (!m_param->bLossless && bEnableDistOffset) + { + pps->bUseDQP = true; + pps->maxCuDQPDepth = 0; + } else { pps->bUseDQP = false; @@ -2660,32 +2761,51 @@ { p->scaleFactor = 0; } - else if ((!p->analysisLoad && !p->analysisSave) || p->analysisReuseLevel < 10) + else if ((!p->analysisLoad && !p->analysisSave) || (p->analysisReuseLevel > 6 && p->analysisReuseLevel != 10)) { - x265_log(p, X265_LOG_WARNING, "Input scaling works with analysis load/save, analysis-reuse-level 10. Disabling scale-factor.\n"); + x265_log(p, X265_LOG_WARNING, "Input scaling works with analysis load/save and analysis-reuse-level 1-6 and 10. Disabling scale-factor.\n"); p->scaleFactor = 0; } } if (p->intraRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (!p->analysisLoad || p->analysisReuseLevel < 10) { - x265_log(p, X265_LOG_WARNING, "Intra refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling intra refine.\n"); + x265_log(p, X265_LOG_WARNING, "Intra refinement requires analysis load, analysis-reuse-level 10. Disabling intra refine.\n"); p->intraRefine = 0; } } if (p->interRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (!p->analysisLoad || p->analysisReuseLevel < 10) + { + x265_log(p, X265_LOG_WARNING, "Inter refinement requires analysis load, analysis-reuse-level 10. Disabling inter refine.\n"); + p->interRefine = 0; + } + } + + if (p->bDynamicRefine) + { + if (!p->analysisLoad || p->analysisReuseLevel < 10) + { + x265_log(p, X265_LOG_WARNING, "Dynamic refinement requires analysis load, analysis-reuse-level 10. Disabling dynamic refine.\n"); + p->bDynamicRefine = 0; + } + if (p->interRefine) { - x265_log(p, X265_LOG_WARNING, "Inter refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling inter refine.\n"); + x265_log(p, X265_LOG_WARNING, "Inter refine cannot be used with dynamic refine. Disabling refine-inter.\n"); p->interRefine = 0; } } + if (p->scaleFactor && p->analysisLoad && !p->interRefine && !p->bDynamicRefine && p->analysisReuseLevel == 10) + { + x265_log(p, X265_LOG_WARNING, "Inter refinement 0 is not supported with scaling and analysis-reuse-level=10. Enabling refine-inter 1.\n"); + p->interRefine = 1; + } - if (p->limitTU && p->interRefine) + if (p->limitTU && (p->interRefine || p->bDynamicRefine)) { x265_log(p, X265_LOG_WARNING, "Inter refinement does not support limitTU. Disabling limitTU.\n"); p->limitTU = 0; @@ -2693,9 +2813,9 @@ if (p->mvRefine) { - if (!p->analysisLoad || p->analysisReuseLevel < 10 || !p->scaleFactor) + if (!p->analysisLoad || p->analysisReuseLevel < 10) { - x265_log(p, X265_LOG_WARNING, "MV refinement requires analysis load, analysis-reuse-level 10, scale factor. Disabling MV refine.\n"); + x265_log(p, X265_LOG_WARNING, "MV refinement requires analysis load, analysis-reuse-level 10. Disabling MV refine.\n"); p->mvRefine = 0; } else if (p->interRefine >= 2) @@ -2711,13 +2831,6 @@ p->bDistributeMotionEstimation = p->bDistributeModeAnalysis = 0; } - if (p->rc.bEnableGrain) - { - x265_log(p, X265_LOG_WARNING, "Rc Grain removes qp fluctuations caused by aq/cutree, Disabling aq,cu-tree\n"); - p->rc.cuTree = 0; - p->rc.aqMode = 0; - } - if (p->bDistributeModeAnalysis && (p->limitReferences >> 1) && 1) { x265_log(p, X265_LOG_WARNING, "Limit reference options 2 and 3 are not supported with pmode. Disabling limit reference\n"); @@ -3054,231 +3167,32 @@ p->radl = 0; x265_log(p, X265_LOG_WARNING, "Radl requires fixed gop-length (keyint == min-keyint). Disabling radl.\n"); } -} - -void Encoder::allocAnalysis(x265_analysis_data* analysis) -{ - X265_CHECK(analysis->sliceType, "invalid slice type\n"); - analysis->interData = analysis->intraData = NULL; - if (m_param->bDisableLookahead && m_rateControl->m_isVbv) - { - CHECKED_MALLOC_ZERO(analysis->lookahead.intraSatdForVbv, uint32_t, analysis->numCuInHeight); - CHECKED_MALLOC_ZERO(analysis->lookahead.satdForVbv, uint32_t, analysis->numCuInHeight); - CHECKED_MALLOC_ZERO(analysis->lookahead.intraVbvCost, uint32_t, analysis->numCUsInFrame); - CHECKED_MALLOC_ZERO(analysis->lookahead.vbvCost, uint32_t, analysis->numCUsInFrame); - } - if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) - { - if (m_param->analysisReuseLevel < 2) - return; - - analysis_intra_data *intraData = (analysis_intra_data*)analysis->intraData; - CHECKED_MALLOC_ZERO(intraData, analysis_intra_data, 1); - CHECKED_MALLOC(intraData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(intraData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(intraData->partSizes, char, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(intraData->chromaModes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - analysis->intraData = intraData; - } - else - { - int numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; - uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; - if (!(m_param->bMVType == AVC_INFO)) - CHECKED_MALLOC_ZERO(analysis->wt, WeightParam, numPlanes * numDir); - if (m_param->analysisReuseLevel < 2) - return; - - analysis_inter_data *interData = (analysis_inter_data*)analysis->interData; - CHECKED_MALLOC_ZERO(interData, analysis_inter_data, 1); - CHECKED_MALLOC(interData->depth, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - if (m_param->analysisReuseLevel > 4) - { - CHECKED_MALLOC(interData->partSize, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->mergeFlag, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - } - - if (m_param->analysisReuseLevel >= 7) - { - CHECKED_MALLOC(interData->interDir, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->sadCost, int64_t, analysis->numPartitions * analysis->numCUsInFrame); - for (int dir = 0; dir < numDir; dir++) - { - CHECKED_MALLOC(interData->mvpIdx[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->refIdx[dir], int8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(interData->mv[dir], MV, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC_ZERO(analysis->modeFlag[dir], uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - } - /* Allocate intra in inter */ - if (analysis->sliceType == X265_TYPE_P || m_param->bIntraInBFrames) - { - analysis_intra_data *intraData = (analysis_intra_data*)analysis->intraData; - CHECKED_MALLOC_ZERO(intraData, analysis_intra_data, 1); - CHECKED_MALLOC(intraData->modes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - CHECKED_MALLOC(intraData->chromaModes, uint8_t, analysis->numPartitions * analysis->numCUsInFrame); - analysis->intraData = intraData; - } - } - else - CHECKED_MALLOC_ZERO(interData->ref, int32_t, analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir); - - analysis->interData = interData; - } - return; - -fail: - freeAnalysis(analysis); - m_aborted = true; -} -void Encoder::freeAnalysis(x265_analysis_data* analysis) -{ - if (m_param->bDisableLookahead && m_rateControl->m_isVbv) - { - X265_FREE(analysis->lookahead.satdForVbv); - X265_FREE(analysis->lookahead.intraSatdForVbv); - X265_FREE(analysis->lookahead.vbvCost); - X265_FREE(analysis->lookahead.intraVbvCost); - } - /* Early exit freeing weights alone if level is 1 (when there is no analysis inter/intra) */ - if (analysis->sliceType > X265_TYPE_I && analysis->wt && !(m_param->bMVType == AVC_INFO)) - X265_FREE(analysis->wt); - if (m_param->analysisReuseLevel < 2) - return; - if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) + if ((p->chunkStart || p->chunkEnd) && p->bOpenGOP) { - if (analysis->intraData) - { - X265_FREE(((analysis_intra_data*)analysis->intraData)->depth); - X265_FREE(((analysis_intra_data*)analysis->intraData)->modes); - X265_FREE(((analysis_intra_data*)analysis->intraData)->partSizes); - X265_FREE(((analysis_intra_data*)analysis->intraData)->chromaModes); - X265_FREE(analysis->intraData); - analysis->intraData = NULL; - } + p->chunkStart = p->chunkEnd = 0; + x265_log(p, X265_LOG_WARNING, "Chunking requires closed gop structure. Disabling chunking.\n"); } - else - { - if (analysis->intraData) - { - X265_FREE(((analysis_intra_data*)analysis->intraData)->modes); - X265_FREE(((analysis_intra_data*)analysis->intraData)->chromaModes); - X265_FREE(analysis->intraData); - analysis->intraData = NULL; - } - if (analysis->interData) - { - X265_FREE(((analysis_inter_data*)analysis->interData)->depth); - X265_FREE(((analysis_inter_data*)analysis->interData)->modes); - if (m_param->analysisReuseLevel > 4) - { - X265_FREE(((analysis_inter_data*)analysis->interData)->mergeFlag); - X265_FREE(((analysis_inter_data*)analysis->interData)->partSize); - } - if (m_param->analysisReuseLevel >= 7) - { - X265_FREE(((analysis_inter_data*)analysis->interData)->interDir); - X265_FREE(((analysis_inter_data*)analysis->interData)->sadCost); - int numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; - for (int dir = 0; dir < numDir; dir++) - { - X265_FREE(((analysis_inter_data*)analysis->interData)->mvpIdx[dir]); - X265_FREE(((analysis_inter_data*)analysis->interData)->refIdx[dir]); - X265_FREE(((analysis_inter_data*)analysis->interData)->mv[dir]); - if (analysis->modeFlag[dir] != NULL) - { - X265_FREE(analysis->modeFlag[dir]); - analysis->modeFlag[dir] = NULL; - } - } - } - else - X265_FREE(((analysis_inter_data*)analysis->interData)->ref); - - X265_FREE(analysis->interData); - analysis->interData = NULL; - } - } -} - -void Encoder::allocAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType) -{ - analysis->analysisFramedata = NULL; - analysis2PassFrameData *analysisFrameData = (analysis2PassFrameData*)analysis->analysisFramedata; - uint32_t widthInCU = (m_param->sourceWidth + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t heightInCU = (m_param->sourceHeight + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t numCUsInFrame = widthInCU * heightInCU; - CHECKED_MALLOC_ZERO(analysisFrameData, analysis2PassFrameData, 1); - CHECKED_MALLOC_ZERO(analysisFrameData->depth, uint8_t, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->distortion, sse_t, m_param->num4x4Partitions * numCUsInFrame); - if (m_param->rc.bStatRead) - { - CHECKED_MALLOC_ZERO(analysisFrameData->ctuDistortion, sse_t, numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->scaledDistortion, double, numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->offset, double, numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->threshold, double, numCUsInFrame); - } - if (!IS_X265_TYPE_I(sliceType)) + if (p->chunkEnd < p->chunkStart) { - CHECKED_MALLOC_ZERO(analysisFrameData->m_mv[0], MV, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->m_mv[1], MV, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->mvpIdx[0], int, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->mvpIdx[1], int, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->ref[0], int32_t, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC_ZERO(analysisFrameData->ref[1], int32_t, m_param->num4x4Partitions * numCUsInFrame); - CHECKED_MALLOC(analysisFrameData->modes, uint8_t, m_param->num4x4Partitions * numCUsInFrame); + p->chunkStart = p->chunkEnd = 0; + x265_log(p, X265_LOG_WARNING, "chunk-end cannot be less than chunk-start. Disabling chunking.\n"); } - analysis->analysisFramedata = analysisFrameData; - - return; - -fail: - freeAnalysis2Pass(analysis, sliceType); - m_aborted = true; -} - -void Encoder::freeAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType) -{ - if (analysis->analysisFramedata) - { - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->depth); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->distortion); - if (m_param->rc.bStatRead) - { - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->ctuDistortion); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->scaledDistortion); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->offset); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->threshold); - } - if (!IS_X265_TYPE_I(sliceType)) - { - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->m_mv[0]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->m_mv[1]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->mvpIdx[0]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->mvpIdx[1]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->ref[0]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->ref[1]); - X265_FREE(((analysis2PassFrameData*)analysis->analysisFramedata)->modes); - } - X265_FREE(analysis->analysisFramedata); - } } -void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn) +void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn, int paramBytes) { - #define X265_FREAD(val, size, readSize, fileOffset, src)\ if (!m_param->bUseAnalysisFile)\ - {\ + {\ memcpy(val, src, (size * readSize));\ - }\ - else if (fread(val, size, readSize, fileOffset) != readSize)\ + }\ + else if (fread(val, size, readSize, fileOffset) != readSize)\ {\ x265_log(NULL, X265_LOG_ERROR, "Error reading analysis data\n");\ - freeAnalysis(analysis);\ + x265_free_analysis_data(m_param, analysis);\ m_aborted = true;\ return;\ }\ @@ -3287,10 +3201,10 @@ static uint64_t totalConsumedBytes = 0; uint32_t depthBytes = 0; if (m_param->bUseAnalysisFile) - fseeko(m_analysisFileIn, totalConsumedBytes, SEEK_SET); + fseeko(m_analysisFileIn, totalConsumedBytes + paramBytes, SEEK_SET); const x265_analysis_data *picData = &(picIn->analysisData); - analysis_intra_data *intraPic = (analysis_intra_data *)picData->intraData; - analysis_inter_data *interPic = (analysis_inter_data *)picData->interData; + x265_analysis_intra_data *intraPic = picData->intraData; + x265_analysis_inter_data *interPic = picData->interData; int poc; uint32_t frameRecordSize; X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); @@ -3305,7 +3219,7 @@ while (poc != curPoc && !feof(m_analysisFileIn)) { currentOffset += frameRecordSize; - fseeko(m_analysisFileIn, currentOffset, SEEK_SET); + fseeko(m_analysisFileIn, currentOffset + paramBytes, SEEK_SET); X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->depthBytes)); X265_FREAD(&poc, sizeof(int), 1, m_analysisFileIn, &(picData->poc)); @@ -3313,7 +3227,7 @@ if (poc != curPoc || feof(m_analysisFileIn)) { x265_log(NULL, X265_LOG_WARNING, "Error reading analysis data: Cannot find POC %d\n", curPoc); - freeAnalysis(analysis); + x265_free_analysis_data(m_param, analysis); return; } } @@ -3337,13 +3251,32 @@ if (m_param->scaleFactor) analysis->numPartitions *= factor; /* Memory is allocated for inter and intra analysis data based on the slicetype */ - allocAnalysis(analysis); + x265_alloc_analysis_data(m_param, analysis); if (m_param->bDisableLookahead && m_rateControl->m_isVbv) { + size_t vbvCount = m_param->lookaheadDepth + m_param->bframes + 2; X265_FREAD(analysis->lookahead.intraVbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.intraVbvCost); X265_FREAD(analysis->lookahead.vbvCost, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.vbvCost); X265_FREAD(analysis->lookahead.satdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.satdForVbv); X265_FREAD(analysis->lookahead.intraSatdForVbv, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.intraSatdForVbv); + X265_FREAD(analysis->lookahead.plannedSatd, sizeof(int64_t), vbvCount, m_analysisFileIn, picData->lookahead.plannedSatd); + + if (m_param->scaleFactor) + { + for (uint64_t index = 0; index < vbvCount; index++) + analysis->lookahead.plannedSatd[index] *= factor; + + for (uint32_t i = 0; i < analysis->numCuInHeight; i++) + { + analysis->lookahead.satdForVbv[i] *= factor; + analysis->lookahead.intraSatdForVbv[i] *= factor; + } + for (uint32_t i = 0; i < analysis->numCUsInFrame; i++) + { + analysis->lookahead.vbvCost[i] *= factor; + analysis->lookahead.intraVbvCost[i] *= factor; + } + } } if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { @@ -3372,22 +3305,22 @@ if (partSizes[d] == SIZE_NxN) partSizes[d] = SIZE_2Nx2N; } - memset(&((analysis_intra_data *)analysis->intraData)->depth[count], depthBuf[d], bytes); - memset(&((analysis_intra_data *)analysis->intraData)->chromaModes[count], modeBuf[d], bytes); - memset(&((analysis_intra_data *)analysis->intraData)->partSizes[count], partSizes[d], bytes); + memset(&(analysis->intraData)->depth[count], depthBuf[d], bytes); + memset(&(analysis->intraData)->chromaModes[count], modeBuf[d], bytes); + memset(&(analysis->intraData)->partSizes[count], partSizes[d], bytes); count += bytes; } if (!m_param->scaleFactor) { - X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); + X265_FREAD((analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); } else { uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); for (uint32_t ctu32Idx = 0, cnt = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++, cnt += factor) - memset(&((analysis_intra_data *)analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); + memset(&(analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); X265_FREE(tempLumaBuf); } X265_FREE(tempBuf); @@ -3456,37 +3389,37 @@ { int bytes = analysis->numPartitions >> (depthBuf[d] * 2); if (m_param->scaleFactor && modeBuf[d] == MODE_INTRA && depthBuf[d] == 0) - depthBuf[d] = 1; - memset(&((analysis_inter_data *)analysis->interData)->depth[count], depthBuf[d], bytes); - memset(&((analysis_inter_data *)analysis->interData)->modes[count], modeBuf[d], bytes); + depthBuf[d] = 1; + memset(&(analysis->interData)->depth[count], depthBuf[d], bytes); + memset(&(analysis->interData)->modes[count], modeBuf[d], bytes); if (m_param->analysisReuseLevel > 4) { if (m_param->scaleFactor && modeBuf[d] == MODE_INTRA && partSize[d] == SIZE_NxN) - partSize[d] = SIZE_2Nx2N; - memset(&((analysis_inter_data *)analysis->interData)->partSize[count], partSize[d], bytes); + partSize[d] = SIZE_2Nx2N; + memset(&(analysis->interData)->partSize[count], partSize[d], bytes); int numPU = (modeBuf[d] == MODE_INTRA) ? 1 : nbPartsTable[(int)partSize[d]]; for (int pu = 0; pu < numPU; pu++) { if (pu) d++; - ((analysis_inter_data *)analysis->interData)->mergeFlag[count + pu] = mergeFlag[d]; + (analysis->interData)->mergeFlag[count + pu] = mergeFlag[d]; if (m_param->analysisReuseLevel == 10) { - ((analysis_inter_data *)analysis->interData)->interDir[count + pu] = interDir[d]; + (analysis->interData)->interDir[count + pu] = interDir[d]; for (uint32_t i = 0; i < numDir; i++) { - ((analysis_inter_data *)analysis->interData)->mvpIdx[i][count + pu] = mvpIdx[i][d]; - ((analysis_inter_data *)analysis->interData)->refIdx[i][count + pu] = refIdx[i][d]; + (analysis->interData)->mvpIdx[i][count + pu] = mvpIdx[i][d]; + (analysis->interData)->refIdx[i][count + pu] = refIdx[i][d]; if (m_param->scaleFactor) { mv[i][d].x *= (int16_t)m_param->scaleFactor; mv[i][d].y *= (int16_t)m_param->scaleFactor; } - memcpy(&((analysis_inter_data *)analysis->interData)->mv[i][count + pu], &mv[i][d], sizeof(MV)); + memcpy(&(analysis->interData)->mv[i][count + pu], &mv[i][d], sizeof(MV)); } } } if (m_param->analysisReuseLevel == 10 && bIntraInInter) - memset(&((analysis_intra_data *)analysis->intraData)->chromaModes[count], chromaDir[d], bytes); + memset(&(analysis->intraData)->chromaModes[count], chromaDir[d], bytes); } count += bytes; } @@ -3505,20 +3438,20 @@ { if (!m_param->scaleFactor) { - X265_FREAD(((analysis_intra_data *)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); + X265_FREAD((analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileIn, intraPic->modes); } else { uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); for (uint32_t ctu32Idx = 0, cnt = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++, cnt += factor) - memset(&((analysis_intra_data *)analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); + memset(&(analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); X265_FREE(tempLumaBuf); } } } else - X265_FREAD(((analysis_inter_data *)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileIn, interPic->ref); + X265_FREAD((analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileIn, interPic->ref); consumedBytes += frameRecordSize; if (numDir == 1) @@ -3527,23 +3460,602 @@ #undef X265_FREAD } -void Encoder::readAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, int curPoc, int sliceType) +void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, const x265_picture* picIn, int paramBytes, cuLocation cuLoc) +{ +#define X265_FREAD(val, size, readSize, fileOffset, src)\ + if (!m_param->bUseAnalysisFile)\ + {\ + memcpy(val, src, (size * readSize));\ + }\ + else if (fread(val, size, readSize, fileOffset) != readSize)\ + {\ + x265_log(NULL, X265_LOG_ERROR, "Error reading analysis data\n");\ + x265_free_analysis_data(m_param, analysis);\ + m_aborted = true;\ + return;\ + }\ + + static uint64_t consumedBytes = 0; + static uint64_t totalConsumedBytes = 0; + uint32_t depthBytes = 0; + if (m_param->bUseAnalysisFile) + fseeko(m_analysisFileIn, totalConsumedBytes + paramBytes, SEEK_SET); + + const x265_analysis_data *picData = &(picIn->analysisData); + x265_analysis_intra_data *intraPic = picData->intraData; + x265_analysis_inter_data *interPic = picData->interData; + + int poc; uint32_t frameRecordSize; + X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->depthBytes)); + X265_FREAD(&poc, sizeof(int), 1, m_analysisFileIn, &(picData->poc)); + + if (m_param->bUseAnalysisFile) + { + uint64_t currentOffset = totalConsumedBytes; + + /* Seeking to the right frame Record */ + while (poc != curPoc && !feof(m_analysisFileIn)) + { + currentOffset += frameRecordSize; + fseeko(m_analysisFileIn, currentOffset + paramBytes, SEEK_SET); + X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->frameRecordSize)); + X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->depthBytes)); + X265_FREAD(&poc, sizeof(int), 1, m_analysisFileIn, &(picData->poc)); + } + if (poc != curPoc || feof(m_analysisFileIn)) + { + x265_log(NULL, X265_LOG_WARNING, "Error reading analysis data: Cannot find POC %d\n", curPoc); + x265_free_analysis_data(m_param, analysis); + return; + } + } + + /* Now arrived at the right frame, read the record */ + analysis->poc = poc; + analysis->frameRecordSize = frameRecordSize; + X265_FREAD(&analysis->sliceType, sizeof(int), 1, m_analysisFileIn, &(picData->sliceType)); + X265_FREAD(&analysis->bScenecut, sizeof(int), 1, m_analysisFileIn, &(picData->bScenecut)); + X265_FREAD(&analysis->satdCost, sizeof(int64_t), 1, m_analysisFileIn, &(picData->satdCost)); + X265_FREAD(&analysis->numCUsInFrame, sizeof(int), 1, m_analysisFileIn, &(picData->numCUsInFrame)); + X265_FREAD(&analysis->numPartitions, sizeof(int), 1, m_analysisFileIn, &(picData->numPartitions)); + if (m_param->bDisableLookahead) + { + X265_FREAD(&analysis->numCuInHeight, sizeof(uint32_t), 1, m_analysisFileIn, &(picData->numCuInHeight)); + X265_FREAD(&analysis->lookahead, sizeof(x265_lookahead_data), 1, m_analysisFileIn, &(picData->lookahead)); + } + int scaledNumPartition = analysis->numPartitions; + int factor = 1 << m_param->scaleFactor; + + int numPartitions = analysis->numPartitions; + int numCUsInFrame = analysis->numCUsInFrame; + int numCuInHeight = analysis->numCuInHeight; + /* Allocate memory for scaled resoultion's numPartitions and numCUsInFrame*/ + analysis->numPartitions = m_param->num4x4Partitions; + analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU; + analysis->numCuInHeight = cuLoc.heightInCU; + + /* Memory is allocated for inter and intra analysis data based on the slicetype */ + x265_alloc_analysis_data(m_param, analysis); + + analysis->numPartitions = numPartitions * factor; + analysis->numCUsInFrame = numCUsInFrame; + analysis->numCuInHeight = numCuInHeight; + if (m_param->bDisableLookahead && m_rateControl->m_isVbv) + { + uint32_t width = analysis->numCUsInFrame / analysis->numCuInHeight; + bool skipLastRow = (analysis->numCuInHeight * 2) > cuLoc.heightInCU; + bool skipLastCol = (width * 2) > cuLoc.widthInCU; + uint32_t *intraVbvCostBuf = NULL, *vbvCostBuf = NULL, *satdForVbvBuf = NULL, *intraSatdForVbvBuf = NULL; + intraVbvCostBuf = X265_MALLOC(uint32_t, analysis->numCUsInFrame); + vbvCostBuf = X265_MALLOC(uint32_t, analysis->numCUsInFrame); + satdForVbvBuf = X265_MALLOC(uint32_t, analysis->numCuInHeight); + intraSatdForVbvBuf = X265_MALLOC(uint32_t, analysis->numCuInHeight); + + X265_FREAD(intraVbvCostBuf, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.intraVbvCost); + X265_FREAD(vbvCostBuf, sizeof(uint32_t), analysis->numCUsInFrame, m_analysisFileIn, picData->lookahead.vbvCost); + X265_FREAD(satdForVbvBuf, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.satdForVbv); + X265_FREAD(intraSatdForVbvBuf, sizeof(uint32_t), analysis->numCuInHeight, m_analysisFileIn, picData->lookahead.intraSatdForVbv); + + int k = 0; + for (uint32_t i = 0; i < analysis->numCuInHeight; i++) + { + analysis->lookahead.satdForVbv[m_param->scaleFactor * i] = satdForVbvBuf[i] * m_param->scaleFactor; + analysis->lookahead.intraSatdForVbv[m_param->scaleFactor * i] = intraSatdForVbvBuf[i] * m_param->scaleFactor; + if (!(i == (analysis->numCuInHeight - 1) && skipLastRow)) + { + analysis->lookahead.satdForVbv[(m_param->scaleFactor * i) + 1] = satdForVbvBuf[i] * m_param->scaleFactor; + analysis->lookahead.intraSatdForVbv[(m_param->scaleFactor * i) + 1] = intraSatdForVbvBuf[i] * m_param->scaleFactor; + } + + for (uint32_t j = 0; j < width; j++, k++) + { + analysis->lookahead.vbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + (j * m_param->scaleFactor)] = vbvCostBuf[k]; + analysis->lookahead.intraVbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + (j * m_param->scaleFactor)] = intraVbvCostBuf[k]; + + if (!(j == (width - 1) && skipLastCol)) + { + analysis->lookahead.vbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + (j * m_param->scaleFactor) + 1] = vbvCostBuf[k]; + analysis->lookahead.intraVbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + (j * m_param->scaleFactor) + 1] = intraVbvCostBuf[k]; + } + if (!(i == (analysis->numCuInHeight - 1) && skipLastRow)) + { + analysis->lookahead.vbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + cuLoc.widthInCU + (j * m_param->scaleFactor)] = vbvCostBuf[k]; + analysis->lookahead.intraVbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + cuLoc.widthInCU + (j * m_param->scaleFactor)] = intraVbvCostBuf[k]; + if (!(j == (width - 1) && skipLastCol)) + { + analysis->lookahead.vbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + cuLoc.widthInCU + (j * m_param->scaleFactor) + 1] = vbvCostBuf[k]; + analysis->lookahead.intraVbvCost[(i * m_param->scaleFactor * cuLoc.widthInCU) + cuLoc.widthInCU + (j * m_param->scaleFactor) + 1] = intraVbvCostBuf[k]; + } + } + } + } + X265_FREE(satdForVbvBuf); + X265_FREE(intraSatdForVbvBuf); + X265_FREE(intraVbvCostBuf); + X265_FREE(vbvCostBuf); + } + + if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) + { + if (m_param->analysisReuseLevel < 2) + return; + + uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSizes = NULL; + + tempBuf = X265_MALLOC(uint8_t, depthBytes * 3); + depthBuf = tempBuf; + modeBuf = tempBuf + depthBytes; + partSizes = tempBuf + 2 * depthBytes; + + X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->depth); + X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->chromaModes); + X265_FREAD(partSizes, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->partSizes); + + uint32_t count = 0; + for (uint32_t d = 0; d < depthBytes; d++) + { + int bytes = analysis->numPartitions >> (depthBuf[d] * 2); + int numCTUCopied = 1; + if (!depthBuf[d]) //copy data of one 64x64 to four scaled 64x64 CTUs. + { + bytes /= 4; + numCTUCopied = 4; + } + if (partSizes[d] == SIZE_NxN) + partSizes[d] = SIZE_2Nx2N; + if ((depthBuf[d] > 1 && m_param->maxCUSize == 64) || (depthBuf[d] && m_param->maxCUSize != 64)) + depthBuf[d]--; + + for (int numCTU = 0; numCTU < numCTUCopied; numCTU++) + { + memset(&(analysis->intraData)->depth[count], depthBuf[d], bytes); + memset(&(analysis->intraData)->chromaModes[count], modeBuf[d], bytes); + memset(&(analysis->intraData)->partSizes[count], partSizes[d], bytes); + count += bytes; + d += getCUIndex(&cuLoc, &count, bytes, 1); + } + } + + cuLoc.evenRowIndex = 0; + cuLoc.oddRowIndex = m_param->num4x4Partitions * cuLoc.widthInCU; + cuLoc.switchCondition = 0; + uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); + X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); + uint32_t cnt = 0; + for (uint32_t ctu32Idx = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++) + { + memset(&(analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); + cnt += factor; + ctu32Idx += getCUIndex(&cuLoc, &cnt, factor, 0); + } + X265_FREE(tempLumaBuf); + X265_FREE(tempBuf); + consumedBytes += frameRecordSize; + } + + else + { + uint32_t numDir = analysis->sliceType == X265_TYPE_P ? 1 : 2; + uint32_t numPlanes = m_param->internalCsp == X265_CSP_I400 ? 1 : 3; + X265_FREAD((WeightParam*)analysis->wt, sizeof(WeightParam), numPlanes * numDir, m_analysisFileIn, (picIn->analysisData.wt)); + if (m_param->analysisReuseLevel < 2) + return; + + uint8_t *tempBuf = NULL, *depthBuf = NULL, *modeBuf = NULL, *partSize = NULL, *mergeFlag = NULL; + uint8_t *interDir = NULL, *chromaDir = NULL, *mvpIdx[2]; + MV* mv[2]; + int8_t* refIdx[2]; + + int numBuf = m_param->analysisReuseLevel > 4 ? 4 : 2; + bool bIntraInInter = false; + if (m_param->analysisReuseLevel == 10) + { + numBuf++; + bIntraInInter = (analysis->sliceType == X265_TYPE_P || m_param->bIntraInBFrames); + if (bIntraInInter) numBuf++; + } + + tempBuf = X265_MALLOC(uint8_t, depthBytes * numBuf); + depthBuf = tempBuf; + modeBuf = tempBuf + depthBytes; + + X265_FREAD(depthBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->depth); + X265_FREAD(modeBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->modes); + if (m_param->analysisReuseLevel > 4) + { + partSize = modeBuf + depthBytes; + mergeFlag = partSize + depthBytes; + X265_FREAD(partSize, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->partSize); + X265_FREAD(mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->mergeFlag); + if (m_param->analysisReuseLevel == 10) + { + interDir = mergeFlag + depthBytes; + X265_FREAD(interDir, sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->interDir); + if (bIntraInInter) + { + chromaDir = interDir + depthBytes; + X265_FREAD(chromaDir, sizeof(uint8_t), depthBytes, m_analysisFileIn, intraPic->chromaModes); + } + for (uint32_t i = 0; i < numDir; i++) + { + mvpIdx[i] = X265_MALLOC(uint8_t, depthBytes); + refIdx[i] = X265_MALLOC(int8_t, depthBytes); + mv[i] = X265_MALLOC(MV, depthBytes); + X265_FREAD(mvpIdx[i], sizeof(uint8_t), depthBytes, m_analysisFileIn, interPic->mvpIdx[i]); + X265_FREAD(refIdx[i], sizeof(int8_t), depthBytes, m_analysisFileIn, interPic->refIdx[i]); + X265_FREAD(mv[i], sizeof(MV), depthBytes, m_analysisFileIn, interPic->mv[i]); + } + } + } + + uint32_t count = 0; + cuLoc.switchCondition = 0; + for (uint32_t d = 0; d < depthBytes; d++) + { + int bytes = analysis->numPartitions >> (depthBuf[d] * 2); + bool isScaledMaxCUSize = false; + int numCTUCopied = 1; + int writeDepth = depthBuf[d]; + if (!depthBuf[d]) //copy data of one 64x64 to four scaled 64x64 CTUs. + { + isScaledMaxCUSize = true; + bytes /= 4; + numCTUCopied = 4; + } + if ((modeBuf[d] != MODE_INTRA && depthBuf[d] != 0) || (modeBuf[d] == MODE_INTRA && depthBuf[d] > 1)) + writeDepth--; + + for (int numCTU = 0; numCTU < numCTUCopied; numCTU++) + { + memset(&(analysis->interData)->depth[count], writeDepth, bytes); + memset(&(analysis->interData)->modes[count], modeBuf[d], bytes); + if (m_param->analysisReuseLevel == 10 && bIntraInInter) + memset(&(analysis->intraData)->chromaModes[count], chromaDir[d], bytes); + + if (m_param->analysisReuseLevel > 4) + { + puOrientation puOrient; + puOrient.init(); + if (modeBuf[d] == MODE_INTRA && partSize[d] == SIZE_NxN) + partSize[d] = SIZE_2Nx2N; + int partitionSize = partSize[d]; + if (isScaledMaxCUSize && partSize[d] != SIZE_2Nx2N) + partitionSize = getPuShape(&puOrient, partSize[d], numCTU); + memset(&(analysis->interData)->partSize[count], partitionSize, bytes); + int numPU = (modeBuf[d] == MODE_INTRA) ? 1 : nbPartsTable[(int)partSize[d]]; + for (int pu = 0; pu < numPU; pu++) + { + if (!isScaledMaxCUSize && pu) + d++; + int restoreD = d; + /* Adjust d value when the current CTU takes data from 2nd PU */ + if (puOrient.isRect || (puOrient.isAmp && partitionSize == SIZE_2Nx2N)) + { + if ((numCTU > 1 && !puOrient.isVert) || ((numCTU % 2 == 1) && puOrient.isVert)) + d++; + } + if (puOrient.isAmp && pu) + d++; + + (analysis->interData)->mergeFlag[count + pu] = mergeFlag[d]; + if (m_param->analysisReuseLevel == 10) + { + (analysis->interData)->interDir[count + pu] = interDir[d]; + MV mvCopy[2]; + for (uint32_t i = 0; i < numDir; i++) + { + (analysis->interData)->mvpIdx[i][count + pu] = mvpIdx[i][d]; + (analysis->interData)->refIdx[i][count + pu] = refIdx[i][d]; + mvCopy[i].x = mv[i][d].x * (int16_t)m_param->scaleFactor; + mvCopy[i].y = mv[i][d].y * (int16_t)m_param->scaleFactor; + memcpy(&(analysis->interData)->mv[i][count + pu], &mvCopy[i], sizeof(MV)); + } + } + d = restoreD; // Restore d value after copying each of the 4 64x64 CTUs + + if (isScaledMaxCUSize && (puOrient.isRect || puOrient.isAmp)) + { + /* Skip PU index when current CTU is a 2Nx2N */ + if (partitionSize == SIZE_2Nx2N) + pu++; + /* Adjust d after completion of all 4 CTU copies */ + if (numCTU == 3 && (pu == (numPU - 1))) + d++; + } + } + } + count += bytes; + d += getCUIndex(&cuLoc, &count, bytes, 1); + } + } + + X265_FREE(tempBuf); + + if (m_param->analysisReuseLevel == 10) + { + for (uint32_t i = 0; i < numDir; i++) + { + X265_FREE(mvpIdx[i]); + X265_FREE(refIdx[i]); + X265_FREE(mv[i]); + } + if (bIntraInInter) + { + cuLoc.evenRowIndex = 0; + cuLoc.oddRowIndex = m_param->num4x4Partitions * cuLoc.widthInCU; + cuLoc.switchCondition = 0; + uint8_t *tempLumaBuf = X265_MALLOC(uint8_t, analysis->numCUsInFrame * scaledNumPartition); + X265_FREAD(tempLumaBuf, sizeof(uint8_t), analysis->numCUsInFrame * scaledNumPartition, m_analysisFileIn, intraPic->modes); + uint32_t cnt = 0; + for (uint32_t ctu32Idx = 0; ctu32Idx < analysis->numCUsInFrame * scaledNumPartition; ctu32Idx++) + { + memset(&(analysis->intraData)->modes[cnt], tempLumaBuf[ctu32Idx], factor); + cnt += factor; + ctu32Idx += getCUIndex(&cuLoc, &cnt, factor, 0); + } + X265_FREE(tempLumaBuf); + } + } + else + X265_FREAD((analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileIn, interPic->ref); + + consumedBytes += frameRecordSize; + if (numDir == 1) + totalConsumedBytes = consumedBytes; + } + + /* Restore to the current encode's numPartitions and numCUsInFrame */ + analysis->numPartitions = m_param->num4x4Partitions; + analysis->numCUsInFrame = cuLoc.heightInCU * cuLoc.widthInCU; + analysis->numCuInHeight = cuLoc.heightInCU; +#undef X265_FREAD +} + + +int Encoder::validateAnalysisData(x265_analysis_data* analysis, int writeFlag) +{ +#define X265_PARAM_VALIDATE(analysisParam, size, bytes, param, errorMsg)\ + if(!writeFlag)\ + {\ + fileOffset = m_analysisFileIn;\ + if ((!m_param->bUseAnalysisFile && analysisParam != (int)*param) || \ + (m_param->bUseAnalysisFile && (fread(&readValue, size, bytes, fileOffset) != bytes || (readValue != (int)*param))))\ + {\ + x265_log(NULL, X265_LOG_ERROR, "Error reading analysis data. Incompatible option : <%s> \n", #errorMsg);\ + m_aborted = true;\ + return -1;\ + }\ + }\ + if(writeFlag)\ + {\ + fileOffset = m_analysisFileOut;\ + if(!m_param->bUseAnalysisFile)\ + analysisParam = *param;\ + else if(fwrite(param, size, bytes, fileOffset) < bytes)\ + {\ + x265_log(NULL, X265_LOG_ERROR, "Error writing analysis data\n"); \ + m_aborted = true;\ + return -1; \ + }\ + }\ + count++; + +#define X265_FREAD(val, size, readSize, fileOffset, src)\ + if (!m_param->bUseAnalysisFile)\ + {\ + memcpy(val, src, (size * readSize));\ + }\ + else if (fread(val, size, readSize, fileOffset) != readSize)\ + {\ + x265_log(NULL, X265_LOG_ERROR, "Error reading analysis data\n");\ + m_aborted = true;\ + return -1;\ + }\ + count++; + + x265_analysis_validate *saveParam = &analysis->saveParam; + FILE* fileOffset = NULL; + int readValue = 0; + int count = 0; + + X265_PARAM_VALIDATE(saveParam->intraRefresh, sizeof(int), 1, &m_param->bIntraRefresh, intra-refresh); + X265_PARAM_VALIDATE(saveParam->maxNumReferences, sizeof(int), 1, &m_param->maxNumReferences, ref); + X265_PARAM_VALIDATE(saveParam->analysisReuseLevel, sizeof(int), 1, &m_param->analysisReuseLevel, analysis-reuse-level); + X265_PARAM_VALIDATE(saveParam->keyframeMax, sizeof(int), 1, &m_param->keyframeMax, keyint); + X265_PARAM_VALIDATE(saveParam->keyframeMin, sizeof(int), 1, &m_param->keyframeMin, min-keyint); + X265_PARAM_VALIDATE(saveParam->openGOP, sizeof(int), 1, &m_param->bOpenGOP, open-gop); + X265_PARAM_VALIDATE(saveParam->bframes, sizeof(int), 1, &m_param->bframes, bframes); + X265_PARAM_VALIDATE(saveParam->bPyramid, sizeof(int), 1, &m_param->bBPyramid, bPyramid); + X265_PARAM_VALIDATE(saveParam->minCUSize, sizeof(int), 1, &m_param->minCUSize, min - cu - size); + X265_PARAM_VALIDATE(saveParam->lookaheadDepth, sizeof(int), 1, &m_param->lookaheadDepth, rc - lookahead); + X265_PARAM_VALIDATE(saveParam->chunkStart, sizeof(int), 1, &m_param->chunkStart, chunk-start); + X265_PARAM_VALIDATE(saveParam->chunkEnd, sizeof(int), 1, &m_param->chunkEnd, chunk-end); + + int sourceHeight, sourceWidth; + if (writeFlag) + { + sourceHeight = m_param->sourceHeight - m_conformanceWindow.bottomOffset; + sourceWidth = m_param->sourceWidth - m_conformanceWindow.rightOffset; + X265_PARAM_VALIDATE(saveParam->sourceWidth, sizeof(int), 1, &sourceWidth, res-width); + X265_PARAM_VALIDATE(saveParam->sourceHeight, sizeof(int), 1, &sourceHeight, res-height); + X265_PARAM_VALIDATE(saveParam->maxCUSize, sizeof(int), 1, &m_param->maxCUSize, ctu); + } + else + { + fileOffset = m_analysisFileIn; + bool error = false; + int curSourceHeight = m_param->sourceHeight - m_conformanceWindow.bottomOffset; + int curSourceWidth = m_param->sourceWidth - m_conformanceWindow.rightOffset; + + X265_FREAD(&sourceWidth, sizeof(int), 1, m_analysisFileIn, &(saveParam->sourceWidth)); + X265_FREAD(&sourceHeight, sizeof(int), 1, m_analysisFileIn, &(saveParam->sourceHeight)); + X265_FREAD(&readValue, sizeof(int), 1, m_analysisFileIn, &(saveParam->maxCUSize)); + + bool isScaledRes = (2 * sourceHeight == curSourceHeight) && (2 * sourceWidth == curSourceWidth); + if (!isScaledRes && (sourceHeight != curSourceHeight || sourceWidth != curSourceWidth + || readValue != (int)m_param->maxCUSize || m_param->scaleFactor)) + error = true; + else if (isScaledRes && !m_param->scaleFactor) + error = true; + else if (isScaledRes && (int)m_param->maxCUSize == readValue) + m_saveCTUSize = 1; + else if (isScaledRes && (g_log2Size[m_param->maxCUSize] - g_log2Size[readValue]) != 1) + error = true; + + if (error) + { + x265_log(NULL, X265_LOG_ERROR, "Error reading analysis data. Incompatible option : <input-res / scale-factor / ctu> \n"); + m_aborted = true; + return -1; + } + } + return (count * sizeof(int)); + +#undef X265_FREAD +#undef X265_PARAM_VALIDATE +} + +/* Toggle between two consecutive CTU rows. The save's CTU is copied +twice consecutively in the first and second CTU row of load*/ + +int Encoder::getCUIndex(cuLocation* cuLoc, uint32_t* count, int bytes, int flag) +{ + int index = 0; + cuLoc->switchCondition += bytes; + int isBoundaryW = (*count % (m_param->num4x4Partitions * cuLoc->widthInCU) == 0); + + /* Width boundary case : + Skip to appropriate index when out of boundary cases occur + Out of boundary may occur when the out of bound pixels along + the width in low resoultion is greater than half of the maxCUSize */ + if (cuLoc->skipWidth && isBoundaryW) + { + if (flag) + index++; + else + { + /* Number of 4x4 blocks in out of bound region */ + int outOfBound = m_param->maxCUSize / 2; + uint32_t sum = (uint32_t)pow((outOfBound >> 2), 2); + index += sum; + } + cuLoc->switchCondition += m_param->num4x4Partitions; + } + + /* Completed writing 2 CTUs - move to the last remembered index of the next CTU row*/ + if (cuLoc->switchCondition == 2 * m_param->num4x4Partitions) + { + if (isBoundaryW) + cuLoc->evenRowIndex = *count + (m_param->num4x4Partitions * cuLoc->widthInCU); // end of row - skip to the next even row + else + cuLoc->evenRowIndex = *count; + *count = cuLoc->oddRowIndex; + + /* Height boundary case : + Skip to appropriate index when out of boundary cases occur + Out of boundary may occur when the out of bound pixels along + the height in low resoultion is greater than half of the maxCUSize */ + int isBoundaryH = (*count >= (m_param->num4x4Partitions * cuLoc->heightInCU * cuLoc->widthInCU)); + if (cuLoc->skipHeight && isBoundaryH) + { + if (flag) + index += 2; + else + { + int outOfBound = m_param->maxCUSize / 2; + uint32_t sum = (uint32_t)(2 * pow((abs(outOfBound) >> 2), 2)); + index += sum; + } + *count = cuLoc->evenRowIndex; + cuLoc->switchCondition = 0; + } + } + /* Completed writing 4 CTUs - move to the last remembered index of + the previous CTU row to copy the next save CTU's data*/ + else if (cuLoc->switchCondition == 4 * m_param->num4x4Partitions) + { + if (isBoundaryW) + cuLoc->oddRowIndex = *count + (m_param->num4x4Partitions * cuLoc->widthInCU); // end of row - skip to the next odd row + else + cuLoc->oddRowIndex = *count; + *count = cuLoc->evenRowIndex; + cuLoc->switchCondition = 0; + } + return index; +} + +/* save load + CTU0 CTU1 CTU2 CTU3 + 2NxN 2Nx2N 2Nx2N 2Nx2N 2Nx2N + NX2N 2Nx2N 2Nx2N 2Nx2N 2Nx2N + 2NxnU 2NxN 2NxN 2Nx2N 2Nx2N + 2NxnD 2Nx2N 2Nx2N 2NxN 2NxN + nLx2N Nx2N 2Nx2N Nx2N 2Nx2N + nRx2N 2Nx2N Nx2N 2Nx2N Nx2N +*/ +int Encoder::getPuShape(puOrientation* puOrient, int partSize, int numCTU) +{ + puOrient->isRect = true; + if (partSize == SIZE_Nx2N) + puOrient->isVert = true; + if (partSize >= SIZE_2NxnU) // All AMP modes + { + puOrient->isAmp = true; + puOrient->isRect = false; + if (partSize == SIZE_2NxnD && numCTU > 1) + return SIZE_2NxN; + else if (partSize == SIZE_2NxnU && numCTU < 2) + return SIZE_2NxN; + else if (partSize == SIZE_nLx2N) + { + puOrient->isVert = true; + if (!(numCTU % 2)) + return SIZE_Nx2N; + } + else if (partSize == SIZE_nRx2N) + { + puOrient->isVert = true; + if (numCTU % 2) + return SIZE_Nx2N; + } + } + return SIZE_2Nx2N; +} + +void Encoder::readAnalysisFile(x265_analysis_data* analysis, int curPoc, int sliceType) { #define X265_FREAD(val, size, readSize, fileOffset)\ if (fread(val, size, readSize, fileOffset) != readSize)\ {\ x265_log(NULL, X265_LOG_ERROR, "Error reading analysis 2 pass data\n"); \ - freeAnalysis2Pass(analysis2Pass, sliceType); \ + x265_alloc_analysis_data(m_param, analysis); \ m_aborted = true; \ return; \ }\ uint32_t depthBytes = 0; - uint32_t widthInCU = (m_param->sourceWidth + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t heightInCU = (m_param->sourceHeight + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t numCUsInFrame = widthInCU * heightInCU; - int poc; uint32_t frameRecordSize; X265_FREAD(&frameRecordSize, sizeof(uint32_t), 1, m_analysisFileIn); X265_FREAD(&depthBytes, sizeof(uint32_t), 1, m_analysisFileIn); @@ -3552,11 +4064,11 @@ if (poc != curPoc || feof(m_analysisFileIn)) { x265_log(NULL, X265_LOG_WARNING, "Error reading analysis 2 pass data: Cannot find POC %d\n", curPoc); - freeAnalysis2Pass(analysis2Pass, sliceType); + x265_free_analysis_data(m_param, analysis); return; } /* Now arrived at the right frame, read the record */ - analysis2Pass->frameRecordSize = frameRecordSize; + analysis->frameRecordSize = frameRecordSize; uint8_t* tempBuf = NULL, *depthBuf = NULL; sse_t *tempdistBuf = NULL, *distortionBuf = NULL; tempBuf = X265_MALLOC(uint8_t, depthBytes); @@ -3565,76 +4077,85 @@ X265_FREAD(tempdistBuf, sizeof(sse_t), depthBytes, m_analysisFileIn); depthBuf = tempBuf; distortionBuf = tempdistBuf; - analysis2PassFrameData* analysisFrameData = (analysis2PassFrameData*)analysis2Pass->analysisFramedata; + x265_analysis_data *analysisData = (x265_analysis_data*)analysis; + x265_analysis_intra_data *intraData = analysisData->intraData; + x265_analysis_inter_data *interData = analysisData->interData; + x265_analysis_distortion_data *distortionData = analysisData->distortionData; + size_t count = 0; uint32_t ctuCount = 0; double sum = 0, sqrSum = 0; for (uint32_t d = 0; d < depthBytes; d++) { - int bytes = m_param->num4x4Partitions >> (depthBuf[d] * 2); - memset(&analysisFrameData->depth[count], depthBuf[d], bytes); - analysisFrameData->distortion[count] = distortionBuf[d]; - analysisFrameData->ctuDistortion[ctuCount] += analysisFrameData->distortion[count]; + int bytes = analysis->numPartitions >> (depthBuf[d] * 2); + if (IS_X265_TYPE_I(sliceType)) + memset(&intraData->depth[count], depthBuf[d], bytes); + else + memset(&interData->depth[count], depthBuf[d], bytes); + distortionData->distortion[count] = distortionBuf[d]; + distortionData->ctuDistortion[ctuCount] += distortionData->distortion[count]; count += bytes; - if ((count % (unsigned)m_param->num4x4Partitions) == 0) + if ((count % (unsigned)analysis->numPartitions) == 0) { - analysisFrameData->scaledDistortion[ctuCount] = X265_LOG2(X265_MAX(analysisFrameData->ctuDistortion[ctuCount], 1)); - sum += analysisFrameData->scaledDistortion[ctuCount]; - sqrSum += analysisFrameData->scaledDistortion[ctuCount] * analysisFrameData->scaledDistortion[ctuCount]; + distortionData->scaledDistortion[ctuCount] = X265_LOG2(X265_MAX(distortionData->ctuDistortion[ctuCount], 1)); + sum += distortionData->scaledDistortion[ctuCount]; + sqrSum += distortionData->scaledDistortion[ctuCount] * distortionData->scaledDistortion[ctuCount]; ctuCount++; } } - double avg = sum / numCUsInFrame; - analysisFrameData->sdDistortion = pow(((sqrSum / numCUsInFrame) - (avg * avg)), 0.5); - analysisFrameData->averageDistortion = avg; - analysisFrameData->highDistortionCtuCount = analysisFrameData->lowDistortionCtuCount = 0; - for (uint32_t i = 0; i < numCUsInFrame; ++i) - { - analysisFrameData->threshold[i] = analysisFrameData->scaledDistortion[i] / analysisFrameData->averageDistortion; - analysisFrameData->offset[i] = (analysisFrameData->averageDistortion - analysisFrameData->scaledDistortion[i]) / analysisFrameData->sdDistortion; - if (analysisFrameData->threshold[i] < 0.9 && analysisFrameData->offset[i] >= 1) - analysisFrameData->lowDistortionCtuCount++; - else if (analysisFrameData->threshold[i] > 1.1 && analysisFrameData->offset[i] <= -1) - analysisFrameData->highDistortionCtuCount++; + double avg = sum / analysis->numCUsInFrame; + distortionData->sdDistortion = pow(((sqrSum / analysis->numCUsInFrame) - (avg * avg)), 0.5); + distortionData->averageDistortion = avg; + distortionData->highDistortionCtuCount = distortionData->lowDistortionCtuCount = 0; + for (uint32_t i = 0; i < analysis->numCUsInFrame; ++i) + { + distortionData->threshold[i] = distortionData->scaledDistortion[i] / distortionData->averageDistortion; + distortionData->offset[i] = (distortionData->averageDistortion - distortionData->scaledDistortion[i]) / distortionData->sdDistortion; + if (distortionData->threshold[i] < 0.9 && distortionData->offset[i] >= 1) + distortionData->lowDistortionCtuCount++; + else if (distortionData->threshold[i] > 1.1 && distortionData->offset[i] <= -1) + distortionData->highDistortionCtuCount++; } if (!IS_X265_TYPE_I(sliceType)) { MV *tempMVBuf[2], *MVBuf[2]; - int32_t *tempRefBuf[2], *refBuf[2]; - int *tempMvpBuf[2], *mvpBuf[2]; + int32_t *tempRefBuf, *refBuf; + uint8_t *tempMvpBuf[2], *mvpBuf[2]; uint8_t* tempModeBuf = NULL, *modeBuf = NULL; - int numDir = sliceType == X265_TYPE_P ? 1 : 2; + tempRefBuf = X265_MALLOC(int32_t, numDir * depthBytes); + for (int i = 0; i < numDir; i++) { tempMVBuf[i] = X265_MALLOC(MV, depthBytes); X265_FREAD(tempMVBuf[i], sizeof(MV), depthBytes, m_analysisFileIn); MVBuf[i] = tempMVBuf[i]; - tempMvpBuf[i] = X265_MALLOC(int, depthBytes); - X265_FREAD(tempMvpBuf[i], sizeof(int), depthBytes, m_analysisFileIn); + tempMvpBuf[i] = X265_MALLOC(uint8_t, depthBytes); + X265_FREAD(tempMvpBuf[i], sizeof(uint8_t), depthBytes, m_analysisFileIn); mvpBuf[i] = tempMvpBuf[i]; - tempRefBuf[i] = X265_MALLOC(int32_t, depthBytes); - X265_FREAD(tempRefBuf[i], sizeof(int32_t), depthBytes, m_analysisFileIn); - refBuf[i] = tempRefBuf[i]; + X265_FREAD(&tempRefBuf[i*depthBytes], sizeof(int32_t), depthBytes, m_analysisFileIn); } + refBuf = tempRefBuf; tempModeBuf = X265_MALLOC(uint8_t, depthBytes); X265_FREAD(tempModeBuf, sizeof(uint8_t), depthBytes, m_analysisFileIn); modeBuf = tempModeBuf; - + count = 0; + for (uint32_t d = 0; d < depthBytes; d++) { - size_t bytes = m_param->num4x4Partitions >> (depthBuf[d] * 2); + size_t bytes = analysis->numPartitions >> (depthBuf[d] * 2); for (int i = 0; i < numDir; i++) { + int32_t* ref = &(analysis->interData)->ref[i * analysis->numPartitions * analysis->numCUsInFrame]; for (size_t j = count, k = 0; k < bytes; j++, k++) { - memcpy(&((analysis2PassFrameData*)analysis2Pass->analysisFramedata)->m_mv[i][j], MVBuf[i] + d, sizeof(MV)); - memcpy(&((analysis2PassFrameData*)analysis2Pass->analysisFramedata)->mvpIdx[i][j], mvpBuf[i] + d, sizeof(int)); - memcpy(&((analysis2PassFrameData*)analysis2Pass->analysisFramedata)->ref[i][j], refBuf[i] + d, sizeof(int32_t)); + memcpy(&(analysis->interData)->mv[i][j], MVBuf[i] + d, sizeof(MV)); + memcpy(&(analysis->interData)->mvpIdx[i][j], mvpBuf[i] + d, sizeof(uint8_t)); + memcpy(&ref[j], refBuf + (i * depthBytes) + d, sizeof(int32_t)); } } - memset(&((analysis2PassFrameData *)analysis2Pass->analysisFramedata)->modes[count], modeBuf[d], bytes); + memset(&(analysis->interData)->modes[count], modeBuf[d], bytes); count += bytes; } @@ -3642,8 +4163,8 @@ { X265_FREE(tempMVBuf[i]); X265_FREE(tempMvpBuf[i]); - X265_FREE(tempRefBuf[i]); } + X265_FREE(tempRefBuf); X265_FREE(tempModeBuf); } X265_FREE(tempBuf); @@ -3659,7 +4180,7 @@ if (fwrite(val, size, writeSize, fileOffset) < writeSize)\ {\ x265_log(NULL, X265_LOG_ERROR, "Error writing analysis data\n");\ - freeAnalysis(analysis);\ + x265_free_analysis_data(m_param, analysis);\ m_aborted = true;\ return;\ }\ @@ -3668,6 +4189,15 @@ uint32_t numDir, numPlanes; bool bIntraInInter = false; + if (!analysis->poc) + { + if (validateAnalysisData(analysis, 1) == -1) + { + m_aborted = true; + return; + } + } + /* calculate frameRecordSize */ analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc) + sizeof(analysis->sliceType) + sizeof(analysis->numCUsInFrame) + sizeof(analysis->numPartitions) + sizeof(analysis->bScenecut) + sizeof(analysis->satdCost); @@ -3689,7 +4219,7 @@ uint8_t partSize = 0; CUData* ctu = curEncData.getPicCTU(cuAddr); - analysis_intra_data* intraDataCTU = (analysis_intra_data*)analysis->intraData; + x265_analysis_intra_data* intraDataCTU = analysis->intraData; for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) { @@ -3717,8 +4247,8 @@ uint8_t partSize = 0; CUData* ctu = curEncData.getPicCTU(cuAddr); - analysis_inter_data* interDataCTU = (analysis_inter_data*)analysis->interData; - analysis_intra_data* intraDataCTU = (analysis_intra_data*)analysis->intraData; + x265_analysis_inter_data* interDataCTU = analysis->interData; + x265_analysis_intra_data* intraDataCTU = analysis->intraData; for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) { @@ -3751,7 +4281,7 @@ { interDataCTU->mvpIdx[dir][depthBytes] = ctu->m_mvpIdx[dir][puabsPartIdx]; interDataCTU->refIdx[dir][depthBytes] = ctu->m_refIdx[dir][puabsPartIdx]; - interDataCTU->mv[dir][depthBytes] = ctu->m_mv[dir][puabsPartIdx]; + interDataCTU->mv[dir][depthBytes].word = ctu->m_mv[dir][puabsPartIdx].word; } } } @@ -3809,58 +4339,58 @@ if (analysis->sliceType == X265_TYPE_IDR || analysis->sliceType == X265_TYPE_I) { - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->partSizes, sizeof(char), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); + X265_FWRITE((analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->intraData)->partSizes, sizeof(char), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); } else { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFileOut); if (m_param->analysisReuseLevel > 4) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->partSize, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->partSize, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->mergeFlag, sizeof(uint8_t), depthBytes, m_analysisFileOut); if (m_param->analysisReuseLevel == 10) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->interDir, sizeof(uint8_t), depthBytes, m_analysisFileOut); - if (bIntraInInter) X265_FWRITE(((analysis_intra_data*)analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->interDir, sizeof(uint8_t), depthBytes, m_analysisFileOut); + if (bIntraInInter) X265_FWRITE((analysis->intraData)->chromaModes, sizeof(uint8_t), depthBytes, m_analysisFileOut); for (uint32_t dir = 0; dir < numDir; dir++) { - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mvpIdx[dir], sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->refIdx[dir], sizeof(int8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(((analysis_inter_data*)analysis->interData)->mv[dir], sizeof(MV), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->mvpIdx[dir], sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->refIdx[dir], sizeof(int8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->mv[dir], sizeof(MV), depthBytes, m_analysisFileOut); } if (bIntraInInter) - X265_FWRITE(((analysis_intra_data*)analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); + X265_FWRITE((analysis->intraData)->modes, sizeof(uint8_t), analysis->numCUsInFrame * analysis->numPartitions, m_analysisFileOut); } } if (m_param->analysisReuseLevel != 10) - X265_FWRITE(((analysis_inter_data*)analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileOut); + X265_FWRITE((analysis->interData)->ref, sizeof(int32_t), analysis->numCUsInFrame * X265_MAX_PRED_MODE_PER_CTU * numDir, m_analysisFileOut); } #undef X265_FWRITE } -void Encoder::writeAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, FrameData &curEncData, int slicetype) +void Encoder::writeAnalysisFileRefine(x265_analysis_data* analysis, FrameData &curEncData) { #define X265_FWRITE(val, size, writeSize, fileOffset)\ if (fwrite(val, size, writeSize, fileOffset) < writeSize)\ {\ x265_log(NULL, X265_LOG_ERROR, "Error writing analysis 2 pass data\n"); \ - freeAnalysis2Pass(analysis2Pass, slicetype); \ + x265_free_analysis_data(m_param, analysis); \ m_aborted = true; \ return; \ }\ uint32_t depthBytes = 0; - uint32_t widthInCU = (m_param->sourceWidth + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t heightInCU = (m_param->sourceHeight + m_param->maxCUSize - 1) >> m_param->maxLog2CUSize; - uint32_t numCUsInFrame = widthInCU * heightInCU; - analysis2PassFrameData* analysisFrameData = (analysis2PassFrameData*)analysis2Pass->analysisFramedata; + x265_analysis_data *analysisData = (x265_analysis_data*)analysis; + x265_analysis_intra_data *intraData = analysisData->intraData; + x265_analysis_inter_data *interData = analysisData->interData; + x265_analysis_distortion_data *distortionData = analysisData->distortionData; - for (uint32_t cuAddr = 0; cuAddr < numCUsInFrame; cuAddr++) + for (uint32_t cuAddr = 0; cuAddr < analysis->numCUsInFrame; cuAddr++) { uint8_t depth = 0; @@ -3869,37 +4399,42 @@ for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) { depth = ctu->m_cuDepth[absPartIdx]; - analysisFrameData->depth[depthBytes] = depth; - analysisFrameData->distortion[depthBytes] = ctu->m_distortion[absPartIdx]; + if (curEncData.m_slice->m_sliceType == I_SLICE) + intraData->depth[depthBytes] = depth; + else + interData->depth[depthBytes] = depth; + distortionData->distortion[depthBytes] = ctu->m_distortion[absPartIdx]; absPartIdx += ctu->m_numPartitions >> (depth * 2); } } if (curEncData.m_slice->m_sliceType != I_SLICE) { + int32_t* ref[2]; + ref[0] = (analysis->interData)->ref; + ref[1] = &(analysis->interData)->ref[analysis->numPartitions * analysis->numCUsInFrame]; depthBytes = 0; - for (uint32_t cuAddr = 0; cuAddr < numCUsInFrame; cuAddr++) + for (uint32_t cuAddr = 0; cuAddr < analysis->numCUsInFrame; cuAddr++) { uint8_t depth = 0; uint8_t predMode = 0; CUData* ctu = curEncData.getPicCTU(cuAddr); - for (uint32_t absPartIdx = 0; absPartIdx < ctu->m_numPartitions; depthBytes++) { depth = ctu->m_cuDepth[absPartIdx]; - analysisFrameData->m_mv[0][depthBytes] = ctu->m_mv[0][absPartIdx]; - analysisFrameData->mvpIdx[0][depthBytes] = ctu->m_mvpIdx[0][absPartIdx]; - analysisFrameData->ref[0][depthBytes] = ctu->m_refIdx[0][absPartIdx]; + interData->mv[0][depthBytes].word = ctu->m_mv[0][absPartIdx].word; + interData->mvpIdx[0][depthBytes] = ctu->m_mvpIdx[0][absPartIdx]; + ref[0][depthBytes] = ctu->m_refIdx[0][absPartIdx]; predMode = ctu->m_predMode[absPartIdx]; if (ctu->m_refIdx[1][absPartIdx] != -1) { - analysisFrameData->m_mv[1][depthBytes] = ctu->m_mv[1][absPartIdx]; - analysisFrameData->mvpIdx[1][depthBytes] = ctu->m_mvpIdx[1][absPartIdx]; - analysisFrameData->ref[1][depthBytes] = ctu->m_refIdx[1][absPartIdx]; + interData->mv[1][depthBytes].word = ctu->m_mv[1][absPartIdx].word; + interData->mvpIdx[1][depthBytes] = ctu->m_mvpIdx[1][absPartIdx]; + ref[1][depthBytes] = ctu->m_refIdx[1][absPartIdx]; predMode = 4; // used as indiacator if the block is coded as bidir } - analysisFrameData->modes[depthBytes] = predMode; + interData->modes[depthBytes] = predMode; absPartIdx += ctu->m_numPartitions >> (depth * 2); } @@ -3907,34 +4442,40 @@ } /* calculate frameRecordSize */ - analysis2Pass->frameRecordSize = sizeof(analysis2Pass->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis2Pass->poc); - - analysis2Pass->frameRecordSize += depthBytes * sizeof(uint8_t); - analysis2Pass->frameRecordSize += depthBytes * sizeof(sse_t); + analysis->frameRecordSize = sizeof(analysis->frameRecordSize) + sizeof(depthBytes) + sizeof(analysis->poc); + analysis->frameRecordSize += depthBytes * sizeof(uint8_t); + analysis->frameRecordSize += depthBytes * sizeof(sse_t); if (curEncData.m_slice->m_sliceType != I_SLICE) { int numDir = (curEncData.m_slice->m_sliceType == P_SLICE) ? 1 : 2; - analysis2Pass->frameRecordSize += depthBytes * sizeof(MV) * numDir; - analysis2Pass->frameRecordSize += depthBytes * sizeof(int32_t) * numDir; - analysis2Pass->frameRecordSize += depthBytes * sizeof(int) * numDir; - analysis2Pass->frameRecordSize += depthBytes * sizeof(uint8_t); + analysis->frameRecordSize += depthBytes * sizeof(MV) * numDir; + analysis->frameRecordSize += depthBytes * sizeof(int32_t) * numDir; + analysis->frameRecordSize += depthBytes * sizeof(uint8_t) * numDir; + analysis->frameRecordSize += depthBytes * sizeof(uint8_t); } - X265_FWRITE(&analysis2Pass->frameRecordSize, sizeof(uint32_t), 1, m_analysisFileOut); + X265_FWRITE(&analysis->frameRecordSize, sizeof(uint32_t), 1, m_analysisFileOut); X265_FWRITE(&depthBytes, sizeof(uint32_t), 1, m_analysisFileOut); - X265_FWRITE(&analysis2Pass->poc, sizeof(uint32_t), 1, m_analysisFileOut); - - X265_FWRITE(analysisFrameData->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); - X265_FWRITE(analysisFrameData->distortion, sizeof(sse_t), depthBytes, m_analysisFileOut); + X265_FWRITE(&analysis->poc, sizeof(uint32_t), 1, m_analysisFileOut); + if (curEncData.m_slice->m_sliceType == I_SLICE) + { + X265_FWRITE((analysis->intraData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); + } + else + { + X265_FWRITE((analysis->interData)->depth, sizeof(uint8_t), depthBytes, m_analysisFileOut); + } + X265_FWRITE(distortionData->distortion, sizeof(sse_t), depthBytes, m_analysisFileOut); if (curEncData.m_slice->m_sliceType != I_SLICE) { int numDir = curEncData.m_slice->m_sliceType == P_SLICE ? 1 : 2; for (int i = 0; i < numDir; i++) { - X265_FWRITE(analysisFrameData->m_mv[i], sizeof(MV), depthBytes, m_analysisFileOut); - X265_FWRITE(analysisFrameData->mvpIdx[i], sizeof(int), depthBytes, m_analysisFileOut); - X265_FWRITE(analysisFrameData->ref[i], sizeof(int32_t), depthBytes, m_analysisFileOut); + int32_t* ref = &(analysis->interData)->ref[i * analysis->numPartitions * analysis->numCUsInFrame]; + X265_FWRITE(interData->mv[i], sizeof(MV), depthBytes, m_analysisFileOut); + X265_FWRITE(interData->mvpIdx[i], sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE(ref, sizeof(int32_t), depthBytes, m_analysisFileOut); } - X265_FWRITE(analysisFrameData->modes, sizeof(uint8_t), depthBytes, m_analysisFileOut); + X265_FWRITE((analysis->interData)->modes, sizeof(uint8_t), depthBytes, m_analysisFileOut); } #undef X265_FWRITE } @@ -3969,6 +4510,51 @@ TOOLCMP(oldParam->rc.rfConstant, newParam->rc.rfConstant, "crf=%f to %f\n"); } +void Encoder::readUserSeiFile(x265_sei_payload& seiMsg, int curPoc) +{ + char line[1024]; + while (fgets(line, sizeof(line), m_naluFile)) + { + int poc = atoi(strtok(line, " ")); + char *prefix = strtok(NULL, " "); + int nalType = atoi(strtok(NULL, "/")); + int payloadType = atoi(strtok(NULL, " ")); + char *base64Encode = strtok(NULL, "\n"); + int base64EncodeLength = (int)strlen(base64Encode); + char *base64Decode = SEI::base64Decode(base64Encode, base64EncodeLength); + if (nalType == NAL_UNIT_PREFIX_SEI && (!strcmp(prefix, "PREFIX"))) + { + int currentPOC = curPoc; + if (currentPOC == poc) + { + seiMsg.payloadSize = (base64EncodeLength / 4) * 3; + seiMsg.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * seiMsg.payloadSize); + if (!seiMsg.payload) + { + x265_log(m_param, X265_LOG_ERROR, "Unable to allocate memory for SEI payload\n"); + break; + } + if (payloadType == 4) + seiMsg.payloadType = USER_DATA_REGISTERED_ITU_T_T35; + else if (payloadType == 5) + seiMsg.payloadType = USER_DATA_UNREGISTERED; + else + { + x265_log(m_param, X265_LOG_WARNING, "Unsupported SEI payload Type for frame %d\n", poc); + break; + } + memcpy(seiMsg.payload, base64Decode, seiMsg.payloadSize); + break; + } + } + else + { + x265_log(m_param, X265_LOG_WARNING, "SEI message for frame %d is not inserted. Will support only PREFIX SEI messages.\n", poc); + break; + } + } +} + bool Encoder::computeSPSRPSIndex() { RPS* rpsInSPS = m_sps.spsrps;
View file
x265_2.7.tar.gz/source/encoder/encoder.h -> x265_2.9.tar.gz/source/encoder/encoder.h
Changed
@@ -90,6 +90,43 @@ RPSListNode* prior; }; +struct cuLocation +{ + bool skipWidth; + bool skipHeight; + uint32_t heightInCU; + uint32_t widthInCU; + uint32_t oddRowIndex; + uint32_t evenRowIndex; + uint32_t switchCondition; + + void init(x265_param* param) + { + skipHeight = false; + skipWidth = false; + heightInCU = (param->sourceHeight + param->maxCUSize - 1) >> param->maxLog2CUSize; + widthInCU = (param->sourceWidth + param->maxCUSize - 1) >> param->maxLog2CUSize; + evenRowIndex = 0; + oddRowIndex = param->num4x4Partitions * widthInCU; + switchCondition = 0; // To switch between odd and even rows + } +}; + +struct puOrientation +{ + bool isVert; + bool isRect; + bool isAmp; + + void init() + { + isRect = false; + isAmp = false; + isVert = false; + } +}; + + class FrameEncoder; class DPB; class Lookahead; @@ -132,6 +169,7 @@ Frame* m_exportedPic; FILE* m_analysisFileIn; FILE* m_analysisFileOut; + FILE* m_naluFile; x265_param* m_param; x265_param* m_latestParam; // Holds latest param during a reconfigure RateControl* m_rateControl; @@ -175,6 +213,7 @@ double m_cR; int m_bToneMap; // Enables tone-mapping + int m_enableNal; #ifdef ENABLE_HDR10_PLUS const hdr10plus_api *m_hdr10plus_api; @@ -184,6 +223,15 @@ x265_sei_payload m_prevTonemapPayload; + /* Collect frame level feature data */ + uint64_t* m_rdCost; + uint64_t* m_variance; + uint32_t* m_trainingCount; + int32_t m_startPoint; + Lock m_dynamicRefineLock; + + bool m_saveCTUSize; + Encoder(); ~Encoder() { @@ -227,21 +275,26 @@ void updateVbvPlan(RateControl* rc); - void allocAnalysis(x265_analysis_data* analysis); + void readAnalysisFile(x265_analysis_data* analysis, int poc, int sliceType); + + void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes); - void freeAnalysis(x265_analysis_data* analysis); + void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn, int paramBytes, cuLocation cuLoc); - void allocAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType); + int getCUIndex(cuLocation* cuLoc, uint32_t* count, int bytes, int flag); - void freeAnalysis2Pass(x265_analysis_2Pass* analysis, int sliceType); + int getPuShape(puOrientation* puOrient, int partSize, int numCTU); - void readAnalysisFile(x265_analysis_data* analysis, int poc, const x265_picture* picIn); + void writeAnalysisFile(x265_analysis_data* analysis, FrameData &curEncData); + + void writeAnalysisFileRefine(x265_analysis_data* analysis, FrameData &curEncData); - void writeAnalysisFile(x265_analysis_data* pic, FrameData &curEncData); - void readAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, int poc, int sliceType); - void writeAnalysis2PassFile(x265_analysis_2Pass* analysis2Pass, FrameData &curEncData, int slicetype); void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, x265_frame_stats* frameStats, int inPoc); + int validateAnalysisData(x265_analysis_data* analysis, int readWriteFlag); + + void readUserSeiFile(x265_sei_payload& seiMsg, int poc); + void calcRefreshInterval(Frame* frameEnc); void initRefIdx(); @@ -249,6 +302,8 @@ void updateRefIdx(); bool computeSPSRPSIndex(); + void copyUserSEIMessages(Frame *frame, const x265_picture* pic_in); + protected: void initVPS(VPS *vps);
View file
x265_2.7.tar.gz/source/encoder/entropy.cpp -> x265_2.9.tar.gz/source/encoder/entropy.cpp
Changed
@@ -1369,8 +1369,8 @@ } bDenomCoded = true; } - WRITE_FLAG(wp[0].bPresentFlag, "luma_weight_lX_flag"); - totalSignalledWeightFlags += wp[0].bPresentFlag; + WRITE_FLAG(!!wp[0].wtPresent, "luma_weight_lX_flag"); + totalSignalledWeightFlags += wp[0].wtPresent; } if (bChroma) @@ -1378,15 +1378,15 @@ for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++) { wp = slice.m_weightPredTable[list][ref]; - WRITE_FLAG(wp[1].bPresentFlag, "chroma_weight_lX_flag"); - totalSignalledWeightFlags += 2 * wp[1].bPresentFlag; + WRITE_FLAG(!!wp[1].wtPresent, "chroma_weight_lX_flag"); + totalSignalledWeightFlags += 2 * wp[1].wtPresent; } } for (int ref = 0; ref < slice.m_numRefIdx[list]; ref++) { wp = slice.m_weightPredTable[list][ref]; - if (wp[0].bPresentFlag) + if (wp[0].wtPresent) { int deltaWeight = (wp[0].inputWeight - (1 << wp[0].log2WeightDenom)); WRITE_SVLC(deltaWeight, "delta_luma_weight_lX"); @@ -1395,7 +1395,7 @@ if (bChroma) { - if (wp[1].bPresentFlag) + if (wp[1].wtPresent) { for (int plane = 1; plane < 3; plane++) {
View file
x265_2.7.tar.gz/source/encoder/frameencoder.cpp -> x265_2.9.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -179,7 +179,7 @@ ok &= m_rce.picTimingSEI && m_rce.hrdTiming; } - if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize) + if (m_param->noiseReductionIntra || m_param->noiseReductionInter) m_nr = X265_MALLOC(NoiseReduction, 1); if (m_nr) memset(m_nr, 0, sizeof(NoiseReduction)); @@ -365,6 +365,65 @@ return length; } +bool FrameEncoder::writeToneMapInfo(x265_sei_payload *payload) +{ + bool payloadChange = false; + if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize) + { + if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0) + payloadChange = true; + } + else + { + payloadChange = true; + if (m_top->m_prevTonemapPayload.payload != NULL) + x265_free(m_top->m_prevTonemapPayload.payload); + m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t)* payload->payloadSize); + } + + if (payloadChange) + { + m_top->m_prevTonemapPayload.payloadType = payload->payloadType; + m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize; + memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize); + } + + bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR; + return (payloadChange || isIDR); +} + +void FrameEncoder::writeTrailingSEIMessages() +{ + Slice* slice = m_frame->m_encData->m_slice; + int planes = (m_param->internalCsp != X265_CSP_I400) ? 3 : 1; + int32_t payloadSize = 0; + + if (m_param->decodedPictureHashSEI == 1) + { + m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::MD5; + for (int i = 0; i < planes; i++) + MD5Final(&m_seiReconPictureDigest.m_state[i], m_seiReconPictureDigest.m_digest[i]); + payloadSize = 1 + 16 * planes; + } + else if (m_param->decodedPictureHashSEI == 2) + { + m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CRC; + for (int i = 0; i < planes; i++) + crcFinish(m_seiReconPictureDigest.m_crc[i], m_seiReconPictureDigest.m_digest[i]); + payloadSize = 1 + 2 * planes; + } + else if (m_param->decodedPictureHashSEI == 3) + { + m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CHECKSUM; + for (int i = 0; i < planes; i++) + checksumFinish(m_seiReconPictureDigest.m_checksum[i], m_seiReconPictureDigest.m_digest[i]); + payloadSize = 1 + 4 * planes; + } + + m_seiReconPictureDigest.setSize(payloadSize); + m_seiReconPictureDigest.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_SUFFIX_SEI, m_nalList, false); +} + void FrameEncoder::compressFrame() { ProfileScopeEvent(frameThread); @@ -393,6 +452,7 @@ * not repeating headers (since AUD is supposed to be the first NAL in the access * unit) */ Slice* slice = m_frame->m_encData->m_slice; + if (m_param->bEnableAccessUnitDelimiters && (m_frame->m_poc || m_param->bRepeatHeaders)) { m_bs.resetBits(); @@ -400,6 +460,8 @@ m_entropyCoder.codeAUD(*slice); m_bs.writeByteAlignment(); m_nalList.serialize(NAL_UNIT_ACCESS_UNIT_DELIMITER, m_bs); + if (m_param->bSingleSeiNal) + m_bs.resetBits(); } if (m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders) { @@ -459,9 +521,7 @@ wa.waitForExit(); else weightAnalyse(*slice, *m_frame, *m_param); - } - } else slice->disableWeights(); @@ -475,7 +535,7 @@ for (int ref = 0; ref < slice->m_numRefIdx[l]; ref++) { WeightParam *w = NULL; - if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].bPresentFlag) + if ((bUseWeightP || bUseWeightB) && slice->m_weightPredTable[l][ref][0].wtPresent) w = slice->m_weightPredTable[l][ref]; slice->m_refReconPicList[l][ref] = slice->m_refFrameList[l][ref]->m_reconPic; m_mref[l][ref].init(slice->m_refReconPicList[l][ref], w, *m_param); @@ -496,41 +556,6 @@ /* Get the QP for this frame from rate control. This call may block until * frames ahead of it in encode order have called rateControlEnd() */ - m_rce.encodeOrder = m_frame->m_encodeOrder; - bool payloadChange = false; - bool writeSei = true; - if (m_param->bDhdr10opt) - { - for (int i = 0; i < m_frame->m_userSEI.numPayloads; i++) - { - x265_sei_payload *payload = &m_frame->m_userSEI.payloads[i]; - if(payload->payloadType == USER_DATA_REGISTERED_ITU_T_T35) - { - if (m_top->m_prevTonemapPayload.payload != NULL && payload->payloadSize == m_top->m_prevTonemapPayload.payloadSize) - { - if (memcmp(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize) != 0) - payloadChange = true; - } - else - { - payloadChange = true; - if (m_top->m_prevTonemapPayload.payload != NULL) - x265_free(m_top->m_prevTonemapPayload.payload); - m_top->m_prevTonemapPayload.payload = (uint8_t*)x265_malloc(sizeof(uint8_t) * payload->payloadSize); - } - - if (payloadChange) - { - m_top->m_prevTonemapPayload.payloadType = payload->payloadType; - m_top->m_prevTonemapPayload.payloadSize = payload->payloadSize; - memcpy(m_top->m_prevTonemapPayload.payload, payload->payload, payload->payloadSize); - } - - bool isIDR = m_frame->m_lowres.sliceType == X265_TYPE_IDR; - writeSei = payloadChange || isIDR; - } - } - } int qp = m_top->m_rateControl->rateControlStart(m_frame, &m_rce, m_top); m_rce.newQp = qp; @@ -594,7 +619,6 @@ /* reset entropy coders and compute slice id */ m_entropyCoder.load(m_initSliceContext); - for (uint32_t sliceId = 0; sliceId < m_param->maxSlices; sliceId++) for (uint32_t row = m_sliceBaseRow[sliceId]; row < m_sliceBaseRow[sliceId + 1]; row++) m_rows[row].init(m_initSliceContext, sliceId); @@ -620,6 +644,7 @@ m_outStreams[i].resetBits(); } + m_rce.encodeOrder = m_frame->m_encodeOrder; int prevBPSEI = m_rce.encodeOrder ? m_top->m_lastBPSEI : 0; if (m_frame->m_lowres.bKeyframe) @@ -632,18 +657,22 @@ bpSei->m_auCpbRemovalDelayDelta = 1; bpSei->m_cpbDelayOffset = 0; bpSei->m_dpbDelayOffset = 0; - // hrdFullness() calculates the initial CPB removal delay and offset m_top->m_rateControl->hrdFullness(bpSei); - - m_bs.resetBits(); - bpSei->write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); + bpSei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); m_top->m_lastBPSEI = m_rce.encodeOrder; } + + if (m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_param->bEmitIDRRecoverySEI) + { + /* Recovery Point SEI require the SPS to be "activated" */ + SEIRecoveryPoint sei; + sei.m_recoveryPocCnt = 0; + sei.m_exactMatchingFlag = true; + sei.m_brokenLinkFlag = false; + sei.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); + } } if ((m_param->bEmitHRDSEI || !!m_param->interlaceMode)) @@ -660,8 +689,10 @@ else if (m_param->interlaceMode == 1) sei->m_picStruct = (poc & 1) ? 2 /* bottom */ : 1 /* top */; else - sei->m_picStruct = 0; - sei->m_sourceScanType = 0; + sei->m_picStruct = m_param->pictureStructure; + + sei->m_sourceScanType = m_param->interlaceMode ? 0 : 1; + sei->m_duplicateFlag = false; } @@ -675,10 +706,14 @@ sei->m_picDpbOutputDelay = slice->m_sps->numReorderPics + poc - m_rce.encodeOrder; } - m_bs.resetBits(); - sei->write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); + sei->writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); + } + + if (m_param->preferredTransferCharacteristics > -1 && slice->isIRAP()) + { + SEIAlternativeTC m_seiAlternativeTC; + m_seiAlternativeTC.m_preferredTransferCharacteristics = m_param->preferredTransferCharacteristics; + m_seiAlternativeTC.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); } /* Write user SEI */ @@ -689,28 +724,33 @@ { SEIuserDataUnregistered sei; sei.m_userData = payload->payload; - m_bs.resetBits(); sei.setSize(payload->payloadSize); - sei.write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); + sei.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); } else if (payload->payloadType == USER_DATA_REGISTERED_ITU_T_T35) { + bool writeSei = m_param->bDhdr10opt ? writeToneMapInfo(payload) : true; if (writeSei) { - SEICreativeIntentMeta sei; - sei.m_payload = payload->payload; - m_bs.resetBits(); + SEIuserDataRegistered sei; + sei.m_userData = payload->payload; sei.setSize(payload->payloadSize); - sei.write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); + sei.writeSEImessages(m_bs, *slice->m_sps, NAL_UNIT_PREFIX_SEI, m_nalList, m_param->bSingleSeiNal); } } else x265_log(m_param, X265_LOG_ERROR, "Unrecognized SEI type\n"); } + + bool isSei = ((m_frame->m_lowres.bKeyframe && m_param->bRepeatHeaders) || m_param->bEmitHRDSEI || + !!m_param->interlaceMode || (m_frame->m_lowres.sliceType == X265_TYPE_IDR && m_param->bEmitIDRRecoverySEI) || + m_frame->m_userSEI.numPayloads); + + if (isSei && m_param->bSingleSeiNal) + { + m_bs.writeByteAlignment(); + m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); + } /* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to * tune RateControl parameters for other frames. * Hence, for these modes, update m_startEndOrder and unlock RC for previous threads waiting in @@ -724,6 +764,9 @@ m_top->m_rateControl->m_startEndOrder.incr(); // faked rateControlEnd calls for negative frames } + if (m_param->bDynamicRefine) + computeAvgTrainingData(); + /* Analyze CTU rows, most of the hard work is done here. Frame is * compressed in a wave-front pattern if WPP is enabled. Row based loop * filters runs behind the CTU compression and reconstruction */ @@ -835,76 +878,19 @@ m_frameFilter.processRow(i - m_filterRowDelay); } } +#if ENABLE_LIBVMAF + vmafFrameLevelScore(); +#endif if (m_param->maxSlices > 1) { PicYuv *reconPic = m_frame->m_reconPic; uint32_t height = reconPic->m_picHeight; - uint32_t width = reconPic->m_picWidth; - intptr_t stride = reconPic->m_stride; - const uint32_t hChromaShift = CHROMA_H_SHIFT(m_param->internalCsp); - const uint32_t vChromaShift = CHROMA_V_SHIFT(m_param->internalCsp); + initDecodedPictureHashSEI(0, 0, height); + } - if (m_param->decodedPictureHashSEI == 1) - { - - MD5Init(&m_state[0]); - - updateMD5Plane(m_state[0], reconPic->m_picOrg[0], width, height, stride); - - if (m_param->internalCsp != X265_CSP_I400) - { - MD5Init(&m_state[1]); - MD5Init(&m_state[2]); - - width >>= hChromaShift; - height >>= vChromaShift; - stride = reconPic->m_strideC; - - updateMD5Plane(m_state[1], reconPic->m_picOrg[1], width, height, stride); - updateMD5Plane(m_state[2], reconPic->m_picOrg[2], width, height, stride); - } - } - // TODO: NOT verify code in below mode - else if (m_param->decodedPictureHashSEI == 2) - { - m_crc[0] = 0xffff; - - updateCRC(reconPic->m_picOrg[0], m_crc[0], height, width, stride); - - if (m_param->internalCsp != X265_CSP_I400) - { - width >>= hChromaShift; - height >>= vChromaShift; - stride = reconPic->m_strideC; - m_crc[1] = m_crc[2] = 0xffff; - - updateCRC(reconPic->m_picOrg[1], m_crc[1], height, width, stride); - updateCRC(reconPic->m_picOrg[2], m_crc[2], height, width, stride); - } - } - else if (m_param->decodedPictureHashSEI == 3) - { - uint32_t cuHeight = m_param->maxCUSize; - - m_checksum[0] = 0; - - updateChecksum(reconPic->m_picOrg[0], m_checksum[0], height, width, stride, 0, cuHeight); - - if (m_param->internalCsp != X265_CSP_I400) - { - width >>= hChromaShift; - height >>= vChromaShift; - stride = reconPic->m_strideC; - cuHeight >>= vChromaShift; - - m_checksum[1] = m_checksum[2] = 0; - - updateChecksum(reconPic->m_picOrg[1], m_checksum[1], height, width, stride, 0, cuHeight); - updateChecksum(reconPic->m_picOrg[2], m_checksum[2], height, width, stride, 0, cuHeight); - } - } - } // end of (m_param->maxSlices > 1) + if (m_param->bDynamicRefine && m_top->m_startPoint <= m_frame->m_encodeOrder) //Avoid collecting data that will not be used by future frames. + collectDynDataFrame(); if (m_param->rc.bStatWrite) { @@ -992,8 +978,6 @@ m_bs.resetBits(); const uint32_t sliceAddr = nextSliceRow * m_numCols; - //CUData* ctu = m_frame->m_encData->getPicCTU(sliceAddr); - //const int sliceQp = ctu->m_qp[0]; if (m_param->bOptRefListLengthPPS) { ScopedLock refIdxLock(m_top->m_sliceRefIdxLock); @@ -1040,38 +1024,8 @@ m_nalList.serialize(slice->m_nalUnitType, m_bs); } - if (m_param->decodedPictureHashSEI) - { - int planes = (m_frame->m_param->internalCsp != X265_CSP_I400) ? 3 : 1; - int32_t payloadSize = 0; - if (m_param->decodedPictureHashSEI == 1) - { - m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::MD5; - for (int i = 0; i < planes; i++) - MD5Final(&m_state[i], m_seiReconPictureDigest.m_digest[i]); - payloadSize = 1 + 16 * planes; - } - else if (m_param->decodedPictureHashSEI == 2) - { - m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CRC; - for (int i = 0; i < planes; i++) - crcFinish(m_crc[i], m_seiReconPictureDigest.m_digest[i]); - payloadSize = 1 + 2 * planes; - } - else if (m_param->decodedPictureHashSEI == 3) - { - m_seiReconPictureDigest.m_method = SEIDecodedPictureHash::CHECKSUM; - for (int i = 0; i < planes; i++) - checksumFinish(m_checksum[i], m_seiReconPictureDigest.m_digest[i]); - payloadSize = 1 + 4 * planes; - } - m_bs.resetBits(); - m_seiReconPictureDigest.setSize(payloadSize); - m_seiReconPictureDigest.write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - m_nalList.serialize(NAL_UNIT_SUFFIX_SEI, m_bs); - } + writeTrailingSEIMessages(); uint64_t bytes = 0; for (uint32_t i = 0; i < m_nalList.m_numNal; i++) @@ -1160,7 +1114,79 @@ m_cuStats.accumulate(m_tld[i].analysis.m_stats[m_jpId], *m_param); #endif - m_endFrameTime = x265_mdate(); + m_endFrameTime = x265_mdate(); +} + +void FrameEncoder::initDecodedPictureHashSEI(int row, int cuAddr, int height) +{ + PicYuv *reconPic = m_frame->m_reconPic; + uint32_t width = reconPic->m_picWidth; + intptr_t stride = reconPic->m_stride; + uint32_t maxCUHeight = m_param->maxCUSize; + + const uint32_t hChromaShift = CHROMA_H_SHIFT(m_param->internalCsp); + const uint32_t vChromaShift = CHROMA_V_SHIFT(m_param->internalCsp); + + if (m_param->decodedPictureHashSEI == 1) + { + if (!row) + MD5Init(&m_seiReconPictureDigest.m_state[0]); + + updateMD5Plane(m_seiReconPictureDigest.m_state[0], reconPic->getLumaAddr(cuAddr), width, height, stride); + if (m_param->internalCsp != X265_CSP_I400) + { + if (!row) + { + MD5Init(&m_seiReconPictureDigest.m_state[1]); + MD5Init(&m_seiReconPictureDigest.m_state[2]); + } + + width >>= hChromaShift; + height >>= vChromaShift; + stride = reconPic->m_strideC; + + updateMD5Plane(m_seiReconPictureDigest.m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride); + updateMD5Plane(m_seiReconPictureDigest.m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride); + } + } + else if (m_param->decodedPictureHashSEI == 2) + { + + if (!row) + m_seiReconPictureDigest.m_crc[0] = 0xffff; + + updateCRC(reconPic->getLumaAddr(cuAddr), m_seiReconPictureDigest.m_crc[0], height, width, stride); + if (m_param->internalCsp != X265_CSP_I400) + { + width >>= hChromaShift; + height >>= vChromaShift; + stride = reconPic->m_strideC; + m_seiReconPictureDigest.m_crc[1] = m_seiReconPictureDigest.m_crc[2] = 0xffff; + + updateCRC(reconPic->getCbAddr(cuAddr), m_seiReconPictureDigest.m_crc[1], height, width, stride); + updateCRC(reconPic->getCrAddr(cuAddr), m_seiReconPictureDigest.m_crc[2], height, width, stride); + } + } + else if (m_param->decodedPictureHashSEI == 3) + { + if (!row) + m_seiReconPictureDigest.m_checksum[0] = 0; + + updateChecksum(reconPic->m_picOrg[0], m_seiReconPictureDigest.m_checksum[0], height, width, stride, row, maxCUHeight); + if (m_param->internalCsp != X265_CSP_I400) + { + width >>= hChromaShift; + height >>= vChromaShift; + stride = reconPic->m_strideC; + maxCUHeight >>= vChromaShift; + + if (!row) + m_seiReconPictureDigest.m_checksum[1] = m_seiReconPictureDigest.m_checksum[2] = 0; + + updateChecksum(reconPic->m_picOrg[1], m_seiReconPictureDigest.m_checksum[1], height, width, stride, row, maxCUHeight); + updateChecksum(reconPic->m_picOrg[2], m_seiReconPictureDigest.m_checksum[2], height, width, stride, row, maxCUHeight); + } + } } void FrameEncoder::encodeSlice(uint32_t sliceAddr) @@ -1367,7 +1393,7 @@ } curRow.avgQPComputed = 1; } - } + } // Initialize restrict on MV range in slices tld.analysis.m_sliceMinY = -(int16_t)(rowInSlice * m_param->maxCUSize * 4) + 3 * 4; @@ -1445,6 +1471,12 @@ // Does all the CU analysis, returns best top level mode decision Mode& best = tld.analysis.compressCTU(*ctu, *m_frame, m_cuGeoms[m_ctuGeomMap[cuAddr]], rowCoder); + /* startPoint > encodeOrder is true when the start point changes for + a new GOP but few frames from the previous GOP is still incomplete. + The data of frames in this interval will not be used by any future frames. */ + if (m_param->bDynamicRefine && m_top->m_startPoint <= m_frame->m_encodeOrder) + collectDynDataRow(*ctu, &curRow.rowStats); + // take a sample of the current active worker count ATOMIC_ADD(&m_totalActiveWorkerCount, m_activeWorkerCount); ATOMIC_INC(&m_activeWorkerCountSamples); @@ -1466,7 +1498,7 @@ { // NOTE: in VBV mode, we may reencode anytime, so we can't do Deblock stage-Horizon and SAO if (!bIsVbv) - { + { // Delay one row to avoid intra prediction conflict if (m_pool && !bFirstRowInSlice) { @@ -1743,24 +1775,24 @@ else if ((uint32_t)m_rce.encodeOrder <= 2 * (m_param->fpsNum / m_param->fpsDenom)) rowCount = X265_MIN((maxRows + 1) / 2, maxRows - 1); else - rowCount = X265_MIN(m_refLagRows / m_param->maxSlices, maxRows - 1); + rowCount = X265_MIN(m_refLagRows / m_param->maxSlices, maxRows - 1); if (rowInSlice == rowCount) { m_rowSliceTotalBits[sliceId] = 0; if (bIsVbv && !(m_param->rc.bEnableConstVbv && m_param->bEnableWavefront)) - { + { for (uint32_t i = m_sliceBaseRow[sliceId]; i < rowCount + m_sliceBaseRow[sliceId]; i++) m_rowSliceTotalBits[sliceId] += curEncData.m_rowStat[i].encodedBits; } else { uint32_t startAddr = m_sliceBaseRow[sliceId] * numCols; - uint32_t finishAddr = startAddr + rowCount * numCols; + uint32_t finishAddr = startAddr + rowCount * numCols; - for (uint32_t cuAddr = startAddr; cuAddr < finishAddr; cuAddr++) + for (uint32_t cuAddr = startAddr; cuAddr < finishAddr; cuAddr++) m_rowSliceTotalBits[sliceId] += curEncData.m_cuStat[cuAddr].totalBits; - } + } if (ATOMIC_INC(&m_sliceCnt) == (int)m_param->maxSlices) { @@ -1827,6 +1859,101 @@ m_completionEvent.trigger(); } +void FrameEncoder::collectDynDataRow(CUData& ctu, FrameStats* rowStats) +{ + for (uint32_t i = 0; i < X265_REFINE_INTER_LEVELS; i++) + { + for (uint32_t depth = 0; depth < m_param->maxCUDepth; depth++) + { + int offset = (depth * X265_REFINE_INTER_LEVELS) + i; + if (ctu.m_collectCUCount[offset]) + { + rowStats->rowVarDyn[offset] += ctu.m_collectCUVariance[offset]; + rowStats->rowRdDyn[offset] += ctu.m_collectCURd[offset]; + rowStats->rowCntDyn[offset] += ctu.m_collectCUCount[offset]; + } + } + } +} + +void FrameEncoder::collectDynDataFrame() +{ + for (uint32_t row = 0; row < m_numRows; row++) + { + for (uint32_t refLevel = 0; refLevel < X265_REFINE_INTER_LEVELS; refLevel++) + { + for (uint32_t depth = 0; depth < m_param->maxCUDepth; depth++) + { + int offset = (depth * X265_REFINE_INTER_LEVELS) + refLevel; + int curFrameIndex = m_frame->m_encodeOrder - m_top->m_startPoint; + int index = (curFrameIndex * X265_REFINE_INTER_LEVELS * m_param->maxCUDepth) + offset; + if (m_rows[row].rowStats.rowCntDyn[offset]) + { + m_top->m_variance[index] += m_rows[row].rowStats.rowVarDyn[offset]; + m_top->m_rdCost[index] += m_rows[row].rowStats.rowRdDyn[offset]; + m_top->m_trainingCount[index] += m_rows[row].rowStats.rowCntDyn[offset]; + } + } + } + } +} + +void FrameEncoder::computeAvgTrainingData() +{ + if (m_frame->m_lowres.bScenecut || m_frame->m_lowres.bKeyframe) + { + m_top->m_startPoint = m_frame->m_encodeOrder; + int size = (m_param->keyframeMax + m_param->lookaheadDepth) * m_param->maxCUDepth * X265_REFINE_INTER_LEVELS; + memset(m_top->m_variance, 0, size * sizeof(uint64_t)); + memset(m_top->m_rdCost, 0, size * sizeof(uint64_t)); + memset(m_top->m_trainingCount, 0, size * sizeof(uint32_t)); + } + if (m_frame->m_encodeOrder - m_top->m_startPoint < 2 * m_param->frameNumThreads) + m_frame->m_classifyFrame = false; + else + m_frame->m_classifyFrame = true; + + int size = m_param->maxCUDepth * X265_REFINE_INTER_LEVELS; + memset(m_frame->m_classifyRd, 0, size * sizeof(uint64_t)); + memset(m_frame->m_classifyVariance, 0, size * sizeof(uint64_t)); + memset(m_frame->m_classifyCount, 0, size * sizeof(uint32_t)); + if (m_frame->m_classifyFrame) + { + uint32_t limit = m_frame->m_encodeOrder - m_top->m_startPoint - m_param->frameNumThreads; + for (uint32_t i = 1; i < limit; i++) + { + for (uint32_t j = 0; j < X265_REFINE_INTER_LEVELS; j++) + { + for (uint32_t depth = 0; depth < m_param->maxCUDepth; depth++) + { + int offset = (depth * X265_REFINE_INTER_LEVELS) + j; + int index = (i* X265_REFINE_INTER_LEVELS * m_param->maxCUDepth) + offset; + if (m_top->m_trainingCount[index]) + { + m_frame->m_classifyRd[offset] += m_top->m_rdCost[index] / m_top->m_trainingCount[index]; + m_frame->m_classifyVariance[offset] += m_top->m_variance[index] / m_top->m_trainingCount[index]; + m_frame->m_classifyCount[offset] += m_top->m_trainingCount[index]; + } + } + } + } + /* Calculates the average feature values of historic frames that are being considered for the current frame */ + int historyCount = m_frame->m_encodeOrder - m_param->frameNumThreads - m_top->m_startPoint - 1; + if (historyCount) + { + for (uint32_t j = 0; j < X265_REFINE_INTER_LEVELS; j++) + { + for (uint32_t depth = 0; depth < m_param->maxCUDepth; depth++) + { + int offset = (depth * X265_REFINE_INTER_LEVELS) + j; + m_frame->m_classifyRd[offset] /= historyCount; + m_frame->m_classifyVariance[offset] /= historyCount; + } + } + } + } +} + /* collect statistics about CU coding decisions, return total QP */ int FrameEncoder::collectCTUStatistics(const CUData& ctu, FrameStats* log) { @@ -1949,6 +2076,31 @@ m_nr->nrOffsetDenoise[cat][0] = 0; } } +#if ENABLE_LIBVMAF +void FrameEncoder::vmafFrameLevelScore() +{ + PicYuv *fenc = m_frame->m_fencPic; + PicYuv *recon = m_frame->m_reconPic; + + x265_vmaf_framedata *vmafframedata = (x265_vmaf_framedata*)x265_malloc(sizeof(x265_vmaf_framedata)); + if (!vmafframedata) + { + x265_log(NULL, X265_LOG_ERROR, "vmaf frame data alloc failed\n"); + } + + vmafframedata->height = fenc->m_picHeight; + vmafframedata->width = fenc->m_picWidth; + vmafframedata->frame_set = 0; + vmafframedata->internalBitDepth = m_param->internalBitDepth; + vmafframedata->reference_frame = fenc; + vmafframedata->distorted_frame = recon; + + fenc->m_vmafScore = x265_calculate_vmaf_framelevelscore(vmafframedata); + + if (vmafframedata) + x265_free(vmafframedata); +} +#endif Frame *FrameEncoder::getEncodedPicture(NALList& output) {
View file
x265_2.7.tar.gz/source/encoder/frameencoder.h -> x265_2.9.tar.gz/source/encoder/frameencoder.h
Changed
@@ -129,6 +129,8 @@ /* blocks until worker thread is done, returns access unit */ Frame *getEncodedPicture(NALList& list); + void initDecodedPictureHashSEI(int row, int cuAddr, int height); + Event m_enable; Event m_done; Event m_completionEvent; @@ -161,9 +163,6 @@ double m_ssim; uint64_t m_accessUnitBits; uint32_t m_ssimCnt; - MD5Context m_state[3]; - uint32_t m_crc[3]; - uint32_t m_checksum[3]; volatile int m_activeWorkerCount; // count of workers currently encoding or filtering CTUs volatile int m_totalActiveWorkerCount; // sum of m_activeWorkerCount sampled at end of each CTU @@ -230,6 +229,8 @@ void threadMain(); int collectCTUStatistics(const CUData& ctu, FrameStats* frameLog); void noiseReductionUpdate(); + void writeTrailingSEIMessages(); + bool writeToneMapInfo(x265_sei_payload *payload); /* Called by WaveFront::findJob() */ virtual void processRow(int row, int threadId); @@ -239,6 +240,12 @@ void enqueueRowFilter(int row) { WaveFront::enqueueRow(row * 2 + 1); } void enableRowEncoder(int row) { WaveFront::enableRow(row * 2 + 0); } void enableRowFilter(int row) { WaveFront::enableRow(row * 2 + 1); } +#if ENABLE_LIBVMAF + void vmafFrameLevelScore(); +#endif + void collectDynDataFrame(); + void computeAvgTrainingData(); + void collectDynDataRow(CUData& ctu, FrameStats* rowStats); }; }
View file
x265_2.7.tar.gz/source/encoder/framefilter.cpp -> x265_2.9.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -712,78 +712,8 @@ if (m_param->maxSlices == 1) { - if (m_param->decodedPictureHashSEI == 1) - { - uint32_t height = m_parallelFilter[row].getCUHeight(); - uint32_t width = reconPic->m_picWidth; - intptr_t stride = reconPic->m_stride; - - if (!row) - MD5Init(&m_frameEncoder->m_state[0]); - - updateMD5Plane(m_frameEncoder->m_state[0], reconPic->getLumaAddr(cuAddr), width, height, stride); - if (m_param->internalCsp != X265_CSP_I400) - { - if (!row) - { - MD5Init(&m_frameEncoder->m_state[1]); - MD5Init(&m_frameEncoder->m_state[2]); - } - - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; - - updateMD5Plane(m_frameEncoder->m_state[1], reconPic->getCbAddr(cuAddr), width, height, stride); - updateMD5Plane(m_frameEncoder->m_state[2], reconPic->getCrAddr(cuAddr), width, height, stride); - } - } - else if (m_param->decodedPictureHashSEI == 2) - { - uint32_t height = m_parallelFilter[row].getCUHeight(); - uint32_t width = reconPic->m_picWidth; - intptr_t stride = reconPic->m_stride; - - if (!row) - m_frameEncoder->m_crc[0] = 0xffff; - - updateCRC(reconPic->getLumaAddr(cuAddr), m_frameEncoder->m_crc[0], height, width, stride); - if (m_param->internalCsp != X265_CSP_I400) - { - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; - m_frameEncoder->m_crc[1] = m_frameEncoder->m_crc[2] = 0xffff; - - updateCRC(reconPic->getCbAddr(cuAddr), m_frameEncoder->m_crc[1], height, width, stride); - updateCRC(reconPic->getCrAddr(cuAddr), m_frameEncoder->m_crc[2], height, width, stride); - } - } - else if (m_param->decodedPictureHashSEI == 3) - { - uint32_t width = reconPic->m_picWidth; - uint32_t height = m_parallelFilter[row].getCUHeight(); - intptr_t stride = reconPic->m_stride; - uint32_t cuHeight = m_param->maxCUSize; - - if (!row) - m_frameEncoder->m_checksum[0] = 0; - - updateChecksum(reconPic->m_picOrg[0], m_frameEncoder->m_checksum[0], height, width, stride, row, cuHeight); - if (m_param->internalCsp != X265_CSP_I400) - { - width >>= m_hChromaShift; - height >>= m_vChromaShift; - stride = reconPic->m_strideC; - cuHeight >>= m_vChromaShift; - - if (!row) - m_frameEncoder->m_checksum[1] = m_frameEncoder->m_checksum[2] = 0; - - updateChecksum(reconPic->m_picOrg[1], m_frameEncoder->m_checksum[1], height, width, stride, row, cuHeight); - updateChecksum(reconPic->m_picOrg[2], m_frameEncoder->m_checksum[2], height, width, stride, row, cuHeight); - } - } + uint32_t height = m_parallelFilter[row].getCUHeight(); + m_frameEncoder->initDecodedPictureHashSEI(row, cuAddr, height); } // end of (m_param->maxSlices == 1) if (ATOMIC_INC(&m_frameEncoder->m_completionCount) == 2 * (int)m_frameEncoder->m_numRows)
View file
x265_2.7.tar.gz/source/encoder/ratecontrol.cpp -> x265_2.9.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -1282,6 +1282,12 @@ m_predictedBits = m_totalBits; updateVbvPlan(enc); rce->bufferFill = m_bufferFill; + rce->vbvEndAdj = false; + if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames) + { + rce->vbvEndAdj = true; + rce->targetFill = 0; + } int mincr = enc->m_vps.ptl.minCrForLevel; /* Profiles above Main10 don't require maxAU size check, so just set the maximum to a large value. */ @@ -1290,7 +1296,7 @@ else { /* The spec has a special case for the first frame. */ - if (rce->encodeOrder == 0) + if (curFrame->m_lowres.bKeyframe) { /* 1.5 * (Max( PicSizeInSamplesY, fR * MaxLumaSr) + MaxLumaSr * (AuCpbRemovalTime[ 0 ] -AuNominalRemovalTime[ 0 ])) ? MinCr */ double fr = 1. / 300; @@ -1302,6 +1308,7 @@ /* 1.5 * MaxLumaSr * (AuCpbRemovalTime[ n ] - AuCpbRemovalTime[ n - 1 ]) / MinCr */ rce->frameSizeMaximum = 8 * 1.5 * enc->m_vps.ptl.maxLumaSrForLevel * m_frameDuration / mincr; } + rce->frameSizeMaximum *= m_param->maxAUSizeFactor; } } if (!m_isAbr && m_2pass && m_param->rc.rateControlMode == X265_RC_CRF) @@ -2172,12 +2179,12 @@ curBits = predictSize(&m_pred[predType], frameQ[type], (double)satd); bufferFillCur -= curBits; } - if (m_param->vbvBufferEnd && rce->encodeOrder >= m_param->vbvEndFrameAdjust * m_param->totalFrames) + if (rce->vbvEndAdj) { bool loopBreak = false; double bufferDiff = m_param->vbvBufferEnd - (m_bufferFill / m_bufferSize); - targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder)); - if (bufferFillCur < targetFill) + rce->targetFill = m_bufferFill + m_bufferSize * (bufferDiff / (m_param->totalFrames - rce->encodeOrder)); + if (bufferFillCur < rce->targetFill) { q *= 1.01; loopTerminate |= 1; @@ -2420,6 +2427,7 @@ double rcTol = bufferLeftPlanned / m_param->frameNumThreads * m_rateTolerance; int32_t encodedBitsSoFar = 0; double accFrameBits = predictRowsSizeSum(curFrame, rce, qpVbv, encodedBitsSoFar); + double vbvEndBias = 0.95; /* * Don't increase the row QPs until a sufficent amount of the bits of * the frame have been processed, in case a flat area at the top of the @@ -2441,7 +2449,8 @@ while (qpVbv < qpMax && (((accFrameBits > rce->frameSizePlanned + rcTol) || (rce->bufferFill - accFrameBits < bufferLeftPlanned * 0.5) || - (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv)) + (accFrameBits > rce->frameSizePlanned && qpVbv < rce->qpNoVbv) || + (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) < (rce->targetFill * vbvEndBias)))) && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot > 0.1))) { qpVbv += stepSize; @@ -2452,7 +2461,8 @@ while (qpVbv > qpMin && (qpVbv > curEncData.m_rowStat[0].rowQp || m_singleFrameVbv) && (((accFrameBits < rce->frameSizePlanned * 0.8f && qpVbv <= prevRowQp) - || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1) + || accFrameBits < (rce->bufferFill - m_bufferSize + m_bufferRate) * 1.1 + || (rce->vbvEndAdj && ((rce->bufferFill - accFrameBits) > (rce->targetFill * vbvEndBias)))) && (!m_param->rc.bStrictCbr ? 1 : abrOvershoot < 0))) { qpVbv -= stepSize; @@ -2630,8 +2640,9 @@ FrameData& curEncData = *curFrame->m_encData; int64_t actualBits = bits; Slice *slice = curEncData.m_slice; + bool bEnableDistOffset = m_param->analysisMultiPassDistortion && m_param->rc.bStatRead; - if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion) + if (m_param->rc.aqMode || m_isVbv || m_param->bAQMotion || bEnableDistOffset) { if (m_isVbv && !(m_2pass && m_param->rc.rateControlMode == X265_RC_CRF)) { @@ -2645,10 +2656,10 @@ rce->qpaRc = curEncData.m_avgQpRc; } - if (m_param->rc.aqMode || m_param->bAQMotion) + if (m_param->rc.aqMode || m_param->bAQMotion || bEnableDistOffset) { double avgQpAq = 0; - /* determine actual avg encoded QP, after AQ/cutree adjustments */ + /* determine actual avg encoded QP, after AQ/cutree/distortion adjustments */ for (uint32_t i = 0; i < slice->m_sps->numCuInHeight; i++) avgQpAq += curEncData.m_rowStat[i].sumQpAq; @@ -2792,12 +2803,8 @@ /* called to write out the rate control frame stats info in multipass encodes */ int RateControl::writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce) { - FrameData& curEncData = *curFrame->m_encData; - int ncu; - if (m_param->rc.qgSize == 8) - ncu = m_ncu * 4; - else - ncu = m_ncu; + FrameData& curEncData = *curFrame->m_encData; + int ncu = (m_param->rc.qgSize == 8) ? m_ncu * 4 : m_ncu; char cType = rce->sliceType == I_SLICE ? (curFrame->m_lowres.sliceType == X265_TYPE_IDR ? 'I' : 'i') : rce->sliceType == P_SLICE ? 'P' : IS_REFERENCED(curFrame) ? 'B' : 'b';
View file
x265_2.7.tar.gz/source/encoder/ratecontrol.h -> x265_2.9.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -82,6 +82,8 @@ double rowCplxrSum; double qpNoVbv; double bufferFill; + double targetFill; + bool vbvEndAdj; double frameDuration; double clippedDuration; double frameSizeEstimated; /* hold frameSize, updated from cu level vbv rc */
View file
x265_2.7.tar.gz/source/encoder/reference.cpp -> x265_2.9.tar.gz/source/encoder/reference.cpp
Changed
@@ -89,7 +89,7 @@ cuHeight >>= reconPic->m_vChromaShift; } - if (wp[c].bPresentFlag) + if (wp[c].wtPresent) { if (!weightBuffer[c]) { @@ -155,12 +155,10 @@ const pixel* src = reconPic->m_picOrg[c] + numWeightedRows * cuHeight * stride; pixel* dst = fpelPlane[c] + numWeightedRows * cuHeight * stride; - // Computing weighted CU rows int correction = IF_INTERNAL_PREC - X265_DEPTH; // intermediate interpolation depth - int padwidth = (width + 15) & ~15; // weightp assembly needs even 16 byte widths + int padwidth = (width + 31) & ~31; // weightp assembly needs even 32 byte widths primitives.weight_pp(src, dst, stride, padwidth, height, w[c].weight, w[c].round << correction, w[c].shift + correction, w[c].offset); - // Extending Left & Right primitives.extendRowBorder(dst, stride, width, height, marginX);
View file
x265_2.7.tar.gz/source/encoder/search.cpp -> x265_2.9.tar.gz/source/encoder/search.cpp
Changed
@@ -82,7 +82,7 @@ m_me.init(param.internalCsp); bool ok = m_quant.init(param.psyRdoq, scalingList, m_entropyCoder); - if (m_param->noiseReductionIntra || m_param->noiseReductionInter || m_param->rc.vbvBufferSize) + if (m_param->noiseReductionIntra || m_param->noiseReductionInter ) ok &= m_quant.allocNoiseReduction(param); ok &= Predict::allocBuffers(param.internalCsp); /* sets m_hChromaShift & m_vChromaShift */ @@ -354,14 +354,17 @@ // store original entropy coding status if (bEnableRDOQ) m_entropyCoder.estBit(m_entropyCoder.m_estBitsSbac, log2TrSize, true); - - primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false); if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); - primitives.cu[sizeIdx].add_ps(reconQt, reconQtStride, pred, residual, stride, stride); + bool reconQtYuvAlign = m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; + bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; + bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; + bool bufferAlignCheck = (reconQtStride % 64 == 0) && (stride % 64 == 0) && reconQtYuvAlign && predAlign && residualAlign; + primitives.cu[sizeIdx].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride); } else // no coded residual, recon = pred @@ -559,15 +562,19 @@ coeff_t* coeff = (useTSkip ? m_tsCoeff : coeffY); pixel* tmpRecon = (useTSkip ? m_tsRecon : reconQt); + bool tmpReconAlign = (useTSkip ? 1 : (m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, m_rqt[qtLayer].reconQtYuv.m_size) % 64 == 0)); uint32_t tmpReconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride); - primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSize, TEXT_LUMA, absPartIdx, useTSkip); if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSize, TEXT_LUMA, true, useTSkip, numSig); - primitives.cu[sizeIdx].add_ps(tmpRecon, tmpReconStride, pred, residual, stride, stride); + bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size) % 64 == 0; + bool predAlign = predYuv->getAddrOffset(absPartIdx, predYuv->m_size) % 64 == 0; + bool bufferAlignCheck = (stride % 64 == 0) && (tmpReconStride % 64 == 0) && tmpReconAlign && residualAlign && predAlign; + primitives.cu[sizeIdx].add_ps[bufferAlignCheck](tmpRecon, tmpReconStride, pred, residual, stride, stride); } else if (useTSkip) { @@ -714,7 +721,7 @@ coeff_t* coeffY = cu.m_trCoeff[0] + coeffOffsetY; uint32_t sizeIdx = log2TrSize - 2; - primitives.cu[sizeIdx].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdx].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); PicYuv* reconPic = m_frame->m_reconPic; pixel* picReconY = reconPic->getLumaAddr(cu.m_cuAddr, cuGeom.absPartIdx + absPartIdx); @@ -724,7 +731,11 @@ if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); - primitives.cu[sizeIdx].add_ps(picReconY, picStride, pred, residual, stride, stride); + bool picReconYAlign = (reconPic->m_cuOffsetY[cu.m_cuAddr] + reconPic->m_buOffsetY[cuGeom.absPartIdx + absPartIdx]) % 64 == 0; + bool predAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; + bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getAddrOffset(absPartIdx, m_rqt[cuGeom.depth].tmpResiYuv.m_size)% 64 == 0; + bool bufferAlignCheck = (picStride % 64 == 0) && (stride % 64 == 0) && picReconYAlign && predAlign && residualAlign; + primitives.cu[sizeIdx].add_ps[bufferAlignCheck](picReconY, picStride, pred, residual, stride, stride); cu.setCbfSubParts(1 << tuDepth, TEXT_LUMA, absPartIdx, fullDepth); } else @@ -893,12 +904,17 @@ predIntraChromaAng(chromaPredMode, pred, stride, log2TrSizeC); cu.setTransformSkipPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep); - primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); + uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false); if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); - primitives.cu[sizeIdxC].add_ps(reconQt, reconQtStride, pred, residual, stride, stride); + bool reconQtAlign = m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool bufferAlignCheck = reconQtAlign && predAlign && residualAlign && (reconQtStride % 64 == 0) && (stride % 64 == 0); + primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](reconQt, reconQtStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } else @@ -992,13 +1008,17 @@ pixel* recon = (useTSkip ? m_tsRecon : reconQt); uint32_t reconStride = (useTSkip ? MAX_TS_SIZE : reconQtStride); - primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSizeC, ttype, absPartIdxC, useTSkip); if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSizeC, ttype, true, useTSkip, numSig); - primitives.cu[sizeIdxC].add_ps(recon, reconStride, pred, residual, stride, stride); + bool reconAlign = (useTSkip ? 1 : m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC)) % 64 == 0; + bool predYuvAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool residualAlign = m_rqt[cuGeom.depth].tmpResiYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool bufferAlignCheck = reconAlign && predYuvAlign && residualAlign && (reconStride % 64 == 0) && (stride % 64 == 0); + primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](recon, reconStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } else if (useTSkip) @@ -1183,12 +1203,17 @@ X265_CHECK(!cu.m_transformSkip[ttype][0], "transform skip not supported at low RD levels\n"); - primitives.cu[sizeIdxC].calcresidual(fenc, pred, residual, stride); + primitives.cu[sizeIdxC].calcresidual[stride % 64 == 0](fenc, pred, residual, stride); + uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false); if (numSig) { m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); - primitives.cu[sizeIdxC].add_ps(picReconC, picStride, pred, residual, stride, stride); + bool picReconCAlign = (reconPic->m_cuOffsetC[cu.m_cuAddr] + reconPic->m_buOffsetC[cuGeom.absPartIdx + absPartIdxC]) % 64 == 0; + bool predAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool residualAlign = resiYuv.getChromaAddrOffset(absPartIdxC)% 64 == 0; + bool bufferAlignCheck = picReconCAlign && predAlign && residualAlign && (picStride % 64 == 0) && (stride % 64 == 0); + primitives.cu[sizeIdxC].add_ps[bufferAlignCheck](picReconC, picStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } else @@ -1304,7 +1329,7 @@ pixel nScale[129]; intraNeighbourBuf[1][0] = intraNeighbourBuf[0][0]; - primitives.scale1D_128to64(nScale + 1, intraNeighbourBuf[0] + 1); + primitives.scale1D_128to64[NONALIGNED](nScale + 1, intraNeighbourBuf[0] + 1); // we do not estimate filtering for downscaled samples memcpy(&intraNeighbourBuf[0][1], &nScale[1], 2 * 64 * sizeof(pixel)); // Top & Left pixels @@ -2107,18 +2132,24 @@ bestME[list].mvCost = mvCost; } } - -void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv) +void Search::searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv, MV mvp, int numMvc, MV* mvc) { CUData& cu = interMode.cu; const Slice *slice = m_slice; - MV mv = cu.m_mv[list][pu.puAbsPartIdx]; + MV mv; + if (m_param->interRefine == 1) + mv = mvp; + else + mv = cu.m_mv[list][pu.puAbsPartIdx]; cu.clipMv(mv); MV mvmin, mvmax; setSearchRange(cu, mv, m_param->searchRange, mvmin, mvmax); - m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv); + if (m_param->interRefine == 1) + m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mv, numMvc, mvc, m_param->searchRange, outmv, m_param->maxSlices, + m_param->bSourceReferenceEstimation ? m_slice->m_refFrameList[list][ref]->m_fencPic->getLumaAddr(0) : 0); + else + m_me.refineMV(&slice->m_mref[list][ref], mvmin, mvmax, mv, outmv); } - /* find the best inter prediction for each PU of specified mode */ void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t refMasks[2]) { @@ -2138,20 +2169,29 @@ int totalmebits = 0; MV mvzero(0, 0); Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv; - MergeData merge; memset(&merge, 0, sizeof(merge)); - + bool useAsMVP = false; for (int puIdx = 0; puIdx < numPart; puIdx++) { MotionData* bestME = interMode.bestME[puIdx]; PredictionUnit pu(cu, cuGeom, puIdx); - m_me.setSourcePU(*interMode.fencYuv, pu.ctuAddr, pu.cuAbsPartIdx, pu.puAbsPartIdx, pu.width, pu.height, m_param->searchMethod, m_param->subpelRefine, bChromaMC); - + useAsMVP = false; + x265_analysis_inter_data* interDataCTU = NULL; + int cuIdx; + cuIdx = (interMode.cu.m_cuAddr * m_param->num4x4Partitions) + cuGeom.absPartIdx; + if (m_param->analysisReuseLevel == 10 && m_param->interRefine > 1) + { + interDataCTU = m_frame->m_analysisData.interData; + if ((cu.m_predMode[pu.puAbsPartIdx] == interDataCTU->modes[cuIdx + pu.puAbsPartIdx]) + && (cu.m_partSize[pu.puAbsPartIdx] == interDataCTU->partSize[cuIdx + pu.puAbsPartIdx]) + && !(interDataCTU->mergeFlag[cuIdx + puIdx]) + && (cu.m_cuDepth[0] == interDataCTU->depth[cuIdx])) + useAsMVP = true; + } /* find best cost merge candidate. note: 2Nx2N merge and bidir are handled as separate modes */ uint32_t mrgCost = numPart == 1 ? MAX_UINT : mergeEstimation(cu, cuGeom, pu, puIdx, merge); - bestME[0].cost = MAX_UINT; bestME[1].cost = MAX_UINT; @@ -2159,26 +2199,37 @@ bool bDoUnidir = true; cu.getNeighbourMV(puIdx, pu.puAbsPartIdx, interMode.interNeighbours); - /* Uni-directional prediction */ if ((m_param->analysisLoad && m_param->analysisReuseLevel > 1 && m_param->analysisReuseLevel != 10) - || (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) || (m_param->bMVType == AVC_INFO)) + || (m_param->analysisMultiPassRefine && m_param->rc.bStatRead) || (m_param->bMVType == AVC_INFO) || (useAsMVP)) { for (int list = 0; list < numPredDir; list++) { - int ref = bestME[list].ref; + + int ref = -1; + if (useAsMVP) + ref = interDataCTU->refIdx[list][cuIdx + puIdx]; + + else + ref = bestME[list].ref; if (ref < 0) + { continue; - + } uint32_t bits = m_listSelBits[list] + MVP_IDX_BITS; bits += getTUBits(ref, numRefIdx[list]); int numMvc = cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); - const MV* amvp = interMode.amvpCand[list][ref]; int mvpIdx = selectMVP(cu, pu, amvp, list, ref); - MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; - + MV mvmin, mvmax, outmv, mvp; + if (useAsMVP) + { + mvp = interDataCTU->mv[list][cuIdx + puIdx].word; + mvpIdx = interDataCTU->mvpIdx[list][cuIdx + puIdx]; + } + else + mvp = amvp[mvpIdx]; if (m_param->searchMethod == X265_SEA) { int puX = puIdx & 1; @@ -2198,9 +2249,8 @@ bits += m_me.bitcost(outmv); uint32_t mvCost = m_me.mvcost(outmv); uint32_t cost = (satdCost - mvCost) + m_rdCost.getCost(bits); - /* Refine MVP selection, updates: mvpIdx, bits, cost */ - if (!m_param->analysisMultiPassRefine) + if (!(m_param->analysisMultiPassRefine || useAsMVP)) mvp = checkBestMVP(amvp, outmv, mvpIdx, bits, cost); else { @@ -2225,6 +2275,7 @@ bestME[list].cost = cost; bestME[list].bits = bits; bestME[list].mvCost = mvCost; + bestME[list].ref = ref; } bDoUnidir = false; } @@ -2372,8 +2423,7 @@ /* Generate reference subpels */ predInterLumaPixel(pu, bidirYuv[0], *refPic0, bestME[0].mv); predInterLumaPixel(pu, bidirYuv[1], *refPic1, bestME[1].mv); - - primitives.pu[m_me.partEnum].pixelavg_pp(tmpPredYuv.m_buf[0], tmpPredYuv.m_size, bidirYuv[0].getLumaAddr(pu.puAbsPartIdx), bidirYuv[0].m_size, + primitives.pu[m_me.partEnum].pixelavg_pp[(tmpPredYuv.m_size % 64 == 0) && (bidirYuv[0].m_size % 64 == 0) && (bidirYuv[1].m_size % 64 == 0)](tmpPredYuv.m_buf[0], tmpPredYuv.m_size, bidirYuv[0].getLumaAddr(pu.puAbsPartIdx), bidirYuv[0].m_size, bidirYuv[1].getLumaAddr(pu.puAbsPartIdx), bidirYuv[1].m_size, 32); satdCost = m_me.bufSATD(tmpPredYuv.m_buf[0], tmpPredYuv.m_size); } @@ -2415,11 +2465,9 @@ const pixel* ref0 = m_slice->m_mref[0][bestME[0].ref].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx); const pixel* ref1 = m_slice->m_mref[1][bestME[1].ref].getLumaAddr(pu.ctuAddr, pu.cuAbsPartIdx + pu.puAbsPartIdx); intptr_t refStride = slice->m_mref[0][0].lumaStride; - - primitives.pu[m_me.partEnum].pixelavg_pp(tmpPredYuv.m_buf[0], tmpPredYuv.m_size, ref0, refStride, ref1, refStride, 32); + primitives.pu[m_me.partEnum].pixelavg_pp[(tmpPredYuv.m_size % 64 == 0) && (refStride % 64 == 0)](tmpPredYuv.m_buf[0], tmpPredYuv.m_size, ref0, refStride, ref1, refStride, 32); satdCost = m_me.bufSATD(tmpPredYuv.m_buf[0], tmpPredYuv.m_size); } - MV mvp0 = bestME[0].mvp; int mvpIdx0 = bestME[0].mvpIdx; uint32_t bits0 = bestME[0].bits - m_me.bitcost(bestME[0].mv, mvp0) + m_me.bitcost(mvzero, mvp0); @@ -2888,7 +2936,7 @@ } else { - primitives.cu[sizeIdx].blockfill_s(curResiY, strideResiY, 0); + primitives.cu[sizeIdx].blockfill_s[strideResiY % 64 == 0](curResiY, strideResiY, 0); cu.setCbfSubParts(0, TEXT_LUMA, absPartIdx, depth); } @@ -2921,7 +2969,7 @@ } else { - primitives.cu[sizeIdxC].blockfill_s(curResiU, strideResiC, 0); + primitives.cu[sizeIdxC].blockfill_s[strideResiC % 64 == 0](curResiU, strideResiC, 0); cu.setCbfPartRange(0, TEXT_CHROMA_U, absPartIdxC, tuIterator.absPartIdxStep); } @@ -2935,7 +2983,7 @@ } else { - primitives.cu[sizeIdxC].blockfill_s(curResiV, strideResiC, 0); + primitives.cu[sizeIdxC].blockfill_s[strideResiC % 64 == 0](curResiV, strideResiC, 0); cu.setCbfPartRange(0, TEXT_CHROMA_V, absPartIdxC, tuIterator.absPartIdxStep); } } @@ -3168,8 +3216,12 @@ // non-zero cost calculation for luma - This is an approximation // finally we have to encode correct cbf after comparing with null cost pixel* curReconY = m_rqt[qtLayer].reconQtYuv.getLumaAddr(absPartIdx); + bool curReconYAlign = m_rqt[qtLayer].reconQtYuv.getAddrOffset(absPartIdx, m_rqt[qtLayer].reconQtYuv.m_size) % 64 == 0; uint32_t strideReconY = m_rqt[qtLayer].reconQtYuv.m_size; - primitives.cu[partSize].add_ps(curReconY, strideReconY, mode.predYuv.getLumaAddr(absPartIdx), curResiY, mode.predYuv.m_size, strideResiY); + bool predYuvAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; + bool curResiYAlign = m_rqt[qtLayer].resiQtYuv.getAddrOffset(absPartIdx, m_rqt[qtLayer].resiQtYuv.m_size) % 64 == 0; + bool bufferAlignCheck = curReconYAlign && predYuvAlign && curResiYAlign && (strideReconY % 64 == 0) && (mode.predYuv.m_size % 64 == 0) && (strideResiY % 64 == 0); + primitives.cu[partSize].add_ps[bufferAlignCheck](curReconY, strideReconY, mode.predYuv.getLumaAddr(absPartIdx), curResiY, mode.predYuv.m_size, strideResiY); const sse_t nonZeroDistY = primitives.cu[partSize].sse_pp(fenc, fencYuv->m_size, curReconY, strideReconY); uint32_t nzCbfBitsY = m_entropyCoder.estimateCbfBits(cbfFlag[TEXT_LUMA][0], TEXT_LUMA, tuDepth); @@ -3203,7 +3255,7 @@ { cbfFlag[TEXT_LUMA][0] = 0; singleBits[TEXT_LUMA][0] = 0; - primitives.cu[partSize].blockfill_s(curResiY, strideResiY, 0); + primitives.cu[partSize].blockfill_s[strideResiY % 64 == 0](curResiY, strideResiY, 0); #if CHECKED_BUILD || _DEBUG uint32_t numCoeffY = 1 << (log2TrSize << 1); memset(coeffCurY, 0, sizeof(coeff_t)* numCoeffY); @@ -3226,7 +3278,7 @@ { if (checkTransformSkipY) minCost[TEXT_LUMA][0] = estimateNullCbfCost(zeroDistY, zeroEnergyY, tuDepth, TEXT_LUMA); - primitives.cu[partSize].blockfill_s(curResiY, strideResiY, 0); + primitives.cu[partSize].blockfill_s[strideResiY % 64 == 0](curResiY, strideResiY, 0); singleDist[TEXT_LUMA][0] = zeroDistY; singleBits[TEXT_LUMA][0] = 0; singleEnergy[TEXT_LUMA][0] = zeroEnergyY; @@ -3284,7 +3336,11 @@ // finally we have to encode correct cbf after comparing with null cost pixel* curReconC = m_rqt[qtLayer].reconQtYuv.getChromaAddr(chromaId, absPartIdxC); uint32_t strideReconC = m_rqt[qtLayer].reconQtYuv.m_csize; - primitives.cu[partSizeC].add_ps(curReconC, strideReconC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), curResiC, mode.predYuv.m_csize, strideResiC); + bool curReconCAlign = m_rqt[qtLayer].reconQtYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool predYuvAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool curResiCAlign = m_rqt[qtLayer].resiQtYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool bufferAlignCheck = curReconCAlign && predYuvAlign && curResiCAlign && (strideReconC % 64 == 0) && (mode.predYuv.m_csize % 64 == 0) && (strideResiC % 64 == 0); + primitives.cu[partSizeC].add_ps[bufferAlignCheck](curReconC, strideReconC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), curResiC, mode.predYuv.m_csize, strideResiC); sse_t nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[partSizeC].sse_pp(fenc, fencYuv->m_csize, curReconC, strideReconC)); uint32_t nzCbfBitsC = m_entropyCoder.estimateCbfBits(cbfFlag[chromaId][tuIterator.section], (TextType)chromaId, tuDepth); uint32_t nonZeroEnergyC = 0; uint64_t singleCostC = 0; @@ -3315,7 +3371,7 @@ { cbfFlag[chromaId][tuIterator.section] = 0; singleBits[chromaId][tuIterator.section] = 0; - primitives.cu[partSizeC].blockfill_s(curResiC, strideResiC, 0); + primitives.cu[partSizeC].blockfill_s[strideResiC % 64 == 0](curResiC, strideResiC, 0); #if CHECKED_BUILD || _DEBUG uint32_t numCoeffC = 1 << (log2TrSizeC << 1); memset(coeffCurC + subTUOffset, 0, sizeof(coeff_t) * numCoeffC); @@ -3338,7 +3394,7 @@ { if (checkTransformSkipC) minCost[chromaId][tuIterator.section] = estimateNullCbfCost(zeroDistC, zeroEnergyC, tuDepthC, (TextType)chromaId); - primitives.cu[partSizeC].blockfill_s(curResiC, strideResiC, 0); + primitives.cu[partSizeC].blockfill_s[strideResiC % 64 == 0](curResiC, strideResiC, 0); singleBits[chromaId][tuIterator.section] = 0; singleDist[chromaId][tuIterator.section] = zeroDistC; singleEnergy[chromaId][tuIterator.section] = zeroEnergyC; @@ -3388,8 +3444,10 @@ const uint32_t skipSingleBitsY = m_entropyCoder.getNumberOfWrittenBits(); m_quant.invtransformNxN(cu, m_tsResidual, trSize, m_tsCoeff, log2TrSize, TEXT_LUMA, false, true, numSigTSkipY); + bool predYuvAlign = mode.predYuv.getAddrOffset(absPartIdx, mode.predYuv.m_size) % 64 == 0; - primitives.cu[partSize].add_ps(m_tsRecon, trSize, mode.predYuv.getLumaAddr(absPartIdx), m_tsResidual, mode.predYuv.m_size, trSize); + bool bufferAlignCheck = predYuvAlign && (trSize % 64 == 0) && (mode.predYuv.m_size % 64 == 0); + primitives.cu[partSize].add_ps[bufferAlignCheck](m_tsRecon, trSize, mode.predYuv.getLumaAddr(absPartIdx), m_tsResidual, mode.predYuv.m_size, trSize); nonZeroDistY = primitives.cu[partSize].sse_pp(fenc, fencYuv->m_size, m_tsRecon, trSize); if (m_rdCost.m_psyRd) @@ -3466,7 +3524,9 @@ m_quant.invtransformNxN(cu, m_tsResidual, trSizeC, m_tsCoeff, log2TrSizeC, (TextType)chromaId, false, true, numSigTSkipC); - primitives.cu[partSizeC].add_ps(m_tsRecon, trSizeC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), m_tsResidual, mode.predYuv.m_csize, trSizeC); + bool predYuvAlign = mode.predYuv.getChromaAddrOffset(absPartIdxC) % 64 == 0; + bool bufferAlignCheck = predYuvAlign && (trSizeC % 64 == 0) && (mode.predYuv.m_csize % 64 == 0) && (trSizeC % 64 == 0); + primitives.cu[partSizeC].add_ps[bufferAlignCheck](m_tsRecon, trSizeC, mode.predYuv.getChromaAddr(chromaId, absPartIdxC), m_tsResidual, mode.predYuv.m_csize, trSizeC); nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, primitives.cu[partSizeC].sse_pp(fenc, fencYuv->m_csize, m_tsRecon, trSizeC)); if (m_rdCost.m_psyRd) {
View file
x265_2.7.tar.gz/source/encoder/search.h -> x265_2.9.tar.gz/source/encoder/search.h
Changed
@@ -310,8 +310,7 @@ // estimation inter prediction (non-skip) void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t masks[2]); - - void searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv); + void searchMV(Mode& interMode, const PredictionUnit& pu, int list, int ref, MV& outmv, MV mvp, int numMvc, MV* mvc); // encode residual and compute rd-cost for inter mode void encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom); void encodeResAndCalcRdSkipCU(Mode& interMode);
View file
x265_2.7.tar.gz/source/encoder/sei.cpp -> x265_2.9.tar.gz/source/encoder/sei.cpp
Changed
@@ -35,45 +35,40 @@ }; /* marshal a single SEI message sei, storing the marshalled representation - * in bitstream bs */ -void SEI::write(Bitstream& bs, const SPS& sps) +* in bitstream bs */ +void SEI::writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested) { - uint32_t type = m_payloadType; + if (!isNested) + bs.resetBits(); + + BitCounter counter; + m_bitIf = &counter; + writeSEI(sps); + /* count the size of the payload and return the size in bits */ + X265_CHECK(0 == (counter.getNumberOfWrittenBits() & 7), "payload unaligned\n"); + uint32_t payloadData = counter.getNumberOfWrittenBits() >> 3; + + // set bitstream m_bitIf = &bs; - BitCounter count; - bool hrdTypes = (m_payloadType == ACTIVE_PARAMETER_SETS || m_payloadType == PICTURE_TIMING || m_payloadType == BUFFERING_PERIOD); - if (hrdTypes) - { - m_bitIf = &count; - /* virtual writeSEI method, write to bit counter to determine size */ - writeSEI(sps); - m_bitIf = &bs; - uint32_t payloadType = m_payloadType; - for (; payloadType >= 0xff; payloadType -= 0xff) - WRITE_CODE(0xff, 8, "payload_type"); - } - WRITE_CODE(type, 8, "payload_type"); - uint32_t payloadSize; - if (hrdTypes || m_payloadType == USER_DATA_UNREGISTERED || m_payloadType == USER_DATA_REGISTERED_ITU_T_T35) + + uint32_t payloadType = m_payloadType; + for (; payloadType >= 0xff; payloadType -= 0xff) + WRITE_CODE(0xff, 8, "payload_type"); + WRITE_CODE(payloadType, 8, "payload_type"); + + uint32_t payloadSize = payloadData; + for (; payloadSize >= 0xff; payloadSize -= 0xff) + WRITE_CODE(0xff, 8, "payload_size"); + WRITE_CODE(payloadSize, 8, "payload_size"); + + // virtual writeSEI method, write to bs + writeSEI(sps); + + if (!isNested) { - if (hrdTypes) - { - X265_CHECK(0 == (count.getNumberOfWrittenBits() & 7), "payload unaligned\n"); - payloadSize = count.getNumberOfWrittenBits() >> 3; - } - else if (m_payloadType == USER_DATA_UNREGISTERED) - payloadSize = m_payloadSize + 16; - else - payloadSize = m_payloadSize; - - for (; payloadSize >= 0xff; payloadSize -= 0xff) - WRITE_CODE(0xff, 8, "payload_size"); - WRITE_CODE(payloadSize, 8, "payload_size"); + bs.writeByteAlignment(); + list.serialize(nalUnitType, bs); } - else - WRITE_CODE(m_payloadSize, 8, "payload_size"); - /* virtual writeSEI method, write to bs */ - writeSEI(sps); } void SEI::writeByteAlign() @@ -93,3 +88,63 @@ { m_payloadSize = size; } + +/* charSet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/" */ + +char* SEI::base64Decode(char encodedString[], int base64EncodeLength) +{ + char* decodedString; + decodedString = (char*)malloc(sizeof(char) * ((base64EncodeLength / 4) * 3)); + int i, j, k = 0; + // stores the bitstream + int bitstream = 0; + // countBits stores current number of bits in bitstream + int countBits = 0; + // selects 4 characters from encodedString at a time. Find the position of each encoded character in charSet and stores in bitstream + for (i = 0; i < base64EncodeLength; i += 4) + { + bitstream = 0, countBits = 0; + for (j = 0; j < 4; j++) + { + // make space for 6 bits + if (encodedString[i + j] != '=') + { + bitstream = bitstream << 6; + countBits += 6; + } + // Finding the position of each encoded character in charSet and storing in bitstream, use OR '|' operator to store bits + + if (encodedString[i + j] >= 'A' && encodedString[i + j] <= 'Z') + bitstream = bitstream | (encodedString[i + j] - 'A'); + + else if (encodedString[i + j] >= 'a' && encodedString[i + j] <= 'z') + bitstream = bitstream | (encodedString[i + j] - 'a' + 26); + + else if (encodedString[i + j] >= '0' && encodedString[i + j] <= '9') + bitstream = bitstream | (encodedString[i + j] - '0' + 52); + + // '+' occurs in 62nd position in charSet + else if (encodedString[i + j] == '+') + bitstream = bitstream | 62; + + // '/' occurs in 63rd position in charSet + else if (encodedString[i + j] == '/') + bitstream = bitstream | 63; + + // to delete appended bits during encoding + else + { + bitstream = bitstream >> 2; + countBits -= 2; + } + } + + while (countBits != 0) + { + countBits -= 8; + decodedString[k++] = (bitstream >> countBits) & 255; + } + } + return decodedString; +} +
View file
x265_2.7.tar.gz/source/encoder/sei.h -> x265_2.9.tar.gz/source/encoder/sei.h
Changed
@@ -27,6 +27,8 @@ #include "common.h" #include "bitstream.h" #include "slice.h" +#include "nal.h" +#include "md5.h" namespace X265_NS { // private namespace @@ -34,11 +36,11 @@ class SEI : public SyntaxElementWriter { public: - /* SEI users call write() to marshal an SEI to a bitstream. - * The write() method calls writeSEI() which encodes the header */ - void write(Bitstream& bs, const SPS& sps); - + /* SEI users call writeSEImessages() to marshal an SEI to a bitstream. + * The writeSEImessages() method calls writeSEI() which encodes the header */ + void writeSEImessages(Bitstream& bs, const SPS& sps, NalUnitType nalUnitType, NALList& list, int isNested); void setSize(uint32_t size); + static char* base64Decode(char encodedString[], int base64EncodeLength); virtual ~SEI() {} protected: SEIPayloadType m_payloadType; @@ -47,6 +49,32 @@ void writeByteAlign(); }; +//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding +class SEIuserDataRegistered : public SEI +{ +public: + SEIuserDataRegistered() + { + m_payloadType = USER_DATA_REGISTERED_ITU_T_T35; + m_payloadSize = 0; + } + + uint8_t *m_userData; + + // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com ) + void writeSEI(const SPS&) + { + if (!m_userData) + return; + + uint32_t i = 0; + for (; i < m_payloadSize; ++i) + WRITE_CODE(m_userData[i], 8, "creative_intent_metadata"); + } +}; + +static const uint32_t ISO_IEC_11578_LEN = 16; + class SEIuserDataUnregistered : public SEI { public: @@ -55,11 +83,11 @@ m_payloadType = USER_DATA_UNREGISTERED; m_payloadSize = 0; } - static const uint8_t m_uuid_iso_iec_11578[16]; + static const uint8_t m_uuid_iso_iec_11578[ISO_IEC_11578_LEN]; uint8_t *m_userData; void writeSEI(const SPS&) { - for (uint32_t i = 0; i < 16; i++) + for (uint32_t i = 0; i < ISO_IEC_11578_LEN; i++) WRITE_CODE(m_uuid_iso_iec_11578[i], 8, "sei.uuid_iso_iec_11578[i]"); for (uint32_t i = 0; i < m_payloadSize; i++) WRITE_CODE(m_userData[i], 8, "user_data"); @@ -133,7 +161,12 @@ CRC, CHECKSUM, } m_method; - uint8_t m_digest[3][16]; + + MD5Context m_state[3]; + uint32_t m_crc[3]; + uint32_t m_checksum[3]; + uint8_t m_digest[3][16]; + void writeSEI(const SPS& sps) { int planes = (sps.chromaFormatIdc != X265_CSP_I400) ? 3 : 1; @@ -253,6 +286,11 @@ class SEIRecoveryPoint : public SEI { public: + SEIRecoveryPoint() + { + m_payloadType = RECOVERY_POINT; + m_payloadSize = 0; + } int m_recoveryPocCnt; bool m_exactMatchingFlag; bool m_brokenLinkFlag; @@ -266,28 +304,22 @@ } }; -//seongnam.oh@samsung.com :: for the Creative Intent Meta Data Encoding -class SEICreativeIntentMeta : public SEI +class SEIAlternativeTC : public SEI { public: - SEICreativeIntentMeta() + int m_preferredTransferCharacteristics; + SEIAlternativeTC() { - m_payloadType = USER_DATA_REGISTERED_ITU_T_T35; + m_payloadType = ALTERNATIVE_TRANSFER_CHARACTERISTICS; m_payloadSize = 0; + m_preferredTransferCharacteristics = -1; } - uint8_t *m_payload; - - // daniel.vt@samsung.com :: for the Creative Intent Meta Data Encoding ( seongnam.oh@samsung.com ) void writeSEI(const SPS&) { - if (!m_payload) - return; - - uint32_t i = 0; - for (; i < m_payloadSize; ++i) - WRITE_CODE(m_payload[i], 8, "creative_intent_metadata"); + WRITE_CODE(m_preferredTransferCharacteristics, 8, "Preferred transfer characteristics"); } }; + } #endif // ifndef X265_SEI_H
View file
x265_2.7.tar.gz/source/encoder/slicetype.cpp -> x265_2.9.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -150,20 +150,14 @@ curFrame->m_lowres.wp_sum[y] = 0; } - /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */ - int blockXY = 0; - int blockX = 0, blockY = 0; - double strength = 0.f; + /* Calculate Qp offset for each 16x16 or 8x8 block in the frame */ if ((param->rc.aqMode == X265_AQ_NONE || param->rc.aqStrength == 0) || (param->rc.bStatRead && param->rc.cuTree && IS_REFERENCED(curFrame))) { - /* Need to init it anyways for CU tree */ - int cuCount = blockCount; - if (param->rc.aqMode && param->rc.aqStrength == 0) { if (quantOffsets) { - for (int cuxy = 0; cuxy < cuCount; cuxy++) + for (int cuxy = 0; cuxy < blockCount; cuxy++) { curFrame->m_lowres.qpCuTreeOffset[cuxy] = curFrame->m_lowres.qpAqOffset[cuxy] = quantOffsets[cuxy]; curFrame->m_lowres.invQscaleFactor[cuxy] = x265_exp2fix8(curFrame->m_lowres.qpCuTreeOffset[cuxy]); @@ -171,61 +165,55 @@ } else { - memset(curFrame->m_lowres.qpCuTreeOffset, 0, cuCount * sizeof(double)); - memset(curFrame->m_lowres.qpAqOffset, 0, cuCount * sizeof(double)); - for (int cuxy = 0; cuxy < cuCount; cuxy++) - curFrame->m_lowres.invQscaleFactor[cuxy] = 256; + memset(curFrame->m_lowres.qpCuTreeOffset, 0, blockCount * sizeof(double)); + memset(curFrame->m_lowres.qpAqOffset, 0, blockCount * sizeof(double)); + for (int cuxy = 0; cuxy < blockCount; cuxy++) + curFrame->m_lowres.invQscaleFactor[cuxy] = 256; } } - /* Need variance data for weighted prediction */ + /* Need variance data for weighted prediction and dynamic refinement*/ if (param->bEnableWeightedPred || param->bEnableWeightedBiPred) { - for (blockY = 0; blockY < maxRow; blockY += loopIncr) - for (blockX = 0; blockX < maxCol; blockX += loopIncr) - acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize); + for (int blockY = 0; blockY < maxRow; blockY += loopIncr) + for (int blockX = 0; blockX < maxCol; blockX += loopIncr) + acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize); } } else { - blockXY = 0; - double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0; - double bias_strength = 0.f; + int blockXY = 0; + double avg_adj_pow2 = 0.f, avg_adj = 0.f, qp_adj = 0.f; + double bias_strength = 0.f, strength = 0.f; if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE || param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED) { - double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8))); - curFrame->m_lowres.frameVariance = 0; - uint64_t rowVariance = 0; - for (blockY = 0; blockY < maxRow; blockY += loopIncr) - { - rowVariance = 0; - for (blockX = 0; blockX < maxCol; blockX += loopIncr) - { - uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize); - curFrame->m_lowres.blockVariance[blockXY] = energy; - rowVariance += energy; + double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8))); + + for (int blockY = 0; blockY < maxRow; blockY += loopIncr) + { + for (int blockX = 0; blockX < maxCol; blockX += loopIncr) + { + uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize); qp_adj = pow(energy * bit_depth_correction + 1, 0.1); curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; avg_adj += qp_adj; avg_adj_pow2 += qp_adj * qp_adj; blockXY++; } - curFrame->m_lowres.frameVariance += (rowVariance / maxCol); } - curFrame->m_lowres.frameVariance /= maxRow; avg_adj /= blockCount; avg_adj_pow2 /= blockCount; strength = param->rc.aqStrength * avg_adj; - avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (modeTwoConst)) / avg_adj; + avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - modeTwoConst) / avg_adj; bias_strength = param->rc.aqStrength; } else strength = param->rc.aqStrength * 1.0397f; blockXY = 0; - for (blockY = 0; blockY < maxRow; blockY += loopIncr) + for (int blockY = 0; blockY < maxRow; blockY += loopIncr) { - for (blockX = 0; blockX < maxCol; blockX += loopIncr) + for (int blockX = 0; blockX < maxCol; blockX += loopIncr) { if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED) { @@ -240,7 +228,7 @@ else { uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp,param->rc.qgSize); - qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8))); + qp_adj = strength * (X265_LOG2(X265_MAX(energy, 1)) - (modeOneConst + 2 * (X265_DEPTH - 8))); } if (param->bHDROpt) @@ -308,6 +296,17 @@ curFrame->m_lowres.wp_ssd[i] = ssd - (sum * sum + (width[i] * height[i]) / 2) / (width[i] * height[i]); } } + + if (param->bDynamicRefine) + { + int blockXY = 0; + for (int blockY = 0; blockY < maxRow; blockY += loopIncr) + for (int blockX = 0; blockX < maxCol; blockX += loopIncr) + { + curFrame->m_lowres.blockVariance[blockXY] = acEnergyCu(curFrame, blockX, blockY, param->internalCsp, param->rc.qgSize); + blockXY++; + } + } } void LookaheadTLD::lowresIntraEstimate(Lowres& fenc, uint32_t qgSize) @@ -426,7 +425,7 @@ pixel *src = ref.fpelPlane[0]; intptr_t stride = fenc.lumaStride; - if (wp.bPresentFlag) + if (wp.wtPresent) { int offset = wp.inputOffset << (X265_DEPTH - 8); int scale = wp.inputWeight; @@ -480,7 +479,7 @@ int deltaIndex = fenc.frameNum - ref.frameNum; WeightParam wp; - wp.bPresentFlag = false; + wp.wtPresent = 0; if (!wbuffer[0]) { @@ -1078,85 +1077,97 @@ } int bframes, brefs; - for (bframes = 0, brefs = 0;; bframes++) + if (!m_param->analysisLoad) { - Lowres& frm = list[bframes]->m_lowres; - - if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid) + for (bframes = 0, brefs = 0;; bframes++) { - frm.sliceType = X265_TYPE_B; - x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n", - frm.frameNum); - } + Lowres& frm = list[bframes]->m_lowres; - /* pyramid with multiple B-refs needs a big enough dpb that the preceding P-frame stays available. - * smaller dpb could be supported by smart enough use of mmco, but it's easier just to forbid it. */ - else if (frm.sliceType == X265_TYPE_BREF && m_param->bBPyramid && brefs && - m_param->maxNumReferences <= (brefs + 3)) - { - frm.sliceType = X265_TYPE_B; - x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n", - frm.sliceType, m_param->maxNumReferences); - } - if ((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax && - (!m_extendGopBoundary || frm.frameNum - m_lastKeyframe >= m_param->keyframeMax + m_param->gopLookahead)) - { - if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I) - frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; - bool warn = frm.sliceType != X265_TYPE_IDR; - if (warn && m_param->bOpenGOP) - warn &= frm.sliceType != X265_TYPE_I; - if (warn) + if (frm.sliceType == X265_TYPE_BREF && !m_param->bBPyramid && brefs == m_param->bBPyramid) { - x265_log(m_param, X265_LOG_WARNING, "specified frame type (%d) at %d is not compatible with keyframe interval\n", - frm.sliceType, frm.frameNum); - frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; + frm.sliceType = X265_TYPE_B; + x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid\n", + frm.frameNum); } - } - if (frm.sliceType == X265_TYPE_I && frm.frameNum - m_lastKeyframe >= m_param->keyframeMin) - { - if (m_param->bOpenGOP) + + /* pyramid with multiple B-refs needs a big enough dpb that the preceding P-frame stays available. + * smaller dpb could be supported by smart enough use of mmco, but it's easier just to forbid it. */ + else if (frm.sliceType == X265_TYPE_BREF && m_param->bBPyramid && brefs && + m_param->maxNumReferences <= (brefs + 3)) + { + frm.sliceType = X265_TYPE_B; + x265_log(m_param, X265_LOG_WARNING, "B-ref at frame %d incompatible with B-pyramid and %d reference frames\n", + frm.sliceType, m_param->maxNumReferences); + } + if (((!m_param->bIntraRefresh || frm.frameNum == 0) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMax && + (!m_extendGopBoundary || frm.frameNum - m_lastKeyframe >= m_param->keyframeMax + m_param->gopLookahead)) || + (frm.frameNum == (m_param->chunkStart - 1)) || (frm.frameNum == m_param->chunkEnd)) + { + if (frm.sliceType == X265_TYPE_AUTO || frm.sliceType == X265_TYPE_I) + frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; + bool warn = frm.sliceType != X265_TYPE_IDR; + if (warn && m_param->bOpenGOP) + warn &= frm.sliceType != X265_TYPE_I; + if (warn) + { + x265_log(m_param, X265_LOG_WARNING, "specified frame type (%d) at %d is not compatible with keyframe interval\n", + frm.sliceType, frm.frameNum); + frm.sliceType = m_param->bOpenGOP && m_lastKeyframe >= 0 ? X265_TYPE_I : X265_TYPE_IDR; + } + } + if ((frm.sliceType == X265_TYPE_I && frm.frameNum - m_lastKeyframe >= m_param->keyframeMin) || (frm.frameNum == (m_param->chunkStart - 1)) || (frm.frameNum == m_param->chunkEnd)) { + if (m_param->bOpenGOP) + { + m_lastKeyframe = frm.frameNum; + frm.bKeyframe = true; + } + else + frm.sliceType = X265_TYPE_IDR; + } + if (frm.sliceType == X265_TYPE_IDR) + { + /* Closed GOP */ m_lastKeyframe = frm.frameNum; frm.bKeyframe = true; + if (bframes > 0 && !m_param->radl) + { + list[bframes - 1]->m_lowres.sliceType = X265_TYPE_P; + bframes--; + } } - else - frm.sliceType = X265_TYPE_IDR; - } - if (frm.sliceType == X265_TYPE_IDR) - { - /* Closed GOP */ - m_lastKeyframe = frm.frameNum; - frm.bKeyframe = true; - if (bframes > 0 && !m_param->radl) + if (bframes == m_param->bframes || !list[bframes + 1]) { - list[bframes - 1]->m_lowres.sliceType = X265_TYPE_P; - bframes--; + if (IS_X265_TYPE_B(frm.sliceType)) + x265_log(m_param, X265_LOG_WARNING, "specified frame type is not compatible with max B-frames\n"); + if (frm.sliceType == X265_TYPE_AUTO || IS_X265_TYPE_B(frm.sliceType)) + frm.sliceType = X265_TYPE_P; } - } - if (m_param->radl && !m_param->bOpenGOP && list[bframes + 1]) - { - if ((frm.frameNum - m_lastKeyframe) > (m_param->keyframeMax - m_param->radl - 1) && (frm.frameNum - m_lastKeyframe) < m_param->keyframeMax) + if (frm.sliceType == X265_TYPE_BREF) + brefs++; + if (frm.sliceType == X265_TYPE_AUTO) frm.sliceType = X265_TYPE_B; - if ((frm.frameNum - m_lastKeyframe) == (m_param->keyframeMax - m_param->radl - 1)) - frm.sliceType = X265_TYPE_P; + else if (!IS_X265_TYPE_B(frm.sliceType)) + break; } - - if (bframes == m_param->bframes || !list[bframes + 1]) + } + else + { + for (bframes = 0, brefs = 0;; bframes++) { - if (IS_X265_TYPE_B(frm.sliceType)) - x265_log(m_param, X265_LOG_WARNING, "specified frame type is not compatible with max B-frames\n"); - if (frm.sliceType == X265_TYPE_AUTO || IS_X265_TYPE_B(frm.sliceType)) - frm.sliceType = X265_TYPE_P; - } - if (frm.sliceType == X265_TYPE_BREF) - brefs++; - if (frm.sliceType == X265_TYPE_AUTO) - frm.sliceType = X265_TYPE_B; - else if (!IS_X265_TYPE_B(frm.sliceType)) - break; + Lowres& frm = list[bframes]->m_lowres; + if (frm.sliceType == X265_TYPE_BREF) + brefs++; + if ((IS_X265_TYPE_I(frm.sliceType) && frm.frameNum - m_lastKeyframe >= m_param->keyframeMin) + || (frm.frameNum == (m_param->chunkStart - 1)) || (frm.frameNum == m_param->chunkEnd)) + { + m_lastKeyframe = frm.frameNum; + frm.bKeyframe = true; + } + if (!IS_X265_TYPE_B(frm.sliceType)) + break; + } } - if (bframes) list[bframes - 1]->m_lowres.bLastMiniGopBFrame = true; list[bframes]->m_lowres.leadingBframes = bframes; @@ -1406,7 +1417,19 @@ return; } frames[framecnt + 1] = NULL; - int keyFrameLimit = m_param->keyframeMax + m_lastKeyframe - frames[0]->frameNum - 1; + + int keylimit = m_param->keyframeMax; + if (frames[0]->frameNum < m_param->chunkEnd) + { + int chunkStart = (m_param->chunkStart - m_lastKeyframe - 1); + int chunkEnd = (m_param->chunkEnd - m_lastKeyframe); + if ((chunkStart > 0) && (chunkStart < m_param->keyframeMax)) + keylimit = chunkStart; + else if ((chunkEnd > 0) && (chunkEnd < m_param->keyframeMax)) + keylimit = chunkEnd; + } + + int keyFrameLimit = keylimit + m_lastKeyframe - frames[0]->frameNum - 1; if (m_param->gopLookahead && keyFrameLimit <= m_param->bframes + 1) keyintLimit = keyFrameLimit + m_param->gopLookahead; else @@ -1496,6 +1519,7 @@ int numBFrames = 0; int numAnalyzed = numFrames; bool isScenecut = scenecut(frames, 0, 1, true, origNumFrames); + /* When scenecut threshold is set, use scenecut detection for I frame placements */ if (m_param->scenecutThreshold && isScenecut) { @@ -1603,14 +1627,28 @@ frames[numFrames]->sliceType = X265_TYPE_P; } - /* Check scenecut on the first minigop. */ - for (int j = 1; j < numBFrames + 1; j++) + bool bForceRADL = m_param->radl && !m_param->bOpenGOP; + bool bLastMiniGop = (framecnt >= m_param->bframes + 1) ? false : true; + int preRADL = m_lastKeyframe + m_param->keyframeMax - m_param->radl - 1; /*Frame preceeding RADL in POC order*/ + if (bForceRADL && (frames[0]->frameNum == preRADL) && !bLastMiniGop) + { + int j = 1; + numBFrames = m_param->radl; + for (; j <= m_param->radl; j++) + frames[j]->sliceType = X265_TYPE_B; + frames[j]->sliceType = X265_TYPE_I; + } + else /* Check scenecut and RADL on the first minigop. */ { - if (scenecut(frames, j, j + 1, false, origNumFrames)) + for (int j = 1; j < numBFrames + 1; j++) { - frames[j]->sliceType = X265_TYPE_P; - numAnalyzed = j; - break; + if (scenecut(frames, j, j + 1, false, origNumFrames) || + (bForceRADL && (frames[j]->frameNum == preRADL))) + { + frames[j]->sliceType = X265_TYPE_P; + numAnalyzed = j; + break; + } } } resetStart = bKeyframe ? 1 : X265_MIN(numBFrames + 2, numAnalyzed + 1); @@ -2513,19 +2551,16 @@ intptr_t stride0 = X265_LOWRES_CU_SIZE, stride1 = X265_LOWRES_CU_SIZE; pixel *src0 = fref0->lowresMC(pelOffset, fenc->lowresMvs[0][listDist[0]][cuXY], subpelbuf0, stride0); pixel *src1 = fref1->lowresMC(pelOffset, fenc->lowresMvs[1][listDist[1]][cuXY], subpelbuf1, stride1); - ALIGN_VAR_32(pixel, ref[X265_LOWRES_CU_SIZE * X265_LOWRES_CU_SIZE]); - primitives.pu[LUMA_8x8].pixelavg_pp(ref, X265_LOWRES_CU_SIZE, src0, stride0, src1, stride1, 32); + primitives.pu[LUMA_8x8].pixelavg_pp[NONALIGNED](ref, X265_LOWRES_CU_SIZE, src0, stride0, src1, stride1, 32); int bicost = tld.me.bufSATD(ref, X265_LOWRES_CU_SIZE); COPY2_IF_LT(bcost, bicost, listused, 3); - /* coloc candidate */ src0 = fref0->lowresPlane[0] + pelOffset; src1 = fref1->lowresPlane[0] + pelOffset; - primitives.pu[LUMA_8x8].pixelavg_pp(ref, X265_LOWRES_CU_SIZE, src0, fref0->lumaStride, src1, fref1->lumaStride, 32); + primitives.pu[LUMA_8x8].pixelavg_pp[NONALIGNED](ref, X265_LOWRES_CU_SIZE, src0, fref0->lumaStride, src1, fref1->lumaStride, 32); bicost = tld.me.bufSATD(ref, X265_LOWRES_CU_SIZE); COPY2_IF_LT(bcost, bicost, listused, 3); - bcost += lowresPenalty; } else /* P, also consider intra */
View file
x265_2.7.tar.gz/source/encoder/weightPrediction.cpp -> x265_2.9.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -184,8 +184,7 @@ int denom = w->log2WeightDenom; int round = denom ? 1 << (denom - 1) : 0; int correction = IF_INTERNAL_PREC - X265_DEPTH; /* intermediate interpolation depth */ - int pwidth = ((width + 15) >> 4) << 4; - + int pwidth = ((width + 31) >> 5) << 5; primitives.weight_pp(ref, weightTemp, stride, pwidth, height, weight, round << correction, denom + correction, offset); ref = weightTemp; @@ -294,7 +293,7 @@ for (int plane = 0; plane < (param.internalCsp != X265_CSP_I400 ? 3 : 1); plane++) { denom = plane ? chromaDenom : lumaDenom; - if (plane && !weights[0].bPresentFlag) + if (plane && !weights[0].wtPresent) break; /* Early termination */ @@ -477,12 +476,12 @@ } } - if (weights[0].bPresentFlag) + if (weights[0].wtPresent) { // Make sure both chroma channels match - if (weights[1].bPresentFlag != weights[2].bPresentFlag) + if (weights[1].wtPresent != weights[2].wtPresent) { - if (weights[1].bPresentFlag) + if (weights[1].wtPresent) weights[2] = weights[1]; else weights[1] = weights[2]; @@ -516,15 +515,15 @@ for (int list = 0; list < numPredDir; list++) { WeightParam* w = &wp[list][0][0]; - if (w[0].bPresentFlag || w[1].bPresentFlag || w[2].bPresentFlag) + if (w[0].wtPresent || w[1].wtPresent || w[2].wtPresent) { bWeighted = true; p += sprintf(buf + p, " [L%d:R0 ", list); - if (w[0].bPresentFlag) + if (w[0].wtPresent) p += sprintf(buf + p, "Y{%d/%d%+d}", w[0].inputWeight, 1 << w[0].log2WeightDenom, w[0].inputOffset); - if (w[1].bPresentFlag) + if (w[1].wtPresent) p += sprintf(buf + p, "U{%d/%d%+d}", w[1].inputWeight, 1 << w[1].log2WeightDenom, w[1].inputOffset); - if (w[2].bPresentFlag) + if (w[2].wtPresent) p += sprintf(buf + p, "V{%d/%d%+d}", w[2].inputWeight, 1 << w[2].log2WeightDenom, w[2].inputOffset); p += sprintf(buf + p, "]"); }
View file
x265_2.7.tar.gz/source/test/ipfilterharness.cpp -> x265_2.9.tar.gz/source/test/ipfilterharness.cpp
Changed
@@ -489,6 +489,26 @@ return true; } +bool IPFilterHarness::check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt) +{ + for (int i = 0; i < TEST_CASES; i++) + { + int index = i % TEST_CASES; + intptr_t rand_srcStride[] = { 128, 192, 256, 512 }; + intptr_t dstStride[] = { 192, 256, 512, 576 }; + for (int p = 0; p < 4; p++) + { + ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]); + checked(opt, pixel_test_buff[index] + (64 * i), rand_srcStride[p], IPF_vec_output_s, dstStride[p]); + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) + return false; + } + reportfail(); + } + + return true; +} + bool IPFilterHarness::check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt) { for (int i = 0; i < ITERS; i++) @@ -510,6 +530,29 @@ return true; } +bool IPFilterHarness::check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt) +{ + for (int i = 0; i < TEST_CASES; i++) + { + int index = i % TEST_CASES; + intptr_t rand_srcStride[] = { 128, 192, 256, 512}; + intptr_t dstStride[] = { 192, 256, 512, 576 }; + + for (int p = 0; p < 4; p++) + { + ref(pixel_test_buff[index], rand_srcStride[p], IPF_C_output_s, dstStride[p]); + + checked(opt, pixel_test_buff[index], rand_srcStride[p], IPF_vec_output_s, dstStride[p]); + + if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) + return false; + } + reportfail(); + } + + return true; +} + bool IPFilterHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { @@ -571,14 +614,22 @@ return false; } } - if (opt.pu[value].convert_p2s) + if (opt.pu[value].convert_p2s[NONALIGNED]) { - if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s, opt.pu[value].convert_p2s)) + if (!check_IPFilterLumaP2S_primitive(ref.pu[value].convert_p2s[NONALIGNED], opt.pu[value].convert_p2s[NONALIGNED])) { printf("convert_p2s[%s]", lumaPartStr[value]); return false; } } + if (opt.pu[value].convert_p2s[ALIGNED]) + { + if (!check_IPFilterLumaP2S_aligned_primitive(ref.pu[value].convert_p2s[ALIGNED], opt.pu[value].convert_p2s[ALIGNED])) + { + printf("convert_p2s_aligned[%s]", lumaPartStr[value]); + return false; + } + } } for (int csp = X265_CSP_I420; csp < X265_CSP_COUNT; csp++) @@ -633,9 +684,17 @@ return false; } } - if (opt.chroma[csp].pu[value].p2s) + if (opt.chroma[csp].pu[value].p2s[ALIGNED]) + { + if (!check_IPFilterChromaP2S_aligned_primitive(ref.chroma[csp].pu[value].p2s[ALIGNED], opt.chroma[csp].pu[value].p2s[ALIGNED])) + { + printf("chroma_p2s_aligned[%s]", chromaPartStr[csp][value]); + return false; + } + } + if (opt.chroma[csp].pu[value].p2s[NONALIGNED]) { - if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s, opt.chroma[csp].pu[value].p2s)) + if (!check_IPFilterChromaP2S_primitive(ref.chroma[csp].pu[value].p2s[NONALIGNED], opt.chroma[csp].pu[value].p2s[NONALIGNED])) { printf("chroma_p2s[%s]", chromaPartStr[csp][value]); return false; @@ -649,8 +708,8 @@ void IPFilterHarness::measureSpeed(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { - int16_t srcStride = 96; - int16_t dstStride = 96; + int16_t srcStride = 192; /* Multiple of 64 */ + int16_t dstStride = 192; int maxVerticalfilterHalfDistance = 3; for (int value = 0; value < NUM_PU_SIZES; value++) @@ -659,62 +718,70 @@ { printf("luma_hpp[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_hpp, ref.pu[value].luma_hpp, - pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1); + pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1); } if (opt.pu[value].luma_hps) { printf("luma_hps[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_hps, ref.pu[value].luma_hps, - pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_s, dstStride, 1, 1); + pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_s, dstStride, 1, 1); } if (opt.pu[value].luma_vpp) { printf("luma_vpp[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_vpp, ref.pu[value].luma_vpp, - pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_p, dstStride, 1); + pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_p, dstStride, 1); } if (opt.pu[value].luma_vps) { printf("luma_vps[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_vps, ref.pu[value].luma_vps, - pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_s, dstStride, 1); + pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_s, dstStride, 1); } if (opt.pu[value].luma_vsp) { printf("luma_vsp[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_vsp, ref.pu[value].luma_vsp, - short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_p, dstStride, 1); + short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_p, dstStride, 1); } if (opt.pu[value].luma_vss) { printf("luma_vss[%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_vss, ref.pu[value].luma_vss, - short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_s, dstStride, 1); + short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_s, dstStride, 1); } if (opt.pu[value].luma_hvpp) { printf("luma_hv [%s]\t", lumaPartStr[value]); REPORT_SPEEDUP(opt.pu[value].luma_hvpp, ref.pu[value].luma_hvpp, - pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3); + pixel_buff + 3 * srcStride, srcStride, IPF_vec_output_p, srcStride, 1, 3); } - if (opt.pu[value].convert_p2s) + if (opt.pu[value].convert_p2s[NONALIGNED]) { printf("convert_p2s[%s]\t", lumaPartStr[value]); - REPORT_SPEEDUP(opt.pu[value].convert_p2s, ref.pu[value].convert_p2s, - pixel_buff, srcStride, - IPF_vec_output_s, dstStride); + REPORT_SPEEDUP(opt.pu[value].convert_p2s[NONALIGNED], ref.pu[value].convert_p2s[NONALIGNED], + pixel_buff, srcStride, + IPF_vec_output_s, dstStride); + } + + if (opt.pu[value].convert_p2s[ALIGNED]) + { + printf("convert_p2s_aligned[%s]\t", lumaPartStr[value]); + REPORT_SPEEDUP(opt.pu[value].convert_p2s[ALIGNED], ref.pu[value].convert_p2s[ALIGNED], + pixel_buff, srcStride, + IPF_vec_output_s, dstStride); } } @@ -727,47 +794,53 @@ { printf("chroma_hpp[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_hpp, ref.chroma[csp].pu[value].filter_hpp, - pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1); + pixel_buff + srcStride, srcStride, IPF_vec_output_p, dstStride, 1); } if (opt.chroma[csp].pu[value].filter_hps) { printf("chroma_hps[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_hps, ref.chroma[csp].pu[value].filter_hps, - pixel_buff + srcStride, srcStride, IPF_vec_output_s, dstStride, 1, 1); + pixel_buff + srcStride, srcStride, IPF_vec_output_s, dstStride, 1, 1); } if (opt.chroma[csp].pu[value].filter_vpp) { printf("chroma_vpp[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_vpp, ref.chroma[csp].pu[value].filter_vpp, - pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_p, dstStride, 1); + pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_p, dstStride, 1); } if (opt.chroma[csp].pu[value].filter_vps) { printf("chroma_vps[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_vps, ref.chroma[csp].pu[value].filter_vps, - pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_s, dstStride, 1); + pixel_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_s, dstStride, 1); } if (opt.chroma[csp].pu[value].filter_vsp) { printf("chroma_vsp[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_vsp, ref.chroma[csp].pu[value].filter_vsp, - short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_p, dstStride, 1); + short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_p, dstStride, 1); } if (opt.chroma[csp].pu[value].filter_vss) { printf("chroma_vss[%s]", chromaPartStr[csp][value]); REPORT_SPEEDUP(opt.chroma[csp].pu[value].filter_vss, ref.chroma[csp].pu[value].filter_vss, - short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, - IPF_vec_output_s, dstStride, 1); + short_buff + maxVerticalfilterHalfDistance * srcStride, srcStride, + IPF_vec_output_s, dstStride, 1); } - if (opt.chroma[csp].pu[value].p2s) + if (opt.chroma[csp].pu[value].p2s[NONALIGNED]) { printf("chroma_p2s[%s]\t", chromaPartStr[csp][value]); - REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s, ref.chroma[csp].pu[value].p2s, - pixel_buff, srcStride, IPF_vec_output_s, dstStride); + REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s[NONALIGNED], ref.chroma[csp].pu[value].p2s[NONALIGNED], + pixel_buff, srcStride, IPF_vec_output_s, dstStride); + } + if (opt.chroma[csp].pu[value].p2s[ALIGNED]) + { + printf("chroma_p2s_aligned[%s]\t", chromaPartStr[csp][value]); + REPORT_SPEEDUP(opt.chroma[csp].pu[value].p2s[ALIGNED], ref.chroma[csp].pu[value].p2s[ALIGNED], + pixel_buff, srcStride, IPF_vec_output_s, dstStride); } } }
View file
x265_2.7.tar.gz/source/test/ipfilterharness.h -> x265_2.9.tar.gz/source/test/ipfilterharness.h
Changed
@@ -40,15 +40,15 @@ enum { TEST_CASES = 3 }; enum { SMAX = 1 << 12 }; enum { SMIN = (unsigned)-1 << 12 }; - ALIGN_VAR_32(pixel, pixel_buff[TEST_BUF_SIZE]); - int16_t short_buff[TEST_BUF_SIZE]; - int16_t IPF_vec_output_s[TEST_BUF_SIZE]; - int16_t IPF_C_output_s[TEST_BUF_SIZE]; - pixel IPF_vec_output_p[TEST_BUF_SIZE]; - pixel IPF_C_output_p[TEST_BUF_SIZE]; + ALIGN_VAR_64(pixel, pixel_buff[TEST_BUF_SIZE]); + ALIGN_VAR_64(int16_t, short_buff[TEST_BUF_SIZE]); + ALIGN_VAR_64(int16_t, IPF_vec_output_s[TEST_BUF_SIZE]); + ALIGN_VAR_64(int16_t, IPF_C_output_s[TEST_BUF_SIZE]); + ALIGN_VAR_64(pixel, IPF_vec_output_p[TEST_BUF_SIZE]); + ALIGN_VAR_64(pixel, IPF_C_output_p[TEST_BUF_SIZE]); - pixel pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]; - int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE]; + ALIGN_VAR_64(pixel, pixel_test_buff[TEST_CASES][TEST_BUF_SIZE]); + ALIGN_VAR_64(int16_t, short_test_buff[TEST_CASES][TEST_BUF_SIZE]); bool check_IPFilterChroma_primitive(filter_pp_t ref, filter_pp_t opt); bool check_IPFilterChroma_ps_primitive(filter_ps_t ref, filter_ps_t opt); @@ -62,7 +62,9 @@ bool check_IPFilterLuma_ss_primitive(filter_ss_t ref, filter_ss_t opt); bool check_IPFilterLumaHV_primitive(filter_hv_pp_t ref, filter_hv_pp_t opt); bool check_IPFilterLumaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt); + bool check_IPFilterLumaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt); bool check_IPFilterChromaP2S_primitive(filter_p2s_t ref, filter_p2s_t opt); + bool check_IPFilterChromaP2S_aligned_primitive(filter_p2s_t ref, filter_p2s_t opt); public:
View file
x265_2.7.tar.gz/source/test/mbdstharness.cpp -> x265_2.9.tar.gz/source/test/mbdstharness.cpp
Changed
@@ -61,16 +61,17 @@ for (int i = 0; i < TEST_BUF_SIZE; i++) { short_test_buff[0][i] = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX); + short_test_buff1[0][i] = (rand() & PIXEL_MAX) - (rand() & PIXEL_MAX); int_test_buff[0][i] = rand() % PIXEL_MAX; int_idct_test_buff[0][i] = (rand() % (SHORT_MAX - SHORT_MIN)) - SHORT_MAX; short_denoise_test_buff1[0][i] = short_denoise_test_buff2[0][i] = (rand() & SHORT_MAX) - (rand() & SHORT_MAX); - short_test_buff[1][i] = -PIXEL_MAX; + short_test_buff1[1][i] = -PIXEL_MAX; int_test_buff[1][i] = -PIXEL_MAX; int_idct_test_buff[1][i] = SHORT_MIN; short_denoise_test_buff1[1][i] = short_denoise_test_buff2[1][i] = -SHORT_MAX; - short_test_buff[2][i] = PIXEL_MAX; + short_test_buff1[2][i] = PIXEL_MAX; int_test_buff[2][i] = PIXEL_MAX; int_idct_test_buff[2][i] = SHORT_MAX; short_denoise_test_buff1[2][i] = short_denoise_test_buff2[2][i] = SHORT_MAX; @@ -252,12 +253,10 @@ bool MBDstHarness::check_nquant_primitive(nquant_t ref, nquant_t opt) { int j = 0; - for (int i = 0; i < ITERS; i++) { - int width = (rand() % 4 + 1) * 4; + int width = 1 << (rand() % 4 + 2); int height = width; - uint32_t optReturnValue = 0; uint32_t refReturnValue = 0; @@ -281,6 +280,136 @@ reportfail(); j += INCR; } + return true; +} + +bool MBDstHarness::check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt) +{ + int j = 0; + int trSize[4] = { 16, 64, 256, 1024 }; + + ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]); + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + + for (int i = 0; i < ITERS; i++) + { + int64_t totalRdCostRef = rand(); + int64_t totalUncodedCostRef = rand(); + int64_t totalRdCostOpt = totalRdCostRef; + int64_t totalUncodedCostOpt = totalUncodedCostRef; + + int index = rand() % 4; + uint32_t blkPos = trSize[index]; + int cmp_size = 4 * MAX_TU_SIZE; + + memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + + int index1 = rand() % TEST_CASES; + + ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos); + checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos); + + if (memcmp(ref_dest, opt_dest, cmp_size)) + return false; + + if (totalUncodedCostRef != totalUncodedCostOpt) + return false; + + if (totalRdCostRef != totalRdCostOpt) + return false; + + reportfail(); + j += INCR; + } + + return true; +} +bool MBDstHarness::check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt) +{ + int j = 0; + int trSize[4] = { 16, 64, 256, 1024 }; + + ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]); + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + + for (int i = 0; i < ITERS; i++) + { + int64_t totalRdCostRef = rand(); + int64_t totalUncodedCostRef = rand(); + int64_t totalRdCostOpt = totalRdCostRef; + int64_t totalUncodedCostOpt = totalUncodedCostRef; + int64_t *psyScale = X265_MALLOC(int64_t, 1); + *psyScale = rand(); + + int index = rand() % 4; + uint32_t blkPos = trSize[index]; + int cmp_size = 4 * MAX_TU_SIZE; + + memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + + int index1 = rand() % TEST_CASES; + + ref(short_test_buff[index1] + j, short_test_buff1[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, psyScale, blkPos); + checked(opt, short_test_buff[index1] + j, short_test_buff1[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, psyScale, blkPos); + + X265_FREE(psyScale); + if (memcmp(ref_dest, opt_dest, cmp_size)) + return false; + + if (totalUncodedCostRef != totalUncodedCostOpt) + return false; + + if (totalRdCostRef != totalRdCostOpt) + return false; + + reportfail(); + j += INCR; + } + + return true; +} +bool MBDstHarness::check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt) +{ + int j = 0; + int trSize[4] = { 16, 64, 256, 1024 }; + + ALIGN_VAR_32(int64_t, ref_dest[4 * MAX_TU_SIZE]); + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + + for (int i = 0; i < ITERS; i++) + { + int64_t totalRdCostRef = rand(); + int64_t totalUncodedCostRef = rand(); + int64_t totalRdCostOpt = totalRdCostRef; + int64_t totalUncodedCostOpt = totalUncodedCostRef; + + int index = rand() % 4; + uint32_t blkPos = trSize[index]; + int cmp_size = 4 * MAX_TU_SIZE; + + memset(ref_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + memset(opt_dest, 0, MAX_TU_SIZE * sizeof(int64_t)); + + int index1 = rand() % TEST_CASES; + + ref(short_test_buff[index1] + j, ref_dest, &totalUncodedCostRef, &totalRdCostRef, blkPos); + checked(opt, short_test_buff[index1] + j, opt_dest, &totalUncodedCostOpt, &totalRdCostOpt, blkPos); + + + if (memcmp(ref_dest, opt_dest, cmp_size)) + return false; + + if (totalUncodedCostRef != totalUncodedCostOpt) + return false; + + if (totalRdCostRef != totalRdCostOpt) + return false; + + reportfail(); + j += INCR; + } return true; } @@ -420,6 +549,40 @@ return false; } } + + for (int i = 0; i < NUM_TR_SIZE; i++) + { + if (opt.cu[i].nonPsyRdoQuant) + { + if (!check_nonPsyRdoQuant_primitive(ref.cu[i].nonPsyRdoQuant, opt.cu[i].nonPsyRdoQuant)) + { + printf("nonPsyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i); + return false; + } + } + } + for (int i = 0; i < NUM_TR_SIZE; i++) + { + if (opt.cu[i].psyRdoQuant) + { + if (!check_psyRdoQuant_primitive(ref.cu[i].psyRdoQuant, opt.cu[i].psyRdoQuant)) + { + printf("psyRdoQuant[%dx%d]: Failed!\n", 4 << i, 4 << i); + return false; + } + } + } + for (int i = 0; i < NUM_TR_SIZE; i++) + { + if (opt.cu[i].psyRdoQuant_1p) + { + if (!check_psyRdoQuant_primitive_avx2(ref.cu[i].psyRdoQuant_1p, opt.cu[i].psyRdoQuant_1p)) + { + printf("psyRdoQuant_1p[%dx%d]: Failed!\n", 4 << i, 4 << i); + return false; + } + } + } for (int i = 0; i < NUM_TR_SIZE; i++) { if (opt.cu[i].count_nonzero) @@ -507,6 +670,42 @@ printf("nquant\t\t"); REPORT_SPEEDUP(opt.nquant, ref.nquant, short_test_buff[0], int_test_buff[1], mshortbuf2, 23, 23785, 32 * 32); } + + for (int value = 0; value < NUM_TR_SIZE; value++) + { + if (opt.cu[value].nonPsyRdoQuant) + { + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + int64_t totalRdCost = 0; + int64_t totalUncodedCost = 0; + printf("nonPsyRdoQuant[%dx%d]", 4 << value, 4 << value); + REPORT_SPEEDUP(opt.cu[value].nonPsyRdoQuant, ref.cu[value].nonPsyRdoQuant, short_test_buff[0], opt_dest, &totalUncodedCost, &totalRdCost, 0); + } + } + for (int value = 0; value < NUM_TR_SIZE; value++) + { + if (opt.cu[value].psyRdoQuant) + { + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + int64_t totalRdCost = 0; + int64_t totalUncodedCost = 0; + int64_t *psyScale = X265_MALLOC(int64_t, 1); + *psyScale = 0; + printf("psyRdoQuant[%dx%d]", 4 << value, 4 << value); + REPORT_SPEEDUP(opt.cu[value].psyRdoQuant, ref.cu[value].psyRdoQuant, short_test_buff[0], short_test_buff1[0], opt_dest, &totalUncodedCost, &totalRdCost, psyScale, 0); + } + } + for (int value = 0; value < NUM_TR_SIZE; value++) + { + if (opt.cu[value].psyRdoQuant_1p) + { + ALIGN_VAR_32(int64_t, opt_dest[4 * MAX_TU_SIZE]); + int64_t totalRdCost = 0; + int64_t totalUncodedCost = 0; + printf("psyRdoQuant_1p[%dx%d]", 4 << value, 4 << value); + REPORT_SPEEDUP(opt.cu[value].psyRdoQuant_1p, ref.cu[value].psyRdoQuant_1p, short_test_buff[0], opt_dest, &totalUncodedCost, &totalRdCost, 0); + } + } for (int value = 0; value < NUM_TR_SIZE; value++) { if (opt.cu[value].count_nonzero)
View file
x265_2.7.tar.gz/source/test/mbdstharness.h -> x265_2.9.tar.gz/source/test/mbdstharness.h
Changed
@@ -51,26 +51,27 @@ int mintbuf2[MAX_TU_SIZE]; int mintbuf3[MAX_TU_SIZE]; int mintbuf4[MAX_TU_SIZE]; - int16_t short_test_buff[TEST_CASES][TEST_BUF_SIZE]; + int16_t short_test_buff1[TEST_CASES][TEST_BUF_SIZE]; int int_test_buff[TEST_CASES][TEST_BUF_SIZE]; int int_idct_test_buff[TEST_CASES][TEST_BUF_SIZE]; - uint32_t mubuf1[MAX_TU_SIZE]; uint32_t mubuf2[MAX_TU_SIZE]; uint16_t mushortbuf1[MAX_TU_SIZE]; int16_t short_denoise_test_buff1[TEST_CASES][TEST_BUF_SIZE]; int16_t short_denoise_test_buff2[TEST_CASES][TEST_BUF_SIZE]; - bool check_dequant_primitive(dequant_scaling_t ref, dequant_scaling_t opt); bool check_dequant_primitive(dequant_normal_t ref, dequant_normal_t opt); + bool check_nonPsyRdoQuant_primitive(nonPsyRdoQuant_t ref, nonPsyRdoQuant_t opt); + bool check_psyRdoQuant_primitive(psyRdoQuant_t ref, psyRdoQuant_t opt); bool check_quant_primitive(quant_t ref, quant_t opt); bool check_nquant_primitive(nquant_t ref, nquant_t opt); bool check_dct_primitive(dct_t ref, dct_t opt, intptr_t width); bool check_idct_primitive(idct_t ref, idct_t opt, intptr_t width); bool check_count_nonzero_primitive(count_nonzero_t ref, count_nonzero_t opt); bool check_denoise_dct_primitive(denoiseDct_t ref, denoiseDct_t opt); + bool check_psyRdoQuant_primitive_avx2(psyRdoQuant_t1 ref, psyRdoQuant_t1 opt); public:
View file
x265_2.7.tar.gz/source/test/pixelharness.cpp -> x265_2.9.tar.gz/source/test/pixelharness.cpp
Changed
@@ -226,6 +226,31 @@ return true; } +bool PixelHarness::check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt) +{ + ALIGN_VAR_16(int16_t, ref_dest[64 * 64]); + ALIGN_VAR_16(int16_t, opt_dest[64 * 64]); + memset(ref_dest, 0, 64 * 64 * sizeof(int16_t)); + memset(opt_dest, 0, 64 * 64 * sizeof(int16_t)); + + int j = 0; + intptr_t stride = STRIDE; + for (int i = 0; i < ITERS; i++) + { + int index = i % TEST_CASES; + checked(opt, pbuf1 + j, pixel_test_buff[index] + j, opt_dest, stride); + ref(pbuf1 + j, pixel_test_buff[index] + j, ref_dest, stride); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + bool PixelHarness::check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt) { int j = 0; @@ -242,10 +267,27 @@ reportfail(); j += INCR; } - return true; } +bool PixelHarness::check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt) +{ + int j = 0; + for (int i = 0; i < ITERS; i++) + { + // NOTE: stride must be multiple of 16, because minimum block is 4x4 + int stride = STRIDE; + sse_t cres = ref(sbuf1 + j, stride); + sse_t vres = (sse_t)checked(opt, sbuf1 + j, (intptr_t)stride); + + if (cres != vres) + return false; + + reportfail(); + j += INCR+32; + } + return true; +} bool PixelHarness::check_weightp(weightp_sp_t ref, weightp_sp_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * (64 + 1)]); @@ -290,7 +332,11 @@ memset(ref_dest, 0, 64 * 64 * sizeof(pixel)); memset(opt_dest, 0, 64 * 64 * sizeof(pixel)); int j = 0; + bool enableavx512 = true; int width = 16 * (rand() % 4 + 1); + int cpuid = X265_NS::cpu_detect(enableavx512); + if (cpuid & X265_CPU_AVX512) + width = 32 * (rand() % 2 + 1); int height = 8; int w0 = rand() % 128; int shift = rand() % 8; // maximum is 7, see setFromWeightAndOffset() @@ -441,12 +487,10 @@ return true; } - bool PixelHarness::check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt) { - ALIGN_VAR_16(int16_t, ref_dest[64 * 64]); - ALIGN_VAR_16(int16_t, opt_dest[64 * 64]); - + ALIGN_VAR_64(int16_t, ref_dest[64 * 64]); + ALIGN_VAR_64(int16_t, opt_dest[64 * 64]); memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); @@ -469,6 +513,33 @@ return true; } +bool PixelHarness::check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt) +{ + ALIGN_VAR_64(int16_t, ref_dest[64 * 64]); + ALIGN_VAR_64(int16_t, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + int j = 0; + intptr_t stride = STRIDE; + for (int i = 0; i < ITERS; i++) + { + int shift = (rand() % 7 + 1); + + int index = i % TEST_CASES; + checked(opt, opt_dest, short_test_buff[index] + j, stride, shift); + ref(ref_dest, short_test_buff[index] + j, stride, shift); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t))) + return false; + + reportfail(); + j += INCR + 32; + } + + return true; +} bool PixelHarness::check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt) { @@ -497,11 +568,37 @@ return true; } - bool PixelHarness::check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt) { - ALIGN_VAR_16(pixel, ref_dest[64 * 64]); - ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); + int j = 0; + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + intptr_t stride = STRIDE; + for (int i = 0; i < ITERS; i++) + { + int index1 = rand() % TEST_CASES; + int index2 = rand() % TEST_CASES; + checked(ref, ref_dest, stride, pixel_test_buff[index1] + j, + stride, pixel_test_buff[index2] + j, stride, 32); + opt(opt_dest, stride, pixel_test_buff[index1] + j, + stride, pixel_test_buff[index2] + j, stride, 32); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} +bool PixelHarness::check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt) +{ + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); int j = 0; @@ -522,7 +619,7 @@ return false; reportfail(); - j += INCR; + j += INCR + 32; } return true; @@ -642,8 +739,33 @@ bool PixelHarness::check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt) { - ALIGN_VAR_16(int16_t, ref_dest[64 * 64]); - ALIGN_VAR_16(int16_t, opt_dest[64 * 64]); + ALIGN_VAR_64(int16_t, ref_dest[64 * 64]); + ALIGN_VAR_64(int16_t, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + intptr_t stride = 64; + for (int i = 0; i < ITERS; i++) + { + int16_t value = (rand() % SHORT_MAX) + 1; + + checked(opt, opt_dest, stride, value); + ref(ref_dest, stride, value); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(int16_t))) + return false; + + reportfail(); + } + + return true; +} + +bool PixelHarness::check_blockfill_s_aligned(blockfill_s_t ref, blockfill_s_t opt) +{ + ALIGN_VAR_64(int16_t, ref_dest[64 * 64]); + ALIGN_VAR_64(int16_t, opt_dest[64 * 64]); memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); @@ -696,8 +818,8 @@ bool PixelHarness::check_scale1D_pp(scale1D_t ref, scale1D_t opt) { - ALIGN_VAR_16(pixel, ref_dest[64 * 64]); - ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); memset(ref_dest, 0, sizeof(ref_dest)); memset(opt_dest, 0, sizeof(opt_dest)); @@ -719,6 +841,31 @@ return true; } +bool PixelHarness::check_scale1D_pp_aligned(scale1D_t ref, scale1D_t opt) +{ + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); + + memset(ref_dest, 0, sizeof(ref_dest)); + memset(opt_dest, 0, sizeof(opt_dest)); + + int j = 0; + for (int i = 0; i < ITERS; i++) + { + int index = i % TEST_CASES; + checked(opt, opt_dest, pixel_test_buff[index] + j); + ref(ref_dest, pixel_test_buff[index] + j); + + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR * 2; + } + + return true; +} + bool PixelHarness::check_scale2D_pp(scale2D_t ref, scale2D_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); @@ -798,6 +945,31 @@ return true; } +bool PixelHarness::check_pixel_add_ps_aligned(pixel_add_ps_t ref, pixel_add_ps_t opt) +{ + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + + int j = 0; + intptr_t stride2 = 64, stride = STRIDE; + for (int i = 0; i < ITERS; i++) + { + int index1 = rand() % TEST_CASES; + int index2 = rand() % TEST_CASES; + checked(opt, opt_dest, stride2, pixel_test_buff[index1] + j, short_test_buff[index2] + j, stride, stride); + ref(ref_dest, stride2, pixel_test_buff[index1] + j, short_test_buff[index2] + j, stride, stride); + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += 2 * INCR; + } + return true; +} + bool PixelHarness::check_pixel_var(var_t ref, var_t opt) { int j = 0; @@ -870,8 +1042,8 @@ bool PixelHarness::check_addAvg(addAvg_t ref, addAvg_t opt) { - ALIGN_VAR_16(pixel, ref_dest[64 * 64]); - ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); int j = 0; @@ -895,6 +1067,32 @@ return true; } +bool PixelHarness::check_addAvg_aligned(addAvg_t ref, addAvg_t opt) +{ + ALIGN_VAR_64(pixel, ref_dest[64 * 64]); + ALIGN_VAR_64(pixel, opt_dest[64 * 64]); + + int j = 0; + + memset(ref_dest, 0xCD, sizeof(ref_dest)); + memset(opt_dest, 0xCD, sizeof(opt_dest)); + intptr_t stride = STRIDE; + + for (int i = 0; i < ITERS; i++) + { + int index1 = rand() % TEST_CASES; + int index2 = rand() % TEST_CASES; + ref(short_test_buff2[index1] + j, short_test_buff2[index2] + j, ref_dest, stride, stride, stride); + checked(opt, short_test_buff2[index1] + j, short_test_buff2[index2] + j, opt_dest, stride, stride, stride); + if (memcmp(ref_dest, opt_dest, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += INCR * 2; + } + + return true; +} bool PixelHarness::check_calSign(sign_t ref, sign_t opt) { ALIGN_VAR_16(int8_t, ref_dest[64 * 2]); @@ -2109,15 +2307,22 @@ return false; } } - - if (opt.pu[part].pixelavg_pp) + if (opt.pu[part].pixelavg_pp[NONALIGNED]) { - if (!check_pixelavg_pp(ref.pu[part].pixelavg_pp, opt.pu[part].pixelavg_pp)) + if (!check_pixelavg_pp(ref.pu[part].pixelavg_pp[NONALIGNED], opt.pu[part].pixelavg_pp[NONALIGNED])) { printf("pixelavg_pp[%s]: failed!\n", lumaPartStr[part]); return false; } } + if (opt.pu[part].pixelavg_pp[ALIGNED]) + { + if (!check_pixelavg_pp_aligned(ref.pu[part].pixelavg_pp[ALIGNED], opt.pu[part].pixelavg_pp[ALIGNED])) + { + printf("pixelavg_pp_aligned[%s]: failed!\n", lumaPartStr[part]); + return false; + } + } if (opt.pu[part].copy_pp) { @@ -2128,15 +2333,24 @@ } } - if (opt.pu[part].addAvg) + if (opt.pu[part].addAvg[NONALIGNED]) { - if (!check_addAvg(ref.pu[part].addAvg, opt.pu[part].addAvg)) + if (!check_addAvg(ref.pu[part].addAvg[NONALIGNED], opt.pu[part].addAvg[NONALIGNED])) { printf("addAvg[%s] failed\n", lumaPartStr[part]); return false; } } + if (opt.pu[part].addAvg[ALIGNED]) + { + if (!check_addAvg_aligned(ref.pu[part].addAvg[ALIGNED], opt.pu[part].addAvg[ALIGNED])) + { + printf("addAvg_aligned[%s] failed\n", lumaPartStr[part]); + return false; + } + } + if (part < NUM_CU_SIZES) { if (opt.cu[part].sse_pp) @@ -2166,15 +2380,24 @@ } } - if (opt.cu[part].add_ps) + if (opt.cu[part].add_ps[NONALIGNED]) { - if (!check_pixel_add_ps(ref.cu[part].add_ps, opt.cu[part].add_ps)) + if (!check_pixel_add_ps(ref.cu[part].add_ps[NONALIGNED], opt.cu[part].add_ps[NONALIGNED])) { printf("add_ps[%s] failed\n", lumaPartStr[part]); return false; } } + if (opt.cu[part].add_ps[ALIGNED]) + { + if (!check_pixel_add_ps_aligned(ref.cu[part].add_ps[ALIGNED], opt.cu[part].add_ps[ALIGNED])) + { + printf("add_ps_aligned[%s] failed\n", lumaPartStr[part]); + return false; + } + } + if (opt.cu[part].copy_ss) { if (!check_copy_ss(ref.cu[part].copy_ss, opt.cu[part].copy_ss)) @@ -2213,14 +2436,22 @@ return false; } } - if (opt.chroma[i].pu[part].addAvg) + if (opt.chroma[i].pu[part].addAvg[NONALIGNED]) { - if (!check_addAvg(ref.chroma[i].pu[part].addAvg, opt.chroma[i].pu[part].addAvg)) + if (!check_addAvg(ref.chroma[i].pu[part].addAvg[NONALIGNED], opt.chroma[i].pu[part].addAvg[NONALIGNED])) { printf("chroma_addAvg[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]); return false; } } + if (opt.chroma[i].pu[part].addAvg[ALIGNED]) + { + if (!check_addAvg_aligned(ref.chroma[i].pu[part].addAvg[ALIGNED], opt.chroma[i].pu[part].addAvg[ALIGNED])) + { + printf("chroma_addAvg_aligned[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]); + return false; + } + } if (opt.chroma[i].pu[part].satd) { if (!check_pixelcmp(ref.chroma[i].pu[part].satd, opt.chroma[i].pu[part].satd)) @@ -2247,14 +2478,22 @@ return false; } } - if (opt.chroma[i].cu[part].add_ps) + if (opt.chroma[i].cu[part].add_ps[NONALIGNED]) { - if (!check_pixel_add_ps(ref.chroma[i].cu[part].add_ps, opt.chroma[i].cu[part].add_ps)) + if (!check_pixel_add_ps(ref.chroma[i].cu[part].add_ps[NONALIGNED], opt.chroma[i].cu[part].add_ps[NONALIGNED])) { printf("chroma_add_ps[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]); return false; } } + if (opt.chroma[i].cu[part].add_ps[ALIGNED]) + { + if (!check_pixel_add_ps_aligned(ref.chroma[i].cu[part].add_ps[ALIGNED], opt.chroma[i].cu[part].add_ps[ALIGNED])) + { + printf("chroma_add_ps_aligned[%s][%s] failed\n", x265_source_csp_names[i], chromaPartStr[i][part]); + return false; + } + } if (opt.chroma[i].cu[part].copy_sp) { if (!check_copy_sp(ref.chroma[i].cu[part].copy_sp, opt.chroma[i].cu[part].copy_sp)) @@ -2333,15 +2572,23 @@ } } - if (opt.cu[i].blockfill_s) + if (opt.cu[i].blockfill_s[NONALIGNED]) { - if (!check_blockfill_s(ref.cu[i].blockfill_s, opt.cu[i].blockfill_s)) + if (!check_blockfill_s(ref.cu[i].blockfill_s[NONALIGNED], opt.cu[i].blockfill_s[NONALIGNED])) { printf("blockfill_s[%dx%d]: failed!\n", 4 << i, 4 << i); return false; } } + if (opt.cu[i].blockfill_s[ALIGNED]) + { + if (!check_blockfill_s_aligned(ref.cu[i].blockfill_s[ALIGNED], opt.cu[i].blockfill_s[ALIGNED])) + { + printf("blockfill_s_aligned[%dx%d]: failed!\n", 4 << i, 4 << i); + return false; + } + } if (opt.cu[i].var) { if (!check_pixel_var(ref.cu[i].var, opt.cu[i].var)) @@ -2364,15 +2611,24 @@ { /* TU only primitives */ - if (opt.cu[i].calcresidual) + if (opt.cu[i].calcresidual[NONALIGNED]) { - if (!check_calresidual(ref.cu[i].calcresidual, opt.cu[i].calcresidual)) + if (!check_calresidual(ref.cu[i].calcresidual[NONALIGNED], opt.cu[i].calcresidual[NONALIGNED])) { printf("calcresidual width: %d failed!\n", 4 << i); return false; } } + if (opt.cu[i].calcresidual[ALIGNED]) + { + if (!check_calresidual_aligned(ref.cu[i].calcresidual[ALIGNED], opt.cu[i].calcresidual[ALIGNED])) + { + printf("calcresidual_aligned width: %d failed!\n", 4 << i); + return false; + } + } + if (opt.cu[i].transpose) { if (!check_transpose(ref.cu[i].transpose, opt.cu[i].transpose)) @@ -2381,16 +2637,22 @@ return false; } } - - if (opt.cu[i].ssd_s) + if (opt.cu[i].ssd_s[NONALIGNED]) { - if (!check_ssd_s(ref.cu[i].ssd_s, opt.cu[i].ssd_s)) + if (!check_ssd_s(ref.cu[i].ssd_s[NONALIGNED], opt.cu[i].ssd_s[NONALIGNED])) { printf("ssd_s[%dx%d]: failed!\n", 4 << i, 4 << i); return false; } } - + if (opt.cu[i].ssd_s[ALIGNED]) + { + if (!check_ssd_s_aligned(ref.cu[i].ssd_s[ALIGNED], opt.cu[i].ssd_s[ALIGNED])) + { + printf("ssd_s_aligned[%dx%d]: failed!\n", 4 << i, 4 << i); + return false; + } + } if (opt.cu[i].copy_cnt) { if (!check_copy_cnt_t(ref.cu[i].copy_cnt, opt.cu[i].copy_cnt)) @@ -2417,15 +2679,22 @@ return false; } } - - if (opt.cu[i].cpy1Dto2D_shl) + if (opt.cu[i].cpy1Dto2D_shl[NONALIGNED]) { - if (!check_cpy1Dto2D_shl_t(ref.cu[i].cpy1Dto2D_shl, opt.cu[i].cpy1Dto2D_shl)) + if (!check_cpy1Dto2D_shl_t(ref.cu[i].cpy1Dto2D_shl[NONALIGNED], opt.cu[i].cpy1Dto2D_shl[NONALIGNED])) { printf("cpy1Dto2D_shl[%dx%d] failed!\n", 4 << i, 4 << i); return false; } } + if (opt.cu[i].cpy1Dto2D_shl[ALIGNED]) + { + if (!check_cpy1Dto2D_shl_aligned_t(ref.cu[i].cpy1Dto2D_shl[ALIGNED], opt.cu[i].cpy1Dto2D_shl[ALIGNED])) + { + printf("cpy1Dto2D_shl_aligned[%dx%d] failed!\n", 4 << i, 4 << i); + return false; + } + } if (opt.cu[i].cpy1Dto2D_shr) { @@ -2465,15 +2734,24 @@ } } - if (opt.scale1D_128to64) + if (opt.scale1D_128to64[NONALIGNED]) { - if (!check_scale1D_pp(ref.scale1D_128to64, opt.scale1D_128to64)) + if (!check_scale1D_pp(ref.scale1D_128to64[NONALIGNED], opt.scale1D_128to64[NONALIGNED])) { printf("scale1D_128to64 failed!\n"); return false; } } + if (opt.scale1D_128to64[ALIGNED]) + { + if (!check_scale1D_pp_aligned(ref.scale1D_128to64[ALIGNED], opt.scale1D_128to64[ALIGNED])) + { + printf("scale1D_128to64_aligned failed!\n"); + return false; + } + } + if (opt.scale2D_64to32) { if (!check_scale2D_pp(ref.scale2D_64to32, opt.scale2D_64to32)) @@ -2830,13 +3108,17 @@ HEADER("satd[%s]", lumaPartStr[part]); REPORT_SPEEDUP(opt.pu[part].satd, ref.pu[part].satd, pbuf1, STRIDE, fref, STRIDE); } - - if (opt.pu[part].pixelavg_pp) + if (opt.pu[part].pixelavg_pp[NONALIGNED]) { HEADER("avg_pp[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.pu[part].pixelavg_pp, ref.pu[part].pixelavg_pp, pbuf1, STRIDE, pbuf2, STRIDE, pbuf3, STRIDE, 32); + REPORT_SPEEDUP(opt.pu[part].pixelavg_pp[NONALIGNED], ref.pu[part].pixelavg_pp[NONALIGNED], pbuf1, STRIDE, pbuf2, STRIDE, pbuf3, STRIDE, 32); } + if (opt.pu[part].pixelavg_pp[ALIGNED]) + { + HEADER("avg_pp_aligned[%s]", lumaPartStr[part]); + REPORT_SPEEDUP(opt.pu[part].pixelavg_pp[ALIGNED], ref.pu[part].pixelavg_pp[ALIGNED], pbuf1, STRIDE, pbuf2, STRIDE, pbuf3, STRIDE, 32); + } if (opt.pu[part].sad) { HEADER("sad[%s]", lumaPartStr[part]); @@ -2861,10 +3143,15 @@ REPORT_SPEEDUP(opt.pu[part].copy_pp, ref.pu[part].copy_pp, pbuf1, 64, pbuf2, 64); } - if (opt.pu[part].addAvg) + if (opt.pu[part].addAvg[NONALIGNED]) { HEADER("addAvg[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.pu[part].addAvg, ref.pu[part].addAvg, sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); + REPORT_SPEEDUP(opt.pu[part].addAvg[NONALIGNED], ref.pu[part].addAvg[NONALIGNED], sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); + } + if (opt.pu[part].addAvg[ALIGNED]) + { + HEADER("addAvg_aligned[%s]", lumaPartStr[part]); + REPORT_SPEEDUP(opt.pu[part].addAvg[ALIGNED], ref.pu[part].addAvg[ALIGNED], sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); } if (part < NUM_CU_SIZES) @@ -2885,10 +3172,15 @@ HEADER("sub_ps[%s]", lumaPartStr[part]); REPORT_SPEEDUP(opt.cu[part].sub_ps, ref.cu[part].sub_ps, (int16_t*)pbuf1, FENC_STRIDE, pbuf2, pbuf1, STRIDE, STRIDE); } - if (opt.cu[part].add_ps) + if (opt.cu[part].add_ps[NONALIGNED]) { HEADER("add_ps[%s]", lumaPartStr[part]); - REPORT_SPEEDUP(opt.cu[part].add_ps, ref.cu[part].add_ps, pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); + REPORT_SPEEDUP(opt.cu[part].add_ps[NONALIGNED], ref.cu[part].add_ps[NONALIGNED], pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); + } + if (opt.cu[part].add_ps[ALIGNED]) + { + HEADER("add_ps_aligned[%s]", lumaPartStr[part]); + REPORT_SPEEDUP(opt.cu[part].add_ps[ALIGNED], ref.cu[part].add_ps[ALIGNED], pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); } if (opt.cu[part].copy_ss) { @@ -2914,10 +3206,15 @@ HEADER("[%s] copy_pp[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); REPORT_SPEEDUP(opt.chroma[i].pu[part].copy_pp, ref.chroma[i].pu[part].copy_pp, pbuf1, 64, pbuf2, 128); } - if (opt.chroma[i].pu[part].addAvg) + if (opt.chroma[i].pu[part].addAvg[NONALIGNED]) { HEADER("[%s] addAvg[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); - REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg, ref.chroma[i].pu[part].addAvg, sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); + REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg[NONALIGNED], ref.chroma[i].pu[part].addAvg[NONALIGNED], sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); + } + if (opt.chroma[i].pu[part].addAvg[ALIGNED]) + { + HEADER("[%s] addAvg_aligned[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); + REPORT_SPEEDUP(opt.chroma[i].pu[part].addAvg[ALIGNED], ref.chroma[i].pu[part].addAvg[ALIGNED], sbuf1, sbuf2, pbuf1, STRIDE, STRIDE, STRIDE); } if (opt.chroma[i].pu[part].satd) { @@ -2951,10 +3248,15 @@ HEADER("[%s] sub_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); REPORT_SPEEDUP(opt.chroma[i].cu[part].sub_ps, ref.chroma[i].cu[part].sub_ps, (int16_t*)pbuf1, FENC_STRIDE, pbuf2, pbuf1, STRIDE, STRIDE); } - if (opt.chroma[i].cu[part].add_ps) + if (opt.chroma[i].cu[part].add_ps[NONALIGNED]) { HEADER("[%s] add_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); - REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps, ref.chroma[i].cu[part].add_ps, pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); + REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps[NONALIGNED], ref.chroma[i].cu[part].add_ps[NONALIGNED], pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); + } + if (opt.chroma[i].cu[part].add_ps[ALIGNED]) + { + HEADER("[%s] add_ps_aligned[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); + REPORT_SPEEDUP(opt.chroma[i].cu[part].add_ps[ALIGNED], ref.chroma[i].cu[part].add_ps[ALIGNED], pbuf1, FENC_STRIDE, pbuf2, sbuf1, STRIDE, STRIDE); } if (opt.chroma[i].cu[part].sa8d) { @@ -3000,29 +3302,42 @@ measurePartition(part, ref, opt); } } - for (int i = 0; i < NUM_CU_SIZES; i++) { - if ((i <= BLOCK_32x32) && opt.cu[i].ssd_s) + if ((i <= BLOCK_32x32) && opt.cu[i].ssd_s[NONALIGNED]) { HEADER("ssd_s[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].ssd_s, ref.cu[i].ssd_s, sbuf1, STRIDE); + REPORT_SPEEDUP(opt.cu[i].ssd_s[NONALIGNED], ref.cu[i].ssd_s[NONALIGNED], sbuf1, STRIDE); + } + if ((i <= BLOCK_32x32) && opt.cu[i].ssd_s[ALIGNED]) + { + HEADER("ssd_s_aligned[%dx%d]", 4 << i, 4 << i); + REPORT_SPEEDUP(opt.cu[i].ssd_s[ALIGNED], ref.cu[i].ssd_s[ALIGNED], sbuf1, STRIDE); } if (opt.cu[i].sa8d) { HEADER("sa8d[%dx%d]", 4 << i, 4 << i); REPORT_SPEEDUP(opt.cu[i].sa8d, ref.cu[i].sa8d, pbuf1, STRIDE, pbuf2, STRIDE); } - if (opt.cu[i].calcresidual) + if (opt.cu[i].calcresidual[NONALIGNED]) { HEADER("residual[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].calcresidual, ref.cu[i].calcresidual, pbuf1, pbuf2, sbuf1, 64); + REPORT_SPEEDUP(opt.cu[i].calcresidual[NONALIGNED], ref.cu[i].calcresidual[NONALIGNED], pbuf1, pbuf2, sbuf1, 64); } - - if (opt.cu[i].blockfill_s) + if (opt.cu[i].calcresidual[ALIGNED]) + { + HEADER("residual_aligned[%dx%d]", 4 << i, 4 << i); + REPORT_SPEEDUP(opt.cu[i].calcresidual[ALIGNED], ref.cu[i].calcresidual[ALIGNED], pbuf1, pbuf2, sbuf1, 64); + } + if (opt.cu[i].blockfill_s[NONALIGNED]) { HEADER("blkfill[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].blockfill_s, ref.cu[i].blockfill_s, sbuf1, 64, SHORT_MAX); + REPORT_SPEEDUP(opt.cu[i].blockfill_s[NONALIGNED], ref.cu[i].blockfill_s[NONALIGNED], sbuf1, 64, SHORT_MAX); + } + if (opt.cu[i].blockfill_s[ALIGNED]) + { + HEADER("blkfill_aligned[%dx%d]", 4 << i, 4 << i); + REPORT_SPEEDUP(opt.cu[i].blockfill_s[ALIGNED], ref.cu[i].blockfill_s[ALIGNED], sbuf1, 64, SHORT_MAX); } if (opt.cu[i].transpose) @@ -3049,13 +3364,17 @@ HEADER("cpy2Dto1D_shr[%dx%d]", 4 << i, 4 << i); REPORT_SPEEDUP(opt.cu[i].cpy2Dto1D_shr, ref.cu[i].cpy2Dto1D_shr, sbuf1, sbuf2, STRIDE, 3); } - - if ((i < BLOCK_64x64) && opt.cu[i].cpy1Dto2D_shl) + if ((i < BLOCK_64x64) && opt.cu[i].cpy1Dto2D_shl[NONALIGNED]) { HEADER("cpy1Dto2D_shl[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].cpy1Dto2D_shl, ref.cu[i].cpy1Dto2D_shl, sbuf1, sbuf2, STRIDE, 64); + REPORT_SPEEDUP(opt.cu[i].cpy1Dto2D_shl[NONALIGNED], ref.cu[i].cpy1Dto2D_shl[NONALIGNED], sbuf1, sbuf2, STRIDE, 64); } + if ((i < BLOCK_64x64) && opt.cu[i].cpy1Dto2D_shl[ALIGNED]) + { + HEADER("cpy1Dto2D_shl_aligned[%dx%d]", 4 << i, 4 << i); + REPORT_SPEEDUP(opt.cu[i].cpy1Dto2D_shl[ALIGNED], ref.cu[i].cpy1Dto2D_shl[ALIGNED], sbuf1, sbuf2, STRIDE, 64); + } if ((i < BLOCK_64x64) && opt.cu[i].cpy1Dto2D_shr) { HEADER("cpy1Dto2D_shr[%dx%d]", 4 << i, 4 << i); @@ -3093,10 +3412,16 @@ REPORT_SPEEDUP(opt.frameInitLowres, ref.frameInitLowres, pbuf2, pbuf1, pbuf2, pbuf3, pbuf4, 64, 64, 64, 64); } - if (opt.scale1D_128to64) + if (opt.scale1D_128to64[NONALIGNED]) { HEADER0("scale1D_128to64"); - REPORT_SPEEDUP(opt.scale1D_128to64, ref.scale1D_128to64, pbuf2, pbuf1); + REPORT_SPEEDUP(opt.scale1D_128to64[NONALIGNED], ref.scale1D_128to64[NONALIGNED], pbuf2, pbuf1); + } + + if (opt.scale1D_128to64[ALIGNED]) + { + HEADER0("scale1D_128to64_aligned"); + REPORT_SPEEDUP(opt.scale1D_128to64[ALIGNED], ref.scale1D_128to64[ALIGNED], pbuf2, pbuf1); } if (opt.scale2D_64to32)
View file
x265_2.7.tar.gz/source/test/pixelharness.h -> x265_2.9.tar.gz/source/test/pixelharness.h
Changed
@@ -44,30 +44,30 @@ enum { RMAX = PIXEL_MAX - PIXEL_MIN }; //The maximum value obtained by subtracting pixel values (residual max) enum { RMIN = PIXEL_MIN - PIXEL_MAX }; //The minimum value obtained by subtracting pixel values (residual min) - ALIGN_VAR_32(pixel, pbuf1[BUFFSIZE]); - pixel pbuf2[BUFFSIZE]; - pixel pbuf3[BUFFSIZE]; - pixel pbuf4[BUFFSIZE]; - int ibuf1[BUFFSIZE]; - int8_t psbuf1[BUFFSIZE]; - int8_t psbuf2[BUFFSIZE]; - int8_t psbuf3[BUFFSIZE]; - int8_t psbuf4[BUFFSIZE]; - int8_t psbuf5[BUFFSIZE]; + ALIGN_VAR_64(pixel, pbuf1[BUFFSIZE]); + ALIGN_VAR_64(pixel, pbuf2[BUFFSIZE]); + ALIGN_VAR_64(pixel, pbuf3[BUFFSIZE]); + ALIGN_VAR_64(pixel, pbuf4[BUFFSIZE]); + ALIGN_VAR_64(int, ibuf1[BUFFSIZE]); + ALIGN_VAR_64(int8_t, psbuf1[BUFFSIZE]); + ALIGN_VAR_64(int8_t, psbuf2[BUFFSIZE]); + ALIGN_VAR_64(int8_t, psbuf3[BUFFSIZE]); + ALIGN_VAR_64(int8_t, psbuf4[BUFFSIZE]); + ALIGN_VAR_64(int8_t, psbuf5[BUFFSIZE]); - int16_t sbuf1[BUFFSIZE]; - int16_t sbuf2[BUFFSIZE]; - int16_t sbuf3[BUFFSIZE]; + ALIGN_VAR_64(int16_t, sbuf1[BUFFSIZE]); + ALIGN_VAR_64(int16_t, sbuf2[BUFFSIZE]); + ALIGN_VAR_64(int16_t, sbuf3[BUFFSIZE]); - pixel pixel_test_buff[TEST_CASES][BUFFSIZE]; - int16_t short_test_buff[TEST_CASES][BUFFSIZE]; - int16_t short_test_buff1[TEST_CASES][BUFFSIZE]; - int16_t short_test_buff2[TEST_CASES][BUFFSIZE]; - int int_test_buff[TEST_CASES][BUFFSIZE]; - uint16_t ushort_test_buff[TEST_CASES][BUFFSIZE]; - uint8_t uchar_test_buff[TEST_CASES][BUFFSIZE]; - double double_test_buff[TEST_CASES][BUFFSIZE]; - int16_t residual_test_buff[TEST_CASES][BUFFSIZE]; + ALIGN_VAR_64(pixel, pixel_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(int16_t, short_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(int16_t, short_test_buff1[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(int16_t, short_test_buff2[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(int, int_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(uint16_t, ushort_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(uint8_t, uchar_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(double, double_test_buff[TEST_CASES][BUFFSIZE]); + ALIGN_VAR_64(int16_t, residual_test_buff[TEST_CASES][BUFFSIZE]); bool check_pixelcmp(pixelcmp_t ref, pixelcmp_t opt); bool check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt); @@ -79,13 +79,19 @@ bool check_copy_ps(copy_ps_t ref, copy_ps_t opt); bool check_copy_ss(copy_ss_t ref, copy_ss_t opt); bool check_pixelavg_pp(pixelavg_pp_t ref, pixelavg_pp_t opt); + bool check_pixelavg_pp_aligned(pixelavg_pp_t ref, pixelavg_pp_t opt); bool check_pixel_sub_ps(pixel_sub_ps_t ref, pixel_sub_ps_t opt); bool check_pixel_add_ps(pixel_add_ps_t ref, pixel_add_ps_t opt); + bool check_pixel_add_ps_aligned(pixel_add_ps_t ref, pixel_add_ps_t opt); bool check_scale1D_pp(scale1D_t ref, scale1D_t opt); + bool check_scale1D_pp_aligned(scale1D_t ref, scale1D_t opt); bool check_scale2D_pp(scale2D_t ref, scale2D_t opt); bool check_ssd_s(pixel_ssd_s_t ref, pixel_ssd_s_t opt); + bool check_ssd_s_aligned(pixel_ssd_s_t ref, pixel_ssd_s_t opt); bool check_blockfill_s(blockfill_s_t ref, blockfill_s_t opt); + bool check_blockfill_s_aligned(blockfill_s_t ref, blockfill_s_t opt); bool check_calresidual(calcresidual_t ref, calcresidual_t opt); + bool check_calresidual_aligned(calcresidual_t ref, calcresidual_t opt); bool check_transpose(transpose_t ref, transpose_t opt); bool check_weightp(weightp_pp_t ref, weightp_pp_t opt); bool check_weightp(weightp_sp_t ref, weightp_sp_t opt); @@ -93,12 +99,14 @@ bool check_cpy2Dto1D_shl_t(cpy2Dto1D_shl_t ref, cpy2Dto1D_shl_t opt); bool check_cpy2Dto1D_shr_t(cpy2Dto1D_shr_t ref, cpy2Dto1D_shr_t opt); bool check_cpy1Dto2D_shl_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt); + bool check_cpy1Dto2D_shl_aligned_t(cpy1Dto2D_shl_t ref, cpy1Dto2D_shl_t opt); bool check_cpy1Dto2D_shr_t(cpy1Dto2D_shr_t ref, cpy1Dto2D_shr_t opt); bool check_copy_cnt_t(copy_cnt_t ref, copy_cnt_t opt); bool check_pixel_var(var_t ref, var_t opt); bool check_ssim_4x4x2_core(ssim_4x4x2_core_t ref, ssim_4x4x2_core_t opt); bool check_ssim_end(ssim_end4_t ref, ssim_end4_t opt); bool check_addAvg(addAvg_t, addAvg_t); + bool check_addAvg_aligned(addAvg_t, addAvg_t); bool check_saoCuOrgE0_t(saoCuOrgE0_t ref, saoCuOrgE0_t opt); bool check_saoCuOrgE1_t(saoCuOrgE1_t ref, saoCuOrgE1_t opt); bool check_saoCuOrgE2_t(saoCuOrgE2_t ref[], saoCuOrgE2_t opt[]);
View file
x265_2.7.tar.gz/source/test/regression-tests.txt -> x265_2.9.tar.gz/source/test/regression-tests.txt
Changed
@@ -23,12 +23,12 @@ BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 --limit-tu 4 BasketballDrive_1920x1080_50.y4m,--preset slower --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0::--preset slower --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --bitrate 7000 --limit-tu 0 BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 --aq-mode 3 --limit-tu 3 -BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --bitrate 7000 --tskip-fast --limit-tu 2 +BasketballDrive_1920x1080_50.y4m,--preset veryslow --no-cutree --analysis-save x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2::--preset veryslow --no-cutree --analysis-load x265_analysis.dat --crf 18 --tskip-fast --limit-tu 2 BasketballDrive_1920x1080_50.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop Coastguard-4k.y4m,--preset superfast --tune grain --pme --aq-strength 2 --merange 190 -Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --bitrate 15000 +Coastguard-4k.y4m,--preset veryfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --qp 35::--preset veryfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --qp 35 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh --slices 2 Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 @@ -69,12 +69,11 @@ KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 --limit-modes --limit-tu 1 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 -NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-save x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000::--preset slow --no-cutree --analysis-load x265_analysis.dat --rd 5 --analysis-reuse-level 10 --bitrate 9000 --vbv-maxrate 9000 --vbv-bufsize 9000 News-4k.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000::--preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --bitrate 15000 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 News-4k.y4m,--preset superfast --slices 4 --aq-mode 0 News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 -News-4k.y4m,--preset slower --opt-cu-delta-qp News-4k.y4m,--preset veryslow --no-rskip News-4k.y4m,--preset veryslow --pme --crf 40 OldTownCross_1920x1080_50_10bit_422.yuv,--preset superfast --weightp @@ -104,7 +103,6 @@ city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2 city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock city_4cif_60fps.y4m,--preset slower --scaling-list default -city_4cif_60fps.y4m,--preset veryslow --opt-cu-delta-qp city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra --limit-refs 0 ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1 ducks_take_off_444_720p50.y4m,--preset superfast --weightp --limit-refs 2 @@ -151,7 +149,7 @@ Kimono1_1920x1080_24_400.yuv,--preset veryslow --crf 4 --cu-lossless --slices 2 --limit-refs 3 --limit-modes Kimono1_1920x1080_24_400.yuv,--preset placebo --ctu 32 --max-tu-size 8 --limit-tu 2 big_buck_bunny_360p24.y4m, --keyint 60 --min-keyint 40 --gop-lookahead 14 -BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2 +BasketballDrive_1920x1080_50.y4m, --preset medium --no-open-gop --keyint 50 --min-keyint 50 --radl 2 --vbv-maxrate 5000 --vbv-bufsize 5000 # Main12 intraCost overflow bug test 720p50_parkrun_ter.y4m,--preset medium @@ -167,4 +165,15 @@ #low-pass dct test 720p50_parkrun_ter.y4m,--preset medium --lowpass-dct +#scaled save/load test +crowd_run_1080p50.y4m,--preset ultrafast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 8000 --vbv-bufsize 8000::crowd_run_2160p50.y4m, --preset ultrafast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 1 --scale-factor 2 --crf 26 --vbv-maxrate 12000 --vbv-bufsize 12000 +crowd_run_1080p50.y4m,--preset superfast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 5000 --vbv-bufsize 5000::crowd_run_2160p50.y4m, --preset superfast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 2 --scale-factor 2 --crf 22 --vbv-maxrate 10000 --vbv-bufsize 10000 +crowd_run_1080p50.y4m,--preset fast --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 5 --scale-factor 2 --qp 18::crowd_run_2160p50.y4m, --preset fast --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 5 --scale-factor 2 --qp 18 +crowd_run_1080p50.y4m,--preset medium --no-cutree --analysis-save x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --bitrate 5000 --vbv-maxrate 5000 --vbv-bufsize 5000 --early-skip --tu-inter-depth 3::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 4 --dynamic-refine::crowd_run_2160p50.y4m, --preset medium --no-cutree --analysis-load x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --bitrate 10000 --vbv-maxrate 10000 --vbv-bufsize 10000 --early-skip --tu-inter-depth 3 --refine-intra 3 --refine-inter 3 +RaceHorses_416x240_30.y4m,--preset slow --no-cutree --ctu 16 --analysis-save x265_analysis.dat --analysis-reuse-level 10 --scale-factor 2 --crf 22 --vbv-maxrate 1000 --vbv-bufsize 1000::RaceHorses_832x480_30.y4m, --preset slow --no-cutree --ctu 32 --analysis-load x265_analysis.dat --analysis-save x265_analysis_2.dat --analysis-reuse-level 10 --scale-factor 2 --crf 16 --vbv-maxrate 4000 --vbv-bufsize 4000 --refine-intra 0 --refine-inter 1::RaceHorses_1664x960_30.y4m,--preset slow --no-cutree --ctu 64 --analysis-load x265_analysis_2.dat --analysis-reuse-level 10 --scale-factor 2 --crf 12 --vbv-maxrate 7000 --vbv-bufsize 7000 --refine-intra 2 --refine-inter 2 +ElFunete_960x540_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_960x540.dat --scale-factor 2::ElFunete_1920x1080_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune psnr --bitrate 10000 --vbv-bufsize 30000 --vbv-maxrate 17500 --analysis-reuse-level 10 --analysis-save elfuente_1920x1080.dat --limit-tu 0 --scale-factor 2 --analysis-load elfuente_960x540.dat --refine-intra 4 --refine-inter 2::ElFuente_3840x2160_60.yuv,--colorprim bt709 --transfer bt709 --chromaloc 2 --aud --repeat-headers --no-opt-qp-pps --no-opt-ref-list-length-pps --wpp --no-interlace --sar 1:1 --min-keyint 60 --no-open-gop --rc-lookahead 180 --bframes 5 --b-intra --ref 4 --cbqpoffs -2 --crqpoffs -2 --lookahead-threads 0 --weightb --qg-size 8 --me star --preset veryslow --frame-threads 1 --b-adapt 2 --aq-mode 3 --rd 6 --pools 15 --colormatrix bt709 --keyint 120 --high-tier --ctu 64 --tune=psnr --bitrate 24000 --vbv-bufsize 84000 --vbv-maxrate 49000 --analysis-reuse-level 10 --limit-tu 0 --scale-factor 2 --analysis-load elfuente_1920x1080.dat --refine-intra 4 --refine-inter 2 + +#segment encoding +BasketballDrive_1920x1080_50.y4m, --preset ultrafast --no-open-gop --chunk-start 100 --chunk-end 200 + # vim: tw=200
View file
x265_2.7.tar.gz/source/test/smoke-tests.txt -> x265_2.9.tar.gz/source/test/smoke-tests.txt
Changed
@@ -13,7 +13,7 @@ old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16 old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32 RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --max-tu-size 8 -RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 --opt-cu-delta-qp +RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16 --tu-inter-depth 2 --limit-tu 3 DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset=veryfast --min-cu 16
View file
x265_2.7.tar.gz/source/test/testbench.cpp -> x265_2.9.tar.gz/source/test/testbench.cpp
Changed
@@ -96,7 +96,8 @@ int main(int argc, char *argv[]) { - int cpuid = X265_NS::cpu_detect(); + bool enableavx512 = true; + int cpuid = X265_NS::cpu_detect(enableavx512); const char *testname = 0; if (!(argc & 1)) @@ -117,7 +118,7 @@ if (!strncmp(name, "cpuid", strlen(name))) { bool bError = false; - cpuid = parseCpuName(value, bError); + cpuid = parseCpuName(value, bError, enableavx512); if (bError) { printf("Invalid CPU name: %s\n", value); @@ -169,6 +170,7 @@ { "XOP", X265_CPU_XOP }, { "AVX2", X265_CPU_AVX2 }, { "BMI2", X265_CPU_AVX2 | X265_CPU_BMI1 | X265_CPU_BMI2 }, + { "AVX512", X265_CPU_AVX512 }, { "ARMv6", X265_CPU_ARMV6 }, { "NEON", X265_CPU_NEON }, { "FastNeonMRC", X265_CPU_FAST_NEON_MRC },
View file
x265_2.7.tar.gz/source/test/testharness.h -> x265_2.9.tar.gz/source/test/testharness.h
Changed
@@ -72,7 +72,7 @@ #include <x86intrin.h> #elif ( !defined(__APPLE__) && defined (__GNUC__) && defined(__ARM_NEON__)) #include <arm_neon.h> -#elif defined(__GNUC__) +#elif defined(__GNUC__) && (!defined(__clang__) || __clang_major__ < 4) /* fallback for older GCC/MinGW */ static inline uint32_t __rdtsc(void) { @@ -91,7 +91,7 @@ } #endif // ifdef _MSC_VER -#define BENCH_RUNS 1000 +#define BENCH_RUNS 2000 // Adapted from checkasm.c, runs each optimized primitive four times, measures rdtsc // and discards invalid times. Repeats 1000 times to get a good average. Then measures
View file
x265_2.7.tar.gz/source/x265.cpp -> x265_2.9.tar.gz/source/x265.cpp
Changed
@@ -75,6 +75,7 @@ const char* reconPlayCmd; const x265_api* api; x265_param* param; + x265_vmaf_data* vmafData; bool bProgress; bool bForceY4m; bool bDither; @@ -96,6 +97,7 @@ reconPlayCmd = NULL; api = NULL; param = NULL; + vmafData = NULL; framesToBeEncoded = seek = 0; totalbytes = 0; bProgress = true; @@ -142,7 +144,7 @@ { int eta = (int)(elapsed * (framesToBeEncoded - frameNum) / ((int64_t)frameNum * 1000000)); sprintf(buf, "x265 [%.1f%%] %d/%d frames, %.2f fps, %.2f kb/s, eta %d:%02d:%02d", - 100. * frameNum / framesToBeEncoded, frameNum, framesToBeEncoded, fps, bitrate, + 100. * frameNum / (param->chunkEnd ? param->chunkEnd : param->totalFrames), frameNum, (param->chunkEnd ? param->chunkEnd : param->totalFrames), fps, bitrate, eta / 3600, (eta / 60) % 60, eta % 60); } else @@ -216,6 +218,14 @@ x265_log(NULL, X265_LOG_ERROR, "param alloc failed\n"); return true; } +#if ENABLE_LIBVMAF + vmafData = (x265_vmaf_data*)x265_malloc(sizeof(x265_vmaf_data)); + if(!vmafData) + { + x265_log(NULL, X265_LOG_ERROR, "vmaf data alloc failed\n"); + return true; + } +#endif if (api->param_default_preset(param, preset, tune) < 0) { @@ -363,6 +373,7 @@ info.frameCount = 0; getParamAspectRatio(param, info.sarWidth, info.sarHeight); + this->input = InputFile::open(info, this->bForceY4m); if (!this->input || this->input->isFail()) { @@ -392,7 +403,7 @@ if (this->framesToBeEncoded == 0 && info.frameCount > (int)seek) this->framesToBeEncoded = info.frameCount - seek; param->totalFrames = this->framesToBeEncoded; - + /* Force CFR until we have support for VFR */ info.timebaseNum = param->fpsDenom; info.timebaseDenom = param->fpsNum; @@ -439,7 +450,30 @@ param->sourceWidth, param->sourceHeight, param->fpsNum, param->fpsDenom, x265_source_csp_names[param->internalCsp]); } +#if ENABLE_LIBVMAF + if (!reconfn) + { + x265_log(param, X265_LOG_ERROR, "recon file must be specified to get VMAF score, try --help for help\n"); + return true; + } + const char *str = strrchr(info.filename, '.'); + if (!strcmp(str, ".y4m")) + { + x265_log(param, X265_LOG_ERROR, "VMAF supports YUV file format only.\n"); + return true; + } + if(param->internalCsp == X265_CSP_I420 || param->internalCsp == X265_CSP_I422 || param->internalCsp == X265_CSP_I444) + { + vmafData->reference_file = x265_fopen(inputfn, "rb"); + vmafData->distorted_file = x265_fopen(reconfn, "rb"); + } + else + { + x265_log(param, X265_LOG_ERROR, "VMAF will support only yuv420p, yu422p, yu444p, yuv420p10le, yuv422p10le, yuv444p10le formats.\n"); + return true; + } +#endif this->output = OutputFile::open(outputfn, info); if (this->output->isFail()) { @@ -555,7 +589,9 @@ x265_param* param = cliopt.param; const x265_api* api = cliopt.api; - +#if ENABLE_LIBVMAF + x265_vmaf_data* vmafdata = cliopt.vmafData; +#endif /* This allows muxers to modify bitstream format */ cliopt.output->setParam(param); @@ -712,7 +748,7 @@ if (!numEncoded) break; } - + /* clear progress report */ if (cliopt.bProgress) fprintf(stderr, "%*s\r", 80, " "); @@ -723,7 +759,11 @@ api->encoder_get_stats(encoder, &stats, sizeof(stats)); if (param->csvfn && !b_ctrl_c) +#if ENABLE_LIBVMAF + api->vmaf_encoder_log(encoder, argc, argv, param, vmafdata); +#else api->encoder_log(encoder, argc, argv); +#endif api->encoder_close(encoder); int64_t second_largest_pts = 0;
View file
x265_2.7.tar.gz/source/x265.h -> x265_2.9.tar.gz/source/x265.h
Changed
@@ -31,6 +31,10 @@ extern "C" { #endif +#if _MSC_VER +#pragma warning(disable: 4201) // non-standard extension used (nameless struct/union) +#endif + /* x265_encoder: * opaque handler for encoder */ typedef struct x265_encoder x265_encoder; @@ -105,25 +109,107 @@ int lastMiniGopBFrame; int plannedType[X265_LOOKAHEAD_MAX + 1]; int64_t dts; + int64_t reorderedPts; } x265_lookahead_data; +typedef struct x265_analysis_validate +{ + int maxNumReferences; + int analysisReuseLevel; + int sourceWidth; + int sourceHeight; + int keyframeMax; + int keyframeMin; + int openGOP; + int bframes; + int bPyramid; + int maxCUSize; + int minCUSize; + int intraRefresh; + int lookaheadDepth; + int chunkStart; + int chunkEnd; +}x265_analysis_validate; + +/* Stores intra analysis data for a single frame. This struct needs better packing */ +typedef struct x265_analysis_intra_data +{ + uint8_t* depth; + uint8_t* modes; + char* partSizes; + uint8_t* chromaModes; +}x265_analysis_intra_data; + +typedef struct x265_analysis_MV +{ + union{ + struct { int16_t x, y; }; + + int32_t word; + }; +}x265_analysis_MV; + +/* Stores inter analysis data for a single frame */ +typedef struct x265_analysis_inter_data +{ + int32_t* ref; + uint8_t* depth; + uint8_t* modes; + uint8_t* partSize; + uint8_t* mergeFlag; + uint8_t* interDir; + uint8_t* mvpIdx[2]; + int8_t* refIdx[2]; + x265_analysis_MV* mv[2]; + int64_t* sadCost; +}x265_analysis_inter_data; + +typedef struct x265_weight_param +{ + uint32_t log2WeightDenom; + int inputWeight; + int inputOffset; + int wtPresent; +}x265_weight_param; + +#if X265_DEPTH < 10 +typedef uint32_t sse_t; +#else +typedef uint64_t sse_t; +#endif + +typedef struct x265_analysis_distortion_data +{ + sse_t* distortion; + sse_t* ctuDistortion; + double* scaledDistortion; + double averageDistortion; + double sdDistortion; + uint32_t highDistortionCtuCount; + uint32_t lowDistortionCtuCount; + double* offset; + double* threshold; +}x265_analysis_distortion_data; + /* Stores all analysis data for a single frame */ typedef struct x265_analysis_data { - int64_t satdCost; - uint32_t frameRecordSize; - uint32_t poc; - uint32_t sliceType; - uint32_t numCUsInFrame; - uint32_t numPartitions; - uint32_t depthBytes; - int bScenecut; - void* wt; - void* interData; - void* intraData; - uint32_t numCuInHeight; - x265_lookahead_data lookahead; - uint8_t* modeFlag[2]; + int64_t satdCost; + uint32_t frameRecordSize; + uint32_t poc; + uint32_t sliceType; + uint32_t numCUsInFrame; + uint32_t numPartitions; + uint32_t depthBytes; + int bScenecut; + x265_weight_param* wt; + x265_analysis_inter_data* interData; + x265_analysis_intra_data* intraData; + uint32_t numCuInHeight; + x265_lookahead_data lookahead; + uint8_t* modeFlag[2]; + x265_analysis_validate saveParam; + x265_analysis_distortion_data* distortionData; } x265_analysis_data; /* cu statistics */ @@ -152,14 +238,6 @@ /* All the above values will add up to 100%. */ } x265_pu_stats; - -typedef struct x265_analysis_2Pass -{ - uint32_t poc; - uint32_t frameRecordSize; - void* analysisFramedata; -}x265_analysis_2Pass; - /* Frame level statistics */ typedef struct x265_frame_stats { @@ -208,6 +286,8 @@ x265_cu_stats cuStats; x265_pu_stats puStats; double totalFrameTime; + double vmafFrameScore; + double bufferFillFinal; } x265_frame_stats; typedef struct x265_ctu_info_t @@ -264,6 +344,7 @@ REGION_REFRESH_INFO = 134, MASTERING_DISPLAY_INFO = 137, CONTENT_LIGHT_LEVEL_INFO = 144, + ALTERNATIVE_TRANSFER_CHARACTERISTICS = 147, } SEIPayloadType; typedef struct x265_sei_payload @@ -362,7 +443,8 @@ int height; - x265_analysis_2Pass analysis2Pass; + // pts is reordered in the order of encoding. + int64_t reorderedPts; } x265_picture; typedef enum @@ -378,39 +460,38 @@ /* CPU flags */ /* x86 */ -#define X265_CPU_CMOV 0x0000001 -#define X265_CPU_MMX 0x0000002 -#define X265_CPU_MMX2 0x0000004 /* MMX2 aka MMXEXT aka ISSE */ +#define X265_CPU_MMX (1 << 0) +#define X265_CPU_MMX2 (1 << 1) /* MMX2 aka MMXEXT aka ISSE */ #define X265_CPU_MMXEXT X265_CPU_MMX2 -#define X265_CPU_SSE 0x0000008 -#define X265_CPU_SSE2 0x0000010 -#define X265_CPU_SSE3 0x0000020 -#define X265_CPU_SSSE3 0x0000040 -#define X265_CPU_SSE4 0x0000080 /* SSE4.1 */ -#define X265_CPU_SSE42 0x0000100 /* SSE4.2 */ -#define X265_CPU_LZCNT 0x0000200 /* Phenom support for "leading zero count" instruction. */ -#define X265_CPU_AVX 0x0000400 /* AVX support: requires OS support even if YMM registers aren't used. */ -#define X265_CPU_XOP 0x0000800 /* AMD XOP */ -#define X265_CPU_FMA4 0x0001000 /* AMD FMA4 */ -#define X265_CPU_AVX2 0x0002000 /* AVX2 */ -#define X265_CPU_FMA3 0x0004000 /* Intel FMA3 */ -#define X265_CPU_BMI1 0x0008000 /* BMI1 */ -#define X265_CPU_BMI2 0x0010000 /* BMI2 */ +#define X265_CPU_SSE (1 << 2) +#define X265_CPU_SSE2 (1 << 3) +#define X265_CPU_LZCNT (1 << 4) +#define X265_CPU_SSE3 (1 << 5) +#define X265_CPU_SSSE3 (1 << 6) +#define X265_CPU_SSE4 (1 << 7) /* SSE4.1 */ +#define X265_CPU_SSE42 (1 << 8) /* SSE4.2 */ +#define X265_CPU_AVX (1 << 9) /* Requires OS support even if YMM registers aren't used. */ +#define X265_CPU_XOP (1 << 10) /* AMD XOP */ +#define X265_CPU_FMA4 (1 << 11) /* AMD FMA4 */ +#define X265_CPU_FMA3 (1 << 12) /* Intel FMA3 */ +#define X265_CPU_BMI1 (1 << 13) /* BMI1 */ +#define X265_CPU_BMI2 (1 << 14) /* BMI2 */ +#define X265_CPU_AVX2 (1 << 15) /* AVX2 */ +#define X265_CPU_AVX512 (1 << 16) /* AVX-512 {F, CD, BW, DQ, VL}, requires OS support */ /* x86 modifiers */ -#define X265_CPU_CACHELINE_32 0x0020000 /* avoid memory loads that span the border between two cachelines */ -#define X265_CPU_CACHELINE_64 0x0040000 /* 32/64 is the size of a cacheline in bytes */ -#define X265_CPU_SSE2_IS_SLOW 0x0080000 /* avoid most SSE2 functions on Athlon64 */ -#define X265_CPU_SSE2_IS_FAST 0x0100000 /* a few functions are only faster on Core2 and Phenom */ -#define X265_CPU_SLOW_SHUFFLE 0x0200000 /* The Conroe has a slow shuffle unit (relative to overall SSE performance) */ -#define X265_CPU_STACK_MOD4 0x0400000 /* if stack is only mod4 and not mod16 */ -#define X265_CPU_SLOW_CTZ 0x0800000 /* BSR/BSF x86 instructions are really slow on some CPUs */ -#define X265_CPU_SLOW_ATOM 0x1000000 /* The Atom is terrible: slow SSE unaligned loads, slow +#define X265_CPU_CACHELINE_32 (1 << 17) /* avoid memory loads that span the border between two cachelines */ +#define X265_CPU_CACHELINE_64 (1 << 18) /* 32/64 is the size of a cacheline in bytes */ +#define X265_CPU_SSE2_IS_SLOW (1 << 19) /* avoid most SSE2 functions on Athlon64 */ +#define X265_CPU_SSE2_IS_FAST (1 << 20) /* a few functions are only faster on Core2 and Phenom */ +#define X265_CPU_SLOW_SHUFFLE (1 << 21) /* The Conroe has a slow shuffle unit (relative to overall SSE performance) */ +#define X265_CPU_STACK_MOD4 (1 << 22) /* if stack is only mod4 and not mod16 */ +#define X265_CPU_SLOW_ATOM (1 << 23) /* The Atom is terrible: slow SSE unaligned loads, slow * SIMD multiplies, slow SIMD variable shifts, slow pshufb, * cacheline split penalties -- gather everything here that * isn't shared by other CPUs to avoid making half a dozen * new SLOW flags. */ -#define X265_CPU_SLOW_PSHUFB 0x2000000 /* such as on the Intel Atom */ -#define X265_CPU_SLOW_PALIGNR 0x4000000 /* such as on the AMD Bobcat */ +#define X265_CPU_SLOW_PSHUFB (1 << 24) /* such as on the Intel Atom */ +#define X265_CPU_SLOW_PALIGNR (1 << 25) /* such as on the AMD Bobcat */ /* ARM */ #define X265_CPU_ARMV6 0x0000001 @@ -459,11 +540,9 @@ #define X265_AQ_VARIANCE 1 #define X265_AQ_AUTO_VARIANCE 2 #define X265_AQ_AUTO_VARIANCE_BIASED 3 - #define x265_ADAPT_RD_STRENGTH 4 - +#define X265_REFINE_INTER_LEVELS 3 /* NOTE! For this release only X265_CSP_I420 and X265_CSP_I444 are supported */ - /* Supported internal color space types (according to semantics of chroma_format_idc) */ #define X265_CSP_I400 0 /* yuv 4:0:0 planar */ #define X265_CSP_I420 1 /* yuv 4:2:0 planar */ @@ -535,6 +614,7 @@ double elapsedEncodeTime; /* wall time since encoder was opened */ double elapsedVideoTime; /* encoded picture count / frame rate */ double bitrate; /* accBits / elapsed video time */ + double aggregateVmafScore; /* aggregate VMAF score for input video*/ uint64_t accBits; /* total bits output thus far */ uint32_t encodedPictureCount; /* number of output pictures thus far */ uint32_t totalWPFrames; /* number of uni-directional weighted frames used */ @@ -571,6 +651,47 @@ float bitrateFactor; } x265_zone; +/* data to calculate aggregate VMAF score */ +typedef struct x265_vmaf_data +{ + int width; + int height; + size_t offset; + int internalBitDepth; + FILE *reference_file; /* FILE pointer for input file */ + FILE *distorted_file; /* FILE pointer for recon file generated*/ +}x265_vmaf_data; + +/* data to calculate frame level VMAF score */ +typedef struct x265_vmaf_framedata +{ + int width; + int height; + int frame_set; + int internalBitDepth; + void *reference_frame; /* points to fenc of particular frame */ + void *distorted_frame; /* points to recon of particular frame */ +}x265_vmaf_framedata; + +/* common data needed to calculate both frame level and video level VMAF scores */ +typedef struct x265_vmaf_commondata +{ + char *format; + char *model_path; + char *log_path; + char *log_fmt; + int disable_clip; + int disable_avx; + int enable_transform; + int phone_model; + int psnr; + int ssim; + int ms_ssim; + char *pool; +}x265_vmaf_commondata; + +static const x265_vmaf_commondata vcd[] = { { NULL, (char *)"/usr/local/share/model/vmaf_v0.6.1.pkl", NULL, NULL, 0, 0, 0, 0, 0, 0, 0, NULL } }; + /* x265 input parameters * * For version safety you may use x265_param_alloc/free() to manage the @@ -584,7 +705,6 @@ * somehow flawed on your target hardware. The asm function tables are * process global, the first encoder configures them for all encoders */ int cpuid; - /*== Parallelism Features ==*/ /* Number of concurrently encoded frames between 1 and X265_MAX_FRAME_THREADS @@ -1153,6 +1273,18 @@ * Default is 0, which is recommended */ int crQpOffset; + /* Specifies the preferred transfer characteristics syntax element in the + * alternative transfer characteristics SEI message (see. D.2.38 and D.3.38 of + * JCTVC-W1005 http://phenix.it-sudparis.eu/jct/doc_end_user/documents/23_San%20Diego/wg11/JCTVC-W1005-v4.zip + * */ + int preferredTransferCharacteristics; + + /* + * Specifies the value for the pic_struc syntax element of the picture timing SEI message (See D2.3 and D3.3) + * of the HEVC spec. for a detailed explanation + * */ + int pictureStructure; + struct { /* Explicit mode of rate-control, necessary for API users. It must @@ -1548,6 +1680,36 @@ /*Number of RADL pictures allowed in front of IDR*/ int radl; + + /* This value controls the maximum AU size defined in specification + * It represents the percentage of maximum AU size used. + * Default is 1 (which is 100%). Range is 0.5 to 1. */ + double maxAUSizeFactor; + + /* Enables the emission of a Recovery Point SEI with the stream headers + * at each IDR frame describing poc of the recovery point, exact matching flag + * and broken link flag. Default is disabled. */ + int bEmitIDRRecoverySEI; + + /* Dynamically change refine-inter at block level*/ + int bDynamicRefine; + + /* Enable writing all SEI messgaes in one single NAL instead of mul*/ + int bSingleSeiNal; + + + /* First frame of the chunk. Frames preceeding this in display order will + * be encoded, however, they will be discarded in the bitstream. + * Default 0 (disabled). */ + int chunkStart; + + /* Last frame of the chunk. Frames following this in display order will be + * used in taking lookahead decisions, but, they will not be encoded. + * Default 0 (disabled). */ + int chunkEnd; + /* File containing base64 encoded SEI messages in POC order */ + const char* naluFile; + } x265_param; /* x265_param_alloc: @@ -1660,6 +1822,14 @@ * A static string describing the compiler and target architecture */ X265_API extern const char *x265_build_info_str; +/* x265_alloc_analysis_data: +* Allocate memory for the x265_analysis_data object's internal structures. */ +void x265_alloc_analysis_data(x265_param *param, x265_analysis_data* analysis); + +/* +* Free the allocated memory for x265_analysis_data object's internal structures. */ +void x265_free_analysis_data(x265_param *param, x265_analysis_data* analysis); + /* Force a link error in the case of linking against an incompatible API version. * Glue #defines exist to force correct macro expansion; the final output of the macro * is x265_encoder_open_##X265_BUILD (for purposes of dlopen). */ @@ -1787,6 +1957,22 @@ /* In-place downshift from a bit-depth greater than 8 to a bit-depth of 8, using * the residual bits to dither each row. */ void x265_dither_image(x265_picture *, int picWidth, int picHeight, int16_t *errorBuf, int bitDepth); +#if ENABLE_LIBVMAF +/* x265_calculate_vmafScore: + * returns VMAF score for the input video. + * This api must be called only after encoding was done. */ +double x265_calculate_vmafscore(x265_param*, x265_vmaf_data*); + +/* x265_calculate_vmaf_framelevelscore: + * returns VMAF score for each frame in a given input video. */ +double x265_calculate_vmaf_framelevelscore(x265_vmaf_framedata*); +/* x265_vmaf_encoder_log: + * write a line to the configured CSV file. If a CSV filename was not + * configured, or file open failed, this function will perform no write. + * This api will be called only when ENABLE_LIBVMAF cmake option is set */ +void x265_vmaf_encoder_log(x265_encoder *encoder, int argc, char **argv, x265_param*, x265_vmaf_data*); + +#endif #define X265_MAJOR_VERSION 1 @@ -1840,6 +2026,11 @@ void (*csvlog_encode)(const x265_param*, const x265_stats *, int, int, int, char**); void (*dither_image)(x265_picture*, int, int, int16_t*, int); int (*set_analysis_data)(x265_encoder *encoder, x265_analysis_data *analysis_data, int poc, uint32_t cuBytes); +#if ENABLE_LIBVMAF + double (*calculate_vmafscore)(x265_param *, x265_vmaf_data *); + double (*calculate_vmaf_framelevelscore)(x265_vmaf_framedata *); + void (*vmaf_encoder_log)(x265_encoder*, int, char**, x265_param *, x265_vmaf_data *); +#endif /* add new pointers to the end, or increment X265_MAJOR_VERSION */ } x265_api;
View file
x265_2.7.tar.gz/source/x265cli.h -> x265_2.9.tar.gz/source/x265cli.h
Changed
@@ -152,6 +152,8 @@ { "vbv-init", required_argument, NULL, 0 }, { "vbv-end", required_argument, NULL, 0 }, { "vbv-end-fr-adj", required_argument, NULL, 0 }, + { "chunk-start", required_argument, NULL, 0 }, + { "chunk-end", required_argument, NULL, 0 }, { "bitrate", required_argument, NULL, 0 }, { "qp", required_argument, NULL, 'q' }, { "aq-mode", required_argument, NULL, 0 }, @@ -263,6 +265,8 @@ { "scale-factor", required_argument, NULL, 0 }, { "refine-intra", required_argument, NULL, 0 }, { "refine-inter", required_argument, NULL, 0 }, + { "dynamic-refine", no_argument, NULL, 0 }, + { "no-dynamic-refine", no_argument, NULL, 0 }, { "strict-cbr", no_argument, NULL, 0 }, { "temporal-layers", no_argument, NULL, 0 }, { "no-temporal-layers", no_argument, NULL, 0 }, @@ -293,6 +297,14 @@ { "refine-mv-type", required_argument, NULL, 0 }, { "copy-pic", no_argument, NULL, 0 }, { "no-copy-pic", no_argument, NULL, 0 }, + { "max-ausize-factor", required_argument, NULL, 0 }, + { "idr-recovery-sei", no_argument, NULL, 0 }, + { "no-idr-recovery-sei", no_argument, NULL, 0 }, + { "single-sei", no_argument, NULL, 0 }, + { "no-single-sei", no_argument, NULL, 0 }, + { "atc-sei", required_argument, NULL, 0 }, + { "pic-struct", required_argument, NULL, 0 }, + { "nalu-file", required_argument, NULL, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, { 0, 0, 0, 0 }, @@ -343,6 +355,7 @@ H0(" --dhdr10-info <filename> JSON file containing the Creative Intent Metadata to be encoded as Dynamic Tone Mapping\n"); H0(" --[no-]dhdr10-opt Insert tone mapping SEI only for IDR frames and when the tone mapping information changes. Default disabled\n"); #endif + H0(" --nalu-file <filename> Text file containing SEI messages in the following format : <POC><space><PREFIX><space><NAL UNIT TYPE>/<SEI TYPE><space><SEI Payload>\n"); H0("-f/--frames <integer> Maximum number of frames to encode. Default all\n"); H0(" --seek <integer> First frame to encode\n"); H1(" --[no-]interlace <bff|tff> Indicate input pictures are interlace fields in temporal order. Default progressive\n"); @@ -389,7 +402,7 @@ H0(" --[no-]early-skip Enable early SKIP detection. Default %s\n", OPT(param->bEnableEarlySkip)); H0(" --[no-]rskip Enable early exit from recursion. Default %s\n", OPT(param->bEnableRecursionSkip)); H1(" --[no-]tskip-fast Enable fast intra transform skipping. Default %s\n", OPT(param->bEnableTSkipFast)); - H1(" --[no-]splitrd-skip Enable skipping split RD analysis when sum of split CU rdCost larger than none split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip)); + H1(" --[no-]splitrd-skip Enable skipping split RD analysis when sum of split CU rdCost larger than one split CU rdCost for Intra CU. Default %s\n", OPT(param->bEnableSplitRdSkip)); H1(" --nr-intra <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in intra CUs. Default 0\n"); H1(" --nr-inter <integer> An integer value in range of 0 to 2000, which denotes strength of noise reduction in inter CUs. Default 0\n"); H0(" --ctu-info <integer> Enable receiving ctu information asynchronously and determine reaction to the CTU information (0, 1, 2, 4, 6) Default 0\n" @@ -459,6 +472,8 @@ H0(" --vbv-init <float> Initial VBV buffer occupancy (fraction of bufsize or in kbits). Default %.2f\n", param->rc.vbvBufferInit); H0(" --vbv-end <float> Final VBV buffer emptiness (fraction of bufsize or in kbits). Default 0 (disabled)\n"); H0(" --vbv-end-fr-adj <float> Frame from which qp has to be adjusted to achieve final decode buffer emptiness. Default 0\n"); + H0(" --chunk-start <integer> First frame of the chunk. Default 0 (disabled)\n"); + H0(" --chunk-end <integer> Last frame of the chunk. Default 0 (disabled)\n"); H0(" --pass Multi pass rate control.\n" " - 1 : First pass, creates stats file\n" " - 2 : Last pass, does not overwrite stats file\n" @@ -475,11 +490,12 @@ H0(" --analysis-reuse-level <1..10> Level of analysis reuse indicates amount of info stored/reused in save/load mode, 1:least..10:most. Default %d\n", param->analysisReuseLevel); H0(" --refine-mv-type <string> Reuse MV information received through API call. Supported option is avc. Default disabled - %d\n", param->bMVType); H0(" --scale-factor <int> Specify factor by which input video is scaled down for analysis save mode. Default %d\n", param->scaleFactor); - H0(" --refine-intra <0..3> Enable intra refinement for encode that uses analysis-load.\n" + H0(" --refine-intra <0..4> Enable intra refinement for encode that uses analysis-load.\n" " - 0 : Forces both mode and depth from the save encode.\n" " - 1 : Functionality of (0) + evaluate all intra modes at min-cu-size's depth when current depth is one smaller than min-cu-size's depth.\n" " - 2 : Functionality of (1) + irrespective of size evaluate all angular modes when the save encode decides the best mode as angular.\n" " - 3 : Functionality of (1) + irrespective of size evaluate all intra modes.\n" + " - 4 : Re-evaluate all intra blocks, does not reuse data from save encode.\n" " Default:%d\n", param->intraRefine); H0(" --refine-inter <0..3> Enable inter refinement for encode that uses analysis-load.\n" " - 0 : Forces both mode and depth from the save encode.\n" @@ -488,6 +504,7 @@ " - 2 : Functionality of (1) + irrespective of size restrict the modes evaluated when specific modes are decided as the best mode by the save encode.\n" " - 3 : Functionality of (1) + irrespective of size evaluate all inter modes.\n" " Default:%d\n", param->interRefine); + H0(" --[no-]dynamic-refine Dynamically changes refine-inter level for each CU. Default %s\n", OPT(param->bDynamicRefine)); H0(" --[no-]refine-mv Enable mv refinement for load mode. Default %s\n", OPT(param->mvRefine)); H0(" --aq-mode <integer> Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance 3:auto variance with bias to dark scenes. Default %d\n", param->rc.aqMode); H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength); @@ -515,6 +532,8 @@ H1(" MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n"); H1(" Blank lines and lines starting with hash(#) are ignored\n"); H1(" Comma is considered to be white-space\n"); + H0(" --max-ausize-factor <float> This value controls the maximum AU size defined in specification.\n"); + H0(" It represents the percentage of maximum AU size used. Default %.1f\n", param->maxAUSizeFactor); H0("\nLoop filters (deblock and SAO):\n"); H0(" --[no-]deblock Enable Deblocking Loop Filter, optionally specify tC:Beta offsets Default %s\n", OPT(param->bEnableLoopFilter)); H0(" --[no-]sao Enable Sample Adaptive Offset. Default %s\n", OPT(param->bEnableSAO)); @@ -548,9 +567,12 @@ H0(" --[no-]repeat-headers Emit SPS and PPS headers at each keyframe. Default %s\n", OPT(param->bRepeatHeaders)); H0(" --[no-]info Emit SEI identifying encoder and parameters. Default %s\n", OPT(param->bEmitInfoSEI)); H0(" --[no-]hrd Enable HRD parameters signaling. Default %s\n", OPT(param->bEmitHRDSEI)); + H0(" --[no-]idr-recovery-sei Emit recovery point infor SEI at each IDR frame \n"); H0(" --[no-]temporal-layers Enable a temporal sublayer for unreferenced B frames. Default %s\n", OPT(param->bEnableTemporalSubLayers)); H0(" --[no-]aud Emit access unit delimiters at the start of each access unit. Default %s\n", OPT(param->bEnableAccessUnitDelimiters)); H1(" --hash <integer> Decoded Picture Hash SEI 0: disabled, 1: MD5, 2: CRC, 3: Checksum. Default %d\n", param->decodedPictureHashSEI); + H0(" --atc-sei <integer> Emit the alternative transfer characteristics SEI message where the integer is the preferred transfer characteristics. Default disabled\n"); + H0(" --pic-struct <integer> Set the picture structure and emits it in the picture timing SEI message. Values in the range 0..12. See D.3.3 of the HEVC spec. for a detailed explanation.\n"); H0(" --log2-max-poc-lsb <integer> Maximum of the picture order count\n"); H0(" --[no-]vui-timing-info Emit VUI timing information in the bistream. Default %s\n", OPT(param->bEmitVUITimingInfo)); H0(" --[no-]vui-hrd-info Emit VUI HRD information in the bistream. Default %s\n", OPT(param->bEmitVUIHRDInfo));
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.