Projects
Essentials
x265
Sign Up
Log In
Username
Password
Overview
Repositories
Revisions
Requests
Users
Attributes
Meta
Expand all
Collapse all
Changes of Revision 11
View file
x265.changes
Changed
@@ -1,4 +1,47 @@ ------------------------------------------------------------------- +Fri Nov 27 18:21:04 UTC 2015 - aloisio@gmx.com + +- Update to version 1.8: + API Changes: + * Experimental support for Main12 is now enabled. Partial + assembly support exists. + * Main12 and Intra/Still picture profiles are now supported. + Still picture profile is detected based on + x265_param::totalFrames. + * Three classes of encoding statistics are now available + through the API. + + x265_stats - contains encoding statistics, available + through x265_encoder_get_stats() + + x265_frame_stats and x265_cu_stats - contains frame + encoding statistics, available through recon x265_picture + * --csv + * x265_encoder_log() is now deprecated + * x265_param::csvfn is also deprecated + * --log-level now controls only console logging, frame + level console logging has been removed. + * Support added for new color transfer characteristic ARIB + STD-B67 + New Features: + * limit-refs + + This feature limits the references analysed for + individual CUS. + + Provides a nice tradeoff between efficiency and + performance. + + aq-mode 3 + * A new aq-mode that provides additional biasing for + low-light conditions. + * An improved scene cut detection logic that allows + ratecontrol to manage visual quality at fade-ins and + fade-outs better. + Preset and Tune Options: + * tune grain + + Increases psyRdoq strength to 10.0, and rdoq-level to 2. + + qg-size + * Default value changed to 32. +- soname bump to 68 +- Reworked arm.patch for 1.8 + +------------------------------------------------------------------- Fri May 29 09:11:02 UTC 2015 - aloisio@gmx.com - soname bump to 59
View file
x265.spec
Changed
@@ -1,10 +1,10 @@ # based on the spec file from https://build.opensuse.org/package/view_file/home:Simmphonie/libx265/ Name: x265 -%define soname 59 +%define soname 68 %define libname lib%{name} %define libsoname %{libname}-%{soname} -Version: 1.7 +Version: 1.8 Release: 0 License: GPL-2.0+ Summary: A free h265/HEVC encoder - encoder binary @@ -43,9 +43,9 @@ streams. %prep -%setup -q -n "%{name}_%{version}/build/linux" +%setup -q -n "%{name}_11047/build/linux" cd ../.. -%patch0 +%patch0 -p1 cd - %define FAKE_BUILDDATE %(LC_ALL=C date -u -r %{_sourcedir}/%{name}.changes '+%%b %%e %%Y') sed -i -e "s/0.0/%{soname}.0/g" ../../source/cmake/version.cmake
View file
arm.patch
Changed
@@ -1,9 +1,11 @@ ---- source/CMakeLists.txt.orig 2015-04-28 21:43:18.585528552 +0200 -+++ source/CMakeLists.txt 2015-04-28 21:47:14.995334232 +0200 -@@ -50,10 +50,18 @@ - set(X64 1) - add_definitions(-DX86_64=1) - endif() +Index: x265_11047/source/CMakeLists.txt +=================================================================== +--- x265_11047.orig/source/CMakeLists.txt ++++ x265_11047/source/CMakeLists.txt +@@ -56,10 +56,22 @@ elseif(POWERMATCH GREATER "-1") + message(STATUS "Detected POWER target processor") + set(POWER 1) + add_definitions(-DX265_ARCH_POWER=1) +elseif(${SYSPROC} MATCHES "armv5.*") + message(STATUS "Detected ARMV5 system processor") + set(ARMV5 1) @@ -19,10 +21,14 @@ + message(STATUS "Detected ARMV7 system processor") + set(ARMV7 1) + add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1 -DHAVE_NEON=0) ++elseif(${SYSPROC} STREQUAL "aarch64") ++ message(STATUS "Detected AArch64 system processor") ++ set(ARMV7 1) ++ add_definitions(-DX265_ARCH_ARM=1 -DHAVE_ARMV6=1 -DHAVE_NEON=0) else() message(STATUS "CMAKE_SYSTEM_PROCESSOR value `${CMAKE_SYSTEM_PROCESSOR}` is unknown") message(STATUS "Please add this value near ${CMAKE_CURRENT_LIST_FILE}:${CMAKE_CURRENT_LIST_LINE}") -@@ -155,8 +163,8 @@ +@@ -169,8 +181,8 @@ if(GCC) elseif(X86 AND NOT X64) add_definitions(-march=i686) endif() @@ -33,8 +39,10 @@ endif() if(FPROFILE_GENERATE) if(INTEL_CXX) ---- source/common/cpu.cpp.orig 2015-04-28 21:47:44.634923269 +0200 -+++ source/common/cpu.cpp 2015-04-28 21:49:50.305468867 +0200 +Index: x265_11047/source/common/cpu.cpp +=================================================================== +--- x265_11047.orig/source/common/cpu.cpp ++++ x265_11047/source/common/cpu.cpp @@ -37,7 +37,7 @@ #include <machine/cpu.h> #endif @@ -44,20 +52,3 @@ #include <signal.h> #include <setjmp.h> static sigjmp_buf jmpbuf; -@@ -340,7 +340,6 @@ - } - - canjump = 1; -- x265_cpu_neon_test(); - canjump = 0; - signal(SIGILL, oldsig); - #endif // if !HAVE_NEON -@@ -356,7 +355,7 @@ - // which may result in incorrect detection and the counters stuck enabled. - // right now Apple does not seem to support performance counters for this test - #ifndef __MACH__ -- flags |= x265_cpu_fast_neon_mrc_test() ? X265_CPU_FAST_NEON_MRC : 0; -+ //flags |= x265_cpu_fast_neon_mrc_test() ? X265_CPU_FAST_NEON_MRC : 0; - #endif - // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc) - #endif // if HAVE_ARMV6
View file
baselibs.conf
Changed
@@ -1,1 +1,1 @@ -libx265-59 +libx265-68
View file
x265_1.7.tar.gz/source/filters/filters.cpp
Deleted
@@ -1,79 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2013 x265 project - * - * Authors: Selvakumar Nithiyaruban <selvakumar@multicorewareinc.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#include "filters.h" -#include "common.h" - -/* The dithering algorithm is based on Sierra-2-4A error diffusion. */ -void ditherPlane(pixel *dst, int dstStride, uint16_t *src, int srcStride, - int width, int height, int16_t *errors, int bitDepth) -{ - const int lShift = 16 - bitDepth; - const int rShift = 16 - bitDepth + 2; - const int half = (1 << (16 - bitDepth + 1)); - const int pixelMax = (1 << bitDepth) - 1; - - memset(errors, 0, (width + 1) * sizeof(int16_t)); - int pitch = 1; - for (int y = 0; y < height; y++, src += srcStride, dst += dstStride) - { - int16_t err = 0; - for (int x = 0; x < width; x++) - { - err = err * 2 + errors[x] + errors[x + 1]; - dst[x * pitch] = (pixel)x265_clip3(0, pixelMax, ((src[x * 1] << 2) + err + half) >> rShift); - errors[x] = err = src[x * pitch] - (dst[x * pitch] << lShift); - } - } -} - -void ditherImage(x265_picture& picIn, int picWidth, int picHeight, int16_t *errorBuf, int bitDepth) -{ - /* This portion of code is from readFrame in x264. */ - for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) - { - if ((picIn.bitDepth & 7) && (picIn.bitDepth != 16)) - { - /* upconvert non 16bit high depth planes to 16bit */ - uint16_t *plane = (uint16_t*)picIn.planes[i]; - uint32_t pixelCount = x265_picturePlaneSize(picIn.colorSpace, picWidth, picHeight, i); - int lShift = 16 - picIn.bitDepth; - - /* This loop assumes width is equal to stride which - happens to be true for file reader outputs */ - for (uint32_t j = 0; j < pixelCount; j++) - { - plane[j] = plane[j] << lShift; - } - } - } - - for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) - { - int height = (int)(picHeight >> x265_cli_csps[picIn.colorSpace].height[i]); - int width = (int)(picWidth >> x265_cli_csps[picIn.colorSpace].width[i]); - - ditherPlane(((pixel*)picIn.planes[i]), picIn.stride[i] / sizeof(pixel), ((uint16_t*)picIn.planes[i]), - picIn.stride[i] / 2, width, height, errorBuf, bitDepth); - } -}
View file
x265_1.7.tar.gz/source/filters/filters.h
Deleted
@@ -1,31 +0,0 @@ -/***************************************************************************** - * Copyright (C) 2013 x265 project - * - * Authors: Selvakumar Nithiyaruban <selvakumar@multicorewareinc.com> - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License as published by - * the Free Software Foundation; either version 2 of the License, or - * (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * You should have received a copy of the GNU General Public License - * along with this program; if not, write to the Free Software - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. - * - * This program is also available under a commercial proprietary license. - * For more information, contact us at license @ x265.com. - *****************************************************************************/ - -#ifndef X265_FILTERS_H -#define X265_FILTERS_H - -#include "x265.h" - -void ditherImage(x265_picture&, int picWidth, int picHeight, int16_t *errorBuf, int bitDepth); - -#endif //X265_FILTERS_H
View file
x265_1.7.tar.gz/.hg_archival.txt -> x265_1.8.tar.gz/.hg_archival.txt
Changed
@@ -1,4 +1,5 @@ repo: 09fe40627f03a0f9c3e6ac78b22ac93da23f9fdf -node: 8425278def1edf0931dc33fc518e1950063e76b0 +node: 5dcc9d3a928c400b41a3547d7bfee10340519e56 branch: stable -tag: 1.7 +latesttag: 1.8 +latesttagdistance: 1
View file
x265_1.7.tar.gz/.hgtags -> x265_1.8.tar.gz/.hgtags
Changed
@@ -15,3 +15,5 @@ 5e604833c5aa605d0b6efbe5234492b5e7d8ac61 1.4 9f0324125f53a12f766f6ed6f98f16e2f42337f4 1.5 cbeb7d8a4880e4020c4545dd8e498432c3c6cad3 1.6 +8425278def1edf0931dc33fc518e1950063e76b0 1.7 +e27327f5da35c5feb660360336fdc94bd0afe719 1.8
View file
x265_1.8.tar.gz/build/linux/multilib.sh
Added
@@ -0,0 +1,41 @@ +#!/bin/sh + +mkdir -p 8bit 10bit 12bit + +cd 12bit +cmake ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +make ${MAKEFLAGS} + +cd ../10bit +cmake ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +make ${MAKEFLAGS} + +cd ../8bit +ln -sf ../10bit/libx265.a libx265_main10.a +ln -sf ../12bit/libx265.a libx265_main12.a +cmake ../../../source -DEXTRA_LIB="x265_main10.a;x265_main12.a" -DEXTRA_LINK_FLAGS=-L. -DLINKED_10BIT=ON -DLINKED_12BIT=ON +make ${MAKEFLAGS} + +# rename the 8bit library, then combine all three into libx265.a +mv libx265.a libx265_main.a + +uname=`uname` +if [ "$uname" = "Linux" ] +then + +# On Linux, we use GNU ar to combine the static libraries together +ar -M <<EOF +CREATE libx265.a +ADDLIB libx265_main.a +ADDLIB libx265_main10.a +ADDLIB libx265_main12.a +SAVE +END +EOF + +else + +# Mac/BSD libtool +libtool -static -o libx265.a libx265_main.a libx265_main10.a libx265_main12.a 2>/dev/null + +fi
View file
x265_1.8.tar.gz/build/msys/multilib.sh
Added
@@ -0,0 +1,29 @@ +#!/bin/sh + +mkdir -p 8bit 10bit 12bit + +cd 12bit +cmake -G "MSYS Makefiles" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +make ${MAKEFLAGS} +cp libx265.a ../8bit/libx265_main12.a + +cd ../10bit +cmake -G "MSYS Makefiles" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +make ${MAKEFLAGS} +cp libx265.a ../8bit/libx265_main10.a + +cd ../8bit +cmake -G "MSYS Makefiles" ../../../source -DEXTRA_LIB="x265_main10.a;x265_main12.a" -DEXTRA_LINK_FLAGS=-L. -DLINKED_10BIT=ON -DLINKED_12BIT=ON +make ${MAKEFLAGS} + +# rename the 8bit library, then combine all three into libx265.a using GNU ar +mv libx265.a libx265_main.a + +ar -M <<EOF +CREATE libx265.a +ADDLIB libx265_main.a +ADDLIB libx265_main10.a +ADDLIB libx265_main12.a +SAVE +END +EOF
View file
x265_1.8.tar.gz/build/vc10-x86_64/multilib.bat
Added
@@ -0,0 +1,44 @@ +@echo off +if "%VS100COMNTOOLS%" == "" ( + msg "%username%" "Visual Studio 10 not detected" + exit 1 +) + +call "%VS100COMNTOOLS%\..\..\VC\vcvarsall.bat" + +@mkdir 12bit +@mkdir 10bit +@mkdir 8bit + +@cd 12bit +cmake -G "Visual Studio 10 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main12.lib +) + +@cd ..\10bit +cmake -G "Visual Studio 10 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main10.lib +) + +@cd ..\8bit +if not exist x265-static-main10.lib ( + msg "%username%" "10bit build failed" + exit 1 +) +if not exist x265-static-main12.lib ( + msg "%username%" "12bit build failed" + exit 1 +) +cmake -G "Visual Studio 10 Win64" ../../../source -DEXTRA_LIB="x265-static-main10.lib;x265-static-main12.lib" -DLINKED_10BIT=ON -DLINKED_12BIT=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + :: combine static libraries (ignore warnings caused by winxp.cpp hacks) + move Release\x265-static.lib x265-static-main.lib + LIB.EXE /ignore:4006 /ignore:4221 /OUT:Release\x265-static.lib x265-static-main.lib x265-static-main10.lib x265-static-main12.lib +) + +pause
View file
x265_1.8.tar.gz/build/vc11-x86_64/multilib.bat
Added
@@ -0,0 +1,44 @@ +@echo off +if "%VS110COMNTOOLS%" == "" ( + msg "%username%" "Visual Studio 11 not detected" + exit 1 +) + +call "%VS110COMNTOOLS%\..\..\VC\vcvarsall.bat" + +@mkdir 12bit +@mkdir 10bit +@mkdir 8bit + +@cd 12bit +cmake -G "Visual Studio 11 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main12.lib +) + +@cd ..\10bit +cmake -G "Visual Studio 11 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main10.lib +) + +@cd ..\8bit +if not exist x265-static-main10.lib ( + msg "%username%" "10bit build failed" + exit 1 +) +if not exist x265-static-main12.lib ( + msg "%username%" "12bit build failed" + exit 1 +) +cmake -G "Visual Studio 11 Win64" ../../../source -DEXTRA_LIB="x265-static-main10.lib;x265-static-main12.lib" -DLINKED_10BIT=ON -DLINKED_12BIT=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + :: combine static libraries (ignore warnings caused by winxp.cpp hacks) + move Release\x265-static.lib x265-static-main.lib + LIB.EXE /ignore:4006 /ignore:4221 /OUT:Release\x265-static.lib x265-static-main.lib x265-static-main10.lib x265-static-main12.lib +) + +pause
View file
x265_1.8.tar.gz/build/vc12-x86_64/multilib.bat
Added
@@ -0,0 +1,44 @@ +@echo off +if "%VS120COMNTOOLS%" == "" ( + msg "%username%" "Visual Studio 12 not detected" + exit 1 +) + +call "%VS120COMNTOOLS%\..\..\VC\vcvarsall.bat" + +@mkdir 12bit +@mkdir 10bit +@mkdir 8bit + +@cd 12bit +cmake -G "Visual Studio 12 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main12.lib +) + +@cd ..\10bit +cmake -G "Visual Studio 12 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main10.lib +) + +@cd ..\8bit +if not exist x265-static-main10.lib ( + msg "%username%" "10bit build failed" + exit 1 +) +if not exist x265-static-main12.lib ( + msg "%username%" "12bit build failed" + exit 1 +) +cmake -G "Visual Studio 12 Win64" ../../../source -DEXTRA_LIB="x265-static-main10.lib;x265-static-main12.lib" -DLINKED_10BIT=ON -DLINKED_12BIT=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + :: combine static libraries (ignore warnings caused by winxp.cpp hacks) + move Release\x265-static.lib x265-static-main.lib + LIB.EXE /ignore:4006 /ignore:4221 /OUT:Release\x265-static.lib x265-static-main.lib x265-static-main10.lib x265-static-main12.lib +) + +pause
View file
x265_1.8.tar.gz/build/vc9-x86_64/multilib.bat
Added
@@ -0,0 +1,44 @@ +@echo off +if "%VS90COMNTOOLS%" == "" ( + msg "%username%" "Visual Studio 9 not detected" + exit 1 +) + +call "%VS90COMNTOOLS%\..\..\VC\vcvarsall.bat" + +@mkdir 12bit +@mkdir 10bit +@mkdir 8bit + +@cd 12bit +cmake -G "Visual Studio 9 2008 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF -DMAIN12=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main12.lib +) + +@cd ..\10bit +cmake -G "Visual Studio 9 2008 Win64" ../../../source -DHIGH_BIT_DEPTH=ON -DEXPORT_C_API=OFF -DENABLE_SHARED=OFF -DENABLE_CLI=OFF +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + copy/y Release\x265-static.lib ..\8bit\x265-static-main10.lib +) + +@cd ..\8bit +if not exist x265-static-main10.lib ( + msg "%username%" "10bit build failed" + exit 1 +) +if not exist x265-static-main12.lib ( + msg "%username%" "12bit build failed" + exit 1 +) +cmake -G "Visual Studio 9 2008 Win64" ../../../source -DEXTRA_LIB="x265-static-main10.lib;x265-static-main12.lib" -DLINKED_10BIT=ON -DLINKED_12BIT=ON +if exist x265.sln ( + MSBuild /property:Configuration="Release" x265.sln + :: combine static libraries (ignore warnings caused by winxp.cpp hacks) + move Release\x265-static.lib x265-static-main.lib + LIB.EXE /ignore:4006 /ignore:4221 /OUT:Release\x265-static.lib x265-static-main.lib x265-static-main10.lib x265-static-main12.lib +) + +pause
View file
x265_1.7.tar.gz/doc/reST/api.rst -> x265_1.8.tar.gz/doc/reST/api.rst
Changed
@@ -41,9 +41,9 @@ x265 will accept input pixels of any depth between 8 and 16 bits regardless of the depth of its internal pixels (8 or 10). It will shift and mask input pixels as required to reach the internal depth. If -downshifting is being performed using our CLI application, the -:option:`--dither` option may be enabled to reduce banding. This feature -is not available through the C interface. +downshifting is being performed using our CLI application (to 8 bits), +the :option:`--dither` option may be enabled to reduce banding. This +feature is not available through the C interface. Encoder ======= @@ -159,7 +159,8 @@ helps future-proof your code in many ways, but the x265 API is versioned in such a way that we prevent linkage against a build of x265 that does not match the version of the header you are compiling - against. This is function of the X265_BUILD macro. + against (unless you use x265_api_query() to acquire the library's + interfaces). This is function of the X265_BUILD macro. **x265_encoder_parameters()** may be used to get a copy of the param structure from the encoder after it has been opened, in order to see the @@ -190,7 +191,7 @@ * presets is not recommended without a more fine-grained breakdown of * parameters to take this into account. */ int x265_encoder_reconfig(x265_encoder *, x265_param *); - + Pictures ======== @@ -320,7 +321,8 @@ provided, the encoder will fill it with data pertaining to the output picture corresponding to the output NALs, including the recontructed image, POC and decode timestamp. These pictures will be - in encode (or decode) order. + in encode (or decode) order. The encoder will also write corresponding + frame encode statistics into **x265_frame_stats**. When the last of the raw input pictures has been sent to the encoder, **x265_encoder_encode()** must still be called repeatedly with a @@ -338,15 +340,6 @@ Cleanup ======= -At the end of the encode, the application will want to trigger logging -of the final encode statistics, if :option:`--csv` had been specified:: - - /* x265_encoder_log: - * write a line to the configured CSV file. If a CSV filename was not - * configured, or file open failed, or the log level indicated frame level - * logging, this function will perform no write. */ - void x265_encoder_log(x265_encoder *encoder, int argc, char **argv); - Finally, the encoder must be closed in order to free all of its resources. An encoder that has been flushed cannot be restarted and reused. Once **x265_encoder_close()** has been called, the encoder @@ -370,52 +363,150 @@ Multi-library Interface ======================= -If your application might want to make a runtime selection between -a number of libx265 libraries (perhaps 8bpp and 16bpp), then you will -want to use the multi-library interface. - -Instead of directly using all of the **x265_** methods documented -above, you query an x265_api structure from your libx265 and then use -the function pointers within that structure of the same name, but -without the **x265_** prefix. So **x265_param_default()** becomes -**api->param_default()**. The key method is x265_api_get():: - - /* x265_api_get: - * Retrieve the programming interface for a linked x265 library. - * May return NULL if no library is available that supports the - * requested bit depth. If bitDepth is 0, the function is guarunteed - * to return a non-NULL x265_api pointer from the system default - * libx265 */ - const x265_api* x265_api_get(int bitDepth); - -Note that using this multi-library API in your application is only the -first step. - -Your application must link to one build of libx265 (statically or -dynamically) and this linked version of libx265 will support one -bit-depth (8 or 10 bits). - -Your application must now request the API for the bitDepth you would -prefer the encoder to use (8 or 10). If the requested bitdepth is zero, -or if it matches the bitdepth of the system default libx265 (the -currently linked library), then this library will be used for encode. -If you request a different bit-depth, the linked libx265 will attempt -to dynamically bind a shared library with a name appropriate for the -requested bit-depth: - - 8-bit: libx265_main.dll - 10-bit: libx265_main10.dll - - (the shared library extension is obviously platform specific. On - Linux it is .so while on Mac it is .dylib) - -For example on Windows, one could package together an x265.exe -statically linked against the 8bpp libx265 together with a -libx265_main10.dll in the same folder, and this executable would be able -to encode main and main10 bitstreams. - -On Linux, x265 packagers could install 8bpp static and shared libraries -under the name libx265 (so all applications link against 8bpp libx265) -and then also install libx265_main10.so (symlinked to its numbered solib). -Thus applications which use x265_api_get() will be able to generate main -or main10 bitstreams. +If your application might want to make a runtime bit-depth selection, it +will need to use one of these bit-depth introspection interfaces which +returns an API structure containing the public function entry points and +constants. + +Instead of directly using all of the **x265_** methods documented above, +you query an x265_api structure from your libx265 and then use the +function pointers of the same name (minus the **x265_** prefix) within +that structure. For instance **x265_param_default()** becomes +**api->param_default()**. + +x265_api_get +------------ + +The first bit-depth instrospecton method is x265_api_get(). It designed +for applications that might statically link with libx265, or will at +least be tied to a particular SONAME or API version:: + + /* x265_api_get: + * Retrieve the programming interface for a linked x265 library. + * May return NULL if no library is available that supports the + * requested bit depth. If bitDepth is 0, the function is guarunteed + * to return a non-NULL x265_api pointer from the system default + * libx265 */ + const x265_api* x265_api_get(int bitDepth); + +Like **x265_encoder_encode()**, this function has the build number +automatically appended to the function name via macros. This ties your +application to a particular binary API version of libx265 (the one you +compile against). If you attempt to link with a libx265 with a different +API version number, the link will fail. + +Obviously this has no meaningful effect on applications which statically +link to libx265. + +x265_api_query +-------------- + +The second bit-depth introspection method is designed for applications +which need more flexibility in API versioning. If you use +**x265_api_query()** and dynamically link to libx265 at runtime (using +dlopen() on POSIX or LoadLibrary() on Windows) your application is no +longer directly tied to the API version that it was compiled against:: + + /* x265_api_query: + * Retrieve the programming interface for a linked x265 library, like + * x265_api_get(), except this function accepts X265_BUILD as the second + * argument rather than using the build number as part of the function name. + * Applications which dynamically link to libx265 can use this interface to + * query the library API and achieve a relative amount of version skew + * flexibility. The function may return NULL if the library determines that + * the apiVersion that your application was compiled against is not compatible + * with the library you have linked with. + * + * api_major_version will be incremented any time non-backward compatible + * changes are made to any public structures or functions. If + * api_major_version does not match X265_MAJOR_VERSION from the x265.h your + * application compiled against, your application must not use the returned + * x265_api pointer. + * + * Users of this API *must* also validate the sizes of any structures which + * are not treated as opaque in application code. For instance, if your + * application dereferences a x265_param pointer, then it must check that + * api->sizeof_param matches the sizeof(x265_param) that your application + * compiled with. */ + const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err); + +A number of validations must be performed on the returned API structure +in order to determine if it is safe for use by your application. If you +do not perform these checks, your application is liable to crash:: + + if (api->api_major_version != X265_MAJOR_VERSION) /* do not use */ + if (api->sizeof_param != sizeof(x265_param)) /* do not use */ + if (api->sizeof_picture != sizeof(x265_picture)) /* do not use */ + if (api->sizeof_stats != sizeof(x265_stats)) /* do not use */ + if (api->sizeof_zone != sizeof(x265_zone)) /* do not use */ + etc. + +Note that if your application does not directly allocate or dereference +one of these structures, if it treats the structure as opaque or does +not use it at all, then it can skip the size check for that structure. + +In particular, if your application uses api->param_alloc(), +api->param_free(), api->param_parse(), etc and never directly accesses +any x265_param fields, then it can skip the check on the +sizeof(x265_parm) and thereby ignore changes to that structure (which +account for a large percentage of X265_BUILD bumps). + +Build Implications +------------------ + +By default libx265 will place all of its internal C++ classes and +functions within an x265 namespace and export all of the C functions +documented in this file. Obviously this prevents 8bit and 10bit builds +of libx265 from being statically linked into a single binary, all of +those symbols would collide. + +However, if you set the EXPORT_C_API cmake option to OFF then libx265 +will use a bit-depth specific namespace and prefix for its assembly +functions (x265_8bit, x265_10bit or x265_12bit) and export no C +functions. + +In this way you can build one or more libx265 libraries without any +exported C interface and link them into a libx265 build that does export +a C interface. The build which exported the C functions becomes the +*default* bit depth for the combined library, and the other bit depths +are available via the bit-depth introspection methods. + +.. Note:: + + When setting EXPORT_C_API cmake option to OFF, it is recommended to + also set ENABLE_SHARED and ENABLE_CLI to OFF to prevent build + problems. We only need the static library from these builds. + +If an application requests a bit-depth that is not supported by the +default library or any of the additionally linked libraries, the +introspection method will fall-back to an attempt to dynamically bind a +shared library with a name appropriate for the requested bit-depth:: + + 8-bit: libx265_main + 10-bit: libx265_main10 + 12-bit: libx265_main12 + +If the profile-named library is not found, it will then try to bind a +generic libx265 in the hopes that it is a multilib library with all bit +depths. + +Packaging and Distribution +-------------------------- + +We recommend that packagers distribute a single combined shared/static +library build which includes all the bit depth libraries linked +together. See the multilib scripts in our :file:`build/` subdirectories +for examples of how to affect these combined library builds. It is the +packager's discretion which bit-depth exports the public C functions and +thus becomes the default bit-depth for the combined library. + +.. Note:: + + Windows packagers might want to build libx265 with WINXP_SUPPORT + enabled. This makes the resulting binaries functional on XP and + Vista. Without this flag, the minimum supported host O/S is Windows + 7. Also note that binaries built with WINXP_SUPPORT will *not* have + NUMA support and they will have slightly less performance. + + STATIC_LINK_CRT is also recommended so end-users will not need to + install any additional MSVC C runtime libraries.
View file
x265_1.7.tar.gz/doc/reST/cli.rst -> x265_1.8.tar.gz/doc/reST/cli.rst
Changed
@@ -28,7 +28,7 @@ Generally, when an option expects a string value from a list of strings the user may specify the integer ordinal of the value they desire. ie: -:option:`--log-level` 4 is equivalent to :option:`--log-level` debug. +:option:`--log-level` 3 is equivalent to :option:`--log-level` debug. Executable Options ================== @@ -52,6 +52,7 @@ 2. unable to open encoder 3. unable to generate stream headers 4. encoder abort + 5. unable to open csv file Logging/Statistic Options ========================= @@ -67,9 +68,8 @@ 0. error 1. warning 2. info **(default)** - 3. frame - 4. debug - 5. full + 3. debug + 4. full .. option:: --no-progress @@ -80,9 +80,9 @@ .. option:: --csv <filename> Writes encoding results to a comma separated value log file. Creates - the file if it doesnt already exist, else adds one line per run. if - :option:`--log-level` is frame or above, it writes one line per - frame. Default none + the file if it doesnt already exist. If :option:`--csv-log-level` is 0, + it adds one line per run. If :option:`--csv-log-level` is greater than + 0, it writes one line per frame. Default none When frame level logging is enabled, several frame performance statistics are listed: @@ -123,13 +123,17 @@ enough ahead for the necessary reference data to be available. This is more of a problem for P frames where some blocks are much more expensive than others. + + **CLI ONLY** +.. option:: --csv-log-level <integer> -.. option:: --cu-stats, --no-cu-stats + CSV logging level. Default 0 + 0. summary + 1. frame level logging + 2. frame level logging with performance statistics - Records statistics on how each CU was coded (split depths and other - mode decisions) and reports those statistics at the end of the - encode. Default disabled + **CLI ONLY** .. option:: --ssim, --no-ssim @@ -349,6 +353,13 @@ **CLI ONLY** +.. option:: --total-frames <integer> + + The number of frames intended to be encoded. It may be left + unspecified, but when it is specified rate control can make use of + this information. It is also used to determine if an encode is + actually a stillpicture profile encode (single frame) + .. option:: --dither Enable high quality downscaling. Dithering is based on the diffusion @@ -384,7 +395,7 @@ **Range of values:** positive int or float, or num/denom -.. option:: --interlaceMode <false|tff|bff>, --no-interlaceMode +.. option:: --interlace <false|tff|bff>, --no-interlace 0. progressive pictures **(default)** 1. top field first @@ -419,14 +430,18 @@ **CLI ONLY** -.. option:: --output-depth, -D 8|10 +.. option:: --output-depth, -D 8|10|12 Bitdepth of output HEVC bitstream, which is also the internal bit depth of the encoder. If the requested bit depth is not the bit depth of the linked libx265, it will attempt to bind libx265_main - for an 8bit encoder, or libx265_main10 for a 10bit encoder, with the + for an 8bit encoder, libx265_main10 for a 10bit encoder, or + libx265_main12 for a 12bit encoder (EXPERIMENTAL), with the same API version as the linked libx265. + If the output depth is not specified but :option:`--profile` is + specified, the output depth will be derived from the profile name. + **CLI ONLY** Profile, Level, Tier @@ -439,15 +454,44 @@ profile. May abort the encode if the specified profile is impossible to be supported by the compile options chosen for the encoder (a high bit depth encoder will be unable to output - bitstreams compliant with Main or Mainstillpicture). + bitstreams compliant with Main or MainStillPicture). + + The following profiles are supported in x265. + + 8bit profiles:: + + main, main-intra, mainstillpicture (or msp for short) + main444-8 main444-intra main444-stillpicture + See note below on signaling intra and stillpicture profiles. + + 10bit profiles:: + + main10, main10-intra + main422-10, main422-10-intra + main444-10, main444-10-intra + + 12bit profiles:: + + main12, main12-intra + main422-12, main422-12-intra + main444-12, main444-12-intra + + + **CLI ONLY** - API users must use x265_param_apply_profile() after configuring + API users must call x265_param_apply_profile() after configuring their param structure. Any changes made to the param structure after this call might make the encode non-compliant. - **Values:** main, main10, mainstillpicture, main422-8, main422-10, main444-8, main444-10 + The CLI application will derive the output bit depth from the + profile name if :option:`--output-depth` is not specified. - **CLI ONLY** +.. note:: + + All 12bit presets are extremely unstable, do not use them yet. + 16bit is not supported at all, but those profiles are included + because it is possible for libx265 to make bitstreams compatible + with them. .. option:: --level-idc <integer|float> @@ -479,6 +523,9 @@ specified level, main tier first, turning on high tier only if necessary and available at that level. + If :option:`--level-idc` has not been specified, this argument is + ignored. + .. option:: --ref <1..16> Max number of L0 references to be allowed. This number has a linear @@ -511,6 +558,7 @@ Default: disabled .. note:: + :option:`--profile`, :option:`--level-idc`, and :option:`--high-tier` are only intended for use when you are targeting a particular decoder (or decoders) with fixed resource @@ -519,6 +567,29 @@ parameters to meet those requirements but it will never raise them. It may enable VBV constraints on a CRF encode. + Also note that x265 determines the decoder requirement profile and + level in three steps. First, the user configures an x265_param + structure with their suggested encoder options and then optionally + calls x265_param_apply_profile() to enforce a specific profile + (main, main10, etc). Second, an encoder is created from this + x265_param instance and the :option:`--level-idc` and + :option:`--high-tier` parameters are used to reduce bitrate or other + features in order to enforce the target level. Finally, the encoder + re-examines the final set of parameters and detects the actual + minimum decoder requirement level and this is what is signaled in + the bitstream headers. The detected decoder level will only use High + tier if the user specified a High tier level. + + The signaled profile will be determined by the encoder's internal + bitdepth and input color space. If :option:`--keyint` is 0 or 1, + then an intra variant of the profile will be signaled. + + If :option:`--total-frames` is 1, then a stillpicture variant will + be signaled, but this parameter is not always set by applications, + particularly not when the CLI uses stdin streaming or when libx265 + is used by third-party applications. + + Mode decision / Analysis ======================== @@ -581,6 +652,33 @@ be consistent for all of them since the encoder configures several key global data structures based on this range. +.. option:: --limit-refs <0|1|2|3> + + When set to X265_REF_LIMIT_DEPTH (1) x265 will limit the references + analyzed at the current depth based on the references used to code + the 4 sub-blocks at the next depth. For example, a 16x16 CU will + only use the references used to code its four 8x8 CUs. + + When set to X265_REF_LIMIT_CU (2), the rectangular and asymmetrical + partitions will only use references selected by the 2Nx2N motion + search (including at the lowest depth which is otherwise unaffected + by the depth limit). + + When set to 3 (X265_REF_LIMIT_DEPTH && X265_REF_LIMIT_CU), the 2Nx2N + motion search at each depth will only use references from the split + CUs and the rect/amp motion searches at that depth will only use the + reference(s) selected by 2Nx2N. + + For all non-zero values of limit-refs, the current depth will evaluate + intra mode (in inter slices), only if intra mode was chosen as the best + mode for atleast one of the 4 sub-blocks. + + You can often increase the number of references you are using + (within your decoder level limits) if you enable one or + both of these flags. + + This feature is EXPERIMENTAL and functional at all RD levels. + .. option:: --rect, --no-rect Enable analysis of rectangular motion partitions Nx2N and 2NxN @@ -874,7 +972,11 @@ .. option:: --strong-intra-smoothing, --no-strong-intra-smoothing - Enable strong intra smoothing for 32x32 intra blocks. Default enabled + Enable strong intra smoothing for 32x32 intra blocks. This flag + performs bi-linear interpolation of the corner reference samples + for a strong smoothing effect. The purpose is to prevent blocking + or banding artifacts in regions with few/zero AC coefficients. + Default enabled .. option:: --constrained-intra, --no-constrained-intra @@ -1133,7 +1235,7 @@ ignored. Slower presets will generally achieve better compression efficiency (and generate smaller bitstreams). Default disabled. -.. option:: --aq-mode <0|1|2> +.. option:: --aq-mode <0|1|2|3> Adaptive Quantization operating mode. Raise or lower per-block quantization based on complexity analysis of the source image. The @@ -1144,6 +1246,7 @@ 0. disabled 1. AQ enabled **(default)** 2. AQ enabled with auto-variance + 3. AQ enabled with auto-variance and bias to dark scenes .. option:: --aq-strength <float> @@ -1410,13 +1513,19 @@ 15. 3:2 16. 2:1 -.. option:: --crop-rect <left,top,right,bottom> +.. option:: --display-window <left,top,right,bottom> Define the (overscan) region of the image that does not contain information because it was added to achieve certain resolution or - aspect ratio. The decoder may be directed to crop away this region - before displaying the images via the :option:`--overscan` option. - Default undefined (not signaled) + aspect ratio (the areas are typically black bars). The decoder may + be directed to crop away this region before displaying the images + via the :option:`--overscan` option. Default undefined (not + signaled). + + Note that this has nothing to do with padding added internally by + the encoder to ensure the pictures size is a multiple of the minimum + coding unit (4x4). That padding is signaled in a separate + "conformance window" and is not user-configurable. .. option:: --overscan <show|crop> @@ -1476,6 +1585,7 @@ 15. bt2020-12 16. smpte-st-2084 17. smpte-st-428 + 18. arib-std-b67 .. option:: --colormatrix <integer|string> @@ -1509,7 +1619,7 @@ integers. The SEI includes X,Y display primaries for RGB channels, white point X,Y and max,min luminance values. (HDR) - Example for P65D3 1000-nits: + Example for D65P3 1000-nits: G(13200,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)
View file
x265_1.7.tar.gz/doc/reST/presets.rst -> x265_1.8.tar.gz/doc/reST/presets.rst
Changed
@@ -114,12 +114,12 @@ ~~~~~~~~~~~~~~~~~~~~ :option:`--tune` *grain* tries to improve the retention of film grain in -the reconstructed output. It helps rate distortion optimizations select -modes which preserve high frequency noise: +the reconstructed output. It disables rate distortion optimizations in +quantization, and increases the default psy-rd. * :option:`--psy-rd` 0.5 - * :option:`--rdoq-level` 1 - * :option:`--psy-rdoq` 30 + * :option:`--rdoq-level` 0 + * :option:`--psy-rdoq` 0 It lowers the strength of adaptive quantization, so residual energy can be more evenly distributed across the (noisy) picture:
View file
x265_1.7.tar.gz/doc/reST/threading.rst -> x265_1.8.tar.gz/doc/reST/threading.rst
Changed
@@ -28,7 +28,7 @@ providers are recommended to call this method when they make new jobs available. -Worker jobs are not allowed to block except when abosultely necessary +Worker jobs are not allowed to block except when absolutely necessary for data locking. If a job becomes blocked, the work function is expected to drop that job so the worker thread may go back to the pool and find more work. @@ -94,10 +94,10 @@ If a worker thread job has work which can be performed in parallel by many threads, it may allocate a bonded task group and enlist the help of -other idle worker threads in the same pool. Those threads will cooperate -to complete the work of the bonded task group and then return to their -idle states. The larger and more uniform those tasks are, the better the -bonded task group will perform. +other idle worker threads from the same thread pool. Those threads will +cooperate to complete the work of the bonded task group and then return +to their idle states. The larger and more uniform those tasks are, the +better the bonded task group will perform. Parallel Mode Analysis ~~~~~~~~~~~~~~~~~~~~~~ @@ -105,19 +105,20 @@ When :option:`--pmode` is enabled, each CU (at all depths from 64x64 to 8x8) will distribute its analysis work to the thread pool via a bonded task group. Each analysis job will measure the cost of one prediction -for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). At -slower presets, the amount of increased parallelism is often enough to -be able to reduce frame parallelism while achieving the same overall CPU -utilization. Reducing frame threads is often beneficial to ABR and VBV -rate control. +for the CU: merge, skip, intra, inter (2Nx2N, Nx2N, 2NxN, and AMP). + +At slower presets, the amount of increased parallelism from pmode is +often enough to be able to reduce or disable frame parallelism while +achieving the same overall CPU utilization. Reducing frame threads is +often beneficial to ABR and VBV rate control. Parallel Motion Estimation ~~~~~~~~~~~~~~~~~~~~~~~~~~ When :option:`--pme` is enabled all of the analysis functions which perform motion searches to reference frames will distribute those motion -searches as jobs for worker threads via a bonded task group (if more -than two motion searches are required). +searches to other worker threads via a bonded task group (if more than +two motion searches are required). Frame Threading =============== @@ -241,7 +242,7 @@ bonded task groups to measure single frame cost estimates using slices. (see :option:`--lookahead-slices`) -The function slicetypeDecide() itself is also be performed by a worker +The main slicetypeDecide() function itself is also performed by a worker thread if your encoder has a thread pool, else it runs within the context of the thread which calls the x265_encoder_encode().
View file
x265_1.7.tar.gz/source/CMakeLists.txt -> x265_1.8.tar.gz/source/CMakeLists.txt
Changed
@@ -30,7 +30,7 @@ mark_as_advanced(FPROFILE_USE FPROFILE_GENERATE NATIVE_BUILD) # X265_BUILD must be incremented each time the public API is changed -set(X265_BUILD 59) +set(X265_BUILD 68) configure_file("${PROJECT_SOURCE_DIR}/x265.def.in" "${PROJECT_BINARY_DIR}/x265.def") configure_file("${PROJECT_SOURCE_DIR}/x265_config.h.in" @@ -42,6 +42,8 @@ string(TOLOWER "${CMAKE_SYSTEM_PROCESSOR}" SYSPROC) set(X86_ALIASES x86 i386 i686 x86_64 amd64) list(FIND X86_ALIASES "${SYSPROC}" X86MATCH) +set(POWER_ALIASES ppc64 ppc64le) +list(FIND POWER_ALIASES "${SYSPROC}" POWERMATCH) if("${SYSPROC}" STREQUAL "" OR X86MATCH GREATER "-1") message(STATUS "Detected x86 target processor") set(X86 1) @@ -50,6 +52,10 @@ set(X64 1) add_definitions(-DX86_64=1) endif() +elseif(POWERMATCH GREATER "-1") + message(STATUS "Detected POWER target processor") + set(POWER 1) + add_definitions(-DX265_ARCH_POWER=1) elseif(${SYSPROC} STREQUAL "armv6l") message(STATUS "Detected ARM target processor") set(ARM 1) @@ -82,6 +88,10 @@ endif() endif() mark_as_advanced(LIBRT NUMA_FOUND) + option(NO_ATOMICS "Use a slow mutex to replace atomics" OFF) + if(NO_ATOMICS) + add_definitions(-DNO_ATOMICS=1) + endif(NO_ATOMICS) endif(UNIX) if(X64 AND NOT WIN32) @@ -260,6 +270,8 @@ message(STATUS "Found Yasm ${YASM_VERSION_STRING} to build assembly primitives") option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" ON) endif() +else() + option(ENABLE_ASSEMBLY "Enable use of assembly coded primitives" OFF) endif() option(CHECKED_BUILD "Enable run-time sanity checks (debugging)" OFF) @@ -270,23 +282,59 @@ # Build options set(LIB_INSTALL_DIR lib CACHE STRING "Install location of libraries") set(BIN_INSTALL_DIR bin CACHE STRING "Install location of executables") +set(EXTRA_LIB "" CACHE STRING "Extra libraries to link against") +set(EXTRA_LINK_FLAGS "" CACHE STRING "Extra link flags") +if(EXTRA_LINK_FLAGS) + list(APPEND LINKER_OPTIONS ${EXTRA_LINK_FLAGS}) +endif() +if(EXTRA_LIB) + option(LINKED_8BIT "8bit libx265 is being linked with this library" OFF) + option(LINKED_10BIT "10bit libx265 is being linked with this library" OFF) + option(LINKED_12BIT "12bit libx265 is being linked with this library" OFF) +endif(EXTRA_LIB) +mark_as_advanced(EXTRA_LIB EXTRA_LINK_FLAGS) if(X64) - # NOTE: We only officially support 16bit-per-pixel compiles of x265 - # on 64bit architectures. 16bpp plus large resolution plus slow + # NOTE: We only officially support high-bit-depth compiles of x265 + # on 64bit architectures. Main10 plus large resolution plus slow # preset plus 32bit address space usually means malloc failure. You # can disable this if(X64) check if you desparately need a 32bit # build with 10bit/12bit support, but this violates the "shrink wrap # license" so to speak. If it breaks you get to keep both halves. - # You will likely need to compile without assembly - option(HIGH_BIT_DEPTH "Store pixels as 16bit values" OFF) + # You will need to disable assembly manually. + option(HIGH_BIT_DEPTH "Store pixel samples as 16bit values (Main10/Main12)" OFF) endif(X64) if(HIGH_BIT_DEPTH) - add_definitions(-DHIGH_BIT_DEPTH=1) + option(MAIN12 "Support Main12 instead of Main10" OFF) + if(MAIN12) + add_definitions(-DHIGH_BIT_DEPTH=1 -DX265_DEPTH=12) + else() + add_definitions(-DHIGH_BIT_DEPTH=1 -DX265_DEPTH=10) + endif() else(HIGH_BIT_DEPTH) - add_definitions(-DHIGH_BIT_DEPTH=0) + add_definitions(-DHIGH_BIT_DEPTH=0 -DX265_DEPTH=8) endif(HIGH_BIT_DEPTH) +# this option can only be used when linking multiple libx265 libraries +# together, and some alternate API access method is implemented. +option(EXPORT_C_API "Implement public C programming interface" ON) +mark_as_advanced(EXPORT_C_API) +if(EXPORT_C_API) + set(X265_NS x265) + add_definitions(-DEXPORT_C_API=1) +elseif(HIGH_BIT_DEPTH) + if(MAIN12) + set(X265_NS x265_12bit) + else() + set(X265_NS x265_10bit) + endif() + add_definitions(-DEXPORT_C_API=0) +else() + set(X265_NS x265_8bit) + add_definitions(-DEXPORT_C_API=0) +endif() +add_definitions(-DX265_NS=${X265_NS}) + option(WARNINGS_AS_ERRORS "Stop compiles on first warning" OFF) if(WARNINGS_AS_ERRORS) if(GCC) @@ -375,6 +423,9 @@ if(NOT MSVC) set_target_properties(x265-static PROPERTIES OUTPUT_NAME x265) endif() +if(EXTRA_LIB) + target_link_libraries(x265-static ${EXTRA_LIB}) +endif() install(TARGETS x265-static LIBRARY DESTINATION ${LIB_INSTALL_DIR} ARCHIVE DESTINATION ${LIB_INSTALL_DIR}) @@ -415,7 +466,7 @@ if(APPLE) set_target_properties(x265-shared PROPERTIES MACOSX_RPATH 1) else() - set_target_properties(x265-shared PROPERTIES LINK_FLAGS "-Wl,-Bsymbolic,-znoexecstack") + list(APPEND LINKER_OPTIONS "-Wl,-Bsymbolic,-znoexecstack") endif() endif() set_target_properties(x265-shared PROPERTIES SOVERSION ${X265_BUILD}) @@ -429,6 +480,9 @@ ARCHIVE DESTINATION ${LIB_INSTALL_DIR} RUNTIME DESTINATION ${BIN_INSTALL_DIR}) endif() + if(EXTRA_LIB) + target_link_libraries(x265-shared ${EXTRA_LIB}) + endif() if(LINKER_OPTIONS) # set_target_properties can't do list expansion string(REPLACE ";" " " LINKER_OPTION_STR "${LINKER_OPTIONS}") @@ -468,16 +522,14 @@ endif() # Main CLI application -option(ENABLE_CLI "Build standalone CLI application" ON) +set(ENABLE_CLI ON CACHE BOOL "Build standalone CLI application") if(ENABLE_CLI) file(GLOB InputFiles input/input.cpp input/yuv.cpp input/y4m.cpp input/*.h) file(GLOB OutputFiles output/output.cpp output/reconplay.cpp output/*.h output/yuv.cpp output/y4m.cpp # recon output/raw.cpp) # muxers - file(GLOB FilterFiles filters/*.cpp filters/*.h) source_group(input FILES ${InputFiles}) source_group(output FILES ${OutputFiles}) - source_group(filters FILES ${FilterFiles}) check_include_files(getopt.h HAVE_GETOPT_H) if(NOT HAVE_GETOPT_H) @@ -487,13 +539,18 @@ include_directories(compat/getopt) set(GETOPT compat/getopt/getopt.c compat/getopt/getopt.h) endif(NOT HAVE_GETOPT_H) + if(WIN32) + set(ExportDefs "${PROJECT_BINARY_DIR}/x265.def") + endif(WIN32) if(XCODE) # Xcode seems unable to link the CLI with libs, so link as one targget - add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${FilterFiles} ${GETOPT} x265.cpp x265.h x265cli.h - $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${YASM_OBJS} ${YASM_SRCS}) + add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} + x265.cpp x265.h x265cli.h x265-extras.h x265-extras.cpp + $<TARGET_OBJECTS:encoder> $<TARGET_OBJECTS:common> ${YASM_OBJS} ${YASM_SRCS}) else() - add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${FilterFiles} ${GETOPT} ${X265_RC_FILE} x265.cpp x265.h x265cli.h) + add_executable(cli ../COPYING ${InputFiles} ${OutputFiles} ${GETOPT} ${X265_RC_FILE} + ${ExportDefs} x265.cpp x265.h x265cli.h x265-extras.h x265-extras.cpp) if(WIN32 OR NOT ENABLE_SHARED OR INTEL_CXX) # The CLI cannot link to the shared library on Windows, it # requires internal APIs not exported from the DLL
View file
x265_1.7.tar.gz/source/cmake/CMakeASM_YASMInformation.cmake -> x265_1.8.tar.gz/source/cmake/CMakeASM_YASMInformation.cmake
Changed
@@ -31,9 +31,13 @@ endif() if(HIGH_BIT_DEPTH) - list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10) + if(MAIN12) + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=12 -DX265_NS=${X265_NS}) + else() + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=1 -DBIT_DEPTH=10 -DX265_NS=${X265_NS}) + endif() else() - list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8) + list(APPEND ASM_FLAGS -DHIGH_BIT_DEPTH=0 -DBIT_DEPTH=8 -DX265_NS=${X265_NS}) endif() list(APPEND ASM_FLAGS "${CMAKE_ASM_YASM_FLAGS}")
View file
x265_1.7.tar.gz/source/cmake/FindYasm.cmake -> x265_1.8.tar.gz/source/cmake/FindYasm.cmake
Changed
@@ -2,7 +2,7 @@ # Simple path search with YASM_ROOT environment variable override find_program(YASM_EXECUTABLE - NAMES yasm yasm-1.2.0-win32 yasm-1.2.0-win64 + NAMES yasm yasm-1.2.0-win32 yasm-1.2.0-win64 yasm yasm-1.3.0-win32 yasm-1.3.0-win64 HINTS $ENV{YASM_ROOT} ${YASM_ROOT} PATH_SUFFIXES bin )
View file
x265_1.7.tar.gz/source/common/CMakeLists.txt -> x265_1.8.tar.gz/source/common/CMakeLists.txt
Changed
@@ -1,7 +1,21 @@ # vim: syntax=cmake +list(APPEND VFLAGS "-DX265_VERSION=${X265_VERSION}") +if(EXTRA_LIB) + if(LINKED_8BIT) + list(APPEND VFLAGS "-DLINKED_8BIT=1") + endif(LINKED_8BIT) + if(LINKED_10BIT) + list(APPEND VFLAGS "-DLINKED_10BIT=1") + endif(LINKED_10BIT) + if(LINKED_12BIT) + list(APPEND VFLAGS "-DLINKED_12BIT=1") + endif(LINKED_12BIT) +endif(EXTRA_LIB) + if(ENABLE_ASSEMBLY) set_source_files_properties(threading.cpp primitives.cpp PROPERTIES COMPILE_FLAGS -DENABLE_ASSEMBLY=1) + list(APPEND VFLAGS "-DENABLE_ASSEMBLY=1") set(SSE3 vec/dct-sse3.cpp) set(SSSE3 vec/dct-ssse3.cpp) @@ -46,7 +60,7 @@ mc-a2.asm pixel-util8.asm blockcopy8.asm pixeladd8.asm dct8.asm) if(HIGH_BIT_DEPTH) - set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm) + set(A_SRCS ${A_SRCS} sad16-a.asm intrapred16.asm ipfilter16.asm loopfilter.asm) else() set(A_SRCS ${A_SRCS} sad-a.asm intrapred8.asm intrapred8_allangs.asm ipfilter8.asm loopfilter.asm) endif() @@ -69,6 +83,10 @@ source_group(Assembly FILES ${ASM_PRIMITIVES}) endif(ENABLE_ASSEMBLY) +# set_target_properties can't do list expansion +string(REPLACE ";" " " VERSION_FLAGS "${VFLAGS}") +set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS ${VERSION_FLAGS}) + check_symbol_exists(strtok_r "string.h" HAVE_STRTOK_R) if(HAVE_STRTOK_R) set_source_files_properties(param.cpp PROPERTIES COMPILE_FLAGS -DHAVE_STRTOK_R=1) @@ -81,11 +99,8 @@ set(WINXP winxp.h winxp.cpp) endif(WIN32) -set_source_files_properties(version.cpp PROPERTIES COMPILE_FLAGS -DX265_VERSION=${X265_VERSION}) - add_library(common OBJECT - ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} - ${LIBCOMMON_SRC} ${LIBCOMMON_HDR} ${WINXP} + ${ASM_PRIMITIVES} ${VEC_PRIMITIVES} ${WINXP} primitives.cpp primitives.h pixel.cpp dct.cpp ipfilter.cpp intrapred.cpp loopfilter.cpp constants.cpp constants.h
View file
x265_1.7.tar.gz/source/common/bitstream.cpp -> x265_1.8.tar.gz/source/common/bitstream.cpp
Changed
@@ -1,7 +1,7 @@ #include "common.h" #include "bitstream.h" -using namespace x265; +using namespace X265_NS; #if defined(_MSC_VER) #pragma warning(disable: 4244)
View file
x265_1.7.tar.gz/source/common/bitstream.h -> x265_1.8.tar.gz/source/common/bitstream.h
Changed
@@ -24,7 +24,7 @@ #ifndef X265_BITSTREAM_H #define X265_BITSTREAM_H 1 -namespace x265 { +namespace X265_NS { // private namespace class BitInterface
View file
x265_1.7.tar.gz/source/common/common.cpp -> x265_1.8.tar.gz/source/common/common.cpp
Changed
@@ -33,6 +33,8 @@ #include <sys/time.h> #endif +namespace X265_NS { + #if CHECKED_BUILD || _DEBUG int g_checkFailures; #endif @@ -50,8 +52,6 @@ #endif } -using namespace x265; - #define X265_ALIGNBYTES 32 #if _WIN32 @@ -215,3 +215,5 @@ fclose(fh); return NULL; } + +}
View file
x265_1.7.tar.gz/source/common/common.h -> x265_1.8.tar.gz/source/common/common.h
Changed
@@ -106,7 +106,7 @@ /* If compiled with CHECKED_BUILD perform run-time checks and log any that * fail, both to stderr and to a file */ #if CHECKED_BUILD || _DEBUG -extern int g_checkFailures; +namespace X265_NS { extern int g_checkFailures; } #define X265_CHECK(expr, ...) if (!(expr)) { \ x265_log(NULL, X265_LOG_ERROR, __VA_ARGS__); \ FILE *fp = fopen("x265_check_failures.txt", "a"); \ @@ -126,16 +126,20 @@ typedef uint64_t sum2_t; typedef uint64_t pixel4; typedef int64_t ssum2_t; -#define X265_DEPTH 10 // compile time configurable bit depth #else typedef uint8_t pixel; typedef uint16_t sum_t; typedef uint32_t sum2_t; typedef uint32_t pixel4; -typedef int32_t ssum2_t; //Signed sum -#define X265_DEPTH 8 // compile time configurable bit depth +typedef int32_t ssum2_t; // Signed sum #endif // if HIGH_BIT_DEPTH +#if X265_DEPTH <= 10 +typedef uint32_t sse_ret_t; +#else +typedef uint64_t sse_ret_t; +#endif + #ifndef NULL #define NULL 0 #endif @@ -313,7 +317,7 @@ #define CHROMA_V_SHIFT(x) (x == X265_CSP_I420) #define X265_MAX_PRED_MODE_PER_CTU 85 * 2 * 8 -namespace x265 { +namespace X265_NS { enum { SAO_NUM_OFFSET = 4 }; @@ -409,9 +413,7 @@ /* located in pixel.cpp */ void extendPicBorder(pixel* recon, intptr_t stride, int width, int height, int marginX, int marginY); -} - -/* outside x265 namespace, but prefixed. defined in common.cpp */ +/* located in common.cpp */ int64_t x265_mdate(void); #define x265_log(param, ...) general_log(param, "x265", __VA_ARGS__) void general_log(const x265_param* param, const char* caller, int level, const char* fmt, ...); @@ -426,7 +428,10 @@ void x265_free(void *ptr); char* x265_slurp_file(const char *filename); -void x265_setup_primitives(x265_param* param, int cpu); /* primitives.cpp */ +/* located in primitives.cpp */ +void x265_setup_primitives(x265_param* param); +void x265_report_simd(x265_param* param); +} #include "constants.h"
View file
x265_1.7.tar.gz/source/common/constants.cpp -> x265_1.8.tar.gz/source/common/constants.cpp
Changed
@@ -25,9 +25,50 @@ #include "constants.h" #include "threading.h" -namespace x265 { +namespace X265_NS { + +#if X265_DEPTH == 12 + +// lambda = pow(2, (double)q / 6 - 2) * (1 << (12 - 8)); +double x265_lambda_tab[QP_MAX_MAX + 1] = +{ + 4.0000, 4.4898, 5.0397, 5.6569, 6.3496, + 7.1272, 8.0000, 8.9797, 10.0794, 11.3137, + 12.6992, 14.2544, 16.0000, 17.9594, 20.1587, + 22.6274, 25.3984, 28.5088, 32.0000, 35.9188, + 40.3175, 45.2548, 50.7968, 57.0175, 64.0000, + 71.8376, 80.6349, 90.5097, 101.5937, 114.0350, + 128.0000, 143.6751, 161.2699, 181.0193, 203.1873, + 228.0701, 256.0000, 287.3503, 322.5398, 362.0387, + 406.3747, 456.1401, 512.0000, 574.7006, 645.0796, + 724.0773, 812.7493, 912.2803, 1024.0000, 1149.4011, + 1290.1592, 1448.1547, 1625.4987, 1824.5606, 2048.0000, + 2298.8023, 2580.3183, 2896.3094, 3250.9974, 3649.1211, + 4096.0000, 4597.6045, 5160.6366, 5792.6188, 6501.9947, + 7298.2423, 8192.0000, 9195.2091, 10321.2732, 11585.2375 +}; + +// lambda2 = pow(lambda, 2) * scale (0.85); +double x265_lambda2_tab[QP_MAX_MAX + 1] = +{ + 13.6000, 17.1349, 21.5887, 27.2000, 34.2699, + 43.1773, 54.4000, 68.5397, 86.3546, 108.8000, + 137.0794, 172.7092, 217.6000, 274.1588, 345.4185, + 435.2000, 548.3176, 690.8369, 870.4000, 1096.6353, + 1381.6739, 1740.8000, 2193.2706, 2763.3478, 3481.6000, + 4386.5411, 5526.6955, 6963.2000, 8773.0822, 11053.3910, + 13926.4000, 17546.1645, 22106.7819, 27852.8000, 35092.3291, + 44213.5641, 55705.6000, 70184.6579, 88427.1282, 111411.2000, + 140369.3159, 176854.2563, 222822.4000, 280738.6324, 353708.5127, + 445644.8001, 561477.2648, 707417.0237, 891289.6000, 1122954.5277, + 1414834.0484, 1782579.2003, 2245909.0566, 2829668.0981, 3565158.4000, + 4491818.1146, 5659336.1938, 7130316.8013, 8983636.2264, 11318672.3923, + 14260633.6000, 17967272.4585, 22637344.7751, 28521267.1953, 35934544.9165, + 45274689.5567, 57042534.4000, 71869089.8338, 90549379.1181, 114085068.8008 +}; + +#elif X265_DEPTH == 10 -#if HIGH_BIT_DEPTH // lambda = pow(2, (double)q / 6 - 2) * (1 << (X265_DEPTH - 8)); double x265_lambda_tab[QP_MAX_MAX + 1] = { @@ -324,11 +365,12 @@ 4, 12, 20, 28, 5, 13, 21, 29, 6, 14, 22, 30, 7, 15, 23, 31, 36, 44, 52, 60, 37, 45, 53, 61, 38, 46, 54, 62, 39, 47, 55, 63 } }; -ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE][4 * 4]) = +ALIGN_VAR_16(const uint16_t, g_scan4x4[NUM_SCAN_TYPE + 1][4 * 4]) = { { 0, 4, 1, 8, 5, 2, 12, 9, 6, 3, 13, 10, 7, 14, 11, 15 }, { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }, - { 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 } + { 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 }, + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 } }; const uint16_t g_scan16x16[16 * 16] =
View file
x265_1.7.tar.gz/source/common/constants.h -> x265_1.8.tar.gz/source/common/constants.h
Changed
@@ -26,7 +26,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { // private namespace extern int g_ctuSizeConfigured; @@ -83,7 +83,7 @@ extern const uint16_t* const g_scanOrder[NUM_SCAN_TYPE][NUM_SCAN_SIZE]; extern const uint16_t* const g_scanOrderCG[NUM_SCAN_TYPE][NUM_SCAN_SIZE]; extern const uint16_t g_scan8x8diag[8 * 8]; -extern const uint16_t g_scan4x4[NUM_SCAN_TYPE][4 * 4]; +extern const uint16_t g_scan4x4[NUM_SCAN_TYPE + 1][4 * 4]; // +1 for safe buffer area for codeCoeffNxN assembly optimize, there have up to 15 bytes beyond bound read extern const uint8_t g_lastCoeffTable[32]; extern const uint8_t g_goRiceRange[5]; // maximum value coded with Rice codes
View file
x265_1.7.tar.gz/source/common/contexts.h -> x265_1.8.tar.gz/source/common/contexts.h
Changed
@@ -102,11 +102,12 @@ #define OFF_TQUANT_BYPASS_FLAG_CTX (OFF_TRANSFORMSKIP_FLAG_CTX + 2 * NUM_TRANSFORMSKIP_FLAG_CTX) #define MAX_OFF_CTX_MOD (OFF_TQUANT_BYPASS_FLAG_CTX + NUM_TQUANT_BYPASS_FLAG_CTX) -namespace x265 { +extern "C" const uint32_t PFX(entropyStateBits)[128]; + +namespace X265_NS { // private namespace extern const uint32_t g_entropyBits[128]; -extern const uint32_t g_entropyStateBits[128]; extern const uint8_t g_nextState[128][2]; #define sbacGetMps(S) ((S) & 1)
View file
x265_1.7.tar.gz/source/common/cpu.cpp -> x265_1.8.tar.gz/source/common/cpu.cpp
Changed
@@ -57,7 +57,7 @@ #endif // if X265_ARCH_ARM -namespace x265 { +namespace X265_NS { const cpu_name_t cpu_names[] = { #if X265_ARCH_X86 @@ -107,9 +107,9 @@ extern "C" { /* cpu-a.asm */ -int x265_cpu_cpuid_test(void); -void x265_cpu_cpuid(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx); -void x265_cpu_xgetbv(uint32_t op, uint32_t *eax, uint32_t *edx); +int PFX(cpu_cpuid_test)(void); +void PFX(cpu_cpuid)(uint32_t op, uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx); +void PFX(cpu_xgetbv)(uint32_t op, uint32_t *eax, uint32_t *edx); } #if defined(_MSC_VER) @@ -125,16 +125,16 @@ uint32_t max_extended_cap, max_basic_cap; #if !X86_64 - if (!x265_cpu_cpuid_test()) + if (!PFX(cpu_cpuid_test)()) return 0; #endif - x265_cpu_cpuid(0, &eax, vendor + 0, vendor + 2, vendor + 1); + PFX(cpu_cpuid)(0, &eax, vendor + 0, vendor + 2, vendor + 1); max_basic_cap = eax; if (max_basic_cap == 0) return 0; - x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx); if (edx & 0x00800000) cpu |= X265_CPU_MMX; else @@ -159,7 +159,7 @@ if ((ecx & 0x18000000) == 0x18000000) { /* Check for OS support */ - x265_cpu_xgetbv(0, &eax, &edx); + PFX(cpu_xgetbv)(0, &eax, &edx); if ((eax & 0x6) == 0x6) { cpu |= X265_CPU_AVX; @@ -170,7 +170,7 @@ if (max_basic_cap >= 7) { - x265_cpu_cpuid(7, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(7, &eax, &ebx, &ecx, &edx); /* AVX2 requires OS support, but BMI1/2 don't. */ if ((cpu & X265_CPU_AVX) && (ebx & 0x00000020)) cpu |= X265_CPU_AVX2; @@ -185,12 +185,12 @@ if (cpu & X265_CPU_SSSE3) cpu |= X265_CPU_SSE2_IS_FAST; - x265_cpu_cpuid(0x80000000, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(0x80000000, &eax, &ebx, &ecx, &edx); max_extended_cap = eax; if (max_extended_cap >= 0x80000001) { - x265_cpu_cpuid(0x80000001, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(0x80000001, &eax, &ebx, &ecx, &edx); if (ecx & 0x00000020) cpu |= X265_CPU_LZCNT; /* Supported by Intel chips starting with Haswell */ @@ -233,7 +233,7 @@ if (!strcmp((char*)vendor, "GenuineIntel")) { - x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx); int family = ((eax >> 8) & 0xf) + ((eax >> 20) & 0xff); int model = ((eax >> 4) & 0xf) + ((eax >> 12) & 0xf0); if (family == 6) @@ -264,11 +264,11 @@ if ((!strcmp((char*)vendor, "GenuineIntel") || !strcmp((char*)vendor, "CyrixInstead")) && !(cpu & X265_CPU_SSE42)) { /* cacheline size is specified in 3 places, any of which may be missing */ - x265_cpu_cpuid(1, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(1, &eax, &ebx, &ecx, &edx); int cache = (ebx & 0xff00) >> 5; // cflush size if (!cache && max_extended_cap >= 0x80000006) { - x265_cpu_cpuid(0x80000006, &eax, &ebx, &ecx, &edx); + PFX(cpu_cpuid)(0x80000006, &eax, &ebx, &ecx, &edx); cache = ecx & 0xff; // cacheline size } if (!cache && max_basic_cap >= 2) @@ -281,7 +281,7 @@ int max, i = 0; do { - x265_cpu_cpuid(2, buf + 0, buf + 1, buf + 2, buf + 3); + PFX(cpu_cpuid)(2, buf + 0, buf + 1, buf + 2, buf + 3); max = buf[0] & 0xff; buf[0] &= ~0xff; for (int j = 0; j < 4; j++) @@ -318,8 +318,8 @@ #elif X265_ARCH_ARM extern "C" { -void x265_cpu_neon_test(void); -int x265_cpu_fast_neon_mrc_test(void); +void PFX(cpu_neon_test)(void); +int PFX(cpu_fast_neon_mrc_test)(void); } uint32_t cpu_detect(void) @@ -340,7 +340,7 @@ } canjump = 1; - x265_cpu_neon_test(); + PFX(cpu_neon_test)(); canjump = 0; signal(SIGILL, oldsig); #endif // if !HAVE_NEON @@ -356,7 +356,7 @@ // which may result in incorrect detection and the counters stuck enabled. // right now Apple does not seem to support performance counters for this test #ifndef __MACH__ - flags |= x265_cpu_fast_neon_mrc_test() ? X265_CPU_FAST_NEON_MRC : 0; + flags |= PFX(cpu_fast_neon_mrc_test)() ? X265_CPU_FAST_NEON_MRC : 0; #endif // TODO: write dual issue test? currently it's A8 (dual issue) vs. A9 (fast mrc) #endif // if HAVE_ARMV6
View file
x265_1.7.tar.gz/source/common/cpu.h -> x265_1.8.tar.gz/source/common/cpu.h
Changed
@@ -27,24 +27,29 @@ #include "common.h" +/* All assembly functions are prefixed with X265_NS (macro expanded) */ +#define PFX3(prefix, name) prefix ## _ ## name +#define PFX2(prefix, name) PFX3(prefix, name) +#define PFX(name) PFX2(X265_NS, name) + // from cpu-a.asm, if ASM primitives are compiled, else primitives.cpp -extern "C" void x265_cpu_emms(void); -extern "C" void x265_safe_intel_cpu_indicator_init(void); +extern "C" void PFX(cpu_emms)(void); +extern "C" void PFX(safe_intel_cpu_indicator_init)(void); #if _MSC_VER && _WIN64 -#define x265_emms() x265_cpu_emms() +#define x265_emms() PFX(cpu_emms)() #elif _MSC_VER #include <mmintrin.h> #define x265_emms() _mm_empty() #elif __GNUC__ // Cannot use _mm_empty() directly without compiling all the source with // a fixed CPU arch, which we would like to avoid at the moment -#define x265_emms() x265_cpu_emms() +#define x265_emms() PFX(cpu_emms)() #else -#define x265_emms() x265_cpu_emms() +#define x265_emms() PFX(cpu_emms)() #endif -namespace x265 { +namespace X265_NS { uint32_t cpu_detect(void); struct cpu_name_t
View file
x265_1.7.tar.gz/source/common/cudata.cpp -> x265_1.8.tar.gz/source/common/cudata.cpp
Changed
@@ -28,33 +28,33 @@ #include "mv.h" #include "cudata.h" -using namespace x265; - -namespace { -// file private namespace +using namespace X265_NS; /* for all bcast* and copy* functions, dst and src are aligned to MIN(size, 32) */ -void bcast1(uint8_t* dst, uint8_t val) { dst[0] = val; } +static void bcast1(uint8_t* dst, uint8_t val) { dst[0] = val; } -void copy4(uint8_t* dst, uint8_t* src) { ((uint32_t*)dst)[0] = ((uint32_t*)src)[0]; } -void bcast4(uint8_t* dst, uint8_t val) { ((uint32_t*)dst)[0] = 0x01010101u * val; } +static void copy4(uint8_t* dst, uint8_t* src) { ((uint32_t*)dst)[0] = ((uint32_t*)src)[0]; } +static void bcast4(uint8_t* dst, uint8_t val) { ((uint32_t*)dst)[0] = 0x01010101u * val; } -void copy16(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; } -void bcast16(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; } +static void copy16(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; } +static void bcast16(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; } -void copy64(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; - ((uint64_t*)dst)[2] = ((uint64_t*)src)[2]; ((uint64_t*)dst)[3] = ((uint64_t*)src)[3]; - ((uint64_t*)dst)[4] = ((uint64_t*)src)[4]; ((uint64_t*)dst)[5] = ((uint64_t*)src)[5]; - ((uint64_t*)dst)[6] = ((uint64_t*)src)[6]; ((uint64_t*)dst)[7] = ((uint64_t*)src)[7]; } -void bcast64(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; - ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; ((uint64_t*)dst)[2] = bval; ((uint64_t*)dst)[3] = bval; - ((uint64_t*)dst)[4] = bval; ((uint64_t*)dst)[5] = bval; ((uint64_t*)dst)[6] = bval; ((uint64_t*)dst)[7] = bval; } +static void copy64(uint8_t* dst, uint8_t* src) { ((uint64_t*)dst)[0] = ((uint64_t*)src)[0]; ((uint64_t*)dst)[1] = ((uint64_t*)src)[1]; + ((uint64_t*)dst)[2] = ((uint64_t*)src)[2]; ((uint64_t*)dst)[3] = ((uint64_t*)src)[3]; + ((uint64_t*)dst)[4] = ((uint64_t*)src)[4]; ((uint64_t*)dst)[5] = ((uint64_t*)src)[5]; + ((uint64_t*)dst)[6] = ((uint64_t*)src)[6]; ((uint64_t*)dst)[7] = ((uint64_t*)src)[7]; } +static void bcast64(uint8_t* dst, uint8_t val) { uint64_t bval = 0x0101010101010101ULL * val; + ((uint64_t*)dst)[0] = bval; ((uint64_t*)dst)[1] = bval; ((uint64_t*)dst)[2] = bval; ((uint64_t*)dst)[3] = bval; + ((uint64_t*)dst)[4] = bval; ((uint64_t*)dst)[5] = bval; ((uint64_t*)dst)[6] = bval; ((uint64_t*)dst)[7] = bval; } /* at 256 bytes, memset/memcpy will probably use SIMD more effectively than our uint64_t hack, * but hand-written assembly would beat it. */ -void copy256(uint8_t* dst, uint8_t* src) { memcpy(dst, src, 256); } -void bcast256(uint8_t* dst, uint8_t val) { memset(dst, val, 256); } +static void copy256(uint8_t* dst, uint8_t* src) { memcpy(dst, src, 256); } +static void bcast256(uint8_t* dst, uint8_t val) { memset(dst, val, 256); } + +namespace { +// file private namespace /* Check whether 2 addresses point to the same column */ inline bool isEqualCol(int addrA, int addrB, int numUnits) @@ -112,38 +112,6 @@ return MV((int16_t)mvx, (int16_t)mvy); } -// Partition table. -// First index is partitioning mode. Second index is partition index. -// Third index is 0 for partition sizes, 1 for partition offsets. The -// sizes and offsets are encoded as two packed 4-bit values (X,Y). -// X and Y represent 1/4 fractions of the block size. -const uint32_t partTable[8][4][2] = -{ - // XY - { { 0x44, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2Nx2N. - { { 0x42, 0x00 }, { 0x42, 0x02 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxN. - { { 0x24, 0x00 }, { 0x24, 0x20 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_Nx2N. - { { 0x22, 0x00 }, { 0x22, 0x20 }, { 0x22, 0x02 }, { 0x22, 0x22 } }, // SIZE_NxN. - { { 0x41, 0x00 }, { 0x43, 0x01 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnU. - { { 0x43, 0x00 }, { 0x41, 0x03 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnD. - { { 0x14, 0x00 }, { 0x34, 0x10 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_nLx2N. - { { 0x34, 0x00 }, { 0x14, 0x30 }, { 0x00, 0x00 }, { 0x00, 0x00 } } // SIZE_nRx2N. -}; - -// Partition Address table. -// First index is partitioning mode. Second index is partition address. -const uint32_t partAddrTable[8][4] = -{ - { 0x00, 0x00, 0x00, 0x00 }, // SIZE_2Nx2N. - { 0x00, 0x08, 0x08, 0x08 }, // SIZE_2NxN. - { 0x00, 0x04, 0x04, 0x04 }, // SIZE_Nx2N. - { 0x00, 0x04, 0x08, 0x0C }, // SIZE_NxN. - { 0x00, 0x02, 0x02, 0x02 }, // SIZE_2NxnU. - { 0x00, 0x0A, 0x0A, 0x0A }, // SIZE_2NxnD. - { 0x00, 0x01, 0x01, 0x01 }, // SIZE_nLx2N. - { 0x00, 0x05, 0x05, 0x05 } // SIZE_nRx2N. -}; - } cubcast_t CUData::s_partSet[NUM_FULL_DEPTH] = { NULL, NULL, NULL, NULL, NULL };
View file
x265_1.7.tar.gz/source/common/cudata.h -> x265_1.8.tar.gz/source/common/cudata.h
Changed
@@ -28,7 +28,7 @@ #include "slice.h" #include "mv.h" -namespace x265 { +namespace X265_NS { // private namespace class FrameData; @@ -121,6 +121,38 @@ // Partition count table, index represents partitioning mode. const uint32_t nbPartsTable[8] = { 1, 2, 2, 4, 2, 2, 2, 2 }; +// Partition table. +// First index is partitioning mode. Second index is partition index. +// Third index is 0 for partition sizes, 1 for partition offsets. The +// sizes and offsets are encoded as two packed 4-bit values (X,Y). +// X and Y represent 1/4 fractions of the block size. +const uint32_t partTable[8][4][2] = +{ + // XY + { { 0x44, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2Nx2N. + { { 0x42, 0x00 }, { 0x42, 0x02 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxN. + { { 0x24, 0x00 }, { 0x24, 0x20 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_Nx2N. + { { 0x22, 0x00 }, { 0x22, 0x20 }, { 0x22, 0x02 }, { 0x22, 0x22 } }, // SIZE_NxN. + { { 0x41, 0x00 }, { 0x43, 0x01 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnU. + { { 0x43, 0x00 }, { 0x41, 0x03 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_2NxnD. + { { 0x14, 0x00 }, { 0x34, 0x10 }, { 0x00, 0x00 }, { 0x00, 0x00 } }, // SIZE_nLx2N. + { { 0x34, 0x00 }, { 0x14, 0x30 }, { 0x00, 0x00 }, { 0x00, 0x00 } } // SIZE_nRx2N. +}; + +// Partition Address table. +// First index is partitioning mode. Second index is partition address. +const uint32_t partAddrTable[8][4] = +{ + { 0x00, 0x00, 0x00, 0x00 }, // SIZE_2Nx2N. + { 0x00, 0x08, 0x08, 0x08 }, // SIZE_2NxN. + { 0x00, 0x04, 0x04, 0x04 }, // SIZE_Nx2N. + { 0x00, 0x04, 0x08, 0x0C }, // SIZE_NxN. + { 0x00, 0x02, 0x02, 0x02 }, // SIZE_2NxnU. + { 0x00, 0x0A, 0x0A, 0x0A }, // SIZE_2NxnD. + { 0x00, 0x01, 0x01, 0x01 }, // SIZE_nLx2N. + { 0x00, 0x05, 0x05, 0x05 } // SIZE_nRx2N. +}; + // Holds part data for a CU of a given size, from an 8x8 CU to a CTU class CUData { @@ -222,8 +254,11 @@ void getNeighbourMV(uint32_t puIdx, uint32_t absPartIdx, InterNeighbourMV* neighbours) const; void getIntraTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const; void getInterTUQtDepthRange(uint32_t tuDepthRange[2], uint32_t absPartIdx) const; + uint32_t getBestRefIdx(uint32_t subPartIdx) const { return ((m_interDir[subPartIdx] & 1) << m_refIdx[0][subPartIdx]) | + (((m_interDir[subPartIdx] >> 1) & 1) << (m_refIdx[1][subPartIdx] + 16)); } + uint32_t getPUOffset(uint32_t puIdx, uint32_t absPartIdx) const { return (partAddrTable[(int)m_partSize[absPartIdx]][puIdx] << (g_unitSizeDepth - m_cuDepth[absPartIdx]) * 2) >> 4; } - uint32_t getNumPartInter() const { return nbPartsTable[(int)m_partSize[0]]; } + uint32_t getNumPartInter(uint32_t absPartIdx) const { return nbPartsTable[(int)m_partSize[absPartIdx]]; } bool isIntra(uint32_t absPartIdx) const { return m_predMode[absPartIdx] == MODE_INTRA; } bool isInter(uint32_t absPartIdx) const { return !!(m_predMode[absPartIdx] & MODE_INTER); } bool isSkipped(uint32_t absPartIdx) const { return m_predMode[absPartIdx] == MODE_SKIP; }
View file
x265_1.7.tar.gz/source/common/dct.cpp -> x265_1.8.tar.gz/source/common/dct.cpp
Changed
@@ -29,19 +29,18 @@ #include "common.h" #include "primitives.h" +#include "contexts.h" // costCoeffNxN_c +#include "threading.h" // CLZ -using namespace x265; +using namespace X265_NS; #if _MSC_VER #pragma warning(disable: 4127) // conditional expression is constant, typical for templated functions #endif -namespace { -// anonymous file-static namespace - // Fast DST Algorithm. Full matrix multiplication for DST and Fast DST algorithm // give identical results -void fastForwardDst(const int16_t* block, int16_t* coeff, int shift) // input block, output coeff +static void fastForwardDst(const int16_t* block, int16_t* coeff, int shift) // input block, output coeff { int c[4]; int rnd_factor = 1 << (shift - 1); @@ -61,7 +60,7 @@ } } -void inversedst(const int16_t* tmp, int16_t* block, int shift) // input tmp, output block +static void inversedst(const int16_t* tmp, int16_t* block, int shift) // input tmp, output block { int i, c[4]; int rnd_factor = 1 << (shift - 1); @@ -81,7 +80,7 @@ } } -void partialButterfly16(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterfly16(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[8], O[8]; @@ -134,7 +133,7 @@ } } -void partialButterfly32(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterfly32(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[16], O[16]; @@ -203,7 +202,7 @@ } } -void partialButterfly8(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterfly8(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[4], O[4]; @@ -240,7 +239,7 @@ } } -void partialButterflyInverse4(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterflyInverse4(const int16_t* src, int16_t* dst, int shift, int line) { int j; int E[2], O[2]; @@ -265,7 +264,7 @@ } } -void partialButterflyInverse8(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterflyInverse8(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[4], O[4]; @@ -301,7 +300,7 @@ } } -void partialButterflyInverse16(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterflyInverse16(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[8], O[8]; @@ -352,7 +351,7 @@ } } -void partialButterflyInverse32(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterflyInverse32(const int16_t* src, int16_t* dst, int shift, int line) { int j, k; int E[16], O[16]; @@ -416,7 +415,7 @@ } } -void partialButterfly4(const int16_t* src, int16_t* dst, int shift, int line) +static void partialButterfly4(const int16_t* src, int16_t* dst, int shift, int line) { int j; int E[2], O[2]; @@ -440,7 +439,7 @@ } } -void dst4_c(const int16_t* src, int16_t* dst, intptr_t srcStride) +static void dst4_c(const int16_t* src, int16_t* dst, intptr_t srcStride) { const int shift_1st = 1 + X265_DEPTH - 8; const int shift_2nd = 8; @@ -457,7 +456,7 @@ fastForwardDst(coef, dst, shift_2nd); } -void dct4_c(const int16_t* src, int16_t* dst, intptr_t srcStride) +static void dct4_c(const int16_t* src, int16_t* dst, intptr_t srcStride) { const int shift_1st = 1 + X265_DEPTH - 8; const int shift_2nd = 8; @@ -474,7 +473,7 @@ partialButterfly4(coef, dst, shift_2nd, 4); } -void dct8_c(const int16_t* src, int16_t* dst, intptr_t srcStride) +static void dct8_c(const int16_t* src, int16_t* dst, intptr_t srcStride) { const int shift_1st = 2 + X265_DEPTH - 8; const int shift_2nd = 9; @@ -491,7 +490,7 @@ partialButterfly8(coef, dst, shift_2nd, 8); } -void dct16_c(const int16_t* src, int16_t* dst, intptr_t srcStride) +static void dct16_c(const int16_t* src, int16_t* dst, intptr_t srcStride) { const int shift_1st = 3 + X265_DEPTH - 8; const int shift_2nd = 10; @@ -508,7 +507,7 @@ partialButterfly16(coef, dst, shift_2nd, 16); } -void dct32_c(const int16_t* src, int16_t* dst, intptr_t srcStride) +static void dct32_c(const int16_t* src, int16_t* dst, intptr_t srcStride) { const int shift_1st = 4 + X265_DEPTH - 8; const int shift_2nd = 11; @@ -525,7 +524,7 @@ partialButterfly32(coef, dst, shift_2nd, 32); } -void idst4_c(const int16_t* src, int16_t* dst, intptr_t dstStride) +static void idst4_c(const int16_t* src, int16_t* dst, intptr_t dstStride) { const int shift_1st = 7; const int shift_2nd = 12 - (X265_DEPTH - 8); @@ -542,7 +541,7 @@ } } -void idct4_c(const int16_t* src, int16_t* dst, intptr_t dstStride) +static void idct4_c(const int16_t* src, int16_t* dst, intptr_t dstStride) { const int shift_1st = 7; const int shift_2nd = 12 - (X265_DEPTH - 8); @@ -559,7 +558,7 @@ } } -void idct8_c(const int16_t* src, int16_t* dst, intptr_t dstStride) +static void idct8_c(const int16_t* src, int16_t* dst, intptr_t dstStride) { const int shift_1st = 7; const int shift_2nd = 12 - (X265_DEPTH - 8); @@ -576,7 +575,7 @@ } } -void idct16_c(const int16_t* src, int16_t* dst, intptr_t dstStride) +static void idct16_c(const int16_t* src, int16_t* dst, intptr_t dstStride) { const int shift_1st = 7; const int shift_2nd = 12 - (X265_DEPTH - 8); @@ -593,7 +592,7 @@ } } -void idct32_c(const int16_t* src, int16_t* dst, intptr_t dstStride) +static void idct32_c(const int16_t* src, int16_t* dst, intptr_t dstStride) { const int shift_1st = 7; const int shift_2nd = 12 - (X265_DEPTH - 8); @@ -610,10 +609,10 @@ } } -void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift) +static void dequant_normal_c(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift) { #if HIGH_BIT_DEPTH - X265_CHECK(scale < 32768 || ((scale & 3) == 0 && shift > 2), "dequant invalid scale %d\n", scale); + X265_CHECK(scale < 32768 || ((scale & 3) == 0 && shift > (X265_DEPTH - 8)), "dequant invalid scale %d\n", scale); #else // NOTE: maximum of scale is (72 * 256) X265_CHECK(scale < 32768, "dequant invalid scale %d\n", scale); @@ -634,7 +633,7 @@ } } -void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift) +static void dequant_scaling_c(const int16_t* quantCoef, const int32_t* deQuantCoef, int16_t* coef, int num, int per, int shift) { X265_CHECK(num <= 32 * 32, "dequant num %d too large\n", num); @@ -662,7 +661,7 @@ } } -uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff) +static uint32_t quant_c(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff) { X265_CHECK(qBits >= 8, "qBits less than 8\n"); X265_CHECK((numCoeff % 16) == 0, "numCoeff must be multiple of 16\n"); @@ -686,7 +685,7 @@ return numSig; } -uint32_t nquant_c(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff) +static uint32_t nquant_c(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff) { X265_CHECK((numCoeff % 16) == 0, "number of quant coeff is not multiple of 4x4\n"); X265_CHECK((uint32_t)add < ((uint32_t)1 << qBits), "2 ^ qBits less than add\n"); @@ -739,7 +738,7 @@ return numSig; } -void denoiseDct_c(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff) +static void denoiseDct_c(int16_t* dctCoef, uint32_t* resSum, const uint16_t* offset, int numCoeff) { for (int i = 0; i < numCoeff; i++) { @@ -752,7 +751,7 @@ } } -int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/) +static int scanPosLast_c(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* /*scanCG4x4*/, const int /*trSize*/) { memset(coeffNum, 0, MLS_GRP_NUM * sizeof(*coeffNum)); memset(coeffFlag, 0, MLS_GRP_NUM * sizeof(*coeffFlag)); @@ -785,7 +784,7 @@ return scanPosLast - 1; } -uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) +static uint32_t findPosFirstLast_c(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]) { int n; @@ -798,11 +797,11 @@ break; } - X265_CHECK(n >= 0, "non-zero coeff scan failuare!\n"); + X265_CHECK(n >= -1, "non-zero coeff scan failuare!\n"); uint32_t lastNZPosInCG = (uint32_t)n; - for (n = 0;; n++) + for (n = 0; n < SCAN_SET_SIZE; n++) { const uint32_t idx = scanTbl[n]; const uint32_t idxY = idx / MLS_CG_SIZE; @@ -813,12 +812,166 @@ uint32_t firstNZPosInCG = (uint32_t)n; + // NOTE: when coeff block all ZERO, the lastNZPosInCG is undefined and firstNZPosInCG is 16 return ((lastNZPosInCG << 16) | firstNZPosInCG); } -} // closing - anonymous file-static namespace -namespace x265 { +static uint32_t costCoeffNxN_c(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase) +{ + ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]); + uint32_t numNonZero = (scanPosSigOff < (SCAN_SET_SIZE - 1) ? 1 : 0); + uint32_t sum = 0; + + // correct offset to match assembly + absCoeff -= numNonZero; + + for (int i = 0; i < MLS_CG_SIZE; i++) + { + tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[i * trSize + 0]); + tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[i * trSize + 1]); + tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[i * trSize + 2]); + tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[i * trSize + 3]); + } + + do + { + uint32_t blkPos, sig, ctxSig; + blkPos = scan[scanPosSigOff]; + const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; + sig = scanFlagMask & 1; + scanFlagMask >>= 1; + X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); + if ((scanPosSigOff != 0) || (subPosBase == 0) || numNonZero) + { + const uint32_t cnt = tabSigCtx[blkPos] + offset; + ctxSig = cnt & posZeroMask; + + //X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; + //encodeBin(sig, baseCtx[ctxSig]); + const uint32_t mstate = baseCtx[ctxSig]; + const uint32_t mps = mstate & 1; + const uint32_t stateBits = PFX(entropyStateBits)[mstate ^ sig]; + uint32_t nextState = (stateBits >> 24) + mps; + if ((mstate ^ sig) == 1) + nextState = sig; + X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n"); + X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n"); + baseCtx[ctxSig] = (uint8_t)nextState; + sum += stateBits; + } + assert(numNonZero <= 15); + assert(blkPos <= 15); + absCoeff[numNonZero] = tmpCoeff[blkPos]; + numNonZero += sig; + scanPosSigOff--; + } + while(scanPosSigOff >= 0); + + return (sum & 0xFFFFFF); +} + +static uint32_t costCoeffRemain_c(uint16_t *absCoeff, int numNonZero, int idx) +{ + uint32_t goRiceParam = 0; + + uint32_t sum = 0; + int baseLevel = 3; + do + { + if (idx >= C1FLAG_NUMBER) + baseLevel = 1; + + // TODO: the IDX is not really idx, so this check inactive + //X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); + int codeNumber = absCoeff[idx] - baseLevel; + + if (codeNumber >= 0) + { + //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); + uint32_t length = 0; + + codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION; + if (codeNumber >= 0) + { + { + unsigned long cidx; + CLZ(cidx, codeNumber + 1); + length = cidx; + } + X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n"); + + codeNumber = (length + length); + } + sum += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber); + + if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) + goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); + } + baseLevel = 2; + idx++; + } + while(idx < numNonZero); + + return sum; +} + + +static uint32_t costC1C2Flag_c(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset) +{ + uint32_t sum = 0; + uint32_t c1 = 1; + uint32_t firstC2Idx = 8; + uint32_t firstC2Flag = 2; + uint32_t c1Next = 0xFFFFFFFE; + + int idx = 0; + do + { + uint32_t symbol1 = absCoeff[idx] > 1; + uint32_t symbol2 = absCoeff[idx] > 2; + //encodeBin(symbol1, baseCtxMod[c1]); + { + const uint32_t mstate = baseCtxMod[c1]; + baseCtxMod[c1] = sbacNext(mstate, symbol1); + sum += sbacGetEntropyBits(mstate, symbol1); + } + + if (symbol1) + c1Next = 0; + + if (symbol1 + firstC2Flag == 3) + firstC2Flag = symbol2; + + if (symbol1 + firstC2Idx == 9) + firstC2Idx = idx; + + c1 = (c1Next & 3); + c1Next >>= 2; + X265_CHECK(c1 <= 3, "c1 check failure\n"); + idx++; + } + while(idx < numC1Flag); + + if (!c1) + { + X265_CHECK((firstC2Flag <= 1), "firstC2FlagIdx check failure\n"); + + baseCtxMod += ctxOffset; + + //encodeBin(firstC2Flag, baseCtxMod[0]); + { + const uint32_t mstate = baseCtxMod[0]; + baseCtxMod[0] = sbacNext(mstate, firstC2Flag); + sum += sbacGetEntropyBits(mstate, firstC2Flag); + } + } + + return (sum & 0x00FFFFFF) + (c1 << 26) + (firstC2Idx << 28); +} + +namespace X265_NS { // x265 private namespace void setupDCTPrimitives_c(EncoderPrimitives& p) @@ -850,5 +1003,8 @@ p.scanPosLast = scanPosLast_c; p.findPosFirstLast = findPosFirstLast_c; + p.costCoeffNxN = costCoeffNxN_c; + p.costCoeffRemain = costCoeffRemain_c; + p.costC1C2Flag = costC1C2Flag_c; } }
View file
x265_1.7.tar.gz/source/common/deblock.cpp -> x265_1.8.tar.gz/source/common/deblock.cpp
Changed
@@ -28,7 +28,7 @@ #include "slice.h" #include "mv.h" -using namespace x265; +using namespace X265_NS; #define DEBLOCK_SMALLEST_BLOCK 8 #define DEFAULT_INTRA_TC_OFFSET 2
View file
x265_1.7.tar.gz/source/common/deblock.h -> x265_1.8.tar.gz/source/common/deblock.h
Changed
@@ -26,7 +26,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { // private namespace class CUData;
View file
x265_1.7.tar.gz/source/common/frame.cpp -> x265_1.8.tar.gz/source/common/frame.cpp
Changed
@@ -26,7 +26,7 @@ #include "picyuv.h" #include "framedata.h" -using namespace x265; +using namespace X265_NS; Frame::Frame() {
View file
x265_1.7.tar.gz/source/common/frame.h -> x265_1.8.tar.gz/source/common/frame.h
Changed
@@ -28,7 +28,7 @@ #include "lowres.h" #include "threading.h" -namespace x265 { +namespace X265_NS { // private namespace class FrameData;
View file
x265_1.7.tar.gz/source/common/framedata.cpp -> x265_1.8.tar.gz/source/common/framedata.cpp
Changed
@@ -24,7 +24,7 @@ #include "framedata.h" #include "picyuv.h" -using namespace x265; +using namespace X265_NS; FrameData::FrameData() {
View file
x265_1.7.tar.gz/source/common/framedata.h -> x265_1.8.tar.gz/source/common/framedata.h
Changed
@@ -28,12 +28,61 @@ #include "slice.h" #include "cudata.h" -namespace x265 { +namespace X265_NS { // private namespace class PicYuv; class JobProvider; +#define INTER_MODES 4 // 2Nx2N, 2NxN, Nx2N, AMP modes +#define INTRA_MODES 3 // DC, Planar, Angular modes + +/* Current frame stats for 2 pass */ +struct FrameStats +{ + int mvBits; /* MV bits (MV+Ref+Block Type) */ + int coeffBits; /* Texture bits (DCT coefs) */ + int miscBits; + + int intra8x8Cnt; + int inter8x8Cnt; + int skip8x8Cnt; + + /* CU type counts stored as percentage */ + double percent8x8Intra; + double percent8x8Inter; + double percent8x8Skip; + double avgLumaDistortion; + double avgChromaDistortion; + double avgPsyEnergy; + double avgLumaLevel; + double lumaLevel; + double percentIntraNxN; + double percentSkipCu[NUM_CU_DEPTH]; + double percentMergeCu[NUM_CU_DEPTH]; + double percentIntraDistribution[NUM_CU_DEPTH][INTRA_MODES]; + double percentInterDistribution[NUM_CU_DEPTH][3]; // 2Nx2N, RECT, AMP modes percentage + + uint64_t cntIntraNxN; + uint64_t totalCu; + uint64_t totalCtu; + uint64_t lumaDistortion; + uint64_t chromaDistortion; + uint64_t psyEnergy; + uint64_t cntSkipCu[NUM_CU_DEPTH]; + uint64_t cntMergeCu[NUM_CU_DEPTH]; + uint64_t cntInter[NUM_CU_DEPTH]; + uint64_t cntIntra[NUM_CU_DEPTH]; + uint64_t cuInterDistribution[NUM_CU_DEPTH][INTER_MODES]; + uint64_t cuIntraDistribution[NUM_CU_DEPTH][INTRA_MODES]; + uint16_t maxLumaLevel; + + FrameStats() + { + memset(this, 0, sizeof(FrameStats)); + } +}; + /* Per-frame data that is used during encodes and referenced while the picture * is available for reference. A FrameData instance is attached to a Frame as it * comes out of the lookahead. Frames which are not being encoded do not have a @@ -85,6 +134,7 @@ RCStatCU* m_cuStat; RCStatRow* m_rowStat; + FrameStats m_frameStats; // stats of current frame for multi-pass encodes double m_avgQpRc; /* avg QP as decided by rate-control */ double m_avgQpAq; /* avg QP as decided by AQ in addition to rate-control */
View file
x265_1.7.tar.gz/source/common/intrapred.cpp -> x265_1.8.tar.gz/source/common/intrapred.cpp
Changed
@@ -24,7 +24,7 @@ #include "common.h" #include "primitives.h" -using namespace x265; +using namespace X265_NS; namespace { @@ -50,7 +50,7 @@ filtered[tuSize2 + tuSize2] = leftLast; } -void dcPredFilter(const pixel* above, const pixel* left, pixel* dst, intptr_t dststride, int size) +static void dcPredFilter(const pixel* above, const pixel* left, pixel* dst, intptr_t dststride, int size) { // boundary pixels processing dst[0] = (pixel)((above[0] + left[0] + 2 * dst[0] + 2) >> 2); @@ -234,7 +234,7 @@ } } -namespace x265 { +namespace X265_NS { // x265 private namespace void setupIntraPrimitives_c(EncoderPrimitives& p)
View file
x265_1.7.tar.gz/source/common/ipfilter.cpp -> x265_1.8.tar.gz/source/common/ipfilter.cpp
Changed
@@ -27,13 +27,15 @@ #include "primitives.h" #include "x265.h" -using namespace x265; +using namespace X265_NS; #if _MSC_VER #pragma warning(disable: 4127) // conditional expression is constant, typical for templated functions #endif namespace { +// file local namespace + template<int width, int height> void filterPixelToShort_c(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride) { @@ -53,7 +55,7 @@ } } -void extendCURowColBorder(pixel* txt, intptr_t stride, int width, int height, int marginX) +static void extendCURowColBorder(pixel* txt, intptr_t stride, int width, int height, int marginX) { for (int y = 0; y < height; y++) { @@ -369,7 +371,7 @@ } } -namespace x265 { +namespace X265_NS { // x265 private namespace #define CHROMA_420(W, H) \
View file
x265_1.7.tar.gz/source/common/loopfilter.cpp -> x265_1.8.tar.gz/source/common/loopfilter.cpp
Changed
@@ -36,13 +36,13 @@ return (x >> 31) | ((int)((((uint32_t)-x)) >> 31)); } -void calSign(int8_t *dst, const pixel *src1, const pixel *src2, const int endX) +static void calSign(int8_t *dst, const pixel *src1, const pixel *src2, const int endX) { for (int x = 0; x < endX; x++) dst[x] = signOf(src1[x] - src2[x]); } -void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride) +static void processSaoCUE0(pixel * rec, int8_t * offsetEo, int width, int8_t* signLeft, intptr_t stride) { int x, y; int8_t signRight, signLeft0; @@ -62,7 +62,7 @@ } } -void processSaoCUE1(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width) +static void processSaoCUE1(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width) { int x; int8_t signDown; @@ -77,7 +77,7 @@ } } -void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width) +static void processSaoCUE1_2Rows(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width) { int x, y; int8_t signDown; @@ -96,7 +96,7 @@ } } -void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride) +static void processSaoCUE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int width, intptr_t stride) { int x; for (x = 0; x < width; x++) @@ -108,7 +108,7 @@ } } -void processSaoCUE3(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX) +static void processSaoCUE3(pixel *rec, int8_t *upBuff1, int8_t *offsetEo, intptr_t stride, int startX, int endX) { int8_t signDown; int8_t edgeType; @@ -122,7 +122,7 @@ } } -void processSaoCUB0(pixel* rec, const int8_t* offset, int ctuWidth, int ctuHeight, intptr_t stride) +static void processSaoCUB0(pixel* rec, const int8_t* offset, int ctuWidth, int ctuHeight, intptr_t stride) { #define SAO_BO_BITS 5 const int boShift = X265_DEPTH - SAO_BO_BITS; @@ -138,7 +138,7 @@ } } -namespace x265 { +namespace X265_NS { void setupLoopFilterPrimitives_c(EncoderPrimitives &p) { p.saoCuOrgE0 = processSaoCUE0;
View file
x265_1.7.tar.gz/source/common/lowres.cpp -> x265_1.8.tar.gz/source/common/lowres.cpp
Changed
@@ -25,7 +25,7 @@ #include "lowres.h" #include "mv.h" -using namespace x265; +using namespace X265_NS; bool Lowres::create(PicYuv *origPic, int _bframes, bool bAQEnabled) { @@ -36,13 +36,13 @@ lumaStride = width + 2 * origPic->m_lumaMarginX; if (lumaStride & 31) lumaStride += 32 - (lumaStride & 31); - int cuWidth = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int cuHeight = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int cuCount = cuWidth * cuHeight; + maxBlocksInRow = (width + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; + maxBlocksInCol = (lines + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; + int cuCount = maxBlocksInRow * maxBlocksInCol; /* rounding the width to multiple of lowres CU size */ - width = cuWidth * X265_LOWRES_CU_SIZE; - lines = cuHeight * X265_LOWRES_CU_SIZE; + width = maxBlocksInRow * X265_LOWRES_CU_SIZE; + lines = maxBlocksInCol * X265_LOWRES_CU_SIZE; size_t planesize = lumaStride * (lines + 2 * origPic->m_lumaMarginY); size_t padoffset = lumaStride * origPic->m_lumaMarginY + origPic->m_lumaMarginX; @@ -74,7 +74,7 @@ { for (int j = 0; j < bframes + 2; j++) { - CHECKED_MALLOC(rowSatds[i][j], int32_t, cuHeight); + CHECKED_MALLOC(rowSatds[i][j], int32_t, maxBlocksInCol); CHECKED_MALLOC(lowresCosts[i][j], uint16_t, cuCount); } } @@ -126,7 +126,7 @@ void Lowres::init(PicYuv *origPic, int poc) { bLastMiniGopBFrame = false; - bScenecut = true; // could be a scene-cut, until ruled out by flash detection + bScenecut = false; // could be a scene-cut, until ruled out by flash detection bKeyframe = false; // Not a keyframe unless identified by lookahead frameNum = poc; leadingBframes = 0;
View file
x265_1.7.tar.gz/source/common/lowres.h -> x265_1.8.tar.gz/source/common/lowres.h
Changed
@@ -29,7 +29,7 @@ #include "picyuv.h" #include "mv.h" -namespace x265 { +namespace X265_NS { // private namespace struct ReferencePlanes @@ -130,6 +130,8 @@ uint16_t(*lowresCosts[X265_BFRAME_MAX + 2][X265_BFRAME_MAX + 2]); int32_t* lowresMvCosts[2][X265_BFRAME_MAX + 1]; MV* lowresMvs[2][X265_BFRAME_MAX + 1]; + uint32_t maxBlocksInRow; + uint32_t maxBlocksInCol; /* used for vbvLookahead */ int plannedType[X265_LOOKAHEAD_MAX + 1];
View file
x265_1.7.tar.gz/source/common/md5.cpp -> x265_1.8.tar.gz/source/common/md5.cpp
Changed
@@ -25,7 +25,7 @@ #include "common.h" #include "md5.h" -namespace x265 { +namespace X265_NS { // private x265 namespace #ifndef ARCH_BIG_ENDIAN
View file
x265_1.7.tar.gz/source/common/md5.h -> x265_1.8.tar.gz/source/common/md5.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { //private x265 namespace typedef struct MD5Context
View file
x265_1.7.tar.gz/source/common/mv.h -> x265_1.8.tar.gz/source/common/mv.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "primitives.h" -namespace x265 { +namespace X265_NS { // private x265 namespace #if _MSC_VER
View file
x265_1.7.tar.gz/source/common/param.cpp -> x265_1.8.tar.gz/source/common/param.cpp
Changed
@@ -52,7 +52,7 @@ */ #undef strtok_r -char* strtok_r(char* str, const char* delim, char** nextp) +static char* strtok_r(char* str, const char* delim, char** nextp) { if (!str) str = *nextp; @@ -76,27 +76,35 @@ #endif // if !defined(HAVE_STRTOK_R) -using namespace x265; +#if EXPORT_C_API + +/* these functions are exported as C functions (default) */ +using namespace X265_NS; +extern "C" { + +#else + +/* these functions exist within private namespace (multilib) */ +namespace X265_NS { + +#endif -extern "C" x265_param *x265_param_alloc() { return (x265_param*)x265_malloc(sizeof(x265_param)); } -extern "C" void x265_param_free(x265_param* p) { x265_free(p); } -extern "C" void x265_param_default(x265_param* param) { memset(param, 0, sizeof(x265_param)); /* Applying default values to all elements in the param structure */ - param->cpuid = x265::cpu_detect(); + param->cpuid = X265_NS::cpu_detect(); param->bEnableWavefront = 1; param->frameNumThreads = 0; @@ -111,7 +119,7 @@ param->bEnableSsim = 0; /* Source specifications */ - param->internalBitDepth = x265_max_bit_depth; + param->internalBitDepth = X265_DEPTH; param->internalCsp = X265_CSP_I420; param->levelIdc = 0; @@ -151,6 +159,7 @@ param->subpelRefine = 2; param->searchRange = 57; param->maxNumMergeCand = 2; + param->limitReferences = 0; param->bEnableWeightedPred = 1; param->bEnableWeightedBiPred = 0; param->bEnableEarlySkip = 0; @@ -197,6 +206,7 @@ param->rc.rateControlMode = X265_RC_CRF; param->rc.qp = 32; param->rc.aqMode = X265_AQ_VARIANCE; + param->rc.qgSize = 32; param->rc.aqStrength = 1.0; param->rc.cuTree = 1; param->rc.rfConstantMax = 0; @@ -210,7 +220,6 @@ param->rc.zones = NULL; param->rc.bEnableSlowFirstPass = 0; param->rc.bStrictCbr = 0; - param->rc.qgSize = 64; /* Same as maxCUSize */ /* Video Usability Information (VUI) */ param->vui.aspectRatioIdc = 0; @@ -234,10 +243,13 @@ param->vui.defDispWinBottomOffset = 0; } -extern "C" int x265_param_default_preset(x265_param* param, const char* preset, const char* tune) { - x265_param_default(param); +#if EXPORT_C_API + ::x265_param_default(param); +#else + X265_NS::x265_param_default(param); +#endif if (preset) { @@ -430,8 +442,8 @@ param->deblockingFilterBetaOffset = -2; param->deblockingFilterTCOffset = -2; param->bIntraInBFrames = 0; - param->rdoqLevel = 1; - param->psyRdoq = 30; + param->rdoqLevel = 2; + param->psyRdoq = 10.0; param->psyRd = 0.5; param->rc.ipFactor = 1.1; param->rc.pbFactor = 1.1; @@ -459,16 +471,6 @@ return 0; } -static double x265_atof(const char* str, bool& bError) -{ - char *end; - double v = strtod(str, &end); - - if (end == str || *end != '\0') - bError = true; - return v; -} - static int parseName(const char* arg, const char* const* names, bool& bError) { for (int i = 0; names[i]; i++) @@ -485,7 +487,6 @@ #define atof(str) x265_atof(str, bError) #define atobool(str) (bNameWasBool = true, x265_atobool(str, bError)) -extern "C" int x265_param_parse(x265_param* p, const char* name, const char* value) { bool bError = false; @@ -581,6 +582,7 @@ } } OPT("cu-stats") p->bLogCuStats = atobool(value); + OPT("total-frames") p->totalFrames = atoi(value); OPT("annexb") p->bAnnexB = atobool(value); OPT("repeat-headers") p->bRepeatHeaders = atobool(value); OPT("wpp") p->bEnableWavefront = atobool(value); @@ -641,6 +643,7 @@ } } OPT("ref") p->maxNumReferences = atoi(value); + OPT("limit-refs") p->limitReferences = atoi(value); OPT("weightp") p->bEnableWeightedPred = atobool(value); OPT("weightb") p->bEnableWeightedBiPred = atobool(value); OPT("cbqpoffs") p->cbQpOffset = atoi(value); @@ -827,7 +830,7 @@ p->vui.chromaSampleLocTypeTopField = atoi(value); p->vui.chromaSampleLocTypeBottomField = p->vui.chromaSampleLocTypeTopField; } - OPT("crop-rect") + OPT2("display-window", "crop-rect") { p->vui.bEnableDefaultDisplayWindowFlag = 1; bError |= sscanf(value, "%d,%d,%d,%d", @@ -845,7 +848,6 @@ p->rc.bStatRead = pass & 2; } OPT("stats") p->rc.statFileName = strdup(value); - OPT("csv") p->csvfn = strdup(value); OPT("scaling-list") p->scalingLists = strdup(value); OPT2("pools", "numa-pools") p->numaPools = strdup(value); OPT("lambda-file") p->rc.lambdaFileName = strdup(value); @@ -864,7 +866,9 @@ return bError ? X265_PARAM_BAD_VALUE : 0; } -namespace x265 { +} /* end extern "C" or namespace */ + +namespace X265_NS { // internal encoder functions int x265_atoi(const char* str, bool& bError) @@ -877,6 +881,16 @@ return v; } +double x265_atof(const char* str, bool& bError) +{ + char *end; + double v = strtod(str, &end); + + if (end == str || *end != '\0') + bError = true; + return v; +} + /* cpu name can be: * auto || true - x265::cpu_detect() * false || no - disabled @@ -893,7 +907,7 @@ if (isdigit(value[0])) cpu = x265_atoi(value, bError); else - cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? x265::cpu_detect() : 0; + cpu = !strcmp(value, "auto") || x265_atobool(value, bError) ? X265_NS::cpu_detect() : 0; if (bError) { @@ -904,12 +918,12 @@ for (init = buf; (tok = strtok_r(init, ",", &saveptr)); init = NULL) { int i; - for (i = 0; x265::cpu_names[i].flags && strcasecmp(tok, x265::cpu_names[i].name); i++) + for (i = 0; X265_NS::cpu_names[i].flags && strcasecmp(tok, X265_NS::cpu_names[i].name); i++) { } - cpu |= x265::cpu_names[i].flags; - if (!x265::cpu_names[i].flags) + cpu |= X265_NS::cpu_names[i].flags; + if (!X265_NS::cpu_names[i].flags) bError = 1; } @@ -997,15 +1011,8 @@ uint32_t tuQTMaxLog2Size = X265_MIN(maxLog2CUSize, 5); uint32_t tuQTMinLog2Size = 2; //log2(4) - /* These checks might be temporary */ -#if HIGH_BIT_DEPTH - CHECK(param->internalBitDepth != 10, - "x265 was compiled for 10bit encodes, only 10bit internal depth supported"); -#else - CHECK(param->internalBitDepth != 8, - "x265 was compiled for 8bit encodes, only 8bit internal depth supported"); -#endif - + CHECK(param->internalBitDepth != X265_DEPTH, + "internalBitDepth must match compiled bit depth"); CHECK(param->minCUSize != 64 && param->minCUSize != 32 && param->minCUSize != 16 && param->minCUSize != 8, "minimim CU size must be 8, 16, 32, or 64"); CHECK(param->minCUSize > param->maxCUSize, @@ -1026,6 +1033,8 @@ "subme must be less than or equal to X265_MAX_SUBPEL_LEVEL (7)"); CHECK(param->subpelRefine < 0, "subme must be greater than or equal to 0"); + CHECK(param->limitReferences > 3, + "limitReferences must be 0, 1, 2 or 3"); CHECK(param->frameNumThreads < 0 || param->frameNumThreads > X265_MAX_FRAME_THREADS, "frameNumThreads (--frame-threads) must be [0 .. X265_MAX_FRAME_THREADS)"); CHECK(param->cbQpOffset < -12, "Min. Chroma Cb QP Offset is -12"); @@ -1077,7 +1086,7 @@ "Lookahead depth must be less than 256"); CHECK(param->lookaheadSlices > 16 || param->lookaheadSlices < 0, "Lookahead slices must between 0 and 16"); - CHECK(param->rc.aqMode < X265_AQ_NONE || X265_AQ_AUTO_VARIANCE < param->rc.aqMode, + CHECK(param->rc.aqMode < X265_AQ_NONE || X265_AQ_AUTO_VARIANCE_BIASED < param->rc.aqMode, "Aq-Mode is out of range"); CHECK(param->rc.aqStrength < 0 || param->rc.aqStrength > 3, "Aq-Strength is out of range"); @@ -1088,7 +1097,6 @@ CHECK(param->psyRd < 0 || 2.0 < param->psyRd, "Psy-rd strength must be between 0 and 2.0"); CHECK(param->psyRdoq < 0 || 50.0 < param->psyRdoq, "Psy-rdoq strength must be between 0 and 50.0"); CHECK(param->bEnableWavefront < 0, "WaveFrontSynchro cannot be negative"); - CHECK(!param->bEnableWavefront && param->rc.vbvBufferSize, "VBV requires wave-front parallelism (--wpp)"); CHECK((param->vui.aspectRatioIdc < 0 || param->vui.aspectRatioIdc > 16) && param->vui.aspectRatioIdc != X265_EXTENDED_SAR, @@ -1106,11 +1114,11 @@ "Color Primaries must be undef, bt709, bt470m," " bt470bg, smpte170m, smpte240m, film or bt2020"); CHECK(param->vui.transferCharacteristics < 0 - || param->vui.transferCharacteristics > 17 + || param->vui.transferCharacteristics > 18 || param->vui.transferCharacteristics == 3, "Transfer Characteristics must be undef, bt709, bt470m, bt470bg," " smpte170m, smpte240m, linear, log100, log316, iec61966-2-4, bt1361e," - " iec61966-2-1, bt2020-10, bt2020-12, smpte-st-2084 or smpte-st-428"); + " iec61966-2-1, bt2020-10, bt2020-12, smpte-st-2084, smpte-st-428 or arib-std-b67"); CHECK(param->vui.matrixCoeffs < 0 || param->vui.matrixCoeffs > 10 || param->vui.matrixCoeffs == 3, @@ -1245,9 +1253,6 @@ if (param->logLevel < X265_LOG_INFO) return; -#if HIGH_BIT_DEPTH - x265_log(param, X265_LOG_INFO, "Internal bit depth : %d\n", param->internalBitDepth); -#endif if (param->interlaceMode) x265_log(param, X265_LOG_INFO, "Interlaced field inputs : %s\n", x265_interlace_names[param->interlaceMode]); @@ -1271,8 +1276,10 @@ x265_log(param, X265_LOG_INFO, "Intra 32x32 TU penalty type : %d\n", param->rdPenalty); x265_log(param, X265_LOG_INFO, "Lookahead / bframes / badapt : %d / %d / %d\n", param->lookaheadDepth, param->bframes, param->bFrameAdaptive); - x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb / refs: %d / %d / %d / %d\n", - param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred, param->maxNumReferences); + x265_log(param, X265_LOG_INFO, "b-pyramid / weightp / weightb : %d / %d / %d\n", + param->bBPyramid, param->bEnableWeightedPred, param->bEnableWeightedBiPred); + x265_log(param, X265_LOG_INFO, "References / ref-limit cu / depth : %d / %d / %d\n", + param->maxNumReferences, !!(param->limitReferences & X265_REF_LIMIT_CU), !!(param->limitReferences & X265_REF_LIMIT_DEPTH)); if (param->rc.aqMode) x265_log(param, X265_LOG_INFO, "AQ: mode / str / qg-size / cu-tree : %d / %0.1f / %d / %d\n", param->rc.aqMode, @@ -1420,9 +1427,11 @@ s += sprintf(s, " bframe-bias=%d", p->bFrameBias); s += sprintf(s, " b-adapt=%d", p->bFrameAdaptive); s += sprintf(s, " ref=%d", p->maxNumReferences); + s += sprintf(s, " limit-refs=%d", p->limitReferences); BOOL(p->bEnableWeightedPred, "weightp"); BOOL(p->bEnableWeightedBiPred, "weightb"); s += sprintf(s, " aq-mode=%d", p->rc.aqMode); + s += sprintf(s, " qg-size=%d", p->rc.qgSize); s += sprintf(s, " aq-strength=%.2f", p->rc.aqStrength); s += sprintf(s, " cbqpoffs=%d", p->cbQpOffset); s += sprintf(s, " crqpoffs=%d", p->crQpOffset);
View file
x265_1.7.tar.gz/source/common/param.h -> x265_1.8.tar.gz/source/common/param.h
Changed
@@ -2,6 +2,7 @@ * Copyright (C) 2013 x265 project * * Authors: Deepthi Nandakumar <deepthi@multicorewareinc.com> + * Praveen Kumar Tiwari <praveen@multicorewareinc.com> * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by @@ -24,7 +25,8 @@ #ifndef X265_PARAM_H #define X265_PARAM_H -namespace x265 { +namespace X265_NS { + int x265_check_params(x265_param *param); int x265_set_globals(x265_param *param); void x265_print_params(x265_param *param); @@ -32,13 +34,27 @@ void x265_param_apply_fastfirstpass(x265_param *p); char* x265_param2string(x265_param *param); int x265_atoi(const char *str, bool& bError); +double x265_atof(const char *str, bool& bError); int parseCpuName(const char *value, bool& bError); void setParamAspectRatio(x265_param *p, int width, int height); void getParamAspectRatio(x265_param *p, int& width, int& height); bool parseLambdaFile(x265_param *param); /* this table is kept internal to avoid confusion, since log level indices start at -1 */ -static const char * const logLevelNames[] = { "none", "error", "warning", "info", "frame", "debug", "full", 0 }; +static const char * const logLevelNames[] = { "none", "error", "warning", "info", "debug", "full", 0 }; + +#if EXPORT_C_API +#define PARAM_NS +#else +/* declare param functions within private namespace */ +void x265_param_free(x265_param *); +x265_param* x265_param_alloc(); +void x265_param_default(x265_param *param); +int x265_param_default_preset(x265_param *, const char *preset, const char *tune); +int x265_param_apply_profile(x265_param *, const char *profile); +int x265_param_parse(x265_param *p, const char *name, const char *value); +#define PARAM_NS X265_NS +#endif #define MAXPARAMSIZE 2000 }
View file
x265_1.7.tar.gz/source/common/piclist.cpp -> x265_1.8.tar.gz/source/common/piclist.cpp
Changed
@@ -25,7 +25,7 @@ #include "piclist.h" #include "frame.h" -using namespace x265; +using namespace X265_NS; void PicList::pushFront(Frame& curFrame) {
View file
x265_1.7.tar.gz/source/common/piclist.h -> x265_1.8.tar.gz/source/common/piclist.h
Changed
@@ -24,9 +24,10 @@ #ifndef X265_PICLIST_H #define X265_PICLIST_H -#include <cstdlib> +#include "common.h" + +namespace X265_NS { -namespace x265 { class Frame; class PicList
View file
x265_1.7.tar.gz/source/common/picyuv.cpp -> x265_1.8.tar.gz/source/common/picyuv.cpp
Changed
@@ -26,7 +26,7 @@ #include "slice.h" #include "primitives.h" -using namespace x265; +using namespace X265_NS; PicYuv::PicYuv() { @@ -148,52 +148,62 @@ padx++; pady++; - if (pic.bitDepth < X265_DEPTH) - { - pixel *yPixel = m_picOrg[0]; - pixel *uPixel = m_picOrg[1]; - pixel *vPixel = m_picOrg[2]; + X265_CHECK(pic.bitDepth >= 8, "pic.bitDepth check failure"); - uint8_t *yChar = (uint8_t*)pic.planes[0]; - uint8_t *uChar = (uint8_t*)pic.planes[1]; - uint8_t *vChar = (uint8_t*)pic.planes[2]; - int shift = X265_MAX(0, X265_DEPTH - pic.bitDepth); - - primitives.planecopy_cp(yChar, pic.stride[0] / sizeof(*yChar), yPixel, m_stride, width, height, shift); - primitives.planecopy_cp(uChar, pic.stride[1] / sizeof(*uChar), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); - primitives.planecopy_cp(vChar, pic.stride[2] / sizeof(*vChar), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); - } - else if (pic.bitDepth == 8) + if (pic.bitDepth == 8) { - pixel *yPixel = m_picOrg[0]; - pixel *uPixel = m_picOrg[1]; - pixel *vPixel = m_picOrg[2]; +#if (X265_DEPTH > 8) + { + pixel *yPixel = m_picOrg[0]; + pixel *uPixel = m_picOrg[1]; + pixel *vPixel = m_picOrg[2]; + + uint8_t *yChar = (uint8_t*)pic.planes[0]; + uint8_t *uChar = (uint8_t*)pic.planes[1]; + uint8_t *vChar = (uint8_t*)pic.planes[2]; + int shift = (X265_DEPTH - 8); + + primitives.planecopy_cp(yChar, pic.stride[0] / sizeof(*yChar), yPixel, m_stride, width, height, shift); + primitives.planecopy_cp(uChar, pic.stride[1] / sizeof(*uChar), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); + primitives.planecopy_cp(vChar, pic.stride[2] / sizeof(*vChar), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift); + } +#else /* Case for (X265_DEPTH == 8) */ + // TODO: Does we need this path? may merge into above in future + { + pixel *yPixel = m_picOrg[0]; + pixel *uPixel = m_picOrg[1]; + pixel *vPixel = m_picOrg[2]; - uint8_t *yChar = (uint8_t*)pic.planes[0]; - uint8_t *uChar = (uint8_t*)pic.planes[1]; - uint8_t *vChar = (uint8_t*)pic.planes[2]; + uint8_t *yChar = (uint8_t*)pic.planes[0]; + uint8_t *uChar = (uint8_t*)pic.planes[1]; + uint8_t *vChar = (uint8_t*)pic.planes[2]; - for (int r = 0; r < height; r++) - { - memcpy(yPixel, yChar, width * sizeof(pixel)); + for (int r = 0; r < height; r++) + { + memcpy(yPixel, yChar, width * sizeof(pixel)); - yPixel += m_stride; - yChar += pic.stride[0] / sizeof(*yChar); - } + yPixel += m_stride; + yChar += pic.stride[0] / sizeof(*yChar); + } - for (int r = 0; r < height >> m_vChromaShift; r++) - { - memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel)); - memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel)); + for (int r = 0; r < height >> m_vChromaShift; r++) + { + memcpy(uPixel, uChar, (width >> m_hChromaShift) * sizeof(pixel)); + memcpy(vPixel, vChar, (width >> m_hChromaShift) * sizeof(pixel)); - uPixel += m_strideC; - vPixel += m_strideC; - uChar += pic.stride[1] / sizeof(*uChar); - vChar += pic.stride[2] / sizeof(*vChar); + uPixel += m_strideC; + vPixel += m_strideC; + uChar += pic.stride[1] / sizeof(*uChar); + vChar += pic.stride[2] / sizeof(*vChar); + } } +#endif /* (X265_DEPTH > 8) */ } else /* pic.bitDepth > 8 */ { + /* defensive programming, mask off bits that are supposed to be zero */ + uint16_t mask = (1 << X265_DEPTH) - 1; + int shift = abs(pic.bitDepth - X265_DEPTH); pixel *yPixel = m_picOrg[0]; pixel *uPixel = m_picOrg[1]; pixel *vPixel = m_picOrg[2]; @@ -202,15 +212,20 @@ uint16_t *uShort = (uint16_t*)pic.planes[1]; uint16_t *vShort = (uint16_t*)pic.planes[2]; - /* defensive programming, mask off bits that are supposed to be zero */ - uint16_t mask = (1 << X265_DEPTH) - 1; - int shift = X265_MAX(0, pic.bitDepth - X265_DEPTH); - - /* shift and mask pixels to final size */ - - primitives.planecopy_sp(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); - primitives.planecopy_sp(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); - primitives.planecopy_sp(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + if (pic.bitDepth > X265_DEPTH) + { + /* shift right and mask pixels to final size */ + primitives.planecopy_sp(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); + primitives.planecopy_sp(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + primitives.planecopy_sp(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + } + else /* Case for (pic.bitDepth <= X265_DEPTH) */ + { + /* shift left and mask pixels to final size */ + primitives.planecopy_sp_shl(yShort, pic.stride[0] / sizeof(*yShort), yPixel, m_stride, width, height, shift, mask); + primitives.planecopy_sp_shl(uShort, pic.stride[1] / sizeof(*uShort), uPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + primitives.planecopy_sp_shl(vShort, pic.stride[2] / sizeof(*vShort), vPixel, m_strideC, width >> m_hChromaShift, height >> m_vChromaShift, shift, mask); + } } /* extend the right edge if width was not multiple of the minimum CU size */ @@ -259,7 +274,7 @@ } } -namespace x265 { +namespace X265_NS { template<uint32_t OUTPUT_BITDEPTH_DIV8> static void md5_block(MD5Context& md5, const pixel* plane, uint32_t n)
View file
x265_1.7.tar.gz/source/common/picyuv.h -> x265_1.8.tar.gz/source/common/picyuv.h
Changed
@@ -28,7 +28,7 @@ #include "md5.h" #include "x265.h" -namespace x265 { +namespace X265_NS { // private namespace class ShortYuv;
View file
x265_1.7.tar.gz/source/common/pixel.cpp -> x265_1.8.tar.gz/source/common/pixel.cpp
Changed
@@ -30,7 +30,7 @@ #include <cstdlib> // abs() -using namespace x265; +using namespace X265_NS; namespace { // place functions in anonymous namespace (file static) @@ -117,9 +117,9 @@ } template<int lx, int ly, class T1, class T2> -int sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2) +sse_ret_t sse(const T1* pix1, intptr_t stride_pix1, const T2* pix2, intptr_t stride_pix2) { - int sum = 0; + sse_ret_t sum = 0; int tmp; for (int y = 0; y < ly; y++) @@ -159,7 +159,7 @@ return (a + s) ^ s; } -int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +static int satd_4x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) { sum2_t tmp[4][2]; sum2_t a0, a1, a2, a3, b0, b1; @@ -219,7 +219,7 @@ } // x264's SWAR version of satd 8x4, performs two 4x4 SATDs at once -int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) +static int satd_8x4(const pixel* pix1, intptr_t stride_pix1, const pixel* pix2, intptr_t stride_pix2) { sum2_t tmp[4][4]; sum2_t a0, a1, a2, a3; @@ -308,7 +308,7 @@ return (int)sum; } -int sa8d_8x8(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +inline int sa8d_8x8(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) { return (int)((_sa8d_8x8(pix1, i_pix1, pix2, i_pix2) + 2) >> 2); } @@ -359,12 +359,12 @@ return (int)sum; } -int sa8d_8x8(const int16_t* pix1, intptr_t i_pix1) +static int sa8d_8x8(const int16_t* pix1, intptr_t i_pix1) { return (int)((_sa8d_8x8(pix1, i_pix1) + 2) >> 2); } -int sa8d_16x16(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) +static int sa8d_16x16(const pixel* pix1, intptr_t i_pix1, const pixel* pix2, intptr_t i_pix2) { int sum = _sa8d_8x8(pix1, i_pix1, pix2, i_pix2) + _sa8d_8x8(pix1 + 8, i_pix1, pix2 + 8, i_pix2) @@ -516,7 +516,7 @@ dst[k * blockSize + l] = src[l * stride + k]; } -void weight_sp_c(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset) +static void weight_sp_c(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset) { int x, y; @@ -541,7 +541,7 @@ } } -void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset) +static void weight_pp_c(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset) { int x, y; @@ -582,7 +582,7 @@ } } -void scale1D_128to64(pixel *dst, const pixel *src) +static void scale1D_128to64(pixel *dst, const pixel *src) { int x; const pixel* src1 = src; @@ -608,7 +608,7 @@ } } -void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride) +static void scale2D_64to32(pixel* dst, const pixel* src, intptr_t stride) { uint32_t x, y; @@ -627,6 +627,7 @@ } } +static void frame_init_lowres_core(const pixel* src0, pixel* dst0, pixel* dsth, pixel* dstv, pixel* dstc, intptr_t src_stride, intptr_t dst_stride, int width, int height) { @@ -653,7 +654,7 @@ } /* structural similarity metric */ -void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4]) +static void ssim_4x4x2_core(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4]) { for (int z = 0; z < 2; z++) { @@ -681,7 +682,7 @@ } } -float ssim_end_1(int s1, int s2, int ss, int s12) +static float ssim_end_1(int s1, int s2, int ss, int s12) { /* Maximum value for 10-bit is: ss*64 = (2^10-1)^2*16*4*64 = 4286582784, which will overflow in some cases. * s1*s1, s2*s2, and s1*s2 also obtain this value for edge cases: ((2^10-1)*16*4)^2 = 4286582784. @@ -689,7 +690,7 @@ #define PIXEL_MAX ((1 << X265_DEPTH) - 1) #if HIGH_BIT_DEPTH - X265_CHECK(X265_DEPTH == 10, "ssim invalid depth\n"); + X265_CHECK((X265_DEPTH == 10) || (X265_DEPTH == 12), "ssim invalid depth\n"); #define type float static const float ssim_c1 = (float)(.01 * .01 * PIXEL_MAX * PIXEL_MAX * 64); static const float ssim_c2 = (float)(.03 * .03 * PIXEL_MAX * PIXEL_MAX * 64 * 63); @@ -711,7 +712,7 @@ #undef PIXEL_MAX } -float ssim_end_4(int sum0[5][4], int sum1[5][4], int width) +static float ssim_end_4(int sum0[5][4], int sum1[5][4], int width) { float ssim = 0.0; @@ -920,7 +921,7 @@ } } -void planecopy_cp_c(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift) +static void planecopy_cp_c(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift) { for (int r = 0; r < height; r++) { @@ -932,7 +933,7 @@ } } -void planecopy_sp_c(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +static void planecopy_sp_c(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) { for (int r = 0; r < height; r++) { @@ -944,9 +945,21 @@ } } +static void planecopy_sp_shl_c(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +{ + for (int r = 0; r < height; r++) + { + for (int c = 0; c < width; c++) + dst[c] = (pixel)((src[c] << shift) & mask); + + dst += dstStride; + src += srcStride; + } +} + /* Estimate the total amount of influence on future quality that could be had if we * were to improve the reference samples used to inter predict any given CU. */ -void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, +static void estimateCUPropagateCost(int* dst, const uint16_t* propagateIn, const int32_t* intraCosts, const uint16_t* interCosts, const int32_t* invQscales, const double* fpsFactor, int len) { double fps = *fpsFactor / 256; @@ -962,7 +975,7 @@ } } // end anonymous namespace -namespace x265 { +namespace X265_NS { // x265 private namespace /* Extend the edges of a picture so that it may safely be used for motion @@ -1244,6 +1257,7 @@ p.planecopy_cp = planecopy_cp_c; p.planecopy_sp = planecopy_sp_c; + p.planecopy_sp_shl = planecopy_sp_shl_c; p.propagateCost = estimateCUPropagateCost; } }
View file
x265_1.7.tar.gz/source/common/predict.cpp -> x265_1.8.tar.gz/source/common/predict.cpp
Changed
@@ -28,7 +28,7 @@ #include "predict.h" #include "primitives.h" -using namespace x265; +using namespace X265_NS; #if _MSC_VER #pragma warning(disable: 4127) // conditional expression is constant @@ -776,30 +776,17 @@ // Fill left & below-left samples adiTemp += picStride; adi--; - pNeighborFlags--; - for (int j = 0; j < leftUnits; j++) + // NOTE: over copy here, but reduce condition operators + for (int j = 0; j < leftUnits * unitHeight; j++) { - if (*pNeighborFlags) - for (int i = 0; i < unitHeight; i++) - adi[-i] = adiTemp[i * picStride]; - - adiTemp += unitHeight * picStride; - adi -= unitHeight; - pNeighborFlags--; + adi[-j] = adiTemp[j * picStride]; } // Fill above & above-right samples adiTemp = adiOrigin - picStride; adi = adiLineBuffer + (leftUnits * unitHeight) + unitWidth; - pNeighborFlags = bNeighborFlags + leftUnits + 1; - for (int j = 0; j < aboveUnits; j++) - { - if (*pNeighborFlags) - memcpy(adi, adiTemp, unitWidth * sizeof(*adiTemp)); - adiTemp += unitWidth; - adi += unitWidth; - pNeighborFlags++; - } + // NOTE: over copy here, but reduce condition operators + memcpy(adi, adiTemp, aboveUnits * unitWidth * sizeof(*adiTemp)); // Pad reference samples when necessary int curr = 0;
View file
x265_1.7.tar.gz/source/common/predict.h -> x265_1.8.tar.gz/source/common/predict.h
Changed
@@ -30,7 +30,7 @@ #include "shortyuv.h" #include "yuv.h" -namespace x265 { +namespace X265_NS { class CUData; class Slice;
View file
x265_1.7.tar.gz/source/common/primitives.cpp -> x265_1.8.tar.gz/source/common/primitives.cpp
Changed
@@ -24,7 +24,7 @@ #include "common.h" #include "primitives.h" -namespace x265 { +namespace X265_NS { // x265 private namespace extern const uint8_t lumaPartitionMapTable[] = @@ -56,6 +56,7 @@ void setupFilterPrimitives_c(EncoderPrimitives &p); void setupIntraPrimitives_c(EncoderPrimitives &p); void setupLoopFilterPrimitives_c(EncoderPrimitives &p); +void setupSaoPrimitives_c(EncoderPrimitives &p); void setupCPrimitives(EncoderPrimitives &p) { @@ -64,6 +65,7 @@ setupFilterPrimitives_c(p); // ipfilter.cpp setupIntraPrimitives_c(p); // intrapred.cpp setupLoopFilterPrimitives_c(p); // loopfilter.cpp + setupSaoPrimitives_c(p); // sao.cpp } void setupAliasPrimitives(EncoderPrimitives &p) @@ -72,7 +74,7 @@ /* at HIGH_BIT_DEPTH, pixel == short so we can alias many primitives */ for (int i = 0; i < NUM_CU_SIZES; i++) { - p.cu[i].sse_pp = (pixelcmp_t)p.cu[i].sse_ss; + p.cu[i].sse_pp = (pixel_sse_t)p.cu[i].sse_ss; p.cu[i].copy_ps = (copy_ps_t)p.pu[i].copy_pp; p.cu[i].copy_sp = (copy_sp_t)p.pu[i].copy_pp; @@ -185,62 +187,36 @@ p.chroma[X265_CSP_I422].cu[BLOCK_422_2x4].sse_pp = NULL; } -} -using namespace x265; -/* cpuid >= 0 - force CPU type - * cpuid < 0 - auto-detect if uninitialized */ -void x265_setup_primitives(x265_param *param, int cpuid) +void x265_report_simd(x265_param* param) { - if (cpuid < 0) - cpuid = x265::cpu_detect(); - - // initialize global variables - if (!primitives.pu[0].sad) - { - setupCPrimitives(primitives); - - /* We do not want the encoder to use the un-optimized intra all-angles - * C references. It is better to call the individual angle functions - * instead. We must check for NULL before using this primitive */ - for (int i = 0; i < NUM_TR_SIZE; i++) - primitives.cu[i].intra_pred_allangs = NULL; - -#if ENABLE_ASSEMBLY - setupInstrinsicPrimitives(primitives, cpuid); - setupAssemblyPrimitives(primitives, cpuid); -#else - x265_log(param, X265_LOG_WARNING, "Assembly not supported in this binary\n"); -#endif - - setupAliasPrimitives(primitives); - } - if (param->logLevel >= X265_LOG_INFO) { + int cpuid = param->cpuid; + char buf[1000]; char *p = buf + sprintf(buf, "using cpu capabilities:"); char *none = p; - for (int i = 0; x265::cpu_names[i].flags; i++) + for (int i = 0; X265_NS::cpu_names[i].flags; i++) { - if (!strcmp(x265::cpu_names[i].name, "SSE") + if (!strcmp(X265_NS::cpu_names[i].name, "SSE") && (cpuid & X265_CPU_SSE2)) continue; - if (!strcmp(x265::cpu_names[i].name, "SSE2") + if (!strcmp(X265_NS::cpu_names[i].name, "SSE2") && (cpuid & (X265_CPU_SSE2_IS_FAST | X265_CPU_SSE2_IS_SLOW))) continue; - if (!strcmp(x265::cpu_names[i].name, "SSE3") + if (!strcmp(X265_NS::cpu_names[i].name, "SSE3") && (cpuid & X265_CPU_SSSE3 || !(cpuid & X265_CPU_CACHELINE_64))) continue; - if (!strcmp(x265::cpu_names[i].name, "SSE4.1") + if (!strcmp(X265_NS::cpu_names[i].name, "SSE4.1") && (cpuid & X265_CPU_SSE42)) continue; - if (!strcmp(x265::cpu_names[i].name, "BMI1") + if (!strcmp(X265_NS::cpu_names[i].name, "BMI1") && (cpuid & X265_CPU_BMI2)) continue; - if ((cpuid & x265::cpu_names[i].flags) == x265::cpu_names[i].flags - && (!i || x265::cpu_names[i].flags != x265::cpu_names[i - 1].flags)) - p += sprintf(p, " %s", x265::cpu_names[i].name); + if ((cpuid & X265_NS::cpu_names[i].flags) == X265_NS::cpu_names[i].flags + && (!i || X265_NS::cpu_names[i].flags != X265_NS::cpu_names[i - 1].flags)) + p += sprintf(p, " %s", X265_NS::cpu_names[i].name); } if (p == none) @@ -249,14 +225,40 @@ } } +void x265_setup_primitives(x265_param *param) +{ + if (!primitives.pu[0].sad) + { + setupCPrimitives(primitives); + + /* We do not want the encoder to use the un-optimized intra all-angles + * C references. It is better to call the individual angle functions + * instead. We must check for NULL before using this primitive */ + for (int i = 0; i < NUM_TR_SIZE; i++) + primitives.cu[i].intra_pred_allangs = NULL; + +#if ENABLE_ASSEMBLY + setupInstrinsicPrimitives(primitives, param->cpuid); + setupAssemblyPrimitives(primitives, param->cpuid); +#endif + + setupAliasPrimitives(primitives); + } + + x265_report_simd(param); +} +} + #if ENABLE_ASSEMBLY /* these functions are implemented in assembly. When assembly is not being * compiled, they are unnecessary and can be NOPs */ #else extern "C" { -int x265_cpu_cpuid_test(void) { return 0; } -void x265_cpu_emms(void) {} -void x265_cpu_cpuid(uint32_t, uint32_t *eax, uint32_t *, uint32_t *, uint32_t *) { *eax = 0; } -void x265_cpu_xgetbv(uint32_t, uint32_t *, uint32_t *) {} +int PFX(cpu_cpuid_test)(void) { return 0; } +void PFX(cpu_emms)(void) {} +void PFX(cpu_cpuid)(uint32_t, uint32_t *eax, uint32_t *, uint32_t *, uint32_t *) { *eax = 0; } +void PFX(cpu_xgetbv)(uint32_t, uint32_t *, uint32_t *) {} +void PFX(cpu_neon_test)(void) {} +int PFX(cpu_fast_neon_mrc_test)(void) { return 0; } } #endif
View file
x265_1.7.tar.gz/source/common/primitives.h -> x265_1.8.tar.gz/source/common/primitives.h
Changed
@@ -33,7 +33,7 @@ #include "common.h" #include "cpu.h" -namespace x265 { +namespace X265_NS { // x265 private namespace enum LumaPU @@ -112,6 +112,8 @@ typedef int (*pixelcmp_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned typedef int (*pixelcmp_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); +typedef sse_ret_t (*pixel_sse_t)(const pixel* fenc, intptr_t fencstride, const pixel* fref, intptr_t frefstride); // fenc is aligned +typedef sse_ret_t (*pixel_sse_ss_t)(const int16_t* fenc, intptr_t fencstride, const int16_t* fref, intptr_t frefstride); typedef int (*pixel_ssd_s_t)(const int16_t* fenc, intptr_t fencstride); typedef void (*pixelcmp_x4_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, const pixel* fref3, intptr_t frefstride, int32_t* res); typedef void (*pixelcmp_x3_t)(const pixel* fenc, const pixel* fref0, const pixel* fref1, const pixel* fref2, intptr_t frefstride, int32_t* res); @@ -173,6 +175,13 @@ typedef void (*saoCuOrgE2_t)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); typedef void (*saoCuOrgE3_t)(pixel* rec, int8_t* upBuff1, int8_t* m_offsetEo, intptr_t stride, int startX, int endX); typedef void (*saoCuOrgB0_t)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); + +typedef void (*saoCuStatsBO_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE0_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE1_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE2_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBuff, int endX, int endY, int32_t *stats, int32_t *count); +typedef void (*saoCuStatsE3_t)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); + typedef void (*sign_t)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); typedef void (*planecopy_cp_t) (const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); typedef void (*planecopy_sp_t) (const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); @@ -182,6 +191,10 @@ typedef int (*scanPosLast_t)(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); typedef uint32_t (*findPosFirstLast_t)(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]); +typedef uint32_t (*costCoeffNxN_t)(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase); +typedef uint32_t (*costCoeffRemain_t)(uint16_t *absCoeff, int numNonZero, int idx); +typedef uint32_t (*costC1C2Flag_t)(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset); + /* Function pointers to optimized encoder primitives. Each pointer can reference * either an assembly routine, a SIMD intrinsic primitive, or a C function */ struct EncoderPrimitives @@ -242,8 +255,9 @@ copy_pp_t copy_pp; // alias to pu[].copy_pp var_t var; // block internal variance - pixelcmp_t sse_pp; // Sum of Square Error (pixel, pixel) fenc alignment not assumed - pixelcmp_ss_t sse_ss; // Sum of Square Error (short, short) fenc alignment not assumed + + pixel_sse_t sse_pp; // Sum of Square Error (pixel, pixel) fenc alignment not assumed + pixel_sse_ss_t sse_ss; // Sum of Square Error (short, short) fenc alignment not assumed pixelcmp_t psy_cost_pp; // difference in AC energy between two pixel blocks pixelcmp_ss_t psy_cost_ss; // difference in AC energy between two signed residual blocks pixel_ssd_s_t ssd_s; // Sum of Square Error (residual coeff to self) @@ -289,12 +303,19 @@ saoCuOrgE3_t saoCuOrgE3[2]; saoCuOrgB0_t saoCuOrgB0; + saoCuStatsBO_t saoCuStatsBO; + saoCuStatsE0_t saoCuStatsE0; + saoCuStatsE1_t saoCuStatsE1; + saoCuStatsE2_t saoCuStatsE2; + saoCuStatsE3_t saoCuStatsE3; + downscale_t frameInitLowres; cutree_propagate_cost propagateCost; extendCURowBorder_t extendRowBorder; planecopy_cp_t planecopy_cp; planecopy_sp_t planecopy_sp; + planecopy_sp_t planecopy_sp_shl; weightp_sp_t weight_sp; weightp_pp_t weight_pp; @@ -303,6 +324,11 @@ scanPosLast_t scanPosLast; findPosFirstLast_t findPosFirstLast; + costCoeffNxN_t costCoeffNxN; + costCoeffRemain_t costCoeffRemain; + costC1C2Flag_t costC1C2Flag; + + /* There is one set of chroma primitives per color space. An encoder will * have just a single color space and thus it will only ever use one entry * in this array. However we always fill all entries in the array in case @@ -335,7 +361,7 @@ struct CUChroma { pixelcmp_t sa8d; // if chroma CU is not multiple of 8x8, will use satd - pixelcmp_t sse_pp; + pixel_sse_t sse_pp; pixel_sub_ps_t sub_ps; pixel_add_ps_t add_ps; @@ -377,4 +403,10 @@ void setupAliasPrimitives(EncoderPrimitives &p); } +#if !EXPORT_C_API +extern const int PFX(max_bit_depth); +extern const char* PFX(version_str); +extern const char* PFX(build_info_str); +#endif + #endif // ifndef X265_PRIMITIVES_H
View file
x265_1.7.tar.gz/source/common/quant.cpp -> x265_1.8.tar.gz/source/common/quant.cpp
Changed
@@ -30,7 +30,7 @@ #include "cudata.h" #include "contexts.h" -using namespace x265; +using namespace X265_NS; #define SIGN(x,y) ((x^(y >> 31))-(y >> 31)) @@ -204,7 +204,6 @@ m_resiDctCoeff = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE * 2); m_fencDctCoeff = m_resiDctCoeff + (MAX_TR_SIZE * MAX_TR_SIZE); m_fencShortBuf = X265_MALLOC(int16_t, MAX_TR_SIZE * MAX_TR_SIZE); - m_tqBypass = false; return m_resiDctCoeff && m_fencShortBuf; } @@ -228,9 +227,6 @@ void Quant::setQPforQuant(const CUData& ctu, int qp) { - m_tqBypass = !!ctu.m_tqBypass[0]; - if (m_tqBypass) - return; m_nr = m_frameNr ? &m_frameNr[ctu.m_encData->m_frameEncoderID] : NULL; m_qpParam[TEXT_LUMA].setQpParam(qp + QP_BD_OFFSET); setChromaQP(qp + ctu.m_slice->m_pps->chromaQpOffset[0], TEXT_CHROMA_U, ctu.m_chromaFormat); @@ -251,30 +247,63 @@ } /* To minimize the distortion only. No rate is considered */ -uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams) +uint32_t Quant::signBitHidingHDQ(int16_t* coeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codeParams, uint32_t log2TrSize) { - const uint32_t log2TrSizeCG = codeParams.log2TrSizeCG; + uint32_t trSize = 1 << log2TrSize; const uint16_t* scan = codeParams.scan; - bool lastCG = true; - for (int cg = (1 << (log2TrSizeCG * 2)) - 1; cg >= 0; cg--) + uint8_t coeffNum[MLS_GRP_NUM]; // value range[0, 16] + uint16_t coeffSign[MLS_GRP_NUM]; // bit mask map for non-zero coeff sign + uint16_t coeffFlag[MLS_GRP_NUM]; // bit mask map for non-zero coeff + +#if CHECKED_BUILD || _DEBUG + // clean output buffer, the asm version of scanPosLast Never output anything after latest non-zero coeff group + memset(coeffNum, 0, sizeof(coeffNum)); + memset(coeffSign, 0, sizeof(coeffNum)); + memset(coeffFlag, 0, sizeof(coeffNum)); +#endif + const int lastScanPos = primitives.scanPosLast(codeParams.scan, coeff, coeffSign, coeffFlag, coeffNum, numSig, g_scan4x4[codeParams.scanType], trSize); + const int cgLastScanPos = (lastScanPos >> LOG2_SCAN_SET_SIZE); + unsigned long tmp; + + // first CG need specially processing + const uint32_t correctOffset = 0x0F & (lastScanPos ^ 0xF); + coeffFlag[cgLastScanPos] <<= correctOffset; + + for (int cg = cgLastScanPos; cg >= 0; cg--) { int cgStartPos = cg << LOG2_SCAN_SET_SIZE; int n; +#if CHECKED_BUILD || _DEBUG for (n = SCAN_SET_SIZE - 1; n >= 0; --n) if (coeff[scan[n + cgStartPos]]) break; - if (n < 0) - continue; + int lastNZPosInCG0 = n; +#endif - int lastNZPosInCG = n; + if (coeffNum[cg] == 0) + { + X265_CHECK(lastNZPosInCG0 < 0, "all zero block check failure\n"); + continue; + } +#if CHECKED_BUILD || _DEBUG for (n = 0;; n++) if (coeff[scan[n + cgStartPos]]) break; - int firstNZPosInCG = n; + int firstNZPosInCG0 = n; +#endif + + CLZ(tmp, coeffFlag[cg]); + const int firstNZPosInCG = (15 ^ tmp); + + CTZ(tmp, coeffFlag[cg]); + const int lastNZPosInCG = (15 ^ tmp); + + X265_CHECK(firstNZPosInCG0 == firstNZPosInCG, "firstNZPosInCG0 check failure\n"); + X265_CHECK(lastNZPosInCG0 == lastNZPosInCG, "lastNZPosInCG0 check failure\n"); if (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD) { @@ -287,12 +316,17 @@ if (signbit != (absSum & 0x1)) // compare signbit with sum_parity { int minCostInc = MAX_INT, minPos = -1, curCost = MAX_INT; - int16_t finalChange = 0, curChange = 0; + int32_t finalChange = 0, curChange = 0; + uint32_t cgFlags = coeffFlag[cg]; + if (cg == cgLastScanPos) + cgFlags >>= correctOffset; - for (n = (lastCG ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n) + for (n = (cg == cgLastScanPos ? lastNZPosInCG : SCAN_SET_SIZE - 1); n >= 0; --n) { uint32_t blkPos = scan[n + cgStartPos]; - if (coeff[blkPos]) + X265_CHECK(!!coeff[blkPos] == !!(cgFlags & 1), "non zero coeff check failure\n"); + + if (cgFlags & 1) { if (deltaU[blkPos] > 0) { @@ -301,8 +335,11 @@ } else { - if (n == firstNZPosInCG && abs(coeff[blkPos]) == 1) + if ((cgFlags == 1) && (abs(coeff[blkPos]) == 1)) + { + X265_CHECK(n == firstNZPosInCG, "firstNZPosInCG position check failure\n"); curCost = MAX_INT; + } else { curCost = deltaU[blkPos]; @@ -312,8 +349,9 @@ } else { - if (n < firstNZPosInCG) + if (cgFlags == 0) { + X265_CHECK(n < firstNZPosInCG, "firstNZPosInCG position check failure\n"); uint32_t thisSignBit = m_resiDctCoeff[blkPos] >= 0 ? 0 : 1; if (thisSignBit != signbit) curCost = MAX_INT; @@ -336,6 +374,7 @@ finalChange = curChange; minPos = blkPos; } + cgFlags>>=1; } /* do not allow change to violate coeff clamp */ @@ -347,14 +386,12 @@ else if (finalChange == -1 && abs(coeff[minPos]) == 1) numSig--; - if (m_resiDctCoeff[minPos] >= 0) - coeff[minPos] += finalChange; - else - coeff[minPos] -= finalChange; + { + const int16_t sigMask = ((int16_t)m_resiDctCoeff[minPos]) >> 15; + coeff[minPos] += ((int16_t)finalChange ^ sigMask) - sigMask; + } } } - - lastCG = false; } return numSig; @@ -364,7 +401,8 @@ coeff_t* coeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip) { const uint32_t sizeIdx = log2TrSize - 2; - if (m_tqBypass) + + if (cu.m_tqBypass[0]) { X265_CHECK(log2TrSize >= 2 && log2TrSize <= 5, "Block size mistake!\n"); return primitives.cu[sizeIdx].copy_cnt(coeff, residual, resiStride); @@ -437,18 +475,19 @@ { TUEntropyCodingParameters codeParams; cu.getTUEntropyCodingParameters(codeParams, absPartIdx, log2TrSize, isLuma); - return signBitHidingHDQ(coeff, deltaU, numSig, codeParams); + return signBitHidingHDQ(coeff, deltaU, numSig, codeParams, log2TrSize); } else return numSig; } } -void Quant::invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff, +void Quant::invtransformNxN(const CUData& cu, int16_t* residual, uint32_t resiStride, const coeff_t* coeff, uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig) { const uint32_t sizeIdx = log2TrSize - 2; - if (m_tqBypass) + + if (cu.m_tqBypass[0]) { primitives.cu[sizeIdx].cpy1Dto2D_shl(residual, coeff, resiStride, 0); return; @@ -546,7 +585,7 @@ #define UNQUANT(lvl) (((lvl) * (unquantScale[blkPos] << per) + unquantRound) >> unquantShift) #define SIGCOST(bits) ((lambda2 * (bits)) >> 8) #define RDCOST(d, bits) ((((int64_t)d * d) << scaleBits) + SIGCOST(bits)) -#define PSYVALUE(rec) ((psyScale * (rec)) >> (2 * transformShift + 1)) +#define PSYVALUE(rec) ((psyScale * (rec)) >> X265_MAX(0, (2 * transformShift + 1))) int64_t costCoeff[32 * 32]; /* d*d + lambda * bits */ int64_t costUncoded[32 * 32]; /* d*d + lambda * 0 */
View file
x265_1.7.tar.gz/source/common/quant.h -> x265_1.8.tar.gz/source/common/quant.h
Changed
@@ -28,7 +28,7 @@ #include "scalinglist.h" #include "contexts.h" -namespace x265 { +namespace X265_NS { // private namespace class CUData; @@ -41,7 +41,7 @@ int per; int qp; int64_t lambda2; /* FIX8 */ - int32_t lambda; /* FIX8, dynamic range is 18-bits in 8bpp and 20-bits in 16bpp */ + int32_t lambda; /* FIX8, dynamic range is 18-bits in Main and 20-bits in Main10 */ QpParam() : qp(MAX_INT) {} @@ -68,9 +68,9 @@ /* 0 = luma 4x4, 1 = luma 8x8, 2 = luma 16x16, 3 = luma 32x32 * 4 = chroma 4x4, 5 = chroma 8x8, 6 = chroma 16x16, 7 = chroma 32x32 * Intra 0..7 - Inter 8..15 */ - uint16_t offsetDenoise[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; - uint32_t residualSum[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; + ALIGN_VAR_16(uint32_t, residualSum[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]); uint32_t count[MAX_NUM_TR_CATEGORIES]; + uint16_t offsetDenoise[MAX_NUM_TR_CATEGORIES][MAX_NUM_TR_COEFFS]; }; class Quant @@ -94,7 +94,6 @@ NoiseReduction* m_nr; NoiseReduction* m_frameNr; // Array of NR structures, one for each frameEncoder - bool m_tqBypass; Quant(); ~Quant(); @@ -109,7 +108,7 @@ uint32_t transformNxN(const CUData& cu, const pixel* fenc, uint32_t fencStride, const int16_t* residual, uint32_t resiStride, coeff_t* coeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool useTransformSkip); - void invtransformNxN(int16_t* residual, uint32_t resiStride, const coeff_t* coeff, + void invtransformNxN(const CUData& cu, int16_t* residual, uint32_t resiStride, const coeff_t* coeff, uint32_t log2TrSize, TextType ttype, bool bIntra, bool useTransformSkip, uint32_t numSig); /* Pattern decision for context derivation process of significant_coeff_flag */ @@ -126,9 +125,9 @@ const uint32_t sigPos = (uint32_t)(sigCoeffGroupFlag64 >> (cgBlkPos + 1)); // just need lowest 7-bits valid // TODO: instruction BT is faster, but _bittest64 still generate instruction 'BT m, r' in VS2012 - const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & (sigPos & 1); - const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 2)) & 2; - return sigRight + sigLower; + const uint32_t sigRight = ((uint32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; + const uint32_t sigLower = ((uint32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); + return sigRight + sigLower * 2; } /* Context derivation process of coeff_abs_significant_flag */ @@ -137,10 +136,10 @@ X265_CHECK(cgBlkPos < 64, "cgBlkPos is too large\n"); // NOTE: unsafe shift operator, see NOTE in calcPatternSigCtx const uint32_t sigPos = (uint32_t)(cgGroupMask >> (cgBlkPos + 1)); // just need lowest 8-bits valid - const uint32_t sigRight = ((int32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; - const uint32_t sigLower = ((int32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); + const uint32_t sigRight = ((uint32_t)(cgPosX - (trSizeCG - 1)) >> 31) & sigPos; + const uint32_t sigLower = ((uint32_t)(cgPosY - (trSizeCG - 1)) >> 31) & (sigPos >> (trSizeCG - 1)); - return (sigRight | sigLower) & 1; + return (sigRight | sigLower); } /* static methods shared with entropy.cpp */ @@ -150,7 +149,7 @@ void setChromaQP(int qpin, TextType ttype, int chFmt); - uint32_t signBitHidingHDQ(int16_t* qcoeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codingParameters); + uint32_t signBitHidingHDQ(int16_t* qcoeff, int32_t* deltaU, uint32_t numSig, const TUEntropyCodingParameters &codingParameters, uint32_t log2TrSize); uint32_t rdoQuant(const CUData& cu, int16_t* dstCoeff, uint32_t log2TrSize, TextType ttype, uint32_t absPartIdx, bool usePsy); };
View file
x265_1.7.tar.gz/source/common/scalinglist.cpp -> x265_1.8.tar.gz/source/common/scalinglist.cpp
Changed
@@ -80,7 +80,7 @@ }, }; -int quantTSDefault4x4[16] = +static int quantTSDefault4x4[16] = { 16, 16, 16, 16, 16, 16, 16, 16, @@ -88,7 +88,7 @@ 16, 16, 16, 16 }; -int quantIntraDefault8x8[64] = +static int quantIntraDefault8x8[64] = { 16, 16, 16, 16, 17, 18, 21, 24, 16, 16, 16, 16, 17, 19, 22, 25, @@ -100,7 +100,7 @@ 24, 25, 29, 36, 47, 65, 88, 115 }; -int quantInterDefault8x8[64] = +static int quantInterDefault8x8[64] = { 16, 16, 16, 16, 17, 18, 20, 24, 16, 16, 16, 17, 18, 20, 24, 25, @@ -114,7 +114,7 @@ } -namespace x265 { +namespace X265_NS { // private namespace const int ScalingList::s_numCoefPerSize[NUM_SIZES] = { 16, 64, 256, 1024 };
View file
x265_1.7.tar.gz/source/common/scalinglist.h -> x265_1.8.tar.gz/source/common/scalinglist.h
Changed
@@ -26,7 +26,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { // private namespace class ScalingList
View file
x265_1.7.tar.gz/source/common/shortyuv.cpp -> x265_1.8.tar.gz/source/common/shortyuv.cpp
Changed
@@ -28,7 +28,7 @@ #include "x265.h" -using namespace x265; +using namespace X265_NS; ShortYuv::ShortYuv() {
View file
x265_1.7.tar.gz/source/common/shortyuv.h -> x265_1.8.tar.gz/source/common/shortyuv.h
Changed
@@ -28,7 +28,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { // private namespace class Yuv;
View file
x265_1.7.tar.gz/source/common/slice.cpp -> x265_1.8.tar.gz/source/common/slice.cpp
Changed
@@ -27,7 +27,7 @@ #include "picyuv.h" #include "slice.h" -using namespace x265; +using namespace X265_NS; void Slice::setRefPicList(PicList& picList) {
View file
x265_1.7.tar.gz/source/common/slice.h -> x265_1.8.tar.gz/source/common/slice.h
Changed
@@ -26,7 +26,7 @@ #include "common.h" -namespace x265 { +namespace X265_NS { // private namespace class Frame; @@ -111,6 +111,7 @@ bool frameOnlyConstraintFlag; bool profileCompatibilityFlag[32]; bool intraConstraintFlag; + bool onePictureOnlyConstraintFlag; bool lowerBitRateConstraintFlag; int profileIdc; int levelIdc;
View file
x265_1.7.tar.gz/source/common/threading.cpp -> x265_1.8.tar.gz/source/common/threading.cpp
Changed
@@ -21,21 +21,73 @@ * For more information, contact us at license @ x265.com *****************************************************************************/ +#include "common.h" #include "threading.h" +#include "cpu.h" -namespace x265 { +namespace X265_NS { // x265 private namespace #if X265_ARCH_X86 && !defined(X86_64) && ENABLE_ASSEMBLY && defined(__GNUC__) -extern "C" intptr_t x265_stack_align(void (*func)(), ...); -#define x265_stack_align(func, ...) x265_stack_align((void (*)())func, __VA_ARGS__) +extern "C" intptr_t PFX(stack_align)(void (*func)(), ...); +#define STACK_ALIGN(func, ...) PFX(stack_align)((void (*)())func, __VA_ARGS__) #else -#define x265_stack_align(func, ...) func(__VA_ARGS__) +#define STACK_ALIGN(func, ...) func(__VA_ARGS__) +#endif + +#if NO_ATOMICS +pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER; + +int no_atomic_or(int* ptr, int mask) +{ + pthread_mutex_lock(&g_mutex); + int ret = *ptr; + *ptr |= mask; + pthread_mutex_unlock(&g_mutex); + return ret; +} + +int no_atomic_and(int* ptr, int mask) +{ + pthread_mutex_lock(&g_mutex); + int ret = *ptr; + *ptr &= mask; + pthread_mutex_unlock(&g_mutex); + return ret; +} + +int no_atomic_inc(int* ptr) +{ + pthread_mutex_lock(&g_mutex); + *ptr += 1; + int ret = *ptr; + pthread_mutex_unlock(&g_mutex); + return ret; +} + +int no_atomic_dec(int* ptr) +{ + pthread_mutex_lock(&g_mutex); + *ptr -= 1; + int ret = *ptr; + pthread_mutex_unlock(&g_mutex); + return ret; +} + +int no_atomic_add(int* ptr, int val) +{ + pthread_mutex_lock(&g_mutex); + *ptr += val; + int ret = *ptr; + pthread_mutex_unlock(&g_mutex); + return ret; +} #endif /* C shim for forced stack alignment */ static void stackAlignMain(Thread *instance) { + // defer processing to the virtual function implemented in the derived class instance->threadMain(); } @@ -43,8 +95,7 @@ static DWORD WINAPI ThreadShim(Thread *instance) { - // defer processing to the virtual function implemented in the derived class - x265_stack_align(stackAlignMain, instance); + STACK_ALIGN(stackAlignMain, instance); return 0; } @@ -77,7 +128,7 @@ // defer processing to the virtual function implemented in the derived class Thread *instance = reinterpret_cast<Thread *>(opaque); - x265_stack_align(stackAlignMain, instance); + STACK_ALIGN(stackAlignMain, instance); return NULL; }
View file
x265_1.7.tar.gz/source/common/threading.h -> x265_1.8.tar.gz/source/common/threading.h
Changed
@@ -42,7 +42,30 @@ #include <sys/sysctl.h> #endif -#ifdef __GNUC__ /* GCCs builtin atomics */ +#if NO_ATOMICS + +#include <sys/time.h> +#include <unistd.h> + +namespace X265_NS { +// x265 private namespace +int no_atomic_or(int* ptr, int mask); +int no_atomic_and(int* ptr, int mask); +int no_atomic_inc(int* ptr); +int no_atomic_dec(int* ptr); +int no_atomic_add(int* ptr, int val); +} + +#define CLZ(id, x) id = (unsigned long)__builtin_clz(x) ^ 31 +#define CTZ(id, x) id = (unsigned long)__builtin_ctz(x) +#define ATOMIC_OR(ptr, mask) no_atomic_or((int*)ptr, mask) +#define ATOMIC_AND(ptr, mask) no_atomic_and((int*)ptr, mask) +#define ATOMIC_INC(ptr) no_atomic_inc((int*)ptr) +#define ATOMIC_DEC(ptr) no_atomic_dec((int*)ptr) +#define ATOMIC_ADD(ptr, val) no_atomic_add((int*)ptr, val) +#define GIVE_UP_TIME() usleep(0) + +#elif __GNUC__ /* GCCs builtin atomics */ #include <sys/time.h> #include <unistd.h> @@ -71,7 +94,7 @@ #endif // ifdef __GNUC__ -namespace x265 { +namespace X265_NS { // x265 private namespace #ifdef _WIN32 @@ -463,6 +486,6 @@ void stop(); }; -} // end namespace x265 +} // end namespace X265_NS #endif // ifndef X265_THREADING_H
View file
x265_1.7.tar.gz/source/common/threadpool.cpp -> x265_1.8.tar.gz/source/common/threadpool.cpp
Changed
@@ -60,7 +60,7 @@ #include <numa.h> #endif -namespace x265 { +namespace X265_NS { // x265 private namespace class WorkerThread : public Thread @@ -310,7 +310,7 @@ ThreadPool *pools = new ThreadPool[numPools]; if (pools) { - int maxProviders = (p->frameNumThreads + 1 + numPools - 1) / numPools; /* +1 is Lookahead */ + int maxProviders = (p->frameNumThreads + numPools - 1) / numPools + 1; /* +1 is Lookahead, always assigned to threadpool 0 */ int node = 0; for (int i = 0; i < numPools; i++) { @@ -480,4 +480,4 @@ #endif } -} // end namespace x265 +} // end namespace X265_NS
View file
x265_1.7.tar.gz/source/common/threadpool.h -> x265_1.8.tar.gz/source/common/threadpool.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "threading.h" -namespace x265 { +namespace X265_NS { // x265 private namespace class ThreadPool; @@ -113,7 +113,7 @@ * called. If it returns non-zero then some number of slave worker threads are * already in the process of calling your processTasks() function. The master * thread should participate and call processTasks() itself. When - * waitForExit() returns, all bonded peer threads are quarunteed to have + * waitForExit() returns, all bonded peer threads are guaranteed to have * exitied processTasks(). Since the thread count is small, it uses explicit * locking instead of atomic counters and bitmasks */ class BondedTaskGroup @@ -167,6 +167,6 @@ virtual void processTasks(int workerThreadId) = 0; }; -} // end namespace x265 +} // end namespace X265_NS #endif // ifndef X265_THREADPOOL_H
View file
x265_1.7.tar.gz/source/common/vec/dct-sse3.cpp -> x265_1.8.tar.gz/source/common/vec/dct-sse3.cpp
Changed
@@ -33,19 +33,13 @@ #include <xmmintrin.h> // SSE #include <pmmintrin.h> // SSE3 -using namespace x265; +using namespace X265_NS; -namespace { #define SHIFT1 7 #define ADD1 64 -#if HIGH_BIT_DEPTH -#define SHIFT2 10 -#define ADD2 512 -#else -#define SHIFT2 12 -#define ADD2 2048 -#endif +#define SHIFT2 (12 - (X265_DEPTH - 8)) +#define ADD2 (1 << ((SHIFT2) - 1)) ALIGN_VAR_32(static const int16_t, tab_idct_8x8[12][8]) = { @@ -62,7 +56,8 @@ { 83, 36, 83, 36, 83, 36, 83, 36 }, { 36, -83, 36, -83, 36, -83, 36, -83 } }; -void idct8(const int16_t* src, int16_t* dst, intptr_t stride) + +static void idct8(const int16_t* src, int16_t* dst, intptr_t stride) { __m128i m128iS0, m128iS1, m128iS2, m128iS3, m128iS4, m128iS5, m128iS6, m128iS7, m128iAdd, m128Tmp0, m128Tmp1, m128Tmp2, m128Tmp3, E0h, E1h, E2h, E3h, E0l, E1l, E2l, E3l, O0h, O1h, O2h, O3h, O0l, O1l, O2l, O3l, EE0l, EE1l, E00l, E01l, EE0h, EE1h, E00h, E01h; __m128i T00, T01, T02, T03, T04, T05, T06, T07; @@ -299,7 +294,7 @@ _mm_storeh_pi((__m64*)&dst[7 * stride + 4], _mm_castsi128_ps(T11)); } -void idct16(const int16_t *src, int16_t *dst, intptr_t stride) +static void idct16(const int16_t *src, int16_t *dst, intptr_t stride) { #define READ_UNPACKHILO(offset)\ const __m128i T_00_00A = _mm_unpacklo_epi16(*(__m128i*)&src[1 * 16 + offset], *(__m128i*)&src[3 * 16 + offset]);\ @@ -677,7 +672,7 @@ #undef UNPACKHILO #undef READ_UNPACKHILO -void idct32(const int16_t *src, int16_t *dst, intptr_t stride) +static void idct32(const int16_t *src, int16_t *dst, intptr_t stride) { //Odd const __m128i c16_p90_p90 = _mm_set1_epi32(0x005A005A); //column 0 @@ -1418,9 +1413,7 @@ } } -} - -namespace x265 { +namespace X265_NS { void setupIntrinsicDCT_sse3(EncoderPrimitives &p) { /* Note: We have AVX2 assembly for these functions, but since AVX2 is still
View file
x265_1.7.tar.gz/source/common/vec/dct-sse41.cpp -> x265_1.8.tar.gz/source/common/vec/dct-sse41.cpp
Changed
@@ -33,10 +33,9 @@ #include <xmmintrin.h> // SSE #include <smmintrin.h> // SSE4.1 -using namespace x265; +using namespace X265_NS; -namespace { -void dequant_scaling(const int16_t* quantCoef, const int32_t *deQuantCoef, int16_t* coef, int num, int per, int shift) +static void dequant_scaling(const int16_t* quantCoef, const int32_t *deQuantCoef, int16_t* coef, int num, int per, int shift) { X265_CHECK(num <= 32 * 32, "dequant num too large\n"); @@ -100,9 +99,8 @@ } } } -} -namespace x265 { +namespace X265_NS { void setupIntrinsicDCT_sse41(EncoderPrimitives &p) { p.dequant_scaling = dequant_scaling;
View file
x265_1.7.tar.gz/source/common/vec/dct-ssse3.cpp -> x265_1.8.tar.gz/source/common/vec/dct-ssse3.cpp
Changed
@@ -34,9 +34,20 @@ #include <pmmintrin.h> // SSE3 #include <tmmintrin.h> // SSSE3 -using namespace x265; +#define DCT16_SHIFT1 (3 + X265_DEPTH - 8) +#define DCT16_ADD1 (1 << ((DCT16_SHIFT1) - 1)) + +#define DCT16_SHIFT2 10 +#define DCT16_ADD2 (1 << ((DCT16_SHIFT2) - 1)) + +#define DCT32_SHIFT1 (DCT16_SHIFT1 + 1) +#define DCT32_ADD1 (1 << ((DCT32_SHIFT1) - 1)) + +#define DCT32_SHIFT2 (DCT16_SHIFT2 + 1) +#define DCT32_ADD2 (1 << ((DCT32_SHIFT2) - 1)) + +using namespace X265_NS; -namespace { ALIGN_VAR_32(static const int16_t, tab_dct_8[][8]) = { { 0x0100, 0x0F0E, 0x0706, 0x0908, 0x0302, 0x0D0C, 0x0504, 0x0B0A }, @@ -99,22 +110,11 @@ #undef MAKE_COEF }; -void dct16(const int16_t *src, int16_t *dst, intptr_t stride) +static void dct16(const int16_t *src, int16_t *dst, intptr_t stride) { -#if HIGH_BIT_DEPTH -#define SHIFT1 5 -#define ADD1 16 -#else -#define SHIFT1 3 -#define ADD1 4 -#endif - -#define SHIFT2 10 -#define ADD2 512 - // Const - __m128i c_4 = _mm_set1_epi32(ADD1); - __m128i c_512 = _mm_set1_epi32(ADD2); + __m128i c_4 = _mm_set1_epi32(DCT16_ADD1); + __m128i c_512 = _mm_set1_epi32(DCT16_ADD2); int i; @@ -202,29 +202,29 @@ T60 = _mm_madd_epi16(T50, _mm_load_si128((__m128i*)tab_dct_8[1])); T61 = _mm_madd_epi16(T51, _mm_load_si128((__m128i*)tab_dct_8[1])); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[0 * 16 + i], T70); T60 = _mm_madd_epi16(T50, _mm_load_si128((__m128i*)tab_dct_8[2])); T61 = _mm_madd_epi16(T51, _mm_load_si128((__m128i*)tab_dct_8[2])); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[8 * 16 + i], T70); T60 = _mm_madd_epi16(T52, _mm_load_si128((__m128i*)tab_dct_8[3])); T61 = _mm_madd_epi16(T53, _mm_load_si128((__m128i*)tab_dct_8[3])); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[4 * 16 + i], T70); T60 = _mm_madd_epi16(T52, _mm_load_si128((__m128i*)tab_dct_8[4])); T61 = _mm_madd_epi16(T53, _mm_load_si128((__m128i*)tab_dct_8[4])); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[12 * 16 + i], T70); @@ -234,8 +234,8 @@ T63 = _mm_madd_epi16(T47, _mm_load_si128((__m128i*)tab_dct_8[5])); T60 = _mm_hadd_epi32(T60, T61); T61 = _mm_hadd_epi32(T62, T63); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[2 * 16 + i], T70); @@ -245,8 +245,8 @@ T63 = _mm_madd_epi16(T47, _mm_load_si128((__m128i*)tab_dct_8[6])); T60 = _mm_hadd_epi32(T60, T61); T61 = _mm_hadd_epi32(T62, T63); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[6 * 16 + i], T70); @@ -256,8 +256,8 @@ T63 = _mm_madd_epi16(T47, _mm_load_si128((__m128i*)tab_dct_8[7])); T60 = _mm_hadd_epi32(T60, T61); T61 = _mm_hadd_epi32(T62, T63); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[10 * 16 + i], T70); @@ -267,8 +267,8 @@ T63 = _mm_madd_epi16(T47, _mm_load_si128((__m128i*)tab_dct_8[8])); T60 = _mm_hadd_epi32(T60, T61); T61 = _mm_hadd_epi32(T62, T63); - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); T70 = _mm_packs_epi32(T60, T61); _mm_store_si128((__m128i*)&tmp[14 * 16 + i], T70); @@ -287,8 +287,8 @@ T63 = _mm_hadd_epi32(T66, T67); \ T60 = _mm_hadd_epi32(T60, T61); \ T61 = _mm_hadd_epi32(T62, T63); \ - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), SHIFT1); \ - T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), SHIFT1); \ + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_4), DCT16_SHIFT1); \ + T61 = _mm_srai_epi32(_mm_add_epi32(T61, c_4), DCT16_SHIFT1); \ T70 = _mm_packs_epi32(T60, T61); \ _mm_store_si128((__m128i*)&tmp[(dstPos) * 16 + i], T70); @@ -352,8 +352,8 @@ T40 = _mm_hadd_epi32(T30, T31); T41 = _mm_hsub_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); - T41 = _mm_srai_epi32(_mm_add_epi32(T41, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); + T41 = _mm_srai_epi32(_mm_add_epi32(T41, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); T41 = _mm_packs_epi32(T41, T41); _mm_storel_epi64((__m128i*)&dst[0 * 16 + i], T40); @@ -377,7 +377,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[4 * 16 + i], T40); @@ -399,7 +399,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[12 * 16 + i], T40); @@ -421,7 +421,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[2 * 16 + i], T40); @@ -443,7 +443,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[6 * 16 + i], T40); @@ -465,7 +465,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[10 * 16 + i], T40); @@ -487,7 +487,7 @@ T31 = _mm_hadd_epi32(T32, T33); T40 = _mm_hadd_epi32(T30, T31); - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); T40 = _mm_packs_epi32(T40, T40); _mm_storel_epi64((__m128i*)&dst[14 * 16 + i], T40); @@ -510,7 +510,7 @@ T31 = _mm_hadd_epi32(T32, T33); \ \ T40 = _mm_hadd_epi32(T30, T31); \ - T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), SHIFT2); \ + T40 = _mm_srai_epi32(_mm_add_epi32(T40, c_512), DCT16_SHIFT2); \ T40 = _mm_packs_epi32(T40, T40); \ _mm_storel_epi64((__m128i*)&dst[(dstPos) * 16 + i], T40); @@ -524,10 +524,6 @@ MAKE_ODD(28, 15); #undef MAKE_ODD } -#undef SHIFT1 -#undef ADD1 -#undef SHIFT2 -#undef ADD2 } ALIGN_VAR_32(static const int16_t, tab_dct_32_0[][8]) = @@ -680,22 +676,11 @@ #undef MAKE_COEF16 }; -void dct32(const int16_t *src, int16_t *dst, intptr_t stride) +static void dct32(const int16_t *src, int16_t *dst, intptr_t stride) { -#if HIGH_BIT_DEPTH -#define SHIFT1 6 -#define ADD1 32 -#else -#define SHIFT1 4 -#define ADD1 8 -#endif - -#define SHIFT2 11 -#define ADD2 1024 - // Const - __m128i c_8 = _mm_set1_epi32(ADD1); - __m128i c_1024 = _mm_set1_epi32(ADD2); + __m128i c_8 = _mm_set1_epi32(DCT32_ADD1); + __m128i c_1024 = _mm_set1_epi32(DCT32_ADD2); int i; @@ -840,15 +825,15 @@ T50 = _mm_hadd_epi32(T40, T41); T51 = _mm_hadd_epi32(T42, T43); - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); T60 = _mm_packs_epi32(T50, T51); im[0][i] = T60; T50 = _mm_hsub_epi32(T40, T41); T51 = _mm_hsub_epi32(T42, T43); - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); T60 = _mm_packs_epi32(T50, T51); im[16][i] = T60; @@ -868,8 +853,8 @@ T50 = _mm_hadd_epi32(T40, T41); T51 = _mm_hadd_epi32(T42, T43); - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); T60 = _mm_packs_epi32(T50, T51); im[8][i] = T60; @@ -889,8 +874,8 @@ T50 = _mm_hadd_epi32(T40, T41); T51 = _mm_hadd_epi32(T42, T43); - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); T60 = _mm_packs_epi32(T50, T51); im[24][i] = T60; @@ -911,8 +896,8 @@ \ T50 = _mm_hadd_epi32(T40, T41); \ T51 = _mm_hadd_epi32(T42, T43); \ - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); \ - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); \ + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); \ + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); \ T60 = _mm_packs_epi32(T50, T51); \ im[(dstPos)][i] = T60; @@ -974,8 +959,8 @@ \ T50 = _mm_hadd_epi32(T50, T51); \ T51 = _mm_hadd_epi32(T52, T53); \ - T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), SHIFT1); \ - T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), SHIFT1); \ + T50 = _mm_srai_epi32(_mm_add_epi32(T50, c_8), DCT32_SHIFT1); \ + T51 = _mm_srai_epi32(_mm_add_epi32(T51, c_8), DCT32_SHIFT1); \ T60 = _mm_packs_epi32(T50, T51); \ im[(dstPos)][i] = T60; @@ -1083,7 +1068,7 @@ \ T60 = _mm_hadd_epi32(T60, T61); \ \ - T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_1024), SHIFT2); \ + T60 = _mm_srai_epi32(_mm_add_epi32(T60, c_1024), DCT32_SHIFT2); \ T60 = _mm_packs_epi32(T60, T60); \ _mm_storel_epi64((__m128i*)&dst[(dstPos) * 32 + (i * 4) + 0], T60); \ @@ -1125,14 +1110,9 @@ MAKE_ODD(158, 159, 160, 161, 31); #undef MAKE_ODD } -#undef SHIFT1 -#undef ADD1 -#undef SHIFT2 -#undef ADD2 -} } -namespace x265 { +namespace X265_NS { void setupIntrinsicDCT_ssse3(EncoderPrimitives &p) { /* Note: We have AVX2 assembly for these two functions, but since AVX2 is
View file
x265_1.7.tar.gz/source/common/vec/vec-primitives.cpp -> x265_1.8.tar.gz/source/common/vec/vec-primitives.cpp
Changed
@@ -32,12 +32,13 @@ #define HAVE_SSE4 #define HAVE_AVX2 #elif defined(__GNUC__) -#if __clang__ || (__GNUC__ >= 4 && __GNUC_MINOR__ >= 3) +#define GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) +#if __clang__ || GCC_VERSION >= 40300 /* gcc_version >= gcc-4.3.0 */ #define HAVE_SSE3 #define HAVE_SSSE3 #define HAVE_SSE4 #endif -#if __clang__ || (__GNUC__ >= 4 && __GNUC_MINOR__ >= 7) +#if __clang__ || GCC_VERSION >= 40700 /* gcc_version >= gcc-4.7.0 */ #define HAVE_AVX2 #endif #elif defined(_MSC_VER) @@ -50,7 +51,7 @@ #endif // compiler checks #endif // if X265_ARCH_X86 -namespace x265 { +namespace X265_NS { // private x265 namespace void setupIntrinsicDCT_sse3(EncoderPrimitives&);
View file
x265_1.7.tar.gz/source/common/version.cpp -> x265_1.8.tar.gz/source/common/version.cpp
Changed
@@ -23,71 +23,109 @@ #include "x265.h" #include "common.h" +#include "primitives.h" #define XSTR(x) STR(x) #define STR(x) #x #if defined(__clang__) -#define NVM_COMPILEDBY "[clang " XSTR(__clang_major__) "." XSTR(__clang_minor__) "." XSTR(__clang_patchlevel__) "]" +#define COMPILEDBY "[clang " XSTR(__clang_major__) "." XSTR(__clang_minor__) "." XSTR(__clang_patchlevel__) "]" #ifdef __IA64__ -#define NVM_ONARCH "[on 64-bit] " +#define ONARCH "[on 64-bit] " #else -#define NVM_ONARCH "[on 32-bit] " +#define ONARCH "[on 32-bit] " #endif #endif #if defined(__GNUC__) && !defined(__INTEL_COMPILER) && !defined(__clang__) -#define NVM_COMPILEDBY "[GCC " XSTR(__GNUC__) "." XSTR(__GNUC_MINOR__) "." XSTR(__GNUC_PATCHLEVEL__) "]" +#define COMPILEDBY "[GCC " XSTR(__GNUC__) "." XSTR(__GNUC_MINOR__) "." XSTR(__GNUC_PATCHLEVEL__) "]" #ifdef __IA64__ -#define NVM_ONARCH "[on 64-bit] " +#define ONARCH "[on 64-bit] " #else -#define NVM_ONARCH "[on 32-bit] " +#define ONARCH "[on 32-bit] " #endif #endif #ifdef __INTEL_COMPILER -#define NVM_COMPILEDBY "[ICC " XSTR(__INTEL_COMPILER) "]" +#define COMPILEDBY "[ICC " XSTR(__INTEL_COMPILER) "]" #elif _MSC_VER -#define NVM_COMPILEDBY "[MSVC " XSTR(_MSC_VER) "]" +#define COMPILEDBY "[MSVC " XSTR(_MSC_VER) "]" #endif -#ifndef NVM_COMPILEDBY -#define NVM_COMPILEDBY "[Unk-CXX]" +#ifndef COMPILEDBY +#define COMPILEDBY "[Unk-CXX]" #endif #ifdef _WIN32 -#define NVM_ONOS "[Windows]" +#define ONOS "[Windows]" #elif __linux -#define NVM_ONOS "[Linux]" +#define ONOS "[Linux]" #elif __OpenBSD__ -#define NVM_ONOS "[OpenBSD]" +#define ONOS "[OpenBSD]" #elif __CYGWIN__ -#define NVM_ONOS "[Cygwin]" +#define ONOS "[Cygwin]" #elif __APPLE__ -#define NVM_ONOS "[Mac OS X]" +#define ONOS "[Mac OS X]" #else -#define NVM_ONOS "[Unk-OS]" +#define ONOS "[Unk-OS]" #endif #if X86_64 -#define NVM_BITS "[64 bit]" +#define BITS "[64 bit]" #else -#define NVM_BITS "[32 bit]" +#define BITS "[32 bit]" +#endif + +#if defined(ENABLE_ASSEMBLY) +#define ASM "" +#else +#define ASM "[noasm]" +#endif + +#if NO_ATOMICS +#define ATOMICS "[no-atomics]" +#else +#define ATOMICS "" #endif #if CHECKED_BUILD -#define CHECKED "[CHECKED] " +#define CHECKED "[CHECKED] " #else -#define CHECKED " " +#define CHECKED " " #endif -#if HIGH_BIT_DEPTH -#define BITDEPTH "16bpp" -const int x265_max_bit_depth = 10; +#if X265_DEPTH == 12 + +#define BITDEPTH "12bit" +const int PFX(max_bit_depth) = 12; + +#elif X265_DEPTH == 10 + +#define BITDEPTH "10bit" +const int PFX(max_bit_depth) = 10; + +#elif X265_DEPTH == 8 + +#define BITDEPTH "8bit" +const int PFX(max_bit_depth) = 8; + +#endif + +#if LINKED_8BIT +#define ADD8 "+8bit" +#else +#define ADD8 "" +#endif +#if LINKED_10BIT +#define ADD10 "+10bit" +#else +#define ADD10 "" +#endif +#if LINKED_12BIT +#define ADD12 "+12bit" #else -#define BITDEPTH "8bpp" -const int x265_max_bit_depth = 8; +#define ADD12 "" #endif -const char *x265_version_str = XSTR(X265_VERSION); -const char *x265_build_info_str = NVM_ONOS NVM_COMPILEDBY NVM_BITS CHECKED BITDEPTH; +const char* PFX(version_str) = XSTR(X265_VERSION); +const char* PFX(build_info_str) = ONOS COMPILEDBY BITS ASM ATOMICS CHECKED BITDEPTH ADD8 ADD10 ADD12;
View file
x265_1.7.tar.gz/source/common/wavefront.cpp -> x265_1.8.tar.gz/source/common/wavefront.cpp
Changed
@@ -26,7 +26,7 @@ #include "wavefront.h" #include "common.h" -namespace x265 { +namespace X265_NS { // x265 private namespace bool WaveFront::init(int numRows)
View file
x265_1.7.tar.gz/source/common/wavefront.h -> x265_1.8.tar.gz/source/common/wavefront.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "threadpool.h" -namespace x265 { +namespace X265_NS { // x265 private namespace // Generic wave-front scheduler, manages busy-state of CU rows as a priority @@ -92,6 +92,6 @@ // derived classes. virtual void processRow(int row, int threadId) = 0; }; -} // end namespace x265 +} // end namespace X265_NS #endif // ifndef X265_WAVEFRONT_H
View file
x265_1.7.tar.gz/source/common/winxp.cpp -> x265_1.8.tar.gz/source/common/winxp.cpp
Changed
@@ -25,7 +25,7 @@ #if defined(_WIN32) && (_WIN32_WINNT < 0x0600) // _WIN32_WINNT_VISTA -namespace x265 { +namespace X265_NS { /* Mimic CONDITION_VARIABLE functions only supported on Vista+ */ int WINAPI cond_init(ConditionVariable *cond) @@ -121,7 +121,7 @@ DeleteCriticalSection(&cond->broadcastMutex); DeleteCriticalSection(&cond->waiterCountMutex); } -} // namespace x265 +} // namespace X265_NS #elif defined(_MSC_VER)
View file
x265_1.7.tar.gz/source/common/winxp.h -> x265_1.8.tar.gz/source/common/winxp.h
Changed
@@ -30,7 +30,7 @@ #include <intrin.h> // _InterlockedCompareExchange64 #endif -namespace x265 { +namespace X265_NS { /* non-native condition variable */ typedef struct { @@ -49,14 +49,14 @@ void cond_destroy(ConditionVariable *cond); /* map missing API symbols to our structure and functions */ -#define CONDITION_VARIABLE x265::ConditionVariable -#define InitializeConditionVariable x265::cond_init -#define SleepConditionVariableCS x265::cond_wait -#define WakeConditionVariable x265::cond_signal -#define WakeAllConditionVariable x265::cond_broadcast -#define XP_CONDITION_VAR_FREE x265::cond_destroy +#define CONDITION_VARIABLE X265_NS::ConditionVariable +#define InitializeConditionVariable X265_NS::cond_init +#define SleepConditionVariableCS X265_NS::cond_wait +#define WakeConditionVariable X265_NS::cond_signal +#define WakeAllConditionVariable X265_NS::cond_broadcast +#define XP_CONDITION_VAR_FREE X265_NS::cond_destroy -} // namespace x265 +} // namespace X265_NS #else // if defined(_WIN32) && (_WIN32_WINNT < 0x0600)
View file
x265_1.7.tar.gz/source/common/x86/asm-primitives.cpp -> x265_1.8.tar.gz/source/common/x86/asm-primitives.cpp
Changed
@@ -28,6 +28,83 @@ #include "x265.h" #include "cpu.h" +#define FUNCDEF_TU(ret, name, cpu, ...) \ + ret PFX(name ## _4x4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _8x8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _16x16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _32x32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _64x64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_TU_S(ret, name, cpu, ...) \ + ret PFX(name ## _4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## _64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_TU_S2(ret, name, cpu, ...) \ + ret PFX(name ## 4_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 8_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 16_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 32_ ## cpu(__VA_ARGS__)); \ + ret PFX(name ## 64_ ## cpu(__VA_ARGS__)) + +#define FUNCDEF_PU(ret, name, cpu, ...) \ + ret PFX(name ## _4x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x48_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _48x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x64_ ## cpu)(__VA_ARGS__) + +#define FUNCDEF_CHROMA_PU(ret, name, cpu, ...) \ + FUNCDEF_PU(ret, name, cpu, __VA_ARGS__); \ + ret PFX(name ## _4x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x6_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _6x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _6x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x6_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _2x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x2_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x12_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _12x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x4_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _4x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _32x48_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _48x32_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _16x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x16_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _8x64_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x8_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _64x24_ ## cpu)(__VA_ARGS__); \ + ret PFX(name ## _24x64_ ## cpu)(__VA_ARGS__); + extern "C" { #include "pixel.h" #include "pixel-util.h" @@ -40,31 +117,31 @@ } #define ALL_LUMA_CU_TYPED(prim, fncdef, fname, cpu) \ - p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu + p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.cu[BLOCK_64x64].prim = fncdef PFX(fname ## _64x64_ ## cpu) #define ALL_LUMA_CU_TYPED_S(prim, fncdef, fname, cpu) \ - p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## 8_ ## cpu; \ - p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## 16_ ## cpu; \ - p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## 32_ ## cpu; \ - p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## 64_ ## cpu + p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## 8_ ## cpu); \ + p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## 16_ ## cpu); \ + p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## 32_ ## cpu); \ + p.cu[BLOCK_64x64].prim = fncdef PFX(fname ## 64_ ## cpu) #define ALL_LUMA_TU_TYPED(prim, fncdef, fname, cpu) \ - p.cu[BLOCK_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu + p.cu[BLOCK_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu) #define ALL_LUMA_TU_TYPED_S(prim, fncdef, fname, cpu) \ - p.cu[BLOCK_4x4].prim = fncdef x265_ ## fname ## 4_ ## cpu; \ - p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## 8_ ## cpu; \ - p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## 16_ ## cpu; \ - p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## 32_ ## cpu + p.cu[BLOCK_4x4].prim = fncdef PFX(fname ## 4_ ## cpu); \ + p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## 8_ ## cpu); \ + p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## 16_ ## cpu); \ + p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## 32_ ## cpu) #define ALL_LUMA_BLOCKS_TYPED(prim, fncdef, fname, cpu) \ - p.cu[BLOCK_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.cu[BLOCK_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.cu[BLOCK_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.cu[BLOCK_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.cu[BLOCK_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu; + p.cu[BLOCK_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.cu[BLOCK_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.cu[BLOCK_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.cu[BLOCK_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.cu[BLOCK_64x64].prim = fncdef PFX(fname ## _64x64_ ## cpu); #define ALL_LUMA_CU(prim, fname, cpu) ALL_LUMA_CU_TYPED(prim, , fname, cpu) #define ALL_LUMA_CU_S(prim, fname, cpu) ALL_LUMA_CU_TYPED_S(prim, , fname, cpu) #define ALL_LUMA_TU(prim, fname, cpu) ALL_LUMA_TU_TYPED(prim, , fname, cpu) @@ -72,30 +149,30 @@ #define ALL_LUMA_TU_S(prim, fname, cpu) ALL_LUMA_TU_TYPED_S(prim, , fname, cpu) #define ALL_LUMA_PU_TYPED(prim, fncdef, fname, cpu) \ - p.pu[LUMA_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.pu[LUMA_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.pu[LUMA_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.pu[LUMA_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu; \ - p.pu[LUMA_8x4].prim = fncdef x265_ ## fname ## _8x4_ ## cpu; \ - p.pu[LUMA_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.pu[LUMA_16x8].prim = fncdef x265_ ## fname ## _16x8_ ## cpu; \ - p.pu[LUMA_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.pu[LUMA_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.pu[LUMA_32x16].prim = fncdef x265_ ## fname ## _32x16_ ## cpu; \ - p.pu[LUMA_64x32].prim = fncdef x265_ ## fname ## _64x32_ ## cpu; \ - p.pu[LUMA_32x64].prim = fncdef x265_ ## fname ## _32x64_ ## cpu; \ - p.pu[LUMA_16x12].prim = fncdef x265_ ## fname ## _16x12_ ## cpu; \ - p.pu[LUMA_12x16].prim = fncdef x265_ ## fname ## _12x16_ ## cpu; \ - p.pu[LUMA_16x4].prim = fncdef x265_ ## fname ## _16x4_ ## cpu; \ - p.pu[LUMA_4x16].prim = fncdef x265_ ## fname ## _4x16_ ## cpu; \ - p.pu[LUMA_32x24].prim = fncdef x265_ ## fname ## _32x24_ ## cpu; \ - p.pu[LUMA_24x32].prim = fncdef x265_ ## fname ## _24x32_ ## cpu; \ - p.pu[LUMA_32x8].prim = fncdef x265_ ## fname ## _32x8_ ## cpu; \ - p.pu[LUMA_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu; \ - p.pu[LUMA_64x48].prim = fncdef x265_ ## fname ## _64x48_ ## cpu; \ - p.pu[LUMA_48x64].prim = fncdef x265_ ## fname ## _48x64_ ## cpu; \ - p.pu[LUMA_64x16].prim = fncdef x265_ ## fname ## _64x16_ ## cpu; \ - p.pu[LUMA_16x64].prim = fncdef x265_ ## fname ## _16x64_ ## cpu + p.pu[LUMA_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.pu[LUMA_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.pu[LUMA_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.pu[LUMA_64x64].prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.pu[LUMA_8x4].prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.pu[LUMA_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.pu[LUMA_16x8].prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.pu[LUMA_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.pu[LUMA_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.pu[LUMA_32x16].prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.pu[LUMA_64x32].prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.pu[LUMA_32x64].prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.pu[LUMA_16x12].prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.pu[LUMA_12x16].prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.pu[LUMA_16x4].prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.pu[LUMA_4x16].prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.pu[LUMA_32x24].prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.pu[LUMA_24x32].prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.pu[LUMA_32x8].prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.pu[LUMA_8x32].prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.pu[LUMA_64x48].prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.pu[LUMA_48x64].prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.pu[LUMA_64x16].prim = fncdef PFX(fname ## _64x16_ ## cpu); \ + p.pu[LUMA_16x64].prim = fncdef PFX(fname ## _16x64_ ## cpu) #define ALL_LUMA_PU(prim, fname, cpu) ALL_LUMA_PU_TYPED(prim, , fname, cpu) #define ALL_LUMA_PU_T(prim, fname) \ @@ -125,237 +202,237 @@ p.pu[LUMA_16x64].prim = fname<LUMA_16x64> #define ALL_CHROMA_420_CU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu) #define ALL_CHROMA_420_CU_TYPED_S(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].prim = fncdef x265_ ## fname ## _4_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].prim = fncdef x265_ ## fname ## _8_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].prim = fncdef x265_ ## fname ## _16_ ## cpu; \ - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].prim = fncdef x265_ ## fname ## _32_ ## cpu + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].prim = fncdef PFX(fname ## _4_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].prim = fncdef PFX(fname ## _8_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].prim = fncdef PFX(fname ## _16_ ## cpu); \ + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].prim = fncdef PFX(fname ## _32_ ## cpu) #define ALL_CHROMA_420_CU(prim, fname, cpu) ALL_CHROMA_420_CU_TYPED(prim, , fname, cpu) #define ALL_CHROMA_420_CU_S(prim, fname, cpu) ALL_CHROMA_420_CU_TYPED_S(prim, , fname, cpu) #define ALL_CHROMA_420_PU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].prim = fncdef x265_ ## fname ## _4x2_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].prim = fncdef x265_ ## fname ## _2x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].prim = fncdef x265_ ## fname ## _8x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].prim = fncdef x265_ ## fname ## _16x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].prim = fncdef x265_ ## fname ## _32x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].prim = fncdef x265_ ## fname ## _8x6_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].prim = fncdef x265_ ## fname ## _6x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].prim = fncdef x265_ ## fname ## _8x2_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].prim = fncdef x265_ ## fname ## _2x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].prim = fncdef x265_ ## fname ## _16x12_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].prim = fncdef x265_ ## fname ## _12x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].prim = fncdef x265_ ## fname ## _16x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].prim = fncdef x265_ ## fname ## _4x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].prim = fncdef x265_ ## fname ## _32x24_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].prim = fncdef x265_ ## fname ## _24x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].prim = fncdef x265_ ## fname ## _32x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].prim = fncdef PFX(fname ## _4x2_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].prim = fncdef PFX(fname ## _2x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].prim = fncdef PFX(fname ## _8x6_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].prim = fncdef PFX(fname ## _6x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].prim = fncdef PFX(fname ## _8x2_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].prim = fncdef PFX(fname ## _2x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim = fncdef PFX(fname ## _8x32_ ## cpu) #define ALL_CHROMA_420_PU(prim, fname, cpu) ALL_CHROMA_420_PU_TYPED(prim, , fname, cpu) #define ALL_CHROMA_420_4x4_PU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].prim = fncdef x265_ ## fname ## _8x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].prim = fncdef x265_ ## fname ## _16x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].prim = fncdef x265_ ## fname ## _32x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].prim = fncdef x265_ ## fname ## _16x12_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].prim = fncdef x265_ ## fname ## _12x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].prim = fncdef x265_ ## fname ## _16x4_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].prim = fncdef x265_ ## fname ## _4x16_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].prim = fncdef x265_ ## fname ## _32x24_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].prim = fncdef x265_ ## fname ## _24x32_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].prim = fncdef x265_ ## fname ## _32x8_ ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].prim = fncdef PFX(fname ## _8x32_ ## cpu) #define ALL_CHROMA_420_4x4_PU(prim, fname, cpu) ALL_CHROMA_420_4x4_PU_TYPED(prim, , fname, cpu) #define ALL_CHROMA_422_CU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].prim = fncdef x265_ ## fname ## _32x64_ ## cpu + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].prim = fncdef PFX(fname ## _32x64_ ## cpu) #define ALL_CHROMA_422_CU(prim, fname, cpu) ALL_CHROMA_422_CU_TYPED(prim, , fname, cpu) #define ALL_CHROMA_422_PU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].prim = fncdef x265_ ## fname ## _32x64_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].prim = fncdef x265_ ## fname ## _2x8_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].prim = fncdef x265_ ## fname ## _4x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].prim = fncdef x265_ ## fname ## _16x64_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].prim = fncdef x265_ ## fname ## _8x12_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].prim = fncdef x265_ ## fname ## _6x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].prim = fncdef x265_ ## fname ## _8x4_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].prim = fncdef x265_ ## fname ## _2x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].prim = fncdef x265_ ## fname ## _16x24_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].prim = fncdef x265_ ## fname ## _12x32_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].prim = fncdef x265_ ## fname ## _16x8_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].prim = fncdef x265_ ## fname ## _4x32_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].prim = fncdef x265_ ## fname ## _32x48_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].prim = fncdef x265_ ## fname ## _24x64_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].prim = fncdef x265_ ## fname ## _32x16_ ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].prim = fncdef x265_ ## fname ## _8x64_ ## cpu + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].prim = fncdef PFX(fname ## _2x8_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].prim = fncdef PFX(fname ## _16x64_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].prim = fncdef PFX(fname ## _8x12_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].prim = fncdef PFX(fname ## _6x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].prim = fncdef PFX(fname ## _2x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].prim = fncdef PFX(fname ## _16x24_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].prim = fncdef PFX(fname ## _12x32_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].prim = fncdef PFX(fname ## _4x32_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].prim = fncdef PFX(fname ## _32x48_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].prim = fncdef PFX(fname ## _24x64_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].prim = fncdef PFX(fname ## _8x64_ ## cpu) #define ALL_CHROMA_422_PU(prim, fname, cpu) ALL_CHROMA_422_PU_TYPED(prim, , fname, cpu) #define ALL_CHROMA_444_PU_TYPED(prim, fncdef, fname, cpu) \ - p.chroma[X265_CSP_I444].pu[LUMA_4x4].prim = fncdef x265_ ## fname ## _4x4_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_8x8].prim = fncdef x265_ ## fname ## _8x8_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x16].prim = fncdef x265_ ## fname ## _16x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_32x32].prim = fncdef x265_ ## fname ## _32x32_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_64x64].prim = fncdef x265_ ## fname ## _64x64_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_8x4].prim = fncdef x265_ ## fname ## _8x4_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_4x8].prim = fncdef x265_ ## fname ## _4x8_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x8].prim = fncdef x265_ ## fname ## _16x8_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_8x16].prim = fncdef x265_ ## fname ## _8x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x32].prim = fncdef x265_ ## fname ## _16x32_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_32x16].prim = fncdef x265_ ## fname ## _32x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_64x32].prim = fncdef x265_ ## fname ## _64x32_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_32x64].prim = fncdef x265_ ## fname ## _32x64_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x12].prim = fncdef x265_ ## fname ## _16x12_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_12x16].prim = fncdef x265_ ## fname ## _12x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x4].prim = fncdef x265_ ## fname ## _16x4_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_4x16].prim = fncdef x265_ ## fname ## _4x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_32x24].prim = fncdef x265_ ## fname ## _32x24_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_24x32].prim = fncdef x265_ ## fname ## _24x32_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_32x8].prim = fncdef x265_ ## fname ## _32x8_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_8x32].prim = fncdef x265_ ## fname ## _8x32_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_64x48].prim = fncdef x265_ ## fname ## _64x48_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_48x64].prim = fncdef x265_ ## fname ## _48x64_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_64x16].prim = fncdef x265_ ## fname ## _64x16_ ## cpu; \ - p.chroma[X265_CSP_I444].pu[LUMA_16x64].prim = fncdef x265_ ## fname ## _16x64_ ## cpu + p.chroma[X265_CSP_I444].pu[LUMA_4x4].prim = fncdef PFX(fname ## _4x4_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_8x8].prim = fncdef PFX(fname ## _8x8_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x16].prim = fncdef PFX(fname ## _16x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_32x32].prim = fncdef PFX(fname ## _32x32_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_64x64].prim = fncdef PFX(fname ## _64x64_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_8x4].prim = fncdef PFX(fname ## _8x4_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_4x8].prim = fncdef PFX(fname ## _4x8_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x8].prim = fncdef PFX(fname ## _16x8_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_8x16].prim = fncdef PFX(fname ## _8x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x32].prim = fncdef PFX(fname ## _16x32_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_32x16].prim = fncdef PFX(fname ## _32x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_64x32].prim = fncdef PFX(fname ## _64x32_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_32x64].prim = fncdef PFX(fname ## _32x64_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x12].prim = fncdef PFX(fname ## _16x12_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_12x16].prim = fncdef PFX(fname ## _12x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x4].prim = fncdef PFX(fname ## _16x4_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_4x16].prim = fncdef PFX(fname ## _4x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_32x24].prim = fncdef PFX(fname ## _32x24_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_24x32].prim = fncdef PFX(fname ## _24x32_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_32x8].prim = fncdef PFX(fname ## _32x8_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_8x32].prim = fncdef PFX(fname ## _8x32_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_64x48].prim = fncdef PFX(fname ## _64x48_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_48x64].prim = fncdef PFX(fname ## _48x64_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_64x16].prim = fncdef PFX(fname ## _64x16_ ## cpu); \ + p.chroma[X265_CSP_I444].pu[LUMA_16x64].prim = fncdef PFX(fname ## _16x64_ ## cpu) #define ALL_CHROMA_444_PU(prim, fname, cpu) ALL_CHROMA_444_PU_TYPED(prim, , fname, cpu) #define AVC_LUMA_PU(name, cpu) \ - p.pu[LUMA_16x16].name = x265_pixel_ ## name ## _16x16_ ## cpu; \ - p.pu[LUMA_16x8].name = x265_pixel_ ## name ## _16x8_ ## cpu; \ - p.pu[LUMA_8x16].name = x265_pixel_ ## name ## _8x16_ ## cpu; \ - p.pu[LUMA_8x8].name = x265_pixel_ ## name ## _8x8_ ## cpu; \ - p.pu[LUMA_8x4].name = x265_pixel_ ## name ## _8x4_ ## cpu; \ - p.pu[LUMA_4x8].name = x265_pixel_ ## name ## _4x8_ ## cpu; \ - p.pu[LUMA_4x4].name = x265_pixel_ ## name ## _4x4_ ## cpu; \ - p.pu[LUMA_4x16].name = x265_pixel_ ## name ## _4x16_ ## cpu + p.pu[LUMA_16x16].name = PFX(pixel_ ## name ## _16x16_ ## cpu); \ + p.pu[LUMA_16x8].name = PFX(pixel_ ## name ## _16x8_ ## cpu); \ + p.pu[LUMA_8x16].name = PFX(pixel_ ## name ## _8x16_ ## cpu); \ + p.pu[LUMA_8x8].name = PFX(pixel_ ## name ## _8x8_ ## cpu); \ + p.pu[LUMA_8x4].name = PFX(pixel_ ## name ## _8x4_ ## cpu); \ + p.pu[LUMA_4x8].name = PFX(pixel_ ## name ## _4x8_ ## cpu); \ + p.pu[LUMA_4x4].name = PFX(pixel_ ## name ## _4x4_ ## cpu); \ + p.pu[LUMA_4x16].name = PFX(pixel_ ## name ## _4x16_ ## cpu) #define HEVC_SAD(cpu) \ - p.pu[LUMA_8x32].sad = x265_pixel_sad_8x32_ ## cpu; \ - p.pu[LUMA_16x4].sad = x265_pixel_sad_16x4_ ## cpu; \ - p.pu[LUMA_16x12].sad = x265_pixel_sad_16x12_ ## cpu; \ - p.pu[LUMA_16x32].sad = x265_pixel_sad_16x32_ ## cpu; \ - p.pu[LUMA_16x64].sad = x265_pixel_sad_16x64_ ## cpu; \ - p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_ ## cpu; \ - p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_ ## cpu; \ - p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_ ## cpu; \ - p.pu[LUMA_32x32].sad = x265_pixel_sad_32x32_ ## cpu; \ - p.pu[LUMA_32x64].sad = x265_pixel_sad_32x64_ ## cpu; \ - p.pu[LUMA_64x16].sad = x265_pixel_sad_64x16_ ## cpu; \ - p.pu[LUMA_64x32].sad = x265_pixel_sad_64x32_ ## cpu; \ - p.pu[LUMA_64x48].sad = x265_pixel_sad_64x48_ ## cpu; \ - p.pu[LUMA_64x64].sad = x265_pixel_sad_64x64_ ## cpu; \ - p.pu[LUMA_48x64].sad = x265_pixel_sad_48x64_ ## cpu; \ - p.pu[LUMA_24x32].sad = x265_pixel_sad_24x32_ ## cpu; \ - p.pu[LUMA_12x16].sad = x265_pixel_sad_12x16_ ## cpu + p.pu[LUMA_8x32].sad = PFX(pixel_sad_8x32_ ## cpu); \ + p.pu[LUMA_16x4].sad = PFX(pixel_sad_16x4_ ## cpu); \ + p.pu[LUMA_16x12].sad = PFX(pixel_sad_16x12_ ## cpu); \ + p.pu[LUMA_16x32].sad = PFX(pixel_sad_16x32_ ## cpu); \ + p.pu[LUMA_16x64].sad = PFX(pixel_sad_16x64_ ## cpu); \ + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_ ## cpu); \ + p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_ ## cpu); \ + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_ ## cpu); \ + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_ ## cpu); \ + p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_ ## cpu); \ + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_ ## cpu); \ + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_ ## cpu); \ + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_ ## cpu); \ + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_ ## cpu); \ + p.pu[LUMA_48x64].sad = PFX(pixel_sad_48x64_ ## cpu); \ + p.pu[LUMA_24x32].sad = PFX(pixel_sad_24x32_ ## cpu); \ + p.pu[LUMA_12x16].sad = PFX(pixel_sad_12x16_ ## cpu) #define HEVC_SAD_X3(cpu) \ - p.pu[LUMA_16x8].sad_x3 = x265_pixel_sad_x3_16x8_ ## cpu; \ - p.pu[LUMA_16x12].sad_x3 = x265_pixel_sad_x3_16x12_ ## cpu; \ - p.pu[LUMA_16x16].sad_x3 = x265_pixel_sad_x3_16x16_ ## cpu; \ - p.pu[LUMA_16x32].sad_x3 = x265_pixel_sad_x3_16x32_ ## cpu; \ - p.pu[LUMA_16x64].sad_x3 = x265_pixel_sad_x3_16x64_ ## cpu; \ - p.pu[LUMA_32x8].sad_x3 = x265_pixel_sad_x3_32x8_ ## cpu; \ - p.pu[LUMA_32x16].sad_x3 = x265_pixel_sad_x3_32x16_ ## cpu; \ - p.pu[LUMA_32x24].sad_x3 = x265_pixel_sad_x3_32x24_ ## cpu; \ - p.pu[LUMA_32x32].sad_x3 = x265_pixel_sad_x3_32x32_ ## cpu; \ - p.pu[LUMA_32x64].sad_x3 = x265_pixel_sad_x3_32x64_ ## cpu; \ - p.pu[LUMA_24x32].sad_x3 = x265_pixel_sad_x3_24x32_ ## cpu; \ - p.pu[LUMA_48x64].sad_x3 = x265_pixel_sad_x3_48x64_ ## cpu; \ - p.pu[LUMA_64x16].sad_x3 = x265_pixel_sad_x3_64x16_ ## cpu; \ - p.pu[LUMA_64x32].sad_x3 = x265_pixel_sad_x3_64x32_ ## cpu; \ - p.pu[LUMA_64x48].sad_x3 = x265_pixel_sad_x3_64x48_ ## cpu; \ - p.pu[LUMA_64x64].sad_x3 = x265_pixel_sad_x3_64x64_ ## cpu + p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_ ## cpu); \ + p.pu[LUMA_16x12].sad_x3 = PFX(pixel_sad_x3_16x12_ ## cpu); \ + p.pu[LUMA_16x16].sad_x3 = PFX(pixel_sad_x3_16x16_ ## cpu); \ + p.pu[LUMA_16x32].sad_x3 = PFX(pixel_sad_x3_16x32_ ## cpu); \ + p.pu[LUMA_16x64].sad_x3 = PFX(pixel_sad_x3_16x64_ ## cpu); \ + p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_ ## cpu); \ + p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_ ## cpu); \ + p.pu[LUMA_32x24].sad_x3 = PFX(pixel_sad_x3_32x24_ ## cpu); \ + p.pu[LUMA_32x32].sad_x3 = PFX(pixel_sad_x3_32x32_ ## cpu); \ + p.pu[LUMA_32x64].sad_x3 = PFX(pixel_sad_x3_32x64_ ## cpu); \ + p.pu[LUMA_24x32].sad_x3 = PFX(pixel_sad_x3_24x32_ ## cpu); \ + p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_ ## cpu); \ + p.pu[LUMA_64x16].sad_x3 = PFX(pixel_sad_x3_64x16_ ## cpu); \ + p.pu[LUMA_64x32].sad_x3 = PFX(pixel_sad_x3_64x32_ ## cpu); \ + p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_ ## cpu); \ + p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_ ## cpu) #define HEVC_SAD_X4(cpu) \ - p.pu[LUMA_16x8].sad_x4 = x265_pixel_sad_x4_16x8_ ## cpu; \ - p.pu[LUMA_16x12].sad_x4 = x265_pixel_sad_x4_16x12_ ## cpu; \ - p.pu[LUMA_16x16].sad_x4 = x265_pixel_sad_x4_16x16_ ## cpu; \ - p.pu[LUMA_16x32].sad_x4 = x265_pixel_sad_x4_16x32_ ## cpu; \ - p.pu[LUMA_16x64].sad_x4 = x265_pixel_sad_x4_16x64_ ## cpu; \ - p.pu[LUMA_32x8].sad_x4 = x265_pixel_sad_x4_32x8_ ## cpu; \ - p.pu[LUMA_32x16].sad_x4 = x265_pixel_sad_x4_32x16_ ## cpu; \ - p.pu[LUMA_32x24].sad_x4 = x265_pixel_sad_x4_32x24_ ## cpu; \ - p.pu[LUMA_32x32].sad_x4 = x265_pixel_sad_x4_32x32_ ## cpu; \ - p.pu[LUMA_32x64].sad_x4 = x265_pixel_sad_x4_32x64_ ## cpu; \ - p.pu[LUMA_24x32].sad_x4 = x265_pixel_sad_x4_24x32_ ## cpu; \ - p.pu[LUMA_48x64].sad_x4 = x265_pixel_sad_x4_48x64_ ## cpu; \ - p.pu[LUMA_64x16].sad_x4 = x265_pixel_sad_x4_64x16_ ## cpu; \ - p.pu[LUMA_64x32].sad_x4 = x265_pixel_sad_x4_64x32_ ## cpu; \ - p.pu[LUMA_64x48].sad_x4 = x265_pixel_sad_x4_64x48_ ## cpu; \ - p.pu[LUMA_64x64].sad_x4 = x265_pixel_sad_x4_64x64_ ## cpu + p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_ ## cpu); \ + p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_ ## cpu); \ + p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_ ## cpu); \ + p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_ ## cpu); \ + p.pu[LUMA_16x64].sad_x4 = PFX(pixel_sad_x4_16x64_ ## cpu); \ + p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_ ## cpu); \ + p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_ ## cpu); \ + p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_ ## cpu); \ + p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_ ## cpu); \ + p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_ ## cpu); \ + p.pu[LUMA_24x32].sad_x4 = PFX(pixel_sad_x4_24x32_ ## cpu); \ + p.pu[LUMA_48x64].sad_x4 = PFX(pixel_sad_x4_48x64_ ## cpu); \ + p.pu[LUMA_64x16].sad_x4 = PFX(pixel_sad_x4_64x16_ ## cpu); \ + p.pu[LUMA_64x32].sad_x4 = PFX(pixel_sad_x4_64x32_ ## cpu); \ + p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_ ## cpu); \ + p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_ ## cpu) #define ASSIGN_SSE_PP(cpu) \ - p.cu[BLOCK_8x8].sse_pp = x265_pixel_ssd_8x8_ ## cpu; \ - p.cu[BLOCK_16x16].sse_pp = x265_pixel_ssd_16x16_ ## cpu; \ - p.cu[BLOCK_32x32].sse_pp = x265_pixel_ssd_32x32_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = x265_pixel_ssd_8x16_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = x265_pixel_ssd_16x32_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = x265_pixel_ssd_32x64_ ## cpu; + p.cu[BLOCK_8x8].sse_pp = PFX(pixel_ssd_8x8_ ## cpu); \ + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_ ## cpu); \ + p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = PFX(pixel_ssd_8x16_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = PFX(pixel_ssd_16x32_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = PFX(pixel_ssd_32x64_ ## cpu); #define ASSIGN_SSE_SS(cpu) ALL_LUMA_BLOCKS(sse_ss, pixel_ssd_ss, cpu) #define ASSIGN_SA8D(cpu) \ ALL_LUMA_CU(sa8d, pixel_sa8d, cpu); \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = x265_pixel_sa8d_8x16_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = x265_pixel_sa8d_16x32_ ## cpu; \ - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = x265_pixel_sa8d_32x64_ ## cpu + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sa8d = PFX(pixel_sa8d_8x16_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sa8d = PFX(pixel_sa8d_16x32_ ## cpu); \ + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sa8d = PFX(pixel_sa8d_32x64_ ## cpu) #define PIXEL_AVG(cpu) \ - p.pu[LUMA_64x64].pixelavg_pp = x265_pixel_avg_64x64_ ## cpu; \ - p.pu[LUMA_64x48].pixelavg_pp = x265_pixel_avg_64x48_ ## cpu; \ - p.pu[LUMA_64x32].pixelavg_pp = x265_pixel_avg_64x32_ ## cpu; \ - p.pu[LUMA_64x16].pixelavg_pp = x265_pixel_avg_64x16_ ## cpu; \ - p.pu[LUMA_48x64].pixelavg_pp = x265_pixel_avg_48x64_ ## cpu; \ - p.pu[LUMA_32x64].pixelavg_pp = x265_pixel_avg_32x64_ ## cpu; \ - p.pu[LUMA_32x32].pixelavg_pp = x265_pixel_avg_32x32_ ## cpu; \ - p.pu[LUMA_32x24].pixelavg_pp = x265_pixel_avg_32x24_ ## cpu; \ - p.pu[LUMA_32x16].pixelavg_pp = x265_pixel_avg_32x16_ ## cpu; \ - p.pu[LUMA_32x8].pixelavg_pp = x265_pixel_avg_32x8_ ## cpu; \ - p.pu[LUMA_24x32].pixelavg_pp = x265_pixel_avg_24x32_ ## cpu; \ - p.pu[LUMA_16x64].pixelavg_pp = x265_pixel_avg_16x64_ ## cpu; \ - p.pu[LUMA_16x32].pixelavg_pp = x265_pixel_avg_16x32_ ## cpu; \ - p.pu[LUMA_16x16].pixelavg_pp = x265_pixel_avg_16x16_ ## cpu; \ - p.pu[LUMA_16x12].pixelavg_pp = x265_pixel_avg_16x12_ ## cpu; \ - p.pu[LUMA_16x8].pixelavg_pp = x265_pixel_avg_16x8_ ## cpu; \ - p.pu[LUMA_16x4].pixelavg_pp = x265_pixel_avg_16x4_ ## cpu; \ - p.pu[LUMA_12x16].pixelavg_pp = x265_pixel_avg_12x16_ ## cpu; \ - p.pu[LUMA_8x32].pixelavg_pp = x265_pixel_avg_8x32_ ## cpu; \ - p.pu[LUMA_8x16].pixelavg_pp = x265_pixel_avg_8x16_ ## cpu; \ - p.pu[LUMA_8x8].pixelavg_pp = x265_pixel_avg_8x8_ ## cpu; \ - p.pu[LUMA_8x4].pixelavg_pp = x265_pixel_avg_8x4_ ## cpu; + p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_ ## cpu); \ + p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_ ## cpu); \ + p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_ ## cpu); \ + p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_ ## cpu); \ + p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_ ## cpu); \ + p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_ ## cpu); \ + p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_ ## cpu); \ + p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_ ## cpu); \ + p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_ ## cpu); \ + p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_ ## cpu); \ + p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_ ## cpu); \ + p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_ ## cpu); \ + p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_ ## cpu); \ + p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_ ## cpu); \ + p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_ ## cpu); \ + p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_ ## cpu); \ + p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_ ## cpu); \ + p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_ ## cpu); \ + p.pu[LUMA_8x32].pixelavg_pp = PFX(pixel_avg_8x32_ ## cpu); \ + p.pu[LUMA_8x16].pixelavg_pp = PFX(pixel_avg_8x16_ ## cpu); \ + p.pu[LUMA_8x8].pixelavg_pp = PFX(pixel_avg_8x8_ ## cpu); \ + p.pu[LUMA_8x4].pixelavg_pp = PFX(pixel_avg_8x4_ ## cpu); #define PIXEL_AVG_W4(cpu) \ - p.pu[LUMA_4x4].pixelavg_pp = x265_pixel_avg_4x4_ ## cpu; \ - p.pu[LUMA_4x8].pixelavg_pp = x265_pixel_avg_4x8_ ## cpu; \ - p.pu[LUMA_4x16].pixelavg_pp = x265_pixel_avg_4x16_ ## cpu; + p.pu[LUMA_4x4].pixelavg_pp = PFX(pixel_avg_4x4_ ## cpu); \ + p.pu[LUMA_4x8].pixelavg_pp = PFX(pixel_avg_4x8_ ## cpu); \ + p.pu[LUMA_4x16].pixelavg_pp = PFX(pixel_avg_4x16_ ## cpu); #define CHROMA_420_FILTERS(cpu) \ ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ @@ -376,7 +453,7 @@ ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, cpu); #define SETUP_CHROMA_420_VSP_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## cpu); #define CHROMA_420_VSP_FILTERS_SSE4(cpu) \ SETUP_CHROMA_420_VSP_FUNC_DEF(4, 4, cpu); \ @@ -407,7 +484,7 @@ SETUP_CHROMA_420_VSP_FUNC_DEF(8, 32, cpu); #define SETUP_CHROMA_422_VSP_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## cpu); #define CHROMA_422_VSP_FILTERS_SSE4(cpu) \ SETUP_CHROMA_422_VSP_FUNC_DEF(4, 8, cpu); \ @@ -438,7 +515,7 @@ SETUP_CHROMA_422_VSP_FUNC_DEF(8, 64, cpu); #define SETUP_CHROMA_444_VSP_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I444].pu[LUMA_ ## W ## x ## H].filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## cpu); #define CHROMA_444_VSP_FILTERS_SSE4(cpu) \ SETUP_CHROMA_444_VSP_FUNC_DEF(4, 4, cpu); \ @@ -470,7 +547,7 @@ SETUP_CHROMA_444_VSP_FUNC_DEF(8, 32, cpu); #define SETUP_CHROMA_420_VSS_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## cpu); #define CHROMA_420_VSS_FILTERS(cpu) \ SETUP_CHROMA_420_VSS_FUNC_DEF(4, 4, cpu); \ @@ -501,7 +578,7 @@ SETUP_CHROMA_420_VSS_FUNC_DEF(6, 8, cpu); #define SETUP_CHROMA_422_VSS_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## cpu); #define CHROMA_422_VSS_FILTERS(cpu) \ SETUP_CHROMA_422_VSS_FUNC_DEF(4, 8, cpu); \ @@ -534,29 +611,29 @@ #define CHROMA_444_VSS_FILTERS(cpu) ALL_CHROMA_444_PU(filter_vss, interp_4tap_vert_ss, cpu) #define LUMA_FILTERS(cpu) \ - ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, cpu); p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_ ## cpu; \ - ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, cpu); p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_ ## cpu; \ - ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, cpu); p.pu[LUMA_4x4].luma_vpp = x265_interp_8tap_vert_pp_4x4_ ## cpu; \ - ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, cpu); p.pu[LUMA_4x4].luma_vps = x265_interp_8tap_vert_ps_4x4_ ## cpu; \ - ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, cpu); p.pu[LUMA_4x4].luma_vsp = x265_interp_8tap_vert_sp_4x4_ ## cpu; \ + ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, cpu); p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_ ## cpu); \ + ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, cpu); p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_ ## cpu); \ + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, cpu); p.pu[LUMA_4x4].luma_vpp = PFX(interp_8tap_vert_pp_4x4_ ## cpu); \ + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, cpu); p.pu[LUMA_4x4].luma_vps = PFX(interp_8tap_vert_ps_4x4_ ## cpu); \ + ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, cpu); p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_ ## cpu); \ ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; -#define LUMA_VSS_FILTERS(cpu) ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, cpu); p.pu[LUMA_4x4].luma_vss = x265_interp_8tap_vert_ss_4x4_ ## cpu +#define LUMA_VSS_FILTERS(cpu) ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, cpu); p.pu[LUMA_4x4].luma_vss = PFX(interp_8tap_vert_ss_4x4_ ## cpu) #define LUMA_CU_BLOCKCOPY(type, cpu) \ - p.cu[BLOCK_4x4].copy_ ## type = x265_blockcopy_ ## type ## _4x4_ ## cpu; \ + p.cu[BLOCK_4x4].copy_ ## type = PFX(blockcopy_ ## type ## _4x4_ ## cpu); \ ALL_LUMA_CU(copy_ ## type, blockcopy_ ## type, cpu); #define CHROMA_420_CU_BLOCKCOPY(type, cpu) ALL_CHROMA_420_CU(copy_ ## type, blockcopy_ ## type, cpu) #define CHROMA_422_CU_BLOCKCOPY(type, cpu) ALL_CHROMA_422_CU(copy_ ## type, blockcopy_ ## type, cpu) -#define LUMA_PU_BLOCKCOPY(type, cpu) ALL_LUMA_PU(copy_ ## type, blockcopy_ ## type, cpu); p.pu[LUMA_4x4].copy_ ## type = x265_blockcopy_ ## type ## _4x4_ ## cpu +#define LUMA_PU_BLOCKCOPY(type, cpu) ALL_LUMA_PU(copy_ ## type, blockcopy_ ## type, cpu); p.pu[LUMA_4x4].copy_ ## type = PFX(blockcopy_ ## type ## _4x4_ ## cpu) #define CHROMA_420_PU_BLOCKCOPY(type, cpu) ALL_CHROMA_420_PU(copy_ ## type, blockcopy_ ## type, cpu) #define CHROMA_422_PU_BLOCKCOPY(type, cpu) ALL_CHROMA_422_PU(copy_ ## type, blockcopy_ ## type, cpu) #define LUMA_PIXELSUB(cpu) \ - p.cu[BLOCK_4x4].sub_ps = x265_pixel_sub_ps_4x4_ ## cpu; \ - p.cu[BLOCK_4x4].add_ps = x265_pixel_add_ps_4x4_ ## cpu; \ + p.cu[BLOCK_4x4].sub_ps = PFX(pixel_sub_ps_4x4_ ## cpu); \ + p.cu[BLOCK_4x4].add_ps = PFX(pixel_add_ps_4x4_ ## cpu); \ ALL_LUMA_CU(sub_ps, pixel_sub_ps, cpu); \ ALL_LUMA_CU(add_ps, pixel_add_ps, cpu); @@ -570,26 +647,26 @@ #define LUMA_VAR(cpu) ALL_LUMA_CU(var, pixel_var, cpu) -#define LUMA_ADDAVG(cpu) ALL_LUMA_PU(addAvg, addAvg, cpu); p.pu[LUMA_4x4].addAvg = x265_addAvg_4x4_ ## cpu +#define LUMA_ADDAVG(cpu) ALL_LUMA_PU(addAvg, addAvg, cpu); p.pu[LUMA_4x4].addAvg = PFX(addAvg_4x4_ ## cpu) #define CHROMA_420_ADDAVG(cpu) ALL_CHROMA_420_PU(addAvg, addAvg, cpu); #define CHROMA_422_ADDAVG(cpu) ALL_CHROMA_422_PU(addAvg, addAvg, cpu); #define SETUP_INTRA_ANG_COMMON(mode, fno, cpu) \ - p.cu[BLOCK_4x4].intra_pred[mode] = x265_intra_pred_ang4_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_8x8].intra_pred[mode] = x265_intra_pred_ang8_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_16x16].intra_pred[mode] = x265_intra_pred_ang16_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_32x32].intra_pred[mode] = x265_intra_pred_ang32_ ## fno ## _ ## cpu; + p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_8x8].intra_pred[mode] = PFX(intra_pred_ang8_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_16x16].intra_pred[mode] = PFX(intra_pred_ang16_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_32x32].intra_pred[mode] = PFX(intra_pred_ang32_ ## fno ## _ ## cpu); #define SETUP_INTRA_ANG4(mode, fno, cpu) \ - p.cu[BLOCK_4x4].intra_pred[mode] = x265_intra_pred_ang4_ ## fno ## _ ## cpu; + p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); #define SETUP_INTRA_ANG16_32(mode, fno, cpu) \ - p.cu[BLOCK_16x16].intra_pred[mode] = x265_intra_pred_ang16_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_32x32].intra_pred[mode] = x265_intra_pred_ang32_ ## fno ## _ ## cpu; + p.cu[BLOCK_16x16].intra_pred[mode] = PFX(intra_pred_ang16_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_32x32].intra_pred[mode] = PFX(intra_pred_ang32_ ## fno ## _ ## cpu); #define SETUP_INTRA_ANG4_8(mode, fno, cpu) \ - p.cu[BLOCK_4x4].intra_pred[mode] = x265_intra_pred_ang4_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_8x8].intra_pred[mode] = x265_intra_pred_ang8_ ## fno ## _ ## cpu; + p.cu[BLOCK_4x4].intra_pred[mode] = PFX(intra_pred_ang4_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_8x8].intra_pred[mode] = PFX(intra_pred_ang8_ ## fno ## _ ## cpu); #define INTRA_ANG_SSSE3(cpu) \ SETUP_INTRA_ANG_COMMON(2, 2, cpu); \ @@ -614,9 +691,9 @@ SETUP_INTRA_ANG_COMMON(18, 18, cpu); #define SETUP_INTRA_ANG_HIGH(mode, fno, cpu) \ - p.cu[BLOCK_8x8].intra_pred[mode] = x265_intra_pred_ang8_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_16x16].intra_pred[mode] = x265_intra_pred_ang16_ ## fno ## _ ## cpu; \ - p.cu[BLOCK_32x32].intra_pred[mode] = x265_intra_pred_ang32_ ## fno ## _ ## cpu; + p.cu[BLOCK_8x8].intra_pred[mode] = PFX(intra_pred_ang8_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_16x16].intra_pred[mode] = PFX(intra_pred_ang16_ ## fno ## _ ## cpu); \ + p.cu[BLOCK_32x32].intra_pred[mode] = PFX(intra_pred_ang32_ ## fno ## _ ## cpu); #define INTRA_ANG_SSE4_HIGH(cpu) \ SETUP_INTRA_ANG_HIGH(19, 19, cpu); \ @@ -689,10 +766,10 @@ ALL_CHROMA_420_4x4_PU(filter_vsp, interp_4tap_vert_sp, cpu) #define SETUP_CHROMA_420_VERT_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vpp = x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vpp = PFX(interp_4tap_vert_pp_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vps = PFX(interp_4tap_vert_ps_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I420].pu[CHROMA_420_ ## W ## x ## H].filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## cpu); #define CHROMA_420_VERT_FILTERS_SSE4(cpu) \ SETUP_CHROMA_420_VERT_FUNC_DEF(2, 4, cpu); \ @@ -701,10 +778,10 @@ SETUP_CHROMA_420_VERT_FUNC_DEF(6, 8, cpu); #define SETUP_CHROMA_422_VERT_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vpp = x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vss = PFX(interp_4tap_vert_ss_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vpp = PFX(interp_4tap_vert_pp_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vps = PFX(interp_4tap_vert_ps_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_vsp = PFX(interp_4tap_vert_sp_ ## W ## x ## H ## cpu); #define CHROMA_422_VERT_FILTERS(cpu) \ SETUP_CHROMA_422_VERT_FUNC_DEF(4, 8, cpu); \ @@ -745,8 +822,8 @@ ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, cpu); #define SETUP_CHROMA_422_HORIZ_FUNC_DEF(W, H, cpu) \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu; \ - p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hps = x265_interp_4tap_horiz_ps_ ## W ## x ## H ## cpu; + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hpp = PFX(interp_4tap_horiz_pp_ ## W ## x ## H ## cpu); \ + p.chroma[X265_CSP_I422].pu[CHROMA_422_ ## W ## x ## H].filter_hps = PFX(interp_4tap_horiz_ps_ ## W ## x ## H ## cpu); #define CHROMA_422_HORIZ_FILTERS(cpu) \ SETUP_CHROMA_422_HORIZ_FUNC_DEF(4, 8, cpu); \ @@ -778,7 +855,7 @@ ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, cpu); \ ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, cpu); -namespace x265 { +namespace X265_NS { // private x265 namespace template<int size> @@ -788,20 +865,20 @@ const int filterSize = NTAPS_LUMA; const int halfFilterSize = filterSize >> 1; - x265::primitives.pu[size].luma_hps(src, srcStride, immed, MAX_CU_SIZE, idxX, 1); - x265::primitives.pu[size].luma_vsp(immed + (halfFilterSize - 1) * MAX_CU_SIZE, MAX_CU_SIZE, dst, dstStride, idxY); + primitives.pu[size].luma_hps(src, srcStride, immed, MAX_CU_SIZE, idxX, 1); + primitives.pu[size].luma_vsp(immed + (halfFilterSize - 1) * MAX_CU_SIZE, MAX_CU_SIZE, dst, dstStride, idxY); } #if HIGH_BIT_DEPTH -void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 16bpp +void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main10 { #if !defined(X86_64) #error "Unsupported build configuration (32bit x86 and HIGH_BIT_DEPTH), you must configure ENABLE_ASSEMBLY=OFF" #endif #if X86_64 - p.scanPosLast = x265_scanPosLast_x64; + p.scanPosLast = PFX(scanPosLast_x64); #endif if (cpuMask & X265_CPU_SSE2) @@ -810,36 +887,39 @@ * for SSE2 and then use both MMX and SSE2 functions */ AVC_LUMA_PU(sad, mmx2); - p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_sse2; - p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_sse2; + p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_sse2); + p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_sse2); + p.pu[LUMA_8x16].sad = PFX(pixel_sad_8x16_sse2); HEVC_SAD(sse2); - p.pu[LUMA_4x4].sad_x3 = x265_pixel_sad_x3_4x4_mmx2; - p.pu[LUMA_4x8].sad_x3 = x265_pixel_sad_x3_4x8_mmx2; - p.pu[LUMA_4x16].sad_x3 = x265_pixel_sad_x3_4x16_mmx2; - p.pu[LUMA_8x4].sad_x3 = x265_pixel_sad_x3_8x4_sse2; - p.pu[LUMA_8x8].sad_x3 = x265_pixel_sad_x3_8x8_sse2; - p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_sse2; - p.pu[LUMA_8x32].sad_x3 = x265_pixel_sad_x3_8x32_sse2; - p.pu[LUMA_16x4].sad_x3 = x265_pixel_sad_x3_16x4_sse2; - p.pu[LUMA_12x16].sad_x3 = x265_pixel_sad_x3_12x16_mmx2; + p.pu[LUMA_4x4].sad_x3 = PFX(pixel_sad_x3_4x4_mmx2); + p.pu[LUMA_4x8].sad_x3 = PFX(pixel_sad_x3_4x8_mmx2); + p.pu[LUMA_4x16].sad_x3 = PFX(pixel_sad_x3_4x16_mmx2); + p.pu[LUMA_8x4].sad_x3 = PFX(pixel_sad_x3_8x4_sse2); + p.pu[LUMA_8x8].sad_x3 = PFX(pixel_sad_x3_8x8_sse2); + p.pu[LUMA_8x16].sad_x3 = PFX(pixel_sad_x3_8x16_sse2); + p.pu[LUMA_8x32].sad_x3 = PFX(pixel_sad_x3_8x32_sse2); + p.pu[LUMA_16x4].sad_x3 = PFX(pixel_sad_x3_16x4_sse2); + p.pu[LUMA_12x16].sad_x3 = PFX(pixel_sad_x3_12x16_mmx2); HEVC_SAD_X3(sse2); - p.pu[LUMA_4x4].sad_x4 = x265_pixel_sad_x4_4x4_mmx2; - p.pu[LUMA_4x8].sad_x4 = x265_pixel_sad_x4_4x8_mmx2; - p.pu[LUMA_4x16].sad_x4 = x265_pixel_sad_x4_4x16_mmx2; - p.pu[LUMA_8x4].sad_x4 = x265_pixel_sad_x4_8x4_sse2; - p.pu[LUMA_8x8].sad_x4 = x265_pixel_sad_x4_8x8_sse2; - p.pu[LUMA_8x16].sad_x4 = x265_pixel_sad_x4_8x16_sse2; - p.pu[LUMA_8x32].sad_x4 = x265_pixel_sad_x4_8x32_sse2; - p.pu[LUMA_16x4].sad_x4 = x265_pixel_sad_x4_16x4_sse2; - p.pu[LUMA_12x16].sad_x4 = x265_pixel_sad_x4_12x16_mmx2; + p.pu[LUMA_4x4].sad_x4 = PFX(pixel_sad_x4_4x4_mmx2); + p.pu[LUMA_4x8].sad_x4 = PFX(pixel_sad_x4_4x8_mmx2); + p.pu[LUMA_4x16].sad_x4 = PFX(pixel_sad_x4_4x16_mmx2); + p.pu[LUMA_8x4].sad_x4 = PFX(pixel_sad_x4_8x4_sse2); + p.pu[LUMA_8x8].sad_x4 = PFX(pixel_sad_x4_8x8_sse2); + p.pu[LUMA_8x16].sad_x4 = PFX(pixel_sad_x4_8x16_sse2); + p.pu[LUMA_8x32].sad_x4 = PFX(pixel_sad_x4_8x32_sse2); + p.pu[LUMA_16x4].sad_x4 = PFX(pixel_sad_x4_16x4_sse2); + p.pu[LUMA_12x16].sad_x4 = PFX(pixel_sad_x4_12x16_mmx2); HEVC_SAD_X4(sse2); - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_mmx2; + p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_mmx2); ALL_LUMA_PU(satd, pixel_satd, sse2); +#if X265_DEPTH <= 10 ASSIGN_SA8D(sse2); +#endif /* X265_DEPTH <= 10 */ LUMA_PIXELSUB(sse2); CHROMA_420_PIXELSUB_PS(sse2); CHROMA_422_PIXELSUB_PS(sse2); @@ -848,7 +928,7 @@ CHROMA_420_CU_BLOCKCOPY(ss, sse2); CHROMA_422_CU_BLOCKCOPY(ss, sse2); - p.pu[LUMA_4x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_4x4_sse2; + p.pu[LUMA_4x4].copy_pp = (copy_pp_t)PFX(blockcopy_ss_4x4_sse2); ALL_LUMA_PU_TYPED(copy_pp, (copy_pp_t), blockcopy_ss, sse2); ALL_CHROMA_420_PU_TYPED(copy_pp, (copy_pp_t), blockcopy_ss, sse2); ALL_CHROMA_422_PU_TYPED(copy_pp, (copy_pp_t), blockcopy_ss, sse2); @@ -857,8 +937,15 @@ CHROMA_422_VERT_FILTERS(_sse2); CHROMA_444_VERT_FILTERS(sse2); - p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_sse2; - p.ssim_end_4 = x265_pixel_ssim_end4_sse2; + ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2); + p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse2); + ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2); + p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse2); + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse2); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse2); + + p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_sse2); + p.ssim_end_4 = PFX(pixel_ssim_end4_sse2); PIXEL_AVG(sse2); PIXEL_AVG_W4(mmx2); LUMA_VAR(sse2); @@ -873,149 +960,160 @@ ALL_LUMA_TU_S(calcresidual, getResidual, sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); - ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2); - ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); + p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar4_sse2); + p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar8_sse2); - p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2; - p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2; - p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2; - p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2; - p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2; - p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2; - p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2; - p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2; - p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2; - p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2; - p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2; - p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2; - p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2; - p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2; - p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2; - p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2; - p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2; - p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2; - p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2; - p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2; - p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2; - p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2; - p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2; - p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2; - p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2; - p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2; - p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2; - p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2; - p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2; - p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2; - p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2; - p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2; +#if X265_DEPTH <= 10 + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_sse2); + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_sse2); +#endif /* X265_DEPTH <= 10 */ + ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); - p.cu[BLOCK_4x4].sse_ss = x265_pixel_ssd_ss_4x4_mmx2; + p.cu[BLOCK_4x4].intra_pred[2] = PFX(intra_pred_ang4_2_sse2); + p.cu[BLOCK_4x4].intra_pred[3] = PFX(intra_pred_ang4_3_sse2); + p.cu[BLOCK_4x4].intra_pred[4] = PFX(intra_pred_ang4_4_sse2); + p.cu[BLOCK_4x4].intra_pred[5] = PFX(intra_pred_ang4_5_sse2); + p.cu[BLOCK_4x4].intra_pred[6] = PFX(intra_pred_ang4_6_sse2); + p.cu[BLOCK_4x4].intra_pred[7] = PFX(intra_pred_ang4_7_sse2); + p.cu[BLOCK_4x4].intra_pred[8] = PFX(intra_pred_ang4_8_sse2); + p.cu[BLOCK_4x4].intra_pred[9] = PFX(intra_pred_ang4_9_sse2); + p.cu[BLOCK_4x4].intra_pred[10] = PFX(intra_pred_ang4_10_sse2); + p.cu[BLOCK_4x4].intra_pred[11] = PFX(intra_pred_ang4_11_sse2); + p.cu[BLOCK_4x4].intra_pred[12] = PFX(intra_pred_ang4_12_sse2); + p.cu[BLOCK_4x4].intra_pred[13] = PFX(intra_pred_ang4_13_sse2); + p.cu[BLOCK_4x4].intra_pred[14] = PFX(intra_pred_ang4_14_sse2); + p.cu[BLOCK_4x4].intra_pred[15] = PFX(intra_pred_ang4_15_sse2); + p.cu[BLOCK_4x4].intra_pred[16] = PFX(intra_pred_ang4_16_sse2); + p.cu[BLOCK_4x4].intra_pred[17] = PFX(intra_pred_ang4_17_sse2); + p.cu[BLOCK_4x4].intra_pred[18] = PFX(intra_pred_ang4_18_sse2); + p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_19_sse2); + p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_20_sse2); + p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_21_sse2); + p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_22_sse2); + p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_sse2); + p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_sse2); + p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_sse2); + p.cu[BLOCK_4x4].intra_pred[26] = PFX(intra_pred_ang4_26_sse2); + p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_sse2); + p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_sse2); + p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_sse2); + p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_30_sse2); + p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_31_sse2); + p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_sse2); + p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2); + + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_32x64_sse2); +#if X265_DEPTH <= 10 + p.cu[BLOCK_4x4].sse_ss = PFX(pixel_ssd_ss_4x4_mmx2); ALL_LUMA_CU(sse_ss, pixel_ssd_ss, sse2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_4x8_mmx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_8x16_sse2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_16x32_sse2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixelcmp_t)x265_pixel_ssd_ss_32x64_sse2; - - p.cu[BLOCK_4x4].dct = x265_dct4_sse2; - p.cu[BLOCK_8x8].dct = x265_dct8_sse2; - p.cu[BLOCK_4x4].idct = x265_idct4_sse2; - p.cu[BLOCK_8x8].idct = x265_idct8_sse2; + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_4x8_mmx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_8x16].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_8x16_sse2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_16x32_sse2); +#endif + p.cu[BLOCK_4x4].dct = PFX(dct4_sse2); + p.cu[BLOCK_8x8].dct = PFX(dct8_sse2); + p.cu[BLOCK_4x4].idct = PFX(idct4_sse2); + p.cu[BLOCK_8x8].idct = PFX(idct8_sse2); - p.idst4x4 = x265_idst4_sse2; + p.idst4x4 = PFX(idst4_sse2); + p.dst4x4 = PFX(dst4_sse2); LUMA_VSS_FILTERS(sse2); - p.frameInitLowres = x265_frame_init_lowres_core_sse2; + p.frameInitLowres = PFX(frame_init_lowres_core_sse2); + // TODO: the planecopy_sp is really planecopy_SC now, must be fix it + //p.planecopy_sp = PFX(downShift_16_sse2); + p.planecopy_sp_shl = PFX(upShift_16_sse2); + + ALL_CHROMA_420_PU(p2s, filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s, filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); + ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); + } + if (cpuMask & X265_CPU_SSE3) + { + ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, sse3); + ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, sse3); + ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, sse3); } if (cpuMask & X265_CPU_SSSE3) { - p.scale1D_128to64 = x265_scale1D_128to64_ssse3; - p.scale2D_64to32 = x265_scale2D_64to32_ssse3; + p.scale1D_128to64 = PFX(scale1D_128to64_ssse3); + p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); - // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_ssse3; this one is broken + // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_ssse3); this one is broken ALL_LUMA_PU(satd, pixel_satd, ssse3); +#if X265_DEPTH <= 10 ASSIGN_SA8D(ssse3); +#endif INTRA_ANG_SSSE3(ssse3); - p.dst4x4 = x265_dst4_ssse3; - p.cu[BLOCK_8x8].idct = x265_idct8_ssse3; - p.cu[BLOCK_4x4].count_nonzero = x265_count_nonzero_4x4_ssse3; - p.cu[BLOCK_8x8].count_nonzero = x265_count_nonzero_8x8_ssse3; - p.cu[BLOCK_16x16].count_nonzero = x265_count_nonzero_16x16_ssse3; - p.cu[BLOCK_32x32].count_nonzero = x265_count_nonzero_32x32_ssse3; - p.frameInitLowres = x265_frame_init_lowres_core_ssse3; - - p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_ssse3; - p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_ssse3; - p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_ssse3; - p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3; - p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3; - p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3; - p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3; - p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3; - p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3; - p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3; - p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3; - p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3; - p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3; - p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3; - p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3; - p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3; - p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3; - p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3; - p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3; - p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3; - p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3; - p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3; - p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3; - p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3; - p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3; - p.findPosFirstLast = x265_findPosFirstLast_ssse3; + p.dst4x4 = PFX(dst4_ssse3); + p.cu[BLOCK_8x8].idct = PFX(idct8_ssse3); + + p.frameInitLowres = PFX(frame_init_lowres_core_ssse3); + + ALL_LUMA_PU(convert_p2s, filterPixelToShort, ssse3); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = PFX(filterPixelToShort_4x4_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = PFX(filterPixelToShort_4x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = PFX(filterPixelToShort_4x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = PFX(filterPixelToShort_4x4_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = PFX(filterPixelToShort_4x8_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = PFX(filterPixelToShort_4x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = PFX(filterPixelToShort_4x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = PFX(filterPixelToShort_8x12_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = PFX(filterPixelToShort_8x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = PFX(filterPixelToShort_12x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = PFX(filterPixelToShort_4x2_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = PFX(filterPixelToShort_8x2_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = PFX(filterPixelToShort_8x6_ssse3); + p.findPosFirstLast = PFX(findPosFirstLast_ssse3); } if (cpuMask & X265_CPU_SSE4) { + p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4); + p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4); + p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4); + p.saoCuOrgE2[0] = PFX(saoCuOrgE2_sse4); + p.saoCuOrgE2[1] = PFX(saoCuOrgE2_sse4); + p.saoCuOrgE3[0] = PFX(saoCuOrgE3_sse4); + p.saoCuOrgE3[1] = PFX(saoCuOrgE3_sse4); + p.saoCuOrgB0 = PFX(saoCuOrgB0_sse4); + p.sign = PFX(calSign_sse4); + LUMA_ADDAVG(sse4); CHROMA_420_ADDAVG(sse4); CHROMA_422_ADDAVG(sse4); @@ -1027,297 +1125,1065 @@ CHROMA_422_VERT_FILTERS_SSE4(_sse4); CHROMA_444_HORIZ_FILTERS(sse4); - p.cu[BLOCK_8x8].dct = x265_dct8_sse4; - p.quant = x265_quant_sse4; - p.nquant = x265_nquant_sse4; - p.dequant_normal = x265_dequant_normal_sse4; + p.cu[BLOCK_8x8].dct = PFX(dct8_sse4); + p.quant = PFX(quant_sse4); + p.nquant = PFX(nquant_sse4); + p.dequant_normal = PFX(dequant_normal_sse4); + p.dequant_scaling = PFX(dequant_scaling_sse4); - // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_sse4; fails tests + // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_sse4); fails tests ALL_LUMA_PU(satd, pixel_satd, sse4); +#if X265_DEPTH <= 10 ASSIGN_SA8D(sse4); +#endif - ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse4); + p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_sse4); + p.cu[BLOCK_8x8].intra_filter = PFX(intra_filter_8x8_sse4); + p.cu[BLOCK_16x16].intra_filter = PFX(intra_filter_16x16_sse4); + p.cu[BLOCK_32x32].intra_filter = PFX(intra_filter_32x32_sse4); + + p.cu[BLOCK_4x4].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar4_sse4); + p.cu[BLOCK_8x8].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar8_sse4); + +#if X265_DEPTH <= 10 + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_sse4); + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_sse4); +#endif ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse4); INTRA_ANG_SSE4_COMMON(sse4); INTRA_ANG_SSE4_HIGH(sse4); - p.planecopy_cp = x265_upShift_8_sse4; - p.weight_pp = x265_weight_pp_sse4; - p.weight_sp = x265_weight_sp_sse4; + p.planecopy_cp = PFX(upShift_8_sse4); + p.weight_pp = PFX(weight_pp_sse4); + p.weight_sp = PFX(weight_sp_sse4); - p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4; - p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4; + p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_sse4); + p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_sse4); // TODO: check POPCNT flag! ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); +#if X265_DEPTH <= 10 ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); +#endif ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4; + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = PFX(filterPixelToShort_2x4_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = PFX(filterPixelToShort_6x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); } if (cpuMask & X265_CPU_AVX) { - // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_avx; fails tests - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = x265_pixel_satd_16x24_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = x265_pixel_satd_32x48_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = x265_pixel_satd_24x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = x265_pixel_satd_8x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx; + // p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_avx); fails tests + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = PFX(pixel_satd_16x24_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = PFX(pixel_satd_32x48_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = PFX(pixel_satd_24x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = PFX(pixel_satd_8x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = PFX(pixel_satd_8x12_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = PFX(pixel_satd_12x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = PFX(pixel_satd_4x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = PFX(pixel_satd_4x8_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = PFX(pixel_satd_8x16_avx); + // p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = PFX(pixel_satd_4x4_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = PFX(pixel_satd_8x8_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = PFX(pixel_satd_4x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = PFX(pixel_satd_8x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = PFX(pixel_satd_8x4_avx); ALL_LUMA_PU(satd, pixel_satd, avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = PFX(pixel_satd_8x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = PFX(pixel_satd_8x4_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = PFX(pixel_satd_8x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = PFX(pixel_satd_8x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = PFX(pixel_satd_12x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = PFX(pixel_satd_24x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = PFX(pixel_satd_4x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = PFX(pixel_satd_4x8_avx); +#if X265_DEPTH <= 10 ASSIGN_SA8D(avx); +#endif + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = PFX(pixel_sa8d_8x8_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = PFX(pixel_sa8d_16x16_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = PFX(pixel_sa8d_32x32_avx); LUMA_VAR(avx); - p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_avx; - p.ssim_end_4 = x265_pixel_ssim_end4_avx; + p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_avx); + p.ssim_end_4 = PFX(pixel_ssim_end4_avx); // copy_pp primitives // 16 x N - p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx; - p.pu[LUMA_16x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x4_avx; - p.pu[LUMA_16x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x8_avx; - p.pu[LUMA_16x12].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x12_avx; - p.pu[LUMA_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx; - p.pu[LUMA_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx; - p.pu[LUMA_16x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x64_avx; - p.pu[LUMA_64x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x16_avx; - p.pu[LUMA_64x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x32_avx; - p.pu[LUMA_64x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x48_avx; - p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x4_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x12_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x24_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_16x32_avx; + p.pu[LUMA_64x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx); + p.pu[LUMA_16x4].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x4_avx); + p.pu[LUMA_16x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x8_avx); + p.pu[LUMA_16x12].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x12_avx); + p.pu[LUMA_16x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x16_avx); + p.pu[LUMA_16x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x32_avx); + p.pu[LUMA_16x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x64_avx); + p.pu[LUMA_64x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x16_avx); + p.pu[LUMA_64x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x32_avx); + p.pu[LUMA_64x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx); + p.pu[LUMA_64x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x4_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x12_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x24_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_16x8_avx); // 24 X N - p.pu[LUMA_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_24x64_avx; + p.pu[LUMA_24x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_24x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_24x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_24x64_avx); // 32 x N - p.pu[LUMA_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx; - p.pu[LUMA_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; - p.pu[LUMA_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx; - p.pu[LUMA_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; - p.pu[LUMA_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x24_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x48_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_32x64_avx; + p.pu[LUMA_32x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x8_avx); + p.pu[LUMA_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx); + p.pu[LUMA_32x24].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x24_avx); + p.pu[LUMA_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx); + p.pu[LUMA_32x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x64_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x24_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x48_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_32x64_avx); // 48 X 64 - p.pu[LUMA_48x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_48x64_avx; + p.pu[LUMA_48x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_48x64_avx); // copy_ss primitives // 16 X N - p.cu[BLOCK_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx; + p.cu[BLOCK_16x16].copy_ss = PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ss = PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ss = PFX(blockcopy_ss_16x32_avx); // 32 X N - p.cu[BLOCK_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx; + p.cu[BLOCK_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ss = PFX(blockcopy_ss_32x64_avx); // 64 X N - p.cu[BLOCK_64x64].copy_ss = x265_blockcopy_ss_64x64_avx; + p.cu[BLOCK_64x64].copy_ss = PFX(blockcopy_ss_64x64_avx); // copy_ps primitives // 16 X N - p.cu[BLOCK_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_16x32_avx; + p.cu[BLOCK_16x16].copy_ps = (copy_ps_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_ps = (copy_ps_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_ps = (copy_ps_t)PFX(blockcopy_ss_16x32_avx); // 32 X N - p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_32x64_avx; + p.cu[BLOCK_32x32].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_ps = (copy_ps_t)PFX(blockcopy_ss_32x64_avx); // 64 X N - p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)x265_blockcopy_ss_64x64_avx; + p.cu[BLOCK_64x64].copy_ps = (copy_ps_t)PFX(blockcopy_ss_64x64_avx); // copy_sp primitives // 16 X N - p.cu[BLOCK_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_16x32_avx; + p.cu[BLOCK_16x16].copy_sp = (copy_sp_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = (copy_sp_t)PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = (copy_sp_t)PFX(blockcopy_ss_16x32_avx); // 32 X N - p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_32x64_avx; + p.cu[BLOCK_32x32].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = (copy_sp_t)PFX(blockcopy_ss_32x64_avx); // 64 X N - p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)x265_blockcopy_ss_64x64_avx; + p.cu[BLOCK_64x64].copy_sp = (copy_sp_t)PFX(blockcopy_ss_64x64_avx); - p.frameInitLowres = x265_frame_init_lowres_core_avx; + p.frameInitLowres = PFX(frame_init_lowres_core_avx); - p.pu[LUMA_64x16].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x16_avx; - p.pu[LUMA_64x32].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x32_avx; - p.pu[LUMA_64x48].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x48_avx; - p.pu[LUMA_64x64].copy_pp = (copy_pp_t)x265_blockcopy_ss_64x64_avx; + p.pu[LUMA_64x16].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x16_avx); + p.pu[LUMA_64x32].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x32_avx); + p.pu[LUMA_64x48].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x48_avx); + p.pu[LUMA_64x64].copy_pp = (copy_pp_t)PFX(blockcopy_ss_64x64_avx); } if (cpuMask & X265_CPU_XOP) { - //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; this one is broken + //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_xop); this one is broken ALL_LUMA_PU(satd, pixel_satd, xop); +#if X265_DEPTH <= 10 ASSIGN_SA8D(xop); +#endif LUMA_VAR(xop); - p.frameInitLowres = x265_frame_init_lowres_core_xop; + p.frameInitLowres = PFX(frame_init_lowres_core_xop); } if (cpuMask & X265_CPU_AVX2) { - p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2; + p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_avx2); - p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2; - p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2; - p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2; - p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2; - - p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2; - p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2; - p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2; - p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2; - p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2; - - p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2; - p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2; - p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2; - p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2; - p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2; - p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2; - - p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2; - p.cu[BLOCK_16x16].sse_ss = x265_pixel_ssd_ss_16x16_avx2; - - p.quant = x265_quant_avx2; - p.nquant = x265_nquant_avx2; - p.dequant_normal = x265_dequant_normal_avx2; - - p.scale1D_128to64 = x265_scale1D_128to64_avx2; - p.scale2D_64to32 = x265_scale2D_64to32_avx2; - // p.weight_pp = x265_weight_pp_avx2; fails tests + // TODO: the planecopy_sp is really planecopy_SC now, must be fix it + //p.planecopy_sp = PFX(downShift_16_avx2); + p.planecopy_sp_shl = PFX(upShift_16_avx2); + + p.saoCuOrgE0 = PFX(saoCuOrgE0_avx2); + p.saoCuOrgE1 = PFX(saoCuOrgE1_avx2); + p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_avx2); + p.saoCuOrgE2[0] = PFX(saoCuOrgE2_avx2); + p.saoCuOrgE2[1] = PFX(saoCuOrgE2_32_avx2); + p.saoCuOrgE3[0] = PFX(saoCuOrgE3_avx2); + p.saoCuOrgE3[1] = PFX(saoCuOrgE3_32_avx2); + p.saoCuOrgB0 = PFX(saoCuOrgB0_avx2); + + p.cu[BLOCK_16x16].intra_pred[2] = PFX(intra_pred_ang16_2_avx2); + p.cu[BLOCK_16x16].intra_pred[3] = PFX(intra_pred_ang16_3_avx2); + p.cu[BLOCK_16x16].intra_pred[4] = PFX(intra_pred_ang16_4_avx2); + p.cu[BLOCK_16x16].intra_pred[5] = PFX(intra_pred_ang16_5_avx2); + p.cu[BLOCK_16x16].intra_pred[6] = PFX(intra_pred_ang16_6_avx2); + p.cu[BLOCK_16x16].intra_pred[7] = PFX(intra_pred_ang16_7_avx2); + p.cu[BLOCK_16x16].intra_pred[8] = PFX(intra_pred_ang16_8_avx2); + p.cu[BLOCK_16x16].intra_pred[9] = PFX(intra_pred_ang16_9_avx2); + p.cu[BLOCK_16x16].intra_pred[10] = PFX(intra_pred_ang16_10_avx2); + p.cu[BLOCK_16x16].intra_pred[11] = PFX(intra_pred_ang16_11_avx2); + p.cu[BLOCK_16x16].intra_pred[12] = PFX(intra_pred_ang16_12_avx2); + p.cu[BLOCK_16x16].intra_pred[13] = PFX(intra_pred_ang16_13_avx2); + p.cu[BLOCK_16x16].intra_pred[14] = PFX(intra_pred_ang16_14_avx2); + p.cu[BLOCK_16x16].intra_pred[15] = PFX(intra_pred_ang16_15_avx2); + p.cu[BLOCK_16x16].intra_pred[16] = PFX(intra_pred_ang16_16_avx2); + p.cu[BLOCK_16x16].intra_pred[17] = PFX(intra_pred_ang16_17_avx2); + p.cu[BLOCK_16x16].intra_pred[18] = PFX(intra_pred_ang16_18_avx2); + p.cu[BLOCK_16x16].intra_pred[19] = PFX(intra_pred_ang16_19_avx2); + p.cu[BLOCK_16x16].intra_pred[20] = PFX(intra_pred_ang16_20_avx2); + p.cu[BLOCK_16x16].intra_pred[21] = PFX(intra_pred_ang16_21_avx2); + p.cu[BLOCK_16x16].intra_pred[22] = PFX(intra_pred_ang16_22_avx2); + p.cu[BLOCK_16x16].intra_pred[23] = PFX(intra_pred_ang16_23_avx2); + p.cu[BLOCK_16x16].intra_pred[24] = PFX(intra_pred_ang16_24_avx2); + p.cu[BLOCK_16x16].intra_pred[25] = PFX(intra_pred_ang16_25_avx2); + p.cu[BLOCK_16x16].intra_pred[26] = PFX(intra_pred_ang16_26_avx2); + p.cu[BLOCK_16x16].intra_pred[27] = PFX(intra_pred_ang16_27_avx2); + p.cu[BLOCK_16x16].intra_pred[28] = PFX(intra_pred_ang16_28_avx2); + p.cu[BLOCK_16x16].intra_pred[29] = PFX(intra_pred_ang16_29_avx2); + p.cu[BLOCK_16x16].intra_pred[30] = PFX(intra_pred_ang16_30_avx2); + p.cu[BLOCK_16x16].intra_pred[31] = PFX(intra_pred_ang16_31_avx2); + p.cu[BLOCK_16x16].intra_pred[32] = PFX(intra_pred_ang16_32_avx2); + p.cu[BLOCK_16x16].intra_pred[33] = PFX(intra_pred_ang16_33_avx2); + p.cu[BLOCK_16x16].intra_pred[34] = PFX(intra_pred_ang16_2_avx2); + + p.cu[BLOCK_32x32].intra_pred[2] = PFX(intra_pred_ang32_2_avx2); + p.cu[BLOCK_32x32].intra_pred[3] = PFX(intra_pred_ang32_3_avx2); + p.cu[BLOCK_32x32].intra_pred[4] = PFX(intra_pred_ang32_4_avx2); + p.cu[BLOCK_32x32].intra_pred[5] = PFX(intra_pred_ang32_5_avx2); + p.cu[BLOCK_32x32].intra_pred[6] = PFX(intra_pred_ang32_6_avx2); + p.cu[BLOCK_32x32].intra_pred[7] = PFX(intra_pred_ang32_7_avx2); + p.cu[BLOCK_32x32].intra_pred[8] = PFX(intra_pred_ang32_8_avx2); + p.cu[BLOCK_32x32].intra_pred[9] = PFX(intra_pred_ang32_9_avx2); + p.cu[BLOCK_32x32].intra_pred[10] = PFX(intra_pred_ang32_10_avx2); + p.cu[BLOCK_32x32].intra_pred[11] = PFX(intra_pred_ang32_11_avx2); + p.cu[BLOCK_32x32].intra_pred[12] = PFX(intra_pred_ang32_12_avx2); + p.cu[BLOCK_32x32].intra_pred[13] = PFX(intra_pred_ang32_13_avx2); + p.cu[BLOCK_32x32].intra_pred[14] = PFX(intra_pred_ang32_14_avx2); + p.cu[BLOCK_32x32].intra_pred[15] = PFX(intra_pred_ang32_15_avx2); + p.cu[BLOCK_32x32].intra_pred[16] = PFX(intra_pred_ang32_16_avx2); + p.cu[BLOCK_32x32].intra_pred[17] = PFX(intra_pred_ang32_17_avx2); + p.cu[BLOCK_32x32].intra_pred[18] = PFX(intra_pred_ang32_18_avx2); + p.cu[BLOCK_32x32].intra_pred[19] = PFX(intra_pred_ang32_19_avx2); + p.cu[BLOCK_32x32].intra_pred[20] = PFX(intra_pred_ang32_20_avx2); + p.cu[BLOCK_32x32].intra_pred[21] = PFX(intra_pred_ang32_21_avx2); + p.cu[BLOCK_32x32].intra_pred[22] = PFX(intra_pred_ang32_22_avx2); + p.cu[BLOCK_32x32].intra_pred[23] = PFX(intra_pred_ang32_23_avx2); + p.cu[BLOCK_32x32].intra_pred[24] = PFX(intra_pred_ang32_24_avx2); + p.cu[BLOCK_32x32].intra_pred[25] = PFX(intra_pred_ang32_25_avx2); + p.cu[BLOCK_32x32].intra_pred[26] = PFX(intra_pred_ang32_26_avx2); + p.cu[BLOCK_32x32].intra_pred[27] = PFX(intra_pred_ang32_27_avx2); + p.cu[BLOCK_32x32].intra_pred[28] = PFX(intra_pred_ang32_28_avx2); + p.cu[BLOCK_32x32].intra_pred[29] = PFX(intra_pred_ang32_29_avx2); + p.cu[BLOCK_32x32].intra_pred[30] = PFX(intra_pred_ang32_30_avx2); + p.cu[BLOCK_32x32].intra_pred[31] = PFX(intra_pred_ang32_31_avx2); + p.cu[BLOCK_32x32].intra_pred[32] = PFX(intra_pred_ang32_32_avx2); + p.cu[BLOCK_32x32].intra_pred[33] = PFX(intra_pred_ang32_33_avx2); + p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_2_avx2); + + p.pu[LUMA_12x16].pixelavg_pp = PFX(pixel_avg_12x16_avx2); + p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); + p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); + p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2); + p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2); + p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2); + p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2); + p.pu[LUMA_24x32].pixelavg_pp = PFX(pixel_avg_24x32_avx2); + p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_avx2); + p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_avx2); + p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_avx2); + p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_avx2); + p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_avx2); + p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_avx2); + p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_avx2); + p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_avx2); + p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_avx2); + p.pu[LUMA_48x64].pixelavg_pp = PFX(pixel_avg_48x64_avx2); + + p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.pu[LUMA_8x32].addAvg = PFX(addAvg_8x32_avx2); + p.pu[LUMA_12x16].addAvg = PFX(addAvg_12x16_avx2); + p.pu[LUMA_16x4].addAvg = PFX(addAvg_16x4_avx2); + p.pu[LUMA_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.pu[LUMA_16x12].addAvg = PFX(addAvg_16x12_avx2); + p.pu[LUMA_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.pu[LUMA_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.pu[LUMA_16x64].addAvg = PFX(addAvg_16x64_avx2); + p.pu[LUMA_24x32].addAvg = PFX(addAvg_24x32_avx2); + p.pu[LUMA_32x8].addAvg = PFX(addAvg_32x8_avx2); + p.pu[LUMA_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.pu[LUMA_32x24].addAvg = PFX(addAvg_32x24_avx2); + p.pu[LUMA_32x32].addAvg = PFX(addAvg_32x32_avx2); + p.pu[LUMA_32x64].addAvg = PFX(addAvg_32x64_avx2); + p.pu[LUMA_48x64].addAvg = PFX(addAvg_48x64_avx2); + p.pu[LUMA_64x16].addAvg = PFX(addAvg_64x16_avx2); + p.pu[LUMA_64x32].addAvg = PFX(addAvg_64x32_avx2); + p.pu[LUMA_64x48].addAvg = PFX(addAvg_64x48_avx2); + p.pu[LUMA_64x64].addAvg = PFX(addAvg_64x64_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = PFX(addAvg_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = PFX(addAvg_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = PFX(addAvg_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = PFX(addAvg_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = PFX(addAvg_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = PFX(addAvg_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = PFX(addAvg_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = PFX(addAvg_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = PFX(addAvg_32x32_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = PFX(addAvg_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = PFX(addAvg_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = PFX(addAvg_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = PFX(addAvg_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = PFX(addAvg_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = PFX(addAvg_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = PFX(addAvg_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = PFX(addAvg_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = PFX(addAvg_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_avx2); + + p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_avx2); + p.cu[BLOCK_8x8].psy_cost_ss = PFX(psyCost_ss_8x8_avx2); + p.cu[BLOCK_16x16].psy_cost_ss = PFX(psyCost_ss_16x16_avx2); + p.cu[BLOCK_32x32].psy_cost_ss = PFX(psyCost_ss_32x32_avx2); + p.cu[BLOCK_64x64].psy_cost_ss = PFX(psyCost_ss_64x64_avx2); + p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_avx2); +#if X265_DEPTH <= 10 + p.cu[BLOCK_8x8].psy_cost_pp = PFX(psyCost_pp_8x8_avx2); + p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx2); + p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx2); + p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx2); + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_avx2); + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_avx2); +#endif + + p.cu[BLOCK_16x16].intra_pred[DC_IDX] = PFX(intra_pred_dc16_avx2); + p.cu[BLOCK_32x32].intra_pred[DC_IDX] = PFX(intra_pred_dc32_avx2); - p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2; - p.cu[BLOCK_32x32].calcresidual = x265_getResidual32_avx2; + p.pu[LUMA_48x64].satd = PFX(pixel_satd_48x64_avx2); + + p.pu[LUMA_64x16].satd = PFX(pixel_satd_64x16_avx2); + p.pu[LUMA_64x32].satd = PFX(pixel_satd_64x32_avx2); + p.pu[LUMA_64x48].satd = PFX(pixel_satd_64x48_avx2); + p.pu[LUMA_64x64].satd = PFX(pixel_satd_64x64_avx2); + + p.pu[LUMA_32x8].satd = PFX(pixel_satd_32x8_avx2); + p.pu[LUMA_32x16].satd = PFX(pixel_satd_32x16_avx2); + p.pu[LUMA_32x24].satd = PFX(pixel_satd_32x24_avx2); + p.pu[LUMA_32x32].satd = PFX(pixel_satd_32x32_avx2); + p.pu[LUMA_32x64].satd = PFX(pixel_satd_32x64_avx2); + + p.pu[LUMA_16x4].satd = PFX(pixel_satd_16x4_avx2); + p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_avx2); + p.pu[LUMA_16x12].satd = PFX(pixel_satd_16x12_avx2); + p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_avx2); + p.pu[LUMA_16x32].satd = PFX(pixel_satd_16x32_avx2); + p.pu[LUMA_16x64].satd = PFX(pixel_satd_16x64_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = PFX(pixel_satd_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = PFX(pixel_satd_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = PFX(pixel_satd_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = PFX(pixel_satd_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = PFX(pixel_satd_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = PFX(pixel_satd_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = PFX(pixel_satd_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = PFX(pixel_satd_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = PFX(pixel_satd_32x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = PFX(pixel_satd_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = PFX(pixel_satd_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = PFX(pixel_satd_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = PFX(pixel_satd_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = PFX(pixel_satd_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_avx2); + + p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16_avx2); + p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32_avx2); + +#if X265_DEPTH <= 10 + p.cu[BLOCK_16x16].sse_ss = PFX(pixel_ssd_ss_16x16_avx2); + p.cu[BLOCK_32x32].sse_ss = PFX(pixel_ssd_ss_32x32_avx2); + p.cu[BLOCK_64x64].sse_ss = PFX(pixel_ssd_ss_64x64_avx2); + + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); + p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); + p.cu[BLOCK_64x64].sse_pp = PFX(pixel_ssd_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_16x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sse_pp = (pixel_sse_t)PFX(pixel_ssd_ss_32x64_avx2); +#endif - p.cu[BLOCK_16x16].blockfill_s = x265_blockfill_s_16x16_avx2; - p.cu[BLOCK_32x32].blockfill_s = x265_blockfill_s_32x32_avx2; + p.quant = PFX(quant_avx2); + p.nquant = PFX(nquant_avx2); + p.dequant_normal = PFX(dequant_normal_avx2); + p.dequant_scaling = PFX(dequant_scaling_avx2); + p.dst4x4 = PFX(dst4_avx2); + p.idst4x4 = PFX(idst4_avx2); + p.denoiseDct = PFX(denoise_dct_avx2); + + p.scale1D_128to64 = PFX(scale1D_128to64_avx2); + p.scale2D_64to32 = PFX(scale2D_64to32_avx2); + + p.weight_pp = PFX(weight_pp_avx2); + p.weight_sp = PFX(weight_sp_avx2); + p.sign = PFX(calSign_avx2); + p.planecopy_cp = PFX(upShift_8_avx2); + + p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2); + p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_avx2); + + p.cu[BLOCK_16x16].blockfill_s = PFX(blockfill_s_16x16_avx2); + p.cu[BLOCK_32x32].blockfill_s = PFX(blockfill_s_32x32_avx2); ALL_LUMA_TU(count_nonzero, count_nonzero, avx2); ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, avx2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, avx2); - p.cu[BLOCK_8x8].copy_cnt = x265_copy_cnt_8_avx2; - p.cu[BLOCK_16x16].copy_cnt = x265_copy_cnt_16_avx2; - p.cu[BLOCK_32x32].copy_cnt = x265_copy_cnt_32_avx2; - - p.cu[BLOCK_8x8].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_8_avx2; - p.cu[BLOCK_16x16].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_16_avx2; - p.cu[BLOCK_32x32].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_32_avx2; - - p.cu[BLOCK_8x8].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_8_avx2; - p.cu[BLOCK_16x16].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_16_avx2; - p.cu[BLOCK_32x32].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_32_avx2; + p.cu[BLOCK_8x8].copy_cnt = PFX(copy_cnt_8_avx2); + p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx2); + p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx2); + + p.cu[BLOCK_8x8].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8_avx2); + p.cu[BLOCK_16x16].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16_avx2); + p.cu[BLOCK_32x32].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32_avx2); + + p.cu[BLOCK_8x8].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8_avx2); + p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16_avx2); + p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32_avx2); +#if X265_DEPTH <= 10 ALL_LUMA_TU_S(dct, dct, avx2); ALL_LUMA_TU_S(idct, idct, avx2); +#endif ALL_LUMA_CU_S(transpose, transpose, avx2); ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, avx2); ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2); +#if X265_DEPTH <= 10 ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2); +#endif ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2); +#if X265_DEPTH <= 10 + p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_avx2); // since ALL_LUMA_PU didn't declare 4x4 size, calling separately luma_vsp function to use +#endif + + p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); + p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); + p.cu[BLOCK_64x64].add_ps = PFX(pixel_add_ps_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = PFX(pixel_add_ps_16x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = PFX(pixel_add_ps_32x64_avx2); + + p.cu[BLOCK_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); + p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); + p.cu[BLOCK_64x64].sub_ps = PFX(pixel_sub_ps_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = PFX(pixel_sub_ps_16x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_avx2); + + p.pu[LUMA_16x4].sad = PFX(pixel_sad_16x4_avx2); + p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_avx2); + p.pu[LUMA_16x12].sad = PFX(pixel_sad_16x12_avx2); + p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_avx2); + p.pu[LUMA_16x32].sad = PFX(pixel_sad_16x32_avx2); +#if X265_DEPTH <= 10 + p.pu[LUMA_16x64].sad = PFX(pixel_sad_16x64_avx2); + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_avx2); + p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_avx2); + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_avx2); + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_avx2); + p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_avx2); + p.pu[LUMA_48x64].sad = PFX(pixel_sad_48x64_avx2); + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_avx2); + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_avx2); + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_avx2); + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_avx2); +#endif - p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; - p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; - p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2; - - p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; - p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; - p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2; - - p.pu[LUMA_16x4].sad = x265_pixel_sad_16x4_avx2; - p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_avx2; - p.pu[LUMA_16x12].sad = x265_pixel_sad_16x12_avx2; - p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_avx2; - p.pu[LUMA_16x32].sad = x265_pixel_sad_16x32_avx2; - - p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_avx2; - p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_avx2; - p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_avx2; - p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_avx2; - p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_avx2; - p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_avx2; - p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2; - p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2; - p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2; - p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2; - p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2; - p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2; - p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2; - p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2; - p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2; - p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2; - p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2; - - p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_avx2; - p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2; - p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2; + p.pu[LUMA_16x4].sad_x3 = PFX(pixel_sad_x3_16x4_avx2); + p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_avx2); + p.pu[LUMA_16x12].sad_x3 = PFX(pixel_sad_x3_16x12_avx2); + p.pu[LUMA_16x16].sad_x3 = PFX(pixel_sad_x3_16x16_avx2); + p.pu[LUMA_16x32].sad_x3 = PFX(pixel_sad_x3_16x32_avx2); + p.pu[LUMA_16x64].sad_x3 = PFX(pixel_sad_x3_16x64_avx2); + p.pu[LUMA_32x8].sad_x3 = PFX(pixel_sad_x3_32x8_avx2); + p.pu[LUMA_32x16].sad_x3 = PFX(pixel_sad_x3_32x16_avx2); + p.pu[LUMA_32x24].sad_x3 = PFX(pixel_sad_x3_32x24_avx2); + p.pu[LUMA_32x32].sad_x3 = PFX(pixel_sad_x3_32x32_avx2); + p.pu[LUMA_32x64].sad_x3 = PFX(pixel_sad_x3_32x64_avx2); + p.pu[LUMA_48x64].sad_x3 = PFX(pixel_sad_x3_48x64_avx2); + p.pu[LUMA_64x16].sad_x3 = PFX(pixel_sad_x3_64x16_avx2); + p.pu[LUMA_64x32].sad_x3 = PFX(pixel_sad_x3_64x32_avx2); + p.pu[LUMA_64x48].sad_x3 = PFX(pixel_sad_x3_64x48_avx2); + p.pu[LUMA_64x64].sad_x3 = PFX(pixel_sad_x3_64x64_avx2); + + p.pu[LUMA_16x4].sad_x4 = PFX(pixel_sad_x4_16x4_avx2); + p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_avx2); + p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_avx2); + p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_avx2); + p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_avx2); + p.pu[LUMA_16x64].sad_x4 = PFX(pixel_sad_x4_16x64_avx2); + p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx2); + p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx2); + p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx2); + p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx2); + p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx2); + p.pu[LUMA_48x64].sad_x4 = PFX(pixel_sad_x4_48x64_avx2); + p.pu[LUMA_64x16].sad_x4 = PFX(pixel_sad_x4_64x16_avx2); + p.pu[LUMA_64x32].sad_x4 = PFX(pixel_sad_x4_64x32_avx2); + p.pu[LUMA_64x48].sad_x4 = PFX(pixel_sad_x4_64x48_avx2); + p.pu[LUMA_64x64].sad_x4 = PFX(pixel_sad_x4_64x64_avx2); + + p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_avx2); + p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_avx2); + p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_avx2); + p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_avx2); + p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_avx2); + p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_avx2); + p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_avx2); + p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_avx2); + p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_avx2); + p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_avx2); + p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_avx2); + p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_avx2); + p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_avx2); + p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_avx2); + p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_avx2); + p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_avx2); + p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = PFX(filterPixelToShort_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_avx2); + +#if X265_DEPTH <= 10 + p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_avx2); + p.pu[LUMA_4x8].luma_hps = PFX(interp_8tap_horiz_ps_4x8_avx2); + p.pu[LUMA_4x16].luma_hps = PFX(interp_8tap_horiz_ps_4x16_avx2); + p.pu[LUMA_8x8].luma_hps = PFX(interp_8tap_horiz_ps_8x8_avx2); + p.pu[LUMA_8x4].luma_hps = PFX(interp_8tap_horiz_ps_8x4_avx2); + p.pu[LUMA_8x16].luma_hps = PFX(interp_8tap_horiz_ps_8x16_avx2); + p.pu[LUMA_8x32].luma_hps = PFX(interp_8tap_horiz_ps_8x32_avx2); + p.pu[LUMA_16x4].luma_hps = PFX(interp_8tap_horiz_ps_16x4_avx2); + p.pu[LUMA_16x8].luma_hps = PFX(interp_8tap_horiz_ps_16x8_avx2); + p.pu[LUMA_16x12].luma_hps = PFX(interp_8tap_horiz_ps_16x12_avx2); + p.pu[LUMA_16x16].luma_hps = PFX(interp_8tap_horiz_ps_16x16_avx2); + p.pu[LUMA_16x32].luma_hps = PFX(interp_8tap_horiz_ps_16x32_avx2); + p.pu[LUMA_16x64].luma_hps = PFX(interp_8tap_horiz_ps_16x64_avx2); + p.pu[LUMA_32x8].luma_hps = PFX(interp_8tap_horiz_ps_32x8_avx2); + p.pu[LUMA_32x16].luma_hps = PFX(interp_8tap_horiz_ps_32x16_avx2); + p.pu[LUMA_32x32].luma_hps = PFX(interp_8tap_horiz_ps_32x32_avx2); + p.pu[LUMA_32x24].luma_hps = PFX(interp_8tap_horiz_ps_32x24_avx2); + p.pu[LUMA_32x64].luma_hps = PFX(interp_8tap_horiz_ps_32x64_avx2); + p.pu[LUMA_64x64].luma_hps = PFX(interp_8tap_horiz_ps_64x64_avx2); + p.pu[LUMA_64x16].luma_hps = PFX(interp_8tap_horiz_ps_64x16_avx2); + p.pu[LUMA_64x32].luma_hps = PFX(interp_8tap_horiz_ps_64x32_avx2); + p.pu[LUMA_64x48].luma_hps = PFX(interp_8tap_horiz_ps_64x48_avx2); + p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_avx2); + p.pu[LUMA_24x32].luma_hps = PFX(interp_8tap_horiz_ps_24x32_avx2); + p.pu[LUMA_12x16].luma_hps = PFX(interp_8tap_horiz_ps_12x16_avx2); +#endif + + p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_avx2); + p.pu[LUMA_4x8].luma_hpp = PFX(interp_8tap_horiz_pp_4x8_avx2); + p.pu[LUMA_4x16].luma_hpp = PFX(interp_8tap_horiz_pp_4x16_avx2); + p.pu[LUMA_8x4].luma_hpp = PFX(interp_8tap_horiz_pp_8x4_avx2); + p.pu[LUMA_8x8].luma_hpp = PFX(interp_8tap_horiz_pp_8x8_avx2); + p.pu[LUMA_8x16].luma_hpp = PFX(interp_8tap_horiz_pp_8x16_avx2); + p.pu[LUMA_8x32].luma_hpp = PFX(interp_8tap_horiz_pp_8x32_avx2); + p.pu[LUMA_16x4].luma_hpp = PFX(interp_8tap_horiz_pp_16x4_avx2); + p.pu[LUMA_16x8].luma_hpp = PFX(interp_8tap_horiz_pp_16x8_avx2); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_8tap_horiz_pp_16x12_avx2); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_avx2); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_8tap_horiz_pp_16x32_avx2); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_8tap_horiz_pp_16x64_avx2); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_8tap_horiz_pp_32x8_avx2); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_8tap_horiz_pp_32x16_avx2); + p.pu[LUMA_32x24].luma_hpp = PFX(interp_8tap_horiz_pp_32x24_avx2); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_avx2); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_8tap_horiz_pp_32x64_avx2); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_8tap_horiz_pp_64x16_avx2); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_8tap_horiz_pp_64x32_avx2); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_8tap_horiz_pp_64x48_avx2); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_8tap_horiz_pp_64x64_avx2); + p.pu[LUMA_12x16].luma_hpp = PFX(interp_8tap_horiz_pp_12x16_avx2); + p.pu[LUMA_24x32].luma_hpp = PFX(interp_8tap_horiz_pp_24x32_avx2); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_avx2); + +#if X265_DEPTH <= 10 + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hps = PFX(interp_4tap_horiz_ps_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hps = PFX(interp_4tap_horiz_ps_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hps = PFX(interp_4tap_horiz_ps_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hps = PFX(interp_4tap_horiz_ps_6x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = PFX(interp_4tap_horiz_ps_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = PFX(interp_4tap_horiz_ps_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = PFX(interp_4tap_horiz_ps_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = PFX(interp_4tap_horiz_ps_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hps = PFX(interp_4tap_horiz_ps_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = PFX(interp_4tap_horiz_ps_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hps = PFX(interp_4tap_horiz_ps_6x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = PFX(interp_4tap_horiz_ps_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = PFX(interp_4tap_horiz_ps_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = PFX(interp_4tap_horiz_ps_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = PFX(interp_4tap_horiz_ps_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = PFX(interp_4tap_horiz_ps_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hps = PFX(interp_4tap_horiz_ps_12x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = PFX(interp_4tap_horiz_pp_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hpp = PFX(interp_4tap_horiz_pp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hpp = PFX(interp_4tap_horiz_pp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = PFX(interp_4tap_horiz_pp_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = PFX(interp_4tap_horiz_pp_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = PFX(interp_4tap_horiz_pp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = PFX(interp_4tap_horiz_pp_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = PFX(interp_4tap_horiz_pp_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = PFX(interp_4tap_horiz_pp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = PFX(interp_4tap_horiz_pp_24x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = PFX(interp_4tap_horiz_pp_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = PFX(interp_4tap_horiz_pp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = PFX(interp_4tap_horiz_pp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = PFX(interp_4tap_horiz_pp_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = PFX(interp_4tap_horiz_pp_48x64_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = PFX(interp_4tap_vert_pp_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = PFX(interp_4tap_vert_ps_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vsp = PFX(interp_4tap_vert_sp_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vss = PFX(interp_4tap_vert_ss_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = PFX(interp_4tap_vert_pp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vps = PFX(interp_4tap_vert_ps_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vsp = PFX(interp_4tap_vert_sp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vss = PFX(interp_4tap_vert_ss_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = PFX(interp_4tap_vert_pp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = PFX(interp_4tap_vert_ps_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vsp = PFX(interp_4tap_vert_sp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vss = PFX(interp_4tap_vert_ss_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = PFX(interp_4tap_vert_pp_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = PFX(interp_4tap_vert_ps_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = PFX(interp_4tap_vert_sp_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = PFX(interp_4tap_vert_ss_8x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = PFX(interp_4tap_vert_pp_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vps = PFX(interp_4tap_vert_ps_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = PFX(interp_4tap_vert_sp_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = PFX(interp_4tap_vert_ss_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = PFX(interp_4tap_vert_pp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = PFX(interp_4tap_vert_ps_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = PFX(interp_4tap_vert_sp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = PFX(interp_4tap_vert_ss_8x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + + + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vss = PFX(interp_4tap_vert_ss_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vsp = PFX(interp_4tap_vert_sp_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vps = PFX(interp_4tap_vert_ps_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = PFX(interp_4tap_vert_pp_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vpp = PFX(interp_4tap_vert_pp_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vps = PFX(interp_4tap_vert_ps_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vss = PFX(interp_4tap_vert_ss_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vsp = PFX(interp_4tap_vert_sp_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = PFX(interp_4tap_vert_ss_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = PFX(interp_4tap_vert_sp_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vps = PFX(interp_4tap_vert_ps_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = PFX(interp_4tap_vert_pp_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = PFX(interp_4tap_vert_pp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = PFX(interp_4tap_vert_ps_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = PFX(interp_4tap_vert_ss_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = PFX(interp_4tap_vert_sp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = PFX(interp_4tap_vert_ps_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = PFX(interp_4tap_vert_ss_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = PFX(interp_4tap_vert_sp_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vpp = PFX(interp_4tap_vert_pp_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vps = PFX(interp_4tap_vert_ps_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = PFX(interp_4tap_vert_ss_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = PFX(interp_4tap_vert_sp_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = PFX(interp_4tap_vert_pp_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = PFX(interp_4tap_vert_ps_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = PFX(interp_4tap_vert_ss_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = PFX(interp_4tap_vert_sp_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = PFX(interp_4tap_vert_pp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = PFX(interp_4tap_vert_ps_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = PFX(interp_4tap_vert_ss_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = PFX(interp_4tap_vert_sp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = PFX(interp_4tap_vert_pp_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = PFX(interp_4tap_vert_ps_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = PFX(interp_4tap_vert_ss_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = PFX(interp_4tap_vert_sp_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = PFX(interp_4tap_vert_pp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = PFX(interp_4tap_vert_pp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = PFX(interp_4tap_vert_pp_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = PFX(interp_4tap_vert_ps_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = PFX(interp_4tap_vert_ps_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = PFX(interp_4tap_vert_ps_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = PFX(interp_4tap_vert_ps_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = PFX(interp_4tap_vert_ss_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = PFX(interp_4tap_vert_ss_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = PFX(interp_4tap_vert_ss_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = PFX(interp_4tap_vert_ss_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = PFX(interp_4tap_vert_sp_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_avx2); +#endif + + p.frameInitLowres = PFX(frame_init_lowres_core_avx2); + +#if X265_DEPTH <= 10 + // TODO: depends on hps and vsp + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); // calling luma_hvpp for all sizes + p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; // ALL_LUMA_PU_T has declared all sizes except 4x4, hence calling luma_hvpp[4x4] +#endif if (cpuMask & X265_CPU_BMI2) - p.scanPosLast = x265_scanPosLast_avx2_bmi2; + p.scanPosLast = PFX(scanPosLast_avx2_bmi2); } } #else // if HIGH_BIT_DEPTH -void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // 8bpp +void setupAssemblyPrimitives(EncoderPrimitives &p, int cpuMask) // Main { #if X86_64 - p.scanPosLast = x265_scanPosLast_x64; + p.scanPosLast = PFX(scanPosLast_x64); #endif if (cpuMask & X265_CPU_SSE2) @@ -1328,27 +2194,27 @@ AVC_LUMA_PU(sad_x3, mmx2); AVC_LUMA_PU(sad_x4, mmx2); - p.pu[LUMA_16x16].sad = x265_pixel_sad_16x16_sse2; - p.pu[LUMA_16x16].sad_x3 = x265_pixel_sad_x3_16x16_sse2; - p.pu[LUMA_16x16].sad_x4 = x265_pixel_sad_x4_16x16_sse2; - p.pu[LUMA_16x8].sad = x265_pixel_sad_16x8_sse2; - p.pu[LUMA_16x8].sad_x3 = x265_pixel_sad_x3_16x8_sse2; - p.pu[LUMA_16x8].sad_x4 = x265_pixel_sad_x4_16x8_sse2; + p.pu[LUMA_16x16].sad = PFX(pixel_sad_16x16_sse2); + p.pu[LUMA_16x16].sad_x3 = PFX(pixel_sad_x3_16x16_sse2); + p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_sse2); + p.pu[LUMA_16x8].sad = PFX(pixel_sad_16x8_sse2); + p.pu[LUMA_16x8].sad_x3 = PFX(pixel_sad_x3_16x8_sse2); + p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_sse2); HEVC_SAD(sse2); - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_mmx2; + p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_mmx2); ALL_LUMA_PU(satd, pixel_satd, sse2); - p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_mmx; - p.cu[BLOCK_8x8].sse_pp = x265_pixel_ssd_8x8_mmx; - p.cu[BLOCK_16x16].sse_pp = x265_pixel_ssd_16x16_mmx; + p.cu[BLOCK_4x4].sse_pp = PFX(pixel_ssd_4x4_mmx); + p.cu[BLOCK_8x8].sse_pp = PFX(pixel_ssd_8x8_mmx); + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_mmx); PIXEL_AVG_W4(mmx2); PIXEL_AVG(sse2); LUMA_VAR(sse2); ASSIGN_SA8D(sse2); - p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_mmx; + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = PFX(pixel_ssd_4x8_mmx); ASSIGN_SSE_PP(sse2); ASSIGN_SSE_SS(sse2); @@ -1370,50 +2236,53 @@ CHROMA_420_VSP_FILTERS(_sse2); CHROMA_422_VSP_FILTERS(_sse2); CHROMA_444_VSP_FILTERS(_sse2); - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = x265_interp_4tap_vert_pp_2x16_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = x265_interp_4tap_vert_pp_4x32_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_sse2; #if X86_64 - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vpp = x265_interp_4tap_vert_pp_6x16_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_sse2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_sse2; + ALL_CHROMA_420_PU(filter_vpp, interp_4tap_vert_pp, sse2); + ALL_CHROMA_422_PU(filter_vpp, interp_4tap_vert_pp, sse2); + ALL_CHROMA_444_PU(filter_vpp, interp_4tap_vert_pp, sse2); + ALL_CHROMA_420_PU(filter_vps, interp_4tap_vert_ps, sse2); + ALL_CHROMA_422_PU(filter_vps, interp_4tap_vert_ps, sse2); + ALL_CHROMA_444_PU(filter_vps, interp_4tap_vert_ps, sse2); + ALL_LUMA_PU(luma_vpp, interp_8tap_vert_pp, sse2); + ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, sse2); +#else + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = PFX(interp_4tap_vert_pp_2x4_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = PFX(interp_4tap_vert_pp_2x8_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = PFX(interp_4tap_vert_pp_4x2_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = PFX(interp_4tap_vert_pp_2x16_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = PFX(interp_4tap_vert_pp_4x32_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vps = PFX(interp_4tap_vert_ps_2x4_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vps = PFX(interp_4tap_vert_ps_2x8_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = PFX(interp_4tap_vert_ps_4x2_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_sse2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vps = PFX(interp_4tap_vert_ps_2x16_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_sse2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vps = PFX(interp_4tap_vert_ps_4x32_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_sse2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_sse2); #endif ALL_LUMA_PU(luma_hpp, interp_8tap_horiz_pp, sse2); - p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_sse2; + p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_sse2); ALL_LUMA_PU(luma_hps, interp_8tap_horiz_ps, sse2); - p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_sse2; - p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_sse3; + p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_sse2); + p.pu[LUMA_8x8].luma_hvpp = PFX(interp_8tap_hv_pp_8x8_sse3); - //p.frameInitLowres = x265_frame_init_lowres_core_mmx2; - p.frameInitLowres = x265_frame_init_lowres_core_sse2; + //p.frameInitLowres = PFX(frame_init_lowres_core_mmx2); + p.frameInitLowres = PFX(frame_init_lowres_core_sse2); ALL_LUMA_TU(blockfill_s, blockfill_s, sse2); ALL_LUMA_TU_S(cpy2Dto1D_shl, cpy2Dto1D_shl_, sse2); @@ -1425,81 +2294,93 @@ ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse2); ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse2); - p.cu[BLOCK_4x4].intra_pred[2] = x265_intra_pred_ang4_2_sse2; - p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_sse2; - p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_sse2; - p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_sse2; - p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_sse2; - p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_sse2; - p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_sse2; - p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_sse2; - p.cu[BLOCK_4x4].intra_pred[10] = x265_intra_pred_ang4_10_sse2; - p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_sse2; - p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_sse2; - p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_sse2; - p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_sse2; - p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_sse2; - p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_sse2; - p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_sse2; - p.cu[BLOCK_4x4].intra_pred[18] = x265_intra_pred_ang4_18_sse2; - p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_17_sse2; - p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_16_sse2; - p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_15_sse2; - p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_14_sse2; - p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_13_sse2; - p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_12_sse2; - p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_11_sse2; - p.cu[BLOCK_4x4].intra_pred[26] = x265_intra_pred_ang4_26_sse2; - p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_9_sse2; - p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_8_sse2; - p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_7_sse2; - p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_6_sse2; - p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_5_sse2; - p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_4_sse2; - p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_3_sse2; + p.cu[BLOCK_4x4].intra_pred[2] = PFX(intra_pred_ang4_2_sse2); + p.cu[BLOCK_4x4].intra_pred[3] = PFX(intra_pred_ang4_3_sse2); + p.cu[BLOCK_4x4].intra_pred[4] = PFX(intra_pred_ang4_4_sse2); + p.cu[BLOCK_4x4].intra_pred[5] = PFX(intra_pred_ang4_5_sse2); + p.cu[BLOCK_4x4].intra_pred[6] = PFX(intra_pred_ang4_6_sse2); + p.cu[BLOCK_4x4].intra_pred[7] = PFX(intra_pred_ang4_7_sse2); + p.cu[BLOCK_4x4].intra_pred[8] = PFX(intra_pred_ang4_8_sse2); + p.cu[BLOCK_4x4].intra_pred[9] = PFX(intra_pred_ang4_9_sse2); + p.cu[BLOCK_4x4].intra_pred[10] = PFX(intra_pred_ang4_10_sse2); + p.cu[BLOCK_4x4].intra_pred[11] = PFX(intra_pred_ang4_11_sse2); + p.cu[BLOCK_4x4].intra_pred[12] = PFX(intra_pred_ang4_12_sse2); + p.cu[BLOCK_4x4].intra_pred[13] = PFX(intra_pred_ang4_13_sse2); + p.cu[BLOCK_4x4].intra_pred[14] = PFX(intra_pred_ang4_14_sse2); + p.cu[BLOCK_4x4].intra_pred[15] = PFX(intra_pred_ang4_15_sse2); + p.cu[BLOCK_4x4].intra_pred[16] = PFX(intra_pred_ang4_16_sse2); + p.cu[BLOCK_4x4].intra_pred[17] = PFX(intra_pred_ang4_17_sse2); + p.cu[BLOCK_4x4].intra_pred[18] = PFX(intra_pred_ang4_18_sse2); + p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_19_sse2); + p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_20_sse2); + p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_21_sse2); + p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_22_sse2); + p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_sse2); + p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_sse2); + p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_sse2); + p.cu[BLOCK_4x4].intra_pred[26] = PFX(intra_pred_ang4_26_sse2); + p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_sse2); + p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_sse2); + p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_sse2); + p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_30_sse2); + p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_31_sse2); + p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_sse2); + p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_sse2); - p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_sse2; + p.cu[BLOCK_4x4].intra_pred_allangs = PFX(all_angs_pred_4x4_sse2); - p.cu[BLOCK_4x4].calcresidual = x265_getResidual4_sse2; - p.cu[BLOCK_8x8].calcresidual = x265_getResidual8_sse2; + p.cu[BLOCK_4x4].calcresidual = PFX(getResidual4_sse2); + p.cu[BLOCK_8x8].calcresidual = PFX(getResidual8_sse2); ALL_LUMA_TU_S(transpose, transpose, sse2); - p.cu[BLOCK_64x64].transpose = x265_transpose64_sse2; + p.cu[BLOCK_64x64].transpose = PFX(transpose64_sse2); - p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_sse2; - p.ssim_end_4 = x265_pixel_ssim_end4_sse2; + p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_sse2); + p.ssim_end_4 = PFX(pixel_ssim_end4_sse2); - p.cu[BLOCK_4x4].dct = x265_dct4_sse2; - p.cu[BLOCK_8x8].dct = x265_dct8_sse2; - p.cu[BLOCK_4x4].idct = x265_idct4_sse2; + p.cu[BLOCK_4x4].dct = PFX(dct4_sse2); + p.cu[BLOCK_8x8].dct = PFX(dct8_sse2); + p.cu[BLOCK_4x4].idct = PFX(idct4_sse2); #if X86_64 - p.cu[BLOCK_8x8].idct = x265_idct8_sse2; + p.cu[BLOCK_8x8].idct = PFX(idct8_sse2); + + // TODO: it is passed smoke test, but we need testbench, so temporary disable + //p.costC1C2Flag = x265_costC1C2Flag_sse2; #endif - p.idst4x4 = x265_idst4_sse2; + p.idst4x4 = PFX(idst4_sse2); + p.dst4x4 = PFX(dst4_sse2); - p.planecopy_sp = x265_downShift_16_sse2; + p.planecopy_sp = PFX(downShift_16_sse2); + ALL_CHROMA_420_PU(p2s, filterPixelToShort, sse2); + ALL_CHROMA_422_PU(p2s, filterPixelToShort, sse2); + ALL_CHROMA_444_PU(p2s, filterPixelToShort, sse2); + ALL_LUMA_PU(convert_p2s, filterPixelToShort, sse2); + ALL_LUMA_TU(count_nonzero, count_nonzero, sse2); } if (cpuMask & X265_CPU_SSE3) { ALL_CHROMA_420_PU(filter_hpp, interp_4tap_horiz_pp, sse3); ALL_CHROMA_422_PU(filter_hpp, interp_4tap_horiz_pp, sse3); ALL_CHROMA_444_PU(filter_hpp, interp_4tap_horiz_pp, sse3); + ALL_CHROMA_420_PU(filter_hps, interp_4tap_horiz_ps, sse3); + ALL_CHROMA_422_PU(filter_hps, interp_4tap_horiz_ps, sse3); + ALL_CHROMA_444_PU(filter_hps, interp_4tap_horiz_ps, sse3); } if (cpuMask & X265_CPU_SSSE3) { - p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_ssse3; - p.pu[LUMA_8x32].sad_x3 = x265_pixel_sad_x3_8x32_ssse3; - p.pu[LUMA_12x16].sad_x3 = x265_pixel_sad_x3_12x16_ssse3; + p.pu[LUMA_8x16].sad_x3 = PFX(pixel_sad_x3_8x16_ssse3); + p.pu[LUMA_8x32].sad_x3 = PFX(pixel_sad_x3_8x32_ssse3); + p.pu[LUMA_12x16].sad_x3 = PFX(pixel_sad_x3_12x16_ssse3); HEVC_SAD_X3(ssse3); - p.pu[LUMA_8x4].sad_x4 = x265_pixel_sad_x4_8x4_ssse3; - p.pu[LUMA_8x8].sad_x4 = x265_pixel_sad_x4_8x8_ssse3; - p.pu[LUMA_8x16].sad_x4 = x265_pixel_sad_x4_8x16_ssse3; - p.pu[LUMA_8x32].sad_x4 = x265_pixel_sad_x4_8x32_ssse3; - p.pu[LUMA_12x16].sad_x4 = x265_pixel_sad_x4_12x16_ssse3; + p.pu[LUMA_8x4].sad_x4 = PFX(pixel_sad_x4_8x4_ssse3); + p.pu[LUMA_8x8].sad_x4 = PFX(pixel_sad_x4_8x8_ssse3); + p.pu[LUMA_8x16].sad_x4 = PFX(pixel_sad_x4_8x16_ssse3); + p.pu[LUMA_8x32].sad_x4 = PFX(pixel_sad_x4_8x32_ssse3); + p.pu[LUMA_12x16].sad_x4 = PFX(pixel_sad_x4_12x16_ssse3); HEVC_SAD_X4(ssse3); - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_ssse3; + p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_ssse3); ALL_LUMA_PU(satd, pixel_satd, ssse3); ASSIGN_SA8D(ssse3); @@ -1508,89 +2389,87 @@ INTRA_ANG_SSSE3(ssse3); ASSIGN_SSE_PP(ssse3); - p.cu[BLOCK_4x4].sse_pp = x265_pixel_ssd_4x4_ssse3; - p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = x265_pixel_ssd_4x8_ssse3; - - p.dst4x4 = x265_dst4_ssse3; - p.cu[BLOCK_8x8].idct = x265_idct8_ssse3; + p.cu[BLOCK_4x4].sse_pp = PFX(pixel_ssd_4x4_ssse3); + p.chroma[X265_CSP_I422].cu[BLOCK_422_4x8].sse_pp = PFX(pixel_ssd_4x8_ssse3); - ALL_LUMA_TU(count_nonzero, count_nonzero, ssse3); + p.dst4x4 = PFX(dst4_ssse3); + p.cu[BLOCK_8x8].idct = PFX(idct8_ssse3); // MUST be done after LUMA_FILTERS() to overwrite default version - p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3; + p.pu[LUMA_8x8].luma_hvpp = PFX(interp_8tap_hv_pp_8x8_ssse3); - p.frameInitLowres = x265_frame_init_lowres_core_ssse3; - p.scale1D_128to64 = x265_scale1D_128to64_ssse3; - p.scale2D_64to32 = x265_scale2D_64to32_ssse3; - - p.pu[LUMA_8x4].convert_p2s = x265_filterPixelToShort_8x4_ssse3; - p.pu[LUMA_8x8].convert_p2s = x265_filterPixelToShort_8x8_ssse3; - p.pu[LUMA_8x16].convert_p2s = x265_filterPixelToShort_8x16_ssse3; - p.pu[LUMA_8x32].convert_p2s = x265_filterPixelToShort_8x32_ssse3; - p.pu[LUMA_16x4].convert_p2s = x265_filterPixelToShort_16x4_ssse3; - p.pu[LUMA_16x8].convert_p2s = x265_filterPixelToShort_16x8_ssse3; - p.pu[LUMA_16x12].convert_p2s = x265_filterPixelToShort_16x12_ssse3; - p.pu[LUMA_16x16].convert_p2s = x265_filterPixelToShort_16x16_ssse3; - p.pu[LUMA_16x32].convert_p2s = x265_filterPixelToShort_16x32_ssse3; - p.pu[LUMA_16x64].convert_p2s = x265_filterPixelToShort_16x64_ssse3; - p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_ssse3; - p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_ssse3; - p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_ssse3; - p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_ssse3; - p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_ssse3; - p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_ssse3; - p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_ssse3; - p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_ssse3; - p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_ssse3; - p.pu[LUMA_12x16].convert_p2s = x265_filterPixelToShort_12x16_ssse3; - p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_ssse3; - p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_ssse3; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = x265_filterPixelToShort_8x2_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = x265_filterPixelToShort_8x6_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = x265_filterPixelToShort_16x4_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = x265_filterPixelToShort_16x12_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_ssse3; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = x265_filterPixelToShort_8x4_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = x265_filterPixelToShort_8x8_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = x265_filterPixelToShort_8x12_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = x265_filterPixelToShort_8x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = x265_filterPixelToShort_8x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = x265_filterPixelToShort_8x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = x265_filterPixelToShort_12x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = x265_filterPixelToShort_16x8_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = x265_filterPixelToShort_16x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = x265_filterPixelToShort_16x24_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = x265_filterPixelToShort_16x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = x265_filterPixelToShort_16x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_ssse3; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_ssse3; - p.findPosFirstLast = x265_findPosFirstLast_ssse3; + p.frameInitLowres = PFX(frame_init_lowres_core_ssse3); + p.scale1D_128to64 = PFX(scale1D_128to64_ssse3); + p.scale2D_64to32 = PFX(scale2D_64to32_ssse3); + + p.pu[LUMA_8x4].convert_p2s = PFX(filterPixelToShort_8x4_ssse3); + p.pu[LUMA_8x8].convert_p2s = PFX(filterPixelToShort_8x8_ssse3); + p.pu[LUMA_8x16].convert_p2s = PFX(filterPixelToShort_8x16_ssse3); + p.pu[LUMA_8x32].convert_p2s = PFX(filterPixelToShort_8x32_ssse3); + p.pu[LUMA_16x4].convert_p2s = PFX(filterPixelToShort_16x4_ssse3); + p.pu[LUMA_16x8].convert_p2s = PFX(filterPixelToShort_16x8_ssse3); + p.pu[LUMA_16x12].convert_p2s = PFX(filterPixelToShort_16x12_ssse3); + p.pu[LUMA_16x16].convert_p2s = PFX(filterPixelToShort_16x16_ssse3); + p.pu[LUMA_16x32].convert_p2s = PFX(filterPixelToShort_16x32_ssse3); + p.pu[LUMA_16x64].convert_p2s = PFX(filterPixelToShort_16x64_ssse3); + p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_ssse3); + p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_ssse3); + p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_ssse3); + p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_ssse3); + p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_ssse3); + p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_ssse3); + p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_ssse3); + p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_ssse3); + p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_ssse3); + p.pu[LUMA_12x16].convert_p2s = PFX(filterPixelToShort_12x16_ssse3); + p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_ssse3); + p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_ssse3); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].p2s = PFX(filterPixelToShort_8x2_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].p2s = PFX(filterPixelToShort_8x6_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].p2s = PFX(filterPixelToShort_16x4_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].p2s = PFX(filterPixelToShort_16x12_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_ssse3); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].p2s = PFX(filterPixelToShort_8x4_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].p2s = PFX(filterPixelToShort_8x8_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].p2s = PFX(filterPixelToShort_8x12_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].p2s = PFX(filterPixelToShort_8x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].p2s = PFX(filterPixelToShort_8x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].p2s = PFX(filterPixelToShort_8x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].p2s = PFX(filterPixelToShort_12x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].p2s = PFX(filterPixelToShort_16x8_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].p2s = PFX(filterPixelToShort_16x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].p2s = PFX(filterPixelToShort_16x24_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].p2s = PFX(filterPixelToShort_16x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].p2s = PFX(filterPixelToShort_16x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_ssse3); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_ssse3); + p.findPosFirstLast = PFX(findPosFirstLast_ssse3); } if (cpuMask & X265_CPU_SSE4) { - p.sign = x265_calSign_sse4; - p.saoCuOrgE0 = x265_saoCuOrgE0_sse4; - p.saoCuOrgE1 = x265_saoCuOrgE1_sse4; - p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_sse4; - p.saoCuOrgE2[0] = x265_saoCuOrgE2_sse4; - p.saoCuOrgE2[1] = x265_saoCuOrgE2_sse4; - p.saoCuOrgE3[0] = x265_saoCuOrgE3_sse4; - p.saoCuOrgE3[1] = x265_saoCuOrgE3_sse4; - p.saoCuOrgB0 = x265_saoCuOrgB0_sse4; + p.sign = PFX(calSign_sse4); + p.saoCuOrgE0 = PFX(saoCuOrgE0_sse4); + p.saoCuOrgE1 = PFX(saoCuOrgE1_sse4); + p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_sse4); + p.saoCuOrgE2[0] = PFX(saoCuOrgE2_sse4); + p.saoCuOrgE2[1] = PFX(saoCuOrgE2_sse4); + p.saoCuOrgE3[0] = PFX(saoCuOrgE3_sse4); + p.saoCuOrgE3[1] = PFX(saoCuOrgE3_sse4); + p.saoCuOrgB0 = PFX(saoCuOrgB0_sse4); LUMA_ADDAVG(sse4); CHROMA_420_ADDAVG(sse4); @@ -1599,11 +2478,11 @@ // TODO: check POPCNT flag! ALL_LUMA_TU_S(copy_cnt, copy_cnt_, sse4); - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_sse4; + p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_sse4); ALL_LUMA_PU(satd, pixel_satd, sse4); ASSIGN_SA8D(sse4); ASSIGN_SSE_SS(sse4); - p.cu[BLOCK_64x64].sse_pp = x265_pixel_ssd_64x64_sse4; + p.cu[BLOCK_64x64].sse_pp = PFX(pixel_ssd_64x64_sse4); LUMA_PIXELSUB(sse4); CHROMA_420_PIXELSUB_PS(sse4); @@ -1620,22 +2499,28 @@ CHROMA_444_VSP_FILTERS_SSE4(_sse4); // MUST be done after LUMA_FILTERS() to overwrite default version - p.pu[LUMA_8x8].luma_hvpp = x265_interp_8tap_hv_pp_8x8_ssse3; + p.pu[LUMA_8x8].luma_hvpp = PFX(interp_8tap_hv_pp_8x8_ssse3); LUMA_CU_BLOCKCOPY(ps, sse4); CHROMA_420_CU_BLOCKCOPY(ps, sse4); CHROMA_422_CU_BLOCKCOPY(ps, sse4); - p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_sse4; - p.cu[BLOCK_32x32].calcresidual = x265_getResidual32_sse4; - p.cu[BLOCK_8x8].dct = x265_dct8_sse4; - p.denoiseDct = x265_denoise_dct_sse4; - p.quant = x265_quant_sse4; - p.nquant = x265_nquant_sse4; - p.dequant_normal = x265_dequant_normal_sse4; - - p.weight_pp = x265_weight_pp_sse4; - p.weight_sp = x265_weight_sp_sse4; + p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_sse4); + p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_sse4); + p.cu[BLOCK_8x8].dct = PFX(dct8_sse4); + p.denoiseDct = PFX(denoise_dct_sse4); + p.quant = PFX(quant_sse4); + p.nquant = PFX(nquant_sse4); + p.dequant_normal = PFX(dequant_normal_sse4); + p.dequant_scaling = PFX(dequant_scaling_sse4); + + p.weight_pp = PFX(weight_pp_sse4); + p.weight_sp = PFX(weight_sp_sse4); + + p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_sse4); + p.cu[BLOCK_8x8].intra_filter = PFX(intra_filter_8x8_sse4); + p.cu[BLOCK_16x16].intra_filter = PFX(intra_filter_16x16_sse4); + p.cu[BLOCK_32x32].intra_filter = PFX(intra_filter_32x32_sse4); ALL_LUMA_TU_S(intra_pred[PLANAR_IDX], intra_pred_planar, sse4); ALL_LUMA_TU_S(intra_pred[DC_IDX], intra_pred_dc, sse4); @@ -1644,475 +2529,500 @@ INTRA_ANG_SSE4_COMMON(sse4); INTRA_ANG_SSE4(sse4); - p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_sse4; - p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_sse4; + p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_sse4); + p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_sse4); - p.pu[LUMA_4x4].convert_p2s = x265_filterPixelToShort_4x4_sse4; - p.pu[LUMA_4x8].convert_p2s = x265_filterPixelToShort_4x8_sse4; - p.pu[LUMA_4x16].convert_p2s = x265_filterPixelToShort_4x16_sse4; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = x265_filterPixelToShort_2x4_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = x265_filterPixelToShort_2x8_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = x265_filterPixelToShort_4x2_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = x265_filterPixelToShort_4x4_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = x265_filterPixelToShort_4x8_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = x265_filterPixelToShort_4x16_sse4; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = x265_filterPixelToShort_6x8_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = x265_filterPixelToShort_2x8_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = x265_filterPixelToShort_2x16_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = x265_filterPixelToShort_4x4_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = x265_filterPixelToShort_4x8_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = x265_filterPixelToShort_4x16_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = x265_filterPixelToShort_4x32_sse4; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = x265_filterPixelToShort_6x16_sse4; + p.pu[LUMA_4x4].convert_p2s = PFX(filterPixelToShort_4x4_sse4); + p.pu[LUMA_4x8].convert_p2s = PFX(filterPixelToShort_4x8_sse4); + p.pu[LUMA_4x16].convert_p2s = PFX(filterPixelToShort_4x16_sse4); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].p2s = PFX(filterPixelToShort_2x4_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].p2s = PFX(filterPixelToShort_4x2_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].p2s = PFX(filterPixelToShort_4x4_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].p2s = PFX(filterPixelToShort_4x8_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].p2s = PFX(filterPixelToShort_4x16_sse4); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].p2s = PFX(filterPixelToShort_6x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].p2s = PFX(filterPixelToShort_2x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].p2s = PFX(filterPixelToShort_2x16_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].p2s = PFX(filterPixelToShort_4x4_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].p2s = PFX(filterPixelToShort_4x8_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].p2s = PFX(filterPixelToShort_4x16_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].p2s = PFX(filterPixelToShort_4x32_sse4); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].p2s = PFX(filterPixelToShort_6x16_sse4); #if X86_64 + p.saoCuStatsBO = PFX(saoCuStatsBO_sse4); + p.saoCuStatsE0 = PFX(saoCuStatsE0_sse4); + p.saoCuStatsE1 = PFX(saoCuStatsE1_sse4); + p.saoCuStatsE2 = PFX(saoCuStatsE2_sse4); + p.saoCuStatsE3 = PFX(saoCuStatsE3_sse4); + ALL_LUMA_CU(psy_cost_pp, psyCost_pp, sse4); ALL_LUMA_CU(psy_cost_ss, psyCost_ss, sse4); + + p.costCoeffNxN = PFX(costCoeffNxN_sse4); #endif + p.costCoeffRemain = PFX(costCoeffRemain_sse4); } if (cpuMask & X265_CPU_AVX) { - p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = x265_pixel_satd_16x24_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = x265_pixel_satd_32x48_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = x265_pixel_satd_24x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = x265_pixel_satd_8x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = x265_pixel_satd_8x12_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = x265_pixel_satd_12x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = x265_pixel_satd_4x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = x265_pixel_satd_16x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = x265_pixel_satd_32x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = x265_pixel_satd_16x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = x265_pixel_satd_32x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = x265_pixel_satd_16x64_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = x265_pixel_satd_16x8_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = x265_pixel_satd_32x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = x265_pixel_satd_8x4_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = x265_pixel_satd_8x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = x265_pixel_satd_8x8_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = x265_pixel_satd_8x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = x265_pixel_satd_4x8_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = x265_pixel_satd_4x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = x265_pixel_satd_4x4_avx; + p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].satd = PFX(pixel_satd_16x24_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].satd = PFX(pixel_satd_32x48_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].satd = PFX(pixel_satd_24x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].satd = PFX(pixel_satd_8x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].satd = PFX(pixel_satd_8x12_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].satd = PFX(pixel_satd_12x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].satd = PFX(pixel_satd_4x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].satd = PFX(pixel_satd_16x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].satd = PFX(pixel_satd_32x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].satd = PFX(pixel_satd_16x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].satd = PFX(pixel_satd_32x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].satd = PFX(pixel_satd_16x64_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].satd = PFX(pixel_satd_16x8_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].satd = PFX(pixel_satd_32x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].satd = PFX(pixel_satd_8x4_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].satd = PFX(pixel_satd_8x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].satd = PFX(pixel_satd_8x8_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].satd = PFX(pixel_satd_8x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].satd = PFX(pixel_satd_4x8_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].satd = PFX(pixel_satd_4x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].satd = PFX(pixel_satd_4x4_avx); ALL_LUMA_PU(satd, pixel_satd, avx); - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = x265_pixel_satd_4x4_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = x265_pixel_satd_8x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = x265_pixel_satd_16x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = x265_pixel_satd_32x32_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = x265_pixel_satd_8x4_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = x265_pixel_satd_4x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = x265_pixel_satd_16x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = x265_pixel_satd_8x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = x265_pixel_satd_32x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = x265_pixel_satd_16x32_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = x265_pixel_satd_16x12_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = x265_pixel_satd_12x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = x265_pixel_satd_16x4_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = x265_pixel_satd_4x16_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = x265_pixel_satd_32x24_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = x265_pixel_satd_24x32_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = x265_pixel_satd_32x8_avx; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = x265_pixel_satd_8x32_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].satd = PFX(pixel_satd_4x4_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].satd = PFX(pixel_satd_8x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].satd = PFX(pixel_satd_16x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].satd = PFX(pixel_satd_32x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].satd = PFX(pixel_satd_8x4_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].satd = PFX(pixel_satd_4x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].satd = PFX(pixel_satd_16x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].satd = PFX(pixel_satd_8x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].satd = PFX(pixel_satd_32x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].satd = PFX(pixel_satd_16x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].satd = PFX(pixel_satd_16x12_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].satd = PFX(pixel_satd_12x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].satd = PFX(pixel_satd_16x4_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].satd = PFX(pixel_satd_4x16_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].satd = PFX(pixel_satd_32x24_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].satd = PFX(pixel_satd_24x32_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].satd = PFX(pixel_satd_32x8_avx); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].satd = PFX(pixel_satd_8x32_avx); ASSIGN_SA8D(avx); - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = x265_pixel_sa8d_32x32_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = x265_pixel_sa8d_16x16_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = x265_pixel_sa8d_8x8_avx; - p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = x265_pixel_satd_4x4_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = PFX(pixel_sa8d_32x32_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = PFX(pixel_sa8d_16x16_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = PFX(pixel_sa8d_8x8_avx); + p.chroma[X265_CSP_I420].cu[BLOCK_420_4x4].sa8d = PFX(pixel_satd_4x4_avx); ASSIGN_SSE_PP(avx); - p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = x265_pixel_ssd_8x8_avx; + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sse_pp = PFX(pixel_ssd_8x8_avx); ASSIGN_SSE_SS(avx); LUMA_VAR(avx); - p.pu[LUMA_12x16].sad_x3 = x265_pixel_sad_x3_12x16_avx; - p.pu[LUMA_16x4].sad_x3 = x265_pixel_sad_x3_16x4_avx; + p.pu[LUMA_12x16].sad_x3 = PFX(pixel_sad_x3_12x16_avx); + p.pu[LUMA_16x4].sad_x3 = PFX(pixel_sad_x3_16x4_avx); HEVC_SAD_X3(avx); - p.pu[LUMA_12x16].sad_x4 = x265_pixel_sad_x4_12x16_avx; - p.pu[LUMA_16x4].sad_x4 = x265_pixel_sad_x4_16x4_avx; + p.pu[LUMA_12x16].sad_x4 = PFX(pixel_sad_x4_12x16_avx); + p.pu[LUMA_16x4].sad_x4 = PFX(pixel_sad_x4_16x4_avx); HEVC_SAD_X4(avx); - p.ssim_4x4x2_core = x265_pixel_ssim_4x4x2_core_avx; - p.ssim_end_4 = x265_pixel_ssim_end4_avx; + p.ssim_4x4x2_core = PFX(pixel_ssim_4x4x2_core_avx); + p.ssim_end_4 = PFX(pixel_ssim_end4_avx); - p.cu[BLOCK_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; - p.cu[BLOCK_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; - p.cu[BLOCK_64x64].copy_ss = x265_blockcopy_ss_64x64_avx; - p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = x265_blockcopy_ss_16x16_avx; - p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = x265_blockcopy_ss_32x32_avx; - p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = x265_blockcopy_ss_16x32_avx; - p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = x265_blockcopy_ss_32x64_avx; + p.cu[BLOCK_16x16].copy_ss = PFX(blockcopy_ss_16x16_avx); + p.cu[BLOCK_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx); + p.cu[BLOCK_64x64].copy_ss = PFX(blockcopy_ss_64x64_avx); + p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ss = PFX(blockcopy_ss_16x16_avx); + p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ss = PFX(blockcopy_ss_32x32_avx); + p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ss = PFX(blockcopy_ss_16x32_avx); + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ss = PFX(blockcopy_ss_32x64_avx); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = x265_blockcopy_pp_32x8_avx; - p.pu[LUMA_32x8].copy_pp = x265_blockcopy_pp_32x8_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].copy_pp = PFX(blockcopy_pp_32x8_avx); + p.pu[LUMA_32x8].copy_pp = PFX(blockcopy_pp_32x8_avx); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = x265_blockcopy_pp_32x16_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = x265_blockcopy_pp_32x16_avx; - p.pu[LUMA_32x16].copy_pp = x265_blockcopy_pp_32x16_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx); + p.pu[LUMA_32x16].copy_pp = PFX(blockcopy_pp_32x16_avx); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = x265_blockcopy_pp_32x24_avx; - p.pu[LUMA_32x24].copy_pp = x265_blockcopy_pp_32x24_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].copy_pp = PFX(blockcopy_pp_32x24_avx); + p.pu[LUMA_32x24].copy_pp = PFX(blockcopy_pp_32x24_avx); - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = x265_blockcopy_pp_32x32_avx; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = x265_blockcopy_pp_32x32_avx; - p.pu[LUMA_32x32].copy_pp = x265_blockcopy_pp_32x32_avx; + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx); + p.pu[LUMA_32x32].copy_pp = PFX(blockcopy_pp_32x32_avx); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = x265_blockcopy_pp_32x48_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].copy_pp = PFX(blockcopy_pp_32x48_avx); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = x265_blockcopy_pp_32x64_avx; - p.pu[LUMA_32x64].copy_pp = x265_blockcopy_pp_32x64_avx; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].copy_pp = PFX(blockcopy_pp_32x64_avx); + p.pu[LUMA_32x64].copy_pp = PFX(blockcopy_pp_32x64_avx); - p.pu[LUMA_64x16].copy_pp = x265_blockcopy_pp_64x16_avx; - p.pu[LUMA_64x32].copy_pp = x265_blockcopy_pp_64x32_avx; - p.pu[LUMA_64x48].copy_pp = x265_blockcopy_pp_64x48_avx; - p.pu[LUMA_64x64].copy_pp = x265_blockcopy_pp_64x64_avx; + p.pu[LUMA_64x16].copy_pp = PFX(blockcopy_pp_64x16_avx); + p.pu[LUMA_64x32].copy_pp = PFX(blockcopy_pp_64x32_avx); + p.pu[LUMA_64x48].copy_pp = PFX(blockcopy_pp_64x48_avx); + p.pu[LUMA_64x64].copy_pp = PFX(blockcopy_pp_64x64_avx); - p.pu[LUMA_48x64].copy_pp = x265_blockcopy_pp_48x64_avx; + p.pu[LUMA_48x64].copy_pp = PFX(blockcopy_pp_48x64_avx); - p.frameInitLowres = x265_frame_init_lowres_core_avx; + p.frameInitLowres = PFX(frame_init_lowres_core_avx); } if (cpuMask & X265_CPU_XOP) { - //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = x265_pixel_satd_4x4_xop; this one is broken + //p.pu[LUMA_4x4].satd = p.cu[BLOCK_4x4].sa8d = PFX(pixel_satd_4x4_xop); this one is broken ALL_LUMA_PU(satd, pixel_satd, xop); ASSIGN_SA8D(xop); LUMA_VAR(xop); - p.cu[BLOCK_8x8].sse_pp = x265_pixel_ssd_8x8_xop; - p.cu[BLOCK_16x16].sse_pp = x265_pixel_ssd_16x16_xop; - p.frameInitLowres = x265_frame_init_lowres_core_xop; + p.cu[BLOCK_8x8].sse_pp = PFX(pixel_ssd_8x8_xop); + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_xop); + p.frameInitLowres = PFX(frame_init_lowres_core_xop); } #if X86_64 if (cpuMask & X265_CPU_AVX2) { - p.planecopy_sp = x265_downShift_16_avx2; + p.cu[BLOCK_4x4].intra_filter = PFX(intra_filter_4x4_avx2); - p.cu[BLOCK_32x32].intra_pred[DC_IDX] = x265_intra_pred_dc32_avx2; + p.planecopy_sp = PFX(downShift_16_avx2); - p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = x265_intra_pred_planar16_avx2; - p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = x265_intra_pred_planar32_avx2; + p.cu[BLOCK_32x32].intra_pred[DC_IDX] = PFX(intra_pred_dc32_avx2); - p.idst4x4 = x265_idst4_avx2; - p.dst4x4 = x265_dst4_avx2; - p.scale2D_64to32 = x265_scale2D_64to32_avx2; - p.saoCuOrgE0 = x265_saoCuOrgE0_avx2; - p.saoCuOrgE1 = x265_saoCuOrgE1_avx2; - p.saoCuOrgE1_2Rows = x265_saoCuOrgE1_2Rows_avx2; - p.saoCuOrgE2[0] = x265_saoCuOrgE2_avx2; - p.saoCuOrgE2[1] = x265_saoCuOrgE2_32_avx2; - p.saoCuOrgE3[0] = x265_saoCuOrgE3_avx2; - p.saoCuOrgE3[1] = x265_saoCuOrgE3_32_avx2; - p.saoCuOrgB0 = x265_saoCuOrgB0_avx2; - p.sign = x265_calSign_avx2; - - p.cu[BLOCK_4x4].psy_cost_ss = x265_psyCost_ss_4x4_avx2; - p.cu[BLOCK_8x8].psy_cost_ss = x265_psyCost_ss_8x8_avx2; - p.cu[BLOCK_16x16].psy_cost_ss = x265_psyCost_ss_16x16_avx2; - p.cu[BLOCK_32x32].psy_cost_ss = x265_psyCost_ss_32x32_avx2; - p.cu[BLOCK_64x64].psy_cost_ss = x265_psyCost_ss_64x64_avx2; - - p.cu[BLOCK_4x4].psy_cost_pp = x265_psyCost_pp_4x4_avx2; - p.cu[BLOCK_8x8].psy_cost_pp = x265_psyCost_pp_8x8_avx2; - p.cu[BLOCK_16x16].psy_cost_pp = x265_psyCost_pp_16x16_avx2; - p.cu[BLOCK_32x32].psy_cost_pp = x265_psyCost_pp_32x32_avx2; - p.cu[BLOCK_64x64].psy_cost_pp = x265_psyCost_pp_64x64_avx2; - - p.pu[LUMA_8x4].addAvg = x265_addAvg_8x4_avx2; - p.pu[LUMA_8x8].addAvg = x265_addAvg_8x8_avx2; - p.pu[LUMA_8x16].addAvg = x265_addAvg_8x16_avx2; - p.pu[LUMA_8x32].addAvg = x265_addAvg_8x32_avx2; - - p.pu[LUMA_12x16].addAvg = x265_addAvg_12x16_avx2; - - p.pu[LUMA_16x4].addAvg = x265_addAvg_16x4_avx2; - p.pu[LUMA_16x8].addAvg = x265_addAvg_16x8_avx2; - p.pu[LUMA_16x12].addAvg = x265_addAvg_16x12_avx2; - p.pu[LUMA_16x16].addAvg = x265_addAvg_16x16_avx2; - p.pu[LUMA_16x32].addAvg = x265_addAvg_16x32_avx2; - p.pu[LUMA_16x64].addAvg = x265_addAvg_16x64_avx2; - - p.pu[LUMA_24x32].addAvg = x265_addAvg_24x32_avx2; - - p.pu[LUMA_32x8].addAvg = x265_addAvg_32x8_avx2; - p.pu[LUMA_32x16].addAvg = x265_addAvg_32x16_avx2; - p.pu[LUMA_32x24].addAvg = x265_addAvg_32x24_avx2; - p.pu[LUMA_32x32].addAvg = x265_addAvg_32x32_avx2; - p.pu[LUMA_32x64].addAvg = x265_addAvg_32x64_avx2; - - p.pu[LUMA_48x64].addAvg = x265_addAvg_48x64_avx2; - - p.pu[LUMA_64x16].addAvg = x265_addAvg_64x16_avx2; - p.pu[LUMA_64x32].addAvg = x265_addAvg_64x32_avx2; - p.pu[LUMA_64x48].addAvg = x265_addAvg_64x48_avx2; - p.pu[LUMA_64x64].addAvg = x265_addAvg_64x64_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = x265_addAvg_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = x265_addAvg_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = x265_addAvg_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = x265_addAvg_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = x265_addAvg_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = x265_addAvg_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = x265_addAvg_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = x265_addAvg_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = x265_addAvg_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = x265_addAvg_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = x265_addAvg_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = x265_addAvg_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = x265_addAvg_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = x265_addAvg_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = x265_addAvg_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = x265_addAvg_32x32_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = x265_addAvg_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = x265_addAvg_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = x265_addAvg_8x12_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = x265_addAvg_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = x265_addAvg_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = x265_addAvg_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = x265_addAvg_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = x265_addAvg_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = x265_addAvg_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = x265_addAvg_16x24_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = x265_addAvg_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = x265_addAvg_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = x265_addAvg_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = x265_addAvg_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = x265_addAvg_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = x265_addAvg_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = x265_addAvg_32x64_avx2; - - p.cu[BLOCK_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; - p.cu[BLOCK_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; - p.cu[BLOCK_64x64].add_ps = x265_pixel_add_ps_64x64_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = x265_pixel_add_ps_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = x265_pixel_add_ps_32x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = x265_pixel_add_ps_16x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = x265_pixel_add_ps_32x64_avx2; - - p.cu[BLOCK_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; - p.cu[BLOCK_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; - p.cu[BLOCK_64x64].sub_ps = x265_pixel_sub_ps_64x64_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = x265_pixel_sub_ps_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = x265_pixel_sub_ps_32x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = x265_pixel_sub_ps_16x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = x265_pixel_sub_ps_32x64_avx2; - - p.pu[LUMA_16x4].pixelavg_pp = x265_pixel_avg_16x4_avx2; - p.pu[LUMA_16x8].pixelavg_pp = x265_pixel_avg_16x8_avx2; - p.pu[LUMA_16x12].pixelavg_pp = x265_pixel_avg_16x12_avx2; - p.pu[LUMA_16x16].pixelavg_pp = x265_pixel_avg_16x16_avx2; - p.pu[LUMA_16x32].pixelavg_pp = x265_pixel_avg_16x32_avx2; - p.pu[LUMA_16x64].pixelavg_pp = x265_pixel_avg_16x64_avx2; - - p.pu[LUMA_32x64].pixelavg_pp = x265_pixel_avg_32x64_avx2; - p.pu[LUMA_32x32].pixelavg_pp = x265_pixel_avg_32x32_avx2; - p.pu[LUMA_32x24].pixelavg_pp = x265_pixel_avg_32x24_avx2; - p.pu[LUMA_32x16].pixelavg_pp = x265_pixel_avg_32x16_avx2; - p.pu[LUMA_32x8].pixelavg_pp = x265_pixel_avg_32x8_avx2; - - p.pu[LUMA_64x64].pixelavg_pp = x265_pixel_avg_64x64_avx2; - p.pu[LUMA_64x48].pixelavg_pp = x265_pixel_avg_64x48_avx2; - p.pu[LUMA_64x32].pixelavg_pp = x265_pixel_avg_64x32_avx2; - p.pu[LUMA_64x16].pixelavg_pp = x265_pixel_avg_64x16_avx2; - - p.pu[LUMA_16x16].satd = x265_pixel_satd_16x16_avx2; - p.pu[LUMA_16x8].satd = x265_pixel_satd_16x8_avx2; - p.pu[LUMA_8x16].satd = x265_pixel_satd_8x16_avx2; - p.pu[LUMA_8x8].satd = x265_pixel_satd_8x8_avx2; - - p.pu[LUMA_16x4].satd = x265_pixel_satd_16x4_avx2; - p.pu[LUMA_16x12].satd = x265_pixel_satd_16x12_avx2; - p.pu[LUMA_16x32].satd = x265_pixel_satd_16x32_avx2; - p.pu[LUMA_16x64].satd = x265_pixel_satd_16x64_avx2; - - p.pu[LUMA_32x8].satd = x265_pixel_satd_32x8_avx2; - p.pu[LUMA_32x16].satd = x265_pixel_satd_32x16_avx2; - p.pu[LUMA_32x24].satd = x265_pixel_satd_32x24_avx2; - p.pu[LUMA_32x32].satd = x265_pixel_satd_32x32_avx2; - p.pu[LUMA_32x64].satd = x265_pixel_satd_32x64_avx2; - p.pu[LUMA_48x64].satd = x265_pixel_satd_48x64_avx2; - p.pu[LUMA_64x16].satd = x265_pixel_satd_64x16_avx2; - p.pu[LUMA_64x32].satd = x265_pixel_satd_64x32_avx2; - p.pu[LUMA_64x48].satd = x265_pixel_satd_64x48_avx2; - p.pu[LUMA_64x64].satd = x265_pixel_satd_64x64_avx2; - - p.pu[LUMA_32x8].sad = x265_pixel_sad_32x8_avx2; - p.pu[LUMA_32x16].sad = x265_pixel_sad_32x16_avx2; - p.pu[LUMA_32x24].sad = x265_pixel_sad_32x24_avx2; - p.pu[LUMA_32x32].sad = x265_pixel_sad_32x32_avx2; - p.pu[LUMA_32x64].sad = x265_pixel_sad_32x64_avx2; - p.pu[LUMA_48x64].sad = x265_pixel_sad_48x64_avx2; - p.pu[LUMA_64x16].sad = x265_pixel_sad_64x16_avx2; - p.pu[LUMA_64x32].sad = x265_pixel_sad_64x32_avx2; - p.pu[LUMA_64x48].sad = x265_pixel_sad_64x48_avx2; - p.pu[LUMA_64x64].sad = x265_pixel_sad_64x64_avx2; - - p.pu[LUMA_8x4].sad_x3 = x265_pixel_sad_x3_8x4_avx2; - p.pu[LUMA_8x8].sad_x3 = x265_pixel_sad_x3_8x8_avx2; - p.pu[LUMA_8x16].sad_x3 = x265_pixel_sad_x3_8x16_avx2; - - p.pu[LUMA_8x8].sad_x4 = x265_pixel_sad_x4_8x8_avx2; - p.pu[LUMA_16x8].sad_x4 = x265_pixel_sad_x4_16x8_avx2; - p.pu[LUMA_16x12].sad_x4 = x265_pixel_sad_x4_16x12_avx2; - p.pu[LUMA_16x16].sad_x4 = x265_pixel_sad_x4_16x16_avx2; - p.pu[LUMA_16x32].sad_x4 = x265_pixel_sad_x4_16x32_avx2; - - p.cu[BLOCK_16x16].sse_pp = x265_pixel_ssd_16x16_avx2; - p.cu[BLOCK_32x32].sse_pp = x265_pixel_ssd_32x32_avx2; - p.cu[BLOCK_64x64].sse_pp = x265_pixel_ssd_64x64_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = x265_pixel_ssd_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = x265_pixel_ssd_32x32_avx2; - - p.cu[BLOCK_16x16].ssd_s = x265_pixel_ssd_s_16_avx2; - p.cu[BLOCK_32x32].ssd_s = x265_pixel_ssd_s_32_avx2; - - p.cu[BLOCK_8x8].copy_cnt = x265_copy_cnt_8_avx2; - p.cu[BLOCK_16x16].copy_cnt = x265_copy_cnt_16_avx2; - p.cu[BLOCK_32x32].copy_cnt = x265_copy_cnt_32_avx2; + p.cu[BLOCK_16x16].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar16_avx2); + p.cu[BLOCK_32x32].intra_pred[PLANAR_IDX] = PFX(intra_pred_planar32_avx2); - p.cu[BLOCK_16x16].blockfill_s = x265_blockfill_s_16x16_avx2; - p.cu[BLOCK_32x32].blockfill_s = x265_blockfill_s_32x32_avx2; + p.idst4x4 = PFX(idst4_avx2); + p.dst4x4 = PFX(dst4_avx2); + p.scale2D_64to32 = PFX(scale2D_64to32_avx2); + p.saoCuOrgE0 = PFX(saoCuOrgE0_avx2); + p.saoCuOrgE1 = PFX(saoCuOrgE1_avx2); + p.saoCuOrgE1_2Rows = PFX(saoCuOrgE1_2Rows_avx2); + p.saoCuOrgE2[0] = PFX(saoCuOrgE2_avx2); + p.saoCuOrgE2[1] = PFX(saoCuOrgE2_32_avx2); + p.saoCuOrgE3[0] = PFX(saoCuOrgE3_avx2); + p.saoCuOrgE3[1] = PFX(saoCuOrgE3_32_avx2); + p.saoCuOrgB0 = PFX(saoCuOrgB0_avx2); + p.sign = PFX(calSign_avx2); + + p.cu[BLOCK_4x4].psy_cost_ss = PFX(psyCost_ss_4x4_avx2); + p.cu[BLOCK_8x8].psy_cost_ss = PFX(psyCost_ss_8x8_avx2); + p.cu[BLOCK_16x16].psy_cost_ss = PFX(psyCost_ss_16x16_avx2); + p.cu[BLOCK_32x32].psy_cost_ss = PFX(psyCost_ss_32x32_avx2); + p.cu[BLOCK_64x64].psy_cost_ss = PFX(psyCost_ss_64x64_avx2); + + p.cu[BLOCK_4x4].psy_cost_pp = PFX(psyCost_pp_4x4_avx2); + p.cu[BLOCK_8x8].psy_cost_pp = PFX(psyCost_pp_8x8_avx2); + p.cu[BLOCK_16x16].psy_cost_pp = PFX(psyCost_pp_16x16_avx2); + p.cu[BLOCK_32x32].psy_cost_pp = PFX(psyCost_pp_32x32_avx2); + p.cu[BLOCK_64x64].psy_cost_pp = PFX(psyCost_pp_64x64_avx2); + + p.pu[LUMA_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.pu[LUMA_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.pu[LUMA_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.pu[LUMA_8x32].addAvg = PFX(addAvg_8x32_avx2); + + p.pu[LUMA_12x16].addAvg = PFX(addAvg_12x16_avx2); + + p.pu[LUMA_16x4].addAvg = PFX(addAvg_16x4_avx2); + p.pu[LUMA_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.pu[LUMA_16x12].addAvg = PFX(addAvg_16x12_avx2); + p.pu[LUMA_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.pu[LUMA_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.pu[LUMA_16x64].addAvg = PFX(addAvg_16x64_avx2); + + p.pu[LUMA_24x32].addAvg = PFX(addAvg_24x32_avx2); + + p.pu[LUMA_32x8].addAvg = PFX(addAvg_32x8_avx2); + p.pu[LUMA_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.pu[LUMA_32x24].addAvg = PFX(addAvg_32x24_avx2); + p.pu[LUMA_32x32].addAvg = PFX(addAvg_32x32_avx2); + p.pu[LUMA_32x64].addAvg = PFX(addAvg_32x64_avx2); + + p.pu[LUMA_48x64].addAvg = PFX(addAvg_48x64_avx2); + + p.pu[LUMA_64x16].addAvg = PFX(addAvg_64x16_avx2); + p.pu[LUMA_64x32].addAvg = PFX(addAvg_64x32_avx2); + p.pu[LUMA_64x48].addAvg = PFX(addAvg_64x48_avx2); + p.pu[LUMA_64x64].addAvg = PFX(addAvg_64x64_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].addAvg = PFX(addAvg_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].addAvg = PFX(addAvg_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].addAvg = PFX(addAvg_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].addAvg = PFX(addAvg_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].addAvg = PFX(addAvg_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].addAvg = PFX(addAvg_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].addAvg = PFX(addAvg_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].addAvg = PFX(addAvg_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].addAvg = PFX(addAvg_32x32_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].addAvg = PFX(addAvg_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].addAvg = PFX(addAvg_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].addAvg = PFX(addAvg_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].addAvg = PFX(addAvg_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].addAvg = PFX(addAvg_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].addAvg = PFX(addAvg_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].addAvg = PFX(addAvg_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].addAvg = PFX(addAvg_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].addAvg = PFX(addAvg_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].addAvg = PFX(addAvg_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].addAvg = PFX(addAvg_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].addAvg = PFX(addAvg_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].addAvg = PFX(addAvg_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].addAvg = PFX(addAvg_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].addAvg = PFX(addAvg_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].addAvg = PFX(addAvg_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].addAvg = PFX(addAvg_32x64_avx2); + + p.cu[BLOCK_8x8].sa8d = PFX(pixel_sa8d_8x8_avx2); + p.cu[BLOCK_16x16].sa8d = PFX(pixel_sa8d_16x16_avx2); + p.cu[BLOCK_32x32].sa8d = PFX(pixel_sa8d_32x32_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_8x8].sa8d = PFX(pixel_sa8d_8x8_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sa8d = PFX(pixel_sa8d_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sa8d = PFX(pixel_sa8d_32x32_avx2); + + p.cu[BLOCK_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); + p.cu[BLOCK_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); + p.cu[BLOCK_64x64].add_ps = PFX(pixel_add_ps_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].add_ps = PFX(pixel_add_ps_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].add_ps = PFX(pixel_add_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].add_ps = PFX(pixel_add_ps_16x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].add_ps = PFX(pixel_add_ps_32x64_avx2); + + p.cu[BLOCK_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); + p.cu[BLOCK_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); + p.cu[BLOCK_64x64].sub_ps = PFX(pixel_sub_ps_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sub_ps = PFX(pixel_sub_ps_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sub_ps = PFX(pixel_sub_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].sub_ps = PFX(pixel_sub_ps_16x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].sub_ps = PFX(pixel_sub_ps_32x64_avx2); + + p.pu[LUMA_16x4].pixelavg_pp = PFX(pixel_avg_16x4_avx2); + p.pu[LUMA_16x8].pixelavg_pp = PFX(pixel_avg_16x8_avx2); + p.pu[LUMA_16x12].pixelavg_pp = PFX(pixel_avg_16x12_avx2); + p.pu[LUMA_16x16].pixelavg_pp = PFX(pixel_avg_16x16_avx2); + p.pu[LUMA_16x32].pixelavg_pp = PFX(pixel_avg_16x32_avx2); + p.pu[LUMA_16x64].pixelavg_pp = PFX(pixel_avg_16x64_avx2); + + p.pu[LUMA_32x64].pixelavg_pp = PFX(pixel_avg_32x64_avx2); + p.pu[LUMA_32x32].pixelavg_pp = PFX(pixel_avg_32x32_avx2); + p.pu[LUMA_32x24].pixelavg_pp = PFX(pixel_avg_32x24_avx2); + p.pu[LUMA_32x16].pixelavg_pp = PFX(pixel_avg_32x16_avx2); + p.pu[LUMA_32x8].pixelavg_pp = PFX(pixel_avg_32x8_avx2); + + p.pu[LUMA_64x64].pixelavg_pp = PFX(pixel_avg_64x64_avx2); + p.pu[LUMA_64x48].pixelavg_pp = PFX(pixel_avg_64x48_avx2); + p.pu[LUMA_64x32].pixelavg_pp = PFX(pixel_avg_64x32_avx2); + p.pu[LUMA_64x16].pixelavg_pp = PFX(pixel_avg_64x16_avx2); + + p.pu[LUMA_16x16].satd = PFX(pixel_satd_16x16_avx2); + p.pu[LUMA_16x8].satd = PFX(pixel_satd_16x8_avx2); + p.pu[LUMA_8x16].satd = PFX(pixel_satd_8x16_avx2); + p.pu[LUMA_8x8].satd = PFX(pixel_satd_8x8_avx2); + + p.pu[LUMA_16x4].satd = PFX(pixel_satd_16x4_avx2); + p.pu[LUMA_16x12].satd = PFX(pixel_satd_16x12_avx2); + p.pu[LUMA_16x32].satd = PFX(pixel_satd_16x32_avx2); + p.pu[LUMA_16x64].satd = PFX(pixel_satd_16x64_avx2); + + p.pu[LUMA_32x8].satd = PFX(pixel_satd_32x8_avx2); + p.pu[LUMA_32x16].satd = PFX(pixel_satd_32x16_avx2); + p.pu[LUMA_32x24].satd = PFX(pixel_satd_32x24_avx2); + p.pu[LUMA_32x32].satd = PFX(pixel_satd_32x32_avx2); + p.pu[LUMA_32x64].satd = PFX(pixel_satd_32x64_avx2); + p.pu[LUMA_48x64].satd = PFX(pixel_satd_48x64_avx2); + p.pu[LUMA_64x16].satd = PFX(pixel_satd_64x16_avx2); + p.pu[LUMA_64x32].satd = PFX(pixel_satd_64x32_avx2); + p.pu[LUMA_64x48].satd = PFX(pixel_satd_64x48_avx2); + p.pu[LUMA_64x64].satd = PFX(pixel_satd_64x64_avx2); + + p.pu[LUMA_32x8].sad = PFX(pixel_sad_32x8_avx2); + p.pu[LUMA_32x16].sad = PFX(pixel_sad_32x16_avx2); + p.pu[LUMA_32x24].sad = PFX(pixel_sad_32x24_avx2); + p.pu[LUMA_32x32].sad = PFX(pixel_sad_32x32_avx2); + p.pu[LUMA_32x64].sad = PFX(pixel_sad_32x64_avx2); + p.pu[LUMA_48x64].sad = PFX(pixel_sad_48x64_avx2); + p.pu[LUMA_64x16].sad = PFX(pixel_sad_64x16_avx2); + p.pu[LUMA_64x32].sad = PFX(pixel_sad_64x32_avx2); + p.pu[LUMA_64x48].sad = PFX(pixel_sad_64x48_avx2); + p.pu[LUMA_64x64].sad = PFX(pixel_sad_64x64_avx2); + + p.pu[LUMA_8x4].sad_x3 = PFX(pixel_sad_x3_8x4_avx2); + p.pu[LUMA_8x8].sad_x3 = PFX(pixel_sad_x3_8x8_avx2); + p.pu[LUMA_8x16].sad_x3 = PFX(pixel_sad_x3_8x16_avx2); + + p.pu[LUMA_8x8].sad_x4 = PFX(pixel_sad_x4_8x8_avx2); + p.pu[LUMA_16x8].sad_x4 = PFX(pixel_sad_x4_16x8_avx2); + p.pu[LUMA_16x12].sad_x4 = PFX(pixel_sad_x4_16x12_avx2); + p.pu[LUMA_16x16].sad_x4 = PFX(pixel_sad_x4_16x16_avx2); + p.pu[LUMA_16x32].sad_x4 = PFX(pixel_sad_x4_16x32_avx2); + p.pu[LUMA_32x32].sad_x4 = PFX(pixel_sad_x4_32x32_avx2); + p.pu[LUMA_32x16].sad_x4 = PFX(pixel_sad_x4_32x16_avx2); + p.pu[LUMA_32x64].sad_x4 = PFX(pixel_sad_x4_32x64_avx2); + p.pu[LUMA_32x24].sad_x4 = PFX(pixel_sad_x4_32x24_avx2); + p.pu[LUMA_32x8].sad_x4 = PFX(pixel_sad_x4_32x8_avx2); + + p.cu[BLOCK_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); + p.cu[BLOCK_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); + p.cu[BLOCK_64x64].sse_pp = PFX(pixel_ssd_64x64_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].sse_pp = PFX(pixel_ssd_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].sse_pp = PFX(pixel_ssd_32x32_avx2); + + p.cu[BLOCK_16x16].ssd_s = PFX(pixel_ssd_s_16_avx2); + p.cu[BLOCK_32x32].ssd_s = PFX(pixel_ssd_s_32_avx2); + + p.cu[BLOCK_8x8].copy_cnt = PFX(copy_cnt_8_avx2); + p.cu[BLOCK_16x16].copy_cnt = PFX(copy_cnt_16_avx2); + p.cu[BLOCK_32x32].copy_cnt = PFX(copy_cnt_32_avx2); + + p.cu[BLOCK_16x16].blockfill_s = PFX(blockfill_s_16x16_avx2); + p.cu[BLOCK_32x32].blockfill_s = PFX(blockfill_s_32x32_avx2); ALL_LUMA_TU_S(cpy1Dto2D_shl, cpy1Dto2D_shl_, avx2); ALL_LUMA_TU_S(cpy1Dto2D_shr, cpy1Dto2D_shr_, avx2); - p.cu[BLOCK_8x8].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_8_avx2; - p.cu[BLOCK_16x16].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_16_avx2; - p.cu[BLOCK_32x32].cpy2Dto1D_shl = x265_cpy2Dto1D_shl_32_avx2; - - p.cu[BLOCK_8x8].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_8_avx2; - p.cu[BLOCK_16x16].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_16_avx2; - p.cu[BLOCK_32x32].cpy2Dto1D_shr = x265_cpy2Dto1D_shr_32_avx2; + p.cu[BLOCK_8x8].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_8_avx2); + p.cu[BLOCK_16x16].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_16_avx2); + p.cu[BLOCK_32x32].cpy2Dto1D_shl = PFX(cpy2Dto1D_shl_32_avx2); + + p.cu[BLOCK_8x8].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_8_avx2); + p.cu[BLOCK_16x16].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_16_avx2); + p.cu[BLOCK_32x32].cpy2Dto1D_shr = PFX(cpy2Dto1D_shr_32_avx2); ALL_LUMA_TU(count_nonzero, count_nonzero, avx2); - p.denoiseDct = x265_denoise_dct_avx2; - p.quant = x265_quant_avx2; - p.nquant = x265_nquant_avx2; - p.dequant_normal = x265_dequant_normal_avx2; - - p.cu[BLOCK_16x16].calcresidual = x265_getResidual16_avx2; - p.cu[BLOCK_32x32].calcresidual = x265_getResidual32_avx2; - - p.scale1D_128to64 = x265_scale1D_128to64_avx2; - p.weight_pp = x265_weight_pp_avx2; - p.weight_sp = x265_weight_sp_avx2; + p.denoiseDct = PFX(denoise_dct_avx2); + p.quant = PFX(quant_avx2); + p.nquant = PFX(nquant_avx2); + p.dequant_normal = PFX(dequant_normal_avx2); + p.dequant_scaling = PFX(dequant_scaling_avx2); + + p.cu[BLOCK_16x16].calcresidual = PFX(getResidual16_avx2); + p.cu[BLOCK_32x32].calcresidual = PFX(getResidual32_avx2); + + p.scale1D_128to64 = PFX(scale1D_128to64_avx2); + p.weight_pp = PFX(weight_pp_avx2); + p.weight_sp = PFX(weight_sp_avx2); // intra_pred functions - p.cu[BLOCK_4x4].intra_pred[3] = x265_intra_pred_ang4_3_avx2; - p.cu[BLOCK_4x4].intra_pred[4] = x265_intra_pred_ang4_4_avx2; - p.cu[BLOCK_4x4].intra_pred[5] = x265_intra_pred_ang4_5_avx2; - p.cu[BLOCK_4x4].intra_pred[6] = x265_intra_pred_ang4_6_avx2; - p.cu[BLOCK_4x4].intra_pred[7] = x265_intra_pred_ang4_7_avx2; - p.cu[BLOCK_4x4].intra_pred[8] = x265_intra_pred_ang4_8_avx2; - p.cu[BLOCK_4x4].intra_pred[9] = x265_intra_pred_ang4_9_avx2; - p.cu[BLOCK_4x4].intra_pred[11] = x265_intra_pred_ang4_11_avx2; - p.cu[BLOCK_4x4].intra_pred[12] = x265_intra_pred_ang4_12_avx2; - p.cu[BLOCK_4x4].intra_pred[13] = x265_intra_pred_ang4_13_avx2; - p.cu[BLOCK_4x4].intra_pred[14] = x265_intra_pred_ang4_14_avx2; - p.cu[BLOCK_4x4].intra_pred[15] = x265_intra_pred_ang4_15_avx2; - p.cu[BLOCK_4x4].intra_pred[16] = x265_intra_pred_ang4_16_avx2; - p.cu[BLOCK_4x4].intra_pred[17] = x265_intra_pred_ang4_17_avx2; - p.cu[BLOCK_4x4].intra_pred[19] = x265_intra_pred_ang4_19_avx2; - p.cu[BLOCK_4x4].intra_pred[20] = x265_intra_pred_ang4_20_avx2; - p.cu[BLOCK_4x4].intra_pred[21] = x265_intra_pred_ang4_21_avx2; - p.cu[BLOCK_4x4].intra_pred[22] = x265_intra_pred_ang4_22_avx2; - p.cu[BLOCK_4x4].intra_pred[23] = x265_intra_pred_ang4_23_avx2; - p.cu[BLOCK_4x4].intra_pred[24] = x265_intra_pred_ang4_24_avx2; - p.cu[BLOCK_4x4].intra_pred[25] = x265_intra_pred_ang4_25_avx2; - p.cu[BLOCK_4x4].intra_pred[27] = x265_intra_pred_ang4_27_avx2; - p.cu[BLOCK_4x4].intra_pred[28] = x265_intra_pred_ang4_28_avx2; - p.cu[BLOCK_4x4].intra_pred[29] = x265_intra_pred_ang4_29_avx2; - p.cu[BLOCK_4x4].intra_pred[30] = x265_intra_pred_ang4_30_avx2; - p.cu[BLOCK_4x4].intra_pred[31] = x265_intra_pred_ang4_31_avx2; - p.cu[BLOCK_4x4].intra_pred[32] = x265_intra_pred_ang4_32_avx2; - p.cu[BLOCK_4x4].intra_pred[33] = x265_intra_pred_ang4_33_avx2; - p.cu[BLOCK_8x8].intra_pred[3] = x265_intra_pred_ang8_3_avx2; - p.cu[BLOCK_8x8].intra_pred[33] = x265_intra_pred_ang8_33_avx2; - p.cu[BLOCK_8x8].intra_pred[4] = x265_intra_pred_ang8_4_avx2; - p.cu[BLOCK_8x8].intra_pred[32] = x265_intra_pred_ang8_32_avx2; - p.cu[BLOCK_8x8].intra_pred[5] = x265_intra_pred_ang8_5_avx2; - p.cu[BLOCK_8x8].intra_pred[31] = x265_intra_pred_ang8_31_avx2; - p.cu[BLOCK_8x8].intra_pred[30] = x265_intra_pred_ang8_30_avx2; - p.cu[BLOCK_8x8].intra_pred[6] = x265_intra_pred_ang8_6_avx2; - p.cu[BLOCK_8x8].intra_pred[7] = x265_intra_pred_ang8_7_avx2; - p.cu[BLOCK_8x8].intra_pred[29] = x265_intra_pred_ang8_29_avx2; - p.cu[BLOCK_8x8].intra_pred[8] = x265_intra_pred_ang8_8_avx2; - p.cu[BLOCK_8x8].intra_pred[28] = x265_intra_pred_ang8_28_avx2; - p.cu[BLOCK_8x8].intra_pred[9] = x265_intra_pred_ang8_9_avx2; - p.cu[BLOCK_8x8].intra_pred[27] = x265_intra_pred_ang8_27_avx2; - p.cu[BLOCK_8x8].intra_pred[25] = x265_intra_pred_ang8_25_avx2; - p.cu[BLOCK_8x8].intra_pred[12] = x265_intra_pred_ang8_12_avx2; - p.cu[BLOCK_8x8].intra_pred[24] = x265_intra_pred_ang8_24_avx2; - p.cu[BLOCK_8x8].intra_pred[11] = x265_intra_pred_ang8_11_avx2; - p.cu[BLOCK_8x8].intra_pred[13] = x265_intra_pred_ang8_13_avx2; - p.cu[BLOCK_8x8].intra_pred[20] = x265_intra_pred_ang8_20_avx2; - p.cu[BLOCK_8x8].intra_pred[21] = x265_intra_pred_ang8_21_avx2; - p.cu[BLOCK_8x8].intra_pred[22] = x265_intra_pred_ang8_22_avx2; - p.cu[BLOCK_8x8].intra_pred[23] = x265_intra_pred_ang8_23_avx2; - p.cu[BLOCK_8x8].intra_pred[14] = x265_intra_pred_ang8_14_avx2; - p.cu[BLOCK_8x8].intra_pred[15] = x265_intra_pred_ang8_15_avx2; - p.cu[BLOCK_8x8].intra_pred[16] = x265_intra_pred_ang8_16_avx2; - p.cu[BLOCK_16x16].intra_pred[3] = x265_intra_pred_ang16_3_avx2; - p.cu[BLOCK_16x16].intra_pred[4] = x265_intra_pred_ang16_4_avx2; - p.cu[BLOCK_16x16].intra_pred[5] = x265_intra_pred_ang16_5_avx2; - p.cu[BLOCK_16x16].intra_pred[6] = x265_intra_pred_ang16_6_avx2; - p.cu[BLOCK_16x16].intra_pred[7] = x265_intra_pred_ang16_7_avx2; - p.cu[BLOCK_16x16].intra_pred[8] = x265_intra_pred_ang16_8_avx2; - p.cu[BLOCK_16x16].intra_pred[9] = x265_intra_pred_ang16_9_avx2; - p.cu[BLOCK_16x16].intra_pred[12] = x265_intra_pred_ang16_12_avx2; - p.cu[BLOCK_16x16].intra_pred[11] = x265_intra_pred_ang16_11_avx2; - p.cu[BLOCK_16x16].intra_pred[13] = x265_intra_pred_ang16_13_avx2; - p.cu[BLOCK_16x16].intra_pred[25] = x265_intra_pred_ang16_25_avx2; - p.cu[BLOCK_16x16].intra_pred[28] = x265_intra_pred_ang16_28_avx2; - p.cu[BLOCK_16x16].intra_pred[27] = x265_intra_pred_ang16_27_avx2; - p.cu[BLOCK_16x16].intra_pred[29] = x265_intra_pred_ang16_29_avx2; - p.cu[BLOCK_16x16].intra_pred[30] = x265_intra_pred_ang16_30_avx2; - p.cu[BLOCK_16x16].intra_pred[31] = x265_intra_pred_ang16_31_avx2; - p.cu[BLOCK_16x16].intra_pred[32] = x265_intra_pred_ang16_32_avx2; - p.cu[BLOCK_16x16].intra_pred[33] = x265_intra_pred_ang16_33_avx2; - p.cu[BLOCK_16x16].intra_pred[24] = x265_intra_pred_ang16_24_avx2; - p.cu[BLOCK_16x16].intra_pred[23] = x265_intra_pred_ang16_23_avx2; - p.cu[BLOCK_16x16].intra_pred[22] = x265_intra_pred_ang16_22_avx2; - p.cu[BLOCK_32x32].intra_pred[34] = x265_intra_pred_ang32_34_avx2; - p.cu[BLOCK_32x32].intra_pred[2] = x265_intra_pred_ang32_2_avx2; - p.cu[BLOCK_32x32].intra_pred[26] = x265_intra_pred_ang32_26_avx2; - p.cu[BLOCK_32x32].intra_pred[27] = x265_intra_pred_ang32_27_avx2; - p.cu[BLOCK_32x32].intra_pred[28] = x265_intra_pred_ang32_28_avx2; - p.cu[BLOCK_32x32].intra_pred[29] = x265_intra_pred_ang32_29_avx2; - p.cu[BLOCK_32x32].intra_pred[30] = x265_intra_pred_ang32_30_avx2; - p.cu[BLOCK_32x32].intra_pred[31] = x265_intra_pred_ang32_31_avx2; - p.cu[BLOCK_32x32].intra_pred[32] = x265_intra_pred_ang32_32_avx2; - p.cu[BLOCK_32x32].intra_pred[33] = x265_intra_pred_ang32_33_avx2; - p.cu[BLOCK_32x32].intra_pred[25] = x265_intra_pred_ang32_25_avx2; - p.cu[BLOCK_32x32].intra_pred[24] = x265_intra_pred_ang32_24_avx2; - p.cu[BLOCK_32x32].intra_pred[23] = x265_intra_pred_ang32_23_avx2; - p.cu[BLOCK_32x32].intra_pred[22] = x265_intra_pred_ang32_22_avx2; - p.cu[BLOCK_32x32].intra_pred[21] = x265_intra_pred_ang32_21_avx2; - p.cu[BLOCK_32x32].intra_pred[18] = x265_intra_pred_ang32_18_avx2; + p.cu[BLOCK_4x4].intra_pred[3] = PFX(intra_pred_ang4_3_avx2); + p.cu[BLOCK_4x4].intra_pred[4] = PFX(intra_pred_ang4_4_avx2); + p.cu[BLOCK_4x4].intra_pred[5] = PFX(intra_pred_ang4_5_avx2); + p.cu[BLOCK_4x4].intra_pred[6] = PFX(intra_pred_ang4_6_avx2); + p.cu[BLOCK_4x4].intra_pred[7] = PFX(intra_pred_ang4_7_avx2); + p.cu[BLOCK_4x4].intra_pred[8] = PFX(intra_pred_ang4_8_avx2); + p.cu[BLOCK_4x4].intra_pred[9] = PFX(intra_pred_ang4_9_avx2); + p.cu[BLOCK_4x4].intra_pred[11] = PFX(intra_pred_ang4_11_avx2); + p.cu[BLOCK_4x4].intra_pred[12] = PFX(intra_pred_ang4_12_avx2); + p.cu[BLOCK_4x4].intra_pred[13] = PFX(intra_pred_ang4_13_avx2); + p.cu[BLOCK_4x4].intra_pred[14] = PFX(intra_pred_ang4_14_avx2); + p.cu[BLOCK_4x4].intra_pred[15] = PFX(intra_pred_ang4_15_avx2); + p.cu[BLOCK_4x4].intra_pred[16] = PFX(intra_pred_ang4_16_avx2); + p.cu[BLOCK_4x4].intra_pred[17] = PFX(intra_pred_ang4_17_avx2); + p.cu[BLOCK_4x4].intra_pred[19] = PFX(intra_pred_ang4_19_avx2); + p.cu[BLOCK_4x4].intra_pred[20] = PFX(intra_pred_ang4_20_avx2); + p.cu[BLOCK_4x4].intra_pred[21] = PFX(intra_pred_ang4_21_avx2); + p.cu[BLOCK_4x4].intra_pred[22] = PFX(intra_pred_ang4_22_avx2); + p.cu[BLOCK_4x4].intra_pred[23] = PFX(intra_pred_ang4_23_avx2); + p.cu[BLOCK_4x4].intra_pred[24] = PFX(intra_pred_ang4_24_avx2); + p.cu[BLOCK_4x4].intra_pred[25] = PFX(intra_pred_ang4_25_avx2); + p.cu[BLOCK_4x4].intra_pred[27] = PFX(intra_pred_ang4_27_avx2); + p.cu[BLOCK_4x4].intra_pred[28] = PFX(intra_pred_ang4_28_avx2); + p.cu[BLOCK_4x4].intra_pred[29] = PFX(intra_pred_ang4_29_avx2); + p.cu[BLOCK_4x4].intra_pred[30] = PFX(intra_pred_ang4_30_avx2); + p.cu[BLOCK_4x4].intra_pred[31] = PFX(intra_pred_ang4_31_avx2); + p.cu[BLOCK_4x4].intra_pred[32] = PFX(intra_pred_ang4_32_avx2); + p.cu[BLOCK_4x4].intra_pred[33] = PFX(intra_pred_ang4_33_avx2); + p.cu[BLOCK_8x8].intra_pred[3] = PFX(intra_pred_ang8_3_avx2); + p.cu[BLOCK_8x8].intra_pred[33] = PFX(intra_pred_ang8_33_avx2); + p.cu[BLOCK_8x8].intra_pred[4] = PFX(intra_pred_ang8_4_avx2); + p.cu[BLOCK_8x8].intra_pred[32] = PFX(intra_pred_ang8_32_avx2); + p.cu[BLOCK_8x8].intra_pred[5] = PFX(intra_pred_ang8_5_avx2); + p.cu[BLOCK_8x8].intra_pred[31] = PFX(intra_pred_ang8_31_avx2); + p.cu[BLOCK_8x8].intra_pred[30] = PFX(intra_pred_ang8_30_avx2); + p.cu[BLOCK_8x8].intra_pred[6] = PFX(intra_pred_ang8_6_avx2); + p.cu[BLOCK_8x8].intra_pred[7] = PFX(intra_pred_ang8_7_avx2); + p.cu[BLOCK_8x8].intra_pred[29] = PFX(intra_pred_ang8_29_avx2); + p.cu[BLOCK_8x8].intra_pred[8] = PFX(intra_pred_ang8_8_avx2); + p.cu[BLOCK_8x8].intra_pred[28] = PFX(intra_pred_ang8_28_avx2); + p.cu[BLOCK_8x8].intra_pred[9] = PFX(intra_pred_ang8_9_avx2); + p.cu[BLOCK_8x8].intra_pred[27] = PFX(intra_pred_ang8_27_avx2); + p.cu[BLOCK_8x8].intra_pred[25] = PFX(intra_pred_ang8_25_avx2); + p.cu[BLOCK_8x8].intra_pred[12] = PFX(intra_pred_ang8_12_avx2); + p.cu[BLOCK_8x8].intra_pred[24] = PFX(intra_pred_ang8_24_avx2); + p.cu[BLOCK_8x8].intra_pred[11] = PFX(intra_pred_ang8_11_avx2); + p.cu[BLOCK_8x8].intra_pred[13] = PFX(intra_pred_ang8_13_avx2); + p.cu[BLOCK_8x8].intra_pred[20] = PFX(intra_pred_ang8_20_avx2); + p.cu[BLOCK_8x8].intra_pred[21] = PFX(intra_pred_ang8_21_avx2); + p.cu[BLOCK_8x8].intra_pred[22] = PFX(intra_pred_ang8_22_avx2); + p.cu[BLOCK_8x8].intra_pred[23] = PFX(intra_pred_ang8_23_avx2); + p.cu[BLOCK_8x8].intra_pred[14] = PFX(intra_pred_ang8_14_avx2); + p.cu[BLOCK_8x8].intra_pred[15] = PFX(intra_pred_ang8_15_avx2); + p.cu[BLOCK_8x8].intra_pred[16] = PFX(intra_pred_ang8_16_avx2); + p.cu[BLOCK_16x16].intra_pred[3] = PFX(intra_pred_ang16_3_avx2); + p.cu[BLOCK_16x16].intra_pred[4] = PFX(intra_pred_ang16_4_avx2); + p.cu[BLOCK_16x16].intra_pred[5] = PFX(intra_pred_ang16_5_avx2); + p.cu[BLOCK_16x16].intra_pred[6] = PFX(intra_pred_ang16_6_avx2); + p.cu[BLOCK_16x16].intra_pred[7] = PFX(intra_pred_ang16_7_avx2); + p.cu[BLOCK_16x16].intra_pred[8] = PFX(intra_pred_ang16_8_avx2); + p.cu[BLOCK_16x16].intra_pred[9] = PFX(intra_pred_ang16_9_avx2); + p.cu[BLOCK_16x16].intra_pred[12] = PFX(intra_pred_ang16_12_avx2); + p.cu[BLOCK_16x16].intra_pred[11] = PFX(intra_pred_ang16_11_avx2); + p.cu[BLOCK_16x16].intra_pred[13] = PFX(intra_pred_ang16_13_avx2); + p.cu[BLOCK_16x16].intra_pred[25] = PFX(intra_pred_ang16_25_avx2); + p.cu[BLOCK_16x16].intra_pred[28] = PFX(intra_pred_ang16_28_avx2); + p.cu[BLOCK_16x16].intra_pred[27] = PFX(intra_pred_ang16_27_avx2); + p.cu[BLOCK_16x16].intra_pred[29] = PFX(intra_pred_ang16_29_avx2); + p.cu[BLOCK_16x16].intra_pred[30] = PFX(intra_pred_ang16_30_avx2); + p.cu[BLOCK_16x16].intra_pred[31] = PFX(intra_pred_ang16_31_avx2); + p.cu[BLOCK_16x16].intra_pred[32] = PFX(intra_pred_ang16_32_avx2); + p.cu[BLOCK_16x16].intra_pred[33] = PFX(intra_pred_ang16_33_avx2); + p.cu[BLOCK_16x16].intra_pred[24] = PFX(intra_pred_ang16_24_avx2); + p.cu[BLOCK_16x16].intra_pred[23] = PFX(intra_pred_ang16_23_avx2); + p.cu[BLOCK_16x16].intra_pred[22] = PFX(intra_pred_ang16_22_avx2); + p.cu[BLOCK_32x32].intra_pred[34] = PFX(intra_pred_ang32_34_avx2); + p.cu[BLOCK_32x32].intra_pred[2] = PFX(intra_pred_ang32_2_avx2); + p.cu[BLOCK_32x32].intra_pred[26] = PFX(intra_pred_ang32_26_avx2); + p.cu[BLOCK_32x32].intra_pred[27] = PFX(intra_pred_ang32_27_avx2); + p.cu[BLOCK_32x32].intra_pred[28] = PFX(intra_pred_ang32_28_avx2); + p.cu[BLOCK_32x32].intra_pred[29] = PFX(intra_pred_ang32_29_avx2); + p.cu[BLOCK_32x32].intra_pred[30] = PFX(intra_pred_ang32_30_avx2); + p.cu[BLOCK_32x32].intra_pred[31] = PFX(intra_pred_ang32_31_avx2); + p.cu[BLOCK_32x32].intra_pred[32] = PFX(intra_pred_ang32_32_avx2); + p.cu[BLOCK_32x32].intra_pred[33] = PFX(intra_pred_ang32_33_avx2); + p.cu[BLOCK_32x32].intra_pred[25] = PFX(intra_pred_ang32_25_avx2); + p.cu[BLOCK_32x32].intra_pred[24] = PFX(intra_pred_ang32_24_avx2); + p.cu[BLOCK_32x32].intra_pred[23] = PFX(intra_pred_ang32_23_avx2); + p.cu[BLOCK_32x32].intra_pred[22] = PFX(intra_pred_ang32_22_avx2); + p.cu[BLOCK_32x32].intra_pred[21] = PFX(intra_pred_ang32_21_avx2); + p.cu[BLOCK_32x32].intra_pred[18] = PFX(intra_pred_ang32_18_avx2); + p.cu[BLOCK_32x32].intra_pred[3] = PFX(intra_pred_ang32_3_avx2); // all_angs primitives - p.cu[BLOCK_4x4].intra_pred_allangs = x265_all_angs_pred_4x4_avx2; + p.cu[BLOCK_4x4].intra_pred_allangs = PFX(all_angs_pred_4x4_avx2); // copy_sp primitives - p.cu[BLOCK_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = x265_blockcopy_sp_16x16_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = x265_blockcopy_sp_16x32_avx2; - - p.cu[BLOCK_32x32].copy_sp = x265_blockcopy_sp_32x32_avx2; - p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = x265_blockcopy_sp_32x32_avx2; - p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = x265_blockcopy_sp_32x64_avx2; + p.cu[BLOCK_16x16].copy_sp = PFX(blockcopy_sp_16x16_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_16x16].copy_sp = PFX(blockcopy_sp_16x16_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_16x32].copy_sp = PFX(blockcopy_sp_16x32_avx2); + + p.cu[BLOCK_32x32].copy_sp = PFX(blockcopy_sp_32x32_avx2); + p.chroma[X265_CSP_I420].cu[BLOCK_420_32x32].copy_sp = PFX(blockcopy_sp_32x32_avx2); + p.chroma[X265_CSP_I422].cu[BLOCK_422_32x64].copy_sp = PFX(blockcopy_sp_32x64_avx2); - p.cu[BLOCK_64x64].copy_sp = x265_blockcopy_sp_64x64_avx2; + p.cu[BLOCK_64x64].copy_sp = PFX(blockcopy_sp_64x64_avx2); // copy_ps primitives - p.cu[BLOCK_16x16].copy_ps = x265_blockcopy_ps_16x16_avx2; - p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ps = x265_blockcopy_ps_16x16_avx2; - p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ps = x265_blockcopy_ps_16x32_avx2; - - p.cu[BLOCK_32x32].copy_ps = x265_blockcopy_ps_32x32_avx2; - p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = x265_blockcopy_ps_32x32_avx2; - p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = x265_blockcopy_ps_32x64_avx2; + p.cu[BLOCK_16x16].copy_ps = PFX(blockcopy_ps_16x16_avx2); + p.chroma[X265_CSP_I420].cu[CHROMA_420_16x16].copy_ps = PFX(blockcopy_ps_16x16_avx2); + p.chroma[X265_CSP_I422].cu[CHROMA_422_16x32].copy_ps = PFX(blockcopy_ps_16x32_avx2); + + p.cu[BLOCK_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); + p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_avx2); - p.cu[BLOCK_64x64].copy_ps = x265_blockcopy_ps_64x64_avx2; + p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_avx2); ALL_LUMA_TU_S(dct, dct, avx2); ALL_LUMA_TU_S(idct, idct, avx2); @@ -2122,577 +3032,604 @@ ALL_LUMA_PU(luma_vps, interp_8tap_vert_ps, avx2); ALL_LUMA_PU(luma_vsp, interp_8tap_vert_sp, avx2); ALL_LUMA_PU(luma_vss, interp_8tap_vert_ss, avx2); + p.pu[LUMA_4x4].luma_vsp = PFX(interp_8tap_vert_sp_4x4_avx2); // missing 4x8, 4x16, 24x32, 12x16 for the fill set of luma PU - p.pu[LUMA_4x4].luma_hpp = x265_interp_8tap_horiz_pp_4x4_avx2; - p.pu[LUMA_4x8].luma_hpp = x265_interp_8tap_horiz_pp_4x8_avx2; - p.pu[LUMA_4x16].luma_hpp = x265_interp_8tap_horiz_pp_4x16_avx2; - p.pu[LUMA_8x4].luma_hpp = x265_interp_8tap_horiz_pp_8x4_avx2; - p.pu[LUMA_8x8].luma_hpp = x265_interp_8tap_horiz_pp_8x8_avx2; - p.pu[LUMA_8x16].luma_hpp = x265_interp_8tap_horiz_pp_8x16_avx2; - p.pu[LUMA_8x32].luma_hpp = x265_interp_8tap_horiz_pp_8x32_avx2; - p.pu[LUMA_16x4].luma_hpp = x265_interp_8tap_horiz_pp_16x4_avx2; - p.pu[LUMA_16x8].luma_hpp = x265_interp_8tap_horiz_pp_16x8_avx2; - p.pu[LUMA_16x12].luma_hpp = x265_interp_8tap_horiz_pp_16x12_avx2; - p.pu[LUMA_16x16].luma_hpp = x265_interp_8tap_horiz_pp_16x16_avx2; - p.pu[LUMA_16x32].luma_hpp = x265_interp_8tap_horiz_pp_16x32_avx2; - p.pu[LUMA_16x64].luma_hpp = x265_interp_8tap_horiz_pp_16x64_avx2; - p.pu[LUMA_32x8].luma_hpp = x265_interp_8tap_horiz_pp_32x8_avx2; - p.pu[LUMA_32x16].luma_hpp = x265_interp_8tap_horiz_pp_32x16_avx2; - p.pu[LUMA_32x24].luma_hpp = x265_interp_8tap_horiz_pp_32x24_avx2; - p.pu[LUMA_32x32].luma_hpp = x265_interp_8tap_horiz_pp_32x32_avx2; - p.pu[LUMA_32x64].luma_hpp = x265_interp_8tap_horiz_pp_32x64_avx2; - p.pu[LUMA_64x64].luma_hpp = x265_interp_8tap_horiz_pp_64x64_avx2; - p.pu[LUMA_64x48].luma_hpp = x265_interp_8tap_horiz_pp_64x48_avx2; - p.pu[LUMA_64x32].luma_hpp = x265_interp_8tap_horiz_pp_64x32_avx2; - p.pu[LUMA_64x16].luma_hpp = x265_interp_8tap_horiz_pp_64x16_avx2; - p.pu[LUMA_48x64].luma_hpp = x265_interp_8tap_horiz_pp_48x64_avx2; - p.pu[LUMA_24x32].luma_hpp = x265_interp_8tap_horiz_pp_24x32_avx2; - p.pu[LUMA_12x16].luma_hpp = x265_interp_8tap_horiz_pp_12x16_avx2; - - p.pu[LUMA_4x4].luma_hps = x265_interp_8tap_horiz_ps_4x4_avx2; - p.pu[LUMA_4x8].luma_hps = x265_interp_8tap_horiz_ps_4x8_avx2; - p.pu[LUMA_4x16].luma_hps = x265_interp_8tap_horiz_ps_4x16_avx2; - p.pu[LUMA_8x4].luma_hps = x265_interp_8tap_horiz_ps_8x4_avx2; - p.pu[LUMA_8x8].luma_hps = x265_interp_8tap_horiz_ps_8x8_avx2; - p.pu[LUMA_8x16].luma_hps = x265_interp_8tap_horiz_ps_8x16_avx2; - p.pu[LUMA_8x32].luma_hps = x265_interp_8tap_horiz_ps_8x32_avx2; - p.pu[LUMA_16x8].luma_hps = x265_interp_8tap_horiz_ps_16x8_avx2; - p.pu[LUMA_16x16].luma_hps = x265_interp_8tap_horiz_ps_16x16_avx2; - p.pu[LUMA_16x12].luma_hps = x265_interp_8tap_horiz_ps_16x12_avx2; - p.pu[LUMA_16x4].luma_hps = x265_interp_8tap_horiz_ps_16x4_avx2; - p.pu[LUMA_16x32].luma_hps = x265_interp_8tap_horiz_ps_16x32_avx2; - p.pu[LUMA_16x64].luma_hps = x265_interp_8tap_horiz_ps_16x64_avx2; - - p.pu[LUMA_32x32].luma_hps = x265_interp_8tap_horiz_ps_32x32_avx2; - p.pu[LUMA_32x16].luma_hps = x265_interp_8tap_horiz_ps_32x16_avx2; - p.pu[LUMA_32x24].luma_hps = x265_interp_8tap_horiz_ps_32x24_avx2; - p.pu[LUMA_32x8].luma_hps = x265_interp_8tap_horiz_ps_32x8_avx2; - p.pu[LUMA_32x64].luma_hps = x265_interp_8tap_horiz_ps_32x64_avx2; - p.pu[LUMA_48x64].luma_hps = x265_interp_8tap_horiz_ps_48x64_avx2; - p.pu[LUMA_64x64].luma_hps = x265_interp_8tap_horiz_ps_64x64_avx2; - p.pu[LUMA_64x48].luma_hps = x265_interp_8tap_horiz_ps_64x48_avx2; - p.pu[LUMA_64x32].luma_hps = x265_interp_8tap_horiz_ps_64x32_avx2; - p.pu[LUMA_64x16].luma_hps = x265_interp_8tap_horiz_ps_64x16_avx2; - p.pu[LUMA_12x16].luma_hps = x265_interp_8tap_horiz_ps_12x16_avx2; - p.pu[LUMA_24x32].luma_hps = x265_interp_8tap_horiz_ps_24x32_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_hpp = x265_interp_4tap_horiz_pp_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_hpp = x265_interp_4tap_horiz_pp_2x8_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hpp = x265_interp_4tap_horiz_pp_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = x265_interp_4tap_horiz_pp_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = x265_interp_4tap_horiz_pp_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = x265_interp_4tap_horiz_pp_6x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = x265_interp_4tap_horiz_pp_6x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hpp = x265_interp_4tap_horiz_pp_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hpp = x265_interp_4tap_horiz_pp_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hpp = x265_interp_4tap_horiz_pp_12x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hps = x265_interp_4tap_horiz_ps_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hps = x265_interp_4tap_horiz_ps_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hps = x265_interp_4tap_horiz_ps_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = x265_interp_4tap_horiz_ps_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_hps = x265_interp_4tap_horiz_ps_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_hps = x265_interp_4tap_horiz_ps_2x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hps = x265_interp_4tap_horiz_ps_6x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hpp = x265_interp_4tap_horiz_pp_24x32_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = x265_interp_4tap_vert_pp_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = x265_interp_4tap_vert_pp_6x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = x265_interp_4tap_vert_pp_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = x265_interp_4tap_vert_pp_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vpp = x265_interp_4tap_vert_pp_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = x265_interp_4tap_vert_pp_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = x265_interp_4tap_vert_pp_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vpp = x265_interp_4tap_vert_pp_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = x265_interp_4tap_vert_pp_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = x265_interp_4tap_vert_pp_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vps = x265_interp_4tap_vert_ps_2x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = x265_interp_4tap_vert_ps_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vps = x265_interp_4tap_vert_ps_6x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vps = x265_interp_4tap_vert_ps_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = x265_interp_4tap_vert_ps_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vps = x265_interp_4tap_vert_ps_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = x265_interp_4tap_vert_ps_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = x265_interp_4tap_vert_ps_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vps = x265_interp_4tap_vert_ps_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = x265_interp_4tap_vert_ps_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = x265_interp_4tap_vert_ps_32x8_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vsp = x265_interp_4tap_vert_sp_2x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vsp = x265_interp_4tap_vert_sp_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vsp = x265_interp_4tap_vert_sp_6x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vsp = x265_interp_4tap_vert_sp_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vsp = x265_interp_4tap_vert_sp_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vsp = x265_interp_4tap_vert_sp_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = x265_interp_4tap_vert_sp_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = x265_interp_4tap_vert_sp_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = x265_interp_4tap_vert_sp_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = x265_interp_4tap_vert_sp_32x24_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vss = x265_interp_4tap_vert_ss_2x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vss = x265_interp_4tap_vert_ss_4x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vss = x265_interp_4tap_vert_ss_6x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vss = x265_interp_4tap_vert_ss_8x2_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vss = x265_interp_4tap_vert_ss_8x6_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vss = x265_interp_4tap_vert_ss_12x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vss = x265_interp_4tap_vert_ss_16x4_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vss = x265_interp_4tap_vert_ss_16x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vss = x265_interp_4tap_vert_ss_16x12_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vss = x265_interp_4tap_vert_ss_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2; + p.pu[LUMA_4x4].luma_hpp = PFX(interp_8tap_horiz_pp_4x4_avx2); + p.pu[LUMA_4x8].luma_hpp = PFX(interp_8tap_horiz_pp_4x8_avx2); + p.pu[LUMA_4x16].luma_hpp = PFX(interp_8tap_horiz_pp_4x16_avx2); + p.pu[LUMA_8x4].luma_hpp = PFX(interp_8tap_horiz_pp_8x4_avx2); + p.pu[LUMA_8x8].luma_hpp = PFX(interp_8tap_horiz_pp_8x8_avx2); + p.pu[LUMA_8x16].luma_hpp = PFX(interp_8tap_horiz_pp_8x16_avx2); + p.pu[LUMA_8x32].luma_hpp = PFX(interp_8tap_horiz_pp_8x32_avx2); + p.pu[LUMA_16x4].luma_hpp = PFX(interp_8tap_horiz_pp_16x4_avx2); + p.pu[LUMA_16x8].luma_hpp = PFX(interp_8tap_horiz_pp_16x8_avx2); + p.pu[LUMA_16x12].luma_hpp = PFX(interp_8tap_horiz_pp_16x12_avx2); + p.pu[LUMA_16x16].luma_hpp = PFX(interp_8tap_horiz_pp_16x16_avx2); + p.pu[LUMA_16x32].luma_hpp = PFX(interp_8tap_horiz_pp_16x32_avx2); + p.pu[LUMA_16x64].luma_hpp = PFX(interp_8tap_horiz_pp_16x64_avx2); + p.pu[LUMA_32x8].luma_hpp = PFX(interp_8tap_horiz_pp_32x8_avx2); + p.pu[LUMA_32x16].luma_hpp = PFX(interp_8tap_horiz_pp_32x16_avx2); + p.pu[LUMA_32x24].luma_hpp = PFX(interp_8tap_horiz_pp_32x24_avx2); + p.pu[LUMA_32x32].luma_hpp = PFX(interp_8tap_horiz_pp_32x32_avx2); + p.pu[LUMA_32x64].luma_hpp = PFX(interp_8tap_horiz_pp_32x64_avx2); + p.pu[LUMA_64x64].luma_hpp = PFX(interp_8tap_horiz_pp_64x64_avx2); + p.pu[LUMA_64x48].luma_hpp = PFX(interp_8tap_horiz_pp_64x48_avx2); + p.pu[LUMA_64x32].luma_hpp = PFX(interp_8tap_horiz_pp_64x32_avx2); + p.pu[LUMA_64x16].luma_hpp = PFX(interp_8tap_horiz_pp_64x16_avx2); + p.pu[LUMA_48x64].luma_hpp = PFX(interp_8tap_horiz_pp_48x64_avx2); + p.pu[LUMA_24x32].luma_hpp = PFX(interp_8tap_horiz_pp_24x32_avx2); + p.pu[LUMA_12x16].luma_hpp = PFX(interp_8tap_horiz_pp_12x16_avx2); + + p.pu[LUMA_4x4].luma_hps = PFX(interp_8tap_horiz_ps_4x4_avx2); + p.pu[LUMA_4x8].luma_hps = PFX(interp_8tap_horiz_ps_4x8_avx2); + p.pu[LUMA_4x16].luma_hps = PFX(interp_8tap_horiz_ps_4x16_avx2); + p.pu[LUMA_8x4].luma_hps = PFX(interp_8tap_horiz_ps_8x4_avx2); + p.pu[LUMA_8x8].luma_hps = PFX(interp_8tap_horiz_ps_8x8_avx2); + p.pu[LUMA_8x16].luma_hps = PFX(interp_8tap_horiz_ps_8x16_avx2); + p.pu[LUMA_8x32].luma_hps = PFX(interp_8tap_horiz_ps_8x32_avx2); + p.pu[LUMA_16x8].luma_hps = PFX(interp_8tap_horiz_ps_16x8_avx2); + p.pu[LUMA_16x16].luma_hps = PFX(interp_8tap_horiz_ps_16x16_avx2); + p.pu[LUMA_16x12].luma_hps = PFX(interp_8tap_horiz_ps_16x12_avx2); + p.pu[LUMA_16x4].luma_hps = PFX(interp_8tap_horiz_ps_16x4_avx2); + p.pu[LUMA_16x32].luma_hps = PFX(interp_8tap_horiz_ps_16x32_avx2); + p.pu[LUMA_16x64].luma_hps = PFX(interp_8tap_horiz_ps_16x64_avx2); + + p.pu[LUMA_32x32].luma_hps = PFX(interp_8tap_horiz_ps_32x32_avx2); + p.pu[LUMA_32x16].luma_hps = PFX(interp_8tap_horiz_ps_32x16_avx2); + p.pu[LUMA_32x24].luma_hps = PFX(interp_8tap_horiz_ps_32x24_avx2); + p.pu[LUMA_32x8].luma_hps = PFX(interp_8tap_horiz_ps_32x8_avx2); + p.pu[LUMA_32x64].luma_hps = PFX(interp_8tap_horiz_ps_32x64_avx2); + p.pu[LUMA_48x64].luma_hps = PFX(interp_8tap_horiz_ps_48x64_avx2); + p.pu[LUMA_64x64].luma_hps = PFX(interp_8tap_horiz_ps_64x64_avx2); + p.pu[LUMA_64x48].luma_hps = PFX(interp_8tap_horiz_ps_64x48_avx2); + p.pu[LUMA_64x32].luma_hps = PFX(interp_8tap_horiz_ps_64x32_avx2); + p.pu[LUMA_64x16].luma_hps = PFX(interp_8tap_horiz_ps_64x16_avx2); + p.pu[LUMA_12x16].luma_hps = PFX(interp_8tap_horiz_ps_12x16_avx2); + p.pu[LUMA_24x32].luma_hps = PFX(interp_8tap_horiz_ps_24x32_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_hpp = PFX(interp_4tap_horiz_pp_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_hpp = PFX(interp_4tap_horiz_pp_2x8_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hpp = PFX(interp_4tap_horiz_pp_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hpp = PFX(interp_4tap_horiz_pp_6x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_hpp = PFX(interp_4tap_horiz_pp_6x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hpp = PFX(interp_4tap_horiz_pp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hpp = PFX(interp_4tap_horiz_pp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_hps = PFX(interp_4tap_horiz_ps_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_hps = PFX(interp_4tap_horiz_ps_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_hps = PFX(interp_4tap_horiz_ps_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_hps = PFX(interp_4tap_horiz_ps_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_hps = PFX(interp_4tap_horiz_ps_2x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_hps = PFX(interp_4tap_horiz_ps_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vpp = PFX(interp_4tap_vert_pp_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vpp = PFX(interp_4tap_vert_pp_2x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vpp = PFX(interp_4tap_vert_pp_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vpp = PFX(interp_4tap_vert_pp_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vpp = PFX(interp_4tap_vert_pp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vpp = PFX(interp_4tap_vert_pp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vpp = PFX(interp_4tap_vert_pp_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vps = PFX(interp_4tap_vert_ps_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vps = PFX(interp_4tap_vert_ps_2x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vps = PFX(interp_4tap_vert_ps_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vps = PFX(interp_4tap_vert_ps_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vps = PFX(interp_4tap_vert_ps_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vps = PFX(interp_4tap_vert_ps_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vps = PFX(interp_4tap_vert_ps_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vsp = PFX(interp_4tap_vert_sp_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vsp = PFX(interp_4tap_vert_sp_2x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vsp = PFX(interp_4tap_vert_sp_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vsp = PFX(interp_4tap_vert_sp_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vsp = PFX(interp_4tap_vert_sp_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vsp = PFX(interp_4tap_vert_sp_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vsp = PFX(interp_4tap_vert_sp_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x4].filter_vss = PFX(interp_4tap_vert_ss_2x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_2x8].filter_vss = PFX(interp_4tap_vert_ss_2x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x2].filter_vss = PFX(interp_4tap_vert_ss_4x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_6x8].filter_vss = PFX(interp_4tap_vert_ss_6x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x2].filter_vss = PFX(interp_4tap_vert_ss_8x2_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x6].filter_vss = PFX(interp_4tap_vert_ss_8x6_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_12x16].filter_vss = PFX(interp_4tap_vert_ss_12x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx2); //i422 for chroma_vss - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vss = x265_interp_4tap_vert_ss_2x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = x265_interp_4tap_vert_ss_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = x265_interp_4tap_vert_ss_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = x265_interp_4tap_vert_ss_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = x265_interp_4tap_vert_ss_8x12_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = x265_interp_4tap_vert_ss_6x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vss = x265_interp_4tap_vert_ss_2x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = x265_interp_4tap_vert_ss_16x24_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = x265_interp_4tap_vert_ss_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = x265_interp_4tap_vert_ss_4x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vss = x265_interp_4tap_vert_ss_2x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vss = PFX(interp_4tap_vert_ss_2x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vss = PFX(interp_4tap_vert_ss_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vss = PFX(interp_4tap_vert_ss_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vss = PFX(interp_4tap_vert_ss_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vss = PFX(interp_4tap_vert_ss_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vss = PFX(interp_4tap_vert_ss_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vss = PFX(interp_4tap_vert_ss_2x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vss = PFX(interp_4tap_vert_ss_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vss = PFX(interp_4tap_vert_ss_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vss = PFX(interp_4tap_vert_ss_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vss = PFX(interp_4tap_vert_ss_2x4_avx2); //i444 for chroma_vss - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = x265_interp_4tap_vert_ss_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = x265_interp_4tap_vert_ss_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = x265_interp_4tap_vert_ss_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = x265_interp_4tap_vert_ss_32x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = x265_interp_4tap_vert_ss_64x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = x265_interp_4tap_vert_ss_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = x265_interp_4tap_vert_ss_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = x265_interp_4tap_vert_ss_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = x265_interp_4tap_vert_ss_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = x265_interp_4tap_vert_ss_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = x265_interp_4tap_vert_ss_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = x265_interp_4tap_vert_ss_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = x265_interp_4tap_vert_ss_12x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = x265_interp_4tap_vert_ss_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = x265_interp_4tap_vert_ss_4x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = x265_interp_4tap_vert_ss_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = x265_interp_4tap_vert_ss_24x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = x265_interp_4tap_vert_ss_32x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = x265_interp_4tap_vert_ss_8x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = x265_interp_4tap_vert_ss_64x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = x265_interp_4tap_vert_ss_32x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = x265_interp_4tap_vert_ss_64x48_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = x265_interp_4tap_vert_ss_48x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = x265_interp_4tap_vert_ss_64x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = x265_interp_4tap_vert_ss_16x64_avx2; - - p.pu[LUMA_16x16].luma_hvpp = x265_interp_8tap_hv_pp_16x16_avx2; - - p.pu[LUMA_32x8].convert_p2s = x265_filterPixelToShort_32x8_avx2; - p.pu[LUMA_32x16].convert_p2s = x265_filterPixelToShort_32x16_avx2; - p.pu[LUMA_32x24].convert_p2s = x265_filterPixelToShort_32x24_avx2; - p.pu[LUMA_32x32].convert_p2s = x265_filterPixelToShort_32x32_avx2; - p.pu[LUMA_32x64].convert_p2s = x265_filterPixelToShort_32x64_avx2; - p.pu[LUMA_64x16].convert_p2s = x265_filterPixelToShort_64x16_avx2; - p.pu[LUMA_64x32].convert_p2s = x265_filterPixelToShort_64x32_avx2; - p.pu[LUMA_64x48].convert_p2s = x265_filterPixelToShort_64x48_avx2; - p.pu[LUMA_64x64].convert_p2s = x265_filterPixelToShort_64x64_avx2; - p.pu[LUMA_48x64].convert_p2s = x265_filterPixelToShort_48x64_avx2; - p.pu[LUMA_24x32].convert_p2s = x265_filterPixelToShort_24x32_avx2; - - p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = x265_filterPixelToShort_24x32_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = x265_filterPixelToShort_32x8_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = x265_filterPixelToShort_32x16_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = x265_filterPixelToShort_32x24_avx2; - p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = x265_filterPixelToShort_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = x265_filterPixelToShort_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = x265_filterPixelToShort_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = x265_filterPixelToShort_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = x265_filterPixelToShort_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = x265_filterPixelToShort_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vss = PFX(interp_4tap_vert_ss_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vss = PFX(interp_4tap_vert_ss_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vss = PFX(interp_4tap_vert_ss_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vss = PFX(interp_4tap_vert_ss_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vss = PFX(interp_4tap_vert_ss_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vss = PFX(interp_4tap_vert_ss_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vss = PFX(interp_4tap_vert_ss_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vss = PFX(interp_4tap_vert_ss_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vss = PFX(interp_4tap_vert_ss_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vss = PFX(interp_4tap_vert_ss_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vss = PFX(interp_4tap_vert_ss_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vss = PFX(interp_4tap_vert_ss_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vss = PFX(interp_4tap_vert_ss_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vss = PFX(interp_4tap_vert_ss_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vss = PFX(interp_4tap_vert_ss_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vss = PFX(interp_4tap_vert_ss_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vss = PFX(interp_4tap_vert_ss_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vss = PFX(interp_4tap_vert_ss_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vss = PFX(interp_4tap_vert_ss_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vss = PFX(interp_4tap_vert_ss_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vss = PFX(interp_4tap_vert_ss_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vss = PFX(interp_4tap_vert_ss_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vss = PFX(interp_4tap_vert_ss_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vss = PFX(interp_4tap_vert_ss_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vss = PFX(interp_4tap_vert_ss_16x64_avx2); + + p.pu[LUMA_16x16].luma_hvpp = PFX(interp_8tap_hv_pp_16x16_avx2); + + ALL_LUMA_PU_T(luma_hvpp, interp_8tap_hv_pp_cpu); + p.pu[LUMA_4x4].luma_hvpp = interp_8tap_hv_pp_cpu<LUMA_4x4>; + + p.pu[LUMA_32x8].convert_p2s = PFX(filterPixelToShort_32x8_avx2); + p.pu[LUMA_32x16].convert_p2s = PFX(filterPixelToShort_32x16_avx2); + p.pu[LUMA_32x24].convert_p2s = PFX(filterPixelToShort_32x24_avx2); + p.pu[LUMA_32x32].convert_p2s = PFX(filterPixelToShort_32x32_avx2); + p.pu[LUMA_32x64].convert_p2s = PFX(filterPixelToShort_32x64_avx2); + p.pu[LUMA_64x16].convert_p2s = PFX(filterPixelToShort_64x16_avx2); + p.pu[LUMA_64x32].convert_p2s = PFX(filterPixelToShort_64x32_avx2); + p.pu[LUMA_64x48].convert_p2s = PFX(filterPixelToShort_64x48_avx2); + p.pu[LUMA_64x64].convert_p2s = PFX(filterPixelToShort_64x64_avx2); + p.pu[LUMA_48x64].convert_p2s = PFX(filterPixelToShort_48x64_avx2); + p.pu[LUMA_24x32].convert_p2s = PFX(filterPixelToShort_24x32_avx2); + + p.chroma[X265_CSP_I420].pu[CHROMA_420_24x32].p2s = PFX(filterPixelToShort_24x32_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x8].p2s = PFX(filterPixelToShort_32x8_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x24].p2s = PFX(filterPixelToShort_32x24_avx2); + p.chroma[X265_CSP_I420].pu[CHROMA_420_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].p2s = PFX(filterPixelToShort_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].p2s = PFX(filterPixelToShort_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].p2s = PFX(filterPixelToShort_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].p2s = PFX(filterPixelToShort_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].p2s = PFX(filterPixelToShort_32x64_avx2); //i422 for chroma_hpp - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = x265_interp_4tap_horiz_pp_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = x265_interp_4tap_horiz_pp_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = x265_interp_4tap_horiz_pp_2x16_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_hpp = PFX(interp_4tap_horiz_pp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hpp = PFX(interp_4tap_horiz_pp_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = PFX(interp_4tap_horiz_pp_2x16_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hpp = PFX(interp_4tap_horiz_pp_2x16_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = x265_interp_4tap_horiz_pp_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = x265_interp_4tap_horiz_pp_8x12_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hpp = PFX(interp_4tap_horiz_pp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hpp = PFX(interp_4tap_horiz_pp_8x12_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = x265_interp_4tap_horiz_pp_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hpp = PFX(interp_4tap_horiz_pp_16x24_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = x265_interp_4tap_horiz_pp_32x48_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hpp = PFX(interp_4tap_horiz_pp_32x48_avx2); - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hpp = x265_interp_4tap_horiz_pp_2x8_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hpp = PFX(interp_4tap_horiz_pp_2x8_avx2); //i444 filters hpp - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = x265_interp_4tap_horiz_pp_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = x265_interp_4tap_horiz_pp_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = x265_interp_4tap_horiz_pp_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = x265_interp_4tap_horiz_pp_32x32_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = x265_interp_4tap_horiz_pp_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = x265_interp_4tap_horiz_pp_4x16_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = x265_interp_4tap_horiz_pp_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = x265_interp_4tap_horiz_pp_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = x265_interp_4tap_horiz_pp_8x32_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = x265_interp_4tap_horiz_pp_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = x265_interp_4tap_horiz_pp_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = x265_interp_4tap_horiz_pp_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = x265_interp_4tap_horiz_pp_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = x265_interp_4tap_horiz_pp_16x64_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = x265_interp_4tap_horiz_pp_12x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = x265_interp_4tap_horiz_pp_24x32_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = x265_interp_4tap_horiz_pp_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = x265_interp_4tap_horiz_pp_32x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = x265_interp_4tap_horiz_pp_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = x265_interp_4tap_horiz_pp_32x8_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = x265_interp_4tap_horiz_pp_64x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = x265_interp_4tap_horiz_pp_64x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = x265_interp_4tap_horiz_pp_64x48_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = x265_interp_4tap_horiz_pp_64x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = x265_interp_4tap_horiz_pp_48x64_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = x265_interp_4tap_horiz_ps_8x64_avx2; //adding macro call - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = x265_interp_4tap_horiz_ps_8x12_avx2; //adding macro call - - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2;//adding macro call - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = x265_interp_4tap_horiz_ps_16x24_avx2;//adding macro call - - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = x265_interp_4tap_horiz_ps_32x48_avx2; - - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hps = x265_interp_4tap_horiz_ps_2x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = x265_interp_4tap_horiz_ps_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hps = x265_interp_4tap_horiz_ps_2x16_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hpp = PFX(interp_4tap_horiz_pp_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hpp = PFX(interp_4tap_horiz_pp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hpp = PFX(interp_4tap_horiz_pp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hpp = PFX(interp_4tap_horiz_pp_32x32_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hpp = PFX(interp_4tap_horiz_pp_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hpp = PFX(interp_4tap_horiz_pp_4x16_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hpp = PFX(interp_4tap_horiz_pp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hpp = PFX(interp_4tap_horiz_pp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hpp = PFX(interp_4tap_horiz_pp_8x32_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hpp = PFX(interp_4tap_horiz_pp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hpp = PFX(interp_4tap_horiz_pp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hpp = PFX(interp_4tap_horiz_pp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hpp = PFX(interp_4tap_horiz_pp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hpp = PFX(interp_4tap_horiz_pp_16x64_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_hpp = PFX(interp_4tap_horiz_pp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hpp = PFX(interp_4tap_horiz_pp_24x32_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hpp = PFX(interp_4tap_horiz_pp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hpp = PFX(interp_4tap_horiz_pp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hpp = PFX(interp_4tap_horiz_pp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hpp = PFX(interp_4tap_horiz_pp_32x8_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hpp = PFX(interp_4tap_horiz_pp_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hpp = PFX(interp_4tap_horiz_pp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hpp = PFX(interp_4tap_horiz_pp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hpp = PFX(interp_4tap_horiz_pp_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hpp = PFX(interp_4tap_horiz_pp_48x64_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_hps = PFX(interp_4tap_horiz_ps_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_hps = PFX(interp_4tap_horiz_ps_8x12_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_hps = PFX(interp_4tap_horiz_ps_16x24_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_hps = PFX(interp_4tap_horiz_ps_32x48_avx2); + + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_hps = PFX(interp_4tap_horiz_ps_2x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_hps = PFX(interp_4tap_horiz_ps_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_hps = PFX(interp_4tap_horiz_ps_2x16_avx2); //i444 chroma_hps - p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = x265_interp_4tap_horiz_ps_64x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = x265_interp_4tap_horiz_ps_64x48_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = x265_interp_4tap_horiz_ps_64x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = x265_interp_4tap_horiz_ps_64x64_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = x265_interp_4tap_horiz_ps_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = x265_interp_4tap_horiz_ps_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = x265_interp_4tap_horiz_ps_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = x265_interp_4tap_horiz_ps_32x32_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = x265_interp_4tap_horiz_ps_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = x265_interp_4tap_horiz_ps_4x16_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = x265_interp_4tap_horiz_ps_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = x265_interp_4tap_horiz_ps_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = x265_interp_4tap_horiz_ps_8x32_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = x265_interp_4tap_horiz_ps_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = x265_interp_4tap_horiz_ps_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = x265_interp_4tap_horiz_ps_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = x265_interp_4tap_horiz_ps_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = x265_interp_4tap_horiz_ps_16x64_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = x265_interp_4tap_horiz_ps_24x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = x265_interp_4tap_horiz_ps_48x64_avx2; - - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = x265_interp_4tap_horiz_ps_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = x265_interp_4tap_horiz_ps_32x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = x265_interp_4tap_horiz_ps_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = x265_interp_4tap_horiz_ps_32x8_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_hps = PFX(interp_4tap_horiz_ps_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_hps = PFX(interp_4tap_horiz_ps_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_hps = PFX(interp_4tap_horiz_ps_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_hps = PFX(interp_4tap_horiz_ps_64x64_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_hps = PFX(interp_4tap_horiz_ps_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_hps = PFX(interp_4tap_horiz_ps_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_hps = PFX(interp_4tap_horiz_ps_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_hps = PFX(interp_4tap_horiz_ps_32x32_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_hps = PFX(interp_4tap_horiz_ps_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_hps = PFX(interp_4tap_horiz_ps_4x16_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_hps = PFX(interp_4tap_horiz_ps_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_hps = PFX(interp_4tap_horiz_ps_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_hps = PFX(interp_4tap_horiz_ps_8x32_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_hps = PFX(interp_4tap_horiz_ps_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_hps = PFX(interp_4tap_horiz_ps_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_hps = PFX(interp_4tap_horiz_ps_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_hps = PFX(interp_4tap_horiz_ps_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_hps = PFX(interp_4tap_horiz_ps_16x64_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_hps = PFX(interp_4tap_horiz_ps_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_hps = PFX(interp_4tap_horiz_ps_48x64_avx2); + + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_hps = PFX(interp_4tap_horiz_ps_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_hps = PFX(interp_4tap_horiz_ps_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_hps = PFX(interp_4tap_horiz_ps_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_hps = PFX(interp_4tap_horiz_ps_32x8_avx2); //i422 for chroma_vsp - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vsp = x265_interp_4tap_vert_sp_2x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = x265_interp_4tap_vert_sp_24x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = x265_interp_4tap_vert_sp_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = x265_interp_4tap_vert_sp_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = x265_interp_4tap_vert_sp_8x12_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = x265_interp_4tap_vert_sp_6x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vsp = x265_interp_4tap_vert_sp_2x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = x265_interp_4tap_vert_sp_16x24_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = x265_interp_4tap_vert_sp_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = x265_interp_4tap_vert_sp_4x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vsp = x265_interp_4tap_vert_sp_2x4_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vsp = PFX(interp_4tap_vert_sp_2x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vsp = PFX(interp_4tap_vert_sp_24x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vsp = PFX(interp_4tap_vert_sp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vsp = PFX(interp_4tap_vert_sp_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vsp = PFX(interp_4tap_vert_sp_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_6x16].filter_vsp = PFX(interp_4tap_vert_sp_6x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vsp = PFX(interp_4tap_vert_sp_2x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vsp = PFX(interp_4tap_vert_sp_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vsp = PFX(interp_4tap_vert_sp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vsp = PFX(interp_4tap_vert_sp_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vsp = PFX(interp_4tap_vert_sp_2x4_avx2); //i444 for chroma_vsp - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = x265_interp_4tap_vert_sp_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = x265_interp_4tap_vert_sp_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = x265_interp_4tap_vert_sp_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = x265_interp_4tap_vert_sp_32x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = x265_interp_4tap_vert_sp_64x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = x265_interp_4tap_vert_sp_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = x265_interp_4tap_vert_sp_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = x265_interp_4tap_vert_sp_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = x265_interp_4tap_vert_sp_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = x265_interp_4tap_vert_sp_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = x265_interp_4tap_vert_sp_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = x265_interp_4tap_vert_sp_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = x265_interp_4tap_vert_sp_12x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = x265_interp_4tap_vert_sp_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = x265_interp_4tap_vert_sp_4x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = x265_interp_4tap_vert_sp_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = x265_interp_4tap_vert_sp_24x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = x265_interp_4tap_vert_sp_32x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = x265_interp_4tap_vert_sp_8x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = x265_interp_4tap_vert_sp_64x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = x265_interp_4tap_vert_sp_32x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = x265_interp_4tap_vert_sp_64x48_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = x265_interp_4tap_vert_sp_48x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = x265_interp_4tap_vert_sp_64x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = x265_interp_4tap_vert_sp_16x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vsp = PFX(interp_4tap_vert_sp_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vsp = PFX(interp_4tap_vert_sp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vsp = PFX(interp_4tap_vert_sp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vsp = PFX(interp_4tap_vert_sp_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vsp = PFX(interp_4tap_vert_sp_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vsp = PFX(interp_4tap_vert_sp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vsp = PFX(interp_4tap_vert_sp_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vsp = PFX(interp_4tap_vert_sp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vsp = PFX(interp_4tap_vert_sp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vsp = PFX(interp_4tap_vert_sp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vsp = PFX(interp_4tap_vert_sp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vsp = PFX(interp_4tap_vert_sp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vsp = PFX(interp_4tap_vert_sp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vsp = PFX(interp_4tap_vert_sp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vsp = PFX(interp_4tap_vert_sp_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vsp = PFX(interp_4tap_vert_sp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vsp = PFX(interp_4tap_vert_sp_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vsp = PFX(interp_4tap_vert_sp_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vsp = PFX(interp_4tap_vert_sp_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vsp = PFX(interp_4tap_vert_sp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vsp = PFX(interp_4tap_vert_sp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vsp = PFX(interp_4tap_vert_sp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vsp = PFX(interp_4tap_vert_sp_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vsp = PFX(interp_4tap_vert_sp_64x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vsp = PFX(interp_4tap_vert_sp_16x64_avx2); //i422 for chroma_vps - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vps = x265_interp_4tap_vert_ps_2x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = x265_interp_4tap_vert_ps_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = x265_interp_4tap_vert_ps_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = x265_interp_4tap_vert_ps_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = x265_interp_4tap_vert_ps_8x12_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = x265_interp_4tap_vert_ps_2x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = x265_interp_4tap_vert_ps_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vps = PFX(interp_4tap_vert_ps_2x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vps = PFX(interp_4tap_vert_ps_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vps = PFX(interp_4tap_vert_ps_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vps = PFX(interp_4tap_vert_ps_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vps = PFX(interp_4tap_vert_ps_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vps = PFX(interp_4tap_vert_ps_2x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vps = PFX(interp_4tap_vert_ps_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vps = PFX(interp_4tap_vert_ps_2x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vps = PFX(interp_4tap_vert_ps_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vps = PFX(interp_4tap_vert_ps_24x64_avx2); //i444 for chroma_vps - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = x265_interp_4tap_vert_ps_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = x265_interp_4tap_vert_ps_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = x265_interp_4tap_vert_ps_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = x265_interp_4tap_vert_ps_32x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = x265_interp_4tap_vert_ps_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = x265_interp_4tap_vert_ps_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = x265_interp_4tap_vert_ps_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = x265_interp_4tap_vert_ps_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = x265_interp_4tap_vert_ps_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = x265_interp_4tap_vert_ps_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = x265_interp_4tap_vert_ps_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = x265_interp_4tap_vert_ps_12x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = x265_interp_4tap_vert_ps_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = x265_interp_4tap_vert_ps_4x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = x265_interp_4tap_vert_ps_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = x265_interp_4tap_vert_ps_24x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = x265_interp_4tap_vert_ps_32x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = x265_interp_4tap_vert_ps_8x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = x265_interp_4tap_vert_ps_16x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = x265_interp_4tap_vert_ps_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vps = PFX(interp_4tap_vert_ps_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vps = PFX(interp_4tap_vert_ps_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vps = PFX(interp_4tap_vert_ps_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vps = PFX(interp_4tap_vert_ps_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vps = PFX(interp_4tap_vert_ps_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vps = PFX(interp_4tap_vert_ps_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vps = PFX(interp_4tap_vert_ps_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vps = PFX(interp_4tap_vert_ps_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vps = PFX(interp_4tap_vert_ps_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vps = PFX(interp_4tap_vert_ps_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vps = PFX(interp_4tap_vert_ps_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vps = PFX(interp_4tap_vert_ps_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vps = PFX(interp_4tap_vert_ps_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vps = PFX(interp_4tap_vert_ps_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vps = PFX(interp_4tap_vert_ps_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vps = PFX(interp_4tap_vert_ps_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vps = PFX(interp_4tap_vert_ps_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vps = PFX(interp_4tap_vert_ps_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vps = PFX(interp_4tap_vert_ps_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vps = PFX(interp_4tap_vert_ps_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vps = PFX(interp_4tap_vert_ps_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vps = PFX(interp_4tap_vert_ps_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vps = PFX(interp_4tap_vert_ps_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vps = PFX(interp_4tap_vert_ps_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vps = PFX(interp_4tap_vert_ps_64x16_avx2); //i422 for chroma_vpp - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vpp = x265_interp_4tap_vert_pp_2x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = x265_interp_4tap_vert_pp_8x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = x265_interp_4tap_vert_pp_32x48_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = x265_interp_4tap_vert_pp_12x32_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = x265_interp_4tap_vert_pp_8x12_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = x265_interp_4tap_vert_pp_2x4_avx2; - p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = x265_interp_4tap_vert_pp_16x24_avx2; + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x8].filter_vpp = PFX(interp_4tap_vert_pp_2x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x64].filter_vpp = PFX(interp_4tap_vert_pp_8x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_32x48].filter_vpp = PFX(interp_4tap_vert_pp_32x48_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_12x32].filter_vpp = PFX(interp_4tap_vert_pp_12x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_8x12].filter_vpp = PFX(interp_4tap_vert_pp_8x12_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x4].filter_vpp = PFX(interp_4tap_vert_pp_2x4_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_16x24].filter_vpp = PFX(interp_4tap_vert_pp_16x24_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_2x16].filter_vpp = PFX(interp_4tap_vert_pp_2x16_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_4x32].filter_vpp = PFX(interp_4tap_vert_pp_4x32_avx2); + p.chroma[X265_CSP_I422].pu[CHROMA_422_24x64].filter_vpp = PFX(interp_4tap_vert_pp_24x64_avx2); //i444 for chroma_vpp - p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = x265_interp_4tap_vert_pp_4x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = x265_interp_4tap_vert_pp_8x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = x265_interp_4tap_vert_pp_16x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = x265_interp_4tap_vert_pp_32x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = x265_interp_4tap_vert_pp_8x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = x265_interp_4tap_vert_pp_4x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = x265_interp_4tap_vert_pp_16x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = x265_interp_4tap_vert_pp_8x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = x265_interp_4tap_vert_pp_32x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = x265_interp_4tap_vert_pp_16x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = x265_interp_4tap_vert_pp_16x12_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = x265_interp_4tap_vert_pp_12x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = x265_interp_4tap_vert_pp_16x4_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = x265_interp_4tap_vert_pp_4x16_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = x265_interp_4tap_vert_pp_32x24_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = x265_interp_4tap_vert_pp_24x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = x265_interp_4tap_vert_pp_32x8_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = x265_interp_4tap_vert_pp_8x32_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = x265_interp_4tap_vert_pp_16x64_avx2; - p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = x265_interp_4tap_vert_pp_32x64_avx2; + p.chroma[X265_CSP_I444].pu[LUMA_4x4].filter_vpp = PFX(interp_4tap_vert_pp_4x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x8].filter_vpp = PFX(interp_4tap_vert_pp_8x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x16].filter_vpp = PFX(interp_4tap_vert_pp_16x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x32].filter_vpp = PFX(interp_4tap_vert_pp_32x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x4].filter_vpp = PFX(interp_4tap_vert_pp_8x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x8].filter_vpp = PFX(interp_4tap_vert_pp_4x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x8].filter_vpp = PFX(interp_4tap_vert_pp_16x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x16].filter_vpp = PFX(interp_4tap_vert_pp_8x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x16].filter_vpp = PFX(interp_4tap_vert_pp_32x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x32].filter_vpp = PFX(interp_4tap_vert_pp_16x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x12].filter_vpp = PFX(interp_4tap_vert_pp_16x12_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_12x16].filter_vpp = PFX(interp_4tap_vert_pp_12x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x4].filter_vpp = PFX(interp_4tap_vert_pp_16x4_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_4x16].filter_vpp = PFX(interp_4tap_vert_pp_4x16_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x24].filter_vpp = PFX(interp_4tap_vert_pp_32x24_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_24x32].filter_vpp = PFX(interp_4tap_vert_pp_24x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x8].filter_vpp = PFX(interp_4tap_vert_pp_32x8_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_8x32].filter_vpp = PFX(interp_4tap_vert_pp_8x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_16x64].filter_vpp = PFX(interp_4tap_vert_pp_16x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_32x64].filter_vpp = PFX(interp_4tap_vert_pp_32x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_48x64].filter_vpp = PFX(interp_4tap_vert_pp_48x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x64].filter_vpp = PFX(interp_4tap_vert_pp_64x64_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x48].filter_vpp = PFX(interp_4tap_vert_pp_64x48_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x32].filter_vpp = PFX(interp_4tap_vert_pp_64x32_avx2); + p.chroma[X265_CSP_I444].pu[LUMA_64x16].filter_vpp = PFX(interp_4tap_vert_pp_64x16_avx2); + + p.frameInitLowres = PFX(frame_init_lowres_core_avx2); if (cpuMask & X265_CPU_BMI2) - p.scanPosLast = x265_scanPosLast_avx2_bmi2; + p.scanPosLast = PFX(scanPosLast_avx2_bmi2); + p.cu[BLOCK_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); + p.chroma[X265_CSP_I420].cu[CHROMA_420_32x32].copy_ps = PFX(blockcopy_ps_32x32_avx2); + p.chroma[X265_CSP_I422].cu[CHROMA_422_32x64].copy_ps = PFX(blockcopy_ps_32x64_avx2); + p.cu[BLOCK_64x64].copy_ps = PFX(blockcopy_ps_64x64_avx2); + } #endif } #endif // if HIGH_BIT_DEPTH -} // namespace x265 +} // namespace X265_NS extern "C" { #ifdef __INTEL_COMPILER @@ -2704,7 +3641,7 @@ // Global variable indicating cpu int __intel_cpu_indicator = 0; // CPU dispatcher function -void x265_intel_cpu_indicator_init(void) +void PFX(intel_cpu_indicator_init)(void) { uint32_t cpu = x265::cpu_detect(); @@ -2737,7 +3674,7 @@ } #else // ifdef __INTEL_COMPILER -void x265_intel_cpu_indicator_init(void) {} +void PFX(intel_cpu_indicator_init)(void) {} #endif // ifdef __INTEL_COMPILER }
View file
x265_1.7.tar.gz/source/common/x86/blockcopy8.asm -> x265_1.8.tar.gz/source/common/x86/blockcopy8.asm
Changed
@@ -3043,43 +3043,31 @@ ;----------------------------------------------------------------------------- %macro BLOCKCOPY_PS_W32_H4_avx2 2 INIT_YMM avx2 -cglobal blockcopy_ps_%1x%2, 4, 7, 3 +cglobal blockcopy_ps_%1x%2, 4, 7, 2 add r1, r1 mov r4d, %2/4 lea r5, [3 * r3] lea r6, [3 * r1] - pxor m0, m0 - .loop: - movu m1, [r2] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0], m3 - movu [r0 + 32], m2 - movu m1, [r2 + r3] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r1], m3 - movu [r0 + r1 + 32], m2 - movu m1, [r2 + 2 * r3] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + 2 * r1], m3 - movu [r0 + 2 * r1 + 32], m2 - movu m1, [r2 + r5] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r6], m3 - movu [r0 + r6 + 32], m2 - + pmovzxbw m0, [r2 + 0] + pmovzxbw m1, [r2 + 16] + movu [r0 + 0], m0 + movu [r0 + 32], m1 + + pmovzxbw m0, [r2 + r3 + 0] + pmovzxbw m1, [r2 + r3 + 16] + movu [r0 + r1 + 0], m0 + movu [r0 + r1 + 32], m1 + + pmovzxbw m0, [r2 + r3 * 2 + 0] + pmovzxbw m1, [r2 + r3 * 2 + 16] + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + 32], m1 + + pmovzxbw m0, [r2 + r5 + 0] + pmovzxbw m1, [r2 + r5 + 16] + movu [r0 + r6 + 0], m0 + movu [r0 + r6 + 32], m1 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] dec r4d @@ -3228,71 +3216,49 @@ INIT_YMM avx2 cglobal blockcopy_ps_64x64, 4, 7, 4 add r1, r1 - mov r4d, 64/4 + mov r4d, 64/8 lea r5, [3 * r3] lea r6, [3 * r1] - pxor m0, m0 - .loop: - movu m1, [r2] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0], m3 - movu [r0 + 32], m2 - movu m1, [r2 + 32] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + 64], m3 - movu [r0 + 96], m2 - movu m1, [r2 + r3] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r1], m3 - movu [r0 + r1 + 32], m2 - movu m1, [r2 + r3 + 32] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r1 + 64], m3 - movu [r0 + r1 + 96], m2 - movu m1, [r2 + 2 * r3] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + 2 * r1], m3 - movu [r0 + 2 * r1 + 32], m2 - movu m1, [r2 + 2 * r3 + 32] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + 2 * r1 + 64], m3 - movu [r0 + 2 * r1 + 96], m2 - movu m1, [r2 + r5] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r6], m3 - movu [r0 + r6 + 32], m2 - movu m1, [r2 + r5 + 32] - punpcklbw m2, m1, m0 - punpckhbw m1, m1, m0 - vperm2i128 m3, m2, m1, 00100000b - vperm2i128 m2, m2, m1, 00110001b - movu [r0 + r6 + 64], m3 - movu [r0 + r6 + 96], m2 - +%rep 2 + pmovzxbw m0, [r2 + 0] + pmovzxbw m1, [r2 + 16] + pmovzxbw m2, [r2 + 32] + pmovzxbw m3, [r2 + 48] + movu [r0 + 0], m0 + movu [r0 + 32], m1 + movu [r0 + 64], m2 + movu [r0 + 96], m3 + + pmovzxbw m0, [r2 + r3 + 0] + pmovzxbw m1, [r2 + r3 + 16] + pmovzxbw m2, [r2 + r3 + 32] + pmovzxbw m3, [r2 + r3 + 48] + movu [r0 + r1 + 0], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 + 64], m2 + movu [r0 + r1 + 96], m3 + + pmovzxbw m0, [r2 + r3 * 2 + 0] + pmovzxbw m1, [r2 + r3 * 2 + 16] + pmovzxbw m2, [r2 + r3 * 2 + 32] + pmovzxbw m3, [r2 + r3 * 2 + 48] + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r1 * 2 + 64], m2 + movu [r0 + r1 * 2 + 96], m3 + + pmovzxbw m0, [r2 + r5 + 0] + pmovzxbw m1, [r2 + r5 + 16] + pmovzxbw m2, [r2 + r5 + 32] + pmovzxbw m3, [r2 + r5 + 48] + movu [r0 + r6 + 0], m0 + movu [r0 + r6 + 32], m1 + movu [r0 + r6 + 64], m2 + movu [r0 + r6 + 96], m3 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] +%endrep dec r4d jnz .loop RET
View file
x265_1.7.tar.gz/source/common/x86/blockcopy8.h -> x265_1.8.tar.gz/source/common/x86/blockcopy8.h
Changed
@@ -24,240 +24,40 @@ #ifndef X265_BLOCKCOPY8_H #define X265_BLOCKCOPY8_H -void x265_cpy2Dto1D_shl_4_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shl_8_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shl_16_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shl_32_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_4_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_8_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_16_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_32_sse2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shl_4_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shl_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shl_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shl_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shl_4_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shl_8_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shl_16_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shl_32_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_4_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy1Dto2D_shr_8_avx2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_16_avx2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_32_avx2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_4_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_8_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_16_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy1Dto2D_shr_32_sse2(int16_t* dst, const int16_t* src, intptr_t dstStride, int shift); -void x265_cpy2Dto1D_shl_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shl_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shl_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -void x265_cpy2Dto1D_shr_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -uint32_t x265_copy_cnt_4_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_8_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_16_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_32_sse4(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_4_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_8_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_16_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride); -uint32_t x265_copy_cnt_32_avx2(int16_t* dst, const int16_t* src, intptr_t srcStride); +FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -#define SETUP_BLOCKCOPY_FUNC(W, H, cpu) \ - void x265_blockcopy_pp_ ## W ## x ## H ## cpu(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); \ - void x265_blockcopy_sp_ ## W ## x ## H ## cpu(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); \ - void x265_blockcopy_ss_ ## W ## x ## H ## cpu(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy2Dto1D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -#define SETUP_BLOCKCOPY_PS(W, H, cpu) \ - void x265_blockcopy_ps_ ## W ## x ## H ## cpu(int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shl, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shl, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -#define SETUP_BLOCKCOPY_SP(W, H, cpu) \ - void x265_blockcopy_sp_ ## W ## x ## H ## cpu(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shr, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); +FUNCDEF_TU_S(void, cpy1Dto2D_shr, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride, int shift); -#define SETUP_BLOCKCOPY_SS_PP(W, H, cpu) \ - void x265_blockcopy_pp_ ## W ## x ## H ## cpu(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); \ - void x265_blockcopy_ss_ ## W ## x ## H ## cpu(int16_t* a, intptr_t stridea, const int16_t* b, intptr_t strideb); +FUNCDEF_TU_S(uint32_t, copy_cnt, sse2, int16_t* dst, const int16_t* src, intptr_t srcStride); +FUNCDEF_TU_S(uint32_t, copy_cnt, sse4, int16_t* dst, const int16_t* src, intptr_t srcStride); +FUNCDEF_TU_S(uint32_t, copy_cnt, avx2, int16_t* dst, const int16_t* src, intptr_t srcStride); -#define BLOCKCOPY_COMMON(cpu) \ - SETUP_BLOCKCOPY_FUNC(4, 4, cpu); \ - SETUP_BLOCKCOPY_FUNC(4, 2, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 8, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 4, cpu); \ - SETUP_BLOCKCOPY_FUNC(4, 8, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 6, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 2, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 8, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 12, cpu); \ - SETUP_BLOCKCOPY_FUNC(12, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 4, cpu); \ - SETUP_BLOCKCOPY_FUNC(4, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(32, 32, cpu); \ - SETUP_BLOCKCOPY_FUNC(32, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 32, cpu); \ - SETUP_BLOCKCOPY_FUNC(32, 24, cpu); \ - SETUP_BLOCKCOPY_FUNC(24, 32, cpu); \ - SETUP_BLOCKCOPY_FUNC(32, 8, cpu); \ - SETUP_BLOCKCOPY_FUNC(8, 32, cpu); \ - SETUP_BLOCKCOPY_FUNC(64, 64, cpu); \ - SETUP_BLOCKCOPY_FUNC(64, 32, cpu); \ - SETUP_BLOCKCOPY_FUNC(32, 64, cpu); \ - SETUP_BLOCKCOPY_FUNC(64, 48, cpu); \ - SETUP_BLOCKCOPY_FUNC(48, 64, cpu); \ - SETUP_BLOCKCOPY_FUNC(64, 16, cpu); \ - SETUP_BLOCKCOPY_FUNC(16, 64, cpu); +FUNCDEF_TU(void, blockfill_s, sse2, int16_t* dst, intptr_t dstride, int16_t val); +FUNCDEF_TU(void, blockfill_s, avx2, int16_t* dst, intptr_t dstride, int16_t val); -#define BLOCKCOPY_SP(cpu) \ - SETUP_BLOCKCOPY_SP(2, 4, cpu); \ - SETUP_BLOCKCOPY_SP(2, 8, cpu); \ - SETUP_BLOCKCOPY_SP(6, 8, cpu); \ - \ - SETUP_BLOCKCOPY_SP(2, 16, cpu); \ - SETUP_BLOCKCOPY_SP(4, 32, cpu); \ - SETUP_BLOCKCOPY_SP(6, 16, cpu); \ - SETUP_BLOCKCOPY_SP(8, 12, cpu); \ - SETUP_BLOCKCOPY_SP(8, 64, cpu); \ - SETUP_BLOCKCOPY_SP(12, 32, cpu); \ - SETUP_BLOCKCOPY_SP(16, 24, cpu); \ - SETUP_BLOCKCOPY_SP(24, 64, cpu); \ - SETUP_BLOCKCOPY_SP(32, 48, cpu); +FUNCDEF_CHROMA_PU(void, blockcopy_ss, sse2, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_CHROMA_PU(void, blockcopy_ss, avx, int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -#define BLOCKCOPY_SS_PP(cpu) \ - SETUP_BLOCKCOPY_SS_PP(2, 4, cpu); \ - SETUP_BLOCKCOPY_SS_PP(2, 8, cpu); \ - SETUP_BLOCKCOPY_SS_PP(6, 8, cpu); \ - \ - SETUP_BLOCKCOPY_SS_PP(2, 16, cpu); \ - SETUP_BLOCKCOPY_SS_PP(4, 32, cpu); \ - SETUP_BLOCKCOPY_SS_PP(6, 16, cpu); \ - SETUP_BLOCKCOPY_SS_PP(8, 12, cpu); \ - SETUP_BLOCKCOPY_SS_PP(8, 64, cpu); \ - SETUP_BLOCKCOPY_SS_PP(12, 32, cpu); \ - SETUP_BLOCKCOPY_SS_PP(16, 24, cpu); \ - SETUP_BLOCKCOPY_SS_PP(24, 64, cpu); \ - SETUP_BLOCKCOPY_SS_PP(32, 48, cpu); - +FUNCDEF_CHROMA_PU(void, blockcopy_pp, sse2, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_CHROMA_PU(void, blockcopy_pp, avx, pixel* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); -#define BLOCKCOPY_PS(cpu) \ - SETUP_BLOCKCOPY_PS(2, 4, cpu); \ - SETUP_BLOCKCOPY_PS(2, 8, cpu); \ - SETUP_BLOCKCOPY_PS(4, 2, cpu); \ - SETUP_BLOCKCOPY_PS(4, 4, cpu); \ - SETUP_BLOCKCOPY_PS(4, 8, cpu); \ - SETUP_BLOCKCOPY_PS(4, 16, cpu); \ - SETUP_BLOCKCOPY_PS(6, 8, cpu); \ - SETUP_BLOCKCOPY_PS(8, 2, cpu); \ - SETUP_BLOCKCOPY_PS(8, 4, cpu); \ - SETUP_BLOCKCOPY_PS(8, 6, cpu); \ - SETUP_BLOCKCOPY_PS(8, 8, cpu); \ - SETUP_BLOCKCOPY_PS(8, 16, cpu); \ - SETUP_BLOCKCOPY_PS(8, 32, cpu); \ - SETUP_BLOCKCOPY_PS(12, 16, cpu); \ - SETUP_BLOCKCOPY_PS(16, 4, cpu); \ - SETUP_BLOCKCOPY_PS(16, 8, cpu); \ - SETUP_BLOCKCOPY_PS(16, 12, cpu); \ - SETUP_BLOCKCOPY_PS(16, 16, cpu); \ - SETUP_BLOCKCOPY_PS(16, 32, cpu); \ - SETUP_BLOCKCOPY_PS(24, 32, cpu); \ - SETUP_BLOCKCOPY_PS(32, 8, cpu); \ - SETUP_BLOCKCOPY_PS(32, 16, cpu); \ - SETUP_BLOCKCOPY_PS(32, 24, cpu); \ - SETUP_BLOCKCOPY_PS(32, 32, cpu); \ - SETUP_BLOCKCOPY_PS(16, 64, cpu); \ - SETUP_BLOCKCOPY_PS(32, 64, cpu); \ - SETUP_BLOCKCOPY_PS(48, 64, cpu); \ - SETUP_BLOCKCOPY_PS(64, 16, cpu); \ - SETUP_BLOCKCOPY_PS(64, 32, cpu); \ - SETUP_BLOCKCOPY_PS(64, 48, cpu); \ - SETUP_BLOCKCOPY_PS(64, 64, cpu); \ - \ - SETUP_BLOCKCOPY_PS(2, 16, cpu); \ - SETUP_BLOCKCOPY_PS(4, 32, cpu); \ - SETUP_BLOCKCOPY_PS(6, 16, cpu); \ - SETUP_BLOCKCOPY_PS(8, 12, cpu); \ - SETUP_BLOCKCOPY_PS(8, 64, cpu); \ - SETUP_BLOCKCOPY_PS(12, 32, cpu); \ - SETUP_BLOCKCOPY_PS(16, 24, cpu); \ - SETUP_BLOCKCOPY_PS(24, 64, cpu); \ - SETUP_BLOCKCOPY_PS(32, 48, cpu); - -BLOCKCOPY_COMMON(_sse2); -BLOCKCOPY_SS_PP(_sse2); -BLOCKCOPY_SP(_sse4); -BLOCKCOPY_PS(_sse4); - -BLOCKCOPY_SP(_sse2); - -void x265_blockfill_s_4x4_sse2(int16_t* dst, intptr_t dstride, int16_t val); -void x265_blockfill_s_8x8_sse2(int16_t* dst, intptr_t dstride, int16_t val); -void x265_blockfill_s_16x16_sse2(int16_t* dst, intptr_t dstride, int16_t val); -void x265_blockfill_s_32x32_sse2(int16_t* dst, intptr_t dstride, int16_t val); -void x265_blockcopy_ss_16x4_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x8_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x12_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x24_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_16x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_64x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_64x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_64x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_64x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x8_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x16_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x24_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x48_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_32x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_48x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_24x32_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); -void x265_blockcopy_ss_24x64_avx(int16_t* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); - -void x265_blockcopy_pp_32x8_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_32x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_32x24_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_32x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_32x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_32x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_64x16_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_64x32_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_64x48_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_64x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_pp_48x64_avx(pixel* a, intptr_t stridea, const pixel* b, intptr_t strideb); - -void x265_blockfill_s_16x16_avx2(int16_t* dst, intptr_t dstride, int16_t val); -void x265_blockfill_s_32x32_avx2(int16_t* dst, intptr_t dstride, int16_t val); -// copy_sp primitives -// 16 x N -void x265_blockcopy_sp_16x16_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); -void x265_blockcopy_sp_16x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); - -// 32 x N -void x265_blockcopy_sp_32x32_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); -void x265_blockcopy_sp_32x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); - -// 64 x N -void x265_blockcopy_sp_64x64_avx2(pixel* a, intptr_t stridea, const int16_t* b, intptr_t strideb); -// copy_ps primitives -// 16 x N -void x265_blockcopy_ps_16x16_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_ps_16x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); - -// 32 x N -void x265_blockcopy_ps_32x32_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); -void x265_blockcopy_ps_32x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); - -// 64 x N -void x265_blockcopy_ps_64x64_avx2(int16_t* a, intptr_t stridea, const pixel* b, intptr_t strideb); - -#undef BLOCKCOPY_COMMON -#undef BLOCKCOPY_SS_PP -#undef BLOCKCOPY_SP -#undef BLOCKCOPY_PS -#undef SETUP_BLOCKCOPY_PS -#undef SETUP_BLOCKCOPY_SP -#undef SETUP_BLOCKCOPY_SS_PP -#undef SETUP_BLOCKCOPY_FUNC +FUNCDEF_PU(void, blockcopy_sp, sse2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_sp, sse4, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_sp, avx2, pixel* dst, intptr_t dstStride, const int16_t* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_ps, sse2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_ps, sse4, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); +FUNCDEF_PU(void, blockcopy_ps, avx2, int16_t* dst, intptr_t dstStride, const pixel* src, intptr_t srcStride); #endif // ifndef X265_I386_PIXEL_H
View file
x265_1.7.tar.gz/source/common/x86/const-a.asm -> x265_1.8.tar.gz/source/common/x86/const-a.asm
Changed
@@ -41,7 +41,7 @@ const pb_16, times 32 db 16 const pb_32, times 32 db 32 const pb_64, times 32 db 64 -const pb_128, times 16 db 128 +const pb_128, times 32 db 128 const pb_a1, times 16 db 0xa1 const pb_01, times 8 db 0, 1 @@ -62,7 +62,9 @@ ;; 16-bit constants const pw_1, times 16 dw 1 -const pw_2, times 8 dw 2 +const pw_2, times 16 dw 2 +const pw_3, times 16 dw 3 +const pw_7, times 16 dw 7 const pw_m2, times 8 dw -2 const pw_4, times 8 dw 4 const pw_8, times 8 dw 8 @@ -75,9 +77,11 @@ const pw_256, times 16 dw 256 const pw_257, times 16 dw 257 const pw_512, times 16 dw 512 -const pw_1023, times 8 dw 1023 +const pw_1023, times 16 dw 1023 const pw_1024, times 16 dw 1024 +const pw_2048, times 16 dw 2048 const pw_4096, times 16 dw 4096 +const pw_8192, times 8 dw 8192 const pw_00ff, times 16 dw 0x00ff const pw_ff00, times 8 dw 0xff00 const pw_2000, times 16 dw 0x2000 @@ -90,7 +94,7 @@ const pw_0_15, times 2 dw 0, 1, 2, 3, 4, 5, 6, 7 const pw_ppppmmmm, times 1 dw 1, 1, 1, 1, -1, -1, -1, -1 const pw_ppmmppmm, times 1 dw 1, 1, -1, -1, 1, 1, -1, -1 -const pw_pmpmpmpm, times 1 dw 1, -1, 1, -1, 1, -1, 1, -1 +const pw_pmpmpmpm, times 16 dw 1, -1, 1, -1, 1, -1, 1, -1 const pw_pmmpzzzz, times 1 dw 1, -1, -1, 1, 0, 0, 0, 0 const multi_2Row, times 1 dw 1, 2, 3, 4, 1, 2, 3, 4 const multiH, times 1 dw 9, 10, 11, 12, 13, 14, 15, 16 @@ -100,7 +104,9 @@ const pw_planar16_mul, times 1 dw 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 const pw_planar32_mul, times 1 dw 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16 const pw_FFFFFFFFFFFFFFF0, dw 0x00 - times 7 dw 0xff + times 7 dw 0xff +const hmul_16p, times 16 db 1 + times 8 db 1, -1 ;; 32-bit constants @@ -109,8 +115,9 @@ const pd_2, times 8 dd 2 const pd_4, times 4 dd 4 const pd_8, times 4 dd 8 -const pd_16, times 4 dd 16 -const pd_32, times 4 dd 32 +const pd_16, times 8 dd 16 +const pd_31, times 4 dd 31 +const pd_32, times 8 dd 32 const pd_64, times 4 dd 64 const pd_128, times 4 dd 128 const pd_256, times 4 dd 256 @@ -119,10 +126,11 @@ const pd_2048, times 4 dd 2048 const pd_ffff, times 4 dd 0xffff const pd_32767, times 4 dd 32767 -const pd_n32768, times 4 dd 0xffff8000 +const pd_524416, times 4 dd 524416 +const pd_n32768, times 8 dd 0xffff8000 +const pd_n131072, times 4 dd 0xfffe0000 const trans8_shuf, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 -const deinterleave_shufd, times 1 dd 0, 4, 1, 5, 2, 6, 3, 7 const popcnt_table %assign x 0 @@ -131,5 +139,3 @@ db ((x>>0)&1)+((x>>1)&1)+((x>>2)&1)+((x>>3)&1)+((x>>4)&1)+((x>>5)&1)+((x>>6)&1)+((x>>7)&1) %assign x x+1 %endrep - -const sw_64, dd 64
View file
x265_1.7.tar.gz/source/common/x86/dct8.asm -> x265_1.8.tar.gz/source/common/x86/dct8.asm
Changed
@@ -157,7 +157,7 @@ idct8_shuf1: dd 0, 2, 4, 6, 1, 3, 5, 7 -idct8_shuf2: times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15 +const idct8_shuf2, times 2 db 0, 1, 2, 3, 8, 9, 10, 11, 4, 5, 6, 7, 12, 13, 14, 15 idct8_shuf3: times 2 db 12, 13, 14, 15, 8, 9, 10, 11, 4, 5, 6, 7, 0, 1, 2, 3 @@ -332,20 +332,48 @@ cextern pd_2048 cextern pw_ppppmmmm cextern trans8_shuf + + +%if BIT_DEPTH == 12 + %define DCT4_SHIFT 5 + %define DCT4_ROUND 16 + %define IDCT_SHIFT 8 + %define IDCT_ROUND 128 + %define DST4_SHIFT 5 + %define DST4_ROUND 16 + %define DCT8_SHIFT1 6 + %define DCT8_ROUND1 32 +%elif BIT_DEPTH == 10 + %define DCT4_SHIFT 3 + %define DCT4_ROUND 4 + %define IDCT_SHIFT 10 + %define IDCT_ROUND 512 + %define DST4_SHIFT 3 + %define DST4_ROUND 4 + %define DCT8_SHIFT1 4 + %define DCT8_ROUND1 8 +%elif BIT_DEPTH == 8 + %define DCT4_SHIFT 1 + %define DCT4_ROUND 1 + %define IDCT_SHIFT 12 + %define IDCT_ROUND 2048 + %define DST4_SHIFT 1 + %define DST4_ROUND 1 + %define DCT8_SHIFT1 2 + %define DCT8_ROUND1 2 +%else + %error Unsupported BIT_DEPTH! +%endif + +%define DCT8_ROUND2 256 +%define DCT8_SHIFT2 9 + ;------------------------------------------------------ ;void dct4(const int16_t* src, int16_t* dst, intptr_t srcStride) ;------------------------------------------------------ INIT_XMM sse2 cglobal dct4, 3, 4, 8 -%if BIT_DEPTH == 10 - %define DCT_SHIFT 3 - mova m7, [pd_4] -%elif BIT_DEPTH == 8 - %define DCT_SHIFT 1 - mova m7, [pd_1] -%else - %error Unsupported BIT_DEPTH! -%endif + mova m7, [pd_ %+ DCT4_ROUND] add r2d, r2d lea r3, [tab_dct4] @@ -372,19 +400,19 @@ psubw m2, m0 pmaddwd m0, m1, m4 paddd m0, m7 - psrad m0, DCT_SHIFT + psrad m0, DCT4_SHIFT pmaddwd m3, m2, m5 paddd m3, m7 - psrad m3, DCT_SHIFT + psrad m3, DCT4_SHIFT packssdw m0, m3 pshufd m0, m0, 0xD8 pshufhw m0, m0, 0xB1 pmaddwd m1, m6 paddd m1, m7 - psrad m1, DCT_SHIFT + psrad m1, DCT4_SHIFT pmaddwd m2, [r3 + 3 * 16] paddd m2, m7 - psrad m2, DCT_SHIFT + psrad m2, DCT4_SHIFT packssdw m1, m2 pshufd m1, m1, 0xD8 pshufhw m1, m1, 0xB1 @@ -431,15 +459,7 @@ ; - r2: source stride INIT_YMM avx2 cglobal dct4, 3, 4, 8, src, dst, srcStride -%if BIT_DEPTH == 10 - %define DCT_SHIFT 3 - vbroadcasti128 m7, [pd_4] -%elif BIT_DEPTH == 8 - %define DCT_SHIFT 1 - vbroadcasti128 m7, [pd_1] -%else - %error Unsupported BIT_DEPTH! -%endif + vbroadcasti128 m7, [pd_ %+ DCT4_ROUND] add r2d, r2d lea r3, [avx2_dct4] @@ -461,11 +481,11 @@ pmaddwd m2, m5 paddd m2, m7 - psrad m2, DCT_SHIFT + psrad m2, DCT4_SHIFT pmaddwd m0, m6 paddd m0, m7 - psrad m0, DCT_SHIFT + psrad m0, DCT4_SHIFT packssdw m2, m0 pshufb m2, m4 @@ -493,30 +513,19 @@ ;void idct4(const int16_t* src, int16_t* dst, intptr_t dstStride) ;------------------------------------------------------- INIT_XMM sse2 -cglobal idct4, 3, 4, 7 -%if BIT_DEPTH == 8 - %define IDCT4_OFFSET [pd_2048] - %define IDCT4_SHIFT 12 -%elif BIT_DEPTH == 10 - %define IDCT4_OFFSET [pd_512] - %define IDCT4_SHIFT 10 -%else - %error Unsupported BIT_DEPTH! -%endif +cglobal idct4, 3, 4, 6 add r2d, r2d lea r3, [tab_dct4] - mova m6, [pd_64] - movu m0, [r0 + 0 * 16] movu m1, [r0 + 1 * 16] punpcklwd m2, m0, m1 pmaddwd m3, m2, [r3 + 0 * 16] ; m3 = E1 - paddd m3, m6 + paddd m3, [pd_64] pmaddwd m2, [r3 + 2 * 16] ; m2 = E2 - paddd m2, m6 + paddd m2, [pd_64] punpckhwd m0, m1 pmaddwd m1, m0, [r3 + 1 * 16] ; m1 = O1 @@ -540,29 +549,27 @@ punpcklwd m0, m1, m4 ; m0 = m128iA punpckhwd m1, m4 ; m1 = m128iD - mova m6, IDCT4_OFFSET - punpcklwd m2, m0, m1 pmaddwd m3, m2, [r3 + 0 * 16] - paddd m3, m6 ; m3 = E1 + paddd m3, [pd_ %+ IDCT_ROUND] ; m3 = E1 pmaddwd m2, [r3 + 2 * 16] - paddd m2, m6 ; m2 = E2 + paddd m2, [pd_ %+ IDCT_ROUND] ; m2 = E2 punpckhwd m0, m1 pmaddwd m1, m0, [r3 + 1 * 16] ; m1 = O1 pmaddwd m0, [r3 + 3 * 16] ; m0 = O2 paddd m4, m3, m1 - psrad m4, IDCT4_SHIFT ; m4 = m128iA + psrad m4, IDCT_SHIFT ; m4 = m128iA paddd m5, m2, m0 - psrad m5, IDCT4_SHIFT + psrad m5, IDCT_SHIFT packssdw m4, m5 ; m4 = m128iA psubd m2, m0 - psrad m2, IDCT4_SHIFT + psrad m2, IDCT_SHIFT psubd m3, m1 - psrad m3, IDCT4_SHIFT + psrad m3, IDCT_SHIFT packssdw m2, m3 ; m2 = m128iD punpcklwd m1, m4, m2 @@ -576,7 +583,139 @@ movlps [r1 + 2 * r2], m1 lea r1, [r1 + 2 * r2] movhps [r1 + r2], m1 + RET + +;------------------------------------------------------ +;void dst4(const int16_t* src, int16_t* dst, intptr_t srcStride) +;------------------------------------------------------ +INIT_XMM sse2 +%if ARCH_X86_64 +cglobal dst4, 3, 4, 8+4 + %define coef0 m8 + %define coef1 m9 + %define coef2 m10 + %define coef3 m11 +%else ; ARCH_X86_64 = 0 +cglobal dst4, 3, 4, 8 + %define coef0 [r3 + 0 * 16] + %define coef1 [r3 + 1 * 16] + %define coef2 [r3 + 2 * 16] + %define coef3 [r3 + 3 * 16] +%endif ; ARCH_X86_64 + mova m5, [pd_ %+ DST4_ROUND] + add r2d, r2d + lea r3, [tab_dst4] +%if ARCH_X86_64 + mova coef0, [r3 + 0 * 16] + mova coef1, [r3 + 1 * 16] + mova coef2, [r3 + 2 * 16] + mova coef3, [r3 + 3 * 16] +%endif + movh m0, [r0 + 0 * r2] ; load + movhps m0, [r0 + 1 * r2] + lea r0, [r0 + 2 * r2] + movh m1, [r0] + movhps m1, [r0 + r2] + pmaddwd m2, m0, coef0 ; DST1 + pmaddwd m3, m1, coef0 + pshufd m6, m2, q2301 + pshufd m7, m3, q2301 + paddd m2, m6 + paddd m3, m7 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + paddd m2, m5 + psrad m2, DST4_SHIFT + pmaddwd m3, m0, coef1 + pmaddwd m4, m1, coef1 + pshufd m6, m4, q2301 + pshufd m7, m3, q2301 + paddd m4, m6 + paddd m3, m7 + pshufd m4, m4, q3120 + pshufd m3, m3, q3120 + punpcklqdq m3, m4 + paddd m3, m5 + psrad m3, DST4_SHIFT + packssdw m2, m3 ; m2 = T70 + pmaddwd m3, m0, coef2 + pmaddwd m4, m1, coef2 + pshufd m6, m4, q2301 + pshufd m7, m3, q2301 + paddd m4, m6 + paddd m3, m7 + pshufd m4, m4, q3120 + pshufd m3, m3, q3120 + punpcklqdq m3, m4 + paddd m3, m5 + psrad m3, DST4_SHIFT + pmaddwd m0, coef3 + pmaddwd m1, coef3 + pshufd m6, m0, q2301 + pshufd m7, m1, q2301 + paddd m0, m6 + paddd m1, m7 + pshufd m0, m0, q3120 + pshufd m1, m1, q3120 + punpcklqdq m0, m1 + paddd m0, m5 + psrad m0, DST4_SHIFT + packssdw m3, m0 ; m3 = T71 + mova m5, [pd_128] + + pmaddwd m0, m2, coef0 ; DST2 + pmaddwd m1, m3, coef0 + pshufd m6, m0, q2301 + pshufd m7, m1, q2301 + paddd m0, m6 + paddd m1, m7 + pshufd m0, m0, q3120 + pshufd m1, m1, q3120 + punpcklqdq m0, m1 + paddd m0, m5 + psrad m0, 8 + + pmaddwd m4, m2, coef1 + pmaddwd m1, m3, coef1 + pshufd m6, m4, q2301 + pshufd m7, m1, q2301 + paddd m4, m6 + paddd m1, m7 + pshufd m4, m4, q3120 + pshufd m1, m1, q3120 + punpcklqdq m4, m1 + paddd m4, m5 + psrad m4, 8 + packssdw m0, m4 + movu [r1 + 0 * 16], m0 + + pmaddwd m0, m2, coef2 + pmaddwd m1, m3, coef2 + pshufd m6, m0, q2301 + pshufd m7, m1, q2301 + paddd m0, m6 + paddd m1, m7 + pshufd m0, m0, q3120 + pshufd m1, m1, q3120 + punpcklqdq m0, m1 + paddd m0, m5 + psrad m0, 8 + + pmaddwd m2, coef3 + pmaddwd m3, coef3 + pshufd m6, m2, q2301 + pshufd m7, m3, q2301 + paddd m2, m6 + paddd m3, m7 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + paddd m2, m5 + psrad m2, 8 + packssdw m0, m2 + movu [r1 + 1 * 16], m0 RET ;------------------------------------------------------ @@ -595,13 +734,7 @@ %define coef0 m6 %define coef1 m7 -%if BIT_DEPTH == 8 - %define DST_SHIFT 1 - mova m5, [pd_1] -%elif BIT_DEPTH == 10 - %define DST_SHIFT 3 - mova m5, [pd_4] -%endif + mova m5, [pd_ %+ DST4_ROUND] add r2d, r2d lea r3, [tab_dst4] mova coef0, [r3 + 0 * 16] @@ -621,23 +754,23 @@ pmaddwd m3, m1, coef0 phaddd m2, m3 paddd m2, m5 - psrad m2, DST_SHIFT + psrad m2, DST4_SHIFT pmaddwd m3, m0, coef1 pmaddwd m4, m1, coef1 phaddd m3, m4 paddd m3, m5 - psrad m3, DST_SHIFT + psrad m3, DST4_SHIFT packssdw m2, m3 ; m2 = T70 pmaddwd m3, m0, coef2 pmaddwd m4, m1, coef2 phaddd m3, m4 paddd m3, m5 - psrad m3, DST_SHIFT + psrad m3, DST4_SHIFT pmaddwd m0, coef3 pmaddwd m1, coef3 phaddd m0, m1 paddd m0, m5 - psrad m0, DST_SHIFT + psrad m0, DST4_SHIFT packssdw m3, m0 ; m3 = T71 mova m5, [pd_128] @@ -668,7 +801,6 @@ psrad m2, 8 packssdw m0, m2 movu [r1 + 1 * 16], m0 - RET ;------------------------------------------------------------------ @@ -676,13 +808,7 @@ ;------------------------------------------------------------------ INIT_YMM avx2 cglobal dst4, 3, 4, 6 -%if BIT_DEPTH == 8 - %define DST_SHIFT 1 - vpbroadcastd m5, [pd_1] -%elif BIT_DEPTH == 10 - %define DST_SHIFT 3 - vpbroadcastd m5, [pd_4] -%endif + vbroadcasti128 m5, [pd_ %+ DST4_ROUND] mova m4, [trans8_shuf] add r2d, r2d lea r3, [pw_dst4_tab] @@ -699,12 +825,12 @@ pmaddwd m1, m0, [r3 + 1 * 32] phaddd m2, m1 paddd m2, m5 - psrad m2, DST_SHIFT + psrad m2, DST4_SHIFT pmaddwd m3, m0, [r3 + 2 * 32] pmaddwd m1, m0, [r3 + 3 * 32] phaddd m3, m1 paddd m3, m5 - psrad m3, DST_SHIFT + psrad m3, DST4_SHIFT packssdw m2, m3 vpermd m2, m4, m2 @@ -729,15 +855,7 @@ ;------------------------------------------------------- INIT_XMM sse2 cglobal idst4, 3, 4, 7 -%if BIT_DEPTH == 8 - mova m6, [pd_2048] - %define IDCT4_SHIFT 12 -%elif BIT_DEPTH == 10 - mova m6, [pd_512] - %define IDCT4_SHIFT 10 -%else - %error Unsupported BIT_DEPTH! -%endif + mova m6, [pd_ %+ IDCT_ROUND] add r2d, r2d lea r3, [tab_idst4] mova m5, [pd_64] @@ -785,23 +903,23 @@ pmaddwd m3, m2, [r3 + 1 * 16] paddd m0, m3 paddd m0, m6 - psrad m0, IDCT4_SHIFT ; m0 = S0 + psrad m0, IDCT_SHIFT ; m0 = S0 pmaddwd m3, m1, [r3 + 2 * 16] pmaddwd m4, m2, [r3 + 3 * 16] paddd m3, m4 paddd m3, m6 - psrad m3, IDCT4_SHIFT ; m3 = S8 + psrad m3, IDCT_SHIFT ; m3 = S8 packssdw m0, m3 ; m0 = m128iA pmaddwd m3, m1, [r3 + 4 * 16] pmaddwd m4, m2, [r3 + 5 * 16] paddd m3, m4 paddd m3, m6 - psrad m3, IDCT4_SHIFT ; m3 = S0 + psrad m3, IDCT_SHIFT ; m3 = S0 pmaddwd m1, [r3 + 6 * 16] pmaddwd m2, [r3 + 7 * 16] paddd m1, m2 paddd m1, m6 - psrad m1, IDCT4_SHIFT ; m1 = S8 + psrad m1, IDCT_SHIFT ; m1 = S8 packssdw m3, m1 ; m3 = m128iD punpcklwd m1, m0, m3 punpckhwd m0, m3 @@ -821,15 +939,7 @@ ;----------------------------------------------------------------- INIT_YMM avx2 cglobal idst4, 3, 4, 6 -%if BIT_DEPTH == 8 - vpbroadcastd m4, [pd_2048] - %define IDCT4_SHIFT 12 -%elif BIT_DEPTH == 10 - vpbroadcastd m4, [pd_512] - %define IDCT4_SHIFT 10 -%else - %error Unsupported BIT_DEPTH! -%endif + vbroadcasti128 m4, [pd_ %+ IDCT_ROUND] add r2d, r2d lea r3, [pw_idst4_tab] @@ -870,12 +980,12 @@ pmaddwd m3, m2, [r3 + 1 * 32] paddd m0, m3 paddd m0, m4 - psrad m0, IDCT4_SHIFT + psrad m0, IDCT_SHIFT pmaddwd m3, m1, [r3 + 2 * 32] pmaddwd m2, m2, [r3 + 3 * 32] paddd m3, m2 paddd m3, m4 - psrad m3, IDCT4_SHIFT + psrad m3, IDCT_SHIFT packssdw m0, m3 pshufb m1, m0, [pb_idst4_shuf] @@ -906,17 +1016,6 @@ ; ... ; Row6[4-7] Row7[4-7] ;------------------------ -%if BIT_DEPTH == 10 - %define DCT_SHIFT1 4 - %define DCT_ADD1 [pd_8] -%elif BIT_DEPTH == 8 - %define DCT_SHIFT1 2 - %define DCT_ADD1 [pd_2] -%else - %error Unsupported BIT_DEPTH! -%endif -%define DCT_ADD2 [pd_256] -%define DCT_SHIFT2 9 add r2, r2 lea r3, [r2 * 3] @@ -962,8 +1061,8 @@ punpckhqdq m7, m5 punpcklqdq m1, m5 paddd m1, m7 - paddd m1, DCT_ADD1 - psrad m1, DCT_SHIFT1 + paddd m1, [pd_ %+ DCT8_ROUND1] + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -977,8 +1076,8 @@ punpckhqdq m7, m5 punpcklqdq m1, m5 paddd m1, m7 - paddd m1, DCT_ADD1 - psrad m1, DCT_SHIFT1 + paddd m1, [pd_ %+ DCT8_ROUND1] + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -992,8 +1091,8 @@ punpckhqdq m7, m5 punpcklqdq m1, m5 paddd m1, m7 - paddd m1, DCT_ADD1 - psrad m1, DCT_SHIFT1 + paddd m1, [pd_ %+ DCT8_ROUND1] + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -1007,8 +1106,8 @@ punpckhqdq m7, m0 punpcklqdq m4, m0 paddd m4, m7 - paddd m4, DCT_ADD1 - psrad m4, DCT_SHIFT1 + paddd m4, [pd_ %+ DCT8_ROUND1] + psrad m4, DCT8_SHIFT1 %if x == 1 pshufd m4, m4, 0x1B %endif @@ -1026,29 +1125,29 @@ pshuflw m2, m2, 0xD8 pshufhw m2, m2, 0xD8 pmaddwd m3, m0, [r4 + 0*16] - paddd m3, DCT_ADD1 - psrad m3, DCT_SHIFT1 + paddd m3, [pd_ %+ DCT8_ROUND1] + psrad m3, DCT8_SHIFT1 %if x == 1 pshufd m3, m3, 0x1B %endif mova [r5 + 0*2*mmsize], m3 ; Row 0 pmaddwd m0, [r4 + 2*16] - paddd m0, DCT_ADD1 - psrad m0, DCT_SHIFT1 + paddd m0, [pd_ %+ DCT8_ROUND1] + psrad m0, DCT8_SHIFT1 %if x == 1 pshufd m0, m0, 0x1B %endif mova [r5 + 4*2*mmsize], m0 ; Row 4 pmaddwd m3, m2, [r4 + 1*16] - paddd m3, DCT_ADD1 - psrad m3, DCT_SHIFT1 + paddd m3, [pd_ %+ DCT8_ROUND1] + psrad m3, DCT8_SHIFT1 %if x == 1 pshufd m3, m3, 0x1B %endif mova [r5 + 2*2*mmsize], m3 ; Row 2 pmaddwd m2, [r4 + 3*16] - paddd m2, DCT_ADD1 - psrad m2, DCT_SHIFT1 + paddd m2, [pd_ %+ DCT8_ROUND1] + psrad m2, DCT8_SHIFT1 %if x == 1 pshufd m2, m2, 0x1B %endif @@ -1108,16 +1207,16 @@ punpckhqdq m7, m5 punpcklqdq m3, m5 paddd m3, m7 ; m3 = [Row2 Row0] - paddd m3, DCT_ADD2 - psrad m3, DCT_SHIFT2 + paddd m3, [pd_ %+ DCT8_ROUND2] + psrad m3, DCT8_SHIFT2 pshufd m4, m4, 0xD8 pshufd m2, m2, 0xD8 mova m7, m4 punpckhqdq m7, m2 punpcklqdq m4, m2 psubd m4, m7 ; m4 = [Row6 Row4] - paddd m4, DCT_ADD2 - psrad m4, DCT_SHIFT2 + paddd m4, [pd_ %+ DCT8_ROUND2] + psrad m4, DCT8_SHIFT2 packssdw m3, m3 movd [r1 + 0*mmsize], m3 @@ -1178,8 +1277,8 @@ punpckhqdq m7, m4 punpcklqdq m2, m4 paddd m2, m7 ; m2 = [Row3 Row1] - paddd m2, DCT_ADD2 - psrad m2, DCT_SHIFT2 + paddd m2, [pd_ %+ DCT8_ROUND2] + psrad m2, DCT8_SHIFT2 packssdw m2, m2 movd [r1 + 1*mmsize], m2 @@ -1234,8 +1333,8 @@ punpckhqdq m7, m4 punpcklqdq m2, m4 paddd m2, m7 ; m2 = [Row7 Row5] - paddd m2, DCT_ADD2 - psrad m2, DCT_SHIFT2 + paddd m2, [pd_ %+ DCT8_ROUND2] + psrad m2, DCT8_SHIFT2 packssdw m2, m2 movd [r1 + 5*mmsize], m2 @@ -1249,10 +1348,6 @@ %endrep RET -%undef IDCT_SHIFT1 -%undef IDCT_ADD1 -%undef IDCT_SHIFT2 -%undef IDCT_ADD2 ;------------------------------------------------------- ; void dct8(const int16_t* src, int16_t* dst, intptr_t srcStride) @@ -1269,15 +1364,7 @@ ; ... ; Row6[4-7] Row7[4-7] ;------------------------ -%if BIT_DEPTH == 10 - %define DCT_SHIFT 4 - mova m6, [pd_8] -%elif BIT_DEPTH == 8 - %define DCT_SHIFT 2 - mova m6, [pd_2] -%else - %error Unsupported BIT_DEPTH! -%endif + mova m6, [pd_ %+ DCT8_ROUND1] add r2, r2 lea r3, [r2 * 3] @@ -1319,7 +1406,7 @@ pmaddwd m5, m0, [r4 + 0*16] phaddd m1, m5 paddd m1, m6 - psrad m1, DCT_SHIFT + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -1329,7 +1416,7 @@ pmaddwd m5, m0, [r4 + 1*16] phaddd m1, m5 paddd m1, m6 - psrad m1, DCT_SHIFT + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -1339,7 +1426,7 @@ pmaddwd m5, m0, [r4 + 2*16] phaddd m1, m5 paddd m1, m6 - psrad m1, DCT_SHIFT + psrad m1, DCT8_SHIFT1 %if x == 1 pshufd m1, m1, 0x1B %endif @@ -1349,7 +1436,7 @@ pmaddwd m0, [r4 + 3*16] phaddd m4, m0 paddd m4, m6 - psrad m4, DCT_SHIFT + psrad m4, DCT8_SHIFT1 %if x == 1 pshufd m4, m4, 0x1B %endif @@ -1364,28 +1451,28 @@ pshufb m2, [pb_unpackhlw1] pmaddwd m3, m0, [r4 + 0*16] paddd m3, m6 - psrad m3, DCT_SHIFT + psrad m3, DCT8_SHIFT1 %if x == 1 pshufd m3, m3, 0x1B %endif mova [r5 + 0*2*mmsize], m3 ; Row 0 pmaddwd m0, [r4 + 2*16] paddd m0, m6 - psrad m0, DCT_SHIFT + psrad m0, DCT8_SHIFT1 %if x == 1 pshufd m0, m0, 0x1B %endif mova [r5 + 4*2*mmsize], m0 ; Row 4 pmaddwd m3, m2, [r4 + 1*16] paddd m3, m6 - psrad m3, DCT_SHIFT + psrad m3, DCT8_SHIFT1 %if x == 1 pshufd m3, m3, 0x1B %endif mova [r5 + 2*2*mmsize], m3 ; Row 2 pmaddwd m2, [r4 + 3*16] paddd m2, m6 - psrad m2, DCT_SHIFT + psrad m2, DCT8_SHIFT1 %if x == 1 pshufd m2, m2, 0x1B %endif @@ -1483,16 +1570,6 @@ ;------------------------------------------------------- %if ARCH_X86_64 INIT_XMM sse2 -%if BIT_DEPTH == 10 - %define IDCT_SHIFT 10 - %define IDCT_ADD pd_512 -%elif BIT_DEPTH == 8 - %define IDCT_SHIFT 12 - %define IDCT_ADD pd_2048 -%else - %error Unsupported BIT_DEPTH! -%endif - cglobal idct8, 3, 6, 16, 0-5*mmsize mova m9, [r0 + 1 * mmsize] mova m1, [r0 + 3 * mmsize] @@ -1742,18 +1819,19 @@ psubd m10, m2 mova m2, m4 pmaddwd m12, [tab_dct4 + 3 * mmsize] - paddd m0, [IDCT_ADD] - paddd m1, [IDCT_ADD] - paddd m8, [IDCT_ADD] - paddd m10, [IDCT_ADD] + mova m15, [pd_ %+ IDCT_ROUND] + paddd m0, m15 + paddd m1, m15 + paddd m8, m15 + paddd m10, m15 paddd m2, m13 paddd m3, m12 - paddd m2, [IDCT_ADD] - paddd m3, [IDCT_ADD] + paddd m2, m15 + paddd m3, m15 psubd m4, m13 psubd m6, m12 - paddd m4, [IDCT_ADD] - paddd m6, [IDCT_ADD] + paddd m4, m15 + paddd m6, m15 mova m15, [rsp + 4 * mmsize] mova m12, m8 psubd m8, m7 @@ -1849,16 +1927,12 @@ movq [r1 + r3 * 2 + 8], m8 movhps [r1 + r0 + 8], m8 RET - -%undef IDCT_SHIFT -%undef IDCT_ADD %endif ;------------------------------------------------------- ; void idct8(const int16_t* src, int16_t* dst, intptr_t dstStride) ;------------------------------------------------------- INIT_XMM ssse3 - cglobal patial_butterfly_inverse_internal_pass1 movh m0, [r0] movhps m0, [r0 + 2 * 16] @@ -1950,13 +2024,6 @@ ret %macro PARTIAL_BUTTERFLY_PROCESS_ROW 1 -%if BIT_DEPTH == 10 - %define IDCT_SHIFT 10 -%elif BIT_DEPTH == 8 - %define IDCT_SHIFT 12 -%else - %error Unsupported BIT_DEPTH! -%endif pshufb m4, %1, [pb_idct8even] pmaddwd m4, [tab_idct8_1] phsubd m5, m4 @@ -1978,11 +2045,10 @@ pshufd m4, m4, 0x1B packssdw %1, m4 -%undef IDCT_SHIFT %endmacro +INIT_XMM ssse3 cglobal patial_butterfly_inverse_internal_pass2 - mova m0, [r5] PARTIAL_BUTTERFLY_PROCESS_ROW m0 movu [r1], m0 @@ -1998,9 +2064,9 @@ mova m3, [r5 + 48] PARTIAL_BUTTERFLY_PROCESS_ROW m3 movu [r1 + r3], m3 - ret +INIT_XMM ssse3 cglobal idct8, 3,7,8 ;,0-16*mmsize ; alignment stack to 64-bytes mov r5, rsp @@ -2019,13 +2085,7 @@ call patial_butterfly_inverse_internal_pass1 -%if BIT_DEPTH == 10 - mova m6, [pd_512] -%elif BIT_DEPTH == 8 - mova m6, [pd_2048] -%else - %error Unsupported BIT_DEPTH! -%endif + mova m6, [pd_ %+ IDCT_ROUND] add r2, r2 lea r3, [r2 * 3] lea r4, [tab_idct8_2] @@ -2150,7 +2210,10 @@ INIT_YMM avx2 cglobal dct8, 3, 7, 11, 0-8*16 -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define DCT_SHIFT 6 + vbroadcasti128 m5, [pd_16] +%elif BIT_DEPTH == 10 %define DCT_SHIFT 4 vbroadcasti128 m5, [pd_8] %elif BIT_DEPTH == 8 @@ -2316,7 +2379,10 @@ %endmacro INIT_YMM avx2 cglobal dct16, 3, 9, 16, 0-16*mmsize -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define DCT_SHIFT 7 + vbroadcasti128 m9, [pd_64] +%elif BIT_DEPTH == 10 %define DCT_SHIFT 5 vbroadcasti128 m9, [pd_16] %elif BIT_DEPTH == 8 @@ -2539,7 +2605,10 @@ INIT_YMM avx2 cglobal dct32, 3, 9, 16, 0-64*mmsize -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define DCT_SHIFT 8 + vpbroadcastq m9, [pd_128] +%elif BIT_DEPTH == 10 %define DCT_SHIFT 6 vpbroadcastq m9, [pd_32] %elif BIT_DEPTH == 8 @@ -2833,7 +2902,10 @@ INIT_YMM avx2 cglobal idct8, 3, 7, 13, 0-8*16 -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m12, [pd_256] +%elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m12, [pd_512] %elif BIT_DEPTH == 8 @@ -2991,7 +3063,10 @@ ;------------------------------------------------------- INIT_YMM avx2 cglobal idct16, 3, 7, 16, 0-16*mmsize -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m15, [pd_256] +%elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m15, [pd_512] %elif BIT_DEPTH == 8 @@ -3410,7 +3485,10 @@ dec r5d jnz .pass1 -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m15, [pd_256] +%elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m15, [pd_512] %elif BIT_DEPTH == 8 @@ -3571,7 +3649,10 @@ cglobal idct4, 3, 4, 6 %define IDCT_SHIFT1 7 -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 + %define IDCT_SHIFT2 8 + vpbroadcastd m5, [pd_256] +%elif BIT_DEPTH == 10 %define IDCT_SHIFT2 10 vpbroadcastd m5, [pd_512] %elif BIT_DEPTH == 8
View file
x265_1.7.tar.gz/source/common/x86/dct8.h -> x265_1.8.tar.gz/source/common/x86/dct8.h
Changed
@@ -23,27 +23,23 @@ #ifndef X265_DCT8_H #define X265_DCT8_H -void x265_dct4_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct8_sse2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dst4_ssse3(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dst4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct8_sse4(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct4_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct8_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct16_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_dct32_avx2(const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_idst4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idst4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct4_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct4_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct8_sse2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct8_ssse3(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct8_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct16_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); -void x265_idct32_avx2(const int16_t* src, int16_t* dst, intptr_t dstStride); +FUNCDEF_TU_S2(void, dct, sse2, const int16_t* src, int16_t* dst, intptr_t srcStride); +FUNCDEF_TU_S2(void, dct, ssse3, const int16_t* src, int16_t* dst, intptr_t srcStride); +FUNCDEF_TU_S2(void, dct, sse4, const int16_t* src, int16_t* dst, intptr_t srcStride); +FUNCDEF_TU_S2(void, dct, avx2, const int16_t* src, int16_t* dst, intptr_t srcStride); -void x265_denoise_dct_sse4(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); -void x265_denoise_dct_avx2(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); +FUNCDEF_TU_S2(void, idct, sse2, const int16_t* src, int16_t* dst, intptr_t dstStride); +FUNCDEF_TU_S2(void, idct, ssse3, const int16_t* src, int16_t* dst, intptr_t dstStride); +FUNCDEF_TU_S2(void, idct, sse4, const int16_t* src, int16_t* dst, intptr_t dstStride); +FUNCDEF_TU_S2(void, idct, avx2, const int16_t* src, int16_t* dst, intptr_t dstStride); + +void PFX(dst4_ssse3)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(dst4_sse2)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(idst4_sse2)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(dst4_avx2)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(idst4_avx2)(const int16_t* src, int16_t* dst, intptr_t srcStride); +void PFX(denoise_dct_sse4)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); +void PFX(denoise_dct_avx2)(int16_t* dct, uint32_t* sum, const uint16_t* offset, int size); #endif // ifndef X265_DCT8_H
View file
x265_1.7.tar.gz/source/common/x86/intrapred.h -> x265_1.8.tar.gz/source/common/x86/intrapred.h
Changed
@@ -26,262 +26,68 @@ #ifndef X265_INTRAPRED_H #define X265_INTRAPRED_H -void x265_intra_pred_dc4_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -void x265_intra_pred_dc8_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -void x265_intra_pred_dc16_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -void x265_intra_pred_dc32_sse2(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -void x265_intra_pred_dc4_sse4(pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -void x265_intra_pred_dc8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); -void x265_intra_pred_dc16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); -void x265_intra_pred_dc32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); -void x265_intra_pred_dc32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int filter); - -void x265_intra_pred_planar4_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar8_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar16_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar32_sse2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar4_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar8_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar16_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar32_sse4(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); -void x265_intra_pred_planar32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int, int); - #define DECL_ANG(bsize, mode, cpu) \ - void x265_intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); + void PFX(intra_pred_ang ## bsize ## _ ## mode ## _ ## cpu)(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); + +#define DECL_ANGS(bsize, cpu) \ + DECL_ANG(bsize, 2, cpu); \ + DECL_ANG(bsize, 3, cpu); \ + DECL_ANG(bsize, 4, cpu); \ + DECL_ANG(bsize, 5, cpu); \ + DECL_ANG(bsize, 6, cpu); \ + DECL_ANG(bsize, 7, cpu); \ + DECL_ANG(bsize, 8, cpu); \ + DECL_ANG(bsize, 9, cpu); \ + DECL_ANG(bsize, 10, cpu); \ + DECL_ANG(bsize, 11, cpu); \ + DECL_ANG(bsize, 12, cpu); \ + DECL_ANG(bsize, 13, cpu); \ + DECL_ANG(bsize, 14, cpu); \ + DECL_ANG(bsize, 15, cpu); \ + DECL_ANG(bsize, 16, cpu); \ + DECL_ANG(bsize, 17, cpu); \ + DECL_ANG(bsize, 18, cpu); \ + DECL_ANG(bsize, 19, cpu); \ + DECL_ANG(bsize, 20, cpu); \ + DECL_ANG(bsize, 21, cpu); \ + DECL_ANG(bsize, 22, cpu); \ + DECL_ANG(bsize, 23, cpu); \ + DECL_ANG(bsize, 24, cpu); \ + DECL_ANG(bsize, 25, cpu); \ + DECL_ANG(bsize, 26, cpu); \ + DECL_ANG(bsize, 27, cpu); \ + DECL_ANG(bsize, 28, cpu); \ + DECL_ANG(bsize, 29, cpu); \ + DECL_ANG(bsize, 30, cpu); \ + DECL_ANG(bsize, 31, cpu); \ + DECL_ANG(bsize, 32, cpu); \ + DECL_ANG(bsize, 33, cpu); \ + DECL_ANG(bsize, 34, cpu) -DECL_ANG(4, 2, sse2); -DECL_ANG(4, 3, sse2); -DECL_ANG(4, 4, sse2); -DECL_ANG(4, 5, sse2); -DECL_ANG(4, 6, sse2); -DECL_ANG(4, 7, sse2); -DECL_ANG(4, 8, sse2); -DECL_ANG(4, 9, sse2); -DECL_ANG(4, 10, sse2); -DECL_ANG(4, 11, sse2); -DECL_ANG(4, 12, sse2); -DECL_ANG(4, 13, sse2); -DECL_ANG(4, 14, sse2); -DECL_ANG(4, 15, sse2); -DECL_ANG(4, 16, sse2); -DECL_ANG(4, 17, sse2); -DECL_ANG(4, 18, sse2); -DECL_ANG(4, 26, sse2); +#define DECL_ALL(cpu) \ + FUNCDEF_TU(void, all_angs_pred, cpu, pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); \ + FUNCDEF_TU(void, intra_filter, cpu, const pixel *samples, pixel *filtered); \ + DECL_ANGS(4, cpu); \ + DECL_ANGS(8, cpu); \ + DECL_ANGS(16, cpu); \ + DECL_ANGS(32, cpu) -DECL_ANG(4, 2, ssse3); -DECL_ANG(4, 3, sse4); -DECL_ANG(4, 4, sse4); -DECL_ANG(4, 5, sse4); -DECL_ANG(4, 6, sse4); -DECL_ANG(4, 7, sse4); -DECL_ANG(4, 8, sse4); -DECL_ANG(4, 9, sse4); -DECL_ANG(4, 10, sse4); -DECL_ANG(4, 11, sse4); -DECL_ANG(4, 12, sse4); -DECL_ANG(4, 13, sse4); -DECL_ANG(4, 14, sse4); -DECL_ANG(4, 15, sse4); -DECL_ANG(4, 16, sse4); -DECL_ANG(4, 17, sse4); -DECL_ANG(4, 18, sse4); -DECL_ANG(4, 26, sse4); -DECL_ANG(8, 2, ssse3); -DECL_ANG(8, 3, sse4); -DECL_ANG(8, 4, sse4); -DECL_ANG(8, 5, sse4); -DECL_ANG(8, 6, sse4); -DECL_ANG(8, 7, sse4); -DECL_ANG(8, 8, sse4); -DECL_ANG(8, 9, sse4); -DECL_ANG(8, 10, sse4); -DECL_ANG(8, 11, sse4); -DECL_ANG(8, 12, sse4); -DECL_ANG(8, 13, sse4); -DECL_ANG(8, 14, sse4); -DECL_ANG(8, 15, sse4); -DECL_ANG(8, 16, sse4); -DECL_ANG(8, 17, sse4); -DECL_ANG(8, 18, sse4); -DECL_ANG(8, 19, sse4); -DECL_ANG(8, 20, sse4); -DECL_ANG(8, 21, sse4); -DECL_ANG(8, 22, sse4); -DECL_ANG(8, 23, sse4); -DECL_ANG(8, 24, sse4); -DECL_ANG(8, 25, sse4); -DECL_ANG(8, 26, sse4); -DECL_ANG(8, 27, sse4); -DECL_ANG(8, 28, sse4); -DECL_ANG(8, 29, sse4); -DECL_ANG(8, 30, sse4); -DECL_ANG(8, 31, sse4); -DECL_ANG(8, 32, sse4); -DECL_ANG(8, 33, sse4); +FUNCDEF_TU_S2(void, intra_pred_dc, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +FUNCDEF_TU_S2(void, intra_pred_dc, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +FUNCDEF_TU_S2(void, intra_pred_dc, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -DECL_ANG(16, 2, ssse3); -DECL_ANG(16, 3, sse4); -DECL_ANG(16, 4, sse4); -DECL_ANG(16, 5, sse4); -DECL_ANG(16, 6, sse4); -DECL_ANG(16, 7, sse4); -DECL_ANG(16, 8, sse4); -DECL_ANG(16, 9, sse4); -DECL_ANG(16, 10, sse4); -DECL_ANG(16, 11, sse4); -DECL_ANG(16, 12, sse4); -DECL_ANG(16, 13, sse4); -DECL_ANG(16, 14, sse4); -DECL_ANG(16, 15, sse4); -DECL_ANG(16, 16, sse4); -DECL_ANG(16, 17, sse4); -DECL_ANG(16, 18, sse4); -DECL_ANG(16, 19, sse4); -DECL_ANG(16, 20, sse4); -DECL_ANG(16, 21, sse4); -DECL_ANG(16, 22, sse4); -DECL_ANG(16, 23, sse4); -DECL_ANG(16, 24, sse4); -DECL_ANG(16, 25, sse4); -DECL_ANG(16, 26, sse4); -DECL_ANG(16, 27, sse4); -DECL_ANG(16, 28, sse4); -DECL_ANG(16, 29, sse4); -DECL_ANG(16, 30, sse4); -DECL_ANG(16, 31, sse4); -DECL_ANG(16, 32, sse4); -DECL_ANG(16, 33, sse4); +FUNCDEF_TU_S2(void, intra_pred_planar, sse2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +FUNCDEF_TU_S2(void, intra_pred_planar, sse4, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); +FUNCDEF_TU_S2(void, intra_pred_planar, avx2, pixel* dst, intptr_t dstStride, const pixel*srcPix, int, int filter); -DECL_ANG(32, 2, ssse3); -DECL_ANG(32, 3, sse4); -DECL_ANG(32, 4, sse4); -DECL_ANG(32, 5, sse4); -DECL_ANG(32, 6, sse4); -DECL_ANG(32, 7, sse4); -DECL_ANG(32, 8, sse4); -DECL_ANG(32, 9, sse4); -DECL_ANG(32, 10, sse4); -DECL_ANG(32, 11, sse4); -DECL_ANG(32, 12, sse4); -DECL_ANG(32, 13, sse4); -DECL_ANG(32, 14, sse4); -DECL_ANG(32, 15, sse4); -DECL_ANG(32, 16, sse4); -DECL_ANG(32, 17, sse4); -DECL_ANG(32, 18, sse4); -DECL_ANG(32, 19, sse4); -DECL_ANG(32, 20, sse4); -DECL_ANG(32, 21, sse4); -DECL_ANG(32, 22, sse4); -DECL_ANG(32, 23, sse4); -DECL_ANG(32, 24, sse4); -DECL_ANG(32, 25, sse4); -DECL_ANG(32, 26, sse4); -DECL_ANG(32, 27, sse4); -DECL_ANG(32, 28, sse4); -DECL_ANG(32, 29, sse4); -DECL_ANG(32, 30, sse4); -DECL_ANG(32, 31, sse4); -DECL_ANG(32, 32, sse4); -DECL_ANG(32, 33, sse4); +DECL_ALL(sse2); +DECL_ALL(ssse3); +DECL_ALL(sse4); +DECL_ALL(avx2); +#undef DECL_ALL +#undef DECL_ANGS #undef DECL_ANG -void x265_intra_pred_ang4_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_17_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_19_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang4_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_14_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_15_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_16_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_20_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang8_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_3_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_4_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_5_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_6_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_7_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_8_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_9_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_12_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_11_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_13_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang16_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_34_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_2_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_26_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_27_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_28_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_29_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_30_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_31_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_32_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_33_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_25_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_24_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_23_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_22_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_21_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_intra_pred_ang32_18_avx2(pixel* dst, intptr_t dstStride, const pixel* srcPix, int dirMode, int bFilter); -void x265_all_angs_pred_4x4_sse2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); -void x265_all_angs_pred_4x4_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); -void x265_all_angs_pred_8x8_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); -void x265_all_angs_pred_16x16_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); -void x265_all_angs_pred_32x32_sse4(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); -void x265_all_angs_pred_4x4_avx2(pixel *dest, pixel *refPix, pixel *filtPix, int bLuma); + + #endif // ifndef X265_INTRAPRED_H
View file
x265_1.7.tar.gz/source/common/x86/intrapred16.asm -> x265_1.8.tar.gz/source/common/x86/intrapred16.asm
Changed
@@ -35,39 +35,52 @@ %assign x x+1 %endrep -const shuf_mode_13_23, db 0, 0, 14, 15, 6, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 -const shuf_mode_14_22, db 14, 15, 10, 11, 4, 5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 -const shuf_mode_15_21, db 12, 13, 8, 9, 4, 5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 -const shuf_mode_16_20, db 2, 3, 0, 1, 14, 15, 12, 13, 8, 9, 6, 7, 2, 3, 0, 1 -const shuf_mode_17_19, db 0, 1, 14, 15, 12, 13, 10, 11, 6, 7, 4, 5, 2, 3, 0, 1 -const shuf_mode32_18, db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 -const pw_punpcklwd, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 -const c_mode32_10_0, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 - -const pw_unpackwdq, times 8 db 0,1 -const pw_ang8_12, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 0, 1 -const pw_ang8_13, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 8, 9, 0, 1 -const pw_ang8_14, db 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 10, 11, 4, 5, 0, 1 -const pw_ang8_15, db 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 8, 9, 4, 5, 0, 1 -const pw_ang8_16, db 0, 0, 0, 0, 0, 0, 12, 13, 10, 11, 6, 7, 4, 5, 0, 1 -const pw_ang8_17, db 0, 0, 14, 15, 12, 13, 10, 11, 8, 9, 4, 5, 2, 3, 0, 1 -const pw_swap16, db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 +const ang_table_avx2 +%assign x 0 +%rep 32 + times 8 dw (32-x), x +%assign x x+1 +%endrep -const pw_ang16_13, db 14, 15, 8, 9, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 -const pw_ang16_16, db 0, 0, 0, 0, 0, 0, 10, 11, 8, 9, 6, 7, 2, 3, 0, 1 +const pw_ang16_12_24, db 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 14, 15, 0, 1, 0, 1 +const pw_ang16_13_23, db 2, 3, 2, 3, 14, 15, 14, 15, 6, 7, 6, 7, 0, 1, 0, 1 +const pw_ang16_14_22, db 2, 3, 2, 3, 10, 11, 10, 11, 6, 7, 6, 7, 0, 1, 0, 1 +const pw_ang16_15_21, db 12, 13, 12, 13, 8, 9, 8, 9, 4, 5, 4, 5, 0, 1, 0, 1 +const pw_ang16_16_20, db 8, 9, 8, 9, 6, 7, 6, 7, 2, 3, 2, 3, 0, 1, 0, 1 + +const pw_ang32_12_24, db 0, 1, 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7 +const pw_ang32_13_23, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 6, 7, 0, 1 +const pw_ang32_14_22, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 11, 6, 7, 0, 1 +const pw_ang32_15_21, db 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 8, 9, 4, 5, 0, 1 +const pw_ang32_16_20, db 0, 0, 0, 0, 0, 0, 0, 0, 8, 9, 6, 7, 2, 3, 0, 1 +const pw_ang32_17_19_0, db 0, 0, 0, 0, 12, 13, 10, 11, 8, 9, 6, 7, 2, 3, 0, 1 + +const shuf_mode_13_23, db 0, 0, 14, 15, 6, 7, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 +const shuf_mode_14_22, db 14, 15, 10, 11, 4, 5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 +const shuf_mode_15_21, db 12, 13, 8, 9, 4, 5, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0 +const shuf_mode_16_20, db 2, 3, 0, 1, 14, 15, 12, 13, 8, 9, 6, 7, 2, 3, 0, 1 +const shuf_mode_17_19, db 0, 1, 14, 15, 12, 13, 10, 11, 6, 7, 4, 5, 2, 3, 0, 1 +const shuf_mode32_18, db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 +const pw_punpcklwd, db 0, 1, 2, 3, 2, 3, 4, 5, 4, 5, 6, 7, 6, 7, 8, 9 +const c_mode32_10_0, db 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1 + +const pw_ang8_12, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 0, 1 +const pw_ang8_13, db 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 8, 9, 0, 1 +const pw_ang8_14, db 0, 0, 0, 0, 0, 0, 0, 0, 14, 15, 10, 11, 4, 5, 0, 1 +const pw_ang8_15, db 0, 0, 0, 0, 0, 0, 0, 0, 12, 13, 8, 9, 4, 5, 0, 1 +const pw_ang8_16, db 0, 0, 0, 0, 0, 0, 12, 13, 10, 11, 6, 7, 4, 5, 0, 1 +const pw_ang8_17, db 0, 0, 14, 15, 12, 13, 10, 11, 8, 9, 4, 5, 2, 3, 0, 1 +const pw_swap16, times 2 db 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 + +const pw_ang16_13, db 14, 15, 8, 9, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 +const pw_ang16_16, db 0, 0, 0, 0, 0, 0, 10, 11, 8, 9, 6, 7, 2, 3, 0, 1 + +intra_filter4_shuf0: db 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13 +intra_filter4_shuf1: db 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13 +intra_filter4_shuf2: times 2 db 4, 5, 0, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 ;; (blkSize - 1 - x) -pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 -pw_planar4_1: dw 3, 3, 3, 3, 3, 3, 3, 3 -pw_planar8_0: dw 7, 6, 5, 4, 3, 2, 1, 0 -pw_planar8_1: dw 7, 7, 7, 7, 7, 7, 7, 7 -pw_planar16_0: dw 15, 14, 13, 12, 11, 10, 9, 8 -pw_planar16_1: dw 15, 15, 15, 15, 15, 15, 15, 15 -pd_planar32_1: dd 31, 31, 31, 31 - -pw_planar32_1: dw 31, 31, 31, 31, 31, 31, 31, 31 -pw_planar32_L: dw 31, 30, 29, 28, 27, 26, 25, 24 -pw_planar32_H: dw 23, 22, 21, 20, 19, 18, 17, 16 +pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 const planar32_table %assign x 31 @@ -85,16 +98,22 @@ SECTION .text +cextern pb_01 cextern pw_1 cextern pw_2 +cextern pw_3 +cextern pw_7 cextern pw_4 cextern pw_8 +cextern pw_15 cextern pw_16 +cextern pw_31 cextern pw_32 -cextern pw_1023 cextern pd_16 +cextern pd_31 cextern pd_32 cextern pw_4096 +cextern pw_pixel_max cextern multiL cextern multiH cextern multiH2 @@ -103,6 +122,8 @@ cextern pw_swap cextern pb_unpackwq1 cextern pb_unpackwq2 +cextern pw_planar16_mul +cextern pw_planar32_mul ;----------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* above, int, int filter) @@ -121,7 +142,7 @@ test r4d, r4d paddw m0, [pw_4] - psraw m0, 3 + psrlw m0, 3 ; store DC 4x4 movh [r0], m0 @@ -140,7 +161,7 @@ ; filter top movh m1, [r2 + 2] paddw m1, m0 - psraw m1, 2 + psrlw m1, 2 movh [r0], m1 ; overwrite top-left pixel, we will update it later ; filter top-left @@ -155,7 +176,7 @@ ; filter left movu m1, [r2 + 20] paddw m1, m0 - psraw m1, 2 + psrlw m1, 2 movd r3d, m1 mov [r0 + r1 * 2], r3w shr r3d, 16 @@ -181,7 +202,7 @@ pmaddwd m0, [pw_1] paddw m0, [pw_8] - psraw m0, 4 ; sum = sum / 16 + psrlw m0, 4 ; sum = sum / 16 pshuflw m0, m0, 0 pshufd m0, m0, 0 ; m0 = word [dc_val ...] @@ -214,7 +235,7 @@ ; filter top movu m0, [r2 + 2] paddw m0, m1 - psraw m0, 2 + psrlw m0, 2 movu [r0], m0 ; filter top-left @@ -229,7 +250,7 @@ ; filter left movu m0, [r2 + 36] paddw m0, m1 - psraw m0, 2 + psrlw m0, 2 movh r3, m0 mov [r0 + r1 * 2], r3w shr r3, 16 @@ -263,14 +284,10 @@ paddw m0, m1 paddw m2, m3 paddw m0, m2 - movhlps m1, m0 - paddw m0, m1 - pshuflw m1, m0, 0x6E - paddw m0, m1 - pmaddwd m0, [pw_1] + HADDUW m0, m1 + paddd m0, [pd_16] + psrld m0, 5 - paddw m0, [pw_16] - psraw m0, 5 movd r5d, m0 pshuflw m0, m0, 0 ; m0 = word [dc_val ...] pshufd m0, m0, 0 @@ -326,11 +343,11 @@ ; filter top movu m2, [r2 + 2] paddw m2, m1 - psraw m2, 2 + psrlw m2, 2 movu [r0], m2 movu m3, [r2 + 18] paddw m3, m1 - psraw m3, 2 + psrlw m3, 2 movu [r0 + 16], m3 ; filter top-left @@ -345,7 +362,7 @@ ; filter left movu m2, [r3 + 2] paddw m2, m1 - psraw m2, 2 + psrlw m2, 2 movq r2, m2 pshufd m2, m2, 0xEE @@ -367,7 +384,7 @@ movu m3, [r3 + 18] paddw m3, m1 - psraw m3, 2 + psrlw m3, 2 movq r3, m3 pshufd m3, m3, 0xEE @@ -402,20 +419,19 @@ paddw m0, m1 paddw m2, m3 paddw m0, m2 + HADDUWD m0, m1 + movu m1, [r2] - movu m3, [r2 + 16] - movu m4, [r2 + 32] - movu m5, [r2 + 48] + movu m2, [r2 + 16] + movu m3, [r2 + 32] + movu m4, [r2 + 48] + paddw m1, m2 + paddw m3, m4 paddw m1, m3 - paddw m4, m5 - paddw m1, m4 - paddw m0, m1 - movhlps m1, m0 - paddw m0, m1 - pshuflw m1, m0, 0x6E - paddw m0, m1 - pmaddwd m0, [pw_1] + HADDUWD m1, m2 + paddd m0, m1 + HADDD m0, m1 paddd m0, [pd_32] ; sum = sum + 32 psrld m0, 6 ; sum = sum / 64 pshuflw m0, m0, 0 @@ -448,6 +464,218 @@ %endrep RET +;------------------------------------------------------------------------------------------------------- +; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel* left, pixel* above, int dirMode, int filter) +;------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_dc16, 3, 9, 4 + mov r3d, r4m + add r1d, r1d + movu m0, [r2 + 66] + movu m2, [r2 + 2] + paddw m0, m2 ; dynamic range 13 bits + + vextracti128 xm1, m0, 1 + paddw xm0, xm1 ; dynamic range 14 bits + movhlps xm1, xm0 + paddw xm0, xm1 ; dynamic range 15 bits + pmaddwd xm0, [pw_1] + phaddd xm0, xm0 + paddd xm0, [pd_16] + psrld xm0, 5 + movd r5d, xm0 + vpbroadcastw m0, xm0 + + test r3d, r3d + + ; store DC 16x16 + lea r6, [r1 + r1 * 2] ; index 3 + lea r7, [r1 + r1 * 4] ; index 5 + lea r8, [r6 + r1 * 4] ; index 7 + lea r4, [r0 + r8 * 1] ; base + 7 + + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r6], m0 + movu [r0 + r1 * 4], m0 + movu [r0 + r7], m0 + movu [r0 + r6 * 2], m0 + movu [r4], m0 + movu [r0 + r1 * 8], m0 + movu [r4 + r1 * 2], m0 + movu [r0 + r7 * 2], m0 + movu [r4 + r1 * 4], m0 + movu [r0 + r6 * 4], m0 + movu [r4 + r6 * 2], m0 + movu [r4 + r8], m0 + movu [r4 + r1 * 8], m0 + + ; Do DC Filter + jz .end + mova m1, [pw_2] + pmullw m1, m0 + paddw m1, [pw_2] + movd r3d, xm1 + paddw m1, m0 + + ; filter top + movu m2, [r2 + 2] + paddw m2, m1 + psrlw m2, 2 + movu [r0], m2 + + ; filter top-left + movzx r3d, r3w + movzx r5d, word [r2 + 66] + add r3d, r5d + movzx r5d, word [r2 + 2] + add r5d, r3d + shr r5d, 2 + mov [r0], r5w + + ; filter left + movu m2, [r2 + 68] + paddw m2, m1 + psrlw m2, 2 + vextracti128 xm3, m2, 1 + + movq r3, xm2 + pshufd xm2, xm2, 0xEE + mov [r0 + r1], r3w + shr r3, 16 + mov [r0 + r1 * 2], r3w + shr r3, 16 + mov [r0 + r6], r3w + shr r3, 16 + mov [r0 + r1 * 4], r3w + movq r3, xm2 + mov [r0 + r7], r3w + shr r3, 16 + mov [r0 + r6 * 2], r3w + shr r3, 16 + mov [r4], r3w + shr r3, 16 + mov [r0 + r1 * 8], r3w + + movq r3, xm3 + pshufd xm3, xm3, 0xEE + mov [r4 + r1 * 2], r3w + shr r3, 16 + mov [r0 + r7 * 2], r3w + shr r3, 16 + mov [r4 + r1 * 4], r3w + shr r3, 16 + mov [r0 + r6 * 4], r3w + movq r3, xm3 + mov [r4 + r6 * 2], r3w + shr r3, 16 + mov [r4 + r8], r3w + shr r3, 16 + mov [r4 + r1 * 8], r3w +.end: + RET + +;--------------------------------------------------------------------------------------------- +; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter) +;--------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_dc32, 3,3,3 + add r2, 2 + add r1d, r1d + movu m0, [r2] + movu m1, [r2 + 32] + add r2, mmsize*4 ; r2 += 128 + paddw m0, m1 ; dynamic range 13 bits + movu m1, [r2] + movu m2, [r2 + 32] + paddw m1, m2 ; dynamic range 13 bits + paddw m0, m1 ; dynamic range 14 bits + vextracti128 xm1, m0, 1 + paddw xm0, xm1 ; dynamic range 15 bits + pmaddwd xm0, [pw_1] + movhlps xm1, xm0 + paddd xm0, xm1 + phaddd xm0, xm0 + paddd xm0, [pd_32] ; sum = sum + 32 + psrld xm0, 6 ; sum = sum / 64 + vpbroadcastw m0, xm0 + + lea r2, [r1 * 3] + ; store DC 32x32 + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0 + 0], m0 + movu [r0 + r1 * 0 + mmsize], m0 + movu [r0 + r1 * 1 + 0], m0 + movu [r0 + r1 * 1 + mmsize], m0 + movu [r0 + r1 * 2 + 0], m0 + movu [r0 + r1 * 2 + mmsize], m0 + movu [r0 + r2 * 1 + 0], m0 + movu [r0 + r2 * 1 + mmsize], m0 + RET + ;--------------------------------------------------------------------------------------- ; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) ;--------------------------------------------------------------------------------------- @@ -465,7 +693,7 @@ pshufd m4, m4, 0 ; v_bottomLeft pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar8_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_7] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_8] paddw m3, m4 paddw m3, m0 @@ -479,9 +707,9 @@ pshufhw m1, m2, 0x55 * (%1 - 4) pshufd m1, m1, 0xAA %endif - pmullw m1, [pw_planar8_0] + pmullw m1, [pw_planar16_mul + mmsize] paddw m1, m3 - psraw m1, 4 + psrlw m1, 4 movu [r0], m1 %if (%1 < 7) paddw m3, m4 @@ -517,8 +745,8 @@ pmullw m4, m3, [multiH] ; (x + 1) * topRight pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m1, m2, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] - pmullw m5, m7, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + pmullw m5, m7, [pw_15] ; (blkSize - 1 - y) * above[x] paddw m4, [pw_16] paddw m3, [pw_16] paddw m4, m6 @@ -554,8 +782,8 @@ paddw m4, m1 lea r0, [r0 + r1 * 2] %endif - pmullw m0, m5, [pw_planar8_0] - pmullw m5, [pw_planar16_0] + pmullw m0, m5, [pw_planar16_mul + mmsize] + pmullw m5, [pw_planar16_mul] paddw m0, m4 paddw m5, m3 psraw m5, 5 @@ -611,7 +839,7 @@ mova m9, m6 mova m10, m6 - mova m12, [pw_planar32_1] + mova m12, [pw_31] movu m4, [r2 + 2] psubw m8, m4 pmullw m4, m12 @@ -632,10 +860,10 @@ pmullw m5, m12 paddw m3, m5 - mova m12, [pw_planar32_L] - mova m13, [pw_planar32_H] - mova m14, [pw_planar16_0] - mova m15, [pw_planar8_0] + mova m12, [pw_planar32_mul] + mova m13, [pw_planar32_mul + mmsize] + mova m14, [pw_planar16_mul] + mova m15, [pw_planar16_mul + mmsize] add r1, r1 %macro PROCESS 1 @@ -690,6 +918,154 @@ %endrep RET +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_planar32, 3,3,8 + movu m1, [r2 + 2] + movu m4, [r2 + 34] + lea r2, [r2 + 66] + vpbroadcastw m3, [r2] ; topRight = above[32] + pmullw m0, m3, [multiL] ; (x + 1) * topRight + pmullw m2, m3, [multiH2] ; (x + 1) * topRight + vpbroadcastw m6, [r2 + 128] ; bottomLeft = left[32] + mova m5, m6 + paddw m5, [pw_32] + + paddw m0, m5 + paddw m2, m5 + mova m5, m6 + psubw m3, m6, m1 + pmullw m1, [pw_31] + paddw m0, m1 + psubw m5, m4 + pmullw m4, [pw_31] + paddw m2, m4 + + mova m6, [pw_planar32_mul] + mova m4, [pw_planar16_mul] + add r1, r1 + +%macro PROCESS_AVX2 1 + vpbroadcastw m7, [r2 + %1 * 2] + pmullw m1, m7, m6 + pmullw m7, m4 + paddw m1, m0 + paddw m7, m2 + psrlw m1, 6 + psrlw m7, 6 + movu [r0], m1 + movu [r0 + mmsize], m7 +%endmacro + +%macro INCREMENT_AVX2 0 + paddw m2, m5 + paddw m0, m3 + add r0, r1 +%endmacro + + add r2, mmsize*2 +%assign x 0 +%rep 4 +%assign y 0 +%rep 8 + PROCESS_AVX2 y +%if x + y < 10 + INCREMENT_AVX2 +%endif +%assign y y+1 +%endrep +lea r2, [r2 + 16] +%assign x x+1 +%endrep + RET + +;--------------------------------------------------------------------------------------- +; void intra_pred_planar(pixel* dst, intptr_t dstStride, pixel*srcPix, int, int filter) +;--------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_planar16, 3,3,4 + add r1d, r1d + vpbroadcastw m3, [r2 + 34] + vpbroadcastw m4, [r2 + 98] + mova m0, [pw_planar16_mul] + movu m2, [r2 + 2] + + pmullw m3, [multiL] ; (x + 1) * topRight + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + paddw m3, [pw_16] + paddw m3, m4 + paddw m3, m1 + psubw m4, m2 + add r2, 66 + +%macro INTRA_PRED_PLANAR16_AVX2 1 + vpbroadcastw m1, [r2 + %1] + vpbroadcastw m2, [r2 + %1 + 2] + + pmullw m1, m0 + pmullw m2, m0 + paddw m1, m3 + paddw m3, m4 + psraw m1, 5 + paddw m2, m3 + psraw m2, 5 + paddw m3, m4 + movu [r0], m1 + movu [r0 + r1], m2 +%if %1 <= 24 + lea r0, [r0 + r1 * 2] +%endif +%endmacro + INTRA_PRED_PLANAR16_AVX2 0 + INTRA_PRED_PLANAR16_AVX2 4 + INTRA_PRED_PLANAR16_AVX2 8 + INTRA_PRED_PLANAR16_AVX2 12 + INTRA_PRED_PLANAR16_AVX2 16 + INTRA_PRED_PLANAR16_AVX2 20 + INTRA_PRED_PLANAR16_AVX2 24 + INTRA_PRED_PLANAR16_AVX2 28 +%undef INTRA_PRED_PLANAR16_AVX2 + RET + +%macro TRANSPOSE_4x4 0 + punpckhwd m0, m1, m3 + punpcklwd m1, m3 + punpckhwd m3, m1, m0 + punpcklwd m1, m0 +%endmacro + +%macro STORE_4x4 0 + add r1, r1 + movh [r0], m1 + movhps [r0 + r1], m1 + movh [r0 + r1 * 2], m3 + lea r1, [r1 * 3] + movhps [r0 + r1], m3 +%endmacro + +%macro CALC_4x4 4 + mova m0, [pd_16] + pmaddwd m1, [ang_table + %1 * 16] + paddd m1, m0 + psrld m1, 5 + + pmaddwd m2, [ang_table + %2 * 16] + paddd m2, m0 + psrld m2, 5 + packssdw m1, m2 + + pmaddwd m3, [ang_table + %3 * 16] + paddd m3, m0 + psrld m3, 5 + + pmaddwd m4, [ang_table + %4 * 16] + paddd m4, m0 + psrld m4, 5 + packssdw m3, m4 +%endmacro + ;----------------------------------------------------------------------------------------- ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) ;----------------------------------------------------------------------------------------- @@ -712,229 +1088,153 @@ movh [r0 + r1], m0 RET -cglobal intra_pred_ang4_3, 3,5,8 - mov r4d, 2 - cmp r3m, byte 33 - mov r3d, 18 - cmove r3d, r4d - - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - +cglobal intra_pred_ang4_3, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] mova m2, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] mova m3, m0 psrldq m0, 2 - punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] + punpcklwd m3, m0 ;[7 6 6 5 5 4 4 3] mova m4, m0 psrldq m0, 2 - punpcklwd m4, m0 ; [7 6 6 5 5 4 4 3] - mova m5, m0 - psrldq m0, 2 - punpcklwd m5, m0 ; [8 7 7 6 6 5 5 4] + punpcklwd m4, m0 ;[8 7 7 6 6 5 5 4] + CALC_4x4 26, 20, 14, 8 - lea r3, [ang_table + 20 * 16] - mova m0, [r3 + 6 * 16] ; [26] - mova m1, [r3] ; [20] - mova m6, [r3 - 6 * 16] ; [14] - mova m7, [r3 - 12 * 16] ; [ 8] - jmp .do_filter4x4 - - -ALIGN 16 -.do_filter4x4: - lea r4, [pd_16] - pmaddwd m2, m0 - paddd m2, [r4] - psrld m2, 5 - - pmaddwd m3, m1 - paddd m3, [r4] - psrld m3, 5 - packssdw m2, m3 - - pmaddwd m4, m6 - paddd m4, [r4] - psrld m4, 5 - - pmaddwd m5, m7 - paddd m5, [r4] - psrld m5, 5 - packssdw m4, m5 + TRANSPOSE_4x4 - jz .store - - ; transpose 4x4 - punpckhwd m0, m2, m4 - punpcklwd m2, m4 - punpckhwd m4, m2, m0 - punpcklwd m2, m0 - -.store: - add r1, r1 - movh [r0], m2 - movhps [r0 + r1], m2 - movh [r0 + r1 * 2], m4 - lea r1, [r1 * 3] - movhps [r0 + r1], m4 + STORE_4x4 RET -cglobal intra_pred_ang4_4, 3,5,8 - mov r4d, 2 - cmp r3m, byte 32 - mov r3d, 18 - cmove r3d, r4d - - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] +cglobal intra_pred_ang4_33, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] mova m2, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] mova m3, m0 psrldq m0, 2 - punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] - mova m4, m3 - mova m5, m0 + punpcklwd m3, m0 ;[7 6 6 5 5 4 4 3] + mova m4, m0 psrldq m0, 2 - punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3] + punpcklwd m4, m0 ;[8 7 7 6 6 5 5 4] - lea r3, [ang_table + 18 * 16] - mova m0, [r3 + 3 * 16] ; [21] - mova m1, [r3 - 8 * 16] ; [10] - mova m6, [r3 + 13 * 16] ; [31] - mova m7, [r3 + 2 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 26, 20, 14, 8 -cglobal intra_pred_ang4_5, 3,5,8 - mov r4d, 2 - cmp r3m, byte 31 - mov r3d, 18 - cmove r3d, r4d + STORE_4x4 + RET - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m2, m0 +cglobal intra_pred_ang4_4, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] - mova m3, m0 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m0 psrldq m0, 2 - punpcklwd m3, m0 ; [6 5 5 4 4 3 3 2] - mova m4, m3 - mova m5, m0 + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] + mova m3, m2 + mova m4, m0 psrldq m0, 2 - punpcklwd m5, m0 ; [7 6 6 5 5 4 4 3] + punpcklwd m4, m0 ;[7 6 6 5 5 4 4 3] - lea r3, [ang_table + 10 * 16] - mova m0, [r3 + 7 * 16] ; [17] - mova m1, [r3 - 8 * 16] ; [ 2] - mova m6, [r3 + 9 * 16] ; [19] - mova m7, [r3 - 6 * 16] ; [ 4] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 21, 10, 31, 20 -cglobal intra_pred_ang4_6, 3,5,8 - mov r4d, 2 - cmp r3m, byte 30 - mov r3d, 18 - cmove r3d, r4d + TRANSPOSE_4x4 - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m2, m0 + STORE_4x4 + RET + +cglobal intra_pred_ang4_6, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] - mova m3, m2 - mova m4, m0 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m0 psrldq m0, 2 - punpcklwd m4, m0 ; [6 5 5 4 4 3 3 2] - mova m5, m4 + punpcklwd m3, m0 ;[6 5 5 4 4 3 3 2] + mova m4, m3 - lea r3, [ang_table + 19 * 16] - mova m0, [r3 - 6 * 16] ; [13] - mova m1, [r3 + 7 * 16] ; [26] - mova m6, [r3 - 12 * 16] ; [ 7] - mova m7, [r3 + 1 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 13, 26, 7, 20 -cglobal intra_pred_ang4_7, 3,5,8 - mov r4d, 2 - cmp r3m, byte 29 - mov r3d, 18 - cmove r3d, r4d + TRANSPOSE_4x4 - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m2, m0 + STORE_4x4 + RET + +cglobal intra_pred_ang4_7, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] - mova m3, m2 - mova m4, m2 - mova m5, m0 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m0 psrldq m0, 2 - punpcklwd m5, m0 ; [6 5 5 4 4 3 3 2] + punpcklwd m4, m0 ;[6 5 5 4 4 3 3 2] - lea r3, [ang_table + 20 * 16] - mova m0, [r3 - 11 * 16] ; [ 9] - mova m1, [r3 - 2 * 16] ; [18] - mova m6, [r3 + 7 * 16] ; [27] - mova m7, [r3 - 16 * 16] ; [ 4] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 9, 18, 27, 4 -cglobal intra_pred_ang4_8, 3,5,8 - mov r4d, 2 - cmp r3m, byte 28 - mov r3d, 18 - cmove r3d, r4d + TRANSPOSE_4x4 - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m2, m0 + STORE_4x4 + RET + +cglobal intra_pred_ang4_8, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] - mova m3, m2 - mova m4, m2 - mova m5, m2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m1 - lea r3, [ang_table + 13 * 16] - mova m0, [r3 - 8 * 16] ; [ 5] - mova m1, [r3 - 3 * 16] ; [10] - mova m6, [r3 + 2 * 16] ; [15] - mova m7, [r3 + 7 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 5, 10, 15, 20 -cglobal intra_pred_ang4_9, 3,5,8 - mov r4d, 2 - cmp r3m, byte 27 - mov r3d, 18 - cmove r3d, r4d + TRANSPOSE_4x4 - movu m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m2, m0 + STORE_4x4 + RET + +cglobal intra_pred_ang4_9, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - punpcklwd m2, m0 ; [5 4 4 3 3 2 2 1] - mova m3, m2 - mova m4, m2 - mova m5, m2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m1 - lea r3, [ang_table + 4 * 16] - mova m0, [r3 - 2 * 16] ; [ 2] - mova m1, [r3 - 0 * 16] ; [ 4] - mova m6, [r3 + 2 * 16] ; [ 6] - mova m7, [r3 + 4 * 16] ; [ 8] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 2, 4, 6, 8 + + TRANSPOSE_4x4 + + STORE_4x4 + RET cglobal intra_pred_ang4_10, 3,3,3 - movh m0, [r2 + 18] ; [4 3 2 1] + movh m0, [r2 + 18] ;[4 3 2 1] - punpcklwd m0, m0 ;[4 4 3 3 2 2 1 1] + punpcklwd m0, m0 ;[4 4 3 3 2 2 1 1] pshufd m1, m0, 0xFA - add r1, r1 + add r1d, r1d pshufd m0, m0, 0x50 movhps [r0 + r1], m0 movh [r0 + r1 * 2], m1 - lea r1, [r1 * 3] + lea r1d, [r1 * 3] movhps [r0 + r1], m1 cmp r4m, byte 0 jz .quit ; filter - movd m2, [r2] ; [7 6 5 4 3 2 1 0] + movd m2, [r2] ;[7 6 5 4 3 2 1 0] pshuflw m2, m2, 0x00 movh m1, [r2 + 2] psubw m1, m2 @@ -942,13 +1242,321 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] .quit: movh [r0], m0 RET +cglobal intra_pred_ang4_11, 3,3,5 + movh m0, [r2 + 18] ;[x x x 4 3 2 1 0] + movh m1, [r2 - 6] + punpcklqdq m1, m0 + psrldq m1, 6 + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + mova m4, m1 + + CALC_4x4 30, 28, 26, 24 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_12, 3,3,5 + movh m0, [r2 + 18] + movh m1, [r2 - 6] + punpcklqdq m1, m0 + psrldq m1, 6 + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + mova m4, m1 + + CALC_4x4 27, 22, 17, 12 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_13, 3,3,5 + movd m4, [r2 + 6] + movd m1, [r2 - 2] + movh m0, [r2 + 18] + punpcklwd m4, m1 + punpcklqdq m4, m0 + psrldq m4, 4 + mova m1, m4 + psrldq m1, 2 + punpcklwd m4, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + + CALC_4x4 23, 14, 5, 28 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_14, 3,3,5 + movd m4, [r2 + 2] + movd m1, [r2 - 2] + movh m0, [r2 + 18] + punpcklwd m4, m1 + punpcklqdq m4, m0 + psrldq m4, 4 + mova m1, m4 + psrldq m1, 2 + punpcklwd m4, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m4 + + CALC_4x4 19, 6, 25, 12 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_15, 3,3,5 + movd m3, [r2] ;[x x x A] + movh m4, [r2 + 4] ;[x C x B] + movh m0, [r2 + 18] ;[4 3 2 1] + pshuflw m4, m4, 0x22 ;[B C B C] + punpcklqdq m4, m3 ;[x x x A B C B C] + psrldq m4, 2 ;[x x x x A B C B] + punpcklqdq m4, m0 + psrldq m4, 2 + mova m1, m4 + mova m2, m4 + psrldq m1, 4 + psrldq m2, 2 + punpcklwd m4, m2 ;[2 1 1 0 0 x x y] + punpcklwd m2, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m3, m2 + + CALC_4x4 15, 30, 13, 28 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_16, 3,3,5 + movd m3, [r2] ;[x x x A] + movd m4, [r2 + 4] ;[x x C B] + movh m0, [r2 + 18] ;[4 3 2 1] + punpcklwd m4, m3 ;[x C A B] + pshuflw m4, m4, 0x4A ;[A B C C] + punpcklqdq m4, m0 ;[4 3 2 1 A B C C] + psrldq m4, 2 + mova m1, m4 + mova m2, m4 + psrldq m1, 4 + psrldq m2, 2 + punpcklwd m4, m2 ;[2 1 1 0 0 x x y] + punpcklwd m2, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m3, m2 + + CALC_4x4 11, 22, 1, 12 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_17, 3,3,5 + movd m3, [r2] + movh m4, [r2 + 2] ;[D x C B] + pshuflw m4, m4, 0x1F ;[B C D D] + punpcklqdq m4, m3 ;[x x x A B C D D] + psrldq m4, 2 ;[x x x x A B C D] + movhps m4, [r2 + 18] + + mova m3, m4 + psrldq m3, 2 + punpcklwd m4, m3 + mova m2, m3 + psrldq m2, 2 + punpcklwd m3, m2 + mova m1, m2 + psrldq m1, 2 + punpcklwd m2, m1 + mova m0, m1 + psrldq m0, 2 + punpcklwd m1, m0 + + CALC_4x4 6, 12, 18, 24 + + TRANSPOSE_4x4 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_18, 3,3,1 + movh m0, [r2 + 16] + pinsrw m0, [r2], 0 + pshuflw m0, m0, q0123 + movhps m0, [r2 + 2] + add r1, r1 + lea r2, [r1 * 3] + movh [r0 + r2], m0 + psrldq m0, 2 + movh [r0 + r1 * 2], m0 + psrldq m0, 2 + movh [r0 + r1], m0 + psrldq m0, 2 + movh [r0], m0 + RET + + cglobal intra_pred_ang4_19, 3,3,5 + movd m3, [r2] + movh m4, [r2 + 18] ;[D x C B] + pshuflw m4, m4, 0x1F ;[B C D D] + punpcklqdq m4, m3 ;[x x x A B C D D] + psrldq m4, 2 ;[x x x x A B C D] + movhps m4, [r2 + 2] + + mova m3, m4 + psrldq m3, 2 + punpcklwd m4, m3 + mova m2, m3 + psrldq m2, 2 + punpcklwd m3, m2 + mova m1, m2 + psrldq m1, 2 + punpcklwd m2, m1 + mova m0, m1 + psrldq m0, 2 + punpcklwd m1, m0 + + CALC_4x4 6, 12, 18, 24 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_20, 3,3,5 + movd m3, [r2] ;[x x x A] + movd m4, [r2 + 20] ;[x x C B] + movh m0, [r2 + 2] ;[4 3 2 1] + punpcklwd m4, m3 ;[x C A B] + pshuflw m4, m4, 0x4A ;[A B C C] + punpcklqdq m4, m0 ;[4 3 2 1 A B C C] + psrldq m4, 2 + mova m1, m4 + mova m2, m4 + psrldq m1, 4 + psrldq m2, 2 + punpcklwd m4, m2 ;[2 1 1 0 0 x x y] + punpcklwd m2, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m3, m2 + + CALC_4x4 11, 22, 1, 12 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_21, 3,3,5 + movd m3, [r2] ;[x x x A] + movh m4, [r2 + 20] ;[x C x B] + movh m0, [r2 + 2] ;[4 3 2 1] + pshuflw m4, m4, 0x22 ;[B C B C] + punpcklqdq m4, m3 ;[x x x A B C B C] + psrldq m4, 2 ;[x x x x A B C B] + punpcklqdq m4, m0 + psrldq m4, 2 + mova m1, m4 + mova m2, m4 + psrldq m1, 4 + psrldq m2, 2 + punpcklwd m4, m2 ;[2 1 1 0 0 x x y] + punpcklwd m2, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m3, m2 + + CALC_4x4 15, 30, 13, 28 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_22, 3,3,5 + movd m4, [r2 + 18] + movd m1, [r2 - 2] + movh m0, [r2 + 2] + punpcklwd m4, m1 + punpcklqdq m4, m0 + psrldq m4, 4 + mova m1, m4 + psrldq m1, 2 + punpcklwd m4, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m4 + + CALC_4x4 19, 6, 25, 12 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_23, 3,3,5 + movd m4, [r2 + 22] + movd m1, [r2 - 2] + movh m0, [r2 + 2] + punpcklwd m4, m1 + punpcklqdq m4, m0 + psrldq m4, 4 + mova m1, m4 + psrldq m1, 2 + punpcklwd m4, m1 ;[3 2 2 1 1 0 0 x] + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + + CALC_4x4 23, 14, 5, 28 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_24, 3,3,5 + movh m0, [r2 + 2] + movh m1, [r2 - 6] + punpcklqdq m1, m0 + psrldq m1, 6 + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + mova m4, m1 + + CALC_4x4 27, 22, 17, 12 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_25, 3,3,5 + movh m0, [r2 + 2] ;[x x x 4 3 2 1 0] + movh m1, [r2 - 6] + punpcklqdq m1, m0 + psrldq m1, 6 + punpcklwd m1, m0 ;[4 3 3 2 2 1 1 0] + mova m2, m1 + mova m3, m1 + mova m4, m1 + + CALC_4x4 30, 28, 26, 24 + + STORE_4x4 + RET + cglobal intra_pred_ang4_26, 3,3,3 - movh m0, [r2 + 2] ; [8 7 6 5 4 3 2 1] + movh m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] add r1d, r1d ; store movh [r0], m0 @@ -970,7 +1578,7 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] movh r2, m0 mov [r0], r2w @@ -983,213 +1591,121 @@ .quit: RET -cglobal intra_pred_ang4_11, 3,5,8 - xor r4d, r4d - cmp r3m, byte 25 - mov r3d, 16 - cmove r3d, r4d +cglobal intra_pred_ang4_27, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m1 - movh m1, [r2 + r3 + 2] ; [x x x 4 3 2 1 0] - movh m2, [r2 - 6] - punpcklqdq m2, m1 - psrldq m2, 6 - punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0] - mova m3, m2 - mova m4, m2 - mova m5, m2 + CALC_4x4 2, 4, 6, 8 - lea r3, [ang_table + 24 * 16] - mova m0, [r3 + 6 * 16] ; [24] - mova m1, [r3 + 4 * 16] ; [26] - mova m6, [r3 + 2 * 16] ; [28] - mova m7, [r3 + 0 * 16] ; [30] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + STORE_4x4 + RET -cglobal intra_pred_ang4_12, 3,5,8 - xor r4d, r4d - cmp r3m, byte 24 - mov r3d, 16 - cmove r3d, r4d +cglobal intra_pred_ang4_28, 3,3,5 - movh m1, [r2 + r3 + 2] - movh m2, [r2 - 6] - punpcklqdq m2, m1 - psrldq m2, 6 - punpcklwd m2, m1 ; [4 3 3 2 2 1 1 0] - mova m3, m2 - mova m4, m2 - mova m5, m2 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m1 - lea r3, [ang_table + 20 * 16] - mova m0, [r3 + 7 * 16] ; [27] - mova m1, [r3 + 2 * 16] ; [22] - mova m6, [r3 - 3 * 16] ; [17] - mova m7, [r3 - 8 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 5, 10, 15, 20 -cglobal intra_pred_ang4_13, 3,5,8 - xor r4d, r4d - cmp r3m, byte 23 - mov r3d, 16 - jz .next - xchg r3d, r4d -.next: - movd m5, [r2 + r3 + 6] - movd m2, [r2 - 2] - movh m0, [r2 + r4 + 2] - punpcklwd m5, m2 - punpcklqdq m5, m0 - psrldq m5, 4 - mova m2, m5 - psrldq m2, 2 - punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x] - punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] - mova m3, m2 - mova m4, m2 + STORE_4x4 + RET - lea r3, [ang_table + 21 * 16] - mova m0, [r3 + 2 * 16] ; [23] - mova m1, [r3 - 7 * 16] ; [14] - mova m6, [r3 - 16 * 16] ; [ 5] - mova m7, [r3 + 7 * 16] ; [28] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) +cglobal intra_pred_ang4_29, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m1 + mova m4, m0 + psrldq m0, 2 + punpcklwd m4, m0 ;[6 5 5 4 4 3 3 2] -cglobal intra_pred_ang4_14, 3,5,8 - xor r4d, r4d - cmp r3m, byte 22 - mov r3d, 16 - jz .next - xchg r3d, r4d -.next: - movd m5, [r2 + r3 + 2] - movd m2, [r2 - 2] - movh m0, [r2 + r4 + 2] - punpcklwd m5, m2 - punpcklqdq m5, m0 - psrldq m5, 4 - mova m2, m5 - psrldq m2, 2 - punpcklwd m5, m2 ; [3 2 2 1 1 0 0 x] - punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] - mova m3, m2 - mova m4, m5 + CALC_4x4 9, 18, 27, 4 - lea r3, [ang_table + 19 * 16] - mova m0, [r3 + 0 * 16] ; [19] - mova m1, [r3 - 13 * 16] ; [ 6] - mova m6, [r3 + 6 * 16] ; [25] - mova m7, [r3 - 7 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + STORE_4x4 + RET -cglobal intra_pred_ang4_15, 3,5,8 - xor r4d, r4d - cmp r3m, byte 21 - mov r3d, 16 - jz .next - xchg r3d, r4d -.next: - movd m4, [r2] ;[x x x A] - movh m5, [r2 + r3 + 4] ;[x C x B] - movh m0, [r2 + r4 + 2] ;[4 3 2 1] - pshuflw m5, m5, 0x22 ;[B C B C] - punpcklqdq m5, m4 ;[x x x A B C B C] - psrldq m5, 2 ;[x x x x A B C B] - punpcklqdq m5, m0 - psrldq m5, 2 - mova m2, m5 - mova m3, m5 - psrldq m2, 4 - psrldq m3, 2 - punpcklwd m5, m3 ; [2 1 1 0 0 x x y] - punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x] - punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] +cglobal intra_pred_ang4_30, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m1 + mova m3, m0 + psrldq m0, 2 + punpcklwd m3, m0 ;[6 5 5 4 4 3 3 2] mova m4, m3 - lea r3, [ang_table + 23 * 16] - mova m0, [r3 - 8 * 16] ; [15] - mova m1, [r3 + 7 * 16] ; [30] - mova m6, [r3 - 10 * 16] ; [13] - mova m7, [r3 + 5 * 16] ; [28] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + CALC_4x4 13, 26, 7, 20 -cglobal intra_pred_ang4_16, 3,5,8 - xor r4d, r4d - cmp r3m, byte 20 - mov r3d, 16 - jz .next - xchg r3d, r4d -.next: - movd m4, [r2] ;[x x x A] - movd m5, [r2 + r3 + 4] ;[x x C B] - movh m0, [r2 + r4 + 2] ;[4 3 2 1] - punpcklwd m5, m4 ;[x C A B] - pshuflw m5, m5, 0x4A ;[A B C C] - punpcklqdq m5, m0 ;[4 3 2 1 A B C C] - psrldq m5, 2 - mova m2, m5 - mova m3, m5 - psrldq m2, 4 - psrldq m3, 2 - punpcklwd m5, m3 ; [2 1 1 0 0 x x y] - punpcklwd m3, m2 ; [3 2 2 1 1 0 0 x] - punpcklwd m2, m0 ; [4 3 3 2 2 1 1 0] - mova m4, m3 + STORE_4x4 + RET - lea r3, [ang_table + 19 * 16] - mova m0, [r3 - 8 * 16] ; [11] - mova m1, [r3 + 3 * 16] ; [22] - mova m6, [r3 - 18 * 16] ; [ 1] - mova m7, [r3 - 7 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) +cglobal intra_pred_ang4_5, 3,3,5 + movu m0, [r2 + 18] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] + mova m3, m2 + mova m4, m0 + psrldq m0, 2 + punpcklwd m4, m0 ;[7 6 6 5 5 4 4 3] -cglobal intra_pred_ang4_17, 3,5,8 - xor r4d, r4d - cmp r3m, byte 19 - mov r3d, 16 - jz .next - xchg r3d, r4d -.next: - movd m4, [r2] - movh m5, [r2 + r3 + 2] ;[D x C B] - pshuflw m5, m5, 0x1F ;[B C D D] - punpcklqdq m5, m4 ;[x x x A B C D D] - psrldq m5, 2 ;[x x x x A B C D] - movhps m5, [r2 + r4 + 2] + CALC_4x4 17, 2, 19, 4 - mova m4, m5 - psrldq m4, 2 - punpcklwd m5, m4 - mova m3, m4 - psrldq m3, 2 - punpcklwd m4, m3 - mova m2, m3 - psrldq m2, 2 - punpcklwd m3, m2 - mova m1, m2 - psrldq m1, 2 - punpcklwd m2, m1 + TRANSPOSE_4x4 - lea r3, [ang_table + 14 * 16] - mova m0, [r3 - 8 * 16] ; [ 6] - mova m1, [r3 - 2 * 16] ; [12] - mova m6, [r3 + 4 * 16] ; [18] - mova m7, [r3 + 10 * 16] ; [24] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + STORE_4x4 + RET -cglobal intra_pred_ang4_18, 3,3,1 - movh m0, [r2 + 16] - pinsrw m0, [r2], 0 - pshuflw m0, m0, q0123 - movhps m0, [r2 + 2] - add r1, r1 - lea r2, [r1 * 3] - movh [r0 + r2], m0 +cglobal intra_pred_ang4_31, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 psrldq m0, 2 - movh [r0 + r1 * 2], m0 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m0 psrldq m0, 2 - movh [r0 + r1], m0 + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] + mova m3, m2 + mova m4, m0 psrldq m0, 2 - movh [r0], m0 + punpcklwd m4, m0 ;[7 6 6 5 5 4 4 3] + + CALC_4x4 17, 2, 19, 4 + + STORE_4x4 + RET + + cglobal intra_pred_ang4_32, 3,3,5 + movu m0, [r2 + 2] ;[8 7 6 5 4 3 2 1] + mova m1, m0 + psrldq m0, 2 + punpcklwd m1, m0 ;[5 4 4 3 3 2 2 1] + mova m2, m0 + psrldq m0, 2 + punpcklwd m2, m0 ;[6 5 5 4 4 3 3 2] + mova m3, m2 + mova m4, m0 + psrldq m0, 2 + punpcklwd m4, m0 ;[7 6 6 5 5 4 4 3] + + CALC_4x4 21, 10, 31, 20 + + STORE_4x4 RET ;----------------------------------------------------------------------------------- @@ -1232,7 +1748,7 @@ ; filter top movu m1, [r2] paddw m1, m0 - psraw m1, 2 + psrlw m1, 2 movh [r0], m1 ; overwrite top-left pixel, we will update it later ; filter top-left @@ -1247,7 +1763,7 @@ lea r0, [r0 + r1 * 2] movu m1, [r3 + 2] paddw m1, m0 - psraw m1, 2 + psrlw m1, 2 movd r3d, m1 mov [r0], r3w shr r3d, 16 @@ -1269,7 +1785,7 @@ pshufd m4, m4, 0xAA pmullw m3, [multi_2Row] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar4_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_3] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_4] paddw m3, m4 @@ -1356,7 +1872,7 @@ ; filter top movu m0, [r2] paddw m0, m1 - psraw m0, 2 + psrlw m0, 2 movu [r6], m0 ; filter top-left @@ -1371,7 +1887,7 @@ add r6, r1 movu m0, [r3 + 2] paddw m0, m1 - psraw m0, 2 + psrlw m0, 2 pextrw [r6], m0, 0 pextrw [r6 + r1], m0, 1 pextrw [r6 + r1 * 2], m0, 2 @@ -1397,13 +1913,13 @@ movu m2, [r2] movu m3, [r2 + 16] - paddw m0, m1 + paddw m0, m1 ; dynamic range 13 bits paddw m2, m3 - paddw m0, m2 - movhlps m1, m0 - paddw m0, m1 - phaddw m0, m0 + paddw m0, m2 ; dynamic range 14 bits + movhlps m1, m0 ; dynamic range 15 bits + paddw m0, m1 ; dynamic range 16 bits pmaddwd m0, [pw_1] + phaddd m0, m0 movd r5d, m0 add r5d, 16 @@ -1467,11 +1983,11 @@ ; filter top movu m2, [r2] paddw m2, m1 - psraw m2, 2 + psrlw m2, 2 movu [r6], m2 movu m3, [r2 + 16] paddw m3, m1 - psraw m3, 2 + psrlw m3, 2 movu [r6 + 16], m3 ; filter top-left @@ -1486,7 +2002,7 @@ add r6, r1 movu m2, [r3 + 2] paddw m2, m1 - psraw m2, 2 + psrlw m2, 2 pextrw [r6], m2, 0 pextrw [r6 + r1], m2, 1 @@ -1503,7 +2019,7 @@ lea r6, [r6 + r1 * 2] movu m3, [r3 + 18] paddw m3, m1 - psraw m3, 2 + psrlw m3, 2 pextrw [r6], m3, 0 pextrw [r6 + r1], m3, 1 @@ -1530,21 +2046,21 @@ movu m1, [r3 + 16] movu m2, [r3 + 32] movu m3, [r3 + 48] - paddw m0, m1 + paddw m0, m1 ; dynamic range 13 bits paddw m2, m3 - paddw m0, m2 + paddw m0, m2 ; dynamic range 14 bits movu m1, [r2] movu m3, [r2 + 16] movu m4, [r2 + 32] movu m5, [r2 + 48] - paddw m1, m3 + paddw m1, m3 ; dynamic range 13 bits paddw m4, m5 - paddw m1, m4 - paddw m0, m1 - movhlps m1, m0 - paddw m0, m1 - phaddw m0, m0 + paddw m1, m4 ; dynamic range 14 bits + paddw m0, m1 ; dynamic range 15 bits pmaddwd m0, [pw_1] + movhlps m1, m0 + paddd m0, m1 + phaddd m0, m0 paddd m0, [pd_32] ; sum = sum + 32 psrld m0, 6 ; sum = sum / 64 @@ -1607,7 +2123,7 @@ pshufd m4, m4, 0xAA pmullw m3, [multi_2Row] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar4_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_3] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_4] paddw m3, m4 @@ -1663,12 +2179,12 @@ pshufd m4, m4, 0 ; v_bottomLeft pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar8_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_7] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_8] paddw m3, m4 paddw m3, m0 psubw m4, m1 - mova m0, [pw_planar8_0] + mova m0, [pw_planar16_mul + mmsize] %macro INTRA_PRED_PLANAR8 1 %if (%1 < 4) @@ -1681,7 +2197,7 @@ pmullw m1, m0 paddw m1, m3 paddw m3, m4 - psraw m1, 4 + psrlw m1, 4 movu [r0], m1 lea r0, [r0 + r1] %endmacro @@ -1715,8 +2231,8 @@ pmullw m4, m3, [multiH] ; (x + 1) * topRight pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m1, m2, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] - pmullw m5, m7, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + pmullw m5, m7, [pw_15] ; (blkSize - 1 - y) * above[x] paddw m4, [pw_16] paddw m3, [pw_16] paddw m4, m6 @@ -1747,8 +2263,8 @@ %endif %endif %endif - pmullw m0, m5, [pw_planar8_0] - pmullw m5, [pw_planar16_0] + pmullw m0, m5, [pw_planar16_mul + mmsize] + pmullw m5, [pw_planar16_mul] paddw m0, m4 paddw m5, m3 paddw m3, m6 @@ -1865,28 +2381,28 @@ ; above[0-3] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 2] - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m0, m5 psubd m5, m6, m4 mova m8, m5 ; above[4-7] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 10] - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m1, m5 psubd m5, m6, m4 mova m9, m5 ; above[8-11] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 18] - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m2, m5 psubd m5, m6, m4 mova m10, m5 ; above[12-15] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 26] - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m3, m5 psubd m5, m6, m4 mova m11, m5 @@ -1894,7 +2410,7 @@ ; above[16-19] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 34] mova m7, m12 - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m7, m5 mova m12, m7 psubd m5, m6, m4 @@ -1903,7 +2419,7 @@ ; above[20-23] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 42] mova m7, m13 - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m7, m5 mova m13, m7 psubd m5, m6, m4 @@ -1912,7 +2428,7 @@ ; above[24-27] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 50] mova m7, m14 - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m7, m5 mova m14, m7 psubd m5, m6, m4 @@ -1921,7 +2437,7 @@ ; above[28-31] * (blkSize - 1 - y) pmovzxwd m4, [r2 + 58] mova m7, m15 - pmulld m5, m4, [pd_planar32_1] + pmulld m5, m4, [pd_31] paddd m7, m5 mova m15, m7 psubd m5, m6, m4 @@ -2235,7 +2751,7 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] .quit: movh [r0], m0 RET @@ -2264,7 +2780,7 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] pextrw [r0], m0, 0 pextrw [r0 + r1], m0, 1 @@ -3439,33 +3955,33 @@ RET cglobal intra_pred_ang8_10, 3,6,3 - movu m1, [r2 + 34] ; [8 7 6 5 4 3 2 1] - pshufb m0, m1, [pw_unpackwdq] ; [1 1 1 1 1 1 1 1] + movu m1, [r2 + 34] ; [8 7 6 5 4 3 2 1] + pshufb m0, m1, [pb_01] ; [1 1 1 1 1 1 1 1] add r1, r1 lea r3, [r1 * 3] psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [2 2 2 2 2 2 2 2] + pshufb m2, m1, [pb_01] ; [2 2 2 2 2 2 2 2] movu [r0 + r1], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [3 3 3 3 3 3 3 3] + pshufb m2, m1, [pb_01] ; [3 3 3 3 3 3 3 3] movu [r0 + r1 * 2], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [4 4 4 4 4 4 4 4] + pshufb m2, m1, [pb_01] ; [4 4 4 4 4 4 4 4] movu [r0 + r3], m2 lea r5, [r0 + r1 *4] psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [5 5 5 5 5 5 5 5] + pshufb m2, m1, [pb_01] ; [5 5 5 5 5 5 5 5] movu [r5], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [6 6 6 6 6 6 6 6] + pshufb m2, m1, [pb_01] ; [6 6 6 6 6 6 6 6] movu [r5 + r1], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [7 7 7 7 7 7 7 7] + pshufb m2, m1, [pb_01] ; [7 7 7 7 7 7 7 7] movu [r5 + r1 * 2], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [8 8 8 8 8 8 8 8] + pshufb m2, m1, [pb_01] ; [8 8 8 8 8 8 8 8] movu [r5 + r3], m2 cmp r4m, byte 0 @@ -3474,14 +3990,14 @@ ; filter movh m1, [r2] ; [3 2 1 0] - pshufb m2, m1, [pw_unpackwdq] ; [0 0 0 0 0 0 0 0] + pshufb m2, m1, [pb_01] ; [0 0 0 0 0 0 0 0] movu m1, [r2 + 2] ; [8 7 6 5 4 3 2 1] psubw m1, m2 psraw m1, 1 paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] .quit: movu [r0], m0 RET @@ -5344,16 +5860,16 @@ jz .quit ; filter - pshufb m0, [pw_unpackwdq] + pshufb m0, [pb_01] pinsrw m1, [r2], 0 ; [3 2 1 0] - pshufb m2, m1, [pw_unpackwdq] ; [0 0 0 0 0 0 0 0] + pshufb m2, m1, [pb_01] ; [0 0 0 0 0 0 0 0] movu m1, [r2 + 2 + 32] ; [8 7 6 5 4 3 2 1] psubw m1, m2 psraw m1, 1 paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] pextrw [r0], m0, 0 pextrw [r0 + r1], m0, 1 pextrw [r0 + r1 * 2], m0, 2 @@ -9679,73 +10195,73 @@ mov r5d, r4m movu m1, [r2 + 2 + 64] ; [8 7 6 5 4 3 2 1] movu m3, [r2 + 18 + 64] ; [16 15 14 13 12 11 10 9] - pshufb m0, m1, [pw_unpackwdq] ; [1 1 1 1 1 1 1 1] + pshufb m0, m1, [pb_01] ; [1 1 1 1 1 1 1 1] add r1, r1 lea r4, [r1 * 3] psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [2 2 2 2 2 2 2 2] + pshufb m2, m1, [pb_01] ; [2 2 2 2 2 2 2 2] movu [r0 + r1], m2 movu [r0 + r1 + 16], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [3 3 3 3 3 3 3 3] + pshufb m2, m1, [pb_01] ; [3 3 3 3 3 3 3 3] movu [r0 + r1 * 2], m2 movu [r0 + r1 * 2 + 16], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [4 4 4 4 4 4 4 4] + pshufb m2, m1, [pb_01] ; [4 4 4 4 4 4 4 4] movu [r0 + r4], m2 movu [r0 + r4 + 16], m2 lea r3, [r0 + r1 *4] psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [5 5 5 5 5 5 5 5] + pshufb m2, m1, [pb_01] ; [5 5 5 5 5 5 5 5] movu [r3], m2 movu [r3 + 16], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [6 6 6 6 6 6 6 6] + pshufb m2, m1, [pb_01] ; [6 6 6 6 6 6 6 6] movu [r3 + r1], m2 movu [r3 + r1 + 16], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [7 7 7 7 7 7 7 7] + pshufb m2, m1, [pb_01] ; [7 7 7 7 7 7 7 7] movu [r3 + r1 * 2], m2 movu [r3 + r1 * 2 + 16], m2 psrldq m1, 2 - pshufb m2, m1, [pw_unpackwdq] ; [8 8 8 8 8 8 8 8] + pshufb m2, m1, [pb_01] ; [8 8 8 8 8 8 8 8] movu [r3 + r4], m2 movu [r3 + r4 + 16], m2 lea r3, [r3 + r1 *4] - pshufb m2, m3, [pw_unpackwdq] ; [9 9 9 9 9 9 9 9] + pshufb m2, m3, [pb_01] ; [9 9 9 9 9 9 9 9] movu [r3], m2 movu [r3 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [10 10 10 10 10 10 10 10] + pshufb m2, m3, [pb_01] ; [10 10 10 10 10 10 10 10] movu [r3 + r1], m2 movu [r3 + r1 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [11 11 11 11 11 11 11 11] + pshufb m2, m3, [pb_01] ; [11 11 11 11 11 11 11 11] movu [r3 + r1 * 2], m2 movu [r3 + r1 * 2 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [12 12 12 12 12 12 12 12] + pshufb m2, m3, [pb_01] ; [12 12 12 12 12 12 12 12] movu [r3 + r4], m2 movu [r3 + r4 + 16], m2 lea r3, [r3 + r1 *4] psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [13 13 13 13 13 13 13 13] + pshufb m2, m3, [pb_01] ; [13 13 13 13 13 13 13 13] movu [r3], m2 movu [r3 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [14 14 14 14 14 14 14 14] + pshufb m2, m3, [pb_01] ; [14 14 14 14 14 14 14 14] movu [r3 + r1], m2 movu [r3 + r1 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [15 15 15 15 15 15 15 15] + pshufb m2, m3, [pb_01] ; [15 15 15 15 15 15 15 15] movu [r3 + r1 * 2], m2 movu [r3 + r1 * 2 + 16], m2 psrldq m3, 2 - pshufb m2, m3, [pw_unpackwdq] ; [16 16 16 16 16 16 16 16] + pshufb m2, m3, [pb_01] ; [16 16 16 16 16 16 16 16] movu [r3 + r4], m2 movu [r3 + r4 + 16], m2 mova m3, m0 @@ -9755,7 +10271,7 @@ ; filter pinsrw m1, [r2], 0 ; [3 2 1 0] - pshufb m2, m1, [pw_unpackwdq] ; [0 0 0 0 0 0 0 0] + pshufb m2, m1, [pb_01] ; [0 0 0 0 0 0 0 0] movu m1, [r2 + 2] ; [8 7 6 5 4 3 2 1] movu m3, [r2 + 18] ; [16 15 14 13 12 11 10 9] psubw m1, m2 @@ -9766,9 +10282,9 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] pmaxsw m3, m1 - pminsw m3, [pw_1023] + pminsw m3, [pw_pixel_max] .quit: movu [r0], m0 movu [r0 + 16], m3 @@ -9825,9 +10341,9 @@ ; filter - pshufb m0, [pw_unpackwdq] + pshufb m0, [pb_01] pinsrw m1, [r2], 0 ; [3 2 1 0] - pshufb m2, m1, [pw_unpackwdq] ; [0 0 0 0 0 0 0 0] + pshufb m2, m1, [pb_01] ; [0 0 0 0 0 0 0 0] movu m1, [r2 + 2 + 64] ; [8 7 6 5 4 3 2 1] movu m3, [r2 + 18 + 64] ; [16 15 14 13 12 11 10 9] psubw m1, m2 @@ -9838,9 +10354,9 @@ paddw m0, m1 pxor m1, m1 pmaxsw m0, m1 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] pmaxsw m3, m1 - pminsw m3, [pw_1023] + pminsw m3, [pw_pixel_max] pextrw [r0], m0, 0 pextrw [r0 + r1], m0, 1 pextrw [r0 + r1 * 2], m0, 2 @@ -9862,6 +10378,7231 @@ .quit: RET +;------------------------------------------------------------------------------------------------------- +; avx2 code for intra_pred_ang16 mode 2 to 34 start +;------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_ang16_2, 3,5,3 + lea r4, [r2] + add r2, 64 + cmp r3m, byte 34 + cmove r2, r4 + add r1d, r1d + lea r3, [r1 * 3] + movu m0, [r2 + 4] + movu m1, [r2 + 20] + + movu [r0], m0 + palignr m2, m1, m0, 2 + movu [r0 + r1], m2 + palignr m2, m1, m0, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + movu [r0], m2 + palignr m2, m1, m0, 10 + movu [r0 + r1], m2 + palignr m2, m1, m0, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 14 + movu [r0 + r3], m2 + + movu m0, [r2 + 36] + lea r0, [r0 + r1 * 4] + movu [r0], m1 + palignr m2, m0, m1, 2 + movu [r0 + r1], m2 + palignr m2, m0, m1, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 6 + movu [r0 + r3], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + movu [r0], m2 + palignr m2, m0, m1, 10 + movu [r0 + r1], m2 + palignr m2, m0, m1, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m1, 14 + movu [r0 + r3], m2 + RET + +%macro TRANSPOSE_STORE_AVX2 11 + jnz .skip%11 + punpckhwd m%9, m%1, m%2 + punpcklwd m%1, m%2 + punpckhwd m%2, m%3, m%4 + punpcklwd m%3, m%4 + + punpckldq m%4, m%1, m%3 + punpckhdq m%1, m%3 + punpckldq m%3, m%9, m%2 + punpckhdq m%9, m%2 + + punpckhwd m%10, m%5, m%6 + punpcklwd m%5, m%6 + punpckhwd m%6, m%7, m%8 + punpcklwd m%7, m%8 + + punpckldq m%8, m%5, m%7 + punpckhdq m%5, m%7 + punpckldq m%7, m%10, m%6 + punpckhdq m%10, m%6 + + punpcklqdq m%6, m%4, m%8 + punpckhqdq m%2, m%4, m%8 + punpcklqdq m%4, m%1, m%5 + punpckhqdq m%8, m%1, m%5 + + punpcklqdq m%1, m%3, m%7 + punpckhqdq m%5, m%3, m%7 + punpcklqdq m%3, m%9, m%10 + punpckhqdq m%7, m%9, m%10 + + movu [r0 + r1 * 0 + %11], xm%6 + movu [r0 + r1 * 1 + %11], xm%2 + movu [r0 + r1 * 2 + %11], xm%4 + movu [r0 + r4 * 1 + %11], xm%8 + + lea r5, [r0 + r1 * 4] + movu [r5 + r1 * 0 + %11], xm%1 + movu [r5 + r1 * 1 + %11], xm%5 + movu [r5 + r1 * 2 + %11], xm%3 + movu [r5 + r4 * 1 + %11], xm%7 + + lea r5, [r5 + r1 * 4] + vextracti128 [r5 + r1 * 0 + %11], m%6, 1 + vextracti128 [r5 + r1 * 1 + %11], m%2, 1 + vextracti128 [r5 + r1 * 2 + %11], m%4, 1 + vextracti128 [r5 + r4 * 1 + %11], m%8, 1 + + lea r5, [r5 + r1 * 4] + vextracti128 [r5 + r1 * 0 + %11], m%1, 1 + vextracti128 [r5 + r1 * 1 + %11], m%5, 1 + vextracti128 [r5 + r1 * 2 + %11], m%3, 1 + vextracti128 [r5 + r4 * 1 + %11], m%7, 1 + jmp .end%11 +.skip%11: + movu [r0 + r1 * 0], m%1 + movu [r0 + r1 * 1], m%2 + movu [r0 + r1 * 2], m%3 + movu [r0 + r4 * 1], m%4 + + lea r0, [r0 + r1 * 4] + movu [r0 + r1 * 0], m%5 + movu [r0 + r1 * 1], m%6 + movu [r0 + r1 * 2], m%7 + movu [r0 + r4 * 1], m%8 + lea r0, [r0 + r1 * 4] +.end%11: +%endmacro + +;; angle 16, modes 3 and 33 +cglobal ang16_mode_3_33 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 10 * 32] ; [26] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 10 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m5, m0, m3, 4 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + pmaddwd m5, [r3 + 4 * 32] ; [20] + paddd m5, [pd_16] + psrld m5, 5 + palignr m6, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + pmaddwd m6, [r3 + 4 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m6, m0, m3, 8 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + pmaddwd m6, [r3 - 2 * 32] ; [14] + paddd m6, [pd_16] + psrld m6, 5 + palignr m7, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + pmaddwd m7, [r3 - 2 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m7, m0, m3, 12 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + pmaddwd m7, [r3 - 8 * 32] ; [8] + paddd m7, [pd_16] + psrld m7, 5 + palignr m8, m2, m0, 12 ; [20 19 19 18 18 17 17 16 12 11 11 10 10 9 9 8] + pmaddwd m8, [r3 - 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m0, [r3 - 14 * 32] ; [2] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m3, m2, [r3 - 14 * 32] ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m8, m3 + + pmaddwd m9, m0, [r3 + 12 * 32] ; [28] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m2, [r3 + 12 * 32] ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + palignr m10, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + pmaddwd m10, [r3 + 6 * 32] ; [22] + paddd m10, [pd_16] + psrld m10, 5 + palignr m3, m1, m2, 4 ; [22 21 21 20 20 19 19 18 14 13 13 12 12 11 11 10] + pmaddwd m3, [r3 + 6 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m10, m3 + + palignr m11, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + pmaddwd m11, [r3] ; [16] + paddd m11, [pd_16] + psrld m11, 5 + palignr m3, m1, m2, 8 ; [23 22 22 21 21 20 20 19 15 14 14 13 13 12 12 11] + pmaddwd m3, [r3] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + palignr m4, m2, m0, 12 ; [20 19 19 18 18 17 17 16 12 11 11 10 10 9 9 8] + pmaddwd m4, [r3 - 6 * 32] ; [10] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m1, m2, 12 ; [24 23 23 22 22 21 21 20 15 16 15 14 14 13 13 12] + pmaddwd m5, [r3 - 6 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 12 * 32] ; [4] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m1, [r3 - 12 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + movu m0, [r2 + 34] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + pmaddwd m6, m2, [r3 + 14 * 32] ; [30] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m1, [r3 + 14 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m3, m0, m0, 2 ; [ x 32 31 30 29 28 27 26 x 24 23 22 21 20 19 18] + punpcklwd m0, m3 ; [29 29 28 28 27 27 26 22 21 20 20 19 19 18 18 17] + + palignr m7, m1, m2, 4 + pmaddwd m7, [r3 + 8 * 32] ; [24] + paddd m7, [pd_16] + psrld m7, 5 + palignr m8, m0, m1, 4 + pmaddwd m8, [r3 + 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m8, m1, m2, 8 + pmaddwd m8, [r3 + 2 * 32] ; [18] + paddd m8, [pd_16] + psrld m8, 5 + palignr m9, m0, m1, 8 + pmaddwd m9, [r3 + 2 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m9, m1, m2, 12 + pmaddwd m9, [r3 - 4 * 32] ; [12] + paddd m9, [pd_16] + psrld m9, 5 + palignr m3, m0, m1, 12 + pmaddwd m3, [r3 - 4 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m1, [r3 - 10 * 32] ; [6] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m0, [r3 - 10 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m1, m0 + + movu m2, [r2 + 28] + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 2, 0, 3, 16 + ret + +;; angle 16, modes 4 and 32 +cglobal ang16_mode_4_32 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 3 * 32] ; [21] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 3 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + pmaddwd m5, m6, [r3 - 8 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + palignr m7, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + pmaddwd m8, m7, [r3 - 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 13 * 32] ; [31] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 13 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m7, m0, m3, 8 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + pmaddwd m7, [r3 + 2 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + palignr m8, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + pmaddwd m8, [r3 + 2 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m0, m3, 12 + pmaddwd m8, m9, [r3 - 9 * 32] ; [9] + paddd m8, [pd_16] + psrld m8, 5 + palignr m3, m2, m0, 12 + pmaddwd m10, m3, [r3 - 9 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 12 * 32] ; [30] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, [r3 + 12 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, m0, [r3 + 1 * 32] ; [19] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m3, m2, [r3 + 1 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m10, m3 + + palignr m11, m2, m0, 4 + pmaddwd m11, [r3 - 10 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + palignr m3, m1, m2, 4 + pmaddwd m3, [r3 - 10 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + palignr m4, m2, m0, 4 + pmaddwd m4, [r3 + 11 * 32] ; [29] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m1, m2, 4 + pmaddwd m5, [r3 + 11 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m5, m2, m0, 8 + pmaddwd m5, [r3] ; [18] + paddd m5, [pd_16] + psrld m5, 5 + palignr m6, m1, m2, 8 + pmaddwd m6, [r3] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m7, m2, m0, 12 + pmaddwd m6, m7, [r3 - 11 * 32] ; [7] + paddd m6, [pd_16] + psrld m6, 5 + palignr m8, m1, m2, 12 + pmaddwd m3, m8, [r3 - 11 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m6, m3 + + pmaddwd m7, [r3 + 10 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 10 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + movu m0, [r2 + 34] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + pmaddwd m8, m2, [r3 - 1 * 32] ; [17] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m1, [r3 - 1 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m3, m0, m0, 2 ; [ x 32 31 30 29 28 27 26 x 24 23 22 21 20 19 18] + punpcklwd m0, m3 ; [29 29 28 28 27 27 26 22 21 20 20 19 19 18 18 17] + + palignr m10, m1, m2, 4 + pmaddwd m9, m10, [r3 - 12 * 32] ; [6] + paddd m9, [pd_16] + psrld m9, 5 + palignr m11, m0, m1, 4 + pmaddwd m3, m11, [r3 - 12 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 9 * 32] ; [27] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 9 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m3, m1, m2, 8 + pmaddwd m3, [r3 - 2 * 32] ; [16] + paddd m3, [pd_16] + psrld m3, 5 + palignr m0, m1, 8 + pmaddwd m0, [r3 - 2 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 3, 0, 1, 16 + ret + +;; angle 16, modes 5 and 31 +cglobal ang16_mode_5_31 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 1 * 32] ; [17] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 1 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 + pmaddwd m5, m6, [r3 - 14 * 32] ; [2] + paddd m5, [pd_16] + psrld m5, 5 + palignr m7, m2, m0, 4 + pmaddwd m8, m7, [r3 - 14 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 3 * 32] ; [19] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 3 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m0, m3, 8 + pmaddwd m7, m8, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + palignr m9, m2, m0, 8 + pmaddwd m10, m9, [r3 - 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 5 * 32] ; [21] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 5 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m0, m3, 12 + pmaddwd m9, m10, [r3 - 10 * 32] ; [6] + paddd m9, [pd_16] + psrld m9, 5 + palignr m11, m2, m0, 12 + pmaddwd m3, m11, [r3 - 10 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 7 * 32] ; [23] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 7 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m11, m0, [r3 - 8 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m3, m2, [r3 - 8 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + pmaddwd m4, m0, [r3 + 9 * 32] ; [25] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 9 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m2, m0, 4 + pmaddwd m5, m6, [r3 - 6 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + palignr m7, m1, m2, 4 + pmaddwd m3, m7, [r3 - 6 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, [r3 + 11 * 32] ; [27] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 11 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m2, m0, 8 + pmaddwd m7, m8, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + palignr m9, m1, m2, 8 + pmaddwd m3, m9, [r3 - 4 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + pmaddwd m8, [r3 + 13 * 32] ; [29] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 13 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m2, m0, 12 + pmaddwd m9, m10, [r3 - 2 * 32] ; [14] + paddd m9, [pd_16] + psrld m9, 5 + palignr m11, m1, m2, 12 + pmaddwd m3, m11, [r3 - 2 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 15 * 32] ; [31] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 15 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m2, [r3] ; [16] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 2, 0, 1, 16 + ret + +;; angle 16, modes 6 and 30 +cglobal ang16_mode_6_30 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 - 2 * 32] ; [13] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 2 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 11 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m0, m3, 4 + pmaddwd m6, m7, [r3 - 8 * 32] ; [7] + paddd m6, [pd_16] + psrld m6, 5 + palignr m8, m2, m0, 4 + pmaddwd m9, m8, [r3 - 8 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 5 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m10, m0, m3, 8 + pmaddwd m8, m10, [r3 - 14 * 32] ; [1] + paddd m8, [pd_16] + psrld m8, 5 + palignr m11, m2, m0, 8 + pmaddwd m9, m11, [r3 - 14 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m10, [r3 - 1 * 32] ; [14] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, m11, [r3 - 1 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, [r3 + 12 * 32] ; [27] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 12 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m0, m3, 12 + pmaddwd m11, [r3 - 7 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + palignr m12, m2, m0, 12 + pmaddwd m12, [r3 - 7 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m4, m0, m3, 12 + pmaddwd m4, [r3 + 6 * 32] ; [21] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m2, m0, 12 + pmaddwd m5, [r3 + 6 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m0, [r3 - 13 * 32] ; [2] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, m2, [r3 - 13 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, m0, [r3] ; [15] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m2, [r3] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m0, [r3 + 13 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, m2, [r3 + 13 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + palignr m9, m2, m0, 4 + pmaddwd m8, m9, [r3 - 6 * 32] ; [9] + paddd m8, [pd_16] + psrld m8, 5 + palignr m3, m1, m2, 4 + pmaddwd m10, m3, [r3 - 6 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 7 * 32] ; [22] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, [r3 + 7 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + palignr m11, m2, m0, 8 + pmaddwd m10, m11, [r3 - 12 * 32] ; [3] + paddd m10, [pd_16] + psrld m10, 5 + palignr m3, m1, m2, 8 + pmaddwd m12, m3, [r3 - 12 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, [r3 + 1 * 32] ; [16] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m3, [r3 + 1 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 16 + ret + +;; angle 16, modes 7 and 29 +cglobal ang16_mode_7_29 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + pmaddwd m4, m3, [r3 - 8 * 32] ; [9] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 8 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 1 * 32] ; [18] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 1 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 10 * 32] ; [27] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m10, m0, m3, 4 + pmaddwd m7, m10, [r3 - 13 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + palignr m11, m2, m0, 4 + pmaddwd m8, m11, [r3 - 13 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m10, [r3 - 4 * 32] ; [13] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m11, [r3 - 4 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m10, [r3 + 5 * 32] ; [22] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, m11, [r3 + 5 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, [r3 + 14 * 32] ; [31] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 14 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m0, m3, 8 + pmaddwd m11, [r3 - 9 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + palignr m12, m2, m0, 8 + pmaddwd m12, [r3 - 9 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m5, m0, m3, 8 + pmaddwd m4, m5, [r3] ; [17] + paddd m4, [pd_16] + psrld m4, 5 + palignr m6, m2, m0, 8 + pmaddwd m7, m6, [r3] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m4, m7 + + pmaddwd m5, [r3 + 9 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, [r3 + 9 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m9, m0, m3, 12 + pmaddwd m6, m9, [r3 - 14 * 32] ; [3] + paddd m6, [pd_16] + psrld m6, 5 + palignr m3, m2, m0, 12 + pmaddwd m7, m3, [r3 - 14 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m9, [r3 - 5 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m3, [r3 - 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m9, [r3 + 4 * 32] ; [21] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m10, m3, [r3 + 4 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 13 * 32] ; [30] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, [r3 + 13 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, m0, [r3 - 10 * 32] ; [7] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m2, [r3 - 10 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m0, [r3 - 1 * 32] ; [16] + paddd m0, [pd_16] + psrld m0, 5 + pmaddwd m2, [r3 - 1 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m0, m2 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 16 + ret + +;; angle 16, modes 8 and 28 +cglobal ang16_mode_8_28 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + pmaddwd m4, m3, [r3 - 10 * 32] ; [5] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 10 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 5 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 - 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3] ; [15] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m3, [r3 + 5 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m0, [r3 + 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m3, [r3 + 10 * 32] ; [25] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 15 * 32] ; [30] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 + 15 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m11, m0, m3, 4 + pmaddwd m10, m11, [r3 - 12 * 32] ; [3] + paddd m10, [pd_16] + psrld m10, 5 + palignr m1, m2, m0, 4 + pmaddwd m12, m1, [r3 - 12 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, [r3 - 7 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m1, [r3 - 7 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m11, m1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m7, m0, m3, 4 + pmaddwd m4, m7, [r3 - 2 * 32] ; [13] + paddd m4, [pd_16] + psrld m4, 5 + palignr m1, m2, m0, 4 + pmaddwd m5, m1, [r3 - 2 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m7, [r3 + 3 * 32] ; [18] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m1, [r3 + 3 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m7, [r3 + 8 * 32] ; [23] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m1, [r3 + 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, [r3 + 13 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m1, [r3 + 13 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m7, m1 + + palignr m1, m0, m3, 8 + pmaddwd m8, m1, [r3 - 14 * 32] ; [1] + paddd m8, [pd_16] + psrld m8, 5 + palignr m2, m0, 8 + pmaddwd m9, m2, [r3 - 14 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m1, [r3 - 9 * 32] ; [6] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m2, [r3 - 9 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m3, m1, [r3 - 4 * 32] ; [11] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, m2, [r3 - 4 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + + pmaddwd m1, [r3 + 1 * 32] ; [16] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m2, [r3 + 1 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m1, m2 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +;; angle 16, modes 9 and 27 +cglobal ang16_mode_9_27 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + pmaddwd m4, m3, [r3 - 14 * 32] ; [2] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 14 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 12 * 32] ; [4] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 - 10 * 32] ; [6] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 - 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m3, [r3 - 8 * 32] ; [8] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m0, [r3 - 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m3, [r3 - 6 * 32] ; [10] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 - 6 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 - 4 * 32] ; [12] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 - 4 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m3, [r3 - 2 * 32] ; [14] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m1, m0, [r3 - 2 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m10, m1 + + pmaddwd m11, m3, [r3] ; [16] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m1, m0, [r3] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m11, m1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 2, 1, 0 + + pmaddwd m4, m3, [r3 + 2 * 32] ; [18] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 2 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 4 * 32] ; [20] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m0, [r3 + 4 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m3, [r3 + 6 * 32] ; [22] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m0, [r3 + 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m3, [r3 + 8 * 32] ; [24] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m1, m0, [r3 + 8 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m7, m1 + + pmaddwd m8, m3, [r3 + 10 * 32] ; [26] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 12 * 32] ; [28] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m1, m0, [r3 + 12 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m9, m1 + + pmaddwd m3, [r3 + 14 * 32] ; [30] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3 + 14 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + + movu m1, [r2 + 4] + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +;; angle 16, modes 11 and 25 +cglobal ang16_mode_11_25 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m1, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m1 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m0, m1 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 14 * 32] ; [30] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 14 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 12 * 32] ; [28] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 10 * 32] ; [26] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m3, [r3 + 8 * 32] ; [24] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m0, [r3 + 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m3, [r3 + 6 * 32] ; [22] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 6 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 4 * 32] ; [20] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 + 4 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m3, [r3 + 2 * 32] ; [18] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m1, m0, [r3 + 2 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m10, m1 + + pmaddwd m11, m3, [r3] ; [16] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m1, m0, [r3] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m11, m1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 2, 1, 0 + + pmaddwd m4, m3, [r3 - 2 * 32] ; [14] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 2 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 4 * 32] ; [12] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m0, [r3 - 4 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m3, [r3 - 6 * 32] ; [10] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m0, [r3 - 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m3, [r3 - 8 * 32] ; [8] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m1, m0, [r3 - 8 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m7, m1 + + pmaddwd m8, m3, [r3 - 10 * 32] ; [6] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 - 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 - 12 * 32] ; [4] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m1, m0, [r3 - 12 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m9, m1 + + pmaddwd m3, [r3 - 14 * 32] ; [2] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3 - 14 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + + movu m1, [r2] + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 3, 1, 0, 2, 16 + ret + +;; angle 16, modes 12 and 24 +cglobal ang16_mode_12_24 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 11 * 32] ; [27] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 11 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 6 * 32] ; [22] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m2, [r3 + 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 1 * 32] ; [17] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m2, [r3 + 1 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m3, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m2, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m3, [r3 - 9 * 32] ; [7] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m2, [r3 - 9 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m2, [r3 - 14 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m9, m2 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 6 6 13 13 x x x x] + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m10, m2, [r3 + 13 * 32] ; [29] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 + 13 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 14 + + pmaddwd m4, m2, [r3 + 3 * 32] ; [19] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 + 3 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 2 * 32] ; [14] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 - 2 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m2, [r3 - 7 * 32] ; [9] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 - 7 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m0, m3, 10 + palignr m3, m1, 10 + + pmaddwd m8, m3, [r3 + 15 * 32] ; [31] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 15 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 10 * 32] ; [26] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m1, m0, [r3 + 10 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m9, m1 + + pmaddwd m1, m3, [r3 + 5 * 32] ; [21] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m2, m0, [r3 + 5 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m1, m2 + + pmaddwd m3, [r3] ; [16] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 3, 0, 2, 16 + ret + +;; angle 16, modes 13 and 23 +cglobal ang16_mode_13_23 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 7 * 32] ; [23] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 7 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 2 * 32] ; [14] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m2, [r3 - 2 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m3, [r3 - 11 * 32] ; [5] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m2, [r3 - 11 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m6, m2 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 4 4 7 7 11 11 14 14] + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m7, m2, [r3 + 12 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 + 3 * 32] ; [19] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 3 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 6 * 32] ; [10] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 - 6 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 15 * 32] ; [1] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 15 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m2, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 10 + + pmaddwd m4, m2, [r3 - 1 * 32] ; [15] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 1 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 10 * 32] ; [6] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 - 10 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m2, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m6, m2, [r3 + 13 * 32] ; [29] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 + 13 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 + 4 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 - 5 * 32] ; [11] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 - 5 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m13, [r3 - 14 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m9, m13 + + palignr m0, m3, 2 + palignr m3, m1, 2 + + pmaddwd m1, m3, [r3 + 9 * 32] ; [25] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m2, m0, [r3 + 9 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m1, m2 + + pmaddwd m3, [r3] ; [16] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 3, 0, 2, 16 + ret + +;; angle 16, modes 14 and 22 +cglobal ang16_mode_14_22 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 3 * 32] ; [19] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 3 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 10 * 32] ; [6] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m2, [r3 - 10 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m5, m2 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 2 2 5 5 7 7 10 10] + vinserti128 m14, m14, xm3, 1 ; [ 3 3 2 2 1 1 0 0 12 12 15 15 x x x x] + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m6, m2, [r3 + 9 * 32] ; [25] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m13, [r3 + 9 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m2, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m2, m3, m1, 10 ; [10 9 9 8 8 7 7 6 2 1 1 0 0 2 2 5] + palignr m13, m0, m3, 10 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + + pmaddwd m8, m2, [r3 + 15 * 32] ; [31] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 15 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 + 2 * 32] ; [18] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 + 2 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 11 * 32] ; [5] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 11 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m2, m3, m1, 6 ; [ 9 8 8 7 7 6 6 5 1 0 0 2 2 5 5 7] + palignr m13, m0, m3, 6 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 6 + + pmaddwd m4, m2, [r3 - 5 * 32] ; [11] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 5 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m2, m0, m3, 2 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + palignr m13, m3, m1, 2 ; [ 8 7 7 6 6 5 5 4 0 2 2 5 5 7 7 10] + + pmaddwd m5, m13, [r3 + 14 * 32] ; [30] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m2, [r3 + 14 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m13, [r3 + 1 * 32] ; [17] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m2, [r3 + 1 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m13, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m2, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m2, m1, m14, 14 ; [ 7 6 6 5 5 4 4 3 2 5 5 7 7 10 10 12] + palignr m0, m3, m1, 14 ; [11 10 10 9 9 8 8 7 3 2 2 1 1 0 0 2] + + pmaddwd m8, m2, [r3 + 7 * 32] ; [23] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 7 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 6 * 32] ; [10] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m2, m0, [r3 - 6 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m9, m2 + + palignr m3, m1, 10 ; [10 9 9 8 8 7 7 6 2 1 1 0 0 2 2 5] + palignr m1, m14, 10 ; [ 6 5 5 4 4 3 3 2 5 7 7 10 10 12 12 15] + + pmaddwd m2, m1, [r3 + 13 * 32] ; [29] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m0, m3, [r3 + 13 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m2, m0 + + pmaddwd m1, [r3] ; [16] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m3, [r3] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m1, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 2, 1, 0, 3, 16 + ret + +;; angle 16, modes 15 and 21 +cglobal ang16_mode_15_21 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 - 1 * 32] ; [15] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 - 1 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 + vinserti128 m14, m14, xm3, 1 + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m5, m2, [r3 + 14 * 32] ; [30] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m13, [r3 + 14 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m2, [r3 - 3 * 32] ; [13] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m13, [r3 - 3 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m2, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m7, m2, [r3 + 12 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 - 5 * 32] ; [11] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 - 5 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m2, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m9, m2, [r3 + 10 * 32] ; [26] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 + 10 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 7 * 32] ; [9] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 7 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m2, m3, m1, 2 + palignr m13, m0, m3, 2 + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 2 + + pmaddwd m4, m2, [r3 - 9 * 32] ; [7] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 9 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m1, m14, 14 + palignr m7, m3, m1, 14 + + pmaddwd m5, m6, [r3 + 6 * 32] ; [22] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m7, [r3 + 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 - 11 * 32] ; [5] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 - 11 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m1, m14, 10 + palignr m9, m3, m1, 10 + + pmaddwd m7, m8, [r3 + 4 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 + 4 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 - 13 * 32] ; [3] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 - 13 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m2, m1, m14, 6 + palignr m0, m3, m1, 6 + + pmaddwd m9, m2, [r3 + 2 * 32] ; [18] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m13, m0, [r3 + 2 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m9, m13 + + pmaddwd m2, [r3 - 15 * 32] ; [1] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m0, [r3 - 15 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m2, m0 + + palignr m3, m1, 2 + palignr m1, m14, 2 + + pmaddwd m1, [r3] ; [16] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m3, [r3] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m1, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 2, 1, 0, 3, 16 + ret + +;; angle 16, modes 16 and 20 +cglobal ang16_mode_16_20 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m12, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 - 5 * 32] ; [11] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m12, [r3 - 5 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 2 2 3 3 5 5 6 6] + vinserti128 m14, m14, xm3, 1 ; [ 3 3 2 2 1 1 0 0 8 8 9 9 11 11 12 12] + vinserti128 m2, m2, xm1, 1 ; [ 2 2 3 3 5 5 6 6 14 14 15 15 x x x x] + + palignr m12, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m5, m12, [r3 + 6 * 32] ; [22] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m13, [r3 + 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m12, [r3 - 15 * 32] ; [1] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m13, [r3 - 15 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m12, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m7, m12, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m12, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m8, m12, [r3 + 7 * 32] ; [23] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 7 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m12, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 - 14 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m12, m3, m1, 2 + palignr m13, m0, m3, 2 + + pmaddwd m10, m12, [r3 - 3 * 32] ; [13] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m13, [r3 - 3 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m12, m1, m14, 14 + palignr m13, m3, m1, 14 + + pmaddwd m11, m12, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 13, 0 + + palignr m13, m3, m1, 14 + + pmaddwd m4, m12, [r3 - 13 * 32] ; [3] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 13 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m1, m14, 10 + palignr m7, m3, m1, 10 + + pmaddwd m5, m6, [r3 - 2 * 32] ; [14] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m7, [r3 - 2 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m1, m14, 6 + palignr m10, m3, m1, 6 + + pmaddwd m6, m7, [r3 + 9 * 32] ; [25] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m10, [r3 + 9 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, [r3 - 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + palignr m8, m1, m14, 2 ; [ 4 3 3 2 2 1 1 0 6 8 8 9 9 11 11 12] + palignr m9, m3, m1, 2 ; [ 8 7 7 6 6 5 5 4 0 2 2 3 3 5 5 6] + + pmaddwd m8, [r3 - 1 * 32] ; [15] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 - 1 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m12, m14, m2, 14 + palignr m0, m1, m14, 14 + + pmaddwd m9, m12, [r3 + 10 * 32] ; [26] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m13, m0, [r3 + 10 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m9, m13 + + pmaddwd m12, [r3 - 11 * 32] ; [5] + paddd m12, [pd_16] + psrld m12, 5 + pmaddwd m0, [r3 - 11 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m12, m0 + + palignr m1, m14, 10 + palignr m14, m2, 10 + + pmaddwd m14, [r3] ; [16] + paddd m14, [pd_16] + psrld m14, 5 + pmaddwd m1, [r3] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m14, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 12, 14, 0, 3, 16 + ret + +;; angle 16, modes 17 and 19 +cglobal ang16_mode_17_19 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m12, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 - 10 * 32] ; [6] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m12, [r3 - 10 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 2 2 3 3 5 5 6 6] + vinserti128 m14, m14, xm3, 1 ; [ 3 3 2 2 1 1 0 0 8 8 9 9 11 11 12 12] + vinserti128 m2, m2, xm1, 1 ; [ 2 2 3 3 5 5 6 6 14 14 15 15 x x x x] + + palignr m12, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m5, m12, [r3 - 4 * 32] ; [12] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m13, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m12, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m6, m12, [r3 + 2 * 32] ; [18] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m13, [r3 + 2 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m12, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m7, m12, [r3 + 8 * 32] ; [24] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m12, m3, m1, 2 + palignr m13, m0, m3, 2 + + pmaddwd m8, m12, [r3 + 14 * 32] ; [30] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 14 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m12, [r3 - 12 * 32] ; [4] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 - 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m12, m1, m14, 14 + palignr m13, m3, m1, 14 + + pmaddwd m10, m12, [r3 - 6 * 32] ; [10] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m13, [r3 - 6 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m12, m1, m14, 10 + palignr m13, m3, m1, 10 + + pmaddwd m11, m12, [r3] ; [16] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 13, 0 + + palignr m12, m1, m14, 6 + palignr m13, m3, m1, 6 + + pmaddwd m4, m12, [r3 + 6 * 32] ; [22] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 + 6 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m12, m1, m14, 2 + palignr m13, m3, m1, 2 + + pmaddwd m5, m12, [r3 + 12 * 32] ; [28] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m13, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m12, [r3 - 14 * 32] ; [2] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 - 14 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + palignr m7, m14, m2, 14 + palignr m0, m1, m14, 14 + + pmaddwd m7, [r3 - 8 * 32] ; [8] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m0, [r3 - 8 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m7, m0 + + palignr m8, m14, m2, 10 + palignr m9, m1, m14, 10 + + pmaddwd m8, [r3 - 2 * 32] ; [14] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 - 2 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m9, m14, m2, 6 + palignr m13, m1, m14, 6 + + pmaddwd m9, [r3 + 4 * 32] ; [20] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m13, [r3 + 4 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m9, m13 + + palignr m1, m14, 2 + palignr m14, m2, 2 + + pmaddwd m12, m14, [r3 + 10 * 32] ; [26] + paddd m12, [pd_16] + psrld m12, 5 + pmaddwd m0, m1, [r3 + 10 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m12, m0 + + pmaddwd m14, [r3 - 16 * 32] ; [0] + paddd m14, [pd_16] + psrld m14, 5 + pmaddwd m1, [r3 - 16 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m14, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 12, 14, 0, 3, 16 + ret + +cglobal intra_pred_ang16_3, 3,7,13 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_3_33 + RET + +cglobal intra_pred_ang16_33, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_3_33 + RET + +cglobal intra_pred_ang16_4, 3,7,13 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_4_32 + RET + +cglobal intra_pred_ang16_32, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_4_32 + RET + +cglobal intra_pred_ang16_5, 3,7,13 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_5_31 + RET + +cglobal intra_pred_ang16_31, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_5_31 + RET + +cglobal intra_pred_ang16_6, 3,7,14 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_6_30 + RET + +cglobal intra_pred_ang16_30, 3,7,14 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_6_30 + RET + +cglobal intra_pred_ang16_7, 3,7,13 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_7_29 + RET + +cglobal intra_pred_ang16_29, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_7_29 + RET + +cglobal intra_pred_ang16_8, 3,7,13 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_8_28 + RET + +cglobal intra_pred_ang16_28, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_8_28 + RET + +cglobal intra_pred_ang16_9, 3,7,12 + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang16_27, 3,7,12 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang16_10, 3,6,3 + mov r5d, r4m + add r1d, r1d + lea r4, [r1 * 3] + + vpbroadcastw m2, [r2 + 2 + 64] ; [1...] + mova m0, m2 + movu [r0], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 2] ; [2...] + movu [r0 + r1], m1 + vpbroadcastw m2, [r2 + 2 + 64 + 4] ; [3...] + movu [r0 + r1 * 2], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 6] ; [4...] + movu [r0 + r4], m1 + + lea r3, [r0 + r1 * 4] + vpbroadcastw m2, [r2 + 2 + 64 + 8] ; [5...] + movu [r3], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 10] ; [6...] + movu [r3 + r1], m1 + vpbroadcastw m2, [r2 + 2 + 64 + 12] ; [7...] + movu [r3 + r1 * 2], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 14] ; [8...] + movu [r3 + r4], m1 + + lea r3, [r3 + r1 *4] + vpbroadcastw m2, [r2 + 2 + 64 + 16] ; [9...] + movu [r3], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 18] ; [10...] + movu [r3 + r1], m1 + vpbroadcastw m2, [r2 + 2 + 64 + 20] ; [11...] + movu [r3 + r1 * 2], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 22] ; [12...] + movu [r3 + r4], m1 + + lea r3, [r3 + r1 *4] + vpbroadcastw m2, [r2 + 2 + 64 + 24] ; [13...] + movu [r3], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 26] ; [14...] + movu [r3 + r1], m1 + vpbroadcastw m2, [r2 + 2 + 64 + 28] ; [15...] + movu [r3 + r1 * 2], m2 + vpbroadcastw m1, [r2 + 2 + 64 + 30] ; [16...] + movu [r3 + r4], m1 + + cmp r5d, byte 0 + jz .quit + + ; filter + vpbroadcastw m2, [r2] ; [0 0...] + movu m1, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + pxor m1, m1 + pmaxsw m0, m1 + pminsw m0, [pw_pixel_max] +.quit: + movu [r0], m0 + RET + +cglobal intra_pred_ang16_26, 3,6,4 + mov r5d, r4m + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + add r1d, r1d + lea r4, [r1 * 3] + + movu [r0], m0 + movu [r0 + r1], m0 + movu [r0 + r1 * 2], m0 + movu [r0 + r4], m0 + + lea r3, [r0 + r1 *4] + movu [r3], m0 + movu [r3 + r1], m0 + movu [r3 + r1 * 2], m0 + movu [r3 + r4], m0 + + lea r3, [r3 + r1 *4] + movu [r3], m0 + movu [r3 + r1], m0 + movu [r3 + r1 * 2], m0 + movu [r3 + r4], m0 + + lea r3, [r3 + r1 *4] + movu [r3], m0 + movu [r3 + r1], m0 + movu [r3 + r1 * 2], m0 + movu [r3 + r4], m0 + + cmp r5d, byte 0 + jz .quit + + ; filter + + vpbroadcastw m0, xm0 + vpbroadcastw m2, [r2] + movu m1, [r2 + 2 + 64] + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + pxor m1, m1 + pmaxsw m0, m1 + pminsw m0, [pw_pixel_max] + pextrw [r0], xm0, 0 + pextrw [r0 + r1], xm0, 1 + pextrw [r0 + r1 * 2], xm0, 2 + pextrw [r0 + r4], xm0, 3 + lea r0, [r0 + r1 * 4] + pextrw [r0], xm0, 4 + pextrw [r0 + r1], xm0, 5 + pextrw [r0 + r1 * 2], xm0, 6 + pextrw [r0 + r4], xm0, 7 + lea r0, [r0 + r1 * 4] + vpermq m0, m0, 11101110b + pextrw [r0], xm0, 0 + pextrw [r0 + r1], xm0, 1 + pextrw [r0 + r1 * 2], xm0, 2 + pextrw [r0 + r4], xm0, 3 + pextrw [r3], xm0, 4 + pextrw [r3 + r1], xm0, 5 + pextrw [r3 + r1 * 2], xm0, 6 + pextrw [r3 + r4], xm0, 7 +.quit: + RET + +cglobal intra_pred_ang16_11, 3,7,12, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r2, 64 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_11_25 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_25, 3,7,12 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + + call ang16_mode_11_25 + RET + +cglobal intra_pred_ang16_12, 3,7,14, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 12] ; [13 12 11 10 9 8 7 6] + pshufb xm1, [pw_ang16_12_24] ; [ 6 6 13 13 x x x x] + xor r6d, r6d + add r2, 64 + + call ang16_mode_12_24 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_24, 3,7,14, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 76] ; [13 12 11 10 9 8 7 6] + pshufb xm1, [pw_ang16_12_24] ; [ 6 6 13 13 x x x x] + xor r6d, r6d + inc r6d + + call ang16_mode_12_24 + RET + +cglobal intra_pred_ang16_13, 3,7,14, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 8] ; [11 x x x 7 x x 4] + pinsrw xm1, [r2 + 28], 1 ; [11 x x x 7 x 14 4] + pshufb xm1, [pw_ang16_13_23] ; [ 4 4 7 7 11 11 14 14] + xor r6d, r6d + add r2, 64 + + call ang16_mode_13_23 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_23, 3,7,14, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 72] ; [11 10 9 8 7 6 5 4] + pinsrw xm1, [r2 + 92], 1 ; [11 x x x 7 x 14 4] + pshufb xm1, [pw_ang16_13_23] ; [ 4 4 7 7 11 11 14 14] + xor r6d, r6d + inc r6d + + call ang16_mode_13_23 + RET + +cglobal intra_pred_ang16_14, 3,7,15, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 4] ; [ x x 7 x 5 x x 2] + pinsrw xm1, [r2 + 20], 1 ; [ x x 7 x 5 x 10 2] + movu xm14, [r2 + 24] ; [ x x x x 15 x x 12] + pshufb xm14, [pw_ang16_14_22] ; [12 12 15 15 x x x x] + pshufb xm1, [pw_ang16_14_22] ; [ 2 2 5 5 7 7 10 10] + xor r6d, r6d + add r2, 64 + + call ang16_mode_14_22 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_22, 3,7,15, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 68] ; [ x x 7 x 5 x x 2] + pinsrw xm1, [r2 + 84], 1 ; [ x x 7 x 5 x 10 2] + movu xm14, [r2 + 88] ; [ x x x x 15 x x 12] + pshufb xm14, [pw_ang16_14_22] ; [12 12 15 15 x x x x] + pshufb xm1, [pw_ang16_14_22] ; [ 2 2 5 5 7 7 10 10] + xor r6d, r6d + inc r6d + + call ang16_mode_14_22 + RET + +cglobal intra_pred_ang16_15, 3,7,15, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 4] ; [ x 8 x 6 x 4 x 2] + movu xm14, [r2 + 18] ; [ x 15 x 13 x 11 x 9] + pshufb xm14, [pw_ang16_15_21] ; [ 9 9 11 11 13 13 15 15] + pshufb xm1, [pw_ang16_15_21] ; [ 2 2 4 4 6 6 8 8] + xor r6d, r6d + add r2, 64 + + call ang16_mode_15_21 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_21, 3,7,15, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 68] ; [ x 8 x 6 x 4 x 2] + movu xm14, [r2 + 82] ; [ x 15 x 13 x 11 x 9] + pshufb xm14, [pw_ang16_15_21] ; [ 9 9 11 11 13 13 15 15] + pshufb xm1, [pw_ang16_15_21] ; [ 2 2 4 4 6 6 8 8] + xor r6d, r6d + inc r6d + + call ang16_mode_15_21 + RET + +cglobal intra_pred_ang16_16, 3,7,15, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 4] ; [ x x x 6 5 x 3 2] + movu xm14, [r2 + 16] ; [ x x x 12 11 x 9 8] + movu xm2, [r2 + 28] ; [ x x x x x x 15 14] + pshufb xm14, [pw_ang16_16_20] ; [ 8 8 9 9 11 11 12 12] + pshufb xm1, [pw_ang16_16_20] ; [ 2 2 3 3 5 5 6 6] + pshufb xm2, [pw_ang16_16_20] ; [14 14 15 15 x x x x] + xor r6d, r6d + add r2, 64 + + call ang16_mode_16_20 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_20, 3,7,15, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 68] ; [ x x x 6 5 x 3 2] + movu xm14, [r2 + 80] ; [ x x x 12 11 x 9 8] + movu xm2, [r2 + 92] ; [ x x x x x x 15 14] + pshufb xm14, [pw_ang16_16_20] ; [ 8 8 9 9 11 11 12 12] + pshufb xm1, [pw_ang16_16_20] ; [ 2 2 3 3 5 5 6 6] + pshufb xm2, [pw_ang16_16_20] ; [14 14 15 15 x x x x] + xor r6d, r6d + inc r6d + + call ang16_mode_16_20 + RET + +cglobal intra_pred_ang16_17, 3,7,15, 0-4 + movzx r5d, word [r2 + 64] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 64], r6w + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 2] ; [ x x x 6 5 x 3 2] + movu xm14, [r2 + 12] ; [ x x x 12 11 x 9 8] + movu xm2, [r2 + 22] ; [ x x x x x x 15 14] + pshufb xm14, [pw_ang16_16_20] ; [ 8 8 9 9 11 11 12 12] + pshufb xm1, [pw_ang16_16_20] ; [ 2 2 3 3 5 5 6 6] + pshufb xm2, [pw_ang16_16_20] ; [14 14 15 15 x x x x] + xor r6d, r6d + add r2, 64 + + call ang16_mode_17_19 + + mov r6d, [rsp] + mov [r2], r6w + RET + +cglobal intra_pred_ang16_19, 3,7,15, 0-4 + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + movu xm1, [r2 + 66] ; [ x x x 6 5 x 3 2] + movu xm14, [r2 + 76] ; [ x x x 12 11 x 9 8] + movu xm2, [r2 + 86] ; [ x x x x x x 15 14] + pshufb xm14, [pw_ang16_16_20] ; [ 8 8 9 9 11 11 12 12] + pshufb xm1, [pw_ang16_16_20] ; [ 2 2 3 3 5 5 6 6] + pshufb xm2, [pw_ang16_16_20] ; [14 14 15 15 x x x x] + xor r6d, r6d + inc r6d + + call ang16_mode_17_19 + RET + +cglobal intra_pred_ang16_18, 3,5,4 + add r1d, r1d + lea r4, [r1 * 3] + movu m1, [r2] + movu m0, [r2 + 2 + 64] + pshufb m0, [pw_swap16] + mova m3, m0 + vinserti128 m0, m0, xm1, 1 + movu [r0], m1 + palignr m2, m1, m0, 14 + movu [r0 + r1], m2 + + palignr m2, m1, m0, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 10 + movu [r0 + r4], m2 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + movu [r0], m2 + palignr m2, m1, m0, 6 + movu [r0 + r1], m2 + palignr m2, m1, m0, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m1, m0, 2 + movu [r0 + r4], m2 + + lea r0, [r0 + r1 * 4] + movu [r0], m0 + vpermq m3, m3, 01001110b + palignr m2, m0, m3, 14 + movu [r0 + r1], m2 + palignr m2, m0, m3, 12 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m3, 10 + movu [r0 + r4], m2 + palignr m2, m1, m0, 10 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m3, 8 + movu [r0], m2 + palignr m2, m0, m3, 6 + movu [r0 + r1], m2 + palignr m2, m0, m3, 4 + movu [r0 + r1 * 2], m2 + palignr m2, m0, m3, 2 + movu [r0 + r4], m2 + palignr m1, m0, 2 + RET + +;------------------------------------------------------------------------------------------------------- +; end of avx2 code for intra_pred_ang16 mode 2 to 34 +;------------------------------------------------------------------------------------------------------- + +;------------------------------------------------------------------------------------------------------- +; avx2 code for intra_pred_ang32 mode 2 to 34 start +;------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal intra_pred_ang32_2, 3,5,6 + lea r4, [r2] + add r2, 128 + cmp r3m, byte 34 + cmove r2, r4 + add r1d, r1d + lea r3, [r1 * 3] + movu m0, [r2 + 4] + movu m1, [r2 + 20] + movu m3, [r2 + 36] + movu m4, [r2 + 52] + + movu [r0], m0 + movu [r0 + 32], m3 + palignr m2, m1, m0, 2 + palignr m5, m4, m3, 2 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m1, m0, 4 + palignr m5, m4, m3, 4 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m1, m0, 6 + palignr m5, m4, m3, 6 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + palignr m5, m4, m3, 8 + movu [r0], m2 + movu [r0 + 32], m5 + palignr m2, m1, m0, 10 + palignr m5, m4, m3, 10 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m1, m0, 12 + palignr m5, m4, m3, 12 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m1, m0, 14 + palignr m5, m4, m3, 14 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + movu m0, [r2 + 36] + movu m3, [r2 + 68] + lea r0, [r0 + r1 * 4] + movu [r0], m1 + movu [r0 + 32], m4 + palignr m2, m0, m1, 2 + palignr m5, m3, m4, 2 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m0, m1, 4 + palignr m5, m3, m4, 4 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m0, m1, 6 + palignr m5, m3, m4, 6 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + palignr m5, m3, m4, 8 + movu [r0], m2 + movu [r0 + 32], m5 + palignr m2, m0, m1, 10 + palignr m5, m3, m4, 10 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m0, m1, 12 + palignr m5, m3, m4, 12 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m0, m1, 14 + palignr m5, m3, m4, 14 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + lea r0, [r0 + r1 * 4] + movu m1, [r2 + 52] + movu m4, [r2 + 84] + + movu [r0], m0 + movu [r0 + 32], m3 + palignr m2, m1, m0, 2 + palignr m5, m4, m3, 2 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m1, m0, 4 + palignr m5, m4, m3, 4 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m1, m0, 6 + palignr m5, m4, m3, 6 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + lea r0, [r0 + r1 * 4] + palignr m2, m1, m0, 8 + palignr m5, m4, m3, 8 + movu [r0], m2 + movu [r0 + 32], m5 + palignr m2, m1, m0, 10 + palignr m5, m4, m3, 10 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m1, m0, 12 + palignr m5, m4, m3, 12 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m1, m0, 14 + palignr m5, m4, m3, 14 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + movu m0, [r2 + 68] + movu m3, [r2 + 100] + lea r0, [r0 + r1 * 4] + movu [r0], m1 + movu [r0 + 32], m4 + palignr m2, m0, m1, 2 + palignr m5, m3, m4, 2 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m0, m1, 4 + palignr m5, m3, m4, 4 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m0, m1, 6 + palignr m5, m3, m4, 6 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + + lea r0, [r0 + r1 * 4] + palignr m2, m0, m1, 8 + palignr m5, m3, m4, 8 + movu [r0], m2 + movu [r0 + 32], m5 + palignr m2, m0, m1, 10 + palignr m5, m3, m4, 10 + movu [r0 + r1], m2 + movu [r0 + r1 + 32], m5 + palignr m2, m0, m1, 12 + palignr m5, m3, m4, 12 + movu [r0 + r1 * 2], m2 + movu [r0 + r1 * 2 + 32], m5 + palignr m2, m0, m1, 14 + palignr m5, m3, m4, 14 + movu [r0 + r3], m2 + movu [r0 + r3 + 32], m5 + RET + +cglobal intra_pred_ang32_3, 3,8,13 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_3_33 + + add r2, 26 + lea r0, [r0 + 32] + + call ang16_mode_3_33 + + add r2, 6 + lea r0, [r7 + 8 * r1] + + call ang16_mode_3_33 + + add r2, 26 + lea r0, [r0 + 32] + + call ang16_mode_3_33 + RET + +cglobal intra_pred_ang32_33, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_3_33 + + add r2, 26 + + call ang16_mode_3_33 + + add r2, 6 + mov r0, r5 + + call ang16_mode_3_33 + + add r2, 26 + + call ang16_mode_3_33 + RET + +;; angle 32, modes 4 and 32 +cglobal ang32_mode_4_32 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 - 13 * 32] ; [5] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 13 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 8 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m6, m0, m3, 4 ; [14 13 13 12 12 11 11 10 6 5 5 4 4 3 3 2] + pmaddwd m6, [r3 - 3 * 32] ; [15] + paddd m6, [pd_16] + psrld m6, 5 + palignr m7, m2, m0, 4 ; [18 17 17 16 16 15 15 14 10 9 9 8 8 7 7 6] + pmaddwd m7, [r3 - 3 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m0, m3, 8 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + pmaddwd m7, m8, [r3 - 14 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + palignr m9, m2, m0, 8 ; [19 18 18 17 17 16 16 15 11 10 10 9 9 8 8 7] + pmaddwd m10, m9, [r3 - 14 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 7 * 32] ; [25] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 7 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m9, m0, m3, 12 + pmaddwd m9, [r3 - 4 * 32] ; [14] + paddd m9, [pd_16] + psrld m9, 5 + palignr m3, m2, m0, 12 + pmaddwd m3, [r3 - 4 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, m0, [r3 - 15 * 32] ; [3] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m3, m2, [r3 - 15 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m10, m3 + + pmaddwd m11, m0, [r3 + 6 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m3, m2, [r3 + 6 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + palignr m4, m2, m0, 4 + pmaddwd m4, [r3 - 5* 32] ; [13] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m1, m2, 4 + pmaddwd m5, [r3 - 5 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m2, m0, 8 + pmaddwd m5, m6, [r3 - 16 * 32] ; [2] + paddd m5, [pd_16] + psrld m5, 5 + palignr m7, m1, m2, 8 + pmaddwd m8, m7, [r3 - 16 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 5 * 32] ; [23] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 5 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m7, m2, m0, 12 + pmaddwd m7, [r3 - 6 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + palignr m8, m1, m2, 12 + pmaddwd m8, [r3 - 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + movu m0, [r2 + 34] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + pmaddwd m8, m2, [r3 - 17 * 32] ; [1] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m1, [r3 - 17 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m3, m0, m0, 2 ; [ x 32 31 30 29 28 27 26 x 24 23 22 21 20 19 18] + punpcklwd m0, m3 ; [29 29 28 28 27 27 26 22 21 20 20 19 19 18 18 17] + + pmaddwd m9, m2, [r3 + 4 * 32] ; [22] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m1, [r3 + 4 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + palignr m10, m1, m2, 4 + pmaddwd m10, [r3 - 7 * 32] ; [11] + paddd m10, [pd_16] + psrld m10, 5 + palignr m11, m0, m1, 4 + pmaddwd m11, [r3 - 7 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m3, m1, m2, 8 + pmaddwd m3, [r3 - 18 * 32] ; [0] + paddd m3, [pd_16] + psrld m3, 5 + palignr m0, m1, 8 + pmaddwd m0, [r3 - 18 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 3, 0, 1, 16 + ret + +cglobal intra_pred_ang32_4, 3,8,13 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_4_32 + + add r2, 22 + lea r0, [r0 + 32] + + call ang32_mode_4_32 + + add r2, 10 + lea r0, [r7 + 8 * r1] + + call ang16_mode_4_32 + + add r2, 22 + lea r0, [r0 + 32] + + call ang32_mode_4_32 + RET + +cglobal intra_pred_ang32_32, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 18 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_4_32 + + add r2, 22 + + call ang32_mode_4_32 + + add r2, 10 + mov r0, r5 + + call ang16_mode_4_32 + + add r2, 22 + + call ang32_mode_4_32 + RET + +;; angle 32, modes 5 and 31 +cglobal ang32_mode_5_31 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 - 15 * 32] ; [1] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 15 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 2 * 32] ; [18] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 2 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m0, m3, 4 + pmaddwd m6, m7, [r3 - 13 * 32] ; [3] + paddd m6, [pd_16] + psrld m6, 5 + palignr m8, m2, m0, 4 + pmaddwd m9, m8, [r3 - 13 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 4 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m0, m3, 8 + pmaddwd m8, m9, [r3 - 11 * 32] ; [5] + paddd m8, [pd_16] + psrld m8, 5 + palignr m10, m2, m0, 8 + pmaddwd m11, m10, [r3 - 11 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m8, m11 + + pmaddwd m9, [r3 + 6 * 32] ; [22] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, [r3 + 6 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m11, m0, m3, 12 + pmaddwd m10, m11, [r3 - 9 * 32] ; [7] + paddd m10, [pd_16] + psrld m10, 5 + palignr m12, m2, m0, 12 + pmaddwd m3, m12, [r3 - 9 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m10, m3 + + pmaddwd m11, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 8 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 3, 0 + + pmaddwd m4, m0, [r3 - 7 * 32] ; [9] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 - 7 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m0, [r3 + 10 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, m2, [r3 + 10 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + palignr m7, m2, m0, 4 + pmaddwd m6, m7, [r3 - 5 * 32] ; [11] + paddd m6, [pd_16] + psrld m6, 5 + palignr m8, m1, m2, 4 + pmaddwd m9, m8, [r3 - 5 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 12 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m2, m0, 8 + pmaddwd m8, m9, [r3 - 3 * 32] ; [13] + paddd m8, [pd_16] + psrld m8, 5 + palignr m3, m1, m2, 8 + pmaddwd m10, m3, [r3 - 3 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 14 * 32] ; [30] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, [r3 + 14 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + palignr m10, m2, m0, 12 + pmaddwd m10, [r3 - 1 * 32] ; [15] + paddd m10, [pd_16] + psrld m10, 5 + palignr m11, m1, m2, 12 + pmaddwd m11, [r3 - 1 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m2, [r3 - 16 * 32] ; [0] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3 - 16 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 2, 0, 1, 16 + ret + +cglobal intra_pred_ang32_5, 3,8,13 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_5_31 + + add r2, 18 + lea r0, [r0 + 32] + + call ang32_mode_5_31 + + add r2, 14 + lea r0, [r7 + 8 * r1] + + call ang16_mode_5_31 + + add r2, 18 + lea r0, [r0 + 32] + + call ang32_mode_5_31 + RET + +cglobal intra_pred_ang32_31, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_5_31 + + add r2, 18 + + call ang32_mode_5_31 + + add r2, 14 + mov r0, r5 + + call ang16_mode_5_31 + + add r2, 18 + + call ang32_mode_5_31 + RET + +;; angle 32, modes 6 and 30 +cglobal ang32_mode_6_30 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 14 * 32] ; [29] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 14 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 + pmaddwd m5, m6, [r3 - 5 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + palignr m7, m2, m0, 4 + pmaddwd m8, m7, [r3 - 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 8 * 32] ; [23] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 8 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m9, m0, m3, 8 + pmaddwd m7, m9, [r3 - 11 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + palignr m12, m2, m0, 8 + pmaddwd m11, m12, [r3 - 11 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m7, m11 + + pmaddwd m8, m9, [r3 + 2 * 32] ; [17] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m10, m12, [r3 + 2 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 15 * 32] ; [30] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, [r3 + 15 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + palignr m11, m0, m3, 12 + pmaddwd m10, m11, [r3 - 4 * 32] ; [11] + paddd m10, [pd_16] + psrld m10, 5 + palignr m12, m2, m0, 12 + pmaddwd m3, m12, [r3 - 4 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m10, m3 + + pmaddwd m11, [r3 + 9 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 9 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + pmaddwd m4, m0, [r3 - 10 * 32] ; [5] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 - 10 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m0, [r3 + 3 * 32] ; [18] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, m2, [r3 + 3 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, m0, [r3 + 16 * 32] ; [31] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m2, [r3 + 16 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m2, m0, 4 + pmaddwd m7, m8, [r3 - 3 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + palignr m9, m1, m2, 4 + pmaddwd m3, m9, [r3 - 3 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + pmaddwd m8, [r3 + 10 * 32] ; [25] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m2, m0, 8 + pmaddwd m9, m10, [r3 - 9 * 32] ; [6] + paddd m9, [pd_16] + psrld m9, 5 + palignr m12, m1, m2, 8 + pmaddwd m3, m12, [r3 - 9 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 4 * 32] ; [19] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, [r3 + 4 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m11, m2, m0, 12 + pmaddwd m11, [r3 - 15 * 32] ; [0] + paddd m11, [pd_16] + psrld m11, 5 + palignr m3, m1, m2, 12 + pmaddwd m3, [r3 - 15 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m11, m3 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 16 + ret + +cglobal intra_pred_ang32_6, 3,8,14 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_6_30 + + add r2, 12 + lea r0, [r0 + 32] + + call ang32_mode_6_30 + + add r2, 20 + lea r0, [r7 + 8 * r1] + + call ang16_mode_6_30 + + add r2, 12 + lea r0, [r0 + 32] + + call ang32_mode_6_30 + RET + +cglobal intra_pred_ang32_30, 3,7,14 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_6_30 + + add r2, 12 + + call ang32_mode_6_30 + + add r2, 20 + mov r0, r5 + + call ang16_mode_6_30 + + add r2, 12 + + call ang32_mode_6_30 + RET + +;; angle 32, modes 7 and 29 +cglobal ang32_mode_7_29 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m1, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m1, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + punpckhwd m1, m4 ; [25 24 24 23 23 22 22 21 17 16 16 15 15 14 14 13] + + pmaddwd m4, m3, [r3 + 8 * 32] ; [25] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 8 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m8, m0, m3, 4 + pmaddwd m5, m8, [r3 - 15 * 32] ; [2] + paddd m5, [pd_16] + psrld m5, 5 + palignr m9, m2, m0, 4 + pmaddwd m10, m9, [r3 - 15 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m5, m10 + + pmaddwd m6, m8, [r3 - 6 * 32] ; [11] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m9, [r3 - 6 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m8, [r3 + 3 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 + 3 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 12 * 32] ; [29] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 12 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m11, m0, m3, 8 + pmaddwd m9, m11, [r3 - 11 * 32] ; [6] + paddd m9, [pd_16] + psrld m9, 5 + palignr m12, m2, m0, 8 + pmaddwd m10, m12, [r3 - 11 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m11, [r3 - 2 * 32] ; [15] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m13, m12, [r3 - 2 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m10, m13 + + pmaddwd m11, [r3 + 7 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 7 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m5, m0, m3, 12 + pmaddwd m4, m5, [r3 - 16 * 32] ; [1] + paddd m4, [pd_16] + psrld m4, 5 + palignr m6, m2, m0, 12 + pmaddwd m7, m6, [r3 - 16 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m4, m7 + + pmaddwd m5, [r3 - 7 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, [r3 - 7 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m9, m0, m3, 12 + pmaddwd m6, m9, [r3 + 2 * 32] ; [19] + paddd m6, [pd_16] + psrld m6, 5 + palignr m3, m2, m0, 12 + pmaddwd m7, m3, [r3 + 2 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m9, [r3 + 11 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m3, [r3 + 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m0, [r3 - 12 * 32] ; [5] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m10, m2, [r3 - 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, m0, [r3 - 3 * 32] ; [14] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m2, [r3 - 3 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, m0, [r3 + 6 * 32] ; [23] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m2, [r3 + 6 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m11, m2, m0, 4 + pmaddwd m11, [r3 - 17 * 32] ; [0] + paddd m11, [pd_16] + psrld m11, 5 + palignr m12, m1, m2, 4 + pmaddwd m12, [r3 - 17 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 3, 2, 16 + ret + +cglobal intra_pred_ang32_7, 3,8,14 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_7_29 + + add r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_7_29 + + add r2, 24 + lea r0, [r7 + 8 * r1] + + call ang16_mode_7_29 + + add r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_7_29 + RET + +cglobal intra_pred_ang32_29, 3,7,14 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 17 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_7_29 + + add r2, 8 + + call ang32_mode_7_29 + + add r2, 24 + mov r0, r5 + + call ang16_mode_7_29 + + add r2, 8 + + call ang32_mode_7_29 + RET + +;; angle 32, modes 8 and 28 +cglobal ang32_mode_8_28 + test r6d, r6d + + movu m0, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 4] ; [17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpcklwd m3, m0, m1 ; [13 12 12 11 11 10 10 9 5 4 4 3 3 2 2 1] + punpckhwd m0, m1 ; [17 16 16 15 15 14 14 13 9 8 8 7 7 6 6 5] + + movu m2, [r2 + 18] ; [24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9] + movu m4, [r2 + 20] ; [25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10] + punpcklwd m2, m4 ; [21 20 20 19 19 18 18 17 13 12 12 11 11 10 10 9] + + pmaddwd m4, m3, [r3 + 6 * 32] ; [21] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 + 6 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 11 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 16 * 32] ; [31] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 + 16 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m11, m0, m3, 4 + pmaddwd m7, m11, [r3 - 11 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + palignr m1, m2, m0, 4 + pmaddwd m8, m1, [r3 - 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m11, [r3 - 6 * 32] ; [9] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m1, [r3 - 6 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m11, [r3 - 1 * 32] ; [14] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m1, [r3 - 1 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m11, [r3 + 4 * 32] ; [19] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m1, [r3 + 4 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, [r3 + 9 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m1, [r3 + 9 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m11, m1 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 0 + + palignr m4, m0, m3, 4 + pmaddwd m4, [r3 + 14 * 32] ; [29] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m2, m0, 4 + pmaddwd m5, [r3 + 14 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m1, m0, m3, 8 + pmaddwd m5, m1, [r3 - 13 * 32] ; [2] + paddd m5, [pd_16] + psrld m5, 5 + palignr m10, m2, m0, 8 + pmaddwd m6, m10, [r3 - 13 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m1, [r3 - 8 * 32] ; [7] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m10, [r3 - 8 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m1, [r3 - 3 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m10, [r3 - 3 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m1, [r3 + 2 * 32] ; [17] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m10, [r3 + 2 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m1, [r3 + 7 * 32] ; [22] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m11, m10, [r3 + 7 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m9, m11 + + pmaddwd m1, [r3 + 12 * 32] ; [27] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m10, [r3 + 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m1, m10 + + palignr m11, m0, m3, 12 + pmaddwd m11, [r3 - 15 * 32] ; [0] + paddd m11, [pd_16] + psrld m11, 5 + palignr m2, m0, 12 + pmaddwd m2, [r3 - 15 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m11, m2 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 11, 0, 2, 16 + ret + +cglobal intra_pred_ang32_8, 3,8,13 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_8_28 + + add r2, 4 + lea r0, [r0 + 32] + + call ang32_mode_8_28 + + add r2, 28 + lea r0, [r7 + 8 * r1] + + call ang16_mode_8_28 + + add r2, 4 + lea r0, [r0 + 32] + + call ang32_mode_8_28 + RET + +cglobal intra_pred_ang32_28, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 15 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_8_28 + + add r2, 4 + + call ang32_mode_8_28 + + add r2, 28 + mov r0, r5 + + call ang16_mode_8_28 + + add r2, 4 + + call ang32_mode_8_28 + RET + +cglobal intra_pred_ang32_9, 3,8,13 + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_9_27 + + add r2, 2 + lea r0, [r0 + 32] + + call ang16_mode_9_27 + + add r2, 30 + lea r0, [r7 + 8 * r1] + + call ang16_mode_9_27 + + add r2, 2 + lea r0, [r0 + 32] + + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang32_27, 3,7,13 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_9_27 + + add r2, 2 + + call ang16_mode_9_27 + + add r2, 30 + mov r0, r5 + + call ang16_mode_9_27 + + add r2, 2 + + call ang16_mode_9_27 + RET + +cglobal intra_pred_ang32_10, 3,4,2 + add r2, mmsize*4 + add r1d, r1d + lea r3, [r1 * 3] + + vpbroadcastw m0, [r2 + 2] ; [1...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 2] ; [2...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 4] ; [3...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 6] ; [4...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 * 4] + vpbroadcastw m0, [r2 + 2 + 8] ; [5...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 10] ; [6...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 12] ; [7...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 14] ; [8...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 16] ; [9...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 18] ; [10...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 20] ; [11...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 22] ; [12...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 24] ; [13...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 26] ; [14...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 28] ; [15...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 30] ; [16...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 32] ; [17...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 34] ; [18...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 36] ; [19...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 38] ; [20...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 40] ; [21...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 42] ; [22...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 44] ; [23...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 46] ; [24...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 48] ; [25...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 50] ; [26...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 52] ; [27...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 54] ; [28...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + + lea r0, [r0 + r1 *4] + vpbroadcastw m0, [r2 + 2 + 56] ; [29...] + movu [r0], m0 + movu [r0 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 58] ; [30...] + movu [r0 + r1], m1 + movu [r0 + r1 + 32], m1 + vpbroadcastw m0, [r2 + 2 + 60] ; [31...] + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m0 + vpbroadcastw m1, [r2 + 2 + 62] ; [32...] + movu [r0 + r3], m1 + movu [r0 + r3 + 32], m1 + RET + +cglobal intra_pred_ang32_26, 3,3,2 + movu m0, [r2 + 2] + movu m1, [r2 + 34] + add r1d, r1d + lea r2, [r1 * 3] + + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + + lea r0, [r0 + r1 *4] + movu [r0], m0 + movu [r0 + 32], m1 + movu [r0 + r1], m0 + movu [r0 + r1 + 32], m1 + movu [r0 + r1 * 2], m0 + movu [r0 + r1 * 2 + 32], m1 + movu [r0 + r2], m0 + movu [r0 + r2 + 32], m1 + RET + +cglobal intra_pred_ang32_11, 3,8,12, 0-8 + movzx r5d, word [r2 + 128] ; [0] + movzx r6d, word [r2] + mov [rsp], r5w + mov [r2 + 128], r6w + + movzx r5d, word [r2 + 126] ; [16] + movzx r6d, word [r2 + 32] + mov [rsp + 4], r5w + mov [r2 + 126], r6w + + add r2, 128 + xor r6d, r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + lea r4, [r1 * 3] + lea r7, [r0 + 8 * r1] + + call ang16_mode_11_25 + + sub r2, 2 + lea r0, [r0 + 32] + + call ang16_mode_11_25 + + add r2, 34 + lea r0, [r7 + 8 * r1] + + call ang16_mode_11_25 + + sub r2, 2 + lea r0, [r0 + 32] + + call ang16_mode_11_25 + + mov r6d, [rsp] + mov [r2 - 30], r6w + mov r6d, [rsp + 4] + mov [r2 - 32], r6w + RET + +cglobal intra_pred_ang32_25, 3,7,12, 0-4 + xor r6d, r6d + inc r6d + lea r3, [ang_table_avx2 + 16 * 32] + add r1d, r1d + + movzx r4d, word [r2 - 2] + movzx r5d, word [r2 + 160] ; [16] + mov [rsp], r4w + mov [r2 - 2], r5w + + lea r4, [r1 * 3] + lea r5, [r0 + 32] + + call ang16_mode_11_25 + + sub r2, 2 + + call ang16_mode_11_25 + + add r2, 34 + mov r0, r5 + + call ang16_mode_11_25 + + sub r2, 2 + + call ang16_mode_11_25 + + mov r5d, [rsp] + mov [r2 - 32], r5w + RET + +;; angle 32, modes 12 and 24, row 0 to 15 +cglobal ang32_mode_12_24_0_15 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 11 * 32] ; [27] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 11 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 6 * 32] ; [22] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m2, [r3 + 6 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 1 * 32] ; [17] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m2, [r3 + 1 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, m3, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m2, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m3, [r3 - 9 * 32] ; [7] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m2, [r3 - 9 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m2, [r3 - 14 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m9, m2 + + movu xm1, [r2 - 8] + pshufb xm1, [pw_ang32_12_24] + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 6 6 13 13 19 19 26 26] + + palignr m2, m3, m1, 14 ; [11 10 10 9 9 8 8 7 3 2 2 1 1 0 0 6] + palignr m13, m0, m3, 14 ; [15 14 14 13 13 12 12 11 7 6 6 5 5 4 4 3] + + pmaddwd m10, m2, [r3 + 13 * 32] ; [29] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 + 13 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 14 + + pmaddwd m4, m2, [r3 + 3 * 32] ; [19] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 + 3 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 2 * 32] ; [14] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 - 2 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m2, [r3 - 7 * 32] ; [9] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 - 7 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m0, m3, 10 + palignr m3, m1, 10 + + pmaddwd m8, m3, [r3 + 15 * 32] ; [31] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 15 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 10 * 32] ; [26] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 + 10 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m3, [r3 + 5 * 32] ; [21] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m2, m0, [r3 + 5 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m10, m2 + + pmaddwd m3, [r3] ; [16] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 3, 0, 2, 16 + ret + +;; angle 32, modes 12 and 24, row 16 to 31 +cglobal ang32_mode_12_24_16_31 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + + palignr m2, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m4, m2, [r3 - 5 * 32] ; [11] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 5 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 10 * 32] ; [6] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m13, [r3 - 10 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m2, [r3 - 15 * 32] ; [1] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m13, [r3 - 15 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m2, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m7, m2, [r3 + 12 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 + 7 * 32] ; [23] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 7 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 + 2 * 32] ; [18] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 + 2 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 3 * 32] ; [13] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 3 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, m2, [r3 - 8 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 - 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 6 + + pmaddwd m4, m2, [r3 - 13 * 32] ; [3] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 13 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m2, m3, m1, 2 + palignr m13, m0, m3, 2 + + pmaddwd m5, m2, [r3 + 14 * 32] ; [30] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 + 14 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m2, [r3 + 9 * 32] ; [25] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 + 9 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 + 4 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 - 1 * 32] ; [15] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 - 1 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 6 * 32] ; [10] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 - 6 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 11 * 32] ; [5] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 11 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m2, [r3 - 16 * 32] ; [0] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m13, [r3 - 16 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m2, m13 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 2, 0, 3, 16 + ret + +cglobal intra_pred_ang32_12, 3,8,14, 0-16 + movu xm0, [r2 + 114] + mova [rsp], xm0 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + pinsrw xm1, [r2], 7 ; [0] + pinsrw xm1, [r2 + 12], 6 ; [6] + pinsrw xm1, [r2 + 26], 5 ; [13] + pinsrw xm1, [r2 + 38], 4 ; [19] + pinsrw xm1, [r2 + 52], 3 ; [26] + movu [r2 + 114], xm1 + + xor r6d, r6d + add r2, 128 + lea r7, [r0 + 8 * r1] + + call ang32_mode_12_24_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_12_24_16_31 + + add r2, 32 + lea r0, [r7 + 8 * r1] + + call ang32_mode_12_24_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_12_24_16_31 + + mova xm0, [rsp] + movu [r2 - 46], xm0 + RET + +cglobal intra_pred_ang32_24, 3,7,14, 0-16 + movu xm0, [r2 - 16] + mova [rsp], xm0 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + pinsrw xm1, [r2 + 140], 7 ; [6] + pinsrw xm1, [r2 + 154], 6 ; [13] + pinsrw xm1, [r2 + 166], 5 ; [19] + pinsrw xm1, [r2 + 180], 4 ; [26] + movu [r2 - 16], xm1 + + xor r6d, r6d + inc r6d + lea r5, [r0 + 32] + + call ang32_mode_12_24_0_15 + + call ang32_mode_12_24_16_31 + + add r2, 32 + mov r0, r5 + + call ang32_mode_12_24_0_15 + + call ang32_mode_12_24_16_31 + + mova xm0, [rsp] + movu [r2 - 48], xm0 + RET + +;; angle 32, modes 13 and 23, row 0 to 15 +cglobal ang32_mode_13_23_row_0_15 + test r6d, r6d + + movu m0, [r2] ; [15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0] + movu m4, [r2 + 2] ; [16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + + punpcklwd m3, m0, m4 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + punpckhwd m2, m0, m4 ; [16 15 15 14 14 13 13 12 8 7 7 6 6 5 5 4] + + pmaddwd m4, m3, [r3 + 7 * 32] ; [23] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 7 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 2 * 32] ; [14] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m2, [r3 - 2 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m3, [r3 - 11 * 32] ; [5] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m2, [r3 - 11 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m6, m2 + + movu xm1, [r2 - 8] + pshufb xm1, [pw_ang32_12_24] + punpcklwd m3, m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + punpckhwd m0, m0 ; [15 15 14 14 13 13 12 12 7 7 6 6 5 5 4 4] + vinserti128 m1, m1, xm0, 1 ; [ 7 7 6 6 5 5 4 4 4 4 7 7 11 11 14 14] + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m7, m2, [r3 + 12 * 32] ; [28] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 + 3 * 32] ; [19] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 + 3 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 6 * 32] ; [10] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 - 6 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 - 15 * 32] ; [1] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 - 15 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m2, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m11, m2, [r3 + 8 * 32] ; [24] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 + 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m13, m0, m3, 10 + + pmaddwd m4, m2, [r3 - 1 * 32] ; [15] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 - 1 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 - 10 * 32] ; [6] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 - 10 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m2, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m6, m2, [r3 + 13 * 32] ; [29] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 + 13 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 + 4 * 32] ; [20] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 + 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 - 5 * 32] ; [11] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 - 5 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m13, [r3 - 14 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m9, m13 + + palignr m0, m3, 2 + palignr m3, m1, 2 + + pmaddwd m1, m3, [r3 + 9 * 32] ; [25] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m2, m0, [r3 + 9 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m1, m2 + + pmaddwd m3, [r3] ; [16] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 3, 0, 2, 16 + ret + +;; angle 32, modes 13 and 23, row 16 to 31 +cglobal ang32_mode_13_23_row_16_31 + test r6d, r6d + + movu m0, [r2] ; [11 10 9 8 7 6 5 4 3 2 1 0 4 7 11 14] + movu m5, [r2 + 2] ; [12 11 10 9 8 7 6 5 4 3 2 1 0 4 7 11] + + punpcklwd m4, m0, m5 ; [ 8 7 7 6 6 5 5 4 0 4 4 7 7 11 11 14] + punpckhwd m2, m0, m5 ; [12 11 11 10 10 9 9 8 4 3 3 2 2 1 1 0] + + pmaddwd m4, [r3 - 9 * 32] ; [7] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m2, [r3 - 9 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m4, m2 + + movu xm1, [r2 - 8] + pshufb xm1, [pw_ang32_12_24] ; [18 18 21 21 25 25 28 28] + punpcklwd m3, m0, m0 ; [ 7 7 6 6 5 5 4 4 4 4 7 7 11 11 14 14] + punpckhwd m0, m0 ; [11 11 10 10 9 9 8 8 3 3 2 2 1 1 0 0] + vinserti128 m1, m1, xm0, 1 ; [ 3 3 2 2 1 1 0 0 18 18 21 21 25 25 28 28] + + palignr m2, m3, m1, 14 + palignr m13, m0, m3, 14 + + pmaddwd m5, m2, [r3 + 14 * 32] ; [30] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 + 14 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m2, [r3 + 5 * 32] ; [21] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m13, [r3 + 5 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m2, [r3 - 4 * 32] ; [12] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + pmaddwd m8, m2, [r3 - 13 * 32] ; [3] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m13, [r3 - 13 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m2, m3, m1, 10 + palignr m13, m0, m3, 10 + + pmaddwd m9, m2, [r3 + 10 * 32] ; [26] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m13, [r3 + 10 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m2, [r3 + 1 * 32] ; [17] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m13, [r3 + 1 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m11, m2, [r3 - 8 * 32] ; [8] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m13, [r3 - 8 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m11, m13 + + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 0 + + palignr m2, m3, m1, 6 + palignr m13, m0, m3, 6 + + pmaddwd m4, m2, [r3 + 15 * 32] ; [31] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m13, [r3 + 15 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m2, [r3 + 6 * 32] ; [22] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m13, [r3 + 6 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m2, [r3 - 3 * 32] ; [13] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m13, [r3 - 3 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 - 12 * 32] ; [4] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, m13, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m0, m3, 2 + palignr m3, m1, 2 + + pmaddwd m8, m3, [r3 + 11 * 32] ; [27] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 11 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m3, [r3 + 2 * 32] ; [18] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 + 2 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m1, m3, [r3 - 7 * 32] ; [9] + paddd m1, [pd_16] + psrld m1, 5 + pmaddwd m2, m0, [r3 - 7 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m1, m2 + + pmaddwd m3, [r3 - 16 * 32] ; [0] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3 - 16 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2 4, 5, 6, 7, 8, 9, 1, 3, 0, 2, 16 + ret + +cglobal intra_pred_ang32_13, 3,8,14, 0-mmsize + movu m0, [r2 + 112] + mova [rsp], m0 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 8] + movu xm2, [r2 + 36] + pshufb xm1, [pw_ang32_13_23] + pshufb xm2, [pw_ang32_13_23] + pinsrw xm1, [r2 + 28], 4 + pinsrw xm2, [r2 + 56], 4 + punpckhqdq xm2, xm1 ; [ 4 7 8 11 18 21 25 28] + + movzx r6d, word [r2] + mov [r2 + 128], r6w + movu [r2 + 112], xm2 + + xor r6d, r6d + add r2, 128 + lea r7, [r0 + 8 * r1] + + call ang32_mode_13_23_row_0_15 + + sub r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_13_23_row_16_31 + + add r2, 40 + lea r0, [r7 + 8 * r1] + + call ang32_mode_13_23_row_0_15 + + sub r2, 8 + lea r0, [r0 + 32] + + call ang32_mode_13_23_row_16_31 + + mova m0, [rsp] + movu [r2 - 40], m0 + RET + +cglobal intra_pred_ang32_23, 3,7,14, 0-16 + movu xm0, [r2 - 16] + mova [rsp], xm0 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 136] + movu xm2, [r2 + 164] + pshufb xm1, [pw_ang32_13_23] + pshufb xm2, [pw_ang32_13_23] + pinsrw xm1, [r2 + 156], 4 + pinsrw xm2, [r2 + 184], 4 + punpckhqdq xm2, xm1 ; [ 4 7 8 11 18 21 25 28] + + movu [r2 - 16], xm2 + + xor r6d, r6d + inc r6d + lea r5, [r0 + 32] + + call ang32_mode_13_23_row_0_15 + + sub r2, 8 + + call ang32_mode_13_23_row_16_31 + + add r2, 40 + mov r0, r5 + + call ang32_mode_13_23_row_0_15 + + sub r2, 8 + + call ang32_mode_13_23_row_16_31 + + mova xm0, [rsp] + movu [r2 - 40], xm0 + RET + +%macro TRANSPOSE_STORE_AVX2_STACK 11 + jnz .skip%11 + punpckhwd m%9, m%1, m%2 + punpcklwd m%1, m%2 + punpckhwd m%2, m%3, m%4 + punpcklwd m%3, m%4 + + punpckldq m%4, m%1, m%3 + punpckhdq m%1, m%3 + punpckldq m%3, m%9, m%2 + punpckhdq m%9, m%2 + + punpckhwd m%10, m%5, m%6 + punpcklwd m%5, m%6 + punpckhwd m%6, m%7, m%8 + punpcklwd m%7, m%8 + + punpckldq m%8, m%5, m%7 + punpckhdq m%5, m%7 + punpckldq m%7, m%10, m%6 + punpckhdq m%10, m%6 + + punpcklqdq m%6, m%4, m%8 + punpckhqdq m%2, m%4, m%8 + punpcklqdq m%4, m%1, m%5 + punpckhqdq m%8, m%1, m%5 + + punpcklqdq m%1, m%3, m%7 + punpckhqdq m%5, m%3, m%7 + punpcklqdq m%3, m%9, m%10 + punpckhqdq m%7, m%9, m%10 + + movu [r0 + r1 * 0 + %11], xm%6 + movu [r0 + r1 * 1 + %11], xm%2 + movu [r0 + r1 * 2 + %11], xm%4 + movu [r0 + r4 * 1 + %11], xm%8 + + lea r5, [r0 + r1 * 4] + movu [r5 + r1 * 0 + %11], xm%1 + movu [r5 + r1 * 1 + %11], xm%5 + movu [r5 + r1 * 2 + %11], xm%3 + movu [r5 + r4 * 1 + %11], xm%7 + + lea r5, [r5 + r1 * 4] + vextracti128 [r5 + r1 * 0 + %11], m%6, 1 + vextracti128 [r5 + r1 * 1 + %11], m%2, 1 + vextracti128 [r5 + r1 * 2 + %11], m%4, 1 + vextracti128 [r5 + r4 * 1 + %11], m%8, 1 + + lea r5, [r5 + r1 * 4] + vextracti128 [r5 + r1 * 0 + %11], m%1, 1 + vextracti128 [r5 + r1 * 1 + %11], m%5, 1 + vextracti128 [r5 + r1 * 2 + %11], m%3, 1 + vextracti128 [r5 + r4 * 1 + %11], m%7, 1 + jmp .end%11 +.skip%11: +%if %11 == 16 + lea r7, [r0 + 8 * r1] +%else + lea r7, [r0] +%endif + movu [r7 + r1 * 0], m%1 + movu [r7 + r1 * 1], m%2 + movu [r7 + r1 * 2], m%3 + movu [r7 + r4 * 1], m%4 + +%if %11 == 16 + lea r7, [r7 + r1 * 4] +%else + lea r7, [r7 + r1 * 4] +%endif + movu [r7 + r1 * 0], m%5 + movu [r7 + r1 * 1], m%6 + movu [r7 + r1 * 2], m%7 + movu [r7 + r4 * 1], m%8 +.end%11: +%endmacro + +;; angle 32, modes 14 and 22, row 0 to 15 +cglobal ang32_mode_14_22_rows_0_15 + test r6d, r6d + + movu m0, [r2 - 12] + movu m1, [r2 - 10] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 + 4] + movu m4, [r2 + 6] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3] ; [16] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 13 * 32] ; [29] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 13 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m0, m3, 4 + pmaddwd m6, m7, [r3 - 6 * 32] ; [10] + paddd m6, [pd_16] + psrld m6, 5 + palignr m8, m2, m0, 4 + pmaddwd m9, m8, [r3 - 6 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 7 * 32] ; [23] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 7 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m10, m0, m3, 8 + pmaddwd m8, m10, [r3 - 12 * 32] ; [4] + paddd m8, [pd_16] + psrld m8, 5 + palignr m12, m2, m0, 8 + pmaddwd m9, m12, [r3 - 12 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m10, [r3 + 1 * 32] ; [17] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m11, m12, [r3 + 1 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m9, m11 + + pmaddwd m10, [r3 + 14 * 32] ; [30] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, [r3 + 14 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + palignr m11, m0, m3, 12 + pmaddwd m11, [r3 - 5 * 32] ; [11] + paddd m11, [pd_16] + psrld m11, 5 + palignr m12, m2, m0, 12 + pmaddwd m12, [r3 - 5 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + palignr m4, m0, m3, 12 + pmaddwd m4, [r3 + 8 * 32] ; [24] + paddd m4, [pd_16] + psrld m4, 5 + palignr m5, m2, m0, 12 + pmaddwd m5, [r3 + 8 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m0, [r3 - 11 * 32] ; [5] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, m2, [r3 - 11 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, m0, [r3 + 2 * 32] ; [18] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m2, [r3 + 2 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m0, [r3 + 15 * 32] ; [31] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, m2, [r3 + 15 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + palignr m9, m2, m0, 4 + palignr m10, m1, m2, 4 + pmaddwd m8, m9, [r3 - 4 * 32] ; [12] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m11, m10, [r3 - 4 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m8, m11 + + pmaddwd m9, [r3 + 9 * 32] ; [25] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, [r3 + 9 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m1, m2, 8 + palignr m2, m0, 8 + + pmaddwd m10, m2, [r3 - 10 * 32] ; [6] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m12, m1, [r3 - 10 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m10, m12 + + pmaddwd m2, [r3 + 3 * 32] ; [19] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3 + 3 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2_STACK 2, 10, 9, 8, 7, 6, 5, 4, 0, 1, 0 + ret + +;; angle 32, modes 14 and 22, rows 16 to 31 +cglobal ang32_mode_14_22_rows_16_31 + test r6d, r6d + + movu m0, [r2 - 24] + movu m1, [r2 - 22] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 - 8] + movu m4, [r2 - 6] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3 - 16 * 32] ; [0] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 16 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 - 3 * 32] ; [13] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 - 3 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, m3, [r3 + 10 * 32] ; [26] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m0, [r3 + 10 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + palignr m8, m0, m3, 4 + palignr m9, m2, m0, 4 + pmaddwd m7, m8, [r3 - 9 * 32] ; [7] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 - 9 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 4 * 32] ; [20] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 4 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m11, m0, m3, 8 + palignr m12, m2, m0, 8 + pmaddwd m9, m11, [r3 - 15 * 32] ; [1] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m12, [r3 - 15 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m11, [r3 - 2 * 32] ; [14] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m13, m12, [r3 - 2 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m10, m13 + + pmaddwd m11, [r3 + 11 * 32] ; [27] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 11 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + palignr m5, m0, m3, 12 + palignr m6, m2, m0, 12 + pmaddwd m4, m5, [r3 - 8 * 32] ; [8] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m7, m6, [r3 - 8 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m4, m7 + + pmaddwd m5, [r3 + 5 * 32] ; [21] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, [r3 + 5 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + pmaddwd m6, m0, [r3 - 14 * 32] ; [2] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, m2, [r3 - 14 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + pmaddwd m7, m0, [r3 - 1 * 32] ; [15] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, m2, [r3 - 1 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + pmaddwd m8, m0, [r3 + 12 * 32] ; [28] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m11, m2, [r3 + 12 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m8, m11 + + palignr m10, m2, m0, 4 + palignr m11, m1, m2, 4 + + pmaddwd m9, m10, [r3 - 7 * 32] ; [9] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m11, [r3 - 7 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 6 * 32] ; [22] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 6 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m1, m2, 8 + palignr m2, m0, 8 + + pmaddwd m2, [r3 - 13 * 32] ; [3] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3 - 13 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2_STACK 2, 10, 9, 8, 7, 6, 5, 4, 0, 1, 0 + ret + +cglobal intra_pred_ang32_14, 3,8,14 + mov r6, rsp + sub rsp, 4*mmsize+gprsize + and rsp, ~63 + mov [rsp+4*mmsize], r6 + + movu m0, [r2 + 128] + movu m1, [r2 + 160] + movd xm2, [r2 + 192] + + mova [rsp + 1*mmsize], m0 + mova [rsp + 2*mmsize], m1 + movd [rsp + 3*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 4] + movu xm2, [r2 + 24] + movu xm3, [r2 + 44] + pshufb xm1, [pw_ang32_14_22] + pshufb xm2, [pw_ang32_14_22] + pshufb xm3, [pw_ang32_14_22] + pinsrw xm1, [r2 + 20], 4 + pinsrw xm2, [r2 + 40], 4 + pinsrw xm3, [r2 + 60], 4 + + punpckhqdq xm2, xm1 ; [ 2 5 7 10 12 15 17 20] + punpckhqdq xm3, xm3 ; [22 25 27 30 22 25 27 30] + + movzx r6d, word [r2] + mov [rsp + 1*mmsize], r6w + movu [rsp + 16], xm2 + movq [rsp + 8], xm3 + + xor r6d, r6d + lea r2, [rsp + 1*mmsize] + lea r7, [r0 + 8 * r1] + + call ang32_mode_14_22_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_14_22_rows_16_31 + + add r2, 32 + lea r0, [r7 + 8 * r1] + + call ang32_mode_14_22_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_14_22_rows_16_31 + + mov rsp, [rsp+4*mmsize] + RET + +cglobal intra_pred_ang32_22, 3,8,14 + mov r6, rsp + sub rsp, 4*mmsize+gprsize + and rsp, ~63 + mov [rsp+4*mmsize], r6 + + movu m0, [r2] + movu m1, [r2 + 32] + movd xm2, [r2 + 64] + + mova [rsp + 1*mmsize], m0 + mova [rsp + 2*mmsize], m1 + movd [rsp + 3*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 132] + movu xm2, [r2 + 152] + movu xm3, [r2 + 172] + pshufb xm1, [pw_ang32_14_22] + pshufb xm2, [pw_ang32_14_22] + pshufb xm3, [pw_ang32_14_22] + pinsrw xm1, [r2 + 148], 4 + pinsrw xm2, [r2 + 168], 4 + pinsrw xm3, [r2 + 188], 4 + + punpckhqdq xm2, xm1 ; [ 2 5 7 10 12 15 17 20] + punpckhqdq xm3, xm3 ; [22 25 27 30 22 25 27 30] + + movu [rsp + 16], xm2 + movq [rsp + 8], xm3 + + xor r6d, r6d + inc r6d + lea r2, [rsp + 1*mmsize] + lea r5, [r0 + 32] + + call ang32_mode_14_22_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_14_22_rows_16_31 + + add r2, 32 + mov r0, r5 + + call ang32_mode_14_22_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_14_22_rows_16_31 + + mov rsp, [rsp+4*mmsize] + RET + +;; angle 32, modes 15 and 21, row 0 to 15 +cglobal ang32_mode_15_21_rows_0_15 + test r6d, r6d + + movu m0, [r2 - 16] + movu m1, [r2 - 14] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2] + movu m4, [r2 + 2] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3] ; [16] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddwd m5, m6, [r3 - 15 * 32] ; [1] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m7, [r3 - 15 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 2 * 32] ; [18] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 2 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m0, m3, 8 + palignr m9, m2, m0, 8 + pmaddwd m7, m8, [r3 - 13 * 32] ; [3] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 - 13 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 4 * 32] ; [20] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 4 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m0, m3, 12 + palignr m11, m2, m0, 12 + pmaddwd m9, m10, [r3 - 11 * 32] ; [5] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, m11, [r3 - 11 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, [r3 + 6 * 32] ; [22] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 6 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m11, m0, [r3 - 9 * 32] ; [7] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, m2, [r3 - 9 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + pmaddwd m4, m0, [r3 + 8 * 32] ; [24] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m2, [r3 + 8 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m2, m0, 4 + palignr m7, m1, m2, 4 + pmaddwd m5, m6, [r3 - 7 * 32] ; [9] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, m7, [r3 - 7 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, [r3 + 10 * 32] ; [26] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 10 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m2, m0, 8 + palignr m9, m1, m2, 8 + pmaddwd m7, m8, [r3 - 5 * 32] ; [11] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, m9, [r3 - 5 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + pmaddwd m8, [r3 + 12 * 32] ; [28] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 12 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m2, m0, 12 + palignr m11, m1, m2, 12 + pmaddwd m9, m10, [r3 - 3 * 32] ; [13] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m11, [r3 - 3 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + pmaddwd m10, [r3 + 14 * 32] ; [30] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 14 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m2, [r3 - 1 * 32] ; [15] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3 - 1 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2_STACK 2, 10, 9, 8, 7, 6, 5, 4, 0, 1, 0 + ret + +;; angle 32, modes 15 and 21, rows 16 to 31 +cglobal ang32_mode_15_21_rows_16_31 + test r6d, r6d + + movu m0, [r2 - 32] + movu m1, [r2 - 30] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 - 16] + movu m4, [r2 - 14] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3 - 16 * 32] ; [0] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 16 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 1 * 32] ; [17] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 1 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m0, m3, 4 + palignr m8, m2, m0, 4 + pmaddwd m6, m7, [r3 - 14 * 32] ; [2] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m8, [r3 - 14 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 3 * 32] ; [19] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 3 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m0, m3, 8 + palignr m10, m2, m0, 8 + pmaddwd m8, m9, [r3 - 12 * 32] ; [4] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m11, m10, [r3 - 12 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m8, m11 + + pmaddwd m9, [r3 + 5 * 32] ; [21] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, [r3 + 5 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m11, m0, m3, 12 + palignr m12, m2, m0, 12 + pmaddwd m10, m11, [r3 - 10 * 32] ; [6] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m13, m12, [r3 - 10 * 32] + paddd m13, [pd_16] + psrld m13, 5 + packusdw m10, m13 + + pmaddwd m11, [r3 + 7 * 32] ; [23] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 7 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + pmaddwd m4, m0, [r3 - 8 * 32] ; [8] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m7, m2, [r3 - 8 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m4, m7 + + pmaddwd m5, m0, [r3 + 9 * 32] ; [25] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, m2, [r3 + 9 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m7, m2, m0, 4 + palignr m8, m1, m2, 4 + pmaddwd m6, m7, [r3 - 6 * 32] ; [10] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m3, m8, [r3 - 6 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m6, m3 + + pmaddwd m7, [r3 + 11 * 32] ; [27] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m9, m2, m0, 8 + palignr m3, m1, m2, 8 + pmaddwd m8, m9, [r3 - 4 * 32] ; [12] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m11, m3, [r3 - 4 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m8, m11 + + pmaddwd m9, [r3 + 13 * 32] ; [29] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, [r3 + 13 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + palignr m1, m2, 12 + palignr m2, m0, 12 + pmaddwd m10, m2, [r3 - 2 * 32] ; [14] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m1, [r3 - 2 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m2, [r3 + 15 * 32] ; [31] + paddd m2, [pd_16] + psrld m2, 5 + pmaddwd m1, [r3 + 15 * 32] + paddd m1, [pd_16] + psrld m1, 5 + packusdw m2, m1 + TRANSPOSE_STORE_AVX2_STACK 2, 10, 9, 8, 7, 6, 5, 4, 0, 1, 0 + ret + +cglobal intra_pred_ang32_15, 3,8,14 + mov r6, rsp + sub rsp, 4*mmsize+gprsize + and rsp, ~63 + mov [rsp+4*mmsize], r6 + + movu m0, [r2 + 128] + movu m1, [r2 + 160] + movd xm2, [r2 + 192] + + mova [rsp + 1*mmsize], m0 + mova [rsp + 2*mmsize], m1 + movd [rsp + 3*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 4] + movu xm2, [r2 + 18] + movu xm3, [r2 + 34] + movu xm4, [r2 + 48] + pshufb xm1, [pw_ang32_15_21] + pshufb xm2, [pw_ang32_15_21] + pshufb xm3, [pw_ang32_15_21] + pshufb xm4, [pw_ang32_15_21] + + punpckhqdq xm2, xm1 + punpckhqdq xm4, xm3 + + movzx r6d, word [r2] + mov [rsp + 1*mmsize], r6w + movu [rsp + 16], xm2 + movu [rsp], xm4 + + xor r6d, r6d + lea r2, [rsp + 1*mmsize] + lea r7, [r0 + 8 * r1] + + call ang32_mode_15_21_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_15_21_rows_16_31 + + add r2, 32 + lea r0, [r7 + 8 * r1] + + call ang32_mode_15_21_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_15_21_rows_16_31 + + mov rsp, [rsp+4*mmsize] + RET + +cglobal intra_pred_ang32_21, 3,8,14 + mov r6, rsp + sub rsp, 4*mmsize+gprsize + and rsp, ~63 + mov [rsp+4*mmsize], r6 + + movu m0, [r2] + movu m1, [r2 + 32] + movd xm2, [r2 + 64] + + mova [rsp + 1*mmsize], m0 + mova [rsp + 2*mmsize], m1 + movd [rsp + 3*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 132] + movu xm2, [r2 + 146] + movu xm3, [r2 + 162] + movu xm4, [r2 + 176] + pshufb xm1, [pw_ang32_15_21] + pshufb xm2, [pw_ang32_15_21] + pshufb xm3, [pw_ang32_15_21] + pshufb xm4, [pw_ang32_15_21] + + punpckhqdq xm2, xm1 + punpckhqdq xm4, xm3 + + movu [rsp + 16], xm2 + movu [rsp], xm4 + + xor r6d, r6d + inc r6d + lea r2, [rsp + 1*mmsize] + lea r5, [r0 + 32] + + call ang32_mode_15_21_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_15_21_rows_16_31 + + add r2, 32 + mov r0, r5 + + call ang32_mode_15_21_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_15_21_rows_16_31 + + mov rsp, [rsp+4*mmsize] + RET + +;; angle 32, modes 16 and 20, row 0 to 15 +cglobal ang32_mode_16_20_rows_0_15 + test r6d, r6d + + movu m0, [r2 - 20] + movu m1, [r2 - 18] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 - 4] ; [ 3 2 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13] + movu m4, [r2 - 2] ; [ 2 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14] + punpcklwd m2, m1, m4 ; [-3 -2 -4 -3 -5 -4 -6 -5 -11 -10 -12 -11 -13 -12 -14 -13] + punpckhwd m1, m4 ; [ 2 3 2 0 -1 0 -2 -1 -7 -6 -8 -7 -9 -8 -10 -9] + + pmaddwd m4, m3, [r3] ; [16] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m6, m0, m3, 4 + palignr m7, m2, m0, 4 + pmaddwd m5, m6, [r3 - 11 * 32] ; [5] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m7, [r3 - 11 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + pmaddwd m6, [r3 + 10 * 32] ; [26] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m7, [r3 + 10 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m6, m7 + + palignr m8, m0, m3, 8 + palignr m9, m2, m0, 8 + pmaddwd m7, m8, [r3 - 1 * 32] ; [15] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 - 1 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + palignr m9, m0, m3, 12 + palignr m12, m2, m0, 12 + pmaddwd m8, m9, [r3 - 12 * 32] ; [4] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m10, m12, [r3 - 12 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, [r3 + 9 * 32] ; [25] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, [r3 + 9 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, m0, [r3 - 2 * 32] ; [14] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m2, [r3 - 2 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m2, m0, 4 + palignr m12, m1, m2, 4 + pmaddwd m11, [r3 - 13 * 32] ; [3] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 - 13 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + palignr m4, m2, m0, 4 + palignr m5, m1, m2, 4 + pmaddwd m4, [r3 + 8 * 32] ; [24] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, [r3 + 8 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m5, m2, m0, 8 + palignr m3, m1, m2, 8 + pmaddwd m5, [r3 - 3 * 32] ; [13] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, [r3 - 3 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + palignr m7, m2, m0, 12 + palignr m3, m1, m2, 12 + pmaddwd m6, m7, [r3 - 14 * 32] ; [2] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m3, [r3 - 14 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, [r3 + 7 * 32] ; [23] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, [r3 + 7 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + pmaddwd m8, m2, [r3 - 4 * 32] ; [12] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m1, [r3 - 4 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + movu m0, [r2 - 2] + movu m1, [r2] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m2, [r2 + 14] + movu m1, [r2 + 16] + punpcklwd m2, m1 + + pmaddwd m9, m3, [r3 - 15 * 32] ; [1] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, m0, [r3 - 15 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + pmaddwd m10, m3, [r3 + 6 * 32] ; [22] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m0, [r3 + 6 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m2, m0, 4 + palignr m0, m3, 4 + pmaddwd m0, [r3 - 5 * 32] ; [11] + paddd m0, [pd_16] + psrld m0, 5 + pmaddwd m2, [r3 - 5 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m0, m2 + TRANSPOSE_STORE_AVX2_STACK 0, 10, 9, 8, 7, 6, 5, 4, 2, 1, 0 + ret + +;; angle 32, modes 16 and 20, rows 16 to 31 +cglobal ang32_mode_16_20_rows_16_31 + test r6d, r6d + + movu m0, [r2 - 40] + movu m1, [r2 - 38] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 - 24] + movu m4, [r2 - 22] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3 - 16 * 32] ; [0] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 16 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 5 * 32] ; [21] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 5 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m7, m0, m3, 4 + palignr m8, m2, m0, 4 + pmaddwd m6, m7, [r3 - 6 * 32] ; [10] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m9, m8, [r3 - 6 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m6, m9 + + pmaddwd m7, [r3 + 15 * 32] ; [31] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m8, [r3 + 15 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m7, m8 + + palignr m8, m0, m3, 8 + palignr m9, m2, m0, 8 + pmaddwd m8, [r3 + 4 * 32] ; [20] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 4 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m10, m0, m3, 12 + palignr m11, m2, m0, 12 + pmaddwd m9, m10, [r3 - 7 * 32] ; [9] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, m11, [r3 - 7 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, [r3 + 14 * 32] ; [30] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 + 14 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m11, m0, [r3 + 3 * 32] ; [19] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, m2, [r3 + 3 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + palignr m5, m2, m0, 4 + palignr m6, m1, m2, 4 + pmaddwd m4, m5, [r3 - 8 * 32] ; [8] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m7, m6, [r3 - 8 * 32] + paddd m7, [pd_16] + psrld m7, 5 + packusdw m4, m7 + + pmaddwd m5, [r3 + 13 * 32] ; [29] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m6, [r3 + 13 * 32] + paddd m6, [pd_16] + psrld m6, 5 + packusdw m5, m6 + + palignr m6, m2, m0, 8 + palignr m3, m1, m2, 8 + pmaddwd m6, [r3 + 2 * 32] ; [18] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m3, [r3 + 2 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m6, m3 + + palignr m8, m2, m0, 12 + palignr m9, m1, m2, 12 + pmaddwd m7, m8, [r3 - 9 * 32] ; [7] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m10, m9, [r3 - 9 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m7, m10 + + pmaddwd m8, [r3 + 12 * 32] ; [28] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, [r3 + 12 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + pmaddwd m9, m2, [r3 + 1 * 32] ; [17] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m3, m1, [r3 + 1 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m9, m3 + + movu m0, [r2 - 22] + movu m1, [r2 - 20] + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + pmaddwd m10, m3, [r3 - 10 * 32] ; [6] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m0, [r3 - 10 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + pmaddwd m3, [r3 + 11 * 32] ; [27] + paddd m3, [pd_16] + psrld m3, 5 + pmaddwd m0, [r3 + 11 * 32] + paddd m0, [pd_16] + psrld m0, 5 + packusdw m3, m0 + TRANSPOSE_STORE_AVX2_STACK 3, 10, 9, 8, 7, 6, 5, 4, 0, 1, 0 + ret + +cglobal intra_pred_ang32_16, 3,8,14 + mov r6, rsp + sub rsp, 5*mmsize+gprsize + and rsp, ~63 + mov [rsp+5*mmsize], r6 + + movu m0, [r2 + 128] + movu m1, [r2 + 160] + movd xm2, [r2 + 192] + + mova [rsp + 2*mmsize], m0 + mova [rsp + 3*mmsize], m1 + movd [rsp + 4*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 4] + movu xm2, [r2 + 16] + movu xm3, [r2 + 28] + movu xm4, [r2 + 40] + movu xm5, [r2 + 52] + pshufb xm1, [pw_ang32_16_20] + pshufb xm2, [pw_ang32_16_20] + pshufb xm3, [pw_ang32_16_20] + pshufb xm4, [pw_ang32_16_20] + pshufb xm5, [pw_ang32_16_20] + + punpckhqdq xm2, xm1 + punpckhqdq xm4, xm3 + punpckhqdq xm5, xm5 + + movzx r6d, word [r2] + mov [rsp + 2*mmsize], r6w + movu [rsp + 48], xm2 + movu [rsp + 32], xm4 + movq [rsp + 24], xm5 + + xor r6d, r6d + lea r2, [rsp + 2*mmsize] + lea r7, [r0 + 8 * r1] + + call ang32_mode_16_20_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_16_20_rows_16_31 + + add r2, 32 + lea r0, [r7 + 8 * r1] + + call ang32_mode_16_20_rows_0_15 + + lea r0, [r0 + 32] + + call ang32_mode_16_20_rows_16_31 + + mov rsp, [rsp+5*mmsize] + RET + +cglobal intra_pred_ang32_20, 3,8,14 + mov r6, rsp + sub rsp, 5*mmsize+gprsize + and rsp, ~63 + mov [rsp+5*mmsize], r6 + + movu m0, [r2] + movu m1, [r2 + 32] + movd xm2, [r2 + 64] + + mova [rsp + 2*mmsize], m0 + mova [rsp + 3*mmsize], m1 + movd [rsp + 4*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 132] + movu xm2, [r2 + 144] + movu xm3, [r2 + 156] + movu xm4, [r2 + 168] + movu xm5, [r2 + 180] + pshufb xm1, [pw_ang32_16_20] + pshufb xm2, [pw_ang32_16_20] + pshufb xm3, [pw_ang32_16_20] + pshufb xm4, [pw_ang32_16_20] + pshufb xm5, [pw_ang32_16_20] + + punpckhqdq xm2, xm1 + punpckhqdq xm4, xm3 + punpckhqdq xm5, xm5 + + movu [rsp + 48], xm2 + movu [rsp + 32], xm4 + movq [rsp + 24], xm5 + + xor r6d, r6d + inc r6d + lea r2, [rsp + 2*mmsize] + lea r5, [r0 + 32] + + call ang32_mode_16_20_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_16_20_rows_16_31 + + add r2, 32 + mov r0, r5 + + call ang32_mode_16_20_rows_0_15 + + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_16_20_rows_16_31 + + mov rsp, [rsp+5*mmsize] + RET + +;; angle 32, modes 17 and 19, row 0 to 15 +cglobal ang32_mode_17_19_rows_0_15 + test r6d, r6d + + movu m0, [r2 - 24] + movu m1, [r2 - 22] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m1, [r2 - 8] + movu m4, [r2 - 6] + punpcklwd m2, m1, m4 + punpckhwd m1, m4 + + pmaddwd m4, m3, [r3 - 16 * 32] ; [0] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, m0, [r3 - 16 * 32] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + pmaddwd m5, m3, [r3 + 10 * 32] ; [26] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m8, m0, [r3 + 10 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m5, m8 + + palignr m6, m0, m3, 4 + palignr m8, m2, m0, 4 + pmaddwd m6, [r3 + 4 * 32] ; [20] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, [r3 + 4 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + palignr m7, m0, m3, 8 + palignr m9, m2, m0, 8 + pmaddwd m7, [r3 - 2 * 32] ; [14] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m9, [r3 - 2 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m7, m9 + + palignr m8, m0, m3, 12 + palignr m10, m2, m0, 12 + pmaddwd m8, [r3 - 8 * 32] ; [8] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m10, [r3 - 8 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m8, m10 + + pmaddwd m9, m0, [r3 - 14 * 32] ; [2] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m12, m2, [r3 - 14 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m9, m12 + + pmaddwd m10, m0, [r3 + 12 * 32] ; [28] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, m2, [r3 + 12 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m11, m2, m0, 4 + palignr m12, m1, m2, 4 + pmaddwd m11, [r3 + 6 * 32] ; [22] + paddd m11, [pd_16] + psrld m11, 5 + pmaddwd m12, [r3 + 6 * 32] + paddd m12, [pd_16] + psrld m12, 5 + packusdw m11, m12 + + TRANSPOSE_STORE_AVX2_STACK 11, 10, 9, 8, 7, 6, 5, 4, 12, 13, 16 + + palignr m4, m2, m0, 8 + palignr m5, m1, m2, 8 + pmaddwd m4, [r3] ; [16] + paddd m4, [pd_16] + psrld m4, 5 + pmaddwd m5, [r3] + paddd m5, [pd_16] + psrld m5, 5 + packusdw m4, m5 + + palignr m5, m2, m0, 12 + palignr m3, m1, m2, 12 + pmaddwd m5, [r3 - 6 * 32] ; [10] + paddd m5, [pd_16] + psrld m5, 5 + pmaddwd m3, [r3 - 6 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m5, m3 + + pmaddwd m6, m2, [r3 - 12 * 32] ; [4] + paddd m6, [pd_16] + psrld m6, 5 + pmaddwd m8, m1, [r3 - 12 * 32] + paddd m8, [pd_16] + psrld m8, 5 + packusdw m6, m8 + + pmaddwd m7, m2, [r3 + 14 * 32] ; [30] + paddd m7, [pd_16] + psrld m7, 5 + pmaddwd m3, m1, [r3 + 14 * 32] + paddd m3, [pd_16] + psrld m3, 5 + packusdw m7, m3 + + movu m0, [r2 - 6] + movu m1, [r2 - 4] + + punpcklwd m3, m0, m1 + punpckhwd m0, m1 + + movu m2, [r2 + 10] + movu m1, [r2 + 12] + punpcklwd m2, m1 + + pmaddwd m8, m3, [r3 + 8 * 32] ; [24] + paddd m8, [pd_16] + psrld m8, 5 + pmaddwd m9, m0, [r3 + 8 * 32] + paddd m9, [pd_16] + psrld m9, 5 + packusdw m8, m9 + + palignr m9, m0, m3, 4 + palignr m10, m2, m0, 4 + pmaddwd m9, [r3 + 2 * 32] ; [18] + paddd m9, [pd_16] + psrld m9, 5 + pmaddwd m10, [r3 + 2 * 32] + paddd m10, [pd_16] + psrld m10, 5 + packusdw m9, m10 + + palignr m10, m0, m3, 8 + palignr m11, m2, m0, 8 + pmaddwd m10, [r3 - 4 * 32] ; [12] + paddd m10, [pd_16] + psrld m10, 5 + pmaddwd m11, [r3 - 4 * 32] + paddd m11, [pd_16] + psrld m11, 5 + packusdw m10, m11 + + palignr m2, m0, 12 + palignr m0, m3, 12 + pmaddwd m0, [r3 - 10 * 32] ; [6] + paddd m0, [pd_16] + psrld m0, 5 + pmaddwd m2, [r3 - 10 * 32] + paddd m2, [pd_16] + psrld m2, 5 + packusdw m0, m2 + TRANSPOSE_STORE_AVX2_STACK 0, 10, 9, 8, 7, 6, 5, 4, 2, 1, 0 + ret + +cglobal intra_pred_ang32_17, 3,8,14 + mov r6, rsp + sub rsp, 5*mmsize+gprsize + and rsp, ~63 + mov [rsp+5*mmsize], r6 + + movu m0, [r2 + 128] + movu m1, [r2 + 160] + movd xm2, [r2 + 192] + + mova [rsp + 2*mmsize], m0 + mova [rsp + 3*mmsize], m1 + movd [rsp + 4*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 2] + movu xm2, [r2 + 18] + movu xm3, [r2 + 34] + movu xm4, [r2 + 50] + pshufb xm1, [pw_ang32_17_19_0] + pshufb xm2, [shuf_mode_17_19] + pshufb xm3, [pw_ang32_17_19_0] + pshufb xm4, [shuf_mode_17_19] + + movzx r6d, word [r2] + mov [rsp + 2*mmsize], r6w + movu [rsp + 48], xm1 + movu [rsp + 36], xm2 + movu [rsp + 22], xm3 + movu [rsp + 10], xm4 + + xor r6d, r6d + lea r2, [rsp + 2*mmsize] + lea r7, [r0 + 8 * r1] + + call ang32_mode_17_19_rows_0_15 + + sub r2, 26 + lea r0, [r0 + 32] + + call ang32_mode_17_19_rows_0_15 + + add r2, 58 + lea r0, [r7 + 8 * r1] + + call ang32_mode_17_19_rows_0_15 + + sub r2, 26 + lea r0, [r0 + 32] + + call ang32_mode_17_19_rows_0_15 + + mov rsp, [rsp+5*mmsize] + RET + +cglobal intra_pred_ang32_19, 3,8,14 + mov r6, rsp + sub rsp, 5*mmsize+gprsize + and rsp, ~63 + mov [rsp+5*mmsize], r6 + + movu m0, [r2] + movu m1, [r2 + 32] + movd xm2, [r2 + 64] + + mova [rsp + 2*mmsize], m0 + mova [rsp + 3*mmsize], m1 + movd [rsp + 4*mmsize], xm2 + + add r1d, r1d + lea r4, [r1 * 3] + lea r3, [ang_table_avx2 + 16 * 32] + + movu xm1, [r2 + 130] + movu xm2, [r2 + 146] + movu xm3, [r2 + 162] + movu xm4, [r2 + 178] + pshufb xm1, [pw_ang32_17_19_0] + pshufb xm2, [shuf_mode_17_19] + pshufb xm3, [pw_ang32_17_19_0] + pshufb xm4, [shuf_mode_17_19] + + movu [rsp + 48], xm1 + movu [rsp + 36], xm2 + movu [rsp + 22], xm3 + movu [rsp + 10], xm4 + + xor r6d, r6d + inc r6d + lea r2, [rsp + 2*mmsize] + lea r5, [r0 + 32] + + call ang32_mode_17_19_rows_0_15 + + sub r2, 26 + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_17_19_rows_0_15 + + add r2, 58 + mov r0, r5 + + call ang32_mode_17_19_rows_0_15 + + sub r2, 26 + lea r0, [r0 + 8 * r1] + lea r0, [r0 + 8 * r1] + + call ang32_mode_17_19_rows_0_15 + + mov rsp, [rsp+5*mmsize] + RET + +cglobal intra_pred_ang32_18, 3,6,6 + mov r4, rsp + sub rsp, 4*mmsize+gprsize + and rsp, ~63 + mov [rsp+4*mmsize], r4 + + movu m0, [r2] + movu m1, [r2 + 32] + mova [rsp + 2*mmsize], m0 + mova [rsp + 3*mmsize], m1 + + movu m2, [r2 + 130] + movu m3, [r2 + 162] + pshufb m2, [pw_swap16] + pshufb m3, [pw_swap16] + vpermq m2, m2, 01001110b + vpermq m3, m3, 01001110b + mova [rsp + 1*mmsize], m2 + mova [rsp + 0*mmsize], m3 + + add r1d, r1d + lea r2, [rsp+2*mmsize] + lea r4, [r1 * 2] + lea r3, [r1 * 3] + lea r5, [r1 * 4] + + movu m0, [r2] + movu m1, [r2 + 32] + movu m2, [r2 - 16] + movu m3, [r2 + 16] + + movu [r0], m0 + movu [r0 + 32], m1 + + palignr m4, m0, m2, 14 + palignr m5, m1, m3, 14 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m0, m2, 12 + palignr m5, m1, m3, 12 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m0, m2, 10 + palignr m5, m1, m3, 10 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + palignr m4, m0, m2, 8 + palignr m5, m1, m3, 8 + movu [r0], m4 + movu [r0 + 32], m5 + + palignr m4, m0, m2, 6 + palignr m5, m1, m3, 6 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m0, m2, 4 + palignr m5, m1, m3, 4 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m0, m2, 2 + palignr m5, m1, m3, 2 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + movu [r0], m2 + movu [r0 + 32], m3 + + movu m0, [r2 - 32] + movu m1, [r2] + + palignr m4, m2, m0, 14 + palignr m5, m3, m1, 14 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m2, m0, 12 + palignr m5, m3, m1, 12 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m2, m0, 10 + palignr m5, m3, m1, 10 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + palignr m4, m2, m0, 8 + palignr m5, m3, m1, 8 + movu [r0], m4 + movu [r0 + 32], m5 + + palignr m4, m2, m0, 6 + palignr m5, m3, m1, 6 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m2, m0, 4 + palignr m5, m3, m1, 4 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m2, m0, 2 + palignr m5, m3, m1, 2 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + movu [r0], m0 + movu [r0 + 32], m1 + + movu m2, [r2 - 48] + movu m3, [r2 - 16] + + palignr m4, m0, m2, 14 + palignr m5, m1, m3, 14 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m0, m2, 12 + palignr m5, m1, m3, 12 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m0, m2, 10 + palignr m5, m1, m3, 10 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + palignr m4, m0, m2, 8 + palignr m5, m1, m3, 8 + movu [r0], m4 + movu [r0 + 32], m5 + + palignr m4, m0, m2, 6 + palignr m5, m1, m3, 6 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m0, m2, 4 + palignr m5, m1, m3, 4 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m0, m2, 2 + palignr m5, m1, m3, 2 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + movu [r0], m2 + movu [r0 + 32], m3 + + movu m0, [r2 - 64] + movu m1, [r2 - 32] + + palignr m4, m2, m0, 14 + palignr m5, m3, m1, 14 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m2, m0, 12 + palignr m5, m3, m1, 12 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m2, m0, 10 + palignr m5, m3, m1, 10 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + add r0, r5 + + palignr m4, m2, m0, 8 + palignr m5, m3, m1, 8 + movu [r0], m4 + movu [r0 + 32], m5 + + palignr m4, m2, m0, 6 + palignr m5, m3, m1, 6 + movu [r0 + r1], m4 + movu [r0 + r1 + 32], m5 + + palignr m4, m2, m0, 4 + palignr m5, m3, m1, 4 + movu [r0 + r4], m4 + movu [r0 + r4 + 32], m5 + + palignr m4, m2, m0, 2 + palignr m5, m3, m1, 2 + movu [r0 + r3], m4 + movu [r0 + r3 + 32], m5 + + mov rsp, [rsp+4*mmsize] + RET +;------------------------------------------------------------------------------------------------------- +; end of avx2 code for intra_pred_ang32 mode 2 to 34 +;------------------------------------------------------------------------------------------------------- + %macro MODE_2_34 0 movu m0, [r2 + 4] movu m1, [r2 + 20] @@ -13892,3 +21633,439 @@ dec r4 jnz .loop RET + +;----------------------------------------------------------------------------------- +; void intra_filter_NxN(const pixel* references, pixel* filtered) +;----------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal intra_filter_4x4, 2,4,5 + mov r2w, word [r0 + 16] ; topLast + mov r3w, word [r0 + 32] ; LeftLast + + ; filtering top + movu m0, [r0 + 0] + movu m1, [r0 + 16] + movu m2, [r0 + 32] + + pshufb m4, m0, [intra_filter4_shuf0] ; [6 5 4 3 2 1 0 1] samples[i - 1] + palignr m3, m1, m0, 4 + pshufb m3, [intra_filter4_shuf1] ; [8 7 6 5 4 3 2 9] samples[i + 1] + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + ; filtering left + palignr m4, m1, m1, 14 + pinsrw m4, [r0], 1 + palignr m3, m2, m1, 4 + pshufb m3, [intra_filter4_shuf1] + + psllw m1, 1 + paddw m4, m3 + paddw m1, m4 + paddw m1, [pw_2] + psrlw m1, 2 + + movu [r1], m0 + movu [r1 + 16], m1 + mov [r1 + 16], r2w ; topLast + mov [r1 + 32], r3w ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_8x8, 2,4,6 + mov r2w, word [r0 + 32] ; topLast + mov r3w, word [r0 + 64] ; LeftLast + + ; filtering top + movu m0, [r0] + movu m1, [r0 + 16] + movu m2, [r0 + 32] + + pshufb m4, m0, [intra_filter4_shuf0] + palignr m5, m1, m0, 2 + pinsrw m5, [r0 + 34], 0 + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m1, 1 + paddw m4, m3 + paddw m1, m4 + paddw m1, [pw_2] + psrlw m1, 2 + movu [r1], m0 + movu [r1 + 16], m1 + + ; filtering left + movu m1, [r0 + 48] + movu m0, [r0 + 64] + + palignr m4, m2, m2, 14 + pinsrw m4, [r0], 1 + palignr m5, m1, m2, 2 + + palignr m3, m1, m2, 14 + palignr m0, m1, 2 + + psllw m2, 1 + paddw m4, m5 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + + psllw m1, 1 + paddw m0, m3 + paddw m1, m0 + paddw m1, [pw_2] + psrlw m1, 2 + + movu [r1 + 32], m2 + movu [r1 + 48], m1 + mov [r1 + 32], r2w ; topLast + mov [r1 + 64], r3w ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_16x16, 2,4,6 + mov r2w, word [r0 + 64] ; topLast + mov r3w, word [r0 + 128] ; LeftLast + + ; filtering top + movu m0, [r0] + movu m1, [r0 + 16] + movu m2, [r0 + 32] + + pshufb m4, m0, [intra_filter4_shuf0] + palignr m5, m1, m0, 2 + pinsrw m5, [r0 + 66], 0 + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1], m0 + movu [r1 + 16], m5 + + movu m0, [r0 + 48] + movu m5, [r0 + 64] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + movu [r1 + 32], m1 + movu [r1 + 48], m0 + + ; filtering left + movu m1, [r0 + 80] + movu m2, [r0 + 96] + + palignr m4, m5, m5, 14 + pinsrw m4, [r0], 1 + palignr m0, m1, m5, 2 + + psllw m3, m5, 1 + paddw m4, m0 + paddw m3, m4 + paddw m3, [pw_2] + psrlw m3, 2 + + palignr m0, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m0 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1 + 64], m3 + movu [r1 + 80], m5 + + movu m5, [r0 + 112] + movu m0, [r0 + 128] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m5, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1 + 96], m1 + movu [r1 + 112], m5 + + mov [r1 + 64], r2w ; topLast + mov [r1 + 128], r3w ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_32x32, 2,4,6 + mov r2w, word [r0 + 128] ; topLast + mov r3w, word [r0 + 256] ; LeftLast + + ; filtering top + ; 0 to 15 + movu m0, [r0 + 0] + movu m1, [r0 + 16] + movu m2, [r0 + 32] + + pshufb m4, m0, [intra_filter4_shuf0] + palignr m5, m1, m0, 2 + pinsrw m5, [r0 + 130], 0 + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1], m0 + movu [r1 + 16], m5 + + ; 16 to 31 + movu m0, [r0 + 48] + movu m5, [r0 + 64] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m2, m0, 1 + paddw m4, m3 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + movu [r1 + 32], m1 + movu [r1 + 48], m2 + + ; 32 to 47 + movu m1, [r0 + 80] + movu m2, [r0 + 96] + + palignr m3, m5, m0, 14 + palignr m4, m1, m5, 2 + + psllw m0, m5, 1 + paddw m3, m4 + paddw m0, m3 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m3, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1 + 64], m0 + movu [r1 + 80], m5 + + ; 48 to 63 + movu m0, [r0 + 112] + movu m5, [r0 + 128] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + movu [r1 + 96], m1 + movu [r1 + 112], m0 + + ; filtering left + ; 64 to 79 + movu m1, [r0 + 144] + movu m2, [r0 + 160] + + palignr m4, m5, m5, 14 + pinsrw m4, [r0], 1 + palignr m0, m1, m5, 2 + + psllw m3, m5, 1 + paddw m4, m0 + paddw m3, m4 + paddw m3, [pw_2] + psrlw m3, 2 + + palignr m0, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m0 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1 + 128], m3 + movu [r1 + 144], m5 + + ; 80 to 95 + movu m5, [r0 + 176] + movu m0, [r0 + 192] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m2, m5, 1 + paddw m4, m3 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + movu [r1 + 160], m1 + movu [r1 + 176], m2 + + ; 96 to 111 + movu m1, [r0 + 208] + movu m2, [r0 + 224] + + palignr m3, m0, m5, 14 + palignr m4, m1, m0, 2 + + psllw m5, m0, 1 + paddw m3, m4 + paddw m5, m3 + paddw m5, [pw_2] + psrlw m5, 2 + + palignr m3, m1, m0, 14 + palignr m4, m2, m1, 2 + + psllw m0, m1, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + movu [r1 + 192], m5 + movu [r1 + 208], m0 + + ; 112 to 127 + movu m5, [r0 + 240] + movu m0, [r0 + 256] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m5, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + movu [r1 + 224], m1 + movu [r1 + 240], m5 + + mov [r1 + 128], r2w ; topLast + mov [r1 + 256], r3w ; LeftLast + RET + +INIT_YMM avx2 +cglobal intra_filter_4x4, 2,4,4 + mov r2w, word [r0 + 16] ; topLast + mov r3w, word [r0 + 32] ; LeftLast + + ; filtering top + movu m0, [r0] + vpbroadcastw m2, xm0 + movu m1, [r0 + 16] + + palignr m3, m0, m2, 14 ; [6 5 4 3 2 1 0 0] [14 13 12 11 10 9 8 0] + pshufb m3, [intra_filter4_shuf2] ; [6 5 4 3 2 1 0 1] [14 13 12 11 10 9 0 9] samples[i - 1] + palignr m1, m0, 4 ; [9 8 7 6 5 4 3 2] + palignr m1, m1, 14 ; [9 8 7 6 5 4 3 2] + + psllw m0, 1 + paddw m3, m1 + paddw m0, m3 + paddw m0, [pw_2] + psrlw m0, 2 + + movu [r1], m0 + mov [r1 + 16], r2w ; topLast + mov [r1 + 32], r3w ; LeftLast + RET
View file
x265_1.7.tar.gz/source/common/x86/intrapred8.asm -> x265_1.8.tar.gz/source/common/x86/intrapred8.asm
Changed
@@ -30,6 +30,10 @@ intra_pred_shuff_0_8: times 2 db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8 intra_pred_shuff_15_0: times 2 db 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0 +intra_filter4_shuf0: times 2 db 2, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 +intra_filter4_shuf1: times 2 db 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13 +intra_filter4_shuf2: times 2 db 4, 5, 0, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 + pb_0_8 times 8 db 0, 8 pb_unpackbw1 times 2 db 1, 8, 2, 8, 3, 8, 4, 8 pb_swap8: times 2 db 7, 6, 5, 4, 3, 2, 1, 0 @@ -191,16 +195,6 @@ intra_pred_shuff_0_15: db 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 15 ALIGN 32 -c_ang16_mode_8: db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 c_ang16_mode_29: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13 @@ -212,16 +206,6 @@ db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 ALIGN 32 -c_ang16_mode_7: db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 c_ang16_mode_30: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 @@ -232,18 +216,6 @@ db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 db 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - - -ALIGN 32 -c_ang16_mode_6: db 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - ALIGN 32 c_ang16_mode_31: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 @@ -255,66 +227,6 @@ db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 -c_ang16_mode_5: db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 -c_ang16_mode_32: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 -c_ang16_mode_4: db 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29, 3, 29 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 1, 31, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7, 25, 7 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17 - db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 13, 19, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - -ALIGN 32 -c_ang16_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang16_mode_3: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - ALIGN 32 c_ang16_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 @@ -476,38 +388,6 @@ db 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11, 21, 11 db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - -ALIGN 32 -c_ang32_mode_33: db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26 - db 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20 - db 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 - db 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8 - db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 - db 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 - db 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 - db 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10 - db 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30 - db 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 - db 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18 - db 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 - db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6 - db 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - - ALIGN 32 c_ang32_mode_25: db 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 db 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24, 8, 24 @@ -526,8 +406,6 @@ db 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 26, 6, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4 db 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 30, 2, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0, 32, 0 - - ALIGN 32 c_ang32_mode_24: db 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22, 10, 22 db 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 15, 17, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12, 20, 12 @@ -664,15 +542,6 @@ ALIGN 32 ;; (blkSize - 1 - x) pw_planar4_0: dw 3, 2, 1, 0, 3, 2, 1, 0 -pw_planar4_1: dw 3, 3, 3, 3, 3, 3, 3, 3 -pw_planar8_0: dw 7, 6, 5, 4, 3, 2, 1, 0 -pw_planar8_1: dw 7, 7, 7, 7, 7, 7, 7, 7 -pw_planar16_0: dw 15, 14, 13, 12, 11, 10, 9, 8 -pw_planar16_1: dw 15, 15, 15, 15, 15, 15, 15, 15 -pw_planar32_1: dw 31, 31, 31, 31, 31, 31, 31, 31 -pw_planar32_L: dw 31, 30, 29, 28, 27, 26, 25, 24 -pw_planar32_H: dw 23, 22, 21, 20, 19, 18, 17, 16 - ALIGN 32 c_ang8_mode_13: db 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 9, 23, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14, 18, 14 db 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 27, 5, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28, 4, 28 @@ -704,6 +573,13 @@ %assign x x+1 %endrep +const ang_table_avx2 +%assign x 0 +%rep 32 + times 16 db (32-x), x +%assign x x+1 +%endrep + const pw_ang_table %assign x 0 %rep 32 @@ -712,9 +588,10 @@ %endrep SECTION .text - cextern pw_2 +cextern pw_3 cextern pw_4 +cextern pw_7 cextern pw_8 cextern pw_16 cextern pw_15 @@ -1149,9 +1026,8 @@ pshufd m3, m3, 0xAA pshufhw m4, m2, 0 ; bottomLeft pshufd m4, m4, 0xAA - pmullw m3, [multi_2Row] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar4_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_3] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_4] paddw m3, m4 paddw m3, m0 @@ -1210,9 +1086,8 @@ pshuflw m4, m4, 0x00 pshufd m3, m3, 0x44 pshufd m4, m4, 0x44 - pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar8_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_7] ; (blkSize - 1 - y) * above[x] paddw m3, [pw_8] paddw m3, m4 paddw m3, m0 @@ -1226,7 +1101,7 @@ pshufhw m5, m2, 0x55 * (%1 - 4) pshufd m5, m5, 0xAA %endif - pmullw m5, [pw_planar8_0] + pmullw m5, [pw_planar16_mul + mmsize] paddw m5, m3 psraw m5, 4 packuswb m5, m5 @@ -1266,11 +1141,10 @@ pshuflw m6, m6, 0x00 pshufd m3, m3, 0x44 ; v_topRight pshufd m6, m6, 0x44 ; v_bottomLeft - pmullw m4, m3, [multiH] ; (x + 1) * topRight pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m1, m2, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] - pmullw m5, m7, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + pmullw m5, m7, [pw_15] ; (blkSize - 1 - y) * above[x] paddw m4, [pw_16] paddw m3, [pw_16] paddw m4, m6 @@ -1308,8 +1182,8 @@ paddw m4, m1 lea r0, [r0 + r1] %endif - pmullw m0, m5, [pw_planar8_0] - pmullw m5, [pw_planar16_0] + pmullw m0, m5, [pw_planar16_mul + mmsize] + pmullw m5, [pw_planar16_mul] paddw m0, m4 paddw m5, m3 psraw m5, 5 @@ -1368,8 +1242,7 @@ mova m8, m11 mova m9, m11 mova m10, m11 - - mova m12, [pw_planar32_1] + mova m12, [pw_31] movh m4, [r2 + 1] punpcklbw m4, m7 psubw m8, m4 @@ -1393,11 +1266,10 @@ psubw m11, m4 pmullw m4, m12 paddw m3, m4 - - mova m12, [pw_planar32_L] - mova m13, [pw_planar32_H] - mova m14, [pw_planar16_0] - mova m15, [pw_planar8_0] + mova m12, [pw_planar32_mul] + mova m13, [pw_planar32_mul + mmsize] + mova m14, [pw_planar16_mul] + mova m15, [pw_planar16_mul + mmsize] %macro PROCESS 1 pmullw m5, %1, m12 pmullw m6, %1, m13 @@ -1480,42 +1352,37 @@ punpcklbw m4, m7 psubw m5, m6, m4 mova [rsp + 0 * mmsize], m5 - pmullw m4, [pw_planar32_1] + pmullw m4, [pw_31] paddw m0, m4 - movh m4, [r2 + 9] punpcklbw m4, m7 psubw m5, m6, m4 mova [rsp + 1 * mmsize], m5 - pmullw m4, [pw_planar32_1] + pmullw m4, [pw_31] paddw m1, m4 - movh m4, [r2 + 17] punpcklbw m4, m7 psubw m5, m6, m4 mova [rsp + 2 * mmsize], m5 - pmullw m4, [pw_planar32_1] + pmullw m4, [pw_31] paddw m2, m4 - movh m4, [r2 + 25] punpcklbw m4, m7 psubw m5, m6, m4 mova [rsp + 3 * mmsize], m5 - pmullw m4, [pw_planar32_1] + pmullw m4, [pw_31] paddw m3, m4 - %macro PROCESS 1 - pmullw m5, %1, [pw_planar32_L] - pmullw m6, %1, [pw_planar32_H] + pmullw m5, %1, [pw_planar32_mul] + pmullw m6, %1, [pw_planar32_mul + mmsize] paddw m5, m0 paddw m6, m1 psraw m5, 6 psraw m6, 6 packuswb m5, m6 movu [r0], m5 - - pmullw m5, %1, [pw_planar16_0] - pmullw %1, [pw_planar8_0] + pmullw m5, %1, [pw_planar16_mul] + pmullw %1, [pw_planar16_mul + mmsize] paddw m5, m2 paddw %1, m3 psraw m5, 6 @@ -1559,6 +1426,30 @@ %endif ; end ARCH_X86_32 +%macro STORE_4x4 0 + movd [r0], m0 + psrldq m0, 4 + movd [r0 + r1], m0 + psrldq m0, 4 + movd [r0 + r1 * 2], m0 + lea r1, [r1 * 3] + psrldq m0, 4 + movd [r0 + r1], m0 +%endmacro + +%macro TRANSPOSE_4x4 0 + pshufd m0, m0, 0xD8 + pshufd m1, m2, 0xD8 + pshuflw m0, m0, 0xD8 + pshuflw m1, m1, 0xD8 + pshufhw m0, m0, 0xD8 + pshufhw m1, m1, 0xD8 + mova m2, m0 + punpckldq m0, m1 + punpckhdq m2, m1 + packuswb m0, m2 +%endmacro + ;----------------------------------------------------------------------------------------- ; void intraPredAng4(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) ;----------------------------------------------------------------------------------------- @@ -1581,214 +1472,208 @@ RET INIT_XMM sse2 -cglobal intra_pred_ang4_3, 3,5,8 - mov r4d, 1 - cmp r3m, byte 33 - mov r3d, 9 - cmove r3d, r4d +cglobal intra_pred_ang4_3, 3,3,5 + movh m3, [r2 + 9] ; [8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + movh m0, m3 ;[x x x x x x x x 5 4 4 3 3 2 2 1] + psrldq m3, 2 + movh m1, m3 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m3, 2 + movh m2, m3 ;[x x x x x x x x 7 6 6 5 5 4 4 3] + psrldq m3, 2 ;[x x x x x x x x 8 7 7 6 6 5 5 4] + + pxor m4, m4 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 20 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 26 * 16] + packssdw m0, m1 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m4 + pmaddwd m3, [pw_ang_table + 8 * 16] + punpcklbw m2, m4 + pmaddwd m2, [pw_ang_table + 14 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - mova m1, m0 - psrldq m1, 1 ; [x 8 7 6 5 4 3 2] - punpcklbw m0, m1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - mova m1, m0 - psrldq m1, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - mova m2, m0 - psrldq m2, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] - mova m3, m0 - psrldq m3, 6 ; [x x x x x x x x 8 7 7 6 6 5 5 4] - punpcklqdq m0, m1 - punpcklqdq m2, m3 + TRANSPOSE_4x4 - lea r3, [pw_ang_table + 20 * 16] - mova m4, [r3 + 6 * 16] ; [26] - mova m5, [r3] ; [20] - mova m6, [r3 - 6 * 16] ; [14] - mova m7, [r3 - 12 * 16] ; [ 8] - jmp .do_filter4x4 + STORE_4x4 + RET - ; NOTE: share path, input is m0=[1 0], m2=[3 2], m3,m4=coef, flag_z=no_transpose -ALIGN 16 -.do_filter4x4: - pxor m1, m1 - punpckhbw m3, m0 - psrlw m3, 8 - pmaddwd m3, m5 - punpcklbw m0, m1 - pmaddwd m0, m4 +cglobal intra_pred_ang4_4, 3,3,5 + movh m1, [r2 + 9] ;[8 7 6 5 4 3 2 1] + punpcklbw m1, m1 + psrldq m1, 1 + movh m0, m1 ;[x x x x x x x x 5 4 4 3 3 2 2 1] + psrldq m1, 2 + movh m2, m1 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m1, 2 ;[x x x x x x x x 7 6 6 5 5 4 4 3] + + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 10 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 21 * 16] packssdw m0, m3 paddw m0, [pw_16] psraw m0, 5 - punpckhbw m3, m2 - psrlw m3, 8 - pmaddwd m3, m7 - punpcklbw m2, m1 - pmaddwd m2, m6 - packssdw m2, m3 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 31 * 16] + packssdw m2, m1 paddw m2, [pw_16] psraw m2, 5 - ; NOTE: mode 33 doesn't reorder, UNSAFE but I don't use any instruction that affect eflag register before - jz .store - - ; transpose 4x4 c_trans_4x4 db 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15 - pshufd m0, m0, 0xD8 - pshufd m1, m2, 0xD8 - pshuflw m0, m0, 0xD8 - pshuflw m1, m1, 0xD8 - pshufhw m0, m0, 0xD8 - pshufhw m1, m1, 0xD8 - mova m2, m0 - punpckldq m0, m1 - punpckhdq m2, m1 + TRANSPOSE_4x4 -.store: - packuswb m0, m2 - movd [r0], m0 - psrldq m0, 4 - movd [r0 + r1], m0 - psrldq m0, 4 - movd [r0 + r1 * 2], m0 - lea r1, [r1 * 3] - psrldq m0, 4 - movd [r0 + r1], m0 + STORE_4x4 RET -cglobal intra_pred_ang4_4, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 32 - mov r3d, 9 - cmove r3d, r4d - - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - punpcklbw m0, m0 - psrldq m0, 1 - mova m2, m0 - psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - mova m1, m0 - psrldq m1, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] - punpcklqdq m0, m2 - punpcklqdq m2, m1 +cglobal intra_pred_ang4_5, 3,3,5 + movh m3, [r2 + 9] ;[8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + mova m0, m3 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m3, 2 + mova m2, m3 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m3, 2 ;[x x x x x x x x 7 6 6 5 5 4 4 3] - lea r3, [pw_ang_table + 18 * 16] - mova m4, [r3 + 3 * 16] ; [21] - mova m5, [r3 - 8 * 16] ; [10] - mova m6, [r3 + 13 * 16] ; [31] - mova m7, [r3 + 2 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m1, m1 + punpcklbw m2, m1 + mova m4, m2 + pmaddwd m4, [pw_ang_table + 2 * 16] + punpcklbw m0, m1 + pmaddwd m0, [pw_ang_table + 17 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m2, [pw_ang_table + 19 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 -cglobal intra_pred_ang4_5, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 31 - mov r3d, 9 - cmove r3d, r4d + TRANSPOSE_4x4 - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - punpcklbw m0, m0 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - psrldq m0, 1 - mova m2, m0 - psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - mova m3, m0 - psrldq m3, 4 ; [x x x x x x x x 7 6 6 5 5 4 4 3] - punpcklqdq m0, m2 - punpcklqdq m2, m3 + STORE_4x4 + RET - lea r3, [pw_ang_table + 10 * 16] - mova m4, [r3 + 7 * 16] ; [17] - mova m5, [r3 - 8 * 16] ; [ 2] - mova m6, [r3 + 9 * 16] ; [19] - mova m7, [r3 - 6 * 16] ; [ 4] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) +cglobal intra_pred_ang4_6, 3,3,4 + movh m2, [r2 + 9] ;[8 7 6 5 4 3 2 1] + punpcklbw m2, m2 + psrldq m2, 1 + movh m0, m2 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m2, 2 ;[x x x 8 8 7 7 6 6 5 5 4 4 3 3 2] -cglobal intra_pred_ang4_6, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 30 - mov r3d, 9 - cmove r3d, r4d + pxor m1, m1 + punpcklbw m0, m1 + mova m3, m0 + pmaddwd m3, [pw_ang_table + 26 * 16] + pmaddwd m0, [pw_ang_table + 13 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m2, m1 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 7 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - punpcklbw m0, m0 - psrldq m0, 1 ; [x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] - mova m2, m0 - psrldq m2, 2 ; [x x x 8 8 7 7 6 6 5 5 4 4 3 3 2] - punpcklqdq m0, m0 - punpcklqdq m2, m2 + TRANSPOSE_4x4 - lea r3, [pw_ang_table + 19 * 16] - mova m4, [r3 - 6 * 16] ; [13] - mova m5, [r3 + 7 * 16] ; [26] - mova m6, [r3 - 12 * 16] ; [ 7] - mova m7, [r3 + 1 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + STORE_4x4 + RET -cglobal intra_pred_ang4_7, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 29 - mov r3d, 9 - cmove r3d, r4d +cglobal intra_pred_ang4_7, 3,3,5 + movh m3, [r2 + 9] ;[8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + movh m0, m3 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m3, 2 ;[x x x x x x x x 6 5 5 4 4 3 3 2] - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] - punpcklbw m0, m0 - psrldq m0, 1 + pxor m1, m1 + punpcklbw m0, m1 + mova m4, m0 mova m2, m0 - psrldq m2, 2 ; [x x x x x x x x 6 5 5 4 4 3 3 2] - punpcklqdq m0, m0 - punpcklqdq m2, m2 - movhlps m2, m0 + pmaddwd m4, [pw_ang_table + 18 * 16] + pmaddwd m0, [pw_ang_table + 9 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m2, [pw_ang_table + 27 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 - lea r3, [pw_ang_table + 20 * 16] - mova m4, [r3 - 11 * 16] ; [ 9] - mova m5, [r3 - 2 * 16] ; [18] - mova m6, [r3 + 7 * 16] ; [27] - mova m7, [r3 - 16 * 16] ; [ 4] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + TRANSPOSE_4x4 -cglobal intra_pred_ang4_8, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 28 - mov r3d, 9 - cmove r3d, r4d + STORE_4x4 + RET - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] +cglobal intra_pred_ang4_8, 3,3,5 + movh m0, [r2 + 9] ;[8 7 6 5 4 3 2 1] punpcklbw m0, m0 - psrldq m0, 1 - punpcklqdq m0, m0 + psrldq m0, 1 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + + pxor m1, m1 + punpcklbw m0, m1 mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 10 * 16] + pmaddwd m0, [pw_ang_table + 5 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 15 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 - lea r3, [pw_ang_table + 13 * 16] - mova m4, [r3 - 8 * 16] ; [ 5] - mova m5, [r3 - 3 * 16] ; [10] - mova m6, [r3 + 2 * 16] ; [15] - mova m7, [r3 + 7 * 16] ; [20] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + TRANSPOSE_4x4 -cglobal intra_pred_ang4_9, 3,5,8 - xor r4d, r4d - inc r4d - cmp r3m, byte 27 - mov r3d, 9 - cmove r3d, r4d + STORE_4x4 + RET - movh m0, [r2 + r3] ; [8 7 6 5 4 3 2 1] +cglobal intra_pred_ang4_9, 3,3,5 + movh m0, [r2 + 9] ;[8 7 6 5 4 3 2 1] punpcklbw m0, m0 - psrldq m0, 1 ; [x 8 7 6 5 4 3 2] - punpcklqdq m0, m0 + psrldq m0, 1 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + + pxor m1, m1 + punpcklbw m0, m1 mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m0, [pw_ang_table + 2 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 8 * 16] + pmaddwd m2, [pw_ang_table + 6 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 - lea r3, [pw_ang_table + 4 * 16] - mova m4, [r3 - 2 * 16] ; [ 2] - mova m5, [r3 - 0 * 16] ; [ 4] - mova m6, [r3 + 2 * 16] ; [ 6] - mova m7, [r3 + 4 * 16] ; [ 8] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + TRANSPOSE_4x4 + + STORE_4x4 + RET cglobal intra_pred_ang4_10, 3,5,4 - movd m0, [r2 + 9] ; [8 7 6 5 4 3 2 1] + movd m0, [r2 + 9] ;[8 7 6 5 4 3 2 1] punpcklbw m0, m0 punpcklwd m0, m0 pshufd m1, m0, 1 @@ -1804,7 +1689,7 @@ ; filter pxor m3, m3 punpcklbw m0, m3 - movh m1, [r2] ; [4 3 2 1 0] + movh m1, [r2] ;[4 3 2 1 0] punpcklbw m1, m3 pshuflw m2, m1, 0x00 psrldq m1, 2 @@ -1817,243 +1702,268 @@ movd [r0], m0 RET -cglobal intra_pred_ang4_26, 3,4,4 - movd m0, [r2 + 1] ; [8 7 6 5 4 3 2 1] - - ; store - movd [r0], m0 - movd [r0 + r1], m0 - movd [r0 + r1 * 2], m0 - lea r3, [r1 * 3] - movd [r0 + r3], m0 - - ; filter - cmp r4m, byte 0 - jz .quit +cglobal intra_pred_ang4_11, 3,3,5 + movd m1, [r2 + 9] ;[4 3 2 1] + movh m0, [r2 - 7] ;[A x x x x x x x] + punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1] + punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x]] + psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A] - pxor m3, m3 - punpcklbw m0, m3 - pshuflw m0, m0, 0x00 - movd m2, [r2] - punpcklbw m2, m3 - pshuflw m2, m2, 0x00 - movd m1, [r2 + 9] - punpcklbw m1, m3 - psubw m1, m2 - psraw m1, 1 - paddw m0, m1 - packuswb m0, m0 + pxor m1, m1 + punpcklbw m0, m1 + mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 28 * 16] + pmaddwd m0, [pw_ang_table + 30 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 24 * 16] + pmaddwd m2, [pw_ang_table + 26 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 - movd r2, m0 - mov [r0], r2b - shr r2, 8 - mov [r0 + r1], r2b - shr r2, 8 - mov [r0 + r1 * 2], r2b - shr r2, 8 - mov [r0 + r3], r2b + TRANSPOSE_4x4 -.quit: + STORE_4x4 RET -cglobal intra_pred_ang4_11, 3,5,8 - xor r4d, r4d - cmp r3m, byte 25 - mov r3d, 8 - cmove r3d, r4d - - movd m1, [r2 + r3 + 1] ;[4 3 2 1] + cglobal intra_pred_ang4_12, 3,3,5 + movd m1, [r2 + 9] ;[4 3 2 1] movh m0, [r2 - 7] ;[A x x x x x x x] punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1] - punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x]] + punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x] psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A] - punpcklqdq m0, m0 + + pxor m1, m1 + punpcklbw m0, m1 mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 22 * 16] + pmaddwd m0, [pw_ang_table + 27 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 17 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 - lea r3, [pw_ang_table + 24 * 16] + TRANSPOSE_4x4 - mova m4, [r3 + 6 * 16] ; [24] - mova m5, [r3 + 4 * 16] ; [26] - mova m6, [r3 + 2 * 16] ; [28] - mova m7, [r3 + 0 * 16] ; [30] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + STORE_4x4 + RET -cglobal intra_pred_ang4_12, 3,5,8 - xor r4d, r4d - cmp r3m, byte 24 - mov r3d, 8 - cmove r3d, r4d + cglobal intra_pred_ang4_24, 3,3,5 + movd m1, [r2 + 1] ;[4 3 2 1] + movh m0, [r2 - 7] ;[A x x x x x x x] + punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1] + punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x] + psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A] - movd m1, [r2 + r3 + 1] - movh m0, [r2 - 7] - punpcklbw m1, m1 - punpcklqdq m0, m1 - psrldq m0, 7 - punpcklqdq m0, m0 + pxor m1, m1 + punpcklbw m0, m1 mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 22 * 16] + pmaddwd m0, [pw_ang_table + 27 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 17 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 - lea r3, [pw_ang_table + 20 * 16] - mova m4, [r3 + 7 * 16] ; [27] - mova m5, [r3 + 2 * 16] ; [22] - mova m6, [r3 - 3 * 16] ; [17] - mova m7, [r3 - 8 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) - -cglobal intra_pred_ang4_13, 3,5,8 - xor r4d, r4d - cmp r3m, byte 23 - mov r3d, 8 - jz .next - xchg r3d, r4d + STORE_4x4 + RET -.next: +cglobal intra_pred_ang4_13, 3,3,5 movd m1, [r2 - 1] ;[x x A x] - movd m2, [r2 + r4 + 1] ;[4 3 2 1] - movd m0, [r2 + r3 + 3] ;[x x B x] + movd m2, [r2 + 9] ;[4 3 2 1] + movd m0, [r2 + 3] ;[x x B x] punpcklbw m0, m1 ;[x x x x A B x x] punpckldq m0, m2 ;[4 3 2 1 A B x x] psrldq m0, 2 ;[x x 4 3 2 1 A B] - punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B] - mova m1, m0 - psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - psrldq m1, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] - movh m2, m0 - punpcklqdq m0, m0 - punpcklqdq m2, m1 + punpcklbw m0, m0 + psrldq m0, 1 + movh m3, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - lea r3, [pw_ang_table + 21 * 16] - mova m4, [r3 + 2 * 16] ; [23] - mova m5, [r3 - 7 * 16] ; [14] - mova m6, [r3 - 16 * 16] ; [ 5] - mova m7, [r3 + 7 * 16] ; [28] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m1, m1 + punpcklbw m0, m1 + mova m4, m0 + mova m2, m0 + pmaddwd m4, [pw_ang_table + 14 * 16] + pmaddwd m0, [pw_ang_table + 23 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 28 * 16] + pmaddwd m2, [pw_ang_table + 5 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 -cglobal intra_pred_ang4_14, 3,5,8 - xor r4d, r4d - cmp r3m, byte 22 - mov r3d, 8 - jz .next - xchg r3d, r4d + TRANSPOSE_4x4 -.next: + STORE_4x4 + RET + +cglobal intra_pred_ang4_14, 3,3,4 movd m1, [r2 - 1] ;[x x A x] - movd m0, [r2 + r3 + 1] ;[x x B x] + movd m0, [r2 + 1] ;[x x B x] punpcklbw m0, m1 ;[A B x x] - movd m1, [r2 + r4 + 1] ;[4 3 2 1] + movd m1, [r2 + 9] ;[4 3 2 1] punpckldq m0, m1 ;[4 3 2 1 A B x x] psrldq m0, 2 ;[x x 4 3 2 1 A B] punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B] - mova m2, m0 - psrldq m0, 3 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - psrldq m2, 1 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] - punpcklqdq m0, m0 - punpcklqdq m2, m2 + psrldq m0, 1 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - lea r3, [pw_ang_table + 19 * 16] - mova m4, [r3 + 0 * 16] ; [19] - mova m5, [r3 - 13 * 16] ; [ 6] - mova m6, [r3 + 6 * 16] ; [25] - mova m7, [r3 - 7 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m1, m1 + punpcklbw m0, m1 + mova m3, m0 + pmaddwd m3, [pw_ang_table + 6 * 16] + pmaddwd m0, [pw_ang_table + 19 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m2, m1 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 25 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 -cglobal intra_pred_ang4_15, 3,5,8 - xor r4d, r4d - cmp r3m, byte 21 - mov r3d, 8 - jz .next - xchg r3d, r4d + TRANSPOSE_4x4 -.next: + STORE_4x4 + RET + +cglobal intra_pred_ang4_15, 3,3,5 movd m0, [r2] ;[x x x A] - movd m1, [r2 + r3 + 2] ;[x x x B] + movd m1, [r2 + 2] ;[x x x B] punpcklbw m1, m0 ;[x x A B] - movd m0, [r2 + r3 + 3] ;[x x C x] + movd m0, [r2 + 3] ;[x x C x] punpcklwd m0, m1 ;[A B C x] - movd m1, [r2 + r4 + 1] ;[4 3 2 1] + movd m1, [r2 + 9] ;[4 3 2 1] punpckldq m0, m1 ;[4 3 2 1 A B C x] psrldq m0, 1 ;[x 4 3 2 1 A B C] punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] psrldq m0, 1 - movh m1, m0 - psrldq m0, 2 - movh m2, m0 + movh m1, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] psrldq m0, 2 - punpcklqdq m0, m2 - punpcklqdq m2, m1 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - lea r3, [pw_ang_table + 23 * 16] - mova m4, [r3 - 8 * 16] ; [15] - mova m5, [r3 + 7 * 16] ; [30] - mova m6, [r3 - 10 * 16] ; [13] - mova m7, [r3 + 5 * 16] ; [28] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 30 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 15 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 28 * 16] + pmaddwd m2, [pw_ang_table + 13 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 -cglobal intra_pred_ang4_16, 3,5,8 - xor r4d, r4d - cmp r3m, byte 20 - mov r3d, 8 - jz .next - xchg r3d, r4d + TRANSPOSE_4x4 -.next: + STORE_4x4 + RET + +cglobal intra_pred_ang4_16, 3,3,5 movd m2, [r2] ;[x x x A] - movd m1, [r2 + r3 + 2] ;[x x x B] + movd m1, [r2 + 2] ;[x x x B] punpcklbw m1, m2 ;[x x A B] - movh m0, [r2 + r3 + 2] ;[x x C x] + movd m0, [r2 + 2] ;[x x C x] punpcklwd m0, m1 ;[A B C x] - movd m1, [r2 + r4 + 1] ;[4 3 2 1] + movd m1, [r2 + 9] ;[4 3 2 1] punpckldq m0, m1 ;[4 3 2 1 A B C x] psrldq m0, 1 ;[x 4 3 2 1 A B C] punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] psrldq m0, 1 - movh m1, m0 + movh m1, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] psrldq m0, 2 - movh m2, m0 - psrldq m0, 2 - punpcklqdq m0, m2 - punpcklqdq m2, m1 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - lea r3, [pw_ang_table + 19 * 16] - mova m4, [r3 - 8 * 16] ; [11] - mova m5, [r3 + 3 * 16] ; [22] - mova m6, [r3 - 18 * 16] ; [ 1] - mova m7, [r3 - 7 * 16] ; [12] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 22 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 11 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 1 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 -cglobal intra_pred_ang4_17, 3,5,8 - xor r4d, r4d - cmp r3m, byte 19 - mov r3d, 8 - jz .next - xchg r3d, r4d + TRANSPOSE_4x4 -.next: + STORE_4x4 + RET + +cglobal intra_pred_ang4_17, 3,3,5 movd m2, [r2] ;[x x x A] - movd m3, [r2 + r3 + 1] ;[x x x B] - movd m4, [r2 + r3 + 2] ;[x x x C] - movd m0, [r2 + r3 + 4] ;[x x x D] + movd m3, [r2 + 1] ;[x x x B] + movd m4, [r2 + 2] ;[x x x C] + movd m0, [r2 + 4] ;[x x x D] punpcklbw m3, m2 ;[x x A B] punpcklbw m0, m4 ;[x x C D] punpcklwd m0, m3 ;[A B C D] - movd m1, [r2 + r4 + 1] ;[4 3 2 1] + movd m1, [r2 + 9] ;[4 3 2 1] punpckldq m0, m1 ;[4 3 2 1 A B C D] punpcklbw m0, m0 ;[4 4 3 3 2 2 1 1 A A B B C C D D] psrldq m0, 1 - movh m1, m0 - psrldq m0, 2 - movh m2, m0 - punpcklqdq m2, m1 + movh m1, m0 ;[x 4 4 3 3 2 2 1 1 A A B B C C D] psrldq m0, 2 - movh m1, m0 + movh m2, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] psrldq m0, 2 - punpcklqdq m0, m1 + movh m3, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] - lea r3, [pw_ang_table + 14 * 16] - mova m4, [r3 - 8 * 16] ; [ 6] - mova m5, [r3 - 2 * 16] ; [12] - mova m6, [r3 + 4 * 16] ; [18] - mova m7, [r3 + 10 * 16] ; [24] - jmp mangle(private_prefix %+ _ %+ intra_pred_ang4_3 %+ SUFFIX %+ .do_filter4x4) + pxor m4, m4 + punpcklbw m3, m4 + pmaddwd m3, [pw_ang_table + 12 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 6 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 24 * 16] + punpcklbw m2, m4 + pmaddwd m2, [pw_ang_table + 18 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 + + TRANSPOSE_4x4 + + STORE_4x4 + RET cglobal intra_pred_ang4_18, 3,4,2 mov r3d, [r2 + 8] @@ -2073,6 +1983,440 @@ movd [r0], m0 RET +cglobal intra_pred_ang4_19, 3,3,5 + movd m2, [r2] ;[x x x A] + movd m3, [r2 + 9] ;[x x x B] + movd m4, [r2 + 10] ;[x x x C] + movd m0, [r2 + 12] ;[x x x D] + punpcklbw m3, m2 ;[x x A B] + punpcklbw m0, m4 ;[x x C D] + punpcklwd m0, m3 ;[A B C D] + movd m1, [r2 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C D] + punpcklbw m0, m0 ;[4 4 3 3 2 2 1 1 A A B B C C D D] + psrldq m0, 1 + movh m1, m0 ;[x 4 4 3 3 2 2 1 1 A A B B C C D] + psrldq m0, 2 + movh m2, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] + psrldq m0, 2 + movh m3, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + + pxor m4, m4 + punpcklbw m3, m4 + pmaddwd m3, [pw_ang_table + 12 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 6 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 24 * 16] + punpcklbw m2, m4 + pmaddwd m2, [pw_ang_table + 18 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_20, 3,3,5 + movd m2, [r2] ;[x x x A] + movd m1, [r2 + 10] ;[x x x B] + punpcklbw m1, m2 ;[x x A B] + movd m0, [r2 + 10] ;[x x C x] + punpcklwd m0, m1 ;[A B C x] + movd m1, [r2 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C x] + psrldq m0, 1 ;[x 4 3 2 1 A B C] + punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] + psrldq m0, 1 + movh m1, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] + psrldq m0, 2 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 22 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 11 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 1 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_21, 3,3,5 + movd m0, [r2] ;[x x x A] + movd m1, [r2 + 10] ;[x x x B] + punpcklbw m1, m0 ;[x x A B] + movd m0, [r2 + 11] ;[x x C x] + punpcklwd m0, m1 ;[A B C x] + movd m1, [r2 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B C x] + psrldq m0, 1 ;[x 4 3 2 1 A B C] + punpcklbw m0, m0 ;[x x 4 4 3 3 2 2 1 1 A A B B C C] + psrldq m0, 1 + movh m1, m0 ;[x x x 4 4 3 3 2 2 1 1 A A B B C] + psrldq m0, 2 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 30 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 15 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 28 * 16] + pmaddwd m2, [pw_ang_table + 13 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_22, 3,3,4 + movd m1, [r2 - 1] ;[x x A x] + movd m0, [r2 + 9] ;[x x B x] + punpcklbw m0, m1 ;[A B x x] + movd m1, [r2 + 1] ;[4 3 2 1] + punpckldq m0, m1 ;[4 3 2 1 A B x x] + psrldq m0, 2 ;[x x 4 3 2 1 A B] + punpcklbw m0, m0 ;[x x x x 4 4 3 3 2 2 1 1 A A B B] + psrldq m0, 1 + movh m2, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + + pxor m1, m1 + punpcklbw m0, m1 + mova m3, m0 + pmaddwd m3, [pw_ang_table + 6 * 16] + pmaddwd m0, [pw_ang_table + 19 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m2, m1 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 12 * 16] + pmaddwd m2, [pw_ang_table + 25 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_23, 3,3,5 + movd m1, [r2 - 1] ;[x x A x] + movd m2, [r2 + 1] ;[4 3 2 1] + movd m0, [r2 + 11] ;[x x B x] + punpcklbw m0, m1 ;[x x x x A B x x] + punpckldq m0, m2 ;[4 3 2 1 A B x x] + psrldq m0, 2 ;[x x 4 3 2 1 A B] + punpcklbw m0, m0 + psrldq m0, 1 + mova m3, m0 ;[x x x x x 4 4 3 3 2 2 1 1 A A B] + psrldq m0, 2 ;[x x x x x x x 4 4 3 3 2 2 1 1 A] + + pxor m1, m1 + punpcklbw m0, m1 + mova m4, m0 + mova m2, m0 + pmaddwd m4, [pw_ang_table + 14 * 16] + pmaddwd m0, [pw_ang_table + 23 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 28 * 16] + pmaddwd m2, [pw_ang_table + 5 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_25, 3,3,5 + movd m1, [r2 + 1] ;[4 3 2 1] + movh m0, [r2 - 7] ;[A x x x x x x x] + punpcklbw m1, m1 ;[4 4 3 3 2 2 1 1] + punpcklqdq m0, m1 ;[4 4 3 3 2 2 1 1 A x x x x x x x] + psrldq m0, 7 ;[x x x x x x x x 4 3 3 2 2 1 1 A] + + pxor m1, m1 + punpcklbw m0, m1 + mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 28 * 16] + pmaddwd m0, [pw_ang_table + 30 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 24 * 16] + pmaddwd m2, [pw_ang_table + 26 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_26, 3,4,4 + movd m0, [r2 + 1] ;[8 7 6 5 4 3 2 1] + + ; store + movd [r0], m0 + movd [r0 + r1], m0 + movd [r0 + r1 * 2], m0 + lea r3, [r1 * 3] + movd [r0 + r3], m0 + + ; filter + cmp r4m, byte 0 + jz .quit + + pxor m3, m3 + punpcklbw m0, m3 + pshuflw m0, m0, 0x00 + movd m2, [r2] + punpcklbw m2, m3 + pshuflw m2, m2, 0x00 + movd m1, [r2 + 9] + punpcklbw m1, m3 + psubw m1, m2 + psraw m1, 1 + paddw m0, m1 + packuswb m0, m0 + + movd r2, m0 + mov [r0], r2b + shr r2, 8 + mov [r0 + r1], r2b + shr r2, 8 + mov [r0 + r1 * 2], r2b + shr r2, 8 + mov [r0 + r3], r2b + +.quit: + RET + +cglobal intra_pred_ang4_27, 3,3,5 + movh m0, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m0, m0 + psrldq m0, 1 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + + pxor m1, m1 + punpcklbw m0, m1 + mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m0, [pw_ang_table + 2 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 8 * 16] + pmaddwd m2, [pw_ang_table + 6 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_28, 3,3,5 + movh m0, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m0, m0 + psrldq m0, 1 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + + pxor m1, m1 + punpcklbw m0, m1 + mova m2, m0 + mova m3, m0 + mova m4, m2 + pmaddwd m3, [pw_ang_table + 10 * 16] + pmaddwd m0, [pw_ang_table + 5 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + pmaddwd m4, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 15 * 16] + packssdw m2, m4 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_29, 3,3,5 + movh m3, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + movh m0, m3 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m3, 2 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + + pxor m1, m1 + punpcklbw m0, m1 + mova m4, m0 + mova m2, m0 + pmaddwd m4, [pw_ang_table + 18 * 16] + pmaddwd m0, [pw_ang_table + 9 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m2, [pw_ang_table + 27 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_30, 3,3,4 + movh m2, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m2, m2 + psrldq m2, 1 + movh m0, m2 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m2, 2 ;[x x x 8 8 7 7 6 6 5 5 4 4 3 3 2] + + pxor m1, m1 + punpcklbw m0, m1 + mova m3, m0 + pmaddwd m3, [pw_ang_table + 26 * 16] + pmaddwd m0, [pw_ang_table + 13 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m2, m1 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 7 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_31, 3,3,5 + movh m3, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + mova m0, m3 ;[x 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + psrldq m3, 2 + mova m2, m3 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m3, 2 ;[x x x x x x x x 7 6 6 5 5 4 4 3] + + pxor m1, m1 + punpcklbw m2, m1 + mova m4, m2 + pmaddwd m4, [pw_ang_table + 2 * 16] + punpcklbw m0, m1 + pmaddwd m0, [pw_ang_table + 17 * 16] + packssdw m0, m4 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m1 + pmaddwd m3, [pw_ang_table + 4 * 16] + pmaddwd m2, [pw_ang_table + 19 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_32, 3,3,5 + movh m1, [r2 + 1] ;[8 7 6 5 4 3 2 1] + punpcklbw m1, m1 + psrldq m1, 1 + movh m0, m1 ;[x x x x x x x x 5 4 4 3 3 2 2 1] + psrldq m1, 2 + movh m2, m1 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m1, 2 ;[x x x x x x x x 7 6 6 5 5 4 4 3] + + pxor m4, m4 + punpcklbw m2, m4 + mova m3, m2 + pmaddwd m3, [pw_ang_table + 10 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 21 * 16] + packssdw m0, m3 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 20 * 16] + pmaddwd m2, [pw_ang_table + 31 * 16] + packssdw m2, m1 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + +cglobal intra_pred_ang4_33, 3,3,5 + movh m3, [r2 + 1] ; [8 7 6 5 4 3 2 1] + punpcklbw m3, m3 + psrldq m3, 1 + movh m0, m3 ;[x x x x x x x x 5 4 4 3 3 2 2 1] + psrldq m3, 2 + movh m1, m3 ;[x x x x x x x x 6 5 5 4 4 3 3 2] + psrldq m3, 2 + movh m2, m3 ;[x x x x x x x x 7 6 6 5 5 4 4 3] + psrldq m3, 2 ;[x x x x x x x x 8 7 7 6 6 5 5 4] + + pxor m4, m4 + punpcklbw m1, m4 + pmaddwd m1, [pw_ang_table + 20 * 16] + punpcklbw m0, m4 + pmaddwd m0, [pw_ang_table + 26 * 16] + packssdw m0, m1 + paddw m0, [pw_16] + psraw m0, 5 + punpcklbw m3, m4 + pmaddwd m3, [pw_ang_table + 8 * 16] + punpcklbw m2, m4 + pmaddwd m2, [pw_ang_table + 14 * 16] + packssdw m2, m3 + paddw m2, [pw_16] + psraw m2, 5 + packuswb m0, m2 + + STORE_4x4 + RET + ;--------------------------------------------------------------------------------------------- ; void intra_pred_dc(pixel* dst, intptr_t dstStride, pixel *srcPix, int dirMode, int bFilter) ;--------------------------------------------------------------------------------------------- @@ -2474,9 +2818,8 @@ pshufd m3, m3, 0xAA pshufhw m4, m2, 0 ; bottomLeft pshufd m4, m4, 0xAA - pmullw m3, [multi_2Row] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar4_1] ; (blkSize - 1 - y) * above[x] + pmullw m0, m1, [pw_3] ; (blkSize - 1 - y) * above[x] mova m6, [pw_planar4_0] paddw m3, [pw_4] paddw m3, m4 @@ -2533,10 +2876,9 @@ pshufb m4, m0 punpcklbw m3, m0 ; v_topRight punpcklbw m4, m0 ; v_bottomLeft - pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m0, m1, [pw_planar8_1] ; (blkSize - 1 - y) * above[x] - mova m6, [pw_planar8_0] + pmullw m0, m1, [pw_7] ; (blkSize - 1 - y) * above[x] + mova m6, [pw_planar16_mul + mmsize] paddw m3, [pw_8] paddw m3, m4 paddw m3, m0 @@ -2585,11 +2927,10 @@ pshufb m6, m0 punpcklbw m3, m0 ; v_topRight punpcklbw m6, m0 ; v_bottomLeft - pmullw m4, m3, [multiH] ; (x + 1) * topRight pmullw m3, [multiL] ; (x + 1) * topRight - pmullw m1, m2, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] - pmullw m5, m7, [pw_planar16_1] ; (blkSize - 1 - y) * above[x] + pmullw m1, m2, [pw_15] ; (blkSize - 1 - y) * above[x] + pmullw m5, m7, [pw_15] ; (blkSize - 1 - y) * above[x] paddw m4, [pw_16] paddw m3, [pw_16] paddw m4, m6 @@ -2620,8 +2961,8 @@ %endif %endif %endif - pmullw m0, m5, [pw_planar8_0] - pmullw m5, [pw_planar16_0] + pmullw m0, m5, [pw_planar16_mul + mmsize] + pmullw m5, [pw_planar16_mul] paddw m0, m4 paddw m5, m3 paddw m3, m6 @@ -2738,27 +3079,23 @@ paddw m1, [pw_32] paddw m2, [pw_32] paddw m3, [pw_32] - pmovzxbw m4, [r2 + 1] - pmullw m5, m4, [pw_planar32_1] + pmullw m5, m4, [pw_31] paddw m0, m5 psubw m5, m6, m4 mova m8, m5 - pmovzxbw m4, [r2 + 9] - pmullw m5, m4, [pw_planar32_1] + pmullw m5, m4, [pw_31] paddw m1, m5 psubw m5, m6, m4 mova m9, m5 - pmovzxbw m4, [r2 + 17] - pmullw m5, m4, [pw_planar32_1] + pmullw m5, m4, [pw_31] paddw m2, m5 psubw m5, m6, m4 mova m10, m5 - pmovzxbw m4, [r2 + 25] - pmullw m5, m4, [pw_planar32_1] + pmullw m5, m4, [pw_31] paddw m3, m5 psubw m5, m6, m4 mova m11, m5 @@ -2768,9 +3105,8 @@ movd m4, [r2] pshufb m4, m7 punpcklbw m4, m7 - - pmullw m5, m4, [pw_planar32_L] - pmullw m6, m4, [pw_planar32_H] + pmullw m5, m4, [pw_planar32_mul] + pmullw m6, m4, [pw_planar32_mul + mmsize] paddw m5, m0 paddw m6, m1 paddw m0, m8 @@ -2779,9 +3115,8 @@ psraw m6, 6 packuswb m5, m6 movu [r0], m5 - - pmullw m5, m4, [pw_planar16_0] - pmullw m4, [pw_planar8_0] + pmullw m5, m4, [pw_planar16_mul] + pmullw m4, [pw_planar16_mul + mmsize] paddw m5, m2 paddw m4, m3 paddw m2, m10 @@ -11337,6 +11672,304 @@ jnz .loop RET +;----------------------------------------------------------------------------------------- +; start of intra_pred_ang32 angular modes avx2 asm +;----------------------------------------------------------------------------------------- + +%if ARCH_X86_64 == 1 +INIT_YMM avx2 + +; register mapping : +; %1-%8 - output registers +; %9 - temp register +; %10 - for label naming +%macro TRANSPOSE_32x8_AVX2 10 + jnz .skip%10 + + ; transpose 8x32 to 32x8 and then store + punpcklbw m%9, m%1, m%2 + punpckhbw m%1, m%2 + punpcklbw m%2, m%3, m%4 + punpckhbw m%3, m%4 + punpcklbw m%4, m%5, m%6 + punpckhbw m%5, m%6 + punpcklbw m%6, m%7, m%8 + punpckhbw m%7, m%8 + + punpcklwd m%8, m%9, m%2 + punpckhwd m%9, m%2 + punpcklwd m%2, m%4, m%6 + punpckhwd m%4, m%6 + punpcklwd m%6, m%1, m%3 + punpckhwd m%1, m%3 + punpcklwd m%3, m%5, m%7 + punpckhwd m%5, m%7 + + punpckldq m%7, m%8, m%2 + punpckhdq m%8, m%2 + punpckldq m%2, m%6, m%3 + punpckhdq m%6, m%3 + punpckldq m%3, m%9, m%4 + punpckhdq m%9, m%4 + punpckldq m%4, m%1, m%5 + punpckhdq m%1, m%5 + + movq [r0 + r1 * 0], xm%7 + movhps [r0 + r1 * 1], xm%7 + movq [r0 + r1 * 2], xm%8 + movhps [r0 + r5 * 1], xm%8 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%3 + movhps [r0 + r1 * 1], xm%3 + movq [r0 + r1 * 2], xm%9 + movhps [r0 + r5 * 1], xm%9 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%2 + movhps [r0 + r1 * 1], xm%2 + movq [r0 + r1 * 2], xm%6 + movhps [r0 + r5 * 1], xm%6 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%4 + movhps [r0 + r1 * 1], xm%4 + movq [r0 + r1 * 2], xm%1 + movhps [r0 + r5 * 1], xm%1 + + lea r0, [r0 + r6] + + vpermq m%8, m%8, 00001110b + vpermq m%7, m%7, 00001110b + vpermq m%6, m%6, 00001110b + vpermq m%3, m%3, 00001110b + vpermq m%9, m%9, 00001110b + vpermq m%2, m%2, 00001110b + vpermq m%4, m%4, 00001110b + vpermq m%1, m%1, 00001110b + + movq [r0 + r1 * 0], xm%7 + movhps [r0 + r1 * 1], xm%7 + movq [r0 + r1 * 2], xm%8 + movhps [r0 + r5 * 1], xm%8 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%3 + movhps [r0 + r1 * 1], xm%3 + movq [r0 + r1 * 2], xm%9 + movhps [r0 + r5 * 1], xm%9 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%2 + movhps [r0 + r1 * 1], xm%2 + movq [r0 + r1 * 2], xm%6 + movhps [r0 + r5 * 1], xm%6 + + lea r0, [r0 + r6] + + movq [r0 + r1 * 0], xm%4 + movhps [r0 + r1 * 1], xm%4 + movq [r0 + r1 * 2], xm%1 + movhps [r0 + r5 * 1], xm%1 + + lea r0, [r4 + 8] + jmp .end%10 +.skip%10: + movu [r0 + r1 * 0], m%1 + movu [r0 + r1 * 1], m%2 + movu [r0 + r1 * 2], m%3 + movu [r0 + r5 * 1], m%4 + + lea r0, [r0 + r6] + + movu [r0 + r1 * 0], m%5 + movu [r0 + r1 * 1], m%6 + movu [r0 + r1 * 2], m%7 + movu [r0 + r5 * 1], m%8 + + lea r0, [r0 + r6] +.end%10: +%endmacro + +cglobal ang32_mode_3_33_row_0_15 + test r7d, r7d + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + movu m3, [r2 + 17] ; [48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17] + movu m4, [r2 + 18] ; [49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + punpcklbw m3, m4 ; [41 40 40 39 39 38 38 37 37 36 36 35 35 34 34 33 25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17] + + pmaddubsw m4, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m7 + pmaddubsw m1, m2, [r3 + 10 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + palignr m5, m2, m0, 2 + palignr m1, m3, m2, 2 + pmaddubsw m5, [r3 + 4 * 32] ; [20] + pmulhrsw m5, m7 + pmaddubsw m1, [r3 + 4 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + palignr m6, m2, m0, 4 + palignr m1, m3, m2, 4 + pmaddubsw m6, [r3 - 2 * 32] ; [14] + pmulhrsw m6, m7 + pmaddubsw m1, [r3 - 2 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + palignr m8, m2, m0, 6 + palignr m1, m3, m2, 6 + pmaddubsw m8, [r3 - 8 * 32] ; [8] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 - 8 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m10, m2, m0, 8 + palignr m11, m3, m2, 8 + pmaddubsw m9, m10, [r3 - 14 * 32] ; [2] + pmulhrsw m9, m7 + pmaddubsw m1, m11, [r3 - 14 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + pmaddubsw m10, [r3 + 12 * 32] ; [28] + pmulhrsw m10, m7 + pmaddubsw m11, [r3 + 12 * 32] + pmulhrsw m11, m7 + packuswb m10, m11 + + palignr m11, m2, m0, 10 + palignr m1, m3, m2, 10 + pmaddubsw m11, [r3 + 6 * 32] ; [22] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 + 6 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + palignr m12, m2, m0, 12 + palignr m1, m3, m2, 12 + pmaddubsw m12, [r3] ; [16] + pmulhrsw m12, m7 + pmaddubsw m1, [r3] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 0 + + ; rows 8 to 15 + palignr m4, m2, m0, 14 + palignr m1, m3, m2, 14 + pmaddubsw m4, [r3 - 6 * 32] ; [10] + pmulhrsw m4, m7 + pmaddubsw m1, [r3 - 6 * 32] + pmulhrsw m1, m7 + packuswb m4, m1 + + pmaddubsw m5, m2, [r3 - 12 * 32] ; [4] + pmulhrsw m5, m7 + pmaddubsw m1, m3, [r3 - 12 * 32] + pmulhrsw m1, m7 + packuswb m5, m1 + + pmaddubsw m6, m2, [r3 + 14 * 32] ; [30] + pmulhrsw m6, m7 + pmaddubsw m1, m3, [r3 + 14 * 32] + pmulhrsw m1, m7 + packuswb m6, m1 + + movu m0, [r2 + 25] + movu m1, [r2 + 26] + punpcklbw m0, m1 + + palignr m8, m3, m2, 2 + palignr m1, m0, m3, 2 + pmaddubsw m8, [r3 + 8 * 32] ; [24] + pmulhrsw m8, m7 + pmaddubsw m1, [r3 + 8 * 32] + pmulhrsw m1, m7 + packuswb m8, m1 + + palignr m9, m3, m2, 4 + palignr m1, m0, m3, 4 + pmaddubsw m9, [r3 + 2 * 32] ; [18] + pmulhrsw m9, m7 + pmaddubsw m1, [r3 + 2 * 32] + pmulhrsw m1, m7 + packuswb m9, m1 + + palignr m10, m3, m2, 6 + palignr m1, m0, m3, 6 + pmaddubsw m10, [r3 - 4 * 32] ; [12] + pmulhrsw m10, m7 + pmaddubsw m1, [r3 - 4 * 32] + pmulhrsw m1, m7 + packuswb m10, m1 + + palignr m11, m3, m2, 8 + palignr m1, m0, m3, 8 + pmaddubsw m11, [r3 - 10 * 32] ; [6] + pmulhrsw m11, m7 + pmaddubsw m1, [r3 - 10 * 32] + pmulhrsw m1, m7 + packuswb m11, m1 + + movu m12, [r2 + 14] + + TRANSPOSE_32x8_AVX2 4, 5, 6, 8, 9, 10, 11, 12, 1, 8 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang32_3, 3,8,13 + add r2, 64 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + mov r4, r0 + xor r7d, r7d + + call ang32_mode_3_33_row_0_15 + + add r4, 16 + mov r0, r4 + add r2, 13 + + call ang32_mode_3_33_row_0_15 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang32_33, 3,8,13 + lea r3, [ang_table_avx2 + 32 * 16] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + xor r7d, r7d + inc r7d + + call ang32_mode_3_33_row_0_15 + + add r2, 13 + + call ang32_mode_3_33_row_0_15 + RET +%endif ; ARCH_X86_64 +;----------------------------------------------------------------------------------------- +; end of intra_pred_ang32 angular modes avx2 asm +;----------------------------------------------------------------------------------------- ;----------------------------------------------------------------------------------------- ; void intraPredAng8(pixel* dst, intptr_t dstStride, pixel* src, int dirMode, int bFilter) @@ -12809,546 +13442,672 @@ INTRA_PRED_TRANS_STORE_16x16 RET - -INIT_YMM avx2 -cglobal intra_pred_ang16_3, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 8 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 16 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_3] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - - movu xm9, [r2 + 2 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 10 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 9 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 17 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - - movu xm7, [r2 + 3 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 11 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - - movu xm9, [r2 + 4 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 12 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 10 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 18 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - movu xm9, [r2 + 5 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 13 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 11 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 19 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - add r4, 4 * mmsize - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - - movu xm7, [r2 + 12 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 20 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - movu xm9, [r2 + 6 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 14 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 13 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 21 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm9, [r2 + 7 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 15 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 14 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 22 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 - - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 - RET - - -INIT_YMM avx2 -cglobal intra_pred_ang16_4, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 6 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 14 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_4] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - - movu xm9, [r2 + 2 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 10 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 7 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 15 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - - movu xm7, [r2 + 8 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 16 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - - movu xm7, [r2 + 3 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 11 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize - - movu xm9, [r2 + 4 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 12 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 9 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 17 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - - movu xm7, [r2 + 10 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 18 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - movu xm7, [r2 + 5 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 13 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm9, [r2 + 6 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 14 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 11 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 19 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 - - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang16_5, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 5 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 13 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_5] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - - movu xm9, [r2 + 2 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 10 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 6 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 14 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - - movu xm9, [r2 + 3 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 11 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 7 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 15 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - - movu xm9, [r2 + 4 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 12 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 8 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 16 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm9, [r2 + 5 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 13 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 9 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 17 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 - - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang16_6, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 4 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 12 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_6] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - - movu xm7, [r2 + 5 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 13 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - - movu xm7, [r2 + 2 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 10 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize - - movu xm9, [r2 + 3 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 11 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 6 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 14 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - movu xm7, [r2 + 7 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 15 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm7, [r2 + 4 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 12 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 - - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang16_7, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 3 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 11 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_7] - - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - - movu xm7, [r2 + 4 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 12 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - - movu xm7, [r2 + 2 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 10 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - movu xm7, [r2 + 5 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 13 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - - movu xm7, [r2 + 3 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 0 - - movu xm8, [r2 + 11 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 0 - - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 - - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 - RET - +; transpose 8x32 to 16x16, used for intra_ang16x16 avx2 asm +%if ARCH_X86_64 == 1 INIT_YMM avx2 -cglobal intra_pred_ang16_8, 3, 6, 12 - mova m11, [pw_1024] - lea r5, [intra_pred_shuff_0_8] - - movu xm9, [r2 + 1 + 32] - pshufb xm9, [r5] - movu xm10, [r2 + 9 + 32] - pshufb xm10, [r5] - - movu xm7, [r2 + 2 + 32] - pshufb xm7, [r5] - vinserti128 m9, m9, xm7, 1 - - movu xm8, [r2 + 10 + 32] - pshufb xm8, [r5] - vinserti128 m10, m10, xm8, 1 - - lea r3, [3 * r1] - lea r4, [c_ang16_mode_8] +%macro TRANSPOSE_STORE_8x32 12 + jc .skip - INTRA_PRED_ANG16_CAL_ROW m0, m1, 0 - INTRA_PRED_ANG16_CAL_ROW m1, m2, 1 - INTRA_PRED_ANG16_CAL_ROW m2, m3, 2 - INTRA_PRED_ANG16_CAL_ROW m3, m4, 3 - - add r4, 4 * mmsize - - movu xm4, [r2 + 3 + 32] - pshufb xm4, [r5] - vinserti128 m9, m9, xm4, 1 - - movu xm5, [r2 + 11 + 32] - pshufb xm5, [r5] - vinserti128 m10, m10, xm5, 1 - - INTRA_PRED_ANG16_CAL_ROW m4, m5, 0 - INTRA_PRED_ANG16_CAL_ROW m5, m6, 1 - - vinserti128 m9, m9, xm7, 0 - vinserti128 m10, m10, xm8, 0 + punpcklbw m%9, m%1, m%2 + punpckhbw m%1, m%2 + punpcklbw m%10, m%3, m%4 + punpckhbw m%3, m%4 + + punpcklwd m%11, m%9, m%10 + punpckhwd m%9, m%10 + punpcklwd m%10, m%1, m%3 + punpckhwd m%1, m%3 + + punpckldq m%12, m%11, m%10 + punpckhdq m%11, m%10 + punpckldq m%10, m%9, m%1 + punpckhdq m%9, m%1 + + punpcklbw m%1, m%5, m%6 + punpckhbw m%5, m%6 + punpcklbw m%2, m%7, m%8 + punpckhbw m%7, m%8 + + punpcklwd m%3, m%1, m%2 + punpckhwd m%1, m%2 + punpcklwd m%4, m%5, m%7 + punpckhwd m%5, m%7 + + punpckldq m%2, m%3, m%4 + punpckhdq m%3, m%4 + punpckldq m%4, m%1, m%5 + punpckhdq m%1, m%5 + + punpckldq m%5, m%12, m%2 + punpckhdq m%6, m%12, m%2 + punpckldq m%7, m%10, m%4 + punpckhdq m%8, m%10, m%4 + + punpckldq m%2, m%11, m%3 + punpckhdq m%11, m%11, m%3 + punpckldq m%4, m%9, m%1 + punpckhdq m%9, m%9, m%1 + + movu [r0 + r1 * 0], xm%5 + movu [r0 + r1 * 1], xm%6 + movu [r0 + r1 * 2], xm%2 + movu [r0 + r5 * 1], xm%11 + + lea r0, [r0 + r6] + + movu [r0 + r1 * 0], xm%7 + movu [r0 + r1 * 1], xm%8 + movu [r0 + r1 * 2], xm%4 + movu [r0 + r5 * 1], xm%9 + + lea r0, [r0 + r6] + + vextracti128 [r0 + r1 * 0], m%5, 1 + vextracti128 [r0 + r1 * 1], m%6, 1 + vextracti128 [r0 + r1 * 2], m%2, 1 + vextracti128 [r0 + r5 * 1], m%11, 1 + + lea r0, [r0 + r6] + + vextracti128 [r0 + r1 * 0], m%7, 1 + vextracti128 [r0 + r1 * 1], m%8, 1 + vextracti128 [r0 + r1 * 2], m%4, 1 + vextracti128 [r0 + r5 * 1], m%9, 1 + jmp .end + +.skip: + vpermq m%1, m%1, q3120 + vpermq m%2, m%2, q3120 + vpermq m%3, m%3, q3120 + vpermq m%4, m%4, q3120 + vpermq m%5, m%5, q3120 + vpermq m%6, m%6, q3120 + vpermq m%7, m%7, q3120 + vpermq m%8, m%8, q3120 + + movu [r0 + r1 * 0], xm%1 + movu [r0 + r1 * 1], xm%2 + movu [r0 + r1 * 2], xm%3 + movu [r0 + r5 * 1], xm%4 + + lea r0, [r0 + r6] + + movu [r0 + r1 * 0], xm%5 + movu [r0 + r1 * 1], xm%6 + movu [r0 + r1 * 2], xm%7 + movu [r0 + r5 * 1], xm%8 + + lea r0, [r0 + r6] + + vextracti128 [r0 + r1 * 0], m%1, 1 + vextracti128 [r0 + r1 * 1], m%2, 1 + vextracti128 [r0 + r1 * 2], m%3, 1 + vextracti128 [r0 + r5 * 1], m%4, 1 + + lea r0, [r0 + r6] + + vextracti128 [r0 + r1 * 0], m%5, 1 + vextracti128 [r0 + r1 * 1], m%6, 1 + vextracti128 [r0 + r1 * 2], m%7, 1 + vextracti128 [r0 + r5 * 1], m%8, 1 +.end: +%endmacro - INTRA_PRED_ANG16_CAL_ROW m6, m7, 2 - INTRA_PRED_ANG16_CAL_ROW m7, m8, 3 +cglobal ang16_mode_3_33 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m4, m7 + + palignr m5, m2, m0, 2 + pmaddubsw m5, [r3 + 4 * 32] ; [20] + pmulhrsw m5, m7 + + palignr m6, m2, m0, 4 + palignr m8, m2, m0, 6 + pmaddubsw m6, [r3 - 2 * 32] ; [14] + pmulhrsw m6, m7 + pmaddubsw m8, [r3 - 8 * 32] ; [8] + pmulhrsw m8, m7 + + palignr m10, m2, m0, 8 + pmaddubsw m9, m10, [r3 - 14 * 32] ; [2] + pmulhrsw m9, m7 + pmaddubsw m10, [r3 + 12 * 32] ; [28] + pmulhrsw m10, m7 + + palignr m11, m2, m0, 10 + palignr m12, m2, m0, 12 + pmaddubsw m11, [r3 + 6 * 32] ; [22] + pmulhrsw m11, m7 + pmaddubsw m12, [r3] ; [16] + pmulhrsw m12, m7 + + ; rows 8 to 15 + palignr m3, m2, m0, 14 + palignr m1, m1, m2, 14 + pmaddubsw m3, [r3 - 6 * 32] ; [10] + pmulhrsw m3, m7 + packuswb m4, m3 + + pmaddubsw m3, m2, [r3 - 12 * 32] ; [4] + pmulhrsw m3, m7 + packuswb m5, m3 + + pmaddubsw m3, m2, [r3 + 14 * 32] ; [30] + pmulhrsw m3, m7 + packuswb m6, m3 + + movu xm0, [r2 + 25] + movu xm1, [r2 + 26] + punpcklbw m0, m1 + mova m1, m2 + vinserti128 m1, m1, xm0, 0 + vpermq m1, m1, 01001110b + + palignr m3, m1, m2, 2 + pmaddubsw m3, [r3 + 8 * 32] ; [24] + pmulhrsw m3, m7 + packuswb m8, m3 + + palignr m3, m1, m2, 4 + pmaddubsw m3, [r3 + 2 * 32] ; [18] + pmulhrsw m3, m7 + packuswb m9, m3 + + palignr m3, m1, m2, 6 + pmaddubsw m3, [r3 - 4 * 32] ; [12] + pmulhrsw m3, m7 + packuswb m10, m3 + + palignr m3, m1, m2, 8 + pmaddubsw m3, [r3 - 10 * 32] ; [6] + pmulhrsw m3, m7 + packuswb m11, m3 + + pmovzxbw m1, [r2 + 14] + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_3, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc + + call ang16_mode_3_33 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_33, 3, 7, 13 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + stc + + call ang16_mode_3_33 + RET + +cglobal ang16_mode_4_32 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 + 5 * 32] ; [21] + pmulhrsw m4, m7 + + palignr m1, m2, m0, 2 + pmaddubsw m5, m1, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + + palignr m8, m2, m0, 4 + pmaddubsw m6, m1, [r3 + 15 * 32] ; [31] + pmulhrsw m6, m7 + pmaddubsw m8, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + + palignr m10, m2, m0, 6 + pmaddubsw m9, m10, [r3 - 7 * 32] ; [9] + pmulhrsw m9, m7 + pmaddubsw m10, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + + palignr m11, m2, m0, 8 + palignr m1, m2, m0, 10 + pmaddubsw m11, [r3 + 3 * 32] ; [19] + pmulhrsw m11, m7 + pmaddubsw m12, m1, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + + ; rows 8 to 15 + pmaddubsw m3, m1, [r3 + 13 * 32] ; [29] + pmulhrsw m3, m7 + packuswb m4, m3 + + palignr m3, m2, m0, 12 + pmaddubsw m3, m3, [r3 + 2 * 32] ; [18] + pmulhrsw m3, m7 + packuswb m5, m3 + + palignr m1, m2, m0, 14 + pmaddubsw m3, m1, [r3 - 9 * 32] ; [7] + pmulhrsw m3, m7 + packuswb m6, m3 + + pmaddubsw m3, m1, [r3 + 12 * 32] ; [28] + pmulhrsw m3, m7 + packuswb m8, m3 + + palignr m3, m2, m0, 16 + pmaddubsw m3, [r3 + 1 * 32] ; [17] + pmulhrsw m3, m7 + packuswb m9, m3 + + movu xm0, [r2 + 25] + movu xm1, [r2 + 26] + punpcklbw m0, m1 + mova m1, m2 + vinserti128 m1, m1, xm0, 0 + vpermq m1, m1, 01001110b + + palignr m0, m1, m2, 2 + pmaddubsw m3, m0, [r3 - 10 * 32] ; [6] + pmulhrsw m3, m7 + packuswb m10, m3 + + pmaddubsw m3, m0, [r3 + 11 * 32] ; [27] + pmulhrsw m3, m7 + packuswb m11, m3 + + palignr m1, m1, m2, 4 + pmaddubsw m1, [r3] ; [16] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_4, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc + + call ang16_mode_4_32 + RET + +INIT_YMM avx2 +cglobal intra_pred_ang16_32, 3, 7, 13 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + stc + + call ang16_mode_4_32 + RET + +cglobal ang16_mode_5 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 + 1 * 32] ; [17] + pmulhrsw m4, m7 + + palignr m1, m2, m0, 2 + pmaddubsw m5, m1, [r3 - 14 * 32] ; [2] + pmulhrsw m5, m7 + + palignr m3, m2, m0, 4 + pmaddubsw m6, m1, [r3 + 3 * 32] ; [19] + pmulhrsw m6, m7 + pmaddubsw m8, m3, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + pmaddubsw m9, m3, [r3 + 5 * 32] ; [21] + pmulhrsw m9, m7 + + palignr m3, m2, m0, 6 + pmaddubsw m10, m3, [r3 - 10 * 32] ; [6] + pmulhrsw m10, m7 + + palignr m1, m2, m0, 8 + pmaddubsw m11, m3, [r3 + 7 * 32] ; [23] + pmulhrsw m11, m7 + pmaddubsw m12, m1, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + + ; rows 8 to 15 + pmaddubsw m3, m1, [r3 + 9 * 32] ; [25] + pmulhrsw m3, m7 + packuswb m4, m3 + + palignr m1, m2, m0, 10 + pmaddubsw m3, m1, [r3 - 6 * 32] ; [10] + pmulhrsw m3, m7 + packuswb m5, m3 + + pmaddubsw m3, m1, [r3 + 11 * 32] ; [27] + pmulhrsw m3, m7 + packuswb m6, m3 + + palignr m1, m2, m0, 12 + pmaddubsw m3, m1, [r3 - 4 * 32] ; [12] + pmulhrsw m3, m7 + packuswb m8, m3 + + pmaddubsw m3, m1, [r3 + 13 * 32] ; [29] + pmulhrsw m3, m7 + packuswb m9, m3 + + palignr m1, m2, m0, 14 + pmaddubsw m3, m1, [r3 - 2 * 32] ; [14] + pmulhrsw m3, m7 + packuswb m10, m3 + + pmaddubsw m3, m1, [r3 + 15 * 32] ; [31] + pmulhrsw m3, m7 + packuswb m11, m3 + + palignr m1, m2, m0, 16 + pmaddubsw m1, [r3] ; [16] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_5, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc + + call ang16_mode_5 + RET + +cglobal ang16_mode_6 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 - 3 * 32] ; [13] + pmulhrsw m4, m7 + + pmaddubsw m5, m0, [r3 + 10 * 32] ; [26] + pmulhrsw m5, m7 + + palignr m3, m2, m0, 2 + pmaddubsw m6, m3, [r3 - 9 * 32] ; [7] + pmulhrsw m6, m7 + pmaddubsw m8, m3, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + + palignr m3, m2, m0, 4 + pmaddubsw m9, m3, [r3 - 15 * 32] ; [1] + pmulhrsw m9, m7 + + pmaddubsw m10, m3, [r3 - 2 * 32] ; [14] + pmulhrsw m10, m7 + + pmaddubsw m11, m3, [r3 + 11 * 32] ; [27] + pmulhrsw m11, m7 + + palignr m1, m2, m0, 6 + pmaddubsw m12, m1, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + + ; rows 8 to 15 + pmaddubsw m3, m1, [r3 + 5 * 32] ; [21] + pmulhrsw m3, m7 + packuswb m4, m3 + + palignr m1, m2, m0, 8 + pmaddubsw m3, m1, [r3 - 14 * 32] ; [2] + pmulhrsw m3, m7 + packuswb m5, m3 + + pmaddubsw m3, m1, [r3 - 1 * 32] ; [15] + pmulhrsw m3, m7 + packuswb m6, m3 + + pmaddubsw m3, m1, [r3 + 12 * 32] ; [28] + pmulhrsw m3, m7 + packuswb m8, m3 + + palignr m1, m2, m0, 10 + pmaddubsw m3, m1, [r3 - 7 * 32] ; [9] + pmulhrsw m3, m7 + packuswb m9, m3 + + pmaddubsw m3, m1, [r3 + 6 * 32] ; [22] + pmulhrsw m3, m7 + packuswb m10, m3 + + palignr m1, m2, m0, 12 + pmaddubsw m3, m1, [r3 - 13 * 32] ; [3] + pmulhrsw m3, m7 + packuswb m11, m3 + + pmaddubsw m1, [r3] ; [16] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_6, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc + + call ang16_mode_6 + RET + +cglobal ang16_mode_7 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 - 7 * 32] ; [9] + pmulhrsw m4, m7 + + pmaddubsw m5, m0, [r3 + 2 * 32] ; [18] + pmulhrsw m5, m7 + pmaddubsw m6, m0, [r3 + 11 * 32] ; [27] + pmulhrsw m6, m7 + + palignr m3, m2, m0, 2 + pmaddubsw m8, m3, [r3 - 12 * 32] ; [4] + pmulhrsw m8, m7 + + pmaddubsw m9, m3, [r3 - 3 * 32] ; [13] + pmulhrsw m9, m7 + + pmaddubsw m10, m3, [r3 + 6 * 32] ; [22] + pmulhrsw m10, m7 + + pmaddubsw m11, m3, [r3 + 15 * 32] ; [31] + pmulhrsw m11, m7 + + palignr m1, m2, m0, 4 + pmaddubsw m12, m1, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + + ; rows 8 to 15 + pmaddubsw m3, m1, [r3 + 1 * 32] ; [17] + pmulhrsw m3, m7 + packuswb m4, m3 + + pmaddubsw m3, m1, [r3 + 10 * 32] ; [26] + pmulhrsw m3, m7 + packuswb m5, m3 + + palignr m1, m2, m0, 6 + pmaddubsw m3, m1, [r3 - 13 * 32] ; [3] + pmulhrsw m3, m7 + packuswb m6, m3 + + pmaddubsw m3, m1, [r3 - 4 * 32] ; [12] + pmulhrsw m3, m7 + packuswb m8, m3 + + pmaddubsw m3, m1, [r3 + 5 * 32] ; [21] + pmulhrsw m3, m7 + packuswb m9, m3 + + pmaddubsw m3, m1, [r3 + 14 * 32] ; [30] + pmulhrsw m3, m7 + packuswb m10, m3 + + palignr m1, m2, m0, 8 + pmaddubsw m3, m1, [r3 - 9 * 32] ; [7] + pmulhrsw m3, m7 + packuswb m11, m3 + + pmaddubsw m1, [r3] ; [16] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_7, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc + + call ang16_mode_7 + RET + +cglobal ang16_mode_8 + ; rows 0 to 7 + movu m0, [r2 + 1] ; [32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1] + movu m1, [r2 + 2] ; [33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2] + + punpckhbw m2, m0, m1 ; [33 32 32 31 31 30 30 29 29 28 28 27 27 26 26 25 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + punpcklbw m0, m1 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vextracti128 xm1, m0, 1 + vperm2i128 m0, m0, m2, 0x20 ; [17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1] + vperm2i128 m2, m2, m1, 0x20 ; [25 24 24 23 23 22 22 21 21 20 20 19 19 18 18 17 17 16 16 15 15 14 14 13 13 12 12 11 11 10 10 9] + + pmaddubsw m4, m0, [r3 - 11 * 32] ; [5] + pmulhrsw m4, m7 + pmaddubsw m5, m0, [r3 - 6 * 32] ; [10] + pmulhrsw m5, m7 + + pmaddubsw m6, m0, [r3 - 1 * 32] ; [15] + pmulhrsw m6, m7 + pmaddubsw m8, m0, [r3 + 4 * 32] ; [20] + pmulhrsw m8, m7 + pmaddubsw m9, m0, [r3 + 9 * 32] ; [25] + pmulhrsw m9, m7 + + pmaddubsw m10, m0, [r3 + 14 * 32] ; [30] + pmulhrsw m10, m7 + palignr m1, m2, m0, 2 + pmaddubsw m11, m1, [r3 - 13 * 32] ; [3] + pmulhrsw m11, m7 + pmaddubsw m12, m1, [r3 - 8 * 32] ; [8] + pmulhrsw m12, m7 + + ; rows 8 to 15 + pmaddubsw m3, m1, [r3 - 3 * 32] ; [13] + pmulhrsw m3, m7 + packuswb m4, m3 + pmaddubsw m3, m1, [r3 + 2 * 32] ; [18] + pmulhrsw m3, m7 + packuswb m5, m3 + + pmaddubsw m3, m1, [r3 + 7 * 32] ; [23] + pmulhrsw m3, m7 + packuswb m6, m3 + pmaddubsw m3, m1, [r3 + 12 * 32] ; [28] + pmulhrsw m3, m7 + packuswb m8, m3 + + palignr m1, m2, m0, 4 + pmaddubsw m3, m1, [r3 - 15 * 32] ; [1] + pmulhrsw m3, m7 + packuswb m9, m3 + pmaddubsw m3, m1, [r3 - 10 * 32] ; [6] + pmulhrsw m3, m7 + packuswb m10, m3 + + pmaddubsw m3, m1, [r3 - 5 * 32] ; [11] + pmulhrsw m3, m7 + packuswb m11, m3 + pmaddubsw m1, [r3] ; [16] + pmulhrsw m1, m7 + packuswb m12, m1 + + TRANSPOSE_STORE_8x32 4, 5, 6, 8, 9, 10, 11, 12, 0, 1, 2, 3 + ret + +INIT_YMM avx2 +cglobal intra_pred_ang16_8, 3, 7, 13 + add r2, 32 + lea r3, [ang_table_avx2 + 16 * 32] + lea r5, [r1 * 3] ; r5 -> 3 * stride + lea r6, [r1 * 4] ; r6 -> 4 * stride + mova m7, [pw_1024] + clc - ; transpose and store - INTRA_PRED_TRANS_STORE_16x16 + call ang16_mode_8 RET +%endif ; ARCH_X86_64 INIT_YMM avx2 cglobal intra_pred_ang16_9, 3, 6, 12 @@ -13588,120 +14347,6 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang16_32, 3, 5, 6 - mova m0, [pw_1024] - mova m5, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang16_mode_32] - - INTRA_PRED_ANG16_MC2 1 - INTRA_PRED_ANG16_MC3 r0, 0 - - INTRA_PRED_ANG16_MC2 2 - INTRA_PRED_ANG16_MC0 r0 + r1, r0 + 2 * r1, 1 - - INTRA_PRED_ANG16_MC2 3 - INTRA_PRED_ANG16_MC3 r0 + r3, 2 - - INTRA_PRED_ANG16_MC2 4 - lea r0, [r0 + 4 * r1] - INTRA_PRED_ANG16_MC0 r0, r0 + r1, 3 - - INTRA_PRED_ANG16_MC2 5 - - add r4, 4 * mmsize - INTRA_PRED_ANG16_MC3 r0 + 2 * r1, 0 - - INTRA_PRED_ANG16_MC2 6 - INTRA_PRED_ANG16_MC0 r0 + r3, r0 + 4 * r1, 1 - INTRA_PRED_ANG16_MC2 7 - - lea r0, [r0 + 4 * r1] - INTRA_PRED_ANG16_MC3 r0 + r1, 2 - INTRA_PRED_ANG16_MC2 8 - INTRA_PRED_ANG16_MC0 r0 + 2 * r1, r0 + r3, 3 - INTRA_PRED_ANG16_MC2 9 - - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - - INTRA_PRED_ANG16_MC3 r0, 0 - INTRA_PRED_ANG16_MC2 10 - INTRA_PRED_ANG16_MC0 r0 + r1, r0 + 2 * r1, 1 - INTRA_PRED_ANG16_MC2 11 - INTRA_PRED_ANG16_MC3 r0 + r3, 2 - RET - -INIT_YMM avx2 -cglobal intra_pred_ang16_33, 3, 5, 6 - mova m0, [pw_1024] - mova m5, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang16_mode_33] - - INTRA_PRED_ANG16_MC2 1 - vperm2i128 m1, m1, m2, 00100000b - pmaddubsw m3, m1, [r4 + 0 * mmsize] - pmulhrsw m3, m0 - - INTRA_PRED_ANG16_MC2 2 - INTRA_PRED_ANG16_MC4 r0, r0 + r1, 1 - - INTRA_PRED_ANG16_MC2 3 - vperm2i128 m1, m1, m2, 00100000b - pmaddubsw m3, m1, [r4 + 2 * mmsize] - pmulhrsw m3, m0 - - INTRA_PRED_ANG16_MC2 4 - INTRA_PRED_ANG16_MC4 r0 + 2 * r1, r0 + r3, 3 - - lea r0, [r0 + 4 * r1] - add r4, 4 * mmsize - - INTRA_PRED_ANG16_MC2 5 - INTRA_PRED_ANG16_MC0 r0, r0 + r1, 0 - - INTRA_PRED_ANG16_MC2 6 - vperm2i128 m1, m1, m2, 00100000b - pmaddubsw m3, m1, [r4 + 1 * mmsize] - pmulhrsw m3, m0 - - INTRA_PRED_ANG16_MC2 7 - INTRA_PRED_ANG16_MC4 r0 + 2 * r1, r0 + r3, 2 - - INTRA_PRED_ANG16_MC2 8 - lea r0, [r0 + 4 * r1] - INTRA_PRED_ANG16_MC3 r0, 3 - - INTRA_PRED_ANG16_MC2 9 - add r4, 4 * mmsize - INTRA_PRED_ANG16_MC0 r0 + r1, r0 + 2 * r1, 0 - - INTRA_PRED_ANG16_MC2 10 - vperm2i128 m1, m1, m2, 00100000b - pmaddubsw m3, m1, [r4 + 1 * mmsize] - pmulhrsw m3, m0 - - INTRA_PRED_ANG16_MC2 11 - INTRA_PRED_ANG16_MC4 r0 + r3, r0 + 4 * r1, 2 - - lea r0, [r0 + 4 * r1] - - INTRA_PRED_ANG16_MC2 12 - vperm2i128 m1, m1, m2, 00100000b - pmaddubsw m3, m1, [r4 + 3 * mmsize] - pmulhrsw m3, m0 - - INTRA_PRED_ANG16_MC2 13 - INTRA_PRED_ANG16_MC4 r0 + r1, r0 + 2 * r1, 4 - - add r4, 4 * mmsize - - INTRA_PRED_ANG16_MC2 14 - INTRA_PRED_ANG16_MC3 r0 + r3, 1 - RET - -INIT_YMM avx2 cglobal intra_pred_ang16_24, 3, 5, 6 mova m0, [pw_1024] mova m5, [intra_pred_shuff_0_8] @@ -15677,547 +16322,6 @@ RET INIT_YMM avx2 -cglobal intra_pred_ang32_33, 3, 5, 11 - mova m0, [pw_1024] - mova m1, [intra_pred_shuff_0_8] - lea r3, [3 * r1] - lea r4, [c_ang32_mode_33] - - ;row [0] - vbroadcasti128 m2, [r2 + 1] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 9] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 17] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 25] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [1] - vbroadcasti128 m2, [r2 + 2] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 10] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 18] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 26] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [2] - vbroadcasti128 m2, [r2 + 3] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 11] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 19] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 27] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [3] - vbroadcasti128 m2, [r2 + 4] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 12] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 20] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 28] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [4, 5] - vbroadcasti128 m2, [r2 + 5] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 13] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 21] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 29] - pshufb m5, m1 - - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [6] - vbroadcasti128 m2, [r2 + 6] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 14] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 22] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 30] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [7] - vbroadcasti128 m2, [r2 + 7] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 15] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 23] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 31] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [8] - vbroadcasti128 m2, [r2 + 8] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 16] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 24] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 32] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [9, 10] - vbroadcasti128 m2, [r2 + 9] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 17] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 25] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 33] - pshufb m5, m1 - - add r4, 4 * mmsize - mova m10, [r4 + 0 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [11] - vbroadcasti128 m2, [r2 + 10] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 18] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 26] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 34] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [12] - vbroadcasti128 m2, [r2 + 11] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 19] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 27] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 35] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [13] - vbroadcasti128 m2, [r2 + 12] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 20] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 28] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 36] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [14] - vbroadcasti128 m2, [r2 + 13] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 21] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 29] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 37] - pshufb m5, m1 - - add r4, 4 * mmsize - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [15, 16] - vbroadcasti128 m2, [r2 + 14] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 22] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 30] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 38] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r3], m7 - lea r0, [r0 + 4 * r1] - movu [r0], m6 - - ;row [17] - vbroadcasti128 m2, [r2 + 15] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 23] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 31] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 39] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [18] - vbroadcasti128 m2, [r2 + 16] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 24] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 32] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 40] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [19] - vbroadcasti128 m2, [r2 + 17] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 25] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 33] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 41] - pshufb m5, m1 - - add r4, 4 * mmsize - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [20, 21] - vbroadcasti128 m2, [r2 + 18] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 26] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 34] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 42] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0], m7 - movu [r0 + r1], m6 - - ;row [22] - vbroadcasti128 m2, [r2 + 19] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 27] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 35] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 43] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [23] - vbroadcasti128 m2, [r2 + 20] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 28] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 36] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 44] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [24] - vbroadcasti128 m2, [r2 + 21] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 29] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 37] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 45] - pshufb m5, m1 - - add r4, 4 * mmsize - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [25, 26] - vbroadcasti128 m2, [r2 + 22] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 30] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 38] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 46] - pshufb m5, m1 - - mova m10, [r4 + 1 * mmsize] - - INTRA_PRED_ANG32_CAL_ROW - movu [r0 + r1], m7 - movu [r0 + 2 * r1], m6 - - ;row [27] - vbroadcasti128 m2, [r2 + 23] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 31] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 39] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 47] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - - ;row [28] - vbroadcasti128 m2, [r2 + 24] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 32] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 40] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 48] - pshufb m5, m1 - - lea r0, [r0 + 4 * r1] - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 3 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 3 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0], m6 - - ;row [29] - vbroadcasti128 m2, [r2 + 25] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 33] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 41] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 49] - pshufb m5, m1 - - add r4, 4 * mmsize - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 0 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 0 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r1], m6 - - ;row [30] - vbroadcasti128 m2, [r2 + 26] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 34] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 42] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 50] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 1 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 1 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + 2 * r1], m6 - - ;row [31] - vbroadcasti128 m2, [r2 + 27] - pshufb m2, m1 - vbroadcasti128 m3, [r2 + 35] - pshufb m3, m1 - vbroadcasti128 m4, [r2 + 43] - pshufb m4, m1 - vbroadcasti128 m5, [r2 + 51] - pshufb m5, m1 - - vperm2i128 m6, m2, m3, 00100000b - pmaddubsw m6, [r4 + 2 * mmsize] - pmulhrsw m6, m0 - vperm2i128 m7, m4, m5, 00100000b - pmaddubsw m7, [r4 + 2 * mmsize] - pmulhrsw m7, m0 - packuswb m6, m7 - vpermq m6, m6, 11011000b - movu [r0 + r3], m6 - RET - -INIT_YMM avx2 cglobal intra_pred_ang32_25, 3, 5, 11 mova m0, [pw_1024] mova m1, [intra_pred_shuff_0_8] @@ -17826,3 +17930,443 @@ INTRA_PRED_STORE_4x4 RET + +;----------------------------------------------------------------------------------- +; void intra_filter_NxN(const pixel* references, pixel* filtered) +;----------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal intra_filter_4x4, 2,4,5 + mov r2b, byte [r0 + 8] ; topLast + mov r3b, byte [r0 + 16] ; LeftLast + + ; filtering top + pmovzxbw m0, [r0 + 0] + pmovzxbw m1, [r0 + 8] + pmovzxbw m2, [r0 + 16] + + pshufb m4, m0, [intra_filter4_shuf0] ; [6 5 4 3 2 1 0 1] samples[i - 1] + palignr m3, m1, m0, 4 + pshufb m3, [intra_filter4_shuf1] ; [8 7 6 5 4 3 2 9] samples[i + 1] + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + ; filtering left + palignr m4, m1, m1, 14 ; [14 13 12 11 10 9 8 15] samples[i - 1] + pinsrb m4, [r0], 2 ; [14 13 12 11 10 9 0 15] samples[i + 1] + palignr m3, m2, m1, 4 + pshufb m3, [intra_filter4_shuf1] + + psllw m1, 1 + paddw m4, m3 + paddw m1, m4 + paddw m1, [pw_2] + psrlw m1, 2 + packuswb m0, m1 + + movu [r1], m0 + mov [r1 + 8], r2b ; topLast + mov [r1 + 16], r3b ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_8x8, 2,4,6 + mov r2b, byte [r0 + 16] ; topLast + mov r3b, byte [r0 + 32] ; LeftLast + + ; filtering top + pmovzxbw m0, [r0 + 0] + pmovzxbw m1, [r0 + 8] + pmovzxbw m2, [r0 + 16] + + pshufb m4, m0, [intra_filter4_shuf0] ; [6 5 4 3 2 1 0 1] samples[i - 1] + palignr m5, m1, m0, 2 + pinsrb m5, [r0 + 17], 0 ; [8 7 6 5 4 3 2 9] samples[i + 1] + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m1, 1 + paddw m4, m3 + paddw m1, m4 + paddw m1, [pw_2] + psrlw m1, 2 + + packuswb m0, m1 + movu [r1], m0 + + ; filtering left + pmovzxbw m1, [r0 + 24] + pmovzxbw m0, [r0 + 32] + + palignr m4, m2, m2, 14 + pinsrb m4, [r0], 2 + palignr m5, m1, m2, 2 + + palignr m3, m1, m2, 14 + palignr m0, m1, 2 + + psllw m2, 1 + paddw m4, m5 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + + psllw m1, 1 + paddw m0, m3 + paddw m1, m0 + paddw m1, [pw_2] + psrlw m1, 2 + + packuswb m2, m1 + movu [r1 + 16], m2 + mov [r1 + 16], r2b ; topLast + mov [r1 + 32], r3b ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_16x16, 2,4,6 + mov r2b, byte [r0 + 32] ; topLast + mov r3b, byte [r0 + 64] ; LeftLast + + ; filtering top + pmovzxbw m0, [r0 + 0] + pmovzxbw m1, [r0 + 8] + pmovzxbw m2, [r0 + 16] + + pshufb m4, m0, [intra_filter4_shuf0] ; [6 5 4 3 2 1 0 1] samples[i - 1] + palignr m5, m1, m0, 2 + pinsrb m5, [r0 + 33], 0 ; [8 7 6 5 4 3 2 9] samples[i + 1] + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m0, m5 + movu [r1], m0 + + pmovzxbw m0, [r0 + 24] + pmovzxbw m5, [r0 + 32] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + packuswb m1, m0 + movu [r1 + 16], m1 + + ; filtering left + pmovzxbw m1, [r0 + 40] + pmovzxbw m2, [r0 + 48] + + palignr m4, m5, m5, 14 + pinsrb m4, [r0], 2 + palignr m0, m1, m5, 2 + + psllw m3, m5, 1 + paddw m4, m0 + paddw m3, m4 + paddw m3, [pw_2] + psrlw m3, 2 + + palignr m0, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m0 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m3, m5 + movu [r1 + 32], m3 + + pmovzxbw m5, [r0 + 56] + pmovzxbw m0, [r0 + 64] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m5, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m1, m5 + movu [r1 + 48], m1 + + mov [r1 + 32], r2b ; topLast + mov [r1 + 64], r3b ; LeftLast + RET + +INIT_XMM sse4 +cglobal intra_filter_32x32, 2,4,6 + mov r2b, byte [r0 + 64] ; topLast + mov r3b, byte [r0 + 128] ; LeftLast + + ; filtering top + ; 0 to 15 + pmovzxbw m0, [r0 + 0] + pmovzxbw m1, [r0 + 8] + pmovzxbw m2, [r0 + 16] + + pshufb m4, m0, [intra_filter4_shuf0] ; [6 5 4 3 2 1 0 1] samples[i - 1] + palignr m5, m1, m0, 2 + pinsrb m5, [r0 + 65], 0 ; [8 7 6 5 4 3 2 9] samples[i + 1] + + palignr m3, m1, m0, 14 + psllw m0, 1 + paddw m4, m5 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m4, m2, m1, 2 + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m0, m5 + movu [r1], m0 + + ; 16 to 31 + pmovzxbw m0, [r0 + 24] + pmovzxbw m5, [r0 + 32] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m2, m0, 1 + paddw m4, m3 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + packuswb m1, m2 + movu [r1 + 16], m1 + + ; 32 to 47 + pmovzxbw m1, [r0 + 40] + pmovzxbw m2, [r0 + 48] + + palignr m3, m5, m0, 14 + palignr m4, m1, m5, 2 + + psllw m0, m5, 1 + paddw m3, m4 + paddw m0, m3 + paddw m0, [pw_2] + psrlw m0, 2 + + palignr m3, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m0, m5 + movu [r1 + 32], m0 + + ; 48 to 63 + pmovzxbw m0, [r0 + 56] + pmovzxbw m5, [r0 + 64] + + palignr m3, m2, m1, 14 + palignr m4, m0, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m0, m2, 14 + palignr m4, m5, m0, 2 + + psllw m0, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + packuswb m1, m0 + movu [r1 + 48], m1 + + ; filtering left + ; 64 to 79 + pmovzxbw m1, [r0 + 72] + pmovzxbw m2, [r0 + 80] + + palignr m4, m5, m5, 14 + pinsrb m4, [r0], 2 + palignr m0, m1, m5, 2 + + psllw m3, m5, 1 + paddw m4, m0 + paddw m3, m4 + paddw m3, [pw_2] + psrlw m3, 2 + + palignr m0, m1, m5, 14 + palignr m4, m2, m1, 2 + + psllw m5, m1, 1 + paddw m4, m0 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m3, m5 + movu [r1 + 64], m3 + + ; 80 to 95 + pmovzxbw m5, [r0 + 88] + pmovzxbw m0, [r0 + 96] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m2, m5, 1 + paddw m4, m3 + paddw m2, m4 + paddw m2, [pw_2] + psrlw m2, 2 + packuswb m1, m2 + movu [r1 + 80], m1 + + ; 96 to 111 + pmovzxbw m1, [r0 + 104] + pmovzxbw m2, [r0 + 112] + + palignr m3, m0, m5, 14 + palignr m4, m1, m0, 2 + + psllw m5, m0, 1 + paddw m3, m4 + paddw m5, m3 + paddw m5, [pw_2] + psrlw m5, 2 + + palignr m3, m1, m0, 14 + palignr m4, m2, m1, 2 + + psllw m0, m1, 1 + paddw m4, m3 + paddw m0, m4 + paddw m0, [pw_2] + psrlw m0, 2 + packuswb m5, m0 + movu [r1 + 96], m5 + + ; 112 to 127 + pmovzxbw m5, [r0 + 120] + pmovzxbw m0, [r0 + 128] + + palignr m3, m2, m1, 14 + palignr m4, m5, m2, 2 + + psllw m1, m2, 1 + paddw m3, m4 + paddw m1, m3 + paddw m1, [pw_2] + psrlw m1, 2 + + palignr m3, m5, m2, 14 + palignr m4, m0, m5, 2 + + psllw m5, 1 + paddw m4, m3 + paddw m5, m4 + paddw m5, [pw_2] + psrlw m5, 2 + packuswb m1, m5 + movu [r1 + 112], m1 + + mov [r1 + 64], r2b ; topLast + mov [r1 + 128], r3b ; LeftLast + RET + +INIT_YMM avx2 +cglobal intra_filter_4x4, 2,4,4 + mov r2b, byte [r0 + 8] ; topLast + mov r3b, byte [r0 + 16] ; LeftLast + + ; filtering top + pmovzxbw m0, [r0] + vpbroadcastw m2, xm0 + pmovzxbw m1, [r0 + 8] + + palignr m3, m0, m2, 14 ; [6 5 4 3 2 1 0 0] [14 13 12 11 10 9 8 0] + pshufb m3, [intra_filter4_shuf2] ; [6 5 4 3 2 1 0 1] [14 13 12 11 10 9 0 9] samples[i - 1] + palignr m1, m0, 4 ; [9 8 7 6 5 4 3 2] + palignr m1, m1, 14 ; [9 8 7 6 5 4 3 2] + + psllw m0, 1 + paddw m3, m1 + paddw m0, m3 + paddw m0, [pw_2] + psrlw m0, 2 + + packuswb m0, m0 + vpermq m0, m0, 10001000b + + movu [r1], xm0 + mov [r1 + 8], r2b ; topLast + mov [r1 + 16], r3b ; LeftLast + RET
View file
x265_1.7.tar.gz/source/common/x86/ipfilter16.asm -> x265_1.8.tar.gz/source/common/x86/ipfilter16.asm
Changed
@@ -3,6 +3,7 @@ ;* ;* Authors: Nabajit Deka <nabajit@multicorewareinc.com> ;* Murugan Vairavel <murugan@multicorewareinc.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -25,10 +26,28 @@ %include "x86inc.asm" %include "x86util.asm" + +%define INTERP_OFFSET_PP pd_32 +%define INTERP_SHIFT_PP 6 + +%if BIT_DEPTH == 10 + %define INTERP_SHIFT_PS 2 + %define INTERP_OFFSET_PS pd_n32768 + %define INTERP_SHIFT_SP 10 + %define INTERP_OFFSET_SP pd_524800 +%elif BIT_DEPTH == 12 + %define INTERP_SHIFT_PS 4 + %define INTERP_OFFSET_PS pd_n131072 + %define INTERP_SHIFT_SP 8 + %define INTERP_OFFSET_SP pd_524416 +%else + %error Unsupport bit depth! +%endif + + SECTION_RODATA 32 -tab_c_32: times 4 dd 32 -tab_c_n32768: times 4 dd -32768 +tab_c_32: times 8 dd 32 tab_c_524800: times 4 dd 524800 tab_c_n8192: times 8 dw -8192 pd_524800: times 8 dd 524800 @@ -44,29 +63,53 @@ dw -2, 16, 54, -4 dw -2, 10, 58, -2 -tab_ChromaCoeffV: times 4 dw 0, 64 - times 4 dw 0, 0 +const tab_ChromaCoeffV, times 8 dw 0, 64 + times 8 dw 0, 0 + + times 8 dw -2, 58 + times 8 dw 10, -2 + + times 8 dw -4, 54 + times 8 dw 16, -2 + + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 - times 4 dw -2, 58 - times 4 dw 10, -2 + times 8 dw -4, 28 + times 8 dw 46, -6 - times 4 dw -4, 54 - times 4 dw 16, -2 + times 8 dw -2, 16 + times 8 dw 54, -4 - times 4 dw -6, 46 - times 4 dw 28, -4 + times 8 dw -2, 10 + times 8 dw 58, -2 - times 4 dw -4, 36 - times 4 dw 36, -4 +tab_ChromaCoeffVer: times 8 dw 0, 64 + times 8 dw 0, 0 - times 4 dw -4, 28 - times 4 dw 46, -6 + times 8 dw -2, 58 + times 8 dw 10, -2 - times 4 dw -2, 16 - times 4 dw 54, -4 + times 8 dw -4, 54 + times 8 dw 16, -2 - times 4 dw -2, 10 - times 4 dw 58, -2 + times 8 dw -6, 46 + times 8 dw 28, -4 + + times 8 dw -4, 36 + times 8 dw 36, -4 + + times 8 dw -4, 28 + times 8 dw 46, -6 + + times 8 dw -2, 16 + times 8 dw 54, -4 + + times 8 dw -2, 10 + times 8 dw 58, -2 tab_LumaCoeff: dw 0, 0, 0, 64, 0, 0, 0, 0 dw -1, 4, -10, 58, 17, -5, 1, 0 @@ -115,11 +158,1024 @@ const interp8_hps_shuf, dd 0, 4, 1, 5, 2, 6, 3, 7 +const interp8_hpp_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 + db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 + +const pb_shuf, db 0, 1, 2, 3, 4, 5, 6, 7, 2, 3, 4, 5, 6, 7, 8, 9 + db 4, 5, 6, 7, 8, 9, 10, 11, 6, 7, 8, 9, 10, 11, 12, 13 + + SECTION .text +cextern pd_8 cextern pd_32 cextern pw_pixel_max +cextern pd_524416 cextern pd_n32768 +cextern pd_n131072 cextern pw_2000 +cextern idct8_shuf2 + +%macro FILTER_LUMA_HOR_4_sse2 1 + movu m4, [r0 + %1] ; m4 = src[0-7] + movu m5, [r0 + %1 + 2] ; m5 = src[1-8] + pmaddwd m4, m0 + pmaddwd m5, m0 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + + movu m5, [r0 + %1 + 4] ; m5 = src[2-9] + movu m3, [r0 + %1 + 6] ; m3 = src[3-10] + pmaddwd m5, m0 + pmaddwd m3, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m5, m5, q3120 + pshufd m3, m3, q3120 + punpcklqdq m5, m3 + + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + paddd m4, m1 +%endmacro + +%macro FILTER_LUMA_HOR_8_sse2 1 + movu m4, [r0 + %1] ; m4 = src[0-7] + movu m5, [r0 + %1 + 2] ; m5 = src[1-8] + pmaddwd m4, m0 + pmaddwd m5, m0 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + + movu m5, [r0 + %1 + 4] ; m5 = src[2-9] + movu m3, [r0 + %1 + 6] ; m3 = src[3-10] + pmaddwd m5, m0 + pmaddwd m3, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m5, m5, q3120 + pshufd m3, m3, q3120 + punpcklqdq m5, m3 + + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + paddd m4, m1 + + movu m5, [r0 + %1 + 8] ; m5 = src[4-11] + movu m6, [r0 + %1 + 10] ; m6 = src[5-12] + pmaddwd m5, m0 + pmaddwd m6, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m5, m5, q3120 + pshufd m6, m6, q3120 + punpcklqdq m5, m6 + + movu m6, [r0 + %1 + 12] ; m6 = src[6-13] + movu m3, [r0 + %1 + 14] ; m3 = src[7-14] + pmaddwd m6, m0 + pmaddwd m3, m0 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m6, m6, q3120 + pshufd m3, m3, q3120 + punpcklqdq m6, m3 + + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m6, q2301 + paddd m6, m2 + pshufd m5, m5, q3120 + pshufd m6, m6, q3120 + punpcklqdq m5, m6 + paddd m5, m1 +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_p%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------ +%macro FILTER_HOR_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 + mov r4d, r4m + sub r0, 6 + shl r4d, 4 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r6, [tab_LumaCoeff] + mova m0, [r6 + r4] +%else + mova m0, [tab_LumaCoeff + r4] +%endif + +%ifidn %3, pp + mova m1, [pd_32] + pxor m7, m7 +%else + mova m1, [INTERP_OFFSET_PS] +%endif + + mov r4d, %2 +%ifidn %3, ps + cmp r5m, byte 0 + je .loopH + lea r6, [r1 + 2 * r1] + sub r0, r6 + add r4d, 7 +%endif + +.loopH: +%assign x 0 +%rep %1/8 + FILTER_LUMA_HOR_8_sse2 x + +%ifidn %3, pp + psrad m4, 6 + psrad m5, 6 + packssdw m4, m5 + CLIPW m4, m7, [pw_pixel_max] +%else + %if BIT_DEPTH == 10 + psrad m4, 2 + psrad m5, 2 + %elif BIT_DEPTH == 12 + psrad m4, 4 + psrad m5, 4 + %endif + packssdw m4, m5 +%endif + + movu [r2 + x], m4 +%assign x x+16 +%endrep + +%rep (%1 % 8)/4 + FILTER_LUMA_HOR_4_sse2 x + +%ifidn %3, pp + psrad m4, 6 + packssdw m4, m4 + CLIPW m4, m7, [pw_pixel_max] +%else + %if BIT_DEPTH == 10 + psrad m4, 2 + %elif BIT_DEPTH == 12 + psrad m4, 4 + %endif + packssdw m4, m4 +%endif + + movh [r2 + x], m4 +%endrep + + add r0, r1 + add r2, r3 + + dec r4d + jnz .loopH + RET + +%endmacro + +;------------------------------------------------------------------------------------------------------------ +; void interp_8tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------ + FILTER_HOR_LUMA_sse2 4, 4, pp + FILTER_HOR_LUMA_sse2 4, 8, pp + FILTER_HOR_LUMA_sse2 4, 16, pp + FILTER_HOR_LUMA_sse2 8, 4, pp + FILTER_HOR_LUMA_sse2 8, 8, pp + FILTER_HOR_LUMA_sse2 8, 16, pp + FILTER_HOR_LUMA_sse2 8, 32, pp + FILTER_HOR_LUMA_sse2 12, 16, pp + FILTER_HOR_LUMA_sse2 16, 4, pp + FILTER_HOR_LUMA_sse2 16, 8, pp + FILTER_HOR_LUMA_sse2 16, 12, pp + FILTER_HOR_LUMA_sse2 16, 16, pp + FILTER_HOR_LUMA_sse2 16, 32, pp + FILTER_HOR_LUMA_sse2 16, 64, pp + FILTER_HOR_LUMA_sse2 24, 32, pp + FILTER_HOR_LUMA_sse2 32, 8, pp + FILTER_HOR_LUMA_sse2 32, 16, pp + FILTER_HOR_LUMA_sse2 32, 24, pp + FILTER_HOR_LUMA_sse2 32, 32, pp + FILTER_HOR_LUMA_sse2 32, 64, pp + FILTER_HOR_LUMA_sse2 48, 64, pp + FILTER_HOR_LUMA_sse2 64, 16, pp + FILTER_HOR_LUMA_sse2 64, 32, pp + FILTER_HOR_LUMA_sse2 64, 48, pp + FILTER_HOR_LUMA_sse2 64, 64, pp + +;--------------------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx, int isRowExt) +;--------------------------------------------------------------------------------------------------------------------------- + FILTER_HOR_LUMA_sse2 4, 4, ps + FILTER_HOR_LUMA_sse2 4, 8, ps + FILTER_HOR_LUMA_sse2 4, 16, ps + FILTER_HOR_LUMA_sse2 8, 4, ps + FILTER_HOR_LUMA_sse2 8, 8, ps + FILTER_HOR_LUMA_sse2 8, 16, ps + FILTER_HOR_LUMA_sse2 8, 32, ps + FILTER_HOR_LUMA_sse2 12, 16, ps + FILTER_HOR_LUMA_sse2 16, 4, ps + FILTER_HOR_LUMA_sse2 16, 8, ps + FILTER_HOR_LUMA_sse2 16, 12, ps + FILTER_HOR_LUMA_sse2 16, 16, ps + FILTER_HOR_LUMA_sse2 16, 32, ps + FILTER_HOR_LUMA_sse2 16, 64, ps + FILTER_HOR_LUMA_sse2 24, 32, ps + FILTER_HOR_LUMA_sse2 32, 8, ps + FILTER_HOR_LUMA_sse2 32, 16, ps + FILTER_HOR_LUMA_sse2 32, 24, ps + FILTER_HOR_LUMA_sse2 32, 32, ps + FILTER_HOR_LUMA_sse2 32, 64, ps + FILTER_HOR_LUMA_sse2 48, 64, ps + FILTER_HOR_LUMA_sse2 64, 16, ps + FILTER_HOR_LUMA_sse2 64, 32, ps + FILTER_HOR_LUMA_sse2 64, 48, ps + FILTER_HOR_LUMA_sse2 64, 64, ps + +%macro PROCESS_LUMA_VER_W4_4R_sse2 0 + movq m0, [r0] + movq m1, [r0 + r1] + punpcklwd m0, m1 ;m0=[0 1] + pmaddwd m0, [r6 + 0 *16] ;m0=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m1, m4 ;m1=[1 2] + pmaddwd m1, [r6 + 0 *16] ;m1=[1+2] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[2 3] + pmaddwd m2, m4, [r6 + 0 *16] ;m2=[2+3] Row3 + pmaddwd m4, [r6 + 1 * 16] + paddd m0, m4 ;m0=[0+1+2+3] Row1 + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m5, m4 ;m5=[3 4] + pmaddwd m3, m5, [r6 + 0 *16] ;m3=[3+4] Row4 + pmaddwd m5, [r6 + 1 * 16] + paddd m1, m5 ;m1 = [1+2+3+4] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[4 5] + pmaddwd m6, m4, [r6 + 1 * 16] + paddd m2, m6 ;m2=[2+3+4+5] Row3 + pmaddwd m4, [r6 + 2 * 16] + paddd m0, m4 ;m0=[0+1+2+3+4+5] Row1 + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m5, m4 ;m5=[5 6] + pmaddwd m6, m5, [r6 + 1 * 16] + paddd m3, m6 ;m3=[3+4+5+6] Row4 + pmaddwd m5, [r6 + 2 * 16] + paddd m1, m5 ;m1=[1+2+3+4+5+6] Row2 + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[6 7] + pmaddwd m6, m4, [r6 + 2 * 16] + paddd m2, m6 ;m2=[2+3+4+5+6+7] Row3 + pmaddwd m4, [r6 + 3 * 16] + paddd m0, m4 ;m0=[0+1+2+3+4+5+6+7] Row1 end + + lea r0, [r0 + 2 * r1] + movq m4, [r0] + punpcklwd m5, m4 ;m5=[7 8] + pmaddwd m6, m5, [r6 + 2 * 16] + paddd m3, m6 ;m3=[3+4+5+6+7+8] Row4 + pmaddwd m5, [r6 + 3 * 16] + paddd m1, m5 ;m1=[1+2+3+4+5+6+7+8] Row2 end + + movq m5, [r0 + r1] + punpcklwd m4, m5 ;m4=[8 9] + pmaddwd m4, [r6 + 3 * 16] + paddd m2, m4 ;m2=[2+3+4+5+6+7+8+9] Row3 end + + movq m4, [r0 + 2 * r1] + punpcklwd m5, m4 ;m5=[9 10] + pmaddwd m5, [r6 + 3 * 16] + paddd m3, m5 ;m3=[3+4+5+6+7+8+9+10] Row4 end +%endmacro + +;-------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert_%1_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;-------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_vert_%1_%2x%3, 5, 7, 8 + + add r1d, r1d + add r3d, r3d + lea r5, [r1 + 2 * r1] + sub r0, r5 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_LumaCoeffV] + lea r6, [r5 + r4] +%else + lea r6, [tab_LumaCoeffV + r4] +%endif + +%ifidn %1,pp + mova m7, [INTERP_OFFSET_PP] +%define SHIFT 6 +%elifidn %1,ps + mova m7, [INTERP_OFFSET_PS] + %if BIT_DEPTH == 10 + %define SHIFT 2 + %elif BIT_DEPTH == 12 + %define SHIFT 4 + %endif +%endif + + mov r4d, %3/4 +.loopH: +%assign x 0 +%rep %2/4 + PROCESS_LUMA_VER_W4_4R_sse2 + + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + + psrad m0, SHIFT + psrad m1, SHIFT + psrad m2, SHIFT + psrad m3, SHIFT + + packssdw m0, m1 + packssdw m2, m3 + +%ifidn %1,pp + pxor m1, m1 + CLIPW2 m0, m2, m1, [pw_pixel_max] +%endif + + movh [r2 + x], m0 + movhps [r2 + r3 + x], m0 + lea r5, [r2 + 2 * r3] + movh [r5 + x], m2 + movhps [r5 + r3 + x], m2 + + lea r5, [8 * r1 - 2 * 4] + sub r0, r5 +%assign x x+8 +%endrep + + lea r0, [r0 + 4 * r1 - 2 * %2] + lea r2, [r2 + 4 * r3] + + dec r4d + jnz .loopH + + RET +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert_pp_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- + FILTER_VER_LUMA_sse2 pp, 4, 4 + FILTER_VER_LUMA_sse2 pp, 8, 8 + FILTER_VER_LUMA_sse2 pp, 8, 4 + FILTER_VER_LUMA_sse2 pp, 4, 8 + FILTER_VER_LUMA_sse2 pp, 16, 16 + FILTER_VER_LUMA_sse2 pp, 16, 8 + FILTER_VER_LUMA_sse2 pp, 8, 16 + FILTER_VER_LUMA_sse2 pp, 16, 12 + FILTER_VER_LUMA_sse2 pp, 12, 16 + FILTER_VER_LUMA_sse2 pp, 16, 4 + FILTER_VER_LUMA_sse2 pp, 4, 16 + FILTER_VER_LUMA_sse2 pp, 32, 32 + FILTER_VER_LUMA_sse2 pp, 32, 16 + FILTER_VER_LUMA_sse2 pp, 16, 32 + FILTER_VER_LUMA_sse2 pp, 32, 24 + FILTER_VER_LUMA_sse2 pp, 24, 32 + FILTER_VER_LUMA_sse2 pp, 32, 8 + FILTER_VER_LUMA_sse2 pp, 8, 32 + FILTER_VER_LUMA_sse2 pp, 64, 64 + FILTER_VER_LUMA_sse2 pp, 64, 32 + FILTER_VER_LUMA_sse2 pp, 32, 64 + FILTER_VER_LUMA_sse2 pp, 64, 48 + FILTER_VER_LUMA_sse2 pp, 48, 64 + FILTER_VER_LUMA_sse2 pp, 64, 16 + FILTER_VER_LUMA_sse2 pp, 16, 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert_ps_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- + FILTER_VER_LUMA_sse2 ps, 4, 4 + FILTER_VER_LUMA_sse2 ps, 8, 8 + FILTER_VER_LUMA_sse2 ps, 8, 4 + FILTER_VER_LUMA_sse2 ps, 4, 8 + FILTER_VER_LUMA_sse2 ps, 16, 16 + FILTER_VER_LUMA_sse2 ps, 16, 8 + FILTER_VER_LUMA_sse2 ps, 8, 16 + FILTER_VER_LUMA_sse2 ps, 16, 12 + FILTER_VER_LUMA_sse2 ps, 12, 16 + FILTER_VER_LUMA_sse2 ps, 16, 4 + FILTER_VER_LUMA_sse2 ps, 4, 16 + FILTER_VER_LUMA_sse2 ps, 32, 32 + FILTER_VER_LUMA_sse2 ps, 32, 16 + FILTER_VER_LUMA_sse2 ps, 16, 32 + FILTER_VER_LUMA_sse2 ps, 32, 24 + FILTER_VER_LUMA_sse2 ps, 24, 32 + FILTER_VER_LUMA_sse2 ps, 32, 8 + FILTER_VER_LUMA_sse2 ps, 8, 32 + FILTER_VER_LUMA_sse2 ps, 64, 64 + FILTER_VER_LUMA_sse2 ps, 64, 32 + FILTER_VER_LUMA_sse2 ps, 32, 64 + FILTER_VER_LUMA_sse2 ps, 64, 48 + FILTER_VER_LUMA_sse2 ps, 48, 64 + FILTER_VER_LUMA_sse2 ps, 64, 16 + FILTER_VER_LUMA_sse2 ps, 16, 64 + +%macro FILTERH_W2_4_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + r1 + %1] + movhps m4, [r0 + r1 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + movh m5, [r0 + 2 * r1 + %1] + movhps m5, [r0 + 2 * r1 + %1 + 2] + pmaddwd m5, m0 + movh m4, [r0 + r4 + %1] + movhps m4, [r0 + r4 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movd [r2 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 * 2 + %1], m3 + psrldq m3, 4 + movd [r2 + r5 + %1], m3 +%endmacro + +%macro FILTERH_W2_3_sse3 1 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + r1 + %1] + movhps m4, [r0 + r1 + %1 + 2] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + 2 * r1 + %1] + movhps m5, [r0 + 2 * r1 + %1 + 2] + pmaddwd m5, m0 + + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m5, m5, q3120 + paddd m5, m1 + + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 + + movd [r2 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 + %1], m3 + psrldq m3, 4 + movd [r2 + r3 * 2 + %1], m3 +%endmacro + +%macro FILTERH_W4_2_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + %1 + 4] + movhps m4, [r0 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + r1 + %1] + movhps m5, [r0 + r1 + %1 + 2] + pmaddwd m5, m0 + movh m4, [r0 + r1 + %1 + 4] + movhps m4, [r0 + r1 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movh [r2 + %1], m3 + movhps [r2 + r3 + %1], m3 +%endmacro + +%macro FILTERH_W4_1_sse3 1 + movh m3, [r0 + 2 * r1 + %1] + movhps m3, [r0 + 2 * r1 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + 2 * r1 + %1 + 4] + movhps m4, [r0 + 2 * r1 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 + movh [r2 + r3 * 2 + %1], m3 +%endmacro + +%macro FILTERH_W8_1_sse3 2 + movh m3, [r0 + %1] + movhps m3, [r0 + %1 + 2] + pmaddwd m3, m0 + movh m4, [r0 + %1 + 4] + movhps m4, [r0 + %1 + 6] + pmaddwd m4, m0 + pshufd m2, m3, q2301 + paddd m3, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m3, m3, q3120 + pshufd m4, m4, q3120 + punpcklqdq m3, m4 + paddd m3, m1 + + movh m5, [r0 + %1 + 8] + movhps m5, [r0 + %1 + 10] + pmaddwd m5, m0 + movh m4, [r0 + %1 + 12] + movhps m4, [r0 + %1 + 14] + pmaddwd m4, m0 + pshufd m2, m5, q2301 + paddd m5, m2 + pshufd m2, m4, q2301 + paddd m4, m2 + pshufd m5, m5, q3120 + pshufd m4, m4, q3120 + punpcklqdq m5, m4 + paddd m5, m1 +%ifidn %2, pp + psrad m3, 6 + psrad m5, 6 + packssdw m3, m5 + CLIPW m3, m7, m6 +%else + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + packssdw m3, m5 +%endif + movdqu [r2 + %1], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_HOR_CHROMA_sse3 3 +INIT_XMM sse3 +cglobal interp_4tap_horiz_%3_%1x%2, 4, 7, 8 + add r3, r3 + add r1, r1 + sub r0, 2 + mov r4d, r4m + add r4d, r4d + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + movddup m0, [r6 + r4 * 4] +%else + movddup m0, [tab_ChromaCoeff + r4 * 4] +%endif + +%ifidn %3, ps + mova m1, [INTERP_OFFSET_PS] + cmp r5m, byte 0 +%if %1 <= 6 + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%endif + je .skip + sub r0, r1 +%if %1 <= 6 +%assign y 1 +%else +%assign y 3 +%endif +%assign z 0 +%rep y +%assign x 0 +%rep %1/8 + FILTERH_W8_1_sse3 x, %3 +%assign x x+16 +%endrep +%if %1 == 4 || (%1 == 6 && z == 0) || (%1 == 12 && z == 0) + FILTERH_W4_2_sse3 x, %3 + FILTERH_W4_1_sse3 x +%assign x x+8 +%endif +%if %1 == 2 || (%1 == 6 && z == 0) + FILTERH_W2_3_sse3 x +%endif +%if %1 <= 6 + lea r0, [r0 + r4] + lea r2, [r2 + r5] +%else + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endif +%assign z z+1 +%endrep +.skip: +%elifidn %3, pp + pxor m7, m7 + mova m6, [pw_pixel_max] + mova m1, [tab_c_32] +%if %1 == 2 || %1 == 6 + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%endif +%endif + +%if %1 == 2 +%assign y %2/4 +%elif %1 <= 6 +%assign y %2/2 +%else +%assign y %2 +%endif +%assign z 0 +%rep y +%assign x 0 +%rep %1/8 + FILTERH_W8_1_sse3 x, %3 +%assign x x+16 +%endrep +%if %1 == 4 || %1 == 6 || (%1 == 12 && (z % 2) == 0) + FILTERH_W4_2_sse3 x, %3 +%assign x x+8 +%endif +%if %1 == 2 || (%1 == 6 && (z % 2) == 0) + FILTERH_W2_4_sse3 x, %3 +%endif +%assign z z+1 +%if z < y +%if %1 == 2 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%elif %1 <= 6 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] +%else + lea r0, [r0 + r1] + lea r2, [r2 + r3] +%endif +%endif ;z < y +%endrep + + RET +%endmacro + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + +FILTER_HOR_CHROMA_sse3 2, 4, pp +FILTER_HOR_CHROMA_sse3 2, 8, pp +FILTER_HOR_CHROMA_sse3 2, 16, pp +FILTER_HOR_CHROMA_sse3 4, 2, pp +FILTER_HOR_CHROMA_sse3 4, 4, pp +FILTER_HOR_CHROMA_sse3 4, 8, pp +FILTER_HOR_CHROMA_sse3 4, 16, pp +FILTER_HOR_CHROMA_sse3 4, 32, pp +FILTER_HOR_CHROMA_sse3 6, 8, pp +FILTER_HOR_CHROMA_sse3 6, 16, pp +FILTER_HOR_CHROMA_sse3 8, 2, pp +FILTER_HOR_CHROMA_sse3 8, 4, pp +FILTER_HOR_CHROMA_sse3 8, 6, pp +FILTER_HOR_CHROMA_sse3 8, 8, pp +FILTER_HOR_CHROMA_sse3 8, 12, pp +FILTER_HOR_CHROMA_sse3 8, 16, pp +FILTER_HOR_CHROMA_sse3 8, 32, pp +FILTER_HOR_CHROMA_sse3 8, 64, pp +FILTER_HOR_CHROMA_sse3 12, 16, pp +FILTER_HOR_CHROMA_sse3 12, 32, pp +FILTER_HOR_CHROMA_sse3 16, 4, pp +FILTER_HOR_CHROMA_sse3 16, 8, pp +FILTER_HOR_CHROMA_sse3 16, 12, pp +FILTER_HOR_CHROMA_sse3 16, 16, pp +FILTER_HOR_CHROMA_sse3 16, 24, pp +FILTER_HOR_CHROMA_sse3 16, 32, pp +FILTER_HOR_CHROMA_sse3 16, 64, pp +FILTER_HOR_CHROMA_sse3 24, 32, pp +FILTER_HOR_CHROMA_sse3 24, 64, pp +FILTER_HOR_CHROMA_sse3 32, 8, pp +FILTER_HOR_CHROMA_sse3 32, 16, pp +FILTER_HOR_CHROMA_sse3 32, 24, pp +FILTER_HOR_CHROMA_sse3 32, 32, pp +FILTER_HOR_CHROMA_sse3 32, 48, pp +FILTER_HOR_CHROMA_sse3 32, 64, pp +FILTER_HOR_CHROMA_sse3 48, 64, pp +FILTER_HOR_CHROMA_sse3 64, 16, pp +FILTER_HOR_CHROMA_sse3 64, 32, pp +FILTER_HOR_CHROMA_sse3 64, 48, pp +FILTER_HOR_CHROMA_sse3 64, 64, pp + +;----------------------------------------------------------------------------- +; void interp_4tap_horiz_ps_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- + +FILTER_HOR_CHROMA_sse3 2, 4, ps +FILTER_HOR_CHROMA_sse3 2, 8, ps +FILTER_HOR_CHROMA_sse3 2, 16, ps +FILTER_HOR_CHROMA_sse3 4, 2, ps +FILTER_HOR_CHROMA_sse3 4, 4, ps +FILTER_HOR_CHROMA_sse3 4, 8, ps +FILTER_HOR_CHROMA_sse3 4, 16, ps +FILTER_HOR_CHROMA_sse3 4, 32, ps +FILTER_HOR_CHROMA_sse3 6, 8, ps +FILTER_HOR_CHROMA_sse3 6, 16, ps +FILTER_HOR_CHROMA_sse3 8, 2, ps +FILTER_HOR_CHROMA_sse3 8, 4, ps +FILTER_HOR_CHROMA_sse3 8, 6, ps +FILTER_HOR_CHROMA_sse3 8, 8, ps +FILTER_HOR_CHROMA_sse3 8, 12, ps +FILTER_HOR_CHROMA_sse3 8, 16, ps +FILTER_HOR_CHROMA_sse3 8, 32, ps +FILTER_HOR_CHROMA_sse3 8, 64, ps +FILTER_HOR_CHROMA_sse3 12, 16, ps +FILTER_HOR_CHROMA_sse3 12, 32, ps +FILTER_HOR_CHROMA_sse3 16, 4, ps +FILTER_HOR_CHROMA_sse3 16, 8, ps +FILTER_HOR_CHROMA_sse3 16, 12, ps +FILTER_HOR_CHROMA_sse3 16, 16, ps +FILTER_HOR_CHROMA_sse3 16, 24, ps +FILTER_HOR_CHROMA_sse3 16, 32, ps +FILTER_HOR_CHROMA_sse3 16, 64, ps +FILTER_HOR_CHROMA_sse3 24, 32, ps +FILTER_HOR_CHROMA_sse3 24, 64, ps +FILTER_HOR_CHROMA_sse3 32, 8, ps +FILTER_HOR_CHROMA_sse3 32, 16, ps +FILTER_HOR_CHROMA_sse3 32, 24, ps +FILTER_HOR_CHROMA_sse3 32, 32, ps +FILTER_HOR_CHROMA_sse3 32, 48, ps +FILTER_HOR_CHROMA_sse3 32, 64, ps +FILTER_HOR_CHROMA_sse3 48, 64, ps +FILTER_HOR_CHROMA_sse3 64, 16, ps +FILTER_HOR_CHROMA_sse3 64, 32, ps +FILTER_HOR_CHROMA_sse3 64, 48, ps +FILTER_HOR_CHROMA_sse3 64, 64, ps + +%macro FILTER_P2S_2_4_sse2 1 + movd m0, [r0 + %1] + movd m2, [r0 + r1 * 2 + %1] + movhps m0, [r0 + r1 + %1] + movhps m2, [r0 + r4 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + + movd [r2 + r3 * 0 + %1], m0 + movd [r2 + r3 * 2 + %1], m2 + movhlps m0, m0 + movhlps m2, m2 + movd [r2 + r3 * 1 + %1], m0 + movd [r2 + r5 + %1], m2 +%endmacro + +%macro FILTER_P2S_4_4_sse2 1 + movh m0, [r0 + %1] + movhps m0, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psubw m0, m1 + movh [r2 + r3 * 0 + %1], m0 + movhps [r2 + r3 * 1 + %1], m0 + + movh m2, [r0 + r1 * 2 + %1] + movhps m2, [r0 + r4 + %1] + psllw m2, (14 - BIT_DEPTH) + psubw m2, m1 + movh [r2 + r3 * 2 + %1], m2 + movhps [r2 + r5 + %1], m2 +%endmacro + +%macro FILTER_P2S_4_2_sse2 0 + movh m0, [r0] + movhps m0, [r0 + r1 * 2] + psllw m0, (14 - BIT_DEPTH) + psubw m0, [pw_2000] + movh [r2 + r3 * 0], m0 + movhps [r2 + r3 * 2], m0 +%endmacro + +%macro FILTER_P2S_8_4_sse2 1 + movu m0, [r0 + %1] + movu m2, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + movu [r2 + r3 * 0 + %1], m0 + movu [r2 + r3 * 1 + %1], m2 + + movu m3, [r0 + r1 * 2 + %1] + movu m4, [r0 + r4 + %1] + psllw m3, (14 - BIT_DEPTH) + psllw m4, (14 - BIT_DEPTH) + psubw m3, m1 + psubw m4, m1 + movu [r2 + r3 * 2 + %1], m3 + movu [r2 + r5 + %1], m4 +%endmacro + +%macro FILTER_P2S_8_2_sse2 1 + movu m0, [r0 + %1] + movu m2, [r0 + r1 + %1] + psllw m0, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psubw m0, m1 + psubw m2, m1 + movu [r2 + r3 * 0 + %1], m0 + movu [r2 + r3 * 1 + %1], m2 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro FILTER_PIX_TO_SHORT_sse2 2 +INIT_XMM sse2 +cglobal filterPixelToShort_%1x%2, 4, 6, 3 +%if %2 == 2 +%if %1 == 4 + FILTER_P2S_4_2_sse2 +%elif %1 == 8 + add r1d, r1d + add r3d, r3d + mova m1, [pw_2000] + FILTER_P2S_8_2_sse2 0 +%endif +%else + add r1d, r1d + add r3d, r3d + mova m1, [pw_2000] + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%assign y 1 +%rep %2/4 +%assign x 0 +%rep %1/8 + FILTER_P2S_8_4_sse2 x +%if %2 == 6 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + FILTER_P2S_8_2_sse2 x +%endif +%assign x x+16 +%endrep +%rep (%1 % 8)/4 + FILTER_P2S_4_4_sse2 x +%assign x x+8 +%endrep +%rep (%1 % 4)/2 + FILTER_P2S_2_4_sse2 x +%endrep +%if y < %2/4 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%assign y y+1 +%endif +%endrep +%endif +RET +%endmacro + + FILTER_PIX_TO_SHORT_sse2 2, 4 + FILTER_PIX_TO_SHORT_sse2 2, 8 + FILTER_PIX_TO_SHORT_sse2 2, 16 + FILTER_PIX_TO_SHORT_sse2 4, 2 + FILTER_PIX_TO_SHORT_sse2 4, 4 + FILTER_PIX_TO_SHORT_sse2 4, 8 + FILTER_PIX_TO_SHORT_sse2 4, 16 + FILTER_PIX_TO_SHORT_sse2 4, 32 + FILTER_PIX_TO_SHORT_sse2 6, 8 + FILTER_PIX_TO_SHORT_sse2 6, 16 + FILTER_PIX_TO_SHORT_sse2 8, 2 + FILTER_PIX_TO_SHORT_sse2 8, 4 + FILTER_PIX_TO_SHORT_sse2 8, 6 + FILTER_PIX_TO_SHORT_sse2 8, 8 + FILTER_PIX_TO_SHORT_sse2 8, 12 + FILTER_PIX_TO_SHORT_sse2 8, 16 + FILTER_PIX_TO_SHORT_sse2 8, 32 + FILTER_PIX_TO_SHORT_sse2 8, 64 + FILTER_PIX_TO_SHORT_sse2 12, 16 + FILTER_PIX_TO_SHORT_sse2 12, 32 + FILTER_PIX_TO_SHORT_sse2 16, 4 + FILTER_PIX_TO_SHORT_sse2 16, 8 + FILTER_PIX_TO_SHORT_sse2 16, 12 + FILTER_PIX_TO_SHORT_sse2 16, 16 + FILTER_PIX_TO_SHORT_sse2 16, 24 + FILTER_PIX_TO_SHORT_sse2 16, 32 + FILTER_PIX_TO_SHORT_sse2 16, 64 + FILTER_PIX_TO_SHORT_sse2 24, 32 + FILTER_PIX_TO_SHORT_sse2 24, 64 + FILTER_PIX_TO_SHORT_sse2 32, 8 + FILTER_PIX_TO_SHORT_sse2 32, 16 + FILTER_PIX_TO_SHORT_sse2 32, 24 + FILTER_PIX_TO_SHORT_sse2 32, 32 + FILTER_PIX_TO_SHORT_sse2 32, 48 + FILTER_PIX_TO_SHORT_sse2 32, 64 + FILTER_PIX_TO_SHORT_sse2 48, 64 + FILTER_PIX_TO_SHORT_sse2 64, 16 + FILTER_PIX_TO_SHORT_sse2 64, 32 + FILTER_PIX_TO_SHORT_sse2 64, 48 + FILTER_PIX_TO_SHORT_sse2 64, 64 ;------------------------------------------------------------------------------------------------------------ ; void interp_8tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -127,7 +1183,6 @@ %macro FILTER_HOR_LUMA_W4 3 INIT_XMM sse4 cglobal interp_8tap_horiz_%3_%1x%2, 4, 7, 8 - mov r4d, r4m sub r0, 6 shl r4d, 4 @@ -141,12 +1196,12 @@ mova m0, [tab_LumaCoeff + r4] %endif -%ifidn %3, pp +%ifidn %3, pp mova m1, [pd_32] pxor m6, m6 mova m7, [pw_pixel_max] %else - mova m1, [pd_n32768] + mova m1, [INTERP_OFFSET_PS] %endif mov r4d, %2 @@ -180,7 +1235,7 @@ packusdw m4, m4 CLIPW m4, m6, m7 %else - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS packssdw m4, m4 %endif @@ -228,17 +1283,17 @@ mova m0, [tab_LumaCoeff + r4] %endif -%ifidn %3, pp +%ifidn %3, pp mova m1, [pd_32] pxor m7, m7 %else - mova m1, [pd_n32768] + mova m1, [INTERP_OFFSET_PS] %endif mov r4d, %2 %ifidn %3, ps cmp r5m, byte 0 - je .loopH + je .loopH lea r6, [r1 + 2 * r1] sub r0, r6 add r4d, 7 @@ -274,14 +1329,14 @@ phaddd m6, m3 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp +%ifidn %3, pp psrad m4, 6 psrad m5, 6 packusdw m4, m5 CLIPW m4, m7, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif @@ -291,7 +1346,7 @@ add r2, r3 dec r4d - jnz .loopH + jnz .loopH RET %endmacro @@ -330,10 +1385,10 @@ %else mova m0, [tab_LumaCoeff + r4] %endif -%ifidn %3, pp - mova m1, [pd_32] +%ifidn %3, pp + mova m1, [INTERP_OFFSET_PP] %else - mova m1, [pd_n32768] + mova m1, [INTERP_OFFSET_PS] %endif mov r4d, %2 @@ -375,15 +1430,15 @@ phaddd m6, m7 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif @@ -403,13 +1458,13 @@ phaddd m5, m2 phaddd m4, m5 paddd m4, m1 -%ifidn %3, pp - psrad m4, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP packusdw m4, m4 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 + psrad m4, INTERP_SHIFT_PS packssdw m4, m4 %endif @@ -453,10 +1508,10 @@ mova m0, [tab_LumaCoeff + r4] %endif -%ifidn %3, pp +%ifidn %3, pp mova m1, [pd_32] %else - mova m1, [pd_n32768] + mova m1, [INTERP_OFFSET_PS] %endif mov r4d, %2 @@ -501,14 +1556,14 @@ phaddd m5, m6 paddd m5, m1 %ifidn %3, pp - psrad m4, 6 - psrad m5, 6 + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif movu [r2 + x], m4 @@ -541,15 +1596,15 @@ phaddd m6, m2 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif movu [r2 + 16 + x], m4 @@ -648,10 +1703,10 @@ %else mova m0, [tab_LumaCoeff + r4] %endif -%ifidn %3, pp +%ifidn %3, pp mova m1, [pd_32] %else - mova m1, [pd_n32768] + mova m1, [INTERP_OFFSET_PS] %endif mov r4d, %2 @@ -693,15 +1748,15 @@ phaddd m6, m7 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif movu [r2], m4 @@ -734,15 +1789,15 @@ phaddd m6, m7 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif movu [r2 + 16], m4 @@ -775,15 +1830,15 @@ phaddd m6, m7 phaddd m5, m6 paddd m5, m1 -%ifidn %3, pp - psrad m4, 6 - psrad m5, 6 +%ifidn %3, pp + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m4, m5 pxor m5, m5 CLIPW m4, m5, [pw_pixel_max] %else - psrad m4, 2 - psrad m5, 2 + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m4, m5 %endif movu [r2 + 32], m4 @@ -816,11 +1871,11 @@ phaddd m3, m4 paddd m3, m1 %ifidn %1, pp - psrad m3, 6 + psrad m3, INTERP_SHIFT_PP packusdw m3, m3 CLIPW m3, m7, m6 %else - psrad m3, 2 + psrad m3, INTERP_SHIFT_PS packssdw m3, m3 %endif movd [r2], m3 @@ -846,19 +1901,728 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m7, m6 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 movhps [r2 + r3], m3 %endmacro +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W4_avx2 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_4x%1, 4,7,7 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + lea r6, [pw_pixel_max] + mova m3, [interp8_hpp_shuf] + mova m6, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1/2 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + phaddd m4, m4 + vpermq m4, m4, q3120 + paddd m4, m6 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [r6] + movq [r2], xm4 + + vbroadcasti128 m4, [r0 + r1] + vbroadcasti128 m5, [r0 + r1 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + phaddd m4, m4 + vpermq m4, m4, q3120 + paddd m4, m6 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [r6] + movq [r2 + r3], xm4 + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W4_avx2 4 +FILTER_HOR_LUMA_W4_avx2 8 +FILTER_HOR_LUMA_W4_avx2 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W8 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_8x%1, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1/2 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + r1] + vbroadcasti128 m5, [r0 + r1 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + r1 + 8] + vbroadcasti128 m6, [r0 + r1 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + r3], xm4 + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W8 4 +FILTER_HOR_LUMA_W8 8 +FILTER_HOR_LUMA_W8 16 +FILTER_HOR_LUMA_W8 32 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W16 1 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_16x%1, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %1 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W16 4 +FILTER_HOR_LUMA_W16 8 +FILTER_HOR_LUMA_W16 12 +FILTER_HOR_LUMA_W16 16 +FILTER_HOR_LUMA_W16 32 +FILTER_HOR_LUMA_W16 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_HOR_LUMA_W32 2 +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_%1x%2, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, %2 + +.loop: +%assign x 0 +%rep %1/16 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8 + x] + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + x], xm4 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24 + x] + vbroadcasti128 m6, [r0 + 32 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16 + x], xm4 + +%assign x x+32 +%endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET +%endmacro +FILTER_HOR_LUMA_W32 32, 8 +FILTER_HOR_LUMA_W32 32, 16 +FILTER_HOR_LUMA_W32 32, 24 +FILTER_HOR_LUMA_W32 32, 32 +FILTER_HOR_LUMA_W32 32, 64 +FILTER_HOR_LUMA_W32 64, 16 +FILTER_HOR_LUMA_W32 64, 32 +FILTER_HOR_LUMA_W32 64, 48 +FILTER_HOR_LUMA_W32 64, 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_12x16, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 16 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movq [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_24x32, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 32 + +.loop: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8] + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24] + vbroadcasti128 m6, [r0 + 32] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 40] + vbroadcasti128 m6, [r0 + 48] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 32], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_horiz_pp(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_8tap_horiz_pp_48x64, 4,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 6 + mov r4d, r4m + shl r4d, 4 +%ifdef PIC + lea r5, [tab_LumaCoeff] + vpbroadcastq m0, [r5 + r4] + vpbroadcastq m1, [r5 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [interp8_hpp_shuf] + mova m7, [pd_32] + pxor m2, m2 + + ; register map + ; m0 , m1 interpolate coeff + + mov r4d, 64 + +.loop: +%assign x 0 +%rep 2 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 8 + x] + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + x], xm4 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 24 + x] + vbroadcasti128 m6, [r0 + 32 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 16 + x], xm4 + + vbroadcasti128 m4, [r0 + 32 + x] + vbroadcasti128 m5, [r0 + 40 + x] + pshufb m4, m3 + pshufb m5, m3 + + pmaddwd m4, m0 + pmaddwd m5, m1 + paddd m4, m5 + + vbroadcasti128 m5, [r0 + 40 + x] + vbroadcasti128 m6, [r0 + 48 + x] + pshufb m5, m3 + pshufb m6, m3 + + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m7 + psrad m4, INTERP_SHIFT_PP + + packusdw m4, m4 + vpermq m4, m4, q2020 + CLIPW m4, m2, [pw_pixel_max] + movu [r2 + 32 + x], xm4 + +%assign x x+48 +%endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop + RET + ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_%3_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- @@ -883,35 +2647,35 @@ mova m2, [tab_Tm16] %ifidn %3, ps - mova m1, [tab_c_n32768] + mova m1, [INTERP_OFFSET_PS] cmp r5m, byte 0 je .skip - sub r0, r1 - movu m3, [r0] - pshufb m3, m3, m2 - pmaddwd m3, m0 + sub r0, r1 + movu m3, [r0] + pshufb m3, m3, m2 + pmaddwd m3, m0 + + %if %1 == 4 + movu m4, [r0 + 4] + pshufb m4, m4, m2 + pmaddwd m4, m0 + phaddd m3, m4 + %else + phaddd m3, m3 + %endif + + paddd m3, m1 + psrad m3, INTERP_SHIFT_PS + packssdw m3, m3 + + %if %1 == 2 + movd [r2], m3 + %else + movh [r2], m3 + %endif - %if %1 == 4 - movu m4, [r0 + 4] - pshufb m4, m4, m2 - pmaddwd m4, m0 - phaddd m3, m4 - %else - phaddd m3, m3 - %endif - - paddd m3, m1 - psrad m3, 2 - packssdw m3, m3 - - %if %1 == 2 - movd [r2], m3 - %else - movh [r2], m3 - %endif - - add r0, r1 - add r2, r3 + add r0, r1 + add r2, r3 FILTER_W%1_2 %3 lea r0, [r0 + 2 * r1] lea r2, [r2 + 2 * r3] @@ -931,8 +2695,7 @@ lea r2, [r2 + 2 * r3] FILTER_W%1_2 %3 %endrep - -RET + RET %endmacro FILTER_CHROMA_H 2, 4, pp, 6, 8, 5 @@ -971,13 +2734,13 @@ phaddd m4, m4 paddd m4, m1 %ifidn %1, pp - psrad m3, 6 - psrad m4, 6 + psrad m3, INTERP_SHIFT_PP + psrad m4, INTERP_SHIFT_PP packusdw m3, m4 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m4, 2 + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS packssdw m3, m4 %endif movh [r2], m3 @@ -1011,13 +2774,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 @@ -1051,13 +2814,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 @@ -1073,11 +2836,11 @@ paddd m3, m1 %ifidn %1, pp - psrad m3, 6 + psrad m3, INTERP_SHIFT_PP packusdw m3, m3 CLIPW m3, m6, m7 %else - psrad m3, 2 + psrad m3, INTERP_SHIFT_PS packssdw m3, m3 %endif movh [r2 + 16], m3 @@ -1110,13 +2873,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 @@ -1140,13 +2903,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 16], m3 @@ -1180,13 +2943,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 @@ -1210,13 +2973,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 16], m3 @@ -1240,13 +3003,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 32], m3 @@ -1280,13 +3043,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2], m3 @@ -1310,13 +3073,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 16], m3 @@ -1340,13 +3103,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 32], m3 @@ -1370,13 +3133,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + 48], m3 @@ -1410,13 +3173,13 @@ phaddd m5, m4 paddd m5, m1 %ifidn %1, pp - psrad m3, 6 - psrad m5, 6 + psrad m3, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP packusdw m3, m5 CLIPW m3, m6, m7 %else - psrad m3, 2 - psrad m5, 2 + psrad m3, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS packssdw m3, m5 %endif movh [r2 + %2], m3 @@ -1485,7 +3248,7 @@ mova m2, [tab_Tm16] %ifidn %3, ps - mova m1, [tab_c_n32768] + mova m1, [INTERP_OFFSET_PS] cmp r5m, byte 0 je .skip sub r0, r1 @@ -1586,37 +3349,726 @@ movq m0, [r0] movq m1, [r0 + r1] punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r6 + 0 *16] ;m0=[0+1] Row1 + pmaddwd m0, [r6 + 0 *32] ;m0=[0+1] Row1 lea r0, [r0 + 2 * r1] movq m4, [r0] punpcklwd m1, m4 ;m1=[1 2] - pmaddwd m1, [r6 + 0 *16] ;m1=[1+2] Row2 + pmaddwd m1, [r6 + 0 *32] ;m1=[1+2] Row2 movq m5, [r0 + r1] punpcklwd m4, m5 ;m4=[2 3] - pmaddwd m2, m4, [r6 + 0 *16] ;m2=[2+3] Row3 - pmaddwd m4, [r6 + 1 * 16] + pmaddwd m2, m4, [r6 + 0 *32] ;m2=[2+3] Row3 + pmaddwd m4, [r6 + 1 * 32] paddd m0, m4 ;m0=[0+1+2+3] Row1 done lea r0, [r0 + 2 * r1] movq m4, [r0] punpcklwd m5, m4 ;m5=[3 4] - pmaddwd m3, m5, [r6 + 0 *16] ;m3=[3+4] Row4 - pmaddwd m5, [r6 + 1 * 16] + pmaddwd m3, m5, [r6 + 0 *32] ;m3=[3+4] Row4 + pmaddwd m5, [r6 + 1 * 32] paddd m1, m5 ;m1 = [1+2+3+4] Row2 movq m5, [r0 + r1] punpcklwd m4, m5 ;m4=[4 5] - pmaddwd m4, [r6 + 1 * 16] + pmaddwd m4, [r6 + 1 * 32] paddd m2, m4 ;m2=[2+3+4+5] Row3 movq m4, [r0 + 2 * r1] punpcklwd m5, m4 ;m5=[5 6] - pmaddwd m5, [r6 + 1 * 16] + pmaddwd m5, [r6 + 1 * 32] paddd m3, m5 ;m3=[3+4+5+6] Row4 %endmacro +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_6xN 1 +cglobal interp_4tap_horiz_pp_6x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1/2 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2], xm3 + pextrd [r2 + 8], xm3, 2 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2 + r3], xm3 + pextrd [r2 + r3 + 8], xm3, 2 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET +%endmacro +IPFILTER_CHROMA_avx2_6xN 8 +IPFILTER_CHROMA_avx2_6xN 16 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x2, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, INTERP_SHIFT_PP ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal interp_4tap_horiz_pp_8x4, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + +%rep 2 + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3,q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] +%endrep + RET + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_8xN 1 +cglobal interp_4tap_horiz_pp_8x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1/2 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + r1] + vbroadcasti128 m4, [r0 + r1 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2 + r3], xm3 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + dec r4d + jnz .loop + RET +%endmacro +IPFILTER_CHROMA_avx2_8xN 6 +IPFILTER_CHROMA_avx2_8xN 8 +IPFILTER_CHROMA_avx2_8xN 12 +IPFILTER_CHROMA_avx2_8xN 16 +IPFILTER_CHROMA_avx2_8xN 32 +IPFILTER_CHROMA_avx2_8xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_16xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_16x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m8, [r0 + 24] + + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m4, m4 + vpermq m4, m4, q2020 + pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + CLIPW m3, m5, m7 + movu [r2], m3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_16xN 4 +IPFILTER_CHROMA_avx2_16xN 8 +IPFILTER_CHROMA_avx2_16xN 12 +IPFILTER_CHROMA_avx2_16xN 16 +IPFILTER_CHROMA_avx2_16xN 24 +IPFILTER_CHROMA_avx2_16xN 32 +IPFILTER_CHROMA_avx2_16xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_32xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_32x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r6d, %1 +.loop: +%assign x 0 +%rep 2 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m4, m4 + vpermq m4, m4, q2020 + pshufb xm4, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + vinserti128 m3, m3, xm4, 1 + CLIPW m3, m5, m7 + movu [r2 + x], m3 + %assign x x+32 + %endrep + + add r0, r1 + add r2, r3 + dec r6d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_32xN 8 +IPFILTER_CHROMA_avx2_32xN 16 +IPFILTER_CHROMA_avx2_32xN 24 +IPFILTER_CHROMA_avx2_32xN 32 +IPFILTER_CHROMA_avx2_32xN 48 +IPFILTER_CHROMA_avx2_32xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_12xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_12x%1, 5,6,8 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movu [r2], xm3 + + vbroadcasti128 m3, [r0 + 16] + vbroadcasti128 m4, [r0 + 24] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 ; m3 = DWORD[7 6 3 2 5 4 1 0] + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 ; m3 = WORD[7 6 5 4 3 2 1 0] + CLIPW xm3, xm5, xm7 + movq [r2 + 16], xm3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_12xN 16 +IPFILTER_CHROMA_avx2_12xN 32 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_24xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_24x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, %1 +.loop: + vbroadcasti128 m3, [r0] + vbroadcasti128 m4, [r0 + 8] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m8, [r0 + 24] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2], m3 + + vbroadcasti128 m3, [r0 + 32] + vbroadcasti128 m4, [r0 + 40] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + packusdw m3, m3 + vpermq m3, m3, q2020 + pshufb xm3, xm6 + CLIPW xm3, xm5, xm7 + movu [r2 + 32], xm3 + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_24xN 32 +IPFILTER_CHROMA_avx2_24xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%macro IPFILTER_CHROMA_avx2_64xN 1 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_64x%1, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r6d, %1 +.loop: +%assign x 0 +%rep 4 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2 + x], m3 + %assign x x+32 + %endrep + + add r0, r1 + add r2, r3 + dec r6d + jnz .loop + RET +%endif +%endmacro +IPFILTER_CHROMA_avx2_64xN 16 +IPFILTER_CHROMA_avx2_64xN 32 +IPFILTER_CHROMA_avx2_64xN 48 +IPFILTER_CHROMA_avx2_64xN 64 + +;------------------------------------------------------------------------------------------------------------- +; void interp_4tap_horiz_pp(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx +;------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_horiz_pp_48x64, 5,6,9 + add r1d, r1d + add r3d, r3d + sub r0, 2 + mov r4d, r4m +%ifdef PIC + lea r5, [tab_ChromaCoeff] + vpbroadcastq m0, [r5 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m1, [interp8_hpp_shuf] + vpbroadcastd m2, [pd_32] + pxor m5, m5 + mova m6, [idct8_shuf2] + mova m7, [pw_pixel_max] + + mov r4d, 64 +.loop: +%assign x 0 +%rep 3 + vbroadcasti128 m3, [r0 + x] + vbroadcasti128 m4, [r0 + 8 + x] + pshufb m3, m1 + pshufb m4, m1 + + pmaddwd m3, m0 + pmaddwd m4, m0 + phaddd m3, m4 + paddd m3, m2 + psrad m3, 6 + + vbroadcasti128 m4, [r0 + 16 + x] + vbroadcasti128 m8, [r0 + 24 + x] + pshufb m4, m1 + pshufb m8, m1 + + pmaddwd m4, m0 + pmaddwd m8, m0 + phaddd m4, m8 + paddd m4, m2 + psrad m4, 6 + + packusdw m3, m4 + vpermq m3, m3, q3120 + pshufb m3, m6 + CLIPW m3, m5, m7 + movu [r2 + x], m3 +%assign x x+32 +%endrep + + add r0, r1 + add r2, r3 + dec r4d + jnz .loop + RET +%endif + ;----------------------------------------------------------------------------------------------------------------- ; void interp_4tap_vert_%3_%1x%2(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------------------------------------------- @@ -1627,7 +4079,7 @@ add r1d, r1d add r3d, r3d sub r0, r1 - shl r4d, 5 + shl r4d, 6 %ifdef PIC lea r5, [tab_ChromaCoeffV] @@ -1642,12 +4094,12 @@ %ifnidn %3, ps mova m7, [pw_pixel_max] %ifidn %3, pp - mova m6, [tab_c_32] + mova m6, [INTERP_OFFSET_PP] %else - mova m6, [tab_c_524800] + mova m6, [INTERP_OFFSET_SP] %endif %else - mova m6, [tab_c_n32768] + mova m6, [INTERP_OFFSET_PS] %endif %endif @@ -1669,10 +4121,10 @@ paddd m1, m6 paddd m2, m6 paddd m3, m6 - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS packssdw m0, m1 packssdw m2, m3 @@ -1682,15 +4134,15 @@ paddd m2, m6 paddd m3, m6 %ifidn %3, pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP %endif packssdw m0, m1 packssdw m2, m3 @@ -1848,7 +4300,7 @@ movd m2, [r0] punpcklwd m1, m2 ;m1=[1 2] punpcklqdq m0, m1 ;m0=[0 1 1 2] - pmaddwd m0, [%1 + 0 *16] ;m0=[0+1 1+2] Row 1-2 + pmaddwd m0, [%1 + 0 *32] ;m0=[0+1 1+2] Row 1-2 movd m1, [r0 + r1] punpcklwd m2, m1 ;m2=[2 3] @@ -1858,8 +4310,8 @@ punpcklwd m1, m3 ;m2=[3 4] punpcklqdq m2, m1 ;m2=[2 3 3 4] - pmaddwd m4, m2, [%1 + 1 * 16] ;m4=[2+3 3+4] Row 1-2 - pmaddwd m2, [%1 + 0 * 16] ;m2=[2+3 3+4] Row 3-4 + pmaddwd m4, m2, [%1 + 1 * 32] ;m4=[2+3 3+4] Row 1-2 + pmaddwd m2, [%1 + 0 * 32] ;m2=[2+3 3+4] Row 3-4 paddd m0, m4 ;m0=[0+1+2+3 1+2+3+4] Row 1-2 movd m1, [r0 + r1] @@ -1868,7 +4320,7 @@ movd m4, [r0 + 2 * r1] punpcklwd m1, m4 ;m1=[5 6] punpcklqdq m3, m1 ;m2=[4 5 5 6] - pmaddwd m3, [%1 + 1 * 16] ;m3=[4+5 5+6] Row 3-4 + pmaddwd m3, [%1 + 1 * 32] ;m3=[4+5 5+6] Row 3-4 paddd m2, m3 ;m2=[2+3+4+5 3+4+5+6] Row 3-4 %endmacro @@ -1882,7 +4334,7 @@ add r1d, r1d add r3d, r3d sub r0, r1 - shl r4d, 5 + shl r4d, 6 %ifdef PIC lea r5, [tab_ChromaCoeffV] @@ -1897,12 +4349,12 @@ pxor m7, m7 mova m6, [pw_pixel_max] %ifidn %2, pp - mova m5, [tab_c_32] + mova m5, [INTERP_OFFSET_PP] %else - mova m5, [tab_c_524800] + mova m5, [INTERP_OFFSET_SP] %endif %else - mova m5, [tab_c_n32768] + mova m5, [INTERP_OFFSET_PS] %endif %endif @@ -1915,18 +4367,18 @@ %elifidn %2, ps paddd m0, m5 paddd m2, m5 - psrad m0, 2 - psrad m2, 2 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS packssdw m0, m2 %else paddd m0, m5 paddd m2, m5 %ifidn %2, pp - psrad m0, 6 - psrad m2, 6 + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m2, 10 + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP %endif packusdw m0, m2 CLIPW m0, m7, m6 @@ -1942,7 +4394,6 @@ dec r4d jnz .loopH - RET %endmacro @@ -1970,11 +4421,10 @@ %macro FILTER_VER_CHROMA_W4 3 INIT_XMM sse4 cglobal interp_4tap_vert_%2_4x%1, 5, 6, %3 - add r1d, r1d add r3d, r3d sub r0, r1 - shl r4d, 5 + shl r4d, 6 %ifdef PIC lea r5, [tab_ChromaCoeffV] @@ -1992,12 +4442,12 @@ pxor m6, m6 mova m5, [pw_pixel_max] %ifidn %2, pp - mova m4, [tab_c_32] + mova m4, [INTERP_OFFSET_PP] %else - mova m4, [tab_c_524800] + mova m4, [INTERP_OFFSET_SP] %endif %else - mova m4, [tab_c_n32768] + mova m4, [INTERP_OFFSET_PS] %endif %endif @@ -2008,21 +4458,21 @@ movh m0, [r0] movh m1, [r0 + r1] punpcklwd m0, m1 ;m0=[0 1] - pmaddwd m0, [r5 + 0 *16] ;m0=[0+1] Row1 + pmaddwd m0, [r5 + 0 *32] ;m0=[0+1] Row1 lea r0, [r0 + 2 * r1] movh m2, [r0] punpcklwd m1, m2 ;m1=[1 2] - pmaddwd m1, [r5 + 0 *16] ;m1=[1+2] Row2 + pmaddwd m1, [r5 + 0 *32] ;m1=[1+2] Row2 movh m3, [r0 + r1] punpcklwd m2, m3 ;m4=[2 3] - pmaddwd m2, [r5 + 1 * 16] + pmaddwd m2, [r5 + 1 * 32] paddd m0, m2 ;m0=[0+1+2+3] Row1 done movh m2, [r0 + 2 * r1] punpcklwd m3, m2 ;m5=[3 4] - pmaddwd m3, [r5 + 1 * 16] + pmaddwd m3, [r5 + 1 * 32] paddd m1, m3 ;m1=[1+2+3+4] Row2 done %ifidn %2, ss @@ -2032,18 +4482,18 @@ %elifidn %2, ps paddd m0, m4 paddd m1, m4 - psrad m0, 2 - psrad m1, 2 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS packssdw m0, m1 %else paddd m0, m4 paddd m1, m4 %ifidn %2, pp - psrad m0, 6 - psrad m1, 6 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m1, 10 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP %endif packusdw m0, m1 CLIPW m0, m6, m5 @@ -2057,7 +4507,6 @@ dec r4d jnz .loop %endif - RET %endmacro @@ -2077,11 +4526,10 @@ %macro FILTER_VER_CHROMA_W6 3 INIT_XMM sse4 cglobal interp_4tap_vert_%2_6x%1, 5, 7, %3 - add r1d, r1d add r3d, r3d sub r0, r1 - shl r4d, 5 + shl r4d, 6 %ifdef PIC lea r5, [tab_ChromaCoeffV] @@ -2096,12 +4544,12 @@ %ifnidn %2, ps mova m7, [pw_pixel_max] %ifidn %2, pp - mova m6, [tab_c_32] + mova m6, [INTERP_OFFSET_PP] %else - mova m6, [tab_c_524800] + mova m6, [INTERP_OFFSET_SP] %endif %else - mova m6, [tab_c_n32768] + mova m6, [INTERP_OFFSET_PS] %endif %endif @@ -2121,10 +4569,10 @@ paddd m1, m6 paddd m2, m6 paddd m3, m6 - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS packssdw m0, m1 packssdw m2, m3 @@ -2134,15 +4582,15 @@ paddd m2, m6 paddd m3, m6 %ifidn %2, pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP %endif packssdw m0, m1 packssdw m2, m3 @@ -2169,18 +4617,18 @@ %elifidn %2, ps paddd m0, m6 paddd m2, m6 - psrad m0, 2 - psrad m2, 2 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS packssdw m0, m2 %else paddd m0, m6 paddd m2, m6 %ifidn %2, pp - psrad m0, 6 - psrad m2, 6 + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m2, 10 + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP %endif packusdw m0, m2 CLIPW m0, m5, m7 @@ -2197,7 +4645,6 @@ dec r4d jnz .loopH - RET %endmacro @@ -2215,31 +4662,31 @@ movu m1, [r0] movu m3, [r0 + r1] punpcklwd m0, m1, m3 - pmaddwd m0, [r5 + 0 * 16] ;m0 = [0l+1l] Row1l + pmaddwd m0, [r5 + 0 * 32] ;m0 = [0l+1l] Row1l punpckhwd m1, m3 - pmaddwd m1, [r5 + 0 * 16] ;m1 = [0h+1h] Row1h + pmaddwd m1, [r5 + 0 * 32] ;m1 = [0h+1h] Row1h movu m4, [r0 + 2 * r1] punpcklwd m2, m3, m4 - pmaddwd m2, [r5 + 0 * 16] ;m2 = [1l+2l] Row2l + pmaddwd m2, [r5 + 0 * 32] ;m2 = [1l+2l] Row2l punpckhwd m3, m4 - pmaddwd m3, [r5 + 0 * 16] ;m3 = [1h+2h] Row2h + pmaddwd m3, [r5 + 0 * 32] ;m3 = [1h+2h] Row2h lea r0, [r0 + 2 * r1] movu m5, [r0 + r1] punpcklwd m6, m4, m5 - pmaddwd m6, [r5 + 1 * 16] ;m6 = [2l+3l] Row1l + pmaddwd m6, [r5 + 1 * 32] ;m6 = [2l+3l] Row1l paddd m0, m6 ;m0 = [0l+1l+2l+3l] Row1l sum punpckhwd m4, m5 - pmaddwd m4, [r5 + 1 * 16] ;m6 = [2h+3h] Row1h + pmaddwd m4, [r5 + 1 * 32] ;m6 = [2h+3h] Row1h paddd m1, m4 ;m1 = [0h+1h+2h+3h] Row1h sum movu m4, [r0 + 2 * r1] punpcklwd m6, m5, m4 - pmaddwd m6, [r5 + 1 * 16] ;m6 = [3l+4l] Row2l + pmaddwd m6, [r5 + 1 * 32] ;m6 = [3l+4l] Row2l paddd m2, m6 ;m2 = [1l+2l+3l+4l] Row2l sum punpckhwd m5, m4 - pmaddwd m5, [r5 + 1 * 16] ;m1 = [3h+4h] Row2h + pmaddwd m5, [r5 + 1 * 32] ;m1 = [3h+4h] Row2h paddd m3, m5 ;m3 = [1h+2h+3h+4h] Row2h sum %endmacro @@ -2253,7 +4700,7 @@ add r1d, r1d add r3d, r3d sub r0, r1 - shl r4d, 5 + shl r4d, 6 %ifdef PIC lea r5, [tab_ChromaCoeffV] @@ -2265,11 +4712,11 @@ mov r4d, %2/2 %ifidn %3, pp - mova m7, [tab_c_32] + mova m7, [INTERP_OFFSET_PP] %elifidn %3, sp - mova m7, [tab_c_524800] + mova m7, [INTERP_OFFSET_SP] %elifidn %3, ps - mova m7, [tab_c_n32768] + mova m7, [INTERP_OFFSET_PS] %endif .loopH: @@ -2288,10 +4735,10 @@ paddd m1, m7 paddd m2, m7 paddd m3, m7 - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS packssdw m0, m1 packssdw m2, m3 @@ -2301,15 +4748,15 @@ paddd m2, m7 paddd m3, m7 %ifidn %3, pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP %else - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP %endif packssdw m0, m1 packssdw m2, m3 @@ -2325,7 +4772,6 @@ dec r4d jnz .loopH - RET %endmacro @@ -2366,10 +4812,1028 @@ FILTER_VER_CHROMA_W8 8, 12, pp, 8 FILTER_VER_CHROMA_W8 8, 64, pp, 8 +%macro PROCESS_CHROMA_VERT_W16_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * 32] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * 32] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * 32] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * 32] + + lea r0, [r0 + 2 * r1] + movu m5, [r0 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * 32] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * 32] + paddd m1, m4 + + movu m4, [r0 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * 32] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * 32] + paddd m3, m5 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_AVX2_6xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_6x%1, 4, 7, 10 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + sub r0, r1 + mov r6d, %1/4 + +%ifidn %2,pp + vbroadcasti128 m8, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m8, [INTERP_OFFSET_SP] +%else + vbroadcasti128 m8, [INTERP_OFFSET_PS] +%endif + +.loopH: + movu xm0, [r0] + movu xm1, [r0 + r1] + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + lea r4, [r1 * 3] + movu xm3, [r0 + r4] + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 + + movu xm5, [r0 + r1] + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 + + movu xm6, [r0 + r1 * 2] + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 + lea r4, [r3 * 3] +%ifidn %2,ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 +%else + paddd m0, m8 + paddd m1, m8 + paddd m2, m8 + paddd m3, m8 +%ifidn %2,pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%elifidn %2, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS +%endif +%endif + + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m5, m5 + mova m9, [pw_pixel_max] +%ifidn %2,pp + CLIPW m0, m5, m9 + CLIPW m2, m5, m9 +%elifidn %2, sp + CLIPW m0, m5, m9 + CLIPW m2, m5, m9 +%endif + + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movq [r2], xm0 + pextrd [r2 + 8], xm0, 2 + movq [r2 + r3], xm1 + pextrd [r2 + r3 + 8], xm1, 2 + movq [r2 + r3 * 2], xm2 + pextrd [r2 + r3 * 2 + 8], xm2, 2 + movq [r2 + r4], xm3 + pextrd [r2 + r4 + 8], xm3, 2 + + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loopH + RET +%endif +%endmacro +FILTER_VER_CHROMA_AVX2_6xN 8, pp +FILTER_VER_CHROMA_AVX2_6xN 8, ps +FILTER_VER_CHROMA_AVX2_6xN 8, ss +FILTER_VER_CHROMA_AVX2_6xN 8, sp +FILTER_VER_CHROMA_AVX2_6xN 16, pp +FILTER_VER_CHROMA_AVX2_6xN 16, ps +FILTER_VER_CHROMA_AVX2_6xN 16, ss +FILTER_VER_CHROMA_AVX2_6xN 16, sp + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_16xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_16x%1, 5, 6, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + + mov r4d, %1/2 + +%ifidn %2, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W16_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_16xN_avx2 4, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, pp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, pp, 8 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, ps, 8 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 8, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 12, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 24, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_16xN_avx2 64, ss, 7 + + FILTER_VER_CHROMA_W16_16xN_avx2 4, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 8, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 12, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 24, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_16xN_avx2 64, sp, 8 + +%macro PROCESS_CHROMA_VERT_W32_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * mmsize] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * mmsize] + + movu m9, [r0 + mmsize] + movu m11, [r0 + r1 + mmsize] + punpcklwd m8, m9, m11 + pmaddwd m8, [r5 + 0 * mmsize] + punpckhwd m9, m11 + pmaddwd m9, [r5 + 0 * mmsize] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * mmsize] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * mmsize] + + movu m12, [r0 + 2 * r1 + mmsize] + punpcklwd m10, m11, m12 + pmaddwd m10, [r5 + 0 * mmsize] + punpckhwd m11, m12 + pmaddwd m11, [r5 + 0 * mmsize] + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m13, [r6 + r1 + mmsize] + punpcklwd m14, m12, m13 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m8, m14 + punpckhwd m12, m13 + pmaddwd m12, [r5 + 1 * mmsize] + paddd m9, m12 + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + + movu m12, [r6 + 2 * r1 + mmsize] + punpcklwd m14, m13, m12 + pmaddwd m14, [r5 + 1 * mmsize] + paddd m10, m14 + punpckhwd m13, m12 + pmaddwd m13, [r5 + 1 * mmsize] + paddd m11, m13 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_32xN_avx2 3 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_32x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W32_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] + CLIPW2 m8, m10, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], m8 + movu [r2 + r3 + mmsize], m10 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endif +%endmacro + FILTER_VER_CHROMA_W16_32xN_avx2 8, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, pp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, pp, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, ps, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, ps, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, ss, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, ss, 15 + + FILTER_VER_CHROMA_W16_32xN_avx2 8, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 16, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 24, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 32, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 48, sp, 15 + FILTER_VER_CHROMA_W16_32xN_avx2 64, sp, 15 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_64xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_64x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: +%assign x 0 +%rep 4 + movu m1, [r0 + x] + movu m3, [r0 + r1 + x] + movu m5, [r5 + 0 * mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m5 + punpckhwd m1, m3 + pmaddwd m1, m5 + + movu m4, [r0 + 2 * r1 + x] + punpcklwd m2, m3, m4 + pmaddwd m2, m5 + punpckhwd m3, m4 + pmaddwd m3, m5 + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1 + x] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m4, [r6 + 2 * r1 + x] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2 + x], m0 + movu [r2 + r3 + x], m2 +%assign x x+mmsize +%endrep + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_64xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 48, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 64, ss, 7 + FILTER_VER_CHROMA_W16_64xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, sp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, ps, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 32, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 48, pp, 8 + FILTER_VER_CHROMA_W16_64xN_avx2 64, pp, 8 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_12xN_avx2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%2_12x%1, 5, 8, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W16_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2], xm0 + movu [r2 + r3], xm2 + vextracti128 xm0, m0, 1 + vextracti128 xm2, m2, 1 + movq [r2 + 16], xm0 + movq [r2 + r3 + 16], xm2 + lea r2, [r2 + 2 * r3] + dec r4d + jnz .loopH + RET +%endmacro + FILTER_VER_CHROMA_W16_12xN_avx2 16, ss, 7 + FILTER_VER_CHROMA_W16_12xN_avx2 16, sp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 16, ps, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 16, pp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, ss, 7 + FILTER_VER_CHROMA_W16_12xN_avx2 32, sp, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, ps, 8 + FILTER_VER_CHROMA_W16_12xN_avx2 32, pp, 8 + +%macro PROCESS_CHROMA_VERT_W24_2R 0 + movu m1, [r0] + movu m3, [r0 + r1] + punpcklwd m0, m1, m3 + pmaddwd m0, [r5 + 0 * mmsize] + punpckhwd m1, m3 + pmaddwd m1, [r5 + 0 * mmsize] + + movu xm9, [r0 + mmsize] + movu xm11, [r0 + r1 + mmsize] + punpcklwd xm8, xm9, xm11 + pmaddwd xm8, [r5 + 0 * mmsize] + punpckhwd xm9, xm11 + pmaddwd xm9, [r5 + 0 * mmsize] + + movu m4, [r0 + 2 * r1] + punpcklwd m2, m3, m4 + pmaddwd m2, [r5 + 0 * mmsize] + punpckhwd m3, m4 + pmaddwd m3, [r5 + 0 * mmsize] + + movu xm12, [r0 + 2 * r1 + mmsize] + punpcklwd xm10, xm11, xm12 + pmaddwd xm10, [r5 + 0 * mmsize] + punpckhwd xm11, xm12 + pmaddwd xm11, [r5 + 0 * mmsize] + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu xm13, [r6 + r1 + mmsize] + punpcklwd xm14, xm12, xm13 + pmaddwd xm14, [r5 + 1 * mmsize] + paddd xm8, xm14 + punpckhwd xm12, xm13 + pmaddwd xm12, [r5 + 1 * mmsize] + paddd xm9, xm12 + + movu m4, [r6 + 2 * r1] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + + movu xm12, [r6 + 2 * r1 + mmsize] + punpcklwd xm14, xm13, xm12 + pmaddwd xm14, [r5 + 1 * mmsize] + paddd xm10, xm14 + punpckhwd xm13, xm12 + pmaddwd xm13, [r5 + 1 * mmsize] + paddd xm11, xm13 +%endmacro + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_24xN_avx2 3 +INIT_YMM avx2 +%if ARCH_X86_64 +cglobal interp_4tap_vert_%2_24x%1, 5, 7, %3 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, %1/2 + +%ifidn %2, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %2, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %2, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: + PROCESS_CHROMA_VERT_W24_2R +%ifidn %2, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%elifidn %2, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + paddd m8, m7 + paddd m9, m7 + paddd m10, m7 + paddd m11, m7 + %ifidn %2, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + packssdw m8, m9 + packssdw m10, m11 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] + CLIPW2 m8, m10, m5, [pw_pixel_max] +%endif + + movu [r2], m0 + movu [r2 + r3], m2 + movu [r2 + mmsize], xm8 + movu [r2 + r3 + mmsize], xm10 + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endif +%endmacro + FILTER_VER_CHROMA_W16_24xN_avx2 32, ss, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, sp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, ps, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 32, pp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, ss, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, sp, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, ps, 15 + FILTER_VER_CHROMA_W16_24xN_avx2 64, pp, 15 + +;----------------------------------------------------------------------------------------------------------------- +; void interp_4tap_vert(int16_t *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_CHROMA_W16_48x64_avx2 2 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_48x64, 5, 7, %2 + add r1d, r1d + add r3d, r3d + sub r0, r1 + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + lea r5, [r5 + r4] +%else + lea r5, [tab_ChromaCoeffV + r4] +%endif + mov r4d, 32 + +%ifidn %1, pp + mova m7, [INTERP_OFFSET_PP] +%elifidn %1, sp + mova m7, [INTERP_OFFSET_SP] +%elifidn %1, ps + mova m7, [INTERP_OFFSET_PS] +%endif + +.loopH: +%assign x 0 +%rep 3 + movu m1, [r0 + x] + movu m3, [r0 + r1 + x] + movu m5, [r5 + 0 * mmsize] + punpcklwd m0, m1, m3 + pmaddwd m0, m5 + punpckhwd m1, m3 + pmaddwd m1, m5 + + movu m4, [r0 + 2 * r1 + x] + punpcklwd m2, m3, m4 + pmaddwd m2, m5 + punpckhwd m3, m4 + pmaddwd m3, m5 + + lea r6, [r0 + 2 * r1] + movu m5, [r6 + r1 + x] + punpcklwd m6, m4, m5 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m0, m6 + punpckhwd m4, m5 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m1, m4 + + movu m4, [r6 + 2 * r1 + x] + punpcklwd m6, m5, m4 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m2, m6 + punpckhwd m5, m4 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 + +%ifidn %1, ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + + packssdw m0, m1 + packssdw m2, m3 +%elifidn %1, ps + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + + packssdw m0, m1 + packssdw m2, m3 +%else + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%ifidn %1, pp + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%else + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%endif + packssdw m0, m1 + packssdw m2, m3 + pxor m5, m5 + CLIPW2 m0, m2, m5, [pw_pixel_max] +%endif + + movu [r2 + x], m0 + movu [r2 + r3 + x], m2 +%assign x x+mmsize +%endrep + + lea r2, [r2 + 2 * r3] + lea r0, [r0 + 2 * r1] + dec r4d + jnz .loopH + RET +%endmacro + + FILTER_VER_CHROMA_W16_48x64_avx2 pp, 8 + FILTER_VER_CHROMA_W16_48x64_avx2 ps, 8 + FILTER_VER_CHROMA_W16_48x64_avx2 ss, 7 + FILTER_VER_CHROMA_W16_48x64_avx2 sp, 8 INIT_XMM sse2 cglobal chroma_p2s, 3, 7, 3 - ; load width and height mov r3d, r3m mov r4d, r4m @@ -2385,11 +5849,11 @@ lea r6, [r0 + r5 * 2] movu m0, [r6] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) paddw m0, m2 movu m1, [r6 + r1] - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) paddw m1, m2 add r5d, 8 @@ -2422,7 +5886,6 @@ sub r4d, 2 jnz .loopH - RET %macro PROCESS_LUMA_VER_W4_4R 0 @@ -2510,7 +5973,7 @@ lea r6, [tab_LumaCoeffV + r4] %endif - mova m7, [pd_32] + mova m7, [INTERP_OFFSET_PP] mov dword [rsp], %2/4 .loopH: @@ -2523,10 +5986,10 @@ paddd m2, m7 paddd m3, m7 - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP packssdw m0, m1 packssdw m2, m3 @@ -2552,7 +6015,6 @@ dec dword [rsp] jnz .loopH - RET %endmacro @@ -2608,7 +6070,7 @@ %elifidn %1, sp mova m6, [pd_524800] %else - vbroadcasti128 m6, [pd_n32768] + vbroadcasti128 m6, [INTERP_OFFSET_PS] %endif movq xm0, [r0] @@ -2661,14 +6123,14 @@ paddd m0, m6 paddd m2, m6 %ifidn %1,pp - psrad m0, 6 - psrad m2, 6 + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP %elifidn %1, sp - psrad m0, 10 - psrad m2, 10 + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP %else - psrad m0, 2 - psrad m2, 2 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS %endif %endif @@ -2718,7 +6180,7 @@ %elifidn %1, sp mova m11, [pd_524800] %else - vbroadcasti128 m11, [pd_n32768] + vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif movu xm0, [r0] ; m0 = row 0 @@ -2829,20 +6291,20 @@ paddd m2, m11 paddd m3, m11 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS %endif %endif @@ -2900,20 +6362,20 @@ paddd m6, m11 paddd m7, m11 %ifidn %1,pp - psrad m4, 6 - psrad m5, 6 - psrad m6, 6 - psrad m7, 6 -%elifidn %1, sp - psrad m4, 10 - psrad m5, 10 - psrad m6, 10 - psrad m7, 10 -%else - psrad m4, 2 - psrad m5, 2 - psrad m6, 2 - psrad m7, 2 + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + psrad m6, INTERP_SHIFT_PP + psrad m7, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -3073,26 +6535,26 @@ paddd m4, m14 paddd m5, m14 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 - psrad m4, 10 - psrad m5, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 - psrad m4, 2 - psrad m5, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS %endif %endif @@ -3155,14 +6617,14 @@ paddd m6, m14 paddd m7, m14 %ifidn %1,pp - psrad m6, 6 - psrad m7, 6 + psrad m6, INTERP_SHIFT_PP + psrad m7, INTERP_SHIFT_PP %elifidn %1, sp - psrad m6, 10 - psrad m7, 10 + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP %else - psrad m6, 2 - psrad m7, 2 + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -3269,32 +6731,32 @@ paddd m0, m14 paddd m1, m14 %ifidn %1,pp - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - psrad m12, 6 - psrad m13, 6 - psrad m0, 6 - psrad m1, 6 -%elifidn %1, sp - psrad m8, 10 - psrad m9, 10 - psrad m10, 10 - psrad m11, 10 - psrad m12, 10 - psrad m13, 10 - psrad m0, 10 - psrad m1, 10 -%else - psrad m8, 2 - psrad m9, 2 - psrad m10, 2 - psrad m11, 2 - psrad m12, 2 - psrad m13, 2 - psrad m0, 2 - psrad m1, 2 + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP + psrad m12, INTERP_SHIFT_PP + psrad m13, INTERP_SHIFT_PP + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + psrad m13, INTERP_SHIFT_SP + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP +%else + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + psrad m13, INTERP_SHIFT_PS + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS %endif %endif @@ -3354,9 +6816,9 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] mov r9d, %2 / 8 @@ -3405,9 +6867,9 @@ %ifidn %3,pp vbroadcasti128 m14, [pd_32] %elifidn %3, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] @@ -3488,9 +6950,9 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] lea r7, [r1 * 4] @@ -3624,26 +7086,26 @@ paddd m4, m14 paddd m5, m14 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 - psrad m4, 10 - psrad m5, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 - psrad m4, 2 - psrad m5, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS %endif %endif @@ -3706,14 +7168,14 @@ paddd m6, m14 paddd m7, m14 %ifidn %1,pp - psrad m6, 6 - psrad m7, 6 + psrad m6, INTERP_SHIFT_PP + psrad m7, INTERP_SHIFT_PP %elifidn %1, sp - psrad m6, 10 - psrad m7, 10 + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP %else - psrad m6, 2 - psrad m7, 2 + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -3820,32 +7282,32 @@ paddd m0, m14 paddd m1, m14 %ifidn %1,pp - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 - psrad m12, 6 - psrad m13, 6 - psrad m0, 6 - psrad m1, 6 -%elifidn %1, sp - psrad m8, 10 - psrad m9, 10 - psrad m10, 10 - psrad m11, 10 - psrad m12, 10 - psrad m13, 10 - psrad m0, 10 - psrad m1, 10 -%else - psrad m8, 2 - psrad m9, 2 - psrad m10, 2 - psrad m11, 2 - psrad m12, 2 - psrad m13, 2 - psrad m0, 2 - psrad m1, 2 + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP + psrad m12, INTERP_SHIFT_PP + psrad m13, INTERP_SHIFT_PP + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP + psrad m12, INTERP_SHIFT_SP + psrad m13, INTERP_SHIFT_SP + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP +%else + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS + psrad m12, INTERP_SHIFT_PS + psrad m13, INTERP_SHIFT_PS + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS %endif %endif @@ -4020,26 +7482,26 @@ paddd m4, m11 paddd m5, m11 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 - psrad m4, 6 - psrad m5, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 - psrad m4, 10 - psrad m5, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 - psrad m4, 2 - psrad m5, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS %endif %endif @@ -4091,14 +7553,14 @@ paddd m6, m11 paddd m7, m11 %ifidn %1,pp - psrad m6, 6 - psrad m7, 6 + psrad m6, INTERP_SHIFT_PP + psrad m7, INTERP_SHIFT_PP %elifidn %1, sp - psrad m6, 10 - psrad m7, 10 + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP %else - psrad m6, 2 - psrad m7, 2 + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -4135,9 +7597,9 @@ %ifidn %1,pp vbroadcasti128 m11, [pd_32] %elifidn %1, sp - mova m11, [pd_524800] + mova m11, [INTERP_OFFSET_SP] %else - vbroadcasti128 m11, [pd_n32768] + vbroadcasti128 m11, [INTERP_OFFSET_PS] %endif mova m12, [pw_pixel_max] lea r6, [r3 * 3] @@ -4182,9 +7644,9 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] mov r9d, 4 @@ -4300,20 +7762,20 @@ paddd m2, m7 paddd m3, m7 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS %endif %endif @@ -4336,7 +7798,7 @@ %macro FILTER_VER_LUMA_AVX2_16x4 1 INIT_YMM avx2 -cglobal interp_8tap_vert_%1_16x4, 4, 7, 8, 0 - gprsize +cglobal interp_8tap_vert_%1_16x4, 4, 7, 8, 0-gprsize mov r4d, r4m shl r4d, 7 add r1d, r1d @@ -4354,9 +7816,9 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + mova m7, [INTERP_OFFSET_SP] %else - vbroadcasti128 m7, [pd_n32768] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif mov dword [rsp], 2 .loopW: @@ -4399,9 +7861,9 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + mova m7, [INTERP_OFFSET_SP] %else - vbroadcasti128 m7, [pd_n32768] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif PROCESS_LUMA_AVX2_W8_4R %1 @@ -4439,9 +7901,9 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif mova m13, [pw_pixel_max] pxor m12, m12 @@ -4549,20 +8011,20 @@ paddd m2, m14 paddd m3, m14 %ifidn %1,pp - psrad m0, 6 - psrad m1, 6 - psrad m2, 6 - psrad m3, 6 -%elifidn %1, sp - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 -%else - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP +%else + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS %endif %endif @@ -4640,20 +8102,20 @@ paddd m6, m14 paddd m7, m14 %ifidn %1,pp - psrad m4, 6 - psrad m5, 6 - psrad m6, 6 - psrad m7, 6 -%elifidn %1, sp - psrad m4, 10 - psrad m5, 10 - psrad m6, 10 - psrad m7, 10 -%else - psrad m4, 2 - psrad m5, 2 - psrad m6, 2 - psrad m7, 2 + psrad m4, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP + psrad m6, INTERP_SHIFT_PP + psrad m7, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m4, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP + psrad m6, INTERP_SHIFT_SP + psrad m7, INTERP_SHIFT_SP +%else + psrad m4, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS + psrad m6, INTERP_SHIFT_PS + psrad m7, INTERP_SHIFT_PS %endif %endif @@ -4717,20 +8179,20 @@ paddd m10, m14 paddd m11, m14 %ifidn %1,pp - psrad m8, 6 - psrad m9, 6 - psrad m10, 6 - psrad m11, 6 -%elifidn %1, sp - psrad m8, 10 - psrad m9, 10 - psrad m10, 10 - psrad m11, 10 -%else - psrad m8, 2 - psrad m9, 2 - psrad m10, 2 - psrad m11, 2 + psrad m8, INTERP_SHIFT_PP + psrad m9, INTERP_SHIFT_PP + psrad m10, INTERP_SHIFT_PP + psrad m11, INTERP_SHIFT_PP +%elifidn %1, sp + psrad m8, INTERP_SHIFT_SP + psrad m9, INTERP_SHIFT_SP + psrad m10, INTERP_SHIFT_SP + psrad m11, INTERP_SHIFT_SP +%else + psrad m8, INTERP_SHIFT_PS + psrad m9, INTERP_SHIFT_PS + psrad m10, INTERP_SHIFT_PS + psrad m11, INTERP_SHIFT_PS %endif %endif @@ -4786,9 +8248,9 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + mova m7, [INTERP_OFFSET_SP] %else - vbroadcasti128 m7, [pd_n32768] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] @@ -4850,14 +8312,14 @@ paddd m0, m7 paddd m2, m7 %ifidn %1,pp - psrad m0, 6 - psrad m2, 6 + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP %elifidn %1, sp - psrad m0, 10 - psrad m2, 10 + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP %else - psrad m0, 2 - psrad m2, 2 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS %endif %endif @@ -4901,14 +8363,14 @@ paddd m4, m7 paddd m1, m7 %ifidn %1,pp - psrad m4, 6 - psrad m1, 6 + psrad m4, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP %elifidn %1, sp - psrad m4, 10 - psrad m1, 10 + psrad m4, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP %else - psrad m4, 2 - psrad m1, 2 + psrad m4, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS %endif %endif @@ -4993,14 +8455,14 @@ paddd m0, m7 paddd m2, m7 %ifidn %1,pp - psrad m0, 6 - psrad m2, 6 + psrad m0, INTERP_SHIFT_PP + psrad m2, INTERP_SHIFT_PP %elifidn %1, sp - psrad m0, 10 - psrad m2, 10 + psrad m0, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP %else - psrad m0, 2 - psrad m2, 2 + psrad m0, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS %endif %endif @@ -5051,14 +8513,14 @@ paddd m4, m7 paddd m1, m7 %ifidn %1,pp - psrad m4, 6 - psrad m1, 6 + psrad m4, INTERP_SHIFT_PP + psrad m1, INTERP_SHIFT_PP %elifidn %1, sp - psrad m4, 10 - psrad m1, 10 + psrad m4, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP %else - psrad m4, 2 - psrad m1, 2 + psrad m4, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS %endif %endif @@ -5109,14 +8571,14 @@ paddd m6, m7 paddd m5, m7 %ifidn %1,pp - psrad m6, 6 - psrad m5, 6 + psrad m6, INTERP_SHIFT_PP + psrad m5, INTERP_SHIFT_PP %elifidn %1, sp - psrad m6, 10 - psrad m5, 10 + psrad m6, INTERP_SHIFT_SP + psrad m5, INTERP_SHIFT_SP %else - psrad m6, 2 - psrad m5, 2 + psrad m6, INTERP_SHIFT_PS + psrad m5, INTERP_SHIFT_PS %endif %endif @@ -5160,17 +8622,17 @@ paddd m0, m7 paddd m3, m7 %ifidn %1,pp - psrad m0, 6 - psrad m3, 6 + psrad m0, INTERP_SHIFT_PP + psrad m3, INTERP_SHIFT_PP %elifidn %1, sp - psrad m0, 10 - psrad m3, 10 + psrad m0, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP %else - psrad m0, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS %endif %endif - + packssdw m0, m3 %ifidn %1,pp CLIPW m0, m1, [pw_pixel_max] @@ -5206,9 +8668,9 @@ %ifidn %1,pp vbroadcasti128 m7, [pd_32] %elifidn %1, sp - mova m7, [pd_524800] + mova m7, [INTERP_OFFSET_SP] %else - vbroadcasti128 m7, [pd_n32768] + vbroadcasti128 m7, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] PROCESS_LUMA_AVX2_W4_16R %1 @@ -5241,9 +8703,9 @@ %ifidn %1,pp vbroadcasti128 m14, [pd_32] %elifidn %1, sp - mova m14, [pd_524800] + mova m14, [INTERP_OFFSET_SP] %else - vbroadcasti128 m14, [pd_n32768] + vbroadcasti128 m14, [INTERP_OFFSET_PS] %endif lea r6, [r3 * 3] PROCESS_LUMA_AVX2_W8_16R %1 @@ -5280,7 +8742,7 @@ lea r6, [tab_LumaCoeffV + r4] %endif - mova m7, [pd_n32768] + mova m7, [INTERP_OFFSET_PS] mov dword [rsp], %2/4 .loopH: @@ -5293,10 +8755,10 @@ paddd m2, m7 paddd m3, m7 - psrad m0, 2 - psrad m1, 2 - psrad m2, 2 - psrad m3, 2 + psrad m0, INTERP_SHIFT_PS + psrad m1, INTERP_SHIFT_PS + psrad m2, INTERP_SHIFT_PS + psrad m3, INTERP_SHIFT_PS packssdw m0, m1 packssdw m2, m3 @@ -5319,7 +8781,6 @@ dec dword [rsp] jnz .loopH - RET %endmacro @@ -5372,7 +8833,7 @@ lea r6, [tab_LumaCoeffV + r4] %endif - mova m7, [tab_c_524800] + mova m7, [INTERP_OFFSET_SP] mov dword [rsp], %2/4 .loopH: @@ -5385,10 +8846,10 @@ paddd m2, m7 paddd m3, m7 - psrad m0, 10 - psrad m1, 10 - psrad m2, 10 - psrad m3, 10 + psrad m0, INTERP_SHIFT_SP + psrad m1, INTERP_SHIFT_SP + psrad m2, INTERP_SHIFT_SP + psrad m3, INTERP_SHIFT_SP packssdw m0, m1 packssdw m2, m3 @@ -5414,7 +8875,6 @@ dec dword [rsp] jnz .loopH - RET %endmacro @@ -5498,7 +8958,6 @@ dec dword [rsp] jnz .loopH - RET %endmacro @@ -5546,7 +9005,7 @@ %rep %1/4 movd m0, [r0] movhps m0, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movd [r2 + r3 * 0], m0 @@ -5554,7 +9013,7 @@ movd m0, [r0 + r1 * 2] movhps m0, [r0 + r4] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movd [r2 + r3 * 2], m0 @@ -5587,14 +9046,14 @@ %rep %1/4 movh m0, [r0] movhps m0, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movh [r2 + r3 * 0], m0 movhps [r2 + r3 * 1], m0 movh m0, [r0 + r1 * 2] movhps m0, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movh [r2 + r3 * 2], m0 movhps [r2 + r4], m0 @@ -5620,11 +9079,10 @@ movh m0, [r0] movhps m0, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, [pw_2000] movh [r2 + r3 * 0], m0 movhps [r2 + r3 * 1], m0 - RET ;----------------------------------------------------------------------------- @@ -5648,9 +9106,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movh [r2 + r3 * 0], m0 @@ -5660,9 +9118,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movh [r2 + r3 * 2], m0 @@ -5700,22 +9158,22 @@ .loop movu m0, [r0] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movu [r2 + r3 * 0], m0 movu m0, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movu [r2 + r3 * 1], m0 movu m0, [r0 + r1 * 2] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movu [r2 + r3 * 2], m0 movu m0, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m1 movu [r2 + r4], m0 @@ -5745,14 +9203,13 @@ movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, [pw_2000] - psllw m1, 4 psubw m1, [pw_2000] movu [r2 + r3 * 0], m0 movu [r2 + r3 * 1], m1 - RET ;----------------------------------------------------------------------------- @@ -5774,11 +9231,11 @@ movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m3 - psllw m2, 4 + psllw m2, (14 - BIT_DEPTH) psubw m2, m3 movu [r2 + r3 * 0], m0 @@ -5789,18 +9246,17 @@ movu m1, [r0 + r1 * 4] movu m2, [r0 + r5 ] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m3 - psllw m2, 4 + psllw m2, (14 - BIT_DEPTH) psubw m2, m3 movu [r2 + r6], m0 movu [r2 + r3 * 4], m1 lea r2, [r2 + r3 * 4] movu [r2 + r3], m2 - RET ;----------------------------------------------------------------------------- @@ -5824,9 +9280,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 0], m0 @@ -5834,9 +9290,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 2], m0 @@ -5844,9 +9300,9 @@ movu m0, [r0 + 16] movu m1, [r0 + r1 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 0 + 16], m0 @@ -5854,9 +9310,9 @@ movu m0, [r0 + r1 * 2 + 16] movu m1, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 2 + 16], m0 @@ -5898,9 +9354,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 0], m0 @@ -5908,9 +9364,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 2], m0 @@ -5954,13 +9410,13 @@ movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] movu m3, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0], m0 @@ -5972,13 +9428,13 @@ movu m1, [r0 + r1 + 16] movu m2, [r0 + r1 * 2 + 16] movu m3, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 16], m0 @@ -5990,13 +9446,13 @@ movu m1, [r0 + r1 + 32] movu m2, [r0 + r1 * 2 + 32] movu m3, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 32], m0 @@ -6008,13 +9464,13 @@ movu m1, [r0 + r1 + 48] movu m2, [r0 + r1 * 2 + 48] movu m3, [r0 + r5 + 48] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 48], m0 @@ -6057,9 +9513,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 + psllw m1, (14 - BIT_DEPTH) psubw m1, m2 movu [r2 + r3 * 0], m0 @@ -6067,9 +9523,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2], m0 @@ -6077,9 +9533,9 @@ movu m0, [r0 + 32] movu m1, [r0 + r1 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0 + 32], m0 @@ -6087,9 +9543,9 @@ movu m0, [r0 + r1 * 2 + 32] movu m1, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2 + 32], m0 @@ -6132,13 +9588,13 @@ movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] movu m3, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0], m0 @@ -6150,13 +9606,13 @@ movu m1, [r0 + r1 + 16] movu m2, [r0 + r1 * 2 + 16] movu m3, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 16], m0 @@ -6168,13 +9624,13 @@ movu m1, [r0 + r1 + 32] movu m2, [r0 + r1 * 2 + 32] movu m3, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 32], m0 @@ -6186,13 +9642,13 @@ movu m1, [r0 + r1 + 48] movu m2, [r0 + r1 * 2 + 48] movu m3, [r0 + r5 + 48] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 48], m0 @@ -6204,13 +9660,13 @@ movu m1, [r0 + r1 + 64] movu m2, [r0 + r1 * 2 + 64] movu m3, [r0 + r5 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 64], m0 @@ -6222,13 +9678,13 @@ movu m1, [r0 + r1 + 80] movu m2, [r0 + r1 * 2 + 80] movu m3, [r0 + r5 + 80] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 80], m0 @@ -6240,13 +9696,13 @@ movu m1, [r0 + r1 + 96] movu m2, [r0 + r1 * 2 + 96] movu m3, [r0 + r5 + 96] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 96], m0 @@ -6258,13 +9714,13 @@ movu m1, [r0 + r1 + 112] movu m2, [r0 + r1 * 2 + 112] movu m3, [r0 + r5 + 112] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 112], m0 @@ -6305,9 +9761,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0], m0 @@ -6315,9 +9771,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2], m0 @@ -6325,9 +9781,9 @@ movu m0, [r0 + 32] movu m1, [r0 + r1 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0 + 32], m0 @@ -6335,9 +9791,9 @@ movu m0, [r0 + r1 * 2 + 32] movu m1, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2 + 32], m0 @@ -6345,9 +9801,9 @@ movu m0, [r0 + 64] movu m1, [r0 + r1 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0 + 64], m0 @@ -6355,9 +9811,9 @@ movu m0, [r0 + r1 * 2 + 64] movu m1, [r0 + r5 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2 + 64], m0 @@ -6365,9 +9821,9 @@ movu m0, [r0 + 96] movu m1, [r0 + r1 + 96] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0 + 96], m0 @@ -6375,9 +9831,9 @@ movu m0, [r0 + r1 * 2 + 96] movu m1, [r0 + r5 + 96] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2 + 96], m0 @@ -6418,13 +9874,13 @@ movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] movu m3, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0], m0 @@ -6436,13 +9892,13 @@ movu m1, [r0 + r1 + 16] movu m2, [r0 + r1 * 2 + 16] movu m3, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 16], m0 @@ -6454,13 +9910,13 @@ movu m1, [r0 + r1 + 32] movu m2, [r0 + r1 * 2 + 32] movu m3, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 32], m0 @@ -6499,36 +9955,36 @@ .loop movu m0, [r0] movu m1, [r0 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0], m0 movu [r2 + r3 * 0 + 32], xm1 movu m0, [r0 + r1] movu m1, [r0 + r1 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 1], m0 movu [r2 + r3 * 1 + 32], xm1 movu m0, [r0 + r1 * 2] movu m1, [r0 + r1 * 2 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2], m0 movu [r2 + r3 * 2 + 32], xm1 movu m0, [r0 + r5] movu m1, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r4], m0 movu [r2 + r4 + 32], xm1 @@ -6564,9 +10020,9 @@ .loop movu m0, [r0] movu m1, [r0 + r1] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 0], m0 @@ -6574,9 +10030,9 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) psubw m0, m2 - psllw m1, 4 psubw m1, m2 movu [r2 + r3 * 2], m0 @@ -6584,7 +10040,7 @@ movh m0, [r0 + 16] movhps m0, [r0 + r1 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 movh [r2 + r3 * 0 + 16], m0 @@ -6592,7 +10048,7 @@ movh m0, [r0 + r1 * 2 + 16] movhps m0, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) psubw m0, m2 movh [r2 + r3 * 2 + 16], m0 @@ -6630,13 +10086,13 @@ movu m1, [r0 + r1] movu m2, [r0 + r1 * 2] movu m3, [r0 + r5] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0], m0 @@ -6648,13 +10104,13 @@ movu m1, [r0 + r1 + 16] movu m2, [r0 + r1 * 2 + 16] movu m3, [r0 + r5 + 16] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 16], m0 @@ -6666,13 +10122,13 @@ movu m1, [r0 + r1 + 32] movu m2, [r0 + r1 * 2 + 32] movu m3, [r0 + r5 + 32] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 32], m0 @@ -6684,13 +10140,13 @@ movu m1, [r0 + r1 + 48] movu m2, [r0 + r1 * 2 + 48] movu m3, [r0 + r5 + 48] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 48], m0 @@ -6702,13 +10158,13 @@ movu m1, [r0 + r1 + 64] movu m2, [r0 + r1 * 2 + 64] movu m3, [r0 + r5 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 64], m0 @@ -6720,13 +10176,13 @@ movu m1, [r0 + r1 + 80] movu m2, [r0 + r1 * 2 + 80] movu m3, [r0 + r5 + 80] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) + psllw m3, (14 - BIT_DEPTH) psubw m0, m4 - psllw m1, 4 psubw m1, m4 - psllw m2, 4 psubw m2, m4 - psllw m3, 4 psubw m3, m4 movu [r2 + r3 * 0 + 80], m0 @@ -6762,11 +10218,11 @@ movu m0, [r0] movu m1, [r0 + 32] movu m2, [r0 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 psubw m1, m3 - psllw m2, 4 psubw m2, m3 movu [r2 + r3 * 0], m0 movu [r2 + r3 * 0 + 32], m1 @@ -6775,11 +10231,11 @@ movu m0, [r0 + r1] movu m1, [r0 + r1 + 32] movu m2, [r0 + r1 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 psubw m1, m3 - psllw m2, 4 psubw m2, m3 movu [r2 + r3 * 1], m0 movu [r2 + r3 * 1 + 32], m1 @@ -6788,11 +10244,11 @@ movu m0, [r0 + r1 * 2] movu m1, [r0 + r1 * 2 + 32] movu m2, [r0 + r1 * 2 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 psubw m1, m3 - psllw m2, 4 psubw m2, m3 movu [r2 + r3 * 2], m0 movu [r2 + r3 * 2 + 32], m1 @@ -6801,11 +10257,11 @@ movu m0, [r0 + r5] movu m1, [r0 + r5 + 32] movu m2, [r0 + r5 + 64] - psllw m0, 4 + psllw m0, (14 - BIT_DEPTH) + psllw m1, (14 - BIT_DEPTH) + psllw m2, (14 - BIT_DEPTH) psubw m0, m3 - psllw m1, 4 psubw m1, m3 - psllw m2, 4 psubw m2, m3 movu [r2 + r4], m0 movu [r2 + r4 + 32], m1 @@ -6831,18 +10287,17 @@ mov r4d, r4m add r1d, r1d add r3d, r3d -%ifdef PIC +%ifdef PIC lea r6, [tab_LumaCoeff] - lea r4 , [r4 * 8] + lea r4, [r4 * 8] vbroadcasti128 m0, [r6 + r4 * 2] - %else - lea r4 , [r4 * 8] + lea r4, [r4 * 8] vbroadcasti128 m0, [tab_LumaCoeff + r4 * 2] %endif - vbroadcasti128 m2, [pd_n32768] + vbroadcasti128 m2, [INTERP_OFFSET_PS] ; register map ; m0 - interpolate coeff @@ -6934,3 +10389,2619 @@ IPFILTER_LUMA_PS_4xN_AVX2 4 IPFILTER_LUMA_PS_4xN_AVX2 8 IPFILTER_LUMA_PS_4xN_AVX2 16 + +%macro IPFILTER_LUMA_PS_8xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_8x%1, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m6, m3 + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, 2 + psrad xm5, 2 + packssdw xm4, xm5 + + movu [r2], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_8xN_AVX2 4 + IPFILTER_LUMA_PS_8xN_AVX2 8 + IPFILTER_LUMA_PS_8xN_AVX2 16 + IPFILTER_LUMA_PS_8xN_AVX2 32 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_24x32, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, 32 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: +%assign x 0 +%rep 24/8 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m6, m3 + pmaddwd m5, m0 + pmaddwd m6, m1 + paddd m5, m6 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, 2 + psrad xm5, 2 + packssdw xm4, xm5 + + movu [r2 + x], xm4 + %assign x x+16 + %endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + + +%macro IPFILTER_LUMA_PS_32_64_AVX2 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_%1x%2, 4, 6, 8 + + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %2 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: +%assign x 0 +%rep %1/16 + vbroadcasti128 m4, [r0 + x] + vbroadcasti128 m5, [r0 + 8 + x] + pshufb m4, m3 + pshufb m7, m5, m3 + + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16 + x] + pshufb m5, m3 + pshufb m7, m6, m3 + + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, 2 + psrad xm5, 2 + packssdw xm4, xm5 + + movu [r2 + x], xm4 + + vbroadcasti128 m5, [r0 + 24 + x] + pshufb m6, m3 + pshufb m7, m5, m3 + + pmaddwd m6, m0 + pmaddwd m7, m1 + paddd m6, m7 + + vbroadcasti128 m7, [r0 + 32 + x] + pshufb m5, m3 + pshufb m7, m3 + + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m6, m5 + vpermq m6, m6, q3120 + paddd m6, m2 + vextracti128 xm5,m6, 1 + psrad xm6, 2 + psrad xm5, 2 + packssdw xm6, xm5 + + movu [r2 + 16 + x], xm6 + %assign x x+32 + %endrep + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_32_64_AVX2 32, 8 + IPFILTER_LUMA_PS_32_64_AVX2 32, 16 + IPFILTER_LUMA_PS_32_64_AVX2 32, 24 + IPFILTER_LUMA_PS_32_64_AVX2 32, 32 + IPFILTER_LUMA_PS_32_64_AVX2 32, 64 + + IPFILTER_LUMA_PS_32_64_AVX2 64, 16 + IPFILTER_LUMA_PS_32_64_AVX2 64, 32 + IPFILTER_LUMA_PS_32_64_AVX2 64, 48 + IPFILTER_LUMA_PS_32_64_AVX2 64, 64 + + IPFILTER_LUMA_PS_32_64_AVX2 48, 64 + +%macro IPFILTER_LUMA_PS_16xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_16x%1, 4, 6, 8 + + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, %1 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m7, m6, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5, m4, 1 + psrad xm4, 2 + psrad xm5, 2 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m3 + pshufb m7, m5, m3 + pmaddwd m6, m0 + pmaddwd m7, m1 + paddd m6, m7 + + vbroadcasti128 m7, [r0 + 32] + pshufb m5, m3 + pshufb m7, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m6, m5 + vpermq m6, m6, q3120 + paddd m6, m2 + vextracti128 xm5,m6, 1 + psrad xm6, 2 + psrad xm5, 2 + packssdw xm6, xm5 + movu [r2 + 16], xm6 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_LUMA_PS_16xN_AVX2 4 + IPFILTER_LUMA_PS_16xN_AVX2 8 + IPFILTER_LUMA_PS_16xN_AVX2 12 + IPFILTER_LUMA_PS_16xN_AVX2 16 + IPFILTER_LUMA_PS_16xN_AVX2 32 + IPFILTER_LUMA_PS_16xN_AVX2 64 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_8tap_horiz_ps_12x16, 4, 6, 8 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + shl r4d, 4 +%ifdef PIC + lea r6, [tab_LumaCoeff] + vpbroadcastq m0, [r6 + r4] + vpbroadcastq m1, [r6 + r4 + 8] +%else + vpbroadcastq m0, [tab_LumaCoeff + r4] + vpbroadcastq m1, [tab_LumaCoeff + r4 + 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 6 + test r5d, r5d + mov r4d, 16 + jz .loop0 + lea r6, [r1*3] + sub r0, r6 + add r4d, 7 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m7, m5, m3 + pmaddwd m4, m0 + pmaddwd m7, m1 + paddd m4, m7 + + vbroadcasti128 m6, [r0 + 16] + pshufb m5, m3 + pshufb m7, m6, m3 + pmaddwd m5, m0 + pmaddwd m7, m1 + paddd m5, m7 + + phaddd m4, m5 + vpermq m4, m4, q3120 + paddd m4, m2 + vextracti128 xm5,m4, 1 + psrad xm4, 2 + psrad xm5, 2 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m5, [r0 + 24] + pshufb m6, m3 + pshufb m5, m3 + pmaddwd m6, m0 + pmaddwd m5, m1 + paddd m6, m5 + + phaddd m6, m6 + vpermq m6, m6, q3120 + paddd xm6, xm2 + psrad xm6, 2 + packssdw xm6, xm6 + movq [r2 + 16], xm6 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + +%macro IPFILTER_CHROMA_PS_8xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_8x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_CHROMA_PS_8xN_AVX2 4 + IPFILTER_CHROMA_PS_8xN_AVX2 8 + IPFILTER_CHROMA_PS_8xN_AVX2 16 + IPFILTER_CHROMA_PS_8xN_AVX2 32 + IPFILTER_CHROMA_PS_8xN_AVX2 6 + IPFILTER_CHROMA_PS_8xN_AVX2 2 + IPFILTER_CHROMA_PS_8xN_AVX2 12 + IPFILTER_CHROMA_PS_8xN_AVX2 64 + +%macro IPFILTER_CHROMA_PS_16xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_16x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_16xN_AVX2 16 +IPFILTER_CHROMA_PS_16xN_AVX2 8 +IPFILTER_CHROMA_PS_16xN_AVX2 32 +IPFILTER_CHROMA_PS_16xN_AVX2 12 +IPFILTER_CHROMA_PS_16xN_AVX2 4 +IPFILTER_CHROMA_PS_16xN_AVX2 64 +IPFILTER_CHROMA_PS_16xN_AVX2 24 + +%macro IPFILTER_CHROMA_PS_24xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_24x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_24xN_AVX2 32 +IPFILTER_CHROMA_PS_24xN_AVX2 64 + +%macro IPFILTER_CHROMA_PS_12xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_12x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + pshufb m4, m3 + pmaddwd m4, m0 + phaddd m4, m4 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movq [r2 + 16], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_12xN_AVX2 16 +IPFILTER_CHROMA_PS_12xN_AVX2 32 + +%macro IPFILTER_CHROMA_PS_32xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_32x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_32xN_AVX2 32 +IPFILTER_CHROMA_PS_32xN_AVX2 16 +IPFILTER_CHROMA_PS_32xN_AVX2 24 +IPFILTER_CHROMA_PS_32xN_AVX2 8 +IPFILTER_CHROMA_PS_32xN_AVX2 64 +IPFILTER_CHROMA_PS_32xN_AVX2 48 + + +%macro IPFILTER_CHROMA_PS_64xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_64x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + vbroadcasti128 m4, [r0 + 64] + vbroadcasti128 m5, [r0 + 72] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 64], xm4 + + vbroadcasti128 m4, [r0 + 80] + vbroadcasti128 m5, [r0 + 88] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 80], xm4 + + vbroadcasti128 m4, [r0 + 96] + vbroadcasti128 m5, [r0 + 104] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 96], xm4 + + vbroadcasti128 m4, [r0 + 112] + vbroadcasti128 m5, [r0 + 120] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 112], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + +IPFILTER_CHROMA_PS_64xN_AVX2 64 +IPFILTER_CHROMA_PS_64xN_AVX2 48 +IPFILTER_CHROMA_PS_64xN_AVX2 32 +IPFILTER_CHROMA_PS_64xN_AVX2 16 + +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_48x64, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, 64 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2], xm4 + + vbroadcasti128 m4, [r0 + 16] + vbroadcasti128 m5, [r0 + 24] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 16], xm4 + + vbroadcasti128 m4, [r0 + 32] + vbroadcasti128 m5, [r0 + 40] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 32], xm4 + + vbroadcasti128 m4, [r0 + 48] + vbroadcasti128 m5, [r0 + 56] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 48], xm4 + + vbroadcasti128 m4, [r0 + 64] + vbroadcasti128 m5, [r0 + 72] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 64], xm4 + + vbroadcasti128 m4, [r0 + 80] + vbroadcasti128 m5, [r0 + 88] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movu [r2 + 80], xm4 + + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif + +%macro IPFILTER_CHROMA_PS_6xN_AVX2 1 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_horiz_ps_6x%1, 4, 7, 6 + add r1d, r1d + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + +%ifdef PIC + lea r6, [tab_ChromaCoeff] + vpbroadcastq m0, [r6 + r4 * 8] +%else + vpbroadcastq m0, [tab_ChromaCoeff + r4 * 8] +%endif + mova m3, [pb_shuf] + vbroadcasti128 m2, [INTERP_OFFSET_PS] + + ; register map + ; m0 , m1 interpolate coeff + + sub r0, 2 + test r5d, r5d + mov r4d, %1 + jz .loop0 + sub r0, r1 + add r4d, 3 + +.loop0: + vbroadcasti128 m4, [r0] + vbroadcasti128 m5, [r0 + 8] + pshufb m4, m3 + pshufb m5, m3 + pmaddwd m4, m0 + pmaddwd m5, m0 + phaddd m4, m5 + paddd m4, m2 + vpermq m4, m4, q3120 + psrad m4, 2 + vextracti128 xm5, m4, 1 + packssdw xm4, xm5 + movq [r2], xm4 + pextrd [r2 + 8], xm4, 2 + add r2, r3 + add r0, r1 + dec r4d + jnz .loop0 + RET +%endif +%endmacro + + IPFILTER_CHROMA_PS_6xN_AVX2 8 + IPFILTER_CHROMA_PS_6xN_AVX2 16 + +%macro FILTER_VER_CHROMA_AVX2_8xN 2 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x%2, 4, 9, 15 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m14, [pd_32] +%elifidn %1, sp + mova m14, [pd_524800] +%else + vbroadcasti128 m14, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + lea r7, [r1 * 4] + mov r8d, %2 / 16 +.loopH: + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m3, m7 + pmaddwd m5, [r5] + + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 + pmaddwd m6, [r5] + + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + paddd m5, m9 + pmaddwd m7, [r5] + + + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m10, m8, [r5 + 1 * mmsize] + paddd m6, m10 + pmaddwd m8, [r5] + + + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm11, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddwd m11, m9, [r5 + 1 * mmsize] + paddd m7, m11 + pmaddwd m9, [r5] + + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhwd xm12, xm10, xm11 + punpcklwd xm10, xm11 + vinserti128 m10, m10, xm12, 1 + pmaddwd m12, m10, [r5 + 1 * mmsize] + paddd m8, m12 + pmaddwd m10, [r5] + + lea r0, [r0 + r1 * 4] + movu xm12, [r0] ; m12 = row 12 + punpckhwd xm13, xm11, xm12 + punpcklwd xm11, xm12 + vinserti128 m11, m11, xm13, 1 + pmaddwd m13, m11, [r5 + 1 * mmsize] + paddd m9, m13 + pmaddwd m11, [r5] + +%ifidn %1,ss + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m4, 6 + psrad m5, 6 +%else + paddd m0, m14 + paddd m1, m14 + paddd m2, m14 + paddd m3, m14 + paddd m4, m14 + paddd m5, m14 +%ifidn %1,pp + psrad m0, 6 + psrad m1, 6 + psrad m2, 6 + psrad m3, 6 + psrad m4, 6 + psrad m5, 6 +%elifidn %1, sp + psrad m0, 10 + psrad m1, 10 + psrad m2, 10 + psrad m3, 10 + psrad m4, 10 + psrad m5, 10 +%else + psrad m0, 2 + psrad m1, 2 + psrad m2, 2 + psrad m3, 2 + psrad m4, 2 + psrad m5, 2 +%endif +%endif + + packssdw m0, m1 + packssdw m2, m3 + packssdw m4, m5 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + pxor m5, m5 + mova m3, [pw_pixel_max] +%ifidn %1,pp + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%elifidn %1, sp + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%endif + + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + vextracti128 xm1, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + vextracti128 xm1, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm1 + + movu xm13, [r0 + r1] ; m13 = row 13 + punpckhwd xm0, xm12, xm13 + punpcklwd xm12, xm13 + vinserti128 m12, m12, xm0, 1 + pmaddwd m0, m12, [r5 + 1 * mmsize] + paddd m10, m0 + pmaddwd m12, [r5] + + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhwd xm1, xm13, xm0 + punpcklwd xm13, xm0 + vinserti128 m13, m13, xm1, 1 + pmaddwd m1, m13, [r5 + 1 * mmsize] + paddd m11, m1 + pmaddwd m13, [r5] + +%ifidn %1,ss + psrad m6, 6 + psrad m7, 6 +%else + paddd m6, m14 + paddd m7, m14 +%ifidn %1,pp + psrad m6, 6 + psrad m7, 6 +%elifidn %1, sp + psrad m6, 10 + psrad m7, 10 +%else + psrad m6, 2 + psrad m7, 2 +%endif +%endif + + packssdw m6, m7 + vpermq m6, m6, q3120 +%ifidn %1,pp + CLIPW m6, m5, m3 +%elifidn %1, sp + CLIPW m6, m5, m3 +%endif + vextracti128 xm7, m6, 1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm7 + + movu xm1, [r0 + r4] ; m1 = row 15 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m12, m2 + pmaddwd m0, [r5] + + lea r0, [r0 + r1 * 4] + movu xm2, [r0] ; m2 = row 16 + punpckhwd xm6, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm6, 1 + pmaddwd m6, m1, [r5 + 1 * mmsize] + paddd m13, m6 + pmaddwd m1, [r5] + + movu xm6, [r0 + r1] ; m6 = row 17 + punpckhwd xm4, xm2, xm6 + punpcklwd xm2, xm6 + vinserti128 m2, m2, xm4, 1 + pmaddwd m2, [r5 + 1 * mmsize] + paddd m0, m2 + + movu xm4, [r0 + r1 * 2] ; m4 = row 18 + punpckhwd xm2, xm6, xm4 + punpcklwd xm6, xm4 + vinserti128 m6, m6, xm2, 1 + pmaddwd m6, [r5 + 1 * mmsize] + paddd m1, m6 + +%ifidn %1,ss + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + psrad m13, 6 + psrad m0, 6 + psrad m1, 6 +%else + paddd m8, m14 + paddd m9, m14 + paddd m10, m14 + paddd m11, m14 + paddd m12, m14 + paddd m13, m14 + paddd m0, m14 + paddd m1, m14 +%ifidn %1,pp + psrad m8, 6 + psrad m9, 6 + psrad m10, 6 + psrad m11, 6 + psrad m12, 6 + psrad m13, 6 + psrad m0, 6 + psrad m1, 6 +%elifidn %1, sp + psrad m8, 10 + psrad m9, 10 + psrad m10, 10 + psrad m11, 10 + psrad m12, 10 + psrad m13, 10 + psrad m0, 10 + psrad m1, 10 +%else + psrad m8, 2 + psrad m9, 2 + psrad m10, 2 + psrad m11, 2 + psrad m12, 2 + psrad m13, 2 + psrad m0, 2 + psrad m1, 2 +%endif +%endif + + packssdw m8, m9 + packssdw m10, m11 + packssdw m12, m13 + packssdw m0, m1 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 + vpermq m12, m12, q3120 + vpermq m0, m0, q3120 +%ifidn %1,pp + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 + CLIPW m12, m5, m3 + CLIPW m0, m5, m3 +%elifidn %1, sp + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 + CLIPW m12, m5, m3 + CLIPW m0, m5, m3 +%endif + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + vextracti128 xm13, m12, 1 + vextracti128 xm1, m0, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r6], xm11 + lea r2, [r2 + r3 * 4] + movu [r2], xm12 + movu [r2 + r3], xm13 + movu [r2 + r3 * 2], xm0 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + dec r8d + jnz .loopH + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8xN pp, 16 +FILTER_VER_CHROMA_AVX2_8xN ps, 16 +FILTER_VER_CHROMA_AVX2_8xN ss, 16 +FILTER_VER_CHROMA_AVX2_8xN sp, 16 +FILTER_VER_CHROMA_AVX2_8xN pp, 32 +FILTER_VER_CHROMA_AVX2_8xN ps, 32 +FILTER_VER_CHROMA_AVX2_8xN sp, 32 +FILTER_VER_CHROMA_AVX2_8xN ss, 32 +FILTER_VER_CHROMA_AVX2_8xN pp, 64 +FILTER_VER_CHROMA_AVX2_8xN ps, 64 +FILTER_VER_CHROMA_AVX2_8xN sp, 64 +FILTER_VER_CHROMA_AVX2_8xN ss, 64 + +%macro PROCESS_CHROMA_AVX2_8x2 3 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m2, m2, [r5 + 1 * mmsize] + paddd m0, m2 + + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m3, m3, [r5 + 1 * mmsize] + paddd m1, m3 + +%ifnidn %1,ss + paddd m0, m7 + paddd m1, m7 +%endif + psrad m0, %3 + psrad m1, %3 + + packssdw m0, m1 + vpermq m0, m0, q3120 + pxor m4, m4 + +%if %2 + CLIPW m0, m4, [pw_pixel_max] +%endif + vextracti128 xm1, m0, 1 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_8x2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x2, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + mova m7, [pd_524800] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + + PROCESS_CHROMA_AVX2_8x2 %1, %2, %3 + movu [r2], xm0 + movu [r2 + r3], xm1 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_8x2 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x2 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_8x2 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x2 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_4x2 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x2, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m6, [pd_32] +%elifidn %1, sp + mova m6, [pd_524800] +%else + vbroadcasti128 m6, [INTERP_OFFSET_PS] +%endif + + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + paddd m0, m5 + +%ifnidn %1, ss + paddd m0, m6 +%endif + psrad m0, %3 + packssdw m0, m0 + pxor m1, m1 + +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + movq [r2], xm0 + movq [r2 + r3], xm2 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x2 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x2 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_4x2 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x2 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_4x4 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x4, 4, 6, 7 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m6, [pd_32] +%elifidn %1, sp + mova m6, [pd_524800] +%else + vbroadcasti128 m6, [INTERP_OFFSET_PS] +%endif + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + +%ifnidn %1,ss + paddd m0, m6 + paddd m2, m6 +%endif + psrad m0, %3 + psrad m2, %3 + + packssdw m0, m2 + pxor m1, m1 +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + + vextracti128 xm2, m0, 1 + lea r4, [r3 * 3] + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r4], xm2 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x4 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x4 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_4x4 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x4 ss, 0, 6 + + +%macro FILTER_VER_CHROMA_AVX2_4x8 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x8, 4, 7, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + mova m7, [pd_524800] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + + movq xm3, [r0 + r4] ; row 7 + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] ; row 8 + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + + movq xm3, [r0 + r1] ; row 9 + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] ; row 10 + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [A 9 9 8] + pmaddwd m6, [r5 + 1 * mmsize] + paddd m1, m6 +%ifnidn %1,ss + paddd m0, m7 + paddd m2, m7 +%endif + psrad m0, %3 + psrad m2, %3 + packssdw m0, m2 + pxor m6, m6 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m6, m3 +%endif + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm2 +%ifnidn %1,ss + paddd m4, m7 + paddd m1, m7 +%endif + psrad m4, %3 + psrad m1, %3 + packssdw m4, m1 +%if %2 + CLIPW m4, m6, m3 +%endif + vextracti128 xm1, m4, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r6], xm1 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4x8 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_4x8 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_4x8 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_4x8 ss, 0 , 6 + +%macro PROCESS_LUMA_AVX2_W4_16R_4TAP 3 + movq xm0, [r0] ; row 0 + movq xm1, [r0 + r1] ; row 1 + punpcklwd xm0, xm1 + movq xm2, [r0 + r1 * 2] ; row 2 + punpcklwd xm1, xm2 + vinserti128 m0, m0, xm1, 1 ; m0 = [2 1 1 0] + pmaddwd m0, [r5] + movq xm3, [r0 + r4] ; row 3 + punpcklwd xm2, xm3 + lea r0, [r0 + 4 * r1] + movq xm4, [r0] ; row 4 + punpcklwd xm3, xm4 + vinserti128 m2, m2, xm3, 1 ; m2 = [4 3 3 2] + pmaddwd m5, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m5 + movq xm3, [r0 + r1] ; row 5 + punpcklwd xm4, xm3 + movq xm1, [r0 + r1 * 2] ; row 6 + punpcklwd xm3, xm1 + vinserti128 m4, m4, xm3, 1 ; m4 = [6 5 5 4] + pmaddwd m5, m4, [r5 + 1 * mmsize] + paddd m2, m5 + pmaddwd m4, [r5] + movq xm3, [r0 + r4] ; row 7 + punpcklwd xm1, xm3 + lea r0, [r0 + 4 * r1] + movq xm6, [r0] ; row 8 + punpcklwd xm3, xm6 + vinserti128 m1, m1, xm3, 1 ; m1 = [8 7 7 6] + pmaddwd m5, m1, [r5 + 1 * mmsize] + paddd m4, m5 + pmaddwd m1, [r5] + movq xm3, [r0 + r1] ; row 9 + punpcklwd xm6, xm3 + movq xm5, [r0 + 2 * r1] ; row 10 + punpcklwd xm3, xm5 + vinserti128 m6, m6, xm3, 1 ; m6 = [10 9 9 8] + pmaddwd m3, m6, [r5 + 1 * mmsize] + paddd m1, m3 + pmaddwd m6, [r5] +%ifnidn %1,ss + paddd m0, m7 + paddd m2, m7 +%endif + psrad m0, %3 + psrad m2, %3 + packssdw m0, m2 + pxor m3, m3 +%if %2 + CLIPW m0, m3, [pw_pixel_max] +%endif + vextracti128 xm2, m0, 1 + movq [r2], xm0 + movq [r2 + r3], xm2 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm2 + movq xm2, [r0 + r4] ;row 11 + punpcklwd xm5, xm2 + lea r0, [r0 + 4 * r1] + movq xm0, [r0] ; row 12 + punpcklwd xm2, xm0 + vinserti128 m5, m5, xm2, 1 ; m5 = [12 11 11 10] + pmaddwd m2, m5, [r5 + 1 * mmsize] + paddd m6, m2 + pmaddwd m5, [r5] + movq xm2, [r0 + r1] ; row 13 + punpcklwd xm0, xm2 + movq xm3, [r0 + 2 * r1] ; row 14 + punpcklwd xm2, xm3 + vinserti128 m0, m0, xm2, 1 ; m0 = [14 13 13 12] + pmaddwd m2, m0, [r5 + 1 * mmsize] + paddd m5, m2 + pmaddwd m0, [r5] +%ifnidn %1,ss + paddd m4, m7 + paddd m1, m7 +%endif + psrad m4, %3 + psrad m1, %3 + packssdw m4, m1 + pxor m2, m2 +%if %2 + CLIPW m4, m2, [pw_pixel_max] +%endif + + vextracti128 xm1, m4, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm4 + movq [r2 + r3], xm1 + movhps [r2 + r3 * 2], xm4 + movhps [r2 + r6], xm1 + movq xm4, [r0 + r4] ; row 15 + punpcklwd xm3, xm4 + lea r0, [r0 + 4 * r1] + movq xm1, [r0] ; row 16 + punpcklwd xm4, xm1 + vinserti128 m3, m3, xm4, 1 ; m3 = [16 15 15 14] + pmaddwd m4, m3, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m3, [r5] + movq xm4, [r0 + r1] ; row 17 + punpcklwd xm1, xm4 + movq xm2, [r0 + 2 * r1] ; row 18 + punpcklwd xm4, xm2 + vinserti128 m1, m1, xm4, 1 ; m1 = [18 17 17 16] + pmaddwd m1, [r5 + 1 * mmsize] + paddd m3, m1 + +%ifnidn %1,ss + paddd m6, m7 + paddd m5, m7 +%endif + psrad m6, %3 + psrad m5, %3 + packssdw m6, m5 + pxor m1, m1 +%if %2 + CLIPW m6, m1, [pw_pixel_max] +%endif + vextracti128 xm5, m6, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm6 + movq [r2 + r3], xm5 + movhps [r2 + r3 * 2], xm6 + movhps [r2 + r6], xm5 +%ifnidn %1,ss + paddd m0, m7 + paddd m3, m7 +%endif + psrad m0, %3 + psrad m3, %3 + packssdw m0, m3 +%if %2 + CLIPW m0, m1, [pw_pixel_max] +%endif + vextracti128 xm3, m0, 1 + lea r2, [r2 + r3 * 4] + movq [r2], xm0 + movq [r2 + r3], xm3 + movhps [r2 + r3 * 2], xm0 + movhps [r2 + r6], xm3 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_4xN 4 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x%2, 4, 8, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + mov r7d, %2 / 16 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + mova m7, [pd_524800] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] +.loopH: + PROCESS_LUMA_AVX2_W4_16R_4TAP %1, %3, %4 + lea r2, [r2 + r3 * 4] + dec r7d + jnz .loopH + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_4xN pp, 16, 1, 6 +FILTER_VER_CHROMA_AVX2_4xN ps, 16, 0, 2 +FILTER_VER_CHROMA_AVX2_4xN sp, 16, 1, 10 +FILTER_VER_CHROMA_AVX2_4xN ss, 16, 0, 6 +FILTER_VER_CHROMA_AVX2_4xN pp, 32, 1, 6 +FILTER_VER_CHROMA_AVX2_4xN ps, 32, 0, 2 +FILTER_VER_CHROMA_AVX2_4xN sp, 32, 1, 10 +FILTER_VER_CHROMA_AVX2_4xN ss, 32, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x8 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x8, 4, 6, 12 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m11, [pd_32] +%elifidn %1, sp + mova m11, [pd_524800] +%else + vbroadcasti128 m11, [INTERP_OFFSET_PS] +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 ; res row0 done(0,1,2,3) + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 ;res row1 done(1, 2, 3, 4) + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 ;res row2 done(2,3,4,5) + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 ;res row3 done(3,4,5,6) + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + pmaddwd m6, [r5] + paddd m4, m8 ;res row4 done(4,5,6,7) + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + pmaddwd m7, [r5] + paddd m5, m9 ;res row5 done(5,6,7,8) + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m8, [r5 + 1 * mmsize] + paddd m6, m8 ;res row6 done(6,7,8,9) + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm8, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm8, 1 + pmaddwd m9, [r5 + 1 * mmsize] + paddd m7, m9 ;res row7 done 7,8,9,10 + lea r4, [r3 * 3] +%ifnidn %1,ss + paddd m0, m11 + paddd m1, m11 + paddd m2, m11 + paddd m3, m11 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m1, m1 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m1, m3 + CLIPW m2, m1, m3 +%endif + vextracti128 xm9, m0, 1 + vextracti128 xm8, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm8 +%ifnidn %1,ss + paddd m4, m11 + paddd m5, m11 + paddd m6, m11 + paddd m7, m11 +%endif + psrad m4, %3 + psrad m5, %3 + psrad m6, %3 + psrad m7, %3 + packssdw m4, m5 + packssdw m6, m7 + vpermq m4, m4, q3120 + vpermq m6, m6, q3120 +%if %2 + CLIPW m4, m1, m3 + CLIPW m6, m1, m3 +%endif + vextracti128 xm5, m4, 1 + vextracti128 xm7, m6, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + movu [r2 + r3 * 2], xm6 + movu [r2 + r4], xm7 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x8 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x8 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_8x8 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x8 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x6 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x6, 4, 6, 12 + mov r4d, r4m + add r1d, r1d + add r3d, r3d + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 + +%ifidn %1,pp + vbroadcasti128 m11, [pd_32] +%elifidn %1, sp + mova m11, [pd_524800] +%else + vbroadcasti128 m11, [INTERP_OFFSET_PS] +%endif + + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + pmaddwd m2, [r5] + paddd m0, m4 ; r0 done(0,1,2,3) + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + pmaddwd m3, [r5] + paddd m1, m5 ;r1 done(1, 2, 3, 4) + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + pmaddwd m4, [r5] + paddd m2, m6 ;r2 done(2,3,4,5) + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + pmaddwd m5, [r5] + paddd m3, m7 ;r3 done(3,4,5,6) + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 ;r4 done(4,5,6,7) + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m7, m7, [r5 + 1 * mmsize] + paddd m5, m7 ;r5 done(5,6,7,8) + lea r4, [r3 * 3] +%ifnidn %1,ss + paddd m0, m11 + paddd m1, m11 + paddd m2, m11 + paddd m3, m11 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m10, m10 + mova m9, [pw_pixel_max] +%if %2 + CLIPW m0, m10, m9 + CLIPW m2, m10, m9 +%endif + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r4], xm3 +%ifnidn %1,ss + paddd m4, m11 + paddd m5, m11 +%endif + psrad m4, %3 + psrad m5, %3 + packssdw m4, m5 + vpermq m4, m4, 11011000b +%if %2 + CLIPW m4, m10, m9 +%endif + vextracti128 xm5, m4, 1 + lea r2, [r2 + r3 * 4] + movu [r2], xm4 + movu [r2 + r3], xm5 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x6 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x6 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_8x6 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x6 ss, 0, 6 + +%macro PROCESS_CHROMA_AVX2 3 + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m4, [r5 + 1 * mmsize] + paddd m2, m4 + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm4, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm4, 1 + pmaddwd m5, [r5 + 1 * mmsize] + paddd m3, m5 +%ifnidn %1,ss + paddd m0, m7 + paddd m1, m7 + paddd m2, m7 + paddd m3, m7 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + packssdw m0, m1 + packssdw m2, m3 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + pxor m4, m4 +%if %2 + CLIPW m0, m4, [pw_pixel_max] + CLIPW m2, m4, [pw_pixel_max] +%endif + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 +%endmacro + + +%macro FILTER_VER_CHROMA_AVX2_8x4 3 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_8x4, 4, 6, 8 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m7, [pd_32] +%elifidn %1, sp + mova m7, [pd_524800] +%else + vbroadcasti128 m7, [INTERP_OFFSET_PS] +%endif + PROCESS_CHROMA_AVX2 %1, %2, %3 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + lea r4, [r3 * 3] + movu [r2 + r4], xm3 + RET +%endmacro + +FILTER_VER_CHROMA_AVX2_8x4 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x4 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_8x4 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x4 ss, 0, 6 + +%macro FILTER_VER_CHROMA_AVX2_8x12 3 +INIT_YMM avx2 +%if ARCH_X86_64 == 1 +cglobal interp_4tap_vert_%1_8x12, 4, 7, 15 + mov r4d, r4m + shl r4d, 6 + add r1d, r1d + add r3d, r3d + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer + r4] +%endif + + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + vbroadcasti128 m14, [pd_32] +%elifidn %1, sp + mova m14, [pd_524800] +%else + vbroadcasti128 m14, [INTERP_OFFSET_PS] +%endif + lea r6, [r3 * 3] + movu xm0, [r0] ; m0 = row 0 + movu xm1, [r0 + r1] ; m1 = row 1 + punpckhwd xm2, xm0, xm1 + punpcklwd xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddwd m0, [r5] + movu xm2, [r0 + r1 * 2] ; m2 = row 2 + punpckhwd xm3, xm1, xm2 + punpcklwd xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddwd m1, [r5] + movu xm3, [r0 + r4] ; m3 = row 3 + punpckhwd xm4, xm2, xm3 + punpcklwd xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddwd m4, m2, [r5 + 1 * mmsize] + paddd m0, m4 + pmaddwd m2, [r5] + lea r0, [r0 + r1 * 4] + movu xm4, [r0] ; m4 = row 4 + punpckhwd xm5, xm3, xm4 + punpcklwd xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddwd m5, m3, [r5 + 1 * mmsize] + paddd m1, m5 + pmaddwd m3, [r5] + movu xm5, [r0 + r1] ; m5 = row 5 + punpckhwd xm6, xm4, xm5 + punpcklwd xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddwd m6, m4, [r5 + 1 * mmsize] + paddd m2, m6 + pmaddwd m4, [r5] + movu xm6, [r0 + r1 * 2] ; m6 = row 6 + punpckhwd xm7, xm5, xm6 + punpcklwd xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddwd m7, m5, [r5 + 1 * mmsize] + paddd m3, m7 + pmaddwd m5, [r5] + movu xm7, [r0 + r4] ; m7 = row 7 + punpckhwd xm8, xm6, xm7 + punpcklwd xm6, xm7 + vinserti128 m6, m6, xm8, 1 + pmaddwd m8, m6, [r5 + 1 * mmsize] + paddd m4, m8 + pmaddwd m6, [r5] + lea r0, [r0 + r1 * 4] + movu xm8, [r0] ; m8 = row 8 + punpckhwd xm9, xm7, xm8 + punpcklwd xm7, xm8 + vinserti128 m7, m7, xm9, 1 + pmaddwd m9, m7, [r5 + 1 * mmsize] + paddd m5, m9 + pmaddwd m7, [r5] + movu xm9, [r0 + r1] ; m9 = row 9 + punpckhwd xm10, xm8, xm9 + punpcklwd xm8, xm9 + vinserti128 m8, m8, xm10, 1 + pmaddwd m10, m8, [r5 + 1 * mmsize] + paddd m6, m10 + pmaddwd m8, [r5] + movu xm10, [r0 + r1 * 2] ; m10 = row 10 + punpckhwd xm11, xm9, xm10 + punpcklwd xm9, xm10 + vinserti128 m9, m9, xm11, 1 + pmaddwd m11, m9, [r5 + 1 * mmsize] + paddd m7, m11 + pmaddwd m9, [r5] + movu xm11, [r0 + r4] ; m11 = row 11 + punpckhwd xm12, xm10, xm11 + punpcklwd xm10, xm11 + vinserti128 m10, m10, xm12, 1 + pmaddwd m12, m10, [r5 + 1 * mmsize] + paddd m8, m12 + pmaddwd m10, [r5] + lea r0, [r0 + r1 * 4] + movu xm12, [r0] ; m12 = row 12 + punpckhwd xm13, xm11, xm12 + punpcklwd xm11, xm12 + vinserti128 m11, m11, xm13, 1 + pmaddwd m13, m11, [r5 + 1 * mmsize] + paddd m9, m13 + pmaddwd m11, [r5] +%ifnidn %1,ss + paddd m0, m14 + paddd m1, m14 + paddd m2, m14 + paddd m3, m14 + paddd m4, m14 + paddd m5, m14 +%endif + psrad m0, %3 + psrad m1, %3 + psrad m2, %3 + psrad m3, %3 + psrad m4, %3 + psrad m5, %3 + packssdw m0, m1 + packssdw m2, m3 + packssdw m4, m5 + vpermq m0, m0, q3120 + vpermq m2, m2, q3120 + vpermq m4, m4, q3120 + pxor m5, m5 + mova m3, [pw_pixel_max] +%if %2 + CLIPW m0, m5, m3 + CLIPW m2, m5, m3 + CLIPW m4, m5, m3 +%endif + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + vextracti128 xm1, m2, 1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r6], xm1 + lea r2, [r2 + r3 * 4] + vextracti128 xm1, m4, 1 + movu [r2], xm4 + movu [r2 + r3], xm1 + movu xm13, [r0 + r1] ; m13 = row 13 + punpckhwd xm0, xm12, xm13 + punpcklwd xm12, xm13 + vinserti128 m12, m12, xm0, 1 + pmaddwd m12, m12, [r5 + 1 * mmsize] + paddd m10, m12 + movu xm0, [r0 + r1 * 2] ; m0 = row 14 + punpckhwd xm1, xm13, xm0 + punpcklwd xm13, xm0 + vinserti128 m13, m13, xm1, 1 + pmaddwd m13, m13, [r5 + 1 * mmsize] + paddd m11, m13 +%ifnidn %1,ss + paddd m6, m14 + paddd m7, m14 + paddd m8, m14 + paddd m9, m14 + paddd m10, m14 + paddd m11, m14 +%endif + psrad m6, %3 + psrad m7, %3 + psrad m8, %3 + psrad m9, %3 + psrad m10, %3 + psrad m11, %3 + packssdw m6, m7 + packssdw m8, m9 + packssdw m10, m11 + vpermq m6, m6, q3120 + vpermq m8, m8, q3120 + vpermq m10, m10, q3120 +%if %2 + CLIPW m6, m5, m3 + CLIPW m8, m5, m3 + CLIPW m10, m5, m3 +%endif + vextracti128 xm7, m6, 1 + vextracti128 xm9, m8, 1 + vextracti128 xm11, m10, 1 + movu [r2 + r3 * 2], xm6 + movu [r2 + r6], xm7 + lea r2, [r2 + r3 * 4] + movu [r2], xm8 + movu [r2 + r3], xm9 + movu [r2 + r3 * 2], xm10 + movu [r2 + r6], xm11 + RET +%endif +%endmacro + +FILTER_VER_CHROMA_AVX2_8x12 pp, 1, 6 +FILTER_VER_CHROMA_AVX2_8x12 ps, 0, 2 +FILTER_VER_CHROMA_AVX2_8x12 sp, 1, 10 +FILTER_VER_CHROMA_AVX2_8x12 ss, 0, 6
View file
x265_1.7.tar.gz/source/common/x86/ipfilter8.asm -> x265_1.8.tar.gz/source/common/x86/ipfilter8.asm
Changed
@@ -301,6 +301,7 @@ cextern pw_32 cextern pw_512 cextern pw_2000 +cextern pw_8192 %macro FILTER_H4_w2_2_sse2 0 pxor m3, m3 @@ -330,80 +331,38 @@ %endmacro ;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_2x4, 4, 6, 6, src, srcstride, dst, dststride - mov r4d, r4m - mova m5, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - - FILTER_H4_w2_2_sse2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] - FILTER_H4_w2_2_sse2 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_horiz_pp_2xN(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- +%macro FILTER_H4_W2xN_sse3 1 INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_2x8, 4, 6, 6, src, srcstride, dst, dststride - mov r4d, r4m - mova m5, [pw_32] +cglobal interp_4tap_horiz_pp_2x%1, 4, 6, 6, src, srcstride, dst, dststride + mov r4d, r4m + mova m5, [pw_32] %ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] %else - movddup m4, [tabw_ChromaCoeff + r4 * 8] + movddup m4, [tabw_ChromaCoeff + r4 * 8] %endif %assign x 1 -%rep 4 +%rep %1/2 FILTER_H4_w2_2_sse2 -%if x < 4 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] +%if x < %1/2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endif %assign x x+1 %endrep RET -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_2x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_2x16, 4, 6, 6, src, srcstride, dst, dststride - mov r4d, r4m - mova m5, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep 8 - FILTER_H4_w2_2_sse2 -%if x < 8 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endif -%assign x x+1 -%endrep +%endmacro - RET + FILTER_H4_W2xN_sse3 4 + FILTER_H4_W2xN_sse3 8 + FILTER_H4_W2xN_sse3 16 %macro FILTER_H4_w4_2_sse2 0 pxor m5, m5 @@ -447,143 +406,41 @@ %endmacro ;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x2, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - - FILTER_H4_w4_2_sse2 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x4, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - - FILTER_H4_w4_2_sse2 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] - FILTER_H4_w4_2_sse2 - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x8, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep 4 - FILTER_H4_w4_2_sse2 -%if x < 4 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endif -%assign x x+1 -%endrep - - RET - -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_4x16(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x16, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - -%ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] -%else - movddup m4, [tabw_ChromaCoeff + r4 * 8] -%endif - -%assign x 1 -%rep 8 - FILTER_H4_w4_2_sse2 -%if x < 8 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] -%endif -%assign x x+1 -%endrep - - RET - -;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_4x32(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- +%macro FILTER_H4_W4xN_sse3 1 INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_4x32, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] +cglobal interp_4tap_horiz_pp_4x%1, 4, 6, 8, src, srcstride, dst, dststride + mov r4d, r4m + mova m7, [pw_32] %ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m4, [r5 + r4 * 8] + lea r5, [tabw_ChromaCoeff] + movddup m4, [r5 + r4 * 8] %else - movddup m4, [tabw_ChromaCoeff + r4 * 8] + movddup m4, [tabw_ChromaCoeff + r4 * 8] %endif %assign x 1 -%rep 16 +%rep %1/2 FILTER_H4_w4_2_sse2 -%if x < 16 - lea srcq, [srcq + srcstrideq * 2] - lea dstq, [dstq + dststrideq * 2] +%if x < %1/2 + lea srcq, [srcq + srcstrideq * 2] + lea dstq, [dstq + dststrideq * 2] %endif %assign x x+1 %endrep RET -%macro FILTER_H4_w2_2 3 - movh %2, [srcq - 1] - pshufb %2, %2, Tm0 - movh %1, [srcq + srcstrideq - 1] - pshufb %1, %1, Tm0 - punpcklqdq %2, %1 - pmaddubsw %2, coef2 - phaddw %2, %2 - pmulhrsw %2, %3 - packuswb %2, %2 - movd r4, %2 - mov [dstq], r4w - shr r4, 16 - mov [dstq + dststrideq], r4w %endmacro + FILTER_H4_W4xN_sse3 2 + FILTER_H4_W4xN_sse3 4 + FILTER_H4_W4xN_sse3 8 + FILTER_H4_W4xN_sse3 16 + FILTER_H4_W4xN_sse3 32 + %macro FILTER_H4_w6_sse2 0 pxor m4, m4 movh m0, [srcq - 1] @@ -762,58 +619,145 @@ IPFILTER_CHROMA_sse3 8, 64 IPFILTER_CHROMA_sse3 12, 32 -;----------------------------------------------------------------------------- -; void interp_4tap_horiz_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) -;----------------------------------------------------------------------------- -%macro IPFILTER_CHROMA_W_sse3 2 + IPFILTER_CHROMA_sse3 16, 4 + IPFILTER_CHROMA_sse3 16, 8 + IPFILTER_CHROMA_sse3 16, 12 + IPFILTER_CHROMA_sse3 16, 16 + IPFILTER_CHROMA_sse3 16, 32 + IPFILTER_CHROMA_sse3 32, 8 + IPFILTER_CHROMA_sse3 32, 16 + IPFILTER_CHROMA_sse3 32, 24 + IPFILTER_CHROMA_sse3 24, 32 + IPFILTER_CHROMA_sse3 32, 32 + + IPFILTER_CHROMA_sse3 16, 24 + IPFILTER_CHROMA_sse3 16, 64 + IPFILTER_CHROMA_sse3 32, 48 + IPFILTER_CHROMA_sse3 24, 64 + IPFILTER_CHROMA_sse3 32, 64 + + IPFILTER_CHROMA_sse3 64, 64 + IPFILTER_CHROMA_sse3 64, 32 + IPFILTER_CHROMA_sse3 64, 48 + IPFILTER_CHROMA_sse3 48, 64 + IPFILTER_CHROMA_sse3 64, 16 + +%macro FILTER_2 2 + movd m3, [srcq + %1] + movd m4, [srcq + 1 + %1] + punpckldq m3, m4 + punpcklbw m3, m0 + pmaddwd m3, m1 + packssdw m3, m3 + pshuflw m4, m3, q2301 + paddw m3, m4 + psrldq m3, 2 + psubw m3, m2 + movd [dstq + %2], m3 +%endmacro + +%macro FILTER_4 2 + movd m3, [srcq + %1] + movd m4, [srcq + 1 + %1] + punpckldq m3, m4 + punpcklbw m3, m0 + pmaddwd m3, m1 + movd m4, [srcq + 2 + %1] + movd m5, [srcq + 3 + %1] + punpckldq m4, m5 + punpcklbw m4, m0 + pmaddwd m4, m1 + packssdw m3, m4 + pshuflw m4, m3, q2301 + pshufhw m4, m4, q2301 + paddw m3, m4 + psrldq m3, 2 + pshufd m3, m3, q3120 + psubw m3, m2 + movh [dstq + %2], m3 +%endmacro + +%macro FILTER_4TAP_HPS_sse3 2 INIT_XMM sse3 -cglobal interp_4tap_horiz_pp_%1x%2, 4, 6, 8, src, srcstride, dst, dststride - mov r4d, r4m - mova m7, [pw_32] - pxor m4, m4 +cglobal interp_4tap_horiz_ps_%1x%2, 4, 7, 6, src, srcstride, dst, dststride + mov r4d, r4m + add dststrided, dststrided + mova m2, [pw_2000] + pxor m0, m0 + %ifdef PIC - lea r5, [tabw_ChromaCoeff] - movddup m6, [r5 + r4 * 8] + lea r6, [tabw_ChromaCoeff] + movddup m1, [r6 + r4 * 8] %else - movddup m6, [tabw_ChromaCoeff + r4 * 8] + movddup m1, [tabw_ChromaCoeff + r4 * 8] %endif -%assign x 1 -%rep %2 - FILTER_H4_w%1_sse2 -%if x < %2 - add srcq, srcstrideq - add dstq, dststrideq -%endif -%assign x x+1 + mov r4d, %2 + cmp r5m, byte 0 + je .loopH + sub srcq, srcstrideq + add r4d, 3 + +.loopH: +%assign x -1 +%assign y 0 +%rep %1/4 + FILTER_4 x,y +%assign x x+4 +%assign y y+8 %endrep +%rep (%1 % 4)/2 + FILTER_2 x,y +%endrep + add srcq, srcstrideq + add dstq, dststrideq + dec r4d + jnz .loopH RET %endmacro - IPFILTER_CHROMA_W_sse3 16, 4 - IPFILTER_CHROMA_W_sse3 16, 8 - IPFILTER_CHROMA_W_sse3 16, 12 - IPFILTER_CHROMA_W_sse3 16, 16 - IPFILTER_CHROMA_W_sse3 16, 32 - IPFILTER_CHROMA_W_sse3 32, 8 - IPFILTER_CHROMA_W_sse3 32, 16 - IPFILTER_CHROMA_W_sse3 32, 24 - IPFILTER_CHROMA_W_sse3 24, 32 - IPFILTER_CHROMA_W_sse3 32, 32 - - IPFILTER_CHROMA_W_sse3 16, 24 - IPFILTER_CHROMA_W_sse3 16, 64 - IPFILTER_CHROMA_W_sse3 32, 48 - IPFILTER_CHROMA_W_sse3 24, 64 - IPFILTER_CHROMA_W_sse3 32, 64 - - IPFILTER_CHROMA_W_sse3 64, 64 - IPFILTER_CHROMA_W_sse3 64, 32 - IPFILTER_CHROMA_W_sse3 64, 48 - IPFILTER_CHROMA_W_sse3 48, 64 - IPFILTER_CHROMA_W_sse3 64, 16 + FILTER_4TAP_HPS_sse3 2, 4 + FILTER_4TAP_HPS_sse3 2, 8 + FILTER_4TAP_HPS_sse3 2, 16 + FILTER_4TAP_HPS_sse3 4, 2 + FILTER_4TAP_HPS_sse3 4, 4 + FILTER_4TAP_HPS_sse3 4, 8 + FILTER_4TAP_HPS_sse3 4, 16 + FILTER_4TAP_HPS_sse3 4, 32 + FILTER_4TAP_HPS_sse3 6, 8 + FILTER_4TAP_HPS_sse3 6, 16 + FILTER_4TAP_HPS_sse3 8, 2 + FILTER_4TAP_HPS_sse3 8, 4 + FILTER_4TAP_HPS_sse3 8, 6 + FILTER_4TAP_HPS_sse3 8, 8 + FILTER_4TAP_HPS_sse3 8, 12 + FILTER_4TAP_HPS_sse3 8, 16 + FILTER_4TAP_HPS_sse3 8, 32 + FILTER_4TAP_HPS_sse3 8, 64 + FILTER_4TAP_HPS_sse3 12, 16 + FILTER_4TAP_HPS_sse3 12, 32 + FILTER_4TAP_HPS_sse3 16, 4 + FILTER_4TAP_HPS_sse3 16, 8 + FILTER_4TAP_HPS_sse3 16, 12 + FILTER_4TAP_HPS_sse3 16, 16 + FILTER_4TAP_HPS_sse3 16, 24 + FILTER_4TAP_HPS_sse3 16, 32 + FILTER_4TAP_HPS_sse3 16, 64 + FILTER_4TAP_HPS_sse3 24, 32 + FILTER_4TAP_HPS_sse3 24, 64 + FILTER_4TAP_HPS_sse3 32, 8 + FILTER_4TAP_HPS_sse3 32, 16 + FILTER_4TAP_HPS_sse3 32, 24 + FILTER_4TAP_HPS_sse3 32, 32 + FILTER_4TAP_HPS_sse3 32, 48 + FILTER_4TAP_HPS_sse3 32, 64 + FILTER_4TAP_HPS_sse3 48, 64 + FILTER_4TAP_HPS_sse3 64, 16 + FILTER_4TAP_HPS_sse3 64, 32 + FILTER_4TAP_HPS_sse3 64, 48 + FILTER_4TAP_HPS_sse3 64, 64 %macro FILTER_H8_W8_sse2 0 movh m1, [r0 + x - 3] @@ -1042,6 +986,365 @@ IPFILTER_LUMA_sse2 64, 16, ps IPFILTER_LUMA_sse2 16, 64, ps +%macro PROCESS_LUMA_W4_4R_sse2 0 + movd m2, [r0] + movd m7, [r0 + r1] + punpcklbw m2, m7 ; m2=[0 1] + + lea r0, [r0 + 2 * r1] + movd m3, [r0] + punpcklbw m7, m3 ; m7=[1 2] + punpcklbw m2, m0 + punpcklbw m7, m0 + pmaddwd m2, [r6 + 0 * 32] + pmaddwd m7, [r6 + 0 * 32] + packssdw m2, m7 ; m2=[0+1 1+2] + + movd m7, [r0 + r1] + punpcklbw m3, m7 ; m3=[2 3] + lea r0, [r0 + 2 * r1] + movd m5, [r0] + punpcklbw m7, m5 ; m7=[3 4] + punpcklbw m3, m0 + punpcklbw m7, m0 + pmaddwd m4, m3, [r6 + 1 * 32] + pmaddwd m6, m7, [r6 + 1 * 32] + packssdw m4, m6 ; m4=[2+3 3+4] + paddw m2, m4 ; m2=[0+1+2+3 1+2+3+4] Row1-2 + pmaddwd m3, [r6 + 0 * 32] + pmaddwd m7, [r6 + 0 * 32] + packssdw m3, m7 ; m3=[2+3 3+4] Row3-4 + + movd m7, [r0 + r1] + punpcklbw m5, m7 ; m5=[4 5] + lea r0, [r0 + 2 * r1] + movd m4, [r0] + punpcklbw m7, m4 ; m7=[5 6] + punpcklbw m5, m0 + punpcklbw m7, m0 + pmaddwd m6, m5, [r6 + 2 * 32] + pmaddwd m8, m7, [r6 + 2 * 32] + packssdw m6, m8 ; m6=[4+5 5+6] + paddw m2, m6 ; m2=[0+1+2+3+4+5 1+2+3+4+5+6] Row1-2 + pmaddwd m5, [r6 + 1 * 32] + pmaddwd m7, [r6 + 1 * 32] + packssdw m5, m7 ; m5=[4+5 5+6] + paddw m3, m5 ; m3=[2+3+4+5 3+4+5+6] Row3-4 + + movd m7, [r0 + r1] + punpcklbw m4, m7 ; m4=[6 7] + lea r0, [r0 + 2 * r1] + movd m5, [r0] + punpcklbw m7, m5 ; m7=[7 8] + punpcklbw m4, m0 + punpcklbw m7, m0 + pmaddwd m6, m4, [r6 + 3 * 32] + pmaddwd m8, m7, [r6 + 3 * 32] + packssdw m6, m8 ; m7=[6+7 7+8] + paddw m2, m6 ; m2=[0+1+2+3+4+5+6+7 1+2+3+4+5+6+7+8] Row1-2 end + pmaddwd m4, [r6 + 2 * 32] + pmaddwd m7, [r6 + 2 * 32] + packssdw m4, m7 ; m4=[6+7 7+8] + paddw m3, m4 ; m3=[2+3+4+5+6+7 3+4+5+6+7+8] Row3-4 + + movd m7, [r0 + r1] + punpcklbw m5, m7 ; m5=[8 9] + movd m4, [r0 + 2 * r1] + punpcklbw m7, m4 ; m7=[9 10] + punpcklbw m5, m0 + punpcklbw m7, m0 + pmaddwd m5, [r6 + 3 * 32] + pmaddwd m7, [r6 + 3 * 32] + packssdw m5, m7 ; m5=[8+9 9+10] + paddw m3, m5 ; m3=[2+3+4+5+6+7+8+9 3+4+5+6+7+8+9+10] Row3-4 end +%endmacro + +%macro PROCESS_LUMA_W8_4R_sse2 0 + movq m7, [r0] + movq m6, [r0 + r1] + punpcklbw m7, m6 + punpcklbw m2, m7, m0 + punpckhbw m7, m0 + pmaddwd m2, [r6 + 0 * 32] + pmaddwd m7, [r6 + 0 * 32] + packssdw m2, m7 ; m2=[0+1] Row1 + + lea r0, [r0 + 2 * r1] + movq m7, [r0] + punpcklbw m6, m7 + punpcklbw m3, m6, m0 + punpckhbw m6, m0 + pmaddwd m3, [r6 + 0 * 32] + pmaddwd m6, [r6 + 0 * 32] + packssdw m3, m6 ; m3=[1+2] Row2 + + movq m6, [r0 + r1] + punpcklbw m7, m6 + punpckhbw m8, m7, m0 + punpcklbw m7, m0 + pmaddwd m4, m7, [r6 + 0 * 32] + pmaddwd m9, m8, [r6 + 0 * 32] + packssdw m4, m9 ; m4=[2+3] Row3 + pmaddwd m7, [r6 + 1 * 32] + pmaddwd m8, [r6 + 1 * 32] + packssdw m7, m8 + paddw m2, m7 ; m2=[0+1+2+3] Row1 + + lea r0, [r0 + 2 * r1] + movq m10, [r0] + punpcklbw m6, m10 + punpckhbw m8, m6, m0 + punpcklbw m6, m0 + pmaddwd m5, m6, [r6 + 0 * 32] + pmaddwd m9, m8, [r6 + 0 * 32] + packssdw m5, m9 ; m5=[3+4] Row4 + pmaddwd m6, [r6 + 1 * 32] + pmaddwd m8, [r6 + 1 * 32] + packssdw m6, m8 + paddw m3, m6 ; m3 = [1+2+3+4] Row2 + + movq m6, [r0 + r1] + punpcklbw m10, m6 + punpckhbw m8, m10, m0 + punpcklbw m10, m0 + pmaddwd m7, m10, [r6 + 1 * 32] + pmaddwd m9, m8, [r6 + 1 * 32] + packssdw m7, m9 + pmaddwd m10, [r6 + 2 * 32] + pmaddwd m8, [r6 + 2 * 32] + packssdw m10, m8 + paddw m2, m10 ; m2=[0+1+2+3+4+5] Row1 + paddw m4, m7 ; m4=[2+3+4+5] Row3 + + lea r0, [r0 + 2 * r1] + movq m10, [r0] + punpcklbw m6, m10 + punpckhbw m8, m6, m0 + punpcklbw m6, m0 + pmaddwd m7, m6, [r6 + 1 * 32] + pmaddwd m9, m8, [r6 + 1 * 32] + packssdw m7, m9 + pmaddwd m6, [r6 + 2 * 32] + pmaddwd m8, [r6 + 2 * 32] + packssdw m6, m8 + paddw m3, m6 ; m3=[1+2+3+4+5+6] Row2 + paddw m5, m7 ; m5=[3+4+5+6] Row4 + + movq m6, [r0 + r1] + punpcklbw m10, m6 + punpckhbw m8, m10, m0 + punpcklbw m10, m0 + pmaddwd m7, m10, [r6 + 2 * 32] + pmaddwd m9, m8, [r6 + 2 * 32] + packssdw m7, m9 + pmaddwd m10, [r6 + 3 * 32] + pmaddwd m8, [r6 + 3 * 32] + packssdw m10, m8 + paddw m2, m10 ; m2=[0+1+2+3+4+5+6+7] Row1 end + paddw m4, m7 ; m4=[2+3+4+5+6+7] Row3 + + lea r0, [r0 + 2 * r1] + movq m10, [r0] + punpcklbw m6, m10 + punpckhbw m8, m6, m0 + punpcklbw m6, m0 + pmaddwd m7, m6, [r6 + 2 * 32] + pmaddwd m9, m8, [r6 + 2 * 32] + packssdw m7, m9 + pmaddwd m6, [r6 + 3 * 32] + pmaddwd m8, [r6 + 3 * 32] + packssdw m6, m8 + paddw m3, m6 ; m3=[1+2+3+4+5+6+7+8] Row2 end + paddw m5, m7 ; m5=[3+4+5+6+7+8] Row4 + + movq m6, [r0 + r1] + punpcklbw m10, m6 + punpckhbw m8, m10, m0 + punpcklbw m10, m0 + pmaddwd m8, [r6 + 3 * 32] + pmaddwd m10, [r6 + 3 * 32] + packssdw m10, m8 + paddw m4, m10 ; m4=[2+3+4+5+6+7+8+9] Row3 end + + movq m10, [r0 + 2 * r1] + punpcklbw m6, m10 + punpckhbw m8, m6, m0 + punpcklbw m6, m0 + pmaddwd m8, [r6 + 3 * 32] + pmaddwd m6, [r6 + 3 * 32] + packssdw m6, m8 + paddw m5, m6 ; m5=[3+4+5+6+7+8+9+10] Row4 end +%endmacro + +;------------------------------------------------------------------------------------------------------------- +; void interp_8tap_vert_%3_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;------------------------------------------------------------------------------------------------------------- +%macro FILTER_VER_LUMA_sse2 3 +INIT_XMM sse2 +cglobal interp_8tap_vert_%3_%1x%2, 5, 8, 11 + lea r5, [3 * r1] + sub r0, r5 + shl r4d, 7 + +%ifdef PIC + lea r6, [pw_LumaCoeffVer] + add r6, r4 +%else + lea r6, [pw_LumaCoeffVer + r4] +%endif + +%ifidn %3,pp + mova m1, [pw_32] +%else + mova m1, [pw_2000] + add r3d, r3d +%endif + + mov r4d, %2/4 + lea r5, [3 * r3] + pxor m0, m0 + +.loopH: +%assign x 0 +%rep (%1 / 8) + PROCESS_LUMA_W8_4R_sse2 + +%ifidn %3,pp + paddw m2, m1 + paddw m3, m1 + paddw m4, m1 + paddw m5, m1 + psraw m2, 6 + psraw m3, 6 + psraw m4, 6 + psraw m5, 6 + + packuswb m2, m3 + packuswb m4, m5 + + movh [r2 + x], m2 + movhps [r2 + r3 + x], m2 + movh [r2 + 2 * r3 + x], m4 + movhps [r2 + r5 + x], m4 +%else + psubw m2, m1 + psubw m3, m1 + psubw m4, m1 + psubw m5, m1 + + movu [r2 + (2*x)], m2 + movu [r2 + r3 + (2*x)], m3 + movu [r2 + 2 * r3 + (2*x)], m4 + movu [r2 + r5 + (2*x)], m5 +%endif +%assign x x+8 +%if %1 > 8 + lea r7, [8 * r1 - 8] + sub r0, r7 +%endif +%endrep + +%rep (%1 % 8)/4 + PROCESS_LUMA_W4_4R_sse2 + +%ifidn %3,pp + paddw m2, m1 + psraw m2, 6 + paddw m3, m1 + psraw m3, 6 + + packuswb m2, m3 + + movd [r2 + x], m2 + psrldq m2, 4 + movd [r2 + r3 + x], m2 + psrldq m2, 4 + movd [r2 + 2 * r3 + x], m2 + psrldq m2, 4 + movd [r2 + r5 + x], m2 +%else + psubw m2, m1 + psubw m3, m1 + + movh [r2 + (2*x)], m2 + movhps [r2 + r3 + (2*x)], m2 + movh [r2 + 2 * r3 + (2*x)], m3 + movhps [r2 + r5 + (2*x)], m3 +%endif +%endrep + + lea r2, [r2 + 4 * r3] +%if %1 <= 8 + lea r7, [4 * r1] + sub r0, r7 +%elif %1 == 12 + lea r7, [4 * r1 + 8] + sub r0, r7 +%else + lea r0, [r0 + 4 * r1 - %1] +%endif + + dec r4d + jnz .loopH + + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_VER_LUMA_sse2 4, 4, pp + FILTER_VER_LUMA_sse2 4, 8, pp + FILTER_VER_LUMA_sse2 4, 16, pp + FILTER_VER_LUMA_sse2 8, 4, pp + FILTER_VER_LUMA_sse2 8, 8, pp + FILTER_VER_LUMA_sse2 8, 16, pp + FILTER_VER_LUMA_sse2 8, 32, pp + FILTER_VER_LUMA_sse2 12, 16, pp + FILTER_VER_LUMA_sse2 16, 4, pp + FILTER_VER_LUMA_sse2 16, 8, pp + FILTER_VER_LUMA_sse2 16, 12, pp + FILTER_VER_LUMA_sse2 16, 16, pp + FILTER_VER_LUMA_sse2 16, 32, pp + FILTER_VER_LUMA_sse2 16, 64, pp + FILTER_VER_LUMA_sse2 24, 32, pp + FILTER_VER_LUMA_sse2 32, 8, pp + FILTER_VER_LUMA_sse2 32, 16, pp + FILTER_VER_LUMA_sse2 32, 24, pp + FILTER_VER_LUMA_sse2 32, 32, pp + FILTER_VER_LUMA_sse2 32, 64, pp + FILTER_VER_LUMA_sse2 48, 64, pp + FILTER_VER_LUMA_sse2 64, 16, pp + FILTER_VER_LUMA_sse2 64, 32, pp + FILTER_VER_LUMA_sse2 64, 48, pp + FILTER_VER_LUMA_sse2 64, 64, pp + + FILTER_VER_LUMA_sse2 4, 4, ps + FILTER_VER_LUMA_sse2 4, 8, ps + FILTER_VER_LUMA_sse2 4, 16, ps + FILTER_VER_LUMA_sse2 8, 4, ps + FILTER_VER_LUMA_sse2 8, 8, ps + FILTER_VER_LUMA_sse2 8, 16, ps + FILTER_VER_LUMA_sse2 8, 32, ps + FILTER_VER_LUMA_sse2 12, 16, ps + FILTER_VER_LUMA_sse2 16, 4, ps + FILTER_VER_LUMA_sse2 16, 8, ps + FILTER_VER_LUMA_sse2 16, 12, ps + FILTER_VER_LUMA_sse2 16, 16, ps + FILTER_VER_LUMA_sse2 16, 32, ps + FILTER_VER_LUMA_sse2 16, 64, ps + FILTER_VER_LUMA_sse2 24, 32, ps + FILTER_VER_LUMA_sse2 32, 8, ps + FILTER_VER_LUMA_sse2 32, 16, ps + FILTER_VER_LUMA_sse2 32, 24, ps + FILTER_VER_LUMA_sse2 32, 32, ps + FILTER_VER_LUMA_sse2 32, 64, ps + FILTER_VER_LUMA_sse2 48, 64, ps + FILTER_VER_LUMA_sse2 64, 16, ps + FILTER_VER_LUMA_sse2 64, 32, ps + FILTER_VER_LUMA_sse2 64, 48, ps + FILTER_VER_LUMA_sse2 64, 64, ps +%endif + %macro WORD_TO_DOUBLE 1 %if ARCH_X86_64 punpcklbw %1, m8 @@ -1052,19 +1355,26 @@ %endmacro ;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_2xn(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_vert_%1_2x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -%macro FILTER_V4_W2_H4_sse2 1 +%macro FILTER_V4_W2_H4_sse2 2 INIT_XMM sse2 %if ARCH_X86_64 -cglobal interp_4tap_vert_pp_2x%1, 4, 6, 9 +cglobal interp_4tap_vert_%1_2x%2, 4, 6, 9 pxor m8, m8 %else -cglobal interp_4tap_vert_pp_2x%1, 4, 6, 8 +cglobal interp_4tap_vert_%1_2x%2, 4, 6, 8 %endif mov r4d, r4m sub r0, r1 +%ifidn %1,pp + mova m1, [pw_32] +%elifidn %1,ps + mova m1, [pw_2000] + add r3d, r3d +%endif + %ifdef PIC lea r5, [tabw_ChromaCoeff] movh m0, [r5 + r4 * 8] @@ -1073,11 +1383,10 @@ %endif punpcklqdq m0, m0 - mova m1, [pw_32] lea r5, [3 * r1] %assign x 1 -%rep %1/4 +%rep %2/4 movd m2, [r0] movd m3, [r0 + r1] movd m4, [r0 + 2 * r1] @@ -1104,7 +1413,6 @@ pshuflw m3, m2, q2301 pshufhw m3, m3, q2301 paddw m2, m3 - psrld m2, 16 movd m7, [r0 + r1] @@ -1128,8 +1436,10 @@ pshuflw m5, m4, q2301 pshufhw m5, m5, q2301 paddw m4, m5 - psrld m4, 16 +%ifidn %1,pp + psrld m2, 16 + psrld m4, 16 packssdw m2, m4 paddw m2, m1 psraw m2, 6 @@ -1157,8 +1467,24 @@ shr r4, 16 mov [r2 + r3], r4w %endif +%elifidn %1,ps + psrldq m2, 2 + psrldq m4, 2 + pshufd m2, m2, q3120 + pshufd m4, m4, q3120 + psubw m4, m1 + psubw m2, m1 -%if x < %1/4 + movd [r2], m2 + psrldq m2, 4 + movd [r2 + r3], m2 + lea r2, [r2 + 2 * r3] + movd [r2], m4 + psrldq m4, 4 + movd [r2 + r3], m4 +%endif + +%if x < %2/4 lea r2, [r2 + 2 * r3] %endif %assign x x+1 @@ -1167,16 +1493,20 @@ %endmacro - FILTER_V4_W2_H4_sse2 4 - FILTER_V4_W2_H4_sse2 8 - FILTER_V4_W2_H4_sse2 16 + FILTER_V4_W2_H4_sse2 pp, 4 + FILTER_V4_W2_H4_sse2 pp, 8 + FILTER_V4_W2_H4_sse2 pp, 16 + + FILTER_V4_W2_H4_sse2 ps, 4 + FILTER_V4_W2_H4_sse2 ps, 8 + FILTER_V4_W2_H4_sse2 ps, 16 ;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_vert_%1_4x2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- +%macro FILTER_V2_W4_H4_sse2 1 INIT_XMM sse2 -cglobal interp_4tap_vert_pp_4x2, 4, 6, 8 - +cglobal interp_4tap_vert_%1_4x2, 4, 6, 8 mov r4d, r4m sub r0, r1 pxor m7, m7 @@ -1225,6 +1555,8 @@ pshuflw m5, m3, q2301 pshufhw m5, m5, q2301 paddw m3, m5 + +%ifidn %1, pp psrld m2, 16 psrld m3, 16 packssdw m2, m3 @@ -1236,18 +1568,35 @@ movd [r2], m2 psrldq m2, 4 movd [r2 + r3], m2 +%elifidn %1, ps + psrldq m2, 2 + psrldq m3, 2 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + + add r3d, r3d + psubw m2, [pw_2000] + movh [r2], m2 + movhps [r2 + r3], m2 +%endif RET +%endmacro + + FILTER_V2_W4_H4_sse2 pp + FILTER_V2_W4_H4_sse2 ps + ;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_vert_%1_4x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -%macro FILTER_V4_W4_H4_sse2 1 +%macro FILTER_V4_W4_H4_sse2 2 INIT_XMM sse2 %if ARCH_X86_64 -cglobal interp_4tap_vert_pp_4x%1, 4, 6, 9 +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 9 pxor m8, m8 %else -cglobal interp_4tap_vert_pp_4x%1, 4, 6, 8 +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 8 %endif mov r4d, r4m @@ -1260,12 +1609,19 @@ movh m0, [tabw_ChromaCoeff + r4 * 8] %endif +%ifidn %1,pp mova m1, [pw_32] +%elifidn %1,ps + add r3d, r3d + mova m1, [pw_2000] +%endif + lea r5, [3 * r1] + lea r4, [3 * r3] punpcklqdq m0, m0 %assign x 1 -%rep %1/4 +%rep %2/4 movd m2, [r0] movd m3, [r0 + r1] movd m4, [r0 + 2 * r1] @@ -1302,12 +1658,24 @@ pshuflw m7, m3, q2301 pshufhw m7, m7, q2301 paddw m3, m7 + +%ifidn %1,pp psrld m2, 16 psrld m3, 16 packssdw m2, m3 - paddw m2, m1 psraw m2, 6 +%elifidn %1,ps + psrldq m2, 2 + psrldq m3, 2 + pshufd m2, m2, q3120 + pshufd m3, m3, q3120 + punpcklqdq m2, m3 + + psubw m2, m1 + movh [r2], m2 + movhps [r2 + r3], m2 +%endif movd m7, [r0 + r1] @@ -1341,6 +1709,8 @@ pshuflw m7, m5, q2301 pshufhw m7, m7, q2301 paddw m5, m7 + +%ifidn %1,pp psrld m4, 16 psrld m5, 16 packssdw m4, m5 @@ -1352,32 +1722,47 @@ movd [r2], m2 psrldq m2, 4 movd [r2 + r3], m2 - lea r2, [r2 + 2 * r3] psrldq m2, 4 - movd [r2], m2 + movd [r2 + 2 * r3], m2 psrldq m2, 4 - movd [r2 + r3], m2 + movd [r2 + r4], m2 +%elifidn %1,ps + psrldq m4, 2 + psrldq m5, 2 + pshufd m4, m4, q3120 + pshufd m5, m5, q3120 + punpcklqdq m4, m5 + psubw m4, m1 + movh [r2 + 2 * r3], m4 + movhps [r2 + r4], m4 +%endif -%if x < %1/4 - lea r2, [r2 + 2 * r3] +%if x < %2/4 + lea r2, [r2 + 4 * r3] %endif + %assign x x+1 %endrep RET + %endmacro - FILTER_V4_W4_H4_sse2 4 - FILTER_V4_W4_H4_sse2 8 - FILTER_V4_W4_H4_sse2 16 - FILTER_V4_W4_H4_sse2 32 + FILTER_V4_W4_H4_sse2 pp, 4 + FILTER_V4_W4_H4_sse2 pp, 8 + FILTER_V4_W4_H4_sse2 pp, 16 + FILTER_V4_W4_H4_sse2 pp, 32 + + FILTER_V4_W4_H4_sse2 ps, 4 + FILTER_V4_W4_H4_sse2 ps, 8 + FILTER_V4_W4_H4_sse2 ps, 16 + FILTER_V4_W4_H4_sse2 ps, 32 ;----------------------------------------------------------------------------- -;void interp_4tap_vert_pp_6x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;void interp_4tap_vert_%1_6x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -%macro FILTER_V4_W6_H4_sse2 1 +%macro FILTER_V4_W6_H4_sse2 2 INIT_XMM sse2 -cglobal interp_4tap_vert_pp_6x%1, 4, 7, 10 - +cglobal interp_4tap_vert_%1_6x%2, 4, 7, 10 mov r4d, r4m sub r0, r1 shl r4d, 5 @@ -1392,11 +1777,16 @@ mova m5, [tab_ChromaCoeffV + r4 + 16] %endif +%ifidn %1,pp mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif lea r5, [3 * r1] %assign x 1 -%rep %1/4 +%rep %2/4 movq m0, [r0] movq m1, [r0 + r1] movq m2, [r0 + 2 * r1] @@ -1423,12 +1813,20 @@ paddw m0, m7 +%ifidn %1,pp paddw m0, m4 psraw m0, 6 packuswb m0, m0 + movd [r2], m0 pextrw r6d, m0, 2 mov [r2 + 4], r6w +%elifidn %1,ps + psubw m0, m4 + movh [r2], m0 + pshufd m0, m0, 2 + movd [r2 + 8], m0 +%endif lea r0, [r0 + 4 * r1] @@ -1452,12 +1850,21 @@ paddw m1, m7 +%ifidn %1,pp paddw m1, m4 psraw m1, 6 packuswb m1, m1 + movd [r2 + r3], m1 pextrw r6d, m1, 2 mov [r2 + r3 + 4], r6w +%elifidn %1,ps + psubw m1, m4 + movh [r2 + r3], m1 + pshufd m1, m1, 2 + movd [r2 + r3 + 8], m1 +%endif + movq m1, [r0 + r1] punpcklbw m7, m0, m1 @@ -1476,14 +1883,21 @@ packssdw m7, m8 paddw m2, m7 + lea r2, [r2 + 2 * r3] +%ifidn %1,pp paddw m2, m4 psraw m2, 6 packuswb m2, m2 - lea r2, [r2 + 2 * r3] movd [r2], m2 pextrw r6d, m2, 2 mov [r2 + 4], r6w +%elifidn %1,ps + psubw m2, m4 + movh [r2], m2 + pshufd m2, m2, 2 + movd [r2 + 8], m2 +%endif movq m2, [r0 + 2 * r1] punpcklbw m1, m2 @@ -1504,17 +1918,25 @@ paddw m3, m1 +%ifidn %1,pp paddw m3, m4 psraw m3, 6 packuswb m3, m3 movd [r2 + r3], m3 - pextrw r6d, m3, 2 + pextrw r6d, m3, 2 mov [r2 + r3 + 4], r6w +%elifidn %1,ps + psubw m3, m4 + movh [r2 + r3], m3 + pshufd m3, m3, 2 + movd [r2 + r3 + 8], m3 +%endif -%if x < %1/4 +%if x < %2/4 lea r2, [r2 + 2 * r3] %endif + %assign x x+1 %endrep RET @@ -1522,22 +1944,29 @@ %endmacro %if ARCH_X86_64 - FILTER_V4_W6_H4_sse2 8 - FILTER_V4_W6_H4_sse2 16 + FILTER_V4_W6_H4_sse2 pp, 8 + FILTER_V4_W6_H4_sse2 pp, 16 + FILTER_V4_W6_H4_sse2 ps, 8 + FILTER_V4_W6_H4_sse2 ps, 16 %endif ;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- -%macro FILTER_V4_W8_sse2 1 +%macro FILTER_V4_W8_sse2 2 INIT_XMM sse2 -cglobal interp_4tap_vert_pp_8x%1, 4, 7, 12 - +cglobal interp_4tap_vert_%1_8x%2, 4, 7, 12 mov r4d, r4m sub r0, r1 shl r4d, 5 pxor m9, m9 + +%ifidn %1,pp mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif %ifdef PIC lea r6, [tab_ChromaCoeffV] @@ -1573,8 +2002,13 @@ paddw m0, m7 +%ifidn %1,pp paddw m0, m4 psraw m0, 6 +%elifidn %1,ps + psubw m0, m4 + movu [r2], m0 +%endif movq m11, [r0 + 4 * r1] @@ -1597,13 +2031,18 @@ paddw m1, m7 +%ifidn %1,pp paddw m1, m4 psraw m1, 6 packuswb m1, m0 movhps [r2], m1 movh [r2 + r3], m1 -%if %1 == 2 ;end of 8x2 +%elifidn %1,ps + psubw m1, m4 + movu [r2 + r3], m1 +%endif +%if %2 == 2 ;end of 8x2 RET %else @@ -1629,8 +2068,13 @@ paddw m2, m7 +%ifidn %1,pp paddw m2, m4 psraw m2, 6 +%elifidn %1,ps + psubw m2, m4 + movu [r2 + 2 * r3], m2 +%endif movq m10, [r6 + 2 * r1] @@ -1652,15 +2096,20 @@ packssdw m7, m8 paddw m3, m7 + lea r5, [r2 + 2 * r3] +%ifidn %1,pp paddw m3, m4 psraw m3, 6 packuswb m3, m2 movhps [r2 + 2 * r3], m3 - lea r5, [r2 + 2 * r3] movh [r5 + r3], m3 -%if %1 == 4 ;end of 8x4 +%elifidn %1,ps + psubw m3, m4 + movu [r5 + r3], m3 +%endif +%if %2 == 4 ;end of 8x4 RET %else @@ -1684,10 +2133,15 @@ pmaddwd m8, m5 packssdw m7, m8 - paddw m11, m7 + paddw m11, m7 - paddw m11, m4 - psraw m11, 6 +%ifidn %1, pp + paddw m11, m4 + psraw m11, 6 +%elifidn %1,ps + psubw m11, m4 + movu [r2 + 4 * r3], m11 +%endif movq m7, [r0 + 8 * r1] @@ -1709,15 +2163,20 @@ packssdw m3, m8 paddw m1, m3 + lea r5, [r2 + 4 * r3] +%ifidn %1,pp paddw m1, m4 psraw m1, 6 packuswb m1, m11 movhps [r2 + 4 * r3], m1 - lea r5, [r2 + 4 * r3] movh [r5 + r3], m1 -%if %1 == 6 +%elifidn %1,ps + psubw m1, m4 + movu [r5 + r3], m1 +%endif +%if %2 == 6 RET %else @@ -1728,18 +2187,20 @@ %endmacro %if ARCH_X86_64 - FILTER_V4_W8_sse2 2 - FILTER_V4_W8_sse2 4 - FILTER_V4_W8_sse2 6 + FILTER_V4_W8_sse2 pp, 2 + FILTER_V4_W8_sse2 pp, 4 + FILTER_V4_W8_sse2 pp, 6 + FILTER_V4_W8_sse2 ps, 2 + FILTER_V4_W8_sse2 ps, 4 + FILTER_V4_W8_sse2 ps, 6 %endif ;----------------------------------------------------------------------------- -; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +; void interp_4tap_vert_%1_8x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- %macro FILTER_V4_W8_H8_H16_H32_sse2 2 INIT_XMM sse2 -cglobal interp_4tap_vert_pp_%1x%2, 4, 6, 11 - +cglobal interp_4tap_vert_%1_8x%2, 4, 6, 11 mov r4d, r4m sub r0, r1 shl r4d, 5 @@ -1754,7 +2215,13 @@ mova m5, [tab_ChromaCoeff + r4 + 16] %endif +%ifidn %1,pp mova m4, [pw_32] +%elifidn %1,ps + mova m4, [pw_2000] + add r3d, r3d +%endif + lea r5, [r1 * 3] %assign x 1 @@ -1784,8 +2251,14 @@ packssdw m7, m8 paddw m0, m7 + +%ifidn %1,pp paddw m0, m4 psraw m0, 6 +%elifidn %1,ps + psubw m0, m4 + movu [r2], m0 +%endif lea r0, [r0 + 4 * r1] movq m10, [r0] @@ -1807,12 +2280,18 @@ packssdw m7, m8 paddw m1, m7 + +%ifidn %1,pp paddw m1, m4 psraw m1, 6 packuswb m0, m1 movh [r2], m0 movhps [r2 + r3], m0 +%elifidn %1,ps + psubw m1, m4 + movu [r2 + r3], m1 +%endif movq m1, [r0 + r1] punpcklbw m10, m1 @@ -1832,8 +2311,15 @@ packssdw m10, m8 paddw m2, m10 + lea r2, [r2 + 2 * r3] + +%ifidn %1,pp paddw m2, m4 psraw m2, 6 +%elifidn %1,ps + psubw m2, m4 + movu [r2], m2 +%endif movq m7, [r0 + 2 * r1] punpcklbw m1, m7 @@ -1853,13 +2339,19 @@ packssdw m1, m8 paddw m3, m1 + +%ifidn %1,pp paddw m3, m4 psraw m3, 6 packuswb m2, m3 - lea r2, [r2 + 2 * r3] movh [r2], m2 movhps [r2 + r3], m2 +%elifidn %1,ps + psubw m3, m4 + movu [r2 + r3], m3 +%endif + %if x < %2/4 lea r2, [r2 + 2 * r3] %endif @@ -1868,13 +2360,1123 @@ %endmacro %if ARCH_X86_64 - FILTER_V4_W8_H8_H16_H32_sse2 8, 8 - FILTER_V4_W8_H8_H16_H32_sse2 8, 16 - FILTER_V4_W8_H8_H16_H32_sse2 8, 32 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 8 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 16 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 32 + + FILTER_V4_W8_H8_H16_H32_sse2 pp, 12 + FILTER_V4_W8_H8_H16_H32_sse2 pp, 64 + + FILTER_V4_W8_H8_H16_H32_sse2 ps, 8 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 16 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 32 + + FILTER_V4_W8_H8_H16_H32_sse2 ps, 12 + FILTER_V4_W8_H8_H16_H32_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_12x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W12_H2_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_12x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m7, [r0 + r1] + + punpcklbw m10, m5, m7 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m10 + + punpckhbw m10, m5, m7 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m2, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movh [r2], m4 + psrldq m4, 8 + movd [r2 + 8], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movh [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m4 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r0 + 2 * r1] + punpcklbw m2, m7, m5 + punpckhbw m7, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + paddw m4, m2 + paddw m3, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movh [r2 + r3], m4 + psrldq m4, 8 + movd [r2 + r3 + 8], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movh [r2 + r3 + 16], m3 +%endif + +%if x < %2/2 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W12_H2_sse2 pp, 16 + FILTER_V4_W12_H2_sse2 pp, 32 + FILTER_V4_W12_H2_sse2 ps, 16 + FILTER_V4_W12_H2_sse2 ps, 32 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_16x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16_H2_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_16x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r0, [r0 + 2 * r1] + movu m5, [r0] + movu m10, [r0 + r1] + + punpckhbw m7, m5, m10 + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m2, m7 + + punpcklbw m7, m5, m10 + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m4, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r0 + 2 * r1] + + punpcklbw m2, m10, m5 + punpckhbw m10, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m2 + paddw m3, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 +%endif + +%if x < %2/2 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W16_H2_sse2 pp, 4 + FILTER_V4_W16_H2_sse2 pp, 8 + FILTER_V4_W16_H2_sse2 pp, 12 + FILTER_V4_W16_H2_sse2 pp, 16 + FILTER_V4_W16_H2_sse2 pp, 32 + + FILTER_V4_W16_H2_sse2 pp, 24 + FILTER_V4_W16_H2_sse2 pp, 64 + + FILTER_V4_W16_H2_sse2 ps, 4 + FILTER_V4_W16_H2_sse2 ps, 8 + FILTER_V4_W16_H2_sse2 ps, 12 + FILTER_V4_W16_H2_sse2 ps, 16 + FILTER_V4_W16_H2_sse2 ps, 32 + + FILTER_V4_W16_H2_sse2 ps, 24 + FILTER_V4_W16_H2_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +;void interp_4tap_vert_%1_24%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W24_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_24x%2, 4, 6, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + +%assign x 1 +%rep %2/2 + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m10, [r5 + r1] + punpcklbw m7, m5, m10 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + paddw m4, m7 + + punpckhbw m7, m5, m10 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + paddw m2, m7 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m2, [r5 + 2 * r1] + + punpcklbw m5, m10, m2 + punpckhbw m10, m2 + + movhlps m8, m5 + punpcklbw m5, m9 + punpcklbw m8, m9 + pmaddwd m5, m0 + pmaddwd m8, m0 + packssdw m5, m8 + + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + + paddw m4, m5 + paddw m3, m10 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m3, m6 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m3, m6 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 +%endif + + movq m2, [r0 + 16] + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + + punpcklbw m2, m3 + punpcklbw m4, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m0 + pmaddwd m8, m0 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + paddw m2, m4 + +%ifidn %1,pp + paddw m2, m6 + psraw m2, 6 +%elifidn %1,ps + psubw m2, m6 + movu [r2 + 32], m2 +%endif + + movq m3, [r0 + r1 + 16] + movq m4, [r5 + 16] + movq m5, [r5 + r1 + 16] + movq m7, [r5 + 2 * r1 + 16] + + punpcklbw m3, m4 + punpcklbw m5, m7 + + movhlps m8, m5 + punpcklbw m5, m9 + punpcklbw m8, m9 + pmaddwd m5, m0 + pmaddwd m8, m0 + packssdw m5, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + paddw m3, m5 + +%ifidn %1,pp + paddw m3, m6 + psraw m3, 6 + + packuswb m2, m3 + movh [r2 + 16], m2 + movhps [r2 + r3 + 16], m2 +%elifidn %1,ps + psubw m3, m6 + movu [r2 + r3 + 32], m3 +%endif + +%if x < %2/2 + mov r0, r5 + lea r2, [r2 + 2 * r3] +%endif +%assign x x+1 +%endrep + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W24_sse2 pp, 32 + FILTER_V4_W24_sse2 pp, 64 + FILTER_V4_W24_sse2 ps, 32 + FILTER_V4_W24_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_32x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W32_sse2 2 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_32x%2, 4, 6, 10 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m6, [pw_32] +%elifidn %1,ps + mova m6, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + + mov r4d, %2 + +.loop: + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m3, [r5] + movu m5, [r5 + r1] + + punpcklbw m7, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m0 + pmaddwd m8, m0 + packssdw m3, m8 + + paddw m4, m7 + paddw m2, m3 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + movu m2, [r0 + 16] + movu m3, [r0 + r1 + 16] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + movu m3, [r5 + 16] + movu m5, [r5 + r1 + 16] + + punpcklbw m7, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m7 + punpcklbw m7, m9 + punpcklbw m8, m9 + pmaddwd m7, m0 + pmaddwd m8, m0 + packssdw m7, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m0 + pmaddwd m8, m0 + packssdw m3, m8 + + paddw m4, m7 + paddw m2, m3 + +%ifidn %1,pp + paddw m4, m6 + psraw m4, 6 + paddw m2, m6 + psraw m2, 6 + + packuswb m4, m2 + movu [r2 + 16], m4 +%elifidn %1,ps + psubw m4, m6 + psubw m2, m6 + movu [r2 + 32], m4 + movu [r2 + 48], m2 +%endif + + lea r0, [r0 + r1] + lea r2, [r2 + r3] + dec r4 + jnz .loop + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W32_sse2 pp, 8 + FILTER_V4_W32_sse2 pp, 16 + FILTER_V4_W32_sse2 pp, 24 + FILTER_V4_W32_sse2 pp, 32 + + FILTER_V4_W32_sse2 pp, 48 + FILTER_V4_W32_sse2 pp, 64 - FILTER_V4_W8_H8_H16_H32_sse2 8, 12 - FILTER_V4_W8_H8_H16_H32_sse2 8, 64 + FILTER_V4_W32_sse2 ps, 8 + FILTER_V4_W32_sse2 ps, 16 + FILTER_V4_W32_sse2 ps, 24 + FILTER_V4_W32_sse2 ps, 32 + + FILTER_V4_W32_sse2 ps, 48 + FILTER_V4_W32_sse2 ps, 64 +%endif + +;----------------------------------------------------------------------------- +; void interp_4tap_vert_%1_%2x%3(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) +;----------------------------------------------------------------------------- +%macro FILTER_V4_W16n_H2_sse2 3 +INIT_XMM sse2 +cglobal interp_4tap_vert_%1_%2x%3, 4, 7, 11 + mov r4d, r4m + sub r0, r1 + shl r4d, 5 + pxor m9, m9 + +%ifidn %1,pp + mova m7, [pw_32] +%elifidn %1,ps + mova m7, [pw_2000] + add r3d, r3d +%endif + +%ifdef PIC + lea r5, [tab_ChromaCoeffV] + mova m1, [r5 + r4] + mova m0, [r5 + r4 + 16] +%else + mova m1, [tab_ChromaCoeffV + r4] + mova m0, [tab_ChromaCoeffV + r4 + 16] +%endif + + mov r4d, %3/2 + +.loop: + + mov r6d, %2/16 + +.loopW: + + movu m2, [r0] + movu m3, [r0 + r1] + + punpcklbw m4, m2, m3 + punpckhbw m2, m3 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m1 + pmaddwd m8, m1 + packssdw m2, m8 + + lea r5, [r0 + 2 * r1] + movu m5, [r5] + movu m6, [r5 + r1] + + punpckhbw m10, m5, m6 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + paddw m2, m10 + + punpcklbw m10, m5, m6 + movhlps m8, m10 + punpcklbw m10, m9 + punpcklbw m8, m9 + pmaddwd m10, m0 + pmaddwd m8, m0 + packssdw m10, m8 + paddw m4, m10 + +%ifidn %1,pp + paddw m4, m7 + psraw m4, 6 + paddw m2, m7 + psraw m2, 6 + + packuswb m4, m2 + movu [r2], m4 +%elifidn %1,ps + psubw m4, m7 + psubw m2, m7 + movu [r2], m4 + movu [r2 + 16], m2 +%endif + + punpcklbw m4, m3, m5 + punpckhbw m3, m5 + + movhlps m8, m4 + punpcklbw m4, m9 + punpcklbw m8, m9 + pmaddwd m4, m1 + pmaddwd m8, m1 + packssdw m4, m8 + + movhlps m8, m3 + punpcklbw m3, m9 + punpcklbw m8, m9 + pmaddwd m3, m1 + pmaddwd m8, m1 + packssdw m3, m8 + + movu m5, [r5 + 2 * r1] + + punpcklbw m2, m6, m5 + punpckhbw m6, m5 + + movhlps m8, m2 + punpcklbw m2, m9 + punpcklbw m8, m9 + pmaddwd m2, m0 + pmaddwd m8, m0 + packssdw m2, m8 + + movhlps m8, m6 + punpcklbw m6, m9 + punpcklbw m8, m9 + pmaddwd m6, m0 + pmaddwd m8, m0 + packssdw m6, m8 + + paddw m4, m2 + paddw m3, m6 + +%ifidn %1,pp + paddw m4, m7 + psraw m4, 6 + paddw m3, m7 + psraw m3, 6 + + packuswb m4, m3 + movu [r2 + r3], m4 + add r2, 16 +%elifidn %1,ps + psubw m4, m7 + psubw m3, m7 + movu [r2 + r3], m4 + movu [r2 + r3 + 16], m3 + add r2, 32 +%endif + + add r0, 16 + dec r6d + jnz .loopW + + lea r0, [r0 + r1 * 2 - %2] + +%ifidn %1,pp + lea r2, [r2 + r3 * 2 - %2] +%elifidn %1,ps + lea r2, [r2 + r3 * 2 - (%2 * 2)] +%endif + + dec r4d + jnz .loop + RET + +%endmacro + +%if ARCH_X86_64 + FILTER_V4_W16n_H2_sse2 pp, 64, 64 + FILTER_V4_W16n_H2_sse2 pp, 64, 32 + FILTER_V4_W16n_H2_sse2 pp, 64, 48 + FILTER_V4_W16n_H2_sse2 pp, 48, 64 + FILTER_V4_W16n_H2_sse2 pp, 64, 16 + FILTER_V4_W16n_H2_sse2 ps, 64, 64 + FILTER_V4_W16n_H2_sse2 ps, 64, 32 + FILTER_V4_W16n_H2_sse2 ps, 64, 48 + FILTER_V4_W16n_H2_sse2 ps, 48, 64 + FILTER_V4_W16n_H2_sse2 ps, 64, 16 +%endif + +%macro FILTER_P2S_2_4_sse2 1 + movd m2, [r0 + %1] + movd m3, [r0 + r1 + %1] + punpcklwd m2, m3 + movd m3, [r0 + r1 * 2 + %1] + movd m4, [r0 + r4 + %1] + punpcklwd m3, m4 + punpckldq m2, m3 + punpcklbw m2, m0 + psllw m2, 6 + psubw m2, m1 + + movd [r2 + r3 * 0 + %1 * 2], m2 + psrldq m2, 4 + movd [r2 + r3 * 1 + %1 * 2], m2 + psrldq m2, 4 + movd [r2 + r3 * 2 + %1 * 2], m2 + psrldq m2, 4 + movd [r2 + r5 + %1 * 2], m2 +%endmacro + +%macro FILTER_P2S_4_4_sse2 1 + movd m2, [r0 + %1] + movd m3, [r0 + r1 + %1] + movd m4, [r0 + r1 * 2 + %1] + movd m5, [r0 + r4 + %1] + punpckldq m2, m3 + punpcklbw m2, m0 + punpckldq m4, m5 + punpcklbw m4, m0 + psllw m2, 6 + psllw m4, 6 + psubw m2, m1 + psubw m4, m1 + movh [r2 + r3 * 0 + %1 * 2], m2 + movh [r2 + r3 * 2 + %1 * 2], m4 + movhps [r2 + r3 * 1 + %1 * 2], m2 + movhps [r2 + r5 + %1 * 2], m4 +%endmacro + +%macro FILTER_P2S_4_2_sse2 0 + movd m2, [r0] + movd m3, [r0 + r1] + punpckldq m2, m3 + punpcklbw m2, m0 + psllw m2, 6 + psubw m2, [pw_8192] + movh [r2], m2 + movhps [r2 + r3 * 2], m2 +%endmacro + +%macro FILTER_P2S_8_4_sse2 1 + movh m2, [r0 + %1] + movh m3, [r0 + r1 + %1] + movh m4, [r0 + r1 * 2 + %1] + movh m5, [r0 + r4 + %1] + punpcklbw m2, m0 + punpcklbw m3, m0 + punpcklbw m5, m0 + punpcklbw m4, m0 + psllw m2, 6 + psllw m3, 6 + psllw m5, 6 + psllw m4, 6 + psubw m2, m1 + psubw m3, m1 + psubw m4, m1 + psubw m5, m1 + movu [r2 + r3 * 0 + %1 * 2], m2 + movu [r2 + r3 * 1 + %1 * 2], m3 + movu [r2 + r3 * 2 + %1 * 2], m4 + movu [r2 + r5 + %1 * 2], m5 +%endmacro + +%macro FILTER_P2S_8_2_sse2 1 + movh m2, [r0 + %1] + movh m3, [r0 + r1 + %1] + punpcklbw m2, m0 + punpcklbw m3, m0 + psllw m2, 6 + psllw m3, 6 + psubw m2, m1 + psubw m3, m1 + movu [r2 + r3 * 0 + %1 * 2], m2 + movu [r2 + r3 * 1 + %1 * 2], m3 +%endmacro + +;----------------------------------------------------------------------------- +; void filterPixelToShort(pixel *src, intptr_t srcStride, int16_t *dst, intptr_t dstStride) +;----------------------------------------------------------------------------- +%macro FILTER_PIX_TO_SHORT_sse2 2 +INIT_XMM sse2 +cglobal filterPixelToShort_%1x%2, 4, 6, 6 + pxor m0, m0 +%if %2 == 2 +%if %1 == 4 + FILTER_P2S_4_2_sse2 +%elif %1 == 8 + add r3d, r3d + mova m1, [pw_8192] + FILTER_P2S_8_2_sse2 0 +%endif +%else + add r3d, r3d + mova m1, [pw_8192] + lea r4, [r1 * 3] + lea r5, [r3 * 3] +%assign y 1 +%rep %2/4 +%assign x 0 +%rep %1/8 + FILTER_P2S_8_4_sse2 x +%if %2 == 6 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + FILTER_P2S_8_2_sse2 x +%endif +%assign x x+8 +%endrep +%rep (%1 % 8)/4 + FILTER_P2S_4_4_sse2 x +%assign x x+4 +%endrep +%rep (%1 % 4)/2 + FILTER_P2S_2_4_sse2 x +%endrep +%if y < %2/4 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] +%assign y y+1 %endif +%endrep +%endif +RET +%endmacro + + FILTER_PIX_TO_SHORT_sse2 2, 4 + FILTER_PIX_TO_SHORT_sse2 2, 8 + FILTER_PIX_TO_SHORT_sse2 2, 16 + FILTER_PIX_TO_SHORT_sse2 4, 2 + FILTER_PIX_TO_SHORT_sse2 4, 4 + FILTER_PIX_TO_SHORT_sse2 4, 8 + FILTER_PIX_TO_SHORT_sse2 4, 16 + FILTER_PIX_TO_SHORT_sse2 4, 32 + FILTER_PIX_TO_SHORT_sse2 6, 8 + FILTER_PIX_TO_SHORT_sse2 6, 16 + FILTER_PIX_TO_SHORT_sse2 8, 2 + FILTER_PIX_TO_SHORT_sse2 8, 4 + FILTER_PIX_TO_SHORT_sse2 8, 6 + FILTER_PIX_TO_SHORT_sse2 8, 8 + FILTER_PIX_TO_SHORT_sse2 8, 12 + FILTER_PIX_TO_SHORT_sse2 8, 16 + FILTER_PIX_TO_SHORT_sse2 8, 32 + FILTER_PIX_TO_SHORT_sse2 8, 64 + FILTER_PIX_TO_SHORT_sse2 12, 16 + FILTER_PIX_TO_SHORT_sse2 12, 32 + FILTER_PIX_TO_SHORT_sse2 16, 4 + FILTER_PIX_TO_SHORT_sse2 16, 8 + FILTER_PIX_TO_SHORT_sse2 16, 12 + FILTER_PIX_TO_SHORT_sse2 16, 16 + FILTER_PIX_TO_SHORT_sse2 16, 24 + FILTER_PIX_TO_SHORT_sse2 16, 32 + FILTER_PIX_TO_SHORT_sse2 16, 64 + FILTER_PIX_TO_SHORT_sse2 24, 32 + FILTER_PIX_TO_SHORT_sse2 24, 64 + FILTER_PIX_TO_SHORT_sse2 32, 8 + FILTER_PIX_TO_SHORT_sse2 32, 16 + FILTER_PIX_TO_SHORT_sse2 32, 24 + FILTER_PIX_TO_SHORT_sse2 32, 32 + FILTER_PIX_TO_SHORT_sse2 32, 48 + FILTER_PIX_TO_SHORT_sse2 32, 64 + FILTER_PIX_TO_SHORT_sse2 48, 64 + FILTER_PIX_TO_SHORT_sse2 64, 16 + FILTER_PIX_TO_SHORT_sse2 64, 32 + FILTER_PIX_TO_SHORT_sse2 64, 48 + FILTER_PIX_TO_SHORT_sse2 64, 64 + +%macro FILTER_H4_w2_2 3 + movh %2, [srcq - 1] + pshufb %2, %2, Tm0 + movh %1, [srcq + srcstrideq - 1] + pshufb %1, %1, Tm0 + punpcklqdq %2, %1 + pmaddubsw %2, coef2 + phaddw %2, %2 + pmulhrsw %2, %3 + packuswb %2, %2 + movd r4, %2 + mov [dstq], r4w + shr r4, 16 + mov [dstq + dststrideq], r4w +%endmacro + ;----------------------------------------------------------------------------- ; void interp_4tap_horiz_pp_2x4(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -4570,6 +6172,123 @@ FILTER_VER_CHROMA_AVX2_2x8 pp FILTER_VER_CHROMA_AVX2_2x8 ps +%macro FILTER_VER_CHROMA_AVX2_2x16 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_2x16, 4, 6, 3 + mov r4d, r4m + shl r4d, 6 + sub r0, r1 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + lea r4, [r1 * 3] + + movd xm1, [r0] + pinsrw xm1, [r0 + r1], 1 + pinsrw xm1, [r0 + r1 * 2], 2 + pinsrw xm1, [r0 + r4], 3 + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + pinsrw xm1, [r0 + r4], 7 + lea r0, [r0 + r1 * 4] + pinsrw xm0, [r0], 4 + pinsrw xm0, [r0 + r1], 5 + pinsrw xm0, [r0 + r1 * 2], 6 + pinsrw xm0, [r0 + r4], 7 + punpckhqdq xm0, xm1, xm0 + vinserti128 m1, m1, xm0, 1 + + pshufb m2, m1, [interp_vert_shuf] + pshufb m1, [interp_vert_shuf + 32] + pmaddubsw m2, [r5] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m2, m1 + + lea r0, [r0 + r1 * 4] + pinsrw xm1, [r0], 4 + pinsrw xm1, [r0 + r1], 5 + pinsrw xm1, [r0 + r1 * 2], 6 + pinsrw xm1, [r0 + r4], 7 + punpckhqdq xm1, xm0, xm1 + lea r0, [r0 + r1 * 4] + pinsrw xm0, [r0], 4 + pinsrw xm0, [r0 + r1], 5 + pinsrw xm0, [r0 + r1 * 2], 6 + punpckhqdq xm0, xm1, xm0 + vinserti128 m1, m1, xm0, 1 + + pshufb m0, m1, [interp_vert_shuf] + pshufb m1, [interp_vert_shuf + 32] + pmaddubsw m0, [r5] + pmaddubsw m1, [r5 + 1 * mmsize] + paddw m0, m1 +%ifidn %1,pp + mova m1, [pw_512] + pmulhrsw m2, m1 + pmulhrsw m0, m1 + packuswb m2, m0 + lea r4, [r3 * 3] + pextrw [r2], xm2, 0 + pextrw [r2 + r3], xm2, 1 + pextrw [r2 + r3 * 2], xm2, 2 + pextrw [r2 + r4], xm2, 3 + vextracti128 xm0, m2, 1 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 0 + pextrw [r2 + r3], xm0, 1 + pextrw [r2 + r3 * 2], xm0, 2 + pextrw [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm2, 4 + pextrw [r2 + r3], xm2, 5 + pextrw [r2 + r3 * 2], xm2, 6 + pextrw [r2 + r4], xm2, 7 + lea r2, [r2 + r3 * 4] + pextrw [r2], xm0, 4 + pextrw [r2 + r3], xm0, 5 + pextrw [r2 + r3 * 2], xm0, 6 + pextrw [r2 + r4], xm0, 7 +%else + add r3d, r3d + lea r4, [r3 * 3] + vbroadcasti128 m1, [pw_2000] + psubw m2, m1 + psubw m0, m1 + vextracti128 xm1, m2, 1 + movd [r2], xm2 + pextrd [r2 + r3], xm2, 1 + pextrd [r2 + r3 * 2], xm2, 2 + pextrd [r2 + r4], xm2, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 + vextracti128 xm1, m0, 1 + lea r2, [r2 + r3 * 4] + movd [r2], xm0 + pextrd [r2 + r3], xm0, 1 + pextrd [r2 + r3 * 2], xm0, 2 + pextrd [r2 + r4], xm0, 3 + lea r2, [r2 + r3 * 4] + movd [r2], xm1 + pextrd [r2 + r3], xm1, 1 + pextrd [r2 + r3 * 2], xm1, 2 + pextrd [r2 + r4], xm1, 3 +%endif + RET +%endmacro + + FILTER_VER_CHROMA_AVX2_2x16 pp + FILTER_VER_CHROMA_AVX2_2x16 ps + ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_2x8(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;----------------------------------------------------------------------------- @@ -4971,10 +6690,10 @@ FILTER_VER_CHROMA_AVX2_4x8 pp FILTER_VER_CHROMA_AVX2_4x8 ps -%macro FILTER_VER_CHROMA_AVX2_4x16 1 -INIT_YMM avx2 +%macro FILTER_VER_CHROMA_AVX2_4xN 2 %if ARCH_X86_64 == 1 -cglobal interp_4tap_vert_%1_4x16, 4, 6, 9 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_4x%2, 4, 6, 12 mov r4d, r4m shl r4d, 6 sub r0, r1 @@ -4987,7 +6706,16 @@ %endif lea r4, [r1 * 3] - + mova m10, [r5] + mova m11, [r5 + mmsize] +%ifidn %1,pp + mova m9, [pw_512] +%else + add r3d, r3d + mova m9, [pw_2000] +%endif + lea r5, [r3 * 3] +%rep %2 / 16 movd xm1, [r0] pinsrd xm1, [r0 + r1], 1 pinsrd xm1, [r0 + r1 * 2], 2 @@ -5035,29 +6763,27 @@ pshufb m6, m6, m5 pshufb m7, m7, m5 pshufb m8, m8, m5 - pmaddubsw m0, [r5] - pmaddubsw m6, [r5] - pmaddubsw m7, [r5] - pmaddubsw m8, [r5] - pmaddubsw m1, [r5 + mmsize] - pmaddubsw m2, [r5 + mmsize] - pmaddubsw m3, [r5 + mmsize] - pmaddubsw m4, [r5 + mmsize] + pmaddubsw m0, m10 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + pmaddubsw m8, m10 + pmaddubsw m1, m11 + pmaddubsw m2, m11 + pmaddubsw m3, m11 + pmaddubsw m4, m11 paddw m0, m1 ; m0 = WORD ROW[3 2 1 0] paddw m6, m2 ; m6 = WORD ROW[7 6 5 4] paddw m7, m3 ; m7 = WORD ROW[11 10 9 8] paddw m8, m4 ; m8 = WORD ROW[15 14 13 12] %ifidn %1,pp - mova m5, [pw_512] - pmulhrsw m0, m5 - pmulhrsw m6, m5 - pmulhrsw m7, m5 - pmulhrsw m8, m5 + pmulhrsw m0, m9 + pmulhrsw m6, m9 + pmulhrsw m7, m9 + pmulhrsw m8, m9 packuswb m0, m6 packuswb m7, m8 vextracti128 xm1, m0, 1 vextracti128 xm2, m7, 1 - lea r5, [r3 * 3] movd [r2], xm0 pextrd [r2 + r3], xm0, 1 movd [r2 + r3 * 2], xm1 @@ -5078,17 +6804,14 @@ pextrd [r2 + r3 * 2], xm2, 2 pextrd [r2 + r5], xm2, 3 %else - add r3d, r3d - mova m5, [pw_2000] - psubw m0, m5 - psubw m6, m5 - psubw m7, m5 - psubw m8, m5 + psubw m0, m9 + psubw m6, m9 + psubw m7, m9 + psubw m8, m9 vextracti128 xm1, m0, 1 vextracti128 xm2, m6, 1 vextracti128 xm3, m7, 1 vextracti128 xm4, m8, 1 - lea r5, [r3 * 3] movq [r2], xm0 movhps [r2 + r3], xm0 movq [r2 + r3 * 2], xm1 @@ -5109,12 +6832,16 @@ movq [r2 + r3 * 2], xm4 movhps [r2 + r5], xm4 %endif + lea r2, [r2 + r3 * 4] +%endrep RET %endif %endmacro - FILTER_VER_CHROMA_AVX2_4x16 pp - FILTER_VER_CHROMA_AVX2_4x16 ps + FILTER_VER_CHROMA_AVX2_4xN pp, 16 + FILTER_VER_CHROMA_AVX2_4xN ps, 16 + FILTER_VER_CHROMA_AVX2_4xN pp, 32 + FILTER_VER_CHROMA_AVX2_4xN ps, 32 ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) @@ -9116,6 +10843,149 @@ FILTER_VER_CHROMA_AVX2_24x32 pp FILTER_VER_CHROMA_AVX2_24x32 ps +%macro FILTER_VER_CHROMA_AVX2_24x64 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_24x64, 4, 7, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + mov r6d, 16 +.loopH: + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + 16], xm2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], xm2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], xm4 + vextracti128 xm4, m4, 1 + movq [r2 + r3 + 16], xm4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], xm4 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], xm6 + vextracti128 xm6, m6, 1 + movq [r2 + r3 * 2 + 16], xm6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], xm6 +%endif + + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], xm2 + vextracti128 xm2, m2, 1 + movq [r2 + r5 + 16], xm2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], xm2 +%endif + lea r2, [r2 + r3 * 4] + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_24x64 pp + FILTER_VER_CHROMA_AVX2_24x64 ps + %macro FILTER_VER_CHROMA_AVX2_16x4 1 INIT_YMM avx2 cglobal interp_4tap_vert_%1_16x4, 4, 6, 8 @@ -9898,6 +11768,364 @@ FILTER_VER_CHROMA_AVX2_32xN ps, 16 FILTER_VER_CHROMA_AVX2_32xN ps, 8 +%macro FILTER_VER_CHROMA_AVX2_48x64 1 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_48x64, 4, 8, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + lea r7, [r1 * 4] + mov r6d, 16 +.loopH: + movu m0, [r0] ; m0 = row 0 + movu m1, [r0 + r1] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], m2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], m4 +%endif + + movu m1, [r0 + r1] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m6 +%endif + + movu m0, [r0 + r1 * 2] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 + add r2, 32 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], m2 + add r2, 64 +%endif + sub r0, r7 + + movu xm0, [r0 + 32] ; m0 = row 0 + movu xm1, [r0 + r1 + 32] ; m1 = row 1 + punpckhbw xm2, xm0, xm1 + punpcklbw xm0, xm1 + vinserti128 m0, m0, xm2, 1 + pmaddubsw m0, m10 + movu xm2, [r0 + r1 * 2 + 32] ; m2 = row 2 + punpckhbw xm3, xm1, xm2 + punpcklbw xm1, xm2 + vinserti128 m1, m1, xm3, 1 + pmaddubsw m1, m10 + movu xm3, [r0 + r4 + 32] ; m3 = row 3 + punpckhbw xm4, xm2, xm3 + punpcklbw xm2, xm3 + vinserti128 m2, m2, xm4, 1 + pmaddubsw m4, m2, m11 + paddw m0, m4 + pmaddubsw m2, m10 + lea r0, [r0 + r1 * 4] + movu xm4, [r0 + 32] ; m4 = row 4 + punpckhbw xm5, xm3, xm4 + punpcklbw xm3, xm4 + vinserti128 m3, m3, xm5, 1 + pmaddubsw m5, m3, m11 + paddw m1, m5 + pmaddubsw m3, m10 + movu xm5, [r0 + r1 + 32] ; m5 = row 5 + punpckhbw xm6, xm4, xm5 + punpcklbw xm4, xm5 + vinserti128 m4, m4, xm6, 1 + pmaddubsw m4, m11 + paddw m2, m4 + movu xm6, [r0 + r1 * 2 + 32] ; m6 = row 6 + punpckhbw xm7, xm5, xm6 + punpcklbw xm5, xm6 + vinserti128 m5, m5, xm7, 1 + pmaddubsw m5, m11 + paddw m3, m5 +%ifidn %1,pp + pmulhrsw m0, m12 ; m0 = word: row 0 + pmulhrsw m1, m12 ; m1 = word: row 1 + pmulhrsw m2, m12 ; m2 = word: row 2 + pmulhrsw m3, m12 ; m3 = word: row 3 + packuswb m0, m1 + packuswb m2, m3 + vpermq m0, m0, 11011000b + vpermq m2, m2, 11011000b + vextracti128 xm1, m0, 1 + vextracti128 xm3, m2, 1 + movu [r2], xm0 + movu [r2 + r3], xm1 + movu [r2 + r3 * 2], xm2 + movu [r2 + r5], xm3 + lea r2, [r2 + r3 * 4 - 32] +%else + psubw m0, m12 ; m0 = word: row 0 + psubw m1, m12 ; m1 = word: row 1 + psubw m2, m12 ; m2 = word: row 2 + psubw m3, m12 ; m3 = word: row 3 + movu [r2], m0 + movu [r2 + r3], m1 + movu [r2 + r3 * 2], m2 + movu [r2 + r5], m3 + lea r2, [r2 + r3 * 4 - 64] +%endif + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_48x64 pp + FILTER_VER_CHROMA_AVX2_48x64 ps + +%macro FILTER_VER_CHROMA_AVX2_64xN 2 +%if ARCH_X86_64 == 1 +INIT_YMM avx2 +cglobal interp_4tap_vert_%1_64x%2, 4, 8, 13 + mov r4d, r4m + shl r4d, 6 + +%ifdef PIC + lea r5, [tab_ChromaCoeffVer_32] + add r5, r4 +%else + lea r5, [tab_ChromaCoeffVer_32 + r4] +%endif + + mova m10, [r5] + mova m11, [r5 + mmsize] + lea r4, [r1 * 3] + sub r0, r1 +%ifidn %1,pp + mova m12, [pw_512] +%else + add r3d, r3d + vbroadcasti128 m12, [pw_2000] +%endif + lea r5, [r3 * 3] + lea r7, [r1 * 4] + mov r6d, %2 / 4 +.loopH: +%assign x 0 +%rep 2 + movu m0, [r0 + x] ; m0 = row 0 + movu m1, [r0 + r1 + x] ; m1 = row 1 + punpcklbw m2, m0, m1 + punpckhbw m3, m0, m1 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + movu m0, [r0 + r1 * 2 + x] ; m0 = row 2 + punpcklbw m4, m1, m0 + punpckhbw m5, m1, m0 + pmaddubsw m4, m10 + pmaddubsw m5, m10 + movu m1, [r0 + r4 + x] ; m1 = row 3 + punpcklbw m6, m0, m1 + punpckhbw m7, m0, m1 + pmaddubsw m8, m6, m11 + pmaddubsw m9, m7, m11 + pmaddubsw m6, m10 + pmaddubsw m7, m10 + paddw m2, m8 + paddw m3, m9 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2], m2 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2], m0 + movu [r2 + mmsize], m2 +%endif + lea r0, [r0 + r1 * 4] + movu m0, [r0 + x] ; m0 = row 4 + punpcklbw m2, m1, m0 + punpckhbw m3, m1, m0 + pmaddubsw m8, m2, m11 + pmaddubsw m9, m3, m11 + pmaddubsw m2, m10 + pmaddubsw m3, m10 + paddw m4, m8 + paddw m5, m9 +%ifidn %1,pp + pmulhrsw m4, m12 + pmulhrsw m5, m12 + packuswb m4, m5 + movu [r2 + r3], m4 +%else + psubw m4, m12 + psubw m5, m12 + vperm2i128 m1, m4, m5, 0x20 + vperm2i128 m4, m4, m5, 0x31 + movu [r2 + r3], m1 + movu [r2 + r3 + mmsize], m4 +%endif + + movu m1, [r0 + r1 + x] ; m1 = row 5 + punpcklbw m4, m0, m1 + punpckhbw m5, m0, m1 + pmaddubsw m4, m11 + pmaddubsw m5, m11 + paddw m6, m4 + paddw m7, m5 +%ifidn %1,pp + pmulhrsw m6, m12 + pmulhrsw m7, m12 + packuswb m6, m7 + movu [r2 + r3 * 2], m6 +%else + psubw m6, m12 + psubw m7, m12 + vperm2i128 m0, m6, m7, 0x20 + vperm2i128 m6, m6, m7, 0x31 + movu [r2 + r3 * 2], m0 + movu [r2 + r3 * 2 + mmsize], m6 +%endif + + movu m0, [r0 + r1 * 2 + x] ; m0 = row 6 + punpcklbw m6, m1, m0 + punpckhbw m7, m1, m0 + pmaddubsw m6, m11 + pmaddubsw m7, m11 + paddw m2, m6 + paddw m3, m7 +%ifidn %1,pp + pmulhrsw m2, m12 + pmulhrsw m3, m12 + packuswb m2, m3 + movu [r2 + r5], m2 + add r2, 32 +%else + psubw m2, m12 + psubw m3, m12 + vperm2i128 m0, m2, m3, 0x20 + vperm2i128 m2, m2, m3, 0x31 + movu [r2 + r5], m0 + movu [r2 + r5 + mmsize], m2 + add r2, 64 +%endif + sub r0, r7 +%assign x x+32 +%endrep +%ifidn %1,pp + lea r2, [r2 + r3 * 4 - 64] +%else + lea r2, [r2 + r3 * 4 - 128] +%endif + add r0, r7 + dec r6d + jnz .loopH + RET +%endif +%endmacro + + FILTER_VER_CHROMA_AVX2_64xN pp, 64 + FILTER_VER_CHROMA_AVX2_64xN pp, 48 + FILTER_VER_CHROMA_AVX2_64xN pp, 32 + FILTER_VER_CHROMA_AVX2_64xN pp, 16 + FILTER_VER_CHROMA_AVX2_64xN ps, 64 + FILTER_VER_CHROMA_AVX2_64xN ps, 48 + FILTER_VER_CHROMA_AVX2_64xN ps, 32 + FILTER_VER_CHROMA_AVX2_64xN ps, 16 + ;----------------------------------------------------------------------------- ; void interp_4tap_vert_pp_%1x%2(pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx) ;-----------------------------------------------------------------------------
View file
x265_1.7.tar.gz/source/common/x86/ipfilter8.h -> x265_1.8.tar.gz/source/common/x86/ipfilter8.h
Changed
@@ -24,912 +24,26 @@ #ifndef X265_IPFILTER8_H #define X265_IPFILTER8_H -#define SETUP_LUMA_FUNC_DEF(W, H, cpu) \ - void x265_interp_8tap_horiz_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_8tap_horiz_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ - void x265_interp_8tap_vert_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_8tap_vert_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); - -#define LUMA_FILTERS(cpu) \ - SETUP_LUMA_FUNC_DEF(4, 4, cpu); \ - SETUP_LUMA_FUNC_DEF(8, 8, cpu); \ - SETUP_LUMA_FUNC_DEF(8, 4, cpu); \ - SETUP_LUMA_FUNC_DEF(4, 8, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 8, cpu); \ - SETUP_LUMA_FUNC_DEF(8, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 12, cpu); \ - SETUP_LUMA_FUNC_DEF(12, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 4, cpu); \ - SETUP_LUMA_FUNC_DEF(4, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(32, 32, cpu); \ - SETUP_LUMA_FUNC_DEF(32, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 32, cpu); \ - SETUP_LUMA_FUNC_DEF(32, 24, cpu); \ - SETUP_LUMA_FUNC_DEF(24, 32, cpu); \ - SETUP_LUMA_FUNC_DEF(32, 8, cpu); \ - SETUP_LUMA_FUNC_DEF(8, 32, cpu); \ - SETUP_LUMA_FUNC_DEF(64, 64, cpu); \ - SETUP_LUMA_FUNC_DEF(64, 32, cpu); \ - SETUP_LUMA_FUNC_DEF(32, 64, cpu); \ - SETUP_LUMA_FUNC_DEF(64, 48, cpu); \ - SETUP_LUMA_FUNC_DEF(48, 64, cpu); \ - SETUP_LUMA_FUNC_DEF(64, 16, cpu); \ - SETUP_LUMA_FUNC_DEF(16, 64, cpu) - -#define SETUP_LUMA_SP_FUNC_DEF(W, H, cpu) \ - void x265_interp_8tap_vert_sp_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); - -#define LUMA_SP_FILTERS(cpu) \ - SETUP_LUMA_SP_FUNC_DEF(4, 4, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(8, 8, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(8, 4, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(4, 8, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 8, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(8, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 12, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(12, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 4, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(4, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(32, 32, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(32, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 32, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(32, 24, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(24, 32, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(32, 8, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(8, 32, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(64, 64, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(64, 32, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(32, 64, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(64, 48, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(48, 64, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(64, 16, cpu); \ - SETUP_LUMA_SP_FUNC_DEF(16, 64, cpu); - -#define SETUP_LUMA_SS_FUNC_DEF(W, H, cpu) \ - void x265_interp_8tap_vert_ss_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); - -#define LUMA_SS_FILTERS(cpu) \ - SETUP_LUMA_SS_FUNC_DEF(4, 4, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(8, 8, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(8, 4, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(4, 8, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 8, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(8, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 12, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(12, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 4, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(4, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(32, 32, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(32, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 32, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(32, 24, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(24, 32, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(32, 8, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(8, 32, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(64, 64, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(64, 32, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(32, 64, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(64, 48, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(48, 64, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(64, 16, cpu); \ - SETUP_LUMA_SS_FUNC_DEF(16, 64, cpu); - -#if HIGH_BIT_DEPTH - -#define SETUP_CHROMA_420_VERT_FUNC_DEF(W, H, cpu) \ - void x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); - -#define CHROMA_420_VERT_FILTERS(cpu) \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 6, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 32, cpu) - -#define CHROMA_420_VERT_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_420_VERT_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(6, 8, cpu); - -#define CHROMA_422_VERT_FILTERS(cpu) \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 64, cpu); - -#define CHROMA_422_VERT_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_420_VERT_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(6, 16, cpu); - -#define CHROMA_444_VERT_FILTERS(cpu) \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(64, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(64, 32, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(64, 48, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(48, 64, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(64, 16, cpu); \ - SETUP_CHROMA_420_VERT_FUNC_DEF(16, 64, cpu) - -#define SETUP_CHROMA_420_HORIZ_FUNC_DEF(W, H, cpu) \ - void x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_horiz_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); - -#define CHROMA_420_HORIZ_FILTERS(cpu) \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 6, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(6, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 32, cpu) - -#define CHROMA_422_HORIZ_FILTERS(cpu) \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(6, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 64, cpu) - -#define CHROMA_444_HORIZ_FILTERS(cpu) \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 32, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 48, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(48, 64, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(64, 16, cpu); \ - SETUP_CHROMA_420_HORIZ_FUNC_DEF(16, 64, cpu) - -void x265_filterPixelToShort_4x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_4x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_4x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x4_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x12_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); - -#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \ - void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); - -#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu); - -#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); - -#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); - -#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); - -#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); - -#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu); - -CHROMA_420_VERT_FILTERS(_sse2); -CHROMA_420_HORIZ_FILTERS(_sse4); -CHROMA_420_VERT_FILTERS_SSE4(_sse4); -CHROMA_420_P2S_FILTERS_SSSE3(_ssse3); -CHROMA_420_P2S_FILTERS_SSE4(_sse4); -CHROMA_420_P2S_FILTERS_AVX2(_avx2); - -CHROMA_422_VERT_FILTERS(_sse2); -CHROMA_422_HORIZ_FILTERS(_sse4); -CHROMA_422_VERT_FILTERS_SSE4(_sse4); -CHROMA_422_P2S_FILTERS_SSE4(_sse4); -CHROMA_422_P2S_FILTERS_SSSE3(_ssse3); -CHROMA_422_P2S_FILTERS_AVX2(_avx2); - -CHROMA_444_VERT_FILTERS(_sse2); -CHROMA_444_HORIZ_FILTERS(_sse4); - -#undef CHROMA_420_VERT_FILTERS_SSE4 -#undef CHROMA_420_VERT_FILTERS -#undef SETUP_CHROMA_420_VERT_FUNC_DEF -#undef CHROMA_420_HORIZ_FILTERS -#undef SETUP_CHROMA_420_HORIZ_FUNC_DEF - -#undef CHROMA_422_VERT_FILTERS -#undef CHROMA_422_VERT_FILTERS_SSE4 -#undef CHROMA_422_HORIZ_FILTERS - -#undef CHROMA_444_VERT_FILTERS -#undef CHROMA_444_HORIZ_FILTERS - -#else // if HIGH_BIT_DEPTH - -#define SETUP_CHROMA_FUNC_DEF(W, H, cpu) \ - void x265_interp_4tap_horiz_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_horiz_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ - void x265_interp_4tap_vert_pp_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ - void x265_interp_4tap_vert_ps_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); - -#define CHROMA_420_FILTERS(cpu) \ - SETUP_CHROMA_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 6, cpu); \ - SETUP_CHROMA_FUNC_DEF(6, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 32, cpu) - -#define CHROMA_422_FILTERS(cpu) \ - SETUP_CHROMA_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_FUNC_DEF(6, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 64, cpu); - -#define CHROMA_444_FILTERS(cpu) \ - SETUP_CHROMA_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(64, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(64, 32, cpu); \ - SETUP_CHROMA_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(64, 48, cpu); \ - SETUP_CHROMA_FUNC_DEF(48, 64, cpu); \ - SETUP_CHROMA_FUNC_DEF(64, 16, cpu); \ - SETUP_CHROMA_FUNC_DEF(16, 64, cpu); - -#define SETUP_CHROMA_SP_FUNC_DEF(W, H, cpu) \ - void x265_interp_4tap_vert_sp_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); - -#define CHROMA_420_SP_FILTERS(cpu) \ - SETUP_CHROMA_SP_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 6, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 32, cpu); - -#define CHROMA_420_SP_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_SP_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(6, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 8, cpu); - -#define CHROMA_422_SP_FILTERS(cpu) \ - SETUP_CHROMA_SP_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 64, cpu); - -#define CHROMA_422_SP_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_SP_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(6, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 16, cpu); - -#define CHROMA_444_SP_FILTERS(cpu) \ - SETUP_CHROMA_SP_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(64, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(64, 32, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(64, 48, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(48, 64, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(64, 16, cpu); \ - SETUP_CHROMA_SP_FUNC_DEF(16, 64, cpu); - -#define SETUP_CHROMA_SS_FUNC_DEF(W, H, cpu) \ - void x265_interp_4tap_vert_ss_ ## W ## x ## H ## cpu(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); - -#define CHROMA_420_SS_FILTERS(cpu) \ - SETUP_CHROMA_SS_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 6, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 32, cpu); - -#define CHROMA_420_SS_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_SS_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(6, 8, cpu); - -#define CHROMA_422_SS_FILTERS(cpu) \ - SETUP_CHROMA_SS_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 64, cpu); - -#define CHROMA_422_SS_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_SS_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(6, 16, cpu); - -#define CHROMA_444_SS_FILTERS(cpu) \ - SETUP_CHROMA_SS_FUNC_DEF(8, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 12, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(12, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 4, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(4, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(8, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(64, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(64, 32, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(32, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(64, 48, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(48, 64, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(64, 16, cpu); \ - SETUP_CHROMA_SS_FUNC_DEF(16, 64, cpu); - -#define SETUP_CHROMA_P2S_FUNC_DEF(W, H, cpu) \ - void x265_filterPixelToShort_ ## W ## x ## H ## cpu(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); - -#define CHROMA_420_P2S_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 4, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(4, 2, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(6, 8, cpu); - -#define CHROMA_420_P2S_FILTERS_SSSE3(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 2, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 6, cpu); - -#define CHROMA_422_P2S_FILTERS_SSE4(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(2, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(6, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(4, 32, cpu); - -#define CHROMA_422_P2S_FILTERS_SSSE3(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 12, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(8, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(12, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 24, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(16, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); - -#define CHROMA_420_P2S_FILTERS_AVX2(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 8, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 24, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); - -#define CHROMA_422_P2S_FILTERS_AVX2(cpu) \ - SETUP_CHROMA_P2S_FUNC_DEF(24, 64, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 16, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 32, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 48, cpu); \ - SETUP_CHROMA_P2S_FUNC_DEF(32, 64, cpu); - -CHROMA_420_FILTERS(_sse4); -CHROMA_420_FILTERS(_avx2); -CHROMA_420_SP_FILTERS(_sse2); -CHROMA_420_SP_FILTERS_SSE4(_sse4); -CHROMA_420_SP_FILTERS(_avx2); -CHROMA_420_SP_FILTERS_SSE4(_avx2); -CHROMA_420_SS_FILTERS(_sse2); -CHROMA_420_SS_FILTERS_SSE4(_sse4); -CHROMA_420_SS_FILTERS(_avx2); -CHROMA_420_SS_FILTERS_SSE4(_avx2); -CHROMA_420_P2S_FILTERS_SSE4(_sse4); -CHROMA_420_P2S_FILTERS_SSSE3(_ssse3); -CHROMA_420_P2S_FILTERS_AVX2(_avx2); - -CHROMA_422_FILTERS(_sse4); -CHROMA_422_FILTERS(_avx2); -CHROMA_422_SP_FILTERS(_sse2); -CHROMA_422_SP_FILTERS(_avx2); -CHROMA_422_SP_FILTERS_SSE4(_sse4); -CHROMA_422_SP_FILTERS_SSE4(_avx2); -CHROMA_422_SS_FILTERS(_sse2); -CHROMA_422_SS_FILTERS(_avx2); -CHROMA_422_SS_FILTERS_SSE4(_sse4); -CHROMA_422_SS_FILTERS_SSE4(_avx2); -CHROMA_422_P2S_FILTERS_SSE4(_sse4); -CHROMA_422_P2S_FILTERS_SSSE3(_ssse3); -CHROMA_422_P2S_FILTERS_AVX2(_avx2); -void x265_interp_4tap_vert_ss_2x4_avx2(const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_sp_2x4_avx2(const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); - -CHROMA_444_FILTERS(_sse4); -CHROMA_444_SP_FILTERS(_sse4); -CHROMA_444_SS_FILTERS(_sse2); -CHROMA_444_FILTERS(_avx2); -CHROMA_444_SP_FILTERS(_avx2); -CHROMA_444_SS_FILTERS(_avx2); - -#undef SETUP_CHROMA_FUNC_DEF -#undef SETUP_CHROMA_SP_FUNC_DEF -#undef SETUP_CHROMA_SS_FUNC_DEF -#undef CHROMA_420_FILTERS -#undef CHROMA_420_SP_FILTERS -#undef CHROMA_420_SS_FILTERS -#undef CHROMA_420_SS_FILTERS_SSE4 -#undef CHROMA_420_SP_FILTERS_SSE4 - -#undef CHROMA_422_FILTERS -#undef CHROMA_422_SP_FILTERS -#undef CHROMA_422_SS_FILTERS -#undef CHROMA_422_SS_FILTERS_SSE4 -#undef CHROMA_422_SP_FILTERS_SSE4 - -#undef CHROMA_444_FILTERS -#undef CHROMA_444_SP_FILTERS -#undef CHROMA_444_SS_FILTERS - -#endif // if HIGH_BIT_DEPTH - -LUMA_FILTERS(_sse4); -LUMA_SP_FILTERS(_sse4); -LUMA_SS_FILTERS(_sse2); -LUMA_FILTERS(_avx2); -LUMA_SP_FILTERS(_avx2); -LUMA_SS_FILTERS(_avx2); -void x265_interp_8tap_hv_pp_8x8_ssse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -void x265_interp_8tap_hv_pp_16x16_avx2(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -void x265_filterPixelToShort_4x4_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_4x8_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_4x16_sse4(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_8x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x4_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x12_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_16x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x8_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x24_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x48_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_12x16_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_24x32_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_48x64_ssse3(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x8_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x24_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_32x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x16_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x48_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_64x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_48x64_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_filterPixelToShort_24x32_avx2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); -void x265_interp_4tap_horiz_pp_2x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_2x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_2x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_4x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_4x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_4x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_4x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_4x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_6x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_6x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x2_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x6_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_8x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_12x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_12x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x4_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x12_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_16x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_24x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_24x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x8_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x24_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_32x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_48x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_64x16_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_64x32_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_64x48_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_horiz_pp_64x64_sse3(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_8x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_8x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_8x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_8x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_12x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x4_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x12_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_16x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_24x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_32x8_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_32x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_32x24_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_32x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_32x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_48x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_64x16_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_64x32_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_64x48_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_pp_64x64_sse2(const pixel* src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_8tap_horiz_ps_4x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_4x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_4x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_8x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_12x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x4_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x12_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_16x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_24x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x8_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x24_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_32x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_48x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x16_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x32_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x48_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_horiz_ps_64x64_sse2(const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); -void x265_interp_8tap_hv_pp_8x8_sse3(const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); -void x265_interp_4tap_vert_pp_2x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_2x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_2x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_4x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_4x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_4x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_4x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_4x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -#ifdef X86_64 -void x265_interp_4tap_vert_pp_6x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_6x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x2_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x4_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x6_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x8_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x12_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x16_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x32_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -void x265_interp_4tap_vert_pp_8x64_sse2(const pixel *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int coeffIdx); -#endif -#undef LUMA_FILTERS -#undef LUMA_SP_FILTERS -#undef LUMA_SS_FILTERS -#undef SETUP_LUMA_FUNC_DEF -#undef SETUP_LUMA_SP_FUNC_DEF -#undef SETUP_LUMA_SS_FUNC_DEF - -#endif // ifndef X265_MC_H +#define SETUP_FUNC_DEF(cpu) \ + FUNCDEF_PU(void, interp_8tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ + FUNCDEF_PU(void, interp_8tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_PU(void, interp_8tap_hv_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int idxX, int idxY); \ + FUNCDEF_CHROMA_PU(void, filterPixelToShort, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_horiz_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx, int isRowExt); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_pp, cpu, const pixel* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ps, cpu, const pixel* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_sp, cpu, const int16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int coeffIdx); \ + FUNCDEF_CHROMA_PU(void, interp_4tap_vert_ss, cpu, const int16_t* src, intptr_t srcStride, int16_t* dst, intptr_t dstStride, int coeffIdx) + +SETUP_FUNC_DEF(sse2); +SETUP_FUNC_DEF(ssse3); +SETUP_FUNC_DEF(sse3); +SETUP_FUNC_DEF(sse4); +SETUP_FUNC_DEF(avx2); + +#endif // ifndef X265_IPFILTER8_H
View file
x265_1.7.tar.gz/source/common/x86/loopfilter.asm -> x265_1.8.tar.gz/source/common/x86/loopfilter.asm
Changed
@@ -29,6 +29,7 @@ SECTION_RODATA 32 pb_31: times 32 db 31 +pb_124: times 32 db 124 pb_15: times 32 db 15 pb_movemask_32: times 32 db 0x00 times 32 db 0xFF @@ -38,13 +39,118 @@ cextern pb_128 cextern pb_2 cextern pw_2 +cextern pw_pixel_max cextern pb_movemask +cextern pw_1 +cextern hmul_16p +cextern pb_4 ;============================================================================================================ ; void saoCuOrgE0(pixel * rec, int8_t * offsetEo, int lcuWidth, int8_t* signLeft, intptr_t stride) ;============================================================================================================ INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE0, 4,5,9 + mov r4d, r4m + movh m6, [r1] + movzx r1d, byte [r3] + pxor m5, m5 + neg r1b + movd m0, r1d + lea r1, [r0 + r4 * 2] + mov r4d, r2d + +.loop: + movu m7, [r0] + movu m8, [r0 + 16] + movu m2, [r0 + 2] + movu m1, [r0 + 18] + + pcmpgtw m3, m7, m2 + pcmpgtw m2, m7 + pcmpgtw m4, m8, m1 + pcmpgtw m1, m8 + + packsswb m3, m4 + packsswb m2, m1 + + pand m3, [pb_1] + por m3, m2 + + palignr m2, m3, m5, 15 + por m2, m0 + + mova m4, [pw_pixel_max] + psignb m2, [pb_128] ; m2 = signLeft + pxor m0, m0 + palignr m0, m3, 15 + paddb m3, m2 + paddb m3, [pb_2] ; m2 = uiEdgeType + pshufb m2, m6, m3 + pmovsxbw m3, m2 ; offsetEo + punpckhbw m2, m2 + psraw m2, 8 + paddw m7, m3 + paddw m8, m2 + pmaxsw m7, m5 + pmaxsw m8, m5 + pminsw m7, m4 + pminsw m8, m4 + movu [r0], m7 + movu [r0 + 16], m8 + + add r0q, 32 + sub r2d, 16 + jnz .loop + + movzx r3d, byte [r3 + 1] + neg r3b + movd m0, r3d +.loopH: + movu m7, [r1] + movu m8, [r1 + 16] + movu m2, [r1 + 2] + movu m1, [r1 + 18] + + pcmpgtw m3, m7, m2 + pcmpgtw m2, m7 + pcmpgtw m4, m8, m1 + pcmpgtw m1, m8 + + packsswb m3, m4 + packsswb m2, m1 + + pand m3, [pb_1] + por m3, m2 + + palignr m2, m3, m5, 15 + por m2, m0 + + mova m4, [pw_pixel_max] + psignb m2, [pb_128] ; m2 = signLeft + pxor m0, m0 + palignr m0, m3, 15 + paddb m3, m2 + paddb m3, [pb_2] ; m2 = uiEdgeType + pshufb m2, m6, m3 + pmovsxbw m3, m2 ; offsetEo + punpckhbw m2, m2 + psraw m2, 8 + paddw m7, m3 + paddw m8, m2 + pmaxsw m7, m5 + pmaxsw m8, m5 + pminsw m7, m4 + pminsw m8, m4 + movu [r1], m7 + movu [r1 + 16], m8 + + add r1q, 32 + sub r4d, 16 + jnz .loopH + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE0, 5, 5, 8, rec, offsetEo, lcuWidth, signLeft, stride mov r4d, r4m @@ -130,8 +236,70 @@ sub r4d, 16 jnz .loopH RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE0, 4,4,9 + vbroadcasti128 m6, [r1] + movzx r1d, byte [r3] + neg r1b + movd xm0, r1d + movzx r1d, byte [r3 + 1] + neg r1b + movd xm1, r1d + vinserti128 m0, m0, xm1, 1 + mova m5, [pw_pixel_max] + mov r1d, r4m + add r1d, r1d + shr r2d, 4 + +.loop: + movu m7, [r0] + movu m8, [r0 + r1] + movu m2, [r0 + 2] + movu m1, [r0 + r1 + 2] + + pcmpgtw m3, m7, m2 + pcmpgtw m2, m7 + pcmpgtw m4, m8, m1 + pcmpgtw m1, m8 + + packsswb m3, m4 + packsswb m2, m1 + vpermq m3, m3, 11011000b + vpermq m2, m2, 11011000b + + pand m3, [pb_1] + por m3, m2 + + pslldq m2, m3, 1 + por m2, m0 + + psignb m2, [pb_128] ; m2 = signLeft + pxor m0, m0 + palignr m0, m3, 15 + paddb m3, m2 + paddb m3, [pb_2] ; m3 = uiEdgeType + pshufb m2, m6, m3 + pmovsxbw m3, xm2 ; offsetEo + vextracti128 xm2, m2, 1 + pmovsxbw m2, xm2 + pxor m4, m4 + paddw m7, m3 + paddw m8, m2 + pmaxsw m7, m4 + pmaxsw m8, m4 + pminsw m7, m5 + pminsw m8, m5 + movu [r0], m7 + movu [r0 + r1], m8 + + add r0q, 32 + dec r2d + jnz .loop + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE0, 5, 5, 7, rec, offsetEo, lcuWidth, signLeft, stride mov r4d, r4m @@ -184,11 +352,68 @@ sub r2d, 16 jnz .loop RET +%endif ;================================================================================================== ; void saoCuOrgE1(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth) ;================================================================================================== INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE1, 4,5,8 + add r3d, r3d + mov r4d, r4m + pxor m0, m0 ; m0 = 0 + mova m6, [pb_2] ; m6 = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2] + shr r4d, 4 +.loop + movu m7, [r0] + movu m5, [r0 + 16] + movu m3, [r0 + r3] + movu m1, [r0 + r3 + 16] + + pcmpgtw m2, m7, m3 + pcmpgtw m3, m7 + pcmpgtw m4, m5, m1 + pcmpgtw m1, m5 + + packsswb m2, m4 + packsswb m3, m1 + + pand m2, [pb_1] + por m2, m3 + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, m6 + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m1, m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovsxbw m3, m1 + punpckhbw m1, m1 + psraw m1, 8 + + paddw m7, m3 + paddw m5, m1 + + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, [pw_pixel_max] + pminsw m5, [pw_pixel_max] + + movu [r0], m7 + movu [r0 + 16], m5 + + add r0, 32 + add r1, 16 + dec r4d + jnz .loop + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE1, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth mov r3d, r3m mov r4d, r4m @@ -234,8 +459,54 @@ dec r4d jnz .loop RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE1, 4,5,6 + add r3d, r3d + mov r4d, r4m + mova m4, [pb_2] + shr r4d, 4 + mova m0, [pw_pixel_max] +.loop + movu m5, [r0] + movu m3, [r0 + r3] + + pcmpgtw m2, m5, m3 + pcmpgtw m3, m5 + + packsswb m2, m3 + vpermq m3, m2, 11011101b + vpermq m2, m2, 10001000b + + pand xm2, [pb_1] + por xm2, xm3 + + movu xm3, [r1] ; m3 = m_iUpBuff1 + + paddb xm3, xm2 + paddb xm3, xm4 + + movu xm1, [r2] ; m1 = m_iOffsetEo + pshufb xm1, xm3 + pmovsxbw m3, xm1 + + paddw m5, m3 + pxor m3, m3 + pmaxsw m5, m3 + pminsw m5, m0 + movu [r0], m5 + + psubb xm3, xm2 + movu [r1], xm3 + + add r0, 32 + add r1, 16 + dec r4d + jnz .loop + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE1, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth mov r3d, r3m mov r4d, r4m @@ -277,11 +548,117 @@ dec r4d jnz .loop RET +%endif ;======================================================================================================== ; void saoCuOrgE1_2Rows(pixel *pRec, int8_t *m_iUpBuff1, int8_t *m_iOffsetEo, Int iStride, Int iLcuWidth) ;======================================================================================================== INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE1_2Rows, 4,7,8 + add r3d, r3d + mov r4d, r4m + pxor m0, m0 ; m0 = 0 + mova m6, [pw_pixel_max] + mov r5d, r4d + shr r4d, 4 + mov r6, r0 +.loop + movu m7, [r0] + movu m5, [r0 + 16] + movu m3, [r0 + r3] + movu m1, [r0 + r3 + 16] + + pcmpgtw m2, m7, m3 + pcmpgtw m3, m7 + pcmpgtw m4, m5, m1 + pcmpgtw m1, m5 + packsswb m2, m4 + packsswb m3, m1 + pand m2, [pb_1] + por m2, m3 + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, [pb_2] + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m1, m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovsxbw m3, m1 + punpckhbw m1, m1 + psraw m1, 8 + + paddw m7, m3 + paddw m5, m1 + + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, m6 + pminsw m5, m6 + + movu [r0], m7 + movu [r0 + 16], m5 + + add r0, 32 + add r1, 16 + dec r4d + jnz .loop + + sub r1, r5 + shr r5d, 4 + lea r0, [r6 + r3] +.loopH: + movu m7, [r0] + movu m5, [r0 + 16] + movu m3, [r0 + r3] + movu m1, [r0 + r3 + 16] + + pcmpgtw m2, m7, m3 + pcmpgtw m3, m7 + pcmpgtw m4, m5, m1 + pcmpgtw m1, m5 + packsswb m2, m4 + packsswb m3, m1 + pand m2, [pb_1] + por m2, m3 + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, [pb_2] + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m1, m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovsxbw m3, m1 + punpckhbw m1, m1 + psraw m1, 8 + + paddw m7, m3 + paddw m5, m1 + + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, m6 + pminsw m5, m6 + + movu [r0], m7 + movu [r0 + 16], m5 + + add r0, 32 + add r1, 16 + dec r5d + jnz .loopH + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE1_2Rows, 3, 5, 8, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth mov r3d, r3m mov r4d, r4m @@ -352,8 +729,65 @@ dec r4d jnz .loop RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE1_2Rows, 4,5,8 + add r3d, r3d + mov r4d, r4m + mova m4, [pw_pixel_max] + vbroadcasti128 m6, [r2] ; m6 = m_iOffsetEo + shr r4d, 4 +.loop + movu m7, [r0] + movu m5, [r0 + r3] + movu m1, [r0 + r3 * 2] + + pcmpgtw m2, m7, m5 + pcmpgtw m3, m5, m7 + pcmpgtw m0, m5, m1 + pcmpgtw m1, m5 + + packsswb m2, m0 + packsswb m3, m1 + vpermq m2, m2, 11011000b + vpermq m3, m3, 11011000b + + pand m2, [pb_1] + por m2, m3 + + movu xm3, [r1] ; m3 = m_iUpBuff1 + pxor m0, m0 + psubb m1, m0, m2 + vinserti128 m3, m3, xm1, 1 + vextracti128 [r1], m1, 1 + + paddb m3, m2 + paddb m3, [pb_2] + + pshufb m1, m6, m3 + pmovsxbw m3, xm1 + vextracti128 xm1, m1, 1 + pmovsxbw m1, xm1 + + paddw m7, m3 + paddw m5, m1 + + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, m4 + pminsw m5, m4 + + movu [r0], m7 + movu [r0 + r3], m5 + + add r0, 32 + add r1, 16 + dec r4d + jnz .loop + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE1_2Rows, 3, 5, 7, pRec, m_iUpBuff1, m_iOffsetEo, iStride, iLcuWidth mov r3d, r3m mov r4d, r4m @@ -401,11 +835,70 @@ dec r4d jnz .loop RET +%endif ;====================================================================================================================================================== ; void saoCuOrgE2(pixel * rec, int8_t * bufft, int8_t * buff1, int8_t * offsetEo, int lcuWidth, intptr_t stride) ;====================================================================================================================================================== INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE2, 6,6,8 + mov r4d, r4m + add r5d, r5d + pxor m0, m0 + inc r1 + movh m6, [r0 + r4 * 2] + movhps m6, [r1 + r4] + +.loop + movu m7, [r0] + movu m5, [r0 + 16] + movu m3, [r0 + r5 + 2] + movu m1, [r0 + r5 + 18] + + pcmpgtw m2, m7, m3 + pcmpgtw m3, m7 + pcmpgtw m4, m5, m1 + pcmpgtw m1, m5 + packsswb m2, m4 + packsswb m3, m1 + pand m2, [pb_1] + por m2, m3 + + movu m3, [r2] + + paddb m3, m2 + paddb m3, [pb_2] + + movu m4, [r3] + pshufb m4, m3 + + psubb m3, m0, m2 + movu [r1], m3 + + pmovsxbw m3, m4 + punpckhbw m4, m4 + psraw m4, 8 + + paddw m7, m3 + paddw m5, m4 + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, [pw_pixel_max] + pminsw m5, [pw_pixel_max] + movu [r0], m7 + movu [r0 + 16], m5 + + add r0, 32 + add r1, 16 + add r2, 16 + sub r4, 16 + jg .loop + + movh [r0 + r4 * 2], m6 + movhps [r1 + r4], m6 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE2, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth mov r4d, r4m mov r5d, r5m @@ -456,8 +949,58 @@ movh [r0 + r4], m5 movhps [r1 + r4], m5 RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE2, 6,6,7 + mov r4d, r4m + add r5d, r5d + inc r1 + movq xm4, [r0 + r4 * 2] + movhps xm4, [r1 + r4] + vbroadcasti128 m5, [r3] + mova m6, [pw_pixel_max] +.loop + movu m1, [r0] + movu m3, [r0 + r5 + 2] + + pcmpgtw m2, m1, m3 + pcmpgtw m3, m1 + + packsswb m2, m3 + vpermq m3, m2, 11011101b + vpermq m2, m2, 10001000b + + pand xm2, [pb_1] + por xm2, xm3 + + movu xm3, [r2] + + paddb xm3, xm2 + paddb xm3, [pb_2] + pshufb xm0, xm5, xm3 + pmovsxbw m3, xm0 + + pxor m0, m0 + paddw m1, m3 + pmaxsw m1, m0 + pminsw m1, m6 + movu [r0], m1 + + psubb xm0, xm2 + movu [r1], xm0 + + add r0, 32 + add r1, 16 + add r2, 16 + sub r4, 16 + jg .loop + + movq [r0 + r4 * 2], xm4 + movhps [r1 + r4], xm4 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE2, 5, 6, 7, rec, bufft, buff1, offsetEo, lcuWidth mov r4d, r4m mov r5d, r5m @@ -497,8 +1040,70 @@ movq [r0 + r4], xm6 movhps [r1 + r4], xm6 RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE2_32, 6,6,8 + mov r4d, r4m + add r5d, r5d + inc r1 + movq xm4, [r0 + r4 * 2] + movhps xm4, [r1 + r4] + vbroadcasti128 m5, [r3] + +.loop + movu m1, [r0] + movu m7, [r0 + 32] + movu m3, [r0 + r5 + 2] + movu m6, [r0 + r5 + 34] + + pcmpgtw m2, m1, m3 + pcmpgtw m0, m7, m6 + pcmpgtw m3, m1 + pcmpgtw m6, m7 + + packsswb m2, m0 + packsswb m3, m6 + vpermq m3, m3, 11011000b + vpermq m2, m2, 11011000b + + pand m2, [pb_1] + por m2, m3 + + movu m3, [r2] + + paddb m3, m2 + paddb m3, [pb_2] + pshufb m0, m5, m3 + + pmovsxbw m3, xm0 + vextracti128 xm0, m0, 1 + pmovsxbw m6, xm0 + + pxor m0, m0 + paddw m1, m3 + paddw m7, m6 + pmaxsw m1, m0 + pmaxsw m7, m0 + pminsw m1, [pw_pixel_max] + pminsw m7, [pw_pixel_max] + movu [r0], m1 + movu [r0 + 32], m7 + + psubb m0, m2 + movu [r1], m0 + + add r0, 64 + add r1, 32 + add r2, 32 + sub r4, 32 + jg .loop + + movq [r0 + r4 * 2], xm4 + movhps [r1 + r4], xm4 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE2_32, 5, 6, 8, rec, bufft, buff1, offsetEo, lcuWidth mov r4d, r4m mov r5d, r5m @@ -550,11 +1155,79 @@ movq [r0 + r4], xm6 movhps [r1 + r4], xm6 RET +%endif ;======================================================================================================= ;void saoCuOrgE3(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX) ;======================================================================================================= INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE3, 4,6,8 + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + + ; save latest 2 pixels for case startX=1 or left_endX=15 + movh m6, [r0 + r5 * 2] + movhps m6, [r1 + r5 - 1] + + ; move to startX+1 + inc r4d + lea r0, [r0 + r4 * 2] ; x = startX + 1 + add r1, r4 + sub r5d, r4d + pxor m0, m0 + +.loop: + movu m7, [r0] + movu m5, [r0 + 16] + movu m3, [r0 + r3] + movu m1, [r0 + r3 + 16] + + pcmpgtw m2, m7, m3 + pcmpgtw m3, m7 + pcmpgtw m4, m5, m1 + pcmpgtw m1, m5 + packsswb m2, m4 + packsswb m3, m1 + pand m2, [pb_1] + por m2, m3 + + movu m3, [r1] ; m3 = m_iUpBuff1 + + paddb m3, m2 + paddb m3, [pb_2] ; m3 = uiEdgeType + + movu m4, [r2] ; m4 = m_iOffsetEo + pshufb m4, m3 + + psubb m3, m0, m2 + movu [r1 - 1], m3 + + pmovsxbw m3, m4 + punpckhbw m4, m4 + psraw m4, 8 + + paddw m7, m3 + paddw m5, m4 + pmaxsw m7, m0 + pmaxsw m5, m0 + pminsw m7, [pw_pixel_max] + pminsw m5, [pw_pixel_max] + movu [r0], m7 + movu [r0 + 16], m5 + + add r0, 32 + add r1, 16 + + sub r5, 16 + jg .loop + + ; restore last pixels (up to 2) + movh [r0 + r5 * 2], m6 + movhps [r1 + r5 - 1], m6 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE3, 3,6,8 mov r3d, r3m mov r4d, r4m @@ -618,8 +1291,64 @@ movh [r0 + r5], m7 movhps [r1 + r5 - 1], m7 RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE3, 4,6,6 + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + + ; save latest 2 pixels for case startX=1 or left_endX=15 + movq xm5, [r0 + r5 * 2] + movhps xm5, [r1 + r5 - 1] + + ; move to startX+1 + inc r4d + lea r0, [r0 + r4 * 2] ; x = startX + 1 + add r1, r4 + sub r5d, r4d + movu xm4, [r2] + +.loop: + movu m1, [r0] + movu m0, [r0 + r3] + + pcmpgtw m2, m1, m0 + pcmpgtw m0, m1 + packsswb m2, m0 + vpermq m0, m2, 11011101b + vpermq m2, m2, 10001000b + pand m2, [pb_1] + por m2, m0 + + movu xm0, [r1] + paddb xm0, xm2 + paddb xm0, [pb_2] + + pshufb xm3, xm4, xm0 + pmovsxbw m3, xm3 + + paddw m1, m3 + pxor m0, m0 + pmaxsw m1, m0 + pminsw m1, [pw_pixel_max] + movu [r0], m1 + + psubb xm0, xm2 + movu [r1 - 1], xm0 + + add r0, 32 + add r1, 16 + sub r5, 16 + jg .loop + + ; restore last pixels (up to 2) + movq [r0 + r5 * 2], xm5 + movhps [r1 + r5 - 1], xm5 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE3, 3, 6, 8 mov r3d, r3m mov r4d, r4m @@ -680,8 +1409,76 @@ movq [r0 + r5], xm7 movhps [r1 + r5 - 1], xm7 RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgE3_32, 3,6,8 + add r3d, r3d + mov r4d, r4m + mov r5d, r5m + + ; save latest 2 pixels for case startX=1 or left_endX=15 + movq xm5, [r0 + r5 * 2] + movhps xm5, [r1 + r5 - 1] + + ; move to startX+1 + inc r4d + lea r0, [r0 + r4 * 2] ; x = startX + 1 + add r1, r4 + sub r5d, r4d + vbroadcasti128 m4, [r2] + +.loop: + movu m1, [r0] + movu m7, [r0 + 32] + movu m0, [r0 + r3] + movu m6, [r0 + r3 + 32] + + pcmpgtw m2, m1, m0 + pcmpgtw m3, m7, m6 + pcmpgtw m0, m1 + pcmpgtw m6, m7 + + packsswb m2, m3 + packsswb m0, m6 + vpermq m2, m2, 11011000b + vpermq m0, m0, 11011000b + pand m2, [pb_1] + por m2, m0 + + movu m0, [r1] + paddb m0, m2 + paddb m0, [pb_2] + + pshufb m3, m4, m0 + vextracti128 xm6, m3, 1 + pmovsxbw m3, xm3 + pmovsxbw m6, xm6 + + paddw m1, m3 + paddw m7, m6 + pxor m0, m0 + pmaxsw m1, m0 + pmaxsw m7, m0 + pminsw m1, [pw_pixel_max] + pminsw m7, [pw_pixel_max] + movu [r0], m1 + movu [r0 + 32], m7 + + psubb m0, m2 + movu [r1 - 1], m0 + + add r0, 64 + add r1, 32 + sub r5, 32 + jg .loop + + ; restore last pixels (up to 2) + movq [r0 + r5 * 2], xm5 + movhps [r1 + r5 - 1], xm5 + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgE3_32, 3, 6, 8 mov r3d, r3m mov r4d, r4m @@ -746,11 +1543,62 @@ movq [r0 + r5], xm7 movhps [r1 + r5 - 1], xm7 RET +%endif ;===================================================================================== ; void saoCuOrgB0(pixel* rec, const pixel* offset, int lcuWidth, int lcuHeight, int stride) ;===================================================================================== INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgB0, 5,7,8 + add r4d, r4d + + shr r2d, 4 + movu m3, [r1] ; offset[0-15] + movu m4, [r1 + 16] ; offset[16-31] + pxor m7, m7 + +.loopH + mov r5d, r2d + xor r6, r6 + +.loopW + movu m2, [r0 + r6] + movu m5, [r0 + r6 + 16] + psrlw m0, m2, (BIT_DEPTH - 5) + psrlw m6, m5, (BIT_DEPTH - 5) + packuswb m0, m6 + pand m0, [pb_31] ; m0 = [index] + + pshufb m6, m3, m0 + pshufb m1, m4, m0 + pcmpgtb m0, [pb_15] ; m0 = [mask] + + pblendvb m6, m6, m1, m0 ; NOTE: don't use 3 parameters style, x264 macro have some bug! + + pmovsxbw m0, m6 ; offset + punpckhbw m6, m6 + psraw m6, 8 + + paddw m2, m0 + paddw m5, m6 + pmaxsw m2, m7 + pmaxsw m5, m7 + pminsw m2, [pw_pixel_max] + pminsw m5, [pw_pixel_max] + + movu [r0 + r6], m2 + movu [r0 + r6 + 16], m5 + add r6d, 32 + dec r5d + jnz .loopW + + lea r0, [r0 + r4] + + dec r3d + jnz .loopH + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgB0, 4, 7, 8 mov r3d, r3m @@ -796,8 +1644,92 @@ dec r3d jnz .loopH RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal saoCuOrgB0, 5,7,8 + vbroadcasti128 m3, [r1] + vbroadcasti128 m4, [r1 + 16] + add r4d, r4d + lea r1, [r4 * 2] + sub r1d, r2d + sub r1d, r2d + shr r2d, 4 + mova m7, [pw_pixel_max] + + mov r6d, r3d + shr r3d, 1 + +.loopH + mov r5d, r2d +.loopW + movu m2, [r0] + movu m5, [r0 + r4] + psrlw m0, m2, (BIT_DEPTH - 5) + psrlw m6, m5, (BIT_DEPTH - 5) + packuswb m0, m6 + vpermq m0, m0, 11011000b + pand m0, [pb_31] ; m0 = [index] + + pshufb m6, m3, m0 + pshufb m1, m4, m0 + pcmpgtb m0, [pb_15] ; m0 = [mask] + + pblendvb m6, m6, m1, m0 ; NOTE: don't use 3 parameters style, x264 macro have some bug! + + pmovsxbw m0, xm6 + vextracti128 xm6, m6, 1 + pmovsxbw m6, xm6 + + paddw m2, m0 + paddw m5, m6 + pxor m1, m1 + pmaxsw m2, m1 + pmaxsw m5, m1 + pminsw m2, m7 + pminsw m5, m7 + + movu [r0], m2 + movu [r0 + r4], m5 + + add r0, 32 + dec r5d + jnz .loopW + + add r0, r1 + dec r3d + jnz .loopH + + test r6b, 1 + jz .end + xor r1, r1 +.loopW1: + movu m2, [r0 + r1] + psrlw m0, m2, (BIT_DEPTH - 5) + packuswb m0, m0 + vpermq m0, m0, 10001000b + pand m0, [pb_31] ; m0 = [index] + + pshufb m6, m3, m0 + pshufb m1, m4, m0 + pcmpgtb m0, [pb_15] ; m0 = [mask] + + pblendvb m6, m6, m1, m0 ; NOTE: don't use 3 parameters style, x264 macro have some bug! + pmovsxbw m0, xm6 ; offset + + paddw m2, m0 + pxor m0, m0 + pmaxsw m2, m0 + pminsw m2, m7 + + movu [r0 + r1], m2 + add r1d, 32 + dec r2d + jnz .loopW1 +.end: + RET +%else ; HIGH_BIT_DEPTH cglobal saoCuOrgB0, 4, 7, 8 mov r3d, r3m @@ -872,11 +1804,54 @@ jnz .loopW1 .end RET +%endif ;============================================================================================================ ; void calSign(int8_t *dst, const Pixel *src1, const Pixel *src2, const int width) ;============================================================================================================ INIT_XMM sse4 +%if HIGH_BIT_DEPTH +cglobal calSign, 4, 7, 5 + mova m0, [pw_1] + mov r4d, r3d + shr r3d, 4 + add r3d, 1 + mov r5, r0 + movu m4, [r0 + r4] +.loop + movu m1, [r1] ; m2 = pRec[x] + movu m2, [r2] ; m3 = pTmpU[x] + + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + pand m3, m0 + por m3, m2 + packsswb m3, m3 + movh [r0], xm3 + + movu m1, [r1 + 16] ; m2 = pRec[x] + movu m2, [r2 + 16] ; m3 = pTmpU[x] + + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + pand m3, m0 + por m3, m2 + packsswb m3, m3 + movh [r0 + 8], xm3 + + add r0, 16 + add r1, 32 + add r2, 32 + dec r3d + jnz .loop + + mov r6, r0 + sub r6, r5 + sub r4, r6 + movu [r0 + r4], m4 + RET +%else ; HIGH_BIT_DEPTH + cglobal calSign, 4,5,6 mova m0, [pb_128] mova m1, [pb_1] @@ -925,8 +1900,44 @@ .end: RET +%endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal calSign, 4, 7, 5 + mova m0, [pw_1] + mov r4d, r3d + shr r3d, 4 + add r3d, 1 + mov r5, r0 + movu m4, [r0 + r4] + +.loop + movu m1, [r1] ; m2 = pRec[x] + movu m2, [r2] ; m3 = pTmpU[x] + + pcmpgtw m3, m1, m2 + pcmpgtw m2, m1 + + pand m3, m0 + por m3, m2 + packsswb m3, m3 + vpermq m3, m3, q3220 + movu [r0 ], xm3 + + add r0, 16 + add r1, 32 + add r2, 32 + dec r3d + jnz .loop + + mov r6, r0 + sub r6, r5 + sub r4, r6 + movu [r0 + r4], m4 + RET +%else ; HIGH_BIT_DEPTH + cglobal calSign, 4, 5, 6 vbroadcasti128 m0, [pb_128] mova m1, [pb_1] @@ -975,3 +1986,296 @@ .end: RET +%endif + +;-------------------------------------------------------------------------------------------------------------------------- +; saoCuStatsBO_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +;-------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal saoCuStatsBO, 7,12,6 + mova m3, [hmul_16p + 16] + mova m4, [pb_124] + mova m5, [pb_4] + xor r7d, r7d + +.loopH: + mov r10, r0 + mov r11, r1 + mov r9d, r3d +.loopL: + movu m1, [r11] + movu m0, [r10] + + punpckhbw m2, m0, m1 + punpcklbw m0, m1 + psrlw m1, 1 ; rec[x] >> boShift + pmaddubsw m2, m3 + pmaddubsw m0, m3 + pand m1, m4 + paddb m1, m5 + +%assign x 0 +%rep 16 + pextrb r7d, m1, x + +%if (x < 8) + pextrw r8d, m0, (x % 8) +%else + pextrw r8d, m2, (x % 8) +%endif + movsx r8d, r8w + inc dword [r6 + r7] ; count[classIdx]++ + add [r5 + r7], r8d ; stats[classIdx] += (fenc[x] - rec[x]); + dec r9d + jz .next +%assign x x+1 +%endrep + + add r10, 16 + add r11, 16 + jmp .loopL + +.next: + add r0, r2 + add r1, r2 + dec r4d + jnz .loopH + RET +%endif + +;----------------------------------------------------------------------------------------------------------------------- +; saoCuStatsE0(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +;----------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal saoCuStatsE0, 5,9,8, 0-32 + mov r3d, r3m + mov r8, r5mp + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m4, [pb_128] + mova m5, [hmul_16p + 16] + mova m6, [pb_2] + xor r7d, r7d + +.loopH: + mov r5d, r3d + + ; calculate signLeft + mov r7b, [r1] + sub r7b, [r1 - 1] + seta r7b + setb r6b + sub r7b, r6b + neg r7b + pinsrb m0, r7d, 15 + +.loopL: + movu m7, [r1] + movu m2, [r1 + 1] + + pxor m1, m7, m4 + pxor m3, m2, m4 + pcmpgtb m2, m1, m3 + pcmpgtb m3, m1 + pand m2, [pb_1] + por m2, m3 ; signRight + + palignr m3, m2, m0, 15 + psignb m3, m4 ; signLeft + + mova m0, m2 + paddb m2, m3 + paddb m2, m6 ; edgeType + + ; stats[edgeType] + movu m3, [r0] ; fenc[0-15] + punpckhbw m1, m3, m7 + punpcklbw m3, m7 + pmaddubsw m1, m5 + pmaddubsw m3, m5 + +%assign x 0 +%rep 16 + pextrb r7d, m2, x + +%if (x < 8) + pextrw r6d, m3, (x % 8) +%else + pextrw r6d, m1, (x % 8) +%endif + movsx r6d, r6w + inc word [rsp + r7 * 2] ; tmp_count[edgeType]++ + add [rsp + 5 * 2 + r7 * 4], r6d ; tmp_stats[edgeType] += (fenc[x] - rec[x]) + dec r5d + jz .next +%assign x x+1 +%endrep + + add r0q, 16 + add r1q, 16 + jmp .loopL + +.next: + mov r6d, r3d + and r6d, 15 + + sub r6, r3 + add r6, r2 + add r0, r6 + add r1, r6 + + dec r4d + jnz .loopH + + ; sum to global buffer + mov r0, r6mp + + ; s_eoTable = {1, 2, 0, 3, 4} + movzx r5d, word [rsp + 0 * 2] + add [r0 + 1 * 4], r5d + movzx r6d, word [rsp + 1 * 2] + add [r0 + 2 * 4], r6d + movzx r5d, word [rsp + 2 * 2] + add [r0 + 0 * 4], r5d + movzx r6d, word [rsp + 3 * 2] + add [r0 + 3 * 4], r6d + movzx r5d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r5d + + mov r6d, [rsp + 5 * 2 + 0 * 4] + add [r8 + 1 * 4], r6d + mov r5d, [rsp + 5 * 2 + 1 * 4] + add [r8 + 2 * 4], r5d + mov r6d, [rsp + 5 * 2 + 2 * 4] + add [r8 + 0 * 4], r6d + mov r5d, [rsp + 5 * 2 + 3 * 4] + add [r8 + 3 * 4], r5d + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r8 + 4 * 4], r6d + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------------------- +; saoCuStatsE1_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +;------------------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal saoCuStatsE1, 4,12,9,0-32 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + mov r4d, r4m + mov r11d, r5d + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m0, [pb_128] + mova m5, [pb_1] + mova m6, [pb_2] + mova m8, [hmul_16p + 16] + movh m7, [r3 + r4] + +.loopH: + mov r6d, r4d + mov r9, r0 + mov r10, r1 + mov r11, r3 + +.loopW: + movu m1, [r10] + movu m2, [r10 + r2] + + ; signDown + pxor m1, m0 + pxor m2, m0 + pcmpgtb m3, m1, m2 + pand m3, m5 + pcmpgtb m2, m1 + por m2, m3 + pxor m3, m3 + psubb m3, m2 ; -signDown + + ; edgeType + movu m4, [r11] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r11], m3 + + ; stats[edgeType] + pxor m1, m0 + movu m3, [r9] + punpckhbw m4, m3, m1 + punpcklbw m3, m1 + pmaddubsw m3, m8 + pmaddubsw m4, m8 + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + %if (x < 8) + pextrw r8d, m3, (x % 8) + %else + pextrw r8d, m4, (x % 8) + %endif + movsx r8d, r8w + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r9, 16 + add r10, 16 + add r11, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + add r0, r2 + add r1, r2 + + dec r5d + jg .loopH + + ; restore unavailable pixels + movh [r3 + r4], m7 + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; s_eoTable = {1,2,0,3,4} + movzx r6d, word [rsp + 0 * 2] + add [r0 + 1 * 4], r6d + movzx r6d, word [rsp + 1 * 2] + add [r0 + 2 * 4], r6d + movzx r6d, word [rsp + 2 * 2] + add [r0 + 0 * 4], r6d + movzx r6d, word [rsp + 3 * 2] + add [r0 + 3 * 4], r6d + movzx r6d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r6d + + mov r6d, [rsp + 5 * 2 + 0 * 4] + add [r1 + 1 * 4], r6d + mov r6d, [rsp + 5 * 2 + 1 * 4] + add [r1 + 2 * 4], r6d + mov r6d, [rsp + 5 * 2 + 2 * 4] + add [r1 + 0 * 4], r6d + mov r6d, [rsp + 5 * 2 + 3 * 4] + add [r1 + 3 * 4], r6d + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64
View file
x265_1.7.tar.gz/source/common/x86/loopfilter.h -> x265_1.8.tar.gz/source/common/x86/loopfilter.h
Changed
@@ -25,21 +25,24 @@ #ifndef X265_LOOPFILTER_H #define X265_LOOPFILTER_H -void x265_saoCuOrgE0_sse4(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride); -void x265_saoCuOrgE0_avx2(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride); -void x265_saoCuOrgE1_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); -void x265_saoCuOrgE1_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); -void x265_saoCuOrgE1_2Rows_sse4(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); -void x265_saoCuOrgE1_2Rows_avx2(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); -void x265_saoCuOrgE2_sse4(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); -void x265_saoCuOrgE2_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); -void x265_saoCuOrgE2_32_avx2(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); -void x265_saoCuOrgE3_sse4(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); -void x265_saoCuOrgE3_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); -void x265_saoCuOrgE3_32_avx2(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); -void x265_saoCuOrgB0_sse4(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); -void x265_saoCuOrgB0_avx2(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); -void x265_calSign_sse4(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); -void x265_calSign_avx2(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); +#define DECL_SAO(cpu) \ + void PFX(saoCuOrgE0_ ## cpu)(pixel * rec, int8_t * offsetEo, int endX, int8_t* signLeft, intptr_t stride); \ + void PFX(saoCuOrgE1_ ## cpu)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); \ + void PFX(saoCuOrgE1_2Rows_ ## cpu)(pixel* rec, int8_t* upBuff1, int8_t* offsetEo, intptr_t stride, int width); \ + void PFX(saoCuOrgE2_ ## cpu)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); \ + void PFX(saoCuOrgE2_ ## cpu)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); \ + void PFX(saoCuOrgE2_32_ ## cpu)(pixel* rec, int8_t* pBufft, int8_t* pBuff1, int8_t* offsetEo, int lcuWidth, intptr_t stride); \ + void PFX(saoCuOrgE3_ ## cpu)(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); \ + void PFX(saoCuOrgE3_32_ ## cpu)(pixel *rec, int8_t *upBuff1, int8_t *m_offsetEo, intptr_t stride, int startX, int endX); \ + void PFX(saoCuOrgB0_ ## cpu)(pixel* rec, const int8_t* offsetBo, int ctuWidth, int ctuHeight, intptr_t stride); \ + void PFX(saoCuStatsBO_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE0_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE1_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE2_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(saoCuStatsE3_ ## cpu)(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); \ + void PFX(calSign_ ## cpu)(int8_t *dst, const pixel *src1, const pixel *src2, const int endX); + +DECL_SAO(sse4); +DECL_SAO(avx2); #endif // ifndef X265_LOOPFILTER_H
View file
x265_1.7.tar.gz/source/common/x86/mc-a.asm -> x265_1.8.tar.gz/source/common/x86/mc-a.asm
Changed
@@ -32,6 +32,19 @@ %include "x86inc.asm" %include "x86util.asm" +%if BIT_DEPTH==8 + %define ADDAVG_FACTOR 256 + %define ADDAVG_ROUND 128 +%elif BIT_DEPTH==10 + %define ADDAVG_FACTOR 1024 + %define ADDAVG_ROUND 512 +%elif BIT_DEPTH==12 + %define ADDAVG_FACTOR 4096 + %define ADDAVG_ROUND 2048 +%else + %error Unsupport bit depth! +%endif + SECTION_RODATA 32 ch_shuf: times 2 db 0,2,2,4,4,6,6,8,1,3,3,5,5,7,7,9 @@ -54,11 +67,12 @@ cextern pw_512 cextern pw_1023 cextern pw_1024 +cextern pw_2048 +cextern pw_4096 cextern pw_00ff cextern pw_pixel_max -cextern sw_64 cextern pd_32 -cextern deinterleave_shufd +cextern pd_64 ;==================================================================================================================== ;void addAvg (int16_t* src0, int16_t* src1, pixel* dst, intptr_t src0Stride, intptr_t src1Stride, intptr_t dstStride) @@ -93,23 +107,24 @@ punpcklqdq m1, m2 punpcklqdq m3, m5 paddw m1, m3 - pmulhrsw m1, [pw_1024] - paddw m1, [pw_512] + pmulhrsw m1, [pw_ %+ ADDAVG_FACTOR] + paddw m1, [pw_ %+ ADDAVG_ROUND] pxor m0, m0 pmaxsw m1, m0 - pminsw m1, [pw_1023] + pminsw m1, [pw_pixel_max] movd [r2], m1 pextrd [r2 + r5], m1, 1 lea r2, [r2 + 2 * r5] pextrd [r2], m1, 2 pextrd [r2 + r5], m1, 3 - RET + + ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_2x8, 6,6,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m0, [pw_512] + mova m0, [pw_ %+ ADDAVG_ROUND] pxor m7, m7 add r3, r3 add r4, r4 @@ -137,11 +152,11 @@ punpcklqdq m1, m2 punpcklqdq m3, m5 paddw m1, m3 - pmulhrsw m1, [pw_1024] + pmulhrsw m1, [pw_ %+ ADDAVG_FACTOR] paddw m1, m0 pmaxsw m1, m7 - pminsw m1, [pw_1023] + pminsw m1, [pw_pixel_max] movd [r2], m1 pextrd [r2 + r5], m1, 1 lea r2, [r2 + 2 * r5] @@ -157,8 +172,8 @@ ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_2x16, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m6, [pw_1023] - mova m7, [pw_1024] + mova m6, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] mov r6d, 16/4 add r3, r3 add r4, r4 @@ -184,7 +199,7 @@ punpcklqdq m3, m5 paddw m1, m3 pmulhrsw m1, m7 - paddw m1, [pw_512] + paddw m1, [pw_ %+ ADDAVG_ROUND] pxor m0, m0 pmaxsw m1, m0 pminsw m1, m6 @@ -214,21 +229,21 @@ punpcklqdq m0, m1 punpcklqdq m2, m3 paddw m0, m2 - pmulhrsw m0, [pw_1024] - paddw m0, [pw_512] + pmulhrsw m0, [pw_ %+ ADDAVG_FACTOR] + paddw m0, [pw_ %+ ADDAVG_ROUND] pxor m6, m6 pmaxsw m0, m6 - pminsw m0, [pw_1023] + pminsw m0, [pw_pixel_max] movh [r2], m0 movhps [r2 + r5], m0 RET ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_6x8, 6,6,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -265,9 +280,9 @@ ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_6x16, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 mov r6d, 16/2 add r3, r3 @@ -301,9 +316,9 @@ ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_8x2, 6,6,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -332,9 +347,9 @@ ;----------------------------------------------------------------------------- INIT_XMM sse4 cglobal addAvg_8x6, 6,6,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -371,9 +386,9 @@ %macro ADDAVG_W4_H4 1 INIT_XMM sse4 cglobal addAvg_4x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -421,9 +436,9 @@ %macro ADDAVG_W8_H4 1 INIT_XMM sse4 cglobal addAvg_8x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -471,9 +486,9 @@ %macro ADDAVG_W12_H4 1 INIT_XMM sse4 cglobal addAvg_12x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -533,9 +548,9 @@ %macro ADDAVG_W16_H4 1 INIT_XMM sse4 cglobal addAvg_16x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -602,9 +617,9 @@ %macro ADDAVG_W24_H2 2 INIT_XMM sse4 cglobal addAvg_%1x%2, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -684,9 +699,9 @@ %macro ADDAVG_W32_H2 1 INIT_XMM sse4 cglobal addAvg_32x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -788,9 +803,9 @@ %macro ADDAVG_W48_H2 1 INIT_XMM sse4 cglobal addAvg_48x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -922,9 +937,9 @@ %macro ADDAVG_W64_H1 1 INIT_XMM sse4 cglobal addAvg_64x%1, 6,7,8, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride - mova m4, [pw_512] - mova m5, [pw_1023] - mova m7, [pw_1024] + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m7, [pw_ %+ ADDAVG_FACTOR] pxor m6, m6 add r3, r3 add r4, r4 @@ -1017,6 +1032,629 @@ ADDAVG_W64_H1 32 ADDAVG_W64_H1 48 ADDAVG_W64_H1 64 + +;------------------------------------------------------------------------------ +; avx2 asm for addAvg high_bit_depth +;------------------------------------------------------------------------------ +INIT_YMM avx2 +cglobal addAvg_8x2, 6,6,2, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3 * 2], 1 + movu xm1, [r1] + vinserti128 m1, m1, [r1 + r4 * 2], 1 + + paddw m0, m1 + pxor m1, m1 + pmulhrsw m0, [pw_ %+ ADDAVG_FACTOR] + paddw m0, [pw_ %+ ADDAVG_ROUND] + pmaxsw m0, m1 + pminsw m0, [pw_pixel_max] + vextracti128 xm1, m0, 1 + movu [r2], xm0 + movu [r2 + r5 * 2], xm1 + RET + +cglobal addAvg_8x6, 6,6,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m1, m1 + add r3d, r3d + add r4d, r4d + add r5d, r5d + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movu [r2 + r5], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movu [r2 + r5], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu xm0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + movu xm2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movu [r2 + r5], xm2 + RET + +%macro ADDAVG_W8_H4_AVX2 1 +cglobal addAvg_8x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m1, m1 + add r3d, r3d + add r4d, r4d + add r5d, r5d + mov r6d, %1/4 + +.loop: + movu m0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + movu m2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movu [r2 + r5], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + movu m0, [r0] + vinserti128 m0, m0, [r0 + r3], 1 + movu m2, [r1] + vinserti128 m2, m2, [r1 + r4], 1 + + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movu [r2 + r5], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET +%endmacro + +ADDAVG_W8_H4_AVX2 4 +ADDAVG_W8_H4_AVX2 8 +ADDAVG_W8_H4_AVX2 12 +ADDAVG_W8_H4_AVX2 16 +ADDAVG_W8_H4_AVX2 32 +ADDAVG_W8_H4_AVX2 64 + +cglobal addAvg_12x16, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m1, m1 + add r3, r3 + add r4, r4 + add r5, r5 + mov r6d, 4 + +.loop: +%rep 2 + movu m0, [r0] + movu m2, [r1] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movq [r2 + 16], xm2 + + movu m0, [r0 + r3] + movu m2, [r1 + r4] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2 + r5], xm0 + movq [r2 + r5 + 16], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + dec r6d + jnz .loop + RET + +cglobal addAvg_12x32, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + paddw m3, m4, m4 + pxor m1, m1 + add r3, r3 + add r4, r4 + add r5, r5 + mov r6d, 8 + +.loop: +%rep 2 + movu m0, [r0] + movu m2, [r1] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2], xm0 + movq [r2 + 16], xm2 + + movu m0, [r0 + r3] + movu m2, [r1 + r4] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + vextracti128 xm2, m0, 1 + movu [r2 + r5], xm0 + movq [r2 + r5 + 16], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + dec r6d + jnz .loop + RET + +%macro ADDAVG_W16_H4_AVX2 1 +cglobal addAvg_16x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + mov r6d, %1/4 + +.loop: +%rep 2 + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] +%endrep + dec r6d + jnz .loop + RET +%endmacro + +ADDAVG_W16_H4_AVX2 4 +ADDAVG_W16_H4_AVX2 8 +ADDAVG_W16_H4_AVX2 12 +ADDAVG_W16_H4_AVX2 16 +ADDAVG_W16_H4_AVX2 24 +ADDAVG_W16_H4_AVX2 32 +ADDAVG_W16_H4_AVX2 64 + +cglobal addAvg_24x32, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m1, m1 + add r3, r3 + add r4, r4 + add r5, r5 + + mov r6d, 16 + +.loop: + movu m0, [r0] + movu m2, [r1] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + movu [r2], m0 + + movu xm0, [r0 + 32] + movu xm2, [r1 + 32] + paddw xm0, xm2 + pmulhrsw xm0, xm3 + paddw xm0, xm4 + pmaxsw xm0, xm1 + pminsw xm0, xm5 + movu [r2 + 32], xm0 + + movu m0, [r0 + r3] + movu m2, [r1 + r4] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu xm2, [r0 + r3 + 32] + movu xm0, [r1 + r4 + 32] + paddw xm2, xm0 + pmulhrsw xm2, xm3 + paddw xm2, xm4 + pmaxsw xm2, xm1 + pminsw xm2, xm5 + movu [r2 + r5 + 32], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET + +cglobal addAvg_24x64, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + paddw m3, m4, m4 + pxor m1, m1 + add r3, r3 + add r4, r4 + add r5, r5 + + mov r6d, 32 + +.loop: + movu m0, [r0] + movu m2, [r1] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + movu [r2], m0 + + movu xm0, [r0 + 32] + movu xm2, [r1 + 32] + paddw xm0, xm2 + pmulhrsw xm0, xm3 + paddw xm0, xm4 + pmaxsw xm0, xm1 + pminsw xm0, xm5 + movu [r2 + 32], xm0 + + movu m0, [r0 + r3] + movu m2, [r1 + r4] + paddw m0, m2 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m1 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu xm2, [r0 + r3 + 32] + movu xm0, [r1 + r4 + 32] + paddw xm2, xm0 + pmulhrsw xm2, xm3 + paddw xm2, xm4 + pmaxsw xm2, xm1 + pminsw xm2, xm5 + movu [r2 + r5 + 32], xm2 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET + +%macro ADDAVG_W32_H2_AVX2 1 +cglobal addAvg_32x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + + mov r6d, %1/2 + +.loop: + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + 32] + movu m1, [r1 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 32], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + r3 + 32] + movu m1, [r1 + r4 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 32], m0 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET +%endmacro + +ADDAVG_W32_H2_AVX2 8 +ADDAVG_W32_H2_AVX2 16 +ADDAVG_W32_H2_AVX2 24 +ADDAVG_W32_H2_AVX2 32 +ADDAVG_W32_H2_AVX2 48 +ADDAVG_W32_H2_AVX2 64 + +cglobal addAvg_48x64, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3, r3 + add r4, r4 + add r5, r5 + + mov r6d, 32 + +.loop: + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + 32] + movu m1, [r1 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 32], m0 + + movu m0, [r0 + 64] + movu m1, [r1 + 64] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 64], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + r3 + 32] + movu m1, [r1 + r4 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 32], m0 + + movu m0, [r0 + r3 + 64] + movu m1, [r1 + r4 + 64] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 64], m0 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET + +%macro ADDAVG_W64_H1_AVX2 1 +cglobal addAvg_64x%1, 6,7,6, pSrc0, pSrc1, pDst, iStride0, iStride1, iDstStride + mova m4, [pw_ %+ ADDAVG_ROUND] + mova m5, [pw_pixel_max] + mova m3, [pw_ %+ ADDAVG_FACTOR] + pxor m2, m2 + add r3d, r3d + add r4d, r4d + add r5d, r5d + + mov r6d, %1/2 + +.loop: + movu m0, [r0] + movu m1, [r1] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2], m0 + + movu m0, [r0 + 32] + movu m1, [r1 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 32], m0 + + movu m0, [r0 + 64] + movu m1, [r1 + 64] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 64], m0 + + movu m0, [r0 + 96] + movu m1, [r1 + 96] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + 96], m0 + + movu m0, [r0 + r3] + movu m1, [r1 + r4] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5], m0 + + movu m0, [r0 + r3 + 32] + movu m1, [r1 + r4 + 32] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 32], m0 + + movu m0, [r0 + r3 + 64] + movu m1, [r1 + r4 + 64] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 64], m0 + + movu m0, [r0 + r3 + 96] + movu m1, [r1 + r4 + 96] + paddw m0, m1 + pmulhrsw m0, m3 + paddw m0, m4 + pmaxsw m0, m2 + pminsw m0, m5 + movu [r2 + r5 + 96], m0 + + lea r2, [r2 + 2 * r5] + lea r0, [r0 + 2 * r3] + lea r1, [r1 + 2 * r4] + + dec r6d + jnz .loop + RET +%endmacro + +ADDAVG_W64_H1_AVX2 16 +ADDAVG_W64_H1_AVX2 32 +ADDAVG_W64_H1_AVX2 48 +ADDAVG_W64_H1_AVX2 64 ;----------------------------------------------------------------------------- %else ; !HIGH_BIT_DEPTH ;----------------------------------------------------------------------------- @@ -3387,6 +4025,87 @@ AVG_END %endmacro +%macro pixel_avg_W8 0 + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if HIGH_BIT_DEPTH +%if ARCH_X86_64 +INIT_XMM sse2 +cglobal pixel_avg_8x4, 6,9,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + pixel_avg_W8 + RET + +cglobal pixel_avg_8x8, 6,9,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + pixel_avg_W8 + pixel_avg_W8 + RET + +cglobal pixel_avg_8x16, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_W8 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_8x32, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 8 +.loop + pixel_avg_W8 + dec r9d + jnz .loop + RET +%endif +%endif + %if HIGH_BIT_DEPTH INIT_MMX mmx2 @@ -3438,11 +4157,6 @@ AVGH 4, 4 AVGH 4, 2 -AVG_FUNC 8, movdqu, movdqa -AVGH 8, 32 -AVGH 8, 16 -AVGH 8, 8 -AVGH 8, 4 AVG_FUNC 16, movdqu, movdqa AVGH 16, 64 @@ -3586,24 +4300,12 @@ AVGH 4, 8 AVGH 4, 4 AVGH 4, 2 + INIT_XMM avx2 ; TODO: active AVX2 after debug ;AVG_FUNC 24, movdqu, movdqa ;AVGH 24, 32 -AVG_FUNC 64, movdqu, movdqa -AVGH 64, 64 -AVGH 64, 48 -AVGH 64, 32 -AVGH 64, 16 - -AVG_FUNC 32, movdqu, movdqa -AVGH 32, 64 -AVGH 32, 32 -AVGH 32, 24 -AVGH 32, 16 -AVGH 32, 8 - AVG_FUNC 16, movdqu, movdqa AVGH 16, 64 AVGH 16, 32 @@ -3614,7 +4316,109 @@ %endif ;HIGH_BIT_DEPTH +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 && BIT_DEPTH == 8 +INIT_YMM avx2 +cglobal pixel_avg_8x32 +%rep 4 + movu m0, [r2] + movu m2, [r2 + r3] + movu m1, [r4] + movu m3, [r4 + r5] + pavgb m0, m1 + pavgb m2, m3 + movu [r0], m0 + movu [r0 + r1], m2 + + lea r2, [r2 + r3 * 2] + lea r4, [r4 + r5 * 2] + lea r0, [r0 + r1 * 2] +%endrep + ret +cglobal pixel_avg_16x64_8bit +%rep 8 + movu m0, [r2] + movu m2, [r2 + mmsize] + movu m1, [r4] + movu m3, [r4 + mmsize] + pavgb m0, m1 + pavgb m2, m3 + movu [r0], m0 + movu [r0 + mmsize], m2 + + movu m0, [r2 + r3] + movu m2, [r2 + r3 + mmsize] + movu m1, [r4 + r5] + movu m3, [r4 + r5 + mmsize] + pavgb m0, m1 + pavgb m2, m3 + movu [r0 + r1], m0 + movu [r0 + r1 + mmsize], m2 + + lea r2, [r2 + r3 * 2] + lea r4, [r4 + r5 * 2] + lea r0, [r0 + r1 * 2] +%endrep + ret + +cglobal pixel_avg_32x8, 6,6,4 + call pixel_avg_8x32 + RET + +cglobal pixel_avg_32x16, 6,6,4 + call pixel_avg_8x32 + call pixel_avg_8x32 + RET + +cglobal pixel_avg_32x24, 6,6,4 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + RET + +cglobal pixel_avg_32x32, 6,6,4 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + RET + +cglobal pixel_avg_32x64, 6,6,4 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + call pixel_avg_8x32 + RET + +cglobal pixel_avg_64x16, 6,6,4 + call pixel_avg_16x64_8bit + RET + +cglobal pixel_avg_64x32, 6,6,4 + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + RET + +cglobal pixel_avg_64x48, 6,6,4 + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + RET + +cglobal pixel_avg_64x64, 6,6,4 + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + call pixel_avg_16x64_8bit + RET +%endif ;============================================================================= ; pixel avg2 @@ -3817,6 +4621,590 @@ INIT_YMM avx2 PIXEL_AVG_W18 +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_12x16, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 + +.loop + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], xm0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], xm2 + + vextracti128 xm0, m0, 1 + vextracti128 xm2, m2, 1 + movq [r0 + 16], xm0 + movq [r0 + r1 + 16], xm2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], xm0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], xm2 + + vextracti128 xm0, m0, 1 + vextracti128 xm2, m2, 1 + movq [r0 + r1 * 2 + 16], xm0 + movq [r0 + r8 + 16], xm2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + dec r9d + jnz .loop + RET +%endif + +%macro pixel_avg_H4 0 + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_16x4, 6,9,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + pixel_avg_H4 + RET + +cglobal pixel_avg_16x8, 6,9,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + pixel_avg_H4 + pixel_avg_H4 + RET + +cglobal pixel_avg_16x12, 6,9,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + pixel_avg_H4 + pixel_avg_H4 + pixel_avg_H4 + RET +%endif + +%macro pixel_avg_H16 0 + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_16x16, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_H16 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_16x32, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_H16 + pixel_avg_H16 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_16x64, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_H16 + pixel_avg_H16 + pixel_avg_H16 + pixel_avg_H16 + dec r9d + jnz .loop + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_24x32, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 8 + +.loop + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu xm0, [r2 + 32] + movu xm1, [r4 + 32] + pavgw xm0, xm1 + movu [r0 + 32], xm0 + movu xm2, [r2 + r3 + 32] + movu xm3, [r4 + r5 + 32] + pavgw xm2, xm3 + movu [r0 + r1 + 32], xm2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + movu xm0, [r2 + r3 * 2 + 32] + movu xm1, [r4 + r5 * 2 + 32] + pavgw xm0, xm1 + movu [r0 + r1 * 2 + 32], xm0 + movu xm2, [r2 + r6 + 32] + movu xm3, [r4 + r7 + 32] + pavgw xm2, xm3 + movu [r0 + r8 + 32], xm2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + dec r9d + jnz .loop + RET +%endif + +%macro pixel_avg_W32 0 + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + 32] + movu m1, [r4 + 32] + pavgw m0, m1 + movu [r0 + 32], m0 + movu m2, [r2 + r3 + 32] + movu m3, [r4 + r5 + 32] + pavgw m2, m3 + movu [r0 + r1 + 32], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + movu m0, [r2 + r3 * 2 + 32] + movu m1, [r4 + r5 * 2 + 32] + pavgw m0, m1 + movu [r0 + r1 * 2 + 32], m0 + movu m2, [r2 + r6 + 32] + movu m3, [r4 + r7 + 32] + pavgw m2, m3 + movu [r0 + r8 + 32], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_32x8, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 2 +.loop + pixel_avg_W32 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_32x16, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_W32 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_32x24, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 6 +.loop + pixel_avg_W32 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_32x32, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 8 +.loop + pixel_avg_W32 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_32x64, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 16 +.loop + pixel_avg_W32 + dec r9d + jnz .loop + RET +%endif + +%macro pixel_avg_W64 0 + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + 32] + movu m1, [r4 + 32] + pavgw m0, m1 + movu [r0 + 32], m0 + movu m2, [r2 + r3 + 32] + movu m3, [r4 + r5 + 32] + pavgw m2, m3 + movu [r0 + r1 + 32], m2 + + movu m0, [r2 + 64] + movu m1, [r4 + 64] + pavgw m0, m1 + movu [r0 + 64], m0 + movu m2, [r2 + r3 + 64] + movu m3, [r4 + r5 + 64] + pavgw m2, m3 + movu [r0 + r1 + 64], m2 + + movu m0, [r2 + 96] + movu m1, [r4 + 96] + pavgw m0, m1 + movu [r0 + 96], m0 + movu m2, [r2 + r3 + 96] + movu m3, [r4 + r5 + 96] + pavgw m2, m3 + movu [r0 + r1 + 96], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + movu m0, [r2 + r3 * 2 + 32] + movu m1, [r4 + r5 * 2 + 32] + pavgw m0, m1 + movu [r0 + r1 * 2 + 32], m0 + movu m2, [r2 + r6 + 32] + movu m3, [r4 + r7 + 32] + pavgw m2, m3 + movu [r0 + r8 + 32], m2 + + movu m0, [r2 + r3 * 2 + 64] + movu m1, [r4 + r5 * 2 + 64] + pavgw m0, m1 + movu [r0 + r1 * 2 + 64], m0 + movu m2, [r2 + r6 + 64] + movu m3, [r4 + r7 + 64] + pavgw m2, m3 + movu [r0 + r8 + 64], m2 + + movu m0, [r2 + r3 * 2 + 96] + movu m1, [r4 + r5 * 2 + 96] + pavgw m0, m1 + movu [r0 + r1 * 2 + 96], m0 + movu m2, [r2 + r6 + 96] + movu m3, [r4 + r7 + 96] + pavgw m2, m3 + movu [r0 + r8 + 96], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] +%endmacro + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_64x16, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 4 +.loop + pixel_avg_W64 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_64x32, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 8 +.loop + pixel_avg_W64 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_64x48, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 12 +.loop + pixel_avg_W64 + dec r9d + jnz .loop + RET + +cglobal pixel_avg_64x64, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 16 +.loop + pixel_avg_W64 + dec r9d + jnz .loop + RET +%endif + +;------------------------------------------------------------------------------------------------------------------------------- +;void pixelavg_pp(pixel dst, intptr_t dstride, const pixel src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int) +;------------------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal pixel_avg_48x64, 6,10,4 + add r1d, r1d + add r3d, r3d + add r5d, r5d + lea r6, [r3 * 3] + lea r7, [r5 * 3] + lea r8, [r1 * 3] + mov r9d, 16 + +.loop + movu m0, [r2] + movu m1, [r4] + pavgw m0, m1 + movu [r0], m0 + movu m2, [r2 + r3] + movu m3, [r4 + r5] + pavgw m2, m3 + movu [r0 + r1], m2 + + movu m0, [r2 + 32] + movu m1, [r4 + 32] + pavgw m0, m1 + movu [r0 + 32], m0 + movu m2, [r2 + r3 + 32] + movu m3, [r4 + r5 + 32] + pavgw m2, m3 + movu [r0 + r1 + 32], m2 + + movu m0, [r2 + 64] + movu m1, [r4 + 64] + pavgw m0, m1 + movu [r0 + 64], m0 + movu m2, [r2 + r3 + 64] + movu m3, [r4 + r5 + 64] + pavgw m2, m3 + movu [r0 + r1 + 64], m2 + + movu m0, [r2 + r3 * 2] + movu m1, [r4 + r5 * 2] + pavgw m0, m1 + movu [r0 + r1 * 2], m0 + movu m2, [r2 + r6] + movu m3, [r4 + r7] + pavgw m2, m3 + movu [r0 + r8], m2 + + movu m0, [r2 + r3 * 2 + 32] + movu m1, [r4 + r5 * 2 + 32] + pavgw m0, m1 + movu [r0 + r1 * 2 + 32], m0 + movu m2, [r2 + r6 + 32] + movu m3, [r4 + r7 + 32] + pavgw m2, m3 + movu [r0 + r8 + 32], m2 + + movu m0, [r2 + r3 * 2 + 64] + movu m1, [r4 + r5 * 2 + 64] + pavgw m0, m1 + movu [r0 + r1 * 2 + 64], m0 + movu m2, [r2 + r6 + 64] + movu m3, [r4 + r7 + 64] + pavgw m2, m3 + movu [r0 + r8 + 64], m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + lea r4, [r4 + 4 * r5] + dec r9d + jnz .loop + RET +%endif + %endif ; HIGH_BIT_DEPTH %if HIGH_BIT_DEPTH == 0 @@ -3973,7 +5361,7 @@ %macro INIT_SHIFT 2 and eax, 7 shl eax, 3 - movd %1, [sw_64] + movd %1, [pd_64] movd %2, eax psubw %1, %2 %endmacro
View file
x265_1.7.tar.gz/source/common/x86/mc-a2.asm -> x265_1.8.tar.gz/source/common/x86/mc-a2.asm
Changed
@@ -692,7 +692,7 @@ %endmacro %macro FILT32x4U 4 - mova m1, [r0+r5] + movu m1, [r0+r5] pavgb m0, m1, [r0] movu m3, [r0+r5+1] pavgb m2, m3, [r0+1] @@ -701,7 +701,7 @@ pavgb m0, m2 pavgb m1, m3 - mova m3, [r0+r5+mmsize] + movu m3, [r0+r5+mmsize] pavgb m2, m3, [r0+mmsize] movu m5, [r0+r5+1+mmsize] pavgb m4, m5, [r0+1+mmsize] @@ -722,10 +722,10 @@ vpermq m1, m4, q3120 vpermq m2, m2, q3120 vpermq m3, m5, q3120 - mova [%1], m0 - mova [%2], m1 - mova [%3], m2 - mova [%4], m3 + movu [%1], m0 + movu [%2], m1 + movu [%3], m2 + movu [%4], m3 %endmacro %macro FILT16x2 4 @@ -796,8 +796,8 @@ %endmacro %macro FILT8xA 4 - mova m3, [r0+%4+mmsize] - mova m2, [r0+%4] + movu m3, [r0+%4+mmsize] + movu m2, [r0+%4] pavgw m3, [r0+%4+r5+mmsize] pavgw m2, [r0+%4+r5] PALIGNR %1, m3, 2, m6 @@ -815,9 +815,13 @@ packssdw m3, %1 packssdw m5, m4 %endif - mova [%2], m3 - mova [%3], m5 - mova %1, m2 +%if cpuflag(avx2) + vpermq m3, m3, q3120 + vpermq m5, m5, q3120 +%endif + movu [%2], m3 + movu [%3], m5 + movu %1, m2 %endmacro ;----------------------------------------------------------------------------- @@ -871,8 +875,8 @@ .vloop: mov r6d, r7m %ifnidn cpuname, mmx2 - mova m0, [r0] - mova m1, [r0+r5] + movu m0, [r0] + movu m1, [r0+r5] pavgw m0, m1 pavgw m1, [r0+r5*2] %endif @@ -977,7 +981,7 @@ FRAME_INIT_LOWRES INIT_XMM xop FRAME_INIT_LOWRES -%if HIGH_BIT_DEPTH==0 +%if ARCH_X86_64 == 1 INIT_YMM avx2 FRAME_INIT_LOWRES %endif
View file
x265_1.7.tar.gz/source/common/x86/mc.h -> x265_1.8.tar.gz/source/common/x86/mc.h
Changed
@@ -25,45 +25,15 @@ #define X265_MC_H #define LOWRES(cpu) \ - void x265_frame_init_lowres_core_ ## cpu(const pixel* src0, pixel* dst0, pixel* dsth, pixel* dstv, pixel* dstc, \ + void PFX(frame_init_lowres_core_ ## cpu)(const pixel* src0, pixel* dst0, pixel* dsth, pixel* dstv, pixel* dstc, \ intptr_t src_stride, intptr_t dst_stride, int width, int height); LOWRES(mmx2) LOWRES(sse2) LOWRES(ssse3) LOWRES(avx) +LOWRES(avx2) LOWRES(xop) -#define DECL_SUF(func, args) \ - void func ## _mmx2 args; \ - void func ## _sse2 args; \ - void func ## _ssse3 args; -DECL_SUF(x265_pixel_avg_64x64, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_64x48, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_64x32, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_64x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_48x64, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_32x64, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_32x32, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_32x24, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_32x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_32x8, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_24x32, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x64, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x32, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x12, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x8, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_16x4, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_12x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_8x32, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_8x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_8x8, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_8x4, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_4x16, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_4x8, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) -DECL_SUF(x265_pixel_avg_4x4, (pixel*, intptr_t, const pixel*, intptr_t, const pixel*, intptr_t, int)) - #undef LOWRES -#undef DECL_SUF #endif // ifndef X265_MC_H
View file
x265_1.7.tar.gz/source/common/x86/pixel-a.asm -> x265_1.8.tar.gz/source/common/x86/pixel-a.asm
Changed
@@ -9,6 +9,7 @@ ;* Alex Izvorski <aizvorksi@gmail.com> ;* Fiona Glaser <fiona@x264.com> ;* Oskar Arvidsson <oskar@irock.se> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -32,8 +33,6 @@ %include "x86util.asm" SECTION_RODATA 32 -hmul_16p: times 16 db 1 - times 8 db 1, -1 hmul_8p: times 8 db 1 times 4 db 1, -1 times 8 db 1 @@ -45,8 +44,7 @@ times 2 dw 1, -1 times 4 dw 1 times 2 dw 1, -1 -ALIGN 32 -hmul_w: times 2 dw 1, -1, 1, -1, 1, -1, 1, -1 + ALIGN 32 transd_shuf1: SHUFFLE_MASK_W 0, 8, 2, 10, 4, 12, 6, 14 transd_shuf2: SHUFFLE_MASK_W 1, 9, 3, 11, 5, 13, 7, 15 @@ -54,8 +52,6 @@ sw_f0: dq 0xfff0, 0 pd_f0: times 4 dd 0xffff0000 -pw_76543210: dw 0, 1, 2, 3, 4, 5, 6, 7 - SECTION .text cextern pb_0 @@ -72,6 +68,9 @@ cextern pd_1 cextern popcnt_table cextern pd_2 +cextern hmul_16p +cextern pb_movemask +cextern pw_pixel_max ;============================================================================= ; SATD @@ -242,6 +241,12 @@ %endif HADAMARD4_2D 4, 5, 6, 7, 3, %%n paddw m4, m6 +;%if HIGH_BIT_DEPTH && (BIT_DEPTH == 12) +; pxor m5, m5 +; punpcklwd m6, m4, m5 +; punpckhwd m4, m5 +; paddd m4, m6 +;%endif SWAP %%n, 4 %endmacro @@ -257,15 +262,45 @@ HADAMARD 1, max, %2, %4, %6, %7 %endif %ifnidn %9, swap + %if (BIT_DEPTH == 12) + pxor m%6, m%6 + punpcklwd m%7, m%2, m%6 + punpckhwd m%2, m%6 + paddd m%8, m%7 + paddd m%8, m%2 + %else paddw m%8, m%2 + %endif %else SWAP %8, %2 + %if (BIT_DEPTH == 12) + pxor m%6, m%6 + punpcklwd m%7, m%8, m%6 + punpckhwd m%8, m%6 + paddd m%8, m%7 + %endif %endif %if %1 + %if (BIT_DEPTH == 12) + pxor m%6, m%6 + punpcklwd m%7, m%4, m%6 + punpckhwd m%4, m%6 + paddd m%8, m%7 + paddd m%8, m%4 + %else paddw m%8, m%4 + %endif %else HADAMARD 1, max, %3, %5, %6, %7 + %if (BIT_DEPTH == 12) + pxor m%6, m%6 + punpcklwd m%7, m%3, m%6 + punpckhwd m%3, m%6 + paddd m%8, m%7 + paddd m%8, m%3 + %else paddw m%8, m%3 + %endif %endif %endmacro @@ -281,29 +316,23 @@ %endif pxor m%10, m%10 - mova m%9, m%2 - punpcklwd m%9, m%10 + punpcklwd m%9, m%2, m%10 paddd m%8, m%9 - mova m%9, m%2 - punpckhwd m%9, m%10 + punpckhwd m%9, m%2, m%10 paddd m%8, m%9 %if %1 pxor m%10, m%10 - mova m%9, m%4 - punpcklwd m%9, m%10 + punpcklwd m%9, m%4, m%10 paddd m%8, m%9 - mova m%9, m%4 - punpckhwd m%9, m%10 + punpckhwd m%9, m%4, m%10 paddd m%8, m%9 %else HADAMARD 1, max, %3, %5, %6, %7 pxor m%10, m%10 - mova m%9, m%3 - punpcklwd m%9, m%10 + punpcklwd m%9, m%3, m%10 paddd m%8, m%9 - mova m%9, m%3 - punpckhwd m%9, m%10 + punpckhwd m%9, m%3, m%10 paddd m%8, m%9 %endif %endmacro @@ -326,6 +355,7 @@ movd eax, m0 and eax, 0xffff %endif ; HIGH_BIT_DEPTH + EMMS RET %endmacro @@ -336,136 +366,10 @@ ; int pixel_satd_16x16( uint8_t *, intptr_t, uint8_t *, intptr_t ) ;----------------------------------------------------------------------------- INIT_MMX mmx2 -cglobal pixel_satd_16x4_internal - SATD_4x4_MMX m2, 0, 0 - SATD_4x4_MMX m1, 4, 0 - paddw m0, m2 - SATD_4x4_MMX m2, 8, 0 - paddw m0, m1 - SATD_4x4_MMX m1, 12, 0 - paddw m0, m2 - paddw m0, m1 - ret - -cglobal pixel_satd_8x8_internal - SATD_4x4_MMX m2, 0, 0 - SATD_4x4_MMX m1, 4, 1 - paddw m0, m2 - paddw m0, m1 -pixel_satd_8x4_internal_mmx2: - SATD_4x4_MMX m2, 0, 0 - SATD_4x4_MMX m1, 4, 0 - paddw m0, m2 - paddw m0, m1 - ret - -%if HIGH_BIT_DEPTH -%macro SATD_MxN_MMX 3 -cglobal pixel_satd_%1x%2, 4,7 - SATD_START_MMX - pxor m0, m0 - call pixel_satd_%1x%3_internal_mmx2 - HADDUW m0, m1 - movd r6d, m0 -%rep %2/%3-1 - pxor m0, m0 - lea r0, [r0+4*r1] - lea r2, [r2+4*r3] - call pixel_satd_%1x%3_internal_mmx2 - movd m2, r4 - HADDUW m0, m1 - movd r4, m0 - add r6, r4 - movd r4, m2 -%endrep - movifnidn eax, r6d - RET -%endmacro - -SATD_MxN_MMX 16, 16, 4 -SATD_MxN_MMX 16, 8, 4 -SATD_MxN_MMX 8, 16, 8 -%endif ; HIGH_BIT_DEPTH - -%if HIGH_BIT_DEPTH == 0 -cglobal pixel_satd_16x16, 4,6 - SATD_START_MMX - pxor m0, m0 -%rep 3 - call pixel_satd_16x4_internal_mmx2 - lea r0, [r0+4*r1] - lea r2, [r2+4*r3] -%endrep - call pixel_satd_16x4_internal_mmx2 - HADDUW m0, m1 - movd eax, m0 - RET - -cglobal pixel_satd_16x8, 4,6 - SATD_START_MMX - pxor m0, m0 - call pixel_satd_16x4_internal_mmx2 - lea r0, [r0+4*r1] - lea r2, [r2+4*r3] - call pixel_satd_16x4_internal_mmx2 - SATD_END_MMX - -cglobal pixel_satd_8x16, 4,6 - SATD_START_MMX - pxor m0, m0 - call pixel_satd_8x8_internal_mmx2 - lea r0, [r0+4*r1] - lea r2, [r2+4*r3] - call pixel_satd_8x8_internal_mmx2 - SATD_END_MMX -%endif ; !HIGH_BIT_DEPTH - -cglobal pixel_satd_8x8, 4,6 - SATD_START_MMX - pxor m0, m0 - call pixel_satd_8x8_internal_mmx2 - SATD_END_MMX - -cglobal pixel_satd_8x4, 4,6 - SATD_START_MMX - pxor m0, m0 - call pixel_satd_8x4_internal_mmx2 - SATD_END_MMX - -cglobal pixel_satd_4x16, 4,6 - SATD_START_MMX - SATD_4x4_MMX m0, 0, 1 - SATD_4x4_MMX m1, 0, 1 - paddw m0, m1 - SATD_4x4_MMX m1, 0, 1 - paddw m0, m1 - SATD_4x4_MMX m1, 0, 0 - paddw m0, m1 - SATD_END_MMX - -cglobal pixel_satd_4x8, 4,6 - SATD_START_MMX - SATD_4x4_MMX m0, 0, 1 - SATD_4x4_MMX m1, 0, 0 - paddw m0, m1 - SATD_END_MMX - cglobal pixel_satd_4x4, 4,6 SATD_START_MMX SATD_4x4_MMX m0, 0, 0 -%if HIGH_BIT_DEPTH - HADDUW m0, m1 - movd eax, m0 -%else ; !HIGH_BIT_DEPTH - pshufw m1, m0, q1032 - paddw m0, m1 - pshufw m1, m0, q2301 - paddw m0, m1 - movd eax, m0 - and eax, 0xffff -%endif ; HIGH_BIT_DEPTH - EMMS - RET + SATD_END_MMX %macro SATD_START_SSE2 2-3 0 FIX_STRIDES r1, r3 @@ -485,10 +389,14 @@ %macro SATD_END_SSE2 1-2 %if HIGH_BIT_DEPTH + %if BIT_DEPTH == 12 + HADDD %1, xm0 + %else ; BIT_DEPTH == 12 HADDUW %1, xm0 -%if %0 == 2 + %endif ; BIT_DEPTH == 12 + %if %0 == 2 paddd %1, %2 -%endif + %endif %else HADDW %1, xm7 %endif @@ -631,7 +539,11 @@ mova m7, [hmul_4p] %endif SATD_4x8_SSE vertical, 0, swap - HADDW m7, m1 +%if BIT_DEPTH == 12 + HADDD m7, m1 +%else + HADDUW m7, m1 +%endif movd eax, m7 RET @@ -644,7 +556,11 @@ lea r0, [r0+r1*2*SIZEOF_PIXEL] lea r2, [r2+r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, add - HADDW m7, m1 +%if BIT_DEPTH == 12 + HADDD m7, m1 +%else + HADDUW m7, m1 +%endif movd eax, m7 RET @@ -690,12 +606,8 @@ mova m7, [pw_00ff] %endif call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_16x8, 4,6,14 @@ -757,12 +669,8 @@ %%pixel_satd_16x8_internal: call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_32x8, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -778,12 +686,8 @@ lea r2, [r7 + 16] call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_32x16, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -803,11 +707,7 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 + HADDD m10, m0 movd eax, m10 RET @@ -832,12 +732,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_32x32, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -865,12 +761,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_32x64, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -914,12 +806,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_48x64, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -981,12 +869,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_64x16, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -1018,12 +902,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_64x32, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -1072,12 +952,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_64x48, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -1142,12 +1018,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET cglobal pixel_satd_64x64, 4,8,14 ;if WIN64 && notcpuflag(avx) @@ -1228,12 +1100,8 @@ call pixel_satd_16x4_internal2 call pixel_satd_16x4_internal2 - pxor m9, m9 - movhlps m9, m10 - paddd m10, m9 - pshufd m9, m10, 1 - paddd m10, m9 - movd eax, m10 + HADDD m10, m0 + movd eax, m10 RET %else @@ -1250,11 +1118,7 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 + HADDD m6, m0 movd eax, m6 RET %else @@ -1271,12 +1135,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif %if WIN64 @@ -1314,12 +1174,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x48, 4,7,8,0-gprsize ;if !WIN64 @@ -1359,12 +1215,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1401,12 +1253,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_24x64, 4,7,8,0-gprsize ;if !WIN64 @@ -1443,12 +1291,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1465,12 +1309,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_8x64, 4,7,8,0-gprsize ;if !WIN64 @@ -1485,12 +1325,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1515,12 +1351,8 @@ mov [rsp], r2 call pixel_satd_8x8_internal2 call %%pixel_satd_8x4_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1565,12 +1397,8 @@ lea r0, [r0 + r1*2*SIZEOF_PIXEL] lea r2, [r2 + r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, 4, 5 - pxor m1, m1 - movhlps m1, m7 - paddd m7, m1 - pshufd m1, m7, 1 - paddd m7, m1 - movd eax, m7 + HADDD m7, m0 + movd eax, m7 RET %else cglobal pixel_satd_12x32, 4,7,8,0-gprsize @@ -1614,12 +1442,8 @@ lea r0, [r0 + r1*2*SIZEOF_PIXEL] lea r2, [r2 + r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, 4, 5 - pxor m1, m1 - movhlps m1, m7 - paddd m7, m1 - pshufd m1, m7, 1 - paddd m7, m1 - movd eax, m7 + HADDD m7, m0 + movd eax, m7 RET %endif %else ;HIGH_BIT_DEPTH @@ -1735,12 +1559,8 @@ lea r0, [r0 + r1*2*SIZEOF_PIXEL] lea r2, [r2 + r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, 4, 5 - pxor m1, m1 - movhlps m1, m7 - paddd m7, m1 - pshufd m1, m7, 1 - paddd m7, m1 - movd eax, m7 + HADDD m7, m0 + movd eax, m7 RET %else cglobal pixel_satd_4x32, 4,7,8,0-gprsize @@ -1827,12 +1647,8 @@ lea r0, [r6 + 24*SIZEOF_PIXEL] lea r2, [r7 + 24*SIZEOF_PIXEL] call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x8, 4,7,8,0-gprsize ;if !WIN64 @@ -1852,12 +1668,8 @@ mov r2, [rsp] add r2, 24*SIZEOF_PIXEL call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1880,12 +1692,8 @@ lea r2, [r7 + 24*SIZEOF_PIXEL] call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x16, 4,7,8,0-gprsize ;if !WIN64 @@ -1909,12 +1717,8 @@ add r2, 24*SIZEOF_PIXEL call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -1941,12 +1745,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x24, 4,7,8,0-gprsize ;if !WIN64 @@ -1974,12 +1774,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2010,12 +1806,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x32, 4,7,8,0-gprsize ;if !WIN64 @@ -2047,12 +1839,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2099,12 +1887,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_32x64, 4,7,8,0-gprsize ;if !WIN64 @@ -2152,12 +1936,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2224,12 +2004,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_48x64, 4,7,8,0-gprsize ;if !WIN64 @@ -2299,12 +2075,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2344,12 +2116,8 @@ lea r2, [r7 + 56*SIZEOF_PIXEL] call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_64x16, 4,7,8,0-gprsize ;if !WIN64 @@ -2393,12 +2161,8 @@ add r2,56*SIZEOF_PIXEL call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2453,12 +2217,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_64x32, 4,7,8,0-gprsize ;if !WIN64 @@ -2518,12 +2278,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2594,12 +2350,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m8, m8 - movhlps m8, m6 - paddd m6, m8 - pshufd m8, m6, 1 - paddd m6, m8 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_64x48, 4,7,8,0-gprsize ;if !WIN64 @@ -2675,12 +2427,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2767,12 +2515,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m8, m8 - movhlps m8, m6 - paddd m6, m8 - pshufd m8, m6, 1 - paddd m6, m8 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_64x64, 4,7,8,0-gprsize ;if !WIN64 @@ -2864,12 +2608,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -2883,12 +2623,8 @@ call %%pixel_satd_8x4_internal2 RESTORE_AND_INC_POINTERS call %%pixel_satd_8x4_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -2901,12 +2637,8 @@ call pixel_satd_8x8_internal2 RESTORE_AND_INC_POINTERS call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -2921,12 +2653,8 @@ RESTORE_AND_INC_POINTERS call pixel_satd_8x8_internal2 call %%pixel_satd_8x4_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -2941,12 +2669,8 @@ RESTORE_AND_INC_POINTERS call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -2965,12 +2689,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -2997,12 +2717,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif @@ -3029,12 +2745,8 @@ lea r0, [r0 + r1*2*SIZEOF_PIXEL] lea r2, [r2 + r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, 4, 5 - pxor m1, m1 - movhlps m1, m7 - paddd m7, m1 - pshufd m1, m7, 1 - paddd m7, m1 - movd eax, m7 + HADDD m7, m0 + movd eax, m7 RET %else cglobal pixel_satd_12x16, 4,7,8,0-gprsize @@ -3060,12 +2772,8 @@ lea r0, [r0 + r1*2*SIZEOF_PIXEL] lea r2, [r2 + r3*2*SIZEOF_PIXEL] SATD_4x8_SSE vertical, 1, 4, 5 - pxor m1, m1 - movhlps m1, m7 - paddd m7, m1 - pshufd m1, m7, 1 - paddd m7, m1 - movd eax, m7 + HADDD m7, m0 + movd eax, m7 RET %endif %else ;HIGH_BIT_DEPTH @@ -3149,12 +2857,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %else cglobal pixel_satd_24x32, 4,7,8,0-gprsize @@ -3179,12 +2883,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %endif ;WIN64 @@ -3201,12 +2901,8 @@ call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET %if WIN64 @@ -3217,12 +2913,8 @@ SATD_START_SSE2 m6, m7 call pixel_satd_8x8_internal2 call pixel_satd_8x8_internal2 - pxor m7, m7 - movhlps m7, m6 - paddd m6, m7 - pshufd m7, m6, 1 - paddd m6, m7 - movd eax, m6 + HADDD m6, m0 + movd eax, m6 RET cglobal pixel_satd_8x8, 4,6,8 @@ -6982,11 +6674,119 @@ add eax, 1 shr eax, 1 RET + +cglobal pixel_sa8d_16x16, 4,6,8 + SATD_START_AVX2 m6, m7, 1 + + call pixel_sa8d_8x8_internal ; pix[0] + + sub r0, r1 + sub r0, r1 + add r0, 8*SIZEOF_PIXEL + sub r2, r3 + sub r2, r3 + add r2, 8*SIZEOF_PIXEL + call pixel_sa8d_8x8_internal ; pix[8] + + add r0, r4 + add r0, r1 + add r2, r5 + add r2, r3 + call pixel_sa8d_8x8_internal ; pix[8*stride+8] + + sub r0, r1 + sub r0, r1 + sub r0, 8*SIZEOF_PIXEL + sub r2, r3 + sub r2, r3 + sub r2, 8*SIZEOF_PIXEL + call pixel_sa8d_8x8_internal ; pix[8*stride] + + ; TODO: analyze Dynamic Range + vextracti128 xm0, m6, 1 + paddusw xm6, xm0 + HADDUW xm6, xm0 + movd eax, xm6 + add eax, 1 + shr eax, 1 + RET + +cglobal pixel_sa8d_16x16_internal + call pixel_sa8d_8x8_internal ; pix[0] + + sub r0, r1 + sub r0, r1 + add r0, 8*SIZEOF_PIXEL + sub r2, r3 + sub r2, r3 + add r2, 8*SIZEOF_PIXEL + call pixel_sa8d_8x8_internal ; pix[8] + + add r0, r4 + add r0, r1 + add r2, r5 + add r2, r3 + call pixel_sa8d_8x8_internal ; pix[8*stride+8] + + sub r0, r1 + sub r0, r1 + sub r0, 8*SIZEOF_PIXEL + sub r2, r3 + sub r2, r3 + sub r2, 8*SIZEOF_PIXEL + call pixel_sa8d_8x8_internal ; pix[8*stride] + + ; TODO: analyze Dynamic Range + vextracti128 xm0, m6, 1 + paddusw xm6, xm0 + HADDUW xm6, xm0 + movd eax, xm6 + add eax, 1 + shr eax, 1 + ret + +%if ARCH_X86_64 +cglobal pixel_sa8d_32x32, 4,8,8 + ; TODO: R6 is RAX on x64 platform, so we use it directly + + SATD_START_AVX2 m6, m7, 1 + xor r7d, r7d + + call pixel_sa8d_16x16_internal ; [0] + pxor m6, m6 + add r7d, eax + + add r0, r4 + add r0, r1 + add r2, r5 + add r2, r3 + call pixel_sa8d_16x16_internal ; [2] + pxor m6, m6 + add r7d, eax + + lea eax, [r4 * 5 - 16] + sub r0, rax + sub r0, r1 + lea eax, [r5 * 5 - 16] + sub r2, rax + sub r2, r3 + call pixel_sa8d_16x16_internal ; [1] + pxor m6, m6 + add r7d, eax + + add r0, r4 + add r0, r1 + add r2, r5 + add r2, r3 + call pixel_sa8d_16x16_internal ; [3] + add eax, r7d + RET +%endif ; ARCH_X86_64=1 %endif ; HIGH_BIT_DEPTH -; Input 16bpp, Output 8bpp +; Input 10bit, Output 8bit ;------------------------------------------------------------------------------------------------------------------------ -;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +;void planecopy_sc(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------ INIT_XMM sse2 cglobal downShift_16, 7,7,3 @@ -7078,7 +6878,7 @@ .end: RET -; Input 16bpp, Output 8bpp +; Input 10bit, Output 8bit ;------------------------------------------------------------------------------------------------------------------------------------- ;void planecopy_sp(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) ;------------------------------------------------------------------------------------------------------------------------------------- @@ -7189,99 +6989,111 @@ .end: RET -; Input 8bpp, Output 16bpp +; Input 8bit, Output 10bit ;--------------------------------------------------------------------------------------------------------------------- ;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift) ;--------------------------------------------------------------------------------------------------------------------- INIT_XMM sse4 -cglobal upShift_8, 7,7,3 - - movd m2, r6d ; m0 = shift - add r3, r3 +cglobal upShift_8, 6,7,3 + movd xm2, r6m + add r3d, r3d dec r5d .loopH: xor r6, r6 .loopW: pmovzxbw m0,[r0 + r6] - pmovzxbw m1,[r0 + r6 + 8] + pmovzxbw m1,[r0 + r6 + mmsize/2] psllw m0, m2 psllw m1, m2 movu [r2 + r6 * 2], m0 - movu [r2 + r6 * 2 + 16], m1 + movu [r2 + r6 * 2 + mmsize], m1 - add r6, 16 + add r6d, mmsize cmp r6d, r4d - jl .loopW + jl .loopW ; move to next row add r0, r1 add r2, r3 dec r5d - jnz .loopH + jg .loopH -;processing last row of every frame [To handle width which not a multiple of 16] - -.loop16: + ; processing last row of every frame [To handle width which not a multiple of 16] + mov r1d, (mmsize/2 - 1) + and r1d, r4d + sub r1, mmsize/2 + + ; NOTE: Width MUST BE more than or equal to 8 + shr r4d, 3 ; log2(mmsize) +.loopW8: pmovzxbw m0,[r0] - pmovzxbw m1,[r0 + 8] psllw m0, m2 - psllw m1, m2 movu [r2], m0 - movu [r2 + 16], m1 - - add r0, mmsize - add r2, 2 * mmsize - sub r4d, 16 - jz .end - cmp r4d, 15 - jg .loop16 + add r0, mmsize/2 + add r2, mmsize + dec r4d + jg .loopW8 - cmp r4d, 8 - jl .process4 - pmovzxbw m0,[r0] + ; Mac OS X can't read beyond array bound, so rollback some bytes + pmovzxbw m0,[r0 + r1] psllw m0, m2 - movu [r2], m0 + movu [r2 + r1 * 2], m0 + RET - add r0, 8 - add r2, mmsize - sub r4d, 8 - jz .end -.process4: - cmp r4d, 4 - jl .process2 - movd m0,[r0] - pmovzxbw m0,m0 - psllw m0, m2 - movh [r2], m0 +;--------------------------------------------------------------------------------------------------------------------- +;void planecopy_cp(uint8_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift) +;--------------------------------------------------------------------------------------------------------------------- +%if ARCH_X86_64 +INIT_YMM avx2 +cglobal upShift_8, 6,7,3 + movd xm2, r6m + add r3d, r3d + dec r5d - add r0, 4 - add r2, 8 - sub r4d, 4 - jz .end +.loopH: + xor r6, r6 +.loopW: + pmovzxbw m0,[r0 + r6] + pmovzxbw m1,[r0 + r6 + mmsize/2] + psllw m0, xm2 + psllw m1, xm2 + movu [r2 + r6 * 2], m0 + movu [r2 + r6 * 2 + mmsize], m1 -.process2: - cmp r4d, 2 - jl .process1 - movzx r3d, byte [r0] - shl r3d, 2 - mov [r2], r3w - movzx r3d, byte [r0 + 1] - shl r3d, 2 - mov [r2 + 2], r3w + add r6d, mmsize + cmp r6d, r4d + jl .loopW - add r0, 2 - add r2, 4 - sub r4d, 2 - jz .end + ; move to next row + add r0, r1 + add r2, r3 + dec r5d + jg .loopH -.process1: - movzx r3d, byte [r0] - shl r3d, 2 - mov [r2], r3w -.end: + ; processing last row of every frame [To handle width which not a multiple of 32] + mov r1d, (mmsize/2 - 1) + and r1d, r4d + sub r1, mmsize/2 + + ; NOTE: Width MUST BE more than or equal to 16 + shr r4d, 4 ; log2(mmsize) +.loopW16: + pmovzxbw m0,[r0] + psllw m0, xm2 + movu [r2], m0 + add r0, mmsize/2 + add r2, mmsize + dec r4d + jg .loopW16 + + ; Mac OS X can't read beyond array bound, so rollback some bytes + pmovzxbw m0,[r0 + r1] + psllw m0, xm2 + movu [r2 + r1 * 2], m0 RET +%endif %macro ABSD2 6 ; dst1, dst2, src1, src2, tmp, tmp %if cpuflag(ssse3) @@ -7304,6 +7116,219 @@ %endif %endmacro + +; Input 10bit, Output 12bit +;------------------------------------------------------------------------------------------------------------------------ +;void planecopy_sp_shl(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +;------------------------------------------------------------------------------------------------------------------------ +INIT_XMM sse2 +cglobal upShift_16, 6,7,4 + movd m0, r6m ; m0 = shift + mova m3, [pw_pixel_max] + FIX_STRIDES r1d, r3d + dec r5d +.loopH: + xor r6d, r6d +.loopW: + movu m1, [r0 + r6 * SIZEOF_PIXEL] + movu m2, [r0 + r6 * SIZEOF_PIXEL + mmsize] + psllw m1, m0 + psllw m2, m0 + ; TODO: if input always valid, we can remove below 2 instructions. + pand m1, m3 + pand m2, m3 + movu [r2 + r6 * SIZEOF_PIXEL], m1 + movu [r2 + r6 * SIZEOF_PIXEL + mmsize], m2 + + add r6, mmsize * 2 / SIZEOF_PIXEL + cmp r6d, r4d + jl .loopW + + ; move to next row + add r0, r1 + add r2, r3 + dec r5d + jnz .loopH + +;processing last row of every frame [To handle width which not a multiple of 16] + +.loop16: + movu m1, [r0] + movu m2, [r0 + mmsize] + psllw m1, m0 + psllw m2, m0 + pand m1, m3 + pand m2, m3 + movu [r2], m1 + movu [r2 + mmsize], m2 + + add r0, 2 * mmsize + add r2, 2 * mmsize + sub r4d, 16 + jz .end + jg .loop16 + + cmp r4d, 8 + jl .process4 + movu m1, [r0] + psrlw m1, m0 + pand m1, m3 + movu [r2], m1 + + add r0, mmsize + add r2, mmsize + sub r4d, 8 + jz .end + +.process4: + cmp r4d, 4 + jl .process2 + movh m1,[r0] + psllw m1, m0 + pand m1, m3 + movh [r2], m1 + + add r0, 8 + add r2, 8 + sub r4d, 4 + jz .end + +.process2: + cmp r4d, 2 + jl .process1 + movd m1, [r0] + psllw m1, m0 + pand m1, m3 + movd [r2], m1 + + add r0, 4 + add r2, 4 + sub r4d, 2 + jz .end + +.process1: + movd m1, [r0] + psllw m1, m0 + pand m1, m3 + movd r3, m1 + mov [r2], r3w +.end: + RET + +; Input 10bit, Output 12bit +;------------------------------------------------------------------------------------------------------------------------------------- +;void planecopy_sp_shl(uint16_t *src, intptr_t srcStride, pixel *dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask) +;------------------------------------------------------------------------------------------------------------------------------------- +; TODO: NO TEST CODE! +INIT_YMM avx2 +cglobal upShift_16, 6,7,4 + movd xm0, r6m ; m0 = shift + vbroadcasti128 m3, [pw_pixel_max] + FIX_STRIDES r1d, r3d + dec r5d +.loopH: + xor r6d, r6d +.loopW: + movu m1, [r0 + r6 * SIZEOF_PIXEL] + movu m2, [r0 + r6 * SIZEOF_PIXEL + mmsize] + psllw m1, xm0 + psllw m2, xm0 + pand m1, m3 + pand m2, m3 + movu [r2 + r6 * SIZEOF_PIXEL], m1 + movu [r2 + r6 * SIZEOF_PIXEL + mmsize], m2 + + add r6, mmsize * 2 / SIZEOF_PIXEL + cmp r6d, r4d + jl .loopW + + ; move to next row + add r0, r1 + add r2, r3 + dec r5d + jnz .loopH + +; processing last row of every frame [To handle width which not a multiple of 32] + mov r6d, r4d + and r4d, 31 + shr r6d, 5 + +.loop32: + movu m1, [r0] + movu m2, [r0 + mmsize] + psllw m1, xm0 + psllw m2, xm0 + pand m1, m3 + pand m2, m3 + movu [r2], m1 + movu [r2 + mmsize], m2 + + add r0, 2*mmsize + add r2, 2*mmsize + dec r6d + jnz .loop32 + + cmp r4d, 16 + jl .process8 + movu m1, [r0] + psllw m1, xm0 + pand m1, m3 + movu [r2], m1 + + add r0, mmsize + add r2, mmsize + sub r4d, 16 + jz .end + +.process8: + cmp r4d, 8 + jl .process4 + movu xm1, [r0] + psllw xm1, xm0 + pand xm1, xm3 + movu [r2], xm1 + + add r0, 16 + add r2, 16 + sub r4d, 8 + jz .end + +.process4: + cmp r4d, 4 + jl .process2 + movq xm1,[r0] + psllw xm1, xm0 + pand xm1, xm3 + movq [r2], xm1 + + add r0, 8 + add r2, 8 + sub r4d, 4 + jz .end + +.process2: + cmp r4d, 2 + jl .process1 + movd xm1, [r0] + psllw xm1, xm0 + pand xm1, xm3 + movd [r2], xm1 + + add r0, 4 + add r2, 4 + sub r4d, 2 + jz .end + +.process1: + movd xm1, [r0] + psllw xm1, xm0 + pand xm1, xm3 + movd r3d, xm1 + mov [r2], r3w +.end: + RET + + ;--------------------------------------------------------------------------------------------------------------------- ;int psyCost_pp(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride) ;--------------------------------------------------------------------------------------------------------------------- @@ -8260,6 +8285,76 @@ %endif INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal psyCost_pp_4x4, 4, 5, 6 + add r1d, r1d + add r3d, r3d + lea r4, [r1 * 3] + movddup xm0, [r0] + movddup xm1, [r0 + r1] + movddup xm2, [r0 + r1 * 2] + movddup xm3, [r0 + r4] + + lea r4, [r3 * 3] + movddup xm4, [r2] + movddup xm5, [r2 + r3] + vinserti128 m0, m0, xm4, 1 + vinserti128 m1, m1, xm5, 1 + movddup xm4, [r2 + r3 * 2] + movddup xm5, [r2 + r4] + vinserti128 m2, m2, xm4, 1 + vinserti128 m3, m3, xm5, 1 + + mova m4, [hmul_8w] + pmaddwd m0, m4 + pmaddwd m1, m4 + pmaddwd m2, m4 + pmaddwd m3, m4 + paddd m5, m0, m1 + paddd m4, m2, m3 + paddd m5, m4 + psrldq m4, m5, 4 + paddd m5, m4 + psrld m5, 2 + + mova m4, m0 + paddd m0, m1 + psubd m1, m4 + mova m4, m2 + paddd m2, m3 + psubd m3, m4 + mova m4, m0 + paddd m0, m2 + psubd m2, m4 + mova m4, m1 + paddd m1, m3 + psubd m3, m4 + movaps m4, m0 + vshufps m4, m4, m2, 11011101b + vshufps m0, m0, m2, 10001000b + movaps m2, m1 + vshufps m2, m2, m3, 11011101b + vshufps m1, m1, m3, 10001000b + pabsd m0, m0 + pabsd m4, m4 + pmaxsd m0, m4 + pabsd m1, m1 + pabsd m2, m2 + pmaxsd m1, m2 + paddd m0, m1 + + vpermq m1, m0, 11110101b + paddd m0, m1 + psrldq m1, m0, 4 + paddd m0, m1 + psubd m0, m5 + + vextracti128 xm1, m0, 1 + psubd xm1, xm0 + pabsd xm1, xm1 + movd eax, xm1 + RET +%else ; !HIGH_BIT_DEPTH cglobal psyCost_pp_4x4, 4, 5, 6 lea r4, [3 * r1] movd xm0, [r0] @@ -8314,6 +8409,7 @@ pabsd m1, m1 movd eax, xm1 RET +%endif %macro PSY_PP_8x8 0 movddup m0, [r0 + r1 * 0] @@ -8495,7 +8591,149 @@ pabsd m0, m0 %endmacro +%macro PSY_PP_8x8_AVX2 0 + lea r4, [r1 * 3] + movu xm0, [r0] + movu xm1, [r0 + r1] + movu xm2, [r0 + r1 * 2] + movu xm3, [r0 + r4] + lea r5, [r0 + r1 * 4] + movu xm4, [r5] + movu xm5, [r5 + r1] + movu xm6, [r5 + r1 * 2] + movu xm7, [r5 + r4] + + lea r4, [r3 * 3] + vinserti128 m0, m0, [r2], 1 + vinserti128 m1, m1, [r2 + r3], 1 + vinserti128 m2, m2, [r2 + r3 * 2], 1 + vinserti128 m3, m3, [r2 + r4], 1 + lea r5, [r2 + r3 * 4] + vinserti128 m4, m4, [r5], 1 + vinserti128 m5, m5, [r5 + r3], 1 + vinserti128 m6, m6, [r5 + r3 * 2], 1 + vinserti128 m7, m7, [r5 + r4], 1 + + paddw m8, m0, m1 + paddw m8, m2 + paddw m8, m3 + paddw m8, m4 + paddw m8, m5 + paddw m8, m6 + paddw m8, m7 + pmaddwd m8, [pw_1] + + psrldq m9, m8, 8 + paddd m8, m9 + psrldq m9, m8, 4 + paddd m8, m9 + psrld m8, 2 + + psubw m9, m1, m0 + paddw m0, m1 + psubw m1, m3, m2 + paddw m2, m3 + punpckhwd m3, m0, m9 + punpcklwd m0, m9 + psubw m9, m3, m0 + paddw m0, m3 + punpckhwd m3, m2, m1 + punpcklwd m2, m1 + psubw m10, m3, m2 + paddw m2, m3 + psubw m3, m5, m4 + paddw m4, m5 + psubw m5, m7, m6 + paddw m6, m7 + punpckhwd m1, m4, m3 + punpcklwd m4, m3 + psubw m7, m1, m4 + paddw m4, m1 + punpckhwd m3, m6, m5 + punpcklwd m6, m5 + psubw m1, m3, m6 + paddw m6, m3 + psubw m3, m2, m0 + paddw m0, m2 + psubw m2, m10, m9 + paddw m9, m10 + punpckhdq m5, m0, m3 + punpckldq m0, m3 + psubw m10, m5, m0 + paddw m0, m5 + punpckhdq m3, m9, m2 + punpckldq m9, m2 + psubw m5, m3, m9 + paddw m9, m3 + psubw m3, m6, m4 + paddw m4, m6 + psubw m6, m1, m7 + paddw m7, m1 + punpckhdq m2, m4, m3 + punpckldq m4, m3 + psubw m1, m2, m4 + paddw m4, m2 + punpckhdq m3, m7, m6 + punpckldq m7, m6 + psubw m2, m3, m7 + paddw m7, m3 + psubw m3, m4, m0 + paddw m0, m4 + psubw m4, m1, m10 + paddw m10, m1 + punpckhqdq m6, m0, m3 + punpcklqdq m0, m3 + pabsw m0, m0 + pabsw m6, m6 + pmaxsw m0, m6 + punpckhqdq m3, m10, m4 + punpcklqdq m10, m4 + pabsw m10, m10 + pabsw m3, m3 + pmaxsw m10, m3 + psubw m3, m7, m9 + paddw m9, m7 + psubw m7, m2, m5 + paddw m5, m2 + punpckhqdq m4, m9, m3 + punpcklqdq m9, m3 + pabsw m9, m9 + pabsw m4, m4 + pmaxsw m9, m4 + punpckhqdq m3, m5, m7 + punpcklqdq m5, m7 + pabsw m5, m5 + pabsw m3, m3 + pmaxsw m5, m3 + paddd m0, m9 + paddd m0, m10 + paddd m0, m5 + psrld m9, m0, 16 + pslld m0, 16 + psrld m0, 16 + paddd m0, m9 + psrldq m9, m0, 8 + paddd m0, m9 + psrldq m9, m0, 4 + paddd m0, m9 + paddd m0, [pd_1] + psrld m0, 1 + psubd m0, m8 + + vextracti128 xm1, m0, 1 + psubd xm1, xm0 + pabsd xm1, xm1 +%endmacro + %if ARCH_X86_64 +%if HIGH_BIT_DEPTH +cglobal psyCost_pp_8x8, 4, 8, 11 + add r1d, r1d + add r3d, r3d + PSY_PP_8x8_AVX2 + movd eax, xm1 + RET +%else ; !HIGH_BIT_DEPTH INIT_YMM avx2 cglobal psyCost_pp_8x8, 4, 8, 13 lea r4, [3 * r1] @@ -8507,9 +8745,33 @@ movd eax, xm0 RET %endif - +%endif %if ARCH_X86_64 INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal psyCost_pp_16x16, 4, 10, 12 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + + mov r8d, 2 +.loopH: + mov r9d, 2 +.loopW: + PSY_PP_8x8_AVX2 + + paddd xm11, xm1 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 32] + lea r2, [r2 + r3 * 8 - 32] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%else ; !HIGH_BIT_DEPTH cglobal psyCost_pp_16x16, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8534,9 +8796,33 @@ movd eax, xm13 RET %endif - +%endif %if ARCH_X86_64 INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal psyCost_pp_32x32, 4, 10, 12 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + + mov r8d, 4 +.loopH: + mov r9d, 4 +.loopW: + PSY_PP_8x8_AVX2 + + paddd xm11, xm1 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 64] + lea r2, [r2 + r3 * 8 - 64] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%else ; !HIGH_BIT_DEPTH cglobal psyCost_pp_32x32, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8561,9 +8847,33 @@ movd eax, xm13 RET %endif - +%endif %if ARCH_X86_64 INIT_YMM avx2 +%if HIGH_BIT_DEPTH +cglobal psyCost_pp_64x64, 4, 10, 12 + add r1d, r1d + add r3d, r3d + pxor m11, m11 + + mov r8d, 8 +.loopH: + mov r9d, 8 +.loopW: + PSY_PP_8x8_AVX2 + + paddd xm11, xm1 + add r0, 16 + add r2, 16 + dec r9d + jnz .loopW + lea r0, [r0 + r1 * 8 - 128] + lea r2, [r2 + r3 * 8 - 128] + dec r8d + jnz .loopH + movd eax, xm11 + RET +%else ; !HIGH_BIT_DEPTH cglobal psyCost_pp_64x64, 4, 10, 14 lea r4, [3 * r1] lea r7, [3 * r3] @@ -8588,6 +8898,7 @@ movd eax, xm13 RET %endif +%endif ;--------------------------------------------------------------------------------------------------------------------- ;int psyCost_ss(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) @@ -8747,7 +9058,7 @@ INIT_XMM sse4 cglobal psyCost_ss_8x8, 4, 6, 15 - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] mova m14, [pw_1] add r1, r1 add r3, r3 @@ -9897,7 +10208,7 @@ INIT_XMM sse4 cglobal psyCost_ss_16x16, 4, 9, 16 - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] mova m14, [pw_1] add r1, r1 add r3, r3 @@ -9925,7 +10236,7 @@ INIT_XMM sse4 cglobal psyCost_ss_32x32, 4, 9, 16 - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] mova m14, [pw_1] add r1, r1 add r3, r3 @@ -9953,7 +10264,7 @@ INIT_XMM sse4 cglobal psyCost_ss_64x64, 4, 9, 16 - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] mova m14, [pw_1] add r1, r1 add r3, r3 @@ -10394,7 +10705,7 @@ and rsp, ~63 mova m12, [pw_1] - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] add r1, r1 add r3, r3 @@ -10414,7 +10725,7 @@ and rsp, ~63 mova m12, [pw_1] - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] add r1, r1 add r3, r3 pxor m14, m14 @@ -10448,7 +10759,7 @@ and rsp, ~63 mova m12, [pw_1] - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] add r1, r1 add r3, r3 pxor m14, m14 @@ -10482,7 +10793,7 @@ and rsp, ~63 mova m12, [pw_1] - mova m13, [hmul_w] + mova m13, [pw_pmpmpmpm] add r1, r1 add r3, r3 pxor m14, m14
View file
x265_1.7.tar.gz/source/common/x86/pixel-util.h -> x265_1.8.tar.gz/source/common/x86/pixel-util.h
Changed
@@ -24,117 +24,36 @@ #ifndef X265_PIXEL_UTIL_H #define X265_PIXEL_UTIL_H -void x265_getResidual4_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual8_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual16_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual16_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual32_sse2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual32_sse4(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual16_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); -void x265_getResidual32_avx2(const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); - -void x265_transpose4_sse2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose8_sse2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose16_sse2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose32_sse2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose64_sse2(pixel* dest, const pixel* src, intptr_t stride); - -void x265_transpose8_avx2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose16_avx2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose32_avx2(pixel* dest, const pixel* src, intptr_t stride); -void x265_transpose64_avx2(pixel* dest, const pixel* src, intptr_t stride); - -uint32_t x265_quant_sse4(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); -uint32_t x265_quant_avx2(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff); -uint32_t x265_nquant_sse4(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); -uint32_t x265_nquant_avx2(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff); -void x265_dequant_normal_sse4(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); -void x265_dequant_normal_avx2(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift); - -int x265_count_nonzero_4x4_ssse3(const int16_t* quantCoeff); -int x265_count_nonzero_8x8_ssse3(const int16_t* quantCoeff); -int x265_count_nonzero_16x16_ssse3(const int16_t* quantCoeff); -int x265_count_nonzero_32x32_ssse3(const int16_t* quantCoeff); -int x265_count_nonzero_4x4_avx2(const int16_t* quantCoeff); -int x265_count_nonzero_8x8_avx2(const int16_t* quantCoeff); -int x265_count_nonzero_16x16_avx2(const int16_t* quantCoeff); -int x265_count_nonzero_32x32_avx2(const int16_t* quantCoeff); - -void x265_weight_pp_sse4(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); -void x265_weight_pp_avx2(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset); -void x265_weight_sp_sse4(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); - -void x265_pixel_ssim_4x4x2_core_mmx2(const uint8_t* pix1, intptr_t stride1, - const uint8_t* pix2, intptr_t stride2, int sums[2][4]); -void x265_pixel_ssim_4x4x2_core_sse2(const pixel* pix1, intptr_t stride1, - const pixel* pix2, intptr_t stride2, int sums[2][4]); -void x265_pixel_ssim_4x4x2_core_avx(const pixel* pix1, intptr_t stride1, - const pixel* pix2, intptr_t stride2, int sums[2][4]); -float x265_pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width); -float x265_pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width); - -void x265_scale1D_128to64_ssse3(pixel*, const pixel*); -void x265_scale1D_128to64_avx2(pixel*, const pixel*); -void x265_scale2D_64to32_ssse3(pixel*, const pixel*, intptr_t); -void x265_scale2D_64to32_avx2(pixel*, const pixel*, intptr_t); - -int x265_scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); -int x265_scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize); -uint32_t x265_findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16]); - -#define SETUP_CHROMA_PIXELSUB_PS_FUNC(W, H, cpu) \ - void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \ - void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1); - -#define CHROMA_420_PIXELSUB_DEF(cpu) \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 4, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(8, 8, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(16, 16, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(32, 32, cpu); - -#define CHROMA_422_PIXELSUB_DEF(cpu) \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(4, 8, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(8, 16, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(16, 32, cpu); \ - SETUP_CHROMA_PIXELSUB_PS_FUNC(32, 64, cpu); - -#define SETUP_LUMA_PIXELSUB_PS_FUNC(W, H, cpu) \ - void x265_pixel_sub_ps_ ## W ## x ## H ## cpu(int16_t* dest, intptr_t destride, const pixel* src0, const pixel* src1, intptr_t srcstride0, intptr_t srcstride1); \ - void x265_pixel_add_ps_ ## W ## x ## H ## cpu(pixel* dest, intptr_t destride, const pixel* src0, const int16_t* src1, intptr_t srcStride0, intptr_t srcStride1); - -#define LUMA_PIXELSUB_DEF(cpu) \ - SETUP_LUMA_PIXELSUB_PS_FUNC(8, 8, cpu); \ - SETUP_LUMA_PIXELSUB_PS_FUNC(16, 16, cpu); \ - SETUP_LUMA_PIXELSUB_PS_FUNC(32, 32, cpu); \ - SETUP_LUMA_PIXELSUB_PS_FUNC(64, 64, cpu); - -LUMA_PIXELSUB_DEF(_sse2); -CHROMA_420_PIXELSUB_DEF(_sse2); -CHROMA_422_PIXELSUB_DEF(_sse2); - -LUMA_PIXELSUB_DEF(_sse4); -CHROMA_420_PIXELSUB_DEF(_sse4); -CHROMA_422_PIXELSUB_DEF(_sse4); - -#define SETUP_LUMA_PIXELVAR_FUNC(W, H, cpu) \ - uint64_t x265_pixel_var_ ## W ## x ## H ## cpu(const pixel* pix, intptr_t pixstride); - -#define LUMA_PIXELVAR_DEF(cpu) \ - SETUP_LUMA_PIXELVAR_FUNC(8, 8, cpu); \ - SETUP_LUMA_PIXELVAR_FUNC(16, 16, cpu); \ - SETUP_LUMA_PIXELVAR_FUNC(32, 32, cpu); \ - SETUP_LUMA_PIXELVAR_FUNC(64, 64, cpu); - -LUMA_PIXELVAR_DEF(_sse2); -LUMA_PIXELVAR_DEF(_xop); -LUMA_PIXELVAR_DEF(_avx); - -#undef CHROMA_420_PIXELSUB_DEF -#undef CHROMA_422_PIXELSUB_DEF -#undef LUMA_PIXELSUB_DEF -#undef LUMA_PIXELVAR_DEF -#undef SETUP_CHROMA_PIXELSUB_PS_FUNC -#undef SETUP_LUMA_PIXELSUB_PS_FUNC -#undef SETUP_LUMA_PIXELVAR_FUNC +#define DEFINE_UTILS(cpu) \ + FUNCDEF_TU_S2(void, getResidual, cpu, const pixel* fenc, const pixel* pred, int16_t* residual, intptr_t stride); \ + FUNCDEF_TU_S2(void, transpose, cpu, pixel* dest, const pixel* src, intptr_t stride); \ + FUNCDEF_TU(int, count_nonzero, cpu, const int16_t* quantCoeff); \ + uint32_t PFX(quant_ ## cpu(const int16_t* coef, const int32_t* quantCoeff, int32_t* deltaU, int16_t* qCoef, int qBits, int add, int numCoeff)); \ + uint32_t PFX(nquant_ ## cpu(const int16_t* coef, const int32_t* quantCoeff, int16_t* qCoef, int qBits, int add, int numCoeff)); \ + void PFX(dequant_normal_ ## cpu(const int16_t* quantCoef, int16_t* coef, int num, int scale, int shift)); \ + void PFX(dequant_scaling_## cpu(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift)); \ + void PFX(weight_pp_ ## cpu(const pixel* src, pixel* dst, intptr_t stride, int width, int height, int w0, int round, int shift, int offset)); \ + void PFX(weight_sp_ ## cpu(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset)); \ + void PFX(scale1D_128to64_ ## cpu(pixel*, const pixel*)); \ + void PFX(scale2D_64to32_ ## cpu(pixel*, const pixel*, intptr_t)); \ + uint32_t PFX(costCoeffRemain_ ## cpu(uint16_t *absCoeff, int numNonZero, int idx)); \ + uint32_t PFX(costC1C2Flag_sse2(uint16_t *absCoeff, intptr_t numNonZero, uint8_t *baseCtxMod, intptr_t ctxOffset)); \ + +DEFINE_UTILS(sse2); +DEFINE_UTILS(ssse3); +DEFINE_UTILS(sse4); +DEFINE_UTILS(avx2); + +#undef DEFINE_UTILS + +void PFX(pixel_ssim_4x4x2_core_sse2(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4])); +void PFX(pixel_ssim_4x4x2_core_avx(const pixel* pix1, intptr_t stride1, const pixel* pix2, intptr_t stride2, int sums[2][4])); +float PFX(pixel_ssim_end4_sse2(int sum0[5][4], int sum1[5][4], int width)); +float PFX(pixel_ssim_end4_avx(int sum0[5][4], int sum1[5][4], int width)); + +int PFX(scanPosLast_x64(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize)); +int PFX(scanPosLast_avx2_bmi2(const uint16_t *scan, const coeff_t *coeff, uint16_t *coeffSign, uint16_t *coeffFlag, uint8_t *coeffNum, int numSig, const uint16_t* scanCG4x4, const int trSize)); +uint32_t PFX(findPosFirstLast_ssse3(const int16_t *dstCoeff, const intptr_t trSize, const uint16_t scanTbl[16])); +uint32_t PFX(costCoeffNxN_sse4(const uint16_t *scan, const coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, const uint8_t *tabSigCtx, uint32_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase)); #endif // ifndef X265_PIXEL_UTIL_H
View file
x265_1.7.tar.gz/source/common/x86/pixel-util8.asm -> x265_1.8.tar.gz/source/common/x86/pixel-util8.asm
Changed
@@ -28,7 +28,12 @@ SECTION_RODATA 32 -%if BIT_DEPTH == 10 +%if BIT_DEPTH == 12 +ssim_c1: times 4 dd 107321.76 ; .01*.01*4095*4095*64 +ssim_c2: times 4 dd 60851437.92 ; .03*.03*4095*4095*64*63 +pf_64: times 4 dd 64.0 +pf_128: times 4 dd 128.0 +%elif BIT_DEPTH == 10 ssim_c1: times 4 dd 6697.7856 ; .01*.01*1023*1023*64 ssim_c2: times 4 dd 3797644.4352 ; .03*.03*1023*1023*64*63 pf_64: times 4 dd 64.0 @@ -45,18 +50,15 @@ times 16 db 0 deinterleave_shuf: times 2 db 0, 2, 4, 6, 8, 10, 12, 14, 1, 3, 5, 7, 9, 11, 13, 15 deinterleave_word_shuf: times 2 db 0, 1, 4, 5, 8, 9, 12, 13, 2, 3, 6, 7, 10, 11, 14, 15 -hmul_16p: times 16 db 1 - times 8 db 1, -1 hmulw_16p: times 8 dw 1 times 4 dw 1, -1 -trans8_shuf: dd 0, 4, 1, 5, 2, 6, 3, 7 - SECTION .text cextern pw_1 cextern pw_0_15 cextern pb_1 +cextern pb_128 cextern pw_00ff cextern pw_1023 cextern pw_3fff @@ -72,6 +74,10 @@ cextern pb_16 cextern pb_32 cextern pb_64 +cextern hmul_16p +cextern trans8_shuf +cextern_naked private_prefix %+ _entropyStateBits +cextern pb_movemask ;----------------------------------------------------------------------------- ; void getResidual(pixel *fenc, pixel *pred, int16_t *residual, intptr_t stride) @@ -627,7 +633,12 @@ movd xm6, r4d ; m6 = qbits8 ; fill offset +%if UNIX64 == 0 vpbroadcastd m5, r5m ; m5 = add +%else ; Mac + movd xm5, r5m + vpbroadcastd m5, xm5 ; m5 = add +%endif lea r5, [pw_1] @@ -699,7 +710,12 @@ movd xm6, r4d ; m6 = qbits8 ; fill offset - vpbroadcastd m5, r5m ; m5 = ad +%if UNIX64 == 0 + vpbroadcastd m5, r5m ; m5 = add +%else ; Mac + movd xm5, r5m + vpbroadcastd m5, xm5 ; m5 = add +%endif lea r5, [pd_1] @@ -817,7 +833,12 @@ INIT_YMM avx2 cglobal nquant, 3,5,7 +%if UNIX64 == 0 vpbroadcastd m4, r4m +%else ; Mac + movd xm4, r4m + vpbroadcastd m4, xm4 +%endif vpbroadcastd m6, [pw_1] mov r4d, r5m pxor m5, m5 ; m7 = numZero @@ -873,8 +894,8 @@ %if HIGH_BIT_DEPTH cmp r3d, 32767 jle .skip - shr r3d, 2 - sub r4d, 2 + shr r3d, (BIT_DEPTH - 8) + sub r4d, (BIT_DEPTH - 8) .skip: %endif movd m0, r4d ; m0 = shift @@ -903,6 +924,136 @@ jnz .loop RET +;---------------------------------------------------------------------------------------------------------------------- +;void dequant_scaling(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift) +;---------------------------------------------------------------------------------------------------------------------- +INIT_XMM sse4 +cglobal dequant_scaling, 6,6,6 + add r5d, 4 + shr r3d, 3 ; num/8 + cmp r5d, r4d + jle .skip + sub r5d, r4d + mova m0, [pd_1] + movd m1, r5d ; shift - per + dec r5d + movd m2, r5d ; shift - per - 1 + pslld m0, m2 ; 1 << shift - per - 1 + +.part0: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 8] + movu m3, [r1] + movu m5, [r1 + 16] + pmulld m2, m3 + pmulld m4, m5 + paddd m2, m0 + paddd m4, m0 + psrad m2, m1 + psrad m4, m1 + packssdw m2, m4 + movu [r2], m2 + + add r0, 16 + add r1, 32 + add r2, 16 + dec r3d + jnz .part0 + jmp .end + +.skip: + sub r4d, r5d ; per - shift + movd m0, r4d + +.part1: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 8] + movu m3, [r1] + movu m5, [r1 + 16] + pmulld m2, m3 + pmulld m4, m5 + packssdw m2, m4 + pmovsxwd m1, m2 + psrldq m2, 8 + pmovsxwd m2, m2 + pslld m1, m0 + pslld m2, m0 + packssdw m1, m2 + movu [r2], m1 + + add r0, 16 + add r1, 32 + add r2, 16 + dec r3d + jnz .part1 +.end: + RET + +;---------------------------------------------------------------------------------------------------------------------- +;void dequant_scaling(const int16_t* src, const int32_t* dequantCoef, int16_t* dst, int num, int mcqp_miper, int shift) +;---------------------------------------------------------------------------------------------------------------------- +INIT_YMM avx2 +cglobal dequant_scaling, 6,6,6 + add r5d, 4 + shr r3d, 4 ; num/16 + cmp r5d, r4d + jle .skip + sub r5d, r4d + mova m0, [pd_1] + movd xm1, r5d ; shift - per + dec r5d + movd xm2, r5d ; shift - per - 1 + pslld m0, xm2 ; 1 << shift - per - 1 + +.part0: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 16] + movu m3, [r1] + movu m5, [r1 + 32] + pmulld m2, m3 + pmulld m4, m5 + paddd m2, m0 + paddd m4, m0 + psrad m2, xm1 + psrad m4, xm1 + packssdw m2, m4 + vpermq m2, m2, 11011000b + movu [r2], m2 + + add r0, 32 + add r1, 64 + add r2, 32 + dec r3d + jnz .part0 + jmp .end + +.skip: + sub r4d, r5d ; per - shift + movd xm0, r4d + +.part1: + pmovsxwd m2, [r0] + pmovsxwd m4, [r0 + 16] + movu m3, [r1] + movu m5, [r1 + 32] + pmulld m2, m3 + pmulld m4, m5 + packssdw m2, m4 + vextracti128 xm4, m2, 1 + pmovsxwd m1, xm2 + pmovsxwd m2, xm4 + pslld m1, xm0 + pslld m2, xm0 + packssdw m1, m2 + movu [r2], m1 + + add r0, 32 + add r1, 64 + add r2, 32 + dec r3d + jnz .part1 +.end: + RET INIT_YMM avx2 cglobal dequant_normal, 5,5,7 @@ -912,14 +1063,16 @@ %if HIGH_BIT_DEPTH cmp r3d, 32767 jle .skip - shr r3d, 2 - sub r4d, 2 + shr r3d, (BIT_DEPTH - 8) + sub r4d, (BIT_DEPTH - 8) .skip: %endif movd xm0, r4d ; m0 = shift add r4d, -1+16 bts r3d, r4d - vpbroadcastd m1, r3d ; m1 = dword [add scale] + + movd xm1, r3d + vpbroadcastd m1, xm1 ; m1 = dword [add scale] ; m0 = shift ; m1 = scale @@ -950,9 +1103,9 @@ ;----------------------------------------------------------------------------- -; int x265_count_nonzero_4x4_ssse3(const int16_t *quantCoeff); +; int x265_count_nonzero_4x4_sse2(const int16_t *quantCoeff); ;----------------------------------------------------------------------------- -INIT_XMM ssse3 +INIT_XMM sse2 cglobal count_nonzero_4x4, 1,1,2 pxor m0, m0 @@ -974,23 +1127,18 @@ INIT_YMM avx2 cglobal count_nonzero_4x4, 1,1,2 pxor m0, m0 - - mova m1, [r0 + 0] - packsswb m1, [r0 + 16] - pcmpeqb m1, m0 - paddb m1, [pb_1] - - psadbw m1, m0 - pshufd m0, m1, 2 - paddd m1, m0 - movd eax, xm1 + movu m1, [r0] + pcmpeqw m1, m0 + pmovmskb eax, m1 + not eax + popcnt eax, eax + shr eax, 1 RET - ;----------------------------------------------------------------------------- -; int x265_count_nonzero_8x8_ssse3(const int16_t *quantCoeff); +; int x265_count_nonzero_8x8_sse2(const int16_t *quantCoeff); ;----------------------------------------------------------------------------- -INIT_XMM ssse3 +INIT_XMM sse2 cglobal count_nonzero_8x8, 1,1,3 pxor m0, m0 movu m1, [pb_4] @@ -1038,9 +1186,9 @@ ;----------------------------------------------------------------------------- -; int x265_count_nonzero_16x16_ssse3(const int16_t *quantCoeff); +; int x265_count_nonzero_16x16_sse2(const int16_t *quantCoeff); ;----------------------------------------------------------------------------- -INIT_XMM ssse3 +INIT_XMM sse2 cglobal count_nonzero_16x16, 1,1,3 pxor m0, m0 movu m1, [pb_16] @@ -1087,9 +1235,9 @@ ;----------------------------------------------------------------------------- -; int x265_count_nonzero_32x32_ssse3(const int16_t *quantCoeff); +; int x265_count_nonzero_32x32_sse2(const int16_t *quantCoeff); ;----------------------------------------------------------------------------- -INIT_XMM ssse3 +INIT_XMM sse2 cglobal count_nonzero_32x32, 1,1,3 pxor m0, m0 movu m1, [pb_64] @@ -1142,13 +1290,7 @@ INIT_XMM sse4 cglobal weight_pp, 4,7,7 %define correction (14 - BIT_DEPTH) -%if BIT_DEPTH == 10 - mova m6, [pw_1023] -%elif BIT_DEPTH == 12 - mova m6, [pw_3fff] -%else - %error Unsupported BIT_DEPTH! -%endif + mova m6, [pw_pixel_max] mov r6d, r6m mov r4d, r4m mov r5d, r5m @@ -1279,7 +1421,61 @@ %endif ; end of (HIGH_BIT_DEPTH == 0) +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal weight_pp, 6, 7, 7 +%define correction (14 - BIT_DEPTH) + mov r6d, r6m + shl r6d, 16 - correction + or r6d, r5d ; assuming both w0 and round are using maximum of 16 bits each. + + movd xm0, r6d + vpbroadcastd m0, xm0 + mov r5d, r7m + sub r5d, correction + movd xm1, r5d + vpbroadcastd m2, r8m + mova m5, [pw_1] + mova m6, [pw_pixel_max] + add r2d, r2d + add r3d, r3d + sub r2d, r3d + shr r3d, 5 + +.loopH: + mov r5d, r3d + +.loopW: + movu m4, [r0] + punpcklwd m3, m4, m5 + pmaddwd m3, m0 + psrad m3, xm1 + paddd m3, m2 + + punpckhwd m4, m5 + pmaddwd m4, m0 + psrad m4, xm1 + paddd m4, m2 + + packusdw m3, m4 + pminuw m3, m6 + movu [r1], m3 + + add r0, 32 + add r1, 32 + + dec r5d + jnz .loopW + + lea r0, [r0 + r2] + lea r1, [r1 + r2] + + dec r4d + jnz .loopH +%undef correction + RET +%else INIT_YMM avx2 cglobal weight_pp, 6, 7, 6 @@ -1288,7 +1484,8 @@ shl r6d, 16 or r6d, r5d ; assuming both (w0<<6) and round are using maximum of 16 bits each. - vpbroadcastd m0, r6d + movd xm0, r6d + vpbroadcastd m0, xm0 movd xm1, r7m vpbroadcastd m2, r8m @@ -1328,20 +1525,14 @@ dec r4d jnz .loopH RET - +%endif ;------------------------------------------------------------------------------------------------------------------------------------------------- ;void weight_sp(int16_t *src, pixel *dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset) ;------------------------------------------------------------------------------------------------------------------------------------------------- %if HIGH_BIT_DEPTH INIT_XMM sse4 cglobal weight_sp, 6,7,8 -%if BIT_DEPTH == 10 - mova m1, [pw_1023] -%elif BIT_DEPTH == 12 - mova m1, [pw_3fff] -%else - %error Unsupported BIT_DEPTH! -%endif + mova m1, [pw_pixel_max] mova m2, [pw_1] mov r6d, r7m shl r6d, 16 @@ -1493,15 +1684,138 @@ dec r5d jnz .loopH RET +%endif -%if ARCH_X86_64 + +%if ARCH_X86_64 == 1 +%if HIGH_BIT_DEPTH +INIT_YMM avx2 +cglobal weight_sp, 6,7,9 + mova m1, [pw_pixel_max] + mova m2, [pw_1] + mov r6d, r7m + shl r6d, 16 + or r6d, r6m + movd xm3, r6d + vpbroadcastd m3, xm3 ; m3 = [round w0] + movd xm4, r8m ; m4 = [shift] + vpbroadcastd m5, r9m ; m5 = [offset] + + ; correct row stride + add r3d, r3d + add r2d, r2d + mov r6d, r4d + and r6d, ~(mmsize / SIZEOF_PIXEL - 1) + sub r3d, r6d + sub r3d, r6d + sub r2d, r6d + sub r2d, r6d + + ; generate partial width mask (MUST BE IN YMM0) + mov r6d, r4d + and r6d, (mmsize / SIZEOF_PIXEL - 1) + movd xm0, r6d + pshuflw m0, m0, 0 + punpcklqdq m0, m0 + vinserti128 m0, m0, xm0, 1 + pcmpgtw m0, [pw_0_15] + +.loopH: + mov r6d, r4d + +.loopW: + movu m6, [r0] + paddw m6, [pw_2000] + + punpcklwd m7, m6, m2 + pmaddwd m7, m3 ;(round w0) + psrad m7, xm4 ;(shift) + paddd m7, m5 ;(offset) + + punpckhwd m6, m2 + pmaddwd m6, m3 + psrad m6, xm4 + paddd m6, m5 + + packusdw m7, m6 + pminuw m7, m1 + + sub r6d, (mmsize / SIZEOF_PIXEL) + jl .width14 + movu [r1], m7 + lea r0, [r0 + mmsize] + lea r1, [r1 + mmsize] + je .nextH + jmp .loopW + +.width14: + add r6d, 16 + cmp r6d, 14 + jl .width12 + movu [r1], xm7 + vextracti128 xm8, m7, 1 + movq [r1 + 16], xm8 + pextrd [r1 + 24], xm8, 2 + je .nextH + +.width12: + cmp r6d, 12 + jl .width10 + movu [r1], xm7 + vextracti128 xm8, m7, 1 + movq [r1 + 16], xm8 + je .nextH + +.width10: + cmp r6d, 10 + jl .width8 + movu [r1], xm7 + vextracti128 xm8, m7, 1 + movd [r1 + 16], xm8 + je .nextH + +.width8: + cmp r6d, 8 + jl .width6 + movu [r1], xm7 + je .nextH + +.width6 + cmp r6d, 6 + jl .width4 + movq [r1], xm7 + pextrd [r1 + 8], xm7, 2 + je .nextH + +.width4: + cmp r6d, 4 + jl .width2 + movq [r1], xm7 + je .nextH + add r1, 4 + pshufd m6, m6, 1 + je .nextH + +.width2: + movd [r1], xm7 + +.nextH: + add r0, r2 + add r1, r3 + + dec r5d + jnz .loopH + RET + +%else INIT_YMM avx2 cglobal weight_sp, 6, 9, 7 mov r7d, r7m shl r7d, 16 or r7d, r6m - vpbroadcastd m0, r7d ; m0 = times 8 dw w0, round - movd xm1, r8m ; m1 = [shift] + movd xm0, r7d + vpbroadcastd m0, xm0 ; m0 = times 8 dw w0, round + movd xm1, r8m ; m1 = [shift] vpbroadcastd m2, r9m ; m2 = times 16 dw offset vpbroadcastw m3, [pw_1] vpbroadcastw m4, [pw_2000] @@ -1571,8 +1885,7 @@ jnz .loopH RET %endif -%endif ; end of (HIGH_BIT_DEPTH == 0) - +%endif ;----------------------------------------------------------------- ; void transpose_4x4(pixel *dst, pixel *src, intptr_t stride) @@ -3330,7 +3643,7 @@ TRANSPOSE4x4D 0, 1, 2, 3, 4 ; s1=m0, s2=m1, ss=m2, s12=m3 -%if BIT_DEPTH == 10 +%if BIT_DEPTH >= 10 cvtdq2ps m0, m0 cvtdq2ps m1, m1 cvtdq2ps m2, m2 @@ -5478,6 +5791,19 @@ RET %endmacro +%macro VAR_END_12bit 2 + HADDD m5, m1 + HADDD m6, m1 +%if ARCH_X86_64 + punpckldq m5, m6 + movq rax, m5 +%else + movd eax, m5 + movd edx, m6 +%endif + RET +%endmacro + %macro VAR_CORE 0 paddw m5, m0 paddw m5, m3 @@ -5493,9 +5819,9 @@ paddd m6, m4 %endmacro -%macro VAR_2ROW 3 +%macro VAR_2ROW 2 mov r2d, %2 -.loop%3: +%%loop: %if HIGH_BIT_DEPTH movu m0, [r0] movu m1, [r0+mmsize] @@ -5519,7 +5845,7 @@ %endif ; !HIGH_BIT_DEPTH VAR_CORE dec r2d - jg .loop%3 + jg %%loop %endmacro ;----------------------------------------------------------------------------- @@ -5529,144 +5855,377 @@ cglobal pixel_var_16x16, 2,3 FIX_STRIDES r1 VAR_START 0 - VAR_2ROW 8*SIZEOF_PIXEL, 16, 1 + VAR_2ROW 8*SIZEOF_PIXEL, 16 VAR_END 16, 16 cglobal pixel_var_8x8, 2,3 FIX_STRIDES r1 VAR_START 0 - VAR_2ROW r1, 4, 1 + VAR_2ROW r1, 4 VAR_END 8, 8 %if HIGH_BIT_DEPTH %macro VAR 0 + +%if BIT_DEPTH <= 10 cglobal pixel_var_16x16, 2,3,8 FIX_STRIDES r1 VAR_START 0 - VAR_2ROW r1, 8, 1 + VAR_2ROW r1, 8 VAR_END 16, 16 -cglobal pixel_var_8x8, 2,3,8 - lea r2, [r1*3] - VAR_START 0 - movu m0, [r0] - movu m1, [r0+r1*2] - movu m3, [r0+r1*4] - movu m4, [r0+r2*2] - lea r0, [r0+r1*8] - VAR_CORE - movu m0, [r0] - movu m1, [r0+r1*2] - movu m3, [r0+r1*4] - movu m4, [r0+r2*2] - VAR_CORE - VAR_END 8, 8 - cglobal pixel_var_32x32, 2,6,8 FIX_STRIDES r1 mov r3, r0 VAR_START 0 - VAR_2ROW r1, 8, 1 + VAR_2ROW r1, 8 HADDW m5, m2 movd r4d, m5 pxor m5, m5 - VAR_2ROW r1, 8, 2 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 lea r0, [r3 + 32] - VAR_2ROW r1, 8, 3 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 4 + VAR_2ROW r1, 8 VAR_END 32, 32 cglobal pixel_var_64x64, 2,6,8 FIX_STRIDES r1 mov r3, r0 VAR_START 0 - VAR_2ROW r1, 8, 1 + VAR_2ROW r1, 8 HADDW m5, m2 movd r4d, m5 pxor m5, m5 - VAR_2ROW r1, 8, 2 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 3 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 4 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 lea r0, [r3 + 32] - VAR_2ROW r1, 8, 5 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 6 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 7 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 8 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 lea r0, [r3 + 64] - VAR_2ROW r1, 8, 9 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 10 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 11 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 12 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 lea r0, [r3 + 96] - VAR_2ROW r1, 8, 13 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 14 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 15 + VAR_2ROW r1, 8 HADDW m5, m2 movd r5d, m5 add r4, r5 pxor m5, m5 - VAR_2ROW r1, 8, 16 + VAR_2ROW r1, 8 VAR_END 64, 64 + +%else ; BIT_DEPTH <= 10 + +cglobal pixel_var_16x16, 2,3,8 + FIX_STRIDES r1 + VAR_START 0 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + mova m7, m5 + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m5, m7 + VAR_END_12bit 16, 16 + +cglobal pixel_var_32x32, 2,6,8 + FIX_STRIDES r1 + mov r3, r0 + VAR_START 0 + + VAR_2ROW r1, 4 + HADDUWD m5, m1 + mova m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + lea r0, [r3 + 32] + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m5, m7 + VAR_END_12bit 32, 32 + +cglobal pixel_var_64x64, 2,6,8 + FIX_STRIDES r1 + mov r3, r0 + VAR_START 0 + + VAR_2ROW r1, 4 + HADDUWD m5, m1 + mova m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + lea r0, [r3 + 16 * SIZEOF_PIXEL] + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + lea r0, [r3 + 32 * SIZEOF_PIXEL] + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + lea r0, [r3 + 48 * SIZEOF_PIXEL] + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m7, m5 + + pxor m5, m5 + VAR_2ROW r1, 4 + HADDUWD m5, m1 + paddd m5, m7 + VAR_END_12bit 64, 64 + +%endif ; BIT_DEPTH <= 10 + +cglobal pixel_var_8x8, 2,3,8 + lea r2, [r1*3] + VAR_START 0 + movu m0, [r0] + movu m1, [r0+r1*2] + movu m3, [r0+r1*4] + movu m4, [r0+r2*2] + lea r0, [r0+r1*8] + VAR_CORE + movu m0, [r0] + movu m1, [r0+r1*2] + movu m3, [r0+r1*4] + movu m4, [r0+r2*2] + VAR_CORE + VAR_END 8, 8 + %endmacro ; VAR INIT_XMM sse2 @@ -6046,11 +6605,736 @@ pshufb m1, m0 ; get First and Last pos - xor eax, eax pmovmskb r0d, m1 - not r0w + not r0d bsr r1w, r0w - bsf ax, r0w + bsf eax, r0d ; side effect: clear AH to Zero shl r1d, 16 or eax, r1d RET + + +;void saoCuStatsE2_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) +;{ +; X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); +; X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); +; int x, y; +; int32_t tmp_stats[SAO::NUM_EDGETYPE]; +; int32_t tmp_count[SAO::NUM_EDGETYPE]; +; memset(tmp_stats, 0, sizeof(tmp_stats)); +; memset(tmp_count, 0, sizeof(tmp_count)); +; for (y = 0; y < endY; y++) +; { +; upBufft[0] = signOf(rec[stride] - rec[-1]); +; for (x = 0; x < endX; x++) +; { +; int signDown = signOf2(rec[x], rec[x + stride + 1]); +; X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); +; uint32_t edgeType = signDown + upBuff1[x] + 2; +; upBufft[x + 1] = (int8_t)(-signDown); +; tmp_stats[edgeType] += (fenc[x] - rec[x]); +; tmp_count[edgeType]++; +; } +; std::swap(upBuff1, upBufft); +; rec += stride; +; fenc += stride; +; } +; for (x = 0; x < SAO::NUM_EDGETYPE; x++) +; { +; stats[SAO::s_eoTable[x]] += tmp_stats[x]; +; count[SAO::s_eoTable[x]] += tmp_count[x]; +; } +;} + +%if ARCH_X86_64 +; TODO: x64 only because I need temporary register r7,r8, easy portab to x86 +INIT_XMM sse4 +cglobal saoCuStatsE2, 5,9,8,0-32 ; Stack: 5 of stats and 5 of count + mov r5d, r5m + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m0, [pb_128] + mova m5, [pb_1] + mova m6, [pb_2] + +.loopH: + ; TODO: merge into below SIMD + ; get upBuffX[0] + mov r6b, [r1 + r2] + sub r6b, [r1 - 1] + seta r6b + setb r7b + sub r6b, r7b + mov [r4], r6b + + ; backup unavailable pixels + movh m7, [r4 + r5 + 1] + + mov r6d, r5d +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 + 1] + + ; signDown + pxor m1, m0 + pxor m2, m0 + pcmpgtb m3, m1, m2 + pand m3, m5 + pcmpgtb m2, m1 + por m2, m3 + pxor m3, m3 + psubb m3, m2 + + ; edgeType + movu m4, [r3] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r4 + 1], m3 + + ; stats[edgeType] + pxor m1, m0 + movu m3, [r0] + punpckhbw m4, m3, m1 + punpcklbw m3, m1 + pmaddubsw m3, [hmul_16p + 16] + pmaddubsw m4, [hmul_16p + 16] + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + %if (x < 8) + pextrw r8d, m3, (x % 8) + %else + pextrw r8d, m4, (x % 8) + %endif + movsx r8d, r8w + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, 16 + add r1, 16 + add r3, 16 + add r4, 16 + jmp .loopW + +.next: + xchg r3, r4 + + ; restore pointer upBuff1 + mov r6d, r5d + and r6d, 15 + + ; move to next row + sub r6, r5 + add r3, r6 + add r4, r6 + add r6, r2 + add r0, r6 + add r1, r6 + + ; restore unavailable pixels + movh [r3 + r5 + 1], m7 + + dec byte r6m + jg .loopH + + ; sum to global buffer + mov r1, r7m + mov r0, r8m + + ; s_eoTable = {1,2,0,3,4} + movzx r6d, word [rsp + 0 * 2] + add [r0 + 1 * 4], r6d + movzx r6d, word [rsp + 1 * 2] + add [r0 + 2 * 4], r6d + movzx r6d, word [rsp + 2 * 2] + add [r0 + 0 * 4], r6d + movzx r6d, word [rsp + 3 * 2] + add [r0 + 3 * 4], r6d + movzx r6d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r6d + + mov r6d, [rsp + 5 * 2 + 0 * 4] + add [r1 + 1 * 4], r6d + mov r6d, [rsp + 5 * 2 + 1 * 4] + add [r1 + 2 * 4], r6d + mov r6d, [rsp + 5 * 2 + 2 * 4] + add [r1 + 0 * 4], r6d + mov r6d, [rsp + 5 * 2 + 3 * 4] + add [r1 + 3 * 4], r6d + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64 + + +;void saoStatE3(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count); +;{ +; memset(tmp_stats, 0, sizeof(tmp_stats)); +; memset(tmp_count, 0, sizeof(tmp_count)); +; for (y = startY; y < endY; y++) +; { +; for (x = startX; x < endX; x++) +; { +; int signDown = signOf2(rec[x], rec[x + stride - 1]); +; uint32_t edgeType = signDown + upBuff1[x] + 2; +; upBuff1[x - 1] = (int8_t)(-signDown); +; tmp_stats[edgeType] += (fenc[x] - rec[x]); +; tmp_count[edgeType]++; +; } +; upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); +; rec += stride; +; fenc += stride; +; } +; for (x = 0; x < NUM_EDGETYPE; x++) +; { +; stats[s_eoTable[x]] += tmp_stats[x]; +; count[s_eoTable[x]] += tmp_count[x]; +; } +;} + +%if ARCH_X86_64 +INIT_XMM sse4 +cglobal saoCuStatsE3, 4,9,8,0-32 ; Stack: 5 of stats and 5 of count + mov r4d, r4m + mov r5d, r5m + + ; clear internal temporary buffer + pxor m0, m0 + mova [rsp], m0 + mova [rsp + mmsize], m0 + mova m0, [pb_128] + mova m5, [pb_1] + mova m6, [pb_2] + movh m7, [r3 + r4] + +.loopH: + mov r6d, r4d + +.loopW: + movu m1, [r1] + movu m2, [r1 + r2 - 1] + + ; signDown + pxor m1, m0 + pxor m2, m0 + pcmpgtb m3, m1, m2 + pand m3, m5 + pcmpgtb m2, m1 + por m2, m3 + pxor m3, m3 + psubb m3, m2 + + ; edgeType + movu m4, [r3] + paddb m4, m6 + paddb m2, m4 + + ; update upBuff1 + movu [r3 - 1], m3 + + ; stats[edgeType] + pxor m1, m0 + movu m3, [r0] + punpckhbw m4, m3, m1 + punpcklbw m3, m1 + pmaddubsw m3, [hmul_16p + 16] + pmaddubsw m4, [hmul_16p + 16] + + ; 16 pixels +%assign x 0 +%rep 16 + pextrb r7d, m2, x + inc word [rsp + r7 * 2] + + %if (x < 8) + pextrw r8d, m3, (x % 8) + %else + pextrw r8d, m4, (x % 8) + %endif + movsx r8d, r8w + add [rsp + 5 * 2 + r7 * 4], r8d + + dec r6d + jz .next +%assign x x+1 +%endrep + + add r0, 16 + add r1, 16 + add r3, 16 + jmp .loopW + +.next: + ; restore pointer upBuff1 + mov r6d, r4d + and r6d, 15 + + ; move to next row + sub r6, r4 + add r3, r6 + add r6, r2 + add r0, r6 + add r1, r6 + dec r5d + jg .loopH + + ; restore unavailable pixels + movh [r3 + r4], m7 + + ; sum to global buffer + mov r1, r6m + mov r0, r7m + + ; s_eoTable = {1,2,0,3,4} + movzx r6d, word [rsp + 0 * 2] + add [r0 + 1 * 4], r6d + movzx r6d, word [rsp + 1 * 2] + add [r0 + 2 * 4], r6d + movzx r6d, word [rsp + 2 * 2] + add [r0 + 0 * 4], r6d + movzx r6d, word [rsp + 3 * 2] + add [r0 + 3 * 4], r6d + movzx r6d, word [rsp + 4 * 2] + add [r0 + 4 * 4], r6d + + mov r6d, [rsp + 5 * 2 + 0 * 4] + add [r1 + 1 * 4], r6d + mov r6d, [rsp + 5 * 2 + 1 * 4] + add [r1 + 2 * 4], r6d + mov r6d, [rsp + 5 * 2 + 2 * 4] + add [r1 + 0 * 4], r6d + mov r6d, [rsp + 5 * 2 + 3 * 4] + add [r1 + 3 * 4], r6d + mov r6d, [rsp + 5 * 2 + 4 * 4] + add [r1 + 4 * 4], r6d + RET +%endif ; ARCH_X86_64 + + +; uint32_t costCoeffNxN(uint16_t *scan, coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, uint8_t *tabSigCtx, uint16_t scanFlagMask, uint8_t *baseCtx, int offset, int subPosBase) +;for (int i = 0; i < MLS_CG_SIZE; i++) +;{ +; tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]); +; tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]); +; tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]); +; tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]); +;} +;do +;{ +; uint32_t blkPos, sig, ctxSig; +; blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; +; const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; +; sig = scanFlagMask & 1; +; scanFlagMask >>= 1; +; if (scanPosSigOff + (subSet == 0) + numNonZero) +; { +; const uint32_t cnt = tabSigCtx[blkPos] + offset + posOffset; +; ctxSig = cnt & posZeroMask; +; +; const uint32_t mstate = baseCtx[ctxSig]; +; const uint32_t mps = mstate & 1; +; const uint32_t stateBits = x265_entropyStateBits[mstate ^ sig]; +; uint32_t nextState = (stateBits >> 24) + mps; +; if ((mstate ^ sig) == 1) +; nextState = sig; +; baseCtx[ctxSig] = (uint8_t)nextState; +; sum += stateBits; +; } +; absCoeff[numNonZero] = tmpCoeff[blkPos]; +; numNonZero += sig; +; scanPosSigOff--; +;} +;while(scanPosSigOff >= 0); +; sum &= 0xFFFFFF + +%if ARCH_X86_64 +; uint32_t costCoeffNxN(uint16_t *scan, coeff_t *coeff, intptr_t trSize, uint16_t *absCoeff, uint8_t *tabSigCtx, uint16_t scanFlagMask, uint8_t *baseCtx, int offset, int scanPosSigOff, int subPosBase) +INIT_XMM sse4 +cglobal costCoeffNxN, 6,11,5 + add r2d, r2d + + ; abs(coeff) + movh m1, [r1] + movhps m1, [r1 + r2] + movh m2, [r1 + r2 * 2] + lea r2, [r2 * 3] + movhps m2, [r1 + r2] + pabsw m1, m1 + pabsw m2, m2 + ; r[1-2] free here + + ; WARNING: beyond-bound read here! + ; loading scan table + mov r2d, r8m + xor r2d, 15 + movu m0, [r0 + r2 * 2] + movu m3, [r0 + r2 * 2 + mmsize] + packuswb m0, m3 + pxor m0, [pb_15] + xchg r2d, r8m + ; r[0-1] free here + + ; reorder coeff + mova m3, [deinterleave_shuf] + pshufb m1, m3 + pshufb m2, m3 + punpcklqdq m3, m1, m2 + punpckhqdq m1, m2 + pshufb m3, m0 + pshufb m1, m0 + punpcklbw m2, m3, m1 + punpckhbw m3, m1 + ; r[0-1], m[1] free here + + ; loading tabSigCtx (+offset) + mova m1, [r4] + pshufb m1, m0 + movd m4, r7m + pxor m5, m5 + pshufb m4, m5 + paddb m1, m4 + + ; register mapping + ; m0 - Zigzag + ; m1 - sigCtx + ; {m3,m2} - abs(coeff) + ; r0 - x265_entropyStateBits + ; r1 - baseCtx + ; r2 - scanPosSigOff + ; r3 - absCoeff + ; r4 - nonZero + ; r5 - scanFlagMask + ; r6 - sum + lea r0, [private_prefix %+ _entropyStateBits] + mov r1, r6mp + xor r6d, r6d + xor r4d, r4d + xor r8d, r8d + + test r2d, r2d + jz .idx_zero + +.loop: +; { +; const uint32_t cnt = tabSigCtx[blkPos] + offset + posOffset; +; ctxSig = cnt & posZeroMask; +; const uint32_t mstate = baseCtx[ctxSig]; +; const uint32_t mps = mstate & 1; +; const uint32_t stateBits = x265_entropyStateBits[mstate ^ sig]; +; uint32_t nextState = (stateBits >> 24) + mps; +; if ((mstate ^ sig) == 1) +; nextState = sig; +; baseCtx[ctxSig] = (uint8_t)nextState; +; sum += stateBits; +; } +; absCoeff[numNonZero] = tmpCoeff[blkPos]; +; numNonZero += sig; +; scanPosSigOff--; + + pextrw [r3 + r4 * 2], m2, 0 ; absCoeff[numNonZero] = tmpCoeff[blkPos] + shr r5d, 1 + setc r8b ; r8 = sig + add r4d, r8d ; numNonZero += sig + palignr m4, m3, m2, 2 + psrldq m3, 2 + mova m2, m4 + movd r7d, m1 ; r7 = ctxSig + movzx r7d, r7b + psrldq m1, 1 + movzx r9d, byte [r1 + r7] ; mstate = baseCtx[ctxSig] + mov r10d, r9d + and r10d, 1 ; mps = mstate & 1 + xor r9d, r8d ; r9 = mstate ^ sig + add r6d, [r0 + r9 * 4] ; sum += x265_entropyStateBits[mstate ^ sig] + add r10b, byte [r0 + r9 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r9b, 1 + cmove r10d, r8d + mov byte [r1 + r7], r10b + + dec r2d + jg .loop + +.idx_zero: + pextrw [r3 + r4 * 2], m2, 0 ; absCoeff[numNonZero] = tmpCoeff[blkPos] + add r4b, r8m + xor r2d, r2d + cmp word r9m, 0 + sete r2b + add r4b, r2b + jz .exit + + dec r2b + movd r3d, m1 + and r2d, r3d + + movzx r3d, byte [r1 + r2] ; mstate = baseCtx[ctxSig] + mov r4d, r5d + xor r5d, r3d ; r0 = mstate ^ sig + and r3d, 1 ; mps = mstate & 1 + add r6d, [r0 + r5 * 4] ; sum += x265_entropyStateBits[mstate ^ sig] + add r3b, [r0 + r5 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r5b, 1 + cmove r3d, r4d + mov byte [r1 + r2], r3b + +.exit: +%ifnidn eax,r6d + mov eax, r6d +%endif + and eax, 0xFFFFFF + RET +%endif ; ARCH_X86_64 + + +;uint32_t goRiceParam = 0; +;int firstCoeff2 = 1; +;uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel +;idx = 0; +;do +;{ +; int baseLevel = (baseLevelN & 3) | firstCoeff2; +; baseLevelN >>= 2; +; int codeNumber = absCoeff[idx] - baseLevel; +; if (codeNumber >= 0) +; { +; uint32_t length = 0; +; codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION; +; if (codeNumber >= 0) +; { +; { +; unsigned long cidx; +; CLZ(cidx, codeNumber + 1); +; length = cidx; +; } +; codeNumber = (length + length); +; } +; sum += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber); +; if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) +; goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); +; } +; if (absCoeff[idx] >= 2) +; firstCoeff2 = 0; +; idx++; +;} +;while(idx < numNonZero); + +; uint32_t costCoeffRemain(uint16_t *absCoeff, int numNonZero, int idx) +INIT_XMM sse4 +cglobal costCoeffRemain, 0,7,1 + ; assign RCX to R3 + ; RAX always in R6 and free + %if WIN64 + DECLARE_REG_TMP 3,1,2,0 + mov t0, r0 + mov r4d, r2d + %elif ARCH_X86_64 + ; *nix x64 didn't do anything + DECLARE_REG_TMP 0,1,2,3 + mov r4d, r2d + %else ; X86_32 + DECLARE_REG_TMP 6,3,2,1 + mov t0, r0m + mov r4d, r2m + %endif + + xor t3d, t3d + xor r5d, r5d + + lea t0, [t0 + r4 * 2] + mov r2d, 3 + + ; register mapping + ; r2d - baseLevel & tmp + ; r4d - idx + ; t3 - goRiceParam + ; eax - absCoeff[idx] & tmp + ; r5 - sum + +.loop: + mov eax, 1 + cmp r4d, 8 + cmovge r2d, eax + + movzx eax, word [t0] + add t0, 2 + sub eax, r2d ; codeNumber = absCoeff[idx] - baseLevel + jl .next + + shr eax, t3b ; codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION + + lea r2d, [rax - 3 + 1] ; CLZ(cidx, codeNumber + 1); + bsr r2d, r2d + add r2d, r2d ; codeNumber = (length + length) + + sub eax, 3 + cmovge eax, r2d + + lea eax, [3 + 1 + t3 + rax] ; sum += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber) + add r5d, eax + + ; if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) + ; goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + cmp t3d, 4 + setl al + + mov r2d, 3 + shl r2d, t3b + cmp word [t0 - 2], r2w + setg r2b + and al, r2b + add t3b, al + +.next: + inc r4d + mov r2d, 2 + cmp r4d, r1m + jl .loop + + mov eax, r5d + RET + + +; uint32_t costC1C2Flag(uint16_t *absCoeff, intptr_t numC1Flag, uint8_t *baseCtxMod, intptr_t ctxOffset) +;idx = 0; +;do +;{ +; uint32_t symbol1 = absCoeff[idx] > 1; +; uint32_t symbol2 = absCoeff[idx] > 2; +; { +; const uint32_t mstate = baseCtxMod[c1]; +; baseCtxMod[c1] = sbacNext(mstate, symbol1); +; sum += sbacGetEntropyBits(mstate, symbol1); +; } +; if (symbol1) +; c1Next = 0; +; if (symbol1 + firstC2Flag == 3) +; firstC2Flag = symbol2; +; if (symbol1 + firstC2Idx == 9) +; firstC2Idx = idx; +; c1 = (c1Next & 3); +; c1Next >>= 2; +; idx++; +;} +;while(idx < numC1Flag); +;if (!c1) +;{ +; baseCtxMod = &m_contextState[(bIsLuma ? 0 : NUM_ABS_FLAG_CTX_LUMA) + OFF_ABS_FLAG_CTX + ctxSet]; +; { +; const uint32_t mstate = baseCtxMod[0]; +; baseCtxMod[0] = sbacNext(mstate, firstC2Flag); +; sum += sbacGetEntropyBits(mstate, firstC2Flag); +; } +;} +;m_fracBits += (sum & 0xFFFFFF); + + +; TODO: we need more register, so I writen code as x64 only, but it is easy to portab to x86 platform +%if ARCH_X86_64 +INIT_XMM sse2 +cglobal costC1C2Flag, 4,12,2 + + mova m0, [r0] + packsswb m0, m0 + + pcmpgtb m1, m0, [pb_1] + pcmpgtb m0, [pb_2] + + ; get mask for 'X>1' + pmovmskb r0d, m1 + mov r11d, r0d + + ; clear unavailable coeff flags + xor r6d, r6d + bts r6d, r1d + dec r6d + and r11d, r6d + + ; calculate firstC2Idx + or r11d, 0x100 ; default value setting to 8 + bsf r11d, r11d + + lea r5, [private_prefix %+ _entropyStateBits] + xor r6d, r6d + mov r4d, 0xFFFFFFF9 + + ; register mapping + ; r4d - nextC1 + ; r5 - x265_entropyStateBits + ; r6d - sum + ; r[7-10] - tmp + ; r11d - firstC2Idx (not use in loop) + + ; process c1 flag +.loop: + ; const uint32_t mstate = baseCtx[ctxSig]; + ; const uint32_t mps = mstate & 1; + ; const uint32_t stateBits = x265_entropyStateBits[mstate ^ sig]; + ; uint32_t nextState = (stateBits >> 24) + mps; + ; if ((mstate ^ sig) == 1) + ; nextState = sig; + mov r10d, r4d ; c1 + and r10d, 3 + shr r4d, 2 + + xor r7d, r7d + shr r0d, 1 + cmovc r4d, r7d ; c1 <- 0 when C1Flag=1 + setc r7b ; symbol1 + + movzx r8d, byte [r2 + r10] ; mstate = baseCtx[c1] + mov r9d, r7d ; sig = symbol1 + xor r7d, r8d ; mstate ^ sig + and r8d, 1 ; mps = mstate & 1 + add r6d, [r5 + r7 * 4] ; sum += x265_entropyStateBits[mstate ^ sig] + add r8b, [r5 + r7 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r7b, 1 ; if ((mstate ^ sig) == 1) nextState = sig; + cmove r8d, r9d + mov byte [r2 + r10], r8b + + dec r1d + jg .loop + + ; check and generate c1 flag + shl r4d, 30 + jnz .quit + + ; move to c2 ctx + add r2, r3 + + ; process c2 flag + pmovmskb r8d, m0 + bt r8d, r11d + setc r7b + + movzx r8d, byte [r2] ; mstate = baseCtx[c1] + mov r1d, r7d ; sig = symbol1 + xor r7d, r8d ; mstate ^ sig + and r8d, 1 ; mps = mstate & 1 + add r6d, [r5 + r7 * 4] ; sum += x265_entropyStateBits[mstate ^ sig] + add r8b, [r5 + r7 * 4 + 3] ; nextState = (stateBits >> 24) + mps + cmp r7b, 1 ; if ((mstate ^ sig) == 1) nextState = sig; + cmove r8d, r1d + mov byte [r2], r8b + +.quit: + shrd r4d, r11d, 4 +%ifnidn r6d,eax + mov eax, r6d +%endif + and eax, 0x00FFFFFF + or eax, r4d + RET +%endif ; ARCH_X86_64
View file
x265_1.7.tar.gz/source/common/x86/pixel.h -> x265_1.8.tar.gz/source/common/x86/pixel.h
Changed
@@ -28,260 +28,41 @@ #ifndef X265_I386_PIXEL_H #define X265_I386_PIXEL_H -#define DECL_PIXELS(ret, name, suffix, args) \ - ret x265_pixel_ ## name ## _16x64_ ## suffix args; \ - ret x265_pixel_ ## name ## _16x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _16x16_ ## suffix args; \ - ret x265_pixel_ ## name ## _16x12_ ## suffix args; \ - ret x265_pixel_ ## name ## _16x8_ ## suffix args; \ - ret x265_pixel_ ## name ## _16x4_ ## suffix args; \ - ret x265_pixel_ ## name ## _8x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _8x16_ ## suffix args; \ - ret x265_pixel_ ## name ## _8x8_ ## suffix args; \ - ret x265_pixel_ ## name ## _8x4_ ## suffix args; \ - ret x265_pixel_ ## name ## _4x16_ ## suffix args; \ - ret x265_pixel_ ## name ## _4x8_ ## suffix args; \ - ret x265_pixel_ ## name ## _4x4_ ## suffix args; \ - ret x265_pixel_ ## name ## _32x8_ ## suffix args; \ - ret x265_pixel_ ## name ## _32x16_ ## suffix args; \ - ret x265_pixel_ ## name ## _32x24_ ## suffix args; \ - ret x265_pixel_ ## name ## _24x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _32x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _32x64_ ## suffix args; \ - ret x265_pixel_ ## name ## _64x16_ ## suffix args; \ - ret x265_pixel_ ## name ## _64x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _64x48_ ## suffix args; \ - ret x265_pixel_ ## name ## _64x64_ ## suffix args; \ - ret x265_pixel_ ## name ## _48x64_ ## suffix args; \ - ret x265_pixel_ ## name ## _24x32_ ## suffix args; \ - ret x265_pixel_ ## name ## _12x16_ ## suffix args; \ - -#define DECL_X1(name, suffix) \ - DECL_PIXELS(int, name, suffix, (const pixel*, intptr_t, const pixel*, intptr_t)) - -#define DECL_X1_SS(name, suffix) \ - DECL_PIXELS(int, name, suffix, (const int16_t*, intptr_t, const int16_t*, intptr_t)) - -#define DECL_X1_SP(name, suffix) \ - DECL_PIXELS(int, name, suffix, (const int16_t*, intptr_t, const pixel*, intptr_t)) - -#define DECL_X4(name, suffix) \ - DECL_PIXELS(void, name ## _x3, suffix, (const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*)) \ - DECL_PIXELS(void, name ## _x4, suffix, (const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*)) - -/* sad-a.asm */ -DECL_X1(sad, mmx2) -DECL_X1(sad, sse2) -DECL_X4(sad, sse2_misalign) -DECL_X1(sad, sse3) -DECL_X1(sad, sse2_aligned) -DECL_X1(sad, ssse3) -DECL_X1(sad, ssse3_aligned) -DECL_X1(sad, avx2) -DECL_X1(sad, avx2_aligned) -DECL_X4(sad, mmx2) -DECL_X4(sad, sse2) -DECL_X4(sad, sse3) -DECL_X4(sad, ssse3) -DECL_X4(sad, avx) -DECL_X4(sad, avx2) -DECL_X1(sad, cache32_mmx2); -DECL_X1(sad, cache64_mmx2); -DECL_X1(sad, cache64_sse2); -DECL_X1(sad, cache64_ssse3); -DECL_X4(sad, cache32_mmx2); -DECL_X4(sad, cache64_mmx2); -DECL_X4(sad, cache64_sse2); -DECL_X4(sad, cache64_ssse3); - -/* pixel-a.asm */ -DECL_X1(satd, mmx2) -DECL_X1(satd, sse2) -DECL_X1(satd, ssse3) -DECL_X1(satd, ssse3_atom) -DECL_X1(satd, sse4) -DECL_X1(satd, avx) -DECL_X1(satd, xop) -DECL_X1(satd, avx2) -int x265_pixel_satd_16x24_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_32x48_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_24x64_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_8x64_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_8x12_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_12x32_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_4x32_avx(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_8x32_sse2(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_16x4_sse2(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_16x12_sse2(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_16x32_sse2(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_satd_16x64_sse2(const pixel*, intptr_t, const pixel*, intptr_t); - -DECL_X1(sa8d, mmx2) -DECL_X1(sa8d, sse2) -DECL_X1(sa8d, ssse3) -DECL_X1(sa8d, ssse3_atom) -DECL_X1(sa8d, sse4) -DECL_X1(sa8d, avx) -DECL_X1(sa8d, xop) -DECL_X1(sa8d, avx2) - -/* ssd-a.asm */ -DECL_X1(ssd, mmx) -DECL_X1(ssd, mmx2) -DECL_X1(ssd, sse2slow) -DECL_X1(ssd, sse2) -DECL_X1(ssd, ssse3) -DECL_X1(ssd, avx) -DECL_X1(ssd, xop) -DECL_X1(ssd, avx2) -DECL_X1_SS(ssd_ss, mmx) -DECL_X1_SS(ssd_ss, mmx2) -DECL_X1_SS(ssd_ss, sse2slow) -DECL_X1_SS(ssd_ss, sse2) -DECL_X1_SS(ssd_ss, ssse3) -DECL_X1_SS(ssd_ss, sse4) -DECL_X1_SS(ssd_ss, avx) -DECL_X1_SS(ssd_ss, xop) -DECL_X1_SS(ssd_ss, avx2) -DECL_X1_SP(ssd_sp, sse4) -#define DECL_HEVC_SSD(suffix) \ - int x265_pixel_ssd_32x64_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x64_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_32x32_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_32x16_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x32_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_32x24_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_24x32_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_32x8_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_8x32_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x16_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x8_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_8x16_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x12_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_16x4_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_8x8_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); \ - int x265_pixel_ssd_8x4_ ## suffix(const pixel*, intptr_t, const pixel*, intptr_t); -DECL_HEVC_SSD(sse2) -DECL_HEVC_SSD(ssse3) -DECL_HEVC_SSD(avx) - -int x265_pixel_ssd_12x16_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_24x32_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_48x64_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_64x16_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_64x32_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_64x48_sse4(const pixel*, intptr_t, const pixel*, intptr_t); -int x265_pixel_ssd_64x64_sse4(const pixel*, intptr_t, const pixel*, intptr_t); - -int x265_pixel_ssd_s_4_sse2(const int16_t*, intptr_t); -int x265_pixel_ssd_s_8_sse2(const int16_t*, intptr_t); -int x265_pixel_ssd_s_16_sse2(const int16_t*, intptr_t); -int x265_pixel_ssd_s_32_sse2(const int16_t*, intptr_t); -int x265_pixel_ssd_s_16_avx2(const int16_t*, intptr_t); -int x265_pixel_ssd_s_32_avx2(const int16_t*, intptr_t); - -#define ADDAVG(func) \ - void x265_ ## func ## _sse4(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ - void x265_ ## func ## _avx2(const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); -ADDAVG(addAvg_2x4) -ADDAVG(addAvg_2x8) -ADDAVG(addAvg_4x2); -ADDAVG(addAvg_4x4) -ADDAVG(addAvg_4x8) -ADDAVG(addAvg_4x16) -ADDAVG(addAvg_6x8) -ADDAVG(addAvg_8x2) -ADDAVG(addAvg_8x4) -ADDAVG(addAvg_8x6) -ADDAVG(addAvg_8x8) -ADDAVG(addAvg_8x16) -ADDAVG(addAvg_8x32) -ADDAVG(addAvg_12x16) -ADDAVG(addAvg_16x4) -ADDAVG(addAvg_16x8) -ADDAVG(addAvg_16x12) -ADDAVG(addAvg_16x16) -ADDAVG(addAvg_16x32) -ADDAVG(addAvg_16x64) -ADDAVG(addAvg_24x32) -ADDAVG(addAvg_32x8) -ADDAVG(addAvg_32x16) -ADDAVG(addAvg_32x24) -ADDAVG(addAvg_32x32) -ADDAVG(addAvg_32x64) -ADDAVG(addAvg_48x64) -ADDAVG(addAvg_64x16) -ADDAVG(addAvg_64x32) -ADDAVG(addAvg_64x48) -ADDAVG(addAvg_64x64) - -ADDAVG(addAvg_2x16) -ADDAVG(addAvg_4x32) -ADDAVG(addAvg_6x16) -ADDAVG(addAvg_8x12) -ADDAVG(addAvg_8x64) -ADDAVG(addAvg_12x32) -ADDAVG(addAvg_16x24) -ADDAVG(addAvg_24x64) -ADDAVG(addAvg_32x48) - -void x265_downShift_16_sse2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); -void x265_downShift_16_avx2(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); -void x265_upShift_8_sse4(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); -int x265_psyCost_pp_4x4_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_8x8_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_16x16_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_32x32_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_64x64_sse4(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_ss_4x4_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_8x8_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_16x16_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_32x32_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_64x64_sse4(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -void x265_pixel_avg_16x4_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_16x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_16x12_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_16x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_16x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_16x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_32x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_32x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_32x24_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_32x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_32x8_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_64x64_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_64x48_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_64x32_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); -void x265_pixel_avg_64x16_avx2(pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); - -void x265_pixel_add_ps_16x16_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_add_ps_32x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_add_ps_64x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_add_ps_16x32_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_add_ps_32x64_avx2(pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); - -void x265_pixel_sub_ps_16x16_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_sub_ps_32x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_sub_ps_64x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_sub_ps_16x32_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); -void x265_pixel_sub_ps_32x64_avx2(int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); - -int x265_psyCost_pp_4x4_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_8x8_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_16x16_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_32x32_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); -int x265_psyCost_pp_64x64_avx2(const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); - -int x265_psyCost_ss_4x4_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_8x8_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_16x16_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_32x32_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -int x265_psyCost_ss_64x64_avx2(const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride); -void x265_weight_sp_avx2(const int16_t* src, pixel* dst, intptr_t srcStride, intptr_t dstStride, int width, int height, int w0, int round, int shift, int offset); +void PFX(downShift_16_sse2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void PFX(downShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void PFX(upShift_16_sse2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void PFX(upShift_16_avx2)(const uint16_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift, uint16_t mask); +void PFX(upShift_8_sse4)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); +void PFX(upShift_8_avx2)(const uint8_t* src, intptr_t srcStride, pixel* dst, intptr_t dstStride, int width, int height, int shift); + +#define DECL_PIXELS(cpu) \ + FUNCDEF_PU(uint32_t, pixel_ssd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_PU(int, pixel_sa8d, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_PU(void, pixel_sad_x3, cpu, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ + FUNCDEF_PU(void, pixel_sad_x4, cpu, const pixel*, const pixel*, const pixel*, const pixel*, const pixel*, intptr_t, int32_t*); \ + FUNCDEF_PU(void, pixel_avg, cpu, pixel* dst, intptr_t dstride, const pixel* src0, intptr_t sstride0, const pixel* src1, intptr_t sstride1, int); \ + FUNCDEF_PU(void, pixel_add_ps, cpu, pixel* a, intptr_t dstride, const pixel* b0, const int16_t* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_PU(void, pixel_sub_ps, cpu, int16_t* a, intptr_t dstride, const pixel* b0, const pixel* b1, intptr_t sstride0, intptr_t sstride1); \ + FUNCDEF_CHROMA_PU(int, pixel_satd, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_CHROMA_PU(int, pixel_sad, cpu, const pixel*, intptr_t, const pixel*, intptr_t); \ + FUNCDEF_CHROMA_PU(uint32_t, pixel_ssd_ss, cpu, const int16_t*, intptr_t, const int16_t*, intptr_t); \ + FUNCDEF_CHROMA_PU(void, addAvg, cpu, const int16_t*, const int16_t*, pixel*, intptr_t, intptr_t, intptr_t); \ + FUNCDEF_CHROMA_PU(int, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU_S(int, pixel_ssd_s, cpu, const int16_t*, intptr_t); \ + FUNCDEF_TU(uint64_t, pixel_var, cpu, const pixel*, intptr_t); \ + FUNCDEF_TU(int, psyCost_pp, cpu, const pixel* source, intptr_t sstride, const pixel* recon, intptr_t rstride); \ + FUNCDEF_TU(int, psyCost_ss, cpu, const int16_t* source, intptr_t sstride, const int16_t* recon, intptr_t rstride) + +DECL_PIXELS(mmx); +DECL_PIXELS(mmx2); +DECL_PIXELS(sse2); +DECL_PIXELS(sse3); +DECL_PIXELS(sse4); +DECL_PIXELS(ssse3); +DECL_PIXELS(avx); +DECL_PIXELS(xop); +DECL_PIXELS(avx2); #undef DECL_PIXELS -#undef DECL_HEVC_SSD -#undef DECL_X1 -#undef DECL_X4 #endif // ifndef X265_I386_PIXEL_H
View file
x265_1.7.tar.gz/source/common/x86/sad-a.asm -> x265_1.8.tar.gz/source/common/x86/sad-a.asm
Changed
@@ -7,6 +7,7 @@ ;* Fiona Glaser <fiona@x264.com> ;* Laurent Aimar <fenrir@via.ecp.fr> ;* Alex Izvorski <aizvorksi@gmail.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -32,15 +33,13 @@ SECTION_RODATA 32 MSK: db 255,255,255,255,255,255,255,255,255,255,255,255,0,0,0,0 -pb_shuf8x8c2: times 2 db 0,0,0,0,8,8,8,8,-1,-1,-1,-1,-1,-1,-1,-1 -hpred_shuf: db 0,0,2,2,8,8,10,10,1,1,3,3,9,9,11,11 SECTION .text cextern pb_3 cextern pb_shuf8x8c cextern pw_8 -cextern sw_64 +cextern pd_64 ;============================================================================= ; SAD MMX @@ -2784,6 +2783,83 @@ %endif %endmacro +%macro SAD_X4_START_2x32P_AVX2 0 + mova m4, [r0] + movu m0, [r1] + movu m2, [r2] + movu m1, [r3] + movu m3, [r4] + psadbw m0, m4 + psadbw m2, m4 + psadbw m1, m4 + psadbw m3, m4 + packusdw m0, m2 + packusdw m1, m3 + + mova m6, [r0+FENC_STRIDE] + movu m2, [r1+r5] + movu m4, [r2+r5] + movu m3, [r3+r5] + movu m5, [r4+r5] + psadbw m2, m6 + psadbw m4, m6 + psadbw m3, m6 + psadbw m5, m6 + packusdw m2, m4 + packusdw m3, m5 + paddd m0, m2 + paddd m1, m3 +%endmacro + +%macro SAD_X4_2x32P_AVX2 4 + mova m6, [r0+%1] + movu m2, [r1+%2] + movu m4, [r2+%2] + movu m3, [r3+%2] + movu m5, [r4+%2] + psadbw m2, m6 + psadbw m4, m6 + psadbw m3, m6 + psadbw m5, m6 + packusdw m2, m4 + packusdw m3, m5 + paddd m0, m2 + paddd m1, m3 + + mova m6, [r0+%3] + movu m2, [r1+%4] + movu m4, [r2+%4] + movu m3, [r3+%4] + movu m5, [r4+%4] + psadbw m2, m6 + psadbw m4, m6 + psadbw m3, m6 + psadbw m5, m6 + packusdw m2, m4 + packusdw m3, m5 + paddd m0, m2 + paddd m1, m3 +%endmacro + +%macro SAD_X4_4x32P_AVX2 2 +%if %1==0 + lea r6, [r5*3] + SAD_X4_START_2x32P_AVX2 +%else + SAD_X4_2x32P_AVX2 FENC_STRIDE*(0+(%1&1)*4), r5*0, FENC_STRIDE*(1+(%1&1)*4), r5*1 +%endif + SAD_X4_2x32P_AVX2 FENC_STRIDE*(2+(%1&1)*4), r5*2, FENC_STRIDE*(3+(%1&1)*4), r6 +%if %1 != %2-1 +%if (%1&1) != 0 + add r0, 8*FENC_STRIDE +%endif + lea r1, [r1+4*r5] + lea r2, [r2+4*r5] + lea r3, [r3+4*r5] + lea r4, [r4+4*r5] +%endif +%endmacro + %macro SAD_X3_END_AVX2 0 movifnidn r5, r5mp packssdw m0, m1 ; 0 0 1 1 0 0 1 1 @@ -2808,6 +2884,17 @@ RET %endmacro +%macro SAD_X4_32P_END_AVX2 0 + mov r0, r6mp + vextracti128 xm2, m0, 1 + vextracti128 xm3, m1, 1 + paddd xm0, xm2 + paddd xm1, xm3 + phaddd xm0, xm1 + mova [r0], xm0 + RET +%endmacro + ;----------------------------------------------------------------------------- ; void pixel_sad_x3_16x16( uint8_t *fenc, uint8_t *pix0, uint8_t *pix1, ; uint8_t *pix2, intptr_t i_stride, int scores[3] ) @@ -3320,7 +3407,12 @@ SAD_X%1_4x%2P_AVX2 x, %3/4 %assign x x+1 %endrep + + %if (%1==4) && (%2==32) + SAD_X%1_32P_END_AVX2 + %else SAD_X%1_END_AVX2 + %endif %endmacro INIT_YMM avx2 @@ -3333,6 +3425,12 @@ SAD_X_AVX2 4, 16, 12, 8 SAD_X_AVX2 4, 16, 8, 8 +SAD_X_AVX2 4, 32, 8, 8 +SAD_X_AVX2 4, 32, 16, 8 +SAD_X_AVX2 4, 32, 24, 8 +SAD_X_AVX2 4, 32, 32, 8 +SAD_X_AVX2 4, 32, 64, 8 + ;============================================================================= ; SAD cacheline split ;============================================================================= @@ -3440,7 +3538,7 @@ jle pixel_sad_%1x%2_mmx2 and eax, 7 shl eax, 3 - movd mm6, [sw_64] + movd mm6, [pd_64] movd mm7, eax psubw mm6, mm7 PROLOGUE 4,5
View file
x265_1.7.tar.gz/source/common/x86/sad16-a.asm -> x265_1.8.tar.gz/source/common/x86/sad16-a.asm
Changed
@@ -6,6 +6,7 @@ ;* Authors: Oskar Arvidsson <oskar@irock.se> ;* Henrik Gramner <henrik@gramner.com> ;* Dnyaneshwar Gorade <dnyaneshwar@multicorewareinc.com> +;* Min Chen <chenm003@163.com> ;* ;* This program is free software; you can redistribute it and/or modify ;* it under the terms of the GNU General Public License as published by @@ -51,8 +52,14 @@ lea r2, [r2+2*r3] paddw m1, m2 paddw m3, m4 + %if BIT_DEPTH <= 10 paddw m0, m1 paddw m0, m3 + %else + paddw m1, m3 + pmaddwd m1, [pw_1] + paddd m0, m1 + %endif %endmacro %macro SAD_INC_2x8P_MMX 0 @@ -70,8 +77,14 @@ lea r2, [r2+4*r3] paddw m1, m2 paddw m3, m4 + %if BIT_DEPTH <= 10 paddw m0, m1 paddw m0, m3 + %else + paddw m1, m3 + pmaddwd m1, [pw_1] + paddd m0, m1 + %endif %endmacro %macro SAD_INC_2x4P_MMX 0 @@ -82,8 +95,14 @@ ABSW2 m1, m2, m1, m2, m3, m4 lea r0, [r0+4*r1] lea r2, [r2+4*r3] + %if BIT_DEPTH <= 10 paddw m0, m1 paddw m0, m2 + %else + paddw m1, m2 + pmaddwd m1, [pw_1] + paddd m0, m1 + %endif %endmacro ;----------------------------------------------------------------------------- @@ -103,9 +122,17 @@ jg .loop %endif %if %1*%2 == 256 + %if BIT_DEPTH <= 10 HADDUW m0, m1 + %else + HADDD m0, m1 + %endif %else + %if BIT_DEPTH <= 10 HADDW m0, m1 + %else + HADDD m0, m1 + %endif %endif movd eax, m0 RET @@ -276,8 +303,9 @@ ABSW2 m3, m4, m3, m4, m7, m5 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m0, m3 + paddw m1, m3 + pmaddwd m1, [pw_1] + paddd m0, m1 %else movu m1, [r2] movu m2, [r2+2*r3] @@ -286,8 +314,9 @@ ABSW2 m1, m2, m1, m2, m3, m4 lea r0, [r0+4*r1] lea r2, [r2+4*r3] - paddw m0, m1 - paddw m0, m2 + paddw m1, m2 + pmaddwd m1, [pw_1] + paddd m0, m1 %endif %endmacro @@ -307,8 +336,9 @@ ABSW2 m3, m4, m3, m4, m7, m5 paddw m1, m2 paddw m3, m4 - paddw m0, m1 - paddw m8, m3 + paddw m1, m3 + pmaddwd m1, [pw_1] + paddd m0, m1 %else movu m1, [r2] movu m2, [r2 + 2 * r3] @@ -317,8 +347,9 @@ ABSW2 m1, m2, m1, m2, m3, m4 lea r0, [r0 + 4 * r1] lea r2, [r2 + 4 * r3] - paddw m0, m1 - paddw m8, m2 + paddw m1, m2 + pmaddwd m1, [pw_1] + paddd m0, m1 %endif %endmacro @@ -326,7 +357,7 @@ ; int pixel_sad_NxM(uint16_t *, intptr_t, uint16_t *, intptr_t) ; ---------------------------------------------------------------------------- - %macro SAD 2 -cglobal pixel_sad_%1x%2, 4,5-(%2&4/4),8*(%1/mmsize) +cglobal pixel_sad_%1x%2, 4,5,8 pxor m0, m0 %if %2 == 4 SAD_INC_2ROW %1 @@ -338,12 +369,7 @@ dec r4d jg .loop %endif -%if %2 == 32 - HADDUWD m0, m1 HADDD m0, m1 -%else - HADDW m0, m1 -%endif movd eax, xm0 RET %endmacro @@ -352,21 +378,15 @@ ; int pixel_sad_Nx64(uint16_t *, intptr_t, uint16_t *, intptr_t) ; ---------------------------------------------------------------------------- - %macro SAD_Nx64 1 -cglobal pixel_sad_%1x64, 4,5-(64&4/4), 9 +cglobal pixel_sad_%1x64, 4,5, 8 pxor m0, m0 - pxor m8, m8 mov r4d, 64 / 2 .loop: SAD_INC_2ROW_Nx64 %1 dec r4d jg .loop - HADDUWD m0, m1 - HADDUWD m8, m1 HADDD m0, m1 - HADDD m8, m1 - paddd m0, m8 - movd eax, xm0 RET %endmacro @@ -392,6 +412,654 @@ SAD 16, 16 SAD 16, 32 +INIT_YMM avx2 +cglobal pixel_sad_16x64, 4,7,4 + pxor m0, m0 + pxor m3, m3 + mov r4d, 64 / 8 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + r3] + psubw m1, [r0] + psubw m2, [r0 + r1] + pabsw m1, m1 + pabsw m2, m2 + paddw m0, m1 + paddw m3, m2 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + r6] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + r5] + pabsw m1, m1 + pabsw m2, m2 + paddw m0, m1 + paddw m3, m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + movu m1, [r2] + movu m2, [r2 + r3] + psubw m1, [r0] + psubw m2, [r0 + r1] + pabsw m1, m1 + pabsw m2, m2 + paddw m0, m1 + paddw m3, m2 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + r6] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + r5] + pabsw m1, m1 + pabsw m2, m2 + paddw m0, m1 + paddw m3, m2 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m3, m1 + HADDD m0, m1 + HADDD m3, m1 + paddd m0, m3 + + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_32x8, 4,7,5 + pxor m0, m0 + mov r4d, 8/4 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + dec r4d + jg .loop + + HADDW m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_32x16, 4,7,5 + pxor m0, m0 + mov r4d, 16/8 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + dec r4d + jg .loop + + HADDW m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_32x24, 4,7,5 + pxor m0, m0 + mov r4d, 24/4 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDD m0, m1 + movd eax, xm0 + RET + + +INIT_YMM avx2 +cglobal pixel_sad_32x32, 4,7,5 + pxor m0, m0 + mov r4d, 32/4 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDD m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_32x64, 4,7,6 + pxor m0, m0 + pxor m5, m5 + mov r4d, 64 / 4 + add r3d, r3d + add r1d, r1d + lea r5, [r1 * 3] + lea r6, [r3 * 3] +.loop: + movu m1, [r2] + movu m2, [r2 + 32] + movu m3, [r2 + r3] + movu m4, [r2 + r3 + 32] + psubw m1, [r0] + psubw m2, [r0 + 32] + psubw m3, [r0 + r1] + psubw m4, [r0 + r1 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m5, m3 + + movu m1, [r2 + 2 * r3] + movu m2, [r2 + 2 * r3 + 32] + movu m3, [r2 + r6] + movu m4, [r2 + r6 + 32] + psubw m1, [r0 + 2 * r1] + psubw m2, [r0 + 2 * r1 + 32] + psubw m3, [r0 + r5] + psubw m4, [r0 + r5 + 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m5, m3 + lea r0, [r0 + 4 * r1] + lea r2, [r2 + 4 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m5, m1 + HADDD m0, m1 + HADDD m5, m1 + paddd m0, m5 + + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_48x64, 4, 5, 7 + pxor m0, m0 + pxor m5, m5 + pxor m6, m6 + mov r4d, 64/2 + add r3d, r3d + add r1d, r1d +.loop: + movu m1, [r2 + 0 * mmsize] + movu m2, [r2 + 1 * mmsize] + movu m3, [r2 + 2 * mmsize] + psubw m1, [r0 + 0 * mmsize] + psubw m2, [r0 + 1 * mmsize] + psubw m3, [r0 + 2 * mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + + movu m1, [r2 + r3 + 0 * mmsize] + movu m2, [r2 + r3 + 1 * mmsize] + movu m3, [r2 + r3 + 2 * mmsize] + psubw m1, [r0 + r1 + 0 * mmsize] + psubw m2, [r0 + r1 + 1 * mmsize] + psubw m3, [r0 + r1 + 2 * mmsize] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m5, m1 + HADDUWD m6, m1 + paddd m0, m5 + paddd m0, m6 + HADDD m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_64x16, 4, 5, 5 + pxor m0, m0 + mov r4d, 16 / 2 + add r3d, r3d + add r1d, r1d +.loop: + movu m1, [r2 + 0] + movu m2, [r2 + 32] + movu m3, [r2 + 2 * 32] + movu m4, [r2 + 3 * 32] + psubw m1, [r0 + 0] + psubw m2, [r0 + 32] + psubw m3, [r0 + 2 * 32] + psubw m4, [r0 + 3 * 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + movu m1, [r2 + r3] + movu m2, [r2 + r3 + 32] + movu m3, [r2 + r3 + 64] + movu m4, [r2 + r3 + 96] + psubw m1, [r0 + r1] + psubw m2, [r0 + r1 + 32] + psubw m3, [r0 + r1 + 64] + psubw m4, [r0 + r1 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m0, m3 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDD m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_64x32, 4, 5, 6 + pxor m0, m0 + pxor m5, m5 + mov r4d, 32 / 2 + add r3d, r3d + add r1d, r1d +.loop: + movu m1, [r2 + 0] + movu m2, [r2 + 32] + movu m3, [r2 + 2 * 32] + movu m4, [r2 + 3 * 32] + psubw m1, [r0 + 0] + psubw m2, [r0 + 32] + psubw m3, [r0 + 2 * 32] + psubw m4, [r0 + 3 * 32] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m5, m3 + + movu m1, [r2 + r3] + movu m2, [r2 + r3 + 32] + movu m3, [r2 + r3 + 64] + movu m4, [r2 + r3 + 96] + psubw m1, [r0 + r1] + psubw m2, [r0 + r1 + 32] + psubw m3, [r0 + r1 + 64] + psubw m4, [r0 + r1 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m1, m2 + paddw m3, m4 + paddw m0, m1 + paddw m5, m3 + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m5, m1 + paddd m0, m5 + HADDD m0, m1 + + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_64x48, 4, 5, 8 + pxor m0, m0 + pxor m5, m5 + pxor m6, m6 + pxor m7, m7 + mov r4d, 48 / 2 + add r3d, r3d + add r1d, r1d +.loop: + movu m1, [r2 + 0] + movu m2, [r2 + 32] + movu m3, [r2 + 64] + movu m4, [r2 + 96] + psubw m1, [r0 + 0] + psubw m2, [r0 + 32] + psubw m3, [r0 + 64] + psubw m4, [r0 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + paddw m7, m4 + + movu m1, [r2 + r3] + movu m2, [r2 + r3 + 32] + movu m3, [r2 + r3 + 64] + movu m4, [r2 + r3 + 96] + psubw m1, [r0 + r1] + psubw m2, [r0 + r1 + 32] + psubw m3, [r0 + r1 + 64] + psubw m4, [r0 + r1 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + paddw m7, m4 + + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m5, m1 + HADDUWD m6, m1 + HADDUWD m7, m1 + paddd m0, m5 + paddd m0, m6 + paddd m0, m7 + HADDD m0, m1 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_sad_64x64, 4, 5, 8 + pxor m0, m0 + pxor m5, m5 + pxor m6, m6 + pxor m7, m7 + mov r4d, 64 / 2 + add r3d, r3d + add r1d, r1d +.loop: + movu m1, [r2 + 0] + movu m2, [r2 + 32] + movu m3, [r2 + 64] + movu m4, [r2 + 96] + psubw m1, [r0 + 0] + psubw m2, [r0 + 32] + psubw m3, [r0 + 64] + psubw m4, [r0 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + paddw m7, m4 + + movu m1, [r2 + r3] + movu m2, [r2 + r3 + 32] + movu m3, [r2 + r3 + 64] + movu m4, [r2 + r3 + 96] + psubw m1, [r0 + r1] + psubw m2, [r0 + r1 + 32] + psubw m3, [r0 + r1 + 64] + psubw m4, [r0 + r1 + 96] + pabsw m1, m1 + pabsw m2, m2 + pabsw m3, m3 + pabsw m4, m4 + paddw m0, m1 + paddw m5, m2 + paddw m6, m3 + paddw m7, m4 + + lea r0, [r0 + 2 * r1] + lea r2, [r2 + 2 * r3] + + dec r4d + jg .loop + + HADDUWD m0, m1 + HADDUWD m5, m1 + HADDUWD m6, m1 + HADDUWD m7, m1 + paddd m0, m5 + paddd m0, m6 + paddd m0, m7 + HADDD m0, m1 + movd eax, xm0 + RET + ;------------------------------------------------------------------ ; int pixel_sad_32xN( uint16_t *, intptr_t, uint16_t *, intptr_t ) ;------------------------------------------------------------------ @@ -887,9 +1555,37 @@ SAD_X 4, 8, 4 INIT_YMM avx2 %define XMM_REGS 7 -SAD_X 3, 16, 16 +SAD_X 3, 16, 4 SAD_X 3, 16, 8 +SAD_X 3, 16, 12 +SAD_X 3, 16, 16 +SAD_X 3, 16, 32 +SAD_X 3, 16, 64 +SAD_X 3, 32, 8 +SAD_X 3, 32, 16 +SAD_X 3, 32, 24 +SAD_X 3, 32, 32 +SAD_X 3, 32, 64 +SAD_X 3, 48, 64 +SAD_X 3, 64, 16 +SAD_X 3, 64, 32 +SAD_X 3, 64, 48 +SAD_X 3, 64, 64 %define XMM_REGS 9 -SAD_X 4, 16, 16 +SAD_X 4, 16, 4 SAD_X 4, 16, 8 +SAD_X 4, 16, 12 +SAD_X 4, 16, 16 +SAD_X 4, 16, 32 +SAD_X 4, 16, 64 +SAD_X 4, 32, 8 +SAD_X 4, 32, 16 +SAD_X 4, 32, 24 +SAD_X 4, 32, 32 +SAD_X 4, 32, 64 +SAD_X 4, 48, 64 +SAD_X 4, 64, 16 +SAD_X 4, 64, 32 +SAD_X 4, 64, 48 +SAD_X 4, 64, 64
View file
x265_1.7.tar.gz/source/common/x86/ssd-a.asm -> x265_1.8.tar.gz/source/common/x86/ssd-a.asm
Changed
@@ -113,6 +113,62 @@ RET %endmacro +; Function to find ssd for 32x16 block, sse2, 12 bit depth +; Defined sepeartely to be called from SSD_ONE_32 macro +INIT_XMM sse2 +cglobal ssd_ss_32x16 + pxor m8, m8 + mov r4d, 16 +.loop: + movu m0, [r0] + movu m1, [r0+mmsize] + movu m2, [r0+2*mmsize] + movu m3, [r0+3*mmsize] + movu m4, [r2] + movu m5, [r2+mmsize] + movu m6, [r2+2*mmsize] + movu m7, [r2+3*mmsize] + psubw m0, m4 + psubw m1, m5 + psubw m2, m6 + psubw m3, m7 + add r0, r1 + add r2, r3 + pmaddwd m0, m0 + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + paddd m2, m3 + paddd m0, m1 + paddd m0, m2 + paddd m8, m0 + dec r4d + jnz .loop + + mova m4, m8 + pxor m5, m5 + punpckldq m8, m5 + punpckhdq m4, m5 + paddq m4, m8 + movhlps m5, m4 + paddq m4, m5 + paddq m9, m4 + ret + +%macro SSD_ONE_32 0 +cglobal pixel_ssd_ss_32x64, 4,7,10 + add r1d, r1d + add r3d, r3d + pxor m9, m9 + xor r4, r4 + call ssd_ss_32x16 + call ssd_ss_32x16 + call ssd_ss_32x16 + call ssd_ss_32x16 + movq rax, m9 + RET +%endmacro + %macro SSD_TWO 2 cglobal pixel_ssd_ss_%1x%2, 4,7,8 FIX_STRIDES r1, r3 @@ -312,6 +368,124 @@ movd eax, xm0 RET %endmacro + +INIT_YMM avx2 +cglobal pixel_ssd_16x16, 4,7,8 + FIX_STRIDES r1, r3 + lea r5, [3 * r1] + lea r6, [3 * r3] + mov r4d, 4 + pxor m0, m0 +.loop: + movu m1, [r0] + movu m2, [r0 + r1] + movu m3, [r0 + r1 * 2] + movu m4, [r0 + r5] + movu m6, [r2] + movu m7, [r2 + r3] + psubw m1, m6 + psubw m2, m7 + movu m6, [r2 + r3 * 2] + movu m7, [r2 + r6] + psubw m3, m6 + psubw m4, m7 + + lea r0, [r0 + r1 * 4] + lea r2, [r2 + r3 * 4] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m0, m1 + paddd m0, m3 + + dec r4d + jg .loop + + HADDD m0, m5 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_ssd_32x32, 4,7,8 + add r1, r1 + add r3, r3 + mov r4d, 16 + pxor m0, m0 +.loop: + movu m1, [r0] + movu m2, [r0 + 32] + movu m3, [r0 + r1] + movu m4, [r0 + r1 + 32] + movu m6, [r2] + movu m7, [r2 + 32] + psubw m1, m6 + psubw m2, m7 + movu m6, [r2 + r3] + movu m7, [r2 + r3 + 32] + psubw m3, m6 + psubw m4, m7 + + lea r0, [r0 + r1 * 2] + lea r2, [r2 + r3 * 2] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m0, m1 + paddd m0, m3 + + dec r4d + jg .loop + + HADDD m0, m5 + movd eax, xm0 + RET + +INIT_YMM avx2 +cglobal pixel_ssd_64x64, 4,7,8 + FIX_STRIDES r1, r3 + mov r4d, 64 + pxor m0, m0 +.loop: + movu m1, [r0] + movu m2, [r0+32] + movu m3, [r0+32*2] + movu m4, [r0+32*3] + movu m6, [r2] + movu m7, [r2+32] + psubw m1, m6 + psubw m2, m7 + movu m6, [r2+32*2] + movu m7, [r2+32*3] + psubw m3, m6 + psubw m4, m7 + + lea r0, [r0+r1] + lea r2, [r2+r3] + + pmaddwd m1, m1 + pmaddwd m2, m2 + pmaddwd m3, m3 + pmaddwd m4, m4 + paddd m1, m2 + paddd m3, m4 + paddd m0, m1 + paddd m0, m3 + + dec r4d + jg .loop + + HADDD m0, m5 + movd eax, xm0 + RET + INIT_MMX mmx2 SSD_ONE 4, 4 SSD_ONE 4, 8 @@ -338,7 +512,13 @@ SSD_ONE 32, 16 SSD_ONE 32, 24 SSD_ONE 32, 32 -SSD_ONE 32, 64 + +%if BIT_DEPTH <= 10 + SSD_ONE 32, 64 +%else + SSD_ONE_32 +%endif + SSD_TWO 48, 64 SSD_TWO 64, 16 SSD_TWO 64, 32 @@ -347,6 +527,10 @@ INIT_YMM avx2 SSD_ONE 16, 8 SSD_ONE 16, 16 +SSD_ONE 32, 32 +SSD_ONE 64, 64 +SSD_ONE 16, 32 +SSD_ONE 32, 64 %endif ; HIGH_BIT_DEPTH ;----------------------------------------------------------------------------- @@ -983,7 +1167,7 @@ mov al, %1*%2/mmsize/2 %if %1 != %2 - jmp mangle(x265_pixel_ssd_%1x%1 %+ SUFFIX %+ .startloop) + jmp mangle(private_prefix %+ _ %+ pixel_ssd_%1x%1 %+ SUFFIX %+ .startloop) %else .startloop:
View file
x265_1.7.tar.gz/source/common/x86/x86inc.asm -> x265_1.8.tar.gz/source/common/x86/x86inc.asm
Changed
@@ -37,7 +37,7 @@ ; to x264-devel@videolan.org . %ifndef private_prefix - %define private_prefix x265 + %define private_prefix X265_NS %endif %ifndef public_prefix @@ -1483,13 +1483,3 @@ %endif %endmacro %endif - -; workaround: vpbroadcastd with register, the yasm will generate wrong code -%macro vpbroadcastd 2 - %ifid %2 - movd %1 %+ xmm, %2 - vpbroadcastd %1, %1 %+ xmm - %else - vpbroadcastd %1, %2 - %endif -%endmacro
View file
x265_1.7.tar.gz/source/common/x86/x86util.asm -> x265_1.8.tar.gz/source/common/x86/x86util.asm
Changed
@@ -358,11 +358,11 @@ %if sizeof%1==32 ; %3 = abcdefgh ijklmnop (lower address) ; %2 = ABCDEFGH IJKLMNOP (higher address) -; vperm2i128 %5, %2, %3, q0003 ; %5 = ijklmnop ABCDEFGH -%if %4 < 16 - palignr %1, %5, %3, %4 ; %1 = bcdefghi jklmnopA + vperm2i128 %4, %1, %2, q0003 ; %4 = ijklmnop ABCDEFGH +%if %3 < 16 + palignr %1, %4, %2, %3 ; %1 = bcdefghi jklmnopA %else - palignr %1, %2, %5, %4-16 ; %1 = pABCDEFG HIJKLMNO + palignr %1, %2, %4, %3-16 ; %1 = pABCDEFG HIJKLMNO %endif %elif cpuflag(ssse3) %if %0==5
View file
x265_1.7.tar.gz/source/common/yuv.cpp -> x265_1.8.tar.gz/source/common/yuv.cpp
Changed
@@ -28,7 +28,7 @@ #include "picyuv.h" #include "primitives.h" -using namespace x265; +using namespace X265_NS; Yuv::Yuv() {
View file
x265_1.7.tar.gz/source/common/yuv.h -> x265_1.8.tar.gz/source/common/yuv.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "primitives.h" -namespace x265 { +namespace X265_NS { // private namespace class ShortYuv;
View file
x265_1.7.tar.gz/source/compat/getopt/getopt.h -> x265_1.8.tar.gz/source/compat/getopt/getopt.h
Changed
@@ -144,23 +144,23 @@ /* Many other libraries have conflicting prototypes for getopt, with differences in the consts, in stdlib.h. To avoid compilation errors, only prototype getopt for the GNU C library. */ -extern int getopt (int __argc, char *const *__argv, const char *__shortopts); +extern int getopt (int argc, char *const *argv, const char *shortopts); # else /* not __GNU_LIBRARY__ */ extern int getopt (); # endif /* __GNU_LIBRARY__ */ # ifndef __need_getopt -extern int getopt_long (int __argc, char *const *__argv, const char *__shortopts, - const struct option *__longopts, int32_t *__longind); -extern int getopt_long_only (int __argc, char *const *__argv, - const char *__shortopts, - const struct option *__longopts, int32_t *__longind); +extern int getopt_long (int argc, char *const *argv, const char *shortopts, + const struct option *longopts, int32_t *longind); +extern int getopt_long_only (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int32_t *longind); /* Internal only. Users should not call this directly. */ -extern int _getopt_internal (int __argc, char *const *__argv, - const char *__shortopts, - const struct option *__longopts, int32_t *__longind, - int __long_only); +extern int _getopt_internal (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int32_t *longind, + int longonly); # endif #else /* not __STDC__ */ extern int getopt ();
View file
x265_1.7.tar.gz/source/compat/msvc/stdint.h -> x265_1.8.tar.gz/source/compat/msvc/stdint.h
Changed
@@ -8,6 +8,7 @@ #if !defined(UINT64_MAX) #include <limits.h> #define UINT64_MAX _UI64_MAX +#define INT16_MAX _I16_MAX #endif /* a minimal set of C99 types for use with MSVC (VC9) */
View file
x265_1.7.tar.gz/source/encoder/CMakeLists.txt -> x265_1.8.tar.gz/source/encoder/CMakeLists.txt
Changed
@@ -11,6 +11,20 @@ add_definitions(/wd4701) # potentially uninitialized local variable 'foo' used endif() +if(EXTRA_LIB) + if(LINKED_8BIT) + list(APPEND APIFLAGS "-DLINKED_8BIT=1") + endif(LINKED_8BIT) + if(LINKED_10BIT) + list(APPEND APIFLAGS "-DLINKED_10BIT=1") + endif(LINKED_10BIT) + if(LINKED_12BIT) + list(APPEND APIFLAGS "-DLINKED_12BIT=1") + endif(LINKED_12BIT) + string(REPLACE ";" " " APIFLAGSTR "${APIFLAGS}") + set_source_files_properties(api.cpp PROPERTIES COMPILE_FLAGS ${APIFLAGSTR}) +endif(EXTRA_LIB) + add_library(encoder OBJECT ../x265.h analysis.cpp analysis.h search.cpp search.h
View file
x265_1.7.tar.gz/source/encoder/analysis.cpp -> x265_1.8.tar.gz/source/encoder/analysis.cpp
Changed
@@ -33,7 +33,7 @@ #include "rdcost.h" #include "encoder.h" -using namespace x265; +using namespace X265_NS; /* An explanation of rate distortion levels (--rd-level) * @@ -209,24 +209,20 @@ return; else if (md.bestMode->cu.isIntra(0)) { - m_quant.m_tqBypass = true; md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); PartSize size = (PartSize)md.pred[PRED_LOSSLESS].cu.m_partSize[0]; uint8_t* modes = md.pred[PRED_LOSSLESS].cu.m_lumaIntraDir; checkIntra(md.pred[PRED_LOSSLESS], cuGeom, size, modes, NULL); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); - m_quant.m_tqBypass = false; } else { - m_quant.m_tqBypass = true; md.pred[PRED_LOSSLESS].initCosts(); md.pred[PRED_LOSSLESS].cu.initLosslessCU(md.bestMode->cu, cuGeom); md.pred[PRED_LOSSLESS].predYuv.copyFromYuv(md.bestMode->predYuv); encodeResAndCalcRdInterCU(md.pred[PRED_LOSSLESS], cuGeom); checkBestMode(md.pred[PRED_LOSSLESS], cuGeom.depth); - m_quant.m_tqBypass = false; } } @@ -385,6 +381,8 @@ /* perform Mode task, repeat until no more work is available */ do { + uint32_t refMasks[2] = { 0, 0 }; + if (m_param->rdLevel <= 4) { switch (pmode.modes[task]) @@ -396,33 +394,33 @@ break; case PRED_2Nx2N: - slave.checkInter_rd0_4(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N); + slave.checkInter_rd0_4(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, refMasks); if (m_slice->m_sliceType == B_SLICE) slave.checkBidir2Nx2N(md.pred[PRED_2Nx2N], md.pred[PRED_BIDIR], pmode.cuGeom); break; case PRED_Nx2N: - slave.checkInter_rd0_4(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N); + slave.checkInter_rd0_4(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, refMasks); break; case PRED_2NxN: - slave.checkInter_rd0_4(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN); + slave.checkInter_rd0_4(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, refMasks); break; case PRED_2NxnU: - slave.checkInter_rd0_4(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU); + slave.checkInter_rd0_4(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, refMasks); break; case PRED_2NxnD: - slave.checkInter_rd0_4(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD); + slave.checkInter_rd0_4(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, refMasks); break; case PRED_nLx2N: - slave.checkInter_rd0_4(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N); + slave.checkInter_rd0_4(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, refMasks); break; case PRED_nRx2N: - slave.checkInter_rd0_4(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N); + slave.checkInter_rd0_4(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, refMasks); break; default: @@ -441,7 +439,7 @@ break; case PRED_2Nx2N: - slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N); + slave.checkInter_rd5_6(md.pred[PRED_2Nx2N], pmode.cuGeom, SIZE_2Nx2N, refMasks); md.pred[PRED_BIDIR].rdCost = MAX_INT64; if (m_slice->m_sliceType == B_SLICE) { @@ -452,27 +450,27 @@ break; case PRED_Nx2N: - slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N); + slave.checkInter_rd5_6(md.pred[PRED_Nx2N], pmode.cuGeom, SIZE_Nx2N, refMasks); break; case PRED_2NxN: - slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN); + slave.checkInter_rd5_6(md.pred[PRED_2NxN], pmode.cuGeom, SIZE_2NxN, refMasks); break; case PRED_2NxnU: - slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU); + slave.checkInter_rd5_6(md.pred[PRED_2NxnU], pmode.cuGeom, SIZE_2NxnU, refMasks); break; case PRED_2NxnD: - slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD); + slave.checkInter_rd5_6(md.pred[PRED_2NxnD], pmode.cuGeom, SIZE_2NxnD, refMasks); break; case PRED_nLx2N: - slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N); + slave.checkInter_rd5_6(md.pred[PRED_nLx2N], pmode.cuGeom, SIZE_nLx2N, refMasks); break; case PRED_nRx2N: - slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N); + slave.checkInter_rd5_6(md.pred[PRED_nRx2N], pmode.cuGeom, SIZE_nRx2N, refMasks); break; default: @@ -581,7 +579,8 @@ /* RD selection between merge, inter, bidir and intra */ if (!m_bChromaSa8d) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ { - for (uint32_t puIdx = 0; puIdx < bestInter->cu.getNumPartInter(); puIdx++) + uint32_t numPU = bestInter->cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { PredictionUnit pu(bestInter->cu, cuGeom, puIdx); motionCompensation(bestInter->cu, pu, bestInter->predYuv, false, true); @@ -617,7 +616,8 @@ else if (!md.bestMode->cu.m_mergeFlag[0]) { /* finally code the best mode selected from SA8D costs */ - for (uint32_t puIdx = 0; puIdx < md.bestMode->cu.getNumPartInter(); puIdx++) + uint32_t numPU = md.bestMode->cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx); motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, true); @@ -746,7 +746,7 @@ md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); } -void Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) +uint32_t Analysis::compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp) { uint32_t depth = cuGeom.depth; uint32_t cuAddr = parentCTU.m_cuAddr; @@ -756,24 +756,104 @@ bool mightSplit = !(cuGeom.flags & CUGeom::LEAF); bool mightNotSplit = !(cuGeom.flags & CUGeom::SPLIT_MANDATORY); uint32_t minDepth = topSkipMinDepth(parentCTU, cuGeom); - + bool earlyskip = false; + bool splitIntra = true; + uint32_t splitRefs[4] = { 0, 0, 0, 0 }; + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ if (mightNotSplit && depth >= minDepth) { - bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; - /* Compute Merge Cost */ md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd0_4(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom); - - bool earlyskip = false; if (m_param->rdLevel) earlyskip = m_param->bEnableEarlySkip && md.bestMode && md.bestMode->cu.isSkipped(0); // TODO: sa8d threshold per depth + } + + bool bNoSplit = false; + if (md.bestMode) + { + bNoSplit = md.bestMode->cu.isSkipped(0); + if (mightSplit && depth && depth >= minDepth && !bNoSplit) + bNoSplit = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); + } + + /* Step 2. Evaluate each of the 4 split sub-blocks in series */ + if (mightSplit && !bNoSplit) + { + Mode* splitPred = &md.pred[PRED_SPLIT]; + splitPred->initCosts(); + CUData* splitCU = &splitPred->cu; + splitCU->initSubCU(parentCTU, cuGeom, qp); + + uint32_t nextDepth = depth + 1; + ModeDepth& nd = m_modeDepth[nextDepth]; + invalidateContexts(nextDepth); + Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; + splitIntra = false; + + for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) + { + const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); + if (childGeom.flags & CUGeom::PRESENT) + { + m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); + m_rqt[nextDepth].cur.load(*nextContext); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + splitRefs[subPartIdx] = compressInterCU_rd0_4(parentCTU, childGeom, nextQP); + + // Save best CU and pred data for this sub CU + splitIntra |= nd.bestMode->cu.isIntra(0); + splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); + splitPred->addSubCosts(*nd.bestMode); + + if (m_param->rdLevel) + nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); + else + nd.bestMode->predYuv.copyToPartYuv(splitPred->predYuv, childGeom.numPartitions * subPartIdx); + if (m_param->rdLevel > 1) + nextContext = &nd.bestMode->contexts; + } + else + splitCU->setEmptyPart(childGeom, subPartIdx); + } + nextContext->store(splitPred->contexts); + + if (mightNotSplit) + addSplitFlagCost(*splitPred, cuGeom.depth); + else if (m_param->rdLevel > 1) + updateModeCost(*splitPred); + else + splitPred->sa8dCost = m_rdCost.calcRdSADCost((uint32_t)splitPred->distortion, splitPred->sa8dBits); + } + + /* Split CUs + * 0 1 + * 2 3 */ + uint32_t allSplitRefs = splitRefs[0] | splitRefs[1] | splitRefs[2] | splitRefs[3]; + /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */ + if (mightNotSplit && depth >= minDepth) + { + if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0) + setLambdaFromQP(parentCTU, qp); if (!earlyskip) { + uint32_t refMasks[2]; + refMasks[0] = allSplitRefs; md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N); + checkInter_rd0_4(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks); + + if (m_param->limitReferences & X265_REF_LIMIT_CU) + { + CUData& cu = md.pred[PRED_2Nx2N].cu; + uint32_t refMask = cu.getBestRefIdx(0); + allSplitRefs = splitRefs[0] = splitRefs[1] = splitRefs[2] = splitRefs[3] = refMask; + } if (m_slice->m_sliceType == B_SLICE) { @@ -784,13 +864,17 @@ Mode *bestInter = &md.pred[PRED_2Nx2N]; if (m_param->bEnableRectInter) { + refMasks[0] = splitRefs[0] | splitRefs[2]; /* left */ + refMasks[1] = splitRefs[1] | splitRefs[3]; /* right */ md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N); + checkInter_rd0_4(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); if (md.pred[PRED_Nx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_Nx2N]; + refMasks[0] = splitRefs[0] | splitRefs[1]; /* top */ + refMasks[1] = splitRefs[2] | splitRefs[3]; /* bot */ md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN); + checkInter_rd0_4(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); if (md.pred[PRED_2NxN].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxN]; } @@ -811,36 +895,45 @@ if (bHor) { + refMasks[0] = splitRefs[0] | splitRefs[1]; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU); + checkInter_rd0_4(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); if (md.pred[PRED_2NxnU].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxnU]; + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitRefs[2] | splitRefs[3]; /* 25% bot */ md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD); + checkInter_rd0_4(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); if (md.pred[PRED_2NxnD].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_2NxnD]; } if (bVer) { + refMasks[0] = splitRefs[0] | splitRefs[2]; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N); + checkInter_rd0_4(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); if (md.pred[PRED_nLx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_nLx2N]; + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitRefs[1] | splitRefs[3]; /* 25% right */ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N); + checkInter_rd0_4(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); if (md.pred[PRED_nRx2N].sa8dCost < bestInter->sa8dCost) bestInter = &md.pred[PRED_nRx2N]; } } - + bool bTryIntra = m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames; if (m_param->rdLevel >= 3) { /* Calculate RD cost of best inter option */ if (!m_bChromaSa8d) /* When m_bChromaSa8d is enabled, chroma MC has already been done */ { - for (uint32_t puIdx = 0; puIdx < bestInter->cu.getNumPartInter(); puIdx++) + uint32_t numPU = bestInter->cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { PredictionUnit pu(bestInter->cu, cuGeom, puIdx); motionCompensation(bestInter->cu, pu, bestInter->predYuv, false, true); @@ -860,10 +953,18 @@ if ((bTryIntra && md.bestMode->cu.getQtRootCbf(0)) || md.bestMode->sa8dCost == MAX_INT64) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntraInInter(md.pred[PRED_INTRA], cuGeom); - encodeIntraInInter(md.pred[PRED_INTRA], cuGeom); - checkBestMode(md.pred[PRED_INTRA], depth); + if (!m_param->limitReferences || splitIntra) + { + ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); + checkIntraInInter(md.pred[PRED_INTRA], cuGeom); + encodeIntraInInter(md.pred[PRED_INTRA], cuGeom); + checkBestMode(md.pred[PRED_INTRA], depth); + } + else + { + ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]); + } } } else @@ -878,10 +979,18 @@ if (bTryIntra || md.bestMode->sa8dCost == MAX_INT64) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntraInInter(md.pred[PRED_INTRA], cuGeom); - if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost) - md.bestMode = &md.pred[PRED_INTRA]; + if (!m_param->limitReferences || splitIntra) + { + ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); + checkIntraInInter(md.pred[PRED_INTRA], cuGeom); + if (md.pred[PRED_INTRA].sa8dCost < md.bestMode->sa8dCost) + md.bestMode = &md.pred[PRED_INTRA]; + } + else + { + ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]); + } } /* finally code the best mode selected by SA8D costs: @@ -895,7 +1004,8 @@ } else if (md.bestMode->cu.isInter(0)) { - for (uint32_t puIdx = 0; puIdx < md.bestMode->cu.getNumPartInter(); puIdx++) + uint32_t numPU = md.bestMode->cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { PredictionUnit pu(md.bestMode->cu, cuGeom, puIdx); motionCompensation(md.bestMode->cu, pu, md.bestMode->predYuv, false, true); @@ -950,63 +1060,9 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } - bool bNoSplit = false; - if (md.bestMode) - { - bNoSplit = md.bestMode->cu.isSkipped(0); - if (mightSplit && depth && depth >= minDepth && !bNoSplit) - bNoSplit = recursionDepthCheck(parentCTU, cuGeom, *md.bestMode); - } - if (mightSplit && !bNoSplit) { Mode* splitPred = &md.pred[PRED_SPLIT]; - splitPred->initCosts(); - CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom, qp); - - uint32_t nextDepth = depth + 1; - ModeDepth& nd = m_modeDepth[nextDepth]; - invalidateContexts(nextDepth); - Entropy* nextContext = &m_rqt[depth].cur; - int nextQP = qp; - - for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) - { - const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); - if (childGeom.flags & CUGeom::PRESENT) - { - m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); - m_rqt[nextDepth].cur.load(*nextContext); - - if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) - nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - - compressInterCU_rd0_4(parentCTU, childGeom, nextQP); - - // Save best CU and pred data for this sub CU - splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); - splitPred->addSubCosts(*nd.bestMode); - - if (m_param->rdLevel) - nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); - else - nd.bestMode->predYuv.copyToPartYuv(splitPred->predYuv, childGeom.numPartitions * subPartIdx); - if (m_param->rdLevel > 1) - nextContext = &nd.bestMode->contexts; - } - else - splitCU->setEmptyPart(childGeom, subPartIdx); - } - nextContext->store(splitPred->contexts); - - if (mightNotSplit) - addSplitFlagCost(*splitPred, cuGeom.depth); - else if (m_param->rdLevel > 1) - updateModeCost(*splitPred); - else - splitPred->sa8dCost = m_rdCost.calcRdSADCost(splitPred->distortion, splitPred->sa8dBits); - if (!md.bestMode) md.bestMode = splitPred; else if (m_param->rdLevel > 1) @@ -1016,6 +1072,23 @@ checkDQPForSplitPred(*md.bestMode, cuGeom); } + + /* determine which motion references the parent CU should search */ + uint32_t refMask; + if (!(m_param->limitReferences & X265_REF_LIMIT_DEPTH)) + refMask = 0; + else if (md.bestMode == &md.pred[PRED_SPLIT]) + refMask = allSplitRefs; + else + { + /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ + CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; + uint32_t numPU = cu.getNumPartInter(0); + refMask = 0; + for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) + refMask |= cu.getBestRefIdx(subPartIdx); + } + if (mightNotSplit) { /* early-out statistics */ @@ -1029,11 +1102,13 @@ /* Copy best data to encData CTU and recon */ X265_CHECK(md.bestMode->ok(), "best mode is not ok"); md.bestMode->cu.copyToPic(depth); - if (md.bestMode != &md.pred[PRED_SPLIT] && m_param->rdLevel) + if (m_param->rdLevel) md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, cuAddr, cuGeom.absPartIdx); + + return refMask; } -void Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) +uint32_t Analysis::compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp) { uint32_t depth = cuGeom.depth; ModeDepth& md = m_modeDepth[depth]; @@ -1066,19 +1141,94 @@ } } + bool foundSkip = false; + bool splitIntra = true; + uint32_t splitRefs[4] = { 0, 0, 0, 0 }; + /* Step 1. Evaluate Merge/Skip candidates for likely early-outs */ if (mightNotSplit) { md.pred[PRED_SKIP].cu.initSubCU(parentCTU, cuGeom, qp); md.pred[PRED_MERGE].cu.initSubCU(parentCTU, cuGeom, qp); checkMerge2Nx2N_rd5_6(md.pred[PRED_SKIP], md.pred[PRED_MERGE], cuGeom, false); - bool earlySkip = m_param->bEnableEarlySkip && md.bestMode && !md.bestMode->cu.getQtRootCbf(0); + foundSkip = md.bestMode && !md.bestMode->cu.getQtRootCbf(0); + } + + // estimate split cost + /* Step 2. Evaluate each of the 4 split sub-blocks in series */ + if (mightSplit && !foundSkip) + { + Mode* splitPred = &md.pred[PRED_SPLIT]; + splitPred->initCosts(); + CUData* splitCU = &splitPred->cu; + splitCU->initSubCU(parentCTU, cuGeom, qp); + + uint32_t nextDepth = depth + 1; + ModeDepth& nd = m_modeDepth[nextDepth]; + invalidateContexts(nextDepth); + Entropy* nextContext = &m_rqt[depth].cur; + int nextQP = qp; + splitIntra = false; - if (!earlySkip) + for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) { + const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); + if (childGeom.flags & CUGeom::PRESENT) + { + m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); + m_rqt[nextDepth].cur.load(*nextContext); + + if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) + nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); + + splitRefs[subPartIdx] = compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); + + // Save best CU and pred data for this sub CU + splitIntra |= nd.bestMode->cu.isIntra(0); + splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); + splitPred->addSubCosts(*nd.bestMode); + nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); + nextContext = &nd.bestMode->contexts; + } + else + { + splitCU->setEmptyPart(childGeom, subPartIdx); + zOrder += g_depthInc[g_maxCUDepth - 1][nextDepth]; + } + } + nextContext->store(splitPred->contexts); + if (mightNotSplit) + addSplitFlagCost(*splitPred, cuGeom.depth); + else + updateModeCost(*splitPred); + + checkDQPForSplitPred(*splitPred, cuGeom); + } + + /* Split CUs + * 0 1 + * 2 3 */ + uint32_t allSplitRefs = splitRefs[0] | splitRefs[1] | splitRefs[2] | splitRefs[3]; + /* Step 3. Evaluate ME (2Nx2N, rect, amp) and intra modes at current depth */ + if (mightNotSplit) + { + if (m_slice->m_pps->bUseDQP && depth <= m_slice->m_pps->maxCuDQPDepth && m_slice->m_pps->maxCuDQPDepth != 0) + setLambdaFromQP(parentCTU, qp); + + if (!(foundSkip && m_param->bEnableEarlySkip)) + { + uint32_t refMasks[2]; + refMasks[0] = allSplitRefs; md.pred[PRED_2Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N); + checkInter_rd5_6(md.pred[PRED_2Nx2N], cuGeom, SIZE_2Nx2N, refMasks); checkBestMode(md.pred[PRED_2Nx2N], cuGeom.depth); + if (m_param->limitReferences & X265_REF_LIMIT_CU) + { + CUData& cu = md.pred[PRED_2Nx2N].cu; + uint32_t refMask = cu.getBestRefIdx(0); + allSplitRefs = splitRefs[0] = splitRefs[1] = splitRefs[2] = splitRefs[3] = refMask; + } + if (m_slice->m_sliceType == B_SLICE) { md.pred[PRED_BIDIR].cu.initSubCU(parentCTU, cuGeom, qp); @@ -1092,12 +1242,16 @@ if (m_param->bEnableRectInter) { + refMasks[0] = splitRefs[0] | splitRefs[2]; /* left */ + refMasks[1] = splitRefs[1] | splitRefs[3]; /* right */ md.pred[PRED_Nx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N); + checkInter_rd5_6(md.pred[PRED_Nx2N], cuGeom, SIZE_Nx2N, refMasks); checkBestMode(md.pred[PRED_Nx2N], cuGeom.depth); + refMasks[0] = splitRefs[0] | splitRefs[1]; /* top */ + refMasks[1] = splitRefs[2] | splitRefs[3]; /* bot */ md.pred[PRED_2NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN); + checkInter_rd5_6(md.pred[PRED_2NxN], cuGeom, SIZE_2NxN, refMasks); checkBestMode(md.pred[PRED_2NxN], cuGeom.depth); } @@ -1117,37 +1271,53 @@ if (bHor) { + refMasks[0] = splitRefs[0] | splitRefs[1]; /* 25% top */ + refMasks[1] = allSplitRefs; /* 75% bot */ md.pred[PRED_2NxnU].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU); + checkInter_rd5_6(md.pred[PRED_2NxnU], cuGeom, SIZE_2NxnU, refMasks); checkBestMode(md.pred[PRED_2NxnU], cuGeom.depth); + refMasks[0] = allSplitRefs; /* 75% top */ + refMasks[1] = splitRefs[2] | splitRefs[3]; /* 25% bot */ md.pred[PRED_2NxnD].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD); + checkInter_rd5_6(md.pred[PRED_2NxnD], cuGeom, SIZE_2NxnD, refMasks); checkBestMode(md.pred[PRED_2NxnD], cuGeom.depth); } if (bVer) { + refMasks[0] = splitRefs[0] | splitRefs[2]; /* 25% left */ + refMasks[1] = allSplitRefs; /* 75% right */ md.pred[PRED_nLx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N); + checkInter_rd5_6(md.pred[PRED_nLx2N], cuGeom, SIZE_nLx2N, refMasks); checkBestMode(md.pred[PRED_nLx2N], cuGeom.depth); + refMasks[0] = allSplitRefs; /* 75% left */ + refMasks[1] = splitRefs[1] | splitRefs[3]; /* 25% right */ md.pred[PRED_nRx2N].cu.initSubCU(parentCTU, cuGeom, qp); - checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N); + checkInter_rd5_6(md.pred[PRED_nRx2N], cuGeom, SIZE_nRx2N, refMasks); checkBestMode(md.pred[PRED_nRx2N], cuGeom.depth); } } if (m_slice->m_sliceType != B_SLICE || m_param->bIntraInBFrames) { - md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); - checkBestMode(md.pred[PRED_INTRA], depth); + if (!m_param->limitReferences || splitIntra) + { + ProfileCounter(parentCTU, totalIntraCU[cuGeom.depth]); + md.pred[PRED_INTRA].cu.initSubCU(parentCTU, cuGeom, qp); + checkIntra(md.pred[PRED_INTRA], cuGeom, SIZE_2Nx2N, NULL, NULL); + checkBestMode(md.pred[PRED_INTRA], depth); - if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) + if (cuGeom.log2CUSize == 3 && m_slice->m_sps->quadtreeTULog2MinSize < 3) + { + md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); + checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); + checkBestMode(md.pred[PRED_INTRA_NxN], depth); + } + } + else { - md.pred[PRED_INTRA_NxN].cu.initSubCU(parentCTU, cuGeom, qp); - checkIntra(md.pred[PRED_INTRA_NxN], cuGeom, SIZE_NxN, NULL, NULL); - checkBestMode(md.pred[PRED_INTRA_NxN], depth); + ProfileCounter(parentCTU, skippedIntraCU[cuGeom.depth]); } } } @@ -1159,59 +1329,32 @@ addSplitFlagCost(*md.bestMode, cuGeom.depth); } - // estimate split cost - if (mightSplit && (!md.bestMode || !md.bestMode->cu.isSkipped(0))) - { - Mode* splitPred = &md.pred[PRED_SPLIT]; - splitPred->initCosts(); - CUData* splitCU = &splitPred->cu; - splitCU->initSubCU(parentCTU, cuGeom, qp); - - uint32_t nextDepth = depth + 1; - ModeDepth& nd = m_modeDepth[nextDepth]; - invalidateContexts(nextDepth); - Entropy* nextContext = &m_rqt[depth].cur; - int nextQP = qp; - - for (uint32_t subPartIdx = 0; subPartIdx < 4; subPartIdx++) - { - const CUGeom& childGeom = *(&cuGeom + cuGeom.childOffset + subPartIdx); - if (childGeom.flags & CUGeom::PRESENT) - { - m_modeDepth[0].fencYuv.copyPartToYuv(nd.fencYuv, childGeom.absPartIdx); - m_rqt[nextDepth].cur.load(*nextContext); - - if (m_slice->m_pps->bUseDQP && nextDepth <= m_slice->m_pps->maxCuDQPDepth) - nextQP = setLambdaFromQP(parentCTU, calculateQpforCuSize(parentCTU, childGeom)); - - compressInterCU_rd5_6(parentCTU, childGeom, zOrder, nextQP); - - // Save best CU and pred data for this sub CU - splitCU->copyPartFrom(nd.bestMode->cu, childGeom, subPartIdx); - splitPred->addSubCosts(*nd.bestMode); - nd.bestMode->reconYuv.copyToPartYuv(splitPred->reconYuv, childGeom.numPartitions * subPartIdx); - nextContext = &nd.bestMode->contexts; - } - else - { - splitCU->setEmptyPart(childGeom, subPartIdx); - zOrder += g_depthInc[g_maxCUDepth - 1][nextDepth]; - } - } - nextContext->store(splitPred->contexts); - if (mightNotSplit) - addSplitFlagCost(*splitPred, cuGeom.depth); - else - updateModeCost(*splitPred); + /* compare split RD cost against best cost */ + if (mightSplit && !foundSkip) + checkBestMode(md.pred[PRED_SPLIT], depth); - checkDQPForSplitPred(*splitPred, cuGeom); - checkBestMode(*splitPred, depth); + /* determine which motion references the parent CU should search */ + uint32_t refMask; + if (!(m_param->limitReferences & X265_REF_LIMIT_DEPTH)) + refMask = 0; + else if (md.bestMode == &md.pred[PRED_SPLIT]) + refMask = allSplitRefs; + else + { + /* use best merge/inter mode, in case of intra use 2Nx2N inter references */ + CUData& cu = md.bestMode->cu.isIntra(0) ? md.pred[PRED_2Nx2N].cu : md.bestMode->cu; + uint32_t numPU = cu.getNumPartInter(0); + refMask = 0; + for (uint32_t puIdx = 0, subPartIdx = 0; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, 0)) + refMask |= cu.getBestRefIdx(subPartIdx); } /* Copy best data to encData CTU and recon */ + X265_CHECK(md.bestMode->ok(), "best mode is not ok"); md.bestMode->cu.copyToPic(depth); - if (md.bestMode != &md.pred[PRED_SPLIT]) - md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); + md.bestMode->reconYuv.copyToPicYuv(*m_frame->m_reconPic, parentCTU.m_cuAddr, cuGeom.absPartIdx); + + return refMask; } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ @@ -1271,7 +1414,7 @@ tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[1], fencYuv->m_csize, tempPred->predYuv.m_buf[1], tempPred->predYuv.m_csize); tempPred->distortion += primitives.chroma[m_csp].cu[sizeIdx].sa8d(fencYuv->m_buf[2], fencYuv->m_csize, tempPred->predYuv.m_buf[2], tempPred->predYuv.m_csize); } - tempPred->sa8dCost = m_rdCost.calcRdSADCost(tempPred->distortion, tempPred->sa8dBits); + tempPred->sa8dCost = m_rdCost.calcRdSADCost((uint32_t)tempPred->distortion, tempPred->sa8dBits); if (tempPred->sa8dCost < bestPred->sa8dCost) { @@ -1324,7 +1467,7 @@ } /* sets md.bestMode if a valid merge candidate is found, else leaves it NULL */ -void Analysis::checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isSkipMode) +void Analysis::checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isShareMergeCand) { uint32_t depth = cuGeom.depth; @@ -1352,91 +1495,82 @@ bool triedPZero = false, triedBZero = false; bestPred->rdCost = MAX_INT64; - if (isSkipMode) + uint32_t first = 0, last = numMergeCand; + if (isShareMergeCand) { - uint32_t i = *m_reuseBestMergeCand; - bestPred->cu.m_mvpIdx[0][0] = (uint8_t)i; - bestPred->cu.m_interDir[0] = candDir[i]; - bestPred->cu.m_mv[0][0] = candMvField[i][0].mv; - bestPred->cu.m_mv[1][0] = candMvField[i][1].mv; - bestPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; - bestPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; - - motionCompensation(bestPred->cu, pu, bestPred->predYuv, true, true); - encodeResAndCalcRdSkipCU(*bestPred); + first = *m_reuseBestMergeCand; + last = first + 1; } - else + + for (uint32_t i = first; i < last; i++) { - for (uint32_t i = 0; i < numMergeCand; i++) + if (m_bFrameParallel && + (candMvField[i][0].mv.y >= (m_param->searchRange + 1) * 4 || + candMvField[i][1].mv.y >= (m_param->searchRange + 1) * 4)) + continue; + + /* the merge candidate list is packed with MV(0,0) ref 0 when it is not full */ + if (candDir[i] == 1 && !candMvField[i][0].mv.word && !candMvField[i][0].refIdx) { - if (m_bFrameParallel && - (candMvField[i][0].mv.y >= (m_param->searchRange + 1) * 4 || - candMvField[i][1].mv.y >= (m_param->searchRange + 1) * 4)) + if (triedPZero) continue; + triedPZero = true; + } + else if (candDir[i] == 3 && + !candMvField[i][0].mv.word && !candMvField[i][0].refIdx && + !candMvField[i][1].mv.word && !candMvField[i][1].refIdx) + { + if (triedBZero) + continue; + triedBZero = true; + } - /* the merge candidate list is packed with MV(0,0) ref 0 when it is not full */ - if (candDir[i] == 1 && !candMvField[i][0].mv.word && !candMvField[i][0].refIdx) - { - if (triedPZero) - continue; - triedPZero = true; - } - else if (candDir[i] == 3 && - !candMvField[i][0].mv.word && !candMvField[i][0].refIdx && - !candMvField[i][1].mv.word && !candMvField[i][1].refIdx) - { - if (triedBZero) - continue; - triedBZero = true; - } - - tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; /* merge candidate ID is stored in L0 MVP idx */ - tempPred->cu.m_interDir[0] = candDir[i]; - tempPred->cu.m_mv[0][0] = candMvField[i][0].mv; - tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; - tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; - tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; - tempPred->cu.setPredModeSubParts(MODE_INTER); /* must be cleared between encode iterations */ + tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; /* merge candidate ID is stored in L0 MVP idx */ + tempPred->cu.m_interDir[0] = candDir[i]; + tempPred->cu.m_mv[0][0] = candMvField[i][0].mv; + tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; + tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; + tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; + tempPred->cu.setPredModeSubParts(MODE_INTER); /* must be cleared between encode iterations */ - motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, true); + motionCompensation(tempPred->cu, pu, tempPred->predYuv, true, true); - uint8_t hasCbf = true; - bool swapped = false; - if (!foundCbf0Merge) - { - /* if the best prediction has CBF (not a skip) then try merge with residual */ + uint8_t hasCbf = true; + bool swapped = false; + if (!foundCbf0Merge) + { + /* if the best prediction has CBF (not a skip) then try merge with residual */ - encodeResAndCalcRdInterCU(*tempPred, cuGeom); - hasCbf = tempPred->cu.getQtRootCbf(0); - foundCbf0Merge = !hasCbf; + encodeResAndCalcRdInterCU(*tempPred, cuGeom); + hasCbf = tempPred->cu.getQtRootCbf(0); + foundCbf0Merge = !hasCbf; - if (tempPred->rdCost < bestPred->rdCost) - { - std::swap(tempPred, bestPred); - swapped = true; - } - } - if (!m_param->bLossless && hasCbf) + if (tempPred->rdCost < bestPred->rdCost) { - /* try merge without residual (skip), if not lossless coding */ + std::swap(tempPred, bestPred); + swapped = true; + } + } + if (!m_param->bLossless && hasCbf) + { + /* try merge without residual (skip), if not lossless coding */ - if (swapped) - { - tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; - tempPred->cu.m_interDir[0] = candDir[i]; - tempPred->cu.m_mv[0][0] = candMvField[i][0].mv; - tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; - tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; - tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; - tempPred->cu.setPredModeSubParts(MODE_INTER); - tempPred->predYuv.copyFromYuv(bestPred->predYuv); - } + if (swapped) + { + tempPred->cu.m_mvpIdx[0][0] = (uint8_t)i; + tempPred->cu.m_interDir[0] = candDir[i]; + tempPred->cu.m_mv[0][0] = candMvField[i][0].mv; + tempPred->cu.m_mv[1][0] = candMvField[i][1].mv; + tempPred->cu.m_refIdx[0][0] = (int8_t)candMvField[i][0].refIdx; + tempPred->cu.m_refIdx[1][0] = (int8_t)candMvField[i][1].refIdx; + tempPred->cu.setPredModeSubParts(MODE_INTER); + tempPred->predYuv.copyFromYuv(bestPred->predYuv); + } - encodeResAndCalcRdSkipCU(*tempPred); + encodeResAndCalcRdSkipCU(*tempPred); - if (tempPred->rdCost < bestPred->rdCost) - std::swap(tempPred, bestPred); - } + if (tempPred->rdCost < bestPred->rdCost) + std::swap(tempPred, bestPred); } } @@ -1463,7 +1597,7 @@ } } -void Analysis::checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize) +void Analysis::checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refMask[2]) { interMode.initCosts(); interMode.cu.setPartSizeSubParts(partSize); @@ -1472,7 +1606,8 @@ if (m_param->analysisMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU) { - for (uint32_t part = 0; part < interMode.cu.getNumPartInter(); part++) + uint32_t numPU = interMode.cu.getNumPartInter(0); + for (uint32_t part = 0; part < numPU; part++) { MotionData* bestME = interMode.bestME[part]; for (int32_t i = 0; i < numPredDir; i++) @@ -1483,7 +1618,7 @@ } } - predInterSearch(interMode, cuGeom, m_bChromaSa8d); + predInterSearch(interMode, cuGeom, m_bChromaSa8d, refMask); /* predInterSearch sets interMode.sa8dBits */ const Yuv& fencYuv = *interMode.fencYuv; @@ -1495,11 +1630,12 @@ interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[1], fencYuv.m_csize, predYuv.m_buf[1], predYuv.m_csize); interMode.distortion += primitives.chroma[m_csp].cu[part].sa8d(fencYuv.m_buf[2], fencYuv.m_csize, predYuv.m_buf[2], predYuv.m_csize); } - interMode.sa8dCost = m_rdCost.calcRdSADCost(interMode.distortion, interMode.sa8dBits); + interMode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)interMode.distortion, interMode.sa8dBits); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU) { - for (uint32_t puIdx = 0; puIdx < interMode.cu.getNumPartInter(); puIdx++) + uint32_t numPU = interMode.cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) @@ -1511,7 +1647,7 @@ } } -void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize) +void Analysis::checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refMask[2]) { interMode.initCosts(); interMode.cu.setPartSizeSubParts(partSize); @@ -1520,7 +1656,8 @@ if (m_param->analysisMode == X265_ANALYSIS_LOAD && m_reuseInterDataCTU) { - for (uint32_t puIdx = 0; puIdx < interMode.cu.getNumPartInter(); puIdx++) + uint32_t numPU = interMode.cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) @@ -1531,14 +1668,15 @@ } } - predInterSearch(interMode, cuGeom, true); + predInterSearch(interMode, cuGeom, true, refMask); /* predInterSearch sets interMode.sa8dBits, but this is ignored */ encodeResAndCalcRdInterCU(interMode, cuGeom); if (m_param->analysisMode == X265_ANALYSIS_SAVE && m_reuseInterDataCTU) { - for (uint32_t puIdx = 0; puIdx < interMode.cu.getNumPartInter(); puIdx++) + uint32_t numPU = interMode.cu.getNumPartInter(0); + for (uint32_t puIdx = 0; puIdx < numPU; puIdx++) { MotionData* bestME = interMode.bestME[puIdx]; for (int32_t i = 0; i < numPredDir; i++) @@ -1805,7 +1943,7 @@ else if (m_param->rdLevel <= 1) { mode.sa8dBits++; - mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits); + mode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)mode.distortion, mode.sa8dBits); } else {
View file
x265_1.7.tar.gz/source/encoder/analysis.h -> x265_1.8.tar.gz/source/encoder/analysis.h
Changed
@@ -35,7 +35,7 @@ #include "entropy.h" #include "search.h" -namespace x265 { +namespace X265_NS { // private namespace class Entropy; @@ -113,16 +113,16 @@ /* full analysis for a P or B slice CU */ void compressInterCU_dist(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); - void compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); - void compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); + uint32_t compressInterCU_rd0_4(const CUData& parentCTU, const CUGeom& cuGeom, int32_t qp); + uint32_t compressInterCU_rd5_6(const CUData& parentCTU, const CUGeom& cuGeom, uint32_t &zOrder, int32_t qp); /* measure merge and skip */ void checkMerge2Nx2N_rd0_4(Mode& skip, Mode& merge, const CUGeom& cuGeom); - void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isSkipMode); + void checkMerge2Nx2N_rd5_6(Mode& skip, Mode& merge, const CUGeom& cuGeom, bool isShareMergeCand); /* measure inter options */ - void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize); - void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize); + void checkInter_rd0_4(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refmask[2]); + void checkInter_rd5_6(Mode& interMode, const CUGeom& cuGeom, PartSize partSize, uint32_t refmask[2]); void checkBidir2Nx2N(Mode& inter2Nx2N, Mode& bidir2Nx2N, const CUGeom& cuGeom);
View file
x265_1.7.tar.gz/source/encoder/api.cpp -> x265_1.8.tar.gz/source/encoder/api.cpp
Changed
@@ -31,25 +31,69 @@ #include "nal.h" #include "bitcost.h" -using namespace x265; +/* multilib namespace reflectors */ +#if LINKED_8BIT +namespace x265_8bit { +const x265_api* x265_api_get(int bitDepth); +const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err); +} +#endif + +#if LINKED_10BIT +namespace x265_10bit { +const x265_api* x265_api_get(int bitDepth); +const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err); +} +#endif + +#if LINKED_12BIT +namespace x265_12bit { +const x265_api* x265_api_get(int bitDepth); +const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err); +} +#endif + +#if EXPORT_C_API +/* these functions are exported as C functions (default) */ +using namespace X265_NS; +extern "C" { +#else +/* these functions exist within private namespace (multilib) */ +namespace X265_NS { +#endif -extern "C" x265_encoder *x265_encoder_open(x265_param *p) { if (!p) return NULL; +#if _MSC_VER +#pragma warning(disable: 4127) // conditional expression is constant, yes I know +#endif + +#if HIGH_BIT_DEPTH + if (X265_DEPTH == 12) + x265_log(p, X265_LOG_WARNING, "Main12 is HIGHLY experimental, do not use!\n"); + else if (X265_DEPTH != 10 && X265_DEPTH != 12) +#else + if (X265_DEPTH != 8) +#endif + { + x265_log(p, X265_LOG_ERROR, "Build error, internal bit depth mismatch\n"); + return NULL; + } + Encoder* encoder = NULL; - x265_param* param = x265_param_alloc(); - x265_param* latestParam = x265_param_alloc(); + x265_param* param = PARAM_NS::x265_param_alloc(); + x265_param* latestParam = PARAM_NS::x265_param_alloc(); if (!param || !latestParam) goto fail; memcpy(param, p, sizeof(x265_param)); - x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str); - x265_log(param, X265_LOG_INFO, "build info %s\n", x265_build_info_str); + x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", PFX(version_str)); + x265_log(param, X265_LOG_INFO, "build info %s\n", PFX(build_info_str)); - x265_setup_primitives(param, param->cpuid); + x265_setup_primitives(param); if (x265_check_params(param)) goto fail; @@ -59,7 +103,7 @@ encoder = new Encoder; if (!param->rc.bEnableSlowFirstPass) - x265_param_apply_fastfirstpass(param); + PARAM_NS::x265_param_apply_fastfirstpass(param); // may change params for auto-detect, etc encoder->configure(param); @@ -87,12 +131,11 @@ fail: delete encoder; - x265_param_free(param); - x265_param_free(latestParam); + PARAM_NS::x265_param_free(param); + PARAM_NS::x265_param_free(latestParam); return NULL; } -extern "C" int x265_encoder_headers(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal) { if (pp_nal && enc) @@ -109,7 +152,6 @@ return -1; } -extern "C" void x265_encoder_parameters(x265_encoder *enc, x265_param *out) { if (enc && out) @@ -119,7 +161,6 @@ } } -extern "C" int x265_encoder_reconfig(x265_encoder* enc, x265_param* param_in) { if (!enc || !param_in) @@ -140,7 +181,6 @@ return ret; } -extern "C" int x265_encoder_encode(x265_encoder *enc, x265_nal **pp_nal, uint32_t *pi_nal, x265_picture *pic_in, x265_picture *pic_out) { if (!enc) @@ -175,7 +215,6 @@ return numEncoded; } -extern "C" void x265_encoder_get_stats(x265_encoder *enc, x265_stats *outputStats, uint32_t statsSizeBytes) { if (enc && outputStats) @@ -185,17 +224,15 @@ } } -extern "C" -void x265_encoder_log(x265_encoder* enc, int argc, char **argv) +void x265_encoder_log(x265_encoder* enc, int, char **) { if (enc) { Encoder *encoder = static_cast<Encoder*>(enc); - encoder->writeLog(argc, argv); + x265_log(encoder->m_param, X265_LOG_WARNING, "x265_encoder_log is now deprecated\n"); } } -extern "C" void x265_encoder_close(x265_encoder *enc) { if (enc) @@ -210,7 +247,6 @@ } } -extern "C" void x265_cleanup(void) { if (!g_ctuSizeConfigured) @@ -220,13 +256,11 @@ } } -extern "C" x265_picture *x265_picture_alloc() { return (x265_picture*)x265_malloc(sizeof(x265_picture)); } -extern "C" void x265_picture_init(x265_param *param, x265_picture *pic) { memset(pic, 0, sizeof(x265_picture)); @@ -245,7 +279,6 @@ } } -extern "C" void x265_picture_free(x265_picture *p) { return x265_free(p); @@ -253,12 +286,24 @@ static const x265_api libapi = { - &x265_param_alloc, - &x265_param_free, - &x265_param_default, - &x265_param_parse, - &x265_param_apply_profile, - &x265_param_default_preset, + X265_MAJOR_VERSION, + X265_BUILD, + sizeof(x265_param), + sizeof(x265_picture), + sizeof(x265_analysis_data), + sizeof(x265_zone), + sizeof(x265_stats), + + PFX(max_bit_depth), + PFX(version_str), + PFX(build_info_str), + + &PARAM_NS::x265_param_alloc, + &PARAM_NS::x265_param_free, + &PARAM_NS::x265_param_default, + &PARAM_NS::x265_param_parse, + &PARAM_NS::x265_param_apply_profile, + &PARAM_NS::x265_param_default_preset, &x265_picture_alloc, &x265_picture_free, &x265_picture_init, @@ -271,12 +316,12 @@ &x265_encoder_log, &x265_encoder_close, &x265_cleanup, - x265_version_str, - x265_build_info_str, - x265_max_bit_depth, + + sizeof(x265_frame_stats), }; typedef const x265_api* (*api_get_func)(int bitDepth); +typedef const x265_api* (*api_query_func)(int bitDepth, int apiVersion, int* err); #define xstr(s) str(s) #define str(s) #s @@ -291,13 +336,25 @@ #define ext ".so" #endif -extern "C" +static int g_recursion /* = 0 */; + const x265_api* x265_api_get(int bitDepth) { if (bitDepth && bitDepth != X265_DEPTH) { +#if LINKED_8BIT + if (bitDepth == 8) return x265_8bit::x265_api_get(0); +#endif +#if LINKED_10BIT + if (bitDepth == 10) return x265_10bit::x265_api_get(0); +#endif +#if LINKED_12BIT + if (bitDepth == 12) return x265_12bit::x265_api_get(0); +#endif + const char* libname = NULL; const char* method = "x265_api_get_" xstr(X265_BUILD); + const char* multilibname = "libx265" ext; if (bitDepth == 12) libname = "libx265_main12" ext; @@ -309,33 +366,150 @@ return NULL; const x265_api* api = NULL; + int reqDepth = 0; + + if (g_recursion > 1) + return NULL; + else + g_recursion++; #if _WIN32 HMODULE h = LoadLibraryA(libname); + if (!h) + { + h = LoadLibraryA(multilibname); + reqDepth = bitDepth; + } if (h) { api_get_func get = (api_get_func)GetProcAddress(h, method); if (get) - api = get(0); + api = get(reqDepth); } #else void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL); + if (!h) + { + h = dlopen(multilibname, RTLD_LAZY | RTLD_LOCAL); + reqDepth = bitDepth; + } if (h) { api_get_func get = (api_get_func)dlsym(h, method); if (get) - api = get(0); + api = get(reqDepth); + } +#endif + + g_recursion--; + + if (api && bitDepth != api->bit_depth) + { + x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth); + return NULL; + } + + return api; + } + + return &libapi; +} + +const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err) +{ + if (apiVersion < 51) + { + /* builds before 1.6 had re-ordered public structs */ + if (err) *err = X265_API_QUERY_ERR_VER_REFUSED; + return NULL; + } + + if (err) *err = X265_API_QUERY_ERR_NONE; + + if (bitDepth && bitDepth != X265_DEPTH) + { +#if LINKED_8BIT + if (bitDepth == 8) return x265_8bit::x265_api_query(0, apiVersion, err); +#endif +#if LINKED_10BIT + if (bitDepth == 10) return x265_10bit::x265_api_query(0, apiVersion, err); +#endif +#if LINKED_12BIT + if (bitDepth == 12) return x265_12bit::x265_api_query(0, apiVersion, err); +#endif + + const char* libname = NULL; + const char* method = "x265_api_query"; + const char* multilibname = "libx265" ext; + + if (bitDepth == 12) + libname = "libx265_main12" ext; + else if (bitDepth == 10) + libname = "libx265_main10" ext; + else if (bitDepth == 8) + libname = "libx265_main" ext; + else + { + if (err) *err = X265_API_QUERY_ERR_LIB_NOT_FOUND; + return NULL; + } + + const x265_api* api = NULL; + int reqDepth = 0; + int e = X265_API_QUERY_ERR_LIB_NOT_FOUND; + + if (g_recursion > 1) + { + if (err) *err = X265_API_QUERY_ERR_LIB_NOT_FOUND; + return NULL; + } + else + g_recursion++; + +#if _WIN32 + HMODULE h = LoadLibraryA(libname); + if (!h) + { + h = LoadLibraryA(multilibname); + reqDepth = bitDepth; + } + if (h) + { + e = X265_API_QUERY_ERR_FUNC_NOT_FOUND; + api_query_func query = (api_query_func)GetProcAddress(h, method); + if (query) + api = query(reqDepth, apiVersion, err); + } +#else + void* h = dlopen(libname, RTLD_LAZY | RTLD_LOCAL); + if (!h) + { + h = dlopen(multilibname, RTLD_LAZY | RTLD_LOCAL); + reqDepth = bitDepth; + } + if (h) + { + e = X265_API_QUERY_ERR_FUNC_NOT_FOUND; + api_query_func query = (api_query_func)dlsym(h, method); + if (query) + api = query(reqDepth, apiVersion, err); } #endif - if (api && bitDepth != api->max_bit_depth) + g_recursion--; + + if (api && bitDepth != api->bit_depth) { x265_log(NULL, X265_LOG_WARNING, "%s does not support requested bitDepth %d\n", libname, bitDepth); + if (err) *err = X265_API_QUERY_ERR_WRONG_BITDEPTH; return NULL; } + if (err) *err = api ? X265_API_QUERY_ERR_NONE : e; return api; } return &libapi; } + +} /* end namespace or extern "C" */
View file
x265_1.7.tar.gz/source/encoder/bitcost.cpp -> x265_1.8.tar.gz/source/encoder/bitcost.cpp
Changed
@@ -25,7 +25,7 @@ #include "primitives.h" #include "bitcost.h" -using namespace x265; +using namespace X265_NS; void BitCost::setQP(unsigned int qp) { @@ -45,7 +45,7 @@ // estimate same cost for negative and positive MVD for (int i = 0; i <= 2 * BC_MAX_MV; i++) - s_costs[qp][i] = s_costs[qp][-i] = (uint16_t)X265_MIN(s_bitsizes[i] * lambda + 0.5f, (1 << 16) - 1); + s_costs[qp][i] = s_costs[qp][-i] = (uint16_t)X265_MIN(s_bitsizes[i] * lambda + 0.5f, (1 << 15) - 1); } }
View file
x265_1.7.tar.gz/source/encoder/bitcost.h -> x265_1.8.tar.gz/source/encoder/bitcost.h
Changed
@@ -28,7 +28,7 @@ #include "threading.h" #include "mv.h" -namespace x265 { +namespace X265_NS { // private x265 namespace class BitCost
View file
x265_1.7.tar.gz/source/encoder/dpb.cpp -> x265_1.8.tar.gz/source/encoder/dpb.cpp
Changed
@@ -29,7 +29,7 @@ #include "dpb.h" -using namespace x265; +using namespace X265_NS; DPB::~DPB() {
View file
x265_1.7.tar.gz/source/encoder/dpb.h -> x265_1.8.tar.gz/source/encoder/dpb.h
Changed
@@ -26,7 +26,7 @@ #include "piclist.h" -namespace x265 { +namespace X265_NS { // private namespace for x265 class Frame;
View file
x265_1.7.tar.gz/source/encoder/encoder.cpp -> x265_1.8.tar.gz/source/encoder/encoder.cpp
Changed
@@ -39,21 +39,13 @@ #include "x265.h" -namespace x265 { +namespace X265_NS { const char g_sliceTypeToChar[] = {'B', 'P', 'I'}; } -static const char* summaryCSVHeader = - "Command, Date/Time, Elapsed Time, FPS, Bitrate, " - "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), " - "I count, I ave-QP, I kpbs, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), " - "P count, P ave-QP, P kpbs, P-PSNR Y, P-PSNR U, P-PSNR V, P-SSIM (dB), " - "B count, B ave-QP, B kpbs, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), " - "Version\n"; - static const char* defaultAnalysisFileName = "x265_analysis.dat"; -using namespace x265; +using namespace X265_NS; Encoder::Encoder() { @@ -72,7 +64,6 @@ m_exportedPic = NULL; m_numDelayedPic = 0; m_outputCount = 0; - m_csvfpt = NULL; m_param = NULL; m_latestParam = NULL; m_cuOffsetY = NULL; @@ -103,7 +94,10 @@ // Do not allow WPP if only one row or fewer than 3 columns, it is pointless and unstable if (rows == 1 || cols < 3) + { + x265_log(p, X265_LOG_WARNING, "Too few rows/columns, --wpp disabled\n"); p->bEnableWavefront = 0; + } bool allowPools = !p->numaPools || strcmp(p->numaPools, "none"); @@ -149,6 +143,12 @@ p->bEnableWavefront = p->bDistributeModeAnalysis = p->bDistributeMotionEstimation = p->lookaheadSlices = 0; } + if (!p->bEnableWavefront && p->rc.vbvBufferSize) + { + x265_log(p, X265_LOG_ERROR, "VBV requires wavefront parallelism\n"); + m_aborted = true; + } + char buf[128]; int len = 0; if (p->bEnableWavefront) @@ -214,43 +214,6 @@ initSPS(&m_sps); initPPS(&m_pps); - /* Try to open CSV file handle */ - if (m_param->csvfn) - { - m_csvfpt = fopen(m_param->csvfn, "r"); - if (m_csvfpt) - { - /* file already exists, re-open for append */ - fclose(m_csvfpt); - m_csvfpt = fopen(m_param->csvfn, "ab"); - } - else - { - /* new CSV file, write header */ - m_csvfpt = fopen(m_param->csvfn, "wb"); - if (m_csvfpt) - { - if (m_param->logLevel >= X265_LOG_FRAME) - { - fprintf(m_csvfpt, "Encode Order, Type, POC, QP, Bits, "); - if (m_param->rc.rateControlMode == X265_RC_CRF) - fprintf(m_csvfpt, "RateFactor, "); - fprintf(m_csvfpt, "Y PSNR, U PSNR, V PSNR, YUV PSNR, SSIM, SSIM (dB), List 0, List 1"); - /* detailed performance statistics */ - fprintf(m_csvfpt, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms), Stall Time (ms), Avg WPP, Row Blocks\n"); - } - else - fputs(summaryCSVHeader, m_csvfpt); - } - } - - if (!m_csvfpt) - { - x265_log(m_param, X265_LOG_ERROR, "Unable to open CSV log file <%s>, aborting\n", m_param->csvfn); - m_aborted = true; - } - } - int numRows = (m_param->sourceHeight + g_maxCUSize - 1) / g_maxCUSize; int numCols = (m_param->sourceWidth + g_maxCUSize - 1) / g_maxCUSize; for (int i = 0; i < m_param->frameNumThreads; i++) @@ -362,8 +325,6 @@ if (m_analysisFile) fclose(m_analysisFile); - if (m_csvfpt) - fclose(m_csvfpt); if (m_param) { @@ -372,15 +333,14 @@ free((char*)m_param->rc.statFileName); free((char*)m_param->analysisFileName); free((char*)m_param->scalingLists); - free((char*)m_param->csvfn); free((char*)m_param->numaPools); free((char*)m_param->masteringDisplayColorVolume); free((char*)m_param->contentLightLevelInfo); - x265_param_free(m_param); + PARAM_NS::x265_param_free(m_param); } - x265_param_free(m_latestParam); + PARAM_NS::x265_param_free(m_latestParam); } void Encoder::updateVbvPlan(RateControl* rc) @@ -570,6 +530,7 @@ if (outFrame) { Slice *slice = outFrame->m_encData->m_slice; + x265_frame_stats* frameData = NULL; /* Free up pic_in->analysisData since it has already been used */ if (m_param->analysisMode == X265_ANALYSIS_LOAD) @@ -582,6 +543,7 @@ pic_out->bitDepth = X265_DEPTH; pic_out->userData = outFrame->m_userData; pic_out->colorSpace = m_param->internalCsp; + frameData = &(pic_out->frameData); pic_out->pts = outFrame->m_pts; pic_out->dts = outFrame->m_dts; @@ -648,7 +610,12 @@ if (m_aborted) return -1; - finishFrameStats(outFrame, curEncoder, curEncoder->m_accessUnitBits); + finishFrameStats(outFrame, curEncoder, curEncoder->m_accessUnitBits, frameData); + + /* Write RateControl Frame level stats in multipass encodes */ + if (m_param->rc.bStatWrite) + if (m_rateControl->writeRateControlFrameStats(outFrame, &curEncoder->m_rce)) + m_aborted = true; /* Allow this frame to be recycled if no frame encoders are using it for reference */ if (!pic_out) @@ -729,7 +696,7 @@ m_aborted = true; } else if (m_encodedFrameNum) - m_rateControl->setFinalFrameCount(m_encodedFrameNum); + m_rateControl->setFinalFrameCount(m_encodedFrameNum); } while (m_bZeroLatency && ++pass < 2); @@ -787,38 +754,6 @@ m_totalQp += aveQp; } -char* Encoder::statsCSVString(EncStats& stat, char* buffer) -{ - if (!stat.m_numPics) - { - sprintf(buffer, "-, -, -, -, -, -, -, "); - return buffer; - } - - double fps = (double)m_param->fpsNum / m_param->fpsDenom; - double scale = fps / 1000 / (double)stat.m_numPics; - - int len = sprintf(buffer, "%-6u, ", stat.m_numPics); - - len += sprintf(buffer + len, "%2.2lf, ", stat.m_totalQp / (double)stat.m_numPics); - len += sprintf(buffer + len, "%-8.2lf, ", stat.m_accBits * scale); - if (m_param->bEnablePsnr) - { - len += sprintf(buffer + len, "%.3lf, %.3lf, %.3lf, ", - stat.m_psnrSumY / (double)stat.m_numPics, - stat.m_psnrSumU / (double)stat.m_numPics, - stat.m_psnrSumV / (double)stat.m_numPics); - } - else - len += sprintf(buffer + len, "-, -, -, "); - - if (m_param->bEnableSsim) - sprintf(buffer + len, "%.3lf, ", x265_ssim2dB(stat.m_globalSsim / (double)stat.m_numPics)); - else - sprintf(buffer + len, "-, "); - return buffer; -} - char* Encoder::statsString(EncStats& stat, char* buffer) { double fps = (double)m_param->fpsNum / m_param->fpsDenom; @@ -856,8 +791,6 @@ x265_log(m_param, X265_LOG_INFO, "frame P: %s\n", statsString(m_analyzeP, buffer)); if (m_analyzeB.m_numPics) x265_log(m_param, X265_LOG_INFO, "frame B: %s\n", statsString(m_analyzeB, buffer)); - if (m_analyzeAll.m_numPics) - x265_log(m_param, X265_LOG_INFO, "global : %s\n", statsString(m_analyzeAll, buffer)); if (m_param->bEnableWeightedPred && m_analyzeP.m_numPics) { x265_log(m_param, X265_LOG_INFO, "Weighted P-Frames: Y:%.1f%% UV:%.1f%%\n", @@ -891,6 +824,30 @@ x265_log(m_param, X265_LOG_INFO, "lossless compression ratio %.2f::1\n", uncompressed / m_analyzeAll.m_accBits); } + if (m_analyzeAll.m_numPics) + { + int p = 0; + double elapsedEncodeTime = (double)(x265_mdate() - m_encodeStartTime) / 1000000; + double elapsedVideoTime = (double)m_analyzeAll.m_numPics * m_param->fpsDenom / m_param->fpsNum; + double bitrate = (0.001f * m_analyzeAll.m_accBits) / elapsedVideoTime; + + p += sprintf(buffer + p, "\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s, Avg QP:%2.2lf", m_analyzeAll.m_numPics, + elapsedEncodeTime, m_analyzeAll.m_numPics / elapsedEncodeTime, bitrate, m_analyzeAll.m_totalQp / (double)m_analyzeAll.m_numPics); + + if (m_param->bEnablePsnr) + { + double globalPsnr = (m_analyzeAll.m_psnrSumY * 6 + m_analyzeAll.m_psnrSumU + m_analyzeAll.m_psnrSumV) / (8 * m_analyzeAll.m_numPics); + p += sprintf(buffer + p, ", Global PSNR: %.3f", globalPsnr); + } + + if (m_param->bEnableSsim) + p += sprintf(buffer + p, ", SSIM Mean Y: %.7f (%6.3f dB)", m_analyzeAll.m_globalSsim / m_analyzeAll.m_numPics, x265_ssim2dB(m_analyzeAll.m_globalSsim / m_analyzeAll.m_numPics)); + + sprintf(buffer + p, "\n"); + general_log(m_param, NULL, X265_LOG_INFO, buffer); + } + else + general_log(m_param, NULL, X265_LOG_INFO, "\nencoded 0 frames\n"); #if DETAILED_CU_STATS /* Summarize stats from all frame encoders */ @@ -949,10 +906,22 @@ x265_log(m_param, X265_LOG_INFO, "CU: %%%05.2lf time spent in motion estimation, averaging %.3lf CU inter modes per CTU\n", 100.0 * cuStats.motionEstimationElapsedTime / totalWorkerTime, (double)cuStats.countMotionEstimate / cuStats.totalCTUs); + + if (cuStats.skippedMotionReferences[0] || cuStats.skippedMotionReferences[1] || cuStats.skippedMotionReferences[2]) + x265_log(m_param, X265_LOG_INFO, "CU: Skipped motion searches per depth %%%.2lf %%%.2lf %%%.2lf %%%.2lf\n", + 100.0 * cuStats.skippedMotionReferences[0] / cuStats.totalMotionReferences[0], + 100.0 * cuStats.skippedMotionReferences[1] / cuStats.totalMotionReferences[1], + 100.0 * cuStats.skippedMotionReferences[2] / cuStats.totalMotionReferences[2], + 100.0 * cuStats.skippedMotionReferences[3] / cuStats.totalMotionReferences[3]); } x265_log(m_param, X265_LOG_INFO, "CU: %%%05.2lf time spent in intra analysis, averaging %.3lf Intra PUs per CTU\n", 100.0 * cuStats.intraAnalysisElapsedTime / totalWorkerTime, (double)cuStats.countIntraAnalysis / cuStats.totalCTUs); + if (cuStats.skippedIntraCU[0] || cuStats.skippedIntraCU[1] || cuStats.skippedIntraCU[2]) + x265_log(m_param, X265_LOG_INFO, "CU: Skipped intra CUs at depth %%%.2lf %%%.2lf %%%.2lf\n", + 100.0 * cuStats.skippedIntraCU[0] / cuStats.totalIntraCU[0], + 100.0 * cuStats.skippedIntraCU[1] / cuStats.totalIntraCU[1], + 100.0 * cuStats.skippedIntraCU[2] / cuStats.totalIntraCU[2]); x265_log(m_param, X265_LOG_INFO, "CU: %%%05.2lf time spent in inter RDO, measuring %.3lf inter/merge predictions per CTU\n", 100.0 * interRDOTotalTime / totalWorkerTime, (double)interRDOTotalCount / cuStats.totalCTUs); @@ -1027,143 +996,6 @@ #undef ELAPSED_SEC #undef ELAPSED_MSEC #endif - - if (!m_param->bLogCuStats) - return; - - for (int sliceType = 2; sliceType >= 0; sliceType--) - { - if (sliceType == P_SLICE && !m_analyzeP.m_numPics) - continue; - if (sliceType == B_SLICE && !m_analyzeB.m_numPics) - continue; - - StatisticLog finalLog; - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - { - int cuSize = g_maxCUSize >> depth; - - for (int i = 0; i < m_param->frameNumThreads; i++) - { - StatisticLog& enclog = m_frameEncoder[i]->m_sliceTypeLog[sliceType]; - if (!depth) - finalLog.totalCu += enclog.totalCu; - finalLog.cntIntra[depth] += enclog.cntIntra[depth]; - for (int m = 0; m < INTER_MODES; m++) - { - if (m < INTRA_MODES) - finalLog.cuIntraDistribution[depth][m] += enclog.cuIntraDistribution[depth][m]; - finalLog.cuInterDistribution[depth][m] += enclog.cuInterDistribution[depth][m]; - } - - if (cuSize == 8 && m_sps.quadtreeTULog2MinSize < 3) - finalLog.cntIntraNxN += enclog.cntIntraNxN; - if (sliceType != I_SLICE) - { - finalLog.cntTotalCu[depth] += enclog.cntTotalCu[depth]; - finalLog.cntInter[depth] += enclog.cntInter[depth]; - finalLog.cntSkipCu[depth] += enclog.cntSkipCu[depth]; - } - } - - uint64_t cntInter, cntSkipCu, cntIntra = 0, cntIntraNxN = 0, encCu = 0; - uint64_t cuInterDistribution[INTER_MODES], cuIntraDistribution[INTRA_MODES]; - - // check for 0/0, if true assign 0 else calculate percentage - for (int n = 0; n < INTER_MODES; n++) - { - if (!finalLog.cntInter[depth]) - cuInterDistribution[n] = 0; - else - cuInterDistribution[n] = (finalLog.cuInterDistribution[depth][n] * 100) / finalLog.cntInter[depth]; - - if (n < INTRA_MODES) - { - if (!finalLog.cntIntra[depth]) - { - cntIntraNxN = 0; - cuIntraDistribution[n] = 0; - } - else - { - cntIntraNxN = (finalLog.cntIntraNxN * 100) / finalLog.cntIntra[depth]; - cuIntraDistribution[n] = (finalLog.cuIntraDistribution[depth][n] * 100) / finalLog.cntIntra[depth]; - } - } - } - - if (!finalLog.totalCu) - encCu = 0; - else if (sliceType == I_SLICE) - { - cntIntra = (finalLog.cntIntra[depth] * 100) / finalLog.totalCu; - cntIntraNxN = (finalLog.cntIntraNxN * 100) / finalLog.totalCu; - } - else - encCu = ((finalLog.cntIntra[depth] + finalLog.cntInter[depth]) * 100) / finalLog.totalCu; - - if (sliceType == I_SLICE) - { - cntInter = 0; - cntSkipCu = 0; - } - else if (!finalLog.cntTotalCu[depth]) - { - cntInter = 0; - cntIntra = 0; - cntSkipCu = 0; - } - else - { - cntInter = (finalLog.cntInter[depth] * 100) / finalLog.cntTotalCu[depth]; - cntIntra = (finalLog.cntIntra[depth] * 100) / finalLog.cntTotalCu[depth]; - cntSkipCu = (finalLog.cntSkipCu[depth] * 100) / finalLog.cntTotalCu[depth]; - } - - // print statistics - char stats[256] = { 0 }; - int len = 0; - if (sliceType != I_SLICE) - len += sprintf(stats + len, " EncCU "X265_LL "%% Merge "X265_LL "%%", encCu, cntSkipCu); - - if (cntInter) - { - len += sprintf(stats + len, " Inter "X265_LL "%%", cntInter); - if (m_param->bEnableAMP) - len += sprintf(stats + len, "(%dx%d "X265_LL "%% %dx%d "X265_LL "%% %dx%d "X265_LL "%% AMP "X265_LL "%%)", - cuSize, cuSize, cuInterDistribution[0], - cuSize / 2, cuSize, cuInterDistribution[2], - cuSize, cuSize / 2, cuInterDistribution[1], - cuInterDistribution[3]); - else if (m_param->bEnableRectInter) - len += sprintf(stats + len, "(%dx%d "X265_LL "%% %dx%d "X265_LL "%% %dx%d "X265_LL "%%)", - cuSize, cuSize, cuInterDistribution[0], - cuSize / 2, cuSize, cuInterDistribution[2], - cuSize, cuSize / 2, cuInterDistribution[1]); - } - if (cntIntra) - { - len += sprintf(stats + len, " Intra "X265_LL "%%(DC "X265_LL "%% P "X265_LL "%% Ang "X265_LL "%%", - cntIntra, cuIntraDistribution[0], - cuIntraDistribution[1], cuIntraDistribution[2]); - if (sliceType != I_SLICE) - { - if (cuSize == 8 && m_sps.quadtreeTULog2MinSize < 3) - len += sprintf(stats + len, " %dx%d "X265_LL "%%", cuSize / 2, cuSize / 2, cntIntraNxN); - } - - len += sprintf(stats + len, ")"); - if (sliceType == I_SLICE) - { - if (cuSize == 8 && m_sps.quadtreeTULog2MinSize < 3) - len += sprintf(stats + len, " %dx%d: "X265_LL "%%", cuSize / 2, cuSize / 2, cntIntraNxN); - } - } - const char slicechars[] = "BPI"; - if (stats[0]) - x265_log(m_param, X265_LOG_INFO, "%c%-2d:%s\n", slicechars[sliceType], cuSize, stats); - } - } } void Encoder::fetchStats(x265_stats *stats, size_t statsSizeBytes) @@ -1191,6 +1023,33 @@ stats->bitrate = 0; stats->elapsedVideoTime = 0; } + + double fps = (double)m_param->fpsNum / m_param->fpsDenom; + double scale = fps / 1000; + + stats->statsI.numPics = m_analyzeI.m_numPics; + stats->statsI.avgQp = m_analyzeI.m_totalQp / (double)m_analyzeI.m_numPics; + stats->statsI.bitrate = m_analyzeI.m_accBits * scale / (double)m_analyzeI.m_numPics; + stats->statsI.psnrY = m_analyzeI.m_psnrSumY / (double)m_analyzeI.m_numPics; + stats->statsI.psnrU = m_analyzeI.m_psnrSumU / (double)m_analyzeI.m_numPics; + stats->statsI.psnrV = m_analyzeI.m_psnrSumV / (double)m_analyzeI.m_numPics; + stats->statsI.ssim = x265_ssim2dB(m_analyzeI.m_globalSsim / (double)m_analyzeI.m_numPics); + + stats->statsP.numPics = m_analyzeP.m_numPics; + stats->statsP.avgQp = m_analyzeP.m_totalQp / (double)m_analyzeP.m_numPics; + stats->statsP.bitrate = m_analyzeP.m_accBits * scale / (double)m_analyzeP.m_numPics; + stats->statsP.psnrY = m_analyzeP.m_psnrSumY / (double)m_analyzeP.m_numPics; + stats->statsP.psnrU = m_analyzeP.m_psnrSumU / (double)m_analyzeP.m_numPics; + stats->statsP.psnrV = m_analyzeP.m_psnrSumV / (double)m_analyzeP.m_numPics; + stats->statsP.ssim = x265_ssim2dB(m_analyzeP.m_globalSsim / (double)m_analyzeP.m_numPics); + + stats->statsB.numPics = m_analyzeB.m_numPics; + stats->statsB.avgQp = m_analyzeB.m_totalQp / (double)m_analyzeB.m_numPics; + stats->statsB.bitrate = m_analyzeB.m_accBits * scale / (double)m_analyzeB.m_numPics; + stats->statsB.psnrY = m_analyzeB.m_psnrSumY / (double)m_analyzeB.m_numPics; + stats->statsB.psnrU = m_analyzeB.m_psnrSumU / (double)m_analyzeB.m_numPics; + stats->statsB.psnrV = m_analyzeB.m_psnrSumV / (double)m_analyzeB.m_numPics; + stats->statsB.ssim = x265_ssim2dB(m_analyzeB.m_globalSsim / (double)m_analyzeB.m_numPics); } /* If new statistics are added to x265_stats, we must check here whether the @@ -1198,84 +1057,7 @@ * future safety) */ } -void Encoder::writeLog(int argc, char **argv) -{ - if (m_csvfpt) - { - if (m_param->logLevel >= X265_LOG_FRAME) - { - // adding summary to a per-frame csv log file needs a summary header - fprintf(m_csvfpt, "\nSummary\n"); - fputs(summaryCSVHeader, m_csvfpt); - } - // CLI arguments or other - for (int i = 1; i < argc; i++) - { - if (i) fputc(' ', m_csvfpt); - fputs(argv[i], m_csvfpt); - } - - // current date and time - time_t now; - struct tm* timeinfo; - time(&now); - timeinfo = localtime(&now); - char buffer[200]; - strftime(buffer, 128, "%c", timeinfo); - fprintf(m_csvfpt, ", %s, ", buffer); - - x265_stats stats; - fetchStats(&stats, sizeof(stats)); - - // elapsed time, fps, bitrate - fprintf(m_csvfpt, "%.2f, %.2f, %.2f,", - stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate); - - if (m_param->bEnablePsnr) - fprintf(m_csvfpt, " %.3lf, %.3lf, %.3lf, %.3lf,", - stats.globalPsnrY / stats.encodedPictureCount, stats.globalPsnrU / stats.encodedPictureCount, - stats.globalPsnrV / stats.encodedPictureCount, stats.globalPsnr); - else - fprintf(m_csvfpt, " -, -, -, -,"); - if (m_param->bEnableSsim) - fprintf(m_csvfpt, " %.6f, %6.3f,", stats.globalSsim, x265_ssim2dB(stats.globalSsim)); - else - fprintf(m_csvfpt, " -, -,"); - - fputs(statsCSVString(m_analyzeI, buffer), m_csvfpt); - fputs(statsCSVString(m_analyzeP, buffer), m_csvfpt); - fputs(statsCSVString(m_analyzeB, buffer), m_csvfpt); - fprintf(m_csvfpt, " %s\n", x265_version_str); - } -} - -/** - * Produce an ascii(hex) representation of picture digest. - * - * Returns: a statically allocated null-terminated string. DO NOT FREE. - */ -static const char*digestToString(const unsigned char digest[3][16], int numChar) -{ - const char* hex = "0123456789abcdef"; - static char string[99]; - int cnt = 0; - - for (int yuvIdx = 0; yuvIdx < 3; yuvIdx++) - { - for (int i = 0; i < numChar; i++) - { - string[cnt++] = hex[digest[yuvIdx][i] >> 4]; - string[cnt++] = hex[digest[yuvIdx][i] & 0xf]; - } - - string[cnt++] = ','; - } - - string[cnt - 1] = '\0'; - return string; -} - -void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, uint64_t bits) +void Encoder::finishFrameStats(Frame* curFrame, FrameEncoder *curEncoder, uint64_t bits, x265_frame_stats* frameStats) { PicYuv* reconPic = curFrame->m_reconPic; @@ -1346,109 +1128,63 @@ if (!IS_REFERENCED(curFrame)) c += 32; // lower case if unreferenced - // if debug log level is enabled, per frame console logging is performed - if (m_param->logLevel >= X265_LOG_DEBUG) + if (frameStats) { - char buf[1024]; - int p; - p = sprintf(buf, "POC:%d %c QP %2.2lf(%d) %10d bits", poc, c, curEncData.m_avgQpAq, slice->m_sliceQp, (int)bits); + frameStats->encoderOrder = m_outputCount++; + frameStats->sliceType = c; + frameStats->poc = poc; + frameStats->qp = curEncData.m_avgQpAq; + frameStats->bits = bits; if (m_param->rc.rateControlMode == X265_RC_CRF) - p += sprintf(buf + p, " RF:%.3lf", curEncData.m_rateFactor); - if (m_param->bEnablePsnr) - p += sprintf(buf + p, " [Y:%6.2lf U:%6.2lf V:%6.2lf]", psnrY, psnrU, psnrV); - if (m_param->bEnableSsim) - p += sprintf(buf + p, " [SSIM: %.3lfdB]", x265_ssim2dB(ssim)); - + frameStats->rateFactor = curEncData.m_rateFactor; + frameStats->psnrY = psnrY; + frameStats->psnrU = psnrU; + frameStats->psnrV = psnrV; + double psnr = (psnrY * 6 + psnrU + psnrV) / 8; + frameStats->psnr = psnr; + frameStats->ssim = ssim; if (!slice->isIntra()) { - int numLists = slice->isInterP() ? 1 : 2; - for (int list = 0; list < numLists; list++) - { - p += sprintf(buf + p, " [L%d ", list); - for (int ref = 0; ref < slice->m_numRefIdx[list]; ref++) - { - int k = slice->m_refPOCList[list][ref] - slice->m_lastIDR; - p += sprintf(buf + p, "%d ", k); - } - - p += sprintf(buf + p, "]"); - } - } + for (int ref = 0; ref < 16; ref++) + frameStats->list0POC[ref] = ref < slice->m_numRefIdx[0] ? slice->m_refPOCList[0][ref] - slice->m_lastIDR : -1; - if (m_param->decodedPictureHashSEI && m_param->logLevel >= X265_LOG_FULL) - { - const char* digestStr = NULL; - if (m_param->decodedPictureHashSEI == 1) - { - digestStr = digestToString(curEncoder->m_seiReconPictureDigest.m_digest, 16); - p += sprintf(buf + p, " [MD5:%s]", digestStr); - } - else if (m_param->decodedPictureHashSEI == 2) - { - digestStr = digestToString(curEncoder->m_seiReconPictureDigest.m_digest, 2); - p += sprintf(buf + p, " [CRC:%s]", digestStr); - } - else if (m_param->decodedPictureHashSEI == 3) + if (!slice->isInterP()) { - digestStr = digestToString(curEncoder->m_seiReconPictureDigest.m_digest, 4); - p += sprintf(buf + p, " [Checksum:%s]", digestStr); + for (int ref = 0; ref < 16; ref++) + frameStats->list1POC[ref] = ref < slice->m_numRefIdx[1] ? slice->m_refPOCList[1][ref] - slice->m_lastIDR : -1; } } - x265_log(m_param, X265_LOG_DEBUG, "%s\n", buf); - } - - if (m_param->logLevel >= X265_LOG_FRAME && m_csvfpt) - { - // per frame CSV logging if the file handle is valid - fprintf(m_csvfpt, "%d, %c-SLICE, %4d, %2.2lf, %10d,", m_outputCount++, c, poc, curEncData.m_avgQpAq, (int)bits); - if (m_param->rc.rateControlMode == X265_RC_CRF) - fprintf(m_csvfpt, "%.3lf,", curEncData.m_rateFactor); - double psnr = (psnrY * 6 + psnrU + psnrV) / 8; - if (m_param->bEnablePsnr) - fprintf(m_csvfpt, "%.3lf, %.3lf, %.3lf, %.3lf,", psnrY, psnrU, psnrV, psnr); - else - fputs(" -, -, -, -,", m_csvfpt); - if (m_param->bEnableSsim) - fprintf(m_csvfpt, " %.6f, %6.3f", ssim, x265_ssim2dB(ssim)); - else - fputs(" -, -", m_csvfpt); - if (slice->isIntra()) - fputs(", -, -", m_csvfpt); - else - { - int numLists = slice->isInterP() ? 1 : 2; - for (int list = 0; list < numLists; list++) - { - fprintf(m_csvfpt, ", "); - for (int ref = 0; ref < slice->m_numRefIdx[list]; ref++) - { - int k = slice->m_refPOCList[list][ref] - slice->m_lastIDR; - fprintf(m_csvfpt, " %d", k); - } - } - - if (numLists == 1) - fputs(", -", m_csvfpt); - } - #define ELAPSED_MSEC(start, end) (((double)(end) - (start)) / 1000) - // detailed frame statistics - fprintf(m_csvfpt, ", %.1lf, %.1lf, %.1lf, %.1lf, %.1lf, %.1lf", - ELAPSED_MSEC(0, curEncoder->m_slicetypeWaitTime), - ELAPSED_MSEC(curEncoder->m_startCompressTime, curEncoder->m_row0WaitTime), - ELAPSED_MSEC(curEncoder->m_row0WaitTime, curEncoder->m_endCompressTime), - ELAPSED_MSEC(curEncoder->m_row0WaitTime, curEncoder->m_allRowsAvailableTime), - ELAPSED_MSEC(0, curEncoder->m_totalWorkerElapsedTime), - ELAPSED_MSEC(0, curEncoder->m_totalNoWorkerTime)); + frameStats->decideWaitTime = ELAPSED_MSEC(0, curEncoder->m_slicetypeWaitTime); + frameStats->row0WaitTime = ELAPSED_MSEC(curEncoder->m_startCompressTime, curEncoder->m_row0WaitTime); + frameStats->wallTime = ELAPSED_MSEC(curEncoder->m_row0WaitTime, curEncoder->m_endCompressTime); + frameStats->refWaitWallTime = ELAPSED_MSEC(curEncoder->m_row0WaitTime, curEncoder->m_allRowsAvailableTime); + frameStats->totalCTUTime = ELAPSED_MSEC(0, curEncoder->m_totalWorkerElapsedTime); + frameStats->stallTime = ELAPSED_MSEC(0, curEncoder->m_totalNoWorkerTime); if (curEncoder->m_totalActiveWorkerCount) - fprintf(m_csvfpt, ", %.3lf", (double)curEncoder->m_totalActiveWorkerCount / curEncoder->m_activeWorkerCountSamples); + frameStats->avgWPP = (double)curEncoder->m_totalActiveWorkerCount / curEncoder->m_activeWorkerCountSamples; else - fputs(", 1", m_csvfpt); - fprintf(m_csvfpt, ", %d", curEncoder->m_countRowBlocks); - fprintf(m_csvfpt, "\n"); - fflush(stderr); + frameStats->avgWPP = 1; + frameStats->countRowBlocks = curEncoder->m_countRowBlocks; + + frameStats->cuStats.percentIntraNxN = curFrame->m_encData->m_frameStats.percentIntraNxN; + frameStats->avgChromaDistortion = curFrame->m_encData->m_frameStats.avgChromaDistortion; + frameStats->avgLumaDistortion = curFrame->m_encData->m_frameStats.avgLumaDistortion; + frameStats->avgPsyEnergy = curFrame->m_encData->m_frameStats.avgPsyEnergy; + frameStats->avgLumaLevel = curFrame->m_encData->m_frameStats.avgLumaLevel; + frameStats->maxLumaLevel = curFrame->m_encData->m_frameStats.maxLumaLevel; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + frameStats->cuStats.percentSkipCu[depth] = curFrame->m_encData->m_frameStats.percentSkipCu[depth]; + frameStats->cuStats.percentMergeCu[depth] = curFrame->m_encData->m_frameStats.percentMergeCu[depth]; + frameStats->cuStats.percentInterDistribution[depth][0] = curFrame->m_encData->m_frameStats.percentInterDistribution[depth][0]; + frameStats->cuStats.percentInterDistribution[depth][1] = curFrame->m_encData->m_frameStats.percentInterDistribution[depth][1]; + frameStats->cuStats.percentInterDistribution[depth][2] = curFrame->m_encData->m_frameStats.percentInterDistribution[depth][2]; + for (int n = 0; n < INTRA_MODES; n++) + frameStats->cuStats.percentIntraDistribution[depth][n] = curFrame->m_encData->m_frameStats.percentIntraDistribution[depth][n]; + } } } @@ -1510,14 +1246,14 @@ char *opts = x265_param2string(m_param); if (opts) { - char *buffer = X265_MALLOC(char, strlen(opts) + strlen(x265_version_str) + - strlen(x265_build_info_str) + 200); + char *buffer = X265_MALLOC(char, strlen(opts) + strlen(PFX(version_str)) + + strlen(PFX(build_info_str)) + 200); if (buffer) { sprintf(buffer, "x265 (build %d) - %s:%s - H.265/HEVC codec - " "Copyright 2013-2015 (c) Multicoreware Inc - " "http://x265.org - options: %s", - X265_BUILD, x265_version_str, x265_build_info_str, opts); + X265_BUILD, PFX(version_str), PFX(build_info_str), opts); bs.resetBits(); SEIuserDataUnregistered idsei; @@ -1675,9 +1411,23 @@ } else if (p->keyframeMax <= 1) { + p->keyframeMax = 1; + // disable lookahead for all-intra encodes p->bFrameAdaptive = 0; p->bframes = 0; + p->bOpenGOP = 0; + p->bRepeatHeaders = 1; + p->lookaheadDepth = 0; + p->bframes = 0; + p->scenecutThreshold = 0; + p->bFrameAdaptive = 0; + p->rc.cuTree = 0; + p->bEnableWeightedPred = 0; + p->bEnableWeightedBiPred = 0; + + /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */ + p->maxNumReferences = 1; } if (!p->keyframeMin) { @@ -1783,7 +1533,7 @@ if (p->analysisMode && (p->bDistributeModeAnalysis || p->bDistributeMotionEstimation)) { - x265_log(p, X265_LOG_ERROR, "Analysis load/save options incompatible with pmode/pme"); + x265_log(p, X265_LOG_WARNING, "Analysis load/save options incompatible with pmode/pme, Disabling pmode/pme\n"); p->bDistributeMotionEstimation = p->bDistributeModeAnalysis = 0; } @@ -1881,6 +1631,12 @@ } else m_param->rc.qgSize = p->maxCUSize; + + if (p->bLogCuStats) + x265_log(p, X265_LOG_WARNING, "--cu-stats option is now deprecated\n"); + + if (p->csvfn) + x265_log(p, X265_LOG_WARNING, "libx265 no longer supports CSV file statistics\n"); } void Encoder::allocAnalysis(x265_analysis_data* analysis)
View file
x265_1.7.tar.gz/source/encoder/encoder.h -> x265_1.8.tar.gz/source/encoder/encoder.h
Changed
@@ -32,7 +32,7 @@ struct x265_encoder {}; -namespace x265 { +namespace X265_NS { // private namespace extern const char g_sliceTypeToChar[3]; @@ -105,7 +105,6 @@ EncStats m_analyzeI; EncStats m_analyzeP; EncStats m_analyzeB; - FILE* m_csvfpt; int64_t m_encodeStartTime; // weighted prediction @@ -149,14 +148,10 @@ void fetchStats(x265_stats* stats, size_t statsSizeBytes); - void writeLog(int argc, char **argv); - void printSummary(); char* statsString(EncStats&, char*); - char* statsCSVString(EncStats& stat, char* buffer); - void configure(x265_param *param); void updateVbvPlan(RateControl* rc); @@ -169,7 +164,7 @@ void writeAnalysisFile(x265_analysis_data* pic); - void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, uint64_t bits); + void finishFrameStats(Frame* pic, FrameEncoder *curEncoder, uint64_t bits, x265_frame_stats* frameStats); protected:
View file
x265_1.7.tar.gz/source/encoder/entropy.cpp -> x265_1.8.tar.gz/source/encoder/entropy.cpp
Changed
@@ -35,9 +35,7 @@ #define CU_DQP_EG_k 0 // exp-golomb order #define START_VALUE 8 // start value for dpcm mode -static const uint32_t g_puOffset[8] = { 0, 8, 4, 4, 2, 10, 1, 5 }; - -namespace x265 { +namespace X265_NS { Entropy::Entropy() { @@ -216,7 +214,7 @@ WRITE_FLAG(csp == X265_CSP_I420 || csp == X265_CSP_I400, "general_max_420chroma_constraint_flag"); WRITE_FLAG(csp == X265_CSP_I400, "general_max_monochrome_constraint_flag"); WRITE_FLAG(ptl.intraConstraintFlag, "general_intra_constraint_flag"); - WRITE_FLAG(0, "general_one_picture_only_constraint_flag"); + WRITE_FLAG(ptl.onePictureOnlyConstraintFlag,"general_one_picture_only_constraint_flag"); WRITE_FLAG(ptl.lowerBitRateConstraintFlag, "general_lower_bit_rate_constraint_flag"); WRITE_CODE(0 , 16, "XXX_reserved_zero_35bits[0..15]"); WRITE_CODE(0 , 16, "XXX_reserved_zero_35bits[16..31]"); @@ -862,12 +860,9 @@ void Entropy::codePUWise(const CUData& cu, uint32_t absPartIdx) { X265_CHECK(!cu.isIntra(absPartIdx), "intra block not expected\n"); - PartSize partSize = (PartSize)cu.m_partSize[absPartIdx]; - uint32_t numPU = (partSize == SIZE_2Nx2N ? 1 : (partSize == SIZE_NxN ? 4 : 2)); - uint32_t depth = cu.m_cuDepth[absPartIdx]; - uint32_t puOffset = (g_puOffset[uint32_t(partSize)] << (g_unitSizeDepth - depth) * 2) >> 4; + uint32_t numPU = cu.getNumPartInter(absPartIdx); - for (uint32_t puIdx = 0, subPartIdx = absPartIdx; puIdx < numPU; puIdx++, subPartIdx += puOffset) + for (uint32_t puIdx = 0, subPartIdx = absPartIdx; puIdx < numPU; puIdx++, subPartIdx += cu.getPUOffset(puIdx, absPartIdx)) { codeMergeFlag(cu, subPartIdx); if (cu.m_mergeFlag[subPartIdx]) @@ -1433,6 +1428,55 @@ encodeBin(cu.getCbf(absPartIdx, ttype, lowestTUDepth), m_contextState[OFF_QT_CBF_CTX + ctx]); } +#if CHECKED_BUILD || _DEBUG +uint32_t costCoeffRemain_c0(uint16_t *absCoeff, int numNonZero) +{ + uint32_t goRiceParam = 0; + int firstCoeff2 = 1; + uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel + + uint32_t sum = 0; + int idx = 0; + do + { + int baseLevel = (baseLevelN & 3) | firstCoeff2; + X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); + baseLevelN >>= 2; + int codeNumber = absCoeff[idx] - baseLevel; + + if (codeNumber >= 0) + { + //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); + uint32_t length = 0; + + codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION; + if (codeNumber >= 0) + { + { + unsigned long cidx; + CLZ(cidx, codeNumber + 1); + length = cidx; + } + X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n"); + + codeNumber = (length + length); + } + sum += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber); + + if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) + goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); + } + if (absCoeff[idx] >= 2) + firstCoeff2 = 0; + idx++; + } + while(idx < numNonZero); + + return sum; +} +#endif // debug only code + void Entropy::codeCoeffNxN(const CUData& cu, const coeff_t* coeff, uint32_t absPartIdx, uint32_t log2TrSize, TextType ttype) { uint32_t trSize = 1 << log2TrSize; @@ -1440,7 +1484,7 @@ // compute number of significant coefficients uint32_t numSig = primitives.cu[log2TrSize - 2].count_nonzero(coeff); X265_CHECK(numSig > 0, "cbf check fail\n"); - bool bHideFirstSign = cu.m_slice->m_pps->bSignHideEnabled && !tqBypass; + bool bHideFirstSign = cu.m_slice->m_pps->bSignHideEnabled & !tqBypass; if (log2TrSize <= MAX_LOG2_TS_SIZE && !tqBypass && cu.m_slice->m_pps->bTransformSkipEnabled) codeTransformSkipFlags(cu.m_transformSkip[ttype][absPartIdx], ttype); @@ -1489,9 +1533,11 @@ if (codingParameters.scanType == SCAN_VER) std::swap(pos[0], pos[1]); - int ctxIdx = bIsLuma ? (3 * (log2TrSize - 2) + ((log2TrSize - 1) >> 2)) : NUM_CTX_LAST_FLAG_XY_LUMA; - int ctxShift = bIsLuma ? ((log2TrSize + 1) >> 2) : log2TrSize - 2; + int ctxIdx = bIsLuma ? (3 * (log2TrSize - 2) + (log2TrSize == 5)) : NUM_CTX_LAST_FLAG_XY_LUMA; + int ctxShift = (bIsLuma ? (log2TrSize > 2) : (log2TrSize - 2)); uint32_t maxGroupIdx = (log2TrSize << 1) - 1; + X265_CHECK(((log2TrSize - 1) >> 2) == (uint32_t)(log2TrSize == 5), "ctxIdx check failure\n"); + X265_CHECK((uint32_t)ctxShift == (bIsLuma ? ((log2TrSize + 1) >> 2) : log2TrSize - 2), "ctxShift check failure\n"); uint8_t *ctx = &m_contextState[OFF_CTX_LAST_FLAG_X]; for (uint32_t i = 0; i < 2; i++, ctxIdx += NUM_CTX_LAST_FLAG_XY) @@ -1519,12 +1565,12 @@ uint8_t * const baseCtx = bIsLuma ? &m_contextState[OFF_SIG_FLAG_CTX] : &m_contextState[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; uint32_t c1 = 1; int scanPosSigOff = scanPosLast - (lastScanSet << MLS_CG_SIZE) - 1; - int absCoeff[1 << MLS_CG_SIZE]; - int numNonZero = 1; + ALIGN_VAR_32(uint16_t, absCoeff[(1 << MLS_CG_SIZE)]); + uint32_t numNonZero = 1; unsigned long lastNZPosInCG; unsigned long firstNZPosInCG; - absCoeff[0] = int(abs(coeff[posLast])); + absCoeff[0] = (uint16_t)abs(coeff[posLast]); for (int subSet = lastScanSet; subSet >= 0; subSet--) { @@ -1540,7 +1586,7 @@ // encode significant_coeffgroup_flag const int cgBlkPos = codingParameters.scanCG[subSet]; - const int cgPosY = cgBlkPos >> (log2TrSize - MLS_CG_LOG2_SIZE); + const int cgPosY = (uint32_t)cgBlkPos >> (log2TrSize - MLS_CG_LOG2_SIZE); const int cgPosX = cgBlkPos & ((1 << (log2TrSize - MLS_CG_LOG2_SIZE)) - 1); const uint64_t cgBlkPosMask = ((uint64_t)1 << cgBlkPos); @@ -1554,21 +1600,14 @@ } // encode significant_coeff_flag - if (sigCoeffGroupFlag64 & cgBlkPosMask) + if ((scanPosSigOff >= 0) && (sigCoeffGroupFlag64 & cgBlkPosMask)) { X265_CHECK((log2TrSize != 2) || (log2TrSize == 2 && subSet == 0), "log2TrSize and subSet mistake!\n"); const int patternSigCtx = Quant::calcPatternSigCtx(sigCoeffGroupFlag64, cgPosX, cgPosY, cgBlkPos, (trSize >> MLS_CG_LOG2_SIZE)); const uint32_t posOffset = (bIsLuma && subSet) ? 3 : 0; - static const uint8_t ctxIndMap4x4[16] = - { - 0, 1, 4, 5, - 2, 3, 4, 5, - 6, 6, 8, 8, - 7, 7, 8, 8 - }; // NOTE: [patternSigCtx][posXinSubset][posYinSubset] - static const uint8_t table_cnt[4][SCAN_SET_SIZE] = + static const uint8_t table_cnt[5][SCAN_SET_SIZE] = { // patternSigCtx = 0 { @@ -1597,50 +1636,61 @@ 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, + }, + // 4x4 + { + 0, 1, 4, 5, + 2, 3, 4, 5, + 6, 6, 8, 8, + 7, 7, 8, 8 } }; const int offset = codingParameters.firstSignificanceMapContext; - ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]); - // TODO: accelerate by PABSW const uint32_t blkPosBase = codingParameters.scan[subPosBase]; - for (int i = 0; i < MLS_CG_SIZE; i++) - { - tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]); - tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]); - tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]); - tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]); - } + X265_CHECK(scanPosSigOff >= 0, "scanPosSigOff check failure\n"); if (m_bitIf) { + ALIGN_VAR_32(uint16_t, tmpCoeff[SCAN_SET_SIZE]); + + // TODO: accelerate by PABSW + for (int i = 0; i < MLS_CG_SIZE; i++) + { + tmpCoeff[i * MLS_CG_SIZE + 0] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 0]); + tmpCoeff[i * MLS_CG_SIZE + 1] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 1]); + tmpCoeff[i * MLS_CG_SIZE + 2] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 2]); + tmpCoeff[i * MLS_CG_SIZE + 3] = (uint16_t)abs(coeff[blkPosBase + i * trSize + 3]); + } + if (log2TrSize == 2) { - uint32_t blkPos, sig, ctxSig; - for (; scanPosSigOff >= 0; scanPosSigOff--) + do { + uint32_t blkPos, sig, ctxSig; blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; sig = scanFlagMask & 1; scanFlagMask >>= 1; X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); { - ctxSig = ctxIndMap4x4[blkPos]; + ctxSig = table_cnt[4][blkPos]; X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, blkPos, bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; encodeBin(sig, baseCtx[ctxSig]); } absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; + scanPosSigOff--; } + while(scanPosSigOff >= 0); } else { X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n"); const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx]; - - uint32_t blkPos, sig, ctxSig; - for (; scanPosSigOff >= 0; scanPosSigOff--) + do { + uint32_t blkPos, sig, ctxSig; blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; sig = scanFlagMask & 1; @@ -1656,79 +1706,20 @@ } absCoeff[numNonZero] = tmpCoeff[blkPos]; numNonZero += sig; + scanPosSigOff--; } + while(scanPosSigOff >= 0); } } else // fast RD path { // maximum g_entropyBits are 18-bits and maximum of count are 16, so intermedia of sum are 22-bits - uint32_t sum = 0; - if (log2TrSize == 2) - { - uint32_t blkPos, sig, ctxSig; - for (; scanPosSigOff >= 0; scanPosSigOff--) - { - blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; - sig = scanFlagMask & 1; - scanFlagMask >>= 1; - X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); - { - ctxSig = ctxIndMap4x4[blkPos]; - X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; - //encodeBin(sig, baseCtx[ctxSig]); - const uint32_t mstate = baseCtx[ctxSig]; - const uint32_t mps = mstate & 1; - const uint32_t stateBits = g_entropyStateBits[mstate ^ sig]; - uint32_t nextState = (stateBits >> 23) + mps; - if ((mstate ^ sig) == 1) - nextState = sig; - X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n"); - X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n"); - baseCtx[ctxSig] = (uint8_t)nextState; - sum += stateBits; - } - absCoeff[numNonZero] = tmpCoeff[blkPos]; - numNonZero += sig; - } - } // end of 4x4 - else - { - X265_CHECK((log2TrSize > 2), "log2TrSize must be more than 2 in this path!\n"); - - const uint8_t *tabSigCtx = table_cnt[(uint32_t)patternSigCtx]; - - uint32_t blkPos, sig, ctxSig; - for (; scanPosSigOff >= 0; scanPosSigOff--) - { - blkPos = g_scan4x4[codingParameters.scanType][scanPosSigOff]; - const uint32_t posZeroMask = (subPosBase + scanPosSigOff) ? ~0 : 0; - sig = scanFlagMask & 1; - scanFlagMask >>= 1; - X265_CHECK((uint32_t)(tmpCoeff[blkPos] != 0) == sig, "sign bit mistake\n"); - if (scanPosSigOff != 0 || subSet == 0 || numNonZero) - { - const uint32_t cnt = tabSigCtx[blkPos] + offset; - ctxSig = (cnt + posOffset) & posZeroMask; - - X265_CHECK(ctxSig == Quant::getSigCtxInc(patternSigCtx, log2TrSize, trSize, codingParameters.scan[subPosBase + scanPosSigOff], bIsLuma, codingParameters.firstSignificanceMapContext), "sigCtx mistake!\n");; - //encodeBin(sig, baseCtx[ctxSig]); - const uint32_t mstate = baseCtx[ctxSig]; - const uint32_t mps = mstate & 1; - const uint32_t stateBits = g_entropyStateBits[mstate ^ sig]; - uint32_t nextState = (stateBits >> 23) + mps; - if ((mstate ^ sig) == 1) - nextState = sig; - X265_CHECK(sbacNext(mstate, sig) == nextState, "nextState check failure\n"); - X265_CHECK(sbacGetEntropyBits(mstate, sig) == (stateBits & 0xFFFFFF), "entropyBits check failure\n"); - baseCtx[ctxSig] = (uint8_t)nextState; - sum += stateBits; - } - absCoeff[numNonZero] = tmpCoeff[blkPos]; - numNonZero += sig; - } - } // end of non 4x4 path - sum &= 0xFFFFFF; + const uint8_t *tabSigCtx = table_cnt[(log2TrSize == 2) ? 4 : (uint32_t)patternSigCtx]; + uint32_t sum = primitives.costCoeffNxN(g_scan4x4[codingParameters.scanType], &coeff[blkPosBase], (intptr_t)trSize, absCoeff + numNonZero, tabSigCtx, scanFlagMask, baseCtx, offset + posOffset, scanPosSigOff, subPosBase); +#if CHECKED_BUILD || _DEBUG + numNonZero = coeffNum[subSet]; +#endif // update RD cost m_fracBits += sum; } // end of fast RD path -- !m_bitIf @@ -1739,113 +1730,114 @@ numNonZero = coeffNum[subSet]; if (numNonZero > 0) { + uint32_t idx; X265_CHECK(subCoeffFlag > 0, "subCoeffFlag is zero\n"); CLZ(lastNZPosInCG, subCoeffFlag); CTZ(firstNZPosInCG, subCoeffFlag); bool signHidden = (lastNZPosInCG - firstNZPosInCG >= SBH_THRESHOLD); - uint32_t ctxSet = (subSet > 0 && bIsLuma) ? 2 : 0; - - if (c1 == 0) - ctxSet++; + const uint8_t ctxSet = (((subSet > 0) + bIsLuma) & 2) + !(c1 & 3); + X265_CHECK((((subSet > 0) & bIsLuma) ? 2 : 0) + !(c1 & 3) == ctxSet, "ctxSet check failure\n"); c1 = 1; - uint8_t *baseCtxMod = bIsLuma ? &m_contextState[OFF_ONE_FLAG_CTX + 4 * ctxSet] : &m_contextState[OFF_ONE_FLAG_CTX + NUM_ONE_FLAG_CTX_LUMA + 4 * ctxSet]; + uint8_t *baseCtxMod = &m_contextState[(bIsLuma ? 0 : NUM_ONE_FLAG_CTX_LUMA) + OFF_ONE_FLAG_CTX + 4 * ctxSet]; - int numC1Flag = X265_MIN(numNonZero, C1FLAG_NUMBER); - int firstC2FlagIdx = -1; - for (int idx = 0; idx < numC1Flag; idx++) + uint32_t numC1Flag = X265_MIN(numNonZero, C1FLAG_NUMBER); + X265_CHECK(numC1Flag > 0, "numC1Flag check failure\n"); + + if (!m_bitIf) { - uint32_t symbol = absCoeff[idx] > 1; - encodeBin(symbol, baseCtxMod[c1]); - if (symbol) - { - c1 = 0; + uint32_t sum = primitives.costC1C2Flag(absCoeff, numC1Flag, baseCtxMod, (bIsLuma ? 0 : NUM_ABS_FLAG_CTX_LUMA - NUM_ONE_FLAG_CTX_LUMA) + (OFF_ABS_FLAG_CTX - OFF_ONE_FLAG_CTX) - 3 * ctxSet); + uint32_t firstC2Idx = (sum >> 28); + c1 = ((sum >> 26) & 3); + m_fracBits += sum & 0x00FFFFFF; - if (firstC2FlagIdx == -1) - firstC2FlagIdx = idx; + const int hiddenShift = (bHideFirstSign & signHidden) ? -1 : 0; + //encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift); + m_fracBits += (numNonZero + hiddenShift) << 15; + + if (numNonZero > firstC2Idx) + { + sum = primitives.costCoeffRemain(absCoeff, numNonZero, firstC2Idx); + X265_CHECK(sum == costCoeffRemain_c0(absCoeff, numNonZero), "costCoeffRemain check failure\n"); + m_fracBits += ((uint64_t)sum << 15); } - else if ((c1 < 3) && (c1 > 0)) - c1++; } - - if (!c1) + // Standard path + else { - baseCtxMod = bIsLuma ? &m_contextState[OFF_ABS_FLAG_CTX + ctxSet] : &m_contextState[OFF_ABS_FLAG_CTX + NUM_ABS_FLAG_CTX_LUMA + ctxSet]; + uint32_t firstC2Idx = 8; + uint32_t firstC2Flag = 2; + uint32_t c1Next = 0xFFFFFFFE; - X265_CHECK((firstC2FlagIdx != -1), "firstC2FlagIdx check failure\n"); - uint32_t symbol = absCoeff[firstC2FlagIdx] > 2; - encodeBin(symbol, baseCtxMod[0]); - } + idx = 0; + do + { + const uint32_t symbol1 = absCoeff[idx] > 1; + const uint32_t symbol2 = absCoeff[idx] > 2; + encodeBin(symbol1, baseCtxMod[c1]); - const int hiddenShift = (bHideFirstSign && signHidden) ? 1 : 0; - encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift); + if (symbol1) + c1Next = 0; - if (!c1 || numNonZero > C1FLAG_NUMBER) - { - uint32_t goRiceParam = 0; - int firstCoeff2 = 1; - uint32_t baseLevelN = 0x5555AAAA; // 2-bits encode format baseLevel + firstC2Flag = (symbol1 + firstC2Flag == 3) ? symbol2 : firstC2Flag; + firstC2Idx = (symbol1 + firstC2Idx == 9) ? idx : firstC2Idx; + + c1 = (c1Next & 3); + c1Next >>= 2; + X265_CHECK(c1 <= 3, "c1 check failure\n"); + idx++; + } + while(idx < numC1Flag); - if (!m_bitIf) + if (!c1) { - // FastRd path - for (int idx = 0; idx < numNonZero; idx++) - { - int baseLevel = (baseLevelN & 3) | firstCoeff2; - X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); - baseLevelN >>= 2; - int codeNumber = absCoeff[idx] - baseLevel; + baseCtxMod = &m_contextState[(bIsLuma ? 0 : NUM_ABS_FLAG_CTX_LUMA) + OFF_ABS_FLAG_CTX + ctxSet]; - if (codeNumber >= 0) - { - //writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); - uint32_t length = 0; - - codeNumber = ((uint32_t)codeNumber >> goRiceParam) - COEF_REMAIN_BIN_REDUCTION; - if (codeNumber >= 0) - { - { - unsigned long cidx; - CLZ(cidx, codeNumber + 1); - length = cidx; - } - X265_CHECK((codeNumber != 0) || (length == 0), "length check failure\n"); - - codeNumber = (length + length); - } - m_fracBits += (COEF_REMAIN_BIN_REDUCTION + 1 + goRiceParam + codeNumber) << 15; - - if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) - goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); - X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); - } - if (absCoeff[idx] >= 2) - firstCoeff2 = 0; - } + X265_CHECK((firstC2Flag <= 1), "firstC2FlagIdx check failure\n"); + encodeBin(firstC2Flag, baseCtxMod[0]); } - else + + const int hiddenShift = (bHideFirstSign && signHidden) ? 1 : 0; + encodeBinsEP((coeffSigns >> hiddenShift), numNonZero - hiddenShift); + + if (!c1 || numNonZero > C1FLAG_NUMBER) { // Standard path - for (int idx = 0; idx < numNonZero; idx++) + uint32_t goRiceParam = 0; + int baseLevel = 3; + uint32_t threshold = COEF_REMAIN_BIN_REDUCTION; +#if CHECKED_BUILD || _DEBUG + int firstCoeff2 = 1; +#endif + idx = firstC2Idx; + do { - int baseLevel = (baseLevelN & 3) | firstCoeff2; + if (idx >= C1FLAG_NUMBER) + baseLevel = 1; + // TODO: fast algorithm maybe broken this check logic X265_CHECK(baseLevel == ((idx < C1FLAG_NUMBER) ? (2 + firstCoeff2) : 1), "baseLevel check failurr\n"); - baseLevelN >>= 2; if (absCoeff[idx] >= baseLevel) { writeCoefRemainExGolomb(absCoeff[idx] - baseLevel, goRiceParam); - if (absCoeff[idx] > (COEF_REMAIN_BIN_REDUCTION << goRiceParam)) - goRiceParam = (goRiceParam + 1) - (goRiceParam >> 2); + X265_CHECK(threshold == (uint32_t)(COEF_REMAIN_BIN_REDUCTION << goRiceParam), "COEF_REMAIN_BIN_REDUCTION check failure\n"); + const int adjust = (absCoeff[idx] > threshold) & (goRiceParam <= 3); + goRiceParam += adjust; + threshold += (adjust) ? threshold : 0; X265_CHECK(goRiceParam <= 4, "goRiceParam check failure\n"); } - if (absCoeff[idx] >= 2) - firstCoeff2 = 0; +#if CHECKED_BUILD || _DEBUG + firstCoeff2 = 0; +#endif + baseLevel = 2; + idx++; } + while(idx < numNonZero); } - } - } + } // end of !bitIf + } // end of (numNonZero > 0) + // Initialize value for next loop numNonZero = 0; scanPosSigOff = (1 << MLS_CG_SIZE) - 1; @@ -2243,28 +2235,6 @@ 0x0050c, 0x29bab, 0x004c1, 0x2a674, 0x004a7, 0x2aa5e, 0x0046f, 0x2b32f, 0x0041f, 0x2c0ad, 0x003e7, 0x2ca8d, 0x003ba, 0x2d323, 0x0010c, 0x3bfbb }; -// [8 24] --> [stateMPS BitCost], [stateLPS BitCost] -const uint32_t g_entropyStateBits[128] = -{ - // Corrected table, most notably for last state - 0x01007b23, 0x000085f9, 0x020074a0, 0x00008cbc, 0x03006ee4, 0x01009354, 0x040067f4, 0x02009c1b, - 0x050060b0, 0x0200a62a, 0x06005a9c, 0x0400af5b, 0x0700548d, 0x0400b955, 0x08004f56, 0x0500c2a9, - 0x09004a87, 0x0600cbf7, 0x0a0045d6, 0x0700d5c3, 0x0b004144, 0x0800e01b, 0x0c003d88, 0x0900e937, - 0x0d0039e0, 0x0900f2cd, 0x0e003663, 0x0b00fc9e, 0x0f003347, 0x0b010600, 0x10003050, 0x0c010f95, - 0x11002d4d, 0x0d011a02, 0x12002ad3, 0x0d012333, 0x1300286e, 0x0f012cad, 0x14002604, 0x0f0136df, - 0x15002425, 0x10013f48, 0x160021f4, 0x100149c4, 0x1700203e, 0x1201527b, 0x18001e4d, 0x12015d00, - 0x19001c99, 0x130166de, 0x1a001b18, 0x13017017, 0x1b0019a5, 0x15017988, 0x1c001841, 0x15018327, - 0x1d0016df, 0x16018d50, 0x1e0015d9, 0x16019547, 0x1f00147c, 0x1701a083, 0x2000138e, 0x1801a8a3, - 0x21001251, 0x1801b418, 0x22001166, 0x1901bd27, 0x23001068, 0x1a01c77b, 0x24000f7f, 0x1a01d18e, - 0x25000eda, 0x1b01d91a, 0x26000e19, 0x1b01e254, 0x27000d4f, 0x1c01ec9a, 0x28000c90, 0x1d01f6e0, - 0x29000c01, 0x1d01fef8, 0x2a000b5f, 0x1e0208b1, 0x2b000ab6, 0x1e021362, 0x2c000a15, 0x1e021e46, - 0x2d000988, 0x1f02285d, 0x2e000934, 0x20022ea8, 0x2f0008a8, 0x200239b2, 0x3000081d, 0x21024577, - 0x310007c9, 0x21024ce6, 0x32000763, 0x21025663, 0x33000710, 0x22025e8f, 0x340006a0, 0x22026a26, - 0x35000672, 0x23026f23, 0x360005e8, 0x23027ef8, 0x370005ba, 0x230284b5, 0x3800055e, 0x24029057, - 0x3900050c, 0x24029bab, 0x3a0004c1, 0x2402a674, 0x3b0004a7, 0x2502aa5e, 0x3c00046f, 0x2502b32f, - 0x3d00041f, 0x2502c0ad, 0x3e0003e7, 0x2602ca8d, 0x3e0003ba, 0x2602d323, 0x3f00010c, 0x3f03bfbb, -}; - const uint8_t g_nextState[128][2] = { { 2, 1 }, { 0, 3 }, { 4, 0 }, { 1, 5 }, { 6, 2 }, { 3, 7 }, { 8, 4 }, { 5, 9 }, @@ -2286,3 +2256,26 @@ }; } + +// [8 24] --> [stateMPS BitCost], [stateLPS BitCost] +extern "C" const uint32_t PFX(entropyStateBits)[128] = +{ + // Corrected table, most notably for last state + 0x02007B23, 0x000085F9, 0x040074A0, 0x00008CBC, 0x06006EE4, 0x02009354, 0x080067F4, 0x04009C1B, + 0x0A0060B0, 0x0400A62A, 0x0C005A9C, 0x0800AF5B, 0x0E00548D, 0x0800B955, 0x10004F56, 0x0A00C2A9, + 0x12004A87, 0x0C00CBF7, 0x140045D6, 0x0E00D5C3, 0x16004144, 0x1000E01B, 0x18003D88, 0x1200E937, + 0x1A0039E0, 0x1200F2CD, 0x1C003663, 0x1600FC9E, 0x1E003347, 0x16010600, 0x20003050, 0x18010F95, + 0x22002D4D, 0x1A011A02, 0x24002AD3, 0x1A012333, 0x2600286E, 0x1E012CAD, 0x28002604, 0x1E0136DF, + 0x2A002425, 0x20013F48, 0x2C0021F4, 0x200149C4, 0x2E00203E, 0x2401527B, 0x30001E4D, 0x24015D00, + 0x32001C99, 0x260166DE, 0x34001B18, 0x26017017, 0x360019A5, 0x2A017988, 0x38001841, 0x2A018327, + 0x3A0016DF, 0x2C018D50, 0x3C0015D9, 0x2C019547, 0x3E00147C, 0x2E01A083, 0x4000138E, 0x3001A8A3, + 0x42001251, 0x3001B418, 0x44001166, 0x3201BD27, 0x46001068, 0x3401C77B, 0x48000F7F, 0x3401D18E, + 0x4A000EDA, 0x3601D91A, 0x4C000E19, 0x3601E254, 0x4E000D4F, 0x3801EC9A, 0x50000C90, 0x3A01F6E0, + 0x52000C01, 0x3A01FEF8, 0x54000B5F, 0x3C0208B1, 0x56000AB6, 0x3C021362, 0x58000A15, 0x3C021E46, + 0x5A000988, 0x3E02285D, 0x5C000934, 0x40022EA8, 0x5E0008A8, 0x400239B2, 0x6000081D, 0x42024577, + 0x620007C9, 0x42024CE6, 0x64000763, 0x42025663, 0x66000710, 0x44025E8F, 0x680006A0, 0x44026A26, + 0x6A000672, 0x46026F23, 0x6C0005E8, 0x46027EF8, 0x6E0005BA, 0x460284B5, 0x7000055E, 0x48029057, + 0x7200050C, 0x48029BAB, 0x740004C1, 0x4802A674, 0x760004A7, 0x4A02AA5E, 0x7800046F, 0x4A02B32F, + 0x7A00041F, 0x4A02C0AD, 0x7C0003E7, 0x4C02CA8D, 0x7C0003BA, 0x4C02D323, 0x7E00010C, 0x7E03BFBB, +}; +
View file
x265_1.7.tar.gz/source/encoder/entropy.h -> x265_1.8.tar.gz/source/encoder/entropy.h
Changed
@@ -31,7 +31,7 @@ #include "contexts.h" #include "slice.h" -namespace x265 { +namespace X265_NS { // private namespace struct SaoCtuParam;
View file
x265_1.7.tar.gz/source/encoder/frameencoder.cpp -> x265_1.8.tar.gz/source/encoder/frameencoder.cpp
Changed
@@ -35,7 +35,7 @@ #include "slicetype.h" #include "nal.h" -namespace x265 { +namespace X265_NS { void weightAnalyse(Slice& slice, Frame& frame, x265_param& param); FrameEncoder::FrameEncoder() @@ -59,7 +59,6 @@ m_cuGeoms = NULL; m_ctuGeomMap = NULL; m_localTldIdx = 0; - memset(&m_frameStats, 0, sizeof(m_frameStats)); memset(&m_rce, 0, sizeof(RateControlEntry)); } @@ -313,7 +312,7 @@ m_SSDY = m_SSDU = m_SSDV = 0; m_ssim = 0; m_ssimCnt = 0; - memset(&m_frameStats, 0, sizeof(m_frameStats)); + memset(&(m_frame->m_encData->m_frameStats), 0, sizeof(m_frame->m_encData->m_frameStats)); /* Emit access unit delimiter unless this is the first frame and the user is * not repeating headers (since AUD is supposed to be the first NAL in the access @@ -419,25 +418,6 @@ m_top->m_lastBPSEI = m_rce.encodeOrder; } - - // The recovery point SEI message assists a decoder in determining when the decoding - // process will produce acceptable pictures for display after the decoder initiates - // random access. The m_recoveryPocCnt is in units of POC(picture order count) which - // means pictures encoded after the CRA but precede it in display order(leading) are - // implicitly discarded after a random access seek regardless of the value of - // m_recoveryPocCnt. Our encoder does not use references prior to the most recent CRA, - // so all pictures following the CRA in POC order are guaranteed to be displayable, - // so m_recoveryPocCnt is always 0. - SEIRecoveryPoint sei_recovery_point; - sei_recovery_point.m_recoveryPocCnt = 0; - sei_recovery_point.m_exactMatchingFlag = true; - sei_recovery_point.m_brokenLinkFlag = false; - - m_bs.resetBits(); - sei_recovery_point.write(m_bs, *slice->m_sps); - m_bs.writeByteAlignment(); - - m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); } if (m_param->bEmitHRDSEI || !!m_param->interlaceMode) @@ -475,6 +455,19 @@ m_nalList.serialize(NAL_UNIT_PREFIX_SEI, m_bs); } + /* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to + * tune RateControl parameters for other frames. + * Hence, for these modes, update m_startEndOrder and unlock RC for previous threads waiting in + * RateControlEnd here, after the slicecontexts are initialized. For the rest - ABR + * and VBV, unlock only after rateControlUpdateStats of this frame is called */ + if (m_param->rc.rateControlMode != X265_RC_ABR && !m_top->m_rateControl->m_isVbv) + { + m_top->m_rateControl->m_startEndOrder.incr(); + + if (m_rce.encodeOrder < m_param->frameNumThreads - 1) + m_top->m_rateControl->m_startEndOrder.incr(); // faked rateControlEnd calls for negative frames + } + /* Analyze CTU rows, most of the hard work is done here. Frame is * compressed in a wave-front pattern if WPP is enabled. Row based loop * filters runs behind the CTU compression and reconstruction */ @@ -559,17 +552,56 @@ // accumulate intra,inter,skip cu count per frame for 2 pass for (uint32_t i = 0; i < m_numRows; i++) { - m_frameStats.mvBits += m_rows[i].rowStats.mvBits; - m_frameStats.coeffBits += m_rows[i].rowStats.coeffBits; - m_frameStats.miscBits += m_rows[i].rowStats.miscBits; - totalI += m_rows[i].rowStats.iCuCnt; - totalP += m_rows[i].rowStats.pCuCnt; - totalSkip += m_rows[i].rowStats.skipCuCnt; + m_frame->m_encData->m_frameStats.mvBits += m_rows[i].rowStats.mvBits; + m_frame->m_encData->m_frameStats.coeffBits += m_rows[i].rowStats.coeffBits; + m_frame->m_encData->m_frameStats.miscBits += m_rows[i].rowStats.miscBits; + totalI += m_rows[i].rowStats.intra8x8Cnt; + totalP += m_rows[i].rowStats.inter8x8Cnt; + totalSkip += m_rows[i].rowStats.skip8x8Cnt; } int totalCuCount = totalI + totalP + totalSkip; - m_frameStats.percentIntra = (double)totalI / totalCuCount; - m_frameStats.percentInter = (double)totalP / totalCuCount; - m_frameStats.percentSkip = (double)totalSkip / totalCuCount; + m_frame->m_encData->m_frameStats.percent8x8Intra = (double)totalI / totalCuCount; + m_frame->m_encData->m_frameStats.percent8x8Inter = (double)totalP / totalCuCount; + m_frame->m_encData->m_frameStats.percent8x8Skip = (double)totalSkip / totalCuCount; + } + for (uint32_t i = 0; i < m_numRows; i++) + { + m_frame->m_encData->m_frameStats.cntIntraNxN += m_rows[i].rowStats.cntIntraNxN; + m_frame->m_encData->m_frameStats.totalCu += m_rows[i].rowStats.totalCu; + m_frame->m_encData->m_frameStats.totalCtu += m_rows[i].rowStats.totalCtu; + m_frame->m_encData->m_frameStats.lumaDistortion += m_rows[i].rowStats.lumaDistortion; + m_frame->m_encData->m_frameStats.chromaDistortion += m_rows[i].rowStats.chromaDistortion; + m_frame->m_encData->m_frameStats.psyEnergy += m_rows[i].rowStats.psyEnergy; + m_frame->m_encData->m_frameStats.lumaLevel += m_rows[i].rowStats.lumaLevel; + + if (m_rows[i].rowStats.maxLumaLevel > m_frame->m_encData->m_frameStats.maxLumaLevel) + m_frame->m_encData->m_frameStats.maxLumaLevel = m_rows[i].rowStats.maxLumaLevel; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + m_frame->m_encData->m_frameStats.cntSkipCu[depth] += m_rows[i].rowStats.cntSkipCu[depth]; + m_frame->m_encData->m_frameStats.cntMergeCu[depth] += m_rows[i].rowStats.cntMergeCu[depth]; + for (int m = 0; m < INTER_MODES; m++) + m_frame->m_encData->m_frameStats.cuInterDistribution[depth][m] += m_rows[i].rowStats.cuInterDistribution[depth][m]; + for (int n = 0; n < INTRA_MODES; n++) + m_frame->m_encData->m_frameStats.cuIntraDistribution[depth][n] += m_rows[i].rowStats.cuIntraDistribution[depth][n]; + } + } + m_frame->m_encData->m_frameStats.avgLumaDistortion = (double)(m_frame->m_encData->m_frameStats.lumaDistortion) / m_frame->m_encData->m_frameStats.totalCtu; + m_frame->m_encData->m_frameStats.avgChromaDistortion = (double)(m_frame->m_encData->m_frameStats.chromaDistortion) / m_frame->m_encData->m_frameStats.totalCtu; + m_frame->m_encData->m_frameStats.avgPsyEnergy = (double)(m_frame->m_encData->m_frameStats.psyEnergy) / m_frame->m_encData->m_frameStats.totalCtu; + m_frame->m_encData->m_frameStats.avgLumaLevel = m_frame->m_encData->m_frameStats.lumaLevel / m_frame->m_encData->m_frameStats.totalCtu; + m_frame->m_encData->m_frameStats.percentIntraNxN = (double)(m_frame->m_encData->m_frameStats.cntIntraNxN * 100) / m_frame->m_encData->m_frameStats.totalCu; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + m_frame->m_encData->m_frameStats.percentSkipCu[depth] = (double)(m_frame->m_encData->m_frameStats.cntSkipCu[depth] * 100) / m_frame->m_encData->m_frameStats.totalCu; + m_frame->m_encData->m_frameStats.percentMergeCu[depth] = (double)(m_frame->m_encData->m_frameStats.cntMergeCu[depth] * 100) / m_frame->m_encData->m_frameStats.totalCu; + for (int n = 0; n < INTRA_MODES; n++) + m_frame->m_encData->m_frameStats.percentIntraDistribution[depth][n] = (double)(m_frame->m_encData->m_frameStats.cuIntraDistribution[depth][n] * 100) / m_frame->m_encData->m_frameStats.totalCu; + uint64_t cuInterRectCnt = 0; // sum of Nx2N, 2NxN counts + cuInterRectCnt += m_frame->m_encData->m_frameStats.cuInterDistribution[depth][1] + m_frame->m_encData->m_frameStats.cuInterDistribution[depth][2]; + m_frame->m_encData->m_frameStats.percentInterDistribution[depth][0] = (double)(m_frame->m_encData->m_frameStats.cuInterDistribution[depth][0] * 100) / m_frame->m_encData->m_frameStats.totalCu; + m_frame->m_encData->m_frameStats.percentInterDistribution[depth][1] = (double)(cuInterRectCnt * 100) / m_frame->m_encData->m_frameStats.totalCu; + m_frame->m_encData->m_frameStats.percentInterDistribution[depth][2] = (double)(m_frame->m_encData->m_frameStats.cuInterDistribution[depth][3] * 100) / m_frame->m_encData->m_frameStats.totalCu; } m_bs.resetBits(); @@ -638,7 +670,7 @@ m_endCompressTime = x265_mdate(); /* rateControlEnd may also block for earlier frames to call rateControlUpdateStats */ - if (m_top->m_rateControl->rateControlEnd(m_frame, m_accessUnitBits, &m_rce, &m_frameStats) < 0) + if (m_top->m_rateControl->rateControlEnd(m_frame, m_accessUnitBits, &m_rce) < 0) m_top->m_aborted = true; /* Decrement referenced frame reference counts, allow them to be recycled */ @@ -826,13 +858,6 @@ const uint32_t lineStartCUAddr = row * numCols; bool bIsVbv = m_param->rc.vbvBufferSize > 0 && m_param->rc.vbvMaxBitrate > 0; - /* These store the count of inter, intra and skip cus within quad tree structure of each CTU */ - uint32_t qTreeInterCnt[NUM_CU_DEPTH]; - uint32_t qTreeIntraCnt[NUM_CU_DEPTH]; - uint32_t qTreeSkipCnt[NUM_CU_DEPTH]; - for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) - qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0; - while (curRow.completed < numCols) { ProfileScopeEvent(encodeCTU); @@ -904,30 +929,57 @@ // Completed CU processing curRow.completed++; - if (m_param->bLogCuStats || m_param->rc.bStatWrite) - curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, qTreeInterCnt, qTreeIntraCnt, qTreeSkipCnt); - else if (m_param->rc.aqMode) - curEncData.m_rowStat[row].sumQpAq += calcCTUQP(*ctu); + FrameStats frameLog; + curEncData.m_rowStat[row].sumQpAq += collectCTUStatistics(*ctu, &frameLog); // copy no. of intra, inter Cu cnt per row into frame stats for 2 pass if (m_param->rc.bStatWrite) { - curRow.rowStats.mvBits += best.mvBits; + curRow.rowStats.mvBits += best.mvBits; curRow.rowStats.coeffBits += best.coeffBits; - curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits); + curRow.rowStats.miscBits += best.totalBits - (best.mvBits + best.coeffBits); for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) { /* 1 << shift == number of 8x8 blocks at current depth */ int shift = 2 * (g_maxCUDepth - depth); - curRow.rowStats.iCuCnt += qTreeIntraCnt[depth] << shift; - curRow.rowStats.pCuCnt += qTreeInterCnt[depth] << shift; - curRow.rowStats.skipCuCnt += qTreeSkipCnt[depth] << shift; + int cuSize = g_maxCUSize >> depth; - // clear the row cu data from thread local object - qTreeIntraCnt[depth] = qTreeInterCnt[depth] = qTreeSkipCnt[depth] = 0; + if (cuSize == 8) + curRow.rowStats.intra8x8Cnt += (int)(frameLog.cntIntra[depth] + frameLog.cntIntraNxN); + else + curRow.rowStats.intra8x8Cnt += (int)(frameLog.cntIntra[depth] << shift); + + curRow.rowStats.inter8x8Cnt += (int)(frameLog.cntInter[depth] << shift); + curRow.rowStats.skip8x8Cnt += (int)((frameLog.cntSkipCu[depth] + frameLog.cntMergeCu[depth]) << shift); } } + curRow.rowStats.totalCtu++; + curRow.rowStats.lumaDistortion += best.lumaDistortion; + curRow.rowStats.chromaDistortion += best.chromaDistortion; + curRow.rowStats.psyEnergy += best.psyEnergy; + curRow.rowStats.cntIntraNxN += frameLog.cntIntraNxN; + curRow.rowStats.totalCu += frameLog.totalCu; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + curRow.rowStats.cntSkipCu[depth] += frameLog.cntSkipCu[depth]; + curRow.rowStats.cntMergeCu[depth] += frameLog.cntMergeCu[depth]; + for (int m = 0; m < INTER_MODES; m++) + curRow.rowStats.cuInterDistribution[depth][m] += frameLog.cuInterDistribution[depth][m]; + for (int n = 0; n < INTRA_MODES; n++) + curRow.rowStats.cuIntraDistribution[depth][n] += frameLog.cuIntraDistribution[depth][n]; + } + + /* calculate maximum and average luma levels */ + uint32_t ctuLumaLevel = 0; + uint32_t ctuNoOfPixels = best.fencYuv->m_size * best.fencYuv->m_size; + for (uint32_t i = 0; i < ctuNoOfPixels; i++) + { + pixel p = best.fencYuv->m_buf[0][i]; + ctuLumaLevel += p; + curRow.rowStats.maxLumaLevel = X265_MAX(p, curRow.rowStats.maxLumaLevel); + } + curRow.rowStats.lumaLevel += (double)(ctuLumaLevel) / ctuNoOfPixels; curEncData.m_cuStat[cuAddr].totalBits = best.totalBits; x265_emms(); @@ -1103,11 +1155,9 @@ } /* collect statistics about CU coding decisions, return total QP */ -int FrameEncoder::collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt) +int FrameEncoder::collectCTUStatistics(const CUData& ctu, FrameStats* log) { - StatisticLog* log = &m_sliceTypeLog[ctu.m_slice->m_sliceType]; int totQP = 0; - if (ctu.m_slice->m_sliceType == I_SLICE) { uint32_t depth = 0; @@ -1117,14 +1167,12 @@ log->totalCu++; log->cntIntra[depth]++; - qtreeIntraCnt[depth]++; totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2)); if (ctu.m_predMode[absPartIdx] == MODE_NONE) { log->totalCu--; log->cntIntra[depth]--; - qtreeIntraCnt[depth]--; } else if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N) { @@ -1147,24 +1195,20 @@ depth = ctu.m_cuDepth[absPartIdx]; log->totalCu++; - log->cntTotalCu[depth]++; totQP += ctu.m_qp[absPartIdx] * (ctu.m_numPartitions >> (depth * 2)); if (ctu.m_predMode[absPartIdx] == MODE_NONE) - { log->totalCu--; - log->cntTotalCu[depth]--; - } else if (ctu.isSkipped(absPartIdx)) { - log->totalCu--; - log->cntSkipCu[depth]++; - qtreeSkipCnt[depth]++; + if (ctu.m_mergeFlag[0]) + log->cntMergeCu[depth]++; + else + log->cntSkipCu[depth]++; } else if (ctu.isInter(absPartIdx)) { log->cntInter[depth]++; - qtreeInterCnt[depth]++; if (ctu.m_partSize[absPartIdx] < AMP_ID) log->cuInterDistribution[depth][ctu.m_partSize[absPartIdx]]++; @@ -1174,7 +1218,6 @@ else if (ctu.isIntra(absPartIdx)) { log->cntIntra[depth]++; - qtreeIntraCnt[depth]++; if (ctu.m_partSize[absPartIdx] != SIZE_2Nx2N) { @@ -1194,21 +1237,6 @@ return totQP; } -/* iterate over coded CUs and determine total QP */ -int FrameEncoder::calcCTUQP(const CUData& ctu) -{ - int totQP = 0; - uint32_t depth = 0, numParts = ctu.m_numPartitions; - - for (uint32_t absPartIdx = 0; absPartIdx < ctu.m_numPartitions; absPartIdx += numParts) - { - depth = ctu.m_cuDepth[absPartIdx]; - numParts = ctu.m_numPartitions >> (depth * 2); - totQP += ctu.m_qp[absPartIdx] * numParts; - } - return totQP; -} - /* DCT-domain noise reduction / adaptive deadzone from libavcodec */ void FrameEncoder::noiseReductionUpdate() {
View file
x265_1.7.tar.gz/source/encoder/frameencoder.h -> x265_1.8.tar.gz/source/encoder/frameencoder.h
Changed
@@ -41,7 +41,7 @@ #include "reference.h" #include "nal.h" -namespace x265 { +namespace X265_NS { // private x265 namespace class ThreadPool; @@ -49,8 +49,6 @@ #define ANGULAR_MODE_ID 2 #define AMP_ID 3 -#define INTER_MODES 4 -#define INTRA_MODES 3 struct StatisticLog { @@ -156,8 +154,6 @@ MD5Context m_state[3]; uint32_t m_crc[3]; uint32_t m_checksum[3]; - StatisticLog m_sliceTypeLog[3]; // per-slice type CU statistics - FrameStats m_frameStats; // stats of current frame for multi-pass encodes volatile int m_activeWorkerCount; // count of workers currently encoding or filtering CTUs volatile int m_totalActiveWorkerCount; // sum of m_activeWorkerCount sampled at end of each CTU @@ -221,8 +217,7 @@ void encodeSlice(); void threadMain(); - int collectCTUStatistics(const CUData& ctu, uint32_t* qtreeInterCnt, uint32_t* qtreeIntraCnt, uint32_t* qtreeSkipCnt); - int calcCTUQP(const CUData& ctu); + int collectCTUStatistics(const CUData& ctu, FrameStats* frameLog); void noiseReductionUpdate(); /* Called by WaveFront::findJob() */
View file
x265_1.7.tar.gz/source/encoder/framefilter.cpp -> x265_1.8.tar.gz/source/encoder/framefilter.cpp
Changed
@@ -30,7 +30,7 @@ #include "frameencoder.h" #include "wavefront.h" -using namespace x265; +using namespace X265_NS; static uint64_t computeSSD(pixel *fenc, pixel *rec, intptr_t stride, uint32_t width, uint32_t height); static float calculateSSIM(pixel *pix1, intptr_t stride1, pixel *pix2, intptr_t stride2, uint32_t width, uint32_t height, void *buf, uint32_t& cnt);
View file
x265_1.7.tar.gz/source/encoder/framefilter.h -> x265_1.8.tar.gz/source/encoder/framefilter.h
Changed
@@ -30,7 +30,7 @@ #include "deblock.h" #include "sao.h" -namespace x265 { +namespace X265_NS { // private x265 namespace class Encoder;
View file
x265_1.7.tar.gz/source/encoder/level.cpp -> x265_1.8.tar.gz/source/encoder/level.cpp
Changed
@@ -25,7 +25,7 @@ #include "slice.h" #include "level.h" -namespace x265 { +namespace X265_NS { typedef struct { uint32_t maxLumaSamples; @@ -61,18 +61,37 @@ /* determine minimum decoder level required to decode the described video */ void determineLevel(const x265_param ¶m, VPS& vps) { + vps.ptl.onePictureOnlyConstraintFlag = param.totalFrames == 1; + vps.ptl.intraConstraintFlag = param.keyframeMax <= 1 || vps.ptl.onePictureOnlyConstraintFlag; + vps.ptl.bitDepthConstraint = param.internalBitDepth; + vps.ptl.chromaFormatConstraint = param.internalCsp; + + /* TODO: figure out HighThroughput signaling, aka: HbrFactor in section A.4.2, only available + * for intra-only profiles (vps.ptl.intraConstraintFlag) */ + vps.ptl.lowerBitRateConstraintFlag = true; + vps.maxTempSubLayers = param.bEnableTemporalSubLayers ? 2 : 1; - if (param.internalCsp == X265_CSP_I420) + + if (param.internalCsp == X265_CSP_I420 && param.internalBitDepth <= 10) { - if (param.internalBitDepth == 8) + /* Probably an HEVC v1 profile, but must check to be sure */ + if (param.internalBitDepth <= 8) { - if (param.keyframeMax == 1 && param.maxNumReferences == 1) + if (vps.ptl.onePictureOnlyConstraintFlag) vps.ptl.profileIdc = Profile::MAINSTILLPICTURE; + else if (vps.ptl.intraConstraintFlag) + vps.ptl.profileIdc = Profile::MAINREXT; /* Main Intra */ else vps.ptl.profileIdc = Profile::MAIN; } - else if (param.internalBitDepth == 10) - vps.ptl.profileIdc = Profile::MAIN10; + else if (param.internalBitDepth <= 10) + { + /* note there is no 10bit still picture profile */ + if (vps.ptl.intraConstraintFlag) + vps.ptl.profileIdc = Profile::MAINREXT; /* Main10 Intra */ + else + vps.ptl.profileIdc = Profile::MAIN10; + } } else vps.ptl.profileIdc = Profile::MAINREXT; @@ -162,17 +181,19 @@ return; } -#define CHECK_RANGE(value, main, high) (value > main && value <= high) +#define CHECK_RANGE(value, main, high) (high != MAX_UINT && value > main && value <= high) - if (CHECK_RANGE(bitrate, levels[i].maxBitrateMain, levels[i].maxBitrateHigh) && - CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh) && - levels[i].maxBitrateHigh != MAX_UINT) + if (CHECK_RANGE(bitrate, levels[i].maxBitrateMain, levels[i].maxBitrateHigh) || + CHECK_RANGE((uint32_t)param.rc.vbvBufferSize, levels[i].maxCpbSizeMain, levels[i].maxCpbSizeHigh)) { - /* If the user has not enabled high tier, continue looking to see if we can encode at a higher level, main tier */ - if (!param.bHighTier && (levels[i].levelIdc < param.levelIdc)) - continue; - else + /* The bitrate or buffer size are out of range for Main tier, but in + * range for High tier. If the user requested High tier then give + * them High tier at this level. Otherwise allow the loop to + * progress to the Main tier of the next level */ + if (param.bHighTier) vps.ptl.tierFlag = Level::HIGH; + else + continue; } else vps.ptl.tierFlag = Level::MAIN; @@ -184,29 +205,68 @@ break; } - vps.ptl.intraConstraintFlag = false; - vps.ptl.lowerBitRateConstraintFlag = true; - vps.ptl.bitDepthConstraint = param.internalBitDepth; - vps.ptl.chromaFormatConstraint = param.internalCsp; - static const char *profiles[] = { "None", "Main", "Main 10", "Main Still Picture", "RExt" }; static const char *tiers[] = { "Main", "High" }; - const char *profile = profiles[vps.ptl.profileIdc]; + char profbuf[64]; + strcpy(profbuf, profiles[vps.ptl.profileIdc]); + + bool bStillPicture = false; if (vps.ptl.profileIdc == Profile::MAINREXT) { - if (param.internalCsp == X265_CSP_I422) - profile = "Main 4:2:2 10"; - if (param.internalCsp == X265_CSP_I444) + if (vps.ptl.bitDepthConstraint > 12 && vps.ptl.intraConstraintFlag) + { + if (vps.ptl.onePictureOnlyConstraintFlag) + { + strcpy(profbuf, "Main 4:4:4 16 Still Picture"); + bStillPicture = true; + } + else + strcpy(profbuf, "Main 4:4:4 16"); + } + else if (param.internalCsp == X265_CSP_I420) + { + X265_CHECK(vps.ptl.intraConstraintFlag || vps.ptl.bitDepthConstraint > 10, "rext fail\n"); + if (vps.ptl.bitDepthConstraint <= 8) + strcpy(profbuf, "Main"); + else if (vps.ptl.bitDepthConstraint <= 10) + strcpy(profbuf, "Main 10"); + else if (vps.ptl.bitDepthConstraint <= 12) + strcpy(profbuf, "Main 12"); + } + else if (param.internalCsp == X265_CSP_I422) + { + /* there is no Main 4:2:2 profile, so it must be signaled as Main10 4:2:2 */ + if (param.internalBitDepth <= 10) + strcpy(profbuf, "Main 4:2:2 10"); + else if (vps.ptl.bitDepthConstraint <= 12) + strcpy(profbuf, "Main 4:2:2 12"); + } + else if (param.internalCsp == X265_CSP_I444) { if (vps.ptl.bitDepthConstraint <= 8) - profile = "Main 4:4:4 8"; + { + if (vps.ptl.onePictureOnlyConstraintFlag) + { + strcpy(profbuf, "Main 4:4:4 Still Picture"); + bStillPicture = true; + } + else + strcpy(profbuf, "Main 4:4:4"); + } else if (vps.ptl.bitDepthConstraint <= 10) - profile = "Main 4:4:4 10"; + strcpy(profbuf, "Main 4:4:4 10"); + else if (vps.ptl.bitDepthConstraint <= 12) + strcpy(profbuf, "Main 4:4:4 12"); } + else + strcpy(profbuf, "Unknown"); + + if (vps.ptl.intraConstraintFlag && !bStillPicture) + strcat(profbuf, " Intra"); } x265_log(¶m, X265_LOG_INFO, "%s profile, Level-%s (%s tier)\n", - profile, levels[i].name, tiers[vps.ptl.tierFlag]); + profbuf, levels[i].name, tiers[vps.ptl.tierFlag]); } /* enforce a maximum decoder level requirement, in other words assure that a @@ -340,80 +400,88 @@ return true; } +} + +#if EXPORT_C_API + +/* these functions are exported as C functions (default) */ +using namespace X265_NS; +extern "C" { + +#else + +/* these functions exist within private namespace (multilib) */ +namespace X265_NS { + +#endif -extern "C" int x265_param_apply_profile(x265_param *param, const char *profile) { if (!param || !profile) return 0; -#if HIGH_BIT_DEPTH - if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || !strcmp(profile, "main444-8")) - { - x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main10.\n", profile); - return -1; - } -#else - if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10")) - { - x265_log(param, X265_LOG_ERROR, "%s profile not supported, compiled for Main.\n", profile); - return -1; - } + /* Check if profile bit-depth requirement is exceeded by internal bit depth */ + bool bInvalidDepth = false; +#if X265_DEPTH > 8 + if (!strcmp(profile, "main") || !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp") || + !strcmp(profile, "main444-8") || !strcmp(profile, "main-intra") || + !strcmp(profile, "main444-intra") || !strcmp(profile, "main444-stillpicture")) + bInvalidDepth = true; #endif - - if (!strcmp(profile, "main")) +#if X265_DEPTH > 10 + if (!strcmp(profile, "main10") || !strcmp(profile, "main422-10") || !strcmp(profile, "main444-10") || + !strcmp(profile, "main10-intra") || !strcmp(profile, "main422-10-intra") || !strcmp(profile, "main444-10-intra")) + bInvalidDepth = true; +#endif +#if X265_DEPTH > 12 + if (!strcmp(profile, "main12") || !strcmp(profile, "main422-12") || !strcmp(profile, "main444-12") || + !strcmp(profile, "main12-intra") || !strcmp(profile, "main422-12-intra") || !strcmp(profile, "main444-12-intra")) + bInvalidDepth = true; +#endif + + if (bInvalidDepth) { - if (!(param->internalCsp & X265_CSP_I420)) - { - x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", - profile, x265_source_csp_names[param->internalCsp]); - return -1; - } + x265_log(param, X265_LOG_ERROR, "%s profile not supported, internal bit depth %d.\n", profile, X265_DEPTH); + return -1; } - else if (!strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp")) - { - if (!(param->internalCsp & X265_CSP_I420)) - { - x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", - profile, x265_source_csp_names[param->internalCsp]); - return -1; - } - /* SPSs shall have sps_max_dec_pic_buffering_minus1[ sps_max_sub_layers_minus1 ] equal to 0 only */ - param->maxNumReferences = 1; - - /* The bitstream shall contain only one picture (we do not enforce this) */ - /* just in case the user gives us more than one picture: */ + size_t l = strlen(profile); + bool bBoolIntra = (l > 6 && !strcmp(profile + l - 6, "-intra")) || + !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp"); + if (bBoolIntra) + { + /* The profile may be detected as still picture if param->totalFrames is 1 */ param->keyframeMax = 1; - param->bOpenGOP = 0; - param->bRepeatHeaders = 1; - param->lookaheadDepth = 0; - param->bframes = 0; - param->scenecutThreshold = 0; - param->bFrameAdaptive = 0; - param->rc.cuTree = 0; - param->bEnableWeightedPred = 0; - param->bEnableWeightedBiPred = 0; } - else if (!strcmp(profile, "main10")) + + /* check that input color space is supported by profile */ + if (!strcmp(profile, "main") || !strcmp(profile, "main-intra") || + !strcmp(profile, "main10") || !strcmp(profile, "main10-intra") || + !strcmp(profile, "main12") || !strcmp(profile, "main12-intra") || + !strcmp(profile, "mainstillpicture") || !strcmp(profile, "msp")) { - if (!(param->internalCsp & X265_CSP_I420)) + if (param->internalCsp != X265_CSP_I420) { x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", profile, x265_source_csp_names[param->internalCsp]); return -1; } } - else if (!strcmp(profile, "main422-10")) + else if (!strcmp(profile, "main422-10") || !strcmp(profile, "main422-10-intra") || + !strcmp(profile, "main422-12") || !strcmp(profile, "main422-12-intra")) { - if (!(param->internalCsp & (X265_CSP_I420 | X265_CSP_I422))) + if (param->internalCsp != X265_CSP_I420 && param->internalCsp != X265_CSP_I422) { x265_log(param, X265_LOG_ERROR, "%s profile not compatible with %s input color space.\n", profile, x265_source_csp_names[param->internalCsp]); return -1; } } - else if (!strcmp(profile, "main444-8") || !strcmp(profile, "main444-10")) + else if (!strcmp(profile, "main444-8") || + !strcmp(profile, "main444-intra") || !strcmp(profile, "main444-stillpicture") || + !strcmp(profile, "main444-10") || !strcmp(profile, "main444-10-intra") || + !strcmp(profile, "main444-12") || !strcmp(profile, "main444-12-intra") || + !strcmp(profile, "main444-16-intra") || !strcmp(profile, "main444-16-stillpicture")) { /* any color space allowed */ }
View file
x265_1.7.tar.gz/source/encoder/level.h -> x265_1.8.tar.gz/source/encoder/level.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "x265.h" -namespace x265 { +namespace X265_NS { // encoder private namespace struct VPS;
View file
x265_1.7.tar.gz/source/encoder/motion.cpp -> x265_1.8.tar.gz/source/encoder/motion.cpp
Changed
@@ -31,7 +31,7 @@ #pragma warning(disable: 4127) // conditional expression is constant (macros use this construct) #endif -using namespace x265; +using namespace X265_NS; namespace { @@ -56,7 +56,7 @@ { 2, 8, 2, 8, true }, // 2x8 SATD HPEL + 2x8 SATD QPEL }; -int sizeScale[NUM_PU_SIZES]; +static int sizeScale[NUM_PU_SIZES]; #define SAD_THRESH(v) (bcost < (((v >> 4) * sizeScale[partEnum]))) /* radius 2 hexagon. repeated entries are to avoid having to compute mod6 every time. */ @@ -234,14 +234,9 @@ pix_base + (m1x) + (m1y) * stride, \ pix_base + (m2x) + (m2y) * stride, \ stride, costs); \ - const uint16_t *base_mvx = &m_cost_mvx[(bmv.x + (m0x)) << 2]; \ - const uint16_t *base_mvy = &m_cost_mvy[(bmv.y + (m0y)) << 2]; \ - X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]), "mvcost() check failure\n"); \ - (costs)[0] += (base_mvx[((m0x) - (m0x)) << 2] + base_mvy[((m0y) - (m0y)) << 2]); \ - (costs)[1] += (base_mvx[((m1x) - (m0x)) << 2] + base_mvy[((m1y) - (m0y)) << 2]); \ - (costs)[2] += (base_mvx[((m2x) - (m0x)) << 2] + base_mvy[((m2y) - (m0y)) << 2]); \ + (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \ + (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \ + (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \ } #define COST_MV_PT_DIST_X4(m0x, m0y, p0, d0, m1x, m1y, p1, d1, m2x, m2y, p2, d2, m3x, m3y, p3, d3) \ @@ -271,16 +266,10 @@ pix_base + (m2x) + (m2y) * stride, \ pix_base + (m3x) + (m3y) * stride, \ stride, costs); \ - const uint16_t *base_mvx = &m_cost_mvx[(omv.x << 2)]; \ - const uint16_t *base_mvy = &m_cost_mvy[(omv.y << 2)]; \ - X265_CHECK(mvcost((omv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((omv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((omv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((omv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \ - costs[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \ - costs[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \ - costs[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \ - costs[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \ + costs[0] += mvcost((omv + MV(m0x, m0y)) << 2); \ + costs[1] += mvcost((omv + MV(m1x, m1y)) << 2); \ + costs[2] += mvcost((omv + MV(m2x, m2y)) << 2); \ + costs[3] += mvcost((omv + MV(m3x, m3y)) << 2); \ COPY2_IF_LT(bcost, costs[0], bmv, omv + MV(m0x, m0y)); \ COPY2_IF_LT(bcost, costs[1], bmv, omv + MV(m1x, m1y)); \ COPY2_IF_LT(bcost, costs[2], bmv, omv + MV(m2x, m2y)); \ @@ -296,17 +285,10 @@ pix_base + (m2x) + (m2y) * stride, \ pix_base + (m3x) + (m3y) * stride, \ stride, costs); \ - /* TODO: use restrict keyword in ICL */ \ - const uint16_t *base_mvx = &m_cost_mvx[(bmv.x << 2)]; \ - const uint16_t *base_mvy = &m_cost_mvy[(bmv.y << 2)]; \ - X265_CHECK(mvcost((bmv + MV(m0x, m0y)) << 2) == (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((bmv + MV(m1x, m1y)) << 2) == (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((bmv + MV(m2x, m2y)) << 2) == (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]), "mvcost() check failure\n"); \ - X265_CHECK(mvcost((bmv + MV(m3x, m3y)) << 2) == (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]), "mvcost() check failure\n"); \ - (costs)[0] += (base_mvx[(m0x) << 2] + base_mvy[(m0y) << 2]); \ - (costs)[1] += (base_mvx[(m1x) << 2] + base_mvy[(m1y) << 2]); \ - (costs)[2] += (base_mvx[(m2x) << 2] + base_mvy[(m2y) << 2]); \ - (costs)[3] += (base_mvx[(m3x) << 2] + base_mvy[(m3y) << 2]); \ + (costs)[0] += mvcost((bmv + MV(m0x, m0y)) << 2); \ + (costs)[1] += mvcost((bmv + MV(m1x, m1y)) << 2); \ + (costs)[2] += mvcost((bmv + MV(m2x, m2y)) << 2); \ + (costs)[3] += mvcost((bmv + MV(m3x, m3y)) << 2); \ } #define DIA1_ITER(mx, my) \ @@ -639,36 +621,18 @@ } } + X265_CHECK(!(ref->isLowres && numCandidates), "lowres motion candidates not allowed\n") // measure SAD cost at each QPEL motion vector candidate - if (ref->isLowres) - { - for (int i = 0; i < numCandidates; i++) - { - MV m = mvc[i].clipped(qmvmin, qmvmax); - if (m.notZero() && m != pmv && m != bestpre) // check already measured - { - int cost = ref->lowresQPelCost(fenc, blockOffset, m, sad) + mvcost(m); - if (cost < bprecost) - { - bprecost = cost; - bestpre = m; - } - } - } - } - else + for (int i = 0; i < numCandidates; i++) { - for (int i = 0; i < numCandidates; i++) + MV m = mvc[i].clipped(qmvmin, qmvmax); + if (m.notZero() & (m != pmv ? 1 : 0) & (m != bestpre ? 1 : 0)) // check already measured { - MV m = mvc[i].clipped(qmvmin, qmvmax); - if (m.notZero() && m != pmv && m != bestpre) // check already measured + int cost = subpelCompare(ref, m, sad) + mvcost(m); + if (cost < bprecost) { - int cost = subpelCompare(ref, m, sad) + mvcost(m); - if (cost < bprecost) - { - bprecost = cost; - bestpre = m; - } + bprecost = cost; + bestpre = m; } } }
View file
x265_1.7.tar.gz/source/encoder/motion.h -> x265_1.8.tar.gz/source/encoder/motion.h
Changed
@@ -30,7 +30,7 @@ #include "bitcost.h" #include "yuv.h" -namespace x265 { +namespace X265_NS { // private x265 namespace class MotionEstimate : public BitCost
View file
x265_1.7.tar.gz/source/encoder/nal.cpp -> x265_1.8.tar.gz/source/encoder/nal.cpp
Changed
@@ -25,7 +25,7 @@ #include "bitstream.h" #include "nal.h" -using namespace x265; +using namespace X265_NS; NALList::NALList() : m_numNal(0)
View file
x265_1.7.tar.gz/source/encoder/nal.h -> x265_1.8.tar.gz/source/encoder/nal.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "x265.h" -namespace x265 { +namespace X265_NS { // private namespace class Bitstream;
View file
x265_1.7.tar.gz/source/encoder/ratecontrol.cpp -> x265_1.8.tar.gz/source/encoder/ratecontrol.cpp
Changed
@@ -37,7 +37,7 @@ #define BR_SHIFT 6 #define CPB_SHIFT 4 -using namespace x265; +using namespace X265_NS; /* Amortize the partial cost of I frames over the next N frames */ @@ -181,6 +181,8 @@ m_bTerminated = false; m_finalFrameCount = 0; m_numEntries = 0; + m_isSceneTransition = false; + m_lastPredictorReset = 0; if (m_param->rc.rateControlMode == X265_RC_CRF) { m_param->rc.qp = (int)m_param->rc.rfConstant; @@ -273,7 +275,6 @@ if(m_param->rc.bStrictCbr) m_rateTolerance = 0.7; - m_leadingBframes = m_param->bframes; m_bframeBits = 0; m_leadingNoBSatd = 0; m_ipOffset = 6.0 * X265_LOG2(m_param->rc.ipFactor); @@ -282,6 +283,7 @@ /* Adjust the first frame in order to stabilize the quality level compared to the rest */ #define ABR_INIT_QP_MIN (24) #define ABR_INIT_QP_MAX (40) +#define ABR_SCENECUT_INIT_QP_MIN (12) #define CRF_INIT_QP (int)m_param->rc.rfConstant for (int i = 0; i < 3; i++) m_lastQScaleFor[i] = x265_qp2qScale(m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN); @@ -369,20 +371,8 @@ m_accumPNorm = .01; m_accumPQp = (m_param->rc.rateControlMode == X265_RC_CRF ? CRF_INIT_QP : ABR_INIT_QP_MIN) * m_accumPNorm; - /* Frame Predictors and Row predictors used in vbv */ - for (int i = 0; i < 4; i++) - { - m_pred[i].coeff = 1.0; - m_pred[i].count = 1.0; - m_pred[i].decay = 0.5; - m_pred[i].offset = 0.0; - } - m_pred[0].coeff = m_pred[3].coeff = 0.75; - if (m_param->rc.qCompress >= 0.8) // when tuned for grain - { - m_pred[1].coeff = 0.75; - m_pred[0].coeff = m_pred[3].coeff = 0.50; - } + /* Frame Predictors used in vbv */ + initFramePredictors(); if (!m_statFileOut && (m_param->rc.bStatWrite || m_param->rc.bStatRead)) { /* If the user hasn't defined the stat filename, use the default value */ @@ -931,6 +921,24 @@ return X265_TYPE_AUTO; } +void RateControl::initFramePredictors() +{ + /* Frame Predictors used in vbv */ + for (int i = 0; i < 4; i++) + { + m_pred[i].coeff = 1.0; + m_pred[i].count = 1.0; + m_pred[i].decay = 0.5; + m_pred[i].offset = 0.0; + } + m_pred[0].coeff = m_pred[3].coeff = 0.75; + if (m_param->rc.qCompress >= 0.8) // when tuned for grain + { + m_pred[1].coeff = 0.75; + m_pred[0].coeff = m_pred[3].coeff = 0.50; + } +} + int RateControl::rateControlStart(Frame* curFrame, RateControlEntry* rce, Encoder* enc) { int orderValue = m_startEndOrder.get(); @@ -960,10 +968,20 @@ copyRceData(rce, &m_rce2Pass[rce->poc]); } rce->isActive = true; - if (m_sliceType == B_SLICE) - rce->bframes = m_leadingBframes; - else - m_leadingBframes = curFrame->m_lowres.leadingBframes; + bool isRefFrameScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refPicList[0][0]->m_lowres.bScenecut == 1; + if (curFrame->m_lowres.bScenecut) + { + m_isSceneTransition = true; + m_lastPredictorReset = rce->encodeOrder; + initFramePredictors(); + } + else if (m_sliceType != B_SLICE && !isRefFrameScenecut) + m_isSceneTransition = false; + + if (rce->encodeOrder < m_lastPredictorReset + m_param->frameNumThreads) + { + rce->rowPreds[0][0].count = 0; + } rce->bLastMiniGopBFrame = curFrame->m_lowres.bLastMiniGopBFrame; rce->bufferRate = m_bufferRate; @@ -1040,6 +1058,10 @@ } } } + /* For a scenecut that occurs within the mini-gop, enable scene transition + * switch until the next mini-gop to ensure a min qp for all the frames within + * the scene-transition mini-gop */ + double q = x265_qScale2qp(rateEstimateQscale(curFrame, rce)); q = x265_clip3((double)QP_MIN, (double)QP_MAX_MAX, q); m_qp = int(q + 0.5); @@ -1087,18 +1109,6 @@ } m_framesDone++; - /* CQP and CRF (without capped VBV) doesn't use mid-frame statistics to - * tune RateControl parameters for other frames. - * Hence, for these modes, update m_startEndOrder and unlock RC for previous threads waiting in - * RateControlEnd here.those modes here. For the rest - ABR - * and VBV, unlock only after rateControlUpdateStats of this frame is called */ - if (m_param->rc.rateControlMode != X265_RC_ABR && !m_isVbv) - { - m_startEndOrder.incr(); - - if (rce->encodeOrder < m_param->frameNumThreads - 1) - m_startEndOrder.incr(); // faked rateControlEnd calls for negative frames - } return m_qp; } @@ -1394,6 +1404,13 @@ else q += m_pbOffset; + /* Set a min qp at scenechanges and transitions */ + if (m_isSceneTransition) + { + q = X265_MAX(ABR_SCENECUT_INIT_QP_MIN, q); + double minScenecutQscale =x265_qp2qScale(ABR_SCENECUT_INIT_QP_MIN); + m_lastQScaleFor[P_SLICE] = X265_MAX(minScenecutQscale, m_lastQScaleFor[P_SLICE]); + } double qScale = x265_qp2qScale(q); rce->qpNoVbv = q; double lmin = 0, lmax = 0; @@ -1556,11 +1573,19 @@ q = X265_MIN(lqmax, q); } q = x265_clip3(MIN_QPSCALE, MAX_MAX_QPSCALE, q); + /* Set a min qp at scenechanges and transitions */ + if (m_isSceneTransition) + { + double minScenecutQscale =x265_qp2qScale(ABR_SCENECUT_INIT_QP_MIN); + q = X265_MAX(minScenecutQscale, q); + m_lastQScaleFor[P_SLICE] = X265_MAX(minScenecutQscale, m_lastQScaleFor[P_SLICE]); + } rce->qpNoVbv = x265_qScale2qp(q); q = clipQscale(curFrame, rce, q); /* clip qp to permissible range after vbv-lookahead estimation to avoid possible - * mispredictions by initial frame size predictors */ - if (!m_2pass && m_isVbv && m_pred[m_predType].count == 1) + * mispredictions by initial frame size predictors, after each scenecut */ + bool isFrameAfterScenecut = m_sliceType!= I_SLICE && m_curSlice->m_refPicList[0][0]->m_lowres.bScenecut; + if (!m_2pass && m_isVbv && isFrameAfterScenecut) q = x265_clip3(lqmin, lqmax, q); } m_lastQScaleFor[m_sliceType] = q; @@ -1762,7 +1787,7 @@ } /* Try to get the buffer not more than 80% filled, but don't set an impossible goal. */ targetFill = x265_clip3(m_bufferSize * (1 - 0.2 * finalDur), m_bufferSize, m_bufferFill - totalDuration * m_vbvMaxRate * 0.5); - if (m_isCbr && bufferFillCur > targetFill) + if (m_isCbr && bufferFillCur > targetFill && !m_isSceneTransition) { q /= 1.01; loopTerminate |= 2; @@ -1904,6 +1929,7 @@ else if (picType == P_SLICE) { intraCostForPendingCus = curEncData.m_rowStat[row].intraSatdForVbv - curEncData.m_rowStat[row].diagIntraSatd; + intraCostForPendingCus >>= X265_DEPTH - 8; /* Our QP is lower than the reference! */ double pred_intra = predictSize(rce->rowPred[1], qScale, intraCostForPendingCus); /* Sum: better to overestimate than underestimate by using only one of the two predictors. */ @@ -1939,7 +1965,7 @@ uint64_t intraRowSatdCost = curEncData.m_rowStat[row].diagIntraSatd; if (row == 1) intraRowSatdCost += curEncData.m_rowStat[0].diagIntraSatd; - + intraRowSatdCost >>= X265_DEPTH - 8; updatePredictor(rce->rowPred[1], qScaleVbv, (double)intraRowSatdCost, encodedBits); } } @@ -2130,7 +2156,7 @@ { int predType = rce->sliceType; predType = rce->sliceType == B_SLICE && rce->keptAsRef ? 3 : predType; - if (rce->lastSatd >= m_ncu) + if (rce->lastSatd >= m_ncu && rce->encodeOrder >= m_lastPredictorReset) updatePredictor(&m_pred[predType], x265_qp2qScale(rce->qpaRc), (double)rce->lastSatd, (double)bits); if (!m_isVbv) return; @@ -2146,7 +2172,7 @@ } /* After encoding one frame, update rate control state */ -int RateControl::rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce, FrameStats* stats) +int RateControl::rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce) { int orderValue = m_startEndOrder.get(); int endOrdinal = (rce->encodeOrder + m_param->frameNumThreads) * 2 - 1; @@ -2163,25 +2189,6 @@ FrameData& curEncData = *curFrame->m_encData; int64_t actualBits = bits; Slice *slice = curEncData.m_slice; - if (m_isAbr) - { - if (m_param->rc.rateControlMode == X265_RC_ABR && !m_param->rc.bStatRead) - checkAndResetABR(rce, true); - - if (m_param->rc.rateControlMode == X265_RC_CRF) - { - if (int(curEncData.m_avgQpRc + 0.5) == slice->m_sliceQp) - curEncData.m_rateFactor = m_rateFactorConstant; - else - { - /* If vbv changed the frame QP recalculate the rate-factor */ - double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); - double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0; - curEncData.m_rateFactor = pow(baseCplx, 1 - m_qCompress) / - x265_qp2qScale(int(curEncData.m_avgQpRc + 0.5) + mbtree_offset); - } - } - } if (m_param->rc.aqMode || m_isVbv) { @@ -2207,35 +2214,26 @@ curEncData.m_avgQpAq = curEncData.m_avgQpRc; } - // Write frame stats into the stats file if 2 pass is enabled. - if (m_param->rc.bStatWrite) - { - char cType = rce->sliceType == I_SLICE ? (rce->poc > 0 && m_param->bOpenGOP ? 'i' : 'I') - : rce->sliceType == P_SLICE ? 'P' - : IS_REFERENCED(curFrame) ? 'B' : 'b'; - if (fprintf(m_statFileOut, - "in:%d out:%d type:%c q:%.2f q-aq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", - rce->poc, rce->encodeOrder, - cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, - stats->coeffBits, - stats->mvBits, - stats->miscBits, - stats->percentIntra * m_ncu, - stats->percentInter * m_ncu, - stats->percentSkip * m_ncu) < 0) - goto writeFailure; - /* Don't re-write the data in multi-pass mode. */ - if (m_param->rc.cuTree && IS_REFERENCED(curFrame) && !m_param->rc.bStatRead) + if (m_isAbr) + { + if (m_param->rc.rateControlMode == X265_RC_ABR && !m_param->rc.bStatRead) + checkAndResetABR(rce, true); + + if (m_param->rc.rateControlMode == X265_RC_CRF) { - uint8_t sliceType = (uint8_t)rce->sliceType; - for (int i = 0; i < m_ncu; i++) - m_cuTreeStats.qpBuffer[0][i] = (uint16_t)(curFrame->m_lowres.qpCuTreeOffset[i] * 256.0); - if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1) - goto writeFailure; - if (fwrite(m_cuTreeStats.qpBuffer[0], sizeof(uint16_t), m_ncu, m_cutreeStatFileOut) < (size_t)m_ncu) - goto writeFailure; + if (int(curEncData.m_avgQpRc + 0.5) == slice->m_sliceQp) + curEncData.m_rateFactor = m_rateFactorConstant; + else + { + /* If vbv changed the frame QP recalculate the rate-factor */ + double baseCplx = m_ncu * (m_param->bframes ? 120 : 80); + double mbtree_offset = m_param->rc.cuTree ? (1.0 - m_param->rc.qCompress) * 13.5 : 0; + curEncData.m_rateFactor = pow(baseCplx, 1 - m_qCompress) / + x265_qp2qScale(int(curEncData.m_avgQpRc + 0.5) + mbtree_offset); + } } } + if (m_isAbr && !m_isAbrReset) { /* amortize part of each I slice over the next several frames, up to @@ -2317,12 +2315,43 @@ // Allow rateControlStart of next frame only when rateControlEnd of previous frame is over m_startEndOrder.incr(); return 0; +} -writeFailure: +/* called to write out the rate control frame stats info in multipass encodes */ +int RateControl::writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce) +{ + FrameData& curEncData = *curFrame->m_encData; + char cType = rce->sliceType == I_SLICE ? (rce->poc > 0 && m_param->bOpenGOP ? 'i' : 'I') + : rce->sliceType == P_SLICE ? 'P' + : IS_REFERENCED(curFrame) ? 'B' : 'b'; + if (fprintf(m_statFileOut, + "in:%d out:%d type:%c q:%.2f q-aq:%.2f tex:%d mv:%d misc:%d icu:%.2f pcu:%.2f scu:%.2f ;\n", + rce->poc, rce->encodeOrder, + cType, curEncData.m_avgQpRc, curEncData.m_avgQpAq, + curFrame->m_encData->m_frameStats.coeffBits, + curFrame->m_encData->m_frameStats.mvBits, + curFrame->m_encData->m_frameStats.miscBits, + curFrame->m_encData->m_frameStats.percent8x8Intra * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Inter * m_ncu, + curFrame->m_encData->m_frameStats.percent8x8Skip * m_ncu) < 0) + goto writeFailure; + /* Don't re-write the data in multi-pass mode. */ + if (m_param->rc.cuTree && IS_REFERENCED(curFrame) && !m_param->rc.bStatRead) + { + uint8_t sliceType = (uint8_t)rce->sliceType; + for (int i = 0; i < m_ncu; i++) + m_cuTreeStats.qpBuffer[0][i] = (uint16_t)(curFrame->m_lowres.qpCuTreeOffset[i] * 256.0); + if (fwrite(&sliceType, 1, 1, m_cutreeStatFileOut) < 1) + goto writeFailure; + if (fwrite(m_cuTreeStats.qpBuffer[0], sizeof(uint16_t), m_ncu, m_cutreeStatFileOut) < (size_t)m_ncu) + goto writeFailure; + } + return 0; + + writeFailure: x265_log(m_param, X265_LOG_ERROR, "RatecontrolEnd: stats file write failure\n"); return 1; } - #if defined(_MSC_VER) #pragma warning(disable: 4996) // POSIX function names are just fine, thank you #endif
View file
x265_1.7.tar.gz/source/encoder/ratecontrol.h -> x265_1.8.tar.gz/source/encoder/ratecontrol.h
Changed
@@ -29,7 +29,7 @@ #include "common.h" #include "sei.h" -namespace x265 { +namespace X265_NS { // encoder namespace class Encoder; @@ -46,23 +46,6 @@ #define MIN_AMORTIZE_FRACTION 0.2 #define CLIP_DURATION(f) x265_clip3(MIN_FRAME_DURATION, MAX_FRAME_DURATION, f) -/* Current frame stats for 2 pass */ -struct FrameStats -{ - int mvBits; /* MV bits (MV+Ref+Block Type) */ - int coeffBits; /* Texture bits (DCT coefs) */ - int miscBits; - - int iCuCnt; - int pCuCnt; - int skipCuCnt; - - /* CU type counts stored as percentage */ - double percentIntra; - double percentInter; - double percentSkip; -}; - struct Predictor { double coeff; @@ -164,7 +147,6 @@ double m_pbOffset; int64_t m_bframeBits; int64_t m_currentSatd; - int m_leadingBframes; int m_qpConstant[3]; int m_lastNonBPictType; int m_framesDone; /* # of frames passed through RateCotrol already */ @@ -190,6 +172,8 @@ int64_t m_lastBsliceSatdCost; int m_numBframesInPattern; bool m_isPatternPresent; + bool m_isSceneTransition; + int m_lastPredictorReset; /* a common variable on which rateControlStart, rateControlEnd and rateControUpdateStats waits to * sync the calls to these functions. For example @@ -241,12 +225,12 @@ // to be called for each curFrame to process RateControl and set QP int rateControlStart(Frame* curFrame, RateControlEntry* rce, Encoder* enc); void rateControlUpdateStats(RateControlEntry* rce); - int rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce, FrameStats* stats); + int rateControlEnd(Frame* curFrame, int64_t bits, RateControlEntry* rce); int rowDiagonalVbvRateControl(Frame* curFrame, uint32_t row, RateControlEntry* rce, double& qpVbv); int rateControlSliceType(int frameNum); bool cuTreeReadFor2Pass(Frame* curFrame); void hrdFullness(SEIBufferingPeriod* sei); - + int writeRateControlFrameStats(Frame* curFrame, RateControlEntry* rce); protected: static const int s_slidingWindowFrames; @@ -274,6 +258,7 @@ void checkAndResetABR(RateControlEntry* rce, bool isFrameDone); double predictRowsSizeSum(Frame* pic, RateControlEntry* rce, double qpm, int32_t& encodedBits); bool initPass2(); + void initFramePredictors(); double getDiffLimitedQScale(RateControlEntry *rce, double q); double countExpectedBits(); bool vbv2Pass(uint64_t allAvailableBits);
View file
x265_1.7.tar.gz/source/encoder/rdcost.h -> x265_1.8.tar.gz/source/encoder/rdcost.h
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "slice.h" -namespace x265 { +namespace X265_NS { // private namespace class RDCost @@ -88,10 +88,17 @@ m_lambda = (uint64_t)floor(256.0 * lambda); } - inline uint64_t calcRdCost(uint32_t distortion, uint32_t bits) const + inline uint64_t calcRdCost(sse_ret_t distortion, uint32_t bits) const { +#if X265_DEPTH <= 10 X265_CHECK(bits <= (UINT64_MAX - 128) / m_lambda2, - "calcRdCost wrap detected dist: %u, bits %u, lambda: "X265_LL"\n", distortion, bits, m_lambda2); + "calcRdCost wrap detected dist: %u, bits %u, lambda: " X265_LL "\n", + distortion, bits, m_lambda2); +#else + X265_CHECK(bits <= (UINT64_MAX - 128) / m_lambda2, + "calcRdCost wrap detected dist: " X265_LL ", bits %u, lambda: " X265_LL "\n", + distortion, bits, m_lambda2); +#endif return distortion + ((bits * m_lambda2 + 128) >> 8); } @@ -108,7 +115,7 @@ } /* return the RD cost of this prediction, including the effect of psy-rd */ - inline uint64_t calcPsyRdCost(uint32_t distortion, uint32_t bits, uint32_t psycost) const + inline uint64_t calcPsyRdCost(sse_ret_t distortion, uint32_t bits, uint32_t psycost) const { return distortion + ((m_lambda * m_psyRd * psycost) >> 24) + ((bits * m_lambda2) >> 8); } @@ -116,15 +123,22 @@ inline uint64_t calcRdSADCost(uint32_t sadCost, uint32_t bits) const { X265_CHECK(bits <= (UINT64_MAX - 128) / m_lambda, - "calcRdSADCost wrap detected dist: %u, bits %u, lambda: "X265_LL"\n", sadCost, bits, m_lambda); + "calcRdSADCost wrap detected dist: %u, bits %u, lambda: " X265_LL "\n", sadCost, bits, m_lambda); return sadCost + ((bits * m_lambda + 128) >> 8); } - inline uint32_t scaleChromaDist(uint32_t plane, uint32_t dist) const + inline sse_ret_t scaleChromaDist(uint32_t plane, sse_ret_t dist) const { +#if X265_DEPTH <= 10 + X265_CHECK(dist <= (UINT64_MAX - 128) / m_chromaDistWeight[plane - 1], + "scaleChromaDist wrap detected dist: %u, lambda: %u\n", + dist, m_chromaDistWeight[plane - 1]); +#else X265_CHECK(dist <= (UINT64_MAX - 128) / m_chromaDistWeight[plane - 1], - "scaleChromaDist wrap detected dist: %u, lambda: %u\n", dist, m_chromaDistWeight[plane - 1]); - return (uint32_t)((dist * (uint64_t)m_chromaDistWeight[plane - 1] + 128) >> 8); + "scaleChromaDist wrap detected dist: " X265_LL " lambda: %u\n", + dist, m_chromaDistWeight[plane - 1]); +#endif + return (sse_ret_t)((dist * (uint64_t)m_chromaDistWeight[plane - 1] + 128) >> 8); } inline uint32_t getCost(uint32_t bits) const
View file
x265_1.7.tar.gz/source/encoder/reference.cpp -> x265_1.8.tar.gz/source/encoder/reference.cpp
Changed
@@ -29,7 +29,7 @@ #include "reference.h" -using namespace x265; +using namespace X265_NS; MotionReference::MotionReference() {
View file
x265_1.7.tar.gz/source/encoder/reference.h -> x265_1.8.tar.gz/source/encoder/reference.h
Changed
@@ -29,7 +29,7 @@ #include "lowres.h" #include "mv.h" -namespace x265 { +namespace X265_NS { // private x265 namespace struct WeightParam;
View file
x265_1.7.tar.gz/source/encoder/sao.cpp -> x265_1.8.tar.gz/source/encoder/sao.cpp
Changed
@@ -42,15 +42,25 @@ return (x >> 31) | ((int)((((uint32_t)-x)) >> 31)); } +inline int signOf2(const int a, const int b) +{ + // NOTE: don't reorder below compare, both ICL, VC, GCC optimize strong depends on order! + int r = 0; + if (a < b) + r = -1; + if (a > b) + r = 1; + return r; +} + inline int64_t estSaoDist(int32_t count, int offset, int32_t offsetOrg) { return (count * offset - offsetOrg * 2) * offset; } - } // end anonymous namespace -namespace x265 { +namespace X265_NS { const uint32_t SAO::s_eoTable[NUM_EDGETYPE] = { @@ -213,14 +223,19 @@ frame->m_encData->m_saoParam = saoParam; } - rdoSaoUnitRowInit(saoParam); + saoParam->bSaoFlag[0] = true; + saoParam->bSaoFlag[1] = true; - // NOTE: Disable SAO automatic turn-off when frame parallelism is - // enabled for output exact independent of frame thread count - if (m_param->frameNumThreads > 1) + m_numNoSao[0] = 0; // Luma + m_numNoSao[1] = 0; // Chroma + + // NOTE: Allow SAO automatic turn-off only when frame parallelism is disabled. + if (m_param->frameNumThreads == 1) { - saoParam->bSaoFlag[0] = true; - saoParam->bSaoFlag[1] = true; + if (m_refDepth > 0 && m_depthSaoRate[0][m_refDepth - 1] > SAO_ENCODING_RATE) + saoParam->bSaoFlag[0] = false; + if (m_refDepth > 0 && m_depthSaoRate[1][m_refDepth - 1] > SAO_ENCODING_RATE_CHROMA) + saoParam->bSaoFlag[1] = false; } } @@ -656,7 +671,6 @@ /* Calculate SAO statistics for current CTU without non-crossing slice */ void SAO::calcSaoStatsCu(int addr, int plane) { - int x, y; const CUData* cu = m_frame->m_encData->getPicCTU(addr); const pixel* fenc0 = m_frame->m_fencPic->getPlaneAddr(plane, addr); const pixel* rec0 = m_frame->m_reconPic->getPlaneAddr(plane, addr); @@ -687,8 +701,6 @@ int startY; int endX; int endY; - int32_t* stats; - int32_t* count; int skipB = plane ? 2 : 4; int skipR = plane ? 3 : 5; @@ -698,34 +710,16 @@ // SAO_BO: { - const int boShift = X265_DEPTH - SAO_BO_BITS; - if (m_param->bSaoNonDeblocked) { skipB = plane ? 1 : 3; skipR = plane ? 2 : 4; } - stats = m_offsetOrg[plane][SAO_BO]; - count = m_count[plane][SAO_BO]; - - fenc = fenc0; - rec = rec0; endX = (rpelx == picWidth) ? ctuWidth : ctuWidth - skipR; endY = (bpely == picHeight) ? ctuHeight : ctuHeight - skipB; - for (y = 0; y < endY; y++) - { - for (x = 0; x < endX; x++) - { - int classIdx = 1 + (rec[x] >> boShift); - stats[classIdx] += (fenc[x] - rec[x]); - count[classIdx]++; - } - - fenc += stride; - rec += stride; - } + primitives.saoCuStatsBO(fenc0, rec0, stride, endX, endY, m_offsetOrg[plane][SAO_BO], m_count[plane][SAO_BO]); } { @@ -736,30 +730,11 @@ skipB = plane ? 1 : 3; skipR = plane ? 3 : 5; } - stats = m_offsetOrg[plane][SAO_EO_0]; - count = m_count[plane][SAO_EO_0]; - - fenc = fenc0; - rec = rec0; startX = !lpelx; endX = (rpelx == picWidth) ? ctuWidth - 1 : ctuWidth - skipR; - for (y = 0; y < ctuHeight - skipB; y++) - { - int signLeft = signOf(rec[startX] - rec[startX - 1]); - for (x = startX; x < endX; x++) - { - int signRight = signOf(rec[x] - rec[x + 1]); - int edgeType = signRight + signLeft + 2; - signLeft = -signRight; - - stats[s_eoTable[edgeType]] += (fenc[x] - rec[x]); - count[s_eoTable[edgeType]]++; - } - fenc += stride; - rec += stride; - } + primitives.saoCuStatsE0(fenc0 + startX, rec0 + startX, stride, endX - startX, ctuHeight - skipB, m_offsetOrg[plane][SAO_EO_0], m_count[plane][SAO_EO_0]); } // SAO_EO_1: // dir: | @@ -769,8 +744,6 @@ skipB = plane ? 2 : 4; skipR = plane ? 2 : 4; } - stats = m_offsetOrg[plane][SAO_EO_1]; - count = m_count[plane][SAO_EO_1]; fenc = fenc0; rec = rec0; @@ -786,21 +759,7 @@ primitives.sign(upBuff1, rec, &rec[- stride], ctuWidth); - for (y = startY; y < endY; y++) - { - for (x = 0; x < endX; x++) - { - int8_t signDown = signOf(rec[x] - rec[x + stride]); - int edgeType = signDown + upBuff1[x] + 2; - upBuff1[x] = -signDown; - - stats[s_eoTable[edgeType]] += (fenc[x] - rec[x]); - count[s_eoTable[edgeType]]++; - } - - fenc += stride; - rec += stride; - } + primitives.saoCuStatsE1(fenc0 + startY * stride, rec0 + startY * stride, stride, upBuff1, endX, endY - startY, m_offsetOrg[plane][SAO_EO_1], m_count[plane][SAO_EO_1]); } // SAO_EO_2: // dir: 135 @@ -810,8 +769,6 @@ skipB = plane ? 2 : 4; skipR = plane ? 3 : 5; } - stats = m_offsetOrg[plane][SAO_EO_2]; - count = m_count[plane][SAO_EO_2]; fenc = fenc0; rec = rec0; @@ -829,23 +786,7 @@ primitives.sign(&upBuff1[startX], &rec[startX], &rec[startX - stride - 1], (endX - startX)); - for (y = startY; y < endY; y++) - { - upBufft[startX] = signOf(rec[startX + stride] - rec[startX - 1]); - for (x = startX; x < endX; x++) - { - int8_t signDown = signOf(rec[x] - rec[x + stride + 1]); - int edgeType = signDown + upBuff1[x] + 2; - upBufft[x + 1] = -signDown; - stats[s_eoTable[edgeType]] += (fenc[x] - rec[x]); - count[s_eoTable[edgeType]]++; - } - - std::swap(upBuff1, upBufft); - - rec += stride; - fenc += stride; - } + primitives.saoCuStatsE2(fenc0 + startX + startY * stride, rec0 + startX + startY * stride, stride, upBuff1 + startX, upBufft + startX, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_2], m_count[plane][SAO_EO_2]); } // SAO_EO_3: // dir: 45 @@ -855,8 +796,6 @@ skipB = plane ? 2 : 4; skipR = plane ? 3 : 5; } - stats = m_offsetOrg[plane][SAO_EO_3]; - count = m_count[plane][SAO_EO_3]; fenc = fenc0; rec = rec0; @@ -875,22 +814,7 @@ primitives.sign(&upBuff1[startX - 1], &rec[startX - 1], &rec[startX - 1 - stride + 1], (endX - startX + 1)); - for (y = startY; y < endY; y++) - { - for (x = startX; x < endX; x++) - { - int8_t signDown = signOf(rec[x] - rec[x + stride - 1]); - int edgeType = signDown + upBuff1[x] + 2; - upBuff1[x - 1] = -signDown; - stats[s_eoTable[edgeType]] += (fenc[x] - rec[x]); - count[s_eoTable[edgeType]]++; - } - - upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); - - rec += stride; - fenc += stride; - } + primitives.saoCuStatsE3(fenc0 + startX + startY * stride, rec0 + startX + startY * stride, stride, upBuff1 + startX, endX - startX, endY - startY, m_offsetOrg[plane][SAO_EO_3], m_count[plane][SAO_EO_3]); } } } @@ -1170,19 +1094,6 @@ memset(m_offsetOrg, 0, sizeof(PerClass) * NUM_PLANE); } -void SAO::rdoSaoUnitRowInit(SAOParam* saoParam) -{ - saoParam->bSaoFlag[0] = true; - saoParam->bSaoFlag[1] = true; - - m_numNoSao[0] = 0; // Luma - m_numNoSao[1] = 0; // Chroma - if (m_refDepth > 0 && m_depthSaoRate[0][m_refDepth - 1] > SAO_ENCODING_RATE) - saoParam->bSaoFlag[0] = false; - if (m_refDepth > 0 && m_depthSaoRate[1][m_refDepth - 1] > SAO_ENCODING_RATE_CHROMA) - saoParam->bSaoFlag[1] = false; -} - void SAO::rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus) { if (!saoParam->bSaoFlag[0]) @@ -1324,7 +1235,7 @@ } if (count) { - int offset = roundIBDI(offsetOrg, count << SAO_BIT_INC); + int offset = roundIBDI(offsetOrg << (X265_DEPTH - 8), count); offset = x265_clip3(-OFFSET_THRESH + 1, OFFSET_THRESH - 1, offset); if (typeIdx < SAO_BO) { @@ -1606,4 +1517,182 @@ } } } + +// NOTE: must put in namespace X265_NS since we need class SAO +void saoCuStatsBO_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +{ + int x, y; + const int boShift = X265_DEPTH - SAO_BO_BITS; + + for (y = 0; y < endY; y++) + { + for (x = 0; x < endX; x++) + { + int classIdx = 1 + (rec[x] >> boShift); + stats[classIdx] += (fenc[x] - rec[x]); + count[classIdx]++; + } + + fenc += stride; + rec += stride; + } +} + +void saoCuStatsE0_c(const pixel *fenc, const pixel *rec, intptr_t stride, int endX, int endY, int32_t *stats, int32_t *count) +{ + int x, y; + int32_t tmp_stats[SAO::NUM_EDGETYPE]; + int32_t tmp_count[SAO::NUM_EDGETYPE]; + + memset(tmp_stats, 0, sizeof(tmp_stats)); + memset(tmp_count, 0, sizeof(tmp_count)); + + for (y = 0; y < endY; y++) + { + int signLeft = signOf(rec[0] - rec[-1]); + for (x = 0; x < endX; x++) + { + int signRight = signOf2(rec[x], rec[x + 1]); + X265_CHECK(signRight == signOf(rec[x] - rec[x + 1]), "signDown check failure\n"); + uint32_t edgeType = signRight + signLeft + 2; + signLeft = -signRight; + + X265_CHECK(edgeType <= 4, "edgeType check failure\n"); + tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_count[edgeType]++; + } + + fenc += stride; + rec += stride; + } + + for (x = 0; x < SAO::NUM_EDGETYPE; x++) + { + stats[SAO::s_eoTable[x]] += tmp_stats[x]; + count[SAO::s_eoTable[x]] += tmp_count[x]; + } } + +void saoCuStatsE1_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +{ + X265_CHECK(endX <= MAX_CU_SIZE, "endX check failure\n"); + X265_CHECK(endY <= MAX_CU_SIZE, "endY check failure\n"); + + int x, y; + int32_t tmp_stats[SAO::NUM_EDGETYPE]; + int32_t tmp_count[SAO::NUM_EDGETYPE]; + + memset(tmp_stats, 0, sizeof(tmp_stats)); + memset(tmp_count, 0, sizeof(tmp_count)); + + for (y = 0; y < endY; y++) + { + for (x = 0; x < endX; x++) + { + int signDown = signOf2(rec[x], rec[x + stride]); + X265_CHECK(signDown == signOf(rec[x] - rec[x + stride]), "signDown check failure\n"); + uint32_t edgeType = signDown + upBuff1[x] + 2; + upBuff1[x] = (int8_t)(-signDown); + + tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_count[edgeType]++; + } + fenc += stride; + rec += stride; + } + + for (x = 0; x < SAO::NUM_EDGETYPE; x++) + { + stats[SAO::s_eoTable[x]] += tmp_stats[x]; + count[SAO::s_eoTable[x]] += tmp_count[x]; + } +} + +void saoCuStatsE2_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int8_t *upBufft, int endX, int endY, int32_t *stats, int32_t *count) +{ + X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); + X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); + + int x, y; + int32_t tmp_stats[SAO::NUM_EDGETYPE]; + int32_t tmp_count[SAO::NUM_EDGETYPE]; + + memset(tmp_stats, 0, sizeof(tmp_stats)); + memset(tmp_count, 0, sizeof(tmp_count)); + + for (y = 0; y < endY; y++) + { + upBufft[0] = signOf(rec[stride] - rec[-1]); + for (x = 0; x < endX; x++) + { + int signDown = signOf2(rec[x], rec[x + stride + 1]); + X265_CHECK(signDown == signOf(rec[x] - rec[x + stride + 1]), "signDown check failure\n"); + uint32_t edgeType = signDown + upBuff1[x] + 2; + upBufft[x + 1] = (int8_t)(-signDown); + tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_count[edgeType]++; + } + + std::swap(upBuff1, upBufft); + + rec += stride; + fenc += stride; + } + + for (x = 0; x < SAO::NUM_EDGETYPE; x++) + { + stats[SAO::s_eoTable[x]] += tmp_stats[x]; + count[SAO::s_eoTable[x]] += tmp_count[x]; + } +} + +void saoCuStatsE3_c(const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) +{ + X265_CHECK(endX < MAX_CU_SIZE, "endX check failure\n"); + X265_CHECK(endY < MAX_CU_SIZE, "endY check failure\n"); + + int x, y; + int32_t tmp_stats[SAO::NUM_EDGETYPE]; + int32_t tmp_count[SAO::NUM_EDGETYPE]; + + memset(tmp_stats, 0, sizeof(tmp_stats)); + memset(tmp_count, 0, sizeof(tmp_count)); + + for (y = 0; y < endY; y++) + { + for (x = 0; x < endX; x++) + { + int signDown = signOf2(rec[x], rec[x + stride - 1]); + X265_CHECK(signDown == signOf(rec[x] - rec[x + stride - 1]), "signDown check failure\n"); + X265_CHECK(abs(upBuff1[x]) <= 1, "upBuffer1 check failure\n"); + + uint32_t edgeType = signDown + upBuff1[x] + 2; + upBuff1[x - 1] = (int8_t)(-signDown); + tmp_stats[edgeType] += (fenc[x] - rec[x]); + tmp_count[edgeType]++; + } + + upBuff1[endX - 1] = signOf(rec[endX - 1 + stride] - rec[endX]); + + rec += stride; + fenc += stride; + } + + for (x = 0; x < SAO::NUM_EDGETYPE; x++) + { + stats[SAO::s_eoTable[x]] += tmp_stats[x]; + count[SAO::s_eoTable[x]] += tmp_count[x]; + } +} + +void setupSaoPrimitives_c(EncoderPrimitives &p) +{ + // TODO: move other sao functions to here + p.saoCuStatsBO = saoCuStatsBO_c; + p.saoCuStatsE0 = saoCuStatsE0_c; + p.saoCuStatsE1 = saoCuStatsE1_c; + p.saoCuStatsE2 = saoCuStatsE2_c; + p.saoCuStatsE3 = saoCuStatsE3_c; +} +} +
View file
x265_1.7.tar.gz/source/encoder/sao.h -> x265_1.8.tar.gz/source/encoder/sao.h
Changed
@@ -30,7 +30,7 @@ #include "frame.h" #include "entropy.h" -namespace x265 { +namespace X265_NS { // private namespace enum SAOTypeLen @@ -52,12 +52,12 @@ class SAO { -protected: +public: enum { SAO_MAX_DEPTH = 4 }; enum { SAO_BO_BITS = 5 }; enum { MAX_NUM_SAO_CLASS = 33 }; - enum { SAO_BIT_INC = X265_MAX(X265_DEPTH - 10, 0) }; + enum { SAO_BIT_INC = 0 }; /* in HM12.0, it wrote as X265_MAX(X265_DEPTH - 10, 0) */ enum { OFFSET_THRESH = 1 << X265_MIN(X265_DEPTH - 5, 5) }; enum { NUM_EDGETYPE = 5 }; enum { NUM_PLANE = 3 }; @@ -68,6 +68,8 @@ typedef int32_t (PerClass[MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]); typedef int32_t (PerPlane[NUM_PLANE][MAX_NUM_SAO_TYPE][MAX_NUM_SAO_CLASS]); +protected: + /* allocated per part */ PerClass* m_count; PerClass* m_offset; @@ -142,7 +144,6 @@ int32_t* currentDistortionTableBo, double* currentRdCostTableBo); inline int64_t estSaoTypeDist(int plane, int typeIdx, double lambda, int32_t* currentDistortionTableBo, double* currentRdCostTableBo); - void rdoSaoUnitRowInit(SAOParam* saoParam); void rdoSaoUnitRowEnd(const SAOParam* saoParam, int numctus); void rdoSaoUnitRow(SAOParam* saoParam, int idxY); };
View file
x265_1.7.tar.gz/source/encoder/search.cpp -> x265_1.8.tar.gz/source/encoder/search.cpp
Changed
@@ -33,7 +33,7 @@ #include "analysis.h" // TLD #include "framedata.h" -using namespace x265; +using namespace X265_NS; #if _MSC_VER #pragma warning(disable: 4800) // 'uint8_t' : forcing value to bool 'true' or 'false' (performance warning) @@ -319,7 +319,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); primitives.cu[sizeIdx].add_ps(reconQt, reconQtStride, pred, residual, stride, stride); } else @@ -517,7 +517,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSize, TEXT_LUMA, absPartIdx, useTSkip); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeff, log2TrSize, TEXT_LUMA, true, useTSkip, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSize, TEXT_LUMA, true, useTSkip, numSig); primitives.cu[sizeIdx].add_ps(tmpRecon, tmpReconStride, pred, residual, stride, stride); } else if (useTSkip) @@ -530,7 +530,7 @@ // no residual coded, recon = pred primitives.cu[sizeIdx].copy_pp(tmpRecon, tmpReconStride, pred, stride); - uint32_t tmpDist = primitives.cu[sizeIdx].sse_pp(tmpRecon, tmpReconStride, fenc, stride); + sse_ret_t tmpDist = primitives.cu[sizeIdx].sse_pp(tmpRecon, tmpReconStride, fenc, stride); cu.setTransformSkipSubParts(useTSkip, TEXT_LUMA, absPartIdx, fullDepth); cu.setCbfSubParts((!!numSig) << tuDepth, TEXT_LUMA, absPartIdx, fullDepth); @@ -667,7 +667,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffY, log2TrSize, TEXT_LUMA, absPartIdx, false); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeffY, log2TrSize, TEXT_LUMA, true, false, numSig); primitives.cu[sizeIdx].add_ps(picReconY, picStride, pred, residual, stride, stride); cu.setCbfSubParts(1 << tuDepth, TEXT_LUMA, absPartIdx, fullDepth); } @@ -797,7 +797,7 @@ uint32_t qtLayer = log2TrSize - 2; uint32_t stride = mode.fencYuv->m_csize; const uint32_t sizeIdxC = log2TrSizeC - 2; - uint32_t outDist = 0; + sse_ret_t outDist = 0; uint32_t curPartNum = cuGeom.numPartitions >> tuDepthC * 2; const SplitType splitType = (m_csp == X265_CSP_I422) ? VERTICAL_SPLIT : DONT_SPLIT; @@ -841,7 +841,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); primitives.cu[sizeIdxC].add_ps(reconQt, reconQtStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } @@ -942,7 +942,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeff, log2TrSizeC, ttype, absPartIdxC, useTSkip); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeff, log2TrSizeC, ttype, true, useTSkip, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeff, log2TrSizeC, ttype, true, useTSkip, numSig); primitives.cu[sizeIdxC].add_ps(recon, reconStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } @@ -956,7 +956,7 @@ primitives.cu[sizeIdxC].copy_pp(recon, reconStride, pred, stride); cu.setCbfPartRange(0, ttype, absPartIdxC, tuIterator.absPartIdxStep); } - uint32_t tmpDist = primitives.cu[sizeIdxC].sse_pp(recon, reconStride, fenc, stride); + sse_ret_t tmpDist = primitives.cu[sizeIdxC].sse_pp(recon, reconStride, fenc, stride); tmpDist = m_rdCost.scaleChromaDist(chromaId, tmpDist); cu.setTransformSkipPartRange(useTSkip, ttype, absPartIdxC, tuIterator.absPartIdxStep); @@ -1129,7 +1129,7 @@ uint32_t numSig = m_quant.transformNxN(cu, fenc, stride, residual, stride, coeffC, log2TrSizeC, ttype, absPartIdxC, false); if (numSig) { - m_quant.invtransformNxN(residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); + m_quant.invtransformNxN(cu, residual, stride, coeffC, log2TrSizeC, ttype, true, false, numSig); primitives.cu[sizeIdxC].add_ps(picReconC, picStride, pred, residual, stride, stride); cu.setCbfPartRange(1 << tuDepth, ttype, absPartIdxC, tuIterator.absPartIdxStep); } @@ -1156,14 +1156,14 @@ cu.setPartSizeSubParts(partSize); cu.setPredModeSubParts(MODE_INTRA); - m_quant.m_tqBypass = !!cu.m_tqBypass[0]; uint32_t tuDepthRange[2]; cu.getIntraTUQtDepthRange(tuDepthRange, 0); intraMode.initCosts(); - intraMode.distortion += estIntraPredQT(intraMode, cuGeom, tuDepthRange, sharedModes); - intraMode.distortion += estIntraPredChromaQT(intraMode, cuGeom, sharedChromaModes); + intraMode.lumaDistortion += estIntraPredQT(intraMode, cuGeom, tuDepthRange, sharedModes); + intraMode.chromaDistortion += estIntraPredChromaQT(intraMode, cuGeom, sharedChromaModes); + intraMode.distortion += intraMode.lumaDistortion + intraMode.chromaDistortion; m_entropyCoder.resetBits(); if (m_slice->m_pps->bTransquantBypassEnabled) @@ -1378,8 +1378,9 @@ codeIntraLumaQT(intraMode, cuGeom, 0, 0, false, icosts, tuDepthRange); extractIntraResultQT(cu, *reconYuv, 0, 0); - intraMode.distortion = icosts.distortion; - intraMode.distortion += estIntraPredChromaQT(intraMode, cuGeom, NULL); + intraMode.lumaDistortion = icosts.distortion; + intraMode.chromaDistortion = estIntraPredChromaQT(intraMode, cuGeom, NULL); + intraMode.distortion = intraMode.lumaDistortion + intraMode.chromaDistortion; m_entropyCoder.resetBits(); if (m_slice->m_pps->bTransquantBypassEnabled) @@ -1861,6 +1862,29 @@ return outCost; } +/* find the lowres motion vector from lookahead in middle of current PU */ +MV Search::getLowresMV(const CUData& cu, const PredictionUnit& pu, int list, int ref) +{ + int diffPoc = abs(m_slice->m_poc - m_slice->m_refPicList[list][ref]->m_poc); + if (diffPoc > m_param->bframes + 1) + /* poc difference is out of range for lookahead */ + return 0; + + MV* mvs = m_frame->m_lowres.lowresMvs[list][diffPoc - 1]; + if (mvs[0].x == 0x7FFF) + /* this motion search was not estimated by lookahead */ + return 0; + + uint32_t block_x = (cu.m_cuPelX + g_zscanToPelX[pu.puAbsPartIdx] + pu.width / 2) >> 4; + uint32_t block_y = (cu.m_cuPelY + g_zscanToPelY[pu.puAbsPartIdx] + pu.height / 2) >> 4; + uint32_t idx = block_y * m_frame->m_lowres.maxBlocksInRow + block_x; + + X265_CHECK(block_x < m_frame->m_lowres.maxBlocksInRow, "block_x is too high\n"); + X265_CHECK(block_y < m_frame->m_lowres.maxBlocksInCol, "block_y is too high\n"); + + return mvs[idx] << 1; /* scale up lowres mv */ +} + /* Pick between the two AMVP candidates which is the best one to use as * MVP for the motion search, based on SAD cost */ int Search::selectMVP(const CUData& cu, const PredictionUnit& pu, const MV amvp[AMVP_NUM_CANDS], int list, int ref) @@ -1929,10 +1953,16 @@ /* Perform ME, repeat until no more work is available */ do { - if (meId < m_slice->m_numRefIdx[0]) - slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, meId); + if (meId < pme.m_jobs.refCnt[0]) + { + int refIdx = pme.m_jobs.ref[0][meId]; //L0 + slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 0, refIdx); + } else - slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, meId - m_slice->m_numRefIdx[0]); + { + int refIdx = pme.m_jobs.ref[1][meId - pme.m_jobs.refCnt[0]]; //L1 + slave.singleMotionEstimation(*this, pme.mode, pme.pu, pme.puIdx, 1, refIdx); + } meId = -1; pme.m_lock.acquire(); @@ -1950,13 +1980,18 @@ MotionData* bestME = interMode.bestME[part]; - MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1]; + // 12 mv candidates including lowresMV + MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 2]; int numMvc = interMode.cu.getPMV(interMode.interNeighbours, list, ref, interMode.amvpCand[list][ref], mvc); const MV* amvp = interMode.amvpCand[list][ref]; int mvpIdx = selectMVP(interMode.cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; + MV lmv = getLowresMV(interMode.cu, pu, list, ref); + if (lmv.notZero()) + mvc[numMvc++] = lmv; + setSearchRange(interMode.cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&m_slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); @@ -1983,23 +2018,22 @@ } /* find the best inter prediction for each PU of specified mode */ -void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC) +void Search::predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t refMasks[2]) { ProfileCUScope(interMode.cu, motionEstimationElapsedTime, countMotionEstimate); CUData& cu = interMode.cu; Yuv* predYuv = &interMode.predYuv; - MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 1]; + // 12 mv candidates including lowresMV + MV mvc[(MD_ABOVE_LEFT + 1) * 2 + 2]; const Slice *slice = m_slice; - int numPart = cu.getNumPartInter(); + int numPart = cu.getNumPartInter(0); int numPredDir = slice->isInterP() ? 1 : 2; const int* numRefIdx = slice->m_numRefIdx; uint32_t lastMode = 0; int totalmebits = 0; - int numME = numRefIdx[0] + numRefIdx[1]; - bool bTryDistributed = m_param->bDistributeMotionEstimation && numME > 2; MV mvzero(0, 0); Yuv& tmpPredYuv = m_rqt[cuGeom.depth].tmpPredYuv; @@ -2039,6 +2073,10 @@ int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; + MV lmv = getLowresMV(cu, pu, list, ref); + if (lmv.notZero()) + mvc[numMvc++] = lmv; + setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); @@ -2060,17 +2098,38 @@ } bDoUnidir = false; } - else if (bTryDistributed) + else if (m_param->bDistributeMotionEstimation) { PME pme(*this, interMode, cuGeom, pu, puIdx); - pme.m_jobTotal = numME; - pme.m_jobAcquired = 1; /* reserve L0-0 */ + pme.m_jobTotal = 0; + pme.m_jobAcquired = 1; /* reserve L0-0 or L1-0 */ - if (pme.tryBondPeers(*m_frame->m_encData->m_jobProvider, numME - 1)) + uint32_t refMask = refMasks[puIdx] ? refMasks[puIdx] : (uint32_t)-1; + for (int list = 0; list < numPredDir; list++) + { + int idx = 0; + for (int ref = 0; ref < numRefIdx[list]; ref++) + { + if (!(refMask & (1 << ref))) + continue; + + pme.m_jobs.ref[list][idx++] = ref; + pme.m_jobTotal++; + } + pme.m_jobs.refCnt[list] = idx; + + /* the second list ref bits start at bit 16 */ + refMask >>= 16; + } + + if (pme.m_jobTotal > 2) { + pme.tryBondPeers(*m_frame->m_encData->m_jobProvider, pme.m_jobTotal - 1); + processPME(pme, *this); - singleMotionEstimation(*this, interMode, pu, puIdx, 0, 0); /* L0-0 */ + int ref = pme.m_jobs.refCnt[0] ? pme.m_jobs.ref[0][0] : pme.m_jobs.ref[1][0]; + singleMotionEstimation(*this, interMode, pu, puIdx, 0, ref); /* L0-0 or L1-0 */ bDoUnidir = false; @@ -2083,10 +2142,20 @@ } if (bDoUnidir) { + uint32_t refMask = refMasks[puIdx] ? refMasks[puIdx] : (uint32_t)-1; + for (int list = 0; list < numPredDir; list++) { for (int ref = 0; ref < numRefIdx[list]; ref++) { + ProfileCounter(interMode.cu, totalMotionReferences[cuGeom.depth]); + + if (!(refMask & (1 << ref))) + { + ProfileCounter(interMode.cu, skippedMotionReferences[cuGeom.depth]); + continue; + } + uint32_t bits = m_listSelBits[list] + MVP_IDX_BITS; bits += getTUBits(ref, numRefIdx[list]); @@ -2096,6 +2165,10 @@ int mvpIdx = selectMVP(cu, pu, amvp, list, ref); MV mvmin, mvmax, outmv, mvp = amvp[mvpIdx]; + MV lmv = getLowresMV(cu, pu, list, ref); + if (lmv.notZero()) + mvc[numMvc++] = lmv; + setSearchRange(cu, mvp, m_param->searchRange, mvmin, mvmax); int satdCost = m_me.motionEstimate(&slice->m_mref[list][ref], mvmin, mvmax, mvp, numMvc, mvc, m_param->searchRange, outmv); @@ -2116,6 +2189,8 @@ bestME[list].bits = bits; } } + /* the second list ref bits start at bit 16 */ + refMask >>= 16; } } @@ -2411,10 +2486,11 @@ // Luma int part = partitionFromLog2Size(cu.m_log2CUSize[0]); - interMode.distortion = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); + interMode.lumaDistortion = primitives.cu[part].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); // Chroma - interMode.distortion += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); - interMode.distortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + interMode.chromaDistortion = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); + interMode.chromaDistortion += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[part].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + interMode.distortion = interMode.lumaDistortion + interMode.chromaDistortion; m_entropyCoder.load(m_rqt[depth].cur); m_entropyCoder.resetBits(); @@ -2464,7 +2540,7 @@ uint32_t tqBypass = cu.m_tqBypass[0]; if (!tqBypass) { - uint32_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); + sse_ret_t cbf0Dist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, predYuv->m_buf[0], predYuv->m_size); cbf0Dist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], predYuv->m_csize, predYuv->m_buf[1], predYuv->m_csize)); cbf0Dist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], predYuv->m_csize, predYuv->m_buf[2], predYuv->m_csize)); @@ -2535,14 +2611,16 @@ reconYuv->copyFromYuv(*predYuv); // update with clipped distortion and cost (qp estimation loop uses unclipped values) - uint32_t bestDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); - bestDist += m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); - bestDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); + sse_ret_t bestLumaDist = primitives.cu[sizeIdx].sse_pp(fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); + sse_ret_t bestChromaDist = m_rdCost.scaleChromaDist(1, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[1], fencYuv->m_csize, reconYuv->m_buf[1], reconYuv->m_csize)); + bestChromaDist += m_rdCost.scaleChromaDist(2, primitives.chroma[m_csp].cu[sizeIdx].sse_pp(fencYuv->m_buf[2], fencYuv->m_csize, reconYuv->m_buf[2], reconYuv->m_csize)); if (m_rdCost.m_psyRd) interMode.psyEnergy = m_rdCost.psyCost(sizeIdx, fencYuv->m_buf[0], fencYuv->m_size, reconYuv->m_buf[0], reconYuv->m_size); interMode.totalBits = bits; - interMode.distortion = bestDist; + interMode.lumaDistortion = bestLumaDist; + interMode.chromaDistortion = bestChromaDist; + interMode.distortion = bestLumaDist + bestChromaDist; interMode.coeffBits = coeffBits; interMode.mvBits = bits - coeffBits; updateModeCost(interMode); @@ -2595,7 +2673,7 @@ if (numSigY) { - m_quant.invtransformNxN(curResiY, strideResiY, coeffCurY, log2TrSize, TEXT_LUMA, false, false, numSigY); + m_quant.invtransformNxN(cu, curResiY, strideResiY, coeffCurY, log2TrSize, TEXT_LUMA, false, false, numSigY); cu.setCbfSubParts(setCbf, TEXT_LUMA, absPartIdx, depth); } else @@ -2628,7 +2706,7 @@ uint32_t numSigU = m_quant.transformNxN(cu, fencCb, fencYuv->m_csize, curResiU, strideResiC, coeffCurU + subTUOffset, log2TrSizeC, TEXT_CHROMA_U, absPartIdxC, false); if (numSigU) { - m_quant.invtransformNxN(curResiU, strideResiC, coeffCurU + subTUOffset, log2TrSizeC, TEXT_CHROMA_U, false, false, numSigU); + m_quant.invtransformNxN(cu, curResiU, strideResiC, coeffCurU + subTUOffset, log2TrSizeC, TEXT_CHROMA_U, false, false, numSigU); cu.setCbfPartRange(setCbf, TEXT_CHROMA_U, absPartIdxC, tuIterator.absPartIdxStep); } else @@ -2642,7 +2720,7 @@ uint32_t numSigV = m_quant.transformNxN(cu, fencCr, fencYuv->m_csize, curResiV, strideResiC, coeffCurV + subTUOffset, log2TrSizeC, TEXT_CHROMA_V, absPartIdxC, false); if (numSigV) { - m_quant.invtransformNxN(curResiV, strideResiC, coeffCurV + subTUOffset, log2TrSizeC, TEXT_CHROMA_V, false, false, numSigV); + m_quant.invtransformNxN(cu, curResiV, strideResiC, coeffCurV + subTUOffset, log2TrSizeC, TEXT_CHROMA_V, false, false, numSigV); cu.setCbfPartRange(setCbf, TEXT_CHROMA_V, absPartIdxC, tuIterator.absPartIdxStep); } else @@ -2788,7 +2866,7 @@ if (cbfFlag[TEXT_LUMA][0]) { - m_quant.invtransformNxN(curResiY, strideResiY, coeffCurY, log2TrSize, TEXT_LUMA, false, false, numSig[TEXT_LUMA][0]); //this is for inter mode only + m_quant.invtransformNxN(cu, curResiY, strideResiY, coeffCurY, log2TrSize, TEXT_LUMA, false, false, numSig[TEXT_LUMA][0]); //this is for inter mode only // non-zero cost calculation for luma - This is an approximation // finally we have to encode correct cbf after comparing with null cost @@ -2885,7 +2963,7 @@ if (cbfFlag[chromaId][tuIterator.section]) { - m_quant.invtransformNxN(curResiC, strideResiC, coeffCurC + subTUOffset, + m_quant.invtransformNxN(cu, curResiC, strideResiC, coeffCurC + subTUOffset, log2TrSizeC, (TextType)chromaId, false, false, numSig[chromaId][tuIterator.section]); // non-zero cost calculation for luma, same as luma - This is an approximation @@ -2974,7 +3052,7 @@ m_entropyCoder.codeCoeffNxN(cu, m_tsCoeff, absPartIdx, log2TrSize, TEXT_LUMA); const uint32_t skipSingleBitsY = m_entropyCoder.getNumberOfWrittenBits(); - m_quant.invtransformNxN(m_tsResidual, trSize, m_tsCoeff, log2TrSize, TEXT_LUMA, false, true, numSigTSkipY); + m_quant.invtransformNxN(cu, m_tsResidual, trSize, m_tsCoeff, log2TrSize, TEXT_LUMA, false, true, numSigTSkipY); nonZeroDistY = primitives.cu[partSize].sse_ss(resiYuv.getLumaAddr(absPartIdx), resiYuv.m_size, m_tsResidual, trSize); @@ -3042,7 +3120,7 @@ m_entropyCoder.codeCoeffNxN(cu, m_tsCoeff, absPartIdxC, log2TrSizeC, (TextType)chromaId); singleBits[chromaId][tuIterator.section] = m_entropyCoder.getNumberOfWrittenBits(); - m_quant.invtransformNxN(m_tsResidual, trSizeC, m_tsCoeff, + m_quant.invtransformNxN(cu, m_tsResidual, trSizeC, m_tsCoeff, log2TrSizeC, (TextType)chromaId, false, true, numSigTSkipC); uint32_t dist = primitives.cu[partSizeC].sse_ss(resiYuv.getChromaAddr(chromaId, absPartIdxC), resiYuv.m_csize, m_tsResidual, trSizeC); nonZeroDistC = m_rdCost.scaleChromaDist(chromaId, dist); @@ -3382,7 +3460,7 @@ else if (m_param->rdLevel <= 1) { mode.sa8dBits++; - mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits); + mode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)mode.distortion, mode.sa8dBits); } else { @@ -3427,7 +3505,7 @@ else if (m_param->rdLevel <= 1) { mode.sa8dBits++; - mode.sa8dCost = m_rdCost.calcRdSADCost(mode.distortion, mode.sa8dBits); + mode.sa8dCost = m_rdCost.calcRdSADCost((uint32_t)mode.distortion, mode.sa8dBits); } else {
View file
x265_1.7.tar.gz/source/encoder/search.h -> x265_1.8.tar.gz/source/encoder/search.h
Changed
@@ -48,7 +48,7 @@ #define ProfileCounter(cu, count) #endif -namespace x265 { +namespace X265_NS { // private namespace class Entropy; @@ -109,7 +109,9 @@ uint64_t sa8dCost; // sum of partition sa8d distortion costs (sa8d(fenc, pred) + lambda * bits) uint32_t sa8dBits; // signal bits used in sa8dCost calculation uint32_t psyEnergy; // sum of partition psycho-visual energy difference - uint32_t distortion; // sum of partition SSE distortion + sse_ret_t lumaDistortion; + sse_ret_t chromaDistortion; + sse_ret_t distortion; // sum of partition SSE distortion uint32_t totalBits; // sum of partition bits (mv + coeff) uint32_t mvBits; // Mv bits + Ref + block type (or intra mode) uint32_t coeffBits; // Texture bits (DCT Coeffs) @@ -120,6 +122,8 @@ sa8dCost = 0; sa8dBits = 0; psyEnergy = 0; + lumaDistortion = 0; + chromaDistortion = 0; distortion = 0; totalBits = 0; mvBits = 0; @@ -133,7 +137,15 @@ sa8dCost = UINT64_MAX / 2; sa8dBits = MAX_UINT / 2; psyEnergy = MAX_UINT / 2; +#if X265_DEPTH <= 10 + lumaDistortion = MAX_UINT / 2; + chromaDistortion = MAX_UINT / 2; distortion = MAX_UINT / 2; +#else + lumaDistortion = UINT64_MAX / 2; + chromaDistortion = UINT64_MAX / 2; + distortion = UINT64_MAX / 2; +#endif totalBits = MAX_UINT / 2; mvBits = MAX_UINT / 2; coeffBits = MAX_UINT / 2; @@ -141,14 +153,29 @@ bool ok() const { +#if X265_DEPTH <= 10 + return !(rdCost >= UINT64_MAX / 2 || + sa8dCost >= UINT64_MAX / 2 || + sa8dBits >= MAX_UINT / 2 || + psyEnergy >= MAX_UINT / 2 || + lumaDistortion >= MAX_UINT / 2 || + chromaDistortion >= MAX_UINT / 2 || + distortion >= MAX_UINT / 2 || + totalBits >= MAX_UINT / 2 || + mvBits >= MAX_UINT / 2 || + coeffBits >= MAX_UINT / 2); +#else return !(rdCost >= UINT64_MAX / 2 || sa8dCost >= UINT64_MAX / 2 || sa8dBits >= MAX_UINT / 2 || psyEnergy >= MAX_UINT / 2 || - distortion >= MAX_UINT / 2 || + lumaDistortion >= UINT64_MAX / 2 || + chromaDistortion >= UINT64_MAX / 2 || + distortion >= UINT64_MAX / 2 || totalBits >= MAX_UINT / 2 || mvBits >= MAX_UINT / 2 || coeffBits >= MAX_UINT / 2); +#endif } void addSubCosts(const Mode& subMode) @@ -159,6 +186,8 @@ sa8dCost += subMode.sa8dCost; sa8dBits += subMode.sa8dBits; psyEnergy += subMode.psyEnergy; + lumaDistortion += subMode.lumaDistortion; + chromaDistortion += subMode.chromaDistortion; distortion += subMode.distortion; totalBits += subMode.totalBits; mvBits += subMode.mvBits; @@ -186,6 +215,11 @@ int64_t weightAnalyzeTime; // elapsed worker time analyzing reference weights int64_t totalCTUTime; // elapsed worker time in compressCTU (includes pmode master) + uint32_t skippedMotionReferences[NUM_CU_DEPTH]; + uint32_t totalMotionReferences[NUM_CU_DEPTH]; + uint32_t skippedIntraCU[NUM_CU_DEPTH]; + uint32_t totalIntraCU[NUM_CU_DEPTH]; + uint64_t countIntraRDO[NUM_CU_DEPTH]; uint64_t countInterRDO[NUM_CU_DEPTH]; uint64_t countIntraAnalysis; @@ -213,6 +247,10 @@ interRDOElapsedTime[i] += other.interRDOElapsedTime[i]; countIntraRDO[i] += other.countIntraRDO[i]; countInterRDO[i] += other.countInterRDO[i]; + skippedMotionReferences[i] += other.skippedMotionReferences[i]; + totalMotionReferences[i] += other.totalMotionReferences[i]; + skippedIntraCU[i] += other.skippedIntraCU[i]; + totalIntraCU[i] += other.totalIntraCU[i]; } intraAnalysisElapsedTime += other.intraAnalysisElapsedTime; @@ -301,7 +339,7 @@ void encodeIntraInInter(Mode& intraMode, const CUGeom& cuGeom); // estimation inter prediction (non-skip) - void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC); + void predInterSearch(Mode& interMode, const CUGeom& cuGeom, bool bChromaMC, uint32_t masks[2]); // encode residual and compute rd-cost for inter mode void encodeResAndCalcRdInterCU(Mode& interMode, const CUGeom& cuGeom); @@ -319,6 +357,8 @@ void checkDQP(Mode& mode, const CUGeom& cuGeom); void checkDQPForSplitPred(Mode& mode, const CUGeom& cuGeom); + MV getLowresMV(const CUData& cu, const PredictionUnit& pu, int list, int ref); + class PME : public BondedTaskGroup { public: @@ -329,6 +369,11 @@ const PredictionUnit& pu; int puIdx; + struct { + int ref[2][MAX_NUM_REF]; + int refCnt[2]; + } m_jobs; + PME(Search& s, Mode& m, const CUGeom& g, const PredictionUnit& u, int p) : master(s), mode(m), cuGeom(g), pu(u), puIdx(p) {} void processTasks(int workerThreadId); @@ -365,7 +410,7 @@ { uint64_t rdcost; uint32_t bits; - uint32_t distortion; + sse_ret_t distortion; uint32_t energy; Cost() { rdcost = 0; bits = 0; distortion = 0; energy = 0; } };
View file
x265_1.7.tar.gz/source/encoder/sei.cpp -> x265_1.8.tar.gz/source/encoder/sei.cpp
Changed
@@ -26,7 +26,7 @@ #include "slice.h" #include "sei.h" -using namespace x265; +using namespace X265_NS; /* x265's identifying GUID */ const uint8_t SEIuserDataUnregistered::m_uuid_iso_iec_11578[16] = {
View file
x265_1.7.tar.gz/source/encoder/sei.h -> x265_1.8.tar.gz/source/encoder/sei.h
Changed
@@ -28,7 +28,7 @@ #include "bitstream.h" #include "slice.h" -namespace x265 { +namespace X265_NS { // private namespace class SEI : public SyntaxElementWriter
View file
x265_1.7.tar.gz/source/encoder/slicetype.cpp -> x265_1.8.tar.gz/source/encoder/slicetype.cpp
Changed
@@ -40,7 +40,7 @@ #define ProfileLookaheadTime(elapsed, count) #endif -using namespace x265; +using namespace X265_NS; namespace { @@ -94,9 +94,7 @@ /* Actual adaptive quantization */ int maxCol = curFrame->m_fencPic->m_picWidth; int maxRow = curFrame->m_fencPic->m_picHeight; - int blockWidth = ((param->sourceWidth / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int blockHeight = ((param->sourceHeight / 2) + X265_LOWRES_CU_SIZE - 1) >> X265_LOWRES_CU_BITS; - int blockCount = blockWidth * blockHeight; + int blockCount = curFrame->m_lowres.maxBlocksInRow * curFrame->m_lowres.maxBlocksInCol; for (int y = 0; y < 3; y++) { @@ -133,15 +131,16 @@ { blockXY = 0; double avg_adj_pow2 = 0, avg_adj = 0, qp_adj = 0; - if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE) + double bias_strength = 0.f; + if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE || param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED) { - double bit_depth_correction = pow(1 << (X265_DEPTH - 8), 0.5); + double bit_depth_correction = 1.f / (1 << (2*(X265_DEPTH-8))); for (blockY = 0; blockY < maxRow; blockY += 16) { for (blockX = 0; blockX < maxCol; blockX += 16) { uint32_t energy = acEnergyCu(curFrame, blockX, blockY, param->internalCsp); - qp_adj = pow(energy + 1, 0.1); + qp_adj = pow(energy * bit_depth_correction + 1, 0.1); curFrame->m_lowres.qpCuTreeOffset[blockXY] = qp_adj; avg_adj += qp_adj; avg_adj_pow2 += qp_adj * qp_adj; @@ -151,8 +150,9 @@ avg_adj /= blockCount; avg_adj_pow2 /= blockCount; - strength = param->rc.aqStrength * avg_adj / bit_depth_correction; - avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f * bit_depth_correction)) / avg_adj; + strength = param->rc.aqStrength * avg_adj; + avg_adj = avg_adj - 0.5f * (avg_adj_pow2 - (11.f)) / avg_adj; + bias_strength = param->rc.aqStrength; } else strength = param->rc.aqStrength * 1.0397f; @@ -162,7 +162,12 @@ { for (blockX = 0; blockX < maxCol; blockX += 16) { - if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE) + if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE_BIASED) + { + qp_adj = curFrame->m_lowres.qpCuTreeOffset[blockXY]; + qp_adj = strength * (qp_adj - avg_adj) + bias_strength * (1.f - 11.f / (qp_adj * qp_adj)); + } + else if (param->rc.aqMode == X265_AQ_AUTO_VARIANCE) { qp_adj = curFrame->m_lowres.qpCuTreeOffset[blockXY]; qp_adj = strength * (qp_adj - avg_adj); @@ -464,6 +469,7 @@ m_pool = pool; m_lastNonB = NULL; + m_isSceneTransition = false; m_scratch = NULL; m_tld = NULL; m_filled = false; @@ -1248,7 +1254,9 @@ int numBFrames = 0; int numAnalyzed = numFrames; - if (m_param->scenecutThreshold && scenecut(frames, 0, 1, true, origNumFrames, maxSearch)) + bool isScenecut = scenecut(frames, 0, 1, true, origNumFrames); + /* When scenecut threshold is set, use scenecut detection for I frame placements */ + if (m_param->scenecutThreshold && isScenecut) { frames[1]->sliceType = X265_TYPE_I; return; @@ -1338,14 +1346,13 @@ /* Check scenecut on the first minigop. */ for (int j = 1; j < numBFrames + 1; j++) { - if (m_param->scenecutThreshold && scenecut(frames, j, j + 1, false, origNumFrames, maxSearch)) + if (scenecut(frames, j, j + 1, false, origNumFrames)) { frames[j]->sliceType = X265_TYPE_P; numAnalyzed = j; break; } } - resetStart = bKeyframe ? 1 : X265_MIN(numBFrames + 2, numAnalyzed + 1); } else @@ -1369,50 +1376,99 @@ if (bIsVbvLookahead) vbvLookahead(frames, numFrames, bKeyframe); + int maxp1 = X265_MIN(m_param->bframes + 1, origNumFrames); /* Restore frame types for all frames that haven't actually been decided yet. */ for (int j = resetStart; j <= numFrames; j++) + { frames[j]->sliceType = X265_TYPE_AUTO; + /* If any frame marked as scenecut is being restarted for sliceDecision, + * undo scene Transition flag */ + if (j <= maxp1 && frames[j]->bScenecut && m_isSceneTransition) + m_isSceneTransition = false; + } } -bool Lookahead::scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames, int maxSearch) +bool Lookahead::scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames) { /* Only do analysis during a normal scenecut check. */ if (bRealScenecut && m_param->bframes) { int origmaxp1 = p0 + 1; /* Look ahead to avoid coding short flashes as scenecuts. */ - if (m_param->bFrameAdaptive == X265_B_ADAPT_TRELLIS) - /* Don't analyse any more frames than the trellis would have covered. */ - origmaxp1 += m_param->bframes; - else - origmaxp1++; + origmaxp1 += m_param->bframes; int maxp1 = X265_MIN(origmaxp1, numFrames); - + bool fluctuate = false; + bool noScenecuts = false; + int64_t avgSatdCost = 0; + if (frames[0]->costEst[1][0] > -1) + avgSatdCost = frames[0]->costEst[1][0]; + int cnt = 1; /* Where A and B are scenes: AAAAAABBBAAAAAA * If BBB is shorter than (maxp1-p0), it is detected as a flash * and not considered a scenecut. */ for (int cp1 = p1; cp1 <= maxp1; cp1++) { if (!scenecutInternal(frames, p0, cp1, false)) + { /* Any frame in between p0 and cur_p1 cannot be a real scenecut. */ for (int i = cp1; i > p0; i--) + { frames[i]->bScenecut = false; + noScenecuts = false; + } + } + else if (scenecutInternal(frames, cp1 - 1, cp1, false)) + { + /* If current frame is a Scenecut from p0 frame as well as Scenecut from + * preceeding frame, mark it as a Scenecut */ + frames[cp1]->bScenecut = true; + noScenecuts = true; + } + + /* compute average satdcost of all the frames in the mini-gop to confirm + * whether there is any great fluctuation among them to rule out false positives */ + X265_CHECK(frames[cp1]->costEst[cp1 - p0][0]!= -1, "costEst is not done \n"); + avgSatdCost += frames[cp1]->costEst[cp1 - p0][0]; + cnt++; } - /* Where A-F are scenes: AAAAABBCCDDEEFFFFFF - * If each of BB ... EE are shorter than (maxp1-p0), they are - * detected as flashes and not considered scenecuts. - * Instead, the first F frame becomes a scenecut. - * If the video ends before F, no frame becomes a scenecut. */ - for (int cp0 = p0; cp0 <= maxp1; cp0++) + /* Identify possible scene fluctuations by comparing the satd cost of the frames. + * This could denote the beginning or ending of scene transitions. + * During a scene transition(fade in/fade outs), if fluctuate remains false, + * then the scene had completed its transition or stabilized */ + if (noScenecuts) { - if (origmaxp1 > maxSearch || (cp0 < maxp1 && scenecutInternal(frames, cp0, maxp1, false))) - /* If cur_p0 is the p0 of a scenecut, it cannot be the p1 of a scenecut. */ - frames[cp0]->bScenecut = false; + fluctuate = false; + avgSatdCost /= cnt; + for (int i = p1; i <= maxp1; i++) + { + int64_t curCost = frames[i]->costEst[i - p0][0]; + int64_t prevCost = frames[i - 1]->costEst[i - 1 - p0][0]; + if (fabs((double)(curCost - avgSatdCost)) > 0.1 * avgSatdCost || + fabs((double)(curCost - prevCost)) > 0.1 * prevCost) + { + fluctuate = true; + if (!m_isSceneTransition && frames[i]->bScenecut) + { + m_isSceneTransition = true; + /* just mark the first scenechange in the scene transition as a scenecut. */ + for (int j = i + 1; j <= maxp1; j++) + frames[j]->bScenecut = false; + break; + } + } + frames[i]->bScenecut = false; + } } + if (!fluctuate && !noScenecuts) + m_isSceneTransition = false; /* Signal end of scene transitioning */ } - /* Ignore frames that are part of a flash, i.e. cannot be real scenecuts. */ + /* A frame is always analysed with bRealScenecut = true first, and then bRealScenecut = false, + the former for I decisions and the latter for P/B decisions. It's possible that the first + analysis detected scenecuts which were later nulled due to scene transitioning, in which + case do not return a true scenecut for this frame */ + if (!frames[p1]->bScenecut) return false; return scenecutInternal(frames, p0, p1, bRealScenecut); @@ -1432,22 +1488,23 @@ /* magic numbers pulled out of thin air */ float threshMin = (float)(threshMax * 0.25); - float bias; - - if (m_param->keyframeMin == m_param->keyframeMax) - threshMin = threshMax; - if (gopSize <= m_param->keyframeMin / 4) - bias = threshMin / 4; - else if (gopSize <= m_param->keyframeMin) - bias = threshMin * gopSize / m_param->keyframeMin; - else + double bias = 0.05; + if (bRealScenecut) { - bias = threshMin - + (threshMax - threshMin) - * (gopSize - m_param->keyframeMin) - / (m_param->keyframeMax - m_param->keyframeMin); + if (m_param->keyframeMin == m_param->keyframeMax) + threshMin = threshMax; + if (gopSize <= m_param->keyframeMin / 4) + bias = threshMin / 4; + else if (gopSize <= m_param->keyframeMin) + bias = threshMin * gopSize / m_param->keyframeMin; + else + { + bias = threshMin + + (threshMax - threshMin) + * (gopSize - m_param->keyframeMin) + / (m_param->keyframeMax - m_param->keyframeMin); + } } - bool res = pcost >= (1.0 - bias) * icost; if (res && bRealScenecut) {
View file
x265_1.7.tar.gz/source/encoder/slicetype.h -> x265_1.8.tar.gz/source/encoder/slicetype.h
Changed
@@ -30,7 +30,7 @@ #include "piclist.h" #include "threadpool.h" -namespace x265 { +namespace X265_NS { // private namespace struct Lowres; @@ -127,7 +127,7 @@ int m_numCoopSlices; int m_numRowsPerSlice; bool m_filled; - + bool m_isSceneTransition; Lookahead(x265_param *param, ThreadPool *pool); #if DETAILED_CU_STATS @@ -156,7 +156,7 @@ void slicetypeAnalyse(Lowres **frames, bool bKeyframe); /* called by slicetypeAnalyse() to make slice decisions */ - bool scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames, int maxSearch); + bool scenecut(Lowres **frames, int p0, int p1, bool bRealScenecut, int numFrames); bool scenecutInternal(Lowres **frames, int p0, int p1, bool bRealScenecut); void slicetypePath(Lowres **frames, int length, char(*best_paths)[X265_LOOKAHEAD_MAX + 1]); int64_t slicetypePathCost(Lowres **frames, char *path, int64_t threshold);
View file
x265_1.7.tar.gz/source/encoder/weightPrediction.cpp -> x265_1.8.tar.gz/source/encoder/weightPrediction.cpp
Changed
@@ -31,7 +31,7 @@ #include "mv.h" #include "bitstream.h" -using namespace x265; +using namespace X265_NS; namespace { struct Cache { @@ -217,7 +217,7 @@ } } -namespace x265 { +namespace X265_NS { void weightAnalyse(Slice& slice, Frame& frame, x265_param& param) { WeightParam wp[2][MAX_NUM_REF][3];
View file
x265_1.7.tar.gz/source/input/input.cpp -> x265_1.8.tar.gz/source/input/input.cpp
Changed
@@ -25,7 +25,7 @@ #include "yuv.h" #include "y4m.h" -using namespace x265; +using namespace X265_NS; InputFile* InputFile::open(InputFileInfo& info, bool bForceY4m) {
View file
x265_1.7.tar.gz/source/input/input.h -> x265_1.8.tar.gz/source/input/input.h
Changed
@@ -31,9 +31,9 @@ #define MIN_FRAME_RATE 1 #define MAX_FRAME_RATE 300 -#include "x265.h" +#include "common.h" -namespace x265 { +namespace X265_NS { // private x265 namespace struct InputFileInfo @@ -79,6 +79,10 @@ virtual bool isFail() = 0; virtual const char *getName() const = 0; + + virtual int getWidth() const = 0; + + virtual int getHeight() const = 0; }; }
View file
x265_1.7.tar.gz/source/input/y4m.cpp -> x265_1.8.tar.gz/source/input/y4m.cpp
Changed
@@ -36,7 +36,7 @@ #endif #endif -using namespace x265; +using namespace X265_NS; using namespace std; static const char header[] = "FRAME";
View file
x265_1.7.tar.gz/source/input/y4m.h -> x265_1.8.tar.gz/source/input/y4m.h
Changed
@@ -30,7 +30,7 @@ #define QUEUE_SIZE 5 -namespace x265 { +namespace X265_NS { // x265 private namespace class Y4MInput : public InputFile, public Thread @@ -88,6 +88,10 @@ bool readPicture(x265_picture&); const char *getName() const { return "y4m"; } + + int getWidth() const { return width; } + + int getHeight() const { return height; } }; }
View file
x265_1.7.tar.gz/source/input/yuv.cpp -> x265_1.8.tar.gz/source/input/yuv.cpp
Changed
@@ -36,7 +36,7 @@ #endif #endif -using namespace x265; +using namespace X265_NS; using namespace std; YUVInput::YUVInput(InputFileInfo& info)
View file
x265_1.7.tar.gz/source/input/yuv.h -> x265_1.8.tar.gz/source/input/yuv.h
Changed
@@ -30,7 +30,7 @@ #define QUEUE_SIZE 5 -namespace x265 { +namespace X265_NS { // private x265 namespace class YUVInput : public InputFile, public Thread @@ -80,6 +80,10 @@ bool readPicture(x265_picture&); const char *getName() const { return "yuv"; } + + int getWidth() const { return width; } + + int getHeight() const { return height; } }; }
View file
x265_1.7.tar.gz/source/output/output.cpp -> x265_1.8.tar.gz/source/output/output.cpp
Changed
@@ -28,7 +28,7 @@ #include "raw.h" -using namespace x265; +using namespace X265_NS; ReconFile* ReconFile::open(const char *fname, int width, int height, uint32_t bitdepth, uint32_t fpsNum, uint32_t fpsDenom, int csp) {
View file
x265_1.7.tar.gz/source/output/output.h -> x265_1.8.tar.gz/source/output/output.h
Changed
@@ -28,7 +28,7 @@ #include "x265.h" #include "input/input.h" -namespace x265 { +namespace X265_NS { // private x265 namespace class ReconFile
View file
x265_1.7.tar.gz/source/output/raw.cpp -> x265_1.8.tar.gz/source/output/raw.cpp
Changed
@@ -24,7 +24,7 @@ #include "raw.h" -using namespace x265; +using namespace X265_NS; using namespace std; RAWOutput::RAWOutput(const char* fname, InputFileInfo&)
View file
x265_1.7.tar.gz/source/output/raw.h -> x265_1.8.tar.gz/source/output/raw.h
Changed
@@ -30,7 +30,7 @@ #include <fstream> #include <iostream> -namespace x265 { +namespace X265_NS { class RAWOutput : public OutputFile { protected:
View file
x265_1.7.tar.gz/source/output/reconplay.cpp -> x265_1.8.tar.gz/source/output/reconplay.cpp
Changed
@@ -27,7 +27,7 @@ #include <signal.h> -using namespace x265; +using namespace X265_NS; #if _WIN32 #define popen _popen
View file
x265_1.7.tar.gz/source/output/reconplay.h -> x265_1.8.tar.gz/source/output/reconplay.h
Changed
@@ -29,7 +29,7 @@ #include "threading.h" #include <cstdio> -namespace x265 { +namespace X265_NS { // private x265 namespace class ReconPlay : public Thread
View file
x265_1.7.tar.gz/source/output/y4m.cpp -> x265_1.8.tar.gz/source/output/y4m.cpp
Changed
@@ -25,7 +25,7 @@ #include "output.h" #include "y4m.h" -using namespace x265; +using namespace X265_NS; using namespace std; Y4MOutput::Y4MOutput(const char *filename, int w, int h, uint32_t fpsNum, uint32_t fpsDenom, int csp)
View file
x265_1.7.tar.gz/source/output/y4m.h -> x265_1.8.tar.gz/source/output/y4m.h
Changed
@@ -27,7 +27,7 @@ #include "output.h" #include <fstream> -namespace x265 { +namespace X265_NS { // private x265 namespace class Y4MOutput : public ReconFile
View file
x265_1.7.tar.gz/source/output/yuv.cpp -> x265_1.8.tar.gz/source/output/yuv.cpp
Changed
@@ -25,7 +25,7 @@ #include "output.h" #include "yuv.h" -using namespace x265; +using namespace X265_NS; using namespace std; YUVOutput::YUVOutput(const char *filename, int w, int h, uint32_t d, int csp)
View file
x265_1.7.tar.gz/source/output/yuv.h -> x265_1.8.tar.gz/source/output/yuv.h
Changed
@@ -29,7 +29,7 @@ #include <fstream> -namespace x265 { +namespace X265_NS { // private x265 namespace class YUVOutput : public ReconFile
View file
x265_1.7.tar.gz/source/profile/vtune/vtune.cpp -> x265_1.8.tar.gz/source/profile/vtune/vtune.cpp
Changed
@@ -36,7 +36,7 @@ } -namespace x265 { +namespace X265_NS { __itt_domain* domain; __itt_string_handle* taskHandle[NUM_VTUNE_TASKS];
View file
x265_1.7.tar.gz/source/profile/vtune/vtune.h -> x265_1.8.tar.gz/source/profile/vtune/vtune.h
Changed
@@ -26,7 +26,7 @@ #include "ittnotify.h" -namespace x265 { +namespace X265_NS { #define CPU_EVENT(x) x, enum VTuneTasksEnum
View file
x265_1.7.tar.gz/source/test/CMakeLists.txt -> x265_1.8.tar.gz/source/test/CMakeLists.txt
Changed
@@ -1,3 +1,4 @@ +# vim: syntax=cmake enable_language(ASM_YASM) if(MSVC_IDE) @@ -24,5 +25,9 @@ intrapredharness.cpp intrapredharness.h) target_link_libraries(TestBench x265-static ${PLATFORM_LIBS}) if(LINKER_OPTIONS) - set_target_properties(TestBench PROPERTIES LINK_FLAGS ${LINKER_OPTIONS}) + if(EXTRA_LIB) + list(APPEND LINKER_OPTIONS "-L..") + endif(EXTRA_LIB) + string(REPLACE ";" " " LINKER_OPTION_STR "${LINKER_OPTIONS}") + set_target_properties(TestBench PROPERTIES LINK_FLAGS "${LINKER_OPTION_STR}") endif()
View file
x265_1.7.tar.gz/source/test/checkasm-a.asm -> x265_1.8.tar.gz/source/test/checkasm-a.asm
Changed
@@ -152,10 +152,12 @@ jz .ok mov r9, rax + mov r10, rdx lea r0, [error_message] call puts mov r1, [rsp+max_args*8] mov dword [r1], 0 + mov rdx, r10 mov rax, r9 .ok: RET @@ -191,12 +193,14 @@ or r3, r5 jz .ok mov r3, eax + mov r4, edx lea r1, [error_message] push r1 call puts add esp, 4 mov r1, r1m mov dword [r1], 0 + mov edx, r4 mov eax, r3 .ok: REP_RET
View file
x265_1.7.tar.gz/source/test/intrapredharness.cpp -> x265_1.8.tar.gz/source/test/intrapredharness.cpp
Changed
@@ -25,12 +25,22 @@ #include "predict.h" #include "intrapredharness.h" -using namespace x265; +using namespace X265_NS; IntraPredHarness::IntraPredHarness() { for (int i = 0; i < INPUT_SIZE; i++) pixel_buff[i] = rand() % PIXEL_MAX; + + /* [0] --- Random values + * [1] --- Minimum + * [2] --- Maximum */ + for (int i = 0; i < BUFFSIZE; i++) + { + pixel_test_buff[0][i] = rand() % PIXEL_MAX; + pixel_test_buff[1][i] = PIXEL_MIN; + pixel_test_buff[2][i] = PIXEL_MAX; + } } bool IntraPredHarness::check_dc_primitive(intra_pred_t ref, intra_pred_t opt, int width) @@ -177,6 +187,27 @@ return true; } +bool IntraPredHarness::check_intra_filter_primitive(const intra_filter_t ref, const intra_filter_t opt) +{ + memset(pixel_out_c, 0, 64 * 64 * sizeof(pixel)); + memset(pixel_out_vec, 0, 64 * 64 * sizeof(pixel)); + int j = 0; + + for (int i = 0; i < 100; i++) + { + int index = rand() % TEST_CASES; + + ref(pixel_test_buff[index] + j, pixel_out_c); + checked(opt, pixel_test_buff[index] + j, pixel_out_vec); + + if (memcmp(pixel_out_c, pixel_out_vec, 64 * 64 * sizeof(pixel))) + return false; + + reportfail(); + j += FENC_STRIDE; + } + return true; +} bool IntraPredHarness::testCorrectness(const EncoderPrimitives& ref, const EncoderPrimitives& opt) { for (int i = BLOCK_4x4; i <= BLOCK_32x32; i++) @@ -213,6 +244,14 @@ return false; } } + if (opt.cu[i].intra_filter) + { + if (!check_intra_filter_primitive(ref.cu[i].intra_filter, opt.cu[i].intra_filter)) + { + printf("intra_filter_%dx%d failed\n", size, size); + return false; + } + } } return true; @@ -268,5 +307,10 @@ pixel_out_vec, FENC_STRIDE, pixel_buff + srcStride, mode, bFilter); } } + if (opt.cu[i].intra_filter) + { + printf("intra_filter_%dx%d", size, size); + REPORT_SPEEDUP(opt.cu[i].intra_filter, ref.cu[i].intra_filter, pixel_buff, pixel_out_c); + } } }
View file
x265_1.7.tar.gz/source/test/intrapredharness.h -> x265_1.8.tar.gz/source/test/intrapredharness.h
Changed
@@ -34,7 +34,15 @@ enum { INPUT_SIZE = 4 * 65 * 65 * 100 }; enum { OUTPUT_SIZE = 64 * FENC_STRIDE }; enum { OUTPUT_SIZE_33 = 33 * OUTPUT_SIZE }; + enum { TEST_CASES = 3 }; + enum { INCR = 32 }; + enum { STRIDE = 64 }; + enum { ITERS = 100 }; + enum { MAX_HEIGHT = 64 }; + enum { PAD_ROWS = 64 }; + enum { BUFFSIZE = STRIDE * (MAX_HEIGHT + PAD_ROWS) + INCR * ITERS }; + pixel pixel_test_buff[TEST_CASES][BUFFSIZE]; ALIGN_VAR_16(pixel, pixel_buff[INPUT_SIZE]); pixel pixel_out_c[OUTPUT_SIZE]; pixel pixel_out_vec[OUTPUT_SIZE]; @@ -45,6 +53,7 @@ bool check_planar_primitive(intra_pred_t ref, intra_pred_t opt, int width); bool check_angular_primitive(const intra_pred_t ref[], const intra_pred_t opt[], int size); bool check_allangs_primitive(const intra_allangs_t ref, const intra_allangs_t opt, int size); + bool check_intra_filter_primitive(const intra_filter_t ref, const intra_filter_t opt); public:
View file
x265_1.7.tar.gz/source/test/ipfilterharness.cpp -> x265_1.8.tar.gz/source/test/ipfilterharness.cpp
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "ipfilterharness.h" -using namespace x265; +using namespace X265_NS; IPFilterHarness::IPFilterHarness() { @@ -122,7 +122,14 @@ coeffIdx); if (memcmp(IPF_vec_output_s, IPF_C_output_s, TEST_BUF_SIZE * sizeof(int16_t))) + { + ref(pixel_test_buff[index] + 3 * rand_srcStride, + rand_srcStride, + IPF_C_output_s, + rand_dstStride, + coeffIdx); return false; + } reportfail(); }
View file
x265_1.7.tar.gz/source/test/mbdstharness.cpp -> x265_1.8.tar.gz/source/test/mbdstharness.cpp
Changed
@@ -27,7 +27,7 @@ #include "common.h" #include "mbdstharness.h" -using namespace x265; +using namespace X265_NS; struct DctConf { @@ -53,7 +53,7 @@ MBDstHarness::MBDstHarness() { - const int idct_max = (1 << (BIT_DEPTH + 4)) - 1; + const int idct_max = (1 << (X265_DEPTH + 4)) - 1; /* [0] --- Random values * [1] --- Minimum @@ -215,8 +215,14 @@ uint32_t optReturnValue = 0; uint32_t refReturnValue = 0; - int bits = (rand() % 24) + 8; - int valueToAdd = rand() % (1 << bits); + int sliceType = rand() % 2; + int log2TrSize = rand() % 4 + 2; + int qp = rand() % (QP_MAX_SPEC + QP_BD_OFFSET + 1); + int per = qp / 6; + int transformShift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - log2TrSize; + + int bits = QUANT_SHIFT + per + transformShift; + int valueToAdd = (sliceType == 1 ? 171 : 85) << (bits - 9); int cmp_size = sizeof(int) * height * width; int cmp_size1 = sizeof(short) * height * width; int numCoeff = height * width;
View file
x265_1.7.tar.gz/source/test/pixelharness.cpp -> x265_1.8.tar.gz/source/test/pixelharness.cpp
Changed
@@ -23,8 +23,9 @@ #include "pixelharness.h" #include "primitives.h" +#include "entropy.h" -using namespace x265; +using namespace X265_NS; PixelHarness::PixelHarness() { @@ -93,7 +94,7 @@ return true; } -bool PixelHarness::check_pixelcmp_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt) +bool PixelHarness::check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt) { int j = 0; intptr_t stride = STRIDE; @@ -102,8 +103,29 @@ { int index1 = rand() % TEST_CASES; int index2 = rand() % TEST_CASES; - int vres = (int)checked(opt, short_test_buff[index1], stride, short_test_buff[index2] + j, stride); - int cres = ref(short_test_buff[index1], stride, short_test_buff[index2] + j, stride); + sse_ret_t vres = (sse_ret_t)checked(opt, pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); + sse_ret_t cres = ref(pixel_test_buff[index1], stride, pixel_test_buff[index2] + j, stride); + if (vres != cres) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_pixel_sse_ss(pixel_sse_ss_t ref, pixel_sse_ss_t opt) +{ + int j = 0; + intptr_t stride = STRIDE; + + for (int i = 0; i < ITERS; i++) + { + int index1 = rand() % TEST_CASES; + int index2 = rand() % TEST_CASES; + sse_ret_t vres = (sse_ret_t)checked(opt, short_test_buff[index1], stride, short_test_buff[index2] + j, stride); + sse_ret_t cres = ref(short_test_buff[index1], stride, short_test_buff[index2] + j, stride); if (vres != cres) return false; @@ -900,8 +922,8 @@ ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); int j = 0; @@ -928,8 +950,8 @@ ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); int j = 0; @@ -956,8 +978,8 @@ ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); for (int id = 0; id < 2; id++) { @@ -992,8 +1014,8 @@ ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); int j = 0; @@ -1016,13 +1038,234 @@ return true; } +bool PixelHarness::check_saoCuStatsBO_t(saoCuStatsBO_t ref, saoCuStatsBO_t opt) +{ + enum { NUM_EDGETYPE = 33 }; // classIdx = 1 + (rec[x] >> 3); + int32_t stats_ref[NUM_EDGETYPE]; + int32_t stats_vec[NUM_EDGETYPE]; + + int32_t count_ref[NUM_EDGETYPE]; + int32_t count_vec[NUM_EDGETYPE]; + + int j = 0; + for (int i = 0; i < ITERS; i++) + { + // initialize input data to random, the dynamic range wrong but good to verify our asm code + for (int x = 0; x < NUM_EDGETYPE; x++) + { + stats_ref[x] = stats_vec[x] = rand(); + count_ref[x] = count_vec[x] = rand(); + } + + intptr_t stride = 16 * (rand() % 4 + 1); + int endX = MAX_CU_SIZE - (rand() % 5); + int endY = MAX_CU_SIZE - (rand() % 4) - 1; + + ref(pbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_ref, count_ref); + checked(opt, pbuf2 + j + 1, pbuf3 + 1, stride, endX, endY, stats_vec, count_vec); + + if (memcmp(stats_ref, stats_vec, sizeof(stats_ref)) || memcmp(count_ref, count_vec, sizeof(count_ref))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_saoCuStatsE0_t(saoCuStatsE0_t ref, saoCuStatsE0_t opt) +{ + enum { NUM_EDGETYPE = 5 }; + int32_t stats_ref[NUM_EDGETYPE]; + int32_t stats_vec[NUM_EDGETYPE]; + + int32_t count_ref[NUM_EDGETYPE]; + int32_t count_vec[NUM_EDGETYPE]; + + int j = 0; + for (int i = 0; i < ITERS; i++) + { + // initialize input data to random, the dynamic range wrong but good to verify our asm code + for (int x = 0; x < NUM_EDGETYPE; x++) + { + stats_ref[x] = stats_vec[x] = rand(); + count_ref[x] = count_vec[x] = rand(); + } + + intptr_t stride = 16 * (rand() % 4 + 1); + int endX = MAX_CU_SIZE - (rand() % 5) - 1; + int endY = MAX_CU_SIZE - (rand() % 4) - 1; + + ref(pbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_ref, count_ref); + checked(opt, pbuf2 + j + 1, pbuf3 + j + 1, stride, endX, endY, stats_vec, count_vec); + + if (memcmp(stats_ref, stats_vec, sizeof(stats_ref)) || memcmp(count_ref, count_vec, sizeof(count_ref))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_saoCuStatsE1_t(saoCuStatsE1_t ref, saoCuStatsE1_t opt) +{ + enum { NUM_EDGETYPE = 5 }; + int32_t stats_ref[NUM_EDGETYPE]; + int32_t stats_vec[NUM_EDGETYPE]; + + int32_t count_ref[NUM_EDGETYPE]; + int32_t count_vec[NUM_EDGETYPE]; + + int8_t _upBuff1_ref[MAX_CU_SIZE + 2], *upBuff1_ref = _upBuff1_ref + 1; + int8_t _upBuff1_vec[MAX_CU_SIZE + 2], *upBuff1_vec = _upBuff1_vec + 1; + + int j = 0; + + for (int i = 0; i < ITERS; i++) + { + // initialize input data to random, the dynamic range wrong but good to verify our asm code + for (int x = 0; x < NUM_EDGETYPE; x++) + { + stats_ref[x] = stats_vec[x] = rand(); + count_ref[x] = count_vec[x] = rand(); + } + + // initial sign + for (int x = 0; x < MAX_CU_SIZE + 2; x++) + _upBuff1_ref[x] = _upBuff1_vec[x] = (rand() % 3) - 1; + + intptr_t stride = 16 * (rand() % 4 + 1); + int endX = MAX_CU_SIZE - (rand() % 5); + int endY = MAX_CU_SIZE - (rand() % 4) - 1; + + ref(pbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); + checked(opt, pbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); + + if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) + || memcmp(stats_ref, stats_vec, sizeof(stats_ref)) + || memcmp(count_ref, count_vec, sizeof(count_ref))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_saoCuStatsE2_t(saoCuStatsE2_t ref, saoCuStatsE2_t opt) +{ + enum { NUM_EDGETYPE = 5 }; + int32_t stats_ref[NUM_EDGETYPE]; + int32_t stats_vec[NUM_EDGETYPE]; + + int32_t count_ref[NUM_EDGETYPE]; + int32_t count_vec[NUM_EDGETYPE]; + + int8_t _upBuff1_ref[MAX_CU_SIZE + 2], *upBuff1_ref = _upBuff1_ref + 1; + int8_t _upBufft_ref[MAX_CU_SIZE + 2], *upBufft_ref = _upBufft_ref + 1; + int8_t _upBuff1_vec[MAX_CU_SIZE + 2], *upBuff1_vec = _upBuff1_vec + 1; + int8_t _upBufft_vec[MAX_CU_SIZE + 2], *upBufft_vec = _upBufft_vec + 1; + + int j = 0; + + // NOTE: verify more times since our asm is NOT exact match to C, the output of upBuff* will be DIFFERENT + for (int i = 0; i < ITERS * 10; i++) + { + // initialize input data to random, the dynamic range wrong but good to verify our asm code + for (int x = 0; x < NUM_EDGETYPE; x++) + { + stats_ref[x] = stats_vec[x] = rand(); + count_ref[x] = count_vec[x] = rand(); + } + + // initial sign + for (int x = 0; x < MAX_CU_SIZE + 2; x++) + { + _upBuff1_ref[x] = _upBuff1_vec[x] = (rand() % 3) - 1; + _upBufft_ref[x] = _upBufft_vec[x] = (rand() % 3) - 1; + } + + intptr_t stride = 16 * (rand() % 4 + 1); + int endX = MAX_CU_SIZE - (rand() % 5) - 1; + int endY = MAX_CU_SIZE - (rand() % 4) - 1; + + ref(pbuf2 + 1, pbuf3 + 1, stride, upBuff1_ref, upBufft_ref, endX, endY, stats_ref, count_ref); + checked(opt, pbuf2 + 1, pbuf3 + 1, stride, upBuff1_vec, upBufft_vec, endX, endY, stats_vec, count_vec); + + // TODO: don't check upBuff*, the latest output pixels different, and can move into stack temporary buffer in future + if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) + || memcmp(_upBufft_ref, _upBufft_vec, sizeof(_upBufft_ref)) + || memcmp(stats_ref, stats_vec, sizeof(stats_ref)) + || memcmp(count_ref, count_vec, sizeof(count_ref))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + +bool PixelHarness::check_saoCuStatsE3_t(saoCuStatsE3_t ref, saoCuStatsE3_t opt) +{ + enum { NUM_EDGETYPE = 5 }; + int32_t stats_ref[NUM_EDGETYPE]; + int32_t stats_vec[NUM_EDGETYPE]; + + int32_t count_ref[NUM_EDGETYPE]; + int32_t count_vec[NUM_EDGETYPE]; + + int8_t _upBuff1_ref[MAX_CU_SIZE + 2], *upBuff1_ref = _upBuff1_ref + 1; + int8_t _upBuff1_vec[MAX_CU_SIZE + 2], *upBuff1_vec = _upBuff1_vec + 1; + + int j = 0; + + // (const pixel *fenc, const pixel *rec, intptr_t stride, int8_t *upBuff1, int endX, int endY, int32_t *stats, int32_t *count) + for (int i = 0; i < ITERS; i++) + { + // initialize input data to random, the dynamic range wrong but good to verify our asm code + for (int x = 0; x < NUM_EDGETYPE; x++) + { + stats_ref[x] = stats_vec[x] = rand(); + count_ref[x] = count_vec[x] = rand(); + } + + // initial sign + for (int x = 0; x < (int)sizeof(_upBuff1_ref); x++) + { + _upBuff1_ref[x] = _upBuff1_vec[x] = (rand() % 3) - 1; + } + + intptr_t stride = 16 * (rand() % 4 + 1); + int endX = MAX_CU_SIZE - (rand() % 5) - 1; + int endY = MAX_CU_SIZE - (rand() % 4) - 1; + + ref(pbuf2, pbuf3, stride, upBuff1_ref, endX, endY, stats_ref, count_ref); + checked(opt, pbuf2, pbuf3, stride, upBuff1_vec, endX, endY, stats_vec, count_vec); + + if ( memcmp(_upBuff1_ref, _upBuff1_vec, sizeof(_upBuff1_ref)) + || memcmp(stats_ref, stats_vec, sizeof(stats_ref)) + || memcmp(count_ref, count_vec, sizeof(count_ref))) + return false; + + reportfail(); + j += INCR; + } + + return true; +} + bool PixelHarness::check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt) { ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); int j = 0; @@ -1061,8 +1304,8 @@ for (int i = 0; i < ITERS; i++) { int index = i % TEST_CASES; - checked(opt, ushort_test_buff[index] + j, srcStride, opt_dest, dstStride, width, height, (int)8, (uint16_t)255); - ref(ushort_test_buff[index] + j, srcStride, ref_dest, dstStride, width, height, (int)8, (uint16_t)255); + checked(opt, ushort_test_buff[index] + j, srcStride, opt_dest, dstStride, width, height, (int)8, (uint16_t)((1 << X265_DEPTH) - 1)); + ref(ushort_test_buff[index] + j, srcStride, ref_dest, dstStride, width, height, (int)8, (uint16_t)((1 << X265_DEPTH) - 1)); if (memcmp(ref_dest, opt_dest, width * height * sizeof(pixel))) return false; @@ -1076,8 +1319,8 @@ bool PixelHarness::check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt) { - ALIGN_VAR_16(pixel, ref_dest[64 * 64]); - ALIGN_VAR_16(pixel, opt_dest[64 * 64]); + ALIGN_VAR_16(pixel, ref_dest[64 * 64 * 2]); + ALIGN_VAR_16(pixel, opt_dest[64 * 64 * 2]); memset(ref_dest, 0xCD, sizeof(ref_dest)); memset(opt_dest, 0xCD, sizeof(opt_dest)); @@ -1094,7 +1337,7 @@ checked(opt, uchar_test_buff[index] + j, srcStride, opt_dest, dstStride, width, height, (int)2); ref(uchar_test_buff[index] + j, srcStride, ref_dest, dstStride, width, height, (int)2); - if (memcmp(ref_dest, opt_dest, width * height * sizeof(pixel))) + if (memcmp(ref_dest, opt_dest, sizeof(ref_dest))) return false; reportfail(); @@ -1181,8 +1424,8 @@ ALIGN_VAR_16(pixel, ref_dest[64 * 64]); ALIGN_VAR_16(pixel, opt_dest[64 * 64]); - memset(ref_dest, 0xCD, sizeof(ref_dest)); - memset(opt_dest, 0xCD, sizeof(opt_dest)); + for (int i = 0; i < 64 * 64; i++) + ref_dest[i] = opt_dest[i] = rand() % (PIXEL_MAX); int j = 0; @@ -1293,23 +1536,22 @@ bool PixelHarness::check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt) { - ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 2]); + ALIGN_VAR_16(coeff_t, ref_src[4 * 32 + ITERS * 2]); + memset(ref_src, 0, sizeof(ref_src)); - for (int i = 0; i < 32 * 32; i++) + // minus ITERS for keep probability to generate all zeros block + for (int i = 0; i < 4 * 32 - ITERS; i++) { ref_src[i] = rand() & SHORT_MAX; } - // extra test area all of 0x1234 - for (int i = 0; i < ITERS * 2; i++) - { - ref_src[32 * 32 + i] = 0x1234; - } + // extra test area all of Zeros for (int i = 0; i < ITERS; i++) { int rand_scan_type = rand() % NUM_SCAN_TYPE; int rand_scan_size = (rand() % NUM_SCAN_SIZE) + 2; + const int trSize = (1 << rand_scan_size); coeff_t *rand_src = ref_src + i; const uint16_t* const scanTbl = g_scan4x4[rand_scan_type]; @@ -1319,22 +1561,193 @@ { const uint32_t idxY = j / MLS_CG_SIZE; const uint32_t idxX = j % MLS_CG_SIZE; - if (rand_src[idxY * rand_scan_size + idxX]) break; + if (rand_src[idxY * trSize + idxX]) break; } - // fill one coeff when all coeff group are zero + uint32_t ref_scanPos = ref(rand_src, trSize, scanTbl); + uint32_t opt_scanPos = (int)checked(opt, rand_src, trSize, scanTbl); + + // specially case: all coeff group are zero if (j >= SCAN_SET_SIZE) - rand_src[0] = 0x0BAD; + { + // all zero block the high 16-bits undefined + if ((uint16_t)ref_scanPos != (uint16_t)opt_scanPos) + return false; + } + else if (ref_scanPos != opt_scanPos) + return false; - uint32_t ref_scanPos = ref(rand_src, (1 << rand_scan_size), scanTbl); - uint32_t opt_scanPos = (int)checked(opt, rand_src, (1 << rand_scan_size), scanTbl); + reportfail(); + } - if (ref_scanPos != opt_scanPos) + return true; +} + +bool PixelHarness::check_costCoeffNxN(costCoeffNxN_t ref, costCoeffNxN_t opt) +{ + ALIGN_VAR_16(coeff_t, ref_src[32 * 32 + ITERS * 3]); + ALIGN_VAR_32(uint16_t, ref_absCoeff[1 << MLS_CG_SIZE]); + ALIGN_VAR_32(uint16_t, opt_absCoeff[1 << MLS_CG_SIZE]); + + memset(ref_absCoeff, 0xCD, sizeof(ref_absCoeff)); + memset(opt_absCoeff, 0xCD, sizeof(opt_absCoeff)); + + int totalCoeffs = 0; + for (int i = 0; i < 32 * 32; i++) + { + ref_src[i] = rand() & SHORT_MAX; + + // more zero coeff + if (ref_src[i] < SHORT_MAX * 2 / 3) + ref_src[i] = 0; + + // more negtive + if ((rand() % 10) < 8) + ref_src[i] *= -1; + totalCoeffs += (ref_src[i] != 0); + } + + // extra test area all of 0x1234 + for (int i = 0; i < ITERS * 3; i++) + { + ref_src[32 * 32 + i] = 0x1234; + } + + // generate CABAC context table + uint8_t m_contextState_ref[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; + uint8_t m_contextState_opt[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; + for (int k = 0; k < (OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA); k++) + { + m_contextState_ref[k] = (rand() % (125 - 2)) + 2; + m_contextState_opt[k] = m_contextState_ref[k]; + } + uint8_t *const ref_baseCtx = m_contextState_ref; + uint8_t *const opt_baseCtx = m_contextState_opt; + + for (int i = 0; i < ITERS * 2; i++) + { + int rand_scan_type = rand() % NUM_SCAN_TYPE; + int rand_scanPosSigOff = rand() % 16; //rand_scanPosSigOff range is [1,15] + int rand_patternSigCtx = rand() % 4; //range [0,3] + int rand_scan_size = rand() % NUM_SCAN_SIZE; + int offset; // the value have a exact range, details in CoeffNxN() + if (rand_scan_size == 2) + offset = 0; + else if (rand_scan_size == 3) + offset = 9; + else + offset = 12; + + const int trSize = (1 << (rand_scan_size + 2)); + ALIGN_VAR_32(static const uint8_t, table_cnt[5][SCAN_SET_SIZE]) = + { + // patternSigCtx = 0 + { + 2, 1, 1, 0, + 1, 1, 0, 0, + 1, 0, 0, 0, + 0, 0, 0, 0, + }, + // patternSigCtx = 1 + { + 2, 2, 2, 2, + 1, 1, 1, 1, + 0, 0, 0, 0, + 0, 0, 0, 0, + }, + // patternSigCtx = 2 + { + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, + 2, 1, 0, 0, + }, + // patternSigCtx = 3 + { + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, + 2, 2, 2, 2, + }, + // 4x4 + { + 0, 1, 4, 5, + 2, 3, 4, 5, + 6, 6, 8, 8, + 7, 7, 8, 8 + } + }; + const uint8_t *rand_tabSigCtx = table_cnt[(rand_scan_size == 2) ? 4 : (uint32_t)rand_patternSigCtx]; + const uint16_t* const scanTbl = g_scanOrder[rand_scan_type][rand_scan_size]; + const uint16_t* const scanTblCG4x4 = g_scan4x4[rand_scan_size <= (MDCS_LOG2_MAX_SIZE - 2) ? rand_scan_type : SCAN_DIAG]; + + int rand_scanPosCG = rand() % (trSize * trSize / MLS_CG_BLK_SIZE); + int subPosBase = rand_scanPosCG * MLS_CG_BLK_SIZE; + int rand_numCoeff = 0; + uint32_t scanFlagMask = 0; + const int numNonZero = (rand_scanPosSigOff < (MLS_CG_BLK_SIZE - 1)) ? 1 : 0; + + for(int k = 0; k <= rand_scanPosSigOff; k++) + { + uint32_t pos = scanTbl[subPosBase + k]; + coeff_t tmp_coeff = ref_src[i + pos]; + if (tmp_coeff != 0) + { + rand_numCoeff++; + } + scanFlagMask = scanFlagMask * 2 + (tmp_coeff != 0); + } + + // can't process all zeros block + if (rand_numCoeff == 0) + continue; + + const uint32_t blkPosBase = scanTbl[subPosBase]; + uint32_t ref_sum = ref(scanTblCG4x4, &ref_src[blkPosBase + i], trSize, ref_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)ref_baseCtx, offset, rand_scanPosSigOff, subPosBase); + uint32_t opt_sum = (uint32_t)checked(opt, scanTblCG4x4, &ref_src[blkPosBase + i], trSize, opt_absCoeff + numNonZero, rand_tabSigCtx, scanFlagMask, (uint8_t*)opt_baseCtx, offset, rand_scanPosSigOff, subPosBase); + + if (ref_sum != opt_sum) + return false; + if (memcmp(ref_baseCtx, opt_baseCtx, sizeof(m_contextState_ref))) + return false; + + // NOTE: just first rand_numCoeff valid, but I check full buffer for confirm no overwrite bug + if (memcmp(ref_absCoeff, opt_absCoeff, sizeof(ref_absCoeff))) return false; reportfail(); } + return true; +} +bool PixelHarness::check_costCoeffRemain(costCoeffRemain_t ref, costCoeffRemain_t opt) +{ + ALIGN_VAR_32(uint16_t, absCoeff[1 << MLS_CG_SIZE]); + for (int i = 0; i < (1 << MLS_CG_SIZE); i++) + { + absCoeff[i] = rand() & SHORT_MAX; + // more coeff with value one + if (absCoeff[i] < SHORT_MAX * 2 / 3) + absCoeff[i] = 1; + } + for (int i = 0; i < ITERS; i++) + { + uint32_t firstC2Idx = 0; + int k = 0; + int numNonZero = rand() % 17; //can be random, range[1, 16] + for (k = 0; k < C1FLAG_NUMBER; k++) + { + if (absCoeff[k] >= 2) + { + break; + } + } + firstC2Idx = k; // it is index of exact first coeff that value more than 2 + int ref_sum = ref(absCoeff, numNonZero, firstC2Idx); + int opt_sum = (int)checked(opt, absCoeff, numNonZero, firstC2Idx); + if (ref_sum != opt_sum) + return false; + } return true; } @@ -1407,7 +1820,7 @@ { if (opt.cu[part].sse_pp) { - if (!check_pixelcmp(ref.cu[part].sse_pp, opt.cu[part].sse_pp)) + if (!check_pixel_sse(ref.cu[part].sse_pp, opt.cu[part].sse_pp)) { printf("sse_pp[%s]: failed!\n", lumaPartStr[part]); return false; @@ -1416,7 +1829,7 @@ if (opt.cu[part].sse_ss) { - if (!check_pixelcmp_ss(ref.cu[part].sse_ss, opt.cu[part].sse_ss)) + if (!check_pixel_sse_ss(ref.cu[part].sse_ss, opt.cu[part].sse_ss)) { printf("sse_ss[%s]: failed!\n", lumaPartStr[part]); return false; @@ -1497,6 +1910,14 @@ } if (part < NUM_CU_SIZES) { + if (opt.chroma[i].cu[part].sse_pp) + { + if (!check_pixel_sse(ref.chroma[i].cu[part].sse_pp, opt.chroma[i].cu[part].sse_pp)) + { + printf("chroma_sse_pp[%s][%s]: failed!\n", x265_source_csp_names[i], chromaPartStr[i][part]); + return false; + } + } if (opt.chroma[i].cu[part].sub_ps) { if (!check_pixel_sub_ps(ref.chroma[i].cu[part].sub_ps, opt.chroma[i].cu[part].sub_ps)) @@ -1843,6 +2264,51 @@ } } + if (opt.saoCuStatsBO) + { + if (!check_saoCuStatsBO_t(ref.saoCuStatsBO, opt.saoCuStatsBO)) + { + printf("saoCuStatsBO failed\n"); + return false; + } + } + + if (opt.saoCuStatsE0) + { + if (!check_saoCuStatsE0_t(ref.saoCuStatsE0, opt.saoCuStatsE0)) + { + printf("saoCuStatsE0 failed\n"); + return false; + } + } + + if (opt.saoCuStatsE1) + { + if (!check_saoCuStatsE1_t(ref.saoCuStatsE1, opt.saoCuStatsE1)) + { + printf("saoCuStatsE1 failed\n"); + return false; + } + } + + if (opt.saoCuStatsE2) + { + if (!check_saoCuStatsE2_t(ref.saoCuStatsE2, opt.saoCuStatsE2)) + { + printf("saoCuStatsE2 failed\n"); + return false; + } + } + + if (opt.saoCuStatsE3) + { + if (!check_saoCuStatsE3_t(ref.saoCuStatsE3, opt.saoCuStatsE3)) + { + printf("saoCuStatsE3 failed\n"); + return false; + } + } + if (opt.planecopy_sp) { if (!check_planecopy_sp(ref.planecopy_sp, opt.planecopy_sp)) @@ -1852,6 +2318,15 @@ } } + if (opt.planecopy_sp_shl) + { + if (!check_planecopy_sp(ref.planecopy_sp_shl, opt.planecopy_sp_shl)) + { + printf("planecopy_sp_shl failed\n"); + return false; + } + } + if (opt.planecopy_cp) { if (!check_planecopy_cp(ref.planecopy_cp, opt.planecopy_cp)) @@ -1887,6 +2362,22 @@ return false; } } + if (opt.costCoeffNxN) + { + if (!check_costCoeffNxN(ref.costCoeffNxN, opt.costCoeffNxN)) + { + printf("costCoeffNxN failed!\n"); + return false; + } + } + if (opt.costCoeffRemain) + { + if (!check_costCoeffRemain(ref.costCoeffRemain, opt.costCoeffRemain)) + { + printf("costCoeffRemain failed!\n"); + return false; + } + } return true; } @@ -2014,6 +2505,11 @@ HEADER("[%s] copy_sp[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); REPORT_SPEEDUP(opt.chroma[i].cu[part].copy_sp, ref.chroma[i].cu[part].copy_sp, pbuf1, 64, sbuf3, 128); } + if (opt.chroma[i].cu[part].sse_pp) + { + HEADER("[%s] sse_pp[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); + REPORT_SPEEDUP(opt.chroma[i].cu[part].sse_pp, ref.chroma[i].cu[part].sse_pp, pbuf1, STRIDE, fref, STRIDE); + } if (opt.chroma[i].cu[part].sub_ps) { HEADER("[%s] sub_ps[%s]", x265_source_csp_names[i], chromaPartStr[i][part]); @@ -2108,7 +2604,8 @@ if ((i < BLOCK_64x64) && opt.cu[i].cpy2Dto1D_shl) { HEADER("cpy2Dto1D_shl[%dx%d]", 4 << i, 4 << i); - REPORT_SPEEDUP(opt.cu[i].cpy2Dto1D_shl, ref.cu[i].cpy2Dto1D_shl, sbuf1, sbuf2, STRIDE, MAX_TR_DYNAMIC_RANGE - X265_DEPTH - (i + 2)); + const int shift = MAX_TR_DYNAMIC_RANGE - X265_DEPTH - (i + 2); + REPORT_SPEEDUP(opt.cu[i].cpy2Dto1D_shl, ref.cu[i].cpy2Dto1D_shl, sbuf1, sbuf2, STRIDE, X265_MAX(0, shift)); } if ((i < BLOCK_64x64) && opt.cu[i].cpy2Dto1D_shr) @@ -2244,6 +2741,49 @@ REPORT_SPEEDUP(opt.saoCuOrgB0, ref.saoCuOrgB0, pbuf1, psbuf1, 64, 64, 64); } + if (opt.saoCuStatsBO) + { + int32_t stats[33], count[33]; + HEADER0("saoCuStatsBO"); + REPORT_SPEEDUP(opt.saoCuStatsBO, ref.saoCuStatsBO, pbuf2, pbuf3, 64, 60, 61, stats, count); + } + + if (opt.saoCuStatsE0) + { + int32_t stats[33], count[33]; + HEADER0("saoCuStatsE0"); + REPORT_SPEEDUP(opt.saoCuStatsE0, ref.saoCuStatsE0, pbuf2, pbuf3, 64, 60, 61, stats, count); + } + + if (opt.saoCuStatsE1) + { + int32_t stats[5], count[5]; + int8_t upBuff1[MAX_CU_SIZE + 2]; + memset(upBuff1, 1, sizeof(upBuff1)); + HEADER0("saoCuStatsE1"); + REPORT_SPEEDUP(opt.saoCuStatsE1, ref.saoCuStatsE1, pbuf2, pbuf3, 64, upBuff1 + 1,60, 61, stats, count); + } + + if (opt.saoCuStatsE2) + { + int32_t stats[5], count[5]; + int8_t upBuff1[MAX_CU_SIZE + 2]; + int8_t upBufft[MAX_CU_SIZE + 2]; + memset(upBuff1, 1, sizeof(upBuff1)); + memset(upBufft, -1, sizeof(upBufft)); + HEADER0("saoCuStatsE2"); + REPORT_SPEEDUP(opt.saoCuStatsE2, ref.saoCuStatsE2, pbuf2, pbuf3, 64, upBuff1 + 1, upBufft + 1, 60, 61, stats, count); + } + + if (opt.saoCuStatsE3) + { + int8_t upBuff1[MAX_CU_SIZE + 2]; + int32_t stats[5], count[5]; + memset(upBuff1, 1, sizeof(upBuff1)); + HEADER0("saoCuStatsE3"); + REPORT_SPEEDUP(opt.saoCuStatsE3, ref.saoCuStatsE3, pbuf2, pbuf3, 64, upBuff1 + 1, 60, 61, stats, count); + } + if (opt.planecopy_sp) { HEADER0("planecopy_sp"); @@ -2283,4 +2823,30 @@ coefBuf[3 + 3 * 32] = 0x0BAD; REPORT_SPEEDUP(opt.findPosFirstLast, ref.findPosFirstLast, coefBuf, 32, g_scan4x4[SCAN_DIAG]); } + if (opt.costCoeffNxN) + { + HEADER0("costCoeffNxN"); + coeff_t coefBuf[32 * 32]; + uint16_t tmpOut[16]; + memset(coefBuf, 1, sizeof(coefBuf)); + ALIGN_VAR_32(static uint8_t const, ctxSig[]) = + { + 0, 1, 4, 5, + 2, 3, 4, 5, + 6, 6, 8, 8, + 7, 7, 8, 8 + }; + uint8_t ctx[OFF_SIG_FLAG_CTX + NUM_SIG_FLAG_CTX_LUMA]; + memset(ctx, 120, sizeof(ctx)); + + REPORT_SPEEDUP(opt.costCoeffNxN, ref.costCoeffNxN, g_scan4x4[SCAN_DIAG], coefBuf, 32, tmpOut, ctxSig, 0xFFFF, ctx, 1, 15, 32); + } + if (opt.costCoeffRemain) + { + HEADER0("costCoeffRemain"); + uint16_t abscoefBuf[32 * 32]; + memset(abscoefBuf, 0, sizeof(abscoefBuf)); + memset(abscoefBuf + 32 * 31, 1, 32 * sizeof(uint16_t)); + REPORT_SPEEDUP(opt.costCoeffRemain, ref.costCoeffRemain, abscoefBuf, 16, 3); + } }
View file
x265_1.7.tar.gz/source/test/pixelharness.h -> x265_1.8.tar.gz/source/test/pixelharness.h
Changed
@@ -66,7 +66,8 @@ double double_test_buff[TEST_CASES][BUFFSIZE]; bool check_pixelcmp(pixelcmp_t ref, pixelcmp_t opt); - bool check_pixelcmp_ss(pixelcmp_ss_t ref, pixelcmp_ss_t opt); + bool check_pixel_sse(pixel_sse_t ref, pixel_sse_t opt); + bool check_pixel_sse_ss(pixel_sse_ss_t ref, pixel_sse_ss_t opt); bool check_pixelcmp_x3(pixelcmp_x3_t ref, pixelcmp_x3_t opt); bool check_pixelcmp_x4(pixelcmp_x4_t ref, pixelcmp_x4_t opt); bool check_copy_pp(copy_pp_t ref, copy_pp_t opt); @@ -100,6 +101,11 @@ bool check_saoCuOrgE3_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt); bool check_saoCuOrgE3_32_t(saoCuOrgE3_t ref, saoCuOrgE3_t opt); bool check_saoCuOrgB0_t(saoCuOrgB0_t ref, saoCuOrgB0_t opt); + bool check_saoCuStatsBO_t(saoCuStatsBO_t ref, saoCuStatsBO_t opt); + bool check_saoCuStatsE0_t(saoCuStatsE0_t ref, saoCuStatsE0_t opt); + bool check_saoCuStatsE1_t(saoCuStatsE1_t ref, saoCuStatsE1_t opt); + bool check_saoCuStatsE2_t(saoCuStatsE2_t ref, saoCuStatsE2_t opt); + bool check_saoCuStatsE3_t(saoCuStatsE3_t ref, saoCuStatsE3_t opt); bool check_planecopy_sp(planecopy_sp_t ref, planecopy_sp_t opt); bool check_planecopy_cp(planecopy_cp_t ref, planecopy_cp_t opt); bool check_cutree_propagate_cost(cutree_propagate_cost ref, cutree_propagate_cost opt); @@ -108,6 +114,8 @@ bool check_calSign(sign_t ref, sign_t opt); bool check_scanPosLast(scanPosLast_t ref, scanPosLast_t opt); bool check_findPosFirstLast(findPosFirstLast_t ref, findPosFirstLast_t opt); + bool check_costCoeffNxN(costCoeffNxN_t ref, costCoeffNxN_t opt); + bool check_costCoeffRemain(costCoeffRemain_t ref, costCoeffRemain_t opt); public:
View file
x265_1.7.tar.gz/source/test/regression-tests.txt -> x265_1.8.tar.gz/source/test/regression-tests.txt
Changed
@@ -12,50 +12,50 @@ # not auto-detected. BasketballDrive_1920x1080_50.y4m,--preset faster --aq-strength 2 --merange 190 -BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 32 +BasketballDrive_1920x1080_50.y4m,--preset medium --ctu 16 --max-tu-size 8 --subme 7 --qg-size 16 --cu-lossless BasketballDrive_1920x1080_50.y4m,--preset medium --keyint -1 --nr-inter 100 -F4 --no-sao -BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 +BasketballDrive_1920x1080_50.y4m,--preset slow --nr-intra 100 -F4 --aq-strength 3 --qg-size 16 --limit-refs 1 BasketballDrive_1920x1080_50.y4m,--preset slower --lossless --chromaloc 3 --subme 0 BasketballDrive_1920x1080_50.y4m,--preset superfast --psy-rd 1 --ctu 16 --no-wpp BasketballDrive_1920x1080_50.y4m,--preset ultrafast --signhide --colormatrix bt709 BasketballDrive_1920x1080_50.y4m,--preset veryfast --tune zerolatency --no-temporal-mvp -BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode +BasketballDrive_1920x1080_50.y4m,--preset veryslow --crf 4 --cu-lossless --pmode --limit-refs 1 Coastguard-4k.y4m,--preset medium --rdoq-level 1 --tune ssim --no-signhide --me umh -Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 +Coastguard-4k.y4m,--preset slow --tune psnr --cbqpoffs -1 --crqpoffs 1 --limit-refs 1 Coastguard-4k.y4m,--preset superfast --tune grain --overscan=crop CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --aq-mode 0 --sar 2 --range full CrowdRun_1920x1080_50_10bit_422.yuv,--preset faster --max-tu-size 4 --min-cu-size 32 -CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing +CrowdRun_1920x1080_50_10bit_422.yuv,--preset medium --no-wpp --no-cutree --no-strong-intra-smoothing --limit-refs 1 CrowdRun_1920x1080_50_10bit_422.yuv,--preset slow --no-wpp --tune ssim --transfer smpte240m -CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode +CrowdRun_1920x1080_50_10bit_422.yuv,--preset slower --tune ssim --tune fastdecode --limit-refs 2 CrowdRun_1920x1080_50_10bit_422.yuv,--preset superfast --weightp --no-wpp --sao CrowdRun_1920x1080_50_10bit_422.yuv,--preset ultrafast --weightp --tune zerolatency --qg-size 16 CrowdRun_1920x1080_50_10bit_422.yuv,--preset veryfast --temporal-layers --tune grain CrowdRun_1920x1080_50_10bit_444.yuv,--preset medium --dither --keyint -1 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset superfast --weightp --dither --no-psy-rd CrowdRun_1920x1080_50_10bit_444.yuv,--preset ultrafast --weightp --no-wpp --no-open-gop -CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers +CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryfast --temporal-layers --repeat-headers --limit-refs 2 CrowdRun_1920x1080_50_10bit_444.yuv,--preset veryslow --tskip --tskip-fast --no-scenecut DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset medium --tune psnr --bframes 16 -DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 +DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset slow --temporal-layers --no-psy-rd --qg-size 32 --limit-refs 0 --cu-lossless DucksAndLegs_1920x1080_60_10bit_422.yuv,--preset superfast --weightp --qg-size 16 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset medium --nr-inter 500 -F4 --no-psy-rdoq -DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 +DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset slower --no-weightp --rdoq-level 0 --limit-refs 3 DucksAndLegs_1920x1080_60_10bit_444.yuv,--preset veryfast --weightp --nr-intra 1000 -F4 FourPeople_1280x720_60.y4m,--preset medium --qp 38 --no-psy-rd FourPeople_1280x720_60.y4m,--preset superfast --no-wpp --lookahead-slices 2 Keiba_832x480_30.y4m,--preset medium --pmode --tune grain -Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4 +Keiba_832x480_30.y4m,--preset slower --fast-intra --nr-inter 500 -F4 --limit-refs 0 Keiba_832x480_30.y4m,--preset superfast --no-fast-intra --nr-intra 1000 -F4 Kimono1_1920x1080_24_10bit_444.yuv,--preset medium --min-cu-size 32 Kimono1_1920x1080_24_10bit_444.yuv,--preset superfast --weightb KristenAndSara_1280x720_60.y4m,--preset medium --no-cutree --max-tu-size 16 -KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 -KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 +KristenAndSara_1280x720_60.y4m,--preset slower --pmode --max-tu-size 8 --limit-refs 0 +KristenAndSara_1280x720_60.y4m,--preset superfast --min-cu-size 16 --qg-size 16 --limit-refs 1 KristenAndSara_1280x720_60.y4m,--preset ultrafast --strong-intra-smoothing -NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset medium --tune grain --limit-refs 2 NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset superfast --tune psnr -News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 32 +News-4k.y4m,--preset medium --tune ssim --no-sao --qg-size 16 News-4k.y4m,--preset superfast --lookahead-slices 6 --aq-mode 0 OldTownCross_1920x1080_50_10bit_422.yuv,--preset medium --no-weightp OldTownCross_1920x1080_50_10bit_422.yuv,--preset slower --tune fastdecode @@ -66,16 +66,16 @@ RaceHorses_416x240_30.y4m,--preset medium --tskip-fast --tskip RaceHorses_416x240_30.y4m,--preset slower --keyint -1 --rdoq-level 0 RaceHorses_416x240_30.y4m,--preset superfast --no-cutree -RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip -RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra +RaceHorses_416x240_30.y4m,--preset veryslow --tskip-fast --tskip --limit-refs 3 +RaceHorses_416x240_30_10bit.yuv,--preset fast --lookahead-slices 2 --b-intra --limit-refs 1 RaceHorses_416x240_30_10bit.yuv,--preset faster --rdoq-level 0 --dither RaceHorses_416x240_30_10bit.yuv,--preset slow --tune grain -RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr +RaceHorses_416x240_30_10bit.yuv,--preset ultrafast --tune psnr --limit-refs 1 RaceHorses_416x240_30_10bit.yuv,--preset veryfast --weightb -RaceHorses_416x240_30_10bit.yuv,--preset placebo +RaceHorses_416x240_30_10bit.yuv,--preset placebo --limit-refs 1 SteamLocomotiveTrain_2560x1600_60_10bit_crop.yuv,--preset medium --dither big_buck_bunny_360p24.y4m,--preset faster --keyint 240 --min-keyint 60 --rc-lookahead 200 -big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb +big_buck_bunny_360p24.y4m,--preset medium --keyint 60 --min-keyint 48 --weightb --limit-refs 3 big_buck_bunny_360p24.y4m,--preset slow --psy-rdoq 2.0 --rdoq-level 1 --no-b-intra big_buck_bunny_360p24.y4m,--preset superfast --psy-rdoq 2.0 big_buck_bunny_360p24.y4m,--preset ultrafast --deblock=2 @@ -83,20 +83,20 @@ city_4cif_60fps.y4m,--preset medium --crf 4 --cu-lossless --sao-non-deblock city_4cif_60fps.y4m,--preset superfast --rdpenalty 1 --tu-intra-depth 2 city_4cif_60fps.y4m,--preset slower --scaling-list default -city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra +city_4cif_60fps.y4m,--preset veryslow --rdpenalty 2 --sao-non-deblock --no-b-intra --limit-refs 0 ducks_take_off_420_720p50.y4m,--preset fast --deblock 6 --bframes 16 --rc-lookahead 40 -ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6 +ducks_take_off_420_720p50.y4m,--preset faster --qp 24 --deblock -6 --limit-refs 2 ducks_take_off_420_720p50.y4m,--preset medium --tskip --tskip-fast --constrained-intra ducks_take_off_420_720p50.y4m,--preset slow --scaling-list default --qp 40 ducks_take_off_420_720p50.y4m,--preset ultrafast --constrained-intra --rd 1 ducks_take_off_420_720p50.y4m,--preset veryslow --constrained-intra --bframes 2 ducks_take_off_444_720p50.y4m,--preset medium --qp 38 --no-scenecut -ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0 -ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1 +ducks_take_off_444_720p50.y4m,--preset superfast --weightp --rd 0 --limit-refs 2 +ducks_take_off_444_720p50.y4m,--preset slower --psy-rd 1 --psy-rdoq 2.0 --rdoq-level 1 --limit-refs 1 mobile_calendar_422_ntsc.y4m,--preset medium --bitrate 500 -F4 mobile_calendar_422_ntsc.y4m,--preset slower --tskip --tskip-fast mobile_calendar_422_ntsc.y4m,--preset superfast --weightp --rd 0 -mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip +mobile_calendar_422_ntsc.y4m,--preset veryslow --tskip --limit-refs 2 old_town_cross_444_720p50.y4m,--preset faster --rd 1 --tune zero-latency old_town_cross_444_720p50.y4m,--preset medium --keyint -1 --no-weightp --ref 6 old_town_cross_444_720p50.y4m,--preset slow --rdoq-level 1 --early-skip --ref 7 --no-b-pyramid @@ -113,12 +113,19 @@ vtc1nw_422_ntsc.y4m,--preset slower --nr-inter 1000 -F4 --tune fast-decode --qg-size 16 vtc1nw_422_ntsc.y4m,--preset superfast --weightp --nr-intra 100 -F4 washdc_422_ntsc.y4m,--preset faster --rdoq-level 1 --max-merge 5 -washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 -washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 +washdc_422_ntsc.y4m,--preset medium --no-weightp --max-tu-size 4 --limit-refs 1 +washdc_422_ntsc.y4m,--preset slower --psy-rdoq 2.0 --rdoq-level 2 --qg-size 32 --limit-refs 1 washdc_422_ntsc.y4m,--preset superfast --psy-rd 1 --tune zerolatency washdc_422_ntsc.y4m,--preset ultrafast --weightp --tu-intra-depth 4 washdc_422_ntsc.y4m,--preset veryfast --tu-inter-depth 4 -washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless +washdc_422_ntsc.y4m,--preset veryslow --crf 4 --cu-lossless --limit-refs 3 +BasketballDrive_1920x1080_50.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 15000,--preset medium --no-cutree --analysis-mode=load --bitrate 13000,--preset medium --no-cutree --analysis-mode=load --bitrate 11000,--preset medium --no-cutree --analysis-mode=load --bitrate 9000,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 +NebutaFestival_2560x1600_60_10bit_crop.yuv,--preset slow --no-cutree --analysis-mode=save --bitrate 15000,--preset slow --no-cutree --analysis-mode=load --bitrate 13000,--preset slow --no-cutree --analysis-mode=load --bitrate 11000,--preset slow --no-cutree --analysis-mode=load --bitrate 9000,--preset slow --no-cutree --analysis-mode=load --bitrate 7000 +old_town_cross_444_720p50.y4m,--preset veryslow --no-cutree --analysis-mode=save --bitrate 15000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 13000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 11000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 9000 --early-skip,--preset veryslow --no-cutree --analysis-mode=load --bitrate 7000 --early-skip +Johnny_1280x720_60.y4m,--preset medium --no-cutree --analysis-mode=save --bitrate 15000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 13000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 11000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 9000 --tskip-fast,--preset medium --no-cutree --analysis-mode=load --bitrate 7000 --tskip-fast +BasketballDrive_1920x1080_50.y4m,--preset medium --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +FourPeople_1280x720_60.y4m,--preset ultrafast --recon-y4m-exec "ffplay -i pipe:0 -autoexit" +FourPeople_1280x720_60.y4m,--preset veryslow --recon-y4m-exec "ffplay -i pipe:0 -autoexit" # interlace test, even though input YUV is not field seperated CrowdRun_1920x1080_50_10bit_422.yuv,--preset fast --interlace bff
View file
x265_1.7.tar.gz/source/test/smoke-tests.txt -> x265_1.8.tar.gz/source/test/smoke-tests.txt
Changed
@@ -6,14 +6,14 @@ big_buck_bunny_360p24.y4m,--preset=superfast --bitrate 400 --vbv-bufsize 600 --vbv-maxrate 400 --hrd --aud --repeat-headers big_buck_bunny_360p24.y4m,--preset=medium --bitrate 1000 -F4 --cu-lossless --scaling-list default -big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --cu-stats --pme --qg-size 16 +big_buck_bunny_360p24.y4m,--preset=slower --no-weightp --pme --qg-size 16 washdc_422_ntsc.y4m,--preset=faster --no-strong-intra-smoothing --keyint 1 --qg-size 16 washdc_422_ntsc.y4m,--preset=medium --qp 40 --nr-inter 400 -F4 washdc_422_ntsc.y4m,--preset=veryslow --pmode --tskip --rdoq-level 0 old_town_cross_444_720p50.y4m,--preset=ultrafast --weightp --keyint -1 old_town_cross_444_720p50.y4m,--preset=fast --keyint 20 --min-cu-size 16 old_town_cross_444_720p50.y4m,--preset=slow --sao-non-deblock --pmode --qg-size 32 -RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --cu-stats --max-tu-size 8 +RaceHorses_416x240_30_10bit.yuv,--preset=veryfast --max-tu-size 8 RaceHorses_416x240_30_10bit.yuv,--preset=slower --bitrate 500 -F4 --rdoq-level 1 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=ultrafast --constrained-intra --min-keyint 5 --keyint 10 CrowdRun_1920x1080_50_10bit_444.yuv,--preset=medium --max-tu-size 16
View file
x265_1.7.tar.gz/source/test/testbench.cpp -> x265_1.8.tar.gz/source/test/testbench.cpp
Changed
@@ -32,7 +32,7 @@ #include "param.h" #include "cpu.h" -using namespace x265; +using namespace X265_NS; const char* lumaPartStr[NUM_PU_SIZES] = { @@ -95,7 +95,7 @@ int main(int argc, char *argv[]) { - int cpuid = x265::cpu_detect(); + int cpuid = X265_NS::cpu_detect(); const char *testname = 0; if (!(argc & 1)) @@ -137,8 +137,7 @@ } int seed = (int)time(NULL); - const char *bpp[] = { "8bpp", "16bpp" }; - printf("Using random seed %X %s\n", seed, bpp[HIGH_BIT_DEPTH]); + printf("Using random seed %X %dbit\n", seed, X265_DEPTH); srand(seed); // To disable classes of tests, simply comment them out in this list @@ -174,7 +173,7 @@ for (int i = 0; test_arch[i].flag; i++) { - if (test_arch[i].flag & cpuid) + if ((test_arch[i].flag & cpuid) == test_arch[i].flag) { printf("Testing primitives: %s\n", test_arch[i].name); fflush(stdout);
View file
x265_1.7.tar.gz/source/test/testharness.h -> x265_1.8.tar.gz/source/test/testharness.h
Changed
@@ -31,18 +31,13 @@ #pragma warning(disable: 4324) // structure was padded due to __declspec(align()) #endif -#if HIGH_BIT_DEPTH -#define BIT_DEPTH 10 -#else -#define BIT_DEPTH 8 -#endif -#define PIXEL_MAX ((1 << BIT_DEPTH) - 1) +#define PIXEL_MAX ((1 << X265_DEPTH) - 1) #define PIXEL_MIN 0 #define SHORT_MAX 32767 #define SHORT_MIN -32767 #define UNSIGNED_SHORT_MAX 65535 -using namespace x265; +using namespace X265_NS; extern const char* lumaPartStr[NUM_PU_SIZES]; extern const char* const* chromaPartStr[X265_CSP_COUNT]; @@ -123,14 +118,14 @@ extern "C" { #if X265_ARCH_X86 -int x265_stack_pagealign(int (*func)(), int align); +int PFX(stack_pagealign)(int (*func)(), int align); /* detect when callee-saved regs aren't saved * needs an explicit asm check because it only sometimes crashes in normal use. */ -intptr_t x265_checkasm_call(intptr_t (*func)(), int *ok, ...); -float x265_checkasm_call_float(float (*func)(), int *ok, ...); +intptr_t PFX(checkasm_call)(intptr_t (*func)(), int *ok, ...); +float PFX(checkasm_call_float)(float (*func)(), int *ok, ...); #else -#define x265_stack_pagealign(func, align) func() +#define PFX(stack_pagealign)(func, align) func() #endif #if X86_64 @@ -144,24 +139,24 @@ * overwrite the junk written to the stack so there's no guarantee that it will always * detect all functions that assumes zero-extension. */ -void x265_checkasm_stack_clobber(uint64_t clobber, ...); +void PFX(checkasm_stack_clobber)(uint64_t clobber, ...); #define checked(func, ...) ( \ m_ok = 1, m_rand = (rand() & 0xffff) * 0x0001000100010001ULL, \ - x265_checkasm_stack_clobber(m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ + PFX(checkasm_stack_clobber)(m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ m_rand, m_rand, m_rand, m_rand, m_rand), /* max_args+6 */ \ - x265_checkasm_call((intptr_t(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__)) + PFX(checkasm_call)((intptr_t(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__)) #define checked_float(func, ...) ( \ m_ok = 1, m_rand = (rand() & 0xffff) * 0x0001000100010001ULL, \ - x265_checkasm_stack_clobber(m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ + PFX(checkasm_stack_clobber)(m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, m_rand, \ m_rand, m_rand, m_rand, m_rand, m_rand), /* max_args+6 */ \ - x265_checkasm_call_float((float(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__)) + PFX(checkasm_call_float)((float(*)())func, &m_ok, 0, 0, 0, 0, __VA_ARGS__)) #define reportfail() if (!m_ok) { fflush(stdout); fprintf(stderr, "stack clobber check failed at %s:%d", __FILE__, __LINE__); abort(); } #elif ARCH_X86 -#define checked(func, ...) x265_checkasm_call((intptr_t(*)())func, &m_ok, __VA_ARGS__); -#define checked_float(func, ...) x265_checkasm_call_float((float(*)())func, &m_ok, __VA_ARGS__); +#define checked(func, ...) PFX(checkasm_call)((intptr_t(*)())func, &m_ok, __VA_ARGS__); +#define checked_float(func, ...) PFX(checkasm_call_float)((float(*)())func, &m_ok, __VA_ARGS__); #else // if X86_64 #define checked(func, ...) func(__VA_ARGS__)
View file
x265_1.8.tar.gz/source/x265-extras.cpp
Added
@@ -0,0 +1,341 @@ +/***************************************************************************** + * Copyright (C) 2015 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * Selvakumar Nithiyaruban <selvakumar@multicorewareinc.com> + * Divya Manivannan <divya@multicorewareinc.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#include "x265.h" +#include "x265-extras.h" + +#include "common.h" + +using namespace X265_NS; + +static const char* summaryCSVHeader = + "Command, Date/Time, Elapsed Time, FPS, Bitrate, " + "Y PSNR, U PSNR, V PSNR, Global PSNR, SSIM, SSIM (dB), " + "I count, I ave-QP, I kbps, I-PSNR Y, I-PSNR U, I-PSNR V, I-SSIM (dB), " + "P count, P ave-QP, P kbps, P-PSNR Y, P-PSNR U, P-PSNR V, P-SSIM (dB), " + "B count, B ave-QP, B kbps, B-PSNR Y, B-PSNR U, B-PSNR V, B-SSIM (dB), " + "Version\n"; + +FILE* x265_csvlog_open(const x265_api& api, const x265_param& param, const char* fname, int level) +{ + if (sizeof(x265_stats) != api.sizeof_stats || sizeof(x265_picture) != api.sizeof_picture) + { + fprintf(stderr, "extras [error]: structure size skew, unable to create CSV logfile\n"); + return NULL; + } + + FILE *csvfp = fopen(fname, "r"); + if (csvfp) + { + /* file already exists, re-open for append */ + fclose(csvfp); + return fopen(fname, "ab"); + } + else + { + /* new CSV file, write header */ + csvfp = fopen(fname, "wb"); + if (csvfp) + { + if (level) + { + fprintf(csvfp, "Encode Order, Type, POC, QP, Bits, "); + if (param.rc.rateControlMode == X265_RC_CRF) + fprintf(csvfp, "RateFactor, "); + fprintf(csvfp, "Y PSNR, U PSNR, V PSNR, YUV PSNR, SSIM, SSIM (dB), List 0, List 1"); + /* detailed performance statistics */ + fprintf(csvfp, ", DecideWait (ms), Row0Wait (ms), Wall time (ms), Ref Wait Wall (ms), Total CTU time (ms), Stall Time (ms), Avg WPP, Row Blocks"); + if (level >= 2) + { + uint32_t size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Intra %dx%d DC, Intra %dx%d Planar, Intra %dx%d Ang", size, size, size, size, size, size); + size /= 2; + } + fprintf(csvfp, ", 4x4"); + size = param.maxCUSize; + if (param.bEnableRectInter) + { + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Inter %dx%d, Inter %dx%d (Rect)", size, size, size, size); + if (param.bEnableAMP) + fprintf(csvfp, ", Inter %dx%d (Amp)", size, size); + size /= 2; + } + } + else + { + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Inter %dx%d", size, size); + size /= 2; + } + } + size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Skip %dx%d", size, size); + size /= 2; + } + size = param.maxCUSize; + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", Merge %dx%d", size, size); + size /= 2; + } + fprintf(csvfp, ", Avg Luma Distortion, Avg Chroma Distortion, Avg psyEnergy, Avg Luma Level, Max Luma Level"); + } + fprintf(csvfp, "\n"); + } + else + fputs(summaryCSVHeader, csvfp); + } + return csvfp; + } +} + +// per frame CSV logging +void x265_csvlog_frame(FILE* csvfp, const x265_param& param, const x265_picture& pic, int level) +{ + if (!csvfp) + return; + + const x265_frame_stats* frameStats = &pic.frameData; + fprintf(csvfp, "%d, %c-SLICE, %4d, %2.2lf, %10d,", frameStats->encoderOrder, frameStats->sliceType, frameStats->poc, frameStats->qp, (int)frameStats->bits); + if (param.rc.rateControlMode == X265_RC_CRF) + fprintf(csvfp, "%.3lf,", frameStats->rateFactor); + if (param.bEnablePsnr) + fprintf(csvfp, "%.3lf, %.3lf, %.3lf, %.3lf,", frameStats->psnrY, frameStats->psnrU, frameStats->psnrV, frameStats->psnr); + else + fputs(" -, -, -, -,", csvfp); + if (param.bEnableSsim) + fprintf(csvfp, " %.6f, %6.3f,", frameStats->ssim, x265_ssim2dB(frameStats->ssim)); + else + fputs(" -, -,", csvfp); + if (frameStats->sliceType == 'I') + fputs(" -, -,", csvfp); + else + { + int i = 0; + while (frameStats->list0POC[i] != -1) + fprintf(csvfp, "%d ", frameStats->list0POC[i++]); + fprintf(csvfp, ","); + if (frameStats->sliceType != 'P') + { + i = 0; + while (frameStats->list1POC[i] != -1) + fprintf(csvfp, "%d ", frameStats->list1POC[i++]); + fprintf(csvfp, ","); + } + else + fputs(" -,", csvfp); + } + fprintf(csvfp, " %.1lf, %.1lf, %.1lf, %.1lf, %.1lf, %.1lf,", frameStats->decideWaitTime, frameStats->row0WaitTime, frameStats->wallTime, frameStats->refWaitWallTime, frameStats->totalCTUTime, frameStats->stallTime); + fprintf(csvfp, " %.3lf, %d", frameStats->avgWPP, frameStats->countRowBlocks); + if (level >= 2) + { + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%, %5.2lf%%, %5.2lf%%", frameStats->cuStats.percentIntraDistribution[depth][0], frameStats->cuStats.percentIntraDistribution[depth][1], frameStats->cuStats.percentIntraDistribution[depth][2]); + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentIntraNxN); + if (param.bEnableRectInter) + { + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + { + fprintf(csvfp, ", %5.2lf%%, %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0], frameStats->cuStats.percentInterDistribution[depth][1]); + if (param.bEnableAMP) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][2]); + } + } + else + { + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentInterDistribution[depth][0]); + } + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentSkipCu[depth]); + for (uint32_t depth = 0; depth <= g_maxCUDepth; depth++) + fprintf(csvfp, ", %5.2lf%%", frameStats->cuStats.percentMergeCu[depth]); + fprintf(csvfp, ", %.2lf, %.2lf, %.2lf, %.2lf, %d", frameStats->avgLumaDistortion, frameStats->avgChromaDistortion, frameStats->avgPsyEnergy, frameStats->avgLumaLevel, frameStats->maxLumaLevel); + } + fprintf(csvfp, "\n"); + fflush(stderr); +} + +void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv) +{ + if (!csvfp) + return; + + if (level) + { + // adding summary to a per-frame csv log file, so it needs a summary header + fprintf(csvfp, "\nSummary\n"); + fputs(summaryCSVHeader, csvfp); + } + + // CLI arguments or other + for (int i = 1; i < argc; i++) + { + if (i) fputc(' ', csvfp); + fputs(argv[i], csvfp); + } + + // current date and time + time_t now; + struct tm* timeinfo; + time(&now); + timeinfo = localtime(&now); + char buffer[200]; + strftime(buffer, 128, "%c", timeinfo); + fprintf(csvfp, ", %s, ", buffer); + + // elapsed time, fps, bitrate + fprintf(csvfp, "%.2f, %.2f, %.2f,", + stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate); + + if (param.bEnablePsnr) + fprintf(csvfp, " %.3lf, %.3lf, %.3lf, %.3lf,", + stats.globalPsnrY / stats.encodedPictureCount, stats.globalPsnrU / stats.encodedPictureCount, + stats.globalPsnrV / stats.encodedPictureCount, stats.globalPsnr); + else + fprintf(csvfp, " -, -, -, -,"); + if (param.bEnableSsim) + fprintf(csvfp, " %.6f, %6.3f,", stats.globalSsim, x265_ssim2dB(stats.globalSsim)); + else + fprintf(csvfp, " -, -,"); + + if (stats.statsI.numPics) + { + fprintf(csvfp, " %-6u, %2.2lf, %-8.2lf,", stats.statsI.numPics, stats.statsI.avgQp, stats.statsI.bitrate); + if (param.bEnablePsnr) + fprintf(csvfp, " %.3lf, %.3lf, %.3lf,", stats.statsI.psnrY, stats.statsI.psnrU, stats.statsI.psnrV); + else + fprintf(csvfp, " -, -, -,"); + if (param.bEnableSsim) + fprintf(csvfp, " %.3lf,", stats.statsI.ssim); + else + fprintf(csvfp, " -,"); + } + else + fprintf(csvfp, " -, -, -, -, -, -, -,"); + + if (stats.statsP.numPics) + { + fprintf(csvfp, " %-6u, %2.2lf, %-8.2lf,", stats.statsP.numPics, stats.statsP.avgQp, stats.statsP.bitrate); + if (param.bEnablePsnr) + fprintf(csvfp, " %.3lf, %.3lf, %.3lf,", stats.statsP.psnrY, stats.statsP.psnrU, stats.statsP.psnrV); + else + fprintf(csvfp, " -, -, -,"); + if (param.bEnableSsim) + fprintf(csvfp, " %.3lf,", stats.statsP.ssim); + else + fprintf(csvfp, " -,"); + } + else + fprintf(csvfp, " -, -, -, -, -, -, -,"); + + if (stats.statsB.numPics) + { + fprintf(csvfp, " %-6u, %2.2lf, %-8.2lf,", stats.statsB.numPics, stats.statsB.avgQp, stats.statsB.bitrate); + if (param.bEnablePsnr) + fprintf(csvfp, " %.3lf, %.3lf, %.3lf,", stats.statsB.psnrY, stats.statsB.psnrU, stats.statsB.psnrV); + else + fprintf(csvfp, " -, -, -,"); + if (param.bEnableSsim) + fprintf(csvfp, " %.3lf,", stats.statsB.ssim); + else + fprintf(csvfp, " -,"); + } + else + fprintf(csvfp, " -, -, -, -, -, -, -,"); + + fprintf(csvfp, " %s\n", api.version_str); +} + +/* The dithering algorithm is based on Sierra-2-4A error diffusion. */ +static void ditherPlane(pixel *dst, int dstStride, uint16_t *src, int srcStride, + int width, int height, int16_t *errors, int bitDepth) +{ + const int lShift = 16 - bitDepth; + const int rShift = 16 - bitDepth + 2; + const int half = (1 << (16 - bitDepth + 1)); + const int pixelMax = (1 << bitDepth) - 1; + + memset(errors, 0, (width + 1) * sizeof(int16_t)); + int pitch = 1; + for (int y = 0; y < height; y++, src += srcStride, dst += dstStride) + { + int16_t err = 0; + for (int x = 0; x < width; x++) + { + err = err * 2 + errors[x] + errors[x + 1]; + dst[x * pitch] = (pixel)x265_clip3(0, pixelMax, ((src[x * 1] << 2) + err + half) >> rShift); + errors[x] = err = src[x * pitch] - (dst[x * pitch] << lShift); + } + } +} + +void x265_dither_image(const x265_api& api, x265_picture& picIn, int picWidth, int picHeight, int16_t *errorBuf, int bitDepth) +{ + if (sizeof(x265_picture) != api.sizeof_picture) + { + fprintf(stderr, "extras [error]: structure size skew, unable to dither\n"); + return; + } + + if (picIn.bitDepth <= 8) + { + fprintf(stderr, "extras [error]: dither support enabled only for input bitdepth > 8\n"); + return; + } + + /* This portion of code is from readFrame in x264. */ + for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) + { + if ((picIn.bitDepth & 7) && (picIn.bitDepth != 16)) + { + /* upconvert non 16bit high depth planes to 16bit */ + uint16_t *plane = (uint16_t*)picIn.planes[i]; + uint32_t pixelCount = x265_picturePlaneSize(picIn.colorSpace, picWidth, picHeight, i); + int lShift = 16 - picIn.bitDepth; + + /* This loop assumes width is equal to stride which + * happens to be true for file reader outputs */ + for (uint32_t j = 0; j < pixelCount; j++) + plane[j] = plane[j] << lShift; + } + } + + for (int i = 0; i < x265_cli_csps[picIn.colorSpace].planes; i++) + { + int height = (int)(picHeight >> x265_cli_csps[picIn.colorSpace].height[i]); + int width = (int)(picWidth >> x265_cli_csps[picIn.colorSpace].width[i]); + + ditherPlane(((pixel*)picIn.planes[i]), picIn.stride[i] / sizeof(pixel), ((uint16_t*)picIn.planes[i]), + picIn.stride[i] / 2, width, height, errorBuf, bitDepth); + } +}
View file
x265_1.8.tar.gz/source/x265-extras.h
Added
@@ -0,0 +1,66 @@ +/***************************************************************************** + * Copyright (C) 2015 x265 project + * + * Authors: Steve Borho <steve@borho.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02111, USA. + * + * This program is also available under a commercial proprietary license. + * For more information, contact us at license @ x265.com. + *****************************************************************************/ + +#ifndef X265_EXTRAS_H +#define X265_EXTRAS_H 1 + +#include "x265.h" + +#include <stdio.h> +#include <stdint.h> + +#ifdef __cplusplus +extern "C" { +#endif + +#if _WIN32 +#define LIBAPI __declspec(dllexport) +#else +#define LIBAPI +#endif + +/* Open a CSV log file. On success it returns a file handle which must be passed + * to x265_csvlog_frame() and/or x265_csvlog_encode(). The file handle must be + * closed by the caller using fclose(). If level is 0, then no frame logging + * header is written to the file. This function will return NULL if it is unable + * to open the file for write or if it detects a structure size skew */ +LIBAPI FILE* x265_csvlog_open(const x265_api& api, const x265_param& param, const char* fname, int level); + +/* Log frame statistics to the CSV file handle. level should have been non-zero + * in the call to x265_csvlog_open() if this function is called. */ +LIBAPI void x265_csvlog_frame(FILE* csvfp, const x265_param& param, const x265_picture& pic, int level); + +/* Log final encode statistics to the CSV file handle. 'argc' and 'argv' are + * intended to be command line arguments passed to the encoder. Encode + * statistics should be queried from the encoder just prior to closing it. */ +LIBAPI void x265_csvlog_encode(FILE* csvfp, const x265_api& api, const x265_param& param, const x265_stats& stats, int level, int argc, char** argv); + +/* In-place downshift from a bit-depth greater than 8 to a bit-depth of 8, using + * the residual bits to dither each row. */ +LIBAPI void x265_dither_image(const x265_api& api, x265_picture&, int picWidth, int picHeight, int16_t *errorBuf, int bitDepth); + +#ifdef __cplusplus +} +#endif + +#endif
View file
x265_1.7.tar.gz/source/x265.cpp -> x265_1.8.tar.gz/source/x265.cpp
Changed
@@ -25,15 +25,17 @@ #pragma warning(disable: 4127) // conditional expression is constant, yes I know #endif +#include "x265.h" +#include "x265-extras.h" +#include "x265cli.h" + +#include "common.h" #include "input/input.h" #include "output/output.h" #include "output/reconplay.h" -#include "filters/filters.h" -#include "common.h" + #include "param.h" #include "cpu.h" -#include "x265.h" -#include "x265cli.h" #if HAVE_VLD /* Visual Leak Detector */ @@ -59,7 +61,7 @@ #define SetThreadExecutionState(es) #endif -using namespace x265; +using namespace X265_NS; /* Ctrl-C handler */ static volatile sig_atomic_t b_ctrl_c /* = 0 */; @@ -74,12 +76,15 @@ ReconFile* recon; OutputFile* output; FILE* qpfile; + FILE* csvfpt; + const char* csvfn; const char* reconPlayCmd; const x265_api* api; x265_param* param; bool bProgress; bool bForceY4m; bool bDither; + int csvLogLevel; uint32_t seek; // number of frames to skip from the beginning uint32_t framesToBeEncoded; // number of frames to encode uint64_t totalbytes; @@ -95,6 +100,8 @@ recon = NULL; output = NULL; qpfile = NULL; + csvfpt = NULL; + csvfn = NULL; reconPlayCmd = NULL; api = NULL; param = NULL; @@ -105,6 +112,7 @@ startTime = x265_mdate(); prevUpdateTime = 0; bDither = false; + csvLogLevel = 0; } void destroy(); @@ -124,6 +132,9 @@ if (qpfile) fclose(qpfile); qpfile = NULL; + if (csvfpt) + fclose(csvfpt); + csvfpt = NULL; if (output) output->release(); output = NULL; @@ -158,8 +169,8 @@ bool CLIOptions::parse(int argc, char **argv) { - bool bError = 0; - int help = 0; + bool bError = false; + int bShowHelp = false; int inputBitDepth = 8; int outputBitDepth = 0; int reconFileBitDepth = 0; @@ -188,8 +199,21 @@ tune = optarg; else if (c == 'D') outputBitDepth = atoi(optarg); + else if (c == 'P') + profile = optarg; else if (c == '?') - showHelp(param); + bShowHelp = true; + } + + if (!outputBitDepth && profile) + { + /* try to derive the output bit depth from the requested profile */ + if (strstr(profile, "10")) + outputBitDepth = 10; + else if (strstr(profile, "12")) + outputBitDepth = 12; + else + outputBitDepth = 8; } api = x265_api_get(outputBitDepth); @@ -212,6 +236,12 @@ return true; } + if (bShowHelp) + { + printVersion(param, api); + showHelp(param); + } + for (optind = 0;; ) { int long_options_index = -1; @@ -222,12 +252,13 @@ switch (c) { case 'h': + printVersion(param, api); showHelp(param); break; case 'V': - printVersion(param); - x265_setup_primitives(param, -1); + printVersion(param, api); + x265_report_simd(param); exit(0); default: @@ -264,6 +295,8 @@ if (0) ; OPT2("frame-skip", "seek") this->seek = (uint32_t)x265_atoi(optarg, bError); OPT("frames") this->framesToBeEncoded = (uint32_t)x265_atoi(optarg, bError); + OPT("csv") this->csvfn = optarg; + OPT("csv-log-level") this->csvLogLevel = x265_atoi(optarg, bError); OPT("no-progress") this->bProgress = false; OPT("output") outputfn = optarg; OPT("input") inputfn = optarg; @@ -272,9 +305,9 @@ OPT("dither") this->bDither = true; OPT("recon-depth") reconFileBitDepth = (uint32_t)x265_atoi(optarg, bError); OPT("y4m") this->bForceY4m = true; - OPT("profile") profile = optarg; /* handled last */ - OPT("preset") /* handled above */; - OPT("tune") /* handled above */; + OPT("profile") /* handled above */; + OPT("preset") /* handled above */; + OPT("tune") /* handled above */; OPT("output-depth") /* handled above */; OPT("recon-y4m-exec") reconPlayCmd = optarg; OPT("qpfile") @@ -309,18 +342,22 @@ return true; } - if (argc <= 1 || help) + if (argc <= 1) + { + api->param_default(param); + printVersion(param, api); showHelp(param); + } - if (inputfn == NULL || outputfn == NULL) + if (!inputfn || !outputfn) { - x265_log(param, X265_LOG_ERROR, "input or output file not specified, try -V for help\n"); + x265_log(param, X265_LOG_ERROR, "input or output file not specified, try --help for help\n"); return true; } - if (param->internalBitDepth != api->max_bit_depth) + if (param->internalBitDepth != api->bit_depth) { - x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", api->max_bit_depth); + x265_log(param, X265_LOG_ERROR, "Only bit depths of %d are supported in this build\n", api->bit_depth); return true; } @@ -465,7 +502,8 @@ * 1 - unable to parse command line * 2 - unable to open encoder * 3 - unable to generate stream headers - * 4 - encoder abort */ + * 4 - encoder abort + * 5 - unable to open csv file */ int main(int argc, char **argv) { @@ -516,6 +554,19 @@ /* get the encoder parameters post-initialization */ api->encoder_parameters(encoder, param); + if (cliopt.csvfn) + { + cliopt.csvfpt = x265_csvlog_open(*api, *param, cliopt.csvfn, cliopt.csvLogLevel); + if (!cliopt.csvfpt) + { + x265_log(param, X265_LOG_ERROR, "Unable to open CSV log file <%s>, aborting\n", cliopt.csvfn); + cliopt.destroy(); + if (cliopt.api) + cliopt.api->param_free(cliopt.param); + exit(5); + } + } + /* Control-C handler */ if (signal(SIGINT, sigint_handler) == SIG_ERR) x265_log(param, X265_LOG_ERROR, "Unable to register CTRL+C handler: %s\n", strerror(errno)); @@ -524,7 +575,7 @@ x265_picture *pic_in = &pic_orig; /* Allocate recon picture if analysisMode is enabled */ std::priority_queue<int64_t>* pts_queue = cliopt.output->needPTS() ? new std::priority_queue<int64_t>() : NULL; - x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode || pts_queue || reconPlay) ? &pic_out : NULL; + x265_picture *pic_recon = (cliopt.recon || !!param->analysisMode || pts_queue || reconPlay || cliopt.csvLogLevel) ? &pic_out : NULL; uint32_t inFrameCount = 0; uint32_t outFrameCount = 0; x265_nal *p_nal; @@ -581,7 +632,7 @@ { if (pic_in->bitDepth > param->internalBitDepth && cliopt.bDither) { - ditherImage(*pic_in, param->sourceWidth, param->sourceHeight, errorBuf, param->internalBitDepth); + x265_dither_image(*api, *pic_in, cliopt.input->getWidth(), cliopt.input->getHeight(), errorBuf, param->internalBitDepth); pic_in->bitDepth = param->internalBitDepth; } /* Overwrite PTS */ @@ -615,6 +666,8 @@ } cliopt.printStatus(outFrameCount); + if (numEncoded && cliopt.csvLogLevel) + x265_csvlog_frame(cliopt.csvfpt, *param, *pic_recon, cliopt.csvLogLevel); } /* Flush the encoder */ @@ -645,6 +698,8 @@ } cliopt.printStatus(outFrameCount); + if (numEncoded && cliopt.csvLogLevel) + x265_csvlog_frame(cliopt.csvfpt, *param, *pic_recon, cliopt.csvLogLevel); if (!numEncoded) break; @@ -659,8 +714,8 @@ delete reconPlay; api->encoder_get_stats(encoder, &stats, sizeof(stats)); - if (param->csvfn && !b_ctrl_c) - api->encoder_log(encoder, argc, argv); + if (cliopt.csvfpt && !b_ctrl_c) + x265_csvlog_encode(cliopt.csvfpt, *api, *param, stats, cliopt.csvLogLevel, argc, argv); api->encoder_close(encoder); int64_t second_largest_pts = 0; @@ -680,26 +735,6 @@ general_log(param, NULL, X265_LOG_INFO, "aborted at input frame %d, output frame %d\n", cliopt.seek + inFrameCount, stats.encodedPictureCount); - if (stats.encodedPictureCount) - { - char buffer[4096]; - int p = sprintf(buffer, "\nencoded %d frames in %.2fs (%.2f fps), %.2f kb/s", stats.encodedPictureCount, - stats.elapsedEncodeTime, stats.encodedPictureCount / stats.elapsedEncodeTime, stats.bitrate); - - if (param->bEnablePsnr) - p += sprintf(buffer + p, ", Global PSNR: %.3f", stats.globalPsnr); - - if (param->bEnableSsim) - p += sprintf(buffer + p, ", SSIM Mean Y: %.7f (%6.3f dB)", stats.globalSsim, x265_ssim2dB(stats.globalSsim)); - - sprintf(buffer + p, "\n"); - general_log(param, NULL, X265_LOG_INFO, buffer); - } - else - { - general_log(param, NULL, X265_LOG_INFO, "\nencoded 0 frames\n"); - } - api->cleanup(); /* Free library singletons */ cliopt.destroy();
View file
x265_1.7.tar.gz/source/x265.def.in -> x265_1.8.tar.gz/source/x265.def.in
Changed
@@ -21,3 +21,4 @@ x265_encoder_close x265_cleanup x265_api_get_${X265_BUILD} +x265_api_query
View file
x265_1.7.tar.gz/source/x265.h -> x265_1.8.tar.gz/source/x265.h
Changed
@@ -100,6 +100,50 @@ uint32_t numPartitions; } x265_analysis_data; +/* cu statistics */ +typedef struct x265_cu_stats +{ + double percentSkipCu[4]; // Percentage of skip cu in all depths + double percentMergeCu[4]; // Percentage of merge cu in all depths + double percentIntraDistribution[4][3]; // Percentage of DC, Planar, Angular intra modes in all depths + double percentInterDistribution[4][3]; // Percentage of 2Nx2N inter, rect and amp in all depths + double percentIntraNxN; // Percentage of 4x4 cu + + /* All the above values will add up to 100%. */ +} x265_cu_stats; + +/* Frame level statistics */ +typedef struct x265_frame_stats +{ + double qp; + double rateFactor; + double psnrY; + double psnrU; + double psnrV; + double psnr; + double ssim; + double decideWaitTime; + double row0WaitTime; + double wallTime; + double refWaitWallTime; + double totalCTUTime; + double stallTime; + double avgWPP; + double avgLumaDistortion; + double avgChromaDistortion; + double avgPsyEnergy; + double avgLumaLevel; + uint64_t bits; + int encoderOrder; + int poc; + int countRowBlocks; + int list0POC[16]; + int list1POC[16]; + uint16_t maxLumaLevel; + char sliceType; + x265_cu_stats cuStats; +} x265_frame_stats; + /* Used to pass pictures into the encoder, and to get picture data back out of * the encoder. The input and output semantics are different */ typedef struct x265_picture @@ -161,6 +205,9 @@ * this data structure */ x265_analysis_data analysisData; + /* Frame level statistics */ + x265_frame_stats frameData; + } x265_picture; typedef enum @@ -221,9 +268,8 @@ #define X265_LOG_ERROR 0 #define X265_LOG_WARNING 1 #define X265_LOG_INFO 2 -#define X265_LOG_FRAME 3 -#define X265_LOG_DEBUG 4 -#define X265_LOG_FULL 5 +#define X265_LOG_DEBUG 3 +#define X265_LOG_FULL 4 #define X265_B_ADAPT_NONE 0 #define X265_B_ADAPT_FAST 1 @@ -249,6 +295,7 @@ #define X265_AQ_NONE 0 #define X265_AQ_VARIANCE 1 #define X265_AQ_AUTO_VARIANCE 2 +#define X265_AQ_AUTO_VARIANCE_BIASED 3 /* NOTE! For this release only X265_CSP_I420 and X265_CSP_I444 are supported */ @@ -302,20 +349,35 @@ X265_RC_CRF } X265_RC_METHODS; +/* slice type statistics */ +typedef struct x265_sliceType_stats +{ + double avgQp; + double bitrate; + double psnrY; + double psnrU; + double psnrV; + double ssim; + uint32_t numPics; +} x265_sliceType_stats; + /* Output statistics from encoder */ typedef struct x265_stats { - double globalPsnrY; - double globalPsnrU; - double globalPsnrV; - double globalPsnr; - double globalSsim; - double elapsedEncodeTime; /* wall time since encoder was opened */ - double elapsedVideoTime; /* encoded picture count / frame rate */ - double bitrate; /* accBits / elapsed video time */ - uint64_t accBits; /* total bits output thus far */ - uint32_t encodedPictureCount; /* number of output pictures thus far */ - uint32_t totalWPFrames; /* number of uni-directional weighted frames used */ + double globalPsnrY; + double globalPsnrU; + double globalPsnrV; + double globalPsnr; + double globalSsim; + double elapsedEncodeTime; /* wall time since encoder was opened */ + double elapsedVideoTime; /* encoded picture count / frame rate */ + double bitrate; /* accBits / elapsed video time */ + uint64_t accBits; /* total bits output thus far */ + uint32_t encodedPictureCount; /* number of output pictures thus far */ + uint32_t totalWPFrames; /* number of uni-directional weighted frames used */ + x265_sliceType_stats statsI; /* statistics of I slice */ + x265_sliceType_stats statsP; /* statistics of P slice */ + x265_sliceType_stats statsB; /* statistics of B slice */ } x265_stats; /* String values accepted by x265_param_parse() (and CLI) for various parameters */ @@ -326,7 +388,7 @@ static const char * const x265_colorprim_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "film", "bt2020", 0 }; static const char * const x265_transfer_names[] = { "", "bt709", "undef", "", "bt470m", "bt470bg", "smpte170m", "smpte240m", "linear", "log100", "log316", "iec61966-2-4", "bt1361e", "iec61966-2-1", "bt2020-10", "bt2020-12", - "smpte-st-2084", "smpte-st-428", 0 }; + "smpte-st-2084", "smpte-st-428", "arib-std-b67", 0 }; static const char * const x265_colmatrix_names[] = { "GBR", "bt709", "undef", "", "fcc", "bt470bg", "smpte170m", "smpte240m", "YCgCo", "bt2020nc", "bt2020c", 0 }; static const char * const x265_sar_names[] = { "undef", "1:1", "12:11", "10:11", "16:11", "40:33", "24:11", "20:11", @@ -439,8 +501,7 @@ /*== Logging Features ==*/ - /* Enable analysis and logging distribution of CUs encoded across various - * modes during mode decision. Default disabled */ + /* Enable analysis and logging distribution of CUs. Now deprecated */ int bLogCuStats; /* Enable the measurement and reporting of PSNR. Default is enabled */ @@ -453,11 +514,7 @@ * X265_LOG_FULL, default is X265_LOG_INFO */ int logLevel; - /* filename of CSV log. If logLevel greater than or equal to X265_LOG_FRAME, - * the encoder will emit per-slice statistics to this log file in encode - * order. Otherwise the encoder will emit per-stream statistics into the log - * file when x265_encoder_log is called (presumably at the end of the - * encode) */ + /* Filename of CSV log. Now deprecated */ const char* csvfn; /*== Internal Picture Specification ==*/ @@ -1143,11 +1200,31 @@ #define X265_PARAM_BAD_VALUE (-2) int x265_param_parse(x265_param *p, const char *name, const char *value); -/* x265_param_apply_profile: - * Applies the restrictions of the given profile. (one of below) */ -static const char * const x265_profile_names[] = { "main", "main10", "mainstillpicture", 0 }; +static const char * const x265_profile_names[] = { + /* HEVC v1 */ + "main", "main10", "mainstillpicture", /* alias */ "msp", + + /* HEVC v2 (Range Extensions) */ + "main-intra", "main10-intra", + "main444-8", "main444-intra", "main444-stillpicture", -/* (can be NULL, in which case the function will do nothing) + "main422-10", "main422-10-intra", + "main444-10", "main444-10-intra", + + "main12", "main12-intra", /* Highly Experimental */ + "main422-12", "main422-12-intra", + "main444-12", "main444-12-intra", + + "main444-16-intra", "main444-16-stillpicture", /* Not Supported! */ + 0 +}; + +/* x265_param_apply_profile: + * Applies the restrictions of the given profile. (one of x265_profile_names) + * (can be NULL, in which case the function will do nothing) + * Note: the detected profile can be lower than the one specified to this + * function. This function will force the encoder parameters to fit within + * the specified profile, or fail if that is impossible. * returns 0 on success, negative on failure (e.g. invalid profile name). */ int x265_param_apply_profile(x265_param *, const char *profile); @@ -1263,9 +1340,7 @@ void x265_encoder_get_stats(x265_encoder *encoder, x265_stats *, uint32_t statsSizeBytes); /* x265_encoder_log: - * write a line to the configured CSV file. If a CSV filename was not - * configured, or file open failed, or the log level indicated frame level - * logging, this function will perform no write. */ + * This function is deprecated */ void x265_encoder_log(x265_encoder *encoder, int argc, char **argv); /* x265_encoder_close: @@ -1276,15 +1351,28 @@ * release library static allocations, reset configured CTU size */ void x265_cleanup(void); +#define X265_MAJOR_VERSION 1 /* === Multi-lib API === - * By using this method to gain access to the libx265 interfaces, you allow shim - * implementations of x265_api_get() to choose between various available libx265 - * libraries based on the encoder parameters. The most likely use case is to - * choose between 8bpp and 16bpp builds of libx265. */ + * By using this method to gain access to the libx265 interfaces, you allow run- + * time selection between various available libx265 libraries based on the + * encoder parameters. The most likely use case is to choose between Main and + * Main10 builds of libx265. */ typedef struct x265_api { + int api_major_version; /* X265_MAJOR_VERSION */ + int api_build_number; /* X265_BUILD (soname) */ + int sizeof_param; /* sizeof(x265_param) */ + int sizeof_picture; /* sizeof(x265_picture) */ + int sizeof_analysis_data; /* sizeof(x265_analysis_data) */ + int sizeof_zone; /* sizeof(x265_zone) */ + int sizeof_stats; /* sizeof(x265_stats) */ + + int bit_depth; + const char* version_str; + const char* build_info_str; + /* libx265 public API functions, documented above with x265_ prefixes */ x265_param* (*param_alloc)(void); void (*param_free)(x265_param*); @@ -1304,9 +1392,9 @@ void (*encoder_log)(x265_encoder*, int, char**); void (*encoder_close)(x265_encoder*); void (*cleanup)(void); - const char* version_str; - const char* build_info_str; - int max_bit_depth; + + int sizeof_frame_stats; /* sizeof(x265_frame_stats) */ + /* add new pointers to the end, or increment X265_MAJOR_VERSION */ } x265_api; /* Force a link error in the case of linking against an incompatible API version. @@ -1330,6 +1418,43 @@ * Obviously the shared library file extension is platform specific */ const x265_api* x265_api_get(int bitDepth); +/* x265_api_query: + * Retrieve the programming interface for a linked x265 library, like + * x265_api_get(), except this function accepts X265_BUILD as the second + * argument rather than using the build number as part of the function name. + * Applications which dynamically link to libx265 can use this interface to + * query the library API and achieve a relative amount of version skew + * flexibility. The function may return NULL if the library determines that + * the apiVersion that your application was compiled against is not compatible + * with the library you have linked with. + * + * api_major_version will be incremented any time non-backward compatible + * changes are made to any public structures or functions. If + * api_major_version does not match X265_MAJOR_VERSION from the x265.h your + * application compiled against, your application must not use the returned + * x265_api pointer. + * + * Users of this API *must* also validate the sizes of any structures which + * are not treated as opaque in application code. For instance, if your + * application dereferences a x265_param pointer, then it must check that + * api->sizeof_param matches the sizeof(x265_param) that your application + * compiled with. */ +const x265_api* x265_api_query(int bitDepth, int apiVersion, int* err); + +#define X265_API_QUERY_ERR_NONE 0 /* returned API pointer is non-NULL */ +#define X265_API_QUERY_ERR_VER_REFUSED 1 /* incompatible version skew */ +#define X265_API_QUERY_ERR_LIB_NOT_FOUND 2 /* libx265_main10 not found, for ex */ +#define X265_API_QUERY_ERR_FUNC_NOT_FOUND 3 /* unable to bind x265_api_query */ +#define X265_API_QUERY_ERR_WRONG_BITDEPTH 4 /* libx265_main10 not 10bit, for ex */ + +static const char * const x265_api_query_errnames[] = { + "api queried from libx265", + "libx265 version is not compatible with this application", + "unable to bind a libx265 with requested bit depth", + "unable to bind x265_api_query from libx265", + "libx265 has an invalid bitdepth" +}; + #ifdef __cplusplus } #endif
View file
x265_1.7.tar.gz/source/x265cli.h -> x265_1.8.tar.gz/source/x265cli.h
Changed
@@ -24,10 +24,13 @@ #ifndef X265CLI_H #define X265CLI_H 1 +#include "common.h" +#include "param.h" + #include <getopt.h> #ifdef __cplusplus -namespace x265 { +namespace X265_NS { #endif static const char short_options[] = "o:D:P:p:f:F:r:I:i:b:s:t:q:m:hwV?"; @@ -54,6 +57,7 @@ { "allow-non-conformance",no_argument, NULL, 0 }, { "no-allow-non-conformance",no_argument, NULL, 0 }, { "csv", required_argument, NULL, 0 }, + { "csv-log-level", required_argument, NULL, 0 }, { "no-cu-stats", no_argument, NULL, 0 }, { "cu-stats", no_argument, NULL, 0 }, { "y4m", no_argument, NULL, 0 }, @@ -121,6 +125,7 @@ { "no-b-pyramid", no_argument, NULL, 0 }, { "b-pyramid", no_argument, NULL, 0 }, { "ref", required_argument, NULL, 0 }, + { "limit-refs", required_argument, NULL, 0 }, { "no-weightp", no_argument, NULL, 0 }, { "weightp", no_argument, NULL, 'w' }, { "no-weightb", no_argument, NULL, 0 }, @@ -183,7 +188,8 @@ { "transfer", required_argument, NULL, 0 }, { "colormatrix", required_argument, NULL, 0 }, { "chromaloc", required_argument, NULL, 0 }, - { "crop-rect", required_argument, NULL, 0 }, + { "display-window", required_argument, NULL, 0 }, + { "crop-rect", required_argument, NULL, 0 }, /* DEPRECATED */ { "master-display", required_argument, NULL, 0 }, { "max-cll", required_argument, NULL, 0 }, { "no-dither", no_argument, NULL, 0 }, @@ -219,17 +225,15 @@ { 0, 0, 0, 0 } }; -static void printVersion(x265_param *param) +static void printVersion(x265_param *param, const x265_api* api) { - x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", x265_version_str); - x265_log(param, X265_LOG_INFO, "build info %s\n", x265_build_info_str); + x265_log(param, X265_LOG_INFO, "HEVC encoder version %s\n", api->version_str); + x265_log(param, X265_LOG_INFO, "build info %s\n", api->build_info_str); } static void showHelp(x265_param *param) { int level = param->logLevel; - x265_param_default(param); - printVersion(param); #define OPT(value) (value ? "enabled" : "disabled") #define H0 printf @@ -243,11 +247,11 @@ H0("-V/--version Show version info and exit\n"); H0("\nOutput Options:\n"); H0("-o/--output <filename> Bitstream output file name\n"); - H0("-D/--output-depth 8|10 Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth); - H0(" --log-level <string> Logging level: none error warning info debug full. Default %s\n", x265::logLevelNames[param->logLevel + 1]); + H0("-D/--output-depth 8|10|12 Output bit depth (also internal bit depth). Default %d\n", param->internalBitDepth); + H0(" --log-level <string> Logging level: none error warning info debug full. Default %s\n", X265_NS::logLevelNames[param->logLevel + 1]); H0(" --no-progress Disable CLI progress reports\n"); - H0(" --[no-]cu-stats Enable logging stats about distribution of cu across all modes. Default %s\n",OPT(param->bLogCuStats)); - H1(" --csv <filename> Comma separated log file, log level >= 3 frame log, else one line per run\n"); + H0(" --csv <filename> Comma separated log file, if csv-log-level > 0 frame level statistics, else one line per run\n"); + H0(" --csv-log-level Level of csv logging, if csv-log-level > 0 frame level statistics, else one line per run: 0-2\n"); H0("\nInput Options:\n"); H0(" --input <filename> Raw YUV or Y4M input file name. `-` for stdin\n"); H1(" --y4m Force parsing of input stream as YUV4MPEG2 regardless of file extension\n"); @@ -302,10 +306,12 @@ H0(" --[no-]signhide Hide sign bit of one coeff per TU (rdo). Default %s\n", OPT(param->bEnableSignHiding)); H1(" --[no-]tskip Enable intra 4x4 transform skipping. Default %s\n", OPT(param->bEnableTransformSkip)); H0("\nTemporal / motion search options:\n"); + H0(" --max-merge <1..5> Maximum number of merge candidates. Default %d\n", param->maxNumMergeCand); + H0(" --ref <integer> max number of L0 references to be allowed (1 .. 16) Default %d\n", param->maxNumReferences); + H0(" --limit-refs <0|1|2|3> limit references per depth (1) or CU (2) or both (3). Default %d\n", param->limitReferences); H0(" --me <string> Motion search method dia hex umh star full. Default %d\n", param->searchMethod); H0("-m/--subme <integer> Amount of subpel refinement to perform (0:least .. 7:most). Default %d \n", param->subpelRefine); H0(" --merange <integer> Motion search range. Default %d\n", param->searchRange); - H0(" --max-merge <1..5> Maximum number of merge candidates. Default %d\n", param->maxNumMergeCand); H0(" --[no-]rect Enable rectangular motion partitions Nx2N and 2NxN. Default %s\n", OPT(param->bEnableRectInter)); H0(" --[no-]amp Enable asymmetric motion partitions, requires --rect. Default %s\n", OPT(param->bEnableAMP)); H1(" --[no-]temporal-mvp Enable temporal MV predictors. Default %s\n", OPT(param->bEnableTemporalMvp)); @@ -327,13 +333,6 @@ H1(" --bframe-bias <integer> Bias towards B frame decisions. Default %d\n", param->bFrameBias); H0(" --b-adapt <0..2> 0 - none, 1 - fast, 2 - full (trellis) adaptive B frame scheduling. Default %d\n", param->bFrameAdaptive); H0(" --[no-]b-pyramid Use B-frames as references. Default %s\n", OPT(param->bBPyramid)); - H0(" --ref <integer> max number of L0 references to be allowed (1 .. 16) Default %d\n", param->maxNumReferences); - H1(" --zones <zone0>/<zone1>/... Tweak the bitrate of regions of the video\n"); - H1(" Each zone is of the form\n"); - H1(" <start frame>,<end frame>,<option>\n"); - H1(" where <option> is either\n"); - H1(" q=<integer> (force QP)\n"); - H1(" or b=<float> (bitrate multiplier)\n"); H1(" --qpfile <string> Force frametypes and QPs for some or all frames\n"); H1(" Format of each line: framenumber frametype QP\n"); H1(" QP is optional (none lets x265 choose). Frametypes: I,i,P,B,b.\n"); @@ -359,7 +358,7 @@ H0(" --[no-]strict-cbr Enable stricter conditions and tolerance for bitrate deviations in CBR mode. Default %s\n", OPT(param->rc.bStrictCbr)); H0(" --analysis-mode <string|int> save - Dump analysis info into file, load - Load analysis buffers from the file. Default %d\n", param->analysisMode); H0(" --analysis-file <filename> Specify file name used for either dumping or reading analysis data.\n"); - H0(" --aq-mode <integer> Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance. Default %d\n", param->rc.aqMode); + H0(" --aq-mode <integer> Mode for Adaptive Quantization - 0:none 1:uniform AQ 2:auto variance 3:auto variance with bias to dark scenes. Default %d\n", param->rc.aqMode); H0(" --aq-strength <float> Reduces blocking and blurring in flat and textured areas (0 to 3.0). Default %.2f\n", param->rc.aqStrength); H0(" --qg-size <int> Specifies the size of the quantization group (64, 32, 16). Default %d\n", param->rc.qgSize); H0(" --[no-]cutree Enable cutree for Adaptive Quantization. Default %s\n", OPT(param->rc.cuTree)); @@ -370,6 +369,12 @@ H1(" --cbqpoffs <integer> Chroma Cb QP Offset [-12..12]. Default %d\n", param->cbQpOffset); H1(" --crqpoffs <integer> Chroma Cr QP Offset [-12..12]. Default %d\n", param->crQpOffset); H1(" --scaling-list <string> Specify a file containing HM style quant scaling lists or 'default' or 'off'. Default: off\n"); + H1(" --zones <zone0>/<zone1>/... Tweak the bitrate of regions of the video\n"); + H1(" Each zone is of the form\n"); + H1(" <start frame>,<end frame>,<option>\n"); + H1(" where <option> is either\n"); + H1(" q=<integer> (force QP)\n"); + H1(" or b=<float> (bitrate multiplier)\n"); H1(" --lambda-file <string> Specify a file containing replacement values for the lambda tables\n"); H1(" MAX_MAX_QP+1 floats for lambda table, then again for lambda2 table\n"); H1(" Blank lines and lines starting with hash(#) are ignored\n"); @@ -383,7 +388,7 @@ H0(" Choose from 0=undef, 1=1:1(\"square\"), 2=12:11, 3=10:11, 4=16:11,\n"); H0(" 5=40:33, 6=24:11, 7=20:11, 8=32:11, 9=80:33, 10=18:11, 11=15:11,\n"); H0(" 12=64:33, 13=160:99, 14=4:3, 15=3:2, 16=2:1 or custom ratio of <int:int>. Default %d\n", param->vui.aspectRatioIdc); - H1(" --crop-rect <string> Add 'left,top,right,bottom' to the bitstream-level cropping rectangle\n"); + H1(" --display-window <string> Describe overscan cropping region as 'left,top,right,bottom' in pixels\n"); H1(" --overscan <string> Specify whether it is appropriate for decoder to show cropped region: undef, show or crop. Default undef\n"); H0(" --videoformat <string> Specify video format from undef, component, pal, ntsc, secam, mac. Default undef\n"); H0(" --range <string> Specify black level and range of luma and chroma signals as full or limited Default limited\n"); @@ -391,7 +396,7 @@ H0(" smpte240m, film, bt2020. Default undef\n"); H0(" --transfer <string> Specify transfer characteristics from undef, bt709, bt470m, bt470bg, smpte170m,\n"); H0(" smpte240m, linear, log100, log316, iec61966-2-4, bt1361e, iec61966-2-1,\n"); - H0(" bt2020-10, bt2020-12. Default undef\n"); + H0(" bt2020-10, bt2020-12, smpte-st-2084, smpte-st-428, arib-std-b67. Default undef\n"); H1(" --colormatrix <string> Specify color matrix setting from undef, bt709, fcc, bt470bg, smpte170m,\n"); H1(" smpte240m, GBR, YCgCo, bt2020nc, bt2020c. Default undef\n"); H1(" --chromaloc <integer> Specify chroma sample location (0 to 5). Default of %d\n", param->vui.chromaSampleLocTypeTopField);
Locations
Projects
Search
Status Monitor
Help
Open Build Service
OBS Manuals
API Documentation
OBS Portal
Reporting a Bug
Contact
Mailing List
Forums
Chat (IRC)
Twitter
Open Build Service (OBS)
is an
openSUSE project
.